Compare commits

..

179 Commits

Author SHA1 Message Date
ntohidi
a5354f267a Merge branch 'develop' into release/v0.8.0 2026-01-16 11:34:24 +01:00
unclecode
6090629ee0 Fix: Enable litellm.drop_params for O-series/GPT-5 model compatibility
O-series (o1, o3) and GPT-5 models only support temperature=1.
Setting litellm.drop_params=True auto-drops unsupported parameters
instead of throwing UnsupportedParamsError.

Fixes temperature=0.01 error for these models in LLM extraction.
2026-01-16 09:56:38 +00:00
unclecode
a00da6557b Add async agenerate_schema method for schema generation
- Extract prompt building to shared _build_schema_prompt() method
- Add agenerate_schema() async version using aperform_completion_with_backoff
- Refactor generate_schema() to use shared prompt builder
- Fixes Gemini/Vertex AI compatibility in async contexts (FastAPI)
2026-01-16 07:05:48 +00:00
ntohidi
177e298af0 Update security researcher acknowledgment with a hyperlink for Neo by ProjectDiscovery 2026-01-14 14:19:23 +01:00
ntohidi
f09146c435 Release v0.8.0: The v0.8.0 Update
- Updated version to 0.8.0
- Added comprehensive demo and release notes
- Updated all documentation
2026-01-14 13:46:42 +01:00
ntohidi
315eae9e6f Add examples for deep crawl crash recovery and prefetch mode in documentation 2026-01-14 12:58:44 +01:00
unclecode
530cde351f Add release notes for v0.8.0, detailing breaking changes, security fixes, new features, bug fixes, and documentation updates
Documentation for v0.8.0 release:

- SECURITY.md: Security policy and vulnerability reporting guidelines
- RELEASE_NOTES_v0.8.0.md: Comprehensive release notes
- migration/v0.8.0-upgrade-guide.md: Step-by-step migration guide
- security/GHSA-DRAFT-RCE-LFI.md: GitHub security advisory drafts
- CHANGELOG.md: Updated with v0.8.0 changes

Breaking changes documented:
- Docker API hooks disabled by default (CRAWL4AI_HOOKS_ENABLED)
- file:// URLs blocked on Docker API endpoints

Security fixes credited to Neo by ProjectDiscovery
2026-01-12 13:45:42 +00:00
ntohidi
122b4fe3f0 Add release notes for v0.7.9, detailing breaking changes, security fixes, new features, bug fixes, and documentation updates 2026-01-12 13:46:39 +01:00
ntohidi
acfab80dd4 Enhance authentication flow by implementing JWT token retrieval and adding authorization headers to API requests 2026-01-12 13:46:32 +01:00
unclecode
f24396c23e Fix critical RCE and LFI vulnerabilities in Docker API deployment
Security fixes for vulnerabilities reported by ProjectDiscovery:

1. Remote Code Execution via Hooks (CVE pending)
   - Remove __import__ from allowed_builtins in hook_manager.py
   - Prevents arbitrary module imports (os, subprocess, etc.)
   - Hooks now disabled by default via CRAWL4AI_HOOKS_ENABLED env var

2. Local File Inclusion via file:// URLs (CVE pending)
   - Add URL scheme validation to /execute_js, /screenshot, /pdf, /html
   - Block file://, javascript:, data: and other dangerous schemes
   - Only allow http://, https://, and raw: (where appropriate)

3. Security hardening
   - Add CRAWL4AI_HOOKS_ENABLED=false as default (opt-in for hooks)
   - Add security warning comments in config.yml
   - Add validate_url_scheme() helper for consistent validation

Testing:
   - Add unit tests (test_security_fixes.py) - 16 tests
   - Add integration tests (run_security_tests.py) for live server

Affected endpoints:
   - POST /crawl (hooks disabled by default)
   - POST /crawl/stream (hooks disabled by default)
   - POST /execute_js (URL validation added)
   - POST /screenshot (URL validation added)
   - POST /pdf (URL validation added)
   - POST /html (URL validation added)

Breaking changes:
   - Hooks require CRAWL4AI_HOOKS_ENABLED=true to function
   - file:// URLs no longer work on API endpoints (use library directly)
2026-01-12 04:14:37 +00:00
unclecode
6b2dca76c3 Docs: Add multi-sample schema generation section
Add documentation explaining how to pass multiple HTML samples
to generate_schema() for stable selectors that work across pages
with varying DOM structures.

Includes:
- Problem explanation (fragile nth-child selectors)
- Solution with code example
- Key points for multi-sample queries
- Comparison table of fragile vs stable selectors
2026-01-04 12:50:08 +00:00
unclecode
0d3f9e65b0 Add MEMORY.md to gitignore 2025-12-30 03:04:30 +00:00
unclecode
db61ab8559 Update URL seeder docs with smart TTL cache parameters
- Add cache_ttl_hours and validate_sitemap_lastmod to parameter table
- Document smart TTL cache validation with examples
- Add cache-related troubleshooting entries
- Update key features summary
2025-12-30 03:03:41 +00:00
unclecode
3d78001c30 Add smart TTL cache for sitemap URL seeder
- Add cache_ttl_hours and validate_sitemap_lastmod params to SeedingConfig
- New JSON cache format with metadata (version, created_at, lastmod, url_count)
- Cache validation by TTL expiry and sitemap lastmod comparison
- Auto-migration from old .jsonl to new .json format
- Fixes bug where incomplete cache was used indefinitely
2025-12-30 01:59:09 +00:00
unclecode
2550f3d2d5 Add browser pipeline support for raw:/file:// URLs
- Add process_in_browser parameter to CrawlerRunConfig
- Route raw:/file:// URLs through _crawl_web() when browser operations needed
- Use page.set_content() instead of goto() for local content
- Fix cookie handling for non-HTTP URLs in browser_manager
- Auto-detect browser requirements: js_code, wait_for, screenshot, etc.
- Maintain fast path for raw:/file:// without browser params

Fixes #310
2025-12-27 12:32:42 +00:00
unclecode
a43256b27a Add proxy support to HTTP crawler strategy 2025-12-26 13:17:28 +00:00
unclecode
9e7f5aa44b Updates on proxy rotation and proxy configuration 2025-12-26 12:45:57 +00:00
unclecode
fde4e9f0c6 Add prefetch mode for two-phase deep crawling
- Add `prefetch` parameter to CrawlerRunConfig
- Add `quick_extract_links()` function for fast link extraction
- Add short-circuit in aprocess_html() for prefetch mode
- Add 42 tests (unit, integration, regression)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-25 01:55:08 +00:00
unclecode
3937efcf0b Add base_url parameter to CrawlerRunConfig for raw HTML processing
When processing raw: HTML (e.g., from cache), the URL parameter is meaningless
for markdown link resolution. This adds a base_url parameter that can be set
explicitly to provide proper URL resolution context.

Changes:
- Add base_url parameter to CrawlerRunConfig.__init__
- Add base_url to CrawlerRunConfig.from_kwargs
- Update aprocess_html to use base_url for markdown generation

Usage:
  config = CrawlerRunConfig(base_url='https://example.com')
  result = await crawler.arun(url='raw:{html}', config=config)
2025-12-24 06:05:55 +00:00
unclecode
624e34164d Fix: HTTP strategy raw: URL parsing truncates at # character
The AsyncHTTPCrawlerStrategy.crawl() method used urlparse() to extract
content from raw: URLs. This caused HTML with CSS color codes like #eee
to be truncated because # is treated as a URL fragment delimiter.

Before: raw:body{background:#eee} -> parsed.path = 'body{background:'
After:  raw:body{background:#eee} -> raw_content = 'body{background:#eee'

Fix: Strip the raw: or raw:// prefix directly instead of using urlparse,
matching how the browser strategy handles it.
2025-12-24 04:31:57 +00:00
unclecode
31ebf37252 Add crash recovery for deep crawl strategies
Add optional resume_state and on_state_change parameters to all deep
crawl strategies (BFS, DFS, Best-First) for cloud deployment crash
recovery.

Features:
- resume_state: Pass saved state to resume from checkpoint
- on_state_change: Async callback fired after each URL for real-time
  state persistence to external storage (Redis, DB, etc.)
- export_state(): Get last captured state manually
- Zero overhead when features are disabled (None defaults)

State includes visited URLs, pending queue/stack, depths, and
pages_crawled count. All state is JSON-serializable.
2025-12-22 14:51:10 +00:00
unclecode
67e03d64b8 Add PDF and MHTML support for raw: and file:// URLs
- Replace _generate_screenshot_from_html with _generate_media_from_html
- New method handles screenshot, PDF, and MHTML in one browser session
- Update raw: and file:// URL handlers to use new method
- Enables cached HTML to generate all media types
2025-12-22 01:24:51 +00:00
unclecode
444cb14f82 Add _generate_screenshot_from_html for raw: and file:// URLs
Implements the missing method that was being called but never defined.
Now raw: and file:// URLs can generate screenshots by:
1. Loading HTML into a browser page via page.set_content()
2. Taking screenshot using existing take_screenshot() method
3. Cleaning up the page afterward

This enables cached HTML to be rendered with screenshots in crawl4ai-cloud.
2025-12-22 01:10:20 +00:00
unclecode
48426f73f0 Some debugging for caching 2025-12-21 04:48:03 +00:00
unclecode
f6b29a8f9f Update gitignore 2025-12-21 04:48:03 +00:00
unclecode
02acad1dc6 Fix CDP connection handling: support WS URLs and proper cleanup
Changes to browser_manager.py:

1. _verify_cdp_ready(): Support multiple URL formats
   - WebSocket URLs (ws://, wss://): Skip HTTP verification, Playwright handles directly
   - HTTP URLs with query params: Properly parse with urlparse to preserve query string
   - Fixes issue where naive f"{cdp_url}/json/version" broke WS URLs and query params

2. close(): Proper cleanup when cdp_cleanup_on_close=True
   - Close all sessions (pages)
   - Close all contexts
   - Call browser.close() to disconnect (doesn't terminate browser, just releases connection)
   - Wait 1 second for CDP connection to fully release
   - Stop Playwright instance to prevent memory leaks

This enables:
- Connecting to specific browsers via WS URL
- Reusing the same browser with multiple sequential connections
- No user wait needed between connections (internal 1s delay handles it)

Added tests/browser/test_cdp_cleanup_reuse.py with comprehensive tests.
2025-12-18 22:04:52 +08:00
unclecode
d10ca38599 Add init_scripts support to BrowserConfig for pre-page-load JS injection
This adds the ability to inject JavaScript that runs before any page loads,
useful for stealth evasions (canvas/audio fingerprinting, userAgentData).

- Add init_scripts parameter to BrowserConfig (list of JS strings)
- Apply init_scripts in setup_context() via context.add_init_script()
- Update from_kwargs() and to_dict() for serialization
2025-12-14 01:58:11 +00:00
unclecode
ecedb6113e Add context caching to create_isolated_context branch
Uses contexts_by_config cache (same as non-CDP mode) to reuse contexts
for multiple URLs with same config. Still creates new page per crawl
for navigation isolation. Benefits batch/deep crawls.
2025-12-13 08:58:21 +00:00
unclecode
55eb968a8d Add create_isolated_context flag for concurrent CDP crawls
When True, forces creation of a new browser context instead of reusing
the default context. Essential for concurrent crawls on the same browser
to prevent navigation conflicts.
2025-12-13 08:29:05 +00:00
unclecode
6185d3cb32 Revert context matching attempts - Playwright cannot see CDP-created contexts 2025-12-13 07:57:29 +00:00
unclecode
8014805c17 Fix: use CDP to find context by browserContextId for concurrent sessions 2025-12-13 07:02:23 +00:00
unclecode
c1e485e0b0 Fix: use target_id to find correct page in get_page 2025-12-13 06:51:54 +00:00
unclecode
b2e4a1f2e3 Fix: find context by target_id for concurrent CDP connections 2025-12-13 06:41:13 +00:00
unclecode
d22825eea4 Fix: add cdp_cleanup_on_close to from_kwargs 2025-12-13 06:33:26 +00:00
unclecode
66941a59e8 Add cdp_cleanup_on_close flag to prevent memory leaks in cloud/server scenarios 2025-12-13 06:25:25 +00:00
unclecode
8ae908bede Add browser_context_id and target_id parameters to BrowserConfig
Enable Crawl4AI to connect to pre-created CDP browser contexts, which is
essential for cloud browser services that pre-create isolated contexts.

Changes:
- Add browser_context_id and target_id parameters to BrowserConfig
- Update from_kwargs() and to_dict() methods
- Modify BrowserManager.start() to use existing context when provided
- Add _get_page_by_target_id() helper method
- Update get_page() to handle pre-existing targets
- Add test for browser_context_id functionality

This enables cloud services to:
1. Create isolated CDP contexts before Crawl4AI connects
2. Pass context/target IDs to BrowserConfig
3. Have Crawl4AI reuse existing contexts instead of creating new ones
2025-12-13 02:42:48 +00:00
ntohidi
306ddcbf3d Merge branch 'main' into develop 2025-12-11 11:18:30 +01:00
Nasrin
a87e8c1c9e Release/v0.7.8 (#1662)
* Fix: Use correct URL variable for raw HTML extraction (#1116)

- Prevents full HTML content from being passed as URL to extraction strategies
- Added unit tests to verify raw HTML and regular URL processing

Fix: Wrong URL variable used for extraction of raw html

* Fix #1181: Preserve whitespace in code blocks during HTML scraping

  The remove_empty_elements_fast() method was removing whitespace-only
  span elements inside <pre> and <code> tags, causing import statements
  like "import torch" to become "importtorch". Now skips elements inside
  code blocks where whitespace is significant.

* Refactor Pydantic model configuration to use ConfigDict for arbitrary types

* Fix EmbeddingStrategy: Uncomment response handling for the variations and clean up mock data. ref #1621

* Fix: permission issues with .cache/url_seeder and other runtime cache dirs. ref #1638

* fix: ensure BrowserConfig.to_dict serializes proxy_config

* feat: make LLM backoff configurable end-to-end

- extend LLMConfig with backoff delay/attempt/factor fields and thread them
  through LLMExtractionStrategy, LLMContentFilter, table extraction, and
  Docker API handlers
- expose the backoff parameter knobs on perform_completion_with_backoff/aperform_completion_with_backoff
  and document them in the md_v2 guides

* reproduced AttributeError from #1642

* pass timeout parameter to docker client request

* added missing deep crawling objects to init

* generalized query in ContentRelevanceFilter to be a str or list

* import modules from enhanceable deserialization

* parameterized tests

* Fix: capture current page URL to reflect JavaScript navigation and add test for delayed redirects. ref #1268

* refactor: replace PyPDF2 with pypdf across the codebase. ref #1412

* announcement: add application form for cloud API closed beta

* Release v0.7.8: Stability & Bug Fix Release

- Updated version to 0.7.8
- Introduced focused stability release addressing 11 community-reported bugs.
- Key fixes include Docker API improvements, LLM extraction enhancements, URL handling corrections, and dependency updates.
- Added detailed release notes for v0.7.8 in the blog and created a dedicated verification script to ensure all fixes are functioning as intended.
- Updated documentation to reflect recent changes and improvements.

* docs: add section for Crawl4AI Cloud API closed beta with application link

* fix: add disk cleanup step to Docker workflow

---------

Co-authored-by: rbushria <rbushri@gmail.com>
Co-authored-by: AHMET YILMAZ <tawfik@kidocode.com>
Co-authored-by: Soham Kukreti <kukretisoham@gmail.com>
Co-authored-by: Chris Murphy <chris.murphy@klaviyo.com>
Co-authored-by: Aravind Karnam <aravind.karanam@gmail.com>
2025-12-11 11:04:52 +01:00
UncleCode
835e3c56fe Add disk cleanup step in Docker release workflow
Added a step to free up disk space before the build process.
2025-12-11 09:49:27 +01:00
Nasrin
5a8fb57795 Merge pull request #1648 from christopher-w-murphy/fix/content-relevance-filter
[Fix]: Docker server does not decode ContentRelevanceFilter
2025-12-03 18:36:07 +08:00
ntohidi
df4d87ed78 refactor: replace PyPDF2 with pypdf across the codebase. ref #1412 2025-12-03 10:59:18 +01:00
Nasrin
f32cfc6db0 Merge pull request #1645 from unclecode/fix/configurable-backoff
Make LLM backoff configurable end-to-end
2025-12-02 21:07:49 +08:00
Nasrin
d06c39e8ab Merge pull request #1641 from unclecode/fix/serialize-proxy-config
Fix BrowserConfig proxy_config serialization
2025-12-02 21:06:02 +08:00
ntohidi
afc31e144a Merge branch 'develop' of https://github.com/unclecode/crawl4ai into develop 2025-12-02 13:01:11 +01:00
ntohidi
07ccf13be6 Fix: capture current page URL to reflect JavaScript navigation and add test for delayed redirects. ref #1268 2025-12-02 13:00:54 +01:00
Aravind
3a07c5962c Sponsors/new (#1643) 2025-12-02 00:49:39 +01:00
Chris Murphy
6893094f58 parameterized tests 2025-12-01 16:19:19 -05:00
Chris Murphy
3a8f8298d3 import modules from enhanceable deserialization 2025-12-01 16:18:59 -05:00
Chris Murphy
e95e8e1a97 generalized query in ContentRelevanceFilter to be a str or list 2025-12-01 16:16:31 -05:00
Chris Murphy
eb76df2c0d added missing deep crawling objects to init 2025-12-01 16:15:58 -05:00
Chris Murphy
6ec6bc4d8a pass timeout parameter to docker client request 2025-12-01 16:15:27 -05:00
Chris Murphy
33a3cc3933 reproduced AttributeError from #1642 2025-12-01 11:31:07 -05:00
Soham Kukreti
7a133e22cc feat: make LLM backoff configurable end-to-end
- extend LLMConfig with backoff delay/attempt/factor fields and thread them
  through LLMExtractionStrategy, LLMContentFilter, table extraction, and
  Docker API handlers
- expose the backoff parameter knobs on perform_completion_with_backoff/aperform_completion_with_backoff
  and document them in the md_v2 guides
2025-11-28 18:50:04 +05:30
Nasrin
dcb77c94bf Merge pull request #1623 from unclecode/fix/deprecated_pydantic
Refactor Pydantic model configuration to use ConfigDict for arbitrary…
2025-11-27 20:05:42 +08:00
Soham Kukreti
a0c5f0f79a fix: ensure BrowserConfig.to_dict serializes proxy_config 2025-11-26 17:44:06 +05:30
ntohidi
b36c6daa5c Fix: permission issues with .cache/url_seeder and other runtime cache dirs. ref #1638 2025-11-25 11:51:59 +01:00
Nasrin
94c8a833bf Merge pull request #1447 from rbushri/fix/wrong_url_raw
Fix: Wrong URL variable used for extraction of raw html
2025-11-25 17:49:44 +08:00
ntohidi
84bfea8bd1 Fix EmbeddingStrategy: Uncomment response handling for the variations and clean up mock data. ref #1621 2025-11-25 10:46:00 +01:00
Aravind
0024c82cdc Sponsors/new (#1637) 2025-11-24 13:29:33 +01:00
Rachel Bushrian
7771ed3894 Merge branch 'develop' into fix/wrong_url_raw 2025-11-24 13:54:07 +02:00
AHMET YILMAZ
eca04b0368 Refactor Pydantic model configuration to use ConfigDict for arbitrary types 2025-11-18 15:40:17 +08:00
ntohidi
c2c4d42be4 Fix #1181: Preserve whitespace in code blocks during HTML scraping
The remove_empty_elements_fast() method was removing whitespace-only
  span elements inside <pre> and <code> tags, causing import statements
  like "import torch" to become "importtorch". Now skips elements inside
  code blocks where whitespace is significant.
2025-11-17 12:21:23 +01:00
Aravind
f68e7531e3 Sponsors/scrapeless (#1619) 2025-11-17 07:44:52 +01:00
UncleCode
cb637fb5c4 Merge pull request #1613 from unclecode/release/v0.7.7 2025-11-16 12:26:54 +01:00
ntohidi
6244f56f36 Release v0.7.7
- Updated version to 0.7.7
- Added comprehensive demo and release notes
- Updated all documentation
2025-11-14 10:23:31 +01:00
ntohidi
2c973b1183 Merge branch 'develop' into release/v0.7.7 2025-11-13 14:54:05 +01:00
Nasrin
f3146de969 Merge pull request #1609 from unclecode/fix/update-config-documentation
Update browser and crawler run config documentation to match async_configs.py implementation
2025-11-13 21:52:53 +08:00
Soham Kukreti
d6b6d11a2d docs: update browser and crawler run config documentation to match async_configs.py implementation
Updated browser-crawler-config.md and parameters.md to ensure complete
accuracy with the actual BrowserConfig and CrawlerRunConfig implementations.

Changes:
- Removed non-existent parameters from documentation:
  * enable_rate_limiting, rate_limit_config (never implemented)
  * memory_threshold_percent, check_interval, max_session_permit (internal to AsyncDispatcher)
  * display_mode (doesn't exist)

- Added missing BrowserConfig parameters (14 total):
  * browser_mode, use_managed_browser, cdp_url, debugging_port, host
  * viewport, chrome_channel, channel
  * accept_downloads, downloads_path, storage_state, sleep_on_close
  * user_agent_mode, user_agent_generator_config, enable_stealth

- Added missing CrawlerRunConfig parameters (29 total):
  * chunking_strategy, keep_attrs, parser_type, scraping_strategy
  * proxy_config, proxy_rotation_strategy
  * locale, timezone_id, geolocation, fetch_ssl_certificate
  * shared_data, wait_for_timeout
  * c4a_script, max_scroll_steps
  * exclude_all_images, table_score_threshold, table_extraction
  * exclude_internal_links, score_links
  * capture_network_requests, capture_console_messages
  * method, stream, url, user_agent, user_agent_mode, user_agent_generator_config
  * deep_crawl_strategy, link_preview_config, url_matcher, match_mode, experimental

- Marked deprecated cache parameters (bypass_cache, disable_cache, no_cache_read, no_cache_write)
- Reorganized parameters into logical sections (Content Processing, Browser Location & Identity,
  Caching & Session, Page Navigation & Timing, Page Interaction, Media Handling, Link/Domain
  Handling, Debug & Logging, Connection & HTTP, Virtual Scroll, URL Matching, Advanced Features)
- Ensured all parameter descriptions match source code docstrings
- Added proper default values from __init__ signatures
2025-11-13 14:54:16 +05:30
ntohidi
b58579548c Bump version to 0.7.7 for stable release 2025-11-13 09:52:18 +01:00
Nasrin
466be69e72 Merge pull request #1607 from unclecode/fix/dfs_deep_crawling
Fix/dfs deep crawling
2025-11-13 16:43:47 +08:00
AHMET YILMAZ
ceade853c3 Enhance DFSDeepCrawlStrategy documentation for clarity and detail 2025-11-13 16:39:08 +08:00
ntohidi
998c809e08 Rename folder name for NSTProxy integration examples for crawl4ai 2025-11-13 09:36:39 +01:00
ntohidi
d0fb53540d Update proxy-security documentation 2025-11-13 09:23:44 +01:00
Nasrin
8116b15b63 Merge pull request #1596 from unclecode/docs-proxy-security
#1591 enhance proxy configuration with security, SSL analysis, and rotation examples
2025-11-13 16:22:28 +08:00
AHMET YILMAZ
fe353c4e27 Refactor proxy configuration documentation for clarity and consistency 2025-11-13 11:20:24 +08:00
ntohidi
89cc29fe44 Merge branch 'fix/docker' into develop 2025-11-12 17:06:31 +01:00
Nasrin
cdcb8836b7 Merge pull request #1605 from Nstproxy/feat/nstproxy
feat: Add Nstproxy Proxies
2025-11-12 23:56:14 +08:00
Nasrin
b207ae2848 Merge pull request #1528 from unclecode/fix/managed-browser-cdp-timing
Add CDP endpoint verification with exponential backoff for managed browsers
2025-11-12 23:53:57 +08:00
Nasrin
be00fc3a42 Merge pull request #1598 from unclecode/fix/sitemap_seeder
#1559 :Add tests for sitemap parsing and URL normalization in AsyncUr…
2025-11-12 18:09:34 +08:00
Nasrin
124ac583bb Merge pull request #1599 from unclecode/docs-llm-strategies-update
#1551 : Fix casing and variable name consistency for LLMConfig in doc…
2025-11-12 17:54:26 +08:00
AHMET YILMAZ
1bd3de6a47 #1510 : Add DFS deep crawler demonstration script and enhance DFS strategy with seen URL tracking 2025-11-12 17:44:43 +08:00
nstproxy
80452166c8 feat: Add Nstproxy Proxies 2025-11-12 16:25:39 +08:00
UncleCode
a99cd37c0e Merge pull request #1597 from unclecode/sponsors/capsolver 2025-11-11 14:50:44 +08:00
AHMET YILMAZ
2e8f8c9b49 #1551 : Fix casing and variable name consistency for LLMConfig in documentation 2025-11-10 15:38:14 +08:00
AHMET YILMAZ
80745bceb9 #1559 :Add tests for sitemap parsing and URL normalization in AsyncUrlSeeder 2025-11-10 14:15:54 +08:00
Aravind Karnam
4bee230c37 docs: Add a tip for captcha solving usecases using a third party integration 2025-11-10 11:20:48 +05:30
Aravind
006e29f308 Merge pull request #1589 from capsolver/main
Add some examples of using capsolver to solve captcha
2025-11-10 10:45:16 +05:30
AHMET YILMAZ
263ac890fd #1591
: Enhance proxy configuration documentation with security features, SSL analysis, and improved examples
2025-11-10 11:42:07 +08:00
unclecode
1a22fb4d4f docs: rename Docker deployment to self-hosting guide with comprehensive monitoring documentation
Major documentation restructuring to emphasize self-hosting capabilities and fully document the real-time monitoring system.

Changes:
- Renamed docker-deployment.md → self-hosting.md to better reflect the value proposition
- Updated mkdocs.yml navigation to "Self-Hosting Guide"
- Completely rewrote introduction emphasizing self-hosting benefits:
  * Data privacy and ownership
  * Cost control and transparency
  * Performance and security advantages
  * Full customization capabilities

- Expanded "Metrics & Monitoring" → "Real-time Monitoring & Operations" with:
  * Monitoring Dashboard section documenting the /monitor UI
  * Complete feature breakdown (system health, requests, browsers, janitor, errors)
  * Monitor API Endpoints with all REST endpoints and examples
  * WebSocket Streaming integration guide with Python examples
  * Control Actions for manual browser management
  * Production Integration patterns (Prometheus, custom dashboards, alerting)
  * Key production metrics to track

- Enhanced summary section:
  * What users learned checklist
  * Why self-hosting matters
  * Clear next steps
  * Key resources with monitoring dashboard URL

The monitoring dashboard built 2-3 weeks ago is now fully documented and discoverable.
Users will understand they have complete operational visibility at http://localhost:11235/monitor
with real-time updates, browser pool management, and programmatic control via REST/WebSocket APIs.

This positions Crawl4AI as an enterprise-grade self-hosting solution with DevOps-level
monitoring capabilities, not just a Docker deployment.
2025-11-09 13:31:52 +08:00
unclecode
81b5312629 Update gitignore 2025-11-09 10:49:42 +08:00
Nasrin
d56b0eb9a9 Merge pull request #1495 from unclecode/fix/viewport_in_managed_browser
feat(ManagedBrowser): add viewport size configuration for browser launch
2025-11-06 18:42:45 +08:00
Nasrin
66175e132b Merge pull request #1590 from unclecode/fix/async-llm-extraction-arunMany
This commit resolves issue #1055 where LLM extraction was blocking async
2025-11-06 18:40:42 +08:00
ntohidi
a30548a98f This commit resolves issue #1055 where LLM extraction was blocking async
execution, causing URLs to be processed sequentially instead of in parallel.

  Changes:
  - Added aperform_completion_with_backoff() using litellm.acompletion for async LLM calls
  - Implemented arun() method in ExtractionStrategy base class with thread pool fallback
  - Created async arun() and aextract() methods in LLMExtractionStrategy using asyncio.gather
  - Updated AsyncWebCrawler.arun() to detect and use arun() when available
  - Added comprehensive test suite to verify parallel execution

  Impact:
  - LLM extraction now runs truly in parallel across multiple URLs
  - Significant performance improvement for multi-URL crawls with LLM strategies
  - Backward compatible - existing extraction strategies continue to work
  - No breaking changes to public API

  Technical details:
  - Uses litellm.acompletion for non-blocking LLM calls
  - Leverages asyncio.gather for concurrent chunk processing
  - Maintains backward compatibility via asyncio.to_thread fallback
  - Works seamlessly with MemoryAdaptiveDispatcher and other dispatchers
2025-11-06 11:22:45 +01:00
CapSolver
2ae9899eac Clarify CapSolver integration instructions
Updated text for clarity and capitalization.
2025-11-06 15:49:30 +08:00
CapSolver
57aeb70f00 Add CapSolver Captcha Solver 2025-11-06 15:37:31 +08:00
Nasrin
2c918155aa Merge pull request #1529 from unclecode/fix/remove_overlay_elements
Fix remove_overlay_elements functionality by calling injected JS function.
2025-11-06 00:10:32 +08:00
Nasrin
854694ef33 Merge pull request #1537 from unclecode/fix/docker-compose-llm-env
fix(docker): Remove environment variable overrides in docker-compose.yml
2025-11-06 00:07:51 +08:00
Nasrin
6534ece026 Merge pull request #1532 from unclecode/fix/update-documentation
Standardize C4A-Script tutorial, add CLI identity-based crawling, and add sponsorship CTA
2025-11-05 23:37:05 +08:00
Nasrin
89e28d4eee Merge pull request #1558 from unclecode/claude/fix-update-pyopenssl-security-011CUPexU25DkNvoxfu5ZrnB
Claude/fix update pyopenssl security 011 cu pex u25 dk nvoxfu5 zrn b
2025-10-28 17:09:11 +08:00
ntohidi
c0f1865287 feat(api): update marketplace version and build date in root endpoint response 2025-10-26 11:35:39 +01:00
ntohidi
46ef1116c4 fix(app-detail): enhance tab functionality, hide documentation and support tabs in marketplace 2025-10-26 11:21:29 +01:00
Nasrin
4df83893ac Merge pull request #1560 from unclecode/fix/marketplace
Fix/marketplace
2025-10-23 22:17:06 +08:00
ntohidi
13e116610d fix(marketplace): improve app detail page content rendering and UX
Fixed multiple issues with app detail page content display and formatting
2025-10-23 16:12:30 +02:00
Claude
613097d121 test: add verification tests for pyOpenSSL security update
- Add lightweight security test to verify version requirements
- Add comprehensive integration test for crawl4ai functionality
- Tests verify pyOpenSSL >= 25.3.0 and cryptography >= 45.0.7
- All tests passing: security vulnerability is resolved

Related to #1545

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-23 06:57:25 +00:00
Claude
44ef0682b0 fix: update pyOpenSSL to >=25.3.0 to address security vulnerability
- Updates pyOpenSSL from >=24.3.0 to >=25.3.0
- This resolves CVE affecting cryptography package versions >=37.0.0 & <43.0.1
- pyOpenSSL 25.3.0 requires cryptography>=45.0.7, which is above the vulnerable range
- Fixes issue #1545

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-23 06:51:25 +00:00
Nasrin
40173eeb73 Update Docker hooks and Webhook documents (#1557)
* fix(docker-api): migrate to modern datetime library API

Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>

* Fix examples in README.md

* feat(docker): add user-provided hooks support to Docker API

Implements comprehensive hooks functionality allowing users to provide custom Python
functions as strings that execute at specific points in the crawling pipeline.

Key Features:
- Support for all 8 crawl4ai hook points:
  • on_browser_created: Initialize browser settings
  • on_page_context_created: Configure page context
  • before_goto: Pre-navigation setup
  • after_goto: Post-navigation processing
  • on_user_agent_updated: User agent modification handling
  • on_execution_started: Crawl execution initialization
  • before_retrieve_html: Pre-extraction processing
  • before_return_html: Final HTML processing

Implementation Details:
- Created UserHookManager for validation, compilation, and safe execution
- Added IsolatedHookWrapper for error isolation and timeout protection
- AST-based validation ensures code structure correctness
- Sandboxed execution with restricted builtins for security
- Configurable timeout (1-120 seconds) prevents infinite loops
- Comprehensive error handling ensures hooks don't crash main process
- Execution tracking with detailed statistics and logging

API Changes:
- Added HookConfig schema with code and timeout fields
- Extended CrawlRequest with optional hooks parameter
- Added /hooks/info endpoint for hook discovery
- Updated /crawl and /crawl/stream endpoints to support hooks

Safety Features:
- Malformed hooks return clear validation errors
- Hook errors are isolated and reported without stopping crawl
- Execution statistics track success/failure/timeout rates
- All hook results are JSON-serializable

Testing:
- Comprehensive test suite covering all 8 hooks
- Error handling and timeout scenarios validated
- Authentication, performance, and content extraction examples
- 100% success rate in production testing

Documentation:
- Added extensive hooks section to docker-deployment.md
- Security warnings about user-provided code risks
- Real-world examples using httpbin.org, GitHub, BBC
- Best practices and troubleshooting guide

ref #1377

* fix(deep-crawl): BestFirst priority inversion; remove pre-scoring truncation. ref #1253

  Use negative scores in PQ to visit high-score URLs first and drop link cap prior to scoring; add test for ordering.

* docs: Update URL seeding examples to use proper async context managers
- Wrap all AsyncUrlSeeder usage with async context managers
- Update URL seeding adventure example to use "sitemap+cc" source, focus on course posts, and add stream=True parameter to fix runtime error

* fix(crawler): Removed the incorrect reference in browser_config variable #1310

* docs: update Docker instructions to use the latest release tag

* fix(docker): Fix LLM API key handling for multi-provider support

Previously, the system incorrectly used OPENAI_API_KEY for all LLM providers
due to a hardcoded api_key_env fallback in config.yml. This caused authentication
errors when using non-OpenAI providers like Gemini.

Changes:
- Remove api_key_env from config.yml to let litellm handle provider-specific env vars
- Simplify get_llm_api_key() to return None, allowing litellm to auto-detect keys
- Update validate_llm_provider() to trust litellm's built-in key detection
- Update documentation to reflect the new automatic key handling

The fix leverages litellm's existing capability to automatically find the correct
environment variable for each provider (OPENAI_API_KEY, GEMINI_API_TOKEN, etc.)
without manual configuration.

ref #1291

* docs: update adaptive crawler docs and cache defaults; remove deprecated examples (#1330)
- Replace BaseStrategy with CrawlStrategy in custom strategy examples (DomainSpecificStrategy, HybridStrategy)
- Remove “Custom Link Scoring” and “Caching Strategy” sections no longer aligned with current library
- Revise memory pruning example to use adaptive.get_relevant_content and index-based retention of top 500 docs
- Correct Quickstart note: default cache mode is CacheMode.BYPASS; instruct enabling with CacheMode.ENABLED

* fix(utils): Improve URL normalization by avoiding quote/unquote to preserve '+' signs. ref #1332

* feat: Add comprehensive website to API example with frontend

This commit adds a complete, web scraping API example that demonstrates how to get structured data from any website and use it like an API using the crawl4ai library with a minimalist frontend interface.

Core Functionality
- AI-powered web scraping with plain English queries
- Dual scraping approaches: Schema-based (faster) and LLM-based (flexible)
- Intelligent schema caching for improved performance
- Custom LLM model support with API key management
- Automatic duplicate request prevention

Modern Frontend Interface
- Minimalist black-and-white design inspired by modern web apps
- Responsive layout with smooth animations and transitions
- Three main pages: Scrape Data, Models Management, API Request History
- Real-time results display with JSON formatting
- Copy-to-clipboard functionality for extracted data
- Toast notifications for user feedback
- Auto-scroll to results when scraping starts

Model Management System
- Web-based model configuration interface
- Support for any LLM provider (OpenAI, Gemini, Anthropic, etc.)
- Simplified configuration requiring only provider and API token
- Add, list, and delete model configurations
- Secure storage of API keys in local JSON files

API Request History
- Automatic saving of all API requests and responses
- Display of request history with URL, query, and cURL commands
- Duplicate prevention (same URL + query combinations)
- Request deletion functionality
- Clean, simplified display focusing on essential information

Technical Implementation

Backend (FastAPI)
- RESTful API with comprehensive endpoints
- Pydantic models for request/response validation
- Async web scraping with crawl4ai library
- Error handling with detailed error messages
- File-based storage for models and request history

Frontend (Vanilla JS/CSS/HTML)
- No framework dependencies - pure HTML, CSS, JavaScript
- Modern CSS Grid and Flexbox layouts
- Custom dropdown styling with SVG arrows
- Responsive design for mobile and desktop
- Smooth scrolling and animations

Core Library Integration
- WebScraperAgent class for orchestration
- ModelConfig class for LLM configuration management
- Schema generation and caching system
- LLM extraction strategy support
- Browser configuration with headless mode

* fix(dependencies): add cssselect to project dependencies

Fixes bug reported in issue #1405
[Bug]: Excluded selector (excluded_selector) doesn't work

This commit reintroduces the cssselect library which was removed by PR (https://github.com/unclecode/crawl4ai/pull/1368) and merged via (437395e490).

Integration tested against 0.7.4 Docker container. Reintroducing cssselector package eliminated errors seen in logs and excluded_selector functionality was restored.

Refs: #1405

* fix(docker): resolve filter serialization and JSON encoding errors in deep crawl strategy (ref #1419)

  - Fix URLPatternFilter serialization by preventing private __slots__ from being serialized as constructor params
  - Add public attributes to URLPatternFilter to store original constructor parameters for proper serialization
  - Handle property descriptors in CrawlResult.model_dump() to prevent JSON serialization errors
  - Ensure filter chains work correctly with Docker client and REST API

  The issue occurred because:
  1. Private implementation details (_simple_suffixes, etc.) were being serialized and passed as constructor arguments during deserialization
  2. Property descriptors were being included in the serialized output, causing "Object of type property is not JSON serializable" errors

  Changes:
  - async_configs.py: Comment out __slots__ serialization logic (lines 100-109)
  - filters.py: Add patterns, use_glob, reverse to URLPatternFilter __slots__ and store as public attributes
  - models.py: Convert property descriptors to strings in model_dump() instead of including them directly

* fix(logger): ensure logger is a Logger instance in crawling strategies. ref #1437

* feat(docker): Add temperature and base_url parameters for LLM configuration. ref #1035

  Implement hierarchical configuration for LLM parameters with support for:
  - Temperature control (0.0-2.0) to adjust response creativity
  - Custom base_url for proxy servers and alternative endpoints
  - 4-tier priority: request params > provider env > global env > defaults

  Add helper functions in utils.py, update API schemas and handlers,
  support environment variables (LLM_TEMPERATURE, OPENAI_TEMPERATURE, etc.),
  and provide comprehensive documentation with examples.

* feat(docker): improve docker error handling
- Return comprehensive error messages along with status codes for api internal errors.
- Fix fit_html property serialization issue in both /crawl and /crawl/stream endpoints
- Add sanitization to ensure fit_html is always JSON-serializable (string or None)
- Add comprehensive error handling test suite.

* #1375 : refactor(proxy) Deprecate 'proxy' parameter in BrowserConfig and enhance proxy string parsing

- Updated ProxyConfig.from_string to support multiple proxy formats, including URLs with credentials.
- Deprecated the 'proxy' parameter in BrowserConfig, replacing it with 'proxy_config' for better flexibility.
- Added warnings for deprecated usage and clarified behavior when both parameters are provided.
- Updated documentation and tests to reflect changes in proxy configuration handling.

* Remove deprecated test for 'proxy' parameter in BrowserConfig and update .gitignore to include test_scripts directory.

* feat: add preserve_https_for_internal_links flag to maintain HTTPS during crawling. Ref #1410

Added a new `preserve_https_for_internal_links` configuration flag that preserves the original HTTPS scheme for same-domain links even when the server redirects to HTTP.

* feat: update documentation for preserve_https_for_internal_links. ref #1410

* fix: drop Python 3.9 support and require Python >=3.10.
The library no longer supports Python 3.9 and so it was important to drop all references to python 3.9.
Following changes have been made:
- pyproject.toml: set requires-python to ">=3.10"; remove 3.9 classifier
- setup.py: set python_requires to ">=3.10"; remove 3.9 classifier
- docs: update Python version mentions
  - deploy/docker/c4ai-doc-context.md: options -> 3.10, 3.11, 3.12, 3.13

* issue #1329 refactor(crawler): move unwanted properties to CrawlerRunConfig class

* fix(auth): fixed Docker JWT authentication. ref #1442

* remove: delete unused yoyo snapshot subproject

* fix: raise error on last attempt failure in perform_completion_with_backoff. ref #989

* Commit without API

* fix: update option labels in request builder for clarity

* fix: allow custom LLM providers for adaptive crawler embedding config. ref: #1291

  - Change embedding_llm_config from Dict to Union[LLMConfig, Dict] for type safety
  - Add backward-compatible conversion property _embedding_llm_config_dict
  - Replace all hardcoded OpenAI embedding configs with configurable options
  - Fix LLMConfig object attribute access in query expansion logic
  - Add comprehensive example demonstrating multiple provider configurations
  - Update documentation with both LLMConfig object and dictionary usage patterns

  Users can now specify any LLM provider for query expansion in embedding strategy:
  - New: embedding_llm_config=LLMConfig(provider='anthropic/claude-3', api_token='key')
  - Old: embedding_llm_config={'provider': 'openai/gpt-4', 'api_token': 'key'} (still works)

* refactor(BrowserConfig): change deprecation warning for 'proxy' parameter to UserWarning

* feat(StealthAdapter): fix stealth features for Playwright integration. ref #1481

* #1505 fix(api): update config handling to only set base config if not provided by user

* fix(docker-deployment): replace console.log with print for metadata extraction

* Release v0.7.5: The Update

- Updated version to 0.7.5
- Added comprehensive demo and release notes
- Updated documentation

* refactor(release): remove memory management section for cleaner documentation. ref #1443

* feat(docs): add brand book and page copy functionality

- Add comprehensive brand book with color system, typography, components
- Add page copy dropdown with markdown copy/view functionality
- Update mkdocs.yml with new assets and branding navigation
- Use terminal-style ASCII icons and condensed menu design

* Update gitignore add local scripts folder

* fix: remove this import as it causes python to treat "json" as a variable in the except block

* fix: always return a list, even if we catch an exception

* feat(marketplace): Add Crawl4AI marketplace with secure configuration

- Implement marketplace frontend and admin dashboard
- Add FastAPI backend with environment-based configuration
- Use .env file for secrets management
- Include data generation scripts
- Add proper CORS configuration
- Remove hardcoded password from admin login
- Update gitignore for security

* fix(marketplace): Update URLs to use /marketplace path and relative API endpoints

- Change API_BASE to relative '/api' for production
- Move marketplace to /marketplace instead of /marketplace/frontend
- Update MkDocs navigation
- Fix logo path in marketplace index

* fix(docs): hide copy menu on non-markdown pages

* feat(marketplace): add sponsor logo uploads

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>

* feat(docs): add chatgpt quick link to page actions

* fix(marketplace): align admin api with backend endpoints

* fix(marketplace): isolate api under marketplace prefix

* fix(marketplace): resolve app detail page routing and styling issues

- Fixed JavaScript errors from missing HTML elements (install-code, usage-code, integration-code)
- Added missing CSS classes for tabs, overview layout, sidebar, and integration content
- Fixed tab navigation to display horizontally in single line
- Added proper padding to tab content sections (removed from container, added to content)
- Fixed tab selector from .nav-tab to .tab-btn to match HTML structure
- Added sidebar styling with stats grid and metadata display
- Improved responsive design with mobile-friendly tab scrolling
- Fixed code block positioning for copy buttons
- Removed margin from first headings to prevent extra spacing
- Added null checks for DOM elements in JavaScript to prevent errors

These changes resolve the routing issue where clicking on apps caused page redirects,
and fix the broken layout where CSS was not properly applied to the app detail page.

* fix(marketplace): prevent hero image overflow and secondary card stretching

- Fixed hero image to 200px height with min/max constraints
- Added object-fit: cover to hero-image img elements
- Changed secondary-featured align-items from stretch to flex-start
- Fixed secondary-card height to 118px (no flex: 1 stretching)
- Updated responsive grid layouts for wider screens
- Added flex: 1 to hero-content for better content distribution

These changes ensure a rigid, predictable layout that prevents:
1. Large images from pushing text content down
2. Single secondary cards from stretching to fill entire height

* feat: Add hooks utility for function-based hooks with Docker client integration. ref #1377

   Add hooks_to_string() utility function that converts Python function objects
   to string representations for the Docker API, enabling developers to write hooks
   as regular Python functions instead of strings.

   Core Changes:
   - New hooks_to_string() utility in crawl4ai/utils.py using inspect.getsource()
   - Docker client now accepts both function objects and strings for hooks
   - Automatic detection and conversion in Crawl4aiDockerClient._prepare_request()
   - New hooks and hooks_timeout parameters in client.crawl() method

   Documentation:
   - Docker client examples with function-based hooks (docs/examples/docker_client_hooks_example.py)
   - Updated main Docker deployment guide with comprehensive hooks section
   - Added unit tests for hooks utility (tests/docker/test_hooks_utility.py)

* feat: Add hooks utility for function-based hooks with Docker client integration. ref #1377

   Add hooks_to_string() utility function that converts Python function objects
   to string representations for the Docker API, enabling developers to write hooks
   as regular Python functions instead of strings.

   Core Changes:
   - New hooks_to_string() utility in crawl4ai/utils.py using inspect.getsource()
   - Docker client now accepts both function objects and strings for hooks
   - Automatic detection and conversion in Crawl4aiDockerClient._prepare_request()
   - New hooks and hooks_timeout parameters in client.crawl() method

   Documentation:
   - Docker client examples with function-based hooks (docs/examples/docker_client_hooks_example.py)
   - Updated main Docker deployment guide with comprehensive hooks section
   - Added unit tests for hooks utility (tests/docker/test_hooks_utility.py)

* fix(docs): clarify Docker Hooks System with function-based API in README

* docs: Add demonstration files for v0.7.5 release, showcasing the new Docker Hooks System and all other features.

* docs: Update 0.7.5 video walkthrough

* docs: add complete SDK reference documentation

Add comprehensive single-page SDK reference combining:
- Installation & Setup
- Quick Start
- Core API (AsyncWebCrawler, arun, arun_many, CrawlResult)
- Configuration (BrowserConfig, CrawlerConfig, Parameters)
- Crawling Patterns
- Content Processing (Markdown, Fit Markdown, Selection, Interaction, Link & Media)
- Extraction Strategies (LLM and No-LLM)
- Advanced Features (Session Management, Hooks & Auth)

Generated using scripts/generate_sdk_docs.py in ultra-dense mode
optimized for AI assistant consumption.

Stats: 23K words, 185 code blocks, 220KB

* feat: add AI assistant skill package for Crawl4AI

- Create comprehensive skill package for AI coding assistants
- Include complete SDK reference (23K words, v0.7.4)
- Add three extraction scripts (basic, batch, pipeline)
- Implement version tracking in skill and scripts
- Add prominent download section on homepage
- Place skill in docs/assets for web distribution

The skill enables AI assistants like Claude, Cursor, and Windsurf
to effectively use Crawl4AI with optimized workflows for markdown
generation and data extraction.

* fix: remove non-existent wiki link and clarify skill usage instructions

* fix: update Crawl4AI skill with corrected parameters and examples

- Fixed CrawlerConfig → CrawlerRunConfig throughout
- Fixed parameter names (timeout → page_timeout, store_html removed)
- Fixed schema format (selector → baseSelector)
- Corrected proxy configuration (in BrowserConfig, not CrawlerRunConfig)
- Fixed fit_markdown usage with content filters
- Added comprehensive references to docs/examples/ directory
- Created safe packaging script to avoid root directory pollution
- All scripts tested and verified working

* fix: thoroughly verify and fix all Crawl4AI skill examples

- Cross-checked every section against actual docs
- Fixed BM25ContentFilter parameters (user_query, bm25_threshold)
- Removed incorrect wait_for selector from basic example
- Added comprehensive test suite (4 test files)
- All examples now tested and verified working
- Tests validate: basic crawling, markdown generation, data extraction, advanced patterns
- Package size: 76.6 KB (includes tests for future validation)

* feat(ci): split release pipeline and add Docker caching

- Split release.yml into PyPI/GitHub release and Docker workflows
- Add GitHub Actions cache for Docker builds (10-15x faster rebuilds)
- Implement dual-trigger for docker-release.yml (auto + manual)
- Add comprehensive workflow documentation in .github/workflows/docs/
- Backup original workflow as release.yml.backup

* feat: add webhook notifications for crawl job completion

Implements webhook support for the crawl job API to eliminate polling requirements.

Changes:
- Added WebhookConfig and WebhookPayload schemas to schemas.py
- Created webhook.py with WebhookDeliveryService class
- Integrated webhook notifications in api.py handle_crawl_job
- Updated job.py CrawlJobPayload to accept webhook_config
- Added webhook configuration section to config.yml
- Included comprehensive usage examples in WEBHOOK_EXAMPLES.md

Features:
- Webhook notifications on job completion (success/failure)
- Configurable data inclusion in webhook payload
- Custom webhook headers support
- Global default webhook URL configuration
- Exponential backoff retry logic (5 attempts: 1s, 2s, 4s, 8s, 16s)
- 30-second timeout per webhook call

Usage:
POST /crawl/job with optional webhook_config:
- webhook_url: URL to receive notifications
- webhook_data_in_payload: include full results (default: false)
- webhook_headers: custom headers for authentication

Generated with Claude Code https://claude.com/claude-code

Co-Authored-By: Claude <noreply@anthropic.com>

* docs: add webhook documentation to Docker README

Added comprehensive webhook section to README.md including:
- Overview of asynchronous job queue with webhooks
- Benefits and use cases
- Quick start examples
- Webhook authentication
- Global webhook configuration
- Job status polling alternative

Updated table of contents and summary to include webhook feature.
Maintains consistent tone and style with rest of README.

Generated with Claude Code https://claude.com/claude-code

Co-Authored-By: Claude <noreply@anthropic.com>

* docs: add webhook example for Docker deployment

Added docker_webhook_example.py demonstrating:
- Submitting crawl jobs with webhook configuration
- Flask-based webhook receiver implementation
- Three usage patterns:
  1. Webhook notification only (fetch data separately)
  2. Webhook with full data in payload
  3. Traditional polling approach for comparison

Includes comprehensive comments explaining:
- Webhook payload structure
- Authentication headers setup
- Error handling
- Production deployment tips

Example is fully functional and ready to run with Flask installed.

Generated with Claude Code https://claude.com/claude-code

Co-Authored-By: Claude <noreply@anthropic.com>

* test: add webhook implementation validation tests

Added comprehensive test suite to validate webhook implementation:
- Module import verification
- WebhookDeliveryService initialization
- Pydantic model validation (WebhookConfig)
- Payload construction logic
- Exponential backoff calculation
- API integration checks

All tests pass (6/6), confirming implementation is correct.

Generated with Claude Code https://claude.com/claude-code

Co-Authored-By: Claude <noreply@anthropic.com>

* test: add comprehensive webhook feature test script

Added end-to-end test script that automates webhook feature testing:

Script Features (test_webhook_feature.sh):
- Automatic branch switching and dependency installation
- Redis and server startup/shutdown management
- Webhook receiver implementation
- Integration test for webhook notifications
- Comprehensive cleanup and error handling
- Returns to original branch after completion

Test Flow:
1. Fetch and checkout webhook feature branch
2. Activate venv and install dependencies
3. Start Redis and Crawl4AI server
4. Submit crawl job with webhook config
5. Verify webhook delivery and payload
6. Clean up all processes and return to original branch

Documentation:
- WEBHOOK_TEST_README.md with usage instructions
- Troubleshooting guide
- Exit codes and safety features

Usage: ./tests/test_webhook_feature.sh

Generated with Claude Code https://claude.com/claude-code

Co-Authored-By: Claude <noreply@anthropic.com>

* fix: properly serialize Pydantic HttpUrl in webhook config

Use model_dump(mode='json') instead of deprecated dict() method to ensure
Pydantic special types (HttpUrl, UUID, etc.) are properly serialized to
JSON-compatible native Python types.

This fixes webhook delivery failures caused by HttpUrl objects remaining
as Pydantic types in the webhook_config dict, which caused JSON
serialization errors and httpx request failures.

Also update mcp requirement to >=1.18.0 for compatibility.

* feat: add webhook support for /llm/job endpoint

Add comprehensive webhook notification support for the /llm/job endpoint,
following the same pattern as the existing /crawl/job implementation.

Changes:
- Add webhook_config field to LlmJobPayload model (job.py)
- Implement webhook notifications in process_llm_extraction() with 4
  notification points: success, provider validation failure, extraction
  failure, and general exceptions (api.py)
- Store webhook_config in Redis task data for job tracking
- Initialize WebhookDeliveryService with exponential backoff retry logic
Documentation:
- Add Example 6 to WEBHOOK_EXAMPLES.md showing LLM extraction with webhooks
- Update Flask webhook handler to support both crawl and llm_extraction tasks
- Add TypeScript client examples for LLM jobs
- Add comprehensive examples to docker_webhook_example.py with schema support
- Clarify data structure differences between webhook and API responses

Testing:
- Add test_llm_webhook_feature.py with 7 validation tests (all passing)
- Verify pattern consistency with /crawl/job implementation
- Add implementation guide (WEBHOOK_LLM_JOB_IMPLEMENTATION.md)

* fix: remove duplicate comma in webhook_config parameter

* fix: update Crawl4AI Docker container port from 11234 to 11235

* docs: enhance README and docker-deployment documentation with Job Queue and Webhook API details

* docs: update docker_hooks_examples.py with comprehensive examples and improved structure

---------

Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
Co-authored-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
Co-authored-by: Nezar Ali <abu5sohaib@gmail.com>
Co-authored-by: Soham Kukreti <kukretisoham@gmail.com>
Co-authored-by: James T. Wood <jamesthomaswood@gmail.com>
Co-authored-by: AHMET YILMAZ <tawfik@kidocode.com>
Co-authored-by: nafeqq-1306 <nafiquee@yahoo.com>
Co-authored-by: unclecode <unclecode@kidocode.com>
Co-authored-by: Martin Sjöborg <martin.sjoborg@quartr.se>
Co-authored-by: Martin Sjöborg <martin@sjoborg.org>
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Co-authored-by: Claude <noreply@anthropic.com>
2025-10-22 22:34:19 +08:00
ntohidi
b74524fdfb docs: update docker_hooks_examples.py with comprehensive examples and improved structure 2025-10-22 16:29:19 +02:00
ntohidi
bcac486921 docs: enhance README and docker-deployment documentation with Job Queue and Webhook API details 2025-10-22 16:19:30 +02:00
ntohidi
6aef5a120f Merge branch 'main' into develop 2025-10-22 15:53:54 +02:00
Nasrin
7cac008c10 Release/v0.7.6 (#1556)
* fix(docker-api): migrate to modern datetime library API

Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>

* Fix examples in README.md

* feat(docker): add user-provided hooks support to Docker API

Implements comprehensive hooks functionality allowing users to provide custom Python
functions as strings that execute at specific points in the crawling pipeline.

Key Features:
- Support for all 8 crawl4ai hook points:
  • on_browser_created: Initialize browser settings
  • on_page_context_created: Configure page context
  • before_goto: Pre-navigation setup
  • after_goto: Post-navigation processing
  • on_user_agent_updated: User agent modification handling
  • on_execution_started: Crawl execution initialization
  • before_retrieve_html: Pre-extraction processing
  • before_return_html: Final HTML processing

Implementation Details:
- Created UserHookManager for validation, compilation, and safe execution
- Added IsolatedHookWrapper for error isolation and timeout protection
- AST-based validation ensures code structure correctness
- Sandboxed execution with restricted builtins for security
- Configurable timeout (1-120 seconds) prevents infinite loops
- Comprehensive error handling ensures hooks don't crash main process
- Execution tracking with detailed statistics and logging

API Changes:
- Added HookConfig schema with code and timeout fields
- Extended CrawlRequest with optional hooks parameter
- Added /hooks/info endpoint for hook discovery
- Updated /crawl and /crawl/stream endpoints to support hooks

Safety Features:
- Malformed hooks return clear validation errors
- Hook errors are isolated and reported without stopping crawl
- Execution statistics track success/failure/timeout rates
- All hook results are JSON-serializable

Testing:
- Comprehensive test suite covering all 8 hooks
- Error handling and timeout scenarios validated
- Authentication, performance, and content extraction examples
- 100% success rate in production testing

Documentation:
- Added extensive hooks section to docker-deployment.md
- Security warnings about user-provided code risks
- Real-world examples using httpbin.org, GitHub, BBC
- Best practices and troubleshooting guide

ref #1377

* fix(deep-crawl): BestFirst priority inversion; remove pre-scoring truncation. ref #1253

  Use negative scores in PQ to visit high-score URLs first and drop link cap prior to scoring; add test for ordering.

* docs: Update URL seeding examples to use proper async context managers
- Wrap all AsyncUrlSeeder usage with async context managers
- Update URL seeding adventure example to use "sitemap+cc" source, focus on course posts, and add stream=True parameter to fix runtime error

* fix(crawler): Removed the incorrect reference in browser_config variable #1310

* docs: update Docker instructions to use the latest release tag

* fix(docker): Fix LLM API key handling for multi-provider support

Previously, the system incorrectly used OPENAI_API_KEY for all LLM providers
due to a hardcoded api_key_env fallback in config.yml. This caused authentication
errors when using non-OpenAI providers like Gemini.

Changes:
- Remove api_key_env from config.yml to let litellm handle provider-specific env vars
- Simplify get_llm_api_key() to return None, allowing litellm to auto-detect keys
- Update validate_llm_provider() to trust litellm's built-in key detection
- Update documentation to reflect the new automatic key handling

The fix leverages litellm's existing capability to automatically find the correct
environment variable for each provider (OPENAI_API_KEY, GEMINI_API_TOKEN, etc.)
without manual configuration.

ref #1291

* docs: update adaptive crawler docs and cache defaults; remove deprecated examples (#1330)
- Replace BaseStrategy with CrawlStrategy in custom strategy examples (DomainSpecificStrategy, HybridStrategy)
- Remove “Custom Link Scoring” and “Caching Strategy” sections no longer aligned with current library
- Revise memory pruning example to use adaptive.get_relevant_content and index-based retention of top 500 docs
- Correct Quickstart note: default cache mode is CacheMode.BYPASS; instruct enabling with CacheMode.ENABLED

* fix(utils): Improve URL normalization by avoiding quote/unquote to preserve '+' signs. ref #1332

* feat: Add comprehensive website to API example with frontend

This commit adds a complete, web scraping API example that demonstrates how to get structured data from any website and use it like an API using the crawl4ai library with a minimalist frontend interface.

Core Functionality
- AI-powered web scraping with plain English queries
- Dual scraping approaches: Schema-based (faster) and LLM-based (flexible)
- Intelligent schema caching for improved performance
- Custom LLM model support with API key management
- Automatic duplicate request prevention

Modern Frontend Interface
- Minimalist black-and-white design inspired by modern web apps
- Responsive layout with smooth animations and transitions
- Three main pages: Scrape Data, Models Management, API Request History
- Real-time results display with JSON formatting
- Copy-to-clipboard functionality for extracted data
- Toast notifications for user feedback
- Auto-scroll to results when scraping starts

Model Management System
- Web-based model configuration interface
- Support for any LLM provider (OpenAI, Gemini, Anthropic, etc.)
- Simplified configuration requiring only provider and API token
- Add, list, and delete model configurations
- Secure storage of API keys in local JSON files

API Request History
- Automatic saving of all API requests and responses
- Display of request history with URL, query, and cURL commands
- Duplicate prevention (same URL + query combinations)
- Request deletion functionality
- Clean, simplified display focusing on essential information

Technical Implementation

Backend (FastAPI)
- RESTful API with comprehensive endpoints
- Pydantic models for request/response validation
- Async web scraping with crawl4ai library
- Error handling with detailed error messages
- File-based storage for models and request history

Frontend (Vanilla JS/CSS/HTML)
- No framework dependencies - pure HTML, CSS, JavaScript
- Modern CSS Grid and Flexbox layouts
- Custom dropdown styling with SVG arrows
- Responsive design for mobile and desktop
- Smooth scrolling and animations

Core Library Integration
- WebScraperAgent class for orchestration
- ModelConfig class for LLM configuration management
- Schema generation and caching system
- LLM extraction strategy support
- Browser configuration with headless mode

* fix(dependencies): add cssselect to project dependencies

Fixes bug reported in issue #1405
[Bug]: Excluded selector (excluded_selector) doesn't work

This commit reintroduces the cssselect library which was removed by PR (https://github.com/unclecode/crawl4ai/pull/1368) and merged via (437395e490).

Integration tested against 0.7.4 Docker container. Reintroducing cssselector package eliminated errors seen in logs and excluded_selector functionality was restored.

Refs: #1405

* fix(docker): resolve filter serialization and JSON encoding errors in deep crawl strategy (ref #1419)

  - Fix URLPatternFilter serialization by preventing private __slots__ from being serialized as constructor params
  - Add public attributes to URLPatternFilter to store original constructor parameters for proper serialization
  - Handle property descriptors in CrawlResult.model_dump() to prevent JSON serialization errors
  - Ensure filter chains work correctly with Docker client and REST API

  The issue occurred because:
  1. Private implementation details (_simple_suffixes, etc.) were being serialized and passed as constructor arguments during deserialization
  2. Property descriptors were being included in the serialized output, causing "Object of type property is not JSON serializable" errors

  Changes:
  - async_configs.py: Comment out __slots__ serialization logic (lines 100-109)
  - filters.py: Add patterns, use_glob, reverse to URLPatternFilter __slots__ and store as public attributes
  - models.py: Convert property descriptors to strings in model_dump() instead of including them directly

* fix(logger): ensure logger is a Logger instance in crawling strategies. ref #1437

* feat(docker): Add temperature and base_url parameters for LLM configuration. ref #1035

  Implement hierarchical configuration for LLM parameters with support for:
  - Temperature control (0.0-2.0) to adjust response creativity
  - Custom base_url for proxy servers and alternative endpoints
  - 4-tier priority: request params > provider env > global env > defaults

  Add helper functions in utils.py, update API schemas and handlers,
  support environment variables (LLM_TEMPERATURE, OPENAI_TEMPERATURE, etc.),
  and provide comprehensive documentation with examples.

* feat(docker): improve docker error handling
- Return comprehensive error messages along with status codes for api internal errors.
- Fix fit_html property serialization issue in both /crawl and /crawl/stream endpoints
- Add sanitization to ensure fit_html is always JSON-serializable (string or None)
- Add comprehensive error handling test suite.

* #1375 : refactor(proxy) Deprecate 'proxy' parameter in BrowserConfig and enhance proxy string parsing

- Updated ProxyConfig.from_string to support multiple proxy formats, including URLs with credentials.
- Deprecated the 'proxy' parameter in BrowserConfig, replacing it with 'proxy_config' for better flexibility.
- Added warnings for deprecated usage and clarified behavior when both parameters are provided.
- Updated documentation and tests to reflect changes in proxy configuration handling.

* Remove deprecated test for 'proxy' parameter in BrowserConfig and update .gitignore to include test_scripts directory.

* feat: add preserve_https_for_internal_links flag to maintain HTTPS during crawling. Ref #1410

Added a new `preserve_https_for_internal_links` configuration flag that preserves the original HTTPS scheme for same-domain links even when the server redirects to HTTP.

* feat: update documentation for preserve_https_for_internal_links. ref #1410

* fix: drop Python 3.9 support and require Python >=3.10.
The library no longer supports Python 3.9 and so it was important to drop all references to python 3.9.
Following changes have been made:
- pyproject.toml: set requires-python to ">=3.10"; remove 3.9 classifier
- setup.py: set python_requires to ">=3.10"; remove 3.9 classifier
- docs: update Python version mentions
  - deploy/docker/c4ai-doc-context.md: options -> 3.10, 3.11, 3.12, 3.13

* issue #1329 refactor(crawler): move unwanted properties to CrawlerRunConfig class

* fix(auth): fixed Docker JWT authentication. ref #1442

* remove: delete unused yoyo snapshot subproject

* fix: raise error on last attempt failure in perform_completion_with_backoff. ref #989

* Commit without API

* fix: update option labels in request builder for clarity

* fix: allow custom LLM providers for adaptive crawler embedding config. ref: #1291

  - Change embedding_llm_config from Dict to Union[LLMConfig, Dict] for type safety
  - Add backward-compatible conversion property _embedding_llm_config_dict
  - Replace all hardcoded OpenAI embedding configs with configurable options
  - Fix LLMConfig object attribute access in query expansion logic
  - Add comprehensive example demonstrating multiple provider configurations
  - Update documentation with both LLMConfig object and dictionary usage patterns

  Users can now specify any LLM provider for query expansion in embedding strategy:
  - New: embedding_llm_config=LLMConfig(provider='anthropic/claude-3', api_token='key')
  - Old: embedding_llm_config={'provider': 'openai/gpt-4', 'api_token': 'key'} (still works)

* refactor(BrowserConfig): change deprecation warning for 'proxy' parameter to UserWarning

* feat(StealthAdapter): fix stealth features for Playwright integration. ref #1481

* #1505 fix(api): update config handling to only set base config if not provided by user

* fix(docker-deployment): replace console.log with print for metadata extraction

* Release v0.7.5: The Update

- Updated version to 0.7.5
- Added comprehensive demo and release notes
- Updated documentation

* refactor(release): remove memory management section for cleaner documentation. ref #1443

* feat(docs): add brand book and page copy functionality

- Add comprehensive brand book with color system, typography, components
- Add page copy dropdown with markdown copy/view functionality
- Update mkdocs.yml with new assets and branding navigation
- Use terminal-style ASCII icons and condensed menu design

* Update gitignore add local scripts folder

* fix: remove this import as it causes python to treat "json" as a variable in the except block

* fix: always return a list, even if we catch an exception

* feat(marketplace): Add Crawl4AI marketplace with secure configuration

- Implement marketplace frontend and admin dashboard
- Add FastAPI backend with environment-based configuration
- Use .env file for secrets management
- Include data generation scripts
- Add proper CORS configuration
- Remove hardcoded password from admin login
- Update gitignore for security

* fix(marketplace): Update URLs to use /marketplace path and relative API endpoints

- Change API_BASE to relative '/api' for production
- Move marketplace to /marketplace instead of /marketplace/frontend
- Update MkDocs navigation
- Fix logo path in marketplace index

* fix(docs): hide copy menu on non-markdown pages

* feat(marketplace): add sponsor logo uploads

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>

* feat(docs): add chatgpt quick link to page actions

* fix(marketplace): align admin api with backend endpoints

* fix(marketplace): isolate api under marketplace prefix

* fix(marketplace): resolve app detail page routing and styling issues

- Fixed JavaScript errors from missing HTML elements (install-code, usage-code, integration-code)
- Added missing CSS classes for tabs, overview layout, sidebar, and integration content
- Fixed tab navigation to display horizontally in single line
- Added proper padding to tab content sections (removed from container, added to content)
- Fixed tab selector from .nav-tab to .tab-btn to match HTML structure
- Added sidebar styling with stats grid and metadata display
- Improved responsive design with mobile-friendly tab scrolling
- Fixed code block positioning for copy buttons
- Removed margin from first headings to prevent extra spacing
- Added null checks for DOM elements in JavaScript to prevent errors

These changes resolve the routing issue where clicking on apps caused page redirects,
and fix the broken layout where CSS was not properly applied to the app detail page.

* fix(marketplace): prevent hero image overflow and secondary card stretching

- Fixed hero image to 200px height with min/max constraints
- Added object-fit: cover to hero-image img elements
- Changed secondary-featured align-items from stretch to flex-start
- Fixed secondary-card height to 118px (no flex: 1 stretching)
- Updated responsive grid layouts for wider screens
- Added flex: 1 to hero-content for better content distribution

These changes ensure a rigid, predictable layout that prevents:
1. Large images from pushing text content down
2. Single secondary cards from stretching to fill entire height

* feat: Add hooks utility for function-based hooks with Docker client integration. ref #1377

   Add hooks_to_string() utility function that converts Python function objects
   to string representations for the Docker API, enabling developers to write hooks
   as regular Python functions instead of strings.

   Core Changes:
   - New hooks_to_string() utility in crawl4ai/utils.py using inspect.getsource()
   - Docker client now accepts both function objects and strings for hooks
   - Automatic detection and conversion in Crawl4aiDockerClient._prepare_request()
   - New hooks and hooks_timeout parameters in client.crawl() method

   Documentation:
   - Docker client examples with function-based hooks (docs/examples/docker_client_hooks_example.py)
   - Updated main Docker deployment guide with comprehensive hooks section
   - Added unit tests for hooks utility (tests/docker/test_hooks_utility.py)

* feat: Add hooks utility for function-based hooks with Docker client integration. ref #1377

   Add hooks_to_string() utility function that converts Python function objects
   to string representations for the Docker API, enabling developers to write hooks
   as regular Python functions instead of strings.

   Core Changes:
   - New hooks_to_string() utility in crawl4ai/utils.py using inspect.getsource()
   - Docker client now accepts both function objects and strings for hooks
   - Automatic detection and conversion in Crawl4aiDockerClient._prepare_request()
   - New hooks and hooks_timeout parameters in client.crawl() method

   Documentation:
   - Docker client examples with function-based hooks (docs/examples/docker_client_hooks_example.py)
   - Updated main Docker deployment guide with comprehensive hooks section
   - Added unit tests for hooks utility (tests/docker/test_hooks_utility.py)

* fix(docs): clarify Docker Hooks System with function-based API in README

* docs: Add demonstration files for v0.7.5 release, showcasing the new Docker Hooks System and all other features.

* docs: Update 0.7.5 video walkthrough

* docs: add complete SDK reference documentation

Add comprehensive single-page SDK reference combining:
- Installation & Setup
- Quick Start
- Core API (AsyncWebCrawler, arun, arun_many, CrawlResult)
- Configuration (BrowserConfig, CrawlerConfig, Parameters)
- Crawling Patterns
- Content Processing (Markdown, Fit Markdown, Selection, Interaction, Link & Media)
- Extraction Strategies (LLM and No-LLM)
- Advanced Features (Session Management, Hooks & Auth)

Generated using scripts/generate_sdk_docs.py in ultra-dense mode
optimized for AI assistant consumption.

Stats: 23K words, 185 code blocks, 220KB

* feat: add AI assistant skill package for Crawl4AI

- Create comprehensive skill package for AI coding assistants
- Include complete SDK reference (23K words, v0.7.4)
- Add three extraction scripts (basic, batch, pipeline)
- Implement version tracking in skill and scripts
- Add prominent download section on homepage
- Place skill in docs/assets for web distribution

The skill enables AI assistants like Claude, Cursor, and Windsurf
to effectively use Crawl4AI with optimized workflows for markdown
generation and data extraction.

* fix: remove non-existent wiki link and clarify skill usage instructions

* fix: update Crawl4AI skill with corrected parameters and examples

- Fixed CrawlerConfig → CrawlerRunConfig throughout
- Fixed parameter names (timeout → page_timeout, store_html removed)
- Fixed schema format (selector → baseSelector)
- Corrected proxy configuration (in BrowserConfig, not CrawlerRunConfig)
- Fixed fit_markdown usage with content filters
- Added comprehensive references to docs/examples/ directory
- Created safe packaging script to avoid root directory pollution
- All scripts tested and verified working

* fix: thoroughly verify and fix all Crawl4AI skill examples

- Cross-checked every section against actual docs
- Fixed BM25ContentFilter parameters (user_query, bm25_threshold)
- Removed incorrect wait_for selector from basic example
- Added comprehensive test suite (4 test files)
- All examples now tested and verified working
- Tests validate: basic crawling, markdown generation, data extraction, advanced patterns
- Package size: 76.6 KB (includes tests for future validation)

* feat(ci): split release pipeline and add Docker caching

- Split release.yml into PyPI/GitHub release and Docker workflows
- Add GitHub Actions cache for Docker builds (10-15x faster rebuilds)
- Implement dual-trigger for docker-release.yml (auto + manual)
- Add comprehensive workflow documentation in .github/workflows/docs/
- Backup original workflow as release.yml.backup

* feat: add webhook notifications for crawl job completion

Implements webhook support for the crawl job API to eliminate polling requirements.

Changes:
- Added WebhookConfig and WebhookPayload schemas to schemas.py
- Created webhook.py with WebhookDeliveryService class
- Integrated webhook notifications in api.py handle_crawl_job
- Updated job.py CrawlJobPayload to accept webhook_config
- Added webhook configuration section to config.yml
- Included comprehensive usage examples in WEBHOOK_EXAMPLES.md

Features:
- Webhook notifications on job completion (success/failure)
- Configurable data inclusion in webhook payload
- Custom webhook headers support
- Global default webhook URL configuration
- Exponential backoff retry logic (5 attempts: 1s, 2s, 4s, 8s, 16s)
- 30-second timeout per webhook call

Usage:
POST /crawl/job with optional webhook_config:
- webhook_url: URL to receive notifications
- webhook_data_in_payload: include full results (default: false)
- webhook_headers: custom headers for authentication

Generated with Claude Code https://claude.com/claude-code

Co-Authored-By: Claude <noreply@anthropic.com>

* docs: add webhook documentation to Docker README

Added comprehensive webhook section to README.md including:
- Overview of asynchronous job queue with webhooks
- Benefits and use cases
- Quick start examples
- Webhook authentication
- Global webhook configuration
- Job status polling alternative

Updated table of contents and summary to include webhook feature.
Maintains consistent tone and style with rest of README.

Generated with Claude Code https://claude.com/claude-code

Co-Authored-By: Claude <noreply@anthropic.com>

* docs: add webhook example for Docker deployment

Added docker_webhook_example.py demonstrating:
- Submitting crawl jobs with webhook configuration
- Flask-based webhook receiver implementation
- Three usage patterns:
  1. Webhook notification only (fetch data separately)
  2. Webhook with full data in payload
  3. Traditional polling approach for comparison

Includes comprehensive comments explaining:
- Webhook payload structure
- Authentication headers setup
- Error handling
- Production deployment tips

Example is fully functional and ready to run with Flask installed.

Generated with Claude Code https://claude.com/claude-code

Co-Authored-By: Claude <noreply@anthropic.com>

* test: add webhook implementation validation tests

Added comprehensive test suite to validate webhook implementation:
- Module import verification
- WebhookDeliveryService initialization
- Pydantic model validation (WebhookConfig)
- Payload construction logic
- Exponential backoff calculation
- API integration checks

All tests pass (6/6), confirming implementation is correct.

Generated with Claude Code https://claude.com/claude-code

Co-Authored-By: Claude <noreply@anthropic.com>

* test: add comprehensive webhook feature test script

Added end-to-end test script that automates webhook feature testing:

Script Features (test_webhook_feature.sh):
- Automatic branch switching and dependency installation
- Redis and server startup/shutdown management
- Webhook receiver implementation
- Integration test for webhook notifications
- Comprehensive cleanup and error handling
- Returns to original branch after completion

Test Flow:
1. Fetch and checkout webhook feature branch
2. Activate venv and install dependencies
3. Start Redis and Crawl4AI server
4. Submit crawl job with webhook config
5. Verify webhook delivery and payload
6. Clean up all processes and return to original branch

Documentation:
- WEBHOOK_TEST_README.md with usage instructions
- Troubleshooting guide
- Exit codes and safety features

Usage: ./tests/test_webhook_feature.sh

Generated with Claude Code https://claude.com/claude-code

Co-Authored-By: Claude <noreply@anthropic.com>

* fix: properly serialize Pydantic HttpUrl in webhook config

Use model_dump(mode='json') instead of deprecated dict() method to ensure
Pydantic special types (HttpUrl, UUID, etc.) are properly serialized to
JSON-compatible native Python types.

This fixes webhook delivery failures caused by HttpUrl objects remaining
as Pydantic types in the webhook_config dict, which caused JSON
serialization errors and httpx request failures.

Also update mcp requirement to >=1.18.0 for compatibility.

* feat: add webhook support for /llm/job endpoint

Add comprehensive webhook notification support for the /llm/job endpoint,
following the same pattern as the existing /crawl/job implementation.

Changes:
- Add webhook_config field to LlmJobPayload model (job.py)
- Implement webhook notifications in process_llm_extraction() with 4
  notification points: success, provider validation failure, extraction
  failure, and general exceptions (api.py)
- Store webhook_config in Redis task data for job tracking
- Initialize WebhookDeliveryService with exponential backoff retry logic
Documentation:
- Add Example 6 to WEBHOOK_EXAMPLES.md showing LLM extraction with webhooks
- Update Flask webhook handler to support both crawl and llm_extraction tasks
- Add TypeScript client examples for LLM jobs
- Add comprehensive examples to docker_webhook_example.py with schema support
- Clarify data structure differences between webhook and API responses

Testing:
- Add test_llm_webhook_feature.py with 7 validation tests (all passing)
- Verify pattern consistency with /crawl/job implementation
- Add implementation guide (WEBHOOK_LLM_JOB_IMPLEMENTATION.md)

* fix: remove duplicate comma in webhook_config parameter

* fix: update Crawl4AI Docker container port from 11234 to 11235

* Release v0.7.6: The 0.7.6 Update

- Updated version to 0.7.6
- Added comprehensive demo and release notes
- Updated all documentation
- Update the veriosn in Dockerfile to 0.7.6

---------

Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
Co-authored-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
Co-authored-by: Nezar Ali <abu5sohaib@gmail.com>
Co-authored-by: Soham Kukreti <kukretisoham@gmail.com>
Co-authored-by: James T. Wood <jamesthomaswood@gmail.com>
Co-authored-by: AHMET YILMAZ <tawfik@kidocode.com>
Co-authored-by: nafeqq-1306 <nafiquee@yahoo.com>
Co-authored-by: unclecode <unclecode@kidocode.com>
Co-authored-by: Martin Sjöborg <martin.sjoborg@quartr.se>
Co-authored-by: Martin Sjöborg <martin@sjoborg.org>
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Co-authored-by: Claude <noreply@anthropic.com>
2025-10-22 20:41:06 +08:00
ntohidi
7e8fb3a8f3 Merge branch 'release/v0.7.5' into develop 2025-10-22 13:16:16 +02:00
ntohidi
3efb59fb9a fix: update Crawl4AI Docker container port from 11234 to 11235 2025-10-22 13:14:11 +02:00
ntohidi
c7b7475b92 fix: remove duplicate comma in webhook_config parameter 2025-10-22 13:12:42 +02:00
ntohidi
b71d624168 Merge branch 'implement-webhook-crawl-feature-011CULZY1Jy8N5MUkZqXkRVp' into develop 2025-10-22 13:12:25 +02:00
ntohidi
d670dcde0a feat: add webhook support for /llm/job endpoint
Add comprehensive webhook notification support for the /llm/job endpoint,
following the same pattern as the existing /crawl/job implementation.

Changes:
- Add webhook_config field to LlmJobPayload model (job.py)
- Implement webhook notifications in process_llm_extraction() with 4
  notification points: success, provider validation failure, extraction
  failure, and general exceptions (api.py)
- Store webhook_config in Redis task data for job tracking
- Initialize WebhookDeliveryService with exponential backoff retry logic
Documentation:
- Add Example 6 to WEBHOOK_EXAMPLES.md showing LLM extraction with webhooks
- Update Flask webhook handler to support both crawl and llm_extraction tasks
- Add TypeScript client examples for LLM jobs
- Add comprehensive examples to docker_webhook_example.py with schema support
- Clarify data structure differences between webhook and API responses

Testing:
- Add test_llm_webhook_feature.py with 7 validation tests (all passing)
- Verify pattern consistency with /crawl/job implementation
- Add implementation guide (WEBHOOK_LLM_JOB_IMPLEMENTATION.md)
2025-10-22 13:03:09 +02:00
unclecode
f8606f6865 fix: properly serialize Pydantic HttpUrl in webhook config
Use model_dump(mode='json') instead of deprecated dict() method to ensure
Pydantic special types (HttpUrl, UUID, etc.) are properly serialized to
JSON-compatible native Python types.

This fixes webhook delivery failures caused by HttpUrl objects remaining
as Pydantic types in the webhook_config dict, which caused JSON
serialization errors and httpx request failures.

Also update mcp requirement to >=1.18.0 for compatibility.
2025-10-22 15:50:25 +08:00
Claude
52da8d72bc test: add comprehensive webhook feature test script
Added end-to-end test script that automates webhook feature testing:

Script Features (test_webhook_feature.sh):
- Automatic branch switching and dependency installation
- Redis and server startup/shutdown management
- Webhook receiver implementation
- Integration test for webhook notifications
- Comprehensive cleanup and error handling
- Returns to original branch after completion

Test Flow:
1. Fetch and checkout webhook feature branch
2. Activate venv and install dependencies
3. Start Redis and Crawl4AI server
4. Submit crawl job with webhook config
5. Verify webhook delivery and payload
6. Clean up all processes and return to original branch

Documentation:
- WEBHOOK_TEST_README.md with usage instructions
- Troubleshooting guide
- Exit codes and safety features

Usage: ./tests/test_webhook_feature.sh

Generated with Claude Code https://claude.com/claude-code

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-22 00:35:07 +00:00
Claude
8b7e67566e test: add webhook implementation validation tests
Added comprehensive test suite to validate webhook implementation:
- Module import verification
- WebhookDeliveryService initialization
- Pydantic model validation (WebhookConfig)
- Payload construction logic
- Exponential backoff calculation
- API integration checks

All tests pass (6/6), confirming implementation is correct.

Generated with Claude Code https://claude.com/claude-code

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-22 00:25:35 +00:00
Claude
7388baa205 docs: add webhook example for Docker deployment
Added docker_webhook_example.py demonstrating:
- Submitting crawl jobs with webhook configuration
- Flask-based webhook receiver implementation
- Three usage patterns:
  1. Webhook notification only (fetch data separately)
  2. Webhook with full data in payload
  3. Traditional polling approach for comparison

Includes comprehensive comments explaining:
- Webhook payload structure
- Authentication headers setup
- Error handling
- Production deployment tips

Example is fully functional and ready to run with Flask installed.

Generated with Claude Code https://claude.com/claude-code

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-21 16:38:53 +00:00
Claude
897bc3a493 docs: add webhook documentation to Docker README
Added comprehensive webhook section to README.md including:
- Overview of asynchronous job queue with webhooks
- Benefits and use cases
- Quick start examples
- Webhook authentication
- Global webhook configuration
- Job status polling alternative

Updated table of contents and summary to include webhook feature.
Maintains consistent tone and style with rest of README.

Generated with Claude Code https://claude.com/claude-code

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-21 16:21:07 +00:00
Claude
8a37710313 feat: add webhook notifications for crawl job completion
Implements webhook support for the crawl job API to eliminate polling requirements.

Changes:
- Added WebhookConfig and WebhookPayload schemas to schemas.py
- Created webhook.py with WebhookDeliveryService class
- Integrated webhook notifications in api.py handle_crawl_job
- Updated job.py CrawlJobPayload to accept webhook_config
- Added webhook configuration section to config.yml
- Included comprehensive usage examples in WEBHOOK_EXAMPLES.md

Features:
- Webhook notifications on job completion (success/failure)
- Configurable data inclusion in webhook payload
- Custom webhook headers support
- Global default webhook URL configuration
- Exponential backoff retry logic (5 attempts: 1s, 2s, 4s, 8s, 16s)
- 30-second timeout per webhook call

Usage:
POST /crawl/job with optional webhook_config:
- webhook_url: URL to receive notifications
- webhook_data_in_payload: include full results (default: false)
- webhook_headers: custom headers for authentication

Generated with Claude Code https://claude.com/claude-code

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-21 16:17:40 +00:00
ntohidi
97c92c4f62 fix(marketplace): replace hardcoded app detail content with database-driven fields.
The app detail page was displaying hardcoded/templated content instead of
using actual data from the database. This prevented admins from controlling
the content shown in Overview, Integration, and Documentation tabs.
2025-10-21 15:39:04 +02:00
ntohidi
f6a02c4358 Merge branch 'develop' into release/v0.7.5 2025-10-21 09:25:29 +02:00
unclecode
6d1a398419 feat(ci): split release pipeline and add Docker caching
- Split release.yml into PyPI/GitHub release and Docker workflows
- Add GitHub Actions cache for Docker builds (10-15x faster rebuilds)
- Implement dual-trigger for docker-release.yml (auto + manual)
- Add comprehensive workflow documentation in .github/workflows/docs/
- Backup original workflow as release.yml.backup
2025-10-21 10:53:12 +08:00
unclecode
c107617920 fix: thoroughly verify and fix all Crawl4AI skill examples
- Cross-checked every section against actual docs
- Fixed BM25ContentFilter parameters (user_query, bm25_threshold)
- Removed incorrect wait_for selector from basic example
- Added comprehensive test suite (4 test files)
- All examples now tested and verified working
- Tests validate: basic crawling, markdown generation, data extraction, advanced patterns
- Package size: 76.6 KB (includes tests for future validation)
2025-10-19 17:08:04 +08:00
unclecode
69d0ef89dd fix: update Crawl4AI skill with corrected parameters and examples
- Fixed CrawlerConfig → CrawlerRunConfig throughout
- Fixed parameter names (timeout → page_timeout, store_html removed)
- Fixed schema format (selector → baseSelector)
- Corrected proxy configuration (in BrowserConfig, not CrawlerRunConfig)
- Fixed fit_markdown usage with content filters
- Added comprehensive references to docs/examples/ directory
- Created safe packaging script to avoid root directory pollution
- All scripts tested and verified working
2025-10-19 16:16:20 +08:00
unclecode
1bf85bcb1a fix: remove non-existent wiki link and clarify skill usage instructions 2025-10-19 13:19:14 +08:00
unclecode
749232ba1a feat: add AI assistant skill package for Crawl4AI
- Create comprehensive skill package for AI coding assistants
- Include complete SDK reference (23K words, v0.7.4)
- Add three extraction scripts (basic, batch, pipeline)
- Implement version tracking in skill and scripts
- Add prominent download section on homepage
- Place skill in docs/assets for web distribution

The skill enables AI assistants like Claude, Cursor, and Windsurf
to effectively use Crawl4AI with optimized workflows for markdown
generation and data extraction.
2025-10-19 13:19:14 +08:00
unclecode
c7288dd2f1 docs: add complete SDK reference documentation
Add comprehensive single-page SDK reference combining:
- Installation & Setup
- Quick Start
- Core API (AsyncWebCrawler, arun, arun_many, CrawlResult)
- Configuration (BrowserConfig, CrawlerConfig, Parameters)
- Crawling Patterns
- Content Processing (Markdown, Fit Markdown, Selection, Interaction, Link & Media)
- Extraction Strategies (LLM and No-LLM)
- Advanced Features (Session Management, Hooks & Auth)

Generated using scripts/generate_sdk_docs.py in ultra-dense mode
optimized for AI assistant consumption.

Stats: 23K words, 185 code blocks, 220KB
2025-10-19 13:19:14 +08:00
unclecode
73a5a7b0f5 Update gitignore 2025-10-18 12:41:29 +08:00
unclecode
05921811b8 docs: add comprehensive technical architecture documentation
Created ARCHITECTURE.md as a complete technical reference for the
Crawl4AI Docker server, replacing the stress test pipeline document
with production-grade documentation.

Contents:
- System overview with architecture diagrams
- Core components deep-dive (server, API, utils)
- Smart browser pool implementation details
- Real-time monitoring system architecture
- WebSocket implementation and fallback strategy
- Memory management and container detection
- Production optimizations and code review fixes
- Deployment guides (local, Docker, production)
- Comprehensive troubleshooting section
- Debug tools and performance tuning
- Test suite documentation
- Architecture decision log (ADRs)

Target audience: Developers maintaining or extending the system
Goal: Enable rapid onboarding and confident modifications
2025-10-18 12:05:49 +08:00
unclecode
25507adb5b feat(monitor): implement code review fixes and real-time WebSocket monitoring
Backend Improvements (11 fixes applied):

Critical Fixes:
- Add lock protection for browser pool access in monitor stats
- Ensure async track_janitor_event across all call sites
- Improve error handling in monitor request tracking (already in place)

Important Fixes:
- Replace fire-and-forget Redis with background persistence worker
- Add time-based expiry for completed requests/errors (5min cleanup)
- Implement input validation for monitor route parameters
- Add 4s timeout to timeline updater to prevent hangs
- Add warning when killing browsers with active requests
- Implement monitor cleanup on shutdown with final persistence
- Document memory estimates with TODO for actual tracking

Frontend Enhancements:

WebSocket Real-time Updates:
- Add WebSocket endpoint at /monitor/ws for live monitoring
- Implement auto-reconnect with exponential backoff (max 5 attempts)
- Add graceful fallback to HTTP polling on WebSocket failure
- Send comprehensive updates every 2 seconds (health, requests, browsers, timeline, events)

UI/UX Improvements:
- Add live connection status indicator with pulsing animation
  - Green "Live" = WebSocket connected
  - Yellow "Connecting..." = Attempting connection
  - Blue "Polling" = Fallback to HTTP polling
  - Red "Disconnected" = Connection failed
- Restore original beautiful styling for all sections
- Improve request table layout with flex-grow for URL column
- Add browser type text labels alongside emojis
- Add flex layout to browser section header

Testing:
- Add test-websocket.py for WebSocket validation
- All 7 integration tests passing successfully

Summary: 563 additions across 6 files
2025-10-18 11:38:25 +08:00
unclecode
aba4036ab6 Add demo and test scripts for monitor dashboard activity
- Introduced a demo script (`demo_monitor_dashboard.py`) to showcase various monitoring features through simulated activity.
- Implemented a test script (`test_monitor_demo.py`) to generate dashboard activity and verify monitor health and endpoint statistics.
- Added a logo image to the static assets for branding purposes.
2025-10-17 22:43:06 +08:00
unclecode
e2af031b09 feat(monitor): add real-time monitoring dashboard with Redis persistence
Complete observability solution for production deployments with terminal-style UI.

**Backend Implementation:**
- `monitor.py`: Stats manager tracking requests, browsers, errors, timeline data
- `monitor_routes.py`: REST API endpoints for all monitor functionality
  - GET /monitor/health - System health snapshot
  - GET /monitor/requests - Active & completed requests
  - GET /monitor/browsers - Browser pool details
  - GET /monitor/endpoints/stats - Aggregated endpoint analytics
  - GET /monitor/timeline - Time-series data (memory, requests, browsers)
  - GET /monitor/logs/{janitor,errors} - Event logs
  - POST /monitor/actions/{cleanup,kill_browser,restart_browser} - Control actions
  - POST /monitor/stats/reset - Reset counters
- Redis persistence for endpoint stats (survives restart)
- Timeline tracking (5min window, 5s resolution, 60 data points)

**Frontend Dashboard** (`/dashboard`):
- **System Health Bar**: CPU%, Memory%, Network I/O, Uptime
- **Pool Status**: Live counts (permanent/hot/cold browsers + memory)
- **Live Activity Tabs**:
  - Requests: Active (realtime) + recent completed (last 100)
  - Browsers: Detailed table with actions (kill/restart)
  - Janitor: Cleanup event log with timestamps
  - Errors: Recent errors with stack traces
- **Endpoint Analytics**: Count, avg latency, success%, pool hit%
- **Resource Timeline**: SVG charts (memory/requests/browsers) with terminal aesthetics
- **Control Actions**: Force cleanup, restart permanent, reset stats
- **Auto-refresh**: 5s polling (toggleable)

**Integration:**
- Janitor events tracked (close_cold, close_hot, promote)
- Crawler pool promotion events logged
- Timeline updater background task (5s interval)
- Lifespan hooks for monitor initialization

**UI Design:**
- Terminal vibe matching Crawl4AI theme
- Dark background, cyan/pink accents, monospace font
- Neon glow effects on charts
- Responsive layout, hover interactions
- Cross-navigation: Playground ↔ Monitor

**Key Features:**
- Zero-config: Works out of the box with existing Redis
- Real-time visibility into pool efficiency
- Manual browser management (kill/restart)
- Historical data persistence
- DevOps-friendly UX

Routes:
- API: `/monitor/*` (backend endpoints)
- UI: `/dashboard` (static HTML)
2025-10-17 21:36:25 +08:00
unclecode
b97eaeea4c feat(docker): implement smart browser pool with 10x memory efficiency
Major refactoring to eliminate memory leaks and enable high-scale crawling:

- **Smart 3-Tier Browser Pool**:
  - Permanent browser (always-ready default config)
  - Hot pool (configs used 3+ times, longer TTL)
  - Cold pool (new/rare configs, short TTL)
  - Auto-promotion: cold → hot after 3 uses
  - 100% pool reuse achieved in tests

- **Container-Aware Memory Detection**:
  - Read cgroup v1/v2 memory limits (not host metrics)
  - Accurate memory pressure detection in Docker
  - Memory-based browser creation blocking

- **Adaptive Janitor**:
  - Dynamic cleanup intervals (10s/30s/60s based on memory)
  - Tiered TTLs: cold 30-300s, hot 120-600s
  - Aggressive cleanup at high memory pressure

- **Unified Pool Usage**:
  - All endpoints now use pool (/html, /screenshot, /pdf, /execute_js, /md, /llm)
  - Fixed config signature mismatch (permanent browser matches endpoints)
  - get_default_browser_config() helper for consistency

- **Configuration**:
  - Reduced idle_ttl: 1800s → 300s (30min → 5min)
  - Fixed port: 11234 → 11235 (match Gunicorn)

**Performance Results** (from stress tests):
- Memory: 10x reduction (500-700MB × N → 270MB permanent)
- Latency: 30-50x faster (<100ms pool hits vs 3-5s startup)
- Reuse: 100% for default config, 60%+ for variants
- Capacity: 100+ concurrent requests (vs ~20 before)
- Leak: 0 MB/cycle (stable across tests)

**Test Infrastructure**:
- 7-phase sequential test suite (tests/)
- Docker stats integration + log analysis
- Pool promotion verification
- Memory leak detection
- Full endpoint coverage

Fixes memory issues reported in production deployments.
2025-10-17 20:38:39 +08:00
UncleCode
fdbcddbf1a Merge pull request #1546 from unclecode/sponsors 2025-10-17 18:07:16 +08:00
Aravind Karnam
564d437d97 docs: fix order of star history and Current sponsors 2025-10-17 15:31:29 +05:30
Aravind Karnam
9cd06ea7eb docs: fix order of star history and Current sponsors 2025-10-17 15:30:02 +05:30
ntohidi
c91b235cb7 docs: Update 0.7.5 video walkthrough 2025-10-14 13:49:57 +08:00
Aravind Karnam
eb257c2ba3 docs: fixed sponsorship link 2025-10-13 17:47:42 +05:30
Aravind Karnam
8d364a0731 docs: Adjust background of sponsor logo to compensate for light themes 2025-10-13 17:45:10 +05:30
Aravind Karnam
6aff0e55aa docs: Adjust background of sponsor logo to compensate for light themes 2025-10-13 17:42:29 +05:30
Aravind Karnam
38a0742708 docs: Adjust background of sponsor logo to compensate for light themes 2025-10-13 17:41:19 +05:30
Aravind Karnam
a720a3a9fe docs: Adjust background of sponsor logo to compensate for light themes 2025-10-13 17:32:34 +05:30
Aravind Karnam
017144c2dd docs: Adjust background of sponsor logo to compensate for light themes 2025-10-13 17:30:22 +05:30
Aravind Karnam
32887ea40d docs: Adjust background of sponsor logo to compensate for light themes 2025-10-13 17:13:52 +05:30
Aravind Karnam
eea41bf1ca docs: Add a slight background to compensate light theme on github docs 2025-10-13 17:00:24 +05:30
Aravind Karnam
21c302f439 docs: Add Current sponsors section in README file 2025-10-13 16:45:16 +05:30
ntohidi
8fc1747225 docs: Add demonstration files for v0.7.5 release, showcasing the new Docker Hooks System and all other features. 2025-10-13 13:59:34 +08:00
ntohidi
aadab30c3d fix(docs): clarify Docker Hooks System with function-based API in README 2025-10-13 13:08:47 +08:00
ntohidi
4a04b8506a feat: Add hooks utility for function-based hooks with Docker client integration. ref #1377
Add hooks_to_string() utility function that converts Python function objects
   to string representations for the Docker API, enabling developers to write hooks
   as regular Python functions instead of strings.

   Core Changes:
   - New hooks_to_string() utility in crawl4ai/utils.py using inspect.getsource()
   - Docker client now accepts both function objects and strings for hooks
   - Automatic detection and conversion in Crawl4aiDockerClient._prepare_request()
   - New hooks and hooks_timeout parameters in client.crawl() method

   Documentation:
   - Docker client examples with function-based hooks (docs/examples/docker_client_hooks_example.py)
   - Updated main Docker deployment guide with comprehensive hooks section
   - Added unit tests for hooks utility (tests/docker/test_hooks_utility.py)
2025-10-13 12:53:33 +08:00
ntohidi
7dadb65b80 Merge branch 'develop' into release/v0.7.5 2025-10-13 12:34:45 +08:00
ntohidi
a3f057e19f feat: Add hooks utility for function-based hooks with Docker client integration. ref #1377
Add hooks_to_string() utility function that converts Python function objects
   to string representations for the Docker API, enabling developers to write hooks
   as regular Python functions instead of strings.

   Core Changes:
   - New hooks_to_string() utility in crawl4ai/utils.py using inspect.getsource()
   - Docker client now accepts both function objects and strings for hooks
   - Automatic detection and conversion in Crawl4aiDockerClient._prepare_request()
   - New hooks and hooks_timeout parameters in client.crawl() method

   Documentation:
   - Docker client examples with function-based hooks (docs/examples/docker_client_hooks_example.py)
   - Updated main Docker deployment guide with comprehensive hooks section
   - Added unit tests for hooks utility (tests/docker/test_hooks_utility.py)
2025-10-13 12:34:08 +08:00
unclecode
216019f29a fix(marketplace): prevent hero image overflow and secondary card stretching
- Fixed hero image to 200px height with min/max constraints
- Added object-fit: cover to hero-image img elements
- Changed secondary-featured align-items from stretch to flex-start
- Fixed secondary-card height to 118px (no flex: 1 stretching)
- Updated responsive grid layouts for wider screens
- Added flex: 1 to hero-content for better content distribution

These changes ensure a rigid, predictable layout that prevents:
1. Large images from pushing text content down
2. Single secondary cards from stretching to fill entire height
2025-10-11 12:52:04 +08:00
unclecode
abe8a92561 fix(marketplace): resolve app detail page routing and styling issues
- Fixed JavaScript errors from missing HTML elements (install-code, usage-code, integration-code)
- Added missing CSS classes for tabs, overview layout, sidebar, and integration content
- Fixed tab navigation to display horizontally in single line
- Added proper padding to tab content sections (removed from container, added to content)
- Fixed tab selector from .nav-tab to .tab-btn to match HTML structure
- Added sidebar styling with stats grid and metadata display
- Improved responsive design with mobile-friendly tab scrolling
- Fixed code block positioning for copy buttons
- Removed margin from first headings to prevent extra spacing
- Added null checks for DOM elements in JavaScript to prevent errors

These changes resolve the routing issue where clicking on apps caused page redirects,
and fix the broken layout where CSS was not properly applied to the app detail page.
2025-10-11 11:51:22 +08:00
unclecode
5a4f21fad9 fix(marketplace): isolate api under marketplace prefix 2025-10-09 22:26:15 +08:00
ntohidi
611d48f93b Merge branch 'develop' into release/v0.7.5 2025-10-09 12:53:39 +08:00
ntohidi
936397ee0e Merge branch 'develop' of https://github.com/unclecode/crawl4ai into develop 2025-10-09 12:53:15 +08:00
unclecode
2c373f0642 fix(marketplace): align admin api with backend endpoints 2025-10-08 18:42:19 +08:00
unclecode
d2c7f345ab feat(docs): add chatgpt quick link to page actions 2025-10-07 11:59:25 +08:00
unclecode
8c62277718 feat(marketplace): add sponsor logo uploads
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-06 20:58:35 +08:00
Soham Kukreti
46e1a67f61 fix(docker): Remove environment variable overrides in docker-compose.yml (#1411)
The docker-compose.yml had an `environment:` section with variable
substitutions (${VAR:-}) that was overriding values from .llm.env with
empty strings.

- Commented out the `environment:` section to prevent overwrites
- Added clear warning comment explaining the override behavior
- .llm.env values now load directly into container without interference
2025-10-06 14:41:22 +05:30
Soham Kukreti
7dfe528d43 fix(docs): standardize C4A-Script tutorial, add CLI identity-based crawling, and add sponsorship CTA
- Switch installs to pip install -r requirements.txt (tutorial and app docs)
- Update local run steps to python server.py and http://localhost:8000
- Set default PORT to 8000; update port-in-use commands and alt port 8001
- Replace unsupported :contains() example with accessible attribute selector
- Update example URLs in tutorial servers to 127.0.0.1:8000
- Add “Identity-based crawling” section with crwl profiles CLI workflow and code usage
- Replace legacy-docs note with sponsorship message in docs/md_v2/index.md
- Minor copy and consistency fixes across pages
2025-10-03 22:00:46 +05:30
unclecode
5145d42df7 fix(docs): hide copy menu on non-markdown pages 2025-10-03 20:11:20 +08:00
Nasrin
9900f63f97 Merge pull request #1531 from unclecode/develop
Marketplace and brand book changes
2025-10-03 13:24:51 +08:00
ntohidi
9292b265fc Merge branch 'develop' of https://github.com/unclecode/crawl4ai into develop 2025-10-03 12:57:23 +08:00
Nasrin
80aa6c11d9 Merge pull request #1530 from Sjoeborg/fix/arun-many-returns-none
Fix: run_urls() returns None, crashing arun_many()
2025-10-03 12:57:06 +08:00
unclecode
749d200866 fix(marketplace): Update URLs to use /marketplace path and relative API endpoints
- Change API_BASE to relative '/api' for production
- Move marketplace to /marketplace instead of /marketplace/frontend
- Update MkDocs navigation
- Fix logo path in marketplace index
2025-10-02 17:08:50 +08:00
unclecode
408ad1b750 feat(marketplace): Add Crawl4AI marketplace with secure configuration
- Implement marketplace frontend and admin dashboard
- Add FastAPI backend with environment-based configuration
- Use .env file for secrets management
- Include data generation scripts
- Add proper CORS configuration
- Remove hardcoded password from admin login
- Update gitignore for security
2025-10-02 16:41:11 +08:00
Martin Sjöborg
35dd206925 fix: always return a list, even if we catch an exception 2025-10-02 09:21:44 +02:00
Martin Sjöborg
8d30662647 fix: remove this import as it causes python to treat "json" as a variable in the except block 2025-10-02 09:19:15 +02:00
unclecode
ef46df10da Update gitignore add local scripts folder 2025-09-30 18:31:57 +08:00
unclecode
0d8d043109 feat(docs): add brand book and page copy functionality
- Add comprehensive brand book with color system, typography, components
- Add page copy dropdown with markdown copy/view functionality
- Update mkdocs.yml with new assets and branding navigation
- Use terminal-style ASCII icons and condensed menu design
2025-09-30 18:28:05 +08:00
ntohidi
70af81d9d7 refactor(release): remove memory management section for cleaner documentation. ref #1443 2025-09-30 11:54:21 +08:00
Soham Kukreti
2dc6588573 fix: remove_overlay_elements functionality by calling injected JS function. ref: #1396
- Fix critical bug where overlay removal JS function was injected but never called
  - Change remove_overlay_elements() to properly execute the injected async function
  - Wrap JS execution in async to handle the async overlay removal logic
  - Add test_remove_overlay_elements() test case to verify functionality works
  - Ensure overlay elements (cookie banners, popups, modals) are actually removed

  The remove_overlay_elements feature now works as intended:
  - Before: Function definition injected but never executed (silent failure)
  - After: Function injected and called, successfully removing overlay elements
2025-09-29 20:40:08 +05:30
Soham Kukreti
34c0996ee4 fix: Add CDP endpoint verification with exponential backoff for managed browsers (#1445)
browser_manager:
- Add CDP endpoint verification with retry logic and exponential backoff
- Call verification before connecting to CDP in `start()` method
- Graceful handling of timing issues during browser startup

test_cdp_strategy:
- Fix cookie persistence test by adding storage state management
- Fix session management test to work with managed browser architecture
- Add comprehensive CDP timing tests covering:
  - Fast startup scenarios
  - Delayed browser startup simulation
  - Exponential backoff behavior validation
  - Concurrent browser connections
  - Stress testing with multiple successive startups
  - Retry count verification

Impact:
- Eliminates browser startup failures due to CDP timing issues
- Provides robust fallback with automatic retries
- Maintains fast startup when CDP is immediately available
- Comprehensive test coverage ensures reliability

Resolves CDP connection timing issues in managed browser mode.
2025-09-29 19:31:09 +05:30
ntohidi
361499d291 Release v0.7.5: The Update
- Updated version to 0.7.5
- Added comprehensive demo and release notes
- Updated documentation
2025-09-29 18:05:26 +08:00
AHMET YILMAZ
e3467c08f6 #1490 feat(ManagedBrowser): add viewport size configuration for browser launch 2025-09-17 17:40:38 +08:00
rbushria
edd0b576b1 Fix: Use correct URL variable for raw HTML extraction (#1116)
- Prevents full HTML content from being passed as URL to extraction strategies
- Added unit tests to verify raw HTML and regular URL processing

Fix: Wrong URL variable used for extraction of raw html
2025-09-01 23:15:56 +03:00
232 changed files with 52236 additions and 20748 deletions

100
.github/workflows/docker-release.yml vendored Normal file
View File

@@ -0,0 +1,100 @@
name: Docker Release
on:
release:
types: [published]
push:
tags:
- 'docker-rebuild-v*' # Allow manual Docker rebuilds via tags
jobs:
docker:
runs-on: ubuntu-latest
steps:
- name: Free up disk space
run: |
echo "=== Disk space before cleanup ==="
df -h
# Remove unnecessary tools and libraries (frees ~25GB)
sudo rm -rf /usr/share/dotnet
sudo rm -rf /usr/local/lib/android
sudo rm -rf /opt/ghc
sudo rm -rf /opt/hostedtoolcache/CodeQL
sudo rm -rf /usr/local/share/boost
sudo rm -rf /usr/share/swift
# Clean apt cache
sudo apt-get clean
echo "=== Disk space after cleanup ==="
df -h
- name: Checkout code
uses: actions/checkout@v4
- name: Extract version from release or tag
id: get_version
run: |
if [ "${{ github.event_name }}" == "release" ]; then
# Triggered by release event
VERSION="${{ github.event.release.tag_name }}"
VERSION=${VERSION#v} # Remove 'v' prefix
else
# Triggered by docker-rebuild-v* tag
VERSION=${GITHUB_REF#refs/tags/docker-rebuild-v}
fi
echo "VERSION=$VERSION" >> $GITHUB_OUTPUT
echo "Building Docker images for version: $VERSION"
- name: Extract major and minor versions
id: versions
run: |
VERSION=${{ steps.get_version.outputs.VERSION }}
MAJOR=$(echo $VERSION | cut -d. -f1)
MINOR=$(echo $VERSION | cut -d. -f1-2)
echo "MAJOR=$MAJOR" >> $GITHUB_OUTPUT
echo "MINOR=$MINOR" >> $GITHUB_OUTPUT
echo "Semantic versions - Major: $MAJOR, Minor: $MINOR"
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Log in to Docker Hub
uses: docker/login-action@v3
with:
username: ${{ secrets.DOCKER_USERNAME }}
password: ${{ secrets.DOCKER_TOKEN }}
- name: Build and push Docker images
uses: docker/build-push-action@v5
with:
context: .
push: true
tags: |
unclecode/crawl4ai:${{ steps.get_version.outputs.VERSION }}
unclecode/crawl4ai:${{ steps.versions.outputs.MINOR }}
unclecode/crawl4ai:${{ steps.versions.outputs.MAJOR }}
unclecode/crawl4ai:latest
platforms: linux/amd64,linux/arm64
cache-from: type=gha
cache-to: type=gha,mode=max
- name: Summary
run: |
echo "## 🐳 Docker Release Complete!" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "### Published Images" >> $GITHUB_STEP_SUMMARY
echo "- \`unclecode/crawl4ai:${{ steps.get_version.outputs.VERSION }}\`" >> $GITHUB_STEP_SUMMARY
echo "- \`unclecode/crawl4ai:${{ steps.versions.outputs.MINOR }}\`" >> $GITHUB_STEP_SUMMARY
echo "- \`unclecode/crawl4ai:${{ steps.versions.outputs.MAJOR }}\`" >> $GITHUB_STEP_SUMMARY
echo "- \`unclecode/crawl4ai:latest\`" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "### Platforms" >> $GITHUB_STEP_SUMMARY
echo "- linux/amd64" >> $GITHUB_STEP_SUMMARY
echo "- linux/arm64" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "### 🚀 Pull Command" >> $GITHUB_STEP_SUMMARY
echo "\`\`\`bash" >> $GITHUB_STEP_SUMMARY
echo "docker pull unclecode/crawl4ai:${{ steps.get_version.outputs.VERSION }}" >> $GITHUB_STEP_SUMMARY
echo "\`\`\`" >> $GITHUB_STEP_SUMMARY

917
.github/workflows/docs/ARCHITECTURE.md vendored Normal file
View File

@@ -0,0 +1,917 @@
# Workflow Architecture Documentation
## Overview
This document describes the technical architecture of the split release pipeline for Crawl4AI.
---
## Architecture Diagram
```
┌─────────────────────────────────────────────────────────────────┐
│ Developer │
│ │ │
│ ▼ │
│ git tag v1.2.3 │
│ git push --tags │
└──────────────────────────────┬──────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ GitHub Repository │
│ │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ Tag Event: v1.2.3 │ │
│ └────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ release.yml (Release Pipeline) │ │
│ │ ┌──────────────────────────────────────────────┐ │ │
│ │ │ 1. Extract Version │ │ │
│ │ │ v1.2.3 → 1.2.3 │ │ │
│ │ └──────────────────────────────────────────────┘ │ │
│ │ ┌──────────────────────────────────────────────┐ │ │
│ │ │ 2. Validate Version │ │ │
│ │ │ Tag == __version__.py │ │ │
│ │ └──────────────────────────────────────────────┘ │ │
│ │ ┌──────────────────────────────────────────────┐ │ │
│ │ │ 3. Build Python Package │ │ │
│ │ │ - Source dist (.tar.gz) │ │ │
│ │ │ - Wheel (.whl) │ │ │
│ │ └──────────────────────────────────────────────┘ │ │
│ │ ┌──────────────────────────────────────────────┐ │ │
│ │ │ 4. Upload to PyPI │ │ │
│ │ │ - Authenticate with token │ │ │
│ │ │ - Upload dist/* │ │ │
│ │ └──────────────────────────────────────────────┘ │ │
│ │ ┌──────────────────────────────────────────────┐ │ │
│ │ │ 5. Create GitHub Release │ │ │
│ │ │ - Tag: v1.2.3 │ │ │
│ │ │ - Body: Install instructions │ │ │
│ │ │ - Status: Published │ │ │
│ │ └──────────────────────────────────────────────┘ │ │
│ └────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ Release Event: published (v1.2.3) │ │
│ └────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ docker-release.yml (Docker Pipeline) │ │
│ │ ┌──────────────────────────────────────────────┐ │ │
│ │ │ 1. Extract Version from Release │ │ │
│ │ │ github.event.release.tag_name → 1.2.3 │ │ │
│ │ └──────────────────────────────────────────────┘ │ │
│ │ ┌──────────────────────────────────────────────┐ │ │
│ │ │ 2. Parse Semantic Versions │ │ │
│ │ │ 1.2.3 → Major: 1, Minor: 1.2 │ │ │
│ │ └──────────────────────────────────────────────┘ │ │
│ │ ┌──────────────────────────────────────────────┐ │ │
│ │ │ 3. Setup Multi-Arch Build │ │ │
│ │ │ - Docker Buildx │ │ │
│ │ │ - QEMU emulation │ │ │
│ │ └──────────────────────────────────────────────┘ │ │
│ │ ┌──────────────────────────────────────────────┐ │ │
│ │ │ 4. Authenticate Docker Hub │ │ │
│ │ │ - Username: DOCKER_USERNAME │ │ │
│ │ │ - Token: DOCKER_TOKEN │ │ │
│ │ └──────────────────────────────────────────────┘ │ │
│ │ ┌──────────────────────────────────────────────┐ │ │
│ │ │ 5. Build Multi-Arch Images │ │ │
│ │ │ ┌────────────────┬────────────────┐ │ │ │
│ │ │ │ linux/amd64 │ linux/arm64 │ │ │ │
│ │ │ └────────────────┴────────────────┘ │ │ │
│ │ │ Cache: GitHub Actions (type=gha) │ │ │
│ │ └──────────────────────────────────────────────┘ │ │
│ │ ┌──────────────────────────────────────────────┐ │ │
│ │ │ 6. Push to Docker Hub │ │ │
│ │ │ Tags: │ │ │
│ │ │ - unclecode/crawl4ai:1.2.3 │ │ │
│ │ │ - unclecode/crawl4ai:1.2 │ │ │
│ │ │ - unclecode/crawl4ai:1 │ │ │
│ │ │ - unclecode/crawl4ai:latest │ │ │
│ │ └──────────────────────────────────────────────┘ │ │
│ └────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ External Services │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ PyPI │ │ Docker Hub │ │ GitHub │ │
│ │ │ │ │ │ │ │
│ │ crawl4ai │ │ unclecode/ │ │ Releases │ │
│ │ 1.2.3 │ │ crawl4ai │ │ v1.2.3 │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────────┘
```
---
## Component Details
### 1. Release Pipeline (release.yml)
#### Purpose
Fast publication of Python package and GitHub release.
#### Input
- **Trigger**: Git tag matching `v*` (excluding `test-v*`)
- **Example**: `v1.2.3`
#### Processing Stages
##### Stage 1: Version Extraction
```bash
Input: refs/tags/v1.2.3
Output: VERSION=1.2.3
```
**Implementation**:
```bash
TAG_VERSION=${GITHUB_REF#refs/tags/v} # Remove 'refs/tags/v' prefix
echo "VERSION=$TAG_VERSION" >> $GITHUB_OUTPUT
```
##### Stage 2: Version Validation
```bash
Input: TAG_VERSION=1.2.3
Check: crawl4ai/__version__.py contains __version__ = "1.2.3"
Output: Pass/Fail
```
**Implementation**:
```bash
PACKAGE_VERSION=$(python -c "from crawl4ai.__version__ import __version__; print(__version__)")
if [ "$TAG_VERSION" != "$PACKAGE_VERSION" ]; then
exit 1
fi
```
##### Stage 3: Package Build
```bash
Input: Source code + pyproject.toml
Output: dist/crawl4ai-1.2.3.tar.gz
dist/crawl4ai-1.2.3-py3-none-any.whl
```
**Implementation**:
```bash
python -m build
# Uses build backend defined in pyproject.toml
```
##### Stage 4: PyPI Upload
```bash
Input: dist/*.{tar.gz,whl}
Auth: PYPI_TOKEN
Output: Package published to PyPI
```
**Implementation**:
```bash
twine upload dist/*
# Environment:
# TWINE_USERNAME: __token__
# TWINE_PASSWORD: ${{ secrets.PYPI_TOKEN }}
```
##### Stage 5: GitHub Release Creation
```bash
Input: Tag: v1.2.3
Body: Markdown content
Output: Published GitHub release
```
**Implementation**:
```yaml
uses: softprops/action-gh-release@v2
with:
tag_name: v1.2.3
name: Release v1.2.3
body: |
Installation instructions and changelog
draft: false
prerelease: false
```
#### Output
- **PyPI Package**: https://pypi.org/project/crawl4ai/1.2.3/
- **GitHub Release**: Published release on repository
- **Event**: `release.published` (triggers Docker workflow)
#### Timeline
```
0:00 - Tag pushed
0:01 - Checkout + Python setup
0:02 - Version validation
0:03 - Package build
0:04 - PyPI upload starts
0:06 - PyPI upload complete
0:07 - GitHub release created
0:08 - Workflow complete
```
---
### 2. Docker Release Pipeline (docker-release.yml)
#### Purpose
Build and publish multi-architecture Docker images.
#### Inputs
##### Input 1: Release Event (Automatic)
```yaml
Event: release.published
Data: github.event.release.tag_name = "v1.2.3"
```
##### Input 2: Docker Rebuild Tag (Manual)
```yaml
Tag: docker-rebuild-v1.2.3
```
#### Processing Stages
##### Stage 1: Version Detection
```bash
# From release event:
VERSION = github.event.release.tag_name.strip("v")
# Result: "1.2.3"
# From rebuild tag:
VERSION = GITHUB_REF.replace("refs/tags/docker-rebuild-v", "")
# Result: "1.2.3"
```
##### Stage 2: Semantic Version Parsing
```bash
Input: VERSION=1.2.3
Output: MAJOR=1
MINOR=1.2
PATCH=3 (implicit)
```
**Implementation**:
```bash
MAJOR=$(echo $VERSION | cut -d. -f1) # Extract first component
MINOR=$(echo $VERSION | cut -d. -f1-2) # Extract first two components
```
##### Stage 3: Multi-Architecture Setup
```yaml
Setup:
- Docker Buildx (multi-platform builder)
- QEMU (ARM emulation on x86)
Platforms:
- linux/amd64 (x86_64)
- linux/arm64 (aarch64)
```
**Architecture**:
```
GitHub Runner (linux/amd64)
├─ Buildx Builder
│ ├─ Native: Build linux/amd64 image
│ └─ QEMU: Emulate ARM to build linux/arm64 image
└─ Generate manifest list (points to both images)
```
##### Stage 4: Docker Hub Authentication
```bash
Input: DOCKER_USERNAME
DOCKER_TOKEN
Output: Authenticated Docker client
```
##### Stage 5: Build with Cache
```yaml
Cache Configuration:
cache-from: type=gha # Read from GitHub Actions cache
cache-to: type=gha,mode=max # Write all layers
Cache Key Components:
- Workflow file path
- Branch name
- Architecture (amd64/arm64)
```
**Cache Hierarchy**:
```
Cache Entry: main/docker-release.yml/linux-amd64
├─ Layer: sha256:abc123... (FROM python:3.12)
├─ Layer: sha256:def456... (RUN apt-get update)
├─ Layer: sha256:ghi789... (COPY requirements.txt)
├─ Layer: sha256:jkl012... (RUN pip install)
└─ Layer: sha256:mno345... (COPY . /app)
Cache Hit/Miss Logic:
- If layer input unchanged → cache hit → skip build
- If layer input changed → cache miss → rebuild + all subsequent layers
```
##### Stage 6: Tag Generation
```bash
Input: VERSION=1.2.3, MAJOR=1, MINOR=1.2
Output Tags:
- unclecode/crawl4ai:1.2.3 (exact version)
- unclecode/crawl4ai:1.2 (minor version)
- unclecode/crawl4ai:1 (major version)
- unclecode/crawl4ai:latest (latest stable)
```
**Tag Strategy**:
- All tags point to same image SHA
- Users can pin to desired stability level
- Pushing new version updates `1`, `1.2`, and `latest` automatically
##### Stage 7: Push to Registry
```bash
For each tag:
For each platform (amd64, arm64):
Push image to Docker Hub
Create manifest list:
Manifest: unclecode/crawl4ai:1.2.3
├─ linux/amd64: sha256:abc...
└─ linux/arm64: sha256:def...
Docker CLI automatically selects correct platform on pull
```
#### Output
- **Docker Images**: 4 tags × 2 platforms = 8 image variants + 4 manifests
- **Docker Hub**: https://hub.docker.com/r/unclecode/crawl4ai/tags
#### Timeline
**Cold Cache (First Build)**:
```
0:00 - Release event received
0:01 - Checkout + Buildx setup
0:02 - Docker Hub auth
0:03 - Start build (amd64)
0:08 - Complete amd64 build
0:09 - Start build (arm64)
0:14 - Complete arm64 build
0:15 - Generate manifests
0:16 - Push all tags
0:17 - Workflow complete
```
**Warm Cache (Code Change Only)**:
```
0:00 - Release event received
0:01 - Checkout + Buildx setup
0:02 - Docker Hub auth
0:03 - Start build (amd64) - cache hit for layers 1-4
0:04 - Complete amd64 build (only layer 5 rebuilt)
0:05 - Start build (arm64) - cache hit for layers 1-4
0:06 - Complete arm64 build (only layer 5 rebuilt)
0:07 - Generate manifests
0:08 - Push all tags
0:09 - Workflow complete
```
---
## Data Flow
### Version Information Flow
```
Developer
crawl4ai/__version__.py
__version__ = "1.2.3"
├─► Git Tag
│ v1.2.3
│ │
│ ▼
│ release.yml
│ │
│ ├─► Validation
│ │ ✓ Match
│ │
│ ├─► PyPI Package
│ │ crawl4ai==1.2.3
│ │
│ └─► GitHub Release
│ v1.2.3
│ │
│ ▼
│ docker-release.yml
│ │
│ └─► Docker Tags
│ 1.2.3, 1.2, 1, latest
└─► Package Metadata
pyproject.toml
version = "1.2.3"
```
### Secrets Flow
```
GitHub Secrets (Encrypted at Rest)
├─► PYPI_TOKEN
│ │
│ ▼
│ release.yml
│ │
│ ▼
│ TWINE_PASSWORD env var (masked in logs)
│ │
│ ▼
│ PyPI API (HTTPS)
├─► DOCKER_USERNAME
│ │
│ ▼
│ docker-release.yml
│ │
│ ▼
│ docker/login-action (masked in logs)
│ │
│ ▼
│ Docker Hub API (HTTPS)
└─► DOCKER_TOKEN
docker-release.yml
docker/login-action (masked in logs)
Docker Hub API (HTTPS)
```
### Artifact Flow
```
Source Code
├─► release.yml
│ │
│ ▼
│ python -m build
│ │
│ ├─► crawl4ai-1.2.3.tar.gz
│ │ │
│ │ ▼
│ │ PyPI Storage
│ │ │
│ │ ▼
│ │ pip install crawl4ai
│ │
│ └─► crawl4ai-1.2.3-py3-none-any.whl
│ │
│ ▼
│ PyPI Storage
│ │
│ ▼
│ pip install crawl4ai
└─► docker-release.yml
docker build
├─► Image: linux/amd64
│ │
│ └─► Docker Hub
│ unclecode/crawl4ai:1.2.3-amd64
└─► Image: linux/arm64
└─► Docker Hub
unclecode/crawl4ai:1.2.3-arm64
```
---
## State Machines
### Release Pipeline State Machine
```
┌─────────┐
│ START │
└────┬────┘
┌──────────────┐
│ Extract │
│ Version │
└──────┬───────┘
┌──────────────┐ ┌─────────┐
│ Validate │─────►│ FAILED │
│ Version │ No │ (Exit 1)│
└──────┬───────┘ └─────────┘
│ Yes
┌──────────────┐
│ Build │
│ Package │
└──────┬───────┘
┌──────────────┐ ┌─────────┐
│ Upload │─────►│ FAILED │
│ to PyPI │ Error│ (Exit 1)│
└──────┬───────┘ └─────────┘
│ Success
┌──────────────┐
│ Create │
│ GH Release │
└──────┬───────┘
┌──────────────┐
│ SUCCESS │
│ (Emit Event) │
└──────────────┘
```
### Docker Pipeline State Machine
```
┌─────────┐
│ START │
│ (Event) │
└────┬────┘
┌──────────────┐
│ Detect │
│ Version │
│ Source │
└──────┬───────┘
┌──────────────┐
│ Parse │
│ Semantic │
│ Versions │
└──────┬───────┘
┌──────────────┐ ┌─────────┐
│ Authenticate │─────►│ FAILED │
│ Docker Hub │ Error│ (Exit 1)│
└──────┬───────┘ └─────────┘
│ Success
┌──────────────┐
│ Build │
│ amd64 │
└──────┬───────┘
┌──────────────┐ ┌─────────┐
│ Build │─────►│ FAILED │
│ arm64 │ Error│ (Exit 1)│
└──────┬───────┘ └─────────┘
│ Success
┌──────────────┐
│ Push All │
│ Tags │
└──────┬───────┘
┌──────────────┐
│ SUCCESS │
└──────────────┘
```
---
## Security Architecture
### Threat Model
#### Threats Mitigated
1. **Secret Exposure**
- Mitigation: GitHub Actions secret masking
- Evidence: Secrets never appear in logs
2. **Unauthorized Package Upload**
- Mitigation: Scoped PyPI tokens
- Evidence: Token limited to `crawl4ai` project
3. **Man-in-the-Middle**
- Mitigation: HTTPS for all API calls
- Evidence: PyPI, Docker Hub, GitHub all use TLS
4. **Supply Chain Tampering**
- Mitigation: Immutable artifacts, content checksums
- Evidence: PyPI stores SHA256, Docker uses content-addressable storage
#### Trust Boundaries
```
┌─────────────────────────────────────────┐
│ Trusted Zone │
│ ┌────────────────────────────────┐ │
│ │ GitHub Actions Runner │ │
│ │ - Ephemeral VM │ │
│ │ - Isolated environment │ │
│ │ - Access to secrets │ │
│ └────────────────────────────────┘ │
│ │ │
│ │ HTTPS (TLS 1.2+) │
│ ▼ │
└─────────────────────────────────────────┘
┌────────────┼────────────┐
│ │ │
▼ ▼ ▼
┌────────┐ ┌─────────┐ ┌──────────┐
│ PyPI │ │ Docker │ │ GitHub │
│ API │ │ Hub │ │ API │
└────────┘ └─────────┘ └──────────┘
External External External
Service Service Service
```
### Secret Management
#### Secret Lifecycle
```
Creation (Developer)
├─► PyPI: Create API token (scoped to project)
├─► Docker Hub: Create access token (read/write)
Storage (GitHub)
├─► Encrypted at rest (AES-256)
├─► Access controlled (repo-scoped)
Usage (Workflow)
├─► Injected as env vars
├─► Masked in logs (GitHub redacts on output)
├─► Never persisted to disk (in-memory only)
Transmission (API Call)
├─► HTTPS only
├─► TLS 1.2+ with strong ciphers
Rotation (Manual)
└─► Regenerate on PyPI/Docker Hub
Update GitHub secret
```
---
## Performance Characteristics
### Release Pipeline Performance
| Metric | Value | Notes |
|--------|-------|-------|
| Cold start | ~2-3 min | First run on new runner |
| Warm start | ~2-3 min | Minimal caching benefit |
| PyPI upload | ~30-60 sec | Network-bound |
| Package build | ~30 sec | CPU-bound |
| Parallelization | None | Sequential by design |
### Docker Pipeline Performance
| Metric | Cold Cache | Warm Cache (code) | Warm Cache (deps) |
|--------|-----------|-------------------|-------------------|
| Total time | 10-15 min | 1-2 min | 3-5 min |
| amd64 build | 5-7 min | 30-60 sec | 1-2 min |
| arm64 build | 5-7 min | 30-60 sec | 1-2 min |
| Push time | 1-2 min | 30 sec | 30 sec |
| Cache hit rate | 0% | 85% | 60% |
### Cache Performance Model
```python
def estimate_build_time(changes):
base_time = 60 # seconds (setup + push)
if "Dockerfile" in changes:
return base_time + (10 * 60) # Full rebuild: ~11 min
elif "requirements.txt" in changes:
return base_time + (3 * 60) # Deps rebuild: ~4 min
elif any(f.endswith(".py") for f in changes):
return base_time + 60 # Code only: ~2 min
else:
return base_time # No changes: ~1 min
```
---
## Scalability Considerations
### Current Limits
| Resource | Limit | Impact |
|----------|-------|--------|
| Workflow concurrency | 20 (default) | Max 20 releases in parallel |
| Artifact storage | 500 MB/artifact | PyPI packages small (<10 MB) |
| Cache storage | 10 GB/repo | Docker layers fit comfortably |
| Workflow run time | 6 hours | Plenty of headroom |
### Scaling Strategies
#### Horizontal Scaling (Multiple Repos)
```
crawl4ai (main)
├─ release.yml
└─ docker-release.yml
crawl4ai-plugins (separate)
├─ release.yml
└─ docker-release.yml
Each repo has independent:
- Secrets
- Cache (10 GB each)
- Concurrency limits (20 each)
```
#### Vertical Scaling (Larger Runners)
```yaml
jobs:
docker:
runs-on: ubuntu-latest-8-cores # GitHub-hosted larger runner
# 4x faster builds for CPU-bound layers
```
---
## Disaster Recovery
### Failure Scenarios
#### Scenario 1: Release Pipeline Fails
**Failure Point**: PyPI upload fails (network error)
**State**:
- ✓ Version validated
- ✓ Package built
- ✗ PyPI upload
- ✗ GitHub release
**Recovery**:
```bash
# Manual upload
twine upload dist/*
# Retry workflow (re-run from GitHub Actions UI)
```
**Prevention**: Add retry logic to PyPI upload
#### Scenario 2: Docker Pipeline Fails
**Failure Point**: ARM build fails (dependency issue)
**State**:
- ✓ PyPI published
- ✓ GitHub release created
- ✓ amd64 image built
- ✗ arm64 image build
**Recovery**:
```bash
# Fix Dockerfile
git commit -am "fix: ARM build dependency"
# Trigger rebuild
git tag docker-rebuild-v1.2.3
git push origin docker-rebuild-v1.2.3
```
**Impact**: PyPI package available, only Docker ARM users affected
#### Scenario 3: Partial Release
**Failure Point**: GitHub release creation fails
**State**:
- ✓ PyPI published
- ✗ GitHub release
- ✗ Docker images
**Recovery**:
```bash
# Create release manually
gh release create v1.2.3 \
--title "Release v1.2.3" \
--notes "..."
# This triggers docker-release.yml automatically
```
---
## Monitoring and Observability
### Metrics to Track
#### Release Pipeline
- Success rate (target: >99%)
- Duration (target: <3 min)
- PyPI upload time (target: <60 sec)
#### Docker Pipeline
- Success rate (target: >95%)
- Duration (target: <15 min cold, <2 min warm)
- Cache hit rate (target: >80% for code changes)
### Alerting
**Critical Alerts**:
- Release pipeline failure (blocks release)
- PyPI authentication failure (expired token)
**Warning Alerts**:
- Docker build >15 min (performance degradation)
- Cache hit rate <50% (cache issue)
### Logging
**GitHub Actions Logs**:
- Retention: 90 days
- Downloadable: Yes
- Searchable: Limited
**Recommended External Logging**:
```yaml
- name: Send logs to external service
if: failure()
run: |
curl -X POST https://logs.example.com/api/v1/logs \
-H "Content-Type: application/json" \
-d "{\"workflow\": \"${{ github.workflow }}\", \"status\": \"failed\"}"
```
---
## Future Enhancements
### Planned Improvements
1. **Automated Changelog Generation**
- Use conventional commits
- Generate CHANGELOG.md automatically
2. **Pre-release Testing**
- Test builds on `test-v*` tags
- Upload to TestPyPI
3. **Notification System**
- Slack/Discord notifications on release
- Email on failure
4. **Performance Optimization**
- Parallel Docker builds (amd64 + arm64 simultaneously)
- Persistent runners for better caching
5. **Enhanced Validation**
- Smoke tests after PyPI upload
- Container security scanning
---
## References
- [GitHub Actions Architecture](https://docs.github.com/en/actions/learn-github-actions/understanding-github-actions)
- [Docker Build Cache](https://docs.docker.com/build/cache/)
- [PyPI API Documentation](https://warehouse.pypa.io/api-reference/)
---
**Last Updated**: 2025-01-21
**Version**: 2.0

1029
.github/workflows/docs/README.md vendored Normal file

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,287 @@
# Workflow Quick Reference
## Quick Commands
### Standard Release
```bash
# 1. Update version
vim crawl4ai/__version__.py # Set to "1.2.3"
# 2. Commit and tag
git add crawl4ai/__version__.py
git commit -m "chore: bump version to 1.2.3"
git tag v1.2.3
git push origin main
git push origin v1.2.3
# 3. Monitor
# - PyPI: ~2-3 minutes
# - Docker: ~1-15 minutes
```
### Docker Rebuild Only
```bash
git tag docker-rebuild-v1.2.3
git push origin docker-rebuild-v1.2.3
```
### Delete Tag (Undo Release)
```bash
# Local
git tag -d v1.2.3
# Remote
git push --delete origin v1.2.3
# GitHub Release
gh release delete v1.2.3
```
---
## Workflow Triggers
### release.yml
| Event | Pattern | Example |
|-------|---------|---------|
| Tag push | `v*` | `v1.2.3` |
| Excludes | `test-v*` | `test-v1.2.3` |
### docker-release.yml
| Event | Pattern | Example |
|-------|---------|---------|
| Release published | `release.published` | Automatic |
| Tag push | `docker-rebuild-v*` | `docker-rebuild-v1.2.3` |
---
## Environment Variables
### release.yml
| Variable | Source | Example |
|----------|--------|---------|
| `VERSION` | Git tag | `1.2.3` |
| `TWINE_USERNAME` | Static | `__token__` |
| `TWINE_PASSWORD` | Secret | `pypi-Ag...` |
| `GITHUB_TOKEN` | Auto | `ghp_...` |
### docker-release.yml
| Variable | Source | Example |
|----------|--------|---------|
| `VERSION` | Release/Tag | `1.2.3` |
| `MAJOR` | Computed | `1` |
| `MINOR` | Computed | `1.2` |
| `DOCKER_USERNAME` | Secret | `unclecode` |
| `DOCKER_TOKEN` | Secret | `dckr_pat_...` |
---
## Docker Tags Generated
| Version | Tags Created |
|---------|-------------|
| v1.0.0 | `1.0.0`, `1.0`, `1`, `latest` |
| v1.1.0 | `1.1.0`, `1.1`, `1`, `latest` |
| v1.2.3 | `1.2.3`, `1.2`, `1`, `latest` |
| v2.0.0 | `2.0.0`, `2.0`, `2`, `latest` |
---
## Workflow Outputs
### release.yml
| Output | Location | Time |
|--------|----------|------|
| PyPI Package | https://pypi.org/project/crawl4ai/ | ~2-3 min |
| GitHub Release | Repository → Releases | ~2-3 min |
| Workflow Summary | Actions → Run → Summary | Immediate |
### docker-release.yml
| Output | Location | Time |
|--------|----------|------|
| Docker Images | https://hub.docker.com/r/unclecode/crawl4ai | ~1-15 min |
| Workflow Summary | Actions → Run → Summary | Immediate |
---
## Common Issues
| Issue | Solution |
|-------|----------|
| Version mismatch | Update `crawl4ai/__version__.py` to match tag |
| PyPI 403 Forbidden | Check `PYPI_TOKEN` secret |
| PyPI 400 File exists | Version already published, increment version |
| Docker auth failed | Regenerate `DOCKER_TOKEN` |
| Docker build timeout | Check Dockerfile, review build logs |
| Cache not working | First build on branch always cold |
---
## Secrets Checklist
- [ ] `PYPI_TOKEN` - PyPI API token (project or account scope)
- [ ] `DOCKER_USERNAME` - Docker Hub username
- [ ] `DOCKER_TOKEN` - Docker Hub access token (read/write)
- [ ] `GITHUB_TOKEN` - Auto-provided (no action needed)
---
## Workflow Dependencies
### release.yml Dependencies
```yaml
Python: 3.12
Actions:
- actions/checkout@v4
- actions/setup-python@v5
- softprops/action-gh-release@v2
PyPI Packages:
- build
- twine
```
### docker-release.yml Dependencies
```yaml
Actions:
- actions/checkout@v4
- docker/setup-buildx-action@v3
- docker/login-action@v3
- docker/build-push-action@v5
Docker:
- Buildx
- QEMU (for multi-arch)
```
---
## Cache Information
### Type
- GitHub Actions Cache (`type=gha`)
### Storage
- **Limit**: 10GB per repository
- **Retention**: 7 days for unused entries
- **Cleanup**: Automatic LRU eviction
### Performance
| Scenario | Cache Hit | Build Time |
|----------|-----------|------------|
| First build | 0% | 10-15 min |
| Code change only | 85% | 1-2 min |
| Dependency update | 60% | 3-5 min |
| No changes | 100% | 30-60 sec |
---
## Build Platforms
| Platform | Architecture | Devices |
|----------|--------------|---------|
| linux/amd64 | x86_64 | Intel/AMD servers, AWS EC2, GCP |
| linux/arm64 | aarch64 | Apple Silicon, AWS Graviton, Raspberry Pi |
---
## Version Validation
### Pre-Tag Checklist
```bash
# Check current version
python -c "from crawl4ai.__version__ import __version__; print(__version__)"
# Verify it matches intended tag
# If tag is v1.2.3, version should be "1.2.3"
```
### Post-Release Verification
```bash
# PyPI
pip install crawl4ai==1.2.3
python -c "import crawl4ai; print(crawl4ai.__version__)"
# Docker
docker pull unclecode/crawl4ai:1.2.3
docker run unclecode/crawl4ai:1.2.3 python -c "import crawl4ai; print(crawl4ai.__version__)"
```
---
## Monitoring URLs
| Service | URL |
|---------|-----|
| GitHub Actions | `https://github.com/{owner}/{repo}/actions` |
| PyPI Project | `https://pypi.org/project/crawl4ai/` |
| Docker Hub | `https://hub.docker.com/r/unclecode/crawl4ai` |
| GitHub Releases | `https://github.com/{owner}/{repo}/releases` |
---
## Rollback Strategy
### PyPI (Cannot Delete)
```bash
# Increment patch version
git tag v1.2.4
git push origin v1.2.4
```
### Docker (Can Overwrite)
```bash
# Rebuild with fix
git tag docker-rebuild-v1.2.3
git push origin docker-rebuild-v1.2.3
```
### GitHub Release
```bash
# Delete release
gh release delete v1.2.3
# Delete tag
git push --delete origin v1.2.3
```
---
## Status Badge Markdown
```markdown
[![Release Pipeline](https://github.com/{owner}/{repo}/actions/workflows/release.yml/badge.svg)](https://github.com/{owner}/{repo}/actions/workflows/release.yml)
[![Docker Release](https://github.com/{owner}/{repo}/actions/workflows/docker-release.yml/badge.svg)](https://github.com/{owner}/{repo}/actions/workflows/docker-release.yml)
```
---
## Timeline Example
```
0:00 - Push tag v1.2.3
0:01 - release.yml starts
0:02 - Version validation passes
0:03 - Package built
0:04 - PyPI upload starts
0:06 - PyPI upload complete ✓
0:07 - GitHub release created ✓
0:08 - release.yml complete
0:08 - docker-release.yml triggered
0:10 - Docker build starts
0:12 - amd64 image built (cache hit)
0:14 - arm64 image built (cache hit)
0:15 - Images pushed to Docker Hub ✓
0:16 - docker-release.yml complete
Total: ~16 minutes
Critical path (PyPI + GitHub): ~8 minutes
```
---
## Contact
For workflow issues:
1. Check Actions tab for logs
2. Review this reference
3. See [README.md](./README.md) for detailed docs

View File

@@ -10,53 +10,53 @@ jobs:
runs-on: ubuntu-latest
permissions:
contents: write # Required for creating releases
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.12'
- name: Extract version from tag
id: get_version
run: |
TAG_VERSION=${GITHUB_REF#refs/tags/v}
echo "VERSION=$TAG_VERSION" >> $GITHUB_OUTPUT
echo "Releasing version: $TAG_VERSION"
- name: Install package dependencies
run: |
pip install -e .
- name: Check version consistency
run: |
TAG_VERSION=${{ steps.get_version.outputs.VERSION }}
PACKAGE_VERSION=$(python -c "from crawl4ai.__version__ import __version__; print(__version__)")
echo "Tag version: $TAG_VERSION"
echo "Package version: $PACKAGE_VERSION"
if [ "$TAG_VERSION" != "$PACKAGE_VERSION" ]; then
echo "❌ Version mismatch! Tag: $TAG_VERSION, Package: $PACKAGE_VERSION"
echo "Please update crawl4ai/__version__.py to match the tag version"
exit 1
fi
echo "✅ Version check passed: $TAG_VERSION"
- name: Install build dependencies
run: |
python -m pip install --upgrade pip
pip install build twine
- name: Build package
run: python -m build
- name: Check package
run: twine check dist/*
- name: Upload to PyPI
env:
TWINE_USERNAME: __token__
@@ -65,37 +65,7 @@ jobs:
echo "📦 Uploading to PyPI..."
twine upload dist/*
echo "✅ Package uploaded to https://pypi.org/project/crawl4ai/"
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Log in to Docker Hub
uses: docker/login-action@v3
with:
username: ${{ secrets.DOCKER_USERNAME }}
password: ${{ secrets.DOCKER_TOKEN }}
- name: Extract major and minor versions
id: versions
run: |
VERSION=${{ steps.get_version.outputs.VERSION }}
MAJOR=$(echo $VERSION | cut -d. -f1)
MINOR=$(echo $VERSION | cut -d. -f1-2)
echo "MAJOR=$MAJOR" >> $GITHUB_OUTPUT
echo "MINOR=$MINOR" >> $GITHUB_OUTPUT
- name: Build and push Docker images
uses: docker/build-push-action@v5
with:
context: .
push: true
tags: |
unclecode/crawl4ai:${{ steps.get_version.outputs.VERSION }}
unclecode/crawl4ai:${{ steps.versions.outputs.MINOR }}
unclecode/crawl4ai:${{ steps.versions.outputs.MAJOR }}
unclecode/crawl4ai:latest
platforms: linux/amd64,linux/arm64
- name: Create GitHub Release
uses: softprops/action-gh-release@v2
with:
@@ -103,26 +73,29 @@ jobs:
name: Release v${{ steps.get_version.outputs.VERSION }}
body: |
## 🎉 Crawl4AI v${{ steps.get_version.outputs.VERSION }} Released!
### 📦 Installation
**PyPI:**
```bash
pip install crawl4ai==${{ steps.get_version.outputs.VERSION }}
```
**Docker:**
```bash
docker pull unclecode/crawl4ai:${{ steps.get_version.outputs.VERSION }}
docker pull unclecode/crawl4ai:latest
```
**Note:** Docker images are being built and will be available shortly.
Check the [Docker Release workflow](https://github.com/${{ github.repository }}/actions/workflows/docker-release.yml) for build status.
### 📝 What's Changed
See [CHANGELOG.md](https://github.com/${{ github.repository }}/blob/main/CHANGELOG.md) for details.
draft: false
prerelease: false
token: ${{ secrets.GITHUB_TOKEN }}
- name: Summary
run: |
echo "## 🚀 Release Complete!" >> $GITHUB_STEP_SUMMARY
@@ -132,11 +105,9 @@ jobs:
echo "- URL: https://pypi.org/project/crawl4ai/" >> $GITHUB_STEP_SUMMARY
echo "- Install: \`pip install crawl4ai==${{ steps.get_version.outputs.VERSION }}\`" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "### 🐳 Docker Images" >> $GITHUB_STEP_SUMMARY
echo "- \`unclecode/crawl4ai:${{ steps.get_version.outputs.VERSION }}\`" >> $GITHUB_STEP_SUMMARY
echo "- \`unclecode/crawl4ai:${{ steps.versions.outputs.MINOR }}\`" >> $GITHUB_STEP_SUMMARY
echo "- \`unclecode/crawl4ai:${{ steps.versions.outputs.MAJOR }}\`" >> $GITHUB_STEP_SUMMARY
echo "- \`unclecode/crawl4ai:latest\`" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "### 📋 GitHub Release" >> $GITHUB_STEP_SUMMARY
echo "https://github.com/${{ github.repository }}/releases/tag/v${{ steps.get_version.outputs.VERSION }}" >> $GITHUB_STEP_SUMMARY
echo "- https://github.com/${{ github.repository }}/releases/tag/v${{ steps.get_version.outputs.VERSION }}" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "### 🐳 Docker Images" >> $GITHUB_STEP_SUMMARY
echo "Docker images are being built in a separate workflow." >> $GITHUB_STEP_SUMMARY
echo "Check: https://github.com/${{ github.repository }}/actions/workflows/docker-release.yml" >> $GITHUB_STEP_SUMMARY

142
.github/workflows/release.yml.backup vendored Normal file
View File

@@ -0,0 +1,142 @@
name: Release Pipeline
on:
push:
tags:
- 'v*'
- '!test-v*' # Exclude test tags
jobs:
release:
runs-on: ubuntu-latest
permissions:
contents: write # Required for creating releases
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.12'
- name: Extract version from tag
id: get_version
run: |
TAG_VERSION=${GITHUB_REF#refs/tags/v}
echo "VERSION=$TAG_VERSION" >> $GITHUB_OUTPUT
echo "Releasing version: $TAG_VERSION"
- name: Install package dependencies
run: |
pip install -e .
- name: Check version consistency
run: |
TAG_VERSION=${{ steps.get_version.outputs.VERSION }}
PACKAGE_VERSION=$(python -c "from crawl4ai.__version__ import __version__; print(__version__)")
echo "Tag version: $TAG_VERSION"
echo "Package version: $PACKAGE_VERSION"
if [ "$TAG_VERSION" != "$PACKAGE_VERSION" ]; then
echo "❌ Version mismatch! Tag: $TAG_VERSION, Package: $PACKAGE_VERSION"
echo "Please update crawl4ai/__version__.py to match the tag version"
exit 1
fi
echo "✅ Version check passed: $TAG_VERSION"
- name: Install build dependencies
run: |
python -m pip install --upgrade pip
pip install build twine
- name: Build package
run: python -m build
- name: Check package
run: twine check dist/*
- name: Upload to PyPI
env:
TWINE_USERNAME: __token__
TWINE_PASSWORD: ${{ secrets.PYPI_TOKEN }}
run: |
echo "📦 Uploading to PyPI..."
twine upload dist/*
echo "✅ Package uploaded to https://pypi.org/project/crawl4ai/"
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Log in to Docker Hub
uses: docker/login-action@v3
with:
username: ${{ secrets.DOCKER_USERNAME }}
password: ${{ secrets.DOCKER_TOKEN }}
- name: Extract major and minor versions
id: versions
run: |
VERSION=${{ steps.get_version.outputs.VERSION }}
MAJOR=$(echo $VERSION | cut -d. -f1)
MINOR=$(echo $VERSION | cut -d. -f1-2)
echo "MAJOR=$MAJOR" >> $GITHUB_OUTPUT
echo "MINOR=$MINOR" >> $GITHUB_OUTPUT
- name: Build and push Docker images
uses: docker/build-push-action@v5
with:
context: .
push: true
tags: |
unclecode/crawl4ai:${{ steps.get_version.outputs.VERSION }}
unclecode/crawl4ai:${{ steps.versions.outputs.MINOR }}
unclecode/crawl4ai:${{ steps.versions.outputs.MAJOR }}
unclecode/crawl4ai:latest
platforms: linux/amd64,linux/arm64
- name: Create GitHub Release
uses: softprops/action-gh-release@v2
with:
tag_name: v${{ steps.get_version.outputs.VERSION }}
name: Release v${{ steps.get_version.outputs.VERSION }}
body: |
## 🎉 Crawl4AI v${{ steps.get_version.outputs.VERSION }} Released!
### 📦 Installation
**PyPI:**
```bash
pip install crawl4ai==${{ steps.get_version.outputs.VERSION }}
```
**Docker:**
```bash
docker pull unclecode/crawl4ai:${{ steps.get_version.outputs.VERSION }}
docker pull unclecode/crawl4ai:latest
```
### 📝 What's Changed
See [CHANGELOG.md](https://github.com/${{ github.repository }}/blob/main/CHANGELOG.md) for details.
draft: false
prerelease: false
token: ${{ secrets.GITHUB_TOKEN }}
- name: Summary
run: |
echo "## 🚀 Release Complete!" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "### 📦 PyPI Package" >> $GITHUB_STEP_SUMMARY
echo "- Version: ${{ steps.get_version.outputs.VERSION }}" >> $GITHUB_STEP_SUMMARY
echo "- URL: https://pypi.org/project/crawl4ai/" >> $GITHUB_STEP_SUMMARY
echo "- Install: \`pip install crawl4ai==${{ steps.get_version.outputs.VERSION }}\`" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "### 🐳 Docker Images" >> $GITHUB_STEP_SUMMARY
echo "- \`unclecode/crawl4ai:${{ steps.get_version.outputs.VERSION }}\`" >> $GITHUB_STEP_SUMMARY
echo "- \`unclecode/crawl4ai:${{ steps.versions.outputs.MINOR }}\`" >> $GITHUB_STEP_SUMMARY
echo "- \`unclecode/crawl4ai:${{ steps.versions.outputs.MAJOR }}\`" >> $GITHUB_STEP_SUMMARY
echo "- \`unclecode/crawl4ai:latest\`" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "### 📋 GitHub Release" >> $GITHUB_STEP_SUMMARY
echo "https://github.com/${{ github.repository }}/releases/tag/v${{ steps.get_version.outputs.VERSION }}" >> $GITHUB_STEP_SUMMARY

30
.gitignore vendored
View File

@@ -1,8 +1,12 @@
# Scripts folder (private tools)
.scripts/
# Docker automation scripts (personal use)
docker-scripts/
# Database files
*.db
# Environment files
.env
.env.local
# Byte-compiled / optimized / DLL files
__pycache__/
@@ -262,9 +266,14 @@ continue_config.json
.llm.env
.private/
.claude/
.context/
CLAUDE_MONITOR.md
CLAUDE.md
.claude/
tests/**/test_site
tests/**/reports
tests/**/benchmark_reports
@@ -274,6 +283,17 @@ docs/**/data
docs/apps/linkdin/debug*/
docs/apps/linkdin/samples/insights/*
.yoyo/
.github/instructions/instructions.instructions.md
.kilocode/mcp.json
scripts/
# Databse files
*.sqlite3
*.sqlite3-journal
*.db-journal
*.db-wal
*.db-shm
*.db
*.rdb
*.ldb
MEMORY.md

View File

@@ -5,6 +5,46 @@ All notable changes to Crawl4AI will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
## [0.8.0] - 2026-01-12
### Security
- **🔒 CRITICAL: Remote Code Execution Fix**: Removed `__import__` from hook allowed builtins
- Prevents arbitrary module imports in user-provided hook code
- Hooks now disabled by default via `CRAWL4AI_HOOKS_ENABLED` environment variable
- Credit: Neo by ProjectDiscovery
- **🔒 HIGH: Local File Inclusion Fix**: Added URL scheme validation to Docker API endpoints
- Blocks `file://`, `javascript:`, `data:` URLs on `/execute_js`, `/screenshot`, `/pdf`, `/html`
- Only allows `http://`, `https://`, and `raw:` URLs
- Credit: Neo by ProjectDiscovery
### Breaking Changes
- **Docker API: Hooks disabled by default**: Set `CRAWL4AI_HOOKS_ENABLED=true` to enable
- **Docker API: file:// URLs blocked**: Use Python library directly for local file processing
### Added
- **🚀 init_scripts for BrowserConfig**: Pre-page-load JavaScript injection for stealth evasions
- **🔄 CDP Connection Improvements**: WebSocket URL support, proper cleanup, browser reuse
- **💾 Crash Recovery for Deep Crawl**: `resume_state` and `on_state_change` for BFS/DFS/Best-First strategies
- **📄 PDF/MHTML for raw:/file:// URLs**: Generate PDFs and MHTML from cached HTML content
- **📸 Screenshots for raw:/file:// URLs**: Render cached HTML and capture screenshots
- **🔗 base_url Parameter**: Proper URL resolution for raw: HTML processing
- **⚡ Prefetch Mode**: Two-phase deep crawling with fast link extraction
- **🔀 Enhanced Proxy Support**: Improved proxy rotation and sticky sessions
- **🌐 HTTP Strategy Proxy Support**: Non-browser crawler now supports proxies
- **🖥️ Browser Pipeline for raw:/file://**: New `process_in_browser` parameter
- **📋 Smart TTL Cache for Sitemap Seeder**: `cache_ttl_hours` and `validate_sitemap_lastmod` parameters
- **📚 Security Documentation**: Added SECURITY.md with vulnerability reporting guidelines
### Fixed
- **raw: URL Parsing**: Fixed truncation at `#` character (CSS color codes like `#eee`)
- **Caching System**: Various improvements to cache validation and persistence
### Documentation
- Multi-sample schema generation section
- URL seeder smart TTL cache parameters
- v0.8.0 migration guide
- Security policy and disclosure process
## [Unreleased]
### Added

View File

@@ -1,7 +1,7 @@
FROM python:3.12-slim-bookworm AS build
# C4ai version
ARG C4AI_VER=0.7.0-r1
ARG C4AI_VER=0.8.0
ENV C4AI_VERSION=$C4AI_VER
LABEL c4ai.version=$C4AI_VER
@@ -124,7 +124,7 @@ COPY . /tmp/project/
# Copy supervisor config first (might need root later, but okay for now)
COPY deploy/docker/supervisord.conf .
COPY deploy/docker/routers ./routers
COPY deploy/docker/requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
@@ -167,6 +167,11 @@ RUN mkdir -p /home/appuser/.cache/ms-playwright \
RUN crawl4ai-doctor
# Ensure all cache directories belong to appuser
# This fixes permission issues with .cache/url_seeder and other runtime cache dirs
RUN mkdir -p /home/appuser/.cache \
&& chown -R appuser:appuser /home/appuser/.cache
# Copy application code
COPY deploy/docker/* ${APP_HOME}/

251
README.md
View File

@@ -12,6 +12,16 @@
[![Downloads](https://static.pepy.tech/badge/crawl4ai/month)](https://pepy.tech/project/crawl4ai)
[![GitHub Sponsors](https://img.shields.io/github/sponsors/unclecode?style=flat&logo=GitHub-Sponsors&label=Sponsors&color=pink)](https://github.com/sponsors/unclecode)
---
#### 🚀 Crawl4AI Cloud API — Closed Beta (Launching Soon)
Reliable, large-scale web extraction, now built to be _**drastically more cost-effective**_ than any of the existing solutions.
👉 **Apply [here](https://forms.gle/E9MyPaNXACnAMaqG7) for early access**
_Well be onboarding in phases and working closely with early users.
Limited slots._
---
<p align="center">
<a href="https://x.com/crawl4ai">
<img src="https://img.shields.io/badge/Follow%20on%20X-000000?style=for-the-badge&logo=x&logoColor=white" alt="Follow on X" />
@@ -27,11 +37,13 @@
Crawl4AI turns the web into clean, LLM ready Markdown for RAG, agents, and data pipelines. Fast, controllable, battle tested by a 50k+ star community.
[✨ Check out latest update v0.7.4](#-recent-updates)
[✨ Check out latest update v0.8.0](#-recent-updates)
✨ New in v0.7.4: Revolutionary LLM Table Extraction with intelligent chunking, enhanced concurrency fixes, memory management refactor, and critical stability improvements. [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.4.md)
**New in v0.8.0**: Crash Recovery & Prefetch Mode! Deep crawl crash recovery with `resume_state` and `on_state_change` callbacks for long-running crawls. New `prefetch=True` mode for 5-10x faster URL discovery. Critical security fixes for Docker API (hooks disabled by default, file:// URLs blocked). [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.8.0.md)
✨ Recent v0.7.3: Undetected Browser Support, Multi-URL Configurations, Memory Monitoring, Enhanced Table Extraction, GitHub Sponsors. [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.3.md)
✨ Recent v0.7.8: Stability & Bug Fix Release! 11 bug fixes addressing Docker API issues, LLM extraction improvements, URL handling fixes, and dependency updates. [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.8.md)
✨ Previous v0.7.7: Complete Self-Hosting Platform with Real-time Monitoring! Enterprise-grade monitoring dashboard, comprehensive REST API, WebSocket streaming, and smart browser pool management. [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.7.md)
<details>
<summary>🤓 <strong>My Personal Story</strong></summary>
@@ -177,7 +189,7 @@ No rate-limited APIs. No lock-in. Build and own your data pipeline with direct g
- 📸 **Screenshots**: Capture page screenshots during crawling for debugging or analysis.
- 📂 **Raw Data Crawling**: Directly process raw HTML (`raw:`) or local files (`file://`).
- 🔗 **Comprehensive Link Extraction**: Extracts internal, external links, and embedded iframe content.
- 🛠️ **Customizable Hooks**: Define hooks at every step to customize crawling behavior.
- 🛠️ **Customizable Hooks**: Define hooks at every step to customize crawling behavior (supports both string and function-based APIs).
- 💾 **Caching**: Cache data for improved speed and to avoid redundant fetches.
- 📄 **Metadata Extraction**: Retrieve structured metadata from web pages.
- 📡 **IFrame Content Extraction**: Seamless extraction from embedded iframe content.
@@ -294,6 +306,7 @@ pip install -e ".[all]" # Install all optional features
### New Docker Features
The new Docker implementation includes:
- **Real-time Monitoring Dashboard** with live system metrics and browser pool visibility
- **Browser pooling** with page pre-warming for faster response times
- **Interactive playground** to test and generate request code
- **MCP integration** for direct connection to AI tools like Claude Code
@@ -308,7 +321,8 @@ The new Docker implementation includes:
docker pull unclecode/crawl4ai:latest
docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:latest
# Visit the playground at http://localhost:11235/playground
# Visit the monitoring dashboard at http://localhost:11235/dashboard
# Or the playground at http://localhost:11235/playground
```
### Quick Test
@@ -337,7 +351,7 @@ else:
result = requests.get(f"http://localhost:11235/task/{task_id}")
```
For more examples, see our [Docker Examples](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/docker_example.py). For advanced configuration, environment variables, and usage examples, see our [Docker Deployment Guide](https://docs.crawl4ai.com/basic/docker-deployment/).
For more examples, see our [Docker Examples](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/docker_example.py). For advanced configuration, monitoring features, and production deployment, see our [Self-Hosting Guide](https://docs.crawl4ai.com/core/self-hosting/).
</details>
@@ -542,8 +556,199 @@ async def test_news_crawl():
</details>
---
> **💡 Tip:** Some websites may use **CAPTCHA** based verification mechanisms to prevent automated access. If your workflow encounters such challenges, you may optionally integrate a third-party CAPTCHA-handling service such as <strong>[CapSolver](https://www.capsolver.com/blog/Partners/crawl4ai-capsolver/?utm_source=crawl4ai&utm_medium=github_pr&utm_campaign=crawl4ai_integration)</strong>. They support reCAPTCHA v2/v3, Cloudflare Turnstile, Challenge, AWS WAF, and more. Please ensure that your usage complies with the target websites terms of service and applicable laws.
## ✨ Recent Updates
<details open>
<summary><strong>Version 0.8.0 Release Highlights - Crash Recovery & Prefetch Mode</strong></summary>
This release introduces crash recovery for deep crawls, a new prefetch mode for fast URL discovery, and critical security fixes for Docker deployments.
- **🔄 Deep Crawl Crash Recovery**:
- `on_state_change` callback fires after each URL for real-time state persistence
- `resume_state` parameter to continue from a saved checkpoint
- JSON-serializable state for Redis/database storage
- Works with BFS, DFS, and Best-First strategies
```python
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
strategy = BFSDeepCrawlStrategy(
max_depth=3,
resume_state=saved_state, # Continue from checkpoint
on_state_change=save_to_redis, # Called after each URL
)
```
- **⚡ Prefetch Mode for Fast URL Discovery**:
- `prefetch=True` skips markdown, extraction, and media processing
- 5-10x faster than full processing
- Perfect for two-phase crawling: discover first, process selectively
```python
config = CrawlerRunConfig(prefetch=True)
result = await crawler.arun("https://example.com", config=config)
# Returns HTML and links only - no markdown generation
```
- **🔒 Security Fixes (Docker API)**:
- Hooks disabled by default (`CRAWL4AI_HOOKS_ENABLED=false`)
- `file://` URLs blocked on API endpoints to prevent LFI
- `__import__` removed from hook execution sandbox
[Full v0.8.0 Release Notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.8.0.md)
</details>
<details>
<summary><strong>Version 0.7.8 Release Highlights - Stability & Bug Fix Release</strong></summary>
This release focuses on stability with 11 bug fixes addressing issues reported by the community. No new features, but significant improvements to reliability.
- **🐳 Docker API Fixes**:
- Fixed `ContentRelevanceFilter` deserialization in deep crawl requests (#1642)
- Fixed `ProxyConfig` JSON serialization in `BrowserConfig.to_dict()` (#1629)
- Fixed `.cache` folder permissions in Docker image (#1638)
- **🤖 LLM Extraction Improvements**:
- Configurable rate limiter backoff with new `LLMConfig` parameters (#1269):
```python
from crawl4ai import LLMConfig
config = LLMConfig(
provider="openai/gpt-4o-mini",
backoff_base_delay=5, # Wait 5s on first retry
backoff_max_attempts=5, # Try up to 5 times
backoff_exponential_factor=3 # Multiply delay by 3 each attempt
)
```
- HTML input format support for `LLMExtractionStrategy` (#1178):
```python
from crawl4ai import LLMExtractionStrategy
strategy = LLMExtractionStrategy(
llm_config=config,
instruction="Extract table data",
input_format="html" # Now supports: "html", "markdown", "fit_markdown"
)
```
- Fixed raw HTML URL variable - extraction strategies now receive `"Raw HTML"` instead of HTML blob (#1116)
- **🔗 URL Handling**:
- Fixed relative URL resolution after JavaScript redirects (#1268)
- Fixed import statement formatting in extracted code (#1181)
- **📦 Dependency Updates**:
- Replaced deprecated PyPDF2 with pypdf (#1412)
- Pydantic v2 ConfigDict compatibility - no more deprecation warnings (#678)
- **🧠 AdaptiveCrawler**:
- Fixed query expansion to actually use LLM instead of hardcoded mock data (#1621)
[Full v0.7.8 Release Notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.8.md)
</details>
<details>
<summary><strong>Version 0.7.7 Release Highlights - The Self-Hosting & Monitoring Update</strong></summary>
- **📊 Real-time Monitoring Dashboard**: Interactive web UI with live system metrics and browser pool visibility
```python
# Access the monitoring dashboard
# Visit: http://localhost:11235/dashboard
# Real-time metrics include:
# - System health (CPU, memory, network, uptime)
# - Active and completed request tracking
# - Browser pool management (permanent/hot/cold)
# - Janitor cleanup events
# - Error monitoring with full context
```
- **🔌 Comprehensive Monitor API**: Complete REST API for programmatic access to all monitoring data
```python
import httpx
async with httpx.AsyncClient() as client:
# System health
health = await client.get("http://localhost:11235/monitor/health")
# Request tracking
requests = await client.get("http://localhost:11235/monitor/requests")
# Browser pool status
browsers = await client.get("http://localhost:11235/monitor/browsers")
# Endpoint statistics
stats = await client.get("http://localhost:11235/monitor/endpoints/stats")
```
- **⚡ WebSocket Streaming**: Real-time updates every 2 seconds for custom dashboards
- **🔥 Smart Browser Pool**: 3-tier architecture (permanent/hot/cold) with automatic promotion and cleanup
- **🧹 Janitor System**: Automatic resource management with event logging
- **🎮 Control Actions**: Manual browser management (kill, restart, cleanup) via API
- **📈 Production Metrics**: 6 critical metrics for operational excellence with Prometheus integration
- **🐛 Critical Bug Fixes**:
- Fixed async LLM extraction blocking issue (#1055)
- Enhanced DFS deep crawl strategy (#1607)
- Fixed sitemap parsing in AsyncUrlSeeder (#1598)
- Resolved browser viewport configuration (#1495)
- Fixed CDP timing with exponential backoff (#1528)
- Security update for pyOpenSSL (>=25.3.0)
[Full v0.7.7 Release Notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.7.md)
</details>
<details>
<summary><strong>Version 0.7.5 Release Highlights - The Docker Hooks & Security Update</strong></summary>
- **🔧 Docker Hooks System**: Complete pipeline customization with user-provided Python functions at 8 key points
- **✨ Function-Based Hooks API (NEW)**: Write hooks as regular Python functions with full IDE support:
```python
from crawl4ai import hooks_to_string
from crawl4ai.docker_client import Crawl4aiDockerClient
# Define hooks as regular Python functions
async def on_page_context_created(page, context, **kwargs):
"""Block images to speed up crawling"""
await context.route("**/*.{png,jpg,jpeg,gif,webp}", lambda route: route.abort())
await page.set_viewport_size({"width": 1920, "height": 1080})
return page
async def before_goto(page, context, url, **kwargs):
"""Add custom headers"""
await page.set_extra_http_headers({'X-Crawl4AI': 'v0.7.5'})
return page
# Option 1: Use hooks_to_string() utility for REST API
hooks_code = hooks_to_string({
"on_page_context_created": on_page_context_created,
"before_goto": before_goto
})
# Option 2: Docker client with automatic conversion (Recommended)
client = Crawl4aiDockerClient(base_url="http://localhost:11235")
results = await client.crawl(
urls=["https://httpbin.org/html"],
hooks={
"on_page_context_created": on_page_context_created,
"before_goto": before_goto
}
)
# ✓ Full IDE support, type checking, and reusability!
```
- **🤖 Enhanced LLM Integration**: Custom providers with temperature control and base_url configuration
- **🔒 HTTPS Preservation**: Secure internal link handling with `preserve_https_for_internal_links=True`
- **🐍 Python 3.10+ Support**: Modern language features and enhanced performance
- **🛠️ Bug Fixes**: Resolved multiple community-reported issues including URL processing, JWT authentication, and proxy configuration
[Full v0.7.5 Release Notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.5.md)
</details>
<details>
<summary><strong>Version 0.7.4 Release Highlights - The Intelligent Table Extraction & Performance Update</strong></summary>
@@ -919,6 +1124,40 @@ We envision a future where AI is powered by real human knowledge, ensuring data
For more details, see our [full mission statement](./MISSION.md).
</details>
## 🌟 Current Sponsors
### 🏢 Enterprise Sponsors & Partners
Our enterprise sponsors and technology partners help scale Crawl4AI to power production-grade data pipelines.
| Company | About | Sponsorship Tier |
|------|------|----------------------------|
| <a href="https://app.nstproxy.com/register?i=ecOqW9" target="_blank"><picture><source width="250" media="(prefers-color-scheme: dark)" srcset="https://gist.github.com/aravindkarnam/62f82bd4818d3079d9dd3c31df432cf8/raw/nst-light.svg"><source width="250" media="(prefers-color-scheme: light)" srcset="https://www.nstproxy.com/logo.svg"><img alt="nstproxy" src="ttps://www.nstproxy.com/logo.svg"></picture></a> | NstProxy is a trusted proxy provider with over 110M+ real residential IPs, city-level targeting, 99.99% uptime, and low pricing at $0.1/GB, it delivers unmatched stability, scale, and cost-efficiency. | 🥈 Silver |
| <a href="https://app.scrapeless.com/passport/register?utm_source=official&utm_term=crawl4ai" target="_blank"><picture><source width="250" media="(prefers-color-scheme: dark)" srcset="https://gist.githubusercontent.com/aravindkarnam/0d275b942705604263e5c32d2db27bc1/raw/Scrapeless-light-logo.svg"><source width="250" media="(prefers-color-scheme: light)" srcset="https://gist.githubusercontent.com/aravindkarnam/22d0525cc0f3021bf19ebf6e11a69ccd/raw/Scrapeless-dark-logo.svg"><img alt="Scrapeless" src="https://gist.githubusercontent.com/aravindkarnam/22d0525cc0f3021bf19ebf6e11a69ccd/raw/Scrapeless-dark-logo.svg"></picture></a> | Scrapeless provides production-grade infrastructure for Crawling, Automation, and AI Agents, offering Scraping Browser, 4 Proxy Types and Universal Scraping API. | 🥈 Silver |
| <a href="https://dashboard.capsolver.com/passport/register?inviteCode=ESVSECTX5Q23" target="_blank"><picture><source width="120" media="(prefers-color-scheme: dark)" srcset="https://docs.crawl4ai.com/uploads/sponsors/20251013045338_72a71fa4ee4d2f40.png"><source width="120" media="(prefers-color-scheme: light)" srcset="https://www.capsolver.com/assets/images/logo-text.png"><img alt="Capsolver" src="https://www.capsolver.com/assets/images/logo-text.png"></picture></a> | AI-powered Captcha solving service. Supports all major Captcha types, including reCAPTCHA, Cloudflare, and more | 🥉 Bronze |
| <a href="https://kipo.ai" target="_blank"><img src="https://docs.crawl4ai.com/uploads/sponsors/20251013045751_2d54f57f117c651e.png" alt="DataSync" width="120"/></a> | Helps engineers and buyers find, compare, and source electronic & industrial parts in seconds, with specs, pricing, lead times & alternatives.| 🥇 Gold |
| <a href="https://www.kidocode.com/" target="_blank"><img src="https://docs.crawl4ai.com/uploads/sponsors/20251013045045_bb8dace3f0440d65.svg" alt="Kidocode" width="120"/><p align="center">KidoCode</p></a> | Kidocode is a hybrid technology and entrepreneurship school for kids aged 518, offering both online and on-campus education. | 🥇 Gold |
| <a href="https://www.alephnull.sg/" target="_blank"><img src="https://docs.crawl4ai.com/uploads/sponsors/20251013050323_a9e8e8c4c3650421.svg" alt="Aleph null" width="120"/></a> | Singapore-based Aleph Null is Asias leading edtech hub, dedicated to student-centric, AI-driven education—empowering learners with the tools to thrive in a fast-changing world. | 🥇 Gold |
### 🧑‍🤝 Individual Sponsors
A heartfelt thanks to our individual supporters! Every contribution helps us keep our opensource mission alive and thriving!
<p align="left">
<a href="https://github.com/hafezparast"><img src="https://avatars.githubusercontent.com/u/14273305?s=60&v=4" style="border-radius:50%;" width="64px;"/></a>
<a href="https://github.com/ntohidi"><img src="https://avatars.githubusercontent.com/u/17140097?s=60&v=4" style="border-radius:50%;"width="64px;"/></a>
<a href="https://github.com/Sjoeborg"><img src="https://avatars.githubusercontent.com/u/17451310?s=60&v=4" style="border-radius:50%;"width="64px;"/></a>
<a href="https://github.com/romek-rozen"><img src="https://avatars.githubusercontent.com/u/30595969?s=60&v=4" style="border-radius:50%;"width="64px;"/></a>
<a href="https://github.com/Kourosh-Kiyani"><img src="https://avatars.githubusercontent.com/u/34105600?s=60&v=4" style="border-radius:50%;"width="64px;"/></a>
<a href="https://github.com/Etherdrake"><img src="https://avatars.githubusercontent.com/u/67021215?s=60&v=4" style="border-radius:50%;"width="64px;"/></a>
<a href="https://github.com/shaman247"><img src="https://avatars.githubusercontent.com/u/211010067?s=60&v=4" style="border-radius:50%;"width="64px;"/></a>
<a href="https://github.com/work-flow-manager"><img src="https://avatars.githubusercontent.com/u/217665461?s=60&v=4" style="border-radius:50%;"width="64px;"/></a>
</p>
> Want to join them? [Sponsor Crawl4AI →](https://github.com/sponsors/unclecode)
## Star History
[![Star History Chart](https://api.star-history.com/svg?repos=unclecode/crawl4ai&type=Date)](https://star-history.com/#unclecode/crawl4ai&Date)

122
SECURITY.md Normal file
View File

@@ -0,0 +1,122 @@
# Security Policy
## Supported Versions
| Version | Supported |
| ------- | ------------------ |
| 0.8.x | :white_check_mark: |
| 0.7.x | :x: (upgrade recommended) |
| < 0.7 | :x: |
## Reporting a Vulnerability
We take security vulnerabilities seriously. If you discover a security issue, please report it responsibly.
### How to Report
**DO NOT** open a public GitHub issue for security vulnerabilities.
Instead, please report via one of these methods:
1. **GitHub Security Advisories (Preferred)**
- Go to [Security Advisories](https://github.com/unclecode/crawl4ai/security/advisories)
- Click "New draft security advisory"
- Fill in the details
2. **Email**
- Send details to: security@crawl4ai.com
- Use subject: `[SECURITY] Brief description`
- Include:
- Description of the vulnerability
- Steps to reproduce
- Potential impact
- Any suggested fixes
### What to Expect
- **Acknowledgment**: Within 48 hours
- **Initial Assessment**: Within 7 days
- **Resolution Timeline**: Depends on severity
- Critical: 24-72 hours
- High: 7 days
- Medium: 30 days
- Low: 90 days
### Disclosure Policy
- We follow responsible disclosure practices
- We will coordinate with you on disclosure timing
- Credit will be given to reporters (unless anonymity is requested)
- We may request CVE assignment for significant vulnerabilities
## Security Best Practices for Users
### Docker API Deployment
If you're running the Crawl4AI Docker API in production:
1. **Enable Authentication**
```yaml
# config.yml
security:
enabled: true
jwt_enabled: true
```
```bash
# Set a strong secret key
export SECRET_KEY="your-secure-random-key-here"
```
2. **Hooks are Disabled by Default** (v0.8.0+)
- Only enable if you trust all API users
- Set `CRAWL4AI_HOOKS_ENABLED=true` only when necessary
3. **Network Security**
- Run behind a reverse proxy (nginx, traefik)
- Use HTTPS in production
- Restrict access to trusted IPs if possible
4. **Container Security**
- Run as non-root user (default in our container)
- Use read-only filesystem where possible
- Limit container resources
### Library Usage
When using Crawl4AI as a Python library:
1. **Validate URLs** before crawling untrusted input
2. **Sanitize extracted content** before using in other systems
3. **Be cautious with hooks** - they execute arbitrary code
## Known Security Issues
### Fixed in v0.8.0
| ID | Severity | Description | Fix |
|----|----------|-------------|-----|
| CVE-pending-1 | CRITICAL | RCE via hooks `__import__` | Removed from allowed builtins |
| CVE-pending-2 | HIGH | LFI via `file://` URLs | URL scheme validation added |
See [Security Advisory](https://github.com/unclecode/crawl4ai/security/advisories) for details.
## Security Features
### v0.8.0+
- **URL Scheme Validation**: Blocks `file://`, `javascript:`, `data:` URLs on API
- **Hooks Disabled by Default**: Opt-in via `CRAWL4AI_HOOKS_ENABLED=true`
- **Restricted Hook Builtins**: No `__import__`, `eval`, `exec`, `open`
- **JWT Authentication**: Optional but recommended for production
- **Rate Limiting**: Configurable request limits
- **Security Headers**: X-Frame-Options, CSP, HSTS when enabled
## Acknowledgments
We thank the following security researchers for responsibly disclosing vulnerabilities:
- **[Neo by ProjectDiscovery](https://projectdiscovery.io/blog/introducing-neo)** - RCE and LFI vulnerabilities (December 2025)
---
*Last updated: January 2026*

View File

@@ -25,8 +25,7 @@ from .extraction_strategy import (
JsonCssExtractionStrategy,
JsonXPathExtractionStrategy,
JsonLxmlExtractionStrategy,
RegexExtractionStrategy,
NoExtractionStrategy, # NEW: Import NoExtractionStrategy
RegexExtractionStrategy
)
from .chunking_strategy import ChunkingStrategy, RegexChunking
from .markdown_generation_strategy import DefaultMarkdownGenerator
@@ -73,6 +72,8 @@ from .deep_crawling import (
BestFirstCrawlingStrategy,
DFSDeepCrawlStrategy,
DeepCrawlDecorator,
ContentRelevanceFilter,
ContentTypeScorer,
)
# NEW: Import AsyncUrlSeeder
from .async_url_seeder import AsyncUrlSeeder
@@ -104,7 +105,8 @@ from .browser_adapter import (
from .utils import (
start_colab_display_server,
setup_colab_environment
setup_colab_environment,
hooks_to_string
)
__all__ = [
@@ -114,7 +116,6 @@ __all__ = [
"BrowserProfiler",
"LLMConfig",
"GeolocationConfig",
"NoExtractionStrategy",
# NEW: Add SeedingConfig and VirtualScrollConfig
"SeedingConfig",
"VirtualScrollConfig",
@@ -185,6 +186,7 @@ __all__ = [
"ProxyConfig",
"start_colab_display_server",
"setup_colab_environment",
"hooks_to_string",
# C4A Script additions
"c4a_compile",
"c4a_validate",

View File

@@ -1,7 +1,7 @@
# crawl4ai/__version__.py
# This is the version that will be used for stable releases
__version__ = "0.7.4"
__version__ = "0.8.0"
# For nightly builds, this gets set during build process
__nightly_version__ = None

View File

@@ -728,18 +728,18 @@ class EmbeddingStrategy(CrawlStrategy):
provider = llm_config_dict.get('provider', 'openai/gpt-4o-mini') if llm_config_dict else 'openai/gpt-4o-mini'
api_token = llm_config_dict.get('api_token') if llm_config_dict else None
# response = perform_completion_with_backoff(
# provider=provider,
# prompt_with_variables=prompt,
# api_token=api_token,
# json_response=True
# )
response = perform_completion_with_backoff(
provider=provider,
prompt_with_variables=prompt,
api_token=api_token,
json_response=True
)
# variations = json.loads(response.choices[0].message.content)
variations = json.loads(response.choices[0].message.content)
# # Mock data with more variations for split
variations ={'queries': ['what are the best vegetables to use in fried rice?', 'how do I make vegetable fried rice from scratch?', 'can you provide a quick recipe for vegetable fried rice?', 'what cooking techniques are essential for perfect fried rice with vegetables?', 'how to add flavor to vegetable fried rice?', 'are there any tips for making healthy fried rice with vegetables?']}
# variations ={'queries': ['what are the best vegetables to use in fried rice?', 'how do I make vegetable fried rice from scratch?', 'can you provide a quick recipe for vegetable fried rice?', 'what cooking techniques are essential for perfect fried rice with vegetables?', 'how to add flavor to vegetable fried rice?', 'are there any tips for making healthy fried rice with vegetables?']}
# variations = {'queries': [

View File

@@ -1,6 +1,7 @@
import importlib
import os
from typing import Union
import warnings
import requests
from .config import (
DEFAULT_PROVIDER,
DEFAULT_PROVIDER_API_KEY,
@@ -26,14 +27,14 @@ from .table_extraction import TableExtractionStrategy, DefaultTableExtraction
from .cache_context import CacheMode
from .proxy_strategy import ProxyRotationStrategy
from typing import Union, List, Callable
import inspect
from typing import Any, Dict, Optional
from typing import Any, Callable, Dict, List, Optional, Union
from enum import Enum
# Type alias for URL matching
UrlMatcher = Union[str, Callable[[str], bool], List[Union[str, Callable[[str], bool]]]]
class MatchMode(Enum):
OR = "or"
AND = "and"
@@ -41,8 +42,7 @@ class MatchMode(Enum):
# from .proxy_strategy import ProxyConfig
def to_serializable_dict(obj: Any, ignore_default_value : bool = False) -> Dict:
def to_serializable_dict(obj: Any, ignore_default_value : bool = False):
"""
Recursively convert an object to a serializable dictionary using {type, params} structure
for complex objects.
@@ -109,8 +109,6 @@ def to_serializable_dict(obj: Any, ignore_default_value : bool = False) -> Dict:
# if value is not None:
# current_values[attr_name] = to_serializable_dict(value)
return {
"type": obj.__class__.__name__,
"params": current_values
@@ -136,12 +134,20 @@ def from_serializable_dict(data: Any) -> Any:
if data["type"] == "dict" and "value" in data:
return {k: from_serializable_dict(v) for k, v in data["value"].items()}
# Import from crawl4ai for class instances
import crawl4ai
if hasattr(crawl4ai, data["type"]):
cls = getattr(crawl4ai, data["type"])
cls = None
# If you are receiving an error while trying to convert a dict to an object:
# Either add a module to `modules_paths` list, or add the `data["type"]` to the crawl4ai __init__.py file
module_paths = ["crawl4ai"]
for module_path in module_paths:
try:
mod = importlib.import_module(module_path)
if hasattr(mod, data["type"]):
cls = getattr(mod, data["type"])
break
except (ImportError, AttributeError):
continue
if cls is not None:
# Handle Enum
if issubclass(cls, Enum):
return cls(data["params"])
@@ -367,6 +373,20 @@ class BrowserConfig:
use_managed_browser (bool): Launch the browser using a managed approach (e.g., via CDP), allowing
advanced manipulation. Default: False.
cdp_url (str): URL for the Chrome DevTools Protocol (CDP) endpoint. Default: "ws://localhost:9222/devtools/browser/".
browser_context_id (str or None): Pre-existing CDP browser context ID to use. When provided along with
cdp_url, the crawler will reuse this context instead of creating a new one.
Useful for cloud browser services that pre-create isolated contexts.
Default: None.
target_id (str or None): Pre-existing CDP target ID (page) to use. When provided along with
browser_context_id, the crawler will reuse this target instead of creating
a new page. Default: None.
cdp_cleanup_on_close (bool): When True and using cdp_url, the close() method will still clean up
the local Playwright client resources. Useful for cloud/server scenarios
where you don't own the remote browser but need to prevent memory leaks
from accumulated Playwright instances. Default: False.
create_isolated_context (bool): When True and using cdp_url, forces creation of a new browser context
instead of reusing the default context. Essential for concurrent crawls
on the same browser to prevent navigation conflicts. Default: False.
debugging_port (int): Port for the browser debugging protocol. Default: 9222.
use_persistent_context (bool): Use a persistent browser context (like a persistent profile).
Automatically sets use_managed_browser=True. Default: False.
@@ -421,6 +441,10 @@ class BrowserConfig:
browser_mode: str = "dedicated",
use_managed_browser: bool = False,
cdp_url: str = None,
browser_context_id: str = None,
target_id: str = None,
cdp_cleanup_on_close: bool = False,
create_isolated_context: bool = False,
use_persistent_context: bool = False,
user_data_dir: str = None,
chrome_channel: str = "chromium",
@@ -453,13 +477,18 @@ class BrowserConfig:
debugging_port: int = 9222,
host: str = "localhost",
enable_stealth: bool = False,
init_scripts: List[str] = None,
):
self.browser_type = browser_type
self.headless = headless
self.headless = headless
self.browser_mode = browser_mode
self.use_managed_browser = use_managed_browser
self.cdp_url = cdp_url
self.browser_context_id = browser_context_id
self.target_id = target_id
self.cdp_cleanup_on_close = cdp_cleanup_on_close
self.create_isolated_context = create_isolated_context
self.use_persistent_context = use_persistent_context
self.user_data_dir = user_data_dir
self.chrome_channel = chrome_channel or self.browser_type or "chromium"
@@ -508,6 +537,7 @@ class BrowserConfig:
self.debugging_port = debugging_port
self.host = host
self.enable_stealth = enable_stealth
self.init_scripts = init_scripts if init_scripts is not None else []
fa_user_agenr_generator = ValidUAGenerator()
if self.user_agent_mode == "random":
@@ -555,6 +585,10 @@ class BrowserConfig:
browser_mode=kwargs.get("browser_mode", "dedicated"),
use_managed_browser=kwargs.get("use_managed_browser", False),
cdp_url=kwargs.get("cdp_url"),
browser_context_id=kwargs.get("browser_context_id"),
target_id=kwargs.get("target_id"),
cdp_cleanup_on_close=kwargs.get("cdp_cleanup_on_close", False),
create_isolated_context=kwargs.get("create_isolated_context", False),
use_persistent_context=kwargs.get("use_persistent_context", False),
user_data_dir=kwargs.get("user_data_dir"),
chrome_channel=kwargs.get("chrome_channel", "chromium"),
@@ -583,6 +617,7 @@ class BrowserConfig:
debugging_port=kwargs.get("debugging_port", 9222),
host=kwargs.get("host", "localhost"),
enable_stealth=kwargs.get("enable_stealth", False),
init_scripts=kwargs.get("init_scripts", []),
)
def to_dict(self):
@@ -592,12 +627,16 @@ class BrowserConfig:
"browser_mode": self.browser_mode,
"use_managed_browser": self.use_managed_browser,
"cdp_url": self.cdp_url,
"browser_context_id": self.browser_context_id,
"target_id": self.target_id,
"cdp_cleanup_on_close": self.cdp_cleanup_on_close,
"create_isolated_context": self.create_isolated_context,
"use_persistent_context": self.use_persistent_context,
"user_data_dir": self.user_data_dir,
"chrome_channel": self.chrome_channel,
"channel": self.channel,
"proxy": self.proxy,
"proxy_config": self.proxy_config,
"proxy_config": self.proxy_config.to_dict() if self.proxy_config else None,
"viewport_width": self.viewport_width,
"viewport_height": self.viewport_height,
"accept_downloads": self.accept_downloads,
@@ -618,9 +657,10 @@ class BrowserConfig:
"debugging_port": self.debugging_port,
"host": self.host,
"enable_stealth": self.enable_stealth,
"init_scripts": self.init_scripts,
}
return result
def clone(self, **kwargs):
@@ -649,6 +689,85 @@ class BrowserConfig:
return config
return BrowserConfig.from_kwargs(config)
def set_nstproxy(
self,
token: str,
channel_id: str,
country: str = "ANY",
state: str = "",
city: str = "",
protocol: str = "http",
session_duration: int = 10,
):
"""
Fetch a proxy from NSTProxy API and automatically assign it to proxy_config.
Get your NSTProxy token from: https://app.nstproxy.com/profile
Args:
token (str): NSTProxy API token.
channel_id (str): NSTProxy channel ID.
country (str, optional): Country code (default: "ANY").
state (str, optional): State code (default: "").
city (str, optional): City name (default: "").
protocol (str, optional): Proxy protocol ("http" or "socks5"). Defaults to "http".
session_duration (int, optional): Session duration in minutes (0 = rotate each request). Defaults to 10.
Raises:
ValueError: If the API response format is invalid.
PermissionError: If the API returns an error message.
"""
# --- Validate input early ---
if not token or not channel_id:
raise ValueError("[NSTProxy] token and channel_id are required")
if protocol not in ("http", "socks5"):
raise ValueError(f"[NSTProxy] Invalid protocol: {protocol}")
# --- Build NSTProxy API URL ---
params = {
"fType": 2,
"count": 1,
"channelId": channel_id,
"country": country,
"protocol": protocol,
"sessionDuration": session_duration,
"token": token,
}
if state:
params["state"] = state
if city:
params["city"] = city
url = "https://api.nstproxy.com/api/v1/generate/apiproxies"
try:
response = requests.get(url, params=params, timeout=10)
response.raise_for_status()
data = response.json()
# --- Handle API error response ---
if isinstance(data, dict) and data.get("err"):
raise PermissionError(f"[NSTProxy] API Error: {data.get('msg', 'Unknown error')}")
if not isinstance(data, list) or not data:
raise ValueError("[NSTProxy] Invalid API response — expected a non-empty list")
proxy_info = data[0]
# --- Apply proxy config ---
self.proxy_config = ProxyConfig(
server=f"{protocol}://{proxy_info['ip']}:{proxy_info['port']}",
username=proxy_info["username"],
password=proxy_info["password"],
)
except Exception as e:
print(f"[NSTProxy] ❌ Failed to set proxy: {e}")
raise
class VirtualScrollConfig:
"""Configuration for virtual scroll handling.
@@ -914,6 +1033,18 @@ class CrawlerRunConfig():
proxy_config (ProxyConfig or dict or None): Detailed proxy configuration, e.g. {"server": "...", "username": "..."}.
If None, no additional proxy config. Default: None.
# Sticky Proxy Session Parameters
proxy_session_id (str or None): When set, maintains the same proxy for all requests sharing this session ID.
The proxy is acquired on first request and reused for subsequent requests.
Session expires when explicitly released or crawler context is closed.
Default: None.
proxy_session_ttl (int or None): Time-to-live for sticky session in seconds.
After TTL expires, a new proxy is acquired on next request.
Default: None (session lasts until explicitly released or crawler closes).
proxy_session_auto_release (bool): If True, automatically release the proxy session after a batch operation.
Useful for arun_many() to clean up sessions automatically.
Default: False.
# Browser Location and Identity Parameters
locale (str or None): Locale to use for the browser context (e.g., "en-US").
Default: None.
@@ -942,6 +1073,15 @@ class CrawlerRunConfig():
shared_data (dict or None): Shared data to be passed between hooks.
Default: None.
# Cache Validation Parameters (Smart Cache)
check_cache_freshness (bool): If True, validates cached content freshness using HTTP
conditional requests (ETag/Last-Modified) and head fingerprinting
before returning cached results. Avoids full browser crawls when
content hasn't changed. Only applies when cache_mode allows reads.
Default: False.
cache_validation_timeout (float): Timeout in seconds for cache validation HTTP requests.
Default: 10.0.
# Page Navigation and Timing Parameters
wait_until (str): The condition to wait for when navigating, e.g. "domcontentloaded".
Default: "domcontentloaded".
@@ -1048,6 +1188,12 @@ class CrawlerRunConfig():
# Connection Parameters
stream (bool): If True, enables streaming of crawled URLs as they are processed when used with arun_many.
Default: False.
process_in_browser (bool): If True, forces raw:/file:// URLs to be processed through the browser
pipeline (enabling js_code, wait_for, scrolling, etc.). When False (default),
raw:/file:// URLs use a fast path that returns HTML directly without browser
interaction. This is automatically enabled when browser-requiring parameters
are detected (js_code, wait_for, screenshot, pdf, etc.).
Default: False.
check_robots_txt (bool): Whether to check robots.txt rules before crawling. Default: False
Default: False.
@@ -1093,6 +1239,10 @@ class CrawlerRunConfig():
scraping_strategy: ContentScrapingStrategy = None,
proxy_config: Union[ProxyConfig, dict, None] = None,
proxy_rotation_strategy: Optional[ProxyRotationStrategy] = None,
# Sticky Proxy Session Parameters
proxy_session_id: Optional[str] = None,
proxy_session_ttl: Optional[int] = None,
proxy_session_auto_release: bool = False,
# Browser Location and Identity Parameters
locale: Optional[str] = None,
timezone_id: Optional[str] = None,
@@ -1107,6 +1257,9 @@ class CrawlerRunConfig():
no_cache_read: bool = False,
no_cache_write: bool = False,
shared_data: dict = None,
# Cache Validation Parameters (Smart Cache)
check_cache_freshness: bool = False,
cache_validation_timeout: float = 10.0,
# Page Navigation and Timing Parameters
wait_until: str = "domcontentloaded",
page_timeout: int = PAGE_TIMEOUT,
@@ -1160,7 +1313,10 @@ class CrawlerRunConfig():
# Connection Parameters
method: str = "GET",
stream: bool = False,
prefetch: bool = False, # When True, return only HTML + links (skip heavy processing)
process_in_browser: bool = False, # Force browser processing for raw:/file:// URLs
url: str = None,
base_url: str = None, # Base URL for markdown link resolution (used with raw: HTML)
check_robots_txt: bool = False,
user_agent: str = None,
user_agent_mode: str = None,
@@ -1179,6 +1335,7 @@ class CrawlerRunConfig():
):
# TODO: Planning to set properties dynamically based on the __init__ signature
self.url = url
self.base_url = base_url # Base URL for markdown link resolution
# Content Processing Parameters
self.word_count_threshold = word_count_threshold
@@ -1203,7 +1360,12 @@ class CrawlerRunConfig():
self.proxy_config = ProxyConfig.from_string(proxy_config)
self.proxy_rotation_strategy = proxy_rotation_strategy
# Sticky Proxy Session Parameters
self.proxy_session_id = proxy_session_id
self.proxy_session_ttl = proxy_session_ttl
self.proxy_session_auto_release = proxy_session_auto_release
# Browser Location and Identity Parameters
self.locale = locale
self.timezone_id = timezone_id
@@ -1220,6 +1382,9 @@ class CrawlerRunConfig():
self.no_cache_read = no_cache_read
self.no_cache_write = no_cache_write
self.shared_data = shared_data
# Cache Validation (Smart Cache)
self.check_cache_freshness = check_cache_freshness
self.cache_validation_timeout = cache_validation_timeout
# Page Navigation and Timing Parameters
self.wait_until = wait_until
@@ -1286,6 +1451,8 @@ class CrawlerRunConfig():
# Connection Parameters
self.stream = stream
self.prefetch = prefetch # Prefetch mode: return only HTML + links
self.process_in_browser = process_in_browser # Force browser processing for raw:/file:// URLs
self.method = method
# Robots.txt Handling Parameters
@@ -1483,6 +1650,10 @@ class CrawlerRunConfig():
scraping_strategy=kwargs.get("scraping_strategy"),
proxy_config=kwargs.get("proxy_config"),
proxy_rotation_strategy=kwargs.get("proxy_rotation_strategy"),
# Sticky Proxy Session Parameters
proxy_session_id=kwargs.get("proxy_session_id"),
proxy_session_ttl=kwargs.get("proxy_session_ttl"),
proxy_session_auto_release=kwargs.get("proxy_session_auto_release", False),
# Browser Location and Identity Parameters
locale=kwargs.get("locale", None),
timezone_id=kwargs.get("timezone_id", None),
@@ -1558,6 +1729,8 @@ class CrawlerRunConfig():
# Connection Parameters
method=kwargs.get("method", "GET"),
stream=kwargs.get("stream", False),
prefetch=kwargs.get("prefetch", False),
process_in_browser=kwargs.get("process_in_browser", False),
check_robots_txt=kwargs.get("check_robots_txt", False),
user_agent=kwargs.get("user_agent"),
user_agent_mode=kwargs.get("user_agent_mode"),
@@ -1567,6 +1740,7 @@ class CrawlerRunConfig():
# Link Extraction Parameters
link_preview_config=kwargs.get("link_preview_config"),
url=kwargs.get("url"),
base_url=kwargs.get("base_url"),
# URL Matching Parameters
url_matcher=kwargs.get("url_matcher"),
match_mode=kwargs.get("match_mode", MatchMode.OR),
@@ -1606,6 +1780,9 @@ class CrawlerRunConfig():
"scraping_strategy": self.scraping_strategy,
"proxy_config": self.proxy_config,
"proxy_rotation_strategy": self.proxy_rotation_strategy,
"proxy_session_id": self.proxy_session_id,
"proxy_session_ttl": self.proxy_session_ttl,
"proxy_session_auto_release": self.proxy_session_auto_release,
"locale": self.locale,
"timezone_id": self.timezone_id,
"geolocation": self.geolocation,
@@ -1662,6 +1839,8 @@ class CrawlerRunConfig():
"capture_console_messages": self.capture_console_messages,
"method": self.method,
"stream": self.stream,
"prefetch": self.prefetch,
"process_in_browser": self.process_in_browser,
"check_robots_txt": self.check_robots_txt,
"user_agent": self.user_agent,
"user_agent_mode": self.user_agent_mode,
@@ -1712,7 +1891,10 @@ class LLMConfig:
frequency_penalty: Optional[float] = None,
presence_penalty: Optional[float] = None,
stop: Optional[List[str]] = None,
n: Optional[int] = None,
n: Optional[int] = None,
backoff_base_delay: Optional[int] = None,
backoff_max_attempts: Optional[int] = None,
backoff_exponential_factor: Optional[int] = None,
):
"""Configuaration class for LLM provider and API token."""
self.provider = provider
@@ -1741,6 +1923,9 @@ class LLMConfig:
self.presence_penalty = presence_penalty
self.stop = stop
self.n = n
self.backoff_base_delay = backoff_base_delay if backoff_base_delay is not None else 2
self.backoff_max_attempts = backoff_max_attempts if backoff_max_attempts is not None else 3
self.backoff_exponential_factor = backoff_exponential_factor if backoff_exponential_factor is not None else 2
@staticmethod
def from_kwargs(kwargs: dict) -> "LLMConfig":
@@ -1754,7 +1939,10 @@ class LLMConfig:
frequency_penalty=kwargs.get("frequency_penalty"),
presence_penalty=kwargs.get("presence_penalty"),
stop=kwargs.get("stop"),
n=kwargs.get("n")
n=kwargs.get("n"),
backoff_base_delay=kwargs.get("backoff_base_delay"),
backoff_max_attempts=kwargs.get("backoff_max_attempts"),
backoff_exponential_factor=kwargs.get("backoff_exponential_factor")
)
def to_dict(self):
@@ -1768,7 +1956,10 @@ class LLMConfig:
"frequency_penalty": self.frequency_penalty,
"presence_penalty": self.presence_penalty,
"stop": self.stop,
"n": self.n
"n": self.n,
"backoff_base_delay": self.backoff_base_delay,
"backoff_max_attempts": self.backoff_max_attempts,
"backoff_exponential_factor": self.backoff_exponential_factor
}
def clone(self, **kwargs):
@@ -1805,6 +1996,8 @@ class SeedingConfig:
score_threshold: Optional[float] = None,
scoring_method: str = "bm25",
filter_nonsense_urls: bool = True,
cache_ttl_hours: int = 24,
validate_sitemap_lastmod: bool = True,
):
"""
Initialize URL seeding configuration.
@@ -1836,10 +2029,14 @@ class SeedingConfig:
Requires extract_head=True. Default: None
score_threshold: Minimum relevance score (0.0-1.0) to include URL.
Only applies when query is provided. Default: None
scoring_method: Scoring algorithm to use. Currently only "bm25" is supported.
scoring_method: Scoring algorithm to use. Currently only "bm25" is supported.
Future: "semantic". Default: "bm25"
filter_nonsense_urls: Filter out utility URLs like robots.txt, sitemap.xml,
filter_nonsense_urls: Filter out utility URLs like robots.txt, sitemap.xml,
ads.txt, favicon.ico, etc. Default: True
cache_ttl_hours: Hours before sitemap cache expires. Set to 0 to disable TTL
(only lastmod validation). Default: 24
validate_sitemap_lastmod: If True, compares sitemap's <lastmod> with cache
timestamp and refetches if sitemap is newer. Default: True
"""
self.source = source
self.pattern = pattern
@@ -1856,6 +2053,8 @@ class SeedingConfig:
self.score_threshold = score_threshold
self.scoring_method = scoring_method
self.filter_nonsense_urls = filter_nonsense_urls
self.cache_ttl_hours = cache_ttl_hours
self.validate_sitemap_lastmod = validate_sitemap_lastmod
# Add to_dict, from_kwargs, and clone methods for consistency
def to_dict(self) -> Dict[str, Any]:

View File

@@ -452,48 +452,48 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
if url.startswith(("http://", "https://", "view-source:")):
return await self._crawl_web(url, config)
elif url.startswith("file://"):
# initialize empty lists for console messages
captured_console = []
# Process local file
local_file_path = url[7:] # Remove 'file://' prefix
if not os.path.exists(local_file_path):
raise FileNotFoundError(f"Local file not found: {local_file_path}")
with open(local_file_path, "r", encoding="utf-8") as f:
html = f.read()
if config.screenshot:
screenshot_data = await self._generate_screenshot_from_html(html)
if config.capture_console_messages:
page, context = await self.browser_manager.get_page(crawlerRunConfig=config)
captured_console = await self._capture_console_messages(page, url)
return AsyncCrawlResponse(
html=html,
response_headers=response_headers,
status_code=status_code,
screenshot=screenshot_data,
get_delayed_content=None,
console_messages=captured_console,
elif url.startswith("file://") or url.startswith("raw://") or url.startswith("raw:"):
# Check if browser processing is required for file:// or raw: URLs
needs_browser = (
config.process_in_browser or
config.screenshot or
config.pdf or
config.capture_mhtml or
config.js_code or
config.wait_for or
config.scan_full_page or
config.remove_overlay_elements or
config.simulate_user or
config.magic or
config.process_iframes or
config.capture_console_messages or
config.capture_network_requests
)
#####
# Since both "raw:" and "raw://" start with "raw:", the first condition is always true for both, so "raw://" will be sliced as "//...", which is incorrect.
# Fix: Check for "raw://" first, then "raw:"
# Also, the prefix "raw://" is actually 6 characters long, not 7, so it should be sliced accordingly: url[6:]
#####
elif url.startswith("raw://") or url.startswith("raw:"):
# Process raw HTML content
# raw_html = url[4:] if url[:4] == "raw:" else url[7:]
raw_html = url[6:] if url.startswith("raw://") else url[4:]
html = raw_html
if config.screenshot:
screenshot_data = await self._generate_screenshot_from_html(html)
if needs_browser:
# Route through _crawl_web() for full browser pipeline
# _crawl_web() will detect file:// and raw: URLs and use set_content()
return await self._crawl_web(url, config)
# Fast path: return HTML directly without browser interaction
if url.startswith("file://"):
# Process local file
local_file_path = url[7:] # Remove 'file://' prefix
if not os.path.exists(local_file_path):
raise FileNotFoundError(f"Local file not found: {local_file_path}")
with open(local_file_path, "r", encoding="utf-8") as f:
html = f.read()
else:
# Process raw HTML content (raw:// or raw:)
html = url[6:] if url.startswith("raw://") else url[4:]
return AsyncCrawlResponse(
html=html,
response_headers=response_headers,
status_code=status_code,
screenshot=screenshot_data,
screenshot=None,
pdf_data=None,
mhtml_data=None,
get_delayed_content=None,
)
else:
@@ -666,67 +666,83 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
if not config.js_only:
await self.execute_hook("before_goto", page, context=context, url=url, config=config)
try:
# Generate a unique nonce for this request
if config.experimental.get("use_csp_nonce", False):
nonce = hashlib.sha256(os.urandom(32)).hexdigest()
# Check if this is a file:// or raw: URL that needs set_content() instead of goto()
is_local_content = url.startswith("file://") or url.startswith("raw://") or url.startswith("raw:")
# Add CSP headers to the request
await page.set_extra_http_headers(
{
"Content-Security-Policy": f"default-src 'self'; script-src 'self' 'nonce-{nonce}' 'strict-dynamic'"
}
)
response = await page.goto(
url, wait_until=config.wait_until, timeout=config.page_timeout
)
redirected_url = page.url
except Error as e:
# Allow navigation to be aborted when downloading files
# This is expected behavior for downloads in some browser engines
if 'net::ERR_ABORTED' in str(e) and self.browser_config.accept_downloads:
self.logger.info(
message=f"Navigation aborted, likely due to file download: {url}",
tag="GOTO",
params={"url": url},
)
response = None
if is_local_content:
# Load local content using set_content() instead of network navigation
if url.startswith("file://"):
local_file_path = url[7:] # Remove 'file://' prefix
if not os.path.exists(local_file_path):
raise FileNotFoundError(f"Local file not found: {local_file_path}")
with open(local_file_path, "r", encoding="utf-8") as f:
html_content = f.read()
else:
raise RuntimeError(f"Failed on navigating ACS-GOTO:\n{str(e)}")
# raw:// or raw:
html_content = url[6:] if url.startswith("raw://") else url[4:]
await page.set_content(html_content, wait_until=config.wait_until)
response = None
redirected_url = config.base_url or url
status_code = 200
response_headers = {}
else:
# Standard web navigation with goto()
try:
# Generate a unique nonce for this request
if config.experimental.get("use_csp_nonce", False):
nonce = hashlib.sha256(os.urandom(32)).hexdigest()
# Add CSP headers to the request
await page.set_extra_http_headers(
{
"Content-Security-Policy": f"default-src 'self'; script-src 'self' 'nonce-{nonce}' 'strict-dynamic'"
}
)
response = await page.goto(
url, wait_until=config.wait_until, timeout=config.page_timeout
)
redirected_url = page.url
except Error as e:
# Allow navigation to be aborted when downloading files
# This is expected behavior for downloads in some browser engines
if 'net::ERR_ABORTED' in str(e) and self.browser_config.accept_downloads:
self.logger.info(
message=f"Navigation aborted, likely due to file download: {url}",
tag="GOTO",
params={"url": url},
)
response = None
else:
raise RuntimeError(f"Failed on navigating ACS-GOTO:\n{str(e)}")
# ──────────────────────────────────────────────────────────────
# Walk the redirect chain. Playwright returns only the last
# hop, so we trace the `request.redirected_from` links until the
# first response that differs from the final one and surface its
# status-code.
# ──────────────────────────────────────────────────────────────
if response is None:
status_code = 200
response_headers = {}
else:
first_resp = response
req = response.request
while req and req.redirected_from:
prev_req = req.redirected_from
prev_resp = await prev_req.response()
if prev_resp: # keep earliest
first_resp = prev_resp
req = prev_req
status_code = first_resp.status
response_headers = first_resp.headers
await self.execute_hook(
"after_goto", page, context=context, url=url, response=response, config=config
)
# ──────────────────────────────────────────────────────────────
# Walk the redirect chain. Playwright returns only the last
# hop, so we trace the `request.redirected_from` links until the
# first response that differs from the final one and surface its
# status-code.
# ──────────────────────────────────────────────────────────────
if response is None:
status_code = 200
response_headers = {}
else:
first_resp = response
req = response.request
while req and req.redirected_from:
prev_req = req.redirected_from
prev_resp = await prev_req.response()
if prev_resp: # keep earliest
first_resp = prev_resp
req = prev_req
status_code = first_resp.status
response_headers = first_resp.headers
# if response is None:
# status_code = 200
# response_headers = {}
# else:
# status_code = response.status
# response_headers = response.headers
else:
status_code = 200
response_headers = {}
@@ -1023,6 +1039,12 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
final_messages = await self.adapter.retrieve_console_messages(page)
captured_console.extend(final_messages)
###
# This ensures we capture the current page URL at the time we return the response,
# which correctly reflects any JavaScript navigation that occurred.
###
redirected_url = page.url # Use current page URL to capture JS redirects
# Return complete response
return AsyncCrawlResponse(
html=html,
@@ -1383,9 +1405,10 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
try:
await self.adapter.evaluate(page,
f"""
(() => {{
(async () => {{
try {{
{remove_overlays_js}
const removeOverlays = {remove_overlays_js};
await removeOverlays();
return {{ success: true }};
}} catch (error) {{
return {{
@@ -1517,7 +1540,78 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
await page.goto(file_path)
return captured_console
async def _generate_media_from_html(
self, html: str, config: CrawlerRunConfig = None
) -> tuple:
"""
Generate media (screenshot, PDF, MHTML) from raw HTML content.
This method is used for raw: and file:// URLs where we have HTML content
but need to render it in a browser to generate media outputs.
Args:
html (str): The raw HTML content to render
config (CrawlerRunConfig, optional): Configuration for media options
Returns:
tuple: (screenshot_data, pdf_data, mhtml_data) - any can be None
"""
page = None
screenshot_data = None
pdf_data = None
mhtml_data = None
try:
# Get a browser page
config = config or CrawlerRunConfig()
page, context = await self.browser_manager.get_page(crawlerRunConfig=config)
# Load the HTML content into the page
await page.set_content(html, wait_until="domcontentloaded")
# Generate requested media
if config.pdf:
pdf_data = await self.export_pdf(page)
if config.capture_mhtml:
mhtml_data = await self.capture_mhtml(page)
if config.screenshot:
if config.screenshot_wait_for:
await asyncio.sleep(config.screenshot_wait_for)
screenshot_height_threshold = getattr(config, 'screenshot_height_threshold', None)
screenshot_data = await self.take_screenshot(
page, screenshot_height_threshold=screenshot_height_threshold
)
return screenshot_data, pdf_data, mhtml_data
except Exception as e:
error_message = f"Failed to generate media from HTML: {str(e)}"
self.logger.error(
message="HTML media generation failed: {error}",
tag="ERROR",
params={"error": error_message},
)
# Return error image for screenshot if it was requested
if config and config.screenshot:
img = Image.new("RGB", (800, 600), color="black")
draw = ImageDraw.Draw(img)
font = ImageFont.load_default()
draw.text((10, 10), error_message, fill=(255, 255, 255), font=font)
buffered = BytesIO()
img.save(buffered, format="JPEG")
screenshot_data = base64.b64encode(buffered.getvalue()).decode("utf-8")
return screenshot_data, pdf_data, mhtml_data
finally:
# Clean up the page
if page:
try:
await page.close()
except Exception:
pass
async def take_screenshot(self, page, **kwargs) -> str:
"""
Take a screenshot of the current page.
@@ -2286,9 +2380,28 @@ class AsyncHTTPCrawlerStrategy(AsyncCrawlerStrategy):
)
def _format_proxy_url(self, proxy_config) -> str:
"""Format ProxyConfig into aiohttp-compatible proxy URL."""
if not proxy_config:
return None
server = proxy_config.server
username = getattr(proxy_config, 'username', None)
password = getattr(proxy_config, 'password', None)
if username and password:
# Insert credentials into URL: http://user:pass@host:port
if '://' in server:
protocol, rest = server.split('://', 1)
return f"{protocol}://{username}:{password}@{rest}"
else:
return f"http://{username}:{password}@{server}"
return server
async def _handle_http(
self,
url: str,
self,
url: str,
config: CrawlerRunConfig
) -> AsyncCrawlResponse:
async with self._session_context() as session:
@@ -2297,7 +2410,7 @@ class AsyncHTTPCrawlerStrategy(AsyncCrawlerStrategy):
connect=10,
sock_read=30
)
headers = dict(self._BASE_HEADERS)
if self.browser_config.headers:
headers.update(self.browser_config.headers)
@@ -2309,6 +2422,12 @@ class AsyncHTTPCrawlerStrategy(AsyncCrawlerStrategy):
'headers': headers
}
# Add proxy support - use config.proxy_config (set by arun() from rotation strategy or direct config)
proxy_url = None
if config.proxy_config:
proxy_url = self._format_proxy_url(config.proxy_config)
request_kwargs['proxy'] = proxy_url
if self.browser_config.method == "POST":
if self.browser_config.data:
request_kwargs['data'] = self.browser_config.data
@@ -2379,7 +2498,10 @@ class AsyncHTTPCrawlerStrategy(AsyncCrawlerStrategy):
if scheme == 'file':
return await self._handle_file(parsed.path)
elif scheme == 'raw':
return await self._handle_raw(parsed.path)
# Don't use parsed.path - urlparse truncates at '#' which is common in CSS
# Strip prefix directly: "raw://" (6 chars) or "raw:" (4 chars)
raw_content = url[6:] if url.startswith("raw://") else url[4:]
return await self._handle_raw(raw_content)
else: # http or https
return await self._handle_http(url, config)

View File

@@ -1,10 +1,11 @@
import os
import time
from pathlib import Path
import aiosqlite
import asyncio
from typing import Optional, Dict
from contextlib import asynccontextmanager
import json
import json
from .models import CrawlResult, MarkdownGenerationResult, StringCompatibleMarkdown
import aiofiles
from .async_logger import AsyncLogger
@@ -262,6 +263,11 @@ class AsyncDatabaseManager:
"screenshot",
"response_headers",
"downloaded_files",
# Smart cache validation columns (added in 0.8.x)
"etag",
"last_modified",
"head_fingerprint",
"cached_at",
]
for column in new_columns:
@@ -275,6 +281,11 @@ class AsyncDatabaseManager:
await db.execute(
f'ALTER TABLE crawled_data ADD COLUMN {new_column} TEXT DEFAULT "{{}}"'
)
elif new_column == "cached_at":
# Timestamp column for cache validation
await db.execute(
f"ALTER TABLE crawled_data ADD COLUMN {new_column} REAL DEFAULT 0"
)
else:
await db.execute(
f'ALTER TABLE crawled_data ADD COLUMN {new_column} TEXT DEFAULT ""'
@@ -378,6 +389,92 @@ class AsyncDatabaseManager:
)
return None
async def aget_cache_metadata(self, url: str) -> Optional[Dict]:
"""
Retrieve only cache validation metadata for a URL (lightweight query).
Returns dict with: url, etag, last_modified, head_fingerprint, cached_at, response_headers
This is used for cache validation without loading full content.
"""
async def _get_metadata(db):
async with db.execute(
"""SELECT url, etag, last_modified, head_fingerprint, cached_at, response_headers
FROM crawled_data WHERE url = ?""",
(url,)
) as cursor:
row = await cursor.fetchone()
if not row:
return None
columns = [description[0] for description in cursor.description]
row_dict = dict(zip(columns, row))
# Parse response_headers JSON
try:
row_dict["response_headers"] = (
json.loads(row_dict["response_headers"])
if row_dict["response_headers"] else {}
)
except json.JSONDecodeError:
row_dict["response_headers"] = {}
return row_dict
try:
return await self.execute_with_retry(_get_metadata)
except Exception as e:
self.logger.error(
message="Error retrieving cache metadata: {error}",
tag="ERROR",
force_verbose=True,
params={"error": str(e)},
)
return None
async def aupdate_cache_metadata(
self,
url: str,
etag: Optional[str] = None,
last_modified: Optional[str] = None,
head_fingerprint: Optional[str] = None,
):
"""
Update only the cache validation metadata for a URL.
Used to update etag/last_modified after a successful validation.
"""
async def _update(db):
updates = []
values = []
if etag is not None:
updates.append("etag = ?")
values.append(etag)
if last_modified is not None:
updates.append("last_modified = ?")
values.append(last_modified)
if head_fingerprint is not None:
updates.append("head_fingerprint = ?")
values.append(head_fingerprint)
if not updates:
return
values.append(url)
await db.execute(
f"UPDATE crawled_data SET {', '.join(updates)} WHERE url = ?",
tuple(values)
)
try:
await self.execute_with_retry(_update)
except Exception as e:
self.logger.error(
message="Error updating cache metadata: {error}",
tag="ERROR",
force_verbose=True,
params={"error": str(e)},
)
async def acache_url(self, result: CrawlResult):
"""Cache CrawlResult data"""
# Store content files and get hashes
@@ -425,15 +522,24 @@ class AsyncDatabaseManager:
for field, (content, content_type) in content_map.items():
content_hashes[field] = await self._store_content(content, content_type)
# Extract cache validation headers from response
response_headers = result.response_headers or {}
etag = response_headers.get("etag") or response_headers.get("ETag") or ""
last_modified = response_headers.get("last-modified") or response_headers.get("Last-Modified") or ""
# head_fingerprint is set by caller via result attribute (if available)
head_fingerprint = getattr(result, "head_fingerprint", None) or ""
cached_at = time.time()
async def _cache(db):
await db.execute(
"""
INSERT INTO crawled_data (
url, html, cleaned_html, markdown,
extracted_content, success, media, links, metadata,
screenshot, response_headers, downloaded_files
screenshot, response_headers, downloaded_files,
etag, last_modified, head_fingerprint, cached_at
)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
ON CONFLICT(url) DO UPDATE SET
html = excluded.html,
cleaned_html = excluded.cleaned_html,
@@ -445,7 +551,11 @@ class AsyncDatabaseManager:
metadata = excluded.metadata,
screenshot = excluded.screenshot,
response_headers = excluded.response_headers,
downloaded_files = excluded.downloaded_files
downloaded_files = excluded.downloaded_files,
etag = excluded.etag,
last_modified = excluded.last_modified,
head_fingerprint = excluded.head_fingerprint,
cached_at = excluded.cached_at
""",
(
result.url,
@@ -460,6 +570,10 @@ class AsyncDatabaseManager:
content_hashes["screenshot"],
json.dumps(result.response_headers or {}),
json.dumps(result.downloaded_files or []),
etag,
last_modified,
head_fingerprint,
cached_at,
),
)

View File

@@ -455,8 +455,6 @@ class MemoryAdaptiveDispatcher(BaseDispatcher):
# Update priorities for waiting tasks if needed
await self._update_queue_priorities()
return results
except Exception as e:
if self.monitor:
@@ -467,6 +465,7 @@ class MemoryAdaptiveDispatcher(BaseDispatcher):
memory_monitor.cancel()
if self.monitor:
self.monitor.stop()
return results
async def _update_queue_priorities(self):
"""Periodically update priorities of items in the queue to prevent starvation"""

View File

@@ -24,7 +24,7 @@ import os
import pathlib
import re
import time
from datetime import timedelta
from datetime import datetime, timedelta, timezone
from pathlib import Path
from typing import Any, Dict, Iterable, List, Optional, Sequence, Union
from urllib.parse import quote, urljoin
@@ -78,6 +78,103 @@ _link_rx = re.compile(
# ────────────────────────────────────────────────────────────────────────── helpers
def _parse_sitemap_lastmod(xml_content: bytes) -> Optional[str]:
"""Extract the most recent lastmod from sitemap XML."""
try:
if LXML:
root = etree.fromstring(xml_content)
# Get all lastmod elements (namespace-agnostic)
lastmods = root.xpath("//*[local-name()='lastmod']/text()")
if lastmods:
# Return the most recent one
return max(lastmods)
except Exception:
pass
return None
def _is_cache_valid(
cache_path: pathlib.Path,
ttl_hours: int,
validate_lastmod: bool,
current_lastmod: Optional[str] = None
) -> bool:
"""
Check if sitemap cache is still valid.
Returns False (invalid) if:
- File doesn't exist
- File is corrupted/unreadable
- TTL expired (if ttl_hours > 0)
- Sitemap lastmod is newer than cache (if validate_lastmod=True)
"""
if not cache_path.exists():
return False
try:
with open(cache_path, "r") as f:
data = json.load(f)
# Check version
if data.get("version") != 1:
return False
# Check TTL
if ttl_hours > 0:
created_at = datetime.fromisoformat(data["created_at"].replace("Z", "+00:00"))
age_hours = (datetime.now(timezone.utc) - created_at).total_seconds() / 3600
if age_hours > ttl_hours:
return False
# Check lastmod
if validate_lastmod and current_lastmod:
cached_lastmod = data.get("sitemap_lastmod")
if cached_lastmod and current_lastmod > cached_lastmod:
return False
# Check URL count (sanity check - if 0, likely corrupted)
if data.get("url_count", 0) == 0:
return False
return True
except (json.JSONDecodeError, KeyError, ValueError, IOError):
# Corrupted cache - return False to trigger refetch
return False
def _read_cache(cache_path: pathlib.Path) -> List[str]:
"""Read URLs from cache file. Returns empty list on error."""
try:
with open(cache_path, "r") as f:
data = json.load(f)
return data.get("urls", [])
except Exception:
return []
def _write_cache(
cache_path: pathlib.Path,
urls: List[str],
sitemap_url: str,
sitemap_lastmod: Optional[str]
) -> None:
"""Write URLs to cache with metadata."""
data = {
"version": 1,
"created_at": datetime.now(timezone.utc).isoformat(),
"sitemap_lastmod": sitemap_lastmod,
"sitemap_url": sitemap_url,
"url_count": len(urls),
"urls": urls
}
try:
with open(cache_path, "w") as f:
json.dump(data, f)
except Exception:
pass # Fail silently - cache is optional
def _match(url: str, pattern: str) -> bool:
if fnmatch.fnmatch(url, pattern):
return True
@@ -295,6 +392,10 @@ class AsyncUrlSeeder:
score_threshold = config.score_threshold
scoring_method = config.scoring_method
# Store cache config for use in _from_sitemaps
self._cache_ttl_hours = getattr(config, 'cache_ttl_hours', 24)
self._validate_sitemap_lastmod = getattr(config, 'validate_sitemap_lastmod', True)
# Ensure seeder's logger verbose matches the config's verbose if it's set
if self.logger and hasattr(self.logger, 'verbose') and config.verbose is not None:
self.logger.verbose = config.verbose
@@ -764,68 +865,222 @@ class AsyncUrlSeeder:
# ─────────────────────────────── Sitemaps
async def _from_sitemaps(self, domain: str, pattern: str, force: bool = False):
"""
1. Probe default sitemap locations.
2. If none exist, parse robots.txt for alternative sitemap URLs.
3. Yield only URLs that match `pattern`.
Discover URLs from sitemaps with smart TTL-based caching.
1. Check cache validity (TTL + lastmod)
2. If valid, yield from cache
3. If invalid or force=True, fetch fresh and update cache
4. FALLBACK: If anything fails, bypass cache and fetch directly
"""
# Get config values (passed via self during urls() call)
cache_ttl_hours = getattr(self, '_cache_ttl_hours', 24)
validate_lastmod = getattr(self, '_validate_sitemap_lastmod', True)
# ── cache file (same logic as _from_cc)
# Cache file path (new format: .json instead of .jsonl)
host = re.sub(r'^https?://', '', domain).rstrip('/')
host = re.sub('[/?#]+', '_', domain)
host_safe = re.sub('[/?#]+', '_', host)
digest = hashlib.md5(pattern.encode()).hexdigest()[:8]
path = self.cache_dir / f"sitemap_{host}_{digest}.jsonl"
cache_path = self.cache_dir / f"sitemap_{host_safe}_{digest}.json"
if path.exists() and not force:
self._log("info", "Loading sitemap URLs for {d} from cache: {p}",
params={"d": host, "p": str(path)}, tag="URL_SEED")
async with aiofiles.open(path, "r") as fp:
async for line in fp:
url = line.strip()
if _match(url, pattern):
yield url
return
# Check for old .jsonl format and delete it
old_cache_path = self.cache_dir / f"sitemap_{host_safe}_{digest}.jsonl"
if old_cache_path.exists():
try:
old_cache_path.unlink()
self._log("info", "Deleted old cache format: {p}",
params={"p": str(old_cache_path)}, tag="URL_SEED")
except Exception:
pass
# 1⃣ direct sitemap probe
# strip any scheme so we can handle https → http fallback
host = re.sub(r'^https?://', '', domain).rstrip('/')
# Step 1: Find sitemap URL and get lastmod (needed for validation)
sitemap_url = None
sitemap_lastmod = None
sitemap_content = None
schemes = ('https', 'http') # prefer TLS, downgrade if needed
schemes = ('https', 'http')
for scheme in schemes:
for suffix in ("/sitemap.xml", "/sitemap_index.xml"):
sm = f"{scheme}://{host}{suffix}"
sm = await self._resolve_head(sm)
if sm:
self._log("info", "Found sitemap at {url}", params={
"url": sm}, tag="URL_SEED")
async with aiofiles.open(path, "w") as fp:
resolved = await self._resolve_head(sm)
if resolved:
sitemap_url = resolved
# Fetch sitemap content to get lastmod
try:
r = await self.client.get(sitemap_url, timeout=15, follow_redirects=True)
if 200 <= r.status_code < 300:
sitemap_content = r.content
sitemap_lastmod = _parse_sitemap_lastmod(sitemap_content)
except Exception:
pass
break
if sitemap_url:
break
# Step 2: Check cache validity (skip if force=True)
if not force and cache_path.exists():
if _is_cache_valid(cache_path, cache_ttl_hours, validate_lastmod, sitemap_lastmod):
self._log("info", "Loading sitemap URLs from valid cache: {p}",
params={"p": str(cache_path)}, tag="URL_SEED")
cached_urls = _read_cache(cache_path)
for url in cached_urls:
if _match(url, pattern):
yield url
return
else:
self._log("info", "Cache invalid/expired, refetching sitemap for {d}",
params={"d": domain}, tag="URL_SEED")
# Step 3: Fetch fresh URLs
discovered_urls = []
if sitemap_url and sitemap_content:
self._log("info", "Found sitemap at {url}", params={"url": sitemap_url}, tag="URL_SEED")
# Parse sitemap (reuse content we already fetched)
async for u in self._iter_sitemap_content(sitemap_url, sitemap_content):
discovered_urls.append(u)
if _match(u, pattern):
yield u
elif sitemap_url:
# We have a sitemap URL but no content (fetch failed earlier), try again
self._log("info", "Found sitemap at {url}", params={"url": sitemap_url}, tag="URL_SEED")
async for u in self._iter_sitemap(sitemap_url):
discovered_urls.append(u)
if _match(u, pattern):
yield u
else:
# Fallback: robots.txt
robots = f"https://{host}/robots.txt"
try:
r = await self.client.get(robots, timeout=10, follow_redirects=True)
if 200 <= r.status_code < 300:
sitemap_lines = [l.split(":", 1)[1].strip()
for l in r.text.splitlines()
if l.lower().startswith("sitemap:")]
for sm in sitemap_lines:
async for u in self._iter_sitemap(sm):
await fp.write(u + "\n")
discovered_urls.append(u)
if _match(u, pattern):
yield u
else:
self._log("warning", "robots.txt unavailable for {d} HTTP{c}",
params={"d": domain, "c": r.status_code}, tag="URL_SEED")
return
# 2⃣ robots.txt fallback
robots = f"https://{domain.rstrip('/')}/robots.txt"
try:
r = await self.client.get(robots, timeout=10, follow_redirects=True)
if not 200 <= r.status_code < 300:
self._log("warning", "robots.txt unavailable for {d} HTTP{c}", params={
"d": domain, "c": r.status_code}, tag="URL_SEED")
except Exception as e:
self._log("warning", "Failed to fetch robots.txt for {d}: {e}",
params={"d": domain, "e": str(e)}, tag="URL_SEED")
return
sitemap_lines = [l.split(":", 1)[1].strip(
) for l in r.text.splitlines() if l.lower().startswith("sitemap:")]
except Exception as e:
self._log("warning", "Failed to fetch robots.txt for {d}: {e}", params={
"d": domain, "e": str(e)}, tag="URL_SEED")
return
if sitemap_lines:
async with aiofiles.open(path, "w") as fp:
for sm in sitemap_lines:
async for u in self._iter_sitemap(sm):
await fp.write(u + "\n")
if _match(u, pattern):
yield u
# Step 4: Write to cache (FALLBACK: if write fails, URLs still yielded above)
if discovered_urls:
_write_cache(cache_path, discovered_urls, sitemap_url or "", sitemap_lastmod)
self._log("info", "Cached {count} URLs for {d}",
params={"count": len(discovered_urls), "d": domain}, tag="URL_SEED")
async def _iter_sitemap_content(self, url: str, content: bytes):
"""Parse sitemap from already-fetched content."""
data = gzip.decompress(content) if url.endswith(".gz") else content
base_url = url
def _normalize_loc(raw: Optional[str]) -> Optional[str]:
if not raw:
return None
normalized = urljoin(base_url, raw.strip())
if not normalized:
return None
return normalized
# Detect if this is a sitemap index
is_sitemap_index = False
sub_sitemaps = []
regular_urls = []
if LXML:
try:
parser = etree.XMLParser(recover=True)
root = etree.fromstring(data, parser=parser)
sitemap_loc_nodes = root.xpath("//*[local-name()='sitemap']/*[local-name()='loc']")
url_loc_nodes = root.xpath("//*[local-name()='url']/*[local-name()='loc']")
if sitemap_loc_nodes:
is_sitemap_index = True
for sitemap_elem in sitemap_loc_nodes:
loc = _normalize_loc(sitemap_elem.text)
if loc:
sub_sitemaps.append(loc)
if not is_sitemap_index:
for loc_elem in url_loc_nodes:
loc = _normalize_loc(loc_elem.text)
if loc:
regular_urls.append(loc)
except Exception as e:
self._log("error", "LXML parsing error for sitemap {url}: {error}",
params={"url": url, "error": str(e)}, tag="URL_SEED")
return
else:
import xml.etree.ElementTree as ET
try:
root = ET.fromstring(data)
for elem in root.iter():
if '}' in elem.tag:
elem.tag = elem.tag.split('}')[1]
sitemaps = root.findall('.//sitemap')
url_entries = root.findall('.//url')
if sitemaps:
is_sitemap_index = True
for sitemap in sitemaps:
loc_elem = sitemap.find('loc')
loc = _normalize_loc(loc_elem.text if loc_elem is not None else None)
if loc:
sub_sitemaps.append(loc)
if not is_sitemap_index:
for url_elem in url_entries:
loc_elem = url_elem.find('loc')
loc = _normalize_loc(loc_elem.text if loc_elem is not None else None)
if loc:
regular_urls.append(loc)
except Exception as e:
self._log("error", "ElementTree parsing error for sitemap {url}: {error}",
params={"url": url, "error": str(e)}, tag="URL_SEED")
return
# Process based on type
if is_sitemap_index and sub_sitemaps:
self._log("info", "Processing sitemap index with {count} sub-sitemaps",
params={"count": len(sub_sitemaps)}, tag="URL_SEED")
queue_size = min(50000, len(sub_sitemaps) * 1000)
result_queue = asyncio.Queue(maxsize=queue_size)
completed_count = 0
total_sitemaps = len(sub_sitemaps)
async def process_subsitemap(sitemap_url: str):
try:
async for u in self._iter_sitemap(sitemap_url):
await result_queue.put(u)
except Exception as e:
self._log("error", "Error processing sub-sitemap {url}: {error}",
params={"url": sitemap_url, "error": str(e)}, tag="URL_SEED")
finally:
await result_queue.put(None)
tasks = [asyncio.create_task(process_subsitemap(sm)) for sm in sub_sitemaps]
while completed_count < total_sitemaps:
item = await result_queue.get()
if item is None:
completed_count += 1
else:
yield item
await asyncio.gather(*tasks, return_exceptions=True)
else:
for u in regular_urls:
yield u
async def _iter_sitemap(self, url: str):
try:
@@ -845,6 +1100,15 @@ class AsyncUrlSeeder:
return
data = gzip.decompress(r.content) if url.endswith(".gz") else r.content
base_url = str(r.url)
def _normalize_loc(raw: Optional[str]) -> Optional[str]:
if not raw:
return None
normalized = urljoin(base_url, raw.strip())
if not normalized:
return None
return normalized
# Detect if this is a sitemap index by checking for <sitemapindex> or presence of <sitemap> elements
is_sitemap_index = False
@@ -857,25 +1121,42 @@ class AsyncUrlSeeder:
# Use XML parser for sitemaps, not HTML parser
parser = etree.XMLParser(recover=True)
root = etree.fromstring(data, parser=parser)
# Namespace-agnostic lookups using local-name() so we honor custom or missing namespaces
sitemap_loc_nodes = root.xpath("//*[local-name()='sitemap']/*[local-name()='loc']")
url_loc_nodes = root.xpath("//*[local-name()='url']/*[local-name()='loc']")
# Define namespace for sitemap
ns = {'s': 'http://www.sitemaps.org/schemas/sitemap/0.9'}
self._log(
"debug",
"Parsed sitemap {url}: {sitemap_count} sitemap entries, {url_count} url entries discovered",
params={
"url": url,
"sitemap_count": len(sitemap_loc_nodes),
"url_count": len(url_loc_nodes),
},
tag="URL_SEED",
)
# Check for sitemap index entries
sitemap_locs = root.xpath('//s:sitemap/s:loc', namespaces=ns)
if sitemap_locs:
if sitemap_loc_nodes:
is_sitemap_index = True
for sitemap_elem in sitemap_locs:
loc = sitemap_elem.text.strip() if sitemap_elem.text else ""
for sitemap_elem in sitemap_loc_nodes:
loc = _normalize_loc(sitemap_elem.text)
if loc:
sub_sitemaps.append(loc)
# If not a sitemap index, get regular URLs
if not is_sitemap_index:
for loc_elem in root.xpath('//s:url/s:loc', namespaces=ns):
loc = loc_elem.text.strip() if loc_elem.text else ""
for loc_elem in url_loc_nodes:
loc = _normalize_loc(loc_elem.text)
if loc:
regular_urls.append(loc)
if not regular_urls:
self._log(
"warning",
"No <loc> entries found inside <url> tags for sitemap {url}. The sitemap might be empty or use an unexpected structure.",
params={"url": url},
tag="URL_SEED",
)
except Exception as e:
self._log("error", "LXML parsing error for sitemap {url}: {error}",
params={"url": url, "error": str(e)}, tag="URL_SEED")
@@ -892,19 +1173,39 @@ class AsyncUrlSeeder:
# Check for sitemap index entries
sitemaps = root.findall('.//sitemap')
url_entries = root.findall('.//url')
self._log(
"debug",
"ElementTree parsed sitemap {url}: {sitemap_count} sitemap entries, {url_count} url entries discovered",
params={
"url": url,
"sitemap_count": len(sitemaps),
"url_count": len(url_entries),
},
tag="URL_SEED",
)
if sitemaps:
is_sitemap_index = True
for sitemap in sitemaps:
loc_elem = sitemap.find('loc')
if loc_elem is not None and loc_elem.text:
sub_sitemaps.append(loc_elem.text.strip())
loc = _normalize_loc(loc_elem.text if loc_elem is not None else None)
if loc:
sub_sitemaps.append(loc)
# If not a sitemap index, get regular URLs
if not is_sitemap_index:
for url_elem in root.findall('.//url'):
for url_elem in url_entries:
loc_elem = url_elem.find('loc')
if loc_elem is not None and loc_elem.text:
regular_urls.append(loc_elem.text.strip())
loc = _normalize_loc(loc_elem.text if loc_elem is not None else None)
if loc:
regular_urls.append(loc)
if not regular_urls:
self._log(
"warning",
"No <loc> entries found inside <url> tags for sitemap {url}. The sitemap might be empty or use an unexpected structure.",
params={"url": url},
tag="URL_SEED",
)
except Exception as e:
self._log("error", "ElementTree parsing error for sitemap {url}: {error}",
params={"url": url, "error": str(e)}, tag="URL_SEED")

View File

@@ -47,7 +47,9 @@ from .utils import (
get_error_context,
RobotsParser,
preprocess_html_for_schema,
compute_head_fingerprint,
)
from .cache_validator import CacheValidator, CacheValidationResult
class AsyncWebCrawler:
@@ -267,6 +269,51 @@ class AsyncWebCrawler:
if cache_context.should_read():
cached_result = await async_db_manager.aget_cached_url(url)
# Smart Cache: Validate cache freshness if enabled
if cached_result and config.check_cache_freshness:
cache_metadata = await async_db_manager.aget_cache_metadata(url)
if cache_metadata:
async with CacheValidator(timeout=config.cache_validation_timeout) as validator:
validation = await validator.validate(
url=url,
stored_etag=cache_metadata.get("etag"),
stored_last_modified=cache_metadata.get("last_modified"),
stored_head_fingerprint=cache_metadata.get("head_fingerprint"),
)
if validation.status == CacheValidationResult.FRESH:
cached_result.cache_status = "hit_validated"
self.logger.info(
message="Cache validated: {reason}",
tag="CACHE",
params={"reason": validation.reason}
)
# Update metadata if we got new values
if validation.new_etag or validation.new_last_modified:
await async_db_manager.aupdate_cache_metadata(
url=url,
etag=validation.new_etag,
last_modified=validation.new_last_modified,
head_fingerprint=validation.new_head_fingerprint,
)
elif validation.status == CacheValidationResult.ERROR:
cached_result.cache_status = "hit_fallback"
self.logger.warning(
message="Cache validation failed, using cached: {reason}",
tag="CACHE",
params={"reason": validation.reason}
)
else:
# STALE or UNKNOWN - force recrawl
self.logger.info(
message="Cache stale: {reason}",
tag="CACHE",
params={"reason": validation.reason}
)
cached_result = None
elif cached_result:
cached_result.cache_status = "hit"
if cached_result:
html = sanitize_input_encode(cached_result.html)
extracted_content = sanitize_input_encode(
@@ -296,15 +343,32 @@ class AsyncWebCrawler:
# Update proxy configuration from rotation strategy if available
if config and config.proxy_rotation_strategy:
next_proxy: ProxyConfig = await config.proxy_rotation_strategy.get_next_proxy()
if next_proxy:
self.logger.info(
message="Switch proxy: {proxy}",
tag="PROXY",
params={"proxy": next_proxy.server}
# Handle sticky sessions - use same proxy for all requests with same session_id
if config.proxy_session_id:
next_proxy: ProxyConfig = await config.proxy_rotation_strategy.get_proxy_for_session(
config.proxy_session_id,
ttl=config.proxy_session_ttl
)
config.proxy_config = next_proxy
# config = config.clone(proxy_config=next_proxy)
if next_proxy:
self.logger.info(
message="Using sticky proxy session: {session_id} -> {proxy}",
tag="PROXY",
params={
"session_id": config.proxy_session_id,
"proxy": next_proxy.server
}
)
config.proxy_config = next_proxy
else:
# Existing behavior: rotate on each request
next_proxy: ProxyConfig = await config.proxy_rotation_strategy.get_next_proxy()
if next_proxy:
self.logger.info(
message="Switch proxy: {proxy}",
tag="PROXY",
params={"proxy": next_proxy.server}
)
config.proxy_config = next_proxy
# Fetch fresh content if needed
if not cached_result or not html:
@@ -383,6 +447,14 @@ class AsyncWebCrawler:
crawl_result.success = bool(html)
crawl_result.session_id = getattr(
config, "session_id", None)
crawl_result.cache_status = "miss"
# Compute head fingerprint for cache validation
if html:
head_end = html.lower().find('</head>')
if head_end != -1:
head_html = html[:head_end + 7]
crawl_result.head_fingerprint = compute_head_fingerprint(head_html)
self.logger.url_status(
url=cache_context.display_url,
@@ -459,6 +531,27 @@ class AsyncWebCrawler:
Returns:
CrawlResult: Processed result containing extracted and formatted content
"""
# === PREFETCH MODE SHORT-CIRCUIT ===
if getattr(config, 'prefetch', False):
from .utils import quick_extract_links
# Use base_url from config (for raw: URLs), redirected_url, or original url
effective_url = getattr(config, 'base_url', None) or kwargs.get('redirected_url') or url
links = quick_extract_links(html, effective_url)
return CrawlResult(
url=url,
html=html,
success=True,
links=links,
status_code=kwargs.get('status_code'),
response_headers=kwargs.get('response_headers'),
redirected_url=kwargs.get('redirected_url'),
ssl_certificate=kwargs.get('ssl_certificate'),
# All other fields default to None
)
# === END PREFETCH SHORT-CIRCUIT ===
cleaned_html = ""
try:
_url = url if not kwargs.get("is_raw_html", False) else "Raw HTML"
@@ -563,7 +656,8 @@ class AsyncWebCrawler:
markdown_result: MarkdownGenerationResult = (
markdown_generator.generate_markdown(
input_html=markdown_input_html,
base_url=params.get("redirected_url", url)
# Use explicit base_url if provided (for raw: HTML), otherwise redirected_url, then url
base_url=params.get("base_url") or params.get("redirected_url") or url
# html2text_options=kwargs.get('html2text', {})
)
)
@@ -617,7 +711,17 @@ class AsyncWebCrawler:
else config.chunking_strategy
)
sections = chunking.chunk(content)
extracted_content = config.extraction_strategy.run(url, sections)
# extracted_content = config.extraction_strategy.run(_url, sections)
# Use async version if available for better parallelism
if hasattr(config.extraction_strategy, 'arun'):
extracted_content = await config.extraction_strategy.arun(_url, sections)
else:
# Fallback to sync version run in thread pool to avoid blocking
extracted_content = await asyncio.to_thread(
config.extraction_strategy.run, url, sections
)
extracted_content = json.dumps(
extracted_content, indent=4, default=str, ensure_ascii=False
)
@@ -746,21 +850,45 @@ class AsyncWebCrawler:
# Handle stream setting - use first config's stream setting if config is a list
if isinstance(config, list):
stream = config[0].stream if config else False
primary_config = config[0] if config else None
else:
stream = config.stream
primary_config = config
# Helper to release sticky session if auto_release is enabled
async def maybe_release_session():
if (primary_config and
primary_config.proxy_session_id and
primary_config.proxy_session_auto_release and
primary_config.proxy_rotation_strategy):
await primary_config.proxy_rotation_strategy.release_session(
primary_config.proxy_session_id
)
self.logger.info(
message="Auto-released proxy session: {session_id}",
tag="PROXY",
params={"session_id": primary_config.proxy_session_id}
)
if stream:
async def result_transformer():
async for task_result in dispatcher.run_urls_stream(
crawler=self, urls=urls, config=config
):
yield transform_result(task_result)
try:
async for task_result in dispatcher.run_urls_stream(
crawler=self, urls=urls, config=config
):
yield transform_result(task_result)
finally:
# Auto-release session after streaming completes
await maybe_release_session()
return result_transformer()
else:
_results = await dispatcher.run_urls(crawler=self, urls=urls, config=config)
return [transform_result(res) for res in _results]
try:
_results = await dispatcher.run_urls(crawler=self, urls=urls, config=config)
return [transform_result(res) for res in _results]
finally:
# Auto-release session after batch completes
await maybe_release_session()
async def aseed_urls(
self,

View File

@@ -369,6 +369,9 @@ class ManagedBrowser:
]
if self.headless:
flags.append("--headless=new")
# Add viewport flag if specified in config
if self.browser_config.viewport_height and self.browser_config.viewport_width:
flags.append(f"--window-size={self.browser_config.viewport_width},{self.browser_config.viewport_height}")
# merge common launch flags
flags.extend(self.build_browser_flags(self.browser_config))
elif self.browser_type == "firefox":
@@ -658,9 +661,44 @@ class BrowserManager:
if self.config.cdp_url or self.config.use_managed_browser:
self.config.use_managed_browser = True
cdp_url = await self.managed_browser.start() if not self.config.cdp_url else self.config.cdp_url
# Add CDP endpoint verification before connecting
if not await self._verify_cdp_ready(cdp_url):
raise Exception(f"CDP endpoint at {cdp_url} is not ready after startup")
self.browser = await self.playwright.chromium.connect_over_cdp(cdp_url)
contexts = self.browser.contexts
if contexts:
# If browser_context_id is provided, we're using a pre-created context
if self.config.browser_context_id:
if self.logger:
self.logger.debug(
f"Using pre-existing browser context: {self.config.browser_context_id}",
tag="BROWSER"
)
# When connecting to a pre-created context, it should be in contexts
if contexts:
self.default_context = contexts[0]
if self.logger:
self.logger.debug(
f"Found {len(contexts)} existing context(s), using first one",
tag="BROWSER"
)
else:
# Context was created but not yet visible - wait a bit
await asyncio.sleep(0.2)
contexts = self.browser.contexts
if contexts:
self.default_context = contexts[0]
else:
# Still no contexts - this shouldn't happen with pre-created context
if self.logger:
self.logger.warning(
"Pre-created context not found, creating new one",
tag="BROWSER"
)
self.default_context = await self.create_browser_context()
elif contexts:
self.default_context = contexts[0]
else:
self.default_context = await self.create_browser_context()
@@ -678,6 +716,49 @@ class BrowserManager:
self.default_context = self.browser
async def _verify_cdp_ready(self, cdp_url: str) -> bool:
"""Verify CDP endpoint is ready with exponential backoff.
Supports multiple URL formats:
- HTTP URLs: http://localhost:9222
- HTTP URLs with query params: http://localhost:9222?browser_id=XXX
- WebSocket URLs: ws://localhost:9222/devtools/browser/XXX
"""
import aiohttp
from urllib.parse import urlparse, urlunparse
# If WebSocket URL, Playwright handles connection directly - skip HTTP verification
if cdp_url.startswith(('ws://', 'wss://')):
self.logger.debug(f"WebSocket CDP URL provided, skipping HTTP verification", tag="BROWSER")
return True
# Parse HTTP URL and properly construct /json/version endpoint
parsed = urlparse(cdp_url)
# Build URL with /json/version path, preserving query params
verify_url = urlunparse((
parsed.scheme,
parsed.netloc,
'/json/version', # Always use this path for verification
'', # params
parsed.query, # preserve query string
'' # fragment
))
self.logger.debug(f"Starting CDP verification for {verify_url}", tag="BROWSER")
for attempt in range(5):
try:
async with aiohttp.ClientSession() as session:
async with session.get(verify_url, timeout=aiohttp.ClientTimeout(total=2)) as response:
if response.status == 200:
self.logger.debug(f"CDP endpoint ready after {attempt + 1} attempts", tag="BROWSER")
return True
except Exception as e:
self.logger.debug(f"CDP check attempt {attempt + 1} failed: {e}", tag="BROWSER")
delay = 0.5 * (1.4 ** attempt)
self.logger.debug(f"Waiting {delay:.2f}s before next CDP check...", tag="BROWSER")
await asyncio.sleep(delay)
self.logger.debug(f"CDP verification failed after 5 attempts", tag="BROWSER")
return False
def _build_browser_args(self) -> dict:
"""Build browser launch arguments from config."""
@@ -814,18 +895,27 @@ class BrowserManager:
combined_headers.update(self.config.headers)
await context.set_extra_http_headers(combined_headers)
# Add default cookie
await context.add_cookies(
[
{
"name": "cookiesEnabled",
"value": "true",
"url": crawlerRunConfig.url
if crawlerRunConfig and crawlerRunConfig.url
else "https://crawl4ai.com/",
}
]
)
# Add default cookie (skip for raw:/file:// URLs which are not valid cookie URLs)
cookie_url = None
if crawlerRunConfig and crawlerRunConfig.url:
url = crawlerRunConfig.url
# Only set cookie for http/https URLs
if url.startswith(("http://", "https://")):
cookie_url = url
elif crawlerRunConfig.base_url and crawlerRunConfig.base_url.startswith(("http://", "https://")):
# Use base_url as fallback for raw:/file:// URLs
cookie_url = crawlerRunConfig.base_url
if cookie_url:
await context.add_cookies(
[
{
"name": "cookiesEnabled",
"value": "true",
"url": cookie_url,
}
]
)
# Handle navigator overrides
if crawlerRunConfig:
@@ -834,7 +924,12 @@ class BrowserManager:
or crawlerRunConfig.simulate_user
or crawlerRunConfig.magic
):
await context.add_init_script(load_js_script("navigator_overrider"))
await context.add_init_script(load_js_script("navigator_overrider"))
# Apply custom init_scripts from BrowserConfig (for stealth evasions, etc.)
if self.config.init_scripts:
for script in self.config.init_scripts:
await context.add_init_script(script)
async def create_browser_context(self, crawlerRunConfig: CrawlerRunConfig = None):
"""
@@ -1016,6 +1111,62 @@ class BrowserManager:
params={"error": str(e)}
)
async def _get_page_by_target_id(self, context: BrowserContext, target_id: str):
"""
Get an existing page by its CDP target ID.
This is used when connecting to a pre-created browser context with an existing page.
Playwright may not immediately see targets created via raw CDP commands, so we
use CDP to get all targets and find the matching one.
Args:
context: The browser context to search in
target_id: The CDP target ID to find
Returns:
Page object if found, None otherwise
"""
try:
# First check if Playwright already sees the page
for page in context.pages:
# Playwright's internal target ID might match
if hasattr(page, '_impl_obj') and hasattr(page._impl_obj, '_target_id'):
if page._impl_obj._target_id == target_id:
return page
# If not found, try using CDP to get targets
if hasattr(self.browser, '_impl_obj') and hasattr(self.browser._impl_obj, '_connection'):
cdp_session = await context.new_cdp_session(context.pages[0] if context.pages else None)
if cdp_session:
try:
result = await cdp_session.send("Target.getTargets")
targets = result.get("targetInfos", [])
for target in targets:
if target.get("targetId") == target_id:
# Found the target - if it's a page type, we can use it
if target.get("type") == "page":
# The page exists, let Playwright discover it
await asyncio.sleep(0.1)
# Refresh pages list
if context.pages:
return context.pages[0]
finally:
await cdp_session.detach()
# Fallback: if there are any pages now, return the first one
if context.pages:
return context.pages[0]
return None
except Exception as e:
if self.logger:
self.logger.warning(
message="Failed to get page by target ID: {error}",
tag="BROWSER",
params={"error": str(e)}
)
return None
async def get_page(self, crawlerRunConfig: CrawlerRunConfig):
"""
Get a page for the given session ID, creating a new one if needed.
@@ -1037,7 +1188,25 @@ class BrowserManager:
# If using a managed browser, just grab the shared default_context
if self.config.use_managed_browser:
if self.config.storage_state:
# If create_isolated_context is True, create isolated contexts for concurrent crawls
# Uses the same caching mechanism as non-CDP mode: cache context by config signature,
# but always create a new page. This prevents navigation conflicts while allowing
# context reuse for multiple URLs with the same config (e.g., batch/deep crawls).
if self.config.create_isolated_context:
config_signature = self._make_config_signature(crawlerRunConfig)
async with self._contexts_lock:
if config_signature in self.contexts_by_config:
context = self.contexts_by_config[config_signature]
else:
context = await self.create_browser_context(crawlerRunConfig)
await self.setup_context(context, crawlerRunConfig)
self.contexts_by_config[config_signature] = context
# Always create a new page for each crawl (isolation for navigation)
page = await context.new_page()
await self._apply_stealth_to_page(page)
elif self.config.storage_state:
context = await self.create_browser_context(crawlerRunConfig)
ctx = self.default_context # default context, one window only
ctx = await clone_runtime_state(context, ctx, crawlerRunConfig, self.config)
@@ -1060,6 +1229,14 @@ class BrowserManager:
pages = context.pages
if pages:
page = pages[0]
elif self.config.browser_context_id and self.config.target_id:
# Pre-existing context/target provided - use CDP to get the page
# This handles the case where Playwright doesn't see the target yet
page = await self._get_page_by_target_id(context, self.config.target_id)
if not page:
# Fallback: create new page in existing context
page = await context.new_page()
await self._apply_stealth_to_page(page)
else:
page = await context.new_page()
await self._apply_stealth_to_page(page)
@@ -1114,8 +1291,44 @@ class BrowserManager:
async def close(self):
"""Close all browser resources and clean up."""
if self.config.cdp_url:
# When using external CDP, we don't own the browser process.
# If cdp_cleanup_on_close is True, properly disconnect from the browser
# and clean up Playwright resources. This frees the browser for other clients.
if self.config.cdp_cleanup_on_close:
# First close all sessions (pages)
session_ids = list(self.sessions.keys())
for session_id in session_ids:
await self.kill_session(session_id)
# Close all contexts we created
for ctx in self.contexts_by_config.values():
try:
await ctx.close()
except Exception:
pass
self.contexts_by_config.clear()
# Disconnect from browser (doesn't terminate it, just releases connection)
if self.browser:
try:
await self.browser.close()
except Exception as e:
if self.logger:
self.logger.debug(
message="Error disconnecting from CDP browser: {error}",
tag="BROWSER",
params={"error": str(e)}
)
self.browser = None
# Allow time for CDP connection to fully release before another client connects
await asyncio.sleep(1.0)
# Stop Playwright instance to prevent memory leaks
if self.playwright:
await self.playwright.stop()
self.playwright = None
return
if self.config.sleep_on_close:
await asyncio.sleep(0.5)

270
crawl4ai/cache_validator.py Normal file
View File

@@ -0,0 +1,270 @@
"""
Cache validation using HTTP conditional requests and head fingerprinting.
Uses httpx for fast, lightweight HTTP requests (no browser needed).
This module enables smart cache validation to avoid unnecessary full browser crawls
when content hasn't changed.
Validation Strategy:
1. Send HEAD request with If-None-Match / If-Modified-Since headers
2. If server returns 304 Not Modified → cache is FRESH
3. If server returns 200 → fetch <head> and compare fingerprint
4. If fingerprint matches → cache is FRESH (minor changes only)
5. Otherwise → cache is STALE, need full recrawl
"""
import httpx
from dataclasses import dataclass
from typing import Optional, Tuple
from enum import Enum
from .utils import compute_head_fingerprint
class CacheValidationResult(Enum):
"""Result of cache validation check."""
FRESH = "fresh" # Content unchanged, use cache
STALE = "stale" # Content changed, need recrawl
UNKNOWN = "unknown" # Couldn't determine, need recrawl
ERROR = "error" # Request failed, use cache as fallback
@dataclass
class ValidationResult:
"""Detailed result of a cache validation attempt."""
status: CacheValidationResult
new_etag: Optional[str] = None
new_last_modified: Optional[str] = None
new_head_fingerprint: Optional[str] = None
reason: str = ""
class CacheValidator:
"""
Validates cache freshness using lightweight HTTP requests.
This validator uses httpx to make fast HTTP requests without needing
a full browser. It supports two validation methods:
1. HTTP Conditional Requests (Layer 3):
- Uses If-None-Match with stored ETag
- Uses If-Modified-Since with stored Last-Modified
- Server returns 304 if content unchanged
2. Head Fingerprinting (Layer 4):
- Fetches only the <head> section (~5KB)
- Compares fingerprint of key meta tags
- Catches changes even without server support for conditional requests
"""
def __init__(self, timeout: float = 10.0, user_agent: Optional[str] = None):
"""
Initialize the cache validator.
Args:
timeout: Request timeout in seconds
user_agent: Custom User-Agent string (optional)
"""
self.timeout = timeout
self.user_agent = user_agent or "Mozilla/5.0 (compatible; Crawl4AI/1.0)"
self._client: Optional[httpx.AsyncClient] = None
async def _get_client(self) -> httpx.AsyncClient:
"""Get or create the httpx client."""
if self._client is None:
self._client = httpx.AsyncClient(
http2=True,
timeout=self.timeout,
follow_redirects=True,
headers={"User-Agent": self.user_agent}
)
return self._client
async def validate(
self,
url: str,
stored_etag: Optional[str] = None,
stored_last_modified: Optional[str] = None,
stored_head_fingerprint: Optional[str] = None,
) -> ValidationResult:
"""
Validate if cached content is still fresh.
Args:
url: The URL to validate
stored_etag: Previously stored ETag header value
stored_last_modified: Previously stored Last-Modified header value
stored_head_fingerprint: Previously computed head fingerprint
Returns:
ValidationResult with status and any updated metadata
"""
client = await self._get_client()
# Build conditional request headers
headers = {}
if stored_etag:
headers["If-None-Match"] = stored_etag
if stored_last_modified:
headers["If-Modified-Since"] = stored_last_modified
try:
# Step 1: Try HEAD request with conditional headers
if headers:
response = await client.head(url, headers=headers)
if response.status_code == 304:
return ValidationResult(
status=CacheValidationResult.FRESH,
reason="Server returned 304 Not Modified"
)
# Got 200, extract new headers for potential update
new_etag = response.headers.get("etag")
new_last_modified = response.headers.get("last-modified")
# If we have fingerprint, compare it
if stored_head_fingerprint:
head_html, _, _ = await self._fetch_head(url)
if head_html:
new_fingerprint = compute_head_fingerprint(head_html)
if new_fingerprint and new_fingerprint == stored_head_fingerprint:
return ValidationResult(
status=CacheValidationResult.FRESH,
new_etag=new_etag,
new_last_modified=new_last_modified,
new_head_fingerprint=new_fingerprint,
reason="Head fingerprint matches"
)
elif new_fingerprint:
return ValidationResult(
status=CacheValidationResult.STALE,
new_etag=new_etag,
new_last_modified=new_last_modified,
new_head_fingerprint=new_fingerprint,
reason="Head fingerprint changed"
)
# Headers changed and no fingerprint match
return ValidationResult(
status=CacheValidationResult.STALE,
new_etag=new_etag,
new_last_modified=new_last_modified,
reason="Server returned 200, content may have changed"
)
# Step 2: No conditional headers available, try fingerprint only
if stored_head_fingerprint:
head_html, new_etag, new_last_modified = await self._fetch_head(url)
if head_html:
new_fingerprint = compute_head_fingerprint(head_html)
if new_fingerprint and new_fingerprint == stored_head_fingerprint:
return ValidationResult(
status=CacheValidationResult.FRESH,
new_etag=new_etag,
new_last_modified=new_last_modified,
new_head_fingerprint=new_fingerprint,
reason="Head fingerprint matches"
)
elif new_fingerprint:
return ValidationResult(
status=CacheValidationResult.STALE,
new_etag=new_etag,
new_last_modified=new_last_modified,
new_head_fingerprint=new_fingerprint,
reason="Head fingerprint changed"
)
# Step 3: No validation data available
return ValidationResult(
status=CacheValidationResult.UNKNOWN,
reason="No validation data available (no etag, last-modified, or fingerprint)"
)
except httpx.TimeoutException:
return ValidationResult(
status=CacheValidationResult.ERROR,
reason="Validation request timed out"
)
except httpx.RequestError as e:
return ValidationResult(
status=CacheValidationResult.ERROR,
reason=f"Validation request failed: {type(e).__name__}"
)
except Exception as e:
# On unexpected error, prefer using cache over failing
return ValidationResult(
status=CacheValidationResult.ERROR,
reason=f"Validation error: {str(e)}"
)
async def _fetch_head(self, url: str) -> Tuple[Optional[str], Optional[str], Optional[str]]:
"""
Fetch only the <head> section of a page.
Uses streaming to stop reading after </head> is found,
minimizing bandwidth usage.
Args:
url: The URL to fetch
Returns:
Tuple of (head_html, etag, last_modified)
"""
client = await self._get_client()
try:
async with client.stream(
"GET",
url,
headers={"Accept-Encoding": "identity"} # Disable compression for easier parsing
) as response:
etag = response.headers.get("etag")
last_modified = response.headers.get("last-modified")
if response.status_code != 200:
return None, etag, last_modified
# Read until </head> or max 64KB
chunks = []
total_bytes = 0
max_bytes = 65536
async for chunk in response.aiter_bytes(4096):
chunks.append(chunk)
total_bytes += len(chunk)
content = b''.join(chunks)
# Check for </head> (case insensitive)
if b'</head>' in content.lower() or b'</HEAD>' in content:
break
if total_bytes >= max_bytes:
break
html = content.decode('utf-8', errors='replace')
# Extract just the head section
head_end = html.lower().find('</head>')
if head_end != -1:
html = html[:head_end + 7]
return html, etag, last_modified
except Exception:
return None, None, None
async def close(self):
"""Close the HTTP client and release resources."""
if self._client:
await self._client.aclose()
self._client = None
async def __aenter__(self):
"""Async context manager entry."""
return self
async def __aexit__(self, exc_type, exc_val, exc_tb):
"""Async context manager exit."""
await self.close()

View File

@@ -980,6 +980,9 @@ class LLMContentFilter(RelevantContentFilter):
prompt,
api_token,
base_url=base_url,
base_delay=self.llm_config.backoff_base_delay,
max_attempts=self.llm_config.backoff_max_attempts,
exponential_factor=self.llm_config.backoff_exponential_factor,
extra_args=extra_args,
)

View File

@@ -542,6 +542,19 @@ class LXMLWebScrapingStrategy(ContentScrapingStrategy):
if el.tag in bypass_tags:
continue
# Skip elements inside <pre> or <code> tags where whitespace is significant
# This preserves whitespace-only spans (e.g., <span class="w"> </span>) in code blocks
is_in_code_block = False
ancestor = el.getparent()
while ancestor is not None:
if ancestor.tag in ("pre", "code"):
is_in_code_block = True
break
ancestor = ancestor.getparent()
if is_in_code_block:
continue
text_content = (el.text_content() or "").strip()
if (
len(text_content.split()) < word_count_threshold

View File

@@ -2,7 +2,7 @@
import asyncio
import logging
from datetime import datetime
from typing import AsyncGenerator, Optional, Set, Dict, List, Tuple
from typing import AsyncGenerator, Optional, Set, Dict, List, Tuple, Any, Callable, Awaitable
from urllib.parse import urlparse
from ..models import TraversalStats
@@ -41,6 +41,9 @@ class BestFirstCrawlingStrategy(DeepCrawlStrategy):
include_external: bool = False,
max_pages: int = infinity,
logger: Optional[logging.Logger] = None,
# Optional resume/callback parameters for crash recovery
resume_state: Optional[Dict[str, Any]] = None,
on_state_change: Optional[Callable[[Dict[str, Any]], Awaitable[None]]] = None,
):
self.max_depth = max_depth
self.filter_chain = filter_chain
@@ -57,6 +60,12 @@ class BestFirstCrawlingStrategy(DeepCrawlStrategy):
self.stats = TraversalStats(start_time=datetime.now())
self._cancel_event = asyncio.Event()
self._pages_crawled = 0
# Store for use in arun methods
self._resume_state = resume_state
self._on_state_change = on_state_change
self._last_state: Optional[Dict[str, Any]] = None
# Shadow list for queue items (only used when on_state_change is set)
self._queue_shadow: Optional[List[Tuple[float, int, str, Optional[str]]]] = None
async def can_process_url(self, url: str, depth: int) -> bool:
"""
@@ -135,16 +144,36 @@ class BestFirstCrawlingStrategy(DeepCrawlStrategy):
) -> AsyncGenerator[CrawlResult, None]:
"""
Core best-first crawl method using a priority queue.
The queue items are tuples of (score, depth, url, parent_url). Lower scores
are treated as higher priority. URLs are processed in batches for efficiency.
"""
queue: asyncio.PriorityQueue = asyncio.PriorityQueue()
# Push the initial URL with score 0 and depth 0.
initial_score = self.url_scorer.score(start_url) if self.url_scorer else 0
await queue.put((-initial_score, 0, start_url, None))
visited: Set[str] = set()
depths: Dict[str, int] = {start_url: 0}
# Conditional state initialization for resume support
if self._resume_state:
visited = set(self._resume_state.get("visited", []))
depths = dict(self._resume_state.get("depths", {}))
self._pages_crawled = self._resume_state.get("pages_crawled", 0)
# Restore queue from saved items
queue_items = self._resume_state.get("queue_items", [])
for item in queue_items:
await queue.put((item["score"], item["depth"], item["url"], item["parent_url"]))
# Initialize shadow list if callback is set
if self._on_state_change:
self._queue_shadow = [
(item["score"], item["depth"], item["url"], item["parent_url"])
for item in queue_items
]
else:
# Original initialization
initial_score = self.url_scorer.score(start_url) if self.url_scorer else 0
await queue.put((-initial_score, 0, start_url, None))
visited: Set[str] = set()
depths: Dict[str, int] = {start_url: 0}
# Initialize shadow list if callback is set
if self._on_state_change:
self._queue_shadow = [(-initial_score, 0, start_url, None)]
while not queue.empty() and not self._cancel_event.is_set():
# Stop if we've reached the max pages limit
@@ -166,6 +195,12 @@ class BestFirstCrawlingStrategy(DeepCrawlStrategy):
if queue.empty():
break
item = await queue.get()
# Remove from shadow list if tracking
if self._on_state_change and self._queue_shadow is not None:
try:
self._queue_shadow.remove(item)
except ValueError:
pass # Item may have been removed already
score, depth, url, parent_url = item
if url in visited:
continue
@@ -210,7 +245,26 @@ class BestFirstCrawlingStrategy(DeepCrawlStrategy):
for new_url, new_parent in new_links:
new_depth = depths.get(new_url, depth + 1)
new_score = self.url_scorer.score(new_url) if self.url_scorer else 0
await queue.put((-new_score, new_depth, new_url, new_parent))
queue_item = (-new_score, new_depth, new_url, new_parent)
await queue.put(queue_item)
# Add to shadow list if tracking
if self._on_state_change and self._queue_shadow is not None:
self._queue_shadow.append(queue_item)
# Capture state after EACH URL processed (if callback set)
if self._on_state_change and self._queue_shadow is not None:
state = {
"strategy_type": "best_first",
"visited": list(visited),
"queue_items": [
{"score": s, "depth": d, "url": u, "parent_url": p}
for s, d, u, p in self._queue_shadow
],
"depths": depths,
"pages_crawled": self._pages_crawled,
}
self._last_state = state
await self._on_state_change(state)
# End of crawl.
@@ -269,3 +323,15 @@ class BestFirstCrawlingStrategy(DeepCrawlStrategy):
"""
self._cancel_event.set()
self.stats.end_time = datetime.now()
def export_state(self) -> Optional[Dict[str, Any]]:
"""
Export current crawl state for external persistence.
Note: This returns the last captured state. For real-time state,
use the on_state_change callback.
Returns:
Dict with strategy state, or None if no state captured yet.
"""
return self._last_state

View File

@@ -2,7 +2,7 @@
import asyncio
import logging
from datetime import datetime
from typing import AsyncGenerator, Optional, Set, Dict, List, Tuple
from typing import AsyncGenerator, Optional, Set, Dict, List, Tuple, Any, Callable, Awaitable
from urllib.parse import urlparse
from ..models import TraversalStats
@@ -26,11 +26,14 @@ class BFSDeepCrawlStrategy(DeepCrawlStrategy):
self,
max_depth: int,
filter_chain: FilterChain = FilterChain(),
url_scorer: Optional[URLScorer] = None,
url_scorer: Optional[URLScorer] = None,
include_external: bool = False,
score_threshold: float = -infinity,
max_pages: int = infinity,
logger: Optional[logging.Logger] = None,
# Optional resume/callback parameters for crash recovery
resume_state: Optional[Dict[str, Any]] = None,
on_state_change: Optional[Callable[[Dict[str, Any]], Awaitable[None]]] = None,
):
self.max_depth = max_depth
self.filter_chain = filter_chain
@@ -48,6 +51,10 @@ class BFSDeepCrawlStrategy(DeepCrawlStrategy):
self.stats = TraversalStats(start_time=datetime.now())
self._cancel_event = asyncio.Event()
self._pages_crawled = 0
# Store for use in arun methods
self._resume_state = resume_state
self._on_state_change = on_state_change
self._last_state: Optional[Dict[str, Any]] = None
async def can_process_url(self, url: str, depth: int) -> bool:
"""
@@ -155,10 +162,21 @@ class BFSDeepCrawlStrategy(DeepCrawlStrategy):
Batch (non-streaming) mode:
Processes one BFS level at a time, then yields all the results.
"""
visited: Set[str] = set()
# current_level holds tuples: (url, parent_url)
current_level: List[Tuple[str, Optional[str]]] = [(start_url, None)]
depths: Dict[str, int] = {start_url: 0}
# Conditional state initialization for resume support
if self._resume_state:
visited = set(self._resume_state.get("visited", []))
current_level = [
(item["url"], item["parent_url"])
for item in self._resume_state.get("pending", [])
]
depths = dict(self._resume_state.get("depths", {}))
self._pages_crawled = self._resume_state.get("pages_crawled", 0)
else:
# Original initialization
visited: Set[str] = set()
# current_level holds tuples: (url, parent_url)
current_level: List[Tuple[str, Optional[str]]] = [(start_url, None)]
depths: Dict[str, int] = {start_url: 0}
results: List[CrawlResult] = []
@@ -174,11 +192,7 @@ class BFSDeepCrawlStrategy(DeepCrawlStrategy):
# Clone the config to disable deep crawling recursion and enforce batch mode.
batch_config = config.clone(deep_crawl_strategy=None, stream=False)
batch_results = await crawler.arun_many(urls=urls, config=batch_config)
# Update pages crawled counter - count only successful crawls
successful_results = [r for r in batch_results if r.success]
self._pages_crawled += len(successful_results)
for result in batch_results:
url = result.url
depth = depths.get(url, 0)
@@ -187,12 +201,27 @@ class BFSDeepCrawlStrategy(DeepCrawlStrategy):
parent_url = next((parent for (u, parent) in current_level if u == url), None)
result.metadata["parent_url"] = parent_url
results.append(result)
# Only discover links from successful crawls
if result.success:
# Increment pages crawled per URL for accurate state tracking
self._pages_crawled += 1
# Link discovery will handle the max pages limit internally
await self.link_discovery(result, url, depth, visited, next_level, depths)
# Capture state after EACH URL processed (if callback set)
if self._on_state_change:
state = {
"strategy_type": "bfs",
"visited": list(visited),
"pending": [{"url": u, "parent_url": p} for u, p in next_level],
"depths": depths,
"pages_crawled": self._pages_crawled,
}
self._last_state = state
await self._on_state_change(state)
current_level = next_level
return results
@@ -207,9 +236,20 @@ class BFSDeepCrawlStrategy(DeepCrawlStrategy):
Streaming mode:
Processes one BFS level at a time and yields results immediately as they arrive.
"""
visited: Set[str] = set()
current_level: List[Tuple[str, Optional[str]]] = [(start_url, None)]
depths: Dict[str, int] = {start_url: 0}
# Conditional state initialization for resume support
if self._resume_state:
visited = set(self._resume_state.get("visited", []))
current_level = [
(item["url"], item["parent_url"])
for item in self._resume_state.get("pending", [])
]
depths = dict(self._resume_state.get("depths", {}))
self._pages_crawled = self._resume_state.get("pages_crawled", 0)
else:
# Original initialization
visited: Set[str] = set()
current_level: List[Tuple[str, Optional[str]]] = [(start_url, None)]
depths: Dict[str, int] = {start_url: 0}
while current_level and not self._cancel_event.is_set():
next_level: List[Tuple[str, Optional[str]]] = []
@@ -244,7 +284,19 @@ class BFSDeepCrawlStrategy(DeepCrawlStrategy):
if result.success:
# Link discovery will handle the max pages limit internally
await self.link_discovery(result, url, depth, visited, next_level, depths)
# Capture state after EACH URL processed (if callback set)
if self._on_state_change:
state = {
"strategy_type": "bfs",
"visited": list(visited),
"pending": [{"url": u, "parent_url": p} for u, p in next_level],
"depths": depths,
"pages_crawled": self._pages_crawled,
}
self._last_state = state
await self._on_state_change(state)
# If we didn't get results back (e.g. due to errors), avoid getting stuck in an infinite loop
# by considering these URLs as visited but not counting them toward the max_pages limit
if results_count == 0 and urls:
@@ -258,3 +310,15 @@ class BFSDeepCrawlStrategy(DeepCrawlStrategy):
"""
self._cancel_event.set()
self.stats.end_time = datetime.now()
def export_state(self) -> Optional[Dict[str, Any]]:
"""
Export current crawl state for external persistence.
Note: This returns the last captured state. For real-time state,
use the on_state_change callback.
Returns:
Dict with strategy state, or None if no state captured yet.
"""
return self._last_state

View File

@@ -4,14 +4,26 @@ from typing import AsyncGenerator, Optional, Set, Dict, List, Tuple
from ..models import CrawlResult
from .bfs_strategy import BFSDeepCrawlStrategy # noqa
from ..types import AsyncWebCrawler, CrawlerRunConfig
from ..utils import normalize_url_for_deep_crawl
class DFSDeepCrawlStrategy(BFSDeepCrawlStrategy):
"""
Depth-First Search (DFS) deep crawling strategy.
Depth-first deep crawling with familiar BFS rules.
Inherits URL validation and link discovery from BFSDeepCrawlStrategy.
Overrides _arun_batch and _arun_stream to use a stack (LIFO) for DFS traversal.
We reuse the same filters, scoring, and page limits from :class:`BFSDeepCrawlStrategy`,
but walk the graph with a stack so we fully explore one branch before hopping to the
next. DFS also keeps its own ``_dfs_seen`` set so we can drop duplicate links at
discovery time without accidentally marking them as “already crawled”.
"""
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self._dfs_seen: Set[str] = set()
def _reset_seen(self, start_url: str) -> None:
"""Start each crawl with a clean dedupe set seeded with the root URL."""
self._dfs_seen = {start_url}
async def _arun_batch(
self,
start_url: str,
@@ -19,14 +31,32 @@ class DFSDeepCrawlStrategy(BFSDeepCrawlStrategy):
config: CrawlerRunConfig,
) -> List[CrawlResult]:
"""
Batch (non-streaming) DFS mode.
Uses a stack to traverse URLs in DFS order, aggregating CrawlResults into a list.
Crawl level-by-level but emit results at the end.
We keep a stack of ``(url, parent, depth)`` tuples, pop one at a time, and
hand it to ``crawler.arun_many`` with deep crawling disabled so we remain
in control of traversal. Every successful page bumps ``_pages_crawled`` and
seeds new stack items discovered via :meth:`link_discovery`.
"""
visited: Set[str] = set()
# Stack items: (url, parent_url, depth)
stack: List[Tuple[str, Optional[str], int]] = [(start_url, None, 0)]
depths: Dict[str, int] = {start_url: 0}
results: List[CrawlResult] = []
# Conditional state initialization for resume support
if self._resume_state:
visited = set(self._resume_state.get("visited", []))
stack = [
(item["url"], item["parent_url"], item["depth"])
for item in self._resume_state.get("stack", [])
]
depths = dict(self._resume_state.get("depths", {}))
self._pages_crawled = self._resume_state.get("pages_crawled", 0)
self._dfs_seen = set(self._resume_state.get("dfs_seen", []))
results: List[CrawlResult] = []
else:
# Original initialization
visited: Set[str] = set()
# Stack items: (url, parent_url, depth)
stack: List[Tuple[str, Optional[str], int]] = [(start_url, None, 0)]
depths: Dict[str, int] = {start_url: 0}
results: List[CrawlResult] = []
self._reset_seen(start_url)
while stack and not self._cancel_event.is_set():
url, parent, depth = stack.pop()
@@ -62,6 +92,22 @@ class DFSDeepCrawlStrategy(BFSDeepCrawlStrategy):
for new_url, new_parent in reversed(new_links):
new_depth = depths.get(new_url, depth + 1)
stack.append((new_url, new_parent, new_depth))
# Capture state after each URL processed (if callback set)
if self._on_state_change:
state = {
"strategy_type": "dfs",
"visited": list(visited),
"stack": [
{"url": u, "parent_url": p, "depth": d}
for u, p, d in stack
],
"depths": depths,
"pages_crawled": self._pages_crawled,
"dfs_seen": list(self._dfs_seen),
}
self._last_state = state
await self._on_state_change(state)
return results
async def _arun_stream(
@@ -71,12 +117,28 @@ class DFSDeepCrawlStrategy(BFSDeepCrawlStrategy):
config: CrawlerRunConfig,
) -> AsyncGenerator[CrawlResult, None]:
"""
Streaming DFS mode.
Uses a stack to traverse URLs in DFS order and yields CrawlResults as they become available.
Same traversal as :meth:`_arun_batch`, but yield pages immediately.
Each popped URL is crawled, its metadata annotated, then the result gets
yielded before we even look at the next stack entry. Successful crawls
still feed :meth:`link_discovery`, keeping DFS order intact.
"""
visited: Set[str] = set()
stack: List[Tuple[str, Optional[str], int]] = [(start_url, None, 0)]
depths: Dict[str, int] = {start_url: 0}
# Conditional state initialization for resume support
if self._resume_state:
visited = set(self._resume_state.get("visited", []))
stack = [
(item["url"], item["parent_url"], item["depth"])
for item in self._resume_state.get("stack", [])
]
depths = dict(self._resume_state.get("depths", {}))
self._pages_crawled = self._resume_state.get("pages_crawled", 0)
self._dfs_seen = set(self._resume_state.get("dfs_seen", []))
else:
# Original initialization
visited: Set[str] = set()
stack: List[Tuple[str, Optional[str], int]] = [(start_url, None, 0)]
depths: Dict[str, int] = {start_url: 0}
self._reset_seen(start_url)
while stack and not self._cancel_event.is_set():
url, parent, depth = stack.pop()
@@ -108,3 +170,108 @@ class DFSDeepCrawlStrategy(BFSDeepCrawlStrategy):
for new_url, new_parent in reversed(new_links):
new_depth = depths.get(new_url, depth + 1)
stack.append((new_url, new_parent, new_depth))
# Capture state after each URL processed (if callback set)
if self._on_state_change:
state = {
"strategy_type": "dfs",
"visited": list(visited),
"stack": [
{"url": u, "parent_url": p, "depth": d}
for u, p, d in stack
],
"depths": depths,
"pages_crawled": self._pages_crawled,
"dfs_seen": list(self._dfs_seen),
}
self._last_state = state
await self._on_state_change(state)
async def link_discovery(
self,
result: CrawlResult,
source_url: str,
current_depth: int,
_visited: Set[str],
next_level: List[Tuple[str, Optional[str]]],
depths: Dict[str, int],
) -> None:
"""
Find the next URLs we should push onto the DFS stack.
Parameters
----------
result : CrawlResult
Output of the page we just crawled; its ``links`` block is our raw material.
source_url : str
URL of the parent page; stored so callers can track ancestry.
current_depth : int
Depth of the parent; children naturally sit at ``current_depth + 1``.
_visited : Set[str]
Present to match the BFS signature, but we rely on ``_dfs_seen`` instead.
next_level : list of tuples
The stack buffer supplied by the caller; we append new ``(url, parent)`` items here.
depths : dict
Shared depth map so future metadata tagging knows how deep each URL lives.
Notes
-----
- ``_dfs_seen`` keeps us from pushing duplicates without touching the traversal guard.
- Validation, scoring, and capacity trimming mirror the BFS version so behaviour stays consistent.
"""
next_depth = current_depth + 1
if next_depth > self.max_depth:
return
remaining_capacity = self.max_pages - self._pages_crawled
if remaining_capacity <= 0:
self.logger.info(
f"Max pages limit ({self.max_pages}) reached, stopping link discovery"
)
return
links = result.links.get("internal", [])
if self.include_external:
links += result.links.get("external", [])
seen = self._dfs_seen
valid_links: List[Tuple[str, float]] = []
for link in links:
raw_url = link.get("href")
if not raw_url:
continue
normalized_url = normalize_url_for_deep_crawl(raw_url, source_url)
if not normalized_url or normalized_url in seen:
continue
if not await self.can_process_url(raw_url, next_depth):
self.stats.urls_skipped += 1
continue
score = self.url_scorer.score(normalized_url) if self.url_scorer else 0
if score < self.score_threshold:
self.logger.debug(
f"URL {normalized_url} skipped: score {score} below threshold {self.score_threshold}"
)
self.stats.urls_skipped += 1
continue
seen.add(normalized_url)
valid_links.append((normalized_url, score))
if len(valid_links) > remaining_capacity:
if self.url_scorer:
valid_links.sort(key=lambda x: x[1], reverse=True)
valid_links = valid_links[:remaining_capacity]
self.logger.info(
f"Limiting to {remaining_capacity} URLs due to max_pages limit"
)
for url, score in valid_links:
if score:
result.metadata = result.metadata or {}
result.metadata["score"] = score
next_level.append((url, source_url))
depths[url] = next_depth

View File

@@ -509,18 +509,22 @@ class DomainFilter(URLFilter):
class ContentRelevanceFilter(URLFilter):
"""BM25-based relevance filter using head section content"""
__slots__ = ("query_terms", "threshold", "k1", "b", "avgdl")
__slots__ = ("query_terms", "threshold", "k1", "b", "avgdl", "query")
def __init__(
self,
query: str,
query: Union[str, List[str]],
threshold: float,
k1: float = 1.2,
b: float = 0.75,
avgdl: int = 1000,
):
super().__init__(name="BM25RelevanceFilter")
self.query_terms = self._tokenize(query)
if isinstance(query, list):
self.query = " ".join(query)
else:
self.query = query
self.query_terms = self._tokenize(self.query)
self.threshold = threshold
self.k1 = k1 # TF saturation parameter
self.b = b # Length normalization parameter

View File

@@ -1,4 +1,4 @@
from typing import List, Optional, Union, AsyncGenerator, Dict, Any
from typing import List, Optional, Union, AsyncGenerator, Dict, Any, Callable
import httpx
import json
from urllib.parse import urljoin
@@ -7,6 +7,7 @@ import asyncio
from .async_configs import BrowserConfig, CrawlerRunConfig
from .models import CrawlResult
from .async_logger import AsyncLogger, LogLevel
from .utils import hooks_to_string
class Crawl4aiClientError(Exception):
@@ -70,17 +71,41 @@ class Crawl4aiDockerClient:
self.logger.error(f"Server unreachable: {str(e)}", tag="ERROR")
raise ConnectionError(f"Cannot connect to server: {str(e)}")
def _prepare_request(self, urls: List[str], browser_config: Optional[BrowserConfig] = None,
crawler_config: Optional[CrawlerRunConfig] = None) -> Dict[str, Any]:
def _prepare_request(
self,
urls: List[str],
browser_config: Optional[BrowserConfig] = None,
crawler_config: Optional[CrawlerRunConfig] = None,
hooks: Optional[Union[Dict[str, Callable], Dict[str, str]]] = None,
hooks_timeout: int = 30
) -> Dict[str, Any]:
"""Prepare request data from configs."""
if self._token:
self._http_client.headers["Authorization"] = f"Bearer {self._token}"
return {
request_data = {
"urls": urls,
"browser_config": browser_config.dump() if browser_config else {},
"crawler_config": crawler_config.dump() if crawler_config else {}
}
# Handle hooks if provided
if hooks:
# Check if hooks are already strings or need conversion
if any(callable(v) for v in hooks.values()):
# Convert function objects to strings
hooks_code = hooks_to_string(hooks)
else:
# Already in string format
hooks_code = hooks
request_data["hooks"] = {
"code": hooks_code,
"timeout": hooks_timeout
}
return request_data
async def _request(self, method: str, endpoint: str, **kwargs) -> httpx.Response:
"""Make an HTTP request with error handling."""
url = urljoin(self.base_url, endpoint)
@@ -102,16 +127,42 @@ class Crawl4aiDockerClient:
self,
urls: List[str],
browser_config: Optional[BrowserConfig] = None,
crawler_config: Optional[CrawlerRunConfig] = None
crawler_config: Optional[CrawlerRunConfig] = None,
hooks: Optional[Union[Dict[str, Callable], Dict[str, str]]] = None,
hooks_timeout: int = 30
) -> Union[CrawlResult, List[CrawlResult], AsyncGenerator[CrawlResult, None]]:
"""Execute a crawl operation."""
"""
Execute a crawl operation.
Args:
urls: List of URLs to crawl
browser_config: Browser configuration
crawler_config: Crawler configuration
hooks: Optional hooks - can be either:
- Dict[str, Callable]: Function objects that will be converted to strings
- Dict[str, str]: Already stringified hook code
hooks_timeout: Timeout in seconds for each hook execution (1-120)
Returns:
Single CrawlResult, list of results, or async generator for streaming
Example with function hooks:
>>> async def my_hook(page, context, **kwargs):
... await page.set_viewport_size({"width": 1920, "height": 1080})
... return page
>>>
>>> result = await client.crawl(
... ["https://example.com"],
... hooks={"on_page_context_created": my_hook}
... )
"""
await self._check_server()
data = self._prepare_request(urls, browser_config, crawler_config)
data = self._prepare_request(urls, browser_config, crawler_config, hooks, hooks_timeout)
is_streaming = crawler_config and crawler_config.stream
self.logger.info(f"Crawling {len(urls)} URLs {'(streaming)' if is_streaming else ''}", tag="CRAWL")
if is_streaming:
async def stream_results() -> AsyncGenerator[CrawlResult, None]:
async with self._http_client.stream("POST", f"{self.base_url}/crawl/stream", json=data) as response:
@@ -128,12 +179,12 @@ class Crawl4aiDockerClient:
else:
yield CrawlResult(**result)
return stream_results()
response = await self._request("POST", "/crawl", json=data)
response = await self._request("POST", "/crawl", json=data, timeout=hooks_timeout)
result_data = response.json()
if not result_data.get("success", False):
raise RequestError(f"Crawl failed: {result_data.get('msg', 'Unknown error')}")
results = [CrawlResult(**r) for r in result_data.get("results", [])]
self.logger.success(f"Crawl completed with {len(results)} results", tag="CRAWL")
return results[0] if len(results) == 1 else results

View File

@@ -94,6 +94,20 @@ class ExtractionStrategy(ABC):
extracted_content.extend(future.result())
return extracted_content
async def arun(self, url: str, sections: List[str], *q, **kwargs) -> List[Dict[str, Any]]:
"""
Async version: Process sections of text in parallel using asyncio.
Default implementation runs the sync version in a thread pool.
Subclasses can override this for true async processing.
:param url: The URL of the webpage.
:param sections: List of sections (strings) to process.
:return: A list of processed JSON blocks.
"""
import asyncio
return await asyncio.to_thread(self.run, url, sections, *q, **kwargs)
class NoExtractionStrategy(ExtractionStrategy):
"""
@@ -635,6 +649,9 @@ class LLMExtractionStrategy(ExtractionStrategy):
base_url=self.llm_config.base_url,
json_response=self.force_json_response,
extra_args=self.extra_args,
base_delay=self.llm_config.backoff_base_delay,
max_attempts=self.llm_config.backoff_max_attempts,
exponential_factor=self.llm_config.backoff_exponential_factor
) # , json_response=self.extract_type == "schema")
# Track usage
usage = TokenUsage(
@@ -780,6 +797,180 @@ class LLMExtractionStrategy(ExtractionStrategy):
return extracted_content
async def aextract(self, url: str, ix: int, html: str) -> List[Dict[str, Any]]:
"""
Async version: Extract meaningful blocks or chunks from the given HTML using an LLM.
How it works:
1. Construct a prompt with variables.
2. Make an async request to the LLM using the prompt.
3. Parse the response and extract blocks or chunks.
Args:
url: The URL of the webpage.
ix: Index of the block.
html: The HTML content of the webpage.
Returns:
A list of extracted blocks or chunks.
"""
from .utils import aperform_completion_with_backoff
if self.verbose:
print(f"[LOG] Call LLM for {url} - block index: {ix}")
variable_values = {
"URL": url,
"HTML": escape_json_string(sanitize_html(html)),
}
prompt_with_variables = PROMPT_EXTRACT_BLOCKS
if self.instruction:
variable_values["REQUEST"] = self.instruction
prompt_with_variables = PROMPT_EXTRACT_BLOCKS_WITH_INSTRUCTION
if self.extract_type == "schema" and self.schema:
variable_values["SCHEMA"] = json.dumps(self.schema, indent=2)
prompt_with_variables = PROMPT_EXTRACT_SCHEMA_WITH_INSTRUCTION
if self.extract_type == "schema" and not self.schema:
prompt_with_variables = PROMPT_EXTRACT_INFERRED_SCHEMA
for variable in variable_values:
prompt_with_variables = prompt_with_variables.replace(
"{" + variable + "}", variable_values[variable]
)
try:
response = await aperform_completion_with_backoff(
self.llm_config.provider,
prompt_with_variables,
self.llm_config.api_token,
base_url=self.llm_config.base_url,
json_response=self.force_json_response,
extra_args=self.extra_args,
base_delay=self.llm_config.backoff_base_delay,
max_attempts=self.llm_config.backoff_max_attempts,
exponential_factor=self.llm_config.backoff_exponential_factor
)
# Track usage
usage = TokenUsage(
completion_tokens=response.usage.completion_tokens,
prompt_tokens=response.usage.prompt_tokens,
total_tokens=response.usage.total_tokens,
completion_tokens_details=response.usage.completion_tokens_details.__dict__
if response.usage.completion_tokens_details
else {},
prompt_tokens_details=response.usage.prompt_tokens_details.__dict__
if response.usage.prompt_tokens_details
else {},
)
self.usages.append(usage)
# Update totals
self.total_usage.completion_tokens += usage.completion_tokens
self.total_usage.prompt_tokens += usage.prompt_tokens
self.total_usage.total_tokens += usage.total_tokens
try:
content = response.choices[0].message.content
blocks = None
if self.force_json_response:
blocks = json.loads(content)
if isinstance(blocks, dict):
if len(blocks) == 1 and isinstance(list(blocks.values())[0], list):
blocks = list(blocks.values())[0]
else:
blocks = [blocks]
elif isinstance(blocks, list):
blocks = blocks
else:
blocks = extract_xml_data(["blocks"], content)["blocks"]
blocks = json.loads(blocks)
for block in blocks:
block["error"] = False
except Exception:
parsed, unparsed = split_and_parse_json_objects(
response.choices[0].message.content
)
blocks = parsed
if unparsed:
blocks.append(
{"index": 0, "error": True, "tags": ["error"], "content": unparsed}
)
if self.verbose:
print(
"[LOG] Extracted",
len(blocks),
"blocks from URL:",
url,
"block index:",
ix,
)
return blocks
except Exception as e:
if self.verbose:
print(f"[LOG] Error in LLM extraction: {e}")
return [
{
"index": ix,
"error": True,
"tags": ["error"],
"content": str(e),
}
]
async def arun(self, url: str, sections: List[str]) -> List[Dict[str, Any]]:
"""
Async version: Process sections with true parallelism using asyncio.gather.
Args:
url: The URL of the webpage.
sections: List of sections (strings) to process.
Returns:
A list of extracted blocks or chunks.
"""
import asyncio
merged_sections = self._merge(
sections,
self.chunk_token_threshold,
overlap=int(self.chunk_token_threshold * self.overlap_rate),
)
extracted_content = []
# Create tasks for all sections to run in parallel
tasks = [
self.aextract(url, ix, sanitize_input_encode(section))
for ix, section in enumerate(merged_sections)
]
# Execute all tasks concurrently
results = await asyncio.gather(*tasks, return_exceptions=True)
# Process results
for result in results:
if isinstance(result, Exception):
if self.verbose:
print(f"Error in async extraction: {result}")
extracted_content.append(
{
"index": 0,
"error": True,
"tags": ["error"],
"content": str(result),
}
)
else:
extracted_content.extend(result)
return extracted_content
def show_usage(self) -> None:
"""Print a detailed token usage report showing total and per-request usage."""
print("\n=== Token Usage Summary ===")
@@ -1086,44 +1277,18 @@ class JsonElementExtractionStrategy(ExtractionStrategy):
}
@staticmethod
def generate_schema(
html: str,
schema_type: str = "CSS", # or XPATH
query: str = None,
target_json_example: str = None,
llm_config: 'LLMConfig' = create_llm_config(),
provider: str = None,
api_token: str = None,
**kwargs
) -> dict:
def _build_schema_prompt(html: str, schema_type: str, query: str = None, target_json_example: str = None) -> str:
"""
Generate extraction schema from HTML content and optional query.
Args:
html (str): The HTML content to analyze
query (str, optional): Natural language description of what data to extract
provider (str): Legacy Parameter. LLM provider to use
api_token (str): Legacy Parameter. API token for LLM provider
llm_config (LLMConfig): LLM configuration object
prompt (str, optional): Custom prompt template to use
**kwargs: Additional args passed to LLM processor
Build the prompt for schema generation. Shared by sync and async methods.
Returns:
dict: Generated schema following the JsonElementExtractionStrategy format
str: Combined system and user prompt
"""
from .prompts import JSON_SCHEMA_BUILDER
from .utils import perform_completion_with_backoff
for name, message in JsonElementExtractionStrategy._GENERATE_SCHEMA_UNWANTED_PROPS.items():
if locals()[name] is not None:
raise AttributeError(f"Setting '{name}' is deprecated. {message}")
# Use default or custom prompt
prompt_template = JSON_SCHEMA_BUILDER if schema_type == "CSS" else JSON_SCHEMA_BUILDER_XPATH
# Build the prompt
system_message = {
"role": "system",
"content": f"""You specialize in generating special JSON schemas for web scraping. This schema uses CSS or XPATH selectors to present a repetitive pattern in crawled HTML, such as a product in a product list or a search result item in a list of search results. We use this JSON schema to pass to a language model along with the HTML content to extract structured data from the HTML. The language model uses the JSON schema to extract data from the HTML and retrieve values for fields in the JSON schema, following the schema.
system_content = f"""You specialize in generating special JSON schemas for web scraping. This schema uses CSS or XPATH selectors to present a repetitive pattern in crawled HTML, such as a product in a product list or a search result item in a list of search results. We use this JSON schema to pass to a language model along with the HTML content to extract structured data from the HTML. The language model uses the JSON schema to extract data from the HTML and retrieve values for fields in the JSON schema, following the schema.
Generating this HTML manually is not feasible, so you need to generate the JSON schema using the HTML content. The HTML copied from the crawled website is provided below, which we believe contains the repetitive pattern.
@@ -1144,31 +1309,27 @@ In this scenario, use your best judgment to generate the schema. You need to exa
# What are the instructions and details for this schema generation?
{prompt_template}"""
}
user_message = {
"role": "user",
"content": f"""
user_content = f"""
HTML to analyze:
```html
{html}
```
"""
}
if query:
user_message["content"] += f"\n\n## Query or explanation of target/goal data item:\n{query}"
user_content += f"\n\n## Query or explanation of target/goal data item:\n{query}"
if target_json_example:
user_message["content"] += f"\n\n## Example of target JSON object:\n```json\n{target_json_example}\n```"
user_content += f"\n\n## Example of target JSON object:\n```json\n{target_json_example}\n```"
if query and not target_json_example:
user_message["content"] += """IMPORTANT: To remind you, in this process, we are not providing a rigid example of the adjacent objects we seek. We rely on your understanding of the explanation provided in the above section. Make sure to grasp what we are looking for and, based on that, create the best schema.."""
user_content += """IMPORTANT: To remind you, in this process, we are not providing a rigid example of the adjacent objects we seek. We rely on your understanding of the explanation provided in the above section. Make sure to grasp what we are looking for and, based on that, create the best schema.."""
elif not query and target_json_example:
user_message["content"] += """IMPORTANT: Please remember that in this process, we provided a proper example of a target JSON object. Make sure to adhere to the structure and create a schema that exactly fits this example. If you find that some elements on the page do not match completely, vote for the majority."""
user_content += """IMPORTANT: Please remember that in this process, we provided a proper example of a target JSON object. Make sure to adhere to the structure and create a schema that exactly fits this example. If you find that some elements on the page do not match completely, vote for the majority."""
elif not query and not target_json_example:
user_message["content"] += """IMPORTANT: Since we neither have a query nor an example, it is crucial to rely solely on the HTML content provided. Leverage your expertise to determine the schema based on the repetitive patterns observed in the content."""
user_message["content"] += """IMPORTANT:
user_content += """IMPORTANT: Since we neither have a query nor an example, it is crucial to rely solely on the HTML content provided. Leverage your expertise to determine the schema based on the repetitive patterns observed in the content."""
user_content += """IMPORTANT:
0/ Ensure your schema remains reliable by avoiding selectors that appear to generate dynamically and are not dependable. You want a reliable schema, as it consistently returns the same data even after many page reloads.
1/ DO NOT USE use base64 kind of classes, they are temporary and not reliable.
2/ Every selector must refer to only one unique element. You should ensure your selector points to a single element and is unique to the place that contains the information. You have to use available techniques based on CSS or XPATH requested schema to make sure your selector is unique and also not fragile, meaning if we reload the page now or in the future, the selector should remain reliable.
@@ -1177,20 +1338,98 @@ In this scenario, use your best judgment to generate the schema. You need to exa
Analyze the HTML and generate a JSON schema that follows the specified format. Only output valid JSON schema, nothing else.
"""
return "\n\n".join([system_content, user_content])
@staticmethod
def generate_schema(
html: str,
schema_type: str = "CSS",
query: str = None,
target_json_example: str = None,
llm_config: 'LLMConfig' = create_llm_config(),
provider: str = None,
api_token: str = None,
**kwargs
) -> dict:
"""
Generate extraction schema from HTML content and optional query (sync version).
Args:
html (str): The HTML content to analyze
query (str, optional): Natural language description of what data to extract
provider (str): Legacy Parameter. LLM provider to use
api_token (str): Legacy Parameter. API token for LLM provider
llm_config (LLMConfig): LLM configuration object
**kwargs: Additional args passed to LLM processor
Returns:
dict: Generated schema following the JsonElementExtractionStrategy format
"""
from .utils import perform_completion_with_backoff
for name, message in JsonElementExtractionStrategy._GENERATE_SCHEMA_UNWANTED_PROPS.items():
if locals()[name] is not None:
raise AttributeError(f"Setting '{name}' is deprecated. {message}")
prompt = JsonElementExtractionStrategy._build_schema_prompt(html, schema_type, query, target_json_example)
try:
# Call LLM with backoff handling
response = perform_completion_with_backoff(
provider=llm_config.provider,
prompt_with_variables="\n\n".join([system_message["content"], user_message["content"]]),
json_response = True,
prompt_with_variables=prompt,
json_response=True,
api_token=llm_config.api_token,
base_url=llm_config.base_url,
extra_args=kwargs
)
return json.loads(response.choices[0].message.content)
except Exception as e:
raise Exception(f"Failed to generate schema: {str(e)}")
@staticmethod
async def agenerate_schema(
html: str,
schema_type: str = "CSS",
query: str = None,
target_json_example: str = None,
llm_config: 'LLMConfig' = None,
**kwargs
) -> dict:
"""
Generate extraction schema from HTML content (async version).
Use this method when calling from async contexts (e.g., FastAPI) to avoid
issues with certain LLM providers (e.g., Gemini/Vertex AI) that require
async execution.
Args:
html (str): The HTML content to analyze
schema_type (str): "CSS" or "XPATH"
query (str, optional): Natural language description of what data to extract
target_json_example (str, optional): Example of desired JSON output
llm_config (LLMConfig): LLM configuration object
**kwargs: Additional args passed to LLM processor
Returns:
dict: Generated schema following the JsonElementExtractionStrategy format
"""
from .utils import aperform_completion_with_backoff
if llm_config is None:
llm_config = create_llm_config()
prompt = JsonElementExtractionStrategy._build_schema_prompt(html, schema_type, query, target_json_example)
try:
response = await aperform_completion_with_backoff(
provider=llm_config.provider,
prompt_with_variables=prompt,
json_response=True,
api_token=llm_config.api_token,
base_url=llm_config.base_url,
extra_args=kwargs
)
# Extract and return schema
return json.loads(response.choices[0].message.content)
except Exception as e:
raise Exception(f"Failed to generate schema: {str(e)}")

View File

@@ -1,4 +1,4 @@
from pydantic import BaseModel, HttpUrl, PrivateAttr, Field
from pydantic import BaseModel, HttpUrl, PrivateAttr, Field, ConfigDict
from typing import List, Dict, Optional, Callable, Awaitable, Union, Any
from typing import AsyncGenerator
from typing import Generic, TypeVar
@@ -152,9 +152,12 @@ class CrawlResult(BaseModel):
network_requests: Optional[List[Dict[str, Any]]] = None
console_messages: Optional[List[Dict[str, Any]]] = None
tables: List[Dict] = Field(default_factory=list) # NEW [{headers,rows,caption,summary}]
# Cache validation metadata (Smart Cache)
head_fingerprint: Optional[str] = None
cached_at: Optional[float] = None
cache_status: Optional[str] = None # "hit", "hit_validated", "hit_fallback", "miss"
class Config:
arbitrary_types_allowed = True
model_config = ConfigDict(arbitrary_types_allowed=True)
# NOTE: The StringCompatibleMarkdown class, custom __init__ method, property getters/setters,
# and model_dump override all exist to support a smooth transition from markdown as a string
@@ -332,8 +335,7 @@ class AsyncCrawlResponse(BaseModel):
network_requests: Optional[List[Dict[str, Any]]] = None
console_messages: Optional[List[Dict[str, Any]]] = None
class Config:
arbitrary_types_allowed = True
model_config = ConfigDict(arbitrary_types_allowed=True)
###############################
# Scraping Models

View File

@@ -15,9 +15,9 @@ from .utils import (
clean_pdf_text_to_html,
)
# Remove direct PyPDF2 imports from the top
# import PyPDF2
# from PyPDF2 import PdfReader
# Remove direct pypdf imports from the top
# import pypdf
# from pypdf import PdfReader
logger = logging.getLogger(__name__)
@@ -59,9 +59,9 @@ class NaivePDFProcessorStrategy(PDFProcessorStrategy):
save_images_locally: bool = False, image_save_dir: Optional[Path] = None, batch_size: int = 4):
# Import check at initialization time
try:
import PyPDF2
import pypdf
except ImportError:
raise ImportError("PyPDF2 is required for PDF processing. Install with 'pip install crawl4ai[pdf]'")
raise ImportError("pypdf is required for PDF processing. Install with 'pip install crawl4ai[pdf]'")
self.image_dpi = image_dpi
self.image_quality = image_quality
@@ -75,9 +75,9 @@ class NaivePDFProcessorStrategy(PDFProcessorStrategy):
def process(self, pdf_path: Path) -> PDFProcessResult:
# Import inside method to allow dependency to be optional
try:
from PyPDF2 import PdfReader
from pypdf import PdfReader
except ImportError:
raise ImportError("PyPDF2 is required for PDF processing. Install with 'pip install crawl4ai[pdf]'")
raise ImportError("pypdf is required for PDF processing. Install with 'pip install crawl4ai[pdf]'")
start_time = time()
result = PDFProcessResult(
@@ -125,15 +125,15 @@ class NaivePDFProcessorStrategy(PDFProcessorStrategy):
"""Like process() but processes PDF pages in parallel batches"""
# Import inside method to allow dependency to be optional
try:
from PyPDF2 import PdfReader
import PyPDF2 # For type checking
from pypdf import PdfReader
import pypdf # For type checking
except ImportError:
raise ImportError("PyPDF2 is required for PDF processing. Install with 'pip install crawl4ai[pdf]'")
raise ImportError("pypdf is required for PDF processing. Install with 'pip install crawl4ai[pdf]'")
import concurrent.futures
import threading
# Initialize PyPDF2 thread support
# Initialize pypdf thread support
if not hasattr(threading.current_thread(), "_children"):
threading.current_thread()._children = set()
@@ -232,11 +232,11 @@ class NaivePDFProcessorStrategy(PDFProcessorStrategy):
return pdf_page
def _extract_images(self, page, image_dir: Optional[Path]) -> List[Dict]:
# Import PyPDF2 for type checking only when needed
# Import pypdf for type checking only when needed
try:
import PyPDF2
from pypdf.generic import IndirectObject
except ImportError:
raise ImportError("PyPDF2 is required for PDF processing. Install with 'pip install crawl4ai[pdf]'")
raise ImportError("pypdf is required for PDF processing. Install with 'pip install crawl4ai[pdf]'")
if not self.extract_images:
return []
@@ -266,7 +266,7 @@ class NaivePDFProcessorStrategy(PDFProcessorStrategy):
width = xobj.get('/Width', 0)
height = xobj.get('/Height', 0)
color_space = xobj.get('/ColorSpace', '/DeviceRGB')
if isinstance(color_space, PyPDF2.generic.IndirectObject):
if isinstance(color_space, IndirectObject):
color_space = color_space.get_object()
# Handle different image encodings
@@ -277,7 +277,7 @@ class NaivePDFProcessorStrategy(PDFProcessorStrategy):
if '/FlateDecode' in filters:
try:
decode_parms = xobj.get('/DecodeParms', {})
if isinstance(decode_parms, PyPDF2.generic.IndirectObject):
if isinstance(decode_parms, IndirectObject):
decode_parms = decode_parms.get_object()
predictor = decode_parms.get('/Predictor', 1)
@@ -416,10 +416,10 @@ class NaivePDFProcessorStrategy(PDFProcessorStrategy):
# Import inside method to allow dependency to be optional
if reader is None:
try:
from PyPDF2 import PdfReader
from pypdf import PdfReader
reader = PdfReader(pdf_path)
except ImportError:
raise ImportError("PyPDF2 is required for PDF processing. Install with 'pip install crawl4ai[pdf]'")
raise ImportError("pypdf is required for PDF processing. Install with 'pip install crawl4ai[pdf]'")
meta = reader.metadata or {}
created = self._parse_pdf_date(meta.get('/CreationDate', ''))
@@ -459,11 +459,11 @@ if __name__ == "__main__":
from pathlib import Path
try:
# Import PyPDF2 only when running the file directly
import PyPDF2
from PyPDF2 import PdfReader
# Import pypdf only when running the file directly
import pypdf
from pypdf import PdfReader
except ImportError:
print("PyPDF2 is required for PDF processing. Install with 'pip install crawl4ai[pdf]'")
print("pypdf is required for PDF processing. Install with 'pip install crawl4ai[pdf]'")
exit(1)
current_dir = Path(__file__).resolve().parent

View File

@@ -1,12 +1,9 @@
from typing import List, Dict, Optional
from typing import List, Dict, Optional, Tuple
from abc import ABC, abstractmethod
from itertools import cycle
import os
import random
import time
import asyncio
import logging
from collections import defaultdict
import time
########### ATTENTION PEOPLE OF EARTH ###########
@@ -125,7 +122,7 @@ class ProxyConfig:
class ProxyRotationStrategy(ABC):
"""Base abstract class for proxy rotation strategies"""
@abstractmethod
async def get_next_proxy(self) -> Optional[ProxyConfig]:
"""Get next proxy configuration from the strategy"""
@@ -136,18 +133,81 @@ class ProxyRotationStrategy(ABC):
"""Add proxy configurations to the strategy"""
pass
@abstractmethod
async def get_proxy_for_session(
self,
session_id: str,
ttl: Optional[int] = None
) -> Optional[ProxyConfig]:
"""
Get or create a sticky proxy for a session.
If session_id already has an assigned proxy (and hasn't expired), return it.
If session_id is new, acquire a new proxy and associate it.
Args:
session_id: Unique session identifier
ttl: Optional time-to-live in seconds for this session
Returns:
ProxyConfig for this session
"""
pass
@abstractmethod
async def release_session(self, session_id: str) -> None:
"""
Release a sticky session, making the proxy available for reuse.
Args:
session_id: Session to release
"""
pass
@abstractmethod
def get_session_proxy(self, session_id: str) -> Optional[ProxyConfig]:
"""
Get the proxy for an existing session without creating new one.
Args:
session_id: Session to look up
Returns:
ProxyConfig if session exists and hasn't expired, None otherwise
"""
pass
@abstractmethod
def get_active_sessions(self) -> Dict[str, ProxyConfig]:
"""
Get all active sticky sessions.
Returns:
Dictionary mapping session_id to ProxyConfig
"""
pass
class RoundRobinProxyStrategy(ProxyRotationStrategy):
"""Simple round-robin proxy rotation strategy using ProxyConfig objects"""
"""Simple round-robin proxy rotation strategy using ProxyConfig objects.
Supports sticky sessions where a session_id can be bound to a specific proxy
for the duration of the session. This is useful for deep crawling where
you want to maintain the same IP address across multiple requests.
"""
def __init__(self, proxies: List[ProxyConfig] = None):
"""
Initialize with optional list of proxy configurations
Args:
proxies: List of ProxyConfig objects
"""
self._proxies = []
self._proxies: List[ProxyConfig] = []
self._proxy_cycle = None
# Session tracking: maps session_id -> (ProxyConfig, created_at, ttl)
self._sessions: Dict[str, Tuple[ProxyConfig, float, Optional[int]]] = {}
self._session_lock = asyncio.Lock()
if proxies:
self.add_proxies(proxies)
@@ -162,112 +222,120 @@ class RoundRobinProxyStrategy(ProxyRotationStrategy):
return None
return next(self._proxy_cycle)
async def get_proxy_for_session(
self,
session_id: str,
ttl: Optional[int] = None
) -> Optional[ProxyConfig]:
"""
Get or create a sticky proxy for a session.
class RandomProxyStrategy(ProxyRotationStrategy):
"""Random proxy selection strategy for unpredictable traffic patterns."""
def __init__(self, proxies: List[ProxyConfig] = None):
self._proxies = []
self._lock = asyncio.Lock()
if proxies:
self.add_proxies(proxies)
def add_proxies(self, proxies: List[ProxyConfig]):
"""Add new proxies to the rotation pool."""
self._proxies.extend(proxies)
async def get_next_proxy(self) -> Optional[ProxyConfig]:
"""Get randomly selected proxy."""
async with self._lock:
if not self._proxies:
If session_id already has an assigned proxy (and hasn't expired), return it.
If session_id is new, acquire a new proxy and associate it.
Args:
session_id: Unique session identifier
ttl: Optional time-to-live in seconds for this session
Returns:
ProxyConfig for this session
"""
async with self._session_lock:
# Check if session exists and hasn't expired
if session_id in self._sessions:
proxy, created_at, session_ttl = self._sessions[session_id]
# Check TTL expiration
effective_ttl = ttl if ttl is not None else session_ttl
if effective_ttl is not None:
elapsed = time.time() - created_at
if elapsed >= effective_ttl:
# Session expired, remove it and get new proxy
del self._sessions[session_id]
else:
return proxy
else:
return proxy
# Acquire new proxy for this session
proxy = await self.get_next_proxy()
if proxy:
self._sessions[session_id] = (proxy, time.time(), ttl)
return proxy
async def release_session(self, session_id: str) -> None:
"""
Release a sticky session, making the proxy available for reuse.
Args:
session_id: Session to release
"""
async with self._session_lock:
if session_id in self._sessions:
del self._sessions[session_id]
def get_session_proxy(self, session_id: str) -> Optional[ProxyConfig]:
"""
Get the proxy for an existing session without creating new one.
Args:
session_id: Session to look up
Returns:
ProxyConfig if session exists and hasn't expired, None otherwise
"""
if session_id not in self._sessions:
return None
proxy, created_at, ttl = self._sessions[session_id]
# Check TTL expiration
if ttl is not None:
elapsed = time.time() - created_at
if elapsed >= ttl:
return None
return random.choice(self._proxies)
return proxy
class LeastUsedProxyStrategy(ProxyRotationStrategy):
"""Least used proxy strategy for optimal load distribution."""
def __init__(self, proxies: List[ProxyConfig] = None):
self._proxies = []
self._usage_count: Dict[str, int] = defaultdict(int)
self._lock = asyncio.Lock()
if proxies:
self.add_proxies(proxies)
def add_proxies(self, proxies: List[ProxyConfig]):
"""Add new proxies to the rotation pool."""
self._proxies.extend(proxies)
for proxy in proxies:
self._usage_count[proxy.server] = 0
async def get_next_proxy(self) -> Optional[ProxyConfig]:
"""Get least used proxy for optimal load balancing."""
async with self._lock:
if not self._proxies:
return None
# Find proxy with minimum usage
min_proxy = min(self._proxies, key=lambda p: self._usage_count[p.server])
self._usage_count[min_proxy.server] += 1
return min_proxy
def get_active_sessions(self) -> Dict[str, ProxyConfig]:
"""
Get all active sticky sessions (excluding expired ones).
Returns:
Dictionary mapping session_id to ProxyConfig
"""
current_time = time.time()
active_sessions = {}
class FailureAwareProxyStrategy(ProxyRotationStrategy):
"""Failure-aware proxy strategy with automatic recovery and health tracking."""
def __init__(self, proxies: List[ProxyConfig] = None, failure_threshold: int = 3, recovery_time: int = 300):
self._proxies = []
self._healthy_proxies = []
self._failure_count: Dict[str, int] = defaultdict(int)
self._last_failure_time: Dict[str, float] = defaultdict(float)
self._failure_threshold = failure_threshold
self._recovery_time = recovery_time # seconds
self._lock = asyncio.Lock()
if proxies:
self.add_proxies(proxies)
def add_proxies(self, proxies: List[ProxyConfig]):
"""Add new proxies to the rotation pool."""
self._proxies.extend(proxies)
self._healthy_proxies.extend(proxies)
for proxy in proxies:
self._failure_count[proxy.server] = 0
async def get_next_proxy(self) -> Optional[ProxyConfig]:
"""Get next healthy proxy with automatic recovery."""
async with self._lock:
# Recovery check: re-enable proxies after recovery_time
for session_id, (proxy, created_at, ttl) in self._sessions.items():
# Skip expired sessions
if ttl is not None:
elapsed = current_time - created_at
if elapsed >= ttl:
continue
active_sessions[session_id] = proxy
return active_sessions
async def cleanup_expired_sessions(self) -> int:
"""
Remove all expired sessions from tracking.
Returns:
Number of sessions removed
"""
async with self._session_lock:
current_time = time.time()
recovered_proxies = []
for proxy in self._proxies:
if (proxy not in self._healthy_proxies and
current_time - self._last_failure_time[proxy.server] > self._recovery_time):
recovered_proxies.append(proxy)
self._failure_count[proxy.server] = 0
# Add recovered proxies back to healthy pool
self._healthy_proxies.extend(recovered_proxies)
# If no healthy proxies, reset all (emergency fallback)
if not self._healthy_proxies and self._proxies:
logging.warning("All proxies failed, resetting health status")
self._healthy_proxies = self._proxies.copy()
for proxy in self._proxies:
self._failure_count[proxy.server] = 0
if not self._healthy_proxies:
return None
return random.choice(self._healthy_proxies)
async def mark_proxy_failed(self, proxy: ProxyConfig):
"""Mark a proxy as failed and remove from healthy pool if threshold exceeded."""
async with self._lock:
self._failure_count[proxy.server] += 1
self._last_failure_time[proxy.server] = time.time()
if (self._failure_count[proxy.server] >= self._failure_threshold and
proxy in self._healthy_proxies):
self._healthy_proxies.remove(proxy)
logging.warning(f"Proxy {proxy.server} marked as unhealthy after {self._failure_count[proxy.server]} failures")
expired = []
for session_id, (proxy, created_at, ttl) in self._sessions.items():
if ttl is not None:
elapsed = current_time - created_at
if elapsed >= ttl:
expired.append(session_id)
for session_id in expired:
del self._sessions[session_id]
return len(expired)

View File

@@ -795,6 +795,9 @@ Return only a JSON array of extracted tables following the specified format."""
api_token=self.llm_config.api_token,
base_url=self.llm_config.base_url,
json_response=True,
base_delay=self.llm_config.backoff_base_delay,
max_attempts=self.llm_config.backoff_max_attempts,
exponential_factor=self.llm_config.backoff_exponential_factor,
extra_args=self.extra_args
)
@@ -1116,6 +1119,9 @@ Return only a JSON array of extracted tables following the specified format."""
api_token=self.llm_config.api_token,
base_url=self.llm_config.base_url,
json_response=True,
base_delay=self.llm_config.backoff_base_delay,
max_attempts=self.llm_config.backoff_max_attempts,
exponential_factor=self.llm_config.backoff_exponential_factor,
extra_args=self.extra_args
)

View File

@@ -1,195 +0,0 @@
from typing import TYPE_CHECKING, Union
# Logger types
AsyncLoggerBase = Union['AsyncLoggerBaseType']
AsyncLogger = Union['AsyncLoggerType']
# Crawler core types
AsyncWebCrawler = Union['AsyncWebCrawlerType']
CacheMode = Union['CacheModeType']
CrawlResult = Union['CrawlResultType']
CrawlerHub = Union['CrawlerHubType']
BrowserProfiler = Union['BrowserProfilerType']
# NEW: Add AsyncUrlSeederType
AsyncUrlSeeder = Union['AsyncUrlSeederType']
# Configuration types
BrowserConfig = Union['BrowserConfigType']
CrawlerRunConfig = Union['CrawlerRunConfigType']
HTTPCrawlerConfig = Union['HTTPCrawlerConfigType']
LLMConfig = Union['LLMConfigType']
# NEW: Add SeedingConfigType
SeedingConfig = Union['SeedingConfigType']
# Content scraping types
ContentScrapingStrategy = Union['ContentScrapingStrategyType']
LXMLWebScrapingStrategy = Union['LXMLWebScrapingStrategyType']
# Backward compatibility alias
WebScrapingStrategy = Union['LXMLWebScrapingStrategyType']
# Proxy types
ProxyRotationStrategy = Union['ProxyRotationStrategyType']
RoundRobinProxyStrategy = Union['RoundRobinProxyStrategyType']
# Extraction types
ExtractionStrategy = Union['ExtractionStrategyType']
LLMExtractionStrategy = Union['LLMExtractionStrategyType']
CosineStrategy = Union['CosineStrategyType']
JsonCssExtractionStrategy = Union['JsonCssExtractionStrategyType']
JsonXPathExtractionStrategy = Union['JsonXPathExtractionStrategyType']
# Chunking types
ChunkingStrategy = Union['ChunkingStrategyType']
RegexChunking = Union['RegexChunkingType']
# Markdown generation types
DefaultMarkdownGenerator = Union['DefaultMarkdownGeneratorType']
MarkdownGenerationResult = Union['MarkdownGenerationResultType']
# Content filter types
RelevantContentFilter = Union['RelevantContentFilterType']
PruningContentFilter = Union['PruningContentFilterType']
BM25ContentFilter = Union['BM25ContentFilterType']
LLMContentFilter = Union['LLMContentFilterType']
# Dispatcher types
BaseDispatcher = Union['BaseDispatcherType']
MemoryAdaptiveDispatcher = Union['MemoryAdaptiveDispatcherType']
SemaphoreDispatcher = Union['SemaphoreDispatcherType']
RateLimiter = Union['RateLimiterType']
CrawlerMonitor = Union['CrawlerMonitorType']
DisplayMode = Union['DisplayModeType']
RunManyReturn = Union['RunManyReturnType']
# Docker client
Crawl4aiDockerClient = Union['Crawl4aiDockerClientType']
# Deep crawling types
DeepCrawlStrategy = Union['DeepCrawlStrategyType']
BFSDeepCrawlStrategy = Union['BFSDeepCrawlStrategyType']
FilterChain = Union['FilterChainType']
ContentTypeFilter = Union['ContentTypeFilterType']
DomainFilter = Union['DomainFilterType']
URLFilter = Union['URLFilterType']
FilterStats = Union['FilterStatsType']
SEOFilter = Union['SEOFilterType']
KeywordRelevanceScorer = Union['KeywordRelevanceScorerType']
URLScorer = Union['URLScorerType']
CompositeScorer = Union['CompositeScorerType']
DomainAuthorityScorer = Union['DomainAuthorityScorerType']
FreshnessScorer = Union['FreshnessScorerType']
PathDepthScorer = Union['PathDepthScorerType']
BestFirstCrawlingStrategy = Union['BestFirstCrawlingStrategyType']
DFSDeepCrawlStrategy = Union['DFSDeepCrawlStrategyType']
DeepCrawlDecorator = Union['DeepCrawlDecoratorType']
# Only import types during type checking to avoid circular imports
if TYPE_CHECKING:
# Logger imports
from .async_logger import (
AsyncLoggerBase as AsyncLoggerBaseType,
AsyncLogger as AsyncLoggerType,
)
# Crawler core imports
from .async_webcrawler import (
AsyncWebCrawler as AsyncWebCrawlerType,
CacheMode as CacheModeType,
)
from .models import CrawlResult as CrawlResultType
from .hub import CrawlerHub as CrawlerHubType
from .browser_profiler import BrowserProfiler as BrowserProfilerType
# NEW: Import AsyncUrlSeeder for type checking
from .async_url_seeder import AsyncUrlSeeder as AsyncUrlSeederType
# Configuration imports
from .async_configs import (
BrowserConfig as BrowserConfigType,
CrawlerRunConfig as CrawlerRunConfigType,
HTTPCrawlerConfig as HTTPCrawlerConfigType,
LLMConfig as LLMConfigType,
# NEW: Import SeedingConfig for type checking
SeedingConfig as SeedingConfigType,
)
# Content scraping imports
from .content_scraping_strategy import (
ContentScrapingStrategy as ContentScrapingStrategyType,
LXMLWebScrapingStrategy as LXMLWebScrapingStrategyType,
)
# Proxy imports
from .proxy_strategy import (
ProxyRotationStrategy as ProxyRotationStrategyType,
RoundRobinProxyStrategy as RoundRobinProxyStrategyType,
)
# Extraction imports
from .extraction_strategy import (
ExtractionStrategy as ExtractionStrategyType,
LLMExtractionStrategy as LLMExtractionStrategyType,
CosineStrategy as CosineStrategyType,
JsonCssExtractionStrategy as JsonCssExtractionStrategyType,
JsonXPathExtractionStrategy as JsonXPathExtractionStrategyType,
)
# Chunking imports
from .chunking_strategy import (
ChunkingStrategy as ChunkingStrategyType,
RegexChunking as RegexChunkingType,
)
# Markdown generation imports
from .markdown_generation_strategy import (
DefaultMarkdownGenerator as DefaultMarkdownGeneratorType,
)
from .models import MarkdownGenerationResult as MarkdownGenerationResultType
# Content filter imports
from .content_filter_strategy import (
RelevantContentFilter as RelevantContentFilterType,
PruningContentFilter as PruningContentFilterType,
BM25ContentFilter as BM25ContentFilterType,
LLMContentFilter as LLMContentFilterType,
)
# Dispatcher imports
from .async_dispatcher import (
BaseDispatcher as BaseDispatcherType,
MemoryAdaptiveDispatcher as MemoryAdaptiveDispatcherType,
SemaphoreDispatcher as SemaphoreDispatcherType,
RateLimiter as RateLimiterType,
CrawlerMonitor as CrawlerMonitorType,
DisplayMode as DisplayModeType,
RunManyReturn as RunManyReturnType,
)
# Docker client
from .docker_client import Crawl4aiDockerClient as Crawl4aiDockerClientType
# Deep crawling imports
from .deep_crawling import (
DeepCrawlStrategy as DeepCrawlStrategyType,
BFSDeepCrawlStrategy as BFSDeepCrawlStrategyType,
FilterChain as FilterChainType,
ContentTypeFilter as ContentTypeFilterType,
DomainFilter as DomainFilterType,
URLFilter as URLFilterType,
FilterStats as FilterStatsType,
SEOFilter as SEOFilterType,
KeywordRelevanceScorer as KeywordRelevanceScorerType,
URLScorer as URLScorerType,
CompositeScorer as CompositeScorerType,
DomainAuthorityScorer as DomainAuthorityScorerType,
FreshnessScorer as FreshnessScorerType,
PathDepthScorer as PathDepthScorerType,
BestFirstCrawlingStrategy as BestFirstCrawlingStrategyType,
DFSDeepCrawlStrategy as DFSDeepCrawlStrategyType,
DeepCrawlDecorator as DeepCrawlDecoratorType,
)
def create_llm_config(*args, **kwargs) -> 'LLMConfigType':
from .async_configs import LLMConfig
return LLMConfig(*args, **kwargs)

View File

@@ -47,6 +47,7 @@ from urllib.parse import (
urljoin, urlparse, urlunparse,
parse_qsl, urlencode, quote, unquote
)
import inspect
# Monkey patch to fix wildcard handling in urllib.robotparser
@@ -1744,6 +1745,9 @@ def perform_completion_with_backoff(
api_token,
json_response=False,
base_url=None,
base_delay=2,
max_attempts=3,
exponential_factor=2,
**kwargs,
):
"""
@@ -1760,6 +1764,9 @@ def perform_completion_with_backoff(
api_token (str): The API token for authentication.
json_response (bool): Whether to request a JSON response. Defaults to False.
base_url (Optional[str]): The base URL for the API. Defaults to None.
base_delay (int): The base delay in seconds. Defaults to 2.
max_attempts (int): The maximum number of attempts. Defaults to 3.
exponential_factor (int): The exponential factor. Defaults to 2.
**kwargs: Additional arguments for the API request.
Returns:
@@ -1768,9 +1775,8 @@ def perform_completion_with_backoff(
from litellm import completion
from litellm.exceptions import RateLimitError
max_attempts = 3
base_delay = 2 # Base delay in seconds, you can adjust this based on your needs
import litellm
litellm.drop_params = True # Auto-drop unsupported params (e.g., temperature for O-series/GPT-5)
extra_args = {"temperature": 0.01, "api_key": api_token, "base_url": base_url}
if json_response:
@@ -1797,7 +1803,7 @@ def perform_completion_with_backoff(
# Check if we have exhausted our max attempts
if attempt < max_attempts - 1:
# Calculate the delay and wait
delay = base_delay * (2**attempt) # Exponential backoff formula
delay = base_delay * (exponential_factor**attempt) # Exponential backoff formula
print(f"Waiting for {delay} seconds before retrying...")
time.sleep(delay)
else:
@@ -1824,6 +1830,87 @@ def perform_completion_with_backoff(
# ]
async def aperform_completion_with_backoff(
provider,
prompt_with_variables,
api_token,
json_response=False,
base_url=None,
base_delay=2,
max_attempts=3,
exponential_factor=2,
**kwargs,
):
"""
Async version: Perform an API completion request with exponential backoff.
How it works:
1. Sends an async completion request to the API.
2. Retries on rate-limit errors with exponential delays (async).
3. Returns the API response or an error after all retries.
Args:
provider (str): The name of the API provider.
prompt_with_variables (str): The input prompt for the completion request.
api_token (str): The API token for authentication.
json_response (bool): Whether to request a JSON response. Defaults to False.
base_url (Optional[str]): The base URL for the API. Defaults to None.
base_delay (int): The base delay in seconds. Defaults to 2.
max_attempts (int): The maximum number of attempts. Defaults to 3.
exponential_factor (int): The exponential factor. Defaults to 2.
**kwargs: Additional arguments for the API request.
Returns:
dict: The API response or an error message after all retries.
"""
from litellm import acompletion
from litellm.exceptions import RateLimitError
import litellm
import asyncio
litellm.drop_params = True # Auto-drop unsupported params (e.g., temperature for O-series/GPT-5)
extra_args = {"temperature": 0.01, "api_key": api_token, "base_url": base_url}
if json_response:
extra_args["response_format"] = {"type": "json_object"}
if kwargs.get("extra_args"):
extra_args.update(kwargs["extra_args"])
for attempt in range(max_attempts):
try:
response = await acompletion(
model=provider,
messages=[{"role": "user", "content": prompt_with_variables}],
**extra_args,
)
return response # Return the successful response
except RateLimitError as e:
print("Rate limit error:", str(e))
if attempt == max_attempts - 1:
# Last attempt failed, raise the error.
raise
# Check if we have exhausted our max attempts
if attempt < max_attempts - 1:
# Calculate the delay and wait
delay = base_delay * (exponential_factor**attempt) # Exponential backoff formula
print(f"Waiting for {delay} seconds before retrying...")
await asyncio.sleep(delay)
else:
# Return an error response after exhausting all retries
return [
{
"index": 0,
"tags": ["error"],
"content": ["Rate limit error. Please try again later."],
}
]
except Exception as e:
raise e # Raise any other exceptions immediately
def extract_blocks(url, html, provider=DEFAULT_PROVIDER, api_token=None, base_url=None):
"""
Extract content blocks from website HTML using an AI provider.
@@ -2378,6 +2465,54 @@ def normalize_url_tmp(href, base_url):
return href.strip()
def quick_extract_links(html: str, base_url: str) -> Dict[str, List[Dict[str, str]]]:
"""
Fast link extraction for prefetch mode.
Only extracts <a href> tags - no media, no cleaning, no heavy processing.
Args:
html: Raw HTML string
base_url: Base URL for resolving relative links
Returns:
{"internal": [{"href": "...", "text": "..."}], "external": [...]}
"""
from lxml.html import document_fromstring
try:
doc = document_fromstring(html)
except Exception:
return {"internal": [], "external": []}
base_domain = get_base_domain(base_url)
internal: List[Dict[str, str]] = []
external: List[Dict[str, str]] = []
seen: Set[str] = set()
for a in doc.xpath("//a[@href]"):
href = a.get("href", "").strip()
if not href or href.startswith(("#", "javascript:", "mailto:", "tel:")):
continue
# Normalize URL
normalized = normalize_url_for_deep_crawl(href, base_url)
if not normalized or normalized in seen:
continue
seen.add(normalized)
# Extract text (truncated for memory efficiency)
text = (a.text_content() or "").strip()[:200]
link_data = {"href": normalized, "text": text}
if is_external_url(normalized, base_domain):
external.append(link_data)
else:
internal.append(link_data)
return {"internal": internal, "external": external}
def get_base_domain(url: str) -> str:
"""
Extract the base domain from a given URL, handling common edge cases.
@@ -2745,6 +2880,67 @@ def generate_content_hash(content: str) -> str:
# return hashlib.sha256(content.encode()).hexdigest()
def compute_head_fingerprint(head_html: str) -> str:
"""
Compute a fingerprint of <head> content for cache validation.
Focuses on content that typically changes when page updates:
- <title>
- <meta name="description">
- <meta property="og:title|og:description|og:image|og:updated_time">
- <meta property="article:modified_time">
- <meta name="last-modified">
Uses xxhash for speed, combines multiple signals into a single hash.
Args:
head_html: The HTML content of the <head> section
Returns:
A hex string fingerprint, or empty string if no signals found
"""
if not head_html:
return ""
head_lower = head_html.lower()
signals = []
# Extract title
title_match = re.search(r'<title[^>]*>(.*?)</title>', head_lower, re.DOTALL)
if title_match:
signals.append(title_match.group(1).strip())
# Meta tags to extract (name or property attribute, and the value to match)
meta_tags = [
("name", "description"),
("name", "last-modified"),
("property", "og:title"),
("property", "og:description"),
("property", "og:image"),
("property", "og:updated_time"),
("property", "article:modified_time"),
]
for attr_type, attr_value in meta_tags:
# Handle both attribute orders: attr="value" content="..." and content="..." attr="value"
patterns = [
rf'<meta[^>]*{attr_type}=["\']{ re.escape(attr_value)}["\'][^>]*content=["\']([^"\']*)["\']',
rf'<meta[^>]*content=["\']([^"\']*)["\'][^>]*{attr_type}=["\']{re.escape(attr_value)}["\']',
]
for pattern in patterns:
match = re.search(pattern, head_lower)
if match:
signals.append(match.group(1).strip())
break # Found this tag, move to next
if not signals:
return ""
# Combine signals and hash
combined = '|'.join(signals)
return xxhash.xxh64(combined.encode()).hexdigest()
def ensure_content_dirs(base_path: str) -> Dict[str, str]:
"""Create content directories if they don't exist"""
dirs = {
@@ -3529,4 +3725,52 @@ def get_memory_stats() -> Tuple[float, float, float]:
available_gb = get_true_available_memory_gb()
used_percent = get_true_memory_usage_percent()
return used_percent, available_gb, total_gb
return used_percent, available_gb, total_gb
# Hook utilities for Docker API
def hooks_to_string(hooks: Dict[str, Callable]) -> Dict[str, str]:
"""
Convert hook function objects to string representations for Docker API.
This utility simplifies the process of using hooks with the Docker API by converting
Python function objects into the string format required by the API.
Args:
hooks: Dictionary mapping hook point names to Python function objects.
Functions should be async and follow hook signature requirements.
Returns:
Dictionary mapping hook point names to string representations of the functions.
Example:
>>> async def my_hook(page, context, **kwargs):
... await page.set_viewport_size({"width": 1920, "height": 1080})
... return page
>>>
>>> hooks_dict = {"on_page_context_created": my_hook}
>>> api_hooks = hooks_to_string(hooks_dict)
>>> # api_hooks is now ready to use with Docker API
Raises:
ValueError: If a hook is not callable or source cannot be extracted
"""
result = {}
for hook_name, hook_func in hooks.items():
if not callable(hook_func):
raise ValueError(f"Hook '{hook_name}' must be a callable function, got {type(hook_func)}")
try:
# Get the source code of the function
source = inspect.getsource(hook_func)
# Remove any leading indentation to get clean source
source = textwrap.dedent(source)
result[hook_name] = source
except (OSError, TypeError) as e:
raise ValueError(
f"Cannot extract source code for hook '{hook_name}'. "
f"Make sure the function is defined in a file (not interactively). Error: {e}"
)
return result

File diff suppressed because it is too large Load Diff

View File

@@ -12,8 +12,8 @@
- [Python SDK](#python-sdk)
- [Understanding Request Schema](#understanding-request-schema)
- [REST API Examples](#rest-api-examples)
- [Asynchronous Jobs with Webhooks](#asynchronous-jobs-with-webhooks)
- [Additional API Endpoints](#additional-api-endpoints)
- [Dispatcher Management](#dispatcher-management)
- [HTML Extraction Endpoint](#html-extraction-endpoint)
- [Screenshot Endpoint](#screenshot-endpoint)
- [PDF Export Endpoint](#pdf-export-endpoint)
@@ -35,8 +35,6 @@
- [Configuration Tips and Best Practices](#configuration-tips-and-best-practices)
- [Customizing Your Configuration](#customizing-your-configuration)
- [Configuration Recommendations](#configuration-recommendations)
- [Testing & Validation](#testing--validation)
- [Dispatcher Demo Test Suite](#dispatcher-demo-test-suite)
- [Getting Help](#getting-help)
- [Summary](#summary)
@@ -61,15 +59,13 @@ Pull and run images directly from Docker Hub without building locally.
#### 1. Pull the Image
Our latest release candidate is `0.7.0-r1`. Images are built with multi-arch manifests, so Docker automatically pulls the correct version for your system.
> ⚠️ **Important Note**: The `latest` tag currently points to the stable `0.6.0` version. After testing and validation, `0.7.0` (without -r1) will be released and `latest` will be updated. For now, please use `0.7.0-r1` to test the new features.
Our latest stable release is `0.8.0`. Images are built with multi-arch manifests, so Docker automatically pulls the correct version for your system.
```bash
# Pull the release candidate (for testing new features)
docker pull unclecode/crawl4ai:0.7.0-r1
# Pull the latest stable version (0.8.0)
docker pull unclecode/crawl4ai:0.8.0
# Or pull the current stable version (0.6.0)
# Or use the latest tag (points to 0.8.0)
docker pull unclecode/crawl4ai:latest
```
@@ -104,7 +100,7 @@ EOL
-p 11235:11235 \
--name crawl4ai \
--shm-size=1g \
unclecode/crawl4ai:0.7.0-r1
unclecode/crawl4ai:0.8.0
```
* **With LLM support:**
@@ -115,7 +111,7 @@ EOL
--name crawl4ai \
--env-file .llm.env \
--shm-size=1g \
unclecode/crawl4ai:0.7.0-r1
unclecode/crawl4ai:0.8.0
```
> The server will be available at `http://localhost:11235`. Visit `/playground` to access the interactive testing interface.
@@ -188,7 +184,7 @@ The `docker-compose.yml` file in the project root provides a simplified approach
```bash
# Pulls and runs the release candidate from Docker Hub
# Automatically selects the correct architecture
IMAGE=unclecode/crawl4ai:0.7.0-r1 docker compose up -d
IMAGE=unclecode/crawl4ai:0.8.0 docker compose up -d
```
* **Build and Run Locally:**
@@ -335,134 +331,6 @@ Access the MCP tool schemas at `http://localhost:11235/mcp/schema` for detailed
In addition to the core `/crawl` and `/crawl/stream` endpoints, the server provides several specialized endpoints:
### Dispatcher Management
The server supports multiple dispatcher strategies for managing concurrent crawling operations. Dispatchers control how many crawl jobs run simultaneously based on different rules like fixed concurrency limits or system memory availability.
#### Available Dispatchers
**Memory Adaptive Dispatcher** (Default)
- Dynamically adjusts concurrency based on system memory usage
- Monitors memory pressure and adapts crawl sessions accordingly
- Automatically requeues tasks under high memory conditions
- Implements fairness timeout for long-waiting URLs
**Semaphore Dispatcher**
- Fixed concurrency limit using semaphore-based control
- Simple and predictable resource usage
- Ideal for controlled crawling scenarios
#### Dispatcher Endpoints
**List Available Dispatchers**
```bash
GET /dispatchers
```
Returns information about all available dispatcher types, their configurations, and features.
```bash
curl http://localhost:11234/dispatchers | jq
```
**Get Default Dispatcher**
```bash
GET /dispatchers/default
```
Returns the current default dispatcher configuration.
```bash
curl http://localhost:11234/dispatchers/default | jq
```
**Get Dispatcher Statistics**
```bash
GET /dispatchers/{dispatcher_type}/stats
```
Returns real-time statistics for a specific dispatcher including active sessions, memory usage, and configuration.
```bash
# Get memory_adaptive dispatcher stats
curl http://localhost:11234/dispatchers/memory_adaptive/stats | jq
# Get semaphore dispatcher stats
curl http://localhost:11234/dispatchers/semaphore/stats | jq
```
#### Using Dispatchers in Crawl Requests
You can specify which dispatcher to use in your crawl requests by adding the `dispatcher` field:
**Using Default Dispatcher (memory_adaptive)**
```bash
curl -X POST http://localhost:11234/crawl \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://example.com"],
"browser_config": {},
"crawler_config": {}
}'
```
**Using Semaphore Dispatcher**
```bash
curl -X POST http://localhost:11234/crawl \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://example.com", "https://httpbin.org/html"],
"browser_config": {},
"crawler_config": {},
"dispatcher": "semaphore"
}'
```
**Python SDK Example**
```python
import requests
# Crawl with memory adaptive dispatcher (default)
response = requests.post(
"http://localhost:11234/crawl",
json={
"urls": ["https://example.com"],
"browser_config": {},
"crawler_config": {}
}
)
# Crawl with semaphore dispatcher
response = requests.post(
"http://localhost:11234/crawl",
json={
"urls": ["https://example.com"],
"browser_config": {},
"crawler_config": {},
"dispatcher": "semaphore"
}
)
```
#### Dispatcher Configuration
Dispatchers are configured with sensible defaults that work well for most use cases:
**Memory Adaptive Dispatcher Defaults:**
- `memory_threshold_percent`: 70.0 - Start adjusting at 70% memory usage
- `critical_threshold_percent`: 85.0 - Critical memory pressure threshold
- `recovery_threshold_percent`: 65.0 - Resume normal operation below 65%
- `check_interval`: 1.0 - Check memory every second
- `max_session_permit`: 20 - Maximum concurrent sessions
- `fairness_timeout`: 600.0 - Prioritize URLs waiting > 10 minutes
- `memory_wait_timeout`: 600.0 - Fail if high memory persists > 10 minutes
**Semaphore Dispatcher Defaults:**
- `semaphore_count`: 5 - Maximum concurrent crawl operations
- `max_session_permit`: 10 - Maximum total sessions allowed
> 💡 **Tip**: Use `memory_adaptive` for dynamic workloads where memory availability varies. Use `semaphore` for predictable, controlled crawling with fixed concurrency limits.
### HTML Extraction Endpoint
```
@@ -779,143 +647,193 @@ async def test_stream_crawl(token: str = None): # Made token optional
# asyncio.run(test_stream_crawl())
```
#### LLM Job with Chunking Strategy
### Asynchronous Jobs with Webhooks
```python
import requests
import time
For long-running crawls or when you want to avoid keeping connections open, use the job queue endpoints. Instead of polling for results, configure a webhook to receive notifications when jobs complete.
# Example: LLM extraction with RegexChunking strategy
# This breaks large documents into smaller chunks before LLM processing
#### Why Use Jobs & Webhooks?
llm_job_payload = {
"url": "https://example.com/long-article",
"q": "Extract all key points and main ideas from this article",
"chunking_strategy": {
"type": "RegexChunking",
"params": {
"patterns": ["\\n\\n"], # Split on double newlines (paragraphs)
"overlap": 50
}
- **No Polling Required** - Get notified when crawls complete instead of constantly checking status
- **Better Resource Usage** - Free up client connections while jobs run in the background
- **Scalable Architecture** - Ideal for high-volume crawling with TypeScript/Node.js clients or microservices
- **Reliable Delivery** - Automatic retry with exponential backoff (5 attempts: 1s → 2s → 4s → 8s → 16s)
#### How It Works
1. **Submit Job** → POST to `/crawl/job` with optional `webhook_config`
2. **Get Task ID** → Receive a `task_id` immediately
3. **Job Runs** → Crawl executes in the background
4. **Webhook Fired** → Server POSTs completion notification to your webhook URL
5. **Fetch Results** → If data wasn't included in webhook, GET `/crawl/job/{task_id}`
#### Quick Example
```bash
# Submit a crawl job with webhook notification
curl -X POST http://localhost:11235/crawl/job \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://example.com"],
"webhook_config": {
"webhook_url": "https://myapp.com/webhooks/crawl-complete",
"webhook_data_in_payload": false
}
}
}'
# Submit LLM job
response = requests.post(
"http://localhost:11235/llm/job",
json=llm_job_payload
)
if response.ok:
job_data = response.json()
job_id = job_data["task_id"]
print(f"Job submitted successfully. Job ID: {job_id}")
# Poll for completion
while True:
status_response = requests.get(f"http://localhost:11235/llm/job/{job_id}")
if status_response.ok:
status_data = status_response.json()
if status_data["status"] == "completed":
print("Job completed!")
print("Extracted content:", status_data["result"])
break
elif status_data["status"] == "failed":
print("Job failed:", status_data.get("error"))
break
else:
print(f"Job status: {status_data['status']}")
time.sleep(2) # Wait 2 seconds before checking again
else:
print(f"Error checking job status: {status_response.text}")
break
else:
print(f"Error submitting job: {response.text}")
# Response: {"task_id": "crawl_a1b2c3d4"}
```
**Available Chunking Strategies:**
**Your webhook receives:**
```json
{
"task_id": "crawl_a1b2c3d4",
"task_type": "crawl",
"status": "completed",
"timestamp": "2025-10-21T10:30:00.000000+00:00",
"urls": ["https://example.com"]
}
```
- **IdentityChunking**: Returns the entire content as a single chunk (no splitting)
```json
{
"type": "IdentityChunking",
"params": {}
Then fetch the results:
```bash
curl http://localhost:11235/crawl/job/crawl_a1b2c3d4
```
#### Include Data in Webhook
Set `webhook_data_in_payload: true` to receive the full crawl results directly in the webhook:
```bash
curl -X POST http://localhost:11235/crawl/job \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://example.com"],
"webhook_config": {
"webhook_url": "https://myapp.com/webhooks/crawl-complete",
"webhook_data_in_payload": true
}
}'
```
**Your webhook receives the complete data:**
```json
{
"task_id": "crawl_a1b2c3d4",
"task_type": "crawl",
"status": "completed",
"timestamp": "2025-10-21T10:30:00.000000+00:00",
"urls": ["https://example.com"],
"data": {
"markdown": "...",
"html": "...",
"links": {...},
"metadata": {...}
}
```
}
```
- **RegexChunking**: Split content using regular expression patterns
```json
{
"type": "RegexChunking",
"params": {
"patterns": ["\\n\\n"]
#### Webhook Authentication
Add custom headers for authentication:
```json
{
"urls": ["https://example.com"],
"webhook_config": {
"webhook_url": "https://myapp.com/webhooks/crawl",
"webhook_data_in_payload": false,
"webhook_headers": {
"X-Webhook-Secret": "your-secret-token",
"X-Service-ID": "crawl4ai-prod"
}
}
```
}
```
- **NlpSentenceChunking**: Split content into sentences using NLP (requires NLTK)
```json
{
"type": "NlpSentenceChunking",
"params": {}
}
```
#### Global Default Webhook
- **TopicSegmentationChunking**: Segment content into topics using TextTiling (requires NLTK)
```json
{
"type": "TopicSegmentationChunking",
"params": {
"num_keywords": 3
Configure a default webhook URL in `config.yml` for all jobs:
```yaml
webhooks:
enabled: true
default_url: "https://myapp.com/webhooks/default"
data_in_payload: false
retry:
max_attempts: 5
initial_delay_ms: 1000
max_delay_ms: 32000
timeout_ms: 30000
```
Now jobs without `webhook_config` automatically use the default webhook.
#### Job Status Polling (Without Webhooks)
If you prefer polling instead of webhooks, just omit `webhook_config`:
```bash
# Submit job
curl -X POST http://localhost:11235/crawl/job \
-H "Content-Type: application/json" \
-d '{"urls": ["https://example.com"]}'
# Response: {"task_id": "crawl_xyz"}
# Poll for status
curl http://localhost:11235/crawl/job/crawl_xyz
```
The response includes `status` field: `"processing"`, `"completed"`, or `"failed"`.
#### LLM Extraction Jobs with Webhooks
The same webhook system works for LLM extraction jobs via `/llm/job`:
```bash
# Submit LLM extraction job with webhook
curl -X POST http://localhost:11235/llm/job \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/article",
"q": "Extract the article title, author, and main points",
"provider": "openai/gpt-4o-mini",
"webhook_config": {
"webhook_url": "https://myapp.com/webhooks/llm-complete",
"webhook_data_in_payload": true,
"webhook_headers": {
"X-Webhook-Secret": "your-secret-token"
}
}
}'
# Response: {"task_id": "llm_1234567890"}
```
**Your webhook receives:**
```json
{
"task_id": "llm_1234567890",
"task_type": "llm_extraction",
"status": "completed",
"timestamp": "2025-10-22T12:30:00.000000+00:00",
"urls": ["https://example.com/article"],
"data": {
"extracted_content": {
"title": "Understanding Web Scraping",
"author": "John Doe",
"main_points": ["Point 1", "Point 2", "Point 3"]
}
}
```
}
```
- **FixedLengthWordChunking**: Split into fixed-length word chunks
```json
{
"type": "FixedLengthWordChunking",
"params": {
"chunk_size": 100
}
}
```
**Key Differences for LLM Jobs:**
- Task type is `"llm_extraction"` instead of `"crawl"`
- Extracted data is in `data.extracted_content`
- Single URL only (not an array)
- Supports schema-based extraction with `schema` parameter
- **SlidingWindowChunking**: Overlapping word chunks with configurable step size
```json
{
"type": "SlidingWindowChunking",
"params": {
"window_size": 100,
"step": 50
}
}
```
- **OverlappingWindowChunking**: Fixed-size chunks with word overlap
```json
{
"type": "OverlappingWindowChunking",
"params": {
"window_size": 1000,
"overlap": 100
}
}
```
{
"type": "OverlappingWindowChunking",
"params": {
"chunk_size": 1500,
"overlap": 100
}
}
```
**Notes:**
- `chunking_strategy` is optional - if omitted, default token-based chunking is used
- Chunking is applied at the API level without modifying the core SDK
- Results from all chunks are merged into a single response
- Each chunk is processed independently with the same LLM instruction
> 💡 **Pro tip**: See [WEBHOOK_EXAMPLES.md](./WEBHOOK_EXAMPLES.md) for detailed examples including TypeScript client code, Flask webhook handlers, and failure handling.
---
@@ -1082,93 +1000,6 @@ You can override the default `config.yml`.
- Increase batch_process timeout for large content
- Adjust stream_init timeout based on initial response times
## Testing & Validation
We provide two comprehensive test suites to validate all Docker server functionality:
### 1. Extended Features Test Suite ✅ **100% Pass Rate**
Complete validation of all advanced features including URL seeding, adaptive crawling, browser adapters, proxy rotation, and dispatchers.
```bash
# Run all extended features tests
cd tests/docker/extended_features
./run_extended_tests.sh
# Custom server URL
./run_extended_tests.sh --server http://localhost:8080
```
**Test Coverage (12 tests):**
- ✅ **URL Seeding** (2 tests): Basic seeding + domain filters
- ✅ **Adaptive Crawling** (2 tests): Basic + custom thresholds
- ✅ **Browser Adapters** (3 tests): Default, Stealth, Undetected
- ✅ **Proxy Rotation** (2 tests): Round Robin, Random strategies
- ✅ **Dispatchers** (3 tests): Memory Adaptive, Semaphore, Management APIs
**Current Status:**
```
Total Tests: 12
Passed: 12
Failed: 0
Pass Rate: 100.0% ✅
Average Duration: ~8.8 seconds
```
Features:
- Rich formatted output with tables and panels
- Real-time progress indicators
- Detailed error diagnostics
- Category-based results grouping
- Server health checks
See [`tests/docker/extended_features/README_EXTENDED_TESTS.md`](../../tests/docker/extended_features/README_EXTENDED_TESTS.md) for full documentation and API response format reference.
### 2. Dispatcher Demo Test Suite
Focused tests for dispatcher functionality with performance comparisons:
```bash
# Run all tests
cd test_scripts
./run_dispatcher_tests.sh
# Run specific category
./run_dispatcher_tests.sh -c basic # Basic dispatcher usage
./run_dispatcher_tests.sh -c integration # Integration with other features
./run_dispatcher_tests.sh -c endpoints # Dispatcher management endpoints
./run_dispatcher_tests.sh -c performance # Performance comparison
./run_dispatcher_tests.sh -c error # Error handling
# Custom server URL
./run_dispatcher_tests.sh -s http://your-server:port
```
**Test Coverage (17 tests):**
- **Basic Usage Tests**: Single/multiple URL crawling with different dispatchers
- **Integration Tests**: Dispatchers combined with anti-bot strategies, browser configs, JS execution, screenshots
- **Endpoint Tests**: Dispatcher management API validation
- **Performance Tests**: Side-by-side comparison of memory_adaptive vs semaphore
- **Error Handling**: Edge cases and validation tests
Results are displayed with rich formatting, timing information, and success rates. See `test_scripts/README_DISPATCHER_TESTS.md` for full documentation.
### Quick Test Commands
```bash
# Test all features (recommended)
./tests/docker/extended_features/run_extended_tests.sh
# Test dispatchers only
./test_scripts/run_dispatcher_tests.sh
# Test server health
curl http://localhost:11235/health
# Test dispatcher endpoint
curl http://localhost:11235/dispatchers | jq
```
## Getting Help
We're here to help you succeed with Crawl4AI! Here's how to get support:
@@ -1182,10 +1013,11 @@ We're here to help you succeed with Crawl4AI! Here's how to get support:
In this guide, we've covered everything you need to get started with Crawl4AI's Docker deployment:
- Building and running the Docker container
- Configuring the environment
- Configuring the environment
- Using the interactive playground for testing
- Making API requests with proper typing
- Using the Python SDK
- Asynchronous job queues with webhook notifications
- Leveraging specialized endpoints for screenshots, PDFs, and JavaScript execution
- Connecting via the Model Context Protocol (MCP)
- Monitoring your deployment

View File

@@ -0,0 +1,241 @@
# Crawl4AI Docker Memory & Pool Optimization - Implementation Log
## Critical Issues Identified
### Memory Management
- **Host vs Container**: `psutil.virtual_memory()` reported host memory, not container limits
- **Browser Pooling**: No pool reuse - every endpoint created new browsers
- **Warmup Waste**: Permanent browser sat idle with mismatched config signature
- **Idle Cleanup**: 30min TTL too long, janitor ran every 60s
- **Endpoint Inconsistency**: 75% of endpoints bypassed pool (`/md`, `/html`, `/screenshot`, `/pdf`, `/execute_js`, `/llm`)
### Pool Design Flaws
- **Config Mismatch**: Permanent browser used `config.yml` args, endpoints used empty `BrowserConfig()`
- **Logging Level**: Pool hit markers at DEBUG, invisible with INFO logging
## Implementation Changes
### 1. Container-Aware Memory Detection (`utils.py`)
```python
def get_container_memory_percent() -> float:
# Try cgroup v2 → v1 → fallback to psutil
# Reads /sys/fs/cgroup/memory.{current,max} OR memory/memory.{usage,limit}_in_bytes
```
### 2. Smart Browser Pool (`crawler_pool.py`)
**3-Tier System:**
- **PERMANENT**: Always-ready default browser (never cleaned)
- **HOT_POOL**: Configs used 3+ times (longer TTL)
- **COLD_POOL**: New/rare configs (short TTL)
**Key Functions:**
- `get_crawler(cfg)`: Check permanent → hot → cold → create new
- `init_permanent(cfg)`: Initialize permanent at startup
- `janitor()`: Adaptive cleanup (10s/30s/60s intervals based on memory)
- `_sig(cfg)`: SHA1 hash of config dict for pool keys
**Logging Fix**: Changed `logger.debug()``logger.info()` for pool hits
### 3. Endpoint Unification
**Helper Function** (`server.py`):
```python
def get_default_browser_config() -> BrowserConfig:
return BrowserConfig(
extra_args=config["crawler"]["browser"].get("extra_args", []),
**config["crawler"]["browser"].get("kwargs", {}),
)
```
**Migrated Endpoints:**
- `/html`, `/screenshot`, `/pdf`, `/execute_js` → use `get_default_browser_config()`
- `handle_llm_qa()`, `handle_markdown_request()` → same
**Result**: All endpoints now hit permanent browser pool
### 4. Config Updates (`config.yml`)
- `idle_ttl_sec: 1800``300` (30min → 5min base TTL)
- `port: 11234``11235` (fixed mismatch with Gunicorn)
### 5. Lifespan Fix (`server.py`)
```python
await init_permanent(BrowserConfig(
extra_args=config["crawler"]["browser"].get("extra_args", []),
**config["crawler"]["browser"].get("kwargs", {}),
))
```
Permanent browser now matches endpoint config signatures
## Test Results
### Test 1: Basic Health
- 10 requests to `/health`
- **Result**: 100% success, avg 3ms latency
- **Baseline**: Container starts in ~5s, 270 MB idle
### Test 2: Memory Monitoring
- 20 requests with Docker stats tracking
- **Result**: 100% success, no memory leak (-0.2 MB delta)
- **Baseline**: 269.7 MB container overhead
### Test 3: Pool Validation
- 30 requests to `/html` endpoint
- **Result**: **100% permanent browser hits**, 0 new browsers created
- **Memory**: 287 MB baseline → 396 MB active (+109 MB)
- **Latency**: Avg 4s (includes network to httpbin.org)
### Test 4: Concurrent Load
- Light (10) → Medium (50) → Heavy (100) concurrent
- **Total**: 320 requests
- **Result**: 100% success, **320/320 permanent hits**, 0 new browsers
- **Memory**: 269 MB → peak 1533 MB → final 993 MB
- **Latency**: P99 at 100 concurrent = 34s (expected with single browser)
### Test 5: Pool Stress (Mixed Configs)
- 20 requests with 4 different viewport configs
- **Result**: 4 new browsers, 4 cold hits, **4 promotions to hot**, 8 hot hits
- **Reuse Rate**: 60% (12 pool hits / 20 requests)
- **Memory**: 270 MB → 928 MB peak (+658 MB = ~165 MB per browser)
- **Proves**: Cold → hot promotion at 3 uses working perfectly
### Test 6: Multi-Endpoint
- 10 requests each: `/html`, `/screenshot`, `/pdf`, `/crawl`
- **Result**: 100% success across all 4 endpoints
- **Latency**: 5-8s avg (PDF slowest at 7.2s)
### Test 7: Cleanup Verification
- 20 requests (load spike) → 90s idle
- **Memory**: 269 MB → peak 1107 MB → final 780 MB
- **Recovery**: 327 MB (39%) - partial cleanup
- **Note**: Hot pool browsers persist (by design), janitor working correctly
## Performance Metrics
| Metric | Before | After | Improvement |
|--------|--------|-------|-------------|
| Pool Reuse | 0% | 100% (default config) | ∞ |
| Memory Leak | Unknown | 0 MB/cycle | Stable |
| Browser Reuse | No | Yes | ~3-5s saved per request |
| Idle Memory | 500-700 MB × N | 270-400 MB | 10x reduction |
| Concurrent Capacity | ~20 | 100+ | 5x |
## Key Learnings
1. **Config Signature Matching**: Permanent browser MUST match endpoint default config exactly (SHA1 hash)
2. **Logging Levels**: Pool diagnostics need INFO level, not DEBUG
3. **Memory in Docker**: Must read cgroup files, not host metrics
4. **Janitor Timing**: 60s interval adequate, but TTLs should be short (5min) for cold pool
5. **Hot Promotion**: 3-use threshold works well for production patterns
6. **Memory Per Browser**: ~150-200 MB per Chromium instance with headless + text_mode
## Test Infrastructure
**Location**: `deploy/docker/tests/`
**Dependencies**: `httpx`, `docker` (Python SDK)
**Pattern**: Sequential build - each test adds one capability
**Files**:
- `test_1_basic.py`: Health check + container lifecycle
- `test_2_memory.py`: + Docker stats monitoring
- `test_3_pool.py`: + Log analysis for pool markers
- `test_4_concurrent.py`: + asyncio.Semaphore for concurrency control
- `test_5_pool_stress.py`: + Config variants (viewports)
- `test_6_multi_endpoint.py`: + Multiple endpoint testing
- `test_7_cleanup.py`: + Time-series memory tracking for janitor
**Run Pattern**:
```bash
cd deploy/docker/tests
pip install -r requirements.txt
# Rebuild after code changes:
cd /path/to/repo && docker buildx build -t crawl4ai-local:latest --load .
# Run test:
python test_N_name.py
```
## Architecture Decisions
**Why Permanent Browser?**
- 90% of requests use default config → single browser serves most traffic
- Eliminates 3-5s startup overhead per request
**Why 3-Tier Pool?**
- Permanent: Zero cost for common case
- Hot: Amortized cost for frequent variants
- Cold: Lazy allocation for rare configs
**Why Adaptive Janitor?**
- Memory pressure triggers aggressive cleanup
- Low memory allows longer TTLs for better reuse
**Why Not Close After Each Request?**
- Browser startup: 3-5s overhead
- Pool reuse: <100ms overhead
- Net: 30-50x faster
## Future Optimizations
1. **Request Queuing**: When at capacity, queue instead of reject
2. **Pre-warming**: Predict common configs, pre-create browsers
3. **Metrics Export**: Prometheus metrics for pool efficiency
4. **Config Normalization**: Group similar viewports (e.g., 1920±50 → 1920)
## Critical Code Paths
**Browser Acquisition** (`crawler_pool.py:34-78`):
```
get_crawler(cfg) →
_sig(cfg) →
if sig == DEFAULT_CONFIG_SIG → PERMANENT
elif sig in HOT_POOL → HOT_POOL[sig]
elif sig in COLD_POOL → promote if count >= 3
else → create new in COLD_POOL
```
**Janitor Loop** (`crawler_pool.py:107-146`):
```
while True:
mem% = get_container_memory_percent()
if mem% > 80: interval=10s, cold_ttl=30s
elif mem% > 60: interval=30s, cold_ttl=60s
else: interval=60s, cold_ttl=300s
sleep(interval)
close idle browsers (COLD then HOT)
```
**Endpoint Pattern** (`server.py` example):
```python
@app.post("/html")
async def generate_html(...):
from crawler_pool import get_crawler
crawler = await get_crawler(get_default_browser_config())
results = await crawler.arun(url=body.url, config=cfg)
# No crawler.close() - returned to pool
```
## Debugging Tips
**Check Pool Activity**:
```bash
docker logs crawl4ai-test | grep -E "(🔥|♨️|❄️|🆕|⬆️)"
```
**Verify Config Signature**:
```python
from crawl4ai import BrowserConfig
import json, hashlib
cfg = BrowserConfig(...)
sig = hashlib.sha1(json.dumps(cfg.to_dict(), sort_keys=True).encode()).hexdigest()
print(sig[:8]) # Compare with logs
```
**Monitor Memory**:
```bash
docker stats crawl4ai-test
```
## Known Limitations
- **Mac Docker Stats**: CPU metrics unreliable, memory works
- **PDF Generation**: Slowest endpoint (~7s), no optimization yet
- **Hot Pool Persistence**: May hold memory longer than needed (trade-off for performance)
- **Janitor Lag**: Up to 60s before cleanup triggers in low-memory scenarios

View File

@@ -0,0 +1,378 @@
# Webhook Feature Examples
This document provides examples of how to use the webhook feature for crawl jobs in Crawl4AI.
## Overview
The webhook feature allows you to receive notifications when crawl jobs complete, eliminating the need for polling. Webhooks are sent with exponential backoff retry logic to ensure reliable delivery.
## Configuration
### Global Configuration (config.yml)
You can configure default webhook settings in `config.yml`:
```yaml
webhooks:
enabled: true
default_url: null # Optional: default webhook URL for all jobs
data_in_payload: false # Optional: default behavior for including data
retry:
max_attempts: 5
initial_delay_ms: 1000 # 1s, 2s, 4s, 8s, 16s exponential backoff
max_delay_ms: 32000
timeout_ms: 30000 # 30s timeout per webhook call
headers: # Optional: default headers to include
User-Agent: "Crawl4AI-Webhook/1.0"
```
## API Usage Examples
### Example 1: Basic Webhook (Notification Only)
Send a webhook notification without including the crawl data in the payload.
**Request:**
```bash
curl -X POST http://localhost:11235/crawl/job \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://example.com"],
"webhook_config": {
"webhook_url": "https://myapp.com/webhooks/crawl-complete",
"webhook_data_in_payload": false
}
}'
```
**Response:**
```json
{
"task_id": "crawl_a1b2c3d4"
}
```
**Webhook Payload Received:**
```json
{
"task_id": "crawl_a1b2c3d4",
"task_type": "crawl",
"status": "completed",
"timestamp": "2025-10-21T10:30:00.000000+00:00",
"urls": ["https://example.com"]
}
```
Your webhook handler should then fetch the results:
```bash
curl http://localhost:11235/crawl/job/crawl_a1b2c3d4
```
### Example 2: Webhook with Data Included
Include the full crawl results in the webhook payload.
**Request:**
```bash
curl -X POST http://localhost:11235/crawl/job \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://example.com"],
"webhook_config": {
"webhook_url": "https://myapp.com/webhooks/crawl-complete",
"webhook_data_in_payload": true
}
}'
```
**Webhook Payload Received:**
```json
{
"task_id": "crawl_a1b2c3d4",
"task_type": "crawl",
"status": "completed",
"timestamp": "2025-10-21T10:30:00.000000+00:00",
"urls": ["https://example.com"],
"data": {
"markdown": "...",
"html": "...",
"links": {...},
"metadata": {...}
}
}
```
### Example 3: Webhook with Custom Headers
Include custom headers for authentication or identification.
**Request:**
```bash
curl -X POST http://localhost:11235/crawl/job \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://example.com"],
"webhook_config": {
"webhook_url": "https://myapp.com/webhooks/crawl-complete",
"webhook_data_in_payload": false,
"webhook_headers": {
"X-Webhook-Secret": "my-secret-token",
"X-Service-ID": "crawl4ai-production"
}
}
}'
```
The webhook will be sent with these additional headers plus the default headers from config.
### Example 4: Failure Notification
When a crawl job fails, a webhook is sent with error details.
**Webhook Payload on Failure:**
```json
{
"task_id": "crawl_a1b2c3d4",
"task_type": "crawl",
"status": "failed",
"timestamp": "2025-10-21T10:30:00.000000+00:00",
"urls": ["https://example.com"],
"error": "Connection timeout after 30s"
}
```
### Example 5: Using Global Default Webhook
If you set a `default_url` in config.yml, jobs without webhook_config will use it:
**config.yml:**
```yaml
webhooks:
enabled: true
default_url: "https://myapp.com/webhooks/default"
data_in_payload: false
```
**Request (no webhook_config needed):**
```bash
curl -X POST http://localhost:11235/crawl/job \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://example.com"]
}'
```
The webhook will be sent to the default URL configured in config.yml.
### Example 6: LLM Extraction Job with Webhook
Use webhooks with the LLM extraction endpoint for asynchronous processing.
**Request:**
```bash
curl -X POST http://localhost:11235/llm/job \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/article",
"q": "Extract the article title, author, and publication date",
"schema": "{\"type\": \"object\", \"properties\": {\"title\": {\"type\": \"string\"}, \"author\": {\"type\": \"string\"}, \"date\": {\"type\": \"string\"}}}",
"cache": false,
"provider": "openai/gpt-4o-mini",
"webhook_config": {
"webhook_url": "https://myapp.com/webhooks/llm-complete",
"webhook_data_in_payload": true
}
}'
```
**Response:**
```json
{
"task_id": "llm_1698765432_12345"
}
```
**Webhook Payload Received:**
```json
{
"task_id": "llm_1698765432_12345",
"task_type": "llm_extraction",
"status": "completed",
"timestamp": "2025-10-21T10:30:00.000000+00:00",
"urls": ["https://example.com/article"],
"data": {
"extracted_content": {
"title": "Understanding Web Scraping",
"author": "John Doe",
"date": "2025-10-21"
}
}
}
```
## Webhook Handler Example
Here's a simple Python Flask webhook handler that supports both crawl and LLM extraction jobs:
```python
from flask import Flask, request, jsonify
import requests
app = Flask(__name__)
@app.route('/webhooks/crawl-complete', methods=['POST'])
def handle_crawl_webhook():
payload = request.json
task_id = payload['task_id']
task_type = payload['task_type']
status = payload['status']
if status == 'completed':
# If data not in payload, fetch it
if 'data' not in payload:
# Determine endpoint based on task type
endpoint = 'crawl' if task_type == 'crawl' else 'llm'
response = requests.get(f'http://localhost:11235/{endpoint}/job/{task_id}')
data = response.json()
else:
data = payload['data']
# Process based on task type
if task_type == 'crawl':
print(f"Processing crawl results for {task_id}")
# Handle crawl results
results = data.get('results', [])
for result in results:
print(f" - {result.get('url')}: {len(result.get('markdown', ''))} chars")
elif task_type == 'llm_extraction':
print(f"Processing LLM extraction for {task_id}")
# Handle LLM extraction
# Note: Webhook sends 'extracted_content', API returns 'result'
extracted = data.get('extracted_content', data.get('result', {}))
print(f" - Extracted: {extracted}")
# Your business logic here...
elif status == 'failed':
error = payload.get('error', 'Unknown error')
print(f"{task_type} job {task_id} failed: {error}")
# Handle failure...
return jsonify({"status": "received"}), 200
if __name__ == '__main__':
app.run(port=8080)
```
## Retry Logic
The webhook delivery service uses exponential backoff retry logic:
- **Attempts:** Up to 5 attempts by default
- **Delays:** 1s → 2s → 4s → 8s → 16s
- **Timeout:** 30 seconds per attempt
- **Retry Conditions:**
- Server errors (5xx status codes)
- Network errors
- Timeouts
- **No Retry:**
- Client errors (4xx status codes)
- Successful delivery (2xx status codes)
## Benefits
1. **No Polling Required** - Eliminates constant API calls to check job status
2. **Real-time Notifications** - Immediate notification when jobs complete
3. **Reliable Delivery** - Exponential backoff ensures webhooks are delivered
4. **Flexible** - Choose between notification-only or full data delivery
5. **Secure** - Support for custom headers for authentication
6. **Configurable** - Global defaults or per-job configuration
7. **Universal Support** - Works with both `/crawl/job` and `/llm/job` endpoints
## TypeScript Client Example
```typescript
interface WebhookConfig {
webhook_url: string;
webhook_data_in_payload?: boolean;
webhook_headers?: Record<string, string>;
}
interface CrawlJobRequest {
urls: string[];
browser_config?: Record<string, any>;
crawler_config?: Record<string, any>;
webhook_config?: WebhookConfig;
}
interface LLMJobRequest {
url: string;
q: string;
schema?: string;
cache?: boolean;
provider?: string;
webhook_config?: WebhookConfig;
}
async function createCrawlJob(request: CrawlJobRequest) {
const response = await fetch('http://localhost:11235/crawl/job', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify(request)
});
const { task_id } = await response.json();
return task_id;
}
async function createLLMJob(request: LLMJobRequest) {
const response = await fetch('http://localhost:11235/llm/job', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify(request)
});
const { task_id } = await response.json();
return task_id;
}
// Usage - Crawl Job
const crawlTaskId = await createCrawlJob({
urls: ['https://example.com'],
webhook_config: {
webhook_url: 'https://myapp.com/webhooks/crawl-complete',
webhook_data_in_payload: false,
webhook_headers: {
'X-Webhook-Secret': 'my-secret'
}
}
});
// Usage - LLM Extraction Job
const llmTaskId = await createLLMJob({
url: 'https://example.com/article',
q: 'Extract the main points from this article',
provider: 'openai/gpt-4o-mini',
webhook_config: {
webhook_url: 'https://myapp.com/webhooks/llm-complete',
webhook_data_in_payload: true,
webhook_headers: {
'X-Webhook-Secret': 'my-secret'
}
}
});
```
## Monitoring and Debugging
Webhook delivery attempts are logged at INFO level:
- Successful deliveries
- Retry attempts with delays
- Final failures after max attempts
Check the application logs for webhook delivery status:
```bash
docker logs crawl4ai-container | grep -i webhook
```

File diff suppressed because it is too large Load Diff

View File

@@ -3,7 +3,7 @@ app:
title: "Crawl4AI API"
version: "1.0.0"
host: "0.0.0.0"
port: 11234
port: 11235
reload: False
workers: 1
timeout_keep_alive: 300
@@ -37,6 +37,10 @@ rate_limiting:
storage_uri: "memory://" # Use "redis://localhost:6379" for production
# Security Configuration
# WARNING: For production deployments, enable security and use proper SECRET_KEY:
# - Set jwt_enabled: true for authentication
# - Set SECRET_KEY environment variable to a secure random value
# - Set CRAWL4AI_HOOKS_ENABLED=true only if you need hooks (RCE risk)
security:
enabled: false
jwt_enabled: false
@@ -61,7 +65,7 @@ crawler:
batch_process: 300.0 # Timeout for batch processing
pool:
max_pages: 40 # ← GLOBAL_SEM permits
idle_ttl_sec: 1800 # ← 30 min janitor cutoff
idle_ttl_sec: 300 # ← 30 min janitor cutoff
browser:
kwargs:
headless: true
@@ -87,4 +91,17 @@ observability:
enabled: True
endpoint: "/metrics"
health_check:
endpoint: "/health"
endpoint: "/health"
# Webhook Configuration
webhooks:
enabled: true
default_url: null # Optional: default webhook URL for all jobs
data_in_payload: false # Optional: default behavior for including data
retry:
max_attempts: 5
initial_delay_ms: 1000 # 1s, 2s, 4s, 8s, 16s exponential backoff
max_delay_ms: 32000
timeout_ms: 30000 # 30s timeout per webhook call
headers: # Optional: default headers to include
User-Agent: "Crawl4AI-Webhook/1.0"

View File

@@ -1,119 +1,170 @@
# crawler_pool.py (new file)
import asyncio
import hashlib
import json
import time
# crawler_pool.py - Smart browser pool with tiered management
import asyncio, json, hashlib, time
from contextlib import suppress
from typing import Dict, Optional
import psutil
from crawl4ai import AsyncWebCrawler, BrowserConfig
from crawl4ai.async_crawler_strategy import AsyncPlaywrightCrawlerStrategy
# Import browser adapters with fallback
try:
from crawl4ai.browser_adapter import BrowserAdapter, PlaywrightAdapter
except ImportError:
# Fallback for development environment
import os
import sys
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", ".."))
from crawl4ai.browser_adapter import BrowserAdapter, PlaywrightAdapter
from utils import load_config
from utils import load_config, get_container_memory_percent
import logging
logger = logging.getLogger(__name__)
CONFIG = load_config()
POOL: Dict[str, AsyncWebCrawler] = {}
# Pool tiers
PERMANENT: Optional[AsyncWebCrawler] = None # Always-ready default browser
HOT_POOL: Dict[str, AsyncWebCrawler] = {} # Frequent configs
COLD_POOL: Dict[str, AsyncWebCrawler] = {} # Rare configs
LAST_USED: Dict[str, float] = {}
USAGE_COUNT: Dict[str, int] = {}
LOCK = asyncio.Lock()
MEM_LIMIT = CONFIG.get("crawler", {}).get(
"memory_threshold_percent", 95.0
) # % RAM refuse new browsers above this
IDLE_TTL = (
CONFIG.get("crawler", {}).get("pool", {}).get("idle_ttl_sec", 1800)
) # close if unused for 30min
# Config
MEM_LIMIT = CONFIG.get("crawler", {}).get("memory_threshold_percent", 95.0)
BASE_IDLE_TTL = CONFIG.get("crawler", {}).get("pool", {}).get("idle_ttl_sec", 300)
DEFAULT_CONFIG_SIG = None # Cached sig for default config
def _sig(cfg: BrowserConfig, adapter: Optional[BrowserAdapter] = None) -> str:
try:
config_payload = json.dumps(cfg.to_dict(), sort_keys=True, separators=(",", ":"))
except (TypeError, ValueError):
# Fallback to string representation if JSON serialization fails
config_payload = str(cfg.to_dict())
adapter_name = adapter.__class__.__name__ if adapter else "PlaywrightAdapter"
payload = f"{config_payload}:{adapter_name}"
def _sig(cfg: BrowserConfig) -> str:
"""Generate config signature."""
payload = json.dumps(cfg.to_dict(), sort_keys=True, separators=(",",":"))
return hashlib.sha1(payload.encode()).hexdigest()
def _is_default_config(sig: str) -> bool:
"""Check if config matches default."""
return sig == DEFAULT_CONFIG_SIG
async def get_crawler(
cfg: BrowserConfig, adapter: Optional[BrowserAdapter] = None
) -> AsyncWebCrawler:
sig = None
try:
sig = _sig(cfg, adapter)
async with LOCK:
if sig in POOL:
LAST_USED[sig] = time.time()
return POOL[sig]
if psutil.virtual_memory().percent >= MEM_LIMIT:
raise MemoryError("RAM pressure new browser denied")
# Create crawler - let it initialize the strategy with proper logger
# Pass browser_adapter as a kwarg so AsyncWebCrawler can use it when creating the strategy
crawler = AsyncWebCrawler(
config=cfg,
thread_safe=False
)
# Set the browser adapter on the strategy after crawler initialization
if adapter:
# Create a new strategy with the adapter and the crawler's logger
from crawl4ai.async_crawler_strategy import AsyncPlaywrightCrawlerStrategy
crawler.crawler_strategy = AsyncPlaywrightCrawlerStrategy(
browser_config=cfg,
logger=crawler.logger,
browser_adapter=adapter
)
await crawler.start()
POOL[sig] = crawler
async def get_crawler(cfg: BrowserConfig) -> AsyncWebCrawler:
"""Get crawler from pool with tiered strategy."""
sig = _sig(cfg)
async with LOCK:
# Check permanent browser for default config
if PERMANENT and _is_default_config(sig):
LAST_USED[sig] = time.time()
return crawler
except MemoryError as e:
raise MemoryError(f"RAM pressure new browser denied: {e}")
except Exception as e:
raise RuntimeError(f"Failed to start browser: {e}")
finally:
if sig:
if sig in POOL:
LAST_USED[sig] = time.time()
else:
# If we failed to start the browser, we should remove it from the pool
POOL.pop(sig, None)
LAST_USED.pop(sig, None)
# If we failed to start the browser, we should remove it from the pool
USAGE_COUNT[sig] = USAGE_COUNT.get(sig, 0) + 1
logger.info("🔥 Using permanent browser")
return PERMANENT
# Check hot pool
if sig in HOT_POOL:
LAST_USED[sig] = time.time()
USAGE_COUNT[sig] = USAGE_COUNT.get(sig, 0) + 1
logger.info(f"♨️ Using hot pool browser (sig={sig[:8]})")
return HOT_POOL[sig]
# Check cold pool (promote to hot if used 3+ times)
if sig in COLD_POOL:
LAST_USED[sig] = time.time()
USAGE_COUNT[sig] = USAGE_COUNT.get(sig, 0) + 1
if USAGE_COUNT[sig] >= 3:
logger.info(f"⬆️ Promoting to hot pool (sig={sig[:8]}, count={USAGE_COUNT[sig]})")
HOT_POOL[sig] = COLD_POOL.pop(sig)
# Track promotion in monitor
try:
from monitor import get_monitor
await get_monitor().track_janitor_event("promote", sig, {"count": USAGE_COUNT[sig]})
except:
pass
return HOT_POOL[sig]
logger.info(f"❄️ Using cold pool browser (sig={sig[:8]})")
return COLD_POOL[sig]
# Memory check before creating new
mem_pct = get_container_memory_percent()
if mem_pct >= MEM_LIMIT:
logger.error(f"💥 Memory pressure: {mem_pct:.1f}% >= {MEM_LIMIT}%")
raise MemoryError(f"Memory at {mem_pct:.1f}%, refusing new browser")
# Create new in cold pool
logger.info(f"🆕 Creating new browser in cold pool (sig={sig[:8]}, mem={mem_pct:.1f}%)")
crawler = AsyncWebCrawler(config=cfg, thread_safe=False)
await crawler.start()
COLD_POOL[sig] = crawler
LAST_USED[sig] = time.time()
USAGE_COUNT[sig] = 1
return crawler
async def init_permanent(cfg: BrowserConfig):
"""Initialize permanent default browser."""
global PERMANENT, DEFAULT_CONFIG_SIG
async with LOCK:
if PERMANENT:
return
DEFAULT_CONFIG_SIG = _sig(cfg)
logger.info("🔥 Creating permanent default browser")
PERMANENT = AsyncWebCrawler(config=cfg, thread_safe=False)
await PERMANENT.start()
LAST_USED[DEFAULT_CONFIG_SIG] = time.time()
USAGE_COUNT[DEFAULT_CONFIG_SIG] = 0
async def close_all():
"""Close all browsers."""
async with LOCK:
await asyncio.gather(
*(c.close() for c in POOL.values()), return_exceptions=True
)
POOL.clear()
tasks = []
if PERMANENT:
tasks.append(PERMANENT.close())
tasks.extend([c.close() for c in HOT_POOL.values()])
tasks.extend([c.close() for c in COLD_POOL.values()])
await asyncio.gather(*tasks, return_exceptions=True)
HOT_POOL.clear()
COLD_POOL.clear()
LAST_USED.clear()
USAGE_COUNT.clear()
async def janitor():
"""Adaptive cleanup based on memory pressure."""
while True:
await asyncio.sleep(60)
mem_pct = get_container_memory_percent()
# Adaptive intervals and TTLs
if mem_pct > 80:
interval, cold_ttl, hot_ttl = 10, 30, 120
elif mem_pct > 60:
interval, cold_ttl, hot_ttl = 30, 60, 300
else:
interval, cold_ttl, hot_ttl = 60, BASE_IDLE_TTL, BASE_IDLE_TTL * 2
await asyncio.sleep(interval)
now = time.time()
async with LOCK:
for sig, crawler in list(POOL.items()):
if now - LAST_USED[sig] > IDLE_TTL:
# Clean cold pool
for sig in list(COLD_POOL.keys()):
if now - LAST_USED.get(sig, now) > cold_ttl:
idle_time = now - LAST_USED[sig]
logger.info(f"🧹 Closing cold browser (sig={sig[:8]}, idle={idle_time:.0f}s)")
with suppress(Exception):
await crawler.close()
POOL.pop(sig, None)
await COLD_POOL[sig].close()
COLD_POOL.pop(sig, None)
LAST_USED.pop(sig, None)
USAGE_COUNT.pop(sig, None)
# Track in monitor
try:
from monitor import get_monitor
await get_monitor().track_janitor_event("close_cold", sig, {"idle_seconds": int(idle_time), "ttl": cold_ttl})
except:
pass
# Clean hot pool (more conservative)
for sig in list(HOT_POOL.keys()):
if now - LAST_USED.get(sig, now) > hot_ttl:
idle_time = now - LAST_USED[sig]
logger.info(f"🧹 Closing hot browser (sig={sig[:8]}, idle={idle_time:.0f}s)")
with suppress(Exception):
await HOT_POOL[sig].close()
HOT_POOL.pop(sig, None)
LAST_USED.pop(sig, None)
USAGE_COUNT.pop(sig, None)
# Track in monitor
try:
from monitor import get_monitor
await get_monitor().track_janitor_event("close_hot", sig, {"idle_seconds": int(idle_time), "ttl": hot_ttl})
except:
pass
# Log pool stats
if mem_pct > 60:
logger.info(f"📊 Pool: hot={len(HOT_POOL)}, cold={len(COLD_POOL)}, mem={mem_pct:.1f}%")

View File

@@ -117,18 +117,18 @@ class UserHookManager:
"""
try:
# Create a safe namespace for the hook
# Use a more complete builtins that includes __import__
# SECURITY: No __import__ to prevent arbitrary module imports (RCE risk)
import builtins
safe_builtins = {}
# Add safe built-in functions
# Add safe built-in functions (no __import__ for security)
allowed_builtins = [
'print', 'len', 'str', 'int', 'float', 'bool',
'list', 'dict', 'set', 'tuple', 'range', 'enumerate',
'zip', 'map', 'filter', 'any', 'all', 'sum', 'min', 'max',
'sorted', 'reversed', 'abs', 'round', 'isinstance', 'type',
'getattr', 'hasattr', 'setattr', 'callable', 'iter', 'next',
'__import__', '__build_class__' # Required for exec
'__build_class__' # Required for class definitions in exec
]
for name in allowed_builtins:

View File

@@ -12,6 +12,7 @@ from api import (
handle_crawl_job,
handle_task_status,
)
from schemas import WebhookConfig
# ------------- dependency placeholders -------------
_redis = None # will be injected from server.py
@@ -37,15 +38,16 @@ class LlmJobPayload(BaseModel):
schema: Optional[str] = None
cache: bool = False
provider: Optional[str] = None
webhook_config: Optional[WebhookConfig] = None
temperature: Optional[float] = None
base_url: Optional[str] = None
chunking_strategy: Optional[Dict] = None
class CrawlJobPayload(BaseModel):
urls: list[HttpUrl]
browser_config: Dict = {}
crawler_config: Dict = {}
webhook_config: Optional[WebhookConfig] = None
# ---------- LLM job ---------------------------------------------------------
@@ -56,6 +58,10 @@ async def llm_job_enqueue(
request: Request,
_td: Dict = Depends(lambda: _token_dep()), # late-bound dep
):
webhook_config = None
if payload.webhook_config:
webhook_config = payload.webhook_config.model_dump(mode='json')
return await handle_llm_request(
_redis,
background_tasks,
@@ -66,9 +72,9 @@ async def llm_job_enqueue(
cache=payload.cache,
config=_config,
provider=payload.provider,
webhook_config=webhook_config,
temperature=payload.temperature,
api_base_url=payload.base_url,
chunking_strategy_config=payload.chunking_strategy,
)
@@ -88,6 +94,10 @@ async def crawl_job_enqueue(
background_tasks: BackgroundTasks,
_td: Dict = Depends(lambda: _token_dep()),
):
webhook_config = None
if payload.webhook_config:
webhook_config = payload.webhook_config.model_dump(mode='json')
return await handle_crawl_job(
_redis,
background_tasks,
@@ -95,6 +105,7 @@ async def crawl_job_enqueue(
payload.browser_config,
payload.crawler_config,
config=_config,
webhook_config=webhook_config,
)

382
deploy/docker/monitor.py Normal file
View File

@@ -0,0 +1,382 @@
# monitor.py - Real-time monitoring stats with Redis persistence
import time
import json
import asyncio
from typing import Dict, List, Optional
from datetime import datetime, timezone
from collections import deque
from redis import asyncio as aioredis
from utils import get_container_memory_percent
import psutil
import logging
logger = logging.getLogger(__name__)
class MonitorStats:
"""Tracks real-time server stats with Redis persistence."""
def __init__(self, redis: aioredis.Redis):
self.redis = redis
self.start_time = time.time()
# In-memory queues (fast reads, Redis backup)
self.active_requests: Dict[str, Dict] = {} # id -> request info
self.completed_requests: deque = deque(maxlen=100) # Last 100
self.janitor_events: deque = deque(maxlen=100)
self.errors: deque = deque(maxlen=100)
# Endpoint stats (persisted in Redis)
self.endpoint_stats: Dict[str, Dict] = {} # endpoint -> {count, total_time, errors, ...}
# Background persistence queue (max 10 pending persist requests)
self._persist_queue: asyncio.Queue = asyncio.Queue(maxsize=10)
self._persist_worker_task: Optional[asyncio.Task] = None
# Timeline data (5min window, 5s resolution = 60 points)
self.memory_timeline: deque = deque(maxlen=60)
self.requests_timeline: deque = deque(maxlen=60)
self.browser_timeline: deque = deque(maxlen=60)
async def track_request_start(self, request_id: str, endpoint: str, url: str, config: Dict = None):
"""Track new request start."""
req_info = {
"id": request_id,
"endpoint": endpoint,
"url": url[:100], # Truncate long URLs
"start_time": time.time(),
"config_sig": config.get("sig", "default") if config else "default",
"mem_start": psutil.Process().memory_info().rss / (1024 * 1024)
}
self.active_requests[request_id] = req_info
# Increment endpoint counter
if endpoint not in self.endpoint_stats:
self.endpoint_stats[endpoint] = {
"count": 0, "total_time": 0, "errors": 0,
"pool_hits": 0, "success": 0
}
self.endpoint_stats[endpoint]["count"] += 1
# Queue persistence (handled by background worker)
try:
self._persist_queue.put_nowait(True)
except asyncio.QueueFull:
logger.warning("Persistence queue full, skipping")
async def track_request_end(self, request_id: str, success: bool, error: str = None,
pool_hit: bool = True, status_code: int = 200):
"""Track request completion."""
if request_id not in self.active_requests:
return
req_info = self.active_requests.pop(request_id)
end_time = time.time()
elapsed = end_time - req_info["start_time"]
mem_end = psutil.Process().memory_info().rss / (1024 * 1024)
mem_delta = mem_end - req_info["mem_start"]
# Update stats
endpoint = req_info["endpoint"]
if endpoint in self.endpoint_stats:
self.endpoint_stats[endpoint]["total_time"] += elapsed
if success:
self.endpoint_stats[endpoint]["success"] += 1
else:
self.endpoint_stats[endpoint]["errors"] += 1
if pool_hit:
self.endpoint_stats[endpoint]["pool_hits"] += 1
# Add to completed queue
completed = {
**req_info,
"end_time": end_time,
"elapsed": round(elapsed, 2),
"mem_delta": round(mem_delta, 1),
"success": success,
"error": error,
"status_code": status_code,
"pool_hit": pool_hit
}
self.completed_requests.append(completed)
# Track errors
if not success and error:
self.errors.append({
"timestamp": end_time,
"endpoint": endpoint,
"url": req_info["url"],
"error": error,
"request_id": request_id
})
await self._persist_endpoint_stats()
async def track_janitor_event(self, event_type: str, sig: str, details: Dict):
"""Track janitor cleanup events."""
self.janitor_events.append({
"timestamp": time.time(),
"type": event_type, # "close_cold", "close_hot", "promote"
"sig": sig[:8],
"details": details
})
def _cleanup_old_entries(self, max_age_seconds: int = 300):
"""Remove entries older than max_age_seconds (default 5min)."""
now = time.time()
cutoff = now - max_age_seconds
# Clean completed requests
while self.completed_requests and self.completed_requests[0].get("end_time", 0) < cutoff:
self.completed_requests.popleft()
# Clean janitor events
while self.janitor_events and self.janitor_events[0].get("timestamp", 0) < cutoff:
self.janitor_events.popleft()
# Clean errors
while self.errors and self.errors[0].get("timestamp", 0) < cutoff:
self.errors.popleft()
async def update_timeline(self):
"""Update timeline data points (called every 5s)."""
now = time.time()
mem_pct = get_container_memory_percent()
# Clean old entries (keep last 5 minutes)
self._cleanup_old_entries(max_age_seconds=300)
# Count requests in last 5s
recent_reqs = sum(1 for req in self.completed_requests
if now - req.get("end_time", 0) < 5)
# Browser counts (acquire lock to prevent race conditions)
from crawler_pool import PERMANENT, HOT_POOL, COLD_POOL, LOCK
async with LOCK:
browser_count = {
"permanent": 1 if PERMANENT else 0,
"hot": len(HOT_POOL),
"cold": len(COLD_POOL)
}
self.memory_timeline.append({"time": now, "value": mem_pct})
self.requests_timeline.append({"time": now, "value": recent_reqs})
self.browser_timeline.append({"time": now, "browsers": browser_count})
async def _persist_endpoint_stats(self):
"""Persist endpoint stats to Redis."""
try:
await self.redis.set(
"monitor:endpoint_stats",
json.dumps(self.endpoint_stats),
ex=86400 # 24h TTL
)
except Exception as e:
logger.warning(f"Failed to persist endpoint stats: {e}")
async def _persistence_worker(self):
"""Background worker to persist stats to Redis."""
while True:
try:
await self._persist_queue.get()
await self._persist_endpoint_stats()
self._persist_queue.task_done()
except asyncio.CancelledError:
break
except Exception as e:
logger.error(f"Persistence worker error: {e}")
def start_persistence_worker(self):
"""Start the background persistence worker."""
if not self._persist_worker_task:
self._persist_worker_task = asyncio.create_task(self._persistence_worker())
logger.info("Started persistence worker")
async def stop_persistence_worker(self):
"""Stop the background persistence worker."""
if self._persist_worker_task:
self._persist_worker_task.cancel()
try:
await self._persist_worker_task
except asyncio.CancelledError:
pass
self._persist_worker_task = None
logger.info("Stopped persistence worker")
async def cleanup(self):
"""Cleanup on shutdown - persist final stats and stop workers."""
logger.info("Monitor cleanup starting...")
try:
# Persist final stats before shutdown
await self._persist_endpoint_stats()
# Stop background worker
await self.stop_persistence_worker()
logger.info("Monitor cleanup completed")
except Exception as e:
logger.error(f"Monitor cleanup error: {e}")
async def load_from_redis(self):
"""Load persisted stats from Redis."""
try:
data = await self.redis.get("monitor:endpoint_stats")
if data:
self.endpoint_stats = json.loads(data)
logger.info("Loaded endpoint stats from Redis")
except Exception as e:
logger.warning(f"Failed to load from Redis: {e}")
async def get_health_summary(self) -> Dict:
"""Get current system health snapshot."""
mem_pct = get_container_memory_percent()
cpu_pct = psutil.cpu_percent(interval=0.1)
# Network I/O (delta since last call)
net = psutil.net_io_counters()
# Pool status (acquire lock to prevent race conditions)
from crawler_pool import PERMANENT, HOT_POOL, COLD_POOL, LOCK
async with LOCK:
# TODO: Track actual browser process memory instead of estimates
# These are conservative estimates based on typical Chromium usage
permanent_mem = 270 if PERMANENT else 0 # Estimate: ~270MB for permanent browser
hot_mem = len(HOT_POOL) * 180 # Estimate: ~180MB per hot pool browser
cold_mem = len(COLD_POOL) * 180 # Estimate: ~180MB per cold pool browser
permanent_active = PERMANENT is not None
hot_count = len(HOT_POOL)
cold_count = len(COLD_POOL)
return {
"container": {
"memory_percent": round(mem_pct, 1),
"cpu_percent": round(cpu_pct, 1),
"network_sent_mb": round(net.bytes_sent / (1024**2), 2),
"network_recv_mb": round(net.bytes_recv / (1024**2), 2),
"uptime_seconds": int(time.time() - self.start_time)
},
"pool": {
"permanent": {"active": permanent_active, "memory_mb": permanent_mem},
"hot": {"count": hot_count, "memory_mb": hot_mem},
"cold": {"count": cold_count, "memory_mb": cold_mem},
"total_memory_mb": permanent_mem + hot_mem + cold_mem
},
"janitor": {
"next_cleanup_estimate": "adaptive", # Would need janitor state
"memory_pressure": "LOW" if mem_pct < 60 else "MEDIUM" if mem_pct < 80 else "HIGH"
}
}
def get_active_requests(self) -> List[Dict]:
"""Get list of currently active requests."""
now = time.time()
return [
{
**req,
"elapsed": round(now - req["start_time"], 1),
"status": "running"
}
for req in self.active_requests.values()
]
def get_completed_requests(self, limit: int = 50, filter_status: str = "all") -> List[Dict]:
"""Get recent completed requests."""
requests = list(self.completed_requests)[-limit:]
if filter_status == "success":
requests = [r for r in requests if r.get("success")]
elif filter_status == "error":
requests = [r for r in requests if not r.get("success")]
return requests
async def get_browser_list(self) -> List[Dict]:
"""Get detailed browser pool information."""
from crawler_pool import PERMANENT, HOT_POOL, COLD_POOL, LAST_USED, USAGE_COUNT, DEFAULT_CONFIG_SIG, LOCK
browsers = []
now = time.time()
# Acquire lock to prevent race conditions during iteration
async with LOCK:
if PERMANENT:
browsers.append({
"type": "permanent",
"sig": DEFAULT_CONFIG_SIG[:8] if DEFAULT_CONFIG_SIG else "unknown",
"age_seconds": int(now - self.start_time),
"last_used_seconds": int(now - LAST_USED.get(DEFAULT_CONFIG_SIG, now)),
"memory_mb": 270,
"hits": USAGE_COUNT.get(DEFAULT_CONFIG_SIG, 0),
"killable": False
})
for sig, crawler in HOT_POOL.items():
browsers.append({
"type": "hot",
"sig": sig[:8],
"age_seconds": int(now - self.start_time), # Approximation
"last_used_seconds": int(now - LAST_USED.get(sig, now)),
"memory_mb": 180, # Estimate
"hits": USAGE_COUNT.get(sig, 0),
"killable": True
})
for sig, crawler in COLD_POOL.items():
browsers.append({
"type": "cold",
"sig": sig[:8],
"age_seconds": int(now - self.start_time),
"last_used_seconds": int(now - LAST_USED.get(sig, now)),
"memory_mb": 180,
"hits": USAGE_COUNT.get(sig, 0),
"killable": True
})
return browsers
def get_endpoint_stats_summary(self) -> Dict[str, Dict]:
"""Get aggregated endpoint statistics."""
summary = {}
for endpoint, stats in self.endpoint_stats.items():
count = stats["count"]
avg_time = (stats["total_time"] / count) if count > 0 else 0
success_rate = (stats["success"] / count * 100) if count > 0 else 0
pool_hit_rate = (stats["pool_hits"] / count * 100) if count > 0 else 0
summary[endpoint] = {
"count": count,
"avg_latency_ms": round(avg_time * 1000, 1),
"success_rate_percent": round(success_rate, 1),
"pool_hit_rate_percent": round(pool_hit_rate, 1),
"errors": stats["errors"]
}
return summary
def get_timeline_data(self, metric: str, window: str = "5m") -> Dict:
"""Get timeline data for charts."""
# For now, only 5m window supported
if metric == "memory":
data = list(self.memory_timeline)
elif metric == "requests":
data = list(self.requests_timeline)
elif metric == "browsers":
data = list(self.browser_timeline)
else:
return {"timestamps": [], "values": []}
return {
"timestamps": [int(d["time"]) for d in data],
"values": [d.get("value", d.get("browsers")) for d in data]
}
def get_janitor_log(self, limit: int = 100) -> List[Dict]:
"""Get recent janitor events."""
return list(self.janitor_events)[-limit:]
def get_errors_log(self, limit: int = 100) -> List[Dict]:
"""Get recent errors."""
return list(self.errors)[-limit:]
# Global instance (initialized in server.py)
monitor_stats: Optional[MonitorStats] = None
def get_monitor() -> MonitorStats:
"""Get global monitor instance."""
if monitor_stats is None:
raise RuntimeError("Monitor not initialized")
return monitor_stats

View File

@@ -0,0 +1,405 @@
# monitor_routes.py - Monitor API endpoints
from fastapi import APIRouter, HTTPException, WebSocket, WebSocketDisconnect
from pydantic import BaseModel
from typing import Optional
from monitor import get_monitor
import logging
import asyncio
import json
logger = logging.getLogger(__name__)
router = APIRouter(prefix="/monitor", tags=["monitor"])
@router.get("/health")
async def get_health():
"""Get current system health snapshot."""
try:
monitor = get_monitor()
return await monitor.get_health_summary()
except Exception as e:
logger.error(f"Error getting health: {e}")
raise HTTPException(500, str(e))
@router.get("/requests")
async def get_requests(status: str = "all", limit: int = 50):
"""Get active and completed requests.
Args:
status: Filter by 'active', 'completed', 'success', 'error', or 'all'
limit: Max number of completed requests to return (default 50)
"""
# Input validation
if status not in ["all", "active", "completed", "success", "error"]:
raise HTTPException(400, f"Invalid status: {status}. Must be one of: all, active, completed, success, error")
if limit < 1 or limit > 1000:
raise HTTPException(400, f"Invalid limit: {limit}. Must be between 1 and 1000")
try:
monitor = get_monitor()
if status == "active":
return {"active": monitor.get_active_requests(), "completed": []}
elif status == "completed":
return {"active": [], "completed": monitor.get_completed_requests(limit)}
elif status in ["success", "error"]:
return {"active": [], "completed": monitor.get_completed_requests(limit, status)}
else: # "all"
return {
"active": monitor.get_active_requests(),
"completed": monitor.get_completed_requests(limit)
}
except Exception as e:
logger.error(f"Error getting requests: {e}")
raise HTTPException(500, str(e))
@router.get("/browsers")
async def get_browsers():
"""Get detailed browser pool information."""
try:
monitor = get_monitor()
browsers = await monitor.get_browser_list()
# Calculate summary stats
total_browsers = len(browsers)
total_memory = sum(b["memory_mb"] for b in browsers)
# Calculate reuse rate from recent requests
recent = monitor.get_completed_requests(100)
pool_hits = sum(1 for r in recent if r.get("pool_hit", False))
reuse_rate = (pool_hits / len(recent) * 100) if recent else 0
return {
"browsers": browsers,
"summary": {
"total_count": total_browsers,
"total_memory_mb": total_memory,
"reuse_rate_percent": round(reuse_rate, 1)
}
}
except Exception as e:
logger.error(f"Error getting browsers: {e}")
raise HTTPException(500, str(e))
@router.get("/endpoints/stats")
async def get_endpoint_stats():
"""Get aggregated endpoint statistics."""
try:
monitor = get_monitor()
return monitor.get_endpoint_stats_summary()
except Exception as e:
logger.error(f"Error getting endpoint stats: {e}")
raise HTTPException(500, str(e))
@router.get("/timeline")
async def get_timeline(metric: str = "memory", window: str = "5m"):
"""Get timeline data for charts.
Args:
metric: 'memory', 'requests', or 'browsers'
window: Time window (only '5m' supported for now)
"""
# Input validation
if metric not in ["memory", "requests", "browsers"]:
raise HTTPException(400, f"Invalid metric: {metric}. Must be one of: memory, requests, browsers")
if window != "5m":
raise HTTPException(400, f"Invalid window: {window}. Only '5m' is currently supported")
try:
monitor = get_monitor()
return monitor.get_timeline_data(metric, window)
except Exception as e:
logger.error(f"Error getting timeline: {e}")
raise HTTPException(500, str(e))
@router.get("/logs/janitor")
async def get_janitor_log(limit: int = 100):
"""Get recent janitor cleanup events."""
# Input validation
if limit < 1 or limit > 1000:
raise HTTPException(400, f"Invalid limit: {limit}. Must be between 1 and 1000")
try:
monitor = get_monitor()
return {"events": monitor.get_janitor_log(limit)}
except Exception as e:
logger.error(f"Error getting janitor log: {e}")
raise HTTPException(500, str(e))
@router.get("/logs/errors")
async def get_errors_log(limit: int = 100):
"""Get recent errors."""
# Input validation
if limit < 1 or limit > 1000:
raise HTTPException(400, f"Invalid limit: {limit}. Must be between 1 and 1000")
try:
monitor = get_monitor()
return {"errors": monitor.get_errors_log(limit)}
except Exception as e:
logger.error(f"Error getting errors log: {e}")
raise HTTPException(500, str(e))
# ========== Control Actions ==========
class KillBrowserRequest(BaseModel):
sig: str
@router.post("/actions/cleanup")
async def force_cleanup():
"""Force immediate janitor cleanup (kills idle cold pool browsers)."""
try:
from crawler_pool import COLD_POOL, LAST_USED, USAGE_COUNT, LOCK
import time
from contextlib import suppress
killed_count = 0
now = time.time()
async with LOCK:
for sig in list(COLD_POOL.keys()):
# Kill all cold pool browsers immediately
logger.info(f"🧹 Force cleanup: closing cold browser (sig={sig[:8]})")
with suppress(Exception):
await COLD_POOL[sig].close()
COLD_POOL.pop(sig, None)
LAST_USED.pop(sig, None)
USAGE_COUNT.pop(sig, None)
killed_count += 1
monitor = get_monitor()
await monitor.track_janitor_event("force_cleanup", "manual", {"killed": killed_count})
return {"success": True, "killed_browsers": killed_count}
except Exception as e:
logger.error(f"Error during force cleanup: {e}")
raise HTTPException(500, str(e))
@router.post("/actions/kill_browser")
async def kill_browser(req: KillBrowserRequest):
"""Kill a specific browser by signature (hot or cold only).
Args:
sig: Browser config signature (first 8 chars)
"""
try:
from crawler_pool import HOT_POOL, COLD_POOL, LAST_USED, USAGE_COUNT, LOCK, DEFAULT_CONFIG_SIG
from contextlib import suppress
# Find full signature matching prefix
target_sig = None
pool_type = None
async with LOCK:
# Check hot pool
for sig in HOT_POOL.keys():
if sig.startswith(req.sig):
target_sig = sig
pool_type = "hot"
break
# Check cold pool
if not target_sig:
for sig in COLD_POOL.keys():
if sig.startswith(req.sig):
target_sig = sig
pool_type = "cold"
break
# Check if trying to kill permanent
if DEFAULT_CONFIG_SIG and DEFAULT_CONFIG_SIG.startswith(req.sig):
raise HTTPException(403, "Cannot kill permanent browser. Use restart instead.")
if not target_sig:
raise HTTPException(404, f"Browser with sig={req.sig} not found")
# Warn if there are active requests (browser might be in use)
monitor = get_monitor()
active_count = len(monitor.get_active_requests())
if active_count > 0:
logger.warning(f"Killing browser {target_sig[:8]} while {active_count} requests are active - may cause failures")
# Kill the browser
if pool_type == "hot":
browser = HOT_POOL.pop(target_sig)
else:
browser = COLD_POOL.pop(target_sig)
with suppress(Exception):
await browser.close()
LAST_USED.pop(target_sig, None)
USAGE_COUNT.pop(target_sig, None)
logger.info(f"🔪 Killed {pool_type} browser (sig={target_sig[:8]})")
monitor = get_monitor()
await monitor.track_janitor_event("kill_browser", target_sig, {"pool": pool_type, "manual": True})
return {"success": True, "killed_sig": target_sig[:8], "pool_type": pool_type}
except HTTPException:
raise
except Exception as e:
logger.error(f"Error killing browser: {e}")
raise HTTPException(500, str(e))
@router.post("/actions/restart_browser")
async def restart_browser(req: KillBrowserRequest):
"""Restart a browser (kill + recreate). Works for permanent too.
Args:
sig: Browser config signature (first 8 chars), or "permanent"
"""
try:
from crawler_pool import (PERMANENT, HOT_POOL, COLD_POOL, LAST_USED,
USAGE_COUNT, LOCK, DEFAULT_CONFIG_SIG, init_permanent)
from crawl4ai import AsyncWebCrawler, BrowserConfig
from contextlib import suppress
import time
# Handle permanent browser restart
if req.sig == "permanent" or (DEFAULT_CONFIG_SIG and DEFAULT_CONFIG_SIG.startswith(req.sig)):
async with LOCK:
if PERMANENT:
with suppress(Exception):
await PERMANENT.close()
# Reinitialize permanent
from utils import load_config
config = load_config()
await init_permanent(BrowserConfig(
extra_args=config["crawler"]["browser"].get("extra_args", []),
**config["crawler"]["browser"].get("kwargs", {}),
))
logger.info("🔄 Restarted permanent browser")
return {"success": True, "restarted": "permanent"}
# Handle hot/cold browser restart
target_sig = None
pool_type = None
browser_config = None
async with LOCK:
# Find browser
for sig in HOT_POOL.keys():
if sig.startswith(req.sig):
target_sig = sig
pool_type = "hot"
# Would need to reconstruct config (not stored currently)
break
if not target_sig:
for sig in COLD_POOL.keys():
if sig.startswith(req.sig):
target_sig = sig
pool_type = "cold"
break
if not target_sig:
raise HTTPException(404, f"Browser with sig={req.sig} not found")
# Kill existing
if pool_type == "hot":
browser = HOT_POOL.pop(target_sig)
else:
browser = COLD_POOL.pop(target_sig)
with suppress(Exception):
await browser.close()
# Note: We can't easily recreate with same config without storing it
# For now, just kill and let new requests create fresh ones
LAST_USED.pop(target_sig, None)
USAGE_COUNT.pop(target_sig, None)
logger.info(f"🔄 Restarted {pool_type} browser (sig={target_sig[:8]})")
monitor = get_monitor()
await monitor.track_janitor_event("restart_browser", target_sig, {"pool": pool_type})
return {"success": True, "restarted_sig": target_sig[:8], "note": "Browser will be recreated on next request"}
except HTTPException:
raise
except Exception as e:
logger.error(f"Error restarting browser: {e}")
raise HTTPException(500, str(e))
@router.post("/stats/reset")
async def reset_stats():
"""Reset today's endpoint counters."""
try:
monitor = get_monitor()
monitor.endpoint_stats.clear()
await monitor._persist_endpoint_stats()
return {"success": True, "message": "Endpoint stats reset"}
except Exception as e:
logger.error(f"Error resetting stats: {e}")
raise HTTPException(500, str(e))
@router.websocket("/ws")
async def websocket_endpoint(websocket: WebSocket):
"""WebSocket endpoint for real-time monitoring updates.
Sends updates every 2 seconds with:
- Health stats
- Active/completed requests
- Browser pool status
- Timeline data
"""
await websocket.accept()
logger.info("WebSocket client connected")
try:
while True:
try:
# Gather all monitoring data
monitor = get_monitor()
data = {
"timestamp": asyncio.get_event_loop().time(),
"health": await monitor.get_health_summary(),
"requests": {
"active": monitor.get_active_requests(),
"completed": monitor.get_completed_requests(limit=10)
},
"browsers": await monitor.get_browser_list(),
"timeline": {
"memory": monitor.get_timeline_data("memory", "5m"),
"requests": monitor.get_timeline_data("requests", "5m"),
"browsers": monitor.get_timeline_data("browsers", "5m")
},
"janitor": monitor.get_janitor_log(limit=10),
"errors": monitor.get_errors_log(limit=10)
}
# Send update to client
await websocket.send_json(data)
# Wait 2 seconds before next update
await asyncio.sleep(2)
except WebSocketDisconnect:
logger.info("WebSocket client disconnected")
break
except Exception as e:
logger.error(f"WebSocket error: {e}", exc_info=True)
await asyncio.sleep(2) # Continue trying
except Exception as e:
logger.error(f"WebSocket connection error: {e}", exc_info=True)
finally:
logger.info("WebSocket connection closed")

View File

@@ -12,6 +12,6 @@ pydantic>=2.11
rank-bm25==0.2.2
anyio==4.9.0
PyJWT==2.10.1
mcp>=1.6.0
mcp>=1.18.0
websockets>=15.0.1
httpx[http2]>=0.27.2

View File

@@ -1,270 +0,0 @@
import uuid
from typing import Any, Dict
from fastapi import APIRouter, BackgroundTasks, HTTPException
from schemas import AdaptiveConfigPayload, AdaptiveCrawlRequest, AdaptiveJobStatus
from crawl4ai import AsyncWebCrawler
from crawl4ai.adaptive_crawler import AdaptiveConfig, AdaptiveCrawler
from crawl4ai.utils import get_error_context
# --- In-memory storage for job statuses. For production, use Redis or a database. ---
ADAPTIVE_JOBS: Dict[str, Dict[str, Any]] = {}
# --- APIRouter for Adaptive Crawling Endpoints ---
router = APIRouter(
prefix="/adaptive/digest",
tags=["Adaptive Crawling"],
)
# --- Background Worker Function ---
async def run_adaptive_digest(task_id: str, request: AdaptiveCrawlRequest):
"""The actual async worker that performs the adaptive crawl."""
try:
# Update job status to RUNNING
ADAPTIVE_JOBS[task_id]["status"] = "RUNNING"
# Create AdaptiveConfig from payload or use default
if request.config:
adaptive_config = AdaptiveConfig(**request.config.model_dump())
else:
adaptive_config = AdaptiveConfig()
# The adaptive crawler needs an instance of the web crawler
async with AsyncWebCrawler() as crawler:
adaptive_crawler = AdaptiveCrawler(crawler, config=adaptive_config)
# This is the long-running operation
final_state = await adaptive_crawler.digest(
start_url=request.start_url, query=request.query
)
# Process the final state into a clean result
result_data = {
"confidence": final_state.metrics.get("confidence", 0.0),
"is_sufficient": adaptive_crawler.is_sufficient,
"coverage_stats": adaptive_crawler.coverage_stats,
"relevant_content": adaptive_crawler.get_relevant_content(top_k=5),
}
# Update job with the final result
ADAPTIVE_JOBS[task_id].update(
{
"status": "COMPLETED",
"result": result_data,
"metrics": final_state.metrics,
}
)
except Exception as e:
# On failure, update the job with an error message
import sys
error_context = get_error_context(sys.exc_info())
error_message = f"Adaptive crawl failed: {str(e)}\nContext: {error_context}"
ADAPTIVE_JOBS[task_id].update({"status": "FAILED", "error": error_message})
# --- API Endpoints ---
@router.post("/job",
summary="Submit Adaptive Crawl Job",
description="Start a long-running adaptive crawling job that intelligently discovers relevant content.",
response_description="Job ID for status polling",
response_model=AdaptiveJobStatus,
status_code=202
)
async def submit_adaptive_digest_job(
request: AdaptiveCrawlRequest,
background_tasks: BackgroundTasks,
):
"""
Submit a new adaptive crawling job.
This endpoint starts an intelligent, long-running crawl that automatically
discovers and extracts relevant content based on your query. Returns
immediately with a task ID for polling.
**Request Body:**
```json
{
"start_url": "https://example.com",
"query": "Find all product documentation",
"config": {
"max_depth": 3,
"max_pages": 50,
"confidence_threshold": 0.7,
"timeout": 300
}
}
```
**Parameters:**
- `start_url`: Starting URL for the crawl
- `query`: Natural language query describing what to find
- `config`: Optional adaptive configuration (max_depth, max_pages, etc.)
**Response:**
```json
{
"task_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "PENDING",
"metrics": null,
"result": null,
"error": null
}
```
**Usage:**
```python
# Submit job
response = requests.post(
"http://localhost:11235/adaptive/digest/job",
headers={"Authorization": f"Bearer {token}"},
json={
"start_url": "https://example.com",
"query": "Find all API documentation"
}
)
task_id = response.json()["task_id"]
# Poll for results
while True:
status_response = requests.get(
f"http://localhost:11235/adaptive/digest/job/{task_id}",
headers={"Authorization": f"Bearer {token}"}
)
status = status_response.json()
if status["status"] in ["COMPLETED", "FAILED"]:
print(status["result"])
break
time.sleep(2)
```
**Notes:**
- Job runs in background, returns immediately
- Use task_id to poll status with GET /adaptive/digest/job/{task_id}
- Adaptive crawler intelligently follows links based on relevance
- Automatically stops when sufficient content found
- Returns HTTP 202 Accepted
"""
print("Received adaptive crawl request:", request)
task_id = str(uuid.uuid4())
# Initialize the job in our in-memory store
ADAPTIVE_JOBS[task_id] = {
"task_id": task_id,
"status": "PENDING",
"metrics": None,
"result": None,
"error": None,
}
# Add the long-running task to the background
background_tasks.add_task(run_adaptive_digest, task_id, request)
return ADAPTIVE_JOBS[task_id]
@router.get("/job/{task_id}",
summary="Get Adaptive Job Status",
description="Poll the status and results of an adaptive crawling job.",
response_description="Job status, metrics, and results",
response_model=AdaptiveJobStatus
)
async def get_adaptive_digest_status(task_id: str):
"""
Get the status and result of an adaptive crawling job.
Poll this endpoint with the task_id returned from the submission endpoint
until the status is 'COMPLETED' or 'FAILED'.
**Parameters:**
- `task_id`: Job ID from POST /adaptive/digest/job
**Response (Running):**
```json
{
"task_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "RUNNING",
"metrics": {
"confidence": 0.45,
"pages_crawled": 15,
"relevant_pages": 8
},
"result": null,
"error": null
}
```
**Response (Completed):**
```json
{
"task_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "COMPLETED",
"metrics": {
"confidence": 0.85,
"pages_crawled": 42,
"relevant_pages": 28
},
"result": {
"confidence": 0.85,
"is_sufficient": true,
"coverage_stats": {...},
"relevant_content": [...]
},
"error": null
}
```
**Status Values:**
- `PENDING`: Job queued, not started yet
- `RUNNING`: Job actively crawling
- `COMPLETED`: Job finished successfully
- `FAILED`: Job encountered an error
**Usage:**
```python
import time
# Poll until complete
while True:
response = requests.get(
f"http://localhost:11235/adaptive/digest/job/{task_id}",
headers={"Authorization": f"Bearer {token}"}
)
job = response.json()
print(f"Status: {job['status']}")
if job['status'] == 'RUNNING':
print(f"Progress: {job['metrics']['pages_crawled']} pages")
elif job['status'] == 'COMPLETED':
print(f"Found {len(job['result']['relevant_content'])} relevant items")
break
elif job['status'] == 'FAILED':
print(f"Error: {job['error']}")
break
time.sleep(2)
```
**Notes:**
- Poll every 1-5 seconds
- Metrics updated in real-time while running
- Returns 404 if task_id not found
- Results include top relevant content and statistics
"""
job = ADAPTIVE_JOBS.get(task_id)
if not job:
raise HTTPException(status_code=404, detail="Job not found")
# If the job is running, update the metrics from the live state
if job["status"] == "RUNNING" and job.get("live_state"):
job["metrics"] = job["live_state"].metrics
return job

View File

@@ -1,259 +0,0 @@
"""
Router for dispatcher management endpoints.
Provides endpoints to:
- List available dispatchers
- Get default dispatcher info
- Get dispatcher statistics
"""
import logging
from typing import Dict, List
from fastapi import APIRouter, HTTPException, Request
from schemas import DispatcherInfo, DispatcherStatsResponse, DispatcherType
from utils import get_available_dispatchers, get_dispatcher_config
logger = logging.getLogger(__name__)
# --- APIRouter for Dispatcher Endpoints ---
router = APIRouter(
prefix="/dispatchers",
tags=["Dispatchers"],
)
@router.get("",
summary="List Dispatchers",
description="Get information about all available dispatcher types.",
response_description="List of dispatcher configurations and features",
response_model=List[DispatcherInfo]
)
async def list_dispatchers(request: Request):
"""
List all available dispatcher types.
Returns information about each dispatcher type including name, description,
configuration parameters, and key features.
**Dispatchers:**
- `memory_adaptive`: Automatically manages crawler instances based on memory
- `semaphore`: Simple semaphore-based concurrency control
**Response:**
```json
[
{
"type": "memory_adaptive",
"name": "Memory Adaptive Dispatcher",
"description": "Automatically adjusts crawler pool based on memory usage",
"config": {...},
"features": ["Auto-scaling", "Memory monitoring", "Smart throttling"]
},
{
"type": "semaphore",
"name": "Semaphore Dispatcher",
"description": "Simple semaphore-based concurrency control",
"config": {...},
"features": ["Fixed concurrency", "Simple queue"]
}
]
```
**Usage:**
```python
response = requests.get(
"http://localhost:11235/dispatchers",
headers={"Authorization": f"Bearer {token}"}
)
dispatchers = response.json()
for dispatcher in dispatchers:
print(f"{dispatcher['type']}: {dispatcher['description']}")
```
**Notes:**
- Lists all registered dispatcher types
- Shows configuration options for each
- Use with /crawl endpoint's `dispatcher` parameter
"""
try:
dispatchers_info = get_available_dispatchers()
result = []
for dispatcher_type, info in dispatchers_info.items():
result.append(
DispatcherInfo(
type=DispatcherType(dispatcher_type),
name=info["name"],
description=info["description"],
config=info["config"],
features=info["features"],
)
)
return result
except Exception as e:
logger.error(f"Error listing dispatchers: {e}")
raise HTTPException(status_code=500, detail=f"Failed to list dispatchers: {str(e)}")
@router.get("/default",
summary="Get Default Dispatcher",
description="Get information about the currently configured default dispatcher.",
response_description="Default dispatcher information",
response_model=Dict
)
async def get_default_dispatcher(request: Request):
"""
Get information about the current default dispatcher.
Returns the dispatcher type, configuration, and status for the default
dispatcher used when no specific dispatcher is requested.
**Response:**
```json
{
"type": "memory_adaptive",
"config": {
"max_memory_percent": 80,
"check_interval": 10,
"min_instances": 1,
"max_instances": 10
},
"active": true
}
```
**Usage:**
```python
response = requests.get(
"http://localhost:11235/dispatchers/default",
headers={"Authorization": f"Bearer {token}"}
)
default_dispatcher = response.json()
print(f"Default: {default_dispatcher['type']}")
```
**Notes:**
- Shows which dispatcher is used by default
- Default can be configured via server settings
- Override with `dispatcher` parameter in /crawl requests
"""
try:
default_type = request.app.state.default_dispatcher_type
dispatcher = request.app.state.dispatchers.get(default_type)
if not dispatcher:
raise HTTPException(
status_code=500,
detail=f"Default dispatcher '{default_type}' not initialized"
)
return {
"type": default_type,
"config": get_dispatcher_config(default_type),
"active": True,
}
except HTTPException:
raise
except Exception as e:
logger.error(f"Error getting default dispatcher: {e}")
raise HTTPException(
status_code=500,
detail=f"Failed to get default dispatcher: {str(e)}"
)
@router.get("/{dispatcher_type}/stats",
summary="Get Dispatcher Statistics",
description="Get runtime statistics for a specific dispatcher.",
response_description="Dispatcher statistics and metrics",
response_model=DispatcherStatsResponse
)
async def get_dispatcher_stats(dispatcher_type: DispatcherType, request: Request):
"""
Get runtime statistics for a specific dispatcher.
Returns active sessions, configuration, and dispatcher-specific metrics.
Useful for monitoring and debugging dispatcher performance.
**Parameters:**
- `dispatcher_type`: Dispatcher type (memory_adaptive, semaphore)
**Response:**
```json
{
"type": "memory_adaptive",
"active_sessions": 3,
"config": {
"max_memory_percent": 80,
"check_interval": 10
},
"stats": {
"current_memory_percent": 45.2,
"active_instances": 3,
"max_instances": 10,
"throttled_count": 0
}
}
```
**Usage:**
```python
response = requests.get(
"http://localhost:11235/dispatchers/memory_adaptive/stats",
headers={"Authorization": f"Bearer {token}"}
)
stats = response.json()
print(f"Active sessions: {stats['active_sessions']}")
print(f"Memory usage: {stats['stats']['current_memory_percent']}%")
```
**Notes:**
- Real-time statistics
- Stats vary by dispatcher type
- Use for monitoring and capacity planning
- Returns 404 if dispatcher type not found
"""
try:
dispatcher_name = dispatcher_type.value
dispatcher = request.app.state.dispatchers.get(dispatcher_name)
if not dispatcher:
raise HTTPException(
status_code=404,
detail=f"Dispatcher '{dispatcher_name}' not found or not initialized"
)
# Get basic stats
stats = {
"type": dispatcher_type,
"active_sessions": dispatcher.concurrent_sessions,
"config": get_dispatcher_config(dispatcher_name),
"stats": {}
}
# Add dispatcher-specific stats
if dispatcher_name == "memory_adaptive":
stats["stats"] = {
"current_memory_percent": getattr(dispatcher, "current_memory_percent", 0.0),
"memory_pressure_mode": getattr(dispatcher, "memory_pressure_mode", False),
"task_queue_size": dispatcher.task_queue.qsize() if hasattr(dispatcher, "task_queue") else 0,
}
elif dispatcher_name == "semaphore":
# For semaphore dispatcher, show semaphore availability
if hasattr(dispatcher, "semaphore_count"):
stats["stats"] = {
"max_concurrent": dispatcher.semaphore_count,
}
return DispatcherStatsResponse(**stats)
except HTTPException:
raise
except Exception as e:
logger.error(f"Error getting dispatcher stats for '{dispatcher_type}': {e}")
raise HTTPException(
status_code=500,
detail=f"Failed to get dispatcher stats: {str(e)}"
)

View File

@@ -1,746 +0,0 @@
"""
Monitoring and Profiling Router
Provides endpoints for:
- Browser performance profiling
- Real-time crawler statistics
- System resource monitoring
- Session management
"""
from fastapi import APIRouter, HTTPException, BackgroundTasks, Query
from fastapi.responses import StreamingResponse
from pydantic import BaseModel, Field
from typing import Dict, List, Optional, Any, AsyncGenerator
from datetime import datetime, timedelta
import uuid
import asyncio
import json
import time
import psutil
import logging
from collections import defaultdict
logger = logging.getLogger(__name__)
router = APIRouter(
prefix="/monitoring",
tags=["Monitoring & Profiling"],
responses={
404: {"description": "Session not found"},
500: {"description": "Internal server error"}
}
)
# ============================================================================
# Data Structures
# ============================================================================
# In-memory storage for profiling sessions
PROFILING_SESSIONS: Dict[str, Dict[str, Any]] = {}
# Real-time crawler statistics
CRAWLER_STATS = {
"active_crawls": 0,
"total_crawls": 0,
"successful_crawls": 0,
"failed_crawls": 0,
"total_bytes_processed": 0,
"average_response_time_ms": 0.0,
"last_updated": datetime.now().isoformat(),
}
# Per-URL statistics
URL_STATS: Dict[str, Dict[str, Any]] = defaultdict(lambda: {
"total_requests": 0,
"success_count": 0,
"failure_count": 0,
"average_time_ms": 0.0,
"last_accessed": None,
})
# ============================================================================
# Pydantic Models
# ============================================================================
class ProfilingStartRequest(BaseModel):
"""Request to start a profiling session."""
url: str = Field(..., description="URL to profile")
browser_config: Optional[Dict[str, Any]] = Field(
default_factory=dict,
description="Browser configuration"
)
crawler_config: Optional[Dict[str, Any]] = Field(
default_factory=dict,
description="Crawler configuration"
)
profile_duration: Optional[int] = Field(
default=30,
ge=5,
le=300,
description="Maximum profiling duration in seconds"
)
collect_network: bool = Field(
default=True,
description="Collect network performance data"
)
collect_memory: bool = Field(
default=True,
description="Collect memory usage data"
)
collect_cpu: bool = Field(
default=True,
description="Collect CPU usage data"
)
class Config:
schema_extra = {
"example": {
"url": "https://example.com",
"profile_duration": 30,
"collect_network": True,
"collect_memory": True,
"collect_cpu": True
}
}
class ProfilingSession(BaseModel):
"""Profiling session information."""
session_id: str = Field(..., description="Unique session identifier")
status: str = Field(..., description="Session status: running, completed, failed")
url: str = Field(..., description="URL being profiled")
start_time: str = Field(..., description="Session start time (ISO format)")
end_time: Optional[str] = Field(None, description="Session end time (ISO format)")
duration_seconds: Optional[float] = Field(None, description="Total duration in seconds")
results: Optional[Dict[str, Any]] = Field(None, description="Profiling results")
error: Optional[str] = Field(None, description="Error message if failed")
class Config:
schema_extra = {
"example": {
"session_id": "abc123",
"status": "completed",
"url": "https://example.com",
"start_time": "2025-10-16T10:30:00",
"end_time": "2025-10-16T10:30:30",
"duration_seconds": 30.5,
"results": {
"performance": {
"page_load_time_ms": 1234,
"dom_content_loaded_ms": 890,
"first_paint_ms": 567
}
}
}
}
class CrawlerStats(BaseModel):
"""Current crawler statistics."""
active_crawls: int = Field(..., description="Number of currently active crawls")
total_crawls: int = Field(..., description="Total crawls since server start")
successful_crawls: int = Field(..., description="Number of successful crawls")
failed_crawls: int = Field(..., description="Number of failed crawls")
success_rate: float = Field(..., description="Success rate percentage")
total_bytes_processed: int = Field(..., description="Total bytes processed")
average_response_time_ms: float = Field(..., description="Average response time")
uptime_seconds: float = Field(..., description="Server uptime in seconds")
memory_usage_mb: float = Field(..., description="Current memory usage in MB")
cpu_percent: float = Field(..., description="Current CPU usage percentage")
last_updated: str = Field(..., description="Last update timestamp")
class URLStatistics(BaseModel):
"""Statistics for a specific URL pattern."""
url_pattern: str
total_requests: int
success_count: int
failure_count: int
success_rate: float
average_time_ms: float
last_accessed: Optional[str]
class SessionListResponse(BaseModel):
"""List of profiling sessions."""
total: int
sessions: List[ProfilingSession]
# ============================================================================
# Helper Functions
# ============================================================================
def get_system_stats() -> Dict[str, Any]:
"""Get current system resource usage."""
try:
process = psutil.Process()
return {
"memory_usage_mb": process.memory_info().rss / 1024 / 1024,
"cpu_percent": process.cpu_percent(interval=0.1),
"num_threads": process.num_threads(),
"open_files": len(process.open_files()),
"connections": len(process.connections()),
}
except Exception as e:
logger.error(f"Error getting system stats: {e}")
return {
"memory_usage_mb": 0.0,
"cpu_percent": 0.0,
"num_threads": 0,
"open_files": 0,
"connections": 0,
}
def cleanup_old_sessions(max_age_hours: int = 24):
"""Remove old profiling sessions to prevent memory leaks."""
cutoff = datetime.now() - timedelta(hours=max_age_hours)
to_remove = []
for session_id, session in PROFILING_SESSIONS.items():
try:
start_time = datetime.fromisoformat(session["start_time"])
if start_time < cutoff:
to_remove.append(session_id)
except (ValueError, KeyError):
continue
for session_id in to_remove:
del PROFILING_SESSIONS[session_id]
logger.info(f"Cleaned up old session: {session_id}")
return len(to_remove)
# ============================================================================
# Profiling Endpoints
# ============================================================================
@router.post(
"/profile/start",
response_model=ProfilingSession,
summary="Start profiling session",
description="Start a new browser profiling session for performance analysis"
)
async def start_profiling_session(
request: ProfilingStartRequest,
background_tasks: BackgroundTasks
):
"""
Start a new profiling session.
Returns a session ID that can be used to retrieve results later.
The profiling runs in the background and collects:
- Page load performance metrics
- Network requests and timing
- Memory usage patterns
- CPU utilization
- Browser-specific metrics
"""
session_id = str(uuid.uuid4())
start_time = datetime.now()
session_data = {
"session_id": session_id,
"status": "running",
"url": request.url,
"start_time": start_time.isoformat(),
"end_time": None,
"duration_seconds": None,
"results": None,
"error": None,
"config": {
"profile_duration": request.profile_duration,
"collect_network": request.collect_network,
"collect_memory": request.collect_memory,
"collect_cpu": request.collect_cpu,
}
}
PROFILING_SESSIONS[session_id] = session_data
# Add background task to run profiling
background_tasks.add_task(
run_profiling_session,
session_id,
request
)
logger.info(f"Started profiling session {session_id} for {request.url}")
return ProfilingSession(**session_data)
@router.get(
"/profile/{session_id}",
response_model=ProfilingSession,
summary="Get profiling results",
description="Retrieve results from a profiling session"
)
async def get_profiling_results(session_id: str):
"""
Get profiling session results.
Returns the current status and results of a profiling session.
If the session is still running, results will be None.
"""
if session_id not in PROFILING_SESSIONS:
raise HTTPException(
status_code=404,
detail=f"Profiling session '{session_id}' not found"
)
session = PROFILING_SESSIONS[session_id]
return ProfilingSession(**session)
@router.get(
"/profile",
response_model=SessionListResponse,
summary="List profiling sessions",
description="List all profiling sessions with optional filtering"
)
async def list_profiling_sessions(
status: Optional[str] = Query(None, description="Filter by status: running, completed, failed"),
limit: int = Query(50, ge=1, le=500, description="Maximum number of sessions to return")
):
"""
List all profiling sessions.
Can be filtered by status and limited in number.
"""
sessions = list(PROFILING_SESSIONS.values())
# Filter by status if provided
if status:
sessions = [s for s in sessions if s["status"] == status]
# Sort by start time (newest first)
sessions.sort(key=lambda x: x["start_time"], reverse=True)
# Limit results
sessions = sessions[:limit]
return SessionListResponse(
total=len(sessions),
sessions=[ProfilingSession(**s) for s in sessions]
)
@router.delete(
"/profile/{session_id}",
summary="Delete profiling session",
description="Delete a profiling session and its results"
)
async def delete_profiling_session(session_id: str):
"""
Delete a profiling session.
Removes the session and all associated data from memory.
"""
if session_id not in PROFILING_SESSIONS:
raise HTTPException(
status_code=404,
detail=f"Profiling session '{session_id}' not found"
)
session = PROFILING_SESSIONS.pop(session_id)
logger.info(f"Deleted profiling session {session_id}")
return {
"success": True,
"message": f"Session {session_id} deleted",
"session": ProfilingSession(**session)
}
@router.post(
"/profile/cleanup",
summary="Cleanup old sessions",
description="Remove old profiling sessions to free memory"
)
async def cleanup_sessions(
max_age_hours: int = Query(24, ge=1, le=168, description="Maximum age in hours")
):
"""
Cleanup old profiling sessions.
Removes sessions older than the specified age.
"""
removed = cleanup_old_sessions(max_age_hours)
return {
"success": True,
"removed_count": removed,
"remaining_count": len(PROFILING_SESSIONS),
"message": f"Removed {removed} sessions older than {max_age_hours} hours"
}
# ============================================================================
# Statistics Endpoints
# ============================================================================
@router.get(
"/stats",
response_model=CrawlerStats,
summary="Get crawler statistics",
description="Get current crawler statistics and system metrics"
)
async def get_crawler_stats():
"""
Get current crawler statistics.
Returns real-time metrics about:
- Active and total crawls
- Success/failure rates
- Response times
- System resource usage
"""
system_stats = get_system_stats()
total = CRAWLER_STATS["successful_crawls"] + CRAWLER_STATS["failed_crawls"]
success_rate = (
(CRAWLER_STATS["successful_crawls"] / total * 100)
if total > 0 else 0.0
)
# Calculate uptime
# In a real implementation, you'd track server start time
uptime_seconds = 0.0 # Placeholder
stats = CrawlerStats(
active_crawls=CRAWLER_STATS["active_crawls"],
total_crawls=CRAWLER_STATS["total_crawls"],
successful_crawls=CRAWLER_STATS["successful_crawls"],
failed_crawls=CRAWLER_STATS["failed_crawls"],
success_rate=success_rate,
total_bytes_processed=CRAWLER_STATS["total_bytes_processed"],
average_response_time_ms=CRAWLER_STATS["average_response_time_ms"],
uptime_seconds=uptime_seconds,
memory_usage_mb=system_stats["memory_usage_mb"],
cpu_percent=system_stats["cpu_percent"],
last_updated=datetime.now().isoformat()
)
return stats
@router.get(
"/stats/stream",
summary="Stream crawler statistics",
description="Server-Sent Events stream of real-time crawler statistics"
)
async def stream_crawler_stats(
interval: int = Query(2, ge=1, le=60, description="Update interval in seconds")
):
"""
Stream real-time crawler statistics.
Returns an SSE (Server-Sent Events) stream that pushes
statistics updates at the specified interval.
Example:
```javascript
const eventSource = new EventSource('/monitoring/stats/stream?interval=2');
eventSource.onmessage = (event) => {
const stats = JSON.parse(event.data);
console.log('Stats:', stats);
};
```
"""
async def generate_stats() -> AsyncGenerator[str, None]:
"""Generate stats stream."""
try:
while True:
# Get current stats
stats = await get_crawler_stats()
# Format as SSE
data = json.dumps(stats.dict())
yield f"data: {data}\n\n"
# Wait for next interval
await asyncio.sleep(interval)
except asyncio.CancelledError:
logger.info("Stats stream cancelled by client")
except Exception as e:
logger.error(f"Error in stats stream: {e}")
yield f"event: error\ndata: {json.dumps({'error': str(e)})}\n\n"
return StreamingResponse(
generate_stats(),
media_type="text/event-stream",
headers={
"Cache-Control": "no-cache",
"Connection": "keep-alive",
"X-Accel-Buffering": "no",
}
)
@router.get(
"/stats/urls",
response_model=List[URLStatistics],
summary="Get URL statistics",
description="Get statistics for crawled URLs"
)
async def get_url_statistics(
limit: int = Query(100, ge=1, le=1000, description="Maximum number of URLs to return"),
sort_by: str = Query("total_requests", description="Sort field: total_requests, success_rate, average_time_ms")
):
"""
Get statistics for crawled URLs.
Returns metrics for each URL that has been crawled,
including request counts, success rates, and timing.
"""
stats_list = []
for url, stats in URL_STATS.items():
total = stats["total_requests"]
success_rate = (stats["success_count"] / total * 100) if total > 0 else 0.0
stats_list.append(URLStatistics(
url_pattern=url,
total_requests=stats["total_requests"],
success_count=stats["success_count"],
failure_count=stats["failure_count"],
success_rate=success_rate,
average_time_ms=stats["average_time_ms"],
last_accessed=stats["last_accessed"]
))
# Sort
if sort_by == "success_rate":
stats_list.sort(key=lambda x: x.success_rate, reverse=True)
elif sort_by == "average_time_ms":
stats_list.sort(key=lambda x: x.average_time_ms)
else: # total_requests
stats_list.sort(key=lambda x: x.total_requests, reverse=True)
return stats_list[:limit]
@router.post(
"/stats/reset",
summary="Reset statistics",
description="Reset all crawler statistics to zero"
)
async def reset_statistics():
"""
Reset all statistics.
Clears all accumulated statistics but keeps the server running.
Useful for testing or starting fresh measurements.
"""
global CRAWLER_STATS, URL_STATS
CRAWLER_STATS = {
"active_crawls": 0,
"total_crawls": 0,
"successful_crawls": 0,
"failed_crawls": 0,
"total_bytes_processed": 0,
"average_response_time_ms": 0.0,
"last_updated": datetime.now().isoformat(),
}
URL_STATS.clear()
logger.info("All statistics reset")
return {
"success": True,
"message": "All statistics have been reset",
"timestamp": datetime.now().isoformat()
}
# ============================================================================
# Background Tasks
# ============================================================================
async def run_profiling_session(session_id: str, request: ProfilingStartRequest):
"""
Background task to run profiling session.
This performs the actual profiling work:
1. Creates a crawler with profiling enabled
2. Crawls the target URL
3. Collects performance metrics
4. Stores results in the session
"""
start_time = time.time()
try:
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
from crawl4ai.browser_profiler import BrowserProfiler
logger.info(f"Starting profiling for session {session_id}")
# Create profiler
profiler = BrowserProfiler()
# Configure browser and crawler
browser_config = BrowserConfig.load(request.browser_config)
crawler_config = CrawlerRunConfig.load(request.crawler_config)
# Enable profiling options
browser_config.profiling_enabled = True
results = {}
async with AsyncWebCrawler(config=browser_config) as crawler:
# Start profiling
profiler.start()
# Collect system stats before
stats_before = get_system_stats()
# Crawl with timeout
try:
result = await asyncio.wait_for(
crawler.arun(request.url, config=crawler_config),
timeout=request.profile_duration
)
crawl_success = result.success
except asyncio.TimeoutError:
logger.warning(f"Profiling session {session_id} timed out")
crawl_success = False
result = None
# Stop profiling
profiler_results = profiler.stop()
# Collect system stats after
stats_after = get_system_stats()
# Build results
results = {
"crawl_success": crawl_success,
"url": request.url,
"performance": profiler_results if profiler_results else {},
"system": {
"before": stats_before,
"after": stats_after,
"delta": {
"memory_mb": stats_after["memory_usage_mb"] - stats_before["memory_usage_mb"],
"cpu_percent": stats_after["cpu_percent"] - stats_before["cpu_percent"],
}
}
}
if result:
results["content"] = {
"markdown_length": len(result.markdown) if result.markdown else 0,
"html_length": len(result.html) if result.html else 0,
"links_count": len(result.links["internal"]) + len(result.links["external"]),
"media_count": len(result.media["images"]) + len(result.media["videos"]),
}
# Update session with results
end_time = time.time()
duration = end_time - start_time
PROFILING_SESSIONS[session_id].update({
"status": "completed",
"end_time": datetime.now().isoformat(),
"duration_seconds": duration,
"results": results
})
logger.info(f"Profiling session {session_id} completed in {duration:.2f}s")
except Exception as e:
logger.error(f"Profiling session {session_id} failed: {str(e)}")
PROFILING_SESSIONS[session_id].update({
"status": "failed",
"end_time": datetime.now().isoformat(),
"duration_seconds": time.time() - start_time,
"error": str(e)
})
# ============================================================================
# Middleware Integration Points
# ============================================================================
def track_crawl_start():
"""Call this when a crawl starts."""
CRAWLER_STATS["active_crawls"] += 1
CRAWLER_STATS["total_crawls"] += 1
CRAWLER_STATS["last_updated"] = datetime.now().isoformat()
def track_crawl_end(url: str, success: bool, duration_ms: float, bytes_processed: int = 0):
"""Call this when a crawl ends."""
CRAWLER_STATS["active_crawls"] = max(0, CRAWLER_STATS["active_crawls"] - 1)
if success:
CRAWLER_STATS["successful_crawls"] += 1
else:
CRAWLER_STATS["failed_crawls"] += 1
CRAWLER_STATS["total_bytes_processed"] += bytes_processed
# Update average response time (running average)
total = CRAWLER_STATS["successful_crawls"] + CRAWLER_STATS["failed_crawls"]
current_avg = CRAWLER_STATS["average_response_time_ms"]
CRAWLER_STATS["average_response_time_ms"] = (
(current_avg * (total - 1) + duration_ms) / total
)
# Update URL stats
url_stat = URL_STATS[url]
url_stat["total_requests"] += 1
if success:
url_stat["success_count"] += 1
else:
url_stat["failure_count"] += 1
# Update average time for this URL
total_url = url_stat["total_requests"]
current_avg_url = url_stat["average_time_ms"]
url_stat["average_time_ms"] = (
(current_avg_url * (total_url - 1) + duration_ms) / total_url
)
url_stat["last_accessed"] = datetime.now().isoformat()
CRAWLER_STATS["last_updated"] = datetime.now().isoformat()
# ============================================================================
# Health Check
# ============================================================================
@router.get(
"/health",
summary="Health check",
description="Check if monitoring system is operational"
)
async def health_check():
"""
Health check endpoint.
Returns status of the monitoring system.
"""
system_stats = get_system_stats()
return {
"status": "healthy",
"timestamp": datetime.now().isoformat(),
"active_sessions": len([s for s in PROFILING_SESSIONS.values() if s["status"] == "running"]),
"total_sessions": len(PROFILING_SESSIONS),
"system": system_stats
}

View File

@@ -1,306 +0,0 @@
from typing import Optional
from fastapi import APIRouter, File, Form, HTTPException, UploadFile
from schemas import C4AScriptPayload
from crawl4ai.script import (
CompilationResult,
ValidationResult,
# ErrorDetail
)
# Import all necessary components from the crawl4ai library
# C4A Script Language Support
from crawl4ai.script import (
compile as c4a_compile,
)
from crawl4ai.script import (
validate as c4a_validate,
)
# --- APIRouter for c4a Scripts Endpoints ---
router = APIRouter(
prefix="/c4a",
tags=["c4a Scripts"],
)
# --- Background Worker Function ---
@router.post("/validate",
summary="Validate C4A-Script",
description="Validate the syntax of a C4A-Script without compiling it.",
response_description="Validation result with errors if any",
response_model=ValidationResult
)
async def validate_c4a_script_endpoint(payload: C4AScriptPayload):
"""
Validate the syntax of a C4A-Script.
Checks the script syntax without compiling to executable JavaScript.
Returns detailed error information if validation fails.
**Request Body:**
```json
{
"script": "NAVIGATE https://example.com\\nWAIT 2\\nCLICK button.submit"
}
```
**Response (Valid):**
```json
{
"success": true,
"errors": []
}
```
**Response (Invalid):**
```json
{
"success": false,
"errors": [
{
"line": 3,
"message": "Unknown command: CLCK",
"type": "SyntaxError"
}
]
}
```
**Usage:**
```python
response = requests.post(
"http://localhost:11235/c4a/validate",
headers={"Authorization": f"Bearer {token}"},
json={
"script": "NAVIGATE https://example.com\\nWAIT 2"
}
)
result = response.json()
if result["success"]:
print("Script is valid!")
else:
for error in result["errors"]:
print(f"Line {error['line']}: {error['message']}")
```
**Notes:**
- Validates syntax only, doesn't execute
- Returns detailed error locations
- Use before compiling to check for issues
"""
# The validate function is designed not to raise exceptions
validation_result = c4a_validate(payload.script)
return validation_result
@router.post("/compile",
summary="Compile C4A-Script",
description="Compile a C4A-Script into executable JavaScript code.",
response_description="Compiled JavaScript code or compilation errors",
response_model=CompilationResult
)
async def compile_c4a_script_endpoint(payload: C4AScriptPayload):
"""
Compile a C4A-Script into executable JavaScript.
Transforms high-level C4A-Script commands into JavaScript that can be
executed in a browser context.
**Request Body:**
```json
{
"script": "NAVIGATE https://example.com\\nWAIT 2\\nCLICK button.submit"
}
```
**Response (Success):**
```json
{
"success": true,
"javascript": "await page.goto('https://example.com');\\nawait page.waitForTimeout(2000);\\nawait page.click('button.submit');",
"errors": []
}
```
**Response (Error):**
```json
{
"success": false,
"javascript": null,
"errors": [
{
"line": 2,
"message": "Invalid WAIT duration",
"type": "CompilationError"
}
]
}
```
**Usage:**
```python
response = requests.post(
"http://localhost:11235/c4a/compile",
headers={"Authorization": f"Bearer {token}"},
json={
"script": "NAVIGATE https://example.com\\nCLICK .login-button"
}
)
result = response.json()
if result["success"]:
print("Compiled JavaScript:")
print(result["javascript"])
else:
print("Compilation failed:", result["errors"])
```
**C4A-Script Commands:**
- `NAVIGATE <url>` - Navigate to URL
- `WAIT <seconds>` - Wait for specified time
- `CLICK <selector>` - Click element
- `TYPE <selector> <text>` - Type text into element
- `SCROLL <direction>` - Scroll page
- And many more...
**Notes:**
- Returns HTTP 400 if compilation fails
- JavaScript can be used with /execute_js endpoint
- Simplifies browser automation scripting
"""
# The compile function also returns a result object instead of raising
compilation_result = c4a_compile(payload.script)
if not compilation_result.success:
# You can optionally raise an HTTP exception for failed compilations
# This makes it clearer on the client-side that it was a bad request
raise HTTPException(
status_code=400,
detail=compilation_result.to_dict(), # FastAPI will serialize this
)
return compilation_result
@router.post("/compile-file",
summary="Compile C4A-Script from File",
description="Compile a C4A-Script from an uploaded file or form string.",
response_description="Compiled JavaScript code or compilation errors",
response_model=CompilationResult
)
async def compile_c4a_script_file_endpoint(
file: Optional[UploadFile] = File(None), script: Optional[str] = Form(None)
):
"""
Compile a C4A-Script from file upload or form data.
Accepts either a file upload or a string parameter. Useful for uploading
C4A-Script files or sending multipart form data.
**Parameters:**
- `file`: C4A-Script file upload (multipart/form-data)
- `script`: C4A-Script content as string (form field)
**Note:** Provide either file OR script, not both.
**Request (File Upload):**
```bash
curl -X POST "http://localhost:11235/c4a/compile-file" \\
-H "Authorization: Bearer YOUR_TOKEN" \\
-F "file=@myscript.c4a"
```
**Request (Form String):**
```bash
curl -X POST "http://localhost:11235/c4a/compile-file" \\
-H "Authorization: Bearer YOUR_TOKEN" \\
-F "script=NAVIGATE https://example.com"
```
**Response:**
```json
{
"success": true,
"javascript": "await page.goto('https://example.com');",
"errors": []
}
```
**Usage (Python with file):**
```python
with open('script.c4a', 'rb') as f:
response = requests.post(
"http://localhost:11235/c4a/compile-file",
headers={"Authorization": f"Bearer {token}"},
files={"file": f}
)
result = response.json()
print(result["javascript"])
```
**Usage (Python with string):**
```python
response = requests.post(
"http://localhost:11235/c4a/compile-file",
headers={"Authorization": f"Bearer {token}"},
data={"script": "NAVIGATE https://example.com"}
)
result = response.json()
print(result["javascript"])
```
**Notes:**
- File must be UTF-8 encoded text
- Use for batch script compilation
- Returns HTTP 400 if both or neither parameter provided
- Returns HTTP 400 if compilation fails
"""
script_content = None
# Validate that at least one input is provided
if not file and not script:
raise HTTPException(
status_code=400,
detail={"error": "Either 'file' or 'script' parameter must be provided"},
)
# If both are provided, prioritize the file
if file and script:
raise HTTPException(
status_code=400,
detail={"error": "Please provide either 'file' or 'script', not both"},
)
# Handle file upload
if file:
try:
file_content = await file.read()
script_content = file_content.decode("utf-8")
except UnicodeDecodeError as exc:
raise HTTPException(
status_code=400,
detail={"error": "File must be a valid UTF-8 text file"},
) from exc
except Exception as e:
raise HTTPException(
status_code=400, detail={"error": f"Error reading file: {str(e)}"}
) from e
# Handle string content
elif script:
script_content = script
# Compile the script content
compilation_result = c4a_compile(script_content)
if not compilation_result.success:
# You can optionally raise an HTTP exception for failed compilations
# This makes it clearer on the client-side that it was a bad request
raise HTTPException(
status_code=400,
detail=compilation_result.to_dict(), # FastAPI will serialize this
)
return compilation_result

View File

@@ -1,301 +0,0 @@
"""
Table Extraction Router for Crawl4AI Docker Server
This module provides dedicated endpoints for table extraction from HTML or URLs,
separate from the main crawling functionality.
"""
import logging
from typing import List, Dict, Any
from fastapi import APIRouter, HTTPException
from fastapi.responses import JSONResponse
# Import crawler pool for browser reuse
from crawler_pool import get_crawler
# Import schemas
from schemas import (
TableExtractionRequest,
TableExtractionBatchRequest,
TableExtractionConfig,
)
# Import utilities
from utils import (
extract_tables_from_html,
format_table_response,
create_table_extraction_strategy,
)
# Configure logger
logger = logging.getLogger(__name__)
# Create router
router = APIRouter(prefix="/tables", tags=["Table Extraction"])
@router.post(
"/extract",
summary="Extract Tables from HTML or URL",
description="""
Extract tables from HTML content or by fetching a URL.
Supports multiple extraction strategies: default, LLM-based, or financial.
**Input Options:**
- Provide `html` for direct HTML content extraction
- Provide `url` to fetch and extract from a live page
- Cannot provide both `html` and `url` simultaneously
**Strategies:**
- `default`: Fast regex and HTML structure-based extraction
- `llm`: AI-powered extraction with semantic understanding (requires LLM config)
- `financial`: Specialized extraction for financial tables with numerical formatting
**Returns:**
- List of extracted tables with headers, rows, and metadata
- Each table includes cell-level details and formatting information
""",
response_description="Extracted tables with metadata",
)
async def extract_tables(request: TableExtractionRequest) -> JSONResponse:
"""
Extract tables from HTML content or URL.
Args:
request: TableExtractionRequest with html/url and extraction config
Returns:
JSONResponse with extracted tables and metadata
Raises:
HTTPException: If validation fails or extraction errors occur
"""
try:
# Validate input
if request.html and request.url:
raise HTTPException(
status_code=400,
detail="Cannot provide both 'html' and 'url'. Choose one input method."
)
if not request.html and not request.url:
raise HTTPException(
status_code=400,
detail="Must provide either 'html' or 'url' for table extraction."
)
# Handle URL-based extraction
if request.url:
# Import crawler configs
from async_configs import BrowserConfig, CrawlerRunConfig
try:
# Create minimal browser config
browser_config = BrowserConfig(
headless=True,
verbose=False,
)
# Create crawler config with table extraction
table_strategy = create_table_extraction_strategy(request.config)
crawler_config = CrawlerRunConfig(
table_extraction_strategy=table_strategy,
)
# Get crawler from pool (browser reuse for memory efficiency)
crawler = await get_crawler(browser_config, adapter=None)
# Crawl the URL
result = await crawler.arun(
url=request.url,
config=crawler_config,
)
if not result.success:
raise HTTPException(
status_code=500,
detail=f"Failed to fetch URL: {result.error_message}"
)
# Extract HTML
html_content = result.html
except Exception as e:
logger.error(f"Error fetching URL {request.url}: {e}")
raise HTTPException(
status_code=500,
detail=f"Failed to fetch and extract from URL: {str(e)}"
)
else:
# Use provided HTML
html_content = request.html
# Extract tables from HTML
tables = await extract_tables_from_html(html_content, request.config)
# Format response
formatted_tables = format_table_response(tables)
return JSONResponse({
"success": True,
"table_count": len(formatted_tables),
"tables": formatted_tables,
"strategy": request.config.strategy.value,
})
except HTTPException:
raise
except Exception as e:
logger.error(f"Error extracting tables: {e}", exc_info=True)
raise HTTPException(
status_code=500,
detail=f"Table extraction failed: {str(e)}"
)
@router.post(
"/extract/batch",
summary="Extract Tables from Multiple Sources (Batch)",
description="""
Extract tables from multiple HTML contents or URLs in a single request.
Processes each input independently and returns results for all.
**Batch Processing:**
- Provide list of HTML contents and/or URLs
- Each input is processed with the same extraction strategy
- Partial failures are allowed (returns results for successful extractions)
**Use Cases:**
- Extracting tables from multiple pages simultaneously
- Bulk financial data extraction
- Comparing table structures across multiple sources
""",
response_description="Batch extraction results with per-item success status",
)
async def extract_tables_batch(request: TableExtractionBatchRequest) -> JSONResponse:
"""
Extract tables from multiple HTML contents or URLs in batch.
Args:
request: TableExtractionBatchRequest with list of html/url and config
Returns:
JSONResponse with batch results
Raises:
HTTPException: If validation fails
"""
try:
# Validate batch request
total_items = len(request.html_list or []) + len(request.url_list or [])
if total_items == 0:
raise HTTPException(
status_code=400,
detail="Must provide at least one HTML content or URL in batch request."
)
if total_items > 50: # Reasonable batch limit
raise HTTPException(
status_code=400,
detail=f"Batch size ({total_items}) exceeds maximum allowed (50)."
)
results = []
# Process HTML list
if request.html_list:
for idx, html_content in enumerate(request.html_list):
try:
tables = await extract_tables_from_html(html_content, request.config)
formatted_tables = format_table_response(tables)
results.append({
"success": True,
"source": f"html_{idx}",
"table_count": len(formatted_tables),
"tables": formatted_tables,
})
except Exception as e:
logger.error(f"Error extracting tables from html_{idx}: {e}")
results.append({
"success": False,
"source": f"html_{idx}",
"error": str(e),
})
# Process URL list
if request.url_list:
from async_configs import BrowserConfig, CrawlerRunConfig
browser_config = BrowserConfig(
headless=True,
verbose=False,
)
table_strategy = create_table_extraction_strategy(request.config)
crawler_config = CrawlerRunConfig(
table_extraction_strategy=table_strategy,
)
# Get crawler from pool (reuse browser for all URLs in batch)
crawler = await get_crawler(browser_config, adapter=None)
for url in request.url_list:
try:
result = await crawler.arun(
url=url,
config=crawler_config,
)
if result.success:
html_content = result.html
tables = await extract_tables_from_html(html_content, request.config)
formatted_tables = format_table_response(tables)
results.append({
"success": True,
"source": url,
"table_count": len(formatted_tables),
"tables": formatted_tables,
})
else:
results.append({
"success": False,
"source": url,
"error": result.error_message,
})
except Exception as e:
logger.error(f"Error extracting tables from {url}: {e}")
results.append({
"success": False,
"source": url,
"error": str(e),
})
# Calculate summary
successful = sum(1 for r in results if r["success"])
failed = len(results) - successful
total_tables = sum(r.get("table_count", 0) for r in results if r["success"])
return JSONResponse({
"success": True,
"summary": {
"total_processed": len(results),
"successful": successful,
"failed": failed,
"total_tables_extracted": total_tables,
},
"results": results,
"strategy": request.config.strategy.value,
})
except HTTPException:
raise
except Exception as e:
logger.error(f"Error in batch table extraction: {e}", exc_info=True)
raise HTTPException(
status_code=500,
detail=f"Batch table extraction failed: {str(e)}"
)

View File

@@ -1,249 +1,28 @@
from typing import List, Optional, Dict
from enum import Enum
from typing import Any, Dict, List, Literal, Optional
from pydantic import BaseModel, Field
from pydantic import BaseModel, Field, HttpUrl
from utils import FilterType
# ============================================================================
# Dispatcher Schemas
# ============================================================================
class DispatcherType(str, Enum):
"""Available dispatcher types for crawling."""
MEMORY_ADAPTIVE = "memory_adaptive"
SEMAPHORE = "semaphore"
class DispatcherInfo(BaseModel):
"""Information about a dispatcher type."""
type: DispatcherType
name: str
description: str
config: Dict[str, Any]
features: List[str]
class DispatcherStatsResponse(BaseModel):
"""Response model for dispatcher statistics."""
type: DispatcherType
active_sessions: int
config: Dict[str, Any]
stats: Optional[Dict[str, Any]] = Field(
None,
description="Additional dispatcher-specific statistics"
)
class DispatcherSelection(BaseModel):
"""Model for selecting a dispatcher in crawl requests."""
dispatcher: Optional[DispatcherType] = Field(
None,
description="Dispatcher type to use. Defaults to memory_adaptive if not specified."
)
# ============================================================================
# End Dispatcher Schemas
# ============================================================================
# ============================================================================
# Table Extraction Schemas
# ============================================================================
class TableExtractionStrategy(str, Enum):
"""Available table extraction strategies."""
NONE = "none"
DEFAULT = "default"
LLM = "llm"
FINANCIAL = "financial"
class TableExtractionConfig(BaseModel):
"""Configuration for table extraction."""
strategy: TableExtractionStrategy = Field(
default=TableExtractionStrategy.DEFAULT,
description="Table extraction strategy to use"
)
# Common configuration for all strategies
table_score_threshold: int = Field(
default=7,
ge=0,
le=100,
description="Minimum score for a table to be considered a data table (default strategy)"
)
min_rows: int = Field(
default=0,
ge=0,
description="Minimum number of rows for a valid table"
)
min_cols: int = Field(
default=0,
ge=0,
description="Minimum number of columns for a valid table"
)
# LLM-specific configuration
llm_provider: Optional[str] = Field(
None,
description="LLM provider for LLM strategy (e.g., 'openai/gpt-4')"
)
llm_model: Optional[str] = Field(
None,
description="Specific LLM model to use"
)
llm_api_key: Optional[str] = Field(
None,
description="API key for LLM provider (if not in environment)"
)
llm_base_url: Optional[str] = Field(
None,
description="Custom base URL for LLM API"
)
extraction_prompt: Optional[str] = Field(
None,
description="Custom prompt for LLM table extraction"
)
# Financial-specific configuration
decimal_separator: str = Field(
default=".",
description="Decimal separator for financial tables (e.g., '.' or ',')"
)
thousand_separator: str = Field(
default=",",
description="Thousand separator for financial tables (e.g., ',' or '.')"
)
# General options
verbose: bool = Field(
default=False,
description="Enable verbose logging for table extraction"
)
class Config:
schema_extra = {
"example": {
"strategy": "default",
"table_score_threshold": 7,
"min_rows": 2,
"min_cols": 2
}
}
class TableExtractionRequest(BaseModel):
"""Request for dedicated table extraction endpoint."""
url: Optional[str] = Field(
None,
description="URL to crawl and extract tables from"
)
html: Optional[str] = Field(
None,
description="Raw HTML content to extract tables from"
)
config: TableExtractionConfig = Field(
default_factory=lambda: TableExtractionConfig(),
description="Table extraction configuration"
)
# Browser config (only used if URL is provided)
browser_config: Optional[Dict] = Field(
default_factory=dict,
description="Browser configuration for URL crawling"
)
class Config:
schema_extra = {
"example": {
"url": "https://example.com/data-table",
"config": {
"strategy": "default",
"min_rows": 2
}
}
}
class TableExtractionBatchRequest(BaseModel):
"""Request for batch table extraction."""
html_list: Optional[List[str]] = Field(
None,
description="List of HTML contents to extract tables from"
)
url_list: Optional[List[str]] = Field(
None,
description="List of URLs to extract tables from"
)
config: TableExtractionConfig = Field(
default_factory=lambda: TableExtractionConfig(),
description="Table extraction configuration"
)
browser_config: Optional[Dict] = Field(
default_factory=dict,
description="Browser configuration"
)
# ============================================================================
# End Table Extraction Schemas
# ============================================================================
class CrawlRequest(BaseModel):
urls: List[str] = Field(min_length=1, max_length=100)
browser_config: Optional[Dict] = Field(default_factory=dict)
crawler_config: Optional[Dict] = Field(default_factory=dict)
anti_bot_strategy: Literal["default", "stealth", "undetected", "max_evasion"] = (
Field("default", description="The anti-bot strategy to use for the crawl.")
)
headless: bool = Field(True, description="Run the browser in headless mode.")
# Dispatcher selection
dispatcher: Optional[DispatcherType] = Field(
None,
description="Dispatcher type to use for crawling. Defaults to memory_adaptive if not specified."
)
# Proxy rotation configuration
proxy_rotation_strategy: Optional[Literal["round_robin", "random", "least_used", "failure_aware"]] = Field(
None, description="Proxy rotation strategy to use for the crawl."
)
proxies: Optional[List[Dict[str, Any]]] = Field(
None, description="List of proxy configurations (dicts with server, username, password, etc.)"
)
proxy_failure_threshold: Optional[int] = Field(
3, ge=1, le=10, description="Failure threshold for failure_aware strategy"
)
proxy_recovery_time: Optional[int] = Field(
300, ge=60, le=3600, description="Recovery time in seconds for failure_aware strategy"
)
# Table extraction configuration
table_extraction: Optional[TableExtractionConfig] = Field(
None, description="Optional table extraction configuration to extract tables during crawl"
)
class HookConfig(BaseModel):
"""Configuration for user-provided hooks"""
code: Dict[str, str] = Field(
default_factory=dict, description="Map of hook points to Python code strings"
default_factory=dict,
description="Map of hook points to Python code strings"
)
timeout: int = Field(
default=30,
ge=1,
le=120,
description="Timeout in seconds for each hook execution",
description="Timeout in seconds for each hook execution"
)
class Config:
schema_extra = {
"example": {
@@ -260,81 +39,42 @@ async def hook(page, context, **kwargs):
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await page.wait_for_timeout(2000)
return page
""",
"""
},
"timeout": 30,
"timeout": 30
}
}
class CrawlRequestWithHooks(CrawlRequest):
"""Extended crawl request with hooks support"""
hooks: Optional[HookConfig] = Field(
default=None, description="Optional user-provided hook functions"
default=None,
description="Optional user-provided hook functions"
)
class HTTPCrawlRequest(BaseModel):
"""Request model for HTTP-only crawling endpoints."""
urls: List[str] = Field(min_length=1, max_length=100, description="List of URLs to crawl")
http_config: Optional[Dict] = Field(
default_factory=dict,
description="HTTP crawler configuration (method, headers, timeout, etc.)"
)
crawler_config: Optional[Dict] = Field(
default_factory=dict,
description="Crawler run configuration (extraction, filtering, etc.)"
)
# Dispatcher selection (same as browser crawling)
dispatcher: Optional[DispatcherType] = Field(
None,
description="Dispatcher type to use. Defaults to memory_adaptive if not specified."
)
class HTTPCrawlRequestWithHooks(HTTPCrawlRequest):
"""Extended HTTP crawl request with hooks support"""
hooks: Optional[HookConfig] = Field(
default=None, description="Optional user-provided hook functions"
)
class MarkdownRequest(BaseModel):
"""Request body for the /md endpoint."""
url: str = Field(..., description="Absolute http/https URL to fetch")
f: FilterType = Field(
FilterType.FIT, description="Contentfilter strategy: fit, raw, bm25, or llm"
)
q: Optional[str] = Field(None, description="Query string used by BM25/LLM filters")
c: Optional[str] = Field("0", description="Cachebust / revision counter")
provider: Optional[str] = Field(
None, description="LLM provider override (e.g., 'anthropic/claude-3-opus')"
)
temperature: Optional[float] = Field(
None, description="LLM temperature override (0.0-2.0)"
)
url: str = Field(..., description="Absolute http/https URL to fetch")
f: FilterType = Field(FilterType.FIT, description="Contentfilter strategy: fit, raw, bm25, or llm")
q: Optional[str] = Field(None, description="Query string used by BM25/LLM filters")
c: Optional[str] = Field("0", description="Cachebust / revision counter")
provider: Optional[str] = Field(None, description="LLM provider override (e.g., 'anthropic/claude-3-opus')")
temperature: Optional[float] = Field(None, description="LLM temperature override (0.0-2.0)")
base_url: Optional[str] = Field(None, description="LLM API base URL override")
class RawCode(BaseModel):
code: str
class HTMLRequest(BaseModel):
url: str
class ScreenshotRequest(BaseModel):
url: str
screenshot_wait_for: Optional[float] = 2
output_path: Optional[str] = None
class PDFRequest(BaseModel):
url: str
output_path: Optional[str] = None
@@ -343,89 +83,24 @@ class PDFRequest(BaseModel):
class JSEndpointRequest(BaseModel):
url: str
scripts: List[str] = Field(
..., description="List of separated JavaScript snippets to execute"
...,
description="List of separated JavaScript snippets to execute"
)
class SeedRequest(BaseModel):
"""Request model for URL seeding endpoint."""
url: str = Field(..., example="https://docs.crawl4ai.com")
config: Dict[str, Any] = Field(default_factory=dict)
class WebhookConfig(BaseModel):
"""Configuration for webhook notifications."""
webhook_url: HttpUrl
webhook_data_in_payload: bool = False
webhook_headers: Optional[Dict[str, str]] = None
class URLDiscoveryRequest(BaseModel):
"""Request model for URL discovery endpoint."""
domain: str = Field(..., example="docs.crawl4ai.com", description="Domain to discover URLs from")
seeding_config: Dict[str, Any] = Field(
default_factory=dict,
description="Configuration for URL discovery using AsyncUrlSeeder",
example={
"source": "sitemap+cc",
"pattern": "*",
"live_check": False,
"extract_head": False,
"max_urls": -1,
"concurrency": 1000,
"hits_per_sec": 5,
"force": False,
"verbose": False,
"query": None,
"score_threshold": None,
"scoring_method": "bm25",
"filter_nonsense_urls": True
}
)
# --- C4A Script Schemas ---
class C4AScriptPayload(BaseModel):
"""Input model for receiving a C4A-Script."""
script: str = Field(..., description="The C4A-Script content to process.")
# --- Adaptive Crawling Schemas ---
class AdaptiveConfigPayload(BaseModel):
"""Pydantic model for receiving AdaptiveConfig parameters."""
confidence_threshold: float = 0.7
max_pages: int = 20
top_k_links: int = 3
strategy: str = "statistical" # "statistical" or "embedding"
embedding_model: Optional[str] = "sentence-transformers/all-MiniLM-L6-v2"
# Add any other AdaptiveConfig fields you want to expose
class AdaptiveCrawlRequest(BaseModel):
"""Input model for the adaptive digest job."""
start_url: str = Field(..., description="The starting URL for the adaptive crawl.")
query: str = Field(..., description="The user query to guide the crawl.")
config: Optional[AdaptiveConfigPayload] = Field(
None, description="Optional adaptive crawler configuration."
)
class AdaptiveJobStatus(BaseModel):
"""Output model for the job status."""
class WebhookPayload(BaseModel):
"""Payload sent to webhook endpoints."""
task_id: str
status: str
metrics: Optional[Dict[str, Any]] = None
result: Optional[Dict[str, Any]] = None
task_type: str # "crawl", "llm_extraction", etc.
status: str # "completed" or "failed"
timestamp: str # ISO 8601 format
urls: List[str]
error: Optional[str] = None
class LinkAnalysisRequest(BaseModel):
"""Request body for the /links/analyze endpoint."""
url: str = Field(..., description="URL to analyze for links")
config: Optional[Dict] = Field(
default_factory=dict,
description="Optional LinkPreviewConfig dictionary"
)
data: Optional[Dict] = None # Included only if webhook_data_in_payload=True

File diff suppressed because it is too large Load Diff

Binary file not shown.

After

Width:  |  Height:  |  Size: 5.8 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.6 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 11 KiB

File diff suppressed because it is too large Load Diff

View File

@@ -167,11 +167,14 @@
</a>
</h1>
<div class="ml-auto flex space-x-2">
<button id="play-tab"
class="px-3 py-1 rounded-t bg-surface border border-b-0 border-border text-primary">Playground</button>
<button id="stress-tab" class="px-3 py-1 rounded-t border border-border hover:bg-surface">Stress
Test</button>
<div class="ml-auto flex items-center space-x-4">
<a href="/dashboard" class="text-xs text-secondary hover:text-primary underline">Monitor</a>
<div class="flex space-x-2">
<button id="play-tab"
class="px-3 py-1 rounded-t bg-surface border border-b-0 border-border text-primary">Playground</button>
<button id="stress-tab" class="px-3 py-1 rounded-t border border-border hover:bg-surface">Stress
Test</button>
</div>
</div>
</header>

34
deploy/docker/test-websocket.py Executable file
View File

@@ -0,0 +1,34 @@
#!/usr/bin/env python3
"""
Quick WebSocket test - Connect to monitor WebSocket and print updates
"""
import asyncio
import websockets
import json
async def test_websocket():
uri = "ws://localhost:11235/monitor/ws"
print(f"Connecting to {uri}...")
try:
async with websockets.connect(uri) as websocket:
print("✅ Connected!")
# Receive and print 5 updates
for i in range(5):
message = await websocket.recv()
data = json.loads(message)
print(f"\n📊 Update #{i+1}:")
print(f" - Health: CPU {data['health']['container']['cpu_percent']}%, Memory {data['health']['container']['memory_percent']}%")
print(f" - Active Requests: {len(data['requests']['active'])}")
print(f" - Browsers: {len(data['browsers'])}")
except Exception as e:
print(f"❌ Error: {e}")
return 1
print("\n✅ WebSocket test passed!")
return 0
if __name__ == "__main__":
exit(asyncio.run(test_websocket()))

View File

@@ -0,0 +1,164 @@
#!/usr/bin/env python3
"""
Monitor Dashboard Demo Script
Generates varied activity to showcase all monitoring features for video recording.
"""
import httpx
import asyncio
import time
from datetime import datetime
BASE_URL = "http://localhost:11235"
async def demo_dashboard():
print("🎬 Monitor Dashboard Demo - Starting...\n")
print(f"📊 Dashboard: {BASE_URL}/dashboard")
print("=" * 60)
async with httpx.AsyncClient(timeout=60.0) as client:
# Phase 1: Simple requests (permanent browser)
print("\n🔷 Phase 1: Testing permanent browser pool")
print("-" * 60)
for i in range(5):
print(f" {i+1}/5 Request to /crawl (default config)...")
try:
r = await client.post(
f"{BASE_URL}/crawl",
json={"urls": [f"https://httpbin.org/html?req={i}"], "crawler_config": {}}
)
print(f" ✅ Status: {r.status_code}, Time: {r.elapsed.total_seconds():.2f}s")
except Exception as e:
print(f" ❌ Error: {e}")
await asyncio.sleep(1) # Small delay between requests
# Phase 2: Create variant browsers (different configs)
print("\n🔶 Phase 2: Testing cold→hot pool promotion")
print("-" * 60)
viewports = [
{"width": 1920, "height": 1080},
{"width": 1280, "height": 720},
{"width": 800, "height": 600}
]
for idx, viewport in enumerate(viewports):
print(f" Viewport {viewport['width']}x{viewport['height']}:")
for i in range(4): # 4 requests each to trigger promotion at 3
try:
r = await client.post(
f"{BASE_URL}/crawl",
json={
"urls": [f"https://httpbin.org/json?v={idx}&r={i}"],
"browser_config": {"viewport": viewport},
"crawler_config": {}
}
)
print(f" {i+1}/4 ✅ {r.status_code} - Should see cold→hot after 3 uses")
except Exception as e:
print(f" {i+1}/4 ❌ {e}")
await asyncio.sleep(0.5)
# Phase 3: Concurrent burst (stress pool)
print("\n🔷 Phase 3: Concurrent burst (10 parallel)")
print("-" * 60)
tasks = []
for i in range(10):
tasks.append(
client.post(
f"{BASE_URL}/crawl",
json={"urls": [f"https://httpbin.org/delay/2?burst={i}"], "crawler_config": {}}
)
)
print(" Sending 10 concurrent requests...")
start = time.time()
results = await asyncio.gather(*tasks, return_exceptions=True)
elapsed = time.time() - start
successes = sum(1 for r in results if not isinstance(r, Exception) and r.status_code == 200)
print(f"{successes}/10 succeeded in {elapsed:.2f}s")
# Phase 4: Multi-endpoint coverage
print("\n🔶 Phase 4: Testing multiple endpoints")
print("-" * 60)
endpoints = [
("/md", {"url": "https://httpbin.org/html", "f": "fit", "c": "0"}),
("/screenshot", {"url": "https://httpbin.org/html"}),
("/pdf", {"url": "https://httpbin.org/html"}),
]
for endpoint, payload in endpoints:
print(f" Testing {endpoint}...")
try:
if endpoint == "/md":
r = await client.post(f"{BASE_URL}{endpoint}", json=payload)
else:
r = await client.post(f"{BASE_URL}{endpoint}", json=payload)
print(f"{r.status_code}")
except Exception as e:
print(f"{e}")
await asyncio.sleep(1)
# Phase 5: Intentional error (to populate errors tab)
print("\n🔷 Phase 5: Generating error examples")
print("-" * 60)
print(" Triggering invalid URL error...")
try:
r = await client.post(
f"{BASE_URL}/crawl",
json={"urls": ["invalid://bad-url"], "crawler_config": {}}
)
print(f" Response: {r.status_code}")
except Exception as e:
print(f" ✅ Error captured: {type(e).__name__}")
# Phase 6: Wait for janitor activity
print("\n🔶 Phase 6: Waiting for janitor cleanup...")
print("-" * 60)
print(" Idle for 40s to allow janitor to clean cold pool browsers...")
for i in range(40, 0, -10):
print(f" {i}s remaining... (Check dashboard for cleanup events)")
await asyncio.sleep(10)
# Phase 7: Final stats check
print("\n🔷 Phase 7: Final dashboard state")
print("-" * 60)
r = await client.get(f"{BASE_URL}/monitor/health")
health = r.json()
print(f" Memory: {health['container']['memory_percent']:.1f}%")
print(f" Browsers: Perm={health['pool']['permanent']['active']}, "
f"Hot={health['pool']['hot']['count']}, Cold={health['pool']['cold']['count']}")
r = await client.get(f"{BASE_URL}/monitor/endpoints/stats")
stats = r.json()
print(f"\n Endpoint Stats:")
for endpoint, data in stats.items():
print(f" {endpoint}: {data['count']} req, "
f"{data['avg_latency_ms']:.0f}ms avg, "
f"{data['success_rate_percent']:.1f}% success")
r = await client.get(f"{BASE_URL}/monitor/browsers")
browsers = r.json()
print(f"\n Pool Efficiency:")
print(f" Total browsers: {browsers['summary']['total_count']}")
print(f" Memory usage: {browsers['summary']['total_memory_mb']} MB")
print(f" Reuse rate: {browsers['summary']['reuse_rate_percent']:.1f}%")
print("\n" + "=" * 60)
print("✅ Demo complete! Dashboard is now populated with rich data.")
print(f"\n📹 Recording tip: Refresh {BASE_URL}/dashboard")
print(" You should see:")
print(" • Active & completed requests")
print(" • Browser pool (permanent + hot/cold)")
print(" • Janitor cleanup events")
print(" • Endpoint analytics")
print(" • Memory timeline")
if __name__ == "__main__":
try:
asyncio.run(demo_dashboard())
except KeyboardInterrupt:
print("\n\n⚠️ Demo interrupted by user")
except Exception as e:
print(f"\n\n❌ Demo failed: {e}")

View File

@@ -0,0 +1,2 @@
httpx>=0.25.0
docker>=7.0.0

View File

@@ -0,0 +1,196 @@
#!/usr/bin/env python3
"""
Security Integration Tests for Crawl4AI Docker API.
Tests that security fixes are working correctly against a running server.
Usage:
python run_security_tests.py [base_url]
Example:
python run_security_tests.py http://localhost:11235
"""
import subprocess
import sys
import re
# Colors for terminal output
GREEN = '\033[0;32m'
RED = '\033[0;31m'
YELLOW = '\033[1;33m'
NC = '\033[0m' # No Color
PASSED = 0
FAILED = 0
def run_curl(args: list) -> str:
"""Run curl command and return output."""
try:
result = subprocess.run(
['curl', '-s'] + args,
capture_output=True,
text=True,
timeout=30
)
return result.stdout + result.stderr
except subprocess.TimeoutExpired:
return "TIMEOUT"
except Exception as e:
return str(e)
def test_expect(name: str, expect_pattern: str, curl_args: list) -> bool:
"""Run a test and check if output matches expected pattern."""
global PASSED, FAILED
result = run_curl(curl_args)
if re.search(expect_pattern, result, re.IGNORECASE):
print(f"{GREEN}{NC} {name}")
PASSED += 1
return True
else:
print(f"{RED}{NC} {name}")
print(f" Expected pattern: {expect_pattern}")
print(f" Got: {result[:200]}")
FAILED += 1
return False
def main():
global PASSED, FAILED
base_url = sys.argv[1] if len(sys.argv) > 1 else "http://localhost:11235"
print("=" * 60)
print("Crawl4AI Security Integration Tests")
print(f"Target: {base_url}")
print("=" * 60)
print()
# Check server availability
print("Checking server availability...")
result = run_curl(['-o', '/dev/null', '-w', '%{http_code}', f'{base_url}/health'])
if '200' not in result:
print(f"{RED}ERROR: Server not reachable at {base_url}{NC}")
print("Please start the server first.")
sys.exit(1)
print(f"{GREEN}Server is running{NC}")
print()
# === Part A: Security Tests ===
print("=== Part A: Security Tests ===")
print("(Vulnerabilities must be BLOCKED)")
print()
test_expect(
"A1: Hooks disabled by default (403)",
r"403|disabled|Hooks are disabled",
['-X', 'POST', f'{base_url}/crawl',
'-H', 'Content-Type: application/json',
'-d', '{"urls":["https://example.com"],"hooks":{"code":{"on_page_context_created":"async def hook(page, context, **kwargs): return page"}}}']
)
test_expect(
"A2: file:// blocked on /execute_js (400)",
r"400|must start with",
['-X', 'POST', f'{base_url}/execute_js',
'-H', 'Content-Type: application/json',
'-d', '{"url":"file:///etc/passwd","scripts":["1"]}']
)
test_expect(
"A3: file:// blocked on /screenshot (400)",
r"400|must start with",
['-X', 'POST', f'{base_url}/screenshot',
'-H', 'Content-Type: application/json',
'-d', '{"url":"file:///etc/passwd"}']
)
test_expect(
"A4: file:// blocked on /pdf (400)",
r"400|must start with",
['-X', 'POST', f'{base_url}/pdf',
'-H', 'Content-Type: application/json',
'-d', '{"url":"file:///etc/passwd"}']
)
test_expect(
"A5: file:// blocked on /html (400)",
r"400|must start with",
['-X', 'POST', f'{base_url}/html',
'-H', 'Content-Type: application/json',
'-d', '{"url":"file:///etc/passwd"}']
)
print()
# === Part B: Functionality Tests ===
print("=== Part B: Functionality Tests ===")
print("(Normal operations must WORK)")
print()
test_expect(
"B1: Basic crawl works",
r"success.*true|results",
['-X', 'POST', f'{base_url}/crawl',
'-H', 'Content-Type: application/json',
'-d', '{"urls":["https://example.com"]}']
)
test_expect(
"B2: /md works with https://",
r"success.*true|markdown",
['-X', 'POST', f'{base_url}/md',
'-H', 'Content-Type: application/json',
'-d', '{"url":"https://example.com"}']
)
test_expect(
"B3: Health endpoint works",
r"ok",
[f'{base_url}/health']
)
print()
# === Part C: Edge Cases ===
print("=== Part C: Edge Cases ===")
print("(Malformed input must be REJECTED)")
print()
test_expect(
"C1: javascript: URL rejected (400)",
r"400|must start with",
['-X', 'POST', f'{base_url}/execute_js',
'-H', 'Content-Type: application/json',
'-d', '{"url":"javascript:alert(1)","scripts":["1"]}']
)
test_expect(
"C2: data: URL rejected (400)",
r"400|must start with",
['-X', 'POST', f'{base_url}/execute_js',
'-H', 'Content-Type: application/json',
'-d', '{"url":"data:text/html,<h1>test</h1>","scripts":["1"]}']
)
print()
print("=" * 60)
print("Results")
print("=" * 60)
print(f"Passed: {GREEN}{PASSED}{NC}")
print(f"Failed: {RED}{FAILED}{NC}")
print()
if FAILED > 0:
print(f"{RED}SOME TESTS FAILED{NC}")
sys.exit(1)
else:
print(f"{GREEN}ALL TESTS PASSED{NC}")
sys.exit(0)
if __name__ == '__main__':
main()

View File

@@ -0,0 +1,138 @@
#!/usr/bin/env python3
"""
Test 1: Basic Container Health + Single Endpoint
- Starts container
- Hits /health endpoint 10 times
- Reports success rate and basic latency
"""
import asyncio
import time
import docker
import httpx
# Config
IMAGE = "crawl4ai-local:latest"
CONTAINER_NAME = "crawl4ai-test"
PORT = 11235
REQUESTS = 10
async def test_endpoint(url: str, count: int):
"""Hit endpoint multiple times, return stats."""
results = []
async with httpx.AsyncClient(timeout=30.0) as client:
for i in range(count):
start = time.time()
try:
resp = await client.get(url)
elapsed = (time.time() - start) * 1000 # ms
results.append({
"success": resp.status_code == 200,
"latency_ms": elapsed,
"status": resp.status_code
})
print(f" [{i+1}/{count}] ✓ {resp.status_code} - {elapsed:.0f}ms")
except Exception as e:
results.append({
"success": False,
"latency_ms": None,
"error": str(e)
})
print(f" [{i+1}/{count}] ✗ Error: {e}")
return results
def start_container(client, image: str, name: str, port: int):
"""Start container, return container object."""
# Clean up existing
try:
old = client.containers.get(name)
print(f"🧹 Stopping existing container '{name}'...")
old.stop()
old.remove()
except docker.errors.NotFound:
pass
print(f"🚀 Starting container '{name}' from image '{image}'...")
container = client.containers.run(
image,
name=name,
ports={f"{port}/tcp": port},
detach=True,
shm_size="1g",
environment={"PYTHON_ENV": "production"}
)
# Wait for health
print(f"⏳ Waiting for container to be healthy...")
for _ in range(30): # 30s timeout
time.sleep(1)
container.reload()
if container.status == "running":
try:
# Quick health check
import requests
resp = requests.get(f"http://localhost:{port}/health", timeout=2)
if resp.status_code == 200:
print(f"✅ Container healthy!")
return container
except:
pass
raise TimeoutError("Container failed to start")
def stop_container(container):
"""Stop and remove container."""
print(f"🛑 Stopping container...")
container.stop()
container.remove()
print(f"✅ Container removed")
async def main():
print("="*60)
print("TEST 1: Basic Container Health + Single Endpoint")
print("="*60)
client = docker.from_env()
container = None
try:
# Start container
container = start_container(client, IMAGE, CONTAINER_NAME, PORT)
# Test /health endpoint
print(f"\n📊 Testing /health endpoint ({REQUESTS} requests)...")
url = f"http://localhost:{PORT}/health"
results = await test_endpoint(url, REQUESTS)
# Calculate stats
successes = sum(1 for r in results if r["success"])
success_rate = (successes / len(results)) * 100
latencies = [r["latency_ms"] for r in results if r["latency_ms"] is not None]
avg_latency = sum(latencies) / len(latencies) if latencies else 0
# Print results
print(f"\n{'='*60}")
print(f"RESULTS:")
print(f" Success Rate: {success_rate:.1f}% ({successes}/{len(results)})")
print(f" Avg Latency: {avg_latency:.0f}ms")
if latencies:
print(f" Min Latency: {min(latencies):.0f}ms")
print(f" Max Latency: {max(latencies):.0f}ms")
print(f"{'='*60}")
# Pass/Fail
if success_rate >= 100:
print(f"✅ TEST PASSED")
return 0
else:
print(f"❌ TEST FAILED (expected 100% success rate)")
return 1
except Exception as e:
print(f"\n❌ TEST ERROR: {e}")
return 1
finally:
if container:
stop_container(container)
if __name__ == "__main__":
exit_code = asyncio.run(main())
exit(exit_code)

View File

@@ -0,0 +1,205 @@
#!/usr/bin/env python3
"""
Test 2: Docker Stats Monitoring
- Extends Test 1 with real-time container stats
- Monitors memory % and CPU during requests
- Reports baseline, peak, and final memory
"""
import asyncio
import time
import docker
import httpx
from threading import Thread, Event
# Config
IMAGE = "crawl4ai-local:latest"
CONTAINER_NAME = "crawl4ai-test"
PORT = 11235
REQUESTS = 20 # More requests to see memory usage
# Stats tracking
stats_history = []
stop_monitoring = Event()
def monitor_stats(container):
"""Background thread to collect container stats."""
for stat in container.stats(decode=True, stream=True):
if stop_monitoring.is_set():
break
try:
# Extract memory stats
mem_usage = stat['memory_stats'].get('usage', 0) / (1024 * 1024) # MB
mem_limit = stat['memory_stats'].get('limit', 1) / (1024 * 1024)
mem_percent = (mem_usage / mem_limit * 100) if mem_limit > 0 else 0
# Extract CPU stats (handle missing fields on Mac)
cpu_percent = 0
try:
cpu_delta = stat['cpu_stats']['cpu_usage']['total_usage'] - \
stat['precpu_stats']['cpu_usage']['total_usage']
system_delta = stat['cpu_stats'].get('system_cpu_usage', 0) - \
stat['precpu_stats'].get('system_cpu_usage', 0)
if system_delta > 0:
num_cpus = stat['cpu_stats'].get('online_cpus', 1)
cpu_percent = (cpu_delta / system_delta * num_cpus * 100.0)
except (KeyError, ZeroDivisionError):
pass
stats_history.append({
'timestamp': time.time(),
'memory_mb': mem_usage,
'memory_percent': mem_percent,
'cpu_percent': cpu_percent
})
except Exception as e:
# Skip malformed stats
pass
time.sleep(0.5) # Sample every 500ms
async def test_endpoint(url: str, count: int):
"""Hit endpoint, return stats."""
results = []
async with httpx.AsyncClient(timeout=30.0) as client:
for i in range(count):
start = time.time()
try:
resp = await client.get(url)
elapsed = (time.time() - start) * 1000
results.append({
"success": resp.status_code == 200,
"latency_ms": elapsed,
})
if (i + 1) % 5 == 0: # Print every 5 requests
print(f" [{i+1}/{count}] ✓ {resp.status_code} - {elapsed:.0f}ms")
except Exception as e:
results.append({"success": False, "error": str(e)})
print(f" [{i+1}/{count}] ✗ Error: {e}")
return results
def start_container(client, image: str, name: str, port: int):
"""Start container."""
try:
old = client.containers.get(name)
print(f"🧹 Stopping existing container '{name}'...")
old.stop()
old.remove()
except docker.errors.NotFound:
pass
print(f"🚀 Starting container '{name}'...")
container = client.containers.run(
image,
name=name,
ports={f"{port}/tcp": port},
detach=True,
shm_size="1g",
mem_limit="4g", # Set explicit memory limit
)
print(f"⏳ Waiting for health...")
for _ in range(30):
time.sleep(1)
container.reload()
if container.status == "running":
try:
import requests
resp = requests.get(f"http://localhost:{port}/health", timeout=2)
if resp.status_code == 200:
print(f"✅ Container healthy!")
return container
except:
pass
raise TimeoutError("Container failed to start")
def stop_container(container):
"""Stop container."""
print(f"🛑 Stopping container...")
container.stop()
container.remove()
async def main():
print("="*60)
print("TEST 2: Docker Stats Monitoring")
print("="*60)
client = docker.from_env()
container = None
monitor_thread = None
try:
# Start container
container = start_container(client, IMAGE, CONTAINER_NAME, PORT)
# Start stats monitoring in background
print(f"\n📊 Starting stats monitor...")
stop_monitoring.clear()
stats_history.clear()
monitor_thread = Thread(target=monitor_stats, args=(container,), daemon=True)
monitor_thread.start()
# Wait a bit for baseline
await asyncio.sleep(2)
baseline_mem = stats_history[-1]['memory_mb'] if stats_history else 0
print(f"📏 Baseline memory: {baseline_mem:.1f} MB")
# Test /health endpoint
print(f"\n🔄 Running {REQUESTS} requests to /health...")
url = f"http://localhost:{PORT}/health"
results = await test_endpoint(url, REQUESTS)
# Wait a bit to capture peak
await asyncio.sleep(1)
# Stop monitoring
stop_monitoring.set()
if monitor_thread:
monitor_thread.join(timeout=2)
# Calculate stats
successes = sum(1 for r in results if r.get("success"))
success_rate = (successes / len(results)) * 100
latencies = [r["latency_ms"] for r in results if "latency_ms" in r]
avg_latency = sum(latencies) / len(latencies) if latencies else 0
# Memory stats
memory_samples = [s['memory_mb'] for s in stats_history]
peak_mem = max(memory_samples) if memory_samples else 0
final_mem = memory_samples[-1] if memory_samples else 0
mem_delta = final_mem - baseline_mem
# Print results
print(f"\n{'='*60}")
print(f"RESULTS:")
print(f" Success Rate: {success_rate:.1f}% ({successes}/{len(results)})")
print(f" Avg Latency: {avg_latency:.0f}ms")
print(f"\n Memory Stats:")
print(f" Baseline: {baseline_mem:.1f} MB")
print(f" Peak: {peak_mem:.1f} MB")
print(f" Final: {final_mem:.1f} MB")
print(f" Delta: {mem_delta:+.1f} MB")
print(f"{'='*60}")
# Pass/Fail
if success_rate >= 100 and mem_delta < 100: # No significant memory growth
print(f"✅ TEST PASSED")
return 0
else:
if success_rate < 100:
print(f"❌ TEST FAILED (success rate < 100%)")
if mem_delta >= 100:
print(f"⚠️ WARNING: Memory grew by {mem_delta:.1f} MB")
return 1
except Exception as e:
print(f"\n❌ TEST ERROR: {e}")
return 1
finally:
stop_monitoring.set()
if container:
stop_container(container)
if __name__ == "__main__":
exit_code = asyncio.run(main())
exit(exit_code)

View File

@@ -0,0 +1,229 @@
#!/usr/bin/env python3
"""
Test 3: Pool Validation - Permanent Browser Reuse
- Tests /html endpoint (should use permanent browser)
- Monitors container logs for pool hit markers
- Validates browser reuse rate
- Checks memory after browser creation
"""
import asyncio
import time
import docker
import httpx
from threading import Thread, Event
# Config
IMAGE = "crawl4ai-local:latest"
CONTAINER_NAME = "crawl4ai-test"
PORT = 11235
REQUESTS = 30
# Stats tracking
stats_history = []
stop_monitoring = Event()
def monitor_stats(container):
"""Background stats collector."""
for stat in container.stats(decode=True, stream=True):
if stop_monitoring.is_set():
break
try:
mem_usage = stat['memory_stats'].get('usage', 0) / (1024 * 1024)
stats_history.append({
'timestamp': time.time(),
'memory_mb': mem_usage,
})
except:
pass
time.sleep(0.5)
def count_log_markers(container):
"""Extract pool usage markers from logs."""
logs = container.logs().decode('utf-8')
permanent_hits = logs.count("🔥 Using permanent browser")
hot_hits = logs.count("♨️ Using hot pool browser")
cold_hits = logs.count("❄️ Using cold pool browser")
new_created = logs.count("🆕 Creating new browser")
return {
'permanent_hits': permanent_hits,
'hot_hits': hot_hits,
'cold_hits': cold_hits,
'new_created': new_created,
'total_hits': permanent_hits + hot_hits + cold_hits
}
async def test_endpoint(url: str, count: int):
"""Hit endpoint multiple times."""
results = []
async with httpx.AsyncClient(timeout=60.0) as client:
for i in range(count):
start = time.time()
try:
resp = await client.post(url, json={"url": "https://httpbin.org/html"})
elapsed = (time.time() - start) * 1000
results.append({
"success": resp.status_code == 200,
"latency_ms": elapsed,
})
if (i + 1) % 10 == 0:
print(f" [{i+1}/{count}] ✓ {resp.status_code} - {elapsed:.0f}ms")
except Exception as e:
results.append({"success": False, "error": str(e)})
print(f" [{i+1}/{count}] ✗ Error: {e}")
return results
def start_container(client, image: str, name: str, port: int):
"""Start container."""
try:
old = client.containers.get(name)
print(f"🧹 Stopping existing container...")
old.stop()
old.remove()
except docker.errors.NotFound:
pass
print(f"🚀 Starting container...")
container = client.containers.run(
image,
name=name,
ports={f"{port}/tcp": port},
detach=True,
shm_size="1g",
mem_limit="4g",
)
print(f"⏳ Waiting for health...")
for _ in range(30):
time.sleep(1)
container.reload()
if container.status == "running":
try:
import requests
resp = requests.get(f"http://localhost:{port}/health", timeout=2)
if resp.status_code == 200:
print(f"✅ Container healthy!")
return container
except:
pass
raise TimeoutError("Container failed to start")
def stop_container(container):
"""Stop container."""
print(f"🛑 Stopping container...")
container.stop()
container.remove()
async def main():
print("="*60)
print("TEST 3: Pool Validation - Permanent Browser Reuse")
print("="*60)
client = docker.from_env()
container = None
monitor_thread = None
try:
# Start container
container = start_container(client, IMAGE, CONTAINER_NAME, PORT)
# Wait for permanent browser initialization
print(f"\n⏳ Waiting for permanent browser init (3s)...")
await asyncio.sleep(3)
# Start stats monitoring
print(f"📊 Starting stats monitor...")
stop_monitoring.clear()
stats_history.clear()
monitor_thread = Thread(target=monitor_stats, args=(container,), daemon=True)
monitor_thread.start()
await asyncio.sleep(1)
baseline_mem = stats_history[-1]['memory_mb'] if stats_history else 0
print(f"📏 Baseline (with permanent browser): {baseline_mem:.1f} MB")
# Test /html endpoint (uses permanent browser for default config)
print(f"\n🔄 Running {REQUESTS} requests to /html...")
url = f"http://localhost:{PORT}/html"
results = await test_endpoint(url, REQUESTS)
# Wait a bit
await asyncio.sleep(1)
# Stop monitoring
stop_monitoring.set()
if monitor_thread:
monitor_thread.join(timeout=2)
# Analyze logs for pool markers
print(f"\n📋 Analyzing pool usage...")
pool_stats = count_log_markers(container)
# Calculate request stats
successes = sum(1 for r in results if r.get("success"))
success_rate = (successes / len(results)) * 100
latencies = [r["latency_ms"] for r in results if "latency_ms" in r]
avg_latency = sum(latencies) / len(latencies) if latencies else 0
# Memory stats
memory_samples = [s['memory_mb'] for s in stats_history]
peak_mem = max(memory_samples) if memory_samples else 0
final_mem = memory_samples[-1] if memory_samples else 0
mem_delta = final_mem - baseline_mem
# Calculate reuse rate
total_requests = len(results)
total_pool_hits = pool_stats['total_hits']
reuse_rate = (total_pool_hits / total_requests * 100) if total_requests > 0 else 0
# Print results
print(f"\n{'='*60}")
print(f"RESULTS:")
print(f" Success Rate: {success_rate:.1f}% ({successes}/{len(results)})")
print(f" Avg Latency: {avg_latency:.0f}ms")
print(f"\n Pool Stats:")
print(f" 🔥 Permanent Hits: {pool_stats['permanent_hits']}")
print(f" ♨️ Hot Pool Hits: {pool_stats['hot_hits']}")
print(f" ❄️ Cold Pool Hits: {pool_stats['cold_hits']}")
print(f" 🆕 New Created: {pool_stats['new_created']}")
print(f" 📊 Reuse Rate: {reuse_rate:.1f}%")
print(f"\n Memory Stats:")
print(f" Baseline: {baseline_mem:.1f} MB")
print(f" Peak: {peak_mem:.1f} MB")
print(f" Final: {final_mem:.1f} MB")
print(f" Delta: {mem_delta:+.1f} MB")
print(f"{'='*60}")
# Pass/Fail
passed = True
if success_rate < 100:
print(f"❌ FAIL: Success rate {success_rate:.1f}% < 100%")
passed = False
if reuse_rate < 80:
print(f"❌ FAIL: Reuse rate {reuse_rate:.1f}% < 80% (expected high permanent browser usage)")
passed = False
if pool_stats['permanent_hits'] < (total_requests * 0.8):
print(f"⚠️ WARNING: Only {pool_stats['permanent_hits']} permanent hits out of {total_requests} requests")
if mem_delta > 200:
print(f"⚠️ WARNING: Memory grew by {mem_delta:.1f} MB (possible browser leak)")
if passed:
print(f"✅ TEST PASSED")
return 0
else:
return 1
except Exception as e:
print(f"\n❌ TEST ERROR: {e}")
import traceback
traceback.print_exc()
return 1
finally:
stop_monitoring.set()
if container:
stop_container(container)
if __name__ == "__main__":
exit_code = asyncio.run(main())
exit(exit_code)

View File

@@ -0,0 +1,236 @@
#!/usr/bin/env python3
"""
Test 4: Concurrent Load Testing
- Tests pool under concurrent load
- Escalates: 10 → 50 → 100 concurrent requests
- Validates latency distribution (P50, P95, P99)
- Monitors memory stability
"""
import asyncio
import time
import docker
import httpx
from threading import Thread, Event
from collections import defaultdict
# Config
IMAGE = "crawl4ai-local:latest"
CONTAINER_NAME = "crawl4ai-test"
PORT = 11235
LOAD_LEVELS = [
{"name": "Light", "concurrent": 10, "requests": 20},
{"name": "Medium", "concurrent": 50, "requests": 100},
{"name": "Heavy", "concurrent": 100, "requests": 200},
]
# Stats
stats_history = []
stop_monitoring = Event()
def monitor_stats(container):
"""Background stats collector."""
for stat in container.stats(decode=True, stream=True):
if stop_monitoring.is_set():
break
try:
mem_usage = stat['memory_stats'].get('usage', 0) / (1024 * 1024)
stats_history.append({'timestamp': time.time(), 'memory_mb': mem_usage})
except:
pass
time.sleep(0.5)
def count_log_markers(container):
"""Extract pool markers."""
logs = container.logs().decode('utf-8')
return {
'permanent': logs.count("🔥 Using permanent browser"),
'hot': logs.count("♨️ Using hot pool browser"),
'cold': logs.count("❄️ Using cold pool browser"),
'new': logs.count("🆕 Creating new browser"),
}
async def hit_endpoint(client, url, payload, semaphore):
"""Single request with concurrency control."""
async with semaphore:
start = time.time()
try:
resp = await client.post(url, json=payload, timeout=60.0)
elapsed = (time.time() - start) * 1000
return {"success": resp.status_code == 200, "latency_ms": elapsed}
except Exception as e:
return {"success": False, "error": str(e)}
async def run_concurrent_test(url, payload, concurrent, total_requests):
"""Run concurrent requests."""
semaphore = asyncio.Semaphore(concurrent)
async with httpx.AsyncClient() as client:
tasks = [hit_endpoint(client, url, payload, semaphore) for _ in range(total_requests)]
results = await asyncio.gather(*tasks)
return results
def calculate_percentiles(latencies):
"""Calculate P50, P95, P99."""
if not latencies:
return 0, 0, 0
sorted_lat = sorted(latencies)
n = len(sorted_lat)
return (
sorted_lat[int(n * 0.50)],
sorted_lat[int(n * 0.95)],
sorted_lat[int(n * 0.99)],
)
def start_container(client, image, name, port):
"""Start container."""
try:
old = client.containers.get(name)
print(f"🧹 Stopping existing container...")
old.stop()
old.remove()
except docker.errors.NotFound:
pass
print(f"🚀 Starting container...")
container = client.containers.run(
image, name=name, ports={f"{port}/tcp": port},
detach=True, shm_size="1g", mem_limit="4g",
)
print(f"⏳ Waiting for health...")
for _ in range(30):
time.sleep(1)
container.reload()
if container.status == "running":
try:
import requests
if requests.get(f"http://localhost:{port}/health", timeout=2).status_code == 200:
print(f"✅ Container healthy!")
return container
except:
pass
raise TimeoutError("Container failed to start")
async def main():
print("="*60)
print("TEST 4: Concurrent Load Testing")
print("="*60)
client = docker.from_env()
container = None
monitor_thread = None
try:
container = start_container(client, IMAGE, CONTAINER_NAME, PORT)
print(f"\n⏳ Waiting for permanent browser init (3s)...")
await asyncio.sleep(3)
# Start monitoring
stop_monitoring.clear()
stats_history.clear()
monitor_thread = Thread(target=monitor_stats, args=(container,), daemon=True)
monitor_thread.start()
await asyncio.sleep(1)
baseline_mem = stats_history[-1]['memory_mb'] if stats_history else 0
print(f"📏 Baseline: {baseline_mem:.1f} MB\n")
url = f"http://localhost:{PORT}/html"
payload = {"url": "https://httpbin.org/html"}
all_results = []
level_stats = []
# Run load levels
for level in LOAD_LEVELS:
print(f"{'='*60}")
print(f"🔄 {level['name']} Load: {level['concurrent']} concurrent, {level['requests']} total")
print(f"{'='*60}")
start_time = time.time()
results = await run_concurrent_test(url, payload, level['concurrent'], level['requests'])
duration = time.time() - start_time
successes = sum(1 for r in results if r.get("success"))
success_rate = (successes / len(results)) * 100
latencies = [r["latency_ms"] for r in results if "latency_ms" in r]
p50, p95, p99 = calculate_percentiles(latencies)
avg_lat = sum(latencies) / len(latencies) if latencies else 0
print(f" Duration: {duration:.1f}s")
print(f" Success: {success_rate:.1f}% ({successes}/{len(results)})")
print(f" Avg Latency: {avg_lat:.0f}ms")
print(f" P50/P95/P99: {p50:.0f}ms / {p95:.0f}ms / {p99:.0f}ms")
level_stats.append({
'name': level['name'],
'concurrent': level['concurrent'],
'success_rate': success_rate,
'avg_latency': avg_lat,
'p50': p50, 'p95': p95, 'p99': p99,
})
all_results.extend(results)
await asyncio.sleep(2) # Cool down between levels
# Stop monitoring
await asyncio.sleep(1)
stop_monitoring.set()
if monitor_thread:
monitor_thread.join(timeout=2)
# Final stats
pool_stats = count_log_markers(container)
memory_samples = [s['memory_mb'] for s in stats_history]
peak_mem = max(memory_samples) if memory_samples else 0
final_mem = memory_samples[-1] if memory_samples else 0
print(f"\n{'='*60}")
print(f"FINAL RESULTS:")
print(f"{'='*60}")
print(f" Total Requests: {len(all_results)}")
print(f"\n Pool Utilization:")
print(f" 🔥 Permanent: {pool_stats['permanent']}")
print(f" ♨️ Hot: {pool_stats['hot']}")
print(f" ❄️ Cold: {pool_stats['cold']}")
print(f" 🆕 New: {pool_stats['new']}")
print(f"\n Memory:")
print(f" Baseline: {baseline_mem:.1f} MB")
print(f" Peak: {peak_mem:.1f} MB")
print(f" Final: {final_mem:.1f} MB")
print(f" Delta: {final_mem - baseline_mem:+.1f} MB")
print(f"{'='*60}")
# Pass/Fail
passed = True
for ls in level_stats:
if ls['success_rate'] < 99:
print(f"❌ FAIL: {ls['name']} success rate {ls['success_rate']:.1f}% < 99%")
passed = False
if ls['p99'] > 10000: # 10s threshold
print(f"⚠️ WARNING: {ls['name']} P99 latency {ls['p99']:.0f}ms very high")
if final_mem - baseline_mem > 300:
print(f"⚠️ WARNING: Memory grew {final_mem - baseline_mem:.1f} MB")
if passed:
print(f"✅ TEST PASSED")
return 0
else:
return 1
except Exception as e:
print(f"\n❌ TEST ERROR: {e}")
import traceback
traceback.print_exc()
return 1
finally:
stop_monitoring.set()
if container:
print(f"🛑 Stopping container...")
container.stop()
container.remove()
if __name__ == "__main__":
exit_code = asyncio.run(main())
exit(exit_code)

View File

@@ -0,0 +1,267 @@
#!/usr/bin/env python3
"""
Test 5: Pool Stress - Mixed Configs
- Tests hot/cold pool with different browser configs
- Uses different viewports to create config variants
- Validates cold → hot promotion after 3 uses
- Monitors pool tier distribution
"""
import asyncio
import time
import docker
import httpx
from threading import Thread, Event
import random
# Config
IMAGE = "crawl4ai-local:latest"
CONTAINER_NAME = "crawl4ai-test"
PORT = 11235
REQUESTS_PER_CONFIG = 5 # 5 requests per config variant
# Different viewport configs to test pool tiers
VIEWPORT_CONFIGS = [
None, # Default (permanent browser)
{"width": 1920, "height": 1080}, # Desktop
{"width": 1024, "height": 768}, # Tablet
{"width": 375, "height": 667}, # Mobile
]
# Stats
stats_history = []
stop_monitoring = Event()
def monitor_stats(container):
"""Background stats collector."""
for stat in container.stats(decode=True, stream=True):
if stop_monitoring.is_set():
break
try:
mem_usage = stat['memory_stats'].get('usage', 0) / (1024 * 1024)
stats_history.append({'timestamp': time.time(), 'memory_mb': mem_usage})
except:
pass
time.sleep(0.5)
def analyze_pool_logs(container):
"""Extract detailed pool stats from logs."""
logs = container.logs().decode('utf-8')
permanent = logs.count("🔥 Using permanent browser")
hot = logs.count("♨️ Using hot pool browser")
cold = logs.count("❄️ Using cold pool browser")
new = logs.count("🆕 Creating new browser")
promotions = logs.count("⬆️ Promoting to hot pool")
return {
'permanent': permanent,
'hot': hot,
'cold': cold,
'new': new,
'promotions': promotions,
'total': permanent + hot + cold
}
async def crawl_with_viewport(client, url, viewport):
"""Single request with specific viewport."""
payload = {
"urls": ["https://httpbin.org/html"],
"browser_config": {},
"crawler_config": {}
}
# Add viewport if specified
if viewport:
payload["browser_config"] = {
"type": "BrowserConfig",
"params": {
"viewport": {"type": "dict", "value": viewport},
"headless": True,
"text_mode": True,
"extra_args": [
"--no-sandbox",
"--disable-dev-shm-usage",
"--disable-gpu",
"--disable-software-rasterizer",
"--disable-web-security",
"--allow-insecure-localhost",
"--ignore-certificate-errors"
]
}
}
start = time.time()
try:
resp = await client.post(url, json=payload, timeout=60.0)
elapsed = (time.time() - start) * 1000
return {"success": resp.status_code == 200, "latency_ms": elapsed, "viewport": viewport}
except Exception as e:
return {"success": False, "error": str(e), "viewport": viewport}
def start_container(client, image, name, port):
"""Start container."""
try:
old = client.containers.get(name)
print(f"🧹 Stopping existing container...")
old.stop()
old.remove()
except docker.errors.NotFound:
pass
print(f"🚀 Starting container...")
container = client.containers.run(
image, name=name, ports={f"{port}/tcp": port},
detach=True, shm_size="1g", mem_limit="4g",
)
print(f"⏳ Waiting for health...")
for _ in range(30):
time.sleep(1)
container.reload()
if container.status == "running":
try:
import requests
if requests.get(f"http://localhost:{port}/health", timeout=2).status_code == 200:
print(f"✅ Container healthy!")
return container
except:
pass
raise TimeoutError("Container failed to start")
async def main():
print("="*60)
print("TEST 5: Pool Stress - Mixed Configs")
print("="*60)
client = docker.from_env()
container = None
monitor_thread = None
try:
container = start_container(client, IMAGE, CONTAINER_NAME, PORT)
print(f"\n⏳ Waiting for permanent browser init (3s)...")
await asyncio.sleep(3)
# Start monitoring
stop_monitoring.clear()
stats_history.clear()
monitor_thread = Thread(target=monitor_stats, args=(container,), daemon=True)
monitor_thread.start()
await asyncio.sleep(1)
baseline_mem = stats_history[-1]['memory_mb'] if stats_history else 0
print(f"📏 Baseline: {baseline_mem:.1f} MB\n")
url = f"http://localhost:{PORT}/crawl"
print(f"Testing {len(VIEWPORT_CONFIGS)} different configs:")
for i, vp in enumerate(VIEWPORT_CONFIGS):
vp_str = "Default" if vp is None else f"{vp['width']}x{vp['height']}"
print(f" {i+1}. {vp_str}")
print()
# Run requests: repeat each config REQUESTS_PER_CONFIG times
all_results = []
config_sequence = []
for _ in range(REQUESTS_PER_CONFIG):
for viewport in VIEWPORT_CONFIGS:
config_sequence.append(viewport)
# Shuffle to mix configs
random.shuffle(config_sequence)
print(f"🔄 Running {len(config_sequence)} requests with mixed configs...")
async with httpx.AsyncClient() as http_client:
for i, viewport in enumerate(config_sequence):
result = await crawl_with_viewport(http_client, url, viewport)
all_results.append(result)
if (i + 1) % 5 == 0:
vp_str = "default" if result['viewport'] is None else f"{result['viewport']['width']}x{result['viewport']['height']}"
status = "" if result.get('success') else ""
lat = f"{result.get('latency_ms', 0):.0f}ms" if 'latency_ms' in result else "error"
print(f" [{i+1}/{len(config_sequence)}] {status} {vp_str} - {lat}")
# Stop monitoring
await asyncio.sleep(2)
stop_monitoring.set()
if monitor_thread:
monitor_thread.join(timeout=2)
# Analyze results
pool_stats = analyze_pool_logs(container)
successes = sum(1 for r in all_results if r.get("success"))
success_rate = (successes / len(all_results)) * 100
latencies = [r["latency_ms"] for r in all_results if "latency_ms" in r]
avg_lat = sum(latencies) / len(latencies) if latencies else 0
memory_samples = [s['memory_mb'] for s in stats_history]
peak_mem = max(memory_samples) if memory_samples else 0
final_mem = memory_samples[-1] if memory_samples else 0
print(f"\n{'='*60}")
print(f"RESULTS:")
print(f"{'='*60}")
print(f" Requests: {len(all_results)}")
print(f" Success Rate: {success_rate:.1f}% ({successes}/{len(all_results)})")
print(f" Avg Latency: {avg_lat:.0f}ms")
print(f"\n Pool Statistics:")
print(f" 🔥 Permanent: {pool_stats['permanent']}")
print(f" ♨️ Hot: {pool_stats['hot']}")
print(f" ❄️ Cold: {pool_stats['cold']}")
print(f" 🆕 New: {pool_stats['new']}")
print(f" ⬆️ Promotions: {pool_stats['promotions']}")
print(f" 📊 Reuse: {(pool_stats['total'] / len(all_results) * 100):.1f}%")
print(f"\n Memory:")
print(f" Baseline: {baseline_mem:.1f} MB")
print(f" Peak: {peak_mem:.1f} MB")
print(f" Final: {final_mem:.1f} MB")
print(f" Delta: {final_mem - baseline_mem:+.1f} MB")
print(f"{'='*60}")
# Pass/Fail
passed = True
if success_rate < 99:
print(f"❌ FAIL: Success rate {success_rate:.1f}% < 99%")
passed = False
# Should see promotions since we repeat each config 5 times
if pool_stats['promotions'] < (len(VIEWPORT_CONFIGS) - 1): # -1 for default
print(f"⚠️ WARNING: Only {pool_stats['promotions']} promotions (expected ~{len(VIEWPORT_CONFIGS)-1})")
# Should have created some browsers for different configs
if pool_stats['new'] == 0:
print(f"⚠️ NOTE: No new browsers created (all used default?)")
if pool_stats['permanent'] == len(all_results):
print(f"⚠️ NOTE: All requests used permanent browser (configs not varying enough?)")
if final_mem - baseline_mem > 500:
print(f"⚠️ WARNING: Memory grew {final_mem - baseline_mem:.1f} MB")
if passed:
print(f"✅ TEST PASSED")
return 0
else:
return 1
except Exception as e:
print(f"\n❌ TEST ERROR: {e}")
import traceback
traceback.print_exc()
return 1
finally:
stop_monitoring.set()
if container:
print(f"🛑 Stopping container...")
container.stop()
container.remove()
if __name__ == "__main__":
exit_code = asyncio.run(main())
exit(exit_code)

View File

@@ -0,0 +1,234 @@
#!/usr/bin/env python3
"""
Test 6: Multi-Endpoint Testing
- Tests multiple endpoints together: /html, /screenshot, /pdf, /crawl
- Validates each endpoint works correctly
- Monitors success rates per endpoint
"""
import asyncio
import time
import docker
import httpx
from threading import Thread, Event
# Config
IMAGE = "crawl4ai-local:latest"
CONTAINER_NAME = "crawl4ai-test"
PORT = 11235
REQUESTS_PER_ENDPOINT = 10
# Stats
stats_history = []
stop_monitoring = Event()
def monitor_stats(container):
"""Background stats collector."""
for stat in container.stats(decode=True, stream=True):
if stop_monitoring.is_set():
break
try:
mem_usage = stat['memory_stats'].get('usage', 0) / (1024 * 1024)
stats_history.append({'timestamp': time.time(), 'memory_mb': mem_usage})
except:
pass
time.sleep(0.5)
async def test_html(client, base_url, count):
"""Test /html endpoint."""
url = f"{base_url}/html"
results = []
for _ in range(count):
start = time.time()
try:
resp = await client.post(url, json={"url": "https://httpbin.org/html"}, timeout=30.0)
elapsed = (time.time() - start) * 1000
results.append({"success": resp.status_code == 200, "latency_ms": elapsed})
except Exception as e:
results.append({"success": False, "error": str(e)})
return results
async def test_screenshot(client, base_url, count):
"""Test /screenshot endpoint."""
url = f"{base_url}/screenshot"
results = []
for _ in range(count):
start = time.time()
try:
resp = await client.post(url, json={"url": "https://httpbin.org/html"}, timeout=30.0)
elapsed = (time.time() - start) * 1000
results.append({"success": resp.status_code == 200, "latency_ms": elapsed})
except Exception as e:
results.append({"success": False, "error": str(e)})
return results
async def test_pdf(client, base_url, count):
"""Test /pdf endpoint."""
url = f"{base_url}/pdf"
results = []
for _ in range(count):
start = time.time()
try:
resp = await client.post(url, json={"url": "https://httpbin.org/html"}, timeout=30.0)
elapsed = (time.time() - start) * 1000
results.append({"success": resp.status_code == 200, "latency_ms": elapsed})
except Exception as e:
results.append({"success": False, "error": str(e)})
return results
async def test_crawl(client, base_url, count):
"""Test /crawl endpoint."""
url = f"{base_url}/crawl"
results = []
payload = {
"urls": ["https://httpbin.org/html"],
"browser_config": {},
"crawler_config": {}
}
for _ in range(count):
start = time.time()
try:
resp = await client.post(url, json=payload, timeout=30.0)
elapsed = (time.time() - start) * 1000
results.append({"success": resp.status_code == 200, "latency_ms": elapsed})
except Exception as e:
results.append({"success": False, "error": str(e)})
return results
def start_container(client, image, name, port):
"""Start container."""
try:
old = client.containers.get(name)
print(f"🧹 Stopping existing container...")
old.stop()
old.remove()
except docker.errors.NotFound:
pass
print(f"🚀 Starting container...")
container = client.containers.run(
image, name=name, ports={f"{port}/tcp": port},
detach=True, shm_size="1g", mem_limit="4g",
)
print(f"⏳ Waiting for health...")
for _ in range(30):
time.sleep(1)
container.reload()
if container.status == "running":
try:
import requests
if requests.get(f"http://localhost:{port}/health", timeout=2).status_code == 200:
print(f"✅ Container healthy!")
return container
except:
pass
raise TimeoutError("Container failed to start")
async def main():
print("="*60)
print("TEST 6: Multi-Endpoint Testing")
print("="*60)
client = docker.from_env()
container = None
monitor_thread = None
try:
container = start_container(client, IMAGE, CONTAINER_NAME, PORT)
print(f"\n⏳ Waiting for permanent browser init (3s)...")
await asyncio.sleep(3)
# Start monitoring
stop_monitoring.clear()
stats_history.clear()
monitor_thread = Thread(target=monitor_stats, args=(container,), daemon=True)
monitor_thread.start()
await asyncio.sleep(1)
baseline_mem = stats_history[-1]['memory_mb'] if stats_history else 0
print(f"📏 Baseline: {baseline_mem:.1f} MB\n")
base_url = f"http://localhost:{PORT}"
# Test each endpoint
endpoints = {
"/html": test_html,
"/screenshot": test_screenshot,
"/pdf": test_pdf,
"/crawl": test_crawl,
}
all_endpoint_stats = {}
async with httpx.AsyncClient() as http_client:
for endpoint_name, test_func in endpoints.items():
print(f"🔄 Testing {endpoint_name} ({REQUESTS_PER_ENDPOINT} requests)...")
results = await test_func(http_client, base_url, REQUESTS_PER_ENDPOINT)
successes = sum(1 for r in results if r.get("success"))
success_rate = (successes / len(results)) * 100
latencies = [r["latency_ms"] for r in results if "latency_ms" in r]
avg_lat = sum(latencies) / len(latencies) if latencies else 0
all_endpoint_stats[endpoint_name] = {
'success_rate': success_rate,
'avg_latency': avg_lat,
'total': len(results),
'successes': successes
}
print(f" ✓ Success: {success_rate:.1f}% ({successes}/{len(results)}), Avg: {avg_lat:.0f}ms")
# Stop monitoring
await asyncio.sleep(1)
stop_monitoring.set()
if monitor_thread:
monitor_thread.join(timeout=2)
# Final stats
memory_samples = [s['memory_mb'] for s in stats_history]
peak_mem = max(memory_samples) if memory_samples else 0
final_mem = memory_samples[-1] if memory_samples else 0
print(f"\n{'='*60}")
print(f"RESULTS:")
print(f"{'='*60}")
for endpoint, stats in all_endpoint_stats.items():
print(f" {endpoint:12} Success: {stats['success_rate']:5.1f}% Avg: {stats['avg_latency']:6.0f}ms")
print(f"\n Memory:")
print(f" Baseline: {baseline_mem:.1f} MB")
print(f" Peak: {peak_mem:.1f} MB")
print(f" Final: {final_mem:.1f} MB")
print(f" Delta: {final_mem - baseline_mem:+.1f} MB")
print(f"{'='*60}")
# Pass/Fail
passed = True
for endpoint, stats in all_endpoint_stats.items():
if stats['success_rate'] < 100:
print(f"❌ FAIL: {endpoint} success rate {stats['success_rate']:.1f}% < 100%")
passed = False
if passed:
print(f"✅ TEST PASSED")
return 0
else:
return 1
except Exception as e:
print(f"\n❌ TEST ERROR: {e}")
import traceback
traceback.print_exc()
return 1
finally:
stop_monitoring.set()
if container:
print(f"🛑 Stopping container...")
container.stop()
container.remove()
if __name__ == "__main__":
exit_code = asyncio.run(main())
exit(exit_code)

View File

@@ -0,0 +1,199 @@
#!/usr/bin/env python3
"""
Test 7: Cleanup Verification (Janitor)
- Creates load spike then goes idle
- Verifies memory returns to near baseline
- Tests janitor cleanup of idle browsers
- Monitors memory recovery time
"""
import asyncio
import time
import docker
import httpx
from threading import Thread, Event
# Config
IMAGE = "crawl4ai-local:latest"
CONTAINER_NAME = "crawl4ai-test"
PORT = 11235
SPIKE_REQUESTS = 20 # Create some browsers
IDLE_TIME = 90 # Wait 90s for janitor (runs every 60s)
# Stats
stats_history = []
stop_monitoring = Event()
def monitor_stats(container):
"""Background stats collector."""
for stat in container.stats(decode=True, stream=True):
if stop_monitoring.is_set():
break
try:
mem_usage = stat['memory_stats'].get('usage', 0) / (1024 * 1024)
stats_history.append({'timestamp': time.time(), 'memory_mb': mem_usage})
except:
pass
time.sleep(1) # Sample every 1s for this test
def start_container(client, image, name, port):
"""Start container."""
try:
old = client.containers.get(name)
print(f"🧹 Stopping existing container...")
old.stop()
old.remove()
except docker.errors.NotFound:
pass
print(f"🚀 Starting container...")
container = client.containers.run(
image, name=name, ports={f"{port}/tcp": port},
detach=True, shm_size="1g", mem_limit="4g",
)
print(f"⏳ Waiting for health...")
for _ in range(30):
time.sleep(1)
container.reload()
if container.status == "running":
try:
import requests
if requests.get(f"http://localhost:{port}/health", timeout=2).status_code == 200:
print(f"✅ Container healthy!")
return container
except:
pass
raise TimeoutError("Container failed to start")
async def main():
print("="*60)
print("TEST 7: Cleanup Verification (Janitor)")
print("="*60)
client = docker.from_env()
container = None
monitor_thread = None
try:
container = start_container(client, IMAGE, CONTAINER_NAME, PORT)
print(f"\n⏳ Waiting for permanent browser init (3s)...")
await asyncio.sleep(3)
# Start monitoring
stop_monitoring.clear()
stats_history.clear()
monitor_thread = Thread(target=monitor_stats, args=(container,), daemon=True)
monitor_thread.start()
await asyncio.sleep(2)
baseline_mem = stats_history[-1]['memory_mb'] if stats_history else 0
print(f"📏 Baseline: {baseline_mem:.1f} MB\n")
# Create load spike with different configs to populate pool
print(f"🔥 Creating load spike ({SPIKE_REQUESTS} requests with varied configs)...")
url = f"http://localhost:{PORT}/crawl"
viewports = [
{"width": 1920, "height": 1080},
{"width": 1024, "height": 768},
{"width": 375, "height": 667},
]
async with httpx.AsyncClient(timeout=60.0) as http_client:
tasks = []
for i in range(SPIKE_REQUESTS):
vp = viewports[i % len(viewports)]
payload = {
"urls": ["https://httpbin.org/html"],
"browser_config": {
"type": "BrowserConfig",
"params": {
"viewport": {"type": "dict", "value": vp},
"headless": True,
"text_mode": True,
"extra_args": [
"--no-sandbox", "--disable-dev-shm-usage",
"--disable-gpu", "--disable-software-rasterizer",
"--disable-web-security", "--allow-insecure-localhost",
"--ignore-certificate-errors"
]
}
},
"crawler_config": {}
}
tasks.append(http_client.post(url, json=payload))
results = await asyncio.gather(*tasks, return_exceptions=True)
successes = sum(1 for r in results if hasattr(r, 'status_code') and r.status_code == 200)
print(f" ✓ Spike completed: {successes}/{len(results)} successful")
# Measure peak
await asyncio.sleep(2)
peak_mem = max([s['memory_mb'] for s in stats_history]) if stats_history else baseline_mem
print(f" 📊 Peak memory: {peak_mem:.1f} MB (+{peak_mem - baseline_mem:.1f} MB)")
# Now go idle and wait for janitor
print(f"\n⏸️ Going idle for {IDLE_TIME}s (janitor cleanup)...")
print(f" (Janitor runs every 60s, checking for idle browsers)")
for elapsed in range(0, IDLE_TIME, 10):
await asyncio.sleep(10)
current_mem = stats_history[-1]['memory_mb'] if stats_history else 0
print(f" [{elapsed+10:3d}s] Memory: {current_mem:.1f} MB")
# Stop monitoring
stop_monitoring.set()
if monitor_thread:
monitor_thread.join(timeout=2)
# Analyze memory recovery
final_mem = stats_history[-1]['memory_mb'] if stats_history else 0
recovery_mb = peak_mem - final_mem
recovery_pct = (recovery_mb / (peak_mem - baseline_mem) * 100) if (peak_mem - baseline_mem) > 0 else 0
print(f"\n{'='*60}")
print(f"RESULTS:")
print(f"{'='*60}")
print(f" Memory Journey:")
print(f" Baseline: {baseline_mem:.1f} MB")
print(f" Peak: {peak_mem:.1f} MB (+{peak_mem - baseline_mem:.1f} MB)")
print(f" Final: {final_mem:.1f} MB (+{final_mem - baseline_mem:.1f} MB)")
print(f" Recovered: {recovery_mb:.1f} MB ({recovery_pct:.1f}%)")
print(f"{'='*60}")
# Pass/Fail
passed = True
# Should have created some memory pressure
if peak_mem - baseline_mem < 100:
print(f"⚠️ WARNING: Peak increase only {peak_mem - baseline_mem:.1f} MB (expected more browsers)")
# Should recover most memory (within 100MB of baseline)
if final_mem - baseline_mem > 100:
print(f"⚠️ WARNING: Memory didn't recover well (still +{final_mem - baseline_mem:.1f} MB above baseline)")
else:
print(f"✅ Good memory recovery!")
# Baseline + 50MB tolerance
if final_mem - baseline_mem < 50:
print(f"✅ Excellent cleanup (within 50MB of baseline)")
print(f"✅ TEST PASSED")
return 0
except Exception as e:
print(f"\n❌ TEST ERROR: {e}")
import traceback
traceback.print_exc()
return 1
finally:
stop_monitoring.set()
if container:
print(f"🛑 Stopping container...")
container.stop()
container.remove()
if __name__ == "__main__":
exit_code = asyncio.run(main())
exit(exit_code)

View File

@@ -0,0 +1,57 @@
#!/usr/bin/env python3
"""Quick test to generate monitor dashboard activity"""
import httpx
import asyncio
async def test_dashboard():
async with httpx.AsyncClient(timeout=30.0) as client:
print("📊 Generating dashboard activity...")
# Test 1: Simple crawl
print("\n1⃣ Running simple crawl...")
r1 = await client.post(
"http://localhost:11235/crawl",
json={"urls": ["https://httpbin.org/html"], "crawler_config": {}}
)
print(f" Status: {r1.status_code}")
# Test 2: Multiple URLs
print("\n2⃣ Running multi-URL crawl...")
r2 = await client.post(
"http://localhost:11235/crawl",
json={
"urls": [
"https://httpbin.org/html",
"https://httpbin.org/json"
],
"crawler_config": {}
}
)
print(f" Status: {r2.status_code}")
# Test 3: Check monitor health
print("\n3⃣ Checking monitor health...")
r3 = await client.get("http://localhost:11235/monitor/health")
health = r3.json()
print(f" Memory: {health['container']['memory_percent']}%")
print(f" Browsers: {health['pool']['permanent']['active']}")
# Test 4: Check requests
print("\n4⃣ Checking request log...")
r4 = await client.get("http://localhost:11235/monitor/requests")
reqs = r4.json()
print(f" Active: {len(reqs['active'])}")
print(f" Completed: {len(reqs['completed'])}")
# Test 5: Check endpoint stats
print("\n5⃣ Checking endpoint stats...")
r5 = await client.get("http://localhost:11235/monitor/endpoints/stats")
stats = r5.json()
for endpoint, data in stats.items():
print(f" {endpoint}: {data['count']} requests, {data['avg_latency_ms']}ms avg")
print("\n✅ Dashboard should now show activity!")
print(f"\n🌐 Open: http://localhost:11235/dashboard")
if __name__ == "__main__":
asyncio.run(test_dashboard())

View File

@@ -0,0 +1,170 @@
#!/usr/bin/env python3
"""
Unit tests for security fixes.
These tests verify the security fixes at the code level without needing a running server.
"""
import sys
import os
# Add parent directory to path to import modules
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
import unittest
class TestURLValidation(unittest.TestCase):
"""Test URL scheme validation helper."""
def setUp(self):
"""Set up test fixtures."""
# Import the validation constants and function
self.ALLOWED_URL_SCHEMES = ("http://", "https://")
self.ALLOWED_URL_SCHEMES_WITH_RAW = ("http://", "https://", "raw:", "raw://")
def validate_url_scheme(self, url: str, allow_raw: bool = False) -> bool:
"""Local version of validate_url_scheme for testing."""
allowed = self.ALLOWED_URL_SCHEMES_WITH_RAW if allow_raw else self.ALLOWED_URL_SCHEMES
return url.startswith(allowed)
# === SECURITY TESTS: These URLs must be BLOCKED ===
def test_file_url_blocked(self):
"""file:// URLs must be blocked (LFI vulnerability)."""
self.assertFalse(self.validate_url_scheme("file:///etc/passwd"))
self.assertFalse(self.validate_url_scheme("file:///etc/passwd", allow_raw=True))
def test_file_url_blocked_windows(self):
"""file:// URLs with Windows paths must be blocked."""
self.assertFalse(self.validate_url_scheme("file:///C:/Windows/System32/config/sam"))
def test_javascript_url_blocked(self):
"""javascript: URLs must be blocked (XSS)."""
self.assertFalse(self.validate_url_scheme("javascript:alert(1)"))
def test_data_url_blocked(self):
"""data: URLs must be blocked."""
self.assertFalse(self.validate_url_scheme("data:text/html,<script>alert(1)</script>"))
def test_ftp_url_blocked(self):
"""ftp: URLs must be blocked."""
self.assertFalse(self.validate_url_scheme("ftp://example.com/file"))
def test_empty_url_blocked(self):
"""Empty URLs must be blocked."""
self.assertFalse(self.validate_url_scheme(""))
def test_relative_url_blocked(self):
"""Relative URLs must be blocked."""
self.assertFalse(self.validate_url_scheme("/etc/passwd"))
self.assertFalse(self.validate_url_scheme("../../../etc/passwd"))
# === FUNCTIONALITY TESTS: These URLs must be ALLOWED ===
def test_http_url_allowed(self):
"""http:// URLs must be allowed."""
self.assertTrue(self.validate_url_scheme("http://example.com"))
self.assertTrue(self.validate_url_scheme("http://localhost:8080"))
def test_https_url_allowed(self):
"""https:// URLs must be allowed."""
self.assertTrue(self.validate_url_scheme("https://example.com"))
self.assertTrue(self.validate_url_scheme("https://example.com/path?query=1"))
def test_raw_url_allowed_when_enabled(self):
"""raw: URLs must be allowed when allow_raw=True."""
self.assertTrue(self.validate_url_scheme("raw:<html></html>", allow_raw=True))
self.assertTrue(self.validate_url_scheme("raw://<html></html>", allow_raw=True))
def test_raw_url_blocked_when_disabled(self):
"""raw: URLs must be blocked when allow_raw=False."""
self.assertFalse(self.validate_url_scheme("raw:<html></html>", allow_raw=False))
self.assertFalse(self.validate_url_scheme("raw://<html></html>", allow_raw=False))
class TestHookBuiltins(unittest.TestCase):
"""Test that dangerous builtins are removed from hooks."""
def test_import_not_in_allowed_builtins(self):
"""__import__ must NOT be in allowed_builtins."""
allowed_builtins = [
'print', 'len', 'str', 'int', 'float', 'bool',
'list', 'dict', 'set', 'tuple', 'range', 'enumerate',
'zip', 'map', 'filter', 'any', 'all', 'sum', 'min', 'max',
'sorted', 'reversed', 'abs', 'round', 'isinstance', 'type',
'getattr', 'hasattr', 'setattr', 'callable', 'iter', 'next',
'__build_class__' # Required for class definitions in exec
]
self.assertNotIn('__import__', allowed_builtins)
self.assertNotIn('eval', allowed_builtins)
self.assertNotIn('exec', allowed_builtins)
self.assertNotIn('compile', allowed_builtins)
self.assertNotIn('open', allowed_builtins)
def test_build_class_in_allowed_builtins(self):
"""__build_class__ must be in allowed_builtins (needed for class definitions)."""
allowed_builtins = [
'print', 'len', 'str', 'int', 'float', 'bool',
'list', 'dict', 'set', 'tuple', 'range', 'enumerate',
'zip', 'map', 'filter', 'any', 'all', 'sum', 'min', 'max',
'sorted', 'reversed', 'abs', 'round', 'isinstance', 'type',
'getattr', 'hasattr', 'setattr', 'callable', 'iter', 'next',
'__build_class__'
]
self.assertIn('__build_class__', allowed_builtins)
class TestHooksEnabled(unittest.TestCase):
"""Test HOOKS_ENABLED environment variable logic."""
def test_hooks_disabled_by_default(self):
"""Hooks must be disabled by default."""
# Simulate the default behavior
hooks_enabled = os.environ.get("CRAWL4AI_HOOKS_ENABLED", "false").lower() == "true"
# Clear any existing env var to test default
original = os.environ.pop("CRAWL4AI_HOOKS_ENABLED", None)
try:
hooks_enabled = os.environ.get("CRAWL4AI_HOOKS_ENABLED", "false").lower() == "true"
self.assertFalse(hooks_enabled)
finally:
if original is not None:
os.environ["CRAWL4AI_HOOKS_ENABLED"] = original
def test_hooks_enabled_when_true(self):
"""Hooks must be enabled when CRAWL4AI_HOOKS_ENABLED=true."""
original = os.environ.get("CRAWL4AI_HOOKS_ENABLED")
try:
os.environ["CRAWL4AI_HOOKS_ENABLED"] = "true"
hooks_enabled = os.environ.get("CRAWL4AI_HOOKS_ENABLED", "false").lower() == "true"
self.assertTrue(hooks_enabled)
finally:
if original is not None:
os.environ["CRAWL4AI_HOOKS_ENABLED"] = original
else:
os.environ.pop("CRAWL4AI_HOOKS_ENABLED", None)
def test_hooks_disabled_when_false(self):
"""Hooks must be disabled when CRAWL4AI_HOOKS_ENABLED=false."""
original = os.environ.get("CRAWL4AI_HOOKS_ENABLED")
try:
os.environ["CRAWL4AI_HOOKS_ENABLED"] = "false"
hooks_enabled = os.environ.get("CRAWL4AI_HOOKS_ENABLED", "false").lower() == "true"
self.assertFalse(hooks_enabled)
finally:
if original is not None:
os.environ["CRAWL4AI_HOOKS_ENABLED"] = original
else:
os.environ.pop("CRAWL4AI_HOOKS_ENABLED", None)
if __name__ == '__main__':
print("=" * 60)
print("Crawl4AI Security Fixes - Unit Tests")
print("=" * 60)
print()
# Run tests with verbosity
unittest.main(verbosity=2)

View File

@@ -6,33 +6,7 @@ from datetime import datetime
from enum import Enum
from pathlib import Path
from fastapi import Request
from typing import Dict, Optional, Any, List
# Import dispatchers from crawl4ai
from crawl4ai.async_dispatcher import (
BaseDispatcher,
MemoryAdaptiveDispatcher,
SemaphoreDispatcher,
)
# Import chunking strategies from crawl4ai
from crawl4ai.chunking_strategy import (
ChunkingStrategy,
IdentityChunking,
RegexChunking,
NlpSentenceChunking,
TopicSegmentationChunking,
FixedLengthWordChunking,
SlidingWindowChunking,
OverlappingWindowChunking,
)
# Import dispatchers from crawl4ai
from crawl4ai.async_dispatcher import (
BaseDispatcher,
MemoryAdaptiveDispatcher,
SemaphoreDispatcher,
)
from typing import Dict, Optional
class TaskStatus(str, Enum):
PROCESSING = "processing"
@@ -45,124 +19,6 @@ class FilterType(str, Enum):
BM25 = "bm25"
LLM = "llm"
# ============================================================================
# Dispatcher Configuration and Factory
# ============================================================================
# Default dispatcher configurations (hardcoded, no env variables)
DISPATCHER_DEFAULTS = {
"memory_adaptive": {
"memory_threshold_percent": 70.0,
"critical_threshold_percent": 85.0,
"recovery_threshold_percent": 65.0,
"check_interval": 1.0,
"max_session_permit": 20,
"fairness_timeout": 600.0,
"memory_wait_timeout": None, # Disable memory timeout for testing
},
"semaphore": {
"semaphore_count": 5,
"max_session_permit": 10,
}
}
DEFAULT_DISPATCHER_TYPE = "memory_adaptive"
def create_dispatcher(dispatcher_type: str) -> BaseDispatcher:
"""
Factory function to create dispatcher instances.
Args:
dispatcher_type: Type of dispatcher to create ("memory_adaptive" or "semaphore")
Returns:
BaseDispatcher instance
Raises:
ValueError: If dispatcher type is unknown
"""
dispatcher_type = dispatcher_type.lower()
if dispatcher_type == "memory_adaptive":
config = DISPATCHER_DEFAULTS["memory_adaptive"]
return MemoryAdaptiveDispatcher(
memory_threshold_percent=config["memory_threshold_percent"],
critical_threshold_percent=config["critical_threshold_percent"],
recovery_threshold_percent=config["recovery_threshold_percent"],
check_interval=config["check_interval"],
max_session_permit=config["max_session_permit"],
fairness_timeout=config["fairness_timeout"],
memory_wait_timeout=config["memory_wait_timeout"],
)
elif dispatcher_type == "semaphore":
config = DISPATCHER_DEFAULTS["semaphore"]
return SemaphoreDispatcher(
semaphore_count=config["semaphore_count"],
max_session_permit=config["max_session_permit"],
)
else:
raise ValueError(f"Unknown dispatcher type: {dispatcher_type}")
def get_dispatcher_config(dispatcher_type: str) -> Dict:
"""
Get configuration for a dispatcher type.
Args:
dispatcher_type: Type of dispatcher ("memory_adaptive" or "semaphore")
Returns:
Dictionary containing dispatcher configuration
Raises:
ValueError: If dispatcher type is unknown
"""
dispatcher_type = dispatcher_type.lower()
if dispatcher_type not in DISPATCHER_DEFAULTS:
raise ValueError(f"Unknown dispatcher type: {dispatcher_type}")
return DISPATCHER_DEFAULTS[dispatcher_type].copy()
def get_available_dispatchers() -> Dict[str, Dict]:
"""
Get information about all available dispatchers.
Returns:
Dictionary mapping dispatcher types to their metadata
"""
return {
"memory_adaptive": {
"name": "Memory Adaptive Dispatcher",
"description": "Dynamically adjusts concurrency based on system memory usage. "
"Monitors memory pressure and adapts crawl sessions accordingly.",
"config": DISPATCHER_DEFAULTS["memory_adaptive"],
"features": [
"Dynamic concurrency adjustment",
"Memory pressure monitoring",
"Automatic task requeuing under high memory",
"Fairness timeout for long-waiting URLs"
]
},
"semaphore": {
"name": "Semaphore Dispatcher",
"description": "Fixed concurrency limit using semaphore-based control. "
"Simple and predictable for controlled crawling.",
"config": DISPATCHER_DEFAULTS["semaphore"],
"features": [
"Fixed concurrency limit",
"Simple semaphore-based control",
"Predictable resource usage"
]
}
}
# ============================================================================
# End Dispatcher Configuration
# ============================================================================
def load_config() -> Dict:
"""Load and return application configuration with environment variable overrides."""
config_path = Path(__file__).parent / "config.yml"
@@ -324,236 +180,27 @@ def verify_email_domain(email: str) -> bool:
except Exception as e:
return False
def create_chunking_strategy(config: Optional[Dict[str, Any]] = None) -> Optional[ChunkingStrategy]:
"""
Factory function to create chunking strategy instances from configuration.
Args:
config: Dictionary containing 'type' and 'params' keys
Example: {"type": "RegexChunking", "params": {"patterns": ["\\n\\n+"]}}
Returns:
ChunkingStrategy instance or None if config is None
Raises:
ValueError: If chunking strategy type is unknown or config is invalid
"""
if config is None:
return None
if not isinstance(config, dict):
raise ValueError(f"Chunking strategy config must be a dictionary, got {type(config)}")
if "type" not in config:
raise ValueError("Chunking strategy config must contain 'type' field")
strategy_type = config["type"]
params = config.get("params", {})
# Validate params is a dict
if not isinstance(params, dict):
raise ValueError(f"Chunking strategy params must be a dictionary, got {type(params)}")
# Strategy factory mapping
strategies = {
"IdentityChunking": IdentityChunking,
"RegexChunking": RegexChunking,
"NlpSentenceChunking": NlpSentenceChunking,
"TopicSegmentationChunking": TopicSegmentationChunking,
"FixedLengthWordChunking": FixedLengthWordChunking,
"SlidingWindowChunking": SlidingWindowChunking,
"OverlappingWindowChunking": OverlappingWindowChunking,
}
if strategy_type not in strategies:
available = ", ".join(strategies.keys())
raise ValueError(f"Unknown chunking strategy type: {strategy_type}. Available: {available}")
def get_container_memory_percent() -> float:
"""Get actual container memory usage vs limit (cgroup v1/v2 aware)."""
try:
return strategies[strategy_type](**params)
except Exception as e:
raise ValueError(f"Failed to create {strategy_type} with params {params}: {str(e)}")
# Try cgroup v2 first
usage_path = Path("/sys/fs/cgroup/memory.current")
limit_path = Path("/sys/fs/cgroup/memory.max")
if not usage_path.exists():
# Fall back to cgroup v1
usage_path = Path("/sys/fs/cgroup/memory/memory.usage_in_bytes")
limit_path = Path("/sys/fs/cgroup/memory/memory.limit_in_bytes")
usage = int(usage_path.read_text())
limit = int(limit_path.read_text())
# ============================================================================
# Table Extraction Utilities
# ============================================================================
# Handle unlimited (v2: "max", v1: > 1e18)
if limit > 1e18:
import psutil
limit = psutil.virtual_memory().total
def create_table_extraction_strategy(config):
"""
Create a table extraction strategy from configuration.
Args:
config: TableExtractionConfig instance or dict
Returns:
TableExtractionStrategy instance
Raises:
ValueError: If strategy type is unknown or configuration is invalid
"""
from crawl4ai.table_extraction import (
NoTableExtraction,
DefaultTableExtraction,
LLMTableExtraction
)
from schemas import TableExtractionStrategy
# Handle both Pydantic model and dict
if hasattr(config, 'strategy'):
strategy_type = config.strategy
elif isinstance(config, dict):
strategy_type = config.get('strategy', 'default')
else:
strategy_type = 'default'
# Convert string to enum if needed
if isinstance(strategy_type, str):
strategy_type = strategy_type.lower()
# Extract configuration values
def get_config_value(key, default=None):
if hasattr(config, key):
return getattr(config, key)
elif isinstance(config, dict):
return config.get(key, default)
return default
# Create strategy based on type
if strategy_type in ['none', TableExtractionStrategy.NONE]:
return NoTableExtraction()
elif strategy_type in ['default', TableExtractionStrategy.DEFAULT]:
return DefaultTableExtraction(
table_score_threshold=get_config_value('table_score_threshold', 7),
min_rows=get_config_value('min_rows', 0),
min_cols=get_config_value('min_cols', 0),
verbose=get_config_value('verbose', False)
)
elif strategy_type in ['llm', TableExtractionStrategy.LLM]:
from crawl4ai.types import LLMConfig
# Build LLM config
llm_config = None
llm_provider = get_config_value('llm_provider')
llm_api_key = get_config_value('llm_api_key')
llm_model = get_config_value('llm_model')
llm_base_url = get_config_value('llm_base_url')
if llm_provider or llm_api_key:
llm_config = LLMConfig(
provider=llm_provider or "openai/gpt-4",
api_token=llm_api_key,
model=llm_model,
base_url=llm_base_url
)
return LLMTableExtraction(
llm_config=llm_config,
extraction_prompt=get_config_value('extraction_prompt'),
table_score_threshold=get_config_value('table_score_threshold', 7),
min_rows=get_config_value('min_rows', 0),
min_cols=get_config_value('min_cols', 0),
verbose=get_config_value('verbose', False)
)
elif strategy_type in ['financial', TableExtractionStrategy.FINANCIAL]:
# Financial strategy uses DefaultTableExtraction with specialized settings
# optimized for financial data (tables with currency, numbers, etc.)
return DefaultTableExtraction(
table_score_threshold=get_config_value('table_score_threshold', 10), # Higher threshold for financial
min_rows=get_config_value('min_rows', 2), # Financial tables usually have at least 2 rows
min_cols=get_config_value('min_cols', 2), # Financial tables usually have at least 2 columns
verbose=get_config_value('verbose', False)
)
else:
raise ValueError(f"Unknown table extraction strategy: {strategy_type}")
def format_table_response(tables: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
"""
Format extracted tables for API response.
Args:
tables: List of table dictionaries from table extraction strategy
Returns:
List of formatted table dictionaries with consistent structure
"""
if not tables:
return []
formatted_tables = []
for idx, table in enumerate(tables):
formatted = {
"table_index": idx,
"headers": table.get("headers", []),
"rows": table.get("rows", []),
"caption": table.get("caption"),
"summary": table.get("summary"),
"metadata": table.get("metadata", {}),
"row_count": len(table.get("rows", [])),
"col_count": len(table.get("headers", [])),
}
# Add score if available (from scoring strategies)
if "score" in table:
formatted["score"] = table["score"]
# Add position information if available
if "position" in table:
formatted["position"] = table["position"]
formatted_tables.append(formatted)
return formatted_tables
async def extract_tables_from_html(html: str, config = None):
"""
Extract tables from HTML content (async wrapper for CPU-bound operation).
Args:
html: HTML content as string
config: TableExtractionConfig instance or dict
Returns:
List of formatted table dictionaries
Raises:
ValueError: If HTML parsing fails
"""
import asyncio
from functools import partial
from lxml import html as lxml_html
from schemas import TableExtractionConfig
# Define sync extraction function
def _sync_extract():
try:
# Parse HTML
element = lxml_html.fromstring(html)
except Exception as e:
raise ValueError(f"Failed to parse HTML: {str(e)}")
# Create strategy
cfg = config if config is not None else TableExtractionConfig()
strategy = create_table_extraction_strategy(cfg)
# Extract tables
tables = strategy.extract_tables(element)
# Format response
return format_table_response(tables)
# Run in executor to avoid blocking the event loop
loop = asyncio.get_event_loop()
return await loop.run_in_executor(None, _sync_extract)
# ============================================================================
# End Table Extraction Utilities
# ============================================================================
return (usage / limit) * 100
except:
# Non-container or unsupported: fallback to host
import psutil
return psutil.virtual_memory().percent

159
deploy/docker/webhook.py Normal file
View File

@@ -0,0 +1,159 @@
"""
Webhook delivery service for Crawl4AI.
This module provides webhook notification functionality with exponential backoff retry logic.
"""
import asyncio
import httpx
import logging
from typing import Dict, Optional
from datetime import datetime, timezone
logger = logging.getLogger(__name__)
class WebhookDeliveryService:
"""Handles webhook delivery with exponential backoff retry logic."""
def __init__(self, config: Dict):
"""
Initialize the webhook delivery service.
Args:
config: Application configuration dictionary containing webhook settings
"""
self.config = config.get("webhooks", {})
self.max_attempts = self.config.get("retry", {}).get("max_attempts", 5)
self.initial_delay = self.config.get("retry", {}).get("initial_delay_ms", 1000) / 1000
self.max_delay = self.config.get("retry", {}).get("max_delay_ms", 32000) / 1000
self.timeout = self.config.get("retry", {}).get("timeout_ms", 30000) / 1000
async def send_webhook(
self,
webhook_url: str,
payload: Dict,
headers: Optional[Dict[str, str]] = None
) -> bool:
"""
Send webhook with exponential backoff retry logic.
Args:
webhook_url: The URL to send the webhook to
payload: The JSON payload to send
headers: Optional custom headers
Returns:
bool: True if delivered successfully, False otherwise
"""
default_headers = self.config.get("headers", {})
merged_headers = {**default_headers, **(headers or {})}
merged_headers["Content-Type"] = "application/json"
async with httpx.AsyncClient(timeout=self.timeout) as client:
for attempt in range(self.max_attempts):
try:
logger.info(
f"Sending webhook (attempt {attempt + 1}/{self.max_attempts}) to {webhook_url}"
)
response = await client.post(
webhook_url,
json=payload,
headers=merged_headers
)
# Success or client error (don't retry client errors)
if response.status_code < 500:
if 200 <= response.status_code < 300:
logger.info(f"Webhook delivered successfully to {webhook_url}")
return True
else:
logger.warning(
f"Webhook rejected with status {response.status_code}: {response.text[:200]}"
)
return False # Client error - don't retry
# Server error - retry with backoff
logger.warning(
f"Webhook failed with status {response.status_code}, will retry"
)
except httpx.TimeoutException as exc:
logger.error(f"Webhook timeout (attempt {attempt + 1}): {exc}")
except httpx.RequestError as exc:
logger.error(f"Webhook request error (attempt {attempt + 1}): {exc}")
except Exception as exc:
logger.error(f"Webhook delivery error (attempt {attempt + 1}): {exc}")
# Calculate exponential backoff delay
if attempt < self.max_attempts - 1:
delay = min(self.initial_delay * (2 ** attempt), self.max_delay)
logger.info(f"Retrying in {delay}s...")
await asyncio.sleep(delay)
logger.error(
f"Webhook delivery failed after {self.max_attempts} attempts to {webhook_url}"
)
return False
async def notify_job_completion(
self,
task_id: str,
task_type: str,
status: str,
urls: list,
webhook_config: Optional[Dict],
result: Optional[Dict] = None,
error: Optional[str] = None
):
"""
Notify webhook of job completion.
Args:
task_id: The task identifier
task_type: Type of task (e.g., "crawl", "llm_extraction")
status: Task status ("completed" or "failed")
urls: List of URLs that were crawled
webhook_config: Webhook configuration from the job request
result: Optional crawl result data
error: Optional error message if failed
"""
# Determine webhook URL
webhook_url = None
data_in_payload = self.config.get("data_in_payload", False)
custom_headers = None
if webhook_config:
webhook_url = webhook_config.get("webhook_url")
data_in_payload = webhook_config.get("webhook_data_in_payload", data_in_payload)
custom_headers = webhook_config.get("webhook_headers")
if not webhook_url:
webhook_url = self.config.get("default_url")
if not webhook_url:
logger.debug("No webhook URL configured, skipping notification")
return
# Check if webhooks are enabled
if not self.config.get("enabled", True):
logger.debug("Webhooks are disabled, skipping notification")
return
# Build payload
payload = {
"task_id": task_id,
"task_type": task_type,
"status": status,
"timestamp": datetime.now(timezone.utc).isoformat(),
"urls": urls
}
if error:
payload["error"] = error
if data_in_payload and result:
payload["data"] = result
# Send webhook (fire and forget - don't block on completion)
await self.send_webhook(webhook_url, payload, custom_headers)

View File

@@ -6,15 +6,16 @@ x-base-config: &base-config
- "11235:11235" # Gunicorn port
env_file:
- .llm.env # API keys (create from .llm.env.example)
environment:
- OPENAI_API_KEY=${OPENAI_API_KEY:-}
- DEEPSEEK_API_KEY=${DEEPSEEK_API_KEY:-}
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY:-}
- GROQ_API_KEY=${GROQ_API_KEY:-}
- TOGETHER_API_KEY=${TOGETHER_API_KEY:-}
- MISTRAL_API_KEY=${MISTRAL_API_KEY:-}
- GEMINI_API_TOKEN=${GEMINI_API_TOKEN:-}
- LLM_PROVIDER=${LLM_PROVIDER:-} # Optional: Override default provider (e.g., "anthropic/claude-3-opus")
# Uncomment to set default environment variables (will overwrite .llm.env)
# environment:
# - OPENAI_API_KEY=${OPENAI_API_KEY:-}
# - DEEPSEEK_API_KEY=${DEEPSEEK_API_KEY:-}
# - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY:-}
# - GROQ_API_KEY=${GROQ_API_KEY:-}
# - TOGETHER_API_KEY=${TOGETHER_API_KEY:-}
# - MISTRAL_API_KEY=${MISTRAL_API_KEY:-}
# - GEMINI_API_KEY=${GEMINI_API_KEY:-}
# - LLM_PROVIDER=${LLM_PROVIDER:-} # Optional: Override default provider (e.g., "anthropic/claude-3-opus")
volumes:
- /dev/shm:/dev/shm # Chromium performance
deploy:

View File

@@ -1,431 +0,0 @@
# Proxy Rotation Strategy Documentation
## Overview
The Crawl4AI FastAPI server now includes comprehensive proxy rotation functionality that allows you to distribute requests across multiple proxy servers using different rotation strategies. This feature helps prevent IP blocking, distributes load across proxy infrastructure, and provides redundancy for high-availability crawling operations.
## Available Proxy Rotation Strategies
| Strategy | Description | Use Case | Performance |
|----------|-------------|----------|-------------|
| `round_robin` | Cycles through proxies sequentially | Even distribution, predictable pattern | ⭐⭐⭐⭐⭐ |
| `random` | Randomly selects from available proxies | Unpredictable traffic pattern | ⭐⭐⭐⭐ |
| `least_used` | Uses proxy with lowest usage count | Optimal load balancing | ⭐⭐⭐ |
| `failure_aware` | Avoids failed proxies with auto-recovery | High availability, fault tolerance | ⭐⭐⭐⭐ |
## API Endpoints
### POST /crawl
Standard crawling endpoint with proxy rotation support.
**Request Body:**
```json
{
"urls": ["https://example.com"],
"proxy_rotation_strategy": "round_robin",
"proxies": [
{"server": "http://proxy1.com:8080", "username": "user1", "password": "pass1"},
{"server": "http://proxy2.com:8080", "username": "user2", "password": "pass2"}
],
"browser_config": {},
"crawler_config": {}
}
```
### POST /crawl/stream
Streaming crawling endpoint with proxy rotation support.
**Request Body:**
```json
{
"urls": ["https://example.com"],
"proxy_rotation_strategy": "failure_aware",
"proxy_failure_threshold": 3,
"proxy_recovery_time": 300,
"proxies": [
{"server": "http://proxy1.com:8080", "username": "user1", "password": "pass1"},
{"server": "http://proxy2.com:8080", "username": "user2", "password": "pass2"}
],
"browser_config": {},
"crawler_config": {
"stream": true
}
}
```
## Parameters
### proxy_rotation_strategy (optional)
- **Type:** `string`
- **Default:** `null` (no proxy rotation)
- **Options:** `"round_robin"`, `"random"`, `"least_used"`, `"failure_aware"`
- **Description:** Selects the proxy rotation strategy for distributing requests
### proxies (optional)
- **Type:** `array of objects`
- **Default:** `null`
- **Description:** List of proxy configurations to rotate between
- **Required when:** `proxy_rotation_strategy` is specified
### proxy_failure_threshold (optional)
- **Type:** `integer`
- **Default:** `3`
- **Range:** `1-10`
- **Description:** Number of failures before marking a proxy as unhealthy (failure_aware only)
### proxy_recovery_time (optional)
- **Type:** `integer`
- **Default:** `300` (5 minutes)
- **Range:** `60-3600` seconds
- **Description:** Time to wait before attempting to use a failed proxy again (failure_aware only)
## Proxy Configuration Format
### Full Configuration
```json
{
"server": "http://proxy.example.com:8080",
"username": "proxy_user",
"password": "proxy_pass",
"ip": "192.168.1.100"
}
```
### Minimal Configuration
```json
{
"server": "http://192.168.1.100:8080"
}
```
### SOCKS Proxy Support
```json
{
"server": "socks5://127.0.0.1:1080",
"username": "socks_user",
"password": "socks_pass"
}
```
## Usage Examples
### 1. Round Robin Strategy
```bash
curl -X POST "http://localhost:11235/crawl" \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://httpbin.org/ip"],
"proxy_rotation_strategy": "round_robin",
"proxies": [
{"server": "http://proxy1.com:8080", "username": "user1", "password": "pass1"},
{"server": "http://proxy2.com:8080", "username": "user2", "password": "pass2"},
{"server": "http://proxy3.com:8080", "username": "user3", "password": "pass3"}
]
}'
```
### 2. Random Strategy with Minimal Config
```bash
curl -X POST "http://localhost:11235/crawl" \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://httpbin.org/headers"],
"proxy_rotation_strategy": "random",
"proxies": [
{"server": "http://192.168.1.100:8080"},
{"server": "http://192.168.1.101:8080"},
{"server": "http://192.168.1.102:8080"}
]
}'
```
### 3. Least Used Strategy with Load Balancing
```bash
curl -X POST "http://localhost:11235/crawl" \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://example.com", "https://httpbin.org/html", "https://httpbin.org/json"],
"proxy_rotation_strategy": "least_used",
"proxies": [
{"server": "http://proxy1.com:8080", "username": "user1", "password": "pass1"},
{"server": "http://proxy2.com:8080", "username": "user2", "password": "pass2"}
],
"crawler_config": {
"cache_mode": "bypass"
}
}'
```
### 4. Failure-Aware Strategy with High Availability
```bash
curl -X POST "http://localhost:11235/crawl" \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://example.com"],
"proxy_rotation_strategy": "failure_aware",
"proxy_failure_threshold": 2,
"proxy_recovery_time": 180,
"proxies": [
{"server": "http://proxy1.com:8080", "username": "user1", "password": "pass1"},
{"server": "http://proxy2.com:8080", "username": "user2", "password": "pass2"},
{"server": "http://proxy3.com:8080", "username": "user3", "password": "pass3"}
],
"headless": true
}'
```
### 5. Streaming with Proxy Rotation
```bash
curl -X POST "http://localhost:11235/crawl/stream" \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://example.com", "https://httpbin.org/html"],
"proxy_rotation_strategy": "round_robin",
"proxies": [
{"server": "http://proxy1.com:8080", "username": "user1", "password": "pass1"},
{"server": "http://proxy2.com:8080", "username": "user2", "password": "pass2"}
],
"crawler_config": {
"stream": true,
"cache_mode": "bypass"
}
}'
```
## Combining with Anti-Bot Strategies
You can combine proxy rotation with anti-bot strategies for maximum effectiveness:
```bash
curl -X POST "http://localhost:11235/crawl" \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://protected-site.com"],
"anti_bot_strategy": "stealth",
"proxy_rotation_strategy": "failure_aware",
"proxy_failure_threshold": 2,
"proxies": [
{"server": "http://proxy1.com:8080", "username": "user1", "password": "pass1"},
{"server": "http://proxy2.com:8080", "username": "user2", "password": "pass2"}
],
"headless": true,
"browser_config": {
"enable_stealth": true
}
}'
```
## Strategy Details
### Round Robin Strategy
- **Algorithm:** Sequential cycling through proxy list
- **Pros:** Predictable, even distribution, simple
- **Cons:** Predictable pattern may be detectable
- **Best for:** General use, development, testing
### Random Strategy
- **Algorithm:** Random selection from available proxies
- **Pros:** Unpredictable pattern, good for evasion
- **Cons:** Uneven distribution possible
- **Best for:** Anti-detection, varying traffic patterns
### Least Used Strategy
- **Algorithm:** Selects proxy with minimum usage count
- **Pros:** Optimal load balancing, prevents overloading
- **Cons:** Slightly more complex, tracking overhead
- **Best for:** High-volume crawling, load balancing
### Failure-Aware Strategy
- **Algorithm:** Tracks proxy health, auto-recovery
- **Pros:** High availability, fault tolerance, automatic recovery
- **Cons:** Most complex, memory overhead for tracking
- **Best for:** Production environments, critical crawling
## Error Handling
### Common Errors
#### Invalid Proxy Configuration
```json
{
"error": "Invalid proxy configuration: Proxy configuration missing 'server' field: {'username': 'user1'}"
}
```
#### Unsupported Strategy
```json
{
"error": "Unsupported proxy rotation strategy: invalid_strategy. Available: round_robin, random, least_used, failure_aware"
}
```
#### Missing Proxies
When `proxy_rotation_strategy` is specified but `proxies` is empty:
```json
{
"error": "proxy_rotation_strategy specified but no proxies provided"
}
```
## Environment Variable Support
You can also configure proxies using environment variables:
```bash
# Set proxy list (comma-separated)
export PROXIES="proxy1.com:8080:user1:pass1,proxy2.com:8080:user2:pass2"
# Set default strategy
export PROXY_ROTATION_STRATEGY="round_robin"
```
## Performance Considerations
1. **Strategy Overhead:**
- Round Robin: Minimal overhead
- Random: Low overhead
- Least Used: Medium overhead (usage tracking)
- Failure Aware: High overhead (health tracking)
2. **Memory Usage:**
- Round Robin: ~O(n) where n = number of proxies
- Random: ~O(n)
- Least Used: ~O(n) + usage counters
- Failure Aware: ~O(n) + health tracking data
3. **Concurrent Safety:**
- All strategies are async-safe with proper locking
- No race conditions in proxy selection
## Best Practices
1. **Production Deployment:**
- Use `failure_aware` strategy for high availability
- Set appropriate failure thresholds (2-3)
- Use recovery times between 3-10 minutes
2. **Development/Testing:**
- Use `round_robin` for predictable behavior
- Start with small proxy pools (2-3 proxies)
3. **Anti-Detection:**
- Combine with `stealth` or `undetected` anti-bot strategies
- Use `random` strategy for unpredictable patterns
- Vary proxy geographic locations
4. **Load Balancing:**
- Use `least_used` for even distribution
- Monitor proxy performance and adjust pools accordingly
5. **Error Monitoring:**
- Monitor failure rates with `failure_aware` strategy
- Set up alerts for proxy pool depletion
- Implement fallback mechanisms
## Integration Examples
### Python Requests
```python
import requests
payload = {
"urls": ["https://example.com"],
"proxy_rotation_strategy": "round_robin",
"proxies": [
{"server": "http://proxy1.com:8080", "username": "user1", "password": "pass1"},
{"server": "http://proxy2.com:8080", "username": "user2", "password": "pass2"}
]
}
response = requests.post("http://localhost:11235/crawl", json=payload)
print(response.json())
```
### JavaScript/Node.js
```javascript
const axios = require('axios');
const payload = {
urls: ["https://example.com"],
proxy_rotation_strategy: "failure_aware",
proxy_failure_threshold: 2,
proxies: [
{server: "http://proxy1.com:8080", username: "user1", password: "pass1"},
{server: "http://proxy2.com:8080", username: "user2", password: "pass2"}
]
};
axios.post('http://localhost:11235/crawl', payload)
.then(response => console.log(response.data))
.catch(error => console.error(error));
```
### cURL with Multiple URLs
```bash
curl -X POST "http://localhost:11235/crawl" \
-H "Content-Type: application/json" \
-d '{
"urls": [
"https://example.com",
"https://httpbin.org/html",
"https://httpbin.org/json",
"https://httpbin.org/xml"
],
"proxy_rotation_strategy": "least_used",
"proxies": [
{"server": "http://proxy1.com:8080", "username": "user1", "password": "pass1"},
{"server": "http://proxy2.com:8080", "username": "user2", "password": "pass2"},
{"server": "http://proxy3.com:8080", "username": "user3", "password": "pass3"}
],
"crawler_config": {
"cache_mode": "bypass",
"wait_for_images": false
}
}'
```
## Troubleshooting
### Common Issues
1. **All proxies failing:**
- Check proxy connectivity
- Verify authentication credentials
- Ensure proxy servers support the target protocols
2. **Uneven distribution:**
- Use `least_used` strategy for better balancing
- Monitor proxy usage patterns
3. **High memory usage:**
- Reduce proxy pool size
- Consider using `round_robin` instead of `failure_aware`
4. **Slow performance:**
- Check proxy response times
- Use geographically closer proxies
- Reduce failure thresholds
### Debug Information
Enable verbose logging to see proxy selection details:
```json
{
"urls": ["https://example.com"],
"proxy_rotation_strategy": "failure_aware",
"proxies": [...],
"crawler_config": {
"verbose": true
}
}
```
This will log which proxy is selected for each request and any failure/recovery events.

View File

@@ -0,0 +1,243 @@
# Crawl4AI v0.8.0 Release Notes
**Release Date**: January 2026
**Previous Version**: v0.7.6
**Status**: Release Candidate
---
## Highlights
- **Critical Security Fixes** for Docker API deployment
- **11 New Features** including crash recovery, prefetch mode, and proxy improvements
- **Breaking Changes** - see migration guide below
---
## Breaking Changes
### 1. Docker API: Hooks Disabled by Default
**What changed**: Hooks are now disabled by default on the Docker API.
**Why**: Security fix for Remote Code Execution (RCE) vulnerability.
**Who is affected**: Users of the Docker API who use the `hooks` parameter in `/crawl` requests.
**Migration**:
```bash
# To re-enable hooks (only if you trust all API users):
export CRAWL4AI_HOOKS_ENABLED=true
```
### 2. Docker API: file:// URLs Blocked
**What changed**: The endpoints `/execute_js`, `/screenshot`, `/pdf`, and `/html` now reject `file://` URLs.
**Why**: Security fix for Local File Inclusion (LFI) vulnerability.
**Who is affected**: Users who were reading local files via the Docker API.
**Migration**: Use the Python library directly for local file processing:
```python
# Instead of API call with file:// URL, use library:
from crawl4ai import AsyncWebCrawler
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="file:///path/to/file.html")
```
---
## Security Fixes
### Critical: Remote Code Execution via Hooks (CVE Pending)
**Severity**: CRITICAL (CVSS 10.0)
**Affected**: Docker API deployment (all versions before v0.8.0)
**Vector**: `POST /crawl` with malicious `hooks` parameter
**Details**: The `__import__` builtin was available in hook code, allowing attackers to import `os`, `subprocess`, etc. and execute arbitrary commands.
**Fix**:
1. Removed `__import__` from allowed builtins
2. Hooks disabled by default (`CRAWL4AI_HOOKS_ENABLED=false`)
### High: Local File Inclusion via file:// URLs (CVE Pending)
**Severity**: HIGH (CVSS 8.6)
**Affected**: Docker API deployment (all versions before v0.8.0)
**Vector**: `POST /execute_js` (and other endpoints) with `file:///etc/passwd`
**Details**: API endpoints accepted `file://` URLs, allowing attackers to read arbitrary files from the server.
**Fix**: URL scheme validation now only allows `http://`, `https://`, and `raw:` URLs.
### Credits
Discovered by **Neo by ProjectDiscovery** ([projectdiscovery.io](https://projectdiscovery.io)) - December 2025
---
## New Features
### 1. init_scripts Support for BrowserConfig
Pre-page-load JavaScript injection for stealth evasions.
```python
config = BrowserConfig(
init_scripts=[
"Object.defineProperty(navigator, 'webdriver', {get: () => false})"
]
)
```
### 2. CDP Connection Improvements
- WebSocket URL support (`ws://`, `wss://`)
- Proper cleanup with `cdp_cleanup_on_close=True`
- Browser reuse across multiple connections
### 3. Crash Recovery for Deep Crawl Strategies
All deep crawl strategies (BFS, DFS, Best-First) now support crash recovery:
```python
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
strategy = BFSDeepCrawlStrategy(
max_depth=3,
resume_state=saved_state, # Resume from checkpoint
on_state_change=save_callback # Persist state in real-time
)
```
### 4. PDF and MHTML for raw:/file:// URLs
Generate PDFs and MHTML from cached HTML content.
### 5. Screenshots for raw:/file:// URLs
Render cached HTML and capture screenshots.
### 6. base_url Parameter for CrawlerRunConfig
Proper URL resolution for raw: HTML processing:
```python
config = CrawlerRunConfig(base_url='https://example.com')
result = await crawler.arun(url='raw:{html}', config=config)
```
### 7. Prefetch Mode for Two-Phase Deep Crawling
Fast link extraction without full page processing:
```python
config = CrawlerRunConfig(prefetch=True)
```
### 8. Proxy Rotation and Configuration
Enhanced proxy rotation with sticky sessions support.
### 9. Proxy Support for HTTP Strategy
Non-browser crawler now supports proxies.
### 10. Browser Pipeline for raw:/file:// URLs
New `process_in_browser` parameter for browser operations on local content:
```python
config = CrawlerRunConfig(
process_in_browser=True, # Force browser processing
screenshot=True
)
result = await crawler.arun(url='raw:<html>...</html>', config=config)
```
### 11. Smart TTL Cache for Sitemap URL Seeder
Intelligent cache invalidation for sitemaps:
```python
config = SeedingConfig(
cache_ttl_hours=24,
validate_sitemap_lastmod=True
)
```
---
## Bug Fixes
### raw: URL Parsing Truncates at # Character
**Problem**: CSS color codes like `#eee` were being truncated.
**Before**: `raw:body{background:#eee}``body{background:`
**After**: `raw:body{background:#eee}``body{background:#eee}`
### Caching System Improvements
Various fixes to cache validation and persistence.
---
## Documentation Updates
- Multi-sample schema generation documentation
- URL seeder smart TTL cache parameters
- Security documentation (SECURITY.md)
---
## Upgrade Guide
### From v0.7.x to v0.8.0
1. **Update the package**:
```bash
pip install --upgrade crawl4ai
```
2. **Docker API users**:
- Hooks are now disabled by default
- If you need hooks: `export CRAWL4AI_HOOKS_ENABLED=true`
- `file://` URLs no longer work on API (use library directly)
3. **Review security settings**:
```yaml
# config.yml - recommended for production
security:
enabled: true
jwt_enabled: true
```
4. **Test your integration** before deploying to production
### Breaking Change Checklist
- [ ] Check if you use `hooks` parameter in API calls
- [ ] Check if you use `file://` URLs via the API
- [ ] Update environment variables if needed
- [ ] Review security configuration
---
## Full Changelog
See [CHANGELOG.md](../CHANGELOG.md) for complete version history.
---
## Contributors
Thanks to all contributors who made this release possible.
Special thanks to **Neo by ProjectDiscovery** for responsible security disclosure.
---
*For questions or issues, please open a [GitHub Issue](https://github.com/unclecode/crawl4ai/issues).*

View File

@@ -10,7 +10,6 @@ Today I'm releasing Crawl4AI v0.7.4—the Intelligent Table Extraction & Perform
- **🚀 LLMTableExtraction**: Revolutionary table extraction with intelligent chunking for massive tables
- **⚡ Enhanced Concurrency**: True concurrency improvements for fast-completing tasks in batch operations
- **🧹 Memory Management Refactor**: Streamlined memory utilities and better resource management
- **🔧 Browser Manager Fixes**: Resolved race conditions in concurrent page creation
- **⌨️ Cross-Platform Browser Profiler**: Improved keyboard handling and quit mechanisms
- **🔗 Advanced URL Processing**: Better handling of raw URLs and base tag link resolution
@@ -158,40 +157,6 @@ async with AsyncWebCrawler() as crawler:
- **Monitoring Systems**: Faster health checks and status page monitoring
- **Data Aggregation**: Improved performance for real-time data collection
## 🧹 Memory Management Refactor: Cleaner Architecture
**The Problem:** Memory utilities were scattered and difficult to maintain, with potential import conflicts and unclear organization.
**My Solution:** I consolidated all memory-related utilities into the main `utils.py` module, creating a cleaner, more maintainable architecture.
### Improved Memory Handling
```python
# All memory utilities now consolidated
from crawl4ai.utils import get_true_memory_usage_percent, MemoryMonitor
# Enhanced memory monitoring
monitor = MemoryMonitor()
monitor.start_monitoring()
async with AsyncWebCrawler() as crawler:
# Memory-efficient batch processing
results = await crawler.arun_many(large_url_list)
# Get accurate memory metrics
memory_usage = get_true_memory_usage_percent()
memory_report = monitor.get_report()
print(f"Memory efficiency: {memory_report['efficiency']:.1f}%")
print(f"Peak usage: {memory_report['peak_mb']:.1f} MB")
```
**Expected Real-World Impact:**
- **Production Stability**: More reliable memory tracking and management
- **Code Maintainability**: Cleaner architecture for easier debugging
- **Import Clarity**: Resolved potential conflicts and import issues
- **Developer Experience**: Simpler API for memory monitoring
## 🔧 Critical Stability Fixes
### Browser Manager Race Condition Resolution

318
docs/blog/release-v0.7.5.md Normal file
View File

@@ -0,0 +1,318 @@
# 🚀 Crawl4AI v0.7.5: The Docker Hooks & Security Update
*September 29, 2025 • 8 min read*
---
Today I'm releasing Crawl4AI v0.7.5—focused on extensibility and security. This update introduces the Docker Hooks System for pipeline customization, enhanced LLM integration, and important security improvements.
## 🎯 What's New at a Glance
- **Docker Hooks System**: Custom Python functions at key pipeline points with function-based API
- **Function-Based Hooks**: New `hooks_to_string()` utility with Docker client auto-conversion
- **Enhanced LLM Integration**: Custom providers with temperature control
- **HTTPS Preservation**: Secure internal link handling
- **Bug Fixes**: Resolved multiple community-reported issues
- **Improved Docker Error Handling**: Better debugging and reliability
## 🔧 Docker Hooks System: Pipeline Customization
Every scraping project needs custom logic—authentication, performance optimization, content processing. Traditional solutions require forking or complex workarounds. Docker Hooks let you inject custom Python functions at 8 key points in the crawling pipeline.
### Real Example: Authentication & Performance
```python
import requests
# Real working hooks for httpbin.org
hooks_config = {
"on_page_context_created": """
async def hook(page, context, **kwargs):
print("Hook: Setting up page context")
# Block images to speed up crawling
await context.route("**/*.{png,jpg,jpeg,gif,webp}", lambda route: route.abort())
print("Hook: Images blocked")
return page
""",
"before_retrieve_html": """
async def hook(page, context, **kwargs):
print("Hook: Before retrieving HTML")
# Scroll to bottom to load lazy content
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await page.wait_for_timeout(1000)
print("Hook: Scrolled to bottom")
return page
""",
"before_goto": """
async def hook(page, context, url, **kwargs):
print(f"Hook: About to navigate to {url}")
# Add custom headers
await page.set_extra_http_headers({
'X-Test-Header': 'crawl4ai-hooks-test'
})
return page
"""
}
# Test with Docker API
payload = {
"urls": ["https://httpbin.org/html"],
"hooks": {
"code": hooks_config,
"timeout": 30
}
}
response = requests.post("http://localhost:11235/crawl", json=payload)
result = response.json()
if result.get('success'):
print("✅ Hooks executed successfully!")
print(f"Content length: {len(result.get('markdown', ''))} characters")
```
**Available Hook Points:**
- `on_browser_created`: Browser setup
- `on_page_context_created`: Page context configuration
- `before_goto`: Pre-navigation setup
- `after_goto`: Post-navigation processing
- `on_user_agent_updated`: User agent changes
- `on_execution_started`: Crawl initialization
- `before_retrieve_html`: Pre-extraction processing
- `before_return_html`: Final HTML processing
### Function-Based Hooks API
Writing hooks as strings works, but lacks IDE support and type checking. v0.7.5 introduces a function-based approach with automatic conversion!
**Option 1: Using the `hooks_to_string()` Utility**
```python
from crawl4ai import hooks_to_string
import requests
# Define hooks as regular Python functions (with full IDE support!)
async def on_page_context_created(page, context, **kwargs):
"""Block images to speed up crawling"""
await context.route("**/*.{png,jpg,jpeg,gif,webp}", lambda route: route.abort())
await page.set_viewport_size({"width": 1920, "height": 1080})
return page
async def before_goto(page, context, url, **kwargs):
"""Add custom headers"""
await page.set_extra_http_headers({
'X-Crawl4AI': 'v0.7.5',
'X-Custom-Header': 'my-value'
})
return page
# Convert functions to strings
hooks_code = hooks_to_string({
"on_page_context_created": on_page_context_created,
"before_goto": before_goto
})
# Use with REST API
payload = {
"urls": ["https://httpbin.org/html"],
"hooks": {"code": hooks_code, "timeout": 30}
}
response = requests.post("http://localhost:11235/crawl", json=payload)
```
**Option 2: Docker Client with Automatic Conversion (Recommended!)**
```python
from crawl4ai.docker_client import Crawl4aiDockerClient
# Define hooks as functions (same as above)
async def on_page_context_created(page, context, **kwargs):
await context.route("**/*.{png,jpg,jpeg,gif,webp}", lambda route: route.abort())
return page
async def before_retrieve_html(page, context, **kwargs):
# Scroll to load lazy content
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await page.wait_for_timeout(1000)
return page
# Use Docker client - conversion happens automatically!
client = Crawl4aiDockerClient(base_url="http://localhost:11235")
results = await client.crawl(
urls=["https://httpbin.org/html"],
hooks={
"on_page_context_created": on_page_context_created,
"before_retrieve_html": before_retrieve_html
},
hooks_timeout=30
)
if results and results.success:
print(f"✅ Hooks executed! HTML length: {len(results.html)}")
```
**Benefits of Function-Based Hooks:**
- ✅ Full IDE support (autocomplete, syntax highlighting)
- ✅ Type checking and linting
- ✅ Easier to test and debug
- ✅ Reusable across projects
- ✅ Automatic conversion in Docker client
- ✅ No breaking changes - string hooks still work!
## 🤖 Enhanced LLM Integration
Enhanced LLM integration with custom providers, temperature control, and base URL configuration.
### Multi-Provider Support
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
# Test with different providers
async def test_llm_providers():
# OpenAI with custom temperature
openai_strategy = LLMExtractionStrategy(
provider="gemini/gemini-2.5-flash-lite",
api_token="your-api-token",
temperature=0.7, # New in v0.7.5
instruction="Summarize this page in one sentence"
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
"https://example.com",
config=CrawlerRunConfig(extraction_strategy=openai_strategy)
)
if result.success:
print("✅ LLM extraction completed")
print(result.extracted_content)
# Docker API with enhanced LLM config
llm_payload = {
"url": "https://example.com",
"f": "llm",
"q": "Summarize this page in one sentence.",
"provider": "gemini/gemini-2.5-flash-lite",
"temperature": 0.7
}
response = requests.post("http://localhost:11235/md", json=llm_payload)
```
**New Features:**
- Custom `temperature` parameter for creativity control
- `base_url` for custom API endpoints
- Multi-provider environment variable support
- Docker API integration
## 🔒 HTTPS Preservation
**The Problem:** Modern web apps require HTTPS everywhere. When crawlers downgrade internal links from HTTPS to HTTP, authentication breaks and security warnings appear.
**Solution:** HTTPS preservation maintains secure protocols throughout crawling.
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, FilterChain, URLPatternFilter, BFSDeepCrawlStrategy
async def test_https_preservation():
# Enable HTTPS preservation
url_filter = URLPatternFilter(
patterns=["^(https:\/\/)?quotes\.toscrape\.com(\/.*)?$"]
)
config = CrawlerRunConfig(
exclude_external_links=True,
preserve_https_for_internal_links=True, # New in v0.7.5
deep_crawl_strategy=BFSDeepCrawlStrategy(
max_depth=2,
max_pages=5,
filter_chain=FilterChain([url_filter])
)
)
async with AsyncWebCrawler() as crawler:
async for result in await crawler.arun(
url="https://quotes.toscrape.com",
config=config
):
# All internal links maintain HTTPS
internal_links = [link['href'] for link in result.links['internal']]
https_links = [link for link in internal_links if link.startswith('https://')]
print(f"HTTPS links preserved: {len(https_links)}/{len(internal_links)}")
for link in https_links[:3]:
print(f"{link}")
```
## 🛠️ Bug Fixes and Improvements
### Major Fixes
- **URL Processing**: Fixed '+' sign preservation in query parameters (#1332)
- **Proxy Configuration**: Enhanced proxy string parsing (old `proxy` parameter deprecated)
- **Docker Error Handling**: Comprehensive error messages with status codes
- **Memory Management**: Fixed leaks in long-running sessions
- **JWT Authentication**: Fixed Docker JWT validation issues (#1442)
- **Playwright Stealth**: Fixed stealth features for Playwright integration (#1481)
- **API Configuration**: Fixed config handling to prevent overriding user-provided settings (#1505)
- **Docker Filter Serialization**: Resolved JSON encoding errors in deep crawl strategy (#1419)
- **LLM Provider Support**: Fixed custom LLM provider integration for adaptive crawler (#1291)
- **Performance Issues**: Resolved backoff strategy failures and timeout handling (#989)
### Community-Reported Issues Fixed
This release addresses multiple issues reported by the community through GitHub issues and Discord discussions:
- Fixed browser configuration reference errors
- Resolved dependency conflicts with cssselect
- Improved error messaging for failed authentications
- Enhanced compatibility with various proxy configurations
- Fixed edge cases in URL normalization
### Configuration Updates
```python
# Old proxy config (deprecated)
# browser_config = BrowserConfig(proxy="http://proxy:8080")
# New enhanced proxy config
browser_config = BrowserConfig(
proxy_config={
"server": "http://proxy:8080",
"username": "optional-user",
"password": "optional-pass"
}
)
```
## 🔄 Breaking Changes
1. **Python 3.10+ Required**: Upgrade from Python 3.9
2. **Proxy Parameter Deprecated**: Use new `proxy_config` structure
3. **New Dependency**: Added `cssselect` for better CSS handling
## 🚀 Get Started
```bash
# Install latest version
pip install crawl4ai==0.7.5
# Docker deployment
docker pull unclecode/crawl4ai:latest
docker run -p 11235:11235 unclecode/crawl4ai:latest
```
**Try the Demo:**
```bash
# Run working examples
python docs/releases_review/demo_v0.7.5.py
```
**Resources:**
- 📖 Documentation: [docs.crawl4ai.com](https://docs.crawl4ai.com)
- 🐙 GitHub: [github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai)
- 💬 Discord: [discord.gg/crawl4ai](https://discord.gg/jP8KfhDhyN)
- 🐦 Twitter: [@unclecode](https://x.com/unclecode)
Happy crawling! 🕷️

314
docs/blog/release-v0.7.6.md Normal file
View File

@@ -0,0 +1,314 @@
# Crawl4AI v0.7.6 Release Notes
*Release Date: October 22, 2025*
I'm excited to announce Crawl4AI v0.7.6, featuring a complete webhook infrastructure for the Docker job queue API! This release eliminates polling and brings real-time notifications to both crawling and LLM extraction workflows.
## 🎯 What's New
### Webhook Support for Docker Job Queue API
The headline feature of v0.7.6 is comprehensive webhook support for asynchronous job processing. No more constant polling to check if your jobs are done - get instant notifications when they complete!
**Key Capabilities:**
-**Universal Webhook Support**: Both `/crawl/job` and `/llm/job` endpoints now support webhooks
-**Flexible Delivery Modes**: Choose notification-only or include full data in the webhook payload
-**Reliable Delivery**: Exponential backoff retry mechanism (5 attempts: 1s → 2s → 4s → 8s → 16s)
-**Custom Authentication**: Add custom headers for webhook authentication
-**Global Configuration**: Set default webhook URL in `config.yml` for all jobs
-**Task Type Identification**: Distinguish between `crawl` and `llm_extraction` tasks
### How It Works
Instead of constantly checking job status:
**OLD WAY (Polling):**
```python
# Submit job
response = requests.post("http://localhost:11235/crawl/job", json=payload)
task_id = response.json()['task_id']
# Poll until complete
while True:
status = requests.get(f"http://localhost:11235/crawl/job/{task_id}")
if status.json()['status'] == 'completed':
break
time.sleep(5) # Wait and try again
```
**NEW WAY (Webhooks):**
```python
# Submit job with webhook
payload = {
"urls": ["https://example.com"],
"webhook_config": {
"webhook_url": "https://myapp.com/webhook",
"webhook_data_in_payload": True
}
}
response = requests.post("http://localhost:11235/crawl/job", json=payload)
# Done! Webhook will notify you when complete
# Your webhook handler receives the results automatically
```
### Crawl Job Webhooks
```bash
curl -X POST http://localhost:11235/crawl/job \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://example.com"],
"browser_config": {"headless": true},
"crawler_config": {"cache_mode": "bypass"},
"webhook_config": {
"webhook_url": "https://myapp.com/webhooks/crawl-complete",
"webhook_data_in_payload": false,
"webhook_headers": {
"X-Webhook-Secret": "your-secret-token"
}
}
}'
```
### LLM Extraction Job Webhooks (NEW!)
```bash
curl -X POST http://localhost:11235/llm/job \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/article",
"q": "Extract the article title, author, and publication date",
"schema": "{\"type\":\"object\",\"properties\":{\"title\":{\"type\":\"string\"}}}",
"provider": "openai/gpt-4o-mini",
"webhook_config": {
"webhook_url": "https://myapp.com/webhooks/llm-complete",
"webhook_data_in_payload": true
}
}'
```
### Webhook Payload Structure
**Success (with data):**
```json
{
"task_id": "llm_1698765432",
"task_type": "llm_extraction",
"status": "completed",
"timestamp": "2025-10-22T10:30:00.000000+00:00",
"urls": ["https://example.com/article"],
"data": {
"extracted_content": {
"title": "Understanding Web Scraping",
"author": "John Doe",
"date": "2025-10-22"
}
}
}
```
**Failure:**
```json
{
"task_id": "crawl_abc123",
"task_type": "crawl",
"status": "failed",
"timestamp": "2025-10-22T10:30:00.000000+00:00",
"urls": ["https://example.com"],
"error": "Connection timeout after 30s"
}
```
### Simple Webhook Handler Example
```python
from flask import Flask, request, jsonify
app = Flask(__name__)
@app.route('/webhook', methods=['POST'])
def handle_webhook():
payload = request.json
task_id = payload['task_id']
task_type = payload['task_type']
status = payload['status']
if status == 'completed':
if 'data' in payload:
# Process data directly
data = payload['data']
else:
# Fetch from API
endpoint = 'crawl' if task_type == 'crawl' else 'llm'
response = requests.get(f'http://localhost:11235/{endpoint}/job/{task_id}')
data = response.json()
# Your business logic here
print(f"Job {task_id} completed!")
elif status == 'failed':
error = payload.get('error', 'Unknown error')
print(f"Job {task_id} failed: {error}")
return jsonify({"status": "received"}), 200
app.run(port=8080)
```
## 📊 Performance Improvements
- **Reduced Server Load**: Eliminates constant polling requests
- **Lower Latency**: Instant notification vs. polling interval delay
- **Better Resource Usage**: Frees up client connections while jobs run in background
- **Scalable Architecture**: Handles high-volume crawling workflows efficiently
## 🐛 Bug Fixes
- Fixed webhook configuration serialization for Pydantic HttpUrl fields
- Improved error handling in webhook delivery service
- Enhanced Redis task storage for webhook config persistence
## 🌍 Expected Real-World Impact
### For Web Scraping Workflows
- **Reduced Costs**: Less API calls = lower bandwidth and server costs
- **Better UX**: Instant notifications improve user experience
- **Scalability**: Handle 100s of concurrent jobs without polling overhead
### For LLM Extraction Pipelines
- **Async Processing**: Submit LLM extraction jobs and move on
- **Batch Processing**: Queue multiple extractions, get notified as they complete
- **Integration**: Easy integration with workflow automation tools (Zapier, n8n, etc.)
### For Microservices
- **Event-Driven**: Perfect for event-driven microservice architectures
- **Decoupling**: Decouple job submission from result processing
- **Reliability**: Automatic retries ensure webhooks are delivered
## 🔄 Breaking Changes
**None!** This release is fully backward compatible.
- Webhook configuration is optional
- Existing code continues to work without modification
- Polling is still supported for jobs without webhook config
## 📚 Documentation
### New Documentation
- **[WEBHOOK_EXAMPLES.md](../deploy/docker/WEBHOOK_EXAMPLES.md)** - Comprehensive webhook usage guide
- **[docker_webhook_example.py](../docs/examples/docker_webhook_example.py)** - Working code examples
### Updated Documentation
- **[Docker README](../deploy/docker/README.md)** - Added webhook sections
- API documentation with webhook examples
## 🛠️ Migration Guide
No migration needed! Webhooks are opt-in:
1. **To use webhooks**: Add `webhook_config` to your job payload
2. **To keep polling**: Continue using your existing code
### Quick Start
```python
# Just add webhook_config to your existing payload
payload = {
# Your existing configuration
"urls": ["https://example.com"],
"browser_config": {...},
"crawler_config": {...},
# NEW: Add webhook configuration
"webhook_config": {
"webhook_url": "https://myapp.com/webhook",
"webhook_data_in_payload": True
}
}
```
## 🔧 Configuration
### Global Webhook Configuration (config.yml)
```yaml
webhooks:
enabled: true
default_url: "https://myapp.com/webhooks/default" # Optional
data_in_payload: false
retry:
max_attempts: 5
initial_delay_ms: 1000
max_delay_ms: 32000
timeout_ms: 30000
headers:
User-Agent: "Crawl4AI-Webhook/1.0"
```
## 🚀 Upgrade Instructions
### Docker
```bash
# Pull the latest image
docker pull unclecode/crawl4ai:0.7.6
# Or use latest tag
docker pull unclecode/crawl4ai:latest
# Run with webhook support
docker run -d \
-p 11235:11235 \
--env-file .llm.env \
--name crawl4ai \
unclecode/crawl4ai:0.7.6
```
### Python Package
```bash
pip install --upgrade crawl4ai
```
## 💡 Pro Tips
1. **Use notification-only mode** for large results - fetch data separately to avoid large webhook payloads
2. **Set custom headers** for webhook authentication and request tracking
3. **Configure global default webhook** for consistent handling across all jobs
4. **Implement idempotent webhook handlers** - same webhook may be delivered multiple times on retry
5. **Use structured schemas** with LLM extraction for predictable webhook data
## 🎬 Demo
Try the release demo:
```bash
python docs/releases_review/demo_v0.7.6.py
```
This comprehensive demo showcases:
- Crawl job webhooks (notification-only and with data)
- LLM extraction webhooks (with JSON schema support)
- Custom headers for authentication
- Webhook retry mechanism
- Real-time webhook receiver
## 🙏 Acknowledgments
Thank you to the community for the feedback that shaped this feature! Special thanks to everyone who requested webhook support for asynchronous job processing.
## 📞 Support
- **Documentation**: https://docs.crawl4ai.com
- **GitHub Issues**: https://github.com/unclecode/crawl4ai/issues
- **Discord**: https://discord.gg/crawl4ai
---
**Happy crawling with webhooks!** 🕷️🪝
*- unclecode*

626
docs/blog/release-v0.7.7.md Normal file
View File

@@ -0,0 +1,626 @@
# 🚀 Crawl4AI v0.7.7: The Self-Hosting & Monitoring Update
*November 14, 2025 • 10 min read*
---
Today I'm releasing Crawl4AI v0.7.7—the Self-Hosting & Monitoring Update. This release transforms Crawl4AI Docker from a simple containerized crawler into a complete self-hosting platform with enterprise-grade real-time monitoring, full operational transparency, and production-ready observability.
## 🎯 What's New at a Glance
- **📊 Real-time Monitoring Dashboard**: Interactive web UI with live system metrics and browser pool status
- **🔌 Comprehensive Monitor API**: Complete REST API for programmatic access to all monitoring data
- **⚡ WebSocket Streaming**: Real-time updates every 2 seconds for custom dashboards
- **🎮 Control Actions**: Manual browser management (kill, restart, cleanup)
- **🔥 Smart Browser Pool**: 3-tier architecture (permanent/hot/cold) with automatic promotion
- **🧹 Janitor Cleanup System**: Automatic resource management with event logging
- **📈 Production Metrics**: 6 critical metrics for operational excellence
- **🏭 Integration Ready**: Prometheus, alerting, and log aggregation examples
- **🐛 Critical Bug Fixes**: Async LLM extraction, DFS crawling, viewport config, and more
## 📊 Real-time Monitoring Dashboard: Complete Visibility
**The Problem:** Running Crawl4AI in Docker was like flying blind. Users had no visibility into what was happening inside the container—memory usage, active requests, browser pools, or errors. Troubleshooting required checking logs, and there was no way to monitor performance or manually intervene when issues occurred.
**My Solution:** I built a complete real-time monitoring system with an interactive dashboard, comprehensive REST API, WebSocket streaming, and manual control actions. Now you have full transparency and control over your crawling infrastructure.
### The Self-Hosting Value Proposition
Before v0.7.7, Docker was just a containerized crawler. After v0.7.7, it's a complete self-hosting platform that gives you:
- **🔒 Data Privacy**: Your data never leaves your infrastructure
- **💰 Cost Control**: No per-request pricing or rate limits
- **🎯 Full Customization**: Complete control over configurations and strategies
- **📊 Complete Transparency**: Real-time visibility into every aspect
- **⚡ Performance**: Direct access without network overhead
- **🛡️ Enterprise Security**: Keep workflows behind your firewall
### Interactive Monitoring Dashboard
Access the dashboard at `http://localhost:11235/dashboard` to see:
- **System Health Overview**: CPU, memory, network, and uptime in real-time
- **Live Request Tracking**: Active and completed requests with full details
- **Browser Pool Management**: Interactive table with permanent/hot/cold browsers
- **Janitor Events Log**: Automatic cleanup activities
- **Error Monitoring**: Full context error logs
The dashboard updates every 2 seconds via WebSocket, giving you live visibility into your crawling operations.
## 🔌 Monitor API: Programmatic Access
**The Problem:** Monitoring dashboards are great for humans, but automation and integration require programmatic access.
**My Solution:** A comprehensive REST API that exposes all monitoring data for integration with your existing infrastructure.
### System Health Endpoint
```python
import httpx
import asyncio
async def monitor_system_health():
async with httpx.AsyncClient() as client:
response = await client.get("http://localhost:11235/monitor/health")
health = response.json()
print(f"Container Metrics:")
print(f" CPU: {health['container']['cpu_percent']:.1f}%")
print(f" Memory: {health['container']['memory_percent']:.1f}%")
print(f" Uptime: {health['container']['uptime_seconds']}s")
print(f"\nBrowser Pool:")
print(f" Permanent: {health['pool']['permanent']['active']} active")
print(f" Hot Pool: {health['pool']['hot']['count']} browsers")
print(f" Cold Pool: {health['pool']['cold']['count']} browsers")
print(f"\nStatistics:")
print(f" Total Requests: {health['stats']['total_requests']}")
print(f" Success Rate: {health['stats']['success_rate_percent']:.1f}%")
print(f" Avg Latency: {health['stats']['avg_latency_ms']:.0f}ms")
asyncio.run(monitor_system_health())
```
### Request Tracking
```python
async def track_requests():
async with httpx.AsyncClient() as client:
response = await client.get("http://localhost:11235/monitor/requests")
requests_data = response.json()
print(f"Active Requests: {len(requests_data['active'])}")
print(f"Completed Requests: {len(requests_data['completed'])}")
# See details of recent requests
for req in requests_data['completed'][:5]:
status_icon = "" if req['success'] else ""
print(f"{status_icon} {req['endpoint']} - {req['latency_ms']:.0f}ms")
```
### Browser Pool Management
```python
async def monitor_browser_pool():
async with httpx.AsyncClient() as client:
response = await client.get("http://localhost:11235/monitor/browsers")
browsers = response.json()
print(f"Pool Summary:")
print(f" Total Browsers: {browsers['summary']['total_count']}")
print(f" Total Memory: {browsers['summary']['total_memory_mb']} MB")
print(f" Reuse Rate: {browsers['summary']['reuse_rate_percent']:.1f}%")
# List all browsers
for browser in browsers['permanent']:
print(f"🔥 Permanent: {browser['browser_id'][:8]}... | "
f"Requests: {browser['request_count']} | "
f"Memory: {browser['memory_mb']:.0f} MB")
```
### Endpoint Performance Statistics
```python
async def get_endpoint_stats():
async with httpx.AsyncClient() as client:
response = await client.get("http://localhost:11235/monitor/endpoints/stats")
stats = response.json()
print("Endpoint Analytics:")
for endpoint, data in stats.items():
print(f" {endpoint}:")
print(f" Requests: {data['count']}")
print(f" Avg Latency: {data['avg_latency_ms']:.0f}ms")
print(f" Success Rate: {data['success_rate_percent']:.1f}%")
```
### Complete API Reference
The Monitor API includes these endpoints:
- `GET /monitor/health` - System health with pool statistics
- `GET /monitor/requests` - Active and completed request tracking
- `GET /monitor/browsers` - Browser pool details and efficiency
- `GET /monitor/endpoints/stats` - Per-endpoint performance analytics
- `GET /monitor/timeline?minutes=5` - Time-series data for charts
- `GET /monitor/logs/janitor?limit=10` - Cleanup activity logs
- `GET /monitor/logs/errors?limit=10` - Error logs with context
- `POST /monitor/actions/cleanup` - Force immediate cleanup
- `POST /monitor/actions/kill_browser` - Kill specific browser
- `POST /monitor/actions/restart_browser` - Restart browser
- `POST /monitor/stats/reset` - Reset accumulated statistics
## ⚡ WebSocket Streaming: Real-time Updates
**The Problem:** Polling the API every few seconds wastes resources and adds latency. Real-time dashboards need instant updates.
**My Solution:** WebSocket streaming with 2-second update intervals for building custom real-time dashboards.
### WebSocket Integration Example
```python
import websockets
import json
import asyncio
async def monitor_realtime():
uri = "ws://localhost:11235/monitor/ws"
async with websockets.connect(uri) as websocket:
print("Connected to real-time monitoring stream")
while True:
# Receive update every 2 seconds
data = await websocket.recv()
update = json.loads(data)
# Access all monitoring data
print(f"\n--- Update at {update['timestamp']} ---")
print(f"Memory: {update['health']['container']['memory_percent']:.1f}%")
print(f"Active Requests: {len(update['requests']['active'])}")
print(f"Total Browsers: {update['browsers']['summary']['total_count']}")
if update['errors']:
print(f"⚠️ Recent Errors: {len(update['errors'])}")
asyncio.run(monitor_realtime())
```
**Expected Real-World Impact:**
- **Custom Dashboards**: Build tailored monitoring UIs for your team
- **Real-time Alerting**: Trigger alerts instantly when metrics exceed thresholds
- **Integration**: Feed live data into monitoring tools like Grafana
- **Automation**: React to events in real-time without polling
## 🔥 Smart Browser Pool: 3-Tier Architecture
**The Problem:** Creating a new browser for every request is slow and memory-intensive. Traditional browser pools are static and inefficient.
**My Solution:** A smart 3-tier browser pool that automatically adapts to usage patterns.
### How It Works
```python
import httpx
async def demonstrate_browser_pool():
async with httpx.AsyncClient() as client:
# Request 1-3: Default config → Uses permanent browser
print("Phase 1: Using permanent browser")
for i in range(3):
await client.post(
"http://localhost:11235/crawl",
json={"urls": [f"https://httpbin.org/html?req={i}"]}
)
print(f" Request {i+1}: Reused permanent browser")
# Request 4-6: Custom viewport → Cold pool (first use)
print("\nPhase 2: Custom config creates cold pool browser")
viewport_config = {"viewport": {"width": 1280, "height": 720}}
for i in range(4):
await client.post(
"http://localhost:11235/crawl",
json={
"urls": [f"https://httpbin.org/json?v={i}"],
"browser_config": viewport_config
}
)
if i < 2:
print(f" Request {i+1}: Cold pool browser")
else:
print(f" Request {i+1}: Promoted to hot pool! (after 3 uses)")
# Check pool status
response = await client.get("http://localhost:11235/monitor/browsers")
browsers = response.json()
print(f"\nPool Status:")
print(f" Permanent: {len(browsers['permanent'])} (always active)")
print(f" Hot: {len(browsers['hot'])} (frequently used configs)")
print(f" Cold: {len(browsers['cold'])} (on-demand)")
print(f" Reuse Rate: {browsers['summary']['reuse_rate_percent']:.1f}%")
asyncio.run(demonstrate_browser_pool())
```
**Pool Tiers:**
- **🔥 Permanent Browser**: Always-on, default configuration, instant response
- **♨️ Hot Pool**: Browsers promoted after 3+ uses, kept warm for quick access
- **❄️ Cold Pool**: On-demand browsers for variant configs, cleaned up when idle
**Expected Real-World Impact:**
- **Memory Efficiency**: 10x reduction in memory usage vs creating browsers per request
- **Performance**: Instant access to frequently-used configurations
- **Automatic Optimization**: Pool adapts to your usage patterns
- **Resource Management**: Janitor automatically cleans up idle browsers
## 🧹 Janitor System: Automatic Cleanup
**The Problem:** Long-running crawlers accumulate idle browsers and consume memory over time.
**My Solution:** An automatic janitor system that monitors and cleans up idle resources.
```python
async def monitor_janitor_activity():
async with httpx.AsyncClient() as client:
response = await client.get("http://localhost:11235/monitor/logs/janitor?limit=5")
logs = response.json()
print("Recent Cleanup Activities:")
for log in logs:
print(f" {log['timestamp']}: {log['message']}")
# Example output:
# 2025-11-14 10:30:00: Cleaned up 2 cold pool browsers (idle > 5min)
# 2025-11-14 10:25:00: Browser reuse rate: 85.3%
# 2025-11-14 10:20:00: Hot pool browser promoted (10 requests)
```
## 🎮 Control Actions: Manual Management
**The Problem:** Sometimes you need to manually intervene—kill a stuck browser, force cleanup, or restart resources.
**My Solution:** Manual control actions via the API for operational troubleshooting.
### Force Cleanup
```python
async def force_cleanup():
async with httpx.AsyncClient() as client:
response = await client.post("http://localhost:11235/monitor/actions/cleanup")
result = response.json()
print(f"Cleanup completed:")
print(f" Browsers cleaned: {result.get('cleaned_count', 0)}")
print(f" Memory freed: {result.get('memory_freed_mb', 0):.1f} MB")
```
### Kill Specific Browser
```python
async def kill_stuck_browser(browser_id: str):
async with httpx.AsyncClient() as client:
response = await client.post(
"http://localhost:11235/monitor/actions/kill_browser",
json={"browser_id": browser_id}
)
if response.status_code == 200:
print(f"✅ Browser {browser_id} killed successfully")
```
### Reset Statistics
```python
async def reset_stats():
async with httpx.AsyncClient() as client:
response = await client.post("http://localhost:11235/monitor/stats/reset")
print("📊 Statistics reset for fresh monitoring")
```
## 📈 Production Integration Patterns
### Prometheus Integration
```python
# Export metrics for Prometheus scraping
async def export_prometheus_metrics():
async with httpx.AsyncClient() as client:
health = await client.get("http://localhost:11235/monitor/health")
data = health.json()
# Export in Prometheus format
metrics = f"""
# HELP crawl4ai_memory_usage_percent Memory usage percentage
# TYPE crawl4ai_memory_usage_percent gauge
crawl4ai_memory_usage_percent {data['container']['memory_percent']}
# HELP crawl4ai_request_success_rate Request success rate
# TYPE crawl4ai_request_success_rate gauge
crawl4ai_request_success_rate {data['stats']['success_rate_percent']}
# HELP crawl4ai_browser_pool_count Total browsers in pool
# TYPE crawl4ai_browser_pool_count gauge
crawl4ai_browser_pool_count {data['pool']['permanent']['active'] + data['pool']['hot']['count'] + data['pool']['cold']['count']}
"""
return metrics
```
### Alerting Example
```python
async def check_alerts():
async with httpx.AsyncClient() as client:
health = await client.get("http://localhost:11235/monitor/health")
data = health.json()
# Memory alert
if data['container']['memory_percent'] > 80:
print("🚨 ALERT: Memory usage above 80%")
# Trigger cleanup
await client.post("http://localhost:11235/monitor/actions/cleanup")
# Success rate alert
if data['stats']['success_rate_percent'] < 90:
print("🚨 ALERT: Success rate below 90%")
# Check error logs
errors = await client.get("http://localhost:11235/monitor/logs/errors")
print(f"Recent errors: {len(errors.json())}")
# Latency alert
if data['stats']['avg_latency_ms'] > 5000:
print("🚨 ALERT: Average latency above 5s")
```
### Key Metrics to Track
```python
CRITICAL_METRICS = {
"memory_usage": {
"current": "container.memory_percent",
"target": "<80%",
"alert_threshold": ">80%",
"action": "Force cleanup or scale"
},
"success_rate": {
"current": "stats.success_rate_percent",
"target": ">95%",
"alert_threshold": "<90%",
"action": "Check error logs"
},
"avg_latency": {
"current": "stats.avg_latency_ms",
"target": "<2000ms",
"alert_threshold": ">5000ms",
"action": "Investigate slow requests"
},
"browser_reuse_rate": {
"current": "browsers.summary.reuse_rate_percent",
"target": ">80%",
"alert_threshold": "<60%",
"action": "Check pool configuration"
},
"total_browsers": {
"current": "browsers.summary.total_count",
"target": "<15",
"alert_threshold": ">20",
"action": "Check for browser leaks"
},
"error_frequency": {
"current": "len(errors)",
"target": "<5/hour",
"alert_threshold": ">10/hour",
"action": "Review error patterns"
}
}
```
## 🐛 Critical Bug Fixes
This release includes significant bug fixes that improve stability and performance:
### Async LLM Extraction (#1590)
**The Problem:** LLM extraction was blocking async execution, causing URLs to be processed sequentially instead of in parallel (issue #1055).
**The Fix:** Resolved the blocking issue to enable true parallel processing for LLM extraction.
```python
# Before v0.7.7: Sequential processing
# After v0.7.7: True parallel processing
async with AsyncWebCrawler() as crawler:
urls = ["url1", "url2", "url3", "url4"]
# Now processes truly in parallel with LLM extraction
results = await crawler.arun_many(
urls,
config=CrawlerRunConfig(
extraction_strategy=LLMExtractionStrategy(...)
)
)
# 4x faster for parallel LLM extraction!
```
**Expected Impact:** Major performance improvement for batch LLM extraction workflows.
### DFS Deep Crawling (#1607)
**The Problem:** DFS (Depth-First Search) deep crawl strategy had implementation issues.
**The Fix:** Enhanced DFSDeepCrawlStrategy with proper seen URL tracking and improved documentation.
### Browser & Crawler Config Documentation (#1609)
**The Problem:** Documentation didn't match the actual `async_configs.py` implementation.
**The Fix:** Updated all configuration documentation to accurately reflect the current implementation.
### Sitemap Seeder (#1598)
**The Problem:** Sitemap parsing and URL normalization issues in AsyncUrlSeeder (issue #1559).
**The Fix:** Added comprehensive tests and fixes for sitemap namespace parsing and URL normalization.
### Remove Overlay Elements (#1529)
**The Problem:** The `remove_overlay_elements` functionality wasn't working (issue #1396).
**The Fix:** Fixed by properly calling the injected JavaScript function.
### Viewport Configuration (#1495)
**The Problem:** Viewport configuration wasn't working in managed browsers (issue #1490).
**The Fix:** Added proper viewport size configuration support for browser launch.
### Managed Browser CDP Timing (#1528)
**The Problem:** CDP (Chrome DevTools Protocol) endpoint verification had timing issues causing connection failures (issue #1445).
**The Fix:** Added exponential backoff for CDP endpoint verification to handle timing variations.
### Security Updates
- **pyOpenSSL**: Updated from >=24.3.0 to >=25.3.0 to address security vulnerability
- Added verification tests for the security update
### Docker Fixes
- **Port Standardization**: Fixed inconsistent port usage (11234 vs 11235) - now standardized to 11235
- **LLM Environment**: Fixed LLM API key handling for multi-provider support (PR #1537)
- **Error Handling**: Improved Docker API error messages with comprehensive status codes
- **Serialization**: Fixed `fit_html` property serialization in `/crawl` and `/crawl/stream` endpoints
### Other Important Fixes
- **arun_many Returns**: Fixed function to always return a list, even on exception (PR #1530)
- **Webhook Serialization**: Properly serialize Pydantic HttpUrl in webhook config
- **LLMConfig Documentation**: Fixed casing and variable name consistency (issue #1551)
- **Python Version**: Dropped Python 3.9 support, now requires Python >=3.10
## 📊 Expected Real-World Impact
### For DevOps & Infrastructure Teams
- **Full Visibility**: Know exactly what's happening inside your crawling infrastructure
- **Proactive Monitoring**: Catch issues before they become problems
- **Resource Optimization**: Identify memory leaks and performance bottlenecks
- **Operational Control**: Manual intervention when automated systems need help
### For Production Deployments
- **Enterprise Observability**: Prometheus, Grafana, and alerting integration
- **Debugging**: Real-time logs and error tracking
- **Capacity Planning**: Historical metrics for scaling decisions
- **SLA Monitoring**: Track success rates and latency against targets
### For Development Teams
- **Local Monitoring**: Understand crawler behavior during development
- **Performance Testing**: Measure impact of configuration changes
- **Troubleshooting**: Quickly identify and fix issues
- **Learning**: See exactly how the browser pool works
## 🔄 Breaking Changes
**None!** This release is fully backward compatible.
- All existing Docker configurations continue to work
- No API changes to existing endpoints
- Monitoring is additive functionality
- No migration required
## 🚀 Upgrade Instructions
### Docker
```bash
# Pull the latest version
docker pull unclecode/crawl4ai:0.7.7
# Or use the latest tag
docker pull unclecode/crawl4ai:latest
# Run with monitoring enabled (default)
docker run -d \
-p 11235:11235 \
--shm-size=1g \
--name crawl4ai \
unclecode/crawl4ai:0.7.7
# Access the monitoring dashboard
open http://localhost:11235/dashboard
```
### Python Package
```bash
# Upgrade to latest version
pip install --upgrade crawl4ai
# Or install specific version
pip install crawl4ai==0.7.7
```
## 🎬 Try the Demo
Run the comprehensive demo that showcases all monitoring features:
```bash
python docs/releases_review/demo_v0.7.7.py
```
**The demo includes:**
1. System health overview with live metrics
2. Request tracking with active/completed monitoring
3. Browser pool management (permanent/hot/cold)
4. Complete Monitor API endpoint examples
5. WebSocket streaming demonstration
6. Control actions (cleanup, kill, restart)
7. Production metrics and alerting patterns
8. Self-hosting value proposition
## 📚 Documentation
### New Documentation
- **[Self-Hosting Guide](https://docs.crawl4ai.com/core/self-hosting/)** - Complete self-hosting documentation with monitoring
- **Demo Script**: `docs/releases_review/demo_v0.7.7.py` - Working examples
### Updated Documentation
- **Docker Deployment** → **Self-Hosting** (renamed for better positioning)
- Added comprehensive monitoring sections
- Production integration patterns
- WebSocket streaming examples
## 💡 Pro Tips
1. **Start with the dashboard** - Visit `/dashboard` to get familiar with the monitoring system
2. **Track the 6 key metrics** - Memory, success rate, latency, reuse rate, browser count, errors
3. **Set up alerting early** - Use the Monitor API to build alerts before issues occur
4. **Monitor browser pool efficiency** - Aim for >80% reuse rate for optimal performance
5. **Use WebSocket for custom dashboards** - Build tailored monitoring UIs for your team
6. **Leverage Prometheus integration** - Export metrics for long-term storage and analysis
7. **Check janitor logs** - Understand automatic cleanup patterns
8. **Use control actions judiciously** - Manual interventions are for exceptional cases
## 🙏 Acknowledgments
Thank you to our community for the feedback, bug reports, and feature requests that shaped this release. Special thanks to everyone who contributed to the issues that were fixed in this version.
The monitoring system was built based on real user needs for production deployments, and your input made it comprehensive and practical.
## 📞 Support & Resources
- **📖 Documentation**: [docs.crawl4ai.com](https://docs.crawl4ai.com)
- **🐙 GitHub**: [github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai)
- **💬 Discord**: [discord.gg/crawl4ai](https://discord.gg/jP8KfhDhyN)
- **🐦 Twitter**: [@unclecode](https://x.com/unclecode)
- **📊 Dashboard**: `http://localhost:11235/dashboard` (when running)
---
**Crawl4AI v0.7.7 delivers complete self-hosting with enterprise-grade monitoring. You now have full visibility and control over your web crawling infrastructure. The monitoring dashboard, comprehensive API, and WebSocket streaming give you everything needed for production deployments. Try the self-hosting platform—it's a game changer for operational excellence!**
**Happy crawling with full visibility!** 🕷️📊
*- unclecode*

327
docs/blog/release-v0.7.8.md Normal file
View File

@@ -0,0 +1,327 @@
# Crawl4AI v0.7.8: Stability & Bug Fix Release
*December 2025*
---
I'm releasing Crawl4AI v0.7.8—a focused stability release that addresses 11 bugs reported by the community. While there are no new features in this release, these fixes resolve important issues affecting Docker deployments, LLM extraction, URL handling, and dependency compatibility.
## What's Fixed at a Glance
- **Docker API**: Fixed ContentRelevanceFilter deserialization, ProxyConfig serialization, and cache folder permissions
- **LLM Extraction**: Configurable rate limiter backoff, HTML input format support, and proper URL handling for raw HTML
- **URL Handling**: Correct relative URL resolution after JavaScript redirects
- **Dependencies**: Replaced deprecated PyPDF2 with pypdf, Pydantic v2 ConfigDict compatibility
- **AdaptiveCrawler**: Fixed query expansion to actually use LLM instead of hardcoded mock data
## Bug Fixes
### Docker & API Fixes
#### ContentRelevanceFilter Deserialization (#1642)
**The Problem:** When sending deep crawl requests to the Docker API with `ContentRelevanceFilter`, the server failed to deserialize the filter, causing requests to fail.
**The Fix:** I added `ContentRelevanceFilter` to the public exports and enhanced the deserialization logic with dynamic imports.
```python
# This now works correctly in Docker API
import httpx
request = {
"urls": ["https://docs.example.com"],
"crawler_config": {
"deep_crawl_strategy": {
"type": "BFSDeepCrawlStrategy",
"max_depth": 2,
"filter_chain": [
{
"type": "ContentRelevanceFilter",
"query": "API documentation",
"threshold": 0.3
}
]
}
}
}
async with httpx.AsyncClient() as client:
response = await client.post("http://localhost:11235/crawl", json=request)
# Previously failed, now works!
```
#### ProxyConfig JSON Serialization (#1629)
**The Problem:** `BrowserConfig.to_dict()` failed when `proxy_config` was set because `ProxyConfig` wasn't being serialized to a dictionary.
**The Fix:** `ProxyConfig.to_dict()` is now called during serialization.
```python
from crawl4ai import BrowserConfig
from crawl4ai.async_configs import ProxyConfig
proxy = ProxyConfig(
server="http://proxy.example.com:8080",
username="user",
password="pass"
)
config = BrowserConfig(headless=True, proxy_config=proxy)
# Previously raised TypeError, now works
config_dict = config.to_dict()
json.dumps(config_dict) # Valid JSON
```
#### Docker Cache Folder Permissions (#1638)
**The Problem:** The `.cache` folder in the Docker image had incorrect permissions, causing crawling to fail when caching was enabled.
**The Fix:** Corrected ownership and permissions during image build.
```bash
# Cache now works correctly in Docker
docker run -d -p 11235:11235 \
--shm-size=1g \
-v ./my-cache:/app/.cache \
unclecode/crawl4ai:0.7.8
```
---
### LLM & Extraction Fixes
#### Configurable Rate Limiter Backoff (#1269)
**The Problem:** The LLM rate limiting backoff parameters were hardcoded, making it impossible to adjust retry behavior for different API rate limits.
**The Fix:** `LLMConfig` now accepts three new parameters for complete control over retry behavior.
```python
from crawl4ai import LLMConfig
# Default behavior (unchanged)
default_config = LLMConfig(provider="openai/gpt-4o-mini")
# backoff_base_delay=2, backoff_max_attempts=3, backoff_exponential_factor=2
# Custom configuration for APIs with strict rate limits
custom_config = LLMConfig(
provider="openai/gpt-4o-mini",
backoff_base_delay=5, # Wait 5 seconds on first retry
backoff_max_attempts=5, # Try up to 5 times
backoff_exponential_factor=3 # Multiply delay by 3 each attempt
)
# Retry sequence: 5s -> 15s -> 45s -> 135s -> 405s
```
#### LLM Strategy HTML Input Support (#1178)
**The Problem:** `LLMExtractionStrategy` always sent markdown to the LLM, but some extraction tasks work better with HTML structure preserved.
**The Fix:** Added `input_format` parameter supporting `"markdown"`, `"html"`, `"fit_markdown"`, `"cleaned_html"`, and `"fit_html"`.
```python
from crawl4ai import LLMExtractionStrategy, LLMConfig
# Default: markdown input (unchanged)
markdown_strategy = LLMExtractionStrategy(
llm_config=LLMConfig(provider="openai/gpt-4o-mini"),
instruction="Extract product information"
)
# NEW: HTML input - preserves table/list structure
html_strategy = LLMExtractionStrategy(
llm_config=LLMConfig(provider="openai/gpt-4o-mini"),
instruction="Extract the data table preserving structure",
input_format="html"
)
# NEW: Filtered markdown - only relevant content
fit_strategy = LLMExtractionStrategy(
llm_config=LLMConfig(provider="openai/gpt-4o-mini"),
instruction="Summarize the main content",
input_format="fit_markdown"
)
```
#### Raw HTML URL Variable (#1116)
**The Problem:** When using `url="raw:<html>..."`, the entire HTML content was being passed to extraction strategies as the URL parameter, polluting LLM prompts.
**The Fix:** The URL is now correctly set to `"Raw HTML"` for raw HTML inputs.
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
html = "<html><body><h1>Test</h1></body></html>"
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url=f"raw:{html}",
config=CrawlerRunConfig(extraction_strategy=my_strategy)
)
# extraction_strategy receives url="Raw HTML" instead of the HTML blob
```
---
### URL Handling Fix
#### Relative URLs After Redirects (#1268)
**The Problem:** When JavaScript caused a page redirect, relative links were resolved against the original URL instead of the final URL.
**The Fix:** `redirected_url` now captures the actual page URL after all JavaScript execution completes.
```python
from crawl4ai import AsyncWebCrawler
async with AsyncWebCrawler() as crawler:
# Page at /old-page redirects via JS to /new-page
result = await crawler.arun(url="https://example.com/old-page")
# BEFORE: redirected_url = "https://example.com/old-page"
# AFTER: redirected_url = "https://example.com/new-page"
# Links are now correctly resolved against the final URL
for link in result.links['internal']:
print(link['href']) # Relative links resolved correctly
```
---
### Dependency & Compatibility Fixes
#### PyPDF2 Replaced with pypdf (#1412)
**The Problem:** PyPDF2 was deprecated in 2022 and is no longer maintained.
**The Fix:** Replaced with the actively maintained `pypdf` library.
```python
# Installation (unchanged)
pip install crawl4ai[pdf]
# The PDF processor now uses pypdf internally
# No code changes required - API remains the same
```
#### Pydantic v2 ConfigDict Compatibility (#678)
**The Problem:** Using the deprecated `class Config` syntax caused deprecation warnings with Pydantic v2.
**The Fix:** Migrated to `model_config = ConfigDict(...)` syntax.
```python
# No more deprecation warnings when importing crawl4ai models
from crawl4ai.models import CrawlResult
from crawl4ai import CrawlerRunConfig, BrowserConfig
# All models are now Pydantic v2 compatible
```
---
### AdaptiveCrawler Fix
#### Query Expansion Using LLM (#1621)
**The Problem:** The `EmbeddingStrategy` in AdaptiveCrawler had commented-out LLM code and was using hardcoded mock query variations instead.
**The Fix:** Uncommented and activated the LLM call for actual query expansion.
```python
# AdaptiveCrawler query expansion now actually uses the LLM
# Instead of hardcoded variations like:
# variations = {'queries': ['what are the best vegetables...']}
# The LLM generates relevant query variations based on your actual query
```
---
### Code Formatting Fix
#### Import Statement Formatting (#1181)
**The Problem:** When extracting code from web pages, import statements were sometimes concatenated without proper line separation.
**The Fix:** Import statements now maintain proper newline separation.
```python
# BEFORE: "import osimport sysfrom pathlib import Path"
# AFTER:
# import os
# import sys
# from pathlib import Path
```
---
## Breaking Changes
**None!** This release is fully backward compatible.
- All existing code continues to work without modification
- New parameters have sensible defaults matching previous behavior
- No API changes to existing functionality
---
## Upgrade Instructions
### Python Package
```bash
pip install --upgrade crawl4ai
# or
pip install crawl4ai==0.7.8
```
### Docker
```bash
# Pull the latest version
docker pull unclecode/crawl4ai:0.7.8
# Run
docker run -d -p 11235:11235 --shm-size=1g unclecode/crawl4ai:0.7.8
```
---
## Verification
Run the verification tests to confirm all fixes are working:
```bash
python docs/releases_review/demo_v0.7.8.py
```
This runs actual tests that verify each bug fix is properly implemented.
---
## Acknowledgments
Thank you to everyone who reported these issues and provided detailed reproduction steps. Your bug reports make Crawl4AI better for everyone.
Issues fixed: #1642, #1638, #1629, #1621, #1412, #1269, #1268, #1181, #1178, #1116, #678
---
## Support & Resources
- **Documentation**: [docs.crawl4ai.com](https://docs.crawl4ai.com)
- **GitHub**: [github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai)
- **Discord**: [discord.gg/crawl4ai](https://discord.gg/jP8KfhDhyN)
- **Twitter**: [@unclecode](https://x.com/unclecode)
---
**This stability release ensures Crawl4AI works reliably across Docker deployments, LLM extraction workflows, and various edge cases. Thank you for your continued support and feedback!**
**Happy crawling!**
*- unclecode*

243
docs/blog/release-v0.8.0.md Normal file
View File

@@ -0,0 +1,243 @@
# Crawl4AI v0.8.0 Release Notes
**Release Date**: January 2026
**Previous Version**: v0.7.6
**Status**: Release Candidate
---
## Highlights
- **Critical Security Fixes** for Docker API deployment
- **11 New Features** including crash recovery, prefetch mode, and proxy improvements
- **Breaking Changes** - see migration guide below
---
## Breaking Changes
### 1. Docker API: Hooks Disabled by Default
**What changed**: Hooks are now disabled by default on the Docker API.
**Why**: Security fix for Remote Code Execution (RCE) vulnerability.
**Who is affected**: Users of the Docker API who use the `hooks` parameter in `/crawl` requests.
**Migration**:
```bash
# To re-enable hooks (only if you trust all API users):
export CRAWL4AI_HOOKS_ENABLED=true
```
### 2. Docker API: file:// URLs Blocked
**What changed**: The endpoints `/execute_js`, `/screenshot`, `/pdf`, and `/html` now reject `file://` URLs.
**Why**: Security fix for Local File Inclusion (LFI) vulnerability.
**Who is affected**: Users who were reading local files via the Docker API.
**Migration**: Use the Python library directly for local file processing:
```python
# Instead of API call with file:// URL, use library:
from crawl4ai import AsyncWebCrawler
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="file:///path/to/file.html")
```
---
## Security Fixes
### Critical: Remote Code Execution via Hooks (CVE Pending)
**Severity**: CRITICAL (CVSS 10.0)
**Affected**: Docker API deployment (all versions before v0.8.0)
**Vector**: `POST /crawl` with malicious `hooks` parameter
**Details**: The `__import__` builtin was available in hook code, allowing attackers to import `os`, `subprocess`, etc. and execute arbitrary commands.
**Fix**:
1. Removed `__import__` from allowed builtins
2. Hooks disabled by default (`CRAWL4AI_HOOKS_ENABLED=false`)
### High: Local File Inclusion via file:// URLs (CVE Pending)
**Severity**: HIGH (CVSS 8.6)
**Affected**: Docker API deployment (all versions before v0.8.0)
**Vector**: `POST /execute_js` (and other endpoints) with `file:///etc/passwd`
**Details**: API endpoints accepted `file://` URLs, allowing attackers to read arbitrary files from the server.
**Fix**: URL scheme validation now only allows `http://`, `https://`, and `raw:` URLs.
### Credits
Discovered by **Neo by ProjectDiscovery** ([projectdiscovery.io](https://projectdiscovery.io)) - December 2025
---
## New Features
### 1. init_scripts Support for BrowserConfig
Pre-page-load JavaScript injection for stealth evasions.
```python
config = BrowserConfig(
init_scripts=[
"Object.defineProperty(navigator, 'webdriver', {get: () => false})"
]
)
```
### 2. CDP Connection Improvements
- WebSocket URL support (`ws://`, `wss://`)
- Proper cleanup with `cdp_cleanup_on_close=True`
- Browser reuse across multiple connections
### 3. Crash Recovery for Deep Crawl Strategies
All deep crawl strategies (BFS, DFS, Best-First) now support crash recovery:
```python
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
strategy = BFSDeepCrawlStrategy(
max_depth=3,
resume_state=saved_state, # Resume from checkpoint
on_state_change=save_callback # Persist state in real-time
)
```
### 4. PDF and MHTML for raw:/file:// URLs
Generate PDFs and MHTML from cached HTML content.
### 5. Screenshots for raw:/file:// URLs
Render cached HTML and capture screenshots.
### 6. base_url Parameter for CrawlerRunConfig
Proper URL resolution for raw: HTML processing:
```python
config = CrawlerRunConfig(base_url='https://example.com')
result = await crawler.arun(url='raw:{html}', config=config)
```
### 7. Prefetch Mode for Two-Phase Deep Crawling
Fast link extraction without full page processing:
```python
config = CrawlerRunConfig(prefetch=True)
```
### 8. Proxy Rotation and Configuration
Enhanced proxy rotation with sticky sessions support.
### 9. Proxy Support for HTTP Strategy
Non-browser crawler now supports proxies.
### 10. Browser Pipeline for raw:/file:// URLs
New `process_in_browser` parameter for browser operations on local content:
```python
config = CrawlerRunConfig(
process_in_browser=True, # Force browser processing
screenshot=True
)
result = await crawler.arun(url='raw:<html>...</html>', config=config)
```
### 11. Smart TTL Cache for Sitemap URL Seeder
Intelligent cache invalidation for sitemaps:
```python
config = SeedingConfig(
cache_ttl_hours=24,
validate_sitemap_lastmod=True
)
```
---
## Bug Fixes
### raw: URL Parsing Truncates at # Character
**Problem**: CSS color codes like `#eee` were being truncated.
**Before**: `raw:body{background:#eee}``body{background:`
**After**: `raw:body{background:#eee}``body{background:#eee}`
### Caching System Improvements
Various fixes to cache validation and persistence.
---
## Documentation Updates
- Multi-sample schema generation documentation
- URL seeder smart TTL cache parameters
- Security documentation (SECURITY.md)
---
## Upgrade Guide
### From v0.7.x to v0.8.0
1. **Update the package**:
```bash
pip install --upgrade crawl4ai
```
2. **Docker API users**:
- Hooks are now disabled by default
- If you need hooks: `export CRAWL4AI_HOOKS_ENABLED=true`
- `file://` URLs no longer work on API (use library directly)
3. **Review security settings**:
```yaml
# config.yml - recommended for production
security:
enabled: true
jwt_enabled: true
```
4. **Test your integration** before deploying to production
### Breaking Change Checklist
- [ ] Check if you use `hooks` parameter in API calls
- [ ] Check if you use `file://` URLs via the API
- [ ] Update environment variables if needed
- [ ] Review security configuration
---
## Full Changelog
See [CHANGELOG.md](../CHANGELOG.md) for complete version history.
---
## Contributors
Thanks to all contributors who made this release possible.
Special thanks to **Neo by ProjectDiscovery** for responsible security disclosure.
---
*For questions or issues, please open a [GitHub Issue](https://github.com/unclecode/crawl4ai/issues).*

View File

@@ -18,7 +18,7 @@ A comprehensive web-based tutorial for learning and experimenting with C4A-Scrip
2. **Install Dependencies**
```bash
pip install flask
pip install -r requirements.txt
```
3. **Launch the Server**
@@ -28,7 +28,7 @@ A comprehensive web-based tutorial for learning and experimenting with C4A-Scrip
4. **Open in Browser**
```
http://localhost:8080
http://localhost:8000
```
**🌐 Try Online**: [Live Demo](https://docs.crawl4ai.com/c4a-script/demo)
@@ -325,7 +325,7 @@ Powers the recording functionality:
### Configuration
```python
# server.py configuration
PORT = 8080
PORT = 8000
DEBUG = True
THREADED = True
```
@@ -343,9 +343,9 @@ THREADED = True
**Port Already in Use**
```bash
# Kill existing process
lsof -ti:8080 | xargs kill -9
lsof -ti:8000 | xargs kill -9
# Or use different port
python server.py --port 8081
python server.py --port 8001
```
**Blockly Not Loading**

View File

@@ -216,7 +216,7 @@ def get_examples():
'name': 'Handle Cookie Banner',
'description': 'Accept cookies and close newsletter popup',
'script': '''# Handle cookie banner and newsletter
GO http://127.0.0.1:8080/playground/
GO http://127.0.0.1:8000/playground/
WAIT `body` 2
IF (EXISTS `.cookie-banner`) THEN CLICK `.accept`
IF (EXISTS `.newsletter-popup`) THEN CLICK `.close`'''

View File

@@ -0,0 +1,62 @@
import asyncio
import capsolver
from crawl4ai import *
# TODO: set your config
# Docs: https://docs.capsolver.com/guide/captcha/awsWaf/
api_key = "CAP-xxxxxxxxxxxxxxxxxxxxx" # your api key of capsolver
site_url = "https://nft.porsche.com/onboarding@6" # page url of your target site
cookie_domain = ".nft.porsche.com" # the domain name to which you want to apply the cookie
captcha_type = "AntiAwsWafTaskProxyLess" # type of your target captcha
capsolver.api_key = api_key
async def main():
browser_config = BrowserConfig(
verbose=True,
headless=False,
use_persistent_context=True,
)
async with AsyncWebCrawler(config=browser_config) as crawler:
await crawler.arun(
url=site_url,
cache_mode=CacheMode.BYPASS,
session_id="session_captcha_test"
)
# get aws waf cookie using capsolver sdk
solution = capsolver.solve({
"type": captcha_type,
"websiteURL": site_url,
})
cookie = solution["cookie"]
print("aws waf cookie:", cookie)
js_code = """
document.cookie = \'aws-waf-token=""" + cookie + """;domain=""" + cookie_domain + """;path=/\';
location.reload();
"""
wait_condition = """() => {
return document.title === \'Join Porsches journey into Web3\';
}"""
run_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
session_id="session_captcha_test",
js_code=js_code,
js_only=True,
wait_for=f"js:{wait_condition}"
)
result_next = await crawler.arun(
url=site_url,
config=run_config,
)
print(result_next.markdown)
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,60 @@
import asyncio
import capsolver
from crawl4ai import *
# TODO: set your config
# Docs: https://docs.capsolver.com/guide/captcha/cloudflare_challenge/
api_key = "CAP-xxxxxxxxxxxxxxxxxxxxx" # your api key of capsolver
site_url = "https://gitlab.com/users/sign_in" # page url of your target site
captcha_type = "AntiCloudflareTask" # type of your target captcha
# your http proxy to solve cloudflare challenge
proxy_server = "proxy.example.com:8080"
proxy_username = "myuser"
proxy_password = "mypass"
capsolver.api_key = api_key
async def main():
# get challenge cookie using capsolver sdk
solution = capsolver.solve({
"type": captcha_type,
"websiteURL": site_url,
"proxy": f"{proxy_server}:{proxy_username}:{proxy_password}",
})
cookies = solution["cookies"]
user_agent = solution["userAgent"]
print("challenge cookies:", cookies)
cookies_list = []
for name, value in cookies.items():
cookies_list.append({
"name": name,
"value": value,
"url": site_url,
})
browser_config = BrowserConfig(
verbose=True,
headless=False,
use_persistent_context=True,
user_agent=user_agent,
cookies=cookies_list,
proxy_config={
"server": f"http://{proxy_server}",
"username": proxy_username,
"password": proxy_password,
},
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url=site_url,
cache_mode=CacheMode.BYPASS,
session_id="session_captcha_test"
)
print(result.markdown)
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,64 @@
import asyncio
import capsolver
from crawl4ai import *
# TODO: set your config
# Docs: https://docs.capsolver.com/guide/captcha/cloudflare_turnstile/
api_key = "CAP-xxxxxxxxxxxxxxxxxxxxx" # your api key of capsolver
site_key = "0x4AAAAAAAGlwMzq_9z6S9Mh" # site key of your target site
site_url = "https://clifford.io/demo/cloudflare-turnstile" # page url of your target site
captcha_type = "AntiTurnstileTaskProxyLess" # type of your target captcha
capsolver.api_key = api_key
async def main():
browser_config = BrowserConfig(
verbose=True,
headless=False,
use_persistent_context=True,
)
async with AsyncWebCrawler(config=browser_config) as crawler:
await crawler.arun(
url=site_url,
cache_mode=CacheMode.BYPASS,
session_id="session_captcha_test"
)
# get turnstile token using capsolver sdk
solution = capsolver.solve({
"type": captcha_type,
"websiteURL": site_url,
"websiteKey": site_key,
})
token = solution["token"]
print("turnstile token:", token)
js_code = """
document.querySelector(\'input[name="cf-turnstile-response"]\').value = \'"""+token+"""\';
document.querySelector(\'button[type="submit"]\').click();
"""
wait_condition = """() => {
const items = document.querySelectorAll(\'h1\');
return items.length === 0;
}"""
run_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
session_id="session_captcha_test",
js_code=js_code,
js_only=True,
wait_for=f"js:{wait_condition}"
)
result_next = await crawler.arun(
url=site_url,
config=run_config,
)
print(result_next.markdown)
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,67 @@
import asyncio
import capsolver
from crawl4ai import *
# TODO: set your config
# Docs: https://docs.capsolver.com/guide/captcha/ReCaptchaV2/
api_key = "CAP-xxxxxxxxxxxxxxxxxxxxx" # your api key of capsolver
site_key = "6LfW6wATAAAAAHLqO2pb8bDBahxlMxNdo9g947u9" # site key of your target site
site_url = "https://recaptcha-demo.appspot.com/recaptcha-v2-checkbox.php" # page url of your target site
captcha_type = "ReCaptchaV2TaskProxyLess" # type of your target captcha
capsolver.api_key = api_key
async def main():
browser_config = BrowserConfig(
verbose=True,
headless=False,
use_persistent_context=True,
)
async with AsyncWebCrawler(config=browser_config) as crawler:
await crawler.arun(
url=site_url,
cache_mode=CacheMode.BYPASS,
session_id="session_captcha_test"
)
# get recaptcha token using capsolver sdk
solution = capsolver.solve({
"type": captcha_type,
"websiteURL": site_url,
"websiteKey": site_key,
})
token = solution["gRecaptchaResponse"]
print("recaptcha token:", token)
js_code = """
const textarea = document.getElementById(\'g-recaptcha-response\');
if (textarea) {
textarea.value = \"""" + token + """\";
document.querySelector(\'button.form-field[type="submit"]\').click();
}
"""
wait_condition = """() => {
const items = document.querySelectorAll(\'h2\');
return items.length > 1;
}"""
run_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
session_id="session_captcha_test",
js_code=js_code,
js_only=True,
wait_for=f"js:{wait_condition}"
)
result_next = await crawler.arun(
url=site_url,
config=run_config,
)
print(result_next.markdown)
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,75 @@
import asyncio
import capsolver
from crawl4ai import *
# TODO: set your config
# Docs: https://docs.capsolver.com/guide/captcha/ReCaptchaV3/
api_key = "CAP-xxxxxxxxxxxxxxxxxxxxx" # your api key of capsolver
site_key = "6LdKlZEpAAAAAAOQjzC2v_d36tWxCl6dWsozdSy9" # site key of your target site
site_url = "https://recaptcha-demo.appspot.com/recaptcha-v3-request-scores.php" # page url of your target site
page_action = "examples/v3scores" # page action of your target site
captcha_type = "ReCaptchaV3TaskProxyLess" # type of your target captcha
capsolver.api_key = api_key
async def main():
browser_config = BrowserConfig(
verbose=True,
headless=False,
use_persistent_context=True,
)
# get recaptcha token using capsolver sdk
solution = capsolver.solve({
"type": captcha_type,
"websiteURL": site_url,
"websiteKey": site_key,
"pageAction": page_action,
})
token = solution["gRecaptchaResponse"]
print("recaptcha token:", token)
async with AsyncWebCrawler(config=browser_config) as crawler:
await crawler.arun(
url=site_url,
cache_mode=CacheMode.BYPASS,
session_id="session_captcha_test"
)
js_code = """
const originalFetch = window.fetch;
window.fetch = function(...args) {
if (typeof args[0] === 'string' && args[0].includes('/recaptcha-v3-verify.php')) {
const url = new URL(args[0], window.location.origin);
url.searchParams.set('action', '""" + token + """');
args[0] = url.toString();
document.querySelector('.token').innerHTML = "fetch('/recaptcha-v3-verify.php?action=examples/v3scores&token=""" + token + """')";
console.log('Fetch URL hooked:', args[0]);
}
return originalFetch.apply(this, args);
};
"""
wait_condition = """() => {
return document.querySelector('.step3:not(.hidden)');
}"""
run_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
session_id="session_captcha_test",
js_code=js_code,
js_only=True,
wait_for=f"js:{wait_condition}"
)
result_next = await crawler.arun(
url=site_url,
config=run_config,
)
print(result_next.markdown)
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,36 @@
import time
import asyncio
from crawl4ai import *
# TODO: the user data directory that includes the capsolver extension
user_data_dir = "/browser-profile/Default1"
"""
The capsolver extension supports more features, such as:
- Telling the extension when to start solving captcha.
- Calling functions to check whether the captcha has been solved, etc.
Reference blog: https://docs.capsolver.com/guide/automation-tool-integration/
"""
browser_config = BrowserConfig(
verbose=True,
headless=False,
user_data_dir=user_data_dir,
use_persistent_context=True,
)
async def main():
async with AsyncWebCrawler(config=browser_config) as crawler:
result_initial = await crawler.arun(
url="https://nft.porsche.com/onboarding@6",
cache_mode=CacheMode.BYPASS,
session_id="session_captcha_test"
)
# do something later
time.sleep(300)
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,36 @@
import time
import asyncio
from crawl4ai import *
# TODO: the user data directory that includes the capsolver extension
user_data_dir = "/browser-profile/Default1"
"""
The capsolver extension supports more features, such as:
- Telling the extension when to start solving captcha.
- Calling functions to check whether the captcha has been solved, etc.
Reference blog: https://docs.capsolver.com/guide/automation-tool-integration/
"""
browser_config = BrowserConfig(
verbose=True,
headless=False,
user_data_dir=user_data_dir,
use_persistent_context=True,
)
async def main():
async with AsyncWebCrawler(config=browser_config) as crawler:
result_initial = await crawler.arun(
url="https://gitlab.com/users/sign_in",
cache_mode=CacheMode.BYPASS,
session_id="session_captcha_test"
)
# do something later
time.sleep(300)
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,36 @@
import time
import asyncio
from crawl4ai import *
# TODO: the user data directory that includes the capsolver extension
user_data_dir = "/browser-profile/Default1"
"""
The capsolver extension supports more features, such as:
- Telling the extension when to start solving captcha.
- Calling functions to check whether the captcha has been solved, etc.
Reference blog: https://docs.capsolver.com/guide/automation-tool-integration/
"""
browser_config = BrowserConfig(
verbose=True,
headless=False,
user_data_dir=user_data_dir,
use_persistent_context=True,
)
async def main():
async with AsyncWebCrawler(config=browser_config) as crawler:
result_initial = await crawler.arun(
url="https://clifford.io/demo/cloudflare-turnstile",
cache_mode=CacheMode.BYPASS,
session_id="session_captcha_test"
)
# do something later
time.sleep(300)
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,36 @@
import time
import asyncio
from crawl4ai import *
# TODO: the user data directory that includes the capsolver extension
user_data_dir = "/browser-profile/Default1"
"""
The capsolver extension supports more features, such as:
- Telling the extension when to start solving captcha.
- Calling functions to check whether the captcha has been solved, etc.
Reference blog: https://docs.capsolver.com/guide/automation-tool-integration/
"""
browser_config = BrowserConfig(
verbose=True,
headless=False,
user_data_dir=user_data_dir,
use_persistent_context=True,
)
async def main():
async with AsyncWebCrawler(config=browser_config) as crawler:
result_initial = await crawler.arun(
url="https://recaptcha-demo.appspot.com/recaptcha-v2-checkbox.php",
cache_mode=CacheMode.BYPASS,
session_id="session_captcha_test"
)
# do something later
time.sleep(300)
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,36 @@
import time
import asyncio
from crawl4ai import *
# TODO: the user data directory that includes the capsolver extension
user_data_dir = "/browser-profile/Default1"
"""
The capsolver extension supports more features, such as:
- Telling the extension when to start solving captcha.
- Calling functions to check whether the captcha has been solved, etc.
Reference blog: https://docs.capsolver.com/guide/automation-tool-integration/
"""
browser_config = BrowserConfig(
verbose=True,
headless=False,
user_data_dir=user_data_dir,
use_persistent_context=True,
)
async def main():
async with AsyncWebCrawler(config=browser_config) as crawler:
result_initial = await crawler.arun(
url="https://recaptcha-demo.appspot.com/recaptcha-v3-request-scores.php",
cache_mode=CacheMode.BYPASS,
session_id="session_captcha_test"
)
# do something later
time.sleep(300)
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,61 @@
import json
import asyncio
from urllib.parse import quote, urlencode
from crawl4ai import CrawlerRunConfig, BrowserConfig, AsyncWebCrawler
# Scrapeless provides a free anti-detection fingerprint browser client and cloud browsers:
# https://www.scrapeless.com/en/blog/scrapeless-nstbrowser-strategic-integration
async def main():
# customize browser fingerprint
fingerprint = {
"userAgent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.1.2.3 Safari/537.36",
"platform": "Windows",
"screen": {
"width": 1280, "height": 1024
},
"localization": {
"languages": ["zh-HK", "en-US", "en"], "timezone": "Asia/Hong_Kong",
}
}
fingerprint_json = json.dumps(fingerprint)
encoded_fingerprint = quote(fingerprint_json)
scrapeless_params = {
"token": "your token",
"sessionTTL": 1000,
"sessionName": "Demo",
"fingerprint": encoded_fingerprint,
# Sets the target country/region for the proxy, sending requests via an IP address from that region. You can specify a country code (e.g., US for the United States, GB for the United Kingdom, ANY for any country). See country codes for all supported options.
# "proxyCountry": "ANY",
# create profile on scrapeless
# "profileId": "your profileId",
# For more usage details, please refer to https://docs.scrapeless.com/en/scraping-browser/quickstart/getting-started
}
query_string = urlencode(scrapeless_params)
scrapeless_connection_url = f"wss://browser.scrapeless.com/api/v2/browser?{query_string}"
async with AsyncWebCrawler(
config=BrowserConfig(
headless=False,
browser_mode="cdp",
cdp_url=scrapeless_connection_url,
)
) as crawler:
result = await crawler.arun(
url="https://www.scrapeless.com/en",
config=CrawlerRunConfig(
wait_for="css:.content",
scan_full_page=True,
),
)
print("-" * 20)
print(f'Status Code: {result.status_code}')
print("-" * 20)
print(f'Title: {result.metadata["title"]}')
print(f'Description: {result.metadata["description"]}')
print("-" * 20)
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,297 @@
#!/usr/bin/env python3
"""
Deep Crawl Crash Recovery Example
This example demonstrates how to implement crash recovery for long-running
deep crawls. The feature is useful for:
- Cloud deployments with spot/preemptible instances
- Long-running crawls that may be interrupted
- Distributed crawling with state coordination
Key concepts:
- `on_state_change`: Callback fired after each URL is processed
- `resume_state`: Pass saved state to continue from a checkpoint
- `export_state()`: Get the last captured state manually
Works with all strategies: BFSDeepCrawlStrategy, DFSDeepCrawlStrategy,
BestFirstCrawlingStrategy
"""
import asyncio
import json
import os
from pathlib import Path
from typing import Dict, Any, List
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
# File to store crawl state (in production, use Redis/database)
STATE_FILE = Path("crawl_state.json")
async def save_state_to_file(state: Dict[str, Any]) -> None:
"""
Callback to save state after each URL is processed.
In production, you might save to:
- Redis: await redis.set("crawl_state", json.dumps(state))
- Database: await db.execute("UPDATE crawls SET state = ?", json.dumps(state))
- S3: await s3.put_object(Bucket="crawls", Key="state.json", Body=json.dumps(state))
"""
with open(STATE_FILE, "w") as f:
json.dump(state, f, indent=2)
print(f" [State saved] Pages: {state['pages_crawled']}, Pending: {len(state['pending'])}")
def load_state_from_file() -> Dict[str, Any] | None:
"""Load previously saved state, if it exists."""
if STATE_FILE.exists():
with open(STATE_FILE, "r") as f:
return json.load(f)
return None
async def example_basic_state_persistence():
"""
Example 1: Basic state persistence with file storage.
The on_state_change callback is called after each URL is processed,
allowing you to save progress in real-time.
"""
print("\n" + "=" * 60)
print("Example 1: Basic State Persistence")
print("=" * 60)
# Clean up any previous state
if STATE_FILE.exists():
STATE_FILE.unlink()
strategy = BFSDeepCrawlStrategy(
max_depth=2,
max_pages=5,
on_state_change=save_state_to_file, # Save after each URL
)
config = CrawlerRunConfig(
deep_crawl_strategy=strategy,
verbose=False,
)
print("\nStarting crawl with state persistence...")
async with AsyncWebCrawler(verbose=False) as crawler:
results = await crawler.arun("https://books.toscrape.com", config=config)
# Show final state
if STATE_FILE.exists():
with open(STATE_FILE, "r") as f:
final_state = json.load(f)
print(f"\nFinal state saved to {STATE_FILE}:")
print(f" - Strategy: {final_state['strategy_type']}")
print(f" - Pages crawled: {final_state['pages_crawled']}")
print(f" - URLs visited: {len(final_state['visited'])}")
print(f" - URLs pending: {len(final_state['pending'])}")
print(f"\nCrawled {len(results)} pages total")
async def example_crash_and_resume():
"""
Example 2: Simulate a crash and resume from checkpoint.
This demonstrates the full crash recovery workflow:
1. Start crawling with state persistence
2. "Crash" after N pages
3. Resume from saved state
4. Verify no duplicate work
"""
print("\n" + "=" * 60)
print("Example 2: Crash and Resume")
print("=" * 60)
# Clean up any previous state
if STATE_FILE.exists():
STATE_FILE.unlink()
crash_after = 3
crawled_urls_phase1: List[str] = []
async def save_and_maybe_crash(state: Dict[str, Any]) -> None:
"""Save state, then simulate crash after N pages."""
# Always save state first
await save_state_to_file(state)
crawled_urls_phase1.clear()
crawled_urls_phase1.extend(state["visited"])
# Simulate crash after reaching threshold
if state["pages_crawled"] >= crash_after:
raise Exception("Simulated crash! (This is intentional)")
# Phase 1: Start crawl that will "crash"
print(f"\n--- Phase 1: Crawl until 'crash' after {crash_after} pages ---")
strategy1 = BFSDeepCrawlStrategy(
max_depth=2,
max_pages=10,
on_state_change=save_and_maybe_crash,
)
config = CrawlerRunConfig(
deep_crawl_strategy=strategy1,
verbose=False,
)
try:
async with AsyncWebCrawler(verbose=False) as crawler:
await crawler.arun("https://books.toscrape.com", config=config)
except Exception as e:
print(f"\n Crash occurred: {e}")
print(f" URLs crawled before crash: {len(crawled_urls_phase1)}")
# Phase 2: Resume from checkpoint
print("\n--- Phase 2: Resume from checkpoint ---")
saved_state = load_state_from_file()
if not saved_state:
print(" ERROR: No saved state found!")
return
print(f" Loaded state: {saved_state['pages_crawled']} pages, {len(saved_state['pending'])} pending")
crawled_urls_phase2: List[str] = []
async def track_resumed_crawl(state: Dict[str, Any]) -> None:
"""Track new URLs crawled in phase 2."""
await save_state_to_file(state)
new_urls = set(state["visited"]) - set(saved_state["visited"])
for url in new_urls:
if url not in crawled_urls_phase2:
crawled_urls_phase2.append(url)
strategy2 = BFSDeepCrawlStrategy(
max_depth=2,
max_pages=10,
resume_state=saved_state, # Resume from checkpoint!
on_state_change=track_resumed_crawl,
)
config2 = CrawlerRunConfig(
deep_crawl_strategy=strategy2,
verbose=False,
)
async with AsyncWebCrawler(verbose=False) as crawler:
results = await crawler.arun("https://books.toscrape.com", config=config2)
# Verify no duplicates
already_crawled = set(saved_state["visited"])
duplicates = set(crawled_urls_phase2) & already_crawled
print(f"\n--- Results ---")
print(f" Phase 1 URLs: {len(crawled_urls_phase1)}")
print(f" Phase 2 new URLs: {len(crawled_urls_phase2)}")
print(f" Duplicate crawls: {len(duplicates)} (should be 0)")
print(f" Total results: {len(results)}")
if len(duplicates) == 0:
print("\n SUCCESS: No duplicate work after resume!")
else:
print(f"\n WARNING: Found duplicates: {duplicates}")
async def example_export_state():
"""
Example 3: Manual state export using export_state().
If you don't need real-time persistence, you can export
the state manually after the crawl completes.
"""
print("\n" + "=" * 60)
print("Example 3: Manual State Export")
print("=" * 60)
strategy = BFSDeepCrawlStrategy(
max_depth=1,
max_pages=3,
# No callback - state is still tracked internally
)
config = CrawlerRunConfig(
deep_crawl_strategy=strategy,
verbose=False,
)
print("\nCrawling without callback...")
async with AsyncWebCrawler(verbose=False) as crawler:
results = await crawler.arun("https://books.toscrape.com", config=config)
# Export state after crawl completes
# Note: This only works if on_state_change was set during crawl
# For this example, we'd need to set on_state_change to get state
print(f"\nCrawled {len(results)} pages")
print("(For manual export, set on_state_change to capture state)")
async def example_state_structure():
"""
Example 4: Understanding the state structure.
Shows the complete state dictionary that gets saved.
"""
print("\n" + "=" * 60)
print("Example 4: State Structure")
print("=" * 60)
captured_state = None
async def capture_state(state: Dict[str, Any]) -> None:
nonlocal captured_state
captured_state = state
strategy = BFSDeepCrawlStrategy(
max_depth=1,
max_pages=2,
on_state_change=capture_state,
)
config = CrawlerRunConfig(
deep_crawl_strategy=strategy,
verbose=False,
)
async with AsyncWebCrawler(verbose=False) as crawler:
await crawler.arun("https://books.toscrape.com", config=config)
if captured_state:
print("\nState structure:")
print(json.dumps(captured_state, indent=2, default=str)[:1000] + "...")
print("\n\nKey fields:")
print(f" strategy_type: '{captured_state['strategy_type']}'")
print(f" visited: List of {len(captured_state['visited'])} URLs")
print(f" pending: List of {len(captured_state['pending'])} queued items")
print(f" depths: Dict mapping URL -> depth level")
print(f" pages_crawled: {captured_state['pages_crawled']}")
async def main():
"""Run all examples."""
print("=" * 60)
print("Deep Crawl Crash Recovery Examples")
print("=" * 60)
await example_basic_state_persistence()
await example_crash_and_resume()
await example_state_structure()
# # Cleanup
# if STATE_FILE.exists():
# STATE_FILE.unlink()
# print(f"\n[Cleaned up {STATE_FILE}]")
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,39 @@
"""
Simple demonstration of the DFS deep crawler visiting multiple pages.
Run with: python docs/examples/dfs_crawl_demo.py
"""
import asyncio
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
from crawl4ai.async_webcrawler import AsyncWebCrawler
from crawl4ai.cache_context import CacheMode
from crawl4ai.deep_crawling.dfs_strategy import DFSDeepCrawlStrategy
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
async def main() -> None:
dfs_strategy = DFSDeepCrawlStrategy(
max_depth=3,
max_pages=50,
include_external=False,
)
config = CrawlerRunConfig(
deep_crawl_strategy=dfs_strategy,
cache_mode=CacheMode.BYPASS,
markdown_generator=DefaultMarkdownGenerator(),
stream=True,
)
seed_url = "https://docs.python.org/3/" # Plenty of internal links
async with AsyncWebCrawler(config=BrowserConfig(headless=True)) as crawler:
async for result in await crawler.arun(url=seed_url, config=config):
depth = result.metadata.get("depth")
status = "SUCCESS" if result.success else "FAILED"
print(f"[{status}] depth={depth} url={result.url}")
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,522 @@
#!/usr/bin/env python3
"""
Comprehensive hooks examples using Docker Client with function objects.
This approach is recommended because:
- Write hooks as regular Python functions
- Full IDE support (autocomplete, type checking)
- Automatic conversion to API format
- Reusable and testable code
- Clean, readable syntax
"""
import asyncio
from crawl4ai import Crawl4aiDockerClient
# API_BASE_URL = "http://localhost:11235"
API_BASE_URL = "http://localhost:11234"
# ============================================================================
# Hook Function Definitions
# ============================================================================
# --- All Hooks Demo ---
async def browser_created_hook(browser, **kwargs):
"""Called after browser is created"""
print("[HOOK] Browser created and ready")
return browser
async def page_context_hook(page, context, **kwargs):
"""Setup page environment"""
print("[HOOK] Setting up page environment")
# Set viewport
await page.set_viewport_size({"width": 1920, "height": 1080})
# Add cookies
await context.add_cookies([{
"name": "test_session",
"value": "abc123xyz",
"domain": ".httpbin.org",
"path": "/"
}])
# Block resources
await context.route("**/*.{png,jpg,jpeg,gif}", lambda route: route.abort())
await context.route("**/analytics/*", lambda route: route.abort())
print("[HOOK] Environment configured")
return page
async def user_agent_hook(page, context, user_agent, **kwargs):
"""Called when user agent is updated"""
print(f"[HOOK] User agent: {user_agent[:50]}...")
return page
async def before_goto_hook(page, context, url, **kwargs):
"""Called before navigating to URL"""
print(f"[HOOK] Navigating to: {url}")
await page.set_extra_http_headers({
"X-Custom-Header": "crawl4ai-test",
"Accept-Language": "en-US"
})
return page
async def after_goto_hook(page, context, url, response, **kwargs):
"""Called after page loads"""
print(f"[HOOK] Page loaded: {url}")
await page.wait_for_timeout(1000)
try:
await page.wait_for_selector("body", timeout=2000)
print("[HOOK] Body element ready")
except:
print("[HOOK] Timeout, continuing")
return page
async def execution_started_hook(page, context, **kwargs):
"""Called when custom JS execution starts"""
print("[HOOK] JS execution started")
await page.evaluate("console.log('[HOOK] Custom JS');")
return page
async def before_retrieve_hook(page, context, **kwargs):
"""Called before retrieving HTML"""
print("[HOOK] Preparing HTML retrieval")
# Scroll for lazy content
await page.evaluate("window.scrollTo(0, document.body.scrollHeight);")
await page.wait_for_timeout(500)
await page.evaluate("window.scrollTo(0, 0);")
print("[HOOK] Scrolling complete")
return page
async def before_return_hook(page, context, html, **kwargs):
"""Called before returning HTML"""
print(f"[HOOK] HTML ready: {len(html)} chars")
metrics = await page.evaluate('''() => ({
images: document.images.length,
links: document.links.length,
scripts: document.scripts.length
})''')
print(f"[HOOK] Metrics - Images: {metrics['images']}, Links: {metrics['links']}")
return page
# --- Authentication Hooks ---
async def auth_context_hook(page, context, **kwargs):
"""Setup authentication context"""
print("[HOOK] Setting up authentication")
# Add auth cookies
await context.add_cookies([{
"name": "auth_token",
"value": "fake_jwt_token",
"domain": ".httpbin.org",
"path": "/",
"httpOnly": True
}])
# Set localStorage
await page.evaluate('''
localStorage.setItem('user_id', '12345');
localStorage.setItem('auth_time', new Date().toISOString());
''')
print("[HOOK] Auth context ready")
return page
async def auth_headers_hook(page, context, url, **kwargs):
"""Add authentication headers"""
print(f"[HOOK] Adding auth headers for {url}")
import base64
credentials = base64.b64encode(b"user:passwd").decode('ascii')
await page.set_extra_http_headers({
'Authorization': f'Basic {credentials}',
'X-API-Key': 'test-key-123'
})
return page
# --- Performance Optimization Hooks ---
async def performance_hook(page, context, **kwargs):
"""Optimize page for performance"""
print("[HOOK] Optimizing for performance")
# Block resource-heavy content
await context.route("**/*.{png,jpg,jpeg,gif,webp,svg}", lambda r: r.abort())
await context.route("**/*.{woff,woff2,ttf}", lambda r: r.abort())
await context.route("**/*.{mp4,webm,ogg}", lambda r: r.abort())
await context.route("**/googletagmanager.com/*", lambda r: r.abort())
await context.route("**/google-analytics.com/*", lambda r: r.abort())
await context.route("**/facebook.com/*", lambda r: r.abort())
# Disable animations
await page.add_style_tag(content='''
*, *::before, *::after {
animation-duration: 0s !important;
transition-duration: 0s !important;
}
''')
print("[HOOK] Optimizations applied")
return page
async def cleanup_hook(page, context, **kwargs):
"""Clean page before extraction"""
print("[HOOK] Cleaning page")
await page.evaluate('''() => {
const selectors = [
'.ad', '.ads', '.advertisement',
'.popup', '.modal', '.overlay',
'.cookie-banner', '.newsletter'
];
selectors.forEach(sel => {
document.querySelectorAll(sel).forEach(el => el.remove());
});
document.querySelectorAll('script, style').forEach(el => el.remove());
}''')
print("[HOOK] Page cleaned")
return page
# --- Content Extraction Hooks ---
async def wait_dynamic_content_hook(page, context, url, response, **kwargs):
"""Wait for dynamic content to load"""
print(f"[HOOK] Waiting for dynamic content on {url}")
await page.wait_for_timeout(2000)
# Click "Load More" if exists
try:
load_more = await page.query_selector('[class*="load-more"], button:has-text("Load More")')
if load_more:
await load_more.click()
await page.wait_for_timeout(1000)
print("[HOOK] Clicked 'Load More'")
except:
pass
return page
async def extract_metadata_hook(page, context, **kwargs):
"""Extract page metadata"""
print("[HOOK] Extracting metadata")
metadata = await page.evaluate('''() => {
const getMeta = (name) => {
const el = document.querySelector(`meta[name="${name}"], meta[property="${name}"]`);
return el ? el.getAttribute('content') : null;
};
return {
title: document.title,
description: getMeta('description'),
author: getMeta('author'),
keywords: getMeta('keywords'),
};
}''')
print(f"[HOOK] Metadata: {metadata}")
# Infinite scroll
for i in range(3):
await page.evaluate("window.scrollTo(0, document.body.scrollHeight);")
await page.wait_for_timeout(1000)
print(f"[HOOK] Scroll {i+1}/3")
return page
# --- Multi-URL Hooks ---
async def url_specific_hook(page, context, url, **kwargs):
"""Apply URL-specific logic"""
print(f"[HOOK] Processing URL: {url}")
# URL-specific headers
if 'html' in url:
await page.set_extra_http_headers({"X-Type": "HTML"})
elif 'json' in url:
await page.set_extra_http_headers({"X-Type": "JSON"})
return page
async def track_progress_hook(page, context, url, response, **kwargs):
"""Track crawl progress"""
status = response.status if response else 'unknown'
print(f"[HOOK] Loaded {url} - Status: {status}")
return page
# ============================================================================
# Test Functions
# ============================================================================
async def test_all_hooks_comprehensive():
"""Test all 8 hook types"""
print("=" * 70)
print("Test 1: All Hooks Comprehensive Demo (Docker Client)")
print("=" * 70)
async with Crawl4aiDockerClient(base_url=API_BASE_URL, verbose=False) as client:
print("\nCrawling with all 8 hooks...")
# Define hooks with function objects
hooks = {
"on_browser_created": browser_created_hook,
"on_page_context_created": page_context_hook,
"on_user_agent_updated": user_agent_hook,
"before_goto": before_goto_hook,
"after_goto": after_goto_hook,
"on_execution_started": execution_started_hook,
"before_retrieve_html": before_retrieve_hook,
"before_return_html": before_return_hook
}
result = await client.crawl(
["https://httpbin.org/html"],
hooks=hooks,
hooks_timeout=30
)
print("\n✅ Success!")
print(f" URL: {result.url}")
print(f" Success: {result.success}")
print(f" HTML: {len(result.html)} chars")
async def test_authentication_workflow():
"""Test authentication with hooks"""
print("\n" + "=" * 70)
print("Test 2: Authentication Workflow (Docker Client)")
print("=" * 70)
async with Crawl4aiDockerClient(base_url=API_BASE_URL, verbose=False) as client:
print("\nTesting authentication...")
hooks = {
"on_page_context_created": auth_context_hook,
"before_goto": auth_headers_hook
}
result = await client.crawl(
["https://httpbin.org/basic-auth/user/passwd"],
hooks=hooks,
hooks_timeout=15
)
print("\n✅ Authentication completed")
if result.success:
if '"authenticated"' in result.html and 'true' in result.html:
print(" ✅ Basic auth successful!")
else:
print(" ⚠️ Auth status unclear")
else:
print(f" ❌ Failed: {result.error_message}")
async def test_performance_optimization():
"""Test performance optimization"""
print("\n" + "=" * 70)
print("Test 3: Performance Optimization (Docker Client)")
print("=" * 70)
async with Crawl4aiDockerClient(base_url=API_BASE_URL, verbose=False) as client:
print("\nTesting performance hooks...")
hooks = {
"on_page_context_created": performance_hook,
"before_retrieve_html": cleanup_hook
}
result = await client.crawl(
["https://httpbin.org/html"],
hooks=hooks,
hooks_timeout=10
)
print("\n✅ Optimization completed")
print(f" HTML size: {len(result.html):,} chars")
print(" Resources blocked, ads removed")
async def test_content_extraction():
"""Test content extraction"""
print("\n" + "=" * 70)
print("Test 4: Content Extraction (Docker Client)")
print("=" * 70)
async with Crawl4aiDockerClient(base_url=API_BASE_URL, verbose=False) as client:
print("\nTesting extraction hooks...")
hooks = {
"after_goto": wait_dynamic_content_hook,
"before_retrieve_html": extract_metadata_hook
}
result = await client.crawl(
["https://www.kidocode.com/"],
hooks=hooks,
hooks_timeout=20
)
print("\n✅ Extraction completed")
print(f" URL: {result.url}")
print(f" Success: {result.success}")
print(f" Metadata: {result.metadata}")
async def test_multi_url_crawl():
"""Test hooks with multiple URLs"""
print("\n" + "=" * 70)
print("Test 5: Multi-URL Crawl (Docker Client)")
print("=" * 70)
async with Crawl4aiDockerClient(base_url=API_BASE_URL, verbose=False) as client:
print("\nCrawling multiple URLs...")
hooks = {
"before_goto": url_specific_hook,
"after_goto": track_progress_hook
}
results = await client.crawl(
[
"https://httpbin.org/html",
"https://httpbin.org/json",
"https://httpbin.org/xml"
],
hooks=hooks,
hooks_timeout=15
)
print("\n✅ Multi-URL crawl completed")
print(f"\n Crawled {len(results)} URLs:")
for i, result in enumerate(results, 1):
status = "" if result.success else ""
print(f" {status} {i}. {result.url}")
async def test_reusable_hook_library():
"""Test using reusable hook library"""
print("\n" + "=" * 70)
print("Test 6: Reusable Hook Library (Docker Client)")
print("=" * 70)
# Create a library of reusable hooks
class HookLibrary:
@staticmethod
async def block_images(page, context, **kwargs):
"""Block all images"""
await context.route("**/*.{png,jpg,jpeg,gif}", lambda r: r.abort())
print("[LIBRARY] Images blocked")
return page
@staticmethod
async def block_analytics(page, context, **kwargs):
"""Block analytics"""
await context.route("**/analytics/*", lambda r: r.abort())
await context.route("**/google-analytics.com/*", lambda r: r.abort())
print("[LIBRARY] Analytics blocked")
return page
@staticmethod
async def scroll_infinite(page, context, **kwargs):
"""Handle infinite scroll"""
for i in range(5):
prev = await page.evaluate("document.body.scrollHeight")
await page.evaluate("window.scrollTo(0, document.body.scrollHeight);")
await page.wait_for_timeout(1000)
curr = await page.evaluate("document.body.scrollHeight")
if curr == prev:
break
print("[LIBRARY] Infinite scroll complete")
return page
async with Crawl4aiDockerClient(base_url=API_BASE_URL, verbose=False) as client:
print("\nUsing hook library...")
hooks = {
"on_page_context_created": HookLibrary.block_images,
"before_retrieve_html": HookLibrary.scroll_infinite
}
result = await client.crawl(
["https://www.kidocode.com/"],
hooks=hooks,
hooks_timeout=20
)
print("\n✅ Library hooks completed")
print(f" Success: {result.success}")
# ============================================================================
# Main
# ============================================================================
async def main():
"""Run all Docker client hook examples"""
print("🔧 Crawl4AI Docker Client - Hooks Examples (Function-Based)")
print("Using Python function objects with automatic conversion")
print("=" * 70)
tests = [
("All Hooks Demo", test_all_hooks_comprehensive),
("Authentication", test_authentication_workflow),
("Performance", test_performance_optimization),
("Extraction", test_content_extraction),
("Multi-URL", test_multi_url_crawl),
("Hook Library", test_reusable_hook_library)
]
for i, (name, test_func) in enumerate(tests, 1):
try:
await test_func()
print(f"\n✅ Test {i}/{len(tests)}: {name} completed\n")
except Exception as e:
print(f"\n❌ Test {i}/{len(tests)}: {name} failed: {e}\n")
import traceback
traceback.print_exc()
print("=" * 70)
print("🎉 All Docker client hook examples completed!")
print("\n💡 Key Benefits of Function-Based Hooks:")
print(" • Write as regular Python functions")
print(" • Full IDE support (autocomplete, types)")
print(" • Automatic conversion to API format")
print(" • Reusable across projects")
print(" • Clean, readable code")
print(" • Easy to test and debug")
print("=" * 70)
if __name__ == "__main__":
asyncio.run(main())

Some files were not shown because too many files have changed in this diff Show More