Compare commits

...

63 Commits

Author SHA1 Message Date
unclecode
55de32d925 Add CycloneDX SBOM and generation script
- Add sbom/sbom.cdx.json generated via Syft
- Add scripts/gen-sbom.sh for regenerating SBOM
- Add sbom/README.md with disclaimer
- Update .gitignore to track gen-sbom.sh
2026-01-27 01:45:42 +00:00
Nasrin
f6f7f1b551 Release v0.8.0: Crash Recovery, Prefetch Mode & Security Fixes (#1712)
* Fix: Use correct URL variable for raw HTML extraction (#1116)

- Prevents full HTML content from being passed as URL to extraction strategies
- Added unit tests to verify raw HTML and regular URL processing

Fix: Wrong URL variable used for extraction of raw html

* Fix #1181: Preserve whitespace in code blocks during HTML scraping

  The remove_empty_elements_fast() method was removing whitespace-only
  span elements inside <pre> and <code> tags, causing import statements
  like "import torch" to become "importtorch". Now skips elements inside
  code blocks where whitespace is significant.

* Refactor Pydantic model configuration to use ConfigDict for arbitrary types

* Fix EmbeddingStrategy: Uncomment response handling for the variations and clean up mock data. ref #1621

* Fix: permission issues with .cache/url_seeder and other runtime cache dirs. ref #1638

* fix: ensure BrowserConfig.to_dict serializes proxy_config

* feat: make LLM backoff configurable end-to-end

- extend LLMConfig with backoff delay/attempt/factor fields and thread them
  through LLMExtractionStrategy, LLMContentFilter, table extraction, and
  Docker API handlers
- expose the backoff parameter knobs on perform_completion_with_backoff/aperform_completion_with_backoff
  and document them in the md_v2 guides

* reproduced AttributeError from #1642

* pass timeout parameter to docker client request

* added missing deep crawling objects to init

* generalized query in ContentRelevanceFilter to be a str or list

* import modules from enhanceable deserialization

* parameterized tests

* Fix: capture current page URL to reflect JavaScript navigation and add test for delayed redirects. ref #1268

* refactor: replace PyPDF2 with pypdf across the codebase. ref #1412

* Add browser_context_id and target_id parameters to BrowserConfig

Enable Crawl4AI to connect to pre-created CDP browser contexts, which is
essential for cloud browser services that pre-create isolated contexts.

Changes:
- Add browser_context_id and target_id parameters to BrowserConfig
- Update from_kwargs() and to_dict() methods
- Modify BrowserManager.start() to use existing context when provided
- Add _get_page_by_target_id() helper method
- Update get_page() to handle pre-existing targets
- Add test for browser_context_id functionality

This enables cloud services to:
1. Create isolated CDP contexts before Crawl4AI connects
2. Pass context/target IDs to BrowserConfig
3. Have Crawl4AI reuse existing contexts instead of creating new ones

* Add cdp_cleanup_on_close flag to prevent memory leaks in cloud/server scenarios

* Fix: add cdp_cleanup_on_close to from_kwargs

* Fix: find context by target_id for concurrent CDP connections

* Fix: use target_id to find correct page in get_page

* Fix: use CDP to find context by browserContextId for concurrent sessions

* Revert context matching attempts - Playwright cannot see CDP-created contexts

* Add create_isolated_context flag for concurrent CDP crawls

When True, forces creation of a new browser context instead of reusing
the default context. Essential for concurrent crawls on the same browser
to prevent navigation conflicts.

* Add context caching to create_isolated_context branch

Uses contexts_by_config cache (same as non-CDP mode) to reuse contexts
for multiple URLs with same config. Still creates new page per crawl
for navigation isolation. Benefits batch/deep crawls.

* Add init_scripts support to BrowserConfig for pre-page-load JS injection

This adds the ability to inject JavaScript that runs before any page loads,
useful for stealth evasions (canvas/audio fingerprinting, userAgentData).

- Add init_scripts parameter to BrowserConfig (list of JS strings)
- Apply init_scripts in setup_context() via context.add_init_script()
- Update from_kwargs() and to_dict() for serialization

* Fix CDP connection handling: support WS URLs and proper cleanup

Changes to browser_manager.py:

1. _verify_cdp_ready(): Support multiple URL formats
   - WebSocket URLs (ws://, wss://): Skip HTTP verification, Playwright handles directly
   - HTTP URLs with query params: Properly parse with urlparse to preserve query string
   - Fixes issue where naive f"{cdp_url}/json/version" broke WS URLs and query params

2. close(): Proper cleanup when cdp_cleanup_on_close=True
   - Close all sessions (pages)
   - Close all contexts
   - Call browser.close() to disconnect (doesn't terminate browser, just releases connection)
   - Wait 1 second for CDP connection to fully release
   - Stop Playwright instance to prevent memory leaks

This enables:
- Connecting to specific browsers via WS URL
- Reusing the same browser with multiple sequential connections
- No user wait needed between connections (internal 1s delay handles it)

Added tests/browser/test_cdp_cleanup_reuse.py with comprehensive tests.

* Update gitignore

* Some debugging for caching

* Add _generate_screenshot_from_html for raw: and file:// URLs

Implements the missing method that was being called but never defined.
Now raw: and file:// URLs can generate screenshots by:
1. Loading HTML into a browser page via page.set_content()
2. Taking screenshot using existing take_screenshot() method
3. Cleaning up the page afterward

This enables cached HTML to be rendered with screenshots in crawl4ai-cloud.

* Add PDF and MHTML support for raw: and file:// URLs

- Replace _generate_screenshot_from_html with _generate_media_from_html
- New method handles screenshot, PDF, and MHTML in one browser session
- Update raw: and file:// URL handlers to use new method
- Enables cached HTML to generate all media types

* Add crash recovery for deep crawl strategies

Add optional resume_state and on_state_change parameters to all deep
crawl strategies (BFS, DFS, Best-First) for cloud deployment crash
recovery.

Features:
- resume_state: Pass saved state to resume from checkpoint
- on_state_change: Async callback fired after each URL for real-time
  state persistence to external storage (Redis, DB, etc.)
- export_state(): Get last captured state manually
- Zero overhead when features are disabled (None defaults)

State includes visited URLs, pending queue/stack, depths, and
pages_crawled count. All state is JSON-serializable.

* Fix: HTTP strategy raw: URL parsing truncates at # character

The AsyncHTTPCrawlerStrategy.crawl() method used urlparse() to extract
content from raw: URLs. This caused HTML with CSS color codes like #eee
to be truncated because # is treated as a URL fragment delimiter.

Before: raw:body{background:#eee} -> parsed.path = 'body{background:'
After:  raw:body{background:#eee} -> raw_content = 'body{background:#eee'

Fix: Strip the raw: or raw:// prefix directly instead of using urlparse,
matching how the browser strategy handles it.

* Add base_url parameter to CrawlerRunConfig for raw HTML processing

When processing raw: HTML (e.g., from cache), the URL parameter is meaningless
for markdown link resolution. This adds a base_url parameter that can be set
explicitly to provide proper URL resolution context.

Changes:
- Add base_url parameter to CrawlerRunConfig.__init__
- Add base_url to CrawlerRunConfig.from_kwargs
- Update aprocess_html to use base_url for markdown generation

Usage:
  config = CrawlerRunConfig(base_url='https://example.com')
  result = await crawler.arun(url='raw:{html}', config=config)

* Add prefetch mode for two-phase deep crawling

- Add `prefetch` parameter to CrawlerRunConfig
- Add `quick_extract_links()` function for fast link extraction
- Add short-circuit in aprocess_html() for prefetch mode
- Add 42 tests (unit, integration, regression)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Updates on proxy rotation and proxy configuration

* Add proxy support to HTTP crawler strategy

* Add browser pipeline support for raw:/file:// URLs

- Add process_in_browser parameter to CrawlerRunConfig
- Route raw:/file:// URLs through _crawl_web() when browser operations needed
- Use page.set_content() instead of goto() for local content
- Fix cookie handling for non-HTTP URLs in browser_manager
- Auto-detect browser requirements: js_code, wait_for, screenshot, etc.
- Maintain fast path for raw:/file:// without browser params

Fixes #310

* Add smart TTL cache for sitemap URL seeder

- Add cache_ttl_hours and validate_sitemap_lastmod params to SeedingConfig
- New JSON cache format with metadata (version, created_at, lastmod, url_count)
- Cache validation by TTL expiry and sitemap lastmod comparison
- Auto-migration from old .jsonl to new .json format
- Fixes bug where incomplete cache was used indefinitely

* Update URL seeder docs with smart TTL cache parameters

- Add cache_ttl_hours and validate_sitemap_lastmod to parameter table
- Document smart TTL cache validation with examples
- Add cache-related troubleshooting entries
- Update key features summary

* Add MEMORY.md to gitignore

* Docs: Add multi-sample schema generation section

Add documentation explaining how to pass multiple HTML samples
to generate_schema() for stable selectors that work across pages
with varying DOM structures.

Includes:
- Problem explanation (fragile nth-child selectors)
- Solution with code example
- Key points for multi-sample queries
- Comparison table of fragile vs stable selectors

* Fix critical RCE and LFI vulnerabilities in Docker API deployment

Security fixes for vulnerabilities reported by ProjectDiscovery:

1. Remote Code Execution via Hooks (CVE pending)
   - Remove __import__ from allowed_builtins in hook_manager.py
   - Prevents arbitrary module imports (os, subprocess, etc.)
   - Hooks now disabled by default via CRAWL4AI_HOOKS_ENABLED env var

2. Local File Inclusion via file:// URLs (CVE pending)
   - Add URL scheme validation to /execute_js, /screenshot, /pdf, /html
   - Block file://, javascript:, data: and other dangerous schemes
   - Only allow http://, https://, and raw: (where appropriate)

3. Security hardening
   - Add CRAWL4AI_HOOKS_ENABLED=false as default (opt-in for hooks)
   - Add security warning comments in config.yml
   - Add validate_url_scheme() helper for consistent validation

Testing:
   - Add unit tests (test_security_fixes.py) - 16 tests
   - Add integration tests (run_security_tests.py) for live server

Affected endpoints:
   - POST /crawl (hooks disabled by default)
   - POST /crawl/stream (hooks disabled by default)
   - POST /execute_js (URL validation added)
   - POST /screenshot (URL validation added)
   - POST /pdf (URL validation added)
   - POST /html (URL validation added)

Breaking changes:
   - Hooks require CRAWL4AI_HOOKS_ENABLED=true to function
   - file:// URLs no longer work on API endpoints (use library directly)

* Enhance authentication flow by implementing JWT token retrieval and adding authorization headers to API requests

* Add release notes for v0.7.9, detailing breaking changes, security fixes, new features, bug fixes, and documentation updates

* Add release notes for v0.8.0, detailing breaking changes, security fixes, new features, bug fixes, and documentation updates

Documentation for v0.8.0 release:

- SECURITY.md: Security policy and vulnerability reporting guidelines
- RELEASE_NOTES_v0.8.0.md: Comprehensive release notes
- migration/v0.8.0-upgrade-guide.md: Step-by-step migration guide
- security/GHSA-DRAFT-RCE-LFI.md: GitHub security advisory drafts
- CHANGELOG.md: Updated with v0.8.0 changes

Breaking changes documented:
- Docker API hooks disabled by default (CRAWL4AI_HOOKS_ENABLED)
- file:// URLs blocked on Docker API endpoints

Security fixes credited to Neo by ProjectDiscovery

* Add examples for deep crawl crash recovery and prefetch mode in documentation

* Release v0.8.0: The v0.8.0 Update

- Updated version to 0.8.0
- Added comprehensive demo and release notes
- Updated all documentation

* Update security researcher acknowledgment with a hyperlink for Neo by ProjectDiscovery

* Add async agenerate_schema method for schema generation

- Extract prompt building to shared _build_schema_prompt() method
- Add agenerate_schema() async version using aperform_completion_with_backoff
- Refactor generate_schema() to use shared prompt builder
- Fixes Gemini/Vertex AI compatibility in async contexts (FastAPI)

* Fix: Enable litellm.drop_params for O-series/GPT-5 model compatibility

O-series (o1, o3) and GPT-5 models only support temperature=1.
Setting litellm.drop_params=True auto-drops unsupported parameters
instead of throwing UnsupportedParamsError.

Fixes temperature=0.01 error for these models in LLM extraction.

---------

Co-authored-by: rbushria <rbushri@gmail.com>
Co-authored-by: AHMET YILMAZ <tawfik@kidocode.com>
Co-authored-by: Soham Kukreti <kukretisoham@gmail.com>
Co-authored-by: Chris Murphy <chris.murphy@klaviyo.com>
Co-authored-by: unclecode <unclecode@kidocode.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-17 14:19:15 +01:00
UncleCode
c85f56b085 Merge pull request #1677 from unclecode/sponsors/thor_data 2025-12-25 12:08:21 +08:00
Aravind Karnam
a234959b12 sponsors: Add thor data as sponsor 2025-12-23 20:45:00 +05:30
Aravind Karnam
da82f0ada5 sponsors: Add thor data as sponsor 2025-12-23 16:28:26 +05:30
Nasrin
a87e8c1c9e Release/v0.7.8 (#1662)
* Fix: Use correct URL variable for raw HTML extraction (#1116)

- Prevents full HTML content from being passed as URL to extraction strategies
- Added unit tests to verify raw HTML and regular URL processing

Fix: Wrong URL variable used for extraction of raw html

* Fix #1181: Preserve whitespace in code blocks during HTML scraping

  The remove_empty_elements_fast() method was removing whitespace-only
  span elements inside <pre> and <code> tags, causing import statements
  like "import torch" to become "importtorch". Now skips elements inside
  code blocks where whitespace is significant.

* Refactor Pydantic model configuration to use ConfigDict for arbitrary types

* Fix EmbeddingStrategy: Uncomment response handling for the variations and clean up mock data. ref #1621

* Fix: permission issues with .cache/url_seeder and other runtime cache dirs. ref #1638

* fix: ensure BrowserConfig.to_dict serializes proxy_config

* feat: make LLM backoff configurable end-to-end

- extend LLMConfig with backoff delay/attempt/factor fields and thread them
  through LLMExtractionStrategy, LLMContentFilter, table extraction, and
  Docker API handlers
- expose the backoff parameter knobs on perform_completion_with_backoff/aperform_completion_with_backoff
  and document them in the md_v2 guides

* reproduced AttributeError from #1642

* pass timeout parameter to docker client request

* added missing deep crawling objects to init

* generalized query in ContentRelevanceFilter to be a str or list

* import modules from enhanceable deserialization

* parameterized tests

* Fix: capture current page URL to reflect JavaScript navigation and add test for delayed redirects. ref #1268

* refactor: replace PyPDF2 with pypdf across the codebase. ref #1412

* announcement: add application form for cloud API closed beta

* Release v0.7.8: Stability & Bug Fix Release

- Updated version to 0.7.8
- Introduced focused stability release addressing 11 community-reported bugs.
- Key fixes include Docker API improvements, LLM extraction enhancements, URL handling corrections, and dependency updates.
- Added detailed release notes for v0.7.8 in the blog and created a dedicated verification script to ensure all fixes are functioning as intended.
- Updated documentation to reflect recent changes and improvements.

* docs: add section for Crawl4AI Cloud API closed beta with application link

* fix: add disk cleanup step to Docker workflow

---------

Co-authored-by: rbushria <rbushri@gmail.com>
Co-authored-by: AHMET YILMAZ <tawfik@kidocode.com>
Co-authored-by: Soham Kukreti <kukretisoham@gmail.com>
Co-authored-by: Chris Murphy <chris.murphy@klaviyo.com>
Co-authored-by: Aravind Karnam <aravind.karanam@gmail.com>
2025-12-11 11:04:52 +01:00
UncleCode
835e3c56fe Add disk cleanup step in Docker release workflow
Added a step to free up disk space before the build process.
2025-12-11 09:49:27 +01:00
Aravind
3a07c5962c Sponsors/new (#1643) 2025-12-02 00:49:39 +01:00
Aravind
0024c82cdc Sponsors/new (#1637) 2025-11-24 13:29:33 +01:00
Aravind
f68e7531e3 Sponsors/scrapeless (#1619) 2025-11-17 07:44:52 +01:00
UncleCode
cb637fb5c4 Merge pull request #1613 from unclecode/release/v0.7.7 2025-11-16 12:26:54 +01:00
ntohidi
6244f56f36 Release v0.7.7
- Updated version to 0.7.7
- Added comprehensive demo and release notes
- Updated all documentation
2025-11-14 10:23:31 +01:00
ntohidi
2c973b1183 Merge branch 'develop' into release/v0.7.7 2025-11-13 14:54:05 +01:00
Nasrin
f3146de969 Merge pull request #1609 from unclecode/fix/update-config-documentation
Update browser and crawler run config documentation to match async_configs.py implementation
2025-11-13 21:52:53 +08:00
Soham Kukreti
d6b6d11a2d docs: update browser and crawler run config documentation to match async_configs.py implementation
Updated browser-crawler-config.md and parameters.md to ensure complete
accuracy with the actual BrowserConfig and CrawlerRunConfig implementations.

Changes:
- Removed non-existent parameters from documentation:
  * enable_rate_limiting, rate_limit_config (never implemented)
  * memory_threshold_percent, check_interval, max_session_permit (internal to AsyncDispatcher)
  * display_mode (doesn't exist)

- Added missing BrowserConfig parameters (14 total):
  * browser_mode, use_managed_browser, cdp_url, debugging_port, host
  * viewport, chrome_channel, channel
  * accept_downloads, downloads_path, storage_state, sleep_on_close
  * user_agent_mode, user_agent_generator_config, enable_stealth

- Added missing CrawlerRunConfig parameters (29 total):
  * chunking_strategy, keep_attrs, parser_type, scraping_strategy
  * proxy_config, proxy_rotation_strategy
  * locale, timezone_id, geolocation, fetch_ssl_certificate
  * shared_data, wait_for_timeout
  * c4a_script, max_scroll_steps
  * exclude_all_images, table_score_threshold, table_extraction
  * exclude_internal_links, score_links
  * capture_network_requests, capture_console_messages
  * method, stream, url, user_agent, user_agent_mode, user_agent_generator_config
  * deep_crawl_strategy, link_preview_config, url_matcher, match_mode, experimental

- Marked deprecated cache parameters (bypass_cache, disable_cache, no_cache_read, no_cache_write)
- Reorganized parameters into logical sections (Content Processing, Browser Location & Identity,
  Caching & Session, Page Navigation & Timing, Page Interaction, Media Handling, Link/Domain
  Handling, Debug & Logging, Connection & HTTP, Virtual Scroll, URL Matching, Advanced Features)
- Ensured all parameter descriptions match source code docstrings
- Added proper default values from __init__ signatures
2025-11-13 14:54:16 +05:30
ntohidi
b58579548c Bump version to 0.7.7 for stable release 2025-11-13 09:52:18 +01:00
Nasrin
466be69e72 Merge pull request #1607 from unclecode/fix/dfs_deep_crawling
Fix/dfs deep crawling
2025-11-13 16:43:47 +08:00
AHMET YILMAZ
ceade853c3 Enhance DFSDeepCrawlStrategy documentation for clarity and detail 2025-11-13 16:39:08 +08:00
ntohidi
998c809e08 Rename folder name for NSTProxy integration examples for crawl4ai 2025-11-13 09:36:39 +01:00
ntohidi
d0fb53540d Update proxy-security documentation 2025-11-13 09:23:44 +01:00
Nasrin
8116b15b63 Merge pull request #1596 from unclecode/docs-proxy-security
#1591 enhance proxy configuration with security, SSL analysis, and rotation examples
2025-11-13 16:22:28 +08:00
AHMET YILMAZ
fe353c4e27 Refactor proxy configuration documentation for clarity and consistency 2025-11-13 11:20:24 +08:00
ntohidi
89cc29fe44 Merge branch 'fix/docker' into develop 2025-11-12 17:06:31 +01:00
Nasrin
cdcb8836b7 Merge pull request #1605 from Nstproxy/feat/nstproxy
feat: Add Nstproxy Proxies
2025-11-12 23:56:14 +08:00
Nasrin
b207ae2848 Merge pull request #1528 from unclecode/fix/managed-browser-cdp-timing
Add CDP endpoint verification with exponential backoff for managed browsers
2025-11-12 23:53:57 +08:00
Nasrin
be00fc3a42 Merge pull request #1598 from unclecode/fix/sitemap_seeder
#1559 :Add tests for sitemap parsing and URL normalization in AsyncUr…
2025-11-12 18:09:34 +08:00
Nasrin
124ac583bb Merge pull request #1599 from unclecode/docs-llm-strategies-update
#1551 : Fix casing and variable name consistency for LLMConfig in doc…
2025-11-12 17:54:26 +08:00
AHMET YILMAZ
1bd3de6a47 #1510 : Add DFS deep crawler demonstration script and enhance DFS strategy with seen URL tracking 2025-11-12 17:44:43 +08:00
nstproxy
80452166c8 feat: Add Nstproxy Proxies 2025-11-12 16:25:39 +08:00
UncleCode
a99cd37c0e Merge pull request #1597 from unclecode/sponsors/capsolver 2025-11-11 14:50:44 +08:00
AHMET YILMAZ
2e8f8c9b49 #1551 : Fix casing and variable name consistency for LLMConfig in documentation 2025-11-10 15:38:14 +08:00
AHMET YILMAZ
80745bceb9 #1559 :Add tests for sitemap parsing and URL normalization in AsyncUrlSeeder 2025-11-10 14:15:54 +08:00
Aravind Karnam
4bee230c37 docs: Add a tip for captcha solving usecases using a third party integration 2025-11-10 11:20:48 +05:30
Aravind
006e29f308 Merge pull request #1589 from capsolver/main
Add some examples of using capsolver to solve captcha
2025-11-10 10:45:16 +05:30
AHMET YILMAZ
263ac890fd #1591
: Enhance proxy configuration documentation with security features, SSL analysis, and improved examples
2025-11-10 11:42:07 +08:00
unclecode
1a22fb4d4f docs: rename Docker deployment to self-hosting guide with comprehensive monitoring documentation
Major documentation restructuring to emphasize self-hosting capabilities and fully document the real-time monitoring system.

Changes:
- Renamed docker-deployment.md → self-hosting.md to better reflect the value proposition
- Updated mkdocs.yml navigation to "Self-Hosting Guide"
- Completely rewrote introduction emphasizing self-hosting benefits:
  * Data privacy and ownership
  * Cost control and transparency
  * Performance and security advantages
  * Full customization capabilities

- Expanded "Metrics & Monitoring" → "Real-time Monitoring & Operations" with:
  * Monitoring Dashboard section documenting the /monitor UI
  * Complete feature breakdown (system health, requests, browsers, janitor, errors)
  * Monitor API Endpoints with all REST endpoints and examples
  * WebSocket Streaming integration guide with Python examples
  * Control Actions for manual browser management
  * Production Integration patterns (Prometheus, custom dashboards, alerting)
  * Key production metrics to track

- Enhanced summary section:
  * What users learned checklist
  * Why self-hosting matters
  * Clear next steps
  * Key resources with monitoring dashboard URL

The monitoring dashboard built 2-3 weeks ago is now fully documented and discoverable.
Users will understand they have complete operational visibility at http://localhost:11235/monitor
with real-time updates, browser pool management, and programmatic control via REST/WebSocket APIs.

This positions Crawl4AI as an enterprise-grade self-hosting solution with DevOps-level
monitoring capabilities, not just a Docker deployment.
2025-11-09 13:31:52 +08:00
unclecode
81b5312629 Update gitignore 2025-11-09 10:49:42 +08:00
Nasrin
d56b0eb9a9 Merge pull request #1495 from unclecode/fix/viewport_in_managed_browser
feat(ManagedBrowser): add viewport size configuration for browser launch
2025-11-06 18:42:45 +08:00
Nasrin
66175e132b Merge pull request #1590 from unclecode/fix/async-llm-extraction-arunMany
This commit resolves issue #1055 where LLM extraction was blocking async
2025-11-06 18:40:42 +08:00
ntohidi
a30548a98f This commit resolves issue #1055 where LLM extraction was blocking async
execution, causing URLs to be processed sequentially instead of in parallel.

  Changes:
  - Added aperform_completion_with_backoff() using litellm.acompletion for async LLM calls
  - Implemented arun() method in ExtractionStrategy base class with thread pool fallback
  - Created async arun() and aextract() methods in LLMExtractionStrategy using asyncio.gather
  - Updated AsyncWebCrawler.arun() to detect and use arun() when available
  - Added comprehensive test suite to verify parallel execution

  Impact:
  - LLM extraction now runs truly in parallel across multiple URLs
  - Significant performance improvement for multi-URL crawls with LLM strategies
  - Backward compatible - existing extraction strategies continue to work
  - No breaking changes to public API

  Technical details:
  - Uses litellm.acompletion for non-blocking LLM calls
  - Leverages asyncio.gather for concurrent chunk processing
  - Maintains backward compatibility via asyncio.to_thread fallback
  - Works seamlessly with MemoryAdaptiveDispatcher and other dispatchers
2025-11-06 11:22:45 +01:00
CapSolver
2ae9899eac Clarify CapSolver integration instructions
Updated text for clarity and capitalization.
2025-11-06 15:49:30 +08:00
CapSolver
57aeb70f00 Add CapSolver Captcha Solver 2025-11-06 15:37:31 +08:00
Nasrin
2c918155aa Merge pull request #1529 from unclecode/fix/remove_overlay_elements
Fix remove_overlay_elements functionality by calling injected JS function.
2025-11-06 00:10:32 +08:00
Nasrin
854694ef33 Merge pull request #1537 from unclecode/fix/docker-compose-llm-env
fix(docker): Remove environment variable overrides in docker-compose.yml
2025-11-06 00:07:51 +08:00
Nasrin
6534ece026 Merge pull request #1532 from unclecode/fix/update-documentation
Standardize C4A-Script tutorial, add CLI identity-based crawling, and add sponsorship CTA
2025-11-05 23:37:05 +08:00
Nasrin
89e28d4eee Merge pull request #1558 from unclecode/claude/fix-update-pyopenssl-security-011CUPexU25DkNvoxfu5ZrnB
Claude/fix update pyopenssl security 011 cu pex u25 dk nvoxfu5 zrn b
2025-10-28 17:09:11 +08:00
ntohidi
c0f1865287 feat(api): update marketplace version and build date in root endpoint response 2025-10-26 11:35:39 +01:00
ntohidi
46ef1116c4 fix(app-detail): enhance tab functionality, hide documentation and support tabs in marketplace 2025-10-26 11:21:29 +01:00
Nasrin
4df83893ac Merge pull request #1560 from unclecode/fix/marketplace
Fix/marketplace
2025-10-23 22:17:06 +08:00
ntohidi
13e116610d fix(marketplace): improve app detail page content rendering and UX
Fixed multiple issues with app detail page content display and formatting
2025-10-23 16:12:30 +02:00
Nasrin
40173eeb73 Update Docker hooks and Webhook documents (#1557)
* fix(docker-api): migrate to modern datetime library API

Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>

* Fix examples in README.md

* feat(docker): add user-provided hooks support to Docker API

Implements comprehensive hooks functionality allowing users to provide custom Python
functions as strings that execute at specific points in the crawling pipeline.

Key Features:
- Support for all 8 crawl4ai hook points:
  • on_browser_created: Initialize browser settings
  • on_page_context_created: Configure page context
  • before_goto: Pre-navigation setup
  • after_goto: Post-navigation processing
  • on_user_agent_updated: User agent modification handling
  • on_execution_started: Crawl execution initialization
  • before_retrieve_html: Pre-extraction processing
  • before_return_html: Final HTML processing

Implementation Details:
- Created UserHookManager for validation, compilation, and safe execution
- Added IsolatedHookWrapper for error isolation and timeout protection
- AST-based validation ensures code structure correctness
- Sandboxed execution with restricted builtins for security
- Configurable timeout (1-120 seconds) prevents infinite loops
- Comprehensive error handling ensures hooks don't crash main process
- Execution tracking with detailed statistics and logging

API Changes:
- Added HookConfig schema with code and timeout fields
- Extended CrawlRequest with optional hooks parameter
- Added /hooks/info endpoint for hook discovery
- Updated /crawl and /crawl/stream endpoints to support hooks

Safety Features:
- Malformed hooks return clear validation errors
- Hook errors are isolated and reported without stopping crawl
- Execution statistics track success/failure/timeout rates
- All hook results are JSON-serializable

Testing:
- Comprehensive test suite covering all 8 hooks
- Error handling and timeout scenarios validated
- Authentication, performance, and content extraction examples
- 100% success rate in production testing

Documentation:
- Added extensive hooks section to docker-deployment.md
- Security warnings about user-provided code risks
- Real-world examples using httpbin.org, GitHub, BBC
- Best practices and troubleshooting guide

ref #1377

* fix(deep-crawl): BestFirst priority inversion; remove pre-scoring truncation. ref #1253

  Use negative scores in PQ to visit high-score URLs first and drop link cap prior to scoring; add test for ordering.

* docs: Update URL seeding examples to use proper async context managers
- Wrap all AsyncUrlSeeder usage with async context managers
- Update URL seeding adventure example to use "sitemap+cc" source, focus on course posts, and add stream=True parameter to fix runtime error

* fix(crawler): Removed the incorrect reference in browser_config variable #1310

* docs: update Docker instructions to use the latest release tag

* fix(docker): Fix LLM API key handling for multi-provider support

Previously, the system incorrectly used OPENAI_API_KEY for all LLM providers
due to a hardcoded api_key_env fallback in config.yml. This caused authentication
errors when using non-OpenAI providers like Gemini.

Changes:
- Remove api_key_env from config.yml to let litellm handle provider-specific env vars
- Simplify get_llm_api_key() to return None, allowing litellm to auto-detect keys
- Update validate_llm_provider() to trust litellm's built-in key detection
- Update documentation to reflect the new automatic key handling

The fix leverages litellm's existing capability to automatically find the correct
environment variable for each provider (OPENAI_API_KEY, GEMINI_API_TOKEN, etc.)
without manual configuration.

ref #1291

* docs: update adaptive crawler docs and cache defaults; remove deprecated examples (#1330)
- Replace BaseStrategy with CrawlStrategy in custom strategy examples (DomainSpecificStrategy, HybridStrategy)
- Remove “Custom Link Scoring” and “Caching Strategy” sections no longer aligned with current library
- Revise memory pruning example to use adaptive.get_relevant_content and index-based retention of top 500 docs
- Correct Quickstart note: default cache mode is CacheMode.BYPASS; instruct enabling with CacheMode.ENABLED

* fix(utils): Improve URL normalization by avoiding quote/unquote to preserve '+' signs. ref #1332

* feat: Add comprehensive website to API example with frontend

This commit adds a complete, web scraping API example that demonstrates how to get structured data from any website and use it like an API using the crawl4ai library with a minimalist frontend interface.

Core Functionality
- AI-powered web scraping with plain English queries
- Dual scraping approaches: Schema-based (faster) and LLM-based (flexible)
- Intelligent schema caching for improved performance
- Custom LLM model support with API key management
- Automatic duplicate request prevention

Modern Frontend Interface
- Minimalist black-and-white design inspired by modern web apps
- Responsive layout with smooth animations and transitions
- Three main pages: Scrape Data, Models Management, API Request History
- Real-time results display with JSON formatting
- Copy-to-clipboard functionality for extracted data
- Toast notifications for user feedback
- Auto-scroll to results when scraping starts

Model Management System
- Web-based model configuration interface
- Support for any LLM provider (OpenAI, Gemini, Anthropic, etc.)
- Simplified configuration requiring only provider and API token
- Add, list, and delete model configurations
- Secure storage of API keys in local JSON files

API Request History
- Automatic saving of all API requests and responses
- Display of request history with URL, query, and cURL commands
- Duplicate prevention (same URL + query combinations)
- Request deletion functionality
- Clean, simplified display focusing on essential information

Technical Implementation

Backend (FastAPI)
- RESTful API with comprehensive endpoints
- Pydantic models for request/response validation
- Async web scraping with crawl4ai library
- Error handling with detailed error messages
- File-based storage for models and request history

Frontend (Vanilla JS/CSS/HTML)
- No framework dependencies - pure HTML, CSS, JavaScript
- Modern CSS Grid and Flexbox layouts
- Custom dropdown styling with SVG arrows
- Responsive design for mobile and desktop
- Smooth scrolling and animations

Core Library Integration
- WebScraperAgent class for orchestration
- ModelConfig class for LLM configuration management
- Schema generation and caching system
- LLM extraction strategy support
- Browser configuration with headless mode

* fix(dependencies): add cssselect to project dependencies

Fixes bug reported in issue #1405
[Bug]: Excluded selector (excluded_selector) doesn't work

This commit reintroduces the cssselect library which was removed by PR (https://github.com/unclecode/crawl4ai/pull/1368) and merged via (437395e490).

Integration tested against 0.7.4 Docker container. Reintroducing cssselector package eliminated errors seen in logs and excluded_selector functionality was restored.

Refs: #1405

* fix(docker): resolve filter serialization and JSON encoding errors in deep crawl strategy (ref #1419)

  - Fix URLPatternFilter serialization by preventing private __slots__ from being serialized as constructor params
  - Add public attributes to URLPatternFilter to store original constructor parameters for proper serialization
  - Handle property descriptors in CrawlResult.model_dump() to prevent JSON serialization errors
  - Ensure filter chains work correctly with Docker client and REST API

  The issue occurred because:
  1. Private implementation details (_simple_suffixes, etc.) were being serialized and passed as constructor arguments during deserialization
  2. Property descriptors were being included in the serialized output, causing "Object of type property is not JSON serializable" errors

  Changes:
  - async_configs.py: Comment out __slots__ serialization logic (lines 100-109)
  - filters.py: Add patterns, use_glob, reverse to URLPatternFilter __slots__ and store as public attributes
  - models.py: Convert property descriptors to strings in model_dump() instead of including them directly

* fix(logger): ensure logger is a Logger instance in crawling strategies. ref #1437

* feat(docker): Add temperature and base_url parameters for LLM configuration. ref #1035

  Implement hierarchical configuration for LLM parameters with support for:
  - Temperature control (0.0-2.0) to adjust response creativity
  - Custom base_url for proxy servers and alternative endpoints
  - 4-tier priority: request params > provider env > global env > defaults

  Add helper functions in utils.py, update API schemas and handlers,
  support environment variables (LLM_TEMPERATURE, OPENAI_TEMPERATURE, etc.),
  and provide comprehensive documentation with examples.

* feat(docker): improve docker error handling
- Return comprehensive error messages along with status codes for api internal errors.
- Fix fit_html property serialization issue in both /crawl and /crawl/stream endpoints
- Add sanitization to ensure fit_html is always JSON-serializable (string or None)
- Add comprehensive error handling test suite.

* #1375 : refactor(proxy) Deprecate 'proxy' parameter in BrowserConfig and enhance proxy string parsing

- Updated ProxyConfig.from_string to support multiple proxy formats, including URLs with credentials.
- Deprecated the 'proxy' parameter in BrowserConfig, replacing it with 'proxy_config' for better flexibility.
- Added warnings for deprecated usage and clarified behavior when both parameters are provided.
- Updated documentation and tests to reflect changes in proxy configuration handling.

* Remove deprecated test for 'proxy' parameter in BrowserConfig and update .gitignore to include test_scripts directory.

* feat: add preserve_https_for_internal_links flag to maintain HTTPS during crawling. Ref #1410

Added a new `preserve_https_for_internal_links` configuration flag that preserves the original HTTPS scheme for same-domain links even when the server redirects to HTTP.

* feat: update documentation for preserve_https_for_internal_links. ref #1410

* fix: drop Python 3.9 support and require Python >=3.10.
The library no longer supports Python 3.9 and so it was important to drop all references to python 3.9.
Following changes have been made:
- pyproject.toml: set requires-python to ">=3.10"; remove 3.9 classifier
- setup.py: set python_requires to ">=3.10"; remove 3.9 classifier
- docs: update Python version mentions
  - deploy/docker/c4ai-doc-context.md: options -> 3.10, 3.11, 3.12, 3.13

* issue #1329 refactor(crawler): move unwanted properties to CrawlerRunConfig class

* fix(auth): fixed Docker JWT authentication. ref #1442

* remove: delete unused yoyo snapshot subproject

* fix: raise error on last attempt failure in perform_completion_with_backoff. ref #989

* Commit without API

* fix: update option labels in request builder for clarity

* fix: allow custom LLM providers for adaptive crawler embedding config. ref: #1291

  - Change embedding_llm_config from Dict to Union[LLMConfig, Dict] for type safety
  - Add backward-compatible conversion property _embedding_llm_config_dict
  - Replace all hardcoded OpenAI embedding configs with configurable options
  - Fix LLMConfig object attribute access in query expansion logic
  - Add comprehensive example demonstrating multiple provider configurations
  - Update documentation with both LLMConfig object and dictionary usage patterns

  Users can now specify any LLM provider for query expansion in embedding strategy:
  - New: embedding_llm_config=LLMConfig(provider='anthropic/claude-3', api_token='key')
  - Old: embedding_llm_config={'provider': 'openai/gpt-4', 'api_token': 'key'} (still works)

* refactor(BrowserConfig): change deprecation warning for 'proxy' parameter to UserWarning

* feat(StealthAdapter): fix stealth features for Playwright integration. ref #1481

* #1505 fix(api): update config handling to only set base config if not provided by user

* fix(docker-deployment): replace console.log with print for metadata extraction

* Release v0.7.5: The Update

- Updated version to 0.7.5
- Added comprehensive demo and release notes
- Updated documentation

* refactor(release): remove memory management section for cleaner documentation. ref #1443

* feat(docs): add brand book and page copy functionality

- Add comprehensive brand book with color system, typography, components
- Add page copy dropdown with markdown copy/view functionality
- Update mkdocs.yml with new assets and branding navigation
- Use terminal-style ASCII icons and condensed menu design

* Update gitignore add local scripts folder

* fix: remove this import as it causes python to treat "json" as a variable in the except block

* fix: always return a list, even if we catch an exception

* feat(marketplace): Add Crawl4AI marketplace with secure configuration

- Implement marketplace frontend and admin dashboard
- Add FastAPI backend with environment-based configuration
- Use .env file for secrets management
- Include data generation scripts
- Add proper CORS configuration
- Remove hardcoded password from admin login
- Update gitignore for security

* fix(marketplace): Update URLs to use /marketplace path and relative API endpoints

- Change API_BASE to relative '/api' for production
- Move marketplace to /marketplace instead of /marketplace/frontend
- Update MkDocs navigation
- Fix logo path in marketplace index

* fix(docs): hide copy menu on non-markdown pages

* feat(marketplace): add sponsor logo uploads

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>

* feat(docs): add chatgpt quick link to page actions

* fix(marketplace): align admin api with backend endpoints

* fix(marketplace): isolate api under marketplace prefix

* fix(marketplace): resolve app detail page routing and styling issues

- Fixed JavaScript errors from missing HTML elements (install-code, usage-code, integration-code)
- Added missing CSS classes for tabs, overview layout, sidebar, and integration content
- Fixed tab navigation to display horizontally in single line
- Added proper padding to tab content sections (removed from container, added to content)
- Fixed tab selector from .nav-tab to .tab-btn to match HTML structure
- Added sidebar styling with stats grid and metadata display
- Improved responsive design with mobile-friendly tab scrolling
- Fixed code block positioning for copy buttons
- Removed margin from first headings to prevent extra spacing
- Added null checks for DOM elements in JavaScript to prevent errors

These changes resolve the routing issue where clicking on apps caused page redirects,
and fix the broken layout where CSS was not properly applied to the app detail page.

* fix(marketplace): prevent hero image overflow and secondary card stretching

- Fixed hero image to 200px height with min/max constraints
- Added object-fit: cover to hero-image img elements
- Changed secondary-featured align-items from stretch to flex-start
- Fixed secondary-card height to 118px (no flex: 1 stretching)
- Updated responsive grid layouts for wider screens
- Added flex: 1 to hero-content for better content distribution

These changes ensure a rigid, predictable layout that prevents:
1. Large images from pushing text content down
2. Single secondary cards from stretching to fill entire height

* feat: Add hooks utility for function-based hooks with Docker client integration. ref #1377

   Add hooks_to_string() utility function that converts Python function objects
   to string representations for the Docker API, enabling developers to write hooks
   as regular Python functions instead of strings.

   Core Changes:
   - New hooks_to_string() utility in crawl4ai/utils.py using inspect.getsource()
   - Docker client now accepts both function objects and strings for hooks
   - Automatic detection and conversion in Crawl4aiDockerClient._prepare_request()
   - New hooks and hooks_timeout parameters in client.crawl() method

   Documentation:
   - Docker client examples with function-based hooks (docs/examples/docker_client_hooks_example.py)
   - Updated main Docker deployment guide with comprehensive hooks section
   - Added unit tests for hooks utility (tests/docker/test_hooks_utility.py)

* feat: Add hooks utility for function-based hooks with Docker client integration. ref #1377

   Add hooks_to_string() utility function that converts Python function objects
   to string representations for the Docker API, enabling developers to write hooks
   as regular Python functions instead of strings.

   Core Changes:
   - New hooks_to_string() utility in crawl4ai/utils.py using inspect.getsource()
   - Docker client now accepts both function objects and strings for hooks
   - Automatic detection and conversion in Crawl4aiDockerClient._prepare_request()
   - New hooks and hooks_timeout parameters in client.crawl() method

   Documentation:
   - Docker client examples with function-based hooks (docs/examples/docker_client_hooks_example.py)
   - Updated main Docker deployment guide with comprehensive hooks section
   - Added unit tests for hooks utility (tests/docker/test_hooks_utility.py)

* fix(docs): clarify Docker Hooks System with function-based API in README

* docs: Add demonstration files for v0.7.5 release, showcasing the new Docker Hooks System and all other features.

* docs: Update 0.7.5 video walkthrough

* docs: add complete SDK reference documentation

Add comprehensive single-page SDK reference combining:
- Installation & Setup
- Quick Start
- Core API (AsyncWebCrawler, arun, arun_many, CrawlResult)
- Configuration (BrowserConfig, CrawlerConfig, Parameters)
- Crawling Patterns
- Content Processing (Markdown, Fit Markdown, Selection, Interaction, Link & Media)
- Extraction Strategies (LLM and No-LLM)
- Advanced Features (Session Management, Hooks & Auth)

Generated using scripts/generate_sdk_docs.py in ultra-dense mode
optimized for AI assistant consumption.

Stats: 23K words, 185 code blocks, 220KB

* feat: add AI assistant skill package for Crawl4AI

- Create comprehensive skill package for AI coding assistants
- Include complete SDK reference (23K words, v0.7.4)
- Add three extraction scripts (basic, batch, pipeline)
- Implement version tracking in skill and scripts
- Add prominent download section on homepage
- Place skill in docs/assets for web distribution

The skill enables AI assistants like Claude, Cursor, and Windsurf
to effectively use Crawl4AI with optimized workflows for markdown
generation and data extraction.

* fix: remove non-existent wiki link and clarify skill usage instructions

* fix: update Crawl4AI skill with corrected parameters and examples

- Fixed CrawlerConfig → CrawlerRunConfig throughout
- Fixed parameter names (timeout → page_timeout, store_html removed)
- Fixed schema format (selector → baseSelector)
- Corrected proxy configuration (in BrowserConfig, not CrawlerRunConfig)
- Fixed fit_markdown usage with content filters
- Added comprehensive references to docs/examples/ directory
- Created safe packaging script to avoid root directory pollution
- All scripts tested and verified working

* fix: thoroughly verify and fix all Crawl4AI skill examples

- Cross-checked every section against actual docs
- Fixed BM25ContentFilter parameters (user_query, bm25_threshold)
- Removed incorrect wait_for selector from basic example
- Added comprehensive test suite (4 test files)
- All examples now tested and verified working
- Tests validate: basic crawling, markdown generation, data extraction, advanced patterns
- Package size: 76.6 KB (includes tests for future validation)

* feat(ci): split release pipeline and add Docker caching

- Split release.yml into PyPI/GitHub release and Docker workflows
- Add GitHub Actions cache for Docker builds (10-15x faster rebuilds)
- Implement dual-trigger for docker-release.yml (auto + manual)
- Add comprehensive workflow documentation in .github/workflows/docs/
- Backup original workflow as release.yml.backup

* feat: add webhook notifications for crawl job completion

Implements webhook support for the crawl job API to eliminate polling requirements.

Changes:
- Added WebhookConfig and WebhookPayload schemas to schemas.py
- Created webhook.py with WebhookDeliveryService class
- Integrated webhook notifications in api.py handle_crawl_job
- Updated job.py CrawlJobPayload to accept webhook_config
- Added webhook configuration section to config.yml
- Included comprehensive usage examples in WEBHOOK_EXAMPLES.md

Features:
- Webhook notifications on job completion (success/failure)
- Configurable data inclusion in webhook payload
- Custom webhook headers support
- Global default webhook URL configuration
- Exponential backoff retry logic (5 attempts: 1s, 2s, 4s, 8s, 16s)
- 30-second timeout per webhook call

Usage:
POST /crawl/job with optional webhook_config:
- webhook_url: URL to receive notifications
- webhook_data_in_payload: include full results (default: false)
- webhook_headers: custom headers for authentication

Generated with Claude Code https://claude.com/claude-code

Co-Authored-By: Claude <noreply@anthropic.com>

* docs: add webhook documentation to Docker README

Added comprehensive webhook section to README.md including:
- Overview of asynchronous job queue with webhooks
- Benefits and use cases
- Quick start examples
- Webhook authentication
- Global webhook configuration
- Job status polling alternative

Updated table of contents and summary to include webhook feature.
Maintains consistent tone and style with rest of README.

Generated with Claude Code https://claude.com/claude-code

Co-Authored-By: Claude <noreply@anthropic.com>

* docs: add webhook example for Docker deployment

Added docker_webhook_example.py demonstrating:
- Submitting crawl jobs with webhook configuration
- Flask-based webhook receiver implementation
- Three usage patterns:
  1. Webhook notification only (fetch data separately)
  2. Webhook with full data in payload
  3. Traditional polling approach for comparison

Includes comprehensive comments explaining:
- Webhook payload structure
- Authentication headers setup
- Error handling
- Production deployment tips

Example is fully functional and ready to run with Flask installed.

Generated with Claude Code https://claude.com/claude-code

Co-Authored-By: Claude <noreply@anthropic.com>

* test: add webhook implementation validation tests

Added comprehensive test suite to validate webhook implementation:
- Module import verification
- WebhookDeliveryService initialization
- Pydantic model validation (WebhookConfig)
- Payload construction logic
- Exponential backoff calculation
- API integration checks

All tests pass (6/6), confirming implementation is correct.

Generated with Claude Code https://claude.com/claude-code

Co-Authored-By: Claude <noreply@anthropic.com>

* test: add comprehensive webhook feature test script

Added end-to-end test script that automates webhook feature testing:

Script Features (test_webhook_feature.sh):
- Automatic branch switching and dependency installation
- Redis and server startup/shutdown management
- Webhook receiver implementation
- Integration test for webhook notifications
- Comprehensive cleanup and error handling
- Returns to original branch after completion

Test Flow:
1. Fetch and checkout webhook feature branch
2. Activate venv and install dependencies
3. Start Redis and Crawl4AI server
4. Submit crawl job with webhook config
5. Verify webhook delivery and payload
6. Clean up all processes and return to original branch

Documentation:
- WEBHOOK_TEST_README.md with usage instructions
- Troubleshooting guide
- Exit codes and safety features

Usage: ./tests/test_webhook_feature.sh

Generated with Claude Code https://claude.com/claude-code

Co-Authored-By: Claude <noreply@anthropic.com>

* fix: properly serialize Pydantic HttpUrl in webhook config

Use model_dump(mode='json') instead of deprecated dict() method to ensure
Pydantic special types (HttpUrl, UUID, etc.) are properly serialized to
JSON-compatible native Python types.

This fixes webhook delivery failures caused by HttpUrl objects remaining
as Pydantic types in the webhook_config dict, which caused JSON
serialization errors and httpx request failures.

Also update mcp requirement to >=1.18.0 for compatibility.

* feat: add webhook support for /llm/job endpoint

Add comprehensive webhook notification support for the /llm/job endpoint,
following the same pattern as the existing /crawl/job implementation.

Changes:
- Add webhook_config field to LlmJobPayload model (job.py)
- Implement webhook notifications in process_llm_extraction() with 4
  notification points: success, provider validation failure, extraction
  failure, and general exceptions (api.py)
- Store webhook_config in Redis task data for job tracking
- Initialize WebhookDeliveryService with exponential backoff retry logic
Documentation:
- Add Example 6 to WEBHOOK_EXAMPLES.md showing LLM extraction with webhooks
- Update Flask webhook handler to support both crawl and llm_extraction tasks
- Add TypeScript client examples for LLM jobs
- Add comprehensive examples to docker_webhook_example.py with schema support
- Clarify data structure differences between webhook and API responses

Testing:
- Add test_llm_webhook_feature.py with 7 validation tests (all passing)
- Verify pattern consistency with /crawl/job implementation
- Add implementation guide (WEBHOOK_LLM_JOB_IMPLEMENTATION.md)

* fix: remove duplicate comma in webhook_config parameter

* fix: update Crawl4AI Docker container port from 11234 to 11235

* docs: enhance README and docker-deployment documentation with Job Queue and Webhook API details

* docs: update docker_hooks_examples.py with comprehensive examples and improved structure

---------

Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
Co-authored-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
Co-authored-by: Nezar Ali <abu5sohaib@gmail.com>
Co-authored-by: Soham Kukreti <kukretisoham@gmail.com>
Co-authored-by: James T. Wood <jamesthomaswood@gmail.com>
Co-authored-by: AHMET YILMAZ <tawfik@kidocode.com>
Co-authored-by: nafeqq-1306 <nafiquee@yahoo.com>
Co-authored-by: unclecode <unclecode@kidocode.com>
Co-authored-by: Martin Sjöborg <martin.sjoborg@quartr.se>
Co-authored-by: Martin Sjöborg <martin@sjoborg.org>
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Co-authored-by: Claude <noreply@anthropic.com>
2025-10-22 22:34:19 +08:00
ntohidi
97c92c4f62 fix(marketplace): replace hardcoded app detail content with database-driven fields.
The app detail page was displaying hardcoded/templated content instead of
using actual data from the database. This prevented admins from controlling
the content shown in Overview, Integration, and Documentation tabs.
2025-10-21 15:39:04 +02:00
unclecode
73a5a7b0f5 Update gitignore 2025-10-18 12:41:29 +08:00
unclecode
05921811b8 docs: add comprehensive technical architecture documentation
Created ARCHITECTURE.md as a complete technical reference for the
Crawl4AI Docker server, replacing the stress test pipeline document
with production-grade documentation.

Contents:
- System overview with architecture diagrams
- Core components deep-dive (server, API, utils)
- Smart browser pool implementation details
- Real-time monitoring system architecture
- WebSocket implementation and fallback strategy
- Memory management and container detection
- Production optimizations and code review fixes
- Deployment guides (local, Docker, production)
- Comprehensive troubleshooting section
- Debug tools and performance tuning
- Test suite documentation
- Architecture decision log (ADRs)

Target audience: Developers maintaining or extending the system
Goal: Enable rapid onboarding and confident modifications
2025-10-18 12:05:49 +08:00
unclecode
25507adb5b feat(monitor): implement code review fixes and real-time WebSocket monitoring
Backend Improvements (11 fixes applied):

Critical Fixes:
- Add lock protection for browser pool access in monitor stats
- Ensure async track_janitor_event across all call sites
- Improve error handling in monitor request tracking (already in place)

Important Fixes:
- Replace fire-and-forget Redis with background persistence worker
- Add time-based expiry for completed requests/errors (5min cleanup)
- Implement input validation for monitor route parameters
- Add 4s timeout to timeline updater to prevent hangs
- Add warning when killing browsers with active requests
- Implement monitor cleanup on shutdown with final persistence
- Document memory estimates with TODO for actual tracking

Frontend Enhancements:

WebSocket Real-time Updates:
- Add WebSocket endpoint at /monitor/ws for live monitoring
- Implement auto-reconnect with exponential backoff (max 5 attempts)
- Add graceful fallback to HTTP polling on WebSocket failure
- Send comprehensive updates every 2 seconds (health, requests, browsers, timeline, events)

UI/UX Improvements:
- Add live connection status indicator with pulsing animation
  - Green "Live" = WebSocket connected
  - Yellow "Connecting..." = Attempting connection
  - Blue "Polling" = Fallback to HTTP polling
  - Red "Disconnected" = Connection failed
- Restore original beautiful styling for all sections
- Improve request table layout with flex-grow for URL column
- Add browser type text labels alongside emojis
- Add flex layout to browser section header

Testing:
- Add test-websocket.py for WebSocket validation
- All 7 integration tests passing successfully

Summary: 563 additions across 6 files
2025-10-18 11:38:25 +08:00
unclecode
aba4036ab6 Add demo and test scripts for monitor dashboard activity
- Introduced a demo script (`demo_monitor_dashboard.py`) to showcase various monitoring features through simulated activity.
- Implemented a test script (`test_monitor_demo.py`) to generate dashboard activity and verify monitor health and endpoint statistics.
- Added a logo image to the static assets for branding purposes.
2025-10-17 22:43:06 +08:00
unclecode
e2af031b09 feat(monitor): add real-time monitoring dashboard with Redis persistence
Complete observability solution for production deployments with terminal-style UI.

**Backend Implementation:**
- `monitor.py`: Stats manager tracking requests, browsers, errors, timeline data
- `monitor_routes.py`: REST API endpoints for all monitor functionality
  - GET /monitor/health - System health snapshot
  - GET /monitor/requests - Active & completed requests
  - GET /monitor/browsers - Browser pool details
  - GET /monitor/endpoints/stats - Aggregated endpoint analytics
  - GET /monitor/timeline - Time-series data (memory, requests, browsers)
  - GET /monitor/logs/{janitor,errors} - Event logs
  - POST /monitor/actions/{cleanup,kill_browser,restart_browser} - Control actions
  - POST /monitor/stats/reset - Reset counters
- Redis persistence for endpoint stats (survives restart)
- Timeline tracking (5min window, 5s resolution, 60 data points)

**Frontend Dashboard** (`/dashboard`):
- **System Health Bar**: CPU%, Memory%, Network I/O, Uptime
- **Pool Status**: Live counts (permanent/hot/cold browsers + memory)
- **Live Activity Tabs**:
  - Requests: Active (realtime) + recent completed (last 100)
  - Browsers: Detailed table with actions (kill/restart)
  - Janitor: Cleanup event log with timestamps
  - Errors: Recent errors with stack traces
- **Endpoint Analytics**: Count, avg latency, success%, pool hit%
- **Resource Timeline**: SVG charts (memory/requests/browsers) with terminal aesthetics
- **Control Actions**: Force cleanup, restart permanent, reset stats
- **Auto-refresh**: 5s polling (toggleable)

**Integration:**
- Janitor events tracked (close_cold, close_hot, promote)
- Crawler pool promotion events logged
- Timeline updater background task (5s interval)
- Lifespan hooks for monitor initialization

**UI Design:**
- Terminal vibe matching Crawl4AI theme
- Dark background, cyan/pink accents, monospace font
- Neon glow effects on charts
- Responsive layout, hover interactions
- Cross-navigation: Playground ↔ Monitor

**Key Features:**
- Zero-config: Works out of the box with existing Redis
- Real-time visibility into pool efficiency
- Manual browser management (kill/restart)
- Historical data persistence
- DevOps-friendly UX

Routes:
- API: `/monitor/*` (backend endpoints)
- UI: `/dashboard` (static HTML)
2025-10-17 21:36:25 +08:00
unclecode
b97eaeea4c feat(docker): implement smart browser pool with 10x memory efficiency
Major refactoring to eliminate memory leaks and enable high-scale crawling:

- **Smart 3-Tier Browser Pool**:
  - Permanent browser (always-ready default config)
  - Hot pool (configs used 3+ times, longer TTL)
  - Cold pool (new/rare configs, short TTL)
  - Auto-promotion: cold → hot after 3 uses
  - 100% pool reuse achieved in tests

- **Container-Aware Memory Detection**:
  - Read cgroup v1/v2 memory limits (not host metrics)
  - Accurate memory pressure detection in Docker
  - Memory-based browser creation blocking

- **Adaptive Janitor**:
  - Dynamic cleanup intervals (10s/30s/60s based on memory)
  - Tiered TTLs: cold 30-300s, hot 120-600s
  - Aggressive cleanup at high memory pressure

- **Unified Pool Usage**:
  - All endpoints now use pool (/html, /screenshot, /pdf, /execute_js, /md, /llm)
  - Fixed config signature mismatch (permanent browser matches endpoints)
  - get_default_browser_config() helper for consistency

- **Configuration**:
  - Reduced idle_ttl: 1800s → 300s (30min → 5min)
  - Fixed port: 11234 → 11235 (match Gunicorn)

**Performance Results** (from stress tests):
- Memory: 10x reduction (500-700MB × N → 270MB permanent)
- Latency: 30-50x faster (<100ms pool hits vs 3-5s startup)
- Reuse: 100% for default config, 60%+ for variants
- Capacity: 100+ concurrent requests (vs ~20 before)
- Leak: 0 MB/cycle (stable across tests)

**Test Infrastructure**:
- 7-phase sequential test suite (tests/)
- Docker stats integration + log analysis
- Pool promotion verification
- Memory leak detection
- Full endpoint coverage

Fixes memory issues reported in production deployments.
2025-10-17 20:38:39 +08:00
Soham Kukreti
46e1a67f61 fix(docker): Remove environment variable overrides in docker-compose.yml (#1411)
The docker-compose.yml had an `environment:` section with variable
substitutions (${VAR:-}) that was overriding values from .llm.env with
empty strings.

- Commented out the `environment:` section to prevent overwrites
- Added clear warning comment explaining the override behavior
- .llm.env values now load directly into container without interference
2025-10-06 14:41:22 +05:30
Soham Kukreti
7dfe528d43 fix(docs): standardize C4A-Script tutorial, add CLI identity-based crawling, and add sponsorship CTA
- Switch installs to pip install -r requirements.txt (tutorial and app docs)
- Update local run steps to python server.py and http://localhost:8000
- Set default PORT to 8000; update port-in-use commands and alt port 8001
- Replace unsupported :contains() example with accessible attribute selector
- Update example URLs in tutorial servers to 127.0.0.1:8000
- Add “Identity-based crawling” section with crwl profiles CLI workflow and code usage
- Replace legacy-docs note with sponsorship message in docs/md_v2/index.md
- Minor copy and consistency fixes across pages
2025-10-03 22:00:46 +05:30
Soham Kukreti
2dc6588573 fix: remove_overlay_elements functionality by calling injected JS function. ref: #1396
- Fix critical bug where overlay removal JS function was injected but never called
  - Change remove_overlay_elements() to properly execute the injected async function
  - Wrap JS execution in async to handle the async overlay removal logic
  - Add test_remove_overlay_elements() test case to verify functionality works
  - Ensure overlay elements (cookie banners, popups, modals) are actually removed

  The remove_overlay_elements feature now works as intended:
  - Before: Function definition injected but never executed (silent failure)
  - After: Function injected and called, successfully removing overlay elements
2025-09-29 20:40:08 +05:30
Soham Kukreti
34c0996ee4 fix: Add CDP endpoint verification with exponential backoff for managed browsers (#1445)
browser_manager:
- Add CDP endpoint verification with retry logic and exponential backoff
- Call verification before connecting to CDP in `start()` method
- Graceful handling of timing issues during browser startup

test_cdp_strategy:
- Fix cookie persistence test by adding storage state management
- Fix session management test to work with managed browser architecture
- Add comprehensive CDP timing tests covering:
  - Fast startup scenarios
  - Delayed browser startup simulation
  - Exponential backoff behavior validation
  - Concurrent browser connections
  - Stress testing with multiple successive startups
  - Retry count verification

Impact:
- Eliminates browser startup failures due to CDP timing issues
- Provides robust fallback with automatic retries
- Maintains fast startup when CDP is immediately available
- Comprehensive test coverage ensures reliability

Resolves CDP connection timing issues in managed browser mode.
2025-09-29 19:31:09 +05:30
AHMET YILMAZ
e3467c08f6 #1490 feat(ManagedBrowser): add viewport size configuration for browser launch 2025-09-17 17:40:38 +08:00
144 changed files with 24622 additions and 3169 deletions

View File

@@ -11,6 +11,25 @@ jobs:
runs-on: ubuntu-latest
steps:
- name: Free up disk space
run: |
echo "=== Disk space before cleanup ==="
df -h
# Remove unnecessary tools and libraries (frees ~25GB)
sudo rm -rf /usr/share/dotnet
sudo rm -rf /usr/local/lib/android
sudo rm -rf /opt/ghc
sudo rm -rf /opt/hostedtoolcache/CodeQL
sudo rm -rf /usr/local/share/boost
sudo rm -rf /usr/share/swift
# Clean apt cache
sudo apt-get clean
echo "=== Disk space after cleanup ==="
df -h
- name: Checkout code
uses: actions/checkout@v4

16
.gitignore vendored
View File

@@ -267,10 +267,13 @@ continue_config.json
.private/
.claude/
.context/
CLAUDE_MONITOR.md
CLAUDE.md
.claude/
tests/**/test_site
tests/**/reports
tests/**/benchmark_reports
@@ -282,3 +285,16 @@ docs/apps/linkdin/debug*/
docs/apps/linkdin/samples/insights/*
scripts/
!scripts/gen-sbom.sh
# Databse files
*.sqlite3
*.sqlite3-journal
*.db-journal
*.db-wal
*.db-shm
*.db
*.rdb
*.ldb
MEMORY.md

View File

@@ -5,6 +5,46 @@ All notable changes to Crawl4AI will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
## [0.8.0] - 2026-01-12
### Security
- **🔒 CRITICAL: Remote Code Execution Fix**: Removed `__import__` from hook allowed builtins
- Prevents arbitrary module imports in user-provided hook code
- Hooks now disabled by default via `CRAWL4AI_HOOKS_ENABLED` environment variable
- Credit: Neo by ProjectDiscovery
- **🔒 HIGH: Local File Inclusion Fix**: Added URL scheme validation to Docker API endpoints
- Blocks `file://`, `javascript:`, `data:` URLs on `/execute_js`, `/screenshot`, `/pdf`, `/html`
- Only allows `http://`, `https://`, and `raw:` URLs
- Credit: Neo by ProjectDiscovery
### Breaking Changes
- **Docker API: Hooks disabled by default**: Set `CRAWL4AI_HOOKS_ENABLED=true` to enable
- **Docker API: file:// URLs blocked**: Use Python library directly for local file processing
### Added
- **🚀 init_scripts for BrowserConfig**: Pre-page-load JavaScript injection for stealth evasions
- **🔄 CDP Connection Improvements**: WebSocket URL support, proper cleanup, browser reuse
- **💾 Crash Recovery for Deep Crawl**: `resume_state` and `on_state_change` for BFS/DFS/Best-First strategies
- **📄 PDF/MHTML for raw:/file:// URLs**: Generate PDFs and MHTML from cached HTML content
- **📸 Screenshots for raw:/file:// URLs**: Render cached HTML and capture screenshots
- **🔗 base_url Parameter**: Proper URL resolution for raw: HTML processing
- **⚡ Prefetch Mode**: Two-phase deep crawling with fast link extraction
- **🔀 Enhanced Proxy Support**: Improved proxy rotation and sticky sessions
- **🌐 HTTP Strategy Proxy Support**: Non-browser crawler now supports proxies
- **🖥️ Browser Pipeline for raw:/file://**: New `process_in_browser` parameter
- **📋 Smart TTL Cache for Sitemap Seeder**: `cache_ttl_hours` and `validate_sitemap_lastmod` parameters
- **📚 Security Documentation**: Added SECURITY.md with vulnerability reporting guidelines
### Fixed
- **raw: URL Parsing**: Fixed truncation at `#` character (CSS color codes like `#eee`)
- **Caching System**: Various improvements to cache validation and persistence
### Documentation
- Multi-sample schema generation section
- URL seeder smart TTL cache parameters
- v0.8.0 migration guide
- Security policy and disclosure process
## [Unreleased]
### Added

View File

@@ -1,7 +1,7 @@
FROM python:3.12-slim-bookworm AS build
# C4ai version
ARG C4AI_VER=0.7.6
ARG C4AI_VER=0.8.0
ENV C4AI_VERSION=$C4AI_VER
LABEL c4ai.version=$C4AI_VER
@@ -167,6 +167,11 @@ RUN mkdir -p /home/appuser/.cache/ms-playwright \
RUN crawl4ai-doctor
# Ensure all cache directories belong to appuser
# This fixes permission issues with .cache/url_seeder and other runtime cache dirs
RUN mkdir -p /home/appuser/.cache \
&& chown -R appuser:appuser /home/appuser/.cache
# Copy application code
COPY deploy/docker/* ${APP_HOME}/

174
README.md
View File

@@ -12,6 +12,16 @@
[![Downloads](https://static.pepy.tech/badge/crawl4ai/month)](https://pepy.tech/project/crawl4ai)
[![GitHub Sponsors](https://img.shields.io/github/sponsors/unclecode?style=flat&logo=GitHub-Sponsors&label=Sponsors&color=pink)](https://github.com/sponsors/unclecode)
---
#### 🚀 Crawl4AI Cloud API — Closed Beta (Launching Soon)
Reliable, large-scale web extraction, now built to be _**drastically more cost-effective**_ than any of the existing solutions.
👉 **Apply [here](https://forms.gle/E9MyPaNXACnAMaqG7) for early access**
_Well be onboarding in phases and working closely with early users.
Limited slots._
---
<p align="center">
<a href="https://x.com/crawl4ai">
<img src="https://img.shields.io/badge/Follow%20on%20X-000000?style=for-the-badge&logo=x&logoColor=white" alt="Follow on X" />
@@ -27,13 +37,13 @@
Crawl4AI turns the web into clean, LLM ready Markdown for RAG, agents, and data pipelines. Fast, controllable, battle tested by a 50k+ star community.
[✨ Check out latest update v0.7.6](#-recent-updates)
[✨ Check out latest update v0.8.0](#-recent-updates)
**New in v0.7.6**: Complete Webhook Infrastructure for Docker Job Queue API! Real-time notifications for both `/crawl/job` and `/llm/job` endpoints with exponential backoff retry, custom headers, and flexible delivery modes. No more polling! [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.6.md)
**New in v0.8.0**: Crash Recovery & Prefetch Mode! Deep crawl crash recovery with `resume_state` and `on_state_change` callbacks for long-running crawls. New `prefetch=True` mode for 5-10x faster URL discovery. Critical security fixes for Docker API (hooks disabled by default, file:// URLs blocked). [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.8.0.md)
✨ Recent v0.7.5: Docker Hooks System with function-based API for pipeline customization, Enhanced LLM Integration with custom providers, HTTPS Preservation, and multiple community-reported bug fixes. [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.5.md)
✨ Recent v0.7.8: Stability & Bug Fix Release! 11 bug fixes addressing Docker API issues, LLM extraction improvements, URL handling fixes, and dependency updates. [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.8.md)
✨ Previous v0.7.4: Revolutionary LLM Table Extraction with intelligent chunking, enhanced concurrency fixes, memory management refactor, and critical stability improvements. [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.4.md)
✨ Previous v0.7.7: Complete Self-Hosting Platform with Real-time Monitoring! Enterprise-grade monitoring dashboard, comprehensive REST API, WebSocket streaming, and smart browser pool management. [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.7.md)
<details>
<summary>🤓 <strong>My Personal Story</strong></summary>
@@ -296,6 +306,7 @@ pip install -e ".[all]" # Install all optional features
### New Docker Features
The new Docker implementation includes:
- **Real-time Monitoring Dashboard** with live system metrics and browser pool visibility
- **Browser pooling** with page pre-warming for faster response times
- **Interactive playground** to test and generate request code
- **MCP integration** for direct connection to AI tools like Claude Code
@@ -310,7 +321,8 @@ The new Docker implementation includes:
docker pull unclecode/crawl4ai:latest
docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:latest
# Visit the playground at http://localhost:11235/playground
# Visit the monitoring dashboard at http://localhost:11235/dashboard
# Or the playground at http://localhost:11235/playground
```
### Quick Test
@@ -339,7 +351,7 @@ else:
result = requests.get(f"http://localhost:11235/task/{task_id}")
```
For more examples, see our [Docker Examples](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/docker_example.py). For advanced configuration, environment variables, and usage examples, see our [Docker Deployment Guide](https://docs.crawl4ai.com/basic/docker-deployment/).
For more examples, see our [Docker Examples](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/docker_example.py). For advanced configuration, monitoring features, and production deployment, see our [Self-Hosting Guide](https://docs.crawl4ai.com/core/self-hosting/).
</details>
@@ -544,8 +556,151 @@ async def test_news_crawl():
</details>
---
> **💡 Tip:** Some websites may use **CAPTCHA** based verification mechanisms to prevent automated access. If your workflow encounters such challenges, you may optionally integrate a third-party CAPTCHA-handling service such as <strong>[CapSolver](https://www.capsolver.com/blog/Partners/crawl4ai-capsolver/?utm_source=crawl4ai&utm_medium=github_pr&utm_campaign=crawl4ai_integration)</strong>. They support reCAPTCHA v2/v3, Cloudflare Turnstile, Challenge, AWS WAF, and more. Please ensure that your usage complies with the target websites terms of service and applicable laws.
## ✨ Recent Updates
<details open>
<summary><strong>Version 0.8.0 Release Highlights - Crash Recovery & Prefetch Mode</strong></summary>
This release introduces crash recovery for deep crawls, a new prefetch mode for fast URL discovery, and critical security fixes for Docker deployments.
- **🔄 Deep Crawl Crash Recovery**:
- `on_state_change` callback fires after each URL for real-time state persistence
- `resume_state` parameter to continue from a saved checkpoint
- JSON-serializable state for Redis/database storage
- Works with BFS, DFS, and Best-First strategies
```python
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
strategy = BFSDeepCrawlStrategy(
max_depth=3,
resume_state=saved_state, # Continue from checkpoint
on_state_change=save_to_redis, # Called after each URL
)
```
- **⚡ Prefetch Mode for Fast URL Discovery**:
- `prefetch=True` skips markdown, extraction, and media processing
- 5-10x faster than full processing
- Perfect for two-phase crawling: discover first, process selectively
```python
config = CrawlerRunConfig(prefetch=True)
result = await crawler.arun("https://example.com", config=config)
# Returns HTML and links only - no markdown generation
```
- **🔒 Security Fixes (Docker API)**:
- Hooks disabled by default (`CRAWL4AI_HOOKS_ENABLED=false`)
- `file://` URLs blocked on API endpoints to prevent LFI
- `__import__` removed from hook execution sandbox
[Full v0.8.0 Release Notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.8.0.md)
</details>
<details>
<summary><strong>Version 0.7.8 Release Highlights - Stability & Bug Fix Release</strong></summary>
This release focuses on stability with 11 bug fixes addressing issues reported by the community. No new features, but significant improvements to reliability.
- **🐳 Docker API Fixes**:
- Fixed `ContentRelevanceFilter` deserialization in deep crawl requests (#1642)
- Fixed `ProxyConfig` JSON serialization in `BrowserConfig.to_dict()` (#1629)
- Fixed `.cache` folder permissions in Docker image (#1638)
- **🤖 LLM Extraction Improvements**:
- Configurable rate limiter backoff with new `LLMConfig` parameters (#1269):
```python
from crawl4ai import LLMConfig
config = LLMConfig(
provider="openai/gpt-4o-mini",
backoff_base_delay=5, # Wait 5s on first retry
backoff_max_attempts=5, # Try up to 5 times
backoff_exponential_factor=3 # Multiply delay by 3 each attempt
)
```
- HTML input format support for `LLMExtractionStrategy` (#1178):
```python
from crawl4ai import LLMExtractionStrategy
strategy = LLMExtractionStrategy(
llm_config=config,
instruction="Extract table data",
input_format="html" # Now supports: "html", "markdown", "fit_markdown"
)
```
- Fixed raw HTML URL variable - extraction strategies now receive `"Raw HTML"` instead of HTML blob (#1116)
- **🔗 URL Handling**:
- Fixed relative URL resolution after JavaScript redirects (#1268)
- Fixed import statement formatting in extracted code (#1181)
- **📦 Dependency Updates**:
- Replaced deprecated PyPDF2 with pypdf (#1412)
- Pydantic v2 ConfigDict compatibility - no more deprecation warnings (#678)
- **🧠 AdaptiveCrawler**:
- Fixed query expansion to actually use LLM instead of hardcoded mock data (#1621)
[Full v0.7.8 Release Notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.8.md)
</details>
<details>
<summary><strong>Version 0.7.7 Release Highlights - The Self-Hosting & Monitoring Update</strong></summary>
- **📊 Real-time Monitoring Dashboard**: Interactive web UI with live system metrics and browser pool visibility
```python
# Access the monitoring dashboard
# Visit: http://localhost:11235/dashboard
# Real-time metrics include:
# - System health (CPU, memory, network, uptime)
# - Active and completed request tracking
# - Browser pool management (permanent/hot/cold)
# - Janitor cleanup events
# - Error monitoring with full context
```
- **🔌 Comprehensive Monitor API**: Complete REST API for programmatic access to all monitoring data
```python
import httpx
async with httpx.AsyncClient() as client:
# System health
health = await client.get("http://localhost:11235/monitor/health")
# Request tracking
requests = await client.get("http://localhost:11235/monitor/requests")
# Browser pool status
browsers = await client.get("http://localhost:11235/monitor/browsers")
# Endpoint statistics
stats = await client.get("http://localhost:11235/monitor/endpoints/stats")
```
- **⚡ WebSocket Streaming**: Real-time updates every 2 seconds for custom dashboards
- **🔥 Smart Browser Pool**: 3-tier architecture (permanent/hot/cold) with automatic promotion and cleanup
- **🧹 Janitor System**: Automatic resource management with event logging
- **🎮 Control Actions**: Manual browser management (kill, restart, cleanup) via API
- **📈 Production Metrics**: 6 critical metrics for operational excellence with Prometheus integration
- **🐛 Critical Bug Fixes**:
- Fixed async LLM extraction blocking issue (#1055)
- Enhanced DFS deep crawl strategy (#1607)
- Fixed sitemap parsing in AsyncUrlSeeder (#1598)
- Resolved browser viewport configuration (#1495)
- Fixed CDP timing with exponential backoff (#1528)
- Security update for pyOpenSSL (>=25.3.0)
[Full v0.7.7 Release Notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.7.md)
</details>
<details>
<summary><strong>Version 0.7.5 Release Highlights - The Docker Hooks & Security Update</strong></summary>
@@ -977,11 +1132,16 @@ Our enterprise sponsors and technology partners help scale Crawl4AI to power pro
| Company | About | Sponsorship Tier |
|------|------|----------------------------|
| <a href="https://dashboard.capsolver.com/passport/register?inviteCode=ESVSECTX5Q23" target="_blank"><picture><source width="120" media="(prefers-color-scheme: dark)" srcset="https://docs.crawl4ai.com/uploads/sponsors/20251013045338_72a71fa4ee4d2f40.png"><source width="120" media="(prefers-color-scheme: light)" srcset="https://www.capsolver.com/assets/images/logo-text.png"><img alt="Capsolver" src="https://www.capsolver.com/assets/images/logo-text.png"></picture></a> | AI-powered Captcha solving service. Supports all major Captcha types, including reCAPTCHA, Cloudflare, and more | 🥈 Silver |
| <a href="https://www.thordata.com/?ls=github&lk=crawl4ai" target="_blank"><img src="https://gist.github.com/aravindkarnam/dfc598a67be5036494475acece7e54cf/raw/thor_data.svg" alt="Thor Data" width="120"/></a> | Leveraging Thordata ensures seamless compatibility with any AI/ML workflows and data infrastructure, massively accessing web data with 99.9% uptime, backed by one-on-one customer support. | 🥈 Silver |
| <a href="https://app.nstproxy.com/register?i=ecOqW9" target="_blank"><picture><source width="250" media="(prefers-color-scheme: dark)" srcset="https://gist.github.com/aravindkarnam/62f82bd4818d3079d9dd3c31df432cf8/raw/nst-light.svg"><source width="250" media="(prefers-color-scheme: light)" srcset="https://www.nstproxy.com/logo.svg"><img alt="nstproxy" src="ttps://www.nstproxy.com/logo.svg"></picture></a> | NstProxy is a trusted proxy provider with over 110M+ real residential IPs, city-level targeting, 99.99% uptime, and low pricing at $0.1/GB, it delivers unmatched stability, scale, and cost-efficiency. | 🥈 Silver |
| <a href="https://app.scrapeless.com/passport/register?utm_source=official&utm_term=crawl4ai" target="_blank"><picture><source width="250" media="(prefers-color-scheme: dark)" srcset="https://gist.githubusercontent.com/aravindkarnam/0d275b942705604263e5c32d2db27bc1/raw/Scrapeless-light-logo.svg"><source width="250" media="(prefers-color-scheme: light)" srcset="https://gist.githubusercontent.com/aravindkarnam/22d0525cc0f3021bf19ebf6e11a69ccd/raw/Scrapeless-dark-logo.svg"><img alt="Scrapeless" src="https://gist.githubusercontent.com/aravindkarnam/22d0525cc0f3021bf19ebf6e11a69ccd/raw/Scrapeless-dark-logo.svg"></picture></a> | Scrapeless provides production-grade infrastructure for Crawling, Automation, and AI Agents, offering Scraping Browser, 4 Proxy Types and Universal Scraping API. | 🥈 Silver |
| <a href="https://dashboard.capsolver.com/passport/register?inviteCode=ESVSECTX5Q23" target="_blank"><picture><source width="120" media="(prefers-color-scheme: dark)" srcset="https://docs.crawl4ai.com/uploads/sponsors/20251013045338_72a71fa4ee4d2f40.png"><source width="120" media="(prefers-color-scheme: light)" srcset="https://www.capsolver.com/assets/images/logo-text.png"><img alt="Capsolver" src="https://www.capsolver.com/assets/images/logo-text.png"></picture></a> | AI-powered Captcha solving service. Supports all major Captcha types, including reCAPTCHA, Cloudflare, and more | 🥉 Bronze |
| <a href="https://kipo.ai" target="_blank"><img src="https://docs.crawl4ai.com/uploads/sponsors/20251013045751_2d54f57f117c651e.png" alt="DataSync" width="120"/></a> | Helps engineers and buyers find, compare, and source electronic & industrial parts in seconds, with specs, pricing, lead times & alternatives.| 🥇 Gold |
| <a href="https://www.kidocode.com/" target="_blank"><img src="https://docs.crawl4ai.com/uploads/sponsors/20251013045045_bb8dace3f0440d65.svg" alt="Kidocode" width="120"/><p align="center">KidoCode</p></a> | Kidocode is a hybrid technology and entrepreneurship school for kids aged 518, offering both online and on-campus education. | 🥇 Gold |
| <a href="https://www.alephnull.sg/" target="_blank"><img src="https://docs.crawl4ai.com/uploads/sponsors/20251013050323_a9e8e8c4c3650421.svg" alt="Aleph null" width="120"/></a> | Singapore-based Aleph Null is Asias leading edtech hub, dedicated to student-centric, AI-driven education—empowering learners with the tools to thrive in a fast-changing world. | 🥇 Gold |
### 🧑‍🤝 Individual Sponsors
A heartfelt thanks to our individual supporters! Every contribution helps us keep our opensource mission alive and thriving!

122
SECURITY.md Normal file
View File

@@ -0,0 +1,122 @@
# Security Policy
## Supported Versions
| Version | Supported |
| ------- | ------------------ |
| 0.8.x | :white_check_mark: |
| 0.7.x | :x: (upgrade recommended) |
| < 0.7 | :x: |
## Reporting a Vulnerability
We take security vulnerabilities seriously. If you discover a security issue, please report it responsibly.
### How to Report
**DO NOT** open a public GitHub issue for security vulnerabilities.
Instead, please report via one of these methods:
1. **GitHub Security Advisories (Preferred)**
- Go to [Security Advisories](https://github.com/unclecode/crawl4ai/security/advisories)
- Click "New draft security advisory"
- Fill in the details
2. **Email**
- Send details to: security@crawl4ai.com
- Use subject: `[SECURITY] Brief description`
- Include:
- Description of the vulnerability
- Steps to reproduce
- Potential impact
- Any suggested fixes
### What to Expect
- **Acknowledgment**: Within 48 hours
- **Initial Assessment**: Within 7 days
- **Resolution Timeline**: Depends on severity
- Critical: 24-72 hours
- High: 7 days
- Medium: 30 days
- Low: 90 days
### Disclosure Policy
- We follow responsible disclosure practices
- We will coordinate with you on disclosure timing
- Credit will be given to reporters (unless anonymity is requested)
- We may request CVE assignment for significant vulnerabilities
## Security Best Practices for Users
### Docker API Deployment
If you're running the Crawl4AI Docker API in production:
1. **Enable Authentication**
```yaml
# config.yml
security:
enabled: true
jwt_enabled: true
```
```bash
# Set a strong secret key
export SECRET_KEY="your-secure-random-key-here"
```
2. **Hooks are Disabled by Default** (v0.8.0+)
- Only enable if you trust all API users
- Set `CRAWL4AI_HOOKS_ENABLED=true` only when necessary
3. **Network Security**
- Run behind a reverse proxy (nginx, traefik)
- Use HTTPS in production
- Restrict access to trusted IPs if possible
4. **Container Security**
- Run as non-root user (default in our container)
- Use read-only filesystem where possible
- Limit container resources
### Library Usage
When using Crawl4AI as a Python library:
1. **Validate URLs** before crawling untrusted input
2. **Sanitize extracted content** before using in other systems
3. **Be cautious with hooks** - they execute arbitrary code
## Known Security Issues
### Fixed in v0.8.0
| ID | Severity | Description | Fix |
|----|----------|-------------|-----|
| CVE-pending-1 | CRITICAL | RCE via hooks `__import__` | Removed from allowed builtins |
| CVE-pending-2 | HIGH | LFI via `file://` URLs | URL scheme validation added |
See [Security Advisory](https://github.com/unclecode/crawl4ai/security/advisories) for details.
## Security Features
### v0.8.0+
- **URL Scheme Validation**: Blocks `file://`, `javascript:`, `data:` URLs on API
- **Hooks Disabled by Default**: Opt-in via `CRAWL4AI_HOOKS_ENABLED=true`
- **Restricted Hook Builtins**: No `__import__`, `eval`, `exec`, `open`
- **JWT Authentication**: Optional but recommended for production
- **Rate Limiting**: Configurable request limits
- **Security Headers**: X-Frame-Options, CSP, HSTS when enabled
## Acknowledgments
We thank the following security researchers for responsibly disclosing vulnerabilities:
- **[Neo by ProjectDiscovery](https://projectdiscovery.io/blog/introducing-neo)** - RCE and LFI vulnerabilities (December 2025)
---
*Last updated: January 2026*

View File

@@ -72,6 +72,8 @@ from .deep_crawling import (
BestFirstCrawlingStrategy,
DFSDeepCrawlStrategy,
DeepCrawlDecorator,
ContentRelevanceFilter,
ContentTypeScorer,
)
# NEW: Import AsyncUrlSeeder
from .async_url_seeder import AsyncUrlSeeder

View File

@@ -1,7 +1,7 @@
# crawl4ai/__version__.py
# This is the version that will be used for stable releases
__version__ = "0.7.6"
__version__ = "0.8.0"
# For nightly builds, this gets set during build process
__nightly_version__ = None

View File

@@ -728,18 +728,18 @@ class EmbeddingStrategy(CrawlStrategy):
provider = llm_config_dict.get('provider', 'openai/gpt-4o-mini') if llm_config_dict else 'openai/gpt-4o-mini'
api_token = llm_config_dict.get('api_token') if llm_config_dict else None
# response = perform_completion_with_backoff(
# provider=provider,
# prompt_with_variables=prompt,
# api_token=api_token,
# json_response=True
# )
response = perform_completion_with_backoff(
provider=provider,
prompt_with_variables=prompt,
api_token=api_token,
json_response=True
)
# variations = json.loads(response.choices[0].message.content)
variations = json.loads(response.choices[0].message.content)
# # Mock data with more variations for split
variations ={'queries': ['what are the best vegetables to use in fried rice?', 'how do I make vegetable fried rice from scratch?', 'can you provide a quick recipe for vegetable fried rice?', 'what cooking techniques are essential for perfect fried rice with vegetables?', 'how to add flavor to vegetable fried rice?', 'are there any tips for making healthy fried rice with vegetables?']}
# variations ={'queries': ['what are the best vegetables to use in fried rice?', 'how do I make vegetable fried rice from scratch?', 'can you provide a quick recipe for vegetable fried rice?', 'what cooking techniques are essential for perfect fried rice with vegetables?', 'how to add flavor to vegetable fried rice?', 'are there any tips for making healthy fried rice with vegetables?']}
# variations = {'queries': [

View File

@@ -1,6 +1,7 @@
import importlib
import os
from typing import Union
import warnings
import requests
from .config import (
DEFAULT_PROVIDER,
DEFAULT_PROVIDER_API_KEY,
@@ -26,14 +27,14 @@ from .table_extraction import TableExtractionStrategy, DefaultTableExtraction
from .cache_context import CacheMode
from .proxy_strategy import ProxyRotationStrategy
from typing import Union, List, Callable
import inspect
from typing import Any, Dict, Optional
from typing import Any, Callable, Dict, List, Optional, Union
from enum import Enum
# Type alias for URL matching
UrlMatcher = Union[str, Callable[[str], bool], List[Union[str, Callable[[str], bool]]]]
class MatchMode(Enum):
OR = "or"
AND = "and"
@@ -41,8 +42,7 @@ class MatchMode(Enum):
# from .proxy_strategy import ProxyConfig
def to_serializable_dict(obj: Any, ignore_default_value : bool = False) -> Dict:
def to_serializable_dict(obj: Any, ignore_default_value : bool = False):
"""
Recursively convert an object to a serializable dictionary using {type, params} structure
for complex objects.
@@ -109,8 +109,6 @@ def to_serializable_dict(obj: Any, ignore_default_value : bool = False) -> Dict:
# if value is not None:
# current_values[attr_name] = to_serializable_dict(value)
return {
"type": obj.__class__.__name__,
"params": current_values
@@ -136,12 +134,20 @@ def from_serializable_dict(data: Any) -> Any:
if data["type"] == "dict" and "value" in data:
return {k: from_serializable_dict(v) for k, v in data["value"].items()}
# Import from crawl4ai for class instances
import crawl4ai
if hasattr(crawl4ai, data["type"]):
cls = getattr(crawl4ai, data["type"])
cls = None
# If you are receiving an error while trying to convert a dict to an object:
# Either add a module to `modules_paths` list, or add the `data["type"]` to the crawl4ai __init__.py file
module_paths = ["crawl4ai"]
for module_path in module_paths:
try:
mod = importlib.import_module(module_path)
if hasattr(mod, data["type"]):
cls = getattr(mod, data["type"])
break
except (ImportError, AttributeError):
continue
if cls is not None:
# Handle Enum
if issubclass(cls, Enum):
return cls(data["params"])
@@ -367,6 +373,20 @@ class BrowserConfig:
use_managed_browser (bool): Launch the browser using a managed approach (e.g., via CDP), allowing
advanced manipulation. Default: False.
cdp_url (str): URL for the Chrome DevTools Protocol (CDP) endpoint. Default: "ws://localhost:9222/devtools/browser/".
browser_context_id (str or None): Pre-existing CDP browser context ID to use. When provided along with
cdp_url, the crawler will reuse this context instead of creating a new one.
Useful for cloud browser services that pre-create isolated contexts.
Default: None.
target_id (str or None): Pre-existing CDP target ID (page) to use. When provided along with
browser_context_id, the crawler will reuse this target instead of creating
a new page. Default: None.
cdp_cleanup_on_close (bool): When True and using cdp_url, the close() method will still clean up
the local Playwright client resources. Useful for cloud/server scenarios
where you don't own the remote browser but need to prevent memory leaks
from accumulated Playwright instances. Default: False.
create_isolated_context (bool): When True and using cdp_url, forces creation of a new browser context
instead of reusing the default context. Essential for concurrent crawls
on the same browser to prevent navigation conflicts. Default: False.
debugging_port (int): Port for the browser debugging protocol. Default: 9222.
use_persistent_context (bool): Use a persistent browser context (like a persistent profile).
Automatically sets use_managed_browser=True. Default: False.
@@ -421,6 +441,10 @@ class BrowserConfig:
browser_mode: str = "dedicated",
use_managed_browser: bool = False,
cdp_url: str = None,
browser_context_id: str = None,
target_id: str = None,
cdp_cleanup_on_close: bool = False,
create_isolated_context: bool = False,
use_persistent_context: bool = False,
user_data_dir: str = None,
chrome_channel: str = "chromium",
@@ -453,13 +477,18 @@ class BrowserConfig:
debugging_port: int = 9222,
host: str = "localhost",
enable_stealth: bool = False,
init_scripts: List[str] = None,
):
self.browser_type = browser_type
self.headless = headless
self.headless = headless
self.browser_mode = browser_mode
self.use_managed_browser = use_managed_browser
self.cdp_url = cdp_url
self.browser_context_id = browser_context_id
self.target_id = target_id
self.cdp_cleanup_on_close = cdp_cleanup_on_close
self.create_isolated_context = create_isolated_context
self.use_persistent_context = use_persistent_context
self.user_data_dir = user_data_dir
self.chrome_channel = chrome_channel or self.browser_type or "chromium"
@@ -508,6 +537,7 @@ class BrowserConfig:
self.debugging_port = debugging_port
self.host = host
self.enable_stealth = enable_stealth
self.init_scripts = init_scripts if init_scripts is not None else []
fa_user_agenr_generator = ValidUAGenerator()
if self.user_agent_mode == "random":
@@ -555,6 +585,10 @@ class BrowserConfig:
browser_mode=kwargs.get("browser_mode", "dedicated"),
use_managed_browser=kwargs.get("use_managed_browser", False),
cdp_url=kwargs.get("cdp_url"),
browser_context_id=kwargs.get("browser_context_id"),
target_id=kwargs.get("target_id"),
cdp_cleanup_on_close=kwargs.get("cdp_cleanup_on_close", False),
create_isolated_context=kwargs.get("create_isolated_context", False),
use_persistent_context=kwargs.get("use_persistent_context", False),
user_data_dir=kwargs.get("user_data_dir"),
chrome_channel=kwargs.get("chrome_channel", "chromium"),
@@ -583,6 +617,7 @@ class BrowserConfig:
debugging_port=kwargs.get("debugging_port", 9222),
host=kwargs.get("host", "localhost"),
enable_stealth=kwargs.get("enable_stealth", False),
init_scripts=kwargs.get("init_scripts", []),
)
def to_dict(self):
@@ -592,12 +627,16 @@ class BrowserConfig:
"browser_mode": self.browser_mode,
"use_managed_browser": self.use_managed_browser,
"cdp_url": self.cdp_url,
"browser_context_id": self.browser_context_id,
"target_id": self.target_id,
"cdp_cleanup_on_close": self.cdp_cleanup_on_close,
"create_isolated_context": self.create_isolated_context,
"use_persistent_context": self.use_persistent_context,
"user_data_dir": self.user_data_dir,
"chrome_channel": self.chrome_channel,
"channel": self.channel,
"proxy": self.proxy,
"proxy_config": self.proxy_config,
"proxy_config": self.proxy_config.to_dict() if self.proxy_config else None,
"viewport_width": self.viewport_width,
"viewport_height": self.viewport_height,
"accept_downloads": self.accept_downloads,
@@ -618,9 +657,10 @@ class BrowserConfig:
"debugging_port": self.debugging_port,
"host": self.host,
"enable_stealth": self.enable_stealth,
"init_scripts": self.init_scripts,
}
return result
def clone(self, **kwargs):
@@ -649,6 +689,85 @@ class BrowserConfig:
return config
return BrowserConfig.from_kwargs(config)
def set_nstproxy(
self,
token: str,
channel_id: str,
country: str = "ANY",
state: str = "",
city: str = "",
protocol: str = "http",
session_duration: int = 10,
):
"""
Fetch a proxy from NSTProxy API and automatically assign it to proxy_config.
Get your NSTProxy token from: https://app.nstproxy.com/profile
Args:
token (str): NSTProxy API token.
channel_id (str): NSTProxy channel ID.
country (str, optional): Country code (default: "ANY").
state (str, optional): State code (default: "").
city (str, optional): City name (default: "").
protocol (str, optional): Proxy protocol ("http" or "socks5"). Defaults to "http".
session_duration (int, optional): Session duration in minutes (0 = rotate each request). Defaults to 10.
Raises:
ValueError: If the API response format is invalid.
PermissionError: If the API returns an error message.
"""
# --- Validate input early ---
if not token or not channel_id:
raise ValueError("[NSTProxy] token and channel_id are required")
if protocol not in ("http", "socks5"):
raise ValueError(f"[NSTProxy] Invalid protocol: {protocol}")
# --- Build NSTProxy API URL ---
params = {
"fType": 2,
"count": 1,
"channelId": channel_id,
"country": country,
"protocol": protocol,
"sessionDuration": session_duration,
"token": token,
}
if state:
params["state"] = state
if city:
params["city"] = city
url = "https://api.nstproxy.com/api/v1/generate/apiproxies"
try:
response = requests.get(url, params=params, timeout=10)
response.raise_for_status()
data = response.json()
# --- Handle API error response ---
if isinstance(data, dict) and data.get("err"):
raise PermissionError(f"[NSTProxy] API Error: {data.get('msg', 'Unknown error')}")
if not isinstance(data, list) or not data:
raise ValueError("[NSTProxy] Invalid API response — expected a non-empty list")
proxy_info = data[0]
# --- Apply proxy config ---
self.proxy_config = ProxyConfig(
server=f"{protocol}://{proxy_info['ip']}:{proxy_info['port']}",
username=proxy_info["username"],
password=proxy_info["password"],
)
except Exception as e:
print(f"[NSTProxy] ❌ Failed to set proxy: {e}")
raise
class VirtualScrollConfig:
"""Configuration for virtual scroll handling.
@@ -914,6 +1033,18 @@ class CrawlerRunConfig():
proxy_config (ProxyConfig or dict or None): Detailed proxy configuration, e.g. {"server": "...", "username": "..."}.
If None, no additional proxy config. Default: None.
# Sticky Proxy Session Parameters
proxy_session_id (str or None): When set, maintains the same proxy for all requests sharing this session ID.
The proxy is acquired on first request and reused for subsequent requests.
Session expires when explicitly released or crawler context is closed.
Default: None.
proxy_session_ttl (int or None): Time-to-live for sticky session in seconds.
After TTL expires, a new proxy is acquired on next request.
Default: None (session lasts until explicitly released or crawler closes).
proxy_session_auto_release (bool): If True, automatically release the proxy session after a batch operation.
Useful for arun_many() to clean up sessions automatically.
Default: False.
# Browser Location and Identity Parameters
locale (str or None): Locale to use for the browser context (e.g., "en-US").
Default: None.
@@ -942,6 +1073,15 @@ class CrawlerRunConfig():
shared_data (dict or None): Shared data to be passed between hooks.
Default: None.
# Cache Validation Parameters (Smart Cache)
check_cache_freshness (bool): If True, validates cached content freshness using HTTP
conditional requests (ETag/Last-Modified) and head fingerprinting
before returning cached results. Avoids full browser crawls when
content hasn't changed. Only applies when cache_mode allows reads.
Default: False.
cache_validation_timeout (float): Timeout in seconds for cache validation HTTP requests.
Default: 10.0.
# Page Navigation and Timing Parameters
wait_until (str): The condition to wait for when navigating, e.g. "domcontentloaded".
Default: "domcontentloaded".
@@ -1048,6 +1188,12 @@ class CrawlerRunConfig():
# Connection Parameters
stream (bool): If True, enables streaming of crawled URLs as they are processed when used with arun_many.
Default: False.
process_in_browser (bool): If True, forces raw:/file:// URLs to be processed through the browser
pipeline (enabling js_code, wait_for, scrolling, etc.). When False (default),
raw:/file:// URLs use a fast path that returns HTML directly without browser
interaction. This is automatically enabled when browser-requiring parameters
are detected (js_code, wait_for, screenshot, pdf, etc.).
Default: False.
check_robots_txt (bool): Whether to check robots.txt rules before crawling. Default: False
Default: False.
@@ -1093,6 +1239,10 @@ class CrawlerRunConfig():
scraping_strategy: ContentScrapingStrategy = None,
proxy_config: Union[ProxyConfig, dict, None] = None,
proxy_rotation_strategy: Optional[ProxyRotationStrategy] = None,
# Sticky Proxy Session Parameters
proxy_session_id: Optional[str] = None,
proxy_session_ttl: Optional[int] = None,
proxy_session_auto_release: bool = False,
# Browser Location and Identity Parameters
locale: Optional[str] = None,
timezone_id: Optional[str] = None,
@@ -1107,6 +1257,9 @@ class CrawlerRunConfig():
no_cache_read: bool = False,
no_cache_write: bool = False,
shared_data: dict = None,
# Cache Validation Parameters (Smart Cache)
check_cache_freshness: bool = False,
cache_validation_timeout: float = 10.0,
# Page Navigation and Timing Parameters
wait_until: str = "domcontentloaded",
page_timeout: int = PAGE_TIMEOUT,
@@ -1160,7 +1313,10 @@ class CrawlerRunConfig():
# Connection Parameters
method: str = "GET",
stream: bool = False,
prefetch: bool = False, # When True, return only HTML + links (skip heavy processing)
process_in_browser: bool = False, # Force browser processing for raw:/file:// URLs
url: str = None,
base_url: str = None, # Base URL for markdown link resolution (used with raw: HTML)
check_robots_txt: bool = False,
user_agent: str = None,
user_agent_mode: str = None,
@@ -1179,6 +1335,7 @@ class CrawlerRunConfig():
):
# TODO: Planning to set properties dynamically based on the __init__ signature
self.url = url
self.base_url = base_url # Base URL for markdown link resolution
# Content Processing Parameters
self.word_count_threshold = word_count_threshold
@@ -1203,7 +1360,12 @@ class CrawlerRunConfig():
self.proxy_config = ProxyConfig.from_string(proxy_config)
self.proxy_rotation_strategy = proxy_rotation_strategy
# Sticky Proxy Session Parameters
self.proxy_session_id = proxy_session_id
self.proxy_session_ttl = proxy_session_ttl
self.proxy_session_auto_release = proxy_session_auto_release
# Browser Location and Identity Parameters
self.locale = locale
self.timezone_id = timezone_id
@@ -1220,6 +1382,9 @@ class CrawlerRunConfig():
self.no_cache_read = no_cache_read
self.no_cache_write = no_cache_write
self.shared_data = shared_data
# Cache Validation (Smart Cache)
self.check_cache_freshness = check_cache_freshness
self.cache_validation_timeout = cache_validation_timeout
# Page Navigation and Timing Parameters
self.wait_until = wait_until
@@ -1286,6 +1451,8 @@ class CrawlerRunConfig():
# Connection Parameters
self.stream = stream
self.prefetch = prefetch # Prefetch mode: return only HTML + links
self.process_in_browser = process_in_browser # Force browser processing for raw:/file:// URLs
self.method = method
# Robots.txt Handling Parameters
@@ -1483,6 +1650,10 @@ class CrawlerRunConfig():
scraping_strategy=kwargs.get("scraping_strategy"),
proxy_config=kwargs.get("proxy_config"),
proxy_rotation_strategy=kwargs.get("proxy_rotation_strategy"),
# Sticky Proxy Session Parameters
proxy_session_id=kwargs.get("proxy_session_id"),
proxy_session_ttl=kwargs.get("proxy_session_ttl"),
proxy_session_auto_release=kwargs.get("proxy_session_auto_release", False),
# Browser Location and Identity Parameters
locale=kwargs.get("locale", None),
timezone_id=kwargs.get("timezone_id", None),
@@ -1558,6 +1729,8 @@ class CrawlerRunConfig():
# Connection Parameters
method=kwargs.get("method", "GET"),
stream=kwargs.get("stream", False),
prefetch=kwargs.get("prefetch", False),
process_in_browser=kwargs.get("process_in_browser", False),
check_robots_txt=kwargs.get("check_robots_txt", False),
user_agent=kwargs.get("user_agent"),
user_agent_mode=kwargs.get("user_agent_mode"),
@@ -1567,6 +1740,7 @@ class CrawlerRunConfig():
# Link Extraction Parameters
link_preview_config=kwargs.get("link_preview_config"),
url=kwargs.get("url"),
base_url=kwargs.get("base_url"),
# URL Matching Parameters
url_matcher=kwargs.get("url_matcher"),
match_mode=kwargs.get("match_mode", MatchMode.OR),
@@ -1606,6 +1780,9 @@ class CrawlerRunConfig():
"scraping_strategy": self.scraping_strategy,
"proxy_config": self.proxy_config,
"proxy_rotation_strategy": self.proxy_rotation_strategy,
"proxy_session_id": self.proxy_session_id,
"proxy_session_ttl": self.proxy_session_ttl,
"proxy_session_auto_release": self.proxy_session_auto_release,
"locale": self.locale,
"timezone_id": self.timezone_id,
"geolocation": self.geolocation,
@@ -1662,6 +1839,8 @@ class CrawlerRunConfig():
"capture_console_messages": self.capture_console_messages,
"method": self.method,
"stream": self.stream,
"prefetch": self.prefetch,
"process_in_browser": self.process_in_browser,
"check_robots_txt": self.check_robots_txt,
"user_agent": self.user_agent,
"user_agent_mode": self.user_agent_mode,
@@ -1712,7 +1891,10 @@ class LLMConfig:
frequency_penalty: Optional[float] = None,
presence_penalty: Optional[float] = None,
stop: Optional[List[str]] = None,
n: Optional[int] = None,
n: Optional[int] = None,
backoff_base_delay: Optional[int] = None,
backoff_max_attempts: Optional[int] = None,
backoff_exponential_factor: Optional[int] = None,
):
"""Configuaration class for LLM provider and API token."""
self.provider = provider
@@ -1741,6 +1923,9 @@ class LLMConfig:
self.presence_penalty = presence_penalty
self.stop = stop
self.n = n
self.backoff_base_delay = backoff_base_delay if backoff_base_delay is not None else 2
self.backoff_max_attempts = backoff_max_attempts if backoff_max_attempts is not None else 3
self.backoff_exponential_factor = backoff_exponential_factor if backoff_exponential_factor is not None else 2
@staticmethod
def from_kwargs(kwargs: dict) -> "LLMConfig":
@@ -1754,7 +1939,10 @@ class LLMConfig:
frequency_penalty=kwargs.get("frequency_penalty"),
presence_penalty=kwargs.get("presence_penalty"),
stop=kwargs.get("stop"),
n=kwargs.get("n")
n=kwargs.get("n"),
backoff_base_delay=kwargs.get("backoff_base_delay"),
backoff_max_attempts=kwargs.get("backoff_max_attempts"),
backoff_exponential_factor=kwargs.get("backoff_exponential_factor")
)
def to_dict(self):
@@ -1768,7 +1956,10 @@ class LLMConfig:
"frequency_penalty": self.frequency_penalty,
"presence_penalty": self.presence_penalty,
"stop": self.stop,
"n": self.n
"n": self.n,
"backoff_base_delay": self.backoff_base_delay,
"backoff_max_attempts": self.backoff_max_attempts,
"backoff_exponential_factor": self.backoff_exponential_factor
}
def clone(self, **kwargs):
@@ -1805,6 +1996,8 @@ class SeedingConfig:
score_threshold: Optional[float] = None,
scoring_method: str = "bm25",
filter_nonsense_urls: bool = True,
cache_ttl_hours: int = 24,
validate_sitemap_lastmod: bool = True,
):
"""
Initialize URL seeding configuration.
@@ -1836,10 +2029,14 @@ class SeedingConfig:
Requires extract_head=True. Default: None
score_threshold: Minimum relevance score (0.0-1.0) to include URL.
Only applies when query is provided. Default: None
scoring_method: Scoring algorithm to use. Currently only "bm25" is supported.
scoring_method: Scoring algorithm to use. Currently only "bm25" is supported.
Future: "semantic". Default: "bm25"
filter_nonsense_urls: Filter out utility URLs like robots.txt, sitemap.xml,
filter_nonsense_urls: Filter out utility URLs like robots.txt, sitemap.xml,
ads.txt, favicon.ico, etc. Default: True
cache_ttl_hours: Hours before sitemap cache expires. Set to 0 to disable TTL
(only lastmod validation). Default: 24
validate_sitemap_lastmod: If True, compares sitemap's <lastmod> with cache
timestamp and refetches if sitemap is newer. Default: True
"""
self.source = source
self.pattern = pattern
@@ -1856,6 +2053,8 @@ class SeedingConfig:
self.score_threshold = score_threshold
self.scoring_method = scoring_method
self.filter_nonsense_urls = filter_nonsense_urls
self.cache_ttl_hours = cache_ttl_hours
self.validate_sitemap_lastmod = validate_sitemap_lastmod
# Add to_dict, from_kwargs, and clone methods for consistency
def to_dict(self) -> Dict[str, Any]:

View File

@@ -452,48 +452,48 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
if url.startswith(("http://", "https://", "view-source:")):
return await self._crawl_web(url, config)
elif url.startswith("file://"):
# initialize empty lists for console messages
captured_console = []
# Process local file
local_file_path = url[7:] # Remove 'file://' prefix
if not os.path.exists(local_file_path):
raise FileNotFoundError(f"Local file not found: {local_file_path}")
with open(local_file_path, "r", encoding="utf-8") as f:
html = f.read()
if config.screenshot:
screenshot_data = await self._generate_screenshot_from_html(html)
if config.capture_console_messages:
page, context = await self.browser_manager.get_page(crawlerRunConfig=config)
captured_console = await self._capture_console_messages(page, url)
return AsyncCrawlResponse(
html=html,
response_headers=response_headers,
status_code=status_code,
screenshot=screenshot_data,
get_delayed_content=None,
console_messages=captured_console,
elif url.startswith("file://") or url.startswith("raw://") or url.startswith("raw:"):
# Check if browser processing is required for file:// or raw: URLs
needs_browser = (
config.process_in_browser or
config.screenshot or
config.pdf or
config.capture_mhtml or
config.js_code or
config.wait_for or
config.scan_full_page or
config.remove_overlay_elements or
config.simulate_user or
config.magic or
config.process_iframes or
config.capture_console_messages or
config.capture_network_requests
)
#####
# Since both "raw:" and "raw://" start with "raw:", the first condition is always true for both, so "raw://" will be sliced as "//...", which is incorrect.
# Fix: Check for "raw://" first, then "raw:"
# Also, the prefix "raw://" is actually 6 characters long, not 7, so it should be sliced accordingly: url[6:]
#####
elif url.startswith("raw://") or url.startswith("raw:"):
# Process raw HTML content
# raw_html = url[4:] if url[:4] == "raw:" else url[7:]
raw_html = url[6:] if url.startswith("raw://") else url[4:]
html = raw_html
if config.screenshot:
screenshot_data = await self._generate_screenshot_from_html(html)
if needs_browser:
# Route through _crawl_web() for full browser pipeline
# _crawl_web() will detect file:// and raw: URLs and use set_content()
return await self._crawl_web(url, config)
# Fast path: return HTML directly without browser interaction
if url.startswith("file://"):
# Process local file
local_file_path = url[7:] # Remove 'file://' prefix
if not os.path.exists(local_file_path):
raise FileNotFoundError(f"Local file not found: {local_file_path}")
with open(local_file_path, "r", encoding="utf-8") as f:
html = f.read()
else:
# Process raw HTML content (raw:// or raw:)
html = url[6:] if url.startswith("raw://") else url[4:]
return AsyncCrawlResponse(
html=html,
response_headers=response_headers,
status_code=status_code,
screenshot=screenshot_data,
screenshot=None,
pdf_data=None,
mhtml_data=None,
get_delayed_content=None,
)
else:
@@ -666,67 +666,83 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
if not config.js_only:
await self.execute_hook("before_goto", page, context=context, url=url, config=config)
try:
# Generate a unique nonce for this request
if config.experimental.get("use_csp_nonce", False):
nonce = hashlib.sha256(os.urandom(32)).hexdigest()
# Check if this is a file:// or raw: URL that needs set_content() instead of goto()
is_local_content = url.startswith("file://") or url.startswith("raw://") or url.startswith("raw:")
# Add CSP headers to the request
await page.set_extra_http_headers(
{
"Content-Security-Policy": f"default-src 'self'; script-src 'self' 'nonce-{nonce}' 'strict-dynamic'"
}
)
response = await page.goto(
url, wait_until=config.wait_until, timeout=config.page_timeout
)
redirected_url = page.url
except Error as e:
# Allow navigation to be aborted when downloading files
# This is expected behavior for downloads in some browser engines
if 'net::ERR_ABORTED' in str(e) and self.browser_config.accept_downloads:
self.logger.info(
message=f"Navigation aborted, likely due to file download: {url}",
tag="GOTO",
params={"url": url},
)
response = None
if is_local_content:
# Load local content using set_content() instead of network navigation
if url.startswith("file://"):
local_file_path = url[7:] # Remove 'file://' prefix
if not os.path.exists(local_file_path):
raise FileNotFoundError(f"Local file not found: {local_file_path}")
with open(local_file_path, "r", encoding="utf-8") as f:
html_content = f.read()
else:
raise RuntimeError(f"Failed on navigating ACS-GOTO:\n{str(e)}")
# raw:// or raw:
html_content = url[6:] if url.startswith("raw://") else url[4:]
await page.set_content(html_content, wait_until=config.wait_until)
response = None
redirected_url = config.base_url or url
status_code = 200
response_headers = {}
else:
# Standard web navigation with goto()
try:
# Generate a unique nonce for this request
if config.experimental.get("use_csp_nonce", False):
nonce = hashlib.sha256(os.urandom(32)).hexdigest()
# Add CSP headers to the request
await page.set_extra_http_headers(
{
"Content-Security-Policy": f"default-src 'self'; script-src 'self' 'nonce-{nonce}' 'strict-dynamic'"
}
)
response = await page.goto(
url, wait_until=config.wait_until, timeout=config.page_timeout
)
redirected_url = page.url
except Error as e:
# Allow navigation to be aborted when downloading files
# This is expected behavior for downloads in some browser engines
if 'net::ERR_ABORTED' in str(e) and self.browser_config.accept_downloads:
self.logger.info(
message=f"Navigation aborted, likely due to file download: {url}",
tag="GOTO",
params={"url": url},
)
response = None
else:
raise RuntimeError(f"Failed on navigating ACS-GOTO:\n{str(e)}")
# ──────────────────────────────────────────────────────────────
# Walk the redirect chain. Playwright returns only the last
# hop, so we trace the `request.redirected_from` links until the
# first response that differs from the final one and surface its
# status-code.
# ──────────────────────────────────────────────────────────────
if response is None:
status_code = 200
response_headers = {}
else:
first_resp = response
req = response.request
while req and req.redirected_from:
prev_req = req.redirected_from
prev_resp = await prev_req.response()
if prev_resp: # keep earliest
first_resp = prev_resp
req = prev_req
status_code = first_resp.status
response_headers = first_resp.headers
await self.execute_hook(
"after_goto", page, context=context, url=url, response=response, config=config
)
# ──────────────────────────────────────────────────────────────
# Walk the redirect chain. Playwright returns only the last
# hop, so we trace the `request.redirected_from` links until the
# first response that differs from the final one and surface its
# status-code.
# ──────────────────────────────────────────────────────────────
if response is None:
status_code = 200
response_headers = {}
else:
first_resp = response
req = response.request
while req and req.redirected_from:
prev_req = req.redirected_from
prev_resp = await prev_req.response()
if prev_resp: # keep earliest
first_resp = prev_resp
req = prev_req
status_code = first_resp.status
response_headers = first_resp.headers
# if response is None:
# status_code = 200
# response_headers = {}
# else:
# status_code = response.status
# response_headers = response.headers
else:
status_code = 200
response_headers = {}
@@ -1023,6 +1039,12 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
final_messages = await self.adapter.retrieve_console_messages(page)
captured_console.extend(final_messages)
###
# This ensures we capture the current page URL at the time we return the response,
# which correctly reflects any JavaScript navigation that occurred.
###
redirected_url = page.url # Use current page URL to capture JS redirects
# Return complete response
return AsyncCrawlResponse(
html=html,
@@ -1383,9 +1405,10 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
try:
await self.adapter.evaluate(page,
f"""
(() => {{
(async () => {{
try {{
{remove_overlays_js}
const removeOverlays = {remove_overlays_js};
await removeOverlays();
return {{ success: true }};
}} catch (error) {{
return {{
@@ -1517,7 +1540,78 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
await page.goto(file_path)
return captured_console
async def _generate_media_from_html(
self, html: str, config: CrawlerRunConfig = None
) -> tuple:
"""
Generate media (screenshot, PDF, MHTML) from raw HTML content.
This method is used for raw: and file:// URLs where we have HTML content
but need to render it in a browser to generate media outputs.
Args:
html (str): The raw HTML content to render
config (CrawlerRunConfig, optional): Configuration for media options
Returns:
tuple: (screenshot_data, pdf_data, mhtml_data) - any can be None
"""
page = None
screenshot_data = None
pdf_data = None
mhtml_data = None
try:
# Get a browser page
config = config or CrawlerRunConfig()
page, context = await self.browser_manager.get_page(crawlerRunConfig=config)
# Load the HTML content into the page
await page.set_content(html, wait_until="domcontentloaded")
# Generate requested media
if config.pdf:
pdf_data = await self.export_pdf(page)
if config.capture_mhtml:
mhtml_data = await self.capture_mhtml(page)
if config.screenshot:
if config.screenshot_wait_for:
await asyncio.sleep(config.screenshot_wait_for)
screenshot_height_threshold = getattr(config, 'screenshot_height_threshold', None)
screenshot_data = await self.take_screenshot(
page, screenshot_height_threshold=screenshot_height_threshold
)
return screenshot_data, pdf_data, mhtml_data
except Exception as e:
error_message = f"Failed to generate media from HTML: {str(e)}"
self.logger.error(
message="HTML media generation failed: {error}",
tag="ERROR",
params={"error": error_message},
)
# Return error image for screenshot if it was requested
if config and config.screenshot:
img = Image.new("RGB", (800, 600), color="black")
draw = ImageDraw.Draw(img)
font = ImageFont.load_default()
draw.text((10, 10), error_message, fill=(255, 255, 255), font=font)
buffered = BytesIO()
img.save(buffered, format="JPEG")
screenshot_data = base64.b64encode(buffered.getvalue()).decode("utf-8")
return screenshot_data, pdf_data, mhtml_data
finally:
# Clean up the page
if page:
try:
await page.close()
except Exception:
pass
async def take_screenshot(self, page, **kwargs) -> str:
"""
Take a screenshot of the current page.
@@ -2286,9 +2380,28 @@ class AsyncHTTPCrawlerStrategy(AsyncCrawlerStrategy):
)
def _format_proxy_url(self, proxy_config) -> str:
"""Format ProxyConfig into aiohttp-compatible proxy URL."""
if not proxy_config:
return None
server = proxy_config.server
username = getattr(proxy_config, 'username', None)
password = getattr(proxy_config, 'password', None)
if username and password:
# Insert credentials into URL: http://user:pass@host:port
if '://' in server:
protocol, rest = server.split('://', 1)
return f"{protocol}://{username}:{password}@{rest}"
else:
return f"http://{username}:{password}@{server}"
return server
async def _handle_http(
self,
url: str,
self,
url: str,
config: CrawlerRunConfig
) -> AsyncCrawlResponse:
async with self._session_context() as session:
@@ -2297,7 +2410,7 @@ class AsyncHTTPCrawlerStrategy(AsyncCrawlerStrategy):
connect=10,
sock_read=30
)
headers = dict(self._BASE_HEADERS)
if self.browser_config.headers:
headers.update(self.browser_config.headers)
@@ -2309,6 +2422,12 @@ class AsyncHTTPCrawlerStrategy(AsyncCrawlerStrategy):
'headers': headers
}
# Add proxy support - use config.proxy_config (set by arun() from rotation strategy or direct config)
proxy_url = None
if config.proxy_config:
proxy_url = self._format_proxy_url(config.proxy_config)
request_kwargs['proxy'] = proxy_url
if self.browser_config.method == "POST":
if self.browser_config.data:
request_kwargs['data'] = self.browser_config.data
@@ -2379,7 +2498,10 @@ class AsyncHTTPCrawlerStrategy(AsyncCrawlerStrategy):
if scheme == 'file':
return await self._handle_file(parsed.path)
elif scheme == 'raw':
return await self._handle_raw(parsed.path)
# Don't use parsed.path - urlparse truncates at '#' which is common in CSS
# Strip prefix directly: "raw://" (6 chars) or "raw:" (4 chars)
raw_content = url[6:] if url.startswith("raw://") else url[4:]
return await self._handle_raw(raw_content)
else: # http or https
return await self._handle_http(url, config)

View File

@@ -1,10 +1,11 @@
import os
import time
from pathlib import Path
import aiosqlite
import asyncio
from typing import Optional, Dict
from contextlib import asynccontextmanager
import json
import json
from .models import CrawlResult, MarkdownGenerationResult, StringCompatibleMarkdown
import aiofiles
from .async_logger import AsyncLogger
@@ -262,6 +263,11 @@ class AsyncDatabaseManager:
"screenshot",
"response_headers",
"downloaded_files",
# Smart cache validation columns (added in 0.8.x)
"etag",
"last_modified",
"head_fingerprint",
"cached_at",
]
for column in new_columns:
@@ -275,6 +281,11 @@ class AsyncDatabaseManager:
await db.execute(
f'ALTER TABLE crawled_data ADD COLUMN {new_column} TEXT DEFAULT "{{}}"'
)
elif new_column == "cached_at":
# Timestamp column for cache validation
await db.execute(
f"ALTER TABLE crawled_data ADD COLUMN {new_column} REAL DEFAULT 0"
)
else:
await db.execute(
f'ALTER TABLE crawled_data ADD COLUMN {new_column} TEXT DEFAULT ""'
@@ -378,6 +389,92 @@ class AsyncDatabaseManager:
)
return None
async def aget_cache_metadata(self, url: str) -> Optional[Dict]:
"""
Retrieve only cache validation metadata for a URL (lightweight query).
Returns dict with: url, etag, last_modified, head_fingerprint, cached_at, response_headers
This is used for cache validation without loading full content.
"""
async def _get_metadata(db):
async with db.execute(
"""SELECT url, etag, last_modified, head_fingerprint, cached_at, response_headers
FROM crawled_data WHERE url = ?""",
(url,)
) as cursor:
row = await cursor.fetchone()
if not row:
return None
columns = [description[0] for description in cursor.description]
row_dict = dict(zip(columns, row))
# Parse response_headers JSON
try:
row_dict["response_headers"] = (
json.loads(row_dict["response_headers"])
if row_dict["response_headers"] else {}
)
except json.JSONDecodeError:
row_dict["response_headers"] = {}
return row_dict
try:
return await self.execute_with_retry(_get_metadata)
except Exception as e:
self.logger.error(
message="Error retrieving cache metadata: {error}",
tag="ERROR",
force_verbose=True,
params={"error": str(e)},
)
return None
async def aupdate_cache_metadata(
self,
url: str,
etag: Optional[str] = None,
last_modified: Optional[str] = None,
head_fingerprint: Optional[str] = None,
):
"""
Update only the cache validation metadata for a URL.
Used to update etag/last_modified after a successful validation.
"""
async def _update(db):
updates = []
values = []
if etag is not None:
updates.append("etag = ?")
values.append(etag)
if last_modified is not None:
updates.append("last_modified = ?")
values.append(last_modified)
if head_fingerprint is not None:
updates.append("head_fingerprint = ?")
values.append(head_fingerprint)
if not updates:
return
values.append(url)
await db.execute(
f"UPDATE crawled_data SET {', '.join(updates)} WHERE url = ?",
tuple(values)
)
try:
await self.execute_with_retry(_update)
except Exception as e:
self.logger.error(
message="Error updating cache metadata: {error}",
tag="ERROR",
force_verbose=True,
params={"error": str(e)},
)
async def acache_url(self, result: CrawlResult):
"""Cache CrawlResult data"""
# Store content files and get hashes
@@ -425,15 +522,24 @@ class AsyncDatabaseManager:
for field, (content, content_type) in content_map.items():
content_hashes[field] = await self._store_content(content, content_type)
# Extract cache validation headers from response
response_headers = result.response_headers or {}
etag = response_headers.get("etag") or response_headers.get("ETag") or ""
last_modified = response_headers.get("last-modified") or response_headers.get("Last-Modified") or ""
# head_fingerprint is set by caller via result attribute (if available)
head_fingerprint = getattr(result, "head_fingerprint", None) or ""
cached_at = time.time()
async def _cache(db):
await db.execute(
"""
INSERT INTO crawled_data (
url, html, cleaned_html, markdown,
extracted_content, success, media, links, metadata,
screenshot, response_headers, downloaded_files
screenshot, response_headers, downloaded_files,
etag, last_modified, head_fingerprint, cached_at
)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
ON CONFLICT(url) DO UPDATE SET
html = excluded.html,
cleaned_html = excluded.cleaned_html,
@@ -445,7 +551,11 @@ class AsyncDatabaseManager:
metadata = excluded.metadata,
screenshot = excluded.screenshot,
response_headers = excluded.response_headers,
downloaded_files = excluded.downloaded_files
downloaded_files = excluded.downloaded_files,
etag = excluded.etag,
last_modified = excluded.last_modified,
head_fingerprint = excluded.head_fingerprint,
cached_at = excluded.cached_at
""",
(
result.url,
@@ -460,6 +570,10 @@ class AsyncDatabaseManager:
content_hashes["screenshot"],
json.dumps(result.response_headers or {}),
json.dumps(result.downloaded_files or []),
etag,
last_modified,
head_fingerprint,
cached_at,
),
)

View File

@@ -24,7 +24,7 @@ import os
import pathlib
import re
import time
from datetime import timedelta
from datetime import datetime, timedelta, timezone
from pathlib import Path
from typing import Any, Dict, Iterable, List, Optional, Sequence, Union
from urllib.parse import quote, urljoin
@@ -78,6 +78,103 @@ _link_rx = re.compile(
# ────────────────────────────────────────────────────────────────────────── helpers
def _parse_sitemap_lastmod(xml_content: bytes) -> Optional[str]:
"""Extract the most recent lastmod from sitemap XML."""
try:
if LXML:
root = etree.fromstring(xml_content)
# Get all lastmod elements (namespace-agnostic)
lastmods = root.xpath("//*[local-name()='lastmod']/text()")
if lastmods:
# Return the most recent one
return max(lastmods)
except Exception:
pass
return None
def _is_cache_valid(
cache_path: pathlib.Path,
ttl_hours: int,
validate_lastmod: bool,
current_lastmod: Optional[str] = None
) -> bool:
"""
Check if sitemap cache is still valid.
Returns False (invalid) if:
- File doesn't exist
- File is corrupted/unreadable
- TTL expired (if ttl_hours > 0)
- Sitemap lastmod is newer than cache (if validate_lastmod=True)
"""
if not cache_path.exists():
return False
try:
with open(cache_path, "r") as f:
data = json.load(f)
# Check version
if data.get("version") != 1:
return False
# Check TTL
if ttl_hours > 0:
created_at = datetime.fromisoformat(data["created_at"].replace("Z", "+00:00"))
age_hours = (datetime.now(timezone.utc) - created_at).total_seconds() / 3600
if age_hours > ttl_hours:
return False
# Check lastmod
if validate_lastmod and current_lastmod:
cached_lastmod = data.get("sitemap_lastmod")
if cached_lastmod and current_lastmod > cached_lastmod:
return False
# Check URL count (sanity check - if 0, likely corrupted)
if data.get("url_count", 0) == 0:
return False
return True
except (json.JSONDecodeError, KeyError, ValueError, IOError):
# Corrupted cache - return False to trigger refetch
return False
def _read_cache(cache_path: pathlib.Path) -> List[str]:
"""Read URLs from cache file. Returns empty list on error."""
try:
with open(cache_path, "r") as f:
data = json.load(f)
return data.get("urls", [])
except Exception:
return []
def _write_cache(
cache_path: pathlib.Path,
urls: List[str],
sitemap_url: str,
sitemap_lastmod: Optional[str]
) -> None:
"""Write URLs to cache with metadata."""
data = {
"version": 1,
"created_at": datetime.now(timezone.utc).isoformat(),
"sitemap_lastmod": sitemap_lastmod,
"sitemap_url": sitemap_url,
"url_count": len(urls),
"urls": urls
}
try:
with open(cache_path, "w") as f:
json.dump(data, f)
except Exception:
pass # Fail silently - cache is optional
def _match(url: str, pattern: str) -> bool:
if fnmatch.fnmatch(url, pattern):
return True
@@ -295,6 +392,10 @@ class AsyncUrlSeeder:
score_threshold = config.score_threshold
scoring_method = config.scoring_method
# Store cache config for use in _from_sitemaps
self._cache_ttl_hours = getattr(config, 'cache_ttl_hours', 24)
self._validate_sitemap_lastmod = getattr(config, 'validate_sitemap_lastmod', True)
# Ensure seeder's logger verbose matches the config's verbose if it's set
if self.logger and hasattr(self.logger, 'verbose') and config.verbose is not None:
self.logger.verbose = config.verbose
@@ -764,68 +865,222 @@ class AsyncUrlSeeder:
# ─────────────────────────────── Sitemaps
async def _from_sitemaps(self, domain: str, pattern: str, force: bool = False):
"""
1. Probe default sitemap locations.
2. If none exist, parse robots.txt for alternative sitemap URLs.
3. Yield only URLs that match `pattern`.
Discover URLs from sitemaps with smart TTL-based caching.
1. Check cache validity (TTL + lastmod)
2. If valid, yield from cache
3. If invalid or force=True, fetch fresh and update cache
4. FALLBACK: If anything fails, bypass cache and fetch directly
"""
# Get config values (passed via self during urls() call)
cache_ttl_hours = getattr(self, '_cache_ttl_hours', 24)
validate_lastmod = getattr(self, '_validate_sitemap_lastmod', True)
# ── cache file (same logic as _from_cc)
# Cache file path (new format: .json instead of .jsonl)
host = re.sub(r'^https?://', '', domain).rstrip('/')
host = re.sub('[/?#]+', '_', domain)
host_safe = re.sub('[/?#]+', '_', host)
digest = hashlib.md5(pattern.encode()).hexdigest()[:8]
path = self.cache_dir / f"sitemap_{host}_{digest}.jsonl"
cache_path = self.cache_dir / f"sitemap_{host_safe}_{digest}.json"
if path.exists() and not force:
self._log("info", "Loading sitemap URLs for {d} from cache: {p}",
params={"d": host, "p": str(path)}, tag="URL_SEED")
async with aiofiles.open(path, "r") as fp:
async for line in fp:
url = line.strip()
if _match(url, pattern):
yield url
return
# Check for old .jsonl format and delete it
old_cache_path = self.cache_dir / f"sitemap_{host_safe}_{digest}.jsonl"
if old_cache_path.exists():
try:
old_cache_path.unlink()
self._log("info", "Deleted old cache format: {p}",
params={"p": str(old_cache_path)}, tag="URL_SEED")
except Exception:
pass
# 1⃣ direct sitemap probe
# strip any scheme so we can handle https → http fallback
host = re.sub(r'^https?://', '', domain).rstrip('/')
# Step 1: Find sitemap URL and get lastmod (needed for validation)
sitemap_url = None
sitemap_lastmod = None
sitemap_content = None
schemes = ('https', 'http') # prefer TLS, downgrade if needed
schemes = ('https', 'http')
for scheme in schemes:
for suffix in ("/sitemap.xml", "/sitemap_index.xml"):
sm = f"{scheme}://{host}{suffix}"
sm = await self._resolve_head(sm)
if sm:
self._log("info", "Found sitemap at {url}", params={
"url": sm}, tag="URL_SEED")
async with aiofiles.open(path, "w") as fp:
resolved = await self._resolve_head(sm)
if resolved:
sitemap_url = resolved
# Fetch sitemap content to get lastmod
try:
r = await self.client.get(sitemap_url, timeout=15, follow_redirects=True)
if 200 <= r.status_code < 300:
sitemap_content = r.content
sitemap_lastmod = _parse_sitemap_lastmod(sitemap_content)
except Exception:
pass
break
if sitemap_url:
break
# Step 2: Check cache validity (skip if force=True)
if not force and cache_path.exists():
if _is_cache_valid(cache_path, cache_ttl_hours, validate_lastmod, sitemap_lastmod):
self._log("info", "Loading sitemap URLs from valid cache: {p}",
params={"p": str(cache_path)}, tag="URL_SEED")
cached_urls = _read_cache(cache_path)
for url in cached_urls:
if _match(url, pattern):
yield url
return
else:
self._log("info", "Cache invalid/expired, refetching sitemap for {d}",
params={"d": domain}, tag="URL_SEED")
# Step 3: Fetch fresh URLs
discovered_urls = []
if sitemap_url and sitemap_content:
self._log("info", "Found sitemap at {url}", params={"url": sitemap_url}, tag="URL_SEED")
# Parse sitemap (reuse content we already fetched)
async for u in self._iter_sitemap_content(sitemap_url, sitemap_content):
discovered_urls.append(u)
if _match(u, pattern):
yield u
elif sitemap_url:
# We have a sitemap URL but no content (fetch failed earlier), try again
self._log("info", "Found sitemap at {url}", params={"url": sitemap_url}, tag="URL_SEED")
async for u in self._iter_sitemap(sitemap_url):
discovered_urls.append(u)
if _match(u, pattern):
yield u
else:
# Fallback: robots.txt
robots = f"https://{host}/robots.txt"
try:
r = await self.client.get(robots, timeout=10, follow_redirects=True)
if 200 <= r.status_code < 300:
sitemap_lines = [l.split(":", 1)[1].strip()
for l in r.text.splitlines()
if l.lower().startswith("sitemap:")]
for sm in sitemap_lines:
async for u in self._iter_sitemap(sm):
await fp.write(u + "\n")
discovered_urls.append(u)
if _match(u, pattern):
yield u
else:
self._log("warning", "robots.txt unavailable for {d} HTTP{c}",
params={"d": domain, "c": r.status_code}, tag="URL_SEED")
return
# 2⃣ robots.txt fallback
robots = f"https://{domain.rstrip('/')}/robots.txt"
try:
r = await self.client.get(robots, timeout=10, follow_redirects=True)
if not 200 <= r.status_code < 300:
self._log("warning", "robots.txt unavailable for {d} HTTP{c}", params={
"d": domain, "c": r.status_code}, tag="URL_SEED")
except Exception as e:
self._log("warning", "Failed to fetch robots.txt for {d}: {e}",
params={"d": domain, "e": str(e)}, tag="URL_SEED")
return
sitemap_lines = [l.split(":", 1)[1].strip(
) for l in r.text.splitlines() if l.lower().startswith("sitemap:")]
except Exception as e:
self._log("warning", "Failed to fetch robots.txt for {d}: {e}", params={
"d": domain, "e": str(e)}, tag="URL_SEED")
return
if sitemap_lines:
async with aiofiles.open(path, "w") as fp:
for sm in sitemap_lines:
async for u in self._iter_sitemap(sm):
await fp.write(u + "\n")
if _match(u, pattern):
yield u
# Step 4: Write to cache (FALLBACK: if write fails, URLs still yielded above)
if discovered_urls:
_write_cache(cache_path, discovered_urls, sitemap_url or "", sitemap_lastmod)
self._log("info", "Cached {count} URLs for {d}",
params={"count": len(discovered_urls), "d": domain}, tag="URL_SEED")
async def _iter_sitemap_content(self, url: str, content: bytes):
"""Parse sitemap from already-fetched content."""
data = gzip.decompress(content) if url.endswith(".gz") else content
base_url = url
def _normalize_loc(raw: Optional[str]) -> Optional[str]:
if not raw:
return None
normalized = urljoin(base_url, raw.strip())
if not normalized:
return None
return normalized
# Detect if this is a sitemap index
is_sitemap_index = False
sub_sitemaps = []
regular_urls = []
if LXML:
try:
parser = etree.XMLParser(recover=True)
root = etree.fromstring(data, parser=parser)
sitemap_loc_nodes = root.xpath("//*[local-name()='sitemap']/*[local-name()='loc']")
url_loc_nodes = root.xpath("//*[local-name()='url']/*[local-name()='loc']")
if sitemap_loc_nodes:
is_sitemap_index = True
for sitemap_elem in sitemap_loc_nodes:
loc = _normalize_loc(sitemap_elem.text)
if loc:
sub_sitemaps.append(loc)
if not is_sitemap_index:
for loc_elem in url_loc_nodes:
loc = _normalize_loc(loc_elem.text)
if loc:
regular_urls.append(loc)
except Exception as e:
self._log("error", "LXML parsing error for sitemap {url}: {error}",
params={"url": url, "error": str(e)}, tag="URL_SEED")
return
else:
import xml.etree.ElementTree as ET
try:
root = ET.fromstring(data)
for elem in root.iter():
if '}' in elem.tag:
elem.tag = elem.tag.split('}')[1]
sitemaps = root.findall('.//sitemap')
url_entries = root.findall('.//url')
if sitemaps:
is_sitemap_index = True
for sitemap in sitemaps:
loc_elem = sitemap.find('loc')
loc = _normalize_loc(loc_elem.text if loc_elem is not None else None)
if loc:
sub_sitemaps.append(loc)
if not is_sitemap_index:
for url_elem in url_entries:
loc_elem = url_elem.find('loc')
loc = _normalize_loc(loc_elem.text if loc_elem is not None else None)
if loc:
regular_urls.append(loc)
except Exception as e:
self._log("error", "ElementTree parsing error for sitemap {url}: {error}",
params={"url": url, "error": str(e)}, tag="URL_SEED")
return
# Process based on type
if is_sitemap_index and sub_sitemaps:
self._log("info", "Processing sitemap index with {count} sub-sitemaps",
params={"count": len(sub_sitemaps)}, tag="URL_SEED")
queue_size = min(50000, len(sub_sitemaps) * 1000)
result_queue = asyncio.Queue(maxsize=queue_size)
completed_count = 0
total_sitemaps = len(sub_sitemaps)
async def process_subsitemap(sitemap_url: str):
try:
async for u in self._iter_sitemap(sitemap_url):
await result_queue.put(u)
except Exception as e:
self._log("error", "Error processing sub-sitemap {url}: {error}",
params={"url": sitemap_url, "error": str(e)}, tag="URL_SEED")
finally:
await result_queue.put(None)
tasks = [asyncio.create_task(process_subsitemap(sm)) for sm in sub_sitemaps]
while completed_count < total_sitemaps:
item = await result_queue.get()
if item is None:
completed_count += 1
else:
yield item
await asyncio.gather(*tasks, return_exceptions=True)
else:
for u in regular_urls:
yield u
async def _iter_sitemap(self, url: str):
try:
@@ -845,6 +1100,15 @@ class AsyncUrlSeeder:
return
data = gzip.decompress(r.content) if url.endswith(".gz") else r.content
base_url = str(r.url)
def _normalize_loc(raw: Optional[str]) -> Optional[str]:
if not raw:
return None
normalized = urljoin(base_url, raw.strip())
if not normalized:
return None
return normalized
# Detect if this is a sitemap index by checking for <sitemapindex> or presence of <sitemap> elements
is_sitemap_index = False
@@ -857,25 +1121,42 @@ class AsyncUrlSeeder:
# Use XML parser for sitemaps, not HTML parser
parser = etree.XMLParser(recover=True)
root = etree.fromstring(data, parser=parser)
# Namespace-agnostic lookups using local-name() so we honor custom or missing namespaces
sitemap_loc_nodes = root.xpath("//*[local-name()='sitemap']/*[local-name()='loc']")
url_loc_nodes = root.xpath("//*[local-name()='url']/*[local-name()='loc']")
# Define namespace for sitemap
ns = {'s': 'http://www.sitemaps.org/schemas/sitemap/0.9'}
self._log(
"debug",
"Parsed sitemap {url}: {sitemap_count} sitemap entries, {url_count} url entries discovered",
params={
"url": url,
"sitemap_count": len(sitemap_loc_nodes),
"url_count": len(url_loc_nodes),
},
tag="URL_SEED",
)
# Check for sitemap index entries
sitemap_locs = root.xpath('//s:sitemap/s:loc', namespaces=ns)
if sitemap_locs:
if sitemap_loc_nodes:
is_sitemap_index = True
for sitemap_elem in sitemap_locs:
loc = sitemap_elem.text.strip() if sitemap_elem.text else ""
for sitemap_elem in sitemap_loc_nodes:
loc = _normalize_loc(sitemap_elem.text)
if loc:
sub_sitemaps.append(loc)
# If not a sitemap index, get regular URLs
if not is_sitemap_index:
for loc_elem in root.xpath('//s:url/s:loc', namespaces=ns):
loc = loc_elem.text.strip() if loc_elem.text else ""
for loc_elem in url_loc_nodes:
loc = _normalize_loc(loc_elem.text)
if loc:
regular_urls.append(loc)
if not regular_urls:
self._log(
"warning",
"No <loc> entries found inside <url> tags for sitemap {url}. The sitemap might be empty or use an unexpected structure.",
params={"url": url},
tag="URL_SEED",
)
except Exception as e:
self._log("error", "LXML parsing error for sitemap {url}: {error}",
params={"url": url, "error": str(e)}, tag="URL_SEED")
@@ -892,19 +1173,39 @@ class AsyncUrlSeeder:
# Check for sitemap index entries
sitemaps = root.findall('.//sitemap')
url_entries = root.findall('.//url')
self._log(
"debug",
"ElementTree parsed sitemap {url}: {sitemap_count} sitemap entries, {url_count} url entries discovered",
params={
"url": url,
"sitemap_count": len(sitemaps),
"url_count": len(url_entries),
},
tag="URL_SEED",
)
if sitemaps:
is_sitemap_index = True
for sitemap in sitemaps:
loc_elem = sitemap.find('loc')
if loc_elem is not None and loc_elem.text:
sub_sitemaps.append(loc_elem.text.strip())
loc = _normalize_loc(loc_elem.text if loc_elem is not None else None)
if loc:
sub_sitemaps.append(loc)
# If not a sitemap index, get regular URLs
if not is_sitemap_index:
for url_elem in root.findall('.//url'):
for url_elem in url_entries:
loc_elem = url_elem.find('loc')
if loc_elem is not None and loc_elem.text:
regular_urls.append(loc_elem.text.strip())
loc = _normalize_loc(loc_elem.text if loc_elem is not None else None)
if loc:
regular_urls.append(loc)
if not regular_urls:
self._log(
"warning",
"No <loc> entries found inside <url> tags for sitemap {url}. The sitemap might be empty or use an unexpected structure.",
params={"url": url},
tag="URL_SEED",
)
except Exception as e:
self._log("error", "ElementTree parsing error for sitemap {url}: {error}",
params={"url": url, "error": str(e)}, tag="URL_SEED")

View File

@@ -47,7 +47,9 @@ from .utils import (
get_error_context,
RobotsParser,
preprocess_html_for_schema,
compute_head_fingerprint,
)
from .cache_validator import CacheValidator, CacheValidationResult
class AsyncWebCrawler:
@@ -267,6 +269,51 @@ class AsyncWebCrawler:
if cache_context.should_read():
cached_result = await async_db_manager.aget_cached_url(url)
# Smart Cache: Validate cache freshness if enabled
if cached_result and config.check_cache_freshness:
cache_metadata = await async_db_manager.aget_cache_metadata(url)
if cache_metadata:
async with CacheValidator(timeout=config.cache_validation_timeout) as validator:
validation = await validator.validate(
url=url,
stored_etag=cache_metadata.get("etag"),
stored_last_modified=cache_metadata.get("last_modified"),
stored_head_fingerprint=cache_metadata.get("head_fingerprint"),
)
if validation.status == CacheValidationResult.FRESH:
cached_result.cache_status = "hit_validated"
self.logger.info(
message="Cache validated: {reason}",
tag="CACHE",
params={"reason": validation.reason}
)
# Update metadata if we got new values
if validation.new_etag or validation.new_last_modified:
await async_db_manager.aupdate_cache_metadata(
url=url,
etag=validation.new_etag,
last_modified=validation.new_last_modified,
head_fingerprint=validation.new_head_fingerprint,
)
elif validation.status == CacheValidationResult.ERROR:
cached_result.cache_status = "hit_fallback"
self.logger.warning(
message="Cache validation failed, using cached: {reason}",
tag="CACHE",
params={"reason": validation.reason}
)
else:
# STALE or UNKNOWN - force recrawl
self.logger.info(
message="Cache stale: {reason}",
tag="CACHE",
params={"reason": validation.reason}
)
cached_result = None
elif cached_result:
cached_result.cache_status = "hit"
if cached_result:
html = sanitize_input_encode(cached_result.html)
extracted_content = sanitize_input_encode(
@@ -296,15 +343,32 @@ class AsyncWebCrawler:
# Update proxy configuration from rotation strategy if available
if config and config.proxy_rotation_strategy:
next_proxy: ProxyConfig = await config.proxy_rotation_strategy.get_next_proxy()
if next_proxy:
self.logger.info(
message="Switch proxy: {proxy}",
tag="PROXY",
params={"proxy": next_proxy.server}
# Handle sticky sessions - use same proxy for all requests with same session_id
if config.proxy_session_id:
next_proxy: ProxyConfig = await config.proxy_rotation_strategy.get_proxy_for_session(
config.proxy_session_id,
ttl=config.proxy_session_ttl
)
config.proxy_config = next_proxy
# config = config.clone(proxy_config=next_proxy)
if next_proxy:
self.logger.info(
message="Using sticky proxy session: {session_id} -> {proxy}",
tag="PROXY",
params={
"session_id": config.proxy_session_id,
"proxy": next_proxy.server
}
)
config.proxy_config = next_proxy
else:
# Existing behavior: rotate on each request
next_proxy: ProxyConfig = await config.proxy_rotation_strategy.get_next_proxy()
if next_proxy:
self.logger.info(
message="Switch proxy: {proxy}",
tag="PROXY",
params={"proxy": next_proxy.server}
)
config.proxy_config = next_proxy
# Fetch fresh content if needed
if not cached_result or not html:
@@ -383,6 +447,14 @@ class AsyncWebCrawler:
crawl_result.success = bool(html)
crawl_result.session_id = getattr(
config, "session_id", None)
crawl_result.cache_status = "miss"
# Compute head fingerprint for cache validation
if html:
head_end = html.lower().find('</head>')
if head_end != -1:
head_html = html[:head_end + 7]
crawl_result.head_fingerprint = compute_head_fingerprint(head_html)
self.logger.url_status(
url=cache_context.display_url,
@@ -459,6 +531,27 @@ class AsyncWebCrawler:
Returns:
CrawlResult: Processed result containing extracted and formatted content
"""
# === PREFETCH MODE SHORT-CIRCUIT ===
if getattr(config, 'prefetch', False):
from .utils import quick_extract_links
# Use base_url from config (for raw: URLs), redirected_url, or original url
effective_url = getattr(config, 'base_url', None) or kwargs.get('redirected_url') or url
links = quick_extract_links(html, effective_url)
return CrawlResult(
url=url,
html=html,
success=True,
links=links,
status_code=kwargs.get('status_code'),
response_headers=kwargs.get('response_headers'),
redirected_url=kwargs.get('redirected_url'),
ssl_certificate=kwargs.get('ssl_certificate'),
# All other fields default to None
)
# === END PREFETCH SHORT-CIRCUIT ===
cleaned_html = ""
try:
_url = url if not kwargs.get("is_raw_html", False) else "Raw HTML"
@@ -563,7 +656,8 @@ class AsyncWebCrawler:
markdown_result: MarkdownGenerationResult = (
markdown_generator.generate_markdown(
input_html=markdown_input_html,
base_url=params.get("redirected_url", url)
# Use explicit base_url if provided (for raw: HTML), otherwise redirected_url, then url
base_url=params.get("base_url") or params.get("redirected_url") or url
# html2text_options=kwargs.get('html2text', {})
)
)
@@ -617,7 +711,17 @@ class AsyncWebCrawler:
else config.chunking_strategy
)
sections = chunking.chunk(content)
extracted_content = config.extraction_strategy.run(url, sections)
# extracted_content = config.extraction_strategy.run(_url, sections)
# Use async version if available for better parallelism
if hasattr(config.extraction_strategy, 'arun'):
extracted_content = await config.extraction_strategy.arun(_url, sections)
else:
# Fallback to sync version run in thread pool to avoid blocking
extracted_content = await asyncio.to_thread(
config.extraction_strategy.run, url, sections
)
extracted_content = json.dumps(
extracted_content, indent=4, default=str, ensure_ascii=False
)
@@ -746,21 +850,45 @@ class AsyncWebCrawler:
# Handle stream setting - use first config's stream setting if config is a list
if isinstance(config, list):
stream = config[0].stream if config else False
primary_config = config[0] if config else None
else:
stream = config.stream
primary_config = config
# Helper to release sticky session if auto_release is enabled
async def maybe_release_session():
if (primary_config and
primary_config.proxy_session_id and
primary_config.proxy_session_auto_release and
primary_config.proxy_rotation_strategy):
await primary_config.proxy_rotation_strategy.release_session(
primary_config.proxy_session_id
)
self.logger.info(
message="Auto-released proxy session: {session_id}",
tag="PROXY",
params={"session_id": primary_config.proxy_session_id}
)
if stream:
async def result_transformer():
async for task_result in dispatcher.run_urls_stream(
crawler=self, urls=urls, config=config
):
yield transform_result(task_result)
try:
async for task_result in dispatcher.run_urls_stream(
crawler=self, urls=urls, config=config
):
yield transform_result(task_result)
finally:
# Auto-release session after streaming completes
await maybe_release_session()
return result_transformer()
else:
_results = await dispatcher.run_urls(crawler=self, urls=urls, config=config)
return [transform_result(res) for res in _results]
try:
_results = await dispatcher.run_urls(crawler=self, urls=urls, config=config)
return [transform_result(res) for res in _results]
finally:
# Auto-release session after batch completes
await maybe_release_session()
async def aseed_urls(
self,

View File

@@ -369,6 +369,9 @@ class ManagedBrowser:
]
if self.headless:
flags.append("--headless=new")
# Add viewport flag if specified in config
if self.browser_config.viewport_height and self.browser_config.viewport_width:
flags.append(f"--window-size={self.browser_config.viewport_width},{self.browser_config.viewport_height}")
# merge common launch flags
flags.extend(self.build_browser_flags(self.browser_config))
elif self.browser_type == "firefox":
@@ -658,9 +661,44 @@ class BrowserManager:
if self.config.cdp_url or self.config.use_managed_browser:
self.config.use_managed_browser = True
cdp_url = await self.managed_browser.start() if not self.config.cdp_url else self.config.cdp_url
# Add CDP endpoint verification before connecting
if not await self._verify_cdp_ready(cdp_url):
raise Exception(f"CDP endpoint at {cdp_url} is not ready after startup")
self.browser = await self.playwright.chromium.connect_over_cdp(cdp_url)
contexts = self.browser.contexts
if contexts:
# If browser_context_id is provided, we're using a pre-created context
if self.config.browser_context_id:
if self.logger:
self.logger.debug(
f"Using pre-existing browser context: {self.config.browser_context_id}",
tag="BROWSER"
)
# When connecting to a pre-created context, it should be in contexts
if contexts:
self.default_context = contexts[0]
if self.logger:
self.logger.debug(
f"Found {len(contexts)} existing context(s), using first one",
tag="BROWSER"
)
else:
# Context was created but not yet visible - wait a bit
await asyncio.sleep(0.2)
contexts = self.browser.contexts
if contexts:
self.default_context = contexts[0]
else:
# Still no contexts - this shouldn't happen with pre-created context
if self.logger:
self.logger.warning(
"Pre-created context not found, creating new one",
tag="BROWSER"
)
self.default_context = await self.create_browser_context()
elif contexts:
self.default_context = contexts[0]
else:
self.default_context = await self.create_browser_context()
@@ -678,6 +716,49 @@ class BrowserManager:
self.default_context = self.browser
async def _verify_cdp_ready(self, cdp_url: str) -> bool:
"""Verify CDP endpoint is ready with exponential backoff.
Supports multiple URL formats:
- HTTP URLs: http://localhost:9222
- HTTP URLs with query params: http://localhost:9222?browser_id=XXX
- WebSocket URLs: ws://localhost:9222/devtools/browser/XXX
"""
import aiohttp
from urllib.parse import urlparse, urlunparse
# If WebSocket URL, Playwright handles connection directly - skip HTTP verification
if cdp_url.startswith(('ws://', 'wss://')):
self.logger.debug(f"WebSocket CDP URL provided, skipping HTTP verification", tag="BROWSER")
return True
# Parse HTTP URL and properly construct /json/version endpoint
parsed = urlparse(cdp_url)
# Build URL with /json/version path, preserving query params
verify_url = urlunparse((
parsed.scheme,
parsed.netloc,
'/json/version', # Always use this path for verification
'', # params
parsed.query, # preserve query string
'' # fragment
))
self.logger.debug(f"Starting CDP verification for {verify_url}", tag="BROWSER")
for attempt in range(5):
try:
async with aiohttp.ClientSession() as session:
async with session.get(verify_url, timeout=aiohttp.ClientTimeout(total=2)) as response:
if response.status == 200:
self.logger.debug(f"CDP endpoint ready after {attempt + 1} attempts", tag="BROWSER")
return True
except Exception as e:
self.logger.debug(f"CDP check attempt {attempt + 1} failed: {e}", tag="BROWSER")
delay = 0.5 * (1.4 ** attempt)
self.logger.debug(f"Waiting {delay:.2f}s before next CDP check...", tag="BROWSER")
await asyncio.sleep(delay)
self.logger.debug(f"CDP verification failed after 5 attempts", tag="BROWSER")
return False
def _build_browser_args(self) -> dict:
"""Build browser launch arguments from config."""
@@ -814,18 +895,27 @@ class BrowserManager:
combined_headers.update(self.config.headers)
await context.set_extra_http_headers(combined_headers)
# Add default cookie
await context.add_cookies(
[
{
"name": "cookiesEnabled",
"value": "true",
"url": crawlerRunConfig.url
if crawlerRunConfig and crawlerRunConfig.url
else "https://crawl4ai.com/",
}
]
)
# Add default cookie (skip for raw:/file:// URLs which are not valid cookie URLs)
cookie_url = None
if crawlerRunConfig and crawlerRunConfig.url:
url = crawlerRunConfig.url
# Only set cookie for http/https URLs
if url.startswith(("http://", "https://")):
cookie_url = url
elif crawlerRunConfig.base_url and crawlerRunConfig.base_url.startswith(("http://", "https://")):
# Use base_url as fallback for raw:/file:// URLs
cookie_url = crawlerRunConfig.base_url
if cookie_url:
await context.add_cookies(
[
{
"name": "cookiesEnabled",
"value": "true",
"url": cookie_url,
}
]
)
# Handle navigator overrides
if crawlerRunConfig:
@@ -834,7 +924,12 @@ class BrowserManager:
or crawlerRunConfig.simulate_user
or crawlerRunConfig.magic
):
await context.add_init_script(load_js_script("navigator_overrider"))
await context.add_init_script(load_js_script("navigator_overrider"))
# Apply custom init_scripts from BrowserConfig (for stealth evasions, etc.)
if self.config.init_scripts:
for script in self.config.init_scripts:
await context.add_init_script(script)
async def create_browser_context(self, crawlerRunConfig: CrawlerRunConfig = None):
"""
@@ -1016,6 +1111,62 @@ class BrowserManager:
params={"error": str(e)}
)
async def _get_page_by_target_id(self, context: BrowserContext, target_id: str):
"""
Get an existing page by its CDP target ID.
This is used when connecting to a pre-created browser context with an existing page.
Playwright may not immediately see targets created via raw CDP commands, so we
use CDP to get all targets and find the matching one.
Args:
context: The browser context to search in
target_id: The CDP target ID to find
Returns:
Page object if found, None otherwise
"""
try:
# First check if Playwright already sees the page
for page in context.pages:
# Playwright's internal target ID might match
if hasattr(page, '_impl_obj') and hasattr(page._impl_obj, '_target_id'):
if page._impl_obj._target_id == target_id:
return page
# If not found, try using CDP to get targets
if hasattr(self.browser, '_impl_obj') and hasattr(self.browser._impl_obj, '_connection'):
cdp_session = await context.new_cdp_session(context.pages[0] if context.pages else None)
if cdp_session:
try:
result = await cdp_session.send("Target.getTargets")
targets = result.get("targetInfos", [])
for target in targets:
if target.get("targetId") == target_id:
# Found the target - if it's a page type, we can use it
if target.get("type") == "page":
# The page exists, let Playwright discover it
await asyncio.sleep(0.1)
# Refresh pages list
if context.pages:
return context.pages[0]
finally:
await cdp_session.detach()
# Fallback: if there are any pages now, return the first one
if context.pages:
return context.pages[0]
return None
except Exception as e:
if self.logger:
self.logger.warning(
message="Failed to get page by target ID: {error}",
tag="BROWSER",
params={"error": str(e)}
)
return None
async def get_page(self, crawlerRunConfig: CrawlerRunConfig):
"""
Get a page for the given session ID, creating a new one if needed.
@@ -1037,7 +1188,25 @@ class BrowserManager:
# If using a managed browser, just grab the shared default_context
if self.config.use_managed_browser:
if self.config.storage_state:
# If create_isolated_context is True, create isolated contexts for concurrent crawls
# Uses the same caching mechanism as non-CDP mode: cache context by config signature,
# but always create a new page. This prevents navigation conflicts while allowing
# context reuse for multiple URLs with the same config (e.g., batch/deep crawls).
if self.config.create_isolated_context:
config_signature = self._make_config_signature(crawlerRunConfig)
async with self._contexts_lock:
if config_signature in self.contexts_by_config:
context = self.contexts_by_config[config_signature]
else:
context = await self.create_browser_context(crawlerRunConfig)
await self.setup_context(context, crawlerRunConfig)
self.contexts_by_config[config_signature] = context
# Always create a new page for each crawl (isolation for navigation)
page = await context.new_page()
await self._apply_stealth_to_page(page)
elif self.config.storage_state:
context = await self.create_browser_context(crawlerRunConfig)
ctx = self.default_context # default context, one window only
ctx = await clone_runtime_state(context, ctx, crawlerRunConfig, self.config)
@@ -1060,6 +1229,14 @@ class BrowserManager:
pages = context.pages
if pages:
page = pages[0]
elif self.config.browser_context_id and self.config.target_id:
# Pre-existing context/target provided - use CDP to get the page
# This handles the case where Playwright doesn't see the target yet
page = await self._get_page_by_target_id(context, self.config.target_id)
if not page:
# Fallback: create new page in existing context
page = await context.new_page()
await self._apply_stealth_to_page(page)
else:
page = await context.new_page()
await self._apply_stealth_to_page(page)
@@ -1114,8 +1291,44 @@ class BrowserManager:
async def close(self):
"""Close all browser resources and clean up."""
if self.config.cdp_url:
# When using external CDP, we don't own the browser process.
# If cdp_cleanup_on_close is True, properly disconnect from the browser
# and clean up Playwright resources. This frees the browser for other clients.
if self.config.cdp_cleanup_on_close:
# First close all sessions (pages)
session_ids = list(self.sessions.keys())
for session_id in session_ids:
await self.kill_session(session_id)
# Close all contexts we created
for ctx in self.contexts_by_config.values():
try:
await ctx.close()
except Exception:
pass
self.contexts_by_config.clear()
# Disconnect from browser (doesn't terminate it, just releases connection)
if self.browser:
try:
await self.browser.close()
except Exception as e:
if self.logger:
self.logger.debug(
message="Error disconnecting from CDP browser: {error}",
tag="BROWSER",
params={"error": str(e)}
)
self.browser = None
# Allow time for CDP connection to fully release before another client connects
await asyncio.sleep(1.0)
# Stop Playwright instance to prevent memory leaks
if self.playwright:
await self.playwright.stop()
self.playwright = None
return
if self.config.sleep_on_close:
await asyncio.sleep(0.5)

270
crawl4ai/cache_validator.py Normal file
View File

@@ -0,0 +1,270 @@
"""
Cache validation using HTTP conditional requests and head fingerprinting.
Uses httpx for fast, lightweight HTTP requests (no browser needed).
This module enables smart cache validation to avoid unnecessary full browser crawls
when content hasn't changed.
Validation Strategy:
1. Send HEAD request with If-None-Match / If-Modified-Since headers
2. If server returns 304 Not Modified → cache is FRESH
3. If server returns 200 → fetch <head> and compare fingerprint
4. If fingerprint matches → cache is FRESH (minor changes only)
5. Otherwise → cache is STALE, need full recrawl
"""
import httpx
from dataclasses import dataclass
from typing import Optional, Tuple
from enum import Enum
from .utils import compute_head_fingerprint
class CacheValidationResult(Enum):
"""Result of cache validation check."""
FRESH = "fresh" # Content unchanged, use cache
STALE = "stale" # Content changed, need recrawl
UNKNOWN = "unknown" # Couldn't determine, need recrawl
ERROR = "error" # Request failed, use cache as fallback
@dataclass
class ValidationResult:
"""Detailed result of a cache validation attempt."""
status: CacheValidationResult
new_etag: Optional[str] = None
new_last_modified: Optional[str] = None
new_head_fingerprint: Optional[str] = None
reason: str = ""
class CacheValidator:
"""
Validates cache freshness using lightweight HTTP requests.
This validator uses httpx to make fast HTTP requests without needing
a full browser. It supports two validation methods:
1. HTTP Conditional Requests (Layer 3):
- Uses If-None-Match with stored ETag
- Uses If-Modified-Since with stored Last-Modified
- Server returns 304 if content unchanged
2. Head Fingerprinting (Layer 4):
- Fetches only the <head> section (~5KB)
- Compares fingerprint of key meta tags
- Catches changes even without server support for conditional requests
"""
def __init__(self, timeout: float = 10.0, user_agent: Optional[str] = None):
"""
Initialize the cache validator.
Args:
timeout: Request timeout in seconds
user_agent: Custom User-Agent string (optional)
"""
self.timeout = timeout
self.user_agent = user_agent or "Mozilla/5.0 (compatible; Crawl4AI/1.0)"
self._client: Optional[httpx.AsyncClient] = None
async def _get_client(self) -> httpx.AsyncClient:
"""Get or create the httpx client."""
if self._client is None:
self._client = httpx.AsyncClient(
http2=True,
timeout=self.timeout,
follow_redirects=True,
headers={"User-Agent": self.user_agent}
)
return self._client
async def validate(
self,
url: str,
stored_etag: Optional[str] = None,
stored_last_modified: Optional[str] = None,
stored_head_fingerprint: Optional[str] = None,
) -> ValidationResult:
"""
Validate if cached content is still fresh.
Args:
url: The URL to validate
stored_etag: Previously stored ETag header value
stored_last_modified: Previously stored Last-Modified header value
stored_head_fingerprint: Previously computed head fingerprint
Returns:
ValidationResult with status and any updated metadata
"""
client = await self._get_client()
# Build conditional request headers
headers = {}
if stored_etag:
headers["If-None-Match"] = stored_etag
if stored_last_modified:
headers["If-Modified-Since"] = stored_last_modified
try:
# Step 1: Try HEAD request with conditional headers
if headers:
response = await client.head(url, headers=headers)
if response.status_code == 304:
return ValidationResult(
status=CacheValidationResult.FRESH,
reason="Server returned 304 Not Modified"
)
# Got 200, extract new headers for potential update
new_etag = response.headers.get("etag")
new_last_modified = response.headers.get("last-modified")
# If we have fingerprint, compare it
if stored_head_fingerprint:
head_html, _, _ = await self._fetch_head(url)
if head_html:
new_fingerprint = compute_head_fingerprint(head_html)
if new_fingerprint and new_fingerprint == stored_head_fingerprint:
return ValidationResult(
status=CacheValidationResult.FRESH,
new_etag=new_etag,
new_last_modified=new_last_modified,
new_head_fingerprint=new_fingerprint,
reason="Head fingerprint matches"
)
elif new_fingerprint:
return ValidationResult(
status=CacheValidationResult.STALE,
new_etag=new_etag,
new_last_modified=new_last_modified,
new_head_fingerprint=new_fingerprint,
reason="Head fingerprint changed"
)
# Headers changed and no fingerprint match
return ValidationResult(
status=CacheValidationResult.STALE,
new_etag=new_etag,
new_last_modified=new_last_modified,
reason="Server returned 200, content may have changed"
)
# Step 2: No conditional headers available, try fingerprint only
if stored_head_fingerprint:
head_html, new_etag, new_last_modified = await self._fetch_head(url)
if head_html:
new_fingerprint = compute_head_fingerprint(head_html)
if new_fingerprint and new_fingerprint == stored_head_fingerprint:
return ValidationResult(
status=CacheValidationResult.FRESH,
new_etag=new_etag,
new_last_modified=new_last_modified,
new_head_fingerprint=new_fingerprint,
reason="Head fingerprint matches"
)
elif new_fingerprint:
return ValidationResult(
status=CacheValidationResult.STALE,
new_etag=new_etag,
new_last_modified=new_last_modified,
new_head_fingerprint=new_fingerprint,
reason="Head fingerprint changed"
)
# Step 3: No validation data available
return ValidationResult(
status=CacheValidationResult.UNKNOWN,
reason="No validation data available (no etag, last-modified, or fingerprint)"
)
except httpx.TimeoutException:
return ValidationResult(
status=CacheValidationResult.ERROR,
reason="Validation request timed out"
)
except httpx.RequestError as e:
return ValidationResult(
status=CacheValidationResult.ERROR,
reason=f"Validation request failed: {type(e).__name__}"
)
except Exception as e:
# On unexpected error, prefer using cache over failing
return ValidationResult(
status=CacheValidationResult.ERROR,
reason=f"Validation error: {str(e)}"
)
async def _fetch_head(self, url: str) -> Tuple[Optional[str], Optional[str], Optional[str]]:
"""
Fetch only the <head> section of a page.
Uses streaming to stop reading after </head> is found,
minimizing bandwidth usage.
Args:
url: The URL to fetch
Returns:
Tuple of (head_html, etag, last_modified)
"""
client = await self._get_client()
try:
async with client.stream(
"GET",
url,
headers={"Accept-Encoding": "identity"} # Disable compression for easier parsing
) as response:
etag = response.headers.get("etag")
last_modified = response.headers.get("last-modified")
if response.status_code != 200:
return None, etag, last_modified
# Read until </head> or max 64KB
chunks = []
total_bytes = 0
max_bytes = 65536
async for chunk in response.aiter_bytes(4096):
chunks.append(chunk)
total_bytes += len(chunk)
content = b''.join(chunks)
# Check for </head> (case insensitive)
if b'</head>' in content.lower() or b'</HEAD>' in content:
break
if total_bytes >= max_bytes:
break
html = content.decode('utf-8', errors='replace')
# Extract just the head section
head_end = html.lower().find('</head>')
if head_end != -1:
html = html[:head_end + 7]
return html, etag, last_modified
except Exception:
return None, None, None
async def close(self):
"""Close the HTTP client and release resources."""
if self._client:
await self._client.aclose()
self._client = None
async def __aenter__(self):
"""Async context manager entry."""
return self
async def __aexit__(self, exc_type, exc_val, exc_tb):
"""Async context manager exit."""
await self.close()

View File

@@ -980,6 +980,9 @@ class LLMContentFilter(RelevantContentFilter):
prompt,
api_token,
base_url=base_url,
base_delay=self.llm_config.backoff_base_delay,
max_attempts=self.llm_config.backoff_max_attempts,
exponential_factor=self.llm_config.backoff_exponential_factor,
extra_args=extra_args,
)

View File

@@ -542,6 +542,19 @@ class LXMLWebScrapingStrategy(ContentScrapingStrategy):
if el.tag in bypass_tags:
continue
# Skip elements inside <pre> or <code> tags where whitespace is significant
# This preserves whitespace-only spans (e.g., <span class="w"> </span>) in code blocks
is_in_code_block = False
ancestor = el.getparent()
while ancestor is not None:
if ancestor.tag in ("pre", "code"):
is_in_code_block = True
break
ancestor = ancestor.getparent()
if is_in_code_block:
continue
text_content = (el.text_content() or "").strip()
if (
len(text_content.split()) < word_count_threshold

View File

@@ -2,7 +2,7 @@
import asyncio
import logging
from datetime import datetime
from typing import AsyncGenerator, Optional, Set, Dict, List, Tuple
from typing import AsyncGenerator, Optional, Set, Dict, List, Tuple, Any, Callable, Awaitable
from urllib.parse import urlparse
from ..models import TraversalStats
@@ -41,6 +41,9 @@ class BestFirstCrawlingStrategy(DeepCrawlStrategy):
include_external: bool = False,
max_pages: int = infinity,
logger: Optional[logging.Logger] = None,
# Optional resume/callback parameters for crash recovery
resume_state: Optional[Dict[str, Any]] = None,
on_state_change: Optional[Callable[[Dict[str, Any]], Awaitable[None]]] = None,
):
self.max_depth = max_depth
self.filter_chain = filter_chain
@@ -57,6 +60,12 @@ class BestFirstCrawlingStrategy(DeepCrawlStrategy):
self.stats = TraversalStats(start_time=datetime.now())
self._cancel_event = asyncio.Event()
self._pages_crawled = 0
# Store for use in arun methods
self._resume_state = resume_state
self._on_state_change = on_state_change
self._last_state: Optional[Dict[str, Any]] = None
# Shadow list for queue items (only used when on_state_change is set)
self._queue_shadow: Optional[List[Tuple[float, int, str, Optional[str]]]] = None
async def can_process_url(self, url: str, depth: int) -> bool:
"""
@@ -135,16 +144,36 @@ class BestFirstCrawlingStrategy(DeepCrawlStrategy):
) -> AsyncGenerator[CrawlResult, None]:
"""
Core best-first crawl method using a priority queue.
The queue items are tuples of (score, depth, url, parent_url). Lower scores
are treated as higher priority. URLs are processed in batches for efficiency.
"""
queue: asyncio.PriorityQueue = asyncio.PriorityQueue()
# Push the initial URL with score 0 and depth 0.
initial_score = self.url_scorer.score(start_url) if self.url_scorer else 0
await queue.put((-initial_score, 0, start_url, None))
visited: Set[str] = set()
depths: Dict[str, int] = {start_url: 0}
# Conditional state initialization for resume support
if self._resume_state:
visited = set(self._resume_state.get("visited", []))
depths = dict(self._resume_state.get("depths", {}))
self._pages_crawled = self._resume_state.get("pages_crawled", 0)
# Restore queue from saved items
queue_items = self._resume_state.get("queue_items", [])
for item in queue_items:
await queue.put((item["score"], item["depth"], item["url"], item["parent_url"]))
# Initialize shadow list if callback is set
if self._on_state_change:
self._queue_shadow = [
(item["score"], item["depth"], item["url"], item["parent_url"])
for item in queue_items
]
else:
# Original initialization
initial_score = self.url_scorer.score(start_url) if self.url_scorer else 0
await queue.put((-initial_score, 0, start_url, None))
visited: Set[str] = set()
depths: Dict[str, int] = {start_url: 0}
# Initialize shadow list if callback is set
if self._on_state_change:
self._queue_shadow = [(-initial_score, 0, start_url, None)]
while not queue.empty() and not self._cancel_event.is_set():
# Stop if we've reached the max pages limit
@@ -166,6 +195,12 @@ class BestFirstCrawlingStrategy(DeepCrawlStrategy):
if queue.empty():
break
item = await queue.get()
# Remove from shadow list if tracking
if self._on_state_change and self._queue_shadow is not None:
try:
self._queue_shadow.remove(item)
except ValueError:
pass # Item may have been removed already
score, depth, url, parent_url = item
if url in visited:
continue
@@ -210,7 +245,26 @@ class BestFirstCrawlingStrategy(DeepCrawlStrategy):
for new_url, new_parent in new_links:
new_depth = depths.get(new_url, depth + 1)
new_score = self.url_scorer.score(new_url) if self.url_scorer else 0
await queue.put((-new_score, new_depth, new_url, new_parent))
queue_item = (-new_score, new_depth, new_url, new_parent)
await queue.put(queue_item)
# Add to shadow list if tracking
if self._on_state_change and self._queue_shadow is not None:
self._queue_shadow.append(queue_item)
# Capture state after EACH URL processed (if callback set)
if self._on_state_change and self._queue_shadow is not None:
state = {
"strategy_type": "best_first",
"visited": list(visited),
"queue_items": [
{"score": s, "depth": d, "url": u, "parent_url": p}
for s, d, u, p in self._queue_shadow
],
"depths": depths,
"pages_crawled": self._pages_crawled,
}
self._last_state = state
await self._on_state_change(state)
# End of crawl.
@@ -269,3 +323,15 @@ class BestFirstCrawlingStrategy(DeepCrawlStrategy):
"""
self._cancel_event.set()
self.stats.end_time = datetime.now()
def export_state(self) -> Optional[Dict[str, Any]]:
"""
Export current crawl state for external persistence.
Note: This returns the last captured state. For real-time state,
use the on_state_change callback.
Returns:
Dict with strategy state, or None if no state captured yet.
"""
return self._last_state

View File

@@ -2,7 +2,7 @@
import asyncio
import logging
from datetime import datetime
from typing import AsyncGenerator, Optional, Set, Dict, List, Tuple
from typing import AsyncGenerator, Optional, Set, Dict, List, Tuple, Any, Callable, Awaitable
from urllib.parse import urlparse
from ..models import TraversalStats
@@ -26,11 +26,14 @@ class BFSDeepCrawlStrategy(DeepCrawlStrategy):
self,
max_depth: int,
filter_chain: FilterChain = FilterChain(),
url_scorer: Optional[URLScorer] = None,
url_scorer: Optional[URLScorer] = None,
include_external: bool = False,
score_threshold: float = -infinity,
max_pages: int = infinity,
logger: Optional[logging.Logger] = None,
# Optional resume/callback parameters for crash recovery
resume_state: Optional[Dict[str, Any]] = None,
on_state_change: Optional[Callable[[Dict[str, Any]], Awaitable[None]]] = None,
):
self.max_depth = max_depth
self.filter_chain = filter_chain
@@ -48,6 +51,10 @@ class BFSDeepCrawlStrategy(DeepCrawlStrategy):
self.stats = TraversalStats(start_time=datetime.now())
self._cancel_event = asyncio.Event()
self._pages_crawled = 0
# Store for use in arun methods
self._resume_state = resume_state
self._on_state_change = on_state_change
self._last_state: Optional[Dict[str, Any]] = None
async def can_process_url(self, url: str, depth: int) -> bool:
"""
@@ -155,10 +162,21 @@ class BFSDeepCrawlStrategy(DeepCrawlStrategy):
Batch (non-streaming) mode:
Processes one BFS level at a time, then yields all the results.
"""
visited: Set[str] = set()
# current_level holds tuples: (url, parent_url)
current_level: List[Tuple[str, Optional[str]]] = [(start_url, None)]
depths: Dict[str, int] = {start_url: 0}
# Conditional state initialization for resume support
if self._resume_state:
visited = set(self._resume_state.get("visited", []))
current_level = [
(item["url"], item["parent_url"])
for item in self._resume_state.get("pending", [])
]
depths = dict(self._resume_state.get("depths", {}))
self._pages_crawled = self._resume_state.get("pages_crawled", 0)
else:
# Original initialization
visited: Set[str] = set()
# current_level holds tuples: (url, parent_url)
current_level: List[Tuple[str, Optional[str]]] = [(start_url, None)]
depths: Dict[str, int] = {start_url: 0}
results: List[CrawlResult] = []
@@ -174,11 +192,7 @@ class BFSDeepCrawlStrategy(DeepCrawlStrategy):
# Clone the config to disable deep crawling recursion and enforce batch mode.
batch_config = config.clone(deep_crawl_strategy=None, stream=False)
batch_results = await crawler.arun_many(urls=urls, config=batch_config)
# Update pages crawled counter - count only successful crawls
successful_results = [r for r in batch_results if r.success]
self._pages_crawled += len(successful_results)
for result in batch_results:
url = result.url
depth = depths.get(url, 0)
@@ -187,12 +201,27 @@ class BFSDeepCrawlStrategy(DeepCrawlStrategy):
parent_url = next((parent for (u, parent) in current_level if u == url), None)
result.metadata["parent_url"] = parent_url
results.append(result)
# Only discover links from successful crawls
if result.success:
# Increment pages crawled per URL for accurate state tracking
self._pages_crawled += 1
# Link discovery will handle the max pages limit internally
await self.link_discovery(result, url, depth, visited, next_level, depths)
# Capture state after EACH URL processed (if callback set)
if self._on_state_change:
state = {
"strategy_type": "bfs",
"visited": list(visited),
"pending": [{"url": u, "parent_url": p} for u, p in next_level],
"depths": depths,
"pages_crawled": self._pages_crawled,
}
self._last_state = state
await self._on_state_change(state)
current_level = next_level
return results
@@ -207,9 +236,20 @@ class BFSDeepCrawlStrategy(DeepCrawlStrategy):
Streaming mode:
Processes one BFS level at a time and yields results immediately as they arrive.
"""
visited: Set[str] = set()
current_level: List[Tuple[str, Optional[str]]] = [(start_url, None)]
depths: Dict[str, int] = {start_url: 0}
# Conditional state initialization for resume support
if self._resume_state:
visited = set(self._resume_state.get("visited", []))
current_level = [
(item["url"], item["parent_url"])
for item in self._resume_state.get("pending", [])
]
depths = dict(self._resume_state.get("depths", {}))
self._pages_crawled = self._resume_state.get("pages_crawled", 0)
else:
# Original initialization
visited: Set[str] = set()
current_level: List[Tuple[str, Optional[str]]] = [(start_url, None)]
depths: Dict[str, int] = {start_url: 0}
while current_level and not self._cancel_event.is_set():
next_level: List[Tuple[str, Optional[str]]] = []
@@ -244,7 +284,19 @@ class BFSDeepCrawlStrategy(DeepCrawlStrategy):
if result.success:
# Link discovery will handle the max pages limit internally
await self.link_discovery(result, url, depth, visited, next_level, depths)
# Capture state after EACH URL processed (if callback set)
if self._on_state_change:
state = {
"strategy_type": "bfs",
"visited": list(visited),
"pending": [{"url": u, "parent_url": p} for u, p in next_level],
"depths": depths,
"pages_crawled": self._pages_crawled,
}
self._last_state = state
await self._on_state_change(state)
# If we didn't get results back (e.g. due to errors), avoid getting stuck in an infinite loop
# by considering these URLs as visited but not counting them toward the max_pages limit
if results_count == 0 and urls:
@@ -258,3 +310,15 @@ class BFSDeepCrawlStrategy(DeepCrawlStrategy):
"""
self._cancel_event.set()
self.stats.end_time = datetime.now()
def export_state(self) -> Optional[Dict[str, Any]]:
"""
Export current crawl state for external persistence.
Note: This returns the last captured state. For real-time state,
use the on_state_change callback.
Returns:
Dict with strategy state, or None if no state captured yet.
"""
return self._last_state

View File

@@ -4,14 +4,26 @@ from typing import AsyncGenerator, Optional, Set, Dict, List, Tuple
from ..models import CrawlResult
from .bfs_strategy import BFSDeepCrawlStrategy # noqa
from ..types import AsyncWebCrawler, CrawlerRunConfig
from ..utils import normalize_url_for_deep_crawl
class DFSDeepCrawlStrategy(BFSDeepCrawlStrategy):
"""
Depth-First Search (DFS) deep crawling strategy.
Depth-first deep crawling with familiar BFS rules.
Inherits URL validation and link discovery from BFSDeepCrawlStrategy.
Overrides _arun_batch and _arun_stream to use a stack (LIFO) for DFS traversal.
We reuse the same filters, scoring, and page limits from :class:`BFSDeepCrawlStrategy`,
but walk the graph with a stack so we fully explore one branch before hopping to the
next. DFS also keeps its own ``_dfs_seen`` set so we can drop duplicate links at
discovery time without accidentally marking them as “already crawled”.
"""
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self._dfs_seen: Set[str] = set()
def _reset_seen(self, start_url: str) -> None:
"""Start each crawl with a clean dedupe set seeded with the root URL."""
self._dfs_seen = {start_url}
async def _arun_batch(
self,
start_url: str,
@@ -19,14 +31,32 @@ class DFSDeepCrawlStrategy(BFSDeepCrawlStrategy):
config: CrawlerRunConfig,
) -> List[CrawlResult]:
"""
Batch (non-streaming) DFS mode.
Uses a stack to traverse URLs in DFS order, aggregating CrawlResults into a list.
Crawl level-by-level but emit results at the end.
We keep a stack of ``(url, parent, depth)`` tuples, pop one at a time, and
hand it to ``crawler.arun_many`` with deep crawling disabled so we remain
in control of traversal. Every successful page bumps ``_pages_crawled`` and
seeds new stack items discovered via :meth:`link_discovery`.
"""
visited: Set[str] = set()
# Stack items: (url, parent_url, depth)
stack: List[Tuple[str, Optional[str], int]] = [(start_url, None, 0)]
depths: Dict[str, int] = {start_url: 0}
results: List[CrawlResult] = []
# Conditional state initialization for resume support
if self._resume_state:
visited = set(self._resume_state.get("visited", []))
stack = [
(item["url"], item["parent_url"], item["depth"])
for item in self._resume_state.get("stack", [])
]
depths = dict(self._resume_state.get("depths", {}))
self._pages_crawled = self._resume_state.get("pages_crawled", 0)
self._dfs_seen = set(self._resume_state.get("dfs_seen", []))
results: List[CrawlResult] = []
else:
# Original initialization
visited: Set[str] = set()
# Stack items: (url, parent_url, depth)
stack: List[Tuple[str, Optional[str], int]] = [(start_url, None, 0)]
depths: Dict[str, int] = {start_url: 0}
results: List[CrawlResult] = []
self._reset_seen(start_url)
while stack and not self._cancel_event.is_set():
url, parent, depth = stack.pop()
@@ -62,6 +92,22 @@ class DFSDeepCrawlStrategy(BFSDeepCrawlStrategy):
for new_url, new_parent in reversed(new_links):
new_depth = depths.get(new_url, depth + 1)
stack.append((new_url, new_parent, new_depth))
# Capture state after each URL processed (if callback set)
if self._on_state_change:
state = {
"strategy_type": "dfs",
"visited": list(visited),
"stack": [
{"url": u, "parent_url": p, "depth": d}
for u, p, d in stack
],
"depths": depths,
"pages_crawled": self._pages_crawled,
"dfs_seen": list(self._dfs_seen),
}
self._last_state = state
await self._on_state_change(state)
return results
async def _arun_stream(
@@ -71,12 +117,28 @@ class DFSDeepCrawlStrategy(BFSDeepCrawlStrategy):
config: CrawlerRunConfig,
) -> AsyncGenerator[CrawlResult, None]:
"""
Streaming DFS mode.
Uses a stack to traverse URLs in DFS order and yields CrawlResults as they become available.
Same traversal as :meth:`_arun_batch`, but yield pages immediately.
Each popped URL is crawled, its metadata annotated, then the result gets
yielded before we even look at the next stack entry. Successful crawls
still feed :meth:`link_discovery`, keeping DFS order intact.
"""
visited: Set[str] = set()
stack: List[Tuple[str, Optional[str], int]] = [(start_url, None, 0)]
depths: Dict[str, int] = {start_url: 0}
# Conditional state initialization for resume support
if self._resume_state:
visited = set(self._resume_state.get("visited", []))
stack = [
(item["url"], item["parent_url"], item["depth"])
for item in self._resume_state.get("stack", [])
]
depths = dict(self._resume_state.get("depths", {}))
self._pages_crawled = self._resume_state.get("pages_crawled", 0)
self._dfs_seen = set(self._resume_state.get("dfs_seen", []))
else:
# Original initialization
visited: Set[str] = set()
stack: List[Tuple[str, Optional[str], int]] = [(start_url, None, 0)]
depths: Dict[str, int] = {start_url: 0}
self._reset_seen(start_url)
while stack and not self._cancel_event.is_set():
url, parent, depth = stack.pop()
@@ -108,3 +170,108 @@ class DFSDeepCrawlStrategy(BFSDeepCrawlStrategy):
for new_url, new_parent in reversed(new_links):
new_depth = depths.get(new_url, depth + 1)
stack.append((new_url, new_parent, new_depth))
# Capture state after each URL processed (if callback set)
if self._on_state_change:
state = {
"strategy_type": "dfs",
"visited": list(visited),
"stack": [
{"url": u, "parent_url": p, "depth": d}
for u, p, d in stack
],
"depths": depths,
"pages_crawled": self._pages_crawled,
"dfs_seen": list(self._dfs_seen),
}
self._last_state = state
await self._on_state_change(state)
async def link_discovery(
self,
result: CrawlResult,
source_url: str,
current_depth: int,
_visited: Set[str],
next_level: List[Tuple[str, Optional[str]]],
depths: Dict[str, int],
) -> None:
"""
Find the next URLs we should push onto the DFS stack.
Parameters
----------
result : CrawlResult
Output of the page we just crawled; its ``links`` block is our raw material.
source_url : str
URL of the parent page; stored so callers can track ancestry.
current_depth : int
Depth of the parent; children naturally sit at ``current_depth + 1``.
_visited : Set[str]
Present to match the BFS signature, but we rely on ``_dfs_seen`` instead.
next_level : list of tuples
The stack buffer supplied by the caller; we append new ``(url, parent)`` items here.
depths : dict
Shared depth map so future metadata tagging knows how deep each URL lives.
Notes
-----
- ``_dfs_seen`` keeps us from pushing duplicates without touching the traversal guard.
- Validation, scoring, and capacity trimming mirror the BFS version so behaviour stays consistent.
"""
next_depth = current_depth + 1
if next_depth > self.max_depth:
return
remaining_capacity = self.max_pages - self._pages_crawled
if remaining_capacity <= 0:
self.logger.info(
f"Max pages limit ({self.max_pages}) reached, stopping link discovery"
)
return
links = result.links.get("internal", [])
if self.include_external:
links += result.links.get("external", [])
seen = self._dfs_seen
valid_links: List[Tuple[str, float]] = []
for link in links:
raw_url = link.get("href")
if not raw_url:
continue
normalized_url = normalize_url_for_deep_crawl(raw_url, source_url)
if not normalized_url or normalized_url in seen:
continue
if not await self.can_process_url(raw_url, next_depth):
self.stats.urls_skipped += 1
continue
score = self.url_scorer.score(normalized_url) if self.url_scorer else 0
if score < self.score_threshold:
self.logger.debug(
f"URL {normalized_url} skipped: score {score} below threshold {self.score_threshold}"
)
self.stats.urls_skipped += 1
continue
seen.add(normalized_url)
valid_links.append((normalized_url, score))
if len(valid_links) > remaining_capacity:
if self.url_scorer:
valid_links.sort(key=lambda x: x[1], reverse=True)
valid_links = valid_links[:remaining_capacity]
self.logger.info(
f"Limiting to {remaining_capacity} URLs due to max_pages limit"
)
for url, score in valid_links:
if score:
result.metadata = result.metadata or {}
result.metadata["score"] = score
next_level.append((url, source_url))
depths[url] = next_depth

View File

@@ -509,18 +509,22 @@ class DomainFilter(URLFilter):
class ContentRelevanceFilter(URLFilter):
"""BM25-based relevance filter using head section content"""
__slots__ = ("query_terms", "threshold", "k1", "b", "avgdl")
__slots__ = ("query_terms", "threshold", "k1", "b", "avgdl", "query")
def __init__(
self,
query: str,
query: Union[str, List[str]],
threshold: float,
k1: float = 1.2,
b: float = 0.75,
avgdl: int = 1000,
):
super().__init__(name="BM25RelevanceFilter")
self.query_terms = self._tokenize(query)
if isinstance(query, list):
self.query = " ".join(query)
else:
self.query = query
self.query_terms = self._tokenize(self.query)
self.threshold = threshold
self.k1 = k1 # TF saturation parameter
self.b = b # Length normalization parameter

View File

@@ -180,7 +180,7 @@ class Crawl4aiDockerClient:
yield CrawlResult(**result)
return stream_results()
response = await self._request("POST", "/crawl", json=data)
response = await self._request("POST", "/crawl", json=data, timeout=hooks_timeout)
result_data = response.json()
if not result_data.get("success", False):
raise RequestError(f"Crawl failed: {result_data.get('msg', 'Unknown error')}")

View File

@@ -94,6 +94,20 @@ class ExtractionStrategy(ABC):
extracted_content.extend(future.result())
return extracted_content
async def arun(self, url: str, sections: List[str], *q, **kwargs) -> List[Dict[str, Any]]:
"""
Async version: Process sections of text in parallel using asyncio.
Default implementation runs the sync version in a thread pool.
Subclasses can override this for true async processing.
:param url: The URL of the webpage.
:param sections: List of sections (strings) to process.
:return: A list of processed JSON blocks.
"""
import asyncio
return await asyncio.to_thread(self.run, url, sections, *q, **kwargs)
class NoExtractionStrategy(ExtractionStrategy):
"""
@@ -635,6 +649,9 @@ class LLMExtractionStrategy(ExtractionStrategy):
base_url=self.llm_config.base_url,
json_response=self.force_json_response,
extra_args=self.extra_args,
base_delay=self.llm_config.backoff_base_delay,
max_attempts=self.llm_config.backoff_max_attempts,
exponential_factor=self.llm_config.backoff_exponential_factor
) # , json_response=self.extract_type == "schema")
# Track usage
usage = TokenUsage(
@@ -780,6 +797,180 @@ class LLMExtractionStrategy(ExtractionStrategy):
return extracted_content
async def aextract(self, url: str, ix: int, html: str) -> List[Dict[str, Any]]:
"""
Async version: Extract meaningful blocks or chunks from the given HTML using an LLM.
How it works:
1. Construct a prompt with variables.
2. Make an async request to the LLM using the prompt.
3. Parse the response and extract blocks or chunks.
Args:
url: The URL of the webpage.
ix: Index of the block.
html: The HTML content of the webpage.
Returns:
A list of extracted blocks or chunks.
"""
from .utils import aperform_completion_with_backoff
if self.verbose:
print(f"[LOG] Call LLM for {url} - block index: {ix}")
variable_values = {
"URL": url,
"HTML": escape_json_string(sanitize_html(html)),
}
prompt_with_variables = PROMPT_EXTRACT_BLOCKS
if self.instruction:
variable_values["REQUEST"] = self.instruction
prompt_with_variables = PROMPT_EXTRACT_BLOCKS_WITH_INSTRUCTION
if self.extract_type == "schema" and self.schema:
variable_values["SCHEMA"] = json.dumps(self.schema, indent=2)
prompt_with_variables = PROMPT_EXTRACT_SCHEMA_WITH_INSTRUCTION
if self.extract_type == "schema" and not self.schema:
prompt_with_variables = PROMPT_EXTRACT_INFERRED_SCHEMA
for variable in variable_values:
prompt_with_variables = prompt_with_variables.replace(
"{" + variable + "}", variable_values[variable]
)
try:
response = await aperform_completion_with_backoff(
self.llm_config.provider,
prompt_with_variables,
self.llm_config.api_token,
base_url=self.llm_config.base_url,
json_response=self.force_json_response,
extra_args=self.extra_args,
base_delay=self.llm_config.backoff_base_delay,
max_attempts=self.llm_config.backoff_max_attempts,
exponential_factor=self.llm_config.backoff_exponential_factor
)
# Track usage
usage = TokenUsage(
completion_tokens=response.usage.completion_tokens,
prompt_tokens=response.usage.prompt_tokens,
total_tokens=response.usage.total_tokens,
completion_tokens_details=response.usage.completion_tokens_details.__dict__
if response.usage.completion_tokens_details
else {},
prompt_tokens_details=response.usage.prompt_tokens_details.__dict__
if response.usage.prompt_tokens_details
else {},
)
self.usages.append(usage)
# Update totals
self.total_usage.completion_tokens += usage.completion_tokens
self.total_usage.prompt_tokens += usage.prompt_tokens
self.total_usage.total_tokens += usage.total_tokens
try:
content = response.choices[0].message.content
blocks = None
if self.force_json_response:
blocks = json.loads(content)
if isinstance(blocks, dict):
if len(blocks) == 1 and isinstance(list(blocks.values())[0], list):
blocks = list(blocks.values())[0]
else:
blocks = [blocks]
elif isinstance(blocks, list):
blocks = blocks
else:
blocks = extract_xml_data(["blocks"], content)["blocks"]
blocks = json.loads(blocks)
for block in blocks:
block["error"] = False
except Exception:
parsed, unparsed = split_and_parse_json_objects(
response.choices[0].message.content
)
blocks = parsed
if unparsed:
blocks.append(
{"index": 0, "error": True, "tags": ["error"], "content": unparsed}
)
if self.verbose:
print(
"[LOG] Extracted",
len(blocks),
"blocks from URL:",
url,
"block index:",
ix,
)
return blocks
except Exception as e:
if self.verbose:
print(f"[LOG] Error in LLM extraction: {e}")
return [
{
"index": ix,
"error": True,
"tags": ["error"],
"content": str(e),
}
]
async def arun(self, url: str, sections: List[str]) -> List[Dict[str, Any]]:
"""
Async version: Process sections with true parallelism using asyncio.gather.
Args:
url: The URL of the webpage.
sections: List of sections (strings) to process.
Returns:
A list of extracted blocks or chunks.
"""
import asyncio
merged_sections = self._merge(
sections,
self.chunk_token_threshold,
overlap=int(self.chunk_token_threshold * self.overlap_rate),
)
extracted_content = []
# Create tasks for all sections to run in parallel
tasks = [
self.aextract(url, ix, sanitize_input_encode(section))
for ix, section in enumerate(merged_sections)
]
# Execute all tasks concurrently
results = await asyncio.gather(*tasks, return_exceptions=True)
# Process results
for result in results:
if isinstance(result, Exception):
if self.verbose:
print(f"Error in async extraction: {result}")
extracted_content.append(
{
"index": 0,
"error": True,
"tags": ["error"],
"content": str(result),
}
)
else:
extracted_content.extend(result)
return extracted_content
def show_usage(self) -> None:
"""Print a detailed token usage report showing total and per-request usage."""
print("\n=== Token Usage Summary ===")
@@ -1086,44 +1277,18 @@ class JsonElementExtractionStrategy(ExtractionStrategy):
}
@staticmethod
def generate_schema(
html: str,
schema_type: str = "CSS", # or XPATH
query: str = None,
target_json_example: str = None,
llm_config: 'LLMConfig' = create_llm_config(),
provider: str = None,
api_token: str = None,
**kwargs
) -> dict:
def _build_schema_prompt(html: str, schema_type: str, query: str = None, target_json_example: str = None) -> str:
"""
Generate extraction schema from HTML content and optional query.
Args:
html (str): The HTML content to analyze
query (str, optional): Natural language description of what data to extract
provider (str): Legacy Parameter. LLM provider to use
api_token (str): Legacy Parameter. API token for LLM provider
llm_config (LLMConfig): LLM configuration object
prompt (str, optional): Custom prompt template to use
**kwargs: Additional args passed to LLM processor
Build the prompt for schema generation. Shared by sync and async methods.
Returns:
dict: Generated schema following the JsonElementExtractionStrategy format
str: Combined system and user prompt
"""
from .prompts import JSON_SCHEMA_BUILDER
from .utils import perform_completion_with_backoff
for name, message in JsonElementExtractionStrategy._GENERATE_SCHEMA_UNWANTED_PROPS.items():
if locals()[name] is not None:
raise AttributeError(f"Setting '{name}' is deprecated. {message}")
# Use default or custom prompt
prompt_template = JSON_SCHEMA_BUILDER if schema_type == "CSS" else JSON_SCHEMA_BUILDER_XPATH
# Build the prompt
system_message = {
"role": "system",
"content": f"""You specialize in generating special JSON schemas for web scraping. This schema uses CSS or XPATH selectors to present a repetitive pattern in crawled HTML, such as a product in a product list or a search result item in a list of search results. We use this JSON schema to pass to a language model along with the HTML content to extract structured data from the HTML. The language model uses the JSON schema to extract data from the HTML and retrieve values for fields in the JSON schema, following the schema.
system_content = f"""You specialize in generating special JSON schemas for web scraping. This schema uses CSS or XPATH selectors to present a repetitive pattern in crawled HTML, such as a product in a product list or a search result item in a list of search results. We use this JSON schema to pass to a language model along with the HTML content to extract structured data from the HTML. The language model uses the JSON schema to extract data from the HTML and retrieve values for fields in the JSON schema, following the schema.
Generating this HTML manually is not feasible, so you need to generate the JSON schema using the HTML content. The HTML copied from the crawled website is provided below, which we believe contains the repetitive pattern.
@@ -1144,31 +1309,27 @@ In this scenario, use your best judgment to generate the schema. You need to exa
# What are the instructions and details for this schema generation?
{prompt_template}"""
}
user_message = {
"role": "user",
"content": f"""
user_content = f"""
HTML to analyze:
```html
{html}
```
"""
}
if query:
user_message["content"] += f"\n\n## Query or explanation of target/goal data item:\n{query}"
user_content += f"\n\n## Query or explanation of target/goal data item:\n{query}"
if target_json_example:
user_message["content"] += f"\n\n## Example of target JSON object:\n```json\n{target_json_example}\n```"
user_content += f"\n\n## Example of target JSON object:\n```json\n{target_json_example}\n```"
if query and not target_json_example:
user_message["content"] += """IMPORTANT: To remind you, in this process, we are not providing a rigid example of the adjacent objects we seek. We rely on your understanding of the explanation provided in the above section. Make sure to grasp what we are looking for and, based on that, create the best schema.."""
user_content += """IMPORTANT: To remind you, in this process, we are not providing a rigid example of the adjacent objects we seek. We rely on your understanding of the explanation provided in the above section. Make sure to grasp what we are looking for and, based on that, create the best schema.."""
elif not query and target_json_example:
user_message["content"] += """IMPORTANT: Please remember that in this process, we provided a proper example of a target JSON object. Make sure to adhere to the structure and create a schema that exactly fits this example. If you find that some elements on the page do not match completely, vote for the majority."""
user_content += """IMPORTANT: Please remember that in this process, we provided a proper example of a target JSON object. Make sure to adhere to the structure and create a schema that exactly fits this example. If you find that some elements on the page do not match completely, vote for the majority."""
elif not query and not target_json_example:
user_message["content"] += """IMPORTANT: Since we neither have a query nor an example, it is crucial to rely solely on the HTML content provided. Leverage your expertise to determine the schema based on the repetitive patterns observed in the content."""
user_message["content"] += """IMPORTANT:
user_content += """IMPORTANT: Since we neither have a query nor an example, it is crucial to rely solely on the HTML content provided. Leverage your expertise to determine the schema based on the repetitive patterns observed in the content."""
user_content += """IMPORTANT:
0/ Ensure your schema remains reliable by avoiding selectors that appear to generate dynamically and are not dependable. You want a reliable schema, as it consistently returns the same data even after many page reloads.
1/ DO NOT USE use base64 kind of classes, they are temporary and not reliable.
2/ Every selector must refer to only one unique element. You should ensure your selector points to a single element and is unique to the place that contains the information. You have to use available techniques based on CSS or XPATH requested schema to make sure your selector is unique and also not fragile, meaning if we reload the page now or in the future, the selector should remain reliable.
@@ -1177,20 +1338,98 @@ In this scenario, use your best judgment to generate the schema. You need to exa
Analyze the HTML and generate a JSON schema that follows the specified format. Only output valid JSON schema, nothing else.
"""
return "\n\n".join([system_content, user_content])
@staticmethod
def generate_schema(
html: str,
schema_type: str = "CSS",
query: str = None,
target_json_example: str = None,
llm_config: 'LLMConfig' = create_llm_config(),
provider: str = None,
api_token: str = None,
**kwargs
) -> dict:
"""
Generate extraction schema from HTML content and optional query (sync version).
Args:
html (str): The HTML content to analyze
query (str, optional): Natural language description of what data to extract
provider (str): Legacy Parameter. LLM provider to use
api_token (str): Legacy Parameter. API token for LLM provider
llm_config (LLMConfig): LLM configuration object
**kwargs: Additional args passed to LLM processor
Returns:
dict: Generated schema following the JsonElementExtractionStrategy format
"""
from .utils import perform_completion_with_backoff
for name, message in JsonElementExtractionStrategy._GENERATE_SCHEMA_UNWANTED_PROPS.items():
if locals()[name] is not None:
raise AttributeError(f"Setting '{name}' is deprecated. {message}")
prompt = JsonElementExtractionStrategy._build_schema_prompt(html, schema_type, query, target_json_example)
try:
# Call LLM with backoff handling
response = perform_completion_with_backoff(
provider=llm_config.provider,
prompt_with_variables="\n\n".join([system_message["content"], user_message["content"]]),
json_response = True,
prompt_with_variables=prompt,
json_response=True,
api_token=llm_config.api_token,
base_url=llm_config.base_url,
extra_args=kwargs
)
return json.loads(response.choices[0].message.content)
except Exception as e:
raise Exception(f"Failed to generate schema: {str(e)}")
@staticmethod
async def agenerate_schema(
html: str,
schema_type: str = "CSS",
query: str = None,
target_json_example: str = None,
llm_config: 'LLMConfig' = None,
**kwargs
) -> dict:
"""
Generate extraction schema from HTML content (async version).
Use this method when calling from async contexts (e.g., FastAPI) to avoid
issues with certain LLM providers (e.g., Gemini/Vertex AI) that require
async execution.
Args:
html (str): The HTML content to analyze
schema_type (str): "CSS" or "XPATH"
query (str, optional): Natural language description of what data to extract
target_json_example (str, optional): Example of desired JSON output
llm_config (LLMConfig): LLM configuration object
**kwargs: Additional args passed to LLM processor
Returns:
dict: Generated schema following the JsonElementExtractionStrategy format
"""
from .utils import aperform_completion_with_backoff
if llm_config is None:
llm_config = create_llm_config()
prompt = JsonElementExtractionStrategy._build_schema_prompt(html, schema_type, query, target_json_example)
try:
response = await aperform_completion_with_backoff(
provider=llm_config.provider,
prompt_with_variables=prompt,
json_response=True,
api_token=llm_config.api_token,
base_url=llm_config.base_url,
extra_args=kwargs
)
# Extract and return schema
return json.loads(response.choices[0].message.content)
except Exception as e:
raise Exception(f"Failed to generate schema: {str(e)}")

View File

@@ -1,4 +1,4 @@
from pydantic import BaseModel, HttpUrl, PrivateAttr, Field
from pydantic import BaseModel, HttpUrl, PrivateAttr, Field, ConfigDict
from typing import List, Dict, Optional, Callable, Awaitable, Union, Any
from typing import AsyncGenerator
from typing import Generic, TypeVar
@@ -152,9 +152,12 @@ class CrawlResult(BaseModel):
network_requests: Optional[List[Dict[str, Any]]] = None
console_messages: Optional[List[Dict[str, Any]]] = None
tables: List[Dict] = Field(default_factory=list) # NEW [{headers,rows,caption,summary}]
# Cache validation metadata (Smart Cache)
head_fingerprint: Optional[str] = None
cached_at: Optional[float] = None
cache_status: Optional[str] = None # "hit", "hit_validated", "hit_fallback", "miss"
class Config:
arbitrary_types_allowed = True
model_config = ConfigDict(arbitrary_types_allowed=True)
# NOTE: The StringCompatibleMarkdown class, custom __init__ method, property getters/setters,
# and model_dump override all exist to support a smooth transition from markdown as a string
@@ -332,8 +335,7 @@ class AsyncCrawlResponse(BaseModel):
network_requests: Optional[List[Dict[str, Any]]] = None
console_messages: Optional[List[Dict[str, Any]]] = None
class Config:
arbitrary_types_allowed = True
model_config = ConfigDict(arbitrary_types_allowed=True)
###############################
# Scraping Models

View File

@@ -15,9 +15,9 @@ from .utils import (
clean_pdf_text_to_html,
)
# Remove direct PyPDF2 imports from the top
# import PyPDF2
# from PyPDF2 import PdfReader
# Remove direct pypdf imports from the top
# import pypdf
# from pypdf import PdfReader
logger = logging.getLogger(__name__)
@@ -59,9 +59,9 @@ class NaivePDFProcessorStrategy(PDFProcessorStrategy):
save_images_locally: bool = False, image_save_dir: Optional[Path] = None, batch_size: int = 4):
# Import check at initialization time
try:
import PyPDF2
import pypdf
except ImportError:
raise ImportError("PyPDF2 is required for PDF processing. Install with 'pip install crawl4ai[pdf]'")
raise ImportError("pypdf is required for PDF processing. Install with 'pip install crawl4ai[pdf]'")
self.image_dpi = image_dpi
self.image_quality = image_quality
@@ -75,9 +75,9 @@ class NaivePDFProcessorStrategy(PDFProcessorStrategy):
def process(self, pdf_path: Path) -> PDFProcessResult:
# Import inside method to allow dependency to be optional
try:
from PyPDF2 import PdfReader
from pypdf import PdfReader
except ImportError:
raise ImportError("PyPDF2 is required for PDF processing. Install with 'pip install crawl4ai[pdf]'")
raise ImportError("pypdf is required for PDF processing. Install with 'pip install crawl4ai[pdf]'")
start_time = time()
result = PDFProcessResult(
@@ -125,15 +125,15 @@ class NaivePDFProcessorStrategy(PDFProcessorStrategy):
"""Like process() but processes PDF pages in parallel batches"""
# Import inside method to allow dependency to be optional
try:
from PyPDF2 import PdfReader
import PyPDF2 # For type checking
from pypdf import PdfReader
import pypdf # For type checking
except ImportError:
raise ImportError("PyPDF2 is required for PDF processing. Install with 'pip install crawl4ai[pdf]'")
raise ImportError("pypdf is required for PDF processing. Install with 'pip install crawl4ai[pdf]'")
import concurrent.futures
import threading
# Initialize PyPDF2 thread support
# Initialize pypdf thread support
if not hasattr(threading.current_thread(), "_children"):
threading.current_thread()._children = set()
@@ -232,11 +232,11 @@ class NaivePDFProcessorStrategy(PDFProcessorStrategy):
return pdf_page
def _extract_images(self, page, image_dir: Optional[Path]) -> List[Dict]:
# Import PyPDF2 for type checking only when needed
# Import pypdf for type checking only when needed
try:
import PyPDF2
from pypdf.generic import IndirectObject
except ImportError:
raise ImportError("PyPDF2 is required for PDF processing. Install with 'pip install crawl4ai[pdf]'")
raise ImportError("pypdf is required for PDF processing. Install with 'pip install crawl4ai[pdf]'")
if not self.extract_images:
return []
@@ -266,7 +266,7 @@ class NaivePDFProcessorStrategy(PDFProcessorStrategy):
width = xobj.get('/Width', 0)
height = xobj.get('/Height', 0)
color_space = xobj.get('/ColorSpace', '/DeviceRGB')
if isinstance(color_space, PyPDF2.generic.IndirectObject):
if isinstance(color_space, IndirectObject):
color_space = color_space.get_object()
# Handle different image encodings
@@ -277,7 +277,7 @@ class NaivePDFProcessorStrategy(PDFProcessorStrategy):
if '/FlateDecode' in filters:
try:
decode_parms = xobj.get('/DecodeParms', {})
if isinstance(decode_parms, PyPDF2.generic.IndirectObject):
if isinstance(decode_parms, IndirectObject):
decode_parms = decode_parms.get_object()
predictor = decode_parms.get('/Predictor', 1)
@@ -416,10 +416,10 @@ class NaivePDFProcessorStrategy(PDFProcessorStrategy):
# Import inside method to allow dependency to be optional
if reader is None:
try:
from PyPDF2 import PdfReader
from pypdf import PdfReader
reader = PdfReader(pdf_path)
except ImportError:
raise ImportError("PyPDF2 is required for PDF processing. Install with 'pip install crawl4ai[pdf]'")
raise ImportError("pypdf is required for PDF processing. Install with 'pip install crawl4ai[pdf]'")
meta = reader.metadata or {}
created = self._parse_pdf_date(meta.get('/CreationDate', ''))
@@ -459,11 +459,11 @@ if __name__ == "__main__":
from pathlib import Path
try:
# Import PyPDF2 only when running the file directly
import PyPDF2
from PyPDF2 import PdfReader
# Import pypdf only when running the file directly
import pypdf
from pypdf import PdfReader
except ImportError:
print("PyPDF2 is required for PDF processing. Install with 'pip install crawl4ai[pdf]'")
print("pypdf is required for PDF processing. Install with 'pip install crawl4ai[pdf]'")
exit(1)
current_dir = Path(__file__).resolve().parent

View File

@@ -1,7 +1,9 @@
from typing import List, Dict, Optional
from typing import List, Dict, Optional, Tuple
from abc import ABC, abstractmethod
from itertools import cycle
import os
import asyncio
import time
########### ATTENTION PEOPLE OF EARTH ###########
@@ -120,7 +122,7 @@ class ProxyConfig:
class ProxyRotationStrategy(ABC):
"""Base abstract class for proxy rotation strategies"""
@abstractmethod
async def get_next_proxy(self) -> Optional[ProxyConfig]:
"""Get next proxy configuration from the strategy"""
@@ -131,18 +133,81 @@ class ProxyRotationStrategy(ABC):
"""Add proxy configurations to the strategy"""
pass
class RoundRobinProxyStrategy:
"""Simple round-robin proxy rotation strategy using ProxyConfig objects"""
@abstractmethod
async def get_proxy_for_session(
self,
session_id: str,
ttl: Optional[int] = None
) -> Optional[ProxyConfig]:
"""
Get or create a sticky proxy for a session.
If session_id already has an assigned proxy (and hasn't expired), return it.
If session_id is new, acquire a new proxy and associate it.
Args:
session_id: Unique session identifier
ttl: Optional time-to-live in seconds for this session
Returns:
ProxyConfig for this session
"""
pass
@abstractmethod
async def release_session(self, session_id: str) -> None:
"""
Release a sticky session, making the proxy available for reuse.
Args:
session_id: Session to release
"""
pass
@abstractmethod
def get_session_proxy(self, session_id: str) -> Optional[ProxyConfig]:
"""
Get the proxy for an existing session without creating new one.
Args:
session_id: Session to look up
Returns:
ProxyConfig if session exists and hasn't expired, None otherwise
"""
pass
@abstractmethod
def get_active_sessions(self) -> Dict[str, ProxyConfig]:
"""
Get all active sticky sessions.
Returns:
Dictionary mapping session_id to ProxyConfig
"""
pass
class RoundRobinProxyStrategy(ProxyRotationStrategy):
"""Simple round-robin proxy rotation strategy using ProxyConfig objects.
Supports sticky sessions where a session_id can be bound to a specific proxy
for the duration of the session. This is useful for deep crawling where
you want to maintain the same IP address across multiple requests.
"""
def __init__(self, proxies: List[ProxyConfig] = None):
"""
Initialize with optional list of proxy configurations
Args:
proxies: List of ProxyConfig objects
"""
self._proxies = []
self._proxies: List[ProxyConfig] = []
self._proxy_cycle = None
# Session tracking: maps session_id -> (ProxyConfig, created_at, ttl)
self._sessions: Dict[str, Tuple[ProxyConfig, float, Optional[int]]] = {}
self._session_lock = asyncio.Lock()
if proxies:
self.add_proxies(proxies)
@@ -156,3 +221,121 @@ class RoundRobinProxyStrategy:
if not self._proxy_cycle:
return None
return next(self._proxy_cycle)
async def get_proxy_for_session(
self,
session_id: str,
ttl: Optional[int] = None
) -> Optional[ProxyConfig]:
"""
Get or create a sticky proxy for a session.
If session_id already has an assigned proxy (and hasn't expired), return it.
If session_id is new, acquire a new proxy and associate it.
Args:
session_id: Unique session identifier
ttl: Optional time-to-live in seconds for this session
Returns:
ProxyConfig for this session
"""
async with self._session_lock:
# Check if session exists and hasn't expired
if session_id in self._sessions:
proxy, created_at, session_ttl = self._sessions[session_id]
# Check TTL expiration
effective_ttl = ttl if ttl is not None else session_ttl
if effective_ttl is not None:
elapsed = time.time() - created_at
if elapsed >= effective_ttl:
# Session expired, remove it and get new proxy
del self._sessions[session_id]
else:
return proxy
else:
return proxy
# Acquire new proxy for this session
proxy = await self.get_next_proxy()
if proxy:
self._sessions[session_id] = (proxy, time.time(), ttl)
return proxy
async def release_session(self, session_id: str) -> None:
"""
Release a sticky session, making the proxy available for reuse.
Args:
session_id: Session to release
"""
async with self._session_lock:
if session_id in self._sessions:
del self._sessions[session_id]
def get_session_proxy(self, session_id: str) -> Optional[ProxyConfig]:
"""
Get the proxy for an existing session without creating new one.
Args:
session_id: Session to look up
Returns:
ProxyConfig if session exists and hasn't expired, None otherwise
"""
if session_id not in self._sessions:
return None
proxy, created_at, ttl = self._sessions[session_id]
# Check TTL expiration
if ttl is not None:
elapsed = time.time() - created_at
if elapsed >= ttl:
return None
return proxy
def get_active_sessions(self) -> Dict[str, ProxyConfig]:
"""
Get all active sticky sessions (excluding expired ones).
Returns:
Dictionary mapping session_id to ProxyConfig
"""
current_time = time.time()
active_sessions = {}
for session_id, (proxy, created_at, ttl) in self._sessions.items():
# Skip expired sessions
if ttl is not None:
elapsed = current_time - created_at
if elapsed >= ttl:
continue
active_sessions[session_id] = proxy
return active_sessions
async def cleanup_expired_sessions(self) -> int:
"""
Remove all expired sessions from tracking.
Returns:
Number of sessions removed
"""
async with self._session_lock:
current_time = time.time()
expired = []
for session_id, (proxy, created_at, ttl) in self._sessions.items():
if ttl is not None:
elapsed = current_time - created_at
if elapsed >= ttl:
expired.append(session_id)
for session_id in expired:
del self._sessions[session_id]
return len(expired)

View File

@@ -795,6 +795,9 @@ Return only a JSON array of extracted tables following the specified format."""
api_token=self.llm_config.api_token,
base_url=self.llm_config.base_url,
json_response=True,
base_delay=self.llm_config.backoff_base_delay,
max_attempts=self.llm_config.backoff_max_attempts,
exponential_factor=self.llm_config.backoff_exponential_factor,
extra_args=self.extra_args
)
@@ -1116,6 +1119,9 @@ Return only a JSON array of extracted tables following the specified format."""
api_token=self.llm_config.api_token,
base_url=self.llm_config.base_url,
json_response=True,
base_delay=self.llm_config.backoff_base_delay,
max_attempts=self.llm_config.backoff_max_attempts,
exponential_factor=self.llm_config.backoff_exponential_factor,
extra_args=self.extra_args
)

View File

@@ -1745,6 +1745,9 @@ def perform_completion_with_backoff(
api_token,
json_response=False,
base_url=None,
base_delay=2,
max_attempts=3,
exponential_factor=2,
**kwargs,
):
"""
@@ -1761,6 +1764,9 @@ def perform_completion_with_backoff(
api_token (str): The API token for authentication.
json_response (bool): Whether to request a JSON response. Defaults to False.
base_url (Optional[str]): The base URL for the API. Defaults to None.
base_delay (int): The base delay in seconds. Defaults to 2.
max_attempts (int): The maximum number of attempts. Defaults to 3.
exponential_factor (int): The exponential factor. Defaults to 2.
**kwargs: Additional arguments for the API request.
Returns:
@@ -1769,9 +1775,8 @@ def perform_completion_with_backoff(
from litellm import completion
from litellm.exceptions import RateLimitError
max_attempts = 3
base_delay = 2 # Base delay in seconds, you can adjust this based on your needs
import litellm
litellm.drop_params = True # Auto-drop unsupported params (e.g., temperature for O-series/GPT-5)
extra_args = {"temperature": 0.01, "api_key": api_token, "base_url": base_url}
if json_response:
@@ -1798,7 +1803,7 @@ def perform_completion_with_backoff(
# Check if we have exhausted our max attempts
if attempt < max_attempts - 1:
# Calculate the delay and wait
delay = base_delay * (2**attempt) # Exponential backoff formula
delay = base_delay * (exponential_factor**attempt) # Exponential backoff formula
print(f"Waiting for {delay} seconds before retrying...")
time.sleep(delay)
else:
@@ -1825,6 +1830,87 @@ def perform_completion_with_backoff(
# ]
async def aperform_completion_with_backoff(
provider,
prompt_with_variables,
api_token,
json_response=False,
base_url=None,
base_delay=2,
max_attempts=3,
exponential_factor=2,
**kwargs,
):
"""
Async version: Perform an API completion request with exponential backoff.
How it works:
1. Sends an async completion request to the API.
2. Retries on rate-limit errors with exponential delays (async).
3. Returns the API response or an error after all retries.
Args:
provider (str): The name of the API provider.
prompt_with_variables (str): The input prompt for the completion request.
api_token (str): The API token for authentication.
json_response (bool): Whether to request a JSON response. Defaults to False.
base_url (Optional[str]): The base URL for the API. Defaults to None.
base_delay (int): The base delay in seconds. Defaults to 2.
max_attempts (int): The maximum number of attempts. Defaults to 3.
exponential_factor (int): The exponential factor. Defaults to 2.
**kwargs: Additional arguments for the API request.
Returns:
dict: The API response or an error message after all retries.
"""
from litellm import acompletion
from litellm.exceptions import RateLimitError
import litellm
import asyncio
litellm.drop_params = True # Auto-drop unsupported params (e.g., temperature for O-series/GPT-5)
extra_args = {"temperature": 0.01, "api_key": api_token, "base_url": base_url}
if json_response:
extra_args["response_format"] = {"type": "json_object"}
if kwargs.get("extra_args"):
extra_args.update(kwargs["extra_args"])
for attempt in range(max_attempts):
try:
response = await acompletion(
model=provider,
messages=[{"role": "user", "content": prompt_with_variables}],
**extra_args,
)
return response # Return the successful response
except RateLimitError as e:
print("Rate limit error:", str(e))
if attempt == max_attempts - 1:
# Last attempt failed, raise the error.
raise
# Check if we have exhausted our max attempts
if attempt < max_attempts - 1:
# Calculate the delay and wait
delay = base_delay * (exponential_factor**attempt) # Exponential backoff formula
print(f"Waiting for {delay} seconds before retrying...")
await asyncio.sleep(delay)
else:
# Return an error response after exhausting all retries
return [
{
"index": 0,
"tags": ["error"],
"content": ["Rate limit error. Please try again later."],
}
]
except Exception as e:
raise e # Raise any other exceptions immediately
def extract_blocks(url, html, provider=DEFAULT_PROVIDER, api_token=None, base_url=None):
"""
Extract content blocks from website HTML using an AI provider.
@@ -2379,6 +2465,54 @@ def normalize_url_tmp(href, base_url):
return href.strip()
def quick_extract_links(html: str, base_url: str) -> Dict[str, List[Dict[str, str]]]:
"""
Fast link extraction for prefetch mode.
Only extracts <a href> tags - no media, no cleaning, no heavy processing.
Args:
html: Raw HTML string
base_url: Base URL for resolving relative links
Returns:
{"internal": [{"href": "...", "text": "..."}], "external": [...]}
"""
from lxml.html import document_fromstring
try:
doc = document_fromstring(html)
except Exception:
return {"internal": [], "external": []}
base_domain = get_base_domain(base_url)
internal: List[Dict[str, str]] = []
external: List[Dict[str, str]] = []
seen: Set[str] = set()
for a in doc.xpath("//a[@href]"):
href = a.get("href", "").strip()
if not href or href.startswith(("#", "javascript:", "mailto:", "tel:")):
continue
# Normalize URL
normalized = normalize_url_for_deep_crawl(href, base_url)
if not normalized or normalized in seen:
continue
seen.add(normalized)
# Extract text (truncated for memory efficiency)
text = (a.text_content() or "").strip()[:200]
link_data = {"href": normalized, "text": text}
if is_external_url(normalized, base_domain):
external.append(link_data)
else:
internal.append(link_data)
return {"internal": internal, "external": external}
def get_base_domain(url: str) -> str:
"""
Extract the base domain from a given URL, handling common edge cases.
@@ -2746,6 +2880,67 @@ def generate_content_hash(content: str) -> str:
# return hashlib.sha256(content.encode()).hexdigest()
def compute_head_fingerprint(head_html: str) -> str:
"""
Compute a fingerprint of <head> content for cache validation.
Focuses on content that typically changes when page updates:
- <title>
- <meta name="description">
- <meta property="og:title|og:description|og:image|og:updated_time">
- <meta property="article:modified_time">
- <meta name="last-modified">
Uses xxhash for speed, combines multiple signals into a single hash.
Args:
head_html: The HTML content of the <head> section
Returns:
A hex string fingerprint, or empty string if no signals found
"""
if not head_html:
return ""
head_lower = head_html.lower()
signals = []
# Extract title
title_match = re.search(r'<title[^>]*>(.*?)</title>', head_lower, re.DOTALL)
if title_match:
signals.append(title_match.group(1).strip())
# Meta tags to extract (name or property attribute, and the value to match)
meta_tags = [
("name", "description"),
("name", "last-modified"),
("property", "og:title"),
("property", "og:description"),
("property", "og:image"),
("property", "og:updated_time"),
("property", "article:modified_time"),
]
for attr_type, attr_value in meta_tags:
# Handle both attribute orders: attr="value" content="..." and content="..." attr="value"
patterns = [
rf'<meta[^>]*{attr_type}=["\']{ re.escape(attr_value)}["\'][^>]*content=["\']([^"\']*)["\']',
rf'<meta[^>]*content=["\']([^"\']*)["\'][^>]*{attr_type}=["\']{re.escape(attr_value)}["\']',
]
for pattern in patterns:
match = re.search(pattern, head_lower)
if match:
signals.append(match.group(1).strip())
break # Found this tag, move to next
if not signals:
return ""
# Combine signals and hash
combined = '|'.join(signals)
return xxhash.xxh64(combined.encode()).hexdigest()
def ensure_content_dirs(base_path: str) -> Dict[str, str]:
"""Create content directories if they don't exist"""
dirs = {

File diff suppressed because it is too large Load Diff

View File

@@ -59,13 +59,13 @@ Pull and run images directly from Docker Hub without building locally.
#### 1. Pull the Image
Our latest stable release is `0.7.6`. Images are built with multi-arch manifests, so Docker automatically pulls the correct version for your system.
Our latest stable release is `0.8.0`. Images are built with multi-arch manifests, so Docker automatically pulls the correct version for your system.
```bash
# Pull the latest stable version (0.7.6)
docker pull unclecode/crawl4ai:0.7.6
# Pull the latest stable version (0.8.0)
docker pull unclecode/crawl4ai:0.8.0
# Or use the latest tag (points to 0.7.6)
# Or use the latest tag (points to 0.8.0)
docker pull unclecode/crawl4ai:latest
```
@@ -100,7 +100,7 @@ EOL
-p 11235:11235 \
--name crawl4ai \
--shm-size=1g \
unclecode/crawl4ai:0.7.6
unclecode/crawl4ai:0.8.0
```
* **With LLM support:**
@@ -111,7 +111,7 @@ EOL
--name crawl4ai \
--env-file .llm.env \
--shm-size=1g \
unclecode/crawl4ai:0.7.6
unclecode/crawl4ai:0.8.0
```
> The server will be available at `http://localhost:11235`. Visit `/playground` to access the interactive testing interface.
@@ -184,7 +184,7 @@ The `docker-compose.yml` file in the project root provides a simplified approach
```bash
# Pulls and runs the release candidate from Docker Hub
# Automatically selects the correct architecture
IMAGE=unclecode/crawl4ai:0.7.6 docker compose up -d
IMAGE=unclecode/crawl4ai:0.8.0 docker compose up -d
```
* **Build and Run Locally:**

View File

@@ -0,0 +1,241 @@
# Crawl4AI Docker Memory & Pool Optimization - Implementation Log
## Critical Issues Identified
### Memory Management
- **Host vs Container**: `psutil.virtual_memory()` reported host memory, not container limits
- **Browser Pooling**: No pool reuse - every endpoint created new browsers
- **Warmup Waste**: Permanent browser sat idle with mismatched config signature
- **Idle Cleanup**: 30min TTL too long, janitor ran every 60s
- **Endpoint Inconsistency**: 75% of endpoints bypassed pool (`/md`, `/html`, `/screenshot`, `/pdf`, `/execute_js`, `/llm`)
### Pool Design Flaws
- **Config Mismatch**: Permanent browser used `config.yml` args, endpoints used empty `BrowserConfig()`
- **Logging Level**: Pool hit markers at DEBUG, invisible with INFO logging
## Implementation Changes
### 1. Container-Aware Memory Detection (`utils.py`)
```python
def get_container_memory_percent() -> float:
# Try cgroup v2 → v1 → fallback to psutil
# Reads /sys/fs/cgroup/memory.{current,max} OR memory/memory.{usage,limit}_in_bytes
```
### 2. Smart Browser Pool (`crawler_pool.py`)
**3-Tier System:**
- **PERMANENT**: Always-ready default browser (never cleaned)
- **HOT_POOL**: Configs used 3+ times (longer TTL)
- **COLD_POOL**: New/rare configs (short TTL)
**Key Functions:**
- `get_crawler(cfg)`: Check permanent → hot → cold → create new
- `init_permanent(cfg)`: Initialize permanent at startup
- `janitor()`: Adaptive cleanup (10s/30s/60s intervals based on memory)
- `_sig(cfg)`: SHA1 hash of config dict for pool keys
**Logging Fix**: Changed `logger.debug()``logger.info()` for pool hits
### 3. Endpoint Unification
**Helper Function** (`server.py`):
```python
def get_default_browser_config() -> BrowserConfig:
return BrowserConfig(
extra_args=config["crawler"]["browser"].get("extra_args", []),
**config["crawler"]["browser"].get("kwargs", {}),
)
```
**Migrated Endpoints:**
- `/html`, `/screenshot`, `/pdf`, `/execute_js` → use `get_default_browser_config()`
- `handle_llm_qa()`, `handle_markdown_request()` → same
**Result**: All endpoints now hit permanent browser pool
### 4. Config Updates (`config.yml`)
- `idle_ttl_sec: 1800``300` (30min → 5min base TTL)
- `port: 11234``11235` (fixed mismatch with Gunicorn)
### 5. Lifespan Fix (`server.py`)
```python
await init_permanent(BrowserConfig(
extra_args=config["crawler"]["browser"].get("extra_args", []),
**config["crawler"]["browser"].get("kwargs", {}),
))
```
Permanent browser now matches endpoint config signatures
## Test Results
### Test 1: Basic Health
- 10 requests to `/health`
- **Result**: 100% success, avg 3ms latency
- **Baseline**: Container starts in ~5s, 270 MB idle
### Test 2: Memory Monitoring
- 20 requests with Docker stats tracking
- **Result**: 100% success, no memory leak (-0.2 MB delta)
- **Baseline**: 269.7 MB container overhead
### Test 3: Pool Validation
- 30 requests to `/html` endpoint
- **Result**: **100% permanent browser hits**, 0 new browsers created
- **Memory**: 287 MB baseline → 396 MB active (+109 MB)
- **Latency**: Avg 4s (includes network to httpbin.org)
### Test 4: Concurrent Load
- Light (10) → Medium (50) → Heavy (100) concurrent
- **Total**: 320 requests
- **Result**: 100% success, **320/320 permanent hits**, 0 new browsers
- **Memory**: 269 MB → peak 1533 MB → final 993 MB
- **Latency**: P99 at 100 concurrent = 34s (expected with single browser)
### Test 5: Pool Stress (Mixed Configs)
- 20 requests with 4 different viewport configs
- **Result**: 4 new browsers, 4 cold hits, **4 promotions to hot**, 8 hot hits
- **Reuse Rate**: 60% (12 pool hits / 20 requests)
- **Memory**: 270 MB → 928 MB peak (+658 MB = ~165 MB per browser)
- **Proves**: Cold → hot promotion at 3 uses working perfectly
### Test 6: Multi-Endpoint
- 10 requests each: `/html`, `/screenshot`, `/pdf`, `/crawl`
- **Result**: 100% success across all 4 endpoints
- **Latency**: 5-8s avg (PDF slowest at 7.2s)
### Test 7: Cleanup Verification
- 20 requests (load spike) → 90s idle
- **Memory**: 269 MB → peak 1107 MB → final 780 MB
- **Recovery**: 327 MB (39%) - partial cleanup
- **Note**: Hot pool browsers persist (by design), janitor working correctly
## Performance Metrics
| Metric | Before | After | Improvement |
|--------|--------|-------|-------------|
| Pool Reuse | 0% | 100% (default config) | ∞ |
| Memory Leak | Unknown | 0 MB/cycle | Stable |
| Browser Reuse | No | Yes | ~3-5s saved per request |
| Idle Memory | 500-700 MB × N | 270-400 MB | 10x reduction |
| Concurrent Capacity | ~20 | 100+ | 5x |
## Key Learnings
1. **Config Signature Matching**: Permanent browser MUST match endpoint default config exactly (SHA1 hash)
2. **Logging Levels**: Pool diagnostics need INFO level, not DEBUG
3. **Memory in Docker**: Must read cgroup files, not host metrics
4. **Janitor Timing**: 60s interval adequate, but TTLs should be short (5min) for cold pool
5. **Hot Promotion**: 3-use threshold works well for production patterns
6. **Memory Per Browser**: ~150-200 MB per Chromium instance with headless + text_mode
## Test Infrastructure
**Location**: `deploy/docker/tests/`
**Dependencies**: `httpx`, `docker` (Python SDK)
**Pattern**: Sequential build - each test adds one capability
**Files**:
- `test_1_basic.py`: Health check + container lifecycle
- `test_2_memory.py`: + Docker stats monitoring
- `test_3_pool.py`: + Log analysis for pool markers
- `test_4_concurrent.py`: + asyncio.Semaphore for concurrency control
- `test_5_pool_stress.py`: + Config variants (viewports)
- `test_6_multi_endpoint.py`: + Multiple endpoint testing
- `test_7_cleanup.py`: + Time-series memory tracking for janitor
**Run Pattern**:
```bash
cd deploy/docker/tests
pip install -r requirements.txt
# Rebuild after code changes:
cd /path/to/repo && docker buildx build -t crawl4ai-local:latest --load .
# Run test:
python test_N_name.py
```
## Architecture Decisions
**Why Permanent Browser?**
- 90% of requests use default config → single browser serves most traffic
- Eliminates 3-5s startup overhead per request
**Why 3-Tier Pool?**
- Permanent: Zero cost for common case
- Hot: Amortized cost for frequent variants
- Cold: Lazy allocation for rare configs
**Why Adaptive Janitor?**
- Memory pressure triggers aggressive cleanup
- Low memory allows longer TTLs for better reuse
**Why Not Close After Each Request?**
- Browser startup: 3-5s overhead
- Pool reuse: <100ms overhead
- Net: 30-50x faster
## Future Optimizations
1. **Request Queuing**: When at capacity, queue instead of reject
2. **Pre-warming**: Predict common configs, pre-create browsers
3. **Metrics Export**: Prometheus metrics for pool efficiency
4. **Config Normalization**: Group similar viewports (e.g., 1920±50 → 1920)
## Critical Code Paths
**Browser Acquisition** (`crawler_pool.py:34-78`):
```
get_crawler(cfg) →
_sig(cfg) →
if sig == DEFAULT_CONFIG_SIG → PERMANENT
elif sig in HOT_POOL → HOT_POOL[sig]
elif sig in COLD_POOL → promote if count >= 3
else → create new in COLD_POOL
```
**Janitor Loop** (`crawler_pool.py:107-146`):
```
while True:
mem% = get_container_memory_percent()
if mem% > 80: interval=10s, cold_ttl=30s
elif mem% > 60: interval=30s, cold_ttl=60s
else: interval=60s, cold_ttl=300s
sleep(interval)
close idle browsers (COLD then HOT)
```
**Endpoint Pattern** (`server.py` example):
```python
@app.post("/html")
async def generate_html(...):
from crawler_pool import get_crawler
crawler = await get_crawler(get_default_browser_config())
results = await crawler.arun(url=body.url, config=cfg)
# No crawler.close() - returned to pool
```
## Debugging Tips
**Check Pool Activity**:
```bash
docker logs crawl4ai-test | grep -E "(🔥|♨️|❄️|🆕|⬆️)"
```
**Verify Config Signature**:
```python
from crawl4ai import BrowserConfig
import json, hashlib
cfg = BrowserConfig(...)
sig = hashlib.sha1(json.dumps(cfg.to_dict(), sort_keys=True).encode()).hexdigest()
print(sig[:8]) # Compare with logs
```
**Monitor Memory**:
```bash
docker stats crawl4ai-test
```
## Known Limitations
- **Mac Docker Stats**: CPU metrics unreliable, memory works
- **PDF Generation**: Slowest endpoint (~7s), no optimization yet
- **Hot Pool Persistence**: May hold memory longer than needed (trade-off for performance)
- **Janitor Lag**: Up to 60s before cleanup triggers in low-memory scenarios

View File

@@ -67,6 +67,7 @@ async def handle_llm_qa(
config: dict
) -> str:
"""Process QA using LLM with crawled content as context."""
from crawler_pool import get_crawler
try:
if not url.startswith(('http://', 'https://')) and not url.startswith(("raw:", "raw://")):
url = 'https://' + url
@@ -75,15 +76,21 @@ async def handle_llm_qa(
if last_q_index != -1:
url = url[:last_q_index]
# Get markdown content
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url)
if not result.success:
raise HTTPException(
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
detail=result.error_message
)
content = result.markdown.fit_markdown or result.markdown.raw_markdown
# Get markdown content (use default config)
from utils import load_config
cfg = load_config()
browser_cfg = BrowserConfig(
extra_args=cfg["crawler"]["browser"].get("extra_args", []),
**cfg["crawler"]["browser"].get("kwargs", {}),
)
crawler = await get_crawler(browser_cfg)
result = await crawler.arun(url)
if not result.success:
raise HTTPException(
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
detail=result.error_message
)
content = result.markdown.fit_markdown or result.markdown.raw_markdown
# Create prompt and get LLM response
prompt = f"""Use the following content as context to answer the question.
@@ -101,7 +108,10 @@ async def handle_llm_qa(
prompt_with_variables=prompt,
api_token=get_llm_api_key(config), # Returns None to let litellm handle it
temperature=get_llm_temperature(config),
base_url=get_llm_base_url(config)
base_url=get_llm_base_url(config),
base_delay=config["llm"].get("backoff_base_delay", 2),
max_attempts=config["llm"].get("backoff_max_attempts", 3),
exponential_factor=config["llm"].get("backoff_exponential_factor", 2)
)
return response.choices[0].message.content
@@ -272,25 +282,32 @@ async def handle_markdown_request(
cache_mode = CacheMode.ENABLED if cache == "1" else CacheMode.WRITE_ONLY
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url=decoded_url,
config=CrawlerRunConfig(
markdown_generator=md_generator,
scraping_strategy=LXMLWebScrapingStrategy(),
cache_mode=cache_mode
)
from crawler_pool import get_crawler
from utils import load_config as _load_config
_cfg = _load_config()
browser_cfg = BrowserConfig(
extra_args=_cfg["crawler"]["browser"].get("extra_args", []),
**_cfg["crawler"]["browser"].get("kwargs", {}),
)
crawler = await get_crawler(browser_cfg)
result = await crawler.arun(
url=decoded_url,
config=CrawlerRunConfig(
markdown_generator=md_generator,
scraping_strategy=LXMLWebScrapingStrategy(),
cache_mode=cache_mode
)
if not result.success:
raise HTTPException(
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
detail=result.error_message
)
)
return (result.markdown.raw_markdown
if filter_type == FilterType.RAW
else result.markdown.fit_markdown)
if not result.success:
raise HTTPException(
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
detail=result.error_message
)
return (result.markdown.raw_markdown
if filter_type == FilterType.RAW
else result.markdown.fit_markdown)
except Exception as e:
logger.error(f"Markdown error: {str(e)}", exc_info=True)
@@ -504,12 +521,22 @@ async def handle_crawl_request(
hooks_config: Optional[dict] = None
) -> dict:
"""Handle non-streaming crawl requests with optional hooks."""
# Track request start
request_id = f"req_{uuid4().hex[:8]}"
try:
from monitor import get_monitor
await get_monitor().track_request_start(
request_id, "/crawl", urls[0] if urls else "batch", browser_config
)
except:
pass # Monitor not critical
start_mem_mb = _get_memory_mb() # <--- Get memory before
start_time = time.time()
mem_delta_mb = None
peak_mem_mb = start_mem_mb
hook_manager = None
try:
urls = [('https://' + url) if not url.startswith(('http://', 'https://')) and not url.startswith(("raw:", "raw://")) else url for url in urls]
browser_config = BrowserConfig.load(browser_config)
@@ -614,7 +641,16 @@ async def handle_crawl_request(
"server_memory_delta_mb": mem_delta_mb,
"server_peak_memory_mb": peak_mem_mb
}
# Track request completion
try:
from monitor import get_monitor
await get_monitor().track_request_end(
request_id, success=True, pool_hit=True, status_code=200
)
except:
pass
# Add hooks information if hooks were used
if hooks_config and hook_manager:
from hook_manager import UserHookManager
@@ -643,6 +679,16 @@ async def handle_crawl_request(
except Exception as e:
logger.error(f"Crawl error: {str(e)}", exc_info=True)
# Track request error
try:
from monitor import get_monitor
await get_monitor().track_request_end(
request_id, success=False, error=str(e), status_code=500
)
except:
pass
if 'crawler' in locals() and crawler.ready: # Check if crawler was initialized and started
# try:
# await crawler.close()

View File

@@ -3,7 +3,7 @@ app:
title: "Crawl4AI API"
version: "1.0.0"
host: "0.0.0.0"
port: 11234
port: 11235
reload: False
workers: 1
timeout_keep_alive: 300
@@ -37,6 +37,10 @@ rate_limiting:
storage_uri: "memory://" # Use "redis://localhost:6379" for production
# Security Configuration
# WARNING: For production deployments, enable security and use proper SECRET_KEY:
# - Set jwt_enabled: true for authentication
# - Set SECRET_KEY environment variable to a secure random value
# - Set CRAWL4AI_HOOKS_ENABLED=true only if you need hooks (RCE risk)
security:
enabled: false
jwt_enabled: false
@@ -61,7 +65,7 @@ crawler:
batch_process: 300.0 # Timeout for batch processing
pool:
max_pages: 40 # ← GLOBAL_SEM permits
idle_ttl_sec: 1800 # ← 30 min janitor cutoff
idle_ttl_sec: 300 # ← 30 min janitor cutoff
browser:
kwargs:
headless: true

View File

@@ -1,60 +1,170 @@
# crawler_pool.py (new file)
import asyncio, json, hashlib, time, psutil
# crawler_pool.py - Smart browser pool with tiered management
import asyncio, json, hashlib, time
from contextlib import suppress
from typing import Dict
from typing import Dict, Optional
from crawl4ai import AsyncWebCrawler, BrowserConfig
from typing import Dict
from utils import load_config
from utils import load_config, get_container_memory_percent
import logging
logger = logging.getLogger(__name__)
CONFIG = load_config()
POOL: Dict[str, AsyncWebCrawler] = {}
# Pool tiers
PERMANENT: Optional[AsyncWebCrawler] = None # Always-ready default browser
HOT_POOL: Dict[str, AsyncWebCrawler] = {} # Frequent configs
COLD_POOL: Dict[str, AsyncWebCrawler] = {} # Rare configs
LAST_USED: Dict[str, float] = {}
USAGE_COUNT: Dict[str, int] = {}
LOCK = asyncio.Lock()
MEM_LIMIT = CONFIG.get("crawler", {}).get("memory_threshold_percent", 95.0) # % RAM refuse new browsers above this
IDLE_TTL = CONFIG.get("crawler", {}).get("pool", {}).get("idle_ttl_sec", 1800) # close if unused for 30min
# Config
MEM_LIMIT = CONFIG.get("crawler", {}).get("memory_threshold_percent", 95.0)
BASE_IDLE_TTL = CONFIG.get("crawler", {}).get("pool", {}).get("idle_ttl_sec", 300)
DEFAULT_CONFIG_SIG = None # Cached sig for default config
def _sig(cfg: BrowserConfig) -> str:
"""Generate config signature."""
payload = json.dumps(cfg.to_dict(), sort_keys=True, separators=(",",":"))
return hashlib.sha1(payload.encode()).hexdigest()
def _is_default_config(sig: str) -> bool:
"""Check if config matches default."""
return sig == DEFAULT_CONFIG_SIG
async def get_crawler(cfg: BrowserConfig) -> AsyncWebCrawler:
try:
sig = _sig(cfg)
async with LOCK:
if sig in POOL:
LAST_USED[sig] = time.time();
return POOL[sig]
if psutil.virtual_memory().percent >= MEM_LIMIT:
raise MemoryError("RAM pressure new browser denied")
crawler = AsyncWebCrawler(config=cfg, thread_safe=False)
await crawler.start()
POOL[sig] = crawler; LAST_USED[sig] = time.time()
return crawler
except MemoryError as e:
raise MemoryError(f"RAM pressure new browser denied: {e}")
except Exception as e:
raise RuntimeError(f"Failed to start browser: {e}")
finally:
if sig in POOL:
LAST_USED[sig] = time.time()
else:
# If we failed to start the browser, we should remove it from the pool
POOL.pop(sig, None)
LAST_USED.pop(sig, None)
# If we failed to start the browser, we should remove it from the pool
async def close_all():
"""Get crawler from pool with tiered strategy."""
sig = _sig(cfg)
async with LOCK:
await asyncio.gather(*(c.close() for c in POOL.values()), return_exceptions=True)
POOL.clear(); LAST_USED.clear()
# Check permanent browser for default config
if PERMANENT and _is_default_config(sig):
LAST_USED[sig] = time.time()
USAGE_COUNT[sig] = USAGE_COUNT.get(sig, 0) + 1
logger.info("🔥 Using permanent browser")
return PERMANENT
# Check hot pool
if sig in HOT_POOL:
LAST_USED[sig] = time.time()
USAGE_COUNT[sig] = USAGE_COUNT.get(sig, 0) + 1
logger.info(f"♨️ Using hot pool browser (sig={sig[:8]})")
return HOT_POOL[sig]
# Check cold pool (promote to hot if used 3+ times)
if sig in COLD_POOL:
LAST_USED[sig] = time.time()
USAGE_COUNT[sig] = USAGE_COUNT.get(sig, 0) + 1
if USAGE_COUNT[sig] >= 3:
logger.info(f"⬆️ Promoting to hot pool (sig={sig[:8]}, count={USAGE_COUNT[sig]})")
HOT_POOL[sig] = COLD_POOL.pop(sig)
# Track promotion in monitor
try:
from monitor import get_monitor
await get_monitor().track_janitor_event("promote", sig, {"count": USAGE_COUNT[sig]})
except:
pass
return HOT_POOL[sig]
logger.info(f"❄️ Using cold pool browser (sig={sig[:8]})")
return COLD_POOL[sig]
# Memory check before creating new
mem_pct = get_container_memory_percent()
if mem_pct >= MEM_LIMIT:
logger.error(f"💥 Memory pressure: {mem_pct:.1f}% >= {MEM_LIMIT}%")
raise MemoryError(f"Memory at {mem_pct:.1f}%, refusing new browser")
# Create new in cold pool
logger.info(f"🆕 Creating new browser in cold pool (sig={sig[:8]}, mem={mem_pct:.1f}%)")
crawler = AsyncWebCrawler(config=cfg, thread_safe=False)
await crawler.start()
COLD_POOL[sig] = crawler
LAST_USED[sig] = time.time()
USAGE_COUNT[sig] = 1
return crawler
async def init_permanent(cfg: BrowserConfig):
"""Initialize permanent default browser."""
global PERMANENT, DEFAULT_CONFIG_SIG
async with LOCK:
if PERMANENT:
return
DEFAULT_CONFIG_SIG = _sig(cfg)
logger.info("🔥 Creating permanent default browser")
PERMANENT = AsyncWebCrawler(config=cfg, thread_safe=False)
await PERMANENT.start()
LAST_USED[DEFAULT_CONFIG_SIG] = time.time()
USAGE_COUNT[DEFAULT_CONFIG_SIG] = 0
async def close_all():
"""Close all browsers."""
async with LOCK:
tasks = []
if PERMANENT:
tasks.append(PERMANENT.close())
tasks.extend([c.close() for c in HOT_POOL.values()])
tasks.extend([c.close() for c in COLD_POOL.values()])
await asyncio.gather(*tasks, return_exceptions=True)
HOT_POOL.clear()
COLD_POOL.clear()
LAST_USED.clear()
USAGE_COUNT.clear()
async def janitor():
"""Adaptive cleanup based on memory pressure."""
while True:
await asyncio.sleep(60)
mem_pct = get_container_memory_percent()
# Adaptive intervals and TTLs
if mem_pct > 80:
interval, cold_ttl, hot_ttl = 10, 30, 120
elif mem_pct > 60:
interval, cold_ttl, hot_ttl = 30, 60, 300
else:
interval, cold_ttl, hot_ttl = 60, BASE_IDLE_TTL, BASE_IDLE_TTL * 2
await asyncio.sleep(interval)
now = time.time()
async with LOCK:
for sig, crawler in list(POOL.items()):
if now - LAST_USED[sig] > IDLE_TTL:
with suppress(Exception): await crawler.close()
POOL.pop(sig, None); LAST_USED.pop(sig, None)
# Clean cold pool
for sig in list(COLD_POOL.keys()):
if now - LAST_USED.get(sig, now) > cold_ttl:
idle_time = now - LAST_USED[sig]
logger.info(f"🧹 Closing cold browser (sig={sig[:8]}, idle={idle_time:.0f}s)")
with suppress(Exception):
await COLD_POOL[sig].close()
COLD_POOL.pop(sig, None)
LAST_USED.pop(sig, None)
USAGE_COUNT.pop(sig, None)
# Track in monitor
try:
from monitor import get_monitor
await get_monitor().track_janitor_event("close_cold", sig, {"idle_seconds": int(idle_time), "ttl": cold_ttl})
except:
pass
# Clean hot pool (more conservative)
for sig in list(HOT_POOL.keys()):
if now - LAST_USED.get(sig, now) > hot_ttl:
idle_time = now - LAST_USED[sig]
logger.info(f"🧹 Closing hot browser (sig={sig[:8]}, idle={idle_time:.0f}s)")
with suppress(Exception):
await HOT_POOL[sig].close()
HOT_POOL.pop(sig, None)
LAST_USED.pop(sig, None)
USAGE_COUNT.pop(sig, None)
# Track in monitor
try:
from monitor import get_monitor
await get_monitor().track_janitor_event("close_hot", sig, {"idle_seconds": int(idle_time), "ttl": hot_ttl})
except:
pass
# Log pool stats
if mem_pct > 60:
logger.info(f"📊 Pool: hot={len(HOT_POOL)}, cold={len(COLD_POOL)}, mem={mem_pct:.1f}%")

View File

@@ -117,18 +117,18 @@ class UserHookManager:
"""
try:
# Create a safe namespace for the hook
# Use a more complete builtins that includes __import__
# SECURITY: No __import__ to prevent arbitrary module imports (RCE risk)
import builtins
safe_builtins = {}
# Add safe built-in functions
# Add safe built-in functions (no __import__ for security)
allowed_builtins = [
'print', 'len', 'str', 'int', 'float', 'bool',
'list', 'dict', 'set', 'tuple', 'range', 'enumerate',
'zip', 'map', 'filter', 'any', 'all', 'sum', 'min', 'max',
'sorted', 'reversed', 'abs', 'round', 'isinstance', 'type',
'getattr', 'hasattr', 'setattr', 'callable', 'iter', 'next',
'__import__', '__build_class__' # Required for exec
'__build_class__' # Required for class definitions in exec
]
for name in allowed_builtins:

382
deploy/docker/monitor.py Normal file
View File

@@ -0,0 +1,382 @@
# monitor.py - Real-time monitoring stats with Redis persistence
import time
import json
import asyncio
from typing import Dict, List, Optional
from datetime import datetime, timezone
from collections import deque
from redis import asyncio as aioredis
from utils import get_container_memory_percent
import psutil
import logging
logger = logging.getLogger(__name__)
class MonitorStats:
"""Tracks real-time server stats with Redis persistence."""
def __init__(self, redis: aioredis.Redis):
self.redis = redis
self.start_time = time.time()
# In-memory queues (fast reads, Redis backup)
self.active_requests: Dict[str, Dict] = {} # id -> request info
self.completed_requests: deque = deque(maxlen=100) # Last 100
self.janitor_events: deque = deque(maxlen=100)
self.errors: deque = deque(maxlen=100)
# Endpoint stats (persisted in Redis)
self.endpoint_stats: Dict[str, Dict] = {} # endpoint -> {count, total_time, errors, ...}
# Background persistence queue (max 10 pending persist requests)
self._persist_queue: asyncio.Queue = asyncio.Queue(maxsize=10)
self._persist_worker_task: Optional[asyncio.Task] = None
# Timeline data (5min window, 5s resolution = 60 points)
self.memory_timeline: deque = deque(maxlen=60)
self.requests_timeline: deque = deque(maxlen=60)
self.browser_timeline: deque = deque(maxlen=60)
async def track_request_start(self, request_id: str, endpoint: str, url: str, config: Dict = None):
"""Track new request start."""
req_info = {
"id": request_id,
"endpoint": endpoint,
"url": url[:100], # Truncate long URLs
"start_time": time.time(),
"config_sig": config.get("sig", "default") if config else "default",
"mem_start": psutil.Process().memory_info().rss / (1024 * 1024)
}
self.active_requests[request_id] = req_info
# Increment endpoint counter
if endpoint not in self.endpoint_stats:
self.endpoint_stats[endpoint] = {
"count": 0, "total_time": 0, "errors": 0,
"pool_hits": 0, "success": 0
}
self.endpoint_stats[endpoint]["count"] += 1
# Queue persistence (handled by background worker)
try:
self._persist_queue.put_nowait(True)
except asyncio.QueueFull:
logger.warning("Persistence queue full, skipping")
async def track_request_end(self, request_id: str, success: bool, error: str = None,
pool_hit: bool = True, status_code: int = 200):
"""Track request completion."""
if request_id not in self.active_requests:
return
req_info = self.active_requests.pop(request_id)
end_time = time.time()
elapsed = end_time - req_info["start_time"]
mem_end = psutil.Process().memory_info().rss / (1024 * 1024)
mem_delta = mem_end - req_info["mem_start"]
# Update stats
endpoint = req_info["endpoint"]
if endpoint in self.endpoint_stats:
self.endpoint_stats[endpoint]["total_time"] += elapsed
if success:
self.endpoint_stats[endpoint]["success"] += 1
else:
self.endpoint_stats[endpoint]["errors"] += 1
if pool_hit:
self.endpoint_stats[endpoint]["pool_hits"] += 1
# Add to completed queue
completed = {
**req_info,
"end_time": end_time,
"elapsed": round(elapsed, 2),
"mem_delta": round(mem_delta, 1),
"success": success,
"error": error,
"status_code": status_code,
"pool_hit": pool_hit
}
self.completed_requests.append(completed)
# Track errors
if not success and error:
self.errors.append({
"timestamp": end_time,
"endpoint": endpoint,
"url": req_info["url"],
"error": error,
"request_id": request_id
})
await self._persist_endpoint_stats()
async def track_janitor_event(self, event_type: str, sig: str, details: Dict):
"""Track janitor cleanup events."""
self.janitor_events.append({
"timestamp": time.time(),
"type": event_type, # "close_cold", "close_hot", "promote"
"sig": sig[:8],
"details": details
})
def _cleanup_old_entries(self, max_age_seconds: int = 300):
"""Remove entries older than max_age_seconds (default 5min)."""
now = time.time()
cutoff = now - max_age_seconds
# Clean completed requests
while self.completed_requests and self.completed_requests[0].get("end_time", 0) < cutoff:
self.completed_requests.popleft()
# Clean janitor events
while self.janitor_events and self.janitor_events[0].get("timestamp", 0) < cutoff:
self.janitor_events.popleft()
# Clean errors
while self.errors and self.errors[0].get("timestamp", 0) < cutoff:
self.errors.popleft()
async def update_timeline(self):
"""Update timeline data points (called every 5s)."""
now = time.time()
mem_pct = get_container_memory_percent()
# Clean old entries (keep last 5 minutes)
self._cleanup_old_entries(max_age_seconds=300)
# Count requests in last 5s
recent_reqs = sum(1 for req in self.completed_requests
if now - req.get("end_time", 0) < 5)
# Browser counts (acquire lock to prevent race conditions)
from crawler_pool import PERMANENT, HOT_POOL, COLD_POOL, LOCK
async with LOCK:
browser_count = {
"permanent": 1 if PERMANENT else 0,
"hot": len(HOT_POOL),
"cold": len(COLD_POOL)
}
self.memory_timeline.append({"time": now, "value": mem_pct})
self.requests_timeline.append({"time": now, "value": recent_reqs})
self.browser_timeline.append({"time": now, "browsers": browser_count})
async def _persist_endpoint_stats(self):
"""Persist endpoint stats to Redis."""
try:
await self.redis.set(
"monitor:endpoint_stats",
json.dumps(self.endpoint_stats),
ex=86400 # 24h TTL
)
except Exception as e:
logger.warning(f"Failed to persist endpoint stats: {e}")
async def _persistence_worker(self):
"""Background worker to persist stats to Redis."""
while True:
try:
await self._persist_queue.get()
await self._persist_endpoint_stats()
self._persist_queue.task_done()
except asyncio.CancelledError:
break
except Exception as e:
logger.error(f"Persistence worker error: {e}")
def start_persistence_worker(self):
"""Start the background persistence worker."""
if not self._persist_worker_task:
self._persist_worker_task = asyncio.create_task(self._persistence_worker())
logger.info("Started persistence worker")
async def stop_persistence_worker(self):
"""Stop the background persistence worker."""
if self._persist_worker_task:
self._persist_worker_task.cancel()
try:
await self._persist_worker_task
except asyncio.CancelledError:
pass
self._persist_worker_task = None
logger.info("Stopped persistence worker")
async def cleanup(self):
"""Cleanup on shutdown - persist final stats and stop workers."""
logger.info("Monitor cleanup starting...")
try:
# Persist final stats before shutdown
await self._persist_endpoint_stats()
# Stop background worker
await self.stop_persistence_worker()
logger.info("Monitor cleanup completed")
except Exception as e:
logger.error(f"Monitor cleanup error: {e}")
async def load_from_redis(self):
"""Load persisted stats from Redis."""
try:
data = await self.redis.get("monitor:endpoint_stats")
if data:
self.endpoint_stats = json.loads(data)
logger.info("Loaded endpoint stats from Redis")
except Exception as e:
logger.warning(f"Failed to load from Redis: {e}")
async def get_health_summary(self) -> Dict:
"""Get current system health snapshot."""
mem_pct = get_container_memory_percent()
cpu_pct = psutil.cpu_percent(interval=0.1)
# Network I/O (delta since last call)
net = psutil.net_io_counters()
# Pool status (acquire lock to prevent race conditions)
from crawler_pool import PERMANENT, HOT_POOL, COLD_POOL, LOCK
async with LOCK:
# TODO: Track actual browser process memory instead of estimates
# These are conservative estimates based on typical Chromium usage
permanent_mem = 270 if PERMANENT else 0 # Estimate: ~270MB for permanent browser
hot_mem = len(HOT_POOL) * 180 # Estimate: ~180MB per hot pool browser
cold_mem = len(COLD_POOL) * 180 # Estimate: ~180MB per cold pool browser
permanent_active = PERMANENT is not None
hot_count = len(HOT_POOL)
cold_count = len(COLD_POOL)
return {
"container": {
"memory_percent": round(mem_pct, 1),
"cpu_percent": round(cpu_pct, 1),
"network_sent_mb": round(net.bytes_sent / (1024**2), 2),
"network_recv_mb": round(net.bytes_recv / (1024**2), 2),
"uptime_seconds": int(time.time() - self.start_time)
},
"pool": {
"permanent": {"active": permanent_active, "memory_mb": permanent_mem},
"hot": {"count": hot_count, "memory_mb": hot_mem},
"cold": {"count": cold_count, "memory_mb": cold_mem},
"total_memory_mb": permanent_mem + hot_mem + cold_mem
},
"janitor": {
"next_cleanup_estimate": "adaptive", # Would need janitor state
"memory_pressure": "LOW" if mem_pct < 60 else "MEDIUM" if mem_pct < 80 else "HIGH"
}
}
def get_active_requests(self) -> List[Dict]:
"""Get list of currently active requests."""
now = time.time()
return [
{
**req,
"elapsed": round(now - req["start_time"], 1),
"status": "running"
}
for req in self.active_requests.values()
]
def get_completed_requests(self, limit: int = 50, filter_status: str = "all") -> List[Dict]:
"""Get recent completed requests."""
requests = list(self.completed_requests)[-limit:]
if filter_status == "success":
requests = [r for r in requests if r.get("success")]
elif filter_status == "error":
requests = [r for r in requests if not r.get("success")]
return requests
async def get_browser_list(self) -> List[Dict]:
"""Get detailed browser pool information."""
from crawler_pool import PERMANENT, HOT_POOL, COLD_POOL, LAST_USED, USAGE_COUNT, DEFAULT_CONFIG_SIG, LOCK
browsers = []
now = time.time()
# Acquire lock to prevent race conditions during iteration
async with LOCK:
if PERMANENT:
browsers.append({
"type": "permanent",
"sig": DEFAULT_CONFIG_SIG[:8] if DEFAULT_CONFIG_SIG else "unknown",
"age_seconds": int(now - self.start_time),
"last_used_seconds": int(now - LAST_USED.get(DEFAULT_CONFIG_SIG, now)),
"memory_mb": 270,
"hits": USAGE_COUNT.get(DEFAULT_CONFIG_SIG, 0),
"killable": False
})
for sig, crawler in HOT_POOL.items():
browsers.append({
"type": "hot",
"sig": sig[:8],
"age_seconds": int(now - self.start_time), # Approximation
"last_used_seconds": int(now - LAST_USED.get(sig, now)),
"memory_mb": 180, # Estimate
"hits": USAGE_COUNT.get(sig, 0),
"killable": True
})
for sig, crawler in COLD_POOL.items():
browsers.append({
"type": "cold",
"sig": sig[:8],
"age_seconds": int(now - self.start_time),
"last_used_seconds": int(now - LAST_USED.get(sig, now)),
"memory_mb": 180,
"hits": USAGE_COUNT.get(sig, 0),
"killable": True
})
return browsers
def get_endpoint_stats_summary(self) -> Dict[str, Dict]:
"""Get aggregated endpoint statistics."""
summary = {}
for endpoint, stats in self.endpoint_stats.items():
count = stats["count"]
avg_time = (stats["total_time"] / count) if count > 0 else 0
success_rate = (stats["success"] / count * 100) if count > 0 else 0
pool_hit_rate = (stats["pool_hits"] / count * 100) if count > 0 else 0
summary[endpoint] = {
"count": count,
"avg_latency_ms": round(avg_time * 1000, 1),
"success_rate_percent": round(success_rate, 1),
"pool_hit_rate_percent": round(pool_hit_rate, 1),
"errors": stats["errors"]
}
return summary
def get_timeline_data(self, metric: str, window: str = "5m") -> Dict:
"""Get timeline data for charts."""
# For now, only 5m window supported
if metric == "memory":
data = list(self.memory_timeline)
elif metric == "requests":
data = list(self.requests_timeline)
elif metric == "browsers":
data = list(self.browser_timeline)
else:
return {"timestamps": [], "values": []}
return {
"timestamps": [int(d["time"]) for d in data],
"values": [d.get("value", d.get("browsers")) for d in data]
}
def get_janitor_log(self, limit: int = 100) -> List[Dict]:
"""Get recent janitor events."""
return list(self.janitor_events)[-limit:]
def get_errors_log(self, limit: int = 100) -> List[Dict]:
"""Get recent errors."""
return list(self.errors)[-limit:]
# Global instance (initialized in server.py)
monitor_stats: Optional[MonitorStats] = None
def get_monitor() -> MonitorStats:
"""Get global monitor instance."""
if monitor_stats is None:
raise RuntimeError("Monitor not initialized")
return monitor_stats

View File

@@ -0,0 +1,405 @@
# monitor_routes.py - Monitor API endpoints
from fastapi import APIRouter, HTTPException, WebSocket, WebSocketDisconnect
from pydantic import BaseModel
from typing import Optional
from monitor import get_monitor
import logging
import asyncio
import json
logger = logging.getLogger(__name__)
router = APIRouter(prefix="/monitor", tags=["monitor"])
@router.get("/health")
async def get_health():
"""Get current system health snapshot."""
try:
monitor = get_monitor()
return await monitor.get_health_summary()
except Exception as e:
logger.error(f"Error getting health: {e}")
raise HTTPException(500, str(e))
@router.get("/requests")
async def get_requests(status: str = "all", limit: int = 50):
"""Get active and completed requests.
Args:
status: Filter by 'active', 'completed', 'success', 'error', or 'all'
limit: Max number of completed requests to return (default 50)
"""
# Input validation
if status not in ["all", "active", "completed", "success", "error"]:
raise HTTPException(400, f"Invalid status: {status}. Must be one of: all, active, completed, success, error")
if limit < 1 or limit > 1000:
raise HTTPException(400, f"Invalid limit: {limit}. Must be between 1 and 1000")
try:
monitor = get_monitor()
if status == "active":
return {"active": monitor.get_active_requests(), "completed": []}
elif status == "completed":
return {"active": [], "completed": monitor.get_completed_requests(limit)}
elif status in ["success", "error"]:
return {"active": [], "completed": monitor.get_completed_requests(limit, status)}
else: # "all"
return {
"active": monitor.get_active_requests(),
"completed": monitor.get_completed_requests(limit)
}
except Exception as e:
logger.error(f"Error getting requests: {e}")
raise HTTPException(500, str(e))
@router.get("/browsers")
async def get_browsers():
"""Get detailed browser pool information."""
try:
monitor = get_monitor()
browsers = await monitor.get_browser_list()
# Calculate summary stats
total_browsers = len(browsers)
total_memory = sum(b["memory_mb"] for b in browsers)
# Calculate reuse rate from recent requests
recent = monitor.get_completed_requests(100)
pool_hits = sum(1 for r in recent if r.get("pool_hit", False))
reuse_rate = (pool_hits / len(recent) * 100) if recent else 0
return {
"browsers": browsers,
"summary": {
"total_count": total_browsers,
"total_memory_mb": total_memory,
"reuse_rate_percent": round(reuse_rate, 1)
}
}
except Exception as e:
logger.error(f"Error getting browsers: {e}")
raise HTTPException(500, str(e))
@router.get("/endpoints/stats")
async def get_endpoint_stats():
"""Get aggregated endpoint statistics."""
try:
monitor = get_monitor()
return monitor.get_endpoint_stats_summary()
except Exception as e:
logger.error(f"Error getting endpoint stats: {e}")
raise HTTPException(500, str(e))
@router.get("/timeline")
async def get_timeline(metric: str = "memory", window: str = "5m"):
"""Get timeline data for charts.
Args:
metric: 'memory', 'requests', or 'browsers'
window: Time window (only '5m' supported for now)
"""
# Input validation
if metric not in ["memory", "requests", "browsers"]:
raise HTTPException(400, f"Invalid metric: {metric}. Must be one of: memory, requests, browsers")
if window != "5m":
raise HTTPException(400, f"Invalid window: {window}. Only '5m' is currently supported")
try:
monitor = get_monitor()
return monitor.get_timeline_data(metric, window)
except Exception as e:
logger.error(f"Error getting timeline: {e}")
raise HTTPException(500, str(e))
@router.get("/logs/janitor")
async def get_janitor_log(limit: int = 100):
"""Get recent janitor cleanup events."""
# Input validation
if limit < 1 or limit > 1000:
raise HTTPException(400, f"Invalid limit: {limit}. Must be between 1 and 1000")
try:
monitor = get_monitor()
return {"events": monitor.get_janitor_log(limit)}
except Exception as e:
logger.error(f"Error getting janitor log: {e}")
raise HTTPException(500, str(e))
@router.get("/logs/errors")
async def get_errors_log(limit: int = 100):
"""Get recent errors."""
# Input validation
if limit < 1 or limit > 1000:
raise HTTPException(400, f"Invalid limit: {limit}. Must be between 1 and 1000")
try:
monitor = get_monitor()
return {"errors": monitor.get_errors_log(limit)}
except Exception as e:
logger.error(f"Error getting errors log: {e}")
raise HTTPException(500, str(e))
# ========== Control Actions ==========
class KillBrowserRequest(BaseModel):
sig: str
@router.post("/actions/cleanup")
async def force_cleanup():
"""Force immediate janitor cleanup (kills idle cold pool browsers)."""
try:
from crawler_pool import COLD_POOL, LAST_USED, USAGE_COUNT, LOCK
import time
from contextlib import suppress
killed_count = 0
now = time.time()
async with LOCK:
for sig in list(COLD_POOL.keys()):
# Kill all cold pool browsers immediately
logger.info(f"🧹 Force cleanup: closing cold browser (sig={sig[:8]})")
with suppress(Exception):
await COLD_POOL[sig].close()
COLD_POOL.pop(sig, None)
LAST_USED.pop(sig, None)
USAGE_COUNT.pop(sig, None)
killed_count += 1
monitor = get_monitor()
await monitor.track_janitor_event("force_cleanup", "manual", {"killed": killed_count})
return {"success": True, "killed_browsers": killed_count}
except Exception as e:
logger.error(f"Error during force cleanup: {e}")
raise HTTPException(500, str(e))
@router.post("/actions/kill_browser")
async def kill_browser(req: KillBrowserRequest):
"""Kill a specific browser by signature (hot or cold only).
Args:
sig: Browser config signature (first 8 chars)
"""
try:
from crawler_pool import HOT_POOL, COLD_POOL, LAST_USED, USAGE_COUNT, LOCK, DEFAULT_CONFIG_SIG
from contextlib import suppress
# Find full signature matching prefix
target_sig = None
pool_type = None
async with LOCK:
# Check hot pool
for sig in HOT_POOL.keys():
if sig.startswith(req.sig):
target_sig = sig
pool_type = "hot"
break
# Check cold pool
if not target_sig:
for sig in COLD_POOL.keys():
if sig.startswith(req.sig):
target_sig = sig
pool_type = "cold"
break
# Check if trying to kill permanent
if DEFAULT_CONFIG_SIG and DEFAULT_CONFIG_SIG.startswith(req.sig):
raise HTTPException(403, "Cannot kill permanent browser. Use restart instead.")
if not target_sig:
raise HTTPException(404, f"Browser with sig={req.sig} not found")
# Warn if there are active requests (browser might be in use)
monitor = get_monitor()
active_count = len(monitor.get_active_requests())
if active_count > 0:
logger.warning(f"Killing browser {target_sig[:8]} while {active_count} requests are active - may cause failures")
# Kill the browser
if pool_type == "hot":
browser = HOT_POOL.pop(target_sig)
else:
browser = COLD_POOL.pop(target_sig)
with suppress(Exception):
await browser.close()
LAST_USED.pop(target_sig, None)
USAGE_COUNT.pop(target_sig, None)
logger.info(f"🔪 Killed {pool_type} browser (sig={target_sig[:8]})")
monitor = get_monitor()
await monitor.track_janitor_event("kill_browser", target_sig, {"pool": pool_type, "manual": True})
return {"success": True, "killed_sig": target_sig[:8], "pool_type": pool_type}
except HTTPException:
raise
except Exception as e:
logger.error(f"Error killing browser: {e}")
raise HTTPException(500, str(e))
@router.post("/actions/restart_browser")
async def restart_browser(req: KillBrowserRequest):
"""Restart a browser (kill + recreate). Works for permanent too.
Args:
sig: Browser config signature (first 8 chars), or "permanent"
"""
try:
from crawler_pool import (PERMANENT, HOT_POOL, COLD_POOL, LAST_USED,
USAGE_COUNT, LOCK, DEFAULT_CONFIG_SIG, init_permanent)
from crawl4ai import AsyncWebCrawler, BrowserConfig
from contextlib import suppress
import time
# Handle permanent browser restart
if req.sig == "permanent" or (DEFAULT_CONFIG_SIG and DEFAULT_CONFIG_SIG.startswith(req.sig)):
async with LOCK:
if PERMANENT:
with suppress(Exception):
await PERMANENT.close()
# Reinitialize permanent
from utils import load_config
config = load_config()
await init_permanent(BrowserConfig(
extra_args=config["crawler"]["browser"].get("extra_args", []),
**config["crawler"]["browser"].get("kwargs", {}),
))
logger.info("🔄 Restarted permanent browser")
return {"success": True, "restarted": "permanent"}
# Handle hot/cold browser restart
target_sig = None
pool_type = None
browser_config = None
async with LOCK:
# Find browser
for sig in HOT_POOL.keys():
if sig.startswith(req.sig):
target_sig = sig
pool_type = "hot"
# Would need to reconstruct config (not stored currently)
break
if not target_sig:
for sig in COLD_POOL.keys():
if sig.startswith(req.sig):
target_sig = sig
pool_type = "cold"
break
if not target_sig:
raise HTTPException(404, f"Browser with sig={req.sig} not found")
# Kill existing
if pool_type == "hot":
browser = HOT_POOL.pop(target_sig)
else:
browser = COLD_POOL.pop(target_sig)
with suppress(Exception):
await browser.close()
# Note: We can't easily recreate with same config without storing it
# For now, just kill and let new requests create fresh ones
LAST_USED.pop(target_sig, None)
USAGE_COUNT.pop(target_sig, None)
logger.info(f"🔄 Restarted {pool_type} browser (sig={target_sig[:8]})")
monitor = get_monitor()
await monitor.track_janitor_event("restart_browser", target_sig, {"pool": pool_type})
return {"success": True, "restarted_sig": target_sig[:8], "note": "Browser will be recreated on next request"}
except HTTPException:
raise
except Exception as e:
logger.error(f"Error restarting browser: {e}")
raise HTTPException(500, str(e))
@router.post("/stats/reset")
async def reset_stats():
"""Reset today's endpoint counters."""
try:
monitor = get_monitor()
monitor.endpoint_stats.clear()
await monitor._persist_endpoint_stats()
return {"success": True, "message": "Endpoint stats reset"}
except Exception as e:
logger.error(f"Error resetting stats: {e}")
raise HTTPException(500, str(e))
@router.websocket("/ws")
async def websocket_endpoint(websocket: WebSocket):
"""WebSocket endpoint for real-time monitoring updates.
Sends updates every 2 seconds with:
- Health stats
- Active/completed requests
- Browser pool status
- Timeline data
"""
await websocket.accept()
logger.info("WebSocket client connected")
try:
while True:
try:
# Gather all monitoring data
monitor = get_monitor()
data = {
"timestamp": asyncio.get_event_loop().time(),
"health": await monitor.get_health_summary(),
"requests": {
"active": monitor.get_active_requests(),
"completed": monitor.get_completed_requests(limit=10)
},
"browsers": await monitor.get_browser_list(),
"timeline": {
"memory": monitor.get_timeline_data("memory", "5m"),
"requests": monitor.get_timeline_data("requests", "5m"),
"browsers": monitor.get_timeline_data("browsers", "5m")
},
"janitor": monitor.get_janitor_log(limit=10),
"errors": monitor.get_errors_log(limit=10)
}
# Send update to client
await websocket.send_json(data)
# Wait 2 seconds before next update
await asyncio.sleep(2)
except WebSocketDisconnect:
logger.info("WebSocket client disconnected")
break
except Exception as e:
logger.error(f"WebSocket error: {e}", exc_info=True)
await asyncio.sleep(2) # Continue trying
except Exception as e:
logger.error(f"WebSocket connection error: {e}", exc_info=True)
finally:
logger.info("WebSocket connection closed")

View File

@@ -16,6 +16,7 @@ from fastapi import Request, Depends
from fastapi.responses import FileResponse
import base64
import re
import logging
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
from api import (
handle_markdown_request, handle_llm_qa,
@@ -78,6 +79,18 @@ __version__ = "0.5.1-d1"
MAX_PAGES = config["crawler"]["pool"].get("max_pages", 30)
GLOBAL_SEM = asyncio.Semaphore(MAX_PAGES)
# ── security feature flags ───────────────────────────────────
# Hooks are disabled by default for security (RCE risk). Set to "true" to enable.
HOOKS_ENABLED = os.environ.get("CRAWL4AI_HOOKS_ENABLED", "false").lower() == "true"
# ── default browser config helper ─────────────────────────────
def get_default_browser_config() -> BrowserConfig:
"""Get default BrowserConfig from config.yml."""
return BrowserConfig(
extra_args=config["crawler"]["browser"].get("extra_args", []),
**config["crawler"]["browser"].get("kwargs", {}),
)
# import logging
# page_log = logging.getLogger("page_cap")
# orig_arun = AsyncWebCrawler.arun
@@ -103,15 +116,52 @@ AsyncWebCrawler.arun = capped_arun
@asynccontextmanager
async def lifespan(_: FastAPI):
await get_crawler(BrowserConfig(
from crawler_pool import init_permanent
from monitor import MonitorStats
import monitor as monitor_module
# Initialize monitor
monitor_module.monitor_stats = MonitorStats(redis)
await monitor_module.monitor_stats.load_from_redis()
monitor_module.monitor_stats.start_persistence_worker()
# Initialize browser pool
await init_permanent(BrowserConfig(
extra_args=config["crawler"]["browser"].get("extra_args", []),
**config["crawler"]["browser"].get("kwargs", {}),
)) # warmup
app.state.janitor = asyncio.create_task(janitor()) # idle GC
))
# Start background tasks
app.state.janitor = asyncio.create_task(janitor())
app.state.timeline_updater = asyncio.create_task(_timeline_updater())
yield
# Cleanup
app.state.janitor.cancel()
app.state.timeline_updater.cancel()
# Monitor cleanup (persist stats and stop workers)
from monitor import get_monitor
try:
await get_monitor().cleanup()
except Exception as e:
logger.error(f"Monitor cleanup failed: {e}")
await close_all()
async def _timeline_updater():
"""Update timeline data every 5 seconds."""
from monitor import get_monitor
while True:
await asyncio.sleep(5)
try:
await asyncio.wait_for(get_monitor().update_timeline(), timeout=4.0)
except asyncio.TimeoutError:
logger.warning("Timeline update timeout after 4s")
except Exception as e:
logger.warning(f"Timeline update error: {e}")
# ───────────────────── FastAPI instance ──────────────────────
app = FastAPI(
title=config["app"]["title"],
@@ -129,6 +179,25 @@ app.mount(
name="play",
)
# ── static monitor dashboard ────────────────────────────────
MONITOR_DIR = pathlib.Path(__file__).parent / "static" / "monitor"
if not MONITOR_DIR.exists():
raise RuntimeError(f"Monitor assets not found at {MONITOR_DIR}")
app.mount(
"/dashboard",
StaticFiles(directory=MONITOR_DIR, html=True),
name="monitor_ui",
)
# ── static assets (logo, etc) ────────────────────────────────
ASSETS_DIR = pathlib.Path(__file__).parent / "static" / "assets"
if ASSETS_DIR.exists():
app.mount(
"/static/assets",
StaticFiles(directory=ASSETS_DIR),
name="assets",
)
@app.get("/")
async def root():
@@ -171,6 +240,19 @@ async def add_security_headers(request: Request, call_next):
resp.headers.update(config["security"]["headers"])
return resp
# ───────────────── URL validation helper ─────────────────
ALLOWED_URL_SCHEMES = ("http://", "https://")
ALLOWED_URL_SCHEMES_WITH_RAW = ("http://", "https://", "raw:", "raw://")
def validate_url_scheme(url: str, allow_raw: bool = False) -> None:
"""Validate URL scheme to prevent file:// LFI attacks."""
allowed = ALLOWED_URL_SCHEMES_WITH_RAW if allow_raw else ALLOWED_URL_SCHEMES
if not url.startswith(allowed):
schemes = ", ".join(allowed)
raise HTTPException(400, f"URL must start with {schemes}")
# ───────────────── safe configdump helper ─────────────────
ALLOWED_TYPES = {
"CrawlerRunConfig": CrawlerRunConfig,
@@ -212,6 +294,12 @@ def _safe_eval_config(expr: str) -> dict:
# ── job router ──────────────────────────────────────────────
app.include_router(init_job_router(redis, config, token_dep))
# ── monitor router ──────────────────────────────────────────
from monitor_routes import router as monitor_router
app.include_router(monitor_router)
logger = logging.getLogger(__name__)
# ──────────────────────── Endpoints ──────────────────────────
@app.post("/token")
async def get_token(req: TokenRequest):
@@ -266,27 +354,21 @@ async def generate_html(
Crawls the URL, preprocesses the raw HTML for schema extraction, and returns the processed HTML.
Use when you need sanitized HTML structures for building schemas or further processing.
"""
validate_url_scheme(body.url, allow_raw=True)
from crawler_pool import get_crawler
cfg = CrawlerRunConfig()
try:
async with AsyncWebCrawler(config=BrowserConfig()) as crawler:
results = await crawler.arun(url=body.url, config=cfg)
# Check if the crawl was successful
crawler = await get_crawler(get_default_browser_config())
results = await crawler.arun(url=body.url, config=cfg)
if not results[0].success:
raise HTTPException(
status_code=500,
detail=results[0].error_message or "Crawl failed"
)
raise HTTPException(500, detail=results[0].error_message or "Crawl failed")
raw_html = results[0].html
from crawl4ai.utils import preprocess_html_for_schema
processed_html = preprocess_html_for_schema(raw_html)
return JSONResponse({"html": processed_html, "url": body.url, "success": True})
except Exception as e:
# Log and raise as HTTP 500 for other exceptions
raise HTTPException(
status_code=500,
detail=str(e)
)
raise HTTPException(500, detail=str(e))
# Screenshot endpoint
@@ -304,16 +386,14 @@ async def generate_screenshot(
Use when you need an image snapshot of the rendered page. Its recommened to provide an output path to save the screenshot.
Then in result instead of the screenshot you will get a path to the saved file.
"""
validate_url_scheme(body.url)
from crawler_pool import get_crawler
try:
cfg = CrawlerRunConfig(
screenshot=True, screenshot_wait_for=body.screenshot_wait_for)
async with AsyncWebCrawler(config=BrowserConfig()) as crawler:
results = await crawler.arun(url=body.url, config=cfg)
cfg = CrawlerRunConfig(screenshot=True, screenshot_wait_for=body.screenshot_wait_for)
crawler = await get_crawler(get_default_browser_config())
results = await crawler.arun(url=body.url, config=cfg)
if not results[0].success:
raise HTTPException(
status_code=500,
detail=results[0].error_message or "Crawl failed"
)
raise HTTPException(500, detail=results[0].error_message or "Crawl failed")
screenshot_data = results[0].screenshot
if body.output_path:
abs_path = os.path.abspath(body.output_path)
@@ -323,10 +403,7 @@ async def generate_screenshot(
return {"success": True, "path": abs_path}
return {"success": True, "screenshot": screenshot_data}
except Exception as e:
raise HTTPException(
status_code=500,
detail=str(e)
)
raise HTTPException(500, detail=str(e))
# PDF endpoint
@@ -344,15 +421,14 @@ async def generate_pdf(
Use when you need a printable or archivable snapshot of the page. It is recommended to provide an output path to save the PDF.
Then in result instead of the PDF you will get a path to the saved file.
"""
validate_url_scheme(body.url)
from crawler_pool import get_crawler
try:
cfg = CrawlerRunConfig(pdf=True)
async with AsyncWebCrawler(config=BrowserConfig()) as crawler:
results = await crawler.arun(url=body.url, config=cfg)
crawler = await get_crawler(get_default_browser_config())
results = await crawler.arun(url=body.url, config=cfg)
if not results[0].success:
raise HTTPException(
status_code=500,
detail=results[0].error_message or "Crawl failed"
)
raise HTTPException(500, detail=results[0].error_message or "Crawl failed")
pdf_data = results[0].pdf
if body.output_path:
abs_path = os.path.abspath(body.output_path)
@@ -362,10 +438,7 @@ async def generate_pdf(
return {"success": True, "path": abs_path}
return {"success": True, "pdf": base64.b64encode(pdf_data).decode()}
except Exception as e:
raise HTTPException(
status_code=500,
detail=str(e)
)
raise HTTPException(500, detail=str(e))
@app.post("/execute_js")
@@ -421,23 +494,18 @@ async def execute_js(
```
"""
validate_url_scheme(body.url)
from crawler_pool import get_crawler
try:
cfg = CrawlerRunConfig(js_code=body.scripts)
async with AsyncWebCrawler(config=BrowserConfig()) as crawler:
results = await crawler.arun(url=body.url, config=cfg)
crawler = await get_crawler(get_default_browser_config())
results = await crawler.arun(url=body.url, config=cfg)
if not results[0].success:
raise HTTPException(
status_code=500,
detail=results[0].error_message or "Crawl failed"
)
# Return JSON-serializable dict of the first CrawlResult
raise HTTPException(500, detail=results[0].error_message or "Crawl failed")
data = results[0].model_dump()
return JSONResponse(data)
except Exception as e:
raise HTTPException(
status_code=500,
detail=str(e)
)
raise HTTPException(500, detail=str(e))
@app.get("/llm/{url:path}")
@@ -553,6 +621,8 @@ async def crawl(
"""
if not crawl_request.urls:
raise HTTPException(400, "At least one URL required")
if crawl_request.hooks and not HOOKS_ENABLED:
raise HTTPException(403, "Hooks are disabled. Set CRAWL4AI_HOOKS_ENABLED=true to enable.")
# Check whether it is a redirection for a streaming request
crawler_config = CrawlerRunConfig.load(crawl_request.crawler_config)
if crawler_config.stream:
@@ -588,6 +658,8 @@ async def crawl_stream(
):
if not crawl_request.urls:
raise HTTPException(400, "At least one URL required")
if crawl_request.hooks and not HOOKS_ENABLED:
raise HTTPException(403, "Hooks are disabled. Set CRAWL4AI_HOOKS_ENABLED=true to enable.")
return await stream_process(crawl_request=crawl_request)

Binary file not shown.

After

Width:  |  Height:  |  Size: 5.8 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.6 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 11 KiB

File diff suppressed because it is too large Load Diff

View File

@@ -167,11 +167,14 @@
</a>
</h1>
<div class="ml-auto flex space-x-2">
<button id="play-tab"
class="px-3 py-1 rounded-t bg-surface border border-b-0 border-border text-primary">Playground</button>
<button id="stress-tab" class="px-3 py-1 rounded-t border border-border hover:bg-surface">Stress
Test</button>
<div class="ml-auto flex items-center space-x-4">
<a href="/dashboard" class="text-xs text-secondary hover:text-primary underline">Monitor</a>
<div class="flex space-x-2">
<button id="play-tab"
class="px-3 py-1 rounded-t bg-surface border border-b-0 border-border text-primary">Playground</button>
<button id="stress-tab" class="px-3 py-1 rounded-t border border-border hover:bg-surface">Stress
Test</button>
</div>
</div>
</header>

34
deploy/docker/test-websocket.py Executable file
View File

@@ -0,0 +1,34 @@
#!/usr/bin/env python3
"""
Quick WebSocket test - Connect to monitor WebSocket and print updates
"""
import asyncio
import websockets
import json
async def test_websocket():
uri = "ws://localhost:11235/monitor/ws"
print(f"Connecting to {uri}...")
try:
async with websockets.connect(uri) as websocket:
print("✅ Connected!")
# Receive and print 5 updates
for i in range(5):
message = await websocket.recv()
data = json.loads(message)
print(f"\n📊 Update #{i+1}:")
print(f" - Health: CPU {data['health']['container']['cpu_percent']}%, Memory {data['health']['container']['memory_percent']}%")
print(f" - Active Requests: {len(data['requests']['active'])}")
print(f" - Browsers: {len(data['browsers'])}")
except Exception as e:
print(f"❌ Error: {e}")
return 1
print("\n✅ WebSocket test passed!")
return 0
if __name__ == "__main__":
exit(asyncio.run(test_websocket()))

View File

@@ -0,0 +1,164 @@
#!/usr/bin/env python3
"""
Monitor Dashboard Demo Script
Generates varied activity to showcase all monitoring features for video recording.
"""
import httpx
import asyncio
import time
from datetime import datetime
BASE_URL = "http://localhost:11235"
async def demo_dashboard():
print("🎬 Monitor Dashboard Demo - Starting...\n")
print(f"📊 Dashboard: {BASE_URL}/dashboard")
print("=" * 60)
async with httpx.AsyncClient(timeout=60.0) as client:
# Phase 1: Simple requests (permanent browser)
print("\n🔷 Phase 1: Testing permanent browser pool")
print("-" * 60)
for i in range(5):
print(f" {i+1}/5 Request to /crawl (default config)...")
try:
r = await client.post(
f"{BASE_URL}/crawl",
json={"urls": [f"https://httpbin.org/html?req={i}"], "crawler_config": {}}
)
print(f" ✅ Status: {r.status_code}, Time: {r.elapsed.total_seconds():.2f}s")
except Exception as e:
print(f" ❌ Error: {e}")
await asyncio.sleep(1) # Small delay between requests
# Phase 2: Create variant browsers (different configs)
print("\n🔶 Phase 2: Testing cold→hot pool promotion")
print("-" * 60)
viewports = [
{"width": 1920, "height": 1080},
{"width": 1280, "height": 720},
{"width": 800, "height": 600}
]
for idx, viewport in enumerate(viewports):
print(f" Viewport {viewport['width']}x{viewport['height']}:")
for i in range(4): # 4 requests each to trigger promotion at 3
try:
r = await client.post(
f"{BASE_URL}/crawl",
json={
"urls": [f"https://httpbin.org/json?v={idx}&r={i}"],
"browser_config": {"viewport": viewport},
"crawler_config": {}
}
)
print(f" {i+1}/4 ✅ {r.status_code} - Should see cold→hot after 3 uses")
except Exception as e:
print(f" {i+1}/4 ❌ {e}")
await asyncio.sleep(0.5)
# Phase 3: Concurrent burst (stress pool)
print("\n🔷 Phase 3: Concurrent burst (10 parallel)")
print("-" * 60)
tasks = []
for i in range(10):
tasks.append(
client.post(
f"{BASE_URL}/crawl",
json={"urls": [f"https://httpbin.org/delay/2?burst={i}"], "crawler_config": {}}
)
)
print(" Sending 10 concurrent requests...")
start = time.time()
results = await asyncio.gather(*tasks, return_exceptions=True)
elapsed = time.time() - start
successes = sum(1 for r in results if not isinstance(r, Exception) and r.status_code == 200)
print(f"{successes}/10 succeeded in {elapsed:.2f}s")
# Phase 4: Multi-endpoint coverage
print("\n🔶 Phase 4: Testing multiple endpoints")
print("-" * 60)
endpoints = [
("/md", {"url": "https://httpbin.org/html", "f": "fit", "c": "0"}),
("/screenshot", {"url": "https://httpbin.org/html"}),
("/pdf", {"url": "https://httpbin.org/html"}),
]
for endpoint, payload in endpoints:
print(f" Testing {endpoint}...")
try:
if endpoint == "/md":
r = await client.post(f"{BASE_URL}{endpoint}", json=payload)
else:
r = await client.post(f"{BASE_URL}{endpoint}", json=payload)
print(f"{r.status_code}")
except Exception as e:
print(f"{e}")
await asyncio.sleep(1)
# Phase 5: Intentional error (to populate errors tab)
print("\n🔷 Phase 5: Generating error examples")
print("-" * 60)
print(" Triggering invalid URL error...")
try:
r = await client.post(
f"{BASE_URL}/crawl",
json={"urls": ["invalid://bad-url"], "crawler_config": {}}
)
print(f" Response: {r.status_code}")
except Exception as e:
print(f" ✅ Error captured: {type(e).__name__}")
# Phase 6: Wait for janitor activity
print("\n🔶 Phase 6: Waiting for janitor cleanup...")
print("-" * 60)
print(" Idle for 40s to allow janitor to clean cold pool browsers...")
for i in range(40, 0, -10):
print(f" {i}s remaining... (Check dashboard for cleanup events)")
await asyncio.sleep(10)
# Phase 7: Final stats check
print("\n🔷 Phase 7: Final dashboard state")
print("-" * 60)
r = await client.get(f"{BASE_URL}/monitor/health")
health = r.json()
print(f" Memory: {health['container']['memory_percent']:.1f}%")
print(f" Browsers: Perm={health['pool']['permanent']['active']}, "
f"Hot={health['pool']['hot']['count']}, Cold={health['pool']['cold']['count']}")
r = await client.get(f"{BASE_URL}/monitor/endpoints/stats")
stats = r.json()
print(f"\n Endpoint Stats:")
for endpoint, data in stats.items():
print(f" {endpoint}: {data['count']} req, "
f"{data['avg_latency_ms']:.0f}ms avg, "
f"{data['success_rate_percent']:.1f}% success")
r = await client.get(f"{BASE_URL}/monitor/browsers")
browsers = r.json()
print(f"\n Pool Efficiency:")
print(f" Total browsers: {browsers['summary']['total_count']}")
print(f" Memory usage: {browsers['summary']['total_memory_mb']} MB")
print(f" Reuse rate: {browsers['summary']['reuse_rate_percent']:.1f}%")
print("\n" + "=" * 60)
print("✅ Demo complete! Dashboard is now populated with rich data.")
print(f"\n📹 Recording tip: Refresh {BASE_URL}/dashboard")
print(" You should see:")
print(" • Active & completed requests")
print(" • Browser pool (permanent + hot/cold)")
print(" • Janitor cleanup events")
print(" • Endpoint analytics")
print(" • Memory timeline")
if __name__ == "__main__":
try:
asyncio.run(demo_dashboard())
except KeyboardInterrupt:
print("\n\n⚠️ Demo interrupted by user")
except Exception as e:
print(f"\n\n❌ Demo failed: {e}")

View File

@@ -0,0 +1,2 @@
httpx>=0.25.0
docker>=7.0.0

View File

@@ -0,0 +1,196 @@
#!/usr/bin/env python3
"""
Security Integration Tests for Crawl4AI Docker API.
Tests that security fixes are working correctly against a running server.
Usage:
python run_security_tests.py [base_url]
Example:
python run_security_tests.py http://localhost:11235
"""
import subprocess
import sys
import re
# Colors for terminal output
GREEN = '\033[0;32m'
RED = '\033[0;31m'
YELLOW = '\033[1;33m'
NC = '\033[0m' # No Color
PASSED = 0
FAILED = 0
def run_curl(args: list) -> str:
"""Run curl command and return output."""
try:
result = subprocess.run(
['curl', '-s'] + args,
capture_output=True,
text=True,
timeout=30
)
return result.stdout + result.stderr
except subprocess.TimeoutExpired:
return "TIMEOUT"
except Exception as e:
return str(e)
def test_expect(name: str, expect_pattern: str, curl_args: list) -> bool:
"""Run a test and check if output matches expected pattern."""
global PASSED, FAILED
result = run_curl(curl_args)
if re.search(expect_pattern, result, re.IGNORECASE):
print(f"{GREEN}{NC} {name}")
PASSED += 1
return True
else:
print(f"{RED}{NC} {name}")
print(f" Expected pattern: {expect_pattern}")
print(f" Got: {result[:200]}")
FAILED += 1
return False
def main():
global PASSED, FAILED
base_url = sys.argv[1] if len(sys.argv) > 1 else "http://localhost:11235"
print("=" * 60)
print("Crawl4AI Security Integration Tests")
print(f"Target: {base_url}")
print("=" * 60)
print()
# Check server availability
print("Checking server availability...")
result = run_curl(['-o', '/dev/null', '-w', '%{http_code}', f'{base_url}/health'])
if '200' not in result:
print(f"{RED}ERROR: Server not reachable at {base_url}{NC}")
print("Please start the server first.")
sys.exit(1)
print(f"{GREEN}Server is running{NC}")
print()
# === Part A: Security Tests ===
print("=== Part A: Security Tests ===")
print("(Vulnerabilities must be BLOCKED)")
print()
test_expect(
"A1: Hooks disabled by default (403)",
r"403|disabled|Hooks are disabled",
['-X', 'POST', f'{base_url}/crawl',
'-H', 'Content-Type: application/json',
'-d', '{"urls":["https://example.com"],"hooks":{"code":{"on_page_context_created":"async def hook(page, context, **kwargs): return page"}}}']
)
test_expect(
"A2: file:// blocked on /execute_js (400)",
r"400|must start with",
['-X', 'POST', f'{base_url}/execute_js',
'-H', 'Content-Type: application/json',
'-d', '{"url":"file:///etc/passwd","scripts":["1"]}']
)
test_expect(
"A3: file:// blocked on /screenshot (400)",
r"400|must start with",
['-X', 'POST', f'{base_url}/screenshot',
'-H', 'Content-Type: application/json',
'-d', '{"url":"file:///etc/passwd"}']
)
test_expect(
"A4: file:// blocked on /pdf (400)",
r"400|must start with",
['-X', 'POST', f'{base_url}/pdf',
'-H', 'Content-Type: application/json',
'-d', '{"url":"file:///etc/passwd"}']
)
test_expect(
"A5: file:// blocked on /html (400)",
r"400|must start with",
['-X', 'POST', f'{base_url}/html',
'-H', 'Content-Type: application/json',
'-d', '{"url":"file:///etc/passwd"}']
)
print()
# === Part B: Functionality Tests ===
print("=== Part B: Functionality Tests ===")
print("(Normal operations must WORK)")
print()
test_expect(
"B1: Basic crawl works",
r"success.*true|results",
['-X', 'POST', f'{base_url}/crawl',
'-H', 'Content-Type: application/json',
'-d', '{"urls":["https://example.com"]}']
)
test_expect(
"B2: /md works with https://",
r"success.*true|markdown",
['-X', 'POST', f'{base_url}/md',
'-H', 'Content-Type: application/json',
'-d', '{"url":"https://example.com"}']
)
test_expect(
"B3: Health endpoint works",
r"ok",
[f'{base_url}/health']
)
print()
# === Part C: Edge Cases ===
print("=== Part C: Edge Cases ===")
print("(Malformed input must be REJECTED)")
print()
test_expect(
"C1: javascript: URL rejected (400)",
r"400|must start with",
['-X', 'POST', f'{base_url}/execute_js',
'-H', 'Content-Type: application/json',
'-d', '{"url":"javascript:alert(1)","scripts":["1"]}']
)
test_expect(
"C2: data: URL rejected (400)",
r"400|must start with",
['-X', 'POST', f'{base_url}/execute_js',
'-H', 'Content-Type: application/json',
'-d', '{"url":"data:text/html,<h1>test</h1>","scripts":["1"]}']
)
print()
print("=" * 60)
print("Results")
print("=" * 60)
print(f"Passed: {GREEN}{PASSED}{NC}")
print(f"Failed: {RED}{FAILED}{NC}")
print()
if FAILED > 0:
print(f"{RED}SOME TESTS FAILED{NC}")
sys.exit(1)
else:
print(f"{GREEN}ALL TESTS PASSED{NC}")
sys.exit(0)
if __name__ == '__main__':
main()

View File

@@ -0,0 +1,138 @@
#!/usr/bin/env python3
"""
Test 1: Basic Container Health + Single Endpoint
- Starts container
- Hits /health endpoint 10 times
- Reports success rate and basic latency
"""
import asyncio
import time
import docker
import httpx
# Config
IMAGE = "crawl4ai-local:latest"
CONTAINER_NAME = "crawl4ai-test"
PORT = 11235
REQUESTS = 10
async def test_endpoint(url: str, count: int):
"""Hit endpoint multiple times, return stats."""
results = []
async with httpx.AsyncClient(timeout=30.0) as client:
for i in range(count):
start = time.time()
try:
resp = await client.get(url)
elapsed = (time.time() - start) * 1000 # ms
results.append({
"success": resp.status_code == 200,
"latency_ms": elapsed,
"status": resp.status_code
})
print(f" [{i+1}/{count}] ✓ {resp.status_code} - {elapsed:.0f}ms")
except Exception as e:
results.append({
"success": False,
"latency_ms": None,
"error": str(e)
})
print(f" [{i+1}/{count}] ✗ Error: {e}")
return results
def start_container(client, image: str, name: str, port: int):
"""Start container, return container object."""
# Clean up existing
try:
old = client.containers.get(name)
print(f"🧹 Stopping existing container '{name}'...")
old.stop()
old.remove()
except docker.errors.NotFound:
pass
print(f"🚀 Starting container '{name}' from image '{image}'...")
container = client.containers.run(
image,
name=name,
ports={f"{port}/tcp": port},
detach=True,
shm_size="1g",
environment={"PYTHON_ENV": "production"}
)
# Wait for health
print(f"⏳ Waiting for container to be healthy...")
for _ in range(30): # 30s timeout
time.sleep(1)
container.reload()
if container.status == "running":
try:
# Quick health check
import requests
resp = requests.get(f"http://localhost:{port}/health", timeout=2)
if resp.status_code == 200:
print(f"✅ Container healthy!")
return container
except:
pass
raise TimeoutError("Container failed to start")
def stop_container(container):
"""Stop and remove container."""
print(f"🛑 Stopping container...")
container.stop()
container.remove()
print(f"✅ Container removed")
async def main():
print("="*60)
print("TEST 1: Basic Container Health + Single Endpoint")
print("="*60)
client = docker.from_env()
container = None
try:
# Start container
container = start_container(client, IMAGE, CONTAINER_NAME, PORT)
# Test /health endpoint
print(f"\n📊 Testing /health endpoint ({REQUESTS} requests)...")
url = f"http://localhost:{PORT}/health"
results = await test_endpoint(url, REQUESTS)
# Calculate stats
successes = sum(1 for r in results if r["success"])
success_rate = (successes / len(results)) * 100
latencies = [r["latency_ms"] for r in results if r["latency_ms"] is not None]
avg_latency = sum(latencies) / len(latencies) if latencies else 0
# Print results
print(f"\n{'='*60}")
print(f"RESULTS:")
print(f" Success Rate: {success_rate:.1f}% ({successes}/{len(results)})")
print(f" Avg Latency: {avg_latency:.0f}ms")
if latencies:
print(f" Min Latency: {min(latencies):.0f}ms")
print(f" Max Latency: {max(latencies):.0f}ms")
print(f"{'='*60}")
# Pass/Fail
if success_rate >= 100:
print(f"✅ TEST PASSED")
return 0
else:
print(f"❌ TEST FAILED (expected 100% success rate)")
return 1
except Exception as e:
print(f"\n❌ TEST ERROR: {e}")
return 1
finally:
if container:
stop_container(container)
if __name__ == "__main__":
exit_code = asyncio.run(main())
exit(exit_code)

View File

@@ -0,0 +1,205 @@
#!/usr/bin/env python3
"""
Test 2: Docker Stats Monitoring
- Extends Test 1 with real-time container stats
- Monitors memory % and CPU during requests
- Reports baseline, peak, and final memory
"""
import asyncio
import time
import docker
import httpx
from threading import Thread, Event
# Config
IMAGE = "crawl4ai-local:latest"
CONTAINER_NAME = "crawl4ai-test"
PORT = 11235
REQUESTS = 20 # More requests to see memory usage
# Stats tracking
stats_history = []
stop_monitoring = Event()
def monitor_stats(container):
"""Background thread to collect container stats."""
for stat in container.stats(decode=True, stream=True):
if stop_monitoring.is_set():
break
try:
# Extract memory stats
mem_usage = stat['memory_stats'].get('usage', 0) / (1024 * 1024) # MB
mem_limit = stat['memory_stats'].get('limit', 1) / (1024 * 1024)
mem_percent = (mem_usage / mem_limit * 100) if mem_limit > 0 else 0
# Extract CPU stats (handle missing fields on Mac)
cpu_percent = 0
try:
cpu_delta = stat['cpu_stats']['cpu_usage']['total_usage'] - \
stat['precpu_stats']['cpu_usage']['total_usage']
system_delta = stat['cpu_stats'].get('system_cpu_usage', 0) - \
stat['precpu_stats'].get('system_cpu_usage', 0)
if system_delta > 0:
num_cpus = stat['cpu_stats'].get('online_cpus', 1)
cpu_percent = (cpu_delta / system_delta * num_cpus * 100.0)
except (KeyError, ZeroDivisionError):
pass
stats_history.append({
'timestamp': time.time(),
'memory_mb': mem_usage,
'memory_percent': mem_percent,
'cpu_percent': cpu_percent
})
except Exception as e:
# Skip malformed stats
pass
time.sleep(0.5) # Sample every 500ms
async def test_endpoint(url: str, count: int):
"""Hit endpoint, return stats."""
results = []
async with httpx.AsyncClient(timeout=30.0) as client:
for i in range(count):
start = time.time()
try:
resp = await client.get(url)
elapsed = (time.time() - start) * 1000
results.append({
"success": resp.status_code == 200,
"latency_ms": elapsed,
})
if (i + 1) % 5 == 0: # Print every 5 requests
print(f" [{i+1}/{count}] ✓ {resp.status_code} - {elapsed:.0f}ms")
except Exception as e:
results.append({"success": False, "error": str(e)})
print(f" [{i+1}/{count}] ✗ Error: {e}")
return results
def start_container(client, image: str, name: str, port: int):
"""Start container."""
try:
old = client.containers.get(name)
print(f"🧹 Stopping existing container '{name}'...")
old.stop()
old.remove()
except docker.errors.NotFound:
pass
print(f"🚀 Starting container '{name}'...")
container = client.containers.run(
image,
name=name,
ports={f"{port}/tcp": port},
detach=True,
shm_size="1g",
mem_limit="4g", # Set explicit memory limit
)
print(f"⏳ Waiting for health...")
for _ in range(30):
time.sleep(1)
container.reload()
if container.status == "running":
try:
import requests
resp = requests.get(f"http://localhost:{port}/health", timeout=2)
if resp.status_code == 200:
print(f"✅ Container healthy!")
return container
except:
pass
raise TimeoutError("Container failed to start")
def stop_container(container):
"""Stop container."""
print(f"🛑 Stopping container...")
container.stop()
container.remove()
async def main():
print("="*60)
print("TEST 2: Docker Stats Monitoring")
print("="*60)
client = docker.from_env()
container = None
monitor_thread = None
try:
# Start container
container = start_container(client, IMAGE, CONTAINER_NAME, PORT)
# Start stats monitoring in background
print(f"\n📊 Starting stats monitor...")
stop_monitoring.clear()
stats_history.clear()
monitor_thread = Thread(target=monitor_stats, args=(container,), daemon=True)
monitor_thread.start()
# Wait a bit for baseline
await asyncio.sleep(2)
baseline_mem = stats_history[-1]['memory_mb'] if stats_history else 0
print(f"📏 Baseline memory: {baseline_mem:.1f} MB")
# Test /health endpoint
print(f"\n🔄 Running {REQUESTS} requests to /health...")
url = f"http://localhost:{PORT}/health"
results = await test_endpoint(url, REQUESTS)
# Wait a bit to capture peak
await asyncio.sleep(1)
# Stop monitoring
stop_monitoring.set()
if monitor_thread:
monitor_thread.join(timeout=2)
# Calculate stats
successes = sum(1 for r in results if r.get("success"))
success_rate = (successes / len(results)) * 100
latencies = [r["latency_ms"] for r in results if "latency_ms" in r]
avg_latency = sum(latencies) / len(latencies) if latencies else 0
# Memory stats
memory_samples = [s['memory_mb'] for s in stats_history]
peak_mem = max(memory_samples) if memory_samples else 0
final_mem = memory_samples[-1] if memory_samples else 0
mem_delta = final_mem - baseline_mem
# Print results
print(f"\n{'='*60}")
print(f"RESULTS:")
print(f" Success Rate: {success_rate:.1f}% ({successes}/{len(results)})")
print(f" Avg Latency: {avg_latency:.0f}ms")
print(f"\n Memory Stats:")
print(f" Baseline: {baseline_mem:.1f} MB")
print(f" Peak: {peak_mem:.1f} MB")
print(f" Final: {final_mem:.1f} MB")
print(f" Delta: {mem_delta:+.1f} MB")
print(f"{'='*60}")
# Pass/Fail
if success_rate >= 100 and mem_delta < 100: # No significant memory growth
print(f"✅ TEST PASSED")
return 0
else:
if success_rate < 100:
print(f"❌ TEST FAILED (success rate < 100%)")
if mem_delta >= 100:
print(f"⚠️ WARNING: Memory grew by {mem_delta:.1f} MB")
return 1
except Exception as e:
print(f"\n❌ TEST ERROR: {e}")
return 1
finally:
stop_monitoring.set()
if container:
stop_container(container)
if __name__ == "__main__":
exit_code = asyncio.run(main())
exit(exit_code)

View File

@@ -0,0 +1,229 @@
#!/usr/bin/env python3
"""
Test 3: Pool Validation - Permanent Browser Reuse
- Tests /html endpoint (should use permanent browser)
- Monitors container logs for pool hit markers
- Validates browser reuse rate
- Checks memory after browser creation
"""
import asyncio
import time
import docker
import httpx
from threading import Thread, Event
# Config
IMAGE = "crawl4ai-local:latest"
CONTAINER_NAME = "crawl4ai-test"
PORT = 11235
REQUESTS = 30
# Stats tracking
stats_history = []
stop_monitoring = Event()
def monitor_stats(container):
"""Background stats collector."""
for stat in container.stats(decode=True, stream=True):
if stop_monitoring.is_set():
break
try:
mem_usage = stat['memory_stats'].get('usage', 0) / (1024 * 1024)
stats_history.append({
'timestamp': time.time(),
'memory_mb': mem_usage,
})
except:
pass
time.sleep(0.5)
def count_log_markers(container):
"""Extract pool usage markers from logs."""
logs = container.logs().decode('utf-8')
permanent_hits = logs.count("🔥 Using permanent browser")
hot_hits = logs.count("♨️ Using hot pool browser")
cold_hits = logs.count("❄️ Using cold pool browser")
new_created = logs.count("🆕 Creating new browser")
return {
'permanent_hits': permanent_hits,
'hot_hits': hot_hits,
'cold_hits': cold_hits,
'new_created': new_created,
'total_hits': permanent_hits + hot_hits + cold_hits
}
async def test_endpoint(url: str, count: int):
"""Hit endpoint multiple times."""
results = []
async with httpx.AsyncClient(timeout=60.0) as client:
for i in range(count):
start = time.time()
try:
resp = await client.post(url, json={"url": "https://httpbin.org/html"})
elapsed = (time.time() - start) * 1000
results.append({
"success": resp.status_code == 200,
"latency_ms": elapsed,
})
if (i + 1) % 10 == 0:
print(f" [{i+1}/{count}] ✓ {resp.status_code} - {elapsed:.0f}ms")
except Exception as e:
results.append({"success": False, "error": str(e)})
print(f" [{i+1}/{count}] ✗ Error: {e}")
return results
def start_container(client, image: str, name: str, port: int):
"""Start container."""
try:
old = client.containers.get(name)
print(f"🧹 Stopping existing container...")
old.stop()
old.remove()
except docker.errors.NotFound:
pass
print(f"🚀 Starting container...")
container = client.containers.run(
image,
name=name,
ports={f"{port}/tcp": port},
detach=True,
shm_size="1g",
mem_limit="4g",
)
print(f"⏳ Waiting for health...")
for _ in range(30):
time.sleep(1)
container.reload()
if container.status == "running":
try:
import requests
resp = requests.get(f"http://localhost:{port}/health", timeout=2)
if resp.status_code == 200:
print(f"✅ Container healthy!")
return container
except:
pass
raise TimeoutError("Container failed to start")
def stop_container(container):
"""Stop container."""
print(f"🛑 Stopping container...")
container.stop()
container.remove()
async def main():
print("="*60)
print("TEST 3: Pool Validation - Permanent Browser Reuse")
print("="*60)
client = docker.from_env()
container = None
monitor_thread = None
try:
# Start container
container = start_container(client, IMAGE, CONTAINER_NAME, PORT)
# Wait for permanent browser initialization
print(f"\n⏳ Waiting for permanent browser init (3s)...")
await asyncio.sleep(3)
# Start stats monitoring
print(f"📊 Starting stats monitor...")
stop_monitoring.clear()
stats_history.clear()
monitor_thread = Thread(target=monitor_stats, args=(container,), daemon=True)
monitor_thread.start()
await asyncio.sleep(1)
baseline_mem = stats_history[-1]['memory_mb'] if stats_history else 0
print(f"📏 Baseline (with permanent browser): {baseline_mem:.1f} MB")
# Test /html endpoint (uses permanent browser for default config)
print(f"\n🔄 Running {REQUESTS} requests to /html...")
url = f"http://localhost:{PORT}/html"
results = await test_endpoint(url, REQUESTS)
# Wait a bit
await asyncio.sleep(1)
# Stop monitoring
stop_monitoring.set()
if monitor_thread:
monitor_thread.join(timeout=2)
# Analyze logs for pool markers
print(f"\n📋 Analyzing pool usage...")
pool_stats = count_log_markers(container)
# Calculate request stats
successes = sum(1 for r in results if r.get("success"))
success_rate = (successes / len(results)) * 100
latencies = [r["latency_ms"] for r in results if "latency_ms" in r]
avg_latency = sum(latencies) / len(latencies) if latencies else 0
# Memory stats
memory_samples = [s['memory_mb'] for s in stats_history]
peak_mem = max(memory_samples) if memory_samples else 0
final_mem = memory_samples[-1] if memory_samples else 0
mem_delta = final_mem - baseline_mem
# Calculate reuse rate
total_requests = len(results)
total_pool_hits = pool_stats['total_hits']
reuse_rate = (total_pool_hits / total_requests * 100) if total_requests > 0 else 0
# Print results
print(f"\n{'='*60}")
print(f"RESULTS:")
print(f" Success Rate: {success_rate:.1f}% ({successes}/{len(results)})")
print(f" Avg Latency: {avg_latency:.0f}ms")
print(f"\n Pool Stats:")
print(f" 🔥 Permanent Hits: {pool_stats['permanent_hits']}")
print(f" ♨️ Hot Pool Hits: {pool_stats['hot_hits']}")
print(f" ❄️ Cold Pool Hits: {pool_stats['cold_hits']}")
print(f" 🆕 New Created: {pool_stats['new_created']}")
print(f" 📊 Reuse Rate: {reuse_rate:.1f}%")
print(f"\n Memory Stats:")
print(f" Baseline: {baseline_mem:.1f} MB")
print(f" Peak: {peak_mem:.1f} MB")
print(f" Final: {final_mem:.1f} MB")
print(f" Delta: {mem_delta:+.1f} MB")
print(f"{'='*60}")
# Pass/Fail
passed = True
if success_rate < 100:
print(f"❌ FAIL: Success rate {success_rate:.1f}% < 100%")
passed = False
if reuse_rate < 80:
print(f"❌ FAIL: Reuse rate {reuse_rate:.1f}% < 80% (expected high permanent browser usage)")
passed = False
if pool_stats['permanent_hits'] < (total_requests * 0.8):
print(f"⚠️ WARNING: Only {pool_stats['permanent_hits']} permanent hits out of {total_requests} requests")
if mem_delta > 200:
print(f"⚠️ WARNING: Memory grew by {mem_delta:.1f} MB (possible browser leak)")
if passed:
print(f"✅ TEST PASSED")
return 0
else:
return 1
except Exception as e:
print(f"\n❌ TEST ERROR: {e}")
import traceback
traceback.print_exc()
return 1
finally:
stop_monitoring.set()
if container:
stop_container(container)
if __name__ == "__main__":
exit_code = asyncio.run(main())
exit(exit_code)

View File

@@ -0,0 +1,236 @@
#!/usr/bin/env python3
"""
Test 4: Concurrent Load Testing
- Tests pool under concurrent load
- Escalates: 10 → 50 → 100 concurrent requests
- Validates latency distribution (P50, P95, P99)
- Monitors memory stability
"""
import asyncio
import time
import docker
import httpx
from threading import Thread, Event
from collections import defaultdict
# Config
IMAGE = "crawl4ai-local:latest"
CONTAINER_NAME = "crawl4ai-test"
PORT = 11235
LOAD_LEVELS = [
{"name": "Light", "concurrent": 10, "requests": 20},
{"name": "Medium", "concurrent": 50, "requests": 100},
{"name": "Heavy", "concurrent": 100, "requests": 200},
]
# Stats
stats_history = []
stop_monitoring = Event()
def monitor_stats(container):
"""Background stats collector."""
for stat in container.stats(decode=True, stream=True):
if stop_monitoring.is_set():
break
try:
mem_usage = stat['memory_stats'].get('usage', 0) / (1024 * 1024)
stats_history.append({'timestamp': time.time(), 'memory_mb': mem_usage})
except:
pass
time.sleep(0.5)
def count_log_markers(container):
"""Extract pool markers."""
logs = container.logs().decode('utf-8')
return {
'permanent': logs.count("🔥 Using permanent browser"),
'hot': logs.count("♨️ Using hot pool browser"),
'cold': logs.count("❄️ Using cold pool browser"),
'new': logs.count("🆕 Creating new browser"),
}
async def hit_endpoint(client, url, payload, semaphore):
"""Single request with concurrency control."""
async with semaphore:
start = time.time()
try:
resp = await client.post(url, json=payload, timeout=60.0)
elapsed = (time.time() - start) * 1000
return {"success": resp.status_code == 200, "latency_ms": elapsed}
except Exception as e:
return {"success": False, "error": str(e)}
async def run_concurrent_test(url, payload, concurrent, total_requests):
"""Run concurrent requests."""
semaphore = asyncio.Semaphore(concurrent)
async with httpx.AsyncClient() as client:
tasks = [hit_endpoint(client, url, payload, semaphore) for _ in range(total_requests)]
results = await asyncio.gather(*tasks)
return results
def calculate_percentiles(latencies):
"""Calculate P50, P95, P99."""
if not latencies:
return 0, 0, 0
sorted_lat = sorted(latencies)
n = len(sorted_lat)
return (
sorted_lat[int(n * 0.50)],
sorted_lat[int(n * 0.95)],
sorted_lat[int(n * 0.99)],
)
def start_container(client, image, name, port):
"""Start container."""
try:
old = client.containers.get(name)
print(f"🧹 Stopping existing container...")
old.stop()
old.remove()
except docker.errors.NotFound:
pass
print(f"🚀 Starting container...")
container = client.containers.run(
image, name=name, ports={f"{port}/tcp": port},
detach=True, shm_size="1g", mem_limit="4g",
)
print(f"⏳ Waiting for health...")
for _ in range(30):
time.sleep(1)
container.reload()
if container.status == "running":
try:
import requests
if requests.get(f"http://localhost:{port}/health", timeout=2).status_code == 200:
print(f"✅ Container healthy!")
return container
except:
pass
raise TimeoutError("Container failed to start")
async def main():
print("="*60)
print("TEST 4: Concurrent Load Testing")
print("="*60)
client = docker.from_env()
container = None
monitor_thread = None
try:
container = start_container(client, IMAGE, CONTAINER_NAME, PORT)
print(f"\n⏳ Waiting for permanent browser init (3s)...")
await asyncio.sleep(3)
# Start monitoring
stop_monitoring.clear()
stats_history.clear()
monitor_thread = Thread(target=monitor_stats, args=(container,), daemon=True)
monitor_thread.start()
await asyncio.sleep(1)
baseline_mem = stats_history[-1]['memory_mb'] if stats_history else 0
print(f"📏 Baseline: {baseline_mem:.1f} MB\n")
url = f"http://localhost:{PORT}/html"
payload = {"url": "https://httpbin.org/html"}
all_results = []
level_stats = []
# Run load levels
for level in LOAD_LEVELS:
print(f"{'='*60}")
print(f"🔄 {level['name']} Load: {level['concurrent']} concurrent, {level['requests']} total")
print(f"{'='*60}")
start_time = time.time()
results = await run_concurrent_test(url, payload, level['concurrent'], level['requests'])
duration = time.time() - start_time
successes = sum(1 for r in results if r.get("success"))
success_rate = (successes / len(results)) * 100
latencies = [r["latency_ms"] for r in results if "latency_ms" in r]
p50, p95, p99 = calculate_percentiles(latencies)
avg_lat = sum(latencies) / len(latencies) if latencies else 0
print(f" Duration: {duration:.1f}s")
print(f" Success: {success_rate:.1f}% ({successes}/{len(results)})")
print(f" Avg Latency: {avg_lat:.0f}ms")
print(f" P50/P95/P99: {p50:.0f}ms / {p95:.0f}ms / {p99:.0f}ms")
level_stats.append({
'name': level['name'],
'concurrent': level['concurrent'],
'success_rate': success_rate,
'avg_latency': avg_lat,
'p50': p50, 'p95': p95, 'p99': p99,
})
all_results.extend(results)
await asyncio.sleep(2) # Cool down between levels
# Stop monitoring
await asyncio.sleep(1)
stop_monitoring.set()
if monitor_thread:
monitor_thread.join(timeout=2)
# Final stats
pool_stats = count_log_markers(container)
memory_samples = [s['memory_mb'] for s in stats_history]
peak_mem = max(memory_samples) if memory_samples else 0
final_mem = memory_samples[-1] if memory_samples else 0
print(f"\n{'='*60}")
print(f"FINAL RESULTS:")
print(f"{'='*60}")
print(f" Total Requests: {len(all_results)}")
print(f"\n Pool Utilization:")
print(f" 🔥 Permanent: {pool_stats['permanent']}")
print(f" ♨️ Hot: {pool_stats['hot']}")
print(f" ❄️ Cold: {pool_stats['cold']}")
print(f" 🆕 New: {pool_stats['new']}")
print(f"\n Memory:")
print(f" Baseline: {baseline_mem:.1f} MB")
print(f" Peak: {peak_mem:.1f} MB")
print(f" Final: {final_mem:.1f} MB")
print(f" Delta: {final_mem - baseline_mem:+.1f} MB")
print(f"{'='*60}")
# Pass/Fail
passed = True
for ls in level_stats:
if ls['success_rate'] < 99:
print(f"❌ FAIL: {ls['name']} success rate {ls['success_rate']:.1f}% < 99%")
passed = False
if ls['p99'] > 10000: # 10s threshold
print(f"⚠️ WARNING: {ls['name']} P99 latency {ls['p99']:.0f}ms very high")
if final_mem - baseline_mem > 300:
print(f"⚠️ WARNING: Memory grew {final_mem - baseline_mem:.1f} MB")
if passed:
print(f"✅ TEST PASSED")
return 0
else:
return 1
except Exception as e:
print(f"\n❌ TEST ERROR: {e}")
import traceback
traceback.print_exc()
return 1
finally:
stop_monitoring.set()
if container:
print(f"🛑 Stopping container...")
container.stop()
container.remove()
if __name__ == "__main__":
exit_code = asyncio.run(main())
exit(exit_code)

View File

@@ -0,0 +1,267 @@
#!/usr/bin/env python3
"""
Test 5: Pool Stress - Mixed Configs
- Tests hot/cold pool with different browser configs
- Uses different viewports to create config variants
- Validates cold → hot promotion after 3 uses
- Monitors pool tier distribution
"""
import asyncio
import time
import docker
import httpx
from threading import Thread, Event
import random
# Config
IMAGE = "crawl4ai-local:latest"
CONTAINER_NAME = "crawl4ai-test"
PORT = 11235
REQUESTS_PER_CONFIG = 5 # 5 requests per config variant
# Different viewport configs to test pool tiers
VIEWPORT_CONFIGS = [
None, # Default (permanent browser)
{"width": 1920, "height": 1080}, # Desktop
{"width": 1024, "height": 768}, # Tablet
{"width": 375, "height": 667}, # Mobile
]
# Stats
stats_history = []
stop_monitoring = Event()
def monitor_stats(container):
"""Background stats collector."""
for stat in container.stats(decode=True, stream=True):
if stop_monitoring.is_set():
break
try:
mem_usage = stat['memory_stats'].get('usage', 0) / (1024 * 1024)
stats_history.append({'timestamp': time.time(), 'memory_mb': mem_usage})
except:
pass
time.sleep(0.5)
def analyze_pool_logs(container):
"""Extract detailed pool stats from logs."""
logs = container.logs().decode('utf-8')
permanent = logs.count("🔥 Using permanent browser")
hot = logs.count("♨️ Using hot pool browser")
cold = logs.count("❄️ Using cold pool browser")
new = logs.count("🆕 Creating new browser")
promotions = logs.count("⬆️ Promoting to hot pool")
return {
'permanent': permanent,
'hot': hot,
'cold': cold,
'new': new,
'promotions': promotions,
'total': permanent + hot + cold
}
async def crawl_with_viewport(client, url, viewport):
"""Single request with specific viewport."""
payload = {
"urls": ["https://httpbin.org/html"],
"browser_config": {},
"crawler_config": {}
}
# Add viewport if specified
if viewport:
payload["browser_config"] = {
"type": "BrowserConfig",
"params": {
"viewport": {"type": "dict", "value": viewport},
"headless": True,
"text_mode": True,
"extra_args": [
"--no-sandbox",
"--disable-dev-shm-usage",
"--disable-gpu",
"--disable-software-rasterizer",
"--disable-web-security",
"--allow-insecure-localhost",
"--ignore-certificate-errors"
]
}
}
start = time.time()
try:
resp = await client.post(url, json=payload, timeout=60.0)
elapsed = (time.time() - start) * 1000
return {"success": resp.status_code == 200, "latency_ms": elapsed, "viewport": viewport}
except Exception as e:
return {"success": False, "error": str(e), "viewport": viewport}
def start_container(client, image, name, port):
"""Start container."""
try:
old = client.containers.get(name)
print(f"🧹 Stopping existing container...")
old.stop()
old.remove()
except docker.errors.NotFound:
pass
print(f"🚀 Starting container...")
container = client.containers.run(
image, name=name, ports={f"{port}/tcp": port},
detach=True, shm_size="1g", mem_limit="4g",
)
print(f"⏳ Waiting for health...")
for _ in range(30):
time.sleep(1)
container.reload()
if container.status == "running":
try:
import requests
if requests.get(f"http://localhost:{port}/health", timeout=2).status_code == 200:
print(f"✅ Container healthy!")
return container
except:
pass
raise TimeoutError("Container failed to start")
async def main():
print("="*60)
print("TEST 5: Pool Stress - Mixed Configs")
print("="*60)
client = docker.from_env()
container = None
monitor_thread = None
try:
container = start_container(client, IMAGE, CONTAINER_NAME, PORT)
print(f"\n⏳ Waiting for permanent browser init (3s)...")
await asyncio.sleep(3)
# Start monitoring
stop_monitoring.clear()
stats_history.clear()
monitor_thread = Thread(target=monitor_stats, args=(container,), daemon=True)
monitor_thread.start()
await asyncio.sleep(1)
baseline_mem = stats_history[-1]['memory_mb'] if stats_history else 0
print(f"📏 Baseline: {baseline_mem:.1f} MB\n")
url = f"http://localhost:{PORT}/crawl"
print(f"Testing {len(VIEWPORT_CONFIGS)} different configs:")
for i, vp in enumerate(VIEWPORT_CONFIGS):
vp_str = "Default" if vp is None else f"{vp['width']}x{vp['height']}"
print(f" {i+1}. {vp_str}")
print()
# Run requests: repeat each config REQUESTS_PER_CONFIG times
all_results = []
config_sequence = []
for _ in range(REQUESTS_PER_CONFIG):
for viewport in VIEWPORT_CONFIGS:
config_sequence.append(viewport)
# Shuffle to mix configs
random.shuffle(config_sequence)
print(f"🔄 Running {len(config_sequence)} requests with mixed configs...")
async with httpx.AsyncClient() as http_client:
for i, viewport in enumerate(config_sequence):
result = await crawl_with_viewport(http_client, url, viewport)
all_results.append(result)
if (i + 1) % 5 == 0:
vp_str = "default" if result['viewport'] is None else f"{result['viewport']['width']}x{result['viewport']['height']}"
status = "" if result.get('success') else ""
lat = f"{result.get('latency_ms', 0):.0f}ms" if 'latency_ms' in result else "error"
print(f" [{i+1}/{len(config_sequence)}] {status} {vp_str} - {lat}")
# Stop monitoring
await asyncio.sleep(2)
stop_monitoring.set()
if monitor_thread:
monitor_thread.join(timeout=2)
# Analyze results
pool_stats = analyze_pool_logs(container)
successes = sum(1 for r in all_results if r.get("success"))
success_rate = (successes / len(all_results)) * 100
latencies = [r["latency_ms"] for r in all_results if "latency_ms" in r]
avg_lat = sum(latencies) / len(latencies) if latencies else 0
memory_samples = [s['memory_mb'] for s in stats_history]
peak_mem = max(memory_samples) if memory_samples else 0
final_mem = memory_samples[-1] if memory_samples else 0
print(f"\n{'='*60}")
print(f"RESULTS:")
print(f"{'='*60}")
print(f" Requests: {len(all_results)}")
print(f" Success Rate: {success_rate:.1f}% ({successes}/{len(all_results)})")
print(f" Avg Latency: {avg_lat:.0f}ms")
print(f"\n Pool Statistics:")
print(f" 🔥 Permanent: {pool_stats['permanent']}")
print(f" ♨️ Hot: {pool_stats['hot']}")
print(f" ❄️ Cold: {pool_stats['cold']}")
print(f" 🆕 New: {pool_stats['new']}")
print(f" ⬆️ Promotions: {pool_stats['promotions']}")
print(f" 📊 Reuse: {(pool_stats['total'] / len(all_results) * 100):.1f}%")
print(f"\n Memory:")
print(f" Baseline: {baseline_mem:.1f} MB")
print(f" Peak: {peak_mem:.1f} MB")
print(f" Final: {final_mem:.1f} MB")
print(f" Delta: {final_mem - baseline_mem:+.1f} MB")
print(f"{'='*60}")
# Pass/Fail
passed = True
if success_rate < 99:
print(f"❌ FAIL: Success rate {success_rate:.1f}% < 99%")
passed = False
# Should see promotions since we repeat each config 5 times
if pool_stats['promotions'] < (len(VIEWPORT_CONFIGS) - 1): # -1 for default
print(f"⚠️ WARNING: Only {pool_stats['promotions']} promotions (expected ~{len(VIEWPORT_CONFIGS)-1})")
# Should have created some browsers for different configs
if pool_stats['new'] == 0:
print(f"⚠️ NOTE: No new browsers created (all used default?)")
if pool_stats['permanent'] == len(all_results):
print(f"⚠️ NOTE: All requests used permanent browser (configs not varying enough?)")
if final_mem - baseline_mem > 500:
print(f"⚠️ WARNING: Memory grew {final_mem - baseline_mem:.1f} MB")
if passed:
print(f"✅ TEST PASSED")
return 0
else:
return 1
except Exception as e:
print(f"\n❌ TEST ERROR: {e}")
import traceback
traceback.print_exc()
return 1
finally:
stop_monitoring.set()
if container:
print(f"🛑 Stopping container...")
container.stop()
container.remove()
if __name__ == "__main__":
exit_code = asyncio.run(main())
exit(exit_code)

View File

@@ -0,0 +1,234 @@
#!/usr/bin/env python3
"""
Test 6: Multi-Endpoint Testing
- Tests multiple endpoints together: /html, /screenshot, /pdf, /crawl
- Validates each endpoint works correctly
- Monitors success rates per endpoint
"""
import asyncio
import time
import docker
import httpx
from threading import Thread, Event
# Config
IMAGE = "crawl4ai-local:latest"
CONTAINER_NAME = "crawl4ai-test"
PORT = 11235
REQUESTS_PER_ENDPOINT = 10
# Stats
stats_history = []
stop_monitoring = Event()
def monitor_stats(container):
"""Background stats collector."""
for stat in container.stats(decode=True, stream=True):
if stop_monitoring.is_set():
break
try:
mem_usage = stat['memory_stats'].get('usage', 0) / (1024 * 1024)
stats_history.append({'timestamp': time.time(), 'memory_mb': mem_usage})
except:
pass
time.sleep(0.5)
async def test_html(client, base_url, count):
"""Test /html endpoint."""
url = f"{base_url}/html"
results = []
for _ in range(count):
start = time.time()
try:
resp = await client.post(url, json={"url": "https://httpbin.org/html"}, timeout=30.0)
elapsed = (time.time() - start) * 1000
results.append({"success": resp.status_code == 200, "latency_ms": elapsed})
except Exception as e:
results.append({"success": False, "error": str(e)})
return results
async def test_screenshot(client, base_url, count):
"""Test /screenshot endpoint."""
url = f"{base_url}/screenshot"
results = []
for _ in range(count):
start = time.time()
try:
resp = await client.post(url, json={"url": "https://httpbin.org/html"}, timeout=30.0)
elapsed = (time.time() - start) * 1000
results.append({"success": resp.status_code == 200, "latency_ms": elapsed})
except Exception as e:
results.append({"success": False, "error": str(e)})
return results
async def test_pdf(client, base_url, count):
"""Test /pdf endpoint."""
url = f"{base_url}/pdf"
results = []
for _ in range(count):
start = time.time()
try:
resp = await client.post(url, json={"url": "https://httpbin.org/html"}, timeout=30.0)
elapsed = (time.time() - start) * 1000
results.append({"success": resp.status_code == 200, "latency_ms": elapsed})
except Exception as e:
results.append({"success": False, "error": str(e)})
return results
async def test_crawl(client, base_url, count):
"""Test /crawl endpoint."""
url = f"{base_url}/crawl"
results = []
payload = {
"urls": ["https://httpbin.org/html"],
"browser_config": {},
"crawler_config": {}
}
for _ in range(count):
start = time.time()
try:
resp = await client.post(url, json=payload, timeout=30.0)
elapsed = (time.time() - start) * 1000
results.append({"success": resp.status_code == 200, "latency_ms": elapsed})
except Exception as e:
results.append({"success": False, "error": str(e)})
return results
def start_container(client, image, name, port):
"""Start container."""
try:
old = client.containers.get(name)
print(f"🧹 Stopping existing container...")
old.stop()
old.remove()
except docker.errors.NotFound:
pass
print(f"🚀 Starting container...")
container = client.containers.run(
image, name=name, ports={f"{port}/tcp": port},
detach=True, shm_size="1g", mem_limit="4g",
)
print(f"⏳ Waiting for health...")
for _ in range(30):
time.sleep(1)
container.reload()
if container.status == "running":
try:
import requests
if requests.get(f"http://localhost:{port}/health", timeout=2).status_code == 200:
print(f"✅ Container healthy!")
return container
except:
pass
raise TimeoutError("Container failed to start")
async def main():
print("="*60)
print("TEST 6: Multi-Endpoint Testing")
print("="*60)
client = docker.from_env()
container = None
monitor_thread = None
try:
container = start_container(client, IMAGE, CONTAINER_NAME, PORT)
print(f"\n⏳ Waiting for permanent browser init (3s)...")
await asyncio.sleep(3)
# Start monitoring
stop_monitoring.clear()
stats_history.clear()
monitor_thread = Thread(target=monitor_stats, args=(container,), daemon=True)
monitor_thread.start()
await asyncio.sleep(1)
baseline_mem = stats_history[-1]['memory_mb'] if stats_history else 0
print(f"📏 Baseline: {baseline_mem:.1f} MB\n")
base_url = f"http://localhost:{PORT}"
# Test each endpoint
endpoints = {
"/html": test_html,
"/screenshot": test_screenshot,
"/pdf": test_pdf,
"/crawl": test_crawl,
}
all_endpoint_stats = {}
async with httpx.AsyncClient() as http_client:
for endpoint_name, test_func in endpoints.items():
print(f"🔄 Testing {endpoint_name} ({REQUESTS_PER_ENDPOINT} requests)...")
results = await test_func(http_client, base_url, REQUESTS_PER_ENDPOINT)
successes = sum(1 for r in results if r.get("success"))
success_rate = (successes / len(results)) * 100
latencies = [r["latency_ms"] for r in results if "latency_ms" in r]
avg_lat = sum(latencies) / len(latencies) if latencies else 0
all_endpoint_stats[endpoint_name] = {
'success_rate': success_rate,
'avg_latency': avg_lat,
'total': len(results),
'successes': successes
}
print(f" ✓ Success: {success_rate:.1f}% ({successes}/{len(results)}), Avg: {avg_lat:.0f}ms")
# Stop monitoring
await asyncio.sleep(1)
stop_monitoring.set()
if monitor_thread:
monitor_thread.join(timeout=2)
# Final stats
memory_samples = [s['memory_mb'] for s in stats_history]
peak_mem = max(memory_samples) if memory_samples else 0
final_mem = memory_samples[-1] if memory_samples else 0
print(f"\n{'='*60}")
print(f"RESULTS:")
print(f"{'='*60}")
for endpoint, stats in all_endpoint_stats.items():
print(f" {endpoint:12} Success: {stats['success_rate']:5.1f}% Avg: {stats['avg_latency']:6.0f}ms")
print(f"\n Memory:")
print(f" Baseline: {baseline_mem:.1f} MB")
print(f" Peak: {peak_mem:.1f} MB")
print(f" Final: {final_mem:.1f} MB")
print(f" Delta: {final_mem - baseline_mem:+.1f} MB")
print(f"{'='*60}")
# Pass/Fail
passed = True
for endpoint, stats in all_endpoint_stats.items():
if stats['success_rate'] < 100:
print(f"❌ FAIL: {endpoint} success rate {stats['success_rate']:.1f}% < 100%")
passed = False
if passed:
print(f"✅ TEST PASSED")
return 0
else:
return 1
except Exception as e:
print(f"\n❌ TEST ERROR: {e}")
import traceback
traceback.print_exc()
return 1
finally:
stop_monitoring.set()
if container:
print(f"🛑 Stopping container...")
container.stop()
container.remove()
if __name__ == "__main__":
exit_code = asyncio.run(main())
exit(exit_code)

View File

@@ -0,0 +1,199 @@
#!/usr/bin/env python3
"""
Test 7: Cleanup Verification (Janitor)
- Creates load spike then goes idle
- Verifies memory returns to near baseline
- Tests janitor cleanup of idle browsers
- Monitors memory recovery time
"""
import asyncio
import time
import docker
import httpx
from threading import Thread, Event
# Config
IMAGE = "crawl4ai-local:latest"
CONTAINER_NAME = "crawl4ai-test"
PORT = 11235
SPIKE_REQUESTS = 20 # Create some browsers
IDLE_TIME = 90 # Wait 90s for janitor (runs every 60s)
# Stats
stats_history = []
stop_monitoring = Event()
def monitor_stats(container):
"""Background stats collector."""
for stat in container.stats(decode=True, stream=True):
if stop_monitoring.is_set():
break
try:
mem_usage = stat['memory_stats'].get('usage', 0) / (1024 * 1024)
stats_history.append({'timestamp': time.time(), 'memory_mb': mem_usage})
except:
pass
time.sleep(1) # Sample every 1s for this test
def start_container(client, image, name, port):
"""Start container."""
try:
old = client.containers.get(name)
print(f"🧹 Stopping existing container...")
old.stop()
old.remove()
except docker.errors.NotFound:
pass
print(f"🚀 Starting container...")
container = client.containers.run(
image, name=name, ports={f"{port}/tcp": port},
detach=True, shm_size="1g", mem_limit="4g",
)
print(f"⏳ Waiting for health...")
for _ in range(30):
time.sleep(1)
container.reload()
if container.status == "running":
try:
import requests
if requests.get(f"http://localhost:{port}/health", timeout=2).status_code == 200:
print(f"✅ Container healthy!")
return container
except:
pass
raise TimeoutError("Container failed to start")
async def main():
print("="*60)
print("TEST 7: Cleanup Verification (Janitor)")
print("="*60)
client = docker.from_env()
container = None
monitor_thread = None
try:
container = start_container(client, IMAGE, CONTAINER_NAME, PORT)
print(f"\n⏳ Waiting for permanent browser init (3s)...")
await asyncio.sleep(3)
# Start monitoring
stop_monitoring.clear()
stats_history.clear()
monitor_thread = Thread(target=monitor_stats, args=(container,), daemon=True)
monitor_thread.start()
await asyncio.sleep(2)
baseline_mem = stats_history[-1]['memory_mb'] if stats_history else 0
print(f"📏 Baseline: {baseline_mem:.1f} MB\n")
# Create load spike with different configs to populate pool
print(f"🔥 Creating load spike ({SPIKE_REQUESTS} requests with varied configs)...")
url = f"http://localhost:{PORT}/crawl"
viewports = [
{"width": 1920, "height": 1080},
{"width": 1024, "height": 768},
{"width": 375, "height": 667},
]
async with httpx.AsyncClient(timeout=60.0) as http_client:
tasks = []
for i in range(SPIKE_REQUESTS):
vp = viewports[i % len(viewports)]
payload = {
"urls": ["https://httpbin.org/html"],
"browser_config": {
"type": "BrowserConfig",
"params": {
"viewport": {"type": "dict", "value": vp},
"headless": True,
"text_mode": True,
"extra_args": [
"--no-sandbox", "--disable-dev-shm-usage",
"--disable-gpu", "--disable-software-rasterizer",
"--disable-web-security", "--allow-insecure-localhost",
"--ignore-certificate-errors"
]
}
},
"crawler_config": {}
}
tasks.append(http_client.post(url, json=payload))
results = await asyncio.gather(*tasks, return_exceptions=True)
successes = sum(1 for r in results if hasattr(r, 'status_code') and r.status_code == 200)
print(f" ✓ Spike completed: {successes}/{len(results)} successful")
# Measure peak
await asyncio.sleep(2)
peak_mem = max([s['memory_mb'] for s in stats_history]) if stats_history else baseline_mem
print(f" 📊 Peak memory: {peak_mem:.1f} MB (+{peak_mem - baseline_mem:.1f} MB)")
# Now go idle and wait for janitor
print(f"\n⏸️ Going idle for {IDLE_TIME}s (janitor cleanup)...")
print(f" (Janitor runs every 60s, checking for idle browsers)")
for elapsed in range(0, IDLE_TIME, 10):
await asyncio.sleep(10)
current_mem = stats_history[-1]['memory_mb'] if stats_history else 0
print(f" [{elapsed+10:3d}s] Memory: {current_mem:.1f} MB")
# Stop monitoring
stop_monitoring.set()
if monitor_thread:
monitor_thread.join(timeout=2)
# Analyze memory recovery
final_mem = stats_history[-1]['memory_mb'] if stats_history else 0
recovery_mb = peak_mem - final_mem
recovery_pct = (recovery_mb / (peak_mem - baseline_mem) * 100) if (peak_mem - baseline_mem) > 0 else 0
print(f"\n{'='*60}")
print(f"RESULTS:")
print(f"{'='*60}")
print(f" Memory Journey:")
print(f" Baseline: {baseline_mem:.1f} MB")
print(f" Peak: {peak_mem:.1f} MB (+{peak_mem - baseline_mem:.1f} MB)")
print(f" Final: {final_mem:.1f} MB (+{final_mem - baseline_mem:.1f} MB)")
print(f" Recovered: {recovery_mb:.1f} MB ({recovery_pct:.1f}%)")
print(f"{'='*60}")
# Pass/Fail
passed = True
# Should have created some memory pressure
if peak_mem - baseline_mem < 100:
print(f"⚠️ WARNING: Peak increase only {peak_mem - baseline_mem:.1f} MB (expected more browsers)")
# Should recover most memory (within 100MB of baseline)
if final_mem - baseline_mem > 100:
print(f"⚠️ WARNING: Memory didn't recover well (still +{final_mem - baseline_mem:.1f} MB above baseline)")
else:
print(f"✅ Good memory recovery!")
# Baseline + 50MB tolerance
if final_mem - baseline_mem < 50:
print(f"✅ Excellent cleanup (within 50MB of baseline)")
print(f"✅ TEST PASSED")
return 0
except Exception as e:
print(f"\n❌ TEST ERROR: {e}")
import traceback
traceback.print_exc()
return 1
finally:
stop_monitoring.set()
if container:
print(f"🛑 Stopping container...")
container.stop()
container.remove()
if __name__ == "__main__":
exit_code = asyncio.run(main())
exit(exit_code)

View File

@@ -0,0 +1,57 @@
#!/usr/bin/env python3
"""Quick test to generate monitor dashboard activity"""
import httpx
import asyncio
async def test_dashboard():
async with httpx.AsyncClient(timeout=30.0) as client:
print("📊 Generating dashboard activity...")
# Test 1: Simple crawl
print("\n1⃣ Running simple crawl...")
r1 = await client.post(
"http://localhost:11235/crawl",
json={"urls": ["https://httpbin.org/html"], "crawler_config": {}}
)
print(f" Status: {r1.status_code}")
# Test 2: Multiple URLs
print("\n2⃣ Running multi-URL crawl...")
r2 = await client.post(
"http://localhost:11235/crawl",
json={
"urls": [
"https://httpbin.org/html",
"https://httpbin.org/json"
],
"crawler_config": {}
}
)
print(f" Status: {r2.status_code}")
# Test 3: Check monitor health
print("\n3⃣ Checking monitor health...")
r3 = await client.get("http://localhost:11235/monitor/health")
health = r3.json()
print(f" Memory: {health['container']['memory_percent']}%")
print(f" Browsers: {health['pool']['permanent']['active']}")
# Test 4: Check requests
print("\n4⃣ Checking request log...")
r4 = await client.get("http://localhost:11235/monitor/requests")
reqs = r4.json()
print(f" Active: {len(reqs['active'])}")
print(f" Completed: {len(reqs['completed'])}")
# Test 5: Check endpoint stats
print("\n5⃣ Checking endpoint stats...")
r5 = await client.get("http://localhost:11235/monitor/endpoints/stats")
stats = r5.json()
for endpoint, data in stats.items():
print(f" {endpoint}: {data['count']} requests, {data['avg_latency_ms']}ms avg")
print("\n✅ Dashboard should now show activity!")
print(f"\n🌐 Open: http://localhost:11235/dashboard")
if __name__ == "__main__":
asyncio.run(test_dashboard())

View File

@@ -0,0 +1,170 @@
#!/usr/bin/env python3
"""
Unit tests for security fixes.
These tests verify the security fixes at the code level without needing a running server.
"""
import sys
import os
# Add parent directory to path to import modules
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
import unittest
class TestURLValidation(unittest.TestCase):
"""Test URL scheme validation helper."""
def setUp(self):
"""Set up test fixtures."""
# Import the validation constants and function
self.ALLOWED_URL_SCHEMES = ("http://", "https://")
self.ALLOWED_URL_SCHEMES_WITH_RAW = ("http://", "https://", "raw:", "raw://")
def validate_url_scheme(self, url: str, allow_raw: bool = False) -> bool:
"""Local version of validate_url_scheme for testing."""
allowed = self.ALLOWED_URL_SCHEMES_WITH_RAW if allow_raw else self.ALLOWED_URL_SCHEMES
return url.startswith(allowed)
# === SECURITY TESTS: These URLs must be BLOCKED ===
def test_file_url_blocked(self):
"""file:// URLs must be blocked (LFI vulnerability)."""
self.assertFalse(self.validate_url_scheme("file:///etc/passwd"))
self.assertFalse(self.validate_url_scheme("file:///etc/passwd", allow_raw=True))
def test_file_url_blocked_windows(self):
"""file:// URLs with Windows paths must be blocked."""
self.assertFalse(self.validate_url_scheme("file:///C:/Windows/System32/config/sam"))
def test_javascript_url_blocked(self):
"""javascript: URLs must be blocked (XSS)."""
self.assertFalse(self.validate_url_scheme("javascript:alert(1)"))
def test_data_url_blocked(self):
"""data: URLs must be blocked."""
self.assertFalse(self.validate_url_scheme("data:text/html,<script>alert(1)</script>"))
def test_ftp_url_blocked(self):
"""ftp: URLs must be blocked."""
self.assertFalse(self.validate_url_scheme("ftp://example.com/file"))
def test_empty_url_blocked(self):
"""Empty URLs must be blocked."""
self.assertFalse(self.validate_url_scheme(""))
def test_relative_url_blocked(self):
"""Relative URLs must be blocked."""
self.assertFalse(self.validate_url_scheme("/etc/passwd"))
self.assertFalse(self.validate_url_scheme("../../../etc/passwd"))
# === FUNCTIONALITY TESTS: These URLs must be ALLOWED ===
def test_http_url_allowed(self):
"""http:// URLs must be allowed."""
self.assertTrue(self.validate_url_scheme("http://example.com"))
self.assertTrue(self.validate_url_scheme("http://localhost:8080"))
def test_https_url_allowed(self):
"""https:// URLs must be allowed."""
self.assertTrue(self.validate_url_scheme("https://example.com"))
self.assertTrue(self.validate_url_scheme("https://example.com/path?query=1"))
def test_raw_url_allowed_when_enabled(self):
"""raw: URLs must be allowed when allow_raw=True."""
self.assertTrue(self.validate_url_scheme("raw:<html></html>", allow_raw=True))
self.assertTrue(self.validate_url_scheme("raw://<html></html>", allow_raw=True))
def test_raw_url_blocked_when_disabled(self):
"""raw: URLs must be blocked when allow_raw=False."""
self.assertFalse(self.validate_url_scheme("raw:<html></html>", allow_raw=False))
self.assertFalse(self.validate_url_scheme("raw://<html></html>", allow_raw=False))
class TestHookBuiltins(unittest.TestCase):
"""Test that dangerous builtins are removed from hooks."""
def test_import_not_in_allowed_builtins(self):
"""__import__ must NOT be in allowed_builtins."""
allowed_builtins = [
'print', 'len', 'str', 'int', 'float', 'bool',
'list', 'dict', 'set', 'tuple', 'range', 'enumerate',
'zip', 'map', 'filter', 'any', 'all', 'sum', 'min', 'max',
'sorted', 'reversed', 'abs', 'round', 'isinstance', 'type',
'getattr', 'hasattr', 'setattr', 'callable', 'iter', 'next',
'__build_class__' # Required for class definitions in exec
]
self.assertNotIn('__import__', allowed_builtins)
self.assertNotIn('eval', allowed_builtins)
self.assertNotIn('exec', allowed_builtins)
self.assertNotIn('compile', allowed_builtins)
self.assertNotIn('open', allowed_builtins)
def test_build_class_in_allowed_builtins(self):
"""__build_class__ must be in allowed_builtins (needed for class definitions)."""
allowed_builtins = [
'print', 'len', 'str', 'int', 'float', 'bool',
'list', 'dict', 'set', 'tuple', 'range', 'enumerate',
'zip', 'map', 'filter', 'any', 'all', 'sum', 'min', 'max',
'sorted', 'reversed', 'abs', 'round', 'isinstance', 'type',
'getattr', 'hasattr', 'setattr', 'callable', 'iter', 'next',
'__build_class__'
]
self.assertIn('__build_class__', allowed_builtins)
class TestHooksEnabled(unittest.TestCase):
"""Test HOOKS_ENABLED environment variable logic."""
def test_hooks_disabled_by_default(self):
"""Hooks must be disabled by default."""
# Simulate the default behavior
hooks_enabled = os.environ.get("CRAWL4AI_HOOKS_ENABLED", "false").lower() == "true"
# Clear any existing env var to test default
original = os.environ.pop("CRAWL4AI_HOOKS_ENABLED", None)
try:
hooks_enabled = os.environ.get("CRAWL4AI_HOOKS_ENABLED", "false").lower() == "true"
self.assertFalse(hooks_enabled)
finally:
if original is not None:
os.environ["CRAWL4AI_HOOKS_ENABLED"] = original
def test_hooks_enabled_when_true(self):
"""Hooks must be enabled when CRAWL4AI_HOOKS_ENABLED=true."""
original = os.environ.get("CRAWL4AI_HOOKS_ENABLED")
try:
os.environ["CRAWL4AI_HOOKS_ENABLED"] = "true"
hooks_enabled = os.environ.get("CRAWL4AI_HOOKS_ENABLED", "false").lower() == "true"
self.assertTrue(hooks_enabled)
finally:
if original is not None:
os.environ["CRAWL4AI_HOOKS_ENABLED"] = original
else:
os.environ.pop("CRAWL4AI_HOOKS_ENABLED", None)
def test_hooks_disabled_when_false(self):
"""Hooks must be disabled when CRAWL4AI_HOOKS_ENABLED=false."""
original = os.environ.get("CRAWL4AI_HOOKS_ENABLED")
try:
os.environ["CRAWL4AI_HOOKS_ENABLED"] = "false"
hooks_enabled = os.environ.get("CRAWL4AI_HOOKS_ENABLED", "false").lower() == "true"
self.assertFalse(hooks_enabled)
finally:
if original is not None:
os.environ["CRAWL4AI_HOOKS_ENABLED"] = original
else:
os.environ.pop("CRAWL4AI_HOOKS_ENABLED", None)
if __name__ == '__main__':
print("=" * 60)
print("Crawl4AI Security Fixes - Unit Tests")
print("=" * 60)
print()
# Run tests with verbosity
unittest.main(verbosity=2)

View File

@@ -178,4 +178,29 @@ def verify_email_domain(email: str) -> bool:
records = dns.resolver.resolve(domain, 'MX')
return True if records else False
except Exception as e:
return False
return False
def get_container_memory_percent() -> float:
"""Get actual container memory usage vs limit (cgroup v1/v2 aware)."""
try:
# Try cgroup v2 first
usage_path = Path("/sys/fs/cgroup/memory.current")
limit_path = Path("/sys/fs/cgroup/memory.max")
if not usage_path.exists():
# Fall back to cgroup v1
usage_path = Path("/sys/fs/cgroup/memory/memory.usage_in_bytes")
limit_path = Path("/sys/fs/cgroup/memory/memory.limit_in_bytes")
usage = int(usage_path.read_text())
limit = int(limit_path.read_text())
# Handle unlimited (v2: "max", v1: > 1e18)
if limit > 1e18:
import psutil
limit = psutil.virtual_memory().total
return (usage / limit) * 100
except:
# Non-container or unsupported: fallback to host
import psutil
return psutil.virtual_memory().percent

View File

@@ -6,15 +6,16 @@ x-base-config: &base-config
- "11235:11235" # Gunicorn port
env_file:
- .llm.env # API keys (create from .llm.env.example)
environment:
- OPENAI_API_KEY=${OPENAI_API_KEY:-}
- DEEPSEEK_API_KEY=${DEEPSEEK_API_KEY:-}
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY:-}
- GROQ_API_KEY=${GROQ_API_KEY:-}
- TOGETHER_API_KEY=${TOGETHER_API_KEY:-}
- MISTRAL_API_KEY=${MISTRAL_API_KEY:-}
- GEMINI_API_TOKEN=${GEMINI_API_TOKEN:-}
- LLM_PROVIDER=${LLM_PROVIDER:-} # Optional: Override default provider (e.g., "anthropic/claude-3-opus")
# Uncomment to set default environment variables (will overwrite .llm.env)
# environment:
# - OPENAI_API_KEY=${OPENAI_API_KEY:-}
# - DEEPSEEK_API_KEY=${DEEPSEEK_API_KEY:-}
# - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY:-}
# - GROQ_API_KEY=${GROQ_API_KEY:-}
# - TOGETHER_API_KEY=${TOGETHER_API_KEY:-}
# - MISTRAL_API_KEY=${MISTRAL_API_KEY:-}
# - GEMINI_API_KEY=${GEMINI_API_KEY:-}
# - LLM_PROVIDER=${LLM_PROVIDER:-} # Optional: Override default provider (e.g., "anthropic/claude-3-opus")
volumes:
- /dev/shm:/dev/shm # Chromium performance
deploy:

View File

@@ -0,0 +1,243 @@
# Crawl4AI v0.8.0 Release Notes
**Release Date**: January 2026
**Previous Version**: v0.7.6
**Status**: Release Candidate
---
## Highlights
- **Critical Security Fixes** for Docker API deployment
- **11 New Features** including crash recovery, prefetch mode, and proxy improvements
- **Breaking Changes** - see migration guide below
---
## Breaking Changes
### 1. Docker API: Hooks Disabled by Default
**What changed**: Hooks are now disabled by default on the Docker API.
**Why**: Security fix for Remote Code Execution (RCE) vulnerability.
**Who is affected**: Users of the Docker API who use the `hooks` parameter in `/crawl` requests.
**Migration**:
```bash
# To re-enable hooks (only if you trust all API users):
export CRAWL4AI_HOOKS_ENABLED=true
```
### 2. Docker API: file:// URLs Blocked
**What changed**: The endpoints `/execute_js`, `/screenshot`, `/pdf`, and `/html` now reject `file://` URLs.
**Why**: Security fix for Local File Inclusion (LFI) vulnerability.
**Who is affected**: Users who were reading local files via the Docker API.
**Migration**: Use the Python library directly for local file processing:
```python
# Instead of API call with file:// URL, use library:
from crawl4ai import AsyncWebCrawler
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="file:///path/to/file.html")
```
---
## Security Fixes
### Critical: Remote Code Execution via Hooks (CVE Pending)
**Severity**: CRITICAL (CVSS 10.0)
**Affected**: Docker API deployment (all versions before v0.8.0)
**Vector**: `POST /crawl` with malicious `hooks` parameter
**Details**: The `__import__` builtin was available in hook code, allowing attackers to import `os`, `subprocess`, etc. and execute arbitrary commands.
**Fix**:
1. Removed `__import__` from allowed builtins
2. Hooks disabled by default (`CRAWL4AI_HOOKS_ENABLED=false`)
### High: Local File Inclusion via file:// URLs (CVE Pending)
**Severity**: HIGH (CVSS 8.6)
**Affected**: Docker API deployment (all versions before v0.8.0)
**Vector**: `POST /execute_js` (and other endpoints) with `file:///etc/passwd`
**Details**: API endpoints accepted `file://` URLs, allowing attackers to read arbitrary files from the server.
**Fix**: URL scheme validation now only allows `http://`, `https://`, and `raw:` URLs.
### Credits
Discovered by **Neo by ProjectDiscovery** ([projectdiscovery.io](https://projectdiscovery.io)) - December 2025
---
## New Features
### 1. init_scripts Support for BrowserConfig
Pre-page-load JavaScript injection for stealth evasions.
```python
config = BrowserConfig(
init_scripts=[
"Object.defineProperty(navigator, 'webdriver', {get: () => false})"
]
)
```
### 2. CDP Connection Improvements
- WebSocket URL support (`ws://`, `wss://`)
- Proper cleanup with `cdp_cleanup_on_close=True`
- Browser reuse across multiple connections
### 3. Crash Recovery for Deep Crawl Strategies
All deep crawl strategies (BFS, DFS, Best-First) now support crash recovery:
```python
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
strategy = BFSDeepCrawlStrategy(
max_depth=3,
resume_state=saved_state, # Resume from checkpoint
on_state_change=save_callback # Persist state in real-time
)
```
### 4. PDF and MHTML for raw:/file:// URLs
Generate PDFs and MHTML from cached HTML content.
### 5. Screenshots for raw:/file:// URLs
Render cached HTML and capture screenshots.
### 6. base_url Parameter for CrawlerRunConfig
Proper URL resolution for raw: HTML processing:
```python
config = CrawlerRunConfig(base_url='https://example.com')
result = await crawler.arun(url='raw:{html}', config=config)
```
### 7. Prefetch Mode for Two-Phase Deep Crawling
Fast link extraction without full page processing:
```python
config = CrawlerRunConfig(prefetch=True)
```
### 8. Proxy Rotation and Configuration
Enhanced proxy rotation with sticky sessions support.
### 9. Proxy Support for HTTP Strategy
Non-browser crawler now supports proxies.
### 10. Browser Pipeline for raw:/file:// URLs
New `process_in_browser` parameter for browser operations on local content:
```python
config = CrawlerRunConfig(
process_in_browser=True, # Force browser processing
screenshot=True
)
result = await crawler.arun(url='raw:<html>...</html>', config=config)
```
### 11. Smart TTL Cache for Sitemap URL Seeder
Intelligent cache invalidation for sitemaps:
```python
config = SeedingConfig(
cache_ttl_hours=24,
validate_sitemap_lastmod=True
)
```
---
## Bug Fixes
### raw: URL Parsing Truncates at # Character
**Problem**: CSS color codes like `#eee` were being truncated.
**Before**: `raw:body{background:#eee}``body{background:`
**After**: `raw:body{background:#eee}``body{background:#eee}`
### Caching System Improvements
Various fixes to cache validation and persistence.
---
## Documentation Updates
- Multi-sample schema generation documentation
- URL seeder smart TTL cache parameters
- Security documentation (SECURITY.md)
---
## Upgrade Guide
### From v0.7.x to v0.8.0
1. **Update the package**:
```bash
pip install --upgrade crawl4ai
```
2. **Docker API users**:
- Hooks are now disabled by default
- If you need hooks: `export CRAWL4AI_HOOKS_ENABLED=true`
- `file://` URLs no longer work on API (use library directly)
3. **Review security settings**:
```yaml
# config.yml - recommended for production
security:
enabled: true
jwt_enabled: true
```
4. **Test your integration** before deploying to production
### Breaking Change Checklist
- [ ] Check if you use `hooks` parameter in API calls
- [ ] Check if you use `file://` URLs via the API
- [ ] Update environment variables if needed
- [ ] Review security configuration
---
## Full Changelog
See [CHANGELOG.md](../CHANGELOG.md) for complete version history.
---
## Contributors
Thanks to all contributors who made this release possible.
Special thanks to **Neo by ProjectDiscovery** for responsible security disclosure.
---
*For questions or issues, please open a [GitHub Issue](https://github.com/unclecode/crawl4ai/issues).*

626
docs/blog/release-v0.7.7.md Normal file
View File

@@ -0,0 +1,626 @@
# 🚀 Crawl4AI v0.7.7: The Self-Hosting & Monitoring Update
*November 14, 2025 • 10 min read*
---
Today I'm releasing Crawl4AI v0.7.7—the Self-Hosting & Monitoring Update. This release transforms Crawl4AI Docker from a simple containerized crawler into a complete self-hosting platform with enterprise-grade real-time monitoring, full operational transparency, and production-ready observability.
## 🎯 What's New at a Glance
- **📊 Real-time Monitoring Dashboard**: Interactive web UI with live system metrics and browser pool status
- **🔌 Comprehensive Monitor API**: Complete REST API for programmatic access to all monitoring data
- **⚡ WebSocket Streaming**: Real-time updates every 2 seconds for custom dashboards
- **🎮 Control Actions**: Manual browser management (kill, restart, cleanup)
- **🔥 Smart Browser Pool**: 3-tier architecture (permanent/hot/cold) with automatic promotion
- **🧹 Janitor Cleanup System**: Automatic resource management with event logging
- **📈 Production Metrics**: 6 critical metrics for operational excellence
- **🏭 Integration Ready**: Prometheus, alerting, and log aggregation examples
- **🐛 Critical Bug Fixes**: Async LLM extraction, DFS crawling, viewport config, and more
## 📊 Real-time Monitoring Dashboard: Complete Visibility
**The Problem:** Running Crawl4AI in Docker was like flying blind. Users had no visibility into what was happening inside the container—memory usage, active requests, browser pools, or errors. Troubleshooting required checking logs, and there was no way to monitor performance or manually intervene when issues occurred.
**My Solution:** I built a complete real-time monitoring system with an interactive dashboard, comprehensive REST API, WebSocket streaming, and manual control actions. Now you have full transparency and control over your crawling infrastructure.
### The Self-Hosting Value Proposition
Before v0.7.7, Docker was just a containerized crawler. After v0.7.7, it's a complete self-hosting platform that gives you:
- **🔒 Data Privacy**: Your data never leaves your infrastructure
- **💰 Cost Control**: No per-request pricing or rate limits
- **🎯 Full Customization**: Complete control over configurations and strategies
- **📊 Complete Transparency**: Real-time visibility into every aspect
- **⚡ Performance**: Direct access without network overhead
- **🛡️ Enterprise Security**: Keep workflows behind your firewall
### Interactive Monitoring Dashboard
Access the dashboard at `http://localhost:11235/dashboard` to see:
- **System Health Overview**: CPU, memory, network, and uptime in real-time
- **Live Request Tracking**: Active and completed requests with full details
- **Browser Pool Management**: Interactive table with permanent/hot/cold browsers
- **Janitor Events Log**: Automatic cleanup activities
- **Error Monitoring**: Full context error logs
The dashboard updates every 2 seconds via WebSocket, giving you live visibility into your crawling operations.
## 🔌 Monitor API: Programmatic Access
**The Problem:** Monitoring dashboards are great for humans, but automation and integration require programmatic access.
**My Solution:** A comprehensive REST API that exposes all monitoring data for integration with your existing infrastructure.
### System Health Endpoint
```python
import httpx
import asyncio
async def monitor_system_health():
async with httpx.AsyncClient() as client:
response = await client.get("http://localhost:11235/monitor/health")
health = response.json()
print(f"Container Metrics:")
print(f" CPU: {health['container']['cpu_percent']:.1f}%")
print(f" Memory: {health['container']['memory_percent']:.1f}%")
print(f" Uptime: {health['container']['uptime_seconds']}s")
print(f"\nBrowser Pool:")
print(f" Permanent: {health['pool']['permanent']['active']} active")
print(f" Hot Pool: {health['pool']['hot']['count']} browsers")
print(f" Cold Pool: {health['pool']['cold']['count']} browsers")
print(f"\nStatistics:")
print(f" Total Requests: {health['stats']['total_requests']}")
print(f" Success Rate: {health['stats']['success_rate_percent']:.1f}%")
print(f" Avg Latency: {health['stats']['avg_latency_ms']:.0f}ms")
asyncio.run(monitor_system_health())
```
### Request Tracking
```python
async def track_requests():
async with httpx.AsyncClient() as client:
response = await client.get("http://localhost:11235/monitor/requests")
requests_data = response.json()
print(f"Active Requests: {len(requests_data['active'])}")
print(f"Completed Requests: {len(requests_data['completed'])}")
# See details of recent requests
for req in requests_data['completed'][:5]:
status_icon = "" if req['success'] else ""
print(f"{status_icon} {req['endpoint']} - {req['latency_ms']:.0f}ms")
```
### Browser Pool Management
```python
async def monitor_browser_pool():
async with httpx.AsyncClient() as client:
response = await client.get("http://localhost:11235/monitor/browsers")
browsers = response.json()
print(f"Pool Summary:")
print(f" Total Browsers: {browsers['summary']['total_count']}")
print(f" Total Memory: {browsers['summary']['total_memory_mb']} MB")
print(f" Reuse Rate: {browsers['summary']['reuse_rate_percent']:.1f}%")
# List all browsers
for browser in browsers['permanent']:
print(f"🔥 Permanent: {browser['browser_id'][:8]}... | "
f"Requests: {browser['request_count']} | "
f"Memory: {browser['memory_mb']:.0f} MB")
```
### Endpoint Performance Statistics
```python
async def get_endpoint_stats():
async with httpx.AsyncClient() as client:
response = await client.get("http://localhost:11235/monitor/endpoints/stats")
stats = response.json()
print("Endpoint Analytics:")
for endpoint, data in stats.items():
print(f" {endpoint}:")
print(f" Requests: {data['count']}")
print(f" Avg Latency: {data['avg_latency_ms']:.0f}ms")
print(f" Success Rate: {data['success_rate_percent']:.1f}%")
```
### Complete API Reference
The Monitor API includes these endpoints:
- `GET /monitor/health` - System health with pool statistics
- `GET /monitor/requests` - Active and completed request tracking
- `GET /monitor/browsers` - Browser pool details and efficiency
- `GET /monitor/endpoints/stats` - Per-endpoint performance analytics
- `GET /monitor/timeline?minutes=5` - Time-series data for charts
- `GET /monitor/logs/janitor?limit=10` - Cleanup activity logs
- `GET /monitor/logs/errors?limit=10` - Error logs with context
- `POST /monitor/actions/cleanup` - Force immediate cleanup
- `POST /monitor/actions/kill_browser` - Kill specific browser
- `POST /monitor/actions/restart_browser` - Restart browser
- `POST /monitor/stats/reset` - Reset accumulated statistics
## ⚡ WebSocket Streaming: Real-time Updates
**The Problem:** Polling the API every few seconds wastes resources and adds latency. Real-time dashboards need instant updates.
**My Solution:** WebSocket streaming with 2-second update intervals for building custom real-time dashboards.
### WebSocket Integration Example
```python
import websockets
import json
import asyncio
async def monitor_realtime():
uri = "ws://localhost:11235/monitor/ws"
async with websockets.connect(uri) as websocket:
print("Connected to real-time monitoring stream")
while True:
# Receive update every 2 seconds
data = await websocket.recv()
update = json.loads(data)
# Access all monitoring data
print(f"\n--- Update at {update['timestamp']} ---")
print(f"Memory: {update['health']['container']['memory_percent']:.1f}%")
print(f"Active Requests: {len(update['requests']['active'])}")
print(f"Total Browsers: {update['browsers']['summary']['total_count']}")
if update['errors']:
print(f"⚠️ Recent Errors: {len(update['errors'])}")
asyncio.run(monitor_realtime())
```
**Expected Real-World Impact:**
- **Custom Dashboards**: Build tailored monitoring UIs for your team
- **Real-time Alerting**: Trigger alerts instantly when metrics exceed thresholds
- **Integration**: Feed live data into monitoring tools like Grafana
- **Automation**: React to events in real-time without polling
## 🔥 Smart Browser Pool: 3-Tier Architecture
**The Problem:** Creating a new browser for every request is slow and memory-intensive. Traditional browser pools are static and inefficient.
**My Solution:** A smart 3-tier browser pool that automatically adapts to usage patterns.
### How It Works
```python
import httpx
async def demonstrate_browser_pool():
async with httpx.AsyncClient() as client:
# Request 1-3: Default config → Uses permanent browser
print("Phase 1: Using permanent browser")
for i in range(3):
await client.post(
"http://localhost:11235/crawl",
json={"urls": [f"https://httpbin.org/html?req={i}"]}
)
print(f" Request {i+1}: Reused permanent browser")
# Request 4-6: Custom viewport → Cold pool (first use)
print("\nPhase 2: Custom config creates cold pool browser")
viewport_config = {"viewport": {"width": 1280, "height": 720}}
for i in range(4):
await client.post(
"http://localhost:11235/crawl",
json={
"urls": [f"https://httpbin.org/json?v={i}"],
"browser_config": viewport_config
}
)
if i < 2:
print(f" Request {i+1}: Cold pool browser")
else:
print(f" Request {i+1}: Promoted to hot pool! (after 3 uses)")
# Check pool status
response = await client.get("http://localhost:11235/monitor/browsers")
browsers = response.json()
print(f"\nPool Status:")
print(f" Permanent: {len(browsers['permanent'])} (always active)")
print(f" Hot: {len(browsers['hot'])} (frequently used configs)")
print(f" Cold: {len(browsers['cold'])} (on-demand)")
print(f" Reuse Rate: {browsers['summary']['reuse_rate_percent']:.1f}%")
asyncio.run(demonstrate_browser_pool())
```
**Pool Tiers:**
- **🔥 Permanent Browser**: Always-on, default configuration, instant response
- **♨️ Hot Pool**: Browsers promoted after 3+ uses, kept warm for quick access
- **❄️ Cold Pool**: On-demand browsers for variant configs, cleaned up when idle
**Expected Real-World Impact:**
- **Memory Efficiency**: 10x reduction in memory usage vs creating browsers per request
- **Performance**: Instant access to frequently-used configurations
- **Automatic Optimization**: Pool adapts to your usage patterns
- **Resource Management**: Janitor automatically cleans up idle browsers
## 🧹 Janitor System: Automatic Cleanup
**The Problem:** Long-running crawlers accumulate idle browsers and consume memory over time.
**My Solution:** An automatic janitor system that monitors and cleans up idle resources.
```python
async def monitor_janitor_activity():
async with httpx.AsyncClient() as client:
response = await client.get("http://localhost:11235/monitor/logs/janitor?limit=5")
logs = response.json()
print("Recent Cleanup Activities:")
for log in logs:
print(f" {log['timestamp']}: {log['message']}")
# Example output:
# 2025-11-14 10:30:00: Cleaned up 2 cold pool browsers (idle > 5min)
# 2025-11-14 10:25:00: Browser reuse rate: 85.3%
# 2025-11-14 10:20:00: Hot pool browser promoted (10 requests)
```
## 🎮 Control Actions: Manual Management
**The Problem:** Sometimes you need to manually intervene—kill a stuck browser, force cleanup, or restart resources.
**My Solution:** Manual control actions via the API for operational troubleshooting.
### Force Cleanup
```python
async def force_cleanup():
async with httpx.AsyncClient() as client:
response = await client.post("http://localhost:11235/monitor/actions/cleanup")
result = response.json()
print(f"Cleanup completed:")
print(f" Browsers cleaned: {result.get('cleaned_count', 0)}")
print(f" Memory freed: {result.get('memory_freed_mb', 0):.1f} MB")
```
### Kill Specific Browser
```python
async def kill_stuck_browser(browser_id: str):
async with httpx.AsyncClient() as client:
response = await client.post(
"http://localhost:11235/monitor/actions/kill_browser",
json={"browser_id": browser_id}
)
if response.status_code == 200:
print(f"✅ Browser {browser_id} killed successfully")
```
### Reset Statistics
```python
async def reset_stats():
async with httpx.AsyncClient() as client:
response = await client.post("http://localhost:11235/monitor/stats/reset")
print("📊 Statistics reset for fresh monitoring")
```
## 📈 Production Integration Patterns
### Prometheus Integration
```python
# Export metrics for Prometheus scraping
async def export_prometheus_metrics():
async with httpx.AsyncClient() as client:
health = await client.get("http://localhost:11235/monitor/health")
data = health.json()
# Export in Prometheus format
metrics = f"""
# HELP crawl4ai_memory_usage_percent Memory usage percentage
# TYPE crawl4ai_memory_usage_percent gauge
crawl4ai_memory_usage_percent {data['container']['memory_percent']}
# HELP crawl4ai_request_success_rate Request success rate
# TYPE crawl4ai_request_success_rate gauge
crawl4ai_request_success_rate {data['stats']['success_rate_percent']}
# HELP crawl4ai_browser_pool_count Total browsers in pool
# TYPE crawl4ai_browser_pool_count gauge
crawl4ai_browser_pool_count {data['pool']['permanent']['active'] + data['pool']['hot']['count'] + data['pool']['cold']['count']}
"""
return metrics
```
### Alerting Example
```python
async def check_alerts():
async with httpx.AsyncClient() as client:
health = await client.get("http://localhost:11235/monitor/health")
data = health.json()
# Memory alert
if data['container']['memory_percent'] > 80:
print("🚨 ALERT: Memory usage above 80%")
# Trigger cleanup
await client.post("http://localhost:11235/monitor/actions/cleanup")
# Success rate alert
if data['stats']['success_rate_percent'] < 90:
print("🚨 ALERT: Success rate below 90%")
# Check error logs
errors = await client.get("http://localhost:11235/monitor/logs/errors")
print(f"Recent errors: {len(errors.json())}")
# Latency alert
if data['stats']['avg_latency_ms'] > 5000:
print("🚨 ALERT: Average latency above 5s")
```
### Key Metrics to Track
```python
CRITICAL_METRICS = {
"memory_usage": {
"current": "container.memory_percent",
"target": "<80%",
"alert_threshold": ">80%",
"action": "Force cleanup or scale"
},
"success_rate": {
"current": "stats.success_rate_percent",
"target": ">95%",
"alert_threshold": "<90%",
"action": "Check error logs"
},
"avg_latency": {
"current": "stats.avg_latency_ms",
"target": "<2000ms",
"alert_threshold": ">5000ms",
"action": "Investigate slow requests"
},
"browser_reuse_rate": {
"current": "browsers.summary.reuse_rate_percent",
"target": ">80%",
"alert_threshold": "<60%",
"action": "Check pool configuration"
},
"total_browsers": {
"current": "browsers.summary.total_count",
"target": "<15",
"alert_threshold": ">20",
"action": "Check for browser leaks"
},
"error_frequency": {
"current": "len(errors)",
"target": "<5/hour",
"alert_threshold": ">10/hour",
"action": "Review error patterns"
}
}
```
## 🐛 Critical Bug Fixes
This release includes significant bug fixes that improve stability and performance:
### Async LLM Extraction (#1590)
**The Problem:** LLM extraction was blocking async execution, causing URLs to be processed sequentially instead of in parallel (issue #1055).
**The Fix:** Resolved the blocking issue to enable true parallel processing for LLM extraction.
```python
# Before v0.7.7: Sequential processing
# After v0.7.7: True parallel processing
async with AsyncWebCrawler() as crawler:
urls = ["url1", "url2", "url3", "url4"]
# Now processes truly in parallel with LLM extraction
results = await crawler.arun_many(
urls,
config=CrawlerRunConfig(
extraction_strategy=LLMExtractionStrategy(...)
)
)
# 4x faster for parallel LLM extraction!
```
**Expected Impact:** Major performance improvement for batch LLM extraction workflows.
### DFS Deep Crawling (#1607)
**The Problem:** DFS (Depth-First Search) deep crawl strategy had implementation issues.
**The Fix:** Enhanced DFSDeepCrawlStrategy with proper seen URL tracking and improved documentation.
### Browser & Crawler Config Documentation (#1609)
**The Problem:** Documentation didn't match the actual `async_configs.py` implementation.
**The Fix:** Updated all configuration documentation to accurately reflect the current implementation.
### Sitemap Seeder (#1598)
**The Problem:** Sitemap parsing and URL normalization issues in AsyncUrlSeeder (issue #1559).
**The Fix:** Added comprehensive tests and fixes for sitemap namespace parsing and URL normalization.
### Remove Overlay Elements (#1529)
**The Problem:** The `remove_overlay_elements` functionality wasn't working (issue #1396).
**The Fix:** Fixed by properly calling the injected JavaScript function.
### Viewport Configuration (#1495)
**The Problem:** Viewport configuration wasn't working in managed browsers (issue #1490).
**The Fix:** Added proper viewport size configuration support for browser launch.
### Managed Browser CDP Timing (#1528)
**The Problem:** CDP (Chrome DevTools Protocol) endpoint verification had timing issues causing connection failures (issue #1445).
**The Fix:** Added exponential backoff for CDP endpoint verification to handle timing variations.
### Security Updates
- **pyOpenSSL**: Updated from >=24.3.0 to >=25.3.0 to address security vulnerability
- Added verification tests for the security update
### Docker Fixes
- **Port Standardization**: Fixed inconsistent port usage (11234 vs 11235) - now standardized to 11235
- **LLM Environment**: Fixed LLM API key handling for multi-provider support (PR #1537)
- **Error Handling**: Improved Docker API error messages with comprehensive status codes
- **Serialization**: Fixed `fit_html` property serialization in `/crawl` and `/crawl/stream` endpoints
### Other Important Fixes
- **arun_many Returns**: Fixed function to always return a list, even on exception (PR #1530)
- **Webhook Serialization**: Properly serialize Pydantic HttpUrl in webhook config
- **LLMConfig Documentation**: Fixed casing and variable name consistency (issue #1551)
- **Python Version**: Dropped Python 3.9 support, now requires Python >=3.10
## 📊 Expected Real-World Impact
### For DevOps & Infrastructure Teams
- **Full Visibility**: Know exactly what's happening inside your crawling infrastructure
- **Proactive Monitoring**: Catch issues before they become problems
- **Resource Optimization**: Identify memory leaks and performance bottlenecks
- **Operational Control**: Manual intervention when automated systems need help
### For Production Deployments
- **Enterprise Observability**: Prometheus, Grafana, and alerting integration
- **Debugging**: Real-time logs and error tracking
- **Capacity Planning**: Historical metrics for scaling decisions
- **SLA Monitoring**: Track success rates and latency against targets
### For Development Teams
- **Local Monitoring**: Understand crawler behavior during development
- **Performance Testing**: Measure impact of configuration changes
- **Troubleshooting**: Quickly identify and fix issues
- **Learning**: See exactly how the browser pool works
## 🔄 Breaking Changes
**None!** This release is fully backward compatible.
- All existing Docker configurations continue to work
- No API changes to existing endpoints
- Monitoring is additive functionality
- No migration required
## 🚀 Upgrade Instructions
### Docker
```bash
# Pull the latest version
docker pull unclecode/crawl4ai:0.7.7
# Or use the latest tag
docker pull unclecode/crawl4ai:latest
# Run with monitoring enabled (default)
docker run -d \
-p 11235:11235 \
--shm-size=1g \
--name crawl4ai \
unclecode/crawl4ai:0.7.7
# Access the monitoring dashboard
open http://localhost:11235/dashboard
```
### Python Package
```bash
# Upgrade to latest version
pip install --upgrade crawl4ai
# Or install specific version
pip install crawl4ai==0.7.7
```
## 🎬 Try the Demo
Run the comprehensive demo that showcases all monitoring features:
```bash
python docs/releases_review/demo_v0.7.7.py
```
**The demo includes:**
1. System health overview with live metrics
2. Request tracking with active/completed monitoring
3. Browser pool management (permanent/hot/cold)
4. Complete Monitor API endpoint examples
5. WebSocket streaming demonstration
6. Control actions (cleanup, kill, restart)
7. Production metrics and alerting patterns
8. Self-hosting value proposition
## 📚 Documentation
### New Documentation
- **[Self-Hosting Guide](https://docs.crawl4ai.com/core/self-hosting/)** - Complete self-hosting documentation with monitoring
- **Demo Script**: `docs/releases_review/demo_v0.7.7.py` - Working examples
### Updated Documentation
- **Docker Deployment** → **Self-Hosting** (renamed for better positioning)
- Added comprehensive monitoring sections
- Production integration patterns
- WebSocket streaming examples
## 💡 Pro Tips
1. **Start with the dashboard** - Visit `/dashboard` to get familiar with the monitoring system
2. **Track the 6 key metrics** - Memory, success rate, latency, reuse rate, browser count, errors
3. **Set up alerting early** - Use the Monitor API to build alerts before issues occur
4. **Monitor browser pool efficiency** - Aim for >80% reuse rate for optimal performance
5. **Use WebSocket for custom dashboards** - Build tailored monitoring UIs for your team
6. **Leverage Prometheus integration** - Export metrics for long-term storage and analysis
7. **Check janitor logs** - Understand automatic cleanup patterns
8. **Use control actions judiciously** - Manual interventions are for exceptional cases
## 🙏 Acknowledgments
Thank you to our community for the feedback, bug reports, and feature requests that shaped this release. Special thanks to everyone who contributed to the issues that were fixed in this version.
The monitoring system was built based on real user needs for production deployments, and your input made it comprehensive and practical.
## 📞 Support & Resources
- **📖 Documentation**: [docs.crawl4ai.com](https://docs.crawl4ai.com)
- **🐙 GitHub**: [github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai)
- **💬 Discord**: [discord.gg/crawl4ai](https://discord.gg/jP8KfhDhyN)
- **🐦 Twitter**: [@unclecode](https://x.com/unclecode)
- **📊 Dashboard**: `http://localhost:11235/dashboard` (when running)
---
**Crawl4AI v0.7.7 delivers complete self-hosting with enterprise-grade monitoring. You now have full visibility and control over your web crawling infrastructure. The monitoring dashboard, comprehensive API, and WebSocket streaming give you everything needed for production deployments. Try the self-hosting platform—it's a game changer for operational excellence!**
**Happy crawling with full visibility!** 🕷️📊
*- unclecode*

327
docs/blog/release-v0.7.8.md Normal file
View File

@@ -0,0 +1,327 @@
# Crawl4AI v0.7.8: Stability & Bug Fix Release
*December 2025*
---
I'm releasing Crawl4AI v0.7.8—a focused stability release that addresses 11 bugs reported by the community. While there are no new features in this release, these fixes resolve important issues affecting Docker deployments, LLM extraction, URL handling, and dependency compatibility.
## What's Fixed at a Glance
- **Docker API**: Fixed ContentRelevanceFilter deserialization, ProxyConfig serialization, and cache folder permissions
- **LLM Extraction**: Configurable rate limiter backoff, HTML input format support, and proper URL handling for raw HTML
- **URL Handling**: Correct relative URL resolution after JavaScript redirects
- **Dependencies**: Replaced deprecated PyPDF2 with pypdf, Pydantic v2 ConfigDict compatibility
- **AdaptiveCrawler**: Fixed query expansion to actually use LLM instead of hardcoded mock data
## Bug Fixes
### Docker & API Fixes
#### ContentRelevanceFilter Deserialization (#1642)
**The Problem:** When sending deep crawl requests to the Docker API with `ContentRelevanceFilter`, the server failed to deserialize the filter, causing requests to fail.
**The Fix:** I added `ContentRelevanceFilter` to the public exports and enhanced the deserialization logic with dynamic imports.
```python
# This now works correctly in Docker API
import httpx
request = {
"urls": ["https://docs.example.com"],
"crawler_config": {
"deep_crawl_strategy": {
"type": "BFSDeepCrawlStrategy",
"max_depth": 2,
"filter_chain": [
{
"type": "ContentRelevanceFilter",
"query": "API documentation",
"threshold": 0.3
}
]
}
}
}
async with httpx.AsyncClient() as client:
response = await client.post("http://localhost:11235/crawl", json=request)
# Previously failed, now works!
```
#### ProxyConfig JSON Serialization (#1629)
**The Problem:** `BrowserConfig.to_dict()` failed when `proxy_config` was set because `ProxyConfig` wasn't being serialized to a dictionary.
**The Fix:** `ProxyConfig.to_dict()` is now called during serialization.
```python
from crawl4ai import BrowserConfig
from crawl4ai.async_configs import ProxyConfig
proxy = ProxyConfig(
server="http://proxy.example.com:8080",
username="user",
password="pass"
)
config = BrowserConfig(headless=True, proxy_config=proxy)
# Previously raised TypeError, now works
config_dict = config.to_dict()
json.dumps(config_dict) # Valid JSON
```
#### Docker Cache Folder Permissions (#1638)
**The Problem:** The `.cache` folder in the Docker image had incorrect permissions, causing crawling to fail when caching was enabled.
**The Fix:** Corrected ownership and permissions during image build.
```bash
# Cache now works correctly in Docker
docker run -d -p 11235:11235 \
--shm-size=1g \
-v ./my-cache:/app/.cache \
unclecode/crawl4ai:0.7.8
```
---
### LLM & Extraction Fixes
#### Configurable Rate Limiter Backoff (#1269)
**The Problem:** The LLM rate limiting backoff parameters were hardcoded, making it impossible to adjust retry behavior for different API rate limits.
**The Fix:** `LLMConfig` now accepts three new parameters for complete control over retry behavior.
```python
from crawl4ai import LLMConfig
# Default behavior (unchanged)
default_config = LLMConfig(provider="openai/gpt-4o-mini")
# backoff_base_delay=2, backoff_max_attempts=3, backoff_exponential_factor=2
# Custom configuration for APIs with strict rate limits
custom_config = LLMConfig(
provider="openai/gpt-4o-mini",
backoff_base_delay=5, # Wait 5 seconds on first retry
backoff_max_attempts=5, # Try up to 5 times
backoff_exponential_factor=3 # Multiply delay by 3 each attempt
)
# Retry sequence: 5s -> 15s -> 45s -> 135s -> 405s
```
#### LLM Strategy HTML Input Support (#1178)
**The Problem:** `LLMExtractionStrategy` always sent markdown to the LLM, but some extraction tasks work better with HTML structure preserved.
**The Fix:** Added `input_format` parameter supporting `"markdown"`, `"html"`, `"fit_markdown"`, `"cleaned_html"`, and `"fit_html"`.
```python
from crawl4ai import LLMExtractionStrategy, LLMConfig
# Default: markdown input (unchanged)
markdown_strategy = LLMExtractionStrategy(
llm_config=LLMConfig(provider="openai/gpt-4o-mini"),
instruction="Extract product information"
)
# NEW: HTML input - preserves table/list structure
html_strategy = LLMExtractionStrategy(
llm_config=LLMConfig(provider="openai/gpt-4o-mini"),
instruction="Extract the data table preserving structure",
input_format="html"
)
# NEW: Filtered markdown - only relevant content
fit_strategy = LLMExtractionStrategy(
llm_config=LLMConfig(provider="openai/gpt-4o-mini"),
instruction="Summarize the main content",
input_format="fit_markdown"
)
```
#### Raw HTML URL Variable (#1116)
**The Problem:** When using `url="raw:<html>..."`, the entire HTML content was being passed to extraction strategies as the URL parameter, polluting LLM prompts.
**The Fix:** The URL is now correctly set to `"Raw HTML"` for raw HTML inputs.
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
html = "<html><body><h1>Test</h1></body></html>"
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url=f"raw:{html}",
config=CrawlerRunConfig(extraction_strategy=my_strategy)
)
# extraction_strategy receives url="Raw HTML" instead of the HTML blob
```
---
### URL Handling Fix
#### Relative URLs After Redirects (#1268)
**The Problem:** When JavaScript caused a page redirect, relative links were resolved against the original URL instead of the final URL.
**The Fix:** `redirected_url` now captures the actual page URL after all JavaScript execution completes.
```python
from crawl4ai import AsyncWebCrawler
async with AsyncWebCrawler() as crawler:
# Page at /old-page redirects via JS to /new-page
result = await crawler.arun(url="https://example.com/old-page")
# BEFORE: redirected_url = "https://example.com/old-page"
# AFTER: redirected_url = "https://example.com/new-page"
# Links are now correctly resolved against the final URL
for link in result.links['internal']:
print(link['href']) # Relative links resolved correctly
```
---
### Dependency & Compatibility Fixes
#### PyPDF2 Replaced with pypdf (#1412)
**The Problem:** PyPDF2 was deprecated in 2022 and is no longer maintained.
**The Fix:** Replaced with the actively maintained `pypdf` library.
```python
# Installation (unchanged)
pip install crawl4ai[pdf]
# The PDF processor now uses pypdf internally
# No code changes required - API remains the same
```
#### Pydantic v2 ConfigDict Compatibility (#678)
**The Problem:** Using the deprecated `class Config` syntax caused deprecation warnings with Pydantic v2.
**The Fix:** Migrated to `model_config = ConfigDict(...)` syntax.
```python
# No more deprecation warnings when importing crawl4ai models
from crawl4ai.models import CrawlResult
from crawl4ai import CrawlerRunConfig, BrowserConfig
# All models are now Pydantic v2 compatible
```
---
### AdaptiveCrawler Fix
#### Query Expansion Using LLM (#1621)
**The Problem:** The `EmbeddingStrategy` in AdaptiveCrawler had commented-out LLM code and was using hardcoded mock query variations instead.
**The Fix:** Uncommented and activated the LLM call for actual query expansion.
```python
# AdaptiveCrawler query expansion now actually uses the LLM
# Instead of hardcoded variations like:
# variations = {'queries': ['what are the best vegetables...']}
# The LLM generates relevant query variations based on your actual query
```
---
### Code Formatting Fix
#### Import Statement Formatting (#1181)
**The Problem:** When extracting code from web pages, import statements were sometimes concatenated without proper line separation.
**The Fix:** Import statements now maintain proper newline separation.
```python
# BEFORE: "import osimport sysfrom pathlib import Path"
# AFTER:
# import os
# import sys
# from pathlib import Path
```
---
## Breaking Changes
**None!** This release is fully backward compatible.
- All existing code continues to work without modification
- New parameters have sensible defaults matching previous behavior
- No API changes to existing functionality
---
## Upgrade Instructions
### Python Package
```bash
pip install --upgrade crawl4ai
# or
pip install crawl4ai==0.7.8
```
### Docker
```bash
# Pull the latest version
docker pull unclecode/crawl4ai:0.7.8
# Run
docker run -d -p 11235:11235 --shm-size=1g unclecode/crawl4ai:0.7.8
```
---
## Verification
Run the verification tests to confirm all fixes are working:
```bash
python docs/releases_review/demo_v0.7.8.py
```
This runs actual tests that verify each bug fix is properly implemented.
---
## Acknowledgments
Thank you to everyone who reported these issues and provided detailed reproduction steps. Your bug reports make Crawl4AI better for everyone.
Issues fixed: #1642, #1638, #1629, #1621, #1412, #1269, #1268, #1181, #1178, #1116, #678
---
## Support & Resources
- **Documentation**: [docs.crawl4ai.com](https://docs.crawl4ai.com)
- **GitHub**: [github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai)
- **Discord**: [discord.gg/crawl4ai](https://discord.gg/jP8KfhDhyN)
- **Twitter**: [@unclecode](https://x.com/unclecode)
---
**This stability release ensures Crawl4AI works reliably across Docker deployments, LLM extraction workflows, and various edge cases. Thank you for your continued support and feedback!**
**Happy crawling!**
*- unclecode*

243
docs/blog/release-v0.8.0.md Normal file
View File

@@ -0,0 +1,243 @@
# Crawl4AI v0.8.0 Release Notes
**Release Date**: January 2026
**Previous Version**: v0.7.6
**Status**: Release Candidate
---
## Highlights
- **Critical Security Fixes** for Docker API deployment
- **11 New Features** including crash recovery, prefetch mode, and proxy improvements
- **Breaking Changes** - see migration guide below
---
## Breaking Changes
### 1. Docker API: Hooks Disabled by Default
**What changed**: Hooks are now disabled by default on the Docker API.
**Why**: Security fix for Remote Code Execution (RCE) vulnerability.
**Who is affected**: Users of the Docker API who use the `hooks` parameter in `/crawl` requests.
**Migration**:
```bash
# To re-enable hooks (only if you trust all API users):
export CRAWL4AI_HOOKS_ENABLED=true
```
### 2. Docker API: file:// URLs Blocked
**What changed**: The endpoints `/execute_js`, `/screenshot`, `/pdf`, and `/html` now reject `file://` URLs.
**Why**: Security fix for Local File Inclusion (LFI) vulnerability.
**Who is affected**: Users who were reading local files via the Docker API.
**Migration**: Use the Python library directly for local file processing:
```python
# Instead of API call with file:// URL, use library:
from crawl4ai import AsyncWebCrawler
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="file:///path/to/file.html")
```
---
## Security Fixes
### Critical: Remote Code Execution via Hooks (CVE Pending)
**Severity**: CRITICAL (CVSS 10.0)
**Affected**: Docker API deployment (all versions before v0.8.0)
**Vector**: `POST /crawl` with malicious `hooks` parameter
**Details**: The `__import__` builtin was available in hook code, allowing attackers to import `os`, `subprocess`, etc. and execute arbitrary commands.
**Fix**:
1. Removed `__import__` from allowed builtins
2. Hooks disabled by default (`CRAWL4AI_HOOKS_ENABLED=false`)
### High: Local File Inclusion via file:// URLs (CVE Pending)
**Severity**: HIGH (CVSS 8.6)
**Affected**: Docker API deployment (all versions before v0.8.0)
**Vector**: `POST /execute_js` (and other endpoints) with `file:///etc/passwd`
**Details**: API endpoints accepted `file://` URLs, allowing attackers to read arbitrary files from the server.
**Fix**: URL scheme validation now only allows `http://`, `https://`, and `raw:` URLs.
### Credits
Discovered by **Neo by ProjectDiscovery** ([projectdiscovery.io](https://projectdiscovery.io)) - December 2025
---
## New Features
### 1. init_scripts Support for BrowserConfig
Pre-page-load JavaScript injection for stealth evasions.
```python
config = BrowserConfig(
init_scripts=[
"Object.defineProperty(navigator, 'webdriver', {get: () => false})"
]
)
```
### 2. CDP Connection Improvements
- WebSocket URL support (`ws://`, `wss://`)
- Proper cleanup with `cdp_cleanup_on_close=True`
- Browser reuse across multiple connections
### 3. Crash Recovery for Deep Crawl Strategies
All deep crawl strategies (BFS, DFS, Best-First) now support crash recovery:
```python
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
strategy = BFSDeepCrawlStrategy(
max_depth=3,
resume_state=saved_state, # Resume from checkpoint
on_state_change=save_callback # Persist state in real-time
)
```
### 4. PDF and MHTML for raw:/file:// URLs
Generate PDFs and MHTML from cached HTML content.
### 5. Screenshots for raw:/file:// URLs
Render cached HTML and capture screenshots.
### 6. base_url Parameter for CrawlerRunConfig
Proper URL resolution for raw: HTML processing:
```python
config = CrawlerRunConfig(base_url='https://example.com')
result = await crawler.arun(url='raw:{html}', config=config)
```
### 7. Prefetch Mode for Two-Phase Deep Crawling
Fast link extraction without full page processing:
```python
config = CrawlerRunConfig(prefetch=True)
```
### 8. Proxy Rotation and Configuration
Enhanced proxy rotation with sticky sessions support.
### 9. Proxy Support for HTTP Strategy
Non-browser crawler now supports proxies.
### 10. Browser Pipeline for raw:/file:// URLs
New `process_in_browser` parameter for browser operations on local content:
```python
config = CrawlerRunConfig(
process_in_browser=True, # Force browser processing
screenshot=True
)
result = await crawler.arun(url='raw:<html>...</html>', config=config)
```
### 11. Smart TTL Cache for Sitemap URL Seeder
Intelligent cache invalidation for sitemaps:
```python
config = SeedingConfig(
cache_ttl_hours=24,
validate_sitemap_lastmod=True
)
```
---
## Bug Fixes
### raw: URL Parsing Truncates at # Character
**Problem**: CSS color codes like `#eee` were being truncated.
**Before**: `raw:body{background:#eee}``body{background:`
**After**: `raw:body{background:#eee}``body{background:#eee}`
### Caching System Improvements
Various fixes to cache validation and persistence.
---
## Documentation Updates
- Multi-sample schema generation documentation
- URL seeder smart TTL cache parameters
- Security documentation (SECURITY.md)
---
## Upgrade Guide
### From v0.7.x to v0.8.0
1. **Update the package**:
```bash
pip install --upgrade crawl4ai
```
2. **Docker API users**:
- Hooks are now disabled by default
- If you need hooks: `export CRAWL4AI_HOOKS_ENABLED=true`
- `file://` URLs no longer work on API (use library directly)
3. **Review security settings**:
```yaml
# config.yml - recommended for production
security:
enabled: true
jwt_enabled: true
```
4. **Test your integration** before deploying to production
### Breaking Change Checklist
- [ ] Check if you use `hooks` parameter in API calls
- [ ] Check if you use `file://` URLs via the API
- [ ] Update environment variables if needed
- [ ] Review security configuration
---
## Full Changelog
See [CHANGELOG.md](../CHANGELOG.md) for complete version history.
---
## Contributors
Thanks to all contributors who made this release possible.
Special thanks to **Neo by ProjectDiscovery** for responsible security disclosure.
---
*For questions or issues, please open a [GitHub Issue](https://github.com/unclecode/crawl4ai/issues).*

View File

@@ -18,7 +18,7 @@ A comprehensive web-based tutorial for learning and experimenting with C4A-Scrip
2. **Install Dependencies**
```bash
pip install flask
pip install -r requirements.txt
```
3. **Launch the Server**
@@ -28,7 +28,7 @@ A comprehensive web-based tutorial for learning and experimenting with C4A-Scrip
4. **Open in Browser**
```
http://localhost:8080
http://localhost:8000
```
**🌐 Try Online**: [Live Demo](https://docs.crawl4ai.com/c4a-script/demo)
@@ -325,7 +325,7 @@ Powers the recording functionality:
### Configuration
```python
# server.py configuration
PORT = 8080
PORT = 8000
DEBUG = True
THREADED = True
```
@@ -343,9 +343,9 @@ THREADED = True
**Port Already in Use**
```bash
# Kill existing process
lsof -ti:8080 | xargs kill -9
lsof -ti:8000 | xargs kill -9
# Or use different port
python server.py --port 8081
python server.py --port 8001
```
**Blockly Not Loading**

View File

@@ -216,7 +216,7 @@ def get_examples():
'name': 'Handle Cookie Banner',
'description': 'Accept cookies and close newsletter popup',
'script': '''# Handle cookie banner and newsletter
GO http://127.0.0.1:8080/playground/
GO http://127.0.0.1:8000/playground/
WAIT `body` 2
IF (EXISTS `.cookie-banner`) THEN CLICK `.accept`
IF (EXISTS `.newsletter-popup`) THEN CLICK `.close`'''

View File

@@ -0,0 +1,62 @@
import asyncio
import capsolver
from crawl4ai import *
# TODO: set your config
# Docs: https://docs.capsolver.com/guide/captcha/awsWaf/
api_key = "CAP-xxxxxxxxxxxxxxxxxxxxx" # your api key of capsolver
site_url = "https://nft.porsche.com/onboarding@6" # page url of your target site
cookie_domain = ".nft.porsche.com" # the domain name to which you want to apply the cookie
captcha_type = "AntiAwsWafTaskProxyLess" # type of your target captcha
capsolver.api_key = api_key
async def main():
browser_config = BrowserConfig(
verbose=True,
headless=False,
use_persistent_context=True,
)
async with AsyncWebCrawler(config=browser_config) as crawler:
await crawler.arun(
url=site_url,
cache_mode=CacheMode.BYPASS,
session_id="session_captcha_test"
)
# get aws waf cookie using capsolver sdk
solution = capsolver.solve({
"type": captcha_type,
"websiteURL": site_url,
})
cookie = solution["cookie"]
print("aws waf cookie:", cookie)
js_code = """
document.cookie = \'aws-waf-token=""" + cookie + """;domain=""" + cookie_domain + """;path=/\';
location.reload();
"""
wait_condition = """() => {
return document.title === \'Join Porsches journey into Web3\';
}"""
run_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
session_id="session_captcha_test",
js_code=js_code,
js_only=True,
wait_for=f"js:{wait_condition}"
)
result_next = await crawler.arun(
url=site_url,
config=run_config,
)
print(result_next.markdown)
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,60 @@
import asyncio
import capsolver
from crawl4ai import *
# TODO: set your config
# Docs: https://docs.capsolver.com/guide/captcha/cloudflare_challenge/
api_key = "CAP-xxxxxxxxxxxxxxxxxxxxx" # your api key of capsolver
site_url = "https://gitlab.com/users/sign_in" # page url of your target site
captcha_type = "AntiCloudflareTask" # type of your target captcha
# your http proxy to solve cloudflare challenge
proxy_server = "proxy.example.com:8080"
proxy_username = "myuser"
proxy_password = "mypass"
capsolver.api_key = api_key
async def main():
# get challenge cookie using capsolver sdk
solution = capsolver.solve({
"type": captcha_type,
"websiteURL": site_url,
"proxy": f"{proxy_server}:{proxy_username}:{proxy_password}",
})
cookies = solution["cookies"]
user_agent = solution["userAgent"]
print("challenge cookies:", cookies)
cookies_list = []
for name, value in cookies.items():
cookies_list.append({
"name": name,
"value": value,
"url": site_url,
})
browser_config = BrowserConfig(
verbose=True,
headless=False,
use_persistent_context=True,
user_agent=user_agent,
cookies=cookies_list,
proxy_config={
"server": f"http://{proxy_server}",
"username": proxy_username,
"password": proxy_password,
},
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url=site_url,
cache_mode=CacheMode.BYPASS,
session_id="session_captcha_test"
)
print(result.markdown)
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,64 @@
import asyncio
import capsolver
from crawl4ai import *
# TODO: set your config
# Docs: https://docs.capsolver.com/guide/captcha/cloudflare_turnstile/
api_key = "CAP-xxxxxxxxxxxxxxxxxxxxx" # your api key of capsolver
site_key = "0x4AAAAAAAGlwMzq_9z6S9Mh" # site key of your target site
site_url = "https://clifford.io/demo/cloudflare-turnstile" # page url of your target site
captcha_type = "AntiTurnstileTaskProxyLess" # type of your target captcha
capsolver.api_key = api_key
async def main():
browser_config = BrowserConfig(
verbose=True,
headless=False,
use_persistent_context=True,
)
async with AsyncWebCrawler(config=browser_config) as crawler:
await crawler.arun(
url=site_url,
cache_mode=CacheMode.BYPASS,
session_id="session_captcha_test"
)
# get turnstile token using capsolver sdk
solution = capsolver.solve({
"type": captcha_type,
"websiteURL": site_url,
"websiteKey": site_key,
})
token = solution["token"]
print("turnstile token:", token)
js_code = """
document.querySelector(\'input[name="cf-turnstile-response"]\').value = \'"""+token+"""\';
document.querySelector(\'button[type="submit"]\').click();
"""
wait_condition = """() => {
const items = document.querySelectorAll(\'h1\');
return items.length === 0;
}"""
run_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
session_id="session_captcha_test",
js_code=js_code,
js_only=True,
wait_for=f"js:{wait_condition}"
)
result_next = await crawler.arun(
url=site_url,
config=run_config,
)
print(result_next.markdown)
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,67 @@
import asyncio
import capsolver
from crawl4ai import *
# TODO: set your config
# Docs: https://docs.capsolver.com/guide/captcha/ReCaptchaV2/
api_key = "CAP-xxxxxxxxxxxxxxxxxxxxx" # your api key of capsolver
site_key = "6LfW6wATAAAAAHLqO2pb8bDBahxlMxNdo9g947u9" # site key of your target site
site_url = "https://recaptcha-demo.appspot.com/recaptcha-v2-checkbox.php" # page url of your target site
captcha_type = "ReCaptchaV2TaskProxyLess" # type of your target captcha
capsolver.api_key = api_key
async def main():
browser_config = BrowserConfig(
verbose=True,
headless=False,
use_persistent_context=True,
)
async with AsyncWebCrawler(config=browser_config) as crawler:
await crawler.arun(
url=site_url,
cache_mode=CacheMode.BYPASS,
session_id="session_captcha_test"
)
# get recaptcha token using capsolver sdk
solution = capsolver.solve({
"type": captcha_type,
"websiteURL": site_url,
"websiteKey": site_key,
})
token = solution["gRecaptchaResponse"]
print("recaptcha token:", token)
js_code = """
const textarea = document.getElementById(\'g-recaptcha-response\');
if (textarea) {
textarea.value = \"""" + token + """\";
document.querySelector(\'button.form-field[type="submit"]\').click();
}
"""
wait_condition = """() => {
const items = document.querySelectorAll(\'h2\');
return items.length > 1;
}"""
run_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
session_id="session_captcha_test",
js_code=js_code,
js_only=True,
wait_for=f"js:{wait_condition}"
)
result_next = await crawler.arun(
url=site_url,
config=run_config,
)
print(result_next.markdown)
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,75 @@
import asyncio
import capsolver
from crawl4ai import *
# TODO: set your config
# Docs: https://docs.capsolver.com/guide/captcha/ReCaptchaV3/
api_key = "CAP-xxxxxxxxxxxxxxxxxxxxx" # your api key of capsolver
site_key = "6LdKlZEpAAAAAAOQjzC2v_d36tWxCl6dWsozdSy9" # site key of your target site
site_url = "https://recaptcha-demo.appspot.com/recaptcha-v3-request-scores.php" # page url of your target site
page_action = "examples/v3scores" # page action of your target site
captcha_type = "ReCaptchaV3TaskProxyLess" # type of your target captcha
capsolver.api_key = api_key
async def main():
browser_config = BrowserConfig(
verbose=True,
headless=False,
use_persistent_context=True,
)
# get recaptcha token using capsolver sdk
solution = capsolver.solve({
"type": captcha_type,
"websiteURL": site_url,
"websiteKey": site_key,
"pageAction": page_action,
})
token = solution["gRecaptchaResponse"]
print("recaptcha token:", token)
async with AsyncWebCrawler(config=browser_config) as crawler:
await crawler.arun(
url=site_url,
cache_mode=CacheMode.BYPASS,
session_id="session_captcha_test"
)
js_code = """
const originalFetch = window.fetch;
window.fetch = function(...args) {
if (typeof args[0] === 'string' && args[0].includes('/recaptcha-v3-verify.php')) {
const url = new URL(args[0], window.location.origin);
url.searchParams.set('action', '""" + token + """');
args[0] = url.toString();
document.querySelector('.token').innerHTML = "fetch('/recaptcha-v3-verify.php?action=examples/v3scores&token=""" + token + """')";
console.log('Fetch URL hooked:', args[0]);
}
return originalFetch.apply(this, args);
};
"""
wait_condition = """() => {
return document.querySelector('.step3:not(.hidden)');
}"""
run_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
session_id="session_captcha_test",
js_code=js_code,
js_only=True,
wait_for=f"js:{wait_condition}"
)
result_next = await crawler.arun(
url=site_url,
config=run_config,
)
print(result_next.markdown)
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,36 @@
import time
import asyncio
from crawl4ai import *
# TODO: the user data directory that includes the capsolver extension
user_data_dir = "/browser-profile/Default1"
"""
The capsolver extension supports more features, such as:
- Telling the extension when to start solving captcha.
- Calling functions to check whether the captcha has been solved, etc.
Reference blog: https://docs.capsolver.com/guide/automation-tool-integration/
"""
browser_config = BrowserConfig(
verbose=True,
headless=False,
user_data_dir=user_data_dir,
use_persistent_context=True,
)
async def main():
async with AsyncWebCrawler(config=browser_config) as crawler:
result_initial = await crawler.arun(
url="https://nft.porsche.com/onboarding@6",
cache_mode=CacheMode.BYPASS,
session_id="session_captcha_test"
)
# do something later
time.sleep(300)
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,36 @@
import time
import asyncio
from crawl4ai import *
# TODO: the user data directory that includes the capsolver extension
user_data_dir = "/browser-profile/Default1"
"""
The capsolver extension supports more features, such as:
- Telling the extension when to start solving captcha.
- Calling functions to check whether the captcha has been solved, etc.
Reference blog: https://docs.capsolver.com/guide/automation-tool-integration/
"""
browser_config = BrowserConfig(
verbose=True,
headless=False,
user_data_dir=user_data_dir,
use_persistent_context=True,
)
async def main():
async with AsyncWebCrawler(config=browser_config) as crawler:
result_initial = await crawler.arun(
url="https://gitlab.com/users/sign_in",
cache_mode=CacheMode.BYPASS,
session_id="session_captcha_test"
)
# do something later
time.sleep(300)
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,36 @@
import time
import asyncio
from crawl4ai import *
# TODO: the user data directory that includes the capsolver extension
user_data_dir = "/browser-profile/Default1"
"""
The capsolver extension supports more features, such as:
- Telling the extension when to start solving captcha.
- Calling functions to check whether the captcha has been solved, etc.
Reference blog: https://docs.capsolver.com/guide/automation-tool-integration/
"""
browser_config = BrowserConfig(
verbose=True,
headless=False,
user_data_dir=user_data_dir,
use_persistent_context=True,
)
async def main():
async with AsyncWebCrawler(config=browser_config) as crawler:
result_initial = await crawler.arun(
url="https://clifford.io/demo/cloudflare-turnstile",
cache_mode=CacheMode.BYPASS,
session_id="session_captcha_test"
)
# do something later
time.sleep(300)
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,36 @@
import time
import asyncio
from crawl4ai import *
# TODO: the user data directory that includes the capsolver extension
user_data_dir = "/browser-profile/Default1"
"""
The capsolver extension supports more features, such as:
- Telling the extension when to start solving captcha.
- Calling functions to check whether the captcha has been solved, etc.
Reference blog: https://docs.capsolver.com/guide/automation-tool-integration/
"""
browser_config = BrowserConfig(
verbose=True,
headless=False,
user_data_dir=user_data_dir,
use_persistent_context=True,
)
async def main():
async with AsyncWebCrawler(config=browser_config) as crawler:
result_initial = await crawler.arun(
url="https://recaptcha-demo.appspot.com/recaptcha-v2-checkbox.php",
cache_mode=CacheMode.BYPASS,
session_id="session_captcha_test"
)
# do something later
time.sleep(300)
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,36 @@
import time
import asyncio
from crawl4ai import *
# TODO: the user data directory that includes the capsolver extension
user_data_dir = "/browser-profile/Default1"
"""
The capsolver extension supports more features, such as:
- Telling the extension when to start solving captcha.
- Calling functions to check whether the captcha has been solved, etc.
Reference blog: https://docs.capsolver.com/guide/automation-tool-integration/
"""
browser_config = BrowserConfig(
verbose=True,
headless=False,
user_data_dir=user_data_dir,
use_persistent_context=True,
)
async def main():
async with AsyncWebCrawler(config=browser_config) as crawler:
result_initial = await crawler.arun(
url="https://recaptcha-demo.appspot.com/recaptcha-v3-request-scores.php",
cache_mode=CacheMode.BYPASS,
session_id="session_captcha_test"
)
# do something later
time.sleep(300)
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,61 @@
import json
import asyncio
from urllib.parse import quote, urlencode
from crawl4ai import CrawlerRunConfig, BrowserConfig, AsyncWebCrawler
# Scrapeless provides a free anti-detection fingerprint browser client and cloud browsers:
# https://www.scrapeless.com/en/blog/scrapeless-nstbrowser-strategic-integration
async def main():
# customize browser fingerprint
fingerprint = {
"userAgent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.1.2.3 Safari/537.36",
"platform": "Windows",
"screen": {
"width": 1280, "height": 1024
},
"localization": {
"languages": ["zh-HK", "en-US", "en"], "timezone": "Asia/Hong_Kong",
}
}
fingerprint_json = json.dumps(fingerprint)
encoded_fingerprint = quote(fingerprint_json)
scrapeless_params = {
"token": "your token",
"sessionTTL": 1000,
"sessionName": "Demo",
"fingerprint": encoded_fingerprint,
# Sets the target country/region for the proxy, sending requests via an IP address from that region. You can specify a country code (e.g., US for the United States, GB for the United Kingdom, ANY for any country). See country codes for all supported options.
# "proxyCountry": "ANY",
# create profile on scrapeless
# "profileId": "your profileId",
# For more usage details, please refer to https://docs.scrapeless.com/en/scraping-browser/quickstart/getting-started
}
query_string = urlencode(scrapeless_params)
scrapeless_connection_url = f"wss://browser.scrapeless.com/api/v2/browser?{query_string}"
async with AsyncWebCrawler(
config=BrowserConfig(
headless=False,
browser_mode="cdp",
cdp_url=scrapeless_connection_url,
)
) as crawler:
result = await crawler.arun(
url="https://www.scrapeless.com/en",
config=CrawlerRunConfig(
wait_for="css:.content",
scan_full_page=True,
),
)
print("-" * 20)
print(f'Status Code: {result.status_code}')
print("-" * 20)
print(f'Title: {result.metadata["title"]}')
print(f'Description: {result.metadata["description"]}')
print("-" * 20)
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,297 @@
#!/usr/bin/env python3
"""
Deep Crawl Crash Recovery Example
This example demonstrates how to implement crash recovery for long-running
deep crawls. The feature is useful for:
- Cloud deployments with spot/preemptible instances
- Long-running crawls that may be interrupted
- Distributed crawling with state coordination
Key concepts:
- `on_state_change`: Callback fired after each URL is processed
- `resume_state`: Pass saved state to continue from a checkpoint
- `export_state()`: Get the last captured state manually
Works with all strategies: BFSDeepCrawlStrategy, DFSDeepCrawlStrategy,
BestFirstCrawlingStrategy
"""
import asyncio
import json
import os
from pathlib import Path
from typing import Dict, Any, List
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
# File to store crawl state (in production, use Redis/database)
STATE_FILE = Path("crawl_state.json")
async def save_state_to_file(state: Dict[str, Any]) -> None:
"""
Callback to save state after each URL is processed.
In production, you might save to:
- Redis: await redis.set("crawl_state", json.dumps(state))
- Database: await db.execute("UPDATE crawls SET state = ?", json.dumps(state))
- S3: await s3.put_object(Bucket="crawls", Key="state.json", Body=json.dumps(state))
"""
with open(STATE_FILE, "w") as f:
json.dump(state, f, indent=2)
print(f" [State saved] Pages: {state['pages_crawled']}, Pending: {len(state['pending'])}")
def load_state_from_file() -> Dict[str, Any] | None:
"""Load previously saved state, if it exists."""
if STATE_FILE.exists():
with open(STATE_FILE, "r") as f:
return json.load(f)
return None
async def example_basic_state_persistence():
"""
Example 1: Basic state persistence with file storage.
The on_state_change callback is called after each URL is processed,
allowing you to save progress in real-time.
"""
print("\n" + "=" * 60)
print("Example 1: Basic State Persistence")
print("=" * 60)
# Clean up any previous state
if STATE_FILE.exists():
STATE_FILE.unlink()
strategy = BFSDeepCrawlStrategy(
max_depth=2,
max_pages=5,
on_state_change=save_state_to_file, # Save after each URL
)
config = CrawlerRunConfig(
deep_crawl_strategy=strategy,
verbose=False,
)
print("\nStarting crawl with state persistence...")
async with AsyncWebCrawler(verbose=False) as crawler:
results = await crawler.arun("https://books.toscrape.com", config=config)
# Show final state
if STATE_FILE.exists():
with open(STATE_FILE, "r") as f:
final_state = json.load(f)
print(f"\nFinal state saved to {STATE_FILE}:")
print(f" - Strategy: {final_state['strategy_type']}")
print(f" - Pages crawled: {final_state['pages_crawled']}")
print(f" - URLs visited: {len(final_state['visited'])}")
print(f" - URLs pending: {len(final_state['pending'])}")
print(f"\nCrawled {len(results)} pages total")
async def example_crash_and_resume():
"""
Example 2: Simulate a crash and resume from checkpoint.
This demonstrates the full crash recovery workflow:
1. Start crawling with state persistence
2. "Crash" after N pages
3. Resume from saved state
4. Verify no duplicate work
"""
print("\n" + "=" * 60)
print("Example 2: Crash and Resume")
print("=" * 60)
# Clean up any previous state
if STATE_FILE.exists():
STATE_FILE.unlink()
crash_after = 3
crawled_urls_phase1: List[str] = []
async def save_and_maybe_crash(state: Dict[str, Any]) -> None:
"""Save state, then simulate crash after N pages."""
# Always save state first
await save_state_to_file(state)
crawled_urls_phase1.clear()
crawled_urls_phase1.extend(state["visited"])
# Simulate crash after reaching threshold
if state["pages_crawled"] >= crash_after:
raise Exception("Simulated crash! (This is intentional)")
# Phase 1: Start crawl that will "crash"
print(f"\n--- Phase 1: Crawl until 'crash' after {crash_after} pages ---")
strategy1 = BFSDeepCrawlStrategy(
max_depth=2,
max_pages=10,
on_state_change=save_and_maybe_crash,
)
config = CrawlerRunConfig(
deep_crawl_strategy=strategy1,
verbose=False,
)
try:
async with AsyncWebCrawler(verbose=False) as crawler:
await crawler.arun("https://books.toscrape.com", config=config)
except Exception as e:
print(f"\n Crash occurred: {e}")
print(f" URLs crawled before crash: {len(crawled_urls_phase1)}")
# Phase 2: Resume from checkpoint
print("\n--- Phase 2: Resume from checkpoint ---")
saved_state = load_state_from_file()
if not saved_state:
print(" ERROR: No saved state found!")
return
print(f" Loaded state: {saved_state['pages_crawled']} pages, {len(saved_state['pending'])} pending")
crawled_urls_phase2: List[str] = []
async def track_resumed_crawl(state: Dict[str, Any]) -> None:
"""Track new URLs crawled in phase 2."""
await save_state_to_file(state)
new_urls = set(state["visited"]) - set(saved_state["visited"])
for url in new_urls:
if url not in crawled_urls_phase2:
crawled_urls_phase2.append(url)
strategy2 = BFSDeepCrawlStrategy(
max_depth=2,
max_pages=10,
resume_state=saved_state, # Resume from checkpoint!
on_state_change=track_resumed_crawl,
)
config2 = CrawlerRunConfig(
deep_crawl_strategy=strategy2,
verbose=False,
)
async with AsyncWebCrawler(verbose=False) as crawler:
results = await crawler.arun("https://books.toscrape.com", config=config2)
# Verify no duplicates
already_crawled = set(saved_state["visited"])
duplicates = set(crawled_urls_phase2) & already_crawled
print(f"\n--- Results ---")
print(f" Phase 1 URLs: {len(crawled_urls_phase1)}")
print(f" Phase 2 new URLs: {len(crawled_urls_phase2)}")
print(f" Duplicate crawls: {len(duplicates)} (should be 0)")
print(f" Total results: {len(results)}")
if len(duplicates) == 0:
print("\n SUCCESS: No duplicate work after resume!")
else:
print(f"\n WARNING: Found duplicates: {duplicates}")
async def example_export_state():
"""
Example 3: Manual state export using export_state().
If you don't need real-time persistence, you can export
the state manually after the crawl completes.
"""
print("\n" + "=" * 60)
print("Example 3: Manual State Export")
print("=" * 60)
strategy = BFSDeepCrawlStrategy(
max_depth=1,
max_pages=3,
# No callback - state is still tracked internally
)
config = CrawlerRunConfig(
deep_crawl_strategy=strategy,
verbose=False,
)
print("\nCrawling without callback...")
async with AsyncWebCrawler(verbose=False) as crawler:
results = await crawler.arun("https://books.toscrape.com", config=config)
# Export state after crawl completes
# Note: This only works if on_state_change was set during crawl
# For this example, we'd need to set on_state_change to get state
print(f"\nCrawled {len(results)} pages")
print("(For manual export, set on_state_change to capture state)")
async def example_state_structure():
"""
Example 4: Understanding the state structure.
Shows the complete state dictionary that gets saved.
"""
print("\n" + "=" * 60)
print("Example 4: State Structure")
print("=" * 60)
captured_state = None
async def capture_state(state: Dict[str, Any]) -> None:
nonlocal captured_state
captured_state = state
strategy = BFSDeepCrawlStrategy(
max_depth=1,
max_pages=2,
on_state_change=capture_state,
)
config = CrawlerRunConfig(
deep_crawl_strategy=strategy,
verbose=False,
)
async with AsyncWebCrawler(verbose=False) as crawler:
await crawler.arun("https://books.toscrape.com", config=config)
if captured_state:
print("\nState structure:")
print(json.dumps(captured_state, indent=2, default=str)[:1000] + "...")
print("\n\nKey fields:")
print(f" strategy_type: '{captured_state['strategy_type']}'")
print(f" visited: List of {len(captured_state['visited'])} URLs")
print(f" pending: List of {len(captured_state['pending'])} queued items")
print(f" depths: Dict mapping URL -> depth level")
print(f" pages_crawled: {captured_state['pages_crawled']}")
async def main():
"""Run all examples."""
print("=" * 60)
print("Deep Crawl Crash Recovery Examples")
print("=" * 60)
await example_basic_state_persistence()
await example_crash_and_resume()
await example_state_structure()
# # Cleanup
# if STATE_FILE.exists():
# STATE_FILE.unlink()
# print(f"\n[Cleaned up {STATE_FILE}]")
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,39 @@
"""
Simple demonstration of the DFS deep crawler visiting multiple pages.
Run with: python docs/examples/dfs_crawl_demo.py
"""
import asyncio
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
from crawl4ai.async_webcrawler import AsyncWebCrawler
from crawl4ai.cache_context import CacheMode
from crawl4ai.deep_crawling.dfs_strategy import DFSDeepCrawlStrategy
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
async def main() -> None:
dfs_strategy = DFSDeepCrawlStrategy(
max_depth=3,
max_pages=50,
include_external=False,
)
config = CrawlerRunConfig(
deep_crawl_strategy=dfs_strategy,
cache_mode=CacheMode.BYPASS,
markdown_generator=DefaultMarkdownGenerator(),
stream=True,
)
seed_url = "https://docs.python.org/3/" # Plenty of internal links
async with AsyncWebCrawler(config=BrowserConfig(headless=True)) as crawler:
async for result in await crawler.arun(url=seed_url, config=config):
depth = result.metadata.get("depth")
status = "SUCCESS" if result.success else "FAILED"
print(f"[{status}] depth={depth} url={result.url}")
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,48 @@
"""
NSTProxy Integration Examples for crawl4ai
------------------------------------------
NSTProxy is a premium residential proxy provider.
👉 Purchase Proxies: https://nstproxy.com
💰 Use coupon code "crawl4ai" for 10% off your plan.
"""
import asyncio, requests
from crawl4ai import AsyncWebCrawler, BrowserConfig
async def main():
"""
Example: Dynamically fetch a proxy from NSTProxy API before crawling.
"""
NST_TOKEN = "YOUR_NST_PROXY_TOKEN" # Get from https://app.nstproxy.com/profile
CHANNEL_ID = "YOUR_NST_PROXY_CHANNEL_ID" # Your NSTProxy Channel ID
country = "ANY" # e.g. "ANY", "US", "DE"
# Fetch proxy from NSTProxy API
api_url = (
f"https://api.nstproxy.com/api/v1/generate/apiproxies"
f"?fType=2&channelId={CHANNEL_ID}&country={country}"
f"&protocol=http&sessionDuration=10&count=1&token={NST_TOKEN}"
)
response = requests.get(api_url, timeout=10).json()
proxy = response[0]
ip = proxy.get("ip")
port = proxy.get("port")
username = proxy.get("username", "")
password = proxy.get("password", "")
browser_config = BrowserConfig(proxy_config={
"server": f"http://{ip}:{port}",
"username": username,
"password": password,
})
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url="https://example.com")
print("[API Proxy] Status:", result.status_code)
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,31 @@
"""
NSTProxy Integration Examples for crawl4ai
------------------------------------------
NSTProxy is a premium residential proxy provider.
👉 Purchase Proxies: https://nstproxy.com
💰 Use coupon code "crawl4ai" for 10% off your plan.
"""
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig
async def main():
"""
Example: Use NSTProxy with manual username/password authentication.
"""
browser_config = BrowserConfig(proxy_config={
"server": "http://gate.nstproxy.io:24125",
"username": "your_username",
"password": "your_password",
})
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url="https://example.com")
print("[Auth Proxy] Status:", result.status_code)
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,29 @@
"""
NSTProxy Integration Examples for crawl4ai
------------------------------------------
NSTProxy is a premium residential proxy provider.
👉 Purchase Proxies: https://nstproxy.com
💰 Use coupon code "crawl4ai" for 10% off your plan.
"""
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig
async def main():
# Using HTTP proxy
browser_config = BrowserConfig(proxy_config={"server": "http://gate.nstproxy.io:24125"})
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url="https://example.com")
print("[HTTP Proxy] Status:", result.status_code)
# Using SOCKS proxy
browser_config = BrowserConfig(proxy_config={"server": "socks5://gate.nstproxy.io:24125"})
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url="https://example.com")
print("[SOCKS5 Proxy] Status:", result.status_code)
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,39 @@
"""
NSTProxy Integration Examples for crawl4ai
------------------------------------------
NSTProxy is a premium residential proxy provider.
👉 Purchase Proxies: https://nstproxy.com
💰 Use coupon code "crawl4ai" for 10% off your plan.
"""
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig
async def main():
"""
Example: Using NSTProxy with AsyncWebCrawler.
"""
NST_TOKEN = "YOUR_NST_PROXY_TOKEN" # Get from https://app.nstproxy.com/profile
CHANNEL_ID = "YOUR_NST_PROXY_CHANNEL_ID" # Your NSTProxy Channel ID
browser_config = BrowserConfig()
browser_config.set_nstproxy(
token=NST_TOKEN,
channel_id=CHANNEL_ID,
country="ANY", # e.g. "US", "JP", or "ANY"
state="", # optional, leave empty if not needed
city="", # optional, leave empty if not needed
session_duration=0 # Session duration in minutes,0 = rotate on every request
)
# === Run crawler ===
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url="https://example.com")
print("[Nstproxy] Status:", result.status_code)
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,279 @@
#!/usr/bin/env python3
"""
Prefetch Mode and Two-Phase Crawling Example
Prefetch mode is a fast path that skips heavy processing and returns
only HTML + links. This is ideal for:
- Site mapping: Quickly discover all URLs
- Selective crawling: Find URLs first, then process only what you need
- Link validation: Check which pages exist without full processing
- Crawl planning: Estimate size before committing resources
Key concept:
- `prefetch=True` in CrawlerRunConfig enables fast link-only extraction
- Skips: markdown generation, content scraping, media extraction, LLM extraction
- Returns: HTML and links dictionary
Performance benefit: ~5-10x faster than full processing
"""
import asyncio
import time
from typing import List, Dict
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
async def example_basic_prefetch():
"""
Example 1: Basic prefetch mode.
Shows how prefetch returns HTML and links without heavy processing.
"""
print("\n" + "=" * 60)
print("Example 1: Basic Prefetch Mode")
print("=" * 60)
async with AsyncWebCrawler(verbose=False) as crawler:
# Enable prefetch mode
config = CrawlerRunConfig(prefetch=True)
print("\nFetching with prefetch=True...")
result = await crawler.arun("https://books.toscrape.com", config=config)
print(f"\nResult summary:")
print(f" Success: {result.success}")
print(f" HTML length: {len(result.html) if result.html else 0} chars")
print(f" Internal links: {len(result.links.get('internal', []))}")
print(f" External links: {len(result.links.get('external', []))}")
# These should be None/empty in prefetch mode
print(f"\n Skipped processing:")
print(f" Markdown: {result.markdown}")
print(f" Cleaned HTML: {result.cleaned_html}")
print(f" Extracted content: {result.extracted_content}")
# Show some discovered links
internal_links = result.links.get("internal", [])
if internal_links:
print(f"\n Sample internal links:")
for link in internal_links[:5]:
print(f" - {link['href'][:60]}...")
async def example_performance_comparison():
"""
Example 2: Compare prefetch vs full processing performance.
"""
print("\n" + "=" * 60)
print("Example 2: Performance Comparison")
print("=" * 60)
url = "https://books.toscrape.com"
async with AsyncWebCrawler(verbose=False) as crawler:
# Warm up - first request is slower due to browser startup
await crawler.arun(url, config=CrawlerRunConfig())
# Prefetch mode timing
start = time.time()
prefetch_result = await crawler.arun(url, config=CrawlerRunConfig(prefetch=True))
prefetch_time = time.time() - start
# Full processing timing
start = time.time()
full_result = await crawler.arun(url, config=CrawlerRunConfig())
full_time = time.time() - start
print(f"\nTiming comparison:")
print(f" Prefetch mode: {prefetch_time:.3f}s")
print(f" Full processing: {full_time:.3f}s")
print(f" Speedup: {full_time / prefetch_time:.1f}x faster")
print(f"\nOutput comparison:")
print(f" Prefetch - Links found: {len(prefetch_result.links.get('internal', []))}")
print(f" Full - Links found: {len(full_result.links.get('internal', []))}")
print(f" Full - Markdown length: {len(full_result.markdown.raw_markdown) if full_result.markdown else 0}")
async def example_two_phase_crawl():
"""
Example 3: Two-phase crawling pattern.
Phase 1: Fast discovery with prefetch
Phase 2: Full processing on selected URLs
"""
print("\n" + "=" * 60)
print("Example 3: Two-Phase Crawling")
print("=" * 60)
async with AsyncWebCrawler(verbose=False) as crawler:
# ═══════════════════════════════════════════════════════════
# Phase 1: Fast URL discovery
# ═══════════════════════════════════════════════════════════
print("\n--- Phase 1: Fast Discovery ---")
prefetch_config = CrawlerRunConfig(prefetch=True)
start = time.time()
discovery = await crawler.arun("https://books.toscrape.com", config=prefetch_config)
discovery_time = time.time() - start
all_urls = [link["href"] for link in discovery.links.get("internal", [])]
print(f" Discovered {len(all_urls)} URLs in {discovery_time:.2f}s")
# Filter to URLs we care about (e.g., book detail pages)
# On books.toscrape.com, book pages contain "catalogue/" but not "category/"
book_urls = [
url for url in all_urls
if "catalogue/" in url and "category/" not in url
][:5] # Limit to 5 for demo
print(f" Filtered to {len(book_urls)} book pages")
# ═══════════════════════════════════════════════════════════
# Phase 2: Full processing on selected URLs
# ═══════════════════════════════════════════════════════════
print("\n--- Phase 2: Full Processing ---")
full_config = CrawlerRunConfig(
word_count_threshold=10,
remove_overlay_elements=True,
)
results = []
start = time.time()
for url in book_urls:
result = await crawler.arun(url, config=full_config)
if result.success:
results.append(result)
title = result.url.split("/")[-2].replace("-", " ").title()[:40]
md_len = len(result.markdown.raw_markdown) if result.markdown else 0
print(f" Processed: {title}... ({md_len} chars)")
processing_time = time.time() - start
print(f"\n Processed {len(results)} pages in {processing_time:.2f}s")
# ═══════════════════════════════════════════════════════════
# Summary
# ═══════════════════════════════════════════════════════════
print(f"\n--- Summary ---")
print(f" Discovery phase: {discovery_time:.2f}s ({len(all_urls)} URLs)")
print(f" Processing phase: {processing_time:.2f}s ({len(results)} pages)")
print(f" Total time: {discovery_time + processing_time:.2f}s")
print(f" URLs skipped: {len(all_urls) - len(book_urls)} (not matching filter)")
async def example_prefetch_with_deep_crawl():
"""
Example 4: Combine prefetch with deep crawl strategy.
Use prefetch mode during deep crawl for maximum speed.
"""
print("\n" + "=" * 60)
print("Example 4: Prefetch with Deep Crawl")
print("=" * 60)
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
async with AsyncWebCrawler(verbose=False) as crawler:
# Deep crawl with prefetch - maximum discovery speed
config = CrawlerRunConfig(
prefetch=True, # Fast mode
deep_crawl_strategy=BFSDeepCrawlStrategy(
max_depth=1,
max_pages=10,
)
)
print("\nDeep crawling with prefetch mode...")
start = time.time()
result_container = await crawler.arun("https://books.toscrape.com", config=config)
# Handle iterator result from deep crawl
if hasattr(result_container, '__iter__'):
results = list(result_container)
else:
results = [result_container]
elapsed = time.time() - start
# Collect all discovered links
all_internal_links = set()
all_external_links = set()
for result in results:
for link in result.links.get("internal", []):
all_internal_links.add(link["href"])
for link in result.links.get("external", []):
all_external_links.add(link["href"])
print(f"\nResults:")
print(f" Pages crawled: {len(results)}")
print(f" Total internal links discovered: {len(all_internal_links)}")
print(f" Total external links discovered: {len(all_external_links)}")
print(f" Time: {elapsed:.2f}s")
async def example_prefetch_with_raw_html():
"""
Example 5: Prefetch with raw HTML input.
You can also use prefetch mode with raw: URLs for cached content.
"""
print("\n" + "=" * 60)
print("Example 5: Prefetch with Raw HTML")
print("=" * 60)
sample_html = """
<html>
<head><title>Sample Page</title></head>
<body>
<h1>Hello World</h1>
<nav>
<a href="/page1">Internal Page 1</a>
<a href="/page2">Internal Page 2</a>
<a href="https://example.com/external">External Link</a>
</nav>
<main>
<p>This is the main content with <a href="/page3">another link</a>.</p>
</main>
</body>
</html>
"""
async with AsyncWebCrawler(verbose=False) as crawler:
config = CrawlerRunConfig(
prefetch=True,
base_url="https://mysite.com", # For resolving relative links
)
result = await crawler.arun(f"raw:{sample_html}", config=config)
print(f"\nExtracted from raw HTML:")
print(f" Internal links: {len(result.links.get('internal', []))}")
for link in result.links.get("internal", []):
print(f" - {link['href']} ({link['text']})")
print(f"\n External links: {len(result.links.get('external', []))}")
for link in result.links.get("external", []):
print(f" - {link['href']} ({link['text']})")
async def main():
"""Run all examples."""
print("=" * 60)
print("Prefetch Mode and Two-Phase Crawling Examples")
print("=" * 60)
await example_basic_prefetch()
await example_performance_comparison()
await example_two_phase_crawl()
await example_prefetch_with_deep_crawl()
await example_prefetch_with_raw_html()
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -82,6 +82,42 @@ If you installed Crawl4AI (which installs Playwright under the hood), you alread
---
### Creating a Profile Using the Crawl4AI CLI (Easiest)
If you prefer a guided, interactive setup, use the built-in CLI to create and manage persistent browser profiles.
1.Launch the profile manager:
```bash
crwl profiles
```
2.Choose "Create new profile" and enter a profile name. A Chromium window opens so you can log in to sites and configure settings. When finished, return to the terminal and press `q` to save the profile.
3.Profiles are saved under `~/.crawl4ai/profiles/<profile_name>` (for example: `/home/<you>/.crawl4ai/profiles/test_profile_1`) along with a `storage_state.json` for cookies and session data.
4.Optionally, choose "List profiles" in the CLI to view available profiles and their paths.
5.Use the saved path with `BrowserConfig.user_data_dir`:
```python
from crawl4ai import AsyncWebCrawler, BrowserConfig
profile_path = "/home/<you>/.crawl4ai/profiles/test_profile_1"
browser_config = BrowserConfig(
headless=True,
use_managed_browser=True,
user_data_dir=profile_path,
browser_type="chromium",
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url="https://example.com/private")
```
The CLI also supports listing and deleting profiles, and even testing a crawl directly from the menu.
---
## 3. Using Managed Browsers in Crawl4AI
Once you have a data directory with your session data, pass it to **`BrowserConfig`**:

View File

@@ -1,98 +1,304 @@
# Proxy
# Proxy & Security
This guide covers proxy configuration and security features in Crawl4AI, including SSL certificate analysis and proxy rotation strategies.
## Understanding Proxy Configuration
Crawl4AI recommends configuring proxies per request through `CrawlerRunConfig.proxy_config`. This gives you precise control, enables rotation strategies, and keeps examples simple enough to copy, paste, and run.
## Basic Proxy Setup
Simple proxy configuration with `BrowserConfig`:
Configure proxies that apply to each crawl operation:
```python
from crawl4ai.async_configs import BrowserConfig
# Using HTTP proxy
browser_config = BrowserConfig(proxy_config={"server": "http://proxy.example.com:8080"})
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url="https://example.com")
# Using SOCKS proxy
browser_config = BrowserConfig(proxy_config={"server": "socks5://proxy.example.com:1080"})
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url="https://example.com")
```
## Authenticated Proxy
Use an authenticated proxy with `BrowserConfig`:
```python
from crawl4ai.async_configs import BrowserConfig
browser_config = BrowserConfig(proxy_config={
"server": "http://[host]:[port]",
"username": "[username]",
"password": "[password]",
})
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url="https://example.com")
```
## Rotating Proxies
Example using a proxy rotation service dynamically:
```python
import re
from crawl4ai import (
AsyncWebCrawler,
BrowserConfig,
CrawlerRunConfig,
CacheMode,
RoundRobinProxyStrategy,
)
import asyncio
from crawl4ai import ProxyConfig
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, ProxyConfig
run_config = CrawlerRunConfig(proxy_config=ProxyConfig(server="http://proxy.example.com:8080"))
# run_config = CrawlerRunConfig(proxy_config={"server": "http://proxy.example.com:8080"})
# run_config = CrawlerRunConfig(proxy_config="http://proxy.example.com:8080")
async def main():
# Load proxies and create rotation strategy
browser_config = BrowserConfig()
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url="https://example.com", config=run_config)
print(f"Success: {result.success} -> {result.url}")
if __name__ == "__main__":
asyncio.run(main())
```
!!! note "Why request-level?"
`CrawlerRunConfig.proxy_config` keeps each request self-contained, so swapping proxies or rotation strategies is just a matter of building a new run configuration.
## Supported Proxy Formats
The `ProxyConfig.from_string()` method supports multiple formats:
```python
from crawl4ai import ProxyConfig
# HTTP proxy with authentication
proxy1 = ProxyConfig.from_string("http://user:pass@192.168.1.1:8080")
# HTTPS proxy
proxy2 = ProxyConfig.from_string("https://proxy.example.com:8080")
# SOCKS5 proxy
proxy3 = ProxyConfig.from_string("socks5://proxy.example.com:1080")
# Simple IP:port format
proxy4 = ProxyConfig.from_string("192.168.1.1:8080")
# IP:port:user:pass format
proxy5 = ProxyConfig.from_string("192.168.1.1:8080:user:pass")
```
## Authenticated Proxies
For proxies requiring authentication:
```python
import asyncio
from crawl4ai import AsyncWebCrawler,BrowserConfig, CrawlerRunConfig, ProxyConfig
run_config = CrawlerRunConfig(
proxy_config=ProxyConfig(
server="http://proxy.example.com:8080",
username="your_username",
password="your_password",
)
)
# Or dictionary style:
# run_config = CrawlerRunConfig(proxy_config={
# "server": "http://proxy.example.com:8080",
# "username": "your_username",
# "password": "your_password",
# })
async def main():
browser_config = BrowserConfig()
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url="https://example.com", config=run_config)
print(f"Success: {result.success} -> {result.url}")
if __name__ == "__main__":
asyncio.run(main())
```
## Environment Variable Configuration
Load proxies from environment variables for easy configuration:
```python
import os
from crawl4ai import ProxyConfig, CrawlerRunConfig
# Set environment variable
os.environ["PROXIES"] = "ip1:port1:user1:pass1,ip2:port2:user2:pass2,ip3:port3"
# Load all proxies
proxies = ProxyConfig.from_env()
print(f"Loaded {len(proxies)} proxies")
# Use first proxy
if proxies:
run_config = CrawlerRunConfig(proxy_config=proxies[0])
```
## Rotating Proxies
Crawl4AI supports automatic proxy rotation to distribute requests across multiple proxy servers. Rotation is applied per request using a rotation strategy on `CrawlerRunConfig`.
### Proxy Rotation (recommended)
```python
import asyncio
import re
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, ProxyConfig
from crawl4ai.proxy_strategy import RoundRobinProxyStrategy
async def main():
# Load proxies from environment
proxies = ProxyConfig.from_env()
#eg: export PROXIES="ip1:port1:username1:password1,ip2:port2:username2:password2"
if not proxies:
print("No proxies found in environment. Set PROXIES env variable!")
print("No proxies found! Set PROXIES environment variable.")
return
# Create rotation strategy
proxy_strategy = RoundRobinProxyStrategy(proxies)
# Create configs
# Configure per-request with proxy rotation
browser_config = BrowserConfig(headless=True, verbose=False)
run_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
proxy_rotation_strategy=proxy_strategy
proxy_rotation_strategy=proxy_strategy,
)
async with AsyncWebCrawler(config=browser_config) as crawler:
urls = ["https://httpbin.org/ip"] * (len(proxies) * 2) # Test each proxy twice
print("\n📈 Initializing crawler with proxy rotation...")
async with AsyncWebCrawler(config=browser_config) as crawler:
print("\n🚀 Starting batch crawl with proxy rotation...")
results = await crawler.arun_many(
urls=urls,
config=run_config
)
for result in results:
if result.success:
ip_match = re.search(r'(?:[0-9]{1,3}\.){3}[0-9]{1,3}', result.html)
current_proxy = run_config.proxy_config if run_config.proxy_config else None
print(f"🚀 Testing {len(proxies)} proxies with rotation...")
results = await crawler.arun_many(urls=urls, config=run_config)
if current_proxy and ip_match:
print(f"URL {result.url}")
print(f"Proxy {current_proxy.server} -> Response IP: {ip_match.group(0)}")
verified = ip_match.group(0) == current_proxy.ip
if verified:
print(f"✅ Proxy working! IP matches: {current_proxy.ip}")
else:
print("❌ Proxy failed or IP mismatch!")
print("---")
for i, result in enumerate(results):
if result.success:
# Extract IP from response
ip_match = re.search(r'(?:[0-9]{1,3}\.){3}[0-9]{1,3}', result.html)
if ip_match:
detected_ip = ip_match.group(0)
proxy_index = i % len(proxies)
expected_ip = proxies[proxy_index].ip
asyncio.run(main())
print(f"✅ Request {i+1}: Proxy {proxy_index+1} -> IP {detected_ip}")
if detected_ip == expected_ip:
print(" 🎯 IP matches proxy configuration")
else:
print(f" ⚠️ IP mismatch (expected {expected_ip})")
else:
print(f"❌ Request {i+1}: Could not extract IP from response")
else:
print(f"❌ Request {i+1}: Failed - {result.error_message}")
if __name__ == "__main__":
asyncio.run(main())
```
## SSL Certificate Analysis
Combine proxy usage with SSL certificate inspection for enhanced security analysis. SSL certificate fetching is configured per request via `CrawlerRunConfig`.
### Per-Request SSL Certificate Analysis
```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
run_config = CrawlerRunConfig(
proxy_config={
"server": "http://proxy.example.com:8080",
"username": "user",
"password": "pass",
},
fetch_ssl_certificate=True, # Enable SSL certificate analysis for this request
)
async def main():
browser_config = BrowserConfig()
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url="https://example.com", config=run_config)
if result.success:
print(f"✅ Crawled via proxy: {result.url}")
# Analyze SSL certificate
if result.ssl_certificate:
cert = result.ssl_certificate
print("🔒 SSL Certificate Info:")
print(f" Issuer: {cert.issuer}")
print(f" Subject: {cert.subject}")
print(f" Valid until: {cert.valid_until}")
print(f" Fingerprint: {cert.fingerprint}")
# Export certificate
cert.to_json("certificate.json")
print("💾 Certificate exported to certificate.json")
else:
print("⚠️ No SSL certificate information available")
if __name__ == "__main__":
asyncio.run(main())
```
## Security Best Practices
### 1. Proxy Rotation for Anonymity
```python
from crawl4ai import CrawlerRunConfig, ProxyConfig
from crawl4ai.proxy_strategy import RoundRobinProxyStrategy
# Use multiple proxies to avoid IP blocking
proxies = ProxyConfig.from_env("PROXIES")
strategy = RoundRobinProxyStrategy(proxies)
# Configure rotation per request (recommended)
run_config = CrawlerRunConfig(proxy_rotation_strategy=strategy)
# For a fixed proxy across all requests, just reuse the same run_config instance
static_run_config = run_config
```
### 2. SSL Certificate Verification
```python
from crawl4ai import CrawlerRunConfig
# Always verify SSL certificates when possible
# Per-request (affects specific requests)
run_config = CrawlerRunConfig(fetch_ssl_certificate=True)
```
### 3. Environment Variable Security
```bash
# Use environment variables for sensitive proxy credentials
# Avoid hardcoding usernames/passwords in code
export PROXIES="ip1:port1:user1:pass1,ip2:port2:user2:pass2"
```
### 4. SOCKS5 for Enhanced Security
```python
from crawl4ai import CrawlerRunConfig
# Prefer SOCKS5 proxies for better protocol support
run_config = CrawlerRunConfig(proxy_config="socks5://proxy.example.com:1080")
```
## Migration from Deprecated `proxy` Parameter
- "Deprecation Notice"
The legacy `proxy` argument on `BrowserConfig` is deprecated. Configure proxies through `CrawlerRunConfig.proxy_config` so each request fully describes its network settings.
```python
# Old (deprecated) approach
# from crawl4ai import BrowserConfig
# browser_config = BrowserConfig(proxy="http://proxy.example.com:8080")
# New (preferred) approach
from crawl4ai import CrawlerRunConfig
run_config = CrawlerRunConfig(proxy_config="http://proxy.example.com:8080")
```
### Safe Logging of Proxies
```python
from crawl4ai import ProxyConfig
def safe_proxy_repr(proxy: ProxyConfig):
if getattr(proxy, "username", None):
return f"{proxy.server} (auth: ****)"
return proxy.server
```
## Troubleshooting
### Common Issues
- "Proxy connection failed"
- Verify the proxy server is reachable from your network.
- Double-check authentication credentials.
- Ensure the protocol matches (`http`, `https`, or `socks5`).
- "SSL certificate errors"
- Some proxies break SSL inspection; switch proxies if you see repeated failures.
- Consider temporarily disabling certificate fetching to isolate the issue.
- "Environment variables not loading"
- Confirm `PROXIES` (or your custom env var) is set before running the script.
- Check formatting: `ip:port:user:pass,ip:port:user:pass`.
- "Proxy rotation not working"
- Ensure `ProxyConfig.from_env()` actually loaded entries (`len(proxies) > 0`).
- Attach `proxy_rotation_strategy` to `CrawlerRunConfig`.
- Validate the proxy definitions you pass into the strategy.

View File

@@ -21,21 +21,35 @@ browser_cfg = BrowserConfig(
|-----------------------|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------|
| **`browser_type`** | `"chromium"`, `"firefox"`, `"webkit"`<br/>*(default: `"chromium"`)* | Which browser engine to use. `"chromium"` is typical for many sites, `"firefox"` or `"webkit"` for specialized tests. |
| **`headless`** | `bool` (default: `True`) | Headless means no visible UI. `False` is handy for debugging. |
| **`browser_mode`** | `str` (default: `"dedicated"`) | How browser is initialized: `"dedicated"` (new instance), `"builtin"` (CDP background), `"custom"` (explicit CDP), `"docker"` (container). |
| **`use_managed_browser`** | `bool` (default: `False`) | Launch browser via CDP for advanced control. Set automatically based on `browser_mode`. |
| **`cdp_url`** | `str` (default: `None`) | Chrome DevTools Protocol endpoint URL (e.g., `"ws://localhost:9222/devtools/browser/"`). Set automatically based on `browser_mode`. |
| **`debugging_port`** | `int` (default: `9222`) | Port for browser debugging protocol. |
| **`host`** | `str` (default: `"localhost"`) | Host for browser connection. |
| **`viewport_width`** | `int` (default: `1080`) | Initial page width (in px). Useful for testing responsive layouts. |
| **`viewport_height`** | `int` (default: `600`) | Initial page height (in px). |
| **`viewport`** | `dict` (default: `None`) | Viewport dimensions dict. If set, overrides `viewport_width` and `viewport_height`. |
| **`proxy`** | `str` (deprecated) | Deprecated. Use `proxy_config` instead. If set, it will be auto-converted internally. |
| **`proxy_config`** | `dict` (default: `None`) | For advanced or multi-proxy needs, specify details like `{"server": "...", "username": "...", ...}`. |
| **`proxy_config`** | `ProxyConfig or dict` (default: `None`)| For advanced or multi-proxy needs, specify `ProxyConfig` object or dict like `{"server": "...", "username": "...", "password": "..."}`. |
| **`use_persistent_context`** | `bool` (default: `False`) | If `True`, uses a **persistent** browser context (keep cookies, sessions across runs). Also sets `use_managed_browser=True`. |
| **`user_data_dir`** | `str or None` (default: `None`) | Directory to store user data (profiles, cookies). Must be set if you want permanent sessions. |
| **`chrome_channel`** | `str` (default: `"chromium"`) | Chrome channel to launch (e.g., "chrome", "msedge"). Only for `browser_type="chromium"`. Auto-set to empty for Firefox/WebKit. |
| **`channel`** | `str` (default: `"chromium"`) | Alias for `chrome_channel`. |
| **`accept_downloads`** | `bool` (default: `False`) | Whether to allow file downloads. Requires `downloads_path` if `True`. |
| **`downloads_path`** | `str or None` (default: `None`) | Directory to store downloaded files. |
| **`storage_state`** | `str or dict or None` (default: `None`)| In-memory storage state (cookies, localStorage) to restore browser state. |
| **`ignore_https_errors`** | `bool` (default: `True`) | If `True`, continues despite invalid certificates (common in dev/staging). |
| **`java_script_enabled`** | `bool` (default: `True`) | Disable if you want no JS overhead, or if only static content is needed. |
| **`sleep_on_close`** | `bool` (default: `False`) | Add a small delay when closing browser (can help with cleanup issues). |
| **`cookies`** | `list` (default: `[]`) | Pre-set cookies, each a dict like `{"name": "session", "value": "...", "url": "..."}`. |
| **`headers`** | `dict` (default: `{}`) | Extra HTTP headers for every request, e.g. `{"Accept-Language": "en-US"}`. |
| **`user_agent`** | `str` (default: Chrome-based UA) | Your custom or random user agent. `user_agent_mode="random"` can shuffle it. |
| **`light_mode`** | `bool` (default: `False`) | Disables some background features for performance gains. |
| **`user_agent`** | `str` (default: Chrome-based UA) | Your custom user agent string. |
| **`user_agent_mode`** | `str` (default: `""`) | Set to `"random"` to randomize user agent from a pool (helps with bot detection). |
| **`user_agent_generator_config`** | `dict` (default: `{}`) | Configuration dict for user agent generation when `user_agent_mode="random"`. |
| **`text_mode`** | `bool` (default: `False`) | If `True`, tries to disable images/other heavy content for speed. |
| **`use_managed_browser`** | `bool` (default: `False`) | For advanced “managed” interactions (debugging, CDP usage). Typically set automatically if persistent context is on. |
| **`light_mode`** | `bool` (default: `False`) | Disables some background features for performance gains. |
| **`extra_args`** | `list` (default: `[]`) | Additional flags for the underlying browser process, e.g. `["--disable-extensions"]`. |
| **`enable_stealth`** | `bool` (default: `False`) | Enable playwright-stealth mode to bypass bot detection. Cannot be used with `browser_mode="builtin"`. |
**Tips**:
- Set `headless=False` to visually **debug** how pages load or how interactions proceed.
@@ -70,6 +84,7 @@ We group them by category.
|------------------------------|--------------------------------------|-------------------------------------------------------------------------------------------------|
| **`word_count_threshold`** | `int` (default: ~200) | Skips text blocks below X words. Helps ignore trivial sections. |
| **`extraction_strategy`** | `ExtractionStrategy` (default: None) | If set, extracts structured data (CSS-based, LLM-based, etc.). |
| **`chunking_strategy`** | `ChunkingStrategy` (default: RegexChunking()) | Strategy to chunk content before extraction. Can be customized for different chunking approaches. |
| **`markdown_generator`** | `MarkdownGenerationStrategy` (None) | If you want specialized markdown output (citations, filtering, chunking, etc.). Can be customized with options such as `content_source` parameter to select the HTML input source ('cleaned_html', 'raw_html', or 'fit_html'). |
| **`css_selector`** | `str` (None) | Retains only the part of the page matching this selector. Affects the entire extraction process. |
| **`target_elements`** | `List[str]` (None) | List of CSS selectors for elements to focus on for markdown generation and data extraction, while still processing the entire page for links, media, etc. Provides more flexibility than `css_selector`. |
@@ -78,32 +93,50 @@ We group them by category.
| **`only_text`** | `bool` (False) | If `True`, tries to extract text-only content. |
| **`prettiify`** | `bool` (False) | If `True`, beautifies final HTML (slower, purely cosmetic). |
| **`keep_data_attributes`** | `bool` (False) | If `True`, preserve `data-*` attributes in cleaned HTML. |
| **`keep_attrs`** | `list` (default: []) | List of HTML attributes to keep during processing (e.g., `["id", "class", "data-value"]`). |
| **`remove_forms`** | `bool` (False) | If `True`, remove all `<form>` elements. |
| **`parser_type`** | `str` (default: "lxml") | HTML parser to use (e.g., "lxml", "html.parser"). |
| **`scraping_strategy`** | `ContentScrapingStrategy` (default: LXMLWebScrapingStrategy()) | Strategy to use for content scraping. Can be customized for different scraping needs (e.g., PDF extraction). |
---
### B) **Caching & Session**
### B) **Browser Location and Identity**
| **Parameter** | **Type / Default** | **What It Does** |
|------------------------|---------------------------|--------------------------------------------------------------------------------------------------------|
| **`locale`** | `str or None` (None) | Browser's locale (e.g., "en-US", "fr-FR") for language preferences. |
| **`timezone_id`** | `str or None` (None) | Browser's timezone (e.g., "America/New_York", "Europe/Paris"). |
| **`geolocation`** | `GeolocationConfig or None` (None) | GPS coordinates configuration. Use `GeolocationConfig(latitude=..., longitude=..., accuracy=...)`. |
| **`fetch_ssl_certificate`** | `bool` (False) | If `True`, fetches and includes SSL certificate information in the result. |
| **`proxy_config`** | `ProxyConfig or dict or None` (None) | Proxy configuration for this specific crawl. Can override browser-level proxy settings. |
| **`proxy_rotation_strategy`** | `ProxyRotationStrategy` (None) | Strategy for rotating proxies during crawl operations. |
---
### C) **Caching & Session**
| **Parameter** | **Type / Default** | **What It Does** |
|-------------------------|------------------------|------------------------------------------------------------------------------------------------------------------------------|
| **`cache_mode`** | `CacheMode or None` | Controls how caching is handled (`ENABLED`, `BYPASS`, `DISABLED`, etc.). If `None`, typically defaults to `ENABLED`. |
| **`session_id`** | `str or None` | Assign a unique ID to reuse a single browser session across multiple `arun()` calls. |
| **`bypass_cache`** | `bool` (False) | If `True`, acts like `CacheMode.BYPASS`. |
| **`disable_cache`** | `bool` (False) | If `True`, acts like `CacheMode.DISABLED`. |
| **`no_cache_read`** | `bool` (False) | If `True`, acts like `CacheMode.WRITE_ONLY` (writes cache but never reads). |
| **`no_cache_write`** | `bool` (False) | If `True`, acts like `CacheMode.READ_ONLY` (reads cache but never writes). |
| **`bypass_cache`** | `bool` (False) | **Deprecated.** If `True`, acts like `CacheMode.BYPASS`. Use `cache_mode` instead. |
| **`disable_cache`** | `bool` (False) | **Deprecated.** If `True`, acts like `CacheMode.DISABLED`. Use `cache_mode` instead. |
| **`no_cache_read`** | `bool` (False) | **Deprecated.** If `True`, acts like `CacheMode.WRITE_ONLY` (writes cache but never reads). Use `cache_mode` instead. |
| **`no_cache_write`** | `bool` (False) | **Deprecated.** If `True`, acts like `CacheMode.READ_ONLY` (reads cache but never writes). Use `cache_mode` instead. |
| **`shared_data`** | `dict or None` (None) | Shared data to be passed between hooks and accessible across crawl operations. |
Use these for controlling whether you read or write from a local content cache. Handy for large batch crawls or repeated site visits.
---
### C) **Page Navigation & Timing**
### D) **Page Navigation & Timing**
| **Parameter** | **Type / Default** | **What It Does** |
|----------------------------|-------------------------|----------------------------------------------------------------------------------------------------------------------|
| **`wait_until`** | `str` (domcontentloaded)| Condition for navigation to complete. Often `"networkidle"` or `"domcontentloaded"`. |
| **`wait_until`** | `str` (domcontentloaded)| Condition for navigation to "complete". Often `"networkidle"` or `"domcontentloaded"`. |
| **`page_timeout`** | `int` (60000 ms) | Timeout for page navigation or JS steps. Increase for slow sites. |
| **`wait_for`** | `str or None` | Wait for a CSS (`"css:selector"`) or JS (`"js:() => bool"`) condition before content extraction. |
| **`wait_for_timeout`** | `int or None` (None) | Specific timeout in ms for the `wait_for` condition. If None, uses `page_timeout`. |
| **`wait_for_images`** | `bool` (False) | Wait for images to load before finishing. Slows down if you only want text. |
| **`delay_before_return_html`** | `float` (0.1) | Additional pause (seconds) before final HTML is captured. Good for last-second updates. |
| **`check_robots_txt`** | `bool` (False) | Whether to check and respect robots.txt rules before crawling. If True, caches robots.txt for efficiency. |
@@ -112,15 +145,17 @@ Use these for controlling whether you read or write from a local content cache.
---
### D) **Page Interaction**
### E) **Page Interaction**
| **Parameter** | **Type / Default** | **What It Does** |
|----------------------------|--------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------|
| **`js_code`** | `str or list[str]` (None) | JavaScript to run after load. E.g. `"document.querySelector('button')?.click();"`. |
| **`js_only`** | `bool` (False) | If `True`, indicates were reusing an existing session and only applying JS. No full reload. |
| **`c4a_script`** | `str or list[str]` (None) | C4A script that compiles to JavaScript. Alternative to writing raw JS. |
| **`js_only`** | `bool` (False) | If `True`, indicates we're reusing an existing session and only applying JS. No full reload. |
| **`ignore_body_visibility`** | `bool` (True) | Skip checking if `<body>` is visible. Usually best to keep `True`. |
| **`scan_full_page`** | `bool` (False) | If `True`, auto-scroll the page to load dynamic content (infinite scroll). |
| **`scroll_delay`** | `float` (0.2) | Delay between scroll steps if `scan_full_page=True`. |
| **`max_scroll_steps`** | `int or None` (None) | Maximum number of scroll steps during full page scan. If None, scrolls until entire page is loaded. |
| **`process_iframes`** | `bool` (False) | Inlines iframe content for single-page extraction. |
| **`remove_overlay_elements`** | `bool` (False) | Removes potential modals/popups blocking the main content. |
| **`simulate_user`** | `bool` (False) | Simulate user interactions (mouse movements) to avoid bot detection. |
@@ -132,7 +167,7 @@ If your page is a single-page app with repeated JS updates, set `js_only=True` i
---
### E) **Media Handling**
### F) **Media Handling**
| **Parameter** | **Type / Default** | **What It Does** |
|--------------------------------------------|---------------------|-----------------------------------------------------------------------------------------------------------|
@@ -141,13 +176,16 @@ If your page is a single-page app with repeated JS updates, set `js_only=True` i
| **`screenshot_height_threshold`** | `int` (~20000) | If the page is taller than this, alternate screenshot strategies are used. |
| **`pdf`** | `bool` (False) | If `True`, returns a PDF in `result.pdf`. |
| **`capture_mhtml`** | `bool` (False) | If `True`, captures an MHTML snapshot of the page in `result.mhtml`. MHTML includes all page resources (CSS, images, etc.) in a single file. |
| **`image_description_min_word_threshold`** | `int` (~50) | Minimum words for an images alt text or description to be considered valid. |
| **`image_description_min_word_threshold`** | `int` (~50) | Minimum words for an image's alt text or description to be considered valid. |
| **`image_score_threshold`** | `int` (~3) | Filter out low-scoring images. The crawler scores images by relevance (size, context, etc.). |
| **`exclude_external_images`** | `bool` (False) | Exclude images from other domains. |
| **`exclude_all_images`** | `bool` (False) | If `True`, excludes all images from processing (both internal and external). |
| **`table_score_threshold`** | `int` (7) | Minimum score threshold for processing a table. Lower values include more tables. |
| **`table_extraction`** | `TableExtractionStrategy` (DefaultTableExtraction) | Strategy for table extraction. Defaults to DefaultTableExtraction with configured threshold. |
---
### F) **Link/Domain Handling**
### G) **Link/Domain Handling**
| **Parameter** | **Type / Default** | **What It Does** |
|------------------------------|-------------------------|-----------------------------------------------------------------------------------------------------------------------------|
@@ -155,23 +193,39 @@ If your page is a single-page app with repeated JS updates, set `js_only=True` i
| **`exclude_external_links`** | `bool` (False) | Removes all links pointing outside the current domain. |
| **`exclude_social_media_links`** | `bool` (False) | Strips links specifically to social sites (like Facebook or Twitter). |
| **`exclude_domains`** | `list` ([]) | Provide a custom list of domains to exclude (like `["ads.com", "trackers.io"]`). |
| **`exclude_internal_links`** | `bool` (False) | If `True`, excludes internal links from the results. |
| **`score_links`** | `bool` (False) | If `True`, calculates intrinsic quality scores for all links using URL structure, text quality, and contextual metrics. |
| **`preserve_https_for_internal_links`** | `bool` (False) | If `True`, preserves HTTPS scheme for internal links even when the server redirects to HTTP. Useful for security-conscious crawling. |
Use these for link-level content filtering (often to keep crawls “internal” or to remove spammy domains).
---
### G) **Debug & Logging**
### H) **Debug, Logging & Network Monitoring**
| **Parameter** | **Type / Default** | **What It Does** |
|----------------|--------------------|---------------------------------------------------------------------------|
| **`verbose`** | `bool` (True) | Prints logs detailing each step of crawling, interactions, or errors. |
| **`log_console`** | `bool` (False) | Logs the pages JavaScript console output if you want deeper JS debugging.|
| **`log_console`** | `bool` (False) | Logs the page's JavaScript console output if you want deeper JS debugging.|
| **`capture_network_requests`** | `bool` (False) | If `True`, captures network requests made by the page in `result.captured_requests`. |
| **`capture_console_messages`** | `bool` (False) | If `True`, captures console messages from the page in `result.console_messages`. |
---
### I) **Connection & HTTP Parameters**
### H) **Virtual Scroll Configuration**
| **Parameter** | **Type / Default** | **What It Does** |
|-----------------------------|-------------------------|----------------------------------------------------------------------------------------------------------------------|
| **`method`** | `str` ("GET") | HTTP method to use when using AsyncHTTPCrawlerStrategy (e.g., "GET", "POST"). |
| **`stream`** | `bool` (False) | If `True`, enables streaming mode for `arun_many()` to process URLs as they complete rather than waiting for all. |
| **`url`** | `str or None` (None) | URL for this specific config. Not typically set directly but used internally for URL-specific configurations. |
| **`user_agent`** | `str or None` (None) | Custom User-Agent string for this crawl. Can override browser-level user agent. |
| **`user_agent_mode`** | `str or None` (None) | Set to `"random"` to randomize user agent. Can override browser-level setting. |
| **`user_agent_generator_config`** | `dict` ({}) | Configuration for user agent generation when `user_agent_mode="random"`. |
---
### J) **Virtual Scroll Configuration**
| **Parameter** | **Type / Default** | **What It Does** |
|------------------------------|------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|
@@ -211,7 +265,7 @@ See [Virtual Scroll documentation](../../advanced/virtual-scroll.md) for detaile
---
### I) **URL Matching Configuration**
### K) **URL Matching Configuration**
| **Parameter** | **Type / Default** | **What It Does** |
|------------------------|------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|
@@ -274,7 +328,25 @@ default_config = CrawlerRunConfig() # No url_matcher = matches everything
- If no config matches a URL and there's no default config (one without `url_matcher`), the URL will fail with "No matching configuration found"
- Always include a default config as the last item if you want to handle all URLs
---## 2.2 Helper Methods
---
### L) **Advanced Crawling Features**
| **Parameter** | **Type / Default** | **What It Does** |
|-----------------------------|------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|
| **`deep_crawl_strategy`** | `DeepCrawlStrategy or None` (None) | Strategy for deep/recursive crawling. Enables automatic link following and multi-level site crawling. |
| **`link_preview_config`** | `LinkPreviewConfig or dict or None` (None) | Configuration for link head extraction and scoring. Fetches and scores link metadata without full page loads. |
| **`experimental`** | `dict or None` (None) | Dictionary for experimental/beta features not yet integrated into main parameters. Use with caution. |
**Deep Crawl Strategy** enables automatic site exploration by following links according to defined rules. Useful for sitemap generation or comprehensive site archiving.
**Link Preview Config** allows efficient link discovery and scoring by fetching only the `<head>` section of linked pages, enabling smart crawl prioritization without the overhead of full page loads.
**Experimental** parameters are features in beta testing. They may change or be removed in future versions. Check documentation for currently available experimental features.
---
## 2.2 Helper Methods
Both `BrowserConfig` and `CrawlerRunConfig` provide a `clone()` method to create modified copies:
@@ -367,10 +439,19 @@ LLMConfig is useful to pass LLM provider config to strategies and functions that
| **`provider`** | `"ollama/llama3","groq/llama3-70b-8192","groq/llama3-8b-8192", "openai/gpt-4o-mini" ,"openai/gpt-4o","openai/o1-mini","openai/o1-preview","openai/o3-mini","openai/o3-mini-high","anthropic/claude-3-haiku-20240307","anthropic/claude-3-opus-20240229","anthropic/claude-3-sonnet-20240229","anthropic/claude-3-5-sonnet-20240620","gemini/gemini-pro","gemini/gemini-1.5-pro","gemini/gemini-2.0-flash","gemini/gemini-2.0-flash-exp","gemini/gemini-2.0-flash-lite-preview-02-05","deepseek/deepseek-chat"`<br/>*(default: `"openai/gpt-4o-mini"`)* | Which LLM provider to use.
| **`api_token`** |1.Optional. When not provided explicitly, api_token will be read from environment variables based on provider. For example: If a gemini model is passed as provider then,`"GEMINI_API_KEY"` will be read from environment variables <br/> 2. API token of LLM provider <br/> eg: `api_token = "gsk_1ClHGGJ7Lpn4WGybR7vNWGdyb3FY7zXEw3SCiy0BAVM9lL8CQv"` <br/> 3. Environment variable - use with prefix "env:" <br/> eg:`api_token = "env: GROQ_API_KEY"` | API token to use for the given provider
| **`base_url`** |Optional. Custom API endpoint | If your provider has a custom endpoint
| **`backoff_base_delay`** |Optional. `int` *(default: `2`)* | Seconds to wait before the first retry when the provider throttles a request.
| **`backoff_max_attempts`** |Optional. `int` *(default: `3`)* | Total tries (initial call + retries) before surfacing an error.
| **`backoff_exponential_factor`** |Optional. `int` *(default: `2`)* | Multiplier that increases the wait time for each retry (`delay = base_delay * factor^attempt`).
## 3.2 Example Usage
```python
llm_config = LLMConfig(provider="openai/gpt-4o-mini", api_token=os.getenv("OPENAI_API_KEY"))
llm_config = LLMConfig(
provider="openai/gpt-4o-mini",
api_token=os.getenv("OPENAI_API_KEY"),
backoff_base_delay=1, # optional
backoff_max_attempts=5, # optional
backoff_exponential_factor=3, # optional
)
```
## 4. Putting It All Together

View File

@@ -18,7 +18,7 @@ A comprehensive web-based tutorial for learning and experimenting with C4A-Scrip
2. **Install Dependencies**
```bash
pip install flask
pip install -r requirements.txt
```
3. **Launch the Server**
@@ -28,7 +28,7 @@ A comprehensive web-based tutorial for learning and experimenting with C4A-Scrip
4. **Open in Browser**
```
http://localhost:8080
http://localhost:8000
```
**🌐 Try Online**: [Live Demo](https://docs.crawl4ai.com/c4a-script/demo)
@@ -325,7 +325,7 @@ Powers the recording functionality:
### Configuration
```python
# server.py configuration
PORT = 8080
PORT = 8000
DEBUG = True
THREADED = True
```
@@ -343,9 +343,9 @@ THREADED = True
**Port Already in Use**
```bash
# Kill existing process
lsof -ti:8080 | xargs kill -9
lsof -ti:8000 | xargs kill -9
# Or use different port
python server.py --port 8081
python server.py --port 8001
```
**Blockly Not Loading**

View File

@@ -216,7 +216,7 @@ def get_examples():
'name': 'Handle Cookie Banner',
'description': 'Accept cookies and close newsletter popup',
'script': '''# Handle cookie banner and newsletter
GO http://127.0.0.1:8080/playground/
GO http://127.0.0.1:8000/playground/
WAIT `body` 2
IF (EXISTS `.cookie-banner`) THEN CLICK `.accept`
IF (EXISTS `.newsletter-popup`) THEN CLICK `.close`'''
@@ -283,7 +283,7 @@ WAIT `.success-message` 5'''
return jsonify(examples)
if __name__ == '__main__':
port = int(os.environ.get('PORT', 8080))
port = int(os.environ.get('PORT', 8000))
print(f"""
╔══════════════════════════════════════════════════════════╗
║ C4A-Script Interactive Tutorial Server ║

View File

@@ -20,48 +20,69 @@ Ever wondered why your AI coding assistant struggles with your library despite c
## Latest Release
### [Crawl4AI v0.8.0 Crash Recovery & Prefetch Mode](../blog/release-v0.8.0.md)
*January 2026*
Crawl4AI v0.8.0 introduces crash recovery for deep crawls, a new prefetch mode for fast URL discovery, and critical security fixes for Docker deployments.
Key highlights:
- **🔄 Deep Crawl Crash Recovery**: `on_state_change` callback for real-time state persistence, `resume_state` to continue from checkpoints
- **⚡ Prefetch Mode**: `prefetch=True` for 5-10x faster URL discovery, perfect for two-phase crawling patterns
- **🔒 Security Fixes**: Hooks disabled by default, `file://` URLs blocked on Docker API, `__import__` removed from sandbox
[Read full release notes →](../blog/release-v0.8.0.md)
## Recent Releases
### [Crawl4AI v0.7.8 Stability & Bug Fix Release](../blog/release-v0.7.8.md)
*December 2025*
Crawl4AI v0.7.8 is a focused stability release addressing 11 bugs reported by the community. Fixes for Docker deployments, LLM extraction, URL handling, and dependency compatibility.
Key highlights:
- **🐳 Docker API Fixes**: ContentRelevanceFilter deserialization, ProxyConfig serialization, cache folder permissions
- **🤖 LLM Improvements**: Configurable rate limiter backoff, HTML input format support
- **📦 Dependencies**: Replaced deprecated PyPDF2 with pypdf, Pydantic v2 ConfigDict compatibility
[Read full release notes →](../blog/release-v0.7.8.md)
### [Crawl4AI v0.7.7 The Self-Hosting & Monitoring Update](../blog/release-v0.7.7.md)
*November 14, 2025*
Crawl4AI v0.7.7 transforms Docker into a complete self-hosting platform with enterprise-grade real-time monitoring, comprehensive observability, and full operational control.
Key highlights:
- **📊 Real-time Monitoring Dashboard**: Interactive web UI with live system metrics
- **🔌 Comprehensive Monitor API**: Complete REST API for programmatic access
- **⚡ WebSocket Streaming**: Real-time updates every 2 seconds
- **🔥 Smart Browser Pool**: 3-tier architecture with automatic promotion and cleanup
[Read full release notes →](../blog/release-v0.7.7.md)
### [Crawl4AI v0.7.6 The Webhook Infrastructure Update](../blog/release-v0.7.6.md)
*October 22, 2025*
Crawl4AI v0.7.6 introduces comprehensive webhook support for the Docker job queue API, bringing real-time notifications to both crawling and LLM extraction workflows. No more polling!
Crawl4AI v0.7.6 introduces comprehensive webhook support for the Docker job queue API, bringing real-time notifications to both crawling and LLM extraction workflows.
Key highlights:
- **🪝 Complete Webhook Support**: Real-time notifications for both `/crawl/job` and `/llm/job` endpoints
- **🔄 Reliable Delivery**: Exponential backoff retry mechanism (5 attempts: 1s → 2s → 4s → 8s → 16s)
- **🔄 Reliable Delivery**: Exponential backoff retry mechanism
- **🔐 Custom Authentication**: Add custom headers for webhook authentication
- **📊 Flexible Delivery**: Choose notification-only or include full data in payload
- **⚙️ Global Configuration**: Set default webhook URL in config.yml for all jobs
- **🎯 Zero Breaking Changes**: Fully backward compatible, webhooks are opt-in
[Read full release notes →](../blog/release-v0.7.6.md)
## Recent Releases
### [Crawl4AI v0.7.5 The Docker Hooks & Security Update](../blog/release-v0.7.5.md)
*September 29, 2025*
Crawl4AI v0.7.5 introduces the powerful Docker Hooks System for complete pipeline customization, enhanced LLM integration with custom providers, HTTPS preservation for modern web security, and resolves multiple community-reported issues.
Key highlights:
- **🔧 Docker Hooks System**: Custom Python functions at 8 key pipeline points for unprecedented customization
- **🤖 Enhanced LLM Integration**: Custom providers with temperature control and base_url configuration
- **🔒 HTTPS Preservation**: Secure internal link handling for modern web applications
- **🐍 Python 3.10+ Support**: Modern language features and enhanced performance
- **🛠️ Bug Fixes**: Resolved multiple community-reported issues including URL processing, JWT authentication, and proxy configuration
[Read full release notes →](../blog/release-v0.7.5.md)
## Recent Releases
### [Crawl4AI v0.7.4 The Intelligent Table Extraction & Performance Update](../blog/release-v0.7.4.md)
*August 17, 2025*
Revolutionary LLM-powered table extraction with intelligent chunking, performance improvements for concurrent crawling, enhanced browser management, and critical stability fixes.
[Read full release notes →](../blog/release-v0.7.4.md)
---
## Older Releases
| Version | Date | Highlights |
|---------|------|------------|
| [v0.7.5](../blog/release-v0.7.5.md) | September 2025 | Docker Hooks System, enhanced LLM integration, HTTPS preservation |
| [v0.7.4](../blog/release-v0.7.4.md) | August 2025 | LLM-powered table extraction, performance improvements |
| [v0.7.3](../blog/release-v0.7.3.md) | July 2025 | Undetected browser, multi-URL config, memory monitoring |
| [v0.7.1](../blog/release-v0.7.1.md) | June 2025 | Bug fixes and stability improvements |
| [v0.7.0](../blog/release-v0.7.0.md) | May 2025 | Adaptive crawling, virtual scroll, link analysis |
## Project History
Curious about how Crawl4AI has evolved? Check out our [complete changelog](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md) for a detailed history of all versions and updates.

View File

@@ -0,0 +1,626 @@
# 🚀 Crawl4AI v0.7.7: The Self-Hosting & Monitoring Update
*November 14, 2025 • 10 min read*
---
Today I'm releasing Crawl4AI v0.7.7—the Self-Hosting & Monitoring Update. This release transforms Crawl4AI Docker from a simple containerized crawler into a complete self-hosting platform with enterprise-grade real-time monitoring, full operational transparency, and production-ready observability.
## 🎯 What's New at a Glance
- **📊 Real-time Monitoring Dashboard**: Interactive web UI with live system metrics and browser pool status
- **🔌 Comprehensive Monitor API**: Complete REST API for programmatic access to all monitoring data
- **⚡ WebSocket Streaming**: Real-time updates every 2 seconds for custom dashboards
- **🎮 Control Actions**: Manual browser management (kill, restart, cleanup)
- **🔥 Smart Browser Pool**: 3-tier architecture (permanent/hot/cold) with automatic promotion
- **🧹 Janitor Cleanup System**: Automatic resource management with event logging
- **📈 Production Metrics**: 6 critical metrics for operational excellence
- **🏭 Integration Ready**: Prometheus, alerting, and log aggregation examples
- **🐛 Critical Bug Fixes**: Async LLM extraction, DFS crawling, viewport config, and more
## 📊 Real-time Monitoring Dashboard: Complete Visibility
**The Problem:** Running Crawl4AI in Docker was like flying blind. Users had no visibility into what was happening inside the container—memory usage, active requests, browser pools, or errors. Troubleshooting required checking logs, and there was no way to monitor performance or manually intervene when issues occurred.
**My Solution:** I built a complete real-time monitoring system with an interactive dashboard, comprehensive REST API, WebSocket streaming, and manual control actions. Now you have full transparency and control over your crawling infrastructure.
### The Self-Hosting Value Proposition
Before v0.7.7, Docker was just a containerized crawler. After v0.7.7, it's a complete self-hosting platform that gives you:
- **🔒 Data Privacy**: Your data never leaves your infrastructure
- **💰 Cost Control**: No per-request pricing or rate limits
- **🎯 Full Customization**: Complete control over configurations and strategies
- **📊 Complete Transparency**: Real-time visibility into every aspect
- **⚡ Performance**: Direct access without network overhead
- **🛡️ Enterprise Security**: Keep workflows behind your firewall
### Interactive Monitoring Dashboard
Access the dashboard at `http://localhost:11235/dashboard` to see:
- **System Health Overview**: CPU, memory, network, and uptime in real-time
- **Live Request Tracking**: Active and completed requests with full details
- **Browser Pool Management**: Interactive table with permanent/hot/cold browsers
- **Janitor Events Log**: Automatic cleanup activities
- **Error Monitoring**: Full context error logs
The dashboard updates every 2 seconds via WebSocket, giving you live visibility into your crawling operations.
## 🔌 Monitor API: Programmatic Access
**The Problem:** Monitoring dashboards are great for humans, but automation and integration require programmatic access.
**My Solution:** A comprehensive REST API that exposes all monitoring data for integration with your existing infrastructure.
### System Health Endpoint
```python
import httpx
import asyncio
async def monitor_system_health():
async with httpx.AsyncClient() as client:
response = await client.get("http://localhost:11235/monitor/health")
health = response.json()
print(f"Container Metrics:")
print(f" CPU: {health['container']['cpu_percent']:.1f}%")
print(f" Memory: {health['container']['memory_percent']:.1f}%")
print(f" Uptime: {health['container']['uptime_seconds']}s")
print(f"\nBrowser Pool:")
print(f" Permanent: {health['pool']['permanent']['active']} active")
print(f" Hot Pool: {health['pool']['hot']['count']} browsers")
print(f" Cold Pool: {health['pool']['cold']['count']} browsers")
print(f"\nStatistics:")
print(f" Total Requests: {health['stats']['total_requests']}")
print(f" Success Rate: {health['stats']['success_rate_percent']:.1f}%")
print(f" Avg Latency: {health['stats']['avg_latency_ms']:.0f}ms")
asyncio.run(monitor_system_health())
```
### Request Tracking
```python
async def track_requests():
async with httpx.AsyncClient() as client:
response = await client.get("http://localhost:11235/monitor/requests")
requests_data = response.json()
print(f"Active Requests: {len(requests_data['active'])}")
print(f"Completed Requests: {len(requests_data['completed'])}")
# See details of recent requests
for req in requests_data['completed'][:5]:
status_icon = "" if req['success'] else ""
print(f"{status_icon} {req['endpoint']} - {req['latency_ms']:.0f}ms")
```
### Browser Pool Management
```python
async def monitor_browser_pool():
async with httpx.AsyncClient() as client:
response = await client.get("http://localhost:11235/monitor/browsers")
browsers = response.json()
print(f"Pool Summary:")
print(f" Total Browsers: {browsers['summary']['total_count']}")
print(f" Total Memory: {browsers['summary']['total_memory_mb']} MB")
print(f" Reuse Rate: {browsers['summary']['reuse_rate_percent']:.1f}%")
# List all browsers
for browser in browsers['permanent']:
print(f"🔥 Permanent: {browser['browser_id'][:8]}... | "
f"Requests: {browser['request_count']} | "
f"Memory: {browser['memory_mb']:.0f} MB")
```
### Endpoint Performance Statistics
```python
async def get_endpoint_stats():
async with httpx.AsyncClient() as client:
response = await client.get("http://localhost:11235/monitor/endpoints/stats")
stats = response.json()
print("Endpoint Analytics:")
for endpoint, data in stats.items():
print(f" {endpoint}:")
print(f" Requests: {data['count']}")
print(f" Avg Latency: {data['avg_latency_ms']:.0f}ms")
print(f" Success Rate: {data['success_rate_percent']:.1f}%")
```
### Complete API Reference
The Monitor API includes these endpoints:
- `GET /monitor/health` - System health with pool statistics
- `GET /monitor/requests` - Active and completed request tracking
- `GET /monitor/browsers` - Browser pool details and efficiency
- `GET /monitor/endpoints/stats` - Per-endpoint performance analytics
- `GET /monitor/timeline?minutes=5` - Time-series data for charts
- `GET /monitor/logs/janitor?limit=10` - Cleanup activity logs
- `GET /monitor/logs/errors?limit=10` - Error logs with context
- `POST /monitor/actions/cleanup` - Force immediate cleanup
- `POST /monitor/actions/kill_browser` - Kill specific browser
- `POST /monitor/actions/restart_browser` - Restart browser
- `POST /monitor/stats/reset` - Reset accumulated statistics
## ⚡ WebSocket Streaming: Real-time Updates
**The Problem:** Polling the API every few seconds wastes resources and adds latency. Real-time dashboards need instant updates.
**My Solution:** WebSocket streaming with 2-second update intervals for building custom real-time dashboards.
### WebSocket Integration Example
```python
import websockets
import json
import asyncio
async def monitor_realtime():
uri = "ws://localhost:11235/monitor/ws"
async with websockets.connect(uri) as websocket:
print("Connected to real-time monitoring stream")
while True:
# Receive update every 2 seconds
data = await websocket.recv()
update = json.loads(data)
# Access all monitoring data
print(f"\n--- Update at {update['timestamp']} ---")
print(f"Memory: {update['health']['container']['memory_percent']:.1f}%")
print(f"Active Requests: {len(update['requests']['active'])}")
print(f"Total Browsers: {update['browsers']['summary']['total_count']}")
if update['errors']:
print(f"⚠️ Recent Errors: {len(update['errors'])}")
asyncio.run(monitor_realtime())
```
**Expected Real-World Impact:**
- **Custom Dashboards**: Build tailored monitoring UIs for your team
- **Real-time Alerting**: Trigger alerts instantly when metrics exceed thresholds
- **Integration**: Feed live data into monitoring tools like Grafana
- **Automation**: React to events in real-time without polling
## 🔥 Smart Browser Pool: 3-Tier Architecture
**The Problem:** Creating a new browser for every request is slow and memory-intensive. Traditional browser pools are static and inefficient.
**My Solution:** A smart 3-tier browser pool that automatically adapts to usage patterns.
### How It Works
```python
import httpx
async def demonstrate_browser_pool():
async with httpx.AsyncClient() as client:
# Request 1-3: Default config → Uses permanent browser
print("Phase 1: Using permanent browser")
for i in range(3):
await client.post(
"http://localhost:11235/crawl",
json={"urls": [f"https://httpbin.org/html?req={i}"]}
)
print(f" Request {i+1}: Reused permanent browser")
# Request 4-6: Custom viewport → Cold pool (first use)
print("\nPhase 2: Custom config creates cold pool browser")
viewport_config = {"viewport": {"width": 1280, "height": 720}}
for i in range(4):
await client.post(
"http://localhost:11235/crawl",
json={
"urls": [f"https://httpbin.org/json?v={i}"],
"browser_config": viewport_config
}
)
if i < 2:
print(f" Request {i+1}: Cold pool browser")
else:
print(f" Request {i+1}: Promoted to hot pool! (after 3 uses)")
# Check pool status
response = await client.get("http://localhost:11235/monitor/browsers")
browsers = response.json()
print(f"\nPool Status:")
print(f" Permanent: {len(browsers['permanent'])} (always active)")
print(f" Hot: {len(browsers['hot'])} (frequently used configs)")
print(f" Cold: {len(browsers['cold'])} (on-demand)")
print(f" Reuse Rate: {browsers['summary']['reuse_rate_percent']:.1f}%")
asyncio.run(demonstrate_browser_pool())
```
**Pool Tiers:**
- **🔥 Permanent Browser**: Always-on, default configuration, instant response
- **♨️ Hot Pool**: Browsers promoted after 3+ uses, kept warm for quick access
- **❄️ Cold Pool**: On-demand browsers for variant configs, cleaned up when idle
**Expected Real-World Impact:**
- **Memory Efficiency**: 10x reduction in memory usage vs creating browsers per request
- **Performance**: Instant access to frequently-used configurations
- **Automatic Optimization**: Pool adapts to your usage patterns
- **Resource Management**: Janitor automatically cleans up idle browsers
## 🧹 Janitor System: Automatic Cleanup
**The Problem:** Long-running crawlers accumulate idle browsers and consume memory over time.
**My Solution:** An automatic janitor system that monitors and cleans up idle resources.
```python
async def monitor_janitor_activity():
async with httpx.AsyncClient() as client:
response = await client.get("http://localhost:11235/monitor/logs/janitor?limit=5")
logs = response.json()
print("Recent Cleanup Activities:")
for log in logs:
print(f" {log['timestamp']}: {log['message']}")
# Example output:
# 2025-11-14 10:30:00: Cleaned up 2 cold pool browsers (idle > 5min)
# 2025-11-14 10:25:00: Browser reuse rate: 85.3%
# 2025-11-14 10:20:00: Hot pool browser promoted (10 requests)
```
## 🎮 Control Actions: Manual Management
**The Problem:** Sometimes you need to manually intervene—kill a stuck browser, force cleanup, or restart resources.
**My Solution:** Manual control actions via the API for operational troubleshooting.
### Force Cleanup
```python
async def force_cleanup():
async with httpx.AsyncClient() as client:
response = await client.post("http://localhost:11235/monitor/actions/cleanup")
result = response.json()
print(f"Cleanup completed:")
print(f" Browsers cleaned: {result.get('cleaned_count', 0)}")
print(f" Memory freed: {result.get('memory_freed_mb', 0):.1f} MB")
```
### Kill Specific Browser
```python
async def kill_stuck_browser(browser_id: str):
async with httpx.AsyncClient() as client:
response = await client.post(
"http://localhost:11235/monitor/actions/kill_browser",
json={"browser_id": browser_id}
)
if response.status_code == 200:
print(f"✅ Browser {browser_id} killed successfully")
```
### Reset Statistics
```python
async def reset_stats():
async with httpx.AsyncClient() as client:
response = await client.post("http://localhost:11235/monitor/stats/reset")
print("📊 Statistics reset for fresh monitoring")
```
## 📈 Production Integration Patterns
### Prometheus Integration
```python
# Export metrics for Prometheus scraping
async def export_prometheus_metrics():
async with httpx.AsyncClient() as client:
health = await client.get("http://localhost:11235/monitor/health")
data = health.json()
# Export in Prometheus format
metrics = f"""
# HELP crawl4ai_memory_usage_percent Memory usage percentage
# TYPE crawl4ai_memory_usage_percent gauge
crawl4ai_memory_usage_percent {data['container']['memory_percent']}
# HELP crawl4ai_request_success_rate Request success rate
# TYPE crawl4ai_request_success_rate gauge
crawl4ai_request_success_rate {data['stats']['success_rate_percent']}
# HELP crawl4ai_browser_pool_count Total browsers in pool
# TYPE crawl4ai_browser_pool_count gauge
crawl4ai_browser_pool_count {data['pool']['permanent']['active'] + data['pool']['hot']['count'] + data['pool']['cold']['count']}
"""
return metrics
```
### Alerting Example
```python
async def check_alerts():
async with httpx.AsyncClient() as client:
health = await client.get("http://localhost:11235/monitor/health")
data = health.json()
# Memory alert
if data['container']['memory_percent'] > 80:
print("🚨 ALERT: Memory usage above 80%")
# Trigger cleanup
await client.post("http://localhost:11235/monitor/actions/cleanup")
# Success rate alert
if data['stats']['success_rate_percent'] < 90:
print("🚨 ALERT: Success rate below 90%")
# Check error logs
errors = await client.get("http://localhost:11235/monitor/logs/errors")
print(f"Recent errors: {len(errors.json())}")
# Latency alert
if data['stats']['avg_latency_ms'] > 5000:
print("🚨 ALERT: Average latency above 5s")
```
### Key Metrics to Track
```python
CRITICAL_METRICS = {
"memory_usage": {
"current": "container.memory_percent",
"target": "<80%",
"alert_threshold": ">80%",
"action": "Force cleanup or scale"
},
"success_rate": {
"current": "stats.success_rate_percent",
"target": ">95%",
"alert_threshold": "<90%",
"action": "Check error logs"
},
"avg_latency": {
"current": "stats.avg_latency_ms",
"target": "<2000ms",
"alert_threshold": ">5000ms",
"action": "Investigate slow requests"
},
"browser_reuse_rate": {
"current": "browsers.summary.reuse_rate_percent",
"target": ">80%",
"alert_threshold": "<60%",
"action": "Check pool configuration"
},
"total_browsers": {
"current": "browsers.summary.total_count",
"target": "<15",
"alert_threshold": ">20",
"action": "Check for browser leaks"
},
"error_frequency": {
"current": "len(errors)",
"target": "<5/hour",
"alert_threshold": ">10/hour",
"action": "Review error patterns"
}
}
```
## 🐛 Critical Bug Fixes
This release includes significant bug fixes that improve stability and performance:
### Async LLM Extraction (#1590)
**The Problem:** LLM extraction was blocking async execution, causing URLs to be processed sequentially instead of in parallel (issue #1055).
**The Fix:** Resolved the blocking issue to enable true parallel processing for LLM extraction.
```python
# Before v0.7.7: Sequential processing
# After v0.7.7: True parallel processing
async with AsyncWebCrawler() as crawler:
urls = ["url1", "url2", "url3", "url4"]
# Now processes truly in parallel with LLM extraction
results = await crawler.arun_many(
urls,
config=CrawlerRunConfig(
extraction_strategy=LLMExtractionStrategy(...)
)
)
# 4x faster for parallel LLM extraction!
```
**Expected Impact:** Major performance improvement for batch LLM extraction workflows.
### DFS Deep Crawling (#1607)
**The Problem:** DFS (Depth-First Search) deep crawl strategy had implementation issues.
**The Fix:** Enhanced DFSDeepCrawlStrategy with proper seen URL tracking and improved documentation.
### Browser & Crawler Config Documentation (#1609)
**The Problem:** Documentation didn't match the actual `async_configs.py` implementation.
**The Fix:** Updated all configuration documentation to accurately reflect the current implementation.
### Sitemap Seeder (#1598)
**The Problem:** Sitemap parsing and URL normalization issues in AsyncUrlSeeder (issue #1559).
**The Fix:** Added comprehensive tests and fixes for sitemap namespace parsing and URL normalization.
### Remove Overlay Elements (#1529)
**The Problem:** The `remove_overlay_elements` functionality wasn't working (issue #1396).
**The Fix:** Fixed by properly calling the injected JavaScript function.
### Viewport Configuration (#1495)
**The Problem:** Viewport configuration wasn't working in managed browsers (issue #1490).
**The Fix:** Added proper viewport size configuration support for browser launch.
### Managed Browser CDP Timing (#1528)
**The Problem:** CDP (Chrome DevTools Protocol) endpoint verification had timing issues causing connection failures (issue #1445).
**The Fix:** Added exponential backoff for CDP endpoint verification to handle timing variations.
### Security Updates
- **pyOpenSSL**: Updated from >=24.3.0 to >=25.3.0 to address security vulnerability
- Added verification tests for the security update
### Docker Fixes
- **Port Standardization**: Fixed inconsistent port usage (11234 vs 11235) - now standardized to 11235
- **LLM Environment**: Fixed LLM API key handling for multi-provider support (PR #1537)
- **Error Handling**: Improved Docker API error messages with comprehensive status codes
- **Serialization**: Fixed `fit_html` property serialization in `/crawl` and `/crawl/stream` endpoints
### Other Important Fixes
- **arun_many Returns**: Fixed function to always return a list, even on exception (PR #1530)
- **Webhook Serialization**: Properly serialize Pydantic HttpUrl in webhook config
- **LLMConfig Documentation**: Fixed casing and variable name consistency (issue #1551)
- **Python Version**: Dropped Python 3.9 support, now requires Python >=3.10
## 📊 Expected Real-World Impact
### For DevOps & Infrastructure Teams
- **Full Visibility**: Know exactly what's happening inside your crawling infrastructure
- **Proactive Monitoring**: Catch issues before they become problems
- **Resource Optimization**: Identify memory leaks and performance bottlenecks
- **Operational Control**: Manual intervention when automated systems need help
### For Production Deployments
- **Enterprise Observability**: Prometheus, Grafana, and alerting integration
- **Debugging**: Real-time logs and error tracking
- **Capacity Planning**: Historical metrics for scaling decisions
- **SLA Monitoring**: Track success rates and latency against targets
### For Development Teams
- **Local Monitoring**: Understand crawler behavior during development
- **Performance Testing**: Measure impact of configuration changes
- **Troubleshooting**: Quickly identify and fix issues
- **Learning**: See exactly how the browser pool works
## 🔄 Breaking Changes
**None!** This release is fully backward compatible.
- All existing Docker configurations continue to work
- No API changes to existing endpoints
- Monitoring is additive functionality
- No migration required
## 🚀 Upgrade Instructions
### Docker
```bash
# Pull the latest version
docker pull unclecode/crawl4ai:0.7.7
# Or use the latest tag
docker pull unclecode/crawl4ai:latest
# Run with monitoring enabled (default)
docker run -d \
-p 11235:11235 \
--shm-size=1g \
--name crawl4ai \
unclecode/crawl4ai:0.7.7
# Access the monitoring dashboard
open http://localhost:11235/dashboard
```
### Python Package
```bash
# Upgrade to latest version
pip install --upgrade crawl4ai
# Or install specific version
pip install crawl4ai==0.7.7
```
## 🎬 Try the Demo
Run the comprehensive demo that showcases all monitoring features:
```bash
python docs/releases_review/demo_v0.7.7.py
```
**The demo includes:**
1. System health overview with live metrics
2. Request tracking with active/completed monitoring
3. Browser pool management (permanent/hot/cold)
4. Complete Monitor API endpoint examples
5. WebSocket streaming demonstration
6. Control actions (cleanup, kill, restart)
7. Production metrics and alerting patterns
8. Self-hosting value proposition
## 📚 Documentation
### New Documentation
- **[Self-Hosting Guide](https://docs.crawl4ai.com/core/self-hosting/)** - Complete self-hosting documentation with monitoring
- **Demo Script**: `docs/releases_review/demo_v0.7.7.py` - Working examples
### Updated Documentation
- **Docker Deployment** → **Self-Hosting** (renamed for better positioning)
- Added comprehensive monitoring sections
- Production integration patterns
- WebSocket streaming examples
## 💡 Pro Tips
1. **Start with the dashboard** - Visit `/dashboard` to get familiar with the monitoring system
2. **Track the 6 key metrics** - Memory, success rate, latency, reuse rate, browser count, errors
3. **Set up alerting early** - Use the Monitor API to build alerts before issues occur
4. **Monitor browser pool efficiency** - Aim for >80% reuse rate for optimal performance
5. **Use WebSocket for custom dashboards** - Build tailored monitoring UIs for your team
6. **Leverage Prometheus integration** - Export metrics for long-term storage and analysis
7. **Check janitor logs** - Understand automatic cleanup patterns
8. **Use control actions judiciously** - Manual interventions are for exceptional cases
## 🙏 Acknowledgments
Thank you to our community for the feedback, bug reports, and feature requests that shaped this release. Special thanks to everyone who contributed to the issues that were fixed in this version.
The monitoring system was built based on real user needs for production deployments, and your input made it comprehensive and practical.
## 📞 Support & Resources
- **📖 Documentation**: [docs.crawl4ai.com](https://docs.crawl4ai.com)
- **🐙 GitHub**: [github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai)
- **💬 Discord**: [discord.gg/crawl4ai](https://discord.gg/jP8KfhDhyN)
- **🐦 Twitter**: [@unclecode](https://x.com/unclecode)
- **📊 Dashboard**: `http://localhost:11235/dashboard` (when running)
---
**Crawl4AI v0.7.7 delivers complete self-hosting with enterprise-grade monitoring. You now have full visibility and control over your web crawling infrastructure. The monitoring dashboard, comprehensive API, and WebSocket streaming give you everything needed for production deployments. Try the self-hosting platform—it's a game changer for operational excellence!**
**Happy crawling with full visibility!** 🕷️📊
*- unclecode*

View File

@@ -0,0 +1,327 @@
# Crawl4AI v0.7.8: Stability & Bug Fix Release
*December 2025*
---
I'm releasing Crawl4AI v0.7.8—a focused stability release that addresses 11 bugs reported by the community. While there are no new features in this release, these fixes resolve important issues affecting Docker deployments, LLM extraction, URL handling, and dependency compatibility.
## What's Fixed at a Glance
- **Docker API**: Fixed ContentRelevanceFilter deserialization, ProxyConfig serialization, and cache folder permissions
- **LLM Extraction**: Configurable rate limiter backoff, HTML input format support, and proper URL handling for raw HTML
- **URL Handling**: Correct relative URL resolution after JavaScript redirects
- **Dependencies**: Replaced deprecated PyPDF2 with pypdf, Pydantic v2 ConfigDict compatibility
- **AdaptiveCrawler**: Fixed query expansion to actually use LLM instead of hardcoded mock data
## Bug Fixes
### Docker & API Fixes
#### ContentRelevanceFilter Deserialization (#1642)
**The Problem:** When sending deep crawl requests to the Docker API with `ContentRelevanceFilter`, the server failed to deserialize the filter, causing requests to fail.
**The Fix:** I added `ContentRelevanceFilter` to the public exports and enhanced the deserialization logic with dynamic imports.
```python
# This now works correctly in Docker API
import httpx
request = {
"urls": ["https://docs.example.com"],
"crawler_config": {
"deep_crawl_strategy": {
"type": "BFSDeepCrawlStrategy",
"max_depth": 2,
"filter_chain": [
{
"type": "ContentRelevanceFilter",
"query": "API documentation",
"threshold": 0.3
}
]
}
}
}
async with httpx.AsyncClient() as client:
response = await client.post("http://localhost:11235/crawl", json=request)
# Previously failed, now works!
```
#### ProxyConfig JSON Serialization (#1629)
**The Problem:** `BrowserConfig.to_dict()` failed when `proxy_config` was set because `ProxyConfig` wasn't being serialized to a dictionary.
**The Fix:** `ProxyConfig.to_dict()` is now called during serialization.
```python
from crawl4ai import BrowserConfig
from crawl4ai.async_configs import ProxyConfig
proxy = ProxyConfig(
server="http://proxy.example.com:8080",
username="user",
password="pass"
)
config = BrowserConfig(headless=True, proxy_config=proxy)
# Previously raised TypeError, now works
config_dict = config.to_dict()
json.dumps(config_dict) # Valid JSON
```
#### Docker Cache Folder Permissions (#1638)
**The Problem:** The `.cache` folder in the Docker image had incorrect permissions, causing crawling to fail when caching was enabled.
**The Fix:** Corrected ownership and permissions during image build.
```bash
# Cache now works correctly in Docker
docker run -d -p 11235:11235 \
--shm-size=1g \
-v ./my-cache:/app/.cache \
unclecode/crawl4ai:0.7.8
```
---
### LLM & Extraction Fixes
#### Configurable Rate Limiter Backoff (#1269)
**The Problem:** The LLM rate limiting backoff parameters were hardcoded, making it impossible to adjust retry behavior for different API rate limits.
**The Fix:** `LLMConfig` now accepts three new parameters for complete control over retry behavior.
```python
from crawl4ai import LLMConfig
# Default behavior (unchanged)
default_config = LLMConfig(provider="openai/gpt-4o-mini")
# backoff_base_delay=2, backoff_max_attempts=3, backoff_exponential_factor=2
# Custom configuration for APIs with strict rate limits
custom_config = LLMConfig(
provider="openai/gpt-4o-mini",
backoff_base_delay=5, # Wait 5 seconds on first retry
backoff_max_attempts=5, # Try up to 5 times
backoff_exponential_factor=3 # Multiply delay by 3 each attempt
)
# Retry sequence: 5s -> 15s -> 45s -> 135s -> 405s
```
#### LLM Strategy HTML Input Support (#1178)
**The Problem:** `LLMExtractionStrategy` always sent markdown to the LLM, but some extraction tasks work better with HTML structure preserved.
**The Fix:** Added `input_format` parameter supporting `"markdown"`, `"html"`, `"fit_markdown"`, `"cleaned_html"`, and `"fit_html"`.
```python
from crawl4ai import LLMExtractionStrategy, LLMConfig
# Default: markdown input (unchanged)
markdown_strategy = LLMExtractionStrategy(
llm_config=LLMConfig(provider="openai/gpt-4o-mini"),
instruction="Extract product information"
)
# NEW: HTML input - preserves table/list structure
html_strategy = LLMExtractionStrategy(
llm_config=LLMConfig(provider="openai/gpt-4o-mini"),
instruction="Extract the data table preserving structure",
input_format="html"
)
# NEW: Filtered markdown - only relevant content
fit_strategy = LLMExtractionStrategy(
llm_config=LLMConfig(provider="openai/gpt-4o-mini"),
instruction="Summarize the main content",
input_format="fit_markdown"
)
```
#### Raw HTML URL Variable (#1116)
**The Problem:** When using `url="raw:<html>..."`, the entire HTML content was being passed to extraction strategies as the URL parameter, polluting LLM prompts.
**The Fix:** The URL is now correctly set to `"Raw HTML"` for raw HTML inputs.
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
html = "<html><body><h1>Test</h1></body></html>"
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url=f"raw:{html}",
config=CrawlerRunConfig(extraction_strategy=my_strategy)
)
# extraction_strategy receives url="Raw HTML" instead of the HTML blob
```
---
### URL Handling Fix
#### Relative URLs After Redirects (#1268)
**The Problem:** When JavaScript caused a page redirect, relative links were resolved against the original URL instead of the final URL.
**The Fix:** `redirected_url` now captures the actual page URL after all JavaScript execution completes.
```python
from crawl4ai import AsyncWebCrawler
async with AsyncWebCrawler() as crawler:
# Page at /old-page redirects via JS to /new-page
result = await crawler.arun(url="https://example.com/old-page")
# BEFORE: redirected_url = "https://example.com/old-page"
# AFTER: redirected_url = "https://example.com/new-page"
# Links are now correctly resolved against the final URL
for link in result.links['internal']:
print(link['href']) # Relative links resolved correctly
```
---
### Dependency & Compatibility Fixes
#### PyPDF2 Replaced with pypdf (#1412)
**The Problem:** PyPDF2 was deprecated in 2022 and is no longer maintained.
**The Fix:** Replaced with the actively maintained `pypdf` library.
```python
# Installation (unchanged)
pip install crawl4ai[pdf]
# The PDF processor now uses pypdf internally
# No code changes required - API remains the same
```
#### Pydantic v2 ConfigDict Compatibility (#678)
**The Problem:** Using the deprecated `class Config` syntax caused deprecation warnings with Pydantic v2.
**The Fix:** Migrated to `model_config = ConfigDict(...)` syntax.
```python
# No more deprecation warnings when importing crawl4ai models
from crawl4ai.models import CrawlResult
from crawl4ai import CrawlerRunConfig, BrowserConfig
# All models are now Pydantic v2 compatible
```
---
### AdaptiveCrawler Fix
#### Query Expansion Using LLM (#1621)
**The Problem:** The `EmbeddingStrategy` in AdaptiveCrawler had commented-out LLM code and was using hardcoded mock query variations instead.
**The Fix:** Uncommented and activated the LLM call for actual query expansion.
```python
# AdaptiveCrawler query expansion now actually uses the LLM
# Instead of hardcoded variations like:
# variations = {'queries': ['what are the best vegetables...']}
# The LLM generates relevant query variations based on your actual query
```
---
### Code Formatting Fix
#### Import Statement Formatting (#1181)
**The Problem:** When extracting code from web pages, import statements were sometimes concatenated without proper line separation.
**The Fix:** Import statements now maintain proper newline separation.
```python
# BEFORE: "import osimport sysfrom pathlib import Path"
# AFTER:
# import os
# import sys
# from pathlib import Path
```
---
## Breaking Changes
**None!** This release is fully backward compatible.
- All existing code continues to work without modification
- New parameters have sensible defaults matching previous behavior
- No API changes to existing functionality
---
## Upgrade Instructions
### Python Package
```bash
pip install --upgrade crawl4ai
# or
pip install crawl4ai==0.7.8
```
### Docker
```bash
# Pull the latest version
docker pull unclecode/crawl4ai:0.7.8
# Run
docker run -d -p 11235:11235 --shm-size=1g unclecode/crawl4ai:0.7.8
```
---
## Verification
Run the verification tests to confirm all fixes are working:
```bash
python docs/releases_review/demo_v0.7.8.py
```
This runs actual tests that verify each bug fix is properly implemented.
---
## Acknowledgments
Thank you to everyone who reported these issues and provided detailed reproduction steps. Your bug reports make Crawl4AI better for everyone.
Issues fixed: #1642, #1638, #1629, #1621, #1412, #1269, #1268, #1181, #1178, #1116, #678
---
## Support & Resources
- **Documentation**: [docs.crawl4ai.com](https://docs.crawl4ai.com)
- **GitHub**: [github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai)
- **Discord**: [discord.gg/crawl4ai](https://discord.gg/jP8KfhDhyN)
- **Twitter**: [@unclecode](https://x.com/unclecode)
---
**This stability release ensures Crawl4AI works reliably across Docker deployments, LLM extraction workflows, and various edge cases. Thank you for your continued support and feedback!**
**Happy crawling!**
*- unclecode*

View File

@@ -0,0 +1,243 @@
# Crawl4AI v0.8.0 Release Notes
**Release Date**: January 2026
**Previous Version**: v0.7.6
**Status**: Release Candidate
---
## Highlights
- **Critical Security Fixes** for Docker API deployment
- **11 New Features** including crash recovery, prefetch mode, and proxy improvements
- **Breaking Changes** - see migration guide below
---
## Breaking Changes
### 1. Docker API: Hooks Disabled by Default
**What changed**: Hooks are now disabled by default on the Docker API.
**Why**: Security fix for Remote Code Execution (RCE) vulnerability.
**Who is affected**: Users of the Docker API who use the `hooks` parameter in `/crawl` requests.
**Migration**:
```bash
# To re-enable hooks (only if you trust all API users):
export CRAWL4AI_HOOKS_ENABLED=true
```
### 2. Docker API: file:// URLs Blocked
**What changed**: The endpoints `/execute_js`, `/screenshot`, `/pdf`, and `/html` now reject `file://` URLs.
**Why**: Security fix for Local File Inclusion (LFI) vulnerability.
**Who is affected**: Users who were reading local files via the Docker API.
**Migration**: Use the Python library directly for local file processing:
```python
# Instead of API call with file:// URL, use library:
from crawl4ai import AsyncWebCrawler
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="file:///path/to/file.html")
```
---
## Security Fixes
### Critical: Remote Code Execution via Hooks (CVE Pending)
**Severity**: CRITICAL (CVSS 10.0)
**Affected**: Docker API deployment (all versions before v0.8.0)
**Vector**: `POST /crawl` with malicious `hooks` parameter
**Details**: The `__import__` builtin was available in hook code, allowing attackers to import `os`, `subprocess`, etc. and execute arbitrary commands.
**Fix**:
1. Removed `__import__` from allowed builtins
2. Hooks disabled by default (`CRAWL4AI_HOOKS_ENABLED=false`)
### High: Local File Inclusion via file:// URLs (CVE Pending)
**Severity**: HIGH (CVSS 8.6)
**Affected**: Docker API deployment (all versions before v0.8.0)
**Vector**: `POST /execute_js` (and other endpoints) with `file:///etc/passwd`
**Details**: API endpoints accepted `file://` URLs, allowing attackers to read arbitrary files from the server.
**Fix**: URL scheme validation now only allows `http://`, `https://`, and `raw:` URLs.
### Credits
Discovered by **Neo by ProjectDiscovery** ([projectdiscovery.io](https://projectdiscovery.io)) - December 2025
---
## New Features
### 1. init_scripts Support for BrowserConfig
Pre-page-load JavaScript injection for stealth evasions.
```python
config = BrowserConfig(
init_scripts=[
"Object.defineProperty(navigator, 'webdriver', {get: () => false})"
]
)
```
### 2. CDP Connection Improvements
- WebSocket URL support (`ws://`, `wss://`)
- Proper cleanup with `cdp_cleanup_on_close=True`
- Browser reuse across multiple connections
### 3. Crash Recovery for Deep Crawl Strategies
All deep crawl strategies (BFS, DFS, Best-First) now support crash recovery:
```python
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
strategy = BFSDeepCrawlStrategy(
max_depth=3,
resume_state=saved_state, # Resume from checkpoint
on_state_change=save_callback # Persist state in real-time
)
```
### 4. PDF and MHTML for raw:/file:// URLs
Generate PDFs and MHTML from cached HTML content.
### 5. Screenshots for raw:/file:// URLs
Render cached HTML and capture screenshots.
### 6. base_url Parameter for CrawlerRunConfig
Proper URL resolution for raw: HTML processing:
```python
config = CrawlerRunConfig(base_url='https://example.com')
result = await crawler.arun(url='raw:{html}', config=config)
```
### 7. Prefetch Mode for Two-Phase Deep Crawling
Fast link extraction without full page processing:
```python
config = CrawlerRunConfig(prefetch=True)
```
### 8. Proxy Rotation and Configuration
Enhanced proxy rotation with sticky sessions support.
### 9. Proxy Support for HTTP Strategy
Non-browser crawler now supports proxies.
### 10. Browser Pipeline for raw:/file:// URLs
New `process_in_browser` parameter for browser operations on local content:
```python
config = CrawlerRunConfig(
process_in_browser=True, # Force browser processing
screenshot=True
)
result = await crawler.arun(url='raw:<html>...</html>', config=config)
```
### 11. Smart TTL Cache for Sitemap URL Seeder
Intelligent cache invalidation for sitemaps:
```python
config = SeedingConfig(
cache_ttl_hours=24,
validate_sitemap_lastmod=True
)
```
---
## Bug Fixes
### raw: URL Parsing Truncates at # Character
**Problem**: CSS color codes like `#eee` were being truncated.
**Before**: `raw:body{background:#eee}``body{background:`
**After**: `raw:body{background:#eee}``body{background:#eee}`
### Caching System Improvements
Various fixes to cache validation and persistence.
---
## Documentation Updates
- Multi-sample schema generation documentation
- URL seeder smart TTL cache parameters
- Security documentation (SECURITY.md)
---
## Upgrade Guide
### From v0.7.x to v0.8.0
1. **Update the package**:
```bash
pip install --upgrade crawl4ai
```
2. **Docker API users**:
- Hooks are now disabled by default
- If you need hooks: `export CRAWL4AI_HOOKS_ENABLED=true`
- `file://` URLs no longer work on API (use library directly)
3. **Review security settings**:
```yaml
# config.yml - recommended for production
security:
enabled: true
jwt_enabled: true
```
4. **Test your integration** before deploying to production
### Breaking Change Checklist
- [ ] Check if you use `hooks` parameter in API calls
- [ ] Check if you use `file://` URLs via the API
- [ ] Update environment variables if needed
- [ ] Review security configuration
---
## Full Changelog
See [CHANGELOG.md](../CHANGELOG.md) for complete version history.
---
## Contributors
Thanks to all contributors who made this release possible.
Special thanks to **Neo by ProjectDiscovery** for responsible security disclosure.
---
*For questions or issues, please open a [GitHub Issue](https://github.com/unclecode/crawl4ai/issues).*

View File

@@ -1593,8 +1593,20 @@ The `clone()` method:
- Environment variable - use with prefix "env:" <br/> eg:`api_token = "env: GROQ_API_KEY"`
3. **`base_url`**:
- If your provider has a custom endpoint
4. **Backoff controls** *(optional)*:
- `backoff_base_delay` *(default `2` seconds)* how long to pause before the first retry if the provider rate-limits you.
- `backoff_max_attempts` *(default `3`)* total tries for the same prompt (initial call + retries).
- `backoff_exponential_factor` *(default `2`)* how quickly the pause grows between retries. A factor of 2 yields waits like 2s → 4s → 8s.
- Because these plug into Crawl4AIs retry helper, every LLM strategy automatically follows the pacing you define here.
```python
llm_config = LLMConfig(provider="openai/gpt-4o-mini", api_token=os.getenv("OPENAI_API_KEY"))
llm_config = LLMConfig(
provider="openai/gpt-4o-mini",
api_token=os.getenv("OPENAI_API_KEY"),
backoff_base_delay=1, # optional
backoff_max_attempts=5, # optional
backoff_exponential_factor=3, # optional
)
```
## 4. Putting It All Together
In a typical scenario, you define **one** `BrowserConfig` for your crawler session, then create **one or more** `CrawlerRunConfig` & `LLMConfig` depending on each call's needs:

View File

@@ -17,6 +17,11 @@ class BrowserConfig:
def __init__(
browser_type="chromium",
headless=True,
browser_mode="dedicated",
use_managed_browser=False,
cdp_url=None,
debugging_port=9222,
host="localhost",
proxy_config=None,
viewport_width=1080,
viewport_height=600,
@@ -25,7 +30,13 @@ class BrowserConfig:
user_data_dir=None,
cookies=None,
headers=None,
user_agent=None,
user_agent=(
# "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) AppleWebKit/537.36 "
# "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
# "(KHTML, like Gecko) Chrome/116.0.5845.187 Safari/604.1 Edg/117.0.2045.47"
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/116.0.0.0 Safari/537.36"
),
user_agent_mode="",
text_mode=False,
light_mode=False,
extra_args=None,
@@ -37,17 +48,33 @@ class BrowserConfig:
### Key Fields to Note
1. **`browser_type`**
- Options: `"chromium"`, `"firefox"`, or `"webkit"`.
- Defaults to `"chromium"`.
- If you need a different engine, specify it here.
1.**`browser_type`**
- Options: `"chromium"`, `"firefox"`, or `"webkit"`.
- Defaults to `"chromium"`.
- If you need a different engine, specify it here.
2. **`headless`**
2.**`headless`**
- `True`: Runs the browser in headless mode (invisible browser).
- `False`: Runs the browser in visible mode, which helps with debugging.
3. **`proxy_config`**
- A dictionary with fields like:
3.**`browser_mode`**
- Determines how the browser should be initialized:
- `"dedicated"` (default): Creates a new browser instance each time
- `"builtin"`: Uses the builtin CDP browser running in background
- `"custom"`: Uses explicit CDP settings provided in `cdp_url`
- `"docker"`: Runs browser in Docker container with isolation
4.**`use_managed_browser`** & **`cdp_url`**
- `use_managed_browser=True`: Launch browser using Chrome DevTools Protocol (CDP) for advanced control
- `cdp_url`: URL for CDP endpoint (e.g., `"ws://localhost:9222/devtools/browser/"`)
- Automatically set based on `browser_mode`
5.**`debugging_port`** & **`host`**
- `debugging_port`: Port for browser debugging protocol (default: 9222)
- `host`: Host for browser connection (default: "localhost")
6.**`proxy_config`**
- A `ProxyConfig` object or dictionary with fields like:
```json
{
"server": "http://proxy.example.com:8080",
@@ -57,35 +84,35 @@ class BrowserConfig:
```
- Leave as `None` if a proxy is not required.
4. **`viewport_width` & `viewport_height`**:
7.**`viewport_width` & `viewport_height`**
- The initial window size.
- Some sites behave differently with smaller or bigger viewports.
5. **`verbose`**:
8.**`verbose`**
- If `True`, prints extra logs.
- Handy for debugging.
6. **`use_persistent_context`**:
9.**`use_persistent_context`**
- If `True`, uses a **persistent** browser profile, storing cookies/local storage across runs.
- Typically also set `user_data_dir` to point to a folder.
7. **`cookies`** & **`headers`**:
- If you want to start with specific cookies or add universal HTTP headers, set them here.
- E.g. `cookies=[{"name": "session", "value": "abc123", "domain": "example.com"}]`.
10.**`cookies`** & **`headers`**
- If you want to start with specific cookies or add universal HTTP headers to the browser context, set them here.
- E.g. `cookies=[{"name": "session", "value": "abc123", "domain": "example.com"}]`.
8. **`user_agent`**:
- Custom User-Agent string. If `None`, a default is used.
- You can also set `user_agent_mode="random"` for randomization (if you want to fight bot detection).
11.**`user_agent`** & **`user_agent_mode`**
- `user_agent`: Custom User-Agent string. If `None`, a default is used.
- `user_agent_mode`: Set to `"random"` for randomization (helps fight bot detection).
9. **`text_mode`** & **`light_mode`**:
- `text_mode=True` disables images, possibly speeding up text-only crawls.
- `light_mode=True` turns off certain background features for performance.
12.**`text_mode`** & **`light_mode`**
- `text_mode=True` disables images, possibly speeding up text-only crawls.
- `light_mode=True` turns off certain background features for performance.
10. **`extra_args`**:
13.**`extra_args`**
- Additional flags for the underlying browser.
- E.g. `["--disable-extensions"]`.
11. **`enable_stealth`**:
14.**`enable_stealth`**
- If `True`, enables stealth mode using playwright-stealth.
- Modifies browser fingerprints to avoid basic bot detection.
- Default is `False`. Recommended for sites with bot protection.
@@ -134,9 +161,11 @@ class CrawlerRunConfig:
def __init__(
word_count_threshold=200,
extraction_strategy=None,
chunking_strategy=RegexChunking(),
markdown_generator=None,
cache_mode=None,
cache_mode=CacheMode.BYPASS,
js_code=None,
c4a_script=None,
wait_for=None,
screenshot=False,
pdf=False,
@@ -145,13 +174,18 @@ class CrawlerRunConfig:
locale=None, # e.g. "en-US", "fr-FR"
timezone_id=None, # e.g. "America/New_York"
geolocation=None, # GeolocationConfig object
# Resource Management
enable_rate_limiting=False,
rate_limit_config=None,
memory_threshold_percent=70.0,
check_interval=1.0,
max_session_permit=20,
display_mode=None,
# Proxy Configuration
proxy_config=None,
proxy_rotation_strategy=None,
# Page Interaction Parameters
scan_full_page=False,
scroll_delay=0.2,
wait_until="domcontentloaded",
page_timeout=60000,
delay_before_return_html=0.1,
# URL Matching Parameters
url_matcher=None, # For URL-specific configurations
match_mode=MatchMode.OR,
verbose=True,
stream=False, # Enable streaming for arun_many()
# ... other advanced parameters omitted
@@ -161,69 +195,68 @@ class CrawlerRunConfig:
### Key Fields to Note
1. **`word_count_threshold`**:
1.**`word_count_threshold`**:
- The minimum word count before a block is considered.
- If your site has lots of short paragraphs or items, you can lower it.
2. **`extraction_strategy`**:
2.**`extraction_strategy`**:
- Where you plug in JSON-based extraction (CSS, LLM, etc.).
- If `None`, no structured extraction is done (only raw/cleaned HTML + markdown).
3. **`markdown_generator`**:
3.**`chunking_strategy`**:
- Strategy to chunk content before extraction.
- Defaults to `RegexChunking()`. Can be customized for different chunking approaches.
4.**`markdown_generator`**:
- E.g., `DefaultMarkdownGenerator(...)`, controlling how HTML→Markdown conversion is done.
- If `None`, a default approach is used.
4. **`cache_mode`**:
5.**`cache_mode`**:
- Controls caching behavior (`ENABLED`, `BYPASS`, `DISABLED`, etc.).
- If `None`, defaults to some level of caching or you can specify `CacheMode.ENABLED`.
- Defaults to `CacheMode.BYPASS`.
5. **`js_code`**:
- A string or list of JS strings to execute.
6.**`js_code`** & **`c4a_script`**:
- `js_code`: A string or list of JavaScript strings to execute.
- `c4a_script`: C4A script that compiles to JavaScript.
- Great for "Load More" buttons or user interactions.
6. **`wait_for`**:
7.**`wait_for`**:
- A CSS or JS expression to wait for before extracting content.
- Common usage: `wait_for="css:.main-loaded"` or `wait_for="js:() => window.loaded === true"`.
7. **`screenshot`**, **`pdf`**, & **`capture_mhtml`**:
8.**`screenshot`**, **`pdf`**, & **`capture_mhtml`**:
- If `True`, captures a screenshot, PDF, or MHTML snapshot after the page is fully loaded.
- The results go to `result.screenshot` (base64), `result.pdf` (bytes), or `result.mhtml` (string).
8. **Location Parameters**:
9.**Location Parameters**:
- **`locale`**: Browser's locale (e.g., `"en-US"`, `"fr-FR"`) for language preferences
- **`timezone_id`**: Browser's timezone (e.g., `"America/New_York"`, `"Europe/Paris"`)
- **`geolocation`**: GPS coordinates via `GeolocationConfig(latitude=48.8566, longitude=2.3522)`
- See [Identity Based Crawling](../advanced/identity-based-crawling.md#7-locale-timezone-and-geolocation-control)
9. **`verbose`**:
- Logs additional runtime details.
- Overlaps with the browser's verbosity if also set to `True` in `BrowserConfig`.
10.**Proxy Configuration**:
- **`proxy_config`**: Proxy server configuration (ProxyConfig object or dict) e.g. {"server": "...", "username": "...", "password"}
- **`proxy_rotation_strategy`**: Strategy for rotating proxies during crawls
10. **`enable_rate_limiting`**:
- If `True`, enables rate limiting for batch processing.
- Requires `rate_limit_config` to be set.
11.**Page Interaction Parameters**:
- **`scan_full_page`**: If `True`, scroll through the entire page to load all content
- **`wait_until`**: Condition to wait for when navigating (e.g., "domcontentloaded", "networkidle")
- **`page_timeout`**: Timeout in milliseconds for page operations (default: 60000)
- **`delay_before_return_html`**: Delay in seconds before retrieving final HTML.
11. **`memory_threshold_percent`**:
- The memory threshold (as a percentage) to monitor.
- If exceeded, the crawler will pause or slow down.
12. **`check_interval`**:
- The interval (in seconds) to check system resources.
- Affects how often memory and CPU usage are monitored.
13. **`max_session_permit`**:
- The maximum number of concurrent crawl sessions.
- Helps prevent overwhelming the system.
14. **`url_matcher`** & **`match_mode`**:
12.**`url_matcher`** & **`match_mode`**:
- Enable URL-specific configurations when used with `arun_many()`.
- Set `url_matcher` to patterns (glob, function, or list) to match specific URLs.
- Use `match_mode` (OR/AND) to control how multiple patterns combine.
- See [URL-Specific Configurations](../api/arun_many.md#url-specific-configurations) for examples.
15. **`display_mode`**:
- The display mode for progress information (`DETAILED`, `BRIEF`, etc.).
- Affects how much information is printed during the crawl.
13.**`verbose`**:
- Logs additional runtime details.
- Overlaps with the browser's verbosity if also set to `True` in `BrowserConfig`.
14.**`stream`**:
- If `True`, enables streaming mode for `arun_many()` to process URLs as they complete.
- Allows handling results incrementally instead of waiting for all URLs to finish.
### Helper Methods
@@ -263,20 +296,32 @@ The `clone()` method:
### Key fields to note
1. **`provider`**:
1.**`provider`**:
- Which LLM provider to use.
- Possible values are `"ollama/llama3","groq/llama3-70b-8192","groq/llama3-8b-8192", "openai/gpt-4o-mini" ,"openai/gpt-4o","openai/o1-mini","openai/o1-preview","openai/o3-mini","openai/o3-mini-high","anthropic/claude-3-haiku-20240307","anthropic/claude-3-opus-20240229","anthropic/claude-3-sonnet-20240229","anthropic/claude-3-5-sonnet-20240620","gemini/gemini-pro","gemini/gemini-1.5-pro","gemini/gemini-2.0-flash","gemini/gemini-2.0-flash-exp","gemini/gemini-2.0-flash-lite-preview-02-05","deepseek/deepseek-chat"`<br/>*(default: `"openai/gpt-4o-mini"`)*
2. **`api_token`**:
2.**`api_token`**:
- Optional. When not provided explicitly, api_token will be read from environment variables based on provider. For example: If a gemini model is passed as provider then,`"GEMINI_API_KEY"` will be read from environment variables
- API token of LLM provider <br/> eg: `api_token = "gsk_1ClHGGJ7Lpn4WGybR7vNWGdyb3FY7zXEw3SCiy0BAVM9lL8CQv"`
- Environment variable - use with prefix "env:" <br/> eg:`api_token = "env: GROQ_API_KEY"`
3. **`base_url`**:
3.**`base_url`**:
- If your provider has a custom endpoint
4.**Retry/backoff controls** *(optional)*:
- `backoff_base_delay` *(default `2` seconds)* base delay inserted before the first retry when the provider returns a rate-limit response.
- `backoff_max_attempts` *(default `3`)* total number of attempts (initial call plus retries) before the request is surfaced as an error.
- `backoff_exponential_factor` *(default `2`)* growth rate for the retry delay (`delay = base_delay * factor^attempt`).
- These values are forwarded to the shared `perform_completion_with_backoff` helper, ensuring every strategy that consumes your `LLMConfig` honors the same throttling policy.
```python
llm_config = LLMConfig(provider="openai/gpt-4o-mini", api_token=os.getenv("OPENAI_API_KEY"))
llm_config = LLMConfig(
provider="openai/gpt-4o-mini",
api_token=os.getenv("OPENAI_API_KEY"),
backoff_base_delay=1, # optional
backoff_max_attempts=5, # optional
backoff_exponential_factor=3, #optional
)
```
## 4. Putting It All Together

View File

@@ -69,12 +69,12 @@ The tutorial includes a Flask-based web interface with:
cd docs/examples/c4a_script/tutorial/
# Install dependencies
pip install flask
pip install -r requirements.txt
# Launch the tutorial server
python app.py
python server.py
# Open http://localhost:5000 in your browser
# Open http://localhost:8000 in your browser
```
## Core Concepts
@@ -111,8 +111,8 @@ CLICK `.submit-btn`
# By attribute
CLICK `button[type="submit"]`
# By text content
CLICK `button:contains("Sign In")`
# By accessible attributes
CLICK `button[aria-label="Search"][title="Search"]`
# Complex selectors
CLICK `.form-container input[name="email"]`

View File

@@ -4,11 +4,13 @@ One of Crawl4AI's most powerful features is its ability to perform **configurabl
In this tutorial, you'll learn:
1. How to set up a **Basic Deep Crawler** with BFS strategy
2. Understanding the difference between **streamed and non-streamed** output
3. Implementing **filters and scorers** to target specific content
4. Creating **advanced filtering chains** for sophisticated crawls
5. Using **BestFirstCrawling** for intelligent exploration prioritization
1. How to set up a **Basic Deep Crawler** with BFS strategy
2. Understanding the difference between **streamed and non-streamed** output
3. Implementing **filters and scorers** to target specific content
4. Creating **advanced filtering chains** for sophisticated crawls
5. Using **BestFirstCrawling** for intelligent exploration prioritization
6. **Crash recovery** for long-running production crawls
7. **Prefetch mode** for fast URL discovery
> **Prerequisites**
> - Youve completed or read [AsyncWebCrawler Basics](../core/simple-crawling.md) to understand how to run a simple crawl.
@@ -485,7 +487,249 @@ This is especially useful for security-conscious crawling or when dealing with s
---
## 10. Summary & Next Steps
## 10. Crash Recovery for Long-Running Crawls
For production deployments, especially in cloud environments where instances can be terminated unexpectedly, Crawl4AI provides built-in crash recovery support for all deep crawl strategies.
### 10.1 Enabling State Persistence
All deep crawl strategies (BFS, DFS, Best-First) support two optional parameters:
- **`resume_state`**: Pass a previously saved state to resume from a checkpoint
- **`on_state_change`**: Async callback fired after each URL is processed
```python
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
import json
# Callback to save state after each URL
async def save_state_to_redis(state: dict):
await redis.set("crawl_state", json.dumps(state))
strategy = BFSDeepCrawlStrategy(
max_depth=3,
on_state_change=save_state_to_redis, # Called after each URL
)
```
### 10.2 State Structure
The state dictionary is JSON-serializable and contains:
```python
{
"strategy_type": "bfs", # or "dfs", "best_first"
"visited": ["url1", "url2", ...], # Already crawled URLs
"pending": [{"url": "...", "parent_url": "..."}], # Queue/stack
"depths": {"url1": 0, "url2": 1}, # Depth tracking
"pages_crawled": 42 # Counter
}
```
### 10.3 Resuming from a Checkpoint
```python
import json
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
# Load saved state (e.g., from Redis, database, or file)
saved_state = json.loads(await redis.get("crawl_state"))
# Resume crawling from where we left off
strategy = BFSDeepCrawlStrategy(
max_depth=3,
resume_state=saved_state, # Continue from checkpoint
on_state_change=save_state_to_redis, # Keep saving progress
)
config = CrawlerRunConfig(deep_crawl_strategy=strategy)
async with AsyncWebCrawler() as crawler:
# Will skip already-visited URLs and continue from pending queue
results = await crawler.arun(start_url, config=config)
```
### 10.4 Manual State Export
You can export the last captured state using `export_state()`. Note that this requires `on_state_change` to be set (state is captured in the callback):
```python
import json
captured_state = None
async def capture_state(state: dict):
global captured_state
captured_state = state
strategy = BFSDeepCrawlStrategy(
max_depth=2,
on_state_change=capture_state, # Required for state capture
)
config = CrawlerRunConfig(deep_crawl_strategy=strategy)
async with AsyncWebCrawler() as crawler:
results = await crawler.arun(start_url, config=config)
# Get the last captured state
state = strategy.export_state()
if state:
# Save to your preferred storage
with open("crawl_checkpoint.json", "w") as f:
json.dump(state, f)
```
### 10.5 Complete Example: Redis-Based Recovery
```python
import asyncio
import json
import redis.asyncio as redis
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
REDIS_KEY = "crawl4ai:crawl_state"
async def main():
redis_client = redis.Redis(host='localhost', port=6379, db=0)
# Check for existing state
saved_state = None
existing = await redis_client.get(REDIS_KEY)
if existing:
saved_state = json.loads(existing)
print(f"Resuming from checkpoint: {saved_state['pages_crawled']} pages already crawled")
# State persistence callback
async def persist_state(state: dict):
await redis_client.set(REDIS_KEY, json.dumps(state))
# Create strategy with recovery support
strategy = BFSDeepCrawlStrategy(
max_depth=3,
max_pages=100,
resume_state=saved_state,
on_state_change=persist_state,
)
config = CrawlerRunConfig(deep_crawl_strategy=strategy, stream=True)
try:
async with AsyncWebCrawler() as crawler:
async for result in await crawler.arun("https://example.com", config=config):
print(f"Crawled: {result.url}")
except Exception as e:
print(f"Crawl interrupted: {e}")
print("State saved - restart to resume")
finally:
await redis_client.close()
if __name__ == "__main__":
asyncio.run(main())
```
### 10.6 Zero Overhead
When `resume_state=None` and `on_state_change=None` (the defaults), there is no performance impact. State tracking only activates when you enable these features.
---
## 11. Prefetch Mode for Fast URL Discovery
When you need to quickly discover URLs without full page processing, use **prefetch mode**. This is ideal for two-phase crawling where you first map the site, then selectively process specific pages.
### 11.1 Enabling Prefetch Mode
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
config = CrawlerRunConfig(prefetch=True)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com", config=config)
# Result contains only HTML and links - no markdown, no extraction
print(f"Found {len(result.links['internal'])} internal links")
print(f"Found {len(result.links['external'])} external links")
```
### 11.2 What Gets Skipped
Prefetch mode uses a fast path that bypasses heavy processing:
| Processing Step | Normal Mode | Prefetch Mode |
|----------------|-------------|---------------|
| Fetch HTML | ✅ | ✅ |
| Extract links | ✅ | ✅ (fast `quick_extract_links()`) |
| Generate markdown | ✅ | ❌ Skipped |
| Content scraping | ✅ | ❌ Skipped |
| Media extraction | ✅ | ❌ Skipped |
| LLM extraction | ✅ | ❌ Skipped |
### 11.3 Performance Benefit
- **Normal mode**: Full pipeline (~2-5 seconds per page)
- **Prefetch mode**: HTML + links only (~200-500ms per page)
This makes prefetch mode **5-10x faster** for URL discovery.
### 11.4 Two-Phase Crawling Pattern
The most common use case is two-phase crawling:
```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
async def two_phase_crawl(start_url: str):
async with AsyncWebCrawler() as crawler:
# ═══════════════════════════════════════════════
# Phase 1: Fast discovery (prefetch mode)
# ═══════════════════════════════════════════════
prefetch_config = CrawlerRunConfig(prefetch=True)
discovery = await crawler.arun(start_url, config=prefetch_config)
all_urls = [link["href"] for link in discovery.links.get("internal", [])]
print(f"Discovered {len(all_urls)} URLs")
# Filter to URLs you care about
blog_urls = [url for url in all_urls if "/blog/" in url]
print(f"Found {len(blog_urls)} blog posts to process")
# ═══════════════════════════════════════════════
# Phase 2: Full processing on selected URLs only
# ═══════════════════════════════════════════════
full_config = CrawlerRunConfig(
# Your normal extraction settings
word_count_threshold=100,
remove_overlay_elements=True,
)
results = []
for url in blog_urls:
result = await crawler.arun(url, config=full_config)
if result.success:
results.append(result)
print(f"Processed: {url}")
return results
if __name__ == "__main__":
results = asyncio.run(two_phase_crawl("https://example.com"))
print(f"Fully processed {len(results)} pages")
```
### 11.5 Use Cases
- **Site mapping**: Quickly discover all URLs before deciding what to process
- **Link validation**: Check which pages exist without heavy processing
- **Selective deep crawl**: Prefetch to find URLs, filter by pattern, then full crawl
- **Crawl planning**: Estimate crawl size before committing resources
---
## 12. Summary & Next Steps
In this **Deep Crawling with Crawl4AI** tutorial, you learned to:
@@ -495,5 +739,7 @@ In this **Deep Crawling with Crawl4AI** tutorial, you learned to:
- Use scorers to prioritize the most relevant pages
- Limit crawls with `max_pages` and `score_threshold` parameters
- Build a complete advanced crawler with combined techniques
- **Implement crash recovery** with `resume_state` and `on_state_change` for production deployments
- **Use prefetch mode** for fast URL discovery and two-phase crawling
With these tools, you can efficiently extract structured data from websites at scale, focusing precisely on the content you need for your specific use case.

View File

@@ -11,6 +11,12 @@ This page provides a comprehensive list of example scripts that demonstrate vari
| Quickstart Set 1 | Basic examples for getting started with Crawl4AI. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/quickstart_examples_set_1.py) |
| Quickstart Set 2 | More advanced examples for working with Crawl4AI. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/quickstart_examples_set_2.py) |
## Proxies
| Example | Description | Link |
|----------|--------------|------|
| **NSTProxy** | [NSTProxy](https://www.nstproxy.com/?utm_source=crawl4ai) Seamlessly integrates with crawl4ai — no setup required. Access high-performance residential, datacenter, ISP, and IPv6 proxies with smart rotation and anti-blocking technology. Starts from $0.1/GB. Use code crawl4ai for 10% off. | [View Code](https://github.com/unclecode/crawl4ai/tree/main/docs/examples/proxy) |
## Browser & Crawling Features
| Example | Description | Link |
@@ -56,13 +62,14 @@ This page provides a comprehensive list of example scripts that demonstrate vari
## Anti-Bot & Stealth Features
| Example | Description | Link |
|---------|-------------|------|
| Stealth Mode Quick Start | Five practical examples showing how to use stealth mode for bypassing basic bot detection. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/stealth_mode_quick_start.py) |
| Example | Description | Link |
|----------------------------|-------------|------|
| Stealth Mode Quick Start | Five practical examples showing how to use stealth mode for bypassing basic bot detection. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/stealth_mode_quick_start.py) |
| Stealth Mode Comprehensive | Comprehensive demonstration of stealth mode features with bot detection testing and comparisons. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/stealth_mode_example.py) |
| Undetected Browser | Simple example showing how to use the undetected browser adapter. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/hello_world_undetected.py) |
| Undetected Browser Demo | Basic demo comparing regular and undetected browser modes. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/undetected_simple_demo.py) |
| Undetected Tests | Advanced tests comparing regular vs undetected browsers on various bot detection services. | [View Folder](https://github.com/unclecode/crawl4ai/tree/main/docs/examples/undetectability/) |
| Undetected Browser | Simple example showing how to use the undetected browser adapter. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/hello_world_undetected.py) |
| Undetected Browser Demo | Basic demo comparing regular and undetected browser modes. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/undetected_simple_demo.py) |
| Undetected Tests | Advanced tests comparing regular vs undetected browsers on various bot detection services. | [View Folder](https://github.com/unclecode/crawl4ai/tree/main/docs/examples/undetectability/) |
| CapSolver Captcha Solver | Seamlessly integrate with [CapSolver](https://www.capsolver.com/?utm_source=crawl4ai&utm_medium=github_pr&utm_campaign=crawl4ai_integration) to automatically solve reCAPTCHA v2/v3, Cloudflare Turnstile / Challenges, AWS WAF and more for uninterrupted scraping and automation. | [View Folder](https://github.com/unclecode/crawl4ai/tree/main/docs/examples/capsolver_captcha_solver/) |
## Customization & Security

View File

@@ -1,4 +1,20 @@
# Crawl4AI Docker Guide 🐳
# Self-Hosting Crawl4AI 🚀
**Take Control of Your Web Crawling Infrastructure**
Self-hosting Crawl4AI gives you complete control over your web crawling and data extraction pipeline. Unlike cloud-based solutions, you own your data, infrastructure, and destiny.
## Why Self-Host?
- **🔒 Data Privacy**: Your crawled data never leaves your infrastructure
- **💰 Cost Control**: No per-request pricing - scale within your own resources
- **🎯 Customization**: Full control over browser configurations, extraction strategies, and performance tuning
- **📊 Transparency**: Real-time monitoring dashboard shows exactly what's happening
- **⚡ Performance**: Direct access without API rate limits or geographic restrictions
- **🛡️ Security**: Keep sensitive data extraction workflows behind your firewall
- **🔧 Flexibility**: Customize, extend, and integrate with your existing infrastructure
When you self-host, you can scale from a single container to a full browser infrastructure, all while maintaining complete control and visibility.
## Table of Contents
- [Prerequisites](#prerequisites)
@@ -13,36 +29,14 @@
- [Available MCP Tools](#available-mcp-tools)
- [Testing MCP Connections](#testing-mcp-connections)
- [MCP Schemas](#mcp-schemas)
- [Additional API Endpoints](#additional-api-endpoints)
- [HTML Extraction Endpoint](#html-extraction-endpoint)
- [Screenshot Endpoint](#screenshot-endpoint)
- [PDF Export Endpoint](#pdf-export-endpoint)
- [JavaScript Execution Endpoint](#javascript-execution-endpoint)
- [User-Provided Hooks API](#user-provided-hooks-api)
- [Hook Information Endpoint](#hook-information-endpoint)
- [Available Hook Points](#available-hook-points)
- [Using Hooks in Requests](#using-hooks-in-requests)
- [Hook Examples with Real URLs](#hook-examples-with-real-urls)
- [Security Best Practices](#security-best-practices)
- [Hook Response Information](#hook-response-information)
- [Error Handling](#error-handling)
- [Hooks Utility: Function-Based Approach (Python)](#hooks-utility-function-based-approach-python)
- [Job Queue & Webhook API](#job-queue-webhook-api)
- [Why Use the Job Queue API?](#why-use-the-job-queue-api)
- [Available Endpoints](#available-endpoints)
- [Webhook Configuration](#webhook-configuration)
- [Usage Examples](#usage-examples)
- [Webhook Best Practices](#webhook-best-practices)
- [Use Cases](#use-cases)
- [Troubleshooting](#troubleshooting)
- [Dockerfile Parameters](#dockerfile-parameters)
- [Using the API](#using-the-api)
- [Playground Interface](#playground-interface)
- [Python SDK](#python-sdk)
- [Understanding Request Schema](#understanding-request-schema)
- [REST API Examples](#rest-api-examples)
- [LLM Configuration Examples](#llm-configuration-examples)
- [Metrics & Monitoring](#metrics--monitoring)
- [Real-time Monitoring & Operations](#real-time-monitoring--operations)
- [Monitoring Dashboard](#monitoring-dashboard)
- [Monitor API Endpoints](#monitor-api-endpoints)
- [WebSocket Streaming](#websocket-streaming)
- [Control Actions](#control-actions)
- [Production Integration](#production-integration)
- [Deployment Scenarios](#deployment-scenarios)
- [Complete Examples](#complete-examples)
- [Server Configuration](#server-configuration)
- [Understanding config.yml](#understanding-configyml)
- [JWT Authentication](#jwt-authentication)
@@ -73,13 +67,13 @@ Pull and run images directly from Docker Hub without building locally.
#### 1. Pull the Image
Our latest release is `0.7.6`. Images are built with multi-arch manifests, so Docker automatically pulls the correct version for your system.
Our latest release is `0.8.0`. Images are built with multi-arch manifests, so Docker automatically pulls the correct version for your system.
> 💡 **Note**: The `latest` tag points to the stable `0.7.6` version.
> 💡 **Note**: The `latest` tag points to the stable `0.8.0` version.
```bash
# Pull the latest version
docker pull unclecode/crawl4ai:0.7.6
docker pull unclecode/crawl4ai:0.8.0
# Or pull using the latest tag
docker pull unclecode/crawl4ai:latest
@@ -151,7 +145,7 @@ docker stop crawl4ai && docker rm crawl4ai
#### Docker Hub Versioning Explained
* **Image Name:** `unclecode/crawl4ai`
* **Tag Format:** `LIBRARY_VERSION[-SUFFIX]` (e.g., `0.7.6`)
* **Tag Format:** `LIBRARY_VERSION[-SUFFIX]` (e.g., `0.8.0`)
* `LIBRARY_VERSION`: The semantic version of the core `crawl4ai` Python library
* `SUFFIX`: Optional tag for release candidates (``) and revisions (`r1`)
* **`latest` Tag:** Points to the most recent stable version
@@ -1957,22 +1951,469 @@ async def test_stream_crawl(token: str = None): # Made token optional
---
## Metrics & Monitoring
## Real-time Monitoring & Operations
Keep an eye on your crawler with these endpoints:
One of the key advantages of self-hosting is complete visibility into your infrastructure. Crawl4AI includes a comprehensive real-time monitoring system that gives you full transparency and control.
- `/health` - Quick health check
- `/metrics` - Detailed Prometheus metrics
- `/schema` - Full API schema
### Monitoring Dashboard
Example health check:
Access the **built-in real-time monitoring dashboard** for complete operational visibility:
```
http://localhost:11235/monitor
```
![Monitoring Dashboard](https://via.placeholder.com/800x400?text=Crawl4AI+Monitoring+Dashboard)
**Dashboard Features:**
#### 1. System Health Overview
- **CPU & Memory**: Live usage with progress bars and percentage indicators
- **Network I/O**: Total bytes sent/received since startup
- **Server Uptime**: How long your server has been running
- **Browser Pool Status**:
- 🔥 Permanent browser (always-on default config, ~270MB)
- ♨️ Hot pool (frequently used configs, ~180MB each)
- ❄️ Cold pool (idle browsers awaiting cleanup, ~180MB each)
- **Memory Pressure**: LOW/MEDIUM/HIGH indicator for janitor behavior
#### 2. Live Request Tracking
- **Active Requests**: Currently running crawls with:
- Request ID for tracking
- Target URL (truncated for display)
- Endpoint being used
- Elapsed time (updates in real-time)
- Memory usage from start
- **Completed Requests**: Last 10 finished requests showing:
- Success/failure status (color-coded)
- Total execution time
- Memory delta (how much memory changed)
- Pool hit (was browser reused?)
- HTTP status code
- **Filtering**: View all, success only, or errors only
#### 3. Browser Pool Management
Interactive table showing all active browsers:
| Type | Signature | Age | Last Used | Hits | Actions |
|------|-----------|-----|-----------|------|---------|
| permanent | abc12345 | 2h | 5s ago | 1,247 | Restart |
| hot | def67890 | 45m | 2m ago | 89 | Kill / Restart |
| cold | ghi11213 | 30m | 15m ago | 3 | Kill / Restart |
- **Reuse Rate**: Percentage of requests that reused existing browsers
- **Memory Estimates**: Total memory used by browser pool
- **Manual Control**: Kill or restart individual browsers
#### 4. Janitor Events Log
Real-time log of browser pool cleanup events:
- When cold browsers are closed due to memory pressure
- When browsers are promoted from cold to hot pool
- Forced cleanups triggered manually
- Detailed cleanup reasons and browser signatures
#### 5. Error Monitoring
Recent errors with full context:
- Timestamp
- Endpoint where error occurred
- Target URL
- Error message
- Request ID for correlation
**Live Updates:**
The dashboard connects via WebSocket and refreshes every **2 seconds** with the latest data. Connection status indicator shows when you're connected/disconnected.
---
### Monitor API Endpoints
For programmatic monitoring, automation, and integration with your existing infrastructure:
#### Health & Statistics
**Get System Health**
```bash
curl http://localhost:11235/health
GET /monitor/health
```
Returns current system snapshot:
```json
{
"container": {
"memory_percent": 45.2,
"cpu_percent": 23.1,
"network_sent_mb": 1250.45,
"network_recv_mb": 3421.12,
"uptime_seconds": 7234
},
"pool": {
"permanent": {"active": true, "memory_mb": 270},
"hot": {"count": 3, "memory_mb": 540},
"cold": {"count": 1, "memory_mb": 180},
"total_memory_mb": 990
},
"janitor": {
"next_cleanup_estimate": "adaptive",
"memory_pressure": "MEDIUM"
}
}
```
**Get Request Statistics**
```bash
GET /monitor/requests?status=all&limit=50
```
Query parameters:
- `status`: Filter by `all`, `active`, `completed`, `success`, or `error`
- `limit`: Number of completed requests to return (1-1000)
**Get Browser Pool Details**
```bash
GET /monitor/browsers
```
Returns detailed information about all active browsers:
```json
{
"browsers": [
{
"type": "permanent",
"sig": "abc12345",
"age_seconds": 7234,
"last_used_seconds": 5,
"memory_mb": 270,
"hits": 1247,
"killable": false
},
{
"type": "hot",
"sig": "def67890",
"age_seconds": 2701,
"last_used_seconds": 120,
"memory_mb": 180,
"hits": 89,
"killable": true
}
],
"summary": {
"total_count": 5,
"total_memory_mb": 990,
"reuse_rate_percent": 87.3
}
}
```
**Get Endpoint Performance Statistics**
```bash
GET /monitor/endpoints/stats
```
Returns aggregated metrics per endpoint:
```json
{
"/crawl": {
"count": 1523,
"avg_latency_ms": 2341.5,
"success_rate_percent": 98.2,
"pool_hit_rate_percent": 89.1,
"errors": 27
},
"/md": {
"count": 891,
"avg_latency_ms": 1823.7,
"success_rate_percent": 99.4,
"pool_hit_rate_percent": 92.3,
"errors": 5
}
}
```
**Get Timeline Data**
```bash
GET /monitor/timeline?metric=memory&window=5m
```
Parameters:
- `metric`: `memory`, `requests`, or `browsers`
- `window`: Currently only `5m` (5-minute window, 5-second resolution)
Returns time-series data for charts:
```json
{
"timestamps": [1699564800, 1699564805, 1699564810, ...],
"values": [42.1, 43.5, 41.8, ...]
}
```
#### Logs
**Get Janitor Events**
```bash
GET /monitor/logs/janitor?limit=100
```
**Get Error Log**
```bash
GET /monitor/logs/errors?limit=100
```
---
*(Deployment Scenarios and Complete Examples sections remain the same, maybe update links if examples moved)*
### WebSocket Streaming
For real-time monitoring in your own dashboards or applications:
```bash
WS /monitor/ws
```
**Connection Example (Python):**
```python
import asyncio
import websockets
import json
async def monitor_server():
uri = "ws://localhost:11235/monitor/ws"
async with websockets.connect(uri) as websocket:
print("Connected to Crawl4AI monitor")
while True:
# Receive update every 2 seconds
data = await websocket.recv()
update = json.loads(data)
# Extract key metrics
health = update['health']
active_requests = len(update['requests']['active'])
browsers = len(update['browsers'])
print(f"Memory: {health['container']['memory_percent']:.1f}% | "
f"Active: {active_requests} | "
f"Browsers: {browsers}")
# Check for high memory pressure
if health['janitor']['memory_pressure'] == 'HIGH':
print("⚠️ HIGH MEMORY PRESSURE - Consider cleanup")
asyncio.run(monitor_server())
```
**Update Payload Structure:**
```json
{
"timestamp": 1699564823.456,
"health": { /* System health snapshot */ },
"requests": {
"active": [ /* Currently running */ ],
"completed": [ /* Last 10 completed */ ]
},
"browsers": [ /* All active browsers */ ],
"timeline": {
"memory": { /* Last 5 minutes */ },
"requests": { /* Request rate */ },
"browsers": { /* Pool composition */ }
},
"janitor": [ /* Last 10 cleanup events */ ],
"errors": [ /* Last 10 errors */ ]
}
```
---
### Control Actions
Take manual control when needed:
**Force Immediate Cleanup**
```bash
POST /monitor/actions/cleanup
```
Kills all cold pool browsers immediately (useful when memory is tight):
```json
{
"success": true,
"killed_browsers": 3
}
```
**Kill Specific Browser**
```bash
POST /monitor/actions/kill_browser
Content-Type: application/json
{
"sig": "abc12345" // First 8 chars of browser signature
}
```
Response:
```json
{
"success": true,
"killed_sig": "abc12345",
"pool_type": "hot"
}
```
**Restart Browser**
```bash
POST /monitor/actions/restart_browser
Content-Type: application/json
{
"sig": "permanent" // Or first 8 chars of signature
}
```
For permanent browser, this will close and reinitialize it. For hot/cold browsers, it kills them and lets new requests create fresh ones.
**Reset Statistics**
```bash
POST /monitor/stats/reset
```
Clears endpoint counters (useful for starting fresh after testing).
---
### Production Integration
#### Integration with Existing Monitoring Systems
**Prometheus Integration:**
```bash
# Scrape metrics endpoint
curl http://localhost:11235/metrics
```
**Custom Dashboard Integration:**
```python
# Example: Push metrics to your monitoring system
import asyncio
import websockets
import json
from your_monitoring import push_metric
async def integrate_monitoring():
async with websockets.connect("ws://localhost:11235/monitor/ws") as ws:
while True:
data = json.loads(await ws.recv())
# Push to your monitoring system
push_metric("crawl4ai.memory.percent",
data['health']['container']['memory_percent'])
push_metric("crawl4ai.active_requests",
len(data['requests']['active']))
push_metric("crawl4ai.browser_count",
len(data['browsers']))
```
**Alerting Example:**
```python
import requests
import time
def check_health():
"""Poll health endpoint and alert on issues"""
response = requests.get("http://localhost:11235/monitor/health")
health = response.json()
# Alert on high memory
if health['container']['memory_percent'] > 85:
send_alert(f"High memory: {health['container']['memory_percent']}%")
# Alert on high error rate
stats = requests.get("http://localhost:11235/monitor/endpoints/stats").json()
for endpoint, metrics in stats.items():
if metrics['success_rate_percent'] < 95:
send_alert(f"{endpoint} success rate: {metrics['success_rate_percent']}%")
# Run every minute
while True:
check_health()
time.sleep(60)
```
**Log Aggregation:**
```python
import requests
from datetime import datetime
def aggregate_errors():
"""Fetch and aggregate errors for logging system"""
response = requests.get("http://localhost:11235/monitor/logs/errors?limit=100")
errors = response.json()['errors']
for error in errors:
log_to_system({
'timestamp': datetime.fromtimestamp(error['timestamp']),
'service': 'crawl4ai',
'endpoint': error['endpoint'],
'url': error['url'],
'message': error['error'],
'request_id': error['request_id']
})
```
#### Key Metrics to Track
For production self-hosted deployments, monitor these metrics:
1. **Memory Usage Trends**
- Track `container.memory_percent` over time
- Alert when consistently above 80%
- Prevents OOM kills
2. **Request Success Rates**
- Monitor per-endpoint success rates
- Alert when below 95%
- Indicates crawling issues
3. **Average Latency**
- Track `avg_latency_ms` per endpoint
- Detect performance degradation
- Optimize slow endpoints
4. **Browser Pool Efficiency**
- Monitor `reuse_rate_percent`
- Should be >80% for good efficiency
- Low rates indicate pool churn
5. **Error Frequency**
- Count errors per time window
- Alert on sudden spikes
- Track error patterns
6. **Janitor Activity**
- Monitor cleanup frequency
- Excessive cleanup indicates memory pressure
- Adjust pool settings if needed
---
### Quick Health Check
For simple uptime monitoring:
```bash
curl http://localhost:11235/health
```
Returns:
```json
{
"status": "healthy",
"version": "0.7.4"
}
```
Other useful endpoints:
- `/metrics` - Prometheus metrics
- `/schema` - Full API schema
---
@@ -2132,43 +2573,46 @@ We're here to help you succeed with Crawl4AI! Here's how to get support:
## Summary
In this guide, we've covered everything you need to get started with Crawl4AI's Docker deployment:
- Building and running the Docker container
- Configuring the environment
- Using the interactive playground for testing
- Making API requests with proper typing
- Using the Python SDK with **automatic hook conversion**
- **Working with hooks** - both string-based (REST API) and function-based (Python SDK)
- Leveraging specialized endpoints for screenshots, PDFs, and JavaScript execution
- Connecting via the Model Context Protocol (MCP)
- Monitoring your deployment
Congratulations! You now have everything you need to self-host your own Crawl4AI infrastructure with complete control and visibility.
### Key Features
**What You've Learned:**
- ✅ Multiple deployment options (Docker Hub, Docker Compose, manual builds)
- ✅ Environment configuration and LLM integration
- ✅ Using the interactive playground for testing
- ✅ Making API requests with proper typing (SDK and REST)
- ✅ Specialized endpoints (screenshots, PDFs, JavaScript execution)
- ✅ MCP integration for AI-assisted development
- ✅ **Real-time monitoring dashboard** for operational transparency
- ✅ **Monitor API** for programmatic control and integration
- ✅ Production deployment best practices
**Hooks Support**: Crawl4AI offers two approaches for working with hooks:
- **String-based** (REST API): Works with any language, requires manual string formatting
- **Function-based** (Python SDK): Write hooks as regular Python functions with full IDE support and automatic conversion
**Why This Matters:**
**Playground Interface**: The built-in playground at `http://localhost:11235/playground` makes it easy to test configurations and generate corresponding JSON for API requests.
By self-hosting Crawl4AI, you:
- 🔒 **Own Your Data**: Everything stays in your infrastructure
- 📊 **See Everything**: Real-time dashboard shows exactly what's happening
- 💰 **Control Costs**: Scale within your resources, no per-request fees
- ⚡ **Maximize Performance**: Direct access with smart browser pooling (10x memory efficiency)
- 🛡️ **Stay Secure**: Keep sensitive workflows behind your firewall
- 🔧 **Customize Freely**: Full control over configs, strategies, and optimizations
**MCP Integration**: For AI application developers, the MCP integration allows tools like Claude Code to directly access Crawl4AI's capabilities without complex API handling.
**Next Steps:**
### Next Steps
1. **Start Simple**: Deploy with Docker Hub image and test with the playground
2. **Monitor Everything**: Open `http://localhost:11235/monitor` to watch your server
3. **Integrate**: Connect your applications using the Python SDK or REST API
4. **Scale Smart**: Use the monitoring data to optimize your deployment
5. **Go Production**: Set up alerting, log aggregation, and automated cleanup
1. **Explore Examples**: Check out the comprehensive examples in:
- `/docs/examples/hooks_docker_client_example.py` - Python function-based hooks
- `/docs/examples/hooks_rest_api_example.py` - REST API string-based hooks
- `/docs/examples/README_HOOKS.md` - Comparison and guide
**Key Resources:**
- 🎮 **Playground**: `http://localhost:11235/playground` - Interactive testing
- 📊 **Monitor Dashboard**: `http://localhost:11235/monitor` - Real-time visibility
- 📖 **Architecture Docs**: `deploy/docker/ARCHITECTURE.md` - Deep technical dive
- 💬 **Discord Community**: Get help and share experiences
- ⭐ **GitHub**: Report issues, contribute, show support
2. **Read Documentation**:
- `/docs/hooks-utility-guide.md` - Complete hooks utility guide
- API documentation for detailed configuration options
Remember: The monitoring dashboard is your window into your infrastructure. Use it to understand performance, troubleshoot issues, and optimize your deployment. The examples in the `examples` folder show real-world usage patterns you can adapt.
3. **Join the Community**:
- GitHub: Report issues and contribute
- Discord: Get help and share your experiences
- Documentation: Comprehensive guides and tutorials
Keep exploring, and don't hesitate to reach out if you need help! We're building something amazing together. 🚀
**You're now in control of your web crawling destiny!** 🚀
Happy crawling! 🕷️

View File

@@ -255,6 +255,8 @@ The `SeedingConfig` object is your control panel. Here's everything you can conf
| `scoring_method` | str | None | Scoring method (currently "bm25") |
| `score_threshold` | float | None | Minimum score to include URL |
| `filter_nonsense_urls` | bool | True | Filter out utility URLs (robots.txt, etc.) |
| `cache_ttl_hours` | int | 24 | Hours before sitemap cache expires (0 = no TTL) |
| `validate_sitemap_lastmod` | bool | True | Check sitemap's lastmod and refetch if newer |
#### Pattern Matching Examples
@@ -968,10 +970,49 @@ config = SeedingConfig(
The seeder automatically caches results to speed up repeated operations:
- **Common Crawl cache**: `~/.crawl4ai/seeder_cache/[index]_[domain]_[hash].jsonl`
- **Sitemap cache**: `~/.crawl4ai/seeder_cache/sitemap_[domain]_[hash].jsonl`
- **Sitemap cache**: `~/.crawl4ai/seeder_cache/sitemap_[domain]_[hash].json`
- **HEAD data cache**: `~/.cache/url_seeder/head/[hash].json`
Cache expires after 7 days by default. Use `force=True` to refresh.
#### Smart TTL Cache for Sitemaps
Sitemap caches now include intelligent validation:
```python
# Default: 24-hour TTL with lastmod validation
config = SeedingConfig(
source="sitemap",
cache_ttl_hours=24, # Cache expires after 24 hours
validate_sitemap_lastmod=True # Also check if sitemap was updated
)
# Aggressive caching (1 week, no lastmod check)
config = SeedingConfig(
source="sitemap",
cache_ttl_hours=168, # 7 days
validate_sitemap_lastmod=False # Trust TTL only
)
# Always validate (no TTL, only lastmod)
config = SeedingConfig(
source="sitemap",
cache_ttl_hours=0, # Disable TTL
validate_sitemap_lastmod=True # Refetch if sitemap has newer lastmod
)
# Always fresh (bypass cache completely)
config = SeedingConfig(
source="sitemap",
force=True # Ignore all caching
)
```
**Cache validation priority:**
1. `force=True` → Always refetch
2. Cache doesn't exist → Fetch fresh
3. `validate_sitemap_lastmod=True` and sitemap has newer `<lastmod>` → Refetch
4. `cache_ttl_hours > 0` and cache is older than TTL → Refetch
5. Cache corrupted → Refetch (automatic recovery)
6. Otherwise → Use cache
### Pattern Matching Strategies
@@ -1060,6 +1101,9 @@ config = SeedingConfig(
| Rate limit errors | Reduce `hits_per_sec` and `concurrency` |
| Memory issues with large sites | Use `max_urls` to limit results, reduce `concurrency` |
| Connection not closed | Use context manager or call `await seeder.close()` |
| Stale/outdated URLs | Set `cache_ttl_hours=0` or use `force=True` |
| Cache not updating | Check `validate_sitemap_lastmod=True`, or use `force=True` |
| Incomplete URL list | Delete cache file and refetch, or use `force=True` |
### Performance Benchmarks
@@ -1119,6 +1163,7 @@ config = SeedingConfig(
3. **Context Manager Support**: Automatic cleanup with `async with` statement
4. **URL-Based Scoring**: Smart filtering even without head extraction
5. **Smart URL Filtering**: Automatically excludes utility/nonsense URLs
6. **Dual Caching**: Separate caches for URL lists and metadata
6. **Smart TTL Cache**: Sitemap caches with TTL expiry and lastmod validation
7. **Automatic Cache Recovery**: Corrupted or incomplete caches are automatically refreshed
Now go forth and seed intelligently! 🌱🚀
Now go forth and seed intelligently!

View File

@@ -20,10 +20,10 @@ In some cases, you need to extract **complex or unstructured** information from
## 2. Provider-Agnostic via LiteLLM
You can use LlmConfig, to quickly configure multiple variations of LLMs and experiment with them to find the optimal one for your use case. You can read more about LlmConfig [here](/api/parameters).
You can use LLMConfig, to quickly configure multiple variations of LLMs and experiment with them to find the optimal one for your use case. You can read more about LLMConfig [here](/api/parameters).
```python
llmConfig = LlmConfig(provider="openai/gpt-4o-mini", api_token=os.getenv("OPENAI_API_KEY"))
llm_config = LLMConfig(provider="openai/gpt-4o-mini", api_token=os.getenv("OPENAI_API_KEY"))
```
Crawl4AI uses a “provider string” (e.g., `"openai/gpt-4o"`, `"ollama/llama2.0"`, `"aws/titan"`) to identify your LLM. **Any** model that LiteLLM supports is fair game. You just provide:
@@ -58,7 +58,7 @@ For structured data, `"schema"` is recommended. You provide `schema=YourPydantic
Below is an overview of important LLM extraction parameters. All are typically set inside `LLMExtractionStrategy(...)`. You then put that strategy in your `CrawlerRunConfig(..., extraction_strategy=...)`.
1. **`llmConfig`** (LlmConfig): e.g., `"openai/gpt-4"`, `"ollama/llama2"`.
1. **`llm_config`** (LLMConfig): e.g., `"openai/gpt-4"`, `"ollama/llama2"`.
2. **`schema`** (dict): A JSON schema describing the fields you want. Usually generated by `YourModel.model_json_schema()`.
3. **`extraction_type`** (str): `"schema"` or `"block"`.
4. **`instruction`** (str): Prompt text telling the LLM what you want extracted. E.g., “Extract these fields as a JSON array.”
@@ -112,7 +112,7 @@ async def main():
# 1. Define the LLM extraction strategy
llm_strategy = LLMExtractionStrategy(
llm_config = LLMConfig(provider="openai/gpt-4o-mini", api_token=os.getenv('OPENAI_API_KEY')),
schema=Product.schema_json(), # Or use model_json_schema()
schema=Product.model_json_schema(), # Or use model_json_schema()
extraction_type="schema",
instruction="Extract all product objects with 'name' and 'price' from the content.",
chunk_token_threshold=1000,
@@ -238,7 +238,7 @@ class KnowledgeGraph(BaseModel):
async def main():
# LLM extraction strategy
llm_strat = LLMExtractionStrategy(
llmConfig = LLMConfig(provider="openai/gpt-4", api_token=os.getenv('OPENAI_API_KEY')),
llm_config = LLMConfig(provider="openai/gpt-4", api_token=os.getenv('OPENAI_API_KEY')),
schema=KnowledgeGraph.model_json_schema(),
extraction_type="schema",
instruction="Extract entities and relationships from the content. Return valid JSON.",

Some files were not shown because too many files have changed in this diff Show More