* Fix: Use correct URL variable for raw HTML extraction (#1116)
- Prevents full HTML content from being passed as URL to extraction strategies
- Added unit tests to verify raw HTML and regular URL processing
Fix: Wrong URL variable used for extraction of raw html
* Fix#1181: Preserve whitespace in code blocks during HTML scraping
The remove_empty_elements_fast() method was removing whitespace-only
span elements inside <pre> and <code> tags, causing import statements
like "import torch" to become "importtorch". Now skips elements inside
code blocks where whitespace is significant.
* Refactor Pydantic model configuration to use ConfigDict for arbitrary types
* Fix EmbeddingStrategy: Uncomment response handling for the variations and clean up mock data. ref #1621
* Fix: permission issues with .cache/url_seeder and other runtime cache dirs. ref #1638
* fix: ensure BrowserConfig.to_dict serializes proxy_config
* feat: make LLM backoff configurable end-to-end
- extend LLMConfig with backoff delay/attempt/factor fields and thread them
through LLMExtractionStrategy, LLMContentFilter, table extraction, and
Docker API handlers
- expose the backoff parameter knobs on perform_completion_with_backoff/aperform_completion_with_backoff
and document them in the md_v2 guides
* reproduced AttributeError from #1642
* pass timeout parameter to docker client request
* added missing deep crawling objects to init
* generalized query in ContentRelevanceFilter to be a str or list
* import modules from enhanceable deserialization
* parameterized tests
* Fix: capture current page URL to reflect JavaScript navigation and add test for delayed redirects. ref #1268
* refactor: replace PyPDF2 with pypdf across the codebase. ref #1412
* Add browser_context_id and target_id parameters to BrowserConfig
Enable Crawl4AI to connect to pre-created CDP browser contexts, which is
essential for cloud browser services that pre-create isolated contexts.
Changes:
- Add browser_context_id and target_id parameters to BrowserConfig
- Update from_kwargs() and to_dict() methods
- Modify BrowserManager.start() to use existing context when provided
- Add _get_page_by_target_id() helper method
- Update get_page() to handle pre-existing targets
- Add test for browser_context_id functionality
This enables cloud services to:
1. Create isolated CDP contexts before Crawl4AI connects
2. Pass context/target IDs to BrowserConfig
3. Have Crawl4AI reuse existing contexts instead of creating new ones
* Add cdp_cleanup_on_close flag to prevent memory leaks in cloud/server scenarios
* Fix: add cdp_cleanup_on_close to from_kwargs
* Fix: find context by target_id for concurrent CDP connections
* Fix: use target_id to find correct page in get_page
* Fix: use CDP to find context by browserContextId for concurrent sessions
* Revert context matching attempts - Playwright cannot see CDP-created contexts
* Add create_isolated_context flag for concurrent CDP crawls
When True, forces creation of a new browser context instead of reusing
the default context. Essential for concurrent crawls on the same browser
to prevent navigation conflicts.
* Add context caching to create_isolated_context branch
Uses contexts_by_config cache (same as non-CDP mode) to reuse contexts
for multiple URLs with same config. Still creates new page per crawl
for navigation isolation. Benefits batch/deep crawls.
* Add init_scripts support to BrowserConfig for pre-page-load JS injection
This adds the ability to inject JavaScript that runs before any page loads,
useful for stealth evasions (canvas/audio fingerprinting, userAgentData).
- Add init_scripts parameter to BrowserConfig (list of JS strings)
- Apply init_scripts in setup_context() via context.add_init_script()
- Update from_kwargs() and to_dict() for serialization
* Fix CDP connection handling: support WS URLs and proper cleanup
Changes to browser_manager.py:
1. _verify_cdp_ready(): Support multiple URL formats
- WebSocket URLs (ws://, wss://): Skip HTTP verification, Playwright handles directly
- HTTP URLs with query params: Properly parse with urlparse to preserve query string
- Fixes issue where naive f"{cdp_url}/json/version" broke WS URLs and query params
2. close(): Proper cleanup when cdp_cleanup_on_close=True
- Close all sessions (pages)
- Close all contexts
- Call browser.close() to disconnect (doesn't terminate browser, just releases connection)
- Wait 1 second for CDP connection to fully release
- Stop Playwright instance to prevent memory leaks
This enables:
- Connecting to specific browsers via WS URL
- Reusing the same browser with multiple sequential connections
- No user wait needed between connections (internal 1s delay handles it)
Added tests/browser/test_cdp_cleanup_reuse.py with comprehensive tests.
* Update gitignore
* Some debugging for caching
* Add _generate_screenshot_from_html for raw: and file:// URLs
Implements the missing method that was being called but never defined.
Now raw: and file:// URLs can generate screenshots by:
1. Loading HTML into a browser page via page.set_content()
2. Taking screenshot using existing take_screenshot() method
3. Cleaning up the page afterward
This enables cached HTML to be rendered with screenshots in crawl4ai-cloud.
* Add PDF and MHTML support for raw: and file:// URLs
- Replace _generate_screenshot_from_html with _generate_media_from_html
- New method handles screenshot, PDF, and MHTML in one browser session
- Update raw: and file:// URL handlers to use new method
- Enables cached HTML to generate all media types
* Add crash recovery for deep crawl strategies
Add optional resume_state and on_state_change parameters to all deep
crawl strategies (BFS, DFS, Best-First) for cloud deployment crash
recovery.
Features:
- resume_state: Pass saved state to resume from checkpoint
- on_state_change: Async callback fired after each URL for real-time
state persistence to external storage (Redis, DB, etc.)
- export_state(): Get last captured state manually
- Zero overhead when features are disabled (None defaults)
State includes visited URLs, pending queue/stack, depths, and
pages_crawled count. All state is JSON-serializable.
* Fix: HTTP strategy raw: URL parsing truncates at # character
The AsyncHTTPCrawlerStrategy.crawl() method used urlparse() to extract
content from raw: URLs. This caused HTML with CSS color codes like #eee
to be truncated because # is treated as a URL fragment delimiter.
Before: raw:body{background:#eee} -> parsed.path = 'body{background:'
After: raw:body{background:#eee} -> raw_content = 'body{background:#eee'
Fix: Strip the raw: or raw:// prefix directly instead of using urlparse,
matching how the browser strategy handles it.
* Add base_url parameter to CrawlerRunConfig for raw HTML processing
When processing raw: HTML (e.g., from cache), the URL parameter is meaningless
for markdown link resolution. This adds a base_url parameter that can be set
explicitly to provide proper URL resolution context.
Changes:
- Add base_url parameter to CrawlerRunConfig.__init__
- Add base_url to CrawlerRunConfig.from_kwargs
- Update aprocess_html to use base_url for markdown generation
Usage:
config = CrawlerRunConfig(base_url='https://example.com')
result = await crawler.arun(url='raw:{html}', config=config)
* Add prefetch mode for two-phase deep crawling
- Add `prefetch` parameter to CrawlerRunConfig
- Add `quick_extract_links()` function for fast link extraction
- Add short-circuit in aprocess_html() for prefetch mode
- Add 42 tests (unit, integration, regression)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* Updates on proxy rotation and proxy configuration
* Add proxy support to HTTP crawler strategy
* Add browser pipeline support for raw:/file:// URLs
- Add process_in_browser parameter to CrawlerRunConfig
- Route raw:/file:// URLs through _crawl_web() when browser operations needed
- Use page.set_content() instead of goto() for local content
- Fix cookie handling for non-HTTP URLs in browser_manager
- Auto-detect browser requirements: js_code, wait_for, screenshot, etc.
- Maintain fast path for raw:/file:// without browser params
Fixes#310
* Add smart TTL cache for sitemap URL seeder
- Add cache_ttl_hours and validate_sitemap_lastmod params to SeedingConfig
- New JSON cache format with metadata (version, created_at, lastmod, url_count)
- Cache validation by TTL expiry and sitemap lastmod comparison
- Auto-migration from old .jsonl to new .json format
- Fixes bug where incomplete cache was used indefinitely
* Update URL seeder docs with smart TTL cache parameters
- Add cache_ttl_hours and validate_sitemap_lastmod to parameter table
- Document smart TTL cache validation with examples
- Add cache-related troubleshooting entries
- Update key features summary
* Add MEMORY.md to gitignore
* Docs: Add multi-sample schema generation section
Add documentation explaining how to pass multiple HTML samples
to generate_schema() for stable selectors that work across pages
with varying DOM structures.
Includes:
- Problem explanation (fragile nth-child selectors)
- Solution with code example
- Key points for multi-sample queries
- Comparison table of fragile vs stable selectors
* Fix critical RCE and LFI vulnerabilities in Docker API deployment
Security fixes for vulnerabilities reported by ProjectDiscovery:
1. Remote Code Execution via Hooks (CVE pending)
- Remove __import__ from allowed_builtins in hook_manager.py
- Prevents arbitrary module imports (os, subprocess, etc.)
- Hooks now disabled by default via CRAWL4AI_HOOKS_ENABLED env var
2. Local File Inclusion via file:// URLs (CVE pending)
- Add URL scheme validation to /execute_js, /screenshot, /pdf, /html
- Block file://, javascript:, data: and other dangerous schemes
- Only allow http://, https://, and raw: (where appropriate)
3. Security hardening
- Add CRAWL4AI_HOOKS_ENABLED=false as default (opt-in for hooks)
- Add security warning comments in config.yml
- Add validate_url_scheme() helper for consistent validation
Testing:
- Add unit tests (test_security_fixes.py) - 16 tests
- Add integration tests (run_security_tests.py) for live server
Affected endpoints:
- POST /crawl (hooks disabled by default)
- POST /crawl/stream (hooks disabled by default)
- POST /execute_js (URL validation added)
- POST /screenshot (URL validation added)
- POST /pdf (URL validation added)
- POST /html (URL validation added)
Breaking changes:
- Hooks require CRAWL4AI_HOOKS_ENABLED=true to function
- file:// URLs no longer work on API endpoints (use library directly)
* Enhance authentication flow by implementing JWT token retrieval and adding authorization headers to API requests
* Add release notes for v0.7.9, detailing breaking changes, security fixes, new features, bug fixes, and documentation updates
* Add release notes for v0.8.0, detailing breaking changes, security fixes, new features, bug fixes, and documentation updates
Documentation for v0.8.0 release:
- SECURITY.md: Security policy and vulnerability reporting guidelines
- RELEASE_NOTES_v0.8.0.md: Comprehensive release notes
- migration/v0.8.0-upgrade-guide.md: Step-by-step migration guide
- security/GHSA-DRAFT-RCE-LFI.md: GitHub security advisory drafts
- CHANGELOG.md: Updated with v0.8.0 changes
Breaking changes documented:
- Docker API hooks disabled by default (CRAWL4AI_HOOKS_ENABLED)
- file:// URLs blocked on Docker API endpoints
Security fixes credited to Neo by ProjectDiscovery
* Add examples for deep crawl crash recovery and prefetch mode in documentation
* Release v0.8.0: The v0.8.0 Update
- Updated version to 0.8.0
- Added comprehensive demo and release notes
- Updated all documentation
* Update security researcher acknowledgment with a hyperlink for Neo by ProjectDiscovery
* Add async agenerate_schema method for schema generation
- Extract prompt building to shared _build_schema_prompt() method
- Add agenerate_schema() async version using aperform_completion_with_backoff
- Refactor generate_schema() to use shared prompt builder
- Fixes Gemini/Vertex AI compatibility in async contexts (FastAPI)
* Fix: Enable litellm.drop_params for O-series/GPT-5 model compatibility
O-series (o1, o3) and GPT-5 models only support temperature=1.
Setting litellm.drop_params=True auto-drops unsupported parameters
instead of throwing UnsupportedParamsError.
Fixes temperature=0.01 error for these models in LLM extraction.
---------
Co-authored-by: rbushria <rbushri@gmail.com>
Co-authored-by: AHMET YILMAZ <tawfik@kidocode.com>
Co-authored-by: Soham Kukreti <kukretisoham@gmail.com>
Co-authored-by: Chris Murphy <chris.murphy@klaviyo.com>
Co-authored-by: unclecode <unclecode@kidocode.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
* Fix: Use correct URL variable for raw HTML extraction (#1116)
- Prevents full HTML content from being passed as URL to extraction strategies
- Added unit tests to verify raw HTML and regular URL processing
Fix: Wrong URL variable used for extraction of raw html
* Fix#1181: Preserve whitespace in code blocks during HTML scraping
The remove_empty_elements_fast() method was removing whitespace-only
span elements inside <pre> and <code> tags, causing import statements
like "import torch" to become "importtorch". Now skips elements inside
code blocks where whitespace is significant.
* Refactor Pydantic model configuration to use ConfigDict for arbitrary types
* Fix EmbeddingStrategy: Uncomment response handling for the variations and clean up mock data. ref #1621
* Fix: permission issues with .cache/url_seeder and other runtime cache dirs. ref #1638
* fix: ensure BrowserConfig.to_dict serializes proxy_config
* feat: make LLM backoff configurable end-to-end
- extend LLMConfig with backoff delay/attempt/factor fields and thread them
through LLMExtractionStrategy, LLMContentFilter, table extraction, and
Docker API handlers
- expose the backoff parameter knobs on perform_completion_with_backoff/aperform_completion_with_backoff
and document them in the md_v2 guides
* reproduced AttributeError from #1642
* pass timeout parameter to docker client request
* added missing deep crawling objects to init
* generalized query in ContentRelevanceFilter to be a str or list
* import modules from enhanceable deserialization
* parameterized tests
* Fix: capture current page URL to reflect JavaScript navigation and add test for delayed redirects. ref #1268
* refactor: replace PyPDF2 with pypdf across the codebase. ref #1412
* announcement: add application form for cloud API closed beta
* Release v0.7.8: Stability & Bug Fix Release
- Updated version to 0.7.8
- Introduced focused stability release addressing 11 community-reported bugs.
- Key fixes include Docker API improvements, LLM extraction enhancements, URL handling corrections, and dependency updates.
- Added detailed release notes for v0.7.8 in the blog and created a dedicated verification script to ensure all fixes are functioning as intended.
- Updated documentation to reflect recent changes and improvements.
* docs: add section for Crawl4AI Cloud API closed beta with application link
* fix: add disk cleanup step to Docker workflow
---------
Co-authored-by: rbushria <rbushri@gmail.com>
Co-authored-by: AHMET YILMAZ <tawfik@kidocode.com>
Co-authored-by: Soham Kukreti <kukretisoham@gmail.com>
Co-authored-by: Chris Murphy <chris.murphy@klaviyo.com>
Co-authored-by: Aravind Karnam <aravind.karanam@gmail.com>
execution, causing URLs to be processed sequentially instead of in parallel.
Changes:
- Added aperform_completion_with_backoff() using litellm.acompletion for async LLM calls
- Implemented arun() method in ExtractionStrategy base class with thread pool fallback
- Created async arun() and aextract() methods in LLMExtractionStrategy using asyncio.gather
- Updated AsyncWebCrawler.arun() to detect and use arun() when available
- Added comprehensive test suite to verify parallel execution
Impact:
- LLM extraction now runs truly in parallel across multiple URLs
- Significant performance improvement for multi-URL crawls with LLM strategies
- Backward compatible - existing extraction strategies continue to work
- No breaking changes to public API
Technical details:
- Uses litellm.acompletion for non-blocking LLM calls
- Leverages asyncio.gather for concurrent chunk processing
- Maintains backward compatibility via asyncio.to_thread fallback
- Works seamlessly with MemoryAdaptiveDispatcher and other dispatchers
- Add lightweight security test to verify version requirements
- Add comprehensive integration test for crawl4ai functionality
- Tests verify pyOpenSSL >= 25.3.0 and cryptography >= 45.0.7
- All tests passing: security vulnerability is resolved
Related to #1545🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Added end-to-end test script that automates webhook feature testing:
Script Features (test_webhook_feature.sh):
- Automatic branch switching and dependency installation
- Redis and server startup/shutdown management
- Webhook receiver implementation
- Integration test for webhook notifications
- Comprehensive cleanup and error handling
- Returns to original branch after completion
Test Flow:
1. Fetch and checkout webhook feature branch
2. Activate venv and install dependencies
3. Start Redis and Crawl4AI server
4. Submit crawl job with webhook config
5. Verify webhook delivery and payload
6. Clean up all processes and return to original branch
Documentation:
- WEBHOOK_TEST_README.md with usage instructions
- Troubleshooting guide
- Exit codes and safety features
Usage: ./tests/test_webhook_feature.sh
Generated with Claude Code https://claude.com/claude-code
Co-Authored-By: Claude <noreply@anthropic.com>
Add hooks_to_string() utility function that converts Python function objects
to string representations for the Docker API, enabling developers to write hooks
as regular Python functions instead of strings.
Core Changes:
- New hooks_to_string() utility in crawl4ai/utils.py using inspect.getsource()
- Docker client now accepts both function objects and strings for hooks
- Automatic detection and conversion in Crawl4aiDockerClient._prepare_request()
- New hooks and hooks_timeout parameters in client.crawl() method
Documentation:
- Docker client examples with function-based hooks (docs/examples/docker_client_hooks_example.py)
- Updated main Docker deployment guide with comprehensive hooks section
- Added unit tests for hooks utility (tests/docker/test_hooks_utility.py)
- Fix critical bug where overlay removal JS function was injected but never called
- Change remove_overlay_elements() to properly execute the injected async function
- Wrap JS execution in async to handle the async overlay removal logic
- Add test_remove_overlay_elements() test case to verify functionality works
- Ensure overlay elements (cookie banners, popups, modals) are actually removed
The remove_overlay_elements feature now works as intended:
- Before: Function definition injected but never executed (silent failure)
- After: Function injected and called, successfully removing overlay elements
Added a new `preserve_https_for_internal_links` configuration flag that preserves the original HTTPS scheme for same-domain links even when the server redirects to HTTP.
- Updated ProxyConfig.from_string to support multiple proxy formats, including URLs with credentials.
- Deprecated the 'proxy' parameter in BrowserConfig, replacing it with 'proxy_config' for better flexibility.
- Added warnings for deprecated usage and clarified behavior when both parameters are provided.
- Updated documentation and tests to reflect changes in proxy configuration handling.
- Return comprehensive error messages along with status codes for api internal errors.
- Fix fit_html property serialization issue in both /crawl and /crawl/stream endpoints
- Add sanitization to ensure fit_html is always JSON-serializable (string or None)
- Add comprehensive error handling test suite.
Implement hierarchical configuration for LLM parameters with support for:
- Temperature control (0.0-2.0) to adjust response creativity
- Custom base_url for proxy servers and alternative endpoints
- 4-tier priority: request params > provider env > global env > defaults
Add helper functions in utils.py, update API schemas and handlers,
support environment variables (LLM_TEMPERATURE, OPENAI_TEMPERATURE, etc.),
and provide comprehensive documentation with examples.
- Fix URLPatternFilter serialization by preventing private __slots__ from being serialized as constructor params
- Add public attributes to URLPatternFilter to store original constructor parameters for proper serialization
- Handle property descriptors in CrawlResult.model_dump() to prevent JSON serialization errors
- Ensure filter chains work correctly with Docker client and REST API
The issue occurred because:
1. Private implementation details (_simple_suffixes, etc.) were being serialized and passed as constructor arguments during deserialization
2. Property descriptors were being included in the serialized output, causing "Object of type property is not JSON serializable" errors
Changes:
- async_configs.py: Comment out __slots__ serialization logic (lines 100-109)
- filters.py: Add patterns, use_glob, reverse to URLPatternFilter __slots__ and store as public attributes
- models.py: Convert property descriptors to strings in model_dump() instead of including them directly
BREAKING CHANGE: Table extraction now uses Strategy Design Pattern
This epic commit introduces a game-changing approach to table extraction in Crawl4AI:
✨ NEW FEATURES:
- LLMTableExtraction: AI-powered extraction for complex HTML tables with rowspan/colspan
- Smart Chunking: Automatically splits massive tables into optimal chunks at row boundaries
- Parallel Processing: Processes multiple chunks simultaneously for blazing-fast extraction
- Intelligent Merging: Seamlessly combines chunk results into complete tables
- Header Preservation: Each chunk maintains context with original headers
- Auto-retry Logic: Built-in resilience with configurable retry attempts
🏗️ ARCHITECTURE:
- Strategy Design Pattern for pluggable table extraction strategies
- ThreadPoolExecutor for concurrent chunk processing
- Token-based chunking with configurable thresholds
- Handles tables without headers gracefully
⚡ PERFORMANCE:
- Process 1000+ row tables without timeout
- Parallel processing with up to 5 concurrent chunks
- Smart token estimation prevents LLM context overflow
- Optimized for providers like Groq for massive tables
🔧 CONFIGURATION:
- enable_chunking: Auto-handle large tables (default: True)
- chunk_token_threshold: When to split (default: 3000 tokens)
- min_rows_per_chunk: Meaningful chunk sizes (default: 10)
- max_parallel_chunks: Concurrent processing (default: 5)
📚 BACKWARD COMPATIBILITY:
- Existing code continues to work unchanged
- DefaultTableExtraction remains the default strategy
- Progressive enhancement approach
This is the future of web table extraction - handling everything from simple tables to massive, complex data grids with merged cells and nested structures. The chunking is completely transparent to users while providing unprecedented scalability.
- Add raw HTML URL validation alongside http/https checks
- Fix URL preprocessing logic to handle raw: and raw:// prefixes
- Update error message and add comprehensive test cases
- Remove deprecated API token authentication from all Docker examples
- Fix async job endpoints: /crawl -> /crawl/job for submission, /task/{id} -> /crawl/job/{id} for polling
- Fix sync endpoint: /crawl_sync -> /crawl (synchronous)
- Remove non-existent /crawl_direct endpoint
- Update request format to use new structure with browser_config and crawler_config
- Fix response handling for both async and sync calls
- Update extraction strategy format to use proper nested structure
- Add Ollama connectivity check before running tests
- Update test schemas and selectors for current website structures
This makes the Docker examples work out-of-the-box with the current API structure.
- Extract base href from <head><base> tag using XPath in _process_element method
- Use base URL as the primary URL for link normalization when present
- Add error handling with logging for malformed or problematic base tags
- Maintain backward compatibility when no base tag is present
- Add test to verify the functionality of the base tag extraction.
- Support LLM_PROVIDER env var to override default provider (openai/gpt-4o-mini)
- Add optional 'provider' parameter to API endpoints for per-request overrides
- Implement provider validation to ensure API keys exist
- Update documentation and examples with new configuration options
Closes the need to hardcode providers in config.yml
commit 2def6524cdacb69c72760bf55a41089257c0bb07
Author: ntohidi <nasrin@kidocode.com>
Date: Mon Aug 4 18:59:10 2025 +0800
refactor: consolidate WebScrapingStrategy to use LXML implementation only
BREAKING CHANGE: None - full backward compatibility maintained
This commit simplifies the content scraping architecture by removing the
redundant BeautifulSoup-based WebScrapingStrategy implementation and making
it an alias for LXMLWebScrapingStrategy.
Changes:
- Remove ~1000 lines of BeautifulSoup-based WebScrapingStrategy code
- Make WebScrapingStrategy an alias for LXMLWebScrapingStrategy
- Update LXMLWebScrapingStrategy to inherit directly from ContentScrapingStrategy
- Add required methods (scrap, ascrap, process_element, _log) to LXMLWebScrapingStrategy
- Maintain 100% backward compatibility - existing code continues to work
Code changes:
- crawl4ai/content_scraping_strategy.py: Remove WebScrapingStrategy class, add alias
- crawl4ai/async_configs.py: Remove WebScrapingStrategy from imports
- crawl4ai/__init__.py: Update imports to show alias relationship
- crawl4ai/types.py: Update type definitions
- crawl4ai/legacy/web_crawler.py: Update import to use alias
- tests/async/test_content_scraper_strategy.py: Update to use LXMLWebScrapingStrategy
- docs/examples/scraping_strategies_performance.py: Update to use single strategy
Documentation updates:
- docs/md_v2/core/content-selection.md: Update scraping modes section
- docs/md_v2/migration/webscraping-strategy-migration.md: Add migration guide
- CHANGELOG.md: Document the refactoring under [Unreleased]
Benefits:
- 10-20x faster HTML parsing for large documents
- Reduced memory usage and simplified codebase
- Consistent parsing behavior
- No migration required for existing users
All existing code using WebScrapingStrategy continues to work without
modification, while benefiting from LXML's superior performance.
Fix critical issue where unmatched URLs incorrectly used the first config instead of failing safely. Also clarify that configs without url_matcher match ALL URLs by design, and improve memory usage monitoring.
Bug fixes:
- Change select_config() to return None when no config matches instead of using first config
- Add proper error handling in dispatchers when no config matches a URL
- Return failed CrawlResult with "No matching configuration found" error message
- Fix is_match() to return True when url_matcher is None (matches all URLs)
- Import and use get_true_memory_usage_percent() for more accurate memory monitoring
Behavior clarification:
- CrawlerRunConfig with url_matcher=None matches ALL URLs (not nothing)
- This is the intended behavior for default/fallback configurations
- Enables clean pattern: specific configs first, default config last
Documentation updates:
- Clarify that configs without url_matcher match everything
- Explain "No matching configuration found" error when no default config
- Add examples showing proper default config usage
- Update all relevant docs: multi-url-crawling.md, arun_many.md, parameters.md
- Simplify API config examples by removing extraction_strategy
Demo and test updates:
- Update demo_multi_config_clean.py with commented default config to show behavior
- Change example URL to w3schools.com to demonstrate no-match scenario
- Uncomment all test URLs in test_multi_config.py for comprehensive testing
Breaking changes: None - this restores the intended behavior
This ensures URLs only get processed with appropriate configs, preventing
issues like HTML pages being processed with PDF extraction strategies.
- Remove unused StealthConfig from browser_manager.py
- Update LinkPreviewConfig import path in __init__.py and examples
- Fix infinity handling in content_scraping_strategy.py (use 0 instead of float('inf'))
- Remove sanitize_json_data functions from API endpoints
- Add comprehensive C4A Script documentation to release notes
- Update v0.7.0 release notes with improved code examples
- Create v0.7.1 release notes focusing on cleanup and documentation improvements
- Update demo files with corrected import paths and examples
- Fix virtual scroll and adaptive crawling examples across documentation
🤖 Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>