- Fix critical bug where overlay removal JS function was injected but never called
- Change remove_overlay_elements() to properly execute the injected async function
- Wrap JS execution in async to handle the async overlay removal logic
- Add test_remove_overlay_elements() test case to verify functionality works
- Ensure overlay elements (cookie banners, popups, modals) are actually removed
The remove_overlay_elements feature now works as intended:
- Before: Function definition injected but never executed (silent failure)
- After: Function injected and called, successfully removing overlay elements
- Prevents full HTML content from being passed as URL to extraction strategies
- Added unit tests to verify raw HTML and regular URL processing
Fix: Wrong URL variable used for extraction of raw html
This commit introduces AsyncUrlSeeder, a high-performance URL discovery system that enables intelligent crawling at scale by pre-discovering and filtering URLs before crawling.
## Core Features
### AsyncUrlSeeder Component
- Discovers URLs from multiple sources:
- Sitemaps (including nested and gzipped)
- Common Crawl index
- Combined sources for maximum coverage
- Extracts page metadata without full crawling:
- Title, description, keywords
- Open Graph and Twitter Card tags
- JSON-LD structured data
- Language and charset information
- BM25 relevance scoring for intelligent filtering:
- Query-based URL discovery
- Configurable score thresholds
- Automatic ranking by relevance
- Performance optimizations:
- Async/concurrent processing with configurable workers
- Rate limiting (hits per second)
- Automatic caching with TTL
- Streaming results for large datasets
### SeedingConfig
- Comprehensive configuration for URL seeding:
- Source selection (sitemap, cc, or both)
- URL pattern filtering with wildcards
- Live URL validation options
- Metadata extraction controls
- BM25 scoring parameters
- Concurrency and rate limiting
### Integration with AsyncWebCrawler
- Seamless pipeline: discover → filter → crawl
- Direct compatibility with arun_many()
- Significant resource savings by pre-filtering URLs
## Documentation
- Comprehensive guide comparing URL seeding vs deep crawling
- Complete API reference with parameter tables
- Practical examples showing all features
- Performance benchmarks and best practices
- Integration patterns with AsyncWebCrawler
## Examples
- url_seeder_demo.py: Interactive Rich-based demo with:
- Basic discovery
- Cache management
- Live validation
- BM25 scoring
- Multi-domain discovery
- Complete pipeline integration
- url_seeder_quick_demo.py: Screenshot-friendly examples:
- Pattern-based filtering
- Metadata exploration
- Smart search with BM25
## Testing
- Comprehensive test suite (test_async_url_seeder_bm25.py)
- Coverage of all major features
- Edge cases and error handling
- Performance and consistency tests
## Implementation Details
- Built on httpx with HTTP/2 support
- Optional dependencies: lxml, brotli, rank_bm25
- Cache management in ~/.crawl4ai/seeder_cache/
- Logger integration with AsyncLoggerBase
- Proper error handling and retry logic
## Bug Fixes
- Fixed logger color compatibility (lightblack → bright_black)
- Corrected URL extraction from seeder results for arun_many()
- Updated all examples and documentation with proper usage
This feature enables users to crawl smarter, not harder, by discovering
and analyzing URLs before committing resources to crawling them.
Adds a new content_source parameter to MarkdownGenerationStrategy that allows
selecting which HTML content to use for markdown generation:
- cleaned_html (default): uses post-processed HTML
- raw_html: uses original webpage HTML
- fit_html: uses preprocessed HTML for schema extraction
Changes include:
- Added content_source parameter to MarkdownGenerationStrategy
- Updated AsyncWebCrawler to handle HTML source selection
- Added examples and tests for the new feature
- Updated documentation with new parameter details
BREAKING CHANGE: Renamed cleaned_html parameter to input_html in generate_markdown()
method signature to better reflect its generalized purpose
Moved ProxyConfig class from proxy_strategy.py to async_configs.py for better organization.
Improved LLM token handling with new PROVIDER_MODELS_PREFIXES.
Added test cases for deep crawling and proxy rotation.
Removed docker_config from BrowserConfig as it's handled separately.
BREAKING CHANGE: ProxyConfig import path changed from crawl4ai.proxy_strategy to crawl4ai
Implement comprehensive network request and console message capturing functionality:
- Add capture_network_requests and capture_console_messages config parameters
- Add network_requests and console_messages fields to models
- Implement Playwright event listeners to capture requests, responses, and console output
- Create detailed documentation and examples
- Add comprehensive tests
This feature enables deep visibility into web page activity for debugging,
security analysis, performance profiling, and API discovery in web applications.