Implement a comprehensive health monitoring system for crawler configurations
that automatically detects failures and applies resolution strategies.
Features
- **Continuous Health Monitoring**: Periodic health checks for multiple crawler
configurations with configurable check intervals
- **Automatic Failure Detection**: Detects failures based on HTTP status codes,
empty HTML responses, and logger errors
- **Resolution Strategies**: Built-in and custom resolution strategies that
automatically attempt to fix failing configurations
- **Resolution Chains**: Support for sequential resolution strategies that
validate each step before proceeding
- **Metrics Collection**: Comprehensive metrics tracking including success rates,
response times, resolution attempts, and uptime statistics
- **Graceful Shutdown**: Robust cleanup mechanism that waits for active health
checks to complete before shutting down
- **Error Tracking**: Integrated logger error tracking to detect non-critical
errors that don't fail HTTP requests but indicate issues
Implementation Details
- New module `crawl4ai/config_health_monitor.py` containing:
- `ConfigHealthMonitor`: Main monitoring class
- `ConfigHealthState`: Health state tracking dataclass
- `ResolutionResult`: Resolution strategy result dataclass
- `ResolutionStrategy`: Type alias for resolution callables
- `_ErrorTrackingLogger`: Proxy logger for error event tracking
- Key capabilities:
- Register/unregister configurations for monitoring
- Manual and automatic health checks
- Config-specific or global resolution strategies
- Thread-safe state management with asyncio locks
- Per-config and global metrics reporting
- Context manager support for automatic cleanup
Testing
- Comprehensive test suite in `tests/general/test_config_health_monitor.py`:
- Basic functionality tests (initialization, registration)
- Lifecycle management tests (start/stop, context manager)
- Health checking tests (success/failure scenarios)
- Resolution strategy tests
- Metrics and status query tests
- Property validation tests
Examples
- Example usage in `docs/examples/config_health_monitor_example.py`:
- Demonstrates monitor initialization and configuration
- Shows custom resolution strategies (incremental backoff, magic mode toggle)
- Implements resolution chains with validation
- Displays metrics reporting and status monitoring
- Includes context manager usage pattern
Technical Notes
- Uses `copy.deepcopy()` for safe configuration mutation
- Implements `_ErrorTrackingLogger` to capture logger errors during health checks
- Tracks active health check tasks for graceful shutdown
- Uses `CacheMode.BYPASS` for health check configs to ensure fresh data
- Minimum check interval enforced at 10 seconds
This feature enables production-grade monitoring of crawler configurations,
automatically detecting and resolving issues before they impact crawling
operations.
- Fix critical bug where overlay removal JS function was injected but never called
- Change remove_overlay_elements() to properly execute the injected async function
- Wrap JS execution in async to handle the async overlay removal logic
- Add test_remove_overlay_elements() test case to verify functionality works
- Ensure overlay elements (cookie banners, popups, modals) are actually removed
The remove_overlay_elements feature now works as intended:
- Before: Function definition injected but never executed (silent failure)
- After: Function injected and called, successfully removing overlay elements
- Prevents full HTML content from being passed as URL to extraction strategies
- Added unit tests to verify raw HTML and regular URL processing
Fix: Wrong URL variable used for extraction of raw html
This commit introduces AsyncUrlSeeder, a high-performance URL discovery system that enables intelligent crawling at scale by pre-discovering and filtering URLs before crawling.
## Core Features
### AsyncUrlSeeder Component
- Discovers URLs from multiple sources:
- Sitemaps (including nested and gzipped)
- Common Crawl index
- Combined sources for maximum coverage
- Extracts page metadata without full crawling:
- Title, description, keywords
- Open Graph and Twitter Card tags
- JSON-LD structured data
- Language and charset information
- BM25 relevance scoring for intelligent filtering:
- Query-based URL discovery
- Configurable score thresholds
- Automatic ranking by relevance
- Performance optimizations:
- Async/concurrent processing with configurable workers
- Rate limiting (hits per second)
- Automatic caching with TTL
- Streaming results for large datasets
### SeedingConfig
- Comprehensive configuration for URL seeding:
- Source selection (sitemap, cc, or both)
- URL pattern filtering with wildcards
- Live URL validation options
- Metadata extraction controls
- BM25 scoring parameters
- Concurrency and rate limiting
### Integration with AsyncWebCrawler
- Seamless pipeline: discover → filter → crawl
- Direct compatibility with arun_many()
- Significant resource savings by pre-filtering URLs
## Documentation
- Comprehensive guide comparing URL seeding vs deep crawling
- Complete API reference with parameter tables
- Practical examples showing all features
- Performance benchmarks and best practices
- Integration patterns with AsyncWebCrawler
## Examples
- url_seeder_demo.py: Interactive Rich-based demo with:
- Basic discovery
- Cache management
- Live validation
- BM25 scoring
- Multi-domain discovery
- Complete pipeline integration
- url_seeder_quick_demo.py: Screenshot-friendly examples:
- Pattern-based filtering
- Metadata exploration
- Smart search with BM25
## Testing
- Comprehensive test suite (test_async_url_seeder_bm25.py)
- Coverage of all major features
- Edge cases and error handling
- Performance and consistency tests
## Implementation Details
- Built on httpx with HTTP/2 support
- Optional dependencies: lxml, brotli, rank_bm25
- Cache management in ~/.crawl4ai/seeder_cache/
- Logger integration with AsyncLoggerBase
- Proper error handling and retry logic
## Bug Fixes
- Fixed logger color compatibility (lightblack → bright_black)
- Corrected URL extraction from seeder results for arun_many()
- Updated all examples and documentation with proper usage
This feature enables users to crawl smarter, not harder, by discovering
and analyzing URLs before committing resources to crawling them.
Adds a new content_source parameter to MarkdownGenerationStrategy that allows
selecting which HTML content to use for markdown generation:
- cleaned_html (default): uses post-processed HTML
- raw_html: uses original webpage HTML
- fit_html: uses preprocessed HTML for schema extraction
Changes include:
- Added content_source parameter to MarkdownGenerationStrategy
- Updated AsyncWebCrawler to handle HTML source selection
- Added examples and tests for the new feature
- Updated documentation with new parameter details
BREAKING CHANGE: Renamed cleaned_html parameter to input_html in generate_markdown()
method signature to better reflect its generalized purpose
Moved ProxyConfig class from proxy_strategy.py to async_configs.py for better organization.
Improved LLM token handling with new PROVIDER_MODELS_PREFIXES.
Added test cases for deep crawling and proxy rotation.
Removed docker_config from BrowserConfig as it's handled separately.
BREAKING CHANGE: ProxyConfig import path changed from crawl4ai.proxy_strategy to crawl4ai
Implement comprehensive network request and console message capturing functionality:
- Add capture_network_requests and capture_console_messages config parameters
- Add network_requests and console_messages fields to models
- Implement Playwright event listeners to capture requests, responses, and console output
- Create detailed documentation and examples
- Add comprehensive tests
This feature enables deep visibility into web page activity for debugging,
security analysis, performance profiling, and API discovery in web applications.