Add comprehensive virtual scroll handling to capture all content from pages that use DOM recycling techniques (Twitter, Instagram, etc).
Key features:
- New VirtualScrollConfig class for configuring virtual scroll behavior
- Automatic detection of three scrolling scenarios: no change, content appended, content replaced
- Intelligent HTML chunk capture and merging with deduplication
- 100% content capture from virtual scroll pages
- Seamless integration with existing extraction strategies
- JavaScript-based detection and capture for performance
- Tree-based DOM merging with text-based deduplication
Documentation:
- Comprehensive guide at docs/md_v2/advanced/virtual-scroll.md
- API reference updates in parameters.md and page-interaction.md
- Blog article explaining the solution and techniques
- Complete examples with local test server
Testing:
- Full test suite achieving 100% capture of 1000 items
- Examples for Twitter timeline, Instagram grid scenarios
- Local test server with different scrolling behaviors
This enables scraping of modern websites that were previously impossible to fully capture with traditional scrolling techniques.
Enhance browser session management with the following improvements:
- Add state cloning between browser contexts
- Implement smarter page closing logic based on total pages and browser config
- Add storage state persistence during profile creation
- Improve managed browser context handling with storage state support
This change improves browser session reliability and persistence across runs.
Adds a new wait_for_timeout parameter to CrawlerRunConfig that allows specifying
a separate timeout for the wait_for condition, independent of the page_timeout.
This provides more granular control over waiting behaviors in the crawler.
Also removes unused colorama dependency and updates LinkedIn crawler example.
BREAKING CHANGE: LinkedIn crawler example now uses different wait_for_images timing
Resolved a bug where running the crawler on local HTML files with `capture_console_messages=False`
(default) raised `UnboundLocalError` due to `captured_console` being accessed before assignment.
Removes automatic page closure in take_screenshot and take_screenshot_naive methods
to prevent premature closure of pages that might still be needed in the calling context.
This allows for more flexible page lifecycle management by the caller.
BREAKING CHANGE: Page objects are no longer automatically closed after taking screenshots.
Callers must explicitly handle page closure when appropriate.
Add session_id feature to allow reusing browser pages across multiple crawls.
Add support for view-source: protocol in URL handling.
Fix browser config reference and string formatting issues.
Update examples to demonstrate new session management features.
BREAKING CHANGE: Browser page handling now persists when using session_id
Add new RegexExtractionStrategy for fast, zero-LLM extraction of common data types:
- Built-in patterns for emails, URLs, phones, dates, and more
- Support for custom regex patterns
- LLM-assisted pattern generation utility
- Optimized HTML preprocessing with fit_html field
- Enhanced network response body capture
Breaking changes: None
Add ability to capture browser console messages during crawling:
- Implement _capture_console_messages method to collect console logs
- Update crawl method to support console message capture
- Modify browser_manager page creation to accept full CrawlerRunConfig
- Fix request failure text formatting
This enhancement allows debugging and monitoring of JavaScript console output during crawling operations.
Implement comprehensive network request and console message capturing functionality:
- Add capture_network_requests and capture_console_messages config parameters
- Add network_requests and console_messages fields to models
- Implement Playwright event listeners to capture requests, responses, and console output
- Create detailed documentation and examples
- Add comprehensive tests
This feature enables deep visibility into web page activity for debugging,
security analysis, performance profiling, and API discovery in web applications.
Add ability to capture web pages as MHTML format, which includes all page resources
in a single file. This enables complete page archival and offline viewing.
- Add capture_mhtml parameter to CrawlerRunConfig
- Implement MHTML capture using CDP in AsyncPlaywrightCrawlerStrategy
- Add mhtml field to CrawlResult and AsyncCrawlResponse models
- Add comprehensive tests for MHTML capture functionality
- Update documentation with MHTML capture details
- Add exclude_all_images option for better memory management
Breaking changes: None
Add experimental parameters dictionary to CrawlerRunConfig to support beta features
Make CSP nonce headers optional via experimental config
Remove default cookie injection
Clean up browser context creation code
Improve code formatting in API handler
BREAKING CHANGE: Default cookie injection has been removed from page initialization
Moves ProxyConfig from configs/ directory into proxy_strategy.py to improve code organization and reduce fragmentation. Updates all imports and type hints to reflect the new location.
Key changes:
- Moved ProxyConfig class from configs/proxy_config.py to proxy_strategy.py
- Updated type hints in async_configs.py to support ProxyConfig
- Fixed proxy configuration handling in browser_manager.py
- Updated documentation and examples to use new import path
BREAKING CHANGE: ProxyConfig import path has changed from crawl4ai.configs to crawl4ai.proxy_strategy
Implements a new AsyncHTTPCrawlerStrategy class that provides a fast, memory-efficient alternative to browser-based crawling. Features include:
- Support for HTTP/HTTPS requests with configurable methods, headers, and timeouts
- File and raw content handling capabilities
- Streaming response processing for large files
- Customizable request/response hooks
- Comprehensive error handling
Also refactors browser management code into separate module for better organization.
Major reorganization of the project structure:
- Moved legacy synchronous crawler code to legacy folder
- Removed deprecated CLI and docs manager
- Consolidated version manager into utils.py
- Added CrawlerHub to __init__.py exports
- Fixed type hints in async_webcrawler.py
- Fixed minor bugs in chunking and crawler strategies
BREAKING CHANGE: Removed synchronous WebCrawler, CLI, and docs management functionality. Users should migrate to AsyncWebCrawler.
Add JavaScript execution result handling and improve PDF processing capabilities:
- Add js_execution_result to CrawlResult and AsyncCrawlResponse models
- Implement execution result capture in AsyncPlaywrightCrawlerStrategy
- Add batch processing for PDF pages with configurable batch size
- Enhance JsonElementExtractionStrategy with better schema generation
- Add HTML optimization utilities
BREAKING CHANGE: PDF processing now uses batch processing by default
Redesign user agent generation to be more modular and reliable:
- Add abstract base class UAGen for user agent generation
- Implement ValidUAGenerator using fake-useragent library
- Add OnlineUAGenerator for fetching real-world user agents
- Update browser configurations to use new UA generation system
- Improve client hints generation
This change makes the user agent system more maintainable and provides better real-world user agent coverage.
Add support for direct CDP URL configuration in BrowserConfig and ManagedBrowser classes. This allows connecting to remote browser instances using custom CDP endpoints instead of always launching a local browser.
- Added cdp_url parameter to BrowserConfig
- Added cdp_url support in ManagedBrowser.start() method
- Updated documentation for new parameters
Renames the final_url field to redirected_url across all components to maintain
consistent terminology throughout the codebase. This change affects:
- AsyncCrawlResponse model
- AsyncPlaywrightCrawlerStrategy
- Documentation and examples
No functional changes, purely naming consistency improvement.