Add a new example script demonstrating Docker API usage with extensive features:
- Basic crawling with single/multi URL support
- Markdown generation with various filters
- Parameter demonstrations (CSS, JS, screenshots, SSL, proxies)
- Extraction strategies using CSS and LLM
- Deep crawling capabilities with streaming
- Integration examples with proxy rotation and SSL certificate fetching
Also includes minor formatting improvements in async_webcrawler.py
Adds a new content_source parameter to MarkdownGenerationStrategy that allows
selecting which HTML content to use for markdown generation:
- cleaned_html (default): uses post-processed HTML
- raw_html: uses original webpage HTML
- fit_html: uses preprocessed HTML for schema extraction
Changes include:
- Added content_source parameter to MarkdownGenerationStrategy
- Updated AsyncWebCrawler to handle HTML source selection
- Added examples and tests for the new feature
- Updated documentation with new parameter details
BREAKING CHANGE: Renamed cleaned_html parameter to input_html in generate_markdown()
method signature to better reflect its generalized purpose
Moved ProxyConfig class from proxy_strategy.py to async_configs.py for better organization.
Improved LLM token handling with new PROVIDER_MODELS_PREFIXES.
Added test cases for deep crawling and proxy rotation.
Removed docker_config from BrowserConfig as it's handled separately.
BREAKING CHANGE: ProxyConfig import path changed from crawl4ai.proxy_strategy to crawl4ai
Implement comprehensive network request and console message capturing functionality:
- Add capture_network_requests and capture_console_messages config parameters
- Add network_requests and console_messages fields to models
- Implement Playwright event listeners to capture requests, responses, and console output
- Create detailed documentation and examples
- Add comprehensive tests
This feature enables deep visibility into web page activity for debugging,
security analysis, performance profiling, and API discovery in web applications.
Reorganize browser strategy code into separate modules for better maintainability and separation of concerns. Improve Docker implementation with:
- Add Alpine and Debian-based Dockerfiles for better container options
- Enhance Docker registry to share configuration with BuiltinBrowserStrategy
- Add CPU and memory limits to container configuration
- Improve error handling and logging
- Update documentation and examples
BREAKING CHANGE: DockerConfig, DockerRegistry, and DockerUtils have been moved to new locations and their APIs have been updated.
Adds a new browser management system with strategy pattern implementation:
- Introduces BrowserManager class with strategy pattern support
- Adds PlaywrightBrowserStrategy, CDPBrowserStrategy, and BuiltinBrowserStrategy
- Implements BrowserProfileManager for profile management
- Adds PagePoolConfig for browser page pooling
- Includes comprehensive test suite for all browser strategies
BREAKING CHANGE: Browser management has been moved to browser/ module. Direct usage of browser_manager.py and browser_profiler.py is deprecated.
Implements a persistent browser management system that allows running a single shared browser instance
that can be reused across multiple crawler sessions. Key changes include:
- Added browser_mode config option with 'builtin', 'dedicated', and 'custom' modes
- Implemented builtin browser management in BrowserProfiler
- Added CLI commands for managing builtin browser (start, stop, status, restart, view)
- Modified browser process handling to support detached processes
- Added automatic builtin browser setup during package installation
BREAKING CHANGE: The browser_mode config option changes how browser instances are managed
Add special handling for single URL requests in Docker API to use arun() instead of arun_many()
Add new example script demonstrating performance differences between sequential and parallel crawling
Update cache mode from aggressive to bypass in examples and tests
Remove unused dependencies (zstandard, msgpack)
BREAKING CHANGE: Changed default cache_mode from aggressive to bypass in examples
Implements a comprehensive monitoring and visualization system for tracking web crawler operations in real-time. The system includes:
- Terminal-based dashboard with rich UI for displaying task statuses
- Memory pressure monitoring and adaptive dispatch control
- Queue statistics and performance metrics tracking
- Detailed task progress visualization
- Stress testing framework for memory management
This addition helps operators track crawler performance and manage memory usage more effectively.
Adds new target_elements parameter to CrawlerRunConfig that allows more flexible content selection than css_selector. This enables focusing markdown generation and data extraction on specific elements while still processing the entire page for links and media.
Key changes:
- Added target_elements list parameter to CrawlerRunConfig
- Modified WebScrapingStrategy and LXMLWebScrapingStrategy to handle target_elements
- Updated documentation with examples and comparison between css_selector and target_elements
- Fixed table extraction in content_scraping_strategy.py
BREAKING CHANGE: Table extraction logic has been modified to better handle thead/tbody structures
Add comprehensive table detection and extraction functionality to the web scraping system:
- Implement intelligent table detection algorithm with scoring system
- Add table extraction with support for headers, rows, captions
- Update models to include tables in Media class
- Add table_score_threshold configuration option
- Add documentation and examples for table extraction
- Include crypto analysis example demonstrating table usage
This change enables users to extract structured data from HTML tables while intelligently filtering out layout tables.
Moves ProxyConfig from configs/ directory into proxy_strategy.py to improve code organization and reduce fragmentation. Updates all imports and type hints to reflect the new location.
Key changes:
- Moved ProxyConfig class from configs/proxy_config.py to proxy_strategy.py
- Updated type hints in async_configs.py to support ProxyConfig
- Fixed proxy configuration handling in browser_manager.py
- Updated documentation and examples to use new import path
BREAKING CHANGE: ProxyConfig import path has changed from crawl4ai.configs to crawl4ai.proxy_strategy
Add new features to enhance browser automation and HTML extraction:
- Add CDP browser launch capability with customizable ports and profiles
- Implement JsonLxmlExtractionStrategy for faster HTML parsing
- Add CLI command 'crwl cdp' for launching standalone CDP browsers
- Support connecting to external CDP browsers via URL
- Optimize selector caching and context-sensitive queries
BREAKING CHANGE: LLMConfig import path changed from crawl4ai.types to crawl4ai
Rename LlmConfig to LLMConfig across the codebase to follow consistent naming conventions.
Update all imports and usages to use the new name.
Update documentation and examples to reflect the change.
BREAKING CHANGE: LlmConfig has been renamed to LLMConfig. Users need to update their imports and usage.
Make 'crawl' the default command when no command is specified.
This improves user experience by allowing direct URL input without
explicitly specifying the 'crawl' command.
Also removes unnecessary blank lines in example code for better readability.
Replace float('inf') and float('-inf') with math.inf and -math.inf from the math module for better readability and performance. Also clean up imports and remove unused speed comparison code.
No breaking changes.
Add max_pages parameter to all deep crawling strategies to limit total pages crawled.
Add score_threshold parameter to BFS/DFS strategies for quality control.
Remove legacy parameter handling in AsyncWebCrawler.
Improve error handling and logging in crawl strategies.
BREAKING CHANGE: Removed support for legacy parameters in AsyncWebCrawler.run_many()
Adds new functionality to crawl websites using saved browser profiles directly from the CLI.
This includes:
- New CLI option to use profiles for crawling
- Helper functions for profile-based crawling
- Fixed type hints for config parameters
- Updated example to show browser window by default
This makes it easier for users to leverage saved browser profiles for crawling without writing code.
Adds a new BrowserProfiler class that provides comprehensive management of browser profiles for identity-based crawling. Features include:
- Interactive profile creation and management
- Profile listing, retrieval, and deletion
- Guided console interface
- Migration of profile management from ManagedBrowser
- New example script for identity-based browsing
ALSO:
- Updates logging format in AsyncWebCrawler
- Removes content filter from hello_world example
- Relaxes httpx version constraint
BREAKING CHANGE: Profile management methods from ManagedBrowser are now deprecated and delegate to BrowserProfiler
* fix: Update export of URLPatternFilter
* chore: Add dependancy for cchardet in requirements
* docs: Update example for deep crawl in release note for v0.5
* Docs: update the example for memory dispatcher
* docs: updated example for crawl strategies
* Refactor: Removed wrapping in if __name__==main block since this is a markdown file.
* chore: removed cchardet from dependancy list, since unclecode is planning to remove it
* docs: updated the example for proxy rotation to a working example
* feat: Introduced ProxyConfig param
* Add tutorial for deep crawl & update contributor list for bug fixes in feb alpha-1
* chore: update and test new dependancies
* feat:Make PyPDF2 a conditional dependancy
* updated tutorial and release note for v0.5
* docs: update docs for deep crawl, and fix a typo in docker-deployment markdown filename
* refactor: 1. Deprecate markdown_v2 2. Make markdown backward compatible to behave as a string when needed. 3. Fix LlmConfig usage in cli 4. Deprecate markdown_v2 in cli 5. Update AsyncWebCrawler for changes in CrawlResult
* fix: Bug in serialisation of markdown in acache_url
* Refactor: Added deprecation errors for fit_html and fit_markdown directly on markdown. Now access them via markdown
* fix: remove deprecated markdown_v2 from docker
* Refactor: remove deprecated fit_markdown and fit_html from result
* refactor: fix cache retrieval for markdown as a string
* chore: update all docs, examples and tests with deprecation announcements for markdown_v2, fit_html, fit_markdown
Make PyPDF2 an optional dependency and improve import handling in PDF processor.
Move imports inside methods to allow for lazy loading and better error handling.
Add new 'pdf' optional dependency group in pyproject.toml.
Clean up unused imports and remove deprecated files.
BREAKING CHANGE: PyPDF2 is now an optional dependency. Users need to install with 'pip install crawl4ai[pdf]' to use PDF processing features.
Add AsyncLoggerBase abstract class to standardize logger interface and introduce AsyncFileLogger for file-only logging. Remove deprecated always_bypass_cache parameter and clean up AsyncWebCrawler initialization.
BREAKING CHANGE: Removed deprecated 'always_by_pass_cache' parameter. Use BrowserConfig cache settings instead.
* feature: Add LlmConfig to easily configure and pass LLM configs to different strategies
* pulled in next branch and resolved conflicts
* feat: Add gemini and deepseek providers. Make ignore_cache in llm content filter to true by default to avoid confusions
* Refactor: Update LlmConfig in LLMExtractionStrategy class and deprecate old params
* updated tests, docs and readme
- Add ignore_default_value option to to_serializable_dict
- Add viewport dict support in BrowserConfig
- Replace FastFilterChain with FilterChain
- Add deprecation warnings for unwanted properties
- Clean up unused imports
- Rename example files for consistency
- Add comprehensive Docker configuration tutorial
BREAKING CHANGE: FastFilterChain has been replaced with FilterChain
Add JWT token-based authentication to Docker server and client.
Refactor server architecture for better code organization and error handling.
Move Dockerfile to root deploy directory and update configuration.
Add comprehensive documentation and examples.
BREAKING CHANGE: Docker server now requires authentication by default.
Endpoints require JWT tokens when security.jwt_enabled is true in config.
Add comprehensive example demonstrating Google Search Results Page (SERP) API implementation using crawl4ai. The example includes:
- Basic web crawling setup
- LLM-based extraction
- Schema generation
- Golden standard implementation
- CrawlerHub usage
The example serves as a reference for implementing SERP API functionality with various extraction strategies.
Implements a full-featured CLI for Crawl4AI with the following capabilities:
- Basic and advanced web crawling
- Configuration management via YAML/JSON files
- Multiple extraction strategies (CSS, XPath, LLM)
- Content filtering and optimization
- Interactive Q&A capabilities
- Various output formats
- Comprehensive documentation and examples
Also includes:
- Home directory setup for configuration and cache
- Environment variable support for API tokens
- Test suite for CLI functionality
Implements a new proxy rotation system with the following changes:
- Add ProxyRotationStrategy abstract base class
- Add RoundRobinProxyStrategy concrete implementation
- Integrate proxy rotation with AsyncWebCrawler
- Add proxy_rotation_strategy parameter to CrawlerRunConfig
- Add example script demonstrating proxy rotation usage
- Remove deprecated synchronous WebCrawler code
- Clean up rate limiting documentation
BREAKING CHANGE: Removed synchronous WebCrawler support and related rate limiting configurations
Complete overhaul of Docker deployment setup with improved architecture:
- Add Redis integration for task management
- Implement rate limiting and security middleware
- Add Prometheus metrics and health checks
- Improve error handling and logging
- Add support for streaming responses
- Implement proper configuration management
- Add platform-specific optimizations for ARM64/AMD64
BREAKING CHANGE: Docker deployment now requires Redis and new config.yml structure
Uncomments demonstration code for memory dispatcher, streaming support,
content scraping, JSON schema generation, LLM markdown, and robots compliance
in the v0.4.3b2 features demo file. Also adds fake-useragent package as a
project dependency.
This change makes all feature demonstrations active by default and ensures
proper user agent handling capabilities.
Modify proxy rotation example to include empty user agent setting and comment out other demo functions for focused testing. This change simplifies the demo file to focus specifically on proxy rotation functionality.
No breaking changes.
Clean up whitespace and improve readability in v0_4_3b2_features_demo.py:
- Remove excessive blank lines between functions
- Improve config formatting for better readability
- Uncomment memory dispatcher demo in main function
No breaking changes.
- Restructure multi-URL crawling documentation with better formatting and examples
- Update code examples to use new API syntax (arun_many)
- Add detailed parameter explanations for RateLimiter and Dispatchers
- Enhance CSS styling for better documentation readability
- Fix outdated method calls in feature demo script
BREAKING CHANGE: Updated dispatcher.run_urls() to crawler.arun_many() in examples
Rename and replace the features demo file to reflect the beta 2 version number.
The old v0.4.3 demo file is removed and replaced with a new beta 2 version.
Renames:
- docs/examples/v0_4_3_features_demo.py -> docs/examples/v0_4_3b2_features_demo.py
Update example scripts to reflect latest API changes and improve demonstrations:
- Increase test URLs in dispatcher example from 20 to 40 pages
- Comment out unused dispatcher strategies for cleaner output
- Fix scraping strategies performance script to use correct object notation
- Update v0_4_3_features_demo with additional feature mentions and uncomment demo sections
These changes make the examples more current and better aligned with the actual API.
Renames the final_url field to redirected_url across all components to maintain
consistent terminology throughout the codebase. This change affects:
- AsyncCrawlResponse model
- AsyncPlaywrightCrawlerStrategy
- Documentation and examples
No functional changes, purely naming consistency improvement.
Implements dynamic proxy rotation functionality with authentication support and IP verification. Updates include:
- Added proxy rotation demo in features example
- Updated proxy configuration handling in BrowserManager
- Added proxy rotation documentation
- Updated README with new proxy rotation feature
- Bumped version to 0.4.3b2
This change enables users to dynamically switch between proxies and verify IP addresses for each request.
Prepare the v0.4.3 beta release with major feature additions and improvements:
- Add JsonXPathExtractionStrategy and LLMContentFilter to exports
- Update version to 0.4.3b1
- Improve documentation for dispatchers and markdown generation
- Update development status to Beta
- Reorganize changelog format
BREAKING CHANGE: Memory threshold in MemoryAdaptiveDispatcher increased to 90% and SemaphoreDispatcher parameter renamed to max_session_permit
Add shared_data parameter to CrawlerRunConfig to allow data sharing between hooks.
Implement browser context reuse based on config signatures to improve memory usage.
Fix Firefox/Webkit channel settings.
Add config parameter to hook callbacks for better context access.
Remove debug print statements.
BREAKING CHANGE: Hook callback signatures now include config parameter