Implements a new proxy rotation system with the following changes:
- Add ProxyRotationStrategy abstract base class
- Add RoundRobinProxyStrategy concrete implementation
- Integrate proxy rotation with AsyncWebCrawler
- Add proxy_rotation_strategy parameter to CrawlerRunConfig
- Add example script demonstrating proxy rotation usage
- Remove deprecated synchronous WebCrawler code
- Clean up rate limiting documentation
BREAKING CHANGE: Removed synchronous WebCrawler support and related rate limiting configurations
Restructure deep crawling code into a dedicated module with improved organization:
- Move deep crawl logic from async_deep_crawl.py to deep_crawling/
- Create separate files for BFS strategy, filters, and scorers
- Improve code organization and maintainability
- Add optimized implementations for URL filtering and scoring
- Rename DeepCrawlHandler to DeepCrawlDecorator for clarity
BREAKING CHANGE: DeepCrawlStrategy and BreadthFirstSearchStrategy imports need to be updated to new package structure
Implements deep crawling functionality with a new BreadthFirstSearch strategy:
- Add DeepCrawlStrategy base class and BFS implementation
- Integrate deep crawling with AsyncWebCrawler via decorator pattern
- Update CrawlerRunConfig to support deep crawling parameters
- Add pagination support for Google Search crawler
BREAKING CHANGE: AsyncWebCrawler.arun and arun_many return types now include deep crawl results
Add Docker service integration with FastAPI server and client implementation.
Implement serialization utilities for BrowserConfig and CrawlerRunConfig to support
Docker service communication. Clean up imports and improve error handling.
- Add Crawl4aiDockerClient class
- Implement config serialization/deserialization
- Add FastAPI server with streaming support
- Add health check endpoint
- Clean up imports and type hints
Major reorganization of the project structure:
- Moved legacy synchronous crawler code to legacy folder
- Removed deprecated CLI and docs manager
- Consolidated version manager into utils.py
- Added CrawlerHub to __init__.py exports
- Fixed type hints in async_webcrawler.py
- Fixed minor bugs in chunking and crawler strategies
BREAKING CHANGE: Removed synchronous WebCrawler, CLI, and docs management functionality. Users should migrate to AsyncWebCrawler.
Add JavaScript execution result handling and improve PDF processing capabilities:
- Add js_execution_result to CrawlResult and AsyncCrawlResponse models
- Implement execution result capture in AsyncPlaywrightCrawlerStrategy
- Add batch processing for PDF pages with configurable batch size
- Enhance JsonElementExtractionStrategy with better schema generation
- Add HTML optimization utilities
BREAKING CHANGE: PDF processing now uses batch processing by default
Add new PDF processing module with the following features:
- PDF text extraction and formatting to HTML/Markdown
- Image extraction with multiple format support (JPEG, PNG, TIFF)
- Link extraction from PDF documents
- Metadata extraction including title, author, dates
- Support for both local and remote PDF files
Also includes:
- New configuration options for HTML attribute handling
- Internal/external link filtering improvements
- Version bump to 0.4.300b4
Renames the final_url field to redirected_url across all components to maintain
consistent terminology throughout the codebase. This change affects:
- AsyncCrawlResponse model
- AsyncPlaywrightCrawlerStrategy
- Documentation and examples
No functional changes, purely naming consistency improvement.
Add support for checking and respecting robots.txt rules before crawling websites:
- Implement RobotsParser class with SQLite caching
- Add check_robots_txt parameter to CrawlerRunConfig
- Integrate robots.txt checking in AsyncWebCrawler
- Update documentation with robots.txt compliance examples
- Add tests for robot parser functionality
The cache uses WAL mode for better concurrency and has a default TTL of 7 days.
Add capability to track and return final URLs after redirects in crawler responses. This enhancement helps users understand the actual destination of crawled URLs after any redirections.
Changes include:
- Added final_url tracking in AsyncPlaywrightCrawlerStrategy
- Added redirected_url field to CrawlResult model
- Updated AsyncWebCrawler to properly handle and store redirect URLs
- Fixed typo in documentation signature
Add streaming capability to crawler configurations and introduce clone() methods
for both BrowserConfig and CrawlerRunConfig to support immutable config updates.
Move stream parameter from arun_many() method to CrawlerRunConfig.
BREAKING CHANGE: Removed stream parameter from AsyncWebCrawler.arun_many() method.
Use config.stream=True instead.
Add new streaming capability to the MemoryAdaptiveDispatcher and AsyncWebCrawler
to allow processing URLs with real-time result streaming. This enables
processing results as they become available rather than waiting for all
URLs to complete.
Key changes:
- Add run_urls_stream method to MemoryAdaptiveDispatcher
- Update AsyncWebCrawler.arun_many to support streaming mode
- Add result queue for better result handling
- Improve type hints and documentation
BREAKING CHANGE: The return type of arun_many now depends on the 'stream'
parameter, returning either List[CrawlResult] or AsyncGenerator[CrawlResult, None]
- Increase memory threshold from 70% to 90% for better resource utilization
- Remove incorrect self parameter from MemoryAdaptiveDispatcher initialization
These changes improve the crawler's performance by allowing more memory usage before throttling and fix a bug in dispatcher initialization.
- Remove .pre-commit-config.yaml and duplicate mkdocs configuration files
- Add Optional type hint for proxy parameter in BrowserConfig
- Fix type annotation for results list in AsyncWebCrawler
- Move calculate_batch_size function import to model_loader
- Update prompt imports in extraction_strategy.py
No breaking changes.
Replace the ScrapingMode enum with a proper strategy pattern implementation for content scraping.
This change introduces:
- New ContentScrapingStrategy abstract base class
- Concrete WebScrapingStrategy and LXMLWebScrapingStrategy implementations
- New Pydantic models for structured scraping results
- Updated documentation reflecting the new strategy-based approach
BREAKING CHANGE: ScrapingMode enum has been removed. Users should now use ContentScrapingStrategy implementations instead.
Adds a new ScrapingMode enum to allow switching between BeautifulSoup and LXML parsing.
LXML mode offers 10-20x better performance for large HTML documents.
Key changes:
- Added ScrapingMode enum with BEAUTIFULSOUP and LXML options
- Implemented LXMLWebScrapingStrategy class
- Added LXML-based metadata extraction
- Updated documentation with scraping mode usage and performance considerations
- Added cssselect dependency
BREAKING CHANGE: None
Reorganize dispatcher functionality into separate components:
- Create dedicated dispatcher classes (MemoryAdaptive, Semaphore)
- Add RateLimiter for smart request throttling
- Implement CrawlerMonitor for real-time progress tracking
- Move dispatcher config from CrawlerRunConfig to separate classes
BREAKING CHANGE: Dispatcher configuration moved from CrawlerRunConfig to dedicated dispatcher classes. Users need to update their configuration approach for multi-URL crawling.
Implements a new MemoryAdaptiveDispatcher class to manage concurrent crawling operations with memory monitoring and rate limiting capabilities. Changes include:
- Added RateLimitConfig dataclass for configuring rate limiting behavior
- Extended CrawlerRunConfig with dispatcher-related settings
- Refactored arun_many to use the new dispatcher system
- Added memory threshold and session permit controls
- Integrated optional progress monitoring display
BREAKING CHANGE: The arun_many method now uses MemoryAdaptiveDispatcher by default, which may affect concurrent crawling behavior
- Set wait_for_images default to false for better performance
- Simplify response attribute copying in AsyncWebCrawler
- Update hello_world example with proper content filtering
- Added examples for Amazon product data extraction methods
- Updated configuration options and enhance documentation
- Minor refactoring for improved performance and readability
- Cleaned up version control settings.
Standardizes parameter naming convention across the codebase by renaming browser_config to the more concise config in AsyncWebCrawler constructor.
Updates all documentation examples and internal usages to reflect the new parameter name for consistency.
Also improves hook execution by adding url/response parameters to goto hooks and fixes parameter ordering in before_return_html hook.
- Add llm.txt generator
- Added SSL certificate extraction in AsyncWebCrawler.
- Introduced new content filters and chunking strategies for more robust data extraction.
- Updated documentation.
- ReImplemented JsonXPathExtractionStrategy for enhanced JSON data extraction.
- Updated existing extraction strategies for better performance.
- Improved handling of response status codes during crawls.
- Added markdown generator parameter to CrawlerRunConfig in `async_configs.py`.
- Implemented logic for Markdown generation in content scraping in `async_webcrawler.py`.
- Updated version number to 0.4.21 in `__version__.py`.
- Introduced new configuration classes: BrowserConfig and CrawlerRunConfig.
- Refactored AsyncWebCrawler to leverage the new configuration system for cleaner parameter management.
- Updated AsyncPlaywrightCrawlerStrategy for better flexibility and reduced legacy parameters.
- Improved error handling with detailed context extraction during exceptions.
- Enhanced overall maintainability and usability of the web crawler.
- Added support for exporting pages as PDFs
- Enhanced screenshot functionality for long pages
- Created a tutorial on dynamic content loading with 'Load More' buttons.
- Updated web crawler to handle PDF data in responses.
- Introduced new async crawl strategy with session management.
- Added BrowserManager for improved browser management.
- Enhanced documentation, focusing on storage state and usage examples.
- Improved error handling and logging for sessions.
- Added JavaScript snippets for customizing navigator properties.
Enhance Async Crawler with storage state handling
- Updated Async Crawler to support storage state management.
- Added error handling for URL validation in Async Web Crawler.
- Modified README logo and improved .gitignore entries.
- Fixed issues in multiple files for better code robustness.
- Updated version to 0.3.743
- Improved ManagedBrowser configuration with dynamic host/port
- Implemented fast HTML formatting in web crawler
- Enhanced markdown generation with a new generator class
- Improved sanitization and utility functions
- Added contributor details and pull request acknowledgments
- Updated documentation for clearer usage scenarios
- Adjusted tests to reflect class name changes
feat(requirements): update requirements.txt to include snowballstemmer
fix(version_manager): correct version parsing to use __version__.__version__
feat(main): introduce chunking strategy and content filter in CrawlRequest model
feat(content_filter): enhance BM25 algorithm with priority tag scoring for improved content relevance
feat(logger): implement new async logger engine replacing print statements throughout library
fix(database): resolve version-related deadlock and circular lock issues in database operations
docs(docker): expand Docker deployment documentation with usage instructions for Docker Compose
chore(requirements): add colorama dependency
refactor(config): add SHOW_DEPRECATION_WARNINGS flag and clean up code
fix(docs): update example scripts for clarity and consistency
- Introduced AsyncDatabaseManager for async DB management.
- Added migration feature to transition to file-based storage.
- Enhanced web crawler with improved caching logic.
- Updated requirements and setup for async processing.
- Another thing this commit introduces is the concept of the Relevance Content Filter. This is an improvement over Fit Markdown. This class of strategies aims to extract the main content from a given page - the part that really matters and is useful to be processed. One strategy has been created using the BM25 algorithm, which finds chunks of text from the web page relevant to its title, descriptions, and keywords, or supports a given user query and matches them. The result is then returned to the main engine to be converted to Markdown. Plans include adding approaches using language models as well.
- The cache database was updated to hold information about response headers and downloaded files.