Implement more robust browser executable path handling using playwright's built-in browser management. This change:
- Adds async browser path resolution
- Implements path caching in the home folder
- Removes hardcoded browser paths
- Adds httpx dependency
- Removes obsolete test result files
This change makes the browser path resolution more reliable across different platforms and environments.
- Increase memory threshold from 70% to 90% for better resource utilization
- Remove incorrect self parameter from MemoryAdaptiveDispatcher initialization
These changes improve the crawler's performance by allowing more memory usage before throttling and fix a bug in dispatcher initialization.
Make fields in MediaItem and Link models optional with default values to prevent validation errors when data is incomplete. Also expose BaseDispatcher in __init__ and fix markdown field handling in database manager.
BREAKING CHANGE: MediaItem and Link model fields are now optional with default values which may affect existing code expecting required fields.
- Remove .pre-commit-config.yaml and duplicate mkdocs configuration files
- Add Optional type hint for proxy parameter in BrowserConfig
- Fix type annotation for results list in AsyncWebCrawler
- Move calculate_batch_size function import to model_loader
- Update prompt imports in extraction_strategy.py
No breaking changes.
Replace the ScrapingMode enum with a proper strategy pattern implementation for content scraping.
This change introduces:
- New ContentScrapingStrategy abstract base class
- Concrete WebScrapingStrategy and LXMLWebScrapingStrategy implementations
- New Pydantic models for structured scraping results
- Updated documentation reflecting the new strategy-based approach
BREAKING CHANGE: ScrapingMode enum has been removed. Users should now use ContentScrapingStrategy implementations instead.
Adds a new ScrapingMode enum to allow switching between BeautifulSoup and LXML parsing.
LXML mode offers 10-20x better performance for large HTML documents.
Key changes:
- Added ScrapingMode enum with BEAUTIFULSOUP and LXML options
- Implemented LXMLWebScrapingStrategy class
- Added LXML-based metadata extraction
- Updated documentation with scraping mode usage and performance considerations
- Added cssselect dependency
BREAKING CHANGE: None
Reorganize dispatcher functionality into separate components:
- Create dedicated dispatcher classes (MemoryAdaptive, Semaphore)
- Add RateLimiter for smart request throttling
- Implement CrawlerMonitor for real-time progress tracking
- Move dispatcher config from CrawlerRunConfig to separate classes
BREAKING CHANGE: Dispatcher configuration moved from CrawlerRunConfig to dedicated dispatcher classes. Users need to update their configuration approach for multi-URL crawling.
Implements a new MemoryAdaptiveDispatcher class to manage concurrent crawling operations with memory monitoring and rate limiting capabilities. Changes include:
- Added RateLimitConfig dataclass for configuring rate limiting behavior
- Extended CrawlerRunConfig with dispatcher-related settings
- Refactored arun_many to use the new dispatcher system
- Added memory threshold and session permit controls
- Integrated optional progress monitoring display
BREAKING CHANGE: The arun_many method now uses MemoryAdaptiveDispatcher by default, which may affect concurrent crawling behavior
Update all documentation URLs from crawl4ai.com/mkdocs to docs.crawl4ai.com
Improve badges styling and layout in documentation
Increase code font size in documentation CSS
BREAKING CHANGE: Documentation URLs have changed from crawl4ai.com/mkdocs to docs.crawl4ai.com
Revise the README's personal story section to better reflect the project's
origins, motivation, and vision for open-source data accessibility. Add more
detail about the creator's background and the project's mission to
democratize AI through open data access.
Also includes a minor TODO comment addition in async crawler strategy.
Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.
BREAKING CHANGE: Documentation structure has been significantly reorganized
Add explanatory comments to JsonCssExtractionStrategy._get_elements() method to clarify that it returns all matching elements using select() instead of select_one(). This helps developers understand the method's behavior and its difference from single element selection.
Removed trailing whitespace at end of file.
- Fix JsonCssExtractionStrategy._get_elements to return all matching elements instead of just one
- Add robust error handling to page_need_scroll with default fallback
- Improve JSON extraction strategies documentation
- Refactor content scraping strategy
- Update version to 0.4.247
- Fixes critical memory leak issue where browser pages remained open
- Ensures proper cleanup of Playwright resources after page operations
- Improves resource management in browser farm implementation
This is an urgent fix to address resource leakage that could impact system stability.
- Set wait_for_images default to false for better performance
- Simplify response attribute copying in AsyncWebCrawler
- Update hello_world example with proper content filtering
BREAKING CHANGE: Updated `chrome_channel` to "chromium" to fix compatibility with the new Chromium headless implementation. This resolves the error `playwright._impl._errors.Error: BrowserType.launch: Chromium distribution 'chrome' is not found`, caused by the removal of the old headless mode in Chromium.
With this change, channels like "chrome" and "msedge" now default to the new headless mode, aligning with upstream updates in Playwright v1.49. The new headless mode uses the real Chrome browser, offering more authenticity, reliability, and feature parity with the full browser.
Additionally, simplified fallback logic by directly assigning `chrome_channel` based on `browser_type` or defaulting to "chromium".
Refer to:
- https://playwright.dev/python/docs/browsers#chromium
- https://github.com/microsoft/playwright/issues/33566
- Replace explicit package listing with setuptools.find
- Include all crawl4ai.* packages automatically
- Use `packages = {find = {where = ["."], include = ["crawl4ai*"]}}` syntax
- Bump version to 0.4.243
This change simplifies package maintenance by automatically discovering
all subpackages under crawl4ai namespace instead of listing them manually.
- Add --force flag to Playwright browser installation
- Add doctor command to test crawling functionality
- Install Chrome and Chromium browsers explicitly
- Add crawl4ai-doctor entry point in pyproject.toml
- Implement simple health check focused on crawling test
- Add pyproject.toml for PEP 517 build system support
- Configure dependencies, scripts, and metadata in pyproject.toml
- Set Python requirement to >=3.9 and add support up to 3.13
- Keep setup.py for backwards compatibility
- Move package dependencies and entry points to pyproject.toml
- Remove setup_docs() call from post_install()
- Simplify error messages for Playwright installation failures
- Use sys.executable for more accurate Python path in error messages
- Add --with-deps flag to Playwright install command