Update all documentation URLs from crawl4ai.com/mkdocs to docs.crawl4ai.com across README, examples, and documentation files. This change reflects the new documentation hosting domain.
Also add todo/ directory to .gitignore.
- Fixes critical memory leak issue where browser pages remained open
- Ensures proper cleanup of Playwright resources after page operations
- Improves resource management in browser farm implementation
This is an urgent fix to address resource leakage that could impact system stability.
- Added examples for Amazon product data extraction methods
- Updated configuration options and enhance documentation
- Minor refactoring for improved performance and readability
- Cleaned up version control settings.
Enhance crawler capabilities and documentation
- Added SSL certificate extraction in AsyncWebCrawler.
- Introduced new content filters and chunking strategies for more robust data extraction.
- Updated documentation management to streamline user experience.
- Add llm.txt generator
- Added SSL certificate extraction in AsyncWebCrawler.
- Introduced new content filters and chunking strategies for more robust data extraction.
- Updated documentation.
Enhance Crawl4AI with CLI and documentation updates
- Implemented Command-Line Interface (CLI) in `crawl4ai/cli.py`
- Added chunking strategies and their documentation in `llm.txt`
- Introduced new configuration classes: BrowserConfig and CrawlerRunConfig.
- Refactored AsyncWebCrawler to leverage the new configuration system for cleaner parameter management.
- Updated AsyncPlaywrightCrawlerStrategy for better flexibility and reduced legacy parameters.
- Improved error handling with detailed context extraction during exceptions.
- Enhanced overall maintainability and usability of the web crawler.
Enhance Async Crawler with storage state handling
- Updated Async Crawler to support storage state management.
- Added error handling for URL validation in Async Web Crawler.
- Modified README logo and improved .gitignore entries.
- Fixed issues in multiple files for better code robustness.
- Removed __del__ method in AsyncPlaywrightCrawlerStrategy to ensure reliable browser lifecycle management by using explicit context managers.
- Added process monitoring in ManagedBrowser to detect and log unexpected terminations of the browser subprocess.
- Updated Docker configuration to expose port 9222 for remote debugging and allocate extra shared memory to prevent browser crashes.
- Improved error handling and resource cleanup for browser instances, particularly in Docker environments.
Resolves Issue #256
- Implement playwright_stealth for better bot detection avoidance
- Add user simulation and navigator override options
- Improve iframe processing and browser selection
- Enhance error reporting and debugging capabilities
- Optimize image processing and parallel crawling
- Add new example for user simulation feature
- Added support for including links in Markdown content, by definin g a new flag `include_links_on_markdown` in `crawl` method.
- Add browser type selection (Chromium, Firefox, WebKit)
- Implement iframe content extraction
- Improve image processing and dimension updates
- Add custom headers support in AsyncPlaywrightCrawlerStrategy
- Enhance delayed content retrieval with new parameter
- Optimize HTML sanitization and Markdown conversion
- Update examples in quickstart_async.py for new features
- Implement smart_wait function in AsyncPlaywrightCrawlerStrategy
- Add screenshot support to AsyncCrawlResponse and AsyncWebCrawler
- Improve error handling and timeout management in crawling process
- Fix typo in CrawlResult model (responser_headers -> response_headers)
- Update .gitignore to exclude additional files
- Adjust import path in test_basic_crawling.py
• Refactored `crawler_strategy.py` to handle exceptions and improve error messages
• Improved `get_content_of_website_optimized` function in `utils.py` for better performance
• Updated `utils.py` with latest changes
• Migrated to `ChromeDriverManager` for resolving Chrome driver download issues