- Added examples for Amazon product data extraction methods
- Updated configuration options and enhance documentation
- Minor refactoring for improved performance and readability
- Cleaned up version control settings.
Standardizes parameter naming convention across the codebase by renaming browser_config to the more concise config in AsyncWebCrawler constructor.
Updates all documentation examples and internal usages to reflect the new parameter name for consistency.
Also improves hook execution by adding url/response parameters to goto hooks and fixes parameter ordering in before_return_html hook.
Enhance crawler capabilities and documentation
- Added SSL certificate extraction in AsyncWebCrawler.
- Introduced new content filters and chunking strategies for more robust data extraction.
- Updated documentation management to streamline user experience.
- Add llm.txt generator
- Added SSL certificate extraction in AsyncWebCrawler.
- Introduced new content filters and chunking strategies for more robust data extraction.
- Updated documentation.
@Haopeng138 Thank you so much. They are still part of the library. I forgot to update them since I moved the asynchronous versions years ago. I really appreciate it. I have to say that I feel weak in the documentation. That's why I spent a lot of time on it last week. Now, when you mention some of the things in the example folder, I realize I forgot about the example folder. I'll try to update it more. If you find anything else, please help and support. Thank you. I will add your name to contributor name as well.
Enhance Crawl4AI with CLI and documentation updates
- Implemented Command-Line Interface (CLI) in `crawl4ai/cli.py`
- Added chunking strategies and their documentation in `llm.txt`
- Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags.
- Introduced Managed Browsers for enhanced crawling experience.
- Updated documentation for clearer navigation on configuration.
- Changed 'text_only' to 'text_mode' in configuration and methods.
- Improved performance and relevance in content filtering strategies.
- ReImplemented JsonXPathExtractionStrategy for enhanced JSON data extraction.
- Updated existing extraction strategies for better performance.
- Improved handling of response status codes during crawls.
- Added detailed CrawlerRunConfig parameters documentation.
- Introduced plans for real-time event-driven crawling.
- Updated async logger default level to DEBUG for better insights.
- Improved structure and readability in configuration file.
- Enhanced documentation on future capabilities in new blog entries.
- Added markdown generator parameter to CrawlerRunConfig in `async_configs.py`.
- Implemented logic for Markdown generation in content scraping in `async_webcrawler.py`.
- Updated version number to 0.4.21 in `__version__.py`.
- Introduced new configuration classes: BrowserConfig and CrawlerRunConfig.
- Refactored AsyncWebCrawler to leverage the new configuration system for cleaner parameter management.
- Updated AsyncPlaywrightCrawlerStrategy for better flexibility and reduced legacy parameters.
- Improved error handling with detailed context extraction during exceptions.
- Enhanced overall maintainability and usability of the web crawler.
- Introduced a new approach for capturing full-page screenshots by exporting them as PDFs first, enhancing reliability and performance.
- Added documentation for the feature in `docs/examples/full_page_screenshot_and_pdf_export.md`.
- Refactored `perform_completion_with_backoff` in `crawl4ai/utils.py` to include necessary extra parameters.
- Updated `quickstart_async.py` to utilize LLM extraction with refined arguments.
- Added support for exporting pages as PDFs
- Enhanced screenshot functionality for long pages
- Created a tutorial on dynamic content loading with 'Load More' buttons.
- Updated web crawler to handle PDF data in responses.
- Introduced new async crawl strategy with session management.
- Added BrowserManager for improved browser management.
- Enhanced documentation, focusing on storage state and usage examples.
- Improved error handling and logging for sessions.
- Added JavaScript snippets for customizing navigator properties.
Enhance Async Crawler with storage state handling
- Updated Async Crawler to support storage state management.
- Added error handling for URL validation in Async Web Crawler.
- Modified README logo and improved .gitignore entries.
- Fixed issues in multiple files for better code robustness.
### New Features:
- **Text-Only Mode**: Added support for text-only crawling by disabling images, JavaScript, GPU, and other non-essential features.
- **Light Mode**: Optimized browser settings to reduce resource usage and improve efficiency during crawling.
- **Dynamic Viewport Adjustment**: Automatically adjusts viewport dimensions based on content size, ensuring accurate rendering and scaling.
- **Full Page Scanning**: Introduced a feature to scroll and capture dynamic content for pages with infinite scroll or lazy-loading elements.
- **Session Management**: Added `create_session` method for creating and managing browser sessions with unique IDs.
### Improvements:
- Unified viewport handling across contexts by dynamically setting dimensions using `self.viewport_width` and `self.viewport_height`.
- Enhanced logging and error handling for viewport adjustments, page scanning, and content evaluation.
- Reduced resource usage with additional browser flags for both `light_mode` and `text_only` configurations.
- Improved handling of cookies, headers, and proxies in session creation.
### Refactoring:
- Removed hardcoded viewport dimensions and replaced them with dynamic configurations.
- Cleaned up unused and commented-out code for better readability and maintainability.
- Introduced defaults for frequently used parameters like `delay_before_return_html`.
### Fixes:
- Resolved potential inconsistencies in viewport handling.
- Improved robustness of content loading and dynamic adjustments to avoid failures and timeouts.
### Docs Update:
- Updated schema usage in `quickstart_async.py` example:
- Changed `OpenAIModelFee.schema()` to `OpenAIModelFee.model_json_schema()` for compatibility.
- Enhanced LLM extraction instruction documentation.
This commit introduces significant enhancements to improve efficiency, flexibility, and reliability of the crawler strategy.
- Enhanced the web scraping strategy with new methods for optimized media handling.
- Added new utility functions for better content processing.
- Refined existing features for improved accuracy and efficiency in scraping tasks.
- Introduced more robust filtering criteria for media elements.