Commit Graph

204 Commits

Author SHA1 Message Date
Guilume
07b4c1c0ed fix: not working long page screenshot (#403) 2025-01-05 17:04:34 +08:00
UncleCode
da1bc0f7bf Update version file 2025-01-01 19:42:35 +08:00
UncleCode
a96e05d4ae refactor(crawler): optimize response handling and default settings
- Set wait_for_images default to false for better performance
- Simplify response attribute copying in AsyncWebCrawler
- Update hello_world example with proper content filtering
2025-01-01 19:39:02 +08:00
UncleCode
5b4fad9e25 - Bump version to 0.4.244 2025-01-01 18:58:43 +08:00
UncleCode
ea0ac25f38 refactor(browser):
Update browser channel default to 'chromium' in BrowserConfig.from_args method
2025-01-01 18:58:15 +08:00
UncleCode
7688aca7d6 Update Version 2025-01-01 18:44:27 +08:00
UncleCode
a7215ad972 fix(browser): update default browser channel to chromium and simplify channel selection logic 2025-01-01 18:38:33 +08:00
UncleCode
bfe21b29d4 build: streamline package discovery and bump to v0.4.243
- Replace explicit package listing with setuptools.find
- Include all crawl4ai.* packages automatically
- Use `packages = {find = {where = ["."], include = ["crawl4ai*"]}}` syntax
- Bump version to 0.4.243

This change simplifies package maintenance by automatically discovering
all subpackages under crawl4ai namespace instead of listing them manually.
2025-01-01 17:55:59 +08:00
UncleCode
e9d9a6ffe8 fix: ensure js_snippet files are included in package
- Add js_snippet to packages list in pyproject.toml
- Verified JS files are properly included in installed package
- Bump version to 0.4.242
2025-01-01 17:38:59 +08:00
UncleCode
d36ef3d424 refactor(install): use chromium as default browser
- Remove Chrome installation to reduce setup time
- Keep Chromium as default browser for better cross-platform compatibility
2025-01-01 17:19:54 +08:00
UncleCode
dc6a24618e feat(install): add doctor command and force browser install
- Add --force flag to Playwright browser installation
- Add doctor command to test crawling functionality
- Install Chrome and Chromium browsers explicitly
- Add crawl4ai-doctor entry point in pyproject.toml
- Implement simple health check focused on crawling test
2025-01-01 16:33:43 +08:00
UncleCode
74a7c6dbb6 feat(install): specify chrome and chromium for playwright
- Install Chrome and Chromium browsers explicitly
- Split browser installation into separate commands
2025-01-01 16:10:08 +08:00
UncleCode
304260e484 refactor(install): simplify Playwright installation error handling
- Remove setup_docs() call from post_install()
- Simplify error messages for Playwright installation failures
- Use sys.executable for more accurate Python path in error messages
- Add --with-deps flag to Playwright install command
2025-01-01 15:33:36 +08:00
UncleCode
704bd66b63 Uphrade plawyright installation command to install dependencies 2025-01-01 15:23:16 +08:00
UncleCode
1acc162c18 Bumb version v0.4.241 2025-01-01 15:16:06 +08:00
UncleCode
553c97a0c1 Fix bug reported in issue https://github.com/unclecode/crawl4ai/issues/396 2025-01-01 15:15:14 +08:00
UncleCode
19b0a5ae82 Update 0.4.24 walkthrough 2024-12-31 21:01:46 +08:00
UncleCode
bd71f7f4ea Add 0.4.24 walkthrough 2024-12-31 20:22:33 +08:00
UncleCode
6c5a44f774 chore: bump version to 0.4.25 2024-12-31 19:45:48 +08:00
UncleCode
553a4622bf chore: prepare for version 0.4.24 2024-12-31 19:18:36 +08:00
UncleCode
0ec593fa90 Update the Tutorial section for new document version 2024-12-31 17:27:31 +08:00
UncleCode
fb33a24891 Commit Message:
- Added examples for Amazon product data extraction methods
  - Updated configuration options and enhance documentation
  - Minor refactoring for improved performance and readability
  - Cleaned up version control settings.
2024-12-29 20:05:18 +08:00
UncleCode
f2d9912697 Renames browser_config param to config in AsyncWebCrawler
Standardizes parameter naming convention across the codebase by renaming browser_config to the more concise config in AsyncWebCrawler constructor.

Updates all documentation examples and internal usages to reflect the new parameter name for consistency.

Also improves hook execution by adding url/response parameters to goto hooks and fixes parameter ordering in before_return_html hook.
2024-12-26 16:34:36 +08:00
UncleCode
9a4ed6bbd7 Commit Message:
Enhance crawler capabilities and documentation

  - Added SSL certificate extraction in AsyncWebCrawler.
  - Introduced new content filters and chunking strategies for more robust data extraction.
  - Updated documentation management to streamline user experience.
2024-12-26 15:17:07 +08:00
UncleCode
d5ed451299 Enhance crawler capabilities and documentation
- Add llm.txt generator
  - Added SSL certificate extraction in AsyncWebCrawler.
  - Introduced new content filters and chunking strategies for more robust data extraction.
  - Updated documentation.
2024-12-25 21:34:31 +08:00
UncleCode
84b311760f Commit Message:
Enhance Crawl4AI with CLI and documentation updates
  - Implemented Command-Line Interface (CLI) in `crawl4ai/cli.py`
  - Added chunking strategies and their documentation in `llm.txt`
2024-12-21 14:26:56 +08:00
UncleCode
8fbc2e0463 Refactor deployment configuration and enhance browser debugging options 2024-12-20 20:35:28 +08:00
UncleCode
849765712f Enhance Crawl4AI with new features and documentation
- Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags.
  - Introduced Managed Browsers for enhanced crawling experience.
  - Updated documentation for clearer navigation on configuration.
  - Changed 'text_only' to 'text_mode' in configuration and methods.
  - Improved performance and relevance in content filtering strategies.
2024-12-19 21:02:29 +08:00
UncleCode
393bb911c0 Enhance crawler strategies with new features
- ReImplemented JsonXPathExtractionStrategy for enhanced JSON data extraction.
  - Updated existing extraction strategies for better performance.
  - Improved handling of response status codes during crawls.
2024-12-17 22:40:10 +08:00
UncleCode
4a5f1aebee Bump version to 0.4.23 2024-12-16 18:53:11 +08:00
UncleCode
a11d9646e3 Enhance crawler features and improve documentation
- Added detailed CrawlerRunConfig parameters documentation.
  - Introduced plans for real-time event-driven crawling.
  - Updated async logger default level to DEBUG for better insights.
  - Improved structure and readability in configuration file.
  - Enhanced documentation on future capabilities in new blog entries.
2024-12-16 18:52:51 +08:00
UncleCode
ed7bc1909c Bump version to 0.4.22 2024-12-15 19:49:38 +08:00
UncleCode
7524aa7b5e Feature: Add Markdown generation to CrawlerRunConfig
- Added markdown generator parameter to CrawlerRunConfig in `async_configs.py`.
  - Implemented logic for Markdown generation in content scraping in `async_webcrawler.py`.
  - Updated version number to 0.4.21 in `__version__.py`.
2024-12-13 21:51:38 +08:00
UncleCode
399af801a1 Merge branch 'next' 2024-12-12 20:17:27 +08:00
UncleCode
de1766d565 Bump version to 0.4.2 2024-12-12 19:35:30 +08:00
UncleCode
0982c639ae Enhance AsyncWebCrawler and related configurations
- Introduced new configuration classes: BrowserConfig and CrawlerRunConfig.
  - Refactored AsyncWebCrawler to leverage the new configuration system for cleaner parameter management.
  - Updated AsyncPlaywrightCrawlerStrategy for better flexibility and reduced legacy parameters.
  - Improved error handling with detailed context extraction during exceptions.
  - Enhanced overall maintainability and usability of the web crawler.
2024-12-12 19:35:09 +08:00
UncleCode
5188b7a6a0 Add full-page screenshot and PDF export features
- Introduced a new approach for capturing full-page screenshots by exporting them as PDFs first, enhancing reliability and performance.
  - Added documentation for the feature in `docs/examples/full_page_screenshot_and_pdf_export.md`.
  - Refactored `perform_completion_with_backoff` in `crawl4ai/utils.py` to include necessary extra parameters.
  - Updated `quickstart_async.py` to utilize LLM extraction with refined arguments.
2024-12-10 20:59:31 +08:00
lvzhengri
759164831d Update async_webcrawler.py (#337)
add @asynccontextmanager
2024-12-10 20:56:52 +08:00
UncleCode
5431fa2d0c Add PDF & screenshot functionality, new tutorial
- Added support for exporting pages as PDFs
  - Enhanced screenshot functionality for long pages
  - Created a tutorial on dynamic content loading with 'Load More' buttons.
  - Updated web crawler to handle PDF data in responses.
2024-12-10 20:10:39 +08:00
UncleCode
e130fd8db9 Implement new async crawler features and stability updates
- Introduced new async crawl strategy with session management.
  - Added BrowserManager for improved browser management.
  - Enhanced documentation, focusing on storage state and usage examples.
  - Improved error handling and logging for sessions.
  - Added JavaScript snippets for customizing navigator properties.
2024-12-10 17:55:29 +08:00
UncleCode
2d31915f0a Commit Message:
Enhance Async Crawler with storage state handling
  - Updated Async Crawler to support storage state management.
  - Added error handling for URL validation in Async Web Crawler.
  - Modified README logo and improved .gitignore entries.
  - Fixed issues in multiple files for better code robustness.
2024-12-09 20:04:59 +08:00
lu4nx
ba3e808802 fix: The extract method logs output only when self.verbose is set to True. (#314)
Co-authored-by: lu4nx <lu4nx@lx-pc>
2024-12-09 17:19:26 +08:00
UncleCode
c51e901f68 feat: Enhance AsyncPlaywrightCrawlerStrategy with text-only and light modes, dynamic viewport adjustment, and session management
### New Features:
- **Text-Only Mode**: Added support for text-only crawling by disabling images, JavaScript, GPU, and other non-essential features.
- **Light Mode**: Optimized browser settings to reduce resource usage and improve efficiency during crawling.
- **Dynamic Viewport Adjustment**: Automatically adjusts viewport dimensions based on content size, ensuring accurate rendering and scaling.
- **Full Page Scanning**: Introduced a feature to scroll and capture dynamic content for pages with infinite scroll or lazy-loading elements.
- **Session Management**: Added `create_session` method for creating and managing browser sessions with unique IDs.

### Improvements:
- Unified viewport handling across contexts by dynamically setting dimensions using `self.viewport_width` and `self.viewport_height`.
- Enhanced logging and error handling for viewport adjustments, page scanning, and content evaluation.
- Reduced resource usage with additional browser flags for both `light_mode` and `text_only` configurations.
- Improved handling of cookies, headers, and proxies in session creation.

### Refactoring:
- Removed hardcoded viewport dimensions and replaced them with dynamic configurations.
- Cleaned up unused and commented-out code for better readability and maintainability.
- Introduced defaults for frequently used parameters like `delay_before_return_html`.

### Fixes:
- Resolved potential inconsistencies in viewport handling.
- Improved robustness of content loading and dynamic adjustments to avoid failures and timeouts.

### Docs Update:
- Updated schema usage in `quickstart_async.py` example:
  - Changed `OpenAIModelFee.schema()` to `OpenAIModelFee.model_json_schema()` for compatibility.
- Enhanced LLM extraction instruction documentation.

This commit introduces significant enhancements to improve efficiency, flexibility, and reliability of the crawler strategy.
2024-12-08 20:04:44 +08:00
UncleCode
8c611dcb4b Refactored web scraping components
- Enhanced the web scraping strategy with new methods for optimized media handling.
  - Added new utility functions for better content processing.
  - Refined existing features for improved accuracy and efficiency in scraping tasks.
  - Introduced more robust filtering criteria for media elements.
2024-12-05 22:33:47 +08:00
UncleCode
486db3a771 Updated to version 0.4.0 with new features
- Enhanced error handling in async crawler.
  - Added flexible options in Markdown generation.
  - Updated user agent settings for improved reliability.
  - Reflected changes in documentation and examples.
2024-12-04 20:26:39 +08:00
UncleCode
e9639ad189 refactor: improve error handling in DataProcessor and optimize data parsing logic 2024-12-03 19:44:38 +08:00
UncleCode
95a4f74d2a fix: pass logger to WebScrapingStrategy and update score computation in PruningContentFilter 2024-12-02 20:37:28 +08:00
unclecode
293f299c08 Add PruningContentFilter with unit tests and update documentation
- Introduced the PruningContentFilter for better content relevance.
  - Implemented comprehensive unit tests for verification of functionality.
  - Enhanced existing BM25ContentFilter tests for edge case coverage.
  - Updated documentation to include usage examples for new filter.
2024-12-01 19:17:33 +08:00
UncleCode
80d58ad24c bump version to 0.3.747 2024-11-30 22:00:15 +08:00
UncleCode
3e83893b3f Enhance User-Agent Handling
- Added a new UserAgentGenerator class for generating random User-Agents.
  - Integrated User-Agent generation in AsyncPlaywrightCrawlerStrategy for randomization.
  - Enhanced HTTP headers with generated Client Hints.
2024-11-30 18:13:12 +08:00