Commit Graph

29 Commits

Author SHA1 Message Date
UncleCode
60ba131ac8 [v0.3.72] Enhance content extraction and proxy support
- Add ContentCleaningStrategy for improved content extraction
- Implement advanced proxy configuration with authentication
- Enhance image source detection and handling
- Add fit_markdown and fit_html for refined content output
- Improve external link and image handling flexibility
2024-10-22 20:19:22 +08:00
UncleCode
04d16e6d2b Fix Base64 image parsing in WebScrappingStrategy (issue 182)
- Add support for extracting Base64 encoded images
- Improve image format detection to include Base64 images
- Enhance compatibility with locally saved HTML files using Base64 image encoding
2024-10-20 19:25:25 +08:00
UncleCode
6ec4cb33ca Enhance Markdown generation and external content control
- Integrate customized html2text library for flexible Markdown output
- Add options to exclude external links and images
- Improve content scraping efficiency and error handling
- Update AsyncPlaywrightCrawlerStrategy for faster closing
- Enhance CosineStrategy with generic embedding model loading
2024-10-20 18:56:58 +08:00
UncleCode
e7cd8a1c2d Update Changelog 2024-10-19 18:37:12 +08:00
UncleCode
b8147b64e0 chore: Bump version to 0.3.71 and improve error handling
- Update version number to 0.3.71
- Add sleep_on_close option to AsyncPlaywrightCrawlerStrategy
- Enhance context creation with additional options
- Improve error message formatting and visibility
- Update quickstart documentation
2024-10-18 13:31:12 +08:00
UncleCode
768aa06ceb feat(crawler): Enhance stealth and flexibility, improve error handling
- Implement playwright_stealth for better bot detection avoidance
- Add user simulation and navigator override options
- Improve iframe processing and browser selection
- Enhance error reporting and debugging capabilities
- Optimize image processing and parallel crawling
- Add new example for user simulation feature
- Added support for including links in Markdown content, by definin g a new flag `include_links_on_markdown` in `crawl` method.
2024-10-17 21:37:48 +08:00
unclecode
2b73bdf6b0 Update changelog 2024-10-14 21:04:02 +08:00
unclecode
68e9144ce3 feat: Enhance crawling control and LLM extraction flexibility
- Add before_retrieve_html hook and delay_before_return_html option
- Implement flexible page_timeout for smart_wait function
- Support extra_args and custom headers in LLM extraction
- Allow arbitrary kwargs in AsyncWebCrawler initialization
- Improve perform_completion_with_backoff for custom API calls
- Update examples with new features and diverse LLM providers
2024-10-12 14:48:22 +08:00
unclecode
9b2b267820 CHANGELOG UPDATE 2024-10-12 13:42:56 +08:00
unclecode
ff3524d9b1 feat(v0.3.6): Add screenshot capture, delayed content, and custom timeouts
- Implement screenshot capture functionality
- Add delayed content retrieval method
- Introduce custom page timeout parameter
- Enhance LLM support with multiple providers
- Improve database schema auto-updates
- Optimize image processing in WebScrappingStrategy
- Update error handling and logging
- Expand examples in quickstart_async.py
2024-10-12 13:42:42 +08:00
unclecode
4750810a67 Enhance AsyncWebCrawler with smart waiting and screenshot capabilities
- Implement smart_wait function in AsyncPlaywrightCrawlerStrategy
- Add screenshot support to AsyncCrawlResponse and AsyncWebCrawler
- Improve error handling and timeout management in crawling process
- Fix typo in CrawlResult model (responser_headers -> response_headers)
- Update .gitignore to exclude additional files
- Adjust import path in test_basic_crawling.py
2024-10-02 17:34:56 +08:00
unclecode
e5e6a34e80 ## [v0.2.77] - 2024-08-04
Significant improvements in text processing and performance:

- 🚀 **Dependency reduction**: Removed dependency on spaCy model for text chunk labeling in cosine extraction strategy.
- 🤖 **Transformer upgrade**: Implemented text sequence classification using a transformer model for labeling text chunks.
-  **Performance enhancement**: Improved model loading speed due to removal of spaCy dependency.
- 🔧 **Future-proofing**: Laid groundwork for potential complete removal of spaCy dependency in future versions.

These changes address issue #68 and provide a foundation for faster, more efficient text processing in Crawl4AI.
2024-08-04 14:54:18 +08:00
unclecode
9200a6731d ## [v0.2.76] - 2024-08-02
Major improvements in functionality, performance, and cross-platform compatibility! 🚀

- 🐳 **Docker enhancements**: Significantly improved Dockerfile for easy installation on Linux, Mac, and Windows.
- 🌐 **Official Docker Hub image**: Launched our first official image on Docker Hub for streamlined deployment (unclecode/crawl4ai).
- 🔧 **Selenium upgrade**: Removed dependency on ChromeDriver, now using Selenium's built-in capabilities for better compatibility.
- 🖼️ **Image description**: Implemented ability to generate textual descriptions for extracted images from web pages.
-  **Performance boost**: Various improvements to enhance overall speed and performance.
2024-08-02 16:02:42 +08:00
unclecode
7715623430 chore: Fix typos and update .gitignore
These changes fix typos in `chunking_strategy.py` and `crawler_strategy.py` to improve code readability. Additionally, the `.test_pads/` directory is removed from the `.gitignore` file to keep the repository clean and organized.
2024-07-19 17:42:39 +08:00
unclecode
4d283ab386 ## [v0.2.74] - 2024-07-08
A slew of exciting updates to improve the crawler's stability and robustness! 🎉

- 💻 **UTF encoding fix**: Resolved the Windows \"charmap\" error by adding UTF encoding.
- 🛡️ **Error handling**: Implemented MaxRetryError exception handling in LocalSeleniumCrawlerStrategy.
- 🧹 **Input sanitization**: Improved input sanitization and handled encoding issues in LLMExtractionStrategy.
- 🚮 **Database cleanup**: Removed existing database file and initialized a new one.
2024-07-08 16:33:25 +08:00
unclecode
3cd1b3719f Bump version to v0.2.73, update documentation, and resolve installation issues 2024-07-03 15:26:43 +08:00
unclecode
940df4631f Update ChangeLog 2024-06-30 00:18:40 +08:00
unclecode
f8a11779fe Update change log 2024-06-26 16:48:36 +08:00
unclecode
d11a83c232 ## [0.2.71] 2024-06-26
• Refactored `crawler_strategy.py` to handle exceptions and improve error messages
• Improved `get_content_of_website_optimized` function in `utils.py` for better performance
• Updated `utils.py` with latest changes
• Migrated to `ChromeDriverManager` for resolving Chrome driver download issues
2024-06-26 15:34:15 +08:00
unclecode
3255c7a3fa Update CHANGELOG.md with recent commits 2024-06-26 15:20:34 +08:00
unclecode
2c2362b4d3 issue 19 is resolved
- Update Dockerfile to install mkdocs and build documentation
2024-06-22 17:18:00 +08:00
unclecode
350ca1511b chore: Update configuration values, create new example, and update Dockerfile and README 2024-06-19 18:48:20 +08:00
unclecode
3f0e265baf Merge branch 'format-inline-tags' 2024-06-19 00:48:38 +08:00
unclecode
853b9d59d8 feat: Add hooks for enhanced control over Selenium drivers
- Added six hooks: on_driver_created, before_get_url, after_get_url, before_return_html, on_user_agent_updated.
- Included example usage in quickstart.py.
- Updated README and changelog.
2024-06-18 20:00:51 +08:00
unclecode
413595542a Enhancement: Replaced inline HTML tags with textual format for better LLM context handling #24 2024-06-17 15:14:34 +08:00
unclecode
42a5da854d Update version and change log. 2024-06-17 14:47:58 +08:00
unclecode
b3a0edaa6d - User agent
- Extract Links
- Extract Metadata
- Update Readme
- Update REST API document
2024-06-08 17:59:42 +08:00
unclecode
1635a92218 chore: Update Crawl4AI quickstart script in README.md 2024-05-17 18:25:32 +08:00
unclecode
aac4e07389 chore: Update README.md and project structure 2024-05-12 12:39:31 +08:00