Commit Graph

111 Commits

Author SHA1 Message Date
UncleCode
4239654722 Update Documentation 2024-10-27 19:24:46 +08:00
UncleCode
38474bd66a Update version 2024-10-24 20:24:21 +08:00
UncleCode
bcfe83f702 feat: enhance crawler with overlay removal and improved screenshot capabilities
• Add smart overlay removal system for handling popups and modals
• Improve screenshot functionality with configurable timing controls
• Implement URL normalization and enhanced link processing
• Add custom base directory support for cache storage
• Refine external content filtering and social media domain handling

This commit significantly improves the crawler's ability to handle modern
websites by automatically removing intrusive overlays and providing better
screenshot capabilities. URL handling is now more robust with proper
normalization and duplicate detection. The cache system is more flexible
with customizable base directory support.

Breaking changes: None
Issue numbers: None
2024-10-24 20:22:47 +08:00
UncleCode
60ba131ac8 [v0.3.72] Enhance content extraction and proxy support
- Add ContentCleaningStrategy for improved content extraction
- Implement advanced proxy configuration with authentication
- Enhance image source detection and handling
- Add fit_markdown and fit_html for refined content output
- Improve external link and image handling flexibility
2024-10-22 20:19:22 +08:00
UncleCode
04d16e6d2b Fix Base64 image parsing in WebScrappingStrategy (issue 182)
- Add support for extracting Base64 encoded images
- Improve image format detection to include Base64 images
- Enhance compatibility with locally saved HTML files using Base64 image encoding
2024-10-20 19:25:25 +08:00
UncleCode
1dd36f9035 Refactor content scrapping strategy and improve error handling 2024-10-20 19:11:18 +08:00
UncleCode
6ec4cb33ca Enhance Markdown generation and external content control
- Integrate customized html2text library for flexible Markdown output
- Add options to exclude external links and images
- Improve content scraping efficiency and error handling
- Update AsyncPlaywrightCrawlerStrategy for faster closing
- Enhance CosineStrategy with generic embedding model loading
2024-10-20 18:56:58 +08:00
UncleCode
4e2852d5ff [v0.3.71] Enhance chunking strategies and improve overall performance
- Add OverlappingWindowChunking and improve SlidingWindowChunking
- Update CHUNK_TOKEN_THRESHOLD to 2048 tokens
- Optimize AsyncPlaywrightCrawlerStrategy close method
- Enhance flexibility in CosineStrategy with generic embedding model loading
- Improve JSON-based extraction strategies
- Add knowledge graph generation example
2024-10-19 18:36:59 +08:00
UncleCode
b8147b64e0 chore: Bump version to 0.3.71 and improve error handling
- Update version number to 0.3.71
- Add sleep_on_close option to AsyncPlaywrightCrawlerStrategy
- Enhance context creation with additional options
- Improve error message formatting and visibility
- Update quickstart documentation
2024-10-18 13:31:12 +08:00
UncleCode
aab6ea022e Update requirements and switch to 0.3.8 2024-10-18 12:51:23 +08:00
UncleCode
dd17ed0e63 Rename some flags name, introducing magic flag. 2024-10-18 12:35:09 +08:00
UncleCode
768aa06ceb feat(crawler): Enhance stealth and flexibility, improve error handling
- Implement playwright_stealth for better bot detection avoidance
- Add user simulation and navigator override options
- Improve iframe processing and browser selection
- Enhance error reporting and debugging capabilities
- Optimize image processing and parallel crawling
- Add new example for user simulation feature
- Added support for including links in Markdown content, by definin g a new flag `include_links_on_markdown` in `crawl` method.
2024-10-17 21:37:48 +08:00
unclecode
320afdea64 feat: Enhance crawler flexibility and LLM extraction capabilities
- Add browser type selection (Chromium, Firefox, WebKit)
- Implement iframe content extraction
- Improve image processing and dimension updates
- Add custom headers support in AsyncPlaywrightCrawlerStrategy
- Enhance delayed content retrieval with new parameter
- Optimize HTML sanitization and Markdown conversion
- Update examples in quickstart_async.py for new features
2024-10-14 21:03:28 +08:00
unclecode
68e9144ce3 feat: Enhance crawling control and LLM extraction flexibility
- Add before_retrieve_html hook and delay_before_return_html option
- Implement flexible page_timeout for smart_wait function
- Support extra_args and custom headers in LLM extraction
- Allow arbitrary kwargs in AsyncWebCrawler initialization
- Improve perform_completion_with_backoff for custom API calls
- Update examples with new features and diverse LLM providers
2024-10-12 14:48:22 +08:00
unclecode
ff3524d9b1 feat(v0.3.6): Add screenshot capture, delayed content, and custom timeouts
- Implement screenshot capture functionality
- Add delayed content retrieval method
- Introduce custom page timeout parameter
- Enhance LLM support with multiple providers
- Improve database schema auto-updates
- Optimize image processing in WebScrappingStrategy
- Update error handling and logging
- Expand examples in quickstart_async.py
2024-10-12 13:42:42 +08:00
unclecode
4750810a67 Enhance AsyncWebCrawler with smart waiting and screenshot capabilities
- Implement smart_wait function in AsyncPlaywrightCrawlerStrategy
- Add screenshot support to AsyncCrawlResponse and AsyncWebCrawler
- Improve error handling and timeout management in crawling process
- Fix typo in CrawlResult model (responser_headers -> response_headers)
- Update .gitignore to exclude additional files
- Adjust import path in test_basic_crawling.py
2024-10-02 17:34:56 +08:00
unclecode
e0e0db4247 Bump version to 0.3.4 2024-09-29 17:07:52 +08:00
unclecode
bccadec887 Remove dependency on psutil, PyYaml, and extend requests version range 2024-09-29 17:07:06 +08:00
unclecode
0759503e50 Extend numpy version range to support Python 3.9 2024-09-29 00:08:02 +08:00
unclecode
8b6e88c85c Update .gitignore to ignore temporary and test directories 2024-09-26 15:09:49 +08:00
unclecode
64190dd0c4 Update README 2024-09-25 17:26:13 +08:00
unclecode
f1eee09cf4 Update README, add manifest, make selenium optional library 2024-09-25 16:35:14 +08:00
unclecode
4d48bd31ca Push async version last changes for merge to main branch 2024-09-24 20:52:08 +08:00
unclecode
d628bc4034 Refactor content_scrapping_strategy.py to remove excluded tags 2024-09-12 17:35:45 +08:00
unclecode
30807f5535 Remove excluded tags from website content 2024-09-12 16:11:20 +08:00
unclecode
396f430022 Refactor AsyncCrawlerStrategy to return AsyncCrawlResponse
This commit refactors the AsyncCrawlerStrategy class in the async_crawler_strategy.py file to modify the return types of the crawl and crawl_many methods. Instead of returning strings, these methods now return instances of the AsyncCrawlResponse class from the pydantic module. The AsyncCrawlResponse class contains the crawled HTML, response headers, and status code. This change improves the clarity and consistency of the code.
2024-09-12 15:49:49 +08:00
unclecode
5c15837677 chore: Update README, generate new notbook for quickstart 2024-09-04 14:46:22 +08:00
unclecode
2fada16abb chore: Update crawl4ai package with AsyncWebCrawler and JsonCssExtractionStrategy 2024-09-03 23:32:27 +08:00
unclecode
c37614cbc8 Add Async Version, JsonCss Extrator 2024-09-03 01:27:00 +08:00
unclecode
3116f95c1a Merge branch 'pull-84' into staging 2024-09-01 16:44:06 +08:00
unclecode
b0e8b66666 Merge branch 'proxy-support' into staging 2024-09-01 16:35:14 +08:00
unclecode
3caf48c9be refactor: Update LocalSeleniumCrawlerStrategy to execute JS code if provided 2024-09-01 16:34:51 +08:00
Umut CAN
3c6ebb73ae Update web_crawler.py
Improve code efficiency, readability, and maintainability in web_crawler.py
2024-08-30 15:30:06 +03:00
UncleCode
0d9b638636 Merge pull request #75 from aravindkarnam/main
Added support to source tags wrapped inside video and audio tags. Ext…
2024-08-30 12:54:15 +02:00
datehoer
16f98cebc0 replace base64 image url to '' 2024-08-27 09:44:35 +08:00
datehoer
fe9ff498ce add proxy and add ai base_url 2024-08-26 16:12:49 +08:00
Datehoer
eba831ca30 fix spelling mistake 2024-08-26 15:29:23 +08:00
unclecode
dec3d44224 refactor: Update extraction strategy to handle schema extraction with non-empty schema
This code change updates the `LLMExtractionStrategy` class to handle schema extraction when the schema is non-empty. Previously, the schema extraction was only triggered when the `extract_type` was set to "schema", regardless of whether a schema was provided. With this update, the schema extraction will only be performed if the `extract_type` is "schema" and a non-empty schema is provided. This ensures that the extraction strategy behaves correctly and avoids unnecessary schema extraction when not needed. Also "numpy" is removed from default installation mode.
2024-08-19 15:37:07 +08:00
Aravind Karnam
9ed1551125 Added support to source tags wrapped inside video and audio tags. Extended the text extraction to video and audio elements in media. https://github.com/unclecode/crawl4ai/issues/71 2024-08-14 11:07:26 +05:30
unclecode
e5e6a34e80 ## [v0.2.77] - 2024-08-04
Significant improvements in text processing and performance:

- 🚀 **Dependency reduction**: Removed dependency on spaCy model for text chunk labeling in cosine extraction strategy.
- 🤖 **Transformer upgrade**: Implemented text sequence classification using a transformer model for labeling text chunks.
-  **Performance enhancement**: Improved model loading speed due to removal of spaCy dependency.
- 🔧 **Future-proofing**: Laid groundwork for potential complete removal of spaCy dependency in future versions.

These changes address issue #68 and provide a foundation for faster, more efficient text processing in Crawl4AI.
2024-08-04 14:54:18 +08:00
unclecode
9ee988753d refactor: Update image description minimum word threshold in get_content_of_website_optimized 2024-08-02 14:53:11 +08:00
unclecode
b6713870ef refactor: Update Dockerfile to install Crawl4AI with specified options
This commit updates the Dockerfile to install Crawl4AI with the specified options. The `INSTALL_OPTION` build argument is used to determine which additional packages to install. If the option is set to "all", all models will be downloaded. If the option is set to "torch", only torch models will be downloaded. If the option is set to "transformer", only transformer models will be downloaded. If no option is specified, the default installation will be used. This change improves the flexibility and customization of the Crawl4AI installation process.
2024-08-01 17:56:19 +08:00
unclecode
40477493d3 refactor: Remove image format dot in get_content_of_website_optimized
The code change removes the dot from the image format in the `get_content_of_website_optimized` function. This change ensures consistency in the image format and improves the functionality.
2024-07-31 16:15:55 +08:00
Kevin Moturi
efcf3ac6eb Update LocalSeleniumCrawlerStrategy to resolve ChromeDriver version mismatch issue
This resolves the following error: `selenium.common.exceptions.SessionNotCreatedException: Message: session not created: This version of ChromeDriver only supports Chrome version 114`

Windows users are getting.
2024-07-31 13:33:09 +08:00
unclecode
9e43f7beda refactor: Temporarily disable fetching image file size in get_content_of_website_optimized
Set the `image_size` variable to 0 in the `get_content_of_website_optimized` function to temporarily disable fetching the image file size. This change addresses performance issues and will be improved in a future update.

Update Dockerfile for linuz users
2024-07-31 13:29:23 +08:00
unclecode
aa9412e1b4 refactor: Set image_size to 0 in get_content_of_website_optimized
The code change sets the `image_size` variable to 0 in the `get_content_of_website_optimized` function. This change is made to temporarily disable fetching the image file size, which was causing performance issues. The image size will be fetched in a future update to improve the functionality.
2024-07-23 13:08:53 +08:00
Aravind Karnam
cf6c835e18 moved score threshold to config.py & replaced the separator for tag.get_text in find_closest_parent_with_useful_text fn from period(.) to space( ) to keep the text more neutral. 2024-07-21 15:18:23 +05:30
Aravind Karnam
e5ecf291f3 Implemented filtering for images and grabbing the contextual text from nearest parent 2024-07-21 15:03:17 +05:30
Aravind Karnam
9d0cafcfa6 fixed import error in model_loader.py 2024-07-21 14:55:58 +05:30
unclecode
f5a4e80e2c chore: Fix typo in chunking_strategy.py and crawler_strategy.py
The commit fixes a typo in the `chunking_strategy.py` file where `nl.toknize.TextTilingTokenizer()` was corrected to `nl.tokenize.TextTilingTokenizer()`. Additionally, in the `crawler_strategy.py` file, the commit converts the screenshot image to RGB mode before saving it as a JPEG. This ensures consistent image quality and compression.
2024-07-19 17:40:31 +08:00