Commit Graph

54 Commits

Author SHA1 Message Date
UncleCode
0982c639ae Enhance AsyncWebCrawler and related configurations
- Introduced new configuration classes: BrowserConfig and CrawlerRunConfig.
  - Refactored AsyncWebCrawler to leverage the new configuration system for cleaner parameter management.
  - Updated AsyncPlaywrightCrawlerStrategy for better flexibility and reduced legacy parameters.
  - Improved error handling with detailed context extraction during exceptions.
  - Enhanced overall maintainability and usability of the web crawler.
2024-12-12 19:35:09 +08:00
UncleCode
5188b7a6a0 Add full-page screenshot and PDF export features
- Introduced a new approach for capturing full-page screenshots by exporting them as PDFs first, enhancing reliability and performance.
  - Added documentation for the feature in `docs/examples/full_page_screenshot_and_pdf_export.md`.
  - Refactored `perform_completion_with_backoff` in `crawl4ai/utils.py` to include necessary extra parameters.
  - Updated `quickstart_async.py` to utilize LLM extraction with refined arguments.
2024-12-10 20:59:31 +08:00
UncleCode
e130fd8db9 Implement new async crawler features and stability updates
- Introduced new async crawl strategy with session management.
  - Added BrowserManager for improved browser management.
  - Enhanced documentation, focusing on storage state and usage examples.
  - Improved error handling and logging for sessions.
  - Added JavaScript snippets for customizing navigator properties.
2024-12-10 17:55:29 +08:00
UncleCode
2d31915f0a Commit Message:
Enhance Async Crawler with storage state handling
  - Updated Async Crawler to support storage state management.
  - Added error handling for URL validation in Async Web Crawler.
  - Modified README logo and improved .gitignore entries.
  - Fixed issues in multiple files for better code robustness.
2024-12-09 20:04:59 +08:00
UncleCode
8c611dcb4b Refactored web scraping components
- Enhanced the web scraping strategy with new methods for optimized media handling.
  - Added new utility functions for better content processing.
  - Refined existing features for improved accuracy and efficiency in scraping tasks.
  - Introduced more robust filtering criteria for media elements.
2024-12-05 22:33:47 +08:00
UncleCode
a036b7f122 feat: implement create_box_message utility for formatted error messages and enhance error logging in AsyncWebCrawler 2024-11-28 19:24:07 +08:00
UncleCode
24723b2f10 Enhance features and documentation
- Updated version to 0.3.743
  - Improved ManagedBrowser configuration with dynamic host/port
  - Implemented fast HTML formatting in web crawler
  - Enhanced markdown generation with a new generator class
  - Improved sanitization and utility functions
  - Added contributor details and pull request acknowledgments
  - Updated documentation for clearer usage scenarios
  - Adjusted tests to reflect class name changes
2024-11-28 12:45:05 +08:00
UncleCode
dbb751c8f0 In this commit, we introduce the new concept of MakrdownGenerationStrategy, which allows us to expand our future strategies to generate better markdown. Right now, we generate raw markdown as we were doing before. We have a new algorithm for fitting markdown based on BM25, and now we add the ability to refine markdown into a citation form. Our links will be extracted and replaced by a citation reference number, and then we will have reference sections at the very end; we add all the links with the descriptions. This format is more suitable for large language models. In case we don't need to pass links, we can reduce the size of the markdown significantly and also attach the list of references as a separate file to a large language model. This commit contains changes for this direction. 2024-11-21 18:21:43 +08:00
UncleCode
b6af94cbbb Merge remote-tracking branch 'origin/main' into 0.3.74 2024-11-18 21:15:04 +08:00
UncleCode
d0014c6793 New async database manager and migration support
- Introduced AsyncDatabaseManager for async DB management.
  - Added migration feature to transition to file-based storage.
  - Enhanced web crawler with improved caching logic.
  - Updated requirements and setup for async processing.
2024-11-16 14:54:41 +08:00
UncleCode
3d00fee6c2 - In this commit, the library is updated to process file downloads. Users can now specify a download folder and trigger the download process via JavaScript or other means, with all files being saved. The list of downloaded files will also be added to the crowd result object.
- Another thing this commit introduces is the concept of the Relevance Content Filter. This is an improvement over Fit Markdown. This class of strategies aims to extract the main content from a given page - the part that really matters and is useful to be processed. One strategy has been created using the BM25 algorithm, which finds chunks of text from the web page relevant to its title, descriptions, and keywords, or supports a given user query and matches them. The result is then returned to the main engine to be converted to Markdown. Plans include adding approaches using language models as well.
- The cache database was updated to hold information about response headers and downloaded files.
2024-11-14 22:50:59 +08:00
UncleCode
c38ac29edb perf(crawler): major performance improvements & raw HTML support
- Switch to lxml parser (~4x speedup)
- Add raw HTML & local file crawling support
- Fix cache headers & async cleanup
- Add browser process monitoring
- Optimize BeautifulSoup operations
- Pre-compile regex patterns

Breaking: Raw HTML handling requires new URL prefixes
Fixes: #256, #253
2024-11-13 19:40:40 +08:00
Mahesh
00026b5f8b feat(config): Adding a configurable way of setting the cache directory for constrained environments 2024-11-12 14:52:51 -07:00
UncleCode
c5aa1bec18 Merge pull request #229 from bizrockman/main
Preventing NoneType has no attribute get Errors
2024-11-06 07:31:07 +01:00
bizrockman
0bba0e074f Preventing NoneType has no attribute get Errors
Sometimes the list contains Tag elements that do not have attrs set, resulting in this Error.
2024-11-04 20:12:24 +01:00
unclecode
54d5a3a259 Improved database management and error handling, updated README instructions, refined .gitignore, enhanced async web crawling capabilities, and updated dependencies. 2024-11-04 13:22:13 +08:00
UncleCode
bcfe83f702 feat: enhance crawler with overlay removal and improved screenshot capabilities
• Add smart overlay removal system for handling popups and modals
• Improve screenshot functionality with configurable timing controls
• Implement URL normalization and enhanced link processing
• Add custom base directory support for cache storage
• Refine external content filtering and social media domain handling

This commit significantly improves the crawler's ability to handle modern
websites by automatically removing intrusive overlays and providing better
screenshot capabilities. URL handling is now more robust with proper
normalization and duplicate detection. The cache system is more flexible
with customizable base directory support.

Breaking changes: None
Issue numbers: None
2024-10-24 20:22:47 +08:00
UncleCode
6ec4cb33ca Enhance Markdown generation and external content control
- Integrate customized html2text library for flexible Markdown output
- Add options to exclude external links and images
- Improve content scraping efficiency and error handling
- Update AsyncPlaywrightCrawlerStrategy for faster closing
- Enhance CosineStrategy with generic embedding model loading
2024-10-20 18:56:58 +08:00
UncleCode
768aa06ceb feat(crawler): Enhance stealth and flexibility, improve error handling
- Implement playwright_stealth for better bot detection avoidance
- Add user simulation and navigator override options
- Improve iframe processing and browser selection
- Enhance error reporting and debugging capabilities
- Optimize image processing and parallel crawling
- Add new example for user simulation feature
- Added support for including links in Markdown content, by definin g a new flag `include_links_on_markdown` in `crawl` method.
2024-10-17 21:37:48 +08:00
unclecode
320afdea64 feat: Enhance crawler flexibility and LLM extraction capabilities
- Add browser type selection (Chromium, Firefox, WebKit)
- Implement iframe content extraction
- Improve image processing and dimension updates
- Add custom headers support in AsyncPlaywrightCrawlerStrategy
- Enhance delayed content retrieval with new parameter
- Optimize HTML sanitization and Markdown conversion
- Update examples in quickstart_async.py for new features
2024-10-14 21:03:28 +08:00
unclecode
68e9144ce3 feat: Enhance crawling control and LLM extraction flexibility
- Add before_retrieve_html hook and delay_before_return_html option
- Implement flexible page_timeout for smart_wait function
- Support extra_args and custom headers in LLM extraction
- Allow arbitrary kwargs in AsyncWebCrawler initialization
- Improve perform_completion_with_backoff for custom API calls
- Update examples with new features and diverse LLM providers
2024-10-12 14:48:22 +08:00
unclecode
bccadec887 Remove dependency on psutil, PyYaml, and extend requests version range 2024-09-29 17:07:06 +08:00
unclecode
30807f5535 Remove excluded tags from website content 2024-09-12 16:11:20 +08:00
unclecode
b0e8b66666 Merge branch 'proxy-support' into staging 2024-09-01 16:35:14 +08:00
UncleCode
0d9b638636 Merge pull request #75 from aravindkarnam/main
Added support to source tags wrapped inside video and audio tags. Ext…
2024-08-30 12:54:15 +02:00
datehoer
16f98cebc0 replace base64 image url to '' 2024-08-27 09:44:35 +08:00
datehoer
fe9ff498ce add proxy and add ai base_url 2024-08-26 16:12:49 +08:00
unclecode
dec3d44224 refactor: Update extraction strategy to handle schema extraction with non-empty schema
This code change updates the `LLMExtractionStrategy` class to handle schema extraction when the schema is non-empty. Previously, the schema extraction was only triggered when the `extract_type` was set to "schema", regardless of whether a schema was provided. With this update, the schema extraction will only be performed if the `extract_type` is "schema" and a non-empty schema is provided. This ensures that the extraction strategy behaves correctly and avoids unnecessary schema extraction when not needed. Also "numpy" is removed from default installation mode.
2024-08-19 15:37:07 +08:00
Aravind Karnam
9ed1551125 Added support to source tags wrapped inside video and audio tags. Extended the text extraction to video and audio elements in media. https://github.com/unclecode/crawl4ai/issues/71 2024-08-14 11:07:26 +05:30
unclecode
9ee988753d refactor: Update image description minimum word threshold in get_content_of_website_optimized 2024-08-02 14:53:11 +08:00
unclecode
40477493d3 refactor: Remove image format dot in get_content_of_website_optimized
The code change removes the dot from the image format in the `get_content_of_website_optimized` function. This change ensures consistency in the image format and improves the functionality.
2024-07-31 16:15:55 +08:00
unclecode
aa9412e1b4 refactor: Set image_size to 0 in get_content_of_website_optimized
The code change sets the `image_size` variable to 0 in the `get_content_of_website_optimized` function. This change is made to temporarily disable fetching the image file size, which was causing performance issues. The image size will be fetched in a future update to improve the functionality.
2024-07-23 13:08:53 +08:00
Aravind Karnam
cf6c835e18 moved score threshold to config.py & replaced the separator for tag.get_text in find_closest_parent_with_useful_text fn from period(.) to space( ) to keep the text more neutral. 2024-07-21 15:18:23 +05:30
Aravind Karnam
e5ecf291f3 Implemented filtering for images and grabbing the contextual text from nearest parent 2024-07-21 15:03:17 +05:30
unclecode
4d283ab386 ## [v0.2.74] - 2024-07-08
A slew of exciting updates to improve the crawler's stability and robustness! 🎉

- 💻 **UTF encoding fix**: Resolved the Windows \"charmap\" error by adding UTF encoding.
- 🛡️ **Error handling**: Implemented MaxRetryError exception handling in LocalSeleniumCrawlerStrategy.
- 🧹 **Input sanitization**: Improved input sanitization and handled encoding issues in LLMExtractionStrategy.
- 🚮 **Database cleanup**: Removed existing database file and initialized a new one.
2024-07-08 16:33:25 +08:00
unclecode
9926eb9f95 feat: Bump version to v0.2.73 and update documentation
This commit updates the version number to v0.2.73 and makes corresponding changes in the README.md and Dockerfile.

Docker file install the default mode, this resolve many of installation issues.

Additionally, the installation instructions are updated to include support for different modes. Setup.py doesn't have anymore dependancy on Spacy.

The change log is also updated to reflect these changes.

Supporting websites need with-head browser.
2024-07-03 15:19:22 +08:00
unclecode
88d8cd8650 feat: Add page load check for LocalSeleniumCrawlerStrategy
This commit adds a page load check for the LocalSeleniumCrawlerStrategy in the `crawl` method. The `_ensure_page_load` method is introduced to ensure that the page has finished loading before proceeding. This helps to prevent issues with incomplete page sources and improves the reliability of the crawler.
2024-07-01 00:07:32 +08:00
unclecode
7ba2142363 chore: Refactor get_content_of_website_optimized function in utils.py 2024-06-26 14:43:09 +08:00
unclecode
96d1eb0d0d Some updated ins utils.py 2024-06-26 13:03:03 +08:00
unclecode
78cfad8b2f chore: Update version to 0.2.7 and improve extraction function speed 2024-06-24 22:39:56 +08:00
unclecode
d6182bedd7 chore:
- Add demo page to the new mkdocs
- Set website home page to mkdocs
2024-06-22 20:36:01 +08:00
unclecode
3f0e265baf Merge branch 'format-inline-tags' 2024-06-19 00:48:38 +08:00
unclecode
413595542a Enhancement: Replaced inline HTML tags with textual format for better LLM context handling #24 2024-06-17 15:14:34 +08:00
unclecode
b3a0edaa6d - User agent
- Extract Links
- Extract Metadata
- Update Readme
- Update REST API document
2024-06-08 17:59:42 +08:00
unclecode
9c34b30723 Extract internal and external links. 2024-06-08 16:53:06 +08:00
unclecode
8e73a482a2 feat: Add screenshot functionality to crawl_urls
The code changes in this commit add the `screenshot` parameter to the `crawl_urls` function in `main.py`. This allows users to specify whether they want to take a screenshot of the page during the crawling process. The default value is `False`.

This commit message follows the established convention of starting with a type (feat for feature) and providing a concise and descriptive summary of the changes made.
2024-06-07 15:23:32 +08:00
unclecode
0533aeb814 v0.2.3:
- Extract all media tags
- Take screenshot of the page
2024-06-07 15:23:13 +08:00
unclecode
c8589f8da3 Update:
- Fix Spacy model issue
- Update Readme and requirements.txt
2024-05-16 19:50:20 +08:00
unclecode
5b80be956d Update:
- Debug
- Refactor code for new version
2024-05-16 17:31:44 +08:00
unclecode
f6e59157bf - Test all methods
- Update index.hml
- Update Readme
- Resolve some bugs
2024-05-14 21:27:41 +08:00