crawl4ai

Author	SHA1	Message	Date
UncleCode	0982c639ae	Enhance AsyncWebCrawler and related configurations - Introduced new configuration classes: BrowserConfig and CrawlerRunConfig. - Refactored AsyncWebCrawler to leverage the new configuration system for cleaner parameter management. - Updated AsyncPlaywrightCrawlerStrategy for better flexibility and reduced legacy parameters. - Improved error handling with detailed context extraction during exceptions. - Enhanced overall maintainability and usability of the web crawler.	2024-12-12 19:35:09 +08:00
UncleCode	5188b7a6a0	Add full-page screenshot and PDF export features - Introduced a new approach for capturing full-page screenshots by exporting them as PDFs first, enhancing reliability and performance. - Added documentation for the feature in `docs/examples/full_page_screenshot_and_pdf_export.md`. - Refactored `perform_completion_with_backoff` in `crawl4ai/utils.py` to include necessary extra parameters. - Updated `quickstart_async.py` to utilize LLM extraction with refined arguments.	2024-12-10 20:59:31 +08:00
UncleCode	e130fd8db9	Implement new async crawler features and stability updates - Introduced new async crawl strategy with session management. - Added BrowserManager for improved browser management. - Enhanced documentation, focusing on storage state and usage examples. - Improved error handling and logging for sessions. - Added JavaScript snippets for customizing navigator properties.	2024-12-10 17:55:29 +08:00
UncleCode	2d31915f0a	Commit Message: Enhance Async Crawler with storage state handling - Updated Async Crawler to support storage state management. - Added error handling for URL validation in Async Web Crawler. - Modified README logo and improved .gitignore entries. - Fixed issues in multiple files for better code robustness.	2024-12-09 20:04:59 +08:00
UncleCode	8c611dcb4b	Refactored web scraping components - Enhanced the web scraping strategy with new methods for optimized media handling. - Added new utility functions for better content processing. - Refined existing features for improved accuracy and efficiency in scraping tasks. - Introduced more robust filtering criteria for media elements.	2024-12-05 22:33:47 +08:00
UncleCode	a036b7f122	feat: implement create_box_message utility for formatted error messages and enhance error logging in AsyncWebCrawler	2024-11-28 19:24:07 +08:00
UncleCode	24723b2f10	Enhance features and documentation - Updated version to 0.3.743 - Improved ManagedBrowser configuration with dynamic host/port - Implemented fast HTML formatting in web crawler - Enhanced markdown generation with a new generator class - Improved sanitization and utility functions - Added contributor details and pull request acknowledgments - Updated documentation for clearer usage scenarios - Adjusted tests to reflect class name changes	2024-11-28 12:45:05 +08:00
UncleCode	dbb751c8f0	In this commit, we introduce the new concept of MakrdownGenerationStrategy, which allows us to expand our future strategies to generate better markdown. Right now, we generate raw markdown as we were doing before. We have a new algorithm for fitting markdown based on BM25, and now we add the ability to refine markdown into a citation form. Our links will be extracted and replaced by a citation reference number, and then we will have reference sections at the very end; we add all the links with the descriptions. This format is more suitable for large language models. In case we don't need to pass links, we can reduce the size of the markdown significantly and also attach the list of references as a separate file to a large language model. This commit contains changes for this direction.	2024-11-21 18:21:43 +08:00
UncleCode	b6af94cbbb	Merge remote-tracking branch 'origin/main' into 0.3.74	2024-11-18 21:15:04 +08:00
UncleCode	d0014c6793	New async database manager and migration support - Introduced AsyncDatabaseManager for async DB management. - Added migration feature to transition to file-based storage. - Enhanced web crawler with improved caching logic. - Updated requirements and setup for async processing.	2024-11-16 14:54:41 +08:00
UncleCode	3d00fee6c2	- In this commit, the library is updated to process file downloads. Users can now specify a download folder and trigger the download process via JavaScript or other means, with all files being saved. The list of downloaded files will also be added to the crowd result object. - Another thing this commit introduces is the concept of the Relevance Content Filter. This is an improvement over Fit Markdown. This class of strategies aims to extract the main content from a given page - the part that really matters and is useful to be processed. One strategy has been created using the BM25 algorithm, which finds chunks of text from the web page relevant to its title, descriptions, and keywords, or supports a given user query and matches them. The result is then returned to the main engine to be converted to Markdown. Plans include adding approaches using language models as well. - The cache database was updated to hold information about response headers and downloaded files.	2024-11-14 22:50:59 +08:00
UncleCode	c38ac29edb	perf(crawler): major performance improvements & raw HTML support - Switch to lxml parser (~4x speedup) - Add raw HTML & local file crawling support - Fix cache headers & async cleanup - Add browser process monitoring - Optimize BeautifulSoup operations - Pre-compile regex patterns Breaking: Raw HTML handling requires new URL prefixes Fixes: #256, #253	2024-11-13 19:40:40 +08:00
Mahesh	00026b5f8b	feat(config): Adding a configurable way of setting the cache directory for constrained environments	2024-11-12 14:52:51 -07:00
UncleCode	c5aa1bec18	Merge pull request #229 from bizrockman/main Preventing NoneType has no attribute get Errors	2024-11-06 07:31:07 +01:00
bizrockman	0bba0e074f	Preventing NoneType has no attribute get Errors Sometimes the list contains Tag elements that do not have attrs set, resulting in this Error.	2024-11-04 20:12:24 +01:00
unclecode	54d5a3a259	Improved database management and error handling, updated README instructions, refined .gitignore, enhanced async web crawling capabilities, and updated dependencies.	2024-11-04 13:22:13 +08:00
UncleCode	bcfe83f702	feat: enhance crawler with overlay removal and improved screenshot capabilities • Add smart overlay removal system for handling popups and modals • Improve screenshot functionality with configurable timing controls • Implement URL normalization and enhanced link processing • Add custom base directory support for cache storage • Refine external content filtering and social media domain handling This commit significantly improves the crawler's ability to handle modern websites by automatically removing intrusive overlays and providing better screenshot capabilities. URL handling is now more robust with proper normalization and duplicate detection. The cache system is more flexible with customizable base directory support. Breaking changes: None Issue numbers: None	2024-10-24 20:22:47 +08:00
UncleCode	6ec4cb33ca	Enhance Markdown generation and external content control - Integrate customized html2text library for flexible Markdown output - Add options to exclude external links and images - Improve content scraping efficiency and error handling - Update AsyncPlaywrightCrawlerStrategy for faster closing - Enhance CosineStrategy with generic embedding model loading	2024-10-20 18:56:58 +08:00
UncleCode	768aa06ceb	feat(crawler): Enhance stealth and flexibility, improve error handling - Implement playwright_stealth for better bot detection avoidance - Add user simulation and navigator override options - Improve iframe processing and browser selection - Enhance error reporting and debugging capabilities - Optimize image processing and parallel crawling - Add new example for user simulation feature - Added support for including links in Markdown content, by definin g a new flag `include_links_on_markdown` in `crawl` method.	2024-10-17 21:37:48 +08:00
unclecode	320afdea64	feat: Enhance crawler flexibility and LLM extraction capabilities - Add browser type selection (Chromium, Firefox, WebKit) - Implement iframe content extraction - Improve image processing and dimension updates - Add custom headers support in AsyncPlaywrightCrawlerStrategy - Enhance delayed content retrieval with new parameter - Optimize HTML sanitization and Markdown conversion - Update examples in quickstart_async.py for new features	2024-10-14 21:03:28 +08:00
unclecode	68e9144ce3	feat: Enhance crawling control and LLM extraction flexibility - Add before_retrieve_html hook and delay_before_return_html option - Implement flexible page_timeout for smart_wait function - Support extra_args and custom headers in LLM extraction - Allow arbitrary kwargs in AsyncWebCrawler initialization - Improve perform_completion_with_backoff for custom API calls - Update examples with new features and diverse LLM providers	2024-10-12 14:48:22 +08:00
unclecode	bccadec887	Remove dependency on psutil, PyYaml, and extend requests version range	2024-09-29 17:07:06 +08:00
unclecode	30807f5535	Remove excluded tags from website content	2024-09-12 16:11:20 +08:00
unclecode	b0e8b66666	Merge branch 'proxy-support' into staging	2024-09-01 16:35:14 +08:00
UncleCode	0d9b638636	Merge pull request #75 from aravindkarnam/main Added support to source tags wrapped inside video and audio tags. Ext…	2024-08-30 12:54:15 +02:00
datehoer	16f98cebc0	replace base64 image url to ''	2024-08-27 09:44:35 +08:00
datehoer	fe9ff498ce	add proxy and add ai base_url	2024-08-26 16:12:49 +08:00
unclecode	dec3d44224	refactor: Update extraction strategy to handle schema extraction with non-empty schema This code change updates the `LLMExtractionStrategy` class to handle schema extraction when the schema is non-empty. Previously, the schema extraction was only triggered when the `extract_type` was set to "schema", regardless of whether a schema was provided. With this update, the schema extraction will only be performed if the `extract_type` is "schema" and a non-empty schema is provided. This ensures that the extraction strategy behaves correctly and avoids unnecessary schema extraction when not needed. Also "numpy" is removed from default installation mode.	2024-08-19 15:37:07 +08:00
Aravind Karnam	9ed1551125	Added support to source tags wrapped inside video and audio tags. Extended the text extraction to video and audio elements in media. https://github.com/unclecode/crawl4ai/issues/71	2024-08-14 11:07:26 +05:30
unclecode	9ee988753d	refactor: Update image description minimum word threshold in get_content_of_website_optimized	2024-08-02 14:53:11 +08:00
unclecode	40477493d3	refactor: Remove image format dot in get_content_of_website_optimized The code change removes the dot from the image format in the `get_content_of_website_optimized` function. This change ensures consistency in the image format and improves the functionality.	2024-07-31 16:15:55 +08:00
unclecode	aa9412e1b4	refactor: Set image_size to 0 in get_content_of_website_optimized The code change sets the `image_size` variable to 0 in the `get_content_of_website_optimized` function. This change is made to temporarily disable fetching the image file size, which was causing performance issues. The image size will be fetched in a future update to improve the functionality.	2024-07-23 13:08:53 +08:00
Aravind Karnam	cf6c835e18	moved score threshold to config.py & replaced the separator for tag.get_text in find_closest_parent_with_useful_text fn from period(.) to space( ) to keep the text more neutral.	2024-07-21 15:18:23 +05:30
Aravind Karnam	e5ecf291f3	Implemented filtering for images and grabbing the contextual text from nearest parent	2024-07-21 15:03:17 +05:30
unclecode	4d283ab386	## [v0.2.74] - 2024-07-08 A slew of exciting updates to improve the crawler's stability and robustness! 🎉 - 💻 UTF encoding fix: Resolved the Windows \"charmap\" error by adding UTF encoding. - 🛡️ Error handling: Implemented MaxRetryError exception handling in LocalSeleniumCrawlerStrategy. - 🧹 Input sanitization: Improved input sanitization and handled encoding issues in LLMExtractionStrategy. - 🚮 Database cleanup: Removed existing database file and initialized a new one.	2024-07-08 16:33:25 +08:00
unclecode	9926eb9f95	feat: Bump version to v0.2.73 and update documentation This commit updates the version number to v0.2.73 and makes corresponding changes in the README.md and Dockerfile. Docker file install the default mode, this resolve many of installation issues. Additionally, the installation instructions are updated to include support for different modes. Setup.py doesn't have anymore dependancy on Spacy. The change log is also updated to reflect these changes. Supporting websites need with-head browser.	2024-07-03 15:19:22 +08:00
unclecode	88d8cd8650	feat: Add page load check for LocalSeleniumCrawlerStrategy This commit adds a page load check for the LocalSeleniumCrawlerStrategy in the `crawl` method. The `_ensure_page_load` method is introduced to ensure that the page has finished loading before proceeding. This helps to prevent issues with incomplete page sources and improves the reliability of the crawler.	2024-07-01 00:07:32 +08:00
unclecode	7ba2142363	chore: Refactor get_content_of_website_optimized function in utils.py	2024-06-26 14:43:09 +08:00
unclecode	96d1eb0d0d	Some updated ins utils.py	2024-06-26 13:03:03 +08:00
unclecode	78cfad8b2f	chore: Update version to 0.2.7 and improve extraction function speed	2024-06-24 22:39:56 +08:00
unclecode	d6182bedd7	chore: - Add demo page to the new mkdocs - Set website home page to mkdocs	2024-06-22 20:36:01 +08:00
unclecode	3f0e265baf	Merge branch 'format-inline-tags'	2024-06-19 00:48:38 +08:00
unclecode	413595542a	Enhancement: Replaced inline HTML tags with textual format for better LLM context handling #24	2024-06-17 15:14:34 +08:00
unclecode	b3a0edaa6d	- User agent - Extract Links - Extract Metadata - Update Readme - Update REST API document	2024-06-08 17:59:42 +08:00
unclecode	9c34b30723	Extract internal and external links.	2024-06-08 16:53:06 +08:00
unclecode	8e73a482a2	feat: Add screenshot functionality to crawl_urls The code changes in this commit add the `screenshot` parameter to the `crawl_urls` function in `main.py`. This allows users to specify whether they want to take a screenshot of the page during the crawling process. The default value is `False`. This commit message follows the established convention of starting with a type (feat for feature) and providing a concise and descriptive summary of the changes made.	2024-06-07 15:23:32 +08:00
unclecode	0533aeb814	v0.2.3: - Extract all media tags - Take screenshot of the page	2024-06-07 15:23:13 +08:00
unclecode	c8589f8da3	Update: - Fix Spacy model issue - Update Readme and requirements.txt	2024-05-16 19:50:20 +08:00
unclecode	5b80be956d	Update: - Debug - Refactor code for new version	2024-05-16 17:31:44 +08:00
unclecode	f6e59157bf	- Test all methods - Update index.hml - Update Readme - Resolve some bugs	2024-05-14 21:27:41 +08:00

1 2

54 Commits