crawl4ai

Author	SHA1	Message	Date
UncleCode	dbb751c8f0	In this commit, we introduce the new concept of MakrdownGenerationStrategy, which allows us to expand our future strategies to generate better markdown. Right now, we generate raw markdown as we were doing before. We have a new algorithm for fitting markdown based on BM25, and now we add the ability to refine markdown into a citation form. Our links will be extracted and replaced by a citation reference number, and then we will have reference sections at the very end; we add all the links with the descriptions. This format is more suitable for large language models. In case we don't need to pass links, we can reduce the size of the markdown significantly and also attach the list of references as a separate file to a large language model. This commit contains changes for this direction.	2024-11-21 18:21:43 +08:00
UncleCode	b6af94cbbb	Merge remote-tracking branch 'origin/main' into 0.3.74	2024-11-18 21:15:04 +08:00
UncleCode	c38ac29edb	perf(crawler): major performance improvements & raw HTML support - Switch to lxml parser (~4x speedup) - Add raw HTML & local file crawling support - Fix cache headers & async cleanup - Add browser process monitoring - Optimize BeautifulSoup operations - Pre-compile regex patterns Breaking: Raw HTML handling requires new URL prefixes Fixes: #256, #253	2024-11-13 19:40:40 +08:00
UncleCode	bf91adf3f8	fix: Resolve unexpected BrowserContext closure during crawl in Docker - Removed __del__ method in AsyncPlaywrightCrawlerStrategy to ensure reliable browser lifecycle management by using explicit context managers. - Added process monitoring in ManagedBrowser to detect and log unexpected terminations of the browser subprocess. - Updated Docker configuration to expose port 9222 for remote debugging and allocate extra shared memory to prevent browser crashes. - Improved error handling and resource cleanup for browser instances, particularly in Docker environments. Resolves Issue #256	2024-11-13 15:37:16 +08:00
Mahesh	00026b5f8b	feat(config): Adding a configurable way of setting the cache directory for constrained environments	2024-11-12 14:52:51 -07:00
unclecode	320afdea64	feat: Enhance crawler flexibility and LLM extraction capabilities - Add browser type selection (Chromium, Firefox, WebKit) - Implement iframe content extraction - Improve image processing and dimension updates - Add custom headers support in AsyncPlaywrightCrawlerStrategy - Enhance delayed content retrieval with new parameter - Optimize HTML sanitization and Markdown conversion - Update examples in quickstart_async.py for new features	2024-10-14 21:03:28 +08:00
unclecode	4d48bd31ca	Push async version last changes for merge to main branch	2024-09-24 20:52:08 +08:00
unclecode	c37614cbc8	Add Async Version, JsonCss Extrator	2024-09-03 01:27:00 +08:00
Umut CAN	3c6ebb73ae	Update web_crawler.py Improve code efficiency, readability, and maintainability in web_crawler.py	2024-08-30 15:30:06 +03:00
unclecode	4d283ab386	## [v0.2.74] - 2024-07-08 A slew of exciting updates to improve the crawler's stability and robustness! 🎉 - 💻 UTF encoding fix: Resolved the Windows \"charmap\" error by adding UTF encoding. - 🛡️ Error handling: Implemented MaxRetryError exception handling in LocalSeleniumCrawlerStrategy. - 🧹 Input sanitization: Improved input sanitization and handled encoding issues in LLMExtractionStrategy. - 🚮 Database cleanup: Removed existing database file and initialized a new one.	2024-07-08 16:33:25 +08:00
unclecode	9926eb9f95	feat: Bump version to v0.2.73 and update documentation This commit updates the version number to v0.2.73 and makes corresponding changes in the README.md and Dockerfile. Docker file install the default mode, this resolve many of installation issues. Additionally, the installation instructions are updated to include support for different modes. Setup.py doesn't have anymore dependancy on Spacy. The change log is also updated to reflect these changes. Supporting websites need with-head browser.	2024-07-03 15:19:22 +08:00
unclecode	4756d0a532	Refactor crawler_strategy.py to handle exceptions and improve error messages	2024-06-26 15:04:33 +08:00
unclecode	7ba2142363	chore: Refactor get_content_of_website_optimized function in utils.py	2024-06-26 14:43:09 +08:00
unclecode	a0dff192ae	Update README for speed example	2024-06-24 23:06:12 +08:00
unclecode	1fffeeedd2	Update Readme: Showcase the speed	2024-06-24 23:02:08 +08:00
unclecode	78cfad8b2f	chore: Update version to 0.2.7 and improve extraction function speed	2024-06-24 22:39:56 +08:00
unclecode	d6182bedd7	chore: - Add demo page to the new mkdocs - Set website home page to mkdocs	2024-06-22 20:36:01 +08:00
unclecode	539263a8ba	chore: Update configuration values for chunk token threshold, overlap rate, and minimum word threshold. Create a new example for LLMExtraction Strategy, update Dockerfile, and README	2024-06-19 18:32:20 +08:00
unclecode	3f0e265baf	Merge branch 'format-inline-tags'	2024-06-19 00:48:38 +08:00
unclecode	f7e0cee1b0	vital: Right now, only raw html is retrived from datbase, therefore, css selector and other filter will be executed every time.	2024-06-08 18:37:40 +08:00
unclecode	b3a0edaa6d	- User agent - Extract Links - Extract Metadata - Update Readme - Update REST API document	2024-06-08 17:59:42 +08:00
unclecode	9c34b30723	Extract internal and external links.	2024-06-08 16:53:06 +08:00
unclecode	a19379aa58	Add recipe images, update README, and REST api example	2024-06-07 20:43:50 +08:00
unclecode	8e73a482a2	feat: Add screenshot functionality to crawl_urls The code changes in this commit add the `screenshot` parameter to the `crawl_urls` function in `main.py`. This allows users to specify whether they want to take a screenshot of the page during the crawling process. The default value is `False`. This commit message follows the established convention of starting with a type (feat for feature) and providing a concise and descriptive summary of the changes made.	2024-06-07 15:23:32 +08:00
unclecode	0533aeb814	v0.2.3: - Extract all media tags - Take screenshot of the page	2024-06-07 15:23:13 +08:00
unclecode	b6319c6f6e	chore: Add support for GPU, MPS, and CPU	2024-05-17 21:56:13 +08:00
unclecode	36e46be23d	chore: Add verbose option to ExtractionStrategy classes This commit adds a new `verbose` option to the `ExtractionStrategy` classes. The `verbose` option allows for logging of extraction details, such as the number of extracted blocks and the URL being processed. This improves the debugging and monitoring capabilities of the code.	2024-05-17 18:06:10 +08:00
unclecode	f52f526002	chore: Update web_crawler.py to use NoExtractionStrategy as default	2024-05-17 16:03:35 +08:00
unclecode	3f8576f870	chore: Update model_loader.py to use pretrained models without resume_download	2024-05-17 15:26:15 +08:00
unclecode	a317dc5e1d	Load CosineStrategy in the function	2024-05-17 15:13:06 +08:00
unclecode	a5f9d07dbf	Remove dependency on Spacy model.	2024-05-17 15:08:03 +08:00
UncleCode	5b4a586b2d	Update web_crawler.py Set CosineExtraction as defaul strategy	2024-05-16 22:28:24 +08:00
UncleCode	a856319499	Update web_crawler.py Set NoExtractionStrategy for FetchPages	2024-05-16 22:06:33 +08:00
UncleCode	5ce1dc1622	Update web_crawler.py Set all extraction strategies default to NoExtractionStrategy	2024-05-16 21:58:11 +08:00
unclecode	ea16dec587	Improve library loading	2024-05-16 21:19:02 +08:00
unclecode	c8589f8da3	Update: - Fix Spacy model issue - Update Readme and requirements.txt	2024-05-16 19:50:20 +08:00
unclecode	5b80be956d	Update: - Debug - Refactor code for new version	2024-05-16 17:31:44 +08:00
unclecode	f6e59157bf	- Test all methods - Update index.hml - Update Readme - Resolve some bugs	2024-05-14 21:27:41 +08:00
unclecode	5fea6c064b	Improve libraries import	2024-05-13 02:46:35 +08:00
unclecode	7679064521	Add model parameter for clustring.	2024-05-13 00:06:16 +08:00
unclecode	5693e324a4	Add time measurements.	2024-05-12 23:35:27 +08:00
unclecode	82706129f5	Update: - Text Categorization - Crawler, Extraction, and Chunking strategies - Clustering for semantic segmentation	2024-05-12 22:37:21 +08:00
unclecode	7039e3c1ee	- Issue Resolved: Every `<pre>` tag's HTML content is replaced with its inner text to address situations like syntax highlighters, where each character might be in a `<span>`. This avoids issues where the minimum word threshold might ignore them.	2024-05-12 14:08:22 +08:00
unclecode	372c921429	Update: Fix bug, when user set extract_blocks to False	2024-05-10 20:12:31 +08:00
unclecode	3ff1d15702	Change the project folder name from crawler to crawl4ai	2024-05-09 22:16:28 +08:00

45 Commits