crawl4ai

Author	SHA1	Message	Date
unclecode	5d4e92db7d	Update quickstart_async.py to improve performance and add Firecrawl simulation	2024-09-28 00:11:39 +08:00
unclecode	8b6e88c85c	Update .gitignore to ignore temporary and test directories	2024-09-26 15:09:49 +08:00
unclecode	10cdad039d	Update documents and README	2024-09-25 16:52:11 +08:00
unclecode	4d48bd31ca	Push async version last changes for merge to main branch	2024-09-24 20:52:08 +08:00
unclecode	eb131bebdf	Create series of quickstart files.	2024-09-04 15:33:24 +08:00
unclecode	5c15837677	chore: Update README, generate new notbook for quickstart	2024-09-04 14:46:22 +08:00
unclecode	2fada16abb	chore: Update crawl4ai package with AsyncWebCrawler and JsonCssExtractionStrategy	2024-09-03 23:32:27 +08:00
unclecode	e5e6a34e80	## [v0.2.77] - 2024-08-04 Significant improvements in text processing and performance: - 🚀 Dependency reduction: Removed dependency on spaCy model for text chunk labeling in cosine extraction strategy. - 🤖 Transformer upgrade: Implemented text sequence classification using a transformer model for labeling text chunks. - ⚡ Performance enhancement: Improved model loading speed due to removal of spaCy dependency. - 🔧 Future-proofing: Laid groundwork for potential complete removal of spaCy dependency in future versions. These changes address issue #68 and provide a foundation for faster, more efficient text processing in Crawl4AI.	2024-08-04 14:54:18 +08:00
unclecode	9200a6731d	## [v0.2.76] - 2024-08-02 Major improvements in functionality, performance, and cross-platform compatibility! 🚀 - 🐳 Docker enhancements: Significantly improved Dockerfile for easy installation on Linux, Mac, and Windows. - 🌐 Official Docker Hub image: Launched our first official image on Docker Hub for streamlined deployment (unclecode/crawl4ai). - 🔧 Selenium upgrade: Removed dependency on ChromeDriver, now using Selenium's built-in capabilities for better compatibility. - 🖼️ Image description: Implemented ability to generate textual descriptions for extracted images from web pages. - ⚡ Performance boost: Various improvements to enhance overall speed and performance.	2024-08-02 16:02:42 +08:00
unclecode	659c8cd953	refactor: Update image description minimum word threshold in get_content_of_website_optimized	2024-08-02 15:55:32 +08:00
unclecode	8ae6c43ca4	refactor: Update Dockerfile to install Crawl4AI with specified options	2024-08-01 20:13:06 +08:00
unclecode	b6713870ef	refactor: Update Dockerfile to install Crawl4AI with specified options This commit updates the Dockerfile to install Crawl4AI with the specified options. The `INSTALL_OPTION` build argument is used to determine which additional packages to install. If the option is set to "all", all models will be downloaded. If the option is set to "torch", only torch models will be downloaded. If the option is set to "transformer", only transformer models will be downloaded. If no option is specified, the default installation will be used. This change improves the flexibility and customization of the Crawl4AI installation process.	2024-08-01 17:56:19 +08:00
unclecode	7715623430	chore: Fix typos and update .gitignore These changes fix typos in `chunking_strategy.py` and `crawler_strategy.py` to improve code readability. Additionally, the `.test_pads/` directory is removed from the `.gitignore` file to keep the repository clean and organized.	2024-07-19 17:42:39 +08:00
unclecode	ca0336af9e	feat: Add error handling for rate limit exceeded in form submission This commit adds error handling for rate limit exceeded in the form submission process. If the server returns a 429 status code, the client will display an error message indicating the rate limit has been exceeded and provide information on when the user can try again. This improves the user experience by providing clear feedback and guidance when rate limits are reached.	2024-07-08 20:24:00 +08:00
unclecode	4d283ab386	## [v0.2.74] - 2024-07-08 A slew of exciting updates to improve the crawler's stability and robustness! 🎉 - 💻 UTF encoding fix: Resolved the Windows \"charmap\" error by adding UTF encoding. - 🛡️ Error handling: Implemented MaxRetryError exception handling in LocalSeleniumCrawlerStrategy. - 🧹 Input sanitization: Improved input sanitization and handled encoding issues in LLMExtractionStrategy. - 🚮 Database cleanup: Removed existing database file and initialized a new one.	2024-07-08 16:33:25 +08:00
unclecode	3cd1b3719f	Bump version to v0.2.73, update documentation, and resolve installation issues	2024-07-03 15:26:43 +08:00
unclecode	9926eb9f95	feat: Bump version to v0.2.73 and update documentation This commit updates the version number to v0.2.73 and makes corresponding changes in the README.md and Dockerfile. Docker file install the default mode, this resolve many of installation issues. Additionally, the installation instructions are updated to include support for different modes. Setup.py doesn't have anymore dependancy on Spacy. The change log is also updated to reflect these changes. Supporting websites need with-head browser.	2024-07-03 15:19:22 +08:00
unclecode	d58286989c	UPDATE DOCUMENTS	2024-06-30 00:34:02 +08:00
unclecode	b58af3349c	chore: Update installation instructions with support for different modes	2024-06-30 00:22:17 +08:00
unclecode	685706e0aa	Update version, and change log	2024-06-30 00:17:43 +08:00
unclecode	61ae2de841	1/Update setup.py to support following modes: - default (most frequent mode) - torch - transformers - all 2/ Update Docker file 3/ Update documentation as well.	2024-06-30 00:15:29 +08:00
unclecode	f8a11779fe	Update change log	2024-06-26 16:48:36 +08:00
unclecode	d11a83c232	## [0.2.71] 2024-06-26 • Refactored `crawler_strategy.py` to handle exceptions and improve error messages • Improved `get_content_of_website_optimized` function in `utils.py` for better performance • Updated `utils.py` with latest changes • Migrated to `ChromeDriverManager` for resolving Chrome driver download issues	2024-06-26 15:34:15 +08:00
unclecode	78cfad8b2f	chore: Update version to 0.2.7 and improve extraction function speed	2024-06-24 22:39:56 +08:00
unclecode	68b3dff74a	Update CSS	2024-06-23 00:36:03 +08:00
unclecode	bfc4abd6e8	Update documents	2024-06-22 20:57:03 +08:00
unclecode	d6182bedd7	chore: - Add demo page to the new mkdocs - Set website home page to mkdocs	2024-06-22 20:36:01 +08:00
unclecode	2c2362b4d3	issue 19 is resolved - Update Dockerfile to install mkdocs and build documentation	2024-06-22 17:18:00 +08:00
unclecode	e7705e661a	ADD MKDocs	2024-06-21 17:56:54 +08:00
unclecode	21b110bfd7	Update LLMExtractionStrategy to disable chunking if specified, Add example of summarization for a web page.	2024-06-19 19:03:35 +08:00
unclecode	539263a8ba	chore: Update configuration values for chunk token threshold, overlap rate, and minimum word threshold. Create a new example for LLMExtraction Strategy, update Dockerfile, and README	2024-06-19 18:32:20 +08:00
unclecode	3f0e265baf	Merge branch 'format-inline-tags'	2024-06-19 00:48:38 +08:00
unclecode	21e2538e57	Update quickstart.py	2024-06-19 00:37:53 +08:00
unclecode	77da48050d	chore: Add custom headers to LocalSeleniumCrawlerStrategy	2024-06-17 15:50:03 +08:00
unclecode	9a97aacd85	chore: Add hooks for customizing the LocalSeleniumCrawlerStrategy	2024-06-17 15:37:18 +08:00
unclecode	b3a0edaa6d	- User agent - Extract Links - Extract Metadata - Update Readme - Update REST API document	2024-06-08 17:59:42 +08:00
unclecode	36a5847df5	Add css selector example	2024-06-07 20:47:20 +08:00
unclecode	a19379aa58	Add recipe images, update README, and REST api example	2024-06-07 20:43:50 +08:00
unclecode	768d048e1c	Update rest call how to use	2024-06-07 18:10:45 +08:00
unclecode	94c11a0262	Add image	2024-06-07 18:09:21 +08:00
unclecode	aeb2114170	Add example of REST API call	2024-06-07 16:24:40 +08:00
unclecode	226a62a3c0	feat: Add screenshot functionality to crawl_urls	2024-06-07 15:33:15 +08:00
unclecode	8e73a482a2	feat: Add screenshot functionality to crawl_urls The code changes in this commit add the `screenshot` parameter to the `crawl_urls` function in `main.py`. This allows users to specify whether they want to take a screenshot of the page during the crawling process. The default value is `False`. This commit message follows the established convention of starting with a type (feat for feature) and providing a concise and descriptive summary of the changes made.	2024-06-07 15:23:32 +08:00
unclecode	c7553b1280	Update research assistant example with package installation instructions	2024-06-04 23:18:19 +08:00
unclecode	8b8683f22e	Add research assistant example using Chainlit	2024-06-04 22:43:09 +08:00
unclecode	51f26d12fe	Update for v0.2.2 - Support multiple JS scripts - Fixed some of bugs - Resolved a few issue relevant to Colab installation	2024-06-02 15:40:18 +08:00
unclecode	13a3b21d19	- Add ONNX embedding model for CPU devices, Update the similarithy threshold, improve the embedding speed.	2024-05-19 22:30:10 +08:00
unclecode	eb6423875f	chore: Update Selenium options in crawler_strategy.py and add verbose logging in CosineStrategy	2024-05-18 14:13:06 +08:00
unclecode	b6319c6f6e	chore: Add support for GPU, MPS, and CPU	2024-05-17 21:56:13 +08:00
unclecode	957a2458b1	chore: Update web crawler URLs to use NBC News business section	2024-05-17 18:11:13 +08:00

1 2

57 Commits