crawl4ai

Author	SHA1	Message	Date
unclecode	9ffa34b697	Update README v0.3.6	2024-10-14 22:58:27 +08:00
unclecode	740802c491	Merge branch '0.3.6'	2024-10-14 22:55:24 +08:00
unclecode	b9ac96c332	Merge branch 'main' of https://github.com/unclecode/crawl4ai	2024-10-14 22:54:23 +08:00
unclecode	d06535388a	Update gitignore	2024-10-14 22:53:56 +08:00
unclecode	2b73bdf6b0	Update changelog	2024-10-14 21:04:02 +08:00
unclecode	6aa803d712	Update gitignore	2024-10-14 21:03:40 +08:00
unclecode	320afdea64	feat: Enhance crawler flexibility and LLM extraction capabilities - Add browser type selection (Chromium, Firefox, WebKit) - Implement iframe content extraction - Improve image processing and dimension updates - Add custom headers support in AsyncPlaywrightCrawlerStrategy - Enhance delayed content retrieval with new parameter - Optimize HTML sanitization and Markdown conversion - Update examples in quickstart_async.py for new features	2024-10-14 21:03:28 +08:00
UncleCode	ccbe72cfc1	Merge pull request #135 from hitesh22rana/fix/docs-example docs: fixed css_selector for example	2024-10-13 14:39:07 +08:00
unclecode	b9bbd42373	Update Quickstart examples	2024-10-13 14:37:45 +08:00
unclecode	68e9144ce3	feat: Enhance crawling control and LLM extraction flexibility - Add before_retrieve_html hook and delay_before_return_html option - Implement flexible page_timeout for smart_wait function - Support extra_args and custom headers in LLM extraction - Allow arbitrary kwargs in AsyncWebCrawler initialization - Improve perform_completion_with_backoff for custom API calls - Update examples with new features and diverse LLM providers	2024-10-12 14:48:22 +08:00
unclecode	9b2b267820	CHANGELOG UPDATE	2024-10-12 13:42:56 +08:00
unclecode	ff3524d9b1	feat(v0.3.6): Add screenshot capture, delayed content, and custom timeouts - Implement screenshot capture functionality - Add delayed content retrieval method - Introduce custom page timeout parameter - Enhance LLM support with multiple providers - Improve database schema auto-updates - Optimize image processing in WebScrappingStrategy - Update error handling and logging - Expand examples in quickstart_async.py	2024-10-12 13:42:42 +08:00
unclecode	b99d20b725	Add pypi_build.sh to .gitignore	2024-10-08 18:10:57 +08:00
hitesh22rana	768b93140f	docs: fixed css_selector for example	2024-10-05 00:25:41 +09:00
unclecode	4750810a67	Enhance AsyncWebCrawler with smart waiting and screenshot capabilities - Implement smart_wait function in AsyncPlaywrightCrawlerStrategy - Add screenshot support to AsyncCrawlResponse and AsyncWebCrawler - Improve error handling and timeout management in crawling process - Fix typo in CrawlResult model (responser_headers -> response_headers) - Update .gitignore to exclude additional files - Adjust import path in test_basic_crawling.py	2024-10-02 17:34:56 +08:00
unclecode	e0e0db4247	Bump version to 0.3.4 0.3.4	2024-09-29 17:07:52 +08:00
unclecode	bccadec887	Remove dependency on psutil, PyYaml, and extend requests version range	2024-09-29 17:07:06 +08:00
unclecode	0759503e50	Extend numpy version range to support Python 3.9 v0.3.3	2024-09-29 00:08:02 +08:00
unclecode	7f1c020746	Update README to add link to previous version in branch V0.2.76	2024-09-28 00:31:53 +08:00
unclecode	5d4e92db7d	Update quickstart_async.py to improve performance and add Firecrawl simulation v0.3.0	2024-09-28 00:11:39 +08:00
unclecode	8b6e88c85c	Update .gitignore to ignore temporary and test directories	2024-09-26 15:09:49 +08:00
unclecode	64190dd0c4	Update README	2024-09-25 17:26:13 +08:00
unclecode	7100bcdf04	Add session based crawling documentation	2024-09-25 17:16:55 +08:00
unclecode	10cdad039d	Update documents and README	2024-09-25 16:52:11 +08:00
unclecode	f1eee09cf4	Update README, add manifest, make selenium optional library	2024-09-25 16:35:14 +08:00
unclecode	4d48bd31ca	Push async version last changes for merge to main branch	2024-09-24 20:52:08 +08:00
unclecode	d628bc4034	Refactor content_scrapping_strategy.py to remove excluded tags	2024-09-12 17:35:45 +08:00
unclecode	b179aa9b6f	Refactor website content and setup.py descriptions for consistent terminology	2024-09-12 16:50:52 +08:00
unclecode	30807f5535	Remove excluded tags from website content	2024-09-12 16:11:20 +08:00
unclecode	396f430022	Refactor AsyncCrawlerStrategy to return AsyncCrawlResponse This commit refactors the AsyncCrawlerStrategy class in the async_crawler_strategy.py file to modify the return types of the crawl and crawl_many methods. Instead of returning strings, these methods now return instances of the AsyncCrawlResponse class from the pydantic module. The AsyncCrawlResponse class contains the crawled HTML, response headers, and status code. This change improves the clarity and consistency of the code.	2024-09-12 15:49:49 +08:00
unclecode	eb131bebdf	Create series of quickstart files.	2024-09-04 15:33:24 +08:00
unclecode	5c15837677	chore: Update README, generate new notbook for quickstart	2024-09-04 14:46:22 +08:00
unclecode	2fada16abb	chore: Update crawl4ai package with AsyncWebCrawler and JsonCssExtractionStrategy	2024-09-03 23:32:27 +08:00
unclecode	c37614cbc8	Add Async Version, JsonCss Extrator	2024-09-03 01:27:00 +08:00
unclecode	3116f95c1a	Merge branch 'pull-84' into staging	2024-09-01 16:44:06 +08:00
unclecode	b0e8b66666	Merge branch 'proxy-support' into staging	2024-09-01 16:35:14 +08:00
unclecode	3caf48c9be	refactor: Update LocalSeleniumCrawlerStrategy to execute JS code if provided	2024-09-01 16:34:51 +08:00
Umut CAN	3c6ebb73ae	Update web_crawler.py Improve code efficiency, readability, and maintainability in web_crawler.py	2024-08-30 15:30:06 +03:00
UncleCode	0d9b638636	Merge pull request #75 from aravindkarnam/main Added support to source tags wrapped inside video and audio tags. Ext…	2024-08-30 12:54:15 +02:00
datehoer	2ba70b9501	add use proxy and llm baseurl examples	2024-08-27 10:14:54 +08:00
datehoer	16f98cebc0	replace base64 image url to ''	2024-08-27 09:44:35 +08:00
datehoer	fe9ff498ce	add proxy and add ai base_url	2024-08-26 16:12:49 +08:00
Datehoer	eba831ca30	fix spelling mistake	2024-08-26 15:29:23 +08:00
unclecode	dec3d44224	refactor: Update extraction strategy to handle schema extraction with non-empty schema This code change updates the `LLMExtractionStrategy` class to handle schema extraction when the schema is non-empty. Previously, the schema extraction was only triggered when the `extract_type` was set to "schema", regardless of whether a schema was provided. With this update, the schema extraction will only be performed if the `extract_type` is "schema" and a non-empty schema is provided. This ensures that the extraction strategy behaves correctly and avoids unnecessary schema extraction when not needed. Also "numpy" is removed from default installation mode.	2024-08-19 15:37:07 +08:00
Aravind Karnam	9ed1551125	Added support to source tags wrapped inside video and audio tags. Extended the text extraction to video and audio elements in media. https://github.com/unclecode/crawl4ai/issues/71	2024-08-14 11:07:26 +05:30
unclecode	e5e6a34e80	## [v0.2.77] - 2024-08-04 Significant improvements in text processing and performance: - 🚀 Dependency reduction: Removed dependency on spaCy model for text chunk labeling in cosine extraction strategy. - 🤖 Transformer upgrade: Implemented text sequence classification using a transformer model for labeling text chunks. - ⚡ Performance enhancement: Improved model loading speed due to removal of spaCy dependency. - 🔧 Future-proofing: Laid groundwork for potential complete removal of spaCy dependency in future versions. These changes address issue #68 and provide a foundation for faster, more efficient text processing in Crawl4AI. v0.2.77	2024-08-04 14:54:18 +08:00
unclecode	897e766728	Update README	2024-08-02 16:04:14 +08:00
unclecode	9200a6731d	## [v0.2.76] - 2024-08-02 Major improvements in functionality, performance, and cross-platform compatibility! 🚀 - 🐳 Docker enhancements: Significantly improved Dockerfile for easy installation on Linux, Mac, and Windows. - 🌐 Official Docker Hub image: Launched our first official image on Docker Hub for streamlined deployment (unclecode/crawl4ai). - 🔧 Selenium upgrade: Removed dependency on ChromeDriver, now using Selenium's built-in capabilities for better compatibility. - 🖼️ Image description: Implemented ability to generate textual descriptions for extracted images from web pages. - ⚡ Performance boost: Various improvements to enhance overall speed and performance.	2024-08-02 16:02:42 +08:00
unclecode	61c166ab19	refactor: Update Crawl4AI version to v0.2.76 This commit updates the Crawl4AI version from v0.2.7765 to v0.2.76. The version number is updated in the README.md file. This change ensures consistency and reflects the correct version of the software.	2024-08-02 15:55:53 +08:00
unclecode	659c8cd953	refactor: Update image description minimum word threshold in get_content_of_website_optimized	2024-08-02 15:55:32 +08:00

1 2 3 4 5 ...

252 Commits