crawl4ai

Author	SHA1	Message	Date
UncleCode	7047422e48	Merge branch '0.3.74' of https://github.com/unclecode/crawl4ai into 0.3.74	2024-11-19 19:33:08 +08:00
UncleCode	2bdec1fa5a	chore: add manage-collab.sh to .gitignore	2024-11-19 19:33:04 +08:00
UncleCode	f2cb7d506d	Delete test3.txt	2024-11-19 19:12:14 +08:00
ntohidikplay	a6dad3fc6d	test: trying to push to 0.3.74	2024-11-19 12:09:33 +01:00
UncleCode	73658c758a	chore: update .gitignore to include manage-collab.sh	2024-11-19 16:10:43 +08:00
UncleCode	b6af94cbbb	Merge remote-tracking branch 'origin/main' into 0.3.74	2024-11-18 21:15:04 +08:00
UncleCode	852729ff38	feat(docker): add Docker Compose configurations for local and hub deployment; enhance GPU support checks in Dockerfile feat(requirements): update requirements.txt to include snowballstemmer fix(version_manager): correct version parsing to use __version__.__version__ feat(main): introduce chunking strategy and content filter in CrawlRequest model feat(content_filter): enhance BM25 algorithm with priority tag scoring for improved content relevance feat(logger): implement new async logger engine replacing print statements throughout library fix(database): resolve version-related deadlock and circular lock issues in database operations docs(docker): expand Docker deployment documentation with usage instructions for Docker Compose	2024-11-18 21:00:06 +08:00
UncleCode	152ac35bc2	feat(docs): update README for version 0.3.74 with new features and improvements fix(version): update version number to 0.3.74 refactor(async_webcrawler): enhance logging and add domain-based request delay	2024-11-17 21:09:26 +08:00
UncleCode	df63a40606	feat(docs): update examples and documentation to replace bypass_cache with cache_mode for improved clarity	2024-11-17 19:44:45 +08:00
UncleCode	a59c107b23	Update changelog for 0.3.74	2024-11-17 18:42:43 +08:00
UncleCode	f9fe6f89fe	feat(database): implement version management and migration checks during initialization	2024-11-17 18:09:33 +08:00
UncleCode	2a82455b3d	feat(crawl): implement direct crawl functionality and introduce CacheMode for improved caching control	2024-11-17 17:17:34 +08:00
UncleCode	3a524a3bdd	fix(docs): remove unnecessary blank line in README for improved readability	2024-11-17 16:00:39 +08:00
UncleCode	3a66aa8a60	feat(cache): introduce CacheMode and CacheContext for enhanced caching behavior chore(requirements): add colorama dependency refactor(config): add SHOW_DEPRECATION_WARNINGS flag and clean up code fix(docs): update example scripts for clarity and consistency	2024-11-17 15:30:56 +08:00
UncleCode	4b45b28f25	feat(docs): enhance deployment documentation with one-click setup, API security details, and Docker Compose examples	2024-11-16 18:44:47 +08:00
UncleCode	9139ef3125	feat(docker): update Dockerfile for improved installation process and enhance deployment documentation with Docker Compose setup and API token security	2024-11-16 18:19:44 +08:00
UncleCode	6360d0545a	feat(api): add API token authentication and update Dockerfile description	2024-11-16 18:08:56 +08:00
UncleCode	1961adb530	refactor(docker): remove shared memory size configuration to streamline Dockerfile	2024-11-16 17:35:27 +08:00
UncleCode	79feab89c4	refactor(deploy): remove memory utilization alert configuration from deployment template	2024-11-16 17:28:42 +08:00
UncleCode	5d0b13294c	feat(deploy): change instance size to professional-xs and update memory utilization alert window to 300 seconds	2024-11-16 17:25:07 +08:00
UncleCode	67edc2d641	feat(deploy): update instance size to professional-xs and add memory utilization alert parameters	2024-11-16 17:23:32 +08:00
UncleCode	6b569cceb5	feat(deploy): update branch to 0.3.74 and change instance size to basic-xs	2024-11-16 17:21:45 +08:00
UncleCode	6f2fe5954f	feat(deploy): update instance size to professional-xs and add memory utilization alert	2024-11-16 17:12:41 +08:00
UncleCode	fca1319b7d	feat(docker): add MkDocs installation and build step for documentation	2024-11-16 17:10:30 +08:00
UncleCode	f77f06a3bd	feat(deploy): add deployment configuration and templates for crawl4ai	2024-11-16 16:43:31 +08:00
UncleCode	e62c807295	feat(deploy): add Railway deployment configuration and setup instructions	2024-11-16 16:38:13 +08:00
UncleCode	90df6921b7	feat(crawl_sync): add synchronous crawl endpoint and corresponding test	2024-11-16 15:34:30 +08:00
UncleCode	5098442086	refactor: migrate versioning to __version__.py and remove deprecated _version.py	2024-11-16 15:30:24 +08:00
UncleCode	d0014c6793	New async database manager and migration support - Introduced AsyncDatabaseManager for async DB management. - Added migration feature to transition to file-based storage. - Enhanced web crawler with improved caching logic. - Updated requirements and setup for async processing.	2024-11-16 14:54:41 +08:00
UncleCode	ae7ebc0bd8	chore: update .gitignore and enhance changelog with major feature additions and examples	2024-11-15 20:16:13 +08:00
UncleCode	1f269f9834	test(content_filter): add comprehensive tests for BM25ContentFilter functionality	2024-11-15 18:11:11 +08:00
UncleCode	7f1ae5adcf	Update changelog	2024-11-14 22:51:51 +08:00
UncleCode	3d00fee6c2	- In this commit, the library is updated to process file downloads. Users can now specify a download folder and trigger the download process via JavaScript or other means, with all files being saved. The list of downloaded files will also be added to the crowd result object. - Another thing this commit introduces is the concept of the Relevance Content Filter. This is an improvement over Fit Markdown. This class of strategies aims to extract the main content from a given page - the part that really matters and is useful to be processed. One strategy has been created using the BM25 algorithm, which finds chunks of text from the web page relevant to its title, descriptions, and keywords, or supports a given user query and matches them. The result is then returned to the main engine to be converted to Markdown. Plans include adding approaches using language models as well. - The cache database was updated to hold information about response headers and downloaded files.	2024-11-14 22:50:59 +08:00
UncleCode	17913f5acf	feat(crawler): support local files and raw HTML input in AsyncWebCrawler	2024-11-13 20:00:29 +08:00
UncleCode	c38ac29edb	perf(crawler): major performance improvements & raw HTML support - Switch to lxml parser (~4x speedup) - Add raw HTML & local file crawling support - Fix cache headers & async cleanup - Add browser process monitoring - Optimize BeautifulSoup operations - Pre-compile regex patterns Breaking: Raw HTML handling requires new URL prefixes Fixes: #256, #253	2024-11-13 19:40:40 +08:00
UncleCode	38044d4afe	Merge pull request #255 from maheshpec/feature/configure-cache-directory feat(config): Adding a configurable way of setting the cache directory for constrained environments	2024-11-13 09:43:29 +01:00
UncleCode	61b93ebf36	Update change log	2024-11-13 15:38:30 +08:00
UncleCode	bf91adf3f8	fix: Resolve unexpected BrowserContext closure during crawl in Docker - Removed __del__ method in AsyncPlaywrightCrawlerStrategy to ensure reliable browser lifecycle management by using explicit context managers. - Added process monitoring in ManagedBrowser to detect and log unexpected terminations of the browser subprocess. - Updated Docker configuration to expose port 9222 for remote debugging and allocate extra shared memory to prevent browser crashes. - Improved error handling and resource cleanup for browser instances, particularly in Docker environments. Resolves Issue #256	2024-11-13 15:37:16 +08:00
Mahesh	00026b5f8b	feat(config): Adding a configurable way of setting the cache directory for constrained environments	2024-11-12 14:52:51 -07:00
UncleCode	8c22396d8b	Merge pull request #234 from devatnull/patch-1 Fix typo: scrapper → scraper	2024-11-12 08:37:14 +01:00
UncleCode	b6d6631b12	Enhance Async Crawler with Playwright support - Implemented new async crawler strategy using Playwright. - Introduced ManagedBrowser for better browser management. - Added support for persistent browser sessions and improved error handling. - Updated version from 0.3.73 to 0.3.731. - Enhanced logic in main.py for conditional mounting of static files. - Updated requirements to replace playwright_stealth with tf-playwright-stealth.	2024-11-12 12:10:58 +08:00
UncleCode	a098483cbb	Update Roadmap	2024-11-09 20:40:30 +08:00
UncleCode	f9a297e08d	Add Docker example script for testing Crawl4AI functionality	2024-11-08 19:39:05 +08:00
UncleCode	bcdd80911f	Remove some old files.	2024-11-08 19:08:58 +08:00
UncleCode	b120965b6a	Fixed issues with the Manage Browser, including its inability to connect to the user directory and inability to create new pages within the Manage Browser context; all issues are now resolved.	2024-11-07 20:15:03 +08:00
UncleCode	16f918621f	Merge branch 'main' of https://github.com/unclecode/crawl4ai	2024-11-07 19:30:22 +08:00
UncleCode	f7574230a1	Update API server request object. text_docker file and Readme	2024-11-07 19:29:31 +08:00
devatnull	2879344d9c	Update README.md	2024-11-06 17:36:46 +03:00
UncleCode	9f5eef1f38	Refactored the `CustomHTML2Text` class in `content_scrapping_strategy.py` to remove the handling logic for header tags (h1-h6), which are now commented out. This cleanup improves code readability and reduces maintenance overhead.	2024-11-06 21:50:09 +08:00
UncleCode	c5aa1bec18	Merge pull request #229 from bizrockman/main Preventing NoneType has no attribute get Errors	2024-11-06 07:31:07 +01:00

1 2 3 4 5 ...

364 Commits