Commit Graph

364 Commits

Author SHA1 Message Date
UncleCode
7047422e48 Merge branch '0.3.74' of https://github.com/unclecode/crawl4ai into 0.3.74 2024-11-19 19:33:08 +08:00
UncleCode
2bdec1fa5a chore: add manage-collab.sh to .gitignore 2024-11-19 19:33:04 +08:00
UncleCode
f2cb7d506d Delete test3.txt 2024-11-19 19:12:14 +08:00
ntohidikplay
a6dad3fc6d test: trying to push to 0.3.74 2024-11-19 12:09:33 +01:00
UncleCode
73658c758a chore: update .gitignore to include manage-collab.sh 2024-11-19 16:10:43 +08:00
UncleCode
b6af94cbbb Merge remote-tracking branch 'origin/main' into 0.3.74 2024-11-18 21:15:04 +08:00
UncleCode
852729ff38 feat(docker): add Docker Compose configurations for local and hub deployment; enhance GPU support checks in Dockerfile
feat(requirements): update requirements.txt to include snowballstemmer
fix(version_manager): correct version parsing to use __version__.__version__
feat(main): introduce chunking strategy and content filter in CrawlRequest model
feat(content_filter): enhance BM25 algorithm with priority tag scoring for improved content relevance
feat(logger): implement new async logger engine replacing print statements throughout library
fix(database): resolve version-related deadlock and circular lock issues in database operations
docs(docker): expand Docker deployment documentation with usage instructions for Docker Compose
2024-11-18 21:00:06 +08:00
UncleCode
152ac35bc2 feat(docs): update README for version 0.3.74 with new features and improvements
fix(version): update version number to 0.3.74
refactor(async_webcrawler): enhance logging and add domain-based request delay
2024-11-17 21:09:26 +08:00
UncleCode
df63a40606 feat(docs): update examples and documentation to replace bypass_cache with cache_mode for improved clarity 2024-11-17 19:44:45 +08:00
UncleCode
a59c107b23 Update changelog for 0.3.74 2024-11-17 18:42:43 +08:00
UncleCode
f9fe6f89fe feat(database): implement version management and migration checks during initialization 2024-11-17 18:09:33 +08:00
UncleCode
2a82455b3d feat(crawl): implement direct crawl functionality and introduce CacheMode for improved caching control 2024-11-17 17:17:34 +08:00
UncleCode
3a524a3bdd fix(docs): remove unnecessary blank line in README for improved readability 2024-11-17 16:00:39 +08:00
UncleCode
3a66aa8a60 feat(cache): introduce CacheMode and CacheContext for enhanced caching behavior
chore(requirements): add colorama dependency
refactor(config): add SHOW_DEPRECATION_WARNINGS flag and clean up code
fix(docs): update example scripts for clarity and consistency
2024-11-17 15:30:56 +08:00
UncleCode
4b45b28f25 feat(docs): enhance deployment documentation with one-click setup, API security details, and Docker Compose examples 2024-11-16 18:44:47 +08:00
UncleCode
9139ef3125 feat(docker): update Dockerfile for improved installation process and enhance deployment documentation with Docker Compose setup and API token security 2024-11-16 18:19:44 +08:00
UncleCode
6360d0545a feat(api): add API token authentication and update Dockerfile description 2024-11-16 18:08:56 +08:00
UncleCode
1961adb530 refactor(docker): remove shared memory size configuration to streamline Dockerfile 2024-11-16 17:35:27 +08:00
UncleCode
79feab89c4 refactor(deploy): remove memory utilization alert configuration from deployment template 2024-11-16 17:28:42 +08:00
UncleCode
5d0b13294c feat(deploy): change instance size to professional-xs and update memory utilization alert window to 300 seconds 2024-11-16 17:25:07 +08:00
UncleCode
67edc2d641 feat(deploy): update instance size to professional-xs and add memory utilization alert parameters 2024-11-16 17:23:32 +08:00
UncleCode
6b569cceb5 feat(deploy): update branch to 0.3.74 and change instance size to basic-xs 2024-11-16 17:21:45 +08:00
UncleCode
6f2fe5954f feat(deploy): update instance size to professional-xs and add memory utilization alert 2024-11-16 17:12:41 +08:00
UncleCode
fca1319b7d feat(docker): add MkDocs installation and build step for documentation 2024-11-16 17:10:30 +08:00
UncleCode
f77f06a3bd feat(deploy): add deployment configuration and templates for crawl4ai 2024-11-16 16:43:31 +08:00
UncleCode
e62c807295 feat(deploy): add Railway deployment configuration and setup instructions 2024-11-16 16:38:13 +08:00
UncleCode
90df6921b7 feat(crawl_sync): add synchronous crawl endpoint and corresponding test 2024-11-16 15:34:30 +08:00
UncleCode
5098442086 refactor: migrate versioning to __version__.py and remove deprecated _version.py 2024-11-16 15:30:24 +08:00
UncleCode
d0014c6793 New async database manager and migration support
- Introduced AsyncDatabaseManager for async DB management.
  - Added migration feature to transition to file-based storage.
  - Enhanced web crawler with improved caching logic.
  - Updated requirements and setup for async processing.
2024-11-16 14:54:41 +08:00
UncleCode
ae7ebc0bd8 chore: update .gitignore and enhance changelog with major feature additions and examples 2024-11-15 20:16:13 +08:00
UncleCode
1f269f9834 test(content_filter): add comprehensive tests for BM25ContentFilter functionality 2024-11-15 18:11:11 +08:00
UncleCode
7f1ae5adcf Update changelog 2024-11-14 22:51:51 +08:00
UncleCode
3d00fee6c2 - In this commit, the library is updated to process file downloads. Users can now specify a download folder and trigger the download process via JavaScript or other means, with all files being saved. The list of downloaded files will also be added to the crowd result object.
- Another thing this commit introduces is the concept of the Relevance Content Filter. This is an improvement over Fit Markdown. This class of strategies aims to extract the main content from a given page - the part that really matters and is useful to be processed. One strategy has been created using the BM25 algorithm, which finds chunks of text from the web page relevant to its title, descriptions, and keywords, or supports a given user query and matches them. The result is then returned to the main engine to be converted to Markdown. Plans include adding approaches using language models as well.
- The cache database was updated to hold information about response headers and downloaded files.
2024-11-14 22:50:59 +08:00
UncleCode
17913f5acf feat(crawler): support local files and raw HTML input in AsyncWebCrawler 2024-11-13 20:00:29 +08:00
UncleCode
c38ac29edb perf(crawler): major performance improvements & raw HTML support
- Switch to lxml parser (~4x speedup)
- Add raw HTML & local file crawling support
- Fix cache headers & async cleanup
- Add browser process monitoring
- Optimize BeautifulSoup operations
- Pre-compile regex patterns

Breaking: Raw HTML handling requires new URL prefixes
Fixes: #256, #253
2024-11-13 19:40:40 +08:00
UncleCode
38044d4afe Merge pull request #255 from maheshpec/feature/configure-cache-directory
feat(config): Adding a configurable way of setting the cache directory for constrained environments
2024-11-13 09:43:29 +01:00
UncleCode
61b93ebf36 Update change log 2024-11-13 15:38:30 +08:00
UncleCode
bf91adf3f8 fix: Resolve unexpected BrowserContext closure during crawl in Docker
- Removed __del__ method in AsyncPlaywrightCrawlerStrategy to ensure reliable browser lifecycle management by using explicit context managers.
- Added process monitoring in ManagedBrowser to detect and log unexpected terminations of the browser subprocess.
- Updated Docker configuration to expose port 9222 for remote debugging and allocate extra shared memory to prevent browser crashes.
- Improved error handling and resource cleanup for browser instances, particularly in Docker environments.

Resolves Issue #256
2024-11-13 15:37:16 +08:00
Mahesh
00026b5f8b feat(config): Adding a configurable way of setting the cache directory for constrained environments 2024-11-12 14:52:51 -07:00
UncleCode
8c22396d8b Merge pull request #234 from devatnull/patch-1
Fix typo: scrapper → scraper
2024-11-12 08:37:14 +01:00
UncleCode
b6d6631b12 Enhance Async Crawler with Playwright support
- Implemented new async crawler strategy using Playwright.
- Introduced ManagedBrowser for better browser management.
- Added support for persistent browser sessions and improved error handling.
- Updated version from 0.3.73 to 0.3.731.
- Enhanced logic in main.py for conditional mounting of static files.
- Updated requirements to replace playwright_stealth with tf-playwright-stealth.
2024-11-12 12:10:58 +08:00
UncleCode
a098483cbb Update Roadmap 2024-11-09 20:40:30 +08:00
UncleCode
f9a297e08d Add Docker example script for testing Crawl4AI functionality 2024-11-08 19:39:05 +08:00
UncleCode
bcdd80911f Remove some old files. 2024-11-08 19:08:58 +08:00
UncleCode
b120965b6a Fixed issues with the Manage Browser, including its inability to connect to the user directory and inability to create new pages within the Manage Browser context; all issues are now resolved. 2024-11-07 20:15:03 +08:00
UncleCode
16f918621f Merge branch 'main' of https://github.com/unclecode/crawl4ai 2024-11-07 19:30:22 +08:00
UncleCode
f7574230a1 Update API server request object. text_docker file and Readme 2024-11-07 19:29:31 +08:00
devatnull
2879344d9c Update README.md 2024-11-06 17:36:46 +03:00
UncleCode
9f5eef1f38 Refactored the CustomHTML2Text class in content_scrapping_strategy.py to remove the handling logic for header tags (h1-h6), which are now commented out. This cleanup improves code readability and reduces maintenance overhead. 2024-11-06 21:50:09 +08:00
UncleCode
c5aa1bec18 Merge pull request #229 from bizrockman/main
Preventing NoneType has no attribute get Errors
2024-11-06 07:31:07 +01:00