UncleCode
e02935dc5b
chore: update README to reflect new features and improvements in version 0.3.74
2024-11-22 18:49:22 +08:00
UncleCode
24ad2fe2dd
feat: enhance Markdown generation to include fit_html attribute
2024-11-22 18:47:17 +08:00
UncleCode
571dda6549
Update Redme
2024-11-22 18:27:43 +08:00
UncleCode
006bee4a5a
feat: enhance image processing capabilities
...
- Enhanced image processing with srcset support and validation checks for better image selection.
2024-11-22 16:00:17 +08:00
UncleCode
dbb751c8f0
In this commit, we introduce the new concept of MakrdownGenerationStrategy, which allows us to expand our future strategies to generate better markdown. Right now, we generate raw markdown as we were doing before. We have a new algorithm for fitting markdown based on BM25, and now we add the ability to refine markdown into a citation form. Our links will be extracted and replaced by a citation reference number, and then we will have reference sections at the very end; we add all the links with the descriptions. This format is more suitable for large language models. In case we don't need to pass links, we can reduce the size of the markdown significantly and also attach the list of references as a separate file to a large language model. This commit contains changes for this direction.
2024-11-21 18:21:43 +08:00
程序员阿江(Relakkes)
3439f7886d
fix: crawler strategy exception handling and fixes ( #271 )
2024-11-20 20:30:25 +08:00
Darwing Medina
d418a04602
Fix #260 prevent pass duplicated kwargs to scrapping_strategy ( #269 )
...
Thank you for the suggestions. It totally makes sense now. Change to pop operator.
2024-11-20 18:52:11 +08:00
UncleCode
7047422e48
Merge branch '0.3.74' of https://github.com/unclecode/crawl4ai into 0.3.74
2024-11-19 19:33:08 +08:00
UncleCode
2bdec1fa5a
chore: add manage-collab.sh to .gitignore
2024-11-19 19:33:04 +08:00
UncleCode
b654c49e55
Update .gitignore to exclude additional scripts and files
2024-11-19 19:32:06 +08:00
UncleCode
f2cb7d506d
Delete test3.txt
2024-11-19 19:12:14 +08:00
ntohidikplay
a6dad3fc6d
test: trying to push to 0.3.74
2024-11-19 12:09:33 +01:00
UncleCode
fbcff85ecb
Remove test files
2024-11-19 19:03:23 +08:00
UncleCode
788c67c29a
Merge branch 'main' of https://github.com/unclecode/crawl4ai
2024-11-19 19:02:44 +08:00
UncleCode
2f19d38693
Update .gitignore to include .gitboss/ and todo_executor.md
2024-11-19 19:02:41 +08:00
ntohidikplay
3aae30ed2a
test1: trying to push to main
2024-11-19 11:57:07 +01:00
ntohidikplay
593c7ad307
test: trying to push to main
2024-11-19 11:45:26 +01:00
UncleCode
73658c758a
chore: update .gitignore to include manage-collab.sh
2024-11-19 16:10:43 +08:00
UncleCode
b6af94cbbb
Merge remote-tracking branch 'origin/main' into 0.3.74
2024-11-18 21:15:04 +08:00
UncleCode
852729ff38
feat(docker): add Docker Compose configurations for local and hub deployment; enhance GPU support checks in Dockerfile
...
feat(requirements): update requirements.txt to include snowballstemmer
fix(version_manager): correct version parsing to use __version__.__version__
feat(main): introduce chunking strategy and content filter in CrawlRequest model
feat(content_filter): enhance BM25 algorithm with priority tag scoring for improved content relevance
feat(logger): implement new async logger engine replacing print statements throughout library
fix(database): resolve version-related deadlock and circular lock issues in database operations
docs(docker): expand Docker deployment documentation with usage instructions for Docker Compose
2024-11-18 21:00:06 +08:00
UncleCode
152ac35bc2
feat(docs): update README for version 0.3.74 with new features and improvements
...
fix(version): update version number to 0.3.74
refactor(async_webcrawler): enhance logging and add domain-based request delay
2024-11-17 21:09:26 +08:00
UncleCode
df63a40606
feat(docs): update examples and documentation to replace bypass_cache with cache_mode for improved clarity
2024-11-17 19:44:45 +08:00
UncleCode
a59c107b23
Update changelog for 0.3.74
2024-11-17 18:42:43 +08:00
UncleCode
f9fe6f89fe
feat(database): implement version management and migration checks during initialization
2024-11-17 18:09:33 +08:00
UncleCode
2a82455b3d
feat(crawl): implement direct crawl functionality and introduce CacheMode for improved caching control
2024-11-17 17:17:34 +08:00
UncleCode
3a524a3bdd
fix(docs): remove unnecessary blank line in README for improved readability
2024-11-17 16:00:39 +08:00
UncleCode
3a66aa8a60
feat(cache): introduce CacheMode and CacheContext for enhanced caching behavior
...
chore(requirements): add colorama dependency
refactor(config): add SHOW_DEPRECATION_WARNINGS flag and clean up code
fix(docs): update example scripts for clarity and consistency
2024-11-17 15:30:56 +08:00
UncleCode
4b45b28f25
feat(docs): enhance deployment documentation with one-click setup, API security details, and Docker Compose examples
2024-11-16 18:44:47 +08:00
UncleCode
9139ef3125
feat(docker): update Dockerfile for improved installation process and enhance deployment documentation with Docker Compose setup and API token security
2024-11-16 18:19:44 +08:00
UncleCode
6360d0545a
feat(api): add API token authentication and update Dockerfile description
2024-11-16 18:08:56 +08:00
UncleCode
1961adb530
refactor(docker): remove shared memory size configuration to streamline Dockerfile
2024-11-16 17:35:27 +08:00
UncleCode
79feab89c4
refactor(deploy): remove memory utilization alert configuration from deployment template
2024-11-16 17:28:42 +08:00
UncleCode
5d0b13294c
feat(deploy): change instance size to professional-xs and update memory utilization alert window to 300 seconds
2024-11-16 17:25:07 +08:00
UncleCode
67edc2d641
feat(deploy): update instance size to professional-xs and add memory utilization alert parameters
2024-11-16 17:23:32 +08:00
UncleCode
6b569cceb5
feat(deploy): update branch to 0.3.74 and change instance size to basic-xs
2024-11-16 17:21:45 +08:00
UncleCode
6f2fe5954f
feat(deploy): update instance size to professional-xs and add memory utilization alert
2024-11-16 17:12:41 +08:00
UncleCode
fca1319b7d
feat(docker): add MkDocs installation and build step for documentation
2024-11-16 17:10:30 +08:00
UncleCode
f77f06a3bd
feat(deploy): add deployment configuration and templates for crawl4ai
2024-11-16 16:43:31 +08:00
UncleCode
e62c807295
feat(deploy): add Railway deployment configuration and setup instructions
2024-11-16 16:38:13 +08:00
UncleCode
90df6921b7
feat(crawl_sync): add synchronous crawl endpoint and corresponding test
2024-11-16 15:34:30 +08:00
UncleCode
5098442086
refactor: migrate versioning to __version__.py and remove deprecated _version.py
2024-11-16 15:30:24 +08:00
UncleCode
d0014c6793
New async database manager and migration support
...
- Introduced AsyncDatabaseManager for async DB management.
- Added migration feature to transition to file-based storage.
- Enhanced web crawler with improved caching logic.
- Updated requirements and setup for async processing.
2024-11-16 14:54:41 +08:00
UncleCode
ae7ebc0bd8
chore: update .gitignore and enhance changelog with major feature additions and examples
2024-11-15 20:16:13 +08:00
UncleCode
1f269f9834
test(content_filter): add comprehensive tests for BM25ContentFilter functionality
2024-11-15 18:11:11 +08:00
UncleCode
7f1ae5adcf
Update changelog
2024-11-14 22:51:51 +08:00
UncleCode
3d00fee6c2
- In this commit, the library is updated to process file downloads. Users can now specify a download folder and trigger the download process via JavaScript or other means, with all files being saved. The list of downloaded files will also be added to the crowd result object.
...
- Another thing this commit introduces is the concept of the Relevance Content Filter. This is an improvement over Fit Markdown. This class of strategies aims to extract the main content from a given page - the part that really matters and is useful to be processed. One strategy has been created using the BM25 algorithm, which finds chunks of text from the web page relevant to its title, descriptions, and keywords, or supports a given user query and matches them. The result is then returned to the main engine to be converted to Markdown. Plans include adding approaches using language models as well.
- The cache database was updated to hold information about response headers and downloaded files.
2024-11-14 22:50:59 +08:00
UncleCode
17913f5acf
feat(crawler): support local files and raw HTML input in AsyncWebCrawler
2024-11-13 20:00:29 +08:00
UncleCode
c38ac29edb
perf(crawler): major performance improvements & raw HTML support
...
- Switch to lxml parser (~4x speedup)
- Add raw HTML & local file crawling support
- Fix cache headers & async cleanup
- Add browser process monitoring
- Optimize BeautifulSoup operations
- Pre-compile regex patterns
Breaking: Raw HTML handling requires new URL prefixes
Fixes : #256 , #253
2024-11-13 19:40:40 +08:00
UncleCode
38044d4afe
Merge pull request #255 from maheshpec/feature/configure-cache-directory
...
feat(config): Adding a configurable way of setting the cache directory for constrained environments
2024-11-13 09:43:29 +01:00
UncleCode
61b93ebf36
Update change log
2024-11-13 15:38:30 +08:00