ce7d49484f
docs: update README for version 0.3.743 with new features, enhancements, and contributor acknowledgments
UncleCode
2024-11-28 13:06:46 +08:00
e4acd18429
docs: update README for version 0.3.743 with new features, enhancements, and contributor acknowledgments
UncleCode
2024-11-28 13:06:30 +08:00
76bea6c577
Merge branch 'main' into 0.3.743
UncleCode
2024-11-28 12:53:30 +08:00
3ff0b0b2c4
feat: update changelog for version 0.3.743 with new features, improvements, and contributor acknowledgments
UncleCode
2024-11-28 12:48:07 +08:00
24723b2f10
Enhance features and documentation - Updated version to 0.3.743 - Improved ManagedBrowser configuration with dynamic host/port - Implemented fast HTML formatting in web crawler - Enhanced markdown generation with a new generator class - Improved sanitization and utility functions - Added contributor details and pull request acknowledgments - Updated documentation for clearer usage scenarios - Adjusted tests to reflect class name changes
UncleCode
2024-11-28 12:45:05 +08:00
f998e9e949
Fix: handled the cases where markdown_with_citations, references_markdown, and filtered_html might not be defined. (#293)
Hamza Farhan
2024-11-27 16:20:54 +05:00
c6a022132b
docs: update CONTRIBUTORS.md to acknowledge aadityakanjolia4 for fixing 'CustomHTML2Text' bug
UncleCode
2024-11-27 14:55:56 +08:00
2f5e0598bb
updated definition of can_process_url to include dept as an argument, as it's needed to skip filters for start_url
Aravind Karnam
2024-11-26 18:26:57 +05:30
ff731e4ea1
fixed the final scraper_quickstart.py example
Aravind Karnam
2024-11-26 17:08:32 +05:30
9530ded83a
fixed the final scraper_quickstart.py example
Aravind Karnam
2024-11-26 17:05:54 +05:30
a888c91790
Fix "Future attached to a different loop" error by ensuring tasks are created in the correct event loop
Aravind Karnam
2024-11-26 14:05:02 +05:30
a98d51a62c
Remove the can_process_url check from _process_links since it's already being checked in process_url
Aravind Karnam
2024-11-26 11:11:49 +05:30
ee3001b1f7
fix: moved depth as a param to can_process_url and applying filter chain only when depth is not zero. This way filter chain is skipped but other validations are in place even for start URL
Aravind Karnam
2024-11-26 10:22:14 +05:30
b13fd71040
chore: 1. Expose process_external_links as a param 2. Removed a few unused imports 3. Removed URL normalisation for external links separately as that won't be necessary
Aravind Karnam
2024-11-26 10:07:11 +05:30
195c0ccf8a
chore: remove deprecated Docker Compose configurations for crawl4ai service
unclecode
2024-11-24 19:40:27 +08:00
b09a86c0c1
chore: remove deprecated Docker Compose configurations for crawl4ai service
unclecode
2024-11-24 19:40:10 +08:00
de43505ae4
feat: update version to 0.3.742
0.3.742
unclecode
2024-11-24 19:36:30 +08:00
d7c5b900b8
feat: add support for arm64 platform in Docker commands and update INSTALL_TYPE variable in docker-compose
unclecode
2024-11-24 19:35:53 +08:00
edad7b6a74
chore: remove Railway deployment configuration and related documentation
unclecode
2024-11-24 18:48:39 +08:00
829a1f7992
feat: update version to 0.3.741 and enhance content filtering with heuristic strategy. Fixing the issue that when the past HTML to BM25 content filter does not have any HTML elements.
UncleCode
2024-11-23 19:45:41 +08:00
d729aa7d5e
refactor: Add group ID to for images extracted from srcset.
UncleCode
2024-11-23 18:00:32 +08:00
2226ef53c8
fix: Exempting the start_url from can_process_url
Aravind Karnam
2024-11-23 14:59:14 +05:30
f8e85b1499
Fixed a bug in _process_links, handled condition for when url_scorer is passed as None, renamed the scrapper folder to scraper.
Aravind Karnam
2024-11-23 13:52:34 +05:30
c1797037c0
Fixed a few bugs, import errors and changed to asyncio wait_for instead of timeout to support python versions < 3.11
Aravind Karnam
2024-11-23 12:39:25 +05:30
0d0cef3438
feat: add enhanced markdown generation example with citations and file output
UncleCode
2024-11-22 20:14:58 +08:00
006bee4a5a
feat: enhance image processing capabilities - Enhanced image processing with srcset support and validation checks for better image selection.
UncleCode
2024-11-22 16:00:17 +08:00
dbb751c8f0
In this commit, we introduce the new concept of MakrdownGenerationStrategy, which allows us to expand our future strategies to generate better markdown. Right now, we generate raw markdown as we were doing before. We have a new algorithm for fitting markdown based on BM25, and now we add the ability to refine markdown into a citation form. Our links will be extracted and replaced by a citation reference number, and then we will have reference sections at the very end; we add all the links with the descriptions. This format is more suitable for large language models. In case we don't need to pass links, we can reduce the size of the markdown significantly and also attach the list of references as a separate file to a large language model. This commit contains changes for this direction.
UncleCode
2024-11-21 18:21:43 +08:00
852729ff38
feat(docker): add Docker Compose configurations for local and hub deployment; enhance GPU support checks in Dockerfile feat(requirements): update requirements.txt to include snowballstemmer fix(version_manager): correct version parsing to use __version__.__version__ feat(main): introduce chunking strategy and content filter in CrawlRequest model feat(content_filter): enhance BM25 algorithm with priority tag scoring for improved content relevance feat(logger): implement new async logger engine replacing print statements throughout library fix(database): resolve version-related deadlock and circular lock issues in database operations docs(docker): expand Docker deployment documentation with usage instructions for Docker Compose
UncleCode
2024-11-18 21:00:06 +08:00
152ac35bc2
feat(docs): update README for version 0.3.74 with new features and improvements fix(version): update version number to 0.3.74 refactor(async_webcrawler): enhance logging and add domain-based request delay
UncleCode
2024-11-17 21:09:26 +08:00
df63a40606
feat(docs): update examples and documentation to replace bypass_cache with cache_mode for improved clarity
UncleCode
2024-11-17 19:44:45 +08:00
a59c107b23
Update changelog for 0.3.74
UncleCode
2024-11-17 18:42:43 +08:00
f9fe6f89fe
feat(database): implement version management and migration checks during initialization
UncleCode
2024-11-17 18:09:33 +08:00
2a82455b3d
feat(crawl): implement direct crawl functionality and introduce CacheMode for improved caching control
UncleCode
2024-11-17 17:17:34 +08:00
3a524a3bdd
fix(docs): remove unnecessary blank line in README for improved readability
UncleCode
2024-11-17 16:00:39 +08:00
3a66aa8a60
feat(cache): introduce CacheMode and CacheContext for enhanced caching behavior chore(requirements): add colorama dependency refactor(config): add SHOW_DEPRECATION_WARNINGS flag and clean up code fix(docs): update example scripts for clarity and consistency
UncleCode
2024-11-17 15:30:56 +08:00
4b45b28f25
feat(docs): enhance deployment documentation with one-click setup, API security details, and Docker Compose examples
UncleCode
2024-11-16 18:44:47 +08:00
9139ef3125
feat(docker): update Dockerfile for improved installation process and enhance deployment documentation with Docker Compose setup and API token security
UncleCode
2024-11-16 18:19:44 +08:00
6360d0545a
feat(api): add API token authentication and update Dockerfile description
UncleCode
2024-11-16 18:08:56 +08:00
90df6921b7
feat(crawl_sync): add synchronous crawl endpoint and corresponding test
UncleCode
2024-11-16 15:34:30 +08:00
5098442086
refactor: migrate versioning to __version__.py and remove deprecated _version.py
UncleCode
2024-11-16 15:30:24 +08:00
d0014c6793
New async database manager and migration support - Introduced AsyncDatabaseManager for async DB management. - Added migration feature to transition to file-based storage. - Enhanced web crawler with improved caching logic. - Updated requirements and setup for async processing.
UncleCode
2024-11-16 14:54:41 +08:00
3d00fee6c2
- In this commit, the library is updated to process file downloads. Users can now specify a download folder and trigger the download process via JavaScript or other means, with all files being saved. The list of downloaded files will also be added to the crowd result object. - Another thing this commit introduces is the concept of the Relevance Content Filter. This is an improvement over Fit Markdown. This class of strategies aims to extract the main content from a given page - the part that really matters and is useful to be processed. One strategy has been created using the BM25 algorithm, which finds chunks of text from the web page relevant to its title, descriptions, and keywords, or supports a given user query and matches them. The result is then returned to the main engine to be converted to Markdown. Plans include adding approaches using language models as well. - The cache database was updated to hold information about response headers and downloaded files.
UncleCode
2024-11-14 22:50:59 +08:00
17913f5acf
feat(crawler): support local files and raw HTML input in AsyncWebCrawler
UncleCode
2024-11-13 20:00:29 +08:00
3a2cb7dacf
test: Add comprehensive unit tests for AsyncExecutor functionality
0.3.75
UncleCode
2024-11-13 19:46:05 +08:00
c38ac29edb
perf(crawler): major performance improvements & raw HTML support
UncleCode
2024-11-13 19:40:40 +08:00
b6d6631b12
Enhance Async Crawler with Playwright support - Implemented new async crawler strategy using Playwright. - Introduced ManagedBrowser for better browser management. - Added support for persistent browser sessions and improved error handling. - Updated version from 0.3.73 to 0.3.731. - Enhanced logic in main.py for conditional mounting of static files. - Updated requirements to replace playwright_stealth with tf-playwright-stealth.
UncleCode
2024-11-12 12:10:58 +08:00
b120965b6a
Fixed issues with the Manage Browser, including its inability to connect to the user directory and inability to create new pages within the Manage Browser context; all issues are now resolved.
UncleCode
2024-11-07 20:15:03 +08:00