Commit Graph

412 Commits

Author SHA1 Message Date
Aravind Karnam
9530ded83a fixed the final scraper_quickstart.py example 2024-11-26 17:05:54 +05:30
Aravind Karnam
155c756238 <Future pending> issue fix was incorrect. Reverting 2024-11-26 17:04:04 +05:30
Aravind Karnam
a888c91790 Fix "Future attached to a different loop" error by ensuring tasks are created in the correct event loop
- Explicitly retrieve and use the correct event loop when creating tasks to avoid cross-loop issues.
- Ensures proper task scheduling in environments with multiple event loops.
2024-11-26 14:05:02 +05:30
Aravind Karnam
a98d51a62c Remove the can_process_url check from _process_links since it's already being checked in process_url 2024-11-26 11:11:49 +05:30
Aravind Karnam
ee3001b1f7 fix: moved depth as a param to can_process_url and applying filter chain only when depth is not zero. This way
filter chain is skipped but other validations are in place even for start URL
2024-11-26 10:22:14 +05:30
Aravind Karnam
b13fd71040 chore: 1. Expose process_external_links as a param
2. Removed a few unused imports
3. Removed URL normalisation for external links separately as that won't be necessary
2024-11-26 10:07:11 +05:30
Aravind Karnam
2226ef53c8 fix: Exempting the start_url from can_process_url 2024-11-23 14:59:14 +05:30
aravind
3d52b551f2 Merge pull request #8 from aravindkarnam/main
Pulling in 0.3.74
2024-11-23 13:57:36 +05:30
Aravind Karnam
f8e85b1499 Fixed a bug in _process_links, handled condition for when url_scorer is passed as None, renamed the scrapper folder to scraper. 2024-11-23 13:52:34 +05:30
Aravind Karnam
c1797037c0 Fixed a few bugs, import errors and changed to asyncio wait_for instead of timeout to support python versions < 3.11 2024-11-23 12:39:25 +05:30
UncleCode
0d0cef3438 feat: add enhanced markdown generation example with citations and file output 2024-11-22 20:14:58 +08:00
UncleCode
d7a112fefe Merge branch 'main' of https://github.com/unclecode/crawl4ai 2024-11-22 19:56:56 +08:00
UncleCode
a5decaa7cf Merge branch '0.3.74' 2024-11-22 19:55:52 +08:00
UncleCode
8dea3f470f chore: update README to include new features and improvements for version 0.3.74 2024-11-22 18:50:12 +08:00
UncleCode
e02935dc5b chore: update README to reflect new features and improvements in version 0.3.74 2024-11-22 18:49:22 +08:00
UncleCode
24ad2fe2dd feat: enhance Markdown generation to include fit_html attribute 2024-11-22 18:47:17 +08:00
UncleCode
571dda6549 Update Redme 2024-11-22 18:27:43 +08:00
UncleCode
006bee4a5a feat: enhance image processing capabilities
- Enhanced image processing with srcset support and validation checks for better image selection.
2024-11-22 16:00:17 +08:00
UncleCode
dbb751c8f0 In this commit, we introduce the new concept of MakrdownGenerationStrategy, which allows us to expand our future strategies to generate better markdown. Right now, we generate raw markdown as we were doing before. We have a new algorithm for fitting markdown based on BM25, and now we add the ability to refine markdown into a citation form. Our links will be extracted and replaced by a citation reference number, and then we will have reference sections at the very end; we add all the links with the descriptions. This format is more suitable for large language models. In case we don't need to pass links, we can reduce the size of the markdown significantly and also attach the list of references as a separate file to a large language model. This commit contains changes for this direction. 2024-11-21 18:21:43 +08:00
程序员阿江(Relakkes)
3439f7886d fix: crawler strategy exception handling and fixes (#271) 2024-11-20 20:30:25 +08:00
Darwing Medina
d418a04602 Fix #260 prevent pass duplicated kwargs to scrapping_strategy (#269)
Thank you for the suggestions. It totally makes sense now. Change to pop operator.
2024-11-20 18:52:11 +08:00
UncleCode
7047422e48 Merge branch '0.3.74' of https://github.com/unclecode/crawl4ai into 0.3.74 2024-11-19 19:33:08 +08:00
UncleCode
2bdec1fa5a chore: add manage-collab.sh to .gitignore 2024-11-19 19:33:04 +08:00
UncleCode
b654c49e55 Update .gitignore to exclude additional scripts and files 2024-11-19 19:32:06 +08:00
UncleCode
f2cb7d506d Delete test3.txt 2024-11-19 19:12:14 +08:00
ntohidikplay
a6dad3fc6d test: trying to push to 0.3.74 2024-11-19 12:09:33 +01:00
UncleCode
fbcff85ecb Remove test files 2024-11-19 19:03:23 +08:00
UncleCode
788c67c29a Merge branch 'main' of https://github.com/unclecode/crawl4ai 2024-11-19 19:02:44 +08:00
UncleCode
2f19d38693 Update .gitignore to include .gitboss/ and todo_executor.md 2024-11-19 19:02:41 +08:00
ntohidikplay
3aae30ed2a test1: trying to push to main 2024-11-19 11:57:07 +01:00
ntohidikplay
593c7ad307 test: trying to push to main 2024-11-19 11:45:26 +01:00
UncleCode
73658c758a chore: update .gitignore to include manage-collab.sh 2024-11-19 16:10:43 +08:00
UncleCode
b6af94cbbb Merge remote-tracking branch 'origin/main' into 0.3.74 2024-11-18 21:15:04 +08:00
UncleCode
852729ff38 feat(docker): add Docker Compose configurations for local and hub deployment; enhance GPU support checks in Dockerfile
feat(requirements): update requirements.txt to include snowballstemmer
fix(version_manager): correct version parsing to use __version__.__version__
feat(main): introduce chunking strategy and content filter in CrawlRequest model
feat(content_filter): enhance BM25 algorithm with priority tag scoring for improved content relevance
feat(logger): implement new async logger engine replacing print statements throughout library
fix(database): resolve version-related deadlock and circular lock issues in database operations
docs(docker): expand Docker deployment documentation with usage instructions for Docker Compose
2024-11-18 21:00:06 +08:00
UncleCode
152ac35bc2 feat(docs): update README for version 0.3.74 with new features and improvements
fix(version): update version number to 0.3.74
refactor(async_webcrawler): enhance logging and add domain-based request delay
2024-11-17 21:09:26 +08:00
UncleCode
df63a40606 feat(docs): update examples and documentation to replace bypass_cache with cache_mode for improved clarity 2024-11-17 19:44:45 +08:00
UncleCode
a59c107b23 Update changelog for 0.3.74 2024-11-17 18:42:43 +08:00
UncleCode
f9fe6f89fe feat(database): implement version management and migration checks during initialization 2024-11-17 18:09:33 +08:00
UncleCode
2a82455b3d feat(crawl): implement direct crawl functionality and introduce CacheMode for improved caching control 2024-11-17 17:17:34 +08:00
UncleCode
3a524a3bdd fix(docs): remove unnecessary blank line in README for improved readability 2024-11-17 16:00:39 +08:00
UncleCode
3a66aa8a60 feat(cache): introduce CacheMode and CacheContext for enhanced caching behavior
chore(requirements): add colorama dependency
refactor(config): add SHOW_DEPRECATION_WARNINGS flag and clean up code
fix(docs): update example scripts for clarity and consistency
2024-11-17 15:30:56 +08:00
UncleCode
4b45b28f25 feat(docs): enhance deployment documentation with one-click setup, API security details, and Docker Compose examples 2024-11-16 18:44:47 +08:00
UncleCode
9139ef3125 feat(docker): update Dockerfile for improved installation process and enhance deployment documentation with Docker Compose setup and API token security 2024-11-16 18:19:44 +08:00
UncleCode
6360d0545a feat(api): add API token authentication and update Dockerfile description 2024-11-16 18:08:56 +08:00
UncleCode
1961adb530 refactor(docker): remove shared memory size configuration to streamline Dockerfile 2024-11-16 17:35:27 +08:00
UncleCode
79feab89c4 refactor(deploy): remove memory utilization alert configuration from deployment template 2024-11-16 17:28:42 +08:00
UncleCode
5d0b13294c feat(deploy): change instance size to professional-xs and update memory utilization alert window to 300 seconds 2024-11-16 17:25:07 +08:00
UncleCode
67edc2d641 feat(deploy): update instance size to professional-xs and add memory utilization alert parameters 2024-11-16 17:23:32 +08:00
UncleCode
6b569cceb5 feat(deploy): update branch to 0.3.74 and change instance size to basic-xs 2024-11-16 17:21:45 +08:00
UncleCode
6f2fe5954f feat(deploy): update instance size to professional-xs and add memory utilization alert 2024-11-16 17:12:41 +08:00