Commit Graph

  • ce7d49484f docs: update README for version 0.3.743 with new features, enhancements, and contributor acknowledgments UncleCode 2024-11-28 13:06:46 +08:00
  • e4acd18429 docs: update README for version 0.3.743 with new features, enhancements, and contributor acknowledgments UncleCode 2024-11-28 13:06:30 +08:00
  • c2d4784810 fix: resolve merge conflict in DefaultMarkdownGenerator affecting fit_markdown generation 0.3.743 UncleCode 2024-11-28 12:56:31 +08:00
  • 76bea6c577 Merge branch 'main' into 0.3.743 UncleCode 2024-11-28 12:53:30 +08:00
  • 3ff0b0b2c4 feat: update changelog for version 0.3.743 with new features, improvements, and contributor acknowledgments UncleCode 2024-11-28 12:48:07 +08:00
  • a1c7dc17ce Merge branch 'next' of https://github.com/unclecode/crawl4ai into next UncleCode 2024-11-28 12:45:57 +08:00
  • 24723b2f10 Enhance features and documentation - Updated version to 0.3.743 - Improved ManagedBrowser configuration with dynamic host/port - Implemented fast HTML formatting in web crawler - Enhanced markdown generation with a new generator class - Improved sanitization and utility functions - Added contributor details and pull request acknowledgments - Updated documentation for clearer usage scenarios - Adjusted tests to reflect class name changes UncleCode 2024-11-28 12:45:05 +08:00
  • f998e9e949 Fix: handled the cases where markdown_with_citations, references_markdown, and filtered_html might not be defined. (#293) Hamza Farhan 2024-11-27 16:20:54 +05:00
  • 73661f7d1f docs: enhance development installation instructions (#286) zhounan 2024-11-27 15:04:20 +08:00
  • b5d4db07d1 Merge branch 'main' of https://github.com/unclecode/crawl4ai UncleCode 2024-11-27 14:55:58 +08:00
  • c6a022132b docs: update CONTRIBUTORS.md to acknowledge aadityakanjolia4 for fixing 'CustomHTML2Text' bug UncleCode 2024-11-27 14:55:56 +08:00
  • 2f5e0598bb updated definition of can_process_url to include dept as an argument, as it's needed to skip filters for start_url Aravind Karnam 2024-11-26 18:26:57 +05:30
  • ff731e4ea1 fixed the final scraper_quickstart.py example Aravind Karnam 2024-11-26 17:08:32 +05:30
  • 9530ded83a fixed the final scraper_quickstart.py example Aravind Karnam 2024-11-26 17:05:54 +05:30
  • 155c756238 <Future pending> issue fix was incorrect. Reverting Aravind Karnam 2024-11-26 17:04:04 +05:30
  • a888c91790 Fix "Future attached to a different loop" error by ensuring tasks are created in the correct event loop Aravind Karnam 2024-11-26 14:05:02 +05:30
  • a98d51a62c Remove the can_process_url check from _process_links since it's already being checked in process_url Aravind Karnam 2024-11-26 11:11:49 +05:30
  • ee3001b1f7 fix: moved depth as a param to can_process_url and applying filter chain only when depth is not zero. This way filter chain is skipped but other validations are in place even for start URL Aravind Karnam 2024-11-26 10:22:14 +05:30
  • b13fd71040 chore: 1. Expose process_external_links as a param 2. Removed a few unused imports 3. Removed URL normalisation for external links separately as that won't be necessary Aravind Karnam 2024-11-26 10:07:11 +05:30
  • 195c0ccf8a chore: remove deprecated Docker Compose configurations for crawl4ai service unclecode 2024-11-24 19:40:27 +08:00
  • b09a86c0c1 chore: remove deprecated Docker Compose configurations for crawl4ai service unclecode 2024-11-24 19:40:10 +08:00
  • de43505ae4 feat: update version to 0.3.742 0.3.742 unclecode 2024-11-24 19:36:30 +08:00
  • d7c5b900b8 feat: add support for arm64 platform in Docker commands and update INSTALL_TYPE variable in docker-compose unclecode 2024-11-24 19:35:53 +08:00
  • edad7b6a74 chore: remove Railway deployment configuration and related documentation unclecode 2024-11-24 18:48:39 +08:00
  • 829a1f7992 feat: update version to 0.3.741 and enhance content filtering with heuristic strategy. Fixing the issue that when the past HTML to BM25 content filter does not have any HTML elements. UncleCode 2024-11-23 19:45:41 +08:00
  • d729aa7d5e refactor: Add group ID to for images extracted from srcset. UncleCode 2024-11-23 18:00:32 +08:00
  • 2226ef53c8 fix: Exempting the start_url from can_process_url Aravind Karnam 2024-11-23 14:59:14 +05:30
  • 3d52b551f2 Merge pull request #8 from aravindkarnam/main aravind 2024-11-23 13:57:36 +05:30
  • f8e85b1499 Fixed a bug in _process_links, handled condition for when url_scorer is passed as None, renamed the scrapper folder to scraper. Aravind Karnam 2024-11-23 13:52:34 +05:30
  • c1797037c0 Fixed a few bugs, import errors and changed to asyncio wait_for instead of timeout to support python versions < 3.11 Aravind Karnam 2024-11-23 12:39:25 +05:30
  • 0d0cef3438 feat: add enhanced markdown generation example with citations and file output UncleCode 2024-11-22 20:14:58 +08:00
  • d7a112fefe Merge branch 'main' of https://github.com/unclecode/crawl4ai UncleCode 2024-11-22 19:56:56 +08:00
  • a5decaa7cf Merge branch '0.3.74' UncleCode 2024-11-22 19:55:52 +08:00
  • 8dea3f470f chore: update README to include new features and improvements for version 0.3.74 0.3.74 UncleCode 2024-11-22 18:50:12 +08:00
  • e02935dc5b chore: update README to reflect new features and improvements in version 0.3.74 UncleCode 2024-11-22 18:49:22 +08:00
  • 24ad2fe2dd feat: enhance Markdown generation to include fit_html attribute UncleCode 2024-11-22 18:47:17 +08:00
  • 571dda6549 Update Redme UncleCode 2024-11-22 18:27:43 +08:00
  • 006bee4a5a feat: enhance image processing capabilities - Enhanced image processing with srcset support and validation checks for better image selection. UncleCode 2024-11-22 16:00:17 +08:00
  • dbb751c8f0 In this commit, we introduce the new concept of MakrdownGenerationStrategy, which allows us to expand our future strategies to generate better markdown. Right now, we generate raw markdown as we were doing before. We have a new algorithm for fitting markdown based on BM25, and now we add the ability to refine markdown into a citation form. Our links will be extracted and replaced by a citation reference number, and then we will have reference sections at the very end; we add all the links with the descriptions. This format is more suitable for large language models. In case we don't need to pass links, we can reduce the size of the markdown significantly and also attach the list of references as a separate file to a large language model. This commit contains changes for this direction. UncleCode 2024-11-21 18:21:43 +08:00
  • 3439f7886d fix: crawler strategy exception handling and fixes (#271) 程序员阿江(Relakkes) 2024-11-20 20:30:25 +08:00
  • d418a04602 Fix #260 prevent pass duplicated kwargs to scrapping_strategy (#269) Darwing Medina 2024-11-20 04:52:11 -06:00
  • 8179cae765 feat: adding test file to my branch feature/content-filter-nasrin-1 feature/content-filter ntohidikplay 2024-11-19 13:23:25 +01:00
  • fde35f644d feat: adding test file to my branch ntohidikplay 2024-11-19 13:02:52 +01:00
  • 7047422e48 Merge branch '0.3.74' of https://github.com/unclecode/crawl4ai into 0.3.74 UncleCode 2024-11-19 19:33:08 +08:00
  • 2bdec1fa5a chore: add manage-collab.sh to .gitignore UncleCode 2024-11-19 19:33:04 +08:00
  • b654c49e55 Update .gitignore to exclude additional scripts and files UncleCode 2024-11-19 19:32:06 +08:00
  • f2cb7d506d Delete test3.txt UncleCode 2024-11-19 19:12:14 +08:00
  • a6dad3fc6d test: trying to push to 0.3.74 ntohidikplay 2024-11-19 12:09:33 +01:00
  • fbcff85ecb Remove test files UncleCode 2024-11-19 19:03:23 +08:00
  • 788c67c29a Merge branch 'main' of https://github.com/unclecode/crawl4ai UncleCode 2024-11-19 19:02:44 +08:00
  • 2f19d38693 Update .gitignore to include .gitboss/ and todo_executor.md UncleCode 2024-11-19 19:02:41 +08:00
  • 3aae30ed2a test1: trying to push to main ntohidikplay 2024-11-19 11:57:07 +01:00
  • 5eeb682719 Delete test.txt unclecode-patch-1 UncleCode 2024-11-19 18:55:11 +08:00
  • 593c7ad307 test: trying to push to main ntohidikplay 2024-11-19 11:45:26 +01:00
  • 73658c758a chore: update .gitignore to include manage-collab.sh UncleCode 2024-11-19 16:10:43 +08:00
  • b6af94cbbb Merge remote-tracking branch 'origin/main' into 0.3.74 UncleCode 2024-11-18 21:15:04 +08:00
  • 852729ff38 feat(docker): add Docker Compose configurations for local and hub deployment; enhance GPU support checks in Dockerfile feat(requirements): update requirements.txt to include snowballstemmer fix(version_manager): correct version parsing to use __version__.__version__ feat(main): introduce chunking strategy and content filter in CrawlRequest model feat(content_filter): enhance BM25 algorithm with priority tag scoring for improved content relevance feat(logger): implement new async logger engine replacing print statements throughout library fix(database): resolve version-related deadlock and circular lock issues in database operations docs(docker): expand Docker deployment documentation with usage instructions for Docker Compose UncleCode 2024-11-18 21:00:06 +08:00
  • 152ac35bc2 feat(docs): update README for version 0.3.74 with new features and improvements fix(version): update version number to 0.3.74 refactor(async_webcrawler): enhance logging and add domain-based request delay UncleCode 2024-11-17 21:09:26 +08:00
  • df63a40606 feat(docs): update examples and documentation to replace bypass_cache with cache_mode for improved clarity UncleCode 2024-11-17 19:44:45 +08:00
  • a59c107b23 Update changelog for 0.3.74 UncleCode 2024-11-17 18:42:43 +08:00
  • f9fe6f89fe feat(database): implement version management and migration checks during initialization UncleCode 2024-11-17 18:09:33 +08:00
  • 2a82455b3d feat(crawl): implement direct crawl functionality and introduce CacheMode for improved caching control UncleCode 2024-11-17 17:17:34 +08:00
  • 3a524a3bdd fix(docs): remove unnecessary blank line in README for improved readability UncleCode 2024-11-17 16:00:39 +08:00
  • 3a66aa8a60 feat(cache): introduce CacheMode and CacheContext for enhanced caching behavior chore(requirements): add colorama dependency refactor(config): add SHOW_DEPRECATION_WARNINGS flag and clean up code fix(docs): update example scripts for clarity and consistency UncleCode 2024-11-17 15:30:56 +08:00
  • 4b45b28f25 feat(docs): enhance deployment documentation with one-click setup, API security details, and Docker Compose examples UncleCode 2024-11-16 18:44:47 +08:00
  • 9139ef3125 feat(docker): update Dockerfile for improved installation process and enhance deployment documentation with Docker Compose setup and API token security UncleCode 2024-11-16 18:19:44 +08:00
  • 6360d0545a feat(api): add API token authentication and update Dockerfile description UncleCode 2024-11-16 18:08:56 +08:00
  • 1961adb530 refactor(docker): remove shared memory size configuration to streamline Dockerfile UncleCode 2024-11-16 17:35:27 +08:00
  • 79feab89c4 refactor(deploy): remove memory utilization alert configuration from deployment template UncleCode 2024-11-16 17:28:42 +08:00
  • 5d0b13294c feat(deploy): change instance size to professional-xs and update memory utilization alert window to 300 seconds UncleCode 2024-11-16 17:25:07 +08:00
  • 67edc2d641 feat(deploy): update instance size to professional-xs and add memory utilization alert parameters UncleCode 2024-11-16 17:23:32 +08:00
  • 6b569cceb5 feat(deploy): update branch to 0.3.74 and change instance size to basic-xs UncleCode 2024-11-16 17:21:45 +08:00
  • 6f2fe5954f feat(deploy): update instance size to professional-xs and add memory utilization alert UncleCode 2024-11-16 17:12:41 +08:00
  • fca1319b7d feat(docker): add MkDocs installation and build step for documentation UncleCode 2024-11-16 17:10:30 +08:00
  • f77f06a3bd feat(deploy): add deployment configuration and templates for crawl4ai UncleCode 2024-11-16 16:43:31 +08:00
  • e62c807295 feat(deploy): add Railway deployment configuration and setup instructions UncleCode 2024-11-16 16:38:13 +08:00
  • 90df6921b7 feat(crawl_sync): add synchronous crawl endpoint and corresponding test UncleCode 2024-11-16 15:34:30 +08:00
  • 5098442086 refactor: migrate versioning to __version__.py and remove deprecated _version.py UncleCode 2024-11-16 15:30:24 +08:00
  • d0014c6793 New async database manager and migration support - Introduced AsyncDatabaseManager for async DB management. - Added migration feature to transition to file-based storage. - Enhanced web crawler with improved caching logic. - Updated requirements and setup for async processing. UncleCode 2024-11-16 14:54:41 +08:00
  • 60670b2af6 Merge pull request #7 from aravindkarnam/main aravind 2024-11-15 20:43:54 +05:30
  • ae7ebc0bd8 chore: update .gitignore and enhance changelog with major feature additions and examples UncleCode 2024-11-15 20:16:13 +08:00
  • 1f269f9834 test(content_filter): add comprehensive tests for BM25ContentFilter functionality UncleCode 2024-11-15 18:11:11 +08:00
  • 7f1ae5adcf Update changelog UncleCode 2024-11-14 22:51:51 +08:00
  • 3d00fee6c2 - In this commit, the library is updated to process file downloads. Users can now specify a download folder and trigger the download process via JavaScript or other means, with all files being saved. The list of downloaded files will also be added to the crowd result object. - Another thing this commit introduces is the concept of the Relevance Content Filter. This is an improvement over Fit Markdown. This class of strategies aims to extract the main content from a given page - the part that really matters and is useful to be processed. One strategy has been created using the BM25 algorithm, which finds chunks of text from the web page relevant to its title, descriptions, and keywords, or supports a given user query and matches them. The result is then returned to the main engine to be converted to Markdown. Plans include adding approaches using language models as well. - The cache database was updated to hold information about response headers and downloaded files. UncleCode 2024-11-14 22:50:59 +08:00
  • 17913f5acf feat(crawler): support local files and raw HTML input in AsyncWebCrawler UncleCode 2024-11-13 20:00:29 +08:00
  • 3a2cb7dacf test: Add comprehensive unit tests for AsyncExecutor functionality 0.3.75 UncleCode 2024-11-13 19:46:05 +08:00
  • c38ac29edb perf(crawler): major performance improvements & raw HTML support UncleCode 2024-11-13 19:40:40 +08:00
  • 38044d4afe Merge pull request #255 from maheshpec/feature/configure-cache-directory UncleCode 2024-11-13 09:43:29 +01:00
  • 61b93ebf36 Update change log UncleCode 2024-11-13 15:38:30 +08:00
  • bf91adf3f8 fix: Resolve unexpected BrowserContext closure during crawl in Docker UncleCode 2024-11-13 15:37:16 +08:00
  • 00026b5f8b feat(config): Adding a configurable way of setting the cache directory for constrained environments Mahesh 2024-11-12 14:52:51 -07:00
  • 8c22396d8b Merge pull request #234 from devatnull/patch-1 UncleCode 2024-11-12 08:37:14 +01:00
  • b6d6631b12 Enhance Async Crawler with Playwright support - Implemented new async crawler strategy using Playwright. - Introduced ManagedBrowser for better browser management. - Added support for persistent browser sessions and improved error handling. - Updated version from 0.3.73 to 0.3.731. - Enhanced logic in main.py for conditional mounting of static files. - Updated requirements to replace playwright_stealth with tf-playwright-stealth. UncleCode 2024-11-12 12:10:58 +08:00
  • a098483cbb Update Roadmap UncleCode 2024-11-09 20:40:30 +08:00
  • f9a297e08d Add Docker example script for testing Crawl4AI functionality UncleCode 2024-11-08 19:39:05 +08:00
  • bcdd80911f Remove some old files. UncleCode 2024-11-08 19:08:58 +08:00
  • 0d357ab7d2 feat(scraper): Enhance URL filtering and scoring systems scraper-uc UncleCode 2024-11-08 19:02:28 +08:00
  • bae4665949 feat(scraper): Enhance URL filtering and scoring systems UncleCode 2024-11-08 18:45:12 +08:00
  • d11c004fbb Enhanced BFS Strategy: Improved monitoring, resource management & configuration UncleCode 2024-11-08 15:57:23 +08:00
  • b120965b6a Fixed issues with the Manage Browser, including its inability to connect to the user directory and inability to create new pages within the Manage Browser context; all issues are now resolved. UncleCode 2024-11-07 20:15:03 +08:00