Commit Graph

  • 7fe220dbd5 1. Introduced a bool flag to ascrape method to switch between sequential and concurrent processing 2. Introduced a dictionary for depth tracking across various tasks 3. Removed redundancy with crawled_urls variable. Instead created a list with visited set variable in returned object. Aravind Karnam 2024-10-03 11:17:11 +05:30
  • 65e013d9d1 Merge pull request #3 from aravindkarnam/main aravind 2024-10-03 09:52:12 +05:30
  • 4750810a67 Enhance AsyncWebCrawler with smart waiting and screenshot capabilities unclecode 2024-10-02 17:34:56 +08:00
  • e0e0db4247 Bump version to 0.3.4 0.3.4 unclecode 2024-09-29 17:07:52 +08:00
  • bccadec887 Remove dependency on psutil, PyYaml, and extend requests version range unclecode 2024-09-29 17:07:06 +08:00
  • 0759503e50 Extend numpy version range to support Python 3.9 v0.3.3 unclecode 2024-09-29 00:08:02 +08:00
  • 7f1c020746 Update README to add link to previous version in branch V0.2.76 unclecode 2024-09-28 00:31:53 +08:00
  • 7afa11a02f Update .gitignore to include test_env/ and tmp/ directories v0.2.76 unclecode 2024-09-28 00:12:58 +08:00
  • 5d4e92db7d Update quickstart_async.py to improve performance and add Firecrawl simulation v0.3.0 staging unclecode 2024-09-28 00:11:39 +08:00
  • 8b6e88c85c Update .gitignore to ignore temporary and test directories unclecode 2024-09-26 15:09:49 +08:00
  • 64190dd0c4 Update README unclecode 2024-09-25 17:26:13 +08:00
  • 7100bcdf04 Add session based crawling documentation unclecode 2024-09-25 17:16:55 +08:00
  • 10cdad039d Update documents and README unclecode 2024-09-25 16:52:11 +08:00
  • f1eee09cf4 Update README, add manifest, make selenium optional library unclecode 2024-09-25 16:35:14 +08:00
  • 4d48bd31ca Push async version last changes for merge to main branch unclecode 2024-09-24 20:52:08 +08:00
  • 7f3e2e47ed Parallel processing with retry on failure with exponential backoff - Simplified URL validation and normalisation - respecting Robots.txt Aravind Karnam 2024-09-19 12:34:12 +05:30
  • 78f26ac263 Merge pull request #2 from aravindkarnam/staging aravind 2024-09-18 18:16:23 +05:30
  • d628bc4034 Refactor content_scrapping_strategy.py to remove excluded tags unclecode 2024-09-12 17:35:45 +08:00
  • b179aa9b6f Refactor website content and setup.py descriptions for consistent terminology unclecode 2024-09-12 16:50:52 +08:00
  • 30807f5535 Remove excluded tags from website content unclecode 2024-09-12 16:11:20 +08:00
  • 396f430022 Refactor AsyncCrawlerStrategy to return AsyncCrawlResponse unclecode 2024-09-12 15:49:49 +08:00
  • 44ce12c62c Created scaffolding for Scraper as per the plan. Implemented the ascrape method in bfs_scraper_strategy Aravind Karnam 2024-09-09 13:13:34 +05:30
  • eb131bebdf Create series of quickstart files. unclecode 2024-09-04 15:33:24 +08:00
  • 5c15837677 chore: Update README, generate new notbook for quickstart unclecode 2024-09-04 14:46:22 +08:00
  • 2fada16abb chore: Update crawl4ai package with AsyncWebCrawler and JsonCssExtractionStrategy unclecode 2024-09-03 23:32:27 +08:00
  • c37614cbc8 Add Async Version, JsonCss Extrator unclecode 2024-09-03 01:27:00 +08:00
  • 3116f95c1a Merge branch 'pull-84' into staging unclecode 2024-09-01 16:44:06 +08:00
  • b0e8b66666 Merge branch 'proxy-support' into staging unclecode 2024-09-01 16:35:14 +08:00
  • 3caf48c9be refactor: Update LocalSeleniumCrawlerStrategy to execute JS code if provided proxy-support unclecode 2024-09-01 16:34:51 +08:00
  • 3c6ebb73ae Update web_crawler.py pull-84 Umut CAN 2024-08-30 15:30:06 +03:00
  • 0d9b638636 Merge pull request #75 from aravindkarnam/main UncleCode 2024-08-30 12:54:15 +02:00
  • 2ba70b9501 add use proxy and llm baseurl examples datehoer 2024-08-27 10:14:54 +08:00
  • 16f98cebc0 replace base64 image url to '' datehoer 2024-08-27 09:44:35 +08:00
  • fe9ff498ce add proxy and add ai base_url datehoer 2024-08-26 16:12:49 +08:00
  • eba831ca30 fix spelling mistake Datehoer 2024-08-26 15:29:23 +08:00
  • dec3d44224 refactor: Update extraction strategy to handle schema extraction with non-empty schema unclecode 2024-08-19 15:37:07 +08:00
  • 9ed1551125 Added support to source tags wrapped inside video and audio tags. Extended the text extraction to video and audio elements in media. https://github.com/unclecode/crawl4ai/issues/71 Aravind Karnam 2024-08-14 10:59:49 +05:30
  • 14e537fdd3 Merge branch 'main' of https://github.com/unclecode/crawl4ai Unclecode 2024-08-04 06:57:16 +00:00
  • e5e6a34e80 ## [v0.2.77] - 2024-08-04 v0.2.77 unclecode 2024-08-04 14:54:18 +08:00
  • 64b33af0e0 Merge branch 'main' of https://github.com/unclecode/crawl4ai Unclecode 2024-08-02 08:04:54 +00:00
  • 897e766728 Update README unclecode 2024-08-02 16:04:14 +08:00
  • 9200a6731d ## [v0.2.76] - 2024-08-02 unclecode 2024-08-02 16:02:42 +08:00
  • 61c166ab19 refactor: Update Crawl4AI version to v0.2.76 main-75 unclecode 2024-08-02 15:55:53 +08:00
  • 659c8cd953 refactor: Update image description minimum word threshold in get_content_of_website_optimized unclecode 2024-08-02 15:55:32 +08:00
  • 9ee988753d refactor: Update image description minimum word threshold in get_content_of_website_optimized unclecode 2024-08-02 14:53:11 +08:00
  • 8ae6c43ca4 refactor: Update Dockerfile to install Crawl4AI with specified options unclecode 2024-08-01 20:13:06 +08:00
  • b6713870ef refactor: Update Dockerfile to install Crawl4AI with specified options unclecode 2024-08-01 17:56:19 +08:00
  • 40477493d3 refactor: Remove image format dot in get_content_of_website_optimized unclecode 2024-07-31 16:15:55 +08:00
  • efcf3ac6eb Update LocalSeleniumCrawlerStrategy to resolve ChromeDriver version mismatch issue Kevin Moturi 2024-07-27 06:11:57 -05:00
  • 9e43f7beda refactor: Temporarily disable fetching image file size in get_content_of_website_optimized unclecode 2024-07-31 13:29:23 +08:00
  • aa9412e1b4 refactor: Set image_size to 0 in get_content_of_website_optimized unclecode 2024-07-23 13:08:53 +08:00
  • cf6c835e18 moved score threshold to config.py & replaced the separator for tag.get_text in find_closest_parent_with_useful_text fn from period(.) to space( ) to keep the text more neutral. image-description Aravind Karnam 2024-07-21 15:18:23 +05:30
  • e5ecf291f3 Implemented filtering for images and grabbing the contextual text from nearest parent Aravind Karnam 2024-07-21 15:03:17 +05:30
  • 9d0cafcfa6 fixed import error in model_loader.py Aravind Karnam 2024-07-21 14:55:58 +05:30
  • 7715623430 chore: Fix typos and update .gitignore v0.0.75 unclecode 2024-07-19 17:42:39 +08:00
  • f5a4e80e2c chore: Fix typo in chunking_strategy.py and crawler_strategy.py unclecode 2024-07-19 17:40:31 +08:00
  • 8463aabedf chore: Remove .test_pads/ directory from .gitignore main-img-captionify unclecode 2024-07-19 17:09:29 +08:00
  • 7f30144ef2 chore: Remove .tests/ directory from .gitignore unclecode 2024-07-09 15:10:18 +08:00
  • fa5516aad6 chore: Refactor setup.py to use pathlib and shutil for folder creation and removal, to remove cache folder in cross platform manner. unclecode 2024-07-09 13:25:00 +08:00
  • 1afcdb6996 Merge branch 'main' of https://github.com/unclecode/crawl4ai Unclecode 2024-07-08 12:24:13 +00:00
  • ca0336af9e feat: Add error handling for rate limit exceeded in form submission unclecode 2024-07-08 20:24:00 +08:00
  • ca625b3152 Merge branch 'main' of https://github.com/unclecode/crawl4ai Unclecode 2024-07-08 12:02:19 +00:00
  • 65ed1aeade feat: Add rate limiting functionality with custom handlers unclecode 2024-07-08 20:02:12 +08:00
  • 6521b4745f Merge branch 'main' of https://github.com/unclecode/crawl4ai Unclecode 2024-07-08 08:35:49 +00:00
  • 4d283ab386 ## [v0.2.74] - 2024-07-08 A slew of exciting updates to improve the crawler's stability and robustness! 🎉 v0.2.74 unclecode 2024-07-08 16:33:25 +08:00
  • 2101540819 chore: Update version to 0.2.74 in setup.py v0.2.74 unclecode 2024-07-08 16:30:28 +08:00
  • 9d98393606 Prepare branch for release 0.2.74 unclecode 2024-07-08 16:30:14 +08:00
  • 6f99368744 Add UTF encoding to resolve the windows machone "charmap" error. unclecode 2024-07-08 16:18:07 +08:00
  • ea2f83ac10 feat: Add delay after fetching URL in crawler hooks unclecode 2024-07-08 15:59:59 +08:00
  • 7f41ff4a74 The after_get_url hook is executed after getting the URL, allowing for further customization. unclecode 2024-07-06 14:28:01 +08:00
  • 236bdb4035 feat: Add MaxRetryError exception handling in LocalSeleniumCrawlerStrategy unclecode 2024-07-06 14:08:30 +08:00
  • 1368248254 feat: Sanitize input and handle encoding issues in LLMExtractionStrategy unclecode 2024-07-05 17:59:26 +08:00
  • b0ec54b9e9 feat: Sanitize input and handle encoding issues in LLMExtractionStrategy unclecode 2024-07-05 17:37:25 +08:00
  • fb6ed5f000 feat: Sanitize input and handle encoding issues in LLMExtractionStrategy unclecode 2024-07-05 17:30:58 +08:00
  • 597fe8bdb7 chore: Delete existing database file and initialize new database unclecode 2024-07-05 17:04:57 +08:00
  • 241862bfe6 Merge branch 'main' of https://github.com/unclecode/crawl4ai Unclecode 2024-07-03 07:27:37 +00:00
  • 3ff2a0d0e7 Merge branch 'main' of https://github.com/unclecode/crawl4ai v0.2.73 unclecode 2024-07-03 15:26:47 +08:00
  • 3cd1b3719f Bump version to v0.2.73, update documentation, and resolve installation issues unclecode 2024-07-03 15:26:43 +08:00
  • 9926eb9f95 feat: Bump version to v0.2.73 and update documentation unclecode 2024-07-03 15:19:22 +08:00
  • 3abaa82501 Merge pull request #37 from shivkumar0757/fix-readme-encoding UncleCode 2024-07-01 07:31:07 +02:00
  • 88d8cd8650 feat: Add page load check for LocalSeleniumCrawlerStrategy unclecode 2024-07-01 00:07:32 +08:00
  • a08f21d66c Fix UnicodeDecodeError by reading README.md with UTF-8 encoding shiv 2024-06-30 20:27:33 +05:30
  • f2491b6c1a Merge branch 'main' of https://github.com/unclecode/crawl4ai Unclecode 2024-06-29 16:34:15 +00:00
  • d58286989c UPDATE DOCUMENTS unclecode 2024-06-30 00:34:02 +08:00
  • 886622cb1e Merge branch 'main' of https://github.com/unclecode/crawl4ai Unclecode 2024-06-29 16:23:44 +00:00
  • b58af3349c chore: Update installation instructions with support for different modes v0.2.72 unclecode 2024-06-30 00:22:17 +08:00
  • 940df4631f Update ChangeLog unclecode 2024-06-30 00:18:40 +08:00
  • 685706e0aa Update version, and change log main-v0.2.72 unclecode 2024-06-30 00:17:43 +08:00
  • 7b0979e134 Update Redme and Docker file unclecode 2024-06-30 00:15:43 +08:00
  • 61ae2de841 1/Update setup.py to support following modes: - default (most frequent mode) - torch - transformers - all 2/ Update Docker file 3/ Update documentation as well. unclecode 2024-06-30 00:15:29 +08:00
  • 5b28eed2c0 Add a temporary solution for when we can't crawl websites in headless mode. unclecode 2024-06-29 23:25:50 +08:00
  • f8a11779fe Update change log unclecode 2024-06-26 16:48:36 +08:00
  • 13dc254438 Merge branch 'main' of https://github.com/unclecode/crawl4ai Unclecode 2024-06-26 07:35:06 +00:00
  • d11a83c232 ## [0.2.71] 2024-06-26 • Refactored crawler_strategy.py to handle exceptions and improve error messages • Improved get_content_of_website_optimized function in utils.py for better performance • Updated utils.py with latest changes • Migrated to ChromeDriverManager for resolving Chrome driver download issues v0.2.71 main-1 unclecode 2024-06-26 15:34:15 +08:00
  • 3255c7a3fa Update CHANGELOG.md with recent commits unclecode 2024-06-26 15:20:34 +08:00
  • 4756d0a532 Refactor crawler_strategy.py to handle exceptions and improve error messages unclecode 2024-06-26 15:04:33 +08:00
  • 7ba2142363 chore: Refactor get_content_of_website_optimized function in utils.py unclecode 2024-06-26 14:43:09 +08:00
  • 096929153f Merge branch 'main' of https://github.com/unclecode/crawl4ai Unclecode 2024-06-26 05:45:25 +00:00
  • 96d1eb0d0d Some updated ins utils.py image-filterizer unclecode 2024-06-26 13:03:03 +08:00
  • 144cfa0eda Switch to ChromeDriverManager due some issues with download the chrome driver unclecode 2024-06-26 13:00:17 +08:00