Commit Graph

  • 16f918621f Merge branch 'main' of https://github.com/unclecode/crawl4ai UncleCode 2024-11-07 19:30:22 +08:00
  • f7574230a1 Update API server request object. text_docker file and Readme UncleCode 2024-11-07 19:29:31 +08:00
  • 3d1c9a8434 Revieweing the BFS strategy. UncleCode 2024-11-07 18:54:53 +08:00
  • 2879344d9c Update README.md devatnull 2024-11-06 17:36:46 +03:00
  • 9f5eef1f38 Refactored the CustomHTML2Text class in content_scrapping_strategy.py to remove the handling logic for header tags (h1-h6), which are now commented out. This cleanup improves code readability and reduces maintenance overhead. UncleCode 2024-11-06 21:50:09 +08:00
  • be472c624c Refactored AsyncWebScraper to include comprehensive error handling and progress tracking capabilities. Introduced a ScrapingProgress data class to monitor processed and failed URLs. Enhanced scraping methods to log errors and track stats throughout the scraping process. UncleCode 2024-11-06 21:09:47 +08:00
  • 06b21dcc50 Update .gitignore to include new directories for issues and documentation UncleCode 2024-11-06 18:44:03 +08:00
  • c5aa1bec18 Merge pull request #229 from bizrockman/main UncleCode 2024-11-06 07:31:07 +01:00
  • 0f0f60527d Merge pull request #172 from aravindkarnam/scraper UncleCode 2024-11-06 07:00:44 +01:00
  • 11721eb0ce Merge branch 'main' of https://github.com/unclecode/crawl4ai Unclecode 2024-11-05 13:02:59 +00:00
  • b51263664e feat(api): add CORS support and static file serving, update root redirect UncleCode 2024-11-05 21:02:47 +08:00
  • 1222e456fb Merge branch 'main' of https://github.com/unclecode/crawl4ai Unclecode 2024-11-05 12:58:30 +00:00
  • 1e7db0d293 docs(README): update release notes for version 0.3.73 with new features and improvements UncleCode 2024-11-05 20:12:20 +08:00
  • 2a54f3c048 refactor(core): remove main_v0.py file and associated functionality UncleCode 2024-11-05 20:11:07 +08:00
  • 1c20b815b3 docs(README): update Docker usage instructions and add deployment options UncleCode 2024-11-05 20:10:24 +08:00
  • 43a2b26f63 Merge branch 'main' of https://github.com/unclecode/crawl4ai UncleCode 2024-11-05 20:08:20 +08:00
  • 3cf19a1bc2 chore(version): bump version to 0.3.73 0.3.73 UncleCode 2024-11-05 20:05:58 +08:00
  • 67a23c3182 feat(core): Release v0.3.73 with Browser Takeover and Docker Support UncleCode 2024-11-05 20:04:18 +08:00
  • 796dbaf08c Rename episode_11_3_Extraction_Strategies:_Cosine.md to episode_11_3_Extraction_Strategies_Cosine.md bizrockman 2024-11-04 20:19:43 +01:00
  • 3a3c88a2d0 Rename episode_11_2_Extraction_Strategies:_LLM.md to episode_11_2_Extraction_Strategies_LLM.md bizrockman 2024-11-04 20:19:20 +01:00
  • 870296fa7e Rename episode_11_1_Extraction_Strategies:_JSON_CSS.md to episode_11_1_Extraction_Strategies_JSON_CSS.md bizrockman 2024-11-04 20:18:58 +01:00
  • a28046c233 Rename episode_08_Media_Handling:_Images,_Videos,_and_Audio.md to episode_08_Media_Handling_Images_Videos_and_Audio.md bizrockman 2024-11-04 20:18:26 +01:00
  • 0bba0e074f Preventing NoneType has no attribute get Errors bizrockman 2024-11-04 20:12:24 +01:00
  • c4c6227962 Creating the API server component UncleCode 2024-11-04 20:33:15 +08:00
  • e6c914d2fa Refactor version management and remove deprecated gitignore.dev file UncleCode 2024-11-04 16:51:59 +08:00
  • be8f4fc59a Merge branch '0.3.73' of https://github.com/unclecode/crawl4ai into 0.3.73 UncleCode 2024-11-04 14:12:07 +08:00
  • fbdf870fbf Update CHANGELOG unclecode 2024-11-04 14:10:27 +08:00
  • 7b0cca41b4 Update gitignore UncleCode 2024-11-04 13:48:26 +08:00
  • 33d0e9ec8c Update dev gitignore UncleCode 2024-11-04 13:42:37 +08:00
  • 42f1c67ca8 Merge branch '0.3.73' of https://github.com/unclecode/crawl4ai into 0.3.73 UncleCode 2024-11-04 13:39:39 +08:00
  • e28c49a8fe Refactor .gitignore.dev file: Add ignore patterns for various files and directories UncleCode 2024-11-04 13:39:38 +08:00
  • 54d5a3a259 Improved database management and error handling, updated README instructions, refined .gitignore, enhanced async web crawling capabilities, and updated dependencies. unclecode 2024-11-04 13:22:13 +08:00
  • de6b43f334 Merge pull request #215 from mjvankampen/build/flexible-requirements UncleCode 2024-11-03 08:30:06 +01:00
  • 07f508bd0c Merge pull request #218 from timoa/main UncleCode 2024-11-03 06:59:30 +01:00
  • 62a86dbe8d Refactor mission section in README and add mission diagram UncleCode 2024-10-31 16:38:56 +08:00
  • 492ada0ed4 Add mission diagram to MISSION.md UncleCode 2024-10-31 15:26:43 +08:00
  • d8eef02867 Add link to mission statement in README UncleCode 2024-10-31 15:23:58 +08:00
  • 6c7235d6a7 Add mission.md file UncleCode 2024-10-31 15:22:00 +08:00
  • 0a09d78fa5 chore(docs): fix documentation links + markdown lint Damien Laureaux 2024-10-31 05:50:22 +01:00
  • e8aaa57cb2 Merge branch 'main' of https://github.com/unclecode/crawl4ai Unclecode 2024-10-30 12:59:34 +00:00
  • 19c3f3efb2 Refactor tutorial markdown files: Update numbering and formatting UncleCode 2024-10-30 20:58:07 +08:00
  • a661b3173d Merge branch 'main' of https://github.com/unclecode/crawl4ai Unclecode 2024-10-30 12:47:07 +00:00
  • e97e8df6ba Update README: Fix typo in project name UncleCode 2024-10-30 20:45:20 +08:00
  • cb6f5323ae Update README UncleCode 2024-10-30 20:44:57 +08:00
  • 47464cedec Update README UncleCode 2024-10-30 20:42:27 +08:00
  • 982d203d91 Merge branch '0.3.73' UncleCode 2024-10-30 20:40:09 +08:00
  • 9307c19f35 Update documents, upload new version of quickstart. UncleCode 2024-10-30 20:39:35 +08:00
  • 605a82793b fix dev requirements and lock playwright due to failing tests Mark Jan van Kampen 2024-10-30 10:41:37 +01:00
  • df9ee44d42 build: make requirements more flexible Mark Jan van Kampen 2024-10-30 10:03:22 +01:00
  • e9f7d5e73a Merge branch '0.3.73' UncleCode 2024-10-30 00:16:49 +08:00
  • 3529c2e732 Update new tutorial documents and added to the docs folder. UncleCode 2024-10-30 00:16:18 +08:00
  • d9e0b7abab Fix README badge UncleCode 2024-10-28 15:14:16 +08:00
  • b2800fefc6 Add badges to README UncleCode 2024-10-28 15:10:12 +08:00
  • d913e20edc Update Readme UncleCode 2024-10-28 15:09:37 +08:00
  • b781b6df96 Merge branch 'main' of https://github.com/unclecode/crawl4ai Unclecode 2024-10-27 11:42:23 +00:00
  • c2a71a5abe Update Docs folder, prepare branch for new version 0.3.73 v.3.72 UncleCode 2024-10-27 19:35:13 +08:00
  • d61615e0b0 Merge branch '0.3.72' UncleCode 2024-10-27 19:33:05 +08:00
  • ac9d83c72f Update gitignore main-0.3.7 UncleCode 2024-10-27 19:29:04 +08:00
  • ff9149b5c9 Merge branch 'main' of https://github.com/unclecode/crawl4ai UncleCode 2024-10-27 19:28:05 +08:00
  • 4239654722 Update Documentation 0.3.72 UncleCode 2024-10-27 19:24:46 +08:00
  • 38474bd66a Update version UncleCode 2024-10-24 20:24:21 +08:00
  • bcfe83f702 feat: enhance crawler with overlay removal and improved screenshot capabilities UncleCode 2024-10-24 20:22:47 +08:00
  • 32f57c49d6 Merge pull request #194 from IdrisHanafi/feat/customize-crawl-base-directory UncleCode 2024-10-24 13:09:27 +02:00
  • 60ba131ac8 [v0.3.72] Enhance content extraction and proxy support UncleCode 2024-10-22 20:19:22 +08:00
  • a5f627ba1a feat: customize crawl base directory Idris Hanafi 2024-10-21 17:58:39 -04:00
  • 04d16e6d2b Fix Base64 image parsing in WebScrappingStrategy (issue 182) UncleCode 2024-10-20 19:25:25 +08:00
  • 1dd36f9035 Refactor content scrapping strategy and improve error handling UncleCode 2024-10-20 19:11:18 +08:00
  • 6ec4cb33ca Enhance Markdown generation and external content control UncleCode 2024-10-20 18:56:58 +08:00
  • e7cd8a1c2d Update Changelog UncleCode 2024-10-19 18:37:12 +08:00
  • 4e2852d5ff [v0.3.71] Enhance chunking strategies and improve overall performance UncleCode 2024-10-19 18:36:59 +08:00
  • b309bc34e1 Fix the model nam ein quick start example 0.3.7 UncleCode 2024-10-18 15:32:25 +08:00
  • b8147b64e0 chore: Bump version to 0.3.71 and improve error handling UncleCode 2024-10-18 13:31:12 +08:00
  • aab6ea022e Update requirements and switch to 0.3.8 UncleCode 2024-10-18 12:51:23 +08:00
  • dd17ed0e63 Rename some flags name, introducing magic flag. UncleCode 2024-10-18 12:35:09 +08:00
  • dbb587d681 Update gitignore UncleCode 2024-10-17 21:38:48 +08:00
  • 768aa06ceb feat(crawler): Enhance stealth and flexibility, improve error handling UncleCode 2024-10-17 21:37:48 +08:00
  • 8105fd178e Removed stubs for remove_from_future_crawls since the visited set is updated soon as the URL was queued, Removed add_to_retry_queue(url) since retry with exponential backoff with help of tenacity is going to take care of it. Aravind Karnam 2024-10-17 15:42:43 +05:30
  • ce7fce4b16 1. Moved to asyncio.wait instead of gather so that results can be yeilded just as they are ready, rather than in batches 2. Moved the visted.add(url), to before the task is put in queue rather than after the crawl is completed. This makes sure that duplicate crawls doesn't happen when same URL is found at different depth and that get's queued too because the crawl is not yet completed and visted set is not updated. 3. Named the yield_results attribute to stream instead. Since that seems to be popularly used in all other AI libraries for intermediate results. Aravind Karnam 2024-10-17 12:25:17 +05:30
  • de28b59aca removed unused imports Aravind Karnam 2024-10-16 22:36:48 +05:30
  • 04d8b47b92 Exposed min_crawl_delay for BFSScraperStrategy Aravind Karnam 2024-10-16 22:34:54 +05:30
  • 2943feeecf 1. Added a flag to yield each crawl result,as they become ready along with the final scraper result as another option 2. Removed ascrape_many method, as I'm currently not focusing on it in the first cut of scraper 3. Added some error handling for cases where robots.txt cannot be fetched or parsed. Aravind Karnam 2024-10-16 22:05:29 +05:30
  • 8a7d29ce85 updated some comments and removed content type checking functionality from core as it's implemented as a filter Aravind Karnam 2024-10-16 15:59:37 +05:30
  • 159bd875bd Merge pull request #5 from aravindkarnam/main aravind 2024-10-16 10:41:22 +05:30
  • 9ffa34b697 Update README v0.3.6 unclecode/issue167 unclecode/issue157 unclecode 2024-10-14 22:58:27 +08:00
  • 740802c491 Merge branch '0.3.6' unclecode 2024-10-14 22:55:24 +08:00
  • b9ac96c332 Merge branch 'main' of https://github.com/unclecode/crawl4ai unclecode 2024-10-14 22:54:23 +08:00
  • d06535388a Update gitignore unclecode 2024-10-14 22:53:56 +08:00
  • 5b84ac9186 Merge branch '0.3.5' of https://github.com/unclecode/crawl4ai into 0.3.5 0.3.5 unclecode 2024-10-14 22:53:09 +08:00
  • 7ea5603576 Update gitignore unclecode 2024-10-14 22:52:00 +08:00
  • 2b73bdf6b0 Update changelog 0.3.6 unclecode 2024-10-14 21:04:02 +08:00
  • 6aa803d712 Update gitignore unclecode 2024-10-14 21:03:40 +08:00
  • 320afdea64 feat: Enhance crawler flexibility and LLM extraction capabilities unclecode 2024-10-14 21:03:28 +08:00
  • ccbe72cfc1 Merge pull request #135 from hitesh22rana/fix/docs-example UncleCode 2024-10-13 14:39:07 +08:00
  • b9bbd42373 Update Quickstart examples unclecode 2024-10-13 14:37:45 +08:00
  • 68e9144ce3 feat: Enhance crawling control and LLM extraction flexibility unclecode 2024-10-12 14:48:22 +08:00
  • 9b2b267820 CHANGELOG UPDATE unclecode 2024-10-12 13:42:56 +08:00
  • ff3524d9b1 feat(v0.3.6): Add screenshot capture, delayed content, and custom timeouts unclecode 2024-10-12 13:42:42 +08:00
  • b99d20b725 Add pypi_build.sh to .gitignore unclecode 2024-10-08 18:10:57 +08:00
  • 768b93140f docs: fixed css_selector for example hitesh22rana 2024-10-05 00:25:41 +09:00
  • d743adac68 Fixed some bugs in robots.txt processing Aravind Karnam 2024-10-03 15:58:57 +05:30