Commit Graph

  • 5cee084340 fix(main): UnicodeDecodeError QIN2DIM 2024-05-18 23:31:11 +08:00
  • bf00c26a83 chore: Update Dockerfile to install chromium-chromedriver and spacy library Unclecode 2024-05-18 09:16:52 +00:00
  • 3846648c12 chore: Update extraction strategy to support GPU, MPS, and CPU, add batch procesing for CPU devices unclecode 2024-05-18 15:42:19 +08:00
  • eb6423875f chore: Update Selenium options in crawler_strategy.py and add verbose logging in CosineStrategy unclecode 2024-05-18 14:13:06 +08:00
  • e3524a10a7 chore: Update REST API base URL in README.md unclecode 2024-05-17 23:28:29 +08:00
  • 468dad6169 chore: Update Dockerfile to install chromium-chromedriver and spacy library unclecode 2024-05-17 23:15:39 +08:00
  • bc27982992 Update setup.py Handle Spacy installation UncleCode 2024-05-17 22:11:00 +08:00
  • 57e5decb55 Update requirements.txt UncleCode 2024-05-17 22:02:08 +08:00
  • b6319c6f6e chore: Add support for GPU, MPS, and CPU unclecode 2024-05-17 21:56:13 +08:00
  • 0a902f562f Update requirements.txt Add Spacy UncleCode 2024-05-17 21:41:35 +08:00
  • 454135856e Update extraction_strategy.py Support GPU, MPS, and CPU UncleCode 2024-05-17 21:40:48 +08:00
  • 33fddc27ad Update model loader to support GPU, MPS, and CPU UncleCode 2024-05-17 21:39:22 +08:00
  • ce052a4eb5 Update README unclecode 2024-05-17 18:29:59 +08:00
  • b43d77a56b Update README unclecode 2024-05-17 18:28:39 +08:00
  • 1635a92218 chore: Update Crawl4AI quickstart script in README.md unclecode 2024-05-17 18:25:32 +08:00
  • 2a8a1b27e1 chore: Update Readme unclecode 2024-05-17 18:24:47 +08:00
  • f5f3cce2c8 Merge new-release-0.0.2-no-spacy into main for v0.2.0 release v0.2.0 unclecode 2024-05-17 18:23:27 +08:00
  • a085e6315b Merge branch 'main' of https://github.com/unclecode/crawl4ai unclecode 2024-05-17 18:21:02 +08:00
  • a8d600a3b4 chore: Add test_pad.py, requirements0.txt, and a.txt to .gitignore v0.1.0 unclecode 2024-05-17 18:13:43 +08:00
  • 6f96dcd649 chore: Update README new-release-0.0.2-no-spacy unclecode 2024-05-17 18:12:50 +08:00
  • 957a2458b1 chore: Update web crawler URLs to use NBC News business section unclecode 2024-05-17 18:11:13 +08:00
  • 36e46be23d chore: Add verbose option to ExtractionStrategy classes unclecode 2024-05-17 18:06:10 +08:00
  • 32c87f0388 chore: Update NlpSentenceChunking constructor parameters to None unclecode 2024-05-17 17:00:43 +08:00
  • 647cfda225 chore: Update Crawl4AI quickstart script in README.md unclecode 2024-05-17 16:55:34 +08:00
  • 1cc67df301 chore: Update pip installation command and requirements, add new dependencies unclecode 2024-05-17 16:53:03 +08:00
  • d7b37e849d chore: Update CrawlRequest model to use NoExtractionStrategy as default unclecode 2024-05-17 16:50:38 +08:00
  • f52f526002 chore: Update web_crawler.py to use NoExtractionStrategy as default unclecode 2024-05-17 16:03:35 +08:00
  • 3593f017d7 chore: Update setup.py to exclude torch, transformers, and nltk dependencies unclecode 2024-05-17 16:01:04 +08:00
  • e7bb76f19b chore: Update torch dependency to version 2.3.0 unclecode 2024-05-17 15:52:39 +08:00
  • 593b928967 Update requirements.txt to include latest versions of dependencies unclecode 2024-05-17 15:48:14 +08:00
  • bb3d37face chore: Update requirements.txt to include latest versions of dependencies unclecode 2024-05-17 15:32:37 +08:00
  • 3f8576f870 chore: Update model_loader.py to use pretrained models without resume_download unclecode 2024-05-17 15:26:15 +08:00
  • bf3b040f10 chore: Update pip installation command and requirements, add new dependencies unclecode 2024-05-17 15:21:45 +08:00
  • a317dc5e1d Load CosineStrategy in the function unclecode 2024-05-17 15:13:06 +08:00
  • a5f9d07dbf Remove dependency on Spacy model. unclecode 2024-05-17 15:08:03 +08:00
  • f85df91ca6 chore: Update README.md with Colab badge new-release-0.0.2 unclecode 2024-05-17 00:21:16 +08:00
  • 6fcaf26b4f Update quickstart.py: Add counting items UncleCode 2024-05-16 22:49:12 +08:00
  • 5b4a586b2d Update web_crawler.py UncleCode 2024-05-16 22:28:24 +08:00
  • a856319499 Update web_crawler.py UncleCode 2024-05-16 22:06:33 +08:00
  • 5ce1dc1622 Update web_crawler.py UncleCode 2024-05-16 21:58:11 +08:00
  • ea16dec587 Improve library loading unclecode 2024-05-16 21:19:02 +08:00
  • d19488a821 chore: Update model_loader.py to create necessary folders in the home directory unclecode 2024-05-16 21:05:24 +08:00
  • 199c66114c chore: Update pip installation command and requirements, add new dependencies unclecode 2024-05-16 20:58:36 +08:00
  • 45569d058d chore: Update pip installation command and requirements for Crawl4AI unclecode 2024-05-16 20:42:53 +08:00
  • 5bb0b0b378 chore: Update pip installation command and requirements for Crawl4AI unclecode 2024-05-16 20:36:29 +08:00
  • 4006f5f4e2 chore: Update pip installation command to use sys.executable unclecode 2024-05-16 20:24:48 +08:00
  • 7e0682e0de chore: Update dependencies and installation process unclecode 2024-05-16 20:22:50 +08:00
  • 8e28eb9efb Add model loader, update requirements.txt unclecode 2024-05-16 20:08:21 +08:00
  • c8589f8da3 Update: - Fix Spacy model issue - Update Readme and requirements.txt unclecode 2024-05-16 19:50:20 +08:00
  • 6a6365ae0a Refactor code to exclude the extraction of semantical blocks of text from the HTML unclecode 2024-05-16 18:10:55 +08:00
  • 5b80be956d Update: - Debug - Refactor code for new version unclecode 2024-05-16 17:31:44 +08:00
  • 4a2e17447b Update README.md UncleCode 2024-05-16 08:57:58 +08:00
  • f6e59157bf - Test all methods - Update index.hml - Update Readme - Resolve some bugs unclecode 2024-05-14 21:27:41 +08:00
  • 5fea6c064b Improve libraries import unclecode 2024-05-13 02:46:35 +08:00
  • 11393183f7 Add Colab setup scritp. unclecode 2024-05-13 00:39:06 +08:00
  • 7679064521 Add model parameter for clustring. unclecode 2024-05-13 00:06:16 +08:00
  • cf087cfa58 Replace embedding model with smaller one unclecode 2024-05-12 23:55:57 +08:00
  • 5693e324a4 Add time measurements. unclecode 2024-05-12 23:35:27 +08:00
  • b38bf64490 Exclude spaCy from requirements.txt unclecode 2024-05-12 22:59:26 +08:00
  • 82706129f5 Update: - Text Categorization - Crawler, Extraction, and Chunking strategies - Clustering for semantic segmentation unclecode 2024-05-12 22:37:21 +08:00
  • 7039e3c1ee - Issue Resolved: Every <pre> tag's HTML content is replaced with its inner text to address situations like syntax highlighters, where each character might be in a <span>. This avoids issues where the minimum word threshold might ignore them. unclecode 2024-05-12 14:08:22 +08:00
  • 8e536b9717 chore: Refactor README.md and project structure unclecode 2024-05-12 12:41:42 +08:00
  • aac4e07389 chore: Update README.md and project structure unclecode 2024-05-12 12:39:31 +08:00
  • e3960ace68 Update README.md UncleCode 2024-05-11 22:11:16 +08:00
  • b0f97ab2b3 Update README.md UncleCode 2024-05-11 08:56:19 +08:00
  • 372c921429 Update: Fix bug, when user set extract_blocks to False unclecode 2024-05-10 20:12:31 +08:00
  • aa126e436b Add CORS middleware for allowing all origins to make requests ntohidi 2024-05-10 12:27:40 +02:00
  • 20ef255c7f Update README unclecode 2024-05-09 23:28:47 +08:00
  • da7748a780 Update README file unclecode 2024-05-09 22:51:10 +08:00
  • f74f4e88c0 Update README file unclecode 2024-05-09 22:48:42 +08:00
  • a8e7218769 chore: Update README.md and project structure unclecode 2024-05-09 22:40:08 +08:00
  • 84f093593a Update README unclecode 2024-05-09 22:37:45 +08:00
  • 88643612e8 chore: Update environment variable usage in config files unclecode 2024-05-09 22:37:01 +08:00
  • 6f99bad6f0 Update web application URL in README.md unclecode 2024-05-09 22:28:37 +08:00
  • 50d7a7e45d chore: Update forced flag for single page fetch to use default value unclecode 2024-05-09 22:21:12 +08:00
  • c71dd9189b chore: Update import statements to use crawl4ai package unclecode 2024-05-09 22:17:15 +08:00
  • 3ff1d15702 Change the project folder name from crawler to crawl4ai unclecode 2024-05-09 22:16:28 +08:00
  • 7ee8001b7d Update README.md UncleCode 2024-05-09 21:49:04 +08:00
  • b9d9d2bbd4 chore: Update URL for single page fetch to NBC News unclecode 2024-05-09 20:05:59 +08:00
  • 6320d07a93 chore: Update landing page URL and min words threshold unclecode 2024-05-09 20:05:31 +08:00
  • 181250cb93 chore: Add function to clear the database unclecode 2024-05-09 19:42:43 +08:00
  • f7c031c097 chore: Remove unused code from test.py unclecode 2024-05-09 19:26:37 +08:00
  • 51095062d4 Update file names unclecode 2024-05-09 19:26:16 +08:00
  • c71adb29ce chore: Update .gitignore and README.md unclecode 2024-05-09 19:25:25 +08:00
  • 898ec30a18 chore: Update license information in README.md unclecode 2024-05-09 19:14:48 +08:00
  • 343c4477f8 Update Crawl4AI web application URL in README.md unclecode 2024-05-09 19:13:20 +08:00
  • 99e0dd1ccd chore: Update README.md with installation instructions for Crawl4AI library and local server unclecode 2024-05-09 19:12:39 +08:00
  • b8e743cd8d Initial Commit unclecode 2024-05-09 19:10:25 +08:00