Commit Graph

322 Commits

Author SHA1 Message Date
unclecode
f5f3cce2c8 Merge new-release-0.0.2-no-spacy into main for v0.2.0 release v0.2.0 2024-05-17 18:23:27 +08:00
unclecode
a085e6315b Merge branch 'main' of https://github.com/unclecode/crawl4ai 2024-05-17 18:21:02 +08:00
unclecode
a8d600a3b4 chore: Add test_pad.py, requirements0.txt, and a.txt to .gitignore v0.1.0 2024-05-17 18:13:43 +08:00
unclecode
6f96dcd649 chore: Update README 2024-05-17 18:12:50 +08:00
unclecode
957a2458b1 chore: Update web crawler URLs to use NBC News business section 2024-05-17 18:11:13 +08:00
unclecode
36e46be23d chore: Add verbose option to ExtractionStrategy classes
This commit adds a new `verbose` option to the `ExtractionStrategy` classes. The `verbose` option allows for logging of extraction details, such as the number of extracted blocks and the URL being processed. This improves the debugging and monitoring capabilities of the code.
2024-05-17 18:06:10 +08:00
unclecode
32c87f0388 chore: Update NlpSentenceChunking constructor parameters to None
The NlpSentenceChunking constructor parameters have been updated to None in order to simplify the usage of the class. This change removes the need for specifying the SpaCy model for sentence detection, making the code more concise and easier to understand.
2024-05-17 17:00:43 +08:00
unclecode
647cfda225 chore: Update Crawl4AI quickstart script in README.md
This commit updates the Crawl4AI quickstart script in the README.md file. The script is now properly formatted and aligned, making it easier to read and understand. The unnecessary indentation has been removed, and the script is now more concise and efficient.
2024-05-17 16:55:34 +08:00
unclecode
1cc67df301 chore: Update pip installation command and requirements, add new dependencies 2024-05-17 16:53:03 +08:00
unclecode
d7b37e849d chore: Update CrawlRequest model to use NoExtractionStrategy as default 2024-05-17 16:50:38 +08:00
unclecode
f52f526002 chore: Update web_crawler.py to use NoExtractionStrategy as default 2024-05-17 16:03:35 +08:00
unclecode
3593f017d7 chore: Update setup.py to exclude torch, transformers, and nltk dependencies
This commit updates the setup.py file to exclude the torch, transformers, and nltk dependencies from the install_requires section. Instead, it creates separate extras_require sections for different environments, including all requirements, excluding torch for Colab, and excluding torch, transformers, and nltk for the crawl environment.
2024-05-17 16:01:04 +08:00
unclecode
e7bb76f19b chore: Update torch dependency to version 2.3.0 2024-05-17 15:52:39 +08:00
unclecode
593b928967 Update requirements.txt to include latest versions of dependencies 2024-05-17 15:48:14 +08:00
unclecode
bb3d37face chore: Update requirements.txt to include latest versions of dependencies 2024-05-17 15:32:37 +08:00
unclecode
3f8576f870 chore: Update model_loader.py to use pretrained models without resume_download 2024-05-17 15:26:15 +08:00
unclecode
bf3b040f10 chore: Update pip installation command and requirements, add new dependencies 2024-05-17 15:21:45 +08:00
unclecode
a317dc5e1d Load CosineStrategy in the function 2024-05-17 15:13:06 +08:00
unclecode
a5f9d07dbf Remove dependency on Spacy model. 2024-05-17 15:08:03 +08:00
unclecode
f85df91ca6 chore: Update README.md with Colab badge 2024-05-17 00:21:16 +08:00
UncleCode
6fcaf26b4f Update quickstart.py: Add counting items 2024-05-16 22:49:12 +08:00
UncleCode
5b4a586b2d Update web_crawler.py
Set CosineExtraction as defaul strategy
2024-05-16 22:28:24 +08:00
UncleCode
a856319499 Update web_crawler.py
Set NoExtractionStrategy for FetchPages
2024-05-16 22:06:33 +08:00
UncleCode
5ce1dc1622 Update web_crawler.py
Set all extraction strategies default to NoExtractionStrategy
2024-05-16 21:58:11 +08:00
unclecode
ea16dec587 Improve library loading 2024-05-16 21:19:02 +08:00
unclecode
d19488a821 chore: Update model_loader.py to create necessary folders in the home directory 2024-05-16 21:05:24 +08:00
unclecode
199c66114c chore: Update pip installation command and requirements, add new dependencies 2024-05-16 20:58:36 +08:00
unclecode
45569d058d chore: Update pip installation command and requirements for Crawl4AI 2024-05-16 20:42:53 +08:00
unclecode
5bb0b0b378 chore: Update pip installation command and requirements for Crawl4AI 2024-05-16 20:36:29 +08:00
unclecode
4006f5f4e2 chore: Update pip installation command to use sys.executable 2024-05-16 20:24:48 +08:00
unclecode
7e0682e0de chore: Update dependencies and installation process 2024-05-16 20:22:50 +08:00
unclecode
8e28eb9efb Add model loader, update requirements.txt 2024-05-16 20:08:21 +08:00
unclecode
c8589f8da3 Update:
- Fix Spacy model issue
- Update Readme and requirements.txt
2024-05-16 19:50:20 +08:00
unclecode
6a6365ae0a Refactor code to exclude the extraction of semantical blocks of text from the HTML 2024-05-16 18:10:55 +08:00
unclecode
5b80be956d Update:
- Debug
- Refactor code for new version
2024-05-16 17:31:44 +08:00
UncleCode
4a2e17447b Update README.md 2024-05-16 08:57:58 +08:00
unclecode
f6e59157bf - Test all methods
- Update index.hml
- Update Readme
- Resolve some bugs
2024-05-14 21:27:41 +08:00
unclecode
5fea6c064b Improve libraries import 2024-05-13 02:46:35 +08:00
unclecode
11393183f7 Add Colab setup scritp. 2024-05-13 00:39:06 +08:00
unclecode
7679064521 Add model parameter for clustring. 2024-05-13 00:06:16 +08:00
unclecode
cf087cfa58 Replace embedding model with smaller one 2024-05-12 23:55:57 +08:00
unclecode
5693e324a4 Add time measurements. 2024-05-12 23:35:27 +08:00
unclecode
b38bf64490 Exclude spaCy from requirements.txt 2024-05-12 22:59:26 +08:00
unclecode
82706129f5 Update:
- Text Categorization
- Crawler, Extraction, and Chunking strategies
- Clustering for semantic segmentation
2024-05-12 22:37:21 +08:00
unclecode
7039e3c1ee - Issue Resolved: Every <pre> tag's HTML content is replaced with its inner text to address situations like syntax highlighters, where each character might be in a <span>. This avoids issues where the minimum word threshold might ignore them. 2024-05-12 14:08:22 +08:00
unclecode
8e536b9717 chore: Refactor README.md and project structure 2024-05-12 12:41:42 +08:00
unclecode
aac4e07389 chore: Update README.md and project structure 2024-05-12 12:39:31 +08:00
UncleCode
e3960ace68 Update README.md
Explain more about  `extract_blocks_flag`
2024-05-11 22:11:16 +08:00
UncleCode
b0f97ab2b3 Update README.md
Public server is available now
2024-05-11 08:56:19 +08:00
unclecode
372c921429 Update: Fix bug, when user set extract_blocks to False 2024-05-10 20:12:31 +08:00