unclecode
32c87f0388
chore: Update NlpSentenceChunking constructor parameters to None
...
The NlpSentenceChunking constructor parameters have been updated to None in order to simplify the usage of the class. This change removes the need for specifying the SpaCy model for sentence detection, making the code more concise and easier to understand.
2024-05-17 17:00:43 +08:00
unclecode
647cfda225
chore: Update Crawl4AI quickstart script in README.md
...
This commit updates the Crawl4AI quickstart script in the README.md file. The script is now properly formatted and aligned, making it easier to read and understand. The unnecessary indentation has been removed, and the script is now more concise and efficient.
2024-05-17 16:55:34 +08:00
unclecode
1cc67df301
chore: Update pip installation command and requirements, add new dependencies
2024-05-17 16:53:03 +08:00
unclecode
d7b37e849d
chore: Update CrawlRequest model to use NoExtractionStrategy as default
2024-05-17 16:50:38 +08:00
unclecode
f52f526002
chore: Update web_crawler.py to use NoExtractionStrategy as default
2024-05-17 16:03:35 +08:00
unclecode
3593f017d7
chore: Update setup.py to exclude torch, transformers, and nltk dependencies
...
This commit updates the setup.py file to exclude the torch, transformers, and nltk dependencies from the install_requires section. Instead, it creates separate extras_require sections for different environments, including all requirements, excluding torch for Colab, and excluding torch, transformers, and nltk for the crawl environment.
2024-05-17 16:01:04 +08:00
unclecode
e7bb76f19b
chore: Update torch dependency to version 2.3.0
2024-05-17 15:52:39 +08:00
unclecode
593b928967
Update requirements.txt to include latest versions of dependencies
2024-05-17 15:48:14 +08:00
unclecode
bb3d37face
chore: Update requirements.txt to include latest versions of dependencies
2024-05-17 15:32:37 +08:00
unclecode
3f8576f870
chore: Update model_loader.py to use pretrained models without resume_download
2024-05-17 15:26:15 +08:00
unclecode
bf3b040f10
chore: Update pip installation command and requirements, add new dependencies
2024-05-17 15:21:45 +08:00
unclecode
a317dc5e1d
Load CosineStrategy in the function
2024-05-17 15:13:06 +08:00
unclecode
a5f9d07dbf
Remove dependency on Spacy model.
2024-05-17 15:08:03 +08:00
unclecode
f85df91ca6
chore: Update README.md with Colab badge
2024-05-17 00:21:16 +08:00
UncleCode
6fcaf26b4f
Update quickstart.py: Add counting items
2024-05-16 22:49:12 +08:00
UncleCode
5b4a586b2d
Update web_crawler.py
...
Set CosineExtraction as defaul strategy
2024-05-16 22:28:24 +08:00
UncleCode
a856319499
Update web_crawler.py
...
Set NoExtractionStrategy for FetchPages
2024-05-16 22:06:33 +08:00
UncleCode
5ce1dc1622
Update web_crawler.py
...
Set all extraction strategies default to NoExtractionStrategy
2024-05-16 21:58:11 +08:00
unclecode
ea16dec587
Improve library loading
2024-05-16 21:19:02 +08:00
unclecode
d19488a821
chore: Update model_loader.py to create necessary folders in the home directory
2024-05-16 21:05:24 +08:00
unclecode
199c66114c
chore: Update pip installation command and requirements, add new dependencies
2024-05-16 20:58:36 +08:00
unclecode
45569d058d
chore: Update pip installation command and requirements for Crawl4AI
2024-05-16 20:42:53 +08:00
unclecode
5bb0b0b378
chore: Update pip installation command and requirements for Crawl4AI
2024-05-16 20:36:29 +08:00
unclecode
4006f5f4e2
chore: Update pip installation command to use sys.executable
2024-05-16 20:24:48 +08:00
unclecode
7e0682e0de
chore: Update dependencies and installation process
2024-05-16 20:22:50 +08:00
unclecode
8e28eb9efb
Add model loader, update requirements.txt
2024-05-16 20:08:21 +08:00
unclecode
c8589f8da3
Update:
...
- Fix Spacy model issue
- Update Readme and requirements.txt
2024-05-16 19:50:20 +08:00
unclecode
6a6365ae0a
Refactor code to exclude the extraction of semantical blocks of text from the HTML
2024-05-16 18:10:55 +08:00
unclecode
5b80be956d
Update:
...
- Debug
- Refactor code for new version
2024-05-16 17:31:44 +08:00
UncleCode
4a2e17447b
Update README.md
2024-05-16 08:57:58 +08:00
unclecode
f6e59157bf
- Test all methods
...
- Update index.hml
- Update Readme
- Resolve some bugs
2024-05-14 21:27:41 +08:00
unclecode
5fea6c064b
Improve libraries import
2024-05-13 02:46:35 +08:00
unclecode
11393183f7
Add Colab setup scritp.
2024-05-13 00:39:06 +08:00
unclecode
7679064521
Add model parameter for clustring.
2024-05-13 00:06:16 +08:00
unclecode
cf087cfa58
Replace embedding model with smaller one
2024-05-12 23:55:57 +08:00
unclecode
5693e324a4
Add time measurements.
2024-05-12 23:35:27 +08:00
unclecode
b38bf64490
Exclude spaCy from requirements.txt
2024-05-12 22:59:26 +08:00
unclecode
82706129f5
Update:
...
- Text Categorization
- Crawler, Extraction, and Chunking strategies
- Clustering for semantic segmentation
2024-05-12 22:37:21 +08:00
unclecode
7039e3c1ee
- Issue Resolved: Every <pre> tag's HTML content is replaced with its inner text to address situations like syntax highlighters, where each character might be in a <span>. This avoids issues where the minimum word threshold might ignore them.
2024-05-12 14:08:22 +08:00
unclecode
8e536b9717
chore: Refactor README.md and project structure
2024-05-12 12:41:42 +08:00
unclecode
aac4e07389
chore: Update README.md and project structure
2024-05-12 12:39:31 +08:00
UncleCode
e3960ace68
Update README.md
...
Explain more about `extract_blocks_flag`
2024-05-11 22:11:16 +08:00
UncleCode
b0f97ab2b3
Update README.md
...
Public server is available now
2024-05-11 08:56:19 +08:00
unclecode
372c921429
Update: Fix bug, when user set extract_blocks to False
2024-05-10 20:12:31 +08:00
ntohidi
aa126e436b
Add CORS middleware for allowing all origins to make requests
2024-05-10 12:27:40 +02:00
unclecode
20ef255c7f
Update README
2024-05-09 23:28:47 +08:00
unclecode
da7748a780
Update README file
2024-05-09 22:51:10 +08:00
unclecode
f74f4e88c0
Update README file
2024-05-09 22:48:42 +08:00
unclecode
a8e7218769
chore: Update README.md and project structure
2024-05-09 22:40:08 +08:00
unclecode
84f093593a
Update README
2024-05-09 22:37:45 +08:00