unclecode
d7b37e849d
chore: Update CrawlRequest model to use NoExtractionStrategy as default
2024-05-17 16:50:38 +08:00
unclecode
f52f526002
chore: Update web_crawler.py to use NoExtractionStrategy as default
2024-05-17 16:03:35 +08:00
unclecode
3593f017d7
chore: Update setup.py to exclude torch, transformers, and nltk dependencies
...
This commit updates the setup.py file to exclude the torch, transformers, and nltk dependencies from the install_requires section. Instead, it creates separate extras_require sections for different environments, including all requirements, excluding torch for Colab, and excluding torch, transformers, and nltk for the crawl environment.
2024-05-17 16:01:04 +08:00
unclecode
e7bb76f19b
chore: Update torch dependency to version 2.3.0
2024-05-17 15:52:39 +08:00
unclecode
593b928967
Update requirements.txt to include latest versions of dependencies
2024-05-17 15:48:14 +08:00
unclecode
bb3d37face
chore: Update requirements.txt to include latest versions of dependencies
2024-05-17 15:32:37 +08:00
unclecode
3f8576f870
chore: Update model_loader.py to use pretrained models without resume_download
2024-05-17 15:26:15 +08:00
unclecode
bf3b040f10
chore: Update pip installation command and requirements, add new dependencies
2024-05-17 15:21:45 +08:00
unclecode
a317dc5e1d
Load CosineStrategy in the function
2024-05-17 15:13:06 +08:00
unclecode
a5f9d07dbf
Remove dependency on Spacy model.
2024-05-17 15:08:03 +08:00
unclecode
f85df91ca6
chore: Update README.md with Colab badge
2024-05-17 00:21:16 +08:00
UncleCode
6fcaf26b4f
Update quickstart.py: Add counting items
2024-05-16 22:49:12 +08:00
UncleCode
5b4a586b2d
Update web_crawler.py
...
Set CosineExtraction as defaul strategy
2024-05-16 22:28:24 +08:00
UncleCode
a856319499
Update web_crawler.py
...
Set NoExtractionStrategy for FetchPages
2024-05-16 22:06:33 +08:00
UncleCode
5ce1dc1622
Update web_crawler.py
...
Set all extraction strategies default to NoExtractionStrategy
2024-05-16 21:58:11 +08:00
unclecode
ea16dec587
Improve library loading
2024-05-16 21:19:02 +08:00
unclecode
d19488a821
chore: Update model_loader.py to create necessary folders in the home directory
2024-05-16 21:05:24 +08:00
unclecode
199c66114c
chore: Update pip installation command and requirements, add new dependencies
2024-05-16 20:58:36 +08:00
unclecode
45569d058d
chore: Update pip installation command and requirements for Crawl4AI
2024-05-16 20:42:53 +08:00
unclecode
5bb0b0b378
chore: Update pip installation command and requirements for Crawl4AI
2024-05-16 20:36:29 +08:00
unclecode
4006f5f4e2
chore: Update pip installation command to use sys.executable
2024-05-16 20:24:48 +08:00
unclecode
7e0682e0de
chore: Update dependencies and installation process
2024-05-16 20:22:50 +08:00
unclecode
8e28eb9efb
Add model loader, update requirements.txt
2024-05-16 20:08:21 +08:00
unclecode
c8589f8da3
Update:
...
- Fix Spacy model issue
- Update Readme and requirements.txt
2024-05-16 19:50:20 +08:00
unclecode
6a6365ae0a
Refactor code to exclude the extraction of semantical blocks of text from the HTML
2024-05-16 18:10:55 +08:00
unclecode
5b80be956d
Update:
...
- Debug
- Refactor code for new version
2024-05-16 17:31:44 +08:00
UncleCode
4a2e17447b
Update README.md
2024-05-16 08:57:58 +08:00
unclecode
f6e59157bf
- Test all methods
...
- Update index.hml
- Update Readme
- Resolve some bugs
2024-05-14 21:27:41 +08:00
unclecode
5fea6c064b
Improve libraries import
2024-05-13 02:46:35 +08:00
unclecode
11393183f7
Add Colab setup scritp.
2024-05-13 00:39:06 +08:00
unclecode
7679064521
Add model parameter for clustring.
2024-05-13 00:06:16 +08:00
unclecode
cf087cfa58
Replace embedding model with smaller one
2024-05-12 23:55:57 +08:00
unclecode
5693e324a4
Add time measurements.
2024-05-12 23:35:27 +08:00
unclecode
b38bf64490
Exclude spaCy from requirements.txt
2024-05-12 22:59:26 +08:00
unclecode
82706129f5
Update:
...
- Text Categorization
- Crawler, Extraction, and Chunking strategies
- Clustering for semantic segmentation
2024-05-12 22:37:21 +08:00
unclecode
7039e3c1ee
- Issue Resolved: Every <pre> tag's HTML content is replaced with its inner text to address situations like syntax highlighters, where each character might be in a <span>. This avoids issues where the minimum word threshold might ignore them.
2024-05-12 14:08:22 +08:00
unclecode
8e536b9717
chore: Refactor README.md and project structure
2024-05-12 12:41:42 +08:00
unclecode
aac4e07389
chore: Update README.md and project structure
2024-05-12 12:39:31 +08:00
UncleCode
e3960ace68
Update README.md
...
Explain more about `extract_blocks_flag`
2024-05-11 22:11:16 +08:00
UncleCode
b0f97ab2b3
Update README.md
...
Public server is available now
2024-05-11 08:56:19 +08:00
unclecode
372c921429
Update: Fix bug, when user set extract_blocks to False
2024-05-10 20:12:31 +08:00
ntohidi
aa126e436b
Add CORS middleware for allowing all origins to make requests
2024-05-10 12:27:40 +02:00
unclecode
20ef255c7f
Update README
2024-05-09 23:28:47 +08:00
unclecode
da7748a780
Update README file
2024-05-09 22:51:10 +08:00
unclecode
f74f4e88c0
Update README file
2024-05-09 22:48:42 +08:00
unclecode
a8e7218769
chore: Update README.md and project structure
2024-05-09 22:40:08 +08:00
unclecode
84f093593a
Update README
2024-05-09 22:37:45 +08:00
unclecode
88643612e8
chore: Update environment variable usage in config files
2024-05-09 22:37:01 +08:00
unclecode
6f99bad6f0
Update web application URL in README.md
2024-05-09 22:28:37 +08:00
unclecode
50d7a7e45d
chore: Update forced flag for single page fetch to use default value
2024-05-09 22:21:12 +08:00