Commit Graph

1109 Commits

Author SHA1 Message Date
unclecode
aead6de888 Merge branch 'main' of https://github.com/unclecode/crawl4ai into extract-media 2024-06-07 13:41:48 +08:00
UncleCode
8d82fd4cfe Merge pull request #14 from gkhngyk/main
Update README.md
2024-06-07 13:30:10 +08:00
Gökhan Geyik
8f44db6499 Update README.md 2024-06-05 17:16:02 +03:00
unclecode
c7553b1280 Update research assistant example with package installation instructions 2024-06-04 23:18:19 +08:00
unclecode
8b8683f22e Add research assistant example using Chainlit 2024-06-04 22:43:09 +08:00
unclecode
774ace6e3b Update html page for tutorial. 2024-06-02 18:00:53 +08:00
unclecode
4a8f91a0fc Set bypass_cached to True 2024-06-02 16:12:25 +08:00
unclecode
18c9784b61 Update index.html (hide extract block check box) 2024-06-02 16:09:20 +08:00
unclecode
e5d401c67c Update generated code sample 2024-06-02 16:06:43 +08:00
unclecode
ae77589a98 Update Readme 2024-06-02 15:42:13 +08:00
unclecode
ad373c0e19 Update Readme 2024-06-02 15:41:24 +08:00
unclecode
51f26d12fe Update for v0.2.2
- Support multiple JS scripts
- Fixed some of bugs
- Resolved a few issue relevant to Colab installation
2024-06-02 15:40:18 +08:00
unclecode
f1b60b2016 chore: Update ONNX model loading process 2024-05-31 18:07:05 +08:00
UncleCode
8c2dc2b1e4 Create Dockerfile 2024-05-29 17:56:57 +08:00
UncleCode
dc9a44c12a Update and rename Dockerfile to Dockerfile-version-0 2024-05-29 17:56:34 +08:00
UncleCode
d9753b6349 Update requirements.txt
Remove tokenizer version from requirements.txt
2024-05-24 14:49:48 +08:00
UncleCode
a554c0b143 Update requirements.txt 2024-05-23 12:52:31 +08:00
UncleCode
7381fa95e6 Merge pull request #3 from QIN2DIM/main
fix(main): UnicodeDecodeError
2024-05-23 09:29:28 +08:00
Unclecode
53d1176d53 chore: Update extraction strategy to support GPU, MPS, and CPU, add batch processing for CPU devices 2024-05-19 16:18:58 +00:00
unclecode
52c4be0696 Update setup.py version to 0.2.1 v0.2.1 2024-05-19 22:30:59 +08:00
unclecode
13a3b21d19 - Add ONNX embedding model for CPU devices, Update the similarithy threshold, improve the embedding speed. 2024-05-19 22:30:10 +08:00
QIN2DIM
5cee084340 fix(main): UnicodeDecodeError
File "T:\_GitHubProjects\Forks\crawl4ai\main.py", line 70, in read_index
    partials[filename[:-5]] = file.read()

UnicodeDecodeError: 'gbk' codec can't decode byte 0xa4 in position 149: illegal multibyte sequence
2024-05-18 23:31:11 +08:00
Unclecode
bf00c26a83 chore: Update Dockerfile to install chromium-chromedriver and spacy library 2024-05-18 09:16:52 +00:00
unclecode
3846648c12 chore: Update extraction strategy to support GPU, MPS, and CPU, add batch procesing for CPU devices 2024-05-18 15:42:19 +08:00
unclecode
eb6423875f chore: Update Selenium options in crawler_strategy.py and add verbose logging in CosineStrategy 2024-05-18 14:13:06 +08:00
unclecode
e3524a10a7 chore: Update REST API base URL in README.md 2024-05-17 23:28:29 +08:00
unclecode
468dad6169 chore: Update Dockerfile to install chromium-chromedriver and spacy library 2024-05-17 23:15:39 +08:00
UncleCode
bc27982992 Update setup.py Handle Spacy installation 2024-05-17 22:11:00 +08:00
UncleCode
57e5decb55 Update requirements.txt 2024-05-17 22:02:08 +08:00
unclecode
b6319c6f6e chore: Add support for GPU, MPS, and CPU 2024-05-17 21:56:13 +08:00
UncleCode
0a902f562f Update requirements.txt Add Spacy 2024-05-17 21:41:35 +08:00
UncleCode
454135856e Update extraction_strategy.py Support GPU, MPS, and CPU 2024-05-17 21:40:48 +08:00
UncleCode
33fddc27ad Update model loader to support GPU, MPS, and CPU 2024-05-17 21:39:22 +08:00
unclecode
ce052a4eb5 Update README 2024-05-17 18:29:59 +08:00
unclecode
b43d77a56b Update README 2024-05-17 18:28:39 +08:00
unclecode
1635a92218 chore: Update Crawl4AI quickstart script in README.md 2024-05-17 18:25:32 +08:00
unclecode
2a8a1b27e1 chore: Update Readme 2024-05-17 18:24:47 +08:00
unclecode
f5f3cce2c8 Merge new-release-0.0.2-no-spacy into main for v0.2.0 release v0.2.0 2024-05-17 18:23:27 +08:00
unclecode
a085e6315b Merge branch 'main' of https://github.com/unclecode/crawl4ai 2024-05-17 18:21:02 +08:00
unclecode
a8d600a3b4 chore: Add test_pad.py, requirements0.txt, and a.txt to .gitignore v0.1.0 2024-05-17 18:13:43 +08:00
unclecode
6f96dcd649 chore: Update README 2024-05-17 18:12:50 +08:00
unclecode
957a2458b1 chore: Update web crawler URLs to use NBC News business section 2024-05-17 18:11:13 +08:00
unclecode
36e46be23d chore: Add verbose option to ExtractionStrategy classes
This commit adds a new `verbose` option to the `ExtractionStrategy` classes. The `verbose` option allows for logging of extraction details, such as the number of extracted blocks and the URL being processed. This improves the debugging and monitoring capabilities of the code.
2024-05-17 18:06:10 +08:00
unclecode
32c87f0388 chore: Update NlpSentenceChunking constructor parameters to None
The NlpSentenceChunking constructor parameters have been updated to None in order to simplify the usage of the class. This change removes the need for specifying the SpaCy model for sentence detection, making the code more concise and easier to understand.
2024-05-17 17:00:43 +08:00
unclecode
647cfda225 chore: Update Crawl4AI quickstart script in README.md
This commit updates the Crawl4AI quickstart script in the README.md file. The script is now properly formatted and aligned, making it easier to read and understand. The unnecessary indentation has been removed, and the script is now more concise and efficient.
2024-05-17 16:55:34 +08:00
unclecode
1cc67df301 chore: Update pip installation command and requirements, add new dependencies 2024-05-17 16:53:03 +08:00
unclecode
d7b37e849d chore: Update CrawlRequest model to use NoExtractionStrategy as default 2024-05-17 16:50:38 +08:00
unclecode
f52f526002 chore: Update web_crawler.py to use NoExtractionStrategy as default 2024-05-17 16:03:35 +08:00
unclecode
3593f017d7 chore: Update setup.py to exclude torch, transformers, and nltk dependencies
This commit updates the setup.py file to exclude the torch, transformers, and nltk dependencies from the install_requires section. Instead, it creates separate extras_require sections for different environments, including all requirements, excluding torch for Colab, and excluding torch, transformers, and nltk for the crawl environment.
2024-05-17 16:01:04 +08:00
unclecode
e7bb76f19b chore: Update torch dependency to version 2.3.0 2024-05-17 15:52:39 +08:00