Commit Graph

237 Commits

Author SHA1 Message Date
unclecode
4a50781453 chore: Remove local and .files folders from .gitignore 2024-06-17 15:57:34 +08:00
unclecode
18561c55ce Remove .files folder from repository 2024-06-17 15:56:56 +08:00
unclecode
77da48050d chore: Add custom headers to LocalSeleniumCrawlerStrategy 2024-06-17 15:50:03 +08:00
unclecode
9a97aacd85 chore: Add hooks for customizing the LocalSeleniumCrawlerStrategy 2024-06-17 15:37:18 +08:00
unclecode
52daf3936a Fix typo in README 2024-06-17 15:15:37 +08:00
unclecode
2f246d19f4 Enhancement: Replaced inline HTML tags with textual format for better LLM context handling #45 2024-06-17 15:14:56 +08:00
unclecode
413595542a Enhancement: Replaced inline HTML tags with textual format for better LLM context handling #24 2024-06-17 15:14:34 +08:00
unclecode
42a5da854d Update version and change log. v0.2.4 2024-06-17 14:47:58 +08:00
unclecode
d1d83a6ef7 Fix issue #22: Use MD5 hash for caching HTML files to handle long URLs 2024-06-17 14:44:01 +08:00
unclecode
194050705d chore: Add pillow library to requirements.txt 2024-06-10 23:03:32 +08:00
unclecode
989f8c91c8 Update README 2024-06-08 18:50:35 +08:00
unclecode
edba5fb5e9 Update README 2024-06-08 18:48:21 +08:00
unclecode
faa1defa5c Update README 2024-06-08 18:47:23 +08:00
unclecode
f7e0cee1b0 vital: Right now, only raw html is retrived from datbase, therefore, css selector and other filter will be executed every time. 2024-06-08 18:37:40 +08:00
unclecode
b3a0edaa6d - User agent
- Extract Links
- Extract Metadata
- Update Readme
- Update REST API document
2024-06-08 17:59:42 +08:00
unclecode
9c34b30723 Extract internal and external links. 2024-06-08 16:53:06 +08:00
unclecode
36a5847df5 Add css selector example 2024-06-07 20:47:20 +08:00
unclecode
a19379aa58 Add recipe images, update README, and REST api example 2024-06-07 20:43:50 +08:00
unclecode
768d048e1c Update rest call how to use 2024-06-07 18:10:45 +08:00
unclecode
94c11a0262 Add image 2024-06-07 18:09:21 +08:00
unclecode
649b0bfd02 feat: Remove default checked state for bypass-cache-checkbox
The code changes in this commit remove the default checked state for the bypass-cache-checkbox in the try_it.html file. This allows users to manually select whether they want to bypass the cache or not.

This commit message follows the established convention of starting with a type (feat for feature) and providing a concise and descriptive summary of the changes made.
2024-06-07 16:26:36 +08:00
unclecode
57a00ec677 Update Readme 2024-06-07 16:25:30 +08:00
unclecode
aeb2114170 Add example of REST API call 2024-06-07 16:24:40 +08:00
unclecode
b8d405fddd Update version number in landing page header 2024-06-07 16:19:30 +08:00
unclecode
b32013cb97 Fix README file hyperlink 2024-06-07 15:37:05 +08:00
unclecode
226a62a3c0 feat: Add screenshot functionality to crawl_urls 2024-06-07 15:33:15 +08:00
unclecode
8e73a482a2 feat: Add screenshot functionality to crawl_urls
The code changes in this commit add the `screenshot` parameter to the `crawl_urls` function in `main.py`. This allows users to specify whether they want to take a screenshot of the page during the crawling process. The default value is `False`.

This commit message follows the established convention of starting with a type (feat for feature) and providing a concise and descriptive summary of the changes made.
2024-06-07 15:23:32 +08:00
unclecode
0533aeb814 v0.2.3:
- Extract all media tags
- Take screenshot of the page
2024-06-07 15:23:13 +08:00
unclecode
aead6de888 Merge branch 'main' of https://github.com/unclecode/crawl4ai into extract-media 2024-06-07 13:41:48 +08:00
UncleCode
8d82fd4cfe Merge pull request #14 from gkhngyk/main
Update README.md
2024-06-07 13:30:10 +08:00
Gökhan Geyik
8f44db6499 Update README.md 2024-06-05 17:16:02 +03:00
unclecode
c7553b1280 Update research assistant example with package installation instructions 2024-06-04 23:18:19 +08:00
unclecode
8b8683f22e Add research assistant example using Chainlit 2024-06-04 22:43:09 +08:00
unclecode
774ace6e3b Update html page for tutorial. 2024-06-02 18:00:53 +08:00
unclecode
4a8f91a0fc Set bypass_cached to True 2024-06-02 16:12:25 +08:00
unclecode
18c9784b61 Update index.html (hide extract block check box) 2024-06-02 16:09:20 +08:00
unclecode
e5d401c67c Update generated code sample 2024-06-02 16:06:43 +08:00
unclecode
ae77589a98 Update Readme 2024-06-02 15:42:13 +08:00
unclecode
ad373c0e19 Update Readme 2024-06-02 15:41:24 +08:00
unclecode
51f26d12fe Update for v0.2.2
- Support multiple JS scripts
- Fixed some of bugs
- Resolved a few issue relevant to Colab installation
2024-06-02 15:40:18 +08:00
unclecode
f1b60b2016 chore: Update ONNX model loading process 2024-05-31 18:07:05 +08:00
UncleCode
8c2dc2b1e4 Create Dockerfile 2024-05-29 17:56:57 +08:00
UncleCode
dc9a44c12a Update and rename Dockerfile to Dockerfile-version-0 2024-05-29 17:56:34 +08:00
UncleCode
d9753b6349 Update requirements.txt
Remove tokenizer version from requirements.txt
2024-05-24 14:49:48 +08:00
UncleCode
a554c0b143 Update requirements.txt 2024-05-23 12:52:31 +08:00
UncleCode
7381fa95e6 Merge pull request #3 from QIN2DIM/main
fix(main): UnicodeDecodeError
2024-05-23 09:29:28 +08:00
Unclecode
53d1176d53 chore: Update extraction strategy to support GPU, MPS, and CPU, add batch processing for CPU devices 2024-05-19 16:18:58 +00:00
unclecode
52c4be0696 Update setup.py version to 0.2.1 v0.2.1 2024-05-19 22:30:59 +08:00
unclecode
13a3b21d19 - Add ONNX embedding model for CPU devices, Update the similarithy threshold, improve the embedding speed. 2024-05-19 22:30:10 +08:00
QIN2DIM
5cee084340 fix(main): UnicodeDecodeError
File "T:\_GitHubProjects\Forks\crawl4ai\main.py", line 70, in read_index
    partials[filename[:-5]] = file.read()

UnicodeDecodeError: 'gbk' codec can't decode byte 0xa4 in position 149: illegal multibyte sequence
2024-05-18 23:31:11 +08:00