unclecode
96d1eb0d0d
Some updated ins utils.py
2024-06-26 13:03:03 +08:00
unclecode
78cfad8b2f
chore: Update version to 0.2.7 and improve extraction function speed
2024-06-24 22:39:56 +08:00
unclecode
d6182bedd7
chore:
...
- Add demo page to the new mkdocs
- Set website home page to mkdocs
2024-06-22 20:36:01 +08:00
unclecode
3f0e265baf
Merge branch 'format-inline-tags'
2024-06-19 00:48:38 +08:00
unclecode
413595542a
Enhancement: Replaced inline HTML tags with textual format for better LLM context handling #24
2024-06-17 15:14:34 +08:00
unclecode
b3a0edaa6d
- User agent
...
- Extract Links
- Extract Metadata
- Update Readme
- Update REST API document
2024-06-08 17:59:42 +08:00
unclecode
9c34b30723
Extract internal and external links.
2024-06-08 16:53:06 +08:00
unclecode
8e73a482a2
feat: Add screenshot functionality to crawl_urls
...
The code changes in this commit add the `screenshot` parameter to the `crawl_urls` function in `main.py`. This allows users to specify whether they want to take a screenshot of the page during the crawling process. The default value is `False`.
This commit message follows the established convention of starting with a type (feat for feature) and providing a concise and descriptive summary of the changes made.
2024-06-07 15:23:32 +08:00
unclecode
0533aeb814
v0.2.3:
...
- Extract all media tags
- Take screenshot of the page
2024-06-07 15:23:13 +08:00
unclecode
c8589f8da3
Update:
...
- Fix Spacy model issue
- Update Readme and requirements.txt
2024-05-16 19:50:20 +08:00
unclecode
5b80be956d
Update:
...
- Debug
- Refactor code for new version
2024-05-16 17:31:44 +08:00
unclecode
f6e59157bf
- Test all methods
...
- Update index.hml
- Update Readme
- Resolve some bugs
2024-05-14 21:27:41 +08:00
unclecode
5fea6c064b
Improve libraries import
2024-05-13 02:46:35 +08:00
unclecode
82706129f5
Update:
...
- Text Categorization
- Crawler, Extraction, and Chunking strategies
- Clustering for semantic segmentation
2024-05-12 22:37:21 +08:00
unclecode
7039e3c1ee
- Issue Resolved: Every <pre> tag's HTML content is replaced with its inner text to address situations like syntax highlighters, where each character might be in a <span>. This avoids issues where the minimum word threshold might ignore them.
2024-05-12 14:08:22 +08:00
unclecode
3ff1d15702
Change the project folder name from crawler to crawl4ai
2024-05-09 22:16:28 +08:00