Commit Graph

66 Commits

Author SHA1 Message Date
unclecode
96d1eb0d0d Some updated ins utils.py 2024-06-26 13:03:03 +08:00
unclecode
78cfad8b2f chore: Update version to 0.2.7 and improve extraction function speed 2024-06-24 22:39:56 +08:00
unclecode
d6182bedd7 chore:
- Add demo page to the new mkdocs
- Set website home page to mkdocs
2024-06-22 20:36:01 +08:00
unclecode
3f0e265baf Merge branch 'format-inline-tags' 2024-06-19 00:48:38 +08:00
unclecode
413595542a Enhancement: Replaced inline HTML tags with textual format for better LLM context handling #24 2024-06-17 15:14:34 +08:00
unclecode
b3a0edaa6d - User agent
- Extract Links
- Extract Metadata
- Update Readme
- Update REST API document
2024-06-08 17:59:42 +08:00
unclecode
9c34b30723 Extract internal and external links. 2024-06-08 16:53:06 +08:00
unclecode
8e73a482a2 feat: Add screenshot functionality to crawl_urls
The code changes in this commit add the `screenshot` parameter to the `crawl_urls` function in `main.py`. This allows users to specify whether they want to take a screenshot of the page during the crawling process. The default value is `False`.

This commit message follows the established convention of starting with a type (feat for feature) and providing a concise and descriptive summary of the changes made.
2024-06-07 15:23:32 +08:00
unclecode
0533aeb814 v0.2.3:
- Extract all media tags
- Take screenshot of the page
2024-06-07 15:23:13 +08:00
unclecode
c8589f8da3 Update:
- Fix Spacy model issue
- Update Readme and requirements.txt
2024-05-16 19:50:20 +08:00
unclecode
5b80be956d Update:
- Debug
- Refactor code for new version
2024-05-16 17:31:44 +08:00
unclecode
f6e59157bf - Test all methods
- Update index.hml
- Update Readme
- Resolve some bugs
2024-05-14 21:27:41 +08:00
unclecode
5fea6c064b Improve libraries import 2024-05-13 02:46:35 +08:00
unclecode
82706129f5 Update:
- Text Categorization
- Crawler, Extraction, and Chunking strategies
- Clustering for semantic segmentation
2024-05-12 22:37:21 +08:00
unclecode
7039e3c1ee - Issue Resolved: Every <pre> tag's HTML content is replaced with its inner text to address situations like syntax highlighters, where each character might be in a <span>. This avoids issues where the minimum word threshold might ignore them. 2024-05-12 14:08:22 +08:00
unclecode
3ff1d15702 Change the project folder name from crawler to crawl4ai 2024-05-09 22:16:28 +08:00