This website requires JavaScript.
16f918621f
Merge branch 'main' of https://github.com/unclecode/crawl4ai
UncleCode
2024-11-07 19:30:22 +08:00
f7574230a1
Update API server request object. text_docker file and Readme
UncleCode
2024-11-07 19:29:31 +08:00
3d1c9a8434
Revieweing the BFS strategy.
UncleCode
2024-11-07 18:54:53 +08:00
2879344d9c
Update README.md
devatnull
2024-11-06 17:36:46 +03:00
9f5eef1f38
Refactored the CustomHTML2Text class in content_scrapping_strategy.py to remove the handling logic for header tags (h1-h6), which are now commented out. This cleanup improves code readability and reduces maintenance overhead.
UncleCode
2024-11-06 21:50:09 +08:00
be472c624c
Refactored AsyncWebScraper to include comprehensive error handling and progress tracking capabilities. Introduced a ScrapingProgress data class to monitor processed and failed URLs. Enhanced scraping methods to log errors and track stats throughout the scraping process.
UncleCode
2024-11-06 21:09:47 +08:00
06b21dcc50
Update .gitignore to include new directories for issues and documentation
UncleCode
2024-11-06 18:44:03 +08:00
c5aa1bec18
Merge pull request #229 from bizrockman/main
UncleCode
2024-11-06 07:31:07 +01:00
0f0f60527d
Merge pull request #172 from aravindkarnam/scraper
UncleCode
2024-11-06 07:00:44 +01:00
11721eb0ce
Merge branch 'main' of https://github.com/unclecode/crawl4ai
Unclecode
2024-11-05 13:02:59 +00:00
b51263664e
feat(api): add CORS support and static file serving, update root redirect
UncleCode
2024-11-05 21:02:47 +08:00
1222e456fb
Merge branch 'main' of https://github.com/unclecode/crawl4ai
Unclecode
2024-11-05 12:58:30 +00:00
1e7db0d293
docs(README): update release notes for version 0.3.73 with new features and improvements
UncleCode
2024-11-05 20:12:20 +08:00
2a54f3c048
refactor(core): remove main_v0.py file and associated functionality
UncleCode
2024-11-05 20:11:07 +08:00
1c20b815b3
docs(README): update Docker usage instructions and add deployment options
UncleCode
2024-11-05 20:10:24 +08:00
43a2b26f63
Merge branch 'main' of https://github.com/unclecode/crawl4ai
UncleCode
2024-11-05 20:08:20 +08:00
3cf19a1bc2
chore(version): bump version to 0.3.73
0.3.73
UncleCode
2024-11-05 20:05:58 +08:00
67a23c3182
feat(core): Release v0.3.73 with Browser Takeover and Docker Support
UncleCode
2024-11-05 20:04:18 +08:00
796dbaf08c
Rename episode_11_3_Extraction_Strategies:_Cosine.md to episode_11_3_Extraction_Strategies_Cosine.md
bizrockman
2024-11-04 20:19:43 +01:00
3a3c88a2d0
Rename episode_11_2_Extraction_Strategies:_LLM.md to episode_11_2_Extraction_Strategies_LLM.md
bizrockman
2024-11-04 20:19:20 +01:00
870296fa7e
Rename episode_11_1_Extraction_Strategies:_JSON_CSS.md to episode_11_1_Extraction_Strategies_JSON_CSS.md
bizrockman
2024-11-04 20:18:58 +01:00
a28046c233
Rename episode_08_Media_Handling:_Images,_Videos,_and_Audio.md to episode_08_Media_Handling_Images_Videos_and_Audio.md
bizrockman
2024-11-04 20:18:26 +01:00
0bba0e074f
Preventing NoneType has no attribute get Errors
bizrockman
2024-11-04 20:12:24 +01:00
c4c6227962
Creating the API server component
UncleCode
2024-11-04 20:33:15 +08:00
e6c914d2fa
Refactor version management and remove deprecated gitignore.dev file
UncleCode
2024-11-04 16:51:59 +08:00
be8f4fc59a
Merge branch '0.3.73' of https://github.com/unclecode/crawl4ai into 0.3.73
UncleCode
2024-11-04 14:12:07 +08:00
fbdf870fbf
Update CHANGELOG
unclecode
2024-11-04 14:10:27 +08:00
7b0cca41b4
Update gitignore
UncleCode
2024-11-04 13:48:26 +08:00
33d0e9ec8c
Update dev gitignore
UncleCode
2024-11-04 13:42:37 +08:00
42f1c67ca8
Merge branch '0.3.73' of https://github.com/unclecode/crawl4ai into 0.3.73
UncleCode
2024-11-04 13:39:39 +08:00
e28c49a8fe
Refactor .gitignore.dev file: Add ignore patterns for various files and directories
UncleCode
2024-11-04 13:39:38 +08:00
54d5a3a259
Improved database management and error handling, updated README instructions, refined .gitignore, enhanced async web crawling capabilities, and updated dependencies.
unclecode
2024-11-04 13:22:13 +08:00
de6b43f334
Merge pull request #215 from mjvankampen/build/flexible-requirements
UncleCode
2024-11-03 08:30:06 +01:00
07f508bd0c
Merge pull request #218 from timoa/main
UncleCode
2024-11-03 06:59:30 +01:00
62a86dbe8d
Refactor mission section in README and add mission diagram
UncleCode
2024-10-31 16:38:56 +08:00
492ada0ed4
Add mission diagram to MISSION.md
UncleCode
2024-10-31 15:26:43 +08:00
d8eef02867
Add link to mission statement in README
UncleCode
2024-10-31 15:23:58 +08:00
6c7235d6a7
Add mission.md file
UncleCode
2024-10-31 15:22:00 +08:00
0a09d78fa5
chore(docs): fix documentation links + markdown lint
Damien Laureaux
2024-10-31 05:50:22 +01:00
e8aaa57cb2
Merge branch 'main' of https://github.com/unclecode/crawl4ai
Unclecode
2024-10-30 12:59:34 +00:00
19c3f3efb2
Refactor tutorial markdown files: Update numbering and formatting
UncleCode
2024-10-30 20:58:07 +08:00
a661b3173d
Merge branch 'main' of https://github.com/unclecode/crawl4ai
Unclecode
2024-10-30 12:47:07 +00:00
e97e8df6ba
Update README: Fix typo in project name
UncleCode
2024-10-30 20:45:20 +08:00
cb6f5323ae
Update README
UncleCode
2024-10-30 20:44:57 +08:00
47464cedec
Update README
UncleCode
2024-10-30 20:42:27 +08:00
982d203d91
Merge branch '0.3.73'
UncleCode
2024-10-30 20:40:09 +08:00
9307c19f35
Update documents, upload new version of quickstart.
UncleCode
2024-10-30 20:39:35 +08:00
605a82793b
fix dev requirements and lock playwright due to failing tests
Mark Jan van Kampen
2024-10-30 10:41:37 +01:00
df9ee44d42
build: make requirements more flexible
Mark Jan van Kampen
2024-10-30 10:03:22 +01:00
e9f7d5e73a
Merge branch '0.3.73'
UncleCode
2024-10-30 00:16:49 +08:00
3529c2e732
Update new tutorial documents and added to the docs folder.
UncleCode
2024-10-30 00:16:18 +08:00
d9e0b7abab
Fix README badge
UncleCode
2024-10-28 15:14:16 +08:00
b2800fefc6
Add badges to README
UncleCode
2024-10-28 15:10:12 +08:00
d913e20edc
Update Readme
UncleCode
2024-10-28 15:09:37 +08:00
b781b6df96
Merge branch 'main' of https://github.com/unclecode/crawl4ai
Unclecode
2024-10-27 11:42:23 +00:00
c2a71a5abe
Update Docs folder, prepare branch for new version 0.3.73
v.3.72
UncleCode
2024-10-27 19:35:13 +08:00
d61615e0b0
Merge branch '0.3.72'
UncleCode
2024-10-27 19:33:05 +08:00
ac9d83c72f
Update gitignore
main-0.3.7
UncleCode
2024-10-27 19:29:04 +08:00
ff9149b5c9
Merge branch 'main' of https://github.com/unclecode/crawl4ai
UncleCode
2024-10-27 19:28:05 +08:00
4239654722
Update Documentation
0.3.72
UncleCode
2024-10-27 19:24:46 +08:00
38474bd66a
Update version
UncleCode
2024-10-24 20:24:21 +08:00
bcfe83f702
feat: enhance crawler with overlay removal and improved screenshot capabilities
UncleCode
2024-10-24 20:22:47 +08:00
32f57c49d6
Merge pull request #194 from IdrisHanafi/feat/customize-crawl-base-directory
UncleCode
2024-10-24 13:09:27 +02:00
60ba131ac8
[v0.3.72] Enhance content extraction and proxy support
UncleCode
2024-10-22 20:19:22 +08:00
a5f627ba1a
feat: customize crawl base directory
Idris Hanafi
2024-10-21 17:58:39 -04:00
04d16e6d2b
Fix Base64 image parsing in WebScrappingStrategy (issue 182)
UncleCode
2024-10-20 19:25:25 +08:00
1dd36f9035
Refactor content scrapping strategy and improve error handling
UncleCode
2024-10-20 19:11:18 +08:00
6ec4cb33ca
Enhance Markdown generation and external content control
UncleCode
2024-10-20 18:56:58 +08:00
e7cd8a1c2d
Update Changelog
UncleCode
2024-10-19 18:37:12 +08:00
4e2852d5ff
[v0.3.71] Enhance chunking strategies and improve overall performance
UncleCode
2024-10-19 18:36:59 +08:00
b309bc34e1
Fix the model nam ein quick start example
0.3.7
UncleCode
2024-10-18 15:32:25 +08:00
b8147b64e0
chore: Bump version to 0.3.71 and improve error handling
UncleCode
2024-10-18 13:31:12 +08:00
aab6ea022e
Update requirements and switch to 0.3.8
UncleCode
2024-10-18 12:51:23 +08:00
dd17ed0e63
Rename some flags name, introducing magic flag.
UncleCode
2024-10-18 12:35:09 +08:00
dbb587d681
Update gitignore
UncleCode
2024-10-17 21:38:48 +08:00
768aa06ceb
feat(crawler): Enhance stealth and flexibility, improve error handling
UncleCode
2024-10-17 21:37:48 +08:00
8105fd178e
Removed stubs for remove_from_future_crawls since the visited set is updated soon as the URL was queued, Removed add_to_retry_queue(url) since retry with exponential backoff with help of tenacity is going to take care of it.
Aravind Karnam
2024-10-17 15:42:43 +05:30
ce7fce4b16
1. Moved to asyncio.wait instead of gather so that results can be yeilded just as they are ready, rather than in batches 2. Moved the visted.add(url), to before the task is put in queue rather than after the crawl is completed. This makes sure that duplicate crawls doesn't happen when same URL is found at different depth and that get's queued too because the crawl is not yet completed and visted set is not updated. 3. Named the yield_results attribute to stream instead. Since that seems to be popularly used in all other AI libraries for intermediate results.
Aravind Karnam
2024-10-17 12:25:17 +05:30
de28b59aca
removed unused imports
Aravind Karnam
2024-10-16 22:36:48 +05:30
04d8b47b92
Exposed min_crawl_delay for BFSScraperStrategy
Aravind Karnam
2024-10-16 22:34:54 +05:30
2943feeecf
1. Added a flag to yield each crawl result,as they become ready along with the final scraper result as another option 2. Removed ascrape_many method, as I'm currently not focusing on it in the first cut of scraper 3. Added some error handling for cases where robots.txt cannot be fetched or parsed.
Aravind Karnam
2024-10-16 22:05:29 +05:30
8a7d29ce85
updated some comments and removed content type checking functionality from core as it's implemented as a filter
Aravind Karnam
2024-10-16 15:59:37 +05:30
159bd875bd
Merge pull request #5 from aravindkarnam/main
aravind
2024-10-16 10:41:22 +05:30
9ffa34b697
Update README
v0.3.6
unclecode/issue167
unclecode/issue157
unclecode
2024-10-14 22:58:27 +08:00
740802c491
Merge branch '0.3.6'
unclecode
2024-10-14 22:55:24 +08:00
b9ac96c332
Merge branch 'main' of https://github.com/unclecode/crawl4ai
unclecode
2024-10-14 22:54:23 +08:00
d06535388a
Update gitignore
unclecode
2024-10-14 22:53:56 +08:00
5b84ac9186
Merge branch '0.3.5' of https://github.com/unclecode/crawl4ai into 0.3.5
0.3.5
unclecode
2024-10-14 22:53:09 +08:00
7ea5603576
Update gitignore
unclecode
2024-10-14 22:52:00 +08:00
2b73bdf6b0
Update changelog
0.3.6
unclecode
2024-10-14 21:04:02 +08:00
6aa803d712
Update gitignore
unclecode
2024-10-14 21:03:40 +08:00
320afdea64
feat: Enhance crawler flexibility and LLM extraction capabilities
unclecode
2024-10-14 21:03:28 +08:00
ccbe72cfc1
Merge pull request #135 from hitesh22rana/fix/docs-example
UncleCode
2024-10-13 14:39:07 +08:00
b9bbd42373
Update Quickstart examples
unclecode
2024-10-13 14:37:45 +08:00
68e9144ce3
feat: Enhance crawling control and LLM extraction flexibility
unclecode
2024-10-12 14:48:22 +08:00
9b2b267820
CHANGELOG UPDATE
unclecode
2024-10-12 13:42:56 +08:00
ff3524d9b1
feat(v0.3.6): Add screenshot capture, delayed content, and custom timeouts
unclecode
2024-10-12 13:42:42 +08:00
b99d20b725
Add pypi_build.sh to .gitignore
unclecode
2024-10-08 18:10:57 +08:00
768b93140f
docs: fixed css_selector for example
hitesh22rana
2024-10-05 00:25:41 +09:00
d743adac68
Fixed some bugs in robots.txt processing
Aravind Karnam
2024-10-03 15:58:57 +05:30