This website requires JavaScript.
7fe220dbd5
1. Introduced a bool flag to ascrape method to switch between sequential and concurrent processing 2. Introduced a dictionary for depth tracking across various tasks 3. Removed redundancy with crawled_urls variable. Instead created a list with visited set variable in returned object.
Aravind Karnam
2024-10-03 11:17:11 +05:30
65e013d9d1
Merge pull request #3 from aravindkarnam/main
aravind
2024-10-03 09:52:12 +05:30
4750810a67
Enhance AsyncWebCrawler with smart waiting and screenshot capabilities
unclecode
2024-10-02 17:34:56 +08:00
e0e0db4247
Bump version to 0.3.4
0.3.4
unclecode
2024-09-29 17:07:52 +08:00
bccadec887
Remove dependency on psutil, PyYaml, and extend requests version range
unclecode
2024-09-29 17:07:06 +08:00
0759503e50
Extend numpy version range to support Python 3.9
v0.3.3
unclecode
2024-09-29 00:08:02 +08:00
7f1c020746
Update README to add link to previous version in branch V0.2.76
unclecode
2024-09-28 00:31:53 +08:00
7afa11a02f
Update .gitignore to include test_env/ and tmp/ directories
v0.2.76
unclecode
2024-09-28 00:12:58 +08:00
5d4e92db7d
Update quickstart_async.py to improve performance and add Firecrawl simulation
v0.3.0
staging
unclecode
2024-09-28 00:11:39 +08:00
8b6e88c85c
Update .gitignore to ignore temporary and test directories
unclecode
2024-09-26 15:09:49 +08:00
64190dd0c4
Update README
unclecode
2024-09-25 17:26:13 +08:00
7100bcdf04
Add session based crawling documentation
unclecode
2024-09-25 17:16:55 +08:00
10cdad039d
Update documents and README
unclecode
2024-09-25 16:52:11 +08:00
f1eee09cf4
Update README, add manifest, make selenium optional library
unclecode
2024-09-25 16:35:14 +08:00
4d48bd31ca
Push async version last changes for merge to main branch
unclecode
2024-09-24 20:52:08 +08:00
7f3e2e47ed
Parallel processing with retry on failure with exponential backoff - Simplified URL validation and normalisation - respecting Robots.txt
Aravind Karnam
2024-09-19 12:34:12 +05:30
78f26ac263
Merge pull request #2 from aravindkarnam/staging
aravind
2024-09-18 18:16:23 +05:30
d628bc4034
Refactor content_scrapping_strategy.py to remove excluded tags
unclecode
2024-09-12 17:35:45 +08:00
b179aa9b6f
Refactor website content and setup.py descriptions for consistent terminology
unclecode
2024-09-12 16:50:52 +08:00
30807f5535
Remove excluded tags from website content
unclecode
2024-09-12 16:11:20 +08:00
396f430022
Refactor AsyncCrawlerStrategy to return AsyncCrawlResponse
unclecode
2024-09-12 15:49:49 +08:00
44ce12c62c
Created scaffolding for Scraper as per the plan. Implemented the ascrape method in bfs_scraper_strategy
Aravind Karnam
2024-09-09 13:13:34 +05:30
eb131bebdf
Create series of quickstart files.
unclecode
2024-09-04 15:33:24 +08:00
5c15837677
chore: Update README, generate new notbook for quickstart
unclecode
2024-09-04 14:46:22 +08:00
2fada16abb
chore: Update crawl4ai package with AsyncWebCrawler and JsonCssExtractionStrategy
unclecode
2024-09-03 23:32:27 +08:00
c37614cbc8
Add Async Version, JsonCss Extrator
unclecode
2024-09-03 01:27:00 +08:00
3116f95c1a
Merge branch 'pull-84' into staging
unclecode
2024-09-01 16:44:06 +08:00
b0e8b66666
Merge branch 'proxy-support' into staging
unclecode
2024-09-01 16:35:14 +08:00
3caf48c9be
refactor: Update LocalSeleniumCrawlerStrategy to execute JS code if provided
proxy-support
unclecode
2024-09-01 16:34:51 +08:00
3c6ebb73ae
Update web_crawler.py
pull-84
Umut CAN
2024-08-30 15:30:06 +03:00
0d9b638636
Merge pull request #75 from aravindkarnam/main
UncleCode
2024-08-30 12:54:15 +02:00
2ba70b9501
add use proxy and llm baseurl examples
datehoer
2024-08-27 10:14:54 +08:00
16f98cebc0
replace base64 image url to ''
datehoer
2024-08-27 09:44:35 +08:00
fe9ff498ce
add proxy and add ai base_url
datehoer
2024-08-26 16:12:49 +08:00
eba831ca30
fix spelling mistake
Datehoer
2024-08-26 15:29:23 +08:00
dec3d44224
refactor: Update extraction strategy to handle schema extraction with non-empty schema
unclecode
2024-08-19 15:37:07 +08:00
9ed1551125
Added support to source tags wrapped inside video and audio tags. Extended the text extraction to video and audio elements in media. https://github.com/unclecode/crawl4ai/issues/71
Aravind Karnam
2024-08-14 10:59:49 +05:30
14e537fdd3
Merge branch 'main' of https://github.com/unclecode/crawl4ai
Unclecode
2024-08-04 06:57:16 +00:00
e5e6a34e80
## [v0.2.77] - 2024-08-04
v0.2.77
unclecode
2024-08-04 14:54:18 +08:00
64b33af0e0
Merge branch 'main' of https://github.com/unclecode/crawl4ai
Unclecode
2024-08-02 08:04:54 +00:00
897e766728
Update README
unclecode
2024-08-02 16:04:14 +08:00
9200a6731d
## [v0.2.76] - 2024-08-02
unclecode
2024-08-02 16:02:42 +08:00
61c166ab19
refactor: Update Crawl4AI version to v0.2.76
main-75
unclecode
2024-08-02 15:55:53 +08:00
659c8cd953
refactor: Update image description minimum word threshold in get_content_of_website_optimized
unclecode
2024-08-02 15:55:32 +08:00
9ee988753d
refactor: Update image description minimum word threshold in get_content_of_website_optimized
unclecode
2024-08-02 14:53:11 +08:00
8ae6c43ca4
refactor: Update Dockerfile to install Crawl4AI with specified options
unclecode
2024-08-01 20:13:06 +08:00
b6713870ef
refactor: Update Dockerfile to install Crawl4AI with specified options
unclecode
2024-08-01 17:56:19 +08:00
40477493d3
refactor: Remove image format dot in get_content_of_website_optimized
unclecode
2024-07-31 16:15:55 +08:00
efcf3ac6eb
Update LocalSeleniumCrawlerStrategy to resolve ChromeDriver version mismatch issue
Kevin Moturi
2024-07-27 06:11:57 -05:00
9e43f7beda
refactor: Temporarily disable fetching image file size in get_content_of_website_optimized
unclecode
2024-07-31 13:29:23 +08:00
aa9412e1b4
refactor: Set image_size to 0 in get_content_of_website_optimized
unclecode
2024-07-23 13:08:53 +08:00
cf6c835e18
moved score threshold to config.py & replaced the separator for tag.get_text in find_closest_parent_with_useful_text fn from period(.) to space( ) to keep the text more neutral.
image-description
Aravind Karnam
2024-07-21 15:18:23 +05:30
e5ecf291f3
Implemented filtering for images and grabbing the contextual text from nearest parent
Aravind Karnam
2024-07-21 15:03:17 +05:30
9d0cafcfa6
fixed import error in model_loader.py
Aravind Karnam
2024-07-21 14:55:58 +05:30
7715623430
chore: Fix typos and update .gitignore
v0.0.75
unclecode
2024-07-19 17:42:39 +08:00
f5a4e80e2c
chore: Fix typo in chunking_strategy.py and crawler_strategy.py
unclecode
2024-07-19 17:40:31 +08:00
8463aabedf
chore: Remove .test_pads/ directory from .gitignore
main-img-captionify
unclecode
2024-07-19 17:09:29 +08:00
7f30144ef2
chore: Remove .tests/ directory from .gitignore
unclecode
2024-07-09 15:10:18 +08:00
fa5516aad6
chore: Refactor setup.py to use pathlib and shutil for folder creation and removal, to remove cache folder in cross platform manner.
unclecode
2024-07-09 13:25:00 +08:00
1afcdb6996
Merge branch 'main' of https://github.com/unclecode/crawl4ai
Unclecode
2024-07-08 12:24:13 +00:00
ca0336af9e
feat: Add error handling for rate limit exceeded in form submission
unclecode
2024-07-08 20:24:00 +08:00
ca625b3152
Merge branch 'main' of https://github.com/unclecode/crawl4ai
Unclecode
2024-07-08 12:02:19 +00:00
65ed1aeade
feat: Add rate limiting functionality with custom handlers
unclecode
2024-07-08 20:02:12 +08:00
6521b4745f
Merge branch 'main' of https://github.com/unclecode/crawl4ai
Unclecode
2024-07-08 08:35:49 +00:00
4d283ab386
## [v0.2.74] - 2024-07-08 A slew of exciting updates to improve the crawler's stability and robustness! 🎉
v0.2.74
unclecode
2024-07-08 16:33:25 +08:00
2101540819
chore: Update version to 0.2.74 in setup.py
v0.2.74
unclecode
2024-07-08 16:30:28 +08:00
9d98393606
Prepare branch for release 0.2.74
unclecode
2024-07-08 16:30:14 +08:00
6f99368744
Add UTF encoding to resolve the windows machone "charmap" error.
unclecode
2024-07-08 16:18:07 +08:00
ea2f83ac10
feat: Add delay after fetching URL in crawler hooks
unclecode
2024-07-08 15:59:59 +08:00
7f41ff4a74
The after_get_url hook is executed after getting the URL, allowing for further customization.
unclecode
2024-07-06 14:28:01 +08:00
236bdb4035
feat: Add MaxRetryError exception handling in LocalSeleniumCrawlerStrategy
unclecode
2024-07-06 14:08:30 +08:00
1368248254
feat: Sanitize input and handle encoding issues in LLMExtractionStrategy
unclecode
2024-07-05 17:59:26 +08:00
b0ec54b9e9
feat: Sanitize input and handle encoding issues in LLMExtractionStrategy
unclecode
2024-07-05 17:37:25 +08:00
fb6ed5f000
feat: Sanitize input and handle encoding issues in LLMExtractionStrategy
unclecode
2024-07-05 17:30:58 +08:00
597fe8bdb7
chore: Delete existing database file and initialize new database
unclecode
2024-07-05 17:04:57 +08:00
241862bfe6
Merge branch 'main' of https://github.com/unclecode/crawl4ai
Unclecode
2024-07-03 07:27:37 +00:00
3ff2a0d0e7
Merge branch 'main' of https://github.com/unclecode/crawl4ai
v0.2.73
unclecode
2024-07-03 15:26:47 +08:00
3cd1b3719f
Bump version to v0.2.73, update documentation, and resolve installation issues
unclecode
2024-07-03 15:26:43 +08:00
9926eb9f95
feat: Bump version to v0.2.73 and update documentation
unclecode
2024-07-03 15:19:22 +08:00
3abaa82501
Merge pull request #37 from shivkumar0757/fix-readme-encoding
UncleCode
2024-07-01 07:31:07 +02:00
88d8cd8650
feat: Add page load check for LocalSeleniumCrawlerStrategy
unclecode
2024-07-01 00:07:32 +08:00
a08f21d66c
Fix UnicodeDecodeError by reading README.md with UTF-8 encoding
shiv
2024-06-30 20:27:33 +05:30
f2491b6c1a
Merge branch 'main' of https://github.com/unclecode/crawl4ai
Unclecode
2024-06-29 16:34:15 +00:00
d58286989c
UPDATE DOCUMENTS
unclecode
2024-06-30 00:34:02 +08:00
886622cb1e
Merge branch 'main' of https://github.com/unclecode/crawl4ai
Unclecode
2024-06-29 16:23:44 +00:00
b58af3349c
chore: Update installation instructions with support for different modes
v0.2.72
unclecode
2024-06-30 00:22:17 +08:00
940df4631f
Update ChangeLog
unclecode
2024-06-30 00:18:40 +08:00
685706e0aa
Update version, and change log
main-v0.2.72
unclecode
2024-06-30 00:17:43 +08:00
7b0979e134
Update Redme and Docker file
unclecode
2024-06-30 00:15:43 +08:00
61ae2de841
1/Update setup.py to support following modes: - default (most frequent mode) - torch - transformers - all 2/ Update Docker file 3/ Update documentation as well.
unclecode
2024-06-30 00:15:29 +08:00
5b28eed2c0
Add a temporary solution for when we can't crawl websites in headless mode.
unclecode
2024-06-29 23:25:50 +08:00
f8a11779fe
Update change log
unclecode
2024-06-26 16:48:36 +08:00
13dc254438
Merge branch 'main' of https://github.com/unclecode/crawl4ai
Unclecode
2024-06-26 07:35:06 +00:00
d11a83c232
## [0.2.71] 2024-06-26 • Refactored crawler_strategy.py to handle exceptions and improve error messages • Improved get_content_of_website_optimized function in utils.py for better performance • Updated utils.py with latest changes • Migrated to ChromeDriverManager for resolving Chrome driver download issues
v0.2.71
main-1
unclecode
2024-06-26 15:34:15 +08:00
3255c7a3fa
Update CHANGELOG.md with recent commits
unclecode
2024-06-26 15:20:34 +08:00
4756d0a532
Refactor crawler_strategy.py to handle exceptions and improve error messages
unclecode
2024-06-26 15:04:33 +08:00
7ba2142363
chore: Refactor get_content_of_website_optimized function in utils.py
unclecode
2024-06-26 14:43:09 +08:00
096929153f
Merge branch 'main' of https://github.com/unclecode/crawl4ai
Unclecode
2024-06-26 05:45:25 +00:00
96d1eb0d0d
Some updated ins utils.py
image-filterizer
unclecode
2024-06-26 13:03:03 +08:00
144cfa0eda
Switch to ChromeDriverManager due some issues with download the chrome driver
unclecode
2024-06-26 13:00:17 +08:00