aravind
159bd875bd
Merge pull request #5 from aravindkarnam/main
...
Merging 0.3.6
2024-10-16 10:41:22 +05:30
unclecode
9ffa34b697
Update README
v0.3.6
2024-10-14 22:58:27 +08:00
unclecode
740802c491
Merge branch '0.3.6'
2024-10-14 22:55:24 +08:00
unclecode
b9ac96c332
Merge branch 'main' of https://github.com/unclecode/crawl4ai
2024-10-14 22:54:23 +08:00
unclecode
d06535388a
Update gitignore
2024-10-14 22:53:56 +08:00
unclecode
2b73bdf6b0
Update changelog
2024-10-14 21:04:02 +08:00
unclecode
6aa803d712
Update gitignore
2024-10-14 21:03:40 +08:00
unclecode
320afdea64
feat: Enhance crawler flexibility and LLM extraction capabilities
...
- Add browser type selection (Chromium, Firefox, WebKit)
- Implement iframe content extraction
- Improve image processing and dimension updates
- Add custom headers support in AsyncPlaywrightCrawlerStrategy
- Enhance delayed content retrieval with new parameter
- Optimize HTML sanitization and Markdown conversion
- Update examples in quickstart_async.py for new features
2024-10-14 21:03:28 +08:00
UncleCode
ccbe72cfc1
Merge pull request #135 from hitesh22rana/fix/docs-example
...
docs: fixed css_selector for example
2024-10-13 14:39:07 +08:00
unclecode
b9bbd42373
Update Quickstart examples
2024-10-13 14:37:45 +08:00
unclecode
68e9144ce3
feat: Enhance crawling control and LLM extraction flexibility
...
- Add before_retrieve_html hook and delay_before_return_html option
- Implement flexible page_timeout for smart_wait function
- Support extra_args and custom headers in LLM extraction
- Allow arbitrary kwargs in AsyncWebCrawler initialization
- Improve perform_completion_with_backoff for custom API calls
- Update examples with new features and diverse LLM providers
2024-10-12 14:48:22 +08:00
unclecode
9b2b267820
CHANGELOG UPDATE
2024-10-12 13:42:56 +08:00
unclecode
ff3524d9b1
feat(v0.3.6): Add screenshot capture, delayed content, and custom timeouts
...
- Implement screenshot capture functionality
- Add delayed content retrieval method
- Introduce custom page timeout parameter
- Enhance LLM support with multiple providers
- Improve database schema auto-updates
- Optimize image processing in WebScrappingStrategy
- Update error handling and logging
- Expand examples in quickstart_async.py
2024-10-12 13:42:42 +08:00
unclecode
b99d20b725
Add pypi_build.sh to .gitignore
2024-10-08 18:10:57 +08:00
hitesh22rana
768b93140f
docs: fixed css_selector for example
2024-10-05 00:25:41 +09:00
Aravind Karnam
d743adac68
Fixed some bugs in robots.txt processing
2024-10-03 15:58:57 +05:30
Aravind Karnam
7fe220dbd5
1. Introduced a bool flag to ascrape method to switch between sequential and concurrent processing
...
2. Introduced a dictionary for depth tracking across various tasks
3. Removed redundancy with crawled_urls variable. Instead created a list with visited set variable in returned object.
2024-10-03 11:17:11 +05:30
aravind
65e013d9d1
Merge pull request #3 from aravindkarnam/main
...
Merging latest changes from main branch
2024-10-03 09:52:12 +05:30
unclecode
4750810a67
Enhance AsyncWebCrawler with smart waiting and screenshot capabilities
...
- Implement smart_wait function in AsyncPlaywrightCrawlerStrategy
- Add screenshot support to AsyncCrawlResponse and AsyncWebCrawler
- Improve error handling and timeout management in crawling process
- Fix typo in CrawlResult model (responser_headers -> response_headers)
- Update .gitignore to exclude additional files
- Adjust import path in test_basic_crawling.py
2024-10-02 17:34:56 +08:00
unclecode
e0e0db4247
Bump version to 0.3.4
0.3.4
2024-09-29 17:07:52 +08:00
unclecode
bccadec887
Remove dependency on psutil, PyYaml, and extend requests version range
2024-09-29 17:07:06 +08:00
unclecode
0759503e50
Extend numpy version range to support Python 3.9
v0.3.3
2024-09-29 00:08:02 +08:00
unclecode
7f1c020746
Update README to add link to previous version in branch V0.2.76
2024-09-28 00:31:53 +08:00
unclecode
5d4e92db7d
Update quickstart_async.py to improve performance and add Firecrawl simulation
v0.3.0
2024-09-28 00:11:39 +08:00
unclecode
8b6e88c85c
Update .gitignore to ignore temporary and test directories
2024-09-26 15:09:49 +08:00
unclecode
64190dd0c4
Update README
2024-09-25 17:26:13 +08:00
unclecode
7100bcdf04
Add session based crawling documentation
2024-09-25 17:16:55 +08:00
unclecode
10cdad039d
Update documents and README
2024-09-25 16:52:11 +08:00
unclecode
f1eee09cf4
Update README, add manifest, make selenium optional library
2024-09-25 16:35:14 +08:00
unclecode
4d48bd31ca
Push async version last changes for merge to main branch
2024-09-24 20:52:08 +08:00
Aravind Karnam
7f3e2e47ed
Parallel processing with retry on failure with exponential backoff - Simplified URL validation and normalisation - respecting Robots.txt
2024-09-19 12:34:12 +05:30
aravind
78f26ac263
Merge pull request #2 from aravindkarnam/staging
...
Staging
2024-09-18 18:16:23 +05:30
unclecode
d628bc4034
Refactor content_scrapping_strategy.py to remove excluded tags
2024-09-12 17:35:45 +08:00
unclecode
b179aa9b6f
Refactor website content and setup.py descriptions for consistent terminology
2024-09-12 16:50:52 +08:00
unclecode
30807f5535
Remove excluded tags from website content
2024-09-12 16:11:20 +08:00
unclecode
396f430022
Refactor AsyncCrawlerStrategy to return AsyncCrawlResponse
...
This commit refactors the AsyncCrawlerStrategy class in the async_crawler_strategy.py file to modify the return types of the crawl and crawl_many methods. Instead of returning strings, these methods now return instances of the AsyncCrawlResponse class from the pydantic module. The AsyncCrawlResponse class contains the crawled HTML, response headers, and status code. This change improves the clarity and consistency of the code.
2024-09-12 15:49:49 +08:00
Aravind Karnam
44ce12c62c
Created scaffolding for Scraper as per the plan. Implemented the ascrape method in bfs_scraper_strategy
2024-09-09 13:13:34 +05:30
unclecode
eb131bebdf
Create series of quickstart files.
2024-09-04 15:33:24 +08:00
unclecode
5c15837677
chore: Update README, generate new notbook for quickstart
2024-09-04 14:46:22 +08:00
unclecode
2fada16abb
chore: Update crawl4ai package with AsyncWebCrawler and JsonCssExtractionStrategy
2024-09-03 23:32:27 +08:00
unclecode
c37614cbc8
Add Async Version, JsonCss Extrator
2024-09-03 01:27:00 +08:00
unclecode
3116f95c1a
Merge branch 'pull-84' into staging
2024-09-01 16:44:06 +08:00
unclecode
b0e8b66666
Merge branch 'proxy-support' into staging
2024-09-01 16:35:14 +08:00
unclecode
3caf48c9be
refactor: Update LocalSeleniumCrawlerStrategy to execute JS code if provided
2024-09-01 16:34:51 +08:00
Umut CAN
3c6ebb73ae
Update web_crawler.py
...
Improve code efficiency, readability, and maintainability in web_crawler.py
2024-08-30 15:30:06 +03:00
UncleCode
0d9b638636
Merge pull request #75 from aravindkarnam/main
...
Added support to source tags wrapped inside video and audio tags. Ext…
2024-08-30 12:54:15 +02:00
datehoer
2ba70b9501
add use proxy and llm baseurl examples
2024-08-27 10:14:54 +08:00
datehoer
16f98cebc0
replace base64 image url to ''
2024-08-27 09:44:35 +08:00
datehoer
fe9ff498ce
add proxy and add ai base_url
2024-08-26 16:12:49 +08:00
Datehoer
eba831ca30
fix spelling mistake
2024-08-26 15:29:23 +08:00