Commit Graph

268 Commits

Author SHA1 Message Date
UncleCode
d21ffad3a2 chore(git): update gitignore patterns
Add new development and tooling related patterns to gitignore:
- Add Next.js build directory (.next/)
- Add various script and documentation files
- Add local development directories (.local, .scripts, .do)
- Add tool-specific files (.codeiumignore, .windsurfrules)

Removes duplicate entries and organizes patterns more clearly.
2025-01-22 17:22:26 +08:00
UncleCode
06b21dcc50 Update .gitignore to include new directories for issues and documentation 2024-11-06 18:44:03 +08:00
UncleCode
0f0f60527d Merge pull request #172 from aravindkarnam/scraper
Scraper
2024-11-06 07:00:44 +01:00
Aravind Karnam
8105fd178e Removed stubs for remove_from_future_crawls since the visited set is updated soon as the URL was queued, Removed add_to_retry_queue(url) since retry with exponential backoff with help of tenacity is going to take care of it. 2024-10-17 15:42:43 +05:30
Aravind Karnam
ce7fce4b16 1. Moved to asyncio.wait instead of gather so that results can be yeilded just as they are ready, rather than in batches
2. Moved the visted.add(url), to before the task is put in queue rather than after the crawl is completed. This makes sure that  duplicate crawls doesn't happen when same URL is found at different depth and that get's queued too because the crawl is not yet completed and visted set is not updated.
3. Named the yield_results attribute to stream instead. Since that seems to be popularly used in all other AI libraries for intermediate results.
2024-10-17 12:25:17 +05:30
Aravind Karnam
de28b59aca removed unused imports 2024-10-16 22:36:48 +05:30
Aravind Karnam
04d8b47b92 Exposed min_crawl_delay for BFSScraperStrategy 2024-10-16 22:34:54 +05:30
Aravind Karnam
2943feeecf 1. Added a flag to yield each crawl result,as they become ready along with the final scraper result as another option
2. Removed ascrape_many method, as I'm currently not focusing on it in the first cut of scraper
3. Added some error handling for cases where robots.txt cannot be fetched or parsed.
2024-10-16 22:05:29 +05:30
Aravind Karnam
8a7d29ce85 updated some comments and removed content type checking functionality from core as it's implemented as a filter 2024-10-16 15:59:37 +05:30
aravind
159bd875bd Merge pull request #5 from aravindkarnam/main
Merging 0.3.6
2024-10-16 10:41:22 +05:30
unclecode
9ffa34b697 Update README v0.3.6 2024-10-14 22:58:27 +08:00
unclecode
740802c491 Merge branch '0.3.6' 2024-10-14 22:55:24 +08:00
unclecode
b9ac96c332 Merge branch 'main' of https://github.com/unclecode/crawl4ai 2024-10-14 22:54:23 +08:00
unclecode
d06535388a Update gitignore 2024-10-14 22:53:56 +08:00
unclecode
2b73bdf6b0 Update changelog 2024-10-14 21:04:02 +08:00
unclecode
6aa803d712 Update gitignore 2024-10-14 21:03:40 +08:00
unclecode
320afdea64 feat: Enhance crawler flexibility and LLM extraction capabilities
- Add browser type selection (Chromium, Firefox, WebKit)
- Implement iframe content extraction
- Improve image processing and dimension updates
- Add custom headers support in AsyncPlaywrightCrawlerStrategy
- Enhance delayed content retrieval with new parameter
- Optimize HTML sanitization and Markdown conversion
- Update examples in quickstart_async.py for new features
2024-10-14 21:03:28 +08:00
UncleCode
ccbe72cfc1 Merge pull request #135 from hitesh22rana/fix/docs-example
docs: fixed css_selector for example
2024-10-13 14:39:07 +08:00
unclecode
b9bbd42373 Update Quickstart examples 2024-10-13 14:37:45 +08:00
unclecode
68e9144ce3 feat: Enhance crawling control and LLM extraction flexibility
- Add before_retrieve_html hook and delay_before_return_html option
- Implement flexible page_timeout for smart_wait function
- Support extra_args and custom headers in LLM extraction
- Allow arbitrary kwargs in AsyncWebCrawler initialization
- Improve perform_completion_with_backoff for custom API calls
- Update examples with new features and diverse LLM providers
2024-10-12 14:48:22 +08:00
unclecode
9b2b267820 CHANGELOG UPDATE 2024-10-12 13:42:56 +08:00
unclecode
ff3524d9b1 feat(v0.3.6): Add screenshot capture, delayed content, and custom timeouts
- Implement screenshot capture functionality
- Add delayed content retrieval method
- Introduce custom page timeout parameter
- Enhance LLM support with multiple providers
- Improve database schema auto-updates
- Optimize image processing in WebScrappingStrategy
- Update error handling and logging
- Expand examples in quickstart_async.py
2024-10-12 13:42:42 +08:00
unclecode
b99d20b725 Add pypi_build.sh to .gitignore 2024-10-08 18:10:57 +08:00
hitesh22rana
768b93140f docs: fixed css_selector for example 2024-10-05 00:25:41 +09:00
Aravind Karnam
d743adac68 Fixed some bugs in robots.txt processing 2024-10-03 15:58:57 +05:30
Aravind Karnam
7fe220dbd5 1. Introduced a bool flag to ascrape method to switch between sequential and concurrent processing
2. Introduced a dictionary for depth tracking across various tasks
3. Removed redundancy with crawled_urls variable. Instead created a list with visited set variable in returned object.
2024-10-03 11:17:11 +05:30
aravind
65e013d9d1 Merge pull request #3 from aravindkarnam/main
Merging latest changes from main branch
2024-10-03 09:52:12 +05:30
unclecode
4750810a67 Enhance AsyncWebCrawler with smart waiting and screenshot capabilities
- Implement smart_wait function in AsyncPlaywrightCrawlerStrategy
- Add screenshot support to AsyncCrawlResponse and AsyncWebCrawler
- Improve error handling and timeout management in crawling process
- Fix typo in CrawlResult model (responser_headers -> response_headers)
- Update .gitignore to exclude additional files
- Adjust import path in test_basic_crawling.py
2024-10-02 17:34:56 +08:00
unclecode
e0e0db4247 Bump version to 0.3.4 0.3.4 2024-09-29 17:07:52 +08:00
unclecode
bccadec887 Remove dependency on psutil, PyYaml, and extend requests version range 2024-09-29 17:07:06 +08:00
unclecode
0759503e50 Extend numpy version range to support Python 3.9 v0.3.3 2024-09-29 00:08:02 +08:00
unclecode
7f1c020746 Update README to add link to previous version in branch V0.2.76 2024-09-28 00:31:53 +08:00
unclecode
5d4e92db7d Update quickstart_async.py to improve performance and add Firecrawl simulation v0.3.0 2024-09-28 00:11:39 +08:00
unclecode
8b6e88c85c Update .gitignore to ignore temporary and test directories 2024-09-26 15:09:49 +08:00
unclecode
64190dd0c4 Update README 2024-09-25 17:26:13 +08:00
unclecode
7100bcdf04 Add session based crawling documentation 2024-09-25 17:16:55 +08:00
unclecode
10cdad039d Update documents and README 2024-09-25 16:52:11 +08:00
unclecode
f1eee09cf4 Update README, add manifest, make selenium optional library 2024-09-25 16:35:14 +08:00
unclecode
4d48bd31ca Push async version last changes for merge to main branch 2024-09-24 20:52:08 +08:00
Aravind Karnam
7f3e2e47ed Parallel processing with retry on failure with exponential backoff - Simplified URL validation and normalisation - respecting Robots.txt 2024-09-19 12:34:12 +05:30
aravind
78f26ac263 Merge pull request #2 from aravindkarnam/staging
Staging
2024-09-18 18:16:23 +05:30
unclecode
d628bc4034 Refactor content_scrapping_strategy.py to remove excluded tags 2024-09-12 17:35:45 +08:00
unclecode
b179aa9b6f Refactor website content and setup.py descriptions for consistent terminology 2024-09-12 16:50:52 +08:00
unclecode
30807f5535 Remove excluded tags from website content 2024-09-12 16:11:20 +08:00
unclecode
396f430022 Refactor AsyncCrawlerStrategy to return AsyncCrawlResponse
This commit refactors the AsyncCrawlerStrategy class in the async_crawler_strategy.py file to modify the return types of the crawl and crawl_many methods. Instead of returning strings, these methods now return instances of the AsyncCrawlResponse class from the pydantic module. The AsyncCrawlResponse class contains the crawled HTML, response headers, and status code. This change improves the clarity and consistency of the code.
2024-09-12 15:49:49 +08:00
Aravind Karnam
44ce12c62c Created scaffolding for Scraper as per the plan. Implemented the ascrape method in bfs_scraper_strategy 2024-09-09 13:13:34 +05:30
unclecode
eb131bebdf Create series of quickstart files. 2024-09-04 15:33:24 +08:00
unclecode
5c15837677 chore: Update README, generate new notbook for quickstart 2024-09-04 14:46:22 +08:00
unclecode
2fada16abb chore: Update crawl4ai package with AsyncWebCrawler and JsonCssExtractionStrategy 2024-09-03 23:32:27 +08:00
unclecode
c37614cbc8 Add Async Version, JsonCss Extrator 2024-09-03 01:27:00 +08:00