crawl4ai/crawl4ai/scraper at be472c624c625b5f240705112036fd5ef6f1eb8f - crawl4ai - Gitea: Git with a cup of tea

ayrisdev/crawl4ai

Files

History

UncleCode be472c624c Refactored AsyncWebScraper to include comprehensive error handling and progress tracking capabilities. Introduced a ScrapingProgress data class to monitor processed and failed URLs. Enhanced scraping methods to log errors and track stats throughout the scraping process.

2024-11-06 21:09:47 +08:00

..

Created scaffolding for Scraper as per the plan. Implemented the ascrape method in bfs_scraper_strategy

2024-09-09 13:13:34 +05:30

Created scaffolding for Scraper as per the plan. Implemented the ascrape method in bfs_scraper_strategy

2024-09-09 13:13:34 +05:30

__init__.py

Parallel processing with retry on failure with exponential backoff - Simplified URL validation and normalisation - respecting Robots.txt

2024-09-19 12:34:12 +05:30

async_web_scraper.py

Refactored AsyncWebScraper to include comprehensive error handling and progress tracking capabilities. Introduced a ScrapingProgress data class to monitor processed and failed URLs. Enhanced scraping methods to log errors and track stats throughout the scraping process.

2024-11-06 21:09:47 +08:00

bfs_scraper_strategy.py

Removed stubs for remove_from_future_crawls since the visited set is updated soon as the URL was queued, Removed add_to_retry_queue(url) since retry with exponential backoff with help of tenacity is going to take care of it.

2024-10-17 15:42:43 +05:30

models.py

Parallel processing with retry on failure with exponential backoff - Simplified URL validation and normalisation - respecting Robots.txt

2024-09-19 12:34:12 +05:30

scraper_strategy.py

1. Moved to asyncio.wait instead of gather so that results can be yeilded just as they are ready, rather than in batches

2024-10-17 12:25:17 +05:30