crawl4ai/crawl4ai/scraper at a677c2b61d1e451e4b6a80e7c7cca993ed1863c6 - crawl4ai - Gitea: Git with a cup of tea

ayrisdev/crawl4ai

Files

History

Aravind Karnam 7a5f83b76f fix: Added browser config and crawler run config from 0.4.22

2024-12-18 10:33:09 +05:30

..

__init__.py

Fixed a few bugs, import errors and changed to asyncio wait_for instead of timeout to support python versions < 3.11

2024-11-23 12:39:25 +05:30

async_web_scraper.py

Refactored AsyncWebScraper to include comprehensive error handling and progress tracking capabilities. Introduced a ScrapingProgress data class to monitor processed and failed URLs. Enhanced scraping methods to log errors and track stats throughout the scraping process.

2024-11-06 21:09:47 +08:00

bfs_scraper_strategy.py

fix: Added browser config and crawler run config from 0.4.22

2024-12-18 10:33:09 +05:30

filters.py

feat(scraper): Enhance URL filtering and scoring systems

2024-11-08 19:02:28 +08:00

models.py

Parallel processing with retry on failure with exponential backoff - Simplified URL validation and normalisation - respecting Robots.txt

2024-09-19 12:34:12 +05:30

scorers.py

feat(scraper): Enhance URL filtering and scoring systems

2024-11-08 19:02:28 +08:00

scraper_strategy.py

updated definition of can_process_url to include dept as an argument, as it's needed to skip filters for start_url

2024-11-26 18:26:57 +05:30