feat(robots): add robots.txt compliance support
Add support for checking and respecting robots.txt rules before crawling websites: - Implement RobotsParser class with SQLite caching - Add check_robots_txt parameter to CrawlerRunConfig - Integrate robots.txt checking in AsyncWebCrawler - Update documentation with robots.txt compliance examples - Add tests for robot parser functionality The cache uses WAL mode for better concurrency and has a default TTL of 7 days.
This commit is contained in:
@@ -1,3 +1,9 @@
|
||||
### [Added] 2025-01-21
|
||||
- Added robots.txt compliance support with efficient SQLite-based caching
|
||||
- New `check_robots_txt` parameter in CrawlerRunConfig to enable robots.txt checking
|
||||
- Documentation updates for robots.txt compliance features and examples
|
||||
- Automated robots.txt checking integrated into AsyncWebCrawler with 403 status codes for blocked URLs
|
||||
|
||||
### [Added] 2025-01-20
|
||||
- Added proxy configuration support to CrawlerRunConfig allowing dynamic proxy settings per crawl request
|
||||
- Updated documentation with examples for using proxy configuration in crawl operations
|
||||
|
||||
Reference in New Issue
Block a user