feat(robots): add robots.txt compliance support

Add support for checking and respecting robots.txt rules before crawling websites: - Implement RobotsParser class with SQLite caching - Add check_robots_txt parameter to CrawlerRunConfig - Integrate robots.txt checking in AsyncWebCrawler - Update documentation with robots.txt compliance examples - Add tests for robot parser functionality The cache uses WAL mode for better concurrency and has a default TTL of 7 days.
2025-01-21 17:54:13 +08:00
parent 9247877037
commit d09c611d15
11 changed files with 482 additions and 12 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,3 +1,9 @@
+### [Added] 2025-01-21
+- Added robots.txt compliance support with efficient SQLite-based caching
+- New `check_robots_txt` parameter in CrawlerRunConfig to enable robots.txt checking
+- Documentation updates for robots.txt compliance features and examples
+- Automated robots.txt checking integrated into AsyncWebCrawler with 403 status codes for blocked URLs
+
 ### [Added] 2025-01-20
 - Added proxy configuration support to CrawlerRunConfig allowing dynamic proxy settings per crawl request
 - Updated documentation with examples for using proxy configuration in crawl operations