feat(robots): add robots.txt compliance support
Add support for checking and respecting robots.txt rules before crawling websites: - Implement RobotsParser class with SQLite caching - Add check_robots_txt parameter to CrawlerRunConfig - Integrate robots.txt checking in AsyncWebCrawler - Update documentation with robots.txt compliance examples - Add tests for robot parser functionality The cache uses WAL mode for better concurrency and has a default TTL of 7 days.
This commit is contained in:
5
.gitignore
vendored
5
.gitignore
vendored
@@ -227,4 +227,7 @@ tree.md
|
||||
.do
|
||||
/plans
|
||||
.codeiumignore
|
||||
todo/
|
||||
todo/
|
||||
|
||||
# windsurf rules
|
||||
.windsurfrules
|
||||
|
||||
Reference in New Issue
Block a user