feat(robots): add robots.txt compliance support

Add support for checking and respecting robots.txt rules before crawling websites:
- Implement RobotsParser class with SQLite caching
- Add check_robots_txt parameter to CrawlerRunConfig
- Integrate robots.txt checking in AsyncWebCrawler
- Update documentation with robots.txt compliance examples
- Add tests for robot parser functionality

The cache uses WAL mode for better concurrency and has a default TTL of 7 days.

This commit is contained in:

UncleCode

2025-01-21 17:54:13 +08:00

parent 9247877037

commit d09c611d15

11 changed files with 482 additions and 12 deletions

5

.gitignore vendored

View File

@@ -227,4 +227,7 @@ tree.md
 .do
 /plans
 .codeiumignore
 todo/
 todo/
 # windsurf rules
 .windsurfrules

feat(robots): add robots.txt compliance support

5 .gitignore vendored Unescape Escape View File

5

.gitignore vendored

View File