feat(robots): add robots.txt compliance support

Add support for checking and respecting robots.txt rules before crawling websites: - Implement RobotsParser class with SQLite caching - Add check_robots_txt parameter to CrawlerRunConfig - Integrate robots.txt checking in AsyncWebCrawler - Update documentation with robots.txt compliance examples - Add tests for robot parser functionality The cache uses WAL mode for better concurrency and has a default TTL of 7 days.
2025-01-21 17:54:13 +08:00
parent 9247877037
commit d09c611d15
11 changed files with 482 additions and 12 deletions
--- a/docs/md_v2/api/arun.md
+++ b/docs/md_v2/api/arun.md
@@ -22,6 +22,7 @@ async def main():
    run_config = CrawlerRunConfig(
        verbose=True,            # Detailed logging
        cache_mode=CacheMode.ENABLED,  # Use normal read/write cache
+        check_robots_txt=True,   # Respect robots.txt rules
        # ... other parameters
    )

@@ -30,8 +31,10 @@ async def main():
            url="https://example.com",
            config=run_config
        )
-        print(result.cleaned_html[:500])
-
+        
+        # Check if blocked by robots.txt
+        if not result.success and result.status_code == 403:
+            print(f"Error: {result.error_message}")
 ```

 **Key Fields**:
@@ -226,6 +229,7 @@ async def main():
        # Core
        verbose=True,
        cache_mode=CacheMode.ENABLED,
+        check_robots_txt=True,   # Respect robots.txt rules
        
        # Content
        word_count_threshold=10,