feat(robots): add robots.txt compliance support

Add support for checking and respecting robots.txt rules before crawling websites: - Implement RobotsParser class with SQLite caching - Add check_robots_txt parameter to CrawlerRunConfig - Integrate robots.txt checking in AsyncWebCrawler - Update documentation with robots.txt compliance examples - Add tests for robot parser functionality The cache uses WAL mode for better concurrency and has a default TTL of 7 days.
2025-01-21 17:54:13 +08:00
parent 9247877037
commit d09c611d15
11 changed files with 482 additions and 12 deletions
--- a/docs/md_v2/advanced/advanced-features.md
+++ b/docs/md_v2/advanced/advanced-features.md
@@ -8,6 +8,7 @@ Crawl4AI offers multiple power-user features that go beyond simple crawling. Thi
 3. **Handling SSL Certificates**  
 4. **Custom Headers**  
 5. **Session Persistence & Local Storage**
+6. **Robots.txt Compliance**

 > **Prerequisites**  
 > - You have a basic grasp of [AsyncWebCrawler Basics](../core/simple-crawling.md)  
@@ -251,6 +252,42 @@ You can sign in once, export the browser context, and reuse it later—without r

 ---

+## 6. Robots.txt Compliance
+
+Crawl4AI supports respecting robots.txt rules with efficient caching:
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+
+async def main():
+    # Enable robots.txt checking in config
+    config = CrawlerRunConfig(
+        check_robots_txt=True  # Will check and respect robots.txt rules
+    )
+
+    async with AsyncWebCrawler() as crawler:
+        result = await crawler.arun(
+            "https://example.com",
+            config=config
+        )
+        
+        if not result.success and result.status_code == 403:
+            print("Access denied by robots.txt")
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+**Key Points**
+- Robots.txt files are cached locally for efficiency
+- Cache is stored in `~/.crawl4ai/robots/robots_cache.db`
+- Cache has a default TTL of 7 days
+- If robots.txt can't be fetched, crawling is allowed
+- Returns 403 status code if URL is disallowed
+
+---
+
 ## Putting It All Together

 Here’s a snippet that combines multiple “advanced” features (proxy, PDF, screenshot, SSL, custom headers, and session reuse) into one run. Normally, you’d tailor each setting to your project’s needs.
@@ -321,6 +358,7 @@ You’ve now explored several **advanced** features:
 - **SSL Certificate** retrieval & exporting  
 - **Custom Headers** for language or specialized requests  
 - **Session Persistence** via storage state
+- **Robots.txt Compliance**

 With these power tools, you can build robust scraping workflows that mimic real user behavior, handle secure sites, capture detailed snapshots, and manage sessions across multiple runs—streamlining your entire data collection pipeline.

--- a/docs/md_v2/advanced/multi-url-crawling.md
+++ b/docs/md_v2/advanced/multi-url-crawling.md
@@ -189,6 +189,44 @@ async def crawl_with_semaphore(urls):
        return results
 ```

+### 4.4 Robots.txt Consideration
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
+
+async def main():
+    urls = [
+        "https://example1.com",
+        "https://example2.com",
+        "https://example3.com"
+    ]
+    
+    config = CrawlerRunConfig(
+        cache_mode=CacheMode.ENABLED,
+        check_robots_txt=True,  # Will respect robots.txt for each URL
+        semaphore_count=3      # Max concurrent requests
+    )
+    
+    async with AsyncWebCrawler() as crawler:
+        async for result in crawler.arun_many(urls, config=config):
+            if result.success:
+                print(f"Successfully crawled {result.url}")
+            elif result.status_code == 403 and "robots.txt" in result.error_message:
+                print(f"Skipped {result.url} - blocked by robots.txt")
+            else:
+                print(f"Failed to crawl {result.url}: {result.error_message}")
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+**Key Points**:
+- When `check_robots_txt=True`, each URL's robots.txt is checked before crawling
+- Robots.txt files are cached for efficiency
+- Failed robots.txt checks return 403 status code
+- Dispatcher handles robots.txt checks automatically for each URL
+
 ## 5. Dispatch Results

 Each crawl result includes dispatch information:
--- a/docs/md_v2/api/arun.md
+++ b/docs/md_v2/api/arun.md
@@ -22,6 +22,7 @@ async def main():
    run_config = CrawlerRunConfig(
        verbose=True,            # Detailed logging
        cache_mode=CacheMode.ENABLED,  # Use normal read/write cache
+        check_robots_txt=True,   # Respect robots.txt rules
        # ... other parameters
    )

@@ -30,8 +31,10 @@ async def main():
            url="https://example.com",
            config=run_config
        )
-        print(result.cleaned_html[:500])
-
+        
+        # Check if blocked by robots.txt
+        if not result.success and result.status_code == 403:
+            print(f"Error: {result.error_message}")
 ```

 **Key Fields**:
@@ -226,6 +229,7 @@ async def main():
        # Core
        verbose=True,
        cache_mode=CacheMode.ENABLED,
+        check_robots_txt=True,   # Respect robots.txt rules
        
        # Content
        word_count_threshold=10,
--- a/docs/md_v2/api/parameters.md
+++ b/docs/md_v2/api/parameters.md
@@ -106,6 +106,7 @@ Use these for controlling whether you read or write from a local content cache.
 | **`wait_for`**             | `str or None`           | Wait for a CSS (`"css:selector"`) or JS (`"js:() => bool"`) condition before content extraction.                     |
 | **`wait_for_images`**      | `bool` (False)          | Wait for images to load before finishing. Slows down if you only want text.                                          |
 | **`delay_before_return_html`** | `float` (0.1)       | Additional pause (seconds) before final HTML is captured. Good for last-second updates.                               |
+| **`check_robots_txt`**     | `bool` (False)          | Whether to check and respect robots.txt rules before crawling. If True, caches robots.txt for efficiency.            |
 | **`mean_delay`** and **`max_range`** | `float` (0.1, 0.3) | If you call `arun_many()`, these define random delay intervals between crawls, helping avoid detection or rate limits. |
 | **`semaphore_count`**      | `int` (5)               | Max concurrency for `arun_many()`. Increase if you have resources for parallel crawls.                                |

@@ -266,17 +267,21 @@ async def main():

 if __name__ == "__main__":
    asyncio.run(main())
+
+## 2.4 Compliance & Ethics
+
+| **Parameter**          | **Type / Default**      | **What It Does**                                                                                                    |
+|-----------------------|-------------------------|----------------------------------------------------------------------------------------------------------------------|
+| **`check_robots_txt`**| `bool` (False)          | When True, checks and respects robots.txt rules before crawling. Uses efficient caching with SQLite backend.          |
+| **`user_agent`**      | `str` (None)            | User agent string to identify your crawler. Used for robots.txt checking when enabled.                                |
+
+```python
+run_config = CrawlerRunConfig(
+    check_robots_txt=True,  # Enable robots.txt compliance
+    user_agent="MyBot/1.0"  # Identify your crawler
+)
 ```

-**What’s Happening**:
- **`text_mode=True`** avoids loading images and other heavy resources, speeding up the crawl.  
- We disable caching (`cache_mode=CacheMode.BYPASS`) to always fetch fresh content.  
- We only keep `main.article` content by specifying `css_selector="main.article"`.  
- We exclude external links (`exclude_external_links=True`).  
- We do a quick screenshot (`screenshot=True`) before finishing.
-
---
-
 ## 3. Putting It All Together

 - **Use** `BrowserConfig` for **global** browser settings: engine, headless, proxy, user agent.