feat(robots): add robots.txt compliance support
Add support for checking and respecting robots.txt rules before crawling websites: - Implement RobotsParser class with SQLite caching - Add check_robots_txt parameter to CrawlerRunConfig - Integrate robots.txt checking in AsyncWebCrawler - Update documentation with robots.txt compliance examples - Add tests for robot parser functionality The cache uses WAL mode for better concurrency and has a default TTL of 7 days.
This commit is contained in:
@@ -8,6 +8,7 @@ Crawl4AI offers multiple power-user features that go beyond simple crawling. Thi
|
||||
3. **Handling SSL Certificates**
|
||||
4. **Custom Headers**
|
||||
5. **Session Persistence & Local Storage**
|
||||
6. **Robots.txt Compliance**
|
||||
|
||||
> **Prerequisites**
|
||||
> - You have a basic grasp of [AsyncWebCrawler Basics](../core/simple-crawling.md)
|
||||
@@ -251,6 +252,42 @@ You can sign in once, export the browser context, and reuse it later—without r
|
||||
|
||||
---
|
||||
|
||||
## 6. Robots.txt Compliance
|
||||
|
||||
Crawl4AI supports respecting robots.txt rules with efficient caching:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
|
||||
async def main():
|
||||
# Enable robots.txt checking in config
|
||||
config = CrawlerRunConfig(
|
||||
check_robots_txt=True # Will check and respect robots.txt rules
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
"https://example.com",
|
||||
config=config
|
||||
)
|
||||
|
||||
if not result.success and result.status_code == 403:
|
||||
print("Access denied by robots.txt")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**Key Points**
|
||||
- Robots.txt files are cached locally for efficiency
|
||||
- Cache is stored in `~/.crawl4ai/robots/robots_cache.db`
|
||||
- Cache has a default TTL of 7 days
|
||||
- If robots.txt can't be fetched, crawling is allowed
|
||||
- Returns 403 status code if URL is disallowed
|
||||
|
||||
---
|
||||
|
||||
## Putting It All Together
|
||||
|
||||
Here’s a snippet that combines multiple “advanced” features (proxy, PDF, screenshot, SSL, custom headers, and session reuse) into one run. Normally, you’d tailor each setting to your project’s needs.
|
||||
@@ -321,6 +358,7 @@ You’ve now explored several **advanced** features:
|
||||
- **SSL Certificate** retrieval & exporting
|
||||
- **Custom Headers** for language or specialized requests
|
||||
- **Session Persistence** via storage state
|
||||
- **Robots.txt Compliance**
|
||||
|
||||
With these power tools, you can build robust scraping workflows that mimic real user behavior, handle secure sites, capture detailed snapshots, and manage sessions across multiple runs—streamlining your entire data collection pipeline.
|
||||
|
||||
|
||||
@@ -189,6 +189,44 @@ async def crawl_with_semaphore(urls):
|
||||
return results
|
||||
```
|
||||
|
||||
### 4.4 Robots.txt Consideration
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
|
||||
|
||||
async def main():
|
||||
urls = [
|
||||
"https://example1.com",
|
||||
"https://example2.com",
|
||||
"https://example3.com"
|
||||
]
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.ENABLED,
|
||||
check_robots_txt=True, # Will respect robots.txt for each URL
|
||||
semaphore_count=3 # Max concurrent requests
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
async for result in crawler.arun_many(urls, config=config):
|
||||
if result.success:
|
||||
print(f"Successfully crawled {result.url}")
|
||||
elif result.status_code == 403 and "robots.txt" in result.error_message:
|
||||
print(f"Skipped {result.url} - blocked by robots.txt")
|
||||
else:
|
||||
print(f"Failed to crawl {result.url}: {result.error_message}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**Key Points**:
|
||||
- When `check_robots_txt=True`, each URL's robots.txt is checked before crawling
|
||||
- Robots.txt files are cached for efficiency
|
||||
- Failed robots.txt checks return 403 status code
|
||||
- Dispatcher handles robots.txt checks automatically for each URL
|
||||
|
||||
## 5. Dispatch Results
|
||||
|
||||
Each crawl result includes dispatch information:
|
||||
|
||||
Reference in New Issue
Block a user