Files
crawl4ai/docs/scraper/bfs_scraper_strategy.md

245 lines
6.8 KiB
Markdown

# BFS Scraper Strategy: Smart Web Traversal
The BFS (Breadth-First Search) Scraper Strategy provides an intelligent way to traverse websites systematically. It crawls websites level by level, ensuring thorough coverage while respecting web crawling etiquette.
```mermaid
flowchart TB
Start([Start]) --> Init[Initialize BFS Strategy]
Init --> InitStats[Initialize CrawlStats]
InitStats --> InitQueue[Initialize Priority Queue]
InitQueue --> AddStart[Add Start URL to Queue]
AddStart --> CheckState{Queue Empty or\nTasks Pending?}
CheckState -->|No| Cleanup[Cleanup & Stats]
Cleanup --> End([End])
CheckState -->|Yes| CheckCancel{Cancel\nRequested?}
CheckCancel -->|Yes| Cleanup
CheckCancel -->|No| CheckConcurrent{Under Max\nConcurrent?}
CheckConcurrent -->|No| WaitComplete[Wait for Task Completion]
WaitComplete --> YieldResult[Yield Result]
YieldResult --> CheckState
CheckConcurrent -->|Yes| GetNextURL[Get Next URL from Queue]
GetNextURL --> ValidateURL{Already\nVisited?}
ValidateURL -->|Yes| CheckState
ValidateURL -->|No| ProcessURL[Process URL]
subgraph URL_Processing [URL Processing]
ProcessURL --> CheckValid{URL Valid?}
CheckValid -->|No| UpdateStats[Update Skip Stats]
CheckValid -->|Yes| CheckRobots{Allowed by\nrobots.txt?}
CheckRobots -->|No| UpdateRobotStats[Update Robot Stats]
CheckRobots -->|Yes| ApplyDelay[Apply Politeness Delay]
ApplyDelay --> FetchContent[Fetch Content with Rate Limit]
FetchContent --> CheckError{Error?}
CheckError -->|Yes| Retry{Retry\nNeeded?}
Retry -->|Yes| FetchContent
Retry -->|No| UpdateFailStats[Update Fail Stats]
CheckError -->|No| ExtractLinks[Extract & Process Links]
ExtractLinks --> ScoreURLs[Score New URLs]
ScoreURLs --> AddToQueue[Add to Priority Queue]
end
ProcessURL --> CreateTask{Parallel\nProcessing?}
CreateTask -->|Yes| AddTask[Add to Pending Tasks]
CreateTask -->|No| DirectProcess[Process Directly]
AddTask --> CheckState
DirectProcess --> YieldResult
UpdateStats --> CheckState
UpdateRobotStats --> CheckState
UpdateFailStats --> CheckState
classDef process fill:#90caf9,stroke:#000,stroke-width:2px;
classDef decision fill:#fff59d,stroke:#000,stroke-width:2px;
classDef error fill:#ef9a9a,stroke:#000,stroke-width:2px;
classDef stats fill:#a5d6a7,stroke:#000,stroke-width:2px;
class Start,End stats;
class CheckState,CheckCancel,CheckConcurrent,ValidateURL,CheckValid,CheckRobots,CheckError,Retry,CreateTask decision;
class UpdateStats,UpdateRobotStats,UpdateFailStats,InitStats,Cleanup stats;
class ProcessURL,FetchContent,ExtractLinks,ScoreURLs process;
```
## How It Works
The BFS strategy crawls a website by:
1. Starting from a root URL
2. Processing all URLs at the current depth
3. Moving to URLs at the next depth level
4. Continuing until maximum depth is reached
This ensures systematic coverage of the website while maintaining control over the crawling process.
## Key Features
### 1. Smart URL Processing
```python
strategy = BFSScraperStrategy(
max_depth=2,
filter_chain=my_filters,
url_scorer=my_scorer,
max_concurrent=5
)
```
- Controls crawl depth
- Filters unwanted URLs
- Scores URLs for priority
- Manages concurrent requests
### 2. Polite Crawling
The strategy automatically implements web crawling best practices:
- Respects robots.txt
- Implements rate limiting
- Adds politeness delays
- Manages concurrent requests
### 3. Link Processing Control
```python
strategy = BFSScraperStrategy(
...,
process_external_links=False # Only process internal links
)
```
- Control whether to follow external links
- Default: internal links only
- Enable external links when needed
## Configuration Options
| Parameter | Description | Default |
|-----------|-------------|---------|
| max_depth | Maximum crawl depth | Required |
| filter_chain | URL filtering rules | Required |
| url_scorer | URL priority scoring | Required |
| max_concurrent | Max parallel requests | 5 |
| min_crawl_delay | Seconds between requests | 1 |
| process_external_links | Follow external links | False |
## Best Practices
1. **Set Appropriate Depth**
- Start with smaller depths (2-3)
- Increase based on needs
- Consider site structure
2. **Configure Filters**
- Use URL patterns
- Filter by content type
- Avoid unwanted sections
3. **Tune Performance**
- Adjust max_concurrent
- Set appropriate delays
- Monitor resource usage
4. **Handle External Links**
- Keep external_links=False for focused crawls
- Enable only when needed
- Consider additional filtering
## Example Usage
```python
from crawl4ai.scraper import BFSScraperStrategy
from crawl4ai.scraper.filters import FilterChain
from crawl4ai.scraper.scorers import BasicURLScorer
# Configure strategy
strategy = BFSScraperStrategy(
max_depth=3,
filter_chain=FilterChain([
URLPatternFilter("*.example.com/*"),
ContentTypeFilter(["text/html"])
]),
url_scorer=BasicURLScorer(),
max_concurrent=5,
min_crawl_delay=1,
process_external_links=False
)
# Use with AsyncWebScraper
scraper = AsyncWebScraper(crawler, strategy)
results = await scraper.ascrape("https://example.com")
```
## Common Use Cases
### 1. Site Mapping
```python
strategy = BFSScraperStrategy(
max_depth=5,
filter_chain=site_filter,
url_scorer=depth_scorer,
process_external_links=False
)
```
Perfect for creating complete site maps or understanding site structure.
### 2. Content Aggregation
```python
strategy = BFSScraperStrategy(
max_depth=2,
filter_chain=content_filter,
url_scorer=relevance_scorer,
max_concurrent=3
)
```
Ideal for collecting specific types of content (articles, products, etc.).
### 3. Link Analysis
```python
strategy = BFSScraperStrategy(
max_depth=1,
filter_chain=link_filter,
url_scorer=link_scorer,
process_external_links=True
)
```
Useful for analyzing both internal and external link structures.
## Advanced Features
### Progress Monitoring
```python
async for result in scraper.ascrape(url):
print(f"Current depth: {strategy.stats.current_depth}")
print(f"Processed URLs: {strategy.stats.urls_processed}")
```
### Custom URL Scoring
```python
class CustomScorer(URLScorer):
def score(self, url: str) -> float:
# Lower scores = higher priority
return score_based_on_criteria(url)
```
## Troubleshooting
1. **Slow Crawling**
- Increase max_concurrent
- Adjust min_crawl_delay
- Check network conditions
2. **Missing Content**
- Verify max_depth
- Check filter settings
- Review URL patterns
3. **High Resource Usage**
- Reduce max_concurrent
- Increase crawl delay
- Add more specific filters