Refactor: Renamed scrape to traverse and deep_crawl in a few sections where it applies

This commit is contained in:
Aravind Karnam
2025-01-29 16:24:11 +05:30
parent 9ef43bc5f0
commit 2c8f2ec5a6
7 changed files with 3 additions and 169 deletions

View File

@@ -0,0 +1,244 @@
# BFS Scraper Strategy: Smart Web Traversal
The BFS (Breadth-First Search) Scraper Strategy provides an intelligent way to traverse websites systematically. It crawls websites level by level, ensuring thorough coverage while respecting web crawling etiquette.
```mermaid
flowchart TB
Start([Start]) --> Init[Initialize BFS Strategy]
Init --> InitStats[Initialize CrawlStats]
InitStats --> InitQueue[Initialize Priority Queue]
InitQueue --> AddStart[Add Start URL to Queue]
AddStart --> CheckState{Queue Empty or\nTasks Pending?}
CheckState -->|No| Cleanup[Cleanup & Stats]
Cleanup --> End([End])
CheckState -->|Yes| CheckCancel{Cancel\nRequested?}
CheckCancel -->|Yes| Cleanup
CheckCancel -->|No| CheckConcurrent{Under Max\nConcurrent?}
CheckConcurrent -->|No| WaitComplete[Wait for Task Completion]
WaitComplete --> YieldResult[Yield Result]
YieldResult --> CheckState
CheckConcurrent -->|Yes| GetNextURL[Get Next URL from Queue]
GetNextURL --> ValidateURL{Already\nVisited?}
ValidateURL -->|Yes| CheckState
ValidateURL -->|No| ProcessURL[Process URL]
subgraph URL_Processing [URL Processing]
ProcessURL --> CheckValid{URL Valid?}
CheckValid -->|No| UpdateStats[Update Skip Stats]
CheckValid -->|Yes| CheckRobots{Allowed by\nrobots.txt?}
CheckRobots -->|No| UpdateRobotStats[Update Robot Stats]
CheckRobots -->|Yes| ApplyDelay[Apply Politeness Delay]
ApplyDelay --> FetchContent[Fetch Content with Rate Limit]
FetchContent --> CheckError{Error?}
CheckError -->|Yes| Retry{Retry\nNeeded?}
Retry -->|Yes| FetchContent
Retry -->|No| UpdateFailStats[Update Fail Stats]
CheckError -->|No| ExtractLinks[Extract & Process Links]
ExtractLinks --> ScoreURLs[Score New URLs]
ScoreURLs --> AddToQueue[Add to Priority Queue]
end
ProcessURL --> CreateTask{Parallel\nProcessing?}
CreateTask -->|Yes| AddTask[Add to Pending Tasks]
CreateTask -->|No| DirectProcess[Process Directly]
AddTask --> CheckState
DirectProcess --> YieldResult
UpdateStats --> CheckState
UpdateRobotStats --> CheckState
UpdateFailStats --> CheckState
classDef process fill:#90caf9,stroke:#000,stroke-width:2px;
classDef decision fill:#fff59d,stroke:#000,stroke-width:2px;
classDef error fill:#ef9a9a,stroke:#000,stroke-width:2px;
classDef stats fill:#a5d6a7,stroke:#000,stroke-width:2px;
class Start,End stats;
class CheckState,CheckCancel,CheckConcurrent,ValidateURL,CheckValid,CheckRobots,CheckError,Retry,CreateTask decision;
class UpdateStats,UpdateRobotStats,UpdateFailStats,InitStats,Cleanup stats;
class ProcessURL,FetchContent,ExtractLinks,ScoreURLs process;
```
## How It Works
The BFS strategy crawls a website by:
1. Starting from a root URL
2. Processing all URLs at the current depth
3. Moving to URLs at the next depth level
4. Continuing until maximum depth is reached
This ensures systematic coverage of the website while maintaining control over the crawling process.
## Key Features
### 1. Smart URL Processing
```python
strategy = BFSScraperStrategy(
max_depth=2,
filter_chain=my_filters,
url_scorer=my_scorer,
max_concurrent=5
)
```
- Controls crawl depth
- Filters unwanted URLs
- Scores URLs for priority
- Manages concurrent requests
### 2. Polite Crawling
The strategy automatically implements web crawling best practices:
- Respects robots.txt
- Implements rate limiting
- Adds politeness delays
- Manages concurrent requests
### 3. Link Processing Control
```python
strategy = BFSScraperStrategy(
...,
process_external_links=False # Only process internal links
)
```
- Control whether to follow external links
- Default: internal links only
- Enable external links when needed
## Configuration Options
| Parameter | Description | Default |
|-----------|-------------|---------|
| max_depth | Maximum crawl depth | Required |
| filter_chain | URL filtering rules | Required |
| url_scorer | URL priority scoring | Required |
| max_concurrent | Max parallel requests | 5 |
| min_crawl_delay | Seconds between requests | 1 |
| process_external_links | Follow external links | False |
## Best Practices
1. **Set Appropriate Depth**
- Start with smaller depths (2-3)
- Increase based on needs
- Consider site structure
2. **Configure Filters**
- Use URL patterns
- Filter by content type
- Avoid unwanted sections
3. **Tune Performance**
- Adjust max_concurrent
- Set appropriate delays
- Monitor resource usage
4. **Handle External Links**
- Keep external_links=False for focused crawls
- Enable only when needed
- Consider additional filtering
## Example Usage
```python
from crawl4ai.scraper import BFSScraperStrategy
from crawl4ai.scraper.filters import FilterChain
from crawl4ai.scraper.scorers import BasicURLScorer
# Configure strategy
strategy = BFSScraperStrategy(
max_depth=3,
filter_chain=FilterChain([
URLPatternFilter("*.example.com/*"),
ContentTypeFilter(["text/html"])
]),
url_scorer=BasicURLScorer(),
max_concurrent=5,
min_crawl_delay=1,
process_external_links=False
)
# Use with AsyncWebScraper
scraper = AsyncWebScraper(crawler, strategy)
results = await scraper.ascrape("https://example.com")
```
## Common Use Cases
### 1. Site Mapping
```python
strategy = BFSScraperStrategy(
max_depth=5,
filter_chain=site_filter,
url_scorer=depth_scorer,
process_external_links=False
)
```
Perfect for creating complete site maps or understanding site structure.
### 2. Content Aggregation
```python
strategy = BFSScraperStrategy(
max_depth=2,
filter_chain=content_filter,
url_scorer=relevance_scorer,
max_concurrent=3
)
```
Ideal for collecting specific types of content (articles, products, etc.).
### 3. Link Analysis
```python
strategy = BFSScraperStrategy(
max_depth=1,
filter_chain=link_filter,
url_scorer=link_scorer,
process_external_links=True
)
```
Useful for analyzing both internal and external link structures.
## Advanced Features
### Progress Monitoring
```python
async for result in scraper.ascrape(url):
print(f"Current depth: {strategy.stats.current_depth}")
print(f"Processed URLs: {strategy.stats.urls_processed}")
```
### Custom URL Scoring
```python
class CustomScorer(URLScorer):
def score(self, url: str) -> float:
# Lower scores = higher priority
return score_based_on_criteria(url)
```
## Troubleshooting
1. **Slow Crawling**
- Increase max_concurrent
- Adjust min_crawl_delay
- Check network conditions
2. **Missing Content**
- Verify max_depth
- Check filter settings
- Review URL patterns
3. **High Resource Usage**
- Reduce max_concurrent
- Increase crawl delay
- Add more specific filters

View File

@@ -0,0 +1,203 @@
# basic_scraper_example.py
from crawl4ai.async_configs import CrawlerRunConfig, BrowserConfig
from crawl4ai.content_scraping_strategy import LXMLWebScrapingStrategy
from crawl4ai.traversal import (
BFSTraversalStrategy,
FilterChain,
URLPatternFilter,
ContentTypeFilter,
)
from crawl4ai.async_webcrawler import AsyncWebCrawler
import re
import time
browser_config = BrowserConfig(headless=True, viewport_width=800, viewport_height=600)
async def basic_scraper_example():
"""
Basic example: Scrape a blog site for articles
- Crawls only HTML pages
- Stays within the blog section
- Collects all results at once
"""
# Create a simple filter chain
filter_chain = FilterChain(
[
# Only crawl pages within the blog section
URLPatternFilter("*/basic/*"),
# Only process HTML pages
ContentTypeFilter(["text/html"]),
]
)
# Initialize the strategy with basic configuration
bfs_strategy = BFSTraversalStrategy(
max_depth=2, # Only go 2 levels deep
filter_chain=filter_chain,
url_scorer=None, # Use default scoring
process_external_links=True,
)
# Create the crawler and scraper
async with AsyncWebCrawler(
config=browser_config,
) as crawler:
# Start scraping
try:
results = await crawler.adeep_crawl(
"https://crawl4ai.com/mkdocs", strategy=bfs_strategy
)
# Process results
print(f"Crawled {len(results)} pages:")
for result in results:
print(f"- {result.url}: {len(result.html)} bytes")
except Exception as e:
print(f"Error during scraping: {e}")
# advanced_scraper_example.py
import logging
from crawl4ai.traversal import (
BFSTraversalStrategy,
FilterChain,
URLPatternFilter,
ContentTypeFilter,
DomainFilter,
KeywordRelevanceScorer,
PathDepthScorer,
FreshnessScorer,
CompositeScorer,
)
async def advanced_scraper_example():
"""
Advanced example: Intelligent news site scraping
- Uses all filter types
- Implements sophisticated scoring
- Streams results
- Includes monitoring and logging
"""
# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("advanced_scraper")
# Create sophisticated filter chain
filter_chain = FilterChain(
[
# Domain control
DomainFilter(
allowed_domains=["techcrunch.com"],
blocked_domains=["login.techcrunch.com", "legal.yahoo.com"],
),
# URL patterns
URLPatternFilter(
[
"*/article/*",
"*/news/*",
"*/blog/*",
re.compile(r"\d{4}/\d{2}/.*"), # Date-based URLs
]
),
# Content types
ContentTypeFilter(["text/html", "application/xhtml+xml"]),
]
)
# Create composite scorer
scorer = CompositeScorer(
[
# Prioritize by keywords
KeywordRelevanceScorer(
keywords=["news", "breaking", "update", "latest"], weight=1.0
),
# Prefer optimal URL structure
PathDepthScorer(optimal_depth=3, weight=0.7),
# Prioritize fresh content
FreshnessScorer(weight=0.9),
]
)
# Initialize strategy with advanced configuration
bfs_strategy = BFSTraversalStrategy(
max_depth=2, filter_chain=filter_chain, url_scorer=scorer
)
# Create crawler and scraper
async with AsyncWebCrawler(
config=browser_config,
) as crawler:
# Track statistics
stats = {"processed": 0, "errors": 0, "total_size": 0}
try:
# Use streaming mode
results = []
result_generator = await crawler.adeep_crawl(
"https://techcrunch.com",
strategy=bfs_strategy,
crawler_run_config=CrawlerRunConfig(
scraping_strategy=LXMLWebScrapingStrategy()
),
stream=True,
)
async for result in result_generator:
stats["processed"] += 1
if result.success:
stats["total_size"] += len(result.html)
logger.info(
f"Processed at depth: {result.depth} with score: {result.score:.3f} : \n {result.url}"
)
results.append(result)
else:
stats["errors"] += 1
logger.error(
f"Failed to process {result.url}: {result.error_message}"
)
# Log progress regularly
if stats["processed"] % 10 == 0:
logger.info(f"Progress: {stats['processed']} URLs processed")
except Exception as e:
logger.error(f"Scraping error: {e}")
finally:
# Print final statistics
logger.info("Scraping completed:")
logger.info(f"- URLs processed: {stats['processed']}")
logger.info(f"- Errors: {stats['errors']}")
logger.info(f"- Total content size: {stats['total_size'] / 1024:.2f} KB")
# Print filter statistics
for filter_ in filter_chain.filters:
logger.info(f"{filter_.name} stats:")
logger.info(f"- Passed: {filter_.stats.passed_urls}")
logger.info(f"- Rejected: {filter_.stats.rejected_urls}")
# Print scorer statistics
logger.info("Scoring statistics:")
logger.info(f"- Average score: {scorer.stats.average_score:.2f}")
logger.info(
f"- Score range: {scorer.stats.min_score:.2f} - {scorer.stats.max_score:.2f}"
)
if __name__ == "__main__":
import asyncio
import time
# Run basic example
start_time = time.perf_counter()
print("Running basic scraper example...")
asyncio.run(basic_scraper_example())
end_time = time.perf_counter()
print(f"Basic scraper example completed in {end_time - start_time:.2f} seconds")
# # Run advanced example
print("\nRunning advanced scraper example...")
asyncio.run(advanced_scraper_example())

View File

@@ -0,0 +1,342 @@
# URL Filters and Scorers
The crawl4ai library provides powerful URL filtering and scoring capabilities that help you control and prioritize your web crawling. This guide explains how to use these features effectively.
```mermaid
flowchart TB
Start([URL Input]) --> Chain[Filter Chain]
subgraph Chain Process
Chain --> Pattern{URL Pattern\nFilter}
Pattern -->|Match| Content{Content Type\nFilter}
Pattern -->|No Match| Reject1[Reject URL]
Content -->|Allowed| Domain{Domain\nFilter}
Content -->|Not Allowed| Reject2[Reject URL]
Domain -->|Allowed| Accept[Accept URL]
Domain -->|Blocked| Reject3[Reject URL]
end
subgraph Statistics
Pattern --> UpdatePattern[Update Pattern Stats]
Content --> UpdateContent[Update Content Stats]
Domain --> UpdateDomain[Update Domain Stats]
Accept --> UpdateChain[Update Chain Stats]
Reject1 --> UpdateChain
Reject2 --> UpdateChain
Reject3 --> UpdateChain
end
Accept --> End([End])
Reject1 --> End
Reject2 --> End
Reject3 --> End
classDef process fill:#90caf9,stroke:#000,stroke-width:2px;
classDef decision fill:#fff59d,stroke:#000,stroke-width:2px;
classDef reject fill:#ef9a9a,stroke:#000,stroke-width:2px;
classDef accept fill:#a5d6a7,stroke:#000,stroke-width:2px;
class Start,End accept;
class Pattern,Content,Domain decision;
class Reject1,Reject2,Reject3 reject;
class Chain,UpdatePattern,UpdateContent,UpdateDomain,UpdateChain process;
```
## URL Filters
URL filters help you control which URLs are crawled. Multiple filters can be chained together to create sophisticated filtering rules.
### Available Filters
1. **URL Pattern Filter**
```python
pattern_filter = URLPatternFilter([
"*.example.com/*", # Glob pattern
"*/article/*", # Path pattern
re.compile(r"blog-\d+") # Regex pattern
])
```
- Supports glob patterns and regex
- Multiple patterns per filter
- Pattern pre-compilation for performance
2. **Content Type Filter**
```python
content_filter = ContentTypeFilter([
"text/html",
"application/pdf"
], check_extension=True)
```
- Filter by MIME types
- Extension checking
- Support for multiple content types
3. **Domain Filter**
```python
domain_filter = DomainFilter(
allowed_domains=["example.com", "blog.example.com"],
blocked_domains=["ads.example.com"]
)
```
- Allow/block specific domains
- Subdomain support
- Efficient domain matching
### Creating Filter Chains
```python
# Create and configure a filter chain
filter_chain = FilterChain([
URLPatternFilter(["*.example.com/*"]),
ContentTypeFilter(["text/html"]),
DomainFilter(blocked_domains=["ads.*"])
])
# Add more filters
filter_chain.add_filter(
URLPatternFilter(["*/article/*"])
)
```
```mermaid
flowchart TB
Start([URL Input]) --> Composite[Composite Scorer]
subgraph Scoring Process
Composite --> Keywords[Keyword Relevance]
Composite --> Path[Path Depth]
Composite --> Content[Content Type]
Composite --> Fresh[Freshness]
Composite --> Domain[Domain Authority]
Keywords --> KeywordScore[Calculate Score]
Path --> PathScore[Calculate Score]
Content --> ContentScore[Calculate Score]
Fresh --> FreshScore[Calculate Score]
Domain --> DomainScore[Calculate Score]
KeywordScore --> Weight1[Apply Weight]
PathScore --> Weight2[Apply Weight]
ContentScore --> Weight3[Apply Weight]
FreshScore --> Weight4[Apply Weight]
DomainScore --> Weight5[Apply Weight]
end
Weight1 --> Combine[Combine Scores]
Weight2 --> Combine
Weight3 --> Combine
Weight4 --> Combine
Weight5 --> Combine
Combine --> Normalize{Normalize?}
Normalize -->|Yes| NormalizeScore[Normalize Combined Score]
Normalize -->|No| FinalScore[Final Score]
NormalizeScore --> FinalScore
FinalScore --> Stats[Update Statistics]
Stats --> End([End])
classDef process fill:#90caf9,stroke:#000,stroke-width:2px;
classDef scorer fill:#fff59d,stroke:#000,stroke-width:2px;
classDef calc fill:#a5d6a7,stroke:#000,stroke-width:2px;
classDef decision fill:#ef9a9a,stroke:#000,stroke-width:2px;
class Start,End calc;
class Keywords,Path,Content,Fresh,Domain scorer;
class KeywordScore,PathScore,ContentScore,FreshScore,DomainScore process;
class Normalize decision;
```
## URL Scorers
URL scorers help prioritize which URLs to crawl first. Higher scores indicate higher priority.
### Available Scorers
1. **Keyword Relevance Scorer**
```python
keyword_scorer = KeywordRelevanceScorer(
keywords=["python", "programming"],
weight=1.0,
case_sensitive=False
)
```
- Score based on keyword matches
- Case sensitivity options
- Weighted scoring
2. **Path Depth Scorer**
```python
path_scorer = PathDepthScorer(
optimal_depth=3, # Preferred URL depth
weight=0.7
)
```
- Score based on URL path depth
- Configurable optimal depth
- Diminishing returns for deeper paths
3. **Content Type Scorer**
```python
content_scorer = ContentTypeScorer({
r'\.html$': 1.0,
r'\.pdf$': 0.8,
r'\.xml$': 0.6
})
```
- Score based on file types
- Configurable type weights
- Pattern matching support
4. **Freshness Scorer**
```python
freshness_scorer = FreshnessScorer(weight=0.9)
```
- Score based on date indicators in URLs
- Multiple date format support
- Recency weighting
5. **Domain Authority Scorer**
```python
authority_scorer = DomainAuthorityScorer({
"python.org": 1.0,
"github.com": 0.9,
"medium.com": 0.7
})
```
- Score based on domain importance
- Configurable domain weights
- Default weight for unknown domains
### Combining Scorers
```python
# Create a composite scorer
composite_scorer = CompositeScorer([
KeywordRelevanceScorer(["python"], weight=1.0),
PathDepthScorer(optimal_depth=2, weight=0.7),
FreshnessScorer(weight=0.8)
], normalize=True)
```
## Best Practices
### Filter Configuration
1. **Start Restrictive**
```python
# Begin with strict filters
filter_chain = FilterChain([
DomainFilter(allowed_domains=["example.com"]),
ContentTypeFilter(["text/html"])
])
```
2. **Layer Filters**
```python
# Add more specific filters
filter_chain.add_filter(
URLPatternFilter(["*/article/*", "*/blog/*"])
)
```
3. **Monitor Filter Statistics**
```python
# Check filter performance
for filter in filter_chain.filters:
print(f"{filter.name}: {filter.stats.rejected_urls} rejected")
```
### Scorer Configuration
1. **Balance Weights**
```python
# Balanced scoring configuration
scorer = create_balanced_scorer()
```
2. **Customize for Content**
```python
# News site configuration
news_scorer = CompositeScorer([
KeywordRelevanceScorer(["news", "article"], weight=1.0),
FreshnessScorer(weight=1.0),
PathDepthScorer(optimal_depth=2, weight=0.5)
])
```
3. **Monitor Scoring Statistics**
```python
# Check scoring distribution
print(f"Average score: {scorer.stats.average_score}")
print(f"Score range: {scorer.stats.min_score} - {scorer.stats.max_score}")
```
## Common Use Cases
### Blog Crawling
```python
blog_config = {
'filters': FilterChain([
URLPatternFilter(["*/blog/*", "*/post/*"]),
ContentTypeFilter(["text/html"])
]),
'scorer': CompositeScorer([
FreshnessScorer(weight=1.0),
KeywordRelevanceScorer(["blog", "article"], weight=0.8)
])
}
```
### Documentation Sites
```python
docs_config = {
'filters': FilterChain([
URLPatternFilter(["*/docs/*", "*/guide/*"]),
ContentTypeFilter(["text/html", "application/pdf"])
]),
'scorer': CompositeScorer([
PathDepthScorer(optimal_depth=3, weight=1.0),
KeywordRelevanceScorer(["guide", "tutorial"], weight=0.9)
])
}
```
### E-commerce Sites
```python
ecommerce_config = {
'filters': FilterChain([
URLPatternFilter(["*/product/*", "*/category/*"]),
DomainFilter(blocked_domains=["ads.*", "tracker.*"])
]),
'scorer': CompositeScorer([
PathDepthScorer(optimal_depth=2, weight=1.0),
ContentTypeScorer({
r'/product/': 1.0,
r'/category/': 0.8
})
])
}
```
## Advanced Topics
### Custom Filters
```python
class CustomFilter(URLFilter):
def apply(self, url: str) -> bool:
# Your custom filtering logic
return True
```
### Custom Scorers
```python
class CustomScorer(URLScorer):
def _calculate_score(self, url: str) -> float:
# Your custom scoring logic
return 1.0
```
For more examples, check our [example repository](https://github.com/example/crawl4ai/examples).

View File

@@ -0,0 +1,206 @@
# Scraper Examples Guide
This guide provides two complete examples of using the crawl4ai scraper: a basic implementation for simple use cases and an advanced implementation showcasing all features.
## Basic Example
The basic example demonstrates a simple blog scraping scenario:
```python
from crawl4ai.scraper import AsyncWebScraper, BFSScraperStrategy, FilterChain
# Create simple filter chain
filter_chain = FilterChain([
URLPatternFilter("*/blog/*"),
ContentTypeFilter(["text/html"])
])
# Initialize strategy
strategy = BFSScraperStrategy(
max_depth=2,
filter_chain=filter_chain,
url_scorer=None,
max_concurrent=3
)
# Create and run scraper
crawler = AsyncWebCrawler()
scraper = AsyncWebScraper(crawler, strategy)
result = await scraper.ascrape("https://example.com/blog/")
```
### Features Demonstrated
- Basic URL filtering
- Simple content type filtering
- Depth control
- Concurrent request limiting
- Result collection
## Advanced Example
The advanced example shows a sophisticated news site scraping setup with all features enabled:
```python
# Create comprehensive filter chain
filter_chain = FilterChain([
DomainFilter(
allowed_domains=["example.com"],
blocked_domains=["ads.example.com"]
),
URLPatternFilter([
"*/article/*",
re.compile(r"\d{4}/\d{2}/.*")
]),
ContentTypeFilter(["text/html"])
])
# Create intelligent scorer
scorer = CompositeScorer([
KeywordRelevanceScorer(
keywords=["news", "breaking"],
weight=1.0
),
PathDepthScorer(optimal_depth=3, weight=0.7),
FreshnessScorer(weight=0.9)
])
# Initialize advanced strategy
strategy = BFSScraperStrategy(
max_depth=4,
filter_chain=filter_chain,
url_scorer=scorer,
max_concurrent=5
)
```
### Features Demonstrated
1. **Advanced Filtering**
- Domain filtering
- Pattern matching
- Content type control
2. **Intelligent Scoring**
- Keyword relevance
- Path optimization
- Freshness priority
3. **Monitoring**
- Progress tracking
- Error handling
- Statistics collection
4. **Resource Management**
- Concurrent processing
- Rate limiting
- Cleanup handling
## Running the Examples
```bash
# Basic usage
python basic_scraper_example.py
# Advanced usage with logging
PYTHONPATH=. python advanced_scraper_example.py
```
## Example Output
### Basic Example
```
Crawled 15 pages:
- https://example.com/blog/post1: 24560 bytes
- https://example.com/blog/post2: 18920 bytes
...
```
### Advanced Example
```
INFO: Starting crawl of https://example.com/news/
INFO: Processed: https://example.com/news/breaking/story1
DEBUG: KeywordScorer: 0.85
DEBUG: FreshnessScorer: 0.95
INFO: Progress: 10 URLs processed
...
INFO: Scraping completed:
INFO: - URLs processed: 50
INFO: - Errors: 2
INFO: - Total content size: 1240.50 KB
```
## Customization
### Adding Custom Filters
```python
class CustomFilter(URLFilter):
def apply(self, url: str) -> bool:
# Your custom filtering logic
return True
filter_chain.add_filter(CustomFilter())
```
### Custom Scoring Logic
```python
class CustomScorer(URLScorer):
def _calculate_score(self, url: str) -> float:
# Your custom scoring logic
return 1.0
scorer = CompositeScorer([
CustomScorer(weight=1.0),
...
])
```
## Best Practices
1. **Start Simple**
- Begin with basic filtering
- Add features incrementally
- Test thoroughly at each step
2. **Monitor Performance**
- Watch memory usage
- Track processing times
- Adjust concurrency as needed
3. **Handle Errors**
- Implement proper error handling
- Log important events
- Track error statistics
4. **Optimize Resources**
- Set appropriate delays
- Limit concurrent requests
- Use streaming for large crawls
## Troubleshooting
Common issues and solutions:
1. **Too Many Requests**
```python
strategy = BFSScraperStrategy(
max_concurrent=3, # Reduce concurrent requests
min_crawl_delay=2 # Increase delay between requests
)
```
2. **Memory Issues**
```python
# Use streaming mode for large crawls
async for result in scraper.ascrape(url, stream=True):
process_result(result)
```
3. **Missing Content**
```python
# Check your filter chain
filter_chain = FilterChain([
URLPatternFilter("*"), # Broaden patterns
ContentTypeFilter(["*"]) # Accept all content
])
```
For more examples and use cases, visit our [GitHub repository](https://github.com/example/crawl4ai/examples).