Implement comprehensive URL filtering and scoring capabilities: Filters: - Add URLPatternFilter with glob/regex support - Implement ContentTypeFilter with MIME type checking - Add DomainFilter for domain control - Create FilterChain with stats tracking Scorers: - Complete KeywordRelevanceScorer implementation - Add PathDepthScorer for URL structure scoring - Implement ContentTypeScorer for file type priorities - Add FreshnessScorer for date-based scoring - Add DomainAuthorityScorer for domain weighting - Create CompositeScorer for combined strategies Features: - Add statistics tracking for both filters and scorers - Implement logging support throughout - Add resource cleanup methods - Create comprehensive documentation - Include performance optimizations Tests and docs included. Note: Review URL normalization overlap with recent crawler changes. - Quick Start is created and added
4.5 KiB
4.5 KiB
Scraper Examples Guide
This guide provides two complete examples of using the crawl4ai scraper: a basic implementation for simple use cases and an advanced implementation showcasing all features.
Basic Example
The basic example demonstrates a simple blog scraping scenario:
from crawl4ai.scraper import AsyncWebScraper, BFSScraperStrategy, FilterChain
# Create simple filter chain
filter_chain = FilterChain([
URLPatternFilter("*/blog/*"),
ContentTypeFilter(["text/html"])
])
# Initialize strategy
strategy = BFSScraperStrategy(
max_depth=2,
filter_chain=filter_chain,
url_scorer=None,
max_concurrent=3
)
# Create and run scraper
crawler = AsyncWebCrawler()
scraper = AsyncWebScraper(crawler, strategy)
result = await scraper.ascrape("https://example.com/blog/")
Features Demonstrated
- Basic URL filtering
- Simple content type filtering
- Depth control
- Concurrent request limiting
- Result collection
Advanced Example
The advanced example shows a sophisticated news site scraping setup with all features enabled:
# Create comprehensive filter chain
filter_chain = FilterChain([
DomainFilter(
allowed_domains=["example.com"],
blocked_domains=["ads.example.com"]
),
URLPatternFilter([
"*/article/*",
re.compile(r"\d{4}/\d{2}/.*")
]),
ContentTypeFilter(["text/html"])
])
# Create intelligent scorer
scorer = CompositeScorer([
KeywordRelevanceScorer(
keywords=["news", "breaking"],
weight=1.0
),
PathDepthScorer(optimal_depth=3, weight=0.7),
FreshnessScorer(weight=0.9)
])
# Initialize advanced strategy
strategy = BFSScraperStrategy(
max_depth=4,
filter_chain=filter_chain,
url_scorer=scorer,
max_concurrent=5
)
Features Demonstrated
-
Advanced Filtering
- Domain filtering
- Pattern matching
- Content type control
-
Intelligent Scoring
- Keyword relevance
- Path optimization
- Freshness priority
-
Monitoring
- Progress tracking
- Error handling
- Statistics collection
-
Resource Management
- Concurrent processing
- Rate limiting
- Cleanup handling
Running the Examples
# Basic usage
python basic_scraper_example.py
# Advanced usage with logging
PYTHONPATH=. python advanced_scraper_example.py
Example Output
Basic Example
Crawled 15 pages:
- https://example.com/blog/post1: 24560 bytes
- https://example.com/blog/post2: 18920 bytes
...
Advanced Example
INFO: Starting crawl of https://example.com/news/
INFO: Processed: https://example.com/news/breaking/story1
DEBUG: KeywordScorer: 0.85
DEBUG: FreshnessScorer: 0.95
INFO: Progress: 10 URLs processed
...
INFO: Scraping completed:
INFO: - URLs processed: 50
INFO: - Errors: 2
INFO: - Total content size: 1240.50 KB
Customization
Adding Custom Filters
class CustomFilter(URLFilter):
def apply(self, url: str) -> bool:
# Your custom filtering logic
return True
filter_chain.add_filter(CustomFilter())
Custom Scoring Logic
class CustomScorer(URLScorer):
def _calculate_score(self, url: str) -> float:
# Your custom scoring logic
return 1.0
scorer = CompositeScorer([
CustomScorer(weight=1.0),
...
])
Best Practices
-
Start Simple
- Begin with basic filtering
- Add features incrementally
- Test thoroughly at each step
-
Monitor Performance
- Watch memory usage
- Track processing times
- Adjust concurrency as needed
-
Handle Errors
- Implement proper error handling
- Log important events
- Track error statistics
-
Optimize Resources
- Set appropriate delays
- Limit concurrent requests
- Use streaming for large crawls
Troubleshooting
Common issues and solutions:
-
Too Many Requests
strategy = BFSScraperStrategy( max_concurrent=3, # Reduce concurrent requests min_crawl_delay=2 # Increase delay between requests ) -
Memory Issues
# Use streaming mode for large crawls async for result in scraper.ascrape(url, stream=True): process_result(result) -
Missing Content
# Check your filter chain filter_chain = FilterChain([ URLPatternFilter("*"), # Broaden patterns ContentTypeFilter(["*"]) # Accept all content ])
For more examples and use cases, visit our GitHub repository.