Files

UncleCode bae4665949 feat(scraper): Enhance URL filtering and scoring systems

Implement comprehensive URL filtering and scoring capabilities:

Filters:
- Add URLPatternFilter with glob/regex support
- Implement ContentTypeFilter with MIME type checking
- Add DomainFilter for domain control
- Create FilterChain with stats tracking

Scorers:
- Complete KeywordRelevanceScorer implementation
- Add PathDepthScorer for URL structure scoring
- Implement ContentTypeScorer for file type priorities
- Add FreshnessScorer for date-based scoring
- Add DomainAuthorityScorer for domain weighting
- Create CompositeScorer for combined strategies

Features:
- Add statistics tracking for both filters and scorers
- Implement logging support throughout
- Add resource cleanup methods
- Create comprehensive documentation
- Include performance optimizations

Tests and docs included.
Note: Review URL normalization overlap with recent crawler changes.

- Quick Start is created and added

2024-11-08 18:45:12 +08:00

4.5 KiB

Raw Blame History

Scraper Examples Guide

This guide provides two complete examples of using the crawl4ai scraper: a basic implementation for simple use cases and an advanced implementation showcasing all features.

Basic Example

The basic example demonstrates a simple blog scraping scenario:

from crawl4ai.scraper import AsyncWebScraper, BFSScraperStrategy, FilterChain

# Create simple filter chain
filter_chain = FilterChain([
    URLPatternFilter("*/blog/*"),
    ContentTypeFilter(["text/html"])
])

# Initialize strategy
strategy = BFSScraperStrategy(
    max_depth=2,
    filter_chain=filter_chain,
    url_scorer=None,
    max_concurrent=3
)

# Create and run scraper
crawler = AsyncWebCrawler()
scraper = AsyncWebScraper(crawler, strategy)
result = await scraper.ascrape("https://example.com/blog/")

Features Demonstrated

Basic URL filtering
Simple content type filtering
Depth control
Concurrent request limiting
Result collection

Advanced Example

The advanced example shows a sophisticated news site scraping setup with all features enabled:

# Create comprehensive filter chain
filter_chain = FilterChain([
    DomainFilter(
        allowed_domains=["example.com"],
        blocked_domains=["ads.example.com"]
    ),
    URLPatternFilter([
        "*/article/*",
        re.compile(r"\d{4}/\d{2}/.*")
    ]),
    ContentTypeFilter(["text/html"])
])

# Create intelligent scorer
scorer = CompositeScorer([
    KeywordRelevanceScorer(
        keywords=["news", "breaking"],
        weight=1.0
    ),
    PathDepthScorer(optimal_depth=3, weight=0.7),
    FreshnessScorer(weight=0.9)
])

# Initialize advanced strategy
strategy = BFSScraperStrategy(
    max_depth=4,
    filter_chain=filter_chain,
    url_scorer=scorer,
    max_concurrent=5
)

Features Demonstrated

Advanced Filtering
- Domain filtering
- Pattern matching
- Content type control
Intelligent Scoring
- Keyword relevance
- Path optimization
- Freshness priority
Monitoring
- Progress tracking
- Error handling
- Statistics collection
Resource Management
- Concurrent processing
- Rate limiting
- Cleanup handling

Running the Examples

# Basic usage
python basic_scraper_example.py

# Advanced usage with logging
PYTHONPATH=. python advanced_scraper_example.py

Example Output

Basic Example

Crawled 15 pages:
- https://example.com/blog/post1: 24560 bytes
- https://example.com/blog/post2: 18920 bytes
...

Advanced Example

INFO: Starting crawl of https://example.com/news/
INFO: Processed: https://example.com/news/breaking/story1
DEBUG: KeywordScorer: 0.85
DEBUG: FreshnessScorer: 0.95
INFO: Progress: 10 URLs processed
...
INFO: Scraping completed:
INFO: - URLs processed: 50
INFO: - Errors: 2
INFO: - Total content size: 1240.50 KB

Customization

Adding Custom Filters

class CustomFilter(URLFilter):
    def apply(self, url: str) -> bool:
        # Your custom filtering logic
        return True

filter_chain.add_filter(CustomFilter())

Custom Scoring Logic

class CustomScorer(URLScorer):
    def _calculate_score(self, url: str) -> float:
        # Your custom scoring logic
        return 1.0

scorer = CompositeScorer([
    CustomScorer(weight=1.0),
    ...
])

Best Practices

Start Simple
- Begin with basic filtering
- Add features incrementally
- Test thoroughly at each step
Monitor Performance
- Watch memory usage
- Track processing times
- Adjust concurrency as needed
Handle Errors
- Implement proper error handling
- Log important events
- Track error statistics
Optimize Resources
- Set appropriate delays
- Limit concurrent requests
- Use streaming for large crawls

Troubleshooting

Common issues and solutions:

Too Many Requests

strategy = BFSScraperStrategy(
    max_concurrent=3,  # Reduce concurrent requests
    min_crawl_delay=2  # Increase delay between requests
)

Memory Issues

# Use streaming mode for large crawls
async for result in scraper.ascrape(url, stream=True):
    process_result(result)

Missing Content

# Check your filter chain
filter_chain = FilterChain([
    URLPatternFilter("*"),  # Broaden patterns
    ContentTypeFilter(["*"])  # Accept all content
])

For more examples and use cases, visit our GitHub repository.

4.5 KiB Raw Blame History

Scraper Examples Guide

Basic Example

Features Demonstrated

Advanced Example

Features Demonstrated

Running the Examples

Example Output

Basic Example

Advanced Example

Customization

Adding Custom Filters

Custom Scoring Logic

Best Practices

Troubleshooting

4.5 KiB

Raw Blame History