Files
crawl4ai/docs/scrapper/how_to_use.md
UncleCode bae4665949 feat(scraper): Enhance URL filtering and scoring systems
Implement comprehensive URL filtering and scoring capabilities:

Filters:
- Add URLPatternFilter with glob/regex support
- Implement ContentTypeFilter with MIME type checking
- Add DomainFilter for domain control
- Create FilterChain with stats tracking

Scorers:
- Complete KeywordRelevanceScorer implementation
- Add PathDepthScorer for URL structure scoring
- Implement ContentTypeScorer for file type priorities
- Add FreshnessScorer for date-based scoring
- Add DomainAuthorityScorer for domain weighting
- Create CompositeScorer for combined strategies

Features:
- Add statistics tracking for both filters and scorers
- Implement logging support throughout
- Add resource cleanup methods
- Create comprehensive documentation
- Include performance optimizations

Tests and docs included.
Note: Review URL normalization overlap with recent crawler changes.

- Quick Start is created and added
2024-11-08 18:45:12 +08:00

4.5 KiB

Scraper Examples Guide

This guide provides two complete examples of using the crawl4ai scraper: a basic implementation for simple use cases and an advanced implementation showcasing all features.

Basic Example

The basic example demonstrates a simple blog scraping scenario:

from crawl4ai.scraper import AsyncWebScraper, BFSScraperStrategy, FilterChain

# Create simple filter chain
filter_chain = FilterChain([
    URLPatternFilter("*/blog/*"),
    ContentTypeFilter(["text/html"])
])

# Initialize strategy
strategy = BFSScraperStrategy(
    max_depth=2,
    filter_chain=filter_chain,
    url_scorer=None,
    max_concurrent=3
)

# Create and run scraper
crawler = AsyncWebCrawler()
scraper = AsyncWebScraper(crawler, strategy)
result = await scraper.ascrape("https://example.com/blog/")

Features Demonstrated

  • Basic URL filtering
  • Simple content type filtering
  • Depth control
  • Concurrent request limiting
  • Result collection

Advanced Example

The advanced example shows a sophisticated news site scraping setup with all features enabled:

# Create comprehensive filter chain
filter_chain = FilterChain([
    DomainFilter(
        allowed_domains=["example.com"],
        blocked_domains=["ads.example.com"]
    ),
    URLPatternFilter([
        "*/article/*",
        re.compile(r"\d{4}/\d{2}/.*")
    ]),
    ContentTypeFilter(["text/html"])
])

# Create intelligent scorer
scorer = CompositeScorer([
    KeywordRelevanceScorer(
        keywords=["news", "breaking"],
        weight=1.0
    ),
    PathDepthScorer(optimal_depth=3, weight=0.7),
    FreshnessScorer(weight=0.9)
])

# Initialize advanced strategy
strategy = BFSScraperStrategy(
    max_depth=4,
    filter_chain=filter_chain,
    url_scorer=scorer,
    max_concurrent=5
)

Features Demonstrated

  1. Advanced Filtering

    • Domain filtering
    • Pattern matching
    • Content type control
  2. Intelligent Scoring

    • Keyword relevance
    • Path optimization
    • Freshness priority
  3. Monitoring

    • Progress tracking
    • Error handling
    • Statistics collection
  4. Resource Management

    • Concurrent processing
    • Rate limiting
    • Cleanup handling

Running the Examples

# Basic usage
python basic_scraper_example.py

# Advanced usage with logging
PYTHONPATH=. python advanced_scraper_example.py

Example Output

Basic Example

Crawled 15 pages:
- https://example.com/blog/post1: 24560 bytes
- https://example.com/blog/post2: 18920 bytes
...

Advanced Example

INFO: Starting crawl of https://example.com/news/
INFO: Processed: https://example.com/news/breaking/story1
DEBUG: KeywordScorer: 0.85
DEBUG: FreshnessScorer: 0.95
INFO: Progress: 10 URLs processed
...
INFO: Scraping completed:
INFO: - URLs processed: 50
INFO: - Errors: 2
INFO: - Total content size: 1240.50 KB

Customization

Adding Custom Filters

class CustomFilter(URLFilter):
    def apply(self, url: str) -> bool:
        # Your custom filtering logic
        return True

filter_chain.add_filter(CustomFilter())

Custom Scoring Logic

class CustomScorer(URLScorer):
    def _calculate_score(self, url: str) -> float:
        # Your custom scoring logic
        return 1.0

scorer = CompositeScorer([
    CustomScorer(weight=1.0),
    ...
])

Best Practices

  1. Start Simple

    • Begin with basic filtering
    • Add features incrementally
    • Test thoroughly at each step
  2. Monitor Performance

    • Watch memory usage
    • Track processing times
    • Adjust concurrency as needed
  3. Handle Errors

    • Implement proper error handling
    • Log important events
    • Track error statistics
  4. Optimize Resources

    • Set appropriate delays
    • Limit concurrent requests
    • Use streaming for large crawls

Troubleshooting

Common issues and solutions:

  1. Too Many Requests

    strategy = BFSScraperStrategy(
        max_concurrent=3,  # Reduce concurrent requests
        min_crawl_delay=2  # Increase delay between requests
    )
    
  2. Memory Issues

    # Use streaming mode for large crawls
    async for result in scraper.ascrape(url, stream=True):
        process_result(result)
    
  3. Missing Content

    # Check your filter chain
    filter_chain = FilterChain([
        URLPatternFilter("*"),  # Broaden patterns
        ContentTypeFilter(["*"])  # Accept all content
    ])
    

For more examples and use cases, visit our GitHub repository.