crawl4ai/docs/deep_crawl/how_to_use.md

# Scraper Examples Guide

This guide provides two complete examples of using the crawl4ai scraper: a basic implementation for simple use cases and an advanced implementation showcasing all features.

## Basic Example

The basic example demonstrates a simple blog scraping scenario:

```python
from crawl4ai.scraper import AsyncWebScraper, BFSScraperStrategy, FilterChain

# Create simple filter chain
filter_chain = FilterChain([
    URLPatternFilter("*/blog/*"),
    ContentTypeFilter(["text/html"])
])

# Initialize strategy
strategy = BFSScraperStrategy(
    max_depth=2,
    filter_chain=filter_chain,
    url_scorer=None,
    max_concurrent=3
)

# Create and run scraper
crawler = AsyncWebCrawler()
scraper = AsyncWebScraper(crawler, strategy)
result = await scraper.ascrape("https://example.com/blog/")
```

### Features Demonstrated
- Basic URL filtering
- Simple content type filtering
- Depth control
- Concurrent request limiting
- Result collection

## Advanced Example

The advanced example shows a sophisticated news site scraping setup with all features enabled:

```python
# Create comprehensive filter chain
filter_chain = FilterChain([
    DomainFilter(
        allowed_domains=["example.com"],
        blocked_domains=["ads.example.com"]
    ),
    URLPatternFilter([
        "*/article/*",
        re.compile(r"\d{4}/\d{2}/.*")
    ]),
    ContentTypeFilter(["text/html"])
])

# Create intelligent scorer
scorer = CompositeScorer([
    KeywordRelevanceScorer(
        keywords=["news", "breaking"],
        weight=1.0
    ),
    PathDepthScorer(optimal_depth=3, weight=0.7),
    FreshnessScorer(weight=0.9)
])

# Initialize advanced strategy
strategy = BFSScraperStrategy(
    max_depth=4,
    filter_chain=filter_chain,
    url_scorer=scorer,
    max_concurrent=5
)
```

### Features Demonstrated
1. **Advanced Filtering**
   - Domain filtering
   - Pattern matching
   - Content type control

2. **Intelligent Scoring**
   - Keyword relevance
   - Path optimization
   - Freshness priority

3. **Monitoring**
   - Progress tracking
   - Error handling
   - Statistics collection

4. **Resource Management**
   - Concurrent processing
   - Rate limiting
   - Cleanup handling

## Running the Examples

```bash
# Basic usage
python basic_scraper_example.py

# Advanced usage with logging
PYTHONPATH=. python advanced_scraper_example.py
```

## Example Output

### Basic Example
```
Crawled 15 pages:
- https://example.com/blog/post1: 24560 bytes
- https://example.com/blog/post2: 18920 bytes
...
```

### Advanced Example
```
INFO: Starting crawl of https://example.com/news/
INFO: Processed: https://example.com/news/breaking/story1
DEBUG: KeywordScorer: 0.85
DEBUG: FreshnessScorer: 0.95
INFO: Progress: 10 URLs processed
...
INFO: Scraping completed:
INFO: - URLs processed: 50
INFO: - Errors: 2
INFO: - Total content size: 1240.50 KB
```

## Customization

### Adding Custom Filters
```python
class CustomFilter(URLFilter):
    def apply(self, url: str) -> bool:
        # Your custom filtering logic
        return True

filter_chain.add_filter(CustomFilter())
```

### Custom Scoring Logic
```python
class CustomScorer(URLScorer):
    def _calculate_score(self, url: str) -> float:
        # Your custom scoring logic
        return 1.0

scorer = CompositeScorer([
    CustomScorer(weight=1.0),
    ...
])
```

## Best Practices

1. **Start Simple**
   - Begin with basic filtering
   - Add features incrementally
   - Test thoroughly at each step

2. **Monitor Performance**
   - Watch memory usage
   - Track processing times
   - Adjust concurrency as needed

3. **Handle Errors**
   - Implement proper error handling
   - Log important events
   - Track error statistics

4. **Optimize Resources**
   - Set appropriate delays
   - Limit concurrent requests
   - Use streaming for large crawls

## Troubleshooting

Common issues and solutions:

1. **Too Many Requests**
   ```python
   strategy = BFSScraperStrategy(
       max_concurrent=3,  # Reduce concurrent requests
       min_crawl_delay=2  # Increase delay between requests
   )
   ```

2. **Memory Issues**
   ```python
   # Use streaming mode for large crawls
   async for result in scraper.ascrape(url, stream=True):
       process_result(result)
   ```

3. **Missing Content**
   ```python
   # Check your filter chain
   filter_chain = FilterChain([
       URLPatternFilter("*"),  # Broaden patterns
       ContentTypeFilter(["*"])  # Accept all content
   ])
   ```

For more examples and use cases, visit our [GitHub repository](https://github.com/example/crawl4ai/examples).