crawl4ai/docs/scraper/async_web_scraper.md

# AsyncWebScraper: Smart Web Crawling Made Easy

AsyncWebScraper is a powerful and flexible web scraping tool that makes it easy to collect data from websites efficiently. Whether you need to scrape a few pages or an entire website, AsyncWebScraper handles the complexity of web crawling while giving you fine-grained control over the process.

## How It Works

```mermaid
flowchart TB
    Start([Start]) --> Init[Initialize AsyncWebScraper\nwith Crawler and Strategy]
    Init --> InputURL[Receive URL to scrape]
    InputURL --> Decision{Stream or\nCollect?}

    %% Streaming Path
    Decision -->|Stream| StreamInit[Initialize Streaming Mode]
    StreamInit --> StreamStrategy[Call Strategy.ascrape]
    StreamStrategy --> AsyncGen[Create Async Generator]
    AsyncGen --> ProcessURL[Process Next URL]
    ProcessURL --> FetchContent[Fetch Page Content]
    FetchContent --> Extract[Extract Data]
    Extract --> YieldResult[Yield CrawlResult]
    YieldResult --> CheckMore{More URLs?}
    CheckMore -->|Yes| ProcessURL
    CheckMore -->|No| StreamEnd([End Stream])

    %% Collecting Path
    Decision -->|Collect| CollectInit[Initialize Collection Mode]
    CollectInit --> CollectStrategy[Call Strategy.ascrape]
    CollectStrategy --> CollectGen[Create Async Generator]
    CollectGen --> ProcessURLColl[Process Next URL]
    ProcessURLColl --> FetchContentColl[Fetch Page Content]
    FetchContentColl --> ExtractColl[Extract Data]
    ExtractColl --> StoreColl[Store in Dictionary]
    StoreColl --> CheckMoreColl{More URLs?}
    CheckMoreColl -->|Yes| ProcessURLColl
    CheckMoreColl -->|No| CreateResult[Create ScraperResult]
    CreateResult --> ReturnResult([Return Result])

    %% Parallel Processing
    subgraph Parallel
        ProcessURL
        FetchContent
        Extract
        ProcessURLColl
        FetchContentColl
        ExtractColl
    end

    %% Error Handling
    FetchContent --> ErrorCheck{Error?}
    ErrorCheck -->|Yes| LogError[Log Error]
    LogError --> UpdateStats[Update Error Stats]
    UpdateStats --> CheckMore
    ErrorCheck -->|No| Extract

    FetchContentColl --> ErrorCheckColl{Error?}
    ErrorCheckColl -->|Yes| LogErrorColl[Log Error]
    LogErrorColl --> UpdateStatsColl[Update Error Stats]
    UpdateStatsColl --> CheckMoreColl
    ErrorCheckColl -->|No| ExtractColl

    %% Style definitions
    classDef process fill:#90caf9,stroke:#000,stroke-width:2px;
    classDef decision fill:#fff59d,stroke:#000,stroke-width:2px;
    classDef error fill:#ef9a9a,stroke:#000,stroke-width:2px;
    classDef start fill:#a5d6a7,stroke:#000,stroke-width:2px;

    class Start,StreamEnd,ReturnResult start;
    class Decision,CheckMore,CheckMoreColl,ErrorCheck,ErrorCheckColl decision;
    class LogError,LogErrorColl,UpdateStats,UpdateStatsColl error;
    class ProcessURL,FetchContent,Extract,ProcessURLColl,FetchContentColl,ExtractColl process;
```

AsyncWebScraper uses an intelligent crawling system that can navigate through websites following your specified strategy. It supports two main modes of operation:

### 1. Streaming Mode
```python
async for result in scraper.ascrape(url, stream=True):
    print(f"Found data on {result.url}")
    process_data(result.data)
```
- Perfect for processing large websites
- Memory efficient - handles one page at a time
- Ideal for real-time data processing
- Great for monitoring or continuous scraping tasks

### 2. Collection Mode
```python
result = await scraper.ascrape(url)
print(f"Scraped {len(result.crawled_urls)} pages")
process_all_data(result.extracted_data)
```
- Collects all data before returning
- Best for when you need the complete dataset
- Easier to work with for batch processing
- Includes comprehensive statistics

## Key Features

- **Smart Crawling**: Automatically follows relevant links while avoiding duplicates
- **Parallel Processing**: Scrapes multiple pages simultaneously for better performance
- **Memory Efficient**: Choose between streaming and collecting based on your needs
- **Error Resilient**: Continues working even if some pages fail to load
- **Progress Tracking**: Monitor the scraping progress in real-time
- **Customizable**: Configure crawling strategy, filters, and scoring to match your needs

## Quick Start

```python
from crawl4ai.scraper import AsyncWebScraper, BFSStrategy
from crawl4ai.async_webcrawler import AsyncWebCrawler

# Initialize the scraper
crawler = AsyncWebCrawler()
strategy = BFSStrategy(
    max_depth=2,  # How deep to crawl
    url_pattern="*.example.com/*"  # What URLs to follow
)
scraper = AsyncWebScraper(crawler, strategy)

# Start scraping
async def main():
    # Collect all results
    result = await scraper.ascrape("https://example.com")
    print(f"Found {len(result.extracted_data)} pages")

    # Or stream results
    async for page in scraper.ascrape("https://example.com", stream=True):
        print(f"Processing {page.url}")

```

## Best Practices

1. **Choose the Right Mode**
   - Use streaming for large websites or real-time processing
   - Use collecting for smaller sites or when you need the complete dataset

2. **Configure Depth**
   - Start with a small depth (2-3) and increase if needed
   - Higher depths mean exponentially more pages to crawl

3. **Set Appropriate Filters**
   - Use URL patterns to stay within relevant sections
   - Set content type filters to only process useful pages

4. **Handle Resources Responsibly**
   - Enable parallel processing for faster results
   - Consider the target website's capacity
   - Implement appropriate delays between requests

## Common Use Cases

- **Content Aggregation**: Collect articles, blog posts, or news from multiple pages
- **Data Extraction**: Gather product information, prices, or specifications
- **Site Mapping**: Create a complete map of a website's structure
- **Content Monitoring**: Track changes or updates across multiple pages
- **Data Mining**: Extract and analyze patterns across web pages

## Advanced Features

- Custom scoring algorithms for prioritizing important pages
- URL filters for focusing on specific site sections
- Content type filtering for processing only relevant pages
- Progress tracking for monitoring long-running scrapes

Need more help? Check out our [examples repository](https://github.com/example/crawl4ai/examples) or join our [community Discord](https://discord.gg/example).