AsyncWebScraper: Smart Web Crawling Made Easy

AsyncWebScraper is a powerful and flexible web scraping tool that makes it easy to collect data from websites efficiently. Whether you need to scrape a few pages or an entire website, AsyncWebScraper handles the complexity of web crawling while giving you fine-grained control over the process.

How It Works

flowchart TB
    Start([Start]) --> Init[Initialize AsyncWebScraper\nwith Crawler and Strategy]
    Init --> InputURL[Receive URL to scrape]
    InputURL --> Decision{Stream or\nCollect?}
    
    %% Streaming Path
    Decision -->|Stream| StreamInit[Initialize Streaming Mode]
    StreamInit --> StreamStrategy[Call Strategy.ascrape]
    StreamStrategy --> AsyncGen[Create Async Generator]
    AsyncGen --> ProcessURL[Process Next URL]
    ProcessURL --> FetchContent[Fetch Page Content]
    FetchContent --> Extract[Extract Data]
    Extract --> YieldResult[Yield CrawlResult]
    YieldResult --> CheckMore{More URLs?}
    CheckMore -->|Yes| ProcessURL
    CheckMore -->|No| StreamEnd([End Stream])
    
    %% Collecting Path
    Decision -->|Collect| CollectInit[Initialize Collection Mode]
    CollectInit --> CollectStrategy[Call Strategy.ascrape]
    CollectStrategy --> CollectGen[Create Async Generator]
    CollectGen --> ProcessURLColl[Process Next URL]
    ProcessURLColl --> FetchContentColl[Fetch Page Content]
    FetchContentColl --> ExtractColl[Extract Data]
    ExtractColl --> StoreColl[Store in Dictionary]
    StoreColl --> CheckMoreColl{More URLs?}
    CheckMoreColl -->|Yes| ProcessURLColl
    CheckMoreColl -->|No| CreateResult[Create ScraperResult]
    CreateResult --> ReturnResult([Return Result])
    
    %% Parallel Processing
    subgraph Parallel
        ProcessURL
        FetchContent
        Extract
        ProcessURLColl
        FetchContentColl
        ExtractColl
    end
    
    %% Error Handling
    FetchContent --> ErrorCheck{Error?}
    ErrorCheck -->|Yes| LogError[Log Error]
    LogError --> UpdateStats[Update Error Stats]
    UpdateStats --> CheckMore
    ErrorCheck -->|No| Extract
    
    FetchContentColl --> ErrorCheckColl{Error?}
    ErrorCheckColl -->|Yes| LogErrorColl[Log Error]
    LogErrorColl --> UpdateStatsColl[Update Error Stats]
    UpdateStatsColl --> CheckMoreColl
    ErrorCheckColl -->|No| ExtractColl
    
    %% Style definitions
    classDef process fill:#90caf9,stroke:#000,stroke-width:2px;
    classDef decision fill:#fff59d,stroke:#000,stroke-width:2px;
    classDef error fill:#ef9a9a,stroke:#000,stroke-width:2px;
    classDef start fill:#a5d6a7,stroke:#000,stroke-width:2px;
    
    class Start,StreamEnd,ReturnResult start;
    class Decision,CheckMore,CheckMoreColl,ErrorCheck,ErrorCheckColl decision;
    class LogError,LogErrorColl,UpdateStats,UpdateStatsColl error;
    class ProcessURL,FetchContent,Extract,ProcessURLColl,FetchContentColl,ExtractColl process;

AsyncWebScraper uses an intelligent crawling system that can navigate through websites following your specified strategy. It supports two main modes of operation:

1. Streaming Mode

async for result in scraper.ascrape(url, stream=True):
    print(f"Found data on {result.url}")
    process_data(result.data)

Perfect for processing large websites
Memory efficient - handles one page at a time
Ideal for real-time data processing
Great for monitoring or continuous scraping tasks

2. Collection Mode

result = await scraper.ascrape(url)
print(f"Scraped {len(result.crawled_urls)} pages")
process_all_data(result.extracted_data)

Collects all data before returning
Best for when you need the complete dataset
Easier to work with for batch processing
Includes comprehensive statistics

Key Features

Smart Crawling: Automatically follows relevant links while avoiding duplicates
Parallel Processing: Scrapes multiple pages simultaneously for better performance
Memory Efficient: Choose between streaming and collecting based on your needs
Error Resilient: Continues working even if some pages fail to load
Progress Tracking: Monitor the scraping progress in real-time
Customizable: Configure crawling strategy, filters, and scoring to match your needs

Quick Start

from crawl4ai.scraper import AsyncWebScraper, BFSStrategy
from crawl4ai.async_webcrawler import AsyncWebCrawler

# Initialize the scraper
crawler = AsyncWebCrawler()
strategy = BFSStrategy(
    max_depth=2,  # How deep to crawl
    url_pattern="*.example.com/*"  # What URLs to follow
)
scraper = AsyncWebScraper(crawler, strategy)

# Start scraping
async def main():
    # Collect all results
    result = await scraper.ascrape("https://example.com")
    print(f"Found {len(result.extracted_data)} pages")
    
    # Or stream results
    async for page in scraper.ascrape("https://example.com", stream=True):
        print(f"Processing {page.url}")

Best Practices

Choose the Right Mode
- Use streaming for large websites or real-time processing
- Use collecting for smaller sites or when you need the complete dataset
Configure Depth
- Start with a small depth (2-3) and increase if needed
- Higher depths mean exponentially more pages to crawl
Set Appropriate Filters
- Use URL patterns to stay within relevant sections
- Set content type filters to only process useful pages
Handle Resources Responsibly
- Enable parallel processing for faster results
- Consider the target website's capacity
- Implement appropriate delays between requests

Common Use Cases

Content Aggregation: Collect articles, blog posts, or news from multiple pages
Data Extraction: Gather product information, prices, or specifications
Site Mapping: Create a complete map of a website's structure
Content Monitoring: Track changes or updates across multiple pages
Data Mining: Extract and analyze patterns across web pages

Advanced Features

Custom scoring algorithms for prioritizing important pages
URL filters for focusing on specific site sections
Content type filtering for processing only relevant pages
Progress tracking for monitoring long-running scrapes

Need more help? Check out our examples repository or join our community Discord.

6.3 KiB Raw Blame History