- Added examples for Amazon product data extraction methods - Updated configuration options and enhance documentation - Minor refactoring for improved performance and readability - Cleaned up version control settings.
12 KiB
Extended Documentation: Asynchronous Crawling with AsyncWebCrawler
This document provides a comprehensive, human-oriented overview of the AsyncWebCrawler class and related components from the crawl4ai package. It explains the motivations behind asynchronous crawling, shows how to configure and run crawls, and provides examples for advanced features like dynamic content handling, extraction strategies, caching, containerization, and troubleshooting.
Introduction
[EDIT: This is not a good way to introduce the library. The library excels at generating crawl data in the form of markdown or extracted JSON as quickly as possible. It is designed to be efficient in terms of memory and CPU usage. Users should choose this library because it generates markdown suitable for large language models and AI. Additionally, it can create structured data, which is beneficial because it supports attaching large language models to generate structured data. It also includes techniques like JSON CSS and JSON XPath extraction, allowing users to define patterns and extract data quickly. One of the library's strengths is its ability to work everywhere. It can crawl any website by offering various capabilities, such as connecting to a remote browser or using persistent data. This feature allows developers to create their own identity on websites where they have authentication access, enabling them to crawl without being mistakenly identified as a bot. This is a better way to introduce the library. In these documents, we discuss the main object, the main class, Asinggull crawlers, and all the functionalities we can achieve with this Asinggull crawler.]
Crawling websites can be slow if done sequentially, especially when handling large numbers of URLs or rendering dynamic pages. Asynchronous crawling helps you run multiple operations concurrently, improving throughput and performance. The AsyncWebCrawler class leverages asynchronous I/O and browser automation tools to fetch content efficiently, handle complex DOM interactions, and extract structured data.
Quick Start
Before diving into advanced features, here is a quick start example that shows how to run a simple asynchronous crawl with a headless Chromium browser, extract basic text, and print the results.
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig
async def main():
# Basic browser configuration
browser_config = BrowserConfig(browser_type="chromium", headless=True)
# Run the crawler asynchronously
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun("https://example.com")
print("Extracted Markdown:")
print(result.markdown)
asyncio.run(main())
This snippet initializes a headless Chromium browser, crawls the page, processes the HTML, and prints extracted content as Markdown.
Browser Configuration
The BrowserConfig class defines browser-related settings and behaviors. You can customize:
browser_type: Browser to use, such aschromiumorfirefox.headless: Run the browser in headless mode (no visible UI).viewport_widthandviewport_height: Control viewport dimensions for rendering.proxy: Configure proxies to bypass IP restrictions.verbose: Control logging verbosity.
Example: Customizing Browser Settings
from crawl4ai import AsyncWebCrawler, BrowserConfig
browser_config = BrowserConfig(
browser_type="firefox",
headless=False,
viewport_width=1920,
viewport_height=1080,
verbose=True
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun("https://yourwebsite.com")
print(result.markdown)
Running in Docker
For scalability and reproducibility, consider running your crawler inside a Docker container. A simple Dockerfile might look like this:
FROM python:3.10-slim
RUN apt-get update && apt-get install -y wget
RUN pip install crawl4ai playwright
RUN playwright install chromium
COPY your_script.py /app/your_script.py
WORKDIR /app
CMD ["python", "your_script.py"]
You can then run:
docker build -t mycrawler .
docker run mycrawler
Within this container, AsyncWebCrawler will launch Chromium using Playwright and crawl sites as configured.
Asynchronous Crawling Strategies
By default, AsyncWebCrawler uses AsyncPlaywrightCrawlerStrategy, which relies on Playwright for browser automation. This lets you interact with DOM elements, scroll, click buttons, and handle dynamic content. If other strategies are available, you can specify them during initialization.
from crawl4ai import AsyncWebCrawler, AsyncPlaywrightCrawlerStrategy
crawler = AsyncWebCrawler(crawler_strategy=AsyncPlaywrightCrawlerStrategy())
Handling Dynamic Content
Modern websites often load data via JavaScript or require user interactions. You can inject custom JavaScript snippets to manipulate the page, click buttons, or wait for certain elements to appear before extracting content.
Example: Loading More Content
js_code = """
(async () => {
const loadButtons = document.querySelectorAll(".load-more");
for (const btn of loadButtons) btn.click();
await new Promise(r => setTimeout(r, 2000)); // Wait for new content
})();
"""
from crawl4ai import CrawlerRunConfig
config = CrawlerRunConfig(js_code=[js_code])
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com/infinite-scroll", config=config)
print("Extracted Markdown:")
print(result.markdown)
You can also use Playwright selectors to wait for specific elements before extraction.
Extraction and Filtering
AsyncWebCrawler supports various extraction strategies to convert raw HTML into structured data. For example, JsonCssExtractionStrategy allows you to specify CSS selectors and get structured JSON from the page. LLMExtractionStrategy can feed extracted text into a language model for intelligent data extraction.
You can also apply content filters and chunking strategies to split large documents into smaller pieces before processing.
Example: Using a JSON CSS Extraction Strategy
from crawl4ai import JsonCssExtractionStrategy, CrawlerRunConfig, AsyncWebCrawler, RegexChunking
config = CrawlerRunConfig(
extraction_strategy=JsonCssExtractionStrategy(selectors={"title": "h1"}),
chunking_strategy=RegexChunking()
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com", config=config)
print("Extracted Content:")
print(result.extracted_content)
Comparing Chunking Strategies:
- Regex-based chunking: Splits text by patterns, good for basic splitting.
- NLP-based chunking (if available): Splits text into semantically meaningful units, ideal for LLM-based extraction.
Caching and Performance
Caching helps avoid repeatedly fetching and rendering the same page. By default, caching is enabled (CacheMode.ENABLED), so subsequent crawls of the same URL can skip the network fetch if the data is still fresh. You can control the cache mode, clear the cache, or bypass it when needed.
Cache Modes:
CacheMode.ENABLED: Use cache if available, write new results to cache.CacheMode.BYPASS: Skip cache reading, but still write new results.CacheMode.DISABLED: Do not use cache at all.
Clearing and Flushing the Cache:
async with AsyncWebCrawler() as crawler:
await crawler.aclear_cache() # Clear entire cache
# ... run some crawls ...
await crawler.aflush_cache() # Flush partial entries if needed
Use caching to speed up development, repeated tests, or partial re-runs of large crawls.
Batch Crawling and Parallelization
The arun_many method lets you process multiple URLs concurrently, improving throughput. You can limit concurrency with semaphore_count and apply rate limiting via CrawlerRunConfig parameters like mean_delay and max_range.
Example: Batch Crawling
urls = [
"https://site1.com",
"https://site2.com",
"https://site3.com"
]
from crawl4ai import CrawlerRunConfig
config = CrawlerRunConfig(semaphore_count=10, mean_delay=1.0, max_range=0.5)
async with AsyncWebCrawler() as crawler:
results = await crawler.arun_many(urls, config=config)
for res in results:
print(res.url, res.markdown)
This allows you to process large URL lists efficiently. Adjust semaphore_count to match your resource limits.
Scaling Crawls
To scale beyond a single machine, consider:
- Distributing URL lists across multiple workers or containers.
- Using a job queue like Celery or Redis Queue to schedule crawls.
- Integrating with cloud-based solutions for browser automation.
Always ensure you respect target site policies and comply with legal and ethical guidelines for web scraping.
Screenshots and PDFs
If you need visual confirmation, you can enable screenshots or PDFs:
from crawl4ai import CrawlerRunConfig, AsyncWebCrawler
config = CrawlerRunConfig(screenshot=True, pdf=True)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com", config=config)
with open("page_screenshot.png", "wb") as f:
f.write(result.screenshot)
with open("page.pdf", "wb") as f:
f.write(result.pdf)
This is helpful for debugging rendering issues or retaining visual copies of crawled pages.
Troubleshooting and Common Issues
Common Problems and Direct Fixes:
-
Browser not launching:
- Check that you have installed Playwright and run
playwright installfor the chosen browser. - Ensure all required dependencies are installed.
- Check that you have installed Playwright and run
-
Timeouts or partial loads:
- Increase timeouts or add delays between requests using
mean_delayandmax_range. - Wait for specific DOM elements to appear before proceeding.
- Increase timeouts or add delays between requests using
-
JavaScript not executing as expected:
- Use
js_codeinCrawlerRunConfigto inject scripts. - Check browser console for errors or consider headless=False to debug UI interactions.
- Use
-
Content Extraction fails:
- Validate CSS selectors or extraction strategies.
- Try a different extraction strategy if the current one is not producing results.
-
Stale Data due to Caching:
- Call
await crawler.aclear_cache()to remove old entries. - Use
cache_mode=CacheMode.BYPASSto fetch fresh data.
- Call
Direct Code Fixes:
If you experience missing content after injecting JS, try waiting longer:
js_code = """
(async () => {
document.querySelector(".load-more").click();
await new Promise(r => setTimeout(r, 3000));
})();
"""
config = CrawlerRunConfig(js_code=[js_code])
Or run headless=False to visually verify that the UI is changing as expected.
Best Practices and Tips
- Structuring your code: Keep crawl logic modular. Have separate functions for configuring crawls, extracting data, and processing results.
- Error Handling: Wrap crawl operations in try/except blocks and log errors with
crawler.logger. - Avoiding Getting Blocked: Use proxies or rotate user agents if you crawl frequently. Randomize delays between requests.
- Authentication and Session Management: If the site requires login, provide the crawler with login steps via
js_codeor Playwright selectors. Consider using cookies or session storage retrieval inCrawlerRunConfig.
Reference and Additional Resources
- GitHub Repository: crawl4ai GitHub
- Playwright Docs: https://playwright.dev/
- AsyncIO in Python: Python Asyncio Docs
FAQ
Q: How do I customize user agents?
A: Pass user_agent="MyUserAgentString" to arun or arun_many, or update crawler_strategy directly.
Q: Can I crawl local HTML files?
A: Yes, provide a file:// URL or raw: prefix with raw HTML strings.
Q: How do I integrate LLM-based extraction?
A: Set extraction_strategy=LLMExtractionStrategy(...) and provide a chunking strategy. This allows using large language models for context-aware data extraction.