Files
crawl4ai/docs/llm.txt/3_async_webcrawler.md
UncleCode fb33a24891 Commit Message:
- Added examples for Amazon product data extraction methods
  - Updated configuration options and enhance documentation
  - Minor refactoring for improved performance and readability
  - Cleaned up version control settings.
2024-12-29 20:05:18 +08:00

281 lines
12 KiB
Markdown

# Extended Documentation: Asynchronous Crawling with `AsyncWebCrawler`
This document provides a comprehensive, human-oriented overview of the `AsyncWebCrawler` class and related components from the `crawl4ai` package. It explains the motivations behind asynchronous crawling, shows how to configure and run crawls, and provides examples for advanced features like dynamic content handling, extraction strategies, caching, containerization, and troubleshooting.
## Introduction
[EDIT: This is not a good way to introduce the library. The library excels at generating crawl data in the form of markdown or extracted JSON as quickly as possible. It is designed to be efficient in terms of memory and CPU usage. Users should choose this library because it generates markdown suitable for large language models and AI. Additionally, it can create structured data, which is beneficial because it supports attaching large language models to generate structured data. It also includes techniques like JSON CSS and JSON XPath extraction, allowing users to define patterns and extract data quickly. One of the library's strengths is its ability to work everywhere. It can crawl any website by offering various capabilities, such as connecting to a remote browser or using persistent data. This feature allows developers to create their own identity on websites where they have authentication access, enabling them to crawl without being mistakenly identified as a bot. This is a better way to introduce the library. In these documents, we discuss the main object, the main class, Asinggull crawlers, and all the functionalities we can achieve with this Asinggull crawler.]
Crawling websites can be slow if done sequentially, especially when handling large numbers of URLs or rendering dynamic pages. Asynchronous crawling helps you run multiple operations concurrently, improving throughput and performance. The `AsyncWebCrawler` class leverages asynchronous I/O and browser automation tools to fetch content efficiently, handle complex DOM interactions, and extract structured data.
### Quick Start
Before diving into advanced features, here is a quick start example that shows how to run a simple asynchronous crawl with a headless Chromium browser, extract basic text, and print the results.
```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig
async def main():
# Basic browser configuration
browser_config = BrowserConfig(browser_type="chromium", headless=True)
# Run the crawler asynchronously
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun("https://example.com")
print("Extracted Markdown:")
print(result.markdown)
asyncio.run(main())
```
This snippet initializes a headless Chromium browser, crawls the page, processes the HTML, and prints extracted content as Markdown.
## Browser Configuration
The `BrowserConfig` class defines browser-related settings and behaviors. You can customize:
- `browser_type`: Browser to use, such as `chromium` or `firefox`.
- `headless`: Run the browser in headless mode (no visible UI).
- `viewport_width` and `viewport_height`: Control viewport dimensions for rendering.
- `proxy`: Configure proxies to bypass IP restrictions.
- `verbose`: Control logging verbosity.
**Example: Customizing Browser Settings**
```python
from crawl4ai import AsyncWebCrawler, BrowserConfig
browser_config = BrowserConfig(
browser_type="firefox",
headless=False,
viewport_width=1920,
viewport_height=1080,
verbose=True
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun("https://yourwebsite.com")
print(result.markdown)
```
### Running in Docker
For scalability and reproducibility, consider running your crawler inside a Docker container. A simple Dockerfile might look like this:
```dockerfile
FROM python:3.10-slim
RUN apt-get update && apt-get install -y wget
RUN pip install crawl4ai playwright
RUN playwright install chromium
COPY your_script.py /app/your_script.py
WORKDIR /app
CMD ["python", "your_script.py"]
```
You can then run:
```bash
docker build -t mycrawler .
docker run mycrawler
```
Within this container, `AsyncWebCrawler` will launch Chromium using Playwright and crawl sites as configured.
## Asynchronous Crawling Strategies
By default, `AsyncWebCrawler` uses `AsyncPlaywrightCrawlerStrategy`, which relies on Playwright for browser automation. This lets you interact with DOM elements, scroll, click buttons, and handle dynamic content. If other strategies are available, you can specify them during initialization.
```python
from crawl4ai import AsyncWebCrawler, AsyncPlaywrightCrawlerStrategy
crawler = AsyncWebCrawler(crawler_strategy=AsyncPlaywrightCrawlerStrategy())
```
## Handling Dynamic Content
Modern websites often load data via JavaScript or require user interactions. You can inject custom JavaScript snippets to manipulate the page, click buttons, or wait for certain elements to appear before extracting content.
**Example: Loading More Content**
```python
js_code = """
(async () => {
const loadButtons = document.querySelectorAll(".load-more");
for (const btn of loadButtons) btn.click();
await new Promise(r => setTimeout(r, 2000)); // Wait for new content
})();
"""
from crawl4ai import CrawlerRunConfig
config = CrawlerRunConfig(js_code=[js_code])
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com/infinite-scroll", config=config)
print("Extracted Markdown:")
print(result.markdown)
```
You can also use Playwright selectors to wait for specific elements before extraction.
## Extraction and Filtering
`AsyncWebCrawler` supports various extraction strategies to convert raw HTML into structured data. For example, `JsonCssExtractionStrategy` allows you to specify CSS selectors and get structured JSON from the page. `LLMExtractionStrategy` can feed extracted text into a language model for intelligent data extraction.
You can also apply content filters and chunking strategies to split large documents into smaller pieces before processing.
**Example: Using a JSON CSS Extraction Strategy**
```python
from crawl4ai import JsonCssExtractionStrategy, CrawlerRunConfig, AsyncWebCrawler, RegexChunking
config = CrawlerRunConfig(
extraction_strategy=JsonCssExtractionStrategy(selectors={"title": "h1"}),
chunking_strategy=RegexChunking()
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com", config=config)
print("Extracted Content:")
print(result.extracted_content)
```
**Comparing Chunking Strategies:**
- Regex-based chunking: Splits text by patterns, good for basic splitting.
- NLP-based chunking (if available): Splits text into semantically meaningful units, ideal for LLM-based extraction.
## Caching and Performance
Caching helps avoid repeatedly fetching and rendering the same page. By default, caching is enabled (`CacheMode.ENABLED`), so subsequent crawls of the same URL can skip the network fetch if the data is still fresh. You can control the cache mode, clear the cache, or bypass it when needed.
**Cache Modes:**
- `CacheMode.ENABLED`: Use cache if available, write new results to cache.
- `CacheMode.BYPASS`: Skip cache reading, but still write new results.
- `CacheMode.DISABLED`: Do not use cache at all.
**Clearing and Flushing the Cache:**
```python
async with AsyncWebCrawler() as crawler:
await crawler.aclear_cache() # Clear entire cache
# ... run some crawls ...
await crawler.aflush_cache() # Flush partial entries if needed
```
Use caching to speed up development, repeated tests, or partial re-runs of large crawls.
## Batch Crawling and Parallelization
The `arun_many` method lets you process multiple URLs concurrently, improving throughput. You can limit concurrency with `semaphore_count` and apply rate limiting via `CrawlerRunConfig` parameters like `mean_delay` and `max_range`.
**Example: Batch Crawling**
```python
urls = [
"https://site1.com",
"https://site2.com",
"https://site3.com"
]
from crawl4ai import CrawlerRunConfig
config = CrawlerRunConfig(semaphore_count=10, mean_delay=1.0, max_range=0.5)
async with AsyncWebCrawler() as crawler:
results = await crawler.arun_many(urls, config=config)
for res in results:
print(res.url, res.markdown)
```
This allows you to process large URL lists efficiently. Adjust `semaphore_count` to match your resource limits.
## Scaling Crawls
To scale beyond a single machine, consider:
- Distributing URL lists across multiple workers or containers.
- Using a job queue like Celery or Redis Queue to schedule crawls.
- Integrating with cloud-based solutions for browser automation.
Always ensure you respect target site policies and comply with legal and ethical guidelines for web scraping.
## Screenshots and PDFs
If you need visual confirmation, you can enable screenshots or PDFs:
```python
from crawl4ai import CrawlerRunConfig, AsyncWebCrawler
config = CrawlerRunConfig(screenshot=True, pdf=True)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com", config=config)
with open("page_screenshot.png", "wb") as f:
f.write(result.screenshot)
with open("page.pdf", "wb") as f:
f.write(result.pdf)
```
This is helpful for debugging rendering issues or retaining visual copies of crawled pages.
## Troubleshooting and Common Issues
**Common Problems and Direct Fixes:**
1. **Browser not launching**:
- Check that you have installed Playwright and run `playwright install` for the chosen browser.
- Ensure all required dependencies are installed.
2. **Timeouts or partial loads**:
- Increase timeouts or add delays between requests using `mean_delay` and `max_range`.
- Wait for specific DOM elements to appear before proceeding.
3. **JavaScript not executing as expected**:
- Use `js_code` in `CrawlerRunConfig` to inject scripts.
- Check browser console for errors or consider headless=False to debug UI interactions.
4. **Content Extraction fails**:
- Validate CSS selectors or extraction strategies.
- Try a different extraction strategy if the current one is not producing results.
5. **Stale Data due to Caching**:
- Call `await crawler.aclear_cache()` to remove old entries.
- Use `cache_mode=CacheMode.BYPASS` to fetch fresh data.
**Direct Code Fixes:**
If you experience missing content after injecting JS, try waiting longer:
```python
js_code = """
(async () => {
document.querySelector(".load-more").click();
await new Promise(r => setTimeout(r, 3000));
})();
"""
config = CrawlerRunConfig(js_code=[js_code])
```
Or run headless=False to visually verify that the UI is changing as expected.
## Best Practices and Tips
- **Structuring your code**: Keep crawl logic modular. Have separate functions for configuring crawls, extracting data, and processing results.
- **Error Handling**: Wrap crawl operations in try/except blocks and log errors with `crawler.logger`.
- **Avoiding Getting Blocked**: Use proxies or rotate user agents if you crawl frequently. Randomize delays between requests.
- **Authentication and Session Management**: If the site requires login, provide the crawler with login steps via `js_code` or Playwright selectors. Consider using cookies or session storage retrieval in `CrawlerRunConfig`.
## Reference and Additional Resources
- **GitHub Repository**: [crawl4ai GitHub](https://github.com/yourusername/crawl4ai)
- **Playwright Docs**: [https://playwright.dev/](https://playwright.dev/)
- **AsyncIO in Python**: [Python Asyncio Docs](https://docs.python.org/3/library/asyncio.html)
## FAQ
**Q**: How do I customize user agents?
**A**: Pass `user_agent="MyUserAgentString"` to `arun` or `arun_many`, or update `crawler_strategy` directly.
**Q**: Can I crawl local HTML files?
**A**: Yes, provide a `file://` URL or `raw:` prefix with raw HTML strings.
**Q**: How do I integrate LLM-based extraction?
**A**: Set `extraction_strategy=LLMExtractionStrategy(...)` and provide a chunking strategy. This allows using large language models for context-aware data extraction.