Files
crawl4ai/docs/llm.txt/3_async_webcrawler.xs.md
UncleCode f2d9912697 Renames browser_config param to config in AsyncWebCrawler
Standardizes parameter naming convention across the codebase by renaming browser_config to the more concise config in AsyncWebCrawler constructor.

Updates all documentation examples and internal usages to reflect the new parameter name for consistency.

Also improves hook execution by adding url/response parameters to goto hooks and fixes parameter ordering in before_return_html hook.
2024-12-26 16:34:36 +08:00

112 lines
3.3 KiB
Markdown

# Crawl4AI: AsyncWebCrawler Reference
> Minimal code-oriented reference. Focus on parameters, usage patterns, and code.
(See full code: [async_webcrawler.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/async_webcrawler.py))
## Setup & Quick Start
```python
from crawl4ai import AsyncWebCrawler, BrowserConfig
import asyncio
async def main():
async with AsyncWebCrawler(config=BrowserConfig(browser_type="chromium", headless=True)) as c:
r = await c.arun("https://example.com")
print(r.markdown)
asyncio.run(main())
```
## BrowserConfig & Docker
**Params:** `browser_type`, `headless`, `viewport_width`, `viewport_height`, `verbose`, `proxy`.
```python
browser_config = BrowserConfig(browser_type="firefox", headless=False)
async with AsyncWebCrawler(config=browser_config) as c:
r = await c.arun("https://site.com")
```
**Docker Example:**
```dockerfile
FROM python:3.10-slim
RUN pip install crawl4ai playwright
RUN playwright install chromium
COPY script.py /app/
WORKDIR /app
CMD ["python", "script.py"]
```
## Asynchronous Strategies
Default: `AsyncPlaywrightCrawlerStrategy`
```python
from crawl4ai import AsyncWebCrawler, AsyncPlaywrightCrawlerStrategy
crawler = AsyncWebCrawler(crawler_strategy=AsyncPlaywrightCrawlerStrategy())
```
## Dynamic Content (js_code)
```python
js_code = ["""
(async () => {
document.querySelector(".load-more").click();
await new Promise(r => setTimeout(r, 2000));
})();
"""]
from crawl4ai import CrawlerRunConfig
config = CrawlerRunConfig(js_code=js_code)
```
## Extraction & Filtering
**Strategies:** `JsonCssExtractionStrategy`, `LLMExtractionStrategy`, `NoExtractionStrategy`.
**Chunking:** `RegexChunking`, NLP-based.
```python
config = CrawlerRunConfig(extraction_strategy=JsonCssExtractionStrategy(selectors={"title": "h1"}))
```
## Caching & Performance
**Cache Modes:** `ENABLED`, `BYPASS`, `DISABLED`
```python
await c.aclear_cache()
await c.aflush_cache()
```
## Batch Crawling & Parallelization
```python
urls = ["https://site1.com", "https://site2.com"]
config = CrawlerRunConfig(semaphore_count=10)
async with AsyncWebCrawler() as c:
results = await c.arun_many(urls, config=config)
```
## Screenshots & PDFs
```python
config = CrawlerRunConfig(screenshot=True, pdf=True)
result = await c.arun("https://example.com", config=config)
with open("page.png","wb") as f: f.write(result.screenshot)
with open("page.pdf","wb") as f: f.write(result.pdf)
```
## Common Issues & Fixes
- Browser not launching: `playwright install chromium`
- Timeouts: Increase delays in `CrawlerRunConfig`
- JS not executing: Use `js_code` or headless=False
- Stale cache: `await c.aclear_cache()`
- Extraction fail: Check CSS selectors or try different strategy
## Best Practices & Tips
- Modularize crawl logic
- Use proxies/rotating user agents
- Add delays to avoid blocking
- Use `async with` for resource cleanup
## Links & FAQ
- GitHub: [crawl4ai](https://github.com/yourusername/crawl4ai)
- Playwright Docs: [https://playwright.dev/](https://playwright.dev/)
**FAQ:**
- Custom user agent: `user_agent="MyUserAgent"`
- Local files: `file://` or `raw:`
- LLM extraction: Set `extraction_strategy=LLMExtractionStrategy(...)`
## Links
- [async_webcrawler.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/async_webcrawler.py)