Files
crawl4ai/docs/llm.txt/3_async_webcrawler.xs.md
UncleCode f2d9912697 Renames browser_config param to config in AsyncWebCrawler
Standardizes parameter naming convention across the codebase by renaming browser_config to the more concise config in AsyncWebCrawler constructor.

Updates all documentation examples and internal usages to reflect the new parameter name for consistency.

Also improves hook execution by adding url/response parameters to goto hooks and fixes parameter ordering in before_return_html hook.
2024-12-26 16:34:36 +08:00

3.3 KiB

Crawl4AI: AsyncWebCrawler Reference

Minimal code-oriented reference. Focus on parameters, usage patterns, and code.

(See full code: async_webcrawler.py)

Setup & Quick Start

from crawl4ai import AsyncWebCrawler, BrowserConfig
import asyncio

async def main():
    async with AsyncWebCrawler(config=BrowserConfig(browser_type="chromium", headless=True)) as c:
        r = await c.arun("https://example.com")
        print(r.markdown)

asyncio.run(main())

BrowserConfig & Docker

Params: browser_type, headless, viewport_width, viewport_height, verbose, proxy.

browser_config = BrowserConfig(browser_type="firefox", headless=False)
async with AsyncWebCrawler(config=browser_config) as c:
    r = await c.arun("https://site.com")

Docker Example:

FROM python:3.10-slim
RUN pip install crawl4ai playwright
RUN playwright install chromium
COPY script.py /app/
WORKDIR /app
CMD ["python", "script.py"]

Asynchronous Strategies

Default: AsyncPlaywrightCrawlerStrategy

from crawl4ai import AsyncWebCrawler, AsyncPlaywrightCrawlerStrategy
crawler = AsyncWebCrawler(crawler_strategy=AsyncPlaywrightCrawlerStrategy())

Dynamic Content (js_code)

js_code = ["""
(async () => {
    document.querySelector(".load-more").click();
    await new Promise(r => setTimeout(r, 2000));
})();
"""]
from crawl4ai import CrawlerRunConfig
config = CrawlerRunConfig(js_code=js_code)

Extraction & Filtering

Strategies: JsonCssExtractionStrategy, LLMExtractionStrategy, NoExtractionStrategy.
Chunking: RegexChunking, NLP-based.

config = CrawlerRunConfig(extraction_strategy=JsonCssExtractionStrategy(selectors={"title": "h1"}))

Caching & Performance

Cache Modes: ENABLED, BYPASS, DISABLED

await c.aclear_cache()
await c.aflush_cache()

Batch Crawling & Parallelization

urls = ["https://site1.com", "https://site2.com"]
config = CrawlerRunConfig(semaphore_count=10)
async with AsyncWebCrawler() as c:
    results = await c.arun_many(urls, config=config)

Screenshots & PDFs

config = CrawlerRunConfig(screenshot=True, pdf=True)
result = await c.arun("https://example.com", config=config)
with open("page.png","wb") as f: f.write(result.screenshot)
with open("page.pdf","wb") as f: f.write(result.pdf)

Common Issues & Fixes

  • Browser not launching: playwright install chromium
  • Timeouts: Increase delays in CrawlerRunConfig
  • JS not executing: Use js_code or headless=False
  • Stale cache: await c.aclear_cache()
  • Extraction fail: Check CSS selectors or try different strategy

Best Practices & Tips

  • Modularize crawl logic
  • Use proxies/rotating user agents
  • Add delays to avoid blocking
  • Use async with for resource cleanup

FAQ:

  • Custom user agent: user_agent="MyUserAgent"
  • Local files: file:// or raw:
  • LLM extraction: Set extraction_strategy=LLMExtractionStrategy(...)