Enhance Crawl4AI with CLI and documentation updates - Implemented Command-Line Interface (CLI) in `crawl4ai/cli.py` - Added chunking strategies and their documentation in `llm.txt`
3.3 KiB
3.3 KiB
Crawl4AI: AsyncWebCrawler Reference
Minimal code-oriented reference. Focus on parameters, usage patterns, and code.
(See full code: async_webcrawler.py)
Setup & Quick Start
from crawl4ai import AsyncWebCrawler, BrowserConfig
import asyncio
async def main():
async with AsyncWebCrawler(browser_config=BrowserConfig(browser_type="chromium", headless=True)) as c:
r = await c.arun("https://example.com")
print(r.markdown)
asyncio.run(main())
BrowserConfig & Docker
Params: browser_type, headless, viewport_width, viewport_height, verbose, proxy.
browser_config = BrowserConfig(browser_type="firefox", headless=False)
async with AsyncWebCrawler(browser_config=browser_config) as c:
r = await c.arun("https://site.com")
Docker Example:
FROM python:3.10-slim
RUN pip install crawl4ai playwright
RUN playwright install chromium
COPY script.py /app/
WORKDIR /app
CMD ["python", "script.py"]
Asynchronous Strategies
Default: AsyncPlaywrightCrawlerStrategy
from crawl4ai import AsyncWebCrawler, AsyncPlaywrightCrawlerStrategy
crawler = AsyncWebCrawler(crawler_strategy=AsyncPlaywrightCrawlerStrategy())
Dynamic Content (js_code)
js_code = ["""
(async () => {
document.querySelector(".load-more").click();
await new Promise(r => setTimeout(r, 2000));
})();
"""]
from crawl4ai import CrawlerRunConfig
config = CrawlerRunConfig(js_code=js_code)
Extraction & Filtering
Strategies: JsonCssExtractionStrategy, LLMExtractionStrategy, NoExtractionStrategy.
Chunking: RegexChunking, NLP-based.
config = CrawlerRunConfig(extraction_strategy=JsonCssExtractionStrategy(selectors={"title": "h1"}))
Caching & Performance
Cache Modes: ENABLED, BYPASS, DISABLED
await c.aclear_cache()
await c.aflush_cache()
Batch Crawling & Parallelization
urls = ["https://site1.com", "https://site2.com"]
config = CrawlerRunConfig(semaphore_count=10)
async with AsyncWebCrawler() as c:
results = await c.arun_many(urls, config=config)
Screenshots & PDFs
config = CrawlerRunConfig(screenshot=True, pdf=True)
result = await c.arun("https://example.com", config=config)
with open("page.png","wb") as f: f.write(result.screenshot)
with open("page.pdf","wb") as f: f.write(result.pdf)
Common Issues & Fixes
- Browser not launching:
playwright install chromium - Timeouts: Increase delays in
CrawlerRunConfig - JS not executing: Use
js_codeor headless=False - Stale cache:
await c.aclear_cache() - Extraction fail: Check CSS selectors or try different strategy
Best Practices & Tips
- Modularize crawl logic
- Use proxies/rotating user agents
- Add delays to avoid blocking
- Use
async withfor resource cleanup
Links & FAQ
- GitHub: crawl4ai
- Playwright Docs: https://playwright.dev/
FAQ:
- Custom user agent:
user_agent="MyUserAgent" - Local files:
file://orraw: - LLM extraction: Set
extraction_strategy=LLMExtractionStrategy(...)