Files
crawl4ai/docs/llm.txt/1_introduction.xs.md
UncleCode 84b311760f Commit Message:
Enhance Crawl4AI with CLI and documentation updates
  - Implemented Command-Line Interface (CLI) in `crawl4ai/cli.py`
  - Added chunking strategies and their documentation in `llm.txt`
2024-12-21 14:26:56 +08:00

2.4 KiB

Crawl4AI LLM Reference

Minimal, code-focused reference for LLM-based retrieval and answer generation.

Intended usage: A language model trained on this document can provide quick answers to developers integrating Crawl4AI.

Installation

  • Basic:
pip install crawl4ai
crawl4ai-setup
  • If necessary:
playwright install chromium

Basic Usage

  • Asynchronous crawl:
import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler(verbose=True) as c:
        r = await c.arun(url="https://example.com")
        print(r.markdown)

asyncio.run(main())

Concurrent Crawling

  • Multiple URLs:
urls = ["https://example.com/page1", "https://example.com/page2"]
async with AsyncWebCrawler() as c:
    results = await asyncio.gather(*[c.arun(url=u) for u in urls])

Configuration

  • CacheMode:
from crawl4ai import CacheMode
r = await c.arun(url="...", cache_mode=CacheMode.ENABLED)
  • Proxies:
async with AsyncWebCrawler(proxies={"http": "http://user:pass@proxy:port"}) as c:
    r = await c.arun("https://example.com")
  • Headers & Viewport:
async with AsyncWebCrawler(headers={"User-Agent": "MyUA"}, viewport={"width":1024,"height":768}) as c:
    r = await c.arun("https://example.com")

JavaScript Injection

  • Custom JS:
js_code = ["""
(async () => {
    const btn = document.querySelector('#load-more');
    if (btn) btn.click();
    await new Promise(r => setTimeout(r, 1000));
})();
"""]

r = await c.arun(url="...", js_code=js_code)

Extraction Strategies

  • JSON CSS Extraction:
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

schema = {...}
r = await c.arun(url="...", extraction_strategy=JsonCssExtractionStrategy(schema))
  • LLM Extraction:
from crawl4ai.extraction_strategy import LLMExtractionStrategy

r = await c.arun(url="...",
    extraction_strategy=LLMExtractionStrategy(
        provider="openai/gpt-4o",
        api_token="YOUR_API_KEY",
        schema={...},
        extraction_type="schema"
    )
)

Common Issues

  • Playwright errors: playwright install chromium
  • Empty output: Increase wait or use js_code.
  • SSL issues: Check certificates or use verify_ssl=False (not recommended for production).