Files

UncleCode 84b311760f Commit Message:

Enhance Crawl4AI with CLI and documentation updates
  - Implemented Command-Line Interface (CLI) in `crawl4ai/cli.py`
  - Added chunking strategies and their documentation in `llm.txt`

2024-12-21 14:26:56 +08:00

2.2 KiB

Raw Blame History

Extraction Strategies (Condensed LLM-Friendly Reference)

Extract structured data (JSON) and text blocks from HTML with LLM-based or clustering methods.

Streamlined parameters, usage, and code snippets for quick LLM reference.

LLMExtractionStrategy

Uses LLM to extract structured data from HTML.
Supports instruction, schema, extraction_type, chunk_token_threshold, overlap_rate.

from crawl4ai.extraction_strategy import LLMExtractionStrategy
strategy = LLMExtractionStrategy(
    provider="openai",
    api_token="your_api_token",
    instruction="Extract prices",
    schema={"fields": [...]},
    extraction_type="schema"
)

CosineStrategy

Clusters content via semantic embeddings.
Key params: semantic_filter, word_count_threshold, sim_threshold, top_k.

from crawl4ai.extraction_strategy import CosineStrategy
strategy = CosineStrategy(
    semantic_filter="product reviews",
    word_count_threshold=20,
    sim_threshold=0.3,
    top_k=5
)

JsonCssExtractionStrategy

Extracts data using CSS selectors.
schema defines baseSelector, fields.

from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
schema = {
  "baseSelector": ".product",
  "fields": [
    {"name":"title","selector":"h2","type":"text"},
    {"name":"price","selector":".price","type":"text"}
  ]
}
strategy = JsonCssExtractionStrategy(schema=schema)

JsonXPathExtractionStrategy

Similar to CSS but uses XPath.
More stable against changing class names.

from crawl4ai.extraction_strategy import JsonXPathExtractionStrategy
schema = {
  "baseSelector": "//div[@class='product']",
  "fields": [
    {"name":"title","selector":".//h2","type":"text"},
    {"name":"price","selector":".//span[@class='price']","type":"text"}
  ]
}
strategy = JsonXPathExtractionStrategy(schema=schema)

Example Usage

from crawl4ai import AsyncWebCrawler, CrawlerRunConfig

config = CrawlerRunConfig(extraction_strategy=strategy)
async with AsyncWebCrawler() as crawler:
    result = await crawler.arun("https://example.com", config=config)
    print(result.extracted_content)

Optional

extraction_strategies.py

2.2 KiB Raw Blame History

Extraction Strategies (Condensed LLM-Friendly Reference)

LLMExtractionStrategy

CosineStrategy

JsonCssExtractionStrategy

JsonXPathExtractionStrategy

Example Usage

Optional

2.2 KiB

Raw Blame History