crawl4ai/docs/llm.txt/7_extraction_strategies.xs.md

# Extraction Strategies (Condensed LLM-Friendly Reference)

> Extract structured data (JSON) and text blocks from HTML with LLM-based or clustering methods.

Streamlined parameters, usage, and code snippets for quick LLM reference.

## Input Formats

- **markdown** (default): Raw markdown from HTML
- **html**: Raw HTML content
- **fit_markdown**: Cleaned markdown (needs markdown_generator + content_filter)

```python
strategy = LLMExtractionStrategy(
    input_format="html",  # Choose format
    provider="openai/gpt-4",
    instruction="Extract data"
)

config = CrawlerRunConfig(
    extraction_strategy=strategy,
    markdown_generator=DefaultMarkdownGenerator(),  # For fit_markdown
    content_filter=PruningContentFilter()          # For fit_markdown
)
```

## LLMExtractionStrategy

- Uses LLM to extract structured data from HTML.
- Supports `instruction`, `schema`, `extraction_type`, `chunk_token_threshold`, `overlap_rate`, `input_format`.
```python
from crawl4ai.extraction_strategy import LLMExtractionStrategy
strategy = LLMExtractionStrategy(
    provider="openai",
    api_token="your_api_token",
    instruction="Extract prices",
    schema={"fields": [...]},
    extraction_type="schema",
    input_format="html"
)
```

## CosineStrategy

- Clusters content via semantic embeddings.
- Key params: `semantic_filter`, `word_count_threshold`, `sim_threshold`, `top_k`.
```python
from crawl4ai.extraction_strategy import CosineStrategy
strategy = CosineStrategy(
    semantic_filter="product reviews",
    word_count_threshold=20,
    sim_threshold=0.3,
    top_k=5
)
```

## JsonCssExtractionStrategy

- Extracts data using CSS selectors.
- `schema` defines `baseSelector`, `fields`.
```python
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
schema = {
  "baseSelector": ".product",
  "fields": [
    {"name":"title","selector":"h2","type":"text"},
    {"name":"price","selector":".price","type":"text"}
  ]
}
strategy = JsonCssExtractionStrategy(schema=schema)
```

## JsonXPathExtractionStrategy

- Similar to CSS but uses XPath.
- More stable against changing class names.
```python
from crawl4ai.extraction_strategy import JsonXPathExtractionStrategy
schema = {
  "baseSelector": "//div[@class='product']",
  "fields": [
    {"name":"title","selector":".//h2","type":"text"},
    {"name":"price","selector":".//span[@class='price']","type":"text"}
  ]
}
strategy = JsonXPathExtractionStrategy(schema=schema)
```

## Example Usage

```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig

config = CrawlerRunConfig(extraction_strategy=strategy)
async with AsyncWebCrawler() as crawler:
    result = await crawler.arun("https://example.com", config=config)
    print(result.extracted_content)
```

## Optional

- [extraction_strategies.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/extraction_strategies.py)