Files
crawl4ai/docs/llm.txt/7_extraction_strategies.sm.md
UncleCode 84b311760f Commit Message:
Enhance Crawl4AI with CLI and documentation updates
  - Implemented Command-Line Interface (CLI) in `crawl4ai/cli.py`
  - Added chunking strategies and their documentation in `llm.txt`
2024-12-21 14:26:56 +08:00

81 lines
2.2 KiB
Markdown

# Extraction Strategies (Condensed LLM-Friendly Reference)
> Extract structured data (JSON) and text blocks from HTML with LLM-based or clustering methods.
Streamlined parameters, usage, and code snippets for quick LLM reference.
## LLMExtractionStrategy
- Uses LLM to extract structured data from HTML.
- Supports `instruction`, `schema`, `extraction_type`, `chunk_token_threshold`, `overlap_rate`.
```python
from crawl4ai.extraction_strategy import LLMExtractionStrategy
strategy = LLMExtractionStrategy(
provider="openai",
api_token="your_api_token",
instruction="Extract prices",
schema={"fields": [...]},
extraction_type="schema"
)
```
## CosineStrategy
- Clusters content via semantic embeddings.
- Key params: `semantic_filter`, `word_count_threshold`, `sim_threshold`, `top_k`.
```python
from crawl4ai.extraction_strategy import CosineStrategy
strategy = CosineStrategy(
semantic_filter="product reviews",
word_count_threshold=20,
sim_threshold=0.3,
top_k=5
)
```
## JsonCssExtractionStrategy
- Extracts data using CSS selectors.
- `schema` defines `baseSelector`, `fields`.
```python
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
schema = {
"baseSelector": ".product",
"fields": [
{"name":"title","selector":"h2","type":"text"},
{"name":"price","selector":".price","type":"text"}
]
}
strategy = JsonCssExtractionStrategy(schema=schema)
```
## JsonXPathExtractionStrategy
- Similar to CSS but uses XPath.
- More stable against changing class names.
```python
from crawl4ai.extraction_strategy import JsonXPathExtractionStrategy
schema = {
"baseSelector": "//div[@class='product']",
"fields": [
{"name":"title","selector":".//h2","type":"text"},
{"name":"price","selector":".//span[@class='price']","type":"text"}
]
}
strategy = JsonXPathExtractionStrategy(schema=schema)
```
## Example Usage
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
config = CrawlerRunConfig(extraction_strategy=strategy)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com", config=config)
print(result.extracted_content)
```
## Optional
- [extraction_strategies.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/extraction_strategies.py)