- Add llm.txt generator - Added SSL certificate extraction in AsyncWebCrawler. - Introduced new content filters and chunking strategies for more robust data extraction. - Updated documentation.
2.8 KiB
2.8 KiB
Extraction Strategies (Condensed LLM-Friendly Reference)
Extract structured data (JSON) and text blocks from HTML with LLM-based or clustering methods.
Streamlined parameters, usage, and code snippets for quick LLM reference.
Input Formats
- markdown (default): Raw markdown from HTML
- html: Raw HTML content
- fit_markdown: Cleaned markdown (needs markdown_generator + content_filter)
strategy = LLMExtractionStrategy(
input_format="html", # Choose format
provider="openai/gpt-4",
instruction="Extract data"
)
config = CrawlerRunConfig(
extraction_strategy=strategy,
markdown_generator=DefaultMarkdownGenerator(), # For fit_markdown
content_filter=PruningContentFilter() # For fit_markdown
)
LLMExtractionStrategy
- Uses LLM to extract structured data from HTML.
- Supports
instruction,schema,extraction_type,chunk_token_threshold,overlap_rate,input_format.
from crawl4ai.extraction_strategy import LLMExtractionStrategy
strategy = LLMExtractionStrategy(
provider="openai",
api_token="your_api_token",
instruction="Extract prices",
schema={"fields": [...]},
extraction_type="schema",
input_format="html"
)
CosineStrategy
- Clusters content via semantic embeddings.
- Key params:
semantic_filter,word_count_threshold,sim_threshold,top_k.
from crawl4ai.extraction_strategy import CosineStrategy
strategy = CosineStrategy(
semantic_filter="product reviews",
word_count_threshold=20,
sim_threshold=0.3,
top_k=5
)
JsonCssExtractionStrategy
- Extracts data using CSS selectors.
schemadefinesbaseSelector,fields.
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
schema = {
"baseSelector": ".product",
"fields": [
{"name":"title","selector":"h2","type":"text"},
{"name":"price","selector":".price","type":"text"}
]
}
strategy = JsonCssExtractionStrategy(schema=schema)
JsonXPathExtractionStrategy
- Similar to CSS but uses XPath.
- More stable against changing class names.
from crawl4ai.extraction_strategy import JsonXPathExtractionStrategy
schema = {
"baseSelector": "//div[@class='product']",
"fields": [
{"name":"title","selector":".//h2","type":"text"},
{"name":"price","selector":".//span[@class='price']","type":"text"}
]
}
strategy = JsonXPathExtractionStrategy(schema=schema)
Example Usage
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
config = CrawlerRunConfig(extraction_strategy=strategy)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com", config=config)
print(result.extracted_content)