feat(extraction): add LLM-powered schema generation utility
Adds new static method generate_schema() to JsonElementExtractionStrategy classes that can automatically generate extraction schemas using LLM (OpenAI or Ollama). This provides a convenient way to bootstrap extraction schemas while maintaining the performance benefits of selector-based extraction. Key changes: - Added generate_schema() static method to base extraction strategy - Added support for both CSS and XPath schema generation - Updated documentation with examples and best practices - Added new prompt templates for schema generation
This commit is contained in:
@@ -124,6 +124,36 @@ async with AsyncWebCrawler() as crawler:
|
||||
|
||||
Crawl4AI can also extract structured data (JSON) using CSS or XPath selectors. Below is a minimal CSS-based example:
|
||||
|
||||
> **New!** Crawl4AI now provides a powerful utility to automatically generate extraction schemas using LLM. This is a one-time cost that gives you a reusable schema for fast, LLM-free extractions:
|
||||
|
||||
```python
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
|
||||
# Generate a schema (one-time cost)
|
||||
html = "<div class='product'><h2>Gaming Laptop</h2><span class='price'>$999.99</span></div>"
|
||||
|
||||
# Using OpenAI (requires API token)
|
||||
schema = JsonCssExtractionStrategy.generate_schema(
|
||||
html,
|
||||
llm_provider="openai/gpt-4o", # Default provider
|
||||
api_token="your-openai-token" # Required for OpenAI
|
||||
)
|
||||
|
||||
# Or using Ollama (open source, no token needed)
|
||||
schema = JsonCssExtractionStrategy.generate_schema(
|
||||
html,
|
||||
llm_provider="ollama/llama3.3", # Open source alternative
|
||||
api_token=None # Not needed for Ollama
|
||||
)
|
||||
|
||||
# Use the schema for fast, repeated extractions
|
||||
strategy = JsonCssExtractionStrategy(schema)
|
||||
```
|
||||
|
||||
For a complete guide on schema generation and advanced usage, see [No-LLM Extraction Strategies](../extraction/no-llm-strategies.md).
|
||||
|
||||
Here's a basic extraction example:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
import json
|
||||
|
||||
Reference in New Issue
Block a user