feat(extraction): add LLM-powered schema generation utility

Adds new static method generate_schema() to JsonElementExtractionStrategy classes
that can automatically generate extraction schemas using LLM (OpenAI or Ollama).
This provides a convenient way to bootstrap extraction schemas while maintaining
the performance benefits of selector-based extraction.

Key changes:
- Added generate_schema() static method to base extraction strategy
- Added support for both CSS and XPath schema generation
- Updated documentation with examples and best practices
- Added new prompt templates for schema generation
This commit is contained in:
UncleCode
2025-01-20 17:28:00 +08:00
parent 4b1309cbf2
commit 2cec527a22
6 changed files with 1052 additions and 3 deletions

View File

@@ -124,6 +124,36 @@ async with AsyncWebCrawler() as crawler:
Crawl4AI can also extract structured data (JSON) using CSS or XPath selectors. Below is a minimal CSS-based example:
> **New!** Crawl4AI now provides a powerful utility to automatically generate extraction schemas using LLM. This is a one-time cost that gives you a reusable schema for fast, LLM-free extractions:
```python
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
# Generate a schema (one-time cost)
html = "<div class='product'><h2>Gaming Laptop</h2><span class='price'>$999.99</span></div>"
# Using OpenAI (requires API token)
schema = JsonCssExtractionStrategy.generate_schema(
html,
llm_provider="openai/gpt-4o", # Default provider
api_token="your-openai-token" # Required for OpenAI
)
# Or using Ollama (open source, no token needed)
schema = JsonCssExtractionStrategy.generate_schema(
html,
llm_provider="ollama/llama3.3", # Open source alternative
api_token=None # Not needed for Ollama
)
# Use the schema for fast, repeated extractions
strategy = JsonCssExtractionStrategy(schema)
```
For a complete guide on schema generation and advanced usage, see [No-LLM Extraction Strategies](../extraction/no-llm-strategies.md).
Here's a basic extraction example:
```python
import asyncio
import json