* spelling change in prompt * gpt-4o-mini support * Remove leading Y before here * prompt spell correction * (Docs) Fix numbered list end-of-line formatting Added the missing "two spaces" to add a line break * fix: access downloads_path through browser_config in _handle_download method - Fixes #585 * crawl * fix: https://github.com/unclecode/crawl4ai/issues/592 * fix: https://github.com/unclecode/crawl4ai/issues/583 * Docs update: https://github.com/unclecode/crawl4ai/issues/649 * fix: https://github.com/unclecode/crawl4ai/issues/570 * Docs: updated example for content-selection to reflect new changes in yc newsfeed css * Refactor: Removed old filters and replaced with optimised filters * fix:Fixed imports as per the new names of filters * Tests: For deep crawl filters * Refactor: Remove old scorers and replace with optimised ones: Fix imports forall filters and scorers. * fix: awaiting on filters that are async in nature eg: content relevance and seo filters * fix: https://github.com/unclecode/crawl4ai/issues/592 * fix: https://github.com/unclecode/crawl4ai/issues/715 --------- Co-authored-by: DarshanTank <darshan.tank@gnani.ai> Co-authored-by: Tuhin Mallick <tuhin.mllk@gmail.com> Co-authored-by: Serhat Soydan <ssoydan@gmail.com> Co-authored-by: cardit1 <maneesh@cardit.in> Co-authored-by: Tautik Agrahari <tautikagrahari@gmail.com>
508 lines
20 KiB
Markdown
508 lines
20 KiB
Markdown
# Extracting JSON (No LLM)
|
||
|
||
One of Crawl4AI’s **most powerful** features is extracting **structured JSON** from websites **without** relying on large language models. By defining a **schema** with CSS or XPath selectors, you can extract data instantly—even from complex or nested HTML structures—without the cost, latency, or environmental impact of an LLM.
|
||
|
||
**Why avoid LLM for basic extractions?**
|
||
|
||
1. **Faster & Cheaper**: No API calls or GPU overhead.
|
||
2. **Lower Carbon Footprint**: LLM inference can be energy-intensive. A well-defined schema is practically carbon-free.
|
||
3. **Precise & Repeatable**: CSS/XPath selectors do exactly what you specify. LLM outputs can vary or hallucinate.
|
||
4. **Scales Readily**: For thousands of pages, schema-based extraction runs quickly and in parallel.
|
||
|
||
Below, we’ll explore how to craft these schemas and use them with **JsonCssExtractionStrategy** (or **JsonXPathExtractionStrategy** if you prefer XPath). We’ll also highlight advanced features like **nested fields** and **base element attributes**.
|
||
|
||
---
|
||
|
||
## 1. Intro to Schema-Based Extraction
|
||
|
||
A schema defines:
|
||
|
||
1. A **base selector** that identifies each “container” element on the page (e.g., a product row, a blog post card).
|
||
2. **Fields** describing which CSS/XPath selectors to use for each piece of data you want to capture (text, attribute, HTML block, etc.).
|
||
3. **Nested** or **list** types for repeated or hierarchical structures.
|
||
|
||
For example, if you have a list of products, each one might have a name, price, reviews, and “related products.” This approach is faster and more reliable than an LLM for consistent, structured pages.
|
||
|
||
---
|
||
|
||
## 2. Simple Example: Crypto Prices
|
||
|
||
Let’s begin with a **simple** schema-based extraction using the `JsonCssExtractionStrategy`. Below is a snippet that extracts cryptocurrency prices from a site (similar to the legacy Coinbase example). Notice we **don’t** call any LLM:
|
||
|
||
```python
|
||
import json
|
||
import asyncio
|
||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
|
||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||
|
||
async def extract_crypto_prices():
|
||
# 1. Define a simple extraction schema
|
||
schema = {
|
||
"name": "Crypto Prices",
|
||
"baseSelector": "div.crypto-row", # Repeated elements
|
||
"fields": [
|
||
{
|
||
"name": "coin_name",
|
||
"selector": "h2.coin-name",
|
||
"type": "text"
|
||
},
|
||
{
|
||
"name": "price",
|
||
"selector": "span.coin-price",
|
||
"type": "text"
|
||
}
|
||
]
|
||
}
|
||
|
||
# 2. Create the extraction strategy
|
||
extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
|
||
|
||
# 3. Set up your crawler config (if needed)
|
||
config = CrawlerRunConfig(
|
||
# e.g., pass js_code or wait_for if the page is dynamic
|
||
# wait_for="css:.crypto-row:nth-child(20)"
|
||
cache_mode = CacheMode.BYPASS,
|
||
extraction_strategy=extraction_strategy,
|
||
)
|
||
|
||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||
# 4. Run the crawl and extraction
|
||
result = await crawler.arun(
|
||
url="https://example.com/crypto-prices",
|
||
|
||
config=config
|
||
)
|
||
|
||
if not result.success:
|
||
print("Crawl failed:", result.error_message)
|
||
return
|
||
|
||
# 5. Parse the extracted JSON
|
||
data = json.loads(result.extracted_content)
|
||
print(f"Extracted {len(data)} coin entries")
|
||
print(json.dumps(data[0], indent=2) if data else "No data found")
|
||
|
||
asyncio.run(extract_crypto_prices())
|
||
```
|
||
|
||
**Highlights**:
|
||
|
||
- **`baseSelector`**: Tells us where each “item” (crypto row) is.
|
||
- **`fields`**: Two fields (`coin_name`, `price`) using simple CSS selectors.
|
||
- Each field defines a **`type`** (e.g., `text`, `attribute`, `html`, `regex`, etc.).
|
||
|
||
No LLM is needed, and the performance is **near-instant** for hundreds or thousands of items.
|
||
|
||
---
|
||
|
||
### **XPath Example with `raw://` HTML**
|
||
|
||
Below is a short example demonstrating **XPath** extraction plus the **`raw://`** scheme. We’ll pass a **dummy HTML** directly (no network request) and define the extraction strategy in `CrawlerRunConfig`.
|
||
|
||
```python
|
||
import json
|
||
import asyncio
|
||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||
from crawl4ai.extraction_strategy import JsonXPathExtractionStrategy
|
||
|
||
async def extract_crypto_prices_xpath():
|
||
# 1. Minimal dummy HTML with some repeating rows
|
||
dummy_html = """
|
||
<html>
|
||
<body>
|
||
<div class='crypto-row'>
|
||
<h2 class='coin-name'>Bitcoin</h2>
|
||
<span class='coin-price'>$28,000</span>
|
||
</div>
|
||
<div class='crypto-row'>
|
||
<h2 class='coin-name'>Ethereum</h2>
|
||
<span class='coin-price'>$1,800</span>
|
||
</div>
|
||
</body>
|
||
</html>
|
||
"""
|
||
|
||
# 2. Define the JSON schema (XPath version)
|
||
schema = {
|
||
"name": "Crypto Prices via XPath",
|
||
"baseSelector": "//div[@class='crypto-row']",
|
||
"fields": [
|
||
{
|
||
"name": "coin_name",
|
||
"selector": ".//h2[@class='coin-name']",
|
||
"type": "text"
|
||
},
|
||
{
|
||
"name": "price",
|
||
"selector": ".//span[@class='coin-price']",
|
||
"type": "text"
|
||
}
|
||
]
|
||
}
|
||
|
||
# 3. Place the strategy in the CrawlerRunConfig
|
||
config = CrawlerRunConfig(
|
||
extraction_strategy=JsonXPathExtractionStrategy(schema, verbose=True)
|
||
)
|
||
|
||
# 4. Use raw:// scheme to pass dummy_html directly
|
||
raw_url = f"raw://{dummy_html}"
|
||
|
||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||
result = await crawler.arun(
|
||
url=raw_url,
|
||
config=config
|
||
)
|
||
|
||
if not result.success:
|
||
print("Crawl failed:", result.error_message)
|
||
return
|
||
|
||
data = json.loads(result.extracted_content)
|
||
print(f"Extracted {len(data)} coin rows")
|
||
if data:
|
||
print("First item:", data[0])
|
||
|
||
asyncio.run(extract_crypto_prices_xpath())
|
||
```
|
||
|
||
**Key Points**:
|
||
|
||
1. **`JsonXPathExtractionStrategy`** is used instead of `JsonCssExtractionStrategy`.
|
||
2. **`baseSelector`** and each field’s `"selector"` use **XPath** instead of CSS.
|
||
3. **`raw://`** lets us pass `dummy_html` with no real network request—handy for local testing.
|
||
4. Everything (including the extraction strategy) is in **`CrawlerRunConfig`**.
|
||
|
||
That’s how you keep the config self-contained, illustrate **XPath** usage, and demonstrate the **raw** scheme for direct HTML input—all while avoiding the old approach of passing `extraction_strategy` directly to `arun()`.
|
||
|
||
---
|
||
|
||
## 3. Advanced Schema & Nested Structures
|
||
|
||
Real sites often have **nested** or repeated data—like categories containing products, which themselves have a list of reviews or features. For that, we can define **nested** or **list** (and even **nested_list**) fields.
|
||
|
||
### Sample E-Commerce HTML
|
||
|
||
We have a **sample e-commerce** HTML file on GitHub (example):
|
||
```
|
||
https://gist.githubusercontent.com/githubusercontent/2d7b8ba3cd8ab6cf3c8da771ddb36878/raw/1ae2f90c6861ce7dd84cc50d3df9920dee5e1fd2/sample_ecommerce.html
|
||
```
|
||
This snippet includes categories, products, features, reviews, and related items. Let’s see how to define a schema that fully captures that structure **without LLM**.
|
||
|
||
```python
|
||
schema = {
|
||
"name": "E-commerce Product Catalog",
|
||
"baseSelector": "div.category",
|
||
# (1) We can define optional baseFields if we want to extract attributes
|
||
# from the category container
|
||
"baseFields": [
|
||
{"name": "data_cat_id", "type": "attribute", "attribute": "data-cat-id"},
|
||
],
|
||
"fields": [
|
||
{
|
||
"name": "category_name",
|
||
"selector": "h2.category-name",
|
||
"type": "text"
|
||
},
|
||
{
|
||
"name": "products",
|
||
"selector": "div.product",
|
||
"type": "nested_list", # repeated sub-objects
|
||
"fields": [
|
||
{
|
||
"name": "name",
|
||
"selector": "h3.product-name",
|
||
"type": "text"
|
||
},
|
||
{
|
||
"name": "price",
|
||
"selector": "p.product-price",
|
||
"type": "text"
|
||
},
|
||
{
|
||
"name": "details",
|
||
"selector": "div.product-details",
|
||
"type": "nested", # single sub-object
|
||
"fields": [
|
||
{
|
||
"name": "brand",
|
||
"selector": "span.brand",
|
||
"type": "text"
|
||
},
|
||
{
|
||
"name": "model",
|
||
"selector": "span.model",
|
||
"type": "text"
|
||
}
|
||
]
|
||
},
|
||
{
|
||
"name": "features",
|
||
"selector": "ul.product-features li",
|
||
"type": "list",
|
||
"fields": [
|
||
{"name": "feature", "type": "text"}
|
||
]
|
||
},
|
||
{
|
||
"name": "reviews",
|
||
"selector": "div.review",
|
||
"type": "nested_list",
|
||
"fields": [
|
||
{
|
||
"name": "reviewer",
|
||
"selector": "span.reviewer",
|
||
"type": "text"
|
||
},
|
||
{
|
||
"name": "rating",
|
||
"selector": "span.rating",
|
||
"type": "text"
|
||
},
|
||
{
|
||
"name": "comment",
|
||
"selector": "p.review-text",
|
||
"type": "text"
|
||
}
|
||
]
|
||
},
|
||
{
|
||
"name": "related_products",
|
||
"selector": "ul.related-products li",
|
||
"type": "list",
|
||
"fields": [
|
||
{
|
||
"name": "name",
|
||
"selector": "span.related-name",
|
||
"type": "text"
|
||
},
|
||
{
|
||
"name": "price",
|
||
"selector": "span.related-price",
|
||
"type": "text"
|
||
}
|
||
]
|
||
}
|
||
]
|
||
}
|
||
]
|
||
}
|
||
```
|
||
|
||
Key Takeaways:
|
||
|
||
- **Nested vs. List**:
|
||
- **`type: "nested"`** means a **single** sub-object (like `details`).
|
||
- **`type: "list"`** means multiple items that are **simple** dictionaries or single text fields.
|
||
- **`type: "nested_list"`** means repeated **complex** objects (like `products` or `reviews`).
|
||
- **Base Fields**: We can extract **attributes** from the container element via `"baseFields"`. For instance, `"data_cat_id"` might be `data-cat-id="elect123"`.
|
||
- **Transforms**: We can also define a `transform` if we want to lower/upper case, strip whitespace, or even run a custom function.
|
||
|
||
### Running the Extraction
|
||
|
||
```python
|
||
import json
|
||
import asyncio
|
||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||
|
||
ecommerce_schema = {
|
||
# ... the advanced schema from above ...
|
||
}
|
||
|
||
async def extract_ecommerce_data():
|
||
strategy = JsonCssExtractionStrategy(ecommerce_schema, verbose=True)
|
||
|
||
config = CrawlerRunConfig()
|
||
|
||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||
result = await crawler.arun(
|
||
url="https://gist.githubusercontent.com/githubusercontent/2d7b8ba3cd8ab6cf3c8da771ddb36878/raw/1ae2f90c6861ce7dd84cc50d3df9920dee5e1fd2/sample_ecommerce.html",
|
||
extraction_strategy=strategy,
|
||
config=config
|
||
)
|
||
|
||
if not result.success:
|
||
print("Crawl failed:", result.error_message)
|
||
return
|
||
|
||
# Parse the JSON output
|
||
data = json.loads(result.extracted_content)
|
||
print(json.dumps(data, indent=2) if data else "No data found.")
|
||
|
||
asyncio.run(extract_ecommerce_data())
|
||
```
|
||
|
||
If all goes well, you get a **structured** JSON array with each “category,” containing an array of `products`. Each product includes `details`, `features`, `reviews`, etc. All of that **without** an LLM.
|
||
|
||
---
|
||
|
||
## 4. Why “No LLM” Is Often Better
|
||
|
||
1. **Zero Hallucination**: Schema-based extraction doesn’t guess text. It either finds it or not.
|
||
2. **Guaranteed Structure**: The same schema yields consistent JSON across many pages, so your downstream pipeline can rely on stable keys.
|
||
3. **Speed**: LLM-based extraction can be 10–1000x slower for large-scale crawling.
|
||
4. **Scalable**: Adding or updating a field is a matter of adjusting the schema, not re-tuning a model.
|
||
|
||
**When might you consider an LLM?** Possibly if the site is extremely unstructured or you want AI summarization. But always try a schema approach first for repeated or consistent data patterns.
|
||
|
||
---
|
||
|
||
## 5. Base Element Attributes & Additional Fields
|
||
|
||
It’s easy to **extract attributes** (like `href`, `src`, or `data-xxx`) from your base or nested elements using:
|
||
|
||
```json
|
||
{
|
||
"name": "href",
|
||
"type": "attribute",
|
||
"attribute": "href",
|
||
"default": null
|
||
}
|
||
```
|
||
|
||
You can define them in **`baseFields`** (extracted from the main container element) or in each field’s sub-lists. This is especially helpful if you need an item’s link or ID stored in the parent `<div>`.
|
||
|
||
---
|
||
|
||
## 6. Putting It All Together: Larger Example
|
||
|
||
Consider a blog site. We have a schema that extracts the **URL** from each post card (via `baseFields` with an `"attribute": "href"`), plus the title, date, summary, and author:
|
||
|
||
```python
|
||
schema = {
|
||
"name": "Blog Posts",
|
||
"baseSelector": "a.blog-post-card",
|
||
"baseFields": [
|
||
{"name": "post_url", "type": "attribute", "attribute": "href"}
|
||
],
|
||
"fields": [
|
||
{"name": "title", "selector": "h2.post-title", "type": "text", "default": "No Title"},
|
||
{"name": "date", "selector": "time.post-date", "type": "text", "default": ""},
|
||
{"name": "summary", "selector": "p.post-summary", "type": "text", "default": ""},
|
||
{"name": "author", "selector": "span.post-author", "type": "text", "default": ""}
|
||
]
|
||
}
|
||
```
|
||
|
||
Then run with `JsonCssExtractionStrategy(schema)` to get an array of blog post objects, each with `"post_url"`, `"title"`, `"date"`, `"summary"`, `"author"`.
|
||
|
||
---
|
||
|
||
## 7. Tips & Best Practices
|
||
|
||
1. **Inspect the DOM** in Chrome DevTools or Firefox’s Inspector to find stable selectors.
|
||
2. **Start Simple**: Verify you can extract a single field. Then add complexity like nested objects or lists.
|
||
3. **Test** your schema on partial HTML or a test page before a big crawl.
|
||
4. **Combine with JS Execution** if the site loads content dynamically. You can pass `js_code` or `wait_for` in `CrawlerRunConfig`.
|
||
5. **Look at Logs** when `verbose=True`: if your selectors are off or your schema is malformed, it’ll often show warnings.
|
||
6. **Use baseFields** if you need attributes from the container element (e.g., `href`, `data-id`), especially for the “parent” item.
|
||
7. **Performance**: For large pages, make sure your selectors are as narrow as possible.
|
||
|
||
---
|
||
|
||
## 8. Schema Generation Utility
|
||
|
||
While manually crafting schemas is powerful and precise, Crawl4AI now offers a convenient utility to **automatically generate** extraction schemas using LLM. This is particularly useful when:
|
||
|
||
1. You're dealing with a new website structure and want a quick starting point
|
||
2. You need to extract complex nested data structures
|
||
3. You want to avoid the learning curve of CSS/XPath selector syntax
|
||
|
||
### Using the Schema Generator
|
||
|
||
The schema generator is available as a static method on both `JsonCssExtractionStrategy` and `JsonXPathExtractionStrategy`. You can choose between OpenAI's GPT-4 or the open-source Ollama for schema generation:
|
||
|
||
```python
|
||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy, JsonXPathExtractionStrategy
|
||
|
||
# Sample HTML with product information
|
||
html = """
|
||
<div class="product-card">
|
||
<h2 class="title">Gaming Laptop</h2>
|
||
<div class="price">$999.99</div>
|
||
<div class="specs">
|
||
<ul>
|
||
<li>16GB RAM</li>
|
||
<li>1TB SSD</li>
|
||
</ul>
|
||
</div>
|
||
</div>
|
||
"""
|
||
|
||
# Option 1: Using OpenAI (requires API token)
|
||
css_schema = JsonCssExtractionStrategy.generate_schema(
|
||
html,
|
||
schema_type="css", # This is the default
|
||
provider="openai/gpt-4o", # Default provider
|
||
api_token="your-openai-token" # Required for OpenAI
|
||
)
|
||
|
||
# Option 2: Using Ollama (open source, no token needed)
|
||
xpath_schema = JsonXPathExtractionStrategy.generate_schema(
|
||
html,
|
||
schema_type="xpath",
|
||
provider="ollama/llama3.3", # Open source alternative
|
||
api_token=None # Not needed for Ollama
|
||
)
|
||
|
||
# Use the generated schema for fast, repeated extractions
|
||
strategy = JsonCssExtractionStrategy(css_schema)
|
||
```
|
||
|
||
### LLM Provider Options
|
||
|
||
1. **OpenAI GPT-4 (`openai/gpt4o`)**
|
||
- Default provider
|
||
- Requires an API token
|
||
- Generally provides more accurate schemas
|
||
- Set via environment variable: `OPENAI_API_KEY`
|
||
|
||
2. **Ollama (`ollama/llama3.3`)**
|
||
- Open source alternative
|
||
- No API token required
|
||
- Self-hosted option
|
||
- Good for development and testing
|
||
|
||
### Benefits of Schema Generation
|
||
|
||
1. **One-Time Cost**: While schema generation uses LLM, it's a one-time cost. The generated schema can be reused for unlimited extractions without further LLM calls.
|
||
2. **Smart Pattern Recognition**: The LLM analyzes the HTML structure and identifies common patterns, often producing more robust selectors than manual attempts.
|
||
3. **Automatic Nesting**: Complex nested structures are automatically detected and properly represented in the schema.
|
||
4. **Learning Tool**: The generated schemas serve as excellent examples for learning how to write your own schemas.
|
||
|
||
### Best Practices
|
||
|
||
1. **Review Generated Schemas**: While the generator is smart, always review and test the generated schema before using it in production.
|
||
2. **Provide Representative HTML**: The better your sample HTML represents the overall structure, the more accurate the generated schema will be.
|
||
3. **Consider Both CSS and XPath**: Try both schema types and choose the one that works best for your specific case.
|
||
4. **Cache Generated Schemas**: Since generation uses LLM, save successful schemas for reuse.
|
||
5. **API Token Security**: Never hardcode API tokens. Use environment variables or secure configuration management.
|
||
6. **Choose Provider Wisely**:
|
||
- Use OpenAI for production-quality schemas
|
||
- Use Ollama for development, testing, or when you need a self-hosted solution
|
||
|
||
That's it for **Extracting JSON (No LLM)**! You've seen how schema-based approaches (either CSS or XPath) can handle everything from simple lists to deeply nested product catalogs—instantly, with minimal overhead. Enjoy building robust scrapers that produce consistent, structured JSON for your data pipelines!
|
||
|
||
---
|
||
|
||
## 9. Conclusion
|
||
|
||
With **JsonCssExtractionStrategy** (or **JsonXPathExtractionStrategy**), you can build powerful, **LLM-free** pipelines that:
|
||
|
||
- Scrape any consistent site for structured data.
|
||
- Support nested objects, repeating lists, or advanced transformations.
|
||
- Scale to thousands of pages quickly and reliably.
|
||
|
||
**Next Steps**:
|
||
|
||
- Combine your extracted JSON with advanced filtering or summarization in a second pass if needed.
|
||
- For dynamic pages, combine strategies with `js_code` or infinite scroll hooking to ensure all content is loaded.
|
||
|
||
**Remember**: For repeated, structured data, you don’t need to pay for or wait on an LLM. A well-crafted schema plus CSS or XPath gets you the data faster, cleaner, and cheaper—**the real power** of Crawl4AI.
|
||
|
||
**Last Updated**: 2025-01-01
|
||
|
||
---
|
||
|
||
That’s it for **Extracting JSON (No LLM)**! You’ve seen how schema-based approaches (either CSS or XPath) can handle everything from simple lists to deeply nested product catalogs—instantly, with minimal overhead. Enjoy building robust scrapers that produce consistent, structured JSON for your data pipelines! |