* Fix: Use correct URL variable for raw HTML extraction (#1116) - Prevents full HTML content from being passed as URL to extraction strategies - Added unit tests to verify raw HTML and regular URL processing Fix: Wrong URL variable used for extraction of raw html * Fix #1181: Preserve whitespace in code blocks during HTML scraping The remove_empty_elements_fast() method was removing whitespace-only span elements inside <pre> and <code> tags, causing import statements like "import torch" to become "importtorch". Now skips elements inside code blocks where whitespace is significant. * Refactor Pydantic model configuration to use ConfigDict for arbitrary types * Fix EmbeddingStrategy: Uncomment response handling for the variations and clean up mock data. ref #1621 * Fix: permission issues with .cache/url_seeder and other runtime cache dirs. ref #1638 * fix: ensure BrowserConfig.to_dict serializes proxy_config * feat: make LLM backoff configurable end-to-end - extend LLMConfig with backoff delay/attempt/factor fields and thread them through LLMExtractionStrategy, LLMContentFilter, table extraction, and Docker API handlers - expose the backoff parameter knobs on perform_completion_with_backoff/aperform_completion_with_backoff and document them in the md_v2 guides * reproduced AttributeError from #1642 * pass timeout parameter to docker client request * added missing deep crawling objects to init * generalized query in ContentRelevanceFilter to be a str or list * import modules from enhanceable deserialization * parameterized tests * Fix: capture current page URL to reflect JavaScript navigation and add test for delayed redirects. ref #1268 * refactor: replace PyPDF2 with pypdf across the codebase. ref #1412 * Add browser_context_id and target_id parameters to BrowserConfig Enable Crawl4AI to connect to pre-created CDP browser contexts, which is essential for cloud browser services that pre-create isolated contexts. Changes: - Add browser_context_id and target_id parameters to BrowserConfig - Update from_kwargs() and to_dict() methods - Modify BrowserManager.start() to use existing context when provided - Add _get_page_by_target_id() helper method - Update get_page() to handle pre-existing targets - Add test for browser_context_id functionality This enables cloud services to: 1. Create isolated CDP contexts before Crawl4AI connects 2. Pass context/target IDs to BrowserConfig 3. Have Crawl4AI reuse existing contexts instead of creating new ones * Add cdp_cleanup_on_close flag to prevent memory leaks in cloud/server scenarios * Fix: add cdp_cleanup_on_close to from_kwargs * Fix: find context by target_id for concurrent CDP connections * Fix: use target_id to find correct page in get_page * Fix: use CDP to find context by browserContextId for concurrent sessions * Revert context matching attempts - Playwright cannot see CDP-created contexts * Add create_isolated_context flag for concurrent CDP crawls When True, forces creation of a new browser context instead of reusing the default context. Essential for concurrent crawls on the same browser to prevent navigation conflicts. * Add context caching to create_isolated_context branch Uses contexts_by_config cache (same as non-CDP mode) to reuse contexts for multiple URLs with same config. Still creates new page per crawl for navigation isolation. Benefits batch/deep crawls. * Add init_scripts support to BrowserConfig for pre-page-load JS injection This adds the ability to inject JavaScript that runs before any page loads, useful for stealth evasions (canvas/audio fingerprinting, userAgentData). - Add init_scripts parameter to BrowserConfig (list of JS strings) - Apply init_scripts in setup_context() via context.add_init_script() - Update from_kwargs() and to_dict() for serialization * Fix CDP connection handling: support WS URLs and proper cleanup Changes to browser_manager.py: 1. _verify_cdp_ready(): Support multiple URL formats - WebSocket URLs (ws://, wss://): Skip HTTP verification, Playwright handles directly - HTTP URLs with query params: Properly parse with urlparse to preserve query string - Fixes issue where naive f"{cdp_url}/json/version" broke WS URLs and query params 2. close(): Proper cleanup when cdp_cleanup_on_close=True - Close all sessions (pages) - Close all contexts - Call browser.close() to disconnect (doesn't terminate browser, just releases connection) - Wait 1 second for CDP connection to fully release - Stop Playwright instance to prevent memory leaks This enables: - Connecting to specific browsers via WS URL - Reusing the same browser with multiple sequential connections - No user wait needed between connections (internal 1s delay handles it) Added tests/browser/test_cdp_cleanup_reuse.py with comprehensive tests. * Update gitignore * Some debugging for caching * Add _generate_screenshot_from_html for raw: and file:// URLs Implements the missing method that was being called but never defined. Now raw: and file:// URLs can generate screenshots by: 1. Loading HTML into a browser page via page.set_content() 2. Taking screenshot using existing take_screenshot() method 3. Cleaning up the page afterward This enables cached HTML to be rendered with screenshots in crawl4ai-cloud. * Add PDF and MHTML support for raw: and file:// URLs - Replace _generate_screenshot_from_html with _generate_media_from_html - New method handles screenshot, PDF, and MHTML in one browser session - Update raw: and file:// URL handlers to use new method - Enables cached HTML to generate all media types * Add crash recovery for deep crawl strategies Add optional resume_state and on_state_change parameters to all deep crawl strategies (BFS, DFS, Best-First) for cloud deployment crash recovery. Features: - resume_state: Pass saved state to resume from checkpoint - on_state_change: Async callback fired after each URL for real-time state persistence to external storage (Redis, DB, etc.) - export_state(): Get last captured state manually - Zero overhead when features are disabled (None defaults) State includes visited URLs, pending queue/stack, depths, and pages_crawled count. All state is JSON-serializable. * Fix: HTTP strategy raw: URL parsing truncates at # character The AsyncHTTPCrawlerStrategy.crawl() method used urlparse() to extract content from raw: URLs. This caused HTML with CSS color codes like #eee to be truncated because # is treated as a URL fragment delimiter. Before: raw:body{background:#eee} -> parsed.path = 'body{background:' After: raw:body{background:#eee} -> raw_content = 'body{background:#eee' Fix: Strip the raw: or raw:// prefix directly instead of using urlparse, matching how the browser strategy handles it. * Add base_url parameter to CrawlerRunConfig for raw HTML processing When processing raw: HTML (e.g., from cache), the URL parameter is meaningless for markdown link resolution. This adds a base_url parameter that can be set explicitly to provide proper URL resolution context. Changes: - Add base_url parameter to CrawlerRunConfig.__init__ - Add base_url to CrawlerRunConfig.from_kwargs - Update aprocess_html to use base_url for markdown generation Usage: config = CrawlerRunConfig(base_url='https://example.com') result = await crawler.arun(url='raw:{html}', config=config) * Add prefetch mode for two-phase deep crawling - Add `prefetch` parameter to CrawlerRunConfig - Add `quick_extract_links()` function for fast link extraction - Add short-circuit in aprocess_html() for prefetch mode - Add 42 tests (unit, integration, regression) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Updates on proxy rotation and proxy configuration * Add proxy support to HTTP crawler strategy * Add browser pipeline support for raw:/file:// URLs - Add process_in_browser parameter to CrawlerRunConfig - Route raw:/file:// URLs through _crawl_web() when browser operations needed - Use page.set_content() instead of goto() for local content - Fix cookie handling for non-HTTP URLs in browser_manager - Auto-detect browser requirements: js_code, wait_for, screenshot, etc. - Maintain fast path for raw:/file:// without browser params Fixes #310 * Add smart TTL cache for sitemap URL seeder - Add cache_ttl_hours and validate_sitemap_lastmod params to SeedingConfig - New JSON cache format with metadata (version, created_at, lastmod, url_count) - Cache validation by TTL expiry and sitemap lastmod comparison - Auto-migration from old .jsonl to new .json format - Fixes bug where incomplete cache was used indefinitely * Update URL seeder docs with smart TTL cache parameters - Add cache_ttl_hours and validate_sitemap_lastmod to parameter table - Document smart TTL cache validation with examples - Add cache-related troubleshooting entries - Update key features summary * Add MEMORY.md to gitignore * Docs: Add multi-sample schema generation section Add documentation explaining how to pass multiple HTML samples to generate_schema() for stable selectors that work across pages with varying DOM structures. Includes: - Problem explanation (fragile nth-child selectors) - Solution with code example - Key points for multi-sample queries - Comparison table of fragile vs stable selectors * Fix critical RCE and LFI vulnerabilities in Docker API deployment Security fixes for vulnerabilities reported by ProjectDiscovery: 1. Remote Code Execution via Hooks (CVE pending) - Remove __import__ from allowed_builtins in hook_manager.py - Prevents arbitrary module imports (os, subprocess, etc.) - Hooks now disabled by default via CRAWL4AI_HOOKS_ENABLED env var 2. Local File Inclusion via file:// URLs (CVE pending) - Add URL scheme validation to /execute_js, /screenshot, /pdf, /html - Block file://, javascript:, data: and other dangerous schemes - Only allow http://, https://, and raw: (where appropriate) 3. Security hardening - Add CRAWL4AI_HOOKS_ENABLED=false as default (opt-in for hooks) - Add security warning comments in config.yml - Add validate_url_scheme() helper for consistent validation Testing: - Add unit tests (test_security_fixes.py) - 16 tests - Add integration tests (run_security_tests.py) for live server Affected endpoints: - POST /crawl (hooks disabled by default) - POST /crawl/stream (hooks disabled by default) - POST /execute_js (URL validation added) - POST /screenshot (URL validation added) - POST /pdf (URL validation added) - POST /html (URL validation added) Breaking changes: - Hooks require CRAWL4AI_HOOKS_ENABLED=true to function - file:// URLs no longer work on API endpoints (use library directly) * Enhance authentication flow by implementing JWT token retrieval and adding authorization headers to API requests * Add release notes for v0.7.9, detailing breaking changes, security fixes, new features, bug fixes, and documentation updates * Add release notes for v0.8.0, detailing breaking changes, security fixes, new features, bug fixes, and documentation updates Documentation for v0.8.0 release: - SECURITY.md: Security policy and vulnerability reporting guidelines - RELEASE_NOTES_v0.8.0.md: Comprehensive release notes - migration/v0.8.0-upgrade-guide.md: Step-by-step migration guide - security/GHSA-DRAFT-RCE-LFI.md: GitHub security advisory drafts - CHANGELOG.md: Updated with v0.8.0 changes Breaking changes documented: - Docker API hooks disabled by default (CRAWL4AI_HOOKS_ENABLED) - file:// URLs blocked on Docker API endpoints Security fixes credited to Neo by ProjectDiscovery * Add examples for deep crawl crash recovery and prefetch mode in documentation * Release v0.8.0: The v0.8.0 Update - Updated version to 0.8.0 - Added comprehensive demo and release notes - Updated all documentation * Update security researcher acknowledgment with a hyperlink for Neo by ProjectDiscovery * Add async agenerate_schema method for schema generation - Extract prompt building to shared _build_schema_prompt() method - Add agenerate_schema() async version using aperform_completion_with_backoff - Refactor generate_schema() to use shared prompt builder - Fixes Gemini/Vertex AI compatibility in async contexts (FastAPI) * Fix: Enable litellm.drop_params for O-series/GPT-5 model compatibility O-series (o1, o3) and GPT-5 models only support temperature=1. Setting litellm.drop_params=True auto-drops unsupported parameters instead of throwing UnsupportedParamsError. Fixes temperature=0.01 error for these models in LLM extraction. --------- Co-authored-by: rbushria <rbushri@gmail.com> Co-authored-by: AHMET YILMAZ <tawfik@kidocode.com> Co-authored-by: Soham Kukreti <kukretisoham@gmail.com> Co-authored-by: Chris Murphy <chris.murphy@klaviyo.com> Co-authored-by: unclecode <unclecode@kidocode.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
837 lines
30 KiB
Markdown
837 lines
30 KiB
Markdown
# Extracting JSON (No LLM)
|
||
|
||
One of Crawl4AI's **most powerful** features is extracting **structured JSON** from websites **without** relying on large language models. Crawl4AI offers several strategies for LLM-free extraction:
|
||
|
||
1. **Schema-based extraction** with CSS or XPath selectors via `JsonCssExtractionStrategy` and `JsonXPathExtractionStrategy`
|
||
2. **Regular expression extraction** with `RegexExtractionStrategy` for fast pattern matching
|
||
|
||
These approaches let you extract data instantly—even from complex or nested HTML structures—without the cost, latency, or environmental impact of an LLM.
|
||
|
||
**Why avoid LLM for basic extractions?**
|
||
|
||
1. **Faster & Cheaper**: No API calls or GPU overhead.
|
||
2. **Lower Carbon Footprint**: LLM inference can be energy-intensive. Pattern-based extraction is practically carbon-free.
|
||
3. **Precise & Repeatable**: CSS/XPath selectors and regex patterns do exactly what you specify. LLM outputs can vary or hallucinate.
|
||
4. **Scales Readily**: For thousands of pages, pattern-based extraction runs quickly and in parallel.
|
||
|
||
Below, we'll explore how to craft these schemas and use them with **JsonCssExtractionStrategy** (or **JsonXPathExtractionStrategy** if you prefer XPath). We'll also highlight advanced features like **nested fields** and **base element attributes**.
|
||
|
||
---
|
||
|
||
## 1. Intro to Schema-Based Extraction
|
||
|
||
A schema defines:
|
||
|
||
1. A **base selector** that identifies each "container" element on the page (e.g., a product row, a blog post card).
|
||
2. **Fields** describing which CSS/XPath selectors to use for each piece of data you want to capture (text, attribute, HTML block, etc.).
|
||
3. **Nested** or **list** types for repeated or hierarchical structures.
|
||
|
||
For example, if you have a list of products, each one might have a name, price, reviews, and "related products." This approach is faster and more reliable than an LLM for consistent, structured pages.
|
||
|
||
---
|
||
|
||
## 2. Simple Example: Crypto Prices
|
||
|
||
Let's begin with a **simple** schema-based extraction using the `JsonCssExtractionStrategy`. Below is a snippet that extracts cryptocurrency prices from a site (similar to the legacy Coinbase example). Notice we **don't** call any LLM:
|
||
|
||
```python
|
||
import json
|
||
import asyncio
|
||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
|
||
from crawl4ai import JsonCssExtractionStrategy
|
||
|
||
async def extract_crypto_prices():
|
||
# 1. Define a simple extraction schema
|
||
schema = {
|
||
"name": "Crypto Prices",
|
||
"baseSelector": "div.crypto-row", # Repeated elements
|
||
"fields": [
|
||
{
|
||
"name": "coin_name",
|
||
"selector": "h2.coin-name",
|
||
"type": "text"
|
||
},
|
||
{
|
||
"name": "price",
|
||
"selector": "span.coin-price",
|
||
"type": "text"
|
||
}
|
||
]
|
||
}
|
||
|
||
# 2. Create the extraction strategy
|
||
extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
|
||
|
||
# 3. Set up your crawler config (if needed)
|
||
config = CrawlerRunConfig(
|
||
# e.g., pass js_code or wait_for if the page is dynamic
|
||
# wait_for="css:.crypto-row:nth-child(20)"
|
||
cache_mode = CacheMode.BYPASS,
|
||
extraction_strategy=extraction_strategy,
|
||
)
|
||
|
||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||
# 4. Run the crawl and extraction
|
||
result = await crawler.arun(
|
||
url="https://example.com/crypto-prices",
|
||
|
||
config=config
|
||
)
|
||
|
||
if not result.success:
|
||
print("Crawl failed:", result.error_message)
|
||
return
|
||
|
||
# 5. Parse the extracted JSON
|
||
data = json.loads(result.extracted_content)
|
||
print(f"Extracted {len(data)} coin entries")
|
||
print(json.dumps(data[0], indent=2) if data else "No data found")
|
||
|
||
asyncio.run(extract_crypto_prices())
|
||
```
|
||
|
||
**Highlights**:
|
||
|
||
- **`baseSelector`**: Tells us where each "item" (crypto row) is.
|
||
- **`fields`**: Two fields (`coin_name`, `price`) using simple CSS selectors.
|
||
- Each field defines a **`type`** (e.g., `text`, `attribute`, `html`, `regex`, etc.).
|
||
|
||
No LLM is needed, and the performance is **near-instant** for hundreds or thousands of items.
|
||
|
||
---
|
||
|
||
### **XPath Example with `raw://` HTML**
|
||
|
||
Below is a short example demonstrating **XPath** extraction plus the **`raw://`** scheme. We'll pass a **dummy HTML** directly (no network request) and define the extraction strategy in `CrawlerRunConfig`.
|
||
|
||
```python
|
||
import json
|
||
import asyncio
|
||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||
from crawl4ai import JsonXPathExtractionStrategy
|
||
|
||
async def extract_crypto_prices_xpath():
|
||
# 1. Minimal dummy HTML with some repeating rows
|
||
dummy_html = """
|
||
<html>
|
||
<body>
|
||
<div class='crypto-row'>
|
||
<h2 class='coin-name'>Bitcoin</h2>
|
||
<span class='coin-price'>$28,000</span>
|
||
</div>
|
||
<div class='crypto-row'>
|
||
<h2 class='coin-name'>Ethereum</h2>
|
||
<span class='coin-price'>$1,800</span>
|
||
</div>
|
||
</body>
|
||
</html>
|
||
"""
|
||
|
||
# 2. Define the JSON schema (XPath version)
|
||
schema = {
|
||
"name": "Crypto Prices via XPath",
|
||
"baseSelector": "//div[@class='crypto-row']",
|
||
"fields": [
|
||
{
|
||
"name": "coin_name",
|
||
"selector": ".//h2[@class='coin-name']",
|
||
"type": "text"
|
||
},
|
||
{
|
||
"name": "price",
|
||
"selector": ".//span[@class='coin-price']",
|
||
"type": "text"
|
||
}
|
||
]
|
||
}
|
||
|
||
# 3. Place the strategy in the CrawlerRunConfig
|
||
config = CrawlerRunConfig(
|
||
extraction_strategy=JsonXPathExtractionStrategy(schema, verbose=True)
|
||
)
|
||
|
||
# 4. Use raw:// scheme to pass dummy_html directly
|
||
raw_url = f"raw://{dummy_html}"
|
||
|
||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||
result = await crawler.arun(
|
||
url=raw_url,
|
||
config=config
|
||
)
|
||
|
||
if not result.success:
|
||
print("Crawl failed:", result.error_message)
|
||
return
|
||
|
||
data = json.loads(result.extracted_content)
|
||
print(f"Extracted {len(data)} coin rows")
|
||
if data:
|
||
print("First item:", data[0])
|
||
|
||
asyncio.run(extract_crypto_prices_xpath())
|
||
```
|
||
|
||
**Key Points**:
|
||
|
||
1. **`JsonXPathExtractionStrategy`** is used instead of `JsonCssExtractionStrategy`.
|
||
2. **`baseSelector`** and each field's `"selector"` use **XPath** instead of CSS.
|
||
3. **`raw://`** lets us pass `dummy_html` with no real network request—handy for local testing.
|
||
4. Everything (including the extraction strategy) is in **`CrawlerRunConfig`**.
|
||
|
||
That's how you keep the config self-contained, illustrate **XPath** usage, and demonstrate the **raw** scheme for direct HTML input—all while avoiding the old approach of passing `extraction_strategy` directly to `arun()`.
|
||
|
||
---
|
||
|
||
## 3. Advanced Schema & Nested Structures
|
||
|
||
Real sites often have **nested** or repeated data—like categories containing products, which themselves have a list of reviews or features. For that, we can define **nested** or **list** (and even **nested_list**) fields.
|
||
|
||
### Sample E-Commerce HTML
|
||
|
||
We have a **sample e-commerce** HTML file on GitHub (example):
|
||
```
|
||
https://gist.githubusercontent.com/githubusercontent/2d7b8ba3cd8ab6cf3c8da771ddb36878/raw/1ae2f90c6861ce7dd84cc50d3df9920dee5e1fd2/sample_ecommerce.html
|
||
```
|
||
This snippet includes categories, products, features, reviews, and related items. Let's see how to define a schema that fully captures that structure **without LLM**.
|
||
|
||
```python
|
||
schema = {
|
||
"name": "E-commerce Product Catalog",
|
||
"baseSelector": "div.category",
|
||
# (1) We can define optional baseFields if we want to extract attributes
|
||
# from the category container
|
||
"baseFields": [
|
||
{"name": "data_cat_id", "type": "attribute", "attribute": "data-cat-id"},
|
||
],
|
||
"fields": [
|
||
{
|
||
"name": "category_name",
|
||
"selector": "h2.category-name",
|
||
"type": "text"
|
||
},
|
||
{
|
||
"name": "products",
|
||
"selector": "div.product",
|
||
"type": "nested_list", # repeated sub-objects
|
||
"fields": [
|
||
{
|
||
"name": "name",
|
||
"selector": "h3.product-name",
|
||
"type": "text"
|
||
},
|
||
{
|
||
"name": "price",
|
||
"selector": "p.product-price",
|
||
"type": "text"
|
||
},
|
||
{
|
||
"name": "details",
|
||
"selector": "div.product-details",
|
||
"type": "nested", # single sub-object
|
||
"fields": [
|
||
{
|
||
"name": "brand",
|
||
"selector": "span.brand",
|
||
"type": "text"
|
||
},
|
||
{
|
||
"name": "model",
|
||
"selector": "span.model",
|
||
"type": "text"
|
||
}
|
||
]
|
||
},
|
||
{
|
||
"name": "features",
|
||
"selector": "ul.product-features li",
|
||
"type": "list",
|
||
"fields": [
|
||
{"name": "feature", "type": "text"}
|
||
]
|
||
},
|
||
{
|
||
"name": "reviews",
|
||
"selector": "div.review",
|
||
"type": "nested_list",
|
||
"fields": [
|
||
{
|
||
"name": "reviewer",
|
||
"selector": "span.reviewer",
|
||
"type": "text"
|
||
},
|
||
{
|
||
"name": "rating",
|
||
"selector": "span.rating",
|
||
"type": "text"
|
||
},
|
||
{
|
||
"name": "comment",
|
||
"selector": "p.review-text",
|
||
"type": "text"
|
||
}
|
||
]
|
||
},
|
||
{
|
||
"name": "related_products",
|
||
"selector": "ul.related-products li",
|
||
"type": "list",
|
||
"fields": [
|
||
{
|
||
"name": "name",
|
||
"selector": "span.related-name",
|
||
"type": "text"
|
||
},
|
||
{
|
||
"name": "price",
|
||
"selector": "span.related-price",
|
||
"type": "text"
|
||
}
|
||
]
|
||
}
|
||
]
|
||
}
|
||
]
|
||
}
|
||
```
|
||
|
||
Key Takeaways:
|
||
|
||
- **Nested vs. List**:
|
||
- **`type: "nested"`** means a **single** sub-object (like `details`).
|
||
- **`type: "list"`** means multiple items that are **simple** dictionaries or single text fields.
|
||
- **`type: "nested_list"`** means repeated **complex** objects (like `products` or `reviews`).
|
||
- **Base Fields**: We can extract **attributes** from the container element via `"baseFields"`. For instance, `"data_cat_id"` might be `data-cat-id="elect123"`.
|
||
- **Transforms**: We can also define a `transform` if we want to lower/upper case, strip whitespace, or even run a custom function.
|
||
|
||
### Running the Extraction
|
||
|
||
```python
|
||
import json
|
||
import asyncio
|
||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||
from crawl4ai import JsonCssExtractionStrategy
|
||
|
||
ecommerce_schema = {
|
||
# ... the advanced schema from above ...
|
||
}
|
||
|
||
async def extract_ecommerce_data():
|
||
strategy = JsonCssExtractionStrategy(ecommerce_schema, verbose=True)
|
||
|
||
config = CrawlerRunConfig()
|
||
|
||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||
result = await crawler.arun(
|
||
url="https://gist.githubusercontent.com/githubusercontent/2d7b8ba3cd8ab6cf3c8da771ddb36878/raw/1ae2f90c6861ce7dd84cc50d3df9920dee5e1fd2/sample_ecommerce.html",
|
||
extraction_strategy=strategy,
|
||
config=config
|
||
)
|
||
|
||
if not result.success:
|
||
print("Crawl failed:", result.error_message)
|
||
return
|
||
|
||
# Parse the JSON output
|
||
data = json.loads(result.extracted_content)
|
||
print(json.dumps(data, indent=2) if data else "No data found.")
|
||
|
||
asyncio.run(extract_ecommerce_data())
|
||
```
|
||
|
||
If all goes well, you get a **structured** JSON array with each "category," containing an array of `products`. Each product includes `details`, `features`, `reviews`, etc. All of that **without** an LLM.
|
||
|
||
---
|
||
|
||
## 4. RegexExtractionStrategy - Fast Pattern-Based Extraction
|
||
|
||
Crawl4AI now offers a powerful new zero-LLM extraction strategy: `RegexExtractionStrategy`. This strategy provides lightning-fast extraction of common data types like emails, phone numbers, URLs, dates, and more using pre-compiled regular expressions.
|
||
|
||
### Key Features
|
||
|
||
- **Zero LLM Dependency**: Extracts data without any AI model calls
|
||
- **Blazing Fast**: Uses pre-compiled regex patterns for maximum performance
|
||
- **Built-in Patterns**: Includes ready-to-use patterns for common data types
|
||
- **Custom Patterns**: Add your own regex patterns for domain-specific extraction
|
||
- **LLM-Assisted Pattern Generation**: Optionally use an LLM once to generate optimized patterns, then reuse them without further LLM calls
|
||
|
||
### Simple Example: Extracting Common Entities
|
||
|
||
The easiest way to start is by using the built-in pattern catalog:
|
||
|
||
```python
|
||
import json
|
||
import asyncio
|
||
from crawl4ai import (
|
||
AsyncWebCrawler,
|
||
CrawlerRunConfig,
|
||
RegexExtractionStrategy
|
||
)
|
||
|
||
async def extract_with_regex():
|
||
# Create a strategy using built-in patterns for URLs and currencies
|
||
strategy = RegexExtractionStrategy(
|
||
pattern = RegexExtractionStrategy.Url | RegexExtractionStrategy.Currency
|
||
)
|
||
|
||
config = CrawlerRunConfig(extraction_strategy=strategy)
|
||
|
||
async with AsyncWebCrawler() as crawler:
|
||
result = await crawler.arun(
|
||
url="https://example.com",
|
||
config=config
|
||
)
|
||
|
||
if result.success:
|
||
data = json.loads(result.extracted_content)
|
||
for item in data[:5]: # Show first 5 matches
|
||
print(f"{item['label']}: {item['value']}")
|
||
print(f"Total matches: {len(data)}")
|
||
|
||
asyncio.run(extract_with_regex())
|
||
```
|
||
|
||
### Available Built-in Patterns
|
||
|
||
`RegexExtractionStrategy` provides these common patterns as IntFlag attributes for easy combining:
|
||
|
||
```python
|
||
# Use individual patterns
|
||
strategy = RegexExtractionStrategy(pattern=RegexExtractionStrategy.Email)
|
||
|
||
# Combine multiple patterns
|
||
strategy = RegexExtractionStrategy(
|
||
pattern = (
|
||
RegexExtractionStrategy.Email |
|
||
RegexExtractionStrategy.PhoneUS |
|
||
RegexExtractionStrategy.Url
|
||
)
|
||
)
|
||
|
||
# Use all available patterns
|
||
strategy = RegexExtractionStrategy(pattern=RegexExtractionStrategy.All)
|
||
```
|
||
|
||
Available patterns include:
|
||
- `Email` - Email addresses
|
||
- `PhoneIntl` - International phone numbers
|
||
- `PhoneUS` - US-format phone numbers
|
||
- `Url` - HTTP/HTTPS URLs
|
||
- `IPv4` - IPv4 addresses
|
||
- `IPv6` - IPv6 addresses
|
||
- `Uuid` - UUIDs
|
||
- `Currency` - Currency values (USD, EUR, etc.)
|
||
- `Percentage` - Percentage values
|
||
- `Number` - Numeric values
|
||
- `DateIso` - ISO format dates
|
||
- `DateUS` - US format dates
|
||
- `Time24h` - 24-hour format times
|
||
- `PostalUS` - US postal codes
|
||
- `PostalUK` - UK postal codes
|
||
- `HexColor` - HTML hex color codes
|
||
- `TwitterHandle` - Twitter handles
|
||
- `Hashtag` - Hashtags
|
||
- `MacAddr` - MAC addresses
|
||
- `Iban` - International bank account numbers
|
||
- `CreditCard` - Credit card numbers
|
||
|
||
### Custom Pattern Example
|
||
|
||
For more targeted extraction, you can provide custom patterns:
|
||
|
||
```python
|
||
import json
|
||
import asyncio
|
||
from crawl4ai import (
|
||
AsyncWebCrawler,
|
||
CrawlerRunConfig,
|
||
RegexExtractionStrategy
|
||
)
|
||
|
||
async def extract_prices():
|
||
# Define a custom pattern for US Dollar prices
|
||
price_pattern = {"usd_price": r"\$\s?\d{1,3}(?:,\d{3})*(?:\.\d{2})?"}
|
||
|
||
# Create strategy with custom pattern
|
||
strategy = RegexExtractionStrategy(custom=price_pattern)
|
||
config = CrawlerRunConfig(extraction_strategy=strategy)
|
||
|
||
async with AsyncWebCrawler() as crawler:
|
||
result = await crawler.arun(
|
||
url="https://www.example.com/products",
|
||
config=config
|
||
)
|
||
|
||
if result.success:
|
||
data = json.loads(result.extracted_content)
|
||
for item in data:
|
||
print(f"Found price: {item['value']}")
|
||
|
||
asyncio.run(extract_prices())
|
||
```
|
||
|
||
### LLM-Assisted Pattern Generation
|
||
|
||
For complex or site-specific patterns, you can use an LLM once to generate an optimized pattern, then save and reuse it without further LLM calls:
|
||
|
||
```python
|
||
import json
|
||
import asyncio
|
||
from pathlib import Path
|
||
from crawl4ai import (
|
||
AsyncWebCrawler,
|
||
CrawlerRunConfig,
|
||
RegexExtractionStrategy,
|
||
LLMConfig
|
||
)
|
||
|
||
async def extract_with_generated_pattern():
|
||
cache_dir = Path("./pattern_cache")
|
||
cache_dir.mkdir(exist_ok=True)
|
||
pattern_file = cache_dir / "price_pattern.json"
|
||
|
||
# 1. Generate or load pattern
|
||
if pattern_file.exists():
|
||
pattern = json.load(pattern_file.open())
|
||
print(f"Using cached pattern: {pattern}")
|
||
else:
|
||
print("Generating pattern via LLM...")
|
||
|
||
# Configure LLM
|
||
llm_config = LLMConfig(
|
||
provider="openai/gpt-4o-mini",
|
||
api_token="env:OPENAI_API_KEY",
|
||
)
|
||
|
||
# Get sample HTML for context
|
||
async with AsyncWebCrawler() as crawler:
|
||
result = await crawler.arun("https://example.com/products")
|
||
html = result.fit_html
|
||
|
||
# Generate pattern (one-time LLM usage)
|
||
pattern = RegexExtractionStrategy.generate_pattern(
|
||
label="price",
|
||
html=html,
|
||
query="Product prices in USD format",
|
||
llm_config=llm_config,
|
||
)
|
||
|
||
# Cache pattern for future use
|
||
json.dump(pattern, pattern_file.open("w"), indent=2)
|
||
|
||
# 2. Use pattern for extraction (no LLM calls)
|
||
strategy = RegexExtractionStrategy(custom=pattern)
|
||
config = CrawlerRunConfig(extraction_strategy=strategy)
|
||
|
||
async with AsyncWebCrawler() as crawler:
|
||
result = await crawler.arun(
|
||
url="https://example.com/products",
|
||
config=config
|
||
)
|
||
|
||
if result.success:
|
||
data = json.loads(result.extracted_content)
|
||
for item in data[:10]:
|
||
print(f"Extracted: {item['value']}")
|
||
print(f"Total matches: {len(data)}")
|
||
|
||
asyncio.run(extract_with_generated_pattern())
|
||
```
|
||
|
||
This pattern allows you to:
|
||
1. Use an LLM once to generate a highly optimized regex for your specific site
|
||
2. Save the pattern to disk for reuse
|
||
3. Extract data using only regex (no further LLM calls) in production
|
||
|
||
### Extraction Results Format
|
||
|
||
The `RegexExtractionStrategy` returns results in a consistent format:
|
||
|
||
```json
|
||
[
|
||
{
|
||
"url": "https://example.com",
|
||
"label": "email",
|
||
"value": "contact@example.com",
|
||
"span": [145, 163]
|
||
},
|
||
{
|
||
"url": "https://example.com",
|
||
"label": "url",
|
||
"value": "https://support.example.com",
|
||
"span": [210, 235]
|
||
}
|
||
]
|
||
```
|
||
|
||
Each match includes:
|
||
- `url`: The source URL
|
||
- `label`: The pattern name that matched (e.g., "email", "phone_us")
|
||
- `value`: The extracted text
|
||
- `span`: The start and end positions in the source content
|
||
|
||
---
|
||
|
||
## 5. Why "No LLM" Is Often Better
|
||
|
||
1. **Zero Hallucination**: Pattern-based extraction doesn't guess text. It either finds it or not.
|
||
2. **Guaranteed Structure**: The same schema or regex yields consistent JSON across many pages, so your downstream pipeline can rely on stable keys.
|
||
3. **Speed**: LLM-based extraction can be 10–1000x slower for large-scale crawling.
|
||
4. **Scalable**: Adding or updating a field is a matter of adjusting the schema or regex, not re-tuning a model.
|
||
|
||
**When might you consider an LLM?** Possibly if the site is extremely unstructured or you want AI summarization. But always try a schema or regex approach first for repeated or consistent data patterns.
|
||
|
||
---
|
||
|
||
## 6. Base Element Attributes & Additional Fields
|
||
|
||
It's easy to **extract attributes** (like `href`, `src`, or `data-xxx`) from your base or nested elements using:
|
||
|
||
```json
|
||
{
|
||
"name": "href",
|
||
"type": "attribute",
|
||
"attribute": "href",
|
||
"default": null
|
||
}
|
||
```
|
||
|
||
You can define them in **`baseFields`** (extracted from the main container element) or in each field's sub-lists. This is especially helpful if you need an item's link or ID stored in the parent `<div>`.
|
||
|
||
---
|
||
|
||
## 7. Putting It All Together: Larger Example
|
||
|
||
Consider a blog site. We have a schema that extracts the **URL** from each post card (via `baseFields` with an `"attribute": "href"`), plus the title, date, summary, and author:
|
||
|
||
```python
|
||
schema = {
|
||
"name": "Blog Posts",
|
||
"baseSelector": "a.blog-post-card",
|
||
"baseFields": [
|
||
{"name": "post_url", "type": "attribute", "attribute": "href"}
|
||
],
|
||
"fields": [
|
||
{"name": "title", "selector": "h2.post-title", "type": "text", "default": "No Title"},
|
||
{"name": "date", "selector": "time.post-date", "type": "text", "default": ""},
|
||
{"name": "summary", "selector": "p.post-summary", "type": "text", "default": ""},
|
||
{"name": "author", "selector": "span.post-author", "type": "text", "default": ""}
|
||
]
|
||
}
|
||
```
|
||
|
||
Then run with `JsonCssExtractionStrategy(schema)` to get an array of blog post objects, each with `"post_url"`, `"title"`, `"date"`, `"summary"`, `"author"`.
|
||
|
||
---
|
||
|
||
## 8. Tips & Best Practices
|
||
|
||
1. **Inspect the DOM** in Chrome DevTools or Firefox's Inspector to find stable selectors.
|
||
2. **Start Simple**: Verify you can extract a single field. Then add complexity like nested objects or lists.
|
||
3. **Test** your schema on partial HTML or a test page before a big crawl.
|
||
4. **Combine with JS Execution** if the site loads content dynamically. You can pass `js_code` or `wait_for` in `CrawlerRunConfig`.
|
||
5. **Look at Logs** when `verbose=True`: if your selectors are off or your schema is malformed, it'll often show warnings.
|
||
6. **Use baseFields** if you need attributes from the container element (e.g., `href`, `data-id`), especially for the "parent" item.
|
||
7. **Performance**: For large pages, make sure your selectors are as narrow as possible.
|
||
8. **Consider Using Regex First**: For simple data types like emails, URLs, and dates, `RegexExtractionStrategy` is often the fastest approach.
|
||
|
||
---
|
||
|
||
## 9. Schema Generation Utility
|
||
|
||
While manually crafting schemas is powerful and precise, Crawl4AI now offers a convenient utility to **automatically generate** extraction schemas using LLM. This is particularly useful when:
|
||
|
||
1. You're dealing with a new website structure and want a quick starting point
|
||
2. You need to extract complex nested data structures
|
||
3. You want to avoid the learning curve of CSS/XPath selector syntax
|
||
|
||
### Using the Schema Generator
|
||
|
||
The schema generator is available as a static method on both `JsonCssExtractionStrategy` and `JsonXPathExtractionStrategy`. You can choose between OpenAI's GPT-4 or the open-source Ollama for schema generation:
|
||
|
||
```python
|
||
from crawl4ai import JsonCssExtractionStrategy, JsonXPathExtractionStrategy
|
||
from crawl4ai import LLMConfig
|
||
|
||
# Sample HTML with product information
|
||
html = """
|
||
<div class="product-card">
|
||
<h2 class="title">Gaming Laptop</h2>
|
||
<div class="price">$999.99</div>
|
||
<div class="specs">
|
||
<ul>
|
||
<li>16GB RAM</li>
|
||
<li>1TB SSD</li>
|
||
</ul>
|
||
</div>
|
||
</div>
|
||
"""
|
||
|
||
# Option 1: Using OpenAI (requires API token)
|
||
css_schema = JsonCssExtractionStrategy.generate_schema(
|
||
html,
|
||
schema_type="css",
|
||
llm_config = LLMConfig(provider="openai/gpt-4o",api_token="your-openai-token")
|
||
)
|
||
|
||
# Option 2: Using Ollama (open source, no token needed)
|
||
xpath_schema = JsonXPathExtractionStrategy.generate_schema(
|
||
html,
|
||
schema_type="xpath",
|
||
llm_config = LLMConfig(provider="ollama/llama3.3", api_token=None) # Not needed for Ollama
|
||
)
|
||
|
||
# Use the generated schema for fast, repeated extractions
|
||
strategy = JsonCssExtractionStrategy(css_schema)
|
||
```
|
||
|
||
### LLM Provider Options
|
||
|
||
1. **OpenAI GPT-4 (`openai/gpt4o`)**
|
||
- Default provider
|
||
- Requires an API token
|
||
- Generally provides more accurate schemas
|
||
- Set via environment variable: `OPENAI_API_KEY`
|
||
|
||
2. **Ollama (`ollama/llama3.3`)**
|
||
- Open source alternative
|
||
- No API token required
|
||
- Self-hosted option
|
||
- Good for development and testing
|
||
|
||
### Benefits of Schema Generation
|
||
|
||
1. **One-Time Cost**: While schema generation uses LLM, it's a one-time cost. The generated schema can be reused for unlimited extractions without further LLM calls.
|
||
2. **Smart Pattern Recognition**: The LLM analyzes the HTML structure and identifies common patterns, often producing more robust selectors than manual attempts.
|
||
3. **Automatic Nesting**: Complex nested structures are automatically detected and properly represented in the schema.
|
||
4. **Learning Tool**: The generated schemas serve as excellent examples for learning how to write your own schemas.
|
||
|
||
### Best Practices
|
||
|
||
1. **Review Generated Schemas**: While the generator is smart, always review and test the generated schema before using it in production.
|
||
2. **Provide Representative HTML**: The better your sample HTML represents the overall structure, the more accurate the generated schema will be.
|
||
3. **Consider Both CSS and XPath**: Try both schema types and choose the one that works best for your specific case.
|
||
4. **Cache Generated Schemas**: Since generation uses LLM, save successful schemas for reuse.
|
||
5. **API Token Security**: Never hardcode API tokens. Use environment variables or secure configuration management.
|
||
6. **Choose Provider Wisely**:
|
||
- Use OpenAI for production-quality schemas
|
||
- Use Ollama for development, testing, or when you need a self-hosted solution
|
||
|
||
### Multi-Sample Schema Generation
|
||
|
||
When scraping multiple pages with varying DOM structures (e.g., product pages where table rows appear in different positions), single-sample schema generation may produce **fragile selectors** like `tr:nth-child(6)` that break on other pages.
|
||
|
||
**The Problem:**
|
||
```
|
||
Page A: Manufacturer is in row 6 → selector: tr:nth-child(6) td a
|
||
Page B: Manufacturer is in row 5 → selector FAILS
|
||
Page C: Manufacturer is in row 7 → selector FAILS
|
||
```
|
||
|
||
**The Solution:** Provide multiple HTML samples so the LLM identifies stable patterns that work across all pages.
|
||
|
||
```python
|
||
from crawl4ai import JsonCssExtractionStrategy, LLMConfig
|
||
|
||
# Collect HTML samples from different pages
|
||
html_sample_1 = """
|
||
<table class="specs">
|
||
<tr><td>Brand</td><td>Apple</td></tr>
|
||
<tr><td>Manufacturer</td><td><a href="/m/apple">Apple Inc</a></td></tr>
|
||
</table>
|
||
"""
|
||
|
||
html_sample_2 = """
|
||
<table class="specs">
|
||
<tr><td>Manufacturer</td><td><a href="/m/samsung">Samsung</a></td></tr>
|
||
<tr><td>Brand</td><td>Galaxy</td></tr>
|
||
</table>
|
||
"""
|
||
|
||
html_sample_3 = """
|
||
<table class="specs">
|
||
<tr><td>Model</td><td>Pixel 8</td></tr>
|
||
<tr><td>Brand</td><td>Google</td></tr>
|
||
<tr><td>Manufacturer</td><td><a href="/m/google">Google LLC</a></td></tr>
|
||
</table>
|
||
"""
|
||
|
||
# Combine samples with labels
|
||
combined_html = """
|
||
## HTML Sample 1 (Product A):
|
||
```html
|
||
""" + html_sample_1 + """
|
||
```
|
||
|
||
## HTML Sample 2 (Product B):
|
||
```html
|
||
""" + html_sample_2 + """
|
||
```
|
||
|
||
## HTML Sample 3 (Product C):
|
||
```html
|
||
""" + html_sample_3 + """
|
||
```
|
||
"""
|
||
|
||
# Provide instructions for stable selectors
|
||
query = """
|
||
IMPORTANT: I'm providing 3 HTML samples from different product pages.
|
||
The manufacturer field appears in different row positions across pages.
|
||
Generate selectors using stable attributes like href patterns (e.g., a[href*='/m/'])
|
||
instead of fragile positional selectors like nth-child().
|
||
Extract: manufacturer name and link.
|
||
"""
|
||
|
||
# Generate schema with multi-sample awareness
|
||
schema = JsonCssExtractionStrategy.generate_schema(
|
||
html=combined_html,
|
||
query=query,
|
||
schema_type="CSS",
|
||
llm_config=LLMConfig(provider="openai/gpt-4o", api_token="your-token")
|
||
)
|
||
|
||
# The generated schema will use stable selectors like:
|
||
# a[href*="/m/"] instead of tr:nth-child(6) td a
|
||
print(schema)
|
||
```
|
||
|
||
**Key Points for Multi-Sample Queries:**
|
||
|
||
1. **Format samples clearly** - Use markdown headers and code blocks to separate samples
|
||
2. **State the number of samples** - "I'm providing 3 HTML samples..."
|
||
3. **Explain the variation** - "...the manufacturer field appears in different row positions"
|
||
4. **Request stable selectors** - "Use href patterns, data attributes, or class names instead of nth-child"
|
||
|
||
**Stable vs Fragile Selectors:**
|
||
|
||
| Fragile (single sample) | Stable (multi-sample) |
|
||
|------------------------|----------------------|
|
||
| `tr:nth-child(6) td a` | `a[href*="/m/"]` |
|
||
| `div:nth-child(3) .price` | `.price, [data-price]` |
|
||
| `ul li:first-child` | `li[data-featured="true"]` |
|
||
|
||
This approach lets you generate schemas once that work reliably across hundreds of similar pages with varying structures.
|
||
|
||
---
|
||
|
||
## 10. Conclusion
|
||
|
||
With Crawl4AI's LLM-free extraction strategies - `JsonCssExtractionStrategy`, `JsonXPathExtractionStrategy`, and now `RegexExtractionStrategy` - you can build powerful pipelines that:
|
||
|
||
- Scrape any consistent site for structured data.
|
||
- Support nested objects, repeating lists, or pattern-based extraction.
|
||
- Scale to thousands of pages quickly and reliably.
|
||
|
||
**Choosing the Right Strategy**:
|
||
|
||
- Use **`RegexExtractionStrategy`** for fast extraction of common data types like emails, phones, URLs, dates, etc.
|
||
- Use **`JsonCssExtractionStrategy`** or **`JsonXPathExtractionStrategy`** for structured data with clear HTML patterns
|
||
- If you need both: first extract structured data with JSON strategies, then use regex on specific fields
|
||
|
||
**Remember**: For repeated, structured data, you don't need to pay for or wait on an LLM. Well-crafted schemas and regex patterns get you the data faster, cleaner, and cheaper—**the real power** of Crawl4AI.
|
||
|
||
**Last Updated**: 2025-05-02
|
||
|
||
---
|
||
|
||
That's it for **Extracting JSON (No LLM)**! You've seen how schema-based approaches (either CSS or XPath) and regex patterns can handle everything from simple lists to deeply nested product catalogs—instantly, with minimal overhead. Enjoy building robust scrapers that produce consistent, structured JSON for your data pipelines! |