Add source (sibling selector) support to JSON extraction strategies

Many sites (e.g. Hacker News) split a single item's data across sibling
elements. Field selectors only search descendants, making sibling data
unreachable. The new "source" field key navigates to a sibling element
before running the selector: {"source": "+ tr"} finds the next sibling
<tr>, then extracts from there.

- Add _resolve_source abstract method to JsonElementExtractionStrategy
- Implement in all 4 subclasses (CSS/BS4, XPath/lxml, two lxml/CSS)
- Modify _extract_field to resolve source before type dispatch
- Update CSS and XPath LLM prompts with source docs and HN example
- Default generate_schema validate=True so schemas are checked on creation
- Add schema validation with feedback loop for auto-refinement
- Add messages param to completion helpers for multi-turn refinement
- Document source field and schema validation in docs
- Add 14 unit tests covering CSS, XPath, backward compat, edge cases
This commit is contained in:
unclecode
2026-02-17 09:04:40 +00:00
parent ccd24aa824
commit d267c650cb
7 changed files with 1054 additions and 28 deletions

View File

@@ -120,7 +120,8 @@ schema = {
"attribute": str, # For type="attribute"
"pattern": str, # For type="regex"
"transform": str, # Optional: "lowercase", "uppercase", "strip"
"default": Any # Default value if extraction fails
"default": Any, # Default value if extraction fails
"source": str, # Optional: navigate to sibling first, e.g. "+ tr"
}
]
}

View File

@@ -232,6 +232,7 @@ if __name__ == "__main__":
- Great for repetitive page structures (e.g., item listings, articles).
- No AI usage or costs.
- The crawler returns a JSON string you can parse or store.
- For sites where data is split across sibling elements (e.g. Hacker News), use the `"source"` field key to navigate to a sibling before extracting: `{"name": "score", "selector": "span.score", "type": "text", "source": "+ tr"}`.
> Tips: You can pass raw HTML to the crawler instead of a URL. To do so, prefix the HTML with `raw://`.
## 6. Simple Data Extraction (LLM-based)
- **Open-Source Models** (e.g., `ollama/llama3.3`, `no_token`)

View File

@@ -92,9 +92,10 @@ asyncio.run(extract_crypto_prices())
**Highlights**:
- **`baseSelector`**: Tells us where each "item" (crypto row) is.
- **`fields`**: Two fields (`coin_name`, `price`) using simple CSS selectors.
- **`baseSelector`**: Tells us where each "item" (crypto row) is.
- **`fields`**: Two fields (`coin_name`, `price`) using simple CSS selectors.
- Each field defines a **`type`** (e.g., `text`, `attribute`, `html`, `regex`, etc.).
- Optional keys: **`transform`**, **`default`**, **`attribute`**, **`pattern`**, and **`source`** (for sibling data — see [Extracting Sibling Data](#sibling-data)).
No LLM is needed, and the performance is **near-instant** for hundreds or thousands of items.
@@ -623,7 +624,60 @@ Then run with `JsonCssExtractionStrategy(schema)` to get an array of blog post o
---
## 8. Tips & Best Practices
## 8. Extracting Sibling Data with `source` {#sibling-data}
Some websites split a single logical item across **sibling elements** rather than nesting everything inside one container. A classic example is Hacker News, where each submission spans two adjacent `<tr>` rows:
```html
<tr class="athing submission"> <!-- rank, title, url -->
<td><span class="rank">1.</span></td>
<td><span class="titleline"><a href="https://example.com">Example Title</a></span></td>
</tr>
<tr> <!-- score, author, comments (sibling!) -->
<td class="subtext">
<span class="score">100 points</span>
<a class="hnuser">johndoe</a>
</td>
</tr>
```
Normally, field selectors only search **descendants** of the base element — siblings are unreachable. The `source` field key solves this by navigating to a sibling element before running the selector.
### Syntax
```
"source": "+ <selector>"
```
- **`+ tr`** — next sibling `<tr>`
- **`+ div.details`** — next sibling `<div>` with class `details`
- **`+ .subtext`** — next sibling with class `subtext`
### Example: Hacker News
```python
schema = {
"name": "HN Submissions",
"baseSelector": "tr.athing.submission",
"fields": [
{"name": "rank", "selector": "span.rank", "type": "text"},
{"name": "title", "selector": "span.titleline a", "type": "text"},
{"name": "url", "selector": "span.titleline a", "type": "attribute", "attribute": "href"},
{"name": "score", "selector": "span.score", "type": "text", "source": "+ tr"},
{"name": "author", "selector": "a.hnuser", "type": "text", "source": "+ tr"},
],
}
strategy = JsonCssExtractionStrategy(schema)
```
The `score` and `author` fields first navigate to the next sibling `<tr>`, then run their selectors inside that element. Fields without `source` work as before — searching descendants of the base element.
`source` works with all field types (`text`, `attribute`, `nested`, `list`, etc.) and with both `JsonCssExtractionStrategy` and `JsonXPathExtractionStrategy`. If the sibling isn't found, the field returns its `default` value.
---
## 9. Tips & Best Practices
1. **Inspect the DOM** in Chrome DevTools or Firefox's Inspector to find stable selectors.
2. **Start Simple**: Verify you can extract a single field. Then add complexity like nested objects or lists.
@@ -636,7 +690,7 @@ Then run with `JsonCssExtractionStrategy(schema)` to get an array of blog post o
---
## 9. Schema Generation Utility
## 10. Schema Generation Utility
While manually crafting schemas is powerful and precise, Crawl4AI now offers a convenient utility to **automatically generate** extraction schemas using LLM. This is particularly useful when:
@@ -669,7 +723,7 @@ html = """
# Option 1: Using OpenAI (requires API token)
css_schema = JsonCssExtractionStrategy.generate_schema(
html,
schema_type="css",
schema_type="css",
llm_config = LLMConfig(provider="openai/gpt-4o",api_token="your-openai-token")
)
@@ -684,6 +738,29 @@ xpath_schema = JsonXPathExtractionStrategy.generate_schema(
strategy = JsonCssExtractionStrategy(css_schema)
```
### Schema Validation
By default, `generate_schema` **validates** the generated schema against the HTML to ensure that it actually extracts the data you expect. If the schema doesn't produce results, it automatically refines the selectors before returning.
You can control this with the `validate` parameter:
```python
# Default: validated (recommended)
schema = JsonCssExtractionStrategy.generate_schema(
url="https://news.ycombinator.com",
query="Extract each story: title, url, score, author",
)
# Skip validation if you want raw LLM output
schema = JsonCssExtractionStrategy.generate_schema(
url="https://news.ycombinator.com",
query="Extract each story: title, url, score, author",
validate=False,
)
```
The generator also understands sibling layouts — for sites like Hacker News where data is split across sibling elements, it will automatically use the [`source` field](#sibling-data) to reach sibling data.
### LLM Provider Options
1. **OpenAI GPT-4 (`openai/gpt4o`)**
@@ -814,7 +891,7 @@ This approach lets you generate schemas once that work reliably across hundreds
---
## 10. Conclusion
## 11. Conclusion
With Crawl4AI's LLM-free extraction strategies - `JsonCssExtractionStrategy`, `JsonXPathExtractionStrategy`, and now `RegexExtractionStrategy` - you can build powerful pipelines that: