Add source (sibling selector) support to JSON extraction strategies

Many sites (e.g. Hacker News) split a single item's data across sibling
elements. Field selectors only search descendants, making sibling data
unreachable. The new "source" field key navigates to a sibling element
before running the selector: {"source": "+ tr"} finds the next sibling
<tr>, then extracts from there.

- Add _resolve_source abstract method to JsonElementExtractionStrategy
- Implement in all 4 subclasses (CSS/BS4, XPath/lxml, two lxml/CSS)
- Modify _extract_field to resolve source before type dispatch
- Update CSS and XPath LLM prompts with source docs and HN example
- Default generate_schema validate=True so schemas are checked on creation
- Add schema validation with feedback loop for auto-refinement
- Add messages param to completion helpers for multi-turn refinement
- Document source field and schema validation in docs
- Add 14 unit tests covering CSS, XPath, backward compat, edge cases
This commit is contained in:
unclecode
2026-02-17 09:04:40 +00:00
parent ccd24aa824
commit d267c650cb
7 changed files with 1054 additions and 28 deletions

View File

@@ -232,6 +232,7 @@ if __name__ == "__main__":
- Great for repetitive page structures (e.g., item listings, articles).
- No AI usage or costs.
- The crawler returns a JSON string you can parse or store.
- For sites where data is split across sibling elements (e.g. Hacker News), use the `"source"` field key to navigate to a sibling before extracting: `{"name": "score", "selector": "span.score", "type": "text", "source": "+ tr"}`.
> Tips: You can pass raw HTML to the crawler instead of a URL. To do so, prefix the HTML with `raw://`.
## 6. Simple Data Extraction (LLM-based)
- **Open-Source Models** (e.g., `ollama/llama3.3`, `no_token`)