Add source (sibling selector) support to JSON extraction strategies

Many sites (e.g. Hacker News) split a single item's data across sibling elements. Field selectors only search descendants, making sibling data unreachable. The new "source" field key navigates to a sibling element before running the selector: {"source": "+ tr"} finds the next sibling <tr>, then extracts from there. - Add _resolve_source abstract method to JsonElementExtractionStrategy - Implement in all 4 subclasses (CSS/BS4, XPath/lxml, two lxml/CSS) - Modify _extract_field to resolve source before type dispatch - Update CSS and XPath LLM prompts with source docs and HN example - Default generate_schema validate=True so schemas are checked on creation - Add schema validation with feedback loop for auto-refinement - Add messages param to completion helpers for multi-turn refinement - Document source field and schema validation in docs - Add 14 unit tests covering CSS, XPath, backward compat, edge cases
2026-02-17 09:04:40 +00:00
parent ccd24aa824
commit d267c650cb
7 changed files with 1054 additions and 28 deletions
--- a/docs/md_v2/complete-sdk-reference.md
+++ b/docs/md_v2/complete-sdk-reference.md
@@ -232,6 +232,7 @@ if __name__ == "__main__":
 - Great for repetitive page structures (e.g., item listings, articles).
 - No AI usage or costs.
 - The crawler returns a JSON string you can parse or store.
+- For sites where data is split across sibling elements (e.g. Hacker News), use the `"source"` field key to navigate to a sibling before extracting: `{"name": "score", "selector": "span.score", "type": "text", "source": "+ tr"}`.
 > Tips: You can pass raw HTML to the crawler instead of a URL. To do so, prefix the HTML with `raw://`.
 ## 6. Simple Data Extraction (LLM-based)
 - **Open-Source Models** (e.g., `ollama/llama3.3`, `no_token`)