Improve schema generation prompt for sibling-based layouts

This commit is contained in:
unclecode
2026-02-10 08:34:22 +00:00
parent fbc52813a4
commit 44b8afb6dc

View File

@@ -311,6 +311,17 @@ Available field types:
- nested: Object containing other fields - nested: Object containing other fields
- list: Array of similar items - list: Array of similar items
- regex: Pattern-based extraction - regex: Pattern-based extraction
CRITICAL - How selectors work at each level:
- baseSelector runs against the FULL document and returns all matching elements.
- Field selectors run INSIDE each base element (descendants only, not siblings).
- This means a field selector will NEVER match sibling elements of the base element.
- Therefore: NEVER use the same (or equivalent) selector as baseSelector in a field.
It would search for the element inside itself, which returns nothing for flat/sibling layouts.
When repeating items are siblings (e.g. table rows, flat divs):
- CORRECT: Use baseSelector to match each item, then use flat fields (text/attribute) to extract data directly from within each item.
- WRONG: Using baseSelector as a "list" field selector inside itself — this produces empty arrays.
</type_definitions> </type_definitions>
<behavior_rules> <behavior_rules>
@@ -606,6 +617,40 @@ Generated Schema:
} }
] ]
} }
7. Sibling Rows Example (e.g. table rows, flat lists):
<html>
<table>
<tr class="item"><td class="title"><a href="/1">First</a></td></tr>
<tr class="item"><td class="title"><a href="/2">Second</a></td></tr>
</table>
</html>
WRONG Schema (baseSelector reused as list field — produces empty results):
{
"name": "Items",
"baseSelector": ".item",
"fields": [
{
"name": "entries",
"type": "list",
"selector": ".item",
"fields": [
{"name": "title", "selector": ".title a", "type": "text"}
]
}
]
}
CORRECT Schema (flat fields directly on base element):
{
"name": "Items",
"baseSelector": ".item",
"fields": [
{"name": "title", "selector": ".title a", "type": "text"},
{"name": "link", "selector": ".title a", "type": "attribute", "attribute": "href"}
]
}
</examples> </examples>
@@ -687,6 +732,17 @@ Available field types:
- nested: Object containing other fields - nested: Object containing other fields
- list: Array of similar items - list: Array of similar items
- regex: Pattern-based extraction - regex: Pattern-based extraction
CRITICAL - How selectors work at each level:
- baseSelector runs against the FULL document and returns all matching elements.
- Field selectors run INSIDE each base element (descendants only, not siblings).
- This means a field selector will NEVER match sibling elements of the base element.
- Therefore: NEVER use the same (or equivalent) selector as baseSelector in a field.
It would search for the element inside itself, which returns nothing for flat/sibling layouts.
When repeating items are siblings (e.g. table rows, flat divs):
- CORRECT: Use baseSelector to match each item, then use flat fields (text/attribute) to extract data directly from within each item.
- WRONG: Using baseSelector as a "list" field selector inside itself — this produces empty arrays.
</type_definitions> </type_definitions>
<behavior_rules> <behavior_rules>
@@ -982,6 +1038,40 @@ Generated Schema:
} }
] ]
} }
7. Sibling Rows Example (e.g. table rows, flat lists):
<html>
<table>
<tr class="item"><td class="title"><a href="/1">First</a></td></tr>
<tr class="item"><td class="title"><a href="/2">Second</a></td></tr>
</table>
</html>
WRONG Schema (baseSelector reused as list field — produces empty results):
{
"name": "Items",
"baseSelector": ".//tr[@class='item']",
"fields": [
{
"name": "entries",
"type": "list",
"selector": ".//tr[@class='item']",
"fields": [
{"name": "title", "selector": ".//td[@class='title']/a", "type": "text"}
]
}
]
}
CORRECT Schema (flat fields directly on base element):
{
"name": "Items",
"baseSelector": ".//tr[@class='item']",
"fields": [
{"name": "title", "selector": ".//td[@class='title']/a", "type": "text"},
{"name": "link", "selector": ".//td[@class='title']/a", "type": "attribute", "attribute": "href"}
]
}
</examples> </examples>
<output_requirements> <output_requirements>