feat(content): add target_elements parameter for selective content extraction

Adds new target_elements parameter to CrawlerRunConfig that allows more flexible content selection than css_selector. This enables focusing markdown generation and data extraction on specific elements while still processing the entire page for links and media.

Key changes:
- Added target_elements list parameter to CrawlerRunConfig
- Modified WebScrapingStrategy and LXMLWebScrapingStrategy to handle target_elements
- Updated documentation with examples and comparison between css_selector and target_elements
- Fixed table extraction in content_scraping_strategy.py

BREAKING CHANGE: Table extraction logic has been modified to better handle thead/tbody structures
This commit is contained in:
UncleCode
2025-03-10 18:54:51 +08:00
parent 9d69fce834
commit 9547bada3a
7 changed files with 188 additions and 47 deletions

View File

@@ -71,7 +71,8 @@ We group them by category.
| **`word_count_threshold`** | `int` (default: ~200) | Skips text blocks below X words. Helps ignore trivial sections. |
| **`extraction_strategy`** | `ExtractionStrategy` (default: None) | If set, extracts structured data (CSS-based, LLM-based, etc.). |
| **`markdown_generator`** | `MarkdownGenerationStrategy` (None) | If you want specialized markdown output (citations, filtering, chunking, etc.). |
| **`css_selector`** | `str` (None) | Retains only the part of the page matching this selector. |
| **`css_selector`** | `str` (None) | Retains only the part of the page matching this selector. Affects the entire extraction process. |
| **`target_elements`** | `List[str]` (None) | List of CSS selectors for elements to focus on for markdown generation and data extraction, while still processing the entire page for links, media, etc. Provides more flexibility than `css_selector`. |
| **`excluded_tags`** | `list` (None) | Removes entire tags (e.g. `["script", "style"]`). |
| **`excluded_selector`** | `str` (None) | Like `css_selector` but to exclude. E.g. `"#ads, .tracker"`. |
| **`only_text`** | `bool` (False) | If `True`, tries to extract text-only content. |