Files
crawl4ai/docs/llm.txt/8_content_selection.xs.md
UncleCode d5ed451299 Enhance crawler capabilities and documentation
- Add llm.txt generator
  - Added SSL certificate extraction in AsyncWebCrawler.
  - Introduced new content filters and chunking strategies for more robust data extraction.
  - Updated documentation.
2024-12-25 21:34:31 +08:00

130 lines
3.3 KiB
Markdown

# Crawl4AI Content Selection (LLM-Friendly Reference)
> Minimal, code-oriented reference for selecting and filtering webpage content using Crawl4AI.
## Quick Start
```python
from crawl4ai.async_configs import CrawlerRunConfig, AsyncWebCrawler
async def run():
config = CrawlerRunConfig(css_selector=".main-article")
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com", config=config)
print(result.extracted_content)
```
## CSS Selectors
- Use `css_selector="selector"` to target specific content.
```python
config = CrawlerRunConfig(css_selector="article h1, article .content")
result = await crawler.arun(url="...", config=config)
```
## Content Filtering
- `word_count_threshold`: int
- `excluded_tags`: list of tags
- `exclude_external_links`: bool
- `exclude_social_media_links`: bool
- `exclude_external_images`: bool
```python
config = CrawlerRunConfig(
word_count_threshold=10,
excluded_tags=["form","header","footer","nav"],
exclude_external_links=True,
exclude_social_media_links=True,
exclude_external_images=True
)
```
## Iframe Content
- `process_iframes`: bool
- `remove_overlay_elements`: bool
```python
config = CrawlerRunConfig(
process_iframes=True,
remove_overlay_elements=True
)
```
## LLM-Based Extraction
- Use `LLMExtractionStrategy(provider="...")` with `schema=...` and `instruction="..."`
```python
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel
class ArticleContent(BaseModel):
title: str
main_points: list[str]
conclusion: str
strategy = LLMExtractionStrategy(
provider="ollama/nemotron",
schema=ArticleContent.schema(),
instruction="Extract title, points, conclusion"
)
config = CrawlerRunConfig(extraction_strategy=strategy)
```
## Pattern-Based Selection (JsonCssExtractionStrategy)
```python
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
schema = {
"name": "News Articles",
"baseSelector": "article.news-item",
"fields": [
{"name":"headline","selector":"h2","type":"text"},
{"name":"summary","selector":".summary","type":"text"},
{"name":"category","selector":".category","type":"text"},
{
"name":"metadata",
"type":"nested",
"fields":[
{"name":"author","selector":".author","type":"text"},
{"name":"date","selector":".date","type":"text"}
]
}
]
}
config = CrawlerRunConfig(extraction_strategy=JsonCssExtractionStrategy(schema))
```
## Combined Example
```python
from crawl4ai.async_configs import CrawlerRunConfig, AsyncWebCrawler
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
article_schema = {
"name":"Article",
"baseSelector":"article.main",
"fields":[
{"name":"title","selector":"h1","type":"text"},
{"name":"content","selector":".content","type":"text"}
]
}
config = CrawlerRunConfig(
extraction_strategy=JsonCssExtractionStrategy(article_schema),
word_count_threshold=10,
excluded_tags=["nav","footer"],
exclude_external_links=True
)
```
## Optional
- [async_webcrawler.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/async_webcrawler.py)
- [async_crawler_strategy.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/async_crawler_strategy.py)