Enhance crawler capabilities and documentation
- Add llm.txt generator - Added SSL certificate extraction in AsyncWebCrawler. - Introduced new content filters and chunking strategies for more robust data extraction. - Updated documentation.
This commit is contained in:
102
docs/llm.txt/7_extraction_strategies.xs.md
Normal file
102
docs/llm.txt/7_extraction_strategies.xs.md
Normal file
@@ -0,0 +1,102 @@
|
||||
# Extraction Strategies (Condensed LLM-Friendly Reference)
|
||||
|
||||
> Extract structured data (JSON) and text blocks from HTML with LLM-based or clustering methods.
|
||||
|
||||
Streamlined parameters, usage, and code snippets for quick LLM reference.
|
||||
|
||||
## Input Formats
|
||||
|
||||
- **markdown** (default): Raw markdown from HTML
|
||||
- **html**: Raw HTML content
|
||||
- **fit_markdown**: Cleaned markdown (needs markdown_generator + content_filter)
|
||||
|
||||
```python
|
||||
strategy = LLMExtractionStrategy(
|
||||
input_format="html", # Choose format
|
||||
provider="openai/gpt-4",
|
||||
instruction="Extract data"
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
extraction_strategy=strategy,
|
||||
markdown_generator=DefaultMarkdownGenerator(), # For fit_markdown
|
||||
content_filter=PruningContentFilter() # For fit_markdown
|
||||
)
|
||||
```
|
||||
|
||||
## LLMExtractionStrategy
|
||||
|
||||
- Uses LLM to extract structured data from HTML.
|
||||
- Supports `instruction`, `schema`, `extraction_type`, `chunk_token_threshold`, `overlap_rate`, `input_format`.
|
||||
```python
|
||||
from crawl4ai.extraction_strategy import LLMExtractionStrategy
|
||||
strategy = LLMExtractionStrategy(
|
||||
provider="openai",
|
||||
api_token="your_api_token",
|
||||
instruction="Extract prices",
|
||||
schema={"fields": [...]},
|
||||
extraction_type="schema",
|
||||
input_format="html"
|
||||
)
|
||||
```
|
||||
|
||||
## CosineStrategy
|
||||
|
||||
- Clusters content via semantic embeddings.
|
||||
- Key params: `semantic_filter`, `word_count_threshold`, `sim_threshold`, `top_k`.
|
||||
```python
|
||||
from crawl4ai.extraction_strategy import CosineStrategy
|
||||
strategy = CosineStrategy(
|
||||
semantic_filter="product reviews",
|
||||
word_count_threshold=20,
|
||||
sim_threshold=0.3,
|
||||
top_k=5
|
||||
)
|
||||
```
|
||||
|
||||
## JsonCssExtractionStrategy
|
||||
|
||||
- Extracts data using CSS selectors.
|
||||
- `schema` defines `baseSelector`, `fields`.
|
||||
```python
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
schema = {
|
||||
"baseSelector": ".product",
|
||||
"fields": [
|
||||
{"name":"title","selector":"h2","type":"text"},
|
||||
{"name":"price","selector":".price","type":"text"}
|
||||
]
|
||||
}
|
||||
strategy = JsonCssExtractionStrategy(schema=schema)
|
||||
```
|
||||
|
||||
## JsonXPathExtractionStrategy
|
||||
|
||||
- Similar to CSS but uses XPath.
|
||||
- More stable against changing class names.
|
||||
```python
|
||||
from crawl4ai.extraction_strategy import JsonXPathExtractionStrategy
|
||||
schema = {
|
||||
"baseSelector": "//div[@class='product']",
|
||||
"fields": [
|
||||
{"name":"title","selector":".//h2","type":"text"},
|
||||
{"name":"price","selector":".//span[@class='price']","type":"text"}
|
||||
]
|
||||
}
|
||||
strategy = JsonXPathExtractionStrategy(schema=schema)
|
||||
```
|
||||
|
||||
## Example Usage
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
|
||||
config = CrawlerRunConfig(extraction_strategy=strategy)
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://example.com", config=config)
|
||||
print(result.extracted_content)
|
||||
```
|
||||
|
||||
## Optional
|
||||
|
||||
- [extraction_strategies.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/extraction_strategies.py)
|
||||
Reference in New Issue
Block a user