Update Documentation
This commit is contained in:
133
docs/md_v2/extraction/chunking.md
Normal file
133
docs/md_v2/extraction/chunking.md
Normal file
@@ -0,0 +1,133 @@
|
||||
## Chunking Strategies 📚
|
||||
|
||||
Crawl4AI provides several powerful chunking strategies to divide text into manageable parts for further processing. Each strategy has unique characteristics and is suitable for different scenarios. Let's explore them one by one.
|
||||
|
||||
### RegexChunking
|
||||
|
||||
`RegexChunking` splits text using regular expressions. This is ideal for creating chunks based on specific patterns like paragraphs or sentences.
|
||||
|
||||
#### When to Use
|
||||
- Great for structured text with consistent delimiters.
|
||||
- Suitable for documents where specific patterns (e.g., double newlines, periods) indicate logical chunks.
|
||||
|
||||
#### Parameters
|
||||
- `patterns` (list, optional): Regular expressions used to split the text. Default is to split by double newlines (`['\n\n']`).
|
||||
|
||||
#### Example
|
||||
```python
|
||||
from crawl4ai.chunking_strategy import RegexChunking
|
||||
|
||||
# Define patterns for splitting text
|
||||
patterns = [r'\n\n', r'\. ']
|
||||
chunker = RegexChunking(patterns=patterns)
|
||||
|
||||
# Sample text
|
||||
text = "This is a sample text. It will be split into chunks.\n\nThis is another paragraph."
|
||||
|
||||
# Chunk the text
|
||||
chunks = chunker.chunk(text)
|
||||
print(chunks)
|
||||
```
|
||||
|
||||
### NlpSentenceChunking
|
||||
|
||||
`NlpSentenceChunking` uses NLP models to split text into sentences, ensuring accurate sentence boundaries.
|
||||
|
||||
#### When to Use
|
||||
- Ideal for texts where sentence boundaries are crucial.
|
||||
- Useful for creating chunks that preserve grammatical structures.
|
||||
|
||||
#### Parameters
|
||||
- None.
|
||||
|
||||
#### Example
|
||||
```python
|
||||
from crawl4ai.chunking_strategy import NlpSentenceChunking
|
||||
|
||||
chunker = NlpSentenceChunking()
|
||||
|
||||
# Sample text
|
||||
text = "This is a sample text. It will be split into sentences. Here's another sentence."
|
||||
|
||||
# Chunk the text
|
||||
chunks = chunker.chunk(text)
|
||||
print(chunks)
|
||||
```
|
||||
|
||||
### TopicSegmentationChunking
|
||||
|
||||
`TopicSegmentationChunking` employs the TextTiling algorithm to segment text into topic-based chunks. This method identifies thematic boundaries.
|
||||
|
||||
#### When to Use
|
||||
- Perfect for long documents with distinct topics.
|
||||
- Useful when preserving topic continuity is more important than maintaining text order.
|
||||
|
||||
#### Parameters
|
||||
- `num_keywords` (int, optional): Number of keywords for each topic segment. Default is `3`.
|
||||
|
||||
#### Example
|
||||
```python
|
||||
from crawl4ai.chunking_strategy import TopicSegmentationChunking
|
||||
|
||||
chunker = TopicSegmentationChunking(num_keywords=3)
|
||||
|
||||
# Sample text
|
||||
text = "This document contains several topics. Topic one discusses AI. Topic two covers machine learning."
|
||||
|
||||
# Chunk the text
|
||||
chunks = chunker.chunk(text)
|
||||
print(chunks)
|
||||
```
|
||||
|
||||
### FixedLengthWordChunking
|
||||
|
||||
`FixedLengthWordChunking` splits text into chunks based on a fixed number of words. This ensures each chunk has approximately the same length.
|
||||
|
||||
#### When to Use
|
||||
- Suitable for processing large texts where uniform chunk size is important.
|
||||
- Useful when the number of words per chunk needs to be controlled.
|
||||
|
||||
#### Parameters
|
||||
- `chunk_size` (int, optional): Number of words per chunk. Default is `100`.
|
||||
|
||||
#### Example
|
||||
```python
|
||||
from crawl4ai.chunking_strategy import FixedLengthWordChunking
|
||||
|
||||
chunker = FixedLengthWordChunking(chunk_size=10)
|
||||
|
||||
# Sample text
|
||||
text = "This is a sample text. It will be split into chunks of fixed length."
|
||||
|
||||
# Chunk the text
|
||||
chunks = chunker.chunk(text)
|
||||
print(chunks)
|
||||
```
|
||||
|
||||
### SlidingWindowChunking
|
||||
|
||||
`SlidingWindowChunking` uses a sliding window approach to create overlapping chunks. Each chunk has a fixed length, and the window slides by a specified step size.
|
||||
|
||||
#### When to Use
|
||||
- Ideal for creating overlapping chunks to preserve context.
|
||||
- Useful for tasks where context from adjacent chunks is needed.
|
||||
|
||||
#### Parameters
|
||||
- `window_size` (int, optional): Number of words in each chunk. Default is `100`.
|
||||
- `step` (int, optional): Number of words to slide the window. Default is `50`.
|
||||
|
||||
#### Example
|
||||
```python
|
||||
from crawl4ai.chunking_strategy import SlidingWindowChunking
|
||||
|
||||
chunker = SlidingWindowChunking(window_size=10, step=5)
|
||||
|
||||
# Sample text
|
||||
text = "This is a sample text. It will be split using a sliding window approach to preserve context."
|
||||
|
||||
# Chunk the text
|
||||
chunks = chunker.chunk(text)
|
||||
print(chunks)
|
||||
```
|
||||
|
||||
With these chunking strategies, you can choose the best method to divide your text based on your specific needs. Whether you need precise sentence boundaries, topic-based segmentation, or uniform chunk sizes, Crawl4AI has you covered. Happy chunking! 📝✨
|
||||
222
docs/md_v2/extraction/cosine.md
Normal file
222
docs/md_v2/extraction/cosine.md
Normal file
@@ -0,0 +1,222 @@
|
||||
# Cosine Strategy
|
||||
|
||||
The Cosine Strategy in Crawl4AI uses similarity-based clustering to identify and extract relevant content sections from web pages. This strategy is particularly useful when you need to find and extract content based on semantic similarity rather than structural patterns.
|
||||
|
||||
## How It Works
|
||||
|
||||
The Cosine Strategy:
|
||||
1. Breaks down page content into meaningful chunks
|
||||
2. Converts text into vector representations
|
||||
3. Calculates similarity between chunks
|
||||
4. Clusters similar content together
|
||||
5. Ranks and filters content based on relevance
|
||||
|
||||
## Basic Usage
|
||||
|
||||
```python
|
||||
from crawl4ai.extraction_strategy import CosineStrategy
|
||||
|
||||
strategy = CosineStrategy(
|
||||
semantic_filter="product reviews", # Target content type
|
||||
word_count_threshold=10, # Minimum words per cluster
|
||||
sim_threshold=0.3 # Similarity threshold
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com/reviews",
|
||||
extraction_strategy=strategy
|
||||
)
|
||||
|
||||
content = result.extracted_content
|
||||
```
|
||||
|
||||
## Configuration Options
|
||||
|
||||
### Core Parameters
|
||||
|
||||
```python
|
||||
CosineStrategy(
|
||||
# Content Filtering
|
||||
semantic_filter: str = None, # Keywords/topic for content filtering
|
||||
word_count_threshold: int = 10, # Minimum words per cluster
|
||||
sim_threshold: float = 0.3, # Similarity threshold (0.0 to 1.0)
|
||||
|
||||
# Clustering Parameters
|
||||
max_dist: float = 0.2, # Maximum distance for clustering
|
||||
linkage_method: str = 'ward', # Clustering linkage method
|
||||
top_k: int = 3, # Number of top categories to extract
|
||||
|
||||
# Model Configuration
|
||||
model_name: str = 'sentence-transformers/all-MiniLM-L6-v2', # Embedding model
|
||||
|
||||
verbose: bool = False # Enable logging
|
||||
)
|
||||
```
|
||||
|
||||
### Parameter Details
|
||||
|
||||
1. **semantic_filter**
|
||||
- Sets the target topic or content type
|
||||
- Use keywords relevant to your desired content
|
||||
- Example: "technical specifications", "user reviews", "pricing information"
|
||||
|
||||
2. **sim_threshold**
|
||||
- Controls how similar content must be to be grouped together
|
||||
- Higher values (e.g., 0.8) mean stricter matching
|
||||
- Lower values (e.g., 0.3) allow more variation
|
||||
```python
|
||||
# Strict matching
|
||||
strategy = CosineStrategy(sim_threshold=0.8)
|
||||
|
||||
# Loose matching
|
||||
strategy = CosineStrategy(sim_threshold=0.3)
|
||||
```
|
||||
|
||||
3. **word_count_threshold**
|
||||
- Filters out short content blocks
|
||||
- Helps eliminate noise and irrelevant content
|
||||
```python
|
||||
# Only consider substantial paragraphs
|
||||
strategy = CosineStrategy(word_count_threshold=50)
|
||||
```
|
||||
|
||||
4. **top_k**
|
||||
- Number of top content clusters to return
|
||||
- Higher values return more diverse content
|
||||
```python
|
||||
# Get top 5 most relevant content clusters
|
||||
strategy = CosineStrategy(top_k=5)
|
||||
```
|
||||
|
||||
## Use Cases
|
||||
|
||||
### 1. Article Content Extraction
|
||||
```python
|
||||
strategy = CosineStrategy(
|
||||
semantic_filter="main article content",
|
||||
word_count_threshold=100, # Longer blocks for articles
|
||||
top_k=1 # Usually want single main content
|
||||
)
|
||||
|
||||
result = await crawler.arun(
|
||||
url="https://example.com/blog/post",
|
||||
extraction_strategy=strategy
|
||||
)
|
||||
```
|
||||
|
||||
### 2. Product Review Analysis
|
||||
```python
|
||||
strategy = CosineStrategy(
|
||||
semantic_filter="customer reviews and ratings",
|
||||
word_count_threshold=20, # Reviews can be shorter
|
||||
top_k=10, # Get multiple reviews
|
||||
sim_threshold=0.4 # Allow variety in review content
|
||||
)
|
||||
```
|
||||
|
||||
### 3. Technical Documentation
|
||||
```python
|
||||
strategy = CosineStrategy(
|
||||
semantic_filter="technical specifications documentation",
|
||||
word_count_threshold=30,
|
||||
sim_threshold=0.6, # Stricter matching for technical content
|
||||
max_dist=0.3 # Allow related technical sections
|
||||
)
|
||||
```
|
||||
|
||||
## Advanced Features
|
||||
|
||||
### Custom Clustering
|
||||
```python
|
||||
strategy = CosineStrategy(
|
||||
linkage_method='complete', # Alternative clustering method
|
||||
max_dist=0.4, # Larger clusters
|
||||
model_name='sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2' # Multilingual support
|
||||
)
|
||||
```
|
||||
|
||||
### Content Filtering Pipeline
|
||||
```python
|
||||
strategy = CosineStrategy(
|
||||
semantic_filter="pricing plans features",
|
||||
word_count_threshold=15,
|
||||
sim_threshold=0.5,
|
||||
top_k=3
|
||||
)
|
||||
|
||||
async def extract_pricing_features(url: str):
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url=url,
|
||||
extraction_strategy=strategy
|
||||
)
|
||||
|
||||
if result.success:
|
||||
content = json.loads(result.extracted_content)
|
||||
return {
|
||||
'pricing_features': content,
|
||||
'clusters': len(content),
|
||||
'similarity_scores': [item['score'] for item in content]
|
||||
}
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Adjust Thresholds Iteratively**
|
||||
- Start with default values
|
||||
- Adjust based on results
|
||||
- Monitor clustering quality
|
||||
|
||||
2. **Choose Appropriate Word Count Thresholds**
|
||||
- Higher for articles (100+)
|
||||
- Lower for reviews/comments (20+)
|
||||
- Medium for product descriptions (50+)
|
||||
|
||||
3. **Optimize Performance**
|
||||
```python
|
||||
strategy = CosineStrategy(
|
||||
word_count_threshold=10, # Filter early
|
||||
top_k=5, # Limit results
|
||||
verbose=True # Monitor performance
|
||||
)
|
||||
```
|
||||
|
||||
4. **Handle Different Content Types**
|
||||
```python
|
||||
# For mixed content pages
|
||||
strategy = CosineStrategy(
|
||||
semantic_filter="product features",
|
||||
sim_threshold=0.4, # More flexible matching
|
||||
max_dist=0.3, # Larger clusters
|
||||
top_k=3 # Multiple relevant sections
|
||||
)
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
```python
|
||||
try:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
extraction_strategy=strategy
|
||||
)
|
||||
|
||||
if result.success:
|
||||
content = json.loads(result.extracted_content)
|
||||
if not content:
|
||||
print("No relevant content found")
|
||||
else:
|
||||
print(f"Extraction failed: {result.error_message}")
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error during extraction: {str(e)}")
|
||||
```
|
||||
|
||||
The Cosine Strategy is particularly effective when:
|
||||
- Content structure is inconsistent
|
||||
- You need semantic understanding
|
||||
- You want to find similar content blocks
|
||||
- Structure-based extraction (CSS/XPath) isn't reliable
|
||||
|
||||
It works well with other strategies and can be used as a pre-processing step for LLM-based extraction.
|
||||
282
docs/md_v2/extraction/css-advanced.md
Normal file
282
docs/md_v2/extraction/css-advanced.md
Normal file
@@ -0,0 +1,282 @@
|
||||
# Advanced Usage of JsonCssExtractionStrategy
|
||||
|
||||
While the basic usage of JsonCssExtractionStrategy is powerful for simple structures, its true potential shines when dealing with complex, nested HTML structures. This section will explore advanced usage scenarios, demonstrating how to extract nested objects, lists, and nested lists.
|
||||
|
||||
## Hypothetical Website Example
|
||||
|
||||
Let's consider a hypothetical e-commerce website that displays product categories, each containing multiple products. Each product has details, reviews, and related items. This complex structure will allow us to demonstrate various advanced features of JsonCssExtractionStrategy.
|
||||
|
||||
Assume the HTML structure looks something like this:
|
||||
|
||||
```html
|
||||
<div class="category">
|
||||
<h2 class="category-name">Electronics</h2>
|
||||
<div class="product">
|
||||
<h3 class="product-name">Smartphone X</h3>
|
||||
<p class="product-price">$999</p>
|
||||
<div class="product-details">
|
||||
<span class="brand">TechCorp</span>
|
||||
<span class="model">X-2000</span>
|
||||
</div>
|
||||
<ul class="product-features">
|
||||
<li>5G capable</li>
|
||||
<li>6.5" OLED screen</li>
|
||||
<li>128GB storage</li>
|
||||
</ul>
|
||||
<div class="product-reviews">
|
||||
<div class="review">
|
||||
<span class="reviewer">John D.</span>
|
||||
<span class="rating">4.5</span>
|
||||
<p class="review-text">Great phone, love the camera!</p>
|
||||
</div>
|
||||
<div class="review">
|
||||
<span class="reviewer">Jane S.</span>
|
||||
<span class="rating">5</span>
|
||||
<p class="review-text">Best smartphone I've ever owned.</p>
|
||||
</div>
|
||||
</div>
|
||||
<ul class="related-products">
|
||||
<li>
|
||||
<span class="related-name">Phone Case</span>
|
||||
<span class="related-price">$29.99</span>
|
||||
</li>
|
||||
<li>
|
||||
<span class="related-name">Screen Protector</span>
|
||||
<span class="related-price">$9.99</span>
|
||||
</li>
|
||||
</ul>
|
||||
</div>
|
||||
<!-- More products... -->
|
||||
</div>
|
||||
```
|
||||
|
||||
Now, let's create a schema to extract this complex structure:
|
||||
|
||||
```python
|
||||
schema = {
|
||||
"name": "E-commerce Product Catalog",
|
||||
"baseSelector": "div.category",
|
||||
"fields": [
|
||||
{
|
||||
"name": "category_name",
|
||||
"selector": "h2.category-name",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "products",
|
||||
"selector": "div.product",
|
||||
"type": "nested_list",
|
||||
"fields": [
|
||||
{
|
||||
"name": "name",
|
||||
"selector": "h3.product-name",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "price",
|
||||
"selector": "p.product-price",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "details",
|
||||
"selector": "div.product-details",
|
||||
"type": "nested",
|
||||
"fields": [
|
||||
{
|
||||
"name": "brand",
|
||||
"selector": "span.brand",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "model",
|
||||
"selector": "span.model",
|
||||
"type": "text"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "features",
|
||||
"selector": "ul.product-features li",
|
||||
"type": "list",
|
||||
"fields": [
|
||||
{
|
||||
"name": "feature",
|
||||
"type": "text"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "reviews",
|
||||
"selector": "div.review",
|
||||
"type": "nested_list",
|
||||
"fields": [
|
||||
{
|
||||
"name": "reviewer",
|
||||
"selector": "span.reviewer",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "rating",
|
||||
"selector": "span.rating",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "comment",
|
||||
"selector": "p.review-text",
|
||||
"type": "text"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "related_products",
|
||||
"selector": "ul.related-products li",
|
||||
"type": "list",
|
||||
"fields": [
|
||||
{
|
||||
"name": "name",
|
||||
"selector": "span.related-name",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "price",
|
||||
"selector": "span.related-price",
|
||||
"type": "text"
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
This schema demonstrates several advanced features:
|
||||
|
||||
1. **Nested Objects**: The `details` field is a nested object within each product.
|
||||
2. **Simple Lists**: The `features` field is a simple list of text items.
|
||||
3. **Nested Lists**: The `products` field is a nested list, where each item is a complex object.
|
||||
4. **Lists of Objects**: The `reviews` and `related_products` fields are lists of objects.
|
||||
|
||||
Let's break down the key concepts:
|
||||
|
||||
### Nested Objects
|
||||
|
||||
To create a nested object, use `"type": "nested"` and provide a `fields` array for the nested structure:
|
||||
|
||||
```python
|
||||
{
|
||||
"name": "details",
|
||||
"selector": "div.product-details",
|
||||
"type": "nested",
|
||||
"fields": [
|
||||
{
|
||||
"name": "brand",
|
||||
"selector": "span.brand",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "model",
|
||||
"selector": "span.model",
|
||||
"type": "text"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Simple Lists
|
||||
|
||||
For a simple list of identical items, use `"type": "list"`:
|
||||
|
||||
```python
|
||||
{
|
||||
"name": "features",
|
||||
"selector": "ul.product-features li",
|
||||
"type": "list",
|
||||
"fields": [
|
||||
{
|
||||
"name": "feature",
|
||||
"type": "text"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Nested Lists
|
||||
|
||||
For a list of complex objects, use `"type": "nested_list"`:
|
||||
|
||||
```python
|
||||
{
|
||||
"name": "products",
|
||||
"selector": "div.product",
|
||||
"type": "nested_list",
|
||||
"fields": [
|
||||
// ... fields for each product
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Lists of Objects
|
||||
|
||||
Similar to nested lists, but typically used for simpler objects within the list:
|
||||
|
||||
```python
|
||||
{
|
||||
"name": "related_products",
|
||||
"selector": "ul.related-products li",
|
||||
"type": "list",
|
||||
"fields": [
|
||||
{
|
||||
"name": "name",
|
||||
"selector": "span.related-name",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "price",
|
||||
"selector": "span.related-price",
|
||||
"type": "text"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## Using the Advanced Schema
|
||||
|
||||
To use this advanced schema with AsyncWebCrawler:
|
||||
|
||||
```python
|
||||
import json
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
|
||||
async def extract_complex_product_data():
|
||||
extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
|
||||
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://gist.githubusercontent.com/githubusercontent/2d7b8ba3cd8ab6cf3c8da771ddb36878/raw/1ae2f90c6861ce7dd84cc50d3df9920dee5e1fd2/sample_ecommerce.html",
|
||||
extraction_strategy=extraction_strategy,
|
||||
bypass_cache=True,
|
||||
)
|
||||
|
||||
assert result.success, "Failed to crawl the page"
|
||||
|
||||
product_data = json.loads(result.extracted_content)
|
||||
print(json.dumps(product_data, indent=2))
|
||||
|
||||
asyncio.run(extract_complex_product_data())
|
||||
```
|
||||
|
||||
This will produce a structured JSON output that captures the complex hierarchy of the product catalog, including nested objects, lists, and nested lists.
|
||||
|
||||
## Tips for Advanced Usage
|
||||
|
||||
1. **Start Simple**: Begin with a basic schema and gradually add complexity.
|
||||
2. **Test Incrementally**: Test each part of your schema separately before combining them.
|
||||
3. **Use Chrome DevTools**: The Element Inspector is invaluable for identifying the correct selectors.
|
||||
4. **Handle Missing Data**: Use the `default` key in your field definitions to handle cases where data might be missing.
|
||||
5. **Leverage Transforms**: Use the `transform` key to clean or format extracted data (e.g., converting prices to numbers).
|
||||
6. **Consider Performance**: Very complex schemas might slow down extraction. Balance complexity with performance needs.
|
||||
|
||||
By mastering these advanced techniques, you can use JsonCssExtractionStrategy to extract highly structured data from even the most complex web pages, making it a powerful tool for web scraping and data analysis tasks.
|
||||
142
docs/md_v2/extraction/css.md
Normal file
142
docs/md_v2/extraction/css.md
Normal file
@@ -0,0 +1,142 @@
|
||||
# JSON CSS Extraction Strategy with AsyncWebCrawler
|
||||
|
||||
The `JsonCssExtractionStrategy` is a powerful feature of Crawl4AI that allows you to extract structured data from web pages using CSS selectors. This method is particularly useful when you need to extract specific data points from a consistent HTML structure, such as tables or repeated elements. Here's how to use it with the AsyncWebCrawler.
|
||||
|
||||
## Overview
|
||||
|
||||
The `JsonCssExtractionStrategy` works by defining a schema that specifies:
|
||||
1. A base CSS selector for the repeating elements
|
||||
2. Fields to extract from each element, each with its own CSS selector
|
||||
|
||||
This strategy is fast and efficient, as it doesn't rely on external services like LLMs for extraction.
|
||||
|
||||
## Example: Extracting Cryptocurrency Prices from Coinbase
|
||||
|
||||
Let's look at an example that extracts cryptocurrency prices from the Coinbase explore page.
|
||||
|
||||
```python
|
||||
import json
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
|
||||
async def extract_structured_data_using_css_extractor():
|
||||
print("\n--- Using JsonCssExtractionStrategy for Fast Structured Output ---")
|
||||
|
||||
# Define the extraction schema
|
||||
schema = {
|
||||
"name": "Coinbase Crypto Prices",
|
||||
"baseSelector": ".cds-tableRow-t45thuk",
|
||||
"fields": [
|
||||
{
|
||||
"name": "crypto",
|
||||
"selector": "td:nth-child(1) h2",
|
||||
"type": "text",
|
||||
},
|
||||
{
|
||||
"name": "symbol",
|
||||
"selector": "td:nth-child(1) p",
|
||||
"type": "text",
|
||||
},
|
||||
{
|
||||
"name": "price",
|
||||
"selector": "td:nth-child(2)",
|
||||
"type": "text",
|
||||
}
|
||||
],
|
||||
}
|
||||
|
||||
# Create the extraction strategy
|
||||
extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
|
||||
|
||||
# Use the AsyncWebCrawler with the extraction strategy
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://www.coinbase.com/explore",
|
||||
extraction_strategy=extraction_strategy,
|
||||
bypass_cache=True,
|
||||
)
|
||||
|
||||
assert result.success, "Failed to crawl the page"
|
||||
|
||||
# Parse the extracted content
|
||||
crypto_prices = json.loads(result.extracted_content)
|
||||
print(f"Successfully extracted {len(crypto_prices)} cryptocurrency prices")
|
||||
print(json.dumps(crypto_prices[0], indent=2))
|
||||
|
||||
return crypto_prices
|
||||
|
||||
# Run the async function
|
||||
asyncio.run(extract_structured_data_using_css_extractor())
|
||||
```
|
||||
|
||||
## Explanation of the Schema
|
||||
|
||||
The schema defines how to extract the data:
|
||||
|
||||
- `name`: A descriptive name for the extraction task.
|
||||
- `baseSelector`: The CSS selector for the repeating elements (in this case, table rows).
|
||||
- `fields`: An array of fields to extract from each element:
|
||||
- `name`: The name to give the extracted data.
|
||||
- `selector`: The CSS selector to find the specific data within the base element.
|
||||
- `type`: The type of data to extract (usually "text" for textual content).
|
||||
|
||||
## Advantages of JsonCssExtractionStrategy
|
||||
|
||||
1. **Speed**: CSS selectors are fast to execute, making this method efficient for large datasets.
|
||||
2. **Precision**: You can target exactly the elements you need.
|
||||
3. **Structured Output**: The result is already structured as JSON, ready for further processing.
|
||||
4. **No External Dependencies**: Unlike LLM-based strategies, this doesn't require any API calls to external services.
|
||||
|
||||
## Tips for Using JsonCssExtractionStrategy
|
||||
|
||||
1. **Inspect the Page**: Use browser developer tools to identify the correct CSS selectors.
|
||||
2. **Test Selectors**: Verify your selectors in the browser console before using them in the script.
|
||||
3. **Handle Dynamic Content**: If the page uses JavaScript to load content, you may need to combine this with JS execution (see the Advanced Usage section).
|
||||
4. **Error Handling**: Always check the `result.success` flag and handle potential failures.
|
||||
|
||||
## Advanced Usage: Combining with JavaScript Execution
|
||||
|
||||
For pages that load data dynamically, you can combine the `JsonCssExtractionStrategy` with JavaScript execution:
|
||||
|
||||
```python
|
||||
async def extract_dynamic_structured_data():
|
||||
schema = {
|
||||
"name": "Dynamic Crypto Prices",
|
||||
"baseSelector": ".crypto-row",
|
||||
"fields": [
|
||||
{"name": "name", "selector": ".crypto-name", "type": "text"},
|
||||
{"name": "price", "selector": ".crypto-price", "type": "text"},
|
||||
]
|
||||
}
|
||||
|
||||
js_code = """
|
||||
window.scrollTo(0, document.body.scrollHeight);
|
||||
await new Promise(resolve => setTimeout(resolve, 2000)); // Wait for 2 seconds
|
||||
"""
|
||||
|
||||
extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
|
||||
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com/crypto-prices",
|
||||
extraction_strategy=extraction_strategy,
|
||||
js_code=js_code,
|
||||
wait_for=".crypto-row:nth-child(20)", # Wait for 20 rows to load
|
||||
bypass_cache=True,
|
||||
)
|
||||
|
||||
crypto_data = json.loads(result.extracted_content)
|
||||
print(f"Extracted {len(crypto_data)} cryptocurrency entries")
|
||||
|
||||
asyncio.run(extract_dynamic_structured_data())
|
||||
```
|
||||
|
||||
This advanced example demonstrates how to:
|
||||
1. Execute JavaScript to trigger dynamic content loading.
|
||||
2. Wait for a specific condition (20 rows loaded) before extraction.
|
||||
3. Extract data from the dynamically loaded content.
|
||||
|
||||
By mastering the `JsonCssExtractionStrategy`, you can efficiently extract structured data from a wide variety of web pages, making it a valuable tool in your web scraping toolkit.
|
||||
|
||||
For more details on schema definitions and advanced extraction strategies, check out the[Advanced JsonCssExtraction](./css-advanced.md).
|
||||
185
docs/md_v2/extraction/extraction_strategies.md
Normal file
185
docs/md_v2/extraction/extraction_strategies.md
Normal file
@@ -0,0 +1,185 @@
|
||||
## Extraction Strategies 🧠
|
||||
|
||||
Crawl4AI offers powerful extraction strategies to derive meaningful information from web content. Let's dive into three of the most important strategies: `CosineStrategy`, `LLMExtractionStrategy`, and the new `JsonCssExtractionStrategy`.
|
||||
|
||||
### LLMExtractionStrategy
|
||||
|
||||
`LLMExtractionStrategy` leverages a Language Model (LLM) to extract meaningful content from HTML. This strategy uses an external provider for LLM completions to perform extraction based on instructions.
|
||||
|
||||
#### When to Use
|
||||
- Suitable for complex extraction tasks requiring nuanced understanding.
|
||||
- Ideal for scenarios where detailed instructions can guide the extraction process.
|
||||
- Perfect for extracting specific types of information or content with precise guidelines.
|
||||
|
||||
#### Parameters
|
||||
- `provider` (str, optional): Provider for language model completions (e.g., openai/gpt-4). Default is `DEFAULT_PROVIDER`.
|
||||
- `api_token` (str, optional): API token for the provider. If not provided, it will try to load from the environment variable `OPENAI_API_KEY`.
|
||||
- `instruction` (str, optional): Instructions to guide the LLM on how to perform the extraction. Default is `None`.
|
||||
|
||||
#### Example Without Instructions
|
||||
```python
|
||||
import asyncio
|
||||
import os
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.extraction_strategy import LLMExtractionStrategy
|
||||
|
||||
async def main():
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
# Define extraction strategy without instructions
|
||||
strategy = LLMExtractionStrategy(
|
||||
provider='openai',
|
||||
api_token=os.getenv('OPENAI_API_KEY')
|
||||
)
|
||||
|
||||
# Sample URL
|
||||
url = "https://www.nbcnews.com/business"
|
||||
|
||||
# Run the crawler with the extraction strategy
|
||||
result = await crawler.arun(url=url, extraction_strategy=strategy)
|
||||
print(result.extracted_content)
|
||||
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
#### Example With Instructions
|
||||
```python
|
||||
import asyncio
|
||||
import os
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.extraction_strategy import LLMExtractionStrategy
|
||||
|
||||
async def main():
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
# Define extraction strategy with instructions
|
||||
strategy = LLMExtractionStrategy(
|
||||
provider='openai',
|
||||
api_token=os.getenv('OPENAI_API_KEY'),
|
||||
instruction="Extract only financial news and summarize key points."
|
||||
)
|
||||
|
||||
# Sample URL
|
||||
url = "https://www.nbcnews.com/business"
|
||||
|
||||
# Run the crawler with the extraction strategy
|
||||
result = await crawler.arun(url=url, extraction_strategy=strategy)
|
||||
print(result.extracted_content)
|
||||
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
### JsonCssExtractionStrategy
|
||||
|
||||
`JsonCssExtractionStrategy` is a powerful tool for extracting structured data from HTML using CSS selectors. It allows you to define a schema that maps CSS selectors to specific fields, enabling precise and efficient data extraction.
|
||||
|
||||
#### When to Use
|
||||
- Ideal for extracting structured data from websites with consistent HTML structures.
|
||||
- Perfect for scenarios where you need to extract specific elements or attributes from a webpage.
|
||||
- Suitable for creating datasets from web pages with tabular or list-based information.
|
||||
|
||||
#### Parameters
|
||||
- `schema` (Dict[str, Any]): A dictionary defining the extraction schema, including base selector and field definitions.
|
||||
|
||||
#### Example
|
||||
```python
|
||||
import asyncio
|
||||
import json
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
|
||||
async def main():
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
# Define the extraction schema
|
||||
schema = {
|
||||
"name": "News Articles",
|
||||
"baseSelector": "article.tease-card",
|
||||
"fields": [
|
||||
{
|
||||
"name": "title",
|
||||
"selector": "h2",
|
||||
"type": "text",
|
||||
},
|
||||
{
|
||||
"name": "summary",
|
||||
"selector": "div.tease-card__info",
|
||||
"type": "text",
|
||||
},
|
||||
{
|
||||
"name": "link",
|
||||
"selector": "a",
|
||||
"type": "attribute",
|
||||
"attribute": "href"
|
||||
}
|
||||
],
|
||||
}
|
||||
|
||||
# Create the extraction strategy
|
||||
strategy = JsonCssExtractionStrategy(schema, verbose=True)
|
||||
|
||||
# Sample URL
|
||||
url = "https://www.nbcnews.com/business"
|
||||
|
||||
# Run the crawler with the extraction strategy
|
||||
result = await crawler.arun(url=url, extraction_strategy=strategy)
|
||||
|
||||
# Parse and print the extracted content
|
||||
extracted_data = json.loads(result.extracted_content)
|
||||
print(json.dumps(extracted_data, indent=2))
|
||||
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
#### Use Cases for JsonCssExtractionStrategy
|
||||
- Extracting product information from e-commerce websites.
|
||||
- Gathering news articles and their metadata from news portals.
|
||||
- Collecting user reviews and ratings from review websites.
|
||||
- Extracting job listings from job boards.
|
||||
|
||||
By choosing the right extraction strategy, you can effectively extract the most relevant and useful information from web content. Whether you need fast, accurate semantic segmentation with `CosineStrategy`, nuanced, instruction-based extraction with `LLMExtractionStrategy`, or precise structured data extraction with `JsonCssExtractionStrategy`, Crawl4AI has you covered. Happy extracting! 🕵️♂️✨
|
||||
|
||||
For more details on schema definitions and advanced extraction strategies, check out the[Advanced JsonCssExtraction](./css-advanced.md).
|
||||
|
||||
|
||||
### CosineStrategy
|
||||
|
||||
`CosineStrategy` uses hierarchical clustering based on cosine similarity to group text chunks into meaningful clusters. This method converts each chunk into its embedding and then clusters them to form semantical chunks.
|
||||
|
||||
#### When to Use
|
||||
- Ideal for fast, accurate semantic segmentation of text.
|
||||
- Perfect for scenarios where LLMs might be overkill or too slow.
|
||||
- Suitable for narrowing down content based on specific queries or keywords.
|
||||
|
||||
#### Parameters
|
||||
- `semantic_filter` (str, optional): Keywords for filtering relevant documents before clustering. Documents are filtered based on their cosine similarity to the keyword filter embedding. Default is `None`.
|
||||
- `word_count_threshold` (int, optional): Minimum number of words per cluster. Default is `20`.
|
||||
- `max_dist` (float, optional): Maximum cophenetic distance on the dendrogram to form clusters. Default is `0.2`.
|
||||
- `linkage_method` (str, optional): Linkage method for hierarchical clustering. Default is `'ward'`.
|
||||
- `top_k` (int, optional): Number of top categories to extract. Default is `3`.
|
||||
- `model_name` (str, optional): Model name for embedding generation. Default is `'BAAI/bge-small-en-v1.5'`.
|
||||
|
||||
#### Example
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.extraction_strategy import CosineStrategy
|
||||
|
||||
async def main():
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
# Define extraction strategy
|
||||
strategy = CosineStrategy(
|
||||
semantic_filter="finance economy stock market",
|
||||
word_count_threshold=10,
|
||||
max_dist=0.2,
|
||||
linkage_method='ward',
|
||||
top_k=3,
|
||||
model_name='BAAI/bge-small-en-v1.5'
|
||||
)
|
||||
|
||||
# Sample URL
|
||||
url = "https://www.nbcnews.com/business"
|
||||
|
||||
# Run the crawler with the extraction strategy
|
||||
result = await crawler.arun(url=url, extraction_strategy=strategy)
|
||||
print(result.extracted_content)
|
||||
|
||||
asyncio.run(main())
|
||||
```
|
||||
179
docs/md_v2/extraction/llm.md
Normal file
179
docs/md_v2/extraction/llm.md
Normal file
@@ -0,0 +1,179 @@
|
||||
# LLM Extraction with AsyncWebCrawler
|
||||
|
||||
Crawl4AI's AsyncWebCrawler allows you to use Language Models (LLMs) to extract structured data or relevant content from web pages asynchronously. Below are two examples demonstrating how to use `LLMExtractionStrategy` for different purposes with the AsyncWebCrawler.
|
||||
|
||||
## Example 1: Extract Structured Data
|
||||
|
||||
In this example, we use the `LLMExtractionStrategy` to extract structured data (model names and their fees) from the OpenAI pricing page.
|
||||
|
||||
```python
|
||||
import os
|
||||
import json
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.extraction_strategy import LLMExtractionStrategy
|
||||
from pydantic import BaseModel, Field
|
||||
|
||||
class OpenAIModelFee(BaseModel):
|
||||
model_name: str = Field(..., description="Name of the OpenAI model.")
|
||||
input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
|
||||
output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")
|
||||
|
||||
async def extract_openai_fees():
|
||||
url = 'https://openai.com/api/pricing/'
|
||||
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
result = await crawler.arun(
|
||||
url=url,
|
||||
word_count_threshold=1,
|
||||
extraction_strategy=LLMExtractionStrategy(
|
||||
provider="openai/gpt-4o", # Or use ollama like provider="ollama/nemotron"
|
||||
api_token=os.getenv('OPENAI_API_KEY'),
|
||||
schema=OpenAIModelFee.model_json_schema(),
|
||||
extraction_type="schema",
|
||||
instruction="From the crawled content, extract all mentioned model names along with their "
|
||||
"fees for input and output tokens. Make sure not to miss anything in the entire content. "
|
||||
'One extracted model JSON format should look like this: '
|
||||
'{ "model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens" }'
|
||||
),
|
||||
bypass_cache=True,
|
||||
)
|
||||
|
||||
model_fees = json.loads(result.extracted_content)
|
||||
print(f"Number of models extracted: {len(model_fees)}")
|
||||
|
||||
with open(".data/openai_fees.json", "w", encoding="utf-8") as f:
|
||||
json.dump(model_fees, f, indent=2)
|
||||
|
||||
asyncio.run(extract_openai_fees())
|
||||
```
|
||||
|
||||
## Example 2: Extract Relevant Content
|
||||
|
||||
In this example, we instruct the LLM to extract only content related to technology from the NBC News business page.
|
||||
|
||||
```python
|
||||
import os
|
||||
import json
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.extraction_strategy import LLMExtractionStrategy
|
||||
|
||||
async def extract_tech_content():
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://www.nbcnews.com/business",
|
||||
extraction_strategy=LLMExtractionStrategy(
|
||||
provider="openai/gpt-4o",
|
||||
api_token=os.getenv('OPENAI_API_KEY'),
|
||||
instruction="Extract only content related to technology"
|
||||
),
|
||||
bypass_cache=True,
|
||||
)
|
||||
|
||||
tech_content = json.loads(result.extracted_content)
|
||||
print(f"Number of tech-related items extracted: {len(tech_content)}")
|
||||
|
||||
with open(".data/tech_content.json", "w", encoding="utf-8") as f:
|
||||
json.dump(tech_content, f, indent=2)
|
||||
|
||||
asyncio.run(extract_tech_content())
|
||||
```
|
||||
|
||||
## Advanced Usage: Combining JS Execution with LLM Extraction
|
||||
|
||||
This example demonstrates how to combine JavaScript execution with LLM extraction to handle dynamic content:
|
||||
|
||||
```python
|
||||
async def extract_dynamic_content():
|
||||
js_code = """
|
||||
const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
|
||||
if (loadMoreButton) {
|
||||
loadMoreButton.click();
|
||||
await new Promise(resolve => setTimeout(resolve, 2000));
|
||||
}
|
||||
"""
|
||||
|
||||
wait_for = """
|
||||
() => {
|
||||
const articles = document.querySelectorAll('article.tease-card');
|
||||
return articles.length > 10;
|
||||
}
|
||||
"""
|
||||
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://www.nbcnews.com/business",
|
||||
js_code=js_code,
|
||||
wait_for=wait_for,
|
||||
css_selector="article.tease-card",
|
||||
extraction_strategy=LLMExtractionStrategy(
|
||||
provider="openai/gpt-4o",
|
||||
api_token=os.getenv('OPENAI_API_KEY'),
|
||||
instruction="Summarize each article, focusing on technology-related content"
|
||||
),
|
||||
bypass_cache=True,
|
||||
)
|
||||
|
||||
summaries = json.loads(result.extracted_content)
|
||||
print(f"Number of summarized articles: {len(summaries)}")
|
||||
|
||||
with open(".data/tech_summaries.json", "w", encoding="utf-8") as f:
|
||||
json.dump(summaries, f, indent=2)
|
||||
|
||||
asyncio.run(extract_dynamic_content())
|
||||
```
|
||||
|
||||
## Customizing LLM Provider
|
||||
|
||||
Crawl4AI uses the `litellm` library under the hood, which allows you to use any LLM provider you want. Just pass the correct model name and API token:
|
||||
|
||||
```python
|
||||
extraction_strategy=LLMExtractionStrategy(
|
||||
provider="your_llm_provider/model_name",
|
||||
api_token="your_api_token",
|
||||
instruction="Your extraction instruction"
|
||||
)
|
||||
```
|
||||
|
||||
This flexibility allows you to integrate with various LLM providers and tailor the extraction process to your specific needs.
|
||||
|
||||
## Error Handling and Retries
|
||||
|
||||
When working with external LLM APIs, it's important to handle potential errors and implement retry logic. Here's an example of how you might do this:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from tenacity import retry, stop_after_attempt, wait_exponential
|
||||
|
||||
class LLMExtractionError(Exception):
|
||||
pass
|
||||
|
||||
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
|
||||
async def extract_with_retry(crawler, url, extraction_strategy):
|
||||
try:
|
||||
result = await crawler.arun(url=url, extraction_strategy=extraction_strategy, bypass_cache=True)
|
||||
return json.loads(result.extracted_content)
|
||||
except Exception as e:
|
||||
raise LLMExtractionError(f"Failed to extract content: {str(e)}")
|
||||
|
||||
async def main():
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
try:
|
||||
content = await extract_with_retry(
|
||||
crawler,
|
||||
"https://www.example.com",
|
||||
LLMExtractionStrategy(
|
||||
provider="openai/gpt-4o",
|
||||
api_token=os.getenv('OPENAI_API_KEY'),
|
||||
instruction="Extract and summarize main points"
|
||||
)
|
||||
)
|
||||
print("Extracted content:", content)
|
||||
except LLMExtractionError as e:
|
||||
print(f"Extraction failed after retries: {e}")
|
||||
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
This example uses the `tenacity` library to implement a retry mechanism with exponential backoff, which can help handle temporary failures or rate limiting from the LLM API.
|
||||
197
docs/md_v2/extraction/overview.md
Normal file
197
docs/md_v2/extraction/overview.md
Normal file
@@ -0,0 +1,197 @@
|
||||
# Extraction Strategies Overview
|
||||
|
||||
Crawl4AI provides powerful extraction strategies to help you get structured data from web pages. Each strategy is designed for specific use cases and offers different approaches to data extraction.
|
||||
|
||||
## Available Strategies
|
||||
|
||||
### [LLM-Based Extraction](llm.md)
|
||||
|
||||
`LLMExtractionStrategy` uses Language Models to extract structured data from web content. This approach is highly flexible and can understand content semantically.
|
||||
|
||||
```python
|
||||
from pydantic import BaseModel
|
||||
from crawl4ai.extraction_strategy import LLMExtractionStrategy
|
||||
|
||||
class Product(BaseModel):
|
||||
name: str
|
||||
price: float
|
||||
description: str
|
||||
|
||||
strategy = LLMExtractionStrategy(
|
||||
provider="ollama/llama2",
|
||||
schema=Product.schema(),
|
||||
instruction="Extract product details from the page"
|
||||
)
|
||||
|
||||
result = await crawler.arun(
|
||||
url="https://example.com/product",
|
||||
extraction_strategy=strategy
|
||||
)
|
||||
```
|
||||
|
||||
**Best for:**
|
||||
- Complex data structures
|
||||
- Content requiring interpretation
|
||||
- Flexible content formats
|
||||
- Natural language processing
|
||||
|
||||
### [CSS-Based Extraction](css.md)
|
||||
|
||||
`JsonCssExtractionStrategy` extracts data using CSS selectors. This is fast, reliable, and perfect for consistently structured pages.
|
||||
|
||||
```python
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
|
||||
schema = {
|
||||
"name": "Product Listing",
|
||||
"baseSelector": ".product-card",
|
||||
"fields": [
|
||||
{"name": "title", "selector": "h2", "type": "text"},
|
||||
{"name": "price", "selector": ".price", "type": "text"},
|
||||
{"name": "image", "selector": "img", "type": "attribute", "attribute": "src"}
|
||||
]
|
||||
}
|
||||
|
||||
strategy = JsonCssExtractionStrategy(schema)
|
||||
|
||||
result = await crawler.arun(
|
||||
url="https://example.com/products",
|
||||
extraction_strategy=strategy
|
||||
)
|
||||
```
|
||||
|
||||
**Best for:**
|
||||
- E-commerce product listings
|
||||
- News article collections
|
||||
- Structured content pages
|
||||
- High-performance needs
|
||||
|
||||
### [Cosine Strategy](cosine.md)
|
||||
|
||||
`CosineStrategy` uses similarity-based clustering to identify and extract relevant content sections.
|
||||
|
||||
```python
|
||||
from crawl4ai.extraction_strategy import CosineStrategy
|
||||
|
||||
strategy = CosineStrategy(
|
||||
semantic_filter="product reviews", # Content focus
|
||||
word_count_threshold=10, # Minimum words per cluster
|
||||
sim_threshold=0.3, # Similarity threshold
|
||||
max_dist=0.2, # Maximum cluster distance
|
||||
top_k=3 # Number of top clusters to extract
|
||||
)
|
||||
|
||||
result = await crawler.arun(
|
||||
url="https://example.com/reviews",
|
||||
extraction_strategy=strategy
|
||||
)
|
||||
```
|
||||
|
||||
**Best for:**
|
||||
- Content similarity analysis
|
||||
- Topic clustering
|
||||
- Relevant content extraction
|
||||
- Pattern recognition in text
|
||||
|
||||
## Strategy Selection Guide
|
||||
|
||||
Choose your strategy based on these factors:
|
||||
|
||||
1. **Content Structure**
|
||||
- Well-structured HTML → Use CSS Strategy
|
||||
- Natural language text → Use LLM Strategy
|
||||
- Mixed/Complex content → Use Cosine Strategy
|
||||
|
||||
2. **Performance Requirements**
|
||||
- Fastest: CSS Strategy
|
||||
- Moderate: Cosine Strategy
|
||||
- Variable: LLM Strategy (depends on provider)
|
||||
|
||||
3. **Accuracy Needs**
|
||||
- Highest structure accuracy: CSS Strategy
|
||||
- Best semantic understanding: LLM Strategy
|
||||
- Best content relevance: Cosine Strategy
|
||||
|
||||
## Combining Strategies
|
||||
|
||||
You can combine strategies for more powerful extraction:
|
||||
|
||||
```python
|
||||
# First use CSS strategy for initial structure
|
||||
css_result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
extraction_strategy=css_strategy
|
||||
)
|
||||
|
||||
# Then use LLM for semantic analysis
|
||||
llm_result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
extraction_strategy=llm_strategy
|
||||
)
|
||||
```
|
||||
|
||||
## Common Use Cases
|
||||
|
||||
1. **E-commerce Scraping**
|
||||
```python
|
||||
# CSS Strategy for product listings
|
||||
schema = {
|
||||
"name": "Products",
|
||||
"baseSelector": ".product",
|
||||
"fields": [
|
||||
{"name": "name", "selector": ".title", "type": "text"},
|
||||
{"name": "price", "selector": ".price", "type": "text"}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
2. **News Article Extraction**
|
||||
```python
|
||||
# LLM Strategy for article content
|
||||
class Article(BaseModel):
|
||||
title: str
|
||||
content: str
|
||||
author: str
|
||||
date: str
|
||||
|
||||
strategy = LLMExtractionStrategy(
|
||||
provider="ollama/llama2",
|
||||
schema=Article.schema()
|
||||
)
|
||||
```
|
||||
|
||||
3. **Content Analysis**
|
||||
```python
|
||||
# Cosine Strategy for topic analysis
|
||||
strategy = CosineStrategy(
|
||||
semantic_filter="technology trends",
|
||||
top_k=5
|
||||
)
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Choose the Right Strategy**
|
||||
- Start with CSS for structured data
|
||||
- Use LLM for complex interpretation
|
||||
- Try Cosine for content relevance
|
||||
|
||||
2. **Optimize Performance**
|
||||
- Cache LLM results
|
||||
- Keep CSS selectors specific
|
||||
- Tune similarity thresholds
|
||||
|
||||
3. **Handle Errors**
|
||||
```python
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
extraction_strategy=strategy
|
||||
)
|
||||
|
||||
if not result.success:
|
||||
print(f"Extraction failed: {result.error_message}")
|
||||
else:
|
||||
data = json.loads(result.extracted_content)
|
||||
```
|
||||
|
||||
Each strategy has its strengths and optimal use cases. Explore the detailed documentation for each strategy to learn more about their specific features and configurations.
|
||||
Reference in New Issue
Block a user