crawl4ai/docs/md _sync/full_details/extraction_strategies.md

## Extraction Strategies 🧠

Crawl4AI offers powerful extraction strategies to derive meaningful information from web content. Let's dive into two of the most important strategies: `CosineStrategy` and `LLMExtractionStrategy`.

### CosineStrategy

`CosineStrategy` uses hierarchical clustering based on cosine similarity to group text chunks into meaningful clusters. This method converts each chunk into its embedding and then clusters them to form semantical chunks.

#### When to Use
- Ideal for fast, accurate semantic segmentation of text.
- Perfect for scenarios where LLMs might be overkill or too slow.
- Suitable for narrowing down content based on specific queries or keywords.

#### Parameters
- `semantic_filter` (str, optional): Keywords for filtering relevant documents before clustering. Documents are filtered based on their cosine similarity to the keyword filter embedding. Default is `None`.
- `word_count_threshold` (int, optional): Minimum number of words per cluster. Default is `20`.
- `max_dist` (float, optional): Maximum cophenetic distance on the dendrogram to form clusters. Default is `0.2`.
- `linkage_method` (str, optional): Linkage method for hierarchical clustering. Default is `'ward'`.
- `top_k` (int, optional): Number of top categories to extract. Default is `3`.
- `model_name` (str, optional): Model name for embedding generation. Default is `'BAAI/bge-small-en-v1.5'`.

#### Example
```python
from crawl4ai.extraction_strategy import CosineStrategy
from crawl4ai import WebCrawler

crawler = WebCrawler()
crawler.warmup()

# Define extraction strategy
strategy = CosineStrategy(
    semantic_filter="finance economy stock market",
    word_count_threshold=10,
    max_dist=0.2,
    linkage_method='ward',
    top_k=3,
    model_name='BAAI/bge-small-en-v1.5'
)

# Sample URL
url = "https://www.nbcnews.com/business"

# Run the crawler with the extraction strategy
result = crawler.run(url=url, extraction_strategy=strategy)
print(result.extracted_content)
```

### LLMExtractionStrategy

`LLMExtractionStrategy` leverages a Language Model (LLM) to extract meaningful content from HTML. This strategy uses an external provider for LLM completions to perform extraction based on instructions.

#### When to Use
- Suitable for complex extraction tasks requiring nuanced understanding.
- Ideal for scenarios where detailed instructions can guide the extraction process.
- Perfect for extracting specific types of information or content with precise guidelines.

#### Parameters
- `provider` (str, optional): Provider for language model completions (e.g., openai/gpt-4). Default is `DEFAULT_PROVIDER`.
- `api_token` (str, optional): API token for the provider. If not provided, it will try to load from the environment variable `OPENAI_API_KEY`.
- `instruction` (str, optional): Instructions to guide the LLM on how to perform the extraction. Default is `None`.

#### Example Without Instructions
```python
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai import WebCrawler

crawler = WebCrawler()
crawler.warmup()

# Define extraction strategy without instructions
strategy = LLMExtractionStrategy(
    provider='openai',
    api_token='your_api_token'
)

# Sample URL
url = "https://www.nbcnews.com/business"

# Run the crawler with the extraction strategy
result = crawler.run(url=url, extraction_strategy=strategy)
print(result.extracted_content)
```

#### Example With Instructions
```python
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai import WebCrawler

crawler = WebCrawler()
crawler.warmup()

# Define extraction strategy with instructions
strategy = LLMExtractionStrategy(
    provider='openai',
    api_token='your_api_token',
    instruction="Extract only financial news and summarize key points."
)

# Sample URL
url = "https://www.nbcnews.com/business"

# Run the crawler with the extraction strategy
result = crawler.run(url=url, extraction_strategy=strategy)
print(result.extracted_content)
```

#### Use Cases for LLMExtractionStrategy
- Extracting specific data types from structured or semi-structured content.
- Generating summaries, extracting key information, or transforming content into different formats.
- Performing detailed extractions based on custom instructions.

For more detailed examples, please refer to the [Examples section](../examples/index.md) of the documentation.

---

By choosing the right extraction strategy, you can effectively extract the most relevant and useful information from web content. Whether you need fast, accurate semantic segmentation with `CosineStrategy` or nuanced, instruction-based extraction with `LLMExtractionStrategy`, Crawl4AI has you covered. Happy extracting! 🕵️‍♂️✨