## Extraction Strategies 🧠 Crawl4AI offers powerful extraction strategies to derive meaningful information from web content. Let's dive into two of the most important strategies: `CosineStrategy` and `LLMExtractionStrategy`. ### CosineStrategy `CosineStrategy` uses hierarchical clustering based on cosine similarity to group text chunks into meaningful clusters. This method converts each chunk into its embedding and then clusters them to form semantical chunks. #### When to Use - Ideal for fast, accurate semantic segmentation of text. - Perfect for scenarios where LLMs might be overkill or too slow. - Suitable for narrowing down content based on specific queries or keywords. #### Parameters - `semantic_filter` (str, optional): Keywords for filtering relevant documents before clustering. Documents are filtered based on their cosine similarity to the keyword filter embedding. Default is `None`. - `word_count_threshold` (int, optional): Minimum number of words per cluster. Default is `20`. - `max_dist` (float, optional): Maximum cophenetic distance on the dendrogram to form clusters. Default is `0.2`. - `linkage_method` (str, optional): Linkage method for hierarchical clustering. Default is `'ward'`. - `top_k` (int, optional): Number of top categories to extract. Default is `3`. - `model_name` (str, optional): Model name for embedding generation. Default is `'BAAI/bge-small-en-v1.5'`. #### Example ```python from crawl4ai.extraction_strategy import CosineStrategy from crawl4ai import WebCrawler crawler = WebCrawler() crawler.warmup() # Define extraction strategy strategy = CosineStrategy( semantic_filter="finance economy stock market", word_count_threshold=10, max_dist=0.2, linkage_method='ward', top_k=3, model_name='BAAI/bge-small-en-v1.5' ) # Sample URL url = "https://www.nbcnews.com/business" # Run the crawler with the extraction strategy result = crawler.run(url=url, extraction_strategy=strategy) print(result.extracted_content) ``` ### LLMExtractionStrategy `LLMExtractionStrategy` leverages a Language Model (LLM) to extract meaningful content from HTML. This strategy uses an external provider for LLM completions to perform extraction based on instructions. #### When to Use - Suitable for complex extraction tasks requiring nuanced understanding. - Ideal for scenarios where detailed instructions can guide the extraction process. - Perfect for extracting specific types of information or content with precise guidelines. #### Parameters - `provider` (str, optional): Provider for language model completions (e.g., openai/gpt-4). Default is `DEFAULT_PROVIDER`. - `api_token` (str, optional): API token for the provider. If not provided, it will try to load from the environment variable `OPENAI_API_KEY`. - `instruction` (str, optional): Instructions to guide the LLM on how to perform the extraction. Default is `None`. #### Example Without Instructions ```python from crawl4ai.extraction_strategy import LLMExtractionStrategy from crawl4ai import WebCrawler crawler = WebCrawler() crawler.warmup() # Define extraction strategy without instructions strategy = LLMExtractionStrategy( provider='openai', api_token='your_api_token' ) # Sample URL url = "https://www.nbcnews.com/business" # Run the crawler with the extraction strategy result = crawler.run(url=url, extraction_strategy=strategy) print(result.extracted_content) ``` #### Example With Instructions ```python from crawl4ai.extraction_strategy import LLMExtractionStrategy from crawl4ai import WebCrawler crawler = WebCrawler() crawler.warmup() # Define extraction strategy with instructions strategy = LLMExtractionStrategy( provider='openai', api_token='your_api_token', instruction="Extract only financial news and summarize key points." ) # Sample URL url = "https://www.nbcnews.com/business" # Run the crawler with the extraction strategy result = crawler.run(url=url, extraction_strategy=strategy) print(result.extracted_content) ``` #### Use Cases for LLMExtractionStrategy - Extracting specific data types from structured or semi-structured content. - Generating summaries, extracting key information, or transforming content into different formats. - Performing detailed extractions based on custom instructions. For more detailed examples, please refer to the [Examples section](../examples/index.md) of the documentation. --- By choosing the right extraction strategy, you can effectively extract the most relevant and useful information from web content. Whether you need fast, accurate semantic segmentation with `CosineStrategy` or nuanced, instruction-based extraction with `LLMExtractionStrategy`, Crawl4AI has you covered. Happy extracting! 🕵️‍♂️✨