Article Title
This is the first important paragraph.
This is the second important paragraph after an ad.
Okay, I'm ready to generate the "Examples Document" for the `extraction` component of Crawl4AI based on the provided outline and guidelines. Here's the Markdown content: ```markdown # Examples for crawl4ai - `extraction` Component **Target Document Type:** Examples Collection **Target Output Filename Suggestion:** `llm_examples_extraction.md` **Library Version Context:** 0.6.3 **Outline Generation Date:** 2024-05-24 --- This document provides a collection of runnable code examples demonstrating various features and configurations of the `extraction` component in the `crawl4ai` library. ## 1. Introduction to Extraction Strategies ### 1.1. Overview: Purpose of Extraction Strategies in Crawl4ai. Extraction strategies in Crawl4ai are responsible for taking raw or processed content (like HTML or Markdown) and extracting structured data or specific blocks of information from it. This is crucial for transforming web content into a more usable format, often for feeding into Large Language Models (LLMs) or other data processing pipelines. ### 1.2. Example: Basic `CrawlerRunConfig` Setup with an `extraction_strategy`. This example shows how to integrate an extraction strategy (here, `NoExtractionStrategy` for simplicity) into the `AsyncWebCrawler` workflow using `CrawlerRunConfig`. ```python import asyncio from crawl4ai import AsyncWebCrawler, CrawlerRunConfig from crawl4ai.extraction_strategy import NoExtractionStrategy async def basic_config_with_extraction_strategy(): # Initialize a simple extraction strategy no_extraction = NoExtractionStrategy() # Configure the crawler run to use this strategy run_config = CrawlerRunConfig( extraction_strategy=no_extraction ) async with AsyncWebCrawler() as crawler: result = await crawler.arun( url="http://example.com", config=run_config ) if result.success: print("Crawl successful.") # For NoExtractionStrategy, extracted_content will likely be None or empty print(f"Extracted Content: {result.extracted_content}") else: print(f"Crawl failed: {result.error_message}") if __name__ == "__main__": asyncio.run(basic_config_with_extraction_strategy()) ``` --- ## 2. `NoExtractionStrategy`: Baseline (No Extraction) The `NoExtractionStrategy` is a pass-through strategy. It doesn't perform any actual data extraction, meaning `result.extracted_content` will typically be `None` or an empty representation. It's useful as a baseline or when you only need the raw/cleaned HTML or Markdown. ### 2.1. Example: Using `NoExtractionStrategy` to demonstrate no structured data is extracted. #### 2.1.1. Scenario: `AsyncWebCrawler` with `NoExtractionStrategy`. This example demonstrates how `AsyncWebCrawler` behaves when `NoExtractionStrategy` is employed. ```python import asyncio from crawl4ai import AsyncWebCrawler, CrawlerRunConfig from crawl4ai.extraction_strategy import NoExtractionStrategy from crawl4ai.utils import HEADERS async def no_extraction_with_crawler(): no_extraction_strat = NoExtractionStrategy() # Provide a basic user agent browser_config = {"headers": HEADERS} run_config = CrawlerRunConfig( extraction_strategy=no_extraction_strat ) async with AsyncWebCrawler(browser_config=browser_config) as crawler: result = await crawler.arun( url="http://example.com", config=run_config ) if result.success: print(f"Crawled URL: {result.url}") print(f"Markdown content (first 100 chars): {result.markdown.raw_markdown[:100]}...") # Extracted content should be None or an empty representation print(f"Extracted Content: {result.extracted_content}") assert result.extracted_content is None or len(result.extracted_content) == 0, \ "Extracted content should be empty with NoExtractionStrategy" else: print(f"Crawl failed: {result.error_message}") if __name__ == "__main__": asyncio.run(no_extraction_with_crawler()) ``` #### 2.1.2. Scenario: Direct call to `NoExtractionStrategy.extract()`. You can also use extraction strategies directly if you have the content. ```python from crawl4ai.extraction_strategy import NoExtractionStrategy def direct_no_extraction(): strategy = NoExtractionStrategy() sample_html = "
Some text.
" # The 'extract' method might expect certain parameters like url, even if not used by this strategy extracted_data = strategy.extract(url="http://dummy.com", html_content=sample_html) print(f"Direct call to NoExtractionStrategy.extract() returned: {extracted_data}") # Expected: A list containing a dictionary with the original content, or similar passthrough # For NoExtractionStrategy, the behavior is to return a list of one block with the original content # if it's a simple string input. The actual structure might vary slightly based on internal logic. # The key is that no "structured" extraction happens. # Based on current implementation, it returns [{'index': 0, 'content': sample_html}] assert isinstance(extracted_data, list) assert len(extracted_data) == 1 assert extracted_data[0]['content'] == sample_html if __name__ == "__main__": direct_no_extraction() ``` --- ## 3. `LLMExtractionStrategy`: LLM-Powered Structured Data Extraction This is the primary strategy for extracting structured data using Large Language Models (LLMs). It allows you to define schemas (using Pydantic models or dictionaries) or provide natural language instructions to guide the LLM in extracting the desired information. *Note: For the following examples, actual LLM calls are often mocked for brevity and to avoid requiring API keys for every example. In a real application, you would configure your LLM provider and API key.* ### 3.1. Core Concepts and Basic Usage #### 3.1.1. Example: Basic initialization of `LLMExtractionStrategy` with default parameters. This example shows how to initialize `LLMExtractionStrategy`. By default, it might use OpenAI if `OPENAI_API_KEY` is set. For this example, we'll assume mocking or a local LLM setup if no API key is found. ```python from crawl4ai.extraction_strategy import LLMExtractionStrategy from crawl4ai.utils import LLMConfig import os # Basic initialization - defaults to OpenAI if OPENAI_API_KEY is set, # or you can specify a provider like Ollama. try: # Attempt to use OpenAI if key is available llm_config = LLMConfig(api_token=os.environ.get("OPENAI_API_KEY")) if not llm_config.api_token: raise ValueError("OpenAI API key not found, using Ollama for example.") strategy = LLMExtractionStrategy(llm_config=llm_config) print("Initialized LLMExtractionStrategy with default provider (likely OpenAI).") except Exception as e: print(f"OpenAI init failed ({e}), trying Ollama (make sure Ollama is running with a model like 'llama3').") try: # Fallback to Ollama if OpenAI key is not set or fails # Ensure Ollama is running and has a model like 'llama3' ollama_config = LLMConfig(provider="ollama/llama3", api_token="ollama") strategy = LLMExtractionStrategy(llm_config=ollama_config) print("Initialized LLMExtractionStrategy with Ollama (llama3).") except Exception as e_ollama: print(f"Ollama init also failed: {e_ollama}") print("Please set up an LLM (OpenAI API key or local Ollama) for these examples.") strategy = None if strategy: print(f"Strategy initialized. Provider: {strategy.llm_config.provider}") # You can now use this 'strategy' object for extraction. # For a basic initialization, we won't run an extraction here to keep it simple. ``` #### 3.1.2. Example: Direct usage of `LLMExtractionStrategy.extract()` with simple Markdown content. This shows how to use the strategy directly with some Markdown text. We'll mock the LLM call. ```python from crawl4ai.extraction_strategy import LLMExtractionStrategy from crawl4ai.utils import LLMConfig from unittest.mock import patch, MagicMock import json # Mocking the LLM call mock_llm_response_block = MagicMock() mock_llm_response_block.choices = [MagicMock()] mock_llm_response_block.choices[0].message.content = """This is paragraph text from HTML.
Another paragraph.
Content
" extracted_json = strategy.extract(url="http://dummy.com/html_page", html_content=sample_html) print("Extraction from Raw HTML (mocked LLM):") if extracted_json: print(json.dumps(json.loads(extracted_json), indent=2)) if __name__ == "__main__": extract_from_raw_html() ``` #### 3.7.3. Example: Extracting from filtered HTML (`input_format="fit_html"`) after `MarkdownGenerator` with a `ContentFilterStrategy` has run. This example shows a two-step process: first filtering HTML using `MarkdownGenerator` and a `ContentFilterStrategy`, then feeding its `fit_html` output to `LLMExtractionStrategy`. ```python import asyncio from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, DefaultMarkdownGenerator, LLMConfig from crawl4ai.extraction_strategy import LLMExtractionStrategy from crawl4ai.content_filter_strategy import PruningContentFilter # Example filter from unittest.mock import patch, MagicMock import json # Mock for the LLMExtractionStrategy part mock_llm_fit_html = MagicMock() mock_llm_fit_html.choices = [MagicMock(message=MagicMock(content=json.dumps({"main_content_summary": "Summary of pruned content."})))] mock_llm_fit_html.usage = MagicMock(completion_tokens=10, prompt_tokens=50, total_tokens=60) mock_llm_fit_html.usage.completion_tokens_details = {}; mock_llm_fit_html.usage.prompt_tokens_details = {} @patch('crawl4ai.extraction_strategy.perform_completion_with_backoff', return_value=mock_llm_fit_html) async def extract_from_fit_html(mock_perform_completion): # Step 1: Setup MarkdownGenerator with a content filter to produce fit_html # For this example, we'll use PruningContentFilter. # In a real scenario, you might need an LLM for more advanced filters. # We'll use a simple mock HTML for this part. sample_raw_html = """This is the core content we want to keep.
Another paragraph of important stuff.
Filtered content.
" print(f"--- Simulated Fit HTML (for LLM input) ---\n{fit_html_content}\n--------------------------------------") # Step 2: Use LLMExtractionStrategy with input_format="fit_html" (or just "html" if it's valid HTML) try: strategy = LLMExtractionStrategy( llm_config=LLMConfig(provider="ollama/llama3", api_token="ollama"), input_format="html", # fit_html is still HTML, or use "fit_html" if specific handling is added extraction_type="schema_from_instruction", instruction="Summarize the main content provided." ) except: print("Ollama not available, skipping fit_html extraction test.") return extracted_json = strategy.extract(url="http://dummy.com/filtered_page", html_content=fit_html_content) print("\nExtraction from Fit HTML (mocked LLM):") if extracted_json: print(json.dumps(json.loads(extracted_json), indent=2)) assert mock_perform_completion.called if __name__ == "__main__": asyncio.run(extract_from_fit_html()) ``` #### 3.7.4. Example: Extracting from plain text content (`input_format="text"`). ```python from crawl4ai.extraction_strategy import LLMExtractionStrategy from crawl4ai.utils import LLMConfig from unittest.mock import patch, MagicMock import json mock_llm_text_input = MagicMock() # Setup mock mock_llm_text_input.choices = [MagicMock(message=MagicMock(content=json.dumps({"sentiment": "positive"})))] mock_llm_text_input.usage = MagicMock(completion_tokens=3, prompt_tokens=25, total_tokens=28) mock_llm_text_input.usage.completion_tokens_details = {}; mock_llm_text_input.usage.prompt_tokens_details = {} @patch('crawl4ai.extraction_strategy.perform_completion_with_backoff', return_value=mock_llm_text_input) def extract_from_plain_text(mock_perform_completion): try: strategy = LLMExtractionStrategy( llm_config=LLMConfig(provider="ollama/llama3", api_token="ollama"), input_format="text", extraction_type="schema_from_instruction", instruction="Determine the sentiment of this text." ) except: print("Ollama not available, skipping plain text input test.") return sample_text = "Crawl4ai is an amazing library for web scraping and data extraction!" extracted_json = strategy.extract(url="http://dummy.com/text_page", html_content=sample_text) # html_content parameter is used for any text-based input, despite its name print("Extraction from Plain Text (mocked LLM):") if extracted_json: print(json.dumps(json.loads(extracted_json), indent=2)) if __name__ == "__main__": extract_from_plain_text() ``` --- ### 3.8. Forcing JSON Response (`force_json_response`) #### 3.8.1. Example: Using `force_json_response=True` with `extraction_type="schema"` or `"schema_from_instruction"`. This is particularly useful with LLMs that might not strictly adhere to JSON output, or when using providers that support JSON mode. ```python from crawl4ai.extraction_strategy import LLMExtractionStrategy from crawl4ai.utils import LLMConfig from pydantic import BaseModel from unittest.mock import patch, MagicMock import json class UserProfile(BaseModel): username: str email: str # Mock LLM: simulate it trying to return JSON but maybe with extra text # if force_json_response was False. With True, it should ensure clean JSON. mock_llm_force_json = MagicMock() mock_llm_force_json.choices = [MagicMock()] # LiteLLM's JSON mode (which force_json_response=True often enables) # typically ensures the LLM's output is directly the JSON object string. mock_llm_force_json.choices[0].message.content = json.dumps( {"username": "testuser", "email": "test@example.com"} ) mock_llm_force_json.usage = MagicMock(completion_tokens=15, prompt_tokens=70, total_tokens=85) mock_llm_force_json.usage.completion_tokens_details = {}; mock_llm_force_json.usage.prompt_tokens_details = {} @patch('crawl4ai.extraction_strategy.perform_completion_with_backoff', return_value=mock_llm_force_json) def force_json_response_example(mock_perform_completion): try: # Note: Some providers/models have better native JSON mode support. # OpenAI models often benefit from this. strategy = LLMExtractionStrategy( llm_config=LLMConfig(provider="openai/gpt-3.5-turbo", api_token=os.getenv("OPENAI_API_KEY","mock_key")), # Using OpenAI example schema=UserProfile.model_json_schema(), extraction_type="schema", force_json_response=True # Enable JSON mode ) if not os.getenv("OPENAI_API_KEY"): print("Warning: OPENAI_API_KEY not set. Mocking will proceed, but real behavior might differ.") except: print("LLM provider not available, skipping force_json_response test.") return sample_content = "User: testuser, Email: test@example.com" extracted_json_string = strategy.extract(url="http://dummy.com/user", html_content=sample_content) print("Force JSON Response Example (mocked LLM):") if extracted_json_string: print(f"Raw output from LLM (should be clean JSON string): {extracted_json_string}") try: extracted_data = json.loads(extracted_json_string) print("Parsed data:", json.dumps(extracted_data, indent=2)) UserProfile(**extracted_data) # Validate print("Successfully parsed and validated JSON.") except json.JSONDecodeError as e: print(f"Failed to parse JSON even with force_json_response: {e}") print("This might indicate an issue with the LLM's JSON mode or the mock setup.") else: print("No data extracted.") # Check if the 'response_format' was passed to litellm # This depends on the internal implementation detail of how force_json_response is passed. # Assuming it sets 'response_format': {'type': 'json_object'} in extra_args for litellm. # mock_perform_completion.assert_called_once() # call_kwargs = mock_perform_completion.call_args.kwargs # assert call_kwargs.get("extra_args", {}).get("response_format") == {"type": "json_object"} # print("LLM call included JSON response format.") if __name__ == "__main__": force_json_response_example() ``` #### 3.8.2. Example: Comparing LLM output with and without `force_json_response=True` to show its effect on non-JSON-compliant LLMs. This example requires an LLM that is known to sometimes produce non-JSON output or a more sophisticated mock. ```python from crawl4ai.extraction_strategy import LLMExtractionStrategy from crawl4ai.utils import LLMConfig from pydantic import BaseModel from unittest.mock import patch, MagicMock import json import os class SimpleData(BaseModel): key: str # Mock 1: LLM returns non-JSON compliant string mock_llm_non_json = MagicMock() mock_llm_non_json.choices = [MagicMock()] mock_llm_non_json.choices[0].message.content = "Here is the JSON you asked for: ```json\n{\"key\": \"value_one\"}\n``` Some extra text." mock_llm_non_json.usage = MagicMock(completion_tokens=30, prompt_tokens=80, total_tokens=110) mock_llm_non_json.usage.completion_tokens_details = {}; mock_llm_non_json.usage.prompt_tokens_details = {} # Mock 2: LLM returns clean JSON (as if force_json_response worked) mock_llm_forced_json = MagicMock() mock_llm_forced_json.choices = [MagicMock()] mock_llm_forced_json.choices[0].message.content = json.dumps({"key": "value_one"}) mock_llm_forced_json.usage = MagicMock(completion_tokens=10, prompt_tokens=80, total_tokens=90) mock_llm_forced_json.usage.completion_tokens_details = {}; mock_llm_forced_json.usage.prompt_tokens_details = {} @patch('crawl4ai.extraction_strategy.perform_completion_with_backoff') def compare_force_json_response(mock_perform_completion): sample_content = "The key is value_one." schema_def = SimpleData.model_json_schema() try: # Using a provider that might benefit from force_json_response llm_config_for_comparison = LLMConfig(provider="openai/gpt-3.5-turbo", api_token=os.getenv("OPENAI_API_KEY","mock_key_compare")) if not os.getenv("OPENAI_API_KEY"): print("Warning: OPENAI_API_KEY not set for comparison. Mocking will show intended difference.") except: print("LLM provider not available, skipping force_json comparison test.") return # Case 1: force_json_response = False (default) mock_perform_completion.return_value = mock_llm_non_json strategy_no_force = LLMExtractionStrategy( llm_config=llm_config_for_comparison, schema=schema_def, extraction_type="schema", force_json_response=False ) print("--- Without force_json_response ---") result_no_force_json_str = strategy_no_force.extract("url", sample_content) print(f"Raw output: {result_no_force_json_str}") try: data_no_force = json.loads(result_no_force_json_str) # This would likely fail with the mock print(f"Parsed data: {data_no_force}") except json.JSONDecodeError as e: print(f"Failed to parse as JSON (expected for this mock): {e}") # Case 2: force_json_response = True mock_perform_completion.return_value = mock_llm_forced_json strategy_with_force = LLMExtractionStrategy( llm_config=llm_config_for_comparison, schema=schema_def, extraction_type="schema", force_json_response=True ) print("\n--- With force_json_response = True ---") result_with_force_json_str = strategy_with_force.extract("url", sample_content) print(f"Raw output: {result_with_force_json_str}") try: data_with_force = json.loads(result_with_force_json_str) SimpleData(**data_with_force) # Validate print(f"Parsed data: {data_with_force} (Successfully parsed and validated)") except json.JSONDecodeError as e: print(f"Failed to parse as JSON (unexpected with good JSON mode): {e}") if __name__ == "__main__": compare_force_json_response() ``` --- ### 3.9. Verbosity and Logging #### 3.9.1. Example: Using `verbose=True` to see detailed LLM interaction logs. Setting `verbose=True` in `LLMExtractionStrategy` enables detailed logging of prompts sent to and responses received from the LLM. ```python from crawl4ai.extraction_strategy import LLMExtractionStrategy from crawl4ai.utils import LLMConfig, DefaultLogger from unittest.mock import patch, MagicMock import json import io import sys # Mock LLM response mock_llm_verbose = MagicMock() mock_llm_verbose.choices = [MagicMock(message=MagicMock(content=json.dumps({"data": "verbose example"})))] mock_llm_verbose.usage = MagicMock(completion_tokens=5, prompt_tokens=10, total_tokens=15) mock_llm_verbose.usage.completion_tokens_details = {}; mock_llm_verbose.usage.prompt_tokens_details = {} @patch('crawl4ai.extraction_strategy.perform_completion_with_backoff', return_value=mock_llm_verbose) def verbose_logging_example(mock_perform_completion): # Capture stdout to check for verbose logs old_stdout = sys.stdout sys.stdout = captured_output = io.StringIO() try: # Use a simple logger for this example that prints to stdout logger = DefaultLogger(verbose=True) strategy = LLMExtractionStrategy( llm_config=LLMConfig(provider="ollama/llama3", api_token="ollama"), verbose=True, # Enable verbose logging in the strategy logger=logger, # Pass the logger extraction_type="schema_from_instruction", instruction="Extract something." ) except: # Fallback if ollama/logger setup fails for some reason in test env sys.stdout = old_stdout print("Ollama/Logger not available, skipping verbose logging test.") return strategy.extract(url="http://dummy.com/verbose", html_content="Some sample content.") sys.stdout = old_stdout # Restore stdout output_log = captured_output.getvalue() print("\n--- Captured Verbose Log Output (should contain LLM prompt/response details) ---") print(output_log) # Check for typical verbose log messages (actual messages might vary) assert "LLM Request" in output_log or "Prompt for LLM" in output_log assert "LLM Response" in output_log or "Response from LLM" in output_log print("\nVerbose logging appeared to work.") if __name__ == "__main__": verbose_logging_example() ``` #### 3.9.2. Example: Providing a custom `logger` instance to `LLMExtractionStrategy`. You can integrate `LLMExtractionStrategy` with your existing logging setup. ```python from crawl4ai.extraction_strategy import LLMExtractionStrategy from crawl4ai.utils import LLMConfig, DefaultLogger import logging import io # Setup a custom Python logger custom_logger = logging.getLogger("MyCustomExtractorLogger") custom_logger.setLevel(logging.INFO) log_capture_string = io.StringIO() ch = logging.StreamHandler(log_capture_string) ch.setLevel(logging.INFO) formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s') ch.setFormatter(formatter) custom_logger.addHandler(ch) custom_logger.propagate = False # Prevent duplicate logs if root logger also has a handler # For LLMExtractionStrategy, we need to wrap this in a Crawl4ai compatible logger class CustomCrawl4aiLogger(DefaultLogger): def __init__(self, py_logger, verbose=False): super().__init__(verbose=verbose) self.py_logger = py_logger def _log(self, level_str, message, tag=None, params=None, colors=None): # You can customize how messages are formatted and logged here log_message = f"[{tag or 'C4AI'}] {message}" if params: log_message = log_message.format(**params) if level_str.lower() == "info": self.py_logger.info(log_message) elif level_str.lower() == "error": self.py_logger.error(log_message) elif level_str.lower() == "warning": self.py_logger.warning(log_message) elif self.verbose and level_str.lower() == "debug": # Only log debug if verbose self.py_logger.debug(log_message) # Mock the LLM call for this example to focus on logging from unittest.mock import patch, MagicMock import json mock_llm_custom_log = MagicMock() mock_llm_custom_log.choices = [MagicMock(message=MagicMock(content=json.dumps({"info":"logged"})))] mock_llm_custom_log.usage = MagicMock(completion_tokens=3, prompt_tokens=10, total_tokens=13) mock_llm_custom_log.usage.completion_tokens_details = {}; mock_llm_custom_log.usage.prompt_tokens_details = {} @patch('crawl4ai.extraction_strategy.perform_completion_with_backoff', return_value=mock_llm_custom_log) def custom_logger_example(mock_perform_completion): crawl4ai_custom_logger = CustomCrawl4aiLogger(custom_logger, verbose=True) try: strategy = LLMExtractionStrategy( llm_config=LLMConfig(provider="ollama/llama3", api_token="ollama"), logger=crawl4ai_custom_logger, # Pass the custom logger instance verbose=True, # Ensure strategy attempts to log debug messages too extraction_type="schema_from_instruction", instruction="Log this." ) except: print("Ollama not available, skipping custom logger test.") return strategy.extract(url="http://dummy.com/custom_log", html_content="Content for custom logger.") log_contents = log_capture_string.getvalue() print("\n--- Captured Log Output (via custom Python logger) ---") print(log_contents) assert "MyCustomExtractorLogger" in log_contents # Check if our logger's name is in output assert "[LLM_REQ]" in log_contents or "[LLM_RESP]" in log_contents # Check for common strategy tags if __name__ == "__main__": custom_logger_example() ``` --- ### 3.10. Practical Extraction Scenarios These examples use `AsyncWebCrawler` and might require actual internet access and potentially API keys for the LLMs. They will be mocked for consistency in testing, but the setup shows real-world usage. #### 3.10.1. Example: Extracting product names, prices, and descriptions from an e-commerce page. ```python import asyncio from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, LLMConfig from crawl4ai.extraction_strategy import LLMExtractionStrategy from pydantic import BaseModel, Field from typing import List, Optional from unittest.mock import patch, MagicMock import json import os class ProductInfo(BaseModel): name: str = Field(..., description="The name of the product") price: Optional[float] = Field(None, description="The price of the product, as a float") description_snippet: Optional[str] = Field(None, description="A short snippet of the product description") class ProductPageExtract(BaseModel): products: List[ProductInfo] = Field(description="List of products found on the page") # Mock the LLM call mock_ecommerce_response = MagicMock() mock_ecommerce_response.choices = [MagicMock()] mock_ecommerce_response.choices[0].message.content = json.dumps({ "products": [ {"name": "Super Widget X1000", "price": 99.99, "description_snippet": "The best widget ever."}, {"name": "Basic Widget B50", "price": 19.99, "description_snippet": "A simple, reliable widget."} ] }) mock_ecommerce_response.usage = MagicMock(completion_tokens=50, prompt_tokens=300, total_tokens=350) mock_ecommerce_response.usage.completion_tokens_details = {}; mock_ecommerce_response.usage.prompt_tokens_details = {} @patch('crawl4ai.extraction_strategy.perform_completion_with_backoff', return_value=mock_ecommerce_response) async def extract_ecommerce_products(mock_perform_completion): # This URL is a placeholder; a real e-commerce page would be used. # For CI/testing, we use a simple example.com which won't have products. # The key is to show the setup. ecommerce_url = "http://example.com" try: llm_conf = LLMConfig(provider="openai/gpt-4o-mini", api_token=os.getenv("OPENAI_API_KEY", "mock_key_ecommerce")) if not os.getenv("OPENAI_API_KEY"): print("Warning: OPENAI_API_KEY not set. Mock will be used.") extraction_strat = LLMExtractionStrategy( llm_config=llm_conf, schema=ProductPageExtract.model_json_schema(), extraction_type="schema", instruction="Extract all product names, their prices, and a short description snippet from the page content." ) except Exception as e: print(f"LLM setup failed for e-commerce example: {e}. Skipping.") return run_config = CrawlerRunConfig( extraction_strategy=extraction_strat, # word_count_threshold=5 # Lower for example.com if testing live ) async with AsyncWebCrawler() as crawler: result = await crawler.arun(url=ecommerce_url, config=run_config) print(f"--- Extraction from E-commerce like page ({ecommerce_url}) ---") if result.success and result.extracted_content: extracted_data = json.loads(result.extracted_content) print(json.dumps(extracted_data, indent=2)) # Validate with Pydantic page_data = ProductPageExtract(**extracted_data) for product in page_data.products: print(f"Product: {product.name}, Price: {product.price}") elif not result.success: print(f"Crawl failed: {result.error_message}") else: print("No structured data extracted or extraction failed.") assert mock_perform_completion.called if __name__ == "__main__": asyncio.run(extract_ecommerce_products()) ``` #### 3.10.2. Example: Extracting article headlines, authors, and publication dates from a news site. ```python import asyncio from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, LLMConfig from crawl4ai.extraction_strategy import LLMExtractionStrategy from pydantic import BaseModel, Field from typing import Optional from unittest.mock import patch, MagicMock import json import os class NewsArticle(BaseModel): headline: str = Field(..., description="The main headline of the news article") author: Optional[str] = Field(None, description="The author(s) of the article") publication_date: Optional[str] = Field(None, description="The date the article was published (e.g., YYYY-MM-DD)") # Mock the LLM call mock_news_response = MagicMock() mock_news_response.choices = [MagicMock()] mock_news_response.choices[0].message.content = json.dumps({ "headline": "AI Breakthrough Announced", "author": "Reporter Bot", "publication_date": "2024-05-24" }) mock_news_response.usage = MagicMock(completion_tokens=30, prompt_tokens=250, total_tokens=280) mock_news_response.usage.completion_tokens_details = {}; mock_news_response.usage.prompt_tokens_details = {} @patch('crawl4ai.extraction_strategy.perform_completion_with_backoff', return_value=mock_news_response) async def extract_news_article_details(mock_perform_completion): # Using Wikipedia for a stable, public news-like article structure news_url = "https://en.wikipedia.org/wiki/Artificial_intelligence" try: llm_conf = LLMConfig(provider="openai/gpt-4o-mini", api_token=os.getenv("OPENAI_API_KEY", "mock_key_news")) if not os.getenv("OPENAI_API_KEY"): print("Warning: OPENAI_API_KEY not set. Mock will be used.") extraction_strat = LLMExtractionStrategy( llm_config=llm_conf, schema=NewsArticle.model_json_schema(), extraction_type="schema", instruction="From the provided news article content, extract the main headline, the author(s), and the publication date." ) except Exception as e: print(f"LLM setup failed for news example: {e}. Skipping.") return run_config = CrawlerRunConfig(extraction_strategy=extraction_strat) async with AsyncWebCrawler() as crawler: result = await crawler.arun(url=news_url, config=run_config) print(f"--- Extraction from News Article ({news_url}) ---") if result.success and result.extracted_content: extracted_data = json.loads(result.extracted_content) print(json.dumps(extracted_data, indent=2)) article_data = NewsArticle(**extracted_data) print(f"Headline: {article_data.headline}") elif not result.success: print(f"Crawl failed: {result.error_message}") else: print("No structured data extracted or extraction failed.") assert mock_perform_completion.called if __name__ == "__main__": asyncio.run(extract_news_article_details()) ``` #### 3.10.3. Example: Extracting frequently asked questions (FAQs) and their answers from a support page. ```python import asyncio from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, LLMConfig from crawl4ai.extraction_strategy import LLMExtractionStrategy from pydantic import BaseModel, Field from typing import List from unittest.mock import patch, MagicMock import json import os class FAQItem(BaseModel): question: str answer: str class FAQPage(BaseModel): faqs: List[FAQItem] # Mock the LLM call mock_faq_response = MagicMock() mock_faq_response.choices = [MagicMock()] mock_faq_response.choices[0].message.content = json.dumps({ "faqs": [ {"question": "What is Crawl4ai?", "answer": "An awesome web crawler."}, {"question": "How to install?", "answer": "pip install crawl4ai"} ] }) mock_faq_response.usage = MagicMock(completion_tokens=60, prompt_tokens=300, total_tokens=360) mock_faq_response.usage.completion_tokens_details = {}; mock_faq_response.usage.prompt_tokens_details = {} @patch('crawl4ai.extraction_strategy.perform_completion_with_backoff', return_value=mock_faq_response) async def extract_faqs(mock_perform_completion): # Placeholder URL - a real FAQ page would be used faq_url = "http://example.com/faq" try: llm_conf = LLMConfig(provider="openai/gpt-4o-mini", api_token=os.getenv("OPENAI_API_KEY", "mock_key_faq")) if not os.getenv("OPENAI_API_KEY"): print("Warning: OPENAI_API_KEY not set. Mock will be used.") extraction_strat = LLMExtractionStrategy( llm_config=llm_conf, schema=FAQPage.model_json_schema(), extraction_type="schema", instruction="Extract all question and answer pairs from the FAQ section of this page." ) except Exception as e: print(f"LLM setup failed for FAQ example: {e}. Skipping.") return run_config = CrawlerRunConfig(extraction_strategy=extraction_strat) async with AsyncWebCrawler() as crawler: result = await crawler.arun(url=faq_url, config=run_config) print(f"--- Extraction from FAQ Page ({faq_url}) ---") if result.success and result.extracted_content: extracted_data = json.loads(result.extracted_content) print(json.dumps(extracted_data, indent=2)) faq_page_data = FAQPage(**extracted_data) for faq_item in faq_page_data.faqs: print(f"Q: {faq_item.question}\nA: {faq_item.answer}\n") elif not result.success: print(f"Crawl failed: {result.error_message}") else: print("No structured data extracted or extraction failed.") assert mock_perform_completion.called if __name__ == "__main__": asyncio.run(extract_faqs()) ``` #### 3.10.4. Example: Extracting contact information (email, phone, address) from a company's "Contact Us" page. ```python import asyncio from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, LLMConfig from crawl4ai.extraction_strategy import LLMExtractionStrategy from pydantic import BaseModel, Field from typing import Optional from unittest.mock import patch, MagicMock import json import os class ContactInfo(BaseModel): email: Optional[str] = Field(None, description="Company contact email address") phone: Optional[str] = Field(None, description="Company contact phone number") address: Optional[str] = Field(None, description="Company physical address") # Mock the LLM call mock_contact_response = MagicMock() mock_contact_response.choices = [MagicMock()] mock_contact_response.choices[0].message.content = json.dumps({ "email": "support@example.com", "phone": "1-800-555-1234", "address": "123 Main St, Anytown, USA" }) mock_contact_response.usage = MagicMock(completion_tokens=40, prompt_tokens=200, total_tokens=240) mock_contact_response.usage.completion_tokens_details = {}; mock_contact_response.usage.prompt_tokens_details = {} @patch('crawl4ai.extraction_strategy.perform_completion_with_backoff', return_value=mock_contact_response) async def extract_contact_info(mock_perform_completion): contact_url = "http://example.com/contact" try: llm_conf = LLMConfig(provider="openai/gpt-4o-mini", api_token=os.getenv("OPENAI_API_KEY", "mock_key_contact")) if not os.getenv("OPENAI_API_KEY"): print("Warning: OPENAI_API_KEY not set. Mock will be used.") extraction_strat = LLMExtractionStrategy( llm_config=llm_conf, schema=ContactInfo.model_json_schema(), extraction_type="schema", instruction="Extract the primary email, phone number, and physical address from this contact page." ) except Exception as e: print(f"LLM setup failed for contact info example: {e}. Skipping.") return run_config = CrawlerRunConfig(extraction_strategy=extraction_strat) async with AsyncWebCrawler() as crawler: result = await crawler.arun(url=contact_url, config=run_config) print(f"--- Extraction from Contact Page ({contact_url}) ---") if result.success and result.extracted_content: extracted_data = json.loads(result.extracted_content) print(json.dumps(extracted_data, indent=2)) contact_data = ContactInfo(**extracted_data) print(f"Email: {contact_data.email}, Phone: {contact_data.phone}") elif not result.success: print(f"Crawl failed: {result.error_message}") else: print("No structured data extracted or extraction failed.") assert mock_perform_completion.called if __name__ == "__main__": asyncio.run(extract_contact_info()) ``` #### 3.10.5. Example: Extracting key entities (people, organizations, locations) from a block of text using `extraction_type="block"` and a specific instruction. This uses "block" extraction but with an instruction to guide the LLM to tag specific entities. ```python import asyncio from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, LLMConfig from crawl4ai.extraction_strategy import LLMExtractionStrategy from unittest.mock import patch, MagicMock import json import os # Mock LLM response - for block extraction with entity tagging mock_entity_response = MagicMock() mock_entity_response.choices = [MagicMock()] mock_entity_response.choices[0].message.content = """This is the first important paragraph.
This is the second important paragraph after an ad.