crawl4ai/docs/md_v2/assets/llmtxt/crawl4ai_extraction_examples_content.llm.txt

Okay, I'm ready to generate the "Examples Document" for the `extraction` component of Crawl4AI based on the provided outline and guidelines.

Here's the Markdown content:

```markdown
# Examples for crawl4ai - `extraction` Component

**Target Document Type:** Examples Collection
**Target Output Filename Suggestion:** `llm_examples_extraction.md`
**Library Version Context:** 0.6.3
**Outline Generation Date:** 2024-05-24
---

This document provides a collection of runnable code examples demonstrating various features and configurations of the `extraction` component in the `crawl4ai` library.

## 1. Introduction to Extraction Strategies

### 1.1. Overview: Purpose of Extraction Strategies in Crawl4ai.

Extraction strategies in Crawl4ai are responsible for taking raw or processed content (like HTML or Markdown) and extracting structured data or specific blocks of information from it. This is crucial for transforming web content into a more usable format, often for feeding into Large Language Models (LLMs) or other data processing pipelines.

### 1.2. Example: Basic `CrawlerRunConfig` Setup with an `extraction_strategy`.

This example shows how to integrate an extraction strategy (here, `NoExtractionStrategy` for simplicity) into the `AsyncWebCrawler` workflow using `CrawlerRunConfig`.

```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.extraction_strategy import NoExtractionStrategy

async def basic_config_with_extraction_strategy():
    # Initialize a simple extraction strategy
    no_extraction = NoExtractionStrategy()

    # Configure the crawler run to use this strategy
    run_config = CrawlerRunConfig(
        extraction_strategy=no_extraction
    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="http://example.com",
            config=run_config
        )

        if result.success:
            print("Crawl successful.")
            # For NoExtractionStrategy, extracted_content will likely be None or empty
            print(f"Extracted Content: {result.extracted_content}")
        else:
            print(f"Crawl failed: {result.error_message}")

if __name__ == "__main__":
    asyncio.run(basic_config_with_extraction_strategy())
```
---

## 2. `NoExtractionStrategy`: Baseline (No Extraction)

The `NoExtractionStrategy` is a pass-through strategy. It doesn't perform any actual data extraction, meaning `result.extracted_content` will typically be `None` or an empty representation. It's useful as a baseline or when you only need the raw/cleaned HTML or Markdown.

### 2.1. Example: Using `NoExtractionStrategy` to demonstrate no structured data is extracted.

#### 2.1.1. Scenario: `AsyncWebCrawler` with `NoExtractionStrategy`.

This example demonstrates how `AsyncWebCrawler` behaves when `NoExtractionStrategy` is employed.

```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.extraction_strategy import NoExtractionStrategy
from crawl4ai.utils import HEADERS

async def no_extraction_with_crawler():
    no_extraction_strat = NoExtractionStrategy()

    # Provide a basic user agent
    browser_config = {"headers": HEADERS}

    run_config = CrawlerRunConfig(
        extraction_strategy=no_extraction_strat
    )

    async with AsyncWebCrawler(browser_config=browser_config) as crawler:
        result = await crawler.arun(
            url="http://example.com",
            config=run_config
        )

        if result.success:
            print(f"Crawled URL: {result.url}")
            print(f"Markdown content (first 100 chars): {result.markdown.raw_markdown[:100]}...")
            # Extracted content should be None or an empty representation
            print(f"Extracted Content: {result.extracted_content}")
            assert result.extracted_content is None or len(result.extracted_content) == 0, \
                "Extracted content should be empty with NoExtractionStrategy"
        else:
            print(f"Crawl failed: {result.error_message}")

if __name__ == "__main__":
    asyncio.run(no_extraction_with_crawler())
```

#### 2.1.2. Scenario: Direct call to `NoExtractionStrategy.extract()`.

You can also use extraction strategies directly if you have the content.

```python
from crawl4ai.extraction_strategy import NoExtractionStrategy

def direct_no_extraction():
    strategy = NoExtractionStrategy()
    sample_html = "<html><body><h1>Title</h1><p>Some text.</p></body></html>"

    # The 'extract' method might expect certain parameters like url, even if not used by this strategy
    extracted_data = strategy.extract(url="http://dummy.com", html_content=sample_html)

    print(f"Direct call to NoExtractionStrategy.extract() returned: {extracted_data}")
    # Expected: A list containing a dictionary with the original content, or similar passthrough
    # For NoExtractionStrategy, the behavior is to return a list of one block with the original content
    # if it's a simple string input. The actual structure might vary slightly based on internal logic.
    # The key is that no "structured" extraction happens.
    # Based on current implementation, it returns [{'index': 0, 'content': sample_html}]
    assert isinstance(extracted_data, list)
    assert len(extracted_data) == 1
    assert extracted_data[0]['content'] == sample_html


if __name__ == "__main__":
    direct_no_extraction()
```
---

## 3. `LLMExtractionStrategy`: LLM-Powered Structured Data Extraction

This is the primary strategy for extracting structured data using Large Language Models (LLMs). It allows you to define schemas (using Pydantic models or dictionaries) or provide natural language instructions to guide the LLM in extracting the desired information.

*Note: For the following examples, actual LLM calls are often mocked for brevity and to avoid requiring API keys for every example. In a real application, you would configure your LLM provider and API key.*

### 3.1. Core Concepts and Basic Usage

#### 3.1.1. Example: Basic initialization of `LLMExtractionStrategy` with default parameters.
This example shows how to initialize `LLMExtractionStrategy`. By default, it might use OpenAI if `OPENAI_API_KEY` is set. For this example, we'll assume mocking or a local LLM setup if no API key is found.

```python
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai.utils import LLMConfig
import os

# Basic initialization - defaults to OpenAI if OPENAI_API_KEY is set,
# or you can specify a provider like Ollama.
try:
    # Attempt to use OpenAI if key is available
    llm_config = LLMConfig(api_token=os.environ.get("OPENAI_API_KEY"))
    if not llm_config.api_token:
        raise ValueError("OpenAI API key not found, using Ollama for example.")
    strategy = LLMExtractionStrategy(llm_config=llm_config)
    print("Initialized LLMExtractionStrategy with default provider (likely OpenAI).")
except Exception as e:
    print(f"OpenAI init failed ({e}), trying Ollama (make sure Ollama is running with a model like 'llama3').")
    try:
        # Fallback to Ollama if OpenAI key is not set or fails
        # Ensure Ollama is running and has a model like 'llama3'
        ollama_config = LLMConfig(provider="ollama/llama3", api_token="ollama")
        strategy = LLMExtractionStrategy(llm_config=ollama_config)
        print("Initialized LLMExtractionStrategy with Ollama (llama3).")
    except Exception as e_ollama:
        print(f"Ollama init also failed: {e_ollama}")
        print("Please set up an LLM (OpenAI API key or local Ollama) for these examples.")
        strategy = None

if strategy:
    print(f"Strategy initialized. Provider: {strategy.llm_config.provider}")
    # You can now use this 'strategy' object for extraction.
    # For a basic initialization, we won't run an extraction here to keep it simple.
```

#### 3.1.2. Example: Direct usage of `LLMExtractionStrategy.extract()` with simple Markdown content.
This shows how to use the strategy directly with some Markdown text. We'll mock the LLM call.

```python
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai.utils import LLMConfig
from unittest.mock import patch, MagicMock
import json

# Mocking the LLM call
mock_llm_response_block = MagicMock()
mock_llm_response_block.choices = [MagicMock()]
mock_llm_response_block.choices[0].message.content = """
<blocks>
  <block>
    <content>This is the main title.</content>
    <tags><tag>title</tag></tags>
  </block>
  <block>
    <content>An introductory paragraph about the topic.</content>
    <tags><tag>introduction</tag></tags>
  </block>
</blocks>
"""
mock_llm_response_block.usage = MagicMock()
mock_llm_response_block.usage.completion_tokens = 20
mock_llm_response_block.usage.prompt_tokens = 50
mock_llm_response_block.usage.total_tokens = 70
mock_llm_response_block.usage.completion_tokens_details = {}
mock_llm_response_block.usage.prompt_tokens_details = {}


@patch('crawl4ai.extraction_strategy.perform_completion_with_backoff', return_value=mock_llm_response_block)
def direct_markdown_extraction(mock_perform_completion):
    # For this example, we assume Ollama is running or an API key is set for another provider
    try:
        strategy = LLMExtractionStrategy(llm_config=LLMConfig(provider="ollama/llama3", api_token="ollama"))
    except:
        print("Ollama not available, skipping direct markdown extraction test. Ensure Ollama is running.")
        return

    sample_markdown = """
# Main Title
An introductory paragraph about the topic.
## Subheading
More details here.
"""
    # Default extraction_type is "block"
    extracted_data = strategy.extract(url="http://dummy.com/markdown", html_content=sample_markdown)

    print("Direct Markdown Extraction (mocked LLM):")
    print(json.dumps(extracted_data, indent=2))

    # Verify mock was called
    assert mock_perform_completion.called

if __name__ == "__main__":
    direct_markdown_extraction()
```

#### 3.1.3. Example: Direct usage of `LLMExtractionStrategy.extract()` with simple HTML content (`input_format="html"`).
This example demonstrates processing HTML content by specifying `input_format="html"`.

```python
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai.utils import LLMConfig
from unittest.mock import patch, MagicMock
import json

# Mocking the LLM call (similar to above)
mock_llm_response_html_block = MagicMock()
mock_llm_response_html_block.choices = [MagicMock()]
mock_llm_response_html_block.choices[0].message.content = """
<blocks>
  <block>
    <content>HTML Title</content>
    <tags><tag>h1</tag><tag>title</tag></tags>
  </block>
  <block>
    <content>This is paragraph text from HTML.</content>
    <tags><tag>p</tag><tag>content</tag></tags>
  </block>
</blocks>
"""
mock_llm_response_html_block.usage = MagicMock() # Assuming same usage structure
mock_llm_response_html_block.usage.completion_tokens = 25
mock_llm_response_html_block.usage.prompt_tokens = 60
mock_llm_response_html_block.usage.total_tokens = 85
mock_llm_response_html_block.usage.completion_tokens_details = {}
mock_llm_response_html_block.usage.prompt_tokens_details = {}


@patch('crawl4ai.extraction_strategy.perform_completion_with_backoff', return_value=mock_llm_response_html_block)
def direct_html_extraction(mock_perform_completion):
    try:
        strategy = LLMExtractionStrategy(
            llm_config=LLMConfig(provider="ollama/llama3", api_token="ollama"),
            input_format="html"  # Specify that the input is HTML
        )
    except:
        print("Ollama not available, skipping direct HTML extraction test.")
        return

    sample_html = "<html><body><h1>HTML Title</h1><p>This is paragraph text from HTML.</p><div><p>Another paragraph.</p></div></body></html>"

    extracted_data = strategy.extract(url="http://dummy.com/html", html_content=sample_html)

    print("Direct HTML Extraction (mocked LLM, input_format='html'):")
    print(json.dumps(extracted_data, indent=2))
    assert mock_perform_completion.called

if __name__ == "__main__":
    direct_html_extraction()
```
---

### 3.2. Schema Definition for Extraction

#### 3.2.1. **Using Pydantic Models for Schema:**

##### 3.2.1.1. Example: Defining a simple Pydantic model and extracting data matching it (`extraction_type="schema"`).

```python
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai.utils import LLMConfig
from pydantic import BaseModel
from unittest.mock import patch, MagicMock
import json

class SimpleItem(BaseModel):
    name: str
    description: str

# Mock LLM response to return JSON matching SimpleItem
mock_llm_response_simple_schema = MagicMock()
mock_llm_response_simple_schema.choices = [MagicMock()]
mock_llm_response_simple_schema.choices[0].message.content = json.dumps({
    "name": "My Item",
    "description": "A simple description."
})
mock_llm_response_simple_schema.usage = MagicMock() # Populate usage as needed
mock_llm_response_simple_schema.usage.completion_tokens = 15
mock_llm_response_simple_schema.usage.prompt_tokens = 70
mock_llm_response_simple_schema.usage.total_tokens = 85
mock_llm_response_simple_schema.usage.completion_tokens_details = {}
mock_llm_response_simple_schema.usage.prompt_tokens_details = {}


@patch('crawl4ai.extraction_strategy.perform_completion_with_backoff', return_value=mock_llm_response_simple_schema)
def pydantic_simple_schema_extraction(mock_perform_completion):
    try:
        strategy = LLMExtractionStrategy(
            llm_config=LLMConfig(provider="ollama/llama3", api_token="ollama"),
            schema=SimpleItem.model_json_schema(), # Pass the Pydantic schema
            extraction_type="schema"
        )
    except:
        print("Ollama not available, skipping Pydantic simple schema test.")
        return

    sample_content = "The item is called My Item. It has a simple description."
    # For schema extraction, html_content is passed as the context to the LLM
    extracted_json_string = strategy.extract(url="http://dummy.com/item", html_content=sample_content)

    print("Pydantic Simple Schema Extraction (mocked LLM):")
    if extracted_json_string:
        extracted_data = json.loads(extracted_json_string) # The result is a JSON string
        print(json.dumps(extracted_data, indent=2))
        # Validate with Pydantic model
        item_instance = SimpleItem(**extracted_data)
        print(f"Validated Pydantic instance: {item_instance}")
    else:
        print("No data extracted.")

    assert mock_perform_completion.called

if __name__ == "__main__":
    pydantic_simple_schema_extraction()
```

##### 3.2.1.2. Example: Pydantic model with various field types (str, int, bool, List).

```python
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai.utils import LLMConfig
from pydantic import BaseModel
from typing import List
from unittest.mock import patch, MagicMock
import json

class ComplexItem(BaseModel):
    name: str
    count: int
    is_active: bool
    tags: List[str]

# Mock LLM response
mock_llm_response_complex_schema = MagicMock()
mock_llm_response_complex_schema.choices = [MagicMock()]
mock_llm_response_complex_schema.choices[0].message.content = json.dumps({
    "name": "Complex Gadget",
    "count": 10,
    "is_active": True,
    "tags": ["tech", "gadget", "new"]
})
# ... (mock usage as before)
mock_llm_response_complex_schema.usage = MagicMock()
mock_llm_response_complex_schema.usage.completion_tokens = 30; mock_llm_response_complex_schema.usage.prompt_tokens = 100; mock_llm_response_complex_schema.usage.total_tokens = 130
mock_llm_response_complex_schema.usage.completion_tokens_details = {}; mock_llm_response_complex_schema.usage.prompt_tokens_details = {}


@patch('crawl4ai.extraction_strategy.perform_completion_with_backoff', return_value=mock_llm_response_complex_schema)
def pydantic_complex_schema_extraction(mock_perform_completion):
    try:
        strategy = LLMExtractionStrategy(
            llm_config=LLMConfig(provider="ollama/llama3", api_token="ollama"),
            schema=ComplexItem.model_json_schema(),
            extraction_type="schema"
        )
    except:
        print("Ollama not available, skipping Pydantic complex schema test.")
        return

    sample_content = "Product: Complex Gadget. Stock: 10 units. Status: Active. Categories: tech, gadget, new."
    extracted_json_string = strategy.extract(url="http://dummy.com/gadget", html_content=sample_content)

    print("Pydantic Complex Schema Extraction (mocked LLM):")
    if extracted_json_string:
        extracted_data = json.loads(extracted_json_string)
        print(json.dumps(extracted_data, indent=2))
        item_instance = ComplexItem(**extracted_data)
        print(f"Validated Pydantic instance: {item_instance}")
    else:
        print("No data extracted.")

if __name__ == "__main__":
    pydantic_complex_schema_extraction()
```

##### 3.2.1.3. Example: Pydantic model with `Optional` fields.

```python
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai.utils import LLMConfig
from pydantic import BaseModel
from typing import Optional
from unittest.mock import patch, MagicMock
import json

class OptionalItem(BaseModel):
    name: str
    description: Optional[str] = None # description is optional
    price: float

# Mock LLM response - sometimes description is present, sometimes not
mock_llm_response_optional_schema_1 = MagicMock()
mock_llm_response_optional_schema_1.choices = [MagicMock()]
mock_llm_response_optional_schema_1.choices[0].message.content = json.dumps({
    "name": "Basic Widget",
    "price": 9.99
    # description is omitted
})
# ... (mock usage)
mock_llm_response_optional_schema_1.usage = MagicMock()
mock_llm_response_optional_schema_1.usage.completion_tokens = 10; mock_llm_response_optional_schema_1.usage.prompt_tokens = 60; mock_llm_response_optional_schema_1.usage.total_tokens = 70
mock_llm_response_optional_schema_1.usage.completion_tokens_details = {}; mock_llm_response_optional_schema_1.usage.prompt_tokens_details = {}


mock_llm_response_optional_schema_2 = MagicMock()
mock_llm_response_optional_schema_2.choices = [MagicMock()]
mock_llm_response_optional_schema_2.choices[0].message.content = json.dumps({
    "name": "Advanced Widget",
    "description": "This one has all the bells and whistles.",
    "price": 29.99
})
# ... (mock usage)
mock_llm_response_optional_schema_2.usage = MagicMock()
mock_llm_response_optional_schema_2.usage.completion_tokens = 20; mock_llm_response_optional_schema_2.usage.prompt_tokens = 70; mock_llm_response_optional_schema_2.usage.total_tokens = 90
mock_llm_response_optional_schema_2.usage.completion_tokens_details = {}; mock_llm_response_optional_schema_2.usage.prompt_tokens_details = {}


@patch('crawl4ai.extraction_strategy.perform_completion_with_backoff')
def pydantic_optional_schema_extraction(mock_perform_completion):
    try:
        strategy = LLMExtractionStrategy(
            llm_config=LLMConfig(provider="ollama/llama3", api_token="ollama"),
            schema=OptionalItem.model_json_schema(),
            extraction_type="schema"
        )
    except:
        print("Ollama not available, skipping Pydantic optional schema test.")
        return

    sample_content_1 = "Item: Basic Widget, Price: $9.99."
    sample_content_2 = "Item: Advanced Widget, Price: $29.99. Description: This one has all the bells and whistles."

    # Test case 1: Description missing
    mock_perform_completion.return_value = mock_llm_response_optional_schema_1
    extracted_json_string_1 = strategy.extract(url="http://dummy.com/widget1", html_content=sample_content_1)
    print("Pydantic Optional Schema (description missing, mocked LLM):")
    if extracted_json_string_1:
        extracted_data_1 = json.loads(extracted_json_string_1)
        print(json.dumps(extracted_data_1, indent=2))
        item_instance_1 = OptionalItem(**extracted_data_1)
        print(f"Validated Pydantic instance 1: {item_instance_1}")
        assert item_instance_1.description is None

    # Test case 2: Description present
    mock_perform_completion.return_value = mock_llm_response_optional_schema_2
    extracted_json_string_2 = strategy.extract(url="http://dummy.com/widget2", html_content=sample_content_2)
    print("\nPydantic Optional Schema (description present, mocked LLM):")
    if extracted_json_string_2:
        extracted_data_2 = json.loads(extracted_json_string_2)
        print(json.dumps(extracted_data_2, indent=2))
        item_instance_2 = OptionalItem(**extracted_data_2)
        print(f"Validated Pydantic instance 2: {item_instance_2}")
        assert item_instance_2.description is not None

if __name__ == "__main__":
    pydantic_optional_schema_extraction()
```

##### 3.2.1.4. Example: Pydantic model with default values for fields.

```python
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai.utils import LLMConfig
from pydantic import BaseModel
from typing import Optional
from unittest.mock import patch, MagicMock
import json

class ItemWithDefaults(BaseModel):
    name: str
    status: str = "available" # Default value
    notes: Optional[str] = None

# Mock LLM - status might be omitted by LLM, Pydantic should use default
mock_llm_response_default_schema = MagicMock()
mock_llm_response_default_schema.choices = [MagicMock()]
mock_llm_response_default_schema.choices[0].message.content = json.dumps({
    "name": "Standard Item"
    # status is omitted, notes is omitted
})
# ... (mock usage)
mock_llm_response_default_schema.usage = MagicMock()
mock_llm_response_default_schema.usage.completion_tokens = 5; mock_llm_response_default_schema.usage.prompt_tokens = 50; mock_llm_response_default_schema.usage.total_tokens = 55
mock_llm_response_default_schema.usage.completion_tokens_details = {}; mock_llm_response_default_schema.usage.prompt_tokens_details = {}


@patch('crawl4ai.extraction_strategy.perform_completion_with_backoff', return_value=mock_llm_response_default_schema)
def pydantic_default_value_extraction(mock_perform_completion):
    try:
        strategy = LLMExtractionStrategy(
            llm_config=LLMConfig(provider="ollama/llama3", api_token="ollama"),
            schema=ItemWithDefaults.model_json_schema(),
            extraction_type="schema"
        )
    except:
        print("Ollama not available, skipping Pydantic default value test.")
        return

    sample_content = "Product Name: Standard Item. Available for immediate shipping."
    extracted_json_string = strategy.extract(url="http://dummy.com/standard", html_content=sample_content)

    print("Pydantic Default Value Extraction (mocked LLM):")
    if extracted_json_string:
        extracted_data = json.loads(extracted_json_string)
        print(f"Raw LLM output (JSON): {json.dumps(extracted_data, indent=2)}")

        # Pydantic applies defaults during model instantiation
        item_instance = ItemWithDefaults(**extracted_data)
        print(f"Validated Pydantic instance: {item_instance}")
        print(f"Instance status (should be default): {item_instance.status}")
        assert item_instance.status == "available"

if __name__ == "__main__":
    pydantic_default_value_extraction()
```

#### 3.2.2. **Using Dictionaries for Schema:**

##### 3.2.2.1. Example: Defining a schema as a Python dictionary and extracting data (`extraction_type="schema"`).

```python
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai.utils import LLMConfig
from unittest.mock import patch, MagicMock
import json

# Define schema as a dictionary (JSON Schema format)
product_schema_dict = {
    "type": "object",
    "properties": {
        "product_name": {"type": "string", "description": "Name of the product"},
        "price": {"type": "number", "description": "Price of the product"}
    },
    "required": ["product_name", "price"]
}

# Mock LLM response
mock_llm_response_dict_schema = MagicMock()
mock_llm_response_dict_schema.choices = [MagicMock()]
mock_llm_response_dict_schema.choices[0].message.content = json.dumps({
    "product_name": "Dictionary Product",
    "price": 49.95
})
# ... (mock usage)
mock_llm_response_dict_schema.usage = MagicMock()
mock_llm_response_dict_schema.usage.completion_tokens = 18; mock_llm_response_dict_schema.usage.prompt_tokens = 80; mock_llm_response_dict_schema.usage.total_tokens = 98
mock_llm_response_dict_schema.usage.completion_tokens_details = {}; mock_llm_response_dict_schema.usage.prompt_tokens_details = {}


@patch('crawl4ai.extraction_strategy.perform_completion_with_backoff', return_value=mock_llm_response_dict_schema)
def dictionary_schema_extraction(mock_perform_completion):
    try:
        strategy = LLMExtractionStrategy(
            llm_config=LLMConfig(provider="ollama/llama3", api_token="ollama"),
            schema=product_schema_dict, # Pass the dictionary schema
            extraction_type="schema"
        )
    except:
        print("Ollama not available, skipping dictionary schema test.")
        return

    sample_content = "Check out the Dictionary Product, only $49.95!"
    extracted_json_string = strategy.extract(url="http://dummy.com/dictprod", html_content=sample_content)

    print("Dictionary Schema Extraction (mocked LLM):")
    if extracted_json_string:
        extracted_data = json.loads(extracted_json_string)
        print(json.dumps(extracted_data, indent=2))
        assert "product_name" in extracted_data
        assert "price" in extracted_data
    else:
        print("No data extracted.")

if __name__ == "__main__":
    dictionary_schema_extraction()
```

#### 3.2.3. **Nested Schemas:**

##### 3.2.3.1. Example: Using a Pydantic model with nested Pydantic models as fields.

```python
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai.utils import LLMConfig
from pydantic import BaseModel
from typing import List
from unittest.mock import patch, MagicMock
import json

class Author(BaseModel):
    name: str
    email: Optional[str] = None

class Article(BaseModel):
    title: str
    author_details: Author # Nested Pydantic model
    tags: List[str]

# Mock LLM response
mock_llm_response_nested_schema = MagicMock()
mock_llm_response_nested_schema.choices = [MagicMock()]
mock_llm_response_nested_schema.choices[0].message.content = json.dumps({
    "title": "The Future of AI",
    "author_details": {"name": "Dr. AI Expert", "email": "ai@example.com"},
    "tags": ["AI", "ML", "Future"]
})
# ... (mock usage)
mock_llm_response_nested_schema.usage = MagicMock()
mock_llm_response_nested_schema.usage.completion_tokens = 40; mock_llm_response_nested_schema.usage.prompt_tokens = 120; mock_llm_response_nested_schema.usage.total_tokens = 160
mock_llm_response_nested_schema.usage.completion_tokens_details = {}; mock_llm_response_nested_schema.usage.prompt_tokens_details = {}


@patch('crawl4ai.extraction_strategy.perform_completion_with_backoff', return_value=mock_llm_response_nested_schema)
def pydantic_nested_schema_extraction(mock_perform_completion):
    try:
        strategy = LLMExtractionStrategy(
            llm_config=LLMConfig(provider="ollama/llama3", api_token="ollama"),
            schema=Article.model_json_schema(),
            extraction_type="schema"
        )
    except:
        print("Ollama not available, skipping Pydantic nested schema test.")
        return

    sample_content = "Article: The Future of AI by Dr. AI Expert (ai@example.com). Tags: AI, ML, Future."
    extracted_json_string = strategy.extract(url="http://dummy.com/article", html_content=sample_content)

    print("Pydantic Nested Schema Extraction (mocked LLM):")
    if extracted_json_string:
        extracted_data = json.loads(extracted_json_string)
        print(json.dumps(extracted_data, indent=2))
        article_instance = Article(**extracted_data)
        print(f"Validated Pydantic instance: {article_instance}")
        assert article_instance.author_details.name == "Dr. AI Expert"
    else:
        print("No data extracted.")

if __name__ == "__main__":
    pydantic_nested_schema_extraction()
```

##### 3.2.3.2. Example: Extracting a list of Pydantic model instances (e.g., list of products).

```python
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai.utils import LLMConfig
from pydantic import BaseModel, Field
from typing import List
from unittest.mock import patch, MagicMock
import json

class Product(BaseModel):
    name: str
    price: float

class ProductList(BaseModel):
    products: List[Product] = Field(description="A list of products found on the page")


# Mock LLM response
mock_llm_response_list_schema = MagicMock()
mock_llm_response_list_schema.choices = [MagicMock()]
mock_llm_response_list_schema.choices[0].message.content = json.dumps({
    "products": [
        {"name": "Laptop Pro", "price": 1200.00},
        {"name": "Wireless Mouse", "price": 25.00},
        {"name": "Keyboard", "price": 75.00}
    ]
})
# ... (mock usage)
mock_llm_response_list_schema.usage = MagicMock()
mock_llm_response_list_schema.usage.completion_tokens = 50; mock_llm_response_list_schema.usage.prompt_tokens = 150; mock_llm_response_list_schema.usage.total_tokens = 200
mock_llm_response_list_schema.usage.completion_tokens_details = {}; mock_llm_response_list_schema.usage.prompt_tokens_details = {}


@patch('crawl4ai.extraction_strategy.perform_completion_with_backoff', return_value=mock_llm_response_list_schema)
def pydantic_list_extraction(mock_perform_completion):
    try:
        strategy = LLMExtractionStrategy(
            llm_config=LLMConfig(provider="ollama/llama3", api_token="ollama"),
            schema=ProductList.model_json_schema(),
            extraction_type="schema"
        )
    except:
        print("Ollama not available, skipping Pydantic list extraction test.")
        return

    sample_content = """
    Available products:
    1. Laptop Pro - $1200.00
    2. Wireless Mouse - $25.00
    3. Mechanical Keyboard - $75.00
    """
    extracted_json_string = strategy.extract(url="http://dummy.com/products", html_content=sample_content)

    print("Pydantic List Extraction (mocked LLM):")
    if extracted_json_string:
        extracted_data = json.loads(extracted_json_string)
        print(json.dumps(extracted_data, indent=2))
        product_list_instance = ProductList(**extracted_data)
        print(f"Validated Pydantic instance: {product_list_instance}")
        assert len(product_list_instance.products) == 3
        assert product_list_instance.products[0].name == "Laptop Pro"
    else:
        print("No data extracted.")

if __name__ == "__main__":
    pydantic_list_extraction()
```

#### 3.2.4. **Dynamic Schema Generation (Advanced):**

##### 3.2.4.1. Example: Programmatically generating a Pydantic model for the schema at runtime.

```python
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai.utils import LLMConfig
from pydantic import create_model, BaseModel
from typing import List, Type
from unittest.mock import patch, MagicMock
import json

# Mock LLM response
mock_llm_response_dynamic_schema = MagicMock()
mock_llm_response_dynamic_schema.choices = [MagicMock()]
mock_llm_response_dynamic_schema.choices[0].message.content = json.dumps({
    "user_id": "user123",
    "username": "john_doe",
    "is_premium_member": True
})
# ... (mock usage)
mock_llm_response_dynamic_schema.usage = MagicMock()
mock_llm_response_dynamic_schema.usage.completion_tokens = 20; mock_llm_response_dynamic_schema.usage.prompt_tokens = 90; mock_llm_response_dynamic_schema.usage.total_tokens = 110
mock_llm_response_dynamic_schema.usage.completion_tokens_details = {}; mock_llm_response_dynamic_schema.usage.prompt_tokens_details = {}


@patch('crawl4ai.extraction_strategy.perform_completion_with_backoff', return_value=mock_llm_response_dynamic_schema)
def dynamic_schema_generation_extraction(mock_perform_completion):
    # Define fields dynamically
    fields_to_extract = {
        "user_id": (str, ...),  # '...' means required
        "username": (str, ...),
        "is_premium_member": (bool, False) # Optional with default
    }

    # Create Pydantic model dynamically
    DynamicUserModel: Type[BaseModel] = create_model(
        'DynamicUserModel',
        **fields_to_extract
    )

    try:
        strategy = LLMExtractionStrategy(
            llm_config=LLMConfig(provider="ollama/llama3", api_token="ollama"),
            schema=DynamicUserModel.model_json_schema(),
            extraction_type="schema"
        )
    except:
        print("Ollama not available, skipping dynamic schema test.")
        return

    sample_content = "User ID: user123, Username: john_doe, Premium: Yes"
    extracted_json_string = strategy.extract(url="http://dummy.com/userprofile", html_content=sample_content)

    print("Dynamic Schema Extraction (mocked LLM):")
    if extracted_json_string:
        extracted_data = json.loads(extracted_json_string)
        print(json.dumps(extracted_data, indent=2))
        user_instance = DynamicUserModel(**extracted_data)
        print(f"Validated Dynamic Pydantic instance: {user_instance}")
        assert user_instance.username == "john_doe"
    else:
        print("No data extracted.")

if __name__ == "__main__":
    dynamic_schema_generation_extraction()
```
---

### 3.3. Instruction Customization

#### 3.3.1. Example: Using `extraction_type="block"` with a custom `instruction` to guide block extraction (e.g., "Extract main paragraphs about AI").

```python
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai.utils import LLMConfig
from unittest.mock import patch, MagicMock
import json

# Mock LLM response
mock_llm_response_block_instr = MagicMock()
mock_llm_response_block_instr.choices = [MagicMock()]
mock_llm_response_block_instr.choices[0].message.content = """
<blocks>
  <block>
    <content>Artificial intelligence is rapidly evolving.</content>
    <tags><tag>AI_paragraph</tag></tags>
  </block>
  <block>
    <content>It impacts various industries.</content>
    <tags><tag>AI_paragraph</tag></tags>
  </block>
</blocks>
"""
# ... (mock usage)
mock_llm_response_block_instr.usage = MagicMock()
mock_llm_response_block_instr.usage.completion_tokens = 30; mock_llm_response_block_instr.usage.prompt_tokens = 100; mock_llm_response_block_instr.usage.total_tokens = 130
mock_llm_response_block_instr.usage.completion_tokens_details = {}; mock_llm_response_block_instr.usage.prompt_tokens_details = {}


@patch('crawl4ai.extraction_strategy.perform_completion_with_backoff', return_value=mock_llm_response_block_instr)
def block_extraction_with_instruction(mock_perform_completion):
    try:
        strategy = LLMExtractionStrategy(
            llm_config=LLMConfig(provider="ollama/llama3", api_token="ollama"),
            extraction_type="block",
            instruction="Extract only the main paragraphs discussing Artificial Intelligence. Ignore other sections."
        )
    except:
        print("Ollama not available, skipping block extraction with instruction test.")
        return

    sample_content = """
# The Future of Computing
This is an intro.
## Artificial Intelligence
Artificial intelligence is rapidly evolving. It impacts various industries.
## unrelated section
This is not about AI.
"""
    extracted_data = strategy.extract(url="http://dummy.com/ai_article", html_content=sample_content)

    print("Block Extraction with Custom Instruction (mocked LLM):")
    print(json.dumps(extracted_data, indent=2))
    if extracted_data:
        assert len(extracted_data) == 2
        assert "Artificial intelligence" in extracted_data[0]["content"]

if __name__ == "__main__":
    block_extraction_with_instruction()
```

#### 3.3.2. Example: Using `extraction_type="schema"` with a Pydantic schema and a guiding `instruction` (e.g., "Extract product details according to the schema. Focus on electronics.").

```python
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai.utils import LLMConfig
from pydantic import BaseModel
from unittest.mock import patch, MagicMock
import json

class ProductSchema(BaseModel):
    productName: str
    category: str
    price: Optional[float] = None

# Mock LLM response
mock_llm_response_schema_instr = MagicMock()
mock_llm_response_schema_instr.choices = [MagicMock()]
mock_llm_response_schema_instr.choices[0].message.content = json.dumps({
    "productName": "Smart TV",
    "category": "Electronics",
    "price": 499.99
})
# ... (mock usage)
mock_llm_response_schema_instr.usage = MagicMock()
mock_llm_response_schema_instr.usage.completion_tokens = 25; mock_llm_response_schema_instr.usage.prompt_tokens = 110; mock_llm_response_schema_instr.usage.total_tokens = 135
mock_llm_response_schema_instr.usage.completion_tokens_details = {}; mock_llm_response_schema_instr.usage.prompt_tokens_details = {}


@patch('crawl4ai.extraction_strategy.perform_completion_with_backoff', return_value=mock_llm_response_schema_instr)
def schema_extraction_with_instruction(mock_perform_completion):
    try:
        strategy = LLMExtractionStrategy(
            llm_config=LLMConfig(provider="ollama/llama3", api_token="ollama"),
            schema=ProductSchema.model_json_schema(),
            extraction_type="schema",
            instruction="Extract product details. Prioritize items in the 'Electronics' category if multiple products are mentioned."
        )
    except:
        print("Ollama not available, skipping schema extraction with instruction test.")
        return

    sample_content = "We sell books, clothes, and a Smart TV for $499.99. The Smart TV is an electronic device."
    extracted_json_string = strategy.extract(url="http://dummy.com/store", html_content=sample_content)

    print("Schema Extraction with Instruction (mocked LLM):")
    if extracted_json_string:
        extracted_data = json.loads(extracted_json_string)
        print(json.dumps(extracted_data, indent=2))
        product_instance = ProductSchema(**extracted_data)
        assert product_instance.category == "Electronics"
    else:
        print("No data extracted.")

if __name__ == "__main__":
    schema_extraction_with_instruction()
```

#### 3.3.3. Example: Using `extraction_type="schema_from_instruction"` where the LLM infers the schema from a detailed `instruction` (e.g., "Extract the title, author, and publication date of the article.").
This powerful feature lets the LLM decide the schema based on your textual request.

```python
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai.utils import LLMConfig
from unittest.mock import patch, MagicMock
import json

# Mock LLM response - LLM invents the schema and fills it
mock_llm_response_infer_schema = MagicMock()
mock_llm_response_infer_schema.choices = [MagicMock()]
mock_llm_response_infer_schema.choices[0].message.content = json.dumps({
    "title": "Adventures in AI",
    "author": "Jane Coder",
    "publication_date": "2024-05-15"
})
# ... (mock usage)
mock_llm_response_infer_schema.usage = MagicMock()
mock_llm_response_infer_schema.usage.completion_tokens = 30; mock_llm_response_infer_schema.usage.prompt_tokens = 90; mock_llm_response_infer_schema.usage.total_tokens = 120
mock_llm_response_infer_schema.usage.completion_tokens_details = {}; mock_llm_response_infer_schema.usage.prompt_tokens_details = {}


@patch('crawl4ai.extraction_strategy.perform_completion_with_backoff', return_value=mock_llm_response_infer_schema)
def schema_from_instruction_extraction(mock_perform_completion):
    try:
        strategy = LLMExtractionStrategy(
            llm_config=LLMConfig(provider="ollama/llama3", api_token="ollama"),
            extraction_type="schema_from_instruction",
            instruction="Please extract the title, author, and publication date (YYYY-MM-DD) of this news article."
        )
    except:
        print("Ollama not available, skipping schema_from_instruction test.")
        return

    sample_content = """
    # Adventures in AI
    By Jane Coder, Published on May 15, 2024.
    This article explores the latest trends...
    """
    extracted_json_string = strategy.extract(url="http://dummy.com/news_article", html_content=sample_content)

    print("Schema from Instruction Extraction (mocked LLM):")
    if extracted_json_string:
        extracted_data = json.loads(extracted_json_string)
        print(json.dumps(extracted_data, indent=2))
        assert "title" in extracted_data and "author" in extracted_data and "publication_date" in extracted_data
    else:
        print("No data extracted.")

if __name__ == "__main__":
    schema_from_instruction_extraction()
```

#### 3.3.4. Example: Comparing outputs with and without a specific `instruction` when using `extraction_type="schema"`.

```python
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai.utils import LLMConfig
from pydantic import BaseModel
from unittest.mock import patch, MagicMock
import json

class Event(BaseModel):
    eventName: str
    location: str
    date: str

# Mock LLM responses
mock_llm_no_instr = MagicMock()
mock_llm_no_instr.choices = [MagicMock()]
mock_llm_no_instr.choices[0].message.content = json.dumps({"eventName": "Tech Meetup", "location": "Online", "date": "2024-06-01"})
mock_llm_no_instr.usage = MagicMock(); mock_llm_no_instr.usage.completion_tokens = 20; mock_llm_no_instr.usage.prompt_tokens = 80; mock_llm_no_instr.usage.total_tokens = 100
mock_llm_no_instr.usage.completion_tokens_details = {}; mock_llm_no_instr.usage.prompt_tokens_details = {}


mock_llm_with_instr = MagicMock()
mock_llm_with_instr.choices = [MagicMock()]
mock_llm_with_instr.choices[0].message.content = json.dumps({"eventName": "AI Conference", "location": "San Francisco", "date": "2024-07-20"})
mock_llm_with_instr.usage = MagicMock(); mock_llm_with_instr.usage.completion_tokens = 22; mock_llm_with_instr.usage.prompt_tokens = 95; mock_llm_with_instr.usage.total_tokens = 117
mock_llm_with_instr.usage.completion_tokens_details = {}; mock_llm_with_instr.usage.prompt_tokens_details = {}


@patch('crawl4ai.extraction_strategy.perform_completion_with_backoff')
def schema_extraction_instruction_comparison(mock_perform_completion):
    try:
        llm_conf = LLMConfig(provider="ollama/llama3", api_token="ollama")
    except:
        print("Ollama not available, skipping schema instruction comparison test.")
        return

    sample_content = """
    Upcoming Events:
    - Tech Meetup, Online, June 1st, 2024
    - AI Conference, San Francisco, July 20th, 2024
    - Local Bake Sale, Town Hall, June 5th, 2024
    """

    # Case 1: No specific instruction
    mock_perform_completion.return_value = mock_llm_no_instr
    strategy_no_instr = LLMExtractionStrategy(
        llm_config=llm_conf,
        schema=Event.model_json_schema(),
        extraction_type="schema"
    )
    result_no_instr_json = strategy_no_instr.extract(url="http://dummy.com/events", html_content=sample_content)
    result_no_instr = json.loads(result_no_instr_json) if result_no_instr_json else {}
    print(f"Without specific instruction: {result_no_instr}")

    # Case 2: With instruction to focus
    mock_perform_completion.return_value = mock_llm_with_instr
    strategy_with_instr = LLMExtractionStrategy(
        llm_config=llm_conf,
        schema=Event.model_json_schema(),
        extraction_type="schema",
        instruction="Focus on extracting details for the 'AI Conference'."
    )
    result_with_instr_json = strategy_with_instr.extract(url="http://dummy.com/events", html_content=sample_content)
    result_with_instr = json.loads(result_with_instr_json) if result_with_instr_json else {}
    print(f"With instruction to focus on AI Conference: {result_with_instr}")

    assert result_no_instr.get("eventName") == "Tech Meetup" # Mock returns first one
    assert result_with_instr.get("eventName") == "AI Conference" # Mock returns AI conf due to instruction

if __name__ == "__main__":
    schema_extraction_instruction_comparison()
```
---

### 3.4. Controlling `extraction_type`

#### 3.4.1. Example: Demonstrating `extraction_type="block"` - output structure (list of blocks with content and tags).

```python
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai.utils import LLMConfig
from unittest.mock import patch, MagicMock
import json

# Mock LLM response for block extraction
mock_llm_block_output = MagicMock()
mock_llm_block_output.choices = [MagicMock()]
mock_llm_block_output.choices[0].message.content = """
<blocks>
  <block>
    <content>First important point.</content>
    <tags><tag>key_takeaway</tag></tags>
  </block>
  <block>
    <content>Second supporting detail.</content>
    <tags><tag>detail</tag><tag>supporting_info</tag></tags>
  </block>
</blocks>
"""
mock_llm_block_output.usage = MagicMock(); mock_llm_block_output.usage.completion_tokens=30; mock_llm_block_output.usage.prompt_tokens=70; mock_llm_block_output.usage.total_tokens=100
mock_llm_block_output.usage.completion_tokens_details = {}; mock_llm_block_output.usage.prompt_tokens_details = {}


@patch('crawl4ai.extraction_strategy.perform_completion_with_backoff', return_value=mock_llm_block_output)
def demonstrate_block_extraction_type(mock_perform_completion):
    try:
        strategy = LLMExtractionStrategy(
            llm_config=LLMConfig(provider="ollama/llama3", api_token="ollama"),
            extraction_type="block" # Explicitly set to block
        )
    except:
        print("Ollama not available, skipping block extraction type demo.")
        return

    sample_content = "Some text with a First important point and then a Second supporting detail."
    extracted_data = strategy.extract(url="http://dummy.com/blocks", html_content=sample_content)

    print("Block Extraction Output Structure (mocked LLM):")
    print(json.dumps(extracted_data, indent=2))

    # Expected output is a list of dictionaries (blocks)
    assert isinstance(extracted_data, list)
    if extracted_data:
        assert "content" in extracted_data[0]
        assert "tags" in extracted_data[0]
        assert isinstance(extracted_data[0]["tags"], list)

if __name__ == "__main__":
    demonstrate_block_extraction_type()
```

#### 3.4.2. Example: Demonstrating `extraction_type="schema"` - output structure (JSON string matching the schema).

```python
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai.utils import LLMConfig
from pydantic import BaseModel
from unittest.mock import patch, MagicMock
import json

class MyData(BaseModel):
    field1: str
    field2: int

# Mock LLM response
mock_llm_schema_output = MagicMock()
mock_llm_schema_output.choices = [MagicMock()]
mock_llm_schema_output.choices[0].message.content = json.dumps({"field1": "value1", "field2": 123})
mock_llm_schema_output.usage = MagicMock(); mock_llm_schema_output.usage.completion_tokens=15; mock_llm_schema_output.usage.prompt_tokens=60; mock_llm_schema_output.usage.total_tokens=75
mock_llm_schema_output.usage.completion_tokens_details = {}; mock_llm_schema_output.usage.prompt_tokens_details = {}


@patch('crawl4ai.extraction_strategy.perform_completion_with_backoff', return_value=mock_llm_schema_output)
def demonstrate_schema_extraction_type(mock_perform_completion):
    try:
        strategy = LLMExtractionStrategy(
            llm_config=LLMConfig(provider="ollama/llama3", api_token="ollama"),
            schema=MyData.model_json_schema(),
            extraction_type="schema" # Explicitly set to schema
        )
    except:
        print("Ollama not available, skipping schema extraction type demo.")
        return

    sample_content = "Field one is value1 and field two is 123."
    extracted_json_string = strategy.extract(url="http://dummy.com/schema_data", html_content=sample_content)

    print("Schema Extraction Output Structure (mocked LLM):")
    print(f"Raw JSON string from LLM: {extracted_json_string}")

    # Expected output is a JSON string that can be parsed into the schema
    if extracted_json_string:
        data = json.loads(extracted_json_string)
        print(f"Parsed data: {data}")
        instance = MyData(**data) # Validate with Pydantic
        assert instance.field1 == "value1"

if __name__ == "__main__":
    demonstrate_schema_extraction_type()
```

#### 3.4.3. Example: Demonstrating `extraction_type="schema_from_instruction"` - output structure (JSON string based on inferred schema).

```python
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai.utils import LLMConfig
from unittest.mock import patch, MagicMock
import json

# Mock LLM response - LLM infers schema and provides data
mock_llm_infer_schema_output = MagicMock()
mock_llm_infer_schema_output.choices = [MagicMock()]
mock_llm_infer_schema_output.choices[0].message.content = json.dumps({
    "book_title": "The LLM Handbook",
    "pages": 300
})
mock_llm_infer_schema_output.usage = MagicMock(); mock_llm_infer_schema_output.usage.completion_tokens=20; mock_llm_infer_schema_output.usage.prompt_tokens=70; mock_llm_infer_schema_output.usage.total_tokens=90
mock_llm_infer_schema_output.usage.completion_tokens_details = {}; mock_llm_infer_schema_output.usage.prompt_tokens_details = {}


@patch('crawl4ai.extraction_strategy.perform_completion_with_backoff', return_value=mock_llm_infer_schema_output)
def demonstrate_schema_from_instruction_type(mock_perform_completion):
    try:
        strategy = LLMExtractionStrategy(
            llm_config=LLMConfig(provider="ollama/llama3", api_token="ollama"),
            extraction_type="schema_from_instruction",
            instruction="Extract the book title and number of pages."
        )
    except:
        print("Ollama not available, skipping schema_from_instruction type demo.")
        return

    sample_content = "The book 'The LLM Handbook' contains 300 pages of valuable insights."
    extracted_json_string = strategy.extract(url="http://dummy.com/book_info", html_content=sample_content)

    print("Schema from Instruction Output Structure (mocked LLM):")
    print(f"Raw JSON string from LLM: {extracted_json_string}")

    if extracted_json_string:
        data = json.loads(extracted_json_string)
        print(f"Parsed data: {data}")
        assert "book_title" in data and "pages" in data

if __name__ == "__main__":
    demonstrate_schema_from_instruction_type()
```
---

### 3.5. LLM Configuration (`llm_config`)

#### 3.5.1. Example: Using the default LLM provider and model.
This assumes `OPENAI_API_KEY` is set in the environment, as OpenAI is often a default. If not, it will error or fallback if a global default is set elsewhere (less common for `LLMExtractionStrategy` without explicit config).

```python
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai.utils import LLMConfig # For explicit configuration if needed
import os

# Note: For this example to run successfully without explicit llm_config,
# the environment variable OPENAI_API_KEY should be set, or another
# global default LLM provider must be configured for LiteLLM.
# If neither is true, LLMExtractionStrategy() might raise an error.

try:
    # This will try to use the default provider (often OpenAI if key is set)
    # or another globally configured default for LiteLLM.
    strategy_default = LLMExtractionStrategy()
    print(f"Successfully initialized LLMExtractionStrategy with default provider: {strategy_default.llm_config.provider}")
    # To make this example runnable without a real API call:
    print("Note: Actual extraction would require a configured LLM and API key.")
except Exception as e:
    print(f"Failed to initialize with default LLM provider: {e}")
    print("This example requires a default LLM (e.g., OPENAI_API_KEY set) or LiteLLM global config.")
    print("Alternatively, provide an explicit LLMConfig to LLMExtractionStrategy.")

# To show an explicit (but still default-targeting) configuration:
# if os.getenv("OPENAI_API_KEY"):
#     llm_config_openai = LLMConfig(provider="openai/gpt-3.5-turbo", api_token=os.getenv("OPENAI_API_KEY"))
#     strategy_explicit_openai = LLMExtractionStrategy(llm_config=llm_config_openai)
#     print(f"Initialized explicitly with OpenAI: {strategy_explicit_openai.llm_config.provider}")
# else:
#     print("OPENAI_API_KEY not set, cannot show explicit OpenAI example.")
```

#### 3.5.2. Example: Configuring `LLMExtractionStrategy` with a specific OpenAI model via `LLMConfig` (e.g., `openai/gpt-4o-mini`).

```python
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai.utils import LLMConfig
import os

# Ensure you have your OPENAI_API_KEY set in your environment variables
openai_api_key = os.getenv("OPENAI_API_KEY")

if not openai_api_key:
    print("OPENAI_API_KEY not found in environment. Skipping OpenAI example.")
    print("To run this, set your OPENAI_API_KEY.")
else:
    llm_config_openai = LLMConfig(
        provider="openai/gpt-4o-mini", # Specify the OpenAI model
        api_token=openai_api_key
    )
    strategy_openai = LLMExtractionStrategy(llm_config=llm_config_openai)
    print(f"Initialized LLMExtractionStrategy with OpenAI provider: {strategy_openai.llm_config.provider}")
    # To test, you would call strategy_openai.extract(...)
    # For this example, we'll just show initialization.
    print("Strategy ready to use OpenAI gpt-4o-mini.")
```

#### 3.5.3. Example: Configuring `LLMExtractionStrategy` with a specific Ollama model via `LLMConfig` (e.g., `ollama/llama3`).
This requires Ollama to be running locally and the specified model (e.g., `llama3`) to be pulled.

```python
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai.utils import LLMConfig

# Assumes Ollama is running locally and 'llama3' model is available
# 'api_token' for Ollama is typically 'ollama' or can be omitted if not needed by your setup.
try:
    llm_config_ollama = LLMConfig(
        provider="ollama/llama3",
        api_token="ollama", # Often 'ollama' or can be None if not required by local setup
        base_url="http://localhost:11434" # Default Ollama API URL
    )
    strategy_ollama = LLMExtractionStrategy(llm_config=llm_config_ollama)
    print(f"Initialized LLMExtractionStrategy with Ollama provider: {strategy_ollama.llm_config.provider}")
    print("Strategy ready to use Ollama with llama3.")
    print("Note: For actual extraction, ensure Ollama server is running and has the 'llama3' model pulled.")
except Exception as e:
    print(f"Failed to initialize Ollama strategy: {e}")
    print("Ensure Ollama is running (e.g., `ollama serve`) and you have pulled the model (e.g., `ollama pull llama3`).")

# Example of a test call (would require Ollama to be active)
# from unittest.mock import patch, MagicMock
# import json
# mock_response = MagicMock()
# mock_response.choices = [MagicMock()]
# mock_response.choices[0].message.content = json.dumps({"info": "extracted by ollama"})
# mock_response.usage = MagicMock(completion_tokens=5, prompt_tokens=10, total_tokens=15)
# mock_response.usage.completion_tokens_details = {}; mock_response.usage.prompt_tokens_details = {}
# @patch('crawl4ai.extraction_strategy.perform_completion_with_backoff', return_value=mock_response)
# def _test_ollama_call(mock_call, strategy):
#     if strategy:
#         result = strategy.extract("url", "content", extraction_type="schema_from_instruction", instruction="get info")
#         print(f"Ollama mock call result: {result}")
# _test_ollama_call(None, strategy_ollama if 'strategy_ollama' in locals() else None)
```

#### 3.5.4. Example: Configuring `LLMExtractionStrategy` with a specific Gemini model via `LLMConfig`.

```python
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai.utils import LLMConfig
import os

gemini_api_key = os.getenv("GEMINI_API_KEY")

if not gemini_api_key:
    print("GEMINI_API_KEY not found in environment. Skipping Gemini example.")
    print("To run this, set your GEMINI_API_KEY.")
else:
    llm_config_gemini = LLMConfig(
        provider="gemini/gemini-1.5-pro-latest", # Or another Gemini model
        api_token=gemini_api_key
    )
    strategy_gemini = LLMExtractionStrategy(llm_config=llm_config_gemini)
    print(f"Initialized LLMExtractionStrategy with Gemini provider: {strategy_gemini.llm_config.provider}")
    print("Strategy ready to use Gemini.")
```

#### 3.5.5. Example: Configuring `LLMExtractionStrategy` with a specific Anthropic (Claude) model via `LLMConfig`.

```python
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai.utils import LLMConfig
import os

anthropic_api_key = os.getenv("ANTHROPIC_API_KEY")

if not anthropic_api_key:
    print("ANTHROPIC_API_KEY not found in environment. Skipping Anthropic Claude example.")
    print("To run this, set your ANTHROPIC_API_KEY.")
else:
    llm_config_claude = LLMConfig(
        provider="anthropic/claude-3-opus-20240229", # Or another Claude model
        api_token=anthropic_api_key
    )
    strategy_claude = LLMExtractionStrategy(llm_config=llm_config_claude)
    print(f"Initialized LLMExtractionStrategy with Anthropic provider: {strategy_claude.llm_config.provider}")
    print("Strategy ready to use Claude.")

```

#### 3.5.6. Example: Overriding LLM parameters like `temperature` and `max_tokens` using `LLMConfig`.

```python
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai.utils import LLMConfig
import os

openai_api_key = os.getenv("OPENAI_API_KEY")

if not openai_api_key:
    print("OPENAI_API_KEY not found. This example shows parameter override with OpenAI.")
else:
    llm_config_custom_params = LLMConfig(
        provider="openai/gpt-3.5-turbo", # Using a common model for this example
        api_token=openai_api_key,
        temperature=0.2,       # Lower temperature for more deterministic output
        max_tokens=150         # Limit the maximum number of tokens in the response
        # You can add other provider-specific parameters here in extra_args if needed
        # extra_args={"top_p": 0.9}
    )

    strategy_custom_params = LLMExtractionStrategy(
        llm_config=llm_config_custom_params,
        instruction="Extract the main point." # A simple instruction
    )

    print(f"Initialized LLMExtractionStrategy with custom LLM parameters for provider: {strategy_custom_params.llm_config.provider}")
    print(f"  Temperature: {strategy_custom_params.llm_config.temperature}")
    print(f"  Max Tokens: {strategy_custom_params.llm_config.max_tokens}")
    # print(f"  Extra Args: {strategy_custom_params.extra_args}") # Note: extra_args on LLMExtractionStrategy, not LLMConfig for this

    # A mock test to show parameters would be passed to perform_completion_with_backoff
    # from unittest.mock import patch, MagicMock
    # mock_response = MagicMock() # ... (setup mock_response) ...
    # @patch('crawl4ai.extraction_strategy.perform_completion_with_backoff', return_value=mock_response)
    # def _test_params(mock_call):
    #     strategy_custom_params.extract("url", "content")
    #     called_args = mock_call.call_args
    #     assert called_args is not None
    #     assert called_args.kwargs.get('temperature') == 0.2
    #     assert called_args.kwargs.get('max_tokens') == 150
    #     print("LLM call would have used temperature=0.2 and max_tokens=150.")
    # _test_params()
```

#### 3.5.7. Example: Using `LLMConfig` to specify a custom `base_url` for a self-hosted LLM.
This is useful for local LLMs like Ollama (already shown), vLLM, or other self-hosted OpenAI-compatible endpoints.

```python
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai.utils import LLMConfig

# Example for a generic OpenAI-compatible API running locally
# The provider name might need to be "openai/custom-model-name" or just "custom/model-name"
# depending on how LiteLLM handles it. For Ollama, it's typically "ollama/model-name".
custom_llm_config = LLMConfig(
    provider="custom/my-local-model", # Or "openai/my-local-model"
    api_token="no_key_needed_for_local", # Or your actual local key
    base_url="http://localhost:8000/v1" # Adjust to your local LLM API endpoint
)

try:
    strategy_custom_endpoint = LLMExtractionStrategy(llm_config=custom_llm_config)
    print(f"Initialized LLMExtractionStrategy with custom endpoint:")
    print(f"  Provider: {strategy_custom_endpoint.llm_config.provider}")
    print(f"  Base URL: {strategy_custom_endpoint.llm_config.base_url}")
    print("Note: This example assumes an OpenAI-compatible API is running at the specified base_url.")
except Exception as e:
    print(f"Failed to initialize custom endpoint strategy: {e}")

# Example for Ollama (already covered more specifically, but fits here too)
ollama_local_config = LLMConfig(
    provider="ollama/mistral", # Assuming mistral model is pulled
    base_url="http://localhost:11434", # Default Ollama
    api_token="ollama" # Usually 'ollama' or None
)
try:
    strategy_ollama_local = LLMExtractionStrategy(llm_config=ollama_local_config)
    print(f"\nInitialized LLMExtractionStrategy with local Ollama endpoint:")
    print(f"  Provider: {strategy_ollama_local.llm_config.provider}")
    print(f"  Base URL: {strategy_ollama_local.llm_config.base_url}")
except Exception as e:
    print(f"Failed to initialize local Ollama strategy via custom base_url example: {e}")
```
---

### 3.6. Chunking Configuration (`apply_chunking` and related parameters)

#### 3.6.1. Example: Default chunking behavior (`apply_chunking=True`).
When content is long, `LLMExtractionStrategy` automatically chunks it.

```python
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai.utils import LLMConfig
from unittest.mock import patch, MagicMock
import json

# Mock LLM to be called multiple times if chunking happens
mock_responses = []
for i in range(3): # Simulate 3 chunks
    mock_resp = MagicMock()
    mock_resp.choices = [MagicMock()]
    mock_resp.choices[0].message.content = json.dumps({"chunk_data": f"Data from chunk {i+1}"})
    mock_resp.usage = MagicMock(completion_tokens=10, prompt_tokens=50, total_tokens=60)
    mock_resp.usage.completion_tokens_details = {}; mock_resp.usage.prompt_tokens_details = {}
    mock_responses.append(mock_resp)

@patch('crawl4ai.extraction_strategy.perform_completion_with_backoff', side_effect=mock_responses)
def default_chunking_behavior(mock_perform_completion):
    try:
        # Use a small chunk_token_threshold to force chunking for the example
        strategy = LLMExtractionStrategy(
            llm_config=LLMConfig(provider="ollama/llama3", api_token="ollama"),
            extraction_type="schema_from_instruction",
            instruction="Extract data from this chunk.",
            chunk_token_threshold=10, # Very small to ensure chunking
            word_token_rate=0.75, # Approx tokens per word
            apply_chunking=True # Default
        )
    except:
        print("Ollama not available, skipping default chunking test.")
        return

    # Create content long enough to be chunked based on threshold and word_token_rate
    # (10 tokens / 0.75 tokens/word) approx 13 words for threshold.
    # Let's use 50 words.
    long_content = " ".join(["word"] * 50)

    extracted_data_json = strategy.extract(url="http://dummy.com/long_content", html_content=long_content)

    print("Default Chunking Behavior (mocked LLM):")
    # The strategy internally merges results from chunks if schema-based.
    # For "schema_from_instruction", it might return a list of JSON strings or a merged JSON.
    # Current LLMExtractionStrategy for schema type returns a single JSON string, implying merging.
    # If it's block extraction, it would be a list of blocks.
    # Let's assume for schema extraction, it returns a list of dicts before final JSON dump in the example

    # The mock setup implies multiple calls. LLMExtractionStrategy aggregates results.
    # If the LLM returns a list for each chunk, results would be concatenated.
    # If it returns a dict, they'd be in a list.
    # The current mock returns a dict per chunk, so we expect a list of dicts if not merged by strategy.
    # However, the `extract` method's return is a single JSON string if schema-based.
    # This means the strategy handles merging internally or expects LLM to handle it.
    # For this test, we'll check if the LLM was called multiple times.

    print(f"LLM called {mock_perform_completion.call_count} times.")
    assert mock_perform_completion.call_count > 1, "Chunking should have occurred, LLM expected to be called multiple times."
    print(f"Final extracted JSON string: {extracted_data_json}")
    # Final result depends on how LLM merges/formats if it gets multiple chunk results.
    # Assuming the mock's structure (list of chunk_data) would be presented as a list by the LLM.
    if extracted_data_json:
        final_data = json.loads(extracted_data_json)
        print(json.dumps(final_data, indent=2))
        # Depending on LLM's aggregation logic, this might be a list or a single dict.
        # For this example, if our mock returns individual dicts, and the LLM is asked to provide a final JSON,
        # it might wrap them in a list. Let's assume it does.
        # For this particular mock structure and schema_from_instruction, the LLM would typically be
        # instructed to return a list of items if the content implies multiple items.
        # Here, the mock returns individual dicts, and the strategy just returns the last one.
        # To properly test chunk aggregation, a more sophisticated mock or real LLM is needed.
        # For now, verifying multiple calls is the main goal.

if __name__ == "__main__":
    default_chunking_behavior()
```

#### 3.6.2. Example: Disabling chunking (`apply_chunking=False`) for short content.

```python
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai.utils import LLMConfig
from unittest.mock import patch, MagicMock
import json

mock_llm_no_chunking = MagicMock()
mock_llm_no_chunking.choices = [MagicMock()]
mock_llm_no_chunking.choices[0].message.content = json.dumps({"summary": "This is short content."})
mock_llm_no_chunking.usage = MagicMock(completion_tokens=5, prompt_tokens=20, total_tokens=25)
mock_llm_no_chunking.usage.completion_tokens_details = {}; mock_llm_no_chunking.usage.prompt_tokens_details = {}

@patch('crawl4ai.extraction_strategy.perform_completion_with_backoff', return_value=mock_llm_no_chunking)
def disable_chunking_behavior(mock_perform_completion):
    try:
        strategy = LLMExtractionStrategy(
            llm_config=LLMConfig(provider="ollama/llama3", api_token="ollama"),
            apply_chunking=False, # Explicitly disable chunking
            extraction_type="schema_from_instruction",
            instruction="Summarize this."
        )
    except:
        print("Ollama not available, skipping disable chunking test.")
        return

    # Content is short enough that chunking wouldn't happen anyway, but this forces it.
    short_content = "This is a piece of short content."
    extracted_data_json = strategy.extract(url="http://dummy.com/short", html_content=short_content)

    print("Disabled Chunking Behavior (mocked LLM):")
    print(f"LLM called {mock_perform_completion.call_count} time(s).")
    assert mock_perform_completion.call_count == 1, "Chunking was disabled, LLM should be called once."
    if extracted_data_json:
        print(json.dumps(json.loads(extracted_data_json), indent=2))


if __name__ == "__main__":
    disable_chunking_behavior()
```

#### 3.6.3. Example: Customizing `chunk_token_threshold` for smaller/larger chunks.

```python
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai.utils import LLMConfig
from unittest.mock import patch, MagicMock
import json

# Mock setup
mock_llm_response_chunk_size = MagicMock()
mock_llm_response_chunk_size.choices = [MagicMock()]
mock_llm_response_chunk_size.choices[0].message.content = json.dumps({"info": "some data"})
mock_llm_response_chunk_size.usage = MagicMock(completion_tokens=5, prompt_tokens=10, total_tokens=15)
mock_llm_response_chunk_size.usage.completion_tokens_details = {}; mock_llm_response_chunk_size.usage.prompt_tokens_details = {}


@patch('crawl4ai.extraction_strategy.perform_completion_with_backoff', return_value=mock_llm_response_chunk_size)
def customize_chunk_token_threshold(mock_perform_completion):
    long_text = " ".join(["word_for_testing_chunk_size"] * 100) # Approx 100 words

    # Scenario 1: Small threshold, more chunks
    try:
        strategy_small_chunks = LLMExtractionStrategy(
            llm_config=LLMConfig(provider="ollama/llama3", api_token="ollama"),
            chunk_token_threshold=20, # Small threshold, e.g., ~26 words
            word_token_rate=0.75,
            extraction_type="schema_from_instruction", instruction="get info"
        )
    except:
        print("Ollama not available, cannot run small chunk test.")
        strategy_small_chunks = None

    if strategy_small_chunks:
        strategy_small_chunks.extract(url="http://dummy.com/small_chunks", html_content=long_text)
        print(f"Small chunk_token_threshold (20): LLM called {mock_perform_completion.call_count} times.")
        small_chunk_calls = mock_perform_completion.call_count
        mock_perform_completion.reset_mock() # Reset for next call
    else:
        small_chunk_calls = 0

    # Scenario 2: Larger threshold, fewer chunks
    try:
        strategy_large_chunks = LLMExtractionStrategy(
            llm_config=LLMConfig(provider="ollama/llama3", api_token="ollama"),
            chunk_token_threshold=80, # Larger threshold, e.g., ~106 words
            word_token_rate=0.75,
            extraction_type="schema_from_instruction", instruction="get info"
        )
    except:
        print("Ollama not available, cannot run large chunk test.")
        strategy_large_chunks = None

    if strategy_large_chunks:
        strategy_large_chunks.extract(url="http://dummy.com/large_chunks", html_content=long_text)
        print(f"Large chunk_token_threshold (80): LLM called {mock_perform_completion.call_count} times.")
        large_chunk_calls = mock_perform_completion.call_count
    else:
        large_chunk_calls = 0

    if strategy_small_chunks and strategy_large_chunks:
        assert small_chunk_calls > large_chunk_calls, "Smaller threshold should result in more LLM calls."

if __name__ == "__main__":
    customize_chunk_token_threshold()
```

#### 3.6.4. Example: Customizing `overlap_rate` between chunks.

```python
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai.utils import LLMConfig
from unittest.mock import patch, MagicMock
import json

# Mock setup - we'll primarily observe the number of calls or potentially the prompt content
mock_llm_overlap = MagicMock()
mock_llm_overlap.choices = [MagicMock()]
mock_llm_overlap.choices[0].message.content = json.dumps({"data_point": "value"})
mock_llm_overlap.usage = MagicMock(completion_tokens=5, prompt_tokens=10, total_tokens=15)
mock_llm_overlap.usage.completion_tokens_details = {}; mock_llm_overlap.usage.prompt_tokens_details = {}


@patch('crawl4ai.extraction_strategy.LLMExtractionStrategy._create_llm_query_tasks') # Patching internal method to inspect chunks
def customize_overlap_rate(mock_create_tasks):
    # The _create_llm_query_tasks method is a good place to see the generated chunks.
    # It returns a list of partials. We can inspect the 'content' arg of the partials.

    long_text = "This is a moderately long text to demonstrate the effect of chunk overlap. " * 5
    # word_token_rate (default 0.75) means ~1 token per 1.33 words.
    # chunk_token_threshold (default 2048) is large.
    # Let's use a smaller threshold for demonstration.
    # Threshold 30 tokens => ~40 words.
    # Overlap 0.1 => 3 token overlap => ~4 words.
    # Overlap 0.5 => 15 token overlap => ~20 words.

    # Scenario 1: Small overlap
    try:
        strategy_small_overlap = LLMExtractionStrategy(
            llm_config=LLMConfig(provider="ollama/llama3", api_token="ollama"),
            chunk_token_threshold=30,
            overlap_rate=0.1, # 10% overlap
            extraction_type="block" # Block type is easier to see chunk content
        )
    except:
        print("Ollama not available. Skipping overlap rate test.")
        return

    strategy_small_overlap.extract(url="http://dummy.com/overlap1", html_content=long_text)
    chunks_small_overlap = [task.args[1] for task in mock_create_tasks.call_args[0][0]] # (tasks_list, url)
    print(f"Small overlap_rate (0.1) generated {len(chunks_small_overlap)} chunks.")
    if len(chunks_small_overlap) > 1:
        print(f"  Chunk 1 (end): ...{chunks_small_overlap[0][-30:]}")
        print(f"  Chunk 2 (start): {chunks_small_overlap[1][:30]}...")
    mock_create_tasks.reset_mock()

    # Scenario 2: Larger overlap
    strategy_large_overlap = LLMExtractionStrategy(
        llm_config=LLMConfig(provider="ollama/llama3", api_token="ollama"),
        chunk_token_threshold=30,
        overlap_rate=0.5, # 50% overlap
        extraction_type="block"
    )
    strategy_large_overlap.extract(url="http://dummy.com/overlap2", html_content=long_text)
    chunks_large_overlap = [task.args[1] for task in mock_create_tasks.call_args[0][0]]
    print(f"Large overlap_rate (0.5) generated {len(chunks_large_overlap)} chunks.")
    if len(chunks_large_overlap) > 1:
        print(f"  Chunk 1 (end): ...{chunks_large_overlap[0][-30:]}")
        print(f"  Chunk 2 (start): {chunks_large_overlap[1][:30]}...")

    # With more overlap, for the same content and threshold, you might get more chunks,
    # or similar number of chunks but with more redundant content.
    # This example primarily shows the parameter being used.
    # Actual number of chunks can be complex to predict without exact tokenization.

if __name__ == "__main__":
    customize_overlap_rate()
```

#### 3.6.5. Example: Demonstrating the effect of different `word_token_rate` values.
The `word_token_rate` helps estimate token count from word count for chunking.

```python
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai.utils import LLMConfig
from unittest.mock import patch, MagicMock

# We'll patch the internal chunking function to see how many chunks are made.
# Or, more simply, observe the number of LLM calls if apply_chunking=True.

mock_llm_wtr = MagicMock() # Generic mock for counting calls
mock_llm_wtr.choices = [MagicMock(message=MagicMock(content="{}"))]
mock_llm_wtr.usage = MagicMock(completion_tokens=1, prompt_tokens=1, total_tokens=2)
mock_llm_wtr.usage.completion_tokens_details = {}; mock_llm_wtr.usage.prompt_tokens_details = {}

@patch('crawl4ai.extraction_strategy.perform_completion_with_backoff', return_value=mock_llm_wtr)
def customize_word_token_rate(mock_perform_completion):
    # Approx 50 words.
    # word_token_rate helps estimate token length for chunking.
    # chunk_token_threshold default is large (2048), so let's use a smaller one.
    test_content = "This is a test sentence. It has ten words precisely. " * 5

    try:
        # Scenario 1: Lower word_token_rate (means more words per token, so fewer tokens for same text)
        # Should result in fewer chunks if text length is near threshold.
        strategy_low_wtr = LLMExtractionStrategy(
            llm_config=LLMConfig(provider="ollama/llama3", api_token="ollama"),
            chunk_token_threshold=30, # e.g. needs 30 tokens
            word_token_rate=0.5, # Estimates 0.5 tokens per word (i.e., 2 words/token)
                                 # So 50 words -> ~25 tokens. Should be 1 chunk.
            extraction_type="block"
        )
        strategy_low_wtr.extract("url", test_content)
        calls_low_wtr = mock_perform_completion.call_count
        print(f"word_token_rate=0.5 (estimates fewer tokens): LLM calls = {calls_low_wtr}")
        mock_perform_completion.reset_mock()

        # Scenario 2: Higher word_token_rate (means fewer words per token, so more tokens for same text)
        # Should result in more chunks if text length is near threshold.
        strategy_high_wtr = LLMExtractionStrategy(
            llm_config=LLMConfig(provider="ollama/llama3", api_token="ollama"),
            chunk_token_threshold=30,
            word_token_rate=1.0, # Estimates 1 token per word
                                 # So 50 words -> ~50 tokens. Should be >1 chunk.
            extraction_type="block"
        )
        strategy_high_wtr.extract("url", test_content)
        calls_high_wtr = mock_perform_completion.call_count
        print(f"word_token_rate=1.0 (estimates more tokens): LLM calls = {calls_high_wtr}")

        assert calls_high_wtr >= calls_low_wtr, \
            "Higher word_token_rate should lead to more or equal chunks for the same content and token threshold."

    except Exception as e:
        print(f"Ollama not available or other error, skipping word_token_rate test: {e}")


if __name__ == "__main__":
    customize_word_token_rate()
```

#### 3.6.6. Example: Processing very large content that requires multiple chunks.
This is similar to 3.6.1, emphasizing that the system handles large inputs by breaking them down.

```python
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai.utils import LLMConfig
from unittest.mock import patch, MagicMock
import json

# Simulate multiple LLM calls for multiple chunks
mock_responses_large_content = []
for i in range(5): # Expecting around 5 chunks for this example
    mock_resp = MagicMock()
    mock_resp.choices = [MagicMock()]
    # Simulate LLM returning structured data for each chunk
    mock_resp.choices[0].message.content = json.dumps({"document_part": f"Content from part {i+1} of the large document."})
    mock_resp.usage = MagicMock(completion_tokens=15, prompt_tokens=100, total_tokens=115) # Example usage
    mock_resp.usage.completion_tokens_details = {}; mock_resp.usage.prompt_tokens_details = {}
    mock_responses_large_content.append(mock_resp)


@patch('crawl4ai.extraction_strategy.perform_completion_with_backoff', side_effect=mock_responses_large_content)
def process_very_large_content(mock_perform_completion):
    # Content designed to be split into several chunks
    # Assuming chunk_token_threshold=50 and word_token_rate=0.75 (~66 words/chunk)
    # This content has 300 words. 300 / (50/0.75) = 300 / 66.6 = ~4.5 chunks
    very_large_content = ("This is a segment of a very large document that needs to be processed. " * 60) # 300 words

    try:
        strategy = LLMExtractionStrategy(
            llm_config=LLMConfig(provider="ollama/llama3", api_token="ollama"),
            extraction_type="schema_from_instruction",
            instruction="Extract key information from this segment of the document.",
            chunk_token_threshold=50, # Smaller threshold to ensure multiple chunks
            word_token_rate=0.75,      # Estimate tokens per word
            apply_chunking=True
        )
    except:
        print("Ollama not available, skipping large content chunking test.")
        return

    print("Processing very large content (mocked LLM calls per chunk)...")
    # The extract method will handle the chunking and aggregation if extraction_type is schema-based.
    # The final extracted_data_json should ideally be a merged/structured result.
    # For this mock, the LLM is assumed to return a list of objects if instruction implies it.
    # Here, we'll get the last chunk's result if schema_from_instruction unless LLM aggregates.
    # To truly show aggregation, a more complex mocking or real LLM is needed.
    # Focus here is on the multiple calls due to chunking.

    extracted_data_json_string = strategy.extract(url="http://dummy.com/very_large_doc", html_content=very_large_content)

    print(f"LLM was called {mock_perform_completion.call_count} times due to chunking.")
    assert mock_perform_completion.call_count > 1, "Expected multiple LLM calls for large content."

    print("Final Extracted Data (structure depends on LLM's handling of chunked results):")
    if extracted_data_json_string:
        print(json.dumps(json.loads(extracted_data_json_string), indent=2))
    else:
        print("No data extracted or an error occurred.")

if __name__ == "__main__":
    process_very_large_content()
```
---

### 3.7. Input Format Selection (`input_format`)

#### 3.7.1. Example: Extracting from Markdown content (`input_format="markdown"`, default).
This is the default behavior if `input_format` is not specified.

```python
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai.utils import LLMConfig
from unittest.mock import patch, MagicMock
import json

mock_llm_md_input = MagicMock() # Setup mock as before
mock_llm_md_input.choices = [MagicMock(message=MagicMock(content=json.dumps({"title": "Markdown Test"})))]
mock_llm_md_input.usage = MagicMock(completion_tokens=5, prompt_tokens=30, total_tokens=35)
mock_llm_md_input.usage.completion_tokens_details = {}; mock_llm_md_input.usage.prompt_tokens_details = {}


@patch('crawl4ai.extraction_strategy.perform_completion_with_backoff', return_value=mock_llm_md_input)
def extract_from_markdown(mock_perform_completion):
    try:
        strategy = LLMExtractionStrategy(
            llm_config=LLMConfig(provider="ollama/llama3", api_token="ollama"),
            input_format="markdown", # Explicitly set, though it's the default
            extraction_type="schema_from_instruction",
            instruction="Extract the main title."
        )
    except:
        print("Ollama not available, skipping markdown input test.")
        return

    sample_markdown = "# Markdown Test\nThis is some **bold** text."
    extracted_json = strategy.extract(url="http://dummy.com/md_page", html_content=sample_markdown)

    print("Extraction from Markdown (mocked LLM):")
    if extracted_json:
        print(json.dumps(json.loads(extracted_json), indent=2))

if __name__ == "__main__":
    extract_from_markdown()
```

#### 3.7.2. Example: Extracting directly from raw HTML (`input_format="html"`).

```python
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai.utils import LLMConfig
from unittest.mock import patch, MagicMock
import json

mock_llm_html_input = MagicMock() # Setup mock
mock_llm_html_input.choices = [MagicMock(message=MagicMock(content=json.dumps({"page_heading": "HTML Document"})))]
mock_llm_html_input.usage = MagicMock(completion_tokens=6, prompt_tokens=40, total_tokens=46)
mock_llm_html_input.usage.completion_tokens_details = {}; mock_llm_html_input.usage.prompt_tokens_details = {}


@patch('crawl4ai.extraction_strategy.perform_completion_with_backoff', return_value=mock_llm_html_input)
def extract_from_raw_html(mock_perform_completion):
    try:
        strategy = LLMExtractionStrategy(
            llm_config=LLMConfig(provider="ollama/llama3", api_token="ollama"),
            input_format="html",
            extraction_type="schema_from_instruction",
            instruction="Extract the main heading (h1)."
        )
    except:
        print("Ollama not available, skipping raw HTML input test.")
        return

    sample_html = "<html><head><title>Test</title></head><body><h1>HTML Document</h1><p>Content</p></body></html>"
    extracted_json = strategy.extract(url="http://dummy.com/html_page", html_content=sample_html)

    print("Extraction from Raw HTML (mocked LLM):")
    if extracted_json:
        print(json.dumps(json.loads(extracted_json), indent=2))

if __name__ == "__main__":
    extract_from_raw_html()
```

#### 3.7.3. Example: Extracting from filtered HTML (`input_format="fit_html"`) after `MarkdownGenerator` with a `ContentFilterStrategy` has run.
This example shows a two-step process: first filtering HTML using `MarkdownGenerator` and a `ContentFilterStrategy`, then feeding its `fit_html` output to `LLMExtractionStrategy`.

```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, DefaultMarkdownGenerator, LLMConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai.content_filter_strategy import PruningContentFilter # Example filter
from unittest.mock import patch, MagicMock
import json

# Mock for the LLMExtractionStrategy part
mock_llm_fit_html = MagicMock()
mock_llm_fit_html.choices = [MagicMock(message=MagicMock(content=json.dumps({"main_content_summary": "Summary of pruned content."})))]
mock_llm_fit_html.usage = MagicMock(completion_tokens=10, prompt_tokens=50, total_tokens=60)
mock_llm_fit_html.usage.completion_tokens_details = {}; mock_llm_fit_html.usage.prompt_tokens_details = {}


@patch('crawl4ai.extraction_strategy.perform_completion_with_backoff', return_value=mock_llm_fit_html)
async def extract_from_fit_html(mock_perform_completion):
    # Step 1: Setup MarkdownGenerator with a content filter to produce fit_html
    # For this example, we'll use PruningContentFilter.
    # In a real scenario, you might need an LLM for more advanced filters.
    # We'll use a simple mock HTML for this part.

    sample_raw_html = """
    <html><body>
        <header>Site Navigation</header>
        <nav>Links...</nav>
        <main>
            <h1>Main Article Title</h1>
            <p>This is the core content we want to keep.</p>
            <p>Another paragraph of important stuff.</p>
        </main>
        <aside>Related links</aside>
        <footer>Copyright info</footer>
    </body></html>
    """

    # Simulate getting fit_html (normally from crawler.arun() and result.markdown.fit_html)
    # Here we manually instantiate and run the filter's logic conceptually
    # Note: MarkdownGenerator itself creates fit_html when a filter is active
    # For simplicity, let's assume PruningContentFilter directly gives us usable HTML for LLM

    # A more accurate simulation would involve creating a MarkdownGenerator
    # and getting its `fit_html`. PruningContentFilter directly manipulates soup.
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(sample_raw_html, "lxml")
    pruning_filter = PruningContentFilter()
    # PruningContentFilter.filter_content modifies soup in-place and returns string list
    # We will simulate its effect by just taking the main content for this test
    main_content_element = soup.find("main")
    fit_html_content = str(main_content_element) if main_content_element else "<p>Filtered content.</p>"

    print(f"--- Simulated Fit HTML (for LLM input) ---\n{fit_html_content}\n--------------------------------------")

    # Step 2: Use LLMExtractionStrategy with input_format="fit_html" (or just "html" if it's valid HTML)
    try:
        strategy = LLMExtractionStrategy(
            llm_config=LLMConfig(provider="ollama/llama3", api_token="ollama"),
            input_format="html", # fit_html is still HTML, or use "fit_html" if specific handling is added
            extraction_type="schema_from_instruction",
            instruction="Summarize the main content provided."
        )
    except:
        print("Ollama not available, skipping fit_html extraction test.")
        return

    extracted_json = strategy.extract(url="http://dummy.com/filtered_page", html_content=fit_html_content)

    print("\nExtraction from Fit HTML (mocked LLM):")
    if extracted_json:
        print(json.dumps(json.loads(extracted_json), indent=2))

    assert mock_perform_completion.called

if __name__ == "__main__":
    asyncio.run(extract_from_fit_html())
```

#### 3.7.4. Example: Extracting from plain text content (`input_format="text"`).

```python
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai.utils import LLMConfig
from unittest.mock import patch, MagicMock
import json

mock_llm_text_input = MagicMock() # Setup mock
mock_llm_text_input.choices = [MagicMock(message=MagicMock(content=json.dumps({"sentiment": "positive"})))]
mock_llm_text_input.usage = MagicMock(completion_tokens=3, prompt_tokens=25, total_tokens=28)
mock_llm_text_input.usage.completion_tokens_details = {}; mock_llm_text_input.usage.prompt_tokens_details = {}


@patch('crawl4ai.extraction_strategy.perform_completion_with_backoff', return_value=mock_llm_text_input)
def extract_from_plain_text(mock_perform_completion):
    try:
        strategy = LLMExtractionStrategy(
            llm_config=LLMConfig(provider="ollama/llama3", api_token="ollama"),
            input_format="text",
            extraction_type="schema_from_instruction",
            instruction="Determine the sentiment of this text."
        )
    except:
        print("Ollama not available, skipping plain text input test.")
        return

    sample_text = "Crawl4ai is an amazing library for web scraping and data extraction!"
    extracted_json = strategy.extract(url="http://dummy.com/text_page", html_content=sample_text)
    # html_content parameter is used for any text-based input, despite its name

    print("Extraction from Plain Text (mocked LLM):")
    if extracted_json:
        print(json.dumps(json.loads(extracted_json), indent=2))

if __name__ == "__main__":
    extract_from_plain_text()
```
---

### 3.8. Forcing JSON Response (`force_json_response`)

#### 3.8.1. Example: Using `force_json_response=True` with `extraction_type="schema"` or `"schema_from_instruction"`.
This is particularly useful with LLMs that might not strictly adhere to JSON output, or when using providers that support JSON mode.

```python
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai.utils import LLMConfig
from pydantic import BaseModel
from unittest.mock import patch, MagicMock
import json

class UserProfile(BaseModel):
    username: str
    email: str

# Mock LLM: simulate it trying to return JSON but maybe with extra text
# if force_json_response was False. With True, it should ensure clean JSON.
mock_llm_force_json = MagicMock()
mock_llm_force_json.choices = [MagicMock()]
# LiteLLM's JSON mode (which force_json_response=True often enables)
# typically ensures the LLM's output is directly the JSON object string.
mock_llm_force_json.choices[0].message.content = json.dumps(
    {"username": "testuser", "email": "test@example.com"}
)
mock_llm_force_json.usage = MagicMock(completion_tokens=15, prompt_tokens=70, total_tokens=85)
mock_llm_force_json.usage.completion_tokens_details = {}; mock_llm_force_json.usage.prompt_tokens_details = {}


@patch('crawl4ai.extraction_strategy.perform_completion_with_backoff', return_value=mock_llm_force_json)
def force_json_response_example(mock_perform_completion):
    try:
        # Note: Some providers/models have better native JSON mode support.
        # OpenAI models often benefit from this.
        strategy = LLMExtractionStrategy(
            llm_config=LLMConfig(provider="openai/gpt-3.5-turbo", api_token=os.getenv("OPENAI_API_KEY","mock_key")), # Using OpenAI example
            schema=UserProfile.model_json_schema(),
            extraction_type="schema",
            force_json_response=True # Enable JSON mode
        )
        if not os.getenv("OPENAI_API_KEY"):
            print("Warning: OPENAI_API_KEY not set. Mocking will proceed, but real behavior might differ.")

    except:
        print("LLM provider not available, skipping force_json_response test.")
        return

    sample_content = "User: testuser, Email: test@example.com"
    extracted_json_string = strategy.extract(url="http://dummy.com/user", html_content=sample_content)

    print("Force JSON Response Example (mocked LLM):")
    if extracted_json_string:
        print(f"Raw output from LLM (should be clean JSON string): {extracted_json_string}")
        try:
            extracted_data = json.loads(extracted_json_string)
            print("Parsed data:", json.dumps(extracted_data, indent=2))
            UserProfile(**extracted_data) # Validate
            print("Successfully parsed and validated JSON.")
        except json.JSONDecodeError as e:
            print(f"Failed to parse JSON even with force_json_response: {e}")
            print("This might indicate an issue with the LLM's JSON mode or the mock setup.")
    else:
        print("No data extracted.")

    # Check if the 'response_format' was passed to litellm
    # This depends on the internal implementation detail of how force_json_response is passed.
    # Assuming it sets 'response_format': {'type': 'json_object'} in extra_args for litellm.
    # mock_perform_completion.assert_called_once()
    # call_kwargs = mock_perform_completion.call_args.kwargs
    # assert call_kwargs.get("extra_args", {}).get("response_format") == {"type": "json_object"}
    # print("LLM call included JSON response format.")


if __name__ == "__main__":
    force_json_response_example()
```

#### 3.8.2. Example: Comparing LLM output with and without `force_json_response=True` to show its effect on non-JSON-compliant LLMs.
This example requires an LLM that is known to sometimes produce non-JSON output or a more sophisticated mock.

```python
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai.utils import LLMConfig
from pydantic import BaseModel
from unittest.mock import patch, MagicMock
import json
import os

class SimpleData(BaseModel):
    key: str

# Mock 1: LLM returns non-JSON compliant string
mock_llm_non_json = MagicMock()
mock_llm_non_json.choices = [MagicMock()]
mock_llm_non_json.choices[0].message.content = "Here is the JSON you asked for: ```json\n{\"key\": \"value_one\"}\n``` Some extra text."
mock_llm_non_json.usage = MagicMock(completion_tokens=30, prompt_tokens=80, total_tokens=110)
mock_llm_non_json.usage.completion_tokens_details = {}; mock_llm_non_json.usage.prompt_tokens_details = {}


# Mock 2: LLM returns clean JSON (as if force_json_response worked)
mock_llm_forced_json = MagicMock()
mock_llm_forced_json.choices = [MagicMock()]
mock_llm_forced_json.choices[0].message.content = json.dumps({"key": "value_one"})
mock_llm_forced_json.usage = MagicMock(completion_tokens=10, prompt_tokens=80, total_tokens=90)
mock_llm_forced_json.usage.completion_tokens_details = {}; mock_llm_forced_json.usage.prompt_tokens_details = {}


@patch('crawl4ai.extraction_strategy.perform_completion_with_backoff')
def compare_force_json_response(mock_perform_completion):
    sample_content = "The key is value_one."
    schema_def = SimpleData.model_json_schema()

    try:
        # Using a provider that might benefit from force_json_response
        llm_config_for_comparison = LLMConfig(provider="openai/gpt-3.5-turbo", api_token=os.getenv("OPENAI_API_KEY","mock_key_compare"))
        if not os.getenv("OPENAI_API_KEY"):
            print("Warning: OPENAI_API_KEY not set for comparison. Mocking will show intended difference.")
    except:
        print("LLM provider not available, skipping force_json comparison test.")
        return

    # Case 1: force_json_response = False (default)
    mock_perform_completion.return_value = mock_llm_non_json
    strategy_no_force = LLMExtractionStrategy(
        llm_config=llm_config_for_comparison,
        schema=schema_def,
        extraction_type="schema",
        force_json_response=False
    )
    print("--- Without force_json_response ---")
    result_no_force_json_str = strategy_no_force.extract("url", sample_content)
    print(f"Raw output: {result_no_force_json_str}")
    try:
        data_no_force = json.loads(result_no_force_json_str) # This would likely fail with the mock
        print(f"Parsed data: {data_no_force}")
    except json.JSONDecodeError as e:
        print(f"Failed to parse as JSON (expected for this mock): {e}")

    # Case 2: force_json_response = True
    mock_perform_completion.return_value = mock_llm_forced_json
    strategy_with_force = LLMExtractionStrategy(
        llm_config=llm_config_for_comparison,
        schema=schema_def,
        extraction_type="schema",
        force_json_response=True
    )
    print("\n--- With force_json_response = True ---")
    result_with_force_json_str = strategy_with_force.extract("url", sample_content)
    print(f"Raw output: {result_with_force_json_str}")
    try:
        data_with_force = json.loads(result_with_force_json_str)
        SimpleData(**data_with_force) # Validate
        print(f"Parsed data: {data_with_force} (Successfully parsed and validated)")
    except json.JSONDecodeError as e:
        print(f"Failed to parse as JSON (unexpected with good JSON mode): {e}")

if __name__ == "__main__":
    compare_force_json_response()
```
---

### 3.9. Verbosity and Logging

#### 3.9.1. Example: Using `verbose=True` to see detailed LLM interaction logs.
Setting `verbose=True` in `LLMExtractionStrategy` enables detailed logging of prompts sent to and responses received from the LLM.

```python
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai.utils import LLMConfig, DefaultLogger
from unittest.mock import patch, MagicMock
import json
import io
import sys

# Mock LLM response
mock_llm_verbose = MagicMock()
mock_llm_verbose.choices = [MagicMock(message=MagicMock(content=json.dumps({"data": "verbose example"})))]
mock_llm_verbose.usage = MagicMock(completion_tokens=5, prompt_tokens=10, total_tokens=15)
mock_llm_verbose.usage.completion_tokens_details = {}; mock_llm_verbose.usage.prompt_tokens_details = {}


@patch('crawl4ai.extraction_strategy.perform_completion_with_backoff', return_value=mock_llm_verbose)
def verbose_logging_example(mock_perform_completion):
    # Capture stdout to check for verbose logs
    old_stdout = sys.stdout
    sys.stdout = captured_output = io.StringIO()

    try:
        # Use a simple logger for this example that prints to stdout
        logger = DefaultLogger(verbose=True)

        strategy = LLMExtractionStrategy(
            llm_config=LLMConfig(provider="ollama/llama3", api_token="ollama"),
            verbose=True, # Enable verbose logging in the strategy
            logger=logger,  # Pass the logger
            extraction_type="schema_from_instruction",
            instruction="Extract something."
        )
    except: # Fallback if ollama/logger setup fails for some reason in test env
        sys.stdout = old_stdout
        print("Ollama/Logger not available, skipping verbose logging test.")
        return


    strategy.extract(url="http://dummy.com/verbose", html_content="Some sample content.")

    sys.stdout = old_stdout # Restore stdout
    output_log = captured_output.getvalue()

    print("\n--- Captured Verbose Log Output (should contain LLM prompt/response details) ---")
    print(output_log)

    # Check for typical verbose log messages (actual messages might vary)
    assert "LLM Request" in output_log or "Prompt for LLM" in output_log
    assert "LLM Response" in output_log or "Response from LLM" in output_log
    print("\nVerbose logging appeared to work.")

if __name__ == "__main__":
    verbose_logging_example()
```

#### 3.9.2. Example: Providing a custom `logger` instance to `LLMExtractionStrategy`.
You can integrate `LLMExtractionStrategy` with your existing logging setup.

```python
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai.utils import LLMConfig, DefaultLogger
import logging
import io

# Setup a custom Python logger
custom_logger = logging.getLogger("MyCustomExtractorLogger")
custom_logger.setLevel(logging.INFO)
log_capture_string = io.StringIO()
ch = logging.StreamHandler(log_capture_string)
ch.setLevel(logging.INFO)
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
ch.setFormatter(formatter)
custom_logger.addHandler(ch)
custom_logger.propagate = False # Prevent duplicate logs if root logger also has a handler

# For LLMExtractionStrategy, we need to wrap this in a Crawl4ai compatible logger
class CustomCrawl4aiLogger(DefaultLogger):
    def __init__(self, py_logger, verbose=False):
        super().__init__(verbose=verbose)
        self.py_logger = py_logger

    def _log(self, level_str, message, tag=None, params=None, colors=None):
        # You can customize how messages are formatted and logged here
        log_message = f"[{tag or 'C4AI'}] {message}"
        if params:
            log_message = log_message.format(**params)

        if level_str.lower() == "info":
            self.py_logger.info(log_message)
        elif level_str.lower() == "error":
            self.py_logger.error(log_message)
        elif level_str.lower() == "warning":
            self.py_logger.warning(log_message)
        elif self.verbose and level_str.lower() == "debug": # Only log debug if verbose
             self.py_logger.debug(log_message)


# Mock the LLM call for this example to focus on logging
from unittest.mock import patch, MagicMock
import json
mock_llm_custom_log = MagicMock()
mock_llm_custom_log.choices = [MagicMock(message=MagicMock(content=json.dumps({"info":"logged"})))]
mock_llm_custom_log.usage = MagicMock(completion_tokens=3, prompt_tokens=10, total_tokens=13)
mock_llm_custom_log.usage.completion_tokens_details = {}; mock_llm_custom_log.usage.prompt_tokens_details = {}


@patch('crawl4ai.extraction_strategy.perform_completion_with_backoff', return_value=mock_llm_custom_log)
def custom_logger_example(mock_perform_completion):
    crawl4ai_custom_logger = CustomCrawl4aiLogger(custom_logger, verbose=True)

    try:
        strategy = LLMExtractionStrategy(
            llm_config=LLMConfig(provider="ollama/llama3", api_token="ollama"),
            logger=crawl4ai_custom_logger, # Pass the custom logger instance
            verbose=True, # Ensure strategy attempts to log debug messages too
            extraction_type="schema_from_instruction",
            instruction="Log this."
        )
    except:
        print("Ollama not available, skipping custom logger test.")
        return

    strategy.extract(url="http://dummy.com/custom_log", html_content="Content for custom logger.")

    log_contents = log_capture_string.getvalue()
    print("\n--- Captured Log Output (via custom Python logger) ---")
    print(log_contents)

    assert "MyCustomExtractorLogger" in log_contents # Check if our logger's name is in output
    assert "[LLM_REQ]" in log_contents or "[LLM_RESP]" in log_contents # Check for common strategy tags

if __name__ == "__main__":
    custom_logger_example()
```
---

### 3.10. Practical Extraction Scenarios
These examples use `AsyncWebCrawler` and might require actual internet access and potentially API keys for the LLMs. They will be mocked for consistency in testing, but the setup shows real-world usage.

#### 3.10.1. Example: Extracting product names, prices, and descriptions from an e-commerce page.

```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, LLMConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field
from typing import List, Optional
from unittest.mock import patch, MagicMock
import json
import os

class ProductInfo(BaseModel):
    name: str = Field(..., description="The name of the product")
    price: Optional[float] = Field(None, description="The price of the product, as a float")
    description_snippet: Optional[str] = Field(None, description="A short snippet of the product description")

class ProductPageExtract(BaseModel):
    products: List[ProductInfo] = Field(description="List of products found on the page")

# Mock the LLM call
mock_ecommerce_response = MagicMock()
mock_ecommerce_response.choices = [MagicMock()]
mock_ecommerce_response.choices[0].message.content = json.dumps({
    "products": [
        {"name": "Super Widget X1000", "price": 99.99, "description_snippet": "The best widget ever."},
        {"name": "Basic Widget B50", "price": 19.99, "description_snippet": "A simple, reliable widget."}
    ]
})
mock_ecommerce_response.usage = MagicMock(completion_tokens=50, prompt_tokens=300, total_tokens=350)
mock_ecommerce_response.usage.completion_tokens_details = {}; mock_ecommerce_response.usage.prompt_tokens_details = {}


@patch('crawl4ai.extraction_strategy.perform_completion_with_backoff', return_value=mock_ecommerce_response)
async def extract_ecommerce_products(mock_perform_completion):
    # This URL is a placeholder; a real e-commerce page would be used.
    # For CI/testing, we use a simple example.com which won't have products.
    # The key is to show the setup.
    ecommerce_url = "http://example.com"

    try:
        llm_conf = LLMConfig(provider="openai/gpt-4o-mini", api_token=os.getenv("OPENAI_API_KEY", "mock_key_ecommerce"))
        if not os.getenv("OPENAI_API_KEY"): print("Warning: OPENAI_API_KEY not set. Mock will be used.")

        extraction_strat = LLMExtractionStrategy(
            llm_config=llm_conf,
            schema=ProductPageExtract.model_json_schema(),
            extraction_type="schema",
            instruction="Extract all product names, their prices, and a short description snippet from the page content."
        )
    except Exception as e:
        print(f"LLM setup failed for e-commerce example: {e}. Skipping.")
        return

    run_config = CrawlerRunConfig(
        extraction_strategy=extraction_strat,
        # word_count_threshold=5 # Lower for example.com if testing live
    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url=ecommerce_url, config=run_config)

    print(f"--- Extraction from E-commerce like page ({ecommerce_url}) ---")
    if result.success and result.extracted_content:
        extracted_data = json.loads(result.extracted_content)
        print(json.dumps(extracted_data, indent=2))

        # Validate with Pydantic
        page_data = ProductPageExtract(**extracted_data)
        for product in page_data.products:
            print(f"Product: {product.name}, Price: {product.price}")
    elif not result.success:
        print(f"Crawl failed: {result.error_message}")
    else:
        print("No structured data extracted or extraction failed.")

    assert mock_perform_completion.called

if __name__ == "__main__":
    asyncio.run(extract_ecommerce_products())
```

#### 3.10.2. Example: Extracting article headlines, authors, and publication dates from a news site.

```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, LLMConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field
from typing import Optional
from unittest.mock import patch, MagicMock
import json
import os

class NewsArticle(BaseModel):
    headline: str = Field(..., description="The main headline of the news article")
    author: Optional[str] = Field(None, description="The author(s) of the article")
    publication_date: Optional[str] = Field(None, description="The date the article was published (e.g., YYYY-MM-DD)")

# Mock the LLM call
mock_news_response = MagicMock()
mock_news_response.choices = [MagicMock()]
mock_news_response.choices[0].message.content = json.dumps({
    "headline": "AI Breakthrough Announced",
    "author": "Reporter Bot",
    "publication_date": "2024-05-24"
})
mock_news_response.usage = MagicMock(completion_tokens=30, prompt_tokens=250, total_tokens=280)
mock_news_response.usage.completion_tokens_details = {}; mock_news_response.usage.prompt_tokens_details = {}

@patch('crawl4ai.extraction_strategy.perform_completion_with_backoff', return_value=mock_news_response)
async def extract_news_article_details(mock_perform_completion):
    # Using Wikipedia for a stable, public news-like article structure
    news_url = "https://en.wikipedia.org/wiki/Artificial_intelligence"

    try:
        llm_conf = LLMConfig(provider="openai/gpt-4o-mini", api_token=os.getenv("OPENAI_API_KEY", "mock_key_news"))
        if not os.getenv("OPENAI_API_KEY"): print("Warning: OPENAI_API_KEY not set. Mock will be used.")

        extraction_strat = LLMExtractionStrategy(
            llm_config=llm_conf,
            schema=NewsArticle.model_json_schema(),
            extraction_type="schema",
            instruction="From the provided news article content, extract the main headline, the author(s), and the publication date."
        )
    except Exception as e:
        print(f"LLM setup failed for news example: {e}. Skipping.")
        return


    run_config = CrawlerRunConfig(extraction_strategy=extraction_strat)

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url=news_url, config=run_config)

    print(f"--- Extraction from News Article ({news_url}) ---")
    if result.success and result.extracted_content:
        extracted_data = json.loads(result.extracted_content)
        print(json.dumps(extracted_data, indent=2))
        article_data = NewsArticle(**extracted_data)
        print(f"Headline: {article_data.headline}")
    elif not result.success:
        print(f"Crawl failed: {result.error_message}")
    else:
        print("No structured data extracted or extraction failed.")

    assert mock_perform_completion.called

if __name__ == "__main__":
    asyncio.run(extract_news_article_details())
```

#### 3.10.3. Example: Extracting frequently asked questions (FAQs) and their answers from a support page.

```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, LLMConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field
from typing import List
from unittest.mock import patch, MagicMock
import json
import os

class FAQItem(BaseModel):
    question: str
    answer: str

class FAQPage(BaseModel):
    faqs: List[FAQItem]

# Mock the LLM call
mock_faq_response = MagicMock()
mock_faq_response.choices = [MagicMock()]
mock_faq_response.choices[0].message.content = json.dumps({
    "faqs": [
        {"question": "What is Crawl4ai?", "answer": "An awesome web crawler."},
        {"question": "How to install?", "answer": "pip install crawl4ai"}
    ]
})
mock_faq_response.usage = MagicMock(completion_tokens=60, prompt_tokens=300, total_tokens=360)
mock_faq_response.usage.completion_tokens_details = {}; mock_faq_response.usage.prompt_tokens_details = {}


@patch('crawl4ai.extraction_strategy.perform_completion_with_backoff', return_value=mock_faq_response)
async def extract_faqs(mock_perform_completion):
    # Placeholder URL - a real FAQ page would be used
    faq_url = "http://example.com/faq"

    try:
        llm_conf = LLMConfig(provider="openai/gpt-4o-mini", api_token=os.getenv("OPENAI_API_KEY", "mock_key_faq"))
        if not os.getenv("OPENAI_API_KEY"): print("Warning: OPENAI_API_KEY not set. Mock will be used.")

        extraction_strat = LLMExtractionStrategy(
            llm_config=llm_conf,
            schema=FAQPage.model_json_schema(),
            extraction_type="schema",
            instruction="Extract all question and answer pairs from the FAQ section of this page."
        )
    except Exception as e:
        print(f"LLM setup failed for FAQ example: {e}. Skipping.")
        return

    run_config = CrawlerRunConfig(extraction_strategy=extraction_strat)

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url=faq_url, config=run_config)

    print(f"--- Extraction from FAQ Page ({faq_url}) ---")
    if result.success and result.extracted_content:
        extracted_data = json.loads(result.extracted_content)
        print(json.dumps(extracted_data, indent=2))
        faq_page_data = FAQPage(**extracted_data)
        for faq_item in faq_page_data.faqs:
            print(f"Q: {faq_item.question}\nA: {faq_item.answer}\n")
    elif not result.success:
        print(f"Crawl failed: {result.error_message}")
    else:
        print("No structured data extracted or extraction failed.")

    assert mock_perform_completion.called

if __name__ == "__main__":
    asyncio.run(extract_faqs())
```

#### 3.10.4. Example: Extracting contact information (email, phone, address) from a company's "Contact Us" page.

```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, LLMConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field
from typing import Optional
from unittest.mock import patch, MagicMock
import json
import os

class ContactInfo(BaseModel):
    email: Optional[str] = Field(None, description="Company contact email address")
    phone: Optional[str] = Field(None, description="Company contact phone number")
    address: Optional[str] = Field(None, description="Company physical address")

# Mock the LLM call
mock_contact_response = MagicMock()
mock_contact_response.choices = [MagicMock()]
mock_contact_response.choices[0].message.content = json.dumps({
    "email": "support@example.com",
    "phone": "1-800-555-1234",
    "address": "123 Main St, Anytown, USA"
})
mock_contact_response.usage = MagicMock(completion_tokens=40, prompt_tokens=200, total_tokens=240)
mock_contact_response.usage.completion_tokens_details = {}; mock_contact_response.usage.prompt_tokens_details = {}


@patch('crawl4ai.extraction_strategy.perform_completion_with_backoff', return_value=mock_contact_response)
async def extract_contact_info(mock_perform_completion):
    contact_url = "http://example.com/contact"

    try:
        llm_conf = LLMConfig(provider="openai/gpt-4o-mini", api_token=os.getenv("OPENAI_API_KEY", "mock_key_contact"))
        if not os.getenv("OPENAI_API_KEY"): print("Warning: OPENAI_API_KEY not set. Mock will be used.")

        extraction_strat = LLMExtractionStrategy(
            llm_config=llm_conf,
            schema=ContactInfo.model_json_schema(),
            extraction_type="schema",
            instruction="Extract the primary email, phone number, and physical address from this contact page."
        )
    except Exception as e:
        print(f"LLM setup failed for contact info example: {e}. Skipping.")
        return

    run_config = CrawlerRunConfig(extraction_strategy=extraction_strat)

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url=contact_url, config=run_config)

    print(f"--- Extraction from Contact Page ({contact_url}) ---")
    if result.success and result.extracted_content:
        extracted_data = json.loads(result.extracted_content)
        print(json.dumps(extracted_data, indent=2))
        contact_data = ContactInfo(**extracted_data)
        print(f"Email: {contact_data.email}, Phone: {contact_data.phone}")
    elif not result.success:
        print(f"Crawl failed: {result.error_message}")
    else:
        print("No structured data extracted or extraction failed.")

    assert mock_perform_completion.called

if __name__ == "__main__":
    asyncio.run(extract_contact_info())
```

#### 3.10.5. Example: Extracting key entities (people, organizations, locations) from a block of text using `extraction_type="block"` and a specific instruction.
This uses "block" extraction but with an instruction to guide the LLM to tag specific entities.

```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, LLMConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from unittest.mock import patch, MagicMock
import json
import os

# Mock LLM response - for block extraction with entity tagging
mock_entity_response = MagicMock()
mock_entity_response.choices = [MagicMock()]
mock_entity_response.choices[0].message.content = """
<blocks>
  <block>
    <content>Apple Inc. is headquartered in Cupertino.</content>
    <tags><tag>sentence</tag><tag>ORG:Apple Inc.</tag><tag>LOC:Cupertino</tag></tags>
  </block>
  <block>
    <content>Tim Cook is the CEO.</content>
    <tags><tag>sentence</tag><tag>PER:Tim Cook</tag></tags>
  </block>
</blocks>
"""
mock_entity_response.usage = MagicMock(completion_tokens=50, prompt_tokens=150, total_tokens=200)
mock_entity_response.usage.completion_tokens_details = {}; mock_entity_response.usage.prompt_tokens_details = {}


@patch('crawl4ai.extraction_strategy.perform_completion_with_backoff', return_value=mock_entity_response)
async def extract_entities_with_block(mock_perform_completion):
    entity_url = "http://example.com/about-us" # Placeholder

    try:
        llm_conf = LLMConfig(provider="openai/gpt-4o-mini", api_token=os.getenv("OPENAI_API_KEY", "mock_key_entities"))
        if not os.getenv("OPENAI_API_KEY"): print("Warning: OPENAI_API_KEY not set. Mock will be used.")

        extraction_strat = LLMExtractionStrategy(
            llm_config=llm_conf,
            extraction_type="block",
            instruction="Extract each sentence as a block. For each block, identify and add tags for People (PER:Name), Organizations (ORG:Name), and Locations (LOC:Name) found within that sentence."
        )
    except Exception as e:
        print(f"LLM setup failed for entity extraction example: {e}. Skipping.")
        return

    run_config = CrawlerRunConfig(extraction_strategy=extraction_strat)

    # For this demo, we'll use direct content instead of crawling a URL
    sample_text_for_entities = "Apple Inc. is headquartered in Cupertino. Tim Cook is the CEO."

    # Normally you'd use crawler.arun(url=..., config=...).
    # Here, we call the strategy directly for simplicity with local text.
    extracted_blocks = extraction_strat.extract(url=entity_url, html_content=sample_text_for_entities)


    print(f"--- Entity Extraction using extraction_type='block' ---")
    if extracted_blocks:
        print(json.dumps(extracted_blocks, indent=2))
        for block in extracted_blocks:
            print(f"Content: {block['content']}")
            print(f"  Tags: {block['tags']}")
    else:
        print("No blocks extracted or extraction failed.")

    assert mock_perform_completion.called

if __name__ == "__main__":
    asyncio.run(extract_entities_with_block())
```
---

## 4. Integration with `AsyncWebCrawler`

#### 4.1. Example: Basic `AsyncWebCrawler` run with `LLMExtractionStrategy` configured in `CrawlerRunConfig`.

```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, LLMConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel
from unittest.mock import patch, MagicMock
import json
import os

class PageSummary(BaseModel):
    summary: str
    keywords: List[str]

mock_llm_crawler_run = MagicMock()
mock_llm_crawler_run.choices = [MagicMock()]
mock_llm_crawler_run.choices[0].message.content = json.dumps({
    "summary": "Example.com is a domain for use in illustrative examples.",
    "keywords": ["example", "domain", "documentation"]
})
mock_llm_crawler_run.usage = MagicMock(completion_tokens=30, prompt_tokens=100, total_tokens=130)
mock_llm_crawler_run.usage.completion_tokens_details = {}; mock_llm_crawler_run.usage.prompt_tokens_details = {}


@patch('crawl4ai.extraction_strategy.perform_completion_with_backoff', return_value=mock_llm_crawler_run)
async def crawler_with_llm_extraction(mock_perform_completion):
    try:
        llm_conf = LLMConfig(provider="openai/gpt-3.5-turbo", api_token=os.getenv("OPENAI_API_KEY", "mock_key_crawler"))
        if not os.getenv("OPENAI_API_KEY"): print("Warning: OPENAI_API_KEY not set. Mock will be used.")

        extraction_strat = LLMExtractionStrategy(
            llm_config=llm_conf,
            schema=PageSummary.model_json_schema(),
            extraction_type="schema",
            instruction="Provide a one-sentence summary of the page and list up to 3 main keywords."
        )
    except Exception as e:
        print(f"LLM setup failed for crawler integration example: {e}. Skipping.")
        return

    run_config = CrawlerRunConfig(
        extraction_strategy=extraction_strat,
        word_count_threshold=5 # Ensure example.com content is processed
    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url="http://example.com", config=run_config)

    print(f"--- Crawler run with LLMExtractionStrategy ---")
    if result.success and result.extracted_content:
        extracted_data = json.loads(result.extracted_content)
        print("Extracted Data:")
        print(json.dumps(extracted_data, indent=2))

        summary_instance = PageSummary(**extracted_data)
        print(f"\nValidated Summary: {summary_instance.summary}")
        print(f"Keywords: {summary_instance.keywords}")
    elif not result.success:
        print(f"Crawl failed: {result.error_message}")
    else:
        print("No structured data extracted.")

    assert mock_perform_completion.called

if __name__ == "__main__":
    asyncio.run(crawler_with_llm_extraction())
```

#### 4.2. Example: `AsyncWebCrawler` processing multiple URLs, each with the same `LLMExtractionStrategy`.

```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, LLMConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel
from unittest.mock import patch, MagicMock
import json
import os

class SiteInfo(BaseModel):
    site_name: str
    main_purpose: str

# Mock LLM to return different info based on URL (simplified by just returning same mock structure)
mock_llm_multi_url = MagicMock()
mock_llm_multi_url.choices = [MagicMock()]
# We'll have the mock return slightly different content for each call
responses_for_multi_url = [
    json.dumps({"site_name": "Example Domain", "main_purpose": "Illustrative examples"}),
    json.dumps({"site_name": "IANA", "main_purpose": "Managing global IP addressing"})
]
call_count_multi = 0
def side_effect_multi_url(*args, **kwargs):
    global call_count_multi
    mock_llm_multi_url.choices[0].message.content = responses_for_multi_url[call_count_multi % len(responses_for_multi_url)]
    call_count_multi +=1
    return mock_llm_multi_url

mock_llm_multi_url.usage = MagicMock(completion_tokens=20, prompt_tokens=80, total_tokens=100) # Generic usage
mock_llm_multi_url.usage.completion_tokens_details = {}; mock_llm_multi_url.usage.prompt_tokens_details = {}

@patch('crawl4ai.extraction_strategy.perform_completion_with_backoff', side_effect=side_effect_multi_url)
async def crawler_many_with_llm_extraction(mock_perform_completion):
    urls_to_crawl = [
        "http://example.com",
        "https://www.iana.org/domains/reserved" # Another simple, stable page
    ]

    try:
        llm_conf = LLMConfig(provider="openai/gpt-3.5-turbo", api_token=os.getenv("OPENAI_API_KEY", "mock_key_many"))
        if not os.getenv("OPENAI_API_KEY"): print("Warning: OPENAI_API_KEY not set. Mock will be used.")

        extraction_strat = LLMExtractionStrategy(
            llm_config=llm_conf,
            schema=SiteInfo.model_json_schema(),
            extraction_type="schema",
            instruction="Identify the site name and its main purpose."
        )
    except Exception as e:
        print(f"LLM setup failed for arun_many example: {e}. Skipping.")
        return

    run_config = CrawlerRunConfig(
        extraction_strategy=extraction_strat,
        word_count_threshold=10 # Adjust as needed for the test URLs
    )

    async with AsyncWebCrawler() as crawler:
        # arun_many returns an async generator if stream=True, or list if stream=False (default)
        results = await crawler.arun_many(urls=urls_to_crawl, config=run_config)

    print(f"--- Crawler arun_many with LLMExtractionStrategy ---")
    for result in results:
        if result.success and result.extracted_content:
            extracted_data = json.loads(result.extracted_content)
            print(f"\nURL: {result.url}")
            print("Extracted Data:")
            print(json.dumps(extracted_data, indent=2))
        elif not result.success:
            print(f"\nCrawl for {result.url} failed: {result.error_message}")
        else:
            print(f"\nNo structured data extracted for {result.url}.")

    assert mock_perform_completion.call_count == len(urls_to_crawl)

if __name__ == "__main__":
    asyncio.run(crawler_many_with_llm_extraction())
```

#### 4.3. Example: `AsyncWebCrawler` where `CrawlerRunConfig` is dynamically changed per URL to use different extraction schemas or instructions.

```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, LLMConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field
from typing import List, Optional, Dict, Any
from unittest.mock import patch, MagicMock
import json
import os

class TechArticleInfo(BaseModel):
    title: str
    primary_topic: str

class CompanyInfo(BaseModel):
    company_name: str
    services_offered: List[str]

# Mocks for different schemas
mock_tech_article_response = MagicMock()
mock_tech_article_response.choices = [MagicMock(message=MagicMock(content=json.dumps({"title": "Intro to Crawling", "primary_topic": "Web Scraping"})))]
mock_tech_article_response.usage = MagicMock(completion_tokens=15, prompt_tokens=70, total_tokens=85)
mock_tech_article_response.usage.completion_tokens_details = {}; mock_tech_article_response.usage.prompt_tokens_details = {}


mock_company_response = MagicMock()
mock_company_response.choices = [MagicMock(message=MagicMock(content=json.dumps({"company_name": "Example Corp", "services_offered": ["Web Hosting", "Domain Registration"]})))]
mock_company_response.usage = MagicMock(completion_tokens=20, prompt_tokens=90, total_tokens=110)
mock_company_response.usage.completion_tokens_details = {}; mock_company_response.usage.prompt_tokens_details = {}


# Side effect function to return different mocks based on schema/instruction
def dynamic_llm_side_effect(*args, **kwargs):
    # Heuristic: check prompt content for clues about which schema is expected
    prompt_content = args[1] # The prompt string is the second argument to perform_completion_with_backoff
    if "TechArticleInfo" in prompt_content or "primary_topic" in prompt_content:
        return mock_tech_article_response
    elif "CompanyInfo" in prompt_content or "services_offered" in prompt_content:
        return mock_company_response
    return MagicMock() # Default generic mock if no match

@patch('crawl4ai.extraction_strategy.perform_completion_with_backoff', side_effect=dynamic_llm_side_effect)
async def crawler_dynamic_configs(mock_perform_completion):
    urls_and_configs: List[Dict[str, Any]] = [
        {
            "url": "https://en.wikipedia.org/wiki/Web_scraping", # Tech article like
            "schema_model": TechArticleInfo,
            "instruction": "Extract title and primary topic of this technical article."
        },
        {
            "url": "http://example.com", # Generic company like
            "schema_model": CompanyInfo,
            "instruction": "Identify the company name and list its main services."
        }
    ]

    try:
        llm_conf = LLMConfig(provider="openai/gpt-3.5-turbo", api_token=os.getenv("OPENAI_API_KEY", "mock_key_dynamic"))
        if not os.getenv("OPENAI_API_KEY"): print("Warning: OPENAI_API_KEY not set. Mock will be used.")
    except Exception as e:
        print(f"LLM setup failed for dynamic config example: {e}. Skipping.")
        return

    all_results_data = []

    async with AsyncWebCrawler() as crawler:
        for item in urls_and_configs:
            current_url = item["url"]
            CurrentSchemaModel = item["schema_model"]
            current_instruction = item["instruction"]

            extraction_strat = LLMExtractionStrategy(
                llm_config=llm_conf,
                schema=CurrentSchemaModel.model_json_schema(),
                extraction_type="schema",
                instruction=current_instruction
            )
            run_config = CrawlerRunConfig(
                extraction_strategy=extraction_strat,
                word_count_threshold=10
            )

            print(f"\nCrawling {current_url} with schema {CurrentSchemaModel.__name__}...")
            result = await crawler.arun(url=current_url, config=run_config)

            if result.success and result.extracted_content:
                extracted_data = json.loads(result.extracted_content)
                print(f"Extracted for {current_url}:")
                print(json.dumps(extracted_data, indent=2))
                all_results_data.append(extracted_data)
            else:
                print(f"Failed or no extraction for {current_url}: {result.error_message}")
                all_results_data.append({"error": result.error_message, "url": current_url})

    assert mock_perform_completion.call_count == len(urls_and_configs)
    assert all_results_data[0].get("primary_topic") == "Web Scraping"
    assert "Web Hosting" in all_results_data[1].get("services_offered", [])


if __name__ == "__main__":
    asyncio.run(crawler_dynamic_configs())
```
---

## 5. Combining Extraction with Other Crawl4ai Features

### 5.1. **Extraction from PDF Content:**

#### 5.1.1. Example: Using `PDFCrawerStrategy` and `PDFContentScrapingStrategy` to get HTML/text from a PDF, then using `LLMExtractionStrategy` on that content.
This example requires `crawl4ai[pdf]` to be installed. We'll use a public PDF URL.

```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, LLMConfig, BrowserConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai.content_scraping_strategy import PDFContentScrapingStrategy # For PDF processing
from crawl4ai.async_crawler_strategy import PDFCrawlerStrategy # To handle PDF URLs
from pydantic import BaseModel, Field
from typing import List, Optional
from unittest.mock import patch, MagicMock
import json
import os

# Note: This example requires PyPDF2. Install with: pip install crawl4ai[pdf]

class PDFSummary(BaseModel):
    title: Optional[str] = Field(None, description="The title of the PDF document, if discernible.")
    first_paragraph_summary: str = Field(..., description="A brief summary of the first main paragraph of text.")

# Mock for LLM
mock_llm_pdf_extract = MagicMock()
mock_llm_pdf_extract.choices = [MagicMock(message=MagicMock(content=json.dumps({
    "title": "Sample PDF Document",
    "first_paragraph_summary": "This PDF discusses important topics related to data."
})))]
mock_llm_pdf_extract.usage = MagicMock(completion_tokens=20, prompt_tokens=150, total_tokens=170)
mock_llm_pdf_extract.usage.completion_tokens_details = {}; mock_llm_pdf_extract.usage.prompt_tokens_details = {}

@patch('crawl4ai.extraction_strategy.perform_completion_with_backoff', return_value=mock_llm_pdf_extract)
async def extract_from_pdf_content(mock_perform_completion):
    # A public, simple PDF for testing. (e.g., a small, text-based PDF)
    # Using a known, stable PDF URL: an RFC document (plain text heavy)
    pdf_url = "https://www.rfc-editor.org/rfc/rfc2616.txt.pdf" # This is a PDF version of an RFC text file
    # pdf_url = "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf" # A very simple dummy PDF

    try:
        llm_conf = LLMConfig(provider="openai/gpt-3.5-turbo", api_token=os.getenv("OPENAI_API_KEY", "mock_key_pdf"))
        if not os.getenv("OPENAI_API_KEY"): print("Warning: OPENAI_API_KEY not set for PDF example. Mock will be used.")

        extraction_strat = LLMExtractionStrategy(
            llm_config=llm_conf,
            schema=PDFSummary.model_json_schema(),
            extraction_type="schema",
            instruction="From the provided PDF text, extract the document title if available, and summarize the first main paragraph.",
            input_format="text" # PDFContentScrapingStrategy will provide text
        )
    except Exception as e:
        print(f"LLM setup failed for PDF extraction example: {e}. Skipping.")
        return

    # BrowserConfig might be needed if the PDF is rendered in-browser and not a direct link
    browser_cfg = BrowserConfig()

    # CrawlerRunConfig for PDF processing and then LLM extraction
    run_config = CrawlerRunConfig(
        extraction_strategy=extraction_strat,
        # The PDFContentScrapingStrategy will be used by PDFCrawlerStrategy
        # to convert PDF to text/HTML for the LLMExtractionStrategy.
        # LLMExtractionStrategy expects text if input_format="text".
        scraping_strategy=PDFContentScrapingStrategy(output_format="text") # Ensure text output for LLM
    )

    # Use PDFCrawlerStrategy to handle the PDF URL
    # Note: For PDFCrawlerStrategy, the 'browser_config' of AsyncWebCrawler is not directly used for PDF fetching.
    # PDFCrawlerStrategy uses requests library directly.
    async with AsyncWebCrawler(crawler_strategy=PDFCrawlerStrategy(), browser_config=browser_cfg) as crawler:
        print(f"Crawling PDF: {pdf_url}")
        result = await crawler.arun(url=pdf_url, config=run_config)

    print(f"--- Extraction from PDF Content ({pdf_url}) ---")
    if result.success:
        if result.extracted_content:
            extracted_data = json.loads(result.extracted_content)
            print("Extracted Data from PDF:")
            print(json.dumps(extracted_data, indent=2))
            pdf_summary_instance = PDFSummary(**extracted_data)
            print(f"\nValidated Title: {pdf_summary_instance.title}")
            print(f"Summary: {pdf_summary_instance.first_paragraph_summary}")
        else:
            print("PDF content processed, but no structured data extracted by LLM.")
            print(f"Raw Markdown/Text from PDF (first 300 chars): {result.markdown.raw_markdown[:300] if result.markdown else 'N/A'}...")
    else:
        print(f"Crawl/Processing of PDF failed: {result.error_message}")

    if extraction_strat.llm_config.api_token != "mock_key_pdf": # Only assert if not fully mocked
         assert mock_perform_completion.called

if __name__ == "__main__":
    # This example needs `pip install crawl4ai[pdf]`
    try:
        import PyPDF2 # Check if PyPDF2 is installed
        asyncio.run(extract_from_pdf_content())
    except ImportError:
        print("PyPDF2 not found. Please install it with `pip install crawl4ai[pdf]` to run this example.")
    except Exception as e:
        print(f"An error occurred: {e}")
```

### 5.2. **Extraction After Content Filtering (Illustrative):**

#### 5.2.1. Example: Manually running a `ContentFilterStrategy` on HTML, then passing the filtered HTML to `LLMExtractionStrategy.extract(input_format="html")`.

```python
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai.content_filter_strategy import PruningContentFilter # Example filter
from crawl4ai.utils import LLMConfig
from bs4 import BeautifulSoup
from unittest.mock import patch, MagicMock
import json

# Mock for LLM
mock_llm_filtered_html = MagicMock()
mock_llm_filtered_html.choices = [MagicMock(message=MagicMock(content=json.dumps({"main_idea": "The core idea is about X."})))]
mock_llm_filtered_html.usage = MagicMock(completion_tokens=10, prompt_tokens=40, total_tokens=50)
mock_llm_filtered_html.usage.completion_tokens_details = {}; mock_llm_filtered_html.usage.prompt_tokens_details = {}


@patch('crawl4ai.extraction_strategy.perform_completion_with_backoff', return_value=mock_llm_filtered_html)
def extract_after_manual_filter(mock_perform_completion):
    sample_raw_html = """
    <html><head><title>Test Page</title></head><body>
        <header>Site Navigation Links...</header>
        <main id='content'>
            <h1>Article Title</h1>
            <p>This is the first important paragraph.</p>
            <div class='ad-banner'>Advertisement here</div>
            <p>This is the second important paragraph after an ad.</p>
        </main>
        <footer>Copyright 2024. All rights reserved.</footer>
    </body></html>
    """

    # Step 1: Manually apply a content filter
    # PruningContentFilter modifies the soup in-place
    soup = BeautifulSoup(sample_raw_html, "lxml")

    # Example of direct filter usage (conceptual, PruningContentFilter works on soup)
    # PruningContentFilter might typically be used within MarkdownGenerator
    # For this example, let's simulate its effect by selecting main content
    # and removing a known ad-like element.

    # Simulate pruning effect:
    header = soup.find("header")
    if header: header.decompose()
    footer = soup.find("footer")
    if footer: footer.decompose()
    ad_banner = soup.find("div", class_="ad-banner")
    if ad_banner: ad_banner.decompose()

    filtered_html_content = str(soup.find("main")) # Get HTML of the main content after pruning

    print(f"--- Filtered HTML (simulated) ---\n{filtered_html_content}\n---------------------------------")

    # Step 2: Pass filtered HTML to LLMExtractionStrategy
    try:
        strategy = LLMExtractionStrategy(
            llm_config=LLMConfig(provider="ollama/llama3", api_token="ollama"),
            input_format="html", # The filtered content is still HTML
            extraction_type="schema_from_instruction",
            instruction="What is the main idea of this content?"
        )
    except:
        print("Ollama not available, skipping manual filter extraction test.")
        return

    extracted_json = strategy.extract(url="http://dummy.com/filtered", html_content=filtered_html_content)

    print("\nExtraction from Manually Filtered HTML (mocked LLM):")
    if extracted_json:
        print(json.dumps(json.loads(extracted_json), indent=2))

    assert mock_perform_completion.called

if __name__ == "__main__":
    extract_after_manual_filter()
```
---

## 6. Advanced Techniques and Edge Cases

#### 6.1. Example: Handling cases where the LLM fails to extract data or returns an unexpected format (and how `force_json_response` might help).

```python
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai.utils import LLMConfig
from pydantic import BaseModel
from unittest.mock import patch, MagicMock
import json
import os

class SimpleSchema(BaseModel):
    name: str

# Mock 1: LLM returns malformed JSON (e.g., with extra text, missing comma)
mock_malformed_json_response = MagicMock()
mock_malformed_json_response.choices = [MagicMock(message=MagicMock(content='Sure, here is the JSON: {"name": "Test" // oops, a comment'))]
mock_malformed_json_response.usage = MagicMock(completion_tokens=10, prompt_tokens=50, total_tokens=60)
mock_malformed_json_response.usage.completion_tokens_details = {}; mock_malformed_json_response.usage.prompt_tokens_details = {}


# Mock 2: LLM returns clean JSON (simulating successful force_json_response)
mock_clean_json_response = MagicMock()
mock_clean_json_response.choices = [MagicMock(message=MagicMock(content=json.dumps({"name": "Test"})))]
mock_clean_json_response.usage = MagicMock(completion_tokens=8, prompt_tokens=50, total_tokens=58)
mock_clean_json_response.usage.completion_tokens_details = {}; mock_clean_json_response.usage.prompt_tokens_details = {}


@patch('crawl4ai.extraction_strategy.perform_completion_with_backoff')
def handle_llm_failures(mock_perform_completion):
    sample_content = "The name is Test."
    schema_def = SimpleSchema.model_json_schema()

    try:
        llm_conf = LLMConfig(provider="openai/gpt-3.5-turbo", api_token=os.getenv("OPENAI_API_KEY", "mock_key_failure"))
        if not os.getenv("OPENAI_API_KEY"): print("Warning: OPENAI_API_KEY not set for failure handling example.")
    except:
        print("LLM provider setup failed. Skipping LLM failure handling test.")
        return

    # Scenario 1: LLM returns malformed JSON, force_json_response=False
    print("--- Scenario 1: Malformed JSON, force_json_response=False ---")
    mock_perform_completion.return_value = mock_malformed_json_response
    strategy_no_force = LLMExtractionStrategy(
        llm_config=llm_conf, schema=schema_def, extraction_type="schema", force_json_response=False
    )
    result_str_no_force = strategy_no_force.extract("url", sample_content)
    print(f"LLM raw output: {result_str_no_force}")
    try:
        # This should ideally fail or be handled by LiteLLM's built-in JSON parsing attempts
        parsed = json.loads(result_str_no_force)
        SimpleSchema(**parsed) # Validate
        print(f"Parsed (unexpectedly successful for this mock): {parsed}")
    except Exception as e:
        print(f"Expected parsing/validation error: {e}")

    # Scenario 2: LLM returns malformed JSON, force_json_response=True
    # The mock for this scenario simulates that force_json_response helped the LLM return clean JSON
    print("\n--- Scenario 2: Malformed JSON (conceptually), force_json_response=True ---")
    mock_perform_completion.return_value = mock_clean_json_response
    strategy_with_force = LLMExtractionStrategy(
        llm_config=llm_conf, schema=schema_def, extraction_type="schema", force_json_response=True
    )
    result_str_with_force = strategy_with_force.extract("url", sample_content)
    print(f"LLM raw output (should be clean JSON): {result_str_with_force}")
    try:
        parsed_forced = json.loads(result_str_with_force)
        SimpleSchema(**parsed_forced) # Validate
        print(f"Parsed successfully with force_json_response: {parsed_forced}")
    except Exception as e:
        print(f"Error parsing/validating even with force_json_response: {e}")

if __name__ == "__main__":
    handle_llm_failures()
```

#### 6.2. Example: Strategies for dealing with very long content that might exceed single LLM context windows even after chunking (e.g., iterative extraction or summarization prior to extraction).
This is a conceptual example. Real implementation would be more complex.
We'll show a simplified version where we first "summarize" chunks then extract from summaries.

```python
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai.utils import LLMConfig
from pydantic import BaseModel
from unittest.mock import patch, MagicMock
import json
import os

class DocumentExtract(BaseModel):
    overall_summary: str
    key_points: List[str]

# Mocks
# 1. For summarizing chunks
mock_summarize_chunk = MagicMock()
mock_summarize_chunk.choices = [MagicMock()]
mock_summarize_chunk.usage = MagicMock(completion_tokens=20, prompt_tokens=50, total_tokens=70)
mock_summarize_chunk.usage.completion_tokens_details = {}; mock_summarize_chunk.usage.prompt_tokens_details = {}


# 2. For final extraction from combined summaries
mock_final_extract = MagicMock()
mock_final_extract.choices = [MagicMock()]
mock_final_extract.choices[0].message.content = json.dumps({
    "overall_summary": "The document discusses AI advancements and their societal impact.",
    "key_points": ["AI is evolving fast.", "Ethics are important.", "Future is exciting."]
})
mock_final_extract.usage = MagicMock(completion_tokens=40, prompt_tokens=100, total_tokens=140)
mock_final_extract.usage.completion_tokens_details = {}; mock_final_extract.usage.prompt_tokens_details = {}


def mock_llm_router(*args, **kwargs):
    # Simplistic router based on instruction
    instruction = kwargs.get('instruction', '')
    if "summarize this chunk" in instruction.lower():
        # The content of the mock is less important here than the call itself
        mock_summarize_chunk.choices[0].message.content = json.dumps({"summary_of_chunk": "Chunk summary text."})
        return mock_summarize_chunk
    elif "overall document summary" in instruction.lower():
        return mock_final_extract
    return MagicMock() # Default

@patch('crawl4ai.extraction_strategy.perform_completion_with_backoff', side_effect=mock_llm_router)
def iterative_extraction_for_long_content(mock_perform_completion):
    very_long_document_content = ("This is part of a very long document. " * 100) # Simulate long content

    try:
        llm_conf = LLMConfig(provider="openai/gpt-3.5-turbo", api_token=os.getenv("OPENAI_API_KEY", "mock_key_iterative"))
        if not os.getenv("OPENAI_API_KEY"): print("Warning: OPENAI_API_KEY not set for iterative example.")
    except:
        print("LLM setup failed. Skipping iterative extraction test.")
        return

    # Step 1: Create strategy for summarizing chunks (block extraction for simplicity)
    summarizer_strategy = LLMExtractionStrategy(
        llm_config=llm_conf,
        extraction_type="block", # Could be schema { "summary": "..." }
        instruction="Summarize this chunk of text concisely.",
        chunk_token_threshold=60, # Smaller chunks for summarization
        word_token_rate=0.75,
        apply_chunking=True
    )

    print("--- Step 1: Summarizing chunks of the long document ---")
    # LLMExtractionStrategy.extract returns list of blocks if extraction_type="block"
    chunk_summaries_blocks = summarizer_strategy.extract("url", very_long_document_content)

    # We'd expect chunk_summaries_blocks to be a list of dicts like [{'content': 'summary1', 'tags':[]}, ...]
    # For this mock, we'll just take the mocked 'content'
    chunk_summaries = [block.get("content", "") for block in chunk_summaries_blocks if isinstance(block, dict)]
    combined_summary_text = "\n".join(chunk_summaries)

    print(f"Summarized {len(chunk_summaries_blocks)} chunks into combined text of length {len(combined_summary_text)}.")
    print(f"Combined summary (first 100 chars): {combined_summary_text[:100]}...")

    # Step 2: Create strategy for final extraction from combined summaries
    final_extraction_strategy = LLMExtractionStrategy(
        llm_config=llm_conf,
        schema=DocumentExtract.model_json_schema(),
        extraction_type="schema",
        instruction="From the provided combined summaries, create an overall document summary and list key points."
    )

    print("\n--- Step 2: Final extraction from combined summaries ---")
    final_extracted_json_str = final_extraction_strategy.extract("url_summary", combined_summary_text)

    if final_extracted_json_str:
        final_data = json.loads(final_extracted_json_str)
        print(json.dumps(final_data, indent=2))
        doc_extract_instance = DocumentExtract(**final_data)
        assert "AI is evolving fast" in doc_extract_instance.key_points
    else:
        print("Final extraction failed.")

if __name__ == "__main__":
    iterative_extraction_for_long_content()
```

#### 6.3. Example: Showing how to access `TokenUsage` statistics after an LLM extraction.

```python
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai.utils import LLMConfig
from unittest.mock import patch, MagicMock
import json

# Mock LLM response with usage data
mock_llm_usage = MagicMock()
mock_llm_usage.choices = [MagicMock(message=MagicMock(content=json.dumps({"extracted": "data"})))]
mock_llm_usage.usage = MagicMock()
mock_llm_usage.usage.completion_tokens = 50
mock_llm_usage.usage.prompt_tokens = 150
mock_llm_usage.usage.total_tokens = 200
# Example of detailed token usage (if provider supports it)
mock_llm_usage.usage.completion_tokens_details = {"gpt-3.5-turbo": 50}
mock_llm_usage.usage.prompt_tokens_details = {"gpt-3.5-turbo": 150}

@patch('crawl4ai.extraction_strategy.perform_completion_with_backoff', return_value=mock_llm_usage)
def show_token_usage(mock_perform_completion):
    try:
        strategy = LLMExtractionStrategy(
            llm_config=LLMConfig(provider="ollama/llama3", api_token="ollama"), # Or any provider
            extraction_type="schema_from_instruction",
            instruction="Extract data."
        )
    except:
        print("Ollama not available, skipping token usage test.")
        return

    strategy.extract(url="http://dummy.com/usage_test", html_content="Some content to extract from.")

    print("--- Token Usage Statistics ---")

    # Total usage accumulated by the strategy instance across all its .extract() calls
    print(f"Total Accumulated Usage for this strategy instance:")
    print(f"  Completion Tokens: {strategy.total_usage.completion_tokens}")
    print(f"  Prompt Tokens: {strategy.total_usage.prompt_tokens}")
    print(f"  Total Tokens: {strategy.total_usage.total_tokens}")

    # Usage for the last .extract() call (or list of calls if chunking happened)
    if strategy.usages:
        print(f"\nUsage for the last .extract() operation (may include multiple LLM calls if chunked):")
        for i, usage_info in enumerate(strategy.usages[-mock_perform_completion.call_count:]): # Show for calls made in this run
            print(f"  LLM Call {i+1}:")
            print(f"    Completion Tokens: {usage_info.completion_tokens}")
            print(f"    Prompt Tokens: {usage_info.prompt_tokens}")
            print(f"    Total Tokens: {usage_info.total_tokens}")
            if usage_info.completion_tokens_details:
                 print(f"    Completion Details: {usage_info.completion_tokens_details}")
            if usage_info.prompt_tokens_details:
                 print(f"    Prompt Details: {usage_info.prompt_tokens_details}")
    else:
        print("No usage data recorded for the last operation (might be an issue or first run).")

    assert strategy.total_usage.total_tokens == 200 # Based on mock

if __name__ == "__main__":
    show_token_usage()
```
---

## 7. Deprecated Parameter Usage (for backward compatibility reference, if necessary)

#### 7.1. Example: Initializing `LLMExtractionStrategy` using deprecated `provider`, `api_token`, `base_url` and showing equivalence with `llm_config`. (Mark as deprecated in example comments).
This example shows the old way of passing LLM configuration directly to `LLMExtractionStrategy` and the new, recommended way using `LLMConfig`.

```python
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai.utils import LLMConfig
import os
import warnings

# Suppress deprecation warnings for this specific example
warnings.filterwarnings("ignore", category=DeprecationWarning, module="crawl4ai.extraction_strategy")


print("--- Demonstrating LLM Configuration: Deprecated vs. LLMConfig ---")

# Parameters for the example
provider_name = "openai/gpt-3.5-turbo"
api_key = os.getenv("OPENAI_API_KEY", "YOUR_OPENAI_API_KEY_PLACEHOLDER")
# base_api_url = "https://api.example.com/v1" # Example custom base URL

# --- Deprecated Way (for illustration) ---
# Note: This will raise DeprecationWarnings if not suppressed.
# It's included here to show users how to migrate.
print("\nAttempting initialization with DEPRECATED direct parameters:")
try:
    strategy_deprecated = LLMExtractionStrategy(
        provider=provider_name,
        api_token=api_key,
        # base_url=base_api_url, # If you had a custom base_url
        instruction="This is a test." # Need some instruction for it to init provider
    )
    print(f"  Deprecated way: Provider='{strategy_deprecated.llm_config.provider}', Token (first 5)='{strategy_deprecated.llm_config.api_token[:5] if strategy_deprecated.llm_config.api_token else None}', BaseURL='{strategy_deprecated.llm_config.base_url}'")
    assert strategy_deprecated.llm_config.provider == provider_name
except Exception as e:
    print(f"  Error with deprecated init (as expected if params fully removed or strict checks): {e}")


# --- New Recommended Way (using LLMConfig) ---
print("\nAttempting initialization with NEW LLMConfig object:")
llm_configuration = LLMConfig(
    provider=provider_name,
    api_token=api_key,
    # base_url=base_api_url # If you have a custom base_url
)
strategy_new = LLMExtractionStrategy(
    llm_config=llm_configuration,
    instruction="This is a test." # Need some instruction
)
print(f"  New way (LLMConfig): Provider='{strategy_new.llm_config.provider}', Token (first 5)='{strategy_new.llm_config.api_token[:5] if strategy_new.llm_config.api_token else None}', BaseURL='{strategy_new.llm_config.base_url}'")
assert strategy_new.llm_config.provider == provider_name

print("\nComparison:")
if 'strategy_deprecated' in locals() and strategy_deprecated.llm_config.provider == strategy_new.llm_config.provider:
    print("Both methods (deprecated and new) resulted in the same LLM provider configuration (provider name matches).")
else:
    print("There was a difference or the deprecated method failed as expected.")

# Restore default warning behavior if needed for other tests
warnings.resetwarnings()
```
---

**Note:** Many examples involving `LLMExtractionStrategy` use mocked LLM calls for simplicity and to avoid API key dependencies during automated testing or casual runs. When adapting these examples for real use, ensure you have a valid `LLMConfig` with appropriate provider details and API tokens, and remove or adapt the `@patch` decorators. For local testing, consider using Ollama with a model like `ollama/llama3`.

```