`.
+## 7. Putting It All Together: Larger Example
+Consider a blog site. We have a schema that extracts the **URL** from each post card (via `baseFields` with an `"attribute": "href"`), plus the title, date, summary, and author:
+```python
+schema = {
+ "name": "Blog Posts",
+ "baseSelector": "a.blog-post-card",
+ "baseFields": [
+ {"name": "post_url", "type": "attribute", "attribute": "href"}
+ ],
+ "fields": [
+ {"name": "title", "selector": "h2.post-title", "type": "text", "default": "No Title"},
+ {"name": "date", "selector": "time.post-date", "type": "text", "default": ""},
+ {"name": "summary", "selector": "p.post-summary", "type": "text", "default": ""},
+ {"name": "author", "selector": "span.post-author", "type": "text", "default": ""}
+ ]
+}
+```
+Then run with `JsonCssExtractionStrategy(schema)` to get an array of blog post objects, each with `"post_url"`, `"title"`, `"date"`, `"summary"`, `"author"`.
+## 8. Tips & Best Practices
+3. **Test** your schema on partial HTML or a test page before a big crawl.
+4. **Combine with JS Execution** if the site loads content dynamically. You can pass `js_code` or `wait_for` in `CrawlerRunConfig`.
+5. **Look at Logs** when `verbose=True`: if your selectors are off or your schema is malformed, it'll often show warnings.
+6. **Use baseFields** if you need attributes from the container element (e.g., `href`, `data-id`), especially for the "parent" item.
+8. **Consider Using Regex First**: For simple data types like emails, URLs, and dates, `RegexExtractionStrategy` is often the fastest approach.
+## 9. Schema Generation Utility
+1. You're dealing with a new website structure and want a quick starting point
+2. You need to extract complex nested data structures
+3. You want to avoid the learning curve of CSS/XPath selector syntax
+### Using the Schema Generator
+The schema generator is available as a static method on both `JsonCssExtractionStrategy` and `JsonXPathExtractionStrategy`. You can choose between OpenAI's GPT-4 or the open-source Ollama for schema generation:
+```python
+from crawl4ai import JsonCssExtractionStrategy, JsonXPathExtractionStrategy
+from crawl4ai import LLMConfig
+
+# Sample HTML with product information
+html = """
+
+
Gaming Laptop
+
$999.99
+
+
+"""
+
+# Option 1: Using OpenAI (requires API token)
+css_schema = JsonCssExtractionStrategy.generate_schema(
+ html,
+ schema_type="css",
+ llm_config = LLMConfig(provider="openai/gpt-4o",api_token="your-openai-token")
+)
+
+# Option 2: Using Ollama (open source, no token needed)
+xpath_schema = JsonXPathExtractionStrategy.generate_schema(
+ html,
+ schema_type="xpath",
+ llm_config = LLMConfig(provider="ollama/llama3.3", api_token=None) # Not needed for Ollama
+)
+
+# Use the generated schema for fast, repeated extractions
+strategy = JsonCssExtractionStrategy(css_schema)
+```
+### LLM Provider Options
+1. **OpenAI GPT-4 (`openai/gpt4o`)**
+ - Default provider
+ - Requires an API token
+ - Generally provides more accurate schemas
+ - Set via environment variable: `OPENAI_API_KEY`
+2. **Ollama (`ollama/llama3.3`)**
+ - Open source alternative
+ - No API token required
+ - Self-hosted option
+ - Good for development and testing
+### Benefits of Schema Generation
+### Best Practices
+6. **Choose Provider Wisely**:
+ - Use OpenAI for production-quality schemas
+ - Use Ollama for development, testing, or when you need a self-hosted solution
+## 10. Conclusion
+With Crawl4AI's LLM-free extraction strategies - `JsonCssExtractionStrategy`, `JsonXPathExtractionStrategy`, and now `RegexExtractionStrategy` - you can build powerful pipelines that:
+- Scrape any consistent site for structured data.
+- Support nested objects, repeating lists, or pattern-based extraction.
+- Scale to thousands of pages quickly and reliably.
+- Use **`RegexExtractionStrategy`** for fast extraction of common data types like emails, phones, URLs, dates, etc.
+- Use **`JsonCssExtractionStrategy`** or **`JsonXPathExtractionStrategy`** for structured data with clear HTML patterns
+
+
+# Extracting JSON (LLM)
+**Important**: LLM-based extraction can be slower and costlier than schema-based approaches. If your page data is highly structured, consider using [`JsonCssExtractionStrategy`](./no-llm-strategies.md) or [`JsonXPathExtractionStrategy`](./no-llm-strategies.md) first. But if you need AI to interpret or reorganize content, read on!
+## 1. Why Use an LLM?
+## 2. Provider-Agnostic via LiteLLM
+```python
+llmConfig = LlmConfig(provider="openai/gpt-4o-mini", api_token=os.getenv("OPENAI_API_KEY"))
+```
+Crawl4AI uses a “provider string” (e.g., `"openai/gpt-4o"`, `"ollama/llama2.0"`, `"aws/titan"`) to identify your LLM. **Any** model that LiteLLM supports is fair game. You just provide:
+- **`provider`**: The `
/` identifier (e.g., `"openai/gpt-4"`, `"ollama/llama2"`, `"huggingface/google-flan"`, etc.).
+- **`api_token`**: If needed (for OpenAI, HuggingFace, etc.); local models or Ollama might not require it.
+- **`base_url`** (optional): If your provider has a custom endpoint.
+## 3. How LLM Extraction Works
+### 3.1 Flow
+1. **Chunking** (optional): The HTML or markdown is split into smaller segments if it’s very long (based on `chunk_token_threshold`, overlap, etc.).
+2. **Prompt Construction**: For each chunk, the library forms a prompt that includes your **`instruction`** (and possibly schema or examples).
+4. **Combining**: The results from each chunk are merged and parsed into JSON.
+### 3.2 `extraction_type`
+- **`"schema"`**: The model tries to return JSON conforming to your Pydantic-based schema.
+- **`"block"`**: The model returns freeform text, or smaller JSON structures, which the library collects.
+For structured data, `"schema"` is recommended. You provide `schema=YourPydanticModel.model_json_schema()`.
+## 4. Key Parameters
+Below is an overview of important LLM extraction parameters. All are typically set inside `LLMExtractionStrategy(...)`. You then put that strategy in your `CrawlerRunConfig(..., extraction_strategy=...)`.
+1. **`llmConfig`** (LlmConfig): e.g., `"openai/gpt-4"`, `"ollama/llama2"`.
+2. **`schema`** (dict): A JSON schema describing the fields you want. Usually generated by `YourModel.model_json_schema()`.
+3. **`extraction_type`** (str): `"schema"` or `"block"`.
+4. **`instruction`** (str): Prompt text telling the LLM what you want extracted. E.g., “Extract these fields as a JSON array.”
+5. **`chunk_token_threshold`** (int): Maximum tokens per chunk. If your content is huge, you can break it up for the LLM.
+6. **`overlap_rate`** (float): Overlap ratio between adjacent chunks. E.g., `0.1` means 10% of each chunk is repeated to preserve context continuity.
+7. **`apply_chunking`** (bool): Set `True` to chunk automatically. If you want a single pass, set `False`.
+8. **`input_format`** (str): Determines **which** crawler result is passed to the LLM. Options include:
+ - `"markdown"`: The raw markdown (default).
+ - `"fit_markdown"`: The filtered “fit” markdown if you used a content filter.
+ - `"html"`: The cleaned or raw HTML.
+9. **`extra_args`** (dict): Additional LLM parameters like `temperature`, `max_tokens`, `top_p`, etc.
+10. **`show_usage()`**: A method you can call to print out usage info (token usage per chunk, total cost if known).
+```python
+extraction_strategy = LLMExtractionStrategy(
+ llm_config = LLMConfig(provider="openai/gpt-4", api_token="YOUR_OPENAI_KEY"),
+ schema=MyModel.model_json_schema(),
+ extraction_type="schema",
+ instruction="Extract a list of items from the text with 'name' and 'price' fields.",
+ chunk_token_threshold=1200,
+ overlap_rate=0.1,
+ apply_chunking=True,
+ input_format="html",
+ extra_args={"temperature": 0.1, "max_tokens": 1000},
+ verbose=True
+)
+```
+## 5. Putting It in `CrawlerRunConfig`
+**Important**: In Crawl4AI, all strategy definitions should go inside the `CrawlerRunConfig`, not directly as a param in `arun()`. Here’s a full example:
+```python
+import os
+import asyncio
+import json
+from pydantic import BaseModel, Field
+from typing import List
+from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, LLMConfig
+from crawl4ai import LLMExtractionStrategy
+
+class Product(BaseModel):
+ name: str
+ price: str
+
+async def main():
+ # 1. Define the LLM extraction strategy
+ llm_strategy = LLMExtractionStrategy(
+ llm_config = LLMConfig(provider="openai/gpt-4o-mini", api_token=os.getenv('OPENAI_API_KEY')),
+ schema=Product.schema_json(), # Or use model_json_schema()
+ extraction_type="schema",
+ instruction="Extract all product objects with 'name' and 'price' from the content.",
+ chunk_token_threshold=1000,
+ overlap_rate=0.0,
+ apply_chunking=True,
+ input_format="markdown", # or "html", "fit_markdown"
+ extra_args={"temperature": 0.0, "max_tokens": 800}
+ )
+
+ # 2. Build the crawler config
+ crawl_config = CrawlerRunConfig(
+ extraction_strategy=llm_strategy,
+ cache_mode=CacheMode.BYPASS
+ )
+
+ # 3. Create a browser config if needed
+ browser_cfg = BrowserConfig(headless=True)
+
+ async with AsyncWebCrawler(config=browser_cfg) as crawler:
+ # 4. Let's say we want to crawl a single page
+ result = await crawler.arun(
+ url="https://example.com/products",
+ config=crawl_config
+ )
+
+ if result.success:
+ # 5. The extracted content is presumably JSON
+ data = json.loads(result.extracted_content)
+ print("Extracted items:", data)
+
+ # 6. Show usage stats
+ llm_strategy.show_usage() # prints token usage
+ else:
+ print("Error:", result.error_message)
+
+if __name__ == "__main__":
+ asyncio.run(main())
+```
+## 6. Chunking Details
+### 6.1 `chunk_token_threshold`
+If your page is large, you might exceed your LLM’s context window. **`chunk_token_threshold`** sets the approximate max tokens per chunk. The library calculates word→token ratio using `word_token_rate` (often ~0.75 by default). If chunking is enabled (`apply_chunking=True`), the text is split into segments.
+### 6.2 `overlap_rate`
+To keep context continuous across chunks, we can overlap them. E.g., `overlap_rate=0.1` means each subsequent chunk includes 10% of the previous chunk’s text. This is helpful if your needed info might straddle chunk boundaries.
+### 6.3 Performance & Parallelism
+## 7. Input Format
+By default, **LLMExtractionStrategy** uses `input_format="markdown"`, meaning the **crawler’s final markdown** is fed to the LLM. You can change to:
+- **`html`**: The cleaned HTML or raw HTML (depending on your crawler config) goes into the LLM.
+- **`fit_markdown`**: If you used, for instance, `PruningContentFilter`, the “fit” version of the markdown is used. This can drastically reduce tokens if you trust the filter.
+- **`markdown`**: Standard markdown output from the crawler’s `markdown_generator`.
+This setting is crucial: if the LLM instructions rely on HTML tags, pick `"html"`. If you prefer a text-based approach, pick `"markdown"`.
+```python
+LLMExtractionStrategy(
+ # ...
+ input_format="html", # Instead of "markdown" or "fit_markdown"
+)
+```
+## 8. Token Usage & Show Usage
+- **`usages`** (list): token usage per chunk or call.
+- **`total_usage`**: sum of all chunk calls.
+- **`show_usage()`**: prints a usage report (if the provider returns usage data).
+```python
+llm_strategy = LLMExtractionStrategy(...)
+# ...
+llm_strategy.show_usage()
+# e.g. “Total usage: 1241 tokens across 2 chunk calls”
+```
+## 9. Example: Building a Knowledge Graph
+Below is a snippet combining **`LLMExtractionStrategy`** with a Pydantic schema for a knowledge graph. Notice how we pass an **`instruction`** telling the model what to parse.
+```python
+import os
+import json
+import asyncio
+from typing import List
+from pydantic import BaseModel, Field
+from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, LLMConfig
+from crawl4ai import LLMExtractionStrategy
+
+class Entity(BaseModel):
+ name: str
+ description: str
+
+class Relationship(BaseModel):
+ entity1: Entity
+ entity2: Entity
+ description: str
+ relation_type: str
+
+class KnowledgeGraph(BaseModel):
+ entities: List[Entity]
+ relationships: List[Relationship]
+
+async def main():
+ # LLM extraction strategy
+ llm_strat = LLMExtractionStrategy(
+ llmConfig = LLMConfig(provider="openai/gpt-4", api_token=os.getenv('OPENAI_API_KEY')),
+ schema=KnowledgeGraph.model_json_schema(),
+ extraction_type="schema",
+ instruction="Extract entities and relationships from the content. Return valid JSON.",
+ chunk_token_threshold=1400,
+ apply_chunking=True,
+ input_format="html",
+ extra_args={"temperature": 0.1, "max_tokens": 1500}
+ )
+
+ crawl_config = CrawlerRunConfig(
+ extraction_strategy=llm_strat,
+ cache_mode=CacheMode.BYPASS
+ )
+
+ async with AsyncWebCrawler(config=BrowserConfig(headless=True)) as crawler:
+ # Example page
+ url = "https://www.nbcnews.com/business"
+ result = await crawler.arun(url=url, config=crawl_config)
+
+ print("--- LLM RAW RESPONSE ---")
+ print(result.extracted_content)
+ print("--- END LLM RAW RESPONSE ---")
+
+ if result.success:
+ with open("kb_result.json", "w", encoding="utf-8") as f:
+ f.write(result.extracted_content)
+ llm_strat.show_usage()
+ else:
+ print("Crawl failed:", result.error_message)
+
+if __name__ == "__main__":
+ asyncio.run(main())
+```
+- **`extraction_type="schema"`** ensures we get JSON fitting our `KnowledgeGraph`.
+- **`input_format="html"`** means we feed HTML to the model.
+- **`instruction`** guides the model to output a structured knowledge graph.
+## 10. Best Practices & Caveats
+4. **Schema Strictness**: `"schema"` extraction tries to parse the model output as JSON. If the model returns invalid JSON, partial extraction might happen, or you might get an error.
+## 11. Conclusion
+- Put your LLM strategy **in `CrawlerRunConfig`**.
+- Use **`input_format`** to pick which form (markdown, HTML, fit_markdown) the LLM sees.
+- Tweak **`chunk_token_threshold`**, **`overlap_rate`**, and **`apply_chunking`** to handle large content efficiently.
+- Monitor token usage with `show_usage()`.
+If your site’s data is consistent or repetitive, consider [`JsonCssExtractionStrategy`](./no-llm-strategies.md) first for speed and simplicity. But if you need an **AI-driven** approach, `LLMExtractionStrategy` offers a flexible, multi-provider solution for extracting structured JSON from any website.
+1. **Experiment with Different Providers**
+ - Try switching the `provider` (e.g., `"ollama/llama2"`, `"openai/gpt-4o"`, etc.) to see differences in speed, accuracy, or cost.
+ - Pass different `extra_args` like `temperature`, `top_p`, and `max_tokens` to fine-tune your results.
+2. **Performance Tuning**
+ - If pages are large, tweak `chunk_token_threshold`, `overlap_rate`, or `apply_chunking` to optimize throughput.
+ - Check the usage logs with `show_usage()` to keep an eye on token consumption and identify potential bottlenecks.
+3. **Validate Outputs**
+ - If using `extraction_type="schema"`, parse the LLM’s JSON with a Pydantic model for a final validation step.
+4. **Explore Hooks & Automation**
+
+
+
+# Advanced Features
+
+# Session Management
+- **Performing JavaScript actions before and after crawling.**
+Use `BrowserConfig` and `CrawlerRunConfig` to maintain state with a `session_id`:
+```python
+from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
+
+async with AsyncWebCrawler() as crawler:
+ session_id = "my_session"
+
+ # Define configurations
+ config1 = CrawlerRunConfig(
+ url="https://example.com/page1", session_id=session_id
+ )
+ config2 = CrawlerRunConfig(
+ url="https://example.com/page2", session_id=session_id
+ )
+
+ # First request
+ result1 = await crawler.arun(config=config1)
+
+ # Subsequent request using the same session
+ result2 = await crawler.arun(config=config2)
+
+ # Clean up when done
+ await crawler.crawler_strategy.kill_session(session_id)
+```
+```python
+from crawl4ai.async_configs import CrawlerRunConfig
+from crawl4ai import JsonCssExtractionStrategy
+from crawl4ai.cache_context import CacheMode
+
+async def crawl_dynamic_content():
+ url = "https://github.com/microsoft/TypeScript/commits/main"
+ session_id = "wait_for_session"
+ all_commits = []
+
+ js_next_page = """
+ const commits = document.querySelectorAll('li[data-testid="commit-row-item"] h4');
+ if (commits.length > 0) {
+ window.lastCommit = commits[0].textContent.trim();
+ }
+ const button = document.querySelector('a[data-testid="pagination-next-button"]');
+ if (button) {button.click(); console.log('button clicked') }
+ """
+
+ wait_for = """() => {
+ const commits = document.querySelectorAll('li[data-testid="commit-row-item"] h4');
+ if (commits.length === 0) return false;
+ const firstCommit = commits[0].textContent.trim();
+ return firstCommit !== window.lastCommit;
+ }"""
+
+ schema = {
+ "name": "Commit Extractor",
+ "baseSelector": "li[data-testid='commit-row-item']",
+ "fields": [
+ {
+ "name": "title",
+ "selector": "h4 a",
+ "type": "text",
+ "transform": "strip",
+ },
+ ],
+ }
+ extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
+
+ browser_config = BrowserConfig(
+ verbose=True,
+ headless=False,
+ )
+
+ async with AsyncWebCrawler(config=browser_config) as crawler:
+ for page in range(3):
+ crawler_config = CrawlerRunConfig(
+ session_id=session_id,
+ css_selector="li[data-testid='commit-row-item']",
+ extraction_strategy=extraction_strategy,
+ js_code=js_next_page if page > 0 else None,
+ wait_for=wait_for if page > 0 else None,
+ js_only=page > 0,
+ cache_mode=CacheMode.BYPASS,
+ capture_console_messages=True,
+ )
+
+ result = await crawler.arun(url=url, config=crawler_config)
+
+ if result.console_messages:
+ print(f"Page {page + 1} console messages:", result.console_messages)
+
+ if result.extracted_content:
+ # print(f"Page {page + 1} result:", result.extracted_content)
+ commits = json.loads(result.extracted_content)
+ all_commits.extend(commits)
+ print(f"Page {page + 1}: Found {len(commits)} commits")
+ else:
+ print(f"Page {page + 1}: No content extracted")
+
+ print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
+ # Clean up session
+ await crawler.crawler_strategy.kill_session(session_id)
+```
+## Example 1: Basic Session-Based Crawling
+```python
+import asyncio
+from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
+from crawl4ai.cache_context import CacheMode
+
+async def basic_session_crawl():
+ async with AsyncWebCrawler() as crawler:
+ session_id = "dynamic_content_session"
+ url = "https://example.com/dynamic-content"
+
+ for page in range(3):
+ config = CrawlerRunConfig(
+ url=url,
+ session_id=session_id,
+ js_code="document.querySelector('.load-more-button').click();" if page > 0 else None,
+ css_selector=".content-item",
+ cache_mode=CacheMode.BYPASS
+ )
+
+ result = await crawler.arun(config=config)
+ print(f"Page {page + 1}: Found {result.extracted_content.count('.content-item')} items")
+
+ await crawler.crawler_strategy.kill_session(session_id)
+
+asyncio.run(basic_session_crawl())
+```
+1. Reusing the same `session_id` across multiple requests.
+2. Executing JavaScript to load more content dynamically.
+3. Properly closing the session to free resources.
+## Advanced Technique 1: Custom Execution Hooks
+```python
+async def advanced_session_crawl_with_hooks():
+ first_commit = ""
+
+ async def on_execution_started(page):
+ nonlocal first_commit
+ try:
+ while True:
+ await page.wait_for_selector("li.commit-item h4")
+ commit = await page.query_selector("li.commit-item h4")
+ commit = await commit.evaluate("(element) => element.textContent").strip()
+ if commit and commit != first_commit:
+ first_commit = commit
+ break
+ await asyncio.sleep(0.5)
+ except Exception as e:
+ print(f"Warning: New content didn't appear: {e}")
+
+ async with AsyncWebCrawler() as crawler:
+ session_id = "commit_session"
+ url = "https://github.com/example/repo/commits/main"
+ crawler.crawler_strategy.set_hook("on_execution_started", on_execution_started)
+
+ js_next_page = """document.querySelector('a.pagination-next').click();"""
+
+ for page in range(3):
+ config = CrawlerRunConfig(
+ url=url,
+ session_id=session_id,
+ js_code=js_next_page if page > 0 else None,
+ css_selector="li.commit-item",
+ js_only=page > 0,
+ cache_mode=CacheMode.BYPASS
+ )
+
+ result = await crawler.arun(config=config)
+ print(f"Page {page + 1}: Found {len(result.extracted_content)} commits")
+
+ await crawler.crawler_strategy.kill_session(session_id)
+
+asyncio.run(advanced_session_crawl_with_hooks())
+```
+## Advanced Technique 2: Integrated JavaScript Execution and Waiting
+```python
+async def integrated_js_and_wait_crawl():
+ async with AsyncWebCrawler() as crawler:
+ session_id = "integrated_session"
+ url = "https://github.com/example/repo/commits/main"
+
+ js_next_page_and_wait = """
+ (async () => {
+ const getCurrentCommit = () => document.querySelector('li.commit-item h4').textContent.trim();
+ const initialCommit = getCurrentCommit();
+ document.querySelector('a.pagination-next').click();
+ while (getCurrentCommit() === initialCommit) {
+ await new Promise(resolve => setTimeout(resolve, 100));
+ }
+ })();
+ """
+
+ for page in range(3):
+ config = CrawlerRunConfig(
+ url=url,
+ session_id=session_id,
+ js_code=js_next_page_and_wait if page > 0 else None,
+ css_selector="li.commit-item",
+ js_only=page > 0,
+ cache_mode=CacheMode.BYPASS
+ )
+
+ result = await crawler.arun(config=config)
+ print(f"Page {page + 1}: Found {len(result.extracted_content)} commits")
+
+ await crawler.crawler_strategy.kill_session(session_id)
+
+asyncio.run(integrated_js_and_wait_crawl())
+```
+1. **Authentication Flows**: Login and interact with secured pages.
+2. **Pagination Handling**: Navigate through multiple pages.
+3. **Form Submissions**: Fill forms, submit, and process results.
+4. **Multi-step Processes**: Complete workflows that span multiple actions.
+
+
+# Hooks & Auth in AsyncWebCrawler
+1. **`on_browser_created`** – After browser creation.
+2. **`on_page_context_created`** – After a new context & page are created.
+3. **`before_goto`** – Just before navigating to a page.
+4. **`after_goto`** – Right after navigation completes.
+5. **`on_user_agent_updated`** – Whenever the user agent changes.
+6. **`on_execution_started`** – Once custom JavaScript execution begins.
+7. **`before_retrieve_html`** – Just before the crawler retrieves final HTML.
+8. **`before_return_html`** – Right before returning the HTML content.
+**Important**: Avoid heavy tasks in `on_browser_created` since you don’t yet have a page context. If you need to *log in*, do so in **`on_page_context_created`**.
+> note "Important Hook Usage Warning"
+ **Avoid Misusing Hooks**: Do not manipulate page objects in the wrong hook or at the wrong time, as it can crash the pipeline or produce incorrect results. A common mistake is attempting to handle authentication prematurely—such as creating or closing pages in `on_browser_created`.
+> **Use the Right Hook for Auth**: If you need to log in or set tokens, use `on_page_context_created`. This ensures you have a valid page/context to work with, without disrupting the main crawling flow.
+## Example: Using Hooks in AsyncWebCrawler
+```python
+import asyncio
+import json
+from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
+from playwright.async_api import Page, BrowserContext
+
+async def main():
+ print("🔗 Hooks Example: Demonstrating recommended usage")
+
+ # 1) Configure the browser
+ browser_config = BrowserConfig(
+ headless=True,
+ verbose=True
+ )
+
+ # 2) Configure the crawler run
+ crawler_run_config = CrawlerRunConfig(
+ js_code="window.scrollTo(0, document.body.scrollHeight);",
+ wait_for="body",
+ cache_mode=CacheMode.BYPASS
+ )
+
+ # 3) Create the crawler instance
+ crawler = AsyncWebCrawler(config=browser_config)
+
+ #
+ # Define Hook Functions
+ #
+
+ async def on_browser_created(browser, **kwargs):
+ # Called once the browser instance is created (but no pages or contexts yet)
+ print("[HOOK] on_browser_created - Browser created successfully!")
+ # Typically, do minimal setup here if needed
+ return browser
+
+ async def on_page_context_created(page: Page, context: BrowserContext, **kwargs):
+ # Called right after a new page + context are created (ideal for auth or route config).
+ print("[HOOK] on_page_context_created - Setting up page & context.")
+
+ # Example 1: Route filtering (e.g., block images)
+ async def route_filter(route):
+ if route.request.resource_type == "image":
+ print(f"[HOOK] Blocking image request: {route.request.url}")
+ await route.abort()
+ else:
+ await route.continue_()
+
+ await context.route("**", route_filter)
+
+ # Example 2: (Optional) Simulate a login scenario
+ # (We do NOT create or close pages here, just do quick steps if needed)
+ # e.g., await page.goto("https://example.com/login")
+ # e.g., await page.fill("input[name='username']", "testuser")
+ # e.g., await page.fill("input[name='password']", "password123")
+ # e.g., await page.click("button[type='submit']")
+ # e.g., await page.wait_for_selector("#welcome")
+ # e.g., await context.add_cookies([...])
+ # Then continue
+
+ # Example 3: Adjust the viewport
+ await page.set_viewport_size({"width": 1080, "height": 600})
+ return page
+
+ async def before_goto(
+ page: Page, context: BrowserContext, url: str, **kwargs
+ ):
+ # Called before navigating to each URL.
+ print(f"[HOOK] before_goto - About to navigate: {url}")
+ # e.g., inject custom headers
+ await page.set_extra_http_headers({
+ "Custom-Header": "my-value"
+ })
+ return page
+
+ async def after_goto(
+ page: Page, context: BrowserContext,
+ url: str, response, **kwargs
+ ):
+ # Called after navigation completes.
+ print(f"[HOOK] after_goto - Successfully loaded: {url}")
+ # e.g., wait for a certain element if we want to verify
+ try:
+ await page.wait_for_selector('.content', timeout=1000)
+ print("[HOOK] Found .content element!")
+ except:
+ print("[HOOK] .content not found, continuing anyway.")
+ return page
+
+ async def on_user_agent_updated(
+ page: Page, context: BrowserContext,
+ user_agent: str, **kwargs
+ ):
+ # Called whenever the user agent updates.
+ print(f"[HOOK] on_user_agent_updated - New user agent: {user_agent}")
+ return page
+
+ async def on_execution_started(page: Page, context: BrowserContext, **kwargs):
+ # Called after custom JavaScript execution begins.
+ print("[HOOK] on_execution_started - JS code is running!")
+ return page
+
+ async def before_retrieve_html(page: Page, context: BrowserContext, **kwargs):
+ # Called before final HTML retrieval.
+ print("[HOOK] before_retrieve_html - We can do final actions")
+ # Example: Scroll again
+ await page.evaluate("window.scrollTo(0, document.body.scrollHeight);")
+ return page
+
+ async def before_return_html(
+ page: Page, context: BrowserContext, html: str, **kwargs
+ ):
+ # Called just before returning the HTML in the result.
+ print(f"[HOOK] before_return_html - HTML length: {len(html)}")
+ return page
+
+ #
+ # Attach Hooks
+ #
+
+ crawler.crawler_strategy.set_hook("on_browser_created", on_browser_created)
+ crawler.crawler_strategy.set_hook(
+ "on_page_context_created", on_page_context_created
+ )
+ crawler.crawler_strategy.set_hook("before_goto", before_goto)
+ crawler.crawler_strategy.set_hook("after_goto", after_goto)
+ crawler.crawler_strategy.set_hook(
+ "on_user_agent_updated", on_user_agent_updated
+ )
+ crawler.crawler_strategy.set_hook(
+ "on_execution_started", on_execution_started
+ )
+ crawler.crawler_strategy.set_hook(
+ "before_retrieve_html", before_retrieve_html
+ )
+ crawler.crawler_strategy.set_hook(
+ "before_return_html", before_return_html
+ )
+
+ await crawler.start()
+
+ # 4) Run the crawler on an example page
+ url = "https://example.com"
+ result = await crawler.arun(url, config=crawler_run_config)
+
+ if result.success:
+ print("\nCrawled URL:", result.url)
+ print("HTML length:", len(result.html))
+ else:
+ print("Error:", result.error_message)
+
+ await crawler.close()
+
+if __name__ == "__main__":
+ asyncio.run(main())
+```
+## Hook Lifecycle Summary
+1. **`on_browser_created`**:
+ - Browser is up, but **no** pages or contexts yet.
+ - Light setup only—don’t try to open or close pages here (that belongs in `on_page_context_created`).
+2. **`on_page_context_created`**:
+ - Perfect for advanced **auth** or route blocking.
+3. **`before_goto`**:
+4. **`after_goto`**:
+5. **`on_user_agent_updated`**:
+ - Whenever the user agent changes (for stealth or different UA modes).
+6. **`on_execution_started`**:
+ - If you set `js_code` or run custom scripts, this runs once your JS is about to start.
+7. **`before_retrieve_html`**:
+8. **`before_return_html`**:
+ - The last hook before returning HTML to the `CrawlResult`. Good for logging HTML length or minor modifications.
+## When to Handle Authentication
+**Recommended**: Use **`on_page_context_created`** if you need to:
+- Navigate to a login page or fill forms
+- Set cookies or localStorage tokens
+- Block resource routes to avoid ads
+This ensures the newly created context is under your control **before** `arun()` navigates to the main URL.
+## Additional Considerations
+- **Session Management**: If you want multiple `arun()` calls to reuse a single session, pass `session_id=` in your `CrawlerRunConfig`. Hooks remain the same.
+- **Concurrency**: If you run `arun_many()`, each URL triggers these hooks in parallel. Ensure your hooks are thread/async-safe.
+## Conclusion
+- **Browser** creation (light tasks only)
+- **Page** and **context** creation (auth, route blocking)
+- **Navigation** phases
+- **Final HTML** retrieval
+- **Login** or advanced tasks in `on_page_context_created`
+- **Custom headers** or logs in `before_goto` / `after_goto`
+- **Scrolling** or final checks in `before_retrieve_html` / `before_return_html`
+
+
+
+---
+
+# Quick Reference
+
+## Core Imports
+```python
+from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerConfig, CacheMode
+from crawl4ai.extraction_strategy import LLMExtractionStrategy, JsonCssExtractionStrategy, CosineStrategy
+```
+
+## Basic Pattern
+```python
+async with AsyncWebCrawler() as crawler:
+ result = await crawler.arun(url="https://example.com")
+ print(result.markdown)
+```
+
+## Advanced Pattern
+```python
+browser_config = BrowserConfig(headless=True, viewport_width=1920)
+crawler_config = CrawlerConfig(
+ cache_mode=CacheMode.BYPASS,
+ wait_for="css:.content",
+ screenshot=True,
+ pdf=True
+)
+strategy = LLMExtractionStrategy(
+ provider="openai/gpt-4",
+ instruction="Extract products with name and price"
+)
+
+async with AsyncWebCrawler(config=browser_config) as crawler:
+ result = await crawler.arun(
+ url="https://example.com",
+ config=crawler_config,
+ extraction_strategy=strategy
+ )
+```
+
+## Multi-URL Pattern
+```python
+urls = ["https://example.com/1", "https://example.com/2"]
+results = await crawler.arun_many(urls, config=crawler_config)
+```
+
+---
+
+**End of Crawl4AI SDK Documentation**
diff --git a/mkdocs.yml b/mkdocs.yml
index efc948c3..475fe190 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -7,6 +7,7 @@ docs_dir: docs/md_v2
nav:
- Home: 'index.md'
+ - "📚 Complete SDK Reference": "complete-sdk-reference.md"
- "Ask AI": "core/ask-ai.md"
- "Quick Start": "core/quickstart.md"
- "Code Examples": "core/examples.md"