`.
## 7. Putting It All Together: Larger Example
Consider a blog site. We have a schema that extracts the **URL** from each post card (via `baseFields` with an `"attribute": "href"`), plus the title, date, summary, and author:
```python
schema = {
"name": "Blog Posts",
"baseSelector": "a.blog-post-card",
"baseFields": [
{"name": "post_url", "type": "attribute", "attribute": "href"}
],
"fields": [
{"name": "title", "selector": "h2.post-title", "type": "text", "default": "No Title"},
{"name": "date", "selector": "time.post-date", "type": "text", "default": ""},
{"name": "summary", "selector": "p.post-summary", "type": "text", "default": ""},
{"name": "author", "selector": "span.post-author", "type": "text", "default": ""}
]
}
```
Then run with `JsonCssExtractionStrategy(schema)` to get an array of blog post objects, each with `"post_url"`, `"title"`, `"date"`, `"summary"`, `"author"`.
## 8. Tips & Best Practices
3. **Test** your schema on partial HTML or a test page before a big crawl.
4. **Combine with JS Execution** if the site loads content dynamically. You can pass `js_code` or `wait_for` in `CrawlerRunConfig`.
5. **Look at Logs** when `verbose=True`: if your selectors are off or your schema is malformed, it'll often show warnings.
6. **Use baseFields** if you need attributes from the container element (e.g., `href`, `data-id`), especially for the "parent" item.
8. **Consider Using Regex First**: For simple data types like emails, URLs, and dates, `RegexExtractionStrategy` is often the fastest approach.
## 9. Schema Generation Utility
1. You're dealing with a new website structure and want a quick starting point
2. You need to extract complex nested data structures
3. You want to avoid the learning curve of CSS/XPath selector syntax
### Using the Schema Generator
The schema generator is available as a static method on both `JsonCssExtractionStrategy` and `JsonXPathExtractionStrategy`. You can choose between OpenAI's GPT-4 or the open-source Ollama for schema generation:
```python
from crawl4ai import JsonCssExtractionStrategy, JsonXPathExtractionStrategy
from crawl4ai import LLMConfig
# Sample HTML with product information
html = """
"""
# Option 1: Using OpenAI (requires API token)
css_schema = JsonCssExtractionStrategy.generate_schema(
html,
schema_type="css",
llm_config = LLMConfig(provider="openai/gpt-4o",api_token="your-openai-token")
)
# Option 2: Using Ollama (open source, no token needed)
xpath_schema = JsonXPathExtractionStrategy.generate_schema(
html,
schema_type="xpath",
llm_config = LLMConfig(provider="ollama/llama3.3", api_token=None) # Not needed for Ollama
)
# Use the generated schema for fast, repeated extractions
strategy = JsonCssExtractionStrategy(css_schema)
```
### LLM Provider Options
1. **OpenAI GPT-4 (`openai/gpt4o`)**
- Default provider
- Requires an API token
- Generally provides more accurate schemas
- Set via environment variable: `OPENAI_API_KEY`
2. **Ollama (`ollama/llama3.3`)**
- Open source alternative
- No API token required
- Self-hosted option
- Good for development and testing
### Benefits of Schema Generation
### Best Practices
6. **Choose Provider Wisely**:
- Use OpenAI for production-quality schemas
- Use Ollama for development, testing, or when you need a self-hosted solution
## 10. Conclusion
With Crawl4AI's LLM-free extraction strategies - `JsonCssExtractionStrategy`, `JsonXPathExtractionStrategy`, and now `RegexExtractionStrategy` - you can build powerful pipelines that:
- Scrape any consistent site for structured data.
- Support nested objects, repeating lists, or pattern-based extraction.
- Scale to thousands of pages quickly and reliably.
- Use **`RegexExtractionStrategy`** for fast extraction of common data types like emails, phones, URLs, dates, etc.
- Use **`JsonCssExtractionStrategy`** or **`JsonXPathExtractionStrategy`** for structured data with clear HTML patterns
# Extracting JSON (LLM)
**Important**: LLM-based extraction can be slower and costlier than schema-based approaches. If your page data is highly structured, consider using [`JsonCssExtractionStrategy`](./no-llm-strategies.md) or [`JsonXPathExtractionStrategy`](./no-llm-strategies.md) first. But if you need AI to interpret or reorganize content, read on!
## 1. Why Use an LLM?
## 2. Provider-Agnostic via LiteLLM
```python
llmConfig = LlmConfig(provider="openai/gpt-4o-mini", api_token=os.getenv("OPENAI_API_KEY"))
```
Crawl4AI uses a “provider string” (e.g., `"openai/gpt-4o"`, `"ollama/llama2.0"`, `"aws/titan"`) to identify your LLM. **Any** model that LiteLLM supports is fair game. You just provide:
- **`provider`**: The `
/` identifier (e.g., `"openai/gpt-4"`, `"ollama/llama2"`, `"huggingface/google-flan"`, etc.).
- **`api_token`**: If needed (for OpenAI, HuggingFace, etc.); local models or Ollama might not require it.
- **`base_url`** (optional): If your provider has a custom endpoint.
## 3. How LLM Extraction Works
### 3.1 Flow
1. **Chunking** (optional): The HTML or markdown is split into smaller segments if it’s very long (based on `chunk_token_threshold`, overlap, etc.).
2. **Prompt Construction**: For each chunk, the library forms a prompt that includes your **`instruction`** (and possibly schema or examples).
4. **Combining**: The results from each chunk are merged and parsed into JSON.
### 3.2 `extraction_type`
- **`"schema"`**: The model tries to return JSON conforming to your Pydantic-based schema.
- **`"block"`**: The model returns freeform text, or smaller JSON structures, which the library collects.
For structured data, `"schema"` is recommended. You provide `schema=YourPydanticModel.model_json_schema()`.
## 4. Key Parameters
Below is an overview of important LLM extraction parameters. All are typically set inside `LLMExtractionStrategy(...)`. You then put that strategy in your `CrawlerRunConfig(..., extraction_strategy=...)`.
1. **`llmConfig`** (LlmConfig): e.g., `"openai/gpt-4"`, `"ollama/llama2"`.
2. **`schema`** (dict): A JSON schema describing the fields you want. Usually generated by `YourModel.model_json_schema()`.
3. **`extraction_type`** (str): `"schema"` or `"block"`.
4. **`instruction`** (str): Prompt text telling the LLM what you want extracted. E.g., “Extract these fields as a JSON array.”
5. **`chunk_token_threshold`** (int): Maximum tokens per chunk. If your content is huge, you can break it up for the LLM.
6. **`overlap_rate`** (float): Overlap ratio between adjacent chunks. E.g., `0.1` means 10% of each chunk is repeated to preserve context continuity.
7. **`apply_chunking`** (bool): Set `True` to chunk automatically. If you want a single pass, set `False`.
8. **`input_format`** (str): Determines **which** crawler result is passed to the LLM. Options include:
- `"markdown"`: The raw markdown (default).
- `"fit_markdown"`: The filtered “fit” markdown if you used a content filter.
- `"html"`: The cleaned or raw HTML.
9. **`extra_args`** (dict): Additional LLM parameters like `temperature`, `max_tokens`, `top_p`, etc.
10. **`show_usage()`**: A method you can call to print out usage info (token usage per chunk, total cost if known).
```python
extraction_strategy = LLMExtractionStrategy(
llm_config = LLMConfig(provider="openai/gpt-4", api_token="YOUR_OPENAI_KEY"),
schema=MyModel.model_json_schema(),
extraction_type="schema",
instruction="Extract a list of items from the text with 'name' and 'price' fields.",
chunk_token_threshold=1200,
overlap_rate=0.1,
apply_chunking=True,
input_format="html",
extra_args={"temperature": 0.1, "max_tokens": 1000},
verbose=True
)
```
## 5. Putting It in `CrawlerRunConfig`
**Important**: In Crawl4AI, all strategy definitions should go inside the `CrawlerRunConfig`, not directly as a param in `arun()`. Here’s a full example:
```python
import os
import asyncio
import json
from pydantic import BaseModel, Field
from typing import List
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, LLMConfig
from crawl4ai import LLMExtractionStrategy
class Product(BaseModel):
name: str
price: str
async def main():
# 1. Define the LLM extraction strategy
llm_strategy = LLMExtractionStrategy(
llm_config = LLMConfig(provider="openai/gpt-4o-mini", api_token=os.getenv('OPENAI_API_KEY')),
schema=Product.schema_json(), # Or use model_json_schema()
extraction_type="schema",
instruction="Extract all product objects with 'name' and 'price' from the content.",
chunk_token_threshold=1000,
overlap_rate=0.0,
apply_chunking=True,
input_format="markdown", # or "html", "fit_markdown"
extra_args={"temperature": 0.0, "max_tokens": 800}
)
# 2. Build the crawler config
crawl_config = CrawlerRunConfig(
extraction_strategy=llm_strategy,
cache_mode=CacheMode.BYPASS
)
# 3. Create a browser config if needed
browser_cfg = BrowserConfig(headless=True)
async with AsyncWebCrawler(config=browser_cfg) as crawler:
# 4. Let's say we want to crawl a single page
result = await crawler.arun(
url="https://example.com/products",
config=crawl_config
)
if result.success:
# 5. The extracted content is presumably JSON
data = json.loads(result.extracted_content)
print("Extracted items:", data)
# 6. Show usage stats
llm_strategy.show_usage() # prints token usage
else:
print("Error:", result.error_message)
if __name__ == "__main__":
asyncio.run(main())
```
## 6. Chunking Details
### 6.1 `chunk_token_threshold`
If your page is large, you might exceed your LLM’s context window. **`chunk_token_threshold`** sets the approximate max tokens per chunk. The library calculates word→token ratio using `word_token_rate` (often ~0.75 by default). If chunking is enabled (`apply_chunking=True`), the text is split into segments.
### 6.2 `overlap_rate`
To keep context continuous across chunks, we can overlap them. E.g., `overlap_rate=0.1` means each subsequent chunk includes 10% of the previous chunk’s text. This is helpful if your needed info might straddle chunk boundaries.
### 6.3 Performance & Parallelism
## 7. Input Format
By default, **LLMExtractionStrategy** uses `input_format="markdown"`, meaning the **crawler’s final markdown** is fed to the LLM. You can change to:
- **`html`**: The cleaned HTML or raw HTML (depending on your crawler config) goes into the LLM.
- **`fit_markdown`**: If you used, for instance, `PruningContentFilter`, the “fit” version of the markdown is used. This can drastically reduce tokens if you trust the filter.
- **`markdown`**: Standard markdown output from the crawler’s `markdown_generator`.
This setting is crucial: if the LLM instructions rely on HTML tags, pick `"html"`. If you prefer a text-based approach, pick `"markdown"`.
```python
LLMExtractionStrategy(
# ...
input_format="html", # Instead of "markdown" or "fit_markdown"
)
```
## 8. Token Usage & Show Usage
- **`usages`** (list): token usage per chunk or call.
- **`total_usage`**: sum of all chunk calls.
- **`show_usage()`**: prints a usage report (if the provider returns usage data).
```python
llm_strategy = LLMExtractionStrategy(...)
# ...
llm_strategy.show_usage()
# e.g. “Total usage: 1241 tokens across 2 chunk calls”
```
## 9. Example: Building a Knowledge Graph
Below is a snippet combining **`LLMExtractionStrategy`** with a Pydantic schema for a knowledge graph. Notice how we pass an **`instruction`** telling the model what to parse.
```python
import os
import json
import asyncio
from typing import List
from pydantic import BaseModel, Field
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, LLMConfig
from crawl4ai import LLMExtractionStrategy
class Entity(BaseModel):
name: str
description: str
class Relationship(BaseModel):
entity1: Entity
entity2: Entity
description: str
relation_type: str
class KnowledgeGraph(BaseModel):
entities: List[Entity]
relationships: List[Relationship]
async def main():
# LLM extraction strategy
llm_strat = LLMExtractionStrategy(
llmConfig = LLMConfig(provider="openai/gpt-4", api_token=os.getenv('OPENAI_API_KEY')),
schema=KnowledgeGraph.model_json_schema(),
extraction_type="schema",
instruction="Extract entities and relationships from the content. Return valid JSON.",
chunk_token_threshold=1400,
apply_chunking=True,
input_format="html",
extra_args={"temperature": 0.1, "max_tokens": 1500}
)
crawl_config = CrawlerRunConfig(
extraction_strategy=llm_strat,
cache_mode=CacheMode.BYPASS
)
async with AsyncWebCrawler(config=BrowserConfig(headless=True)) as crawler:
# Example page
url = "https://www.nbcnews.com/business"
result = await crawler.arun(url=url, config=crawl_config)
print("--- LLM RAW RESPONSE ---")
print(result.extracted_content)
print("--- END LLM RAW RESPONSE ---")
if result.success:
with open("kb_result.json", "w", encoding="utf-8") as f:
f.write(result.extracted_content)
llm_strat.show_usage()
else:
print("Crawl failed:", result.error_message)
if __name__ == "__main__":
asyncio.run(main())
```
- **`extraction_type="schema"`** ensures we get JSON fitting our `KnowledgeGraph`.
- **`input_format="html"`** means we feed HTML to the model.
- **`instruction`** guides the model to output a structured knowledge graph.
## 10. Best Practices & Caveats
4. **Schema Strictness**: `"schema"` extraction tries to parse the model output as JSON. If the model returns invalid JSON, partial extraction might happen, or you might get an error.
## 11. Conclusion
- Put your LLM strategy **in `CrawlerRunConfig`**.
- Use **`input_format`** to pick which form (markdown, HTML, fit_markdown) the LLM sees.
- Tweak **`chunk_token_threshold`**, **`overlap_rate`**, and **`apply_chunking`** to handle large content efficiently.
- Monitor token usage with `show_usage()`.
If your site’s data is consistent or repetitive, consider [`JsonCssExtractionStrategy`](./no-llm-strategies.md) first for speed and simplicity. But if you need an **AI-driven** approach, `LLMExtractionStrategy` offers a flexible, multi-provider solution for extracting structured JSON from any website.
1. **Experiment with Different Providers**
- Try switching the `provider` (e.g., `"ollama/llama2"`, `"openai/gpt-4o"`, etc.) to see differences in speed, accuracy, or cost.
- Pass different `extra_args` like `temperature`, `top_p`, and `max_tokens` to fine-tune your results.
2. **Performance Tuning**
- If pages are large, tweak `chunk_token_threshold`, `overlap_rate`, or `apply_chunking` to optimize throughput.
- Check the usage logs with `show_usage()` to keep an eye on token consumption and identify potential bottlenecks.
3. **Validate Outputs**
- If using `extraction_type="schema"`, parse the LLM’s JSON with a Pydantic model for a final validation step.
4. **Explore Hooks & Automation**
# Advanced Features
# Session Management
- **Performing JavaScript actions before and after crawling.**
Use `BrowserConfig` and `CrawlerRunConfig` to maintain state with a `session_id`:
```python
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
async with AsyncWebCrawler() as crawler:
session_id = "my_session"
# Define configurations
config1 = CrawlerRunConfig(
url="https://example.com/page1", session_id=session_id
)
config2 = CrawlerRunConfig(
url="https://example.com/page2", session_id=session_id
)
# First request
result1 = await crawler.arun(config=config1)
# Subsequent request using the same session
result2 = await crawler.arun(config=config2)
# Clean up when done
await crawler.crawler_strategy.kill_session(session_id)
```
```python
from crawl4ai.async_configs import CrawlerRunConfig
from crawl4ai import JsonCssExtractionStrategy
from crawl4ai.cache_context import CacheMode
async def crawl_dynamic_content():
url = "https://github.com/microsoft/TypeScript/commits/main"
session_id = "wait_for_session"
all_commits = []
js_next_page = """
const commits = document.querySelectorAll('li[data-testid="commit-row-item"] h4');
if (commits.length > 0) {
window.lastCommit = commits[0].textContent.trim();
}
const button = document.querySelector('a[data-testid="pagination-next-button"]');
if (button) {button.click(); console.log('button clicked') }
"""
wait_for = """() => {
const commits = document.querySelectorAll('li[data-testid="commit-row-item"] h4');
if (commits.length === 0) return false;
const firstCommit = commits[0].textContent.trim();
return firstCommit !== window.lastCommit;
}"""
schema = {
"name": "Commit Extractor",
"baseSelector": "li[data-testid='commit-row-item']",
"fields": [
{
"name": "title",
"selector": "h4 a",
"type": "text",
"transform": "strip",
},
],
}
extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
browser_config = BrowserConfig(
verbose=True,
headless=False,
)
async with AsyncWebCrawler(config=browser_config) as crawler:
for page in range(3):
crawler_config = CrawlerRunConfig(
session_id=session_id,
css_selector="li[data-testid='commit-row-item']",
extraction_strategy=extraction_strategy,
js_code=js_next_page if page > 0 else None,
wait_for=wait_for if page > 0 else None,
js_only=page > 0,
cache_mode=CacheMode.BYPASS,
capture_console_messages=True,
)
result = await crawler.arun(url=url, config=crawler_config)
if result.console_messages:
print(f"Page {page + 1} console messages:", result.console_messages)
if result.extracted_content:
# print(f"Page {page + 1} result:", result.extracted_content)
commits = json.loads(result.extracted_content)
all_commits.extend(commits)
print(f"Page {page + 1}: Found {len(commits)} commits")
else:
print(f"Page {page + 1}: No content extracted")
print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
# Clean up session
await crawler.crawler_strategy.kill_session(session_id)
```
## Example 1: Basic Session-Based Crawling
```python
import asyncio
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
from crawl4ai.cache_context import CacheMode
async def basic_session_crawl():
async with AsyncWebCrawler() as crawler:
session_id = "dynamic_content_session"
url = "https://example.com/dynamic-content"
for page in range(3):
config = CrawlerRunConfig(
url=url,
session_id=session_id,
js_code="document.querySelector('.load-more-button').click();" if page > 0 else None,
css_selector=".content-item",
cache_mode=CacheMode.BYPASS
)
result = await crawler.arun(config=config)
print(f"Page {page + 1}: Found {result.extracted_content.count('.content-item')} items")
await crawler.crawler_strategy.kill_session(session_id)
asyncio.run(basic_session_crawl())
```
1. Reusing the same `session_id` across multiple requests.
2. Executing JavaScript to load more content dynamically.
3. Properly closing the session to free resources.
## Advanced Technique 1: Custom Execution Hooks
```python
async def advanced_session_crawl_with_hooks():
first_commit = ""
async def on_execution_started(page):
nonlocal first_commit
try:
while True:
await page.wait_for_selector("li.commit-item h4")
commit = await page.query_selector("li.commit-item h4")
commit = await commit.evaluate("(element) => element.textContent").strip()
if commit and commit != first_commit:
first_commit = commit
break
await asyncio.sleep(0.5)
except Exception as e:
print(f"Warning: New content didn't appear: {e}")
async with AsyncWebCrawler() as crawler:
session_id = "commit_session"
url = "https://github.com/example/repo/commits/main"
crawler.crawler_strategy.set_hook("on_execution_started", on_execution_started)
js_next_page = """document.querySelector('a.pagination-next').click();"""
for page in range(3):
config = CrawlerRunConfig(
url=url,
session_id=session_id,
js_code=js_next_page if page > 0 else None,
css_selector="li.commit-item",
js_only=page > 0,
cache_mode=CacheMode.BYPASS
)
result = await crawler.arun(config=config)
print(f"Page {page + 1}: Found {len(result.extracted_content)} commits")
await crawler.crawler_strategy.kill_session(session_id)
asyncio.run(advanced_session_crawl_with_hooks())
```
## Advanced Technique 2: Integrated JavaScript Execution and Waiting
```python
async def integrated_js_and_wait_crawl():
async with AsyncWebCrawler() as crawler:
session_id = "integrated_session"
url = "https://github.com/example/repo/commits/main"
js_next_page_and_wait = """
(async () => {
const getCurrentCommit = () => document.querySelector('li.commit-item h4').textContent.trim();
const initialCommit = getCurrentCommit();
document.querySelector('a.pagination-next').click();
while (getCurrentCommit() === initialCommit) {
await new Promise(resolve => setTimeout(resolve, 100));
}
})();
"""
for page in range(3):
config = CrawlerRunConfig(
url=url,
session_id=session_id,
js_code=js_next_page_and_wait if page > 0 else None,
css_selector="li.commit-item",
js_only=page > 0,
cache_mode=CacheMode.BYPASS
)
result = await crawler.arun(config=config)
print(f"Page {page + 1}: Found {len(result.extracted_content)} commits")
await crawler.crawler_strategy.kill_session(session_id)
asyncio.run(integrated_js_and_wait_crawl())
```
1. **Authentication Flows**: Login and interact with secured pages.
2. **Pagination Handling**: Navigate through multiple pages.
3. **Form Submissions**: Fill forms, submit, and process results.
4. **Multi-step Processes**: Complete workflows that span multiple actions.
# Hooks & Auth in AsyncWebCrawler
1. **`on_browser_created`** – After browser creation.
2. **`on_page_context_created`** – After a new context & page are created.
3. **`before_goto`** – Just before navigating to a page.
4. **`after_goto`** – Right after navigation completes.
5. **`on_user_agent_updated`** – Whenever the user agent changes.
6. **`on_execution_started`** – Once custom JavaScript execution begins.
7. **`before_retrieve_html`** – Just before the crawler retrieves final HTML.
8. **`before_return_html`** – Right before returning the HTML content.
**Important**: Avoid heavy tasks in `on_browser_created` since you don’t yet have a page context. If you need to *log in*, do so in **`on_page_context_created`**.
> note "Important Hook Usage Warning"
**Avoid Misusing Hooks**: Do not manipulate page objects in the wrong hook or at the wrong time, as it can crash the pipeline or produce incorrect results. A common mistake is attempting to handle authentication prematurely—such as creating or closing pages in `on_browser_created`.
> **Use the Right Hook for Auth**: If you need to log in or set tokens, use `on_page_context_created`. This ensures you have a valid page/context to work with, without disrupting the main crawling flow.
## Example: Using Hooks in AsyncWebCrawler
```python
import asyncio
import json
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from playwright.async_api import Page, BrowserContext
async def main():
print("🔗 Hooks Example: Demonstrating recommended usage")
# 1) Configure the browser
browser_config = BrowserConfig(
headless=True,
verbose=True
)
# 2) Configure the crawler run
crawler_run_config = CrawlerRunConfig(
js_code="window.scrollTo(0, document.body.scrollHeight);",
wait_for="body",
cache_mode=CacheMode.BYPASS
)
# 3) Create the crawler instance
crawler = AsyncWebCrawler(config=browser_config)
#
# Define Hook Functions
#
async def on_browser_created(browser, **kwargs):
# Called once the browser instance is created (but no pages or contexts yet)
print("[HOOK] on_browser_created - Browser created successfully!")
# Typically, do minimal setup here if needed
return browser
async def on_page_context_created(page: Page, context: BrowserContext, **kwargs):
# Called right after a new page + context are created (ideal for auth or route config).
print("[HOOK] on_page_context_created - Setting up page & context.")
# Example 1: Route filtering (e.g., block images)
async def route_filter(route):
if route.request.resource_type == "image":
print(f"[HOOK] Blocking image request: {route.request.url}")
await route.abort()
else:
await route.continue_()
await context.route("**", route_filter)
# Example 2: (Optional) Simulate a login scenario
# (We do NOT create or close pages here, just do quick steps if needed)
# e.g., await page.goto("https://example.com/login")
# e.g., await page.fill("input[name='username']", "testuser")
# e.g., await page.fill("input[name='password']", "password123")
# e.g., await page.click("button[type='submit']")
# e.g., await page.wait_for_selector("#welcome")
# e.g., await context.add_cookies([...])
# Then continue
# Example 3: Adjust the viewport
await page.set_viewport_size({"width": 1080, "height": 600})
return page
async def before_goto(
page: Page, context: BrowserContext, url: str, **kwargs
):
# Called before navigating to each URL.
print(f"[HOOK] before_goto - About to navigate: {url}")
# e.g., inject custom headers
await page.set_extra_http_headers({
"Custom-Header": "my-value"
})
return page
async def after_goto(
page: Page, context: BrowserContext,
url: str, response, **kwargs
):
# Called after navigation completes.
print(f"[HOOK] after_goto - Successfully loaded: {url}")
# e.g., wait for a certain element if we want to verify
try:
await page.wait_for_selector('.content', timeout=1000)
print("[HOOK] Found .content element!")
except:
print("[HOOK] .content not found, continuing anyway.")
return page
async def on_user_agent_updated(
page: Page, context: BrowserContext,
user_agent: str, **kwargs
):
# Called whenever the user agent updates.
print(f"[HOOK] on_user_agent_updated - New user agent: {user_agent}")
return page
async def on_execution_started(page: Page, context: BrowserContext, **kwargs):
# Called after custom JavaScript execution begins.
print("[HOOK] on_execution_started - JS code is running!")
return page
async def before_retrieve_html(page: Page, context: BrowserContext, **kwargs):
# Called before final HTML retrieval.
print("[HOOK] before_retrieve_html - We can do final actions")
# Example: Scroll again
await page.evaluate("window.scrollTo(0, document.body.scrollHeight);")
return page
async def before_return_html(
page: Page, context: BrowserContext, html: str, **kwargs
):
# Called just before returning the HTML in the result.
print(f"[HOOK] before_return_html - HTML length: {len(html)}")
return page
#
# Attach Hooks
#
crawler.crawler_strategy.set_hook("on_browser_created", on_browser_created)
crawler.crawler_strategy.set_hook(
"on_page_context_created", on_page_context_created
)
crawler.crawler_strategy.set_hook("before_goto", before_goto)
crawler.crawler_strategy.set_hook("after_goto", after_goto)
crawler.crawler_strategy.set_hook(
"on_user_agent_updated", on_user_agent_updated
)
crawler.crawler_strategy.set_hook(
"on_execution_started", on_execution_started
)
crawler.crawler_strategy.set_hook(
"before_retrieve_html", before_retrieve_html
)
crawler.crawler_strategy.set_hook(
"before_return_html", before_return_html
)
await crawler.start()
# 4) Run the crawler on an example page
url = "https://example.com"
result = await crawler.arun(url, config=crawler_run_config)
if result.success:
print("\nCrawled URL:", result.url)
print("HTML length:", len(result.html))
else:
print("Error:", result.error_message)
await crawler.close()
if __name__ == "__main__":
asyncio.run(main())
```
## Hook Lifecycle Summary
1. **`on_browser_created`**:
- Browser is up, but **no** pages or contexts yet.
- Light setup only—don’t try to open or close pages here (that belongs in `on_page_context_created`).
2. **`on_page_context_created`**:
- Perfect for advanced **auth** or route blocking.
3. **`before_goto`**:
4. **`after_goto`**:
5. **`on_user_agent_updated`**:
- Whenever the user agent changes (for stealth or different UA modes).
6. **`on_execution_started`**:
- If you set `js_code` or run custom scripts, this runs once your JS is about to start.
7. **`before_retrieve_html`**:
8. **`before_return_html`**:
- The last hook before returning HTML to the `CrawlResult`. Good for logging HTML length or minor modifications.
## When to Handle Authentication
**Recommended**: Use **`on_page_context_created`** if you need to:
- Navigate to a login page or fill forms
- Set cookies or localStorage tokens
- Block resource routes to avoid ads
This ensures the newly created context is under your control **before** `arun()` navigates to the main URL.
## Additional Considerations
- **Session Management**: If you want multiple `arun()` calls to reuse a single session, pass `session_id=` in your `CrawlerRunConfig`. Hooks remain the same.
- **Concurrency**: If you run `arun_many()`, each URL triggers these hooks in parallel. Ensure your hooks are thread/async-safe.
## Conclusion
- **Browser** creation (light tasks only)
- **Page** and **context** creation (auth, route blocking)
- **Navigation** phases
- **Final HTML** retrieval
- **Login** or advanced tasks in `on_page_context_created`
- **Custom headers** or logs in `before_goto` / `after_goto`
- **Scrolling** or final checks in `before_retrieve_html` / `before_return_html`
---
# Quick Reference
## Core Imports
```python
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerConfig, CacheMode
from crawl4ai.extraction_strategy import LLMExtractionStrategy, JsonCssExtractionStrategy, CosineStrategy
```
## Basic Pattern
```python
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com")
print(result.markdown)
```
## Advanced Pattern
```python
browser_config = BrowserConfig(headless=True, viewport_width=1920)
crawler_config = CrawlerConfig(
cache_mode=CacheMode.BYPASS,
wait_for="css:.content",
screenshot=True,
pdf=True
)
strategy = LLMExtractionStrategy(
provider="openai/gpt-4",
instruction="Extract products with name and price"
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://example.com",
config=crawler_config,
extraction_strategy=strategy
)
```
## Multi-URL Pattern
```python
urls = ["https://example.com/1", "https://example.com/2"]
results = await crawler.arun_many(urls, config=crawler_config)
```
---
**End of Crawl4AI SDK Documentation**