feat: Add hooks utility for function-based hooks with Docker client integration. ref #1377

Add hooks_to_string() utility function that converts Python function objects
   to string representations for the Docker API, enabling developers to write hooks
   as regular Python functions instead of strings.

   Core Changes:
   - New hooks_to_string() utility in crawl4ai/utils.py using inspect.getsource()
   - Docker client now accepts both function objects and strings for hooks
   - Automatic detection and conversion in Crawl4aiDockerClient._prepare_request()
   - New hooks and hooks_timeout parameters in client.crawl() method

   Documentation:
   - Docker client examples with function-based hooks (docs/examples/docker_client_hooks_example.py)
   - Updated main Docker deployment guide with comprehensive hooks section
   - Added unit tests for hooks utility (tests/docker/test_hooks_utility.py)
This commit is contained in:
ntohidi
2025-10-13 12:34:08 +08:00
parent 216019f29a
commit a3f057e19f
6 changed files with 1198 additions and 44 deletions

View File

@@ -6,18 +6,6 @@
- [Option 1: Using Pre-built Docker Hub Images (Recommended)](#option-1-using-pre-built-docker-hub-images-recommended)
- [Option 2: Using Docker Compose](#option-2-using-docker-compose)
- [Option 3: Manual Local Build & Run](#option-3-manual-local-build--run)
- [Dockerfile Parameters](#dockerfile-parameters)
- [Using the API](#using-the-api)
- [Playground Interface](#playground-interface)
- [Python SDK](#python-sdk)
- [Understanding Request Schema](#understanding-request-schema)
- [REST API Examples](#rest-api-examples)
- [Additional API Endpoints](#additional-api-endpoints)
- [HTML Extraction Endpoint](#html-extraction-endpoint)
- [Screenshot Endpoint](#screenshot-endpoint)
- [PDF Export Endpoint](#pdf-export-endpoint)
- [JavaScript Execution Endpoint](#javascript-execution-endpoint)
- [Library Context Endpoint](#library-context-endpoint)
- [MCP (Model Context Protocol) Support](#mcp-model-context-protocol-support)
- [What is MCP?](#what-is-mcp)
- [Connecting via MCP](#connecting-via-mcp)
@@ -25,9 +13,28 @@
- [Available MCP Tools](#available-mcp-tools)
- [Testing MCP Connections](#testing-mcp-connections)
- [MCP Schemas](#mcp-schemas)
- [Additional API Endpoints](#additional-api-endpoints)
- [HTML Extraction Endpoint](#html-extraction-endpoint)
- [Screenshot Endpoint](#screenshot-endpoint)
- [PDF Export Endpoint](#pdf-export-endpoint)
- [JavaScript Execution Endpoint](#javascript-execution-endpoint)
- [User-Provided Hooks API](#user-provided-hooks-api)
- [Hook Information Endpoint](#hook-information-endpoint)
- [Available Hook Points](#available-hook-points)
- [Using Hooks in Requests](#using-hooks-in-requests)
- [Hook Examples with Real URLs](#hook-examples-with-real-urls)
- [Security Best Practices](#security-best-practices)
- [Hook Response Information](#hook-response-information)
- [Error Handling](#error-handling)
- [Hooks Utility: Function-Based Approach (Python)](#hooks-utility-function-based-approach-python)
- [Dockerfile Parameters](#dockerfile-parameters)
- [Using the API](#using-the-api)
- [Playground Interface](#playground-interface)
- [Python SDK](#python-sdk)
- [Understanding Request Schema](#understanding-request-schema)
- [REST API Examples](#rest-api-examples)
- [LLM Configuration Examples](#llm-configuration-examples)
- [Metrics & Monitoring](#metrics--monitoring)
- [Deployment Scenarios](#deployment-scenarios)
- [Complete Examples](#complete-examples)
- [Server Configuration](#server-configuration)
- [Understanding config.yml](#understanding-configyml)
- [JWT Authentication](#jwt-authentication)
@@ -832,6 +839,275 @@ else:
> 💡 **Remember**: Always test your hooks on safe, known websites first before using them on production sites. Never crawl sites that you don't have permission to access or that might be malicious.
### Hooks Utility: Function-Based Approach (Python)
For Python developers, Crawl4AI provides a more convenient way to work with hooks using the `hooks_to_string()` utility function and Docker client integration.
#### Why Use Function-Based Hooks?
**String-Based Approach (shown above)**:
```python
hooks_code = {
"on_page_context_created": """
async def hook(page, context, **kwargs):
await page.set_viewport_size({"width": 1920, "height": 1080})
return page
"""
}
```
**Function-Based Approach (recommended for Python)**:
```python
from crawl4ai import Crawl4aiDockerClient
async def my_hook(page, context, **kwargs):
await page.set_viewport_size({"width": 1920, "height": 1080})
return page
async with Crawl4aiDockerClient(base_url="http://localhost:11235") as client:
result = await client.crawl(
["https://example.com"],
hooks={"on_page_context_created": my_hook}
)
```
**Benefits**:
- ✅ Write hooks as regular Python functions
- ✅ Full IDE support (autocomplete, syntax highlighting, type checking)
- ✅ Easy to test and debug
- ✅ Reusable hook libraries
- ✅ Automatic conversion to API format
#### Using the Hooks Utility
The `hooks_to_string()` utility converts Python function objects to the string format required by the API:
```python
from crawl4ai import hooks_to_string
# Define your hooks as functions
async def setup_hook(page, context, **kwargs):
await page.set_viewport_size({"width": 1920, "height": 1080})
await context.add_cookies([{
"name": "session",
"value": "token",
"domain": ".example.com"
}])
return page
async def scroll_hook(page, context, **kwargs):
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
return page
# Convert to string format
hooks_dict = {
"on_page_context_created": setup_hook,
"before_retrieve_html": scroll_hook
}
hooks_string = hooks_to_string(hooks_dict)
# Now use with REST API or Docker client
# hooks_string contains the string representations
```
#### Docker Client with Automatic Conversion
The Docker client automatically detects and converts function objects:
```python
from crawl4ai import Crawl4aiDockerClient
async def auth_hook(page, context, **kwargs):
"""Add authentication cookies"""
await context.add_cookies([{
"name": "auth_token",
"value": "your_token",
"domain": ".example.com"
}])
return page
async def performance_hook(page, context, **kwargs):
"""Block unnecessary resources"""
await context.route("**/*.{png,jpg,gif}", lambda r: r.abort())
await context.route("**/analytics/*", lambda r: r.abort())
return page
async with Crawl4aiDockerClient(base_url="http://localhost:11235") as client:
# Pass functions directly - automatic conversion!
result = await client.crawl(
["https://example.com"],
hooks={
"on_page_context_created": performance_hook,
"before_goto": auth_hook
},
hooks_timeout=30 # Optional timeout in seconds (1-120)
)
print(f"Success: {result.success}")
print(f"HTML: {len(result.html)} chars")
```
#### Creating Reusable Hook Libraries
Build collections of reusable hooks:
```python
# hooks_library.py
class CrawlHooks:
"""Reusable hook collection for common crawling tasks"""
@staticmethod
async def block_images(page, context, **kwargs):
"""Block all images to speed up crawling"""
await context.route("**/*.{png,jpg,jpeg,gif,webp}", lambda r: r.abort())
return page
@staticmethod
async def block_analytics(page, context, **kwargs):
"""Block analytics and tracking scripts"""
tracking_domains = [
"**/google-analytics.com/*",
"**/googletagmanager.com/*",
"**/facebook.com/tr/*",
"**/doubleclick.net/*"
]
for domain in tracking_domains:
await context.route(domain, lambda r: r.abort())
return page
@staticmethod
async def scroll_infinite(page, context, **kwargs):
"""Handle infinite scroll to load more content"""
previous_height = 0
for i in range(5): # Max 5 scrolls
current_height = await page.evaluate("document.body.scrollHeight")
if current_height == previous_height:
break
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await page.wait_for_timeout(1000)
previous_height = current_height
return page
@staticmethod
async def wait_for_dynamic_content(page, context, url, response, **kwargs):
"""Wait for dynamic content to load"""
await page.wait_for_timeout(2000)
try:
# Click "Load More" if present
load_more = await page.query_selector('[class*="load-more"]')
if load_more:
await load_more.click()
await page.wait_for_timeout(1000)
except:
pass
return page
# Use in your application
from hooks_library import CrawlHooks
from crawl4ai import Crawl4aiDockerClient
async def crawl_with_optimizations(url):
async with Crawl4aiDockerClient() as client:
result = await client.crawl(
[url],
hooks={
"on_page_context_created": CrawlHooks.block_images,
"before_retrieve_html": CrawlHooks.scroll_infinite
}
)
return result
```
#### Choosing the Right Approach
| Approach | Best For | IDE Support | Language |
|----------|----------|-------------|----------|
| **String-based** | Non-Python clients, REST APIs, other languages | ❌ None | Any |
| **Function-based** | Python applications, local development | ✅ Full | Python only |
| **Docker Client** | Python apps with automatic conversion | ✅ Full | Python only |
**Recommendation**:
- **Python applications**: Use Docker client with function objects (easiest)
- **Non-Python or REST API**: Use string-based hooks (most flexible)
- **Manual control**: Use `hooks_to_string()` utility (middle ground)
#### Complete Example with Function Hooks
```python
from crawl4ai import Crawl4aiDockerClient, BrowserConfig, CrawlerRunConfig, CacheMode
# Define hooks as regular Python functions
async def setup_environment(page, context, **kwargs):
"""Setup crawling environment"""
# Set viewport
await page.set_viewport_size({"width": 1920, "height": 1080})
# Block resources for speed
await context.route("**/*.{png,jpg,gif}", lambda r: r.abort())
# Add custom headers
await page.set_extra_http_headers({
"Accept-Language": "en-US",
"X-Custom-Header": "Crawl4AI"
})
print("[HOOK] Environment configured")
return page
async def extract_content(page, context, **kwargs):
"""Extract and prepare content"""
# Scroll to load lazy content
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await page.wait_for_timeout(1000)
# Extract metadata
metadata = await page.evaluate('''() => ({
title: document.title,
links: document.links.length,
images: document.images.length
})''')
print(f"[HOOK] Page metadata: {metadata}")
return page
async def main():
async with Crawl4aiDockerClient(base_url="http://localhost:11235", verbose=True) as client:
# Configure crawl
browser_config = BrowserConfig(headless=True)
crawler_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
# Crawl with hooks
result = await client.crawl(
["https://httpbin.org/html"],
browser_config=browser_config,
crawler_config=crawler_config,
hooks={
"on_page_context_created": setup_environment,
"before_retrieve_html": extract_content
},
hooks_timeout=30
)
if result.success:
print(f"✅ Crawl successful!")
print(f" URL: {result.url}")
print(f" HTML: {len(result.html)} chars")
print(f" Markdown: {len(result.markdown)} chars")
else:
print(f"❌ Crawl failed: {result.error_message}")
if __name__ == "__main__":
import asyncio
asyncio.run(main())
```
#### Additional Resources
- **Comprehensive Examples**: See `/docs/examples/hooks_docker_client_example.py` for Python function-based examples
- **REST API Examples**: See `/docs/examples/hooks_rest_api_example.py` for string-based examples
- **Comparison Guide**: See `/docs/examples/README_HOOKS.md` for detailed comparison
- **Utility Documentation**: See `/docs/hooks-utility-guide.md` for complete guide
---
## Dockerfile Parameters
@@ -892,10 +1168,12 @@ This is the easiest way to translate Python configuration to JSON requests when
Install the SDK: `pip install crawl4ai`
The Python SDK provides a convenient way to interact with the Docker API, including **automatic hook conversion** when using function objects.
```python
import asyncio
from crawl4ai.docker_client import Crawl4aiDockerClient
from crawl4ai import BrowserConfig, CrawlerRunConfig, CacheMode # Assuming you have crawl4ai installed
from crawl4ai import BrowserConfig, CrawlerRunConfig, CacheMode
async def main():
# Point to the correct server port
@@ -907,23 +1185,22 @@ async def main():
print("--- Running Non-Streaming Crawl ---")
results = await client.crawl(
["https://httpbin.org/html"],
browser_config=BrowserConfig(headless=True), # Use library classes for config aid
browser_config=BrowserConfig(headless=True),
crawler_config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
)
if results: # client.crawl returns None on failure
print(f"Non-streaming results success: {results.success}")
if results.success:
for result in results: # Iterate through the CrawlResultContainer
print(f"URL: {result.url}, Success: {result.success}")
if results:
print(f"Non-streaming results success: {results.success}")
if results.success:
for result in results:
print(f"URL: {result.url}, Success: {result.success}")
else:
print("Non-streaming crawl failed.")
# Example Streaming crawl
print("\n--- Running Streaming Crawl ---")
stream_config = CrawlerRunConfig(stream=True, cache_mode=CacheMode.BYPASS)
try:
async for result in await client.crawl( # client.crawl returns an async generator for streaming
async for result in await client.crawl(
["https://httpbin.org/html", "https://httpbin.org/links/5/0"],
browser_config=BrowserConfig(headless=True),
crawler_config=stream_config
@@ -932,17 +1209,56 @@ async def main():
except Exception as e:
print(f"Streaming crawl failed: {e}")
# Example with hooks (Python function objects)
print("\n--- Crawl with Hooks ---")
async def my_hook(page, context, **kwargs):
"""Custom hook to optimize performance"""
await page.set_viewport_size({"width": 1920, "height": 1080})
await context.route("**/*.{png,jpg}", lambda r: r.abort())
print("[HOOK] Page optimized")
return page
result = await client.crawl(
["https://httpbin.org/html"],
browser_config=BrowserConfig(headless=True),
crawler_config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS),
hooks={"on_page_context_created": my_hook}, # Pass function directly!
hooks_timeout=30
)
print(f"Crawl with hooks success: {result.success}")
# Example Get schema
print("\n--- Getting Schema ---")
schema = await client.get_schema()
print(f"Schema received: {bool(schema)}") # Print whether schema was received
print(f"Schema received: {bool(schema)}")
if __name__ == "__main__":
asyncio.run(main())
```
*(SDK parameters like timeout, verify_ssl etc. remain the same)*
#### SDK Parameters
The Docker client supports the following parameters:
**Client Initialization**:
- `base_url` (str): URL of the Docker server (default: `http://localhost:8000`)
- `timeout` (float): Request timeout in seconds (default: 30.0)
- `verify_ssl` (bool): Verify SSL certificates (default: True)
- `verbose` (bool): Enable verbose logging (default: True)
- `log_file` (Optional[str]): Path to log file (default: None)
**crawl() Method**:
- `urls` (List[str]): List of URLs to crawl
- `browser_config` (Optional[BrowserConfig]): Browser configuration
- `crawler_config` (Optional[CrawlerRunConfig]): Crawler configuration
- `hooks` (Optional[Dict]): Hook functions or strings - **automatically converts function objects!**
- `hooks_timeout` (int): Timeout for each hook execution in seconds (default: 30)
**Returns**:
- Single URL: `CrawlResult` object
- Multiple URLs: `List[CrawlResult]`
- Streaming: `AsyncGenerator[CrawlResult]`
### Second Approach: Direct API Calls
@@ -1352,19 +1668,40 @@ We're here to help you succeed with Crawl4AI! Here's how to get support:
In this guide, we've covered everything you need to get started with Crawl4AI's Docker deployment:
- Building and running the Docker container
- Configuring the environment
- Configuring the environment
- Using the interactive playground for testing
- Making API requests with proper typing
- Using the Python SDK
- Using the Python SDK with **automatic hook conversion**
- **Working with hooks** - both string-based (REST API) and function-based (Python SDK)
- Leveraging specialized endpoints for screenshots, PDFs, and JavaScript execution
- Connecting via the Model Context Protocol (MCP)
- Monitoring your deployment
The new playground interface at `http://localhost:11235/playground` makes it much easier to test configurations and generate the corresponding JSON for API requests.
### Key Features
For AI application developers, the MCP integration allows tools like Claude Code to directly access Crawl4AI's capabilities without complex API handling.
**Hooks Support**: Crawl4AI offers two approaches for working with hooks:
- **String-based** (REST API): Works with any language, requires manual string formatting
- **Function-based** (Python SDK): Write hooks as regular Python functions with full IDE support and automatic conversion
Remember, the examples in the `examples` folder are your friends - they show real-world usage patterns that you can adapt for your needs.
**Playground Interface**: The built-in playground at `http://localhost:11235/playground` makes it easy to test configurations and generate corresponding JSON for API requests.
**MCP Integration**: For AI application developers, the MCP integration allows tools like Claude Code to directly access Crawl4AI's capabilities without complex API handling.
### Next Steps
1. **Explore Examples**: Check out the comprehensive examples in:
- `/docs/examples/hooks_docker_client_example.py` - Python function-based hooks
- `/docs/examples/hooks_rest_api_example.py` - REST API string-based hooks
- `/docs/examples/README_HOOKS.md` - Comparison and guide
2. **Read Documentation**:
- `/docs/hooks-utility-guide.md` - Complete hooks utility guide
- API documentation for detailed configuration options
3. **Join the Community**:
- GitHub: Report issues and contribute
- Discord: Get help and share your experiences
- Documentation: Comprehensive guides and tutorials
Keep exploring, and don't hesitate to reach out if you need help! We're building something amazing together. 🚀