Add hooks_to_string() utility function that converts Python function objects to string representations for the Docker API, enabling developers to write hooks as regular Python functions instead of strings. Core Changes: - New hooks_to_string() utility in crawl4ai/utils.py using inspect.getsource() - Docker client now accepts both function objects and strings for hooks - Automatic detection and conversion in Crawl4aiDockerClient._prepare_request() - New hooks and hooks_timeout parameters in client.crawl() method Documentation: - Docker client examples with function-based hooks (docs/examples/docker_client_hooks_example.py) - Updated main Docker deployment guide with comprehensive hooks section - Added unit tests for hooks utility (tests/docker/test_hooks_utility.py)
10 KiB
🚀 Crawl4AI v0.7.5: The Docker Hooks & Security Update
September 29, 2025 • 8 min read
Today I'm releasing Crawl4AI v0.7.5—focused on extensibility and security. This update introduces the Docker Hooks System for pipeline customization, enhanced LLM integration, and important security improvements.
🎯 What's New at a Glance
- Docker Hooks System: Custom Python functions at key pipeline points with function-based API
- Function-Based Hooks: New
hooks_to_string()utility with Docker client auto-conversion - Enhanced LLM Integration: Custom providers with temperature control
- HTTPS Preservation: Secure internal link handling
- Bug Fixes: Resolved multiple community-reported issues
- Improved Docker Error Handling: Better debugging and reliability
🔧 Docker Hooks System: Pipeline Customization
Every scraping project needs custom logic—authentication, performance optimization, content processing. Traditional solutions require forking or complex workarounds. Docker Hooks let you inject custom Python functions at 8 key points in the crawling pipeline.
Real Example: Authentication & Performance
import requests
# Real working hooks for httpbin.org
hooks_config = {
"on_page_context_created": """
async def hook(page, context, **kwargs):
print("Hook: Setting up page context")
# Block images to speed up crawling
await context.route("**/*.{png,jpg,jpeg,gif,webp}", lambda route: route.abort())
print("Hook: Images blocked")
return page
""",
"before_retrieve_html": """
async def hook(page, context, **kwargs):
print("Hook: Before retrieving HTML")
# Scroll to bottom to load lazy content
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await page.wait_for_timeout(1000)
print("Hook: Scrolled to bottom")
return page
""",
"before_goto": """
async def hook(page, context, url, **kwargs):
print(f"Hook: About to navigate to {url}")
# Add custom headers
await page.set_extra_http_headers({
'X-Test-Header': 'crawl4ai-hooks-test'
})
return page
"""
}
# Test with Docker API
payload = {
"urls": ["https://httpbin.org/html"],
"hooks": {
"code": hooks_config,
"timeout": 30
}
}
response = requests.post("http://localhost:11235/crawl", json=payload)
result = response.json()
if result.get('success'):
print("✅ Hooks executed successfully!")
print(f"Content length: {len(result.get('markdown', ''))} characters")
Available Hook Points:
on_browser_created: Browser setupon_page_context_created: Page context configurationbefore_goto: Pre-navigation setupafter_goto: Post-navigation processingon_user_agent_updated: User agent changeson_execution_started: Crawl initializationbefore_retrieve_html: Pre-extraction processingbefore_return_html: Final HTML processing
Function-Based Hooks API
Writing hooks as strings works, but lacks IDE support and type checking. v0.7.5 introduces a function-based approach with automatic conversion!
Option 1: Using the hooks_to_string() Utility
from crawl4ai import hooks_to_string
import requests
# Define hooks as regular Python functions (with full IDE support!)
async def on_page_context_created(page, context, **kwargs):
"""Block images to speed up crawling"""
await context.route("**/*.{png,jpg,jpeg,gif,webp}", lambda route: route.abort())
await page.set_viewport_size({"width": 1920, "height": 1080})
return page
async def before_goto(page, context, url, **kwargs):
"""Add custom headers"""
await page.set_extra_http_headers({
'X-Crawl4AI': 'v0.7.5',
'X-Custom-Header': 'my-value'
})
return page
# Convert functions to strings
hooks_code = hooks_to_string({
"on_page_context_created": on_page_context_created,
"before_goto": before_goto
})
# Use with REST API
payload = {
"urls": ["https://httpbin.org/html"],
"hooks": {"code": hooks_code, "timeout": 30}
}
response = requests.post("http://localhost:11235/crawl", json=payload)
Option 2: Docker Client with Automatic Conversion (Recommended!)
from crawl4ai.docker_client import Crawl4aiDockerClient
# Define hooks as functions (same as above)
async def on_page_context_created(page, context, **kwargs):
await context.route("**/*.{png,jpg,jpeg,gif,webp}", lambda route: route.abort())
return page
async def before_retrieve_html(page, context, **kwargs):
# Scroll to load lazy content
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await page.wait_for_timeout(1000)
return page
# Use Docker client - conversion happens automatically!
client = Crawl4aiDockerClient(base_url="http://localhost:11235")
results = await client.crawl(
urls=["https://httpbin.org/html"],
hooks={
"on_page_context_created": on_page_context_created,
"before_retrieve_html": before_retrieve_html
},
hooks_timeout=30
)
if results and results.success:
print(f"✅ Hooks executed! HTML length: {len(results.html)}")
Benefits of Function-Based Hooks:
- ✅ Full IDE support (autocomplete, syntax highlighting)
- ✅ Type checking and linting
- ✅ Easier to test and debug
- ✅ Reusable across projects
- ✅ Automatic conversion in Docker client
- ✅ No breaking changes - string hooks still work!
🤖 Enhanced LLM Integration
Enhanced LLM integration with custom providers, temperature control, and base URL configuration.
Multi-Provider Support
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
# Test with different providers
async def test_llm_providers():
# OpenAI with custom temperature
openai_strategy = LLMExtractionStrategy(
provider="gemini/gemini-2.5-flash-lite",
api_token="your-api-token",
temperature=0.7, # New in v0.7.5
instruction="Summarize this page in one sentence"
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
"https://example.com",
config=CrawlerRunConfig(extraction_strategy=openai_strategy)
)
if result.success:
print("✅ LLM extraction completed")
print(result.extracted_content)
# Docker API with enhanced LLM config
llm_payload = {
"url": "https://example.com",
"f": "llm",
"q": "Summarize this page in one sentence.",
"provider": "gemini/gemini-2.5-flash-lite",
"temperature": 0.7
}
response = requests.post("http://localhost:11235/md", json=llm_payload)
New Features:
- Custom
temperatureparameter for creativity control base_urlfor custom API endpoints- Multi-provider environment variable support
- Docker API integration
🔒 HTTPS Preservation
The Problem: Modern web apps require HTTPS everywhere. When crawlers downgrade internal links from HTTPS to HTTP, authentication breaks and security warnings appear.
Solution: HTTPS preservation maintains secure protocols throughout crawling.
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, FilterChain, URLPatternFilter, BFSDeepCrawlStrategy
async def test_https_preservation():
# Enable HTTPS preservation
url_filter = URLPatternFilter(
patterns=["^(https:\/\/)?quotes\.toscrape\.com(\/.*)?$"]
)
config = CrawlerRunConfig(
exclude_external_links=True,
preserve_https_for_internal_links=True, # New in v0.7.5
deep_crawl_strategy=BFSDeepCrawlStrategy(
max_depth=2,
max_pages=5,
filter_chain=FilterChain([url_filter])
)
)
async with AsyncWebCrawler() as crawler:
async for result in await crawler.arun(
url="https://quotes.toscrape.com",
config=config
):
# All internal links maintain HTTPS
internal_links = [link['href'] for link in result.links['internal']]
https_links = [link for link in internal_links if link.startswith('https://')]
print(f"HTTPS links preserved: {len(https_links)}/{len(internal_links)}")
for link in https_links[:3]:
print(f" → {link}")
🛠️ Bug Fixes and Improvements
Major Fixes
- URL Processing: Fixed '+' sign preservation in query parameters (#1332)
- Proxy Configuration: Enhanced proxy string parsing (old
proxyparameter deprecated) - Docker Error Handling: Comprehensive error messages with status codes
- Memory Management: Fixed leaks in long-running sessions
- JWT Authentication: Fixed Docker JWT validation issues (#1442)
- Playwright Stealth: Fixed stealth features for Playwright integration (#1481)
- API Configuration: Fixed config handling to prevent overriding user-provided settings (#1505)
- Docker Filter Serialization: Resolved JSON encoding errors in deep crawl strategy (#1419)
- LLM Provider Support: Fixed custom LLM provider integration for adaptive crawler (#1291)
- Performance Issues: Resolved backoff strategy failures and timeout handling (#989)
Community-Reported Issues Fixed
This release addresses multiple issues reported by the community through GitHub issues and Discord discussions:
- Fixed browser configuration reference errors
- Resolved dependency conflicts with cssselect
- Improved error messaging for failed authentications
- Enhanced compatibility with various proxy configurations
- Fixed edge cases in URL normalization
Configuration Updates
# Old proxy config (deprecated)
# browser_config = BrowserConfig(proxy="http://proxy:8080")
# New enhanced proxy config
browser_config = BrowserConfig(
proxy_config={
"server": "http://proxy:8080",
"username": "optional-user",
"password": "optional-pass"
}
)
🔄 Breaking Changes
- Python 3.10+ Required: Upgrade from Python 3.9
- Proxy Parameter Deprecated: Use new
proxy_configstructure - New Dependency: Added
cssselectfor better CSS handling
🚀 Get Started
# Install latest version
pip install crawl4ai==0.7.5
# Docker deployment
docker pull unclecode/crawl4ai:latest
docker run -p 11235:11235 unclecode/crawl4ai:latest
Try the Demo:
# Run working examples
python docs/releases_review/demo_v0.7.5.py
Resources:
- 📖 Documentation: docs.crawl4ai.com
- 🐙 GitHub: github.com/unclecode/crawl4ai
- 💬 Discord: discord.gg/crawl4ai
- 🐦 Twitter: @unclecode
Happy crawling! 🕷️