Compare commits
2 Commits
docker-reb
...
fix/market
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
13e116610d | ||
|
|
97c92c4f62 |
58
README.md
58
README.md
@@ -27,13 +27,11 @@
|
||||
|
||||
Crawl4AI turns the web into clean, LLM ready Markdown for RAG, agents, and data pipelines. Fast, controllable, battle tested by a 50k+ star community.
|
||||
|
||||
[✨ Check out latest update v0.7.5](#-recent-updates)
|
||||
[✨ Check out latest update v0.7.4](#-recent-updates)
|
||||
|
||||
✨ New in v0.7.5: Docker Hooks System with function-based API for pipeline customization, Enhanced LLM Integration with custom providers, HTTPS Preservation, and multiple community-reported bug fixes. [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.5.md)
|
||||
✨ New in v0.7.4: Revolutionary LLM Table Extraction with intelligent chunking, enhanced concurrency fixes, memory management refactor, and critical stability improvements. [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.4.md)
|
||||
|
||||
✨ Recent v0.7.4: Revolutionary LLM Table Extraction with intelligent chunking, enhanced concurrency fixes, memory management refactor, and critical stability improvements. [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.4.md)
|
||||
|
||||
✨ Previous v0.7.3: Undetected Browser Support, Multi-URL Configurations, Memory Monitoring, Enhanced Table Extraction, GitHub Sponsors. [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.3.md)
|
||||
✨ Recent v0.7.3: Undetected Browser Support, Multi-URL Configurations, Memory Monitoring, Enhanced Table Extraction, GitHub Sponsors. [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.3.md)
|
||||
|
||||
<details>
|
||||
<summary>🤓 <strong>My Personal Story</strong></summary>
|
||||
@@ -179,7 +177,7 @@ No rate-limited APIs. No lock-in. Build and own your data pipeline with direct g
|
||||
- 📸 **Screenshots**: Capture page screenshots during crawling for debugging or analysis.
|
||||
- 📂 **Raw Data Crawling**: Directly process raw HTML (`raw:`) or local files (`file://`).
|
||||
- 🔗 **Comprehensive Link Extraction**: Extracts internal, external links, and embedded iframe content.
|
||||
- 🛠️ **Customizable Hooks**: Define hooks at every step to customize crawling behavior (supports both string and function-based APIs).
|
||||
- 🛠️ **Customizable Hooks**: Define hooks at every step to customize crawling behavior.
|
||||
- 💾 **Caching**: Cache data for improved speed and to avoid redundant fetches.
|
||||
- 📄 **Metadata Extraction**: Retrieve structured metadata from web pages.
|
||||
- 📡 **IFrame Content Extraction**: Seamless extraction from embedded iframe content.
|
||||
@@ -546,54 +544,6 @@ async def test_news_crawl():
|
||||
|
||||
## ✨ Recent Updates
|
||||
|
||||
<details>
|
||||
<summary><strong>Version 0.7.5 Release Highlights - The Docker Hooks & Security Update</strong></summary>
|
||||
|
||||
- **🔧 Docker Hooks System**: Complete pipeline customization with user-provided Python functions at 8 key points
|
||||
- **✨ Function-Based Hooks API (NEW)**: Write hooks as regular Python functions with full IDE support:
|
||||
```python
|
||||
from crawl4ai import hooks_to_string
|
||||
from crawl4ai.docker_client import Crawl4aiDockerClient
|
||||
|
||||
# Define hooks as regular Python functions
|
||||
async def on_page_context_created(page, context, **kwargs):
|
||||
"""Block images to speed up crawling"""
|
||||
await context.route("**/*.{png,jpg,jpeg,gif,webp}", lambda route: route.abort())
|
||||
await page.set_viewport_size({"width": 1920, "height": 1080})
|
||||
return page
|
||||
|
||||
async def before_goto(page, context, url, **kwargs):
|
||||
"""Add custom headers"""
|
||||
await page.set_extra_http_headers({'X-Crawl4AI': 'v0.7.5'})
|
||||
return page
|
||||
|
||||
# Option 1: Use hooks_to_string() utility for REST API
|
||||
hooks_code = hooks_to_string({
|
||||
"on_page_context_created": on_page_context_created,
|
||||
"before_goto": before_goto
|
||||
})
|
||||
|
||||
# Option 2: Docker client with automatic conversion (Recommended)
|
||||
client = Crawl4aiDockerClient(base_url="http://localhost:11235")
|
||||
results = await client.crawl(
|
||||
urls=["https://httpbin.org/html"],
|
||||
hooks={
|
||||
"on_page_context_created": on_page_context_created,
|
||||
"before_goto": before_goto
|
||||
}
|
||||
)
|
||||
# ✓ Full IDE support, type checking, and reusability!
|
||||
```
|
||||
|
||||
- **🤖 Enhanced LLM Integration**: Custom providers with temperature control and base_url configuration
|
||||
- **🔒 HTTPS Preservation**: Secure internal link handling with `preserve_https_for_internal_links=True`
|
||||
- **🐍 Python 3.10+ Support**: Modern language features and enhanced performance
|
||||
- **🛠️ Bug Fixes**: Resolved multiple community-reported issues including URL processing, JWT authentication, and proxy configuration
|
||||
|
||||
[Full v0.7.5 Release Notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.5.md)
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary><strong>Version 0.7.4 Release Highlights - The Intelligent Table Extraction & Performance Update</strong></summary>
|
||||
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
# crawl4ai/__version__.py
|
||||
|
||||
# This is the version that will be used for stable releases
|
||||
__version__ = "0.7.5"
|
||||
__version__ = "0.7.4"
|
||||
|
||||
# For nightly builds, this gets set during build process
|
||||
__nightly_version__ = None
|
||||
|
||||
@@ -10,6 +10,7 @@ Today I'm releasing Crawl4AI v0.7.4—the Intelligent Table Extraction & Perform
|
||||
|
||||
- **🚀 LLMTableExtraction**: Revolutionary table extraction with intelligent chunking for massive tables
|
||||
- **⚡ Enhanced Concurrency**: True concurrency improvements for fast-completing tasks in batch operations
|
||||
- **🧹 Memory Management Refactor**: Streamlined memory utilities and better resource management
|
||||
- **🔧 Browser Manager Fixes**: Resolved race conditions in concurrent page creation
|
||||
- **⌨️ Cross-Platform Browser Profiler**: Improved keyboard handling and quit mechanisms
|
||||
- **🔗 Advanced URL Processing**: Better handling of raw URLs and base tag link resolution
|
||||
@@ -157,6 +158,40 @@ async with AsyncWebCrawler() as crawler:
|
||||
- **Monitoring Systems**: Faster health checks and status page monitoring
|
||||
- **Data Aggregation**: Improved performance for real-time data collection
|
||||
|
||||
## 🧹 Memory Management Refactor: Cleaner Architecture
|
||||
|
||||
**The Problem:** Memory utilities were scattered and difficult to maintain, with potential import conflicts and unclear organization.
|
||||
|
||||
**My Solution:** I consolidated all memory-related utilities into the main `utils.py` module, creating a cleaner, more maintainable architecture.
|
||||
|
||||
### Improved Memory Handling
|
||||
|
||||
```python
|
||||
# All memory utilities now consolidated
|
||||
from crawl4ai.utils import get_true_memory_usage_percent, MemoryMonitor
|
||||
|
||||
# Enhanced memory monitoring
|
||||
monitor = MemoryMonitor()
|
||||
monitor.start_monitoring()
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Memory-efficient batch processing
|
||||
results = await crawler.arun_many(large_url_list)
|
||||
|
||||
# Get accurate memory metrics
|
||||
memory_usage = get_true_memory_usage_percent()
|
||||
memory_report = monitor.get_report()
|
||||
|
||||
print(f"Memory efficiency: {memory_report['efficiency']:.1f}%")
|
||||
print(f"Peak usage: {memory_report['peak_mb']:.1f} MB")
|
||||
```
|
||||
|
||||
**Expected Real-World Impact:**
|
||||
- **Production Stability**: More reliable memory tracking and management
|
||||
- **Code Maintainability**: Cleaner architecture for easier debugging
|
||||
- **Import Clarity**: Resolved potential conflicts and import issues
|
||||
- **Developer Experience**: Simpler API for memory monitoring
|
||||
|
||||
## 🔧 Critical Stability Fixes
|
||||
|
||||
### Browser Manager Race Condition Resolution
|
||||
|
||||
@@ -1,318 +0,0 @@
|
||||
# 🚀 Crawl4AI v0.7.5: The Docker Hooks & Security Update
|
||||
|
||||
*September 29, 2025 • 8 min read*
|
||||
|
||||
---
|
||||
|
||||
Today I'm releasing Crawl4AI v0.7.5—focused on extensibility and security. This update introduces the Docker Hooks System for pipeline customization, enhanced LLM integration, and important security improvements.
|
||||
|
||||
## 🎯 What's New at a Glance
|
||||
|
||||
- **Docker Hooks System**: Custom Python functions at key pipeline points with function-based API
|
||||
- **Function-Based Hooks**: New `hooks_to_string()` utility with Docker client auto-conversion
|
||||
- **Enhanced LLM Integration**: Custom providers with temperature control
|
||||
- **HTTPS Preservation**: Secure internal link handling
|
||||
- **Bug Fixes**: Resolved multiple community-reported issues
|
||||
- **Improved Docker Error Handling**: Better debugging and reliability
|
||||
|
||||
## 🔧 Docker Hooks System: Pipeline Customization
|
||||
|
||||
Every scraping project needs custom logic—authentication, performance optimization, content processing. Traditional solutions require forking or complex workarounds. Docker Hooks let you inject custom Python functions at 8 key points in the crawling pipeline.
|
||||
|
||||
### Real Example: Authentication & Performance
|
||||
|
||||
```python
|
||||
import requests
|
||||
|
||||
# Real working hooks for httpbin.org
|
||||
hooks_config = {
|
||||
"on_page_context_created": """
|
||||
async def hook(page, context, **kwargs):
|
||||
print("Hook: Setting up page context")
|
||||
# Block images to speed up crawling
|
||||
await context.route("**/*.{png,jpg,jpeg,gif,webp}", lambda route: route.abort())
|
||||
print("Hook: Images blocked")
|
||||
return page
|
||||
""",
|
||||
|
||||
"before_retrieve_html": """
|
||||
async def hook(page, context, **kwargs):
|
||||
print("Hook: Before retrieving HTML")
|
||||
# Scroll to bottom to load lazy content
|
||||
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
|
||||
await page.wait_for_timeout(1000)
|
||||
print("Hook: Scrolled to bottom")
|
||||
return page
|
||||
""",
|
||||
|
||||
"before_goto": """
|
||||
async def hook(page, context, url, **kwargs):
|
||||
print(f"Hook: About to navigate to {url}")
|
||||
# Add custom headers
|
||||
await page.set_extra_http_headers({
|
||||
'X-Test-Header': 'crawl4ai-hooks-test'
|
||||
})
|
||||
return page
|
||||
"""
|
||||
}
|
||||
|
||||
# Test with Docker API
|
||||
payload = {
|
||||
"urls": ["https://httpbin.org/html"],
|
||||
"hooks": {
|
||||
"code": hooks_config,
|
||||
"timeout": 30
|
||||
}
|
||||
}
|
||||
|
||||
response = requests.post("http://localhost:11235/crawl", json=payload)
|
||||
result = response.json()
|
||||
|
||||
if result.get('success'):
|
||||
print("✅ Hooks executed successfully!")
|
||||
print(f"Content length: {len(result.get('markdown', ''))} characters")
|
||||
```
|
||||
|
||||
**Available Hook Points:**
|
||||
- `on_browser_created`: Browser setup
|
||||
- `on_page_context_created`: Page context configuration
|
||||
- `before_goto`: Pre-navigation setup
|
||||
- `after_goto`: Post-navigation processing
|
||||
- `on_user_agent_updated`: User agent changes
|
||||
- `on_execution_started`: Crawl initialization
|
||||
- `before_retrieve_html`: Pre-extraction processing
|
||||
- `before_return_html`: Final HTML processing
|
||||
|
||||
### Function-Based Hooks API
|
||||
|
||||
Writing hooks as strings works, but lacks IDE support and type checking. v0.7.5 introduces a function-based approach with automatic conversion!
|
||||
|
||||
**Option 1: Using the `hooks_to_string()` Utility**
|
||||
|
||||
```python
|
||||
from crawl4ai import hooks_to_string
|
||||
import requests
|
||||
|
||||
# Define hooks as regular Python functions (with full IDE support!)
|
||||
async def on_page_context_created(page, context, **kwargs):
|
||||
"""Block images to speed up crawling"""
|
||||
await context.route("**/*.{png,jpg,jpeg,gif,webp}", lambda route: route.abort())
|
||||
await page.set_viewport_size({"width": 1920, "height": 1080})
|
||||
return page
|
||||
|
||||
async def before_goto(page, context, url, **kwargs):
|
||||
"""Add custom headers"""
|
||||
await page.set_extra_http_headers({
|
||||
'X-Crawl4AI': 'v0.7.5',
|
||||
'X-Custom-Header': 'my-value'
|
||||
})
|
||||
return page
|
||||
|
||||
# Convert functions to strings
|
||||
hooks_code = hooks_to_string({
|
||||
"on_page_context_created": on_page_context_created,
|
||||
"before_goto": before_goto
|
||||
})
|
||||
|
||||
# Use with REST API
|
||||
payload = {
|
||||
"urls": ["https://httpbin.org/html"],
|
||||
"hooks": {"code": hooks_code, "timeout": 30}
|
||||
}
|
||||
response = requests.post("http://localhost:11235/crawl", json=payload)
|
||||
```
|
||||
|
||||
**Option 2: Docker Client with Automatic Conversion (Recommended!)**
|
||||
|
||||
```python
|
||||
from crawl4ai.docker_client import Crawl4aiDockerClient
|
||||
|
||||
# Define hooks as functions (same as above)
|
||||
async def on_page_context_created(page, context, **kwargs):
|
||||
await context.route("**/*.{png,jpg,jpeg,gif,webp}", lambda route: route.abort())
|
||||
return page
|
||||
|
||||
async def before_retrieve_html(page, context, **kwargs):
|
||||
# Scroll to load lazy content
|
||||
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
|
||||
await page.wait_for_timeout(1000)
|
||||
return page
|
||||
|
||||
# Use Docker client - conversion happens automatically!
|
||||
client = Crawl4aiDockerClient(base_url="http://localhost:11235")
|
||||
|
||||
results = await client.crawl(
|
||||
urls=["https://httpbin.org/html"],
|
||||
hooks={
|
||||
"on_page_context_created": on_page_context_created,
|
||||
"before_retrieve_html": before_retrieve_html
|
||||
},
|
||||
hooks_timeout=30
|
||||
)
|
||||
|
||||
if results and results.success:
|
||||
print(f"✅ Hooks executed! HTML length: {len(results.html)}")
|
||||
```
|
||||
|
||||
**Benefits of Function-Based Hooks:**
|
||||
- ✅ Full IDE support (autocomplete, syntax highlighting)
|
||||
- ✅ Type checking and linting
|
||||
- ✅ Easier to test and debug
|
||||
- ✅ Reusable across projects
|
||||
- ✅ Automatic conversion in Docker client
|
||||
- ✅ No breaking changes - string hooks still work!
|
||||
|
||||
## 🤖 Enhanced LLM Integration
|
||||
|
||||
Enhanced LLM integration with custom providers, temperature control, and base URL configuration.
|
||||
|
||||
### Multi-Provider Support
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
from crawl4ai.extraction_strategy import LLMExtractionStrategy
|
||||
|
||||
# Test with different providers
|
||||
async def test_llm_providers():
|
||||
# OpenAI with custom temperature
|
||||
openai_strategy = LLMExtractionStrategy(
|
||||
provider="gemini/gemini-2.5-flash-lite",
|
||||
api_token="your-api-token",
|
||||
temperature=0.7, # New in v0.7.5
|
||||
instruction="Summarize this page in one sentence"
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
"https://example.com",
|
||||
config=CrawlerRunConfig(extraction_strategy=openai_strategy)
|
||||
)
|
||||
|
||||
if result.success:
|
||||
print("✅ LLM extraction completed")
|
||||
print(result.extracted_content)
|
||||
|
||||
# Docker API with enhanced LLM config
|
||||
llm_payload = {
|
||||
"url": "https://example.com",
|
||||
"f": "llm",
|
||||
"q": "Summarize this page in one sentence.",
|
||||
"provider": "gemini/gemini-2.5-flash-lite",
|
||||
"temperature": 0.7
|
||||
}
|
||||
|
||||
response = requests.post("http://localhost:11235/md", json=llm_payload)
|
||||
```
|
||||
|
||||
**New Features:**
|
||||
- Custom `temperature` parameter for creativity control
|
||||
- `base_url` for custom API endpoints
|
||||
- Multi-provider environment variable support
|
||||
- Docker API integration
|
||||
|
||||
## 🔒 HTTPS Preservation
|
||||
|
||||
**The Problem:** Modern web apps require HTTPS everywhere. When crawlers downgrade internal links from HTTPS to HTTP, authentication breaks and security warnings appear.
|
||||
|
||||
**Solution:** HTTPS preservation maintains secure protocols throughout crawling.
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, FilterChain, URLPatternFilter, BFSDeepCrawlStrategy
|
||||
|
||||
async def test_https_preservation():
|
||||
# Enable HTTPS preservation
|
||||
url_filter = URLPatternFilter(
|
||||
patterns=["^(https:\/\/)?quotes\.toscrape\.com(\/.*)?$"]
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
exclude_external_links=True,
|
||||
preserve_https_for_internal_links=True, # New in v0.7.5
|
||||
deep_crawl_strategy=BFSDeepCrawlStrategy(
|
||||
max_depth=2,
|
||||
max_pages=5,
|
||||
filter_chain=FilterChain([url_filter])
|
||||
)
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
async for result in await crawler.arun(
|
||||
url="https://quotes.toscrape.com",
|
||||
config=config
|
||||
):
|
||||
# All internal links maintain HTTPS
|
||||
internal_links = [link['href'] for link in result.links['internal']]
|
||||
https_links = [link for link in internal_links if link.startswith('https://')]
|
||||
|
||||
print(f"HTTPS links preserved: {len(https_links)}/{len(internal_links)}")
|
||||
for link in https_links[:3]:
|
||||
print(f" → {link}")
|
||||
```
|
||||
|
||||
## 🛠️ Bug Fixes and Improvements
|
||||
|
||||
### Major Fixes
|
||||
- **URL Processing**: Fixed '+' sign preservation in query parameters (#1332)
|
||||
- **Proxy Configuration**: Enhanced proxy string parsing (old `proxy` parameter deprecated)
|
||||
- **Docker Error Handling**: Comprehensive error messages with status codes
|
||||
- **Memory Management**: Fixed leaks in long-running sessions
|
||||
- **JWT Authentication**: Fixed Docker JWT validation issues (#1442)
|
||||
- **Playwright Stealth**: Fixed stealth features for Playwright integration (#1481)
|
||||
- **API Configuration**: Fixed config handling to prevent overriding user-provided settings (#1505)
|
||||
- **Docker Filter Serialization**: Resolved JSON encoding errors in deep crawl strategy (#1419)
|
||||
- **LLM Provider Support**: Fixed custom LLM provider integration for adaptive crawler (#1291)
|
||||
- **Performance Issues**: Resolved backoff strategy failures and timeout handling (#989)
|
||||
|
||||
### Community-Reported Issues Fixed
|
||||
This release addresses multiple issues reported by the community through GitHub issues and Discord discussions:
|
||||
- Fixed browser configuration reference errors
|
||||
- Resolved dependency conflicts with cssselect
|
||||
- Improved error messaging for failed authentications
|
||||
- Enhanced compatibility with various proxy configurations
|
||||
- Fixed edge cases in URL normalization
|
||||
|
||||
### Configuration Updates
|
||||
```python
|
||||
# Old proxy config (deprecated)
|
||||
# browser_config = BrowserConfig(proxy="http://proxy:8080")
|
||||
|
||||
# New enhanced proxy config
|
||||
browser_config = BrowserConfig(
|
||||
proxy_config={
|
||||
"server": "http://proxy:8080",
|
||||
"username": "optional-user",
|
||||
"password": "optional-pass"
|
||||
}
|
||||
)
|
||||
```
|
||||
|
||||
## 🔄 Breaking Changes
|
||||
|
||||
1. **Python 3.10+ Required**: Upgrade from Python 3.9
|
||||
2. **Proxy Parameter Deprecated**: Use new `proxy_config` structure
|
||||
3. **New Dependency**: Added `cssselect` for better CSS handling
|
||||
|
||||
## 🚀 Get Started
|
||||
|
||||
```bash
|
||||
# Install latest version
|
||||
pip install crawl4ai==0.7.5
|
||||
|
||||
# Docker deployment
|
||||
docker pull unclecode/crawl4ai:latest
|
||||
docker run -p 11235:11235 unclecode/crawl4ai:latest
|
||||
```
|
||||
|
||||
**Try the Demo:**
|
||||
```bash
|
||||
# Run working examples
|
||||
python docs/releases_review/demo_v0.7.5.py
|
||||
```
|
||||
|
||||
**Resources:**
|
||||
- 📖 Documentation: [docs.crawl4ai.com](https://docs.crawl4ai.com)
|
||||
- 🐙 GitHub: [github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai)
|
||||
- 💬 Discord: [discord.gg/crawl4ai](https://discord.gg/jP8KfhDhyN)
|
||||
- 🐦 Twitter: [@unclecode](https://x.com/unclecode)
|
||||
|
||||
Happy crawling! 🕷️
|
||||
@@ -20,26 +20,17 @@ Ever wondered why your AI coding assistant struggles with your library despite c
|
||||
|
||||
## Latest Release
|
||||
|
||||
### [Crawl4AI v0.7.5 – The Docker Hooks & Security Update](../blog/release-v0.7.5.md)
|
||||
*September 29, 2025*
|
||||
|
||||
Crawl4AI v0.7.5 introduces the powerful Docker Hooks System for complete pipeline customization, enhanced LLM integration with custom providers, HTTPS preservation for modern web security, and resolves multiple community-reported issues.
|
||||
|
||||
Key highlights:
|
||||
- **🔧 Docker Hooks System**: Custom Python functions at 8 key pipeline points for unprecedented customization
|
||||
- **🤖 Enhanced LLM Integration**: Custom providers with temperature control and base_url configuration
|
||||
- **🔒 HTTPS Preservation**: Secure internal link handling for modern web applications
|
||||
- **🐍 Python 3.10+ Support**: Modern language features and enhanced performance
|
||||
- **🛠️ Bug Fixes**: Resolved multiple community-reported issues including URL processing, JWT authentication, and proxy configuration
|
||||
|
||||
[Read full release notes →](../blog/release-v0.7.5.md)
|
||||
|
||||
## Recent Releases
|
||||
|
||||
### [Crawl4AI v0.7.4 – The Intelligent Table Extraction & Performance Update](../blog/release-v0.7.4.md)
|
||||
*August 17, 2025*
|
||||
|
||||
Revolutionary LLM-powered table extraction with intelligent chunking, performance improvements for concurrent crawling, enhanced browser management, and critical stability fixes.
|
||||
Crawl4AI v0.7.4 introduces revolutionary LLM-powered table extraction with intelligent chunking, performance improvements for concurrent crawling, enhanced browser management, and critical stability fixes that make Crawl4AI more robust for production workloads.
|
||||
|
||||
Key highlights:
|
||||
- **🚀 LLMTableExtraction**: Revolutionary table extraction with intelligent chunking for massive tables
|
||||
- **⚡ Dispatcher Bug Fix**: Fixed sequential processing issue in arun_many for fast-completing tasks
|
||||
- **🧹 Memory Management Refactor**: Streamlined memory utilities and better resource management
|
||||
- **🔧 Browser Manager Fixes**: Resolved race conditions in concurrent page creation
|
||||
- **🔗 Advanced URL Processing**: Better handling of raw URLs and base tag link resolution
|
||||
|
||||
[Read full release notes →](../blog/release-v0.7.4.md)
|
||||
|
||||
|
||||
@@ -1,318 +0,0 @@
|
||||
# 🚀 Crawl4AI v0.7.5: The Docker Hooks & Security Update
|
||||
|
||||
*September 29, 2025 • 8 min read*
|
||||
|
||||
---
|
||||
|
||||
Today I'm releasing Crawl4AI v0.7.5—focused on extensibility and security. This update introduces the Docker Hooks System for pipeline customization, enhanced LLM integration, and important security improvements.
|
||||
|
||||
## 🎯 What's New at a Glance
|
||||
|
||||
- **Docker Hooks System**: Custom Python functions at key pipeline points with function-based API
|
||||
- **Function-Based Hooks**: New `hooks_to_string()` utility with Docker client auto-conversion
|
||||
- **Enhanced LLM Integration**: Custom providers with temperature control
|
||||
- **HTTPS Preservation**: Secure internal link handling
|
||||
- **Bug Fixes**: Resolved multiple community-reported issues
|
||||
- **Improved Docker Error Handling**: Better debugging and reliability
|
||||
|
||||
## 🔧 Docker Hooks System: Pipeline Customization
|
||||
|
||||
Every scraping project needs custom logic—authentication, performance optimization, content processing. Traditional solutions require forking or complex workarounds. Docker Hooks let you inject custom Python functions at 8 key points in the crawling pipeline.
|
||||
|
||||
### Real Example: Authentication & Performance
|
||||
|
||||
```python
|
||||
import requests
|
||||
|
||||
# Real working hooks for httpbin.org
|
||||
hooks_config = {
|
||||
"on_page_context_created": """
|
||||
async def hook(page, context, **kwargs):
|
||||
print("Hook: Setting up page context")
|
||||
# Block images to speed up crawling
|
||||
await context.route("**/*.{png,jpg,jpeg,gif,webp}", lambda route: route.abort())
|
||||
print("Hook: Images blocked")
|
||||
return page
|
||||
""",
|
||||
|
||||
"before_retrieve_html": """
|
||||
async def hook(page, context, **kwargs):
|
||||
print("Hook: Before retrieving HTML")
|
||||
# Scroll to bottom to load lazy content
|
||||
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
|
||||
await page.wait_for_timeout(1000)
|
||||
print("Hook: Scrolled to bottom")
|
||||
return page
|
||||
""",
|
||||
|
||||
"before_goto": """
|
||||
async def hook(page, context, url, **kwargs):
|
||||
print(f"Hook: About to navigate to {url}")
|
||||
# Add custom headers
|
||||
await page.set_extra_http_headers({
|
||||
'X-Test-Header': 'crawl4ai-hooks-test'
|
||||
})
|
||||
return page
|
||||
"""
|
||||
}
|
||||
|
||||
# Test with Docker API
|
||||
payload = {
|
||||
"urls": ["https://httpbin.org/html"],
|
||||
"hooks": {
|
||||
"code": hooks_config,
|
||||
"timeout": 30
|
||||
}
|
||||
}
|
||||
|
||||
response = requests.post("http://localhost:11235/crawl", json=payload)
|
||||
result = response.json()
|
||||
|
||||
if result.get('success'):
|
||||
print("✅ Hooks executed successfully!")
|
||||
print(f"Content length: {len(result.get('markdown', ''))} characters")
|
||||
```
|
||||
|
||||
**Available Hook Points:**
|
||||
- `on_browser_created`: Browser setup
|
||||
- `on_page_context_created`: Page context configuration
|
||||
- `before_goto`: Pre-navigation setup
|
||||
- `after_goto`: Post-navigation processing
|
||||
- `on_user_agent_updated`: User agent changes
|
||||
- `on_execution_started`: Crawl initialization
|
||||
- `before_retrieve_html`: Pre-extraction processing
|
||||
- `before_return_html`: Final HTML processing
|
||||
|
||||
### Function-Based Hooks API
|
||||
|
||||
Writing hooks as strings works, but lacks IDE support and type checking. v0.7.5 introduces a function-based approach with automatic conversion!
|
||||
|
||||
**Option 1: Using the `hooks_to_string()` Utility**
|
||||
|
||||
```python
|
||||
from crawl4ai import hooks_to_string
|
||||
import requests
|
||||
|
||||
# Define hooks as regular Python functions (with full IDE support!)
|
||||
async def on_page_context_created(page, context, **kwargs):
|
||||
"""Block images to speed up crawling"""
|
||||
await context.route("**/*.{png,jpg,jpeg,gif,webp}", lambda route: route.abort())
|
||||
await page.set_viewport_size({"width": 1920, "height": 1080})
|
||||
return page
|
||||
|
||||
async def before_goto(page, context, url, **kwargs):
|
||||
"""Add custom headers"""
|
||||
await page.set_extra_http_headers({
|
||||
'X-Crawl4AI': 'v0.7.5',
|
||||
'X-Custom-Header': 'my-value'
|
||||
})
|
||||
return page
|
||||
|
||||
# Convert functions to strings
|
||||
hooks_code = hooks_to_string({
|
||||
"on_page_context_created": on_page_context_created,
|
||||
"before_goto": before_goto
|
||||
})
|
||||
|
||||
# Use with REST API
|
||||
payload = {
|
||||
"urls": ["https://httpbin.org/html"],
|
||||
"hooks": {"code": hooks_code, "timeout": 30}
|
||||
}
|
||||
response = requests.post("http://localhost:11235/crawl", json=payload)
|
||||
```
|
||||
|
||||
**Option 2: Docker Client with Automatic Conversion (Recommended!)**
|
||||
|
||||
```python
|
||||
from crawl4ai.docker_client import Crawl4aiDockerClient
|
||||
|
||||
# Define hooks as functions (same as above)
|
||||
async def on_page_context_created(page, context, **kwargs):
|
||||
await context.route("**/*.{png,jpg,jpeg,gif,webp}", lambda route: route.abort())
|
||||
return page
|
||||
|
||||
async def before_retrieve_html(page, context, **kwargs):
|
||||
# Scroll to load lazy content
|
||||
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
|
||||
await page.wait_for_timeout(1000)
|
||||
return page
|
||||
|
||||
# Use Docker client - conversion happens automatically!
|
||||
client = Crawl4aiDockerClient(base_url="http://localhost:11235")
|
||||
|
||||
results = await client.crawl(
|
||||
urls=["https://httpbin.org/html"],
|
||||
hooks={
|
||||
"on_page_context_created": on_page_context_created,
|
||||
"before_retrieve_html": before_retrieve_html
|
||||
},
|
||||
hooks_timeout=30
|
||||
)
|
||||
|
||||
if results and results.success:
|
||||
print(f"✅ Hooks executed! HTML length: {len(results.html)}")
|
||||
```
|
||||
|
||||
**Benefits of Function-Based Hooks:**
|
||||
- ✅ Full IDE support (autocomplete, syntax highlighting)
|
||||
- ✅ Type checking and linting
|
||||
- ✅ Easier to test and debug
|
||||
- ✅ Reusable across projects
|
||||
- ✅ Automatic conversion in Docker client
|
||||
- ✅ No breaking changes - string hooks still work!
|
||||
|
||||
## 🤖 Enhanced LLM Integration
|
||||
|
||||
Enhanced LLM integration with custom providers, temperature control, and base URL configuration.
|
||||
|
||||
### Multi-Provider Support
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
from crawl4ai.extraction_strategy import LLMExtractionStrategy
|
||||
|
||||
# Test with different providers
|
||||
async def test_llm_providers():
|
||||
# OpenAI with custom temperature
|
||||
openai_strategy = LLMExtractionStrategy(
|
||||
provider="gemini/gemini-2.5-flash-lite",
|
||||
api_token="your-api-token",
|
||||
temperature=0.7, # New in v0.7.5
|
||||
instruction="Summarize this page in one sentence"
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
"https://example.com",
|
||||
config=CrawlerRunConfig(extraction_strategy=openai_strategy)
|
||||
)
|
||||
|
||||
if result.success:
|
||||
print("✅ LLM extraction completed")
|
||||
print(result.extracted_content)
|
||||
|
||||
# Docker API with enhanced LLM config
|
||||
llm_payload = {
|
||||
"url": "https://example.com",
|
||||
"f": "llm",
|
||||
"q": "Summarize this page in one sentence.",
|
||||
"provider": "gemini/gemini-2.5-flash-lite",
|
||||
"temperature": 0.7
|
||||
}
|
||||
|
||||
response = requests.post("http://localhost:11235/md", json=llm_payload)
|
||||
```
|
||||
|
||||
**New Features:**
|
||||
- Custom `temperature` parameter for creativity control
|
||||
- `base_url` for custom API endpoints
|
||||
- Multi-provider environment variable support
|
||||
- Docker API integration
|
||||
|
||||
## 🔒 HTTPS Preservation
|
||||
|
||||
**The Problem:** Modern web apps require HTTPS everywhere. When crawlers downgrade internal links from HTTPS to HTTP, authentication breaks and security warnings appear.
|
||||
|
||||
**Solution:** HTTPS preservation maintains secure protocols throughout crawling.
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, FilterChain, URLPatternFilter, BFSDeepCrawlStrategy
|
||||
|
||||
async def test_https_preservation():
|
||||
# Enable HTTPS preservation
|
||||
url_filter = URLPatternFilter(
|
||||
patterns=["^(https:\/\/)?quotes\.toscrape\.com(\/.*)?$"]
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
exclude_external_links=True,
|
||||
preserve_https_for_internal_links=True, # New in v0.7.5
|
||||
deep_crawl_strategy=BFSDeepCrawlStrategy(
|
||||
max_depth=2,
|
||||
max_pages=5,
|
||||
filter_chain=FilterChain([url_filter])
|
||||
)
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
async for result in await crawler.arun(
|
||||
url="https://quotes.toscrape.com",
|
||||
config=config
|
||||
):
|
||||
# All internal links maintain HTTPS
|
||||
internal_links = [link['href'] for link in result.links['internal']]
|
||||
https_links = [link for link in internal_links if link.startswith('https://')]
|
||||
|
||||
print(f"HTTPS links preserved: {len(https_links)}/{len(internal_links)}")
|
||||
for link in https_links[:3]:
|
||||
print(f" → {link}")
|
||||
```
|
||||
|
||||
## 🛠️ Bug Fixes and Improvements
|
||||
|
||||
### Major Fixes
|
||||
- **URL Processing**: Fixed '+' sign preservation in query parameters (#1332)
|
||||
- **Proxy Configuration**: Enhanced proxy string parsing (old `proxy` parameter deprecated)
|
||||
- **Docker Error Handling**: Comprehensive error messages with status codes
|
||||
- **Memory Management**: Fixed leaks in long-running sessions
|
||||
- **JWT Authentication**: Fixed Docker JWT validation issues (#1442)
|
||||
- **Playwright Stealth**: Fixed stealth features for Playwright integration (#1481)
|
||||
- **API Configuration**: Fixed config handling to prevent overriding user-provided settings (#1505)
|
||||
- **Docker Filter Serialization**: Resolved JSON encoding errors in deep crawl strategy (#1419)
|
||||
- **LLM Provider Support**: Fixed custom LLM provider integration for adaptive crawler (#1291)
|
||||
- **Performance Issues**: Resolved backoff strategy failures and timeout handling (#989)
|
||||
|
||||
### Community-Reported Issues Fixed
|
||||
This release addresses multiple issues reported by the community through GitHub issues and Discord discussions:
|
||||
- Fixed browser configuration reference errors
|
||||
- Resolved dependency conflicts with cssselect
|
||||
- Improved error messaging for failed authentications
|
||||
- Enhanced compatibility with various proxy configurations
|
||||
- Fixed edge cases in URL normalization
|
||||
|
||||
### Configuration Updates
|
||||
```python
|
||||
# Old proxy config (deprecated)
|
||||
# browser_config = BrowserConfig(proxy="http://proxy:8080")
|
||||
|
||||
# New enhanced proxy config
|
||||
browser_config = BrowserConfig(
|
||||
proxy_config={
|
||||
"server": "http://proxy:8080",
|
||||
"username": "optional-user",
|
||||
"password": "optional-pass"
|
||||
}
|
||||
)
|
||||
```
|
||||
|
||||
## 🔄 Breaking Changes
|
||||
|
||||
1. **Python 3.10+ Required**: Upgrade from Python 3.9
|
||||
2. **Proxy Parameter Deprecated**: Use new `proxy_config` structure
|
||||
3. **New Dependency**: Added `cssselect` for better CSS handling
|
||||
|
||||
## 🚀 Get Started
|
||||
|
||||
```bash
|
||||
# Install latest version
|
||||
pip install crawl4ai==0.7.5
|
||||
|
||||
# Docker deployment
|
||||
docker pull unclecode/crawl4ai:latest
|
||||
docker run -p 11235:11235 unclecode/crawl4ai:latest
|
||||
```
|
||||
|
||||
**Try the Demo:**
|
||||
```bash
|
||||
# Run working examples
|
||||
python docs/releases_review/demo_v0.7.5.py
|
||||
```
|
||||
|
||||
**Resources:**
|
||||
- 📖 Documentation: [docs.crawl4ai.com](https://docs.crawl4ai.com)
|
||||
- 🐙 GitHub: [github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai)
|
||||
- 💬 Discord: [discord.gg/crawl4ai](https://discord.gg/jP8KfhDhyN)
|
||||
- 🐦 Twitter: [@unclecode](https://x.com/unclecode)
|
||||
|
||||
Happy crawling! 🕷️
|
||||
@@ -529,8 +529,19 @@ class AdminDashboard {
|
||||
</label>
|
||||
</div>
|
||||
<div class="form-group full-width">
|
||||
<label>Integration Guide</label>
|
||||
<textarea id="form-integration" rows="10">${app?.integration_guide || ''}</textarea>
|
||||
<label>Long Description (Markdown - Overview tab)</label>
|
||||
<textarea id="form-long-description" rows="10" placeholder="Enter detailed description with markdown formatting...">${app?.long_description || ''}</textarea>
|
||||
<small>Markdown support: **bold**, *italic*, [links](url), # headers, code blocks, lists</small>
|
||||
</div>
|
||||
<div class="form-group full-width">
|
||||
<label>Integration Guide (Markdown - Integration tab)</label>
|
||||
<textarea id="form-integration" rows="20" placeholder="Enter integration guide with installation, examples, and code snippets using markdown...">${app?.integration_guide || ''}</textarea>
|
||||
<small>Single markdown field with installation, examples, and complete guide. Code blocks get auto copy buttons.</small>
|
||||
</div>
|
||||
<div class="form-group full-width">
|
||||
<label>Documentation (Markdown - Documentation tab)</label>
|
||||
<textarea id="form-documentation" rows="20" placeholder="Enter documentation with API reference, examples, and best practices using markdown...">${app?.documentation || ''}</textarea>
|
||||
<small>Full documentation with API reference, examples, best practices, etc.</small>
|
||||
</div>
|
||||
</div>
|
||||
`;
|
||||
@@ -712,7 +723,9 @@ class AdminDashboard {
|
||||
data.contact_email = document.getElementById('form-email').value;
|
||||
data.featured = document.getElementById('form-featured').checked ? 1 : 0;
|
||||
data.sponsored = document.getElementById('form-sponsored').checked ? 1 : 0;
|
||||
data.long_description = document.getElementById('form-long-description').value;
|
||||
data.integration_guide = document.getElementById('form-integration').value;
|
||||
data.documentation = document.getElementById('form-documentation').value;
|
||||
} else if (type === 'articles') {
|
||||
data.title = document.getElementById('form-title').value;
|
||||
data.slug = this.generateSlug(data.title);
|
||||
|
||||
@@ -510,6 +510,31 @@
|
||||
line-height: 1.5;
|
||||
}
|
||||
|
||||
/* Markdown rendered code blocks */
|
||||
.integration-content pre,
|
||||
.docs-content pre {
|
||||
background: var(--bg-dark);
|
||||
border: 1px solid var(--border-color);
|
||||
margin: 1rem 0;
|
||||
padding: 1rem;
|
||||
padding-top: 2.5rem; /* Space for copy button */
|
||||
overflow-x: auto;
|
||||
position: relative;
|
||||
max-height: none; /* Remove any height restrictions */
|
||||
height: auto; /* Allow content to expand */
|
||||
}
|
||||
|
||||
.integration-content pre code,
|
||||
.docs-content pre code {
|
||||
background: transparent;
|
||||
padding: 0;
|
||||
color: var(--text-secondary);
|
||||
font-size: 0.875rem;
|
||||
line-height: 1.5;
|
||||
white-space: pre; /* Preserve whitespace and line breaks */
|
||||
display: block;
|
||||
}
|
||||
|
||||
/* Feature Grid */
|
||||
.feature-grid {
|
||||
display: grid;
|
||||
|
||||
@@ -80,20 +80,7 @@
|
||||
<section id="overview-tab" class="tab-content active">
|
||||
<div class="overview-columns">
|
||||
<div class="overview-main">
|
||||
<h2>Overview</h2>
|
||||
<div id="app-overview">Overview content goes here.</div>
|
||||
|
||||
<h3>Key Features</h3>
|
||||
<ul id="app-features" class="features-list">
|
||||
<li>Feature 1</li>
|
||||
<li>Feature 2</li>
|
||||
<li>Feature 3</li>
|
||||
</ul>
|
||||
|
||||
<h3>Use Cases</h3>
|
||||
<div id="app-use-cases" class="use-cases">
|
||||
<p>Describe how this app can help your workflow.</p>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<aside class="sidebar">
|
||||
@@ -142,33 +129,14 @@
|
||||
</section>
|
||||
|
||||
<section id="integration-tab" class="tab-content">
|
||||
<div class="integration-content">
|
||||
<h2>Integration Guide</h2>
|
||||
|
||||
<h3>Installation</h3>
|
||||
<div class="code-block">
|
||||
<pre><code id="install-code"># Installation instructions will appear here</code></pre>
|
||||
</div>
|
||||
|
||||
<h3>Basic Usage</h3>
|
||||
<div class="code-block">
|
||||
<pre><code id="usage-code"># Usage example will appear here</code></pre>
|
||||
</div>
|
||||
|
||||
<h3>Complete Integration Example</h3>
|
||||
<div class="code-block">
|
||||
<button class="copy-btn" id="copy-integration">Copy</button>
|
||||
<pre><code id="integration-code"># Complete integration guide will appear here</code></pre>
|
||||
</div>
|
||||
<div class="integration-content" id="app-integration">
|
||||
<!-- Integration guide markdown content will be rendered here -->
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section id="docs-tab" class="tab-content">
|
||||
<div class="docs-content">
|
||||
<h2>Documentation</h2>
|
||||
<div id="app-docs" class="doc-sections">
|
||||
<p>Documentation coming soon.</p>
|
||||
</div>
|
||||
<div class="docs-content" id="app-docs">
|
||||
<!-- Documentation markdown content will be rendered here -->
|
||||
</div>
|
||||
</section>
|
||||
|
||||
|
||||
@@ -123,144 +123,132 @@ class AppDetailPage {
|
||||
document.getElementById('sidebar-pricing').textContent = this.appData.pricing || 'Free';
|
||||
document.getElementById('sidebar-contact').textContent = this.appData.contact_email || 'contact@example.com';
|
||||
|
||||
// Integration guide
|
||||
this.renderIntegrationGuide();
|
||||
// Render tab contents from database fields
|
||||
this.renderTabContents();
|
||||
}
|
||||
|
||||
renderIntegrationGuide() {
|
||||
// Installation code
|
||||
const installCode = document.getElementById('install-code');
|
||||
if (installCode) {
|
||||
if (this.appData.type === 'Open Source' && this.appData.github_url) {
|
||||
installCode.textContent = `# Clone from GitHub
|
||||
git clone ${this.appData.github_url}
|
||||
|
||||
# Install dependencies
|
||||
pip install -r requirements.txt`;
|
||||
} else if (this.appData.name.toLowerCase().includes('api')) {
|
||||
installCode.textContent = `# Install via pip
|
||||
pip install ${this.appData.slug}
|
||||
|
||||
# Or install from source
|
||||
pip install git+${this.appData.github_url || 'https://github.com/example/repo'}`;
|
||||
renderTabContents() {
|
||||
// Overview tab - use long_description from database
|
||||
const overviewDiv = document.getElementById('app-overview');
|
||||
if (overviewDiv) {
|
||||
if (this.appData.long_description) {
|
||||
overviewDiv.innerHTML = this.renderMarkdown(this.appData.long_description);
|
||||
} else {
|
||||
overviewDiv.innerHTML = `<p>${this.appData.description || 'No overview available.'}</p>`;
|
||||
}
|
||||
}
|
||||
|
||||
// Usage code - customize based on category
|
||||
const usageCode = document.getElementById('usage-code');
|
||||
if (usageCode) {
|
||||
if (this.appData.category === 'Browser Automation') {
|
||||
usageCode.textContent = `from crawl4ai import AsyncWebCrawler
|
||||
from ${this.appData.slug.replace(/-/g, '_')} import ${this.appData.name.replace(/\s+/g, '')}
|
||||
|
||||
async def main():
|
||||
# Initialize ${this.appData.name}
|
||||
automation = ${this.appData.name.replace(/\s+/g, '')}()
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
browser_config=automation.config,
|
||||
wait_for="css:body"
|
||||
)
|
||||
print(result.markdown)`;
|
||||
} else if (this.appData.category === 'Proxy Services') {
|
||||
usageCode.textContent = `from crawl4ai import AsyncWebCrawler
|
||||
import ${this.appData.slug.replace(/-/g, '_')}
|
||||
|
||||
# Configure proxy
|
||||
proxy_config = {
|
||||
"server": "${this.appData.website_url || 'https://proxy.example.com'}",
|
||||
"username": "your_username",
|
||||
"password": "your_password"
|
||||
}
|
||||
|
||||
async with AsyncWebCrawler(proxy=proxy_config) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
bypass_cache=True
|
||||
)
|
||||
print(result.status_code)`;
|
||||
} else if (this.appData.category === 'LLM Integration') {
|
||||
usageCode.textContent = `from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.extraction_strategy import LLMExtractionStrategy
|
||||
|
||||
# Configure LLM extraction
|
||||
strategy = LLMExtractionStrategy(
|
||||
provider="${this.appData.name.toLowerCase().includes('gpt') ? 'openai' : 'anthropic'}",
|
||||
api_key="your-api-key",
|
||||
model="${this.appData.name.toLowerCase().includes('gpt') ? 'gpt-4' : 'claude-3'}",
|
||||
instruction="Extract structured data"
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
extraction_strategy=strategy
|
||||
)
|
||||
print(result.extracted_content)`;
|
||||
// Integration tab - use integration_guide field from database
|
||||
const integrationDiv = document.getElementById('app-integration');
|
||||
if (integrationDiv) {
|
||||
if (this.appData.integration_guide) {
|
||||
integrationDiv.innerHTML = this.renderMarkdown(this.appData.integration_guide);
|
||||
// Add copy buttons to all code blocks
|
||||
this.addCopyButtonsToCodeBlocks(integrationDiv);
|
||||
} else {
|
||||
integrationDiv.innerHTML = '<p>Integration guide not yet available. Please check the official website for details.</p>';
|
||||
}
|
||||
}
|
||||
|
||||
// Integration example
|
||||
const integrationCode = document.getElementById('integration-code');
|
||||
if (integrationCode) {
|
||||
integrationCode.textContent = this.appData.integration_guide ||
|
||||
`# Complete ${this.appData.name} Integration Example
|
||||
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
import json
|
||||
|
||||
async def crawl_with_${this.appData.slug.replace(/-/g, '_')}():
|
||||
"""
|
||||
Complete example showing how to use ${this.appData.name}
|
||||
with Crawl4AI for production web scraping
|
||||
"""
|
||||
|
||||
# Define extraction schema
|
||||
schema = {
|
||||
"name": "ProductList",
|
||||
"baseSelector": "div.product",
|
||||
"fields": [
|
||||
{"name": "title", "selector": "h2", "type": "text"},
|
||||
{"name": "price", "selector": ".price", "type": "text"},
|
||||
{"name": "image", "selector": "img", "type": "attribute", "attribute": "src"},
|
||||
{"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
|
||||
]
|
||||
// Documentation tab - use documentation field from database
|
||||
const docsDiv = document.getElementById('app-docs');
|
||||
if (docsDiv) {
|
||||
if (this.appData.documentation) {
|
||||
docsDiv.innerHTML = this.renderMarkdown(this.appData.documentation);
|
||||
// Add copy buttons to all code blocks
|
||||
this.addCopyButtonsToCodeBlocks(docsDiv);
|
||||
} else {
|
||||
docsDiv.innerHTML = '<p>Documentation coming soon.</p>';
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
# Initialize crawler with ${this.appData.name}
|
||||
async with AsyncWebCrawler(
|
||||
browser_type="chromium",
|
||||
headless=True,
|
||||
verbose=True
|
||||
) as crawler:
|
||||
addCopyButtonsToCodeBlocks(container) {
|
||||
// Find all code blocks and add copy buttons
|
||||
const codeBlocks = container.querySelectorAll('pre code');
|
||||
codeBlocks.forEach(codeBlock => {
|
||||
const pre = codeBlock.parentElement;
|
||||
|
||||
# Crawl with extraction
|
||||
result = await crawler.arun(
|
||||
url="https://example.com/products",
|
||||
extraction_strategy=JsonCssExtractionStrategy(schema),
|
||||
cache_mode="bypass",
|
||||
wait_for="css:.product",
|
||||
screenshot=True
|
||||
)
|
||||
// Skip if already has a copy button
|
||||
if (pre.querySelector('.copy-btn')) return;
|
||||
|
||||
# Process results
|
||||
if result.success:
|
||||
products = json.loads(result.extracted_content)
|
||||
print(f"Found {len(products)} products")
|
||||
// Create copy button
|
||||
const copyBtn = document.createElement('button');
|
||||
copyBtn.className = 'copy-btn';
|
||||
copyBtn.textContent = 'Copy';
|
||||
copyBtn.onclick = () => {
|
||||
navigator.clipboard.writeText(codeBlock.textContent).then(() => {
|
||||
copyBtn.textContent = '✓ Copied!';
|
||||
setTimeout(() => {
|
||||
copyBtn.textContent = 'Copy';
|
||||
}, 2000);
|
||||
});
|
||||
};
|
||||
|
||||
for product in products[:5]:
|
||||
print(f"- {product['title']}: {product['price']}")
|
||||
// Add button to pre element
|
||||
pre.style.position = 'relative';
|
||||
pre.insertBefore(copyBtn, codeBlock);
|
||||
});
|
||||
}
|
||||
|
||||
return products
|
||||
renderMarkdown(text) {
|
||||
if (!text) return '';
|
||||
|
||||
# Run the crawler
|
||||
if __name__ == "__main__":
|
||||
import asyncio
|
||||
asyncio.run(crawl_with_${this.appData.slug.replace(/-/g, '_')}())`;
|
||||
}
|
||||
// Store code blocks temporarily to protect them from processing
|
||||
const codeBlocks = [];
|
||||
let processed = text.replace(/```(\w+)?\n([\s\S]*?)```/g, (match, lang, code) => {
|
||||
const placeholder = `___CODE_BLOCK_${codeBlocks.length}___`;
|
||||
codeBlocks.push(`<pre><code class="language-${lang || ''}">${this.escapeHtml(code)}</code></pre>`);
|
||||
return placeholder;
|
||||
});
|
||||
|
||||
// Store inline code temporarily
|
||||
const inlineCodes = [];
|
||||
processed = processed.replace(/`([^`]+)`/g, (match, code) => {
|
||||
const placeholder = `___INLINE_CODE_${inlineCodes.length}___`;
|
||||
inlineCodes.push(`<code>${this.escapeHtml(code)}</code>`);
|
||||
return placeholder;
|
||||
});
|
||||
|
||||
// Now process the rest of the markdown
|
||||
processed = processed
|
||||
// Headers
|
||||
.replace(/^### (.*$)/gim, '<h3>$1</h3>')
|
||||
.replace(/^## (.*$)/gim, '<h2>$1</h2>')
|
||||
.replace(/^# (.*$)/gim, '<h1>$1</h1>')
|
||||
// Bold
|
||||
.replace(/\*\*(.*?)\*\*/g, '<strong>$1</strong>')
|
||||
// Italic
|
||||
.replace(/\*(.*?)\*/g, '<em>$1</em>')
|
||||
// Links
|
||||
.replace(/\[([^\]]+)\]\(([^)]+)\)/g, '<a href="$2" target="_blank">$1</a>')
|
||||
// Line breaks
|
||||
.replace(/\n\n/g, '</p><p>')
|
||||
.replace(/\n/g, '<br>')
|
||||
// Lists
|
||||
.replace(/^\* (.*)$/gim, '<li>$1</li>')
|
||||
.replace(/^- (.*)$/gim, '<li>$1</li>')
|
||||
// Wrap in paragraphs
|
||||
.replace(/^(?!<[h|p|pre|ul|ol|li])/gim, '<p>')
|
||||
.replace(/(?<![>])$/gim, '</p>');
|
||||
|
||||
// Restore inline code
|
||||
inlineCodes.forEach((code, i) => {
|
||||
processed = processed.replace(`___INLINE_CODE_${i}___`, code);
|
||||
});
|
||||
|
||||
// Restore code blocks
|
||||
codeBlocks.forEach((block, i) => {
|
||||
processed = processed.replace(`___CODE_BLOCK_${i}___`, block);
|
||||
});
|
||||
|
||||
return processed;
|
||||
}
|
||||
|
||||
escapeHtml(text) {
|
||||
const div = document.createElement('div');
|
||||
div.textContent = text;
|
||||
return div.innerHTML;
|
||||
}
|
||||
|
||||
formatNumber(num) {
|
||||
@@ -289,33 +277,6 @@ if __name__ == "__main__":
|
||||
document.getElementById(`${tabName}-tab`).classList.add('active');
|
||||
});
|
||||
});
|
||||
|
||||
// Copy integration code
|
||||
document.getElementById('copy-integration').addEventListener('click', () => {
|
||||
const code = document.getElementById('integration-code').textContent;
|
||||
navigator.clipboard.writeText(code).then(() => {
|
||||
const btn = document.getElementById('copy-integration');
|
||||
const originalText = btn.innerHTML;
|
||||
btn.innerHTML = '<span>✓</span> Copied!';
|
||||
setTimeout(() => {
|
||||
btn.innerHTML = originalText;
|
||||
}, 2000);
|
||||
});
|
||||
});
|
||||
|
||||
// Copy code buttons
|
||||
document.querySelectorAll('.copy-btn').forEach(btn => {
|
||||
btn.addEventListener('click', (e) => {
|
||||
const codeBlock = e.target.closest('.code-block');
|
||||
const code = codeBlock.querySelector('code').textContent;
|
||||
navigator.clipboard.writeText(code).then(() => {
|
||||
btn.textContent = 'Copied!';
|
||||
setTimeout(() => {
|
||||
btn.textContent = 'Copy';
|
||||
}, 2000);
|
||||
});
|
||||
});
|
||||
});
|
||||
}
|
||||
|
||||
async loadRelatedApps() {
|
||||
|
||||
@@ -1,338 +0,0 @@
|
||||
"""
|
||||
🚀 Crawl4AI v0.7.5 Release Demo - Working Examples
|
||||
==================================================
|
||||
This demo showcases key features introduced in v0.7.5 with real, executable examples.
|
||||
|
||||
Featured Demos:
|
||||
1. ✅ Docker Hooks System - Real API calls with custom hooks (string & function-based)
|
||||
2. ✅ Enhanced LLM Integration - Working LLM configurations
|
||||
3. ✅ HTTPS Preservation - Live crawling with HTTPS maintenance
|
||||
|
||||
Requirements:
|
||||
- crawl4ai v0.7.5 installed
|
||||
- Docker running with crawl4ai image (optional for Docker demos)
|
||||
- Valid API keys for LLM demos (optional)
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import requests
|
||||
import time
|
||||
import sys
|
||||
|
||||
from crawl4ai import (AsyncWebCrawler, CrawlerRunConfig, BrowserConfig,
|
||||
CacheMode, FilterChain, URLPatternFilter, BFSDeepCrawlStrategy,
|
||||
hooks_to_string)
|
||||
from crawl4ai.docker_client import Crawl4aiDockerClient
|
||||
|
||||
|
||||
def print_section(title: str, description: str = ""):
|
||||
"""Print a section header"""
|
||||
print(f"\n{'=' * 60}")
|
||||
print(f"{title}")
|
||||
if description:
|
||||
print(f"{description}")
|
||||
print(f"{'=' * 60}\n")
|
||||
|
||||
|
||||
async def demo_1_docker_hooks_system():
|
||||
"""Demo 1: Docker Hooks System - Real API calls with custom hooks"""
|
||||
print_section(
|
||||
"Demo 1: Docker Hooks System",
|
||||
"Testing both string-based and function-based hooks (NEW in v0.7.5!)"
|
||||
)
|
||||
|
||||
# Check Docker service availability
|
||||
def check_docker_service():
|
||||
try:
|
||||
response = requests.get("http://localhost:11235/", timeout=3)
|
||||
return response.status_code == 200
|
||||
except:
|
||||
return False
|
||||
|
||||
print("Checking Docker service...")
|
||||
docker_running = check_docker_service()
|
||||
|
||||
if not docker_running:
|
||||
print("⚠️ Docker service not running on localhost:11235")
|
||||
print("To test Docker hooks:")
|
||||
print("1. Run: docker run -p 11235:11235 unclecode/crawl4ai:latest")
|
||||
print("2. Wait for service to start")
|
||||
print("3. Re-run this demo\n")
|
||||
return
|
||||
|
||||
print("✓ Docker service detected!")
|
||||
|
||||
# ============================================================================
|
||||
# PART 1: Traditional String-Based Hooks (Works with REST API)
|
||||
# ============================================================================
|
||||
print("\n" + "─" * 60)
|
||||
print("Part 1: String-Based Hooks (REST API)")
|
||||
print("─" * 60)
|
||||
|
||||
hooks_config_string = {
|
||||
"on_page_context_created": """
|
||||
async def hook(page, context, **kwargs):
|
||||
print("[String Hook] Setting up page context")
|
||||
await context.route("**/*.{png,jpg,jpeg,gif,webp}", lambda route: route.abort())
|
||||
return page
|
||||
""",
|
||||
"before_retrieve_html": """
|
||||
async def hook(page, context, **kwargs):
|
||||
print("[String Hook] Before retrieving HTML")
|
||||
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
|
||||
await page.wait_for_timeout(1000)
|
||||
return page
|
||||
"""
|
||||
}
|
||||
|
||||
payload = {
|
||||
"urls": ["https://httpbin.org/html"],
|
||||
"hooks": {
|
||||
"code": hooks_config_string,
|
||||
"timeout": 30
|
||||
}
|
||||
}
|
||||
|
||||
print("🔧 Using string-based hooks for REST API...")
|
||||
try:
|
||||
start_time = time.time()
|
||||
response = requests.post("http://localhost:11235/crawl", json=payload, timeout=60)
|
||||
execution_time = time.time() - start_time
|
||||
|
||||
if response.status_code == 200:
|
||||
result = response.json()
|
||||
print(f"✅ String-based hooks executed in {execution_time:.2f}s")
|
||||
if result.get('results') and result['results'][0].get('success'):
|
||||
html_length = len(result['results'][0].get('html', ''))
|
||||
print(f" 📄 HTML length: {html_length} characters")
|
||||
else:
|
||||
print(f"❌ Request failed: {response.status_code}")
|
||||
except Exception as e:
|
||||
print(f"❌ Error: {str(e)}")
|
||||
|
||||
# ============================================================================
|
||||
# PART 2: NEW Function-Based Hooks with Docker Client (v0.7.5)
|
||||
# ============================================================================
|
||||
print("\n" + "─" * 60)
|
||||
print("Part 2: Function-Based Hooks with Docker Client (✨ NEW!)")
|
||||
print("─" * 60)
|
||||
|
||||
# Define hooks as regular Python functions
|
||||
async def on_page_context_created_func(page, context, **kwargs):
|
||||
"""Block images to speed up crawling"""
|
||||
print("[Function Hook] Setting up page context")
|
||||
await context.route("**/*.{png,jpg,jpeg,gif,webp}", lambda route: route.abort())
|
||||
await page.set_viewport_size({"width": 1920, "height": 1080})
|
||||
return page
|
||||
|
||||
async def before_goto_func(page, context, url, **kwargs):
|
||||
"""Add custom headers before navigation"""
|
||||
print(f"[Function Hook] About to navigate to {url}")
|
||||
await page.set_extra_http_headers({
|
||||
'X-Crawl4AI': 'v0.7.5-function-hooks',
|
||||
'X-Test-Header': 'demo'
|
||||
})
|
||||
return page
|
||||
|
||||
async def before_retrieve_html_func(page, context, **kwargs):
|
||||
"""Scroll to load lazy content"""
|
||||
print("[Function Hook] Scrolling page for lazy-loaded content")
|
||||
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
|
||||
await page.wait_for_timeout(500)
|
||||
await page.evaluate("window.scrollTo(0, 0)")
|
||||
return page
|
||||
|
||||
# Use the hooks_to_string utility (can be used standalone)
|
||||
print("\n📦 Converting functions to strings with hooks_to_string()...")
|
||||
hooks_as_strings = hooks_to_string({
|
||||
"on_page_context_created": on_page_context_created_func,
|
||||
"before_goto": before_goto_func,
|
||||
"before_retrieve_html": before_retrieve_html_func
|
||||
})
|
||||
print(f" ✓ Converted {len(hooks_as_strings)} hooks to string format")
|
||||
|
||||
# OR use Docker Client which does conversion automatically!
|
||||
print("\n🐳 Using Docker Client with automatic conversion...")
|
||||
try:
|
||||
client = Crawl4aiDockerClient(base_url="http://localhost:11235")
|
||||
|
||||
# Pass function objects directly - conversion happens automatically!
|
||||
results = await client.crawl(
|
||||
urls=["https://httpbin.org/html"],
|
||||
hooks={
|
||||
"on_page_context_created": on_page_context_created_func,
|
||||
"before_goto": before_goto_func,
|
||||
"before_retrieve_html": before_retrieve_html_func
|
||||
},
|
||||
hooks_timeout=30
|
||||
)
|
||||
|
||||
if results and results.success:
|
||||
print(f"✅ Function-based hooks executed successfully!")
|
||||
print(f" 📄 HTML length: {len(results.html)} characters")
|
||||
print(f" 🎯 URL: {results.url}")
|
||||
else:
|
||||
print("⚠️ Crawl completed but may have warnings")
|
||||
|
||||
except Exception as e:
|
||||
print(f"❌ Docker client error: {str(e)}")
|
||||
|
||||
# Show the benefits
|
||||
print("\n" + "=" * 60)
|
||||
print("✨ Benefits of Function-Based Hooks:")
|
||||
print("=" * 60)
|
||||
print("✓ Full IDE support (autocomplete, syntax highlighting)")
|
||||
print("✓ Type checking and linting")
|
||||
print("✓ Easier to test and debug")
|
||||
print("✓ Reusable across projects")
|
||||
print("✓ Automatic conversion in Docker client")
|
||||
print("=" * 60)
|
||||
|
||||
|
||||
async def demo_2_enhanced_llm_integration():
|
||||
"""Demo 2: Enhanced LLM Integration - Working LLM configurations"""
|
||||
print_section(
|
||||
"Demo 2: Enhanced LLM Integration",
|
||||
"Testing custom LLM providers and configurations"
|
||||
)
|
||||
|
||||
print("🤖 Testing Enhanced LLM Integration Features")
|
||||
|
||||
provider = "gemini/gemini-2.5-flash-lite"
|
||||
payload = {
|
||||
"url": "https://example.com",
|
||||
"f": "llm",
|
||||
"q": "Summarize this page in one sentence.",
|
||||
"provider": provider, # Explicitly set provider
|
||||
"temperature": 0.7
|
||||
}
|
||||
try:
|
||||
response = requests.post(
|
||||
"http://localhost:11235/md",
|
||||
json=payload,
|
||||
timeout=60
|
||||
)
|
||||
if response.status_code == 200:
|
||||
result = response.json()
|
||||
print(f"✓ Request successful with provider: {provider}")
|
||||
print(f" - Response keys: {list(result.keys())}")
|
||||
print(f" - Content length: {len(result.get('markdown', ''))} characters")
|
||||
print(f" - Note: Actual LLM call may fail without valid API key")
|
||||
else:
|
||||
print(f"❌ Request failed: {response.status_code}")
|
||||
print(f" - Response: {response.text[:500]}")
|
||||
|
||||
except Exception as e:
|
||||
print(f"[red]Error: {e}[/]")
|
||||
|
||||
|
||||
async def demo_3_https_preservation():
|
||||
"""Demo 3: HTTPS Preservation - Live crawling with HTTPS maintenance"""
|
||||
print_section(
|
||||
"Demo 3: HTTPS Preservation",
|
||||
"Testing HTTPS preservation for internal links"
|
||||
)
|
||||
|
||||
print("🔒 Testing HTTPS Preservation Feature")
|
||||
|
||||
# Test with HTTPS preservation enabled
|
||||
print("\nTest 1: HTTPS Preservation ENABLED")
|
||||
|
||||
url_filter = URLPatternFilter(
|
||||
patterns=["^(https:\/\/)?quotes\.toscrape\.com(\/.*)?$"]
|
||||
)
|
||||
config = CrawlerRunConfig(
|
||||
exclude_external_links=True,
|
||||
stream=True,
|
||||
verbose=False,
|
||||
preserve_https_for_internal_links=True,
|
||||
deep_crawl_strategy=BFSDeepCrawlStrategy(
|
||||
max_depth=2,
|
||||
max_pages=5,
|
||||
filter_chain=FilterChain([url_filter])
|
||||
)
|
||||
)
|
||||
|
||||
test_url = "https://quotes.toscrape.com"
|
||||
print(f"🎯 Testing URL: {test_url}")
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
async for result in await crawler.arun(url=test_url, config=config):
|
||||
print("✓ HTTPS Preservation Test Completed")
|
||||
internal_links = [i['href'] for i in result.links['internal']]
|
||||
for link in internal_links:
|
||||
print(f" → {link}")
|
||||
|
||||
|
||||
async def main():
|
||||
"""Run all demos"""
|
||||
print("\n" + "=" * 60)
|
||||
print("🚀 Crawl4AI v0.7.5 Working Demo")
|
||||
print("=" * 60)
|
||||
|
||||
# Check system requirements
|
||||
print("🔍 System Requirements Check:")
|
||||
print(f" - Python version: {sys.version.split()[0]} {'✓' if sys.version_info >= (3, 10) else '❌ (3.10+ required)'}")
|
||||
|
||||
try:
|
||||
import requests
|
||||
print(f" - Requests library: ✓")
|
||||
except ImportError:
|
||||
print(f" - Requests library: ❌")
|
||||
|
||||
print()
|
||||
|
||||
demos = [
|
||||
("Docker Hooks System", demo_1_docker_hooks_system),
|
||||
("Enhanced LLM Integration", demo_2_enhanced_llm_integration),
|
||||
("HTTPS Preservation", demo_3_https_preservation),
|
||||
]
|
||||
|
||||
for i, (name, demo_func) in enumerate(demos, 1):
|
||||
try:
|
||||
print(f"\n📍 Starting Demo {i}/{len(demos)}: {name}")
|
||||
await demo_func()
|
||||
|
||||
if i < len(demos):
|
||||
print(f"\n✨ Demo {i} complete! Press Enter for next demo...")
|
||||
input()
|
||||
|
||||
except KeyboardInterrupt:
|
||||
print(f"\n⏹️ Demo interrupted by user")
|
||||
break
|
||||
except Exception as e:
|
||||
print(f"❌ Demo {i} error: {str(e)}")
|
||||
print("Continuing to next demo...")
|
||||
continue
|
||||
|
||||
print("\n" + "=" * 60)
|
||||
print("🎉 Demo Complete!")
|
||||
print("=" * 60)
|
||||
print("You've experienced the power of Crawl4AI v0.7.5!")
|
||||
print("")
|
||||
print("Key Features Demonstrated:")
|
||||
print("🔧 Docker Hooks - String-based & function-based (NEW!)")
|
||||
print(" • hooks_to_string() utility for function conversion")
|
||||
print(" • Docker client with automatic conversion")
|
||||
print(" • Full IDE support and type checking")
|
||||
print("🤖 Enhanced LLM - Better AI integration")
|
||||
print("🔒 HTTPS Preservation - Secure link handling")
|
||||
print("")
|
||||
print("Ready to build something amazing? 🚀")
|
||||
print("")
|
||||
print("📖 Docs: https://docs.crawl4ai.com/")
|
||||
print("🐙 GitHub: https://github.com/unclecode/crawl4ai")
|
||||
print("=" * 60)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
print("🚀 Crawl4AI v0.7.5 Live Demo Starting...")
|
||||
print("Press Ctrl+C anytime to exit\n")
|
||||
|
||||
try:
|
||||
asyncio.run(main())
|
||||
except KeyboardInterrupt:
|
||||
print("\n👋 Demo stopped by user. Thanks for trying Crawl4AI v0.7.5!")
|
||||
except Exception as e:
|
||||
print(f"\n❌ Demo error: {str(e)}")
|
||||
print("Make sure you have the required dependencies installed.")
|
||||
@@ -1,655 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
🚀 Crawl4AI v0.7.5 - Docker Hooks System Complete Demonstration
|
||||
================================================================
|
||||
|
||||
This file demonstrates the NEW Docker Hooks System introduced in v0.7.5.
|
||||
|
||||
The Docker Hooks System is a completely NEW feature that provides pipeline
|
||||
customization through user-provided Python functions. It offers three approaches:
|
||||
|
||||
1. String-based hooks for REST API
|
||||
2. hooks_to_string() utility to convert functions
|
||||
3. Docker Client with automatic conversion (most convenient)
|
||||
|
||||
All three approaches are part of this NEW v0.7.5 feature!
|
||||
|
||||
Perfect for video recording and demonstration purposes.
|
||||
|
||||
Requirements:
|
||||
- Docker container running: docker run -p 11235:11235 unclecode/crawl4ai:latest
|
||||
- crawl4ai v0.7.5 installed: pip install crawl4ai==0.7.5
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import requests
|
||||
import json
|
||||
import time
|
||||
from typing import Dict, Any
|
||||
|
||||
# Import Crawl4AI components
|
||||
from crawl4ai import hooks_to_string
|
||||
from crawl4ai.docker_client import Crawl4aiDockerClient
|
||||
|
||||
# Configuration
|
||||
DOCKER_URL = "http://localhost:11235"
|
||||
# DOCKER_URL = "http://localhost:11234"
|
||||
TEST_URLS = [
|
||||
# "https://httpbin.org/html",
|
||||
"https://www.kidocode.com",
|
||||
"https://quotes.toscrape.com",
|
||||
]
|
||||
|
||||
|
||||
def print_section(title: str, description: str = ""):
|
||||
"""Print a formatted section header"""
|
||||
print("\n" + "=" * 70)
|
||||
print(f" {title}")
|
||||
if description:
|
||||
print(f" {description}")
|
||||
print("=" * 70 + "\n")
|
||||
|
||||
|
||||
def check_docker_service() -> bool:
|
||||
"""Check if Docker service is running"""
|
||||
try:
|
||||
response = requests.get(f"{DOCKER_URL}/health", timeout=3)
|
||||
return response.status_code == 200
|
||||
except:
|
||||
return False
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# REUSABLE HOOK LIBRARY (NEW in v0.7.5)
|
||||
# ============================================================================
|
||||
|
||||
async def performance_optimization_hook(page, context, **kwargs):
|
||||
"""
|
||||
Performance Hook: Block unnecessary resources to speed up crawling
|
||||
"""
|
||||
print(" [Hook] 🚀 Optimizing performance - blocking images and ads...")
|
||||
|
||||
# Block images
|
||||
await context.route(
|
||||
"**/*.{png,jpg,jpeg,gif,webp,svg,ico}",
|
||||
lambda route: route.abort()
|
||||
)
|
||||
|
||||
# Block ads and analytics
|
||||
await context.route("**/analytics/*", lambda route: route.abort())
|
||||
await context.route("**/ads/*", lambda route: route.abort())
|
||||
await context.route("**/google-analytics.com/*", lambda route: route.abort())
|
||||
|
||||
print(" [Hook] ✓ Performance optimization applied")
|
||||
return page
|
||||
|
||||
|
||||
async def viewport_setup_hook(page, context, **kwargs):
|
||||
"""
|
||||
Viewport Hook: Set consistent viewport size for rendering
|
||||
"""
|
||||
print(" [Hook] 🖥️ Setting viewport to 1920x1080...")
|
||||
await page.set_viewport_size({"width": 1920, "height": 1080})
|
||||
print(" [Hook] ✓ Viewport configured")
|
||||
return page
|
||||
|
||||
|
||||
async def authentication_headers_hook(page, context, url, **kwargs):
|
||||
"""
|
||||
Headers Hook: Add custom authentication and tracking headers
|
||||
"""
|
||||
print(f" [Hook] 🔐 Adding custom headers for {url[:50]}...")
|
||||
|
||||
await page.set_extra_http_headers({
|
||||
'X-Crawl4AI-Version': '0.7.5',
|
||||
'X-Custom-Hook': 'function-based-demo',
|
||||
'Accept-Language': 'en-US,en;q=0.9',
|
||||
'User-Agent': 'Crawl4AI/0.7.5 (Educational Demo)'
|
||||
})
|
||||
|
||||
print(" [Hook] ✓ Custom headers added")
|
||||
return page
|
||||
|
||||
|
||||
async def lazy_loading_handler_hook(page, context, **kwargs):
|
||||
"""
|
||||
Content Hook: Handle lazy-loaded content by scrolling
|
||||
"""
|
||||
print(" [Hook] 📜 Scrolling to load lazy content...")
|
||||
|
||||
# Scroll to bottom
|
||||
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
|
||||
await page.wait_for_timeout(1000)
|
||||
|
||||
# Scroll to middle
|
||||
await page.evaluate("window.scrollTo(0, document.body.scrollHeight / 2)")
|
||||
await page.wait_for_timeout(500)
|
||||
|
||||
# Scroll back to top
|
||||
await page.evaluate("window.scrollTo(0, 0)")
|
||||
await page.wait_for_timeout(500)
|
||||
|
||||
print(" [Hook] ✓ Lazy content loaded")
|
||||
return page
|
||||
|
||||
|
||||
async def page_analytics_hook(page, context, **kwargs):
|
||||
"""
|
||||
Analytics Hook: Log page metrics before extraction
|
||||
"""
|
||||
print(" [Hook] 📊 Collecting page analytics...")
|
||||
|
||||
metrics = await page.evaluate('''
|
||||
() => ({
|
||||
title: document.title,
|
||||
images: document.images.length,
|
||||
links: document.links.length,
|
||||
scripts: document.scripts.length,
|
||||
headings: document.querySelectorAll('h1, h2, h3').length,
|
||||
paragraphs: document.querySelectorAll('p').length
|
||||
})
|
||||
''')
|
||||
|
||||
print(f" [Hook] 📈 Page: {metrics['title'][:50]}...")
|
||||
print(f" Links: {metrics['links']}, Images: {metrics['images']}, "
|
||||
f"Headings: {metrics['headings']}, Paragraphs: {metrics['paragraphs']}")
|
||||
|
||||
return page
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# DEMO 1: String-Based Hooks (NEW Docker Hooks System)
|
||||
# ============================================================================
|
||||
|
||||
def demo_1_string_based_hooks():
|
||||
"""
|
||||
Demonstrate string-based hooks with REST API (part of NEW Docker Hooks System)
|
||||
"""
|
||||
print_section(
|
||||
"DEMO 1: String-Based Hooks (REST API)",
|
||||
"Part of the NEW Docker Hooks System - hooks as strings"
|
||||
)
|
||||
|
||||
# Define hooks as strings
|
||||
hooks_config = {
|
||||
"on_page_context_created": """
|
||||
async def hook(page, context, **kwargs):
|
||||
print(" [String Hook] Setting up page context...")
|
||||
# Block images for performance
|
||||
await context.route("**/*.{png,jpg,jpeg,gif,webp}", lambda route: route.abort())
|
||||
await page.set_viewport_size({"width": 1920, "height": 1080})
|
||||
return page
|
||||
""",
|
||||
|
||||
"before_goto": """
|
||||
async def hook(page, context, url, **kwargs):
|
||||
print(f" [String Hook] Navigating to {url[:50]}...")
|
||||
await page.set_extra_http_headers({
|
||||
'X-Crawl4AI': 'string-based-hooks',
|
||||
'X-Demo': 'v0.7.5'
|
||||
})
|
||||
return page
|
||||
""",
|
||||
|
||||
"before_retrieve_html": """
|
||||
async def hook(page, context, **kwargs):
|
||||
print(" [String Hook] Scrolling page...")
|
||||
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
|
||||
await page.wait_for_timeout(1000)
|
||||
return page
|
||||
"""
|
||||
}
|
||||
|
||||
# Prepare request payload
|
||||
payload = {
|
||||
"urls": [TEST_URLS[0]],
|
||||
"hooks": {
|
||||
"code": hooks_config,
|
||||
"timeout": 30
|
||||
},
|
||||
"crawler_config": {
|
||||
"cache_mode": "bypass"
|
||||
}
|
||||
}
|
||||
|
||||
print(f"🎯 Target URL: {TEST_URLS[0]}")
|
||||
print(f"🔧 Configured {len(hooks_config)} string-based hooks")
|
||||
print(f"📡 Sending request to Docker API...\n")
|
||||
|
||||
try:
|
||||
start_time = time.time()
|
||||
response = requests.post(f"{DOCKER_URL}/crawl", json=payload, timeout=60)
|
||||
execution_time = time.time() - start_time
|
||||
|
||||
if response.status_code == 200:
|
||||
result = response.json()
|
||||
|
||||
print(f"\n✅ Request successful! (took {execution_time:.2f}s)")
|
||||
|
||||
# Display results
|
||||
if result.get('results') and result['results'][0].get('success'):
|
||||
crawl_result = result['results'][0]
|
||||
html_length = len(crawl_result.get('html', ''))
|
||||
markdown_length = len(crawl_result.get('markdown', ''))
|
||||
|
||||
print(f"\n📊 Results:")
|
||||
print(f" • HTML length: {html_length:,} characters")
|
||||
print(f" • Markdown length: {markdown_length:,} characters")
|
||||
print(f" • URL: {crawl_result.get('url')}")
|
||||
|
||||
# Check hooks execution
|
||||
if 'hooks' in result:
|
||||
hooks_info = result['hooks']
|
||||
print(f"\n🎣 Hooks Execution:")
|
||||
print(f" • Status: {hooks_info['status']['status']}")
|
||||
print(f" • Attached hooks: {len(hooks_info['status']['attached_hooks'])}")
|
||||
|
||||
if 'summary' in hooks_info:
|
||||
summary = hooks_info['summary']
|
||||
print(f" • Total executions: {summary['total_executions']}")
|
||||
print(f" • Successful: {summary['successful']}")
|
||||
print(f" • Success rate: {summary['success_rate']:.1f}%")
|
||||
else:
|
||||
print(f"⚠️ Crawl completed but no results")
|
||||
|
||||
else:
|
||||
print(f"❌ Request failed with status {response.status_code}")
|
||||
print(f" Error: {response.text[:200]}")
|
||||
|
||||
except requests.exceptions.Timeout:
|
||||
print("⏰ Request timed out after 60 seconds")
|
||||
except Exception as e:
|
||||
print(f"❌ Error: {str(e)}")
|
||||
|
||||
print("\n" + "─" * 70)
|
||||
print("✓ String-based hooks demo complete\n")
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# DEMO 2: Function-Based Hooks with hooks_to_string() Utility
|
||||
# ============================================================================
|
||||
|
||||
def demo_2_hooks_to_string_utility():
|
||||
"""
|
||||
Demonstrate the new hooks_to_string() utility for converting functions
|
||||
"""
|
||||
print_section(
|
||||
"DEMO 2: hooks_to_string() Utility (NEW! ✨)",
|
||||
"Convert Python functions to strings for REST API"
|
||||
)
|
||||
|
||||
print("📦 Creating hook functions...")
|
||||
print(" • performance_optimization_hook")
|
||||
print(" • viewport_setup_hook")
|
||||
print(" • authentication_headers_hook")
|
||||
print(" • lazy_loading_handler_hook")
|
||||
|
||||
# Convert function objects to strings using the NEW utility
|
||||
print("\n🔄 Converting functions to strings with hooks_to_string()...")
|
||||
|
||||
hooks_dict = {
|
||||
"on_page_context_created": performance_optimization_hook,
|
||||
"before_goto": authentication_headers_hook,
|
||||
"before_retrieve_html": lazy_loading_handler_hook,
|
||||
}
|
||||
|
||||
hooks_as_strings = hooks_to_string(hooks_dict)
|
||||
|
||||
print(f"✅ Successfully converted {len(hooks_as_strings)} functions to strings")
|
||||
|
||||
# Show a preview
|
||||
print("\n📝 Sample converted hook (first 250 characters):")
|
||||
print("─" * 70)
|
||||
sample_hook = list(hooks_as_strings.values())[0]
|
||||
print(sample_hook[:250] + "...")
|
||||
print("─" * 70)
|
||||
|
||||
# Use the converted hooks with REST API
|
||||
print("\n📡 Using converted hooks with REST API...")
|
||||
|
||||
payload = {
|
||||
"urls": [TEST_URLS[0]],
|
||||
"hooks": {
|
||||
"code": hooks_as_strings,
|
||||
"timeout": 30
|
||||
}
|
||||
}
|
||||
|
||||
try:
|
||||
start_time = time.time()
|
||||
response = requests.post(f"{DOCKER_URL}/crawl", json=payload, timeout=60)
|
||||
execution_time = time.time() - start_time
|
||||
|
||||
if response.status_code == 200:
|
||||
result = response.json()
|
||||
print(f"\n✅ Request successful! (took {execution_time:.2f}s)")
|
||||
|
||||
if result.get('results') and result['results'][0].get('success'):
|
||||
crawl_result = result['results'][0]
|
||||
print(f" • HTML length: {len(crawl_result.get('html', '')):,} characters")
|
||||
print(f" • Hooks executed successfully!")
|
||||
else:
|
||||
print(f"❌ Request failed: {response.status_code}")
|
||||
|
||||
except Exception as e:
|
||||
print(f"❌ Error: {str(e)}")
|
||||
|
||||
print("\n💡 Benefits of hooks_to_string():")
|
||||
print(" ✓ Write hooks as regular Python functions")
|
||||
print(" ✓ Full IDE support (autocomplete, syntax highlighting)")
|
||||
print(" ✓ Type checking and linting")
|
||||
print(" ✓ Easy to test and debug")
|
||||
print(" ✓ Reusable across projects")
|
||||
print(" ✓ Works with any REST API client")
|
||||
|
||||
print("\n" + "─" * 70)
|
||||
print("✓ hooks_to_string() utility demo complete\n")
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# DEMO 3: Docker Client with Automatic Conversion (RECOMMENDED! 🌟)
|
||||
# ============================================================================
|
||||
|
||||
async def demo_3_docker_client_auto_conversion():
|
||||
"""
|
||||
Demonstrate Docker Client with automatic hook conversion (RECOMMENDED)
|
||||
"""
|
||||
print_section(
|
||||
"DEMO 3: Docker Client with Auto-Conversion (RECOMMENDED! 🌟)",
|
||||
"Pass function objects directly - conversion happens automatically!"
|
||||
)
|
||||
|
||||
print("🐳 Initializing Crawl4AI Docker Client...")
|
||||
client = Crawl4aiDockerClient(base_url=DOCKER_URL)
|
||||
|
||||
print("✅ Client ready!\n")
|
||||
|
||||
# Use our reusable hook library - just pass the function objects!
|
||||
print("📚 Using reusable hook library:")
|
||||
print(" • performance_optimization_hook")
|
||||
print(" • viewport_setup_hook")
|
||||
print(" • authentication_headers_hook")
|
||||
print(" • lazy_loading_handler_hook")
|
||||
print(" • page_analytics_hook")
|
||||
|
||||
print("\n🎯 Target URL: " + TEST_URLS[1])
|
||||
print("🚀 Starting crawl with automatic hook conversion...\n")
|
||||
|
||||
try:
|
||||
start_time = time.time()
|
||||
|
||||
# Pass function objects directly - NO manual conversion needed! ✨
|
||||
results = await client.crawl(
|
||||
urls=[TEST_URLS[0]],
|
||||
hooks={
|
||||
"on_page_context_created": performance_optimization_hook,
|
||||
"before_goto": authentication_headers_hook,
|
||||
"before_retrieve_html": lazy_loading_handler_hook,
|
||||
"before_return_html": page_analytics_hook,
|
||||
},
|
||||
hooks_timeout=30
|
||||
)
|
||||
|
||||
execution_time = time.time() - start_time
|
||||
|
||||
print(f"\n✅ Crawl completed! (took {execution_time:.2f}s)\n")
|
||||
|
||||
# Display results
|
||||
if results and results.success:
|
||||
result = results
|
||||
print(f"📊 Results:")
|
||||
print(f" • URL: {result.url}")
|
||||
print(f" • Success: {result.success}")
|
||||
print(f" • HTML length: {len(result.html):,} characters")
|
||||
print(f" • Markdown length: {len(result.markdown):,} characters")
|
||||
|
||||
# Show metadata
|
||||
if result.metadata:
|
||||
print(f"\n📋 Metadata:")
|
||||
print(f" • Title: {result.metadata.get('title', 'N/A')}")
|
||||
print(f" • Description: {result.metadata.get('description', 'N/A')}")
|
||||
|
||||
# Show links
|
||||
if result.links:
|
||||
internal_count = len(result.links.get('internal', []))
|
||||
external_count = len(result.links.get('external', []))
|
||||
print(f"\n🔗 Links Found:")
|
||||
print(f" • Internal: {internal_count}")
|
||||
print(f" • External: {external_count}")
|
||||
else:
|
||||
print(f"⚠️ Crawl completed but no successful results")
|
||||
if results:
|
||||
print(f" Error: {results.error_message}")
|
||||
|
||||
except Exception as e:
|
||||
print(f"❌ Error: {str(e)}")
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
|
||||
print("\n🌟 Why Docker Client is RECOMMENDED:")
|
||||
print(" ✓ Automatic function-to-string conversion")
|
||||
print(" ✓ No manual hooks_to_string() calls needed")
|
||||
print(" ✓ Cleaner, more Pythonic code")
|
||||
print(" ✓ Full type hints and IDE support")
|
||||
print(" ✓ Built-in error handling")
|
||||
print(" ✓ Async/await support")
|
||||
|
||||
print("\n" + "─" * 70)
|
||||
print("✓ Docker Client auto-conversion demo complete\n")
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# DEMO 4: Advanced Use Case - Complete Hook Pipeline
|
||||
# ============================================================================
|
||||
|
||||
async def demo_4_complete_hook_pipeline():
|
||||
"""
|
||||
Demonstrate a complete hook pipeline using all 8 hook points
|
||||
"""
|
||||
print_section(
|
||||
"DEMO 4: Complete Hook Pipeline",
|
||||
"Using all 8 available hook points for comprehensive control"
|
||||
)
|
||||
|
||||
# Define all 8 hooks
|
||||
async def on_browser_created_hook(browser, **kwargs):
|
||||
"""Hook 1: Called after browser is created"""
|
||||
print(" [Pipeline] 1/8 Browser created")
|
||||
return browser
|
||||
|
||||
async def on_page_context_created_hook(page, context, **kwargs):
|
||||
"""Hook 2: Called after page context is created"""
|
||||
print(" [Pipeline] 2/8 Page context created - setting up...")
|
||||
await page.set_viewport_size({"width": 1920, "height": 1080})
|
||||
return page
|
||||
|
||||
async def on_user_agent_updated_hook(page, context, user_agent, **kwargs):
|
||||
"""Hook 3: Called when user agent is updated"""
|
||||
print(f" [Pipeline] 3/8 User agent updated: {user_agent[:50]}...")
|
||||
return page
|
||||
|
||||
async def before_goto_hook(page, context, url, **kwargs):
|
||||
"""Hook 4: Called before navigating to URL"""
|
||||
print(f" [Pipeline] 4/8 Before navigation to: {url[:60]}...")
|
||||
return page
|
||||
|
||||
async def after_goto_hook(page, context, url, response, **kwargs):
|
||||
"""Hook 5: Called after navigation completes"""
|
||||
print(f" [Pipeline] 5/8 After navigation - Status: {response.status if response else 'N/A'}")
|
||||
await page.wait_for_timeout(1000)
|
||||
return page
|
||||
|
||||
async def on_execution_started_hook(page, context, **kwargs):
|
||||
"""Hook 6: Called when JavaScript execution starts"""
|
||||
print(" [Pipeline] 6/8 JavaScript execution started")
|
||||
return page
|
||||
|
||||
async def before_retrieve_html_hook(page, context, **kwargs):
|
||||
"""Hook 7: Called before retrieving HTML"""
|
||||
print(" [Pipeline] 7/8 Before HTML retrieval - scrolling...")
|
||||
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
|
||||
return page
|
||||
|
||||
async def before_return_html_hook(page, context, html, **kwargs):
|
||||
"""Hook 8: Called before returning HTML"""
|
||||
print(f" [Pipeline] 8/8 Before return - HTML length: {len(html):,} chars")
|
||||
return page
|
||||
|
||||
print("🎯 Target URL: " + TEST_URLS[0])
|
||||
print("🔧 Configured ALL 8 hook points for complete pipeline control\n")
|
||||
|
||||
client = Crawl4aiDockerClient(base_url=DOCKER_URL)
|
||||
|
||||
try:
|
||||
print("🚀 Starting complete pipeline crawl...\n")
|
||||
start_time = time.time()
|
||||
|
||||
results = await client.crawl(
|
||||
urls=[TEST_URLS[0]],
|
||||
hooks={
|
||||
"on_browser_created": on_browser_created_hook,
|
||||
"on_page_context_created": on_page_context_created_hook,
|
||||
"on_user_agent_updated": on_user_agent_updated_hook,
|
||||
"before_goto": before_goto_hook,
|
||||
"after_goto": after_goto_hook,
|
||||
"on_execution_started": on_execution_started_hook,
|
||||
"before_retrieve_html": before_retrieve_html_hook,
|
||||
"before_return_html": before_return_html_hook,
|
||||
},
|
||||
hooks_timeout=45
|
||||
)
|
||||
|
||||
execution_time = time.time() - start_time
|
||||
|
||||
if results and results.success:
|
||||
print(f"\n✅ Complete pipeline executed successfully! (took {execution_time:.2f}s)")
|
||||
print(f" • All 8 hooks executed in sequence")
|
||||
print(f" • HTML length: {len(results.html):,} characters")
|
||||
else:
|
||||
print(f"⚠️ Pipeline completed with warnings")
|
||||
|
||||
except Exception as e:
|
||||
print(f"❌ Error: {str(e)}")
|
||||
|
||||
print("\n📚 Available Hook Points:")
|
||||
print(" 1. on_browser_created - Browser initialization")
|
||||
print(" 2. on_page_context_created - Page context setup")
|
||||
print(" 3. on_user_agent_updated - User agent configuration")
|
||||
print(" 4. before_goto - Pre-navigation setup")
|
||||
print(" 5. after_goto - Post-navigation processing")
|
||||
print(" 6. on_execution_started - JavaScript execution start")
|
||||
print(" 7. before_retrieve_html - Pre-extraction processing")
|
||||
print(" 8. before_return_html - Final HTML processing")
|
||||
|
||||
print("\n" + "─" * 70)
|
||||
print("✓ Complete hook pipeline demo complete\n")
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# MAIN EXECUTION
|
||||
# ============================================================================
|
||||
|
||||
async def main():
|
||||
"""
|
||||
Run all demonstrations
|
||||
"""
|
||||
print("\n" + "=" * 70)
|
||||
print(" 🚀 Crawl4AI v0.7.5 - Docker Hooks Complete Demonstration")
|
||||
print("=" * 70)
|
||||
|
||||
# Check Docker service
|
||||
print("\n🔍 Checking Docker service status...")
|
||||
if not check_docker_service():
|
||||
print("❌ Docker service is not running!")
|
||||
print("\n📋 To start the Docker service:")
|
||||
print(" docker run -p 11235:11235 unclecode/crawl4ai:latest")
|
||||
print("\nPlease start the service and run this demo again.")
|
||||
return
|
||||
|
||||
print("✅ Docker service is running!\n")
|
||||
|
||||
# Run all demos
|
||||
demos = [
|
||||
("String-Based Hooks (REST API)", demo_1_string_based_hooks, False),
|
||||
("hooks_to_string() Utility", demo_2_hooks_to_string_utility, False),
|
||||
("Docker Client Auto-Conversion", demo_3_docker_client_auto_conversion, True),
|
||||
# ("Complete Hook Pipeline", demo_4_complete_hook_pipeline, True),
|
||||
]
|
||||
|
||||
for i, (name, demo_func, is_async) in enumerate(demos, 1):
|
||||
print(f"\n{'🔷' * 35}")
|
||||
print(f"Starting Demo {i}/{len(demos)}: {name}")
|
||||
print(f"{'🔷' * 35}\n")
|
||||
|
||||
try:
|
||||
if is_async:
|
||||
await demo_func()
|
||||
else:
|
||||
demo_func()
|
||||
|
||||
print(f"✅ Demo {i} completed successfully!")
|
||||
|
||||
# Pause between demos (except the last one)
|
||||
if i < len(demos):
|
||||
print("\n⏸️ Press Enter to continue to next demo...")
|
||||
# input()
|
||||
|
||||
except KeyboardInterrupt:
|
||||
print(f"\n⏹️ Demo interrupted by user")
|
||||
break
|
||||
except Exception as e:
|
||||
print(f"\n❌ Demo {i} failed: {str(e)}")
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
print("\nContinuing to next demo...\n")
|
||||
continue
|
||||
|
||||
# Final summary
|
||||
print("\n" + "=" * 70)
|
||||
print(" 🎉 All Demonstrations Complete!")
|
||||
print("=" * 70)
|
||||
|
||||
print("\n📊 Summary of v0.7.5 Docker Hooks System:")
|
||||
print("\n🆕 COMPLETELY NEW FEATURE in v0.7.5:")
|
||||
print(" The Docker Hooks System lets you customize the crawling pipeline")
|
||||
print(" with user-provided Python functions at 8 strategic points.")
|
||||
|
||||
print("\n✨ Three Ways to Use Docker Hooks (All NEW!):")
|
||||
print(" 1. String-based - Write hooks as strings for REST API")
|
||||
print(" 2. hooks_to_string() - Convert Python functions to strings")
|
||||
print(" 3. Docker Client - Automatic conversion (RECOMMENDED)")
|
||||
|
||||
print("\n💡 Key Benefits:")
|
||||
print(" ✓ Full IDE support (autocomplete, syntax highlighting)")
|
||||
print(" ✓ Type checking and linting")
|
||||
print(" ✓ Easy to test and debug")
|
||||
print(" ✓ Reusable across projects")
|
||||
print(" ✓ Complete pipeline control")
|
||||
|
||||
print("\n🎯 8 Hook Points Available:")
|
||||
print(" • on_browser_created, on_page_context_created")
|
||||
print(" • on_user_agent_updated, before_goto, after_goto")
|
||||
print(" • on_execution_started, before_retrieve_html, before_return_html")
|
||||
|
||||
print("\n📚 Resources:")
|
||||
print(" • Docs: https://docs.crawl4ai.com")
|
||||
print(" • GitHub: https://github.com/unclecode/crawl4ai")
|
||||
print(" • Discord: https://discord.gg/jP8KfhDhyN")
|
||||
|
||||
print("\n" + "=" * 70)
|
||||
print(" Happy Crawling with v0.7.5! 🕷️")
|
||||
print("=" * 70 + "\n")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
print("\n🎬 Starting Crawl4AI v0.7.5 Docker Hooks Demonstration...")
|
||||
print("Press Ctrl+C anytime to exit\n")
|
||||
|
||||
try:
|
||||
asyncio.run(main())
|
||||
except KeyboardInterrupt:
|
||||
print("\n\n👋 Demo stopped by user. Thanks for exploring Crawl4AI v0.7.5!")
|
||||
except Exception as e:
|
||||
print(f"\n\n❌ Demo error: {str(e)}")
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user