500 lines
12 KiB
Markdown
500 lines
12 KiB
Markdown
# Docker Deployment 🐳
|
|
|
|
Crawl4AI provides official Docker images for easy deployment and scalability. This guide covers installation, configuration, and usage of Crawl4AI in Docker environments.
|
|
|
|
## Docker Compose Setup 🐳
|
|
|
|
### Basic Usage
|
|
|
|
Create a `docker-compose.yml`:
|
|
```yaml
|
|
version: '3.8'
|
|
|
|
services:
|
|
crawl4ai:
|
|
image: unclecode/crawl4ai:all
|
|
ports:
|
|
- "11235:11235"
|
|
volumes:
|
|
- /dev/shm:/dev/shm
|
|
deploy:
|
|
resources:
|
|
limits:
|
|
memory: 4G
|
|
restart: unless-stopped
|
|
```
|
|
|
|
Run with:
|
|
```bash
|
|
docker-compose up -d
|
|
```
|
|
|
|
### Secure Mode with API Token
|
|
|
|
To enable API authentication, simply set the `CRAWL4AI_API_TOKEN`:
|
|
```bash
|
|
CRAWL4AI_API_TOKEN=your-secret-token docker-compose up -d
|
|
```
|
|
|
|
### Using Environment Variables
|
|
|
|
Create a `.env` file for your API tokens:
|
|
```env
|
|
# Crawl4AI API Security (optional)
|
|
CRAWL4AI_API_TOKEN=your-secret-token
|
|
|
|
# LLM Provider API Keys
|
|
OPENAI_API_KEY=sk-...
|
|
ANTHROPIC_API_KEY=sk-ant-...
|
|
GOOGLE_API_KEY=...
|
|
GEMINI_API_KEY=...
|
|
OLLAMA_API_KEY=...
|
|
|
|
# Additional Configuration
|
|
MAX_CONCURRENT_TASKS=5
|
|
```
|
|
|
|
Docker Compose will automatically load variables from the `.env` file. No additional configuration needed!
|
|
|
|
### Testing with API Token
|
|
|
|
```python
|
|
import requests
|
|
|
|
# Initialize headers with token if using secure mode
|
|
headers = {}
|
|
if api_token := os.getenv('CRAWL4AI_API_TOKEN'):
|
|
headers['Authorization'] = f'Bearer {api_token}'
|
|
|
|
# Test crawl with authentication
|
|
response = requests.post(
|
|
"http://localhost:11235/crawl",
|
|
headers=headers,
|
|
json={
|
|
"urls": "https://www.nbcnews.com/business",
|
|
"priority": 10
|
|
}
|
|
)
|
|
task_id = response.json()["task_id"]
|
|
```
|
|
|
|
### Security Best Practices 🔒
|
|
|
|
- Add `.env` to your `.gitignore`
|
|
- Use different API tokens for development and production
|
|
- Rotate API tokens periodically
|
|
- Use secure methods to pass tokens in production environments
|
|
```
|
|
|
|
This addition to your documentation:
|
|
1. Shows how to use Docker Compose
|
|
2. Explains both secure and non-secure modes
|
|
3. Demonstrates environment variable configuration
|
|
4. Provides example code for authenticated requests
|
|
5. Includes security best practices
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Usage Examples 📝
|
|
|
|
### Basic Crawling
|
|
|
|
```python
|
|
request = {
|
|
"urls": "https://www.nbcnews.com/business",
|
|
"priority": 10
|
|
}
|
|
|
|
response = requests.post("http://localhost:11235/crawl", json=request)
|
|
task_id = response.json()["task_id"]
|
|
|
|
# Get results
|
|
result = requests.get(f"http://localhost:11235/task/{task_id}")
|
|
```
|
|
|
|
### Structured Data Extraction
|
|
|
|
```python
|
|
schema = {
|
|
"name": "Crypto Prices",
|
|
"baseSelector": ".cds-tableRow-t45thuk",
|
|
"fields": [
|
|
{
|
|
"name": "crypto",
|
|
"selector": "td:nth-child(1) h2",
|
|
"type": "text",
|
|
},
|
|
{
|
|
"name": "price",
|
|
"selector": "td:nth-child(2)",
|
|
"type": "text",
|
|
}
|
|
],
|
|
}
|
|
|
|
request = {
|
|
"urls": "https://www.coinbase.com/explore",
|
|
"extraction_config": {
|
|
"type": "json_css",
|
|
"params": {"schema": schema}
|
|
}
|
|
}
|
|
```
|
|
|
|
### Dynamic Content Handling
|
|
|
|
```python
|
|
request = {
|
|
"urls": "https://www.nbcnews.com/business",
|
|
"js_code": [
|
|
"const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"
|
|
],
|
|
"wait_for": "article.tease-card:nth-child(10)"
|
|
}
|
|
```
|
|
|
|
### AI-Powered Extraction (Full Version)
|
|
|
|
```python
|
|
request = {
|
|
"urls": "https://www.nbcnews.com/business",
|
|
"extraction_config": {
|
|
"type": "cosine",
|
|
"params": {
|
|
"semantic_filter": "business finance economy",
|
|
"word_count_threshold": 10,
|
|
"max_dist": 0.2,
|
|
"top_k": 3
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
## Platform-Specific Instructions 💻
|
|
|
|
### macOS
|
|
```bash
|
|
docker pull unclecode/crawl4ai:basic
|
|
docker run -p 11235:11235 unclecode/crawl4ai:basic
|
|
```
|
|
|
|
### Ubuntu
|
|
```bash
|
|
# Basic version
|
|
docker pull unclecode/crawl4ai:basic
|
|
docker run -p 11235:11235 unclecode/crawl4ai:basic
|
|
|
|
# With GPU support
|
|
docker pull unclecode/crawl4ai:gpu
|
|
docker run --gpus all -p 11235:11235 unclecode/crawl4ai:gpu
|
|
```
|
|
|
|
### Windows (PowerShell)
|
|
```powershell
|
|
docker pull unclecode/crawl4ai:basic
|
|
docker run -p 11235:11235 unclecode/crawl4ai:basic
|
|
```
|
|
|
|
## Testing 🧪
|
|
|
|
Save this as `test_docker.py`:
|
|
|
|
```python
|
|
import requests
|
|
import json
|
|
import time
|
|
import sys
|
|
|
|
class Crawl4AiTester:
|
|
def __init__(self, base_url: str = "http://localhost:11235"):
|
|
self.base_url = base_url
|
|
|
|
def submit_and_wait(self, request_data: dict, timeout: int = 300) -> dict:
|
|
# Submit crawl job
|
|
response = requests.post(f"{self.base_url}/crawl", json=request_data)
|
|
task_id = response.json()["task_id"]
|
|
print(f"Task ID: {task_id}")
|
|
|
|
# Poll for result
|
|
start_time = time.time()
|
|
while True:
|
|
if time.time() - start_time > timeout:
|
|
raise TimeoutError(f"Task {task_id} timeout")
|
|
|
|
result = requests.get(f"{self.base_url}/task/{task_id}")
|
|
status = result.json()
|
|
|
|
if status["status"] == "completed":
|
|
return status
|
|
|
|
time.sleep(2)
|
|
|
|
def test_deployment():
|
|
tester = Crawl4AiTester()
|
|
|
|
# Test basic crawl
|
|
request = {
|
|
"urls": "https://www.nbcnews.com/business",
|
|
"priority": 10
|
|
}
|
|
|
|
result = tester.submit_and_wait(request)
|
|
print("Basic crawl successful!")
|
|
print(f"Content length: {len(result['result']['markdown'])}")
|
|
|
|
if __name__ == "__main__":
|
|
test_deployment()
|
|
```
|
|
|
|
## Advanced Configuration ⚙️
|
|
|
|
### Crawler Parameters
|
|
|
|
The `crawler_params` field allows you to configure the browser instance and crawling behavior. Here are key parameters you can use:
|
|
|
|
```python
|
|
request = {
|
|
"urls": "https://example.com",
|
|
"crawler_params": {
|
|
# Browser Configuration
|
|
"headless": True, # Run in headless mode
|
|
"browser_type": "chromium", # chromium/firefox/webkit
|
|
"user_agent": "custom-agent", # Custom user agent
|
|
"proxy": "http://proxy:8080", # Proxy configuration
|
|
|
|
# Performance & Behavior
|
|
"page_timeout": 30000, # Page load timeout (ms)
|
|
"verbose": True, # Enable detailed logging
|
|
"semaphore_count": 5, # Concurrent request limit
|
|
|
|
# Anti-Detection Features
|
|
"simulate_user": True, # Simulate human behavior
|
|
"magic": True, # Advanced anti-detection
|
|
"override_navigator": True, # Override navigator properties
|
|
|
|
# Session Management
|
|
"user_data_dir": "./browser-data", # Browser profile location
|
|
"use_managed_browser": True, # Use persistent browser
|
|
}
|
|
}
|
|
```
|
|
|
|
### Extra Parameters
|
|
|
|
The `extra` field allows passing additional parameters directly to the crawler's `arun` function:
|
|
|
|
```python
|
|
request = {
|
|
"urls": "https://example.com",
|
|
"extra": {
|
|
"word_count_threshold": 10, # Min words per block
|
|
"only_text": True, # Extract only text
|
|
"bypass_cache": True, # Force fresh crawl
|
|
"process_iframes": True, # Include iframe content
|
|
}
|
|
}
|
|
```
|
|
|
|
### Complete Examples
|
|
|
|
1. **Advanced News Crawling**
|
|
```python
|
|
request = {
|
|
"urls": "https://www.nbcnews.com/business",
|
|
"crawler_params": {
|
|
"headless": True,
|
|
"page_timeout": 30000,
|
|
"remove_overlay_elements": True # Remove popups
|
|
},
|
|
"extra": {
|
|
"word_count_threshold": 50, # Longer content blocks
|
|
"bypass_cache": True # Fresh content
|
|
},
|
|
"css_selector": ".article-body"
|
|
}
|
|
```
|
|
|
|
2. **Anti-Detection Configuration**
|
|
```python
|
|
request = {
|
|
"urls": "https://example.com",
|
|
"crawler_params": {
|
|
"simulate_user": True,
|
|
"magic": True,
|
|
"override_navigator": True,
|
|
"user_agent": "Mozilla/5.0 ...",
|
|
"headers": {
|
|
"Accept-Language": "en-US,en;q=0.9"
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
3. **LLM Extraction with Custom Parameters**
|
|
```python
|
|
request = {
|
|
"urls": "https://openai.com/pricing",
|
|
"extraction_config": {
|
|
"type": "llm",
|
|
"params": {
|
|
"provider": "openai/gpt-4",
|
|
"schema": pricing_schema
|
|
}
|
|
},
|
|
"crawler_params": {
|
|
"verbose": True,
|
|
"page_timeout": 60000
|
|
},
|
|
"extra": {
|
|
"word_count_threshold": 1,
|
|
"only_text": True
|
|
}
|
|
}
|
|
```
|
|
|
|
4. **Session-Based Dynamic Content**
|
|
```python
|
|
request = {
|
|
"urls": "https://example.com",
|
|
"crawler_params": {
|
|
"session_id": "dynamic_session",
|
|
"headless": False,
|
|
"page_timeout": 60000
|
|
},
|
|
"js_code": ["window.scrollTo(0, document.body.scrollHeight);"],
|
|
"wait_for": "js:() => document.querySelectorAll('.item').length > 10",
|
|
"extra": {
|
|
"delay_before_return_html": 2.0
|
|
}
|
|
}
|
|
```
|
|
|
|
5. **Screenshot with Custom Timing**
|
|
```python
|
|
request = {
|
|
"urls": "https://example.com",
|
|
"screenshot": True,
|
|
"crawler_params": {
|
|
"headless": True,
|
|
"screenshot_wait_for": ".main-content"
|
|
},
|
|
"extra": {
|
|
"delay_before_return_html": 3.0
|
|
}
|
|
}
|
|
```
|
|
|
|
### Parameter Reference Table
|
|
|
|
| Category | Parameter | Type | Description |
|
|
|----------|-----------|------|-------------|
|
|
| Browser | headless | bool | Run browser in headless mode |
|
|
| Browser | browser_type | str | Browser engine selection |
|
|
| Browser | user_agent | str | Custom user agent string |
|
|
| Network | proxy | str | Proxy server URL |
|
|
| Network | headers | dict | Custom HTTP headers |
|
|
| Timing | page_timeout | int | Page load timeout (ms) |
|
|
| Timing | delay_before_return_html | float | Wait before capture |
|
|
| Anti-Detection | simulate_user | bool | Human behavior simulation |
|
|
| Anti-Detection | magic | bool | Advanced protection |
|
|
| Session | session_id | str | Browser session ID |
|
|
| Session | user_data_dir | str | Profile directory |
|
|
| Content | word_count_threshold | int | Minimum words per block |
|
|
| Content | only_text | bool | Text-only extraction |
|
|
| Content | process_iframes | bool | Include iframe content |
|
|
| Debug | verbose | bool | Detailed logging |
|
|
| Debug | log_console | bool | Browser console logs |
|
|
|
|
## Troubleshooting 🔍
|
|
|
|
### Common Issues
|
|
|
|
1. **Connection Refused**
|
|
```
|
|
Error: Connection refused at localhost:11235
|
|
```
|
|
Solution: Ensure the container is running and ports are properly mapped.
|
|
|
|
2. **Resource Limits**
|
|
```
|
|
Error: No available slots
|
|
```
|
|
Solution: Increase MAX_CONCURRENT_TASKS or container resources.
|
|
|
|
3. **GPU Access**
|
|
```
|
|
Error: GPU not found
|
|
```
|
|
Solution: Ensure proper NVIDIA drivers and use `--gpus all` flag.
|
|
|
|
### Debug Mode
|
|
|
|
Access container for debugging:
|
|
```bash
|
|
docker run -it --entrypoint /bin/bash unclecode/crawl4ai:all
|
|
```
|
|
|
|
View container logs:
|
|
```bash
|
|
docker logs [container_id]
|
|
```
|
|
|
|
## Best Practices 🌟
|
|
|
|
1. **Resource Management**
|
|
- Set appropriate memory and CPU limits
|
|
- Monitor resource usage via health endpoint
|
|
- Use basic version for simple crawling tasks
|
|
|
|
2. **Scaling**
|
|
- Use multiple containers for high load
|
|
- Implement proper load balancing
|
|
- Monitor performance metrics
|
|
|
|
3. **Security**
|
|
- Use environment variables for sensitive data
|
|
- Implement proper network isolation
|
|
- Regular security updates
|
|
|
|
## API Reference 📚
|
|
|
|
### Health Check
|
|
```http
|
|
GET /health
|
|
```
|
|
|
|
### Submit Crawl Task
|
|
```http
|
|
POST /crawl
|
|
Content-Type: application/json
|
|
|
|
{
|
|
"urls": "string or array",
|
|
"extraction_config": {
|
|
"type": "basic|llm|cosine|json_css",
|
|
"params": {}
|
|
},
|
|
"priority": 1-10,
|
|
"ttl": 3600
|
|
}
|
|
```
|
|
|
|
### Get Task Status
|
|
```http
|
|
GET /task/{task_id}
|
|
```
|
|
|
|
For more details, visit the [official documentation](https://crawl4ai.com/mkdocs/). |