feat(core): Release v0.3.73 with Browser Takeover and Docker Support

Major changes:
- Add browser takeover feature using CDP for authentic browsing
- Implement Docker support with full API server documentation
- Enhance Mockdown with tag preservation system
- Improve parallel crawling performance

This release focuses on authenticity and scalability, introducing the ability
to use users' own browsers while providing containerized deployment options.
Breaking changes include modified browser handling and API response structure.

See CHANGELOG.md for detailed migration guide.
This commit is contained in:
UncleCode
2024-11-05 20:04:18 +08:00
parent c4c6227962
commit 67a23c3182
18 changed files with 1066 additions and 61263 deletions

View File

@@ -1,5 +1,97 @@
# Changelog
# CHANGELOG
## [v0.3.73] - 2024-11-05
### Major Features
- **New Doctor Feature**
- Added comprehensive system diagnostics tool
- Available through package hub and CLI
- Provides automated troubleshooting and system health checks
- Includes detailed reporting of configuration issues
- **Dockerized API Server**
- Released complete Docker implementation for API server
- Added comprehensive documentation for Docker deployment
- Implemented container communication protocols
- Added environment configuration guides
- **Managed Browser Integration**
- Added support for user-controlled browser instances
- Implemented `ManagedBrowser` class for better browser lifecycle management
- Added ability to connect to existing Chrome DevTools Protocol (CDP) endpoints
- Introduced user data directory support for persistent browser profiles
- **Enhanced HTML Processing**
- Added HTML tag preservation feature during markdown conversion
- Introduced configurable tag preservation system
- Improved pre-tag and code block handling
- Added support for nested preserved tags with attribute retention
### Improvements
- **Browser Handling**
- Added flag to ignore body visibility for problematic pages
- Improved browser process cleanup and management
- Enhanced temporary directory handling for browser profiles
- Added configurable browser launch arguments
- **Database Management**
- Implemented connection pooling for better performance
- Added retry logic for database operations
- Improved error handling and logging
- Enhanced cleanup procedures for database connections
- **Resource Management**
- Added memory and CPU monitoring
- Implemented dynamic task slot allocation based on system resources
- Added configurable cleanup intervals
### Technical Improvements
- **Code Structure**
- Moved version management to dedicated _version.py file
- Improved error handling throughout the codebase
- Enhanced logging system with better error reporting
- Reorganized core components for better maintainability
### Bug Fixes
- Fixed issues with browser process termination
- Improved handling of connection timeouts
- Enhanced error recovery in database operations
- Fixed memory leaks in long-running processes
### Dependencies
- Updated Playwright to v1.47
- Updated core dependencies with more flexible version constraints
- Added new development dependencies for testing
### Breaking Changes
- Changed default browser handling behavior
- Modified database connection management approach
- Updated API response structure for better consistency
## Migration Guide
When upgrading to v0.3.73, be aware of the following changes:
1. Docker Deployment:
- Review Docker documentation for new deployment options
- Update environment configurations as needed
- Check container communication settings
2. If using custom browser management:
- Update browser initialization code to use new ManagedBrowser class
- Review browser cleanup procedures
3. For database operations:
- Check custom database queries for compatibility with new connection pooling
- Update error handling to work with new retry logic
4. Using the Doctor:
- Run doctor command for system diagnostics: `crawl4ai doctor`
- Review generated reports for potential issues
- Follow recommended fixes for any identified problems
## [2024-11-04 - 13:21:42] Comprehensive Update of Crawl4AI Features and Dependencies
This commit introduces several key enhancements, including improved error handling and robust database operations in `async_database.py`, which now features a connection pool and retry logic for better reliability. Updates to the README.md provide clearer instructions and a better user experience with links to documentation sections. The `.gitignore` file has been refined to include additional directories, while the async web crawler now utilizes a managed browser for more efficient crawling. Furthermore, multiple dependency updates and introduction of the `CustomHTML2Text` class enhance text extraction capabilities.

121
Dockerfile Normal file
View File

@@ -0,0 +1,121 @@
# syntax=docker/dockerfile:1.4
# Build arguments
ARG PYTHON_VERSION=3.10
# Base stage with system dependencies
FROM python:${PYTHON_VERSION}-slim as base
# Declare ARG variables again within the build stage
ARG INSTALL_TYPE=all
ARG ENABLE_GPU=false
# Platform-specific labels
LABEL maintainer="unclecode"
LABEL description="Crawl4AI - Advanced Web Crawler with AI capabilities"
LABEL version="1.0"
# Environment setup
ENV PYTHONUNBUFFERED=1 \
PYTHONDONTWRITEBYTECODE=1 \
PIP_NO_CACHE_DIR=1 \
PIP_DISABLE_PIP_VERSION_CHECK=1 \
PIP_DEFAULT_TIMEOUT=100 \
DEBIAN_FRONTEND=noninteractive
# Install system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
curl \
wget \
gnupg \
git \
cmake \
pkg-config \
python3-dev \
libjpeg-dev \
libpng-dev \
&& rm -rf /var/lib/apt/lists/*
# Playwright system dependencies for Linux
RUN apt-get update && apt-get install -y --no-install-recommends \
libglib2.0-0 \
libnss3 \
libnspr4 \
libatk1.0-0 \
libatk-bridge2.0-0 \
libcups2 \
libdrm2 \
libdbus-1-3 \
libxcb1 \
libxkbcommon0 \
libx11-6 \
libxcomposite1 \
libxdamage1 \
libxext6 \
libxfixes3 \
libxrandr2 \
libgbm1 \
libpango-1.0-0 \
libcairo2 \
libasound2 \
libatspi2.0-0 \
&& rm -rf /var/lib/apt/lists/*
# GPU support if enabled
RUN if [ "$ENABLE_GPU" = "true" ] ; then \
apt-get update && apt-get install -y --no-install-recommends \
nvidia-cuda-toolkit \
&& rm -rf /var/lib/apt/lists/* ; \
fi
# Create and set working directory
WORKDIR /app
# Copy the entire project
COPY . .
# Install base requirements
RUN pip install --no-cache-dir -r requirements.txt
# Install required library for FastAPI
RUN pip install fastapi uvicorn psutil
# Install ML dependencies first for better layer caching
RUN if [ "$INSTALL_TYPE" = "all" ] ; then \
pip install --no-cache-dir \
torch \
torchvision \
torchaudio \
scikit-learn \
nltk \
transformers \
tokenizers && \
python -m nltk.downloader punkt stopwords ; \
fi
# Install the package
RUN if [ "$INSTALL_TYPE" = "all" ] ; then \
pip install -e ".[all]" && \
python -m crawl4ai.model_loader ; \
elif [ "$INSTALL_TYPE" = "torch" ] ; then \
pip install -e ".[torch]" ; \
elif [ "$INSTALL_TYPE" = "transformer" ] ; then \
pip install -e ".[transformer]" && \
python -m crawl4ai.model_loader ; \
else \
pip install -e "." ; \
fi
# Install Playwright and browsers
RUN playwright install
# Health check
HEALTHCHECK --interval=30s --timeout=30s --start-period=5s --retries=3 \
CMD curl -f http://localhost:8000/health || exit 1
# Expose port
EXPOSE 8000
# Start the FastAPI server
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "11235"]

View File

@@ -116,9 +116,53 @@ pip install -e .
### Using Docker 🐳
We're in the process of creating Docker images and pushing them to Docker Hub. This will provide an easy way to run Crawl4AI in a containerized environment. Stay tuned for updates!
Crawl4AI is available as Docker images for easy deployment. You can either pull directly from Docker Hub (recommended) or build from the repository.
#### Option 1: Docker Hub (Recommended)
```bash
# Pull and run from Docker Hub (choose one):
docker pull unclecode/crawl4ai:basic # Basic crawling features
docker pull unclecode/crawl4ai:all # Full installation (ML, LLM support)
docker pull unclecode/crawl4ai:gpu # GPU-enabled version
# Run the container
docker run -p 11235:11235 unclecode/crawl4ai:basic # Replace 'basic' with your chosen version
```
#### Option 2: Build from Repository
```bash
# Clone the repository
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
# Build the image
docker build -t crawl4ai:local \
--build-arg INSTALL_TYPE=basic \ # Options: basic, all
.
# Run your local build
docker run -p 11235:11235 crawl4ai:local
```
Quick test (works for both options):
```python
import requests
# Submit a crawl job
response = requests.post(
"http://localhost:11235/crawl",
json={"urls": "https://example.com", "priority": 10}
)
task_id = response.json()["task_id"]
# Get results
result = requests.get(f"http://localhost:11235/task/{task_id}")
```
For advanced configuration, environment variables, and usage examples, see our [Docker Deployment Guide](https://crawl4ai.com/mkdocs/basic/docker-deployment/).
For more detailed installation instructions and options, please refer to our [Installation Guide](https://crawl4ai.com/mkdocs/basic/installation/).
## Quick Start 🚀
@@ -352,6 +396,7 @@ if __name__ == "__main__":
This example demonstrates Crawl4AI's ability to handle complex scenarios where content is loaded asynchronously. It crawls multiple pages of GitHub commits, executing JavaScript to load new content and using custom hooks to ensure data is loaded before proceeding.
For more advanced usage examples, check out our [Examples](https://crawl4ai.com/mkdocs/tutorial/episode_12_Session-Based_Crawling_for_Dynamic_Websites/) section in the documentation.
</details>
## Speed Comparison 🚀

View File

@@ -172,7 +172,6 @@ class AsyncWebCrawler:
]
return await asyncio.gather(*tasks)
async def aprocess_html(
self,
url: str,
@@ -286,3 +285,5 @@ class AsyncWebCrawler:
async def aget_cache_size(self):
return await async_db_manager.aget_total_count()

View File

@@ -1,25 +0,0 @@
{
"_name_or_path": "sentence-transformers/all-MiniLM-L6-v2",
"architectures": [
"BertModel"
],
"attention_probs_dropout_prob": 0.1,
"classifier_dropout": null,
"gradient_checkpointing": false,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 384,
"initializer_range": 0.02,
"intermediate_size": 1536,
"layer_norm_eps": 1e-12,
"max_position_embeddings": 512,
"model_type": "bert",
"num_attention_heads": 12,
"num_hidden_layers": 6,
"pad_token_id": 0,
"position_embedding_type": "absolute",
"transformers_version": "4.27.4",
"type_vocab_size": 2,
"use_cache": true,
"vocab_size": 30522
}

Binary file not shown.

View File

@@ -1,7 +0,0 @@
{
"cls_token": "[CLS]",
"mask_token": "[MASK]",
"pad_token": "[PAD]",
"sep_token": "[SEP]",
"unk_token": "[UNK]"
}

File diff suppressed because it is too large Load Diff

View File

@@ -1,15 +0,0 @@
{
"cls_token": "[CLS]",
"do_basic_tokenize": true,
"do_lower_case": true,
"mask_token": "[MASK]",
"model_max_length": 512,
"never_split": null,
"pad_token": "[PAD]",
"sep_token": "[SEP]",
"special_tokens_map_file": "/Users/hammad/.cache/huggingface/hub/models--sentence-transformers--all-MiniLM-L6-v2/snapshots/7dbbc90392e2f80f3d3c277d6e90027e55de9125/special_tokens_map.json",
"strip_accents": null,
"tokenize_chinese_chars": true,
"tokenizer_class": "BertTokenizer",
"unk_token": "[UNK]"
}

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,35 @@
# Parameter Reference Table
| File Name | Parameter Name | Code Usage | Strategy/Class | Description |
|-----------|---------------|------------|----------------|-------------|
| async_crawler_strategy.py | user_agent | `kwargs.get("user_agent")` | AsyncPlaywrightCrawlerStrategy | User agent string for browser identification |
| async_crawler_strategy.py | proxy | `kwargs.get("proxy")` | AsyncPlaywrightCrawlerStrategy | Proxy server configuration for network requests |
| async_crawler_strategy.py | proxy_config | `kwargs.get("proxy_config")` | AsyncPlaywrightCrawlerStrategy | Detailed proxy configuration including auth |
| async_crawler_strategy.py | headless | `kwargs.get("headless", True)` | AsyncPlaywrightCrawlerStrategy | Whether to run browser in headless mode |
| async_crawler_strategy.py | browser_type | `kwargs.get("browser_type", "chromium")` | AsyncPlaywrightCrawlerStrategy | Type of browser to use (chromium/firefox/webkit) |
| async_crawler_strategy.py | headers | `kwargs.get("headers", {})` | AsyncPlaywrightCrawlerStrategy | Custom HTTP headers for requests |
| async_crawler_strategy.py | verbose | `kwargs.get("verbose", False)` | AsyncPlaywrightCrawlerStrategy | Enable detailed logging output |
| async_crawler_strategy.py | sleep_on_close | `kwargs.get("sleep_on_close", False)` | AsyncPlaywrightCrawlerStrategy | Add delay before closing browser |
| async_crawler_strategy.py | use_managed_browser | `kwargs.get("use_managed_browser", False)` | AsyncPlaywrightCrawlerStrategy | Use managed browser instance |
| async_crawler_strategy.py | user_data_dir | `kwargs.get("user_data_dir", None)` | AsyncPlaywrightCrawlerStrategy | Custom directory for browser profile data |
| async_crawler_strategy.py | session_id | `kwargs.get("session_id")` | AsyncPlaywrightCrawlerStrategy | Unique identifier for browser session |
| async_crawler_strategy.py | override_navigator | `kwargs.get("override_navigator", False)` | AsyncPlaywrightCrawlerStrategy | Override browser navigator properties |
| async_crawler_strategy.py | simulate_user | `kwargs.get("simulate_user", False)` | AsyncPlaywrightCrawlerStrategy | Simulate human-like behavior |
| async_crawler_strategy.py | magic | `kwargs.get("magic", False)` | AsyncPlaywrightCrawlerStrategy | Enable advanced anti-detection features |
| async_crawler_strategy.py | log_console | `kwargs.get("log_console", False)` | AsyncPlaywrightCrawlerStrategy | Log browser console messages |
| async_crawler_strategy.py | js_only | `kwargs.get("js_only", False)` | AsyncPlaywrightCrawlerStrategy | Only execute JavaScript without page load |
| async_crawler_strategy.py | page_timeout | `kwargs.get("page_timeout", 60000)` | AsyncPlaywrightCrawlerStrategy | Timeout for page load in milliseconds |
| async_crawler_strategy.py | ignore_body_visibility | `kwargs.get("ignore_body_visibility", True)` | AsyncPlaywrightCrawlerStrategy | Process page even if body is hidden |
| async_crawler_strategy.py | js_code | `kwargs.get("js_code", kwargs.get("js", self.js_code))` | AsyncPlaywrightCrawlerStrategy | Custom JavaScript code to execute |
| async_crawler_strategy.py | wait_for | `kwargs.get("wait_for")` | AsyncPlaywrightCrawlerStrategy | Wait for specific element/condition |
| async_crawler_strategy.py | process_iframes | `kwargs.get("process_iframes", False)` | AsyncPlaywrightCrawlerStrategy | Extract content from iframes |
| async_crawler_strategy.py | delay_before_return_html | `kwargs.get("delay_before_return_html")` | AsyncPlaywrightCrawlerStrategy | Additional delay before returning HTML |
| async_crawler_strategy.py | remove_overlay_elements | `kwargs.get("remove_overlay_elements", False)` | AsyncPlaywrightCrawlerStrategy | Remove pop-ups and overlay elements |
| async_crawler_strategy.py | screenshot | `kwargs.get("screenshot")` | AsyncPlaywrightCrawlerStrategy | Take page screenshot |
| async_crawler_strategy.py | screenshot_wait_for | `kwargs.get("screenshot_wait_for")` | AsyncPlaywrightCrawlerStrategy | Wait before taking screenshot |
| async_crawler_strategy.py | semaphore_count | `kwargs.get("semaphore_count", 5)` | AsyncPlaywrightCrawlerStrategy | Concurrent request limit |
| async_webcrawler.py | verbose | `kwargs.get("verbose", False)` | AsyncWebCrawler | Enable detailed logging |
| async_webcrawler.py | warmup | `kwargs.get("warmup", True)` | AsyncWebCrawler | Initialize crawler with warmup request |
| async_webcrawler.py | session_id | `kwargs.get("session_id", None)` | AsyncWebCrawler | Session identifier for browser reuse |
| async_webcrawler.py | only_text | `kwargs.get("only_text", False)` | AsyncWebCrawler | Extract only text content |
| async_webcrawler.py | bypass_cache | `kwargs.get("bypass_cache", False)` | AsyncWebCrawler | Skip cache and force fresh crawl |

View File

@@ -0,0 +1,459 @@
# Docker Deployment
Crawl4AI provides official Docker images for easy deployment and scalability. This guide covers installation, configuration, and usage of Crawl4AI in Docker environments.
## Quick Start 🚀
Pull and run the basic version:
```bash
docker pull unclecode/crawl4ai:basic
docker run -p 11235:11235 unclecode/crawl4ai:basic
```
Test the deployment:
```python
import requests
# Test health endpoint
health = requests.get("http://localhost:11235/health")
print("Health check:", health.json())
# Test basic crawl
response = requests.post(
"http://localhost:11235/crawl",
json={
"urls": "https://www.nbcnews.com/business",
"priority": 10
}
)
task_id = response.json()["task_id"]
print("Task ID:", task_id)
```
## Available Images 🏷️
- `unclecode/crawl4ai:basic` - Basic web crawling capabilities
- `unclecode/crawl4ai:all` - Full installation with all features
- `unclecode/crawl4ai:gpu` - GPU-enabled version for ML features
## Configuration Options 🔧
### Environment Variables
```bash
docker run -p 11235:11235 \
-e MAX_CONCURRENT_TASKS=5 \
-e OPENAI_API_KEY=your_key \
unclecode/crawl4ai:all
```
### Volume Mounting
Mount a directory for persistent data:
```bash
docker run -p 11235:11235 \
-v $(pwd)/data:/app/data \
unclecode/crawl4ai:all
```
### Resource Limits
Control container resources:
```bash
docker run -p 11235:11235 \
--memory=4g \
--cpus=2 \
unclecode/crawl4ai:all
```
## Usage Examples 📝
### Basic Crawling
```python
request = {
"urls": "https://www.nbcnews.com/business",
"priority": 10
}
response = requests.post("http://localhost:11235/crawl", json=request)
task_id = response.json()["task_id"]
# Get results
result = requests.get(f"http://localhost:11235/task/{task_id}")
```
### Structured Data Extraction
```python
schema = {
"name": "Crypto Prices",
"baseSelector": ".cds-tableRow-t45thuk",
"fields": [
{
"name": "crypto",
"selector": "td:nth-child(1) h2",
"type": "text",
},
{
"name": "price",
"selector": "td:nth-child(2)",
"type": "text",
}
],
}
request = {
"urls": "https://www.coinbase.com/explore",
"extraction_config": {
"type": "json_css",
"params": {"schema": schema}
}
}
```
### Dynamic Content Handling
```python
request = {
"urls": "https://www.nbcnews.com/business",
"js_code": [
"const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"
],
"wait_for": "article.tease-card:nth-child(10)"
}
```
### AI-Powered Extraction (Full Version)
```python
request = {
"urls": "https://www.nbcnews.com/business",
"extraction_config": {
"type": "cosine",
"params": {
"semantic_filter": "business finance economy",
"word_count_threshold": 10,
"max_dist": 0.2,
"top_k": 3
}
}
}
```
## Platform-Specific Instructions 💻
### macOS
```bash
docker pull unclecode/crawl4ai:basic
docker run -p 11235:11235 unclecode/crawl4ai:basic
```
### Ubuntu
```bash
# Basic version
docker pull unclecode/crawl4ai:basic
docker run -p 11235:11235 unclecode/crawl4ai:basic
# With GPU support
docker pull unclecode/crawl4ai:gpu
docker run --gpus all -p 11235:11235 unclecode/crawl4ai:gpu
```
### Windows (PowerShell)
```powershell
docker pull unclecode/crawl4ai:basic
docker run -p 11235:11235 unclecode/crawl4ai:basic
```
## Testing 🧪
Save this as `test_docker.py`:
```python
import requests
import json
import time
import sys
class Crawl4AiTester:
def __init__(self, base_url: str = "http://localhost:11235"):
self.base_url = base_url
def submit_and_wait(self, request_data: dict, timeout: int = 300) -> dict:
# Submit crawl job
response = requests.post(f"{self.base_url}/crawl", json=request_data)
task_id = response.json()["task_id"]
print(f"Task ID: {task_id}")
# Poll for result
start_time = time.time()
while True:
if time.time() - start_time > timeout:
raise TimeoutError(f"Task {task_id} timeout")
result = requests.get(f"{self.base_url}/task/{task_id}")
status = result.json()
if status["status"] == "completed":
return status
time.sleep(2)
def test_deployment():
tester = Crawl4AiTester()
# Test basic crawl
request = {
"urls": "https://www.nbcnews.com/business",
"priority": 10
}
result = tester.submit_and_wait(request)
print("Basic crawl successful!")
print(f"Content length: {len(result['result']['markdown'])}")
if __name__ == "__main__":
test_deployment()
```
## Advanced Configuration ⚙️
### Crawler Parameters
The `crawler_params` field allows you to configure the browser instance and crawling behavior. Here are key parameters you can use:
```python
request = {
"urls": "https://example.com",
"crawler_params": {
# Browser Configuration
"headless": True, # Run in headless mode
"browser_type": "chromium", # chromium/firefox/webkit
"user_agent": "custom-agent", # Custom user agent
"proxy": "http://proxy:8080", # Proxy configuration
# Performance & Behavior
"page_timeout": 30000, # Page load timeout (ms)
"verbose": True, # Enable detailed logging
"semaphore_count": 5, # Concurrent request limit
# Anti-Detection Features
"simulate_user": True, # Simulate human behavior
"magic": True, # Advanced anti-detection
"override_navigator": True, # Override navigator properties
# Session Management
"user_data_dir": "./browser-data", # Browser profile location
"use_managed_browser": True, # Use persistent browser
}
}
```
### Extra Parameters
The `extra` field allows passing additional parameters directly to the crawler's `arun` function:
```python
request = {
"urls": "https://example.com",
"extra": {
"word_count_threshold": 10, # Min words per block
"only_text": True, # Extract only text
"bypass_cache": True, # Force fresh crawl
"process_iframes": True, # Include iframe content
}
}
```
### Complete Examples
1. **Advanced News Crawling**
```python
request = {
"urls": "https://www.nbcnews.com/business",
"crawler_params": {
"headless": True,
"page_timeout": 30000,
"remove_overlay_elements": True # Remove popups
},
"extra": {
"word_count_threshold": 50, # Longer content blocks
"bypass_cache": True # Fresh content
},
"css_selector": ".article-body"
}
```
2. **Anti-Detection Configuration**
```python
request = {
"urls": "https://example.com",
"crawler_params": {
"simulate_user": True,
"magic": True,
"override_navigator": True,
"user_agent": "Mozilla/5.0 ...",
"headers": {
"Accept-Language": "en-US,en;q=0.9"
}
}
}
```
3. **LLM Extraction with Custom Parameters**
```python
request = {
"urls": "https://openai.com/pricing",
"extraction_config": {
"type": "llm",
"params": {
"provider": "openai/gpt-4",
"schema": pricing_schema
}
},
"crawler_params": {
"verbose": True,
"page_timeout": 60000
},
"extra": {
"word_count_threshold": 1,
"only_text": True
}
}
```
4. **Session-Based Dynamic Content**
```python
request = {
"urls": "https://example.com",
"crawler_params": {
"session_id": "dynamic_session",
"headless": False,
"page_timeout": 60000
},
"js_code": ["window.scrollTo(0, document.body.scrollHeight);"],
"wait_for": "js:() => document.querySelectorAll('.item').length > 10",
"extra": {
"delay_before_return_html": 2.0
}
}
```
5. **Screenshot with Custom Timing**
```python
request = {
"urls": "https://example.com",
"screenshot": True,
"crawler_params": {
"headless": True,
"screenshot_wait_for": ".main-content"
},
"extra": {
"delay_before_return_html": 3.0
}
}
```
### Parameter Reference Table
| Category | Parameter | Type | Description |
|----------|-----------|------|-------------|
| Browser | headless | bool | Run browser in headless mode |
| Browser | browser_type | str | Browser engine selection |
| Browser | user_agent | str | Custom user agent string |
| Network | proxy | str | Proxy server URL |
| Network | headers | dict | Custom HTTP headers |
| Timing | page_timeout | int | Page load timeout (ms) |
| Timing | delay_before_return_html | float | Wait before capture |
| Anti-Detection | simulate_user | bool | Human behavior simulation |
| Anti-Detection | magic | bool | Advanced protection |
| Session | session_id | str | Browser session ID |
| Session | user_data_dir | str | Profile directory |
| Content | word_count_threshold | int | Minimum words per block |
| Content | only_text | bool | Text-only extraction |
| Content | process_iframes | bool | Include iframe content |
| Debug | verbose | bool | Detailed logging |
| Debug | log_console | bool | Browser console logs |
## Troubleshooting 🔍
### Common Issues
1. **Connection Refused**
```
Error: Connection refused at localhost:11235
```
Solution: Ensure the container is running and ports are properly mapped.
2. **Resource Limits**
```
Error: No available slots
```
Solution: Increase MAX_CONCURRENT_TASKS or container resources.
3. **GPU Access**
```
Error: GPU not found
```
Solution: Ensure proper NVIDIA drivers and use `--gpus all` flag.
### Debug Mode
Access container for debugging:
```bash
docker run -it --entrypoint /bin/bash unclecode/crawl4ai:all
```
View container logs:
```bash
docker logs [container_id]
```
## Best Practices 🌟
1. **Resource Management**
- Set appropriate memory and CPU limits
- Monitor resource usage via health endpoint
- Use basic version for simple crawling tasks
2. **Scaling**
- Use multiple containers for high load
- Implement proper load balancing
- Monitor performance metrics
3. **Security**
- Use environment variables for sensitive data
- Implement proper network isolation
- Regular security updates
## API Reference 📚
### Health Check
```http
GET /health
```
### Submit Crawl Task
```http
POST /crawl
Content-Type: application/json
{
"urls": "string or array",
"extraction_config": {
"type": "basic|llm|cosine|json_css",
"params": {}
},
"priority": 1-10,
"ttl": 3600
}
```
### Get Task Status
```http
GET /task/{task_id}
```
For more details, visit the [official documentation](https://crawl4ai.com/mkdocs/).

View File

@@ -72,7 +72,7 @@ Our documentation is organized into several sections:
### Advanced Features
- [Magic Mode](advanced/magic-mode.md)
- [Session Management](advanced/session-management.md)
- [Hooks & Authentication](advanced/hooks.md)
- [Hooks & Authentication](advanced/hooks-auth.md)
- [Proxy & Security](advanced/proxy-security.md)
- [Content Processing](advanced/content-processing.md)

View File

@@ -269,6 +269,7 @@ class CrawlerService:
css_selector=request.css_selector,
screenshot=request.screenshot,
magic=request.magic,
**request.extra,
)
else:
results = await crawler.arun(
@@ -279,6 +280,7 @@ class CrawlerService:
css_selector=request.css_selector,
screenshot=request.screenshot,
magic=request.magic,
**request.extra,
)
await self.crawler_pool.release(crawler)
@@ -343,4 +345,4 @@ async def health_check():
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)
uvicorn.run(app, host="0.0.0.0", port=11235)

View File

@@ -8,6 +8,7 @@ docs_dir: docs/md_v2
nav:
- Home: 'index.md'
- 'Installation': 'basic/installation.md'
- 'Docker Deplotment': 'basic/docker-deploymeny.md'
- 'Quick Start': 'basic/quickstart.md'
- Basic:
@@ -34,6 +35,7 @@ nav:
- 'Chunking': 'extraction/chunking.md'
- API Reference:
- 'Parameters Table': 'api/parameters.md'
- 'AsyncWebCrawler': 'api/async-webcrawler.md'
- 'AsyncWebCrawler.arun()': 'api/arun.md'
- 'CrawlResult': 'api/crawl-result.md'

View File

@@ -31,9 +31,11 @@ with open("crawl4ai/_version.py") as f:
# Define the requirements for different environments
default_requirements = requirements
torch_requirements = ["torch", "nltk", "spacy", "scikit-learn"]
transformer_requirements = ["transformers", "tokenizers", "onnxruntime"]
cosine_similarity_requirements = ["torch", "transformers", "nltk", "spacy"]
# torch_requirements = ["torch", "nltk", "spacy", "scikit-learn"]
# transformer_requirements = ["transformers", "tokenizers", "onnxruntime"]
torch_requirements = ["torch", "nltk", "scikit-learn"]
transformer_requirements = ["transformers", "tokenizers"]
cosine_similarity_requirements = ["torch", "transformers", "nltk" ]
sync_requirements = ["selenium"]
def install_playwright():

299
tests/test_docker.py Normal file
View File

@@ -0,0 +1,299 @@
import requests
import json
import time
import sys
import base64
import os
from typing import Dict, Any
class Crawl4AiTester:
def __init__(self, base_url: str = "http://localhost:8000"):
self.base_url = base_url
def submit_and_wait(self, request_data: Dict[str, Any], timeout: int = 300) -> Dict[str, Any]:
# Submit crawl job
response = requests.post(f"{self.base_url}/crawl", json=request_data)
task_id = response.json()["task_id"]
print(f"Task ID: {task_id}")
# Poll for result
start_time = time.time()
while True:
if time.time() - start_time > timeout:
raise TimeoutError(f"Task {task_id} did not complete within {timeout} seconds")
result = requests.get(f"{self.base_url}/task/{task_id}")
status = result.json()
if status["status"] == "failed":
print("Task failed:", status.get("error"))
raise Exception(f"Task failed: {status.get('error')}")
if status["status"] == "completed":
return status
time.sleep(2)
def test_docker_deployment(version="basic"):
tester = Crawl4AiTester()
print(f"Testing Crawl4AI Docker {version} version")
# Health check with timeout and retry
max_retries = 5
for i in range(max_retries):
try:
health = requests.get(f"{tester.base_url}/health", timeout=10)
print("Health check:", health.json())
break
except requests.exceptions.RequestException as e:
if i == max_retries - 1:
print(f"Failed to connect after {max_retries} attempts")
sys.exit(1)
print(f"Waiting for service to start (attempt {i+1}/{max_retries})...")
time.sleep(5)
# Test cases based on version
test_basic_crawl(tester)
if version in ["full", "transformer"]:
test_cosine_extraction(tester)
# test_js_execution(tester)
# test_css_selector(tester)
# test_structured_extraction(tester)
# test_llm_extraction(tester)
# test_llm_with_ollama(tester)
# test_screenshot(tester)
def test_basic_crawl(tester: Crawl4AiTester):
print("\n=== Testing Basic Crawl ===")
request = {
"urls": "https://www.nbcnews.com/business",
"priority": 10
}
result = tester.submit_and_wait(request)
print(f"Basic crawl result length: {len(result['result']['markdown'])}")
assert result["result"]["success"]
assert len(result["result"]["markdown"]) > 0
def test_js_execution(tester: Crawl4AiTester):
print("\n=== Testing JS Execution ===")
request = {
"urls": "https://www.nbcnews.com/business",
"priority": 8,
"js_code": [
"const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"
],
"wait_for": "article.tease-card:nth-child(10)",
"crawler_params": {
"headless": True
}
}
result = tester.submit_and_wait(request)
print(f"JS execution result length: {len(result['result']['markdown'])}")
assert result["result"]["success"]
def test_css_selector(tester: Crawl4AiTester):
print("\n=== Testing CSS Selector ===")
request = {
"urls": "https://www.nbcnews.com/business",
"priority": 7,
"css_selector": ".wide-tease-item__description",
"crawler_params": {
"headless": True
},
"extra": {"word_count_threshold": 10}
}
result = tester.submit_and_wait(request)
print(f"CSS selector result length: {len(result['result']['markdown'])}")
assert result["result"]["success"]
def test_structured_extraction(tester: Crawl4AiTester):
print("\n=== Testing Structured Extraction ===")
schema = {
"name": "Coinbase Crypto Prices",
"baseSelector": ".cds-tableRow-t45thuk",
"fields": [
{
"name": "crypto",
"selector": "td:nth-child(1) h2",
"type": "text",
},
{
"name": "symbol",
"selector": "td:nth-child(1) p",
"type": "text",
},
{
"name": "price",
"selector": "td:nth-child(2)",
"type": "text",
}
],
}
request = {
"urls": "https://www.coinbase.com/explore",
"priority": 9,
"extraction_config": {
"type": "json_css",
"params": {
"schema": schema
}
}
}
result = tester.submit_and_wait(request)
extracted = json.loads(result["result"]["extracted_content"])
print(f"Extracted {len(extracted)} items")
print("Sample item:", json.dumps(extracted[0], indent=2))
assert result["result"]["success"]
assert len(extracted) > 0
def test_llm_extraction(tester: Crawl4AiTester):
print("\n=== Testing LLM Extraction ===")
schema = {
"type": "object",
"properties": {
"model_name": {
"type": "string",
"description": "Name of the OpenAI model."
},
"input_fee": {
"type": "string",
"description": "Fee for input token for the OpenAI model."
},
"output_fee": {
"type": "string",
"description": "Fee for output token for the OpenAI model."
}
},
"required": ["model_name", "input_fee", "output_fee"]
}
request = {
"urls": "https://openai.com/api/pricing",
"priority": 8,
"extraction_config": {
"type": "llm",
"params": {
"provider": "openai/gpt-4o-mini",
"api_token": os.getenv("OPENAI_API_KEY"),
"schema": schema,
"extraction_type": "schema",
"instruction": """From the crawled content, extract all mentioned model names along with their fees for input and output tokens."""
}
},
"crawler_params": {"word_count_threshold": 1}
}
try:
result = tester.submit_and_wait(request)
extracted = json.loads(result["result"]["extracted_content"])
print(f"Extracted {len(extracted)} model pricing entries")
print("Sample entry:", json.dumps(extracted[0], indent=2))
assert result["result"]["success"]
except Exception as e:
print(f"LLM extraction test failed (might be due to missing API key): {str(e)}")
def test_llm_with_ollama(tester: Crawl4AiTester):
print("\n=== Testing LLM with Ollama ===")
schema = {
"type": "object",
"properties": {
"article_title": {
"type": "string",
"description": "The main title of the news article"
},
"summary": {
"type": "string",
"description": "A brief summary of the article content"
},
"main_topics": {
"type": "array",
"items": {"type": "string"},
"description": "Main topics or themes discussed in the article"
}
}
}
request = {
"urls": "https://www.nbcnews.com/business",
"priority": 8,
"extraction_config": {
"type": "llm",
"params": {
"provider": "ollama/llama2",
"schema": schema,
"extraction_type": "schema",
"instruction": "Extract the main article information including title, summary, and main topics."
}
},
"extra": {"word_count_threshold": 1},
"crawler_params": {"verbose": True}
}
try:
result = tester.submit_and_wait(request)
extracted = json.loads(result["result"]["extracted_content"])
print("Extracted content:", json.dumps(extracted, indent=2))
assert result["result"]["success"]
except Exception as e:
print(f"Ollama extraction test failed: {str(e)}")
def test_cosine_extraction(tester: Crawl4AiTester):
print("\n=== Testing Cosine Extraction ===")
request = {
"urls": "https://www.nbcnews.com/business",
"priority": 8,
"extraction_config": {
"type": "cosine",
"params": {
"semantic_filter": "business finance economy",
"word_count_threshold": 10,
"max_dist": 0.2,
"top_k": 3
}
}
}
try:
result = tester.submit_and_wait(request)
extracted = json.loads(result["result"]["extracted_content"])
print(f"Extracted {len(extracted)} text clusters")
print("First cluster tags:", extracted[0]["tags"])
assert result["result"]["success"]
except Exception as e:
print(f"Cosine extraction test failed: {str(e)}")
def test_screenshot(tester: Crawl4AiTester):
print("\n=== Testing Screenshot ===")
request = {
"urls": "https://www.nbcnews.com/business",
"priority": 5,
"screenshot": True,
"crawler_params": {
"headless": True
}
}
result = tester.submit_and_wait(request)
print("Screenshot captured:", bool(result["result"]["screenshot"]))
if result["result"]["screenshot"]:
# Save screenshot
screenshot_data = base64.b64decode(result["result"]["screenshot"])
with open("test_screenshot.jpg", "wb") as f:
f.write(screenshot_data)
print("Screenshot saved as test_screenshot.jpg")
assert result["result"]["success"]
if __name__ == "__main__":
version = sys.argv[1] if len(sys.argv) > 1 else "basic"
# version = "full"
test_docker_deployment(version)