feat(core): Release v0.3.73 with Browser Takeover and Docker Support
Major changes: - Add browser takeover feature using CDP for authentic browsing - Implement Docker support with full API server documentation - Enhance Mockdown with tag preservation system - Improve parallel crawling performance This release focuses on authenticity and scalability, introducing the ability to use users' own browsers while providing containerized deployment options. Breaking changes include modified browser handling and API response structure. See CHANGELOG.md for detailed migration guide.
This commit is contained in:
92
CHANGELOG.md
92
CHANGELOG.md
@@ -1,5 +1,97 @@
|
|||||||
# Changelog
|
# Changelog
|
||||||
|
|
||||||
|
# CHANGELOG
|
||||||
|
|
||||||
|
## [v0.3.73] - 2024-11-05
|
||||||
|
|
||||||
|
### Major Features
|
||||||
|
- **New Doctor Feature**
|
||||||
|
- Added comprehensive system diagnostics tool
|
||||||
|
- Available through package hub and CLI
|
||||||
|
- Provides automated troubleshooting and system health checks
|
||||||
|
- Includes detailed reporting of configuration issues
|
||||||
|
|
||||||
|
- **Dockerized API Server**
|
||||||
|
- Released complete Docker implementation for API server
|
||||||
|
- Added comprehensive documentation for Docker deployment
|
||||||
|
- Implemented container communication protocols
|
||||||
|
- Added environment configuration guides
|
||||||
|
|
||||||
|
- **Managed Browser Integration**
|
||||||
|
- Added support for user-controlled browser instances
|
||||||
|
- Implemented `ManagedBrowser` class for better browser lifecycle management
|
||||||
|
- Added ability to connect to existing Chrome DevTools Protocol (CDP) endpoints
|
||||||
|
- Introduced user data directory support for persistent browser profiles
|
||||||
|
|
||||||
|
- **Enhanced HTML Processing**
|
||||||
|
- Added HTML tag preservation feature during markdown conversion
|
||||||
|
- Introduced configurable tag preservation system
|
||||||
|
- Improved pre-tag and code block handling
|
||||||
|
- Added support for nested preserved tags with attribute retention
|
||||||
|
|
||||||
|
### Improvements
|
||||||
|
- **Browser Handling**
|
||||||
|
- Added flag to ignore body visibility for problematic pages
|
||||||
|
- Improved browser process cleanup and management
|
||||||
|
- Enhanced temporary directory handling for browser profiles
|
||||||
|
- Added configurable browser launch arguments
|
||||||
|
|
||||||
|
- **Database Management**
|
||||||
|
- Implemented connection pooling for better performance
|
||||||
|
- Added retry logic for database operations
|
||||||
|
- Improved error handling and logging
|
||||||
|
- Enhanced cleanup procedures for database connections
|
||||||
|
|
||||||
|
- **Resource Management**
|
||||||
|
- Added memory and CPU monitoring
|
||||||
|
- Implemented dynamic task slot allocation based on system resources
|
||||||
|
- Added configurable cleanup intervals
|
||||||
|
|
||||||
|
### Technical Improvements
|
||||||
|
- **Code Structure**
|
||||||
|
- Moved version management to dedicated _version.py file
|
||||||
|
- Improved error handling throughout the codebase
|
||||||
|
- Enhanced logging system with better error reporting
|
||||||
|
- Reorganized core components for better maintainability
|
||||||
|
|
||||||
|
### Bug Fixes
|
||||||
|
- Fixed issues with browser process termination
|
||||||
|
- Improved handling of connection timeouts
|
||||||
|
- Enhanced error recovery in database operations
|
||||||
|
- Fixed memory leaks in long-running processes
|
||||||
|
|
||||||
|
### Dependencies
|
||||||
|
- Updated Playwright to v1.47
|
||||||
|
- Updated core dependencies with more flexible version constraints
|
||||||
|
- Added new development dependencies for testing
|
||||||
|
|
||||||
|
### Breaking Changes
|
||||||
|
- Changed default browser handling behavior
|
||||||
|
- Modified database connection management approach
|
||||||
|
- Updated API response structure for better consistency
|
||||||
|
|
||||||
|
## Migration Guide
|
||||||
|
When upgrading to v0.3.73, be aware of the following changes:
|
||||||
|
|
||||||
|
1. Docker Deployment:
|
||||||
|
- Review Docker documentation for new deployment options
|
||||||
|
- Update environment configurations as needed
|
||||||
|
- Check container communication settings
|
||||||
|
|
||||||
|
2. If using custom browser management:
|
||||||
|
- Update browser initialization code to use new ManagedBrowser class
|
||||||
|
- Review browser cleanup procedures
|
||||||
|
|
||||||
|
3. For database operations:
|
||||||
|
- Check custom database queries for compatibility with new connection pooling
|
||||||
|
- Update error handling to work with new retry logic
|
||||||
|
|
||||||
|
4. Using the Doctor:
|
||||||
|
- Run doctor command for system diagnostics: `crawl4ai doctor`
|
||||||
|
- Review generated reports for potential issues
|
||||||
|
- Follow recommended fixes for any identified problems
|
||||||
|
|
||||||
|
|
||||||
## [2024-11-04 - 13:21:42] Comprehensive Update of Crawl4AI Features and Dependencies
|
## [2024-11-04 - 13:21:42] Comprehensive Update of Crawl4AI Features and Dependencies
|
||||||
This commit introduces several key enhancements, including improved error handling and robust database operations in `async_database.py`, which now features a connection pool and retry logic for better reliability. Updates to the README.md provide clearer instructions and a better user experience with links to documentation sections. The `.gitignore` file has been refined to include additional directories, while the async web crawler now utilizes a managed browser for more efficient crawling. Furthermore, multiple dependency updates and introduction of the `CustomHTML2Text` class enhance text extraction capabilities.
|
This commit introduces several key enhancements, including improved error handling and robust database operations in `async_database.py`, which now features a connection pool and retry logic for better reliability. Updates to the README.md provide clearer instructions and a better user experience with links to documentation sections. The `.gitignore` file has been refined to include additional directories, while the async web crawler now utilizes a managed browser for more efficient crawling. Furthermore, multiple dependency updates and introduction of the `CustomHTML2Text` class enhance text extraction capabilities.
|
||||||
|
|
||||||
|
|||||||
121
Dockerfile
Normal file
121
Dockerfile
Normal file
@@ -0,0 +1,121 @@
|
|||||||
|
# syntax=docker/dockerfile:1.4
|
||||||
|
|
||||||
|
# Build arguments
|
||||||
|
ARG PYTHON_VERSION=3.10
|
||||||
|
|
||||||
|
# Base stage with system dependencies
|
||||||
|
FROM python:${PYTHON_VERSION}-slim as base
|
||||||
|
|
||||||
|
# Declare ARG variables again within the build stage
|
||||||
|
ARG INSTALL_TYPE=all
|
||||||
|
ARG ENABLE_GPU=false
|
||||||
|
|
||||||
|
# Platform-specific labels
|
||||||
|
LABEL maintainer="unclecode"
|
||||||
|
LABEL description="Crawl4AI - Advanced Web Crawler with AI capabilities"
|
||||||
|
LABEL version="1.0"
|
||||||
|
|
||||||
|
# Environment setup
|
||||||
|
ENV PYTHONUNBUFFERED=1 \
|
||||||
|
PYTHONDONTWRITEBYTECODE=1 \
|
||||||
|
PIP_NO_CACHE_DIR=1 \
|
||||||
|
PIP_DISABLE_PIP_VERSION_CHECK=1 \
|
||||||
|
PIP_DEFAULT_TIMEOUT=100 \
|
||||||
|
DEBIAN_FRONTEND=noninteractive
|
||||||
|
|
||||||
|
# Install system dependencies
|
||||||
|
RUN apt-get update && apt-get install -y --no-install-recommends \
|
||||||
|
build-essential \
|
||||||
|
curl \
|
||||||
|
wget \
|
||||||
|
gnupg \
|
||||||
|
git \
|
||||||
|
cmake \
|
||||||
|
pkg-config \
|
||||||
|
python3-dev \
|
||||||
|
libjpeg-dev \
|
||||||
|
libpng-dev \
|
||||||
|
&& rm -rf /var/lib/apt/lists/*
|
||||||
|
|
||||||
|
# Playwright system dependencies for Linux
|
||||||
|
RUN apt-get update && apt-get install -y --no-install-recommends \
|
||||||
|
libglib2.0-0 \
|
||||||
|
libnss3 \
|
||||||
|
libnspr4 \
|
||||||
|
libatk1.0-0 \
|
||||||
|
libatk-bridge2.0-0 \
|
||||||
|
libcups2 \
|
||||||
|
libdrm2 \
|
||||||
|
libdbus-1-3 \
|
||||||
|
libxcb1 \
|
||||||
|
libxkbcommon0 \
|
||||||
|
libx11-6 \
|
||||||
|
libxcomposite1 \
|
||||||
|
libxdamage1 \
|
||||||
|
libxext6 \
|
||||||
|
libxfixes3 \
|
||||||
|
libxrandr2 \
|
||||||
|
libgbm1 \
|
||||||
|
libpango-1.0-0 \
|
||||||
|
libcairo2 \
|
||||||
|
libasound2 \
|
||||||
|
libatspi2.0-0 \
|
||||||
|
&& rm -rf /var/lib/apt/lists/*
|
||||||
|
|
||||||
|
# GPU support if enabled
|
||||||
|
RUN if [ "$ENABLE_GPU" = "true" ] ; then \
|
||||||
|
apt-get update && apt-get install -y --no-install-recommends \
|
||||||
|
nvidia-cuda-toolkit \
|
||||||
|
&& rm -rf /var/lib/apt/lists/* ; \
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Create and set working directory
|
||||||
|
WORKDIR /app
|
||||||
|
|
||||||
|
# Copy the entire project
|
||||||
|
COPY . .
|
||||||
|
|
||||||
|
# Install base requirements
|
||||||
|
RUN pip install --no-cache-dir -r requirements.txt
|
||||||
|
|
||||||
|
# Install required library for FastAPI
|
||||||
|
RUN pip install fastapi uvicorn psutil
|
||||||
|
|
||||||
|
# Install ML dependencies first for better layer caching
|
||||||
|
RUN if [ "$INSTALL_TYPE" = "all" ] ; then \
|
||||||
|
pip install --no-cache-dir \
|
||||||
|
torch \
|
||||||
|
torchvision \
|
||||||
|
torchaudio \
|
||||||
|
scikit-learn \
|
||||||
|
nltk \
|
||||||
|
transformers \
|
||||||
|
tokenizers && \
|
||||||
|
python -m nltk.downloader punkt stopwords ; \
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Install the package
|
||||||
|
RUN if [ "$INSTALL_TYPE" = "all" ] ; then \
|
||||||
|
pip install -e ".[all]" && \
|
||||||
|
python -m crawl4ai.model_loader ; \
|
||||||
|
elif [ "$INSTALL_TYPE" = "torch" ] ; then \
|
||||||
|
pip install -e ".[torch]" ; \
|
||||||
|
elif [ "$INSTALL_TYPE" = "transformer" ] ; then \
|
||||||
|
pip install -e ".[transformer]" && \
|
||||||
|
python -m crawl4ai.model_loader ; \
|
||||||
|
else \
|
||||||
|
pip install -e "." ; \
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Install Playwright and browsers
|
||||||
|
RUN playwright install
|
||||||
|
|
||||||
|
# Health check
|
||||||
|
HEALTHCHECK --interval=30s --timeout=30s --start-period=5s --retries=3 \
|
||||||
|
CMD curl -f http://localhost:8000/health || exit 1
|
||||||
|
|
||||||
|
# Expose port
|
||||||
|
EXPOSE 8000
|
||||||
|
|
||||||
|
# Start the FastAPI server
|
||||||
|
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "11235"]
|
||||||
49
README.md
49
README.md
@@ -116,9 +116,53 @@ pip install -e .
|
|||||||
|
|
||||||
### Using Docker 🐳
|
### Using Docker 🐳
|
||||||
|
|
||||||
We're in the process of creating Docker images and pushing them to Docker Hub. This will provide an easy way to run Crawl4AI in a containerized environment. Stay tuned for updates!
|
Crawl4AI is available as Docker images for easy deployment. You can either pull directly from Docker Hub (recommended) or build from the repository.
|
||||||
|
|
||||||
|
#### Option 1: Docker Hub (Recommended)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Pull and run from Docker Hub (choose one):
|
||||||
|
docker pull unclecode/crawl4ai:basic # Basic crawling features
|
||||||
|
docker pull unclecode/crawl4ai:all # Full installation (ML, LLM support)
|
||||||
|
docker pull unclecode/crawl4ai:gpu # GPU-enabled version
|
||||||
|
|
||||||
|
# Run the container
|
||||||
|
docker run -p 11235:11235 unclecode/crawl4ai:basic # Replace 'basic' with your chosen version
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Option 2: Build from Repository
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Clone the repository
|
||||||
|
git clone https://github.com/unclecode/crawl4ai.git
|
||||||
|
cd crawl4ai
|
||||||
|
|
||||||
|
# Build the image
|
||||||
|
docker build -t crawl4ai:local \
|
||||||
|
--build-arg INSTALL_TYPE=basic \ # Options: basic, all
|
||||||
|
.
|
||||||
|
|
||||||
|
# Run your local build
|
||||||
|
docker run -p 11235:11235 crawl4ai:local
|
||||||
|
```
|
||||||
|
|
||||||
|
Quick test (works for both options):
|
||||||
|
```python
|
||||||
|
import requests
|
||||||
|
|
||||||
|
# Submit a crawl job
|
||||||
|
response = requests.post(
|
||||||
|
"http://localhost:11235/crawl",
|
||||||
|
json={"urls": "https://example.com", "priority": 10}
|
||||||
|
)
|
||||||
|
task_id = response.json()["task_id"]
|
||||||
|
|
||||||
|
# Get results
|
||||||
|
result = requests.get(f"http://localhost:11235/task/{task_id}")
|
||||||
|
```
|
||||||
|
|
||||||
|
For advanced configuration, environment variables, and usage examples, see our [Docker Deployment Guide](https://crawl4ai.com/mkdocs/basic/docker-deployment/).
|
||||||
|
|
||||||
For more detailed installation instructions and options, please refer to our [Installation Guide](https://crawl4ai.com/mkdocs/basic/installation/).
|
|
||||||
|
|
||||||
## Quick Start 🚀
|
## Quick Start 🚀
|
||||||
|
|
||||||
@@ -352,6 +396,7 @@ if __name__ == "__main__":
|
|||||||
This example demonstrates Crawl4AI's ability to handle complex scenarios where content is loaded asynchronously. It crawls multiple pages of GitHub commits, executing JavaScript to load new content and using custom hooks to ensure data is loaded before proceeding.
|
This example demonstrates Crawl4AI's ability to handle complex scenarios where content is loaded asynchronously. It crawls multiple pages of GitHub commits, executing JavaScript to load new content and using custom hooks to ensure data is loaded before proceeding.
|
||||||
|
|
||||||
For more advanced usage examples, check out our [Examples](https://crawl4ai.com/mkdocs/tutorial/episode_12_Session-Based_Crawling_for_Dynamic_Websites/) section in the documentation.
|
For more advanced usage examples, check out our [Examples](https://crawl4ai.com/mkdocs/tutorial/episode_12_Session-Based_Crawling_for_Dynamic_Websites/) section in the documentation.
|
||||||
|
</details>
|
||||||
|
|
||||||
|
|
||||||
## Speed Comparison 🚀
|
## Speed Comparison 🚀
|
||||||
|
|||||||
@@ -172,7 +172,6 @@ class AsyncWebCrawler:
|
|||||||
]
|
]
|
||||||
return await asyncio.gather(*tasks)
|
return await asyncio.gather(*tasks)
|
||||||
|
|
||||||
|
|
||||||
async def aprocess_html(
|
async def aprocess_html(
|
||||||
self,
|
self,
|
||||||
url: str,
|
url: str,
|
||||||
@@ -286,3 +285,5 @@ class AsyncWebCrawler:
|
|||||||
|
|
||||||
async def aget_cache_size(self):
|
async def aget_cache_size(self):
|
||||||
return await async_db_manager.aget_total_count()
|
return await async_db_manager.aget_total_count()
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@@ -1,25 +0,0 @@
|
|||||||
{
|
|
||||||
"_name_or_path": "sentence-transformers/all-MiniLM-L6-v2",
|
|
||||||
"architectures": [
|
|
||||||
"BertModel"
|
|
||||||
],
|
|
||||||
"attention_probs_dropout_prob": 0.1,
|
|
||||||
"classifier_dropout": null,
|
|
||||||
"gradient_checkpointing": false,
|
|
||||||
"hidden_act": "gelu",
|
|
||||||
"hidden_dropout_prob": 0.1,
|
|
||||||
"hidden_size": 384,
|
|
||||||
"initializer_range": 0.02,
|
|
||||||
"intermediate_size": 1536,
|
|
||||||
"layer_norm_eps": 1e-12,
|
|
||||||
"max_position_embeddings": 512,
|
|
||||||
"model_type": "bert",
|
|
||||||
"num_attention_heads": 12,
|
|
||||||
"num_hidden_layers": 6,
|
|
||||||
"pad_token_id": 0,
|
|
||||||
"position_embedding_type": "absolute",
|
|
||||||
"transformers_version": "4.27.4",
|
|
||||||
"type_vocab_size": 2,
|
|
||||||
"use_cache": true,
|
|
||||||
"vocab_size": 30522
|
|
||||||
}
|
|
||||||
Binary file not shown.
@@ -1,7 +0,0 @@
|
|||||||
{
|
|
||||||
"cls_token": "[CLS]",
|
|
||||||
"mask_token": "[MASK]",
|
|
||||||
"pad_token": "[PAD]",
|
|
||||||
"sep_token": "[SEP]",
|
|
||||||
"unk_token": "[UNK]"
|
|
||||||
}
|
|
||||||
File diff suppressed because it is too large
Load Diff
@@ -1,15 +0,0 @@
|
|||||||
{
|
|
||||||
"cls_token": "[CLS]",
|
|
||||||
"do_basic_tokenize": true,
|
|
||||||
"do_lower_case": true,
|
|
||||||
"mask_token": "[MASK]",
|
|
||||||
"model_max_length": 512,
|
|
||||||
"never_split": null,
|
|
||||||
"pad_token": "[PAD]",
|
|
||||||
"sep_token": "[SEP]",
|
|
||||||
"special_tokens_map_file": "/Users/hammad/.cache/huggingface/hub/models--sentence-transformers--all-MiniLM-L6-v2/snapshots/7dbbc90392e2f80f3d3c277d6e90027e55de9125/special_tokens_map.json",
|
|
||||||
"strip_accents": null,
|
|
||||||
"tokenize_chinese_chars": true,
|
|
||||||
"tokenizer_class": "BertTokenizer",
|
|
||||||
"unk_token": "[UNK]"
|
|
||||||
}
|
|
||||||
File diff suppressed because it is too large
Load Diff
35
docs/md_v2/api/parameters.md
Normal file
35
docs/md_v2/api/parameters.md
Normal file
@@ -0,0 +1,35 @@
|
|||||||
|
# Parameter Reference Table
|
||||||
|
|
||||||
|
| File Name | Parameter Name | Code Usage | Strategy/Class | Description |
|
||||||
|
|-----------|---------------|------------|----------------|-------------|
|
||||||
|
| async_crawler_strategy.py | user_agent | `kwargs.get("user_agent")` | AsyncPlaywrightCrawlerStrategy | User agent string for browser identification |
|
||||||
|
| async_crawler_strategy.py | proxy | `kwargs.get("proxy")` | AsyncPlaywrightCrawlerStrategy | Proxy server configuration for network requests |
|
||||||
|
| async_crawler_strategy.py | proxy_config | `kwargs.get("proxy_config")` | AsyncPlaywrightCrawlerStrategy | Detailed proxy configuration including auth |
|
||||||
|
| async_crawler_strategy.py | headless | `kwargs.get("headless", True)` | AsyncPlaywrightCrawlerStrategy | Whether to run browser in headless mode |
|
||||||
|
| async_crawler_strategy.py | browser_type | `kwargs.get("browser_type", "chromium")` | AsyncPlaywrightCrawlerStrategy | Type of browser to use (chromium/firefox/webkit) |
|
||||||
|
| async_crawler_strategy.py | headers | `kwargs.get("headers", {})` | AsyncPlaywrightCrawlerStrategy | Custom HTTP headers for requests |
|
||||||
|
| async_crawler_strategy.py | verbose | `kwargs.get("verbose", False)` | AsyncPlaywrightCrawlerStrategy | Enable detailed logging output |
|
||||||
|
| async_crawler_strategy.py | sleep_on_close | `kwargs.get("sleep_on_close", False)` | AsyncPlaywrightCrawlerStrategy | Add delay before closing browser |
|
||||||
|
| async_crawler_strategy.py | use_managed_browser | `kwargs.get("use_managed_browser", False)` | AsyncPlaywrightCrawlerStrategy | Use managed browser instance |
|
||||||
|
| async_crawler_strategy.py | user_data_dir | `kwargs.get("user_data_dir", None)` | AsyncPlaywrightCrawlerStrategy | Custom directory for browser profile data |
|
||||||
|
| async_crawler_strategy.py | session_id | `kwargs.get("session_id")` | AsyncPlaywrightCrawlerStrategy | Unique identifier for browser session |
|
||||||
|
| async_crawler_strategy.py | override_navigator | `kwargs.get("override_navigator", False)` | AsyncPlaywrightCrawlerStrategy | Override browser navigator properties |
|
||||||
|
| async_crawler_strategy.py | simulate_user | `kwargs.get("simulate_user", False)` | AsyncPlaywrightCrawlerStrategy | Simulate human-like behavior |
|
||||||
|
| async_crawler_strategy.py | magic | `kwargs.get("magic", False)` | AsyncPlaywrightCrawlerStrategy | Enable advanced anti-detection features |
|
||||||
|
| async_crawler_strategy.py | log_console | `kwargs.get("log_console", False)` | AsyncPlaywrightCrawlerStrategy | Log browser console messages |
|
||||||
|
| async_crawler_strategy.py | js_only | `kwargs.get("js_only", False)` | AsyncPlaywrightCrawlerStrategy | Only execute JavaScript without page load |
|
||||||
|
| async_crawler_strategy.py | page_timeout | `kwargs.get("page_timeout", 60000)` | AsyncPlaywrightCrawlerStrategy | Timeout for page load in milliseconds |
|
||||||
|
| async_crawler_strategy.py | ignore_body_visibility | `kwargs.get("ignore_body_visibility", True)` | AsyncPlaywrightCrawlerStrategy | Process page even if body is hidden |
|
||||||
|
| async_crawler_strategy.py | js_code | `kwargs.get("js_code", kwargs.get("js", self.js_code))` | AsyncPlaywrightCrawlerStrategy | Custom JavaScript code to execute |
|
||||||
|
| async_crawler_strategy.py | wait_for | `kwargs.get("wait_for")` | AsyncPlaywrightCrawlerStrategy | Wait for specific element/condition |
|
||||||
|
| async_crawler_strategy.py | process_iframes | `kwargs.get("process_iframes", False)` | AsyncPlaywrightCrawlerStrategy | Extract content from iframes |
|
||||||
|
| async_crawler_strategy.py | delay_before_return_html | `kwargs.get("delay_before_return_html")` | AsyncPlaywrightCrawlerStrategy | Additional delay before returning HTML |
|
||||||
|
| async_crawler_strategy.py | remove_overlay_elements | `kwargs.get("remove_overlay_elements", False)` | AsyncPlaywrightCrawlerStrategy | Remove pop-ups and overlay elements |
|
||||||
|
| async_crawler_strategy.py | screenshot | `kwargs.get("screenshot")` | AsyncPlaywrightCrawlerStrategy | Take page screenshot |
|
||||||
|
| async_crawler_strategy.py | screenshot_wait_for | `kwargs.get("screenshot_wait_for")` | AsyncPlaywrightCrawlerStrategy | Wait before taking screenshot |
|
||||||
|
| async_crawler_strategy.py | semaphore_count | `kwargs.get("semaphore_count", 5)` | AsyncPlaywrightCrawlerStrategy | Concurrent request limit |
|
||||||
|
| async_webcrawler.py | verbose | `kwargs.get("verbose", False)` | AsyncWebCrawler | Enable detailed logging |
|
||||||
|
| async_webcrawler.py | warmup | `kwargs.get("warmup", True)` | AsyncWebCrawler | Initialize crawler with warmup request |
|
||||||
|
| async_webcrawler.py | session_id | `kwargs.get("session_id", None)` | AsyncWebCrawler | Session identifier for browser reuse |
|
||||||
|
| async_webcrawler.py | only_text | `kwargs.get("only_text", False)` | AsyncWebCrawler | Extract only text content |
|
||||||
|
| async_webcrawler.py | bypass_cache | `kwargs.get("bypass_cache", False)` | AsyncWebCrawler | Skip cache and force fresh crawl |
|
||||||
459
docs/md_v2/basic/docker-deploymeny.md
Normal file
459
docs/md_v2/basic/docker-deploymeny.md
Normal file
@@ -0,0 +1,459 @@
|
|||||||
|
# Docker Deployment
|
||||||
|
|
||||||
|
Crawl4AI provides official Docker images for easy deployment and scalability. This guide covers installation, configuration, and usage of Crawl4AI in Docker environments.
|
||||||
|
|
||||||
|
## Quick Start 🚀
|
||||||
|
|
||||||
|
Pull and run the basic version:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker pull unclecode/crawl4ai:basic
|
||||||
|
docker run -p 11235:11235 unclecode/crawl4ai:basic
|
||||||
|
```
|
||||||
|
|
||||||
|
Test the deployment:
|
||||||
|
```python
|
||||||
|
import requests
|
||||||
|
|
||||||
|
# Test health endpoint
|
||||||
|
health = requests.get("http://localhost:11235/health")
|
||||||
|
print("Health check:", health.json())
|
||||||
|
|
||||||
|
# Test basic crawl
|
||||||
|
response = requests.post(
|
||||||
|
"http://localhost:11235/crawl",
|
||||||
|
json={
|
||||||
|
"urls": "https://www.nbcnews.com/business",
|
||||||
|
"priority": 10
|
||||||
|
}
|
||||||
|
)
|
||||||
|
task_id = response.json()["task_id"]
|
||||||
|
print("Task ID:", task_id)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Available Images 🏷️
|
||||||
|
|
||||||
|
- `unclecode/crawl4ai:basic` - Basic web crawling capabilities
|
||||||
|
- `unclecode/crawl4ai:all` - Full installation with all features
|
||||||
|
- `unclecode/crawl4ai:gpu` - GPU-enabled version for ML features
|
||||||
|
|
||||||
|
## Configuration Options 🔧
|
||||||
|
|
||||||
|
### Environment Variables
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker run -p 11235:11235 \
|
||||||
|
-e MAX_CONCURRENT_TASKS=5 \
|
||||||
|
-e OPENAI_API_KEY=your_key \
|
||||||
|
unclecode/crawl4ai:all
|
||||||
|
```
|
||||||
|
|
||||||
|
### Volume Mounting
|
||||||
|
|
||||||
|
Mount a directory for persistent data:
|
||||||
|
```bash
|
||||||
|
docker run -p 11235:11235 \
|
||||||
|
-v $(pwd)/data:/app/data \
|
||||||
|
unclecode/crawl4ai:all
|
||||||
|
```
|
||||||
|
|
||||||
|
### Resource Limits
|
||||||
|
|
||||||
|
Control container resources:
|
||||||
|
```bash
|
||||||
|
docker run -p 11235:11235 \
|
||||||
|
--memory=4g \
|
||||||
|
--cpus=2 \
|
||||||
|
unclecode/crawl4ai:all
|
||||||
|
```
|
||||||
|
|
||||||
|
## Usage Examples 📝
|
||||||
|
|
||||||
|
### Basic Crawling
|
||||||
|
|
||||||
|
```python
|
||||||
|
request = {
|
||||||
|
"urls": "https://www.nbcnews.com/business",
|
||||||
|
"priority": 10
|
||||||
|
}
|
||||||
|
|
||||||
|
response = requests.post("http://localhost:11235/crawl", json=request)
|
||||||
|
task_id = response.json()["task_id"]
|
||||||
|
|
||||||
|
# Get results
|
||||||
|
result = requests.get(f"http://localhost:11235/task/{task_id}")
|
||||||
|
```
|
||||||
|
|
||||||
|
### Structured Data Extraction
|
||||||
|
|
||||||
|
```python
|
||||||
|
schema = {
|
||||||
|
"name": "Crypto Prices",
|
||||||
|
"baseSelector": ".cds-tableRow-t45thuk",
|
||||||
|
"fields": [
|
||||||
|
{
|
||||||
|
"name": "crypto",
|
||||||
|
"selector": "td:nth-child(1) h2",
|
||||||
|
"type": "text",
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "price",
|
||||||
|
"selector": "td:nth-child(2)",
|
||||||
|
"type": "text",
|
||||||
|
}
|
||||||
|
],
|
||||||
|
}
|
||||||
|
|
||||||
|
request = {
|
||||||
|
"urls": "https://www.coinbase.com/explore",
|
||||||
|
"extraction_config": {
|
||||||
|
"type": "json_css",
|
||||||
|
"params": {"schema": schema}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Dynamic Content Handling
|
||||||
|
|
||||||
|
```python
|
||||||
|
request = {
|
||||||
|
"urls": "https://www.nbcnews.com/business",
|
||||||
|
"js_code": [
|
||||||
|
"const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"
|
||||||
|
],
|
||||||
|
"wait_for": "article.tease-card:nth-child(10)"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### AI-Powered Extraction (Full Version)
|
||||||
|
|
||||||
|
```python
|
||||||
|
request = {
|
||||||
|
"urls": "https://www.nbcnews.com/business",
|
||||||
|
"extraction_config": {
|
||||||
|
"type": "cosine",
|
||||||
|
"params": {
|
||||||
|
"semantic_filter": "business finance economy",
|
||||||
|
"word_count_threshold": 10,
|
||||||
|
"max_dist": 0.2,
|
||||||
|
"top_k": 3
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Platform-Specific Instructions 💻
|
||||||
|
|
||||||
|
### macOS
|
||||||
|
```bash
|
||||||
|
docker pull unclecode/crawl4ai:basic
|
||||||
|
docker run -p 11235:11235 unclecode/crawl4ai:basic
|
||||||
|
```
|
||||||
|
|
||||||
|
### Ubuntu
|
||||||
|
```bash
|
||||||
|
# Basic version
|
||||||
|
docker pull unclecode/crawl4ai:basic
|
||||||
|
docker run -p 11235:11235 unclecode/crawl4ai:basic
|
||||||
|
|
||||||
|
# With GPU support
|
||||||
|
docker pull unclecode/crawl4ai:gpu
|
||||||
|
docker run --gpus all -p 11235:11235 unclecode/crawl4ai:gpu
|
||||||
|
```
|
||||||
|
|
||||||
|
### Windows (PowerShell)
|
||||||
|
```powershell
|
||||||
|
docker pull unclecode/crawl4ai:basic
|
||||||
|
docker run -p 11235:11235 unclecode/crawl4ai:basic
|
||||||
|
```
|
||||||
|
|
||||||
|
## Testing 🧪
|
||||||
|
|
||||||
|
Save this as `test_docker.py`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
import requests
|
||||||
|
import json
|
||||||
|
import time
|
||||||
|
import sys
|
||||||
|
|
||||||
|
class Crawl4AiTester:
|
||||||
|
def __init__(self, base_url: str = "http://localhost:11235"):
|
||||||
|
self.base_url = base_url
|
||||||
|
|
||||||
|
def submit_and_wait(self, request_data: dict, timeout: int = 300) -> dict:
|
||||||
|
# Submit crawl job
|
||||||
|
response = requests.post(f"{self.base_url}/crawl", json=request_data)
|
||||||
|
task_id = response.json()["task_id"]
|
||||||
|
print(f"Task ID: {task_id}")
|
||||||
|
|
||||||
|
# Poll for result
|
||||||
|
start_time = time.time()
|
||||||
|
while True:
|
||||||
|
if time.time() - start_time > timeout:
|
||||||
|
raise TimeoutError(f"Task {task_id} timeout")
|
||||||
|
|
||||||
|
result = requests.get(f"{self.base_url}/task/{task_id}")
|
||||||
|
status = result.json()
|
||||||
|
|
||||||
|
if status["status"] == "completed":
|
||||||
|
return status
|
||||||
|
|
||||||
|
time.sleep(2)
|
||||||
|
|
||||||
|
def test_deployment():
|
||||||
|
tester = Crawl4AiTester()
|
||||||
|
|
||||||
|
# Test basic crawl
|
||||||
|
request = {
|
||||||
|
"urls": "https://www.nbcnews.com/business",
|
||||||
|
"priority": 10
|
||||||
|
}
|
||||||
|
|
||||||
|
result = tester.submit_and_wait(request)
|
||||||
|
print("Basic crawl successful!")
|
||||||
|
print(f"Content length: {len(result['result']['markdown'])}")
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
test_deployment()
|
||||||
|
```
|
||||||
|
|
||||||
|
## Advanced Configuration ⚙️
|
||||||
|
|
||||||
|
### Crawler Parameters
|
||||||
|
|
||||||
|
The `crawler_params` field allows you to configure the browser instance and crawling behavior. Here are key parameters you can use:
|
||||||
|
|
||||||
|
```python
|
||||||
|
request = {
|
||||||
|
"urls": "https://example.com",
|
||||||
|
"crawler_params": {
|
||||||
|
# Browser Configuration
|
||||||
|
"headless": True, # Run in headless mode
|
||||||
|
"browser_type": "chromium", # chromium/firefox/webkit
|
||||||
|
"user_agent": "custom-agent", # Custom user agent
|
||||||
|
"proxy": "http://proxy:8080", # Proxy configuration
|
||||||
|
|
||||||
|
# Performance & Behavior
|
||||||
|
"page_timeout": 30000, # Page load timeout (ms)
|
||||||
|
"verbose": True, # Enable detailed logging
|
||||||
|
"semaphore_count": 5, # Concurrent request limit
|
||||||
|
|
||||||
|
# Anti-Detection Features
|
||||||
|
"simulate_user": True, # Simulate human behavior
|
||||||
|
"magic": True, # Advanced anti-detection
|
||||||
|
"override_navigator": True, # Override navigator properties
|
||||||
|
|
||||||
|
# Session Management
|
||||||
|
"user_data_dir": "./browser-data", # Browser profile location
|
||||||
|
"use_managed_browser": True, # Use persistent browser
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Extra Parameters
|
||||||
|
|
||||||
|
The `extra` field allows passing additional parameters directly to the crawler's `arun` function:
|
||||||
|
|
||||||
|
```python
|
||||||
|
request = {
|
||||||
|
"urls": "https://example.com",
|
||||||
|
"extra": {
|
||||||
|
"word_count_threshold": 10, # Min words per block
|
||||||
|
"only_text": True, # Extract only text
|
||||||
|
"bypass_cache": True, # Force fresh crawl
|
||||||
|
"process_iframes": True, # Include iframe content
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Complete Examples
|
||||||
|
|
||||||
|
1. **Advanced News Crawling**
|
||||||
|
```python
|
||||||
|
request = {
|
||||||
|
"urls": "https://www.nbcnews.com/business",
|
||||||
|
"crawler_params": {
|
||||||
|
"headless": True,
|
||||||
|
"page_timeout": 30000,
|
||||||
|
"remove_overlay_elements": True # Remove popups
|
||||||
|
},
|
||||||
|
"extra": {
|
||||||
|
"word_count_threshold": 50, # Longer content blocks
|
||||||
|
"bypass_cache": True # Fresh content
|
||||||
|
},
|
||||||
|
"css_selector": ".article-body"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Anti-Detection Configuration**
|
||||||
|
```python
|
||||||
|
request = {
|
||||||
|
"urls": "https://example.com",
|
||||||
|
"crawler_params": {
|
||||||
|
"simulate_user": True,
|
||||||
|
"magic": True,
|
||||||
|
"override_navigator": True,
|
||||||
|
"user_agent": "Mozilla/5.0 ...",
|
||||||
|
"headers": {
|
||||||
|
"Accept-Language": "en-US,en;q=0.9"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **LLM Extraction with Custom Parameters**
|
||||||
|
```python
|
||||||
|
request = {
|
||||||
|
"urls": "https://openai.com/pricing",
|
||||||
|
"extraction_config": {
|
||||||
|
"type": "llm",
|
||||||
|
"params": {
|
||||||
|
"provider": "openai/gpt-4",
|
||||||
|
"schema": pricing_schema
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"crawler_params": {
|
||||||
|
"verbose": True,
|
||||||
|
"page_timeout": 60000
|
||||||
|
},
|
||||||
|
"extra": {
|
||||||
|
"word_count_threshold": 1,
|
||||||
|
"only_text": True
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
4. **Session-Based Dynamic Content**
|
||||||
|
```python
|
||||||
|
request = {
|
||||||
|
"urls": "https://example.com",
|
||||||
|
"crawler_params": {
|
||||||
|
"session_id": "dynamic_session",
|
||||||
|
"headless": False,
|
||||||
|
"page_timeout": 60000
|
||||||
|
},
|
||||||
|
"js_code": ["window.scrollTo(0, document.body.scrollHeight);"],
|
||||||
|
"wait_for": "js:() => document.querySelectorAll('.item').length > 10",
|
||||||
|
"extra": {
|
||||||
|
"delay_before_return_html": 2.0
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
5. **Screenshot with Custom Timing**
|
||||||
|
```python
|
||||||
|
request = {
|
||||||
|
"urls": "https://example.com",
|
||||||
|
"screenshot": True,
|
||||||
|
"crawler_params": {
|
||||||
|
"headless": True,
|
||||||
|
"screenshot_wait_for": ".main-content"
|
||||||
|
},
|
||||||
|
"extra": {
|
||||||
|
"delay_before_return_html": 3.0
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Parameter Reference Table
|
||||||
|
|
||||||
|
| Category | Parameter | Type | Description |
|
||||||
|
|----------|-----------|------|-------------|
|
||||||
|
| Browser | headless | bool | Run browser in headless mode |
|
||||||
|
| Browser | browser_type | str | Browser engine selection |
|
||||||
|
| Browser | user_agent | str | Custom user agent string |
|
||||||
|
| Network | proxy | str | Proxy server URL |
|
||||||
|
| Network | headers | dict | Custom HTTP headers |
|
||||||
|
| Timing | page_timeout | int | Page load timeout (ms) |
|
||||||
|
| Timing | delay_before_return_html | float | Wait before capture |
|
||||||
|
| Anti-Detection | simulate_user | bool | Human behavior simulation |
|
||||||
|
| Anti-Detection | magic | bool | Advanced protection |
|
||||||
|
| Session | session_id | str | Browser session ID |
|
||||||
|
| Session | user_data_dir | str | Profile directory |
|
||||||
|
| Content | word_count_threshold | int | Minimum words per block |
|
||||||
|
| Content | only_text | bool | Text-only extraction |
|
||||||
|
| Content | process_iframes | bool | Include iframe content |
|
||||||
|
| Debug | verbose | bool | Detailed logging |
|
||||||
|
| Debug | log_console | bool | Browser console logs |
|
||||||
|
|
||||||
|
## Troubleshooting 🔍
|
||||||
|
|
||||||
|
### Common Issues
|
||||||
|
|
||||||
|
1. **Connection Refused**
|
||||||
|
```
|
||||||
|
Error: Connection refused at localhost:11235
|
||||||
|
```
|
||||||
|
Solution: Ensure the container is running and ports are properly mapped.
|
||||||
|
|
||||||
|
2. **Resource Limits**
|
||||||
|
```
|
||||||
|
Error: No available slots
|
||||||
|
```
|
||||||
|
Solution: Increase MAX_CONCURRENT_TASKS or container resources.
|
||||||
|
|
||||||
|
3. **GPU Access**
|
||||||
|
```
|
||||||
|
Error: GPU not found
|
||||||
|
```
|
||||||
|
Solution: Ensure proper NVIDIA drivers and use `--gpus all` flag.
|
||||||
|
|
||||||
|
### Debug Mode
|
||||||
|
|
||||||
|
Access container for debugging:
|
||||||
|
```bash
|
||||||
|
docker run -it --entrypoint /bin/bash unclecode/crawl4ai:all
|
||||||
|
```
|
||||||
|
|
||||||
|
View container logs:
|
||||||
|
```bash
|
||||||
|
docker logs [container_id]
|
||||||
|
```
|
||||||
|
|
||||||
|
## Best Practices 🌟
|
||||||
|
|
||||||
|
1. **Resource Management**
|
||||||
|
- Set appropriate memory and CPU limits
|
||||||
|
- Monitor resource usage via health endpoint
|
||||||
|
- Use basic version for simple crawling tasks
|
||||||
|
|
||||||
|
2. **Scaling**
|
||||||
|
- Use multiple containers for high load
|
||||||
|
- Implement proper load balancing
|
||||||
|
- Monitor performance metrics
|
||||||
|
|
||||||
|
3. **Security**
|
||||||
|
- Use environment variables for sensitive data
|
||||||
|
- Implement proper network isolation
|
||||||
|
- Regular security updates
|
||||||
|
|
||||||
|
## API Reference 📚
|
||||||
|
|
||||||
|
### Health Check
|
||||||
|
```http
|
||||||
|
GET /health
|
||||||
|
```
|
||||||
|
|
||||||
|
### Submit Crawl Task
|
||||||
|
```http
|
||||||
|
POST /crawl
|
||||||
|
Content-Type: application/json
|
||||||
|
|
||||||
|
{
|
||||||
|
"urls": "string or array",
|
||||||
|
"extraction_config": {
|
||||||
|
"type": "basic|llm|cosine|json_css",
|
||||||
|
"params": {}
|
||||||
|
},
|
||||||
|
"priority": 1-10,
|
||||||
|
"ttl": 3600
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Get Task Status
|
||||||
|
```http
|
||||||
|
GET /task/{task_id}
|
||||||
|
```
|
||||||
|
|
||||||
|
For more details, visit the [official documentation](https://crawl4ai.com/mkdocs/).
|
||||||
@@ -72,7 +72,7 @@ Our documentation is organized into several sections:
|
|||||||
### Advanced Features
|
### Advanced Features
|
||||||
- [Magic Mode](advanced/magic-mode.md)
|
- [Magic Mode](advanced/magic-mode.md)
|
||||||
- [Session Management](advanced/session-management.md)
|
- [Session Management](advanced/session-management.md)
|
||||||
- [Hooks & Authentication](advanced/hooks.md)
|
- [Hooks & Authentication](advanced/hooks-auth.md)
|
||||||
- [Proxy & Security](advanced/proxy-security.md)
|
- [Proxy & Security](advanced/proxy-security.md)
|
||||||
- [Content Processing](advanced/content-processing.md)
|
- [Content Processing](advanced/content-processing.md)
|
||||||
|
|
||||||
|
|||||||
4
main.py
4
main.py
@@ -269,6 +269,7 @@ class CrawlerService:
|
|||||||
css_selector=request.css_selector,
|
css_selector=request.css_selector,
|
||||||
screenshot=request.screenshot,
|
screenshot=request.screenshot,
|
||||||
magic=request.magic,
|
magic=request.magic,
|
||||||
|
**request.extra,
|
||||||
)
|
)
|
||||||
else:
|
else:
|
||||||
results = await crawler.arun(
|
results = await crawler.arun(
|
||||||
@@ -279,6 +280,7 @@ class CrawlerService:
|
|||||||
css_selector=request.css_selector,
|
css_selector=request.css_selector,
|
||||||
screenshot=request.screenshot,
|
screenshot=request.screenshot,
|
||||||
magic=request.magic,
|
magic=request.magic,
|
||||||
|
**request.extra,
|
||||||
)
|
)
|
||||||
|
|
||||||
await self.crawler_pool.release(crawler)
|
await self.crawler_pool.release(crawler)
|
||||||
@@ -343,4 +345,4 @@ async def health_check():
|
|||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
import uvicorn
|
import uvicorn
|
||||||
uvicorn.run(app, host="0.0.0.0", port=8000)
|
uvicorn.run(app, host="0.0.0.0", port=11235)
|
||||||
@@ -8,6 +8,7 @@ docs_dir: docs/md_v2
|
|||||||
nav:
|
nav:
|
||||||
- Home: 'index.md'
|
- Home: 'index.md'
|
||||||
- 'Installation': 'basic/installation.md'
|
- 'Installation': 'basic/installation.md'
|
||||||
|
- 'Docker Deplotment': 'basic/docker-deploymeny.md'
|
||||||
- 'Quick Start': 'basic/quickstart.md'
|
- 'Quick Start': 'basic/quickstart.md'
|
||||||
|
|
||||||
- Basic:
|
- Basic:
|
||||||
@@ -34,6 +35,7 @@ nav:
|
|||||||
- 'Chunking': 'extraction/chunking.md'
|
- 'Chunking': 'extraction/chunking.md'
|
||||||
|
|
||||||
- API Reference:
|
- API Reference:
|
||||||
|
- 'Parameters Table': 'api/parameters.md'
|
||||||
- 'AsyncWebCrawler': 'api/async-webcrawler.md'
|
- 'AsyncWebCrawler': 'api/async-webcrawler.md'
|
||||||
- 'AsyncWebCrawler.arun()': 'api/arun.md'
|
- 'AsyncWebCrawler.arun()': 'api/arun.md'
|
||||||
- 'CrawlResult': 'api/crawl-result.md'
|
- 'CrawlResult': 'api/crawl-result.md'
|
||||||
|
|||||||
8
setup.py
8
setup.py
@@ -31,9 +31,11 @@ with open("crawl4ai/_version.py") as f:
|
|||||||
|
|
||||||
# Define the requirements for different environments
|
# Define the requirements for different environments
|
||||||
default_requirements = requirements
|
default_requirements = requirements
|
||||||
torch_requirements = ["torch", "nltk", "spacy", "scikit-learn"]
|
# torch_requirements = ["torch", "nltk", "spacy", "scikit-learn"]
|
||||||
transformer_requirements = ["transformers", "tokenizers", "onnxruntime"]
|
# transformer_requirements = ["transformers", "tokenizers", "onnxruntime"]
|
||||||
cosine_similarity_requirements = ["torch", "transformers", "nltk", "spacy"]
|
torch_requirements = ["torch", "nltk", "scikit-learn"]
|
||||||
|
transformer_requirements = ["transformers", "tokenizers"]
|
||||||
|
cosine_similarity_requirements = ["torch", "transformers", "nltk" ]
|
||||||
sync_requirements = ["selenium"]
|
sync_requirements = ["selenium"]
|
||||||
|
|
||||||
def install_playwright():
|
def install_playwright():
|
||||||
|
|||||||
299
tests/test_docker.py
Normal file
299
tests/test_docker.py
Normal file
@@ -0,0 +1,299 @@
|
|||||||
|
import requests
|
||||||
|
import json
|
||||||
|
import time
|
||||||
|
import sys
|
||||||
|
import base64
|
||||||
|
import os
|
||||||
|
from typing import Dict, Any
|
||||||
|
|
||||||
|
class Crawl4AiTester:
|
||||||
|
def __init__(self, base_url: str = "http://localhost:8000"):
|
||||||
|
self.base_url = base_url
|
||||||
|
|
||||||
|
def submit_and_wait(self, request_data: Dict[str, Any], timeout: int = 300) -> Dict[str, Any]:
|
||||||
|
# Submit crawl job
|
||||||
|
response = requests.post(f"{self.base_url}/crawl", json=request_data)
|
||||||
|
task_id = response.json()["task_id"]
|
||||||
|
print(f"Task ID: {task_id}")
|
||||||
|
|
||||||
|
# Poll for result
|
||||||
|
start_time = time.time()
|
||||||
|
while True:
|
||||||
|
if time.time() - start_time > timeout:
|
||||||
|
raise TimeoutError(f"Task {task_id} did not complete within {timeout} seconds")
|
||||||
|
|
||||||
|
result = requests.get(f"{self.base_url}/task/{task_id}")
|
||||||
|
status = result.json()
|
||||||
|
|
||||||
|
if status["status"] == "failed":
|
||||||
|
print("Task failed:", status.get("error"))
|
||||||
|
raise Exception(f"Task failed: {status.get('error')}")
|
||||||
|
|
||||||
|
if status["status"] == "completed":
|
||||||
|
return status
|
||||||
|
|
||||||
|
time.sleep(2)
|
||||||
|
|
||||||
|
def test_docker_deployment(version="basic"):
|
||||||
|
tester = Crawl4AiTester()
|
||||||
|
print(f"Testing Crawl4AI Docker {version} version")
|
||||||
|
|
||||||
|
# Health check with timeout and retry
|
||||||
|
max_retries = 5
|
||||||
|
for i in range(max_retries):
|
||||||
|
try:
|
||||||
|
health = requests.get(f"{tester.base_url}/health", timeout=10)
|
||||||
|
print("Health check:", health.json())
|
||||||
|
break
|
||||||
|
except requests.exceptions.RequestException as e:
|
||||||
|
if i == max_retries - 1:
|
||||||
|
print(f"Failed to connect after {max_retries} attempts")
|
||||||
|
sys.exit(1)
|
||||||
|
print(f"Waiting for service to start (attempt {i+1}/{max_retries})...")
|
||||||
|
time.sleep(5)
|
||||||
|
|
||||||
|
# Test cases based on version
|
||||||
|
test_basic_crawl(tester)
|
||||||
|
if version in ["full", "transformer"]:
|
||||||
|
test_cosine_extraction(tester)
|
||||||
|
|
||||||
|
# test_js_execution(tester)
|
||||||
|
# test_css_selector(tester)
|
||||||
|
# test_structured_extraction(tester)
|
||||||
|
# test_llm_extraction(tester)
|
||||||
|
# test_llm_with_ollama(tester)
|
||||||
|
# test_screenshot(tester)
|
||||||
|
|
||||||
|
|
||||||
|
def test_basic_crawl(tester: Crawl4AiTester):
|
||||||
|
print("\n=== Testing Basic Crawl ===")
|
||||||
|
request = {
|
||||||
|
"urls": "https://www.nbcnews.com/business",
|
||||||
|
"priority": 10
|
||||||
|
}
|
||||||
|
|
||||||
|
result = tester.submit_and_wait(request)
|
||||||
|
print(f"Basic crawl result length: {len(result['result']['markdown'])}")
|
||||||
|
assert result["result"]["success"]
|
||||||
|
assert len(result["result"]["markdown"]) > 0
|
||||||
|
|
||||||
|
def test_js_execution(tester: Crawl4AiTester):
|
||||||
|
print("\n=== Testing JS Execution ===")
|
||||||
|
request = {
|
||||||
|
"urls": "https://www.nbcnews.com/business",
|
||||||
|
"priority": 8,
|
||||||
|
"js_code": [
|
||||||
|
"const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"
|
||||||
|
],
|
||||||
|
"wait_for": "article.tease-card:nth-child(10)",
|
||||||
|
"crawler_params": {
|
||||||
|
"headless": True
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
result = tester.submit_and_wait(request)
|
||||||
|
print(f"JS execution result length: {len(result['result']['markdown'])}")
|
||||||
|
assert result["result"]["success"]
|
||||||
|
|
||||||
|
def test_css_selector(tester: Crawl4AiTester):
|
||||||
|
print("\n=== Testing CSS Selector ===")
|
||||||
|
request = {
|
||||||
|
"urls": "https://www.nbcnews.com/business",
|
||||||
|
"priority": 7,
|
||||||
|
"css_selector": ".wide-tease-item__description",
|
||||||
|
"crawler_params": {
|
||||||
|
"headless": True
|
||||||
|
},
|
||||||
|
"extra": {"word_count_threshold": 10}
|
||||||
|
|
||||||
|
}
|
||||||
|
|
||||||
|
result = tester.submit_and_wait(request)
|
||||||
|
print(f"CSS selector result length: {len(result['result']['markdown'])}")
|
||||||
|
assert result["result"]["success"]
|
||||||
|
|
||||||
|
def test_structured_extraction(tester: Crawl4AiTester):
|
||||||
|
print("\n=== Testing Structured Extraction ===")
|
||||||
|
schema = {
|
||||||
|
"name": "Coinbase Crypto Prices",
|
||||||
|
"baseSelector": ".cds-tableRow-t45thuk",
|
||||||
|
"fields": [
|
||||||
|
{
|
||||||
|
"name": "crypto",
|
||||||
|
"selector": "td:nth-child(1) h2",
|
||||||
|
"type": "text",
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "symbol",
|
||||||
|
"selector": "td:nth-child(1) p",
|
||||||
|
"type": "text",
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "price",
|
||||||
|
"selector": "td:nth-child(2)",
|
||||||
|
"type": "text",
|
||||||
|
}
|
||||||
|
],
|
||||||
|
}
|
||||||
|
|
||||||
|
request = {
|
||||||
|
"urls": "https://www.coinbase.com/explore",
|
||||||
|
"priority": 9,
|
||||||
|
"extraction_config": {
|
||||||
|
"type": "json_css",
|
||||||
|
"params": {
|
||||||
|
"schema": schema
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
result = tester.submit_and_wait(request)
|
||||||
|
extracted = json.loads(result["result"]["extracted_content"])
|
||||||
|
print(f"Extracted {len(extracted)} items")
|
||||||
|
print("Sample item:", json.dumps(extracted[0], indent=2))
|
||||||
|
assert result["result"]["success"]
|
||||||
|
assert len(extracted) > 0
|
||||||
|
|
||||||
|
def test_llm_extraction(tester: Crawl4AiTester):
|
||||||
|
print("\n=== Testing LLM Extraction ===")
|
||||||
|
schema = {
|
||||||
|
"type": "object",
|
||||||
|
"properties": {
|
||||||
|
"model_name": {
|
||||||
|
"type": "string",
|
||||||
|
"description": "Name of the OpenAI model."
|
||||||
|
},
|
||||||
|
"input_fee": {
|
||||||
|
"type": "string",
|
||||||
|
"description": "Fee for input token for the OpenAI model."
|
||||||
|
},
|
||||||
|
"output_fee": {
|
||||||
|
"type": "string",
|
||||||
|
"description": "Fee for output token for the OpenAI model."
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"required": ["model_name", "input_fee", "output_fee"]
|
||||||
|
}
|
||||||
|
|
||||||
|
request = {
|
||||||
|
"urls": "https://openai.com/api/pricing",
|
||||||
|
"priority": 8,
|
||||||
|
"extraction_config": {
|
||||||
|
"type": "llm",
|
||||||
|
"params": {
|
||||||
|
"provider": "openai/gpt-4o-mini",
|
||||||
|
"api_token": os.getenv("OPENAI_API_KEY"),
|
||||||
|
"schema": schema,
|
||||||
|
"extraction_type": "schema",
|
||||||
|
"instruction": """From the crawled content, extract all mentioned model names along with their fees for input and output tokens."""
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"crawler_params": {"word_count_threshold": 1}
|
||||||
|
}
|
||||||
|
|
||||||
|
try:
|
||||||
|
result = tester.submit_and_wait(request)
|
||||||
|
extracted = json.loads(result["result"]["extracted_content"])
|
||||||
|
print(f"Extracted {len(extracted)} model pricing entries")
|
||||||
|
print("Sample entry:", json.dumps(extracted[0], indent=2))
|
||||||
|
assert result["result"]["success"]
|
||||||
|
except Exception as e:
|
||||||
|
print(f"LLM extraction test failed (might be due to missing API key): {str(e)}")
|
||||||
|
|
||||||
|
def test_llm_with_ollama(tester: Crawl4AiTester):
|
||||||
|
print("\n=== Testing LLM with Ollama ===")
|
||||||
|
schema = {
|
||||||
|
"type": "object",
|
||||||
|
"properties": {
|
||||||
|
"article_title": {
|
||||||
|
"type": "string",
|
||||||
|
"description": "The main title of the news article"
|
||||||
|
},
|
||||||
|
"summary": {
|
||||||
|
"type": "string",
|
||||||
|
"description": "A brief summary of the article content"
|
||||||
|
},
|
||||||
|
"main_topics": {
|
||||||
|
"type": "array",
|
||||||
|
"items": {"type": "string"},
|
||||||
|
"description": "Main topics or themes discussed in the article"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
request = {
|
||||||
|
"urls": "https://www.nbcnews.com/business",
|
||||||
|
"priority": 8,
|
||||||
|
"extraction_config": {
|
||||||
|
"type": "llm",
|
||||||
|
"params": {
|
||||||
|
"provider": "ollama/llama2",
|
||||||
|
"schema": schema,
|
||||||
|
"extraction_type": "schema",
|
||||||
|
"instruction": "Extract the main article information including title, summary, and main topics."
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"extra": {"word_count_threshold": 1},
|
||||||
|
"crawler_params": {"verbose": True}
|
||||||
|
}
|
||||||
|
|
||||||
|
try:
|
||||||
|
result = tester.submit_and_wait(request)
|
||||||
|
extracted = json.loads(result["result"]["extracted_content"])
|
||||||
|
print("Extracted content:", json.dumps(extracted, indent=2))
|
||||||
|
assert result["result"]["success"]
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Ollama extraction test failed: {str(e)}")
|
||||||
|
|
||||||
|
def test_cosine_extraction(tester: Crawl4AiTester):
|
||||||
|
print("\n=== Testing Cosine Extraction ===")
|
||||||
|
request = {
|
||||||
|
"urls": "https://www.nbcnews.com/business",
|
||||||
|
"priority": 8,
|
||||||
|
"extraction_config": {
|
||||||
|
"type": "cosine",
|
||||||
|
"params": {
|
||||||
|
"semantic_filter": "business finance economy",
|
||||||
|
"word_count_threshold": 10,
|
||||||
|
"max_dist": 0.2,
|
||||||
|
"top_k": 3
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
try:
|
||||||
|
result = tester.submit_and_wait(request)
|
||||||
|
extracted = json.loads(result["result"]["extracted_content"])
|
||||||
|
print(f"Extracted {len(extracted)} text clusters")
|
||||||
|
print("First cluster tags:", extracted[0]["tags"])
|
||||||
|
assert result["result"]["success"]
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Cosine extraction test failed: {str(e)}")
|
||||||
|
|
||||||
|
def test_screenshot(tester: Crawl4AiTester):
|
||||||
|
print("\n=== Testing Screenshot ===")
|
||||||
|
request = {
|
||||||
|
"urls": "https://www.nbcnews.com/business",
|
||||||
|
"priority": 5,
|
||||||
|
"screenshot": True,
|
||||||
|
"crawler_params": {
|
||||||
|
"headless": True
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
result = tester.submit_and_wait(request)
|
||||||
|
print("Screenshot captured:", bool(result["result"]["screenshot"]))
|
||||||
|
|
||||||
|
if result["result"]["screenshot"]:
|
||||||
|
# Save screenshot
|
||||||
|
screenshot_data = base64.b64decode(result["result"]["screenshot"])
|
||||||
|
with open("test_screenshot.jpg", "wb") as f:
|
||||||
|
f.write(screenshot_data)
|
||||||
|
print("Screenshot saved as test_screenshot.jpg")
|
||||||
|
|
||||||
|
assert result["result"]["success"]
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
version = sys.argv[1] if len(sys.argv) > 1 else "basic"
|
||||||
|
# version = "full"
|
||||||
|
test_docker_deployment(version)
|
||||||
Reference in New Issue
Block a user