feat(core): Release v0.3.73 with Browser Takeover and Docker Support

Major changes: - Add browser takeover feature using CDP for authentic browsing - Implement Docker support with full API server documentation - Enhance Mockdown with tag preservation system - Improve parallel crawling performance This release focuses on authenticity and scalability, introducing the ability to use users' own browsers while providing containerized deployment options. Breaking changes include modified browser handling and API response structure. See CHANGELOG.md for detailed migration guide.
2024-11-05 20:04:18 +08:00
parent c4c6227962
commit 67a23c3182
18 changed files with 1066 additions and 61263 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,5 +1,97 @@
 # Changelog
 # CHANGELOG
 ## [v0.3.73] - 2024-11-05
 ### Major Features
 - **New Doctor Feature**
  - Added comprehensive system diagnostics tool
  - Available through package hub and CLI
  - Provides automated troubleshooting and system health checks
  - Includes detailed reporting of configuration issues
 - **Dockerized API Server**
  - Released complete Docker implementation for API server
  - Added comprehensive documentation for Docker deployment
  - Implemented container communication protocols
  - Added environment configuration guides
 - **Managed Browser Integration**
  - Added support for user-controlled browser instances
  - Implemented `ManagedBrowser` class for better browser lifecycle management
  - Added ability to connect to existing Chrome DevTools Protocol (CDP) endpoints
  - Introduced user data directory support for persistent browser profiles
 - **Enhanced HTML Processing**
  - Added HTML tag preservation feature during markdown conversion
  - Introduced configurable tag preservation system
  - Improved pre-tag and code block handling
  - Added support for nested preserved tags with attribute retention
 ### Improvements
 - **Browser Handling**
  - Added flag to ignore body visibility for problematic pages
  - Improved browser process cleanup and management
  - Enhanced temporary directory handling for browser profiles
  - Added configurable browser launch arguments
 - **Database Management**
  - Implemented connection pooling for better performance
  - Added retry logic for database operations
  - Improved error handling and logging
  - Enhanced cleanup procedures for database connections
 - **Resource Management**
  - Added memory and CPU monitoring
  - Implemented dynamic task slot allocation based on system resources
  - Added configurable cleanup intervals
 ### Technical Improvements
 - **Code Structure**
  - Moved version management to dedicated _version.py file
  - Improved error handling throughout the codebase
  - Enhanced logging system with better error reporting
  - Reorganized core components for better maintainability
 ### Bug Fixes
 - Fixed issues with browser process termination
 - Improved handling of connection timeouts
 - Enhanced error recovery in database operations
 - Fixed memory leaks in long-running processes
 ### Dependencies
 - Updated Playwright to v1.47
 - Updated core dependencies with more flexible version constraints
 - Added new development dependencies for testing
 ### Breaking Changes
 - Changed default browser handling behavior
 - Modified database connection management approach
 - Updated API response structure for better consistency
 ## Migration Guide
 When upgrading to v0.3.73, be aware of the following changes:
 1. Docker Deployment:
   - Review Docker documentation for new deployment options
   - Update environment configurations as needed
   - Check container communication settings
 2. If using custom browser management:
   - Update browser initialization code to use new ManagedBrowser class
   - Review browser cleanup procedures
 3. For database operations:
   - Check custom database queries for compatibility with new connection pooling
   - Update error handling to work with new retry logic
 4. Using the Doctor:
   - Run doctor command for system diagnostics: `crawl4ai doctor`
   - Review generated reports for potential issues
   - Follow recommended fixes for any identified problems
 ## [2024-11-04 - 13:21:42] Comprehensive Update of Crawl4AI Features and Dependencies
 This commit introduces several key enhancements, including improved error handling and robust database operations in `async_database.py`, which now features a connection pool and retry logic for better reliability. Updates to the README.md provide clearer instructions and a better user experience with links to documentation sections. The `.gitignore` file has been refined to include additional directories, while the async web crawler now utilizes a managed browser for more efficient crawling. Furthermore, multiple dependency updates and introduction of the `CustomHTML2Text` class enhance text extraction capabilities.
--- a/121
+++ b/121
@@ -0,0 +1,121 @@
 # syntax=docker/dockerfile:1.4
 # Build arguments
 ARG PYTHON_VERSION=3.10
 # Base stage with system dependencies
 FROM python:${PYTHON_VERSION}-slim as base
 # Declare ARG variables again within the build stage
 ARG INSTALL_TYPE=all
 ARG ENABLE_GPU=false
 # Platform-specific labels
 LABEL maintainer="unclecode"
 LABEL description="Crawl4AI - Advanced Web Crawler with AI capabilities"
 LABEL version="1.0"
 # Environment setup
 ENV PYTHONUNBUFFERED=1 \
    PYTHONDONTWRITEBYTECODE=1 \
    PIP_NO_CACHE_DIR=1 \
    PIP_DISABLE_PIP_VERSION_CHECK=1 \
    PIP_DEFAULT_TIMEOUT=100 \
    DEBIAN_FRONTEND=noninteractive
 # Install system dependencies
 RUN apt-get update && apt-get install -y --no-install-recommends \
    build-essential \
    curl \
    wget \
    gnupg \
    git \
    cmake \
    pkg-config \
    python3-dev \
    libjpeg-dev \
    libpng-dev \
    && rm -rf /var/lib/apt/lists/*
 # Playwright system dependencies for Linux
 RUN apt-get update && apt-get install -y --no-install-recommends \
    libglib2.0-0 \
    libnss3 \
    libnspr4 \
    libatk1.0-0 \
    libatk-bridge2.0-0 \
    libcups2 \
    libdrm2 \
    libdbus-1-3 \
    libxcb1 \
    libxkbcommon0 \
    libx11-6 \
    libxcomposite1 \
    libxdamage1 \
    libxext6 \
    libxfixes3 \
    libxrandr2 \
    libgbm1 \
    libpango-1.0-0 \
    libcairo2 \
    libasound2 \
    libatspi2.0-0 \
    && rm -rf /var/lib/apt/lists/*
 # GPU support if enabled
 RUN if [ "$ENABLE_GPU" = "true" ] ; then \
    apt-get update && apt-get install -y --no-install-recommends \
    nvidia-cuda-toolkit \
    && rm -rf /var/lib/apt/lists/* ; \
    fi
 # Create and set working directory
 WORKDIR /app
 # Copy the entire project
 COPY . .
 # Install base requirements
 RUN pip install --no-cache-dir -r requirements.txt
 # Install required library for FastAPI
 RUN pip install fastapi uvicorn psutil
 # Install ML dependencies first for better layer caching
 RUN if [ "$INSTALL_TYPE" = "all" ] ; then \
        pip install --no-cache-dir \
            torch \
            torchvision \
            torchaudio \
            scikit-learn \
            nltk \
            transformers \
            tokenizers && \
        python -m nltk.downloader punkt stopwords ; \
    fi
 # Install the package
 RUN if [ "$INSTALL_TYPE" = "all" ] ; then \
        pip install -e ".[all]" && \
        python -m crawl4ai.model_loader ; \
    elif [ "$INSTALL_TYPE" = "torch" ] ; then \
        pip install -e ".[torch]" ; \
    elif [ "$INSTALL_TYPE" = "transformer" ] ; then \
        pip install -e ".[transformer]" && \
        python -m crawl4ai.model_loader ; \
    else \
        pip install -e "." ; \
    fi
 # Install Playwright and browsers
 RUN playwright install
 # Health check
 HEALTHCHECK --interval=30s --timeout=30s --start-period=5s --retries=3 \
    CMD curl -f http://localhost:8000/health || exit 1
 # Expose port
 EXPOSE 8000
 # Start the FastAPI server
 CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "11235"]
--- a/README.md
+++ b/README.md
@@ -116,9 +116,53 @@ pip install -e .
 ### Using Docker 🐳
-We're in the process of creating Docker images and pushing them to Docker Hub. This will provide an easy way to run Crawl4AI in a containerized environment. Stay tuned for updates!
+Crawl4AI is available as Docker images for easy deployment. You can either pull directly from Docker Hub (recommended) or build from the repository.
 #### Option 1: Docker Hub (Recommended)
 ```bash
 # Pull and run from Docker Hub (choose one):
 docker pull unclecode/crawl4ai:basic    # Basic crawling features
 docker pull unclecode/crawl4ai:all      # Full installation (ML, LLM support)
 docker pull unclecode/crawl4ai:gpu      # GPU-enabled version
 # Run the container
 docker run -p 11235:11235 unclecode/crawl4ai:basic  # Replace 'basic' with your chosen version
 ```
 #### Option 2: Build from Repository
 ```bash
 # Clone the repository
 git clone https://github.com/unclecode/crawl4ai.git
 cd crawl4ai
 # Build the image
 docker build -t crawl4ai:local \
  --build-arg INSTALL_TYPE=basic \  # Options: basic, all
  .
 # Run your local build
 docker run -p 11235:11235 crawl4ai:local
 ```
 Quick test (works for both options):
 ```python
 import requests
 # Submit a crawl job
 response = requests.post(
    "http://localhost:11235/crawl",
    json={"urls": "https://example.com", "priority": 10}
 )
 task_id = response.json()["task_id"]
 # Get results
 result = requests.get(f"http://localhost:11235/task/{task_id}")
 ```
 For advanced configuration, environment variables, and usage examples, see our [Docker Deployment Guide](https://crawl4ai.com/mkdocs/basic/docker-deployment/).
 For more detailed installation instructions and options, please refer to our [Installation Guide](https://crawl4ai.com/mkdocs/basic/installation/).
 ## Quick Start 🚀
@@ -352,6 +396,7 @@ if __name__ == "__main__":
 This example demonstrates Crawl4AI's ability to handle complex scenarios where content is loaded asynchronously. It crawls multiple pages of GitHub commits, executing JavaScript to load new content and using custom hooks to ensure data is loaded before proceeding.
 For more advanced usage examples, check out our [Examples](https://crawl4ai.com/mkdocs/tutorial/episode_12_Session-Based_Crawling_for_Dynamic_Websites/) section in the documentation.
 </details>
 ## Speed Comparison 🚀
--- a/crawl4ai/async_webcrawler.py
+++ b/crawl4ai/async_webcrawler.py
@@ -172,7 +172,6 @@ class AsyncWebCrawler:
        ]
        return await asyncio.gather(*tasks)
    async def aprocess_html(
        self,
        url: str,
@@ -286,3 +285,5 @@ class AsyncWebCrawler:
    async def aget_cache_size(self):
        return await async_db_manager.aget_total_count()
--- a/crawl4ai/models/onnx/config.json
+++ b/crawl4ai/models/onnx/config.json
@@ -1,25 +0,0 @@
 {
  "_name_or_path": "sentence-transformers/all-MiniLM-L6-v2",
  "architectures": [
    "BertModel"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 384,
  "initializer_range": 0.02,
  "intermediate_size": 1536,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 6,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.27.4",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
 }
--- a/crawl4ai/models/onnx/model.onnx
+++ b/crawl4ai/models/onnx/model.onnx
--- a/crawl4ai/models/onnx/special_tokens_map.json
+++ b/crawl4ai/models/onnx/special_tokens_map.json
@@ -1,7 +0,0 @@
 {
  "cls_token": "[CLS]",
  "mask_token": "[MASK]",
  "pad_token": "[PAD]",
  "sep_token": "[SEP]",
  "unk_token": "[UNK]"
 }
--- a/crawl4ai/models/onnx/tokenizer.json
+++ b/crawl4ai/models/onnx/tokenizer.json
--- a/crawl4ai/models/onnx/tokenizer_config.json
+++ b/crawl4ai/models/onnx/tokenizer_config.json
@@ -1,15 +0,0 @@
 {
  "cls_token": "[CLS]",
  "do_basic_tokenize": true,
  "do_lower_case": true,
  "mask_token": "[MASK]",
  "model_max_length": 512,
  "never_split": null,
  "pad_token": "[PAD]",
  "sep_token": "[SEP]",
  "special_tokens_map_file": "/Users/hammad/.cache/huggingface/hub/models--sentence-transformers--all-MiniLM-L6-v2/snapshots/7dbbc90392e2f80f3d3c277d6e90027e55de9125/special_tokens_map.json",
  "strip_accents": null,
  "tokenize_chinese_chars": true,
  "tokenizer_class": "BertTokenizer",
  "unk_token": "[UNK]"
 }
--- a/crawl4ai/models/onnx/vocab.txt
+++ b/crawl4ai/models/onnx/vocab.txt
--- a/docs/md_v2/advanced/hooks.md
+++ b/docs/md_v2/advanced/hooks.md
--- a/docs/md_v2/api/parameters.md
+++ b/docs/md_v2/api/parameters.md
@@ -0,0 +1,35 @@
 # Parameter Reference Table
 | File Name | Parameter Name | Code Usage | Strategy/Class | Description |
 |-----------|---------------|------------|----------------|-------------|
 | async_crawler_strategy.py | user_agent | `kwargs.get("user_agent")` | AsyncPlaywrightCrawlerStrategy | User agent string for browser identification |
 | async_crawler_strategy.py | proxy | `kwargs.get("proxy")` | AsyncPlaywrightCrawlerStrategy | Proxy server configuration for network requests |
 | async_crawler_strategy.py | proxy_config | `kwargs.get("proxy_config")` | AsyncPlaywrightCrawlerStrategy | Detailed proxy configuration including auth |
 | async_crawler_strategy.py | headless | `kwargs.get("headless", True)` | AsyncPlaywrightCrawlerStrategy | Whether to run browser in headless mode |
 | async_crawler_strategy.py | browser_type | `kwargs.get("browser_type", "chromium")` | AsyncPlaywrightCrawlerStrategy | Type of browser to use (chromium/firefox/webkit) |
 | async_crawler_strategy.py | headers | `kwargs.get("headers", {})` | AsyncPlaywrightCrawlerStrategy | Custom HTTP headers for requests |
 | async_crawler_strategy.py | verbose | `kwargs.get("verbose", False)` | AsyncPlaywrightCrawlerStrategy | Enable detailed logging output |
 | async_crawler_strategy.py | sleep_on_close | `kwargs.get("sleep_on_close", False)` | AsyncPlaywrightCrawlerStrategy | Add delay before closing browser |
 | async_crawler_strategy.py | use_managed_browser | `kwargs.get("use_managed_browser", False)` | AsyncPlaywrightCrawlerStrategy | Use managed browser instance |
 | async_crawler_strategy.py | user_data_dir | `kwargs.get("user_data_dir", None)` | AsyncPlaywrightCrawlerStrategy | Custom directory for browser profile data |
 | async_crawler_strategy.py | session_id | `kwargs.get("session_id")` | AsyncPlaywrightCrawlerStrategy | Unique identifier for browser session |
 | async_crawler_strategy.py | override_navigator | `kwargs.get("override_navigator", False)` | AsyncPlaywrightCrawlerStrategy | Override browser navigator properties |
 | async_crawler_strategy.py | simulate_user | `kwargs.get("simulate_user", False)` | AsyncPlaywrightCrawlerStrategy | Simulate human-like behavior |
 | async_crawler_strategy.py | magic | `kwargs.get("magic", False)` | AsyncPlaywrightCrawlerStrategy | Enable advanced anti-detection features |
 | async_crawler_strategy.py | log_console | `kwargs.get("log_console", False)` | AsyncPlaywrightCrawlerStrategy | Log browser console messages |
 | async_crawler_strategy.py | js_only | `kwargs.get("js_only", False)` | AsyncPlaywrightCrawlerStrategy | Only execute JavaScript without page load |
 | async_crawler_strategy.py | page_timeout | `kwargs.get("page_timeout", 60000)` | AsyncPlaywrightCrawlerStrategy | Timeout for page load in milliseconds |
 | async_crawler_strategy.py | ignore_body_visibility | `kwargs.get("ignore_body_visibility", True)` | AsyncPlaywrightCrawlerStrategy | Process page even if body is hidden |
 | async_crawler_strategy.py | js_code | `kwargs.get("js_code", kwargs.get("js", self.js_code))` | AsyncPlaywrightCrawlerStrategy | Custom JavaScript code to execute |
 | async_crawler_strategy.py | wait_for | `kwargs.get("wait_for")` | AsyncPlaywrightCrawlerStrategy | Wait for specific element/condition |
 | async_crawler_strategy.py | process_iframes | `kwargs.get("process_iframes", False)` | AsyncPlaywrightCrawlerStrategy | Extract content from iframes |
 | async_crawler_strategy.py | delay_before_return_html | `kwargs.get("delay_before_return_html")` | AsyncPlaywrightCrawlerStrategy | Additional delay before returning HTML |
 | async_crawler_strategy.py | remove_overlay_elements | `kwargs.get("remove_overlay_elements", False)` | AsyncPlaywrightCrawlerStrategy | Remove pop-ups and overlay elements |
 | async_crawler_strategy.py | screenshot | `kwargs.get("screenshot")` | AsyncPlaywrightCrawlerStrategy | Take page screenshot |
 | async_crawler_strategy.py | screenshot_wait_for | `kwargs.get("screenshot_wait_for")` | AsyncPlaywrightCrawlerStrategy | Wait before taking screenshot |
 | async_crawler_strategy.py | semaphore_count | `kwargs.get("semaphore_count", 5)` | AsyncPlaywrightCrawlerStrategy | Concurrent request limit |
 | async_webcrawler.py | verbose | `kwargs.get("verbose", False)` | AsyncWebCrawler | Enable detailed logging |
 | async_webcrawler.py | warmup | `kwargs.get("warmup", True)` | AsyncWebCrawler | Initialize crawler with warmup request |
 | async_webcrawler.py | session_id | `kwargs.get("session_id", None)` | AsyncWebCrawler | Session identifier for browser reuse |
 | async_webcrawler.py | only_text | `kwargs.get("only_text", False)` | AsyncWebCrawler | Extract only text content |
 | async_webcrawler.py | bypass_cache | `kwargs.get("bypass_cache", False)` | AsyncWebCrawler | Skip cache and force fresh crawl |
--- a/docs/md_v2/basic/docker-deploymeny.md
+++ b/docs/md_v2/basic/docker-deploymeny.md
@@ -0,0 +1,459 @@
 # Docker Deployment
 Crawl4AI provides official Docker images for easy deployment and scalability. This guide covers installation, configuration, and usage of Crawl4AI in Docker environments.
 ## Quick Start 🚀
 Pull and run the basic version:
 ```bash
 docker pull unclecode/crawl4ai:basic
 docker run -p 11235:11235 unclecode/crawl4ai:basic
 ```
 Test the deployment:
 ```python
 import requests
 # Test health endpoint
 health = requests.get("http://localhost:11235/health")
 print("Health check:", health.json())
 # Test basic crawl
 response = requests.post(
    "http://localhost:11235/crawl",
    json={
        "urls": "https://www.nbcnews.com/business",
        "priority": 10
    }
 )
 task_id = response.json()["task_id"]
 print("Task ID:", task_id)
 ```
 ## Available Images 🏷️
 - `unclecode/crawl4ai:basic` - Basic web crawling capabilities
 - `unclecode/crawl4ai:all` - Full installation with all features
 - `unclecode/crawl4ai:gpu` - GPU-enabled version for ML features
 ## Configuration Options 🔧
 ### Environment Variables
 ```bash
 docker run -p 11235:11235 \
    -e MAX_CONCURRENT_TASKS=5 \
    -e OPENAI_API_KEY=your_key \
    unclecode/crawl4ai:all
 ```
 ### Volume Mounting
 Mount a directory for persistent data:
 ```bash
 docker run -p 11235:11235 \
    -v $(pwd)/data:/app/data \
    unclecode/crawl4ai:all
 ```
 ### Resource Limits
 Control container resources:
 ```bash
 docker run -p 11235:11235 \
    --memory=4g \
    --cpus=2 \
    unclecode/crawl4ai:all
 ```
 ## Usage Examples 📝
 ### Basic Crawling
 ```python
 request = {
    "urls": "https://www.nbcnews.com/business",
    "priority": 10
 }
 response = requests.post("http://localhost:11235/crawl", json=request)
 task_id = response.json()["task_id"]
 # Get results
 result = requests.get(f"http://localhost:11235/task/{task_id}")
 ```
 ### Structured Data Extraction
 ```python
 schema = {
    "name": "Crypto Prices",
    "baseSelector": ".cds-tableRow-t45thuk",
    "fields": [
        {
            "name": "crypto",
            "selector": "td:nth-child(1) h2",
            "type": "text",
        },
        {
            "name": "price",
            "selector": "td:nth-child(2)",
            "type": "text",
        }
    ],
 }
 request = {
    "urls": "https://www.coinbase.com/explore",
    "extraction_config": {
        "type": "json_css",
        "params": {"schema": schema}
    }
 }
 ```
 ### Dynamic Content Handling
 ```python
 request = {
    "urls": "https://www.nbcnews.com/business",
    "js_code": [
        "const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"
    ],
    "wait_for": "article.tease-card:nth-child(10)"
 }
 ```
 ### AI-Powered Extraction (Full Version)
 ```python
 request = {
    "urls": "https://www.nbcnews.com/business",
    "extraction_config": {
        "type": "cosine",
        "params": {
            "semantic_filter": "business finance economy",
            "word_count_threshold": 10,
            "max_dist": 0.2,
            "top_k": 3
        }
    }
 }
 ```
 ## Platform-Specific Instructions 💻
 ### macOS
 ```bash
 docker pull unclecode/crawl4ai:basic
 docker run -p 11235:11235 unclecode/crawl4ai:basic
 ```
 ### Ubuntu
 ```bash
 # Basic version
 docker pull unclecode/crawl4ai:basic
 docker run -p 11235:11235 unclecode/crawl4ai:basic
 # With GPU support
 docker pull unclecode/crawl4ai:gpu
 docker run --gpus all -p 11235:11235 unclecode/crawl4ai:gpu
 ```
 ### Windows (PowerShell)
 ```powershell
 docker pull unclecode/crawl4ai:basic
 docker run -p 11235:11235 unclecode/crawl4ai:basic
 ```
 ## Testing 🧪
 Save this as `test_docker.py`:
 ```python
 import requests
 import json
 import time
 import sys
 class Crawl4AiTester:
    def __init__(self, base_url: str = "http://localhost:11235"):
        self.base_url = base_url
    def submit_and_wait(self, request_data: dict, timeout: int = 300) -> dict:
        # Submit crawl job
        response = requests.post(f"{self.base_url}/crawl", json=request_data)
        task_id = response.json()["task_id"]
        print(f"Task ID: {task_id}")
        # Poll for result
        start_time = time.time()
        while True:
            if time.time() - start_time > timeout:
                raise TimeoutError(f"Task {task_id} timeout")
            result = requests.get(f"{self.base_url}/task/{task_id}")
            status = result.json()
            if status["status"] == "completed":
                return status
            time.sleep(2)
 def test_deployment():
    tester = Crawl4AiTester()
    # Test basic crawl
    request = {
        "urls": "https://www.nbcnews.com/business",
        "priority": 10
    }
    result = tester.submit_and_wait(request)
    print("Basic crawl successful!")
    print(f"Content length: {len(result['result']['markdown'])}")
 if __name__ == "__main__":
    test_deployment()
 ```
 ## Advanced Configuration ⚙️
 ### Crawler Parameters
 The `crawler_params` field allows you to configure the browser instance and crawling behavior. Here are key parameters you can use:
 ```python
 request = {
    "urls": "https://example.com",
    "crawler_params": {
        # Browser Configuration
        "headless": True,                    # Run in headless mode
        "browser_type": "chromium",          # chromium/firefox/webkit
        "user_agent": "custom-agent",        # Custom user agent
        "proxy": "http://proxy:8080",        # Proxy configuration
        # Performance & Behavior
        "page_timeout": 30000,               # Page load timeout (ms)
        "verbose": True,                     # Enable detailed logging
        "semaphore_count": 5,               # Concurrent request limit
        # Anti-Detection Features
        "simulate_user": True,               # Simulate human behavior
        "magic": True,                       # Advanced anti-detection
        "override_navigator": True,          # Override navigator properties
        # Session Management
        "user_data_dir": "./browser-data",   # Browser profile location
        "use_managed_browser": True,         # Use persistent browser
    }
 }
 ```
 ### Extra Parameters
 The `extra` field allows passing additional parameters directly to the crawler's `arun` function:
 ```python
 request = {
    "urls": "https://example.com",
    "extra": {
        "word_count_threshold": 10,          # Min words per block
        "only_text": True,                   # Extract only text
        "bypass_cache": True,                # Force fresh crawl
        "process_iframes": True,             # Include iframe content
    }
 }
 ```
 ### Complete Examples
 1. **Advanced News Crawling**
 ```python
 request = {
    "urls": "https://www.nbcnews.com/business",
    "crawler_params": {
        "headless": True,
        "page_timeout": 30000,
        "remove_overlay_elements": True      # Remove popups
    },
    "extra": {
        "word_count_threshold": 50,          # Longer content blocks
        "bypass_cache": True                 # Fresh content
    },
    "css_selector": ".article-body"
 }
 ```
 2. **Anti-Detection Configuration**
 ```python
 request = {
    "urls": "https://example.com",
    "crawler_params": {
        "simulate_user": True,
        "magic": True,
        "override_navigator": True,
        "user_agent": "Mozilla/5.0 ...",
        "headers": {
            "Accept-Language": "en-US,en;q=0.9"
        }
    }
 }
 ```
 3. **LLM Extraction with Custom Parameters**
 ```python
 request = {
    "urls": "https://openai.com/pricing",
    "extraction_config": {
        "type": "llm",
        "params": {
            "provider": "openai/gpt-4",
            "schema": pricing_schema
        }
    },
    "crawler_params": {
        "verbose": True,
        "page_timeout": 60000
    },
    "extra": {
        "word_count_threshold": 1,
        "only_text": True
    }
 }
 ```
 4. **Session-Based Dynamic Content**
 ```python
 request = {
    "urls": "https://example.com",
    "crawler_params": {
        "session_id": "dynamic_session",
        "headless": False,
        "page_timeout": 60000
    },
    "js_code": ["window.scrollTo(0, document.body.scrollHeight);"],
    "wait_for": "js:() => document.querySelectorAll('.item').length > 10",
    "extra": {
        "delay_before_return_html": 2.0
    }
 }
 ```
 5. **Screenshot with Custom Timing**
 ```python
 request = {
    "urls": "https://example.com",
    "screenshot": True,
    "crawler_params": {
        "headless": True,
        "screenshot_wait_for": ".main-content"
    },
    "extra": {
        "delay_before_return_html": 3.0
    }
 }
 ```
 ### Parameter Reference Table
 | Category | Parameter | Type | Description |
 |----------|-----------|------|-------------|
 | Browser | headless | bool | Run browser in headless mode |
 | Browser | browser_type | str | Browser engine selection |
 | Browser | user_agent | str | Custom user agent string |
 | Network | proxy | str | Proxy server URL |
 | Network | headers | dict | Custom HTTP headers |
 | Timing | page_timeout | int | Page load timeout (ms) |
 | Timing | delay_before_return_html | float | Wait before capture |
 | Anti-Detection | simulate_user | bool | Human behavior simulation |
 | Anti-Detection | magic | bool | Advanced protection |
 | Session | session_id | str | Browser session ID |
 | Session | user_data_dir | str | Profile directory |
 | Content | word_count_threshold | int | Minimum words per block |
 | Content | only_text | bool | Text-only extraction |
 | Content | process_iframes | bool | Include iframe content |
 | Debug | verbose | bool | Detailed logging |
 | Debug | log_console | bool | Browser console logs |
 ## Troubleshooting 🔍
 ### Common Issues
 1. **Connection Refused**
   ```
   Error: Connection refused at localhost:11235
   ```
   Solution: Ensure the container is running and ports are properly mapped.
 2. **Resource Limits**
   ```
   Error: No available slots
   ```
   Solution: Increase MAX_CONCURRENT_TASKS or container resources.
 3. **GPU Access**
   ```
   Error: GPU not found
   ```
   Solution: Ensure proper NVIDIA drivers and use `--gpus all` flag.
 ### Debug Mode
 Access container for debugging:
 ```bash
 docker run -it --entrypoint /bin/bash unclecode/crawl4ai:all
 ```
 View container logs:
 ```bash
 docker logs [container_id]
 ```
 ## Best Practices 🌟
 1. **Resource Management**
   - Set appropriate memory and CPU limits
   - Monitor resource usage via health endpoint
   - Use basic version for simple crawling tasks
 2. **Scaling**
   - Use multiple containers for high load
   - Implement proper load balancing
   - Monitor performance metrics
 3. **Security**
   - Use environment variables for sensitive data
   - Implement proper network isolation
   - Regular security updates
 ## API Reference 📚
 ### Health Check
 ```http
 GET /health
 ```
 ### Submit Crawl Task
 ```http
 POST /crawl
 Content-Type: application/json
 {
    "urls": "string or array",
    "extraction_config": {
        "type": "basic|llm|cosine|json_css",
        "params": {}
    },
    "priority": 1-10,
    "ttl": 3600
 }
 ```
 ### Get Task Status
 ```http
 GET /task/{task_id}
 ```
 For more details, visit the [official documentation](https://crawl4ai.com/mkdocs/).
--- a/docs/md_v2/index.md
+++ b/docs/md_v2/index.md
@@ -72,7 +72,7 @@ Our documentation is organized into several sections:
 ### Advanced Features
 - [Magic Mode](advanced/magic-mode.md)
 - [Session Management](advanced/session-management.md)
- [Hooks & Authentication](advanced/hooks.md)
+- [Hooks & Authentication](advanced/hooks-auth.md)
 - [Proxy & Security](advanced/proxy-security.md)
 - [Content Processing](advanced/content-processing.md)
--- a/main.py
+++ b/main.py
@@ -269,6 +269,7 @@ class CrawlerService:
                            css_selector=request.css_selector,
                            screenshot=request.screenshot,
                            magic=request.magic,
                            **request.extra,
                        )
                    else:
                        results = await crawler.arun(
@@ -279,6 +280,7 @@ class CrawlerService:
                            css_selector=request.css_selector,
                            screenshot=request.screenshot,
                            magic=request.magic,
                            **request.extra,
                        )
                    await self.crawler_pool.release(crawler)
@@ -343,4 +345,4 @@ async def health_check():
 if __name__ == "__main__":
    import uvicorn
-    uvicorn.run(app, host="0.0.0.0", port=8000)
+    uvicorn.run(app, host="0.0.0.0", port=11235)
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -8,6 +8,7 @@ docs_dir: docs/md_v2
 nav:
  - Home: 'index.md'
  - 'Installation': 'basic/installation.md'
  - 'Docker Deplotment': 'basic/docker-deploymeny.md'
  - 'Quick Start': 'basic/quickstart.md'
  - Basic:
@@ -34,6 +35,7 @@ nav:
    - 'Chunking': 'extraction/chunking.md'
  - API Reference:
    - 'Parameters Table': 'api/parameters.md'
    - 'AsyncWebCrawler': 'api/async-webcrawler.md'
    - 'AsyncWebCrawler.arun()': 'api/arun.md'
    - 'CrawlResult': 'api/crawl-result.md'
--- a/setup.py
+++ b/setup.py
@@ -31,9 +31,11 @@ with open("crawl4ai/_version.py") as f:
 # Define the requirements for different environments
 default_requirements = requirements
-torch_requirements = ["torch", "nltk", "spacy", "scikit-learn"]
+# torch_requirements = ["torch", "nltk", "spacy", "scikit-learn"]
-transformer_requirements = ["transformers", "tokenizers", "onnxruntime"]
+# transformer_requirements = ["transformers", "tokenizers", "onnxruntime"]
-cosine_similarity_requirements = ["torch", "transformers", "nltk", "spacy"]
+torch_requirements = ["torch", "nltk",  "scikit-learn"]
 transformer_requirements = ["transformers", "tokenizers"]
 cosine_similarity_requirements = ["torch", "transformers", "nltk" ]
 sync_requirements = ["selenium"]
 def install_playwright():
--- a/tests/test_docker.py
+++ b/tests/test_docker.py
@@ -0,0 +1,299 @@
 import requests
 import json
 import time
 import sys
 import base64
 import os
 from typing import Dict, Any
 class Crawl4AiTester:
    def __init__(self, base_url: str = "http://localhost:8000"):
        self.base_url = base_url
    def submit_and_wait(self, request_data: Dict[str, Any], timeout: int = 300) -> Dict[str, Any]:
        # Submit crawl job
        response = requests.post(f"{self.base_url}/crawl", json=request_data)
        task_id = response.json()["task_id"]
        print(f"Task ID: {task_id}")
        # Poll for result
        start_time = time.time()
        while True:
            if time.time() - start_time > timeout:
                raise TimeoutError(f"Task {task_id} did not complete within {timeout} seconds")
            result = requests.get(f"{self.base_url}/task/{task_id}")
            status = result.json()
            if status["status"] == "failed":
                print("Task failed:", status.get("error"))
                raise Exception(f"Task failed: {status.get('error')}")
            if status["status"] == "completed":
                return status
            time.sleep(2)
 def test_docker_deployment(version="basic"):
    tester = Crawl4AiTester()
    print(f"Testing Crawl4AI Docker {version} version")
    # Health check with timeout and retry
    max_retries = 5
    for i in range(max_retries):
        try:
            health = requests.get(f"{tester.base_url}/health", timeout=10)
            print("Health check:", health.json())
            break
        except requests.exceptions.RequestException as e:
            if i == max_retries - 1:
                print(f"Failed to connect after {max_retries} attempts")
                sys.exit(1)
            print(f"Waiting for service to start (attempt {i+1}/{max_retries})...")
            time.sleep(5)
    # Test cases based on version
    test_basic_crawl(tester)
    if version in ["full", "transformer"]:
        test_cosine_extraction(tester)
    # test_js_execution(tester)
    # test_css_selector(tester)
    # test_structured_extraction(tester)
    # test_llm_extraction(tester)
    # test_llm_with_ollama(tester)
    # test_screenshot(tester)
 def test_basic_crawl(tester: Crawl4AiTester):
    print("\n=== Testing Basic Crawl ===")
    request = {
        "urls": "https://www.nbcnews.com/business",
        "priority": 10
    }
    result = tester.submit_and_wait(request)
    print(f"Basic crawl result length: {len(result['result']['markdown'])}")
    assert result["result"]["success"]
    assert len(result["result"]["markdown"]) > 0
 def test_js_execution(tester: Crawl4AiTester):
    print("\n=== Testing JS Execution ===")
    request = {
        "urls": "https://www.nbcnews.com/business",
        "priority": 8,
        "js_code": [
            "const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"
        ],
        "wait_for": "article.tease-card:nth-child(10)",
        "crawler_params": {
            "headless": True
        }
    }
    result = tester.submit_and_wait(request)
    print(f"JS execution result length: {len(result['result']['markdown'])}")
    assert result["result"]["success"]
 def test_css_selector(tester: Crawl4AiTester):
    print("\n=== Testing CSS Selector ===")
    request = {
        "urls": "https://www.nbcnews.com/business",
        "priority": 7,
        "css_selector": ".wide-tease-item__description",
        "crawler_params": {
            "headless": True
        },
        "extra": {"word_count_threshold": 10}
    }
    result = tester.submit_and_wait(request)
    print(f"CSS selector result length: {len(result['result']['markdown'])}")
    assert result["result"]["success"]
 def test_structured_extraction(tester: Crawl4AiTester):
    print("\n=== Testing Structured Extraction ===")
    schema = {
        "name": "Coinbase Crypto Prices",
        "baseSelector": ".cds-tableRow-t45thuk",
        "fields": [
            {
                "name": "crypto",
                "selector": "td:nth-child(1) h2",
                "type": "text",
            },
            {
                "name": "symbol",
                "selector": "td:nth-child(1) p",
                "type": "text",
            },
            {
                "name": "price",
                "selector": "td:nth-child(2)",
                "type": "text",
            }
        ],
    }
    request = {
        "urls": "https://www.coinbase.com/explore",
        "priority": 9,
        "extraction_config": {
            "type": "json_css",
            "params": {
                "schema": schema
            }
        }
    }
    result = tester.submit_and_wait(request)
    extracted = json.loads(result["result"]["extracted_content"])
    print(f"Extracted {len(extracted)} items")
    print("Sample item:", json.dumps(extracted[0], indent=2))
    assert result["result"]["success"]
    assert len(extracted) > 0
 def test_llm_extraction(tester: Crawl4AiTester):
    print("\n=== Testing LLM Extraction ===")
    schema = {
        "type": "object",
        "properties": {
            "model_name": {
                "type": "string",
                "description": "Name of the OpenAI model."
            },
            "input_fee": {
                "type": "string",
                "description": "Fee for input token for the OpenAI model."
            },
            "output_fee": {
                "type": "string",
                "description": "Fee for output token for the OpenAI model."
            }
        },
        "required": ["model_name", "input_fee", "output_fee"]
    }
    request = {
        "urls": "https://openai.com/api/pricing",
        "priority": 8,
        "extraction_config": {
            "type": "llm",
            "params": {
                "provider": "openai/gpt-4o-mini",
                "api_token": os.getenv("OPENAI_API_KEY"),
                "schema": schema,
                "extraction_type": "schema",
                "instruction": """From the crawled content, extract all mentioned model names along with their fees for input and output tokens."""
            }
        },
        "crawler_params": {"word_count_threshold": 1}
    }
    try:
        result = tester.submit_and_wait(request)
        extracted = json.loads(result["result"]["extracted_content"])
        print(f"Extracted {len(extracted)} model pricing entries")
        print("Sample entry:", json.dumps(extracted[0], indent=2))
        assert result["result"]["success"]
    except Exception as e:
        print(f"LLM extraction test failed (might be due to missing API key): {str(e)}")
 def test_llm_with_ollama(tester: Crawl4AiTester):
    print("\n=== Testing LLM with Ollama ===")
    schema = {
        "type": "object",
        "properties": {
            "article_title": {
                "type": "string",
                "description": "The main title of the news article"
            },
            "summary": {
                "type": "string",
                "description": "A brief summary of the article content"
            },
            "main_topics": {
                "type": "array",
                "items": {"type": "string"},
                "description": "Main topics or themes discussed in the article"
            }
        }
    }
    request = {
        "urls": "https://www.nbcnews.com/business",
        "priority": 8,
        "extraction_config": {
            "type": "llm",
            "params": {
                "provider": "ollama/llama2",
                "schema": schema,
                "extraction_type": "schema",
                "instruction": "Extract the main article information including title, summary, and main topics."
            }
        },
        "extra": {"word_count_threshold": 1},
        "crawler_params": {"verbose": True}
    }
    try:
        result = tester.submit_and_wait(request)
        extracted = json.loads(result["result"]["extracted_content"])
        print("Extracted content:", json.dumps(extracted, indent=2))
        assert result["result"]["success"]
    except Exception as e:
        print(f"Ollama extraction test failed: {str(e)}")
 def test_cosine_extraction(tester: Crawl4AiTester):
    print("\n=== Testing Cosine Extraction ===")
    request = {
        "urls": "https://www.nbcnews.com/business",
        "priority": 8,
        "extraction_config": {
            "type": "cosine",
            "params": {
                "semantic_filter": "business finance economy",
                "word_count_threshold": 10,
                "max_dist": 0.2,
                "top_k": 3
            }
        }
    }
    try:
        result = tester.submit_and_wait(request)
        extracted = json.loads(result["result"]["extracted_content"])
        print(f"Extracted {len(extracted)} text clusters")
        print("First cluster tags:", extracted[0]["tags"])
        assert result["result"]["success"]
    except Exception as e:
        print(f"Cosine extraction test failed: {str(e)}")
 def test_screenshot(tester: Crawl4AiTester):
    print("\n=== Testing Screenshot ===")
    request = {
        "urls": "https://www.nbcnews.com/business",
        "priority": 5,
        "screenshot": True,
        "crawler_params": {
            "headless": True
        }
    }
    result = tester.submit_and_wait(request)
    print("Screenshot captured:", bool(result["result"]["screenshot"]))
    if result["result"]["screenshot"]:
        # Save screenshot
        screenshot_data = base64.b64decode(result["result"]["screenshot"])
        with open("test_screenshot.jpg", "wb") as f:
            f.write(screenshot_data)
        print("Screenshot saved as test_screenshot.jpg")
    assert result["result"]["success"]
 if __name__ == "__main__":
    version = sys.argv[1] if len(sys.argv) > 1 else "basic"
    # version = "full"
    test_docker_deployment(version)