crawl4ai/docs/migration/v0.8.0-upgrade-guide.md

# Migration Guide: Upgrading to Crawl4AI v0.8.0

This guide helps you upgrade from v0.7.x to v0.8.0, with special attention to breaking changes and security updates.

## Quick Summary

| Change | Impact | Action Required |
|--------|--------|-----------------|
| Hooks disabled by default | Docker API users with hooks | Set `CRAWL4AI_HOOKS_ENABLED=true` |
| file:// URLs blocked | Docker API users reading local files | Use Python library directly |
| Security fixes | All Docker API users | Update immediately |

---

## Step 1: Update the Package

### PyPI Installation

```bash
pip install --upgrade crawl4ai
```

### Docker Installation

```bash
docker pull unclecode/crawl4ai:latest
# or
docker pull unclecode/crawl4ai:0.8.0
```

### From Source

```bash
git pull origin main
pip install -e .
```

---

## Step 2: Check for Breaking Changes

### Are You Affected?

**You ARE affected if you:**
- Use the Docker API deployment
- Use the `hooks` parameter in `/crawl` requests
- Use `file://` URLs via API endpoints

**You are NOT affected if you:**
- Only use Crawl4AI as a Python library
- Don't use hooks in your API calls
- Don't use `file://` URLs via the API

---

## Step 3: Migrate Hooks Usage

### Before v0.8.0

Hooks worked by default:

```bash
# This worked without any configuration
curl -X POST http://localhost:11235/crawl \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://example.com"],
    "hooks": {
      "code": {
        "on_page_context_created": "async def hook(page, context, **kwargs):\n    await context.add_cookies([...])\n    return page"
      }
    }
  }'
```

### After v0.8.0

You must explicitly enable hooks:

**Option A: Environment Variable (Recommended)**
```bash
# In your Docker run command or docker-compose.yml
export CRAWL4AI_HOOKS_ENABLED=true
```

```yaml
# docker-compose.yml
services:
  crawl4ai:
    image: unclecode/crawl4ai:0.8.0
    environment:
      - CRAWL4AI_HOOKS_ENABLED=true
```

**Option B: For Kubernetes**
```yaml
env:
  - name: CRAWL4AI_HOOKS_ENABLED
    value: "true"
```

### Security Warning

Only enable hooks if:
- You trust all users who can access the API
- The API is not exposed to the public internet
- You have other authentication/authorization in place

---

## Step 4: Migrate file:// URL Usage

### Before v0.8.0

```bash
# This worked via API
curl -X POST http://localhost:11235/execute_js \
  -d '{"url": "file:///var/data/page.html", "scripts": ["document.title"]}'
```

### After v0.8.0

**Option A: Use the Python Library Directly**

```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig

async def process_local_file():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="file:///var/data/page.html",
            config=CrawlerRunConfig(js_code=["document.title"])
        )
        return result
```

**Option B: Use raw: Protocol for HTML Content**

If you have the HTML content, you can still use the API:

```bash
# Read file content and send as raw:
HTML_CONTENT=$(cat /var/data/page.html)
curl -X POST http://localhost:11235/html \
  -H "Content-Type: application/json" \
  -d "{\"url\": \"raw:$HTML_CONTENT\"}"
```

**Option C: Create a Preprocessing Service**

```python
# preprocessing_service.py
from fastapi import FastAPI
from crawl4ai import AsyncWebCrawler

app = FastAPI()

@app.post("/process-local")
async def process_local(file_path: str):
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url=f"file://{file_path}")
        return result.model_dump()
```

---

## Step 5: Review Security Configuration

### Recommended Production Settings

```yaml
# config.yml
security:
  enabled: true
  jwt_enabled: true
  https_redirect: true  # If behind HTTPS proxy
  trusted_hosts:
    - "your-domain.com"
    - "api.your-domain.com"
```

### Environment Variables

```bash
# Required for JWT authentication
export SECRET_KEY="your-secure-random-key-minimum-32-characters"

# Only if you need hooks
export CRAWL4AI_HOOKS_ENABLED=true
```

### Generate a Secure Secret Key

```python
import secrets
print(secrets.token_urlsafe(32))
```

---

## Step 6: Test Your Integration

### Quick Validation Script

```python
import asyncio
import aiohttp

async def test_upgrade():
    base_url = "http://localhost:11235"

    # Test 1: Basic crawl should work
    async with aiohttp.ClientSession() as session:
        async with session.post(
            f"{base_url}/crawl",
            json={"urls": ["https://example.com"]}
        ) as resp:
            assert resp.status == 200, "Basic crawl failed"
            print("✓ Basic crawl works")

    # Test 2: Hooks should be blocked (unless enabled)
    async with aiohttp.ClientSession() as session:
        async with session.post(
            f"{base_url}/crawl",
            json={
                "urls": ["https://example.com"],
                "hooks": {"code": {"on_page_context_created": "async def hook(page, context, **kwargs): return page"}}
            }
        ) as resp:
            if resp.status == 403:
                print("✓ Hooks correctly blocked (default)")
            elif resp.status == 200:
                print("! Hooks enabled - ensure this is intentional")

    # Test 3: file:// should be blocked
    async with aiohttp.ClientSession() as session:
        async with session.post(
            f"{base_url}/execute_js",
            json={"url": "file:///etc/passwd", "scripts": ["1"]}
        ) as resp:
            assert resp.status == 400, "file:// should be blocked"
            print("✓ file:// URLs correctly blocked")

asyncio.run(test_upgrade())
```

---

## Troubleshooting

### "Hooks are disabled" Error

**Symptom**: API returns 403 with "Hooks are disabled"

**Solution**: Set `CRAWL4AI_HOOKS_ENABLED=true` if you need hooks

### "URL must start with http://, https://" Error

**Symptom**: API returns 400 when using `file://` URLs

**Solution**: Use Python library directly or `raw:` protocol

### Authentication Errors After Enabling JWT

**Symptom**: API returns 401 Unauthorized

**Solution**:
1. Get a token: `POST /token` with your email
2. Include token in requests: `Authorization: Bearer <token>`

---

## Rollback Plan

If you need to rollback:

```bash
# PyPI
pip install crawl4ai==0.7.6

# Docker
docker pull unclecode/crawl4ai:0.7.6
```

**Warning**: Rolling back re-exposes the security vulnerabilities. Only do this temporarily while fixing integration issues.

---

## Getting Help

- **GitHub Issues**: [github.com/unclecode/crawl4ai/issues](https://github.com/unclecode/crawl4ai/issues)
- **Security Issues**: See [SECURITY.md](../../SECURITY.md)
- **Documentation**: [docs.crawl4ai.com](https://docs.crawl4ai.com)

---

## Changelog Reference

For complete list of changes, see:
- [Release Notes v0.8.0](../RELEASE_NOTES_v0.8.0.md)
- [CHANGELOG.md](../../CHANGELOG.md)