Files
crawl4ai/deploy/modal/guide.md
2025-03-10 18:57:14 +08:00

13 KiB

Deploying Crawl4ai with Modal: A Comprehensive Tutorial

Hey there! UncleCode here. I'm excited to show you how to deploy Crawl4ai using Modal - a fantastic serverless platform that makes deployment super simple and scalable.

In this tutorial, I'll walk you through deploying your own Crawl4ai instance on Modal's infrastructure. This will give you a powerful, scalable web crawling solution without having to worry about infrastructure management.

What is Modal?

Modal is a serverless platform that allows you to run Python functions in the cloud without managing servers. It's perfect for deploying Crawl4ai because:

  1. It handles all the infrastructure for you
  2. It scales automatically based on demand
  3. It makes deployment incredibly simple

Prerequisites

Before we get started, you'll need:

  • A Modal account (sign up at modal.com)
  • Python 3.10 or later installed on your local machine
  • Basic familiarity with Python and command-line operations

Step 1: Setting Up Your Modal Account

First, sign up for a Modal account at modal.com if you haven't already. Modal offers a generous free tier that's perfect for getting started.

After signing up, install the Modal CLI and authenticate:

pip install modal
modal token new

This will open a browser window where you can authenticate and generate a token for the CLI.

Step 2: Creating Your Crawl4ai Deployment

Now, let's create a Python file called crawl4ai_modal.py with our deployment code:

import modal
from typing import Optional, Dict, Any

# Create a custom image with Crawl4ai and its dependencies
image = modal.Image.debian_slim(python_version="3.10").pip_install(
    ["fastapi[standard]"]
).run_commands(
    "apt-get update",
    "apt-get install -y software-properties-common",
    "apt-get install -y git",
    "apt-add-repository non-free",
    "apt-add-repository contrib",
    "pip install -U crawl4ai",
    "pip install -U fastapi[standard]",
    "pip install -U pydantic",
    "crawl4ai-setup",  # This installs playwright and downloads chromium
)

# Define the app
app = modal.App("crawl4ai", image=image)

# Define default configurations
DEFAULT_BROWSER_CONFIG = {
    "headless": True,
    "verbose": False,
}

DEFAULT_CRAWLER_CONFIG = {
    "crawler_config": {
        "type": "CrawlerRunConfig",
        "params": {
            "markdown_generator": {
                "type": "DefaultMarkdownGenerator",
                "params": {
                    "content_filter": {
                        "type": "PruningContentFilter",
                        "params": {
                            "threshold": 0.48,
                            "threshold_type": "fixed"
                        }
                    }
                }
            }
        }
    }
}

@app.function(timeout=300)  # 5 minute timeout
async def crawl(
    url: str,
    browser_config: Optional[Dict[str, Any]] = None,
    crawler_config: Optional[Dict[str, Any]] = None,
) -> Dict[str, Any]:
    """
    Crawl a given URL using Crawl4ai.
    
    Args:
        url: The URL to crawl
        browser_config: Optional browser configuration to override defaults
        crawler_config: Optional crawler configuration to override defaults
        
    Returns:
        A dictionary containing the crawl results
    """
    from crawl4ai import (
        AsyncWebCrawler,
        BrowserConfig,
        CrawlerRunConfig,
        CrawlResult
    )

    # Prepare browser config using the loader method
    if browser_config is None:
        browser_config = DEFAULT_BROWSER_CONFIG
    browser_config_obj = BrowserConfig.load(browser_config)
    
    # Prepare crawler config using the loader method
    if crawler_config is None:
        crawler_config = DEFAULT_CRAWLER_CONFIG
    crawler_config_obj = CrawlerRunConfig.load(crawler_config)    
    
    # Perform the crawl
    async with AsyncWebCrawler(config=browser_config_obj) as crawler:
        result: CrawlResult = await crawler.arun(url=url, config=crawler_config_obj)
        
        # Return serializable results
        try:
            # Try newer Pydantic v2 method
            return result.model_dump()
        except AttributeError:
            try:
                # Try older Pydantic v1 method
                return result.dict()
            except AttributeError:
                # Fallback to manual conversion
                return {
                    "url": result.url,
                    "title": result.title,
                    "status": result.status,
                    "content": str(result.content) if hasattr(result, "content") else None,
                    "links": [{"url": link.url, "text": link.text} for link in result.links] if hasattr(result, "links") else [],
                    "markdown_v2": {
                        "raw_markdown": result.markdown_v2.raw_markdown if hasattr(result, "markdown_v2") else None
                    }
                }

@app.function()
@modal.web_endpoint(method="POST")
def crawl_endpoint(data: Dict[str, Any]) -> Dict[str, Any]:
    """
    Web endpoint that accepts POST requests with JSON data containing:
    - url: The URL to crawl
    - browser_config: Optional browser configuration
    - crawler_config: Optional crawler configuration
    
    Returns the crawl results.
    """
    url = data.get("url")
    if not url:
        return {"error": "URL is required"}
    
    browser_config = data.get("browser_config")
    crawler_config = data.get("crawler_config")
    
    return crawl.remote(url, browser_config, crawler_config)

@app.local_entrypoint()
def main(url: str = "https://www.modal.com"):
    """
    Command line entrypoint for local testing.
    """
    result = crawl.remote(url)
    print(result)

Step 3: Understanding the Code Components

Let's break down what's happening in this code:

1. Image Definition

image = modal.Image.debian_slim(python_version="3.10").pip_install(
    ["fastapi[standard]"]
).run_commands(
    "apt-get update",
    "apt-get install -y software-properties-common",
    "apt-get install -y git",
    "apt-add-repository non-free",
    "apt-add-repository contrib",
    "pip install -U git+https://github.com/unclecode/crawl4ai.git@next",
    "pip install -U fastapi[standard]",
    "pip install -U pydantic",
    "crawl4ai-setup",  # This installs playwright and downloads chromium
)

This section defines the container image that Modal will use to run your code. It:

  • Starts with a Debian Slim base image with Python 3.10
  • Installs FastAPI
  • Updates the system packages
  • Installs Git and other dependencies
  • Installs Crawl4ai from the GitHub repository
  • Runs the Crawl4ai setup to install Playwright and download Chromium

2. Modal App Definition

app = modal.App("crawl4ai", image=image)

This creates a Modal application named "crawl4ai" that uses the image we defined above.

3. Default Configurations

DEFAULT_BROWSER_CONFIG = {
    "headless": True,
    "verbose": False,
}

DEFAULT_CRAWLER_CONFIG = {
    "crawler_config": {
        "type": "CrawlerRunConfig",
        "params": {
            "markdown_generator": {
                "type": "DefaultMarkdownGenerator",
                "params": {
                    "content_filter": {
                        "type": "PruningContentFilter",
                        "params": {
                            "threshold": 0.48,
                            "threshold_type": "fixed"
                        }
                    }
                }
            }
        }
    }
}

These define the default configurations for the browser and crawler. You can customize these settings based on your specific needs.

4. The Crawl Function

@app.function(timeout=300)
async def crawl(url, browser_config, crawler_config):
    # Function implementation

This is the main function that performs the crawling. It:

  • Takes a URL and optional configurations
  • Sets up the browser and crawler with those configurations
  • Performs the crawl
  • Returns the results in a serializable format

The @app.function(timeout=300) decorator tells Modal to run this function in the cloud with a 5-minute timeout.

5. The Web Endpoint

@app.function()
@modal.web_endpoint(method="POST")
def crawl_endpoint(data: Dict[str, Any]) -> Dict[str, Any]:
    # Function implementation

This creates a web endpoint that accepts POST requests. It:

  • Extracts the URL and configurations from the request
  • Calls the crawl function with those parameters
  • Returns the results

6. Local Entrypoint

@app.local_entrypoint()
def main(url: str = "https://www.modal.com"):
    # Function implementation

This provides a way to test the application from the command line.

Step 4: Testing Locally

Before deploying, let's test our application locally:

modal run crawl4ai_modal.py --url "https://example.com"

This command will:

  1. Upload your code to Modal
  2. Create the necessary containers
  3. Run the main function with the specified URL
  4. Return the results

Modal will handle all the infrastructure setup for you. You should see the crawling results printed to your console.

Step 5: Deploying Your Application

Once you're satisfied with the local testing, it's time to deploy:

modal deploy crawl4ai_modal.py

This will deploy your application to Modal's cloud. The deployment process will output URLs for your web endpoints.

You should see output similar to:

✓ Deployed crawl4ai.
  URLs:
    crawl_endpoint => https://your-username--crawl-endpoint.modal.run

Save this URL - you'll need it to make requests to your deployment.

Step 6: Using Your Deployment

Now that your application is deployed, you can use it by sending POST requests to the endpoint URL:

curl -X POST https://your-username--crawl-endpoint.modal.run \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'

Or in Python:

import requests

response = requests.post(
    "https://your-username--crawl-endpoint.modal.run",
    json={"url": "https://example.com"}
)

result = response.json()
print(result)

You can also customize the browser and crawler configurations:

requests.post(
    "https://your-username--crawl-endpoint.modal.run",
    json={
        "url": "https://example.com",
        "browser_config": {
            "headless": False,
            "verbose": True
        },
        "crawler_config": {
            "crawler_config": {
                "type": "CrawlerRunConfig",
                "params": {
                    "markdown_generator": {
                        "type": "DefaultMarkdownGenerator",
                        "params": {
                            "content_filter": {
                                "type": "PruningContentFilter",
                                "params": {
                                    "threshold": 0.6,  # Adjusted threshold
                                    "threshold_type": "fixed"
                                }
                            }
                        }
                    }
                }
            }
        }
    }
)

Step 7: Calling Your Deployment from Another Python Script

You can also call your deployed function directly from another Python script:

import modal

# Get a reference to the deployed function
crawl_function = modal.Function.from_name("crawl4ai", "crawl")

# Call the function
result = crawl_function.remote("https://example.com")
print(result)

Understanding Modal's Execution Flow

To understand how Modal works, it's important to know:

  1. Local vs. Remote Execution: When you call a function with .remote(), it runs in Modal's cloud, not on your local machine.

  2. Container Lifecycle: Modal creates containers on-demand and destroys them when they're not needed.

  3. Caching: Modal caches your container images to speed up subsequent runs.

  4. Serverless Scaling: Modal automatically scales your application based on demand.

Customizing Your Deployment

You can customize your deployment in several ways:

Changing the Crawl4ai Version

To use a different version of Crawl4ai, update the installation command in the image definition:

"pip install -U git+https://github.com/unclecode/crawl4ai.git@main",  # Use main branch

Adjusting Resource Limits

You can change the resources allocated to your functions:

@app.function(timeout=600, cpu=2, memory=4096)  # 10 minute timeout, 2 CPUs, 4GB RAM
async def crawl(...):
    # Function implementation

Keeping Containers Warm

To reduce cold start times, you can keep containers warm:

@app.function(keep_warm=1)  # Keep 1 container warm
async def crawl(...):
    # Function implementation

Conclusion

That's it! You've successfully deployed Crawl4ai on Modal. You now have a scalable web crawling solution that can handle as many requests as you need without requiring any infrastructure management.

The beauty of this setup is its simplicity - Modal handles all the hard parts, letting you focus on using Crawl4ai to extract the data you need.

Feel free to reach out if you have any questions or need help with your deployment!

Happy crawling!

  • UncleCode

Additional Resources