454 lines
13 KiB
Markdown
454 lines
13 KiB
Markdown
# Deploying Crawl4ai with Modal: A Comprehensive Tutorial
|
|
|
|
Hey there! UncleCode here. I'm excited to show you how to deploy Crawl4ai using Modal - a fantastic serverless platform that makes deployment super simple and scalable.
|
|
|
|
In this tutorial, I'll walk you through deploying your own Crawl4ai instance on Modal's infrastructure. This will give you a powerful, scalable web crawling solution without having to worry about infrastructure management.
|
|
|
|
## What is Modal?
|
|
|
|
Modal is a serverless platform that allows you to run Python functions in the cloud without managing servers. It's perfect for deploying Crawl4ai because:
|
|
|
|
1. It handles all the infrastructure for you
|
|
2. It scales automatically based on demand
|
|
3. It makes deployment incredibly simple
|
|
|
|
## Prerequisites
|
|
|
|
Before we get started, you'll need:
|
|
|
|
- A Modal account (sign up at [modal.com](https://modal.com))
|
|
- Python 3.10 or later installed on your local machine
|
|
- Basic familiarity with Python and command-line operations
|
|
|
|
## Step 1: Setting Up Your Modal Account
|
|
|
|
First, sign up for a Modal account at [modal.com](https://modal.com) if you haven't already. Modal offers a generous free tier that's perfect for getting started.
|
|
|
|
After signing up, install the Modal CLI and authenticate:
|
|
|
|
```bash
|
|
pip install modal
|
|
modal token new
|
|
```
|
|
|
|
This will open a browser window where you can authenticate and generate a token for the CLI.
|
|
|
|
## Step 2: Creating Your Crawl4ai Deployment
|
|
|
|
Now, let's create a Python file called `crawl4ai_modal.py` with our deployment code:
|
|
|
|
```python
|
|
import modal
|
|
from typing import Optional, Dict, Any
|
|
|
|
# Create a custom image with Crawl4ai and its dependencies
|
|
image = modal.Image.debian_slim(python_version="3.10").pip_install(
|
|
["fastapi[standard]"]
|
|
).run_commands(
|
|
"apt-get update",
|
|
"apt-get install -y software-properties-common",
|
|
"apt-get install -y git",
|
|
"apt-add-repository non-free",
|
|
"apt-add-repository contrib",
|
|
"pip install -U crawl4ai",
|
|
"pip install -U fastapi[standard]",
|
|
"pip install -U pydantic",
|
|
"crawl4ai-setup", # This installs playwright and downloads chromium
|
|
)
|
|
|
|
# Define the app
|
|
app = modal.App("crawl4ai", image=image)
|
|
|
|
# Define default configurations
|
|
DEFAULT_BROWSER_CONFIG = {
|
|
"headless": True,
|
|
"verbose": False,
|
|
}
|
|
|
|
DEFAULT_CRAWLER_CONFIG = {
|
|
"crawler_config": {
|
|
"type": "CrawlerRunConfig",
|
|
"params": {
|
|
"markdown_generator": {
|
|
"type": "DefaultMarkdownGenerator",
|
|
"params": {
|
|
"content_filter": {
|
|
"type": "PruningContentFilter",
|
|
"params": {
|
|
"threshold": 0.48,
|
|
"threshold_type": "fixed"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
|
|
@app.function(timeout=300) # 5 minute timeout
|
|
async def crawl(
|
|
url: str,
|
|
browser_config: Optional[Dict[str, Any]] = None,
|
|
crawler_config: Optional[Dict[str, Any]] = None,
|
|
) -> Dict[str, Any]:
|
|
"""
|
|
Crawl a given URL using Crawl4ai.
|
|
|
|
Args:
|
|
url: The URL to crawl
|
|
browser_config: Optional browser configuration to override defaults
|
|
crawler_config: Optional crawler configuration to override defaults
|
|
|
|
Returns:
|
|
A dictionary containing the crawl results
|
|
"""
|
|
from crawl4ai import (
|
|
AsyncWebCrawler,
|
|
BrowserConfig,
|
|
CrawlerRunConfig,
|
|
CrawlResult
|
|
)
|
|
|
|
# Prepare browser config using the loader method
|
|
if browser_config is None:
|
|
browser_config = DEFAULT_BROWSER_CONFIG
|
|
browser_config_obj = BrowserConfig.load(browser_config)
|
|
|
|
# Prepare crawler config using the loader method
|
|
if crawler_config is None:
|
|
crawler_config = DEFAULT_CRAWLER_CONFIG
|
|
crawler_config_obj = CrawlerRunConfig.load(crawler_config)
|
|
|
|
# Perform the crawl
|
|
async with AsyncWebCrawler(config=browser_config_obj) as crawler:
|
|
result: CrawlResult = await crawler.arun(url=url, config=crawler_config_obj)
|
|
|
|
# Return serializable results
|
|
try:
|
|
# Try newer Pydantic v2 method
|
|
return result.model_dump()
|
|
except AttributeError:
|
|
try:
|
|
# Try older Pydantic v1 method
|
|
return result.dict()
|
|
except AttributeError:
|
|
# Fallback to manual conversion
|
|
return {
|
|
"url": result.url,
|
|
"title": result.title,
|
|
"status": result.status,
|
|
"content": str(result.content) if hasattr(result, "content") else None,
|
|
"links": [{"url": link.url, "text": link.text} for link in result.links] if hasattr(result, "links") else [],
|
|
"markdown_v2": {
|
|
"raw_markdown": result.markdown_v2.raw_markdown if hasattr(result, "markdown_v2") else None
|
|
}
|
|
}
|
|
|
|
@app.function()
|
|
@modal.web_endpoint(method="POST")
|
|
def crawl_endpoint(data: Dict[str, Any]) -> Dict[str, Any]:
|
|
"""
|
|
Web endpoint that accepts POST requests with JSON data containing:
|
|
- url: The URL to crawl
|
|
- browser_config: Optional browser configuration
|
|
- crawler_config: Optional crawler configuration
|
|
|
|
Returns the crawl results.
|
|
"""
|
|
url = data.get("url")
|
|
if not url:
|
|
return {"error": "URL is required"}
|
|
|
|
browser_config = data.get("browser_config")
|
|
crawler_config = data.get("crawler_config")
|
|
|
|
return crawl.remote(url, browser_config, crawler_config)
|
|
|
|
@app.local_entrypoint()
|
|
def main(url: str = "https://www.modal.com"):
|
|
"""
|
|
Command line entrypoint for local testing.
|
|
"""
|
|
result = crawl.remote(url)
|
|
print(result)
|
|
```
|
|
|
|
## Step 3: Understanding the Code Components
|
|
|
|
Let's break down what's happening in this code:
|
|
|
|
### 1. Image Definition
|
|
|
|
```python
|
|
image = modal.Image.debian_slim(python_version="3.10").pip_install(
|
|
["fastapi[standard]"]
|
|
).run_commands(
|
|
"apt-get update",
|
|
"apt-get install -y software-properties-common",
|
|
"apt-get install -y git",
|
|
"apt-add-repository non-free",
|
|
"apt-add-repository contrib",
|
|
"pip install -U git+https://github.com/unclecode/crawl4ai.git@next",
|
|
"pip install -U fastapi[standard]",
|
|
"pip install -U pydantic",
|
|
"crawl4ai-setup", # This installs playwright and downloads chromium
|
|
)
|
|
```
|
|
|
|
This section defines the container image that Modal will use to run your code. It:
|
|
- Starts with a Debian Slim base image with Python 3.10
|
|
- Installs FastAPI
|
|
- Updates the system packages
|
|
- Installs Git and other dependencies
|
|
- Installs Crawl4ai from the GitHub repository
|
|
- Runs the Crawl4ai setup to install Playwright and download Chromium
|
|
|
|
### 2. Modal App Definition
|
|
|
|
```python
|
|
app = modal.App("crawl4ai", image=image)
|
|
```
|
|
|
|
This creates a Modal application named "crawl4ai" that uses the image we defined above.
|
|
|
|
### 3. Default Configurations
|
|
|
|
```python
|
|
DEFAULT_BROWSER_CONFIG = {
|
|
"headless": True,
|
|
"verbose": False,
|
|
}
|
|
|
|
DEFAULT_CRAWLER_CONFIG = {
|
|
"crawler_config": {
|
|
"type": "CrawlerRunConfig",
|
|
"params": {
|
|
"markdown_generator": {
|
|
"type": "DefaultMarkdownGenerator",
|
|
"params": {
|
|
"content_filter": {
|
|
"type": "PruningContentFilter",
|
|
"params": {
|
|
"threshold": 0.48,
|
|
"threshold_type": "fixed"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
These define the default configurations for the browser and crawler. You can customize these settings based on your specific needs.
|
|
|
|
### 4. The Crawl Function
|
|
|
|
```python
|
|
@app.function(timeout=300)
|
|
async def crawl(url, browser_config, crawler_config):
|
|
# Function implementation
|
|
```
|
|
|
|
This is the main function that performs the crawling. It:
|
|
- Takes a URL and optional configurations
|
|
- Sets up the browser and crawler with those configurations
|
|
- Performs the crawl
|
|
- Returns the results in a serializable format
|
|
|
|
The `@app.function(timeout=300)` decorator tells Modal to run this function in the cloud with a 5-minute timeout.
|
|
|
|
### 5. The Web Endpoint
|
|
|
|
```python
|
|
@app.function()
|
|
@modal.web_endpoint(method="POST")
|
|
def crawl_endpoint(data: Dict[str, Any]) -> Dict[str, Any]:
|
|
# Function implementation
|
|
```
|
|
|
|
This creates a web endpoint that accepts POST requests. It:
|
|
- Extracts the URL and configurations from the request
|
|
- Calls the crawl function with those parameters
|
|
- Returns the results
|
|
|
|
### 6. Local Entrypoint
|
|
|
|
```python
|
|
@app.local_entrypoint()
|
|
def main(url: str = "https://www.modal.com"):
|
|
# Function implementation
|
|
```
|
|
|
|
This provides a way to test the application from the command line.
|
|
|
|
## Step 4: Testing Locally
|
|
|
|
Before deploying, let's test our application locally:
|
|
|
|
```bash
|
|
modal run crawl4ai_modal.py --url "https://example.com"
|
|
```
|
|
|
|
This command will:
|
|
1. Upload your code to Modal
|
|
2. Create the necessary containers
|
|
3. Run the `main` function with the specified URL
|
|
4. Return the results
|
|
|
|
Modal will handle all the infrastructure setup for you. You should see the crawling results printed to your console.
|
|
|
|
## Step 5: Deploying Your Application
|
|
|
|
Once you're satisfied with the local testing, it's time to deploy:
|
|
|
|
```bash
|
|
modal deploy crawl4ai_modal.py
|
|
```
|
|
|
|
This will deploy your application to Modal's cloud. The deployment process will output URLs for your web endpoints.
|
|
|
|
You should see output similar to:
|
|
|
|
```
|
|
✓ Deployed crawl4ai.
|
|
URLs:
|
|
crawl_endpoint => https://your-username--crawl-endpoint.modal.run
|
|
```
|
|
|
|
Save this URL - you'll need it to make requests to your deployment.
|
|
|
|
## Step 6: Using Your Deployment
|
|
|
|
Now that your application is deployed, you can use it by sending POST requests to the endpoint URL:
|
|
|
|
```bash
|
|
curl -X POST https://your-username--crawl-endpoint.modal.run \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"url": "https://example.com"}'
|
|
```
|
|
|
|
Or in Python:
|
|
|
|
```python
|
|
import requests
|
|
|
|
response = requests.post(
|
|
"https://your-username--crawl-endpoint.modal.run",
|
|
json={"url": "https://example.com"}
|
|
)
|
|
|
|
result = response.json()
|
|
print(result)
|
|
```
|
|
|
|
You can also customize the browser and crawler configurations:
|
|
|
|
```python
|
|
requests.post(
|
|
"https://your-username--crawl-endpoint.modal.run",
|
|
json={
|
|
"url": "https://example.com",
|
|
"browser_config": {
|
|
"headless": False,
|
|
"verbose": True
|
|
},
|
|
"crawler_config": {
|
|
"crawler_config": {
|
|
"type": "CrawlerRunConfig",
|
|
"params": {
|
|
"markdown_generator": {
|
|
"type": "DefaultMarkdownGenerator",
|
|
"params": {
|
|
"content_filter": {
|
|
"type": "PruningContentFilter",
|
|
"params": {
|
|
"threshold": 0.6, # Adjusted threshold
|
|
"threshold_type": "fixed"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
)
|
|
```
|
|
|
|
## Step 7: Calling Your Deployment from Another Python Script
|
|
|
|
You can also call your deployed function directly from another Python script:
|
|
|
|
```python
|
|
import modal
|
|
|
|
# Get a reference to the deployed function
|
|
crawl_function = modal.Function.from_name("crawl4ai", "crawl")
|
|
|
|
# Call the function
|
|
result = crawl_function.remote("https://example.com")
|
|
print(result)
|
|
```
|
|
|
|
## Understanding Modal's Execution Flow
|
|
|
|
To understand how Modal works, it's important to know:
|
|
|
|
1. **Local vs. Remote Execution**: When you call a function with `.remote()`, it runs in Modal's cloud, not on your local machine.
|
|
|
|
2. **Container Lifecycle**: Modal creates containers on-demand and destroys them when they're not needed.
|
|
|
|
3. **Caching**: Modal caches your container images to speed up subsequent runs.
|
|
|
|
4. **Serverless Scaling**: Modal automatically scales your application based on demand.
|
|
|
|
## Customizing Your Deployment
|
|
|
|
You can customize your deployment in several ways:
|
|
|
|
### Changing the Crawl4ai Version
|
|
|
|
To use a different version of Crawl4ai, update the installation command in the image definition:
|
|
|
|
```python
|
|
"pip install -U git+https://github.com/unclecode/crawl4ai.git@main", # Use main branch
|
|
```
|
|
|
|
### Adjusting Resource Limits
|
|
|
|
You can change the resources allocated to your functions:
|
|
|
|
```python
|
|
@app.function(timeout=600, cpu=2, memory=4096) # 10 minute timeout, 2 CPUs, 4GB RAM
|
|
async def crawl(...):
|
|
# Function implementation
|
|
```
|
|
|
|
### Keeping Containers Warm
|
|
|
|
To reduce cold start times, you can keep containers warm:
|
|
|
|
```python
|
|
@app.function(keep_warm=1) # Keep 1 container warm
|
|
async def crawl(...):
|
|
# Function implementation
|
|
```
|
|
|
|
## Conclusion
|
|
|
|
That's it! You've successfully deployed Crawl4ai on Modal. You now have a scalable web crawling solution that can handle as many requests as you need without requiring any infrastructure management.
|
|
|
|
The beauty of this setup is its simplicity - Modal handles all the hard parts, letting you focus on using Crawl4ai to extract the data you need.
|
|
|
|
Feel free to reach out if you have any questions or need help with your deployment!
|
|
|
|
Happy crawling!
|
|
- UncleCode
|
|
|
|
## Additional Resources
|
|
|
|
- [Modal Documentation](https://modal.com/docs)
|
|
- [Crawl4ai GitHub Repository](https://github.com/unclecode/crawl4ai)
|
|
- [Crawl4ai Documentation](https://docs.crawl4ai.com)
|