Files

UncleCode 3ea3c0520d Add all 5 deployments solution for testing

2025-03-10 18:57:14 +08:00

10 KiB

Raw Blame History

Deploying Crawl4ai on AWS Lambda

This guide walks you through deploying Crawl4ai as an AWS Lambda function with API Gateway integration. You'll learn how to set up, test, and clean up your deployment.

Prerequisites

Before you begin, ensure you have:

AWS CLI installed and configured (aws configure)
Docker installed and running
Python 3.8+ installed
Basic familiarity with AWS services

Project Files

Your project directory should contain:

Dockerfile: Container configuration for Lambda
lambda_function.py: Lambda handler code
deploy.py: Our deployment script

Step 1: Install Required Python Packages

Install the Python packages needed for our deployment script:

pip install typer rich

Step 2: Run the Deployment Script

Our Python script automates the entire deployment process:

python deploy.py

The script will guide you through:

Configuration setup (AWS region, function name, memory allocation)
Docker image building
ECR repository creation
Lambda function deployment
API Gateway setup
Provisioned concurrency configuration (optional)

Follow the prompts and confirm each step by pressing Enter.

Step 3: Manual Deployment (Alternative to the Script)

If you prefer to deploy manually or understand what the script does, follow these steps:

Building and Pushing the Docker Image

# Build the Docker image
docker build -t crawl4ai-lambda .

# Create an ECR repository (if it doesn't exist)
aws ecr create-repository --repository-name crawl4ai-lambda

# Get ECR login password and login
aws ecr get-login-password | docker login --username AWS --password-stdin $(aws sts get-caller-identity --query Account --output text).dkr.ecr.us-east-1.amazonaws.com

# Tag the image
ECR_URI=$(aws ecr describe-repositories --repository-names crawl4ai-lambda --query 'repositories[0].repositoryUri' --output text)
docker tag crawl4ai-lambda:latest $ECR_URI:latest

# Push the image to ECR
docker push $ECR_URI:latest

Creating the Lambda Function

# Get IAM role ARN (create it if needed)
ROLE_ARN=$(aws iam get-role --role-name lambda-execution-role --query 'Role.Arn' --output text)

# Create Lambda function
aws lambda create-function \
    --function-name crawl4ai-function \
    --package-type Image \
    --code ImageUri=$ECR_URI:latest \
    --role $ROLE_ARN \
    --timeout 300 \
    --memory-size 4096 \
    --ephemeral-storage Size=10240 \
    --environment "Variables={CRAWL4_AI_BASE_DIRECTORY=/tmp/.crawl4ai,HOME=/tmp,PLAYWRIGHT_BROWSERS_PATH=/function/pw-browsers}"

If you're updating an existing function:

# Update function code
aws lambda update-function-code \
    --function-name crawl4ai-function \
    --image-uri $ECR_URI:latest

# Update function configuration
aws lambda update-function-configuration \
    --function-name crawl4ai-function \
    --timeout 300 \
    --memory-size 4096 \
    --ephemeral-storage Size=10240 \
    --environment "Variables={CRAWL4_AI_BASE_DIRECTORY=/tmp/.crawl4ai,HOME=/tmp,PLAYWRIGHT_BROWSERS_PATH=/function/pw-browsers}"

Setting Up API Gateway

# Create API Gateway
API_ID=$(aws apigateway create-rest-api --name crawl4ai-api --query 'id' --output text)

# Get root resource ID
PARENT_ID=$(aws apigateway get-resources --rest-api-id $API_ID --query 'items[?path==`/`].id' --output text)

# Create resource
RESOURCE_ID=$(aws apigateway create-resource --rest-api-id $API_ID --parent-id $PARENT_ID --path-part "crawl" --query 'id' --output text)

# Create POST method
aws apigateway put-method --rest-api-id $API_ID --resource-id $RESOURCE_ID --http-method POST --authorization-type NONE

# Get Lambda function ARN
LAMBDA_ARN=$(aws lambda get-function --function-name crawl4ai-function --query 'Configuration.FunctionArn' --output text)

# Set Lambda integration
aws apigateway put-integration \
    --rest-api-id $API_ID \
    --resource-id $RESOURCE_ID \
    --http-method POST \
    --type AWS_PROXY \
    --integration-http-method POST \
    --uri arn:aws:apigateway:us-east-1:lambda:path/2015-03-31/functions/$LAMBDA_ARN/invocations

# Deploy API
aws apigateway create-deployment --rest-api-id $API_ID --stage-name prod

# Set Lambda permission
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
aws lambda add-permission \
    --function-name crawl4ai-function \
    --statement-id apigateway \
    --action lambda:InvokeFunction \
    --principal apigateway.amazonaws.com \
    --source-arn "arn:aws:execute-api:us-east-1:$ACCOUNT_ID:$API_ID/*/POST/crawl"

Setting Up Provisioned Concurrency (Optional)

This reduces cold starts:

# Publish a version
VERSION=$(aws lambda publish-version --function-name crawl4ai-function --query 'Version' --output text)

# Create alias
aws lambda create-alias \
    --function-name crawl4ai-function \
    --name prod \
    --function-version $VERSION

# Configure provisioned concurrency
aws lambda put-provisioned-concurrency-config \
    --function-name crawl4ai-function \
    --qualifier prod \
    --provisioned-concurrent-executions 2

# Update API Gateway to use alias
LAMBDA_ALIAS_ARN="arn:aws:lambda:us-east-1:$ACCOUNT_ID:function:crawl4ai-function:prod"
aws apigateway put-integration \
    --rest-api-id $API_ID \
    --resource-id $RESOURCE_ID \
    --http-method POST \
    --type AWS_PROXY \
    --integration-http-method POST \
    --uri arn:aws:apigateway:us-east-1:lambda:path/2015-03-31/functions/$LAMBDA_ALIAS_ARN/invocations

# Redeploy API Gateway
aws apigateway create-deployment --rest-api-id $API_ID --stage-name prod

Step 4: Testing the Deployment

Once deployed, test your function with:

ENDPOINT_URL="https://$API_ID.execute-api.us-east-1.amazonaws.com/prod/crawl"

# Test with curl
curl -X POST $ENDPOINT_URL \
  -H "Content-Type: application/json" \
  -d '{"url":"https://example.com"}'

Or using Python:

import requests
import json

url = "https://your-api-id.execute-api.us-east-1.amazonaws.com/prod/crawl"
payload = {
    "url": "https://example.com",
    "browser_config": {
        "headless": True,
        "verbose": False
    },
    "crawler_config": {
        "crawler_config": {
            "type": "CrawlerRunConfig",
            "params": {
                "markdown_generator": {
                    "type": "DefaultMarkdownGenerator",
                    "params": {
                        "content_filter": {
                            "type": "PruningContentFilter",
                            "params": {
                                "threshold": 0.48,
                                "threshold_type": "fixed"
                            }
                        }
                    }
                }
            }
        }
    }
}

response = requests.post(url, json=payload)
result = response.json()
print(json.dumps(result, indent=2))

Step 5: Cleaning Up Resources

To remove all AWS resources created for this deployment:

python deploy.py cleanup

Or manually:

# Delete API Gateway
aws apigateway delete-rest-api --rest-api-id $API_ID

# Remove provisioned concurrency (if configured)
aws lambda delete-provisioned-concurrency-config \
    --function-name crawl4ai-function \
    --qualifier prod

# Delete alias (if created)
aws lambda delete-alias \
    --function-name crawl4ai-function \
    --name prod

# Delete Lambda function
aws lambda delete-function --function-name crawl4ai-function

# Delete ECR repository
aws ecr delete-repository --repository-name crawl4ai-lambda --force

Troubleshooting

Cold Start Issues

If experiencing long cold starts:

Enable provisioned concurrency
Increase memory allocation (4096 MB recommended)
Ensure the Lambda function has enough ephemeral storage

Permission Errors

If you encounter permission errors:

Check the IAM role has the necessary permissions
Ensure API Gateway has permission to invoke the Lambda function

Container Size Issues

If your container is too large:

Optimize the Dockerfile
Use multi-stage builds
Consider removing unnecessary dependencies

Performance Considerations

Lambda memory affects CPU allocation - higher memory means faster execution
Provisioned concurrency eliminates cold starts but costs more
Optimize the Playwright setup for faster browser initialization

Security Best Practices

Use the principle of least privilege for IAM roles
Implement API Gateway authentication for production deployments
Consider using AWS KMS for storing sensitive configuration

Useful AWS Console Links

Here are quick links to access important AWS console pages for monitoring and managing your deployment:

Resource	Console Link
Lambda Functions	AWS Lambda Console
Lambda Function Logs	CloudWatch Logs
API Gateway	API Gateway Console
ECR Repositories	ECR Console
IAM Roles	IAM Console
CloudWatch Metrics	CloudWatch Metrics

Monitoring Lambda Execution

To monitor your Lambda function:

Go to the Lambda function console
Select your function (crawl4ai-function)
Click the "Monitor" tab to see:
- Invocation metrics
- Success/failure rates
- Duration statistics

Viewing Lambda Logs

To see detailed execution logs:

Go to CloudWatch Logs
Find the log group named /aws/lambda/crawl4ai-function
Click to see the latest log streams
Each stream contains logs from a function execution

Checking API Gateway Traffic

To monitor API requests:

Go to the API Gateway console
Select your API (crawl4ai-api)
Click "Dashboard" to see:
- API calls
- Latency
- Error rates

Conclusion

You now have Crawl4ai running as a serverless function on AWS Lambda! This setup allows you to crawl websites on-demand without maintaining infrastructure, while paying only for the compute time you use.

10 KiB Raw Blame History