- Implemented demo_proxy_rotation.py to showcase various proxy rotation strategies and their integration with the API. - Included multiple demos demonstrating round robin, random, least used, failure-aware, and streaming strategies. - Added error handling and real-world scenario examples for e-commerce price monitoring. - Created quick_proxy_test.py to validate API integration without real proxies, testing parameter acceptance, invalid strategy rejection, and optional parameters. - Ensured both scripts provide informative output and usage instructions.
12 KiB
Proxy Rotation Strategy Documentation
Overview
The Crawl4AI FastAPI server now includes comprehensive proxy rotation functionality that allows you to distribute requests across multiple proxy servers using different rotation strategies. This feature helps prevent IP blocking, distributes load across proxy infrastructure, and provides redundancy for high-availability crawling operations.
Available Proxy Rotation Strategies
| Strategy | Description | Use Case | Performance |
|---|---|---|---|
round_robin |
Cycles through proxies sequentially | Even distribution, predictable pattern | ⭐⭐⭐⭐⭐ |
random |
Randomly selects from available proxies | Unpredictable traffic pattern | ⭐⭐⭐⭐ |
least_used |
Uses proxy with lowest usage count | Optimal load balancing | ⭐⭐⭐ |
failure_aware |
Avoids failed proxies with auto-recovery | High availability, fault tolerance | ⭐⭐⭐⭐ |
API Endpoints
POST /crawl
Standard crawling endpoint with proxy rotation support.
Request Body:
{
"urls": ["https://example.com"],
"proxy_rotation_strategy": "round_robin",
"proxies": [
{"server": "http://proxy1.com:8080", "username": "user1", "password": "pass1"},
{"server": "http://proxy2.com:8080", "username": "user2", "password": "pass2"}
],
"browser_config": {},
"crawler_config": {}
}
POST /crawl/stream
Streaming crawling endpoint with proxy rotation support.
Request Body:
{
"urls": ["https://example.com"],
"proxy_rotation_strategy": "failure_aware",
"proxy_failure_threshold": 3,
"proxy_recovery_time": 300,
"proxies": [
{"server": "http://proxy1.com:8080", "username": "user1", "password": "pass1"},
{"server": "http://proxy2.com:8080", "username": "user2", "password": "pass2"}
],
"browser_config": {},
"crawler_config": {
"stream": true
}
}
Parameters
proxy_rotation_strategy (optional)
- Type:
string - Default:
null(no proxy rotation) - Options:
"round_robin","random","least_used","failure_aware" - Description: Selects the proxy rotation strategy for distributing requests
proxies (optional)
- Type:
array of objects - Default:
null - Description: List of proxy configurations to rotate between
- Required when:
proxy_rotation_strategyis specified
proxy_failure_threshold (optional)
- Type:
integer - Default:
3 - Range:
1-10 - Description: Number of failures before marking a proxy as unhealthy (failure_aware only)
proxy_recovery_time (optional)
- Type:
integer - Default:
300(5 minutes) - Range:
60-3600seconds - Description: Time to wait before attempting to use a failed proxy again (failure_aware only)
Proxy Configuration Format
Full Configuration
{
"server": "http://proxy.example.com:8080",
"username": "proxy_user",
"password": "proxy_pass",
"ip": "192.168.1.100"
}
Minimal Configuration
{
"server": "http://192.168.1.100:8080"
}
SOCKS Proxy Support
{
"server": "socks5://127.0.0.1:1080",
"username": "socks_user",
"password": "socks_pass"
}
Usage Examples
1. Round Robin Strategy
curl -X POST "http://localhost:11235/crawl" \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://httpbin.org/ip"],
"proxy_rotation_strategy": "round_robin",
"proxies": [
{"server": "http://proxy1.com:8080", "username": "user1", "password": "pass1"},
{"server": "http://proxy2.com:8080", "username": "user2", "password": "pass2"},
{"server": "http://proxy3.com:8080", "username": "user3", "password": "pass3"}
]
}'
2. Random Strategy with Minimal Config
curl -X POST "http://localhost:11235/crawl" \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://httpbin.org/headers"],
"proxy_rotation_strategy": "random",
"proxies": [
{"server": "http://192.168.1.100:8080"},
{"server": "http://192.168.1.101:8080"},
{"server": "http://192.168.1.102:8080"}
]
}'
3. Least Used Strategy with Load Balancing
curl -X POST "http://localhost:11235/crawl" \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://example.com", "https://httpbin.org/html", "https://httpbin.org/json"],
"proxy_rotation_strategy": "least_used",
"proxies": [
{"server": "http://proxy1.com:8080", "username": "user1", "password": "pass1"},
{"server": "http://proxy2.com:8080", "username": "user2", "password": "pass2"}
],
"crawler_config": {
"cache_mode": "bypass"
}
}'
4. Failure-Aware Strategy with High Availability
curl -X POST "http://localhost:11235/crawl" \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://example.com"],
"proxy_rotation_strategy": "failure_aware",
"proxy_failure_threshold": 2,
"proxy_recovery_time": 180,
"proxies": [
{"server": "http://proxy1.com:8080", "username": "user1", "password": "pass1"},
{"server": "http://proxy2.com:8080", "username": "user2", "password": "pass2"},
{"server": "http://proxy3.com:8080", "username": "user3", "password": "pass3"}
],
"headless": true
}'
5. Streaming with Proxy Rotation
curl -X POST "http://localhost:11235/crawl/stream" \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://example.com", "https://httpbin.org/html"],
"proxy_rotation_strategy": "round_robin",
"proxies": [
{"server": "http://proxy1.com:8080", "username": "user1", "password": "pass1"},
{"server": "http://proxy2.com:8080", "username": "user2", "password": "pass2"}
],
"crawler_config": {
"stream": true,
"cache_mode": "bypass"
}
}'
Combining with Anti-Bot Strategies
You can combine proxy rotation with anti-bot strategies for maximum effectiveness:
curl -X POST "http://localhost:11235/crawl" \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://protected-site.com"],
"anti_bot_strategy": "stealth",
"proxy_rotation_strategy": "failure_aware",
"proxy_failure_threshold": 2,
"proxies": [
{"server": "http://proxy1.com:8080", "username": "user1", "password": "pass1"},
{"server": "http://proxy2.com:8080", "username": "user2", "password": "pass2"}
],
"headless": true,
"browser_config": {
"enable_stealth": true
}
}'
Strategy Details
Round Robin Strategy
- Algorithm: Sequential cycling through proxy list
- Pros: Predictable, even distribution, simple
- Cons: Predictable pattern may be detectable
- Best for: General use, development, testing
Random Strategy
- Algorithm: Random selection from available proxies
- Pros: Unpredictable pattern, good for evasion
- Cons: Uneven distribution possible
- Best for: Anti-detection, varying traffic patterns
Least Used Strategy
- Algorithm: Selects proxy with minimum usage count
- Pros: Optimal load balancing, prevents overloading
- Cons: Slightly more complex, tracking overhead
- Best for: High-volume crawling, load balancing
Failure-Aware Strategy
- Algorithm: Tracks proxy health, auto-recovery
- Pros: High availability, fault tolerance, automatic recovery
- Cons: Most complex, memory overhead for tracking
- Best for: Production environments, critical crawling
Error Handling
Common Errors
Invalid Proxy Configuration
{
"error": "Invalid proxy configuration: Proxy configuration missing 'server' field: {'username': 'user1'}"
}
Unsupported Strategy
{
"error": "Unsupported proxy rotation strategy: invalid_strategy. Available: round_robin, random, least_used, failure_aware"
}
Missing Proxies
When proxy_rotation_strategy is specified but proxies is empty:
{
"error": "proxy_rotation_strategy specified but no proxies provided"
}
Environment Variable Support
You can also configure proxies using environment variables:
# Set proxy list (comma-separated)
export PROXIES="proxy1.com:8080:user1:pass1,proxy2.com:8080:user2:pass2"
# Set default strategy
export PROXY_ROTATION_STRATEGY="round_robin"
Performance Considerations
-
Strategy Overhead:
- Round Robin: Minimal overhead
- Random: Low overhead
- Least Used: Medium overhead (usage tracking)
- Failure Aware: High overhead (health tracking)
-
Memory Usage:
- Round Robin: ~O(n) where n = number of proxies
- Random: ~O(n)
- Least Used: ~O(n) + usage counters
- Failure Aware: ~O(n) + health tracking data
-
Concurrent Safety:
- All strategies are async-safe with proper locking
- No race conditions in proxy selection
Best Practices
-
Production Deployment:
- Use
failure_awarestrategy for high availability - Set appropriate failure thresholds (2-3)
- Use recovery times between 3-10 minutes
- Use
-
Development/Testing:
- Use
round_robinfor predictable behavior - Start with small proxy pools (2-3 proxies)
- Use
-
Anti-Detection:
- Combine with
stealthorundetectedanti-bot strategies - Use
randomstrategy for unpredictable patterns - Vary proxy geographic locations
- Combine with
-
Load Balancing:
- Use
least_usedfor even distribution - Monitor proxy performance and adjust pools accordingly
- Use
-
Error Monitoring:
- Monitor failure rates with
failure_awarestrategy - Set up alerts for proxy pool depletion
- Implement fallback mechanisms
- Monitor failure rates with
Integration Examples
Python Requests
import requests
payload = {
"urls": ["https://example.com"],
"proxy_rotation_strategy": "round_robin",
"proxies": [
{"server": "http://proxy1.com:8080", "username": "user1", "password": "pass1"},
{"server": "http://proxy2.com:8080", "username": "user2", "password": "pass2"}
]
}
response = requests.post("http://localhost:11235/crawl", json=payload)
print(response.json())
JavaScript/Node.js
const axios = require('axios');
const payload = {
urls: ["https://example.com"],
proxy_rotation_strategy: "failure_aware",
proxy_failure_threshold: 2,
proxies: [
{server: "http://proxy1.com:8080", username: "user1", password: "pass1"},
{server: "http://proxy2.com:8080", username: "user2", password: "pass2"}
]
};
axios.post('http://localhost:11235/crawl', payload)
.then(response => console.log(response.data))
.catch(error => console.error(error));
cURL with Multiple URLs
curl -X POST "http://localhost:11235/crawl" \
-H "Content-Type: application/json" \
-d '{
"urls": [
"https://example.com",
"https://httpbin.org/html",
"https://httpbin.org/json",
"https://httpbin.org/xml"
],
"proxy_rotation_strategy": "least_used",
"proxies": [
{"server": "http://proxy1.com:8080", "username": "user1", "password": "pass1"},
{"server": "http://proxy2.com:8080", "username": "user2", "password": "pass2"},
{"server": "http://proxy3.com:8080", "username": "user3", "password": "pass3"}
],
"crawler_config": {
"cache_mode": "bypass",
"wait_for_images": false
}
}'
Troubleshooting
Common Issues
-
All proxies failing:
- Check proxy connectivity
- Verify authentication credentials
- Ensure proxy servers support the target protocols
-
Uneven distribution:
- Use
least_usedstrategy for better balancing - Monitor proxy usage patterns
- Use
-
High memory usage:
- Reduce proxy pool size
- Consider using
round_robininstead offailure_aware
-
Slow performance:
- Check proxy response times
- Use geographically closer proxies
- Reduce failure thresholds
Debug Information
Enable verbose logging to see proxy selection details:
{
"urls": ["https://example.com"],
"proxy_rotation_strategy": "failure_aware",
"proxies": [...],
"crawler_config": {
"verbose": true
}
}
This will log which proxy is selected for each request and any failure/recovery events.