Add comprehensive tests for anti-bot strategies and extended features

- Implemented `test_adapter_verification.py` to verify correct usage of browser adapters.
- Created `test_all_features.py` for a comprehensive suite covering URL seeding, adaptive crawling, browser adapters, proxy rotation, and dispatchers.
- Developed `test_anti_bot_strategy.py` to validate the functionality of various anti-bot strategies.
- Added `test_antibot_simple.py` for simple testing of anti-bot strategies using async web crawling.
- Introduced `test_bot_detection.py` to assess adapter performance against bot detection mechanisms.
- Compiled `test_final_summary.py` to provide a detailed summary of all tests and their results.
This commit is contained in:
AHMET YILMAZ
2025-10-07 18:51:13 +08:00
parent f00e8cbf35
commit 201843a204
23 changed files with 5265 additions and 96 deletions

View File

@@ -13,6 +13,7 @@
- [Understanding Request Schema](#understanding-request-schema)
- [REST API Examples](#rest-api-examples)
- [Additional API Endpoints](#additional-api-endpoints)
- [Dispatcher Management](#dispatcher-management)
- [HTML Extraction Endpoint](#html-extraction-endpoint)
- [Screenshot Endpoint](#screenshot-endpoint)
- [PDF Export Endpoint](#pdf-export-endpoint)
@@ -34,6 +35,8 @@
- [Configuration Tips and Best Practices](#configuration-tips-and-best-practices)
- [Customizing Your Configuration](#customizing-your-configuration)
- [Configuration Recommendations](#configuration-recommendations)
- [Testing & Validation](#testing--validation)
- [Dispatcher Demo Test Suite](#dispatcher-demo-test-suite)
- [Getting Help](#getting-help)
- [Summary](#summary)
@@ -332,6 +335,134 @@ Access the MCP tool schemas at `http://localhost:11235/mcp/schema` for detailed
In addition to the core `/crawl` and `/crawl/stream` endpoints, the server provides several specialized endpoints:
### Dispatcher Management
The server supports multiple dispatcher strategies for managing concurrent crawling operations. Dispatchers control how many crawl jobs run simultaneously based on different rules like fixed concurrency limits or system memory availability.
#### Available Dispatchers
**Memory Adaptive Dispatcher** (Default)
- Dynamically adjusts concurrency based on system memory usage
- Monitors memory pressure and adapts crawl sessions accordingly
- Automatically requeues tasks under high memory conditions
- Implements fairness timeout for long-waiting URLs
**Semaphore Dispatcher**
- Fixed concurrency limit using semaphore-based control
- Simple and predictable resource usage
- Ideal for controlled crawling scenarios
#### Dispatcher Endpoints
**List Available Dispatchers**
```bash
GET /dispatchers
```
Returns information about all available dispatcher types, their configurations, and features.
```bash
curl http://localhost:11234/dispatchers | jq
```
**Get Default Dispatcher**
```bash
GET /dispatchers/default
```
Returns the current default dispatcher configuration.
```bash
curl http://localhost:11234/dispatchers/default | jq
```
**Get Dispatcher Statistics**
```bash
GET /dispatchers/{dispatcher_type}/stats
```
Returns real-time statistics for a specific dispatcher including active sessions, memory usage, and configuration.
```bash
# Get memory_adaptive dispatcher stats
curl http://localhost:11234/dispatchers/memory_adaptive/stats | jq
# Get semaphore dispatcher stats
curl http://localhost:11234/dispatchers/semaphore/stats | jq
```
#### Using Dispatchers in Crawl Requests
You can specify which dispatcher to use in your crawl requests by adding the `dispatcher` field:
**Using Default Dispatcher (memory_adaptive)**
```bash
curl -X POST http://localhost:11234/crawl \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://example.com"],
"browser_config": {},
"crawler_config": {}
}'
```
**Using Semaphore Dispatcher**
```bash
curl -X POST http://localhost:11234/crawl \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://example.com", "https://httpbin.org/html"],
"browser_config": {},
"crawler_config": {},
"dispatcher": "semaphore"
}'
```
**Python SDK Example**
```python
import requests
# Crawl with memory adaptive dispatcher (default)
response = requests.post(
"http://localhost:11234/crawl",
json={
"urls": ["https://example.com"],
"browser_config": {},
"crawler_config": {}
}
)
# Crawl with semaphore dispatcher
response = requests.post(
"http://localhost:11234/crawl",
json={
"urls": ["https://example.com"],
"browser_config": {},
"crawler_config": {},
"dispatcher": "semaphore"
}
)
```
#### Dispatcher Configuration
Dispatchers are configured with sensible defaults that work well for most use cases:
**Memory Adaptive Dispatcher Defaults:**
- `memory_threshold_percent`: 70.0 - Start adjusting at 70% memory usage
- `critical_threshold_percent`: 85.0 - Critical memory pressure threshold
- `recovery_threshold_percent`: 65.0 - Resume normal operation below 65%
- `check_interval`: 1.0 - Check memory every second
- `max_session_permit`: 20 - Maximum concurrent sessions
- `fairness_timeout`: 600.0 - Prioritize URLs waiting > 10 minutes
- `memory_wait_timeout`: 600.0 - Fail if high memory persists > 10 minutes
**Semaphore Dispatcher Defaults:**
- `semaphore_count`: 5 - Maximum concurrent crawl operations
- `max_session_permit`: 10 - Maximum total sessions allowed
> 💡 **Tip**: Use `memory_adaptive` for dynamic workloads where memory availability varies. Use `semaphore` for predictable, controlled crawling with fixed concurrency limits.
### HTML Extraction Endpoint
```
@@ -813,6 +944,93 @@ You can override the default `config.yml`.
- Increase batch_process timeout for large content
- Adjust stream_init timeout based on initial response times
## Testing & Validation
We provide two comprehensive test suites to validate all Docker server functionality:
### 1. Extended Features Test Suite ✅ **100% Pass Rate**
Complete validation of all advanced features including URL seeding, adaptive crawling, browser adapters, proxy rotation, and dispatchers.
```bash
# Run all extended features tests
cd tests/docker/extended_features
./run_extended_tests.sh
# Custom server URL
./run_extended_tests.sh --server http://localhost:8080
```
**Test Coverage (12 tests):**
- ✅ **URL Seeding** (2 tests): Basic seeding + domain filters
- ✅ **Adaptive Crawling** (2 tests): Basic + custom thresholds
- ✅ **Browser Adapters** (3 tests): Default, Stealth, Undetected
- ✅ **Proxy Rotation** (2 tests): Round Robin, Random strategies
- ✅ **Dispatchers** (3 tests): Memory Adaptive, Semaphore, Management APIs
**Current Status:**
```
Total Tests: 12
Passed: 12
Failed: 0
Pass Rate: 100.0% ✅
Average Duration: ~8.8 seconds
```
Features:
- Rich formatted output with tables and panels
- Real-time progress indicators
- Detailed error diagnostics
- Category-based results grouping
- Server health checks
See [`tests/docker/extended_features/README_EXTENDED_TESTS.md`](../../tests/docker/extended_features/README_EXTENDED_TESTS.md) for full documentation and API response format reference.
### 2. Dispatcher Demo Test Suite
Focused tests for dispatcher functionality with performance comparisons:
```bash
# Run all tests
cd test_scripts
./run_dispatcher_tests.sh
# Run specific category
./run_dispatcher_tests.sh -c basic # Basic dispatcher usage
./run_dispatcher_tests.sh -c integration # Integration with other features
./run_dispatcher_tests.sh -c endpoints # Dispatcher management endpoints
./run_dispatcher_tests.sh -c performance # Performance comparison
./run_dispatcher_tests.sh -c error # Error handling
# Custom server URL
./run_dispatcher_tests.sh -s http://your-server:port
```
**Test Coverage (17 tests):**
- **Basic Usage Tests**: Single/multiple URL crawling with different dispatchers
- **Integration Tests**: Dispatchers combined with anti-bot strategies, browser configs, JS execution, screenshots
- **Endpoint Tests**: Dispatcher management API validation
- **Performance Tests**: Side-by-side comparison of memory_adaptive vs semaphore
- **Error Handling**: Edge cases and validation tests
Results are displayed with rich formatting, timing information, and success rates. See `test_scripts/README_DISPATCHER_TESTS.md` for full documentation.
### Quick Test Commands
```bash
# Test all features (recommended)
./tests/docker/extended_features/run_extended_tests.sh
# Test dispatchers only
./test_scripts/run_dispatcher_tests.sh
# Test server health
curl http://localhost:11235/health
# Test dispatcher endpoint
curl http://localhost:11235/dispatchers | jq
```
## Getting Help
We're here to help you succeed with Crawl4AI! Here's how to get support: