feat(tests): implement high volume stress testing framework

Add comprehensive stress testing solution for SDK using arun_many and dispatcher system: - Create test_stress_sdk.py for running high volume crawl tests - Add run_benchmark.py for orchestrating tests with predefined configs - Implement benchmark_report.py for generating performance reports - Add memory tracking and local test site generation - Support both streaming and batch processing modes - Add detailed documentation in README.md The framework enables testing SDK performance, concurrency handling, and memory behavior under high-volume scenarios.
2025-04-17 22:31:51 +08:00
parent 94d486579c
commit 921e0c46b6
7 changed files with 2161 additions and 1 deletions
--- a/tests/memory/README.md
+++ b/tests/memory/README.md
@@ -0,0 +1,315 @@
+# Crawl4AI Stress Testing and Benchmarking
+
+This directory contains tools for stress testing Crawl4AI's `arun_many` method and dispatcher system with high volumes of URLs to evaluate performance, concurrency handling, and potentially detect memory issues. It also includes a benchmarking system to track performance over time.
+
+## Quick Start
+
+```bash
+# Run a default stress test (small config) and generate a report
+# (Assumes run_all.sh is updated to call run_benchmark.py)
+./run_all.sh
+```
+*Note: `run_all.sh` might need to be updated if it directly called the old script.*
+
+## Overview
+
+The stress testing system works by:
+
+1.  Generating a local test site with heavy HTML pages (regenerated by default for each test).
+2.  Starting a local HTTP server to serve these pages.
+3.  Running Crawl4AI's `arun_many` method against this local site using the `MemoryAdaptiveDispatcher` with configurable concurrency (`max_sessions`).
+4.  Monitoring performance metrics via the `CrawlerMonitor` and optionally logging memory usage.
+5.  Optionally generating detailed benchmark reports with visualizations using `benchmark_report.py`.
+
+## Available Tools
+
+-   `test_stress_sdk.py` - Main stress testing script utilizing `arun_many` and dispatchers.
+-   `benchmark_report.py` - Report generator for comparing test results (assumes compatibility with `test_stress_sdk.py` outputs).
+-   `run_benchmark.py` - Python script with predefined test configurations that orchestrates tests using `test_stress_sdk.py`.
+-   `run_all.sh` - Simple wrapper script (may need updating).
+
+## Usage Guide
+
+### Using Predefined Configurations (Recommended)
+
+The `run_benchmark.py` script offers the easiest way to run standardized tests:
+
+```bash
+# Quick test (50 URLs, 4 max sessions)
+python run_benchmark.py quick
+
+# Medium test (500 URLs, 16 max sessions)
+python run_benchmark.py medium
+
+# Large test (1000 URLs, 32 max sessions)
+python run_benchmark.py large
+
+# Extreme test (2000 URLs, 64 max sessions)
+python run_benchmark.py extreme
+
+# Custom configuration
+python run_benchmark.py custom --urls 300 --max-sessions 24 --chunk-size 50
+
+# Run 'small' test in streaming mode
+python run_benchmark.py small --stream
+
+# Override max_sessions for the 'medium' config
+python run_benchmark.py medium --max-sessions 20
+
+# Skip benchmark report generation after the test
+python run_benchmark.py small --no-report
+
+# Clean up reports and site files before running
+python run_benchmark.py medium --clean
+```
+
+#### `run_benchmark.py` Parameters
+
+| Parameter            | Default         | Description                                                                 |
+| -------------------- | --------------- | --------------------------------------------------------------------------- |
+| `config`             | *required*      | Test configuration: `quick`, `small`, `medium`, `large`, `extreme`, `custom`|
+| `--urls`             | config-specific | Number of URLs (required for `custom`)                                      |
+| `--max-sessions`     | config-specific | Max concurrent sessions managed by dispatcher (required for `custom`)         |
+| `--chunk-size`       | config-specific | URLs per batch for non-stream logging (required for `custom`)               |
+| `--stream`           | False           | Enable streaming results (disables batch logging)                           |
+| `--monitor-mode`     | DETAILED        | `DETAILED` or `AGGREGATED` display for the live monitor                     |
+| `--use-rate-limiter` | False           | Enable basic rate limiter in the dispatcher                                 |
+| `--port`             | 8000            | HTTP server port                                                            |
+| `--no-report`        | False           | Skip generating comparison report via `benchmark_report.py`                 |
+| `--clean`            | False           | Clean up reports and site files before running                              |
+| `--keep-server-alive`| False           | Keep local HTTP server running after test                                   |
+| `--use-existing-site`| False           | Use existing site on specified port (no local server start/site gen)        |
+| `--skip-generation`  | False           | Use existing site files but start local server                              |
+| `--keep-site`        | False           | Keep generated site files after test                                        |
+
+#### Predefined Configurations
+
+| Configuration | URLs   | Max Sessions | Chunk Size | Description                      |
+| ------------- | ------ | ------------ | ---------- | -------------------------------- |
+| `quick`       | 50     | 4            | 10         | Quick test for basic validation  |
+| `small`       | 100    | 8            | 20         | Small test for routine checks    |
+| `medium`      | 500    | 16           | 50         | Medium test for thorough checks  |
+| `large`       | 1000   | 32           | 100        | Large test for stress testing    |
+| `extreme`     | 2000   | 64           | 200        | Extreme test for limit testing   |
+
+### Direct Usage of `test_stress_sdk.py`
+
+For fine-grained control or debugging, you can run the stress test script directly:
+
+```bash
+# Test with 200 URLs and 32 max concurrent sessions
+python test_stress_sdk.py --urls 200 --max-sessions 32 --chunk-size 40
+
+# Clean up previous test data first
+python test_stress_sdk.py --clean-reports --clean-site --urls 100 --max-sessions 16 --chunk-size 20
+
+# Change the HTTP server port and use aggregated monitor
+python test_stress_sdk.py --port 8088 --urls 100 --max-sessions 16 --monitor-mode AGGREGATED
+
+# Enable streaming mode and use rate limiting
+python test_stress_sdk.py --urls 50 --max-sessions 8 --stream --use-rate-limiter
+
+# Change report output location
+python test_stress_sdk.py --report-path custom_reports --urls 100 --max-sessions 16
+```
+
+#### `test_stress_sdk.py` Parameters
+
+| Parameter            | Default    | Description                                                          |
+| -------------------- | ---------- | -------------------------------------------------------------------- |
+| `--urls`             | 100        | Number of URLs to test                                               |
+| `--max-sessions`     | 16         | Maximum concurrent crawling sessions managed by the dispatcher       |
+| `--chunk-size`       | 10         | Number of URLs per batch (relevant for non-stream logging)           |
+| `--stream`           | False      | Enable streaming results (disables batch logging)                    |
+| `--monitor-mode`     | DETAILED   | `DETAILED` or `AGGREGATED` display for the live `CrawlerMonitor`     |
+| `--use-rate-limiter` | False      | Enable a basic `RateLimiter` within the dispatcher                   |
+| `--site-path`        | "test_site"| Path to store/use the generated test site                            |
+| `--port`             | 8000       | Port for the local HTTP server                                       |
+| `--report-path`      | "reports"  | Path to save test result summary (JSON) and memory samples (CSV)   |
+| `--skip-generation`  | False      | Use existing test site files but still start local server            |
+| `--use-existing-site`| False      | Use existing site on specified port (no local server/site gen)     |
+| `--keep-server-alive`| False      | Keep local HTTP server running after test completion                 |
+| `--keep-site`        | False      | Keep the generated test site files after test completion             |
+| `--clean-reports`    | False      | Clean up report directory before running                             |
+| `--clean-site`       | False      | Clean up site directory before/after running (see script logic)    |
+
+### Generating Reports Only
+
+If you only want to generate a benchmark report from existing test results (assuming `benchmark_report.py` is compatible):
+
+```bash
+# Generate a report from existing test results in ./reports/
+python benchmark_report.py
+
+# Limit to the most recent 5 test results
+python benchmark_report.py --limit 5
+
+# Specify a custom source directory for test results
+python benchmark_report.py --reports-dir alternate_results
+```
+
+#### `benchmark_report.py` Parameters (Assumed)
+
+| Parameter       | Default              | Description                                                 |
+| --------------- | -------------------- | ----------------------------------------------------------- |
+| `--reports-dir` | "reports"            | Directory containing `test_stress_sdk.py` result files      |
+| `--output-dir`  | "benchmark_reports"  | Directory to save generated HTML reports and charts         |
+| `--limit`       | None (all results)   | Limit comparison to N most recent test results              |
+| `--output-file` | Auto-generated       | Custom output filename for the HTML report                  |
+
+## Understanding the Test Output
+
+### Real-time Progress Display (`CrawlerMonitor`)
+
+When running `test_stress_sdk.py`, the `CrawlerMonitor` provides a live view of the crawling process managed by the dispatcher.
+
+-   **DETAILED Mode (Default):** Shows individual task status (Queued, Active, Completed, Failed), timings, memory usage per task (if `psutil` is available), overall queue statistics, and memory pressure status (if `psutil` available).
+-   **AGGREGATED Mode:** Shows summary counts (Queued, Active, Completed, Failed), overall progress percentage, estimated time remaining, average URLs/sec, and memory pressure status.
+
+### Batch Log Output (Non-Streaming Mode Only)
+
+If running `test_stress_sdk.py` **without** the `--stream` flag, you will *also* see per-batch summary lines printed to the console *after* the monitor display, once each chunk of URLs finishes processing:
+
+```
+ Batch | Progress | Start Mem | End Mem   | URLs/sec | Success/Fail | Time (s) | Status
+───────────────────────────────────────────────────────────────────────────────────────────
+ 1     |  10.0%   |  50.1 MB  |  55.3 MB  |    23.8    |    10/0      |     0.42   | Success
+ 2     |  20.0%   |  55.3 MB  |  60.1 MB  |    24.1    |    10/0      |     0.41   | Success
+ ...
+```
+
+This display provides chunk-specific metrics:
+-   **Batch**: The batch number being reported.
+-   **Progress**: Overall percentage of total URLs processed *after* this batch.
+-   **Start Mem / End Mem**: Memory usage before and after processing this batch (if tracked).
+-   **URLs/sec**: Processing speed *for this specific batch*.
+-   **Success/Fail**: Number of successful and failed URLs *in this batch*.
+-   **Time (s)**: Wall-clock time taken to process *this batch*.
+-   **Status**: Color-coded status for the batch outcome.
+
+### Summary Output
+
+After test completion, a final summary is displayed:
+
+```
+================================================================================
+Test Completed
+================================================================================
+Test ID: 20250418_103015
+Configuration: 100 URLs, 16 max sessions, Chunk: 10, Stream: False, Monitor: DETAILED
+Results: 100 successful, 0 failed (100 processed, 100.0% success)
+Performance: 5.85 seconds total, 17.09 URLs/second avg
+Memory Usage: Start: 50.1 MB, End: 75.3 MB, Max: 78.1 MB, Growth: 25.2 MB
+Results summary saved to reports/test_summary_20250418_103015.json
+```
+
+### HTML Report Structure (Generated by `benchmark_report.py`)
+
+(This section remains the same, assuming `benchmark_report.py` generates these)
+The benchmark report contains several sections:
+1.  **Summary**: Overview of the latest test results and trends
+2.  **Performance Comparison**: Charts showing throughput across tests
+3.  **Memory Usage**: Detailed memory usage graphs for each test
+4.  **Detailed Results**: Tabular data of all test metrics
+5.  **Conclusion**: Automated analysis of performance and memory patterns
+
+### Memory Metrics
+
+(This section remains conceptually the same)
+Memory growth is the key metric for detecting leaks...
+
+### Performance Metrics
+
+(This section remains conceptually the same, though "URLs per Worker" is less relevant - focus on overall URLs/sec)
+Key performance indicators include:
+-   **URLs per Second**: Higher is better (throughput)
+-   **Success Rate**: Should be 100% in normal conditions
+-   **Total Processing Time**: Lower is better
+-   **Dispatcher Efficiency**: Observe queue lengths and wait times in the monitor (Detailed mode)
+
+### Raw Data Files
+
+Raw data is saved in the `--report-path` directory (default `./reports/`):
+
+-   **JSON files** (`test_summary_*.json`): Contains the final summary for each test run.
+-   **CSV files** (`memory_samples_*.csv`): Contains time-series memory samples taken during the test run.
+
+Example of reading raw data:
+```python
+import json
+import pandas as pd
+
+# Load test summary
+test_id = "20250418_103015" # Example ID
+with open(f'reports/test_summary_{test_id}.json', 'r') as f:
+    results = json.load(f)
+
+# Load memory samples
+memory_df = pd.read_csv(f'reports/memory_samples_{test_id}.csv')
+
+# Analyze memory_df (e.g., calculate growth, plot)
+if not memory_df['memory_info_mb'].isnull().all():
+    growth = memory_df['memory_info_mb'].iloc[-1] - memory_df['memory_info_mb'].iloc[0]
+    print(f"Total Memory Growth: {growth:.1f} MB")
+else:
+    print("No valid memory samples found.")
+
+print(f"Avg URLs/sec: {results['urls_processed'] / results['total_time_seconds']:.2f}")
+```
+
+## Visualization Dependencies
+
+(This section remains the same)
+For full visualization capabilities in the HTML reports generated by `benchmark_report.py`, install additional dependencies...
+
+## Directory Structure
+
+```
+benchmarking/          # Or your top-level directory name
+├── benchmark_reports/ # Generated HTML reports (by benchmark_report.py)
+├── reports/           # Raw test result data (from test_stress_sdk.py)
+├── test_site/         # Generated test content (temporary)
+├── benchmark_report.py# Report generator
+├── run_benchmark.py   # Test runner with predefined configs
+├── test_stress_sdk.py # Main stress test implementation using arun_many
+└── run_all.sh         # Simple wrapper script (may need updates)
+#└── requirements.txt   # Optional: Visualization dependencies for benchmark_report.py
+```
+
+## Cleanup
+
+To clean up after testing:
+
+```bash
+# Remove the test site content (if not using --keep-site)
+rm -rf test_site
+
+# Remove all raw reports and generated benchmark reports
+rm -rf reports benchmark_reports
+
+# Or use the --clean flag with run_benchmark.py
+python run_benchmark.py medium --clean
+```
+
+## Use in CI/CD
+
+(This section remains conceptually the same, just update script names)
+These tests can be integrated into CI/CD pipelines:
+```bash
+# Example CI script
+python run_benchmark.py medium --no-report # Run test without interactive report gen
+# Check exit code
+if [ $? -ne 0 ]; then echo "Stress test failed!"; exit 1; fi
+# Optionally, run report generator and check its output/metrics
+# python benchmark_report.py
+# check_report_metrics.py reports/test_summary_*.json || exit 1
+exit 0
+```
+
+## Troubleshooting
+
+-   **HTTP Server Port Conflict**: Use `--port` with `run_benchmark.py` or `test_stress_sdk.py`.
+-   **Memory Tracking Issues**: The `SimpleMemoryTracker` uses platform commands (`ps`, `/proc`, `tasklist`). Ensure these are available and the script has permission. If it consistently fails, memory reporting will be limited.
+-   **Visualization Missing**: Related to `benchmark_report.py` and its dependencies.
+-   **Site Generation Issues**: Check permissions for creating `./test_site/`. Use `--skip-generation` if you want to manage the site manually.
+-   **Testing Against External Site**: Ensure the external site is running and use `--use-existing-site --port <correct_port>`.