merge next. Resolve conflicts. Fix some import errors and error handling in server.py
This commit is contained in:
4
.gitignore
vendored
4
.gitignore
vendored
@@ -258,3 +258,7 @@ continue_config.json
|
||||
|
||||
CLAUDE_MONITOR.md
|
||||
CLAUDE.md
|
||||
|
||||
tests/**/test_site
|
||||
tests/**/reports
|
||||
tests/**/benchmark_reports
|
||||
@@ -5,6 +5,13 @@ All notable changes to Crawl4AI will be documented in this file.
|
||||
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
||||
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
||||
|
||||
### [Added] 2025-04-17
|
||||
- Added content source selection feature for markdown generation
|
||||
- New `content_source` parameter allows choosing between `cleaned_html`, `raw_html`, and `fit_html`
|
||||
- Provides flexibility in how HTML content is processed before markdown conversion
|
||||
- Added examples and documentation for the new feature
|
||||
- Includes backward compatibility with default `cleaned_html` behavior
|
||||
|
||||
## Version 0.5.0.post5 (2025-03-14)
|
||||
|
||||
### Added
|
||||
|
||||
231
JOURNAL.md
231
JOURNAL.md
@@ -2,6 +2,237 @@
|
||||
|
||||
This journal tracks significant feature additions, bug fixes, and architectural decisions in the crawl4ai project. It serves as both documentation and a historical record of the project's evolution.
|
||||
|
||||
## [2025-04-17] Added Content Source Selection for Markdown Generation
|
||||
|
||||
**Feature:** Configurable content source for markdown generation
|
||||
|
||||
**Changes Made:**
|
||||
1. Added `content_source: str = "cleaned_html"` parameter to `MarkdownGenerationStrategy` class
|
||||
2. Updated `DefaultMarkdownGenerator` to accept and pass the content source parameter
|
||||
3. Renamed the `cleaned_html` parameter to `input_html` in the `generate_markdown` method
|
||||
4. Modified `AsyncWebCrawler.aprocess_html` to select the appropriate HTML source based on the generator's config
|
||||
5. Added `preprocess_html_for_schema` import in `async_webcrawler.py`
|
||||
|
||||
**Implementation Details:**
|
||||
- Added a new `content_source` parameter to specify which HTML input to use for markdown generation
|
||||
- Options include: "cleaned_html" (default), "raw_html", and "fit_html"
|
||||
- Used a dictionary dispatch pattern in `aprocess_html` to select the appropriate HTML source
|
||||
- Added proper error handling with fallback to cleaned_html if content source selection fails
|
||||
- Ensured backward compatibility by defaulting to "cleaned_html" option
|
||||
|
||||
**Files Modified:**
|
||||
- `crawl4ai/markdown_generation_strategy.py`: Added content_source parameter and updated the method signature
|
||||
- `crawl4ai/async_webcrawler.py`: Added HTML source selection logic and updated imports
|
||||
|
||||
**Examples:**
|
||||
- Created `docs/examples/content_source_example.py` demonstrating how to use the new parameter
|
||||
|
||||
**Challenges:**
|
||||
- Maintaining backward compatibility while reorganizing the parameter flow
|
||||
- Ensuring proper error handling for all content source options
|
||||
- Making the change with minimal code modifications
|
||||
|
||||
**Why This Feature:**
|
||||
The content source selection feature allows users to choose which HTML content to use as input for markdown generation:
|
||||
1. "cleaned_html" - Uses the post-processed HTML after scraping strategy (original behavior)
|
||||
2. "raw_html" - Uses the original raw HTML directly from the web page
|
||||
3. "fit_html" - Uses the preprocessed HTML optimized for schema extraction
|
||||
|
||||
This feature provides greater flexibility in how users generate markdown, enabling them to:
|
||||
- Capture more detailed content from the original HTML when needed
|
||||
- Use schema-optimized HTML when working with structured data
|
||||
- Choose the approach that best suits their specific use case
|
||||
## [2025-04-17] Implemented High Volume Stress Testing Solution for SDK
|
||||
|
||||
**Feature:** Comprehensive stress testing framework using `arun_many` and the dispatcher system to evaluate performance, concurrency handling, and identify potential issues under high-volume crawling scenarios.
|
||||
|
||||
**Changes Made:**
|
||||
1. Created a dedicated stress testing framework in the `benchmarking/` (or similar) directory.
|
||||
2. Implemented local test site generation (`SiteGenerator`) with configurable heavy HTML pages.
|
||||
3. Added basic memory usage tracking (`SimpleMemoryTracker`) using platform-specific commands (avoiding `psutil` dependency for this specific test).
|
||||
4. Utilized `CrawlerMonitor` from `crawl4ai` for rich terminal UI and real-time monitoring of test progress and dispatcher activity.
|
||||
5. Implemented detailed result summary saving (JSON) and memory sample logging (CSV).
|
||||
6. Developed `run_benchmark.py` to orchestrate tests with predefined configurations.
|
||||
7. Created `run_all.sh` as a simple wrapper for `run_benchmark.py`.
|
||||
|
||||
**Implementation Details:**
|
||||
- Generates a local test site with configurable pages containing heavy text and image content.
|
||||
- Uses Python's built-in `http.server` for local serving, minimizing network variance.
|
||||
- Leverages `crawl4ai`'s `arun_many` method for processing URLs.
|
||||
- Utilizes `MemoryAdaptiveDispatcher` to manage concurrency via the `max_sessions` parameter (note: memory adaptation features require `psutil`, not used by `SimpleMemoryTracker`).
|
||||
- Tracks memory usage via `SimpleMemoryTracker`, recording samples throughout test execution to a CSV file.
|
||||
- Uses `CrawlerMonitor` (which uses the `rich` library) for clear terminal visualization and progress reporting directly from the dispatcher.
|
||||
- Stores detailed final metrics in a JSON summary file.
|
||||
|
||||
**Files Created/Updated:**
|
||||
- `stress_test_sdk.py`: Main stress testing implementation using `arun_many`.
|
||||
- `benchmark_report.py`: (Assumed) Report generator for comparing test results.
|
||||
- `run_benchmark.py`: Test runner script with predefined configurations.
|
||||
- `run_all.sh`: Simple bash script wrapper for `run_benchmark.py`.
|
||||
- `USAGE.md`: Comprehensive documentation on usage and interpretation (updated).
|
||||
|
||||
**Testing Approach:**
|
||||
- Creates a controlled, reproducible test environment with a local HTTP server.
|
||||
- Processes URLs using `arun_many`, allowing the dispatcher to manage concurrency up to `max_sessions`.
|
||||
- Optionally logs per-batch summaries (when not in streaming mode) after processing chunks.
|
||||
- Supports different test sizes via `run_benchmark.py` configurations.
|
||||
- Records memory samples via platform commands for basic trend analysis.
|
||||
- Includes cleanup functionality for the test environment.
|
||||
|
||||
**Challenges:**
|
||||
- Ensuring proper cleanup of HTTP server processes.
|
||||
- Getting reliable memory tracking across platforms without adding heavy dependencies (`psutil`) to this specific test script.
|
||||
- Designing `run_benchmark.py` to correctly pass arguments to `stress_test_sdk.py`.
|
||||
|
||||
**Why This Feature:**
|
||||
The high volume stress testing solution addresses critical needs for ensuring Crawl4AI's `arun_many` reliability:
|
||||
1. Provides a reproducible way to evaluate performance under concurrent load.
|
||||
2. Allows testing the dispatcher's concurrency control (`max_session_permit`) and queue management.
|
||||
3. Enables performance tuning by observing throughput (`URLs/sec`) under different `max_sessions` settings.
|
||||
4. Creates a controlled environment for testing `arun_many` behavior.
|
||||
5. Supports continuous integration by providing deterministic test conditions for `arun_many`.
|
||||
|
||||
**Design Decisions:**
|
||||
- Chose local site generation for reproducibility and isolation from network issues.
|
||||
- Utilized the built-in `CrawlerMonitor` for real-time feedback, leveraging its `rich` integration.
|
||||
- Implemented optional per-batch logging in `stress_test_sdk.py` (when not streaming) to provide chunk-level summaries alongside the continuous monitor.
|
||||
- Adopted `arun_many` with a `MemoryAdaptiveDispatcher` as the core mechanism for parallel execution, reflecting the intended SDK usage.
|
||||
- Created `run_benchmark.py` to simplify running standard test configurations.
|
||||
- Used `SimpleMemoryTracker` to provide basic memory insights without requiring `psutil` for this particular test runner.
|
||||
|
||||
**Future Enhancements to Consider:**
|
||||
- Create a separate test variant that *does* use `psutil` to specifically stress the memory-adaptive features of the dispatcher.
|
||||
- Add support for generated JavaScript content.
|
||||
- Add support for Docker-based testing with explicit memory limits.
|
||||
- Enhance `benchmark_report.py` to provide more sophisticated analysis of performance and memory trends from the generated JSON/CSV files.
|
||||
|
||||
---
|
||||
|
||||
## [2025-04-17] Refined Stress Testing System Parameters and Execution
|
||||
|
||||
**Changes Made:**
|
||||
1. Corrected `run_benchmark.py` and `stress_test_sdk.py` to use `--max-sessions` instead of the incorrect `--workers` parameter, accurately reflecting dispatcher configuration.
|
||||
2. Updated `run_benchmark.py` argument handling to correctly pass all relevant custom parameters (including `--stream`, `--monitor-mode`, etc.) to `stress_test_sdk.py`.
|
||||
3. (Assuming changes in `benchmark_report.py`) Applied dark theme to benchmark reports for better readability.
|
||||
4. (Assuming changes in `benchmark_report.py`) Improved visualization code to eliminate matplotlib warnings.
|
||||
5. Updated `run_benchmark.py` to provide clickable `file://` links to generated reports in the terminal output.
|
||||
6. Updated `USAGE.md` with comprehensive parameter descriptions reflecting the final script arguments.
|
||||
7. Updated `run_all.sh` wrapper to correctly invoke `run_benchmark.py` with flexible arguments.
|
||||
|
||||
**Details of Changes:**
|
||||
|
||||
1. **Parameter Correction (`--max-sessions`)**:
|
||||
* Identified the fundamental misunderstanding where `--workers` was used incorrectly.
|
||||
* Refactored `stress_test_sdk.py` to accept `--max-sessions` and configure the `MemoryAdaptiveDispatcher`'s `max_session_permit` accordingly.
|
||||
* Updated `run_benchmark.py` argument parsing and command construction to use `--max-sessions`.
|
||||
* Updated `TEST_CONFIGS` in `run_benchmark.py` to use `max_sessions`.
|
||||
|
||||
2. **Argument Handling (`run_benchmark.py`)**:
|
||||
* Improved logic to collect all command-line arguments provided to `run_benchmark.py`.
|
||||
* Ensured all relevant arguments (like `--stream`, `--monitor-mode`, `--port`, `--use-rate-limiter`, etc.) are correctly forwarded when calling `stress_test_sdk.py` as a subprocess.
|
||||
|
||||
3. **Dark Theme & Visualization Fixes (Assumed in `benchmark_report.py`)**:
|
||||
* (Describes changes assumed to be made in the separate reporting script).
|
||||
|
||||
4. **Clickable Links (`run_benchmark.py`)**:
|
||||
* Added logic to find the latest HTML report and PNG chart in the `benchmark_reports` directory after `benchmark_report.py` runs.
|
||||
* Used `pathlib` to generate correct `file://` URLs for terminal output.
|
||||
|
||||
5. **Documentation Improvements (`USAGE.md`)**:
|
||||
* Rewrote sections to explain `arun_many`, dispatchers, and `--max-sessions`.
|
||||
* Updated parameter tables for all scripts (`stress_test_sdk.py`, `run_benchmark.py`).
|
||||
* Clarified the difference between batch and streaming modes and their effect on logging.
|
||||
* Updated examples to use correct arguments.
|
||||
|
||||
**Files Modified:**
|
||||
- `stress_test_sdk.py`: Changed `--workers` to `--max-sessions`, added new arguments, used `arun_many`.
|
||||
- `run_benchmark.py`: Changed argument handling, updated configs, calls `stress_test_sdk.py`.
|
||||
- `run_all.sh`: Updated to call `run_benchmark.py` correctly.
|
||||
- `USAGE.md`: Updated documentation extensively.
|
||||
- `benchmark_report.py`: (Assumed modifications for dark theme and viz fixes).
|
||||
|
||||
**Testing:**
|
||||
- Verified that `--max-sessions` correctly limits concurrency via the `CrawlerMonitor` output.
|
||||
- Confirmed that custom arguments passed to `run_benchmark.py` are forwarded to `stress_test_sdk.py`.
|
||||
- Validated clickable links work in supporting terminals.
|
||||
- Ensured documentation matches the final script parameters and behavior.
|
||||
|
||||
**Why These Changes:**
|
||||
These refinements correct the fundamental approach of the stress test to align with `crawl4ai`'s actual architecture and intended usage:
|
||||
1. Ensures the test evaluates the correct components (`arun_many`, `MemoryAdaptiveDispatcher`).
|
||||
2. Makes test configurations more accurate and flexible.
|
||||
3. Improves the usability of the testing framework through better argument handling and documentation.
|
||||
|
||||
|
||||
**Future Enhancements to Consider:**
|
||||
- Add support for generated JavaScript content to test JS rendering performance
|
||||
- Implement more sophisticated memory analysis like generational garbage collection tracking
|
||||
- Add support for Docker-based testing with memory limits to force OOM conditions
|
||||
- Create visualization tools for analyzing memory usage patterns across test runs
|
||||
- Add benchmark comparisons between different crawler versions or configurations
|
||||
|
||||
## [2025-04-17] Fixed Issues in Stress Testing System
|
||||
|
||||
**Changes Made:**
|
||||
1. Fixed custom parameter handling in run_benchmark.py
|
||||
2. Applied dark theme to benchmark reports for better readability
|
||||
3. Improved visualization code to eliminate matplotlib warnings
|
||||
4. Added clickable links to generated reports in terminal output
|
||||
5. Enhanced documentation with comprehensive parameter descriptions
|
||||
|
||||
**Details of Changes:**
|
||||
|
||||
1. **Custom Parameter Handling Fix**
|
||||
- Identified bug where custom URL count was being ignored in run_benchmark.py
|
||||
- Rewrote argument handling to use a custom args dictionary
|
||||
- Properly passed parameters to the test_simple_stress.py command
|
||||
- Added better UI indication of custom parameters in use
|
||||
|
||||
2. **Dark Theme Implementation**
|
||||
- Added complete dark theme to HTML benchmark reports
|
||||
- Applied dark styling to all visualization components
|
||||
- Used Nord-inspired color palette for charts and graphs
|
||||
- Improved contrast and readability for data visualization
|
||||
- Updated text colors and backgrounds for better eye comfort
|
||||
|
||||
3. **Matplotlib Warning Fixes**
|
||||
- Resolved warnings related to improper use of set_xticklabels()
|
||||
- Implemented correct x-axis positioning for bar charts
|
||||
- Ensured proper alignment of bar labels and data points
|
||||
- Updated plotting code to use modern matplotlib practices
|
||||
|
||||
4. **Documentation Improvements**
|
||||
- Created comprehensive USAGE.md with detailed instructions
|
||||
- Added parameter documentation for all scripts
|
||||
- Included examples for all common use cases
|
||||
- Provided detailed explanations for interpreting results
|
||||
- Added troubleshooting guide for common issues
|
||||
|
||||
**Files Modified:**
|
||||
- `tests/memory/run_benchmark.py`: Fixed custom parameter handling
|
||||
- `tests/memory/benchmark_report.py`: Added dark theme and fixed visualization warnings
|
||||
- `tests/memory/run_all.sh`: Added clickable links to reports
|
||||
- `tests/memory/USAGE.md`: Created comprehensive documentation
|
||||
|
||||
**Testing:**
|
||||
- Verified that custom URL counts are now correctly used
|
||||
- Confirmed dark theme is properly applied to all report elements
|
||||
- Checked that matplotlib warnings are no longer appearing
|
||||
- Validated clickable links to reports work in terminals that support them
|
||||
|
||||
**Why These Changes:**
|
||||
These improvements address several usability issues with the stress testing system:
|
||||
1. Better parameter handling ensures test configurations work as expected
|
||||
2. Dark theme reduces eye strain during extended test review sessions
|
||||
3. Fixing visualization warnings improves code quality and output clarity
|
||||
4. Enhanced documentation makes the system more accessible for future use
|
||||
|
||||
**Future Enhancements:**
|
||||
- Add additional visualization options for different types of analysis
|
||||
- Implement theme toggle to support both light and dark preferences
|
||||
- Add export options for embedding reports in other documentation
|
||||
- Create dedicated CI/CD integration templates for automated testing
|
||||
|
||||
## [2025-04-09] Added MHTML Capture Feature
|
||||
|
||||
**Feature:** MHTML snapshot capture of crawled pages
|
||||
|
||||
@@ -47,6 +47,7 @@ from .utils import (
|
||||
create_box_message,
|
||||
get_error_context,
|
||||
RobotsParser,
|
||||
preprocess_html_for_schema,
|
||||
)
|
||||
|
||||
|
||||
@@ -111,7 +112,8 @@ class AsyncWebCrawler:
|
||||
self,
|
||||
crawler_strategy: AsyncCrawlerStrategy = None,
|
||||
config: BrowserConfig = None,
|
||||
base_directory: str = str(os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home())),
|
||||
base_directory: str = str(
|
||||
os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home())),
|
||||
thread_safe: bool = False,
|
||||
logger: AsyncLoggerBase = None,
|
||||
**kwargs,
|
||||
@@ -139,7 +141,8 @@ class AsyncWebCrawler:
|
||||
)
|
||||
|
||||
# Initialize crawler strategy
|
||||
params = {k: v for k, v in kwargs.items() if k in ["browser_config", "logger"]}
|
||||
params = {k: v for k, v in kwargs.items() if k in [
|
||||
"browser_config", "logger"]}
|
||||
self.crawler_strategy = crawler_strategy or AsyncPlaywrightCrawlerStrategy(
|
||||
browser_config=browser_config,
|
||||
logger=self.logger,
|
||||
@@ -237,7 +240,8 @@ class AsyncWebCrawler:
|
||||
|
||||
config = config or CrawlerRunConfig()
|
||||
if not isinstance(url, str) or not url:
|
||||
raise ValueError("Invalid URL, make sure the URL is a non-empty string")
|
||||
raise ValueError(
|
||||
"Invalid URL, make sure the URL is a non-empty string")
|
||||
|
||||
async with self._lock or self.nullcontext():
|
||||
try:
|
||||
@@ -291,7 +295,7 @@ class AsyncWebCrawler:
|
||||
|
||||
# Update proxy configuration from rotation strategy if available
|
||||
if config and config.proxy_rotation_strategy:
|
||||
next_proxy : ProxyConfig = await config.proxy_rotation_strategy.get_next_proxy()
|
||||
next_proxy: ProxyConfig = await config.proxy_rotation_strategy.get_next_proxy()
|
||||
if next_proxy:
|
||||
self.logger.info(
|
||||
message="Switch proxy: {proxy}",
|
||||
@@ -306,7 +310,8 @@ class AsyncWebCrawler:
|
||||
t1 = time.perf_counter()
|
||||
|
||||
if config.user_agent:
|
||||
self.crawler_strategy.update_user_agent(config.user_agent)
|
||||
self.crawler_strategy.update_user_agent(
|
||||
config.user_agent)
|
||||
|
||||
# Check robots.txt if enabled
|
||||
if config and config.check_robots_txt:
|
||||
@@ -373,7 +378,8 @@ class AsyncWebCrawler:
|
||||
crawl_result.console_messages = async_response.console_messages
|
||||
|
||||
crawl_result.success = bool(html)
|
||||
crawl_result.session_id = getattr(config, "session_id", None)
|
||||
crawl_result.session_id = getattr(
|
||||
config, "session_id", None)
|
||||
|
||||
self.logger.url_status(
|
||||
url=cache_context.display_url,
|
||||
@@ -396,7 +402,8 @@ class AsyncWebCrawler:
|
||||
tag="COMPLETE"
|
||||
)
|
||||
cached_result.success = bool(html)
|
||||
cached_result.session_id = getattr(config, "session_id", None)
|
||||
cached_result.session_id = getattr(
|
||||
config, "session_id", None)
|
||||
cached_result.redirected_url = cached_result.redirected_url or url
|
||||
return CrawlResultContainer(cached_result)
|
||||
|
||||
@@ -463,12 +470,14 @@ class AsyncWebCrawler:
|
||||
params = config.__dict__.copy()
|
||||
params.pop("url", None)
|
||||
# add keys from kwargs to params that doesn't exist in params
|
||||
params.update({k: v for k, v in kwargs.items() if k not in params.keys()})
|
||||
params.update({k: v for k, v in kwargs.items()
|
||||
if k not in params.keys()})
|
||||
|
||||
################################
|
||||
# Scraping Strategy Execution #
|
||||
################################
|
||||
result: ScrapingResult = scraping_strategy.scrap(url, html, **params)
|
||||
result: ScrapingResult = scraping_strategy.scrap(
|
||||
url, html, **params)
|
||||
|
||||
if result is None:
|
||||
raise ValueError(
|
||||
@@ -484,7 +493,8 @@ class AsyncWebCrawler:
|
||||
|
||||
# Extract results - handle both dict and ScrapingResult
|
||||
if isinstance(result, dict):
|
||||
cleaned_html = sanitize_input_encode(result.get("cleaned_html", ""))
|
||||
cleaned_html = sanitize_input_encode(
|
||||
result.get("cleaned_html", ""))
|
||||
media = result.get("media", {})
|
||||
links = result.get("links", {})
|
||||
metadata = result.get("metadata", {})
|
||||
@@ -501,14 +511,49 @@ class AsyncWebCrawler:
|
||||
config.markdown_generator or DefaultMarkdownGenerator()
|
||||
)
|
||||
|
||||
# --- SELECT HTML SOURCE BASED ON CONTENT_SOURCE ---
|
||||
# Get the desired source from the generator config, default to 'cleaned_html'
|
||||
selected_html_source = getattr(markdown_generator, 'content_source', 'cleaned_html')
|
||||
|
||||
# Define the source selection logic using dict dispatch
|
||||
html_source_selector = {
|
||||
"raw_html": lambda: html, # The original raw HTML
|
||||
"cleaned_html": lambda: cleaned_html, # The HTML after scraping strategy
|
||||
"fit_html": lambda: preprocess_html_for_schema(html_content=html), # Preprocessed raw HTML
|
||||
}
|
||||
|
||||
markdown_input_html = cleaned_html # Default to cleaned_html
|
||||
|
||||
try:
|
||||
# Get the appropriate lambda function, default to returning cleaned_html if key not found
|
||||
source_lambda = html_source_selector.get(selected_html_source, lambda: cleaned_html)
|
||||
# Execute the lambda to get the selected HTML
|
||||
markdown_input_html = source_lambda()
|
||||
|
||||
# Log which source is being used (optional, but helpful for debugging)
|
||||
# if self.logger and verbose:
|
||||
# actual_source_used = selected_html_source if selected_html_source in html_source_selector else 'cleaned_html (default)'
|
||||
# self.logger.debug(f"Using '{actual_source_used}' as source for Markdown generation for {url}", tag="MARKDOWN_SRC")
|
||||
|
||||
except Exception as e:
|
||||
# Handle potential errors, especially from preprocess_html_for_schema
|
||||
if self.logger:
|
||||
self.logger.warning(
|
||||
f"Error getting/processing '{selected_html_source}' for markdown source: {e}. Falling back to cleaned_html.",
|
||||
tag="MARKDOWN_SRC"
|
||||
)
|
||||
# Ensure markdown_input_html is still the default cleaned_html in case of error
|
||||
markdown_input_html = cleaned_html
|
||||
# --- END: HTML SOURCE SELECTION ---
|
||||
|
||||
# Uncomment if by default we want to use PruningContentFilter
|
||||
# if not config.content_filter and not markdown_generator.content_filter:
|
||||
# markdown_generator.content_filter = PruningContentFilter()
|
||||
|
||||
markdown_result: MarkdownGenerationResult = (
|
||||
markdown_generator.generate_markdown(
|
||||
cleaned_html=cleaned_html,
|
||||
base_url=params.get("redirected_url", url),
|
||||
input_html=markdown_input_html,
|
||||
base_url=params.get("redirected_url", url)
|
||||
# html2text_options=kwargs.get('html2text', {})
|
||||
)
|
||||
)
|
||||
|
||||
@@ -31,22 +31,24 @@ class MarkdownGenerationStrategy(ABC):
|
||||
content_filter: Optional[RelevantContentFilter] = None,
|
||||
options: Optional[Dict[str, Any]] = None,
|
||||
verbose: bool = False,
|
||||
content_source: str = "cleaned_html",
|
||||
):
|
||||
self.content_filter = content_filter
|
||||
self.options = options or {}
|
||||
self.verbose = verbose
|
||||
self.content_source = content_source
|
||||
|
||||
@abstractmethod
|
||||
def generate_markdown(
|
||||
self,
|
||||
cleaned_html: str,
|
||||
input_html: str,
|
||||
base_url: str = "",
|
||||
html2text_options: Optional[Dict[str, Any]] = None,
|
||||
content_filter: Optional[RelevantContentFilter] = None,
|
||||
citations: bool = True,
|
||||
**kwargs,
|
||||
) -> MarkdownGenerationResult:
|
||||
"""Generate markdown from cleaned HTML."""
|
||||
"""Generate markdown from the selected input HTML."""
|
||||
pass
|
||||
|
||||
|
||||
@@ -63,6 +65,7 @@ class DefaultMarkdownGenerator(MarkdownGenerationStrategy):
|
||||
Args:
|
||||
content_filter (Optional[RelevantContentFilter]): Content filter for generating fit markdown.
|
||||
options (Optional[Dict[str, Any]]): Additional options for markdown generation. Defaults to None.
|
||||
content_source (str): Source of content to generate markdown from. Options: "cleaned_html", "raw_html", "fit_html". Defaults to "cleaned_html".
|
||||
|
||||
Returns:
|
||||
MarkdownGenerationResult: Result containing raw markdown, fit markdown, fit HTML, and references markdown.
|
||||
@@ -72,8 +75,9 @@ class DefaultMarkdownGenerator(MarkdownGenerationStrategy):
|
||||
self,
|
||||
content_filter: Optional[RelevantContentFilter] = None,
|
||||
options: Optional[Dict[str, Any]] = None,
|
||||
content_source: str = "cleaned_html",
|
||||
):
|
||||
super().__init__(content_filter, options)
|
||||
super().__init__(content_filter, options, verbose=False, content_source=content_source)
|
||||
|
||||
def convert_links_to_citations(
|
||||
self, markdown: str, base_url: str = ""
|
||||
@@ -143,7 +147,7 @@ class DefaultMarkdownGenerator(MarkdownGenerationStrategy):
|
||||
|
||||
def generate_markdown(
|
||||
self,
|
||||
cleaned_html: str,
|
||||
input_html: str,
|
||||
base_url: str = "",
|
||||
html2text_options: Optional[Dict[str, Any]] = None,
|
||||
options: Optional[Dict[str, Any]] = None,
|
||||
@@ -152,16 +156,16 @@ class DefaultMarkdownGenerator(MarkdownGenerationStrategy):
|
||||
**kwargs,
|
||||
) -> MarkdownGenerationResult:
|
||||
"""
|
||||
Generate markdown with citations from cleaned HTML.
|
||||
Generate markdown with citations from the provided input HTML.
|
||||
|
||||
How it works:
|
||||
1. Generate raw markdown from cleaned HTML.
|
||||
1. Generate raw markdown from the input HTML.
|
||||
2. Convert links to citations.
|
||||
3. Generate fit markdown if content filter is provided.
|
||||
4. Return MarkdownGenerationResult.
|
||||
|
||||
Args:
|
||||
cleaned_html (str): Cleaned HTML content.
|
||||
input_html (str): The HTML content to process (selected based on content_source).
|
||||
base_url (str): Base URL for URL joins.
|
||||
html2text_options (Optional[Dict[str, Any]]): HTML2Text options.
|
||||
options (Optional[Dict[str, Any]]): Additional options for markdown generation.
|
||||
@@ -196,14 +200,14 @@ class DefaultMarkdownGenerator(MarkdownGenerationStrategy):
|
||||
h.update_params(**default_options)
|
||||
|
||||
# Ensure we have valid input
|
||||
if not cleaned_html:
|
||||
cleaned_html = ""
|
||||
elif not isinstance(cleaned_html, str):
|
||||
cleaned_html = str(cleaned_html)
|
||||
if not input_html:
|
||||
input_html = ""
|
||||
elif not isinstance(input_html, str):
|
||||
input_html = str(input_html)
|
||||
|
||||
# Generate raw markdown
|
||||
try:
|
||||
raw_markdown = h.handle(cleaned_html)
|
||||
raw_markdown = h.handle(input_html)
|
||||
except Exception as e:
|
||||
raw_markdown = f"Error converting HTML to markdown: {str(e)}"
|
||||
|
||||
@@ -228,7 +232,7 @@ class DefaultMarkdownGenerator(MarkdownGenerationStrategy):
|
||||
if content_filter or self.content_filter:
|
||||
try:
|
||||
content_filter = content_filter or self.content_filter
|
||||
filtered_html = content_filter.filter_content(cleaned_html)
|
||||
filtered_html = content_filter.filter_content(input_html)
|
||||
filtered_html = "\n".join(
|
||||
"<div>{}</div>".format(s) for s in filtered_html
|
||||
)
|
||||
|
||||
503
deploy/docker/api copy.py
Normal file
503
deploy/docker/api copy.py
Normal file
@@ -0,0 +1,503 @@
|
||||
import os
|
||||
import json
|
||||
import asyncio
|
||||
from typing import List, Tuple
|
||||
from functools import partial
|
||||
|
||||
import logging
|
||||
from typing import Optional, AsyncGenerator
|
||||
from urllib.parse import unquote
|
||||
from fastapi import HTTPException, Request, status
|
||||
from fastapi.background import BackgroundTasks
|
||||
from fastapi.responses import JSONResponse
|
||||
from redis import asyncio as aioredis
|
||||
|
||||
from crawl4ai import (
|
||||
AsyncWebCrawler,
|
||||
CrawlerRunConfig,
|
||||
LLMExtractionStrategy,
|
||||
CacheMode,
|
||||
BrowserConfig,
|
||||
MemoryAdaptiveDispatcher,
|
||||
RateLimiter,
|
||||
LLMConfig
|
||||
)
|
||||
from crawl4ai.utils import perform_completion_with_backoff
|
||||
from crawl4ai.content_filter_strategy import (
|
||||
PruningContentFilter,
|
||||
BM25ContentFilter,
|
||||
LLMContentFilter
|
||||
)
|
||||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
||||
from crawl4ai.content_scraping_strategy import LXMLWebScrapingStrategy
|
||||
|
||||
from utils import (
|
||||
TaskStatus,
|
||||
FilterType,
|
||||
get_base_url,
|
||||
is_task_id,
|
||||
should_cleanup_task,
|
||||
decode_redis_hash
|
||||
)
|
||||
|
||||
import psutil, time
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# --- Helper to get memory ---
|
||||
def _get_memory_mb():
|
||||
try:
|
||||
return psutil.Process().memory_info().rss / (1024 * 1024)
|
||||
except Exception as e:
|
||||
logger.warning(f"Could not get memory info: {e}")
|
||||
return None
|
||||
|
||||
|
||||
async def handle_llm_qa(
|
||||
url: str,
|
||||
query: str,
|
||||
config: dict
|
||||
) -> str:
|
||||
"""Process QA using LLM with crawled content as context."""
|
||||
try:
|
||||
# Extract base URL by finding last '?q=' occurrence
|
||||
last_q_index = url.rfind('?q=')
|
||||
if last_q_index != -1:
|
||||
url = url[:last_q_index]
|
||||
|
||||
# Get markdown content
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url)
|
||||
if not result.success:
|
||||
raise HTTPException(
|
||||
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
|
||||
detail=result.error_message
|
||||
)
|
||||
content = result.markdown.fit_markdown
|
||||
|
||||
# Create prompt and get LLM response
|
||||
prompt = f"""Use the following content as context to answer the question.
|
||||
Content:
|
||||
{content}
|
||||
|
||||
Question: {query}
|
||||
|
||||
Answer:"""
|
||||
|
||||
response = perform_completion_with_backoff(
|
||||
provider=config["llm"]["provider"],
|
||||
prompt_with_variables=prompt,
|
||||
api_token=os.environ.get(config["llm"].get("api_key_env", ""))
|
||||
)
|
||||
|
||||
return response.choices[0].message.content
|
||||
except Exception as e:
|
||||
logger.error(f"QA processing error: {str(e)}", exc_info=True)
|
||||
raise HTTPException(
|
||||
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
|
||||
detail=str(e)
|
||||
)
|
||||
|
||||
async def process_llm_extraction(
|
||||
redis: aioredis.Redis,
|
||||
config: dict,
|
||||
task_id: str,
|
||||
url: str,
|
||||
instruction: str,
|
||||
schema: Optional[str] = None,
|
||||
cache: str = "0"
|
||||
) -> None:
|
||||
"""Process LLM extraction in background."""
|
||||
try:
|
||||
# If config['llm'] has api_key then ignore the api_key_env
|
||||
api_key = ""
|
||||
if "api_key" in config["llm"]:
|
||||
api_key = config["llm"]["api_key"]
|
||||
else:
|
||||
api_key = os.environ.get(config["llm"].get("api_key_env", None), "")
|
||||
llm_strategy = LLMExtractionStrategy(
|
||||
llm_config=LLMConfig(
|
||||
provider=config["llm"]["provider"],
|
||||
api_token=api_key
|
||||
),
|
||||
instruction=instruction,
|
||||
schema=json.loads(schema) if schema else None,
|
||||
)
|
||||
|
||||
cache_mode = CacheMode.ENABLED if cache == "1" else CacheMode.WRITE_ONLY
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url=url,
|
||||
config=CrawlerRunConfig(
|
||||
extraction_strategy=llm_strategy,
|
||||
scraping_strategy=LXMLWebScrapingStrategy(),
|
||||
cache_mode=cache_mode
|
||||
)
|
||||
)
|
||||
|
||||
if not result.success:
|
||||
await redis.hset(f"task:{task_id}", mapping={
|
||||
"status": TaskStatus.FAILED,
|
||||
"error": result.error_message
|
||||
})
|
||||
return
|
||||
|
||||
try:
|
||||
content = json.loads(result.extracted_content)
|
||||
except json.JSONDecodeError:
|
||||
content = result.extracted_content
|
||||
await redis.hset(f"task:{task_id}", mapping={
|
||||
"status": TaskStatus.COMPLETED,
|
||||
"result": json.dumps(content)
|
||||
})
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"LLM extraction error: {str(e)}", exc_info=True)
|
||||
await redis.hset(f"task:{task_id}", mapping={
|
||||
"status": TaskStatus.FAILED,
|
||||
"error": str(e)
|
||||
})
|
||||
|
||||
async def handle_markdown_request(
|
||||
url: str,
|
||||
filter_type: FilterType,
|
||||
query: Optional[str] = None,
|
||||
cache: str = "0",
|
||||
config: Optional[dict] = None
|
||||
) -> str:
|
||||
"""Handle markdown generation requests."""
|
||||
try:
|
||||
decoded_url = unquote(url)
|
||||
if not decoded_url.startswith(('http://', 'https://')):
|
||||
decoded_url = 'https://' + decoded_url
|
||||
|
||||
if filter_type == FilterType.RAW:
|
||||
md_generator = DefaultMarkdownGenerator()
|
||||
else:
|
||||
content_filter = {
|
||||
FilterType.FIT: PruningContentFilter(),
|
||||
FilterType.BM25: BM25ContentFilter(user_query=query or ""),
|
||||
FilterType.LLM: LLMContentFilter(
|
||||
llm_config=LLMConfig(
|
||||
provider=config["llm"]["provider"],
|
||||
api_token=os.environ.get(config["llm"].get("api_key_env", None), ""),
|
||||
),
|
||||
instruction=query or "Extract main content"
|
||||
)
|
||||
}[filter_type]
|
||||
md_generator = DefaultMarkdownGenerator(content_filter=content_filter)
|
||||
|
||||
cache_mode = CacheMode.ENABLED if cache == "1" else CacheMode.WRITE_ONLY
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url=decoded_url,
|
||||
config=CrawlerRunConfig(
|
||||
markdown_generator=md_generator,
|
||||
scraping_strategy=LXMLWebScrapingStrategy(),
|
||||
cache_mode=cache_mode
|
||||
)
|
||||
)
|
||||
|
||||
if not result.success:
|
||||
raise HTTPException(
|
||||
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
|
||||
detail=result.error_message
|
||||
)
|
||||
|
||||
return (result.markdown.raw_markdown
|
||||
if filter_type == FilterType.RAW
|
||||
else result.markdown.fit_markdown)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Markdown error: {str(e)}", exc_info=True)
|
||||
raise HTTPException(
|
||||
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
|
||||
detail=str(e)
|
||||
)
|
||||
|
||||
async def handle_llm_request(
|
||||
redis: aioredis.Redis,
|
||||
background_tasks: BackgroundTasks,
|
||||
request: Request,
|
||||
input_path: str,
|
||||
query: Optional[str] = None,
|
||||
schema: Optional[str] = None,
|
||||
cache: str = "0",
|
||||
config: Optional[dict] = None
|
||||
) -> JSONResponse:
|
||||
"""Handle LLM extraction requests."""
|
||||
base_url = get_base_url(request)
|
||||
|
||||
try:
|
||||
if is_task_id(input_path):
|
||||
return await handle_task_status(
|
||||
redis, input_path, base_url
|
||||
)
|
||||
|
||||
if not query:
|
||||
return JSONResponse({
|
||||
"message": "Please provide an instruction",
|
||||
"_links": {
|
||||
"example": {
|
||||
"href": f"{base_url}/llm/{input_path}?q=Extract+main+content",
|
||||
"title": "Try this example"
|
||||
}
|
||||
}
|
||||
})
|
||||
|
||||
return await create_new_task(
|
||||
redis,
|
||||
background_tasks,
|
||||
input_path,
|
||||
query,
|
||||
schema,
|
||||
cache,
|
||||
base_url,
|
||||
config
|
||||
)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"LLM endpoint error: {str(e)}", exc_info=True)
|
||||
return JSONResponse({
|
||||
"error": str(e),
|
||||
"_links": {
|
||||
"retry": {"href": str(request.url)}
|
||||
}
|
||||
}, status_code=status.HTTP_500_INTERNAL_SERVER_ERROR)
|
||||
|
||||
async def handle_task_status(
|
||||
redis: aioredis.Redis,
|
||||
task_id: str,
|
||||
base_url: str
|
||||
) -> JSONResponse:
|
||||
"""Handle task status check requests."""
|
||||
task = await redis.hgetall(f"task:{task_id}")
|
||||
if not task:
|
||||
raise HTTPException(
|
||||
status_code=status.HTTP_404_NOT_FOUND,
|
||||
detail="Task not found"
|
||||
)
|
||||
|
||||
task = decode_redis_hash(task)
|
||||
response = create_task_response(task, task_id, base_url)
|
||||
|
||||
if task["status"] in [TaskStatus.COMPLETED, TaskStatus.FAILED]:
|
||||
if should_cleanup_task(task["created_at"]):
|
||||
await redis.delete(f"task:{task_id}")
|
||||
|
||||
return JSONResponse(response)
|
||||
|
||||
async def create_new_task(
|
||||
redis: aioredis.Redis,
|
||||
background_tasks: BackgroundTasks,
|
||||
input_path: str,
|
||||
query: str,
|
||||
schema: Optional[str],
|
||||
cache: str,
|
||||
base_url: str,
|
||||
config: dict
|
||||
) -> JSONResponse:
|
||||
"""Create and initialize a new task."""
|
||||
decoded_url = unquote(input_path)
|
||||
if not decoded_url.startswith(('http://', 'https://')):
|
||||
decoded_url = 'https://' + decoded_url
|
||||
|
||||
from datetime import datetime
|
||||
task_id = f"llm_{int(datetime.now().timestamp())}_{id(background_tasks)}"
|
||||
|
||||
await redis.hset(f"task:{task_id}", mapping={
|
||||
"status": TaskStatus.PROCESSING,
|
||||
"created_at": datetime.now().isoformat(),
|
||||
"url": decoded_url
|
||||
})
|
||||
|
||||
background_tasks.add_task(
|
||||
process_llm_extraction,
|
||||
redis,
|
||||
config,
|
||||
task_id,
|
||||
decoded_url,
|
||||
query,
|
||||
schema,
|
||||
cache
|
||||
)
|
||||
|
||||
return JSONResponse({
|
||||
"task_id": task_id,
|
||||
"status": TaskStatus.PROCESSING,
|
||||
"url": decoded_url,
|
||||
"_links": {
|
||||
"self": {"href": f"{base_url}/llm/{task_id}"},
|
||||
"status": {"href": f"{base_url}/llm/{task_id}"}
|
||||
}
|
||||
})
|
||||
|
||||
def create_task_response(task: dict, task_id: str, base_url: str) -> dict:
|
||||
"""Create response for task status check."""
|
||||
response = {
|
||||
"task_id": task_id,
|
||||
"status": task["status"],
|
||||
"created_at": task["created_at"],
|
||||
"url": task["url"],
|
||||
"_links": {
|
||||
"self": {"href": f"{base_url}/llm/{task_id}"},
|
||||
"refresh": {"href": f"{base_url}/llm/{task_id}"}
|
||||
}
|
||||
}
|
||||
|
||||
if task["status"] == TaskStatus.COMPLETED:
|
||||
response["result"] = json.loads(task["result"])
|
||||
elif task["status"] == TaskStatus.FAILED:
|
||||
response["error"] = task["error"]
|
||||
|
||||
return response
|
||||
|
||||
async def stream_results(crawler: AsyncWebCrawler, results_gen: AsyncGenerator) -> AsyncGenerator[bytes, None]:
|
||||
"""Stream results with heartbeats and completion markers."""
|
||||
import json
|
||||
from utils import datetime_handler
|
||||
|
||||
try:
|
||||
async for result in results_gen:
|
||||
try:
|
||||
server_memory_mb = _get_memory_mb()
|
||||
result_dict = result.model_dump()
|
||||
result_dict['server_memory_mb'] = server_memory_mb
|
||||
logger.info(f"Streaming result for {result_dict.get('url', 'unknown')}")
|
||||
data = json.dumps(result_dict, default=datetime_handler) + "\n"
|
||||
yield data.encode('utf-8')
|
||||
except Exception as e:
|
||||
logger.error(f"Serialization error: {e}")
|
||||
error_response = {"error": str(e), "url": getattr(result, 'url', 'unknown')}
|
||||
yield (json.dumps(error_response) + "\n").encode('utf-8')
|
||||
|
||||
yield json.dumps({"status": "completed"}).encode('utf-8')
|
||||
|
||||
except asyncio.CancelledError:
|
||||
logger.warning("Client disconnected during streaming")
|
||||
finally:
|
||||
try:
|
||||
await crawler.close()
|
||||
except Exception as e:
|
||||
logger.error(f"Crawler cleanup error: {e}")
|
||||
|
||||
async def handle_crawl_request(
|
||||
urls: List[str],
|
||||
browser_config: dict,
|
||||
crawler_config: dict,
|
||||
config: dict
|
||||
) -> dict:
|
||||
"""Handle non-streaming crawl requests."""
|
||||
start_mem_mb = _get_memory_mb() # <--- Get memory before
|
||||
start_time = time.time()
|
||||
mem_delta_mb = None
|
||||
peak_mem_mb = start_mem_mb
|
||||
|
||||
try:
|
||||
browser_config = BrowserConfig.load(browser_config)
|
||||
crawler_config = CrawlerRunConfig.load(crawler_config)
|
||||
|
||||
dispatcher = MemoryAdaptiveDispatcher(
|
||||
memory_threshold_percent=config["crawler"]["memory_threshold_percent"],
|
||||
rate_limiter=RateLimiter(
|
||||
base_delay=tuple(config["crawler"]["rate_limiter"]["base_delay"])
|
||||
)
|
||||
)
|
||||
|
||||
crawler: AsyncWebCrawler = AsyncWebCrawler(config=browser_config)
|
||||
await crawler.start()
|
||||
results = []
|
||||
func = getattr(crawler, "arun" if len(urls) == 1 else "arun_many")
|
||||
partial_func = partial(func,
|
||||
urls[0] if len(urls) == 1 else urls,
|
||||
config=crawler_config,
|
||||
dispatcher=dispatcher)
|
||||
results = await partial_func()
|
||||
await crawler.close()
|
||||
|
||||
end_mem_mb = _get_memory_mb() # <--- Get memory after
|
||||
end_time = time.time()
|
||||
|
||||
if start_mem_mb is not None and end_mem_mb is not None:
|
||||
mem_delta_mb = end_mem_mb - start_mem_mb # <--- Calculate delta
|
||||
peak_mem_mb = max(peak_mem_mb if peak_mem_mb else 0, end_mem_mb) # <--- Get peak memory
|
||||
logger.info(f"Memory usage: Start: {start_mem_mb} MB, End: {end_mem_mb} MB, Delta: {mem_delta_mb} MB, Peak: {peak_mem_mb} MB")
|
||||
|
||||
return {
|
||||
"success": True,
|
||||
"results": [result.model_dump() for result in results],
|
||||
"server_processing_time_s": end_time - start_time,
|
||||
"server_memory_delta_mb": mem_delta_mb,
|
||||
"server_peak_memory_mb": peak_mem_mb
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Crawl error: {str(e)}", exc_info=True)
|
||||
if 'crawler' in locals() and crawler.ready: # Check if crawler was initialized and started
|
||||
try:
|
||||
await crawler.close()
|
||||
except Exception as close_e:
|
||||
logger.error(f"Error closing crawler during exception handling: {close_e}")
|
||||
|
||||
# Measure memory even on error if possible
|
||||
end_mem_mb_error = _get_memory_mb()
|
||||
if start_mem_mb is not None and end_mem_mb_error is not None:
|
||||
mem_delta_mb = end_mem_mb_error - start_mem_mb
|
||||
|
||||
raise HTTPException(
|
||||
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
|
||||
detail=json.dumps({ # Send structured error
|
||||
"error": str(e),
|
||||
"server_memory_delta_mb": mem_delta_mb,
|
||||
"server_peak_memory_mb": max(peak_mem_mb if peak_mem_mb else 0, end_mem_mb_error or 0)
|
||||
})
|
||||
)
|
||||
|
||||
async def handle_stream_crawl_request(
|
||||
urls: List[str],
|
||||
browser_config: dict,
|
||||
crawler_config: dict,
|
||||
config: dict
|
||||
) -> Tuple[AsyncWebCrawler, AsyncGenerator]:
|
||||
"""Handle streaming crawl requests."""
|
||||
try:
|
||||
browser_config = BrowserConfig.load(browser_config)
|
||||
# browser_config.verbose = True # Set to False or remove for production stress testing
|
||||
browser_config.verbose = False
|
||||
crawler_config = CrawlerRunConfig.load(crawler_config)
|
||||
crawler_config.scraping_strategy = LXMLWebScrapingStrategy()
|
||||
crawler_config.stream = True
|
||||
|
||||
dispatcher = MemoryAdaptiveDispatcher(
|
||||
memory_threshold_percent=config["crawler"]["memory_threshold_percent"],
|
||||
rate_limiter=RateLimiter(
|
||||
base_delay=tuple(config["crawler"]["rate_limiter"]["base_delay"])
|
||||
)
|
||||
)
|
||||
|
||||
crawler = AsyncWebCrawler(config=browser_config)
|
||||
await crawler.start()
|
||||
|
||||
results_gen = await crawler.arun_many(
|
||||
urls=urls,
|
||||
config=crawler_config,
|
||||
dispatcher=dispatcher
|
||||
)
|
||||
|
||||
return crawler, results_gen
|
||||
|
||||
except Exception as e:
|
||||
# Make sure to close crawler if started during an error here
|
||||
if 'crawler' in locals() and crawler.ready:
|
||||
try:
|
||||
await crawler.close()
|
||||
except Exception as close_e:
|
||||
logger.error(f"Error closing crawler during stream setup exception: {close_e}")
|
||||
logger.error(f"Stream crawl error: {str(e)}", exc_info=True)
|
||||
# Raising HTTPException here will prevent streaming response
|
||||
raise HTTPException(
|
||||
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
|
||||
detail=str(e)
|
||||
)
|
||||
@@ -40,8 +40,19 @@ from utils import (
|
||||
decode_redis_hash
|
||||
)
|
||||
|
||||
import psutil, time
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# --- Helper to get memory ---
|
||||
def _get_memory_mb():
|
||||
try:
|
||||
return psutil.Process().memory_info().rss / (1024 * 1024)
|
||||
except Exception as e:
|
||||
logger.warning(f"Could not get memory info: {e}")
|
||||
return None
|
||||
|
||||
|
||||
async def handle_llm_qa(
|
||||
url: str,
|
||||
query: str,
|
||||
@@ -353,7 +364,9 @@ async def stream_results(crawler: AsyncWebCrawler, results_gen: AsyncGenerator)
|
||||
try:
|
||||
async for result in results_gen:
|
||||
try:
|
||||
server_memory_mb = _get_memory_mb()
|
||||
result_dict = result.model_dump()
|
||||
result_dict['server_memory_mb'] = server_memory_mb
|
||||
logger.info(f"Streaming result for {result_dict.get('url', 'unknown')}")
|
||||
data = json.dumps(result_dict, default=datetime_handler) + "\n"
|
||||
yield data.encode('utf-8')
|
||||
@@ -366,19 +379,25 @@ async def stream_results(crawler: AsyncWebCrawler, results_gen: AsyncGenerator)
|
||||
|
||||
except asyncio.CancelledError:
|
||||
logger.warning("Client disconnected during streaming")
|
||||
finally:
|
||||
try:
|
||||
await crawler.close()
|
||||
except Exception as e:
|
||||
logger.error(f"Crawler cleanup error: {e}")
|
||||
# finally:
|
||||
# try:
|
||||
# await crawler.close()
|
||||
# except Exception as e:
|
||||
# logger.error(f"Crawler cleanup error: {e}")
|
||||
|
||||
async def handle_crawl_request(
|
||||
crawler: AsyncWebCrawler,
|
||||
urls: List[str],
|
||||
browser_config: dict,
|
||||
crawler_config: dict,
|
||||
config: dict
|
||||
) -> dict:
|
||||
"""Handle non-streaming crawl requests."""
|
||||
start_mem_mb = _get_memory_mb() # <--- Get memory before
|
||||
start_time = time.time()
|
||||
mem_delta_mb = None
|
||||
peak_mem_mb = start_mem_mb
|
||||
|
||||
try:
|
||||
urls = [('https://' + url) if not url.startswith(('http://', 'https://')) else url for url in urls]
|
||||
browser_config = BrowserConfig.load(browser_config)
|
||||
@@ -391,31 +410,63 @@ async def handle_crawl_request(
|
||||
)
|
||||
)
|
||||
|
||||
crawler: AsyncWebCrawler = AsyncWebCrawler(config=browser_config)
|
||||
await crawler.start()
|
||||
# crawler: AsyncWebCrawler = AsyncWebCrawler(config=browser_config)
|
||||
# await crawler.start()
|
||||
results = []
|
||||
func = getattr(crawler, "arun" if len(urls) == 1 else "arun_many")
|
||||
partial_func = partial(func,
|
||||
urls[0] if len(urls) == 1 else urls,
|
||||
config=crawler_config,
|
||||
dispatcher=dispatcher)
|
||||
|
||||
# Simulate work being done by the crawler
|
||||
# logger.debug(f"Request (URLs: {len(urls)}) starting simulated work...") # Add log
|
||||
# await asyncio.sleep(2) # <--- ADD ARTIFICIAL DELAY (e.g., 0.5 seconds)
|
||||
# logger.debug(f"Request (URLs: {len(urls)}) finished simulated work.")
|
||||
|
||||
results = await partial_func()
|
||||
await crawler.close()
|
||||
# await crawler.close()
|
||||
|
||||
end_mem_mb = _get_memory_mb() # <--- Get memory after
|
||||
end_time = time.time()
|
||||
|
||||
if start_mem_mb is not None and end_mem_mb is not None:
|
||||
mem_delta_mb = end_mem_mb - start_mem_mb # <--- Calculate delta
|
||||
peak_mem_mb = max(peak_mem_mb if peak_mem_mb else 0, end_mem_mb) # <--- Get peak memory
|
||||
logger.info(f"Memory usage: Start: {start_mem_mb} MB, End: {end_mem_mb} MB, Delta: {mem_delta_mb} MB, Peak: {peak_mem_mb} MB")
|
||||
|
||||
return {
|
||||
"success": True,
|
||||
"results": [result.model_dump() for result in results]
|
||||
"results": [result.model_dump() for result in results],
|
||||
"server_processing_time_s": end_time - start_time,
|
||||
"server_memory_delta_mb": mem_delta_mb,
|
||||
"server_peak_memory_mb": peak_mem_mb
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Crawl error: {str(e)}", exc_info=True)
|
||||
if 'crawler' in locals():
|
||||
await crawler.close()
|
||||
# if 'crawler' in locals() and crawler.ready: # Check if crawler was initialized and started
|
||||
# try:
|
||||
# await crawler.close()
|
||||
# except Exception as close_e:
|
||||
# logger.error(f"Error closing crawler during exception handling: {close_e}")
|
||||
|
||||
# Measure memory even on error if possible
|
||||
end_mem_mb_error = _get_memory_mb()
|
||||
if start_mem_mb is not None and end_mem_mb_error is not None:
|
||||
mem_delta_mb = end_mem_mb_error - start_mem_mb
|
||||
|
||||
raise HTTPException(
|
||||
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
|
||||
detail=str(e)
|
||||
detail=json.dumps({ # Send structured error
|
||||
"error": str(e),
|
||||
"server_memory_delta_mb": mem_delta_mb,
|
||||
"server_peak_memory_mb": max(peak_mem_mb if peak_mem_mb else 0, end_mem_mb_error or 0)
|
||||
})
|
||||
)
|
||||
|
||||
async def handle_stream_crawl_request(
|
||||
crawler: AsyncWebCrawler,
|
||||
urls: List[str],
|
||||
browser_config: dict,
|
||||
crawler_config: dict,
|
||||
@@ -424,9 +475,11 @@ async def handle_stream_crawl_request(
|
||||
"""Handle streaming crawl requests."""
|
||||
try:
|
||||
browser_config = BrowserConfig.load(browser_config)
|
||||
browser_config.verbose = True
|
||||
# browser_config.verbose = True # Set to False or remove for production stress testing
|
||||
browser_config.verbose = False
|
||||
crawler_config = CrawlerRunConfig.load(crawler_config)
|
||||
crawler_config.scraping_strategy = LXMLWebScrapingStrategy()
|
||||
crawler_config.stream = True
|
||||
|
||||
dispatcher = MemoryAdaptiveDispatcher(
|
||||
memory_threshold_percent=config["crawler"]["memory_threshold_percent"],
|
||||
@@ -435,8 +488,8 @@ async def handle_stream_crawl_request(
|
||||
)
|
||||
)
|
||||
|
||||
crawler = AsyncWebCrawler(config=browser_config)
|
||||
await crawler.start()
|
||||
# crawler = AsyncWebCrawler(config=browser_config)
|
||||
# await crawler.start()
|
||||
|
||||
results_gen = await crawler.arun_many(
|
||||
urls=urls,
|
||||
@@ -444,12 +497,19 @@ async def handle_stream_crawl_request(
|
||||
dispatcher=dispatcher
|
||||
)
|
||||
|
||||
# Return the *same* crawler instance and the generator
|
||||
# The caller (server.py) manages the crawler lifecycle via the pool context
|
||||
return crawler, results_gen
|
||||
|
||||
except Exception as e:
|
||||
if 'crawler' in locals():
|
||||
await crawler.close()
|
||||
# Make sure to close crawler if started during an error here
|
||||
# if 'crawler' in locals() and crawler.ready:
|
||||
# try:
|
||||
# await crawler.close()
|
||||
# except Exception as close_e:
|
||||
# logger.error(f"Error closing crawler during stream setup exception: {close_e}")
|
||||
logger.error(f"Stream crawl error: {str(e)}", exc_info=True)
|
||||
# Raising HTTPException here will prevent streaming response
|
||||
raise HTTPException(
|
||||
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
|
||||
detail=str(e)
|
||||
|
||||
@@ -48,6 +48,38 @@ security:
|
||||
content_security_policy: "default-src 'self'"
|
||||
strict_transport_security: "max-age=63072000; includeSubDomains"
|
||||
|
||||
# Crawler Pool Configuration
|
||||
crawler_pool:
|
||||
enabled: true # Set to false to disable the pool
|
||||
|
||||
# --- Option 1: Auto-calculate size ---
|
||||
auto_calculate_size: true
|
||||
calculation_params:
|
||||
mem_headroom_mb: 512 # Memory reserved for OS/other apps
|
||||
avg_page_mem_mb: 150 # Estimated MB per concurrent "tab"/page in browsers
|
||||
fd_per_page: 20 # Estimated file descriptors per page
|
||||
core_multiplier: 4 # Max crawlers per CPU core
|
||||
min_pool_size: 2 # Minimum number of primary crawlers
|
||||
max_pool_size: 16 # Maximum number of primary crawlers
|
||||
|
||||
# --- Option 2: Manual size (ignored if auto_calculate_size is true) ---
|
||||
# pool_size: 8
|
||||
|
||||
# --- Other Pool Settings ---
|
||||
backup_pool_size: 1 # Number of backup crawlers
|
||||
max_wait_time_s: 30.0 # Max seconds a request waits for a free crawler
|
||||
throttle_threshold_percent: 70.0 # Start throttling delay above this % usage
|
||||
throttle_delay_min_s: 0.1 # Min throttle delay
|
||||
throttle_delay_max_s: 0.5 # Max throttle delay
|
||||
|
||||
# --- Browser Config for Pooled Crawlers ---
|
||||
browser_config:
|
||||
# No need for "type": "BrowserConfig" here, just params
|
||||
headless: true
|
||||
verbose: false # Keep pool crawlers less verbose in production
|
||||
# user_agent: "MyPooledCrawler/1.0" # Example
|
||||
# Add other BrowserConfig params as needed (e.g., proxy, viewport)
|
||||
|
||||
# Crawler Configuration
|
||||
crawler:
|
||||
memory_threshold_percent: 95.0
|
||||
@@ -61,6 +93,8 @@ crawler:
|
||||
logging:
|
||||
level: "INFO"
|
||||
format: "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
|
||||
file: "logs/app.log"
|
||||
verbose: true
|
||||
|
||||
# Observability Configuration
|
||||
observability:
|
||||
|
||||
556
deploy/docker/crawler_manager.py
Normal file
556
deploy/docker/crawler_manager.py
Normal file
@@ -0,0 +1,556 @@
|
||||
# crawler_manager.py
|
||||
import asyncio
|
||||
import time
|
||||
import uuid
|
||||
import psutil
|
||||
import os
|
||||
import resource # For FD limit
|
||||
import random
|
||||
import math
|
||||
from typing import Optional, Tuple, Any, List, Dict, AsyncGenerator
|
||||
from pydantic import BaseModel, Field, field_validator
|
||||
from contextlib import asynccontextmanager
|
||||
import logging
|
||||
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, AsyncLogger
|
||||
# Assuming api.py handlers are accessible or refactored slightly if needed
|
||||
# We might need to import the specific handler functions if we call them directly
|
||||
# from api import handle_crawl_request, handle_stream_crawl_request, _get_memory_mb, stream_results
|
||||
|
||||
# --- Custom Exceptions ---
|
||||
class PoolTimeoutError(Exception):
|
||||
"""Raised when waiting for a crawler resource times out."""
|
||||
pass
|
||||
|
||||
class PoolConfigurationError(Exception):
|
||||
"""Raised for configuration issues."""
|
||||
pass
|
||||
|
||||
class NoHealthyCrawlerError(Exception):
|
||||
"""Raised when no healthy crawler is available."""
|
||||
pass
|
||||
|
||||
|
||||
# --- Configuration Models ---
|
||||
class CalculationParams(BaseModel):
|
||||
mem_headroom_mb: int = 512
|
||||
avg_page_mem_mb: int = 150
|
||||
fd_per_page: int = 20
|
||||
core_multiplier: int = 4
|
||||
min_pool_size: int = 1 # Min safe pages should be at least 1
|
||||
max_pool_size: int = 16
|
||||
|
||||
# V2 validation for avg_page_mem_mb
|
||||
@field_validator('avg_page_mem_mb')
|
||||
@classmethod
|
||||
def check_avg_page_mem(cls, v: int) -> int:
|
||||
if v <= 0:
|
||||
raise ValueError("avg_page_mem_mb must be positive")
|
||||
return v
|
||||
|
||||
# V2 validation for fd_per_page
|
||||
@field_validator('fd_per_page')
|
||||
@classmethod
|
||||
def check_fd_per_page(cls, v: int) -> int:
|
||||
if v <= 0:
|
||||
raise ValueError("fd_per_page must be positive")
|
||||
return v
|
||||
|
||||
# crawler_manager.py
|
||||
# ... (imports including BaseModel, Field from pydantic) ...
|
||||
from pydantic import BaseModel, Field, field_validator # <-- Import field_validator
|
||||
|
||||
# --- Configuration Models (Pydantic V2 Syntax) ---
|
||||
class CalculationParams(BaseModel):
|
||||
mem_headroom_mb: int = 512
|
||||
avg_page_mem_mb: int = 150
|
||||
fd_per_page: int = 20
|
||||
core_multiplier: int = 4
|
||||
min_pool_size: int = 1 # Min safe pages should be at least 1
|
||||
max_pool_size: int = 16
|
||||
|
||||
# V2 validation for avg_page_mem_mb
|
||||
@field_validator('avg_page_mem_mb')
|
||||
@classmethod
|
||||
def check_avg_page_mem(cls, v: int) -> int:
|
||||
if v <= 0:
|
||||
raise ValueError("avg_page_mem_mb must be positive")
|
||||
return v
|
||||
|
||||
# V2 validation for fd_per_page
|
||||
@field_validator('fd_per_page')
|
||||
@classmethod
|
||||
def check_fd_per_page(cls, v: int) -> int:
|
||||
if v <= 0:
|
||||
raise ValueError("fd_per_page must be positive")
|
||||
return v
|
||||
|
||||
class CrawlerManagerConfig(BaseModel):
|
||||
enabled: bool = True
|
||||
auto_calculate_size: bool = True
|
||||
calculation_params: CalculationParams = Field(default_factory=CalculationParams) # Use Field for default_factory
|
||||
backup_pool_size: int = Field(1, ge=0) # Allow 0 backups
|
||||
max_wait_time_s: float = 30.0
|
||||
throttle_threshold_percent: float = Field(70.0, ge=0, le=100)
|
||||
throttle_delay_min_s: float = 0.1
|
||||
throttle_delay_max_s: float = 0.5
|
||||
browser_config: Dict[str, Any] = Field(default_factory=lambda: {"headless": True, "verbose": False}) # Use Field for default_factory
|
||||
primary_reload_delay_s: float = 60.0
|
||||
|
||||
# --- Crawler Manager ---
|
||||
class CrawlerManager:
|
||||
"""Manages shared AsyncWebCrawler instances, concurrency, and failover."""
|
||||
|
||||
def __init__(self, config: CrawlerManagerConfig, logger = None):
|
||||
if not config.enabled:
|
||||
self.logger.warning("CrawlerManager is disabled by configuration.")
|
||||
# Set defaults to allow server to run, but manager won't function
|
||||
self.config = config
|
||||
self._initialized = False,
|
||||
return
|
||||
|
||||
self.config = config
|
||||
self._primary_crawler: Optional[AsyncWebCrawler] = None
|
||||
self._secondary_crawlers: List[AsyncWebCrawler] = []
|
||||
self._active_crawler_index: int = 0 # 0 for primary, 1+ for secondary index
|
||||
self._primary_healthy: bool = False
|
||||
self._secondary_healthy_flags: List[bool] = []
|
||||
|
||||
self._safe_pages: int = 1 # Default, calculated in initialize
|
||||
self._semaphore: Optional[asyncio.Semaphore] = None
|
||||
self._state_lock = asyncio.Lock() # Protects active_crawler, health flags
|
||||
self._reload_tasks: List[Optional[asyncio.Task]] = [] # Track reload background tasks
|
||||
|
||||
self._initialized = False
|
||||
self._shutting_down = False
|
||||
|
||||
# Initialize logger if provided
|
||||
if logger is None:
|
||||
self.logger = logging.getLogger(__name__)
|
||||
self.logger.setLevel(logging.INFO)
|
||||
else:
|
||||
self.logger = logger
|
||||
|
||||
self.logger.info("CrawlerManager initialized with config.")
|
||||
self.logger.debug(f"Config: {self.config.model_dump_json(indent=2)}")
|
||||
|
||||
def is_enabled(self) -> bool:
|
||||
return self.config.enabled and self._initialized
|
||||
|
||||
def _get_system_resources(self) -> Tuple[int, int, int]:
|
||||
"""Gets RAM, CPU cores, and FD limit."""
|
||||
total_ram_mb = 0
|
||||
cpu_cores = 0
|
||||
try:
|
||||
mem_info = psutil.virtual_memory()
|
||||
total_ram_mb = mem_info.total // (1024 * 1024)
|
||||
cpu_cores = psutil.cpu_count(logical=False) or psutil.cpu_count(logical=True) # Prefer physical cores
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Could not get RAM/CPU info via psutil: {e}")
|
||||
total_ram_mb = 2048 # Default fallback
|
||||
cpu_cores = 2 # Default fallback
|
||||
|
||||
fd_limit = 1024 # Default fallback
|
||||
try:
|
||||
soft_limit, hard_limit = resource.getrlimit(resource.RLIMIT_NOFILE)
|
||||
fd_limit = soft_limit # Use the soft limit
|
||||
except (ImportError, ValueError, OSError, AttributeError) as e:
|
||||
self.logger.warning(f"Could not get file descriptor limit (common on Windows): {e}. Using default: {fd_limit}")
|
||||
|
||||
self.logger.info(f"System Resources: RAM={total_ram_mb}MB, Cores={cpu_cores}, FD Limit={fd_limit}")
|
||||
return total_ram_mb, cpu_cores, fd_limit
|
||||
|
||||
def _calculate_safe_pages(self) -> int:
|
||||
"""Calculates the safe number of concurrent pages based on resources."""
|
||||
if not self.config.auto_calculate_size:
|
||||
# If auto-calc is off, use max_pool_size as the hard limit
|
||||
# This isn't ideal based on the prompt, but provides *some* manual override
|
||||
# A dedicated `manual_safe_pages` might be better. Let's use max_pool_size for now.
|
||||
self.logger.warning("Auto-calculation disabled. Using max_pool_size as safe_pages limit.")
|
||||
return self.config.calculation_params.max_pool_size
|
||||
|
||||
params = self.config.calculation_params
|
||||
total_ram_mb, cpu_cores, fd_limit = self._get_system_resources()
|
||||
|
||||
available_ram_mb = total_ram_mb - params.mem_headroom_mb
|
||||
if available_ram_mb <= 0:
|
||||
self.logger.error(f"Not enough RAM ({total_ram_mb}MB) after headroom ({params.mem_headroom_mb}MB). Cannot calculate safe pages.")
|
||||
return params.min_pool_size # Fallback to minimum
|
||||
|
||||
try:
|
||||
# Calculate limits from each resource
|
||||
mem_limit = available_ram_mb // params.avg_page_mem_mb if params.avg_page_mem_mb > 0 else float('inf')
|
||||
fd_limit_pages = fd_limit // params.fd_per_page if params.fd_per_page > 0 else float('inf')
|
||||
cpu_limit = cpu_cores * params.core_multiplier if cpu_cores > 0 else float('inf')
|
||||
|
||||
# Determine the most constraining limit
|
||||
calculated_limit = math.floor(min(mem_limit, fd_limit_pages, cpu_limit))
|
||||
|
||||
except ZeroDivisionError:
|
||||
self.logger.error("Division by zero in safe_pages calculation (avg_page_mem_mb or fd_per_page is zero).")
|
||||
calculated_limit = params.min_pool_size # Fallback
|
||||
|
||||
# Clamp the result within min/max bounds
|
||||
safe_pages = max(params.min_pool_size, min(calculated_limit, params.max_pool_size))
|
||||
|
||||
self.logger.info(f"Calculated safe pages: MemoryLimit={mem_limit}, FDLimit={fd_limit_pages}, CPULimit={cpu_limit} -> RawCalc={calculated_limit} -> Clamped={safe_pages}")
|
||||
return safe_pages
|
||||
|
||||
async def _create_and_start_crawler(self, crawler_id: str) -> Optional[AsyncWebCrawler]:
|
||||
"""Creates, starts, and returns a crawler instance."""
|
||||
try:
|
||||
# Create BrowserConfig from the dictionary in manager config
|
||||
browser_conf = BrowserConfig(**self.config.browser_config)
|
||||
crawler = AsyncWebCrawler(config=browser_conf)
|
||||
await crawler.start()
|
||||
self.logger.info(f"Successfully started crawler instance: {crawler_id}")
|
||||
return crawler
|
||||
except Exception as e:
|
||||
self.logger.error(f"Failed to start crawler instance {crawler_id}: {e}", exc_info=True)
|
||||
return None
|
||||
|
||||
async def initialize(self):
|
||||
"""Initializes crawlers and semaphore. Called at server startup."""
|
||||
if not self.config.enabled or self._initialized:
|
||||
return
|
||||
|
||||
self.logger.info("Initializing CrawlerManager...")
|
||||
self._safe_pages = self._calculate_safe_pages()
|
||||
self._semaphore = asyncio.Semaphore(self._safe_pages)
|
||||
|
||||
self._primary_crawler = await self._create_and_start_crawler("Primary")
|
||||
if self._primary_crawler:
|
||||
self._primary_healthy = True
|
||||
else:
|
||||
self._primary_healthy = False
|
||||
self.logger.critical("Primary crawler failed to initialize!")
|
||||
|
||||
self._secondary_crawlers = []
|
||||
self._secondary_healthy_flags = []
|
||||
self._reload_tasks = [None] * (1 + self.config.backup_pool_size) # For primary + backups
|
||||
|
||||
for i in range(self.config.backup_pool_size):
|
||||
sec_id = f"Secondary-{i+1}"
|
||||
crawler = await self._create_and_start_crawler(sec_id)
|
||||
self._secondary_crawlers.append(crawler) # Add even if None
|
||||
self._secondary_healthy_flags.append(crawler is not None)
|
||||
if crawler is None:
|
||||
self.logger.error(f"{sec_id} crawler failed to initialize!")
|
||||
|
||||
# Set initial active crawler (prefer primary)
|
||||
if self._primary_healthy:
|
||||
self._active_crawler_index = 0
|
||||
self.logger.info("Primary crawler is active.")
|
||||
else:
|
||||
# Find the first healthy secondary
|
||||
found_healthy_backup = False
|
||||
for i, healthy in enumerate(self._secondary_healthy_flags):
|
||||
if healthy:
|
||||
self._active_crawler_index = i + 1 # 1-based index for secondaries
|
||||
self.logger.warning(f"Primary failed, Secondary-{i+1} is active.")
|
||||
found_healthy_backup = True
|
||||
break
|
||||
if not found_healthy_backup:
|
||||
self.logger.critical("FATAL: No healthy crawlers available after initialization!")
|
||||
# Server should probably refuse connections in this state
|
||||
|
||||
self._initialized = True
|
||||
self.logger.info(f"CrawlerManager initialized. Safe Pages: {self._safe_pages}. Active Crawler Index: {self._active_crawler_index}")
|
||||
|
||||
async def shutdown(self):
|
||||
"""Shuts down all crawler instances. Called at server shutdown."""
|
||||
if not self._initialized or self._shutting_down:
|
||||
return
|
||||
|
||||
self._shutting_down = True
|
||||
self.logger.info("Shutting down CrawlerManager...")
|
||||
|
||||
# Cancel any ongoing reload tasks
|
||||
for i, task in enumerate(self._reload_tasks):
|
||||
if task and not task.done():
|
||||
try:
|
||||
task.cancel()
|
||||
await task # Wait for cancellation
|
||||
self.logger.info(f"Cancelled reload task for crawler index {i}.")
|
||||
except asyncio.CancelledError:
|
||||
self.logger.info(f"Reload task for crawler index {i} was already cancelled.")
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error cancelling reload task for crawler index {i}: {e}")
|
||||
self._reload_tasks = []
|
||||
|
||||
|
||||
# Close primary
|
||||
if self._primary_crawler:
|
||||
try:
|
||||
self.logger.info("Closing primary crawler...")
|
||||
await self._primary_crawler.close()
|
||||
self._primary_crawler = None
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error closing primary crawler: {e}", exc_info=True)
|
||||
|
||||
# Close secondaries
|
||||
for i, crawler in enumerate(self._secondary_crawlers):
|
||||
if crawler:
|
||||
try:
|
||||
self.logger.info(f"Closing secondary crawler {i+1}...")
|
||||
await crawler.close()
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error closing secondary crawler {i+1}: {e}", exc_info=True)
|
||||
self._secondary_crawlers = []
|
||||
|
||||
self._initialized = False
|
||||
self.logger.info("CrawlerManager shut down complete.")
|
||||
|
||||
@asynccontextmanager
|
||||
async def get_crawler(self) -> AsyncGenerator[AsyncWebCrawler, None]:
|
||||
"""Acquires semaphore, yields active crawler, handles throttling & failover."""
|
||||
if not self.is_enabled():
|
||||
raise NoHealthyCrawlerError("CrawlerManager is disabled or not initialized.")
|
||||
|
||||
if self._shutting_down:
|
||||
raise NoHealthyCrawlerError("CrawlerManager is shutting down.")
|
||||
|
||||
active_crawler: Optional[AsyncWebCrawler] = None
|
||||
acquired = False
|
||||
request_id = uuid.uuid4()
|
||||
start_wait = time.time()
|
||||
|
||||
# --- Throttling ---
|
||||
try:
|
||||
# Check semaphore value without acquiring
|
||||
current_usage = self._safe_pages - self._semaphore._value
|
||||
usage_percent = (current_usage / self._safe_pages) * 100 if self._safe_pages > 0 else 0
|
||||
|
||||
if usage_percent >= self.config.throttle_threshold_percent:
|
||||
delay = random.uniform(self.config.throttle_delay_min_s, self.config.throttle_delay_max_s)
|
||||
self.logger.debug(f"Throttling: Usage {usage_percent:.1f}% >= {self.config.throttle_threshold_percent}%. Delaying {delay:.3f}s")
|
||||
await asyncio.sleep(delay)
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error during throttling check: {e}") # Continue attempt even if throttle check fails
|
||||
|
||||
# --- Acquire Semaphore ---
|
||||
try:
|
||||
# self.logger.debug(f"Attempting to acquire semaphore (Available: {self._semaphore._value}/{self._safe_pages}). Wait Timeout: {self.config.max_wait_time_s}s")
|
||||
|
||||
# --- Logging Before Acquire ---
|
||||
sem_value = self._semaphore._value if self._semaphore else 'N/A'
|
||||
sem_waiters = len(self._semaphore._waiters) if self._semaphore and self._semaphore._waiters else 0
|
||||
self.logger.debug(f"Req {request_id}: Attempting acquire. Available={sem_value}/{self._safe_pages}, Waiters={sem_waiters}, Timeout={self.config.max_wait_time_s}s")
|
||||
|
||||
await asyncio.wait_for(
|
||||
self._semaphore.acquire(), timeout=self.config.max_wait_time_s
|
||||
)
|
||||
acquired = True
|
||||
wait_duration = time.time() - start_wait
|
||||
if wait_duration > 1:
|
||||
self.logger.warning(f"Semaphore acquired after {wait_duration:.3f}s. (Available: {self._semaphore._value}/{self._safe_pages})")
|
||||
|
||||
self.logger.debug(f"Semaphore acquired successfully after {wait_duration:.3f}s. (Available: {self._semaphore._value}/{self._safe_pages})")
|
||||
|
||||
# --- Select Active Crawler (Critical Section) ---
|
||||
async with self._state_lock:
|
||||
current_active_index = self._active_crawler_index
|
||||
is_primary_active = (current_active_index == 0)
|
||||
|
||||
if is_primary_active:
|
||||
if self._primary_healthy and self._primary_crawler:
|
||||
active_crawler = self._primary_crawler
|
||||
else:
|
||||
# Primary is supposed to be active but isn't healthy
|
||||
self.logger.warning("Primary crawler unhealthy, attempting immediate failover...")
|
||||
if not await self._try_failover_sync(): # Try to switch active crawler NOW
|
||||
raise NoHealthyCrawlerError("Primary unhealthy and no healthy backup available.")
|
||||
# If failover succeeded, active_crawler_index is updated
|
||||
current_active_index = self._active_crawler_index
|
||||
# Fall through to select the new active secondary
|
||||
|
||||
# Check if we need to use a secondary (either initially or after failover)
|
||||
if current_active_index > 0:
|
||||
secondary_idx = current_active_index - 1
|
||||
if secondary_idx < len(self._secondary_crawlers) and \
|
||||
self._secondary_healthy_flags[secondary_idx] and \
|
||||
self._secondary_crawlers[secondary_idx]:
|
||||
active_crawler = self._secondary_crawlers[secondary_idx]
|
||||
else:
|
||||
self.logger.error(f"Selected Secondary-{current_active_index} is unhealthy or missing.")
|
||||
# Attempt failover to *another* secondary if possible? (Adds complexity)
|
||||
# For now, raise error if the selected one isn't good.
|
||||
raise NoHealthyCrawlerError(f"Selected Secondary-{current_active_index} is unavailable.")
|
||||
|
||||
if active_crawler is None:
|
||||
# This shouldn't happen if logic above is correct, but safeguard
|
||||
raise NoHealthyCrawlerError("Failed to select a healthy active crawler.")
|
||||
|
||||
# --- Yield Crawler ---
|
||||
try:
|
||||
yield active_crawler
|
||||
except Exception as crawl_error:
|
||||
self.logger.error(f"Error during crawl execution using {active_crawler}: {crawl_error}", exc_info=True)
|
||||
# Determine if this error warrants failover
|
||||
# For now, let's assume any exception triggers a health check/failover attempt
|
||||
await self._handle_crawler_failure(active_crawler)
|
||||
raise # Re-raise the original error for the API handler
|
||||
|
||||
except asyncio.TimeoutError:
|
||||
self.logger.warning(f"Timeout waiting for semaphore after {self.config.max_wait_time_s}s.")
|
||||
raise PoolTimeoutError(f"Timed out waiting for available crawler resource after {self.config.max_wait_time_s}s")
|
||||
except NoHealthyCrawlerError:
|
||||
# Logged within the selection logic
|
||||
raise # Re-raise for API handler
|
||||
except Exception as e:
|
||||
self.logger.error(f"Unexpected error in get_crawler context manager: {e}", exc_info=True)
|
||||
raise # Re-raise potentially unknown errors
|
||||
finally:
|
||||
if acquired:
|
||||
self._semaphore.release()
|
||||
self.logger.debug(f"Semaphore released. (Available: {self._semaphore._value}/{self._safe_pages})")
|
||||
|
||||
|
||||
async def _try_failover_sync(self) -> bool:
|
||||
"""Synchronous part of failover logic (must be called under state_lock). Finds next healthy secondary."""
|
||||
if not self._primary_healthy: # Only failover if primary is already marked down
|
||||
found_healthy_backup = False
|
||||
start_idx = (self._active_crawler_index % (self.config.backup_pool_size +1)) # Start check after current
|
||||
for i in range(self.config.backup_pool_size):
|
||||
check_idx = (start_idx + i) % self.config.backup_pool_size # Circular check
|
||||
if self._secondary_healthy_flags[check_idx] and self._secondary_crawlers[check_idx]:
|
||||
self._active_crawler_index = check_idx + 1
|
||||
self.logger.warning(f"Failover successful: Switched active crawler to Secondary-{self._active_crawler_index}")
|
||||
found_healthy_backup = True
|
||||
break # Found one
|
||||
if not found_healthy_backup:
|
||||
# If primary is down AND no backups are healthy, mark primary as active index (0) but it's still unhealthy
|
||||
self._active_crawler_index = 0
|
||||
self.logger.error("Failover failed: No healthy secondary crawlers available.")
|
||||
return False
|
||||
return True
|
||||
return True # Primary is healthy, no failover needed
|
||||
|
||||
async def _handle_crawler_failure(self, failed_crawler: AsyncWebCrawler):
|
||||
"""Handles marking a crawler as unhealthy and initiating recovery."""
|
||||
if self._shutting_down: return # Don't handle failures during shutdown
|
||||
|
||||
async with self._state_lock:
|
||||
crawler_index = -1
|
||||
is_primary = False
|
||||
|
||||
if failed_crawler is self._primary_crawler and self._primary_healthy:
|
||||
self.logger.warning("Primary crawler reported failure.")
|
||||
self._primary_healthy = False
|
||||
is_primary = True
|
||||
crawler_index = 0
|
||||
# Try immediate failover within the lock
|
||||
await self._try_failover_sync()
|
||||
# Start reload task if not already running for primary
|
||||
if self._reload_tasks[0] is None or self._reload_tasks[0].done():
|
||||
self.logger.info("Initiating primary crawler reload task.")
|
||||
self._reload_tasks[0] = asyncio.create_task(self._reload_crawler(0))
|
||||
|
||||
else:
|
||||
# Check if it was one of the secondaries
|
||||
for i, crawler in enumerate(self._secondary_crawlers):
|
||||
if failed_crawler is crawler and self._secondary_healthy_flags[i]:
|
||||
self.logger.warning(f"Secondary-{i+1} crawler reported failure.")
|
||||
self._secondary_healthy_flags[i] = False
|
||||
is_primary = False
|
||||
crawler_index = i + 1
|
||||
# If this *was* the active crawler, trigger failover check
|
||||
if self._active_crawler_index == crawler_index:
|
||||
self.logger.warning(f"Active secondary {crawler_index} failed, attempting failover...")
|
||||
await self._try_failover_sync()
|
||||
# Start reload task for this secondary
|
||||
if self._reload_tasks[crawler_index] is None or self._reload_tasks[crawler_index].done():
|
||||
self.logger.info(f"Initiating Secondary-{i+1} crawler reload task.")
|
||||
self._reload_tasks[crawler_index] = asyncio.create_task(self._reload_crawler(crawler_index))
|
||||
break # Found the failed secondary
|
||||
|
||||
if crawler_index == -1:
|
||||
self.logger.debug("Failure reported by an unknown or already unhealthy crawler instance. Ignoring.")
|
||||
|
||||
|
||||
async def _reload_crawler(self, crawler_index_to_reload: int):
|
||||
"""Background task to close, recreate, and start a specific crawler."""
|
||||
is_primary = (crawler_index_to_reload == 0)
|
||||
crawler_id = "Primary" if is_primary else f"Secondary-{crawler_index_to_reload}"
|
||||
original_crawler = self._primary_crawler if is_primary else self._secondary_crawlers[crawler_index_to_reload - 1]
|
||||
|
||||
self.logger.info(f"Starting reload process for {crawler_id}...")
|
||||
|
||||
# 1. Delay before attempting reload (e.g., allow transient issues to clear)
|
||||
if not is_primary: # Maybe shorter delay for backups?
|
||||
await asyncio.sleep(self.config.primary_reload_delay_s / 2)
|
||||
else:
|
||||
await asyncio.sleep(self.config.primary_reload_delay_s)
|
||||
|
||||
|
||||
# 2. Attempt to close the old instance cleanly
|
||||
if original_crawler:
|
||||
try:
|
||||
self.logger.info(f"Attempting to close existing {crawler_id} instance...")
|
||||
await original_crawler.close()
|
||||
self.logger.info(f"Successfully closed old {crawler_id} instance.")
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error closing old {crawler_id} instance during reload: {e}")
|
||||
|
||||
# 3. Create and start a new instance
|
||||
self.logger.info(f"Attempting to start new {crawler_id} instance...")
|
||||
new_crawler = await self._create_and_start_crawler(crawler_id)
|
||||
|
||||
# 4. Update state if successful
|
||||
async with self._state_lock:
|
||||
if new_crawler:
|
||||
self.logger.info(f"Successfully reloaded {crawler_id}. Marking as healthy.")
|
||||
if is_primary:
|
||||
self._primary_crawler = new_crawler
|
||||
self._primary_healthy = True
|
||||
# Switch back to primary if no other failures occurred
|
||||
# Check if ANY secondary is currently active
|
||||
secondary_is_active = self._active_crawler_index > 0
|
||||
if not secondary_is_active or not self._secondary_healthy_flags[self._active_crawler_index - 1]:
|
||||
self.logger.info("Switching active crawler back to primary.")
|
||||
self._active_crawler_index = 0
|
||||
else: # Is secondary
|
||||
secondary_idx = crawler_index_to_reload - 1
|
||||
self._secondary_crawlers[secondary_idx] = new_crawler
|
||||
self._secondary_healthy_flags[secondary_idx] = True
|
||||
# Potentially switch back if primary is still down and this was needed?
|
||||
if not self._primary_healthy and self._active_crawler_index == 0:
|
||||
self.logger.info(f"Primary still down, activating reloaded Secondary-{crawler_index_to_reload}.")
|
||||
self._active_crawler_index = crawler_index_to_reload
|
||||
|
||||
else:
|
||||
self.logger.error(f"Failed to reload {crawler_id}. It remains unhealthy.")
|
||||
# Keep the crawler marked as unhealthy
|
||||
if is_primary:
|
||||
self._primary_healthy = False # Ensure it stays false
|
||||
else:
|
||||
self._secondary_healthy_flags[crawler_index_to_reload - 1] = False
|
||||
|
||||
|
||||
# Clear the reload task reference for this index
|
||||
self._reload_tasks[crawler_index_to_reload] = None
|
||||
|
||||
|
||||
async def get_status(self) -> Dict:
|
||||
"""Returns the current status of the manager."""
|
||||
if not self.is_enabled():
|
||||
return {"status": "disabled"}
|
||||
|
||||
async with self._state_lock:
|
||||
active_id = "Primary" if self._active_crawler_index == 0 else f"Secondary-{self._active_crawler_index}"
|
||||
primary_status = "Healthy" if self._primary_healthy else "Unhealthy"
|
||||
secondary_statuses = [f"Secondary-{i+1}: {'Healthy' if healthy else 'Unhealthy'}"
|
||||
for i, healthy in enumerate(self._secondary_healthy_flags)]
|
||||
semaphore_available = self._semaphore._value if self._semaphore else 'N/A'
|
||||
semaphore_locked = len(self._semaphore._waiters) if self._semaphore and self._semaphore._waiters else 0
|
||||
|
||||
return {
|
||||
"status": "enabled",
|
||||
"safe_pages": self._safe_pages,
|
||||
"semaphore_available": semaphore_available,
|
||||
"semaphore_waiters": semaphore_locked,
|
||||
"active_crawler": active_id,
|
||||
"primary_status": primary_status,
|
||||
"secondary_statuses": secondary_statuses,
|
||||
"reloading_tasks": [i for i, t in enumerate(self._reload_tasks) if t and not t.done()]
|
||||
}
|
||||
@@ -1,8 +1,20 @@
|
||||
# Import from auth.py
|
||||
from auth import create_access_token, get_token_dependency, TokenRequest
|
||||
from api import (
|
||||
handle_markdown_request,
|
||||
handle_llm_qa,
|
||||
handle_stream_crawl_request,
|
||||
handle_crawl_request,
|
||||
stream_results,
|
||||
_get_memory_mb
|
||||
)
|
||||
from utils import FilterType, load_config, setup_logging, verify_email_domain
|
||||
import os
|
||||
import sys
|
||||
import time
|
||||
from typing import List, Optional, Dict
|
||||
from fastapi import FastAPI, HTTPException, Request, Query, Path, Depends
|
||||
from typing import List, Optional, Dict, AsyncGenerator
|
||||
from contextlib import asynccontextmanager
|
||||
from fastapi import FastAPI, HTTPException, Request, Query, Path, Depends, status
|
||||
from fastapi.responses import StreamingResponse, RedirectResponse, PlainTextResponse, JSONResponse
|
||||
from fastapi.middleware.httpsredirect import HTTPSRedirectMiddleware
|
||||
from fastapi.middleware.trustedhost import TrustedHostMiddleware
|
||||
@@ -11,28 +23,40 @@ from slowapi import Limiter
|
||||
from slowapi.util import get_remote_address
|
||||
from prometheus_fastapi_instrumentator import Instrumentator
|
||||
from redis import asyncio as aioredis
|
||||
from crawl4ai import (
|
||||
BrowserConfig,
|
||||
CrawlerRunConfig,
|
||||
AsyncLogger
|
||||
)
|
||||
|
||||
from crawler_manager import (
|
||||
CrawlerManager,
|
||||
CrawlerManagerConfig,
|
||||
PoolTimeoutError,
|
||||
NoHealthyCrawlerError
|
||||
)
|
||||
import json
|
||||
|
||||
|
||||
sys.path.append(os.path.dirname(os.path.realpath(__file__)))
|
||||
from utils import FilterType, load_config, setup_logging, verify_email_domain
|
||||
from api import (
|
||||
handle_markdown_request,
|
||||
handle_llm_qa,
|
||||
handle_stream_crawl_request,
|
||||
handle_crawl_request,
|
||||
stream_results
|
||||
)
|
||||
from auth import create_access_token, get_token_dependency, TokenRequest # Import from auth.py
|
||||
|
||||
__version__ = "0.2.6"
|
||||
|
||||
|
||||
class CrawlRequest(BaseModel):
|
||||
urls: List[str] = Field(min_length=1, max_length=100)
|
||||
browser_config: Optional[Dict] = Field(default_factory=dict)
|
||||
crawler_config: Optional[Dict] = Field(default_factory=dict)
|
||||
|
||||
|
||||
# Load configuration and setup
|
||||
config = load_config()
|
||||
setup_logging(config)
|
||||
logger = AsyncLogger(
|
||||
log_file=config["logging"].get("log_file", "app.log"),
|
||||
verbose=config["logging"].get("verbose", False),
|
||||
tag_width=10,
|
||||
)
|
||||
|
||||
# Initialize Redis
|
||||
redis = aioredis.from_url(config["redis"].get("uri", "redis://localhost"))
|
||||
@@ -44,9 +68,43 @@ limiter = Limiter(
|
||||
storage_uri=config["rate_limiting"]["storage_uri"]
|
||||
)
|
||||
|
||||
# --- Initialize Manager (will be done in lifespan) ---
|
||||
# Load manager config from the main config
|
||||
manager_config_dict = config.get("crawler_pool", {})
|
||||
# Use Pydantic to parse and validate
|
||||
manager_config = CrawlerManagerConfig(**manager_config_dict)
|
||||
crawler_manager = CrawlerManager(config=manager_config, logger=logger)
|
||||
|
||||
# --- FastAPI App and Lifespan ---
|
||||
|
||||
|
||||
@asynccontextmanager
|
||||
async def lifespan(app: FastAPI):
|
||||
# Startup
|
||||
logger.info("Starting up the server...")
|
||||
if manager_config.enabled:
|
||||
logger.info("Initializing Crawler Manager...")
|
||||
await crawler_manager.initialize()
|
||||
app.state.crawler_manager = crawler_manager # Store manager in app state
|
||||
logger.info("Crawler Manager is enabled.")
|
||||
else:
|
||||
logger.warning("Crawler Manager is disabled.")
|
||||
app.state.crawler_manager = None # Indicate disabled state
|
||||
|
||||
yield # Server runs here
|
||||
|
||||
# Shutdown
|
||||
logger.info("Shutting down server...")
|
||||
if app.state.crawler_manager:
|
||||
logger.info("Shutting down Crawler Manager...")
|
||||
await app.state.crawler_manager.shutdown()
|
||||
logger.info("Crawler Manager shut down.")
|
||||
logger.info("Server shut down.")
|
||||
|
||||
app = FastAPI(
|
||||
title=config["app"]["title"],
|
||||
version=config["app"]["version"]
|
||||
version=config["app"]["version"],
|
||||
lifespan=lifespan,
|
||||
)
|
||||
|
||||
# Configure middleware
|
||||
@@ -56,7 +114,9 @@ def setup_security_middleware(app, config):
|
||||
if sec_config.get("https_redirect", False):
|
||||
app.add_middleware(HTTPSRedirectMiddleware)
|
||||
if sec_config.get("trusted_hosts", []) != ["*"]:
|
||||
app.add_middleware(TrustedHostMiddleware, allowed_hosts=sec_config["trusted_hosts"])
|
||||
app.add_middleware(TrustedHostMiddleware,
|
||||
allowed_hosts=sec_config["trusted_hosts"])
|
||||
|
||||
|
||||
setup_security_middleware(app, config)
|
||||
|
||||
@@ -68,6 +128,8 @@ if config["observability"]["prometheus"]["enabled"]:
|
||||
token_dependency = get_token_dependency(config)
|
||||
|
||||
# Middleware for security headers
|
||||
|
||||
|
||||
@app.middleware("http")
|
||||
async def add_security_headers(request: Request, call_next):
|
||||
response = await call_next(request)
|
||||
@@ -75,7 +137,24 @@ async def add_security_headers(request: Request, call_next):
|
||||
response.headers.update(config["security"]["headers"])
|
||||
return response
|
||||
|
||||
|
||||
async def get_manager() -> CrawlerManager:
|
||||
# Ensure manager exists and is enabled before yielding
|
||||
if not hasattr(app.state, 'crawler_manager') or app.state.crawler_manager is None:
|
||||
raise HTTPException(
|
||||
status_code=status.HTTP_503_SERVICE_UNAVAILABLE,
|
||||
detail="Crawler service is disabled or not initialized"
|
||||
)
|
||||
if not app.state.crawler_manager.is_enabled():
|
||||
raise HTTPException(
|
||||
status_code=status.HTTP_503_SERVICE_UNAVAILABLE,
|
||||
detail="Crawler service is currently disabled"
|
||||
)
|
||||
return app.state.crawler_manager
|
||||
|
||||
# Token endpoint (always available, but usage depends on config)
|
||||
|
||||
|
||||
@app.post("/token")
|
||||
async def get_token(request_data: TokenRequest):
|
||||
if not verify_email_domain(request_data.email):
|
||||
@@ -84,6 +163,8 @@ async def get_token(request_data: TokenRequest):
|
||||
return {"email": request_data.email, "access_token": token, "token_type": "bearer"}
|
||||
|
||||
# Endpoints with conditional auth
|
||||
|
||||
|
||||
@app.get("/md/{url:path}")
|
||||
@limiter.limit(config["rate_limiting"]["default_limit"])
|
||||
async def get_markdown(
|
||||
@@ -97,6 +178,7 @@ async def get_markdown(
|
||||
result = await handle_markdown_request(url, f, q, c, config)
|
||||
return PlainTextResponse(result)
|
||||
|
||||
|
||||
@app.get("/llm/{url:path}", description="URL should be without http/https prefix")
|
||||
async def llm_endpoint(
|
||||
request: Request,
|
||||
@@ -110,36 +192,89 @@ async def llm_endpoint(
|
||||
except Exception as e:
|
||||
raise HTTPException(status_code=500, detail=str(e))
|
||||
|
||||
|
||||
@app.get("/schema")
|
||||
async def get_schema():
|
||||
from crawl4ai import BrowserConfig, CrawlerRunConfig
|
||||
return {"browser": BrowserConfig().dump(), "crawler": CrawlerRunConfig().dump()}
|
||||
|
||||
|
||||
@app.get(config["observability"]["health_check"]["endpoint"])
|
||||
async def health():
|
||||
return {"status": "ok", "timestamp": time.time(), "version": __version__}
|
||||
|
||||
|
||||
@app.get(config["observability"]["prometheus"]["endpoint"])
|
||||
async def metrics():
|
||||
return RedirectResponse(url=config["observability"]["prometheus"]["endpoint"])
|
||||
|
||||
|
||||
@app.get("/browswers")
|
||||
# Optional dependency
|
||||
async def health(manager: Optional[CrawlerManager] = Depends(get_manager, use_cache=False)):
|
||||
base_status = {"status": "ok", "timestamp": time.time(),
|
||||
"version": __version__}
|
||||
if manager:
|
||||
try:
|
||||
manager_status = await manager.get_status()
|
||||
base_status["crawler_manager"] = manager_status
|
||||
except Exception as e:
|
||||
base_status["crawler_manager"] = {
|
||||
"status": "error", "detail": str(e)}
|
||||
else:
|
||||
base_status["crawler_manager"] = {"status": "disabled"}
|
||||
return base_status
|
||||
|
||||
|
||||
@app.post("/crawl")
|
||||
@limiter.limit(config["rate_limiting"]["default_limit"])
|
||||
async def crawl(
|
||||
request: Request,
|
||||
crawl_request: CrawlRequest,
|
||||
token_data: Optional[Dict] = Depends(token_dependency)
|
||||
manager: CrawlerManager = Depends(get_manager), # Use dependency
|
||||
token_data: Optional[Dict] = Depends(token_dependency) # Keep auth
|
||||
):
|
||||
if not crawl_request.urls:
|
||||
raise HTTPException(status_code=400, detail="At least one URL required")
|
||||
results = await handle_crawl_request(
|
||||
urls=crawl_request.urls,
|
||||
browser_config=crawl_request.browser_config,
|
||||
crawler_config=crawl_request.crawler_config,
|
||||
config=config
|
||||
)
|
||||
raise HTTPException(
|
||||
status_code=400, detail="At least one URL required")
|
||||
|
||||
return JSONResponse(results)
|
||||
try:
|
||||
# Use the manager's context to get a crawler instance
|
||||
async with manager.get_crawler() as active_crawler:
|
||||
# Call the actual handler from api.py, passing the acquired crawler
|
||||
results_dict = await handle_crawl_request(
|
||||
crawler=active_crawler, # Pass the live crawler instance
|
||||
urls=crawl_request.urls,
|
||||
# Pass user-provided configs, these might override pool defaults if needed
|
||||
# Or the manager/handler could decide how to merge them
|
||||
browser_config=crawl_request.browser_config or {}, # Ensure dict
|
||||
crawler_config=crawl_request.crawler_config or {}, # Ensure dict
|
||||
config=config # Pass the global server config
|
||||
)
|
||||
return JSONResponse(results_dict)
|
||||
|
||||
except PoolTimeoutError as e:
|
||||
logger.warning(f"Request rejected due to pool timeout: {e}")
|
||||
raise HTTPException(
|
||||
status_code=status.HTTP_503_SERVICE_UNAVAILABLE, # Or 429
|
||||
detail=f"Crawler resources busy. Please try again later. Timeout: {e}"
|
||||
)
|
||||
except NoHealthyCrawlerError as e:
|
||||
logger.error(f"Request failed as no healthy crawler available: {e}")
|
||||
raise HTTPException(
|
||||
status_code=status.HTTP_503_SERVICE_UNAVAILABLE,
|
||||
detail=f"Crawler service temporarily unavailable: {e}"
|
||||
)
|
||||
except HTTPException: # Re-raise HTTP exceptions from handler
|
||||
raise
|
||||
except Exception as e:
|
||||
logger.error(
|
||||
f"Unexpected error during batch crawl processing: {e}", exc_info=True)
|
||||
# Return generic error, details might be logged by handle_crawl_request
|
||||
raise HTTPException(
|
||||
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
|
||||
detail=f"An unexpected error occurred: {e}"
|
||||
)
|
||||
|
||||
|
||||
@app.post("/crawl/stream")
|
||||
@@ -147,23 +282,114 @@ async def crawl(
|
||||
async def crawl_stream(
|
||||
request: Request,
|
||||
crawl_request: CrawlRequest,
|
||||
manager: CrawlerManager = Depends(get_manager),
|
||||
token_data: Optional[Dict] = Depends(token_dependency)
|
||||
):
|
||||
if not crawl_request.urls:
|
||||
raise HTTPException(status_code=400, detail="At least one URL required")
|
||||
raise HTTPException(
|
||||
status_code=400, detail="At least one URL required")
|
||||
|
||||
crawler, results_gen = await handle_stream_crawl_request(
|
||||
urls=crawl_request.urls,
|
||||
browser_config=crawl_request.browser_config,
|
||||
crawler_config=crawl_request.crawler_config,
|
||||
config=config
|
||||
)
|
||||
try:
|
||||
# THIS IS A BIT WORK OF ART RATHER THAN ENGINEERING
|
||||
# Acquire the crawler context from the manager
|
||||
# IMPORTANT: The context needs to be active for the *duration* of the stream
|
||||
# This structure might be tricky with FastAPI's StreamingResponse which consumes
|
||||
# the generator *after* the endpoint function returns.
|
||||
|
||||
return StreamingResponse(
|
||||
stream_results(crawler, results_gen),
|
||||
media_type='application/x-ndjson',
|
||||
headers={'Cache-Control': 'no-cache', 'Connection': 'keep-alive', 'X-Stream-Status': 'active'}
|
||||
)
|
||||
# --- Option A: Acquire crawler, pass to handler, handler yields ---
|
||||
# (Requires handler NOT to be async generator itself, but return one)
|
||||
# async with manager.get_crawler() as active_crawler:
|
||||
# # Handler returns the generator
|
||||
# _, results_gen = await handle_stream_crawl_request(
|
||||
# crawler=active_crawler,
|
||||
# urls=crawl_request.urls,
|
||||
# browser_config=crawl_request.browser_config or {},
|
||||
# crawler_config=crawl_request.crawler_config or {},
|
||||
# config=config
|
||||
# )
|
||||
# # PROBLEM: `active_crawler` context exits before StreamingResponse uses results_gen
|
||||
# # This releases the semaphore too early.
|
||||
|
||||
# --- Option B: Pass manager to handler, handler uses context internally ---
|
||||
# (Requires modifying handle_stream_crawl_request signature/logic)
|
||||
# This seems cleaner. Let's assume api.py is adapted for this.
|
||||
# We need a way for the generator yielded by stream_results to know when
|
||||
# to release the semaphore.
|
||||
|
||||
# --- Option C: Create a wrapper generator that handles context ---
|
||||
async def stream_wrapper(manager: CrawlerManager, crawl_request: CrawlRequest, config: dict) -> AsyncGenerator[bytes, None]:
|
||||
active_crawler = None
|
||||
try:
|
||||
async with manager.get_crawler() as acquired_crawler:
|
||||
active_crawler = acquired_crawler # Keep reference for cleanup
|
||||
# Call the handler which returns the raw result generator
|
||||
_crawler_ref, results_gen = await handle_stream_crawl_request(
|
||||
crawler=acquired_crawler,
|
||||
urls=crawl_request.urls,
|
||||
browser_config=crawl_request.browser_config or {},
|
||||
crawler_config=crawl_request.crawler_config or {},
|
||||
config=config
|
||||
)
|
||||
# Use the stream_results utility to format and yield
|
||||
async for data_bytes in stream_results(_crawler_ref, results_gen):
|
||||
yield data_bytes
|
||||
except (PoolTimeoutError, NoHealthyCrawlerError) as e:
|
||||
# Yield a final error message in the stream
|
||||
error_payload = {"status": "error", "detail": str(e)}
|
||||
yield (json.dumps(error_payload) + "\n").encode('utf-8')
|
||||
logger.warning(f"Stream request failed: {e}")
|
||||
# Re-raise might be better if StreamingResponse handles it? Test needed.
|
||||
except HTTPException as e: # Catch HTTP exceptions from handler setup
|
||||
error_payload = {"status": "error",
|
||||
"detail": e.detail, "status_code": e.status_code}
|
||||
yield (json.dumps(error_payload) + "\n").encode('utf-8')
|
||||
logger.warning(
|
||||
f"Stream request failed with HTTPException: {e.detail}")
|
||||
except Exception as e:
|
||||
error_payload = {"status": "error",
|
||||
"detail": f"Unexpected stream error: {e}"}
|
||||
yield (json.dumps(error_payload) + "\n").encode('utf-8')
|
||||
logger.error(
|
||||
f"Unexpected error during stream processing: {e}", exc_info=True)
|
||||
# finally:
|
||||
# Ensure crawler cleanup if stream_results doesn't handle it?
|
||||
# stream_results *should* call crawler.close(), but only on the
|
||||
# instance it received. If we pass the *manager* instead, this gets complex.
|
||||
# Let's stick to passing the acquired_crawler and rely on stream_results.
|
||||
|
||||
# Create the generator using the wrapper
|
||||
streaming_generator = stream_wrapper(manager, crawl_request, config)
|
||||
|
||||
return StreamingResponse(
|
||||
streaming_generator, # Use the wrapper
|
||||
media_type='application/x-ndjson',
|
||||
headers={'Cache-Control': 'no-cache',
|
||||
'Connection': 'keep-alive', 'X-Stream-Status': 'active'}
|
||||
)
|
||||
|
||||
except (PoolTimeoutError, NoHealthyCrawlerError) as e:
|
||||
# These might occur if get_crawler fails *before* stream starts
|
||||
# Or if the wrapper re-raises them.
|
||||
logger.warning(f"Stream request rejected before starting: {e}")
|
||||
status_code = status.HTTP_503_SERVICE_UNAVAILABLE # Or 429 for timeout
|
||||
# Don't raise HTTPException here, let the wrapper yield the error message.
|
||||
# If we want to return a non-200 initial status, need more complex handling.
|
||||
# Return an *empty* stream with error headers? Or just let wrapper yield error.
|
||||
|
||||
async def _error_stream(e):
|
||||
error_payload = {"status": "error", "detail": str(e)}
|
||||
yield (json.dumps(error_payload) + "\n").encode('utf-8')
|
||||
return StreamingResponse(_error_stream(e), status_code=status_code, media_type='application/x-ndjson')
|
||||
|
||||
except HTTPException: # Re-raise HTTP exceptions from setup
|
||||
raise
|
||||
except Exception as e:
|
||||
logger.error(
|
||||
f"Unexpected error setting up stream crawl: {e}", exc_info=True)
|
||||
raise HTTPException(
|
||||
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
|
||||
detail=f"An unexpected error occurred setting up the stream: {e}"
|
||||
)
|
||||
|
||||
if __name__ == "__main__":
|
||||
import uvicorn
|
||||
|
||||
1019
docs/examples/docker/demo_docker_api.py
Normal file
1019
docs/examples/docker/demo_docker_api.py
Normal file
File diff suppressed because it is too large
Load Diff
64
docs/examples/markdown/content_source_example.py
Normal file
64
docs/examples/markdown/content_source_example.py
Normal file
@@ -0,0 +1,64 @@
|
||||
"""
|
||||
Example showing how to use the content_source parameter to control HTML input for markdown generation.
|
||||
"""
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, DefaultMarkdownGenerator
|
||||
|
||||
async def demo_content_source():
|
||||
"""Demonstrates different content_source options for markdown generation."""
|
||||
url = "https://example.com" # Simple demo site
|
||||
|
||||
print("Crawling with different content_source options...")
|
||||
|
||||
# --- Example 1: Default Behavior (cleaned_html) ---
|
||||
# This uses the HTML after it has been processed by the scraping strategy
|
||||
# The HTML is cleaned, simplified, and optimized for readability
|
||||
default_generator = DefaultMarkdownGenerator() # content_source="cleaned_html" is default
|
||||
default_config = CrawlerRunConfig(markdown_generator=default_generator)
|
||||
|
||||
# --- Example 2: Raw HTML ---
|
||||
# This uses the original HTML directly from the webpage
|
||||
# Preserves more original content but may include navigation, ads, etc.
|
||||
raw_generator = DefaultMarkdownGenerator(content_source="raw_html")
|
||||
raw_config = CrawlerRunConfig(markdown_generator=raw_generator)
|
||||
|
||||
# --- Example 3: Fit HTML ---
|
||||
# This uses preprocessed HTML optimized for schema extraction
|
||||
# Better for structured data extraction but may lose some formatting
|
||||
fit_generator = DefaultMarkdownGenerator(content_source="fit_html")
|
||||
fit_config = CrawlerRunConfig(markdown_generator=fit_generator)
|
||||
|
||||
# Execute all three crawlers in sequence
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Default (cleaned_html)
|
||||
result_default = await crawler.arun(url=url, config=default_config)
|
||||
|
||||
# Raw HTML
|
||||
result_raw = await crawler.arun(url=url, config=raw_config)
|
||||
|
||||
# Fit HTML
|
||||
result_fit = await crawler.arun(url=url, config=fit_config)
|
||||
|
||||
# Print a summary of the results
|
||||
print("\nMarkdown Generation Results:\n")
|
||||
|
||||
print("1. Default (cleaned_html):")
|
||||
print(f" Length: {len(result_default.markdown.raw_markdown)} chars")
|
||||
print(f" First 80 chars: {result_default.markdown.raw_markdown[:80]}...\n")
|
||||
|
||||
print("2. Raw HTML:")
|
||||
print(f" Length: {len(result_raw.markdown.raw_markdown)} chars")
|
||||
print(f" First 80 chars: {result_raw.markdown.raw_markdown[:80]}...\n")
|
||||
|
||||
print("3. Fit HTML:")
|
||||
print(f" Length: {len(result_fit.markdown.raw_markdown)} chars")
|
||||
print(f" First 80 chars: {result_fit.markdown.raw_markdown[:80]}...\n")
|
||||
|
||||
# Demonstrate differences in output
|
||||
print("\nKey Takeaways:")
|
||||
print("- cleaned_html: Best for readable, focused content")
|
||||
print("- raw_html: Preserves more original content, but may include noise")
|
||||
print("- fit_html: Optimized for schema extraction and structured data")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(demo_content_source())
|
||||
42
docs/examples/markdown/content_source_short_example.py
Normal file
42
docs/examples/markdown/content_source_short_example.py
Normal file
@@ -0,0 +1,42 @@
|
||||
"""
|
||||
Example demonstrating how to use the content_source parameter in MarkdownGenerationStrategy
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, DefaultMarkdownGenerator
|
||||
|
||||
async def demo_markdown_source_config():
|
||||
print("\n=== Demo: Configuring Markdown Source ===")
|
||||
|
||||
# Example 1: Generate markdown from cleaned HTML (default behavior)
|
||||
cleaned_md_generator = DefaultMarkdownGenerator(content_source="cleaned_html")
|
||||
config_cleaned = CrawlerRunConfig(markdown_generator=cleaned_md_generator)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result_cleaned = await crawler.arun(url="https://example.com", config=config_cleaned)
|
||||
print("Markdown from Cleaned HTML (default):")
|
||||
print(f" Length: {len(result_cleaned.markdown.raw_markdown)}")
|
||||
print(f" Start: {result_cleaned.markdown.raw_markdown[:100]}...")
|
||||
|
||||
# Example 2: Generate markdown directly from raw HTML
|
||||
raw_md_generator = DefaultMarkdownGenerator(content_source="raw_html")
|
||||
config_raw = CrawlerRunConfig(markdown_generator=raw_md_generator)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result_raw = await crawler.arun(url="https://example.com", config=config_raw)
|
||||
print("\nMarkdown from Raw HTML:")
|
||||
print(f" Length: {len(result_raw.markdown.raw_markdown)}")
|
||||
print(f" Start: {result_raw.markdown.raw_markdown[:100]}...")
|
||||
|
||||
# Example 3: Generate markdown from preprocessed 'fit' HTML
|
||||
fit_md_generator = DefaultMarkdownGenerator(content_source="fit_html")
|
||||
config_fit = CrawlerRunConfig(markdown_generator=fit_md_generator)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result_fit = await crawler.arun(url="https://example.com", config=config_fit)
|
||||
print("\nMarkdown from Fit HTML:")
|
||||
print(f" Length: {len(result_fit.markdown.raw_markdown)}")
|
||||
print(f" Start: {result_fit.markdown.raw_markdown[:100]}...")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(demo_markdown_source_config())
|
||||
@@ -69,9 +69,8 @@ We group them by category.
|
||||
| **Parameter** | **Type / Default** | **What It Does** |
|
||||
|------------------------------|--------------------------------------|-------------------------------------------------------------------------------------------------|
|
||||
| **`word_count_threshold`** | `int` (default: ~200) | Skips text blocks below X words. Helps ignore trivial sections. |
|
||||
| **`extraction_strategy`** | `ExtractionStrategy` (default: None) | If set, extracts structured data (CSS-based, LLM-based, etc.).
|
||||
| **`chunking_strategy`** | `ChunkingStrategy` (default: RegexChunking) | If set, extracts structured data (CSS-based, LLM-based, etc.). |
|
||||
| **`markdown_generator`** | `MarkdownGenerationStrategy` (None) | If you want specialized markdown output (citations, filtering, chunking, etc.). |
|
||||
| **`extraction_strategy`** | `ExtractionStrategy` (default: None) | If set, extracts structured data (CSS-based, LLM-based, etc.). |
|
||||
| **`markdown_generator`** | `MarkdownGenerationStrategy` (None) | If you want specialized markdown output (citations, filtering, chunking, etc.). Can be customized with options such as `content_source` parameter to select the HTML input source ('cleaned_html', 'raw_html', or 'fit_html'). |
|
||||
| **`css_selector`** | `str` (None) | Retains only the part of the page matching this selector. Affects the entire extraction process. |
|
||||
| **`target_elements`** | `List[str]` (None) | List of CSS selectors for elements to focus on for markdown generation and data extraction, while still processing the entire page for links, media, etc. Provides more flexibility than `css_selector`. |
|
||||
| **`excluded_tags`** | `list` (None) | Removes entire tags (e.g. `["script", "style"]`). |
|
||||
|
||||
@@ -111,13 +111,71 @@ Some commonly used `options`:
|
||||
- **`skip_internal_links`** (bool): If `True`, omit `#localAnchors` or internal links referencing the same page.
|
||||
- **`include_sup_sub`** (bool): Attempt to handle `<sup>` / `<sub>` in a more readable way.
|
||||
|
||||
## 4. Selecting the HTML Source for Markdown Generation
|
||||
|
||||
The `content_source` parameter allows you to control which HTML content is used as input for markdown generation. This gives you flexibility in how the HTML is processed before conversion to markdown.
|
||||
|
||||
```python
|
||||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
|
||||
async def main():
|
||||
# Option 1: Use the raw HTML directly from the webpage (before any processing)
|
||||
raw_md_generator = DefaultMarkdownGenerator(
|
||||
content_source="raw_html",
|
||||
options={"ignore_links": True}
|
||||
)
|
||||
|
||||
# Option 2: Use the cleaned HTML (after scraping strategy processing - default)
|
||||
cleaned_md_generator = DefaultMarkdownGenerator(
|
||||
content_source="cleaned_html", # This is the default
|
||||
options={"ignore_links": True}
|
||||
)
|
||||
|
||||
# Option 3: Use preprocessed HTML optimized for schema extraction
|
||||
fit_md_generator = DefaultMarkdownGenerator(
|
||||
content_source="fit_html",
|
||||
options={"ignore_links": True}
|
||||
)
|
||||
|
||||
# Use one of the generators in your crawler config
|
||||
config = CrawlerRunConfig(
|
||||
markdown_generator=raw_md_generator # Try each of the generators
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://example.com", config=config)
|
||||
if result.success:
|
||||
print("Markdown:\n", result.markdown.raw_markdown[:500])
|
||||
else:
|
||||
print("Crawl failed:", result.error_message)
|
||||
|
||||
if __name__ == "__main__":
|
||||
import asyncio
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
### HTML Source Options
|
||||
|
||||
- **`"cleaned_html"`** (default): Uses the HTML after it has been processed by the scraping strategy. This HTML is typically cleaner and more focused on content, with some boilerplate removed.
|
||||
|
||||
- **`"raw_html"`**: Uses the original HTML directly from the webpage, before any cleaning or processing. This preserves more of the original content, but may include navigation bars, ads, footers, and other elements that might not be relevant to the main content.
|
||||
|
||||
- **`"fit_html"`**: Uses HTML preprocessed for schema extraction. This HTML is optimized for structured data extraction and may have certain elements simplified or removed.
|
||||
|
||||
### When to Use Each Option
|
||||
|
||||
- Use **`"cleaned_html"`** (default) for most cases where you want a balance of content preservation and noise removal.
|
||||
- Use **`"raw_html"`** when you need to preserve all original content, or when the cleaning process is removing content you actually want to keep.
|
||||
- Use **`"fit_html"`** when working with structured data or when you need HTML that's optimized for schema extraction.
|
||||
|
||||
---
|
||||
|
||||
## 4. Content Filters
|
||||
## 5. Content Filters
|
||||
|
||||
**Content filters** selectively remove or rank sections of text before turning them into Markdown. This is especially helpful if your page has ads, nav bars, or other clutter you don’t want.
|
||||
|
||||
### 4.1 BM25ContentFilter
|
||||
### 5.1 BM25ContentFilter
|
||||
|
||||
If you have a **search query**, BM25 is a good choice:
|
||||
|
||||
@@ -146,7 +204,7 @@ config = CrawlerRunConfig(markdown_generator=md_generator)
|
||||
|
||||
**No query provided?** BM25 tries to glean a context from page metadata, or you can simply treat it as a scorched-earth approach that discards text with low generic score. Realistically, you want to supply a query for best results.
|
||||
|
||||
### 4.2 PruningContentFilter
|
||||
### 5.2 PruningContentFilter
|
||||
|
||||
If you **don’t** have a specific query, or if you just want a robust “junk remover,” use `PruningContentFilter`. It analyzes text density, link density, HTML structure, and known patterns (like “nav,” “footer”) to systematically prune extraneous or repetitive sections.
|
||||
|
||||
@@ -170,7 +228,7 @@ prune_filter = PruningContentFilter(
|
||||
- You want a broad cleanup without a user query.
|
||||
- The page has lots of repeated sidebars, footers, or disclaimers that hamper text extraction.
|
||||
|
||||
### 4.3 LLMContentFilter
|
||||
### 5.3 LLMContentFilter
|
||||
|
||||
For intelligent content filtering and high-quality markdown generation, you can use the **LLMContentFilter**. This filter leverages LLMs to generate relevant markdown while preserving the original content's meaning and structure:
|
||||
|
||||
@@ -247,7 +305,7 @@ filter = LLMContentFilter(
|
||||
|
||||
---
|
||||
|
||||
## 5. Using Fit Markdown
|
||||
## 6. Using Fit Markdown
|
||||
|
||||
When a content filter is active, the library produces two forms of markdown inside `result.markdown`:
|
||||
|
||||
@@ -284,7 +342,7 @@ if __name__ == "__main__":
|
||||
|
||||
---
|
||||
|
||||
## 6. The `MarkdownGenerationResult` Object
|
||||
## 7. The `MarkdownGenerationResult` Object
|
||||
|
||||
If your library stores detailed markdown output in an object like `MarkdownGenerationResult`, you’ll see fields such as:
|
||||
|
||||
@@ -315,7 +373,7 @@ Below is a **revised section** under “Combining Filters (BM25 + Pruning)” th
|
||||
|
||||
---
|
||||
|
||||
## 7. Combining Filters (BM25 + Pruning) in Two Passes
|
||||
## 8. Combining Filters (BM25 + Pruning) in Two Passes
|
||||
|
||||
You might want to **prune out** noisy boilerplate first (with `PruningContentFilter`), and then **rank what’s left** against a user query (with `BM25ContentFilter`). You don’t have to crawl the page twice. Instead:
|
||||
|
||||
@@ -407,7 +465,7 @@ If your codebase or pipeline design allows applying multiple filters in one pass
|
||||
|
||||
---
|
||||
|
||||
## 8. Common Pitfalls & Tips
|
||||
## 9. Common Pitfalls & Tips
|
||||
|
||||
1. **No Markdown Output?**
|
||||
- Make sure the crawler actually retrieved HTML. If the site is heavily JS-based, you may need to enable dynamic rendering or wait for elements.
|
||||
@@ -427,11 +485,12 @@ If your codebase or pipeline design allows applying multiple filters in one pass
|
||||
|
||||
---
|
||||
|
||||
## 9. Summary & Next Steps
|
||||
## 10. Summary & Next Steps
|
||||
|
||||
In this **Markdown Generation Basics** tutorial, you learned to:
|
||||
|
||||
- Configure the **DefaultMarkdownGenerator** with HTML-to-text options.
|
||||
- Select different HTML sources using the `content_source` parameter.
|
||||
- Use **BM25ContentFilter** for query-specific extraction or **PruningContentFilter** for general noise removal.
|
||||
- Distinguish between raw and filtered markdown (`fit_markdown`).
|
||||
- Leverage the `MarkdownGenerationResult` object to handle different forms of output (citations, references, etc.).
|
||||
|
||||
106
tests/general/test_content_source_parameter.py
Normal file
106
tests/general/test_content_source_parameter.py
Normal file
@@ -0,0 +1,106 @@
|
||||
"""
|
||||
Tests for the content_source parameter in markdown generation.
|
||||
"""
|
||||
import unittest
|
||||
import asyncio
|
||||
from unittest.mock import patch, MagicMock
|
||||
|
||||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator, MarkdownGenerationStrategy
|
||||
from crawl4ai.async_webcrawler import AsyncWebCrawler
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
from crawl4ai.models import MarkdownGenerationResult
|
||||
|
||||
HTML_SAMPLE = """
|
||||
<html>
|
||||
<head><title>Test Page</title></head>
|
||||
<body>
|
||||
<h1>Test Content</h1>
|
||||
<p>This is a test paragraph.</p>
|
||||
<div class="container">
|
||||
<p>This is content within a container.</p>
|
||||
</div>
|
||||
</body>
|
||||
</html>
|
||||
"""
|
||||
|
||||
|
||||
class TestContentSourceParameter(unittest.TestCase):
|
||||
"""Test cases for the content_source parameter in markdown generation."""
|
||||
|
||||
def setUp(self):
|
||||
"""Set up test fixtures."""
|
||||
self.loop = asyncio.new_event_loop()
|
||||
asyncio.set_event_loop(self.loop)
|
||||
|
||||
def tearDown(self):
|
||||
"""Tear down test fixtures."""
|
||||
self.loop.close()
|
||||
|
||||
def test_default_content_source(self):
|
||||
"""Test that the default content_source is 'cleaned_html'."""
|
||||
# Can't directly instantiate abstract class, so just test DefaultMarkdownGenerator
|
||||
generator = DefaultMarkdownGenerator()
|
||||
self.assertEqual(generator.content_source, "cleaned_html")
|
||||
|
||||
def test_custom_content_source(self):
|
||||
"""Test that content_source can be customized."""
|
||||
generator = DefaultMarkdownGenerator(content_source="fit_html")
|
||||
self.assertEqual(generator.content_source, "fit_html")
|
||||
|
||||
@patch('crawl4ai.markdown_generation_strategy.CustomHTML2Text')
|
||||
def test_html_processing_using_input_html(self, mock_html2text):
|
||||
"""Test that generate_markdown uses input_html parameter."""
|
||||
# Setup mock
|
||||
mock_instance = MagicMock()
|
||||
mock_instance.handle.return_value = "# Test Content\n\nThis is a test paragraph."
|
||||
mock_html2text.return_value = mock_instance
|
||||
|
||||
# Create generator and call generate_markdown
|
||||
generator = DefaultMarkdownGenerator()
|
||||
result = generator.generate_markdown(input_html="<h1>Test Content</h1><p>This is a test paragraph.</p>")
|
||||
|
||||
# Verify input_html was passed to HTML2Text handler
|
||||
mock_instance.handle.assert_called_once()
|
||||
# Get the first positional argument
|
||||
args, _ = mock_instance.handle.call_args
|
||||
self.assertEqual(args[0], "<h1>Test Content</h1><p>This is a test paragraph.</p>")
|
||||
|
||||
# Check result
|
||||
self.assertIsInstance(result, MarkdownGenerationResult)
|
||||
self.assertEqual(result.raw_markdown, "# Test Content\n\nThis is a test paragraph.")
|
||||
|
||||
def test_html_source_selection_logic(self):
|
||||
"""Test that the HTML source selection logic works correctly."""
|
||||
# We'll test the dispatch pattern directly to avoid async complexities
|
||||
|
||||
# Create test data
|
||||
raw_html = "<html><body><h1>Raw HTML</h1></body></html>"
|
||||
cleaned_html = "<html><body><h1>Cleaned HTML</h1></body></html>"
|
||||
fit_html = "<html><body><h1>Preprocessed HTML</h1></body></html>"
|
||||
|
||||
# Test the dispatch pattern
|
||||
html_source_selector = {
|
||||
"raw_html": lambda: raw_html,
|
||||
"cleaned_html": lambda: cleaned_html,
|
||||
"fit_html": lambda: fit_html,
|
||||
}
|
||||
|
||||
# Test Case 1: content_source="cleaned_html"
|
||||
source_lambda = html_source_selector.get("cleaned_html")
|
||||
self.assertEqual(source_lambda(), cleaned_html)
|
||||
|
||||
# Test Case 2: content_source="raw_html"
|
||||
source_lambda = html_source_selector.get("raw_html")
|
||||
self.assertEqual(source_lambda(), raw_html)
|
||||
|
||||
# Test Case 3: content_source="fit_html"
|
||||
source_lambda = html_source_selector.get("fit_html")
|
||||
self.assertEqual(source_lambda(), fit_html)
|
||||
|
||||
# Test Case 4: Invalid content_source falls back to cleaned_html
|
||||
source_lambda = html_source_selector.get("invalid_source", lambda: cleaned_html)
|
||||
self.assertEqual(source_lambda(), cleaned_html)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
unittest.main()
|
||||
315
tests/memory/README.md
Normal file
315
tests/memory/README.md
Normal file
@@ -0,0 +1,315 @@
|
||||
# Crawl4AI Stress Testing and Benchmarking
|
||||
|
||||
This directory contains tools for stress testing Crawl4AI's `arun_many` method and dispatcher system with high volumes of URLs to evaluate performance, concurrency handling, and potentially detect memory issues. It also includes a benchmarking system to track performance over time.
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
# Run a default stress test (small config) and generate a report
|
||||
# (Assumes run_all.sh is updated to call run_benchmark.py)
|
||||
./run_all.sh
|
||||
```
|
||||
*Note: `run_all.sh` might need to be updated if it directly called the old script.*
|
||||
|
||||
## Overview
|
||||
|
||||
The stress testing system works by:
|
||||
|
||||
1. Generating a local test site with heavy HTML pages (regenerated by default for each test).
|
||||
2. Starting a local HTTP server to serve these pages.
|
||||
3. Running Crawl4AI's `arun_many` method against this local site using the `MemoryAdaptiveDispatcher` with configurable concurrency (`max_sessions`).
|
||||
4. Monitoring performance metrics via the `CrawlerMonitor` and optionally logging memory usage.
|
||||
5. Optionally generating detailed benchmark reports with visualizations using `benchmark_report.py`.
|
||||
|
||||
## Available Tools
|
||||
|
||||
- `test_stress_sdk.py` - Main stress testing script utilizing `arun_many` and dispatchers.
|
||||
- `benchmark_report.py` - Report generator for comparing test results (assumes compatibility with `test_stress_sdk.py` outputs).
|
||||
- `run_benchmark.py` - Python script with predefined test configurations that orchestrates tests using `test_stress_sdk.py`.
|
||||
- `run_all.sh` - Simple wrapper script (may need updating).
|
||||
|
||||
## Usage Guide
|
||||
|
||||
### Using Predefined Configurations (Recommended)
|
||||
|
||||
The `run_benchmark.py` script offers the easiest way to run standardized tests:
|
||||
|
||||
```bash
|
||||
# Quick test (50 URLs, 4 max sessions)
|
||||
python run_benchmark.py quick
|
||||
|
||||
# Medium test (500 URLs, 16 max sessions)
|
||||
python run_benchmark.py medium
|
||||
|
||||
# Large test (1000 URLs, 32 max sessions)
|
||||
python run_benchmark.py large
|
||||
|
||||
# Extreme test (2000 URLs, 64 max sessions)
|
||||
python run_benchmark.py extreme
|
||||
|
||||
# Custom configuration
|
||||
python run_benchmark.py custom --urls 300 --max-sessions 24 --chunk-size 50
|
||||
|
||||
# Run 'small' test in streaming mode
|
||||
python run_benchmark.py small --stream
|
||||
|
||||
# Override max_sessions for the 'medium' config
|
||||
python run_benchmark.py medium --max-sessions 20
|
||||
|
||||
# Skip benchmark report generation after the test
|
||||
python run_benchmark.py small --no-report
|
||||
|
||||
# Clean up reports and site files before running
|
||||
python run_benchmark.py medium --clean
|
||||
```
|
||||
|
||||
#### `run_benchmark.py` Parameters
|
||||
|
||||
| Parameter | Default | Description |
|
||||
| -------------------- | --------------- | --------------------------------------------------------------------------- |
|
||||
| `config` | *required* | Test configuration: `quick`, `small`, `medium`, `large`, `extreme`, `custom`|
|
||||
| `--urls` | config-specific | Number of URLs (required for `custom`) |
|
||||
| `--max-sessions` | config-specific | Max concurrent sessions managed by dispatcher (required for `custom`) |
|
||||
| `--chunk-size` | config-specific | URLs per batch for non-stream logging (required for `custom`) |
|
||||
| `--stream` | False | Enable streaming results (disables batch logging) |
|
||||
| `--monitor-mode` | DETAILED | `DETAILED` or `AGGREGATED` display for the live monitor |
|
||||
| `--use-rate-limiter` | False | Enable basic rate limiter in the dispatcher |
|
||||
| `--port` | 8000 | HTTP server port |
|
||||
| `--no-report` | False | Skip generating comparison report via `benchmark_report.py` |
|
||||
| `--clean` | False | Clean up reports and site files before running |
|
||||
| `--keep-server-alive`| False | Keep local HTTP server running after test |
|
||||
| `--use-existing-site`| False | Use existing site on specified port (no local server start/site gen) |
|
||||
| `--skip-generation` | False | Use existing site files but start local server |
|
||||
| `--keep-site` | False | Keep generated site files after test |
|
||||
|
||||
#### Predefined Configurations
|
||||
|
||||
| Configuration | URLs | Max Sessions | Chunk Size | Description |
|
||||
| ------------- | ------ | ------------ | ---------- | -------------------------------- |
|
||||
| `quick` | 50 | 4 | 10 | Quick test for basic validation |
|
||||
| `small` | 100 | 8 | 20 | Small test for routine checks |
|
||||
| `medium` | 500 | 16 | 50 | Medium test for thorough checks |
|
||||
| `large` | 1000 | 32 | 100 | Large test for stress testing |
|
||||
| `extreme` | 2000 | 64 | 200 | Extreme test for limit testing |
|
||||
|
||||
### Direct Usage of `test_stress_sdk.py`
|
||||
|
||||
For fine-grained control or debugging, you can run the stress test script directly:
|
||||
|
||||
```bash
|
||||
# Test with 200 URLs and 32 max concurrent sessions
|
||||
python test_stress_sdk.py --urls 200 --max-sessions 32 --chunk-size 40
|
||||
|
||||
# Clean up previous test data first
|
||||
python test_stress_sdk.py --clean-reports --clean-site --urls 100 --max-sessions 16 --chunk-size 20
|
||||
|
||||
# Change the HTTP server port and use aggregated monitor
|
||||
python test_stress_sdk.py --port 8088 --urls 100 --max-sessions 16 --monitor-mode AGGREGATED
|
||||
|
||||
# Enable streaming mode and use rate limiting
|
||||
python test_stress_sdk.py --urls 50 --max-sessions 8 --stream --use-rate-limiter
|
||||
|
||||
# Change report output location
|
||||
python test_stress_sdk.py --report-path custom_reports --urls 100 --max-sessions 16
|
||||
```
|
||||
|
||||
#### `test_stress_sdk.py` Parameters
|
||||
|
||||
| Parameter | Default | Description |
|
||||
| -------------------- | ---------- | -------------------------------------------------------------------- |
|
||||
| `--urls` | 100 | Number of URLs to test |
|
||||
| `--max-sessions` | 16 | Maximum concurrent crawling sessions managed by the dispatcher |
|
||||
| `--chunk-size` | 10 | Number of URLs per batch (relevant for non-stream logging) |
|
||||
| `--stream` | False | Enable streaming results (disables batch logging) |
|
||||
| `--monitor-mode` | DETAILED | `DETAILED` or `AGGREGATED` display for the live `CrawlerMonitor` |
|
||||
| `--use-rate-limiter` | False | Enable a basic `RateLimiter` within the dispatcher |
|
||||
| `--site-path` | "test_site"| Path to store/use the generated test site |
|
||||
| `--port` | 8000 | Port for the local HTTP server |
|
||||
| `--report-path` | "reports" | Path to save test result summary (JSON) and memory samples (CSV) |
|
||||
| `--skip-generation` | False | Use existing test site files but still start local server |
|
||||
| `--use-existing-site`| False | Use existing site on specified port (no local server/site gen) |
|
||||
| `--keep-server-alive`| False | Keep local HTTP server running after test completion |
|
||||
| `--keep-site` | False | Keep the generated test site files after test completion |
|
||||
| `--clean-reports` | False | Clean up report directory before running |
|
||||
| `--clean-site` | False | Clean up site directory before/after running (see script logic) |
|
||||
|
||||
### Generating Reports Only
|
||||
|
||||
If you only want to generate a benchmark report from existing test results (assuming `benchmark_report.py` is compatible):
|
||||
|
||||
```bash
|
||||
# Generate a report from existing test results in ./reports/
|
||||
python benchmark_report.py
|
||||
|
||||
# Limit to the most recent 5 test results
|
||||
python benchmark_report.py --limit 5
|
||||
|
||||
# Specify a custom source directory for test results
|
||||
python benchmark_report.py --reports-dir alternate_results
|
||||
```
|
||||
|
||||
#### `benchmark_report.py` Parameters (Assumed)
|
||||
|
||||
| Parameter | Default | Description |
|
||||
| --------------- | -------------------- | ----------------------------------------------------------- |
|
||||
| `--reports-dir` | "reports" | Directory containing `test_stress_sdk.py` result files |
|
||||
| `--output-dir` | "benchmark_reports" | Directory to save generated HTML reports and charts |
|
||||
| `--limit` | None (all results) | Limit comparison to N most recent test results |
|
||||
| `--output-file` | Auto-generated | Custom output filename for the HTML report |
|
||||
|
||||
## Understanding the Test Output
|
||||
|
||||
### Real-time Progress Display (`CrawlerMonitor`)
|
||||
|
||||
When running `test_stress_sdk.py`, the `CrawlerMonitor` provides a live view of the crawling process managed by the dispatcher.
|
||||
|
||||
- **DETAILED Mode (Default):** Shows individual task status (Queued, Active, Completed, Failed), timings, memory usage per task (if `psutil` is available), overall queue statistics, and memory pressure status (if `psutil` available).
|
||||
- **AGGREGATED Mode:** Shows summary counts (Queued, Active, Completed, Failed), overall progress percentage, estimated time remaining, average URLs/sec, and memory pressure status.
|
||||
|
||||
### Batch Log Output (Non-Streaming Mode Only)
|
||||
|
||||
If running `test_stress_sdk.py` **without** the `--stream` flag, you will *also* see per-batch summary lines printed to the console *after* the monitor display, once each chunk of URLs finishes processing:
|
||||
|
||||
```
|
||||
Batch | Progress | Start Mem | End Mem | URLs/sec | Success/Fail | Time (s) | Status
|
||||
───────────────────────────────────────────────────────────────────────────────────────────
|
||||
1 | 10.0% | 50.1 MB | 55.3 MB | 23.8 | 10/0 | 0.42 | Success
|
||||
2 | 20.0% | 55.3 MB | 60.1 MB | 24.1 | 10/0 | 0.41 | Success
|
||||
...
|
||||
```
|
||||
|
||||
This display provides chunk-specific metrics:
|
||||
- **Batch**: The batch number being reported.
|
||||
- **Progress**: Overall percentage of total URLs processed *after* this batch.
|
||||
- **Start Mem / End Mem**: Memory usage before and after processing this batch (if tracked).
|
||||
- **URLs/sec**: Processing speed *for this specific batch*.
|
||||
- **Success/Fail**: Number of successful and failed URLs *in this batch*.
|
||||
- **Time (s)**: Wall-clock time taken to process *this batch*.
|
||||
- **Status**: Color-coded status for the batch outcome.
|
||||
|
||||
### Summary Output
|
||||
|
||||
After test completion, a final summary is displayed:
|
||||
|
||||
```
|
||||
================================================================================
|
||||
Test Completed
|
||||
================================================================================
|
||||
Test ID: 20250418_103015
|
||||
Configuration: 100 URLs, 16 max sessions, Chunk: 10, Stream: False, Monitor: DETAILED
|
||||
Results: 100 successful, 0 failed (100 processed, 100.0% success)
|
||||
Performance: 5.85 seconds total, 17.09 URLs/second avg
|
||||
Memory Usage: Start: 50.1 MB, End: 75.3 MB, Max: 78.1 MB, Growth: 25.2 MB
|
||||
Results summary saved to reports/test_summary_20250418_103015.json
|
||||
```
|
||||
|
||||
### HTML Report Structure (Generated by `benchmark_report.py`)
|
||||
|
||||
(This section remains the same, assuming `benchmark_report.py` generates these)
|
||||
The benchmark report contains several sections:
|
||||
1. **Summary**: Overview of the latest test results and trends
|
||||
2. **Performance Comparison**: Charts showing throughput across tests
|
||||
3. **Memory Usage**: Detailed memory usage graphs for each test
|
||||
4. **Detailed Results**: Tabular data of all test metrics
|
||||
5. **Conclusion**: Automated analysis of performance and memory patterns
|
||||
|
||||
### Memory Metrics
|
||||
|
||||
(This section remains conceptually the same)
|
||||
Memory growth is the key metric for detecting leaks...
|
||||
|
||||
### Performance Metrics
|
||||
|
||||
(This section remains conceptually the same, though "URLs per Worker" is less relevant - focus on overall URLs/sec)
|
||||
Key performance indicators include:
|
||||
- **URLs per Second**: Higher is better (throughput)
|
||||
- **Success Rate**: Should be 100% in normal conditions
|
||||
- **Total Processing Time**: Lower is better
|
||||
- **Dispatcher Efficiency**: Observe queue lengths and wait times in the monitor (Detailed mode)
|
||||
|
||||
### Raw Data Files
|
||||
|
||||
Raw data is saved in the `--report-path` directory (default `./reports/`):
|
||||
|
||||
- **JSON files** (`test_summary_*.json`): Contains the final summary for each test run.
|
||||
- **CSV files** (`memory_samples_*.csv`): Contains time-series memory samples taken during the test run.
|
||||
|
||||
Example of reading raw data:
|
||||
```python
|
||||
import json
|
||||
import pandas as pd
|
||||
|
||||
# Load test summary
|
||||
test_id = "20250418_103015" # Example ID
|
||||
with open(f'reports/test_summary_{test_id}.json', 'r') as f:
|
||||
results = json.load(f)
|
||||
|
||||
# Load memory samples
|
||||
memory_df = pd.read_csv(f'reports/memory_samples_{test_id}.csv')
|
||||
|
||||
# Analyze memory_df (e.g., calculate growth, plot)
|
||||
if not memory_df['memory_info_mb'].isnull().all():
|
||||
growth = memory_df['memory_info_mb'].iloc[-1] - memory_df['memory_info_mb'].iloc[0]
|
||||
print(f"Total Memory Growth: {growth:.1f} MB")
|
||||
else:
|
||||
print("No valid memory samples found.")
|
||||
|
||||
print(f"Avg URLs/sec: {results['urls_processed'] / results['total_time_seconds']:.2f}")
|
||||
```
|
||||
|
||||
## Visualization Dependencies
|
||||
|
||||
(This section remains the same)
|
||||
For full visualization capabilities in the HTML reports generated by `benchmark_report.py`, install additional dependencies...
|
||||
|
||||
## Directory Structure
|
||||
|
||||
```
|
||||
benchmarking/ # Or your top-level directory name
|
||||
├── benchmark_reports/ # Generated HTML reports (by benchmark_report.py)
|
||||
├── reports/ # Raw test result data (from test_stress_sdk.py)
|
||||
├── test_site/ # Generated test content (temporary)
|
||||
├── benchmark_report.py# Report generator
|
||||
├── run_benchmark.py # Test runner with predefined configs
|
||||
├── test_stress_sdk.py # Main stress test implementation using arun_many
|
||||
└── run_all.sh # Simple wrapper script (may need updates)
|
||||
#└── requirements.txt # Optional: Visualization dependencies for benchmark_report.py
|
||||
```
|
||||
|
||||
## Cleanup
|
||||
|
||||
To clean up after testing:
|
||||
|
||||
```bash
|
||||
# Remove the test site content (if not using --keep-site)
|
||||
rm -rf test_site
|
||||
|
||||
# Remove all raw reports and generated benchmark reports
|
||||
rm -rf reports benchmark_reports
|
||||
|
||||
# Or use the --clean flag with run_benchmark.py
|
||||
python run_benchmark.py medium --clean
|
||||
```
|
||||
|
||||
## Use in CI/CD
|
||||
|
||||
(This section remains conceptually the same, just update script names)
|
||||
These tests can be integrated into CI/CD pipelines:
|
||||
```bash
|
||||
# Example CI script
|
||||
python run_benchmark.py medium --no-report # Run test without interactive report gen
|
||||
# Check exit code
|
||||
if [ $? -ne 0 ]; then echo "Stress test failed!"; exit 1; fi
|
||||
# Optionally, run report generator and check its output/metrics
|
||||
# python benchmark_report.py
|
||||
# check_report_metrics.py reports/test_summary_*.json || exit 1
|
||||
exit 0
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
- **HTTP Server Port Conflict**: Use `--port` with `run_benchmark.py` or `test_stress_sdk.py`.
|
||||
- **Memory Tracking Issues**: The `SimpleMemoryTracker` uses platform commands (`ps`, `/proc`, `tasklist`). Ensure these are available and the script has permission. If it consistently fails, memory reporting will be limited.
|
||||
- **Visualization Missing**: Related to `benchmark_report.py` and its dependencies.
|
||||
- **Site Generation Issues**: Check permissions for creating `./test_site/`. Use `--skip-generation` if you want to manage the site manually.
|
||||
- **Testing Against External Site**: Ensure the external site is running and use `--use-existing-site --port <correct_port>`.
|
||||
887
tests/memory/benchmark_report.py
Executable file
887
tests/memory/benchmark_report.py
Executable file
@@ -0,0 +1,887 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Benchmark reporting tool for Crawl4AI stress tests.
|
||||
Generates visual reports and comparisons between test runs.
|
||||
"""
|
||||
|
||||
import os
|
||||
import json
|
||||
import glob
|
||||
import argparse
|
||||
import sys
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
from rich.console import Console
|
||||
from rich.table import Table
|
||||
from rich.panel import Panel
|
||||
|
||||
# Initialize rich console
|
||||
console = Console()
|
||||
|
||||
# Try to import optional visualization dependencies
|
||||
VISUALIZATION_AVAILABLE = True
|
||||
try:
|
||||
import pandas as pd
|
||||
import matplotlib.pyplot as plt
|
||||
import matplotlib as mpl
|
||||
import numpy as np
|
||||
import seaborn as sns
|
||||
except ImportError:
|
||||
VISUALIZATION_AVAILABLE = False
|
||||
console.print("[yellow]Warning: Visualization dependencies not found. Install with:[/yellow]")
|
||||
console.print("[yellow]pip install pandas matplotlib seaborn[/yellow]")
|
||||
console.print("[yellow]Only text-based reports will be generated.[/yellow]")
|
||||
|
||||
# Configure plotting if available
|
||||
if VISUALIZATION_AVAILABLE:
|
||||
# Set plot style for dark theme
|
||||
plt.style.use('dark_background')
|
||||
sns.set_theme(style="darkgrid")
|
||||
|
||||
# Custom color palette based on Nord theme
|
||||
nord_palette = ["#88c0d0", "#81a1c1", "#a3be8c", "#ebcb8b", "#bf616a", "#b48ead", "#5e81ac"]
|
||||
sns.set_palette(nord_palette)
|
||||
|
||||
class BenchmarkReporter:
|
||||
"""Generates visual reports and comparisons for Crawl4AI stress tests."""
|
||||
|
||||
def __init__(self, reports_dir="reports", output_dir="benchmark_reports"):
|
||||
"""Initialize the benchmark reporter.
|
||||
|
||||
Args:
|
||||
reports_dir: Directory containing test result files
|
||||
output_dir: Directory to save generated reports
|
||||
"""
|
||||
self.reports_dir = Path(reports_dir)
|
||||
self.output_dir = Path(output_dir)
|
||||
self.output_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# Configure matplotlib if available
|
||||
if VISUALIZATION_AVAILABLE:
|
||||
# Ensure the matplotlib backend works in headless environments
|
||||
mpl.use('Agg')
|
||||
|
||||
# Set up styling for plots with dark theme
|
||||
mpl.rcParams['figure.figsize'] = (12, 8)
|
||||
mpl.rcParams['font.size'] = 12
|
||||
mpl.rcParams['axes.labelsize'] = 14
|
||||
mpl.rcParams['axes.titlesize'] = 16
|
||||
mpl.rcParams['xtick.labelsize'] = 12
|
||||
mpl.rcParams['ytick.labelsize'] = 12
|
||||
mpl.rcParams['legend.fontsize'] = 12
|
||||
mpl.rcParams['figure.facecolor'] = '#1e1e1e'
|
||||
mpl.rcParams['axes.facecolor'] = '#2e3440'
|
||||
mpl.rcParams['savefig.facecolor'] = '#1e1e1e'
|
||||
mpl.rcParams['text.color'] = '#e0e0e0'
|
||||
mpl.rcParams['axes.labelcolor'] = '#e0e0e0'
|
||||
mpl.rcParams['xtick.color'] = '#e0e0e0'
|
||||
mpl.rcParams['ytick.color'] = '#e0e0e0'
|
||||
mpl.rcParams['grid.color'] = '#444444'
|
||||
mpl.rcParams['figure.edgecolor'] = '#444444'
|
||||
|
||||
def load_test_results(self, limit=None):
|
||||
"""Load all test results from the reports directory.
|
||||
|
||||
Args:
|
||||
limit: Optional limit on number of most recent tests to load
|
||||
|
||||
Returns:
|
||||
Dictionary mapping test IDs to result data
|
||||
"""
|
||||
result_files = glob.glob(str(self.reports_dir / "test_results_*.json"))
|
||||
|
||||
# Sort files by modification time (newest first)
|
||||
result_files.sort(key=os.path.getmtime, reverse=True)
|
||||
|
||||
if limit:
|
||||
result_files = result_files[:limit]
|
||||
|
||||
results = {}
|
||||
for file_path in result_files:
|
||||
try:
|
||||
with open(file_path, 'r') as f:
|
||||
data = json.load(f)
|
||||
test_id = data.get('test_id')
|
||||
if test_id:
|
||||
results[test_id] = data
|
||||
|
||||
# Try to load the corresponding memory samples
|
||||
csv_path = self.reports_dir / f"memory_samples_{test_id}.csv"
|
||||
if csv_path.exists():
|
||||
try:
|
||||
memory_df = pd.read_csv(csv_path)
|
||||
results[test_id]['memory_samples'] = memory_df
|
||||
except Exception as e:
|
||||
console.print(f"[yellow]Warning: Could not load memory samples for {test_id}: {e}[/yellow]")
|
||||
except Exception as e:
|
||||
console.print(f"[red]Error loading {file_path}: {e}[/red]")
|
||||
|
||||
console.print(f"Loaded {len(results)} test results")
|
||||
return results
|
||||
|
||||
def generate_summary_table(self, results):
|
||||
"""Generate a summary table of test results.
|
||||
|
||||
Args:
|
||||
results: Dictionary mapping test IDs to result data
|
||||
|
||||
Returns:
|
||||
Rich Table object
|
||||
"""
|
||||
table = Table(title="Crawl4AI Stress Test Summary", show_header=True)
|
||||
|
||||
# Define columns
|
||||
table.add_column("Test ID", style="cyan")
|
||||
table.add_column("Date", style="bright_green")
|
||||
table.add_column("URLs", justify="right")
|
||||
table.add_column("Workers", justify="right")
|
||||
table.add_column("Success %", justify="right")
|
||||
table.add_column("Time (s)", justify="right")
|
||||
table.add_column("Mem Growth", justify="right")
|
||||
table.add_column("URLs/sec", justify="right")
|
||||
|
||||
# Add rows
|
||||
for test_id, data in sorted(results.items(), key=lambda x: x[0], reverse=True):
|
||||
# Parse timestamp from test_id
|
||||
try:
|
||||
date_str = datetime.strptime(test_id, "%Y%m%d_%H%M%S").strftime("%Y-%m-%d %H:%M")
|
||||
except:
|
||||
date_str = "Unknown"
|
||||
|
||||
# Calculate success percentage
|
||||
total_urls = data.get('url_count', 0)
|
||||
successful = data.get('successful_urls', 0)
|
||||
success_pct = (successful / total_urls * 100) if total_urls > 0 else 0
|
||||
|
||||
# Calculate memory growth if available
|
||||
mem_growth = "N/A"
|
||||
if 'memory_samples' in data:
|
||||
samples = data['memory_samples']
|
||||
if len(samples) >= 2:
|
||||
# Try to extract numeric values from memory_info strings
|
||||
try:
|
||||
first_mem = float(samples.iloc[0]['memory_info'].split()[0])
|
||||
last_mem = float(samples.iloc[-1]['memory_info'].split()[0])
|
||||
mem_growth = f"{last_mem - first_mem:.1f} MB"
|
||||
except:
|
||||
pass
|
||||
|
||||
# Calculate URLs per second
|
||||
time_taken = data.get('total_time_seconds', 0)
|
||||
urls_per_sec = total_urls / time_taken if time_taken > 0 else 0
|
||||
|
||||
table.add_row(
|
||||
test_id,
|
||||
date_str,
|
||||
str(total_urls),
|
||||
str(data.get('workers', 'N/A')),
|
||||
f"{success_pct:.1f}%",
|
||||
f"{data.get('total_time_seconds', 0):.2f}",
|
||||
mem_growth,
|
||||
f"{urls_per_sec:.1f}"
|
||||
)
|
||||
|
||||
return table
|
||||
|
||||
def generate_performance_chart(self, results, output_file=None):
|
||||
"""Generate a performance comparison chart.
|
||||
|
||||
Args:
|
||||
results: Dictionary mapping test IDs to result data
|
||||
output_file: File path to save the chart
|
||||
|
||||
Returns:
|
||||
Path to the saved chart file or None if visualization is not available
|
||||
"""
|
||||
if not VISUALIZATION_AVAILABLE:
|
||||
console.print("[yellow]Skipping performance chart - visualization dependencies not available[/yellow]")
|
||||
return None
|
||||
|
||||
# Extract relevant data
|
||||
data = []
|
||||
for test_id, result in results.items():
|
||||
urls = result.get('url_count', 0)
|
||||
workers = result.get('workers', 0)
|
||||
time_taken = result.get('total_time_seconds', 0)
|
||||
urls_per_sec = urls / time_taken if time_taken > 0 else 0
|
||||
|
||||
# Parse timestamp from test_id for sorting
|
||||
try:
|
||||
timestamp = datetime.strptime(test_id, "%Y%m%d_%H%M%S")
|
||||
data.append({
|
||||
'test_id': test_id,
|
||||
'timestamp': timestamp,
|
||||
'urls': urls,
|
||||
'workers': workers,
|
||||
'time_seconds': time_taken,
|
||||
'urls_per_sec': urls_per_sec
|
||||
})
|
||||
except:
|
||||
console.print(f"[yellow]Warning: Could not parse timestamp from {test_id}[/yellow]")
|
||||
|
||||
if not data:
|
||||
console.print("[yellow]No valid data for performance chart[/yellow]")
|
||||
return None
|
||||
|
||||
# Convert to DataFrame and sort by timestamp
|
||||
df = pd.DataFrame(data)
|
||||
df = df.sort_values('timestamp')
|
||||
|
||||
# Create the plot
|
||||
fig, ax1 = plt.subplots(figsize=(12, 6))
|
||||
|
||||
# Plot URLs per second as bars with properly set x-axis
|
||||
x_pos = range(len(df['test_id']))
|
||||
bars = ax1.bar(x_pos, df['urls_per_sec'], color='#88c0d0', alpha=0.8)
|
||||
ax1.set_ylabel('URLs per Second', color='#88c0d0')
|
||||
ax1.tick_params(axis='y', labelcolor='#88c0d0')
|
||||
|
||||
# Properly set x-axis labels
|
||||
ax1.set_xticks(x_pos)
|
||||
ax1.set_xticklabels(df['test_id'].tolist(), rotation=45, ha='right')
|
||||
|
||||
# Add worker count as text on each bar
|
||||
for i, bar in enumerate(bars):
|
||||
height = bar.get_height()
|
||||
workers = df.iloc[i]['workers']
|
||||
ax1.text(i, height + 0.1,
|
||||
f'W: {workers}', ha='center', va='bottom', fontsize=9, color='#e0e0e0')
|
||||
|
||||
# Add a second y-axis for total URLs
|
||||
ax2 = ax1.twinx()
|
||||
ax2.plot(x_pos, df['urls'], '-', color='#bf616a', alpha=0.8, markersize=6, marker='o')
|
||||
ax2.set_ylabel('Total URLs', color='#bf616a')
|
||||
ax2.tick_params(axis='y', labelcolor='#bf616a')
|
||||
|
||||
# Set title and layout
|
||||
plt.title('Crawl4AI Performance Benchmarks')
|
||||
plt.tight_layout()
|
||||
|
||||
# Save the figure
|
||||
if output_file is None:
|
||||
output_file = self.output_dir / "performance_comparison.png"
|
||||
plt.savefig(output_file, dpi=100, bbox_inches='tight')
|
||||
plt.close()
|
||||
|
||||
return output_file
|
||||
|
||||
def generate_memory_charts(self, results, output_prefix=None):
|
||||
"""Generate memory usage charts for each test.
|
||||
|
||||
Args:
|
||||
results: Dictionary mapping test IDs to result data
|
||||
output_prefix: Prefix for output file names
|
||||
|
||||
Returns:
|
||||
List of paths to the saved chart files
|
||||
"""
|
||||
if not VISUALIZATION_AVAILABLE:
|
||||
console.print("[yellow]Skipping memory charts - visualization dependencies not available[/yellow]")
|
||||
return []
|
||||
|
||||
output_files = []
|
||||
|
||||
for test_id, result in results.items():
|
||||
if 'memory_samples' not in result:
|
||||
continue
|
||||
|
||||
memory_df = result['memory_samples']
|
||||
|
||||
# Check if we have enough data points
|
||||
if len(memory_df) < 2:
|
||||
continue
|
||||
|
||||
# Try to extract numeric values from memory_info strings
|
||||
try:
|
||||
memory_values = []
|
||||
for mem_str in memory_df['memory_info']:
|
||||
# Extract the number from strings like "142.8 MB"
|
||||
value = float(mem_str.split()[0])
|
||||
memory_values.append(value)
|
||||
|
||||
memory_df['memory_mb'] = memory_values
|
||||
except Exception as e:
|
||||
console.print(f"[yellow]Could not parse memory values for {test_id}: {e}[/yellow]")
|
||||
continue
|
||||
|
||||
# Create the plot
|
||||
plt.figure(figsize=(10, 6))
|
||||
|
||||
# Plot memory usage over time
|
||||
plt.plot(memory_df['elapsed_seconds'], memory_df['memory_mb'],
|
||||
color='#88c0d0', marker='o', linewidth=2, markersize=4)
|
||||
|
||||
# Add annotations for chunk processing
|
||||
chunk_size = result.get('chunk_size', 0)
|
||||
url_count = result.get('url_count', 0)
|
||||
if chunk_size > 0 and url_count > 0:
|
||||
# Estimate chunk processing times
|
||||
num_chunks = (url_count + chunk_size - 1) // chunk_size # Ceiling division
|
||||
total_time = result.get('total_time_seconds', memory_df['elapsed_seconds'].max())
|
||||
chunk_times = np.linspace(0, total_time, num_chunks + 1)[1:]
|
||||
|
||||
for i, time_point in enumerate(chunk_times):
|
||||
if time_point <= memory_df['elapsed_seconds'].max():
|
||||
plt.axvline(x=time_point, color='#4c566a', linestyle='--', alpha=0.6)
|
||||
plt.text(time_point, memory_df['memory_mb'].min(), f'Chunk {i+1}',
|
||||
rotation=90, verticalalignment='bottom', fontsize=8, color='#e0e0e0')
|
||||
|
||||
# Set labels and title
|
||||
plt.xlabel('Elapsed Time (seconds)', color='#e0e0e0')
|
||||
plt.ylabel('Memory Usage (MB)', color='#e0e0e0')
|
||||
plt.title(f'Memory Usage During Test {test_id}\n({url_count} URLs, {result.get("workers", "?")} Workers)',
|
||||
color='#e0e0e0')
|
||||
|
||||
# Add grid and set y-axis to start from zero
|
||||
plt.grid(True, alpha=0.3, color='#4c566a')
|
||||
|
||||
# Add test metadata as text
|
||||
info_text = (
|
||||
f"URLs: {url_count}\n"
|
||||
f"Workers: {result.get('workers', 'N/A')}\n"
|
||||
f"Chunk Size: {result.get('chunk_size', 'N/A')}\n"
|
||||
f"Total Time: {result.get('total_time_seconds', 0):.2f}s\n"
|
||||
)
|
||||
|
||||
# Calculate memory growth
|
||||
if len(memory_df) >= 2:
|
||||
first_mem = memory_df.iloc[0]['memory_mb']
|
||||
last_mem = memory_df.iloc[-1]['memory_mb']
|
||||
growth = last_mem - first_mem
|
||||
growth_rate = growth / result.get('total_time_seconds', 1)
|
||||
|
||||
info_text += f"Memory Growth: {growth:.1f} MB\n"
|
||||
info_text += f"Growth Rate: {growth_rate:.2f} MB/s"
|
||||
|
||||
plt.figtext(0.02, 0.02, info_text, fontsize=9, color='#e0e0e0',
|
||||
bbox=dict(facecolor='#3b4252', alpha=0.8, edgecolor='#4c566a'))
|
||||
|
||||
# Save the figure
|
||||
if output_prefix is None:
|
||||
output_file = self.output_dir / f"memory_chart_{test_id}.png"
|
||||
else:
|
||||
output_file = Path(f"{output_prefix}_memory_{test_id}.png")
|
||||
|
||||
plt.tight_layout()
|
||||
plt.savefig(output_file, dpi=100, bbox_inches='tight')
|
||||
plt.close()
|
||||
|
||||
output_files.append(output_file)
|
||||
|
||||
return output_files
|
||||
|
||||
def generate_comparison_report(self, results, title=None, output_file=None):
|
||||
"""Generate a comprehensive comparison report of multiple test runs.
|
||||
|
||||
Args:
|
||||
results: Dictionary mapping test IDs to result data
|
||||
title: Optional title for the report
|
||||
output_file: File path to save the report
|
||||
|
||||
Returns:
|
||||
Path to the saved report file
|
||||
"""
|
||||
if not results:
|
||||
console.print("[yellow]No results to generate comparison report[/yellow]")
|
||||
return None
|
||||
|
||||
if output_file is None:
|
||||
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
||||
output_file = self.output_dir / f"comparison_report_{timestamp}.html"
|
||||
|
||||
# Create data for the report
|
||||
rows = []
|
||||
for test_id, data in results.items():
|
||||
# Calculate metrics
|
||||
urls = data.get('url_count', 0)
|
||||
workers = data.get('workers', 0)
|
||||
successful = data.get('successful_urls', 0)
|
||||
failed = data.get('failed_urls', 0)
|
||||
time_seconds = data.get('total_time_seconds', 0)
|
||||
|
||||
# Calculate additional metrics
|
||||
success_rate = (successful / urls) * 100 if urls > 0 else 0
|
||||
urls_per_second = urls / time_seconds if time_seconds > 0 else 0
|
||||
urls_per_worker = urls / workers if workers > 0 else 0
|
||||
|
||||
# Calculate memory growth if available
|
||||
mem_start = None
|
||||
mem_end = None
|
||||
mem_growth = None
|
||||
if 'memory_samples' in data:
|
||||
samples = data['memory_samples']
|
||||
if len(samples) >= 2:
|
||||
try:
|
||||
first_mem = float(samples.iloc[0]['memory_info'].split()[0])
|
||||
last_mem = float(samples.iloc[-1]['memory_info'].split()[0])
|
||||
mem_start = first_mem
|
||||
mem_end = last_mem
|
||||
mem_growth = last_mem - first_mem
|
||||
except:
|
||||
pass
|
||||
|
||||
# Parse timestamp from test_id
|
||||
try:
|
||||
timestamp = datetime.strptime(test_id, "%Y%m%d_%H%M%S")
|
||||
except:
|
||||
timestamp = None
|
||||
|
||||
rows.append({
|
||||
'test_id': test_id,
|
||||
'timestamp': timestamp,
|
||||
'date': timestamp.strftime("%Y-%m-%d %H:%M:%S") if timestamp else "Unknown",
|
||||
'urls': urls,
|
||||
'workers': workers,
|
||||
'chunk_size': data.get('chunk_size', 0),
|
||||
'successful': successful,
|
||||
'failed': failed,
|
||||
'success_rate': success_rate,
|
||||
'time_seconds': time_seconds,
|
||||
'urls_per_second': urls_per_second,
|
||||
'urls_per_worker': urls_per_worker,
|
||||
'memory_start': mem_start,
|
||||
'memory_end': mem_end,
|
||||
'memory_growth': mem_growth
|
||||
})
|
||||
|
||||
# Sort data by timestamp if possible
|
||||
if VISUALIZATION_AVAILABLE:
|
||||
# Convert to DataFrame and sort by timestamp
|
||||
df = pd.DataFrame(rows)
|
||||
if 'timestamp' in df.columns and not df['timestamp'].isna().all():
|
||||
df = df.sort_values('timestamp', ascending=False)
|
||||
else:
|
||||
# Simple sorting without pandas
|
||||
rows.sort(key=lambda x: x.get('timestamp', datetime.now()), reverse=True)
|
||||
df = None
|
||||
|
||||
# Generate HTML report
|
||||
html = []
|
||||
html.append('<!DOCTYPE html>')
|
||||
html.append('<html lang="en">')
|
||||
html.append('<head>')
|
||||
html.append('<meta charset="UTF-8">')
|
||||
html.append('<meta name="viewport" content="width=device-width, initial-scale=1.0">')
|
||||
html.append(f'<title>{title or "Crawl4AI Benchmark Comparison"}</title>')
|
||||
html.append('<style>')
|
||||
html.append('''
|
||||
body {
|
||||
font-family: Arial, sans-serif;
|
||||
line-height: 1.6;
|
||||
margin: 0;
|
||||
padding: 20px;
|
||||
max-width: 1200px;
|
||||
margin: 0 auto;
|
||||
color: #e0e0e0;
|
||||
background-color: #1e1e1e;
|
||||
}
|
||||
h1, h2, h3 {
|
||||
color: #81a1c1;
|
||||
}
|
||||
table {
|
||||
border-collapse: collapse;
|
||||
width: 100%;
|
||||
margin-bottom: 20px;
|
||||
}
|
||||
th, td {
|
||||
text-align: left;
|
||||
padding: 12px;
|
||||
border-bottom: 1px solid #444;
|
||||
}
|
||||
th {
|
||||
background-color: #2e3440;
|
||||
font-weight: bold;
|
||||
}
|
||||
tr:hover {
|
||||
background-color: #2e3440;
|
||||
}
|
||||
a {
|
||||
color: #88c0d0;
|
||||
text-decoration: none;
|
||||
}
|
||||
a:hover {
|
||||
text-decoration: underline;
|
||||
}
|
||||
.chart-container {
|
||||
margin: 30px 0;
|
||||
text-align: center;
|
||||
background-color: #2e3440;
|
||||
padding: 20px;
|
||||
border-radius: 8px;
|
||||
}
|
||||
.chart-container img {
|
||||
max-width: 100%;
|
||||
height: auto;
|
||||
border: 1px solid #444;
|
||||
box-shadow: 0 0 10px rgba(0,0,0,0.3);
|
||||
}
|
||||
.card {
|
||||
border: 1px solid #444;
|
||||
border-radius: 8px;
|
||||
padding: 15px;
|
||||
margin-bottom: 20px;
|
||||
background-color: #2e3440;
|
||||
box-shadow: 0 0 10px rgba(0,0,0,0.2);
|
||||
}
|
||||
.highlight {
|
||||
background-color: #3b4252;
|
||||
font-weight: bold;
|
||||
}
|
||||
.status-good {
|
||||
color: #a3be8c;
|
||||
}
|
||||
.status-warning {
|
||||
color: #ebcb8b;
|
||||
}
|
||||
.status-bad {
|
||||
color: #bf616a;
|
||||
}
|
||||
''')
|
||||
html.append('</style>')
|
||||
html.append('</head>')
|
||||
html.append('<body>')
|
||||
|
||||
# Header
|
||||
html.append(f'<h1>{title or "Crawl4AI Benchmark Comparison"}</h1>')
|
||||
html.append(f'<p>Report generated on {datetime.now().strftime("%Y-%m-%d %H:%M:%S")}</p>')
|
||||
|
||||
# Summary section
|
||||
html.append('<div class="card">')
|
||||
html.append('<h2>Summary</h2>')
|
||||
html.append('<p>This report compares the performance of Crawl4AI across multiple test runs.</p>')
|
||||
|
||||
# Summary metrics
|
||||
data_available = (VISUALIZATION_AVAILABLE and df is not None and not df.empty) or (not VISUALIZATION_AVAILABLE and len(rows) > 0)
|
||||
if data_available:
|
||||
# Get the latest test data
|
||||
if VISUALIZATION_AVAILABLE and df is not None and not df.empty:
|
||||
latest_test = df.iloc[0]
|
||||
latest_id = latest_test['test_id']
|
||||
else:
|
||||
latest_test = rows[0] # First row (already sorted by timestamp)
|
||||
latest_id = latest_test['test_id']
|
||||
|
||||
html.append('<h3>Latest Test Results</h3>')
|
||||
html.append('<ul>')
|
||||
html.append(f'<li><strong>Test ID:</strong> {latest_id}</li>')
|
||||
html.append(f'<li><strong>Date:</strong> {latest_test["date"]}</li>')
|
||||
html.append(f'<li><strong>URLs:</strong> {latest_test["urls"]}</li>')
|
||||
html.append(f'<li><strong>Workers:</strong> {latest_test["workers"]}</li>')
|
||||
html.append(f'<li><strong>Success Rate:</strong> {latest_test["success_rate"]:.1f}%</li>')
|
||||
html.append(f'<li><strong>Time:</strong> {latest_test["time_seconds"]:.2f} seconds</li>')
|
||||
html.append(f'<li><strong>Performance:</strong> {latest_test["urls_per_second"]:.1f} URLs/second</li>')
|
||||
|
||||
# Check memory growth (handle both pandas and dict mode)
|
||||
memory_growth_available = False
|
||||
if VISUALIZATION_AVAILABLE and df is not None:
|
||||
if pd.notna(latest_test["memory_growth"]):
|
||||
html.append(f'<li><strong>Memory Growth:</strong> {latest_test["memory_growth"]:.1f} MB</li>')
|
||||
memory_growth_available = True
|
||||
else:
|
||||
if latest_test["memory_growth"] is not None:
|
||||
html.append(f'<li><strong>Memory Growth:</strong> {latest_test["memory_growth"]:.1f} MB</li>')
|
||||
memory_growth_available = True
|
||||
|
||||
html.append('</ul>')
|
||||
|
||||
# If we have more than one test, show trend
|
||||
if (VISUALIZATION_AVAILABLE and df is not None and len(df) > 1) or (not VISUALIZATION_AVAILABLE and len(rows) > 1):
|
||||
if VISUALIZATION_AVAILABLE and df is not None:
|
||||
prev_test = df.iloc[1]
|
||||
else:
|
||||
prev_test = rows[1]
|
||||
|
||||
# Calculate performance change
|
||||
perf_change = ((latest_test["urls_per_second"] / prev_test["urls_per_second"]) - 1) * 100 if prev_test["urls_per_second"] > 0 else 0
|
||||
|
||||
status_class = ""
|
||||
if perf_change > 5:
|
||||
status_class = "status-good"
|
||||
elif perf_change < -5:
|
||||
status_class = "status-bad"
|
||||
|
||||
html.append('<h3>Performance Trend</h3>')
|
||||
html.append('<ul>')
|
||||
html.append(f'<li><strong>Performance Change:</strong> <span class="{status_class}">{perf_change:+.1f}%</span> compared to previous test</li>')
|
||||
|
||||
# Memory trend if available
|
||||
memory_trend_available = False
|
||||
if VISUALIZATION_AVAILABLE and df is not None:
|
||||
if pd.notna(latest_test["memory_growth"]) and pd.notna(prev_test["memory_growth"]):
|
||||
mem_change = latest_test["memory_growth"] - prev_test["memory_growth"]
|
||||
memory_trend_available = True
|
||||
else:
|
||||
if latest_test["memory_growth"] is not None and prev_test["memory_growth"] is not None:
|
||||
mem_change = latest_test["memory_growth"] - prev_test["memory_growth"]
|
||||
memory_trend_available = True
|
||||
|
||||
if memory_trend_available:
|
||||
mem_status = ""
|
||||
if mem_change < -1: # Improved (less growth)
|
||||
mem_status = "status-good"
|
||||
elif mem_change > 1: # Worse (more growth)
|
||||
mem_status = "status-bad"
|
||||
|
||||
html.append(f'<li><strong>Memory Trend:</strong> <span class="{mem_status}">{mem_change:+.1f} MB</span> change in memory growth</li>')
|
||||
|
||||
html.append('</ul>')
|
||||
|
||||
html.append('</div>')
|
||||
|
||||
# Generate performance chart if visualization is available
|
||||
if VISUALIZATION_AVAILABLE:
|
||||
perf_chart = self.generate_performance_chart(results)
|
||||
if perf_chart:
|
||||
html.append('<div class="chart-container">')
|
||||
html.append('<h2>Performance Comparison</h2>')
|
||||
html.append(f'<img src="{os.path.relpath(perf_chart, os.path.dirname(output_file))}" alt="Performance Comparison Chart">')
|
||||
html.append('</div>')
|
||||
else:
|
||||
html.append('<div class="chart-container">')
|
||||
html.append('<h2>Performance Comparison</h2>')
|
||||
html.append('<p>Charts not available - install visualization dependencies (pandas, matplotlib, seaborn) to enable.</p>')
|
||||
html.append('</div>')
|
||||
|
||||
# Generate memory charts if visualization is available
|
||||
if VISUALIZATION_AVAILABLE:
|
||||
memory_charts = self.generate_memory_charts(results)
|
||||
if memory_charts:
|
||||
html.append('<div class="chart-container">')
|
||||
html.append('<h2>Memory Usage</h2>')
|
||||
|
||||
for chart in memory_charts:
|
||||
test_id = chart.stem.split('_')[-1]
|
||||
html.append(f'<h3>Test {test_id}</h3>')
|
||||
html.append(f'<img src="{os.path.relpath(chart, os.path.dirname(output_file))}" alt="Memory Chart for {test_id}">')
|
||||
|
||||
html.append('</div>')
|
||||
else:
|
||||
html.append('<div class="chart-container">')
|
||||
html.append('<h2>Memory Usage</h2>')
|
||||
html.append('<p>Charts not available - install visualization dependencies (pandas, matplotlib, seaborn) to enable.</p>')
|
||||
html.append('</div>')
|
||||
|
||||
# Detailed results table
|
||||
html.append('<h2>Detailed Results</h2>')
|
||||
|
||||
# Add the results as an HTML table
|
||||
html.append('<table>')
|
||||
|
||||
# Table headers
|
||||
html.append('<tr>')
|
||||
for col in ['Test ID', 'Date', 'URLs', 'Workers', 'Success %', 'Time (s)', 'URLs/sec', 'Mem Growth (MB)']:
|
||||
html.append(f'<th>{col}</th>')
|
||||
html.append('</tr>')
|
||||
|
||||
# Table rows - handle both pandas DataFrame and list of dicts
|
||||
if VISUALIZATION_AVAILABLE and df is not None:
|
||||
# Using pandas DataFrame
|
||||
for _, row in df.iterrows():
|
||||
html.append('<tr>')
|
||||
html.append(f'<td>{row["test_id"]}</td>')
|
||||
html.append(f'<td>{row["date"]}</td>')
|
||||
html.append(f'<td>{row["urls"]}</td>')
|
||||
html.append(f'<td>{row["workers"]}</td>')
|
||||
html.append(f'<td>{row["success_rate"]:.1f}%</td>')
|
||||
html.append(f'<td>{row["time_seconds"]:.2f}</td>')
|
||||
html.append(f'<td>{row["urls_per_second"]:.1f}</td>')
|
||||
|
||||
# Memory growth cell
|
||||
if pd.notna(row["memory_growth"]):
|
||||
html.append(f'<td>{row["memory_growth"]:.1f}</td>')
|
||||
else:
|
||||
html.append('<td>N/A</td>')
|
||||
|
||||
html.append('</tr>')
|
||||
else:
|
||||
# Using list of dicts (when pandas is not available)
|
||||
for row in rows:
|
||||
html.append('<tr>')
|
||||
html.append(f'<td>{row["test_id"]}</td>')
|
||||
html.append(f'<td>{row["date"]}</td>')
|
||||
html.append(f'<td>{row["urls"]}</td>')
|
||||
html.append(f'<td>{row["workers"]}</td>')
|
||||
html.append(f'<td>{row["success_rate"]:.1f}%</td>')
|
||||
html.append(f'<td>{row["time_seconds"]:.2f}</td>')
|
||||
html.append(f'<td>{row["urls_per_second"]:.1f}</td>')
|
||||
|
||||
# Memory growth cell
|
||||
if row["memory_growth"] is not None:
|
||||
html.append(f'<td>{row["memory_growth"]:.1f}</td>')
|
||||
else:
|
||||
html.append('<td>N/A</td>')
|
||||
|
||||
html.append('</tr>')
|
||||
|
||||
html.append('</table>')
|
||||
|
||||
# Conclusion section
|
||||
html.append('<div class="card">')
|
||||
html.append('<h2>Conclusion</h2>')
|
||||
|
||||
if VISUALIZATION_AVAILABLE and df is not None and not df.empty:
|
||||
# Using pandas for statistics (when available)
|
||||
# Calculate some overall statistics
|
||||
avg_urls_per_sec = df['urls_per_second'].mean()
|
||||
max_urls_per_sec = df['urls_per_second'].max()
|
||||
|
||||
# Determine if we have a trend
|
||||
if len(df) > 1:
|
||||
trend_data = df.sort_values('timestamp')
|
||||
first_perf = trend_data.iloc[0]['urls_per_second']
|
||||
last_perf = trend_data.iloc[-1]['urls_per_second']
|
||||
|
||||
perf_change = ((last_perf / first_perf) - 1) * 100 if first_perf > 0 else 0
|
||||
|
||||
if perf_change > 10:
|
||||
trend_desc = "significantly improved"
|
||||
trend_class = "status-good"
|
||||
elif perf_change > 5:
|
||||
trend_desc = "improved"
|
||||
trend_class = "status-good"
|
||||
elif perf_change < -10:
|
||||
trend_desc = "significantly decreased"
|
||||
trend_class = "status-bad"
|
||||
elif perf_change < -5:
|
||||
trend_desc = "decreased"
|
||||
trend_class = "status-bad"
|
||||
else:
|
||||
trend_desc = "remained stable"
|
||||
trend_class = ""
|
||||
|
||||
html.append(f'<p>Overall performance has <span class="{trend_class}">{trend_desc}</span> over the test period.</p>')
|
||||
|
||||
html.append(f'<p>Average throughput: <strong>{avg_urls_per_sec:.1f}</strong> URLs/second</p>')
|
||||
html.append(f'<p>Maximum throughput: <strong>{max_urls_per_sec:.1f}</strong> URLs/second</p>')
|
||||
|
||||
# Memory leak assessment
|
||||
if 'memory_growth' in df.columns and not df['memory_growth'].isna().all():
|
||||
avg_growth = df['memory_growth'].mean()
|
||||
max_growth = df['memory_growth'].max()
|
||||
|
||||
if avg_growth < 5:
|
||||
leak_assessment = "No significant memory leaks detected"
|
||||
leak_class = "status-good"
|
||||
elif avg_growth < 10:
|
||||
leak_assessment = "Minor memory growth observed"
|
||||
leak_class = "status-warning"
|
||||
else:
|
||||
leak_assessment = "Potential memory leak detected"
|
||||
leak_class = "status-bad"
|
||||
|
||||
html.append(f'<p><span class="{leak_class}">{leak_assessment}</span>. Average memory growth: <strong>{avg_growth:.1f} MB</strong> per test.</p>')
|
||||
else:
|
||||
# Manual calculations without pandas
|
||||
if rows:
|
||||
# Calculate average and max throughput
|
||||
total_urls_per_sec = sum(row['urls_per_second'] for row in rows)
|
||||
avg_urls_per_sec = total_urls_per_sec / len(rows)
|
||||
max_urls_per_sec = max(row['urls_per_second'] for row in rows)
|
||||
|
||||
html.append(f'<p>Average throughput: <strong>{avg_urls_per_sec:.1f}</strong> URLs/second</p>')
|
||||
html.append(f'<p>Maximum throughput: <strong>{max_urls_per_sec:.1f}</strong> URLs/second</p>')
|
||||
|
||||
# Memory assessment (simplified without pandas)
|
||||
growth_values = [row['memory_growth'] for row in rows if row['memory_growth'] is not None]
|
||||
if growth_values:
|
||||
avg_growth = sum(growth_values) / len(growth_values)
|
||||
|
||||
if avg_growth < 5:
|
||||
leak_assessment = "No significant memory leaks detected"
|
||||
leak_class = "status-good"
|
||||
elif avg_growth < 10:
|
||||
leak_assessment = "Minor memory growth observed"
|
||||
leak_class = "status-warning"
|
||||
else:
|
||||
leak_assessment = "Potential memory leak detected"
|
||||
leak_class = "status-bad"
|
||||
|
||||
html.append(f'<p><span class="{leak_class}">{leak_assessment}</span>. Average memory growth: <strong>{avg_growth:.1f} MB</strong> per test.</p>')
|
||||
else:
|
||||
html.append('<p>No test data available for analysis.</p>')
|
||||
|
||||
html.append('</div>')
|
||||
|
||||
# Footer
|
||||
html.append('<div style="margin-top: 30px; text-align: center; color: #777; font-size: 0.9em;">')
|
||||
html.append('<p>Generated by Crawl4AI Benchmark Reporter</p>')
|
||||
html.append('</div>')
|
||||
|
||||
html.append('</body>')
|
||||
html.append('</html>')
|
||||
|
||||
# Write the HTML file
|
||||
with open(output_file, 'w') as f:
|
||||
f.write('\n'.join(html))
|
||||
|
||||
# Print a clickable link for terminals that support it (iTerm, VS Code, etc.)
|
||||
file_url = f"file://{os.path.abspath(output_file)}"
|
||||
console.print(f"[green]Comparison report saved to: {output_file}[/green]")
|
||||
console.print(f"[blue underline]Click to open report: {file_url}[/blue underline]")
|
||||
return output_file
|
||||
|
||||
def run(self, limit=None, output_file=None):
|
||||
"""Generate a full benchmark report.
|
||||
|
||||
Args:
|
||||
limit: Optional limit on number of most recent tests to include
|
||||
output_file: Optional output file path
|
||||
|
||||
Returns:
|
||||
Path to the generated report file
|
||||
"""
|
||||
# Load test results
|
||||
results = self.load_test_results(limit=limit)
|
||||
|
||||
if not results:
|
||||
console.print("[yellow]No test results found. Run some tests first.[/yellow]")
|
||||
return None
|
||||
|
||||
# Generate and display summary table
|
||||
summary_table = self.generate_summary_table(results)
|
||||
console.print(summary_table)
|
||||
|
||||
# Generate comparison report
|
||||
title = f"Crawl4AI Benchmark Report ({len(results)} test runs)"
|
||||
report_file = self.generate_comparison_report(results, title=title, output_file=output_file)
|
||||
|
||||
if report_file:
|
||||
console.print(f"[bold green]Report generated successfully: {report_file}[/bold green]")
|
||||
return report_file
|
||||
else:
|
||||
console.print("[bold red]Failed to generate report[/bold red]")
|
||||
return None
|
||||
|
||||
|
||||
def main():
|
||||
"""Main entry point for the benchmark reporter."""
|
||||
parser = argparse.ArgumentParser(description="Generate benchmark reports for Crawl4AI stress tests")
|
||||
|
||||
parser.add_argument("--reports-dir", type=str, default="reports",
|
||||
help="Directory containing test result files")
|
||||
parser.add_argument("--output-dir", type=str, default="benchmark_reports",
|
||||
help="Directory to save generated reports")
|
||||
parser.add_argument("--limit", type=int, default=None,
|
||||
help="Limit to most recent N test results")
|
||||
parser.add_argument("--output-file", type=str, default=None,
|
||||
help="Custom output file path for the report")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Create the benchmark reporter
|
||||
reporter = BenchmarkReporter(reports_dir=args.reports_dir, output_dir=args.output_dir)
|
||||
|
||||
# Generate the report
|
||||
report_file = reporter.run(limit=args.limit, output_file=args.output_file)
|
||||
|
||||
if report_file:
|
||||
print(f"Report generated at: {report_file}")
|
||||
return 0
|
||||
else:
|
||||
print("Failed to generate report")
|
||||
return 1
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
import sys
|
||||
sys.exit(main())
|
||||
4
tests/memory/requirements.txt
Normal file
4
tests/memory/requirements.txt
Normal file
@@ -0,0 +1,4 @@
|
||||
pandas>=1.5.0
|
||||
matplotlib>=3.5.0
|
||||
seaborn>=0.12.0
|
||||
rich>=12.0.0
|
||||
259
tests/memory/run_benchmark.py
Executable file
259
tests/memory/run_benchmark.py
Executable file
@@ -0,0 +1,259 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Run a complete Crawl4AI benchmark test using test_stress_sdk.py and generate a report.
|
||||
"""
|
||||
|
||||
import sys
|
||||
import os
|
||||
import glob
|
||||
import argparse
|
||||
import subprocess
|
||||
import time
|
||||
from datetime import datetime
|
||||
|
||||
from rich.console import Console
|
||||
from rich.text import Text
|
||||
|
||||
console = Console()
|
||||
|
||||
# Updated TEST_CONFIGS to use max_sessions
|
||||
TEST_CONFIGS = {
|
||||
"quick": {"urls": 50, "max_sessions": 4, "chunk_size": 10, "description": "Quick test (50 URLs, 4 sessions)"},
|
||||
"small": {"urls": 100, "max_sessions": 8, "chunk_size": 20, "description": "Small test (100 URLs, 8 sessions)"},
|
||||
"medium": {"urls": 500, "max_sessions": 16, "chunk_size": 50, "description": "Medium test (500 URLs, 16 sessions)"},
|
||||
"large": {"urls": 1000, "max_sessions": 32, "chunk_size": 100,"description": "Large test (1000 URLs, 32 sessions)"},
|
||||
"extreme": {"urls": 2000, "max_sessions": 64, "chunk_size": 200,"description": "Extreme test (2000 URLs, 64 sessions)"},
|
||||
}
|
||||
|
||||
# Arguments to forward directly if present in custom_args
|
||||
FORWARD_ARGS = {
|
||||
"urls": "--urls",
|
||||
"max_sessions": "--max-sessions",
|
||||
"chunk_size": "--chunk-size",
|
||||
"port": "--port",
|
||||
"monitor_mode": "--monitor-mode",
|
||||
}
|
||||
# Boolean flags to forward if True
|
||||
FORWARD_FLAGS = {
|
||||
"stream": "--stream",
|
||||
"use_rate_limiter": "--use-rate-limiter",
|
||||
"keep_server_alive": "--keep-server-alive",
|
||||
"use_existing_site": "--use-existing-site",
|
||||
"skip_generation": "--skip-generation",
|
||||
"keep_site": "--keep-site",
|
||||
"clean_reports": "--clean-reports", # Note: clean behavior is handled here, but pass flag if needed
|
||||
"clean_site": "--clean-site", # Note: clean behavior is handled here, but pass flag if needed
|
||||
}
|
||||
|
||||
def run_benchmark(config_name, custom_args=None, compare=True, clean=False):
|
||||
"""Runs the stress test and optionally the report generator."""
|
||||
if config_name not in TEST_CONFIGS and config_name != "custom":
|
||||
console.print(f"[bold red]Unknown configuration: {config_name}[/bold red]")
|
||||
return False
|
||||
|
||||
# Print header
|
||||
title = "Crawl4AI SDK Benchmark Test"
|
||||
if config_name != "custom":
|
||||
title += f" - {TEST_CONFIGS[config_name]['description']}"
|
||||
else:
|
||||
# Safely get custom args for title
|
||||
urls = custom_args.get('urls', '?') if custom_args else '?'
|
||||
sessions = custom_args.get('max_sessions', '?') if custom_args else '?'
|
||||
title += f" - Custom ({urls} URLs, {sessions} sessions)"
|
||||
|
||||
console.print(f"\n[bold blue]{title}[/bold blue]")
|
||||
console.print("=" * (len(title) + 4)) # Adjust underline length
|
||||
|
||||
console.print("\n[bold white]Preparing test...[/bold white]")
|
||||
|
||||
# --- Command Construction ---
|
||||
# Use the new script name
|
||||
cmd = ["python", "test_stress_sdk.py"]
|
||||
|
||||
# Apply config or custom args
|
||||
args_to_use = {}
|
||||
if config_name != "custom":
|
||||
args_to_use = TEST_CONFIGS[config_name].copy()
|
||||
# If custom args are provided (e.g., boolean flags), overlay them
|
||||
if custom_args:
|
||||
args_to_use.update(custom_args)
|
||||
elif custom_args: # Custom config
|
||||
args_to_use = custom_args.copy()
|
||||
|
||||
# Add arguments with values
|
||||
for key, arg_name in FORWARD_ARGS.items():
|
||||
if key in args_to_use:
|
||||
cmd.extend([arg_name, str(args_to_use[key])])
|
||||
|
||||
# Add boolean flags
|
||||
for key, flag_name in FORWARD_FLAGS.items():
|
||||
if args_to_use.get(key, False): # Check if key exists and is True
|
||||
# Special handling for clean flags - apply locally, don't forward?
|
||||
# Decide if test_stress_sdk.py also needs --clean flags or if run_benchmark handles it.
|
||||
# For now, let's assume run_benchmark handles cleaning based on its own --clean flag.
|
||||
# We'll forward other flags.
|
||||
if key not in ["clean_reports", "clean_site"]:
|
||||
cmd.append(flag_name)
|
||||
|
||||
# Handle the top-level --clean flag for run_benchmark
|
||||
if clean:
|
||||
# Pass clean flags to the stress test script as well, if needed
|
||||
# This assumes test_stress_sdk.py also uses --clean-reports and --clean-site
|
||||
cmd.append("--clean-reports")
|
||||
cmd.append("--clean-site")
|
||||
console.print("[yellow]Applying --clean: Cleaning reports and site before test.[/yellow]")
|
||||
# Actual cleaning logic might reside here or be delegated entirely
|
||||
|
||||
console.print(f"\n[bold white]Running stress test:[/bold white] {' '.join(cmd)}")
|
||||
start = time.time()
|
||||
|
||||
# Execute the stress test script
|
||||
# Use Popen to stream output
|
||||
try:
|
||||
proc = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True, encoding='utf-8', errors='replace')
|
||||
while True:
|
||||
line = proc.stdout.readline()
|
||||
if not line:
|
||||
break
|
||||
console.print(line.rstrip()) # Print line by line
|
||||
proc.wait() # Wait for the process to complete
|
||||
except FileNotFoundError:
|
||||
console.print(f"[bold red]Error: Script 'test_stress_sdk.py' not found. Make sure it's in the correct directory.[/bold red]")
|
||||
return False
|
||||
except Exception as e:
|
||||
console.print(f"[bold red]Error running stress test subprocess: {e}[/bold red]")
|
||||
return False
|
||||
|
||||
|
||||
if proc.returncode != 0:
|
||||
console.print(f"[bold red]Stress test failed with exit code {proc.returncode}[/bold red]")
|
||||
return False
|
||||
|
||||
duration = time.time() - start
|
||||
console.print(f"[bold green]Stress test completed in {duration:.1f} seconds[/bold green]")
|
||||
|
||||
# --- Report Generation (Optional) ---
|
||||
if compare:
|
||||
# Assuming benchmark_report.py exists and works with the generated reports
|
||||
report_script = "benchmark_report.py" # Keep configurable if needed
|
||||
report_cmd = ["python", report_script]
|
||||
console.print(f"\n[bold white]Generating benchmark report: {' '.join(report_cmd)}[/bold white]")
|
||||
|
||||
# Run the report command and capture output
|
||||
try:
|
||||
report_proc = subprocess.run(report_cmd, capture_output=True, text=True, check=False, encoding='utf-8', errors='replace') # Use check=False to handle potential errors
|
||||
|
||||
# Print the captured output from benchmark_report.py
|
||||
if report_proc.stdout:
|
||||
console.print("\n" + report_proc.stdout)
|
||||
if report_proc.stderr:
|
||||
console.print("[yellow]Report generator stderr:[/yellow]\n" + report_proc.stderr)
|
||||
|
||||
if report_proc.returncode != 0:
|
||||
console.print(f"[bold yellow]Benchmark report generation script '{report_script}' failed with exit code {report_proc.returncode}[/bold yellow]")
|
||||
# Don't return False here, test itself succeeded
|
||||
else:
|
||||
console.print(f"[bold green]Benchmark report script '{report_script}' completed.[/bold green]")
|
||||
|
||||
# Find and print clickable links to the reports
|
||||
# Assuming reports are saved in 'benchmark_reports' by benchmark_report.py
|
||||
report_dir = "benchmark_reports"
|
||||
if os.path.isdir(report_dir):
|
||||
report_files = glob.glob(os.path.join(report_dir, "comparison_report_*.html"))
|
||||
if report_files:
|
||||
try:
|
||||
latest_report = max(report_files, key=os.path.getctime)
|
||||
report_path = os.path.abspath(latest_report)
|
||||
report_url = pathlib.Path(report_path).as_uri() # Better way to create file URI
|
||||
console.print(f"[bold cyan]Click to open report: [link={report_url}]{report_url}[/link][/bold cyan]")
|
||||
except Exception as e:
|
||||
console.print(f"[yellow]Could not determine latest report: {e}[/yellow]")
|
||||
|
||||
chart_files = glob.glob(os.path.join(report_dir, "memory_chart_*.png"))
|
||||
if chart_files:
|
||||
try:
|
||||
latest_chart = max(chart_files, key=os.path.getctime)
|
||||
chart_path = os.path.abspath(latest_chart)
|
||||
chart_url = pathlib.Path(chart_path).as_uri()
|
||||
console.print(f"[cyan]Memory chart: [link={chart_url}]{chart_url}[/link][/cyan]")
|
||||
except Exception as e:
|
||||
console.print(f"[yellow]Could not determine latest chart: {e}[/yellow]")
|
||||
else:
|
||||
console.print(f"[yellow]Benchmark report directory '{report_dir}' not found. Cannot link reports.[/yellow]")
|
||||
|
||||
except FileNotFoundError:
|
||||
console.print(f"[bold red]Error: Report script '{report_script}' not found.[/bold red]")
|
||||
except Exception as e:
|
||||
console.print(f"[bold red]Error running report generation subprocess: {e}[/bold red]")
|
||||
|
||||
|
||||
# Prompt to exit
|
||||
console.print("\n[bold green]Benchmark run finished. Press Enter to exit.[/bold green]")
|
||||
try:
|
||||
input() # Wait for user input
|
||||
except EOFError:
|
||||
pass # Handle case where input is piped or unavailable
|
||||
|
||||
return True
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="Run a Crawl4AI SDK benchmark test and generate a report")
|
||||
|
||||
# --- Arguments ---
|
||||
parser.add_argument("config", choices=list(TEST_CONFIGS) + ["custom"],
|
||||
help="Test configuration: quick, small, medium, large, extreme, or custom")
|
||||
|
||||
# Arguments for 'custom' config or to override presets
|
||||
parser.add_argument("--urls", type=int, help="Number of URLs")
|
||||
parser.add_argument("--max-sessions", type=int, help="Max concurrent sessions (replaces --workers)")
|
||||
parser.add_argument("--chunk-size", type=int, help="URLs per batch (for non-stream logging)")
|
||||
parser.add_argument("--port", type=int, help="HTTP server port")
|
||||
parser.add_argument("--monitor-mode", type=str, choices=["DETAILED", "AGGREGATED"], help="Monitor display mode")
|
||||
|
||||
# Boolean flags / options
|
||||
parser.add_argument("--stream", action="store_true", help="Enable streaming results (disables batch logging)")
|
||||
parser.add_argument("--use-rate-limiter", action="store_true", help="Enable basic rate limiter")
|
||||
parser.add_argument("--no-report", action="store_true", help="Skip generating comparison report")
|
||||
parser.add_argument("--clean", action="store_true", help="Clean up reports and site before running")
|
||||
parser.add_argument("--keep-server-alive", action="store_true", help="Keep HTTP server running after test")
|
||||
parser.add_argument("--use-existing-site", action="store_true", help="Use existing site on specified port")
|
||||
parser.add_argument("--skip-generation", action="store_true", help="Use existing site files without regenerating")
|
||||
parser.add_argument("--keep-site", action="store_true", help="Keep generated site files after test")
|
||||
# Removed url_level_logging as it's implicitly handled by stream/batch mode now
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
custom_args = {}
|
||||
|
||||
# Populate custom_args from explicit command-line args
|
||||
if args.urls is not None: custom_args["urls"] = args.urls
|
||||
if args.max_sessions is not None: custom_args["max_sessions"] = args.max_sessions
|
||||
if args.chunk_size is not None: custom_args["chunk_size"] = args.chunk_size
|
||||
if args.port is not None: custom_args["port"] = args.port
|
||||
if args.monitor_mode is not None: custom_args["monitor_mode"] = args.monitor_mode
|
||||
if args.stream: custom_args["stream"] = True
|
||||
if args.use_rate_limiter: custom_args["use_rate_limiter"] = True
|
||||
if args.keep_server_alive: custom_args["keep_server_alive"] = True
|
||||
if args.use_existing_site: custom_args["use_existing_site"] = True
|
||||
if args.skip_generation: custom_args["skip_generation"] = True
|
||||
if args.keep_site: custom_args["keep_site"] = True
|
||||
# Clean flags are handled by the 'clean' argument passed to run_benchmark
|
||||
|
||||
# Validate custom config requirements
|
||||
if args.config == "custom":
|
||||
required_custom = ["urls", "max_sessions", "chunk_size"]
|
||||
missing = [f"--{arg}" for arg in required_custom if arg not in custom_args]
|
||||
if missing:
|
||||
console.print(f"[bold red]Error: 'custom' config requires: {', '.join(missing)}[/bold red]")
|
||||
return 1
|
||||
|
||||
success = run_benchmark(
|
||||
config_name=args.config,
|
||||
custom_args=custom_args, # Pass all collected custom args
|
||||
compare=not args.no_report,
|
||||
clean=args.clean
|
||||
)
|
||||
return 0 if success else 1
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
516
tests/memory/test_stress_api.py
Normal file
516
tests/memory/test_stress_api.py
Normal file
@@ -0,0 +1,516 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Stress test for Crawl4AI's Docker API server (/crawl and /crawl/stream endpoints).
|
||||
|
||||
This version targets a running Crawl4AI API server, sending concurrent requests
|
||||
to test its ability to handle multiple crawl jobs simultaneously.
|
||||
It uses httpx for async HTTP requests and logs results per batch of requests,
|
||||
including server-side memory usage reported by the API.
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import time
|
||||
import uuid
|
||||
import argparse
|
||||
import json
|
||||
import sys
|
||||
import os
|
||||
import shutil
|
||||
from typing import List, Dict, Optional, Union, AsyncGenerator, Tuple
|
||||
import httpx
|
||||
import pathlib # Import pathlib explicitly
|
||||
from rich.console import Console
|
||||
from rich.panel import Panel
|
||||
from rich.syntax import Syntax
|
||||
|
||||
# --- Constants ---
|
||||
# DEFAULT_API_URL = "http://localhost:11235" # Default port
|
||||
DEFAULT_API_URL = "http://localhost:8020" # Default port
|
||||
DEFAULT_URL_COUNT = 1000
|
||||
DEFAULT_MAX_CONCURRENT_REQUESTS = 5
|
||||
DEFAULT_CHUNK_SIZE = 10
|
||||
DEFAULT_REPORT_PATH = "reports_api"
|
||||
DEFAULT_STREAM_MODE = False
|
||||
REQUEST_TIMEOUT = 180.0
|
||||
|
||||
# Initialize Rich console
|
||||
console = Console()
|
||||
|
||||
# --- API Health Check (Unchanged) ---
|
||||
async def check_server_health(client: httpx.AsyncClient, health_endpoint: str = "/health"):
|
||||
"""Check if the API server is healthy."""
|
||||
console.print(f"[bold cyan]Checking API server health at {client.base_url}{health_endpoint}...[/]", end="")
|
||||
try:
|
||||
response = await client.get(health_endpoint, timeout=10.0)
|
||||
response.raise_for_status()
|
||||
health_data = response.json()
|
||||
version = health_data.get('version', 'N/A')
|
||||
console.print(f"[bold green] Server OK! Version: {version}[/]")
|
||||
return True
|
||||
except (httpx.RequestError, httpx.HTTPStatusError) as e:
|
||||
console.print(f"\n[bold red]Server health check FAILED:[/]")
|
||||
console.print(f"Error: {e}")
|
||||
console.print(f"Is the server running and accessible at {client.base_url}?")
|
||||
return False
|
||||
except Exception as e:
|
||||
console.print(f"\n[bold red]An unexpected error occurred during health check:[/]")
|
||||
console.print(e)
|
||||
return False
|
||||
|
||||
# --- API Stress Test Class ---
|
||||
class ApiStressTest:
|
||||
"""Orchestrates the stress test by sending concurrent requests to the API."""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
api_url: str,
|
||||
url_count: int,
|
||||
max_concurrent_requests: int,
|
||||
chunk_size: int,
|
||||
report_path: str,
|
||||
stream_mode: bool,
|
||||
):
|
||||
self.api_base_url = api_url.rstrip('/')
|
||||
self.url_count = url_count
|
||||
self.max_concurrent_requests = max_concurrent_requests
|
||||
self.chunk_size = chunk_size
|
||||
self.report_path = pathlib.Path(report_path)
|
||||
self.report_path.mkdir(parents=True, exist_ok=True)
|
||||
self.stream_mode = stream_mode
|
||||
|
||||
self.test_id = time.strftime("%Y%m%d_%H%M%S")
|
||||
self.results_summary = {
|
||||
"test_id": self.test_id, "api_url": api_url, "url_count": url_count,
|
||||
"max_concurrent_requests": max_concurrent_requests, "chunk_size": chunk_size,
|
||||
"stream_mode": stream_mode, "start_time": "", "end_time": "",
|
||||
"total_time_seconds": 0, "successful_requests": 0, "failed_requests": 0,
|
||||
"successful_urls": 0, "failed_urls": 0, "total_urls_processed": 0,
|
||||
"total_api_calls": 0,
|
||||
"server_memory_metrics": { # To store aggregated server memory info
|
||||
"batch_mode_avg_delta_mb": None,
|
||||
"batch_mode_max_delta_mb": None,
|
||||
"stream_mode_avg_max_snapshot_mb": None,
|
||||
"stream_mode_max_max_snapshot_mb": None,
|
||||
"samples": [] # Store individual request memory results
|
||||
}
|
||||
}
|
||||
self.http_client = httpx.AsyncClient(base_url=self.api_base_url, timeout=REQUEST_TIMEOUT, limits=httpx.Limits(max_connections=max_concurrent_requests + 5, max_keepalive_connections=max_concurrent_requests))
|
||||
|
||||
async def close_client(self):
|
||||
"""Close the httpx client."""
|
||||
await self.http_client.aclose()
|
||||
|
||||
async def run(self) -> Dict:
|
||||
"""Run the API stress test."""
|
||||
# No client memory tracker needed
|
||||
urls_to_process = [f"https://httpbin.org/anything/{uuid.uuid4()}" for _ in range(self.url_count)]
|
||||
url_chunks = [urls_to_process[i:i+self.chunk_size] for i in range(0, len(urls_to_process), self.chunk_size)]
|
||||
|
||||
self.results_summary["start_time"] = time.strftime("%Y-%m-%d %H:%M:%S")
|
||||
start_time = time.time()
|
||||
|
||||
console.print(f"\n[bold cyan]Crawl4AI API Stress Test - {self.url_count} URLs, {self.max_concurrent_requests} concurrent requests[/bold cyan]")
|
||||
console.print(f"[bold cyan]Target API:[/bold cyan] {self.api_base_url}, [bold cyan]Mode:[/bold cyan] {'Streaming' if self.stream_mode else 'Batch'}, [bold cyan]URLs per Request:[/bold cyan] {self.chunk_size}")
|
||||
# Removed client memory log
|
||||
|
||||
semaphore = asyncio.Semaphore(self.max_concurrent_requests)
|
||||
|
||||
# Updated Batch logging header
|
||||
console.print("\n[bold]API Request Batch Progress:[/bold]")
|
||||
# Adjusted spacing and added Peak
|
||||
console.print("[bold] Batch | Progress | SrvMem Peak / Δ|Max (MB) | Reqs/sec | S/F URLs | Time (s) | Status [/bold]")
|
||||
# Adjust separator length if needed, looks okay for now
|
||||
console.print("─" * 95)
|
||||
|
||||
# No client memory monitor task needed
|
||||
|
||||
tasks = []
|
||||
total_api_calls = len(url_chunks)
|
||||
self.results_summary["total_api_calls"] = total_api_calls
|
||||
|
||||
try:
|
||||
for i, chunk in enumerate(url_chunks):
|
||||
task = asyncio.create_task(self._make_api_request(
|
||||
chunk=chunk,
|
||||
batch_idx=i + 1,
|
||||
total_batches=total_api_calls,
|
||||
semaphore=semaphore
|
||||
# No memory tracker passed
|
||||
))
|
||||
tasks.append(task)
|
||||
|
||||
api_results = await asyncio.gather(*tasks)
|
||||
|
||||
# Process aggregated results including server memory
|
||||
total_successful_requests = sum(1 for r in api_results if r['request_success'])
|
||||
total_failed_requests = total_api_calls - total_successful_requests
|
||||
total_successful_urls = sum(r['success_urls'] for r in api_results)
|
||||
total_failed_urls = sum(r['failed_urls'] for r in api_results)
|
||||
total_urls_processed = total_successful_urls + total_failed_urls
|
||||
|
||||
# Aggregate server memory metrics
|
||||
valid_samples = [r for r in api_results if r.get('server_delta_or_max_mb') is not None] # Filter results with valid mem data
|
||||
self.results_summary["server_memory_metrics"]["samples"] = valid_samples # Store raw samples with both peak and delta/max
|
||||
|
||||
if valid_samples:
|
||||
delta_or_max_values = [r['server_delta_or_max_mb'] for r in valid_samples]
|
||||
if self.stream_mode:
|
||||
# Stream mode: delta_or_max holds max snapshot
|
||||
self.results_summary["server_memory_metrics"]["stream_mode_avg_max_snapshot_mb"] = sum(delta_or_max_values) / len(delta_or_max_values)
|
||||
self.results_summary["server_memory_metrics"]["stream_mode_max_max_snapshot_mb"] = max(delta_or_max_values)
|
||||
else: # Batch mode
|
||||
# delta_or_max holds delta
|
||||
self.results_summary["server_memory_metrics"]["batch_mode_avg_delta_mb"] = sum(delta_or_max_values) / len(delta_or_max_values)
|
||||
self.results_summary["server_memory_metrics"]["batch_mode_max_delta_mb"] = max(delta_or_max_values)
|
||||
|
||||
# Aggregate peak values for batch mode
|
||||
peak_values = [r['server_peak_memory_mb'] for r in valid_samples if r.get('server_peak_memory_mb') is not None]
|
||||
if peak_values:
|
||||
self.results_summary["server_memory_metrics"]["batch_mode_avg_peak_mb"] = sum(peak_values) / len(peak_values)
|
||||
self.results_summary["server_memory_metrics"]["batch_mode_max_peak_mb"] = max(peak_values)
|
||||
|
||||
|
||||
self.results_summary.update({
|
||||
"successful_requests": total_successful_requests,
|
||||
"failed_requests": total_failed_requests,
|
||||
"successful_urls": total_successful_urls,
|
||||
"failed_urls": total_failed_urls,
|
||||
"total_urls_processed": total_urls_processed,
|
||||
})
|
||||
|
||||
except Exception as e:
|
||||
console.print(f"[bold red]An error occurred during task execution: {e}[/bold red]")
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
# No finally block needed for monitor task
|
||||
|
||||
end_time = time.time()
|
||||
self.results_summary.update({
|
||||
"end_time": time.strftime("%Y-%m-%d %H:%M:%S"),
|
||||
"total_time_seconds": end_time - start_time,
|
||||
# No client memory report
|
||||
})
|
||||
self._save_results()
|
||||
return self.results_summary
|
||||
|
||||
async def _make_api_request(
|
||||
self,
|
||||
chunk: List[str],
|
||||
batch_idx: int,
|
||||
total_batches: int,
|
||||
semaphore: asyncio.Semaphore
|
||||
# No memory tracker
|
||||
) -> Dict:
|
||||
"""Makes a single API request for a chunk of URLs, handling concurrency and logging server memory."""
|
||||
request_success = False
|
||||
success_urls = 0
|
||||
failed_urls = 0
|
||||
status = "Pending"
|
||||
status_color = "grey"
|
||||
server_memory_metric = None # Store delta (batch) or max snapshot (stream)
|
||||
api_call_start_time = time.time()
|
||||
|
||||
async with semaphore:
|
||||
try:
|
||||
# No client memory sampling
|
||||
|
||||
endpoint = "/crawl/stream" if self.stream_mode else "/crawl"
|
||||
payload = {
|
||||
"urls": chunk,
|
||||
"browser_config": {"type": "BrowserConfig", "params": {"headless": True}},
|
||||
"crawler_config": {
|
||||
"type": "CrawlerRunConfig",
|
||||
"params": {"cache_mode": "BYPASS", "stream": self.stream_mode}
|
||||
}
|
||||
}
|
||||
|
||||
if self.stream_mode:
|
||||
max_server_mem_snapshot = 0.0 # Track max memory seen in this stream
|
||||
async with self.http_client.stream("POST", endpoint, json=payload) as response:
|
||||
initial_status_code = response.status_code
|
||||
response.raise_for_status()
|
||||
|
||||
completed_marker_received = False
|
||||
async for line in response.aiter_lines():
|
||||
if line:
|
||||
try:
|
||||
data = json.loads(line)
|
||||
if data.get("status") == "completed":
|
||||
completed_marker_received = True
|
||||
break
|
||||
elif data.get("url"):
|
||||
if data.get("success"): success_urls += 1
|
||||
else: failed_urls += 1
|
||||
# Extract server memory snapshot per result
|
||||
mem_snapshot = data.get('server_memory_mb')
|
||||
if mem_snapshot is not None:
|
||||
max_server_mem_snapshot = max(max_server_mem_snapshot, float(mem_snapshot))
|
||||
except json.JSONDecodeError:
|
||||
console.print(f"[Batch {batch_idx}] [red]Stream decode error for line:[/red] {line}")
|
||||
failed_urls = len(chunk)
|
||||
break
|
||||
request_success = completed_marker_received
|
||||
if not request_success:
|
||||
failed_urls = len(chunk) - success_urls
|
||||
server_memory_metric = max_server_mem_snapshot # Use max snapshot for stream logging
|
||||
|
||||
else: # Batch mode
|
||||
response = await self.http_client.post(endpoint, json=payload)
|
||||
response.raise_for_status()
|
||||
data = response.json()
|
||||
|
||||
# Extract server memory delta from the response
|
||||
server_memory_metric = data.get('server_memory_delta_mb')
|
||||
server_peak_mem_mb = data.get('server_peak_memory_mb')
|
||||
|
||||
if data.get("success") and "results" in data:
|
||||
request_success = True
|
||||
results_list = data.get("results", [])
|
||||
for result_item in results_list:
|
||||
if result_item.get("success"): success_urls += 1
|
||||
else: failed_urls += 1
|
||||
if len(results_list) != len(chunk):
|
||||
console.print(f"[Batch {batch_idx}] [yellow]Warning: Result count ({len(results_list)}) doesn't match URL count ({len(chunk)})[/yellow]")
|
||||
failed_urls = len(chunk) - success_urls
|
||||
else:
|
||||
request_success = False
|
||||
failed_urls = len(chunk)
|
||||
# Try to get memory from error detail if available
|
||||
detail = data.get('detail')
|
||||
if isinstance(detail, str):
|
||||
try: detail_json = json.loads(detail)
|
||||
except: detail_json = {}
|
||||
elif isinstance(detail, dict):
|
||||
detail_json = detail
|
||||
else: detail_json = {}
|
||||
server_peak_mem_mb = detail_json.get('server_peak_memory_mb', None)
|
||||
server_memory_metric = detail_json.get('server_memory_delta_mb', None)
|
||||
console.print(f"[Batch {batch_idx}] [red]API request failed:[/red] {detail_json.get('error', 'No details')}")
|
||||
|
||||
|
||||
except httpx.HTTPStatusError as e:
|
||||
request_success = False
|
||||
failed_urls = len(chunk)
|
||||
console.print(f"[Batch {batch_idx}] [bold red]HTTP Error {e.response.status_code}:[/] {e.request.url}")
|
||||
try:
|
||||
error_detail = e.response.json()
|
||||
# Attempt to extract memory info even from error responses
|
||||
detail_content = error_detail.get('detail', {})
|
||||
if isinstance(detail_content, str): # Handle if detail is stringified JSON
|
||||
try: detail_content = json.loads(detail_content)
|
||||
except: detail_content = {}
|
||||
server_memory_metric = detail_content.get('server_memory_delta_mb', None)
|
||||
server_peak_mem_mb = detail_content.get('server_peak_memory_mb', None)
|
||||
console.print(f"Response: {error_detail}")
|
||||
except Exception:
|
||||
console.print(f"Response Text: {e.response.text[:200]}...")
|
||||
except httpx.RequestError as e:
|
||||
request_success = False
|
||||
failed_urls = len(chunk)
|
||||
console.print(f"[Batch {batch_idx}] [bold red]Request Error:[/bold] {e.request.url} - {e}")
|
||||
except Exception as e:
|
||||
request_success = False
|
||||
failed_urls = len(chunk)
|
||||
console.print(f"[Batch {batch_idx}] [bold red]Unexpected Error:[/bold] {e}")
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
|
||||
finally:
|
||||
api_call_time = time.time() - api_call_start_time
|
||||
total_processed_urls = success_urls + failed_urls
|
||||
|
||||
if request_success and failed_urls == 0: status_color, status = "green", "Success"
|
||||
elif request_success and success_urls > 0: status_color, status = "yellow", "Partial"
|
||||
else: status_color, status = "red", "Failed"
|
||||
|
||||
current_total_urls = batch_idx * self.chunk_size
|
||||
progress_pct = min(100.0, (current_total_urls / self.url_count) * 100)
|
||||
reqs_per_sec = 1.0 / api_call_time if api_call_time > 0 else float('inf')
|
||||
|
||||
# --- New Memory Formatting ---
|
||||
mem_display = " N/A " # Default
|
||||
peak_mem_value = None
|
||||
delta_or_max_value = None
|
||||
|
||||
if self.stream_mode:
|
||||
# server_memory_metric holds max snapshot for stream
|
||||
if server_memory_metric is not None:
|
||||
mem_display = f"{server_memory_metric:.1f} (Max)"
|
||||
delta_or_max_value = server_memory_metric # Store for aggregation
|
||||
else: # Batch mode - expect peak and delta
|
||||
# We need to get peak and delta from the API response
|
||||
peak_mem_value = locals().get('server_peak_mem_mb', None) # Get from response data if available
|
||||
delta_value = server_memory_metric # server_memory_metric holds delta for batch
|
||||
|
||||
if peak_mem_value is not None and delta_value is not None:
|
||||
mem_display = f"{peak_mem_value:.1f} / {delta_value:+.1f}"
|
||||
delta_or_max_value = delta_value # Store delta for aggregation
|
||||
elif peak_mem_value is not None:
|
||||
mem_display = f"{peak_mem_value:.1f} / N/A"
|
||||
elif delta_value is not None:
|
||||
mem_display = f"N/A / {delta_value:+.1f}"
|
||||
delta_or_max_value = delta_value # Store delta for aggregation
|
||||
|
||||
# --- Updated Print Statement with Adjusted Padding ---
|
||||
console.print(
|
||||
f" {batch_idx:<5} | {progress_pct:6.1f}% | {mem_display:>24} | {reqs_per_sec:8.1f} | " # Increased width for memory column
|
||||
f"{success_urls:^7}/{failed_urls:<6} | {api_call_time:8.2f} | [{status_color}]{status:<7}[/{status_color}] " # Added trailing space
|
||||
)
|
||||
|
||||
# --- Updated Return Dictionary ---
|
||||
return_data = {
|
||||
"batch_idx": batch_idx,
|
||||
"request_success": request_success,
|
||||
"success_urls": success_urls,
|
||||
"failed_urls": failed_urls,
|
||||
"time": api_call_time,
|
||||
# Return both peak (if available) and delta/max
|
||||
"server_peak_memory_mb": peak_mem_value, # Will be None for stream mode
|
||||
"server_delta_or_max_mb": delta_or_max_value # Delta for batch, Max for stream
|
||||
}
|
||||
# Add back the specific batch mode delta if needed elsewhere, but delta_or_max covers it
|
||||
# if not self.stream_mode:
|
||||
# return_data["server_memory_delta_mb"] = delta_value
|
||||
return return_data
|
||||
|
||||
# No _periodic_memory_sample needed
|
||||
|
||||
def _save_results(self) -> None:
|
||||
"""Saves the results summary to a JSON file."""
|
||||
results_path = self.report_path / f"api_test_summary_{self.test_id}.json"
|
||||
try:
|
||||
# No client memory path to convert
|
||||
with open(results_path, 'w', encoding='utf-8') as f:
|
||||
json.dump(self.results_summary, f, indent=2, default=str)
|
||||
except Exception as e:
|
||||
console.print(f"[bold red]Failed to save results summary: {e}[/bold red]")
|
||||
|
||||
|
||||
# --- run_full_test Function ---
|
||||
async def run_full_test(args):
|
||||
"""Runs the full API stress test process."""
|
||||
client = httpx.AsyncClient(base_url=args.api_url, timeout=REQUEST_TIMEOUT)
|
||||
|
||||
if not await check_server_health(client):
|
||||
console.print("[bold red]Aborting test due to server health check failure.[/]")
|
||||
await client.aclose()
|
||||
return
|
||||
await client.aclose()
|
||||
|
||||
test = ApiStressTest(
|
||||
api_url=args.api_url,
|
||||
url_count=args.urls,
|
||||
max_concurrent_requests=args.max_concurrent_requests,
|
||||
chunk_size=args.chunk_size,
|
||||
report_path=args.report_path,
|
||||
stream_mode=args.stream,
|
||||
)
|
||||
results = {}
|
||||
try:
|
||||
results = await test.run()
|
||||
finally:
|
||||
await test.close_client()
|
||||
|
||||
if not results:
|
||||
console.print("[bold red]Test did not produce results.[/bold red]")
|
||||
return
|
||||
|
||||
console.print("\n" + "=" * 80)
|
||||
console.print("[bold green]API Stress Test Completed[/bold green]")
|
||||
console.print("=" * 80)
|
||||
|
||||
success_rate_reqs = results["successful_requests"] / results["total_api_calls"] * 100 if results["total_api_calls"] > 0 else 0
|
||||
success_rate_urls = results["successful_urls"] / results["url_count"] * 100 if results["url_count"] > 0 else 0
|
||||
urls_per_second = results["total_urls_processed"] / results["total_time_seconds"] if results["total_time_seconds"] > 0 else 0
|
||||
reqs_per_second = results["total_api_calls"] / results["total_time_seconds"] if results["total_time_seconds"] > 0 else 0
|
||||
|
||||
|
||||
console.print(f"[bold cyan]Test ID:[/bold cyan] {results['test_id']}")
|
||||
console.print(f"[bold cyan]Target API:[/bold cyan] {results['api_url']}")
|
||||
console.print(f"[bold cyan]Configuration:[/bold cyan] {results['url_count']} URLs, {results['max_concurrent_requests']} concurrent client requests, URLs/Req: {results['chunk_size']}, Stream: {results['stream_mode']}")
|
||||
console.print(f"[bold cyan]API Requests:[/bold cyan] {results['successful_requests']} successful, {results['failed_requests']} failed ({results['total_api_calls']} total, {success_rate_reqs:.1f}% success)")
|
||||
console.print(f"[bold cyan]URL Processing:[/bold cyan] {results['successful_urls']} successful, {results['failed_urls']} failed ({results['total_urls_processed']} processed, {success_rate_urls:.1f}% success)")
|
||||
console.print(f"[bold cyan]Performance:[/bold cyan] {results['total_time_seconds']:.2f}s total | Avg Reqs/sec: {reqs_per_second:.2f} | Avg URLs/sec: {urls_per_second:.2f}")
|
||||
|
||||
# Report Server Memory
|
||||
mem_metrics = results.get("server_memory_metrics", {})
|
||||
mem_samples = mem_metrics.get("samples", [])
|
||||
if mem_samples:
|
||||
num_samples = len(mem_samples)
|
||||
if results['stream_mode']:
|
||||
avg_mem = mem_metrics.get("stream_mode_avg_max_snapshot_mb")
|
||||
max_mem = mem_metrics.get("stream_mode_max_max_snapshot_mb")
|
||||
avg_str = f"{avg_mem:.1f}" if avg_mem is not None else "N/A"
|
||||
max_str = f"{max_mem:.1f}" if max_mem is not None else "N/A"
|
||||
console.print(f"[bold cyan]Server Memory (Stream):[/bold cyan] Avg Max Snapshot: {avg_str} MB | Max Max Snapshot: {max_str} MB (across {num_samples} requests)")
|
||||
else: # Batch mode
|
||||
avg_delta = mem_metrics.get("batch_mode_avg_delta_mb")
|
||||
max_delta = mem_metrics.get("batch_mode_max_delta_mb")
|
||||
avg_peak = mem_metrics.get("batch_mode_avg_peak_mb")
|
||||
max_peak = mem_metrics.get("batch_mode_max_peak_mb")
|
||||
|
||||
avg_delta_str = f"{avg_delta:.1f}" if avg_delta is not None else "N/A"
|
||||
max_delta_str = f"{max_delta:.1f}" if max_delta is not None else "N/A"
|
||||
avg_peak_str = f"{avg_peak:.1f}" if avg_peak is not None else "N/A"
|
||||
max_peak_str = f"{max_peak:.1f}" if max_peak is not None else "N/A"
|
||||
|
||||
console.print(f"[bold cyan]Server Memory (Batch):[/bold cyan] Avg Peak: {avg_peak_str} MB | Max Peak: {max_peak_str} MB | Avg Delta: {avg_delta_str} MB | Max Delta: {max_delta_str} MB (across {num_samples} requests)")
|
||||
else:
|
||||
console.print("[bold cyan]Server Memory:[/bold cyan] No memory data reported by server.")
|
||||
|
||||
|
||||
# No client memory report
|
||||
summary_path = pathlib.Path(args.report_path) / f"api_test_summary_{results['test_id']}.json"
|
||||
console.print(f"[bold green]Results summary saved to {summary_path}[/bold green]")
|
||||
|
||||
if results["failed_requests"] > 0:
|
||||
console.print(f"\n[bold yellow]Warning: {results['failed_requests']} API requests failed ({100-success_rate_reqs:.1f}% failure rate)[/bold yellow]")
|
||||
if results["failed_urls"] > 0:
|
||||
console.print(f"[bold yellow]Warning: {results['failed_urls']} URLs failed to process ({100-success_rate_urls:.1f}% URL failure rate)[/bold yellow]")
|
||||
if results["total_urls_processed"] < results["url_count"]:
|
||||
console.print(f"\n[bold red]Error: Only {results['total_urls_processed']} out of {results['url_count']} target URLs were processed![/bold red]")
|
||||
|
||||
|
||||
# --- main Function (Argument parsing mostly unchanged) ---
|
||||
def main():
|
||||
"""Main entry point for the script."""
|
||||
parser = argparse.ArgumentParser(description="Crawl4AI API Server Stress Test")
|
||||
|
||||
parser.add_argument("--api-url", type=str, default=DEFAULT_API_URL, help=f"Base URL of the Crawl4AI API server (default: {DEFAULT_API_URL})")
|
||||
parser.add_argument("--urls", type=int, default=DEFAULT_URL_COUNT, help=f"Total number of unique URLs to process via API calls (default: {DEFAULT_URL_COUNT})")
|
||||
parser.add_argument("--max-concurrent-requests", type=int, default=DEFAULT_MAX_CONCURRENT_REQUESTS, help=f"Maximum concurrent API requests from this client (default: {DEFAULT_MAX_CONCURRENT_REQUESTS})")
|
||||
parser.add_argument("--chunk-size", type=int, default=DEFAULT_CHUNK_SIZE, help=f"Number of URLs per API request payload (default: {DEFAULT_CHUNK_SIZE})")
|
||||
parser.add_argument("--stream", action="store_true", default=DEFAULT_STREAM_MODE, help=f"Use the /crawl/stream endpoint instead of /crawl (default: {DEFAULT_STREAM_MODE})")
|
||||
parser.add_argument("--report-path", type=str, default=DEFAULT_REPORT_PATH, help=f"Path to save reports and logs (default: {DEFAULT_REPORT_PATH})")
|
||||
parser.add_argument("--clean-reports", action="store_true", help="Clean up report directory before running")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
console.print("[bold underline]Crawl4AI API Stress Test Configuration[/bold underline]")
|
||||
console.print(f"API URL: {args.api_url}")
|
||||
console.print(f"Total URLs: {args.urls}, Concurrent Client Requests: {args.max_concurrent_requests}, URLs per Request: {args.chunk_size}")
|
||||
console.print(f"Mode: {'Streaming' if args.stream else 'Batch'}")
|
||||
console.print(f"Report Path: {args.report_path}")
|
||||
console.print("-" * 40)
|
||||
if args.clean_reports: console.print("[cyan]Option: Clean reports before test[/cyan]")
|
||||
console.print("-" * 40)
|
||||
|
||||
if args.clean_reports:
|
||||
report_dir = pathlib.Path(args.report_path)
|
||||
if report_dir.exists():
|
||||
console.print(f"[yellow]Cleaning up reports directory: {args.report_path}[/yellow]")
|
||||
shutil.rmtree(args.report_path)
|
||||
report_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
try:
|
||||
asyncio.run(run_full_test(args))
|
||||
except KeyboardInterrupt:
|
||||
console.print("\n[bold yellow]Test interrupted by user.[/bold yellow]")
|
||||
except Exception as e:
|
||||
console.print(f"\n[bold red]An unexpected error occurred:[/bold red] {e}")
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
|
||||
if __name__ == "__main__":
|
||||
# No need to modify sys.path for SimpleMemoryTracker as it's removed
|
||||
main()
|
||||
129
tests/memory/test_stress_docker_api.py
Normal file
129
tests/memory/test_stress_docker_api.py
Normal file
@@ -0,0 +1,129 @@
|
||||
"""
|
||||
Crawl4AI Docker API stress tester.
|
||||
|
||||
Examples
|
||||
--------
|
||||
python test_stress_docker_api.py --urls 1000 --concurrency 32
|
||||
python test_stress_docker_api.py --urls 1000 --concurrency 32 --stream
|
||||
python test_stress_docker_api.py --base-url http://10.0.0.42:11235 --http2
|
||||
"""
|
||||
|
||||
import argparse, asyncio, json, secrets, statistics, time
|
||||
from typing import List, Tuple
|
||||
import httpx
|
||||
from rich.console import Console
|
||||
from rich.progress import Progress, BarColumn, TimeElapsedColumn, TimeRemainingColumn
|
||||
from rich.table import Table
|
||||
|
||||
console = Console()
|
||||
|
||||
|
||||
# ───────────────────────── helpers ─────────────────────────
|
||||
def make_fake_urls(n: int) -> List[str]:
|
||||
base = "https://httpbin.org/anything/"
|
||||
return [f"{base}{secrets.token_hex(8)}" for _ in range(n)]
|
||||
|
||||
|
||||
async def fire(
|
||||
client: httpx.AsyncClient, endpoint: str, payload: dict, sem: asyncio.Semaphore
|
||||
) -> Tuple[bool, float]:
|
||||
async with sem:
|
||||
print(f"POST {endpoint} with {len(payload['urls'])} URLs")
|
||||
t0 = time.perf_counter()
|
||||
try:
|
||||
if endpoint.endswith("/stream"):
|
||||
async with client.stream("POST", endpoint, json=payload) as r:
|
||||
r.raise_for_status()
|
||||
async for _ in r.aiter_lines():
|
||||
pass
|
||||
else:
|
||||
r = await client.post(endpoint, json=payload)
|
||||
r.raise_for_status()
|
||||
return True, time.perf_counter() - t0
|
||||
except Exception:
|
||||
return False, time.perf_counter() - t0
|
||||
|
||||
|
||||
def pct(lat: List[float], p: float) -> str:
|
||||
"""Return percentile string even for tiny samples."""
|
||||
if not lat:
|
||||
return "-"
|
||||
if len(lat) == 1:
|
||||
return f"{lat[0]:.2f}s"
|
||||
lat_sorted = sorted(lat)
|
||||
k = (p / 100) * (len(lat_sorted) - 1)
|
||||
lo = int(k)
|
||||
hi = min(lo + 1, len(lat_sorted) - 1)
|
||||
frac = k - lo
|
||||
val = lat_sorted[lo] * (1 - frac) + lat_sorted[hi] * frac
|
||||
return f"{val:.2f}s"
|
||||
|
||||
|
||||
# ───────────────────────── main ─────────────────────────
|
||||
def parse_args() -> argparse.Namespace:
|
||||
p = argparse.ArgumentParser(description="Stress test Crawl4AI Docker API")
|
||||
p.add_argument("--urls", type=int, default=100, help="number of URLs")
|
||||
p.add_argument("--concurrency", type=int, default=1, help="max POSTs in flight")
|
||||
p.add_argument("--chunk-size", type=int, default=50, help="URLs per request")
|
||||
p.add_argument("--base-url", default="http://localhost:11235", help="API root")
|
||||
# p.add_argument("--base-url", default="http://localhost:8020", help="API root")
|
||||
p.add_argument("--stream", action="store_true", help="use /crawl/stream")
|
||||
p.add_argument("--http2", action="store_true", help="enable HTTP/2")
|
||||
p.add_argument("--headless", action="store_true", default=True)
|
||||
return p.parse_args()
|
||||
|
||||
|
||||
async def main() -> None:
|
||||
args = parse_args()
|
||||
|
||||
urls = make_fake_urls(args.urls)
|
||||
batches = [urls[i : i + args.chunk_size] for i in range(0, len(urls), args.chunk_size)]
|
||||
endpoint = "/crawl/stream" if args.stream else "/crawl"
|
||||
sem = asyncio.Semaphore(args.concurrency)
|
||||
|
||||
async with httpx.AsyncClient(base_url=args.base_url, http2=args.http2, timeout=None) as client:
|
||||
with Progress(
|
||||
"[progress.description]{task.description}",
|
||||
BarColumn(),
|
||||
"[progress.percentage]{task.percentage:>3.0f}%",
|
||||
TimeElapsedColumn(),
|
||||
TimeRemainingColumn(),
|
||||
) as progress:
|
||||
task_id = progress.add_task("[cyan]bombarding…", total=len(batches))
|
||||
tasks = []
|
||||
for chunk in batches:
|
||||
payload = {
|
||||
"urls": chunk,
|
||||
"browser_config": {"type": "BrowserConfig", "params": {"headless": args.headless}},
|
||||
"crawler_config": {"type": "CrawlerRunConfig", "params": {"cache_mode": "BYPASS", "stream": args.stream}},
|
||||
}
|
||||
tasks.append(asyncio.create_task(fire(client, endpoint, payload, sem)))
|
||||
progress.advance(task_id)
|
||||
|
||||
results = await asyncio.gather(*tasks)
|
||||
|
||||
ok_latencies = [dt for ok, dt in results if ok]
|
||||
err_count = sum(1 for ok, _ in results if not ok)
|
||||
|
||||
table = Table(title="Docker API Stress‑Test Summary")
|
||||
table.add_column("total", justify="right")
|
||||
table.add_column("errors", justify="right")
|
||||
table.add_column("p50", justify="right")
|
||||
table.add_column("p95", justify="right")
|
||||
table.add_column("max", justify="right")
|
||||
|
||||
table.add_row(
|
||||
str(len(results)),
|
||||
str(err_count),
|
||||
pct(ok_latencies, 50),
|
||||
pct(ok_latencies, 95),
|
||||
f"{max(ok_latencies):.2f}s" if ok_latencies else "-",
|
||||
)
|
||||
console.print(table)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
try:
|
||||
asyncio.run(main())
|
||||
except KeyboardInterrupt:
|
||||
console.print("\n[yellow]aborted by user[/]")
|
||||
500
tests/memory/test_stress_sdk.py
Normal file
500
tests/memory/test_stress_sdk.py
Normal file
@@ -0,0 +1,500 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Stress test for Crawl4AI's arun_many and dispatcher system.
|
||||
This version uses a local HTTP server and focuses on testing
|
||||
the SDK's ability to handle multiple URLs concurrently, with per-batch logging.
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import os
|
||||
import time
|
||||
import pathlib
|
||||
import random
|
||||
import secrets
|
||||
import argparse
|
||||
import json
|
||||
import sys
|
||||
import subprocess
|
||||
import signal
|
||||
from typing import List, Dict, Optional, Union, AsyncGenerator
|
||||
import shutil
|
||||
from rich.console import Console
|
||||
|
||||
# Crawl4AI components
|
||||
from crawl4ai import (
|
||||
AsyncWebCrawler,
|
||||
CrawlerRunConfig,
|
||||
BrowserConfig,
|
||||
MemoryAdaptiveDispatcher,
|
||||
CrawlerMonitor,
|
||||
DisplayMode,
|
||||
CrawlResult,
|
||||
RateLimiter,
|
||||
CacheMode,
|
||||
)
|
||||
|
||||
# Constants
|
||||
DEFAULT_SITE_PATH = "test_site"
|
||||
DEFAULT_PORT = 8000
|
||||
DEFAULT_MAX_SESSIONS = 16
|
||||
DEFAULT_URL_COUNT = 1
|
||||
DEFAULT_CHUNK_SIZE = 1 # Define chunk size for batch logging
|
||||
DEFAULT_REPORT_PATH = "reports"
|
||||
DEFAULT_STREAM_MODE = False
|
||||
DEFAULT_MONITOR_MODE = "DETAILED"
|
||||
|
||||
# Initialize Rich console
|
||||
console = Console()
|
||||
|
||||
# --- SiteGenerator Class (Unchanged) ---
|
||||
class SiteGenerator:
|
||||
"""Generates a local test site with heavy pages for stress testing."""
|
||||
|
||||
def __init__(self, site_path: str = DEFAULT_SITE_PATH, page_count: int = DEFAULT_URL_COUNT):
|
||||
self.site_path = pathlib.Path(site_path)
|
||||
self.page_count = page_count
|
||||
self.images_dir = self.site_path / "images"
|
||||
self.lorem_words = " ".join("lorem ipsum dolor sit amet " * 100).split()
|
||||
|
||||
self.html_template = """<!doctype html>
|
||||
<html>
|
||||
<head>
|
||||
<title>Test Page {page_num}</title>
|
||||
<meta charset="utf-8">
|
||||
</head>
|
||||
<body>
|
||||
<h1>Test Page {page_num}</h1>
|
||||
{paragraphs}
|
||||
{images}
|
||||
</body>
|
||||
</html>
|
||||
"""
|
||||
|
||||
def generate_site(self) -> None:
|
||||
self.site_path.mkdir(parents=True, exist_ok=True)
|
||||
self.images_dir.mkdir(exist_ok=True)
|
||||
console.print(f"Generating {self.page_count} test pages...")
|
||||
for i in range(self.page_count):
|
||||
paragraphs = "\n".join(f"<p>{' '.join(random.choices(self.lorem_words, k=200))}</p>" for _ in range(5))
|
||||
images = "\n".join(f'<img src="https://picsum.photos/seed/{secrets.token_hex(8)}/300/200" loading="lazy" alt="Random image {j}"/>' for j in range(3))
|
||||
page_path = self.site_path / f"page_{i}.html"
|
||||
page_path.write_text(self.html_template.format(page_num=i, paragraphs=paragraphs, images=images), encoding="utf-8")
|
||||
if (i + 1) % (self.page_count // 10 or 1) == 0 or i == self.page_count - 1:
|
||||
console.print(f"Generated {i+1}/{self.page_count} pages")
|
||||
self._create_index_page()
|
||||
console.print(f"[bold green]Successfully generated {self.page_count} test pages in [cyan]{self.site_path}[/cyan][/bold green]")
|
||||
|
||||
def _create_index_page(self) -> None:
|
||||
index_content = """<!doctype html><html><head><title>Test Site Index</title><meta charset="utf-8"></head><body><h1>Test Site Index</h1><p>This is an automatically generated site for testing Crawl4AI.</p><div class="page-links">\n"""
|
||||
for i in range(self.page_count):
|
||||
index_content += f' <a href="page_{i}.html">Test Page {i}</a><br>\n'
|
||||
index_content += """ </div></body></html>"""
|
||||
(self.site_path / "index.html").write_text(index_content, encoding="utf-8")
|
||||
|
||||
# --- LocalHttpServer Class (Unchanged) ---
|
||||
class LocalHttpServer:
|
||||
"""Manages a local HTTP server for serving test pages."""
|
||||
def __init__(self, site_path: str = DEFAULT_SITE_PATH, port: int = DEFAULT_PORT):
|
||||
self.site_path = pathlib.Path(site_path)
|
||||
self.port = port
|
||||
self.process = None
|
||||
|
||||
def start(self) -> None:
|
||||
if not self.site_path.exists(): raise FileNotFoundError(f"Site directory {self.site_path} does not exist")
|
||||
console.print(f"Attempting to start HTTP server in [cyan]{self.site_path}[/cyan] on port {self.port}...")
|
||||
try:
|
||||
cmd = ["python", "-m", "http.server", str(self.port)]
|
||||
creationflags = 0; preexec_fn = None
|
||||
if sys.platform == 'win32': creationflags = subprocess.CREATE_NEW_PROCESS_GROUP
|
||||
self.process = subprocess.Popen(cmd, cwd=str(self.site_path), stdout=subprocess.PIPE, stderr=subprocess.PIPE, creationflags=creationflags)
|
||||
time.sleep(1.5)
|
||||
if self.is_running(): console.print(f"[bold green]HTTP server started successfully (PID: {self.process.pid})[/bold green]")
|
||||
else:
|
||||
console.print("[bold red]Failed to start HTTP server. Checking logs...[/bold red]")
|
||||
stdout, stderr = self.process.communicate(); print(stdout.decode(errors='ignore')); print(stderr.decode(errors='ignore'))
|
||||
self.stop(); raise RuntimeError("HTTP server failed to start.")
|
||||
except Exception as e: console.print(f"[bold red]Error starting HTTP server: {str(e)}[/bold red]"); self.stop(); raise
|
||||
|
||||
def stop(self) -> None:
|
||||
if self.process and self.is_running():
|
||||
console.print(f"Stopping HTTP server (PID: {self.process.pid})...")
|
||||
try:
|
||||
if sys.platform == 'win32': self.process.send_signal(signal.CTRL_BREAK_EVENT); time.sleep(0.5)
|
||||
self.process.terminate()
|
||||
try: stdout, stderr = self.process.communicate(timeout=5); console.print("[bold yellow]HTTP server stopped[/bold yellow]")
|
||||
except subprocess.TimeoutExpired: console.print("[bold red]Server did not terminate gracefully, killing...[/bold red]"); self.process.kill(); stdout, stderr = self.process.communicate(); console.print("[bold yellow]HTTP server killed[/bold yellow]")
|
||||
except Exception as e: console.print(f"[bold red]Error stopping HTTP server: {str(e)}[/bold red]"); self.process.kill()
|
||||
finally: self.process = None
|
||||
elif self.process: console.print("[dim]HTTP server process already stopped.[/dim]"); self.process = None
|
||||
|
||||
def is_running(self) -> bool:
|
||||
if not self.process: return False
|
||||
return self.process.poll() is None
|
||||
|
||||
# --- SimpleMemoryTracker Class (Unchanged) ---
|
||||
class SimpleMemoryTracker:
|
||||
"""Basic memory tracker that doesn't rely on psutil."""
|
||||
def __init__(self, report_path: str = DEFAULT_REPORT_PATH, test_id: Optional[str] = None):
|
||||
self.report_path = pathlib.Path(report_path); self.report_path.mkdir(parents=True, exist_ok=True)
|
||||
self.test_id = test_id or time.strftime("%Y%m%d_%H%M%S")
|
||||
self.start_time = time.time(); self.memory_samples = []; self.pid = os.getpid()
|
||||
self.csv_path = self.report_path / f"memory_samples_{self.test_id}.csv"
|
||||
with open(self.csv_path, 'w', encoding='utf-8') as f: f.write("timestamp,elapsed_seconds,memory_info_mb\n")
|
||||
|
||||
def sample(self) -> Dict:
|
||||
try:
|
||||
memory_mb = self._get_memory_info_mb()
|
||||
memory_str = f"{memory_mb:.1f} MB" if memory_mb is not None else "Unknown"
|
||||
timestamp = time.time(); elapsed = timestamp - self.start_time
|
||||
sample = {"timestamp": timestamp, "elapsed_seconds": elapsed, "memory_mb": memory_mb, "memory_str": memory_str}
|
||||
self.memory_samples.append(sample)
|
||||
with open(self.csv_path, 'a', encoding='utf-8') as f: f.write(f"{timestamp},{elapsed:.2f},{memory_mb if memory_mb is not None else ''}\n")
|
||||
return sample
|
||||
except Exception as e: return {"memory_mb": None, "memory_str": "Error"}
|
||||
|
||||
def _get_memory_info_mb(self) -> Optional[float]:
|
||||
pid_str = str(self.pid)
|
||||
try:
|
||||
if sys.platform == 'darwin': result = subprocess.run(["ps", "-o", "rss=", "-p", pid_str], capture_output=True, text=True, check=True, encoding='utf-8'); return int(result.stdout.strip()) / 1024.0
|
||||
elif sys.platform == 'linux':
|
||||
with open(f"/proc/{pid_str}/status", encoding='utf-8') as f:
|
||||
for line in f:
|
||||
if line.startswith("VmRSS:"): return int(line.split()[1]) / 1024.0
|
||||
return None
|
||||
elif sys.platform == 'win32': result = subprocess.run(["tasklist", "/fi", f"PID eq {pid_str}", "/fo", "csv", "/nh"], capture_output=True, text=True, check=True, encoding='cp850', errors='ignore'); parts = result.stdout.strip().split('","'); return int(parts[4].strip().replace('"', '').replace(' K', '').replace(',', '')) / 1024.0 if len(parts) >= 5 else None
|
||||
else: return None
|
||||
except: return None # Catch all exceptions for robustness
|
||||
|
||||
def get_report(self) -> Dict:
|
||||
if not self.memory_samples: return {"error": "No memory samples collected"}
|
||||
total_time = time.time() - self.start_time; valid_samples = [s['memory_mb'] for s in self.memory_samples if s['memory_mb'] is not None]
|
||||
start_mem = valid_samples[0] if valid_samples else None; end_mem = valid_samples[-1] if valid_samples else None
|
||||
max_mem = max(valid_samples) if valid_samples else None; avg_mem = sum(valid_samples) / len(valid_samples) if valid_samples else None
|
||||
growth = (end_mem - start_mem) if start_mem is not None and end_mem is not None else None
|
||||
return {"test_id": self.test_id, "total_time_seconds": total_time, "sample_count": len(self.memory_samples), "valid_sample_count": len(valid_samples), "csv_path": str(self.csv_path), "platform": sys.platform, "start_memory_mb": start_mem, "end_memory_mb": end_mem, "max_memory_mb": max_mem, "average_memory_mb": avg_mem, "memory_growth_mb": growth}
|
||||
|
||||
|
||||
# --- CrawlerStressTest Class (Refactored for Per-Batch Logging) ---
|
||||
class CrawlerStressTest:
|
||||
"""Orchestrates the stress test using arun_many per chunk and a dispatcher."""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
url_count: int = DEFAULT_URL_COUNT,
|
||||
port: int = DEFAULT_PORT,
|
||||
max_sessions: int = DEFAULT_MAX_SESSIONS,
|
||||
chunk_size: int = DEFAULT_CHUNK_SIZE, # Added chunk_size
|
||||
report_path: str = DEFAULT_REPORT_PATH,
|
||||
stream_mode: bool = DEFAULT_STREAM_MODE,
|
||||
monitor_mode: str = DEFAULT_MONITOR_MODE,
|
||||
use_rate_limiter: bool = False
|
||||
):
|
||||
self.url_count = url_count
|
||||
self.server_port = port
|
||||
self.max_sessions = max_sessions
|
||||
self.chunk_size = chunk_size # Store chunk size
|
||||
self.report_path = pathlib.Path(report_path)
|
||||
self.report_path.mkdir(parents=True, exist_ok=True)
|
||||
self.stream_mode = stream_mode
|
||||
self.monitor_mode = DisplayMode[monitor_mode.upper()]
|
||||
self.use_rate_limiter = use_rate_limiter
|
||||
|
||||
self.test_id = time.strftime("%Y%m%d_%H%M%S")
|
||||
self.results_summary = {
|
||||
"test_id": self.test_id, "url_count": url_count, "max_sessions": max_sessions,
|
||||
"chunk_size": chunk_size, "stream_mode": stream_mode, "monitor_mode": monitor_mode,
|
||||
"rate_limiter_used": use_rate_limiter, "start_time": "", "end_time": "",
|
||||
"total_time_seconds": 0, "successful_urls": 0, "failed_urls": 0,
|
||||
"urls_processed": 0, "chunks_processed": 0
|
||||
}
|
||||
|
||||
async def run(self) -> Dict:
|
||||
"""Run the stress test and return results."""
|
||||
memory_tracker = SimpleMemoryTracker(report_path=self.report_path, test_id=self.test_id)
|
||||
urls = [f"http://localhost:{self.server_port}/page_{i}.html" for i in range(self.url_count)]
|
||||
# Split URLs into chunks based on self.chunk_size
|
||||
url_chunks = [urls[i:i+self.chunk_size] for i in range(0, len(urls), self.chunk_size)]
|
||||
|
||||
self.results_summary["start_time"] = time.strftime("%Y-%m-%d %H:%M:%S")
|
||||
start_time = time.time()
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
wait_for_images=False, verbose=False,
|
||||
stream=self.stream_mode, # Still pass stream mode, affects arun_many return type
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
|
||||
total_successful_urls = 0
|
||||
total_failed_urls = 0
|
||||
total_urls_processed = 0
|
||||
start_memory_sample = memory_tracker.sample()
|
||||
start_memory_str = start_memory_sample.get("memory_str", "Unknown")
|
||||
|
||||
# monitor = CrawlerMonitor(display_mode=self.monitor_mode, total_urls=self.url_count)
|
||||
monitor = None
|
||||
rate_limiter = RateLimiter(base_delay=(0.1, 0.3)) if self.use_rate_limiter else None
|
||||
dispatcher = MemoryAdaptiveDispatcher(max_session_permit=self.max_sessions, monitor=monitor, rate_limiter=rate_limiter)
|
||||
|
||||
console.print(f"\n[bold cyan]Crawl4AI Stress Test - {self.url_count} URLs, {self.max_sessions} max sessions[/bold cyan]")
|
||||
console.print(f"[bold cyan]Mode:[/bold cyan] {'Streaming' if self.stream_mode else 'Batch'}, [bold cyan]Monitor:[/bold cyan] {self.monitor_mode.name}, [bold cyan]Chunk Size:[/bold cyan] {self.chunk_size}")
|
||||
console.print(f"[bold cyan]Initial Memory:[/bold cyan] {start_memory_str}")
|
||||
|
||||
# Print batch log header only if not streaming
|
||||
if not self.stream_mode:
|
||||
console.print("\n[bold]Batch Progress:[/bold] (Monitor below shows overall progress)")
|
||||
console.print("[bold] Batch | Progress | Start Mem | End Mem | URLs/sec | Success/Fail | Time (s) | Status [/bold]")
|
||||
console.print("─" * 90)
|
||||
|
||||
monitor_task = asyncio.create_task(self._periodic_memory_sample(memory_tracker, 2.0))
|
||||
|
||||
try:
|
||||
async with AsyncWebCrawler(
|
||||
config=BrowserConfig( verbose = False)
|
||||
) as crawler:
|
||||
# Process URLs chunk by chunk
|
||||
for chunk_idx, url_chunk in enumerate(url_chunks):
|
||||
batch_start_time = time.time()
|
||||
chunk_success = 0
|
||||
chunk_failed = 0
|
||||
|
||||
# Sample memory before the chunk
|
||||
start_mem_sample = memory_tracker.sample()
|
||||
start_mem_str = start_mem_sample.get("memory_str", "Unknown")
|
||||
|
||||
# --- Call arun_many for the current chunk ---
|
||||
try:
|
||||
# Note: dispatcher/monitor persist across calls
|
||||
results_gen_or_list: Union[AsyncGenerator[CrawlResult, None], List[CrawlResult]] = \
|
||||
await crawler.arun_many(
|
||||
urls=url_chunk,
|
||||
config=config,
|
||||
dispatcher=dispatcher # Reuse the same dispatcher
|
||||
)
|
||||
|
||||
if self.stream_mode:
|
||||
# Process stream results if needed, but batch logging is less relevant
|
||||
async for result in results_gen_or_list:
|
||||
total_urls_processed += 1
|
||||
if result.success: chunk_success += 1
|
||||
else: chunk_failed += 1
|
||||
# In stream mode, batch summary isn't as meaningful here
|
||||
# We could potentially track completion per chunk async, but it's complex
|
||||
|
||||
else: # Batch mode
|
||||
# Process the list of results for this chunk
|
||||
for result in results_gen_or_list:
|
||||
total_urls_processed += 1
|
||||
if result.success: chunk_success += 1
|
||||
else: chunk_failed += 1
|
||||
|
||||
except Exception as e:
|
||||
console.print(f"[bold red]Error processing chunk {chunk_idx+1}: {e}[/bold red]")
|
||||
chunk_failed = len(url_chunk) # Assume all failed in the chunk on error
|
||||
total_urls_processed += len(url_chunk) # Count them as processed (failed)
|
||||
|
||||
# --- Log batch results (only if not streaming) ---
|
||||
if not self.stream_mode:
|
||||
batch_time = time.time() - batch_start_time
|
||||
urls_per_sec = len(url_chunk) / batch_time if batch_time > 0 else 0
|
||||
end_mem_sample = memory_tracker.sample()
|
||||
end_mem_str = end_mem_sample.get("memory_str", "Unknown")
|
||||
|
||||
progress_pct = (total_urls_processed / self.url_count) * 100
|
||||
|
||||
if chunk_failed == 0: status_color, status = "green", "Success"
|
||||
elif chunk_success == 0: status_color, status = "red", "Failed"
|
||||
else: status_color, status = "yellow", "Partial"
|
||||
|
||||
console.print(
|
||||
f" {chunk_idx+1:<5} | {progress_pct:6.1f}% | {start_mem_str:>9} | {end_mem_str:>9} | {urls_per_sec:8.1f} | "
|
||||
f"{chunk_success:^7}/{chunk_failed:<6} | {batch_time:8.2f} | [{status_color}]{status:<7}[/{status_color}]"
|
||||
)
|
||||
|
||||
# Accumulate totals
|
||||
total_successful_urls += chunk_success
|
||||
total_failed_urls += chunk_failed
|
||||
self.results_summary["chunks_processed"] += 1
|
||||
|
||||
# Optional small delay between starting chunks if needed
|
||||
# await asyncio.sleep(0.1)
|
||||
|
||||
except Exception as e:
|
||||
console.print(f"[bold red]An error occurred during the main crawl loop: {e}[/bold red]")
|
||||
finally:
|
||||
if 'monitor_task' in locals() and not monitor_task.done():
|
||||
monitor_task.cancel()
|
||||
try: await monitor_task
|
||||
except asyncio.CancelledError: pass
|
||||
|
||||
end_time = time.time()
|
||||
self.results_summary.update({
|
||||
"end_time": time.strftime("%Y-%m-%d %H:%M:%S"),
|
||||
"total_time_seconds": end_time - start_time,
|
||||
"successful_urls": total_successful_urls,
|
||||
"failed_urls": total_failed_urls,
|
||||
"urls_processed": total_urls_processed,
|
||||
"memory": memory_tracker.get_report()
|
||||
})
|
||||
self._save_results()
|
||||
return self.results_summary
|
||||
|
||||
async def _periodic_memory_sample(self, tracker: SimpleMemoryTracker, interval: float):
|
||||
"""Background task to sample memory periodically."""
|
||||
while True:
|
||||
tracker.sample()
|
||||
try:
|
||||
await asyncio.sleep(interval)
|
||||
except asyncio.CancelledError:
|
||||
break # Exit loop on cancellation
|
||||
|
||||
def _save_results(self) -> None:
|
||||
results_path = self.report_path / f"test_summary_{self.test_id}.json"
|
||||
try:
|
||||
with open(results_path, 'w', encoding='utf-8') as f: json.dump(self.results_summary, f, indent=2, default=str)
|
||||
# console.print(f"\n[bold green]Results summary saved to {results_path}[/bold green]") # Moved summary print to run_full_test
|
||||
except Exception as e: console.print(f"[bold red]Failed to save results summary: {e}[/bold red]")
|
||||
|
||||
|
||||
# --- run_full_test Function (Adjusted) ---
|
||||
async def run_full_test(args):
|
||||
"""Run the complete test process from site generation to crawling."""
|
||||
server = None
|
||||
site_generated = False
|
||||
|
||||
# --- Site Generation --- (Same as before)
|
||||
if not args.use_existing_site and not args.skip_generation:
|
||||
if os.path.exists(args.site_path): console.print(f"[yellow]Removing existing site directory: {args.site_path}[/yellow]"); shutil.rmtree(args.site_path)
|
||||
site_generator = SiteGenerator(site_path=args.site_path, page_count=args.urls); site_generator.generate_site(); site_generated = True
|
||||
elif args.use_existing_site: console.print(f"[cyan]Using existing site assumed to be running on port {args.port}[/cyan]")
|
||||
elif args.skip_generation:
|
||||
console.print(f"[cyan]Skipping site generation, using existing directory: {args.site_path}[/cyan]")
|
||||
if not os.path.exists(args.site_path) or not os.path.isdir(args.site_path): console.print(f"[bold red]Error: Site path '{args.site_path}' does not exist or is not a directory.[/bold red]"); return
|
||||
|
||||
# --- Start Local Server --- (Same as before)
|
||||
server_started = False
|
||||
if not args.use_existing_site:
|
||||
server = LocalHttpServer(site_path=args.site_path, port=args.port)
|
||||
try: server.start(); server_started = True
|
||||
except Exception as e:
|
||||
console.print(f"[bold red]Failed to start local server. Aborting test.[/bold red]")
|
||||
if site_generated and not args.keep_site: console.print(f"[yellow]Cleaning up generated site: {args.site_path}[/yellow]"); shutil.rmtree(args.site_path)
|
||||
return
|
||||
|
||||
try:
|
||||
# --- Run the Stress Test ---
|
||||
test = CrawlerStressTest(
|
||||
url_count=args.urls,
|
||||
port=args.port,
|
||||
max_sessions=args.max_sessions,
|
||||
chunk_size=args.chunk_size, # Pass chunk_size
|
||||
report_path=args.report_path,
|
||||
stream_mode=args.stream,
|
||||
monitor_mode=args.monitor_mode,
|
||||
use_rate_limiter=args.use_rate_limiter
|
||||
)
|
||||
results = await test.run() # Run the test which now handles chunks internally
|
||||
|
||||
# --- Print Summary ---
|
||||
console.print("\n" + "=" * 80)
|
||||
console.print("[bold green]Test Completed[/bold green]")
|
||||
console.print("=" * 80)
|
||||
|
||||
# (Summary printing logic remains largely the same)
|
||||
success_rate = results["successful_urls"] / results["url_count"] * 100 if results["url_count"] > 0 else 0
|
||||
urls_per_second = results["urls_processed"] / results["total_time_seconds"] if results["total_time_seconds"] > 0 else 0
|
||||
|
||||
console.print(f"[bold cyan]Test ID:[/bold cyan] {results['test_id']}")
|
||||
console.print(f"[bold cyan]Configuration:[/bold cyan] {results['url_count']} URLs, {results['max_sessions']} sessions, Chunk: {results['chunk_size']}, Stream: {results['stream_mode']}, Monitor: {results['monitor_mode']}")
|
||||
console.print(f"[bold cyan]Results:[/bold cyan] {results['successful_urls']} successful, {results['failed_urls']} failed ({results['urls_processed']} processed, {success_rate:.1f}% success)")
|
||||
console.print(f"[bold cyan]Performance:[/bold cyan] {results['total_time_seconds']:.2f} seconds total, {urls_per_second:.2f} URLs/second avg")
|
||||
|
||||
mem_report = results.get("memory", {})
|
||||
mem_info_str = "Memory tracking data unavailable."
|
||||
if mem_report and not mem_report.get("error"):
|
||||
start_mb = mem_report.get('start_memory_mb'); end_mb = mem_report.get('end_memory_mb'); max_mb = mem_report.get('max_memory_mb'); growth_mb = mem_report.get('memory_growth_mb')
|
||||
mem_parts = []
|
||||
if start_mb is not None: mem_parts.append(f"Start: {start_mb:.1f} MB")
|
||||
if end_mb is not None: mem_parts.append(f"End: {end_mb:.1f} MB")
|
||||
if max_mb is not None: mem_parts.append(f"Max: {max_mb:.1f} MB")
|
||||
if growth_mb is not None: mem_parts.append(f"Growth: {growth_mb:.1f} MB")
|
||||
if mem_parts: mem_info_str = ", ".join(mem_parts)
|
||||
csv_path = mem_report.get('csv_path')
|
||||
if csv_path: console.print(f"[dim]Memory samples saved to: {csv_path}[/dim]")
|
||||
|
||||
console.print(f"[bold cyan]Memory Usage:[/bold cyan] {mem_info_str}")
|
||||
console.print(f"[bold green]Results summary saved to {results['memory']['csv_path'].replace('memory_samples', 'test_summary').replace('.csv', '.json')}[/bold green]") # Infer summary path
|
||||
|
||||
|
||||
if results["failed_urls"] > 0: console.print(f"\n[bold yellow]Warning: {results['failed_urls']} URLs failed to process ({100-success_rate:.1f}% failure rate)[/bold yellow]")
|
||||
if results["urls_processed"] < results["url_count"]: console.print(f"\n[bold red]Error: Only {results['urls_processed']} out of {results['url_count']} URLs were processed![/bold red]")
|
||||
|
||||
|
||||
finally:
|
||||
# --- Stop Server / Cleanup --- (Same as before)
|
||||
if server_started and server and not args.keep_server_alive: server.stop()
|
||||
elif server_started and server and args.keep_server_alive:
|
||||
console.print(f"[bold cyan]Server is kept running on port {args.port}. Press Ctrl+C to stop it.[/bold cyan]")
|
||||
try: await asyncio.Future() # Keep running indefinitely
|
||||
except KeyboardInterrupt: console.print("\n[bold yellow]Stopping server due to user interrupt...[/bold yellow]"); server.stop()
|
||||
|
||||
if site_generated and not args.keep_site: console.print(f"[yellow]Cleaning up generated site: {args.site_path}[/yellow]"); shutil.rmtree(args.site_path)
|
||||
elif args.clean_site and os.path.exists(args.site_path): console.print(f"[yellow]Cleaning up site directory as requested: {args.site_path}[/yellow]"); shutil.rmtree(args.site_path)
|
||||
|
||||
|
||||
# --- main Function (Added chunk_size argument) ---
|
||||
def main():
|
||||
"""Main entry point for the script."""
|
||||
parser = argparse.ArgumentParser(description="Crawl4AI SDK High Volume Stress Test using arun_many")
|
||||
|
||||
# Test parameters
|
||||
parser.add_argument("--urls", type=int, default=DEFAULT_URL_COUNT, help=f"Number of URLs to test (default: {DEFAULT_URL_COUNT})")
|
||||
parser.add_argument("--max-sessions", type=int, default=DEFAULT_MAX_SESSIONS, help=f"Maximum concurrent crawling sessions (default: {DEFAULT_MAX_SESSIONS})")
|
||||
parser.add_argument("--chunk-size", type=int, default=DEFAULT_CHUNK_SIZE, help=f"Number of URLs per batch for logging (default: {DEFAULT_CHUNK_SIZE})") # Added
|
||||
parser.add_argument("--stream", action="store_true", default=DEFAULT_STREAM_MODE, help=f"Enable streaming mode (disables batch logging) (default: {DEFAULT_STREAM_MODE})")
|
||||
parser.add_argument("--monitor-mode", type=str, default=DEFAULT_MONITOR_MODE, choices=["DETAILED", "AGGREGATED"], help=f"Display mode for the live monitor (default: {DEFAULT_MONITOR_MODE})")
|
||||
parser.add_argument("--use-rate-limiter", action="store_true", default=False, help="Enable a basic rate limiter (default: False)")
|
||||
|
||||
# Environment parameters
|
||||
parser.add_argument("--site-path", type=str, default=DEFAULT_SITE_PATH, help=f"Path to generate/use the test site (default: {DEFAULT_SITE_PATH})")
|
||||
parser.add_argument("--port", type=int, default=DEFAULT_PORT, help=f"Port for the local HTTP server (default: {DEFAULT_PORT})")
|
||||
parser.add_argument("--report-path", type=str, default=DEFAULT_REPORT_PATH, help=f"Path to save reports and logs (default: {DEFAULT_REPORT_PATH})")
|
||||
|
||||
# Site/Server management
|
||||
parser.add_argument("--skip-generation", action="store_true", help="Use existing test site folder without regenerating")
|
||||
parser.add_argument("--use-existing-site", action="store_true", help="Do not generate site or start local server; assume site exists on --port")
|
||||
parser.add_argument("--keep-server-alive", action="store_true", help="Keep the local HTTP server running after test")
|
||||
parser.add_argument("--keep-site", action="store_true", help="Keep the generated test site files after test")
|
||||
parser.add_argument("--clean-reports", action="store_true", help="Clean up report directory before running")
|
||||
parser.add_argument("--clean-site", action="store_true", help="Clean up site directory before running (if generating) or after")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Display config
|
||||
console.print("[bold underline]Crawl4AI SDK Stress Test Configuration[/bold underline]")
|
||||
console.print(f"URLs: {args.urls}, Max Sessions: {args.max_sessions}, Chunk Size: {args.chunk_size}") # Added chunk size
|
||||
console.print(f"Mode: {'Streaming' if args.stream else 'Batch'}, Monitor: {args.monitor_mode}, Rate Limit: {args.use_rate_limiter}")
|
||||
console.print(f"Site Path: {args.site_path}, Port: {args.port}, Report Path: {args.report_path}")
|
||||
console.print("-" * 40)
|
||||
# (Rest of config display and cleanup logic is the same)
|
||||
if args.use_existing_site: console.print("[cyan]Mode: Using existing external site/server[/cyan]")
|
||||
elif args.skip_generation: console.print("[cyan]Mode: Using existing site files, starting local server[/cyan]")
|
||||
else: console.print("[cyan]Mode: Generating site files, starting local server[/cyan]")
|
||||
if args.keep_server_alive: console.print("[cyan]Option: Keep server alive after test[/cyan]")
|
||||
if args.keep_site: console.print("[cyan]Option: Keep site files after test[/cyan]")
|
||||
if args.clean_reports: console.print("[cyan]Option: Clean reports before test[/cyan]")
|
||||
if args.clean_site: console.print("[cyan]Option: Clean site directory[/cyan]")
|
||||
console.print("-" * 40)
|
||||
|
||||
if args.clean_reports:
|
||||
if os.path.exists(args.report_path): console.print(f"[yellow]Cleaning up reports directory: {args.report_path}[/yellow]"); shutil.rmtree(args.report_path)
|
||||
os.makedirs(args.report_path, exist_ok=True)
|
||||
if args.clean_site and not args.use_existing_site:
|
||||
if os.path.exists(args.site_path): console.print(f"[yellow]Cleaning up site directory as requested: {args.site_path}[/yellow]"); shutil.rmtree(args.site_path)
|
||||
|
||||
# Run
|
||||
try: asyncio.run(run_full_test(args))
|
||||
except KeyboardInterrupt: console.print("\n[bold yellow]Test interrupted by user.[/bold yellow]")
|
||||
except Exception as e: console.print(f"\n[bold red]An unexpected error occurred:[/bold red] {e}"); import traceback; traceback.print_exc()
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
Reference in New Issue
Block a user