Compare commits
101 Commits
next-alpin
...
codex/find
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
9d8ead59b8 | ||
|
|
45f1652d98 | ||
|
|
897e017361 | ||
|
|
a3e9ef91ad | ||
|
|
76dd86d1b3 | ||
|
|
206a9dfabd | ||
|
|
aaf05910eb | ||
|
|
a0555d5fa6 | ||
|
|
38ebcbb304 | ||
|
|
9b5ccac76e | ||
|
|
87d4b0fff4 | ||
|
|
bd5a9ac632 | ||
|
|
6650b2f34a | ||
|
|
5cc58f9bb3 | ||
|
|
baf7f6a6f5 | ||
|
|
94e9959fe0 | ||
|
|
7c2fd5202e | ||
|
|
ee01b81f3e | ||
|
|
0e5d672763 | ||
|
|
cd2b490b40 | ||
|
|
50f0b83fcd | ||
|
|
9499164d3c | ||
|
|
2140d9aca4 | ||
|
|
ccec40ed17 | ||
|
|
ad4dfb21e1 | ||
|
|
7784b2468e | ||
|
|
146f9d415f | ||
|
|
37fd80e4b9 | ||
|
|
949a93982e | ||
|
|
c4f5651199 | ||
|
|
b0aa8bc9f7 | ||
|
|
c98ffe2130 | ||
|
|
4812f08a73 | ||
|
|
f3ebb38edf | ||
|
|
0007aea204 | ||
|
|
b5c25731e6 | ||
|
|
5297e362f3 | ||
|
|
a58c8000aa | ||
|
|
b27bb367e8 | ||
|
|
d2648eaa39 | ||
|
|
c2902fd200 | ||
|
|
16b2318242 | ||
|
|
907cba194f | ||
|
|
3bf78ff47a | ||
|
|
921e0c46b6 | ||
|
|
fd899f66aa | ||
|
|
30ec4f571f | ||
|
|
7db6b468d9 | ||
|
|
eed7f88f29 | ||
|
|
94d486579c | ||
|
|
5206c6f2d6 | ||
|
|
230f22da86 | ||
|
|
793668a413 | ||
|
|
82aa53aa59 | ||
|
|
dcc265458c | ||
|
|
7d8e81fb2e | ||
|
|
9fc5d315af | ||
|
|
d84508b4d5 | ||
|
|
022f5c9e25 | ||
|
|
b2f3cb0dfa | ||
|
|
18e8227dfb | ||
|
|
7c358a1aee | ||
|
|
6f7ab9c927 | ||
|
|
7155778eac | ||
|
|
4133e5460d | ||
|
|
73fda8a6ec | ||
|
|
9e16a4bb26 | ||
|
|
765f856ed4 | ||
|
|
757e3177ed | ||
|
|
d8357e80d2 | ||
|
|
ef1f0c4102 | ||
|
|
1119f2f5b5 | ||
|
|
d8cbeff386 | ||
|
|
57e0423b3a | ||
|
|
7be5427283 | ||
|
|
585e5e5973 | ||
|
|
e3111d0a32 | ||
|
|
2f0e217751 | ||
|
|
efa73257c5 | ||
|
|
e01d1e73e1 | ||
|
|
471d110c5e | ||
|
|
f89113377a | ||
|
|
6740e87b4d | ||
|
|
8b761f232b | ||
|
|
e0c2a7c284 | ||
|
|
ac2f9ae533 | ||
|
|
eedda1ae5c | ||
|
|
8cecbec7a7 | ||
|
|
4359b12003 | ||
|
|
529a79725e | ||
|
|
9109ecd8fc | ||
|
|
84883be513 | ||
|
|
c190ba816d | ||
|
|
a3954dd4c6 | ||
|
|
cbb8755972 | ||
|
|
341b7a5f2a | ||
|
|
504207faa6 | ||
|
|
f14e4a4b67 | ||
|
|
1e819cdb26 | ||
|
|
5edfea279d | ||
|
|
7c1705712d |
9
.gitignore
vendored
9
.gitignore
vendored
@@ -257,4 +257,11 @@ continue_config.json
|
||||
.private/
|
||||
|
||||
CLAUDE_MONITOR.md
|
||||
CLAUDE.md
|
||||
CLAUDE.md
|
||||
|
||||
tests/**/test_site
|
||||
tests/**/reports
|
||||
tests/**/benchmark_reports
|
||||
|
||||
docs/**/data
|
||||
.codecat/
|
||||
|
||||
106
CHANGELOG.md
106
CHANGELOG.md
@@ -5,6 +5,112 @@ All notable changes to Crawl4AI will be documented in this file.
|
||||
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
||||
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
||||
|
||||
## [0.6.2] - 2025-05-02
|
||||
|
||||
### Added
|
||||
- New `RegexExtractionStrategy` for fast pattern-based extraction without requiring LLM
|
||||
- Built-in patterns for emails, URLs, phone numbers, dates, and more
|
||||
- Support for custom regex patterns
|
||||
- `generate_pattern` utility for LLM-assisted pattern creation (one-time use)
|
||||
- Added `fit_html` as a top-level field in `CrawlResult` for optimized HTML extraction
|
||||
- Added support for network response body capture in network request tracking
|
||||
|
||||
### Changed
|
||||
- Updated documentation for no-LLM extraction strategies
|
||||
- Enhanced API reference to include RegexExtractionStrategy examples and usage
|
||||
- Improved HTML preprocessing with optimized performance for extraction strategies
|
||||
|
||||
## [0.6.1] - 2025-04-24
|
||||
|
||||
### Added
|
||||
- New dedicated `tables` field in `CrawlResult` model for better table extraction handling
|
||||
- Updated crypto_analysis_example.py to use the new tables field with backward compatibility
|
||||
|
||||
### Changed
|
||||
- Improved playground UI in Docker deployment with better endpoint handling and UI feedback
|
||||
|
||||
## [0.6.0] ‑ 2025‑04‑22
|
||||
|
||||
### Added
|
||||
- Browser pooling with page pre‑warming and fine‑grained **geolocation, locale, and timezone** controls
|
||||
- Crawler pool manager (SDK + Docker API) for smarter resource allocation
|
||||
- Network & console log capture plus MHTML snapshot export
|
||||
- **Table extractor**: turn HTML `<table>`s into DataFrames or CSV with one flag
|
||||
- High‑volume stress‑test framework in `tests/memory` and API load scripts
|
||||
- MCP protocol endpoints with socket & SSE support; playground UI scaffold
|
||||
- Docs v2 revamp: TOC, GitHub badge, copy‑code buttons, Docker API demo
|
||||
- “Ask AI” helper button *(work‑in‑progress, shipping soon)*
|
||||
- New examples: geo‑location usage, network/console capture, Docker API, markdown source selection, crypto analysis
|
||||
- Expanded automated test suites for browser, Docker, MCP and memory benchmarks
|
||||
|
||||
### Changed
|
||||
- Consolidated and renamed browser strategies; legacy docker strategy modules removed
|
||||
- `ProxyConfig` moved to `async_configs`
|
||||
- Server migrated to pool‑based crawler management
|
||||
- FastAPI validators replace custom query validation
|
||||
- Docker build now uses Chromium base image
|
||||
- Large‑scale repo tidy‑up (≈36 k insertions, ≈5 k deletions)
|
||||
|
||||
### Fixed
|
||||
- Async crawler session leak, duplicate‑visit handling, URL normalisation
|
||||
- Target‑element regressions in scraping strategies
|
||||
- Logged‑URL readability, encoded‑URL decoding, middle truncation for long URLs
|
||||
- Closed issues: #701, #733, #756, #774, #804, #822, #839, #841, #842, #843, #867, #902, #911
|
||||
|
||||
### Removed
|
||||
- Obsolete modules under `crawl4ai/browser/*` superseded by the new pooled browser layer
|
||||
|
||||
### Deprecated
|
||||
- Old markdown generator names now alias `DefaultMarkdownGenerator` and emit warnings
|
||||
|
||||
---
|
||||
|
||||
#### Upgrade notes
|
||||
1. Update any direct imports from `crawl4ai/browser/*` to the new pooled browser modules
|
||||
2. If you override `AsyncPlaywrightCrawlerStrategy.get_page`, adopt the new signature
|
||||
3. Rebuild Docker images to pull the new Chromium layer
|
||||
4. Switch to `DefaultMarkdownGenerator` (or silence the deprecation warning)
|
||||
|
||||
---
|
||||
|
||||
`121 files changed, ≈36 223 insertions, ≈4 975 deletions` :contentReference[oaicite:0]{index=0}​:contentReference[oaicite:1]{index=1}
|
||||
|
||||
|
||||
### [Feature] 2025-04-21
|
||||
- Implemented MCP protocol for machine-to-machine communication
|
||||
- Added WebSocket and SSE transport for MCP server
|
||||
- Exposed server endpoints via MCP protocol
|
||||
- Created tests for MCP socket and SSE communication
|
||||
- Enhanced Docker server with file handling and intelligent search
|
||||
- Added PDF and screenshot endpoints with file saving capability
|
||||
- Added JavaScript execution endpoint for page interaction
|
||||
- Implemented advanced context search with BM25 and code chunking
|
||||
- Added file path output support for generated assets
|
||||
- Improved server endpoints and API surface
|
||||
- Added intelligent context search with query filtering
|
||||
- Added syntax-aware code function chunking
|
||||
- Implemented efficient HTML processing pipeline
|
||||
- Added support for controlling browser geolocation via new GeolocationConfig class
|
||||
- Added locale and timezone configuration options to CrawlerRunConfig
|
||||
- Added example script demonstrating geolocation and locale usage
|
||||
- Added documentation for location-based identity features
|
||||
|
||||
### [Refactor] 2025-04-20
|
||||
- Replaced crawler_manager.py with simpler crawler_pool.py implementation
|
||||
- Added global page semaphore for hard concurrency cap
|
||||
- Implemented browser pool with idle cleanup
|
||||
- Added playground UI for testing and stress testing
|
||||
- Updated API handlers to use pooled crawlers
|
||||
- Enhanced logging levels and symbols
|
||||
- Added memory tests and stress test utilities
|
||||
|
||||
### [Added] 2025-04-17
|
||||
- Added content source selection feature for markdown generation
|
||||
- New `content_source` parameter allows choosing between `cleaned_html`, `raw_html`, and `fit_html`
|
||||
- Provides flexibility in how HTML content is processed before markdown conversion
|
||||
- Added examples and documentation for the new feature
|
||||
- Includes backward compatibility with default `cleaned_html` behavior
|
||||
|
||||
## Version 0.5.0.post5 (2025-03-14)
|
||||
|
||||
### Added
|
||||
|
||||
15
Dockerfile
15
Dockerfile
@@ -1,4 +1,9 @@
|
||||
FROM python:3.10-slim
|
||||
FROM python:3.12-slim-bookworm AS build
|
||||
|
||||
# C4ai version
|
||||
ARG C4AI_VER=0.6.0
|
||||
ENV C4AI_VERSION=$C4AI_VER
|
||||
LABEL c4ai.version=$C4AI_VER
|
||||
|
||||
# Set build arguments
|
||||
ARG APP_HOME=/app
|
||||
@@ -17,7 +22,7 @@ ENV PYTHONFAULTHANDLER=1 \
|
||||
REDIS_HOST=localhost \
|
||||
REDIS_PORT=6379
|
||||
|
||||
ARG PYTHON_VERSION=3.10
|
||||
ARG PYTHON_VERSION=3.12
|
||||
ARG INSTALL_TYPE=default
|
||||
ARG ENABLE_GPU=false
|
||||
ARG TARGETARCH
|
||||
@@ -66,6 +71,9 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
|
||||
&& apt-get clean \
|
||||
&& rm -rf /var/lib/apt/lists/*
|
||||
|
||||
RUN apt-get update && apt-get dist-upgrade -y \
|
||||
&& rm -rf /var/lib/apt/lists/*
|
||||
|
||||
RUN if [ "$ENABLE_GPU" = "true" ] && [ "$TARGETARCH" = "amd64" ] ; then \
|
||||
apt-get update && apt-get install -y --no-install-recommends \
|
||||
nvidia-cuda-toolkit \
|
||||
@@ -162,6 +170,9 @@ RUN crawl4ai-doctor
|
||||
# Copy application code
|
||||
COPY deploy/docker/* ${APP_HOME}/
|
||||
|
||||
# copy the playground + any future static assets
|
||||
COPY deploy/docker/static ${APP_HOME}/static
|
||||
|
||||
# Change ownership of the application directory to the non-root user
|
||||
RUN chown -R appuser:appuser ${APP_HOME}
|
||||
|
||||
|
||||
231
JOURNAL.md
231
JOURNAL.md
@@ -2,6 +2,237 @@
|
||||
|
||||
This journal tracks significant feature additions, bug fixes, and architectural decisions in the crawl4ai project. It serves as both documentation and a historical record of the project's evolution.
|
||||
|
||||
## [2025-04-17] Added Content Source Selection for Markdown Generation
|
||||
|
||||
**Feature:** Configurable content source for markdown generation
|
||||
|
||||
**Changes Made:**
|
||||
1. Added `content_source: str = "cleaned_html"` parameter to `MarkdownGenerationStrategy` class
|
||||
2. Updated `DefaultMarkdownGenerator` to accept and pass the content source parameter
|
||||
3. Renamed the `cleaned_html` parameter to `input_html` in the `generate_markdown` method
|
||||
4. Modified `AsyncWebCrawler.aprocess_html` to select the appropriate HTML source based on the generator's config
|
||||
5. Added `preprocess_html_for_schema` import in `async_webcrawler.py`
|
||||
|
||||
**Implementation Details:**
|
||||
- Added a new `content_source` parameter to specify which HTML input to use for markdown generation
|
||||
- Options include: "cleaned_html" (default), "raw_html", and "fit_html"
|
||||
- Used a dictionary dispatch pattern in `aprocess_html` to select the appropriate HTML source
|
||||
- Added proper error handling with fallback to cleaned_html if content source selection fails
|
||||
- Ensured backward compatibility by defaulting to "cleaned_html" option
|
||||
|
||||
**Files Modified:**
|
||||
- `crawl4ai/markdown_generation_strategy.py`: Added content_source parameter and updated the method signature
|
||||
- `crawl4ai/async_webcrawler.py`: Added HTML source selection logic and updated imports
|
||||
|
||||
**Examples:**
|
||||
- Created `docs/examples/content_source_example.py` demonstrating how to use the new parameter
|
||||
|
||||
**Challenges:**
|
||||
- Maintaining backward compatibility while reorganizing the parameter flow
|
||||
- Ensuring proper error handling for all content source options
|
||||
- Making the change with minimal code modifications
|
||||
|
||||
**Why This Feature:**
|
||||
The content source selection feature allows users to choose which HTML content to use as input for markdown generation:
|
||||
1. "cleaned_html" - Uses the post-processed HTML after scraping strategy (original behavior)
|
||||
2. "raw_html" - Uses the original raw HTML directly from the web page
|
||||
3. "fit_html" - Uses the preprocessed HTML optimized for schema extraction
|
||||
|
||||
This feature provides greater flexibility in how users generate markdown, enabling them to:
|
||||
- Capture more detailed content from the original HTML when needed
|
||||
- Use schema-optimized HTML when working with structured data
|
||||
- Choose the approach that best suits their specific use case
|
||||
## [2025-04-17] Implemented High Volume Stress Testing Solution for SDK
|
||||
|
||||
**Feature:** Comprehensive stress testing framework using `arun_many` and the dispatcher system to evaluate performance, concurrency handling, and identify potential issues under high-volume crawling scenarios.
|
||||
|
||||
**Changes Made:**
|
||||
1. Created a dedicated stress testing framework in the `benchmarking/` (or similar) directory.
|
||||
2. Implemented local test site generation (`SiteGenerator`) with configurable heavy HTML pages.
|
||||
3. Added basic memory usage tracking (`SimpleMemoryTracker`) using platform-specific commands (avoiding `psutil` dependency for this specific test).
|
||||
4. Utilized `CrawlerMonitor` from `crawl4ai` for rich terminal UI and real-time monitoring of test progress and dispatcher activity.
|
||||
5. Implemented detailed result summary saving (JSON) and memory sample logging (CSV).
|
||||
6. Developed `run_benchmark.py` to orchestrate tests with predefined configurations.
|
||||
7. Created `run_all.sh` as a simple wrapper for `run_benchmark.py`.
|
||||
|
||||
**Implementation Details:**
|
||||
- Generates a local test site with configurable pages containing heavy text and image content.
|
||||
- Uses Python's built-in `http.server` for local serving, minimizing network variance.
|
||||
- Leverages `crawl4ai`'s `arun_many` method for processing URLs.
|
||||
- Utilizes `MemoryAdaptiveDispatcher` to manage concurrency via the `max_sessions` parameter (note: memory adaptation features require `psutil`, not used by `SimpleMemoryTracker`).
|
||||
- Tracks memory usage via `SimpleMemoryTracker`, recording samples throughout test execution to a CSV file.
|
||||
- Uses `CrawlerMonitor` (which uses the `rich` library) for clear terminal visualization and progress reporting directly from the dispatcher.
|
||||
- Stores detailed final metrics in a JSON summary file.
|
||||
|
||||
**Files Created/Updated:**
|
||||
- `stress_test_sdk.py`: Main stress testing implementation using `arun_many`.
|
||||
- `benchmark_report.py`: (Assumed) Report generator for comparing test results.
|
||||
- `run_benchmark.py`: Test runner script with predefined configurations.
|
||||
- `run_all.sh`: Simple bash script wrapper for `run_benchmark.py`.
|
||||
- `USAGE.md`: Comprehensive documentation on usage and interpretation (updated).
|
||||
|
||||
**Testing Approach:**
|
||||
- Creates a controlled, reproducible test environment with a local HTTP server.
|
||||
- Processes URLs using `arun_many`, allowing the dispatcher to manage concurrency up to `max_sessions`.
|
||||
- Optionally logs per-batch summaries (when not in streaming mode) after processing chunks.
|
||||
- Supports different test sizes via `run_benchmark.py` configurations.
|
||||
- Records memory samples via platform commands for basic trend analysis.
|
||||
- Includes cleanup functionality for the test environment.
|
||||
|
||||
**Challenges:**
|
||||
- Ensuring proper cleanup of HTTP server processes.
|
||||
- Getting reliable memory tracking across platforms without adding heavy dependencies (`psutil`) to this specific test script.
|
||||
- Designing `run_benchmark.py` to correctly pass arguments to `stress_test_sdk.py`.
|
||||
|
||||
**Why This Feature:**
|
||||
The high volume stress testing solution addresses critical needs for ensuring Crawl4AI's `arun_many` reliability:
|
||||
1. Provides a reproducible way to evaluate performance under concurrent load.
|
||||
2. Allows testing the dispatcher's concurrency control (`max_session_permit`) and queue management.
|
||||
3. Enables performance tuning by observing throughput (`URLs/sec`) under different `max_sessions` settings.
|
||||
4. Creates a controlled environment for testing `arun_many` behavior.
|
||||
5. Supports continuous integration by providing deterministic test conditions for `arun_many`.
|
||||
|
||||
**Design Decisions:**
|
||||
- Chose local site generation for reproducibility and isolation from network issues.
|
||||
- Utilized the built-in `CrawlerMonitor` for real-time feedback, leveraging its `rich` integration.
|
||||
- Implemented optional per-batch logging in `stress_test_sdk.py` (when not streaming) to provide chunk-level summaries alongside the continuous monitor.
|
||||
- Adopted `arun_many` with a `MemoryAdaptiveDispatcher` as the core mechanism for parallel execution, reflecting the intended SDK usage.
|
||||
- Created `run_benchmark.py` to simplify running standard test configurations.
|
||||
- Used `SimpleMemoryTracker` to provide basic memory insights without requiring `psutil` for this particular test runner.
|
||||
|
||||
**Future Enhancements to Consider:**
|
||||
- Create a separate test variant that *does* use `psutil` to specifically stress the memory-adaptive features of the dispatcher.
|
||||
- Add support for generated JavaScript content.
|
||||
- Add support for Docker-based testing with explicit memory limits.
|
||||
- Enhance `benchmark_report.py` to provide more sophisticated analysis of performance and memory trends from the generated JSON/CSV files.
|
||||
|
||||
---
|
||||
|
||||
## [2025-04-17] Refined Stress Testing System Parameters and Execution
|
||||
|
||||
**Changes Made:**
|
||||
1. Corrected `run_benchmark.py` and `stress_test_sdk.py` to use `--max-sessions` instead of the incorrect `--workers` parameter, accurately reflecting dispatcher configuration.
|
||||
2. Updated `run_benchmark.py` argument handling to correctly pass all relevant custom parameters (including `--stream`, `--monitor-mode`, etc.) to `stress_test_sdk.py`.
|
||||
3. (Assuming changes in `benchmark_report.py`) Applied dark theme to benchmark reports for better readability.
|
||||
4. (Assuming changes in `benchmark_report.py`) Improved visualization code to eliminate matplotlib warnings.
|
||||
5. Updated `run_benchmark.py` to provide clickable `file://` links to generated reports in the terminal output.
|
||||
6. Updated `USAGE.md` with comprehensive parameter descriptions reflecting the final script arguments.
|
||||
7. Updated `run_all.sh` wrapper to correctly invoke `run_benchmark.py` with flexible arguments.
|
||||
|
||||
**Details of Changes:**
|
||||
|
||||
1. **Parameter Correction (`--max-sessions`)**:
|
||||
* Identified the fundamental misunderstanding where `--workers` was used incorrectly.
|
||||
* Refactored `stress_test_sdk.py` to accept `--max-sessions` and configure the `MemoryAdaptiveDispatcher`'s `max_session_permit` accordingly.
|
||||
* Updated `run_benchmark.py` argument parsing and command construction to use `--max-sessions`.
|
||||
* Updated `TEST_CONFIGS` in `run_benchmark.py` to use `max_sessions`.
|
||||
|
||||
2. **Argument Handling (`run_benchmark.py`)**:
|
||||
* Improved logic to collect all command-line arguments provided to `run_benchmark.py`.
|
||||
* Ensured all relevant arguments (like `--stream`, `--monitor-mode`, `--port`, `--use-rate-limiter`, etc.) are correctly forwarded when calling `stress_test_sdk.py` as a subprocess.
|
||||
|
||||
3. **Dark Theme & Visualization Fixes (Assumed in `benchmark_report.py`)**:
|
||||
* (Describes changes assumed to be made in the separate reporting script).
|
||||
|
||||
4. **Clickable Links (`run_benchmark.py`)**:
|
||||
* Added logic to find the latest HTML report and PNG chart in the `benchmark_reports` directory after `benchmark_report.py` runs.
|
||||
* Used `pathlib` to generate correct `file://` URLs for terminal output.
|
||||
|
||||
5. **Documentation Improvements (`USAGE.md`)**:
|
||||
* Rewrote sections to explain `arun_many`, dispatchers, and `--max-sessions`.
|
||||
* Updated parameter tables for all scripts (`stress_test_sdk.py`, `run_benchmark.py`).
|
||||
* Clarified the difference between batch and streaming modes and their effect on logging.
|
||||
* Updated examples to use correct arguments.
|
||||
|
||||
**Files Modified:**
|
||||
- `stress_test_sdk.py`: Changed `--workers` to `--max-sessions`, added new arguments, used `arun_many`.
|
||||
- `run_benchmark.py`: Changed argument handling, updated configs, calls `stress_test_sdk.py`.
|
||||
- `run_all.sh`: Updated to call `run_benchmark.py` correctly.
|
||||
- `USAGE.md`: Updated documentation extensively.
|
||||
- `benchmark_report.py`: (Assumed modifications for dark theme and viz fixes).
|
||||
|
||||
**Testing:**
|
||||
- Verified that `--max-sessions` correctly limits concurrency via the `CrawlerMonitor` output.
|
||||
- Confirmed that custom arguments passed to `run_benchmark.py` are forwarded to `stress_test_sdk.py`.
|
||||
- Validated clickable links work in supporting terminals.
|
||||
- Ensured documentation matches the final script parameters and behavior.
|
||||
|
||||
**Why These Changes:**
|
||||
These refinements correct the fundamental approach of the stress test to align with `crawl4ai`'s actual architecture and intended usage:
|
||||
1. Ensures the test evaluates the correct components (`arun_many`, `MemoryAdaptiveDispatcher`).
|
||||
2. Makes test configurations more accurate and flexible.
|
||||
3. Improves the usability of the testing framework through better argument handling and documentation.
|
||||
|
||||
|
||||
**Future Enhancements to Consider:**
|
||||
- Add support for generated JavaScript content to test JS rendering performance
|
||||
- Implement more sophisticated memory analysis like generational garbage collection tracking
|
||||
- Add support for Docker-based testing with memory limits to force OOM conditions
|
||||
- Create visualization tools for analyzing memory usage patterns across test runs
|
||||
- Add benchmark comparisons between different crawler versions or configurations
|
||||
|
||||
## [2025-04-17] Fixed Issues in Stress Testing System
|
||||
|
||||
**Changes Made:**
|
||||
1. Fixed custom parameter handling in run_benchmark.py
|
||||
2. Applied dark theme to benchmark reports for better readability
|
||||
3. Improved visualization code to eliminate matplotlib warnings
|
||||
4. Added clickable links to generated reports in terminal output
|
||||
5. Enhanced documentation with comprehensive parameter descriptions
|
||||
|
||||
**Details of Changes:**
|
||||
|
||||
1. **Custom Parameter Handling Fix**
|
||||
- Identified bug where custom URL count was being ignored in run_benchmark.py
|
||||
- Rewrote argument handling to use a custom args dictionary
|
||||
- Properly passed parameters to the test_simple_stress.py command
|
||||
- Added better UI indication of custom parameters in use
|
||||
|
||||
2. **Dark Theme Implementation**
|
||||
- Added complete dark theme to HTML benchmark reports
|
||||
- Applied dark styling to all visualization components
|
||||
- Used Nord-inspired color palette for charts and graphs
|
||||
- Improved contrast and readability for data visualization
|
||||
- Updated text colors and backgrounds for better eye comfort
|
||||
|
||||
3. **Matplotlib Warning Fixes**
|
||||
- Resolved warnings related to improper use of set_xticklabels()
|
||||
- Implemented correct x-axis positioning for bar charts
|
||||
- Ensured proper alignment of bar labels and data points
|
||||
- Updated plotting code to use modern matplotlib practices
|
||||
|
||||
4. **Documentation Improvements**
|
||||
- Created comprehensive USAGE.md with detailed instructions
|
||||
- Added parameter documentation for all scripts
|
||||
- Included examples for all common use cases
|
||||
- Provided detailed explanations for interpreting results
|
||||
- Added troubleshooting guide for common issues
|
||||
|
||||
**Files Modified:**
|
||||
- `tests/memory/run_benchmark.py`: Fixed custom parameter handling
|
||||
- `tests/memory/benchmark_report.py`: Added dark theme and fixed visualization warnings
|
||||
- `tests/memory/run_all.sh`: Added clickable links to reports
|
||||
- `tests/memory/USAGE.md`: Created comprehensive documentation
|
||||
|
||||
**Testing:**
|
||||
- Verified that custom URL counts are now correctly used
|
||||
- Confirmed dark theme is properly applied to all report elements
|
||||
- Checked that matplotlib warnings are no longer appearing
|
||||
- Validated clickable links to reports work in terminals that support them
|
||||
|
||||
**Why These Changes:**
|
||||
These improvements address several usability issues with the stress testing system:
|
||||
1. Better parameter handling ensures test configurations work as expected
|
||||
2. Dark theme reduces eye strain during extended test review sessions
|
||||
3. Fixing visualization warnings improves code quality and output clarity
|
||||
4. Enhanced documentation makes the system more accessible for future use
|
||||
|
||||
**Future Enhancements:**
|
||||
- Add additional visualization options for different types of analysis
|
||||
- Implement theme toggle to support both light and dark preferences
|
||||
- Add export options for embedding reports in other documentation
|
||||
- Create dedicated CI/CD integration templates for automated testing
|
||||
|
||||
## [2025-04-09] Added MHTML Capture Feature
|
||||
|
||||
**Feature:** MHTML snapshot capture of crawled pages
|
||||
|
||||
138
README.md
138
README.md
@@ -21,9 +21,9 @@
|
||||
|
||||
Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for LLMs, AI agents, and data pipelines. Open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease.
|
||||
|
||||
[✨ Check out latest update v0.5.0](#-recent-updates)
|
||||
[✨ Check out latest update v0.6.0](#-recent-updates)
|
||||
|
||||
🎉 **Version 0.5.0 is out!** This major release introduces Deep Crawling with BFS/DFS/BestFirst strategies, Memory-Adaptive Dispatcher, Multiple Crawling Strategies (Playwright and HTTP), Docker Deployment with FastAPI, Command-Line Interface (CLI), and more! [Read the release notes →](https://docs.crawl4ai.com/blog)
|
||||
🎉 **Version 0.6.0 is now available!** This release candidate introduces World-aware Crawling with geolocation and locale settings, Table-to-DataFrame extraction, Browser pooling with pre-warming, Network and console traffic capture, MCP integration for AI tools, and a completely revamped Docker deployment! [Read the release notes →](https://docs.crawl4ai.com/blog)
|
||||
|
||||
<details>
|
||||
<summary>🤓 <strong>My Personal Story</strong></summary>
|
||||
@@ -253,24 +253,29 @@ pip install -e ".[all]" # Install all optional features
|
||||
<details>
|
||||
<summary>🐳 <strong>Docker Deployment</strong></summary>
|
||||
|
||||
> 🚀 **Major Changes Coming!** We're developing a completely new Docker implementation that will make deployment even more efficient and seamless. The current Docker setup is being deprecated in favor of this new solution.
|
||||
> 🚀 **Now Available!** Our completely redesigned Docker implementation is here! This new solution makes deployment more efficient and seamless than ever.
|
||||
|
||||
### Current Docker Support
|
||||
### New Docker Features
|
||||
|
||||
The existing Docker implementation is being deprecated and will be replaced soon. If you still need to use Docker with the current version:
|
||||
The new Docker implementation includes:
|
||||
- **Browser pooling** with page pre-warming for faster response times
|
||||
- **Interactive playground** to test and generate request code
|
||||
- **MCP integration** for direct connection to AI tools like Claude Code
|
||||
- **Comprehensive API endpoints** including HTML extraction, screenshots, PDF generation, and JavaScript execution
|
||||
- **Multi-architecture support** with automatic detection (AMD64/ARM64)
|
||||
- **Optimized resources** with improved memory management
|
||||
|
||||
- 📚 [Deprecated Docker Setup](./docs/deprecated/docker-deployment.md) - Instructions for the current Docker implementation
|
||||
- ⚠️ Note: This setup will be replaced in the next major release
|
||||
### Getting Started
|
||||
|
||||
### What's Coming Next?
|
||||
```bash
|
||||
# Pull and run the latest release candidate
|
||||
docker pull unclecode/crawl4ai:0.6.0-rN # Use your favorite revision number
|
||||
docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:0.6.0-rN # Use your favorite revision number
|
||||
|
||||
Our new Docker implementation will bring:
|
||||
- Improved performance and resource efficiency
|
||||
- Streamlined deployment process
|
||||
- Better integration with Crawl4AI features
|
||||
- Enhanced scalability options
|
||||
# Visit the playground at http://localhost:11235/playground
|
||||
```
|
||||
|
||||
Stay connected with our [GitHub repository](https://github.com/unclecode/crawl4ai) for updates!
|
||||
For complete documentation, see our [Docker Deployment Guide](https://docs.crawl4ai.com/core/docker-deployment/).
|
||||
|
||||
</details>
|
||||
|
||||
@@ -500,31 +505,92 @@ async def test_news_crawl():
|
||||
|
||||
## ✨ Recent Updates
|
||||
|
||||
### Version 0.5.0 Major Release Highlights
|
||||
### Version 0.6.0 Release Highlights
|
||||
|
||||
- **🚀 Deep Crawling System**: Explore websites beyond initial URLs with three strategies:
|
||||
- **BFS Strategy**: Breadth-first search explores websites level by level
|
||||
- **DFS Strategy**: Depth-first search explores each branch deeply before backtracking
|
||||
- **BestFirst Strategy**: Uses scoring functions to prioritize which URLs to crawl next
|
||||
- **Page Limiting**: Control the maximum number of pages to crawl with `max_pages` parameter
|
||||
- **Score Thresholds**: Filter URLs based on relevance scores
|
||||
- **⚡ Memory-Adaptive Dispatcher**: Dynamically adjusts concurrency based on system memory with built-in rate limiting
|
||||
- **🔄 Multiple Crawling Strategies**:
|
||||
- **AsyncPlaywrightCrawlerStrategy**: Browser-based crawling with JavaScript support (Default)
|
||||
- **AsyncHTTPCrawlerStrategy**: Fast, lightweight HTTP-only crawler for simple tasks
|
||||
- **🐳 Docker Deployment**: Easy deployment with FastAPI server and streaming/non-streaming endpoints
|
||||
- **💻 Command-Line Interface**: New `crwl` CLI provides convenient terminal access to all features with intuitive commands and configuration options
|
||||
- **👤 Browser Profiler**: Create and manage persistent browser profiles to save authentication states, cookies, and settings for seamless crawling of protected content
|
||||
- **🧠 Crawl4AI Coding Assistant**: AI-powered coding assistant to answer your question for Crawl4ai, and generate proper code for crawling.
|
||||
- **🏎️ LXML Scraping Mode**: Fast HTML parsing using the `lxml` library for improved performance
|
||||
- **🌐 Proxy Rotation**: Built-in support for proxy switching with `RoundRobinProxyStrategy`
|
||||
- **🌎 World-aware Crawling**: Set geolocation, language, and timezone for authentic locale-specific content:
|
||||
```python
|
||||
crun_cfg = CrawlerRunConfig(
|
||||
url="https://browserleaks.com/geo", # test page that shows your location
|
||||
locale="en-US", # Accept-Language & UI locale
|
||||
timezone_id="America/Los_Angeles", # JS Date()/Intl timezone
|
||||
geolocation=GeolocationConfig( # override GPS coords
|
||||
latitude=34.0522,
|
||||
longitude=-118.2437,
|
||||
accuracy=10.0,
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
- **📊 Table-to-DataFrame Extraction**: Extract HTML tables directly to CSV or pandas DataFrames:
|
||||
```python
|
||||
crawler = AsyncWebCrawler(config=browser_config)
|
||||
await crawler.start()
|
||||
|
||||
try:
|
||||
# Set up scraping parameters
|
||||
crawl_config = CrawlerRunConfig(
|
||||
table_score_threshold=8, # Strict table detection
|
||||
)
|
||||
|
||||
# Execute market data extraction
|
||||
results: List[CrawlResult] = await crawler.arun(
|
||||
url="https://coinmarketcap.com/?page=1", config=crawl_config
|
||||
)
|
||||
|
||||
# Process results
|
||||
raw_df = pd.DataFrame()
|
||||
for result in results:
|
||||
if result.success and result.media["tables"]:
|
||||
raw_df = pd.DataFrame(
|
||||
result.media["tables"][0]["rows"],
|
||||
columns=result.media["tables"][0]["headers"],
|
||||
)
|
||||
break
|
||||
print(raw_df.head())
|
||||
|
||||
finally:
|
||||
await crawler.stop()
|
||||
```
|
||||
|
||||
- **🚀 Browser Pooling**: Pages launch hot with pre-warmed browser instances for lower latency and memory usage
|
||||
|
||||
- **🕸️ Network and Console Capture**: Full traffic logs and MHTML snapshots for debugging:
|
||||
```python
|
||||
crawler_config = CrawlerRunConfig(
|
||||
capture_network=True,
|
||||
capture_console=True,
|
||||
mhtml=True
|
||||
)
|
||||
```
|
||||
|
||||
- **🔌 MCP Integration**: Connect to AI tools like Claude Code through the Model Context Protocol
|
||||
```bash
|
||||
# Add Crawl4AI to Claude Code
|
||||
claude mcp add --transport sse c4ai-sse http://localhost:11235/mcp/sse
|
||||
```
|
||||
|
||||
- **🖥️ Interactive Playground**: Test configurations and generate API requests with the built-in web interface at `http://localhost:11235//playground`
|
||||
|
||||
- **🐳 Revamped Docker Deployment**: Streamlined multi-architecture Docker image with improved resource efficiency
|
||||
|
||||
- **📱 Multi-stage Build System**: Optimized Dockerfile with platform-specific performance enhancements
|
||||
|
||||
Read the full details in our [0.6.0 Release Notes](https://docs.crawl4ai.com/blog/releases/0.6.0.html) or check the [CHANGELOG](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md).
|
||||
|
||||
### Previous Version: 0.5.0 Major Release Highlights
|
||||
|
||||
- **🚀 Deep Crawling System**: Explore websites beyond initial URLs with BFS, DFS, and BestFirst strategies
|
||||
- **⚡ Memory-Adaptive Dispatcher**: Dynamically adjusts concurrency based on system memory
|
||||
- **🔄 Multiple Crawling Strategies**: Browser-based and lightweight HTTP-only crawlers
|
||||
- **💻 Command-Line Interface**: New `crwl` CLI provides convenient terminal access
|
||||
- **👤 Browser Profiler**: Create and manage persistent browser profiles
|
||||
- **🧠 Crawl4AI Coding Assistant**: AI-powered coding assistant
|
||||
- **🏎️ LXML Scraping Mode**: Fast HTML parsing using the `lxml` library
|
||||
- **🌐 Proxy Rotation**: Built-in support for proxy switching
|
||||
- **🤖 LLM Content Filter**: Intelligent markdown generation using LLMs
|
||||
- **📄 PDF Processing**: Extract text, images, and metadata from PDF files
|
||||
- **🔗 URL Redirection Tracking**: Automatically follow and record HTTP redirects
|
||||
- **🤖 LLM Schema Generation**: Easily create extraction schemas with LLM assistance
|
||||
- **🔍 robots.txt Compliance**: Respect website crawling rules
|
||||
|
||||
Read the full details in our [0.5.0 Release Notes](https://docs.crawl4ai.com/blog/releases/0.5.0.html) or check the [CHANGELOG](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md).
|
||||
Read the full details in our [0.5.0 Release Notes](https://docs.crawl4ai.com/blog/releases/0.5.0.html).
|
||||
|
||||
## Version Numbering in Crawl4AI
|
||||
|
||||
@@ -540,7 +606,7 @@ We use different suffixes to indicate development stages:
|
||||
- `dev` (0.4.3dev1): Development versions, unstable
|
||||
- `a` (0.4.3a1): Alpha releases, experimental features
|
||||
- `b` (0.4.3b1): Beta releases, feature complete but needs testing
|
||||
- `rc` (0.4.3rc1): Release candidates, potential final version
|
||||
- `rc` (0.4.3): Release candidates, potential final version
|
||||
|
||||
#### Installation
|
||||
- Regular installation (stable version):
|
||||
|
||||
@@ -2,7 +2,7 @@
|
||||
import warnings
|
||||
|
||||
from .async_webcrawler import AsyncWebCrawler, CacheMode
|
||||
from .async_configs import BrowserConfig, CrawlerRunConfig, HTTPCrawlerConfig, LLMConfig
|
||||
from .async_configs import BrowserConfig, CrawlerRunConfig, HTTPCrawlerConfig, LLMConfig, ProxyConfig, GeolocationConfig
|
||||
|
||||
from .content_scraping_strategy import (
|
||||
ContentScrapingStrategy,
|
||||
@@ -23,7 +23,8 @@ from .extraction_strategy import (
|
||||
CosineStrategy,
|
||||
JsonCssExtractionStrategy,
|
||||
JsonXPathExtractionStrategy,
|
||||
JsonLxmlExtractionStrategy
|
||||
JsonLxmlExtractionStrategy,
|
||||
RegexExtractionStrategy
|
||||
)
|
||||
from .chunking_strategy import ChunkingStrategy, RegexChunking
|
||||
from .markdown_generation_strategy import DefaultMarkdownGenerator
|
||||
@@ -71,6 +72,7 @@ __all__ = [
|
||||
"AsyncWebCrawler",
|
||||
"BrowserProfiler",
|
||||
"LLMConfig",
|
||||
"GeolocationConfig",
|
||||
"DeepCrawlStrategy",
|
||||
"BFSDeepCrawlStrategy",
|
||||
"BestFirstCrawlingStrategy",
|
||||
@@ -104,6 +106,7 @@ __all__ = [
|
||||
"JsonCssExtractionStrategy",
|
||||
"JsonXPathExtractionStrategy",
|
||||
"JsonLxmlExtractionStrategy",
|
||||
"RegexExtractionStrategy",
|
||||
"ChunkingStrategy",
|
||||
"RegexChunking",
|
||||
"DefaultMarkdownGenerator",
|
||||
@@ -121,6 +124,7 @@ __all__ = [
|
||||
"Crawl4aiDockerClient",
|
||||
"ProxyRotationStrategy",
|
||||
"RoundRobinProxyStrategy",
|
||||
"ProxyConfig"
|
||||
]
|
||||
|
||||
|
||||
|
||||
@@ -1,2 +1,3 @@
|
||||
# crawl4ai/_version.py
|
||||
__version__ = "0.5.0.post8"
|
||||
__version__ = "0.6.3"
|
||||
|
||||
|
||||
@@ -5,6 +5,7 @@ from .config import (
|
||||
MIN_WORD_THRESHOLD,
|
||||
IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD,
|
||||
PROVIDER_MODELS,
|
||||
PROVIDER_MODELS_PREFIXES,
|
||||
SCREENSHOT_HEIGHT_TRESHOLD,
|
||||
PAGE_TIMEOUT,
|
||||
IMAGE_SCORE_THRESHOLD,
|
||||
@@ -27,11 +28,8 @@ import inspect
|
||||
from typing import Any, Dict, Optional
|
||||
from enum import Enum
|
||||
|
||||
from .proxy_strategy import ProxyConfig
|
||||
try:
|
||||
from .browser.models import DockerConfig
|
||||
except ImportError:
|
||||
DockerConfig = None
|
||||
# from .proxy_strategy import ProxyConfig
|
||||
|
||||
|
||||
|
||||
def to_serializable_dict(obj: Any, ignore_default_value : bool = False) -> Dict:
|
||||
@@ -161,6 +159,166 @@ def is_empty_value(value: Any) -> bool:
|
||||
return True
|
||||
return False
|
||||
|
||||
class GeolocationConfig:
|
||||
def __init__(
|
||||
self,
|
||||
latitude: float,
|
||||
longitude: float,
|
||||
accuracy: Optional[float] = 0.0
|
||||
):
|
||||
"""Configuration class for geolocation settings.
|
||||
|
||||
Args:
|
||||
latitude: Latitude coordinate (e.g., 37.7749)
|
||||
longitude: Longitude coordinate (e.g., -122.4194)
|
||||
accuracy: Accuracy in meters. Default: 0.0
|
||||
"""
|
||||
self.latitude = latitude
|
||||
self.longitude = longitude
|
||||
self.accuracy = accuracy
|
||||
|
||||
@staticmethod
|
||||
def from_dict(geo_dict: Dict) -> "GeolocationConfig":
|
||||
"""Create a GeolocationConfig from a dictionary."""
|
||||
return GeolocationConfig(
|
||||
latitude=geo_dict.get("latitude"),
|
||||
longitude=geo_dict.get("longitude"),
|
||||
accuracy=geo_dict.get("accuracy", 0.0)
|
||||
)
|
||||
|
||||
def to_dict(self) -> Dict:
|
||||
"""Convert to dictionary representation."""
|
||||
return {
|
||||
"latitude": self.latitude,
|
||||
"longitude": self.longitude,
|
||||
"accuracy": self.accuracy
|
||||
}
|
||||
|
||||
def clone(self, **kwargs) -> "GeolocationConfig":
|
||||
"""Create a copy of this configuration with updated values.
|
||||
|
||||
Args:
|
||||
**kwargs: Key-value pairs of configuration options to update
|
||||
|
||||
Returns:
|
||||
GeolocationConfig: A new instance with the specified updates
|
||||
"""
|
||||
config_dict = self.to_dict()
|
||||
config_dict.update(kwargs)
|
||||
return GeolocationConfig.from_dict(config_dict)
|
||||
|
||||
|
||||
class ProxyConfig:
|
||||
def __init__(
|
||||
self,
|
||||
server: str,
|
||||
username: Optional[str] = None,
|
||||
password: Optional[str] = None,
|
||||
ip: Optional[str] = None,
|
||||
):
|
||||
"""Configuration class for a single proxy.
|
||||
|
||||
Args:
|
||||
server: Proxy server URL (e.g., "http://127.0.0.1:8080")
|
||||
username: Optional username for proxy authentication
|
||||
password: Optional password for proxy authentication
|
||||
ip: Optional IP address for verification purposes
|
||||
"""
|
||||
self.server = server
|
||||
self.username = username
|
||||
self.password = password
|
||||
|
||||
# Extract IP from server if not explicitly provided
|
||||
self.ip = ip or self._extract_ip_from_server()
|
||||
|
||||
def _extract_ip_from_server(self) -> Optional[str]:
|
||||
"""Extract IP address from server URL."""
|
||||
try:
|
||||
# Simple extraction assuming http://ip:port format
|
||||
if "://" in self.server:
|
||||
parts = self.server.split("://")[1].split(":")
|
||||
return parts[0]
|
||||
else:
|
||||
parts = self.server.split(":")
|
||||
return parts[0]
|
||||
except Exception:
|
||||
return None
|
||||
|
||||
@staticmethod
|
||||
def from_string(proxy_str: str) -> "ProxyConfig":
|
||||
"""Create a ProxyConfig from a string in the format 'ip:port:username:password'."""
|
||||
parts = proxy_str.split(":")
|
||||
if len(parts) == 4: # ip:port:username:password
|
||||
ip, port, username, password = parts
|
||||
return ProxyConfig(
|
||||
server=f"http://{ip}:{port}",
|
||||
username=username,
|
||||
password=password,
|
||||
ip=ip
|
||||
)
|
||||
elif len(parts) == 2: # ip:port only
|
||||
ip, port = parts
|
||||
return ProxyConfig(
|
||||
server=f"http://{ip}:{port}",
|
||||
ip=ip
|
||||
)
|
||||
else:
|
||||
raise ValueError(f"Invalid proxy string format: {proxy_str}")
|
||||
|
||||
@staticmethod
|
||||
def from_dict(proxy_dict: Dict) -> "ProxyConfig":
|
||||
"""Create a ProxyConfig from a dictionary."""
|
||||
return ProxyConfig(
|
||||
server=proxy_dict.get("server"),
|
||||
username=proxy_dict.get("username"),
|
||||
password=proxy_dict.get("password"),
|
||||
ip=proxy_dict.get("ip")
|
||||
)
|
||||
|
||||
@staticmethod
|
||||
def from_env(env_var: str = "PROXIES") -> List["ProxyConfig"]:
|
||||
"""Load proxies from environment variable.
|
||||
|
||||
Args:
|
||||
env_var: Name of environment variable containing comma-separated proxy strings
|
||||
|
||||
Returns:
|
||||
List of ProxyConfig objects
|
||||
"""
|
||||
proxies = []
|
||||
try:
|
||||
proxy_list = os.getenv(env_var, "").split(",")
|
||||
for proxy in proxy_list:
|
||||
if not proxy:
|
||||
continue
|
||||
proxies.append(ProxyConfig.from_string(proxy))
|
||||
except Exception as e:
|
||||
print(f"Error loading proxies from environment: {e}")
|
||||
return proxies
|
||||
|
||||
def to_dict(self) -> Dict:
|
||||
"""Convert to dictionary representation."""
|
||||
return {
|
||||
"server": self.server,
|
||||
"username": self.username,
|
||||
"password": self.password,
|
||||
"ip": self.ip
|
||||
}
|
||||
|
||||
def clone(self, **kwargs) -> "ProxyConfig":
|
||||
"""Create a copy of this configuration with updated values.
|
||||
|
||||
Args:
|
||||
**kwargs: Key-value pairs of configuration options to update
|
||||
|
||||
Returns:
|
||||
ProxyConfig: A new instance with the specified updates
|
||||
"""
|
||||
config_dict = self.to_dict()
|
||||
config_dict.update(kwargs)
|
||||
return ProxyConfig.from_dict(config_dict)
|
||||
|
||||
|
||||
|
||||
class BrowserConfig:
|
||||
"""
|
||||
@@ -197,8 +355,6 @@ class BrowserConfig:
|
||||
Default: None.
|
||||
proxy_config (ProxyConfig or dict or None): Detailed proxy configuration, e.g. {"server": "...", "username": "..."}.
|
||||
If None, no additional proxy config. Default: None.
|
||||
docker_config (DockerConfig or dict or None): Configuration for Docker-based browser automation.
|
||||
Contains settings for Docker container operation. Default: None.
|
||||
viewport_width (int): Default viewport width for pages. Default: 1080.
|
||||
viewport_height (int): Default viewport height for pages. Default: 600.
|
||||
viewport (dict): Default viewport dimensions for pages. If set, overrides viewport_width and viewport_height.
|
||||
@@ -244,7 +400,6 @@ class BrowserConfig:
|
||||
channel: str = "chromium",
|
||||
proxy: str = None,
|
||||
proxy_config: Union[ProxyConfig, dict, None] = None,
|
||||
docker_config: Union[DockerConfig, dict, None] = None,
|
||||
viewport_width: int = 1080,
|
||||
viewport_height: int = 600,
|
||||
viewport: dict = None,
|
||||
@@ -272,7 +427,7 @@ class BrowserConfig:
|
||||
host: str = "localhost",
|
||||
):
|
||||
self.browser_type = browser_type
|
||||
self.headless = headless or True
|
||||
self.headless = headless
|
||||
self.browser_mode = browser_mode
|
||||
self.use_managed_browser = use_managed_browser
|
||||
self.cdp_url = cdp_url
|
||||
@@ -285,15 +440,7 @@ class BrowserConfig:
|
||||
self.chrome_channel = ""
|
||||
self.proxy = proxy
|
||||
self.proxy_config = proxy_config
|
||||
|
||||
# Handle docker configuration
|
||||
if isinstance(docker_config, dict) and DockerConfig is not None:
|
||||
self.docker_config = DockerConfig.from_kwargs(docker_config)
|
||||
else:
|
||||
self.docker_config = docker_config
|
||||
|
||||
if self.docker_config:
|
||||
self.user_data_dir = self.docker_config.user_data_dir
|
||||
|
||||
self.viewport_width = viewport_width
|
||||
self.viewport_height = viewport_height
|
||||
@@ -364,7 +511,6 @@ class BrowserConfig:
|
||||
channel=kwargs.get("channel", "chromium"),
|
||||
proxy=kwargs.get("proxy"),
|
||||
proxy_config=kwargs.get("proxy_config", None),
|
||||
docker_config=kwargs.get("docker_config", None),
|
||||
viewport_width=kwargs.get("viewport_width", 1080),
|
||||
viewport_height=kwargs.get("viewport_height", 600),
|
||||
accept_downloads=kwargs.get("accept_downloads", False),
|
||||
@@ -421,13 +567,7 @@ class BrowserConfig:
|
||||
"debugging_port": self.debugging_port,
|
||||
"host": self.host,
|
||||
}
|
||||
|
||||
# Include docker_config if it exists
|
||||
if hasattr(self, "docker_config") and self.docker_config is not None:
|
||||
if hasattr(self.docker_config, "to_dict"):
|
||||
result["docker_config"] = self.docker_config.to_dict()
|
||||
else:
|
||||
result["docker_config"] = self.docker_config
|
||||
|
||||
|
||||
return result
|
||||
|
||||
@@ -589,6 +729,14 @@ class CrawlerRunConfig():
|
||||
proxy_config (ProxyConfig or dict or None): Detailed proxy configuration, e.g. {"server": "...", "username": "..."}.
|
||||
If None, no additional proxy config. Default: None.
|
||||
|
||||
# Browser Location and Identity Parameters
|
||||
locale (str or None): Locale to use for the browser context (e.g., "en-US").
|
||||
Default: None.
|
||||
timezone_id (str or None): Timezone identifier to use for the browser context (e.g., "America/New_York").
|
||||
Default: None.
|
||||
geolocation (GeolocationConfig or None): Geolocation configuration for the browser.
|
||||
Default: None.
|
||||
|
||||
# SSL Parameters
|
||||
fetch_ssl_certificate: bool = False,
|
||||
# Caching Parameters
|
||||
@@ -738,6 +886,10 @@ class CrawlerRunConfig():
|
||||
scraping_strategy: ContentScrapingStrategy = None,
|
||||
proxy_config: Union[ProxyConfig, dict, None] = None,
|
||||
proxy_rotation_strategy: Optional[ProxyRotationStrategy] = None,
|
||||
# Browser Location and Identity Parameters
|
||||
locale: Optional[str] = None,
|
||||
timezone_id: Optional[str] = None,
|
||||
geolocation: Optional[GeolocationConfig] = None,
|
||||
# SSL Parameters
|
||||
fetch_ssl_certificate: bool = False,
|
||||
# Caching Parameters
|
||||
@@ -826,6 +978,11 @@ class CrawlerRunConfig():
|
||||
self.scraping_strategy = scraping_strategy or WebScrapingStrategy()
|
||||
self.proxy_config = proxy_config
|
||||
self.proxy_rotation_strategy = proxy_rotation_strategy
|
||||
|
||||
# Browser Location and Identity Parameters
|
||||
self.locale = locale
|
||||
self.timezone_id = timezone_id
|
||||
self.geolocation = geolocation
|
||||
|
||||
# SSL Parameters
|
||||
self.fetch_ssl_certificate = fetch_ssl_certificate
|
||||
@@ -966,6 +1123,10 @@ class CrawlerRunConfig():
|
||||
scraping_strategy=kwargs.get("scraping_strategy"),
|
||||
proxy_config=kwargs.get("proxy_config"),
|
||||
proxy_rotation_strategy=kwargs.get("proxy_rotation_strategy"),
|
||||
# Browser Location and Identity Parameters
|
||||
locale=kwargs.get("locale", None),
|
||||
timezone_id=kwargs.get("timezone_id", None),
|
||||
geolocation=kwargs.get("geolocation", None),
|
||||
# SSL Parameters
|
||||
fetch_ssl_certificate=kwargs.get("fetch_ssl_certificate", False),
|
||||
# Caching Parameters
|
||||
@@ -1075,6 +1236,9 @@ class CrawlerRunConfig():
|
||||
"scraping_strategy": self.scraping_strategy,
|
||||
"proxy_config": self.proxy_config,
|
||||
"proxy_rotation_strategy": self.proxy_rotation_strategy,
|
||||
"locale": self.locale,
|
||||
"timezone_id": self.timezone_id,
|
||||
"geolocation": self.geolocation,
|
||||
"fetch_ssl_certificate": self.fetch_ssl_certificate,
|
||||
"cache_mode": self.cache_mode,
|
||||
"session_id": self.session_id,
|
||||
@@ -1180,9 +1344,18 @@ class LLMConfig:
|
||||
elif api_token and api_token.startswith("env:"):
|
||||
self.api_token = os.getenv(api_token[4:])
|
||||
else:
|
||||
self.api_token = PROVIDER_MODELS.get(provider, "no-token") or os.getenv(
|
||||
DEFAULT_PROVIDER_API_KEY
|
||||
)
|
||||
# Check if given provider starts with any of key in PROVIDER_MODELS_PREFIXES
|
||||
# If not, check if it is in PROVIDER_MODELS
|
||||
prefixes = PROVIDER_MODELS_PREFIXES.keys()
|
||||
if any(provider.startswith(prefix) for prefix in prefixes):
|
||||
selected_prefix = next(
|
||||
(prefix for prefix in prefixes if provider.startswith(prefix)),
|
||||
None,
|
||||
)
|
||||
self.api_token = PROVIDER_MODELS_PREFIXES.get(selected_prefix)
|
||||
else:
|
||||
self.provider = DEFAULT_PROVIDER
|
||||
self.api_token = os.getenv(DEFAULT_PROVIDER_API_KEY)
|
||||
self.base_url = base_url
|
||||
self.temprature = temprature
|
||||
self.max_tokens = max_tokens
|
||||
|
||||
@@ -24,7 +24,7 @@ from .browser_manager import BrowserManager
|
||||
|
||||
import aiofiles
|
||||
import aiohttp
|
||||
import cchardet
|
||||
import chardet
|
||||
from aiohttp.client import ClientTimeout
|
||||
from urllib.parse import urlparse
|
||||
from types import MappingProxyType
|
||||
@@ -130,6 +130,8 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
Close the browser and clean up resources.
|
||||
"""
|
||||
await self.browser_manager.close()
|
||||
# Explicitly reset the static Playwright instance
|
||||
BrowserManager._playwright_instance = None
|
||||
|
||||
async def kill_session(self, session_id: str):
|
||||
"""
|
||||
@@ -439,7 +441,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
status_code = 200 # Default for local/raw HTML
|
||||
screenshot_data = None
|
||||
|
||||
if url.startswith(("http://", "https://")):
|
||||
if url.startswith(("http://", "https://", "view-source:")):
|
||||
return await self._crawl_web(url, config)
|
||||
|
||||
elif url.startswith("file://"):
|
||||
@@ -569,6 +571,14 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
|
||||
async def handle_response_capture(response):
|
||||
try:
|
||||
try:
|
||||
# body = await response.body()
|
||||
# json_body = await response.json()
|
||||
text_body = await response.text()
|
||||
except Exception as e:
|
||||
body = None
|
||||
# json_body = None
|
||||
# text_body = None
|
||||
captured_requests.append({
|
||||
"event_type": "response",
|
||||
"url": response.url,
|
||||
@@ -577,7 +587,12 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
"headers": dict(response.headers), # Convert Header dict
|
||||
"from_service_worker": response.from_service_worker,
|
||||
"request_timing": response.request.timing, # Detailed timing info
|
||||
"timestamp": time.time()
|
||||
"timestamp": time.time(),
|
||||
"body" : {
|
||||
# "raw": body,
|
||||
# "json": json_body,
|
||||
"text": text_body
|
||||
}
|
||||
})
|
||||
except Exception as e:
|
||||
if self.logger:
|
||||
@@ -679,14 +694,12 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
if console_log_type == "error":
|
||||
self.logger.error(
|
||||
message=f"Console error: {msg}", # Use f-string for variable interpolation
|
||||
tag="CONSOLE",
|
||||
params={"msg": msg.text},
|
||||
tag="CONSOLE"
|
||||
)
|
||||
elif console_log_type == "debug":
|
||||
self.logger.debug(
|
||||
message=f"Console: {msg}", # Use f-string for variable interpolation
|
||||
tag="CONSOLE",
|
||||
params={"msg": msg.text},
|
||||
tag="CONSOLE"
|
||||
)
|
||||
|
||||
page.on("console", log_consol)
|
||||
@@ -771,7 +784,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
except Error:
|
||||
visibility_info = await self.check_visibility(page)
|
||||
|
||||
if self.config.verbose:
|
||||
if self.browser_config.config.verbose:
|
||||
self.logger.debug(
|
||||
message="Body visibility info: {info}",
|
||||
tag="DEBUG",
|
||||
@@ -967,7 +980,11 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
|
||||
for selector in selectors:
|
||||
try:
|
||||
content = await page.evaluate(f"document.querySelector('{selector}')?.outerHTML || ''")
|
||||
content = await page.evaluate(
|
||||
f"""Array.from(document.querySelectorAll("{selector}"))
|
||||
.map(el => el.outerHTML)
|
||||
.join('')"""
|
||||
)
|
||||
html_parts.append(content)
|
||||
except Error as e:
|
||||
print(f"Warning: Could not get content for selector '{selector}': {str(e)}")
|
||||
@@ -1450,8 +1467,8 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
buffered = BytesIO()
|
||||
img.save(buffered, format="JPEG")
|
||||
return base64.b64encode(buffered.getvalue()).decode("utf-8")
|
||||
finally:
|
||||
await page.close()
|
||||
# finally:
|
||||
# await page.close()
|
||||
|
||||
async def take_screenshot_naive(self, page: Page) -> str:
|
||||
"""
|
||||
@@ -1484,8 +1501,8 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
buffered = BytesIO()
|
||||
img.save(buffered, format="JPEG")
|
||||
return base64.b64encode(buffered.getvalue()).decode("utf-8")
|
||||
finally:
|
||||
await page.close()
|
||||
# finally:
|
||||
# await page.close()
|
||||
|
||||
async def export_storage_state(self, path: str = None) -> dict:
|
||||
"""
|
||||
@@ -1975,7 +1992,7 @@ class AsyncHTTPCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
await self.start()
|
||||
yield self._session
|
||||
finally:
|
||||
await self.close()
|
||||
pass
|
||||
|
||||
def set_hook(self, hook_type: str, hook_func: Callable) -> None:
|
||||
if hook_type in self.hooks:
|
||||
@@ -2091,7 +2108,7 @@ class AsyncHTTPCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
|
||||
encoding = response.charset
|
||||
if not encoding:
|
||||
encoding = cchardet.detect(content.tobytes())['encoding'] or 'utf-8'
|
||||
encoding = chardet.detect(content.tobytes())['encoding'] or 'utf-8'
|
||||
|
||||
result = AsyncCrawlResponse(
|
||||
html=content.tobytes().decode(encoding, errors='replace'),
|
||||
|
||||
@@ -171,7 +171,10 @@ class AsyncDatabaseManager:
|
||||
f"Code context:\n{error_context['code_context']}"
|
||||
)
|
||||
self.logger.error(
|
||||
message=create_box_message(error_message, type="error"),
|
||||
message="{error}",
|
||||
tag="ERROR",
|
||||
params={"error": str(error_message)},
|
||||
boxes=["error"],
|
||||
)
|
||||
|
||||
raise
|
||||
@@ -189,7 +192,10 @@ class AsyncDatabaseManager:
|
||||
f"Code context:\n{error_context['code_context']}"
|
||||
)
|
||||
self.logger.error(
|
||||
message=create_box_message(error_message, type="error"),
|
||||
message="{error}",
|
||||
tag="ERROR",
|
||||
params={"error": str(error_message)},
|
||||
boxes=["error"],
|
||||
)
|
||||
raise
|
||||
finally:
|
||||
|
||||
@@ -1,18 +1,48 @@
|
||||
from abc import ABC, abstractmethod
|
||||
from enum import Enum
|
||||
from typing import Optional, Dict, Any
|
||||
from colorama import Fore, Style, init
|
||||
from typing import Optional, Dict, Any, List
|
||||
import os
|
||||
from datetime import datetime
|
||||
from urllib.parse import unquote
|
||||
from rich.console import Console
|
||||
from rich.text import Text
|
||||
from .utils import create_box_message
|
||||
|
||||
|
||||
class LogLevel(Enum):
|
||||
DEFAULT = 0
|
||||
DEBUG = 1
|
||||
INFO = 2
|
||||
SUCCESS = 3
|
||||
WARNING = 4
|
||||
ERROR = 5
|
||||
CRITICAL = 6
|
||||
ALERT = 7
|
||||
NOTICE = 8
|
||||
EXCEPTION = 9
|
||||
FATAL = 10
|
||||
|
||||
|
||||
def __str__(self):
|
||||
return self.name.lower()
|
||||
|
||||
class LogColor(str, Enum):
|
||||
"""Enum for log colors."""
|
||||
|
||||
DEBUG = "lightblack"
|
||||
INFO = "cyan"
|
||||
SUCCESS = "green"
|
||||
WARNING = "yellow"
|
||||
ERROR = "red"
|
||||
CYAN = "cyan"
|
||||
GREEN = "green"
|
||||
YELLOW = "yellow"
|
||||
MAGENTA = "magenta"
|
||||
DIM_MAGENTA = "dim magenta"
|
||||
|
||||
def __str__(self):
|
||||
"""Automatically convert rich color to string."""
|
||||
return self.value
|
||||
|
||||
|
||||
class AsyncLoggerBase(ABC):
|
||||
@@ -37,13 +67,14 @@ class AsyncLoggerBase(ABC):
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def url_status(self, url: str, success: bool, timing: float, tag: str = "FETCH", url_length: int = 50):
|
||||
def url_status(self, url: str, success: bool, timing: float, tag: str = "FETCH", url_length: int = 100):
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def error_status(self, url: str, error: str, tag: str = "ERROR", url_length: int = 50):
|
||||
def error_status(self, url: str, error: str, tag: str = "ERROR", url_length: int = 100):
|
||||
pass
|
||||
|
||||
|
||||
class AsyncLogger(AsyncLoggerBase):
|
||||
"""
|
||||
Asynchronous logger with support for colored console output and file logging.
|
||||
@@ -61,14 +92,21 @@ class AsyncLogger(AsyncLoggerBase):
|
||||
"DEBUG": "⋯",
|
||||
"INFO": "ℹ",
|
||||
"WARNING": "⚠",
|
||||
"SUCCESS": "✔",
|
||||
"CRITICAL": "‼",
|
||||
"ALERT": "⚡",
|
||||
"NOTICE": "ℹ",
|
||||
"EXCEPTION": "❗",
|
||||
"FATAL": "☠",
|
||||
"DEFAULT": "•",
|
||||
}
|
||||
|
||||
DEFAULT_COLORS = {
|
||||
LogLevel.DEBUG: Fore.LIGHTBLACK_EX,
|
||||
LogLevel.INFO: Fore.CYAN,
|
||||
LogLevel.SUCCESS: Fore.GREEN,
|
||||
LogLevel.WARNING: Fore.YELLOW,
|
||||
LogLevel.ERROR: Fore.RED,
|
||||
LogLevel.DEBUG: LogColor.DEBUG,
|
||||
LogLevel.INFO: LogColor.INFO,
|
||||
LogLevel.SUCCESS: LogColor.SUCCESS,
|
||||
LogLevel.WARNING: LogColor.WARNING,
|
||||
LogLevel.ERROR: LogColor.ERROR,
|
||||
}
|
||||
|
||||
def __init__(
|
||||
@@ -77,7 +115,7 @@ class AsyncLogger(AsyncLoggerBase):
|
||||
log_level: LogLevel = LogLevel.DEBUG,
|
||||
tag_width: int = 10,
|
||||
icons: Optional[Dict[str, str]] = None,
|
||||
colors: Optional[Dict[LogLevel, str]] = None,
|
||||
colors: Optional[Dict[LogLevel, LogColor]] = None,
|
||||
verbose: bool = True,
|
||||
):
|
||||
"""
|
||||
@@ -91,13 +129,13 @@ class AsyncLogger(AsyncLoggerBase):
|
||||
colors: Custom colors for different log levels
|
||||
verbose: Whether to output to console
|
||||
"""
|
||||
init() # Initialize colorama
|
||||
self.log_file = log_file
|
||||
self.log_level = log_level
|
||||
self.tag_width = tag_width
|
||||
self.icons = icons or self.DEFAULT_ICONS
|
||||
self.colors = colors or self.DEFAULT_COLORS
|
||||
self.verbose = verbose
|
||||
self.console = Console()
|
||||
|
||||
# Create log file directory if needed
|
||||
if log_file:
|
||||
@@ -110,20 +148,23 @@ class AsyncLogger(AsyncLoggerBase):
|
||||
def _get_icon(self, tag: str) -> str:
|
||||
"""Get the icon for a tag, defaulting to info icon if not found."""
|
||||
return self.icons.get(tag, self.icons["INFO"])
|
||||
|
||||
def _shorten(self, text, length, placeholder="..."):
|
||||
"""Truncate text in the middle if longer than length, or pad if shorter."""
|
||||
if len(text) <= length:
|
||||
return text.ljust(length) # Pad with spaces to reach desired length
|
||||
half = (length - len(placeholder)) // 2
|
||||
shortened = text[:half] + placeholder + text[-half:]
|
||||
return shortened.ljust(length) # Also pad shortened text to consistent length
|
||||
|
||||
def _write_to_file(self, message: str):
|
||||
"""Write a message to the log file if configured."""
|
||||
if self.log_file:
|
||||
text = Text.from_markup(message)
|
||||
plain_text = text.plain
|
||||
timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S.%f")[:-3]
|
||||
with open(self.log_file, "a", encoding="utf-8") as f:
|
||||
# Strip ANSI color codes for file output
|
||||
clean_message = message.replace(Fore.RESET, "").replace(
|
||||
Style.RESET_ALL, ""
|
||||
)
|
||||
for color in vars(Fore).values():
|
||||
if isinstance(color, str):
|
||||
clean_message = clean_message.replace(color, "")
|
||||
f.write(f"[{timestamp}] {clean_message}\n")
|
||||
f.write(f"[{timestamp}] {plain_text}\n")
|
||||
|
||||
def _log(
|
||||
self,
|
||||
@@ -131,8 +172,9 @@ class AsyncLogger(AsyncLoggerBase):
|
||||
message: str,
|
||||
tag: str,
|
||||
params: Optional[Dict[str, Any]] = None,
|
||||
colors: Optional[Dict[str, str]] = None,
|
||||
base_color: Optional[str] = None,
|
||||
colors: Optional[Dict[str, LogColor]] = None,
|
||||
boxes: Optional[List[str]] = None,
|
||||
base_color: Optional[LogColor] = None,
|
||||
**kwargs,
|
||||
):
|
||||
"""
|
||||
@@ -144,55 +186,44 @@ class AsyncLogger(AsyncLoggerBase):
|
||||
tag: Tag for the message
|
||||
params: Parameters to format into the message
|
||||
colors: Color overrides for specific parameters
|
||||
boxes: Box overrides for specific parameters
|
||||
base_color: Base color for the entire message
|
||||
"""
|
||||
if level.value < self.log_level.value:
|
||||
return
|
||||
|
||||
# Format the message with parameters if provided
|
||||
# avoid conflict with rich formatting
|
||||
parsed_message = message.replace("[", "[[").replace("]", "]]")
|
||||
if params:
|
||||
try:
|
||||
# First format the message with raw parameters
|
||||
formatted_message = message.format(**params)
|
||||
# FIXME: If there are formatting strings in floating point format,
|
||||
# this may result in colors and boxes not being applied properly.
|
||||
# such as {value:.2f}, the value is 0.23333 format it to 0.23,
|
||||
# but we replace("0.23333", "[color]0.23333[/color]")
|
||||
formatted_message = parsed_message.format(**params)
|
||||
for key, value in params.items():
|
||||
# value_str may discard `[` and `]`, so we need to replace it.
|
||||
value_str = str(value).replace("[", "[[").replace("]", "]]")
|
||||
# check is need apply color
|
||||
if colors and key in colors:
|
||||
color_str = f"[{colors[key]}]{value_str}[/{colors[key]}]"
|
||||
formatted_message = formatted_message.replace(value_str, color_str)
|
||||
value_str = color_str
|
||||
|
||||
# Then apply colors if specified
|
||||
color_map = {
|
||||
"green": Fore.GREEN,
|
||||
"red": Fore.RED,
|
||||
"yellow": Fore.YELLOW,
|
||||
"blue": Fore.BLUE,
|
||||
"cyan": Fore.CYAN,
|
||||
"magenta": Fore.MAGENTA,
|
||||
"white": Fore.WHITE,
|
||||
"black": Fore.BLACK,
|
||||
"reset": Style.RESET_ALL,
|
||||
}
|
||||
if colors:
|
||||
for key, color in colors.items():
|
||||
# Find the formatted value in the message and wrap it with color
|
||||
if color in color_map:
|
||||
color = color_map[color]
|
||||
if key in params:
|
||||
value_str = str(params[key])
|
||||
formatted_message = formatted_message.replace(
|
||||
value_str, f"{color}{value_str}{Style.RESET_ALL}"
|
||||
)
|
||||
# check is need apply box
|
||||
if boxes and key in boxes:
|
||||
formatted_message = formatted_message.replace(value_str,
|
||||
create_box_message(value_str, type=str(level)))
|
||||
|
||||
except KeyError as e:
|
||||
formatted_message = (
|
||||
f"LOGGING ERROR: Missing parameter {e} in message template"
|
||||
)
|
||||
level = LogLevel.ERROR
|
||||
else:
|
||||
formatted_message = message
|
||||
formatted_message = parsed_message
|
||||
|
||||
# Construct the full log line
|
||||
color = base_color or self.colors[level]
|
||||
log_line = f"{color}{self._format_tag(tag)} {self._get_icon(tag)} {formatted_message}{Style.RESET_ALL}"
|
||||
color: LogColor = base_color or self.colors[level]
|
||||
log_line = f"[{color}]{self._format_tag(tag)} {self._get_icon(tag)} {formatted_message} [/{color}]"
|
||||
|
||||
# Output to console if verbose
|
||||
if self.verbose or kwargs.get("force_verbose", False):
|
||||
print(log_line)
|
||||
self.console.print(log_line)
|
||||
|
||||
# Write to file if configured
|
||||
self._write_to_file(log_line)
|
||||
@@ -212,6 +243,22 @@ class AsyncLogger(AsyncLoggerBase):
|
||||
def warning(self, message: str, tag: str = "WARNING", **kwargs):
|
||||
"""Log a warning message."""
|
||||
self._log(LogLevel.WARNING, message, tag, **kwargs)
|
||||
|
||||
def critical(self, message: str, tag: str = "CRITICAL", **kwargs):
|
||||
"""Log a critical message."""
|
||||
self._log(LogLevel.ERROR, message, tag, **kwargs)
|
||||
def exception(self, message: str, tag: str = "EXCEPTION", **kwargs):
|
||||
"""Log an exception message."""
|
||||
self._log(LogLevel.ERROR, message, tag, **kwargs)
|
||||
def fatal(self, message: str, tag: str = "FATAL", **kwargs):
|
||||
"""Log a fatal message."""
|
||||
self._log(LogLevel.ERROR, message, tag, **kwargs)
|
||||
def alert(self, message: str, tag: str = "ALERT", **kwargs):
|
||||
"""Log an alert message."""
|
||||
self._log(LogLevel.ERROR, message, tag, **kwargs)
|
||||
def notice(self, message: str, tag: str = "NOTICE", **kwargs):
|
||||
"""Log a notice message."""
|
||||
self._log(LogLevel.INFO, message, tag, **kwargs)
|
||||
|
||||
def error(self, message: str, tag: str = "ERROR", **kwargs):
|
||||
"""Log an error message."""
|
||||
@@ -223,7 +270,7 @@ class AsyncLogger(AsyncLoggerBase):
|
||||
success: bool,
|
||||
timing: float,
|
||||
tag: str = "FETCH",
|
||||
url_length: int = 50,
|
||||
url_length: int = 100,
|
||||
):
|
||||
"""
|
||||
Convenience method for logging URL fetch status.
|
||||
@@ -235,19 +282,20 @@ class AsyncLogger(AsyncLoggerBase):
|
||||
tag: Tag for the message
|
||||
url_length: Maximum length for URL in log
|
||||
"""
|
||||
decoded_url = unquote(url)
|
||||
readable_url = self._shorten(decoded_url, url_length)
|
||||
self._log(
|
||||
level=LogLevel.SUCCESS if success else LogLevel.ERROR,
|
||||
message="{url:.{url_length}}... | Status: {status} | Time: {timing:.2f}s",
|
||||
message="{url} | {status} | ⏱: {timing:.2f}s",
|
||||
tag=tag,
|
||||
params={
|
||||
"url": url,
|
||||
"url_length": url_length,
|
||||
"status": success,
|
||||
"url": readable_url,
|
||||
"status": "✓" if success else "✗",
|
||||
"timing": timing,
|
||||
},
|
||||
colors={
|
||||
"status": Fore.GREEN if success else Fore.RED,
|
||||
"timing": Fore.YELLOW,
|
||||
"status": LogColor.SUCCESS if success else LogColor.ERROR,
|
||||
"timing": LogColor.WARNING,
|
||||
},
|
||||
)
|
||||
|
||||
@@ -263,11 +311,13 @@ class AsyncLogger(AsyncLoggerBase):
|
||||
tag: Tag for the message
|
||||
url_length: Maximum length for URL in log
|
||||
"""
|
||||
decoded_url = unquote(url)
|
||||
readable_url = self._shorten(decoded_url, url_length)
|
||||
self._log(
|
||||
level=LogLevel.ERROR,
|
||||
message="{url:.{url_length}}... | Error: {error}",
|
||||
message="{url} | Error: {error}",
|
||||
tag=tag,
|
||||
params={"url": url, "url_length": url_length, "error": error},
|
||||
params={"url": readable_url, "error": error},
|
||||
)
|
||||
|
||||
class AsyncFileLogger(AsyncLoggerBase):
|
||||
@@ -311,13 +361,13 @@ class AsyncFileLogger(AsyncLoggerBase):
|
||||
"""Log an error message to file."""
|
||||
self._write_to_file("ERROR", message, tag)
|
||||
|
||||
def url_status(self, url: str, success: bool, timing: float, tag: str = "FETCH", url_length: int = 50):
|
||||
def url_status(self, url: str, success: bool, timing: float, tag: str = "FETCH", url_length: int = 100):
|
||||
"""Log URL fetch status to file."""
|
||||
status = "SUCCESS" if success else "FAILED"
|
||||
message = f"{url[:url_length]}... | Status: {status} | Time: {timing:.2f}s"
|
||||
self._write_to_file("URL_STATUS", message, tag)
|
||||
|
||||
def error_status(self, url: str, error: str, tag: str = "ERROR", url_length: int = 50):
|
||||
def error_status(self, url: str, error: str, tag: str = "ERROR", url_length: int = 100):
|
||||
"""Log error status to file."""
|
||||
message = f"{url[:url_length]}... | Error: {error}"
|
||||
self._write_to_file("ERROR", message, tag)
|
||||
|
||||
@@ -2,7 +2,6 @@ from .__version__ import __version__ as crawl4ai_version
|
||||
import os
|
||||
import sys
|
||||
import time
|
||||
from colorama import Fore
|
||||
from pathlib import Path
|
||||
from typing import Optional, List
|
||||
import json
|
||||
@@ -36,7 +35,7 @@ from .markdown_generation_strategy import (
|
||||
)
|
||||
from .deep_crawling import DeepCrawlDecorator
|
||||
from .async_logger import AsyncLogger, AsyncLoggerBase
|
||||
from .async_configs import BrowserConfig, CrawlerRunConfig
|
||||
from .async_configs import BrowserConfig, CrawlerRunConfig, ProxyConfig
|
||||
from .async_dispatcher import * # noqa: F403
|
||||
from .async_dispatcher import BaseDispatcher, MemoryAdaptiveDispatcher, RateLimiter
|
||||
|
||||
@@ -44,9 +43,9 @@ from .utils import (
|
||||
sanitize_input_encode,
|
||||
InvalidCSSSelectorError,
|
||||
fast_format_html,
|
||||
create_box_message,
|
||||
get_error_context,
|
||||
RobotsParser,
|
||||
preprocess_html_for_schema,
|
||||
)
|
||||
|
||||
|
||||
@@ -111,7 +110,8 @@ class AsyncWebCrawler:
|
||||
self,
|
||||
crawler_strategy: AsyncCrawlerStrategy = None,
|
||||
config: BrowserConfig = None,
|
||||
base_directory: str = str(os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home())),
|
||||
base_directory: str = str(
|
||||
os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home())),
|
||||
thread_safe: bool = False,
|
||||
logger: AsyncLoggerBase = None,
|
||||
**kwargs,
|
||||
@@ -139,7 +139,8 @@ class AsyncWebCrawler:
|
||||
)
|
||||
|
||||
# Initialize crawler strategy
|
||||
params = {k: v for k, v in kwargs.items() if k in ["browser_config", "logger"]}
|
||||
params = {k: v for k, v in kwargs.items() if k in [
|
||||
"browser_config", "logger"]}
|
||||
self.crawler_strategy = crawler_strategy or AsyncPlaywrightCrawlerStrategy(
|
||||
browser_config=browser_config,
|
||||
logger=self.logger,
|
||||
@@ -237,7 +238,8 @@ class AsyncWebCrawler:
|
||||
|
||||
config = config or CrawlerRunConfig()
|
||||
if not isinstance(url, str) or not url:
|
||||
raise ValueError("Invalid URL, make sure the URL is a non-empty string")
|
||||
raise ValueError(
|
||||
"Invalid URL, make sure the URL is a non-empty string")
|
||||
|
||||
async with self._lock or self.nullcontext():
|
||||
try:
|
||||
@@ -291,12 +293,12 @@ class AsyncWebCrawler:
|
||||
|
||||
# Update proxy configuration from rotation strategy if available
|
||||
if config and config.proxy_rotation_strategy:
|
||||
next_proxy = await config.proxy_rotation_strategy.get_next_proxy()
|
||||
next_proxy: ProxyConfig = await config.proxy_rotation_strategy.get_next_proxy()
|
||||
if next_proxy:
|
||||
self.logger.info(
|
||||
message="Switch proxy: {proxy}",
|
||||
tag="PROXY",
|
||||
params={"proxy": next_proxy.server},
|
||||
params={"proxy": next_proxy.server}
|
||||
)
|
||||
config.proxy_config = next_proxy
|
||||
# config = config.clone(proxy_config=next_proxy)
|
||||
@@ -306,7 +308,8 @@ class AsyncWebCrawler:
|
||||
t1 = time.perf_counter()
|
||||
|
||||
if config.user_agent:
|
||||
self.crawler_strategy.update_user_agent(config.user_agent)
|
||||
self.crawler_strategy.update_user_agent(
|
||||
config.user_agent)
|
||||
|
||||
# Check robots.txt if enabled
|
||||
if config and config.check_robots_txt:
|
||||
@@ -353,10 +356,11 @@ class AsyncWebCrawler:
|
||||
html=html,
|
||||
extracted_content=extracted_content,
|
||||
config=config, # Pass the config object instead of individual parameters
|
||||
screenshot=screenshot_data,
|
||||
screenshot_data=screenshot_data,
|
||||
pdf_data=pdf_data,
|
||||
verbose=config.verbose,
|
||||
is_raw_html=True if url.startswith("raw:") else False,
|
||||
redirected_url=async_response.redirected_url,
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
@@ -372,20 +376,14 @@ class AsyncWebCrawler:
|
||||
crawl_result.console_messages = async_response.console_messages
|
||||
|
||||
crawl_result.success = bool(html)
|
||||
crawl_result.session_id = getattr(config, "session_id", None)
|
||||
crawl_result.session_id = getattr(
|
||||
config, "session_id", None)
|
||||
|
||||
self.logger.success(
|
||||
message="{url:.50}... | Status: {status} | Total: {timing}",
|
||||
self.logger.url_status(
|
||||
url=cache_context.display_url,
|
||||
success=crawl_result.success,
|
||||
timing=time.perf_counter() - start_time,
|
||||
tag="COMPLETE",
|
||||
params={
|
||||
"url": cache_context.display_url,
|
||||
"status": crawl_result.success,
|
||||
"timing": f"{time.perf_counter() - start_time:.2f}s",
|
||||
},
|
||||
colors={
|
||||
"status": Fore.GREEN if crawl_result.success else Fore.RED,
|
||||
"timing": Fore.YELLOW,
|
||||
},
|
||||
)
|
||||
|
||||
# Update cache if appropriate
|
||||
@@ -395,19 +393,15 @@ class AsyncWebCrawler:
|
||||
return CrawlResultContainer(crawl_result)
|
||||
|
||||
else:
|
||||
self.logger.success(
|
||||
message="{url:.50}... | Status: {status} | Total: {timing}",
|
||||
tag="COMPLETE",
|
||||
params={
|
||||
"url": cache_context.display_url,
|
||||
"status": True,
|
||||
"timing": f"{time.perf_counter() - start_time:.2f}s",
|
||||
},
|
||||
colors={"status": Fore.GREEN, "timing": Fore.YELLOW},
|
||||
self.logger.url_status(
|
||||
url=cache_context.display_url,
|
||||
success=True,
|
||||
timing=time.perf_counter() - start_time,
|
||||
tag="COMPLETE"
|
||||
)
|
||||
|
||||
cached_result.success = bool(html)
|
||||
cached_result.session_id = getattr(config, "session_id", None)
|
||||
cached_result.session_id = getattr(
|
||||
config, "session_id", None)
|
||||
cached_result.redirected_url = cached_result.redirected_url or url
|
||||
return CrawlResultContainer(cached_result)
|
||||
|
||||
@@ -423,7 +417,7 @@ class AsyncWebCrawler:
|
||||
|
||||
self.logger.error_status(
|
||||
url=url,
|
||||
error=create_box_message(error_message, type="error"),
|
||||
error=error_message,
|
||||
tag="ERROR",
|
||||
)
|
||||
|
||||
@@ -439,7 +433,7 @@ class AsyncWebCrawler:
|
||||
html: str,
|
||||
extracted_content: str,
|
||||
config: CrawlerRunConfig,
|
||||
screenshot: str,
|
||||
screenshot_data: str,
|
||||
pdf_data: str,
|
||||
verbose: bool,
|
||||
**kwargs,
|
||||
@@ -452,7 +446,7 @@ class AsyncWebCrawler:
|
||||
html: Raw HTML content
|
||||
extracted_content: Previously extracted content (if any)
|
||||
config: Configuration object controlling processing behavior
|
||||
screenshot: Screenshot data (if any)
|
||||
screenshot_data: Screenshot data (if any)
|
||||
pdf_data: PDF data (if any)
|
||||
verbose: Whether to enable verbose logging
|
||||
**kwargs: Additional parameters for backwards compatibility
|
||||
@@ -474,12 +468,14 @@ class AsyncWebCrawler:
|
||||
params = config.__dict__.copy()
|
||||
params.pop("url", None)
|
||||
# add keys from kwargs to params that doesn't exist in params
|
||||
params.update({k: v for k, v in kwargs.items() if k not in params.keys()})
|
||||
params.update({k: v for k, v in kwargs.items()
|
||||
if k not in params.keys()})
|
||||
|
||||
################################
|
||||
# Scraping Strategy Execution #
|
||||
################################
|
||||
result: ScrapingResult = scraping_strategy.scrap(url, html, **params)
|
||||
result: ScrapingResult = scraping_strategy.scrap(
|
||||
url, html, **params)
|
||||
|
||||
if result is None:
|
||||
raise ValueError(
|
||||
@@ -495,15 +491,20 @@ class AsyncWebCrawler:
|
||||
|
||||
# Extract results - handle both dict and ScrapingResult
|
||||
if isinstance(result, dict):
|
||||
cleaned_html = sanitize_input_encode(result.get("cleaned_html", ""))
|
||||
cleaned_html = sanitize_input_encode(
|
||||
result.get("cleaned_html", ""))
|
||||
media = result.get("media", {})
|
||||
tables = media.pop("tables", []) if isinstance(media, dict) else []
|
||||
links = result.get("links", {})
|
||||
metadata = result.get("metadata", {})
|
||||
else:
|
||||
cleaned_html = sanitize_input_encode(result.cleaned_html)
|
||||
media = result.media.model_dump()
|
||||
tables = media.pop("tables", [])
|
||||
links = result.links.model_dump()
|
||||
metadata = result.metadata
|
||||
|
||||
fit_html = preprocess_html_for_schema(html_content=html, text_threshold= 500, max_size= 300_000)
|
||||
|
||||
################################
|
||||
# Generate Markdown #
|
||||
@@ -512,27 +513,65 @@ class AsyncWebCrawler:
|
||||
config.markdown_generator or DefaultMarkdownGenerator()
|
||||
)
|
||||
|
||||
# --- SELECT HTML SOURCE BASED ON CONTENT_SOURCE ---
|
||||
# Get the desired source from the generator config, default to 'cleaned_html'
|
||||
selected_html_source = getattr(markdown_generator, 'content_source', 'cleaned_html')
|
||||
|
||||
# Define the source selection logic using dict dispatch
|
||||
html_source_selector = {
|
||||
"raw_html": lambda: html, # The original raw HTML
|
||||
"cleaned_html": lambda: cleaned_html, # The HTML after scraping strategy
|
||||
"fit_html": lambda: fit_html, # The HTML after preprocessing for schema
|
||||
}
|
||||
|
||||
markdown_input_html = cleaned_html # Default to cleaned_html
|
||||
|
||||
try:
|
||||
# Get the appropriate lambda function, default to returning cleaned_html if key not found
|
||||
source_lambda = html_source_selector.get(selected_html_source, lambda: cleaned_html)
|
||||
# Execute the lambda to get the selected HTML
|
||||
markdown_input_html = source_lambda()
|
||||
|
||||
# Log which source is being used (optional, but helpful for debugging)
|
||||
# if self.logger and verbose:
|
||||
# actual_source_used = selected_html_source if selected_html_source in html_source_selector else 'cleaned_html (default)'
|
||||
# self.logger.debug(f"Using '{actual_source_used}' as source for Markdown generation for {url}", tag="MARKDOWN_SRC")
|
||||
|
||||
except Exception as e:
|
||||
# Handle potential errors, especially from preprocess_html_for_schema
|
||||
if self.logger:
|
||||
self.logger.warning(
|
||||
f"Error getting/processing '{selected_html_source}' for markdown source: {e}. Falling back to cleaned_html.",
|
||||
tag="MARKDOWN_SRC"
|
||||
)
|
||||
# Ensure markdown_input_html is still the default cleaned_html in case of error
|
||||
markdown_input_html = cleaned_html
|
||||
# --- END: HTML SOURCE SELECTION ---
|
||||
|
||||
# Uncomment if by default we want to use PruningContentFilter
|
||||
# if not config.content_filter and not markdown_generator.content_filter:
|
||||
# markdown_generator.content_filter = PruningContentFilter()
|
||||
|
||||
markdown_result: MarkdownGenerationResult = (
|
||||
markdown_generator.generate_markdown(
|
||||
cleaned_html=cleaned_html,
|
||||
base_url=url,
|
||||
input_html=markdown_input_html,
|
||||
base_url=params.get("redirected_url", url)
|
||||
# html2text_options=kwargs.get('html2text', {})
|
||||
)
|
||||
)
|
||||
|
||||
# Log processing completion
|
||||
self.logger.info(
|
||||
message="{url:.50}... | Time: {timing}s",
|
||||
tag="SCRAPE",
|
||||
params={
|
||||
"url": _url,
|
||||
"timing": int((time.perf_counter() - t1) * 1000) / 1000,
|
||||
},
|
||||
self.logger.url_status(
|
||||
url=_url,
|
||||
success=True,
|
||||
timing=int((time.perf_counter() - t1) * 1000) / 1000,
|
||||
tag="SCRAPE"
|
||||
)
|
||||
# self.logger.info(
|
||||
# message="{url:.50}... | Time: {timing}s",
|
||||
# tag="SCRAPE",
|
||||
# params={"url": _url, "timing": int((time.perf_counter() - t1) * 1000) / 1000},
|
||||
# )
|
||||
|
||||
################################
|
||||
# Structured Content Extraction #
|
||||
@@ -556,6 +595,7 @@ class AsyncWebCrawler:
|
||||
content = {
|
||||
"markdown": markdown_result.raw_markdown,
|
||||
"html": html,
|
||||
"fit_html": fit_html,
|
||||
"cleaned_html": cleaned_html,
|
||||
"fit_markdown": markdown_result.fit_markdown,
|
||||
}.get(content_format, markdown_result.raw_markdown)
|
||||
@@ -563,7 +603,7 @@ class AsyncWebCrawler:
|
||||
# Use IdentityChunking for HTML input, otherwise use provided chunking strategy
|
||||
chunking = (
|
||||
IdentityChunking()
|
||||
if content_format in ["html", "cleaned_html"]
|
||||
if content_format in ["html", "cleaned_html", "fit_html"]
|
||||
else config.chunking_strategy
|
||||
)
|
||||
sections = chunking.chunk(content)
|
||||
@@ -579,10 +619,6 @@ class AsyncWebCrawler:
|
||||
params={"url": _url, "timing": time.perf_counter() - t1},
|
||||
)
|
||||
|
||||
# Handle screenshot and PDF data
|
||||
screenshot_data = None if not screenshot else screenshot
|
||||
pdf_data = None if not pdf_data else pdf_data
|
||||
|
||||
# Apply HTML formatting if requested
|
||||
if config.prettiify:
|
||||
cleaned_html = fast_format_html(cleaned_html)
|
||||
@@ -591,9 +627,11 @@ class AsyncWebCrawler:
|
||||
return CrawlResult(
|
||||
url=url,
|
||||
html=html,
|
||||
fit_html=fit_html,
|
||||
cleaned_html=cleaned_html,
|
||||
markdown=markdown_result,
|
||||
media=media,
|
||||
tables=tables, # NEW
|
||||
links=links,
|
||||
metadata=metadata,
|
||||
screenshot=screenshot_data,
|
||||
|
||||
@@ -5,7 +5,10 @@ import os
|
||||
import sys
|
||||
import shutil
|
||||
import tempfile
|
||||
import psutil
|
||||
import signal
|
||||
import subprocess
|
||||
import shlex
|
||||
from playwright.async_api import BrowserContext
|
||||
import hashlib
|
||||
from .js_snippet import load_js_script
|
||||
@@ -76,6 +79,51 @@ class ManagedBrowser:
|
||||
_cleanup(): Terminates the browser process and removes the temporary directory.
|
||||
create_profile(): Static method to create a user profile by launching a browser for user interaction.
|
||||
"""
|
||||
|
||||
@staticmethod
|
||||
def build_browser_flags(config: BrowserConfig) -> List[str]:
|
||||
"""Common CLI flags for launching Chromium"""
|
||||
flags = [
|
||||
"--disable-gpu",
|
||||
"--disable-gpu-compositing",
|
||||
"--disable-software-rasterizer",
|
||||
"--no-sandbox",
|
||||
"--disable-dev-shm-usage",
|
||||
"--no-first-run",
|
||||
"--no-default-browser-check",
|
||||
"--disable-infobars",
|
||||
"--window-position=0,0",
|
||||
"--ignore-certificate-errors",
|
||||
"--ignore-certificate-errors-spki-list",
|
||||
"--disable-blink-features=AutomationControlled",
|
||||
"--window-position=400,0",
|
||||
"--disable-renderer-backgrounding",
|
||||
"--disable-ipc-flooding-protection",
|
||||
"--force-color-profile=srgb",
|
||||
"--mute-audio",
|
||||
"--disable-background-timer-throttling",
|
||||
]
|
||||
if config.light_mode:
|
||||
flags.extend(BROWSER_DISABLE_OPTIONS)
|
||||
if config.text_mode:
|
||||
flags.extend([
|
||||
"--blink-settings=imagesEnabled=false",
|
||||
"--disable-remote-fonts",
|
||||
"--disable-images",
|
||||
"--disable-javascript",
|
||||
"--disable-software-rasterizer",
|
||||
"--disable-dev-shm-usage",
|
||||
])
|
||||
# proxy support
|
||||
if config.proxy:
|
||||
flags.append(f"--proxy-server={config.proxy}")
|
||||
elif config.proxy_config:
|
||||
creds = ""
|
||||
if config.proxy_config.username and config.proxy_config.password:
|
||||
creds = f"{config.proxy_config.username}:{config.proxy_config.password}@"
|
||||
flags.append(f"--proxy-server={creds}{config.proxy_config.server}")
|
||||
# dedupe
|
||||
return list(dict.fromkeys(flags))
|
||||
|
||||
browser_type: str
|
||||
user_data_dir: str
|
||||
@@ -94,6 +142,7 @@ class ManagedBrowser:
|
||||
host: str = "localhost",
|
||||
debugging_port: int = 9222,
|
||||
cdp_url: Optional[str] = None,
|
||||
browser_config: Optional[BrowserConfig] = None,
|
||||
):
|
||||
"""
|
||||
Initialize the ManagedBrowser instance.
|
||||
@@ -109,17 +158,19 @@ class ManagedBrowser:
|
||||
host (str): Host for debugging the browser. Default: "localhost".
|
||||
debugging_port (int): Port for debugging the browser. Default: 9222.
|
||||
cdp_url (str or None): CDP URL to connect to the browser. Default: None.
|
||||
browser_config (BrowserConfig): Configuration object containing all browser settings. Default: None.
|
||||
"""
|
||||
self.browser_type = browser_type
|
||||
self.user_data_dir = user_data_dir
|
||||
self.headless = headless
|
||||
self.browser_type = browser_config.browser_type
|
||||
self.user_data_dir = browser_config.user_data_dir
|
||||
self.headless = browser_config.headless
|
||||
self.browser_process = None
|
||||
self.temp_dir = None
|
||||
self.debugging_port = debugging_port
|
||||
self.host = host
|
||||
self.debugging_port = browser_config.debugging_port
|
||||
self.host = browser_config.host
|
||||
self.logger = logger
|
||||
self.shutting_down = False
|
||||
self.cdp_url = cdp_url
|
||||
self.cdp_url = browser_config.cdp_url
|
||||
self.browser_config = browser_config
|
||||
|
||||
async def start(self) -> str:
|
||||
"""
|
||||
@@ -142,6 +193,48 @@ class ManagedBrowser:
|
||||
# Get browser path and args based on OS and browser type
|
||||
# browser_path = self._get_browser_path()
|
||||
args = await self._get_browser_args()
|
||||
|
||||
if self.browser_config.extra_args:
|
||||
args.extend(self.browser_config.extra_args)
|
||||
|
||||
|
||||
# ── make sure no old Chromium instance is owning the same port/profile ──
|
||||
try:
|
||||
if sys.platform == "win32":
|
||||
if psutil is None:
|
||||
raise RuntimeError("psutil not available, cannot clean old browser")
|
||||
for p in psutil.process_iter(["pid", "name", "cmdline"]):
|
||||
cl = " ".join(p.info.get("cmdline") or [])
|
||||
if (
|
||||
f"--remote-debugging-port={self.debugging_port}" in cl
|
||||
and f"--user-data-dir={self.user_data_dir}" in cl
|
||||
):
|
||||
p.kill()
|
||||
p.wait(timeout=5)
|
||||
else: # macOS / Linux
|
||||
# kill any process listening on the same debugging port
|
||||
pids = (
|
||||
subprocess.check_output(shlex.split(f"lsof -t -i:{self.debugging_port}"))
|
||||
.decode()
|
||||
.strip()
|
||||
.splitlines()
|
||||
)
|
||||
for pid in pids:
|
||||
try:
|
||||
os.kill(int(pid), signal.SIGTERM)
|
||||
except ProcessLookupError:
|
||||
pass
|
||||
|
||||
# remove Chromium singleton locks, or new launch exits with
|
||||
# “Opening in existing browser session.”
|
||||
for f in ("SingletonLock", "SingletonSocket", "SingletonCookie"):
|
||||
fp = os.path.join(self.user_data_dir, f)
|
||||
if os.path.exists(fp):
|
||||
os.remove(fp)
|
||||
except Exception as _e:
|
||||
# non-fatal — we'll try to start anyway, but log what happened
|
||||
self.logger.warning(f"pre-launch cleanup failed: {_e}", tag="BROWSER")
|
||||
|
||||
|
||||
# Start browser process
|
||||
try:
|
||||
@@ -274,29 +367,29 @@ class ManagedBrowser:
|
||||
return browser_path
|
||||
|
||||
async def _get_browser_args(self) -> List[str]:
|
||||
"""Returns browser-specific command line arguments"""
|
||||
base_args = [await self._get_browser_path()]
|
||||
|
||||
"""Returns full CLI args for launching the browser"""
|
||||
base = [await self._get_browser_path()]
|
||||
if self.browser_type == "chromium":
|
||||
args = [
|
||||
flags = [
|
||||
f"--remote-debugging-port={self.debugging_port}",
|
||||
f"--user-data-dir={self.user_data_dir}",
|
||||
]
|
||||
if self.headless:
|
||||
args.append("--headless=new")
|
||||
flags.append("--headless=new")
|
||||
# merge common launch flags
|
||||
flags.extend(self.build_browser_flags(self.browser_config))
|
||||
elif self.browser_type == "firefox":
|
||||
args = [
|
||||
flags = [
|
||||
"--remote-debugging-port",
|
||||
str(self.debugging_port),
|
||||
"--profile",
|
||||
self.user_data_dir,
|
||||
]
|
||||
if self.headless:
|
||||
args.append("--headless")
|
||||
flags.append("--headless")
|
||||
else:
|
||||
raise NotImplementedError(f"Browser type {self.browser_type} not supported")
|
||||
|
||||
return base_args + args
|
||||
return base + flags
|
||||
|
||||
async def cleanup(self):
|
||||
"""Cleanup browser process and temporary directory"""
|
||||
@@ -477,6 +570,7 @@ class BrowserManager:
|
||||
logger=self.logger,
|
||||
debugging_port=self.config.debugging_port,
|
||||
cdp_url=self.config.cdp_url,
|
||||
browser_config=self.config,
|
||||
)
|
||||
|
||||
async def start(self):
|
||||
@@ -565,6 +659,9 @@ class BrowserManager:
|
||||
if self.config.extra_args:
|
||||
args.extend(self.config.extra_args)
|
||||
|
||||
# Deduplicate args
|
||||
args = list(dict.fromkeys(args))
|
||||
|
||||
browser_args = {"headless": self.config.headless, "args": args}
|
||||
|
||||
if self.config.chrome_channel:
|
||||
@@ -779,6 +876,23 @@ class BrowserManager:
|
||||
# Update context settings with text mode settings
|
||||
context_settings.update(text_mode_settings)
|
||||
|
||||
# inject locale / tz / geo if user provided them
|
||||
if crawlerRunConfig:
|
||||
if crawlerRunConfig.locale:
|
||||
context_settings["locale"] = crawlerRunConfig.locale
|
||||
if crawlerRunConfig.timezone_id:
|
||||
context_settings["timezone_id"] = crawlerRunConfig.timezone_id
|
||||
if crawlerRunConfig.geolocation:
|
||||
context_settings["geolocation"] = {
|
||||
"latitude": crawlerRunConfig.geolocation.latitude,
|
||||
"longitude": crawlerRunConfig.geolocation.longitude,
|
||||
"accuracy": crawlerRunConfig.geolocation.accuracy,
|
||||
}
|
||||
# ensure geolocation permission
|
||||
perms = context_settings.get("permissions", [])
|
||||
perms.append("geolocation")
|
||||
context_settings["permissions"] = perms
|
||||
|
||||
# Create and return the context with all settings
|
||||
context = await self.browser.new_context(**context_settings)
|
||||
|
||||
@@ -811,6 +925,10 @@ class BrowserManager:
|
||||
"semaphore_count",
|
||||
"url"
|
||||
]
|
||||
|
||||
# Do NOT exclude locale, timezone_id, or geolocation as these DO affect browser context
|
||||
# and should cause a new context to be created if they change
|
||||
|
||||
for key in ephemeral_keys:
|
||||
if key in config_dict:
|
||||
del config_dict[key]
|
||||
@@ -846,7 +964,7 @@ class BrowserManager:
|
||||
pages = context.pages
|
||||
page = next((p for p in pages if p.url == crawlerRunConfig.url), None)
|
||||
if not page:
|
||||
page = await context.new_page()
|
||||
page = context.pages[0] # await context.new_page()
|
||||
else:
|
||||
# Otherwise, check if we have an existing context for this config
|
||||
config_signature = self._make_config_signature(crawlerRunConfig)
|
||||
|
||||
@@ -15,12 +15,12 @@ import shutil
|
||||
import json
|
||||
import subprocess
|
||||
import time
|
||||
from typing import List, Dict, Optional, Any, Tuple
|
||||
from colorama import Fore, Style, init
|
||||
from typing import List, Dict, Optional, Any
|
||||
from rich.console import Console
|
||||
|
||||
from .async_configs import BrowserConfig
|
||||
from .browser_manager import ManagedBrowser
|
||||
from .async_logger import AsyncLogger, AsyncLoggerBase
|
||||
from .async_logger import AsyncLogger, AsyncLoggerBase, LogColor
|
||||
from .utils import get_home_folder
|
||||
|
||||
|
||||
@@ -45,8 +45,8 @@ class BrowserProfiler:
|
||||
logger (AsyncLoggerBase, optional): Logger for outputting messages.
|
||||
If None, a default AsyncLogger will be created.
|
||||
"""
|
||||
# Initialize colorama for colorful terminal output
|
||||
init()
|
||||
# Initialize rich console for colorful input prompts
|
||||
self.console = Console()
|
||||
|
||||
# Create a logger if not provided
|
||||
if logger is None:
|
||||
@@ -127,26 +127,30 @@ class BrowserProfiler:
|
||||
profile_path = os.path.join(self.profiles_dir, profile_name)
|
||||
os.makedirs(profile_path, exist_ok=True)
|
||||
|
||||
# Print instructions for the user with colorama formatting
|
||||
border = f"{Fore.CYAN}{'='*80}{Style.RESET_ALL}"
|
||||
self.logger.info(f"\n{border}", tag="PROFILE")
|
||||
self.logger.info(f"Creating browser profile: {Fore.GREEN}{profile_name}{Style.RESET_ALL}", tag="PROFILE")
|
||||
self.logger.info(f"Profile directory: {Fore.YELLOW}{profile_path}{Style.RESET_ALL}", tag="PROFILE")
|
||||
# Print instructions for the user with rich formatting
|
||||
border = f"{'='*80}"
|
||||
self.logger.info("{border}", tag="PROFILE", params={"border": f"\n{border}"}, colors={"border": LogColor.CYAN})
|
||||
self.logger.info("Creating browser profile: {profile_name}", tag="PROFILE", params={"profile_name": profile_name}, colors={"profile_name": LogColor.GREEN})
|
||||
self.logger.info("Profile directory: {profile_path}", tag="PROFILE", params={"profile_path": profile_path}, colors={"profile_path": LogColor.YELLOW})
|
||||
|
||||
self.logger.info("\nInstructions:", tag="PROFILE")
|
||||
self.logger.info("1. A browser window will open for you to set up your profile.", tag="PROFILE")
|
||||
self.logger.info(f"2. {Fore.CYAN}Log in to websites{Style.RESET_ALL}, configure settings, etc. as needed.", tag="PROFILE")
|
||||
self.logger.info(f"3. When you're done, {Fore.YELLOW}press 'q' in this terminal{Style.RESET_ALL} to close the browser.", tag="PROFILE")
|
||||
self.logger.info("{segment}, configure settings, etc. as needed.", tag="PROFILE", params={"segment": "2. Log in to websites"}, colors={"segment": LogColor.CYAN})
|
||||
self.logger.info("3. When you're done, {segment} to close the browser.", tag="PROFILE", params={"segment": "press 'q' in this terminal"}, colors={"segment": LogColor.YELLOW})
|
||||
self.logger.info("4. The profile will be saved and ready to use with Crawl4AI.", tag="PROFILE")
|
||||
self.logger.info(f"{border}\n", tag="PROFILE")
|
||||
self.logger.info("{border}", tag="PROFILE", params={"border": f"{border}\n"}, colors={"border": LogColor.CYAN})
|
||||
|
||||
browser_config.headless = False
|
||||
browser_config.user_data_dir = profile_path
|
||||
|
||||
|
||||
# Create managed browser instance
|
||||
managed_browser = ManagedBrowser(
|
||||
browser_type=browser_config.browser_type,
|
||||
user_data_dir=profile_path,
|
||||
headless=False, # Must be visible
|
||||
browser_config=browser_config,
|
||||
# user_data_dir=profile_path,
|
||||
# headless=False, # Must be visible
|
||||
logger=self.logger,
|
||||
debugging_port=browser_config.debugging_port
|
||||
# debugging_port=browser_config.debugging_port
|
||||
)
|
||||
|
||||
# Set up signal handlers to ensure cleanup on interrupt
|
||||
@@ -181,7 +185,7 @@ class BrowserProfiler:
|
||||
import select
|
||||
|
||||
# First output the prompt
|
||||
self.logger.info(f"{Fore.CYAN}Press '{Fore.WHITE}q{Fore.CYAN}' when you've finished using the browser...{Style.RESET_ALL}", tag="PROFILE")
|
||||
self.logger.info("Press 'q' when you've finished using the browser...", tag="PROFILE")
|
||||
|
||||
# Save original terminal settings
|
||||
fd = sys.stdin.fileno()
|
||||
@@ -197,7 +201,7 @@ class BrowserProfiler:
|
||||
if readable:
|
||||
key = sys.stdin.read(1)
|
||||
if key.lower() == 'q':
|
||||
self.logger.info(f"{Fore.GREEN}Closing browser and saving profile...{Style.RESET_ALL}", tag="PROFILE")
|
||||
self.logger.info("Closing browser and saving profile...", tag="PROFILE", base_color=LogColor.GREEN)
|
||||
user_done_event.set()
|
||||
return
|
||||
|
||||
@@ -223,7 +227,7 @@ class BrowserProfiler:
|
||||
self.logger.error("Failed to start browser process.", tag="PROFILE")
|
||||
return None
|
||||
|
||||
self.logger.info(f"Browser launched. {Fore.CYAN}Waiting for you to finish...{Style.RESET_ALL}", tag="PROFILE")
|
||||
self.logger.info("Browser launched. Waiting for you to finish...", tag="PROFILE")
|
||||
|
||||
# Start listening for keyboard input
|
||||
listener_task = asyncio.create_task(listen_for_quit_command())
|
||||
@@ -245,10 +249,10 @@ class BrowserProfiler:
|
||||
self.logger.info("Terminating browser process...", tag="PROFILE")
|
||||
await managed_browser.cleanup()
|
||||
|
||||
self.logger.success(f"Browser closed. Profile saved at: {Fore.GREEN}{profile_path}{Style.RESET_ALL}", tag="PROFILE")
|
||||
self.logger.success(f"Browser closed. Profile saved at: {profile_path}", tag="PROFILE")
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error creating profile: {str(e)}", tag="PROFILE")
|
||||
self.logger.error(f"Error creating profile: {e!s}", tag="PROFILE")
|
||||
await managed_browser.cleanup()
|
||||
return None
|
||||
finally:
|
||||
@@ -440,25 +444,27 @@ class BrowserProfiler:
|
||||
```
|
||||
"""
|
||||
while True:
|
||||
self.logger.info(f"\n{Fore.CYAN}Profile Management Options:{Style.RESET_ALL}", tag="MENU")
|
||||
self.logger.info(f"1. {Fore.GREEN}Create a new profile{Style.RESET_ALL}", tag="MENU")
|
||||
self.logger.info(f"2. {Fore.YELLOW}List available profiles{Style.RESET_ALL}", tag="MENU")
|
||||
self.logger.info(f"3. {Fore.RED}Delete a profile{Style.RESET_ALL}", tag="MENU")
|
||||
self.logger.info("\nProfile Management Options:", tag="MENU")
|
||||
self.logger.info("1. Create a new profile", tag="MENU", base_color=LogColor.GREEN)
|
||||
self.logger.info("2. List available profiles", tag="MENU", base_color=LogColor.YELLOW)
|
||||
self.logger.info("3. Delete a profile", tag="MENU", base_color=LogColor.RED)
|
||||
|
||||
# Only show crawl option if callback provided
|
||||
if crawl_callback:
|
||||
self.logger.info(f"4. {Fore.CYAN}Use a profile to crawl a website{Style.RESET_ALL}", tag="MENU")
|
||||
self.logger.info(f"5. {Fore.MAGENTA}Exit{Style.RESET_ALL}", tag="MENU")
|
||||
self.logger.info("4. Use a profile to crawl a website", tag="MENU", base_color=LogColor.CYAN)
|
||||
self.logger.info("5. Exit", tag="MENU", base_color=LogColor.MAGENTA)
|
||||
exit_option = "5"
|
||||
else:
|
||||
self.logger.info(f"4. {Fore.MAGENTA}Exit{Style.RESET_ALL}", tag="MENU")
|
||||
self.logger.info("4. Exit", tag="MENU", base_color=LogColor.MAGENTA)
|
||||
exit_option = "4"
|
||||
|
||||
choice = input(f"\n{Fore.CYAN}Enter your choice (1-{exit_option}): {Style.RESET_ALL}")
|
||||
self.logger.print(f"\n[cyan]Enter your choice (1-{exit_option}): [/cyan]", end="")
|
||||
choice = input()
|
||||
|
||||
if choice == "1":
|
||||
# Create new profile
|
||||
name = input(f"{Fore.GREEN}Enter a name for the new profile (or press Enter for auto-generated name): {Style.RESET_ALL}")
|
||||
self.console.print("[green]Enter a name for the new profile (or press Enter for auto-generated name): [/green]", end="")
|
||||
name = input()
|
||||
await self.create_profile(name or None)
|
||||
|
||||
elif choice == "2":
|
||||
@@ -469,11 +475,11 @@ class BrowserProfiler:
|
||||
self.logger.warning(" No profiles found. Create one first with option 1.", tag="PROFILES")
|
||||
continue
|
||||
|
||||
# Print profile information with colorama formatting
|
||||
# Print profile information
|
||||
self.logger.info("\nAvailable profiles:", tag="PROFILES")
|
||||
for i, profile in enumerate(profiles):
|
||||
self.logger.info(f"[{i+1}] {Fore.CYAN}{profile['name']}{Style.RESET_ALL}", tag="PROFILES")
|
||||
self.logger.info(f" Path: {Fore.YELLOW}{profile['path']}{Style.RESET_ALL}", tag="PROFILES")
|
||||
self.logger.info(f"[{i+1}] {profile['name']}", tag="PROFILES")
|
||||
self.logger.info(f" Path: {profile['path']}", tag="PROFILES", base_color=LogColor.YELLOW)
|
||||
self.logger.info(f" Created: {profile['created'].strftime('%Y-%m-%d %H:%M:%S')}", tag="PROFILES")
|
||||
self.logger.info(f" Browser type: {profile['type']}", tag="PROFILES")
|
||||
self.logger.info("", tag="PROFILES") # Empty line for spacing
|
||||
@@ -486,12 +492,13 @@ class BrowserProfiler:
|
||||
continue
|
||||
|
||||
# Display numbered list
|
||||
self.logger.info(f"\n{Fore.YELLOW}Available profiles:{Style.RESET_ALL}", tag="PROFILES")
|
||||
self.logger.info("\nAvailable profiles:", tag="PROFILES", base_color=LogColor.YELLOW)
|
||||
for i, profile in enumerate(profiles):
|
||||
self.logger.info(f"[{i+1}] {profile['name']}", tag="PROFILES")
|
||||
|
||||
# Get profile to delete
|
||||
profile_idx = input(f"{Fore.RED}Enter the number of the profile to delete (or 'c' to cancel): {Style.RESET_ALL}")
|
||||
self.console.print("[red]Enter the number of the profile to delete (or 'c' to cancel): [/red]", end="")
|
||||
profile_idx = input()
|
||||
if profile_idx.lower() == 'c':
|
||||
continue
|
||||
|
||||
@@ -499,17 +506,18 @@ class BrowserProfiler:
|
||||
idx = int(profile_idx) - 1
|
||||
if 0 <= idx < len(profiles):
|
||||
profile_name = profiles[idx]["name"]
|
||||
self.logger.info(f"Deleting profile: {Fore.YELLOW}{profile_name}{Style.RESET_ALL}", tag="PROFILES")
|
||||
self.logger.info(f"Deleting profile: [yellow]{profile_name}[/yellow]", tag="PROFILES")
|
||||
|
||||
# Confirm deletion
|
||||
confirm = input(f"{Fore.RED}Are you sure you want to delete this profile? (y/n): {Style.RESET_ALL}")
|
||||
self.console.print("[red]Are you sure you want to delete this profile? (y/n): [/red]", end="")
|
||||
confirm = input()
|
||||
if confirm.lower() == 'y':
|
||||
success = self.delete_profile(profiles[idx]["path"])
|
||||
|
||||
if success:
|
||||
self.logger.success(f"Profile {Fore.GREEN}{profile_name}{Style.RESET_ALL} deleted successfully", tag="PROFILES")
|
||||
self.logger.success(f"Profile {profile_name} deleted successfully", tag="PROFILES")
|
||||
else:
|
||||
self.logger.error(f"Failed to delete profile {Fore.RED}{profile_name}{Style.RESET_ALL}", tag="PROFILES")
|
||||
self.logger.error(f"Failed to delete profile {profile_name}", tag="PROFILES")
|
||||
else:
|
||||
self.logger.error("Invalid profile number", tag="PROFILES")
|
||||
except ValueError:
|
||||
@@ -523,12 +531,13 @@ class BrowserProfiler:
|
||||
continue
|
||||
|
||||
# Display numbered list
|
||||
self.logger.info(f"\n{Fore.YELLOW}Available profiles:{Style.RESET_ALL}", tag="PROFILES")
|
||||
self.logger.info("\nAvailable profiles:", tag="PROFILES", base_color=LogColor.YELLOW)
|
||||
for i, profile in enumerate(profiles):
|
||||
self.logger.info(f"[{i+1}] {profile['name']}", tag="PROFILES")
|
||||
|
||||
# Get profile to use
|
||||
profile_idx = input(f"{Fore.CYAN}Enter the number of the profile to use (or 'c' to cancel): {Style.RESET_ALL}")
|
||||
self.console.print("[cyan]Enter the number of the profile to use (or 'c' to cancel): [/cyan]", end="")
|
||||
profile_idx = input()
|
||||
if profile_idx.lower() == 'c':
|
||||
continue
|
||||
|
||||
@@ -536,7 +545,8 @@ class BrowserProfiler:
|
||||
idx = int(profile_idx) - 1
|
||||
if 0 <= idx < len(profiles):
|
||||
profile_path = profiles[idx]["path"]
|
||||
url = input(f"{Fore.CYAN}Enter the URL to crawl: {Style.RESET_ALL}")
|
||||
self.console.print("[cyan]Enter the URL to crawl: [/cyan]", end="")
|
||||
url = input()
|
||||
if url:
|
||||
# Call the provided crawl callback
|
||||
await crawl_callback(profile_path, url)
|
||||
@@ -597,13 +607,13 @@ class BrowserProfiler:
|
||||
os.makedirs(profile_path, exist_ok=True)
|
||||
|
||||
# Print initial information
|
||||
border = f"{Fore.CYAN}{'='*80}{Style.RESET_ALL}"
|
||||
self.logger.info(f"\n{border}", tag="CDP")
|
||||
self.logger.info(f"Launching standalone browser with CDP debugging", tag="CDP")
|
||||
self.logger.info(f"Browser type: {Fore.GREEN}{browser_type}{Style.RESET_ALL}", tag="CDP")
|
||||
self.logger.info(f"Profile path: {Fore.YELLOW}{profile_path}{Style.RESET_ALL}", tag="CDP")
|
||||
self.logger.info(f"Debugging port: {Fore.CYAN}{debugging_port}{Style.RESET_ALL}", tag="CDP")
|
||||
self.logger.info(f"Headless mode: {Fore.CYAN}{headless}{Style.RESET_ALL}", tag="CDP")
|
||||
border = f"{'='*80}"
|
||||
self.logger.info("{border}", tag="CDP", params={"border": border}, colors={"border": LogColor.CYAN})
|
||||
self.logger.info("Launching standalone browser with CDP debugging", tag="CDP")
|
||||
self.logger.info("Browser type: {browser_type}", tag="CDP", params={"browser_type": browser_type}, colors={"browser_type": LogColor.CYAN})
|
||||
self.logger.info("Profile path: {profile_path}", tag="CDP", params={"profile_path": profile_path}, colors={"profile_path": LogColor.YELLOW})
|
||||
self.logger.info(f"Debugging port: {debugging_port}", tag="CDP")
|
||||
self.logger.info(f"Headless mode: {headless}", tag="CDP")
|
||||
|
||||
# Create managed browser instance
|
||||
managed_browser = ManagedBrowser(
|
||||
@@ -646,7 +656,7 @@ class BrowserProfiler:
|
||||
import select
|
||||
|
||||
# First output the prompt
|
||||
self.logger.info(f"{Fore.CYAN}Press '{Fore.WHITE}q{Fore.CYAN}' to stop the browser and exit...{Style.RESET_ALL}", tag="CDP")
|
||||
self.logger.info("Press 'q' to stop the browser and exit...", tag="CDP")
|
||||
|
||||
# Save original terminal settings
|
||||
fd = sys.stdin.fileno()
|
||||
@@ -662,7 +672,7 @@ class BrowserProfiler:
|
||||
if readable:
|
||||
key = sys.stdin.read(1)
|
||||
if key.lower() == 'q':
|
||||
self.logger.info(f"{Fore.GREEN}Closing browser...{Style.RESET_ALL}", tag="CDP")
|
||||
self.logger.info("Closing browser...", tag="CDP")
|
||||
user_done_event.set()
|
||||
return
|
||||
|
||||
@@ -716,20 +726,20 @@ class BrowserProfiler:
|
||||
self.logger.error("Failed to start browser process.", tag="CDP")
|
||||
return None
|
||||
|
||||
self.logger.info(f"Browser launched successfully. Retrieving CDP information...", tag="CDP")
|
||||
self.logger.info("Browser launched successfully. Retrieving CDP information...", tag="CDP")
|
||||
|
||||
# Get CDP URL and JSON config
|
||||
cdp_url, config_json = await get_cdp_json(debugging_port)
|
||||
|
||||
if cdp_url:
|
||||
self.logger.success(f"CDP URL: {Fore.GREEN}{cdp_url}{Style.RESET_ALL}", tag="CDP")
|
||||
self.logger.success(f"CDP URL: {cdp_url}", tag="CDP")
|
||||
|
||||
if config_json:
|
||||
# Display relevant CDP information
|
||||
self.logger.info(f"Browser: {Fore.CYAN}{config_json.get('Browser', 'Unknown')}{Style.RESET_ALL}", tag="CDP")
|
||||
self.logger.info(f"Protocol Version: {config_json.get('Protocol-Version', 'Unknown')}", tag="CDP")
|
||||
self.logger.info(f"Browser: {config_json.get('Browser', 'Unknown')}", tag="CDP", colors={"Browser": LogColor.CYAN})
|
||||
self.logger.info(f"Protocol Version: {config_json.get('Protocol-Version', 'Unknown')}", tag="CDP", colors={"Protocol-Version": LogColor.CYAN})
|
||||
if 'webSocketDebuggerUrl' in config_json:
|
||||
self.logger.info(f"WebSocket URL: {Fore.GREEN}{config_json['webSocketDebuggerUrl']}{Style.RESET_ALL}", tag="CDP")
|
||||
self.logger.info("WebSocket URL: {webSocketDebuggerUrl}", tag="CDP", params={"webSocketDebuggerUrl": config_json['webSocketDebuggerUrl']}, colors={"webSocketDebuggerUrl": LogColor.GREEN})
|
||||
else:
|
||||
self.logger.warning("Could not retrieve CDP configuration JSON", tag="CDP")
|
||||
else:
|
||||
@@ -757,7 +767,7 @@ class BrowserProfiler:
|
||||
self.logger.info("Terminating browser process...", tag="CDP")
|
||||
await managed_browser.cleanup()
|
||||
|
||||
self.logger.success(f"Browser closed.", tag="CDP")
|
||||
self.logger.success("Browser closed.", tag="CDP")
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error launching standalone browser: {str(e)}", tag="CDP")
|
||||
@@ -972,3 +982,30 @@ class BrowserProfiler:
|
||||
'info': browser_info
|
||||
}
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
# Example usage
|
||||
profiler = BrowserProfiler()
|
||||
|
||||
# Create a new profile
|
||||
import os
|
||||
from pathlib import Path
|
||||
home_dir = Path.home()
|
||||
profile_path = asyncio.run(profiler.create_profile( str(home_dir / ".crawl4ai/profiles/test-profile")))
|
||||
|
||||
|
||||
|
||||
# Launch a standalone browser
|
||||
asyncio.run(profiler.launch_standalone_browser())
|
||||
|
||||
# List profiles
|
||||
profiles = profiler.list_profiles()
|
||||
for profile in profiles:
|
||||
print(f"Profile: {profile['name']}, Path: {profile['path']}")
|
||||
|
||||
# Delete a profile
|
||||
success = profiler.delete_profile("my-profile")
|
||||
if success:
|
||||
print("Profile deleted successfully")
|
||||
else:
|
||||
print("Failed to delete profile")
|
||||
@@ -29,6 +29,14 @@ PROVIDER_MODELS = {
|
||||
'gemini/gemini-2.0-flash-lite-preview-02-05': os.getenv("GEMINI_API_KEY"),
|
||||
"deepseek/deepseek-chat": os.getenv("DEEPSEEK_API_KEY"),
|
||||
}
|
||||
PROVIDER_MODELS_PREFIXES = {
|
||||
"ollama": "no-token-needed", # Any model from Ollama no need for API token
|
||||
"groq": os.getenv("GROQ_API_KEY"),
|
||||
"openai": os.getenv("OPENAI_API_KEY"),
|
||||
"anthropic": os.getenv("ANTHROPIC_API_KEY"),
|
||||
"gemini": os.getenv("GEMINI_API_KEY"),
|
||||
"deepseek": os.getenv("DEEPSEEK_API_KEY"),
|
||||
}
|
||||
|
||||
# Chunk token threshold
|
||||
CHUNK_TOKEN_THRESHOLD = 2**11 # 2048 tokens
|
||||
|
||||
@@ -27,8 +27,7 @@ import json
|
||||
import hashlib
|
||||
from pathlib import Path
|
||||
from concurrent.futures import ThreadPoolExecutor
|
||||
from .async_logger import AsyncLogger, LogLevel
|
||||
from colorama import Fore, Style
|
||||
from .async_logger import AsyncLogger, LogLevel, LogColor
|
||||
|
||||
|
||||
class RelevantContentFilter(ABC):
|
||||
@@ -846,8 +845,7 @@ class LLMContentFilter(RelevantContentFilter):
|
||||
},
|
||||
colors={
|
||||
**AsyncLogger.DEFAULT_COLORS,
|
||||
LogLevel.INFO: Fore.MAGENTA
|
||||
+ Style.DIM, # Dimmed purple for LLM ops
|
||||
LogLevel.INFO: LogColor.DIM_MAGENTA # Dimmed purple for LLM ops
|
||||
},
|
||||
)
|
||||
else:
|
||||
@@ -892,7 +890,7 @@ class LLMContentFilter(RelevantContentFilter):
|
||||
"Starting LLM markdown content filtering process",
|
||||
tag="LLM",
|
||||
params={"provider": self.llm_config.provider},
|
||||
colors={"provider": Fore.CYAN},
|
||||
colors={"provider": LogColor.CYAN},
|
||||
)
|
||||
|
||||
# Cache handling
|
||||
@@ -929,7 +927,7 @@ class LLMContentFilter(RelevantContentFilter):
|
||||
"LLM markdown: Split content into {chunk_count} chunks",
|
||||
tag="CHUNK",
|
||||
params={"chunk_count": len(html_chunks)},
|
||||
colors={"chunk_count": Fore.YELLOW},
|
||||
colors={"chunk_count": LogColor.YELLOW},
|
||||
)
|
||||
|
||||
start_time = time.time()
|
||||
@@ -1038,7 +1036,7 @@ class LLMContentFilter(RelevantContentFilter):
|
||||
"LLM markdown: Completed processing in {time:.2f}s",
|
||||
tag="LLM",
|
||||
params={"time": end_time - start_time},
|
||||
colors={"time": Fore.YELLOW},
|
||||
colors={"time": LogColor.YELLOW},
|
||||
)
|
||||
|
||||
result = ordered_results if ordered_results else []
|
||||
|
||||
@@ -28,6 +28,7 @@ from lxml import etree
|
||||
from lxml import html as lhtml
|
||||
from typing import List
|
||||
from .models import ScrapingResult, MediaItem, Link, Media, Links
|
||||
import copy
|
||||
|
||||
# Pre-compile regular expressions for Open Graph and Twitter metadata
|
||||
OG_REGEX = re.compile(r"^og:")
|
||||
@@ -48,7 +49,7 @@ def parse_srcset(s: str) -> List[Dict]:
|
||||
if len(parts) >= 1:
|
||||
url = parts[0]
|
||||
width = (
|
||||
parts[1].rstrip("w")
|
||||
parts[1].rstrip("w").split('.')[0]
|
||||
if len(parts) > 1 and parts[1].endswith("w")
|
||||
else None
|
||||
)
|
||||
@@ -128,7 +129,8 @@ class WebScrapingStrategy(ContentScrapingStrategy):
|
||||
Returns:
|
||||
ScrapingResult: A structured result containing the scraped content.
|
||||
"""
|
||||
raw_result = self._scrap(url, html, is_async=False, **kwargs)
|
||||
actual_url = kwargs.get("redirected_url", url)
|
||||
raw_result = self._scrap(actual_url, html, is_async=False, **kwargs)
|
||||
if raw_result is None:
|
||||
return ScrapingResult(
|
||||
cleaned_html="",
|
||||
@@ -619,6 +621,9 @@ class WebScrapingStrategy(ContentScrapingStrategy):
|
||||
return False
|
||||
|
||||
keep_element = False
|
||||
# Special case for table elements - always preserve structure
|
||||
if element.name in ["tr", "td", "th"]:
|
||||
keep_element = True
|
||||
|
||||
exclude_domains = kwargs.get("exclude_domains", [])
|
||||
# exclude_social_media_domains = kwargs.get('exclude_social_media_domains', set(SOCIAL_MEDIA_DOMAINS))
|
||||
@@ -859,6 +864,8 @@ class WebScrapingStrategy(ContentScrapingStrategy):
|
||||
parser_type = kwargs.get("parser", "lxml")
|
||||
soup = BeautifulSoup(html, parser_type)
|
||||
body = soup.body
|
||||
if body is None:
|
||||
raise Exception("'<body>' tag is not found in fetched html. Consider adding wait_for=\"css:body\" to wait for body tag to be loaded into DOM.")
|
||||
base_domain = get_base_domain(url)
|
||||
|
||||
# Early removal of all images if exclude_all_images is set
|
||||
@@ -897,23 +904,6 @@ class WebScrapingStrategy(ContentScrapingStrategy):
|
||||
for element in body.select(excluded_selector):
|
||||
element.extract()
|
||||
|
||||
# if False and css_selector:
|
||||
# selected_elements = body.select(css_selector)
|
||||
# if not selected_elements:
|
||||
# return {
|
||||
# "markdown": "",
|
||||
# "cleaned_html": "",
|
||||
# "success": True,
|
||||
# "media": {"images": [], "videos": [], "audios": []},
|
||||
# "links": {"internal": [], "external": []},
|
||||
# "metadata": {},
|
||||
# "message": f"No elements found for CSS selector: {css_selector}",
|
||||
# }
|
||||
# # raise InvalidCSSSelectorError(f"Invalid CSS selector, No elements found for CSS selector: {css_selector}")
|
||||
# body = soup.new_tag("div")
|
||||
# for el in selected_elements:
|
||||
# body.append(el)
|
||||
|
||||
content_element = None
|
||||
if target_elements:
|
||||
try:
|
||||
@@ -922,12 +912,12 @@ class WebScrapingStrategy(ContentScrapingStrategy):
|
||||
for_content_targeted_element.extend(body.select(target_element))
|
||||
content_element = soup.new_tag("div")
|
||||
for el in for_content_targeted_element:
|
||||
content_element.append(el)
|
||||
content_element.append(copy.deepcopy(el))
|
||||
except Exception as e:
|
||||
self._log("error", f"Error with target element detection: {str(e)}", "SCRAPE")
|
||||
return None
|
||||
else:
|
||||
content_element = body
|
||||
content_element = body
|
||||
|
||||
kwargs["exclude_social_media_domains"] = set(
|
||||
kwargs.get("exclude_social_media_domains", []) + SOCIAL_MEDIA_DOMAINS
|
||||
@@ -1308,6 +1298,9 @@ class LXMLWebScrapingStrategy(WebScrapingStrategy):
|
||||
"source",
|
||||
"track",
|
||||
"wbr",
|
||||
"tr",
|
||||
"td",
|
||||
"th",
|
||||
}
|
||||
|
||||
for el in reversed(list(root.iterdescendants())):
|
||||
@@ -1540,26 +1533,6 @@ class LXMLWebScrapingStrategy(WebScrapingStrategy):
|
||||
self._log("error", f"Error extracting metadata: {str(e)}", "SCRAPE")
|
||||
meta = {}
|
||||
|
||||
# Handle CSS selector targeting
|
||||
# if css_selector:
|
||||
# try:
|
||||
# selected_elements = body.cssselect(css_selector)
|
||||
# if not selected_elements:
|
||||
# return {
|
||||
# "markdown": "",
|
||||
# "cleaned_html": "",
|
||||
# "success": True,
|
||||
# "media": {"images": [], "videos": [], "audios": []},
|
||||
# "links": {"internal": [], "external": []},
|
||||
# "metadata": meta,
|
||||
# "message": f"No elements found for CSS selector: {css_selector}",
|
||||
# }
|
||||
# body = lhtml.Element("div")
|
||||
# body.extend(selected_elements)
|
||||
# except Exception as e:
|
||||
# self._log("error", f"Error with CSS selector: {str(e)}", "SCRAPE")
|
||||
# return None
|
||||
|
||||
content_element = None
|
||||
if target_elements:
|
||||
try:
|
||||
@@ -1567,7 +1540,7 @@ class LXMLWebScrapingStrategy(WebScrapingStrategy):
|
||||
for target_element in target_elements:
|
||||
for_content_targeted_element.extend(body.cssselect(target_element))
|
||||
content_element = lhtml.Element("div")
|
||||
content_element.extend(for_content_targeted_element)
|
||||
content_element.extend(copy.deepcopy(for_content_targeted_element))
|
||||
except Exception as e:
|
||||
self._log("error", f"Error with target element detection: {str(e)}", "SCRAPE")
|
||||
return None
|
||||
@@ -1636,7 +1609,7 @@ class LXMLWebScrapingStrategy(WebScrapingStrategy):
|
||||
# Remove empty elements
|
||||
self.remove_empty_elements_fast(body, 1)
|
||||
|
||||
# Remvoe unneeded attributes
|
||||
# Remove unneeded attributes
|
||||
self.remove_unwanted_attributes_fast(
|
||||
body, keep_data_attributes=kwargs.get("keep_data_attributes", False)
|
||||
)
|
||||
|
||||
@@ -11,6 +11,7 @@ from .scorers import URLScorer
|
||||
from . import DeepCrawlStrategy
|
||||
|
||||
from ..types import AsyncWebCrawler, CrawlerRunConfig, CrawlResult, RunManyReturn
|
||||
from ..utils import normalize_url_for_deep_crawl
|
||||
|
||||
from math import inf as infinity
|
||||
|
||||
@@ -106,13 +107,14 @@ class BestFirstCrawlingStrategy(DeepCrawlStrategy):
|
||||
valid_links = []
|
||||
for link in links:
|
||||
url = link.get("href")
|
||||
if url in visited:
|
||||
base_url = normalize_url_for_deep_crawl(url, source_url)
|
||||
if base_url in visited:
|
||||
continue
|
||||
if not await self.can_process_url(url, new_depth):
|
||||
self.stats.urls_skipped += 1
|
||||
continue
|
||||
|
||||
valid_links.append(url)
|
||||
valid_links.append(base_url)
|
||||
|
||||
# If we have more valid links than capacity, limit them
|
||||
if len(valid_links) > remaining_capacity:
|
||||
|
||||
@@ -117,7 +117,8 @@ class BFSDeepCrawlStrategy(DeepCrawlStrategy):
|
||||
self.logger.debug(f"URL {url} skipped: score {score} below threshold {self.score_threshold}")
|
||||
self.stats.urls_skipped += 1
|
||||
continue
|
||||
|
||||
|
||||
visited.add(base_url)
|
||||
valid_links.append((base_url, score))
|
||||
|
||||
# If we have more valid links than capacity, sort by score and take the top ones
|
||||
@@ -158,7 +159,6 @@ class BFSDeepCrawlStrategy(DeepCrawlStrategy):
|
||||
while current_level and not self._cancel_event.is_set():
|
||||
next_level: List[Tuple[str, Optional[str]]] = []
|
||||
urls = [url for url, _ in current_level]
|
||||
visited.update(urls)
|
||||
|
||||
# Clone the config to disable deep crawling recursion and enforce batch mode.
|
||||
batch_config = config.clone(deep_crawl_strategy=None, stream=False)
|
||||
|
||||
@@ -1,9 +1,10 @@
|
||||
from abc import ABC, abstractmethod
|
||||
import inspect
|
||||
from typing import Any, List, Dict, Optional
|
||||
from typing import Any, List, Dict, Optional, Tuple, Pattern, Union
|
||||
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||
import json
|
||||
import time
|
||||
from enum import IntFlag, auto
|
||||
|
||||
from .prompts import PROMPT_EXTRACT_BLOCKS, PROMPT_EXTRACT_BLOCKS_WITH_INSTRUCTION, PROMPT_EXTRACT_SCHEMA_WITH_INSTRUCTION, JSON_SCHEMA_BUILDER_XPATH, PROMPT_EXTRACT_INFERRED_SCHEMA
|
||||
from .config import (
|
||||
@@ -1668,3 +1669,303 @@ class JsonXPathExtractionStrategy(JsonElementExtractionStrategy):
|
||||
def _get_element_attribute(self, element, attribute: str):
|
||||
return element.get(attribute)
|
||||
|
||||
"""
|
||||
RegexExtractionStrategy
|
||||
Fast, zero-LLM extraction of common entities via regular expressions.
|
||||
"""
|
||||
|
||||
_CTRL = {c: rf"\x{ord(c):02x}" for c in map(chr, range(32)) if c not in "\t\n\r"}
|
||||
|
||||
_WB_FIX = re.compile(r"\x08") # stray back-space → word-boundary
|
||||
_NEEDS_ESCAPE = re.compile(r"(?<!\\)\\(?![\\u])") # lone backslash
|
||||
|
||||
def _sanitize_schema(schema: Dict[str, str]) -> Dict[str, str]:
|
||||
"""Fix common JSON-escape goofs coming from LLMs or manual edits."""
|
||||
safe = {}
|
||||
for label, pat in schema.items():
|
||||
# 1️⃣ replace accidental control chars (inc. the infamous back-space)
|
||||
pat = _WB_FIX.sub(r"\\b", pat).translate(_CTRL)
|
||||
|
||||
# 2️⃣ double any single backslash that JSON kept single
|
||||
pat = _NEEDS_ESCAPE.sub(r"\\\\", pat)
|
||||
|
||||
# 3️⃣ quick sanity compile
|
||||
try:
|
||||
re.compile(pat)
|
||||
except re.error as e:
|
||||
raise ValueError(f"Regex for '{label}' won’t compile after fix: {e}") from None
|
||||
|
||||
safe[label] = pat
|
||||
return safe
|
||||
|
||||
|
||||
class RegexExtractionStrategy(ExtractionStrategy):
|
||||
"""
|
||||
A lean strategy that finds e-mails, phones, URLs, dates, money, etc.,
|
||||
using nothing but pre-compiled regular expressions.
|
||||
|
||||
Extraction returns::
|
||||
|
||||
{
|
||||
"url": "<page-url>",
|
||||
"label": "<pattern-label>",
|
||||
"value": "<matched-string>",
|
||||
"span": [start, end]
|
||||
}
|
||||
|
||||
Only `generate_schema()` touches an LLM, extraction itself is pure Python.
|
||||
"""
|
||||
|
||||
# -------------------------------------------------------------- #
|
||||
# Built-in patterns exposed as IntFlag so callers can bit-OR them
|
||||
# -------------------------------------------------------------- #
|
||||
class _B(IntFlag):
|
||||
EMAIL = auto()
|
||||
PHONE_INTL = auto()
|
||||
PHONE_US = auto()
|
||||
URL = auto()
|
||||
IPV4 = auto()
|
||||
IPV6 = auto()
|
||||
UUID = auto()
|
||||
CURRENCY = auto()
|
||||
PERCENTAGE = auto()
|
||||
NUMBER = auto()
|
||||
DATE_ISO = auto()
|
||||
DATE_US = auto()
|
||||
TIME_24H = auto()
|
||||
POSTAL_US = auto()
|
||||
POSTAL_UK = auto()
|
||||
HTML_COLOR_HEX = auto()
|
||||
TWITTER_HANDLE = auto()
|
||||
HASHTAG = auto()
|
||||
MAC_ADDR = auto()
|
||||
IBAN = auto()
|
||||
CREDIT_CARD = auto()
|
||||
NOTHING = auto()
|
||||
ALL = (
|
||||
EMAIL | PHONE_INTL | PHONE_US | URL | IPV4 | IPV6 | UUID
|
||||
| CURRENCY | PERCENTAGE | NUMBER | DATE_ISO | DATE_US | TIME_24H
|
||||
| POSTAL_US | POSTAL_UK | HTML_COLOR_HEX | TWITTER_HANDLE
|
||||
| HASHTAG | MAC_ADDR | IBAN | CREDIT_CARD
|
||||
)
|
||||
|
||||
# user-friendly aliases (RegexExtractionStrategy.Email, .IPv4, …)
|
||||
Email = _B.EMAIL
|
||||
PhoneIntl = _B.PHONE_INTL
|
||||
PhoneUS = _B.PHONE_US
|
||||
Url = _B.URL
|
||||
IPv4 = _B.IPV4
|
||||
IPv6 = _B.IPV6
|
||||
Uuid = _B.UUID
|
||||
Currency = _B.CURRENCY
|
||||
Percentage = _B.PERCENTAGE
|
||||
Number = _B.NUMBER
|
||||
DateIso = _B.DATE_ISO
|
||||
DateUS = _B.DATE_US
|
||||
Time24h = _B.TIME_24H
|
||||
PostalUS = _B.POSTAL_US
|
||||
PostalUK = _B.POSTAL_UK
|
||||
HexColor = _B.HTML_COLOR_HEX
|
||||
TwitterHandle = _B.TWITTER_HANDLE
|
||||
Hashtag = _B.HASHTAG
|
||||
MacAddr = _B.MAC_ADDR
|
||||
Iban = _B.IBAN
|
||||
CreditCard = _B.CREDIT_CARD
|
||||
All = _B.ALL
|
||||
Nothing = _B(0) # no patterns
|
||||
|
||||
# ------------------------------------------------------------------ #
|
||||
# Built-in pattern catalog
|
||||
# ------------------------------------------------------------------ #
|
||||
DEFAULT_PATTERNS: Dict[str, str] = {
|
||||
# Communication
|
||||
"email": r"[\w.+-]+@[\w-]+\.[\w.-]+",
|
||||
"phone_intl": r"\+?\d[\d .()-]{7,}\d",
|
||||
"phone_us": r"\(?\d{3}\)?[ -. ]?\d{3}[ -. ]?\d{4}",
|
||||
# Web
|
||||
"url": r"https?://[^\s\"'<>]+",
|
||||
"ipv4": r"(?:\d{1,3}\.){3}\d{1,3}",
|
||||
"ipv6": r"[A-F0-9]{1,4}(?::[A-F0-9]{1,4}){7}",
|
||||
# IDs
|
||||
"uuid": r"[0-9a-f]{8}-[0-9a-f]{4}-[1-5][0-9a-f]{3}-[89ab][0-9a-f]{3}-[0-9a-f]{12}",
|
||||
# Money / numbers
|
||||
"currency": r"(?:USD|EUR|RM|\$|€|£)\s?\d+(?:[.,]\d{2})?",
|
||||
"percentage": r"\d+(?:\.\d+)?%",
|
||||
"number": r"\b\d{1,3}(?:[,.\s]\d{3})*(?:\.\d+)?\b",
|
||||
# Dates / Times
|
||||
"date_iso": r"\d{4}-\d{2}-\d{2}",
|
||||
"date_us": r"\d{1,2}/\d{1,2}/\d{2,4}",
|
||||
"time_24h": r"\b(?:[01]?\d|2[0-3]):[0-5]\d(?:[:.][0-5]\d)?\b",
|
||||
# Misc
|
||||
"postal_us": r"\b\d{5}(?:-\d{4})?\b",
|
||||
"postal_uk": r"\b[A-Z]{1,2}\d[A-Z\d]? ?\d[A-Z]{2}\b",
|
||||
"html_color_hex": r"#[0-9A-Fa-f]{6}\b",
|
||||
"twitter_handle": r"@[\w]{1,15}",
|
||||
"hashtag": r"#[\w-]+",
|
||||
"mac_addr": r"(?:[0-9A-Fa-f]{2}:){5}[0-9A-Fa-f]{2}",
|
||||
"iban": r"[A-Z]{2}\d{2}[A-Z0-9]{11,30}",
|
||||
"credit_card": r"\b(?:4\d{12}(?:\d{3})?|5[1-5]\d{14}|3[47]\d{13}|6(?:011|5\d{2})\d{12})\b",
|
||||
}
|
||||
|
||||
_FLAGS = re.IGNORECASE | re.MULTILINE
|
||||
_UNWANTED_PROPS = {
|
||||
"provider": "Use llm_config instead",
|
||||
"api_token": "Use llm_config instead",
|
||||
}
|
||||
|
||||
# ------------------------------------------------------------------ #
|
||||
# Construction
|
||||
# ------------------------------------------------------------------ #
|
||||
def __init__(
|
||||
self,
|
||||
pattern: "_B" = _B.NOTHING,
|
||||
*,
|
||||
custom: Optional[Union[Dict[str, str], List[Tuple[str, str]]]] = None,
|
||||
input_format: str = "fit_html",
|
||||
**kwargs,
|
||||
) -> None:
|
||||
"""
|
||||
Args:
|
||||
patterns: Custom patterns overriding or extending defaults.
|
||||
Dict[label, regex] or list[tuple(label, regex)].
|
||||
input_format: "html", "markdown" or "text".
|
||||
**kwargs: Forwarded to ExtractionStrategy.
|
||||
"""
|
||||
super().__init__(input_format=input_format, **kwargs)
|
||||
|
||||
# 1️⃣ take only the requested built-ins
|
||||
merged: Dict[str, str] = {
|
||||
key: rx
|
||||
for key, rx in self.DEFAULT_PATTERNS.items()
|
||||
if getattr(self._B, key.upper()).value & pattern
|
||||
}
|
||||
|
||||
# 2️⃣ apply user overrides / additions
|
||||
if custom:
|
||||
if isinstance(custom, dict):
|
||||
merged.update(custom)
|
||||
else: # iterable of (label, regex)
|
||||
merged.update({lbl: rx for lbl, rx in custom})
|
||||
|
||||
self._compiled: Dict[str, Pattern] = {
|
||||
lbl: re.compile(rx, self._FLAGS) for lbl, rx in merged.items()
|
||||
}
|
||||
|
||||
# ------------------------------------------------------------------ #
|
||||
# Extraction
|
||||
# ------------------------------------------------------------------ #
|
||||
def extract(self, url: str, content: str, *q, **kw) -> List[Dict[str, Any]]:
|
||||
# text = self._plain_text(html)
|
||||
out: List[Dict[str, Any]] = []
|
||||
|
||||
for label, cre in self._compiled.items():
|
||||
for m in cre.finditer(content):
|
||||
out.append(
|
||||
{
|
||||
"url": url,
|
||||
"label": label,
|
||||
"value": m.group(0),
|
||||
"span": [m.start(), m.end()],
|
||||
}
|
||||
)
|
||||
return out
|
||||
|
||||
# ------------------------------------------------------------------ #
|
||||
# Helpers
|
||||
# ------------------------------------------------------------------ #
|
||||
def _plain_text(self, content: str) -> str:
|
||||
if self.input_format == "text":
|
||||
return content
|
||||
return BeautifulSoup(content, "lxml").get_text(" ", strip=True)
|
||||
|
||||
# ------------------------------------------------------------------ #
|
||||
# LLM-assisted pattern generator
|
||||
# ------------------------------------------------------------------ #
|
||||
# ------------------------------------------------------------------ #
|
||||
# LLM-assisted one-off pattern builder
|
||||
# ------------------------------------------------------------------ #
|
||||
@staticmethod
|
||||
def generate_pattern(
|
||||
label: str,
|
||||
html: str,
|
||||
*,
|
||||
query: Optional[str] = None,
|
||||
examples: Optional[List[str]] = None,
|
||||
llm_config: Optional[LLMConfig] = None,
|
||||
**kwargs,
|
||||
) -> Dict[str, str]:
|
||||
"""
|
||||
Ask an LLM for a single page-specific regex and return
|
||||
{label: pattern} ── ready for RegexExtractionStrategy(custom=…)
|
||||
"""
|
||||
|
||||
# ── guard deprecated kwargs
|
||||
for k in RegexExtractionStrategy._UNWANTED_PROPS:
|
||||
if k in kwargs:
|
||||
raise AttributeError(
|
||||
f"{k} is deprecated, {RegexExtractionStrategy._UNWANTED_PROPS[k]}"
|
||||
)
|
||||
|
||||
# ── default LLM config
|
||||
if llm_config is None:
|
||||
llm_config = create_llm_config()
|
||||
|
||||
# ── system prompt – hardened
|
||||
system_msg = (
|
||||
"You are an expert Python-regex engineer.\n"
|
||||
f"Return **one** JSON object whose single key is exactly \"{label}\", "
|
||||
"and whose value is a raw-string regex pattern that works with "
|
||||
"the standard `re` module in Python.\n\n"
|
||||
"Strict rules (obey every bullet):\n"
|
||||
"• If a *user query* is supplied, treat it as the precise semantic target and optimise the "
|
||||
" pattern to capture ONLY text that answers that query. If the query conflicts with the "
|
||||
" sample HTML, the HTML wins.\n"
|
||||
"• Tailor the pattern to the *sample HTML* – reproduce its exact punctuation, spacing, "
|
||||
" symbols, capitalisation, etc. Do **NOT** invent a generic form.\n"
|
||||
"• Keep it minimal and fast: avoid unnecessary capturing, prefer non-capturing `(?: … )`, "
|
||||
" and guard against catastrophic backtracking.\n"
|
||||
"• Anchor with `^`, `$`, or `\\b` only when it genuinely improves precision.\n"
|
||||
"• Use inline flags like `(?i)` when needed; no verbose flag comments.\n"
|
||||
"• Output must be valid JSON – no markdown, code fences, comments, or extra keys.\n"
|
||||
"• The regex value must be a Python string literal: **double every backslash** "
|
||||
"(e.g. `\\\\b`, `\\\\d`, `\\\\\\\\`).\n\n"
|
||||
"Example valid output:\n"
|
||||
f"{{\"{label}\": \"(?:RM|rm)\\\\s?\\\\d{{1,3}}(?:,\\\\d{{3}})*(?:\\\\.\\\\d{{2}})?\"}}"
|
||||
)
|
||||
|
||||
# ── user message: cropped HTML + optional hints
|
||||
user_parts = ["```html", html[:5000], "```"] # protect token budget
|
||||
if query:
|
||||
user_parts.append(f"\n\n## Query\n{query.strip()}")
|
||||
if examples:
|
||||
user_parts.append("## Examples\n" + "\n".join(examples[:20]))
|
||||
user_msg = "\n\n".join(user_parts)
|
||||
|
||||
# ── LLM call (with retry/backoff)
|
||||
resp = perform_completion_with_backoff(
|
||||
provider=llm_config.provider,
|
||||
prompt_with_variables="\n\n".join([system_msg, user_msg]),
|
||||
json_response=True,
|
||||
api_token=llm_config.api_token,
|
||||
base_url=llm_config.base_url,
|
||||
extra_args=kwargs,
|
||||
)
|
||||
|
||||
# ── clean & load JSON (fix common escape mistakes *before* json.loads)
|
||||
raw = resp.choices[0].message.content
|
||||
raw = raw.replace("\x08", "\\b") # stray back-space → \b
|
||||
raw = re.sub(r'(?<!\\)\\(?![\\u"])', r"\\\\", raw) # lone \ → \\
|
||||
|
||||
try:
|
||||
pattern_dict = json.loads(raw)
|
||||
except Exception as exc:
|
||||
raise ValueError(f"LLM did not return valid JSON: {raw}") from exc
|
||||
|
||||
# quick sanity-compile
|
||||
for lbl, pat in pattern_dict.items():
|
||||
try:
|
||||
re.compile(pat)
|
||||
except re.error as e:
|
||||
raise ValueError(f"Invalid regex for '{lbl}': {e}") from None
|
||||
|
||||
return pattern_dict
|
||||
|
||||
@@ -115,5 +115,6 @@ async () => {
|
||||
document.body.style.overflow = "auto";
|
||||
|
||||
// Wait a bit for any animations to complete
|
||||
await new Promise((resolve) => setTimeout(resolve, 100));
|
||||
document.body.scrollIntoView(false);
|
||||
await new Promise((resolve) => setTimeout(resolve, 50));
|
||||
};
|
||||
|
||||
@@ -31,22 +31,24 @@ class MarkdownGenerationStrategy(ABC):
|
||||
content_filter: Optional[RelevantContentFilter] = None,
|
||||
options: Optional[Dict[str, Any]] = None,
|
||||
verbose: bool = False,
|
||||
content_source: str = "cleaned_html",
|
||||
):
|
||||
self.content_filter = content_filter
|
||||
self.options = options or {}
|
||||
self.verbose = verbose
|
||||
self.content_source = content_source
|
||||
|
||||
@abstractmethod
|
||||
def generate_markdown(
|
||||
self,
|
||||
cleaned_html: str,
|
||||
input_html: str,
|
||||
base_url: str = "",
|
||||
html2text_options: Optional[Dict[str, Any]] = None,
|
||||
content_filter: Optional[RelevantContentFilter] = None,
|
||||
citations: bool = True,
|
||||
**kwargs,
|
||||
) -> MarkdownGenerationResult:
|
||||
"""Generate markdown from cleaned HTML."""
|
||||
"""Generate markdown from the selected input HTML."""
|
||||
pass
|
||||
|
||||
|
||||
@@ -63,6 +65,7 @@ class DefaultMarkdownGenerator(MarkdownGenerationStrategy):
|
||||
Args:
|
||||
content_filter (Optional[RelevantContentFilter]): Content filter for generating fit markdown.
|
||||
options (Optional[Dict[str, Any]]): Additional options for markdown generation. Defaults to None.
|
||||
content_source (str): Source of content to generate markdown from. Options: "cleaned_html", "raw_html", "fit_html". Defaults to "cleaned_html".
|
||||
|
||||
Returns:
|
||||
MarkdownGenerationResult: Result containing raw markdown, fit markdown, fit HTML, and references markdown.
|
||||
@@ -72,8 +75,9 @@ class DefaultMarkdownGenerator(MarkdownGenerationStrategy):
|
||||
self,
|
||||
content_filter: Optional[RelevantContentFilter] = None,
|
||||
options: Optional[Dict[str, Any]] = None,
|
||||
content_source: str = "cleaned_html",
|
||||
):
|
||||
super().__init__(content_filter, options)
|
||||
super().__init__(content_filter, options, verbose=False, content_source=content_source)
|
||||
|
||||
def convert_links_to_citations(
|
||||
self, markdown: str, base_url: str = ""
|
||||
@@ -143,7 +147,7 @@ class DefaultMarkdownGenerator(MarkdownGenerationStrategy):
|
||||
|
||||
def generate_markdown(
|
||||
self,
|
||||
cleaned_html: str,
|
||||
input_html: str,
|
||||
base_url: str = "",
|
||||
html2text_options: Optional[Dict[str, Any]] = None,
|
||||
options: Optional[Dict[str, Any]] = None,
|
||||
@@ -152,16 +156,16 @@ class DefaultMarkdownGenerator(MarkdownGenerationStrategy):
|
||||
**kwargs,
|
||||
) -> MarkdownGenerationResult:
|
||||
"""
|
||||
Generate markdown with citations from cleaned HTML.
|
||||
Generate markdown with citations from the provided input HTML.
|
||||
|
||||
How it works:
|
||||
1. Generate raw markdown from cleaned HTML.
|
||||
1. Generate raw markdown from the input HTML.
|
||||
2. Convert links to citations.
|
||||
3. Generate fit markdown if content filter is provided.
|
||||
4. Return MarkdownGenerationResult.
|
||||
|
||||
Args:
|
||||
cleaned_html (str): Cleaned HTML content.
|
||||
input_html (str): The HTML content to process (selected based on content_source).
|
||||
base_url (str): Base URL for URL joins.
|
||||
html2text_options (Optional[Dict[str, Any]]): HTML2Text options.
|
||||
options (Optional[Dict[str, Any]]): Additional options for markdown generation.
|
||||
@@ -196,14 +200,14 @@ class DefaultMarkdownGenerator(MarkdownGenerationStrategy):
|
||||
h.update_params(**default_options)
|
||||
|
||||
# Ensure we have valid input
|
||||
if not cleaned_html:
|
||||
cleaned_html = ""
|
||||
elif not isinstance(cleaned_html, str):
|
||||
cleaned_html = str(cleaned_html)
|
||||
if not input_html:
|
||||
input_html = ""
|
||||
elif not isinstance(input_html, str):
|
||||
input_html = str(input_html)
|
||||
|
||||
# Generate raw markdown
|
||||
try:
|
||||
raw_markdown = h.handle(cleaned_html)
|
||||
raw_markdown = h.handle(input_html)
|
||||
except Exception as e:
|
||||
raw_markdown = f"Error converting HTML to markdown: {str(e)}"
|
||||
|
||||
@@ -228,7 +232,7 @@ class DefaultMarkdownGenerator(MarkdownGenerationStrategy):
|
||||
if content_filter or self.content_filter:
|
||||
try:
|
||||
content_filter = content_filter or self.content_filter
|
||||
filtered_html = content_filter.filter_content(cleaned_html)
|
||||
filtered_html = content_filter.filter_content(input_html)
|
||||
filtered_html = "\n".join(
|
||||
"<div>{}</div>".format(s) for s in filtered_html
|
||||
)
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
from pydantic import BaseModel, HttpUrl, PrivateAttr
|
||||
from pydantic import BaseModel, HttpUrl, PrivateAttr, Field
|
||||
from typing import List, Dict, Optional, Callable, Awaitable, Union, Any
|
||||
from typing import AsyncGenerator
|
||||
from typing import Generic, TypeVar
|
||||
@@ -129,6 +129,7 @@ class MarkdownGenerationResult(BaseModel):
|
||||
class CrawlResult(BaseModel):
|
||||
url: str
|
||||
html: str
|
||||
fit_html: Optional[str] = None
|
||||
success: bool
|
||||
cleaned_html: Optional[str] = None
|
||||
media: Dict[str, List[Dict]] = {}
|
||||
@@ -150,6 +151,7 @@ class CrawlResult(BaseModel):
|
||||
redirected_url: Optional[str] = None
|
||||
network_requests: Optional[List[Dict[str, Any]]] = None
|
||||
console_messages: Optional[List[Dict[str, Any]]] = None
|
||||
tables: List[Dict] = Field(default_factory=list) # NEW – [{headers,rows,caption,summary}]
|
||||
|
||||
class Config:
|
||||
arbitrary_types_allowed = True
|
||||
|
||||
@@ -4,6 +4,9 @@ from itertools import cycle
|
||||
import os
|
||||
|
||||
|
||||
########### ATTENTION PEOPLE OF EARTH ###########
|
||||
# I have moved this config to async_configs.py, kept it here, in case someone still importing it, however
|
||||
# be a dear and follow `from crawl4ai import ProxyConfig` instead :)
|
||||
class ProxyConfig:
|
||||
def __init__(
|
||||
self,
|
||||
@@ -119,12 +122,12 @@ class ProxyRotationStrategy(ABC):
|
||||
"""Base abstract class for proxy rotation strategies"""
|
||||
|
||||
@abstractmethod
|
||||
async def get_next_proxy(self) -> Optional[Dict]:
|
||||
async def get_next_proxy(self) -> Optional[ProxyConfig]:
|
||||
"""Get next proxy configuration from the strategy"""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def add_proxies(self, proxies: List[Dict]):
|
||||
def add_proxies(self, proxies: List[ProxyConfig]):
|
||||
"""Add proxy configurations to the strategy"""
|
||||
pass
|
||||
|
||||
|
||||
@@ -9,83 +9,44 @@ from urllib.parse import urlparse
|
||||
import OpenSSL.crypto
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
class SSLCertificate:
|
||||
# === Inherit from dict ===
|
||||
class SSLCertificate(dict):
|
||||
"""
|
||||
A class representing an SSL certificate with methods to export in various formats.
|
||||
A class representing an SSL certificate, behaving like a dictionary
|
||||
for direct JSON serialization. It stores the certificate information internally
|
||||
and provides methods for export and property access.
|
||||
|
||||
Attributes:
|
||||
cert_info (Dict[str, Any]): The certificate information.
|
||||
|
||||
Methods:
|
||||
from_url(url: str, timeout: int = 10) -> Optional['SSLCertificate']: Create SSLCertificate instance from a URL.
|
||||
from_file(file_path: str) -> Optional['SSLCertificate']: Create SSLCertificate instance from a file.
|
||||
from_binary(binary_data: bytes) -> Optional['SSLCertificate']: Create SSLCertificate instance from binary data.
|
||||
export_as_pem() -> str: Export the certificate as PEM format.
|
||||
export_as_der() -> bytes: Export the certificate as DER format.
|
||||
export_as_json() -> Dict[str, Any]: Export the certificate as JSON format.
|
||||
export_as_text() -> str: Export the certificate as text format.
|
||||
Inherits from dict, so instances are directly JSON serializable.
|
||||
"""
|
||||
|
||||
# Use __slots__ for potential memory optimization if desired, though less common when inheriting dict
|
||||
# __slots__ = ("_cert_info",) # If using slots, be careful with dict inheritance interaction
|
||||
|
||||
def __init__(self, cert_info: Dict[str, Any]):
|
||||
self._cert_info = self._decode_cert_data(cert_info)
|
||||
|
||||
@staticmethod
|
||||
def from_url(url: str, timeout: int = 10) -> Optional["SSLCertificate"]:
|
||||
"""
|
||||
Create SSLCertificate instance from a URL.
|
||||
Initializes the SSLCertificate object.
|
||||
|
||||
Args:
|
||||
url (str): URL of the website.
|
||||
timeout (int): Timeout for the connection (default: 10).
|
||||
|
||||
Returns:
|
||||
Optional[SSLCertificate]: SSLCertificate instance if successful, None otherwise.
|
||||
cert_info (Dict[str, Any]): The raw certificate dictionary.
|
||||
"""
|
||||
try:
|
||||
hostname = urlparse(url).netloc
|
||||
if ":" in hostname:
|
||||
hostname = hostname.split(":")[0]
|
||||
# 1. Decode the data (handle bytes -> str)
|
||||
decoded_info = self._decode_cert_data(cert_info)
|
||||
|
||||
context = ssl.create_default_context()
|
||||
with socket.create_connection((hostname, 443), timeout=timeout) as sock:
|
||||
with context.wrap_socket(sock, server_hostname=hostname) as ssock:
|
||||
cert_binary = ssock.getpeercert(binary_form=True)
|
||||
x509 = OpenSSL.crypto.load_certificate(
|
||||
OpenSSL.crypto.FILETYPE_ASN1, cert_binary
|
||||
)
|
||||
# 2. Store the decoded info internally (optional but good practice)
|
||||
# self._cert_info = decoded_info # You can keep this if methods rely on it
|
||||
|
||||
cert_info = {
|
||||
"subject": dict(x509.get_subject().get_components()),
|
||||
"issuer": dict(x509.get_issuer().get_components()),
|
||||
"version": x509.get_version(),
|
||||
"serial_number": hex(x509.get_serial_number()),
|
||||
"not_before": x509.get_notBefore(),
|
||||
"not_after": x509.get_notAfter(),
|
||||
"fingerprint": x509.digest("sha256").hex(),
|
||||
"signature_algorithm": x509.get_signature_algorithm(),
|
||||
"raw_cert": base64.b64encode(cert_binary),
|
||||
}
|
||||
|
||||
# Add extensions
|
||||
extensions = []
|
||||
for i in range(x509.get_extension_count()):
|
||||
ext = x509.get_extension(i)
|
||||
extensions.append(
|
||||
{"name": ext.get_short_name(), "value": str(ext)}
|
||||
)
|
||||
cert_info["extensions"] = extensions
|
||||
|
||||
return SSLCertificate(cert_info)
|
||||
|
||||
except Exception:
|
||||
return None
|
||||
# 3. Initialize the dictionary part of the object with the decoded data
|
||||
super().__init__(decoded_info)
|
||||
|
||||
@staticmethod
|
||||
def _decode_cert_data(data: Any) -> Any:
|
||||
"""Helper method to decode bytes in certificate data."""
|
||||
if isinstance(data, bytes):
|
||||
return data.decode("utf-8")
|
||||
try:
|
||||
# Try UTF-8 first, fallback to latin-1 for arbitrary bytes
|
||||
return data.decode("utf-8")
|
||||
except UnicodeDecodeError:
|
||||
return data.decode("latin-1") # Or handle as needed, maybe hex representation
|
||||
elif isinstance(data, dict):
|
||||
return {
|
||||
(
|
||||
@@ -97,36 +58,119 @@ class SSLCertificate:
|
||||
return [SSLCertificate._decode_cert_data(item) for item in data]
|
||||
return data
|
||||
|
||||
@staticmethod
|
||||
def from_url(url: str, timeout: int = 10) -> Optional["SSLCertificate"]:
|
||||
"""
|
||||
Create SSLCertificate instance from a URL. Fetches cert info and initializes.
|
||||
(Fetching logic remains the same)
|
||||
"""
|
||||
cert_info_raw = None # Variable to hold the fetched dict
|
||||
try:
|
||||
hostname = urlparse(url).netloc
|
||||
if ":" in hostname:
|
||||
hostname = hostname.split(":")[0]
|
||||
|
||||
context = ssl.create_default_context()
|
||||
# Set check_hostname to False and verify_mode to CERT_NONE temporarily
|
||||
# for potentially problematic certificates during fetch, but parse the result regardless.
|
||||
# context.check_hostname = False
|
||||
# context.verify_mode = ssl.CERT_NONE
|
||||
|
||||
with socket.create_connection((hostname, 443), timeout=timeout) as sock:
|
||||
with context.wrap_socket(sock, server_hostname=hostname) as ssock:
|
||||
cert_binary = ssock.getpeercert(binary_form=True)
|
||||
if not cert_binary:
|
||||
print(f"Warning: No certificate returned for {hostname}")
|
||||
return None
|
||||
|
||||
x509 = OpenSSL.crypto.load_certificate(
|
||||
OpenSSL.crypto.FILETYPE_ASN1, cert_binary
|
||||
)
|
||||
|
||||
# Create the dictionary directly
|
||||
cert_info_raw = {
|
||||
"subject": dict(x509.get_subject().get_components()),
|
||||
"issuer": dict(x509.get_issuer().get_components()),
|
||||
"version": x509.get_version(),
|
||||
"serial_number": hex(x509.get_serial_number()),
|
||||
"not_before": x509.get_notBefore(), # Keep as bytes initially, _decode handles it
|
||||
"not_after": x509.get_notAfter(), # Keep as bytes initially
|
||||
"fingerprint": x509.digest("sha256").hex(), # hex() is already string
|
||||
"signature_algorithm": x509.get_signature_algorithm(), # Keep as bytes
|
||||
"raw_cert": base64.b64encode(cert_binary), # Base64 is bytes, _decode handles it
|
||||
}
|
||||
|
||||
# Add extensions
|
||||
extensions = []
|
||||
for i in range(x509.get_extension_count()):
|
||||
ext = x509.get_extension(i)
|
||||
# get_short_name() returns bytes, str(ext) handles value conversion
|
||||
extensions.append(
|
||||
{"name": ext.get_short_name(), "value": str(ext)}
|
||||
)
|
||||
cert_info_raw["extensions"] = extensions
|
||||
|
||||
except ssl.SSLCertVerificationError as e:
|
||||
print(f"SSL Verification Error for {url}: {e}")
|
||||
# Decide if you want to proceed or return None based on your needs
|
||||
# You might try fetching without verification here if needed, but be cautious.
|
||||
return None
|
||||
except socket.gaierror:
|
||||
print(f"Could not resolve hostname: {hostname}")
|
||||
return None
|
||||
except socket.timeout:
|
||||
print(f"Connection timed out for {url}")
|
||||
return None
|
||||
except Exception as e:
|
||||
print(f"Error fetching/processing certificate for {url}: {e}")
|
||||
# Log the full error details if needed: logging.exception("Cert fetch error")
|
||||
return None
|
||||
|
||||
# If successful, create the SSLCertificate instance from the dictionary
|
||||
if cert_info_raw:
|
||||
return SSLCertificate(cert_info_raw)
|
||||
else:
|
||||
return None
|
||||
|
||||
|
||||
# --- Properties now access the dictionary items directly via self[] ---
|
||||
@property
|
||||
def issuer(self) -> Dict[str, str]:
|
||||
return self.get("issuer", {}) # Use self.get for safety
|
||||
|
||||
@property
|
||||
def subject(self) -> Dict[str, str]:
|
||||
return self.get("subject", {})
|
||||
|
||||
@property
|
||||
def valid_from(self) -> str:
|
||||
return self.get("not_before", "")
|
||||
|
||||
@property
|
||||
def valid_until(self) -> str:
|
||||
return self.get("not_after", "")
|
||||
|
||||
@property
|
||||
def fingerprint(self) -> str:
|
||||
return self.get("fingerprint", "")
|
||||
|
||||
# --- Export methods can use `self` directly as it is the dict ---
|
||||
def to_json(self, filepath: Optional[str] = None) -> Optional[str]:
|
||||
"""
|
||||
Export certificate as JSON.
|
||||
|
||||
Args:
|
||||
filepath (Optional[str]): Path to save the JSON file (default: None).
|
||||
|
||||
Returns:
|
||||
Optional[str]: JSON string if successful, None otherwise.
|
||||
"""
|
||||
json_str = json.dumps(self._cert_info, indent=2, ensure_ascii=False)
|
||||
"""Export certificate as JSON."""
|
||||
# `self` is already the dictionary we want to serialize
|
||||
json_str = json.dumps(self, indent=2, ensure_ascii=False)
|
||||
if filepath:
|
||||
Path(filepath).write_text(json_str, encoding="utf-8")
|
||||
return None
|
||||
return json_str
|
||||
|
||||
def to_pem(self, filepath: Optional[str] = None) -> Optional[str]:
|
||||
"""
|
||||
Export certificate as PEM.
|
||||
|
||||
Args:
|
||||
filepath (Optional[str]): Path to save the PEM file (default: None).
|
||||
|
||||
Returns:
|
||||
Optional[str]: PEM string if successful, None otherwise.
|
||||
"""
|
||||
"""Export certificate as PEM."""
|
||||
try:
|
||||
# Decode the raw_cert (which should be string due to _decode)
|
||||
raw_cert_bytes = base64.b64decode(self.get("raw_cert", ""))
|
||||
x509 = OpenSSL.crypto.load_certificate(
|
||||
OpenSSL.crypto.FILETYPE_ASN1,
|
||||
base64.b64decode(self._cert_info["raw_cert"]),
|
||||
OpenSSL.crypto.FILETYPE_ASN1, raw_cert_bytes
|
||||
)
|
||||
pem_data = OpenSSL.crypto.dump_certificate(
|
||||
OpenSSL.crypto.FILETYPE_PEM, x509
|
||||
@@ -136,49 +180,25 @@ class SSLCertificate:
|
||||
Path(filepath).write_text(pem_data, encoding="utf-8")
|
||||
return None
|
||||
return pem_data
|
||||
except Exception:
|
||||
return None
|
||||
except Exception as e:
|
||||
print(f"Error converting to PEM: {e}")
|
||||
return None
|
||||
|
||||
def to_der(self, filepath: Optional[str] = None) -> Optional[bytes]:
|
||||
"""
|
||||
Export certificate as DER.
|
||||
|
||||
Args:
|
||||
filepath (Optional[str]): Path to save the DER file (default: None).
|
||||
|
||||
Returns:
|
||||
Optional[bytes]: DER bytes if successful, None otherwise.
|
||||
"""
|
||||
"""Export certificate as DER."""
|
||||
try:
|
||||
der_data = base64.b64decode(self._cert_info["raw_cert"])
|
||||
# Decode the raw_cert (which should be string due to _decode)
|
||||
der_data = base64.b64decode(self.get("raw_cert", ""))
|
||||
if filepath:
|
||||
Path(filepath).write_bytes(der_data)
|
||||
return None
|
||||
return der_data
|
||||
except Exception:
|
||||
return None
|
||||
except Exception as e:
|
||||
print(f"Error converting to DER: {e}")
|
||||
return None
|
||||
|
||||
@property
|
||||
def issuer(self) -> Dict[str, str]:
|
||||
"""Get certificate issuer information."""
|
||||
return self._cert_info.get("issuer", {})
|
||||
|
||||
@property
|
||||
def subject(self) -> Dict[str, str]:
|
||||
"""Get certificate subject information."""
|
||||
return self._cert_info.get("subject", {})
|
||||
|
||||
@property
|
||||
def valid_from(self) -> str:
|
||||
"""Get certificate validity start date."""
|
||||
return self._cert_info.get("not_before", "")
|
||||
|
||||
@property
|
||||
def valid_until(self) -> str:
|
||||
"""Get certificate validity end date."""
|
||||
return self._cert_info.get("not_after", "")
|
||||
|
||||
@property
|
||||
def fingerprint(self) -> str:
|
||||
"""Get certificate fingerprint."""
|
||||
return self._cert_info.get("fingerprint", "")
|
||||
# Optional: Add __repr__ for better debugging
|
||||
def __repr__(self) -> str:
|
||||
subject_cn = self.subject.get('CN', 'N/A')
|
||||
issuer_cn = self.issuer.get('CN', 'N/A')
|
||||
return f"<SSLCertificate Subject='{subject_cn}' Issuer='{issuer_cn}'>"
|
||||
@@ -20,7 +20,6 @@ from urllib.parse import urljoin
|
||||
import requests
|
||||
from requests.exceptions import InvalidSchema
|
||||
import xxhash
|
||||
from colorama import Fore, Style, init
|
||||
import textwrap
|
||||
import cProfile
|
||||
import pstats
|
||||
@@ -136,13 +135,20 @@ def merge_chunks(
|
||||
word_token_ratio: float = 1.0,
|
||||
splitter: Callable = None
|
||||
) -> List[str]:
|
||||
"""Merges documents into chunks of specified token size.
|
||||
"""
|
||||
Merges a sequence of documents into chunks based on a target token count, with optional overlap.
|
||||
|
||||
Each document is split into tokens using the provided splitter function (defaults to str.split). Tokens are distributed into chunks aiming for the specified target size, with optional overlapping tokens between consecutive chunks. Returns a list of non-empty merged chunks as strings.
|
||||
|
||||
Args:
|
||||
docs: Input documents
|
||||
target_size: Desired token count per chunk
|
||||
overlap: Number of tokens to overlap between chunks
|
||||
word_token_ratio: Multiplier for word->token conversion
|
||||
docs: Sequence of input document strings to be merged.
|
||||
target_size: Target number of tokens per chunk.
|
||||
overlap: Number of tokens to overlap between consecutive chunks.
|
||||
word_token_ratio: Multiplier to estimate token count from word count.
|
||||
splitter: Callable used to split each document into tokens.
|
||||
|
||||
Returns:
|
||||
List of merged document chunks as strings, each not exceeding the target token size.
|
||||
"""
|
||||
# Pre-tokenize all docs and store token counts
|
||||
splitter = splitter or str.split
|
||||
@@ -151,7 +157,7 @@ def merge_chunks(
|
||||
total_tokens = 0
|
||||
|
||||
for doc in docs:
|
||||
tokens = doc.split()
|
||||
tokens = splitter(doc)
|
||||
count = int(len(tokens) * word_token_ratio)
|
||||
if count: # Skip empty docs
|
||||
token_counts.append(count)
|
||||
@@ -441,14 +447,13 @@ def create_box_message(
|
||||
str: A formatted string containing the styled message box.
|
||||
"""
|
||||
|
||||
init()
|
||||
|
||||
# Define border and text colors for different types
|
||||
styles = {
|
||||
"warning": (Fore.YELLOW, Fore.LIGHTYELLOW_EX, "⚠"),
|
||||
"info": (Fore.BLUE, Fore.LIGHTBLUE_EX, "ℹ"),
|
||||
"success": (Fore.GREEN, Fore.LIGHTGREEN_EX, "✓"),
|
||||
"error": (Fore.RED, Fore.LIGHTRED_EX, "×"),
|
||||
"warning": ("yellow", "bright_yellow", "⚠"),
|
||||
"info": ("blue", "bright_blue", "ℹ"),
|
||||
"debug": ("lightblack", "bright_black", "⋯"),
|
||||
"success": ("green", "bright_green", "✓"),
|
||||
"error": ("red", "bright_red", "×"),
|
||||
}
|
||||
|
||||
border_color, text_color, prefix = styles.get(type.lower(), styles["info"])
|
||||
@@ -480,12 +485,12 @@ def create_box_message(
|
||||
# Create the box with colored borders and lighter text
|
||||
horizontal_line = h_line * (width - 1)
|
||||
box = [
|
||||
f"{border_color}{tl}{horizontal_line}{tr}",
|
||||
f"[{border_color}]{tl}{horizontal_line}{tr}[/{border_color}]",
|
||||
*[
|
||||
f"{border_color}{v_line}{text_color} {line:<{width-2}}{border_color}{v_line}"
|
||||
f"[{border_color}]{v_line}[{text_color}] {line:<{width-2}}[/{text_color}][{border_color}]{v_line}[/{border_color}]"
|
||||
for line in formatted_lines
|
||||
],
|
||||
f"{border_color}{bl}{horizontal_line}{br}{Style.RESET_ALL}",
|
||||
f"[{border_color}]{bl}{horizontal_line}{br}[/{border_color}]",
|
||||
]
|
||||
|
||||
result = "\n".join(box)
|
||||
@@ -1111,6 +1116,23 @@ def get_content_of_website_optimized(
|
||||
css_selector: str = None,
|
||||
**kwargs,
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
Extracts and cleans content from website HTML, optimizing for useful media and contextual information.
|
||||
|
||||
Parses the provided HTML to extract internal and external links, filters and scores images for usefulness, gathers contextual descriptions for media, removes unwanted or low-value elements, and converts the cleaned HTML to Markdown. Also extracts metadata and returns all structured content in a dictionary.
|
||||
|
||||
Args:
|
||||
url: The URL of the website being processed.
|
||||
html: The raw HTML content to extract from.
|
||||
word_count_threshold: Minimum word count for elements to be retained.
|
||||
css_selector: Optional CSS selector to restrict extraction to specific elements.
|
||||
|
||||
Returns:
|
||||
A dictionary containing Markdown content, cleaned HTML, extraction success status, media and link lists, and metadata.
|
||||
|
||||
Raises:
|
||||
InvalidCSSSelectorError: If a provided CSS selector does not match any elements.
|
||||
"""
|
||||
if not html:
|
||||
return None
|
||||
|
||||
@@ -1153,6 +1175,20 @@ def get_content_of_website_optimized(
|
||||
|
||||
def process_image(img, url, index, total_images):
|
||||
# Check if an image has valid display and inside undesired html elements
|
||||
"""
|
||||
Processes an HTML image element to determine its relevance and extract metadata.
|
||||
|
||||
Evaluates an image's visibility, context, and usefulness based on its attributes and parent elements. If the image passes validation and exceeds a usefulness score threshold, returns a dictionary with its source, alt text, contextual description, score, and type. Otherwise, returns None.
|
||||
|
||||
Args:
|
||||
img: The BeautifulSoup image tag to process.
|
||||
url: The base URL of the page containing the image.
|
||||
index: The index of the image in the list of images on the page.
|
||||
total_images: The total number of images on the page.
|
||||
|
||||
Returns:
|
||||
A dictionary with image metadata if the image is considered useful, or None otherwise.
|
||||
"""
|
||||
def is_valid_image(img, parent, parent_classes):
|
||||
style = img.get("style", "")
|
||||
src = img.get("src", "")
|
||||
@@ -1174,6 +1210,20 @@ def get_content_of_website_optimized(
|
||||
# Score an image for it's usefulness
|
||||
def score_image_for_usefulness(img, base_url, index, images_count):
|
||||
# Function to parse image height/width value and units
|
||||
"""
|
||||
Scores an HTML image element for usefulness based on size, format, attributes, and position.
|
||||
|
||||
The function evaluates an image's dimensions, file format, alt text, and its position among all images on the page to assign a usefulness score. Higher scores indicate images that are likely more relevant or informative for content extraction or summarization.
|
||||
|
||||
Args:
|
||||
img: The HTML image element to score.
|
||||
base_url: The base URL used to resolve relative image sources.
|
||||
index: The position of the image in the list of images on the page (zero-based).
|
||||
images_count: The total number of images on the page.
|
||||
|
||||
Returns:
|
||||
An integer usefulness score for the image.
|
||||
"""
|
||||
def parse_dimension(dimension):
|
||||
if dimension:
|
||||
match = re.match(r"(\d+)(\D*)", dimension)
|
||||
@@ -1188,6 +1238,16 @@ def get_content_of_website_optimized(
|
||||
# Fetch image file metadata to extract size and extension
|
||||
def fetch_image_file_size(img, base_url):
|
||||
# If src is relative path construct full URL, if not it may be CDN URL
|
||||
"""
|
||||
Fetches the file size of an image by sending a HEAD request to its URL.
|
||||
|
||||
Args:
|
||||
img: A BeautifulSoup tag representing the image element.
|
||||
base_url: The base URL to resolve relative image sources.
|
||||
|
||||
Returns:
|
||||
The value of the "Content-Length" header as a string if available, otherwise None.
|
||||
"""
|
||||
img_url = urljoin(base_url, img.get("src"))
|
||||
try:
|
||||
response = requests.head(img_url)
|
||||
@@ -1198,8 +1258,6 @@ def get_content_of_website_optimized(
|
||||
return None
|
||||
except InvalidSchema:
|
||||
return None
|
||||
finally:
|
||||
return
|
||||
|
||||
image_height = img.get("height")
|
||||
height_value, height_unit = parse_dimension(image_height)
|
||||
@@ -2003,6 +2061,10 @@ def normalize_url(href, base_url):
|
||||
if not parsed_base.scheme or not parsed_base.netloc:
|
||||
raise ValueError(f"Invalid base URL format: {base_url}")
|
||||
|
||||
# Ensure base_url ends with a trailing slash if it's a directory path
|
||||
if not base_url.endswith('/'):
|
||||
base_url = base_url + '/'
|
||||
|
||||
# Use urljoin to handle all cases
|
||||
normalized = urljoin(base_url, href.strip())
|
||||
return normalized
|
||||
@@ -2047,7 +2109,7 @@ def normalize_url_for_deep_crawl(href, base_url):
|
||||
normalized = urlunparse((
|
||||
parsed.scheme,
|
||||
netloc,
|
||||
parsed.path.rstrip('/') or '/', # Normalize trailing slash
|
||||
parsed.path.rstrip('/'), # Normalize trailing slash
|
||||
parsed.params,
|
||||
query,
|
||||
fragment
|
||||
@@ -2075,7 +2137,7 @@ def efficient_normalize_url_for_deep_crawl(href, base_url):
|
||||
normalized = urlunparse((
|
||||
parsed.scheme,
|
||||
parsed.netloc.lower(),
|
||||
parsed.path,
|
||||
parsed.path.rstrip('/'),
|
||||
parsed.params,
|
||||
parsed.query,
|
||||
'' # Remove fragment
|
||||
@@ -2733,33 +2795,67 @@ def preprocess_html_for_schema(html_content, text_threshold=100, attr_value_thre
|
||||
# Also truncate tail text if present
|
||||
if element.tail and len(element.tail.strip()) > text_threshold:
|
||||
element.tail = element.tail.strip()[:text_threshold] + '...'
|
||||
|
||||
# 4. Find repeated patterns and keep only a few examples
|
||||
# This is a simplistic approach - more sophisticated pattern detection could be implemented
|
||||
pattern_elements = {}
|
||||
for element in tree.xpath('//*[contains(@class, "")]'):
|
||||
parent = element.getparent()
|
||||
|
||||
# 4. Detect duplicates and drop them in a single pass
|
||||
seen: dict[tuple, None] = {}
|
||||
for el in list(tree.xpath('//*[@class]')): # snapshot once, XPath is fast
|
||||
parent = el.getparent()
|
||||
if parent is None:
|
||||
continue
|
||||
|
||||
# Create a signature based on tag and classes
|
||||
classes = element.get('class', '')
|
||||
if not classes:
|
||||
|
||||
cls = el.get('class')
|
||||
if not cls:
|
||||
continue
|
||||
signature = f"{element.tag}.{classes}"
|
||||
|
||||
if signature in pattern_elements:
|
||||
pattern_elements[signature].append(element)
|
||||
|
||||
# ── build signature ───────────────────────────────────────────
|
||||
h = xxhash.xxh64() # stream, no big join()
|
||||
for txt in el.itertext():
|
||||
h.update(txt)
|
||||
sig = (el.tag, cls, h.intdigest()) # tuple cheaper & hashable
|
||||
|
||||
# ── first seen? keep – else drop ─────────────
|
||||
if sig in seen and parent is not None:
|
||||
parent.remove(el) # duplicate
|
||||
else:
|
||||
pattern_elements[signature] = [element]
|
||||
seen[sig] = None
|
||||
|
||||
# Keep only 3 examples of each repeating pattern
|
||||
for signature, elements in pattern_elements.items():
|
||||
if len(elements) > 3:
|
||||
# Keep the first 2 and last elements
|
||||
for element in elements[2:-1]:
|
||||
if element.getparent() is not None:
|
||||
element.getparent().remove(element)
|
||||
# # 4. Find repeated patterns and keep only a few examples
|
||||
# # This is a simplistic approach - more sophisticated pattern detection could be implemented
|
||||
# pattern_elements = {}
|
||||
# for element in tree.xpath('//*[contains(@class, "")]'):
|
||||
# parent = element.getparent()
|
||||
# if parent is None:
|
||||
# continue
|
||||
|
||||
# # Create a signature based on tag and classes
|
||||
# classes = element.get('class', '')
|
||||
# if not classes:
|
||||
# continue
|
||||
# innert_text = ''.join(element.xpath('.//text()'))
|
||||
# innert_text_hash = xxhash.xxh64(innert_text.encode()).hexdigest()
|
||||
# signature = f"{element.tag}.{classes}.{innert_text_hash}"
|
||||
|
||||
# if signature in pattern_elements:
|
||||
# pattern_elements[signature].append(element)
|
||||
# else:
|
||||
# pattern_elements[signature] = [element]
|
||||
|
||||
# # Keep only first examples of each repeating pattern
|
||||
# for signature, elements in pattern_elements.items():
|
||||
# if len(elements) > 1:
|
||||
# # Keep the first element and remove the rest
|
||||
# for element in elements[1:]:
|
||||
# if element.getparent() is not None:
|
||||
# element.getparent().remove(element)
|
||||
|
||||
|
||||
# # Keep only 3 examples of each repeating pattern
|
||||
# for signature, elements in pattern_elements.items():
|
||||
# if len(elements) > 3:
|
||||
# # Keep the first 2 and last elements
|
||||
# for element in elements[2:-1]:
|
||||
# if element.getparent() is not None:
|
||||
# element.getparent().remove(element)
|
||||
|
||||
# 5. Convert back to string
|
||||
result = etree.tostring(tree, encoding='unicode', method='html')
|
||||
@@ -2774,4 +2870,3 @@ def preprocess_html_for_schema(html_content, text_threshold=100, attr_value_thre
|
||||
# Fallback for parsing errors
|
||||
return html_content[:max_size] if len(html_content) > max_size else html_content
|
||||
|
||||
|
||||
|
||||
@@ -1,644 +0,0 @@
|
||||
# Crawl4AI Docker Guide 🐳
|
||||
|
||||
## Table of Contents
|
||||
- [Prerequisites](#prerequisites)
|
||||
- [Installation](#installation)
|
||||
- [Option 1: Using Docker Compose (Recommended)](#option-1-using-docker-compose-recommended)
|
||||
- [Option 2: Manual Local Build & Run](#option-2-manual-local-build--run)
|
||||
- [Option 3: Using Pre-built Docker Hub Images](#option-3-using-pre-built-docker-hub-images)
|
||||
- [Dockerfile Parameters](#dockerfile-parameters)
|
||||
- [Using the API](#using-the-api)
|
||||
- [Understanding Request Schema](#understanding-request-schema)
|
||||
- [REST API Examples](#rest-api-examples)
|
||||
- [Python SDK](#python-sdk)
|
||||
- [Metrics & Monitoring](#metrics--monitoring)
|
||||
- [Deployment Scenarios](#deployment-scenarios)
|
||||
- [Complete Examples](#complete-examples)
|
||||
- [Server Configuration](#server-configuration)
|
||||
- [Understanding config.yml](#understanding-configyml)
|
||||
- [JWT Authentication](#jwt-authentication)
|
||||
- [Configuration Tips and Best Practices](#configuration-tips-and-best-practices)
|
||||
- [Customizing Your Configuration](#customizing-your-configuration)
|
||||
- [Configuration Recommendations](#configuration-recommendations)
|
||||
- [Getting Help](#getting-help)
|
||||
|
||||
## Prerequisites
|
||||
|
||||
Before we dive in, make sure you have:
|
||||
- Docker installed and running (version 20.10.0 or higher), including `docker compose` (usually bundled with Docker Desktop).
|
||||
- `git` for cloning the repository.
|
||||
- At least 4GB of RAM available for the container (more recommended for heavy use).
|
||||
- Python 3.10+ (if using the Python SDK).
|
||||
- Node.js 16+ (if using the Node.js examples).
|
||||
|
||||
> 💡 **Pro tip**: Run `docker info` to check your Docker installation and available resources.
|
||||
|
||||
## Installation
|
||||
|
||||
We offer several ways to get the Crawl4AI server running. Docker Compose is the easiest way to manage local builds and runs.
|
||||
|
||||
### Option 1: Using Docker Compose (Recommended)
|
||||
|
||||
Docker Compose simplifies building and running the service, especially for local development and testing across different platforms.
|
||||
|
||||
#### 1. Clone Repository
|
||||
|
||||
```bash
|
||||
git clone https://github.com/unclecode/crawl4ai.git
|
||||
cd crawl4ai
|
||||
```
|
||||
|
||||
#### 2. Environment Setup (API Keys)
|
||||
|
||||
If you plan to use LLMs, copy the example environment file and add your API keys. This file should be in the **project root directory**.
|
||||
|
||||
```bash
|
||||
# Make sure you are in the 'crawl4ai' root directory
|
||||
cp deploy/docker/.llm.env.example .llm.env
|
||||
|
||||
# Now edit .llm.env and add your API keys
|
||||
# Example content:
|
||||
# OPENAI_API_KEY=sk-your-key
|
||||
# ANTHROPIC_API_KEY=your-anthropic-key
|
||||
# ...
|
||||
```
|
||||
> 🔑 **Note**: Keep your API keys secure! Never commit `.llm.env` to version control.
|
||||
|
||||
#### 3. Build and Run with Compose
|
||||
|
||||
The `docker-compose.yml` file in the project root defines services for different scenarios using **profiles**.
|
||||
|
||||
* **Build and Run Locally (AMD64):**
|
||||
```bash
|
||||
# Builds the image locally using Dockerfile and runs it
|
||||
docker compose --profile local-amd64 up --build -d
|
||||
```
|
||||
|
||||
* **Build and Run Locally (ARM64):**
|
||||
```bash
|
||||
# Builds the image locally using Dockerfile and runs it
|
||||
docker compose --profile local-arm64 up --build -d
|
||||
```
|
||||
|
||||
* **Run Pre-built Image from Docker Hub (AMD64):**
|
||||
```bash
|
||||
# Pulls and runs the specified AMD64 image from Docker Hub
|
||||
# (Set VERSION env var for specific tags, e.g., VERSION=0.5.1-d1)
|
||||
docker compose --profile hub-amd64 up -d
|
||||
```
|
||||
|
||||
* **Run Pre-built Image from Docker Hub (ARM64):**
|
||||
```bash
|
||||
# Pulls and runs the specified ARM64 image from Docker Hub
|
||||
docker compose --profile hub-arm64 up -d
|
||||
```
|
||||
|
||||
> The server will be available at `http://localhost:11235`.
|
||||
|
||||
#### 4. Stopping Compose Services
|
||||
|
||||
```bash
|
||||
# Stop the service(s) associated with a profile (e.g., local-amd64)
|
||||
docker compose --profile local-amd64 down
|
||||
```
|
||||
|
||||
### Option 2: Manual Local Build & Run
|
||||
|
||||
If you prefer not to use Docker Compose for local builds.
|
||||
|
||||
#### 1. Clone Repository & Setup Environment
|
||||
|
||||
Follow steps 1 and 2 from the Docker Compose section above (clone repo, `cd crawl4ai`, create `.llm.env` in the root).
|
||||
|
||||
#### 2. Build the Image (Multi-Arch)
|
||||
|
||||
Use `docker buildx` to build the image. This example builds for multiple platforms and loads the image matching your host architecture into the local Docker daemon.
|
||||
|
||||
```bash
|
||||
# Make sure you are in the 'crawl4ai' root directory
|
||||
docker buildx build --platform linux/amd64,linux/arm64 -t crawl4ai-local:latest --load .
|
||||
```
|
||||
|
||||
#### 3. Run the Container
|
||||
|
||||
* **Basic run (no LLM support):**
|
||||
```bash
|
||||
# Replace --platform if your host is ARM64
|
||||
docker run -d \
|
||||
-p 11235:11235 \
|
||||
--name crawl4ai-standalone \
|
||||
--shm-size=1g \
|
||||
--platform linux/amd64 \
|
||||
crawl4ai-local:latest
|
||||
```
|
||||
|
||||
* **With LLM support:**
|
||||
```bash
|
||||
# Make sure .llm.env is in the current directory (project root)
|
||||
# Replace --platform if your host is ARM64
|
||||
docker run -d \
|
||||
-p 11235:11235 \
|
||||
--name crawl4ai-standalone \
|
||||
--env-file .llm.env \
|
||||
--shm-size=1g \
|
||||
--platform linux/amd64 \
|
||||
crawl4ai-local:latest
|
||||
```
|
||||
|
||||
> The server will be available at `http://localhost:11235`.
|
||||
|
||||
#### 4. Stopping the Manual Container
|
||||
|
||||
```bash
|
||||
docker stop crawl4ai-standalone && docker rm crawl4ai-standalone
|
||||
```
|
||||
|
||||
### Option 3: Using Pre-built Docker Hub Images
|
||||
|
||||
Pull and run images directly from Docker Hub without building locally.
|
||||
|
||||
#### 1. Pull the Image
|
||||
|
||||
We use a versioning scheme like `LIBRARY_VERSION-dREVISION` (e.g., `0.5.1-d1`). The `latest` tag points to the most recent stable release. Images are built with multi-arch manifests, so Docker usually pulls the correct version for your system automatically.
|
||||
|
||||
```bash
|
||||
# Pull a specific version (recommended for stability)
|
||||
docker pull unclecode/crawl4ai:0.5.1-d1
|
||||
|
||||
# Or pull the latest stable version
|
||||
docker pull unclecode/crawl4ai:latest
|
||||
```
|
||||
|
||||
#### 2. Setup Environment (API Keys)
|
||||
|
||||
If using LLMs, create the `.llm.env` file in a directory of your choice, similar to Step 2 in the Compose section.
|
||||
|
||||
#### 3. Run the Container
|
||||
|
||||
* **Basic run:**
|
||||
```bash
|
||||
docker run -d \
|
||||
-p 11235:11235 \
|
||||
--name crawl4ai-hub \
|
||||
--shm-size=1g \
|
||||
unclecode/crawl4ai:0.5.1-d1 # Or use :latest
|
||||
```
|
||||
|
||||
* **With LLM support:**
|
||||
```bash
|
||||
# Make sure .llm.env is in the current directory you are running docker from
|
||||
docker run -d \
|
||||
-p 11235:11235 \
|
||||
--name crawl4ai-hub \
|
||||
--env-file .llm.env \
|
||||
--shm-size=1g \
|
||||
unclecode/crawl4ai:0.5.1-d1 # Or use :latest
|
||||
```
|
||||
|
||||
> The server will be available at `http://localhost:11235`.
|
||||
|
||||
#### 4. Stopping the Hub Container
|
||||
|
||||
```bash
|
||||
docker stop crawl4ai-hub && docker rm crawl4ai-hub
|
||||
```
|
||||
|
||||
#### Docker Hub Versioning Explained
|
||||
|
||||
* **Image Name:** `unclecode/crawl4ai`
|
||||
* **Tag Format:** `LIBRARY_VERSION-dREVISION`
|
||||
* `LIBRARY_VERSION`: The Semantic Version of the core `crawl4ai` Python library included (e.g., `0.5.1`).
|
||||
* `dREVISION`: An incrementing number (starting at `d1`) for Docker build changes made *without* changing the library version (e.g., base image updates, dependency fixes). Resets to `d1` for each new `LIBRARY_VERSION`.
|
||||
* **Example:** `unclecode/crawl4ai:0.5.1-d1`
|
||||
* **`latest` Tag:** Points to the most recent stable `LIBRARY_VERSION-dREVISION`.
|
||||
* **Multi-Arch:** Images support `linux/amd64` and `linux/arm64`. Docker automatically selects the correct architecture.
|
||||
|
||||
---
|
||||
|
||||
*(Rest of the document remains largely the same, but with key updates below)*
|
||||
|
||||
---
|
||||
|
||||
## Dockerfile Parameters
|
||||
|
||||
You can customize the image build process using build arguments (`--build-arg`). These are typically used via `docker buildx build` or within the `docker-compose.yml` file.
|
||||
|
||||
```bash
|
||||
# Example: Build with 'all' features using buildx
|
||||
docker buildx build \
|
||||
--platform linux/amd64,linux/arm64 \
|
||||
--build-arg INSTALL_TYPE=all \
|
||||
-t yourname/crawl4ai-all:latest \
|
||||
--load \
|
||||
. # Build from root context
|
||||
```
|
||||
|
||||
### Build Arguments Explained
|
||||
|
||||
| Argument | Description | Default | Options |
|
||||
| :----------- | :--------------------------------------- | :-------- | :--------------------------------- |
|
||||
| INSTALL_TYPE | Feature set | `default` | `default`, `all`, `torch`, `transformer` |
|
||||
| ENABLE_GPU | GPU support (CUDA for AMD64) | `false` | `true`, `false` |
|
||||
| APP_HOME | Install path inside container (advanced) | `/app` | any valid path |
|
||||
| USE_LOCAL | Install library from local source | `true` | `true`, `false` |
|
||||
| GITHUB_REPO | Git repo to clone if USE_LOCAL=false | *(see Dockerfile)* | any git URL |
|
||||
| GITHUB_BRANCH| Git branch to clone if USE_LOCAL=false | `main` | any branch name |
|
||||
|
||||
*(Note: PYTHON_VERSION is fixed by the `FROM` instruction in the Dockerfile)*
|
||||
|
||||
### Build Best Practices
|
||||
|
||||
1. **Choose the Right Install Type**
|
||||
* `default`: Basic installation, smallest image size. Suitable for most standard web scraping and markdown generation.
|
||||
* `all`: Full features including `torch` and `transformers` for advanced extraction strategies (e.g., CosineStrategy, certain LLM filters). Significantly larger image. Ensure you need these extras.
|
||||
2. **Platform Considerations**
|
||||
* Use `buildx` for building multi-architecture images, especially for pushing to registries.
|
||||
* Use `docker compose` profiles (`local-amd64`, `local-arm64`) for easy platform-specific local builds.
|
||||
3. **Performance Optimization**
|
||||
* The image automatically includes platform-specific optimizations (OpenMP for AMD64, OpenBLAS for ARM64).
|
||||
|
||||
---
|
||||
|
||||
## Using the API
|
||||
|
||||
Communicate with the running Docker server via its REST API (defaulting to `http://localhost:11235`). You can use the Python SDK or make direct HTTP requests.
|
||||
|
||||
### Python SDK
|
||||
|
||||
Install the SDK: `pip install crawl4ai`
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai.docker_client import Crawl4aiDockerClient
|
||||
from crawl4ai import BrowserConfig, CrawlerRunConfig, CacheMode # Assuming you have crawl4ai installed
|
||||
|
||||
async def main():
|
||||
# Point to the correct server port
|
||||
async with Crawl4aiDockerClient(base_url="http://localhost:11235", verbose=True) as client:
|
||||
# If JWT is enabled on the server, authenticate first:
|
||||
# await client.authenticate("user@example.com") # See Server Configuration section
|
||||
|
||||
# Example Non-streaming crawl
|
||||
print("--- Running Non-Streaming Crawl ---")
|
||||
results = await client.crawl(
|
||||
["https://httpbin.org/html"],
|
||||
browser_config=BrowserConfig(headless=True), # Use library classes for config aid
|
||||
crawler_config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
|
||||
)
|
||||
if results: # client.crawl returns None on failure
|
||||
print(f"Non-streaming results success: {results.success}")
|
||||
if results.success:
|
||||
for result in results: # Iterate through the CrawlResultContainer
|
||||
print(f"URL: {result.url}, Success: {result.success}")
|
||||
else:
|
||||
print("Non-streaming crawl failed.")
|
||||
|
||||
|
||||
# Example Streaming crawl
|
||||
print("\n--- Running Streaming Crawl ---")
|
||||
stream_config = CrawlerRunConfig(stream=True, cache_mode=CacheMode.BYPASS)
|
||||
try:
|
||||
async for result in await client.crawl( # client.crawl returns an async generator for streaming
|
||||
["https://httpbin.org/html", "https://httpbin.org/links/5/0"],
|
||||
browser_config=BrowserConfig(headless=True),
|
||||
crawler_config=stream_config
|
||||
):
|
||||
print(f"Streamed result: URL: {result.url}, Success: {result.success}")
|
||||
except Exception as e:
|
||||
print(f"Streaming crawl failed: {e}")
|
||||
|
||||
|
||||
# Example Get schema
|
||||
print("\n--- Getting Schema ---")
|
||||
schema = await client.get_schema()
|
||||
print(f"Schema received: {bool(schema)}") # Print whether schema was received
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
*(SDK parameters like timeout, verify_ssl etc. remain the same)*
|
||||
|
||||
### Second Approach: Direct API Calls
|
||||
|
||||
Crucially, when sending configurations directly via JSON, they **must** follow the `{"type": "ClassName", "params": {...}}` structure for any non-primitive value (like config objects or strategies). Dictionaries must be wrapped as `{"type": "dict", "value": {...}}`.
|
||||
|
||||
*(Keep the detailed explanation of Configuration Structure, Basic Pattern, Simple vs Complex, Strategy Pattern, Complex Nested Example, Quick Grammar Overview, Important Rules, Pro Tip)*
|
||||
|
||||
#### More Examples *(Ensure Schema example uses type/value wrapper)*
|
||||
|
||||
**Advanced Crawler Configuration**
|
||||
*(Keep example, ensure cache_mode uses valid enum value like "bypass")*
|
||||
|
||||
**Extraction Strategy**
|
||||
```json
|
||||
{
|
||||
"crawler_config": {
|
||||
"type": "CrawlerRunConfig",
|
||||
"params": {
|
||||
"extraction_strategy": {
|
||||
"type": "JsonCssExtractionStrategy",
|
||||
"params": {
|
||||
"schema": {
|
||||
"type": "dict",
|
||||
"value": {
|
||||
"baseSelector": "article.post",
|
||||
"fields": [
|
||||
{"name": "title", "selector": "h1", "type": "text"},
|
||||
{"name": "content", "selector": ".content", "type": "html"}
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**LLM Extraction Strategy** *(Keep example, ensure schema uses type/value wrapper)*
|
||||
*(Keep Deep Crawler Example)*
|
||||
|
||||
### REST API Examples
|
||||
|
||||
Update URLs to use port `11235`.
|
||||
|
||||
#### Simple Crawl
|
||||
|
||||
```python
|
||||
import requests
|
||||
|
||||
# Configuration objects converted to the required JSON structure
|
||||
browser_config_payload = {
|
||||
"type": "BrowserConfig",
|
||||
"params": {"headless": True}
|
||||
}
|
||||
crawler_config_payload = {
|
||||
"type": "CrawlerRunConfig",
|
||||
"params": {"stream": False, "cache_mode": "bypass"} # Use string value of enum
|
||||
}
|
||||
|
||||
crawl_payload = {
|
||||
"urls": ["https://httpbin.org/html"],
|
||||
"browser_config": browser_config_payload,
|
||||
"crawler_config": crawler_config_payload
|
||||
}
|
||||
response = requests.post(
|
||||
"http://localhost:11235/crawl", # Updated port
|
||||
# headers={"Authorization": f"Bearer {token}"}, # If JWT is enabled
|
||||
json=crawl_payload
|
||||
)
|
||||
print(f"Status Code: {response.status_code}")
|
||||
if response.ok:
|
||||
print(response.json())
|
||||
else:
|
||||
print(f"Error: {response.text}")
|
||||
|
||||
```
|
||||
|
||||
#### Streaming Results
|
||||
|
||||
```python
|
||||
import json
|
||||
import httpx # Use httpx for async streaming example
|
||||
|
||||
async def test_stream_crawl(token: str = None): # Made token optional
|
||||
"""Test the /crawl/stream endpoint with multiple URLs."""
|
||||
url = "http://localhost:11235/crawl/stream" # Updated port
|
||||
payload = {
|
||||
"urls": [
|
||||
"https://httpbin.org/html",
|
||||
"https://httpbin.org/links/5/0",
|
||||
],
|
||||
"browser_config": {
|
||||
"type": "BrowserConfig",
|
||||
"params": {"headless": True, "viewport": {"type": "dict", "value": {"width": 1200, "height": 800}}} # Viewport needs type:dict
|
||||
},
|
||||
"crawler_config": {
|
||||
"type": "CrawlerRunConfig",
|
||||
"params": {"stream": True, "cache_mode": "bypass"}
|
||||
}
|
||||
}
|
||||
|
||||
headers = {}
|
||||
# if token:
|
||||
# headers = {"Authorization": f"Bearer {token}"} # If JWT is enabled
|
||||
|
||||
try:
|
||||
async with httpx.AsyncClient() as client:
|
||||
async with client.stream("POST", url, json=payload, headers=headers, timeout=120.0) as response:
|
||||
print(f"Status: {response.status_code} (Expected: 200)")
|
||||
response.raise_for_status() # Raise exception for bad status codes
|
||||
|
||||
# Read streaming response line-by-line (NDJSON)
|
||||
async for line in response.aiter_lines():
|
||||
if line:
|
||||
try:
|
||||
data = json.loads(line)
|
||||
# Check for completion marker
|
||||
if data.get("status") == "completed":
|
||||
print("Stream completed.")
|
||||
break
|
||||
print(f"Streamed Result: {json.dumps(data, indent=2)}")
|
||||
except json.JSONDecodeError:
|
||||
print(f"Warning: Could not decode JSON line: {line}")
|
||||
|
||||
except httpx.HTTPStatusError as e:
|
||||
print(f"HTTP error occurred: {e.response.status_code} - {e.response.text}")
|
||||
except Exception as e:
|
||||
print(f"Error in streaming crawl test: {str(e)}")
|
||||
|
||||
# To run this example:
|
||||
# import asyncio
|
||||
# asyncio.run(test_stream_crawl())
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Metrics & Monitoring
|
||||
|
||||
Keep an eye on your crawler with these endpoints:
|
||||
|
||||
- `/health` - Quick health check
|
||||
- `/metrics` - Detailed Prometheus metrics
|
||||
- `/schema` - Full API schema
|
||||
|
||||
Example health check:
|
||||
```bash
|
||||
curl http://localhost:11235/health
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
*(Deployment Scenarios and Complete Examples sections remain the same, maybe update links if examples moved)*
|
||||
|
||||
---
|
||||
|
||||
## Server Configuration
|
||||
|
||||
The server's behavior can be customized through the `config.yml` file.
|
||||
|
||||
### Understanding config.yml
|
||||
|
||||
The configuration file is loaded from `/app/config.yml` inside the container. By default, the file from `deploy/docker/config.yml` in the repository is copied there during the build.
|
||||
|
||||
Here's a detailed breakdown of the configuration options (using defaults from `deploy/docker/config.yml`):
|
||||
|
||||
```yaml
|
||||
# Application Configuration
|
||||
app:
|
||||
title: "Crawl4AI API"
|
||||
version: "1.0.0" # Consider setting this to match library version, e.g., "0.5.1"
|
||||
host: "0.0.0.0"
|
||||
port: 8020 # NOTE: This port is used ONLY when running server.py directly. Gunicorn overrides this (see supervisord.conf).
|
||||
reload: False # Default set to False - suitable for production
|
||||
timeout_keep_alive: 300
|
||||
|
||||
# Default LLM Configuration
|
||||
llm:
|
||||
provider: "openai/gpt-4o-mini"
|
||||
api_key_env: "OPENAI_API_KEY"
|
||||
# api_key: sk-... # If you pass the API key directly then api_key_env will be ignored
|
||||
|
||||
# Redis Configuration (Used by internal Redis server managed by supervisord)
|
||||
redis:
|
||||
host: "localhost"
|
||||
port: 6379
|
||||
db: 0
|
||||
password: ""
|
||||
# ... other redis options ...
|
||||
|
||||
# Rate Limiting Configuration
|
||||
rate_limiting:
|
||||
enabled: True
|
||||
default_limit: "1000/minute"
|
||||
trusted_proxies: []
|
||||
storage_uri: "memory://" # Use "redis://localhost:6379" if you need persistent/shared limits
|
||||
|
||||
# Security Configuration
|
||||
security:
|
||||
enabled: false # Master toggle for security features
|
||||
jwt_enabled: false # Enable JWT authentication (requires security.enabled=true)
|
||||
https_redirect: false # Force HTTPS (requires security.enabled=true)
|
||||
trusted_hosts: ["*"] # Allowed hosts (use specific domains in production)
|
||||
headers: # Security headers (applied if security.enabled=true)
|
||||
x_content_type_options: "nosniff"
|
||||
x_frame_options: "DENY"
|
||||
content_security_policy: "default-src 'self'"
|
||||
strict_transport_security: "max-age=63072000; includeSubDomains"
|
||||
|
||||
# Crawler Configuration
|
||||
crawler:
|
||||
memory_threshold_percent: 95.0
|
||||
rate_limiter:
|
||||
base_delay: [1.0, 2.0] # Min/max delay between requests in seconds for dispatcher
|
||||
timeouts:
|
||||
stream_init: 30.0 # Timeout for stream initialization
|
||||
batch_process: 300.0 # Timeout for non-streaming /crawl processing
|
||||
|
||||
# Logging Configuration
|
||||
logging:
|
||||
level: "INFO"
|
||||
format: "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
|
||||
|
||||
# Observability Configuration
|
||||
observability:
|
||||
prometheus:
|
||||
enabled: True
|
||||
endpoint: "/metrics"
|
||||
health_check:
|
||||
endpoint: "/health"
|
||||
```
|
||||
|
||||
*(JWT Authentication section remains the same, just note the default port is now 11235 for requests)*
|
||||
|
||||
*(Configuration Tips and Best Practices remain the same)*
|
||||
|
||||
### Customizing Your Configuration
|
||||
|
||||
You can override the default `config.yml`.
|
||||
|
||||
#### Method 1: Modify Before Build
|
||||
|
||||
1. Edit the `deploy/docker/config.yml` file in your local repository clone.
|
||||
2. Build the image using `docker buildx` or `docker compose --profile local-... up --build`. The modified file will be copied into the image.
|
||||
|
||||
#### Method 2: Runtime Mount (Recommended for Custom Deploys)
|
||||
|
||||
1. Create your custom configuration file, e.g., `my-custom-config.yml` locally. Ensure it contains all necessary sections.
|
||||
2. Mount it when running the container:
|
||||
|
||||
* **Using `docker run`:**
|
||||
```bash
|
||||
# Assumes my-custom-config.yml is in the current directory
|
||||
docker run -d -p 11235:11235 \
|
||||
--name crawl4ai-custom-config \
|
||||
--env-file .llm.env \
|
||||
--shm-size=1g \
|
||||
-v $(pwd)/my-custom-config.yml:/app/config.yml \
|
||||
unclecode/crawl4ai:latest # Or your specific tag
|
||||
```
|
||||
|
||||
* **Using `docker-compose.yml`:** Add a `volumes` section to the service definition:
|
||||
```yaml
|
||||
services:
|
||||
crawl4ai-hub-amd64: # Or your chosen service
|
||||
image: unclecode/crawl4ai:latest
|
||||
profiles: ["hub-amd64"]
|
||||
<<: *base-config
|
||||
volumes:
|
||||
# Mount local custom config over the default one in the container
|
||||
- ./my-custom-config.yml:/app/config.yml
|
||||
# Keep the shared memory volume from base-config
|
||||
- /dev/shm:/dev/shm
|
||||
```
|
||||
*(Note: Ensure `my-custom-config.yml` is in the same directory as `docker-compose.yml`)*
|
||||
|
||||
> 💡 When mounting, your custom file *completely replaces* the default one. Ensure it's a valid and complete configuration.
|
||||
|
||||
### Configuration Recommendations
|
||||
|
||||
1. **Security First** 🔒
|
||||
- Always enable security in production
|
||||
- Use specific trusted_hosts instead of wildcards
|
||||
- Set up proper rate limiting to protect your server
|
||||
- Consider your environment before enabling HTTPS redirect
|
||||
|
||||
2. **Resource Management** 💻
|
||||
- Adjust memory_threshold_percent based on available RAM
|
||||
- Set timeouts according to your content size and network conditions
|
||||
- Use Redis for rate limiting in multi-container setups
|
||||
|
||||
3. **Monitoring** 📊
|
||||
- Enable Prometheus if you need metrics
|
||||
- Set DEBUG logging in development, INFO in production
|
||||
- Regular health check monitoring is crucial
|
||||
|
||||
4. **Performance Tuning** ⚡
|
||||
- Start with conservative rate limiter delays
|
||||
- Increase batch_process timeout for large content
|
||||
- Adjust stream_init timeout based on initial response times
|
||||
|
||||
## Getting Help
|
||||
|
||||
We're here to help you succeed with Crawl4AI! Here's how to get support:
|
||||
|
||||
- 📖 Check our [full documentation](https://docs.crawl4ai.com)
|
||||
- 🐛 Found a bug? [Open an issue](https://github.com/unclecode/crawl4ai/issues)
|
||||
- 💬 Join our [Discord community](https://discord.gg/crawl4ai)
|
||||
- ⭐ Star us on GitHub to show support!
|
||||
|
||||
## Summary
|
||||
|
||||
In this guide, we've covered everything you need to get started with Crawl4AI's Docker deployment:
|
||||
- Building and running the Docker container
|
||||
- Configuring the environment
|
||||
- Making API requests with proper typing
|
||||
- Using the Python SDK
|
||||
- Monitoring your deployment
|
||||
|
||||
Remember, the examples in the `examples` folder are your friends - they show real-world usage patterns that you can adapt for your needs.
|
||||
|
||||
Keep exploring, and don't hesitate to reach out if you need help! We're building something amazing together. 🚀
|
||||
|
||||
Happy crawling! 🕷️
|
||||
File diff suppressed because it is too large
Load Diff
@@ -1,8 +1,10 @@
|
||||
import os
|
||||
import json
|
||||
import asyncio
|
||||
from typing import List, Tuple
|
||||
from typing import List, Tuple, Dict
|
||||
from functools import partial
|
||||
from uuid import uuid4
|
||||
from datetime import datetime
|
||||
|
||||
import logging
|
||||
from typing import Optional, AsyncGenerator
|
||||
@@ -40,8 +42,19 @@ from utils import (
|
||||
decode_redis_hash
|
||||
)
|
||||
|
||||
import psutil, time
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# --- Helper to get memory ---
|
||||
def _get_memory_mb():
|
||||
try:
|
||||
return psutil.Process().memory_info().rss / (1024 * 1024)
|
||||
except Exception as e:
|
||||
logger.warning(f"Could not get memory info: {e}")
|
||||
return None
|
||||
|
||||
|
||||
async def handle_llm_qa(
|
||||
url: str,
|
||||
query: str,
|
||||
@@ -49,6 +62,8 @@ async def handle_llm_qa(
|
||||
) -> str:
|
||||
"""Process QA using LLM with crawled content as context."""
|
||||
try:
|
||||
if not url.startswith(('http://', 'https://')):
|
||||
url = 'https://' + url
|
||||
# Extract base URL by finding last '?q=' occurrence
|
||||
last_q_index = url.rfind('?q=')
|
||||
if last_q_index != -1:
|
||||
@@ -62,7 +77,7 @@ async def handle_llm_qa(
|
||||
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
|
||||
detail=result.error_message
|
||||
)
|
||||
content = result.markdown.fit_markdown
|
||||
content = result.markdown.fit_markdown or result.markdown.raw_markdown
|
||||
|
||||
# Create prompt and get LLM response
|
||||
prompt = f"""Use the following content as context to answer the question.
|
||||
@@ -259,7 +274,9 @@ async def handle_llm_request(
|
||||
async def handle_task_status(
|
||||
redis: aioredis.Redis,
|
||||
task_id: str,
|
||||
base_url: str
|
||||
base_url: str,
|
||||
*,
|
||||
keep: bool = False
|
||||
) -> JSONResponse:
|
||||
"""Handle task status check requests."""
|
||||
task = await redis.hgetall(f"task:{task_id}")
|
||||
@@ -273,7 +290,7 @@ async def handle_task_status(
|
||||
response = create_task_response(task, task_id, base_url)
|
||||
|
||||
if task["status"] in [TaskStatus.COMPLETED, TaskStatus.FAILED]:
|
||||
if should_cleanup_task(task["created_at"]):
|
||||
if not keep and should_cleanup_task(task["created_at"]):
|
||||
await redis.delete(f"task:{task_id}")
|
||||
|
||||
return JSONResponse(response)
|
||||
@@ -351,7 +368,9 @@ async def stream_results(crawler: AsyncWebCrawler, results_gen: AsyncGenerator)
|
||||
try:
|
||||
async for result in results_gen:
|
||||
try:
|
||||
server_memory_mb = _get_memory_mb()
|
||||
result_dict = result.model_dump()
|
||||
result_dict['server_memory_mb'] = server_memory_mb
|
||||
logger.info(f"Streaming result for {result_dict.get('url', 'unknown')}")
|
||||
data = json.dumps(result_dict, default=datetime_handler) + "\n"
|
||||
yield data.encode('utf-8')
|
||||
@@ -365,10 +384,11 @@ async def stream_results(crawler: AsyncWebCrawler, results_gen: AsyncGenerator)
|
||||
except asyncio.CancelledError:
|
||||
logger.warning("Client disconnected during streaming")
|
||||
finally:
|
||||
try:
|
||||
await crawler.close()
|
||||
except Exception as e:
|
||||
logger.error(f"Crawler cleanup error: {e}")
|
||||
# try:
|
||||
# await crawler.close()
|
||||
# except Exception as e:
|
||||
# logger.error(f"Crawler cleanup error: {e}")
|
||||
pass
|
||||
|
||||
async def handle_crawl_request(
|
||||
urls: List[str],
|
||||
@@ -377,7 +397,13 @@ async def handle_crawl_request(
|
||||
config: dict
|
||||
) -> dict:
|
||||
"""Handle non-streaming crawl requests."""
|
||||
start_mem_mb = _get_memory_mb() # <--- Get memory before
|
||||
start_time = time.time()
|
||||
mem_delta_mb = None
|
||||
peak_mem_mb = start_mem_mb
|
||||
|
||||
try:
|
||||
urls = [('https://' + url) if not url.startswith(('http://', 'https://')) else url for url in urls]
|
||||
browser_config = BrowserConfig.load(browser_config)
|
||||
crawler_config = CrawlerRunConfig.load(crawler_config)
|
||||
|
||||
@@ -385,11 +411,21 @@ async def handle_crawl_request(
|
||||
memory_threshold_percent=config["crawler"]["memory_threshold_percent"],
|
||||
rate_limiter=RateLimiter(
|
||||
base_delay=tuple(config["crawler"]["rate_limiter"]["base_delay"])
|
||||
)
|
||||
) if config["crawler"]["rate_limiter"]["enabled"] else None
|
||||
)
|
||||
|
||||
from crawler_pool import get_crawler
|
||||
crawler = await get_crawler(browser_config)
|
||||
|
||||
# crawler: AsyncWebCrawler = AsyncWebCrawler(config=browser_config)
|
||||
# await crawler.start()
|
||||
|
||||
base_config = config["crawler"]["base_config"]
|
||||
# Iterate on key-value pairs in global_config then use haseattr to set them
|
||||
for key, value in base_config.items():
|
||||
if hasattr(crawler_config, key):
|
||||
setattr(crawler_config, key, value)
|
||||
|
||||
crawler: AsyncWebCrawler = AsyncWebCrawler(config=browser_config)
|
||||
await crawler.start()
|
||||
results = []
|
||||
func = getattr(crawler, "arun" if len(urls) == 1 else "arun_many")
|
||||
partial_func = partial(func,
|
||||
@@ -397,19 +433,46 @@ async def handle_crawl_request(
|
||||
config=crawler_config,
|
||||
dispatcher=dispatcher)
|
||||
results = await partial_func()
|
||||
await crawler.close()
|
||||
|
||||
# await crawler.close()
|
||||
|
||||
end_mem_mb = _get_memory_mb() # <--- Get memory after
|
||||
end_time = time.time()
|
||||
|
||||
if start_mem_mb is not None and end_mem_mb is not None:
|
||||
mem_delta_mb = end_mem_mb - start_mem_mb # <--- Calculate delta
|
||||
peak_mem_mb = max(peak_mem_mb if peak_mem_mb else 0, end_mem_mb) # <--- Get peak memory
|
||||
logger.info(f"Memory usage: Start: {start_mem_mb} MB, End: {end_mem_mb} MB, Delta: {mem_delta_mb} MB, Peak: {peak_mem_mb} MB")
|
||||
|
||||
return {
|
||||
"success": True,
|
||||
"results": [result.model_dump() for result in results]
|
||||
"results": [result.model_dump() for result in results],
|
||||
"server_processing_time_s": end_time - start_time,
|
||||
"server_memory_delta_mb": mem_delta_mb,
|
||||
"server_peak_memory_mb": peak_mem_mb
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Crawl error: {str(e)}", exc_info=True)
|
||||
if 'crawler' in locals():
|
||||
await crawler.close()
|
||||
if 'crawler' in locals() and crawler.ready: # Check if crawler was initialized and started
|
||||
# try:
|
||||
# await crawler.close()
|
||||
# except Exception as close_e:
|
||||
# logger.error(f"Error closing crawler during exception handling: {close_e}")
|
||||
logger.error(f"Error closing crawler during exception handling: {close_e}")
|
||||
|
||||
# Measure memory even on error if possible
|
||||
end_mem_mb_error = _get_memory_mb()
|
||||
if start_mem_mb is not None and end_mem_mb_error is not None:
|
||||
mem_delta_mb = end_mem_mb_error - start_mem_mb
|
||||
|
||||
raise HTTPException(
|
||||
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
|
||||
detail=str(e)
|
||||
detail=json.dumps({ # Send structured error
|
||||
"error": str(e),
|
||||
"server_memory_delta_mb": mem_delta_mb,
|
||||
"server_peak_memory_mb": max(peak_mem_mb if peak_mem_mb else 0, end_mem_mb_error or 0)
|
||||
})
|
||||
)
|
||||
|
||||
async def handle_stream_crawl_request(
|
||||
@@ -421,9 +484,11 @@ async def handle_stream_crawl_request(
|
||||
"""Handle streaming crawl requests."""
|
||||
try:
|
||||
browser_config = BrowserConfig.load(browser_config)
|
||||
browser_config.verbose = True
|
||||
# browser_config.verbose = True # Set to False or remove for production stress testing
|
||||
browser_config.verbose = False
|
||||
crawler_config = CrawlerRunConfig.load(crawler_config)
|
||||
crawler_config.scraping_strategy = LXMLWebScrapingStrategy()
|
||||
crawler_config.stream = True
|
||||
|
||||
dispatcher = MemoryAdaptiveDispatcher(
|
||||
memory_threshold_percent=config["crawler"]["memory_threshold_percent"],
|
||||
@@ -432,8 +497,11 @@ async def handle_stream_crawl_request(
|
||||
)
|
||||
)
|
||||
|
||||
crawler = AsyncWebCrawler(config=browser_config)
|
||||
await crawler.start()
|
||||
from crawler_pool import get_crawler
|
||||
crawler = await get_crawler(browser_config)
|
||||
|
||||
# crawler = AsyncWebCrawler(config=browser_config)
|
||||
# await crawler.start()
|
||||
|
||||
results_gen = await crawler.arun_many(
|
||||
urls=urls,
|
||||
@@ -444,10 +512,60 @@ async def handle_stream_crawl_request(
|
||||
return crawler, results_gen
|
||||
|
||||
except Exception as e:
|
||||
if 'crawler' in locals():
|
||||
await crawler.close()
|
||||
# Make sure to close crawler if started during an error here
|
||||
if 'crawler' in locals() and crawler.ready:
|
||||
# try:
|
||||
# await crawler.close()
|
||||
# except Exception as close_e:
|
||||
# logger.error(f"Error closing crawler during stream setup exception: {close_e}")
|
||||
logger.error(f"Error closing crawler during stream setup exception: {close_e}")
|
||||
logger.error(f"Stream crawl error: {str(e)}", exc_info=True)
|
||||
# Raising HTTPException here will prevent streaming response
|
||||
raise HTTPException(
|
||||
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
|
||||
detail=str(e)
|
||||
)
|
||||
)
|
||||
|
||||
async def handle_crawl_job(
|
||||
redis,
|
||||
background_tasks: BackgroundTasks,
|
||||
urls: List[str],
|
||||
browser_config: Dict,
|
||||
crawler_config: Dict,
|
||||
config: Dict,
|
||||
) -> Dict:
|
||||
"""
|
||||
Fire-and-forget version of handle_crawl_request.
|
||||
Creates a task in Redis, runs the heavy work in a background task,
|
||||
lets /crawl/job/{task_id} polling fetch the result.
|
||||
"""
|
||||
task_id = f"crawl_{uuid4().hex[:8]}"
|
||||
await redis.hset(f"task:{task_id}", mapping={
|
||||
"status": TaskStatus.PROCESSING, # <-- keep enum values consistent
|
||||
"created_at": datetime.utcnow().isoformat(),
|
||||
"url": json.dumps(urls), # store list as JSON string
|
||||
"result": "",
|
||||
"error": "",
|
||||
})
|
||||
|
||||
async def _runner():
|
||||
try:
|
||||
result = await handle_crawl_request(
|
||||
urls=urls,
|
||||
browser_config=browser_config,
|
||||
crawler_config=crawler_config,
|
||||
config=config,
|
||||
)
|
||||
await redis.hset(f"task:{task_id}", mapping={
|
||||
"status": TaskStatus.COMPLETED,
|
||||
"result": json.dumps(result),
|
||||
})
|
||||
await asyncio.sleep(5) # Give Redis time to process the update
|
||||
except Exception as exc:
|
||||
await redis.hset(f"task:{task_id}", mapping={
|
||||
"status": TaskStatus.FAILED,
|
||||
"error": str(exc),
|
||||
})
|
||||
|
||||
background_tasks.add_task(_runner)
|
||||
return {"task_id": task_id}
|
||||
11631
deploy/docker/c4ai-code-context.md
Normal file
11631
deploy/docker/c4ai-code-context.md
Normal file
File diff suppressed because it is too large
Load Diff
8899
deploy/docker/c4ai-doc-context.md
Normal file
8899
deploy/docker/c4ai-doc-context.md
Normal file
File diff suppressed because it is too large
Load Diff
@@ -3,8 +3,9 @@ app:
|
||||
title: "Crawl4AI API"
|
||||
version: "1.0.0"
|
||||
host: "0.0.0.0"
|
||||
port: 8020
|
||||
port: 11234
|
||||
reload: False
|
||||
workers: 1
|
||||
timeout_keep_alive: 300
|
||||
|
||||
# Default LLM Configuration
|
||||
@@ -50,12 +51,31 @@ security:
|
||||
|
||||
# Crawler Configuration
|
||||
crawler:
|
||||
base_config:
|
||||
simulate_user: true
|
||||
memory_threshold_percent: 95.0
|
||||
rate_limiter:
|
||||
enabled: true
|
||||
base_delay: [1.0, 2.0]
|
||||
timeouts:
|
||||
stream_init: 30.0 # Timeout for stream initialization
|
||||
batch_process: 300.0 # Timeout for batch processing
|
||||
pool:
|
||||
max_pages: 40 # ← GLOBAL_SEM permits
|
||||
idle_ttl_sec: 1800 # ← 30 min janitor cutoff
|
||||
browser:
|
||||
kwargs:
|
||||
headless: true
|
||||
text_mode: true
|
||||
extra_args:
|
||||
# - "--single-process"
|
||||
- "--no-sandbox"
|
||||
- "--disable-dev-shm-usage"
|
||||
- "--disable-gpu"
|
||||
- "--disable-software-rasterizer"
|
||||
- "--disable-web-security"
|
||||
- "--allow-insecure-localhost"
|
||||
- "--ignore-certificate-errors"
|
||||
|
||||
# Logging Configuration
|
||||
logging:
|
||||
@@ -68,4 +88,4 @@ observability:
|
||||
enabled: True
|
||||
endpoint: "/metrics"
|
||||
health_check:
|
||||
endpoint: "/health"
|
||||
endpoint: "/health"
|
||||
60
deploy/docker/crawler_pool.py
Normal file
60
deploy/docker/crawler_pool.py
Normal file
@@ -0,0 +1,60 @@
|
||||
# crawler_pool.py (new file)
|
||||
import asyncio, json, hashlib, time, psutil
|
||||
from contextlib import suppress
|
||||
from typing import Dict
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig
|
||||
from typing import Dict
|
||||
from utils import load_config
|
||||
|
||||
CONFIG = load_config()
|
||||
|
||||
POOL: Dict[str, AsyncWebCrawler] = {}
|
||||
LAST_USED: Dict[str, float] = {}
|
||||
LOCK = asyncio.Lock()
|
||||
|
||||
MEM_LIMIT = CONFIG.get("crawler", {}).get("memory_threshold_percent", 95.0) # % RAM – refuse new browsers above this
|
||||
IDLE_TTL = CONFIG.get("crawler", {}).get("pool", {}).get("idle_ttl_sec", 1800) # close if unused for 30 min
|
||||
|
||||
def _sig(cfg: BrowserConfig) -> str:
|
||||
payload = json.dumps(cfg.to_dict(), sort_keys=True, separators=(",",":"))
|
||||
return hashlib.sha1(payload.encode()).hexdigest()
|
||||
|
||||
async def get_crawler(cfg: BrowserConfig) -> AsyncWebCrawler:
|
||||
try:
|
||||
sig = _sig(cfg)
|
||||
async with LOCK:
|
||||
if sig in POOL:
|
||||
LAST_USED[sig] = time.time();
|
||||
return POOL[sig]
|
||||
if psutil.virtual_memory().percent >= MEM_LIMIT:
|
||||
raise MemoryError("RAM pressure – new browser denied")
|
||||
crawler = AsyncWebCrawler(config=cfg, thread_safe=False)
|
||||
await crawler.start()
|
||||
POOL[sig] = crawler; LAST_USED[sig] = time.time()
|
||||
return crawler
|
||||
except MemoryError as e:
|
||||
raise MemoryError(f"RAM pressure – new browser denied: {e}")
|
||||
except Exception as e:
|
||||
raise RuntimeError(f"Failed to start browser: {e}")
|
||||
finally:
|
||||
if sig in POOL:
|
||||
LAST_USED[sig] = time.time()
|
||||
else:
|
||||
# If we failed to start the browser, we should remove it from the pool
|
||||
POOL.pop(sig, None)
|
||||
LAST_USED.pop(sig, None)
|
||||
# If we failed to start the browser, we should remove it from the pool
|
||||
async def close_all():
|
||||
async with LOCK:
|
||||
await asyncio.gather(*(c.close() for c in POOL.values()), return_exceptions=True)
|
||||
POOL.clear(); LAST_USED.clear()
|
||||
|
||||
async def janitor():
|
||||
while True:
|
||||
await asyncio.sleep(60)
|
||||
now = time.time()
|
||||
async with LOCK:
|
||||
for sig, crawler in list(POOL.items()):
|
||||
if now - LAST_USED[sig] > IDLE_TTL:
|
||||
with suppress(Exception): await crawler.close()
|
||||
POOL.pop(sig, None); LAST_USED.pop(sig, None)
|
||||
99
deploy/docker/job.py
Normal file
99
deploy/docker/job.py
Normal file
@@ -0,0 +1,99 @@
|
||||
"""
|
||||
Job endpoints (enqueue + poll) for long-running LLM extraction and raw crawl.
|
||||
Relies on the existing Redis task helpers in api.py
|
||||
"""
|
||||
|
||||
from typing import Dict, Optional, Callable
|
||||
from fastapi import APIRouter, BackgroundTasks, Depends, Request
|
||||
from pydantic import BaseModel, HttpUrl
|
||||
|
||||
from api import (
|
||||
handle_llm_request,
|
||||
handle_crawl_job,
|
||||
handle_task_status,
|
||||
)
|
||||
|
||||
# ------------- dependency placeholders -------------
|
||||
_redis = None # will be injected from server.py
|
||||
_config = None
|
||||
_token_dep: Callable = lambda: None # dummy until injected
|
||||
|
||||
# public router
|
||||
router = APIRouter()
|
||||
|
||||
|
||||
# === init hook called by server.py =========================================
|
||||
def init_job_router(redis, config, token_dep) -> APIRouter:
|
||||
"""Inject shared singletons and return the router for mounting."""
|
||||
global _redis, _config, _token_dep
|
||||
_redis, _config, _token_dep = redis, config, token_dep
|
||||
return router
|
||||
|
||||
|
||||
# ---------- payload models --------------------------------------------------
|
||||
class LlmJobPayload(BaseModel):
|
||||
url: HttpUrl
|
||||
q: str
|
||||
schema: Optional[str] = None
|
||||
cache: bool = False
|
||||
|
||||
|
||||
class CrawlJobPayload(BaseModel):
|
||||
urls: list[HttpUrl]
|
||||
browser_config: Dict = {}
|
||||
crawler_config: Dict = {}
|
||||
|
||||
|
||||
# ---------- LLM job ---------------------------------------------------------
|
||||
@router.post("/llm/job", status_code=202)
|
||||
async def llm_job_enqueue(
|
||||
payload: LlmJobPayload,
|
||||
background_tasks: BackgroundTasks,
|
||||
request: Request,
|
||||
_td: Dict = Depends(lambda: _token_dep()), # late-bound dep
|
||||
):
|
||||
return await handle_llm_request(
|
||||
_redis,
|
||||
background_tasks,
|
||||
request,
|
||||
str(payload.url),
|
||||
query=payload.q,
|
||||
schema=payload.schema,
|
||||
cache=payload.cache,
|
||||
config=_config,
|
||||
)
|
||||
|
||||
|
||||
@router.get("/llm/job/{task_id}")
|
||||
async def llm_job_status(
|
||||
request: Request,
|
||||
task_id: str,
|
||||
_td: Dict = Depends(lambda: _token_dep())
|
||||
):
|
||||
return await handle_task_status(_redis, task_id)
|
||||
|
||||
|
||||
# ---------- CRAWL job -------------------------------------------------------
|
||||
@router.post("/crawl/job", status_code=202)
|
||||
async def crawl_job_enqueue(
|
||||
payload: CrawlJobPayload,
|
||||
background_tasks: BackgroundTasks,
|
||||
_td: Dict = Depends(lambda: _token_dep()),
|
||||
):
|
||||
return await handle_crawl_job(
|
||||
_redis,
|
||||
background_tasks,
|
||||
[str(u) for u in payload.urls],
|
||||
payload.browser_config,
|
||||
payload.crawler_config,
|
||||
config=_config,
|
||||
)
|
||||
|
||||
|
||||
@router.get("/crawl/job/{task_id}")
|
||||
async def crawl_job_status(
|
||||
request: Request,
|
||||
task_id: str,
|
||||
_td: Dict = Depends(lambda: _token_dep())
|
||||
):
|
||||
return await handle_task_status(_redis, task_id, base_url=str(request.base_url))
|
||||
252
deploy/docker/mcp_bridge.py
Normal file
252
deploy/docker/mcp_bridge.py
Normal file
@@ -0,0 +1,252 @@
|
||||
# deploy/docker/mcp_bridge.py
|
||||
|
||||
from __future__ import annotations
|
||||
import inspect, json, re, anyio
|
||||
from contextlib import suppress
|
||||
from typing import Any, Callable, Dict, List, Tuple
|
||||
import httpx
|
||||
|
||||
from fastapi import FastAPI, WebSocket, WebSocketDisconnect, HTTPException
|
||||
from fastapi.responses import JSONResponse
|
||||
from fastapi import Request
|
||||
from sse_starlette.sse import EventSourceResponse
|
||||
from pydantic import BaseModel
|
||||
from mcp.server.sse import SseServerTransport
|
||||
|
||||
import mcp.types as t
|
||||
from mcp.server.lowlevel.server import Server, NotificationOptions
|
||||
from mcp.server.models import InitializationOptions
|
||||
|
||||
# ── opt‑in decorators ───────────────────────────────────────────
|
||||
def mcp_resource(name: str | None = None):
|
||||
def deco(fn):
|
||||
fn.__mcp_kind__, fn.__mcp_name__ = "resource", name
|
||||
return fn
|
||||
return deco
|
||||
|
||||
def mcp_template(name: str | None = None):
|
||||
def deco(fn):
|
||||
fn.__mcp_kind__, fn.__mcp_name__ = "template", name
|
||||
return fn
|
||||
return deco
|
||||
|
||||
def mcp_tool(name: str | None = None):
|
||||
def deco(fn):
|
||||
fn.__mcp_kind__, fn.__mcp_name__ = "tool", name
|
||||
return fn
|
||||
return deco
|
||||
|
||||
# ── HTTP‑proxy helper for FastAPI endpoints ─────────────────────
|
||||
def _make_http_proxy(base_url: str, route):
|
||||
method = list(route.methods - {"HEAD", "OPTIONS"})[0]
|
||||
async def proxy(**kwargs):
|
||||
# replace `/items/{id}` style params first
|
||||
path = route.path
|
||||
for k, v in list(kwargs.items()):
|
||||
placeholder = "{" + k + "}"
|
||||
if placeholder in path:
|
||||
path = path.replace(placeholder, str(v))
|
||||
kwargs.pop(k)
|
||||
url = base_url.rstrip("/") + path
|
||||
|
||||
async with httpx.AsyncClient() as client:
|
||||
try:
|
||||
r = (
|
||||
await client.get(url, params=kwargs)
|
||||
if method == "GET"
|
||||
else await client.request(method, url, json=kwargs)
|
||||
)
|
||||
r.raise_for_status()
|
||||
return r.text if method == "GET" else r.json()
|
||||
except httpx.HTTPStatusError as e:
|
||||
# surface FastAPI error details instead of plain 500
|
||||
raise HTTPException(e.response.status_code, e.response.text)
|
||||
return proxy
|
||||
|
||||
# ── main entry point ────────────────────────────────────────────
|
||||
def attach_mcp(
|
||||
app: FastAPI,
|
||||
*, # keyword‑only
|
||||
base: str = "/mcp",
|
||||
name: str | None = None,
|
||||
base_url: str, # eg. "http://127.0.0.1:8020"
|
||||
) -> None:
|
||||
"""Call once after all routes are declared to expose WS+SSE MCP endpoints."""
|
||||
server_name = name or app.title or "FastAPI-MCP"
|
||||
mcp = Server(server_name)
|
||||
|
||||
# tools: Dict[str, Callable] = {}
|
||||
tools: Dict[str, Tuple[Callable, Callable]] = {}
|
||||
resources: Dict[str, Callable] = {}
|
||||
templates: Dict[str, Callable] = {}
|
||||
|
||||
# register decorated FastAPI routes
|
||||
for route in app.routes:
|
||||
fn = getattr(route, "endpoint", None)
|
||||
kind = getattr(fn, "__mcp_kind__", None)
|
||||
if not kind:
|
||||
continue
|
||||
|
||||
key = fn.__mcp_name__ or re.sub(r"[/{}}]", "_", route.path).strip("_")
|
||||
|
||||
# if kind == "tool":
|
||||
# tools[key] = _make_http_proxy(base_url, route)
|
||||
if kind == "tool":
|
||||
proxy = _make_http_proxy(base_url, route)
|
||||
tools[key] = (proxy, fn)
|
||||
continue
|
||||
if kind == "resource":
|
||||
resources[key] = fn
|
||||
if kind == "template":
|
||||
templates[key] = fn
|
||||
|
||||
# helpers for JSON‑Schema
|
||||
def _schema(model: type[BaseModel] | None) -> dict:
|
||||
return {"type": "object"} if model is None else model.model_json_schema()
|
||||
|
||||
def _body_model(fn: Callable) -> type[BaseModel] | None:
|
||||
for p in inspect.signature(fn).parameters.values():
|
||||
a = p.annotation
|
||||
if inspect.isclass(a) and issubclass(a, BaseModel):
|
||||
return a
|
||||
return None
|
||||
|
||||
# MCP handlers
|
||||
@mcp.list_tools()
|
||||
async def _list_tools() -> List[t.Tool]:
|
||||
out = []
|
||||
for k, (proxy, orig_fn) in tools.items():
|
||||
desc = getattr(orig_fn, "__mcp_description__", None) or inspect.getdoc(orig_fn) or ""
|
||||
schema = getattr(orig_fn, "__mcp_schema__", None) or _schema(_body_model(orig_fn))
|
||||
out.append(
|
||||
t.Tool(name=k, description=desc, inputSchema=schema)
|
||||
)
|
||||
return out
|
||||
|
||||
|
||||
@mcp.call_tool()
|
||||
async def _call_tool(name: str, arguments: Dict | None) -> List[t.TextContent]:
|
||||
if name not in tools:
|
||||
raise HTTPException(404, "tool not found")
|
||||
|
||||
proxy, _ = tools[name]
|
||||
try:
|
||||
res = await proxy(**(arguments or {}))
|
||||
except HTTPException as exc:
|
||||
# map server‑side errors into MCP "text/error" payloads
|
||||
err = {"error": exc.status_code, "detail": exc.detail}
|
||||
return [t.TextContent(type = "text", text=json.dumps(err))]
|
||||
return [t.TextContent(type = "text", text=json.dumps(res, default=str))]
|
||||
|
||||
@mcp.list_resources()
|
||||
async def _list_resources() -> List[t.Resource]:
|
||||
return [
|
||||
t.Resource(name=k, description=inspect.getdoc(f) or "", mime_type="application/json")
|
||||
for k, f in resources.items()
|
||||
]
|
||||
|
||||
@mcp.read_resource()
|
||||
async def _read_resource(name: str) -> List[t.TextContent]:
|
||||
if name not in resources:
|
||||
raise HTTPException(404, "resource not found")
|
||||
res = resources[name]()
|
||||
return [t.TextContent(type = "text", text=json.dumps(res, default=str))]
|
||||
|
||||
@mcp.list_resource_templates()
|
||||
async def _list_templates() -> List[t.ResourceTemplate]:
|
||||
return [
|
||||
t.ResourceTemplate(
|
||||
name=k,
|
||||
description=inspect.getdoc(f) or "",
|
||||
parameters={
|
||||
p: {"type": "string"} for p in _path_params(app, f)
|
||||
},
|
||||
)
|
||||
for k, f in templates.items()
|
||||
]
|
||||
|
||||
init_opts = InitializationOptions(
|
||||
server_name=server_name,
|
||||
server_version="0.1.0",
|
||||
capabilities=mcp.get_capabilities(
|
||||
notification_options=NotificationOptions(),
|
||||
experimental_capabilities={},
|
||||
),
|
||||
)
|
||||
|
||||
# ── WebSocket transport ────────────────────────────────────
|
||||
@app.websocket_route(f"{base}/ws")
|
||||
async def _ws(ws: WebSocket):
|
||||
await ws.accept()
|
||||
c2s_send, c2s_recv = anyio.create_memory_object_stream(100)
|
||||
s2c_send, s2c_recv = anyio.create_memory_object_stream(100)
|
||||
|
||||
from pydantic import TypeAdapter
|
||||
from mcp.types import JSONRPCMessage
|
||||
adapter = TypeAdapter(JSONRPCMessage)
|
||||
|
||||
init_done = anyio.Event()
|
||||
|
||||
async def srv_to_ws():
|
||||
first = True
|
||||
try:
|
||||
async for msg in s2c_recv:
|
||||
await ws.send_json(msg.model_dump())
|
||||
if first:
|
||||
init_done.set()
|
||||
first = False
|
||||
finally:
|
||||
# make sure cleanup survives TaskGroup cancellation
|
||||
with anyio.CancelScope(shield=True):
|
||||
with suppress(RuntimeError): # idempotent close
|
||||
await ws.close()
|
||||
|
||||
async def ws_to_srv():
|
||||
try:
|
||||
# 1st frame is always "initialize"
|
||||
first = adapter.validate_python(await ws.receive_json())
|
||||
await c2s_send.send(first)
|
||||
await init_done.wait() # block until server ready
|
||||
while True:
|
||||
data = await ws.receive_json()
|
||||
await c2s_send.send(adapter.validate_python(data))
|
||||
except WebSocketDisconnect:
|
||||
await c2s_send.aclose()
|
||||
|
||||
async with anyio.create_task_group() as tg:
|
||||
tg.start_soon(mcp.run, c2s_recv, s2c_send, init_opts)
|
||||
tg.start_soon(ws_to_srv)
|
||||
tg.start_soon(srv_to_ws)
|
||||
|
||||
# ── SSE transport (official) ─────────────────────────────
|
||||
sse = SseServerTransport(f"{base}/messages/")
|
||||
|
||||
@app.get(f"{base}/sse")
|
||||
async def _mcp_sse(request: Request):
|
||||
async with sse.connect_sse(
|
||||
request.scope, request.receive, request._send # starlette ASGI primitives
|
||||
) as (read_stream, write_stream):
|
||||
await mcp.run(read_stream, write_stream, init_opts)
|
||||
|
||||
# client → server frames are POSTed here
|
||||
app.mount(f"{base}/messages", app=sse.handle_post_message)
|
||||
|
||||
# ── schema endpoint ───────────────────────────────────────
|
||||
@app.get(f"{base}/schema")
|
||||
async def _schema_endpoint():
|
||||
return JSONResponse({
|
||||
"tools": [x.model_dump() for x in await _list_tools()],
|
||||
"resources": [x.model_dump() for x in await _list_resources()],
|
||||
"resource_templates": [x.model_dump() for x in await _list_templates()],
|
||||
})
|
||||
|
||||
|
||||
# ── helpers ────────────────────────────────────────────────────
|
||||
def _route_name(path: str) -> str:
|
||||
return re.sub(r"[/{}}]", "_", path).strip("_")
|
||||
|
||||
def _path_params(app: FastAPI, fn: Callable) -> List[str]:
|
||||
for r in app.routes:
|
||||
if r.endpoint is fn:
|
||||
return list(r.param_convertors.keys())
|
||||
return []
|
||||
@@ -1,9 +1,16 @@
|
||||
fastapi
|
||||
uvicorn
|
||||
fastapi>=0.115.12
|
||||
uvicorn>=0.34.2
|
||||
gunicorn>=23.0.0
|
||||
slowapi>=0.1.9
|
||||
prometheus-fastapi-instrumentator>=7.0.2
|
||||
slowapi==0.1.9
|
||||
prometheus-fastapi-instrumentator>=7.1.0
|
||||
redis>=5.2.1
|
||||
jwt>=1.3.1
|
||||
dnspython>=2.7.0
|
||||
email-validator>=2.2.0
|
||||
email-validator==2.2.0
|
||||
sse-starlette==2.2.1
|
||||
pydantic>=2.11
|
||||
rank-bm25==0.2.2
|
||||
anyio==4.9.0
|
||||
PyJWT==2.10.1
|
||||
mcp>=1.6.0
|
||||
websockets>=15.0.1
|
||||
|
||||
42
deploy/docker/schemas.py
Normal file
42
deploy/docker/schemas.py
Normal file
@@ -0,0 +1,42 @@
|
||||
from typing import List, Optional, Dict
|
||||
from enum import Enum
|
||||
from pydantic import BaseModel, Field
|
||||
from utils import FilterType
|
||||
|
||||
|
||||
class CrawlRequest(BaseModel):
|
||||
urls: List[str] = Field(min_length=1, max_length=100)
|
||||
browser_config: Optional[Dict] = Field(default_factory=dict)
|
||||
crawler_config: Optional[Dict] = Field(default_factory=dict)
|
||||
|
||||
class MarkdownRequest(BaseModel):
|
||||
"""Request body for the /md endpoint."""
|
||||
url: str = Field(..., description="Absolute http/https URL to fetch")
|
||||
f: FilterType = Field(FilterType.FIT,
|
||||
description="Content‑filter strategy: FIT, RAW, BM25, or LLM")
|
||||
q: Optional[str] = Field(None, description="Query string used by BM25/LLM filters")
|
||||
c: Optional[str] = Field("0", description="Cache‑bust / revision counter")
|
||||
|
||||
|
||||
class RawCode(BaseModel):
|
||||
code: str
|
||||
|
||||
class HTMLRequest(BaseModel):
|
||||
url: str
|
||||
|
||||
class ScreenshotRequest(BaseModel):
|
||||
url: str
|
||||
screenshot_wait_for: Optional[float] = 2
|
||||
output_path: Optional[str] = None
|
||||
|
||||
class PDFRequest(BaseModel):
|
||||
url: str
|
||||
output_path: Optional[str] = None
|
||||
|
||||
|
||||
class JSEndpointRequest(BaseModel):
|
||||
url: str
|
||||
scripts: List[str] = Field(
|
||||
...,
|
||||
description="List of separated JavaScript snippets to execute"
|
||||
)
|
||||
@@ -1,150 +1,449 @@
|
||||
# ───────────────────────── server.py ─────────────────────────
|
||||
"""
|
||||
Crawl4AI FastAPI entry‑point
|
||||
• Browser pool + global page cap
|
||||
• Rate‑limiting, security, metrics
|
||||
• /crawl, /crawl/stream, /md, /llm endpoints
|
||||
"""
|
||||
|
||||
# ── stdlib & 3rd‑party imports ───────────────────────────────
|
||||
from crawler_pool import get_crawler, close_all, janitor
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
|
||||
from auth import create_access_token, get_token_dependency, TokenRequest
|
||||
from pydantic import BaseModel
|
||||
from typing import Optional, List, Dict
|
||||
from fastapi import Request, Depends
|
||||
from fastapi.responses import FileResponse
|
||||
import base64
|
||||
import re
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
|
||||
from api import (
|
||||
handle_markdown_request, handle_llm_qa,
|
||||
handle_stream_crawl_request, handle_crawl_request,
|
||||
stream_results
|
||||
)
|
||||
from schemas import (
|
||||
CrawlRequest,
|
||||
MarkdownRequest,
|
||||
RawCode,
|
||||
HTMLRequest,
|
||||
ScreenshotRequest,
|
||||
PDFRequest,
|
||||
JSEndpointRequest,
|
||||
)
|
||||
|
||||
from utils import (
|
||||
FilterType, load_config, setup_logging, verify_email_domain
|
||||
)
|
||||
import os
|
||||
import sys
|
||||
import time
|
||||
from typing import List, Optional, Dict
|
||||
from fastapi import FastAPI, HTTPException, Request, Query, Path, Depends
|
||||
from fastapi.responses import StreamingResponse, RedirectResponse, PlainTextResponse, JSONResponse
|
||||
import asyncio
|
||||
from typing import List
|
||||
from contextlib import asynccontextmanager
|
||||
import pathlib
|
||||
|
||||
from fastapi import (
|
||||
FastAPI, HTTPException, Request, Path, Query, Depends
|
||||
)
|
||||
from rank_bm25 import BM25Okapi
|
||||
from fastapi.responses import (
|
||||
StreamingResponse, RedirectResponse, PlainTextResponse, JSONResponse
|
||||
)
|
||||
from fastapi.middleware.httpsredirect import HTTPSRedirectMiddleware
|
||||
from fastapi.middleware.trustedhost import TrustedHostMiddleware
|
||||
from fastapi.staticfiles import StaticFiles
|
||||
from job import init_job_router
|
||||
|
||||
from mcp_bridge import attach_mcp, mcp_resource, mcp_template, mcp_tool
|
||||
|
||||
import ast
|
||||
import crawl4ai as _c4
|
||||
from pydantic import BaseModel, Field
|
||||
from slowapi import Limiter
|
||||
from slowapi.util import get_remote_address
|
||||
from prometheus_fastapi_instrumentator import Instrumentator
|
||||
from redis import asyncio as aioredis
|
||||
|
||||
# ── internal imports (after sys.path append) ─────────────────
|
||||
sys.path.append(os.path.dirname(os.path.realpath(__file__)))
|
||||
from utils import FilterType, load_config, setup_logging, verify_email_domain
|
||||
from api import (
|
||||
handle_markdown_request,
|
||||
handle_llm_qa,
|
||||
handle_stream_crawl_request,
|
||||
handle_crawl_request,
|
||||
stream_results
|
||||
)
|
||||
from auth import create_access_token, get_token_dependency, TokenRequest # Import from auth.py
|
||||
|
||||
__version__ = "0.2.6"
|
||||
|
||||
class CrawlRequest(BaseModel):
|
||||
urls: List[str] = Field(min_length=1, max_length=100)
|
||||
browser_config: Optional[Dict] = Field(default_factory=dict)
|
||||
crawler_config: Optional[Dict] = Field(default_factory=dict)
|
||||
|
||||
# Load configuration and setup
|
||||
# ────────────────── configuration / logging ──────────────────
|
||||
config = load_config()
|
||||
setup_logging(config)
|
||||
|
||||
# Initialize Redis
|
||||
__version__ = "0.5.1-d1"
|
||||
|
||||
# ── global page semaphore (hard cap) ─────────────────────────
|
||||
MAX_PAGES = config["crawler"]["pool"].get("max_pages", 30)
|
||||
GLOBAL_SEM = asyncio.Semaphore(MAX_PAGES)
|
||||
|
||||
# import logging
|
||||
# page_log = logging.getLogger("page_cap")
|
||||
# orig_arun = AsyncWebCrawler.arun
|
||||
# async def capped_arun(self, *a, **kw):
|
||||
# await GLOBAL_SEM.acquire() # ← take slot
|
||||
# try:
|
||||
# in_flight = MAX_PAGES - GLOBAL_SEM._value # used permits
|
||||
# page_log.info("🕸️ pages_in_flight=%s / %s", in_flight, MAX_PAGES)
|
||||
# return await orig_arun(self, *a, **kw)
|
||||
# finally:
|
||||
# GLOBAL_SEM.release() # ← free slot
|
||||
|
||||
orig_arun = AsyncWebCrawler.arun
|
||||
|
||||
|
||||
async def capped_arun(self, *a, **kw):
|
||||
async with GLOBAL_SEM:
|
||||
return await orig_arun(self, *a, **kw)
|
||||
AsyncWebCrawler.arun = capped_arun
|
||||
|
||||
# ───────────────────── FastAPI lifespan ──────────────────────
|
||||
|
||||
|
||||
@asynccontextmanager
|
||||
async def lifespan(_: FastAPI):
|
||||
await get_crawler(BrowserConfig(
|
||||
extra_args=config["crawler"]["browser"].get("extra_args", []),
|
||||
**config["crawler"]["browser"].get("kwargs", {}),
|
||||
)) # warm‑up
|
||||
app.state.janitor = asyncio.create_task(janitor()) # idle GC
|
||||
yield
|
||||
app.state.janitor.cancel()
|
||||
await close_all()
|
||||
|
||||
# ───────────────────── FastAPI instance ──────────────────────
|
||||
app = FastAPI(
|
||||
title=config["app"]["title"],
|
||||
version=config["app"]["version"],
|
||||
lifespan=lifespan,
|
||||
)
|
||||
|
||||
# ── static playground ──────────────────────────────────────
|
||||
STATIC_DIR = pathlib.Path(__file__).parent / "static" / "playground"
|
||||
if not STATIC_DIR.exists():
|
||||
raise RuntimeError(f"Playground assets not found at {STATIC_DIR}")
|
||||
app.mount(
|
||||
"/playground",
|
||||
StaticFiles(directory=STATIC_DIR, html=True),
|
||||
name="play",
|
||||
)
|
||||
|
||||
|
||||
@app.get("/")
|
||||
async def root():
|
||||
return RedirectResponse("/playground")
|
||||
|
||||
# ─────────────────── infra / middleware ─────────────────────
|
||||
redis = aioredis.from_url(config["redis"].get("uri", "redis://localhost"))
|
||||
|
||||
# Initialize rate limiter
|
||||
limiter = Limiter(
|
||||
key_func=get_remote_address,
|
||||
default_limits=[config["rate_limiting"]["default_limit"]],
|
||||
storage_uri=config["rate_limiting"]["storage_uri"]
|
||||
storage_uri=config["rate_limiting"]["storage_uri"],
|
||||
)
|
||||
|
||||
app = FastAPI(
|
||||
title=config["app"]["title"],
|
||||
version=config["app"]["version"]
|
||||
)
|
||||
|
||||
# Configure middleware
|
||||
def setup_security_middleware(app, config):
|
||||
sec_config = config.get("security", {})
|
||||
if sec_config.get("enabled", False):
|
||||
if sec_config.get("https_redirect", False):
|
||||
app.add_middleware(HTTPSRedirectMiddleware)
|
||||
if sec_config.get("trusted_hosts", []) != ["*"]:
|
||||
app.add_middleware(TrustedHostMiddleware, allowed_hosts=sec_config["trusted_hosts"])
|
||||
def _setup_security(app_: FastAPI):
|
||||
sec = config["security"]
|
||||
if not sec["enabled"]:
|
||||
return
|
||||
if sec.get("https_redirect"):
|
||||
app_.add_middleware(HTTPSRedirectMiddleware)
|
||||
if sec.get("trusted_hosts", []) != ["*"]:
|
||||
app_.add_middleware(
|
||||
TrustedHostMiddleware, allowed_hosts=sec["trusted_hosts"]
|
||||
)
|
||||
|
||||
setup_security_middleware(app, config)
|
||||
|
||||
# Prometheus instrumentation
|
||||
_setup_security(app)
|
||||
|
||||
if config["observability"]["prometheus"]["enabled"]:
|
||||
Instrumentator().instrument(app).expose(app)
|
||||
|
||||
# Get token dependency based on config
|
||||
token_dependency = get_token_dependency(config)
|
||||
token_dep = get_token_dependency(config)
|
||||
|
||||
|
||||
# Middleware for security headers
|
||||
@app.middleware("http")
|
||||
async def add_security_headers(request: Request, call_next):
|
||||
response = await call_next(request)
|
||||
resp = await call_next(request)
|
||||
if config["security"]["enabled"]:
|
||||
response.headers.update(config["security"]["headers"])
|
||||
return response
|
||||
resp.headers.update(config["security"]["headers"])
|
||||
return resp
|
||||
|
||||
# Token endpoint (always available, but usage depends on config)
|
||||
# ───────────────── safe config‑dump helper ─────────────────
|
||||
ALLOWED_TYPES = {
|
||||
"CrawlerRunConfig": CrawlerRunConfig,
|
||||
"BrowserConfig": BrowserConfig,
|
||||
}
|
||||
|
||||
|
||||
def _safe_eval_config(expr: str) -> dict:
|
||||
"""
|
||||
Accept exactly one top‑level call to CrawlerRunConfig(...) or BrowserConfig(...).
|
||||
Whatever is inside the parentheses is fine *except* further function calls
|
||||
(so no __import__('os') stuff). All public names from crawl4ai are available
|
||||
when we eval.
|
||||
"""
|
||||
tree = ast.parse(expr, mode="eval")
|
||||
|
||||
# must be a single call
|
||||
if not isinstance(tree.body, ast.Call):
|
||||
raise ValueError("Expression must be a single constructor call")
|
||||
|
||||
call = tree.body
|
||||
if not (isinstance(call.func, ast.Name) and call.func.id in {"CrawlerRunConfig", "BrowserConfig"}):
|
||||
raise ValueError(
|
||||
"Only CrawlerRunConfig(...) or BrowserConfig(...) are allowed")
|
||||
|
||||
# forbid nested calls to keep the surface tiny
|
||||
for node in ast.walk(call):
|
||||
if isinstance(node, ast.Call) and node is not call:
|
||||
raise ValueError("Nested function calls are not permitted")
|
||||
|
||||
# expose everything that crawl4ai exports, nothing else
|
||||
safe_env = {name: getattr(_c4, name)
|
||||
for name in dir(_c4) if not name.startswith("_")}
|
||||
obj = eval(compile(tree, "<config>", "eval"),
|
||||
{"__builtins__": {}}, safe_env)
|
||||
return obj.dump()
|
||||
|
||||
|
||||
# ── job router ──────────────────────────────────────────────
|
||||
app.include_router(init_job_router(redis, config, token_dep))
|
||||
|
||||
# ──────────────────────── Endpoints ──────────────────────────
|
||||
@app.post("/token")
|
||||
async def get_token(request_data: TokenRequest):
|
||||
if not verify_email_domain(request_data.email):
|
||||
raise HTTPException(status_code=400, detail="Invalid email domain")
|
||||
token = create_access_token({"sub": request_data.email})
|
||||
return {"email": request_data.email, "access_token": token, "token_type": "bearer"}
|
||||
async def get_token(req: TokenRequest):
|
||||
if not verify_email_domain(req.email):
|
||||
raise HTTPException(400, "Invalid email domain")
|
||||
token = create_access_token({"sub": req.email})
|
||||
return {"email": req.email, "access_token": token, "token_type": "bearer"}
|
||||
|
||||
# Endpoints with conditional auth
|
||||
@app.get("/md/{url:path}")
|
||||
|
||||
@app.post("/config/dump")
|
||||
async def config_dump(raw: RawCode):
|
||||
try:
|
||||
return JSONResponse(_safe_eval_config(raw.code.strip()))
|
||||
except Exception as e:
|
||||
raise HTTPException(400, str(e))
|
||||
|
||||
|
||||
@app.post("/md")
|
||||
@limiter.limit(config["rate_limiting"]["default_limit"])
|
||||
@mcp_tool("md")
|
||||
async def get_markdown(
|
||||
request: Request,
|
||||
url: str,
|
||||
f: FilterType = FilterType.FIT,
|
||||
q: Optional[str] = None,
|
||||
c: Optional[str] = "0",
|
||||
token_data: Optional[Dict] = Depends(token_dependency)
|
||||
body: MarkdownRequest,
|
||||
_td: Dict = Depends(token_dep),
|
||||
):
|
||||
result = await handle_markdown_request(url, f, q, c, config)
|
||||
return PlainTextResponse(result)
|
||||
if not body.url.startswith(("http://", "https://")):
|
||||
raise HTTPException(
|
||||
400, "URL must be absolute and start with http/https")
|
||||
markdown = await handle_markdown_request(
|
||||
body.url, body.f, body.q, body.c, config
|
||||
)
|
||||
return JSONResponse({
|
||||
"url": body.url,
|
||||
"filter": body.f,
|
||||
"query": body.q,
|
||||
"cache": body.c,
|
||||
"markdown": markdown,
|
||||
"success": True
|
||||
})
|
||||
|
||||
@app.get("/llm/{url:path}", description="URL should be without http/https prefix")
|
||||
|
||||
@app.post("/html")
|
||||
@limiter.limit(config["rate_limiting"]["default_limit"])
|
||||
@mcp_tool("html")
|
||||
async def generate_html(
|
||||
request: Request,
|
||||
body: HTMLRequest,
|
||||
_td: Dict = Depends(token_dep),
|
||||
):
|
||||
"""
|
||||
Crawls the URL, preprocesses the raw HTML for schema extraction, and returns the processed HTML.
|
||||
Use when you need sanitized HTML structures for building schemas or further processing.
|
||||
"""
|
||||
cfg = CrawlerRunConfig()
|
||||
async with AsyncWebCrawler(config=BrowserConfig()) as crawler:
|
||||
results = await crawler.arun(url=body.url, config=cfg)
|
||||
raw_html = results[0].html
|
||||
from crawl4ai.utils import preprocess_html_for_schema
|
||||
processed_html = preprocess_html_for_schema(raw_html)
|
||||
return JSONResponse({"html": processed_html, "url": body.url, "success": True})
|
||||
|
||||
# Screenshot endpoint
|
||||
|
||||
|
||||
@app.post("/screenshot")
|
||||
@limiter.limit(config["rate_limiting"]["default_limit"])
|
||||
@mcp_tool("screenshot")
|
||||
async def generate_screenshot(
|
||||
request: Request,
|
||||
body: ScreenshotRequest,
|
||||
_td: Dict = Depends(token_dep),
|
||||
):
|
||||
"""
|
||||
Capture a full-page PNG screenshot of the specified URL, waiting an optional delay before capture,
|
||||
Use when you need an image snapshot of the rendered page. Its recommened to provide an output path to save the screenshot.
|
||||
Then in result instead of the screenshot you will get a path to the saved file.
|
||||
"""
|
||||
cfg = CrawlerRunConfig(
|
||||
screenshot=True, screenshot_wait_for=body.screenshot_wait_for)
|
||||
async with AsyncWebCrawler(config=BrowserConfig()) as crawler:
|
||||
results = await crawler.arun(url=body.url, config=cfg)
|
||||
screenshot_data = results[0].screenshot
|
||||
if body.output_path:
|
||||
abs_path = os.path.abspath(body.output_path)
|
||||
os.makedirs(os.path.dirname(abs_path), exist_ok=True)
|
||||
with open(abs_path, "wb") as f:
|
||||
f.write(base64.b64decode(screenshot_data))
|
||||
return {"success": True, "path": abs_path}
|
||||
return {"success": True, "screenshot": screenshot_data}
|
||||
|
||||
# PDF endpoint
|
||||
|
||||
|
||||
@app.post("/pdf")
|
||||
@limiter.limit(config["rate_limiting"]["default_limit"])
|
||||
@mcp_tool("pdf")
|
||||
async def generate_pdf(
|
||||
request: Request,
|
||||
body: PDFRequest,
|
||||
_td: Dict = Depends(token_dep),
|
||||
):
|
||||
"""
|
||||
Generate a PDF document of the specified URL,
|
||||
Use when you need a printable or archivable snapshot of the page. It is recommended to provide an output path to save the PDF.
|
||||
Then in result instead of the PDF you will get a path to the saved file.
|
||||
"""
|
||||
cfg = CrawlerRunConfig(pdf=True)
|
||||
async with AsyncWebCrawler(config=BrowserConfig()) as crawler:
|
||||
results = await crawler.arun(url=body.url, config=cfg)
|
||||
pdf_data = results[0].pdf
|
||||
if body.output_path:
|
||||
abs_path = os.path.abspath(body.output_path)
|
||||
os.makedirs(os.path.dirname(abs_path), exist_ok=True)
|
||||
with open(abs_path, "wb") as f:
|
||||
f.write(pdf_data)
|
||||
return {"success": True, "path": abs_path}
|
||||
return {"success": True, "pdf": base64.b64encode(pdf_data).decode()}
|
||||
|
||||
|
||||
@app.post("/execute_js")
|
||||
@limiter.limit(config["rate_limiting"]["default_limit"])
|
||||
@mcp_tool("execute_js")
|
||||
async def execute_js(
|
||||
request: Request,
|
||||
body: JSEndpointRequest,
|
||||
_td: Dict = Depends(token_dep),
|
||||
):
|
||||
"""
|
||||
Execute a sequence of JavaScript snippets on the specified URL.
|
||||
Return the full CrawlResult JSON (first result).
|
||||
Use this when you need to interact with dynamic pages using JS.
|
||||
REMEMBER: Scripts accept a list of separated JS snippets to execute and execute them in order.
|
||||
IMPORTANT: Each script should be an expression that returns a value. It can be an IIFE or an async function. You can think of it as such.
|
||||
Your script will replace '{script}' and execute in the browser context. So provide either an IIFE or a sync/async function that returns a value.
|
||||
Return Format:
|
||||
- The return result is an instance of CrawlResult, so you have access to markdown, links, and other stuff. If this is enough, you don't need to call again for other endpoints.
|
||||
|
||||
```python
|
||||
class CrawlResult(BaseModel):
|
||||
url: str
|
||||
html: str
|
||||
success: bool
|
||||
cleaned_html: Optional[str] = None
|
||||
media: Dict[str, List[Dict]] = {}
|
||||
links: Dict[str, List[Dict]] = {}
|
||||
downloaded_files: Optional[List[str]] = None
|
||||
js_execution_result: Optional[Dict[str, Any]] = None
|
||||
screenshot: Optional[str] = None
|
||||
pdf: Optional[bytes] = None
|
||||
mhtml: Optional[str] = None
|
||||
_markdown: Optional[MarkdownGenerationResult] = PrivateAttr(default=None)
|
||||
extracted_content: Optional[str] = None
|
||||
metadata: Optional[dict] = None
|
||||
error_message: Optional[str] = None
|
||||
session_id: Optional[str] = None
|
||||
response_headers: Optional[dict] = None
|
||||
status_code: Optional[int] = None
|
||||
ssl_certificate: Optional[SSLCertificate] = None
|
||||
dispatch_result: Optional[DispatchResult] = None
|
||||
redirected_url: Optional[str] = None
|
||||
network_requests: Optional[List[Dict[str, Any]]] = None
|
||||
console_messages: Optional[List[Dict[str, Any]]] = None
|
||||
|
||||
class MarkdownGenerationResult(BaseModel):
|
||||
raw_markdown: str
|
||||
markdown_with_citations: str
|
||||
references_markdown: str
|
||||
fit_markdown: Optional[str] = None
|
||||
fit_html: Optional[str] = None
|
||||
```
|
||||
|
||||
"""
|
||||
cfg = CrawlerRunConfig(js_code=body.scripts)
|
||||
async with AsyncWebCrawler(config=BrowserConfig()) as crawler:
|
||||
results = await crawler.arun(url=body.url, config=cfg)
|
||||
# Return JSON-serializable dict of the first CrawlResult
|
||||
data = results[0].model_dump()
|
||||
return JSONResponse(data)
|
||||
|
||||
|
||||
@app.get("/llm/{url:path}")
|
||||
async def llm_endpoint(
|
||||
request: Request,
|
||||
url: str = Path(...),
|
||||
q: Optional[str] = Query(None),
|
||||
token_data: Optional[Dict] = Depends(token_dependency)
|
||||
q: str = Query(...),
|
||||
_td: Dict = Depends(token_dep),
|
||||
):
|
||||
if not q:
|
||||
raise HTTPException(status_code=400, detail="Query parameter 'q' is required")
|
||||
if not url.startswith(('http://', 'https://')):
|
||||
url = 'https://' + url
|
||||
try:
|
||||
answer = await handle_llm_qa(url, q, config)
|
||||
return JSONResponse({"answer": answer})
|
||||
except Exception as e:
|
||||
raise HTTPException(status_code=500, detail=str(e))
|
||||
raise HTTPException(400, "Query parameter 'q' is required")
|
||||
if not url.startswith(("http://", "https://")):
|
||||
url = "https://" + url
|
||||
answer = await handle_llm_qa(url, q, config)
|
||||
return JSONResponse({"answer": answer})
|
||||
|
||||
|
||||
@app.get("/schema")
|
||||
async def get_schema():
|
||||
from crawl4ai import BrowserConfig, CrawlerRunConfig
|
||||
return {"browser": BrowserConfig().dump(), "crawler": CrawlerRunConfig().dump()}
|
||||
return {"browser": BrowserConfig().dump(),
|
||||
"crawler": CrawlerRunConfig().dump()}
|
||||
|
||||
|
||||
@app.get(config["observability"]["health_check"]["endpoint"])
|
||||
async def health():
|
||||
return {"status": "ok", "timestamp": time.time(), "version": __version__}
|
||||
|
||||
|
||||
@app.get(config["observability"]["prometheus"]["endpoint"])
|
||||
async def metrics():
|
||||
return RedirectResponse(url=config["observability"]["prometheus"]["endpoint"])
|
||||
return RedirectResponse(config["observability"]["prometheus"]["endpoint"])
|
||||
|
||||
|
||||
@app.post("/crawl")
|
||||
@limiter.limit(config["rate_limiting"]["default_limit"])
|
||||
@mcp_tool("crawl")
|
||||
async def crawl(
|
||||
request: Request,
|
||||
crawl_request: CrawlRequest,
|
||||
token_data: Optional[Dict] = Depends(token_dependency)
|
||||
_td: Dict = Depends(token_dep),
|
||||
):
|
||||
"""
|
||||
Crawl a list of URLs and return the results as JSON.
|
||||
"""
|
||||
if not crawl_request.urls:
|
||||
raise HTTPException(status_code=400, detail="At least one URL required")
|
||||
|
||||
results = await handle_crawl_request(
|
||||
raise HTTPException(400, "At least one URL required")
|
||||
res = await handle_crawl_request(
|
||||
urls=crawl_request.urls,
|
||||
browser_config=crawl_request.browser_config,
|
||||
crawler_config=crawl_request.crawler_config,
|
||||
config=config
|
||||
config=config,
|
||||
)
|
||||
|
||||
return JSONResponse(results)
|
||||
return JSONResponse(res)
|
||||
|
||||
|
||||
@app.post("/crawl/stream")
|
||||
@@ -152,24 +451,161 @@ async def crawl(
|
||||
async def crawl_stream(
|
||||
request: Request,
|
||||
crawl_request: CrawlRequest,
|
||||
token_data: Optional[Dict] = Depends(token_dependency)
|
||||
_td: Dict = Depends(token_dep),
|
||||
):
|
||||
if not crawl_request.urls:
|
||||
raise HTTPException(status_code=400, detail="At least one URL required")
|
||||
|
||||
crawler, results_gen = await handle_stream_crawl_request(
|
||||
raise HTTPException(400, "At least one URL required")
|
||||
crawler, gen = await handle_stream_crawl_request(
|
||||
urls=crawl_request.urls,
|
||||
browser_config=crawl_request.browser_config,
|
||||
crawler_config=crawl_request.crawler_config,
|
||||
config=config
|
||||
config=config,
|
||||
)
|
||||
|
||||
return StreamingResponse(
|
||||
stream_results(crawler, results_gen),
|
||||
media_type='application/x-ndjson',
|
||||
headers={'Cache-Control': 'no-cache', 'Connection': 'keep-alive', 'X-Stream-Status': 'active'}
|
||||
stream_results(crawler, gen),
|
||||
media_type="application/x-ndjson",
|
||||
headers={
|
||||
"Cache-Control": "no-cache",
|
||||
"Connection": "keep-alive",
|
||||
"X-Stream-Status": "active",
|
||||
},
|
||||
)
|
||||
|
||||
|
||||
def chunk_code_functions(code_md: str) -> List[str]:
|
||||
"""Extract each function/class from markdown code blocks per file."""
|
||||
pattern = re.compile(
|
||||
# match "## File: <path>" then a ```py fence, then capture until the closing ```
|
||||
r'##\s*File:\s*(?P<path>.+?)\s*?\r?\n' # file header
|
||||
r'```py\s*?\r?\n' # opening fence
|
||||
r'(?P<code>.*?)(?=\r?\n```)', # code block
|
||||
re.DOTALL
|
||||
)
|
||||
chunks: List[str] = []
|
||||
for m in pattern.finditer(code_md):
|
||||
file_path = m.group("path").strip()
|
||||
code_blk = m.group("code")
|
||||
tree = ast.parse(code_blk)
|
||||
lines = code_blk.splitlines()
|
||||
for node in tree.body:
|
||||
if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef, ast.ClassDef)):
|
||||
start = node.lineno - 1
|
||||
end = getattr(node, "end_lineno", start + 1)
|
||||
snippet = "\n".join(lines[start:end])
|
||||
chunks.append(f"# File: {file_path}\n{snippet}")
|
||||
return chunks
|
||||
|
||||
|
||||
def chunk_doc_sections(doc: str) -> List[str]:
|
||||
lines = doc.splitlines(keepends=True)
|
||||
sections = []
|
||||
current: List[str] = []
|
||||
for line in lines:
|
||||
if re.match(r"^#{1,6}\s", line):
|
||||
if current:
|
||||
sections.append("".join(current))
|
||||
current = [line]
|
||||
else:
|
||||
current.append(line)
|
||||
if current:
|
||||
sections.append("".join(current))
|
||||
return sections
|
||||
|
||||
|
||||
@app.get("/ask")
|
||||
@limiter.limit(config["rate_limiting"]["default_limit"])
|
||||
@mcp_tool("ask")
|
||||
async def get_context(
|
||||
request: Request,
|
||||
_td: Dict = Depends(token_dep),
|
||||
context_type: str = Query("all", regex="^(code|doc|all)$"),
|
||||
query: Optional[str] = Query(
|
||||
None, description="search query to filter chunks"),
|
||||
score_ratio: float = Query(
|
||||
0.5, ge=0.0, le=1.0, description="min score as fraction of max_score"),
|
||||
max_results: int = Query(
|
||||
20, ge=1, description="absolute cap on returned chunks"),
|
||||
):
|
||||
"""
|
||||
This end point is design for any questions about Crawl4ai library. It returns a plain text markdown with extensive information about Crawl4ai.
|
||||
You can use this as a context for any AI assistant. Use this endpoint for AI assistants to retrieve library context for decision making or code generation tasks.
|
||||
Alway is BEST practice you provide a query to filter the context. Otherwise the lenght of the response will be very long.
|
||||
|
||||
Parameters:
|
||||
- context_type: Specify "code" for code context, "doc" for documentation context, or "all" for both.
|
||||
- query: RECOMMENDED search query to filter paragraphs using BM25. You can leave this empty to get all the context.
|
||||
- score_ratio: Minimum score as a fraction of the maximum score for filtering results.
|
||||
- max_results: Maximum number of results to return. Default is 20.
|
||||
|
||||
Returns:
|
||||
- JSON response with the requested context.
|
||||
- If "code" is specified, returns the code context.
|
||||
- If "doc" is specified, returns the documentation context.
|
||||
- If "all" is specified, returns both code and documentation contexts.
|
||||
"""
|
||||
# load contexts
|
||||
base = os.path.dirname(__file__)
|
||||
code_path = os.path.join(base, "c4ai-code-context.md")
|
||||
doc_path = os.path.join(base, "c4ai-doc-context.md")
|
||||
if not os.path.exists(code_path) or not os.path.exists(doc_path):
|
||||
raise HTTPException(404, "Context files not found")
|
||||
|
||||
with open(code_path, "r") as f:
|
||||
code_content = f.read()
|
||||
with open(doc_path, "r") as f:
|
||||
doc_content = f.read()
|
||||
|
||||
# if no query, just return raw contexts
|
||||
if not query:
|
||||
if context_type == "code":
|
||||
return JSONResponse({"code_context": code_content})
|
||||
if context_type == "doc":
|
||||
return JSONResponse({"doc_context": doc_content})
|
||||
return JSONResponse({
|
||||
"code_context": code_content,
|
||||
"doc_context": doc_content,
|
||||
})
|
||||
|
||||
tokens = query.split()
|
||||
results: Dict[str, List[Dict[str, float]]] = {}
|
||||
|
||||
# code BM25 over functions/classes
|
||||
if context_type in ("code", "all"):
|
||||
code_chunks = chunk_code_functions(code_content)
|
||||
bm25 = BM25Okapi([c.split() for c in code_chunks])
|
||||
scores = bm25.get_scores(tokens)
|
||||
max_sc = float(scores.max()) if scores.size > 0 else 0.0
|
||||
cutoff = max_sc * score_ratio
|
||||
picked = [(c, s) for c, s in zip(code_chunks, scores) if s >= cutoff]
|
||||
picked = sorted(picked, key=lambda x: x[1], reverse=True)[:max_results]
|
||||
results["code_results"] = [{"text": c, "score": s} for c, s in picked]
|
||||
|
||||
# doc BM25 over markdown sections
|
||||
if context_type in ("doc", "all"):
|
||||
sections = chunk_doc_sections(doc_content)
|
||||
bm25d = BM25Okapi([sec.split() for sec in sections])
|
||||
scores_d = bm25d.get_scores(tokens)
|
||||
max_sd = float(scores_d.max()) if scores_d.size > 0 else 0.0
|
||||
cutoff_d = max_sd * score_ratio
|
||||
idxs = [i for i, s in enumerate(scores_d) if s >= cutoff_d]
|
||||
neighbors = set(i for idx in idxs for i in (idx-1, idx, idx+1))
|
||||
valid = [i for i in sorted(neighbors) if 0 <= i < len(sections)]
|
||||
valid = valid[:max_results]
|
||||
results["doc_results"] = [
|
||||
{"text": sections[i], "score": scores_d[i]} for i in valid
|
||||
]
|
||||
|
||||
return JSONResponse(results)
|
||||
|
||||
|
||||
# attach MCP layer (adds /mcp/ws, /mcp/sse, /mcp/schema)
|
||||
print(f"MCP server running on {config['app']['host']}:{config['app']['port']}")
|
||||
attach_mcp(
|
||||
app,
|
||||
base_url=f"http://{config['app']['host']}:{config['app']['port']}"
|
||||
)
|
||||
|
||||
# ────────────────────────── cli ──────────────────────────────
|
||||
if __name__ == "__main__":
|
||||
import uvicorn
|
||||
uvicorn.run(
|
||||
@@ -177,5 +613,6 @@ if __name__ == "__main__":
|
||||
host=config["app"]["host"],
|
||||
port=config["app"]["port"],
|
||||
reload=config["app"]["reload"],
|
||||
timeout_keep_alive=config["app"]["timeout_keep_alive"]
|
||||
)
|
||||
timeout_keep_alive=config["app"]["timeout_keep_alive"],
|
||||
)
|
||||
# ─────────────────────────────────────────────────────────────
|
||||
|
||||
955
deploy/docker/static/playground/index.html
Normal file
955
deploy/docker/static/playground/index.html
Normal file
@@ -0,0 +1,955 @@
|
||||
<!DOCTYPE html>
|
||||
<html lang="en">
|
||||
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
||||
<title>Crawl4AI Playground</title>
|
||||
<script src="https://cdn.tailwindcss.com"></script>
|
||||
<script>
|
||||
tailwind.config = {
|
||||
theme: {
|
||||
extend: {
|
||||
colors: {
|
||||
primary: '#4EFFFF',
|
||||
primarydim: '#09b5a5',
|
||||
accent: '#F380F5',
|
||||
dark: '#070708',
|
||||
light: '#E8E9ED',
|
||||
secondary: '#D5CEBF',
|
||||
codebg: '#1E1E1E',
|
||||
surface: '#202020',
|
||||
border: '#3F3F44',
|
||||
},
|
||||
fontFamily: {
|
||||
mono: ['Fira Code', 'monospace'],
|
||||
},
|
||||
}
|
||||
}
|
||||
}
|
||||
</script>
|
||||
<link href="https://fonts.googleapis.com/css2?family=Fira+Code:wght@400;500&display=swap" rel="stylesheet">
|
||||
<!-- Highlight.js -->
|
||||
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.9.0/styles/github-dark.min.css">
|
||||
<script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.9.0/highlight.min.js"></script>
|
||||
<script src="https://cdnjs.cloudflare.com/ajax/libs/clipboard.js/2.0.11/clipboard.min.js"></script>
|
||||
<!-- CodeMirror (python mode) -->
|
||||
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/codemirror/5.65.16/codemirror.min.css">
|
||||
<script src="https://cdnjs.cloudflare.com/ajax/libs/codemirror/5.65.16/codemirror.min.js"></script>
|
||||
<script src="https://cdnjs.cloudflare.com/ajax/libs/codemirror/5.65.16/mode/python/python.min.js"></script>
|
||||
<script src="https://cdnjs.cloudflare.com/ajax/libs/codemirror/5.65.16/addon/edit/matchbrackets.min.js"></script>
|
||||
<script src="https://cdnjs.cloudflare.com/ajax/libs/codemirror/5.65.16/addon/selection/active-line.min.js"></script>
|
||||
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/codemirror/5.65.16/theme/darcula.min.css">
|
||||
<!-- <script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.9.0/languages/python.min.js"></script>
|
||||
<script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.9.0/languages/bash.min.js"></script>
|
||||
<script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.9.0/languages/json.min.js"></script> -->
|
||||
<style>
|
||||
/* Custom CodeMirror styling to match theme */
|
||||
.CodeMirror {
|
||||
background-color: #1E1E1E !important;
|
||||
color: #E8E9ED !important;
|
||||
border-radius: 4px;
|
||||
font-family: 'Fira Code', monospace;
|
||||
font-size: 0.9rem;
|
||||
}
|
||||
|
||||
.CodeMirror-gutters {
|
||||
background-color: #1E1E1E !important;
|
||||
border-right: 1px solid #3F3F44 !important;
|
||||
}
|
||||
|
||||
.CodeMirror-linenumber {
|
||||
color: #3F3F44 !important;
|
||||
}
|
||||
|
||||
.cm-s-darcula .cm-keyword {
|
||||
color: #4EFFFF !important;
|
||||
}
|
||||
|
||||
.cm-s-darcula .cm-string {
|
||||
color: #F380F5 !important;
|
||||
}
|
||||
|
||||
.cm-s-darcula .cm-number {
|
||||
color: #D5CEBF !important;
|
||||
}
|
||||
|
||||
/* Add to your <style> section or Tailwind config */
|
||||
.hljs {
|
||||
background: #1E1E1E !important;
|
||||
border-radius: 4px;
|
||||
padding: 1rem !important;
|
||||
}
|
||||
|
||||
pre code.hljs {
|
||||
display: block;
|
||||
overflow-x: auto;
|
||||
}
|
||||
|
||||
/* Language-specific colors */
|
||||
.hljs-attr {
|
||||
color: #4EFFFF;
|
||||
}
|
||||
|
||||
/* JSON keys */
|
||||
.hljs-string {
|
||||
color: #F380F5;
|
||||
}
|
||||
|
||||
/* Strings */
|
||||
.hljs-number {
|
||||
color: #D5CEBF;
|
||||
}
|
||||
|
||||
/* Numbers */
|
||||
.hljs-keyword {
|
||||
color: #4EFFFF;
|
||||
}
|
||||
|
||||
pre code {
|
||||
white-space: pre-wrap;
|
||||
word-break: break-word;
|
||||
}
|
||||
|
||||
.copy-btn {
|
||||
transition: all 0.2s ease;
|
||||
opacity: 0.7;
|
||||
}
|
||||
|
||||
.copy-btn:hover {
|
||||
opacity: 1;
|
||||
}
|
||||
|
||||
.tab-content:hover .copy-btn {
|
||||
opacity: 0.7;
|
||||
}
|
||||
|
||||
.tab-content:hover .copy-btn:hover {
|
||||
opacity: 1;
|
||||
}
|
||||
|
||||
/* copid text highlighted */
|
||||
.highlighted {
|
||||
background-color: rgba(78, 255, 255, 0.2) !important;
|
||||
transition: background-color 0.5s ease;
|
||||
}
|
||||
</style>
|
||||
</head>
|
||||
|
||||
<body class="bg-dark text-light font-mono min-h-screen flex flex-col" style="font-feature-settings: 'calt' 0;">
|
||||
<!-- Header -->
|
||||
<header class="border-b border-border px-4 py-2 flex items-center">
|
||||
<h1 class="text-lg font-medium flex items-center space-x-4">
|
||||
<span>🚀🤖 <span class="text-primary">Crawl4AI</span> Playground</span>
|
||||
|
||||
<!-- GitHub badges -->
|
||||
<a href="https://github.com/unclecode/crawl4ai" target="_blank" class="flex space-x-1">
|
||||
<img src="https://img.shields.io/github/stars/unclecode/crawl4ai?style=social"
|
||||
alt="GitHub stars" class="h-5">
|
||||
<img src="https://img.shields.io/github/forks/unclecode/crawl4ai?style=social"
|
||||
alt="GitHub forks" class="h-5">
|
||||
</a>
|
||||
|
||||
<!-- Docs -->
|
||||
<a href="https://docs.crawl4ai.com" target="_blank"
|
||||
class="text-xs text-secondary hover:text-primary underline flex items-center">
|
||||
Docs
|
||||
</a>
|
||||
|
||||
<!-- X (Twitter) follow -->
|
||||
<a href="https://x.com/unclecode" target="_blank"
|
||||
class="hover:text-primary flex items-center" title="Follow @unclecode on X">
|
||||
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"
|
||||
class="w-4 h-4 fill-current mr-1">
|
||||
<path d="M22.46 6c-.77.35-1.6.58-2.46.69a4.27 4.27 0 001.88-2.35 8.53 8.53 0 01-2.71 1.04 4.24 4.24 0 00-7.23 3.87A12.05 12.05 0 013 4.62a4.24 4.24 0 001.31 5.65 4.2 4.2 0 01-1.92-.53v.05a4.24 4.24 0 003.4 4.16 4.31 4.31 0 01-1.91.07 4.25 4.25 0 003.96 2.95A8.5 8.5 0 012 19.55a12.04 12.04 0 006.53 1.92c7.84 0 12.13-6.49 12.13-12.13 0-.18-.01-.36-.02-.54A8.63 8.63 0 0024 5.1a8.45 8.45 0 01-2.54.7z"/>
|
||||
</svg>
|
||||
<span class="text-xs">@unclecode</span>
|
||||
</a>
|
||||
</h1>
|
||||
|
||||
<div class="ml-auto flex space-x-2">
|
||||
<button id="play-tab"
|
||||
class="px-3 py-1 rounded-t bg-surface border border-b-0 border-border text-primary">Playground</button>
|
||||
<button id="stress-tab" class="px-3 py-1 rounded-t border border-border hover:bg-surface">Stress
|
||||
Test</button>
|
||||
</div>
|
||||
</header>
|
||||
|
||||
<!-- Main Playground -->
|
||||
<main id="playground" class="flex-1 flex flex-col p-4 space-y-4 max-w-5xl w-full mx-auto">
|
||||
<!-- Request Builder -->
|
||||
<section class="bg-surface rounded-lg border border-border overflow-hidden">
|
||||
<div class="px-4 py-2 border-b border-border flex items-center">
|
||||
<h2 class="font-medium">Request Builder</h2>
|
||||
<select id="endpoint" class="ml-auto bg-dark border border-border rounded px-2 py-1 text-sm">
|
||||
<option value="crawl">/crawl (batch)</option>
|
||||
<option value="crawl_stream">/crawl/stream</option>
|
||||
<option value="md">/md</option>
|
||||
<option value="llm">/llm</option>
|
||||
</select>
|
||||
</div>
|
||||
<div class="p-4">
|
||||
<label class="block mb-2 text-sm">URL(s) - one per line</label>
|
||||
<textarea id="urls" class="w-full bg-dark border border-border rounded p-2 h-32 text-sm mb-4"
|
||||
spellcheck="false">https://example.com</textarea>
|
||||
|
||||
<!-- Specific options for /md endpoint -->
|
||||
<details id="md-options" class="mb-4 hidden">
|
||||
<summary class="text-sm text-secondary cursor-pointer">/md Options</summary>
|
||||
<div class="mt-2 space-y-3 p-2 border border-border rounded">
|
||||
<div>
|
||||
<label for="md-filter" class="block text-xs text-secondary mb-1">Filter Type</label>
|
||||
<select id="md-filter" class="bg-dark border border-border rounded px-2 py-1 text-sm w-full">
|
||||
<option value="fit">fit - Adaptive content filtering</option>
|
||||
<option value="raw">raw - No filtering</option>
|
||||
<option value="bm25">bm25 - BM25 keyword relevance</option>
|
||||
<option value="llm">llm - LLM-based filtering</option>
|
||||
</select>
|
||||
</div>
|
||||
<div>
|
||||
<label for="md-query" class="block text-xs text-secondary mb-1">Query (for BM25/LLM filters)</label>
|
||||
<input id="md-query" type="text" placeholder="Enter search terms or instructions"
|
||||
class="bg-dark border border-border rounded px-2 py-1 text-sm w-full">
|
||||
</div>
|
||||
<div>
|
||||
<label for="md-cache" class="block text-xs text-secondary mb-1">Cache Mode</label>
|
||||
<select id="md-cache" class="bg-dark border border-border rounded px-2 py-1 text-sm w-full">
|
||||
<option value="0">Write-Only (0)</option>
|
||||
<option value="1">Enabled (1)</option>
|
||||
</select>
|
||||
</div>
|
||||
</div>
|
||||
</details>
|
||||
|
||||
<!-- Specific options for /llm endpoint -->
|
||||
<details id="llm-options" class="mb-4 hidden">
|
||||
<summary class="text-sm text-secondary cursor-pointer">/llm Options</summary>
|
||||
<div class="mt-2 space-y-3 p-2 border border-border rounded">
|
||||
<div>
|
||||
<label for="llm-question" class="block text-xs text-secondary mb-1">Question</label>
|
||||
<input id="llm-question" type="text" value="What is this page about?"
|
||||
class="bg-dark border border-border rounded px-2 py-1 text-sm w-full">
|
||||
</div>
|
||||
</div>
|
||||
</details>
|
||||
|
||||
<!-- Advanced config for /crawl endpoints -->
|
||||
<details id="adv-config" class="mb-4">
|
||||
<summary class="text-sm text-secondary cursor-pointer">Advanced Config <span
|
||||
class="text-xs text-primary">(Python → auto‑JSON)</span></summary>
|
||||
|
||||
<!-- Toolbar -->
|
||||
<div class="flex items-center justify-end space-x-3 mt-2">
|
||||
<label for="cfg-type" class="text-xs text-secondary">Type:</label>
|
||||
<select id="cfg-type"
|
||||
class="bg-dark border border-border rounded px-1 py-0.5 text-xs">
|
||||
<option value="CrawlerRunConfig">CrawlerRunConfig</option>
|
||||
<option value="BrowserConfig">BrowserConfig</option>
|
||||
</select>
|
||||
|
||||
<!-- help link -->
|
||||
<a href="https://docs.crawl4ai.com/api/parameters/"
|
||||
target="_blank"
|
||||
class="text-xs text-primary hover:underline flex items-center space-x-1"
|
||||
title="Open parameter reference in new tab">
|
||||
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"
|
||||
class="w-4 h-4 fill-current">
|
||||
<path d="M13 3h8v8h-2V6.41l-9.29 9.3-1.42-1.42 9.3-9.29H13V3z"/>
|
||||
<path d="M5 5h4V3H3v6h2V5zm0 14v-4H3v6h6v-2H5z"/>
|
||||
</svg>
|
||||
<span>Docs</span>
|
||||
</a>
|
||||
|
||||
<span id="cfg-status" class="text-xs text-secondary ml-2"></span>
|
||||
</div>
|
||||
|
||||
<!-- CodeMirror host -->
|
||||
<div id="adv-editor" class="mt-2 border border-border rounded overflow-hidden h-40"></div>
|
||||
</details>
|
||||
|
||||
<div class="flex space-x-2">
|
||||
<button id="run-btn" class="bg-primary text-dark px-4 py-2 rounded hover:bg-primarydim font-medium">
|
||||
Run (⌘/Ctrl+Enter)
|
||||
</button>
|
||||
<button id="export-btn" class="border border-border px-4 py-2 rounded hover:bg-surface hidden">
|
||||
Export Python Code
|
||||
</button>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<!-- Execution Status -->
|
||||
<section id="execution-status" class="hidden bg-surface rounded-lg border border-border p-3 text-sm">
|
||||
<div class="flex space-x-4">
|
||||
<div id="status-badge" class="flex items-center">
|
||||
<span class="w-3 h-3 rounded-full mr-2"></span>
|
||||
<span>Ready</span>
|
||||
</div>
|
||||
<div>
|
||||
<span class="text-secondary">Time:</span>
|
||||
<span id="exec-time" class="text-light">-</span>
|
||||
</div>
|
||||
<div>
|
||||
<span class="text-secondary">Memory:</span>
|
||||
<span id="exec-mem" class="text-light">-</span>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<!-- Response Viewer -->
|
||||
<!-- Update the Response Viewer section -->
|
||||
<section class="bg-surface rounded-lg border border-border overflow-hidden flex-1 flex flex-col">
|
||||
<div class="border-b border-border flex">
|
||||
<button data-tab="response" class="tab-btn active px-4 py-2 border-r border-border">Response</button>
|
||||
<button data-tab="python" class="tab-btn px-4 py-2 border-r border-border">Python</button>
|
||||
<button data-tab="curl" class="tab-btn px-4 py-2">cURL</button>
|
||||
</div>
|
||||
<div class="flex-1 overflow-auto relative">
|
||||
<!-- Response Tab -->
|
||||
<div class="tab-content active h-full">
|
||||
<div class="absolute right-2 top-2">
|
||||
<button class="copy-btn bg-surface border border-border rounded px-2 py-1 text-xs hover:bg-dark"
|
||||
data-target="#response-content code">
|
||||
Copy
|
||||
</button>
|
||||
</div>
|
||||
<pre id="response-content" class="p-4 text-sm h-full"><code class="json hljs">{}</code></pre>
|
||||
</div>
|
||||
|
||||
<!-- Python Tab -->
|
||||
<div class="tab-content hidden h-full">
|
||||
<div class="absolute right-2 top-2">
|
||||
<button class="copy-btn bg-surface border border-border rounded px-2 py-1 text-xs hover:bg-dark"
|
||||
data-target="#python-content code">
|
||||
Copy
|
||||
</button>
|
||||
</div>
|
||||
<pre id="python-content" class="p-4 text-sm h-full"><code class="python hljs"></code></pre>
|
||||
</div>
|
||||
|
||||
<!-- cURL Tab -->
|
||||
<div class="tab-content hidden h-full">
|
||||
<div class="absolute right-2 top-2">
|
||||
<button class="copy-btn bg-surface border border-border rounded px-2 py-1 text-xs hover:bg-dark"
|
||||
data-target="#curl-content code">
|
||||
Copy
|
||||
</button>
|
||||
</div>
|
||||
<pre id="curl-content" class="p-4 text-sm h-full"><code class="bash hljs"></code></pre>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
</main>
|
||||
|
||||
<!-- Stress Test Modal -->
|
||||
<div id="stress-modal"
|
||||
class="hidden fixed inset-0 bg-black bg-opacity-70 z-50 flex items-center justify-center p-4">
|
||||
<div class="bg-surface rounded-lg border border-accent w-full max-w-3xl max-h-[90vh] flex flex-col">
|
||||
<div class="px-4 py-2 border-b border-border flex items-center">
|
||||
<h2 class="font-medium text-accent">🔥 Stress Test</h2>
|
||||
<button id="close-stress" class="ml-auto text-secondary hover:text-light">×</button>
|
||||
</div>
|
||||
|
||||
<div class="p-4 space-y-4 flex-1 overflow-auto">
|
||||
<div class="grid grid-cols-3 gap-4">
|
||||
<div>
|
||||
<label class="block text-sm mb-1">Total URLs</label>
|
||||
<input id="st-total" type="number" value="20"
|
||||
class="w-full bg-dark border border-border rounded px-3 py-1">
|
||||
</div>
|
||||
<div>
|
||||
<label class="block text-sm mb-1">Chunk Size</label>
|
||||
<input id="st-chunk" type="number" value="5"
|
||||
class="w-full bg-dark border border-border rounded px-3 py-1">
|
||||
</div>
|
||||
<div>
|
||||
<label class="block text-sm mb-1">Concurrency</label>
|
||||
<input id="st-conc" type="number" value="2"
|
||||
class="w-full bg-dark border border-border rounded px-3 py-1">
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="flex items-center">
|
||||
<input id="st-stream" type="checkbox" class="mr-2">
|
||||
<label for="st-stream" class="text-sm">Use /crawl/stream</label>
|
||||
<button id="st-run"
|
||||
class="ml-auto bg-accent text-dark px-4 py-2 rounded hover:bg-opacity-90 font-medium">
|
||||
Run Stress Test
|
||||
</button>
|
||||
</div>
|
||||
|
||||
<div class="mt-4">
|
||||
<div class="bg-dark rounded border border-border p-3 h-64 overflow-auto text-sm whitespace-break-spaces"
|
||||
id="stress-log"></div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="px-4 py-2 border-t border-border text-sm text-secondary">
|
||||
<div class="flex justify-between">
|
||||
<span>Completed: <span id="stress-completed">0</span>/<span id="stress-total">0</span></span>
|
||||
<span>Avg. Time: <span id="stress-avg-time">0</span>ms</span>
|
||||
<span>Peak Memory: <span id="stress-peak-mem">0</span>MB</span>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<script>
|
||||
// Tab switching
|
||||
document.querySelectorAll('.tab-btn').forEach(btn => {
|
||||
btn.addEventListener('click', () => {
|
||||
document.querySelectorAll('.tab-btn').forEach(b => b.classList.remove('active'));
|
||||
document.querySelectorAll('.tab-content').forEach(c => c.classList.add('hidden'));
|
||||
|
||||
btn.classList.add('active');
|
||||
const tabName = btn.dataset.tab;
|
||||
document.querySelector(`#${tabName}-content`).parentElement.classList.remove('hidden');
|
||||
|
||||
// Re-highlight content when switching tabs
|
||||
const activeCode = document.querySelector(`#${tabName}-content code`);
|
||||
if (activeCode) {
|
||||
forceHighlightElement(activeCode);
|
||||
}
|
||||
});
|
||||
});
|
||||
|
||||
// View switching
|
||||
document.getElementById('play-tab').addEventListener('click', () => {
|
||||
document.getElementById('playground').classList.remove('hidden');
|
||||
document.getElementById('stress-modal').classList.add('hidden');
|
||||
document.getElementById('play-tab').classList.add('bg-surface', 'border-b-0');
|
||||
document.getElementById('stress-tab').classList.remove('bg-surface', 'border-b-0');
|
||||
});
|
||||
|
||||
document.getElementById('stress-tab').addEventListener('click', () => {
|
||||
document.getElementById('stress-modal').classList.remove('hidden');
|
||||
document.getElementById('stress-tab').classList.add('bg-surface', 'border-b-0');
|
||||
document.getElementById('play-tab').classList.remove('bg-surface', 'border-b-0');
|
||||
});
|
||||
|
||||
document.getElementById('close-stress').addEventListener('click', () => {
|
||||
document.getElementById('stress-modal').classList.add('hidden');
|
||||
document.getElementById('play-tab').classList.add('bg-surface', 'border-b-0');
|
||||
document.getElementById('stress-tab').classList.remove('bg-surface', 'border-b-0');
|
||||
});
|
||||
|
||||
// Initialize clipboard and highlight.js
|
||||
new ClipboardJS('#export-btn');
|
||||
hljs.highlightAll();
|
||||
|
||||
// Keyboard shortcut
|
||||
window.addEventListener('keydown', e => {
|
||||
if ((e.ctrlKey || e.metaKey) && e.key === 'Enter') {
|
||||
document.getElementById('run-btn').click();
|
||||
}
|
||||
});
|
||||
|
||||
// ================ ADVANCED CONFIG EDITOR ================
|
||||
const cm = CodeMirror(document.getElementById('adv-editor'), {
|
||||
value: `CrawlerRunConfig(
|
||||
stream=True,
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
)`,
|
||||
mode: 'python',
|
||||
lineNumbers: true,
|
||||
theme: 'darcula',
|
||||
tabSize: 4,
|
||||
styleActiveLine: true,
|
||||
matchBrackets: true,
|
||||
gutters: ["CodeMirror-linenumbers"],
|
||||
lineWrapping: true,
|
||||
});
|
||||
|
||||
const TEMPLATES = {
|
||||
CrawlerRunConfig: `CrawlerRunConfig(
|
||||
stream=True,
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
)`,
|
||||
BrowserConfig: `BrowserConfig(
|
||||
headless=True,
|
||||
extra_args=[
|
||||
"--no-sandbox",
|
||||
"--disable-gpu",
|
||||
],
|
||||
)`,
|
||||
};
|
||||
|
||||
document.getElementById('cfg-type').addEventListener('change', (e) => {
|
||||
cm.setValue(TEMPLATES[e.target.value]);
|
||||
document.getElementById('cfg-status').textContent = '';
|
||||
});
|
||||
|
||||
// Handle endpoint selection change to show appropriate options
|
||||
document.getElementById('endpoint').addEventListener('change', function(e) {
|
||||
const endpoint = e.target.value;
|
||||
const mdOptions = document.getElementById('md-options');
|
||||
const llmOptions = document.getElementById('llm-options');
|
||||
const advConfig = document.getElementById('adv-config');
|
||||
|
||||
// Hide all option sections first
|
||||
mdOptions.classList.add('hidden');
|
||||
llmOptions.classList.add('hidden');
|
||||
advConfig.classList.add('hidden');
|
||||
|
||||
// Show the appropriate section based on endpoint
|
||||
if (endpoint === 'md') {
|
||||
mdOptions.classList.remove('hidden');
|
||||
// Auto-open the /md options
|
||||
mdOptions.setAttribute('open', '');
|
||||
} else if (endpoint === 'llm') {
|
||||
llmOptions.classList.remove('hidden');
|
||||
// Auto-open the /llm options
|
||||
llmOptions.setAttribute('open', '');
|
||||
} else {
|
||||
// For /crawl endpoints, show the advanced config
|
||||
advConfig.classList.remove('hidden');
|
||||
}
|
||||
});
|
||||
|
||||
async function pyConfigToJson() {
|
||||
const code = cm.getValue().trim();
|
||||
if (!code) return {};
|
||||
|
||||
const res = await fetch('/config/dump', {
|
||||
method: 'POST',
|
||||
headers: { 'Content-Type': 'application/json' },
|
||||
body: JSON.stringify({ code }),
|
||||
});
|
||||
|
||||
const statusEl = document.getElementById('cfg-status');
|
||||
if (!res.ok) {
|
||||
const msg = await res.text();
|
||||
statusEl.textContent = '✖ config error';
|
||||
statusEl.className = 'text-xs text-red-400';
|
||||
throw new Error(msg || 'Invalid config');
|
||||
}
|
||||
|
||||
statusEl.textContent = '✓ parsed';
|
||||
statusEl.className = 'text-xs text-green-400';
|
||||
|
||||
return await res.json();
|
||||
}
|
||||
|
||||
// ================ SERVER COMMUNICATION ================
|
||||
|
||||
// Update status UI
|
||||
function updateStatus(status, time, memory, peakMemory) {
|
||||
const statusEl = document.getElementById('execution-status');
|
||||
const badgeEl = document.querySelector('#status-badge span:first-child');
|
||||
const textEl = document.querySelector('#status-badge span:last-child');
|
||||
|
||||
statusEl.classList.remove('hidden');
|
||||
badgeEl.className = 'w-3 h-3 rounded-full mr-2';
|
||||
|
||||
if (status === 'success') {
|
||||
badgeEl.classList.add('bg-green-500');
|
||||
textEl.textContent = 'Success';
|
||||
} else if (status === 'error') {
|
||||
badgeEl.classList.add('bg-red-500');
|
||||
textEl.textContent = 'Error';
|
||||
} else {
|
||||
badgeEl.classList.add('bg-yellow-500');
|
||||
textEl.textContent = 'Processing...';
|
||||
}
|
||||
|
||||
if (time) {
|
||||
document.getElementById('exec-time').textContent = `${time}ms`;
|
||||
}
|
||||
|
||||
if (memory !== undefined && peakMemory !== undefined) {
|
||||
document.getElementById('exec-mem').textContent = `Δ${memory >= 0 ? '+' : ''}${memory}MB (Peak: ${peakMemory}MB)`;
|
||||
}
|
||||
}
|
||||
|
||||
// Generate code snippets
|
||||
function generateSnippets(api, payload, method = 'POST') {
|
||||
// Python snippet
|
||||
const pyCodeEl = document.querySelector('#python-content code');
|
||||
let pySnippet;
|
||||
|
||||
if (method === 'GET') {
|
||||
// GET request (for /llm endpoint)
|
||||
pySnippet = `import httpx\n\nasync def crawl():\n async with httpx.AsyncClient() as client:\n response = await client.get(\n "${window.location.origin}${api}"\n )\n return response.json()`;
|
||||
} else {
|
||||
// POST request (for /crawl and /md endpoints)
|
||||
pySnippet = `import httpx\n\nasync def crawl():\n async with httpx.AsyncClient() as client:\n response = await client.post(\n "${window.location.origin}${api}",\n json=${JSON.stringify(payload, null, 4).replace(/\n/g, '\n ')}\n )\n return response.json()`;
|
||||
}
|
||||
|
||||
pyCodeEl.textContent = pySnippet;
|
||||
pyCodeEl.className = 'python hljs'; // Reset classes
|
||||
forceHighlightElement(pyCodeEl);
|
||||
|
||||
// cURL snippet
|
||||
const curlCodeEl = document.querySelector('#curl-content code');
|
||||
let curlSnippet;
|
||||
|
||||
if (method === 'GET') {
|
||||
// GET request (for /llm endpoint)
|
||||
curlSnippet = `curl -X GET "${window.location.origin}${api}"`;
|
||||
} else {
|
||||
// POST request (for /crawl and /md endpoints)
|
||||
curlSnippet = `curl -X POST ${window.location.origin}${api} \\\n -H "Content-Type: application/json" \\\n -d '${JSON.stringify(payload)}'`;
|
||||
}
|
||||
|
||||
curlCodeEl.textContent = curlSnippet;
|
||||
curlCodeEl.className = 'bash hljs'; // Reset classes
|
||||
forceHighlightElement(curlCodeEl);
|
||||
}
|
||||
|
||||
// Main run function
|
||||
async function runCrawl() {
|
||||
const endpoint = document.getElementById('endpoint').value;
|
||||
const urls = document.getElementById('urls').value.trim().split(/\n/).filter(u => u);
|
||||
// 1) grab python from CodeMirror, validate via /config/dump
|
||||
let advConfig = {};
|
||||
try {
|
||||
const cfgJson = await pyConfigToJson(); // may throw
|
||||
if (Object.keys(cfgJson).length) {
|
||||
const cfgType = document.getElementById('cfg-type').value;
|
||||
advConfig = cfgType === 'CrawlerRunConfig'
|
||||
? { crawler_config: cfgJson }
|
||||
: { browser_config: cfgJson };
|
||||
}
|
||||
} catch (err) {
|
||||
updateStatus('error');
|
||||
document.querySelector('#response-content code').textContent =
|
||||
JSON.stringify({ error: err.message }, null, 2);
|
||||
forceHighlightElement(document.querySelector('#response-content code'));
|
||||
return; // stop run
|
||||
}
|
||||
|
||||
const endpointMap = {
|
||||
crawl: '/crawl',
|
||||
// crawl_stream: '/crawl/stream',
|
||||
md: '/md',
|
||||
llm: '/llm'
|
||||
};
|
||||
|
||||
const api = endpointMap[endpoint];
|
||||
let payload;
|
||||
|
||||
// Create appropriate payload based on endpoint type
|
||||
if (endpoint === 'md') {
|
||||
// Get values from the /md specific inputs
|
||||
const filterType = document.getElementById('md-filter').value;
|
||||
const query = document.getElementById('md-query').value.trim();
|
||||
const cache = document.getElementById('md-cache').value;
|
||||
|
||||
// MD endpoint expects: { url, f, q, c }
|
||||
payload = {
|
||||
url: urls[0], // Take first URL
|
||||
f: filterType, // Lowercase filter type as required by server
|
||||
q: query || null, // Use the query if provided, otherwise null
|
||||
c: cache
|
||||
};
|
||||
} else if (endpoint === 'llm') {
|
||||
// LLM endpoint has a different URL pattern and uses query params
|
||||
// This will be handled directly in the fetch below
|
||||
payload = null;
|
||||
} else {
|
||||
// Default payload for /crawl and /crawl/stream
|
||||
payload = {
|
||||
urls,
|
||||
...advConfig
|
||||
};
|
||||
}
|
||||
|
||||
updateStatus('processing');
|
||||
|
||||
try {
|
||||
const startTime = performance.now();
|
||||
let response, responseData;
|
||||
|
||||
if (endpoint === 'llm') {
|
||||
// Special handling for LLM endpoint which uses URL pattern: /llm/{encoded_url}?q={query}
|
||||
const url = urls[0];
|
||||
const encodedUrl = encodeURIComponent(url);
|
||||
// Get the question from the LLM-specific input
|
||||
const question = document.getElementById('llm-question').value.trim() || "What is this page about?";
|
||||
|
||||
response = await fetch(`${api}/${encodedUrl}?q=${encodeURIComponent(question)}`, {
|
||||
method: 'GET',
|
||||
headers: { 'Accept': 'application/json' }
|
||||
});
|
||||
} else if (endpoint === 'crawl_stream') {
|
||||
// Stream processing
|
||||
response = await fetch(api, {
|
||||
method: 'POST',
|
||||
headers: { 'Content-Type': 'application/json' },
|
||||
body: JSON.stringify(payload)
|
||||
});
|
||||
|
||||
const reader = response.body.getReader();
|
||||
let text = '';
|
||||
let maxMemory = 0;
|
||||
|
||||
while (true) {
|
||||
const { value, done } = await reader.read();
|
||||
if (done) break;
|
||||
|
||||
const chunk = new TextDecoder().decode(value);
|
||||
text += chunk;
|
||||
|
||||
// Process each line for memory updates
|
||||
chunk.trim().split('\n').forEach(line => {
|
||||
if (!line) return;
|
||||
try {
|
||||
const obj = JSON.parse(line);
|
||||
if (obj.server_memory_mb) {
|
||||
maxMemory = Math.max(maxMemory, obj.server_memory_mb);
|
||||
}
|
||||
} catch (e) {
|
||||
console.error('Error parsing stream line:', e);
|
||||
}
|
||||
});
|
||||
}
|
||||
|
||||
responseData = { stream: text };
|
||||
const time = Math.round(performance.now() - startTime);
|
||||
updateStatus('success', time, null, maxMemory);
|
||||
document.querySelector('#response-content code').textContent = text;
|
||||
document.querySelector('#response-content code').className = 'json hljs'; // Reset classes
|
||||
forceHighlightElement(document.querySelector('#response-content code'));
|
||||
} else {
|
||||
// Regular request (handles /crawl and /md)
|
||||
response = await fetch(api, {
|
||||
method: 'POST',
|
||||
headers: { 'Content-Type': 'application/json' },
|
||||
body: JSON.stringify(payload)
|
||||
});
|
||||
|
||||
responseData = await response.json();
|
||||
const time = Math.round(performance.now() - startTime);
|
||||
|
||||
if (!response.ok) {
|
||||
updateStatus('error', time);
|
||||
throw new Error(responseData.error || 'Request failed');
|
||||
}
|
||||
|
||||
updateStatus(
|
||||
'success',
|
||||
time,
|
||||
responseData.server_memory_delta_mb,
|
||||
responseData.server_peak_memory_mb
|
||||
);
|
||||
|
||||
document.querySelector('#response-content code').textContent = JSON.stringify(responseData, null, 2);
|
||||
document.querySelector('#response-content code').className = 'json hljs'; // Ensure class is set
|
||||
forceHighlightElement(document.querySelector('#response-content code'));
|
||||
}
|
||||
|
||||
forceHighlightElement(document.querySelector('#response-content code'));
|
||||
|
||||
// For generateSnippets, handle the LLM case specially
|
||||
if (endpoint === 'llm') {
|
||||
const url = urls[0];
|
||||
const encodedUrl = encodeURIComponent(url);
|
||||
const question = document.getElementById('llm-question').value.trim() || "What is this page about?";
|
||||
generateSnippets(`${api}/${encodedUrl}?q=${encodeURIComponent(question)}`, null, 'GET');
|
||||
} else {
|
||||
generateSnippets(api, payload);
|
||||
}
|
||||
} catch (error) {
|
||||
console.error('Error:', error);
|
||||
updateStatus('error');
|
||||
document.querySelector('#response-content code').textContent = JSON.stringify(
|
||||
{ error: error.message },
|
||||
null,
|
||||
2
|
||||
);
|
||||
forceHighlightElement(document.querySelector('#response-content code'));
|
||||
}
|
||||
}
|
||||
|
||||
// Stress test function
|
||||
async function runStressTest() {
|
||||
const total = parseInt(document.getElementById('st-total').value);
|
||||
const chunkSize = parseInt(document.getElementById('st-chunk').value);
|
||||
const concurrency = parseInt(document.getElementById('st-conc').value);
|
||||
const useStream = document.getElementById('st-stream').checked;
|
||||
|
||||
const logEl = document.getElementById('stress-log');
|
||||
logEl.textContent = '';
|
||||
|
||||
document.getElementById('stress-completed').textContent = '0';
|
||||
document.getElementById('stress-total').textContent = total;
|
||||
document.getElementById('stress-avg-time').textContent = '0';
|
||||
document.getElementById('stress-peak-mem').textContent = '0';
|
||||
|
||||
const api = useStream ? '/crawl/stream' : '/crawl';
|
||||
const urls = Array.from({ length: total }, (_, i) => `https://httpbin.org/anything/stress-${i}-${Date.now()}`);
|
||||
const chunks = [];
|
||||
|
||||
for (let i = 0; i < urls.length; i += chunkSize) {
|
||||
chunks.push(urls.slice(i, i + chunkSize));
|
||||
}
|
||||
|
||||
let completed = 0;
|
||||
let totalTime = 0;
|
||||
let peakMemory = 0;
|
||||
|
||||
const processBatch = async (batch, index) => {
|
||||
const payload = {
|
||||
urls: batch,
|
||||
browser_config: {},
|
||||
crawler_config: { cache_mode: 'BYPASS', stream: useStream }
|
||||
};
|
||||
|
||||
const start = performance.now();
|
||||
let time, memory;
|
||||
|
||||
try {
|
||||
if (useStream) {
|
||||
const response = await fetch(api, {
|
||||
method: 'POST',
|
||||
headers: { 'Content-Type': 'application/json' },
|
||||
body: JSON.stringify(payload)
|
||||
});
|
||||
|
||||
const reader = response.body.getReader();
|
||||
let maxMem = 0;
|
||||
while (true) {
|
||||
const { value, done } = await reader.read();
|
||||
if (done) break;
|
||||
const text = new TextDecoder().decode(value);
|
||||
text.split('\n').forEach(line => {
|
||||
try {
|
||||
const obj = JSON.parse(line);
|
||||
if (obj.server_memory_mb) {
|
||||
maxMem = Math.max(maxMem, obj.server_memory_mb);
|
||||
}
|
||||
} catch { }
|
||||
});
|
||||
}
|
||||
|
||||
memory = maxMem;
|
||||
} else {
|
||||
const response = await fetch(api, {
|
||||
method: 'POST',
|
||||
headers: { 'Content-Type': 'application/json' },
|
||||
body: JSON.stringify(payload)
|
||||
});
|
||||
|
||||
const data = await response.json();
|
||||
memory = data.server_peak_memory_mb;
|
||||
}
|
||||
|
||||
time = Math.round(performance.now() - start);
|
||||
peakMemory = Math.max(peakMemory, memory || 0);
|
||||
totalTime += time;
|
||||
|
||||
logEl.textContent += `[${index + 1}/${chunks.length}] ✔ ${time}ms | Peak ${memory}MB\n`;
|
||||
} catch (error) {
|
||||
time = Math.round(performance.now() - start);
|
||||
logEl.textContent += `[${index + 1}/${chunks.length}] ✖ ${time}ms | ${error.message}\n`;
|
||||
}
|
||||
|
||||
completed += batch.length;
|
||||
document.getElementById('stress-completed').textContent = completed;
|
||||
document.getElementById('stress-peak-mem').textContent = peakMemory;
|
||||
document.getElementById('stress-avg-time').textContent = Math.round(totalTime / (index + 1));
|
||||
|
||||
logEl.scrollTop = logEl.scrollHeight;
|
||||
};
|
||||
|
||||
// Run with concurrency control
|
||||
let active = 0;
|
||||
let index = 0;
|
||||
|
||||
return new Promise(resolve => {
|
||||
const runNext = () => {
|
||||
while (active < concurrency && index < chunks.length) {
|
||||
processBatch(chunks[index], index)
|
||||
.finally(() => {
|
||||
active--;
|
||||
runNext();
|
||||
});
|
||||
active++;
|
||||
index++;
|
||||
}
|
||||
|
||||
if (active === 0 && index >= chunks.length) {
|
||||
logEl.textContent += '\n✅ Stress test completed\n';
|
||||
resolve();
|
||||
}
|
||||
};
|
||||
|
||||
runNext();
|
||||
});
|
||||
}
|
||||
|
||||
// Event listeners
|
||||
document.getElementById('run-btn').addEventListener('click', runCrawl);
|
||||
document.getElementById('st-run').addEventListener('click', runStressTest);
|
||||
|
||||
function forceHighlightElement(element) {
|
||||
if (!element) return;
|
||||
|
||||
// Save current scroll position (important for large code blocks)
|
||||
const scrollTop = element.parentElement.scrollTop;
|
||||
|
||||
// Reset the element
|
||||
const text = element.textContent;
|
||||
element.innerHTML = text;
|
||||
element.removeAttribute('data-highlighted');
|
||||
|
||||
// Reapply highlighting
|
||||
hljs.highlightElement(element);
|
||||
|
||||
// Restore scroll position
|
||||
element.parentElement.scrollTop = scrollTop;
|
||||
}
|
||||
|
||||
// Initialize clipboard for all copy buttons
|
||||
function initCopyButtons() {
|
||||
document.querySelectorAll('.copy-btn').forEach(btn => {
|
||||
new ClipboardJS(btn, {
|
||||
text: () => {
|
||||
const target = document.querySelector(btn.dataset.target);
|
||||
return target ? target.textContent : '';
|
||||
}
|
||||
}).on('success', e => {
|
||||
e.clearSelection();
|
||||
// make button text "copied" for 1 second
|
||||
const originalText = e.trigger.textContent;
|
||||
e.trigger.textContent = 'Copied!';
|
||||
setTimeout(() => {
|
||||
e.trigger.textContent = originalText;
|
||||
}, 1000);
|
||||
// Highlight the copied code
|
||||
const target = document.querySelector(btn.dataset.target);
|
||||
if (target) {
|
||||
target.classList.add('highlighted');
|
||||
setTimeout(() => {
|
||||
target.classList.remove('highlighted');
|
||||
}, 1000);
|
||||
}
|
||||
|
||||
}).on('error', e => {
|
||||
console.error('Error copying:', e);
|
||||
});
|
||||
});
|
||||
}
|
||||
|
||||
// Function to initialize UI based on selected endpoint
|
||||
function initUI() {
|
||||
// Trigger the endpoint change handler to set initial UI state
|
||||
const endpointSelect = document.getElementById('endpoint');
|
||||
const event = new Event('change');
|
||||
endpointSelect.dispatchEvent(event);
|
||||
|
||||
// Initialize copy buttons
|
||||
initCopyButtons();
|
||||
}
|
||||
|
||||
// Initialize on page load
|
||||
document.addEventListener('DOMContentLoaded', initUI);
|
||||
// Also call it immediately in case the script runs after DOM is already loaded
|
||||
if (document.readyState !== 'loading') {
|
||||
initUI();
|
||||
}
|
||||
|
||||
</script>
|
||||
</body>
|
||||
|
||||
</html>
|
||||
@@ -14,7 +14,7 @@ stderr_logfile=/dev/stderr ; Redirect redis stderr to container stderr
|
||||
stderr_logfile_maxbytes=0
|
||||
|
||||
[program:gunicorn]
|
||||
command=/usr/local/bin/gunicorn --bind 0.0.0.0:11235 --workers 2 --threads 2 --timeout 120 --graceful-timeout 30 --keep-alive 60 --log-level info --worker-class uvicorn.workers.UvicornWorker server:app
|
||||
command=/usr/local/bin/gunicorn --bind 0.0.0.0:11235 --workers 1 --threads 4 --timeout 1800 --graceful-timeout 30 --keep-alive 300 --log-level info --worker-class uvicorn.workers.UvicornWorker server:app
|
||||
directory=/app ; Working directory for the app
|
||||
user=appuser ; Run gunicorn as our non-root user
|
||||
autorestart=true
|
||||
|
||||
@@ -45,10 +45,10 @@ def datetime_handler(obj: any) -> Optional[str]:
|
||||
return obj.isoformat()
|
||||
raise TypeError(f"Object of type {type(obj)} is not JSON serializable")
|
||||
|
||||
def should_cleanup_task(created_at: str) -> bool:
|
||||
def should_cleanup_task(created_at: str, ttl_seconds: int = 3600) -> bool:
|
||||
"""Check if task should be cleaned up based on creation time."""
|
||||
created = datetime.fromisoformat(created_at)
|
||||
return (datetime.now() - created).total_seconds() > 3600
|
||||
return (datetime.now() - created).total_seconds() > ttl_seconds
|
||||
|
||||
def decode_redis_hash(hash_data: Dict[bytes, bytes]) -> Dict[str, str]:
|
||||
"""Decode Redis hash data from bytes to strings."""
|
||||
|
||||
@@ -1,19 +1,11 @@
|
||||
# docker-compose.yml
|
||||
version: '3.8'
|
||||
|
||||
# Base configuration anchor for reusability
|
||||
# Shared configuration for all environments
|
||||
x-base-config: &base-config
|
||||
ports:
|
||||
# Map host port 11235 to container port 11235 (where Gunicorn will listen)
|
||||
- "11235:11235"
|
||||
# - "8080:8080" # Uncomment if needed
|
||||
|
||||
# Load API keys primarily from .llm.env file
|
||||
# Create .llm.env in the root directory .llm.env.example
|
||||
- "11235:11235" # Gunicorn port
|
||||
env_file:
|
||||
- .llm.env
|
||||
|
||||
# Define environment variables, allowing overrides from host environment
|
||||
# Syntax ${VAR:-} uses host env var 'VAR' if set, otherwise uses value from .llm.env
|
||||
- .llm.env # API keys (create from .llm.env.example)
|
||||
environment:
|
||||
- OPENAI_API_KEY=${OPENAI_API_KEY:-}
|
||||
- DEEPSEEK_API_KEY=${DEEPSEEK_API_KEY:-}
|
||||
@@ -22,10 +14,8 @@ x-base-config: &base-config
|
||||
- TOGETHER_API_KEY=${TOGETHER_API_KEY:-}
|
||||
- MISTRAL_API_KEY=${MISTRAL_API_KEY:-}
|
||||
- GEMINI_API_TOKEN=${GEMINI_API_TOKEN:-}
|
||||
|
||||
volumes:
|
||||
# Mount /dev/shm for Chromium/Playwright performance
|
||||
- /dev/shm:/dev/shm
|
||||
- /dev/shm:/dev/shm # Chromium performance
|
||||
deploy:
|
||||
resources:
|
||||
limits:
|
||||
@@ -34,47 +24,26 @@ x-base-config: &base-config
|
||||
memory: 1G
|
||||
restart: unless-stopped
|
||||
healthcheck:
|
||||
# IMPORTANT: Ensure Gunicorn binds to 11235 in supervisord.conf
|
||||
test: ["CMD", "curl", "-f", "http://localhost:11235/health"]
|
||||
interval: 30s
|
||||
timeout: 10s
|
||||
retries: 3
|
||||
start_period: 40s # Give the server time to start
|
||||
# Run the container as the non-root user defined in the Dockerfile
|
||||
start_period: 40s
|
||||
user: "appuser"
|
||||
|
||||
services:
|
||||
# --- Local Build Services ---
|
||||
crawl4ai-local-amd64:
|
||||
crawl4ai:
|
||||
# 1. Default: Pull multi-platform test image from Docker Hub
|
||||
# 2. Override with local image via: IMAGE=local-test docker compose up
|
||||
image: ${IMAGE:-unclecode/crawl4ai:${TAG:-latest}}
|
||||
|
||||
# Local build config (used with --build)
|
||||
build:
|
||||
context: . # Build context is the root directory
|
||||
dockerfile: Dockerfile # Dockerfile is in the root directory
|
||||
context: .
|
||||
dockerfile: Dockerfile
|
||||
args:
|
||||
INSTALL_TYPE: ${INSTALL_TYPE:-default}
|
||||
ENABLE_GPU: ${ENABLE_GPU:-false}
|
||||
# PYTHON_VERSION arg is omitted as it's fixed by 'FROM python:3.10-slim' in Dockerfile
|
||||
platform: linux/amd64
|
||||
profiles: ["local-amd64"]
|
||||
<<: *base-config # Inherit base configuration
|
||||
|
||||
crawl4ai-local-arm64:
|
||||
build:
|
||||
context: . # Build context is the root directory
|
||||
dockerfile: Dockerfile # Dockerfile is in the root directory
|
||||
args:
|
||||
INSTALL_TYPE: ${INSTALL_TYPE:-default}
|
||||
ENABLE_GPU: ${ENABLE_GPU:-false}
|
||||
platform: linux/arm64
|
||||
profiles: ["local-arm64"]
|
||||
<<: *base-config
|
||||
|
||||
# --- Docker Hub Image Services ---
|
||||
crawl4ai-hub-amd64:
|
||||
image: unclecode/crawl4ai:${VERSION:-latest}-amd64
|
||||
profiles: ["hub-amd64"]
|
||||
<<: *base-config
|
||||
|
||||
crawl4ai-hub-arm64:
|
||||
image: unclecode/crawl4ai:${VERSION:-latest}-arm64
|
||||
profiles: ["hub-arm64"]
|
||||
|
||||
# Inherit shared config
|
||||
<<: *base-config
|
||||
127
docs/apps/linkdin/README.md
Normal file
127
docs/apps/linkdin/README.md
Normal file
@@ -0,0 +1,127 @@
|
||||
# Crawl4AI Prospect‑Wizard – step‑by‑step guide
|
||||
|
||||
A three‑stage demo that goes from **LinkedIn scraping** ➜ **LLM reasoning** ➜ **graph visualisation**.
|
||||
|
||||
```
|
||||
prospect‑wizard/
|
||||
├─ c4ai_discover.py # Stage 1 – scrape companies + people
|
||||
├─ c4ai_insights.py # Stage 2 – embeddings, org‑charts, scores
|
||||
├─ graph_view_template.html # Stage 3 – graph viewer (static HTML)
|
||||
└─ data/ # output lands here (*.jsonl / *.json)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 1 Install & boot a LinkedIn profile (one‑time)
|
||||
|
||||
### 1.1 Install dependencies
|
||||
```bash
|
||||
pip install crawl4ai litellm sentence-transformers pandas rich
|
||||
```
|
||||
|
||||
### 1.2 Create / warm a LinkedIn browser profile
|
||||
```bash
|
||||
crwl profiles
|
||||
```
|
||||
1. The interactive shell shows **New profile** – hit **enter**.
|
||||
2. Choose a name, e.g. `profile_linkedin_uc`.
|
||||
3. A Chromium window opens – log in to LinkedIn, solve whatever CAPTCHA, then close.
|
||||
|
||||
> Remember the **profile name**. All future runs take `--profile-name <your_name>`.
|
||||
|
||||
---
|
||||
|
||||
## 2 Discovery – scrape companies & people
|
||||
|
||||
```bash
|
||||
python c4ai_discover.py full \
|
||||
--query "health insurance management" \
|
||||
--geo 102713980 \ # Malaysia geoUrn
|
||||
--title-filters "" \ # or "Product,Engineering"
|
||||
--max-companies 10 \ # default set small for workshops
|
||||
--max-people 20 \ # \^ same
|
||||
--profile-name profile_linkedin_uc \
|
||||
--outdir ./data \
|
||||
--concurrency 2 \
|
||||
--log-level debug
|
||||
```
|
||||
**Outputs** in `./data/`:
|
||||
* `companies.jsonl` – one JSON per company
|
||||
* `people.jsonl` – one JSON per employee
|
||||
|
||||
🛠️ **Dry‑run:** `C4AI_DEMO_DEBUG=1 python c4ai_discover.py full --query coffee` uses bundled HTML snippets, no network.
|
||||
|
||||
### Handy geoUrn cheatsheet
|
||||
| Location | geoUrn |
|
||||
|----------|--------|
|
||||
| Singapore | **103644278** |
|
||||
| Malaysia | **102713980** |
|
||||
| United States | **103644922** |
|
||||
| United Kingdom | **102221843** |
|
||||
| Australia | **101452733** |
|
||||
_See more: <https://www.linkedin.com/search/results/companies/?geoUrn=XXX> – the number after `geoUrn=` is what you need._
|
||||
|
||||
---
|
||||
|
||||
## 3 Insights – embeddings, org‑charts, decision makers
|
||||
|
||||
```bash
|
||||
python c4ai_insights.py \
|
||||
--in ./data \
|
||||
--out ./data \
|
||||
--embed-model all-MiniLM-L6-v2 \
|
||||
--llm-provider gemini/gemini-2.0-flash \
|
||||
--llm-api-key "" \
|
||||
--top-k 10 \
|
||||
--max-llm-tokens 8024 \
|
||||
--llm-temperature 1.0 \
|
||||
--workers 4
|
||||
```
|
||||
Emits next to the Stage‑1 files:
|
||||
* `company_graph.json` – inter‑company similarity graph
|
||||
* `org_chart_<handle>.json` – one per company
|
||||
* `decision_makers.csv` – hand‑picked ‘who to pitch’ list
|
||||
|
||||
Flags reference (straight from `build_arg_parser()`):
|
||||
| Flag | Default | Purpose |
|
||||
|------|---------|---------|
|
||||
| `--in` | `.` | Stage‑1 output dir |
|
||||
| `--out` | `.` | Destination dir |
|
||||
| `--embed_model` | `all-MiniLM-L6-v2` | Sentence‑Transformer model |
|
||||
| `--top_k` | `10` | Neighbours per company in graph |
|
||||
| `--openai_model` | `gpt-4.1` | LLM for scoring decision makers |
|
||||
| `--max_llm_tokens` | `8024` | Token budget per LLM call |
|
||||
| `--llm_temperature` | `1.0` | Creativity knob |
|
||||
| `--stub` | off | Skip OpenAI and fabricate tiny charts |
|
||||
| `--workers` | `4` | Parallel LLM workers |
|
||||
|
||||
---
|
||||
|
||||
## 4 Visualise – interactive graph
|
||||
|
||||
After Stage 2 completes, simply open the HTML viewer from the project root:
|
||||
```bash
|
||||
open graph_view_template.html # or Live Server / Python -http
|
||||
```
|
||||
The page fetches `data/company_graph.json` and the `org_chart_*.json` files automatically; keep the `data/` folder beside the HTML file.
|
||||
|
||||
* Left pane → list of companies (clans).
|
||||
* Click a node to load its org‑chart on the right.
|
||||
* Chat drawer lets you ask follow‑up questions; context is pulled from `people.jsonl`.
|
||||
|
||||
---
|
||||
|
||||
## 5 Common snags
|
||||
|
||||
| Symptom | Fix |
|
||||
|---------|-----|
|
||||
| Infinite CAPTCHA | Use a residential proxy: `--proxy http://user:pass@ip:port` |
|
||||
| 429 Too Many Requests | Lower `--concurrency`, rotate profile, add delay |
|
||||
| Blank graph | Check JSON paths, clear `localStorage` in browser |
|
||||
|
||||
---
|
||||
|
||||
### TL;DR
|
||||
`crwl profiles` → `c4ai_discover.py` → `c4ai_insights.py` → open `graph_view_template.html`.
|
||||
Live long and `import crawl4ai`.
|
||||
|
||||
437
docs/apps/linkdin/c4ai_discover.py
Normal file
437
docs/apps/linkdin/c4ai_discover.py
Normal file
@@ -0,0 +1,437 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
c4ai-discover — Stage‑1 Discovery CLI
|
||||
|
||||
Scrapes LinkedIn company search + their people pages and dumps two newline‑delimited
|
||||
JSON files: companies.jsonl and people.jsonl.
|
||||
|
||||
Key design rules
|
||||
----------------
|
||||
* No BeautifulSoup — Crawl4AI only for network + HTML fetch.
|
||||
* JsonCssExtractionStrategy for structured scraping; schema auto‑generated once
|
||||
from sample HTML provided by user and then cached under ./schemas/.
|
||||
* Defaults are embedded so the file runs inside VS Code debugger without CLI args.
|
||||
* If executed as a console script (argv > 1), CLI flags win.
|
||||
* Lightweight deps: argparse + Crawl4AI stack.
|
||||
|
||||
Author: Tom @ Kidocode 2025‑04‑26
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import warnings, re
|
||||
warnings.filterwarnings(
|
||||
"ignore",
|
||||
message=r"The pseudo class ':contains' is deprecated, ':-soup-contains' should be used.*",
|
||||
category=FutureWarning,
|
||||
module=r"soupsieve"
|
||||
)
|
||||
|
||||
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
# Imports
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
import argparse
|
||||
import random
|
||||
import asyncio
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import pathlib
|
||||
import sys
|
||||
# 3rd-party rich for pretty logging
|
||||
from rich.console import Console
|
||||
from rich.logging import RichHandler
|
||||
|
||||
from datetime import datetime, UTC
|
||||
from textwrap import dedent
|
||||
from types import SimpleNamespace
|
||||
from typing import Dict, List, Optional
|
||||
from urllib.parse import quote
|
||||
from pathlib import Path
|
||||
from glob import glob
|
||||
|
||||
from crawl4ai import (
|
||||
AsyncWebCrawler,
|
||||
BrowserConfig,
|
||||
CacheMode,
|
||||
CrawlerRunConfig,
|
||||
JsonCssExtractionStrategy,
|
||||
BrowserProfiler,
|
||||
LLMConfig,
|
||||
)
|
||||
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
# Constants / paths
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
BASE_DIR = pathlib.Path(__file__).resolve().parent
|
||||
SCHEMA_DIR = BASE_DIR / "schemas"
|
||||
SCHEMA_DIR.mkdir(parents=True, exist_ok=True)
|
||||
COMPANY_SCHEMA_PATH = SCHEMA_DIR / "company_card.json"
|
||||
PEOPLE_SCHEMA_PATH = SCHEMA_DIR / "people_card.json"
|
||||
|
||||
# ---------- deterministic target JSON examples ----------
|
||||
_COMPANY_SCHEMA_EXAMPLE = {
|
||||
"handle": "/company/posify/",
|
||||
"profile_image": "https://media.licdn.com/dms/image/v2/.../logo.jpg",
|
||||
"name": "Management Research Services, Inc. (MRS, Inc)",
|
||||
"descriptor": "Insurance • Milwaukee, Wisconsin",
|
||||
"about": "Insurance • Milwaukee, Wisconsin",
|
||||
"followers": 1000
|
||||
}
|
||||
|
||||
_PEOPLE_SCHEMA_EXAMPLE = {
|
||||
"profile_url": "https://www.linkedin.com/in/lily-ng/",
|
||||
"name": "Lily Ng",
|
||||
"headline": "VP Product @ Posify",
|
||||
"followers": 890,
|
||||
"connection_degree": "2nd",
|
||||
"avatar_url": "https://media.licdn.com/dms/image/v2/.../lily.jpg"
|
||||
}
|
||||
|
||||
# Provided sample HTML snippets (trimmed) — used exactly once to cold‑generate schema.
|
||||
_SAMPLE_COMPANY_HTML = (Path(__file__).resolve().parent / "snippets/company.html").read_text()
|
||||
_SAMPLE_PEOPLE_HTML = (Path(__file__).resolve().parent / "snippets/people.html").read_text()
|
||||
|
||||
# --------- tighter schema prompts ----------
|
||||
_COMPANY_SCHEMA_QUERY = dedent(
|
||||
"""
|
||||
Using the supplied <li> company-card HTML, build a JsonCssExtractionStrategy schema that,
|
||||
for every card, outputs *exactly* the keys shown in the example JSON below.
|
||||
JSON spec:
|
||||
• handle – href of the outermost <a> that wraps the logo/title, e.g. "/company/posify/"
|
||||
• profile_image – absolute URL of the <img> inside that link
|
||||
• name – text of the <a> inside the <span class*='t-16'>
|
||||
• descriptor – text line with industry • location
|
||||
• about – text of the <div class*='t-normal'> below the name (industry + geo)
|
||||
• followers – integer parsed from the <div> containing 'followers'
|
||||
|
||||
IMPORTANT: Do not use the base64 kind of classes to target element. It's not reliable.
|
||||
The main div parent contains these li element is "div.search-results-container" you can use this.
|
||||
The <ul> parent has "role" equal to "list". Using these two should be enough to target the <li> elements."
|
||||
"""
|
||||
)
|
||||
|
||||
_PEOPLE_SCHEMA_QUERY = dedent(
|
||||
"""
|
||||
Using the supplied <li> people-card HTML, build a JsonCssExtractionStrategy schema that
|
||||
outputs exactly the keys in the example JSON below.
|
||||
Fields:
|
||||
• profile_url – href of the outermost profile link
|
||||
• name – text inside artdeco-entity-lockup__title
|
||||
• headline – inner text of artdeco-entity-lockup__subtitle
|
||||
• followers – integer parsed from the span inside lt-line-clamp--multi-line
|
||||
• connection_degree – '1st', '2nd', etc. from artdeco-entity-lockup__badge
|
||||
• avatar_url – src of the <img> within artdeco-entity-lockup__image
|
||||
|
||||
IMPORTANT: Do not use the base64 kind of classes to target element. It's not reliable.
|
||||
The main div parent contains these li element is a "div" has these classes "artdeco-card org-people-profile-card__card-spacing org-people__card-margin-bottom".
|
||||
"""
|
||||
)
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Utility helpers
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def _load_or_build_schema(
|
||||
path: pathlib.Path,
|
||||
sample_html: str,
|
||||
query: str,
|
||||
example_json: Dict,
|
||||
force = False
|
||||
) -> Dict:
|
||||
"""Load schema from path, else call generate_schema once and persist."""
|
||||
if path.exists() and not force:
|
||||
return json.loads(path.read_text())
|
||||
|
||||
logging.info("[SCHEMA] Generating schema %s", path.name)
|
||||
schema = JsonCssExtractionStrategy.generate_schema(
|
||||
html=sample_html,
|
||||
llm_config=LLMConfig(
|
||||
provider=os.getenv("C4AI_SCHEMA_PROVIDER", "openai/gpt-4o"),
|
||||
api_token=os.getenv("OPENAI_API_KEY", "env:OPENAI_API_KEY"),
|
||||
),
|
||||
query=query,
|
||||
target_json_example=json.dumps(example_json, indent=2),
|
||||
)
|
||||
path.write_text(json.dumps(schema, indent=2))
|
||||
return schema
|
||||
|
||||
|
||||
def _openai_friendly_number(text: str) -> Optional[int]:
|
||||
"""Extract first int from text like '1K followers' (returns 1000)."""
|
||||
import re
|
||||
|
||||
m = re.search(r"(\d[\d,]*)", text.replace(",", ""))
|
||||
if not m:
|
||||
return None
|
||||
val = int(m.group(1))
|
||||
if "k" in text.lower():
|
||||
val *= 1000
|
||||
if "m" in text.lower():
|
||||
val *= 1_000_000
|
||||
return val
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Core async workers
|
||||
# ---------------------------------------------------------------------------
|
||||
async def crawl_company_search(crawler: AsyncWebCrawler, url: str, schema: Dict, limit: int) -> List[Dict]:
|
||||
"""Paginate 10-item company search pages until `limit` reached."""
|
||||
extraction = JsonCssExtractionStrategy(schema)
|
||||
cfg = CrawlerRunConfig(
|
||||
extraction_strategy=extraction,
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
wait_for = ".search-marvel-srp",
|
||||
session_id="company_search",
|
||||
delay_before_return_html=1,
|
||||
magic = True,
|
||||
verbose= False,
|
||||
)
|
||||
companies, page = [], 1
|
||||
while len(companies) < max(limit, 10):
|
||||
paged_url = f"{url}&page={page}"
|
||||
res = await crawler.arun(paged_url, config=cfg)
|
||||
batch = json.loads(res[0].extracted_content)
|
||||
if not batch:
|
||||
break
|
||||
for item in batch:
|
||||
name = item.get("name", "").strip()
|
||||
handle = item.get("handle", "").strip()
|
||||
if not handle or not name:
|
||||
continue
|
||||
descriptor = item.get("descriptor")
|
||||
about = item.get("about")
|
||||
followers = _openai_friendly_number(str(item.get("followers", "")))
|
||||
companies.append(
|
||||
{
|
||||
"handle": handle,
|
||||
"name": name,
|
||||
"descriptor": descriptor,
|
||||
"about": about,
|
||||
"followers": followers,
|
||||
"people_url": f"{handle}people/",
|
||||
"captured_at": datetime.now(UTC).isoformat(timespec="seconds") + "Z",
|
||||
}
|
||||
)
|
||||
page += 1
|
||||
logging.info(
|
||||
f"[dim]Page {page}[/] — running total: {len(companies)}/{limit} companies"
|
||||
)
|
||||
|
||||
return companies[:max(limit, 10)]
|
||||
|
||||
|
||||
async def crawl_people_page(
|
||||
crawler: AsyncWebCrawler,
|
||||
people_url: str,
|
||||
schema: Dict,
|
||||
limit: int,
|
||||
title_kw: str,
|
||||
) -> List[Dict]:
|
||||
people_u = f"{people_url}?keywords={quote(title_kw)}"
|
||||
extraction = JsonCssExtractionStrategy(schema)
|
||||
cfg = CrawlerRunConfig(
|
||||
extraction_strategy=extraction,
|
||||
# scan_full_page=True,
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
magic=True,
|
||||
wait_for=".org-people-profile-card__card-spacing",
|
||||
delay_before_return_html=1,
|
||||
session_id="people_search",
|
||||
)
|
||||
res = await crawler.arun(people_u, config=cfg)
|
||||
if not res[0].success:
|
||||
return []
|
||||
raw = json.loads(res[0].extracted_content)
|
||||
people = []
|
||||
for p in raw[:limit]:
|
||||
followers = _openai_friendly_number(str(p.get("followers", "")))
|
||||
people.append(
|
||||
{
|
||||
"profile_url": p.get("profile_url"),
|
||||
"name": p.get("name"),
|
||||
"headline": p.get("headline"),
|
||||
"followers": followers,
|
||||
"connection_degree": p.get("connection_degree"),
|
||||
"avatar_url": p.get("avatar_url"),
|
||||
}
|
||||
)
|
||||
return people
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# CLI + main
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def build_arg_parser() -> argparse.ArgumentParser:
|
||||
ap = argparse.ArgumentParser("c4ai-discover — Crawl4AI LinkedIn discovery")
|
||||
sub = ap.add_subparsers(dest="cmd", required=False, help="run scope")
|
||||
|
||||
def add_flags(parser: argparse.ArgumentParser):
|
||||
parser.add_argument("--query", required=False, help="query keyword(s)")
|
||||
parser.add_argument("--geo", required=False, type=int, help="LinkedIn geoUrn")
|
||||
parser.add_argument("--title-filters", default="Product,Engineering", help="comma list of job keywords")
|
||||
parser.add_argument("--max-companies", type=int, default=1000)
|
||||
parser.add_argument("--max-people", type=int, default=500)
|
||||
parser.add_argument("--profile-name", default=str(pathlib.Path.home() / ".crawl4ai/profiles/profile_linkedin_uc"))
|
||||
parser.add_argument("--outdir", default="./output")
|
||||
parser.add_argument("--concurrency", type=int, default=4)
|
||||
parser.add_argument("--log-level", default="info", choices=["debug", "info", "warn", "error"])
|
||||
|
||||
add_flags(sub.add_parser("full"))
|
||||
add_flags(sub.add_parser("companies"))
|
||||
add_flags(sub.add_parser("people"))
|
||||
|
||||
# global flags
|
||||
ap.add_argument(
|
||||
"--debug",
|
||||
action="store_true",
|
||||
help="Use built-in demo defaults (same as C4AI_DEMO_DEBUG=1)",
|
||||
)
|
||||
return ap
|
||||
|
||||
|
||||
def detect_debug_defaults(force = False) -> SimpleNamespace:
|
||||
if not force and sys.gettrace() is None and not os.getenv("C4AI_DEMO_DEBUG"):
|
||||
return SimpleNamespace()
|
||||
# ----- debug‑friendly defaults -----
|
||||
return SimpleNamespace(
|
||||
cmd="full",
|
||||
query="health insurance management",
|
||||
geo=102713980,
|
||||
# title_filters="Product,Engineering",
|
||||
title_filters="",
|
||||
max_companies=10,
|
||||
max_people=5,
|
||||
profile_name="profile_linkedin_uc",
|
||||
outdir="./debug_out",
|
||||
concurrency=2,
|
||||
log_level="debug",
|
||||
)
|
||||
|
||||
|
||||
async def async_main(opts):
|
||||
# ─────────── logging setup ───────────
|
||||
console = Console()
|
||||
logging.basicConfig(
|
||||
level=opts.log_level.upper(),
|
||||
format="%(message)s",
|
||||
handlers=[RichHandler(console=console, markup=True, rich_tracebacks=True)],
|
||||
)
|
||||
|
||||
# -------------------------------------------------------------------
|
||||
# Load or build schemas (one‑time LLM call each)
|
||||
# -------------------------------------------------------------------
|
||||
company_schema = _load_or_build_schema(
|
||||
COMPANY_SCHEMA_PATH,
|
||||
_SAMPLE_COMPANY_HTML,
|
||||
_COMPANY_SCHEMA_QUERY,
|
||||
_COMPANY_SCHEMA_EXAMPLE,
|
||||
# True
|
||||
)
|
||||
people_schema = _load_or_build_schema(
|
||||
PEOPLE_SCHEMA_PATH,
|
||||
_SAMPLE_PEOPLE_HTML,
|
||||
_PEOPLE_SCHEMA_QUERY,
|
||||
_PEOPLE_SCHEMA_EXAMPLE,
|
||||
# True
|
||||
)
|
||||
|
||||
outdir = BASE_DIR / pathlib.Path(opts.outdir)
|
||||
outdir.mkdir(parents=True, exist_ok=True)
|
||||
f_companies = (BASE_DIR / outdir / "companies.jsonl").open("a", encoding="utf-8")
|
||||
f_people = (BASE_DIR / outdir / "people.jsonl").open("a", encoding="utf-8")
|
||||
|
||||
# -------------------------------------------------------------------
|
||||
# Prepare crawler with cookie pool rotation
|
||||
# -------------------------------------------------------------------
|
||||
profiler = BrowserProfiler()
|
||||
path = profiler.get_profile_path(opts.profile_name)
|
||||
bc = BrowserConfig(
|
||||
headless=False,
|
||||
verbose=False,
|
||||
user_data_dir=path,
|
||||
use_managed_browser=True,
|
||||
user_agent_mode = "random",
|
||||
user_agent_generator_config= {
|
||||
"platforms": "mobile",
|
||||
"os": "Android"
|
||||
}
|
||||
)
|
||||
crawler = AsyncWebCrawler(config=bc)
|
||||
|
||||
await crawler.start()
|
||||
|
||||
# Single worker for simplicity; concurrency can be scaled by arun_many if needed.
|
||||
# crawler = await next_crawler().start()
|
||||
try:
|
||||
# Build LinkedIn search URL
|
||||
search_url = f'https://www.linkedin.com/search/results/companies/?keywords={quote(opts.query)}&companyHqGeo="{opts.geo}"'
|
||||
logging.info("Seed URL => %s", search_url)
|
||||
|
||||
companies: List[Dict] = []
|
||||
if opts.cmd in ("companies", "full"):
|
||||
companies = await crawl_company_search(
|
||||
crawler, search_url, company_schema, opts.max_companies
|
||||
)
|
||||
for c in companies:
|
||||
f_companies.write(json.dumps(c, ensure_ascii=False) + "\n")
|
||||
logging.info(f"[bold green]✓[/] Companies scraped so far: {len(companies)}")
|
||||
|
||||
if opts.cmd in ("people", "full"):
|
||||
if not companies:
|
||||
# load from previous run
|
||||
src = outdir / "companies.jsonl"
|
||||
if not src.exists():
|
||||
logging.error("companies.jsonl missing — run companies/full first")
|
||||
return 10
|
||||
companies = [json.loads(l) for l in src.read_text().splitlines()]
|
||||
total_people = 0
|
||||
title_kw = " ".join([t.strip() for t in opts.title_filters.split(",") if t.strip()]) if opts.title_filters else ""
|
||||
for comp in companies:
|
||||
people = await crawl_people_page(
|
||||
crawler,
|
||||
comp["people_url"],
|
||||
people_schema,
|
||||
opts.max_people,
|
||||
title_kw,
|
||||
)
|
||||
for p in people:
|
||||
rec = p | {
|
||||
"company_handle": comp["handle"],
|
||||
# "captured_at": datetime.now(UTC).isoformat(timespec="seconds") + "Z",
|
||||
"captured_at": datetime.now(UTC).isoformat(timespec="seconds") + "Z",
|
||||
}
|
||||
f_people.write(json.dumps(rec, ensure_ascii=False) + "\n")
|
||||
total_people += len(people)
|
||||
logging.info(
|
||||
f"{comp['name']} — [cyan]{len(people)}[/] people extracted"
|
||||
)
|
||||
await asyncio.sleep(random.uniform(0.5, 1))
|
||||
logging.info("Total people scraped: %d", total_people)
|
||||
finally:
|
||||
await crawler.close()
|
||||
f_companies.close()
|
||||
f_people.close()
|
||||
|
||||
return 0
|
||||
|
||||
|
||||
def main():
|
||||
parser = build_arg_parser()
|
||||
cli_opts = parser.parse_args()
|
||||
|
||||
# decide on debug defaults
|
||||
if cli_opts.debug:
|
||||
opts = detect_debug_defaults(force=True)
|
||||
else:
|
||||
env_defaults = detect_debug_defaults()
|
||||
opts = env_defaults if env_defaults else cli_opts
|
||||
|
||||
if not getattr(opts, "cmd", None):
|
||||
opts.cmd = "full"
|
||||
|
||||
exit_code = asyncio.run(async_main(cli_opts))
|
||||
sys.exit(exit_code)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
377
docs/apps/linkdin/c4ai_insights.py
Normal file
377
docs/apps/linkdin/c4ai_insights.py
Normal file
@@ -0,0 +1,377 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Stage-2 Insights builder
|
||||
------------------------
|
||||
Reads companies.jsonl & people.jsonl (Stage-1 output) and produces:
|
||||
• company_graph.json
|
||||
• org_chart_<handle>.json (one per company)
|
||||
• decision_makers.csv
|
||||
• graph_view.html (interactive visualisation)
|
||||
|
||||
Run:
|
||||
python c4ai_insights.py --in ./stage1_out --out ./stage2_out
|
||||
|
||||
Author : Tom @ Kidocode, 2025-04-28
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
# Imports & Third-party
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
|
||||
import argparse, asyncio, json, pathlib, random
|
||||
from datetime import datetime, UTC
|
||||
from types import SimpleNamespace
|
||||
from pathlib import Path
|
||||
from typing import List, Dict, Any
|
||||
# Pretty CLI UX
|
||||
from rich.console import Console
|
||||
from rich.logging import RichHandler
|
||||
from rich.progress import Progress, SpinnerColumn, BarColumn, TextColumn, TimeElapsedColumn
|
||||
import logging
|
||||
|
||||
|
||||
BASE_DIR = pathlib.Path(__file__).resolve().parent
|
||||
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
# 3rd-party deps
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
import numpy as np
|
||||
# from sentence_transformers import SentenceTransformer
|
||||
# from sklearn.metrics.pairwise import cosine_similarity
|
||||
import pandas as pd
|
||||
import hashlib
|
||||
|
||||
from litellm import completion #Support any LLM Provider
|
||||
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
# Utils
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
def load_jsonl(path: Path) -> List[Dict[str, Any]]:
|
||||
with open(path, "r", encoding="utf-8") as f:
|
||||
return [json.loads(l) for l in f]
|
||||
|
||||
def dump_json(obj, path: Path):
|
||||
with open(path, "w", encoding="utf-8") as f:
|
||||
json.dump(obj, f, ensure_ascii=False, indent=2)
|
||||
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
# Constants
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
BASE_DIR = pathlib.Path(__file__).resolve().parent
|
||||
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
# Debug defaults (mirrors Stage-1 trick)
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
def dev_defaults() -> SimpleNamespace:
|
||||
return SimpleNamespace(
|
||||
in_dir="./debug_out",
|
||||
out_dir="./insights_debug",
|
||||
embed_model="all-MiniLM-L6-v2",
|
||||
top_k=10,
|
||||
llm_provider="openai/gpt-4.1",
|
||||
llm_api_key=None,
|
||||
max_llm_tokens=8000,
|
||||
llm_temperature=1.0,
|
||||
workers=4
|
||||
)
|
||||
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
# Graph builders
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
def embed_descriptions(companies, model_name:str, opts) -> np.ndarray:
|
||||
from sentence_transformers import SentenceTransformer
|
||||
|
||||
logging.debug(f"Using embedding model: {model_name}")
|
||||
cache_path = BASE_DIR / Path(opts.out_dir) / "embeds_cache.json"
|
||||
cache = {}
|
||||
if cache_path.exists():
|
||||
with open(cache_path) as f:
|
||||
cache = json.load(f)
|
||||
# flush cache if model differs
|
||||
if cache.get("_model") != model_name:
|
||||
cache = {}
|
||||
|
||||
model = SentenceTransformer(model_name)
|
||||
new_texts, new_indices = [], []
|
||||
vectors = np.zeros((len(companies), 384), dtype=np.float32)
|
||||
|
||||
for idx, comp in enumerate(companies):
|
||||
text = comp.get("about") or comp.get("descriptor","")
|
||||
h = hashlib.sha1(text.encode("utf-8")).hexdigest()
|
||||
cached = cache.get(comp["handle"])
|
||||
if cached and cached["hash"] == h:
|
||||
vectors[idx] = np.array(cached["vector"], dtype=np.float32)
|
||||
else:
|
||||
new_texts.append(text)
|
||||
new_indices.append((idx, comp["handle"], h))
|
||||
|
||||
if new_texts:
|
||||
embeds = model.encode(new_texts, show_progress_bar=False, convert_to_numpy=True)
|
||||
for vec, (idx, handle, h) in zip(embeds, new_indices):
|
||||
vectors[idx] = vec
|
||||
cache[handle] = {"hash": h, "vector": vec.tolist()}
|
||||
cache["_model"] = model_name
|
||||
with open(cache_path, "w") as f:
|
||||
json.dump(cache, f)
|
||||
|
||||
return vectors
|
||||
|
||||
def build_company_graph(companies, embeds:np.ndarray, top_k:int) -> Dict[str,Any]:
|
||||
from sklearn.metrics.pairwise import cosine_similarity
|
||||
sims = cosine_similarity(embeds)
|
||||
nodes, edges = [], []
|
||||
idx_of = {c["handle"]: i for i,c in enumerate(companies)}
|
||||
for i,c in enumerate(companies):
|
||||
node = dict(
|
||||
id=c["handle"].strip("/"),
|
||||
name=c["name"],
|
||||
handle=c["handle"],
|
||||
about=c.get("about",""),
|
||||
people_url=c.get("people_url",""),
|
||||
industry=c.get("descriptor","").split("•")[0].strip(),
|
||||
geoUrn=c.get("geoUrn"),
|
||||
followers=c.get("followers",0),
|
||||
# desc_embed=embeds[i].tolist(),
|
||||
desc_embed=[],
|
||||
)
|
||||
nodes.append(node)
|
||||
# pick top-k most similar except itself
|
||||
top_idx = np.argsort(sims[i])[::-1][1:top_k+1]
|
||||
for j in top_idx:
|
||||
tgt = companies[j]
|
||||
weight = float(sims[i,j])
|
||||
if node["industry"] == tgt.get("descriptor","").split("•")[0].strip():
|
||||
weight += 0.10
|
||||
if node["geoUrn"] == tgt.get("geoUrn"):
|
||||
weight += 0.05
|
||||
tgt['followers'] = tgt.get("followers", None) or 1
|
||||
node["followers"] = node.get("followers", None) or 1
|
||||
follower_ratio = min(node["followers"], tgt.get("followers",1)) / max(node["followers"] or 1, tgt.get("followers",1))
|
||||
weight += 0.05 * follower_ratio
|
||||
edges.append(dict(
|
||||
source=node["id"],
|
||||
target=tgt["handle"].strip("/"),
|
||||
weight=round(weight,4),
|
||||
drivers=dict(
|
||||
embed_sim=round(float(sims[i,j]),4),
|
||||
industry_match=0.10 if node["industry"] == tgt.get("descriptor","").split("•")[0].strip() else 0,
|
||||
geo_overlap=0.05 if node["geoUrn"] == tgt.get("geoUrn") else 0,
|
||||
)
|
||||
))
|
||||
# return {"nodes":nodes,"edges":edges,"meta":{"generated_at":datetime.now(UTC).isoformat()}}
|
||||
return {"nodes":nodes,"edges":edges,"meta":{"generated_at":datetime.now(UTC).isoformat()}}
|
||||
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
# Org-chart via LLM
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
async def infer_org_chart_llm(company, people, llm_provider:str, api_key:str, max_tokens:int, temperature:float, stub:bool=False, base_url:str=None):
|
||||
if stub:
|
||||
# Tiny fake org-chart when debugging offline
|
||||
chief = random.choice(people)
|
||||
nodes = [{
|
||||
"id": chief["profile_url"],
|
||||
"name": chief["name"],
|
||||
"title": chief["headline"],
|
||||
"dept": chief["headline"].split()[:1][0],
|
||||
"yoe_total": 8,
|
||||
"yoe_current": 2,
|
||||
"seniority_score": 0.8,
|
||||
"decision_score": 0.9,
|
||||
"avatar_url": chief.get("avatar_url")
|
||||
}]
|
||||
return {"nodes":nodes,"edges":[],"meta":{"debug_stub":True,"generated_at":datetime.now(UTC).isoformat()}}
|
||||
|
||||
prompt = [
|
||||
{"role":"system","content":"You are an expert B2B org-chart reasoner."},
|
||||
{"role":"user","content":f"""Here is the company description:
|
||||
|
||||
<company>
|
||||
{json.dumps(company, ensure_ascii=False)}
|
||||
</company>
|
||||
|
||||
Here is a JSON list of employees:
|
||||
<employees>
|
||||
{json.dumps(people, ensure_ascii=False)}
|
||||
</employees>
|
||||
|
||||
1) Build a reporting tree (manager -> direct reports)
|
||||
2) For each person output a decision_score 0-1 for buying new software
|
||||
|
||||
Return JSON: {{ "nodes":[{{id,name,title,dept,yoe_total,yoe_current,seniority_score,decision_score,avatar_url,profile_url}}], "edges":[{{source,target,type,confidence}}] }}
|
||||
"""}
|
||||
]
|
||||
resp = completion(
|
||||
model=llm_provider,
|
||||
messages=prompt,
|
||||
max_tokens=max_tokens,
|
||||
temperature=temperature,
|
||||
response_format={"type":"json_object"},
|
||||
api_key=api_key,
|
||||
base_url=base_url
|
||||
)
|
||||
chart = json.loads(resp.choices[0].message.content)
|
||||
chart["meta"] = dict(
|
||||
model=llm_provider,
|
||||
generated_at=datetime.now(UTC).isoformat()
|
||||
)
|
||||
return chart
|
||||
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
# CSV flatten
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
def export_decision_makers(charts_dir:Path, csv_path:Path, threshold:float=0.5):
|
||||
rows=[]
|
||||
for p in charts_dir.glob("org_chart_*.json"):
|
||||
data=json.loads(p.read_text())
|
||||
comp = p.stem.split("org_chart_")[1]
|
||||
for n in data.get("nodes",[]):
|
||||
if n.get("decision_score",0)>=threshold:
|
||||
rows.append(dict(
|
||||
company=comp,
|
||||
person=n["name"],
|
||||
title=n["title"],
|
||||
decision_score=n["decision_score"],
|
||||
profile_url=n["id"]
|
||||
))
|
||||
pd.DataFrame(rows).to_csv(csv_path,index=False)
|
||||
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
# HTML rendering
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
def render_html(out:Path, template_dir:Path):
|
||||
# From template folder cp graph_view.html and ai.js in out folder
|
||||
import shutil
|
||||
shutil.copy(template_dir/"graph_view_template.html", out / "graph_view.html")
|
||||
shutil.copy(template_dir/"ai.js", out)
|
||||
|
||||
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
# Main async pipeline
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
async def run(opts):
|
||||
# ── silence SDK noise ──────────────────────────────────────────────────────
|
||||
for noisy in ("openai", "httpx", "httpcore"):
|
||||
lg = logging.getLogger(noisy)
|
||||
lg.setLevel(logging.WARNING) # or ERROR if you want total silence
|
||||
lg.propagate = False # optional: stop them reaching root
|
||||
|
||||
# ────────────── logging bootstrap ──────────────
|
||||
console = Console()
|
||||
logging.basicConfig(
|
||||
level="INFO",
|
||||
format="%(message)s",
|
||||
handlers=[RichHandler(console=console, markup=True, rich_tracebacks=True)],
|
||||
)
|
||||
|
||||
in_dir = BASE_DIR / Path(opts.in_dir)
|
||||
out_dir = BASE_DIR / Path(opts.out_dir)
|
||||
out_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
companies = load_jsonl(in_dir/"companies.jsonl")
|
||||
people = load_jsonl(in_dir/"people.jsonl")
|
||||
|
||||
logging.info(f"[bold cyan]Loaded[/] {len(companies)} companies, {len(people)} people")
|
||||
|
||||
logging.info("[bold]⇢[/] Embedding company descriptions…")
|
||||
embeds = embed_descriptions(companies, opts.embed_model, opts)
|
||||
|
||||
logging.info("[bold]⇢[/] Building similarity graph")
|
||||
company_graph = build_company_graph(companies, embeds, opts.top_k)
|
||||
dump_json(company_graph, out_dir/"company_graph.json")
|
||||
|
||||
# Filter companies that need processing
|
||||
to_process = []
|
||||
for comp in companies:
|
||||
handle = comp["handle"].strip("/").replace("/","_")
|
||||
out_file = out_dir/f"org_chart_{handle}.json"
|
||||
if out_file.exists() and False:
|
||||
logging.info(f"[green]✓[/] Skipping existing {comp['name']}")
|
||||
continue
|
||||
to_process.append(comp)
|
||||
|
||||
|
||||
if not to_process:
|
||||
logging.info("[yellow]All companies already processed[/]")
|
||||
else:
|
||||
workers = getattr(opts, 'workers', 1)
|
||||
parallel = workers > 1
|
||||
|
||||
logging.info(f"[bold]⇢[/] Inferring org-charts via LLM {f'(parallel={workers} workers)' if parallel else ''}")
|
||||
|
||||
with Progress(
|
||||
SpinnerColumn(),
|
||||
BarColumn(),
|
||||
TextColumn("[progress.description]{task.description}"),
|
||||
TimeElapsedColumn(),
|
||||
console=console,
|
||||
) as progress:
|
||||
task = progress.add_task("Org charts", total=len(to_process))
|
||||
|
||||
async def process_one(comp):
|
||||
handle = comp["handle"].strip("/").replace("/","_")
|
||||
persons = [p for p in people if p["company_handle"].strip("/") == comp["handle"].strip("/")]
|
||||
chart = await infer_org_chart_llm(
|
||||
comp, persons,
|
||||
llm_provider=opts.llm_provider,
|
||||
api_key=opts.llm_api_key or None,
|
||||
max_tokens=opts.max_llm_tokens,
|
||||
temperature=opts.llm_temperature,
|
||||
stub=opts.stub or False,
|
||||
base_url=opts.llm_base_url or None
|
||||
)
|
||||
chart["meta"]["company"] = comp["name"]
|
||||
|
||||
# Save the result immediately
|
||||
dump_json(chart, out_dir/f"org_chart_{handle}.json")
|
||||
|
||||
progress.update(task, advance=1, description=f"{comp['name']} ({len(persons)} ppl)")
|
||||
|
||||
# Create tasks for all companies
|
||||
tasks = [process_one(comp) for comp in to_process]
|
||||
|
||||
# Process in batches based on worker count
|
||||
semaphore = asyncio.Semaphore(workers)
|
||||
|
||||
async def bounded_process(coro):
|
||||
async with semaphore:
|
||||
return await coro
|
||||
|
||||
# Run with concurrency control
|
||||
await asyncio.gather(*(bounded_process(task) for task in tasks))
|
||||
|
||||
logging.info("[bold]⇢[/] Flattening decision-makers CSV")
|
||||
export_decision_makers(out_dir, out_dir/"decision_makers.csv")
|
||||
|
||||
render_html(out_dir, template_dir=BASE_DIR/"templates")
|
||||
logging.success = lambda msg, **k: console.print(f"[bold green]✓[/] {msg}", **k)
|
||||
logging.success(f"Stage-2 artefacts written to {out_dir}")
|
||||
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
# CLI
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
def build_arg_parser():
|
||||
p = argparse.ArgumentParser(description="Build graphs & visualisation from Stage-1 output")
|
||||
p.add_argument("--in", dest="in_dir", required=False, help="Stage-1 output dir", default=".")
|
||||
p.add_argument("--out", dest="out_dir", required=False, help="Destination dir", default=".")
|
||||
p.add_argument("--embed-model", default="all-MiniLM-L6-v2")
|
||||
p.add_argument("--top-k", type=int, default=10, help="Top-k neighbours per company")
|
||||
p.add_argument("--llm-provider", default="openai/gpt-4.1",
|
||||
help="LLM model to use in format 'provider/model_name' (e.g., 'anthropic/claude-3')")
|
||||
p.add_argument("--llm-api-key", help="API key for LLM provider (defaults to env vars)")
|
||||
p.add_argument("--llm-base-url", help="Base URL for LLM API endpoint")
|
||||
p.add_argument("--max-llm-tokens", type=int, default=8024)
|
||||
p.add_argument("--llm-temperature", type=float, default=1.0)
|
||||
p.add_argument("--stub", action="store_true", help="Skip OpenAI call and generate tiny fake org charts")
|
||||
p.add_argument("--workers", type=int, default=4, help="Number of parallel workers for LLM inference")
|
||||
return p
|
||||
|
||||
def main():
|
||||
dbg = dev_defaults()
|
||||
# opts = dbg if True else build_arg_parser().parse_args()
|
||||
opts = build_arg_parser().parse_args()
|
||||
asyncio.run(run(opts))
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
39
docs/apps/linkdin/schemas/company_card.json
Normal file
39
docs/apps/linkdin/schemas/company_card.json
Normal file
@@ -0,0 +1,39 @@
|
||||
{
|
||||
"name": "LinkedIn Company Card",
|
||||
"baseSelector": "div.search-results-container ul[role='list'] > li",
|
||||
"fields": [
|
||||
{
|
||||
"name": "handle",
|
||||
"selector": "a[href*='/company/']",
|
||||
"type": "attribute",
|
||||
"attribute": "href"
|
||||
},
|
||||
{
|
||||
"name": "profile_image",
|
||||
"selector": "a[href*='/company/'] img",
|
||||
"type": "attribute",
|
||||
"attribute": "src"
|
||||
},
|
||||
{
|
||||
"name": "name",
|
||||
"selector": "span[class*='t-16'] a",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "descriptor",
|
||||
"selector": "div[class*='t-black t-normal']",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "about",
|
||||
"selector": "p[class*='entity-result__summary--2-lines']",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "followers",
|
||||
"selector": "div:contains('followers')",
|
||||
"type": "regex",
|
||||
"pattern": "(\\d+)\\s*followers"
|
||||
}
|
||||
]
|
||||
}
|
||||
38
docs/apps/linkdin/schemas/people_card.json
Normal file
38
docs/apps/linkdin/schemas/people_card.json
Normal file
@@ -0,0 +1,38 @@
|
||||
{
|
||||
"name": "LinkedIn People Card",
|
||||
"baseSelector": "li.org-people-profile-card__profile-card-spacing",
|
||||
"fields": [
|
||||
{
|
||||
"name": "profile_url",
|
||||
"selector": "a.eETATgYTipaVsmrBChiBJJvFsdPhNpulhPZUVLHLo",
|
||||
"type": "attribute",
|
||||
"attribute": "href"
|
||||
},
|
||||
{
|
||||
"name": "name",
|
||||
"selector": ".artdeco-entity-lockup__title .lt-line-clamp--single-line",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "headline",
|
||||
"selector": ".artdeco-entity-lockup__subtitle .lt-line-clamp--multi-line",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "followers",
|
||||
"selector": ".lt-line-clamp--multi-line.t-12",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "connection_degree",
|
||||
"selector": ".artdeco-entity-lockup__badge .artdeco-entity-lockup__degree",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "avatar_url",
|
||||
"selector": ".artdeco-entity-lockup__image img",
|
||||
"type": "attribute",
|
||||
"attribute": "src"
|
||||
}
|
||||
]
|
||||
}
|
||||
143
docs/apps/linkdin/snippets/company.html
Normal file
143
docs/apps/linkdin/snippets/company.html
Normal file
@@ -0,0 +1,143 @@
|
||||
<li class="yCLWzruNprmIzaZzFFonVFBtMrbaVYnuDFA">
|
||||
<!----><!---->
|
||||
|
||||
|
||||
|
||||
<div class="IxlEPbRZwQYrRltKPvHAyjBmCdIWTAoYo" data-chameleon-result-urn="urn:li:company:362492"
|
||||
data-view-name="search-entity-result-universal-template">
|
||||
|
||||
|
||||
|
||||
|
||||
<div class="linked-area flex-1
|
||||
cursor-pointer">
|
||||
|
||||
<div class="BAEgVqVuxosMJZodcelsgPoyRcrkiqgVCGHXNQ">
|
||||
<div class="afcvrbGzNuyRlhPPQWrWirJtUdHAAtUlqxwvVA">
|
||||
<div class="display-flex align-items-center">
|
||||
<!---->
|
||||
|
||||
<a class="eETATgYTipaVsmrBChiBJJvFsdPhNpulhPZUVLHLo scale-down " aria-hidden="true"
|
||||
tabindex="-1" href="https://www.linkedin.com/company/managment-research-services-inc./"
|
||||
data-test-app-aware-link="">
|
||||
|
||||
<div class="ivm-image-view-model ">
|
||||
|
||||
<div class="ivm-view-attr__img-wrapper
|
||||
|
||||
">
|
||||
<!---->
|
||||
<!----> <img width="48"
|
||||
src="https://media.licdn.com/dms/image/v2/C560BAQFWpusEOgW-ww/company-logo_100_100/company-logo_100_100/0/1630583697877/managment_research_services_inc_logo?e=1750896000&v=beta&t=Ch9vyEZdfng-1D1m_XqP5kjNpVXUBKkk9cNhMZUhx0E"
|
||||
loading="lazy" height="48" alt="Management Research Services, Inc. (MRS, Inc)"
|
||||
id="ember28"
|
||||
class="ivm-view-attr__img--centered EntityPhoto-square-3 evi-image lazy-image ember-view">
|
||||
</div>
|
||||
|
||||
</div>
|
||||
|
||||
</a>
|
||||
|
||||
|
||||
</div>
|
||||
</div>
|
||||
<div
|
||||
class="wympnVuDByXHvafWrMGJLZuchDmCRqLmWPwg MmzCPRicJimZvjJhvqTzDcDbdHhWPzspERzA pt3 pb3 t-12 t-black--light">
|
||||
<div class="mb1">
|
||||
|
||||
<div class="t-roman t-sans">
|
||||
|
||||
|
||||
|
||||
<div class="display-flex">
|
||||
<span class="TikBXjihYvcNUoIzkslUaEjfIuLmYxfs OoHEyXgsiIqGADjcOtTmfdpoYVXrLKTvkwI ">
|
||||
<span class="CgaWLOzmXNuKbRIRARSErqCJcBPYudEKo
|
||||
t-16">
|
||||
<a class="eETATgYTipaVsmrBChiBJJvFsdPhNpulhPZUVLHLo "
|
||||
href="https://www.linkedin.com/company/managment-research-services-inc./"
|
||||
data-test-app-aware-link="">
|
||||
<!---->Management Research Services, Inc. (MRS, Inc)<!---->
|
||||
<!----> </a>
|
||||
<!----> </span>
|
||||
</span>
|
||||
<!---->
|
||||
</div>
|
||||
|
||||
|
||||
|
||||
</div>
|
||||
|
||||
|
||||
|
||||
<div class="LjmdKCEqKITHihFOiQsBAQylkdnsWhqZii
|
||||
t-14 t-black t-normal">
|
||||
<!---->Insurance • Milwaukee, Wisconsin<!---->
|
||||
</div>
|
||||
|
||||
<div class="cTPhJiHyNLmxdQYFlsEOutjznmqrVHUByZwZ
|
||||
t-14 t-normal">
|
||||
<!---->1K followers<!---->
|
||||
</div>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
</div>
|
||||
|
||||
<!---->
|
||||
<p class="yWzlqwKNlvCWVNoKqmzoDDEnBMUuyynaLg
|
||||
entity-result__summary--2-lines
|
||||
t-12 t-black--light
|
||||
">
|
||||
<!---->MRS combines 30 years of experience supporting the Life,<span class="white-space-pre">
|
||||
</span><strong><!---->Health<!----></strong><span class="white-space-pre"> </span>and
|
||||
Annuities<span class="white-space-pre"> </span><strong><!---->Insurance<!----></strong><span
|
||||
class="white-space-pre"> </span>Industry with customized<span class="white-space-pre">
|
||||
</span><strong><!---->insurance<!----></strong><span class="white-space-pre">
|
||||
</span>underwriting solutions that efficiently support clients’ workflows. Supported by the
|
||||
Agenium Platform (www.agenium.ai) our innovative underwriting solutions are guaranteed to
|
||||
optimize requirements...<!---->
|
||||
</p>
|
||||
|
||||
<!---->
|
||||
</div>
|
||||
<div class="qXxdnXtzRVFTnTnetmNpssucBwQBsWlUuk MmzCPRicJimZvjJhvqTzDcDbdHhWPzspERzA">
|
||||
<!---->
|
||||
|
||||
|
||||
<div>
|
||||
|
||||
|
||||
|
||||
|
||||
<button aria-label="Follow Management Research Services, Inc. (MRS, Inc)" id="ember61"
|
||||
class="artdeco-button artdeco-button--2 artdeco-button--secondary ember-view"
|
||||
type="button"><!---->
|
||||
<span class="artdeco-button__text">
|
||||
Follow
|
||||
</span></button>
|
||||
|
||||
|
||||
|
||||
<!---->
|
||||
<!---->
|
||||
|
||||
|
||||
</div>
|
||||
|
||||
|
||||
|
||||
</div>
|
||||
</div>
|
||||
|
||||
</div>
|
||||
|
||||
|
||||
|
||||
|
||||
</div>
|
||||
|
||||
|
||||
|
||||
</li>
|
||||
94
docs/apps/linkdin/snippets/people.html
Normal file
94
docs/apps/linkdin/snippets/people.html
Normal file
@@ -0,0 +1,94 @@
|
||||
<li class="grid grid__col--lg-8 block org-people-profile-card__profile-card-spacing">
|
||||
<div>
|
||||
|
||||
|
||||
<section class="artdeco-card full-width qQdPErXQkSAbwApNgNfuxukTIPPykttCcZGOHk">
|
||||
<!---->
|
||||
|
||||
<img width="210" src="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7"
|
||||
ariarole="presentation" loading="lazy" height="210" alt="" id="ember96"
|
||||
class="evi-image lazy-image ghost-default ember-view org-people-profile-card__cover-photo org-people-profile-card__cover-photo--people">
|
||||
|
||||
<div class="org-people-profile-card__profile-info">
|
||||
<div id="ember97"
|
||||
class="artdeco-entity-lockup artdeco-entity-lockup--stacked-center artdeco-entity-lockup--size-7 ember-view">
|
||||
<div id="ember98"
|
||||
class="artdeco-entity-lockup__image artdeco-entity-lockup__image--type-circle ember-view"
|
||||
type="circle">
|
||||
|
||||
<a class="eETATgYTipaVsmrBChiBJJvFsdPhNpulhPZUVLHLo "
|
||||
id="org-people-profile-card__profile-image-0"
|
||||
href="https://www.linkedin.com/in/speakerrayna?miniProfileUrn=urn%3Ali%3Afs_miniProfile%3AACoAABsqUBoBr5x071PuGGpNtK3NlvSARiVXPIs"
|
||||
data-test-app-aware-link="">
|
||||
<img width="104"
|
||||
src="https://media.licdn.com/dms/image/v2/D5603AQGs2Vyju4xZ7A/profile-displayphoto-shrink_100_100/profile-displayphoto-shrink_100_100/0/1681741067031?e=1750896000&v=beta&t=Hvj--IrrmpVIH7pec7-l_PQok8vsS__CGeUqBWOw7co"
|
||||
loading="lazy" height="104" alt="Dr. Rayna S." id="ember99"
|
||||
class="evi-image lazy-image ember-view">
|
||||
</a>
|
||||
|
||||
|
||||
</div>
|
||||
<div id="ember100" class="artdeco-entity-lockup__content ember-view">
|
||||
<div id="ember101" class="artdeco-entity-lockup__title ember-view">
|
||||
<a class="eETATgYTipaVsmrBChiBJJvFsdPhNpulhPZUVLHLo link-without-visited-state"
|
||||
aria-label="View Dr. Rayna S.’s profile"
|
||||
href="https://www.linkedin.com/in/speakerrayna?miniProfileUrn=urn%3Ali%3Afs_miniProfile%3AACoAABsqUBoBr5x071PuGGpNtK3NlvSARiVXPIs"
|
||||
data-test-app-aware-link="">
|
||||
<div id="ember103" class="ember-view lt-line-clamp lt-line-clamp--single-line AGabuksChUpCmjWshSnaZryLKSthOKkwclxY
|
||||
t-black" style="">
|
||||
Dr. Rayna S.
|
||||
|
||||
<!---->
|
||||
</div>
|
||||
|
||||
</a>
|
||||
|
||||
</div>
|
||||
<div id="ember104" class="artdeco-entity-lockup__badge ember-view"> <span class="a11y-text">3rd+
|
||||
degree connection</span>
|
||||
<span class="artdeco-entity-lockup__degree" aria-hidden="true">
|
||||
· 3rd
|
||||
</span>
|
||||
<!----><!---->
|
||||
</div>
|
||||
<div id="ember105" class="artdeco-entity-lockup__subtitle ember-view">
|
||||
<div class="t-14 t-black--light t-normal">
|
||||
<div id="ember107" class="ember-view lt-line-clamp lt-line-clamp--multi-line"
|
||||
style="-webkit-line-clamp: 2">
|
||||
Leadership and Talent Development Consultant and Professional Speaker
|
||||
|
||||
<!---->
|
||||
</div>
|
||||
|
||||
</div>
|
||||
</div>
|
||||
<div id="ember108" class="artdeco-entity-lockup__caption ember-view"></div>
|
||||
</div>
|
||||
|
||||
</div>
|
||||
<span class="text-align-center">
|
||||
<span id="ember110"
|
||||
class="ember-view lt-line-clamp lt-line-clamp--multi-line t-12 t-black--light mt2"
|
||||
style="-webkit-line-clamp: 3">
|
||||
727 followers
|
||||
|
||||
<!----> </span>
|
||||
|
||||
</span>
|
||||
</div>
|
||||
|
||||
<footer class="ph3 pb3">
|
||||
<button aria-label="Follow Dr. Rayna S." id="ember111"
|
||||
class="artdeco-button artdeco-button--2 artdeco-button--secondary ember-view full-width"
|
||||
type="button"><!---->
|
||||
<span class="artdeco-button__text">
|
||||
Follow
|
||||
</span></button>
|
||||
</footer>
|
||||
|
||||
</section>
|
||||
|
||||
|
||||
</div>
|
||||
|
||||
</li>
|
||||
50
docs/apps/linkdin/templates/ai.js
Normal file
50
docs/apps/linkdin/templates/ai.js
Normal file
@@ -0,0 +1,50 @@
|
||||
// ==== File: ai.js ====
|
||||
|
||||
class ApiHandler {
|
||||
constructor(apiKey = null) {
|
||||
this.apiKey = apiKey || localStorage.getItem("openai_api_key") || "";
|
||||
console.log("ApiHandler ready");
|
||||
}
|
||||
|
||||
setApiKey(k) {
|
||||
this.apiKey = k.trim();
|
||||
if (this.apiKey) localStorage.setItem("openai_api_key", this.apiKey);
|
||||
}
|
||||
|
||||
async *chatStream(messages, {model = "gpt-4o", temperature = 0.7} = {}) {
|
||||
if (!this.apiKey) throw new Error("OpenAI API key missing");
|
||||
const payload = {model, messages, stream: true, max_tokens: 1024};
|
||||
const controller = new AbortController();
|
||||
|
||||
const res = await fetch("https://api.openai.com/v1/chat/completions", {
|
||||
method: "POST",
|
||||
headers: {
|
||||
"Content-Type": "application/json",
|
||||
Authorization: `Bearer ${this.apiKey}`,
|
||||
},
|
||||
body: JSON.stringify(payload),
|
||||
signal: controller.signal,
|
||||
});
|
||||
if (!res.ok) throw new Error(`OpenAI: ${res.statusText}`);
|
||||
const reader = res.body.getReader();
|
||||
const dec = new TextDecoder();
|
||||
|
||||
let buf = "";
|
||||
while (true) {
|
||||
const {done, value} = await reader.read();
|
||||
if (done) break;
|
||||
buf += dec.decode(value, {stream: true});
|
||||
for (const line of buf.split("\n")) {
|
||||
if (!line.startsWith("data: ")) continue;
|
||||
if (line.includes("[DONE]")) return;
|
||||
const json = JSON.parse(line.slice(6));
|
||||
const delta = json.choices?.[0]?.delta?.content;
|
||||
if (delta) yield delta;
|
||||
}
|
||||
buf = buf.endsWith("\n") ? "" : buf; // keep partial line
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
window.API = new ApiHandler();
|
||||
|
||||
1171
docs/apps/linkdin/templates/graph_view_template.html
Normal file
1171
docs/apps/linkdin/templates/graph_view_template.html
Normal file
File diff suppressed because it is too large
Load Diff
51
docs/codebase/browser.md
Normal file
51
docs/codebase/browser.md
Normal file
@@ -0,0 +1,51 @@
|
||||
### browser_manager.py
|
||||
|
||||
| Function | What it does |
|
||||
|---|---|
|
||||
| `ManagedBrowser.build_browser_flags` | Returns baseline Chromium CLI flags, disables GPU and sandbox, plugs locale, timezone, stealth tweaks, and any extras from `BrowserConfig`. |
|
||||
| `ManagedBrowser.__init__` | Stores config and logger, creates temp dir, preps internal state. |
|
||||
| `ManagedBrowser.start` | Spawns or connects to the Chromium process, returns its CDP endpoint plus the `subprocess.Popen` handle. |
|
||||
| `ManagedBrowser._initial_startup_check` | Pings the CDP endpoint once to be sure the browser is alive, raises if not. |
|
||||
| `ManagedBrowser._monitor_browser_process` | Async-loops on the subprocess, logs exits or crashes, restarts if policy allows. |
|
||||
| `ManagedBrowser._get_browser_path_WIP` | Old helper that maps OS + browser type to an executable path. |
|
||||
| `ManagedBrowser._get_browser_path` | Current helper, checks env vars, Playwright cache, and OS defaults for the real executable. |
|
||||
| `ManagedBrowser._get_browser_args` | Builds the final CLI arg list by merging user flags, stealth flags, and defaults. |
|
||||
| `ManagedBrowser.cleanup` | Terminates the browser, stops monitors, deletes the temp dir. |
|
||||
| `ManagedBrowser.create_profile` | Opens a visible browser so a human can log in, then zips the resulting user-data-dir to `~/.crawl4ai/profiles/<name>`. |
|
||||
| `ManagedBrowser.list_profiles` | Thin wrapper, now forwarded to `BrowserProfiler.list_profiles()`. |
|
||||
| `ManagedBrowser.delete_profile` | Thin wrapper, now forwarded to `BrowserProfiler.delete_profile()`. |
|
||||
| `BrowserManager.__init__` | Holds the global Playwright instance, browser handle, config signature cache, session map, and logger. |
|
||||
| `BrowserManager.start` | Boots the underlying `ManagedBrowser`, then spins up the default Playwright browser context with stealth patches. |
|
||||
| `BrowserManager._build_browser_args` | Translates `CrawlerRunConfig` (proxy, UA, timezone, headless flag, etc.) into Playwright `launch_args`. |
|
||||
| `BrowserManager.setup_context` | Applies locale, geolocation, permissions, cookies, and UA overrides on a fresh context. |
|
||||
| `BrowserManager.create_browser_context` | Internal helper that actually calls `browser.new_context(**options)` after running `setup_context`. |
|
||||
| `BrowserManager._make_config_signature` | Hashes the non-ephemeral parts of `CrawlerRunConfig` so contexts can be reused safely. |
|
||||
| `BrowserManager.get_page` | Returns a ready `Page` for a given session id, reusing an existing one or creating a new context/page, injects helper scripts, updates `last_used`. |
|
||||
| `BrowserManager.kill_session` | Force-closes a context/page for a session and removes it from the session map. |
|
||||
| `BrowserManager._cleanup_expired_sessions` | Periodic sweep that drops sessions idle longer than `ttl_seconds`. |
|
||||
| `BrowserManager.close` | Gracefully shuts down all contexts, the browser, Playwright, and background tasks. |
|
||||
|
||||
---
|
||||
|
||||
### browser_profiler.py
|
||||
|
||||
| Function | What it does |
|
||||
|---|---|
|
||||
| `BrowserProfiler.__init__` | Sets up profile folder paths, async logger, and signal handlers. |
|
||||
| `BrowserProfiler.create_profile` | Launches a visible browser with a new user-data-dir for manual login, on exit compresses and stores it as a named profile. |
|
||||
| `BrowserProfiler.cleanup_handler` | General SIGTERM/SIGINT cleanup wrapper that kills child processes. |
|
||||
| `BrowserProfiler.sigint_handler` | Handles Ctrl-C during an interactive session, makes sure the browser shuts down cleanly. |
|
||||
| `BrowserProfiler.listen_for_quit_command` | Async REPL that exits when the user types `q`. |
|
||||
| `BrowserProfiler.list_profiles` | Enumerates `~/.crawl4ai/profiles`, prints profile name, browser type, size, and last modified. |
|
||||
| `BrowserProfiler.get_profile_path` | Returns the absolute path of a profile given its name, or `None` if missing. |
|
||||
| `BrowserProfiler.delete_profile` | Removes a profile folder or a direct path from disk, with optional confirmation prompt. |
|
||||
| `BrowserProfiler.interactive_manager` | Text UI loop for listing, creating, deleting, or launching profiles. |
|
||||
| `BrowserProfiler.launch_standalone_browser` | Starts a non-headless Chromium with remote debugging enabled and keeps it alive for manual tests. |
|
||||
| `BrowserProfiler.get_cdp_json` | Pulls `/json/version` from a CDP endpoint and returns the parsed JSON. |
|
||||
| `BrowserProfiler.launch_builtin_browser` | Spawns a headless Chromium in the background, saves `{wsEndpoint, pid, started_at}` to `~/.crawl4ai/builtin_browser.json`. |
|
||||
| `BrowserProfiler.get_builtin_browser_info` | Reads that JSON file, verifies the PID, and returns browser status info. |
|
||||
| `BrowserProfiler._is_browser_running` | Cross-platform helper that checks if a PID is still alive. |
|
||||
| `BrowserProfiler.kill_builtin_browser` | Terminates the background builtin browser and removes its status file. |
|
||||
| `BrowserProfiler.get_builtin_browser_status` | Returns `{running: bool, wsEndpoint, pid, started_at}` for quick health checks. |
|
||||
|
||||
Let me know what you want to tweak or dive into next.
|
||||
40
docs/codebase/cli.md
Normal file
40
docs/codebase/cli.md
Normal file
@@ -0,0 +1,40 @@
|
||||
### `cli.py` command surface
|
||||
|
||||
| Command | Inputs / flags | What it does |
|
||||
|---|---|---|
|
||||
| **profiles** | *(none)* | Opens the interactive profile manager, lets you list, create, delete saved browser profiles that live in `~/.crawl4ai/profiles`. |
|
||||
| **browser status** | – | Prints whether the always-on *builtin* browser is running, shows its CDP URL, PID, start time. |
|
||||
| **browser stop** | – | Kills the builtin browser and deletes its status file. |
|
||||
| **browser view** | `--url, -u` URL *(optional)* | Pops a visible window of the builtin browser, navigates to `URL` or `about:blank`. |
|
||||
| **config list** | – | Dumps every global setting, showing current value, default, and description. |
|
||||
| **config get** | `key` | Prints the value of a single setting, falls back to default if unset. |
|
||||
| **config set** | `key value` | Persists a new value in the global config (stored under `~/.crawl4ai/config.yml`). |
|
||||
| **examples** | – | Just spits out real-world CLI usage samples. |
|
||||
| **crawl** | `url` *(positional)*<br>`--browser-config,-B` path<br>`--crawler-config,-C` path<br>`--filter-config,-f` path<br>`--extraction-config,-e` path<br>`--json-extract,-j` [desc]\*<br>`--schema,-s` path<br>`--browser,-b` k=v list<br>`--crawler,-c` k=v list<br>`--output,-o` all,json,markdown,md,markdown-fit,md-fit *(default all)*<br>`--output-file,-O` path<br>`--bypass-cache,-b` *(flag, default true — note flag reuse)*<br>`--question,-q` str<br>`--verbose,-v` *(flag)*<br>`--profile,-p` profile-name | One-shot crawl + extraction. Builds `BrowserConfig` and `CrawlerRunConfig` from inline flags or separate YAML/JSON files, runs `AsyncWebCrawler.run()`, can route through a named saved profile and pipe the result to stdout or a file. |
|
||||
| **(default)** | Same flags as **crawl**, plus `--example` | Shortcut so you can type just `crwl https://site.com`. When first arg is not a known sub-command, it falls through to *crawl*. |
|
||||
|
||||
\* `--json-extract/-j` with no value turns on LLM-based JSON extraction using an auto schema, supplying a string lets you prompt-engineer the field descriptions.
|
||||
|
||||
> Quick mental model
|
||||
> `profiles` = manage identities,
|
||||
> `browser ...` = control long-running headless Chrome that all crawls can piggy-back on,
|
||||
> `crawl` = do the actual work,
|
||||
> `config` = tweak global defaults,
|
||||
> everything else is sugar.
|
||||
|
||||
### Quick-fire “profile” usage cheatsheet
|
||||
|
||||
| Scenario | Command (copy-paste ready) | Notes |
|
||||
|---|---|---|
|
||||
| **Launch interactive Profile Manager UI** | `crwl profiles` | Opens TUI with options: 1 List, 2 Create, 3 Delete, 4 Use-to-crawl, 5 Exit. |
|
||||
| **Create a fresh profile** | `crwl profiles` → choose **2** → name it → browser opens → log in → press **q** in terminal | Saves to `~/.crawl4ai/profiles/<name>`. |
|
||||
| **List saved profiles** | `crwl profiles` → choose **1** | Shows name, browser type, size, last-modified. |
|
||||
| **Delete a profile** | `crwl profiles` → choose **3** → pick the profile index → confirm | Removes the folder. |
|
||||
| **Crawl with a profile (default alias)** | `crwl https://site.com/dashboard -p my-profile` | Keeps login cookies, sets `use_managed_browser=true` under the hood. |
|
||||
| **Crawl + verbose JSON output** | `crwl https://site.com -p my-profile -o json -v` | Any other `crawl` flags work the same. |
|
||||
| **Crawl with extra browser tweaks** | `crwl https://site.com -p my-profile -b "headless=true,viewport_width=1680"` | CLI overrides go on top of the profile. |
|
||||
| **Same but via explicit sub-command** | `crwl crawl https://site.com -p my-profile` | Identical to default alias. |
|
||||
| **Use profile from inside Profile Manager** | `crwl profiles` → choose **4** → pick profile → enter URL → follow prompts | Handy when demo-ing to non-CLI folks. |
|
||||
| **One-off crawl with a profile folder path (no name lookup)** | `crwl https://site.com -b "user_data_dir=$HOME/.crawl4ai/profiles/my-profile,use_managed_browser=true"` | Bypasses registry, useful for CI scripts. |
|
||||
| **Launch a dev browser on CDP port with the same identity** | `crwl cdp -d $HOME/.crawl4ai/profiles/my-profile -P 9223` | Lets Puppeteer/Playwright attach for debugging. |
|
||||
|
||||
@@ -383,29 +383,31 @@ async def main():
|
||||
scroll_delay=0.2,
|
||||
)
|
||||
|
||||
# # Execute market data extraction
|
||||
# results: List[CrawlResult] = await crawler.arun(
|
||||
# url="https://coinmarketcap.com/?page=1", config=crawl_config
|
||||
# )
|
||||
# Execute market data extraction
|
||||
results: List[CrawlResult] = await crawler.arun(
|
||||
url="https://coinmarketcap.com/?page=1", config=crawl_config
|
||||
)
|
||||
|
||||
# # Process results
|
||||
# raw_df = pd.DataFrame()
|
||||
# for result in results:
|
||||
# if result.success and result.media["tables"]:
|
||||
# # Extract primary market table
|
||||
# # DataFrame
|
||||
# raw_df = pd.DataFrame(
|
||||
# result.media["tables"][0]["rows"],
|
||||
# columns=result.media["tables"][0]["headers"],
|
||||
# )
|
||||
# break
|
||||
# Process results
|
||||
raw_df = pd.DataFrame()
|
||||
for result in results:
|
||||
# Use the new tables field, falling back to media["tables"] for backward compatibility
|
||||
tables = result.tables if hasattr(result, "tables") and result.tables else result.media.get("tables", [])
|
||||
if result.success and tables:
|
||||
# Extract primary market table
|
||||
# DataFrame
|
||||
raw_df = pd.DataFrame(
|
||||
tables[0]["rows"],
|
||||
columns=tables[0]["headers"],
|
||||
)
|
||||
break
|
||||
|
||||
|
||||
# This is for debugging only
|
||||
# ////// Remove this in production from here..
|
||||
# Save raw data for debugging
|
||||
# raw_df.to_csv(f"{__current_dir__}/tmp/raw_crypto_data.csv", index=False)
|
||||
# print("🔍 Raw data saved to 'raw_crypto_data.csv'")
|
||||
raw_df.to_csv(f"{__current_dir__}/tmp/raw_crypto_data.csv", index=False)
|
||||
print("🔍 Raw data saved to 'raw_crypto_data.csv'")
|
||||
|
||||
# Read from file for debugging
|
||||
raw_df = pd.read_csv(f"{__current_dir__}/tmp/raw_crypto_data.csv")
|
||||
|
||||
1317
docs/examples/docker/demo_docker_api.py
Normal file
1317
docs/examples/docker/demo_docker_api.py
Normal file
File diff suppressed because it is too large
Load Diff
149
docs/examples/docker/demo_docker_polling.py
Normal file
149
docs/examples/docker/demo_docker_polling.py
Normal file
@@ -0,0 +1,149 @@
|
||||
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
demo_docker_polling.py
|
||||
Quick sanity-check for the asynchronous crawl job endpoints:
|
||||
|
||||
• POST /crawl/job – enqueue work, get task_id
|
||||
• GET /crawl/job/{id} – poll status / fetch result
|
||||
|
||||
The style matches demo_docker_api.py (console.rule banners, helper
|
||||
functions, coloured status lines). Adjust BASE_URL as needed.
|
||||
|
||||
Run: python demo_docker_polling.py
|
||||
"""
|
||||
|
||||
import asyncio, json, os, time, urllib.parse
|
||||
from typing import Dict, List
|
||||
|
||||
import httpx
|
||||
from rich.console import Console
|
||||
from rich.panel import Panel
|
||||
from rich.syntax import Syntax
|
||||
|
||||
console = Console()
|
||||
BASE_URL = os.getenv("BASE_URL", "http://localhost:11234")
|
||||
SIMPLE_URL = "https://example.org"
|
||||
LINKS_URL = "https://httpbin.org/links/10/1"
|
||||
|
||||
# --- helpers --------------------------------------------------------------
|
||||
|
||||
|
||||
def print_payload(payload: Dict):
|
||||
console.print(Panel(Syntax(json.dumps(payload, indent=2),
|
||||
"json", theme="monokai", line_numbers=False),
|
||||
title="Payload", border_style="cyan", expand=False))
|
||||
|
||||
|
||||
async def check_server_health(client: httpx.AsyncClient) -> bool:
|
||||
try:
|
||||
resp = await client.get("/health")
|
||||
if resp.is_success:
|
||||
console.print("[green]Server healthy[/]")
|
||||
return True
|
||||
except Exception:
|
||||
pass
|
||||
console.print("[bold red]Server is not responding on /health[/]")
|
||||
return False
|
||||
|
||||
|
||||
async def poll_for_result(client: httpx.AsyncClient, task_id: str,
|
||||
poll_interval: float = 1.5, timeout: float = 90.0):
|
||||
"""Hit /crawl/job/{id} until COMPLETED/FAILED or timeout."""
|
||||
start = time.time()
|
||||
while True:
|
||||
resp = await client.get(f"/crawl/job/{task_id}")
|
||||
resp.raise_for_status()
|
||||
data = resp.json()
|
||||
status = data.get("status")
|
||||
if status.upper() in ("COMPLETED", "FAILED"):
|
||||
return data
|
||||
if time.time() - start > timeout:
|
||||
raise TimeoutError(f"Task {task_id} did not finish in {timeout}s")
|
||||
await asyncio.sleep(poll_interval)
|
||||
|
||||
|
||||
# --- demo functions -------------------------------------------------------
|
||||
|
||||
|
||||
async def demo_poll_single_url(client: httpx.AsyncClient):
|
||||
payload = {
|
||||
"urls": [SIMPLE_URL],
|
||||
"browser_config": {"type": "BrowserConfig",
|
||||
"params": {"headless": True}},
|
||||
"crawler_config": {"type": "CrawlerRunConfig",
|
||||
"params": {"cache_mode": "BYPASS"}}
|
||||
}
|
||||
|
||||
console.rule("[bold blue]Demo A: /crawl/job Single URL[/]", style="blue")
|
||||
print_payload(payload)
|
||||
|
||||
# enqueue
|
||||
resp = await client.post("/crawl/job", json=payload)
|
||||
console.print(f"Enqueue status: [bold]{resp.status_code}[/]")
|
||||
resp.raise_for_status()
|
||||
task_id = resp.json()["task_id"]
|
||||
console.print(f"Task ID: [yellow]{task_id}[/]")
|
||||
|
||||
# poll
|
||||
console.print("Polling…")
|
||||
result = await poll_for_result(client, task_id)
|
||||
console.print(Panel(Syntax(json.dumps(result, indent=2),
|
||||
"json", theme="fruity"),
|
||||
title="Final result", border_style="green"))
|
||||
if result["status"] == "COMPLETED":
|
||||
console.print("[green]✅ Crawl succeeded[/]")
|
||||
else:
|
||||
console.print("[red]❌ Crawl failed[/]")
|
||||
|
||||
|
||||
async def demo_poll_multi_url(client: httpx.AsyncClient):
|
||||
payload = {
|
||||
"urls": [SIMPLE_URL, LINKS_URL],
|
||||
"browser_config": {"type": "BrowserConfig",
|
||||
"params": {"headless": True}},
|
||||
"crawler_config": {"type": "CrawlerRunConfig",
|
||||
"params": {"cache_mode": "BYPASS"}}
|
||||
}
|
||||
|
||||
console.rule("[bold magenta]Demo B: /crawl/job Multi-URL[/]",
|
||||
style="magenta")
|
||||
print_payload(payload)
|
||||
|
||||
resp = await client.post("/crawl/job", json=payload)
|
||||
console.print(f"Enqueue status: [bold]{resp.status_code}[/]")
|
||||
resp.raise_for_status()
|
||||
task_id = resp.json()["task_id"]
|
||||
console.print(f"Task ID: [yellow]{task_id}[/]")
|
||||
|
||||
console.print("Polling…")
|
||||
result = await poll_for_result(client, task_id)
|
||||
console.print(Panel(Syntax(json.dumps(result, indent=2),
|
||||
"json", theme="fruity"),
|
||||
title="Final result", border_style="green"))
|
||||
if result["status"] == "COMPLETED":
|
||||
console.print(
|
||||
f"[green]✅ {len(json.loads(result['result'])['results'])} URLs crawled[/]")
|
||||
else:
|
||||
console.print("[red]❌ Crawl failed[/]")
|
||||
|
||||
|
||||
# --- main runner ----------------------------------------------------------
|
||||
|
||||
|
||||
async def main_demo():
|
||||
async with httpx.AsyncClient(base_url=BASE_URL, timeout=300.0) as client:
|
||||
if not await check_server_health(client):
|
||||
return
|
||||
await demo_poll_single_url(client)
|
||||
await demo_poll_multi_url(client)
|
||||
console.rule("[bold green]Polling demos complete[/]", style="green")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
try:
|
||||
asyncio.run(main_demo())
|
||||
except KeyboardInterrupt:
|
||||
console.print("\n[yellow]Interrupted by user[/]")
|
||||
except Exception:
|
||||
console.print_exception(show_locals=False)
|
||||
@@ -12,9 +12,10 @@ We’ve introduced a new feature that effortlessly handles even the biggest page
|
||||
|
||||
**Simple Example:**
|
||||
```python
|
||||
import os, sys
|
||||
import os
|
||||
import sys
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CacheMode
|
||||
from crawl4ai import AsyncWebCrawler, CacheMode, CrawlerRunConfig
|
||||
|
||||
# Adjust paths as needed
|
||||
parent_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
|
||||
@@ -26,9 +27,11 @@ async def main():
|
||||
# Request both PDF and screenshot
|
||||
result = await crawler.arun(
|
||||
url='https://en.wikipedia.org/wiki/List_of_common_misconceptions',
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
pdf=True,
|
||||
screenshot=True
|
||||
config=CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
pdf=True,
|
||||
screenshot=True
|
||||
)
|
||||
)
|
||||
|
||||
if result.success:
|
||||
@@ -40,9 +43,8 @@ async def main():
|
||||
|
||||
# Save PDF
|
||||
if result.pdf:
|
||||
pdf_bytes = b64decode(result.pdf)
|
||||
with open(os.path.join(__location__, "page.pdf"), "wb") as f:
|
||||
f.write(pdf_bytes)
|
||||
f.write(result.pdf)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
|
||||
@@ -3,45 +3,24 @@ from crawl4ai import (
|
||||
AsyncWebCrawler,
|
||||
BrowserConfig,
|
||||
CrawlerRunConfig,
|
||||
CacheMode,
|
||||
DefaultMarkdownGenerator,
|
||||
PruningContentFilter,
|
||||
CrawlResult
|
||||
)
|
||||
|
||||
async def example_cdp():
|
||||
browser_conf = BrowserConfig(
|
||||
headless=False,
|
||||
cdp_url="http://localhost:9223"
|
||||
)
|
||||
crawler_config = CrawlerRunConfig(
|
||||
session_id="test",
|
||||
js_code = """(() => { return {"result": "Hello World!"} })()""",
|
||||
js_only=True
|
||||
)
|
||||
async with AsyncWebCrawler(
|
||||
config=browser_conf,
|
||||
verbose=True,
|
||||
) as crawler:
|
||||
result : CrawlResult = await crawler.arun(
|
||||
url="https://www.helloworld.org",
|
||||
config=crawler_config,
|
||||
)
|
||||
print(result.js_execution_result)
|
||||
|
||||
|
||||
async def main():
|
||||
browser_config = BrowserConfig(headless=True, verbose=True)
|
||||
browser_config = BrowserConfig(
|
||||
headless=False,
|
||||
verbose=True,
|
||||
)
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
crawler_config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
markdown_generator=DefaultMarkdownGenerator(
|
||||
content_filter=PruningContentFilter(
|
||||
threshold=0.48, threshold_type="fixed", min_word_threshold=0
|
||||
)
|
||||
content_filter=PruningContentFilter()
|
||||
),
|
||||
)
|
||||
result : CrawlResult = await crawler.arun(
|
||||
result: CrawlResult = await crawler.arun(
|
||||
url="https://www.helloworld.org", config=crawler_config
|
||||
)
|
||||
print(result.markdown.raw_markdown[:500])
|
||||
|
||||
64
docs/examples/markdown/content_source_example.py
Normal file
64
docs/examples/markdown/content_source_example.py
Normal file
@@ -0,0 +1,64 @@
|
||||
"""
|
||||
Example showing how to use the content_source parameter to control HTML input for markdown generation.
|
||||
"""
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, DefaultMarkdownGenerator
|
||||
|
||||
async def demo_content_source():
|
||||
"""Demonstrates different content_source options for markdown generation."""
|
||||
url = "https://example.com" # Simple demo site
|
||||
|
||||
print("Crawling with different content_source options...")
|
||||
|
||||
# --- Example 1: Default Behavior (cleaned_html) ---
|
||||
# This uses the HTML after it has been processed by the scraping strategy
|
||||
# The HTML is cleaned, simplified, and optimized for readability
|
||||
default_generator = DefaultMarkdownGenerator() # content_source="cleaned_html" is default
|
||||
default_config = CrawlerRunConfig(markdown_generator=default_generator)
|
||||
|
||||
# --- Example 2: Raw HTML ---
|
||||
# This uses the original HTML directly from the webpage
|
||||
# Preserves more original content but may include navigation, ads, etc.
|
||||
raw_generator = DefaultMarkdownGenerator(content_source="raw_html")
|
||||
raw_config = CrawlerRunConfig(markdown_generator=raw_generator)
|
||||
|
||||
# --- Example 3: Fit HTML ---
|
||||
# This uses preprocessed HTML optimized for schema extraction
|
||||
# Better for structured data extraction but may lose some formatting
|
||||
fit_generator = DefaultMarkdownGenerator(content_source="fit_html")
|
||||
fit_config = CrawlerRunConfig(markdown_generator=fit_generator)
|
||||
|
||||
# Execute all three crawlers in sequence
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Default (cleaned_html)
|
||||
result_default = await crawler.arun(url=url, config=default_config)
|
||||
|
||||
# Raw HTML
|
||||
result_raw = await crawler.arun(url=url, config=raw_config)
|
||||
|
||||
# Fit HTML
|
||||
result_fit = await crawler.arun(url=url, config=fit_config)
|
||||
|
||||
# Print a summary of the results
|
||||
print("\nMarkdown Generation Results:\n")
|
||||
|
||||
print("1. Default (cleaned_html):")
|
||||
print(f" Length: {len(result_default.markdown.raw_markdown)} chars")
|
||||
print(f" First 80 chars: {result_default.markdown.raw_markdown[:80]}...\n")
|
||||
|
||||
print("2. Raw HTML:")
|
||||
print(f" Length: {len(result_raw.markdown.raw_markdown)} chars")
|
||||
print(f" First 80 chars: {result_raw.markdown.raw_markdown[:80]}...\n")
|
||||
|
||||
print("3. Fit HTML:")
|
||||
print(f" Length: {len(result_fit.markdown.raw_markdown)} chars")
|
||||
print(f" First 80 chars: {result_fit.markdown.raw_markdown[:80]}...\n")
|
||||
|
||||
# Demonstrate differences in output
|
||||
print("\nKey Takeaways:")
|
||||
print("- cleaned_html: Best for readable, focused content")
|
||||
print("- raw_html: Preserves more original content, but may include noise")
|
||||
print("- fit_html: Optimized for schema extraction and structured data")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(demo_content_source())
|
||||
42
docs/examples/markdown/content_source_short_example.py
Normal file
42
docs/examples/markdown/content_source_short_example.py
Normal file
@@ -0,0 +1,42 @@
|
||||
"""
|
||||
Example demonstrating how to use the content_source parameter in MarkdownGenerationStrategy
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, DefaultMarkdownGenerator
|
||||
|
||||
async def demo_markdown_source_config():
|
||||
print("\n=== Demo: Configuring Markdown Source ===")
|
||||
|
||||
# Example 1: Generate markdown from cleaned HTML (default behavior)
|
||||
cleaned_md_generator = DefaultMarkdownGenerator(content_source="cleaned_html")
|
||||
config_cleaned = CrawlerRunConfig(markdown_generator=cleaned_md_generator)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result_cleaned = await crawler.arun(url="https://example.com", config=config_cleaned)
|
||||
print("Markdown from Cleaned HTML (default):")
|
||||
print(f" Length: {len(result_cleaned.markdown.raw_markdown)}")
|
||||
print(f" Start: {result_cleaned.markdown.raw_markdown[:100]}...")
|
||||
|
||||
# Example 2: Generate markdown directly from raw HTML
|
||||
raw_md_generator = DefaultMarkdownGenerator(content_source="raw_html")
|
||||
config_raw = CrawlerRunConfig(markdown_generator=raw_md_generator)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result_raw = await crawler.arun(url="https://example.com", config=config_raw)
|
||||
print("\nMarkdown from Raw HTML:")
|
||||
print(f" Length: {len(result_raw.markdown.raw_markdown)}")
|
||||
print(f" Start: {result_raw.markdown.raw_markdown[:100]}...")
|
||||
|
||||
# Example 3: Generate markdown from preprocessed 'fit' HTML
|
||||
fit_md_generator = DefaultMarkdownGenerator(content_source="fit_html")
|
||||
config_fit = CrawlerRunConfig(markdown_generator=fit_md_generator)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result_fit = await crawler.arun(url="https://example.com", config=config_fit)
|
||||
print("\nMarkdown from Fit HTML:")
|
||||
print(f" Length: {len(result_fit.markdown.raw_markdown)}")
|
||||
print(f" Start: {result_fit.markdown.raw_markdown[:100]}...")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(demo_markdown_source_config())
|
||||
@@ -4,7 +4,7 @@ import json
|
||||
import base64
|
||||
from pathlib import Path
|
||||
from typing import List
|
||||
from crawl4ai.proxy_strategy import ProxyConfig
|
||||
from crawl4ai import ProxyConfig
|
||||
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode, CrawlResult
|
||||
from crawl4ai import RoundRobinProxyStrategy
|
||||
|
||||
143
docs/examples/regex_extraction_quickstart.py
Normal file
143
docs/examples/regex_extraction_quickstart.py
Normal file
@@ -0,0 +1,143 @@
|
||||
# == File: regex_extraction_quickstart.py ==
|
||||
"""
|
||||
Mini–quick-start for RegexExtractionStrategy
|
||||
────────────────────────────────────────────
|
||||
3 bite-sized demos that parallel the style of *quickstart_examples_set_1.py*:
|
||||
|
||||
1. **Default catalog** – scrape a page and pull out e-mails / phones / URLs, etc.
|
||||
2. **Custom pattern** – add your own regex at instantiation time.
|
||||
3. **LLM-assisted schema** – ask the model to write a pattern, cache it, then
|
||||
run extraction _without_ further LLM calls.
|
||||
|
||||
Run the whole thing with::
|
||||
|
||||
python regex_extraction_quickstart.py
|
||||
"""
|
||||
|
||||
import os, json, asyncio
|
||||
from pathlib import Path
|
||||
from typing import List
|
||||
|
||||
from crawl4ai import (
|
||||
AsyncWebCrawler,
|
||||
CrawlerRunConfig,
|
||||
CrawlResult,
|
||||
RegexExtractionStrategy,
|
||||
LLMConfig,
|
||||
)
|
||||
|
||||
# ────────────────────────────────────────────────────────────────────────────
|
||||
# 1. Default-catalog extraction
|
||||
# ────────────────────────────────────────────────────────────────────────────
|
||||
async def demo_regex_default() -> None:
|
||||
print("\n=== 1. Regex extraction – default patterns ===")
|
||||
|
||||
url = "https://www.iana.org/domains/example" # has e-mail + URLs
|
||||
strategy = RegexExtractionStrategy(
|
||||
pattern = RegexExtractionStrategy.Url | RegexExtractionStrategy.Currency
|
||||
)
|
||||
config = CrawlerRunConfig(extraction_strategy=strategy)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result: CrawlResult = await crawler.arun(url, config=config)
|
||||
|
||||
print(f"Fetched {url} - success={result.success}")
|
||||
if result.success:
|
||||
data = json.loads(result.extracted_content)
|
||||
for d in data[:10]:
|
||||
print(f" {d['label']:<12} {d['value']}")
|
||||
print(f"... total matches: {len(data)}")
|
||||
else:
|
||||
print(" !!! crawl failed")
|
||||
|
||||
|
||||
# ────────────────────────────────────────────────────────────────────────────
|
||||
# 2. Custom pattern override / extension
|
||||
# ────────────────────────────────────────────────────────────────────────────
|
||||
async def demo_regex_custom() -> None:
|
||||
print("\n=== 2. Regex extraction – custom price pattern ===")
|
||||
|
||||
url = "https://www.apple.com/shop/buy-mac/macbook-pro"
|
||||
price_pattern = {"usd_price": r"\$\s?\d{1,3}(?:,\d{3})*(?:\.\d{2})?"}
|
||||
|
||||
strategy = RegexExtractionStrategy(custom = price_pattern)
|
||||
config = CrawlerRunConfig(extraction_strategy=strategy)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result: CrawlResult = await crawler.arun(url, config=config)
|
||||
|
||||
if result.success:
|
||||
data = json.loads(result.extracted_content)
|
||||
for d in data:
|
||||
print(f" {d['value']}")
|
||||
if not data:
|
||||
print(" (No prices found - page layout may have changed)")
|
||||
else:
|
||||
print(" !!! crawl failed")
|
||||
|
||||
|
||||
# ────────────────────────────────────────────────────────────────────────────
|
||||
# 3. One-shot LLM pattern generation, then fast extraction
|
||||
# ────────────────────────────────────────────────────────────────────────────
|
||||
async def demo_regex_generate_pattern() -> None:
|
||||
print("\n=== 3. generate_pattern → regex extraction ===")
|
||||
|
||||
cache_dir = Path(__file__).parent / "tmp"
|
||||
cache_dir.mkdir(exist_ok=True)
|
||||
pattern_file = cache_dir / "price_pattern.json"
|
||||
|
||||
url = "https://www.lazada.sg/tag/smartphone/"
|
||||
|
||||
# ── 3-A. build or load the cached pattern
|
||||
if pattern_file.exists():
|
||||
pattern = json.load(pattern_file.open(encoding="utf-8"))
|
||||
print("Loaded cached pattern:", pattern)
|
||||
else:
|
||||
print("Generating pattern via LLM…")
|
||||
|
||||
llm_cfg = LLMConfig(
|
||||
provider="openai/gpt-4o-mini",
|
||||
api_token="env:OPENAI_API_KEY",
|
||||
)
|
||||
|
||||
# pull one sample page as HTML context
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
html = (await crawler.arun(url)).fit_html
|
||||
|
||||
pattern = RegexExtractionStrategy.generate_pattern(
|
||||
label="price",
|
||||
html=html,
|
||||
query="Prices in Malaysian Ringgit (e.g. RM1,299.00 or RM200)",
|
||||
llm_config=llm_cfg,
|
||||
)
|
||||
|
||||
json.dump(pattern, pattern_file.open("w", encoding="utf-8"), indent=2)
|
||||
print("Saved pattern:", pattern_file)
|
||||
|
||||
# ── 3-B. extraction pass – zero LLM calls
|
||||
strategy = RegexExtractionStrategy(custom=pattern)
|
||||
config = CrawlerRunConfig(extraction_strategy=strategy, delay_before_return_html=3)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result: CrawlResult = await crawler.arun(url, config=config)
|
||||
|
||||
if result.success:
|
||||
data = json.loads(result.extracted_content)
|
||||
for d in data[:15]:
|
||||
print(f" {d['value']}")
|
||||
print(f"... total matches: {len(data)}")
|
||||
else:
|
||||
print(" !!! crawl failed")
|
||||
|
||||
|
||||
# ────────────────────────────────────────────────────────────────────────────
|
||||
# Entrypoint
|
||||
# ────────────────────────────────────────────────────────────────────────────
|
||||
async def main() -> None:
|
||||
# await demo_regex_default()
|
||||
# await demo_regex_custom()
|
||||
await demo_regex_generate_pattern()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
38
docs/examples/session_id_example.py
Normal file
38
docs/examples/session_id_example.py
Normal file
@@ -0,0 +1,38 @@
|
||||
import asyncio
|
||||
from crawl4ai import (
|
||||
AsyncWebCrawler,
|
||||
BrowserConfig,
|
||||
CrawlerRunConfig,
|
||||
DefaultMarkdownGenerator,
|
||||
PruningContentFilter,
|
||||
CrawlResult
|
||||
)
|
||||
|
||||
|
||||
|
||||
async def main():
|
||||
browser_config = BrowserConfig(
|
||||
headless=False,
|
||||
verbose=True,
|
||||
)
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
crawler_config = CrawlerRunConfig(
|
||||
session_id= "hello_world", # This help us to use the same page
|
||||
)
|
||||
result : CrawlResult = await crawler.arun(
|
||||
url="https://www.helloworld.org", config=crawler_config
|
||||
)
|
||||
# Add a breakpoint here, then you will the page is open and browser is not closed
|
||||
print(result.markdown.raw_markdown[:500])
|
||||
|
||||
new_config = crawler_config.clone(js_code=["(() => ({'data':'hello'}))()"], js_only=True)
|
||||
result : CrawlResult = await crawler.arun( # This time there is no fetch and this only executes JS in the same opened page
|
||||
url="https://www.helloworld.org", config= new_config
|
||||
)
|
||||
print(result.js_execution_result) # You should see {'data':'hello'} in the console
|
||||
|
||||
# Get direct access to Playwright paege object. This works only if you use the same session_id and pass same config
|
||||
page, context = crawler.crawler_strategy.get_page(new_config)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
@@ -13,7 +13,7 @@ from crawl4ai.deep_crawling import (
|
||||
)
|
||||
from crawl4ai.deep_crawling.scorers import KeywordRelevanceScorer
|
||||
from crawl4ai.async_crawler_strategy import AsyncHTTPCrawlerStrategy
|
||||
from crawl4ai.proxy_strategy import ProxyConfig
|
||||
from crawl4ai import ProxyConfig
|
||||
from crawl4ai import RoundRobinProxyStrategy
|
||||
from crawl4ai.content_filter_strategy import LLMContentFilter
|
||||
from crawl4ai import DefaultMarkdownGenerator
|
||||
|
||||
70
docs/examples/use_geo_location.py
Normal file
70
docs/examples/use_geo_location.py
Normal file
@@ -0,0 +1,70 @@
|
||||
# use_geo_location.py
|
||||
"""
|
||||
Example: override locale, timezone, and geolocation using Crawl4ai patterns.
|
||||
|
||||
This demo uses `AsyncWebCrawler.arun()` to fetch a page with
|
||||
browser context primed for specific locale, timezone, and GPS,
|
||||
and saves a screenshot for visual verification.
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import base64
|
||||
from pathlib import Path
|
||||
from typing import List
|
||||
from crawl4ai import (
|
||||
AsyncWebCrawler,
|
||||
CrawlerRunConfig,
|
||||
BrowserConfig,
|
||||
GeolocationConfig,
|
||||
CrawlResult,
|
||||
)
|
||||
|
||||
async def demo_geo_override():
|
||||
"""Demo: Crawl a geolocation-test page with overrides and screenshot."""
|
||||
print("\n=== Geo-Override Crawl ===")
|
||||
|
||||
# 1) Browser setup: use Playwright-managed contexts
|
||||
browser_cfg = BrowserConfig(
|
||||
headless=False,
|
||||
viewport_width=1280,
|
||||
viewport_height=720,
|
||||
use_managed_browser=False,
|
||||
)
|
||||
|
||||
# 2) Run config: include locale, timezone_id, geolocation, and screenshot
|
||||
run_cfg = CrawlerRunConfig(
|
||||
url="https://browserleaks.com/geo", # test page that shows your location
|
||||
locale="en-US", # Accept-Language & UI locale
|
||||
timezone_id="America/Los_Angeles", # JS Date()/Intl timezone
|
||||
geolocation=GeolocationConfig( # override GPS coords
|
||||
latitude=34.0522,
|
||||
longitude=-118.2437,
|
||||
accuracy=10.0,
|
||||
),
|
||||
screenshot=True, # capture screenshot after load
|
||||
session_id="geo_test", # reuse context if rerunning
|
||||
delay_before_return_html=5
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_cfg) as crawler:
|
||||
# 3) Run crawl (returns list even for single URL)
|
||||
results: List[CrawlResult] = await crawler.arun(
|
||||
url=run_cfg.url,
|
||||
config=run_cfg,
|
||||
)
|
||||
result = results[0]
|
||||
|
||||
# 4) Save screenshot and report path
|
||||
if result.screenshot:
|
||||
__current_dir = Path(__file__).parent
|
||||
out_dir = __current_dir / "tmp"
|
||||
out_dir.mkdir(exist_ok=True)
|
||||
shot_path = out_dir / "geo_test.png"
|
||||
with open(shot_path, "wb") as f:
|
||||
f.write(base64.b64decode(result.screenshot))
|
||||
print(f"Saved screenshot to {shot_path}")
|
||||
else:
|
||||
print("No screenshot captured, check configuration.")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(demo_geo_override())
|
||||
@@ -263,7 +263,102 @@ See the full example in `docs/examples/identity_based_browsing.py` for a complet
|
||||
|
||||
---
|
||||
|
||||
## 7. Summary
|
||||
## 7. Locale, Timezone, and Geolocation Control
|
||||
|
||||
In addition to using persistent profiles, Crawl4AI supports customizing your browser's locale, timezone, and geolocation settings. These features enhance your identity-based browsing experience by allowing you to control how websites perceive your location and regional settings.
|
||||
|
||||
### Setting Locale and Timezone
|
||||
|
||||
You can set the browser's locale and timezone through `CrawlerRunConfig`:
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
config=CrawlerRunConfig(
|
||||
# Set browser locale (language and region formatting)
|
||||
locale="fr-FR", # French (France)
|
||||
|
||||
# Set browser timezone
|
||||
timezone_id="Europe/Paris",
|
||||
|
||||
# Other normal options...
|
||||
magic=True,
|
||||
page_timeout=60000
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
**How it works:**
|
||||
- `locale` affects language preferences, date formats, number formats, etc.
|
||||
- `timezone_id` affects JavaScript's Date object and time-related functionality
|
||||
- These settings are applied when creating the browser context and maintained throughout the session
|
||||
|
||||
### Configuring Geolocation
|
||||
|
||||
Control the GPS coordinates reported by the browser's geolocation API:
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, GeolocationConfig
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://maps.google.com", # Or any location-aware site
|
||||
config=CrawlerRunConfig(
|
||||
# Configure precise GPS coordinates
|
||||
geolocation=GeolocationConfig(
|
||||
latitude=48.8566, # Paris coordinates
|
||||
longitude=2.3522,
|
||||
accuracy=100 # Accuracy in meters (optional)
|
||||
),
|
||||
|
||||
# This site will see you as being in Paris
|
||||
page_timeout=60000
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
**Important notes:**
|
||||
- When `geolocation` is specified, the browser is automatically granted permission to access location
|
||||
- Websites using the Geolocation API will receive the exact coordinates you specify
|
||||
- This affects map services, store locators, delivery services, etc.
|
||||
- Combined with the appropriate `locale` and `timezone_id`, you can create a fully consistent location profile
|
||||
|
||||
### Combining with Managed Browsers
|
||||
|
||||
These settings work perfectly with managed browsers for a complete identity solution:
|
||||
|
||||
```python
|
||||
from crawl4ai import (
|
||||
AsyncWebCrawler, BrowserConfig, CrawlerRunConfig,
|
||||
GeolocationConfig
|
||||
)
|
||||
|
||||
browser_config = BrowserConfig(
|
||||
use_managed_browser=True,
|
||||
user_data_dir="/path/to/my-profile",
|
||||
browser_type="chromium"
|
||||
)
|
||||
|
||||
crawl_config = CrawlerRunConfig(
|
||||
# Location settings
|
||||
locale="es-MX", # Spanish (Mexico)
|
||||
timezone_id="America/Mexico_City",
|
||||
geolocation=GeolocationConfig(
|
||||
latitude=19.4326, # Mexico City
|
||||
longitude=-99.1332
|
||||
)
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(url="https://example.com", config=crawl_config)
|
||||
```
|
||||
|
||||
Combining persistent profiles with precise geolocation and region settings gives you complete control over your digital identity.
|
||||
|
||||
## 8. Summary
|
||||
|
||||
- **Create** your user-data directory either:
|
||||
- By launching Chrome/Chromium externally with `--user-data-dir=/some/path`
|
||||
@@ -271,6 +366,7 @@ See the full example in `docs/examples/identity_based_browsing.py` for a complet
|
||||
- Or through the interactive interface with `profiler.interactive_manager()`
|
||||
- **Log in** or configure sites as needed, then close the browser
|
||||
- **Reference** that folder in `BrowserConfig(user_data_dir="...")` + `use_managed_browser=True`
|
||||
- **Customize** identity aspects with `locale`, `timezone_id`, and `geolocation`
|
||||
- **List and reuse** profiles with `BrowserProfiler.list_profiles()`
|
||||
- **Manage** your profiles with the dedicated `BrowserProfiler` class
|
||||
- Enjoy **persistent** sessions that reflect your real identity
|
||||
|
||||
@@ -10,6 +10,7 @@ class CrawlResult(BaseModel):
|
||||
html: str
|
||||
success: bool
|
||||
cleaned_html: Optional[str] = None
|
||||
fit_html: Optional[str] = None # Preprocessed HTML optimized for extraction
|
||||
media: Dict[str, List[Dict]] = {}
|
||||
links: Dict[str, List[Dict]] = {}
|
||||
downloaded_files: Optional[List[str]] = None
|
||||
@@ -50,7 +51,7 @@ if not result.success:
|
||||
```
|
||||
|
||||
### 1.3 **`status_code`** *(Optional[int])*
|
||||
**What**: The page’s HTTP status code (e.g., 200, 404).
|
||||
**What**: The page's HTTP status code (e.g., 200, 404).
|
||||
**Usage**:
|
||||
```python
|
||||
if result.status_code == 404:
|
||||
@@ -82,7 +83,7 @@ if result.response_headers:
|
||||
```
|
||||
|
||||
### 1.7 **`ssl_certificate`** *(Optional[SSLCertificate])*
|
||||
**What**: If `fetch_ssl_certificate=True` in your CrawlerRunConfig, **`result.ssl_certificate`** contains a [**`SSLCertificate`**](../advanced/ssl-certificate.md) object describing the site’s certificate. You can export the cert in multiple formats (PEM/DER/JSON) or access its properties like `issuer`,
|
||||
**What**: If `fetch_ssl_certificate=True` in your CrawlerRunConfig, **`result.ssl_certificate`** contains a [**`SSLCertificate`**](../advanced/ssl-certificate.md) object describing the site's certificate. You can export the cert in multiple formats (PEM/DER/JSON) or access its properties like `issuer`,
|
||||
`subject`, `valid_from`, `valid_until`, etc.
|
||||
**Usage**:
|
||||
```python
|
||||
@@ -109,14 +110,6 @@ print(len(result.html))
|
||||
print(result.cleaned_html[:500]) # Show a snippet
|
||||
```
|
||||
|
||||
### 2.3 **`fit_html`** *(Optional[str])*
|
||||
**What**: If a **content filter** or heuristic (e.g., Pruning/BM25) modifies the HTML, the “fit” or post-filter version.
|
||||
**When**: This is **only** present if your `markdown_generator` or `content_filter` produces it.
|
||||
**Usage**:
|
||||
```python
|
||||
if result.markdown.fit_html:
|
||||
print("High-value HTML content:", result.markdown.fit_html[:300])
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
@@ -135,7 +128,7 @@ Crawl4AI can convert HTML→Markdown, optionally including:
|
||||
- **`raw_markdown`** *(str)*: The full HTML→Markdown conversion.
|
||||
- **`markdown_with_citations`** *(str)*: Same markdown, but with link references as academic-style citations.
|
||||
- **`references_markdown`** *(str)*: The reference list or footnotes at the end.
|
||||
- **`fit_markdown`** *(Optional[str])*: If content filtering (Pruning/BM25) was applied, the filtered “fit” text.
|
||||
- **`fit_markdown`** *(Optional[str])*: If content filtering (Pruning/BM25) was applied, the filtered "fit" text.
|
||||
- **`fit_html`** *(Optional[str])*: The HTML that led to `fit_markdown`.
|
||||
|
||||
**Usage**:
|
||||
@@ -157,7 +150,7 @@ print(result.markdown.raw_markdown[:200])
|
||||
print(result.markdown.fit_markdown)
|
||||
print(result.markdown.fit_html)
|
||||
```
|
||||
**Important**: “Fit” content (in `fit_markdown`/`fit_html`) exists in result.markdown, only if you used a **filter** (like **PruningContentFilter** or **BM25ContentFilter**) within a `MarkdownGenerationStrategy`.
|
||||
**Important**: "Fit" content (in `fit_markdown`/`fit_html`) exists in result.markdown, only if you used a **filter** (like **PruningContentFilter** or **BM25ContentFilter**) within a `MarkdownGenerationStrategy`.
|
||||
|
||||
---
|
||||
|
||||
@@ -169,7 +162,7 @@ print(result.markdown.fit_html)
|
||||
|
||||
- `src` *(str)*: Media URL
|
||||
- `alt` or `title` *(str)*: Descriptive text
|
||||
- `score` *(float)*: Relevance score if the crawler’s heuristic found it “important”
|
||||
- `score` *(float)*: Relevance score if the crawler's heuristic found it "important"
|
||||
- `desc` or `description` *(Optional[str])*: Additional context extracted from surrounding text
|
||||
|
||||
**Usage**:
|
||||
@@ -263,7 +256,7 @@ A `DispatchResult` object providing additional concurrency and resource usage in
|
||||
|
||||
- **`task_id`**: A unique identifier for the parallel task.
|
||||
- **`memory_usage`** (float): The memory (in MB) used at the time of completion.
|
||||
- **`peak_memory`** (float): The peak memory usage (in MB) recorded during the task’s execution.
|
||||
- **`peak_memory`** (float): The peak memory usage (in MB) recorded during the task's execution.
|
||||
- **`start_time`** / **`end_time`** (datetime): Time range for this crawling task.
|
||||
- **`error_message`** (str): Any dispatcher- or concurrency-related error encountered.
|
||||
|
||||
@@ -358,7 +351,7 @@ async def handle_result(result: CrawlResult):
|
||||
# HTML
|
||||
print("Original HTML size:", len(result.html))
|
||||
print("Cleaned HTML size:", len(result.cleaned_html or ""))
|
||||
|
||||
|
||||
# Markdown output
|
||||
if result.markdown:
|
||||
print("Raw Markdown:", result.markdown.raw_markdown[:300])
|
||||
|
||||
@@ -70,7 +70,7 @@ We group them by category.
|
||||
|------------------------------|--------------------------------------|-------------------------------------------------------------------------------------------------|
|
||||
| **`word_count_threshold`** | `int` (default: ~200) | Skips text blocks below X words. Helps ignore trivial sections. |
|
||||
| **`extraction_strategy`** | `ExtractionStrategy` (default: None) | If set, extracts structured data (CSS-based, LLM-based, etc.). |
|
||||
| **`markdown_generator`** | `MarkdownGenerationStrategy` (None) | If you want specialized markdown output (citations, filtering, chunking, etc.). |
|
||||
| **`markdown_generator`** | `MarkdownGenerationStrategy` (None) | If you want specialized markdown output (citations, filtering, chunking, etc.). Can be customized with options such as `content_source` parameter to select the HTML input source ('cleaned_html', 'raw_html', or 'fit_html'). |
|
||||
| **`css_selector`** | `str` (None) | Retains only the part of the page matching this selector. Affects the entire extraction process. |
|
||||
| **`target_elements`** | `List[str]` (None) | List of CSS selectors for elements to focus on for markdown generation and data extraction, while still processing the entire page for links, media, etc. Provides more flexibility than `css_selector`. |
|
||||
| **`excluded_tags`** | `list` (None) | Removes entire tags (e.g. `["script", "style"]`). |
|
||||
@@ -232,6 +232,7 @@ async def main():
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
## 2.4 Compliance & Ethics
|
||||
|
||||
|
||||
@@ -36,6 +36,45 @@ LLMExtractionStrategy(
|
||||
)
|
||||
```
|
||||
|
||||
### RegexExtractionStrategy
|
||||
|
||||
Used for fast pattern-based extraction of common entities using regular expressions.
|
||||
|
||||
```python
|
||||
RegexExtractionStrategy(
|
||||
# Pattern Configuration
|
||||
pattern: IntFlag = RegexExtractionStrategy.Nothing, # Bit flags of built-in patterns to use
|
||||
custom: Optional[Dict[str, str]] = None, # Custom pattern dictionary {label: regex}
|
||||
|
||||
# Input Format
|
||||
input_format: str = "fit_html", # "html", "markdown", "text" or "fit_html"
|
||||
)
|
||||
|
||||
# Built-in Patterns as Bit Flags
|
||||
RegexExtractionStrategy.Email # Email addresses
|
||||
RegexExtractionStrategy.PhoneIntl # International phone numbers
|
||||
RegexExtractionStrategy.PhoneUS # US-format phone numbers
|
||||
RegexExtractionStrategy.Url # HTTP/HTTPS URLs
|
||||
RegexExtractionStrategy.IPv4 # IPv4 addresses
|
||||
RegexExtractionStrategy.IPv6 # IPv6 addresses
|
||||
RegexExtractionStrategy.Uuid # UUIDs
|
||||
RegexExtractionStrategy.Currency # Currency values (USD, EUR, etc)
|
||||
RegexExtractionStrategy.Percentage # Percentage values
|
||||
RegexExtractionStrategy.Number # Numeric values
|
||||
RegexExtractionStrategy.DateIso # ISO format dates
|
||||
RegexExtractionStrategy.DateUS # US format dates
|
||||
RegexExtractionStrategy.Time24h # 24-hour format times
|
||||
RegexExtractionStrategy.PostalUS # US postal codes
|
||||
RegexExtractionStrategy.PostalUK # UK postal codes
|
||||
RegexExtractionStrategy.HexColor # HTML hex color codes
|
||||
RegexExtractionStrategy.TwitterHandle # Twitter handles
|
||||
RegexExtractionStrategy.Hashtag # Hashtags
|
||||
RegexExtractionStrategy.MacAddr # MAC addresses
|
||||
RegexExtractionStrategy.Iban # International bank account numbers
|
||||
RegexExtractionStrategy.CreditCard # Credit card numbers
|
||||
RegexExtractionStrategy.All # All available patterns
|
||||
```
|
||||
|
||||
### CosineStrategy
|
||||
|
||||
Used for content similarity-based extraction and clustering.
|
||||
@@ -156,6 +195,55 @@ result = await crawler.arun(
|
||||
data = json.loads(result.extracted_content)
|
||||
```
|
||||
|
||||
### Regex Extraction
|
||||
|
||||
```python
|
||||
import json
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, RegexExtractionStrategy
|
||||
|
||||
# Method 1: Use built-in patterns
|
||||
strategy = RegexExtractionStrategy(
|
||||
pattern = RegexExtractionStrategy.Email | RegexExtractionStrategy.Url
|
||||
)
|
||||
|
||||
# Method 2: Use custom patterns
|
||||
price_pattern = {"usd_price": r"\$\s?\d{1,3}(?:,\d{3})*(?:\.\d{2})?"}
|
||||
strategy = RegexExtractionStrategy(custom=price_pattern)
|
||||
|
||||
# Method 3: Generate pattern with LLM assistance (one-time)
|
||||
from crawl4ai import LLMConfig
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Get sample HTML first
|
||||
sample_result = await crawler.arun("https://example.com/products")
|
||||
html = sample_result.fit_html
|
||||
|
||||
# Generate regex pattern once
|
||||
pattern = RegexExtractionStrategy.generate_pattern(
|
||||
label="price",
|
||||
html=html,
|
||||
query="Product prices in USD format",
|
||||
llm_config=LLMConfig(provider="openai/gpt-4o-mini")
|
||||
)
|
||||
|
||||
# Save pattern for reuse
|
||||
import json
|
||||
with open("price_pattern.json", "w") as f:
|
||||
json.dump(pattern, f)
|
||||
|
||||
# Use pattern for extraction (no LLM calls)
|
||||
strategy = RegexExtractionStrategy(custom=pattern)
|
||||
result = await crawler.arun(
|
||||
url="https://example.com/products",
|
||||
config=CrawlerRunConfig(extraction_strategy=strategy)
|
||||
)
|
||||
|
||||
# Process results
|
||||
data = json.loads(result.extracted_content)
|
||||
for item in data:
|
||||
print(f"{item['label']}: {item['value']}")
|
||||
```
|
||||
|
||||
### CSS Extraction
|
||||
|
||||
```python
|
||||
@@ -220,12 +308,28 @@ result = await crawler.arun(
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Choose the Right Strategy**
|
||||
- Use `LLMExtractionStrategy` for complex, unstructured content
|
||||
- Use `JsonCssExtractionStrategy` for well-structured HTML
|
||||
1. **Choose the Right Strategy**
|
||||
- Use `RegexExtractionStrategy` for common data types like emails, phones, URLs, dates
|
||||
- Use `JsonCssExtractionStrategy` for well-structured HTML with consistent patterns
|
||||
- Use `LLMExtractionStrategy` for complex, unstructured content requiring reasoning
|
||||
- Use `CosineStrategy` for content similarity and clustering
|
||||
|
||||
2. **Optimize Chunking**
|
||||
2. **Strategy Selection Guide**
|
||||
```
|
||||
Is the target data a common type (email/phone/date/URL)?
|
||||
→ RegexExtractionStrategy
|
||||
|
||||
Does the page have consistent HTML structure?
|
||||
→ JsonCssExtractionStrategy or JsonXPathExtractionStrategy
|
||||
|
||||
Is the data semantically complex or unstructured?
|
||||
→ LLMExtractionStrategy
|
||||
|
||||
Need to find content similar to a specific topic?
|
||||
→ CosineStrategy
|
||||
```
|
||||
|
||||
3. **Optimize Chunking**
|
||||
```python
|
||||
# For long documents
|
||||
strategy = LLMExtractionStrategy(
|
||||
@@ -234,7 +338,26 @@ result = await crawler.arun(
|
||||
)
|
||||
```
|
||||
|
||||
3. **Handle Errors**
|
||||
4. **Combine Strategies for Best Performance**
|
||||
```python
|
||||
# First pass: Extract structure with CSS
|
||||
css_strategy = JsonCssExtractionStrategy(product_schema)
|
||||
css_result = await crawler.arun(url, config=CrawlerRunConfig(extraction_strategy=css_strategy))
|
||||
product_data = json.loads(css_result.extracted_content)
|
||||
|
||||
# Second pass: Extract specific fields with regex
|
||||
descriptions = [product["description"] for product in product_data]
|
||||
regex_strategy = RegexExtractionStrategy(
|
||||
pattern=RegexExtractionStrategy.Email | RegexExtractionStrategy.PhoneUS,
|
||||
custom={"dimension": r"\d+x\d+x\d+ (?:cm|in)"}
|
||||
)
|
||||
|
||||
# Process descriptions with regex
|
||||
for text in descriptions:
|
||||
matches = regex_strategy.extract("", text) # Direct extraction
|
||||
```
|
||||
|
||||
5. **Handle Errors**
|
||||
```python
|
||||
try:
|
||||
result = await crawler.arun(
|
||||
@@ -247,11 +370,31 @@ result = await crawler.arun(
|
||||
print(f"Extraction failed: {e}")
|
||||
```
|
||||
|
||||
4. **Monitor Performance**
|
||||
6. **Monitor Performance**
|
||||
```python
|
||||
strategy = CosineStrategy(
|
||||
verbose=True, # Enable logging
|
||||
word_count_threshold=20, # Filter short content
|
||||
top_k=5 # Limit results
|
||||
)
|
||||
```
|
||||
|
||||
7. **Cache Generated Patterns**
|
||||
```python
|
||||
# For RegexExtractionStrategy pattern generation
|
||||
import json
|
||||
from pathlib import Path
|
||||
|
||||
cache_dir = Path("./pattern_cache")
|
||||
cache_dir.mkdir(exist_ok=True)
|
||||
pattern_file = cache_dir / "product_pattern.json"
|
||||
|
||||
if pattern_file.exists():
|
||||
with open(pattern_file) as f:
|
||||
pattern = json.load(f)
|
||||
else:
|
||||
# Generate once with LLM
|
||||
pattern = RegexExtractionStrategy.generate_pattern(...)
|
||||
with open(pattern_file, "w") as f:
|
||||
json.dump(pattern, f)
|
||||
```
|
||||
@@ -361,8 +361,10 @@ A code snippet: \`crawler.run()\`. Check the [quickstart](/core/quickstart).`;
|
||||
chatMessages.innerHTML = ""; // Start with clean slate for query
|
||||
if (!isFromQuery) {
|
||||
// Show welcome only if manually started
|
||||
// chatMessages.innerHTML =
|
||||
// '<div class="message ai-message welcome-message">Started a new chat! Ask me anything about Crawl4AI.</div>';
|
||||
chatMessages.innerHTML =
|
||||
'<div class="message ai-message welcome-message">Started a new chat! Ask me anything about Crawl4AI.</div>';
|
||||
'<div class="message ai-message welcome-message">We will launch this feature very soon.</div>';
|
||||
}
|
||||
addCitations([]); // Clear citations
|
||||
updateCitationsDisplay(); // Clear UI
|
||||
@@ -504,8 +506,10 @@ A code snippet: \`crawler.run()\`. Check the [quickstart](/core/quickstart).`;
|
||||
addMessageToChat(message, false);
|
||||
});
|
||||
if (messages.length === 0) {
|
||||
// chatMessages.innerHTML =
|
||||
// '<div class="message ai-message welcome-message">Chat history loaded. Ask a question!</div>';
|
||||
chatMessages.innerHTML =
|
||||
'<div class="message ai-message welcome-message">Chat history loaded. Ask a question!</div>';
|
||||
'<div class="message ai-message welcome-message">We will launch this feature very soon.</div>';
|
||||
}
|
||||
// Scroll to bottom after loading messages
|
||||
scrollToBottom();
|
||||
|
||||
@@ -36,7 +36,7 @@
|
||||
<div id="chat-input-area">
|
||||
<!-- Loading indicator for general waiting (optional) -->
|
||||
<!-- <div class="loading-indicator" style="display: none;">Thinking...</div> -->
|
||||
<textarea id="chat-input" placeholder="Ask about Crawl4AI..." rows="2"></textarea>
|
||||
<textarea id="chat-input" placeholder="We will roll out this feature very soon." rows="2" disabled></textarea>
|
||||
<button id="send-button">Send</button>
|
||||
</div>
|
||||
</main>
|
||||
|
||||
@@ -64,7 +64,7 @@ body {
|
||||
/* Apply side padding within the centered block */
|
||||
padding-left: calc(var(--global-space) * 2);
|
||||
padding-right: calc(var(--global-space) * 2);
|
||||
/* Add margin-left to clear the fixed sidebar */
|
||||
/* Add margin-left to clear the fixed sidebar - ONLY ON DESKTOP */
|
||||
margin-left: var(--sidebar-width);
|
||||
}
|
||||
|
||||
@@ -81,7 +81,7 @@ body {
|
||||
z-index: 900;
|
||||
padding: 1em calc(var(--global-space) * 2);
|
||||
padding-bottom: 2em;
|
||||
/* transition: left var(--layout-transition-speed) ease-in-out; */
|
||||
transition: left var(--layout-transition-speed) ease-in-out;
|
||||
}
|
||||
|
||||
/* --- 2. Main Content Area (Within Centered Grid) --- */
|
||||
@@ -188,21 +188,133 @@ footer {
|
||||
}
|
||||
}
|
||||
|
||||
/* --- Mobile Menu Styles --- */
|
||||
.mobile-menu-toggle {
|
||||
display: none; /* Hidden by default, shown in mobile */
|
||||
background: none;
|
||||
border: none;
|
||||
padding: 10px;
|
||||
cursor: pointer;
|
||||
z-index: 1200;
|
||||
margin-right: 10px;
|
||||
position: absolute;
|
||||
left: 10px;
|
||||
top: 50%;
|
||||
transform: translateY(-50%);
|
||||
/* Make sure it doesn't get moved */
|
||||
min-width: 30px;
|
||||
min-height: 30px;
|
||||
}
|
||||
|
||||
.hamburger-line {
|
||||
display: block;
|
||||
width: 22px;
|
||||
height: 2px;
|
||||
margin: 5px 0;
|
||||
background-color: var(--font-color);
|
||||
transition: transform 0.3s, opacity 0.3s;
|
||||
}
|
||||
|
||||
/* Hamburger animation */
|
||||
.mobile-menu-toggle.is-active .hamburger-line:nth-child(1) {
|
||||
transform: translateY(7px) rotate(45deg);
|
||||
}
|
||||
|
||||
.mobile-menu-toggle.is-active .hamburger-line:nth-child(2) {
|
||||
opacity: 0;
|
||||
}
|
||||
|
||||
.mobile-menu-toggle.is-active .hamburger-line:nth-child(3) {
|
||||
transform: translateY(-7px) rotate(-45deg);
|
||||
}
|
||||
|
||||
.mobile-menu-close {
|
||||
display: none; /* Hidden by default, shown in mobile */
|
||||
position: absolute;
|
||||
top: 10px;
|
||||
right: 10px;
|
||||
background: none;
|
||||
border: none;
|
||||
color: var(--font-color);
|
||||
font-size: 24px;
|
||||
cursor: pointer;
|
||||
z-index: 1200;
|
||||
padding: 5px 10px;
|
||||
}
|
||||
|
||||
.mobile-menu-backdrop {
|
||||
position: fixed;
|
||||
top: 0;
|
||||
left: 0;
|
||||
right: 0;
|
||||
bottom: 0;
|
||||
background-color: rgba(0, 0, 0, 0.7);
|
||||
z-index: 1050;
|
||||
}
|
||||
|
||||
/* --- Small screens: Hide left sidebar, full width content & footer --- */
|
||||
@media screen and (max-width: 768px) {
|
||||
/* Hide the terminal-menu from theme */
|
||||
.terminal-menu {
|
||||
display: none !important;
|
||||
}
|
||||
|
||||
/* Add padding to site name to prevent hamburger overlap */
|
||||
.terminal-mkdocs-site-name,
|
||||
.terminal-logo a,
|
||||
.terminal-nav .logo {
|
||||
padding-left: 40px !important;
|
||||
white-space: nowrap;
|
||||
overflow: hidden;
|
||||
text-overflow: ellipsis;
|
||||
}
|
||||
|
||||
/* Show mobile menu toggle button */
|
||||
.mobile-menu-toggle {
|
||||
display: block;
|
||||
}
|
||||
|
||||
/* Show mobile menu close button */
|
||||
.mobile-menu-close {
|
||||
display: block;
|
||||
}
|
||||
|
||||
#terminal-mkdocs-side-panel {
|
||||
left: calc(-1 * var(--sidebar-width));
|
||||
left: -100%; /* Hide completely off-screen */
|
||||
z-index: 1100;
|
||||
box-shadow: 2px 0 10px rgba(0,0,0,0.3);
|
||||
top: 0; /* Start from top edge */
|
||||
height: 100%; /* Full height */
|
||||
transition: left 0.3s ease-in-out;
|
||||
padding-top: 50px; /* Space for close button */
|
||||
overflow-y: auto;
|
||||
width: 85%; /* Wider on mobile */
|
||||
max-width: 320px; /* Maximum width */
|
||||
background-color: var(--background-color); /* Ensure solid background */
|
||||
}
|
||||
|
||||
#terminal-mkdocs-side-panel.sidebar-visible {
|
||||
left: 0;
|
||||
}
|
||||
|
||||
/* Make navigation links more touch-friendly */
|
||||
#terminal-mkdocs-side-panel a {
|
||||
padding: 6px 15px;
|
||||
display: block;
|
||||
/* No border as requested */
|
||||
}
|
||||
|
||||
#terminal-mkdocs-side-panel ul {
|
||||
padding-left: 0;
|
||||
}
|
||||
|
||||
#terminal-mkdocs-side-panel ul ul a {
|
||||
padding-left: 10px;
|
||||
}
|
||||
|
||||
.terminal-mkdocs-main-grid {
|
||||
/* Grid now takes full width (minus body padding) */
|
||||
margin-left: 0; /* Override sidebar margin */
|
||||
margin-left: 0 !important; /* Override sidebar margin with !important */
|
||||
margin-right: 0; /* Override auto margin */
|
||||
max-width: 100%; /* Allow full width */
|
||||
padding-left: var(--global-space); /* Reduce padding */
|
||||
@@ -224,7 +336,6 @@ footer {
|
||||
text-align: center;
|
||||
gap: 0.5em;
|
||||
}
|
||||
/* Remember JS for toggle button & overlay */
|
||||
}
|
||||
|
||||
|
||||
@@ -301,17 +412,41 @@ footer {
|
||||
background-color: var(--primary-dimmed-color, #09b5a5);
|
||||
color: var(--background-color, #070708);
|
||||
border: none;
|
||||
padding: 4px 8px;
|
||||
padding: 6px 10px;
|
||||
font-size: 0.8em;
|
||||
border-radius: 4px;
|
||||
cursor: pointer;
|
||||
box-shadow: 0 2px 5px rgba(0, 0, 0, 0.3);
|
||||
transition: background-color 0.2s ease;
|
||||
box-shadow: 0 3px 8px rgba(0, 0, 0, 0.3);
|
||||
transition: background-color 0.2s ease, transform 0.15s ease;
|
||||
white-space: nowrap;
|
||||
display: flex;
|
||||
align-items: center;
|
||||
font-weight: 500;
|
||||
animation: askAiButtonAppear 0.2s ease-out;
|
||||
}
|
||||
|
||||
@keyframes askAiButtonAppear {
|
||||
from {
|
||||
opacity: 0;
|
||||
transform: scale(0.9);
|
||||
}
|
||||
to {
|
||||
opacity: 1;
|
||||
transform: scale(1);
|
||||
}
|
||||
}
|
||||
|
||||
.ask-ai-selection-button:hover {
|
||||
background-color: var(--primary-color, #50ffff);
|
||||
transform: scale(1.05);
|
||||
}
|
||||
|
||||
/* Mobile styles for Ask AI button */
|
||||
@media screen and (max-width: 768px) {
|
||||
.ask-ai-selection-button {
|
||||
padding: 8px 12px; /* Larger touch target on mobile */
|
||||
font-size: 0.9em; /* Slightly larger text */
|
||||
}
|
||||
}
|
||||
|
||||
/* ==== File: docs/assets/layout.css (Additions) ==== */
|
||||
|
||||
106
docs/md_v2/assets/mobile_menu.js
Normal file
106
docs/md_v2/assets/mobile_menu.js
Normal file
@@ -0,0 +1,106 @@
|
||||
// mobile_menu.js - Hamburger menu for mobile view
|
||||
document.addEventListener('DOMContentLoaded', () => {
|
||||
// Get references to key elements
|
||||
const sidePanel = document.getElementById('terminal-mkdocs-side-panel');
|
||||
const mainHeader = document.querySelector('.terminal .container:first-child');
|
||||
|
||||
if (!sidePanel || !mainHeader) {
|
||||
console.warn('Mobile menu: Required elements not found');
|
||||
return;
|
||||
}
|
||||
|
||||
// Force hide sidebar on mobile
|
||||
const checkMobile = () => {
|
||||
if (window.innerWidth <= 768) {
|
||||
// Force with !important-like priority
|
||||
sidePanel.style.setProperty('left', '-100%', 'important');
|
||||
// Also hide terminal-menu from the theme
|
||||
const terminalMenu = document.querySelector('.terminal-menu');
|
||||
if (terminalMenu) {
|
||||
terminalMenu.style.setProperty('display', 'none', 'important');
|
||||
}
|
||||
} else {
|
||||
sidePanel.style.removeProperty('left');
|
||||
// Restore terminal-menu if it exists
|
||||
const terminalMenu = document.querySelector('.terminal-menu');
|
||||
if (terminalMenu) {
|
||||
terminalMenu.style.removeProperty('display');
|
||||
}
|
||||
}
|
||||
};
|
||||
|
||||
// Run on initial load
|
||||
checkMobile();
|
||||
|
||||
// Also run on resize
|
||||
window.addEventListener('resize', checkMobile);
|
||||
|
||||
// Create hamburger button
|
||||
const hamburgerBtn = document.createElement('button');
|
||||
hamburgerBtn.className = 'mobile-menu-toggle';
|
||||
hamburgerBtn.setAttribute('aria-label', 'Toggle navigation menu');
|
||||
hamburgerBtn.innerHTML = `
|
||||
<span class="hamburger-line"></span>
|
||||
<span class="hamburger-line"></span>
|
||||
<span class="hamburger-line"></span>
|
||||
`;
|
||||
|
||||
// Create backdrop overlay
|
||||
const menuBackdrop = document.createElement('div');
|
||||
menuBackdrop.className = 'mobile-menu-backdrop';
|
||||
menuBackdrop.style.display = 'none';
|
||||
document.body.appendChild(menuBackdrop);
|
||||
|
||||
// Make sure it's properly hidden on page load
|
||||
if (window.innerWidth <= 768) {
|
||||
menuBackdrop.style.display = 'none';
|
||||
}
|
||||
|
||||
// Insert hamburger button into header
|
||||
mainHeader.insertBefore(hamburgerBtn, mainHeader.firstChild);
|
||||
|
||||
// Add menu close button to side panel
|
||||
const closeBtn = document.createElement('button');
|
||||
closeBtn.className = 'mobile-menu-close';
|
||||
closeBtn.setAttribute('aria-label', 'Close navigation menu');
|
||||
closeBtn.innerHTML = `×`;
|
||||
sidePanel.insertBefore(closeBtn, sidePanel.firstChild);
|
||||
|
||||
// Toggle function
|
||||
function toggleMobileMenu() {
|
||||
const isOpen = sidePanel.classList.toggle('sidebar-visible');
|
||||
|
||||
// Toggle backdrop
|
||||
menuBackdrop.style.display = isOpen ? 'block' : 'none';
|
||||
|
||||
// Toggle aria-expanded
|
||||
hamburgerBtn.setAttribute('aria-expanded', isOpen ? 'true' : 'false');
|
||||
|
||||
// Toggle hamburger animation class
|
||||
hamburgerBtn.classList.toggle('is-active');
|
||||
|
||||
// Force sidebar visibility setting
|
||||
if (isOpen) {
|
||||
sidePanel.style.setProperty('left', '0', 'important');
|
||||
} else {
|
||||
sidePanel.style.setProperty('left', '-100%', 'important');
|
||||
}
|
||||
|
||||
// Prevent body scrolling when menu is open
|
||||
document.body.style.overflow = isOpen ? 'hidden' : '';
|
||||
}
|
||||
|
||||
// Event listeners
|
||||
hamburgerBtn.addEventListener('click', toggleMobileMenu);
|
||||
closeBtn.addEventListener('click', toggleMobileMenu);
|
||||
menuBackdrop.addEventListener('click', toggleMobileMenu);
|
||||
|
||||
// Close menu on window resize to desktop
|
||||
window.addEventListener('resize', () => {
|
||||
if (window.innerWidth > 768 && sidePanel.classList.contains('sidebar-visible')) {
|
||||
toggleMobileMenu();
|
||||
}
|
||||
});
|
||||
|
||||
console.log('Mobile menu initialized');
|
||||
});
|
||||
@@ -8,12 +8,32 @@ document.addEventListener('DOMContentLoaded', () => {
|
||||
const button = document.createElement('button');
|
||||
button.id = 'ask-ai-selection-btn';
|
||||
button.className = 'ask-ai-selection-button';
|
||||
button.textContent = 'Ask AI'; // Or use an icon
|
||||
|
||||
// Add icon and text for better visibility
|
||||
button.innerHTML = `
|
||||
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" width="12" height="12" fill="currentColor" style="margin-right: 4px; vertical-align: middle;">
|
||||
<path d="M20 2H4c-1.1 0-2 .9-2 2v12c0 1.1.9 2 2 2h14l4 4V4c0-1.1-.9-2-2-2z"/>
|
||||
</svg>
|
||||
<span>Ask AI</span>
|
||||
`;
|
||||
|
||||
// Common styles
|
||||
button.style.display = 'none'; // Initially hidden
|
||||
button.style.position = 'absolute';
|
||||
button.style.zIndex = '1500'; // Ensure it's on top
|
||||
button.style.boxShadow = '0 3px 8px rgba(0, 0, 0, 0.4)'; // More pronounced shadow
|
||||
button.style.transition = 'transform 0.15s ease, background-color 0.2s ease'; // Smooth hover effect
|
||||
|
||||
// Add transform on hover
|
||||
button.addEventListener('mouseover', () => {
|
||||
button.style.transform = 'scale(1.05)';
|
||||
});
|
||||
|
||||
button.addEventListener('mouseout', () => {
|
||||
button.style.transform = 'scale(1)';
|
||||
});
|
||||
|
||||
document.body.appendChild(button);
|
||||
|
||||
button.addEventListener('click', handleAskAiClick);
|
||||
return button;
|
||||
}
|
||||
@@ -43,11 +63,38 @@ document.addEventListener('DOMContentLoaded', () => {
|
||||
const range = selection.getRangeAt(0);
|
||||
const rect = range.getBoundingClientRect();
|
||||
|
||||
// Calculate position: top-right of the selection
|
||||
// Get viewport dimensions
|
||||
const viewportWidth = window.innerWidth;
|
||||
const viewportHeight = window.innerHeight;
|
||||
|
||||
// Calculate position based on selection
|
||||
const scrollX = window.scrollX;
|
||||
const scrollY = window.scrollY;
|
||||
const buttonTop = rect.top + scrollY - askAiButton.offsetHeight - 5; // 5px above
|
||||
const buttonLeft = rect.right + scrollX + 5; // 5px to the right
|
||||
|
||||
// Default position (top-right of selection)
|
||||
let buttonTop = rect.top + scrollY - askAiButton.offsetHeight - 5; // 5px above
|
||||
let buttonLeft = rect.right + scrollX + 5; // 5px to the right
|
||||
|
||||
// Check if we're on mobile (which we define as less than 768px)
|
||||
const isMobile = viewportWidth <= 768;
|
||||
|
||||
if (isMobile) {
|
||||
// On mobile, position centered above selection to avoid edge issues
|
||||
buttonTop = rect.top + scrollY - askAiButton.offsetHeight - 10; // 10px above on mobile
|
||||
buttonLeft = rect.left + scrollX + (rect.width / 2) - (askAiButton.offsetWidth / 2); // Centered
|
||||
} else {
|
||||
// For desktop, ensure the button doesn't go off screen
|
||||
// Check right edge
|
||||
if (buttonLeft + askAiButton.offsetWidth > scrollX + viewportWidth) {
|
||||
buttonLeft = scrollX + viewportWidth - askAiButton.offsetWidth - 10; // 10px from right edge
|
||||
}
|
||||
}
|
||||
|
||||
// Check top edge (for all devices)
|
||||
if (buttonTop < scrollY) {
|
||||
// If would go above viewport, position below selection instead
|
||||
buttonTop = rect.bottom + scrollY + 5; // 5px below
|
||||
}
|
||||
|
||||
askAiButton.style.top = `${buttonTop}px`;
|
||||
askAiButton.style.left = `${buttonLeft}px`;
|
||||
@@ -77,8 +124,8 @@ document.addEventListener('DOMContentLoaded', () => {
|
||||
|
||||
// --- Event Listeners ---
|
||||
|
||||
// Show button on mouse up after selection
|
||||
document.addEventListener('mouseup', (event) => {
|
||||
// Function to handle selection events (both mouse and touch)
|
||||
function handleSelectionEvent(event) {
|
||||
// Slight delay to ensure selection is registered
|
||||
setTimeout(() => {
|
||||
const selectedText = getSafeSelectedText();
|
||||
@@ -86,7 +133,7 @@ document.addEventListener('DOMContentLoaded', () => {
|
||||
if (!askAiButton) {
|
||||
askAiButton = createAskAiButton();
|
||||
}
|
||||
// Don't position if the click was ON the button itself
|
||||
// Don't position if the event was ON the button itself
|
||||
if (event.target !== askAiButton) {
|
||||
positionButton(event);
|
||||
}
|
||||
@@ -94,16 +141,46 @@ document.addEventListener('DOMContentLoaded', () => {
|
||||
hideButton();
|
||||
}
|
||||
}, 10); // Small delay
|
||||
}
|
||||
|
||||
// Mouse selection events (desktop)
|
||||
document.addEventListener('mouseup', handleSelectionEvent);
|
||||
|
||||
// Touch selection events (mobile)
|
||||
document.addEventListener('touchend', handleSelectionEvent);
|
||||
document.addEventListener('selectionchange', () => {
|
||||
// This helps with mobile selection which can happen without mouseup/touchend
|
||||
setTimeout(() => {
|
||||
const selectedText = getSafeSelectedText();
|
||||
if (selectedText && askAiButton) {
|
||||
positionButton();
|
||||
}
|
||||
}, 300); // Longer delay for selection change
|
||||
});
|
||||
|
||||
// Hide button on scroll or click elsewhere
|
||||
// Hide button on various events
|
||||
document.addEventListener('mousedown', (event) => {
|
||||
// Hide if clicking anywhere EXCEPT the button itself
|
||||
if (askAiButton && event.target !== askAiButton) {
|
||||
hideButton();
|
||||
}
|
||||
});
|
||||
|
||||
document.addEventListener('touchstart', (event) => {
|
||||
// Same for touch events, but only hide if not on the button
|
||||
if (askAiButton && event.target !== askAiButton) {
|
||||
hideButton();
|
||||
}
|
||||
});
|
||||
|
||||
document.addEventListener('scroll', hideButton, true); // Capture scroll events
|
||||
|
||||
// Also hide when pressing Escape key
|
||||
document.addEventListener('keydown', (event) => {
|
||||
if (event.key === 'Escape') {
|
||||
hideButton();
|
||||
}
|
||||
});
|
||||
|
||||
console.log("Selection Ask AI script loaded.");
|
||||
});
|
||||
@@ -268,3 +268,6 @@ div.badges a > img {
|
||||
}
|
||||
|
||||
|
||||
table td, table th {
|
||||
border: 1px solid var(--code-bg-color) !important;
|
||||
}
|
||||
@@ -4,6 +4,32 @@ Welcome to the Crawl4AI blog! Here you'll find detailed release notes, technical
|
||||
|
||||
## Latest Release
|
||||
|
||||
Here’s the blog index entry for **v0.6.0**, written to match the exact tone and structure of your previous entries:
|
||||
|
||||
---
|
||||
|
||||
### [Crawl4AI v0.6.0 – World-Aware Crawling, Pre-Warmed Browsers, and the MCP API](releases/0.6.0.md)
|
||||
*April 23, 2025*
|
||||
|
||||
Crawl4AI v0.6.0 is our most powerful release yet. This update brings major architectural upgrades including world-aware crawling (set geolocation, locale, and timezone), real-time traffic capture, and a memory-efficient crawler pool with pre-warmed pages.
|
||||
|
||||
The Docker server now exposes a full-featured MCP socket + SSE interface, supports streaming, and comes with a new Playground UI. Plus, table extraction is now native, and the new stress-test framework supports crawling 1,000+ URLs.
|
||||
|
||||
Other key changes:
|
||||
|
||||
* Native support for `result.media["tables"]` to export DataFrames
|
||||
* Full network + console logs and MHTML snapshot per crawl
|
||||
* Browser pooling and pre-warming for faster cold starts
|
||||
* New streaming endpoints via MCP API and Playground
|
||||
* Robots.txt support, proxy rotation, and improved session handling
|
||||
* Deprecated old markdown names, legacy modules cleaned up
|
||||
* Massive repo cleanup: ~36K insertions, ~5K deletions across 121 files
|
||||
|
||||
[Read full release notes →](releases/0.6.0.md)
|
||||
|
||||
---
|
||||
|
||||
Let me know if you want me to auto-update the actual file or just paste this into the markdown.
|
||||
|
||||
### [Crawl4AI v0.5.0: Deep Crawling, Scalability, and a New CLI!](releases/0.5.0.md)
|
||||
|
||||
|
||||
@@ -251,7 +251,7 @@ from crawl4ai import (
|
||||
RoundRobinProxyStrategy,
|
||||
)
|
||||
import asyncio
|
||||
from crawl4ai.proxy_strategy import ProxyConfig
|
||||
from crawl4ai import ProxyConfig
|
||||
async def main():
|
||||
# Load proxies and create rotation strategy
|
||||
proxies = ProxyConfig.from_env()
|
||||
|
||||
143
docs/md_v2/blog/releases/0.6.0.md
Normal file
143
docs/md_v2/blog/releases/0.6.0.md
Normal file
@@ -0,0 +1,143 @@
|
||||
# Crawl4AI v0.6.0 Release Notes
|
||||
|
||||
We're excited to announce the release of **Crawl4AI v0.6.0**, our biggest and most feature-rich update yet. This version introduces major architectural upgrades, brand-new capabilities for geo-aware crawling, high-efficiency scraping, and real-time streaming support for scalable deployments.
|
||||
|
||||
---
|
||||
|
||||
## Highlights
|
||||
|
||||
### 1. **World-Aware Crawlers**
|
||||
Crawl as if you’re anywhere in the world. With v0.6.0, each crawl can simulate:
|
||||
- Specific GPS coordinates
|
||||
- Browser locale
|
||||
- Timezone
|
||||
|
||||
Example:
|
||||
```python
|
||||
CrawlerRunConfig(
|
||||
url="https://browserleaks.com/geo",
|
||||
locale="en-US",
|
||||
timezone_id="America/Los_Angeles",
|
||||
geolocation=GeolocationConfig(
|
||||
latitude=34.0522,
|
||||
longitude=-118.2437,
|
||||
accuracy=10.0
|
||||
)
|
||||
)
|
||||
```
|
||||
Great for accessing region-specific content or testing global behavior.
|
||||
|
||||
---
|
||||
|
||||
### 2. **Native Table Extraction**
|
||||
Extract HTML tables directly into usable formats like Pandas DataFrames or CSV with zero parsing hassle. All table data is available under `result.media["tables"]`.
|
||||
|
||||
Example:
|
||||
```python
|
||||
raw_df = pd.DataFrame(
|
||||
result.media["tables"][0]["rows"],
|
||||
columns=result.media["tables"][0]["headers"]
|
||||
)
|
||||
```
|
||||
This makes it ideal for scraping financial data, pricing pages, or anything tabular.
|
||||
|
||||
---
|
||||
|
||||
### 3. **Browser Pooling & Pre-Warming**
|
||||
We've overhauled browser management. Now, multiple browser instances can be pooled and pages pre-warmed for ultra-fast launches:
|
||||
- Reduces cold-start latency
|
||||
- Lowers memory spikes
|
||||
- Enhances parallel crawling stability
|
||||
|
||||
This powers the new **Docker Playground** experience and streamlines heavy-load crawling.
|
||||
|
||||
---
|
||||
|
||||
### 4. **Traffic & Snapshot Capture**
|
||||
Need full visibility? You can now capture:
|
||||
- Full network traffic logs
|
||||
- Console output
|
||||
- MHTML page snapshots for post-crawl audits and debugging
|
||||
|
||||
No more guesswork on what happened during your crawl.
|
||||
|
||||
---
|
||||
|
||||
### 5. **MCP API and Streaming Support**
|
||||
We’re exposing **MCP socket and SSE endpoints**, allowing:
|
||||
- Live streaming of crawl results
|
||||
- Real-time integration with agents or frontends
|
||||
- A new Playground UI for interactive crawling
|
||||
|
||||
This is a major step towards making Crawl4AI real-time ready.
|
||||
|
||||
---
|
||||
|
||||
### 6. **Stress-Test Framework**
|
||||
Want to test performance under heavy load? v0.6.0 includes a new memory stress-test suite that supports 1,000+ URL workloads. Ideal for:
|
||||
- Load testing
|
||||
- Performance benchmarking
|
||||
- Validating memory efficiency
|
||||
|
||||
---
|
||||
|
||||
## Core Improvements
|
||||
- Robots.txt compliance
|
||||
- Proxy rotation support
|
||||
- Improved URL normalization and session reuse
|
||||
- Shared data across crawler hooks
|
||||
- New page routing logic
|
||||
|
||||
---
|
||||
|
||||
## Breaking Changes & Deprecations
|
||||
- Legacy `crawl4ai/browser/*` modules are removed. Update imports accordingly.
|
||||
- `AsyncPlaywrightCrawlerStrategy.get_page` now uses a new function signature.
|
||||
- Deprecated markdown generator aliases now point to `DefaultMarkdownGenerator` with warning.
|
||||
|
||||
---
|
||||
|
||||
## Miscellaneous Updates
|
||||
- FastAPI validators replaced custom validation logic
|
||||
- Docker build now based on a Chromium layer
|
||||
- Repo-wide cleanup: ~36,000 insertions, ~5,000 deletions
|
||||
|
||||
---
|
||||
|
||||
## New Examples Included
|
||||
- Geo-location crawling
|
||||
- Network + console log capture
|
||||
- Docker MCP API usage
|
||||
- Markdown selector usage
|
||||
- Crypto project data extraction
|
||||
|
||||
---
|
||||
|
||||
## Watch the Release Video
|
||||
Want a visual walkthrough of all these updates? Watch the video:
|
||||
🔗 https://youtu.be/9x7nVcjOZks
|
||||
|
||||
If you're new to Crawl4AI, start here:
|
||||
🔗 https://www.youtube.com/watch?v=xo3qK6Hg9AA&t=15s
|
||||
|
||||
---
|
||||
|
||||
## Join the Community
|
||||
We’ve just opened up our **Discord** for the public. Join us to:
|
||||
- Ask questions
|
||||
- Share your projects
|
||||
- Get help or contribute
|
||||
|
||||
💬 https://discord.gg/wpYFACrHR4
|
||||
|
||||
---
|
||||
|
||||
## Install or Upgrade
|
||||
```bash
|
||||
pip install -U crawl4ai
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
Live long and import crawl4ai. 🖖
|
||||
|
||||
@@ -1,9 +1,9 @@
|
||||
# Browser, Crawler & LLM Configuration (Quick Overview)
|
||||
|
||||
Crawl4AI’s flexibility stems from two key classes:
|
||||
Crawl4AI's flexibility stems from two key classes:
|
||||
|
||||
1. **`BrowserConfig`** – Dictates **how** the browser is launched and behaves (e.g., headless or visible, proxy, user agent).
|
||||
2. **`CrawlerRunConfig`** – Dictates **how** each **crawl** operates (e.g., caching, extraction, timeouts, JavaScript code to run, etc.).
|
||||
1. **`BrowserConfig`** – Dictates **how** the browser is launched and behaves (e.g., headless or visible, proxy, user agent).
|
||||
2. **`CrawlerRunConfig`** – Dictates **how** each **crawl** operates (e.g., caching, extraction, timeouts, JavaScript code to run, etc.).
|
||||
3. **`LLMConfig`** - Dictates **how** LLM providers are configured. (model, api token, base url, temperature etc.)
|
||||
|
||||
In most examples, you create **one** `BrowserConfig` for the entire crawler session, then pass a **fresh** or re-used `CrawlerRunConfig` whenever you call `arun()`. This tutorial shows the most commonly used parameters. If you need advanced or rarely used fields, see the [Configuration Parameters](../api/parameters.md).
|
||||
@@ -36,18 +36,16 @@ class BrowserConfig:
|
||||
|
||||
### Key Fields to Note
|
||||
|
||||
|
||||
|
||||
1. **`browser_type`**
|
||||
1. **`browser_type`**
|
||||
- Options: `"chromium"`, `"firefox"`, or `"webkit"`.
|
||||
- Defaults to `"chromium"`.
|
||||
- If you need a different engine, specify it here.
|
||||
|
||||
2. **`headless`**
|
||||
2. **`headless`**
|
||||
- `True`: Runs the browser in headless mode (invisible browser).
|
||||
- `False`: Runs the browser in visible mode, which helps with debugging.
|
||||
|
||||
3. **`proxy_config`**
|
||||
3. **`proxy_config`**
|
||||
- A dictionary with fields like:
|
||||
```json
|
||||
{
|
||||
@@ -58,31 +56,31 @@ class BrowserConfig:
|
||||
```
|
||||
- Leave as `None` if a proxy is not required.
|
||||
|
||||
4. **`viewport_width` & `viewport_height`**:
|
||||
4. **`viewport_width` & `viewport_height`**:
|
||||
- The initial window size.
|
||||
- Some sites behave differently with smaller or bigger viewports.
|
||||
|
||||
5. **`verbose`**:
|
||||
5. **`verbose`**:
|
||||
- If `True`, prints extra logs.
|
||||
- Handy for debugging.
|
||||
|
||||
6. **`use_persistent_context`**:
|
||||
6. **`use_persistent_context`**:
|
||||
- If `True`, uses a **persistent** browser profile, storing cookies/local storage across runs.
|
||||
- Typically also set `user_data_dir` to point to a folder.
|
||||
|
||||
7. **`cookies`** & **`headers`**:
|
||||
7. **`cookies`** & **`headers`**:
|
||||
- If you want to start with specific cookies or add universal HTTP headers, set them here.
|
||||
- E.g. `cookies=[{"name": "session", "value": "abc123", "domain": "example.com"}]`.
|
||||
|
||||
8. **`user_agent`**:
|
||||
8. **`user_agent`**:
|
||||
- Custom User-Agent string. If `None`, a default is used.
|
||||
- You can also set `user_agent_mode="random"` for randomization (if you want to fight bot detection).
|
||||
|
||||
9. **`text_mode`** & **`light_mode`**:
|
||||
9. **`text_mode`** & **`light_mode`**:
|
||||
- `text_mode=True` disables images, possibly speeding up text-only crawls.
|
||||
- `light_mode=True` turns off certain background features for performance.
|
||||
|
||||
10. **`extra_args`**:
|
||||
10. **`extra_args`**:
|
||||
- Additional flags for the underlying browser.
|
||||
- E.g. `["--disable-extensions"]`.
|
||||
|
||||
@@ -137,6 +135,11 @@ class CrawlerRunConfig:
|
||||
screenshot=False,
|
||||
pdf=False,
|
||||
capture_mhtml=False,
|
||||
# Location and Identity Parameters
|
||||
locale=None, # e.g. "en-US", "fr-FR"
|
||||
timezone_id=None, # e.g. "America/New_York"
|
||||
geolocation=None, # GeolocationConfig object
|
||||
# Resource Management
|
||||
enable_rate_limiting=False,
|
||||
rate_limit_config=None,
|
||||
memory_threshold_percent=70.0,
|
||||
@@ -152,57 +155,65 @@ class CrawlerRunConfig:
|
||||
|
||||
### Key Fields to Note
|
||||
|
||||
1. **`word_count_threshold`**:
|
||||
1. **`word_count_threshold`**:
|
||||
- The minimum word count before a block is considered.
|
||||
- If your site has lots of short paragraphs or items, you can lower it.
|
||||
|
||||
2. **`extraction_strategy`**:
|
||||
2. **`extraction_strategy`**:
|
||||
- Where you plug in JSON-based extraction (CSS, LLM, etc.).
|
||||
- If `None`, no structured extraction is done (only raw/cleaned HTML + markdown).
|
||||
|
||||
3. **`markdown_generator`**:
|
||||
3. **`markdown_generator`**:
|
||||
- E.g., `DefaultMarkdownGenerator(...)`, controlling how HTML→Markdown conversion is done.
|
||||
- If `None`, a default approach is used.
|
||||
|
||||
4. **`cache_mode`**:
|
||||
4. **`cache_mode`**:
|
||||
- Controls caching behavior (`ENABLED`, `BYPASS`, `DISABLED`, etc.).
|
||||
- If `None`, defaults to some level of caching or you can specify `CacheMode.ENABLED`.
|
||||
|
||||
5. **`js_code`**:
|
||||
5. **`js_code`**:
|
||||
- A string or list of JS strings to execute.
|
||||
- Great for “Load More” buttons or user interactions.
|
||||
- Great for "Load More" buttons or user interactions.
|
||||
|
||||
6. **`wait_for`**:
|
||||
6. **`wait_for`**:
|
||||
- A CSS or JS expression to wait for before extracting content.
|
||||
- Common usage: `wait_for="css:.main-loaded"` or `wait_for="js:() => window.loaded === true"`.
|
||||
|
||||
7. **`screenshot`**, **`pdf`**, & **`capture_mhtml`**:
|
||||
- If `True`, captures a screenshot, PDF, or MHTML snapshot after the page is fully loaded.
|
||||
- The results go to `result.screenshot` (base64), `result.pdf` (bytes), or `result.mhtml` (string).
|
||||
8. **`verbose`**:
|
||||
- Logs additional runtime details.
|
||||
- Overlaps with the browser’s verbosity if also set to `True` in `BrowserConfig`.
|
||||
|
||||
9. **`enable_rate_limiting`**:
|
||||
8. **Location Parameters**:
|
||||
- **`locale`**: Browser's locale (e.g., `"en-US"`, `"fr-FR"`) for language preferences
|
||||
- **`timezone_id`**: Browser's timezone (e.g., `"America/New_York"`, `"Europe/Paris"`)
|
||||
- **`geolocation`**: GPS coordinates via `GeolocationConfig(latitude=48.8566, longitude=2.3522)`
|
||||
- See [Identity Based Crawling](../advanced/identity-based-crawling.md#7-locale-timezone-and-geolocation-control)
|
||||
|
||||
9. **`verbose`**:
|
||||
- Logs additional runtime details.
|
||||
- Overlaps with the browser's verbosity if also set to `True` in `BrowserConfig`.
|
||||
|
||||
10. **`enable_rate_limiting`**:
|
||||
- If `True`, enables rate limiting for batch processing.
|
||||
- Requires `rate_limit_config` to be set.
|
||||
|
||||
10. **`memory_threshold_percent`**:
|
||||
11. **`memory_threshold_percent`**:
|
||||
- The memory threshold (as a percentage) to monitor.
|
||||
- If exceeded, the crawler will pause or slow down.
|
||||
|
||||
11. **`check_interval`**:
|
||||
12. **`check_interval`**:
|
||||
- The interval (in seconds) to check system resources.
|
||||
- Affects how often memory and CPU usage are monitored.
|
||||
|
||||
12. **`max_session_permit`**:
|
||||
13. **`max_session_permit`**:
|
||||
- The maximum number of concurrent crawl sessions.
|
||||
- Helps prevent overwhelming the system.
|
||||
|
||||
13. **`display_mode`**:
|
||||
14. **`display_mode`**:
|
||||
- The display mode for progress information (`DETAILED`, `BRIEF`, etc.).
|
||||
- Affects how much information is printed during the crawl.
|
||||
|
||||
|
||||
### Helper Methods
|
||||
|
||||
The `clone()` method is particularly useful for creating variations of your crawler configuration:
|
||||
@@ -236,23 +247,20 @@ The `clone()` method:
|
||||
---
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
## 3. LLMConfig Essentials
|
||||
|
||||
### Key fields to note
|
||||
|
||||
1. **`provider`**:
|
||||
1. **`provider`**:
|
||||
- Which LLM provoder to use.
|
||||
- Possible values are `"ollama/llama3","groq/llama3-70b-8192","groq/llama3-8b-8192", "openai/gpt-4o-mini" ,"openai/gpt-4o","openai/o1-mini","openai/o1-preview","openai/o3-mini","openai/o3-mini-high","anthropic/claude-3-haiku-20240307","anthropic/claude-3-opus-20240229","anthropic/claude-3-sonnet-20240229","anthropic/claude-3-5-sonnet-20240620","gemini/gemini-pro","gemini/gemini-1.5-pro","gemini/gemini-2.0-flash","gemini/gemini-2.0-flash-exp","gemini/gemini-2.0-flash-lite-preview-02-05","deepseek/deepseek-chat"`<br/>*(default: `"openai/gpt-4o-mini"`)*
|
||||
|
||||
2. **`api_token`**:
|
||||
2. **`api_token`**:
|
||||
- Optional. When not provided explicitly, api_token will be read from environment variables based on provider. For example: If a gemini model is passed as provider then,`"GEMINI_API_KEY"` will be read from environment variables
|
||||
- API token of LLM provider <br/> eg: `api_token = "gsk_1ClHGGJ7Lpn4WGybR7vNWGdyb3FY7zXEw3SCiy0BAVM9lL8CQv"`
|
||||
- Environment variable - use with prefix "env:" <br/> eg:`api_token = "env: GROQ_API_KEY"`
|
||||
|
||||
3. **`base_url`**:
|
||||
3. **`base_url`**:
|
||||
- If your provider has a custom endpoint
|
||||
|
||||
```python
|
||||
@@ -261,7 +269,7 @@ llm_config = LLMConfig(provider="openai/gpt-4o-mini", api_token=os.getenv("OPENA
|
||||
|
||||
## 4. Putting It All Together
|
||||
|
||||
In a typical scenario, you define **one** `BrowserConfig` for your crawler session, then create **one or more** `CrawlerRunConfig` & `LLMConfig` depending on each call’s needs:
|
||||
In a typical scenario, you define **one** `BrowserConfig` for your crawler session, then create **one or more** `CrawlerRunConfig` & `LLMConfig` depending on each call's needs:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
115
docs/md_v2/core/examples.md
Normal file
115
docs/md_v2/core/examples.md
Normal file
@@ -0,0 +1,115 @@
|
||||
# Code Examples
|
||||
|
||||
This page provides a comprehensive list of example scripts that demonstrate various features and capabilities of Crawl4AI. Each example is designed to showcase specific functionality, making it easier for you to understand how to implement these features in your own projects.
|
||||
|
||||
## Getting Started Examples
|
||||
|
||||
| Example | Description | Link |
|
||||
|---------|-------------|------|
|
||||
| Hello World | A simple introductory example demonstrating basic usage of AsyncWebCrawler with JavaScript execution and content filtering. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/hello_world.py) |
|
||||
| Quickstart | A comprehensive collection of examples showcasing various features including basic crawling, content cleaning, link analysis, JavaScript execution, CSS selectors, media handling, custom hooks, proxy configuration, screenshots, and multiple extraction strategies. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/quickstart.py) |
|
||||
| Quickstart Set 1 | Basic examples for getting started with Crawl4AI. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/quickstart_examples_set_1.py) |
|
||||
| Quickstart Set 2 | More advanced examples for working with Crawl4AI. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/quickstart_examples_set_2.py) |
|
||||
|
||||
## Browser & Crawling Features
|
||||
|
||||
| Example | Description | Link |
|
||||
|---------|-------------|------|
|
||||
| Built-in Browser | Demonstrates how to use the built-in browser capabilities. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/builtin_browser_example.py) |
|
||||
| Browser Optimization | Focuses on browser performance optimization techniques. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/browser_optimization_example.py) |
|
||||
| arun vs arun_many | Compares the `arun` and `arun_many` methods for single vs. multiple URL crawling. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/arun_vs_arun_many.py) |
|
||||
| Multiple URLs | Shows how to crawl multiple URLs asynchronously. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/async_webcrawler_multiple_urls_example.py) |
|
||||
| Page Interaction | Guide on interacting with dynamic elements through clicks. | [View Guide](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/tutorial_dynamic_clicks.md) |
|
||||
| Crawler Monitor | Shows how to monitor the crawler's activities and status. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/crawler_monitor_example.py) |
|
||||
| Full Page Screenshot & PDF | Guide on capturing full-page screenshots and PDFs from massive webpages. | [View Guide](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/full_page_screenshot_and_pdf_export.md) |
|
||||
|
||||
## Advanced Crawling & Deep Crawling
|
||||
|
||||
| Example | Description | Link |
|
||||
|---------|-------------|------|
|
||||
| Deep Crawling | An extensive tutorial on deep crawling capabilities, demonstrating BFS and BestFirst strategies, stream vs. non-stream execution, filters, scorers, and advanced configurations. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/deepcrawl_example.py) |
|
||||
| Dispatcher | Shows how to use the crawl dispatcher for advanced workload management. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/dispatcher_example.py) |
|
||||
| Storage State | Tutorial on managing browser storage state for persistence. | [View Guide](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/storage_state_tutorial.md) |
|
||||
| Network Console Capture | Demonstrates how to capture and analyze network requests and console logs. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/network_console_capture_example.py) |
|
||||
|
||||
## Extraction Strategies
|
||||
|
||||
| Example | Description | Link |
|
||||
|---------|-------------|------|
|
||||
| Extraction Strategies | Demonstrates different extraction strategies with various input formats (markdown, HTML, fit_markdown) and JSON-based extractors (CSS and XPath). | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/extraction_strategies_examples.py) |
|
||||
| Scraping Strategies | Compares the performance of different scraping strategies. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/scraping_strategies_performance.py) |
|
||||
| LLM Extraction | Demonstrates LLM-based extraction specifically for OpenAI pricing data. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/llm_extraction_openai_pricing.py) |
|
||||
| LLM Markdown | Shows how to use LLMs to generate markdown from crawled content. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/llm_markdown_generator.py) |
|
||||
| Summarize Page | Shows how to summarize web page content. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/summarize_page.py) |
|
||||
|
||||
## E-commerce & Specialized Crawling
|
||||
|
||||
| Example | Description | Link |
|
||||
|---------|-------------|------|
|
||||
| Amazon Product Extraction | Demonstrates how to extract structured product data from Amazon search results using CSS selectors. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/amazon_product_extraction_direct_url.py) |
|
||||
| Amazon with Hooks | Shows how to use hooks with Amazon product extraction. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/amazon_product_extraction_using_hooks.py) |
|
||||
| Amazon with JavaScript | Demonstrates using custom JavaScript for Amazon product extraction. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/amazon_product_extraction_using_use_javascript.py) |
|
||||
| Crypto Analysis | Demonstrates how to crawl and analyze cryptocurrency data. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/crypto_analysis_example.py) |
|
||||
| SERP API | Demonstrates using Crawl4AI with search engine result pages. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/serp_api_project_11_feb.py) |
|
||||
|
||||
## Customization & Security
|
||||
|
||||
| Example | Description | Link |
|
||||
|---------|-------------|------|
|
||||
| Hooks | Illustrates how to use hooks at different stages of the crawling process for advanced customization. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/hooks_example.py) |
|
||||
| Identity-Based Browsing | Illustrates identity-based browsing configurations for authentic browsing experiences. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/identity_based_browsing.py) |
|
||||
| Proxy Rotation | Shows how to use proxy rotation for web scraping and avoiding IP blocks. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/proxy_rotation_demo.py) |
|
||||
| SSL Certificate | Illustrates SSL certificate handling and verification. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/ssl_example.py) |
|
||||
| Language Support | Shows how to handle different languages during crawling. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/language_support_example.py) |
|
||||
| Geolocation | Demonstrates how to use geolocation features. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/use_geo_location.py) |
|
||||
|
||||
## Docker & Deployment
|
||||
|
||||
| Example | Description | Link |
|
||||
|---------|-------------|------|
|
||||
| Docker Config | Demonstrates how to create and use Docker configuration objects. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/docker_config_obj.py) |
|
||||
| Docker Basic | A test suite for Docker deployment, showcasing various functionalities through the Docker API. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/docker_example.py) |
|
||||
| Docker REST API | Shows how to interact with Crawl4AI Docker using REST API calls. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/docker_python_rest_api.py) |
|
||||
| Docker SDK | Demonstrates using the Python SDK for Crawl4AI Docker. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/docker_python_sdk.py) |
|
||||
|
||||
## Application Examples
|
||||
|
||||
| Example | Description | Link |
|
||||
|---------|-------------|------|
|
||||
| Research Assistant | Demonstrates how to build a research assistant using Crawl4AI. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/research_assistant.py) |
|
||||
| REST Call | Shows how to make REST API calls with Crawl4AI. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/rest_call.py) |
|
||||
| Chainlit Integration | Shows how to integrate Crawl4AI with Chainlit. | [View Guide](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/chainlit.md) |
|
||||
| Crawl4AI vs FireCrawl | Compares Crawl4AI with the FireCrawl library. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/crawlai_vs_firecrawl.py) |
|
||||
|
||||
## Content Generation & Markdown
|
||||
|
||||
| Example | Description | Link |
|
||||
|---------|-------------|------|
|
||||
| Content Source | Demonstrates how to work with different content sources in markdown generation. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/markdown/content_source_example.py) |
|
||||
| Content Source (Short) | A simplified version of content source usage. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/markdown/content_source_short_example.py) |
|
||||
| Built-in Browser Guide | Guide for using the built-in browser capabilities. | [View Guide](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/README_BUILTIN_BROWSER.md) |
|
||||
|
||||
## Running the Examples
|
||||
|
||||
To run any of these examples, you'll need to have Crawl4AI installed:
|
||||
|
||||
```bash
|
||||
pip install crawl4ai
|
||||
```
|
||||
|
||||
Then, you can run an example script like this:
|
||||
|
||||
```bash
|
||||
python -m docs.examples.hello_world
|
||||
```
|
||||
|
||||
For examples that require additional dependencies or environment variables, refer to the comments at the top of each file.
|
||||
|
||||
Some examples may require:
|
||||
- API keys (for LLM-based examples)
|
||||
- Docker setup (for Docker-related examples)
|
||||
- Additional dependencies (specified in the example files)
|
||||
|
||||
## Contributing New Examples
|
||||
|
||||
If you've created an interesting example that demonstrates a unique use case or feature of Crawl4AI, we encourage you to contribute it to our examples collection. Please see our [contribution guidelines](https://github.com/unclecode/crawl4ai/blob/main/CONTRIBUTORS.md) for more information.
|
||||
@@ -111,13 +111,71 @@ Some commonly used `options`:
|
||||
- **`skip_internal_links`** (bool): If `True`, omit `#localAnchors` or internal links referencing the same page.
|
||||
- **`include_sup_sub`** (bool): Attempt to handle `<sup>` / `<sub>` in a more readable way.
|
||||
|
||||
## 4. Selecting the HTML Source for Markdown Generation
|
||||
|
||||
The `content_source` parameter allows you to control which HTML content is used as input for markdown generation. This gives you flexibility in how the HTML is processed before conversion to markdown.
|
||||
|
||||
```python
|
||||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
|
||||
async def main():
|
||||
# Option 1: Use the raw HTML directly from the webpage (before any processing)
|
||||
raw_md_generator = DefaultMarkdownGenerator(
|
||||
content_source="raw_html",
|
||||
options={"ignore_links": True}
|
||||
)
|
||||
|
||||
# Option 2: Use the cleaned HTML (after scraping strategy processing - default)
|
||||
cleaned_md_generator = DefaultMarkdownGenerator(
|
||||
content_source="cleaned_html", # This is the default
|
||||
options={"ignore_links": True}
|
||||
)
|
||||
|
||||
# Option 3: Use preprocessed HTML optimized for schema extraction
|
||||
fit_md_generator = DefaultMarkdownGenerator(
|
||||
content_source="fit_html",
|
||||
options={"ignore_links": True}
|
||||
)
|
||||
|
||||
# Use one of the generators in your crawler config
|
||||
config = CrawlerRunConfig(
|
||||
markdown_generator=raw_md_generator # Try each of the generators
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://example.com", config=config)
|
||||
if result.success:
|
||||
print("Markdown:\n", result.markdown.raw_markdown[:500])
|
||||
else:
|
||||
print("Crawl failed:", result.error_message)
|
||||
|
||||
if __name__ == "__main__":
|
||||
import asyncio
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
### HTML Source Options
|
||||
|
||||
- **`"cleaned_html"`** (default): Uses the HTML after it has been processed by the scraping strategy. This HTML is typically cleaner and more focused on content, with some boilerplate removed.
|
||||
|
||||
- **`"raw_html"`**: Uses the original HTML directly from the webpage, before any cleaning or processing. This preserves more of the original content, but may include navigation bars, ads, footers, and other elements that might not be relevant to the main content.
|
||||
|
||||
- **`"fit_html"`**: Uses HTML preprocessed for schema extraction. This HTML is optimized for structured data extraction and may have certain elements simplified or removed.
|
||||
|
||||
### When to Use Each Option
|
||||
|
||||
- Use **`"cleaned_html"`** (default) for most cases where you want a balance of content preservation and noise removal.
|
||||
- Use **`"raw_html"`** when you need to preserve all original content, or when the cleaning process is removing content you actually want to keep.
|
||||
- Use **`"fit_html"`** when working with structured data or when you need HTML that's optimized for schema extraction.
|
||||
|
||||
---
|
||||
|
||||
## 4. Content Filters
|
||||
## 5. Content Filters
|
||||
|
||||
**Content filters** selectively remove or rank sections of text before turning them into Markdown. This is especially helpful if your page has ads, nav bars, or other clutter you don’t want.
|
||||
|
||||
### 4.1 BM25ContentFilter
|
||||
### 5.1 BM25ContentFilter
|
||||
|
||||
If you have a **search query**, BM25 is a good choice:
|
||||
|
||||
@@ -146,7 +204,7 @@ config = CrawlerRunConfig(markdown_generator=md_generator)
|
||||
|
||||
**No query provided?** BM25 tries to glean a context from page metadata, or you can simply treat it as a scorched-earth approach that discards text with low generic score. Realistically, you want to supply a query for best results.
|
||||
|
||||
### 4.2 PruningContentFilter
|
||||
### 5.2 PruningContentFilter
|
||||
|
||||
If you **don’t** have a specific query, or if you just want a robust “junk remover,” use `PruningContentFilter`. It analyzes text density, link density, HTML structure, and known patterns (like “nav,” “footer”) to systematically prune extraneous or repetitive sections.
|
||||
|
||||
@@ -170,7 +228,7 @@ prune_filter = PruningContentFilter(
|
||||
- You want a broad cleanup without a user query.
|
||||
- The page has lots of repeated sidebars, footers, or disclaimers that hamper text extraction.
|
||||
|
||||
### 4.3 LLMContentFilter
|
||||
### 5.3 LLMContentFilter
|
||||
|
||||
For intelligent content filtering and high-quality markdown generation, you can use the **LLMContentFilter**. This filter leverages LLMs to generate relevant markdown while preserving the original content's meaning and structure:
|
||||
|
||||
@@ -247,7 +305,7 @@ filter = LLMContentFilter(
|
||||
|
||||
---
|
||||
|
||||
## 5. Using Fit Markdown
|
||||
## 6. Using Fit Markdown
|
||||
|
||||
When a content filter is active, the library produces two forms of markdown inside `result.markdown`:
|
||||
|
||||
@@ -284,7 +342,7 @@ if __name__ == "__main__":
|
||||
|
||||
---
|
||||
|
||||
## 6. The `MarkdownGenerationResult` Object
|
||||
## 7. The `MarkdownGenerationResult` Object
|
||||
|
||||
If your library stores detailed markdown output in an object like `MarkdownGenerationResult`, you’ll see fields such as:
|
||||
|
||||
@@ -315,7 +373,7 @@ Below is a **revised section** under “Combining Filters (BM25 + Pruning)” th
|
||||
|
||||
---
|
||||
|
||||
## 7. Combining Filters (BM25 + Pruning) in Two Passes
|
||||
## 8. Combining Filters (BM25 + Pruning) in Two Passes
|
||||
|
||||
You might want to **prune out** noisy boilerplate first (with `PruningContentFilter`), and then **rank what’s left** against a user query (with `BM25ContentFilter`). You don’t have to crawl the page twice. Instead:
|
||||
|
||||
@@ -407,7 +465,7 @@ If your codebase or pipeline design allows applying multiple filters in one pass
|
||||
|
||||
---
|
||||
|
||||
## 8. Common Pitfalls & Tips
|
||||
## 9. Common Pitfalls & Tips
|
||||
|
||||
1. **No Markdown Output?**
|
||||
- Make sure the crawler actually retrieved HTML. If the site is heavily JS-based, you may need to enable dynamic rendering or wait for elements.
|
||||
@@ -427,11 +485,12 @@ If your codebase or pipeline design allows applying multiple filters in one pass
|
||||
|
||||
---
|
||||
|
||||
## 9. Summary & Next Steps
|
||||
## 10. Summary & Next Steps
|
||||
|
||||
In this **Markdown Generation Basics** tutorial, you learned to:
|
||||
|
||||
- Configure the **DefaultMarkdownGenerator** with HTML-to-text options.
|
||||
- Select different HTML sources using the `content_source` parameter.
|
||||
- Use **BM25ContentFilter** for query-specific extraction or **PruningContentFilter** for general noise removal.
|
||||
- Distinguish between raw and filtered markdown (`fit_markdown`).
|
||||
- Leverage the `MarkdownGenerationResult` object to handle different forms of output (citations, references, etc.).
|
||||
|
||||
@@ -2,7 +2,7 @@
|
||||
|
||||
In some cases, you need to extract **complex or unstructured** information from a webpage that a simple CSS/XPath schema cannot easily parse. Or you want **AI**-driven insights, classification, or summarization. For these scenarios, Crawl4AI provides an **LLM-based extraction strategy** that:
|
||||
|
||||
1. Works with **any** large language model supported by [LightLLM](https://github.com/LightLLM) (Ollama, OpenAI, Claude, and more).
|
||||
1. Works with **any** large language model supported by [LiteLLM](https://github.com/BerriAI/litellm) (Ollama, OpenAI, Claude, and more).
|
||||
2. Automatically splits content into chunks (if desired) to handle token limits, then combines results.
|
||||
3. Lets you define a **schema** (like a Pydantic model) or a simpler “block” extraction approach.
|
||||
|
||||
@@ -18,13 +18,19 @@ In some cases, you need to extract **complex or unstructured** information from
|
||||
|
||||
---
|
||||
|
||||
## 2. Provider-Agnostic via LightLLM
|
||||
## 2. Provider-Agnostic via LiteLLM
|
||||
|
||||
Crawl4AI uses a “provider string” (e.g., `"openai/gpt-4o"`, `"ollama/llama2.0"`, `"aws/titan"`) to identify your LLM. **Any** model that LightLLM supports is fair game. You just provide:
|
||||
You can use LlmConfig, to quickly configure multiple variations of LLMs and experiment with them to find the optimal one for your use case. You can read more about LlmConfig [here](/api/parameters).
|
||||
|
||||
```python
|
||||
llmConfig = LlmConfig(provider="openai/gpt-4o-mini", api_token=os.getenv("OPENAI_API_KEY"))
|
||||
```
|
||||
|
||||
Crawl4AI uses a “provider string” (e.g., `"openai/gpt-4o"`, `"ollama/llama2.0"`, `"aws/titan"`) to identify your LLM. **Any** model that LiteLLM supports is fair game. You just provide:
|
||||
|
||||
- **`provider`**: The `<provider>/<model_name>` identifier (e.g., `"openai/gpt-4"`, `"ollama/llama2"`, `"huggingface/google-flan"`, etc.).
|
||||
- **`api_token`**: If needed (for OpenAI, HuggingFace, etc.); local models or Ollama might not require it.
|
||||
- **`api_base`** (optional): If your provider has a custom endpoint.
|
||||
- **`base_url`** (optional): If your provider has a custom endpoint.
|
||||
|
||||
This means you **aren’t locked** into a single LLM vendor. Switch or experiment easily.
|
||||
|
||||
@@ -52,20 +58,19 @@ For structured data, `"schema"` is recommended. You provide `schema=YourPydantic
|
||||
|
||||
Below is an overview of important LLM extraction parameters. All are typically set inside `LLMExtractionStrategy(...)`. You then put that strategy in your `CrawlerRunConfig(..., extraction_strategy=...)`.
|
||||
|
||||
1. **`provider`** (str): e.g., `"openai/gpt-4"`, `"ollama/llama2"`.
|
||||
2. **`api_token`** (str): The API key or token for that model. May not be needed for local models.
|
||||
3. **`schema`** (dict): A JSON schema describing the fields you want. Usually generated by `YourModel.model_json_schema()`.
|
||||
4. **`extraction_type`** (str): `"schema"` or `"block"`.
|
||||
5. **`instruction`** (str): Prompt text telling the LLM what you want extracted. E.g., “Extract these fields as a JSON array.”
|
||||
6. **`chunk_token_threshold`** (int): Maximum tokens per chunk. If your content is huge, you can break it up for the LLM.
|
||||
7. **`overlap_rate`** (float): Overlap ratio between adjacent chunks. E.g., `0.1` means 10% of each chunk is repeated to preserve context continuity.
|
||||
8. **`apply_chunking`** (bool): Set `True` to chunk automatically. If you want a single pass, set `False`.
|
||||
9. **`input_format`** (str): Determines **which** crawler result is passed to the LLM. Options include:
|
||||
1. **`llmConfig`** (LlmConfig): e.g., `"openai/gpt-4"`, `"ollama/llama2"`.
|
||||
2. **`schema`** (dict): A JSON schema describing the fields you want. Usually generated by `YourModel.model_json_schema()`.
|
||||
3. **`extraction_type`** (str): `"schema"` or `"block"`.
|
||||
4. **`instruction`** (str): Prompt text telling the LLM what you want extracted. E.g., “Extract these fields as a JSON array.”
|
||||
5. **`chunk_token_threshold`** (int): Maximum tokens per chunk. If your content is huge, you can break it up for the LLM.
|
||||
6. **`overlap_rate`** (float): Overlap ratio between adjacent chunks. E.g., `0.1` means 10% of each chunk is repeated to preserve context continuity.
|
||||
7. **`apply_chunking`** (bool): Set `True` to chunk automatically. If you want a single pass, set `False`.
|
||||
8. **`input_format`** (str): Determines **which** crawler result is passed to the LLM. Options include:
|
||||
- `"markdown"`: The raw markdown (default).
|
||||
- `"fit_markdown"`: The filtered “fit” markdown if you used a content filter.
|
||||
- `"html"`: The cleaned or raw HTML.
|
||||
10. **`extra_args`** (dict): Additional LLM parameters like `temperature`, `max_tokens`, `top_p`, etc.
|
||||
11. **`show_usage()`**: A method you can call to print out usage info (token usage per chunk, total cost if known).
|
||||
9. **`extra_args`** (dict): Additional LLM parameters like `temperature`, `max_tokens`, `top_p`, etc.
|
||||
10. **`show_usage()`**: A method you can call to print out usage info (token usage per chunk, total cost if known).
|
||||
|
||||
**Example**:
|
||||
|
||||
@@ -233,8 +238,7 @@ class KnowledgeGraph(BaseModel):
|
||||
async def main():
|
||||
# LLM extraction strategy
|
||||
llm_strat = LLMExtractionStrategy(
|
||||
provider="openai/gpt-4",
|
||||
api_token=os.getenv('OPENAI_API_KEY'),
|
||||
llmConfig = LlmConfig(provider="openai/gpt-4", api_token=os.getenv('OPENAI_API_KEY')),
|
||||
schema=KnowledgeGraph.schema_json(),
|
||||
extraction_type="schema",
|
||||
instruction="Extract entities and relationships from the content. Return valid JSON.",
|
||||
@@ -286,7 +290,7 @@ if __name__ == "__main__":
|
||||
|
||||
## 11. Conclusion
|
||||
|
||||
**LLM-based extraction** in Crawl4AI is **provider-agnostic**, letting you choose from hundreds of models via LightLLM. It’s perfect for **semantically complex** tasks or generating advanced structures like knowledge graphs. However, it’s **slower** and potentially costlier than schema-based approaches. Keep these tips in mind:
|
||||
**LLM-based extraction** in Crawl4AI is **provider-agnostic**, letting you choose from hundreds of models via LiteLLM. It’s perfect for **semantically complex** tasks or generating advanced structures like knowledge graphs. However, it’s **slower** and potentially costlier than schema-based approaches. Keep these tips in mind:
|
||||
|
||||
- Put your LLM strategy **in `CrawlerRunConfig`**.
|
||||
- Use **`input_format`** to pick which form (markdown, HTML, fit_markdown) the LLM sees.
|
||||
@@ -317,4 +321,4 @@ If your site’s data is consistent or repetitive, consider [`JsonCssExtractionS
|
||||
|
||||
---
|
||||
|
||||
That’s it for **Extracting JSON (LLM)**—now you can harness AI to parse, classify, or reorganize data on the web. Happy crawling!
|
||||
That’s it for **Extracting JSON (LLM)**—now you can harness AI to parse, classify, or reorganize data on the web. Happy crawling!
|
||||
|
||||
@@ -1,15 +1,20 @@
|
||||
# Extracting JSON (No LLM)
|
||||
|
||||
One of Crawl4AI’s **most powerful** features is extracting **structured JSON** from websites **without** relying on large language models. By defining a **schema** with CSS or XPath selectors, you can extract data instantly—even from complex or nested HTML structures—without the cost, latency, or environmental impact of an LLM.
|
||||
One of Crawl4AI's **most powerful** features is extracting **structured JSON** from websites **without** relying on large language models. Crawl4AI offers several strategies for LLM-free extraction:
|
||||
|
||||
1. **Schema-based extraction** with CSS or XPath selectors via `JsonCssExtractionStrategy` and `JsonXPathExtractionStrategy`
|
||||
2. **Regular expression extraction** with `RegexExtractionStrategy` for fast pattern matching
|
||||
|
||||
These approaches let you extract data instantly—even from complex or nested HTML structures—without the cost, latency, or environmental impact of an LLM.
|
||||
|
||||
**Why avoid LLM for basic extractions?**
|
||||
|
||||
1. **Faster & Cheaper**: No API calls or GPU overhead.
|
||||
2. **Lower Carbon Footprint**: LLM inference can be energy-intensive. A well-defined schema is practically carbon-free.
|
||||
3. **Precise & Repeatable**: CSS/XPath selectors do exactly what you specify. LLM outputs can vary or hallucinate.
|
||||
4. **Scales Readily**: For thousands of pages, schema-based extraction runs quickly and in parallel.
|
||||
1. **Faster & Cheaper**: No API calls or GPU overhead.
|
||||
2. **Lower Carbon Footprint**: LLM inference can be energy-intensive. Pattern-based extraction is practically carbon-free.
|
||||
3. **Precise & Repeatable**: CSS/XPath selectors and regex patterns do exactly what you specify. LLM outputs can vary or hallucinate.
|
||||
4. **Scales Readily**: For thousands of pages, pattern-based extraction runs quickly and in parallel.
|
||||
|
||||
Below, we’ll explore how to craft these schemas and use them with **JsonCssExtractionStrategy** (or **JsonXPathExtractionStrategy** if you prefer XPath). We’ll also highlight advanced features like **nested fields** and **base element attributes**.
|
||||
Below, we'll explore how to craft these schemas and use them with **JsonCssExtractionStrategy** (or **JsonXPathExtractionStrategy** if you prefer XPath). We'll also highlight advanced features like **nested fields** and **base element attributes**.
|
||||
|
||||
---
|
||||
|
||||
@@ -17,17 +22,17 @@ Below, we’ll explore how to craft these schemas and use them with **JsonCssExt
|
||||
|
||||
A schema defines:
|
||||
|
||||
1. A **base selector** that identifies each “container” element on the page (e.g., a product row, a blog post card).
|
||||
2. **Fields** describing which CSS/XPath selectors to use for each piece of data you want to capture (text, attribute, HTML block, etc.).
|
||||
3. **Nested** or **list** types for repeated or hierarchical structures.
|
||||
1. A **base selector** that identifies each "container" element on the page (e.g., a product row, a blog post card).
|
||||
2. **Fields** describing which CSS/XPath selectors to use for each piece of data you want to capture (text, attribute, HTML block, etc.).
|
||||
3. **Nested** or **list** types for repeated or hierarchical structures.
|
||||
|
||||
For example, if you have a list of products, each one might have a name, price, reviews, and “related products.” This approach is faster and more reliable than an LLM for consistent, structured pages.
|
||||
For example, if you have a list of products, each one might have a name, price, reviews, and "related products." This approach is faster and more reliable than an LLM for consistent, structured pages.
|
||||
|
||||
---
|
||||
|
||||
## 2. Simple Example: Crypto Prices
|
||||
|
||||
Let’s begin with a **simple** schema-based extraction using the `JsonCssExtractionStrategy`. Below is a snippet that extracts cryptocurrency prices from a site (similar to the legacy Coinbase example). Notice we **don’t** call any LLM:
|
||||
Let's begin with a **simple** schema-based extraction using the `JsonCssExtractionStrategy`. Below is a snippet that extracts cryptocurrency prices from a site (similar to the legacy Coinbase example). Notice we **don't** call any LLM:
|
||||
|
||||
```python
|
||||
import json
|
||||
@@ -87,7 +92,7 @@ asyncio.run(extract_crypto_prices())
|
||||
|
||||
**Highlights**:
|
||||
|
||||
- **`baseSelector`**: Tells us where each “item” (crypto row) is.
|
||||
- **`baseSelector`**: Tells us where each "item" (crypto row) is.
|
||||
- **`fields`**: Two fields (`coin_name`, `price`) using simple CSS selectors.
|
||||
- Each field defines a **`type`** (e.g., `text`, `attribute`, `html`, `regex`, etc.).
|
||||
|
||||
@@ -97,7 +102,7 @@ No LLM is needed, and the performance is **near-instant** for hundreds or thousa
|
||||
|
||||
### **XPath Example with `raw://` HTML**
|
||||
|
||||
Below is a short example demonstrating **XPath** extraction plus the **`raw://`** scheme. We’ll pass a **dummy HTML** directly (no network request) and define the extraction strategy in `CrawlerRunConfig`.
|
||||
Below is a short example demonstrating **XPath** extraction plus the **`raw://`** scheme. We'll pass a **dummy HTML** directly (no network request) and define the extraction strategy in `CrawlerRunConfig`.
|
||||
|
||||
```python
|
||||
import json
|
||||
@@ -168,12 +173,12 @@ asyncio.run(extract_crypto_prices_xpath())
|
||||
|
||||
**Key Points**:
|
||||
|
||||
1. **`JsonXPathExtractionStrategy`** is used instead of `JsonCssExtractionStrategy`.
|
||||
2. **`baseSelector`** and each field’s `"selector"` use **XPath** instead of CSS.
|
||||
3. **`raw://`** lets us pass `dummy_html` with no real network request—handy for local testing.
|
||||
1. **`JsonXPathExtractionStrategy`** is used instead of `JsonCssExtractionStrategy`.
|
||||
2. **`baseSelector`** and each field's `"selector"` use **XPath** instead of CSS.
|
||||
3. **`raw://`** lets us pass `dummy_html` with no real network request—handy for local testing.
|
||||
4. Everything (including the extraction strategy) is in **`CrawlerRunConfig`**.
|
||||
|
||||
That’s how you keep the config self-contained, illustrate **XPath** usage, and demonstrate the **raw** scheme for direct HTML input—all while avoiding the old approach of passing `extraction_strategy` directly to `arun()`.
|
||||
That's how you keep the config self-contained, illustrate **XPath** usage, and demonstrate the **raw** scheme for direct HTML input—all while avoiding the old approach of passing `extraction_strategy` directly to `arun()`.
|
||||
|
||||
---
|
||||
|
||||
@@ -187,7 +192,7 @@ We have a **sample e-commerce** HTML file on GitHub (example):
|
||||
```
|
||||
https://gist.githubusercontent.com/githubusercontent/2d7b8ba3cd8ab6cf3c8da771ddb36878/raw/1ae2f90c6861ce7dd84cc50d3df9920dee5e1fd2/sample_ecommerce.html
|
||||
```
|
||||
This snippet includes categories, products, features, reviews, and related items. Let’s see how to define a schema that fully captures that structure **without LLM**.
|
||||
This snippet includes categories, products, features, reviews, and related items. Let's see how to define a schema that fully captures that structure **without LLM**.
|
||||
|
||||
```python
|
||||
schema = {
|
||||
@@ -333,24 +338,253 @@ async def extract_ecommerce_data():
|
||||
asyncio.run(extract_ecommerce_data())
|
||||
```
|
||||
|
||||
If all goes well, you get a **structured** JSON array with each “category,” containing an array of `products`. Each product includes `details`, `features`, `reviews`, etc. All of that **without** an LLM.
|
||||
If all goes well, you get a **structured** JSON array with each "category," containing an array of `products`. Each product includes `details`, `features`, `reviews`, etc. All of that **without** an LLM.
|
||||
|
||||
---
|
||||
|
||||
## 4. Why “No LLM” Is Often Better
|
||||
## 4. RegexExtractionStrategy - Fast Pattern-Based Extraction
|
||||
|
||||
1. **Zero Hallucination**: Schema-based extraction doesn’t guess text. It either finds it or not.
|
||||
2. **Guaranteed Structure**: The same schema yields consistent JSON across many pages, so your downstream pipeline can rely on stable keys.
|
||||
3. **Speed**: LLM-based extraction can be 10–1000x slower for large-scale crawling.
|
||||
4. **Scalable**: Adding or updating a field is a matter of adjusting the schema, not re-tuning a model.
|
||||
Crawl4AI now offers a powerful new zero-LLM extraction strategy: `RegexExtractionStrategy`. This strategy provides lightning-fast extraction of common data types like emails, phone numbers, URLs, dates, and more using pre-compiled regular expressions.
|
||||
|
||||
**When might you consider an LLM?** Possibly if the site is extremely unstructured or you want AI summarization. But always try a schema approach first for repeated or consistent data patterns.
|
||||
### Key Features
|
||||
|
||||
- **Zero LLM Dependency**: Extracts data without any AI model calls
|
||||
- **Blazing Fast**: Uses pre-compiled regex patterns for maximum performance
|
||||
- **Built-in Patterns**: Includes ready-to-use patterns for common data types
|
||||
- **Custom Patterns**: Add your own regex patterns for domain-specific extraction
|
||||
- **LLM-Assisted Pattern Generation**: Optionally use an LLM once to generate optimized patterns, then reuse them without further LLM calls
|
||||
|
||||
### Simple Example: Extracting Common Entities
|
||||
|
||||
The easiest way to start is by using the built-in pattern catalog:
|
||||
|
||||
```python
|
||||
import json
|
||||
import asyncio
|
||||
from crawl4ai import (
|
||||
AsyncWebCrawler,
|
||||
CrawlerRunConfig,
|
||||
RegexExtractionStrategy
|
||||
)
|
||||
|
||||
async def extract_with_regex():
|
||||
# Create a strategy using built-in patterns for URLs and currencies
|
||||
strategy = RegexExtractionStrategy(
|
||||
pattern = RegexExtractionStrategy.Url | RegexExtractionStrategy.Currency
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(extraction_strategy=strategy)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
config=config
|
||||
)
|
||||
|
||||
if result.success:
|
||||
data = json.loads(result.extracted_content)
|
||||
for item in data[:5]: # Show first 5 matches
|
||||
print(f"{item['label']}: {item['value']}")
|
||||
print(f"Total matches: {len(data)}")
|
||||
|
||||
asyncio.run(extract_with_regex())
|
||||
```
|
||||
|
||||
### Available Built-in Patterns
|
||||
|
||||
`RegexExtractionStrategy` provides these common patterns as IntFlag attributes for easy combining:
|
||||
|
||||
```python
|
||||
# Use individual patterns
|
||||
strategy = RegexExtractionStrategy(pattern=RegexExtractionStrategy.Email)
|
||||
|
||||
# Combine multiple patterns
|
||||
strategy = RegexExtractionStrategy(
|
||||
pattern = (
|
||||
RegexExtractionStrategy.Email |
|
||||
RegexExtractionStrategy.PhoneUS |
|
||||
RegexExtractionStrategy.Url
|
||||
)
|
||||
)
|
||||
|
||||
# Use all available patterns
|
||||
strategy = RegexExtractionStrategy(pattern=RegexExtractionStrategy.All)
|
||||
```
|
||||
|
||||
Available patterns include:
|
||||
- `Email` - Email addresses
|
||||
- `PhoneIntl` - International phone numbers
|
||||
- `PhoneUS` - US-format phone numbers
|
||||
- `Url` - HTTP/HTTPS URLs
|
||||
- `IPv4` - IPv4 addresses
|
||||
- `IPv6` - IPv6 addresses
|
||||
- `Uuid` - UUIDs
|
||||
- `Currency` - Currency values (USD, EUR, etc.)
|
||||
- `Percentage` - Percentage values
|
||||
- `Number` - Numeric values
|
||||
- `DateIso` - ISO format dates
|
||||
- `DateUS` - US format dates
|
||||
- `Time24h` - 24-hour format times
|
||||
- `PostalUS` - US postal codes
|
||||
- `PostalUK` - UK postal codes
|
||||
- `HexColor` - HTML hex color codes
|
||||
- `TwitterHandle` - Twitter handles
|
||||
- `Hashtag` - Hashtags
|
||||
- `MacAddr` - MAC addresses
|
||||
- `Iban` - International bank account numbers
|
||||
- `CreditCard` - Credit card numbers
|
||||
|
||||
### Custom Pattern Example
|
||||
|
||||
For more targeted extraction, you can provide custom patterns:
|
||||
|
||||
```python
|
||||
import json
|
||||
import asyncio
|
||||
from crawl4ai import (
|
||||
AsyncWebCrawler,
|
||||
CrawlerRunConfig,
|
||||
RegexExtractionStrategy
|
||||
)
|
||||
|
||||
async def extract_prices():
|
||||
# Define a custom pattern for US Dollar prices
|
||||
price_pattern = {"usd_price": r"\$\s?\d{1,3}(?:,\d{3})*(?:\.\d{2})?"}
|
||||
|
||||
# Create strategy with custom pattern
|
||||
strategy = RegexExtractionStrategy(custom=price_pattern)
|
||||
config = CrawlerRunConfig(extraction_strategy=strategy)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://www.example.com/products",
|
||||
config=config
|
||||
)
|
||||
|
||||
if result.success:
|
||||
data = json.loads(result.extracted_content)
|
||||
for item in data:
|
||||
print(f"Found price: {item['value']}")
|
||||
|
||||
asyncio.run(extract_prices())
|
||||
```
|
||||
|
||||
### LLM-Assisted Pattern Generation
|
||||
|
||||
For complex or site-specific patterns, you can use an LLM once to generate an optimized pattern, then save and reuse it without further LLM calls:
|
||||
|
||||
```python
|
||||
import json
|
||||
import asyncio
|
||||
from pathlib import Path
|
||||
from crawl4ai import (
|
||||
AsyncWebCrawler,
|
||||
CrawlerRunConfig,
|
||||
RegexExtractionStrategy,
|
||||
LLMConfig
|
||||
)
|
||||
|
||||
async def extract_with_generated_pattern():
|
||||
cache_dir = Path("./pattern_cache")
|
||||
cache_dir.mkdir(exist_ok=True)
|
||||
pattern_file = cache_dir / "price_pattern.json"
|
||||
|
||||
# 1. Generate or load pattern
|
||||
if pattern_file.exists():
|
||||
pattern = json.load(pattern_file.open())
|
||||
print(f"Using cached pattern: {pattern}")
|
||||
else:
|
||||
print("Generating pattern via LLM...")
|
||||
|
||||
# Configure LLM
|
||||
llm_config = LLMConfig(
|
||||
provider="openai/gpt-4o-mini",
|
||||
api_token="env:OPENAI_API_KEY",
|
||||
)
|
||||
|
||||
# Get sample HTML for context
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://example.com/products")
|
||||
html = result.fit_html
|
||||
|
||||
# Generate pattern (one-time LLM usage)
|
||||
pattern = RegexExtractionStrategy.generate_pattern(
|
||||
label="price",
|
||||
html=html,
|
||||
query="Product prices in USD format",
|
||||
llm_config=llm_config,
|
||||
)
|
||||
|
||||
# Cache pattern for future use
|
||||
json.dump(pattern, pattern_file.open("w"), indent=2)
|
||||
|
||||
# 2. Use pattern for extraction (no LLM calls)
|
||||
strategy = RegexExtractionStrategy(custom=pattern)
|
||||
config = CrawlerRunConfig(extraction_strategy=strategy)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com/products",
|
||||
config=config
|
||||
)
|
||||
|
||||
if result.success:
|
||||
data = json.loads(result.extracted_content)
|
||||
for item in data[:10]:
|
||||
print(f"Extracted: {item['value']}")
|
||||
print(f"Total matches: {len(data)}")
|
||||
|
||||
asyncio.run(extract_with_generated_pattern())
|
||||
```
|
||||
|
||||
This pattern allows you to:
|
||||
1. Use an LLM once to generate a highly optimized regex for your specific site
|
||||
2. Save the pattern to disk for reuse
|
||||
3. Extract data using only regex (no further LLM calls) in production
|
||||
|
||||
### Extraction Results Format
|
||||
|
||||
The `RegexExtractionStrategy` returns results in a consistent format:
|
||||
|
||||
```json
|
||||
[
|
||||
{
|
||||
"url": "https://example.com",
|
||||
"label": "email",
|
||||
"value": "contact@example.com",
|
||||
"span": [145, 163]
|
||||
},
|
||||
{
|
||||
"url": "https://example.com",
|
||||
"label": "url",
|
||||
"value": "https://support.example.com",
|
||||
"span": [210, 235]
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
Each match includes:
|
||||
- `url`: The source URL
|
||||
- `label`: The pattern name that matched (e.g., "email", "phone_us")
|
||||
- `value`: The extracted text
|
||||
- `span`: The start and end positions in the source content
|
||||
|
||||
---
|
||||
|
||||
## 5. Base Element Attributes & Additional Fields
|
||||
## 5. Why "No LLM" Is Often Better
|
||||
|
||||
It’s easy to **extract attributes** (like `href`, `src`, or `data-xxx`) from your base or nested elements using:
|
||||
1. **Zero Hallucination**: Pattern-based extraction doesn't guess text. It either finds it or not.
|
||||
2. **Guaranteed Structure**: The same schema or regex yields consistent JSON across many pages, so your downstream pipeline can rely on stable keys.
|
||||
3. **Speed**: LLM-based extraction can be 10–1000x slower for large-scale crawling.
|
||||
4. **Scalable**: Adding or updating a field is a matter of adjusting the schema or regex, not re-tuning a model.
|
||||
|
||||
**When might you consider an LLM?** Possibly if the site is extremely unstructured or you want AI summarization. But always try a schema or regex approach first for repeated or consistent data patterns.
|
||||
|
||||
---
|
||||
|
||||
## 6. Base Element Attributes & Additional Fields
|
||||
|
||||
It's easy to **extract attributes** (like `href`, `src`, or `data-xxx`) from your base or nested elements using:
|
||||
|
||||
```json
|
||||
{
|
||||
@@ -361,11 +595,11 @@ It’s easy to **extract attributes** (like `href`, `src`, or `data-xxx`) from y
|
||||
}
|
||||
```
|
||||
|
||||
You can define them in **`baseFields`** (extracted from the main container element) or in each field’s sub-lists. This is especially helpful if you need an item’s link or ID stored in the parent `<div>`.
|
||||
You can define them in **`baseFields`** (extracted from the main container element) or in each field's sub-lists. This is especially helpful if you need an item's link or ID stored in the parent `<div>`.
|
||||
|
||||
---
|
||||
|
||||
## 6. Putting It All Together: Larger Example
|
||||
## 7. Putting It All Together: Larger Example
|
||||
|
||||
Consider a blog site. We have a schema that extracts the **URL** from each post card (via `baseFields` with an `"attribute": "href"`), plus the title, date, summary, and author:
|
||||
|
||||
@@ -389,19 +623,20 @@ Then run with `JsonCssExtractionStrategy(schema)` to get an array of blog post o
|
||||
|
||||
---
|
||||
|
||||
## 7. Tips & Best Practices
|
||||
## 8. Tips & Best Practices
|
||||
|
||||
1. **Inspect the DOM** in Chrome DevTools or Firefox’s Inspector to find stable selectors.
|
||||
2. **Start Simple**: Verify you can extract a single field. Then add complexity like nested objects or lists.
|
||||
3. **Test** your schema on partial HTML or a test page before a big crawl.
|
||||
4. **Combine with JS Execution** if the site loads content dynamically. You can pass `js_code` or `wait_for` in `CrawlerRunConfig`.
|
||||
5. **Look at Logs** when `verbose=True`: if your selectors are off or your schema is malformed, it’ll often show warnings.
|
||||
6. **Use baseFields** if you need attributes from the container element (e.g., `href`, `data-id`), especially for the “parent” item.
|
||||
7. **Performance**: For large pages, make sure your selectors are as narrow as possible.
|
||||
1. **Inspect the DOM** in Chrome DevTools or Firefox's Inspector to find stable selectors.
|
||||
2. **Start Simple**: Verify you can extract a single field. Then add complexity like nested objects or lists.
|
||||
3. **Test** your schema on partial HTML or a test page before a big crawl.
|
||||
4. **Combine with JS Execution** if the site loads content dynamically. You can pass `js_code` or `wait_for` in `CrawlerRunConfig`.
|
||||
5. **Look at Logs** when `verbose=True`: if your selectors are off or your schema is malformed, it'll often show warnings.
|
||||
6. **Use baseFields** if you need attributes from the container element (e.g., `href`, `data-id`), especially for the "parent" item.
|
||||
7. **Performance**: For large pages, make sure your selectors are as narrow as possible.
|
||||
8. **Consider Using Regex First**: For simple data types like emails, URLs, and dates, `RegexExtractionStrategy` is often the fastest approach.
|
||||
|
||||
---
|
||||
|
||||
## 8. Schema Generation Utility
|
||||
## 9. Schema Generation Utility
|
||||
|
||||
While manually crafting schemas is powerful and precise, Crawl4AI now offers a convenient utility to **automatically generate** extraction schemas using LLM. This is particularly useful when:
|
||||
|
||||
@@ -481,27 +716,26 @@ strategy = JsonCssExtractionStrategy(css_schema)
|
||||
- Use OpenAI for production-quality schemas
|
||||
- Use Ollama for development, testing, or when you need a self-hosted solution
|
||||
|
||||
That's it for **Extracting JSON (No LLM)**! You've seen how schema-based approaches (either CSS or XPath) can handle everything from simple lists to deeply nested product catalogs—instantly, with minimal overhead. Enjoy building robust scrapers that produce consistent, structured JSON for your data pipelines!
|
||||
|
||||
---
|
||||
|
||||
## 9. Conclusion
|
||||
## 10. Conclusion
|
||||
|
||||
With **JsonCssExtractionStrategy** (or **JsonXPathExtractionStrategy**), you can build powerful, **LLM-free** pipelines that:
|
||||
With Crawl4AI's LLM-free extraction strategies - `JsonCssExtractionStrategy`, `JsonXPathExtractionStrategy`, and now `RegexExtractionStrategy` - you can build powerful pipelines that:
|
||||
|
||||
- Scrape any consistent site for structured data.
|
||||
- Support nested objects, repeating lists, or advanced transformations.
|
||||
- Support nested objects, repeating lists, or pattern-based extraction.
|
||||
- Scale to thousands of pages quickly and reliably.
|
||||
|
||||
**Next Steps**:
|
||||
**Choosing the Right Strategy**:
|
||||
|
||||
- Combine your extracted JSON with advanced filtering or summarization in a second pass if needed.
|
||||
- For dynamic pages, combine strategies with `js_code` or infinite scroll hooking to ensure all content is loaded.
|
||||
- Use **`RegexExtractionStrategy`** for fast extraction of common data types like emails, phones, URLs, dates, etc.
|
||||
- Use **`JsonCssExtractionStrategy`** or **`JsonXPathExtractionStrategy`** for structured data with clear HTML patterns
|
||||
- If you need both: first extract structured data with JSON strategies, then use regex on specific fields
|
||||
|
||||
**Remember**: For repeated, structured data, you don’t need to pay for or wait on an LLM. A well-crafted schema plus CSS or XPath gets you the data faster, cleaner, and cheaper—**the real power** of Crawl4AI.
|
||||
**Remember**: For repeated, structured data, you don't need to pay for or wait on an LLM. Well-crafted schemas and regex patterns get you the data faster, cleaner, and cheaper—**the real power** of Crawl4AI.
|
||||
|
||||
**Last Updated**: 2025-01-01
|
||||
**Last Updated**: 2025-05-02
|
||||
|
||||
---
|
||||
|
||||
That’s it for **Extracting JSON (No LLM)**! You’ve seen how schema-based approaches (either CSS or XPath) can handle everything from simple lists to deeply nested product catalogs—instantly, with minimal overhead. Enjoy building robust scrapers that produce consistent, structured JSON for your data pipelines!
|
||||
That's it for **Extracting JSON (No LLM)**! You've seen how schema-based approaches (either CSS or XPath) and regex patterns can handle everything from simple lists to deeply nested product catalogs—instantly, with minimal overhead. Enjoy building robust scrapers that produce consistent, structured JSON for your data pipelines!
|
||||
@@ -72,6 +72,14 @@ asyncio.run(main())
|
||||
|
||||
---
|
||||
|
||||
## Video Tutorial
|
||||
|
||||
<div align="center">
|
||||
<iframe width="560" height="315" src="https://www.youtube.com/embed/xo3qK6Hg9AA?start=15" title="Crawl4AI Tutorial" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
|
||||
</div>
|
||||
|
||||
---
|
||||
|
||||
## What Does Crawl4AI Do?
|
||||
|
||||
Crawl4AI is a feature-rich crawler and scraper that aims to:
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
site_name: Crawl4AI Documentation (v0.5.x)
|
||||
site_name: Crawl4AI Documentation (v0.6.x)
|
||||
site_description: 🚀🤖 Crawl4AI, Open-source LLM-Friendly Web Crawler & Scraper
|
||||
site_url: https://docs.crawl4ai.com
|
||||
repo_url: https://github.com/unclecode/crawl4ai
|
||||
@@ -9,6 +9,7 @@ nav:
|
||||
- Home: 'index.md'
|
||||
- "Ask AI": "core/ask-ai.md"
|
||||
- "Quick Start": "core/quickstart.md"
|
||||
- "Code Examples": "core/examples.md"
|
||||
- Setup & Installation:
|
||||
- "Installation": "core/installation.md"
|
||||
- "Docker Deployment": "core/docker-deployment.md"
|
||||
@@ -90,4 +91,5 @@ extra_javascript:
|
||||
- assets/github_stats.js
|
||||
- assets/selection_ask_ai.js
|
||||
- assets/copy_code.js
|
||||
- assets/floating_ask_ai_button.js
|
||||
- assets/floating_ask_ai_button.js
|
||||
- assets/mobile_menu.js
|
||||
@@ -1,20 +0,0 @@
|
||||
The file /docs/md_v2/api/parameters.md should be updated to include the new network and console capturing parameters.
|
||||
|
||||
Here's what needs to be updated:
|
||||
|
||||
1. Change section title from:
|
||||
```
|
||||
### G) **Debug & Logging**
|
||||
```
|
||||
to:
|
||||
```
|
||||
### G) **Debug, Logging & Capturing**
|
||||
```
|
||||
|
||||
2. Add new parameters to the table:
|
||||
```
|
||||
| **`capture_network_requests`** | `bool` (False) | Captures all network requests, responses, and failures during the crawl. Available in `result.network_requests`. |
|
||||
| **`capture_console_messages`** | `bool` (False) | Captures all browser console messages (logs, warnings, errors) during the crawl. Available in `result.console_messages`. |
|
||||
```
|
||||
|
||||
These changes demonstrate how to use the new network and console capturing features in the CrawlerRunConfig.
|
||||
@@ -8,7 +8,7 @@ dynamic = ["version"]
|
||||
description = "🚀🤖 Crawl4AI: Open-source LLM Friendly Web Crawler & scraper"
|
||||
readme = "README.md"
|
||||
requires-python = ">=3.9"
|
||||
license = {text = "MIT"}
|
||||
license = "Apache-2.0"
|
||||
authors = [
|
||||
{name = "Unclecode", email = "unclecode@kidocode.com"}
|
||||
]
|
||||
@@ -40,14 +40,14 @@ dependencies = [
|
||||
"fake-useragent>=2.0.3",
|
||||
"click>=8.1.7",
|
||||
"pyperclip>=1.8.2",
|
||||
"faust-cchardet>=2.1.19",
|
||||
"chardet>=5.2.0",
|
||||
"aiohttp>=3.11.11",
|
||||
"brotli>=1.1.0",
|
||||
"humanize>=4.10.0",
|
||||
]
|
||||
classifiers = [
|
||||
"Development Status :: 4 - Beta",
|
||||
"Intended Audience :: Developers",
|
||||
"License :: OSI Approved :: Apache Software License",
|
||||
"Programming Language :: Python :: 3",
|
||||
"Programming Language :: Python :: 3.9",
|
||||
"Programming Language :: Python :: 3.10",
|
||||
|
||||
@@ -21,4 +21,5 @@ psutil>=6.1.1
|
||||
nltk>=3.9.1
|
||||
rich>=13.9.4
|
||||
cssselect>=1.2.0
|
||||
faust-cchardet>=2.1.19
|
||||
chardet>=5.2.0
|
||||
brotli>=1.1.0
|
||||
3
setup.py
3
setup.py
@@ -49,13 +49,12 @@ setup(
|
||||
url="https://github.com/unclecode/crawl4ai",
|
||||
author="Unclecode",
|
||||
author_email="unclecode@kidocode.com",
|
||||
license="MIT",
|
||||
license="Apache-2.0",
|
||||
packages=find_packages(),
|
||||
package_data={"crawl4ai": ["js_snippet/*.js"]},
|
||||
classifiers=[
|
||||
"Development Status :: 3 - Alpha",
|
||||
"Intended Audience :: Developers",
|
||||
"License :: OSI Approved :: Apache Software License",
|
||||
"Programming Language :: Python :: 3",
|
||||
"Programming Language :: Python :: 3.9",
|
||||
"Programming Language :: Python :: 3.10",
|
||||
|
||||
596
tests/docker/test_rest_api_deep_crawl.py
Normal file
596
tests/docker/test_rest_api_deep_crawl.py
Normal file
@@ -0,0 +1,596 @@
|
||||
# ==== File: test_rest_api_deep_crawl.py ====
|
||||
|
||||
import pytest
|
||||
import pytest_asyncio
|
||||
import httpx
|
||||
import json
|
||||
import asyncio
|
||||
import os
|
||||
from typing import List, Dict, Any, AsyncGenerator
|
||||
|
||||
from dotenv import load_dotenv
|
||||
load_dotenv() # Load environment variables from .env file if present
|
||||
|
||||
# --- Test Configuration ---
|
||||
BASE_URL = os.getenv("CRAWL4AI_TEST_URL", "http://localhost:11235") # If server is running in Docker, use the host's IP
|
||||
BASE_URL = os.getenv("CRAWL4AI_TEST_URL", "http://localhost:8020") # If server is running in dev debug mode
|
||||
DEEP_CRAWL_BASE_URL = "https://docs.crawl4ai.com/samples/deepcrawl/"
|
||||
DEEP_CRAWL_DOMAIN = "docs.crawl4ai.com" # Used for domain filter
|
||||
|
||||
# --- Helper Functions ---
|
||||
def load_proxies_from_env() -> List[Dict]:
|
||||
"""Load proxies from PROXIES environment variable"""
|
||||
proxies = []
|
||||
proxies_str = os.getenv("PROXIES", "")
|
||||
if not proxies_str:
|
||||
print("PROXIES environment variable not set or empty.")
|
||||
return proxies
|
||||
try:
|
||||
proxy_list = proxies_str.split(",")
|
||||
for proxy in proxy_list:
|
||||
proxy = proxy.strip()
|
||||
if not proxy:
|
||||
continue
|
||||
parts = proxy.split(":")
|
||||
if len(parts) == 4:
|
||||
ip, port, username, password = parts
|
||||
proxies.append({
|
||||
"server": f"http://{ip}:{port}", # Assuming http, adjust if needed
|
||||
"username": username,
|
||||
"password": password,
|
||||
"ip": ip # Store original IP if available
|
||||
})
|
||||
elif len(parts) == 2: # ip:port only
|
||||
ip, port = parts
|
||||
proxies.append({
|
||||
"server": f"http://{ip}:{port}",
|
||||
"ip": ip
|
||||
})
|
||||
else:
|
||||
print(f"Skipping invalid proxy string format: {proxy}")
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error loading proxies from environment: {e}")
|
||||
return proxies
|
||||
|
||||
|
||||
async def check_server_health(client: httpx.AsyncClient):
|
||||
"""Check if the server is healthy before running tests."""
|
||||
try:
|
||||
response = await client.get("/health")
|
||||
response.raise_for_status()
|
||||
print(f"\nServer healthy: {response.json()}")
|
||||
return True
|
||||
except (httpx.RequestError, httpx.HTTPStatusError) as e:
|
||||
pytest.fail(f"Server health check failed: {e}. Is the server running at {BASE_URL}?", pytrace=False)
|
||||
|
||||
async def assert_crawl_result_structure(result: Dict[str, Any], check_ssl=False):
|
||||
"""Asserts the basic structure of a single crawl result."""
|
||||
assert isinstance(result, dict)
|
||||
assert "url" in result
|
||||
assert "success" in result
|
||||
assert "html" in result # Basic crawls should return HTML
|
||||
assert "metadata" in result
|
||||
assert isinstance(result["metadata"], dict)
|
||||
assert "depth" in result["metadata"] # Deep crawls add depth
|
||||
|
||||
if check_ssl:
|
||||
assert "ssl_certificate" in result # Check if SSL info is present
|
||||
assert isinstance(result["ssl_certificate"], dict) or result["ssl_certificate"] is None
|
||||
|
||||
|
||||
async def process_streaming_response(response: httpx.Response) -> List[Dict[str, Any]]:
|
||||
"""Processes an NDJSON streaming response."""
|
||||
results = []
|
||||
completed = False
|
||||
async for line in response.aiter_lines():
|
||||
if line:
|
||||
try:
|
||||
data = json.loads(line)
|
||||
if data.get("status") == "completed":
|
||||
completed = True
|
||||
break # Stop processing after completion marker
|
||||
elif data.get("url"): # Ensure it looks like a result object
|
||||
results.append(data)
|
||||
else:
|
||||
print(f"Received non-result JSON line: {data}") # Log other status messages if needed
|
||||
except json.JSONDecodeError:
|
||||
pytest.fail(f"Failed to decode JSON line: {line}")
|
||||
assert completed, "Streaming response did not end with a completion marker."
|
||||
return results
|
||||
|
||||
|
||||
# --- Pytest Fixtures ---
|
||||
@pytest_asyncio.fixture(scope="function")
|
||||
async def async_client() -> AsyncGenerator[httpx.AsyncClient, None]:
|
||||
"""Provides an async HTTP client"""
|
||||
# Increased timeout for potentially longer deep crawls
|
||||
async with httpx.AsyncClient(base_url=BASE_URL, timeout=300.0) as client:
|
||||
yield client
|
||||
# No explicit close needed with 'async with'
|
||||
|
||||
# --- Test Class ---
|
||||
@pytest.mark.asyncio
|
||||
class TestDeepCrawlEndpoints:
|
||||
|
||||
@pytest_asyncio.fixture(autouse=True)
|
||||
async def check_health_before_tests(self, async_client: httpx.AsyncClient):
|
||||
"""Fixture to ensure server is healthy before each test in the class."""
|
||||
await check_server_health(async_client)
|
||||
|
||||
# 1. Basic Deep Crawl
|
||||
async def test_deep_crawl_basic_bfs(self, async_client: httpx.AsyncClient):
|
||||
"""Test BFS deep crawl with limited depth and pages."""
|
||||
max_depth = 1
|
||||
max_pages = 3 # start_url + 2 more
|
||||
payload = {
|
||||
"urls": [DEEP_CRAWL_BASE_URL],
|
||||
"browser_config": {"type": "BrowserConfig", "params": {"headless": True}},
|
||||
"crawler_config": {
|
||||
"type": "CrawlerRunConfig",
|
||||
"params": {
|
||||
"stream": False,
|
||||
"cache_mode": "BYPASS", # Use string value for CacheMode
|
||||
"deep_crawl_strategy": {
|
||||
"type": "BFSDeepCrawlStrategy",
|
||||
"params": {
|
||||
"max_depth": max_depth,
|
||||
"max_pages": max_pages,
|
||||
# Minimal filters for basic test
|
||||
"filter_chain": {
|
||||
"type": "FilterChain",
|
||||
"params": {
|
||||
"filters": [
|
||||
{
|
||||
"type": "DomainFilter",
|
||||
"params": {"allowed_domains": [DEEP_CRAWL_DOMAIN]}
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
response = await async_client.post("/crawl", json=payload)
|
||||
response.raise_for_status()
|
||||
data = response.json()
|
||||
|
||||
assert data["success"] is True
|
||||
assert isinstance(data["results"], list)
|
||||
assert len(data["results"]) > 1 # Should be more than just the start URL
|
||||
assert len(data["results"]) <= max_pages # Respect max_pages
|
||||
|
||||
found_depth_0 = False
|
||||
found_depth_1 = False
|
||||
for result in data["results"]:
|
||||
await assert_crawl_result_structure(result)
|
||||
assert result["success"] is True
|
||||
assert DEEP_CRAWL_DOMAIN in result["url"]
|
||||
depth = result["metadata"]["depth"]
|
||||
assert depth <= max_depth
|
||||
if depth == 0: found_depth_0 = True
|
||||
if depth == 1: found_depth_1 = True
|
||||
|
||||
assert found_depth_0
|
||||
assert found_depth_1
|
||||
|
||||
# 2. Deep Crawl with Filtering
|
||||
async def test_deep_crawl_with_filters(self, async_client: httpx.AsyncClient):
|
||||
"""Test BFS deep crawl with content type and domain filters."""
|
||||
max_depth = 1
|
||||
max_pages = 5
|
||||
payload = {
|
||||
"urls": [DEEP_CRAWL_BASE_URL],
|
||||
"browser_config": {"type": "BrowserConfig", "params": {"headless": True}},
|
||||
"crawler_config": {
|
||||
"type": "CrawlerRunConfig",
|
||||
"params": {
|
||||
"stream": False,
|
||||
"cache_mode": "BYPASS",
|
||||
"deep_crawl_strategy": {
|
||||
"type": "BFSDeepCrawlStrategy",
|
||||
"params": {
|
||||
"max_depth": max_depth,
|
||||
"max_pages": max_pages,
|
||||
"filter_chain": {
|
||||
"type": "FilterChain",
|
||||
"params": {
|
||||
"filters": [
|
||||
{
|
||||
"type": "DomainFilter",
|
||||
"params": {"allowed_domains": [DEEP_CRAWL_DOMAIN]}
|
||||
},
|
||||
{
|
||||
"type": "ContentTypeFilter",
|
||||
"params": {"allowed_types": ["text/html"]}
|
||||
},
|
||||
# Example: Exclude specific paths using regex
|
||||
{
|
||||
"type": "URLPatternFilter",
|
||||
"params": {
|
||||
"patterns": ["*/category-3/*"], # Block category 3
|
||||
"reverse": True # Block if match
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
response = await async_client.post("/crawl", json=payload)
|
||||
response.raise_for_status()
|
||||
data = response.json()
|
||||
|
||||
assert data["success"] is True
|
||||
assert len(data["results"]) > 0
|
||||
assert len(data["results"]) <= max_pages
|
||||
|
||||
for result in data["results"]:
|
||||
await assert_crawl_result_structure(result)
|
||||
assert result["success"] is True
|
||||
assert DEEP_CRAWL_DOMAIN in result["url"]
|
||||
assert "category-3" not in result["url"] # Check if filter worked
|
||||
assert result["metadata"]["depth"] <= max_depth
|
||||
|
||||
# 3. Deep Crawl with Scoring
|
||||
async def test_deep_crawl_with_scoring(self, async_client: httpx.AsyncClient):
|
||||
"""Test BFS deep crawl with URL scoring."""
|
||||
max_depth = 1
|
||||
max_pages = 4
|
||||
payload = {
|
||||
"urls": [DEEP_CRAWL_BASE_URL],
|
||||
"browser_config": {"type": "BrowserConfig", "params": {"headless": True}},
|
||||
"crawler_config": {
|
||||
"type": "CrawlerRunConfig",
|
||||
"params": {
|
||||
"stream": False,
|
||||
"cache_mode": "BYPASS",
|
||||
"deep_crawl_strategy": {
|
||||
"type": "BFSDeepCrawlStrategy",
|
||||
"params": {
|
||||
"max_depth": max_depth,
|
||||
"max_pages": max_pages,
|
||||
"filter_chain": { # Keep basic domain filter
|
||||
"type": "FilterChain",
|
||||
"params": { "filters": [{"type": "DomainFilter", "params": {"allowed_domains": [DEEP_CRAWL_DOMAIN]}}]}
|
||||
},
|
||||
"url_scorer": { # Add scorer
|
||||
"type": "CompositeScorer",
|
||||
"params": {
|
||||
"scorers": [
|
||||
{ # Favor pages with 'product' in the URL
|
||||
"type": "KeywordRelevanceScorer",
|
||||
"params": {"keywords": ["product"], "weight": 1.0}
|
||||
},
|
||||
{ # Penalize deep paths slightly
|
||||
"type": "PathDepthScorer",
|
||||
"params": {"optimal_depth": 2, "weight": -0.2}
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
# Set a threshold if needed: "score_threshold": 0.1
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
response = await async_client.post("/crawl", json=payload)
|
||||
response.raise_for_status()
|
||||
data = response.json()
|
||||
|
||||
assert data["success"] is True
|
||||
assert len(data["results"]) > 0
|
||||
assert len(data["results"]) <= max_pages
|
||||
|
||||
# Check if results seem biased towards products (harder to assert strictly without knowing exact scores)
|
||||
product_urls_found = any("product_" in result["url"] for result in data["results"] if result["metadata"]["depth"] > 0)
|
||||
print(f"Product URLs found among depth > 0 results: {product_urls_found}")
|
||||
# We expect scoring to prioritize product pages if available within limits
|
||||
# assert product_urls_found # This might be too strict depending on site structure and limits
|
||||
|
||||
for result in data["results"]:
|
||||
await assert_crawl_result_structure(result)
|
||||
assert result["success"] is True
|
||||
assert result["metadata"]["depth"] <= max_depth
|
||||
|
||||
# 4. Deep Crawl with CSS Extraction
|
||||
async def test_deep_crawl_with_css_extraction(self, async_client: httpx.AsyncClient):
|
||||
"""Test BFS deep crawl combined with JsonCssExtractionStrategy."""
|
||||
max_depth = 6 # Go deep enough to reach product pages
|
||||
max_pages = 20
|
||||
# Schema to extract product details
|
||||
product_schema = {
|
||||
"name": "ProductDetails",
|
||||
"baseSelector": "div.container", # Base for product page
|
||||
"fields": [
|
||||
{"name": "product_title", "selector": "h1", "type": "text"},
|
||||
{"name": "price", "selector": ".product-price", "type": "text"},
|
||||
{"name": "description", "selector": ".product-description p", "type": "text"},
|
||||
{"name": "specs", "selector": ".product-specs li", "type": "list", "fields":[
|
||||
{"name": "spec_name", "selector": ".spec-name", "type": "text"},
|
||||
{"name": "spec_value", "selector": ".spec-value", "type": "text"}
|
||||
]}
|
||||
]
|
||||
}
|
||||
payload = {
|
||||
"urls": [DEEP_CRAWL_BASE_URL],
|
||||
"browser_config": {"type": "BrowserConfig", "params": {"headless": True}},
|
||||
"crawler_config": {
|
||||
"type": "CrawlerRunConfig",
|
||||
"params": {
|
||||
"stream": False,
|
||||
"cache_mode": "BYPASS",
|
||||
"extraction_strategy": { # Apply extraction to ALL crawled pages
|
||||
"type": "JsonCssExtractionStrategy",
|
||||
"params": {"schema": {"type": "dict", "value": product_schema}}
|
||||
},
|
||||
"deep_crawl_strategy": {
|
||||
"type": "BFSDeepCrawlStrategy",
|
||||
"params": {
|
||||
"max_depth": max_depth,
|
||||
"max_pages": max_pages,
|
||||
"filter_chain": { # Only crawl HTML on our domain
|
||||
"type": "FilterChain",
|
||||
"params": {
|
||||
"filters": [
|
||||
{"type": "DomainFilter", "params": {"allowed_domains": [DEEP_CRAWL_DOMAIN]}},
|
||||
{"type": "ContentTypeFilter", "params": {"allowed_types": ["text/html"]}}
|
||||
]
|
||||
}
|
||||
}
|
||||
# Optional: Add scoring to prioritize product pages for extraction
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
response = await async_client.post("/crawl", json=payload)
|
||||
response.raise_for_status()
|
||||
data = response.json()
|
||||
|
||||
assert data["success"] is True
|
||||
assert len(data["results"]) > 0
|
||||
# assert len(data["results"]) <= max_pages
|
||||
|
||||
found_extracted_product = False
|
||||
for result in data["results"]:
|
||||
await assert_crawl_result_structure(result)
|
||||
assert result["success"] is True
|
||||
assert "extracted_content" in result
|
||||
if "product_" in result["url"]: # Check product pages specifically
|
||||
assert result["extracted_content"] is not None
|
||||
try:
|
||||
extracted = json.loads(result["extracted_content"])
|
||||
# Schema returns list even if one base match
|
||||
assert isinstance(extracted, list)
|
||||
if extracted:
|
||||
item = extracted[0]
|
||||
assert "product_title" in item and item["product_title"]
|
||||
assert "price" in item and item["price"]
|
||||
# Specs might be empty list if not found
|
||||
assert "specs" in item and isinstance(item["specs"], list)
|
||||
found_extracted_product = True
|
||||
print(f"Extracted product: {item.get('product_title')}")
|
||||
except (json.JSONDecodeError, AssertionError, IndexError) as e:
|
||||
pytest.fail(f"Extraction validation failed for {result['url']}: {e}\nContent: {result['extracted_content']}")
|
||||
# else:
|
||||
# # Non-product pages might have None or empty list depending on schema match
|
||||
# assert result["extracted_content"] is None or json.loads(result["extracted_content"]) == []
|
||||
|
||||
assert found_extracted_product, "Did not find any pages where product data was successfully extracted."
|
||||
|
||||
# 5. Deep Crawl with LLM Extraction (Requires Server LLM Setup)
|
||||
async def test_deep_crawl_with_llm_extraction(self, async_client: httpx.AsyncClient):
|
||||
"""Test BFS deep crawl combined with LLMExtractionStrategy."""
|
||||
max_depth = 1 # Limit depth to keep LLM calls manageable
|
||||
max_pages = 3
|
||||
payload = {
|
||||
"urls": [DEEP_CRAWL_BASE_URL],
|
||||
"browser_config": {"type": "BrowserConfig", "params": {"headless": True}},
|
||||
"crawler_config": {
|
||||
"type": "CrawlerRunConfig",
|
||||
"params": {
|
||||
"stream": False,
|
||||
"cache_mode": "BYPASS",
|
||||
"extraction_strategy": { # Apply LLM extraction to crawled pages
|
||||
"type": "LLMExtractionStrategy",
|
||||
"params": {
|
||||
"instruction": "Extract the main H1 title and the text content of the first paragraph.",
|
||||
"llm_config": { # Example override, rely on server default if possible
|
||||
"type": "LLMConfig",
|
||||
"params": {"provider": "openai/gpt-4.1-mini"} # Use a cheaper model for testing
|
||||
},
|
||||
"schema": { # Expected JSON output
|
||||
"type": "dict",
|
||||
"value": {
|
||||
"title": "PageContent", "type": "object",
|
||||
"properties": {
|
||||
"h1_title": {"type": "string"},
|
||||
"first_paragraph": {"type": "string"}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"deep_crawl_strategy": {
|
||||
"type": "BFSDeepCrawlStrategy",
|
||||
"params": {
|
||||
"max_depth": max_depth,
|
||||
"max_pages": max_pages,
|
||||
"filter_chain": {
|
||||
"type": "FilterChain",
|
||||
"params": {
|
||||
"filters": [
|
||||
{"type": "DomainFilter", "params": {"allowed_domains": [DEEP_CRAWL_DOMAIN]}},
|
||||
{"type": "ContentTypeFilter", "params": {"allowed_types": ["text/html"]}}
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
try:
|
||||
response = await async_client.post("/crawl", json=payload)
|
||||
response.raise_for_status()
|
||||
data = response.json()
|
||||
except httpx.HTTPStatusError as e:
|
||||
pytest.fail(f"Deep Crawl + LLM extraction request failed: {e}. Response: {e.response.text}. Check server logs and LLM API key setup.")
|
||||
except httpx.RequestError as e:
|
||||
pytest.fail(f"Deep Crawl + LLM extraction request failed: {e}.")
|
||||
|
||||
|
||||
assert data["success"] is True
|
||||
assert len(data["results"]) > 0
|
||||
assert len(data["results"]) <= max_pages
|
||||
|
||||
found_llm_extraction = False
|
||||
for result in data["results"]:
|
||||
await assert_crawl_result_structure(result)
|
||||
assert result["success"] is True
|
||||
assert "extracted_content" in result
|
||||
assert result["extracted_content"] is not None
|
||||
try:
|
||||
extracted = json.loads(result["extracted_content"])
|
||||
if isinstance(extracted, list): extracted = extracted[0] # Handle list output
|
||||
assert isinstance(extracted, dict)
|
||||
assert "h1_title" in extracted # Check keys based on schema
|
||||
assert "first_paragraph" in extracted
|
||||
found_llm_extraction = True
|
||||
print(f"LLM extracted from {result['url']}: Title='{extracted.get('h1_title')}'")
|
||||
except (json.JSONDecodeError, AssertionError, IndexError, TypeError) as e:
|
||||
pytest.fail(f"LLM extraction validation failed for {result['url']}: {e}\nContent: {result['extracted_content']}")
|
||||
|
||||
assert found_llm_extraction, "LLM extraction did not yield expected data on any crawled page."
|
||||
|
||||
|
||||
# 6. Deep Crawl with SSL Certificate Fetching
|
||||
async def test_deep_crawl_with_ssl(self, async_client: httpx.AsyncClient):
|
||||
"""Test BFS deep crawl with fetch_ssl_certificate enabled."""
|
||||
max_depth = 0 # Only fetch for start URL to keep test fast
|
||||
max_pages = 1
|
||||
payload = {
|
||||
"urls": [DEEP_CRAWL_BASE_URL],
|
||||
"browser_config": {"type": "BrowserConfig", "params": {"headless": True}},
|
||||
"crawler_config": {
|
||||
"type": "CrawlerRunConfig",
|
||||
"params": {
|
||||
"stream": False,
|
||||
"cache_mode": "BYPASS",
|
||||
"fetch_ssl_certificate": True, # <-- Enable SSL fetching
|
||||
"deep_crawl_strategy": {
|
||||
"type": "BFSDeepCrawlStrategy",
|
||||
"params": {
|
||||
"max_depth": max_depth,
|
||||
"max_pages": max_pages,
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
response = await async_client.post("/crawl", json=payload)
|
||||
response.raise_for_status()
|
||||
data = response.json()
|
||||
|
||||
assert data["success"] is True
|
||||
assert len(data["results"]) == 1
|
||||
result = data["results"][0]
|
||||
|
||||
await assert_crawl_result_structure(result, check_ssl=True) # <-- Tell helper to check SSL field
|
||||
assert result["success"] is True
|
||||
# Check if SSL info was actually retrieved
|
||||
if result["ssl_certificate"]:
|
||||
# Assert directly using dictionary keys
|
||||
assert isinstance(result["ssl_certificate"], dict) # Verify it's a dict
|
||||
assert "issuer" in result["ssl_certificate"]
|
||||
assert "subject" in result["ssl_certificate"]
|
||||
# --- MODIFIED ASSERTIONS ---
|
||||
assert "not_before" in result["ssl_certificate"] # Check for the actual key
|
||||
assert "not_after" in result["ssl_certificate"] # Check for the actual key
|
||||
# --- END MODIFICATIONS ---
|
||||
assert "fingerprint" in result["ssl_certificate"] # Check another key
|
||||
|
||||
# This print statement using .get() already works correctly with dictionaries
|
||||
print(f"SSL Issuer Org: {result['ssl_certificate'].get('issuer', {}).get('O', 'N/A')}")
|
||||
print(f"SSL Valid From: {result['ssl_certificate'].get('not_before', 'N/A')}")
|
||||
else:
|
||||
# This part remains the same
|
||||
print("SSL Certificate was null in the result.")
|
||||
|
||||
|
||||
# 7. Deep Crawl with Proxy Rotation (Requires PROXIES env var)
|
||||
async def test_deep_crawl_with_proxies(self, async_client: httpx.AsyncClient):
|
||||
"""Test BFS deep crawl using proxy rotation."""
|
||||
proxies = load_proxies_from_env()
|
||||
if not proxies:
|
||||
pytest.skip("Skipping proxy test: PROXIES environment variable not set or empty.")
|
||||
|
||||
print(f"\nTesting with {len(proxies)} proxies loaded from environment.")
|
||||
|
||||
max_depth = 1
|
||||
max_pages = 3
|
||||
payload = {
|
||||
"urls": [DEEP_CRAWL_BASE_URL], # Use the dummy site
|
||||
# Use a BrowserConfig that *might* pick up proxy if set, but rely on CrawlerRunConfig
|
||||
"browser_config": {"type": "BrowserConfig", "params": {"headless": True}},
|
||||
"crawler_config": {
|
||||
"type": "CrawlerRunConfig",
|
||||
"params": {
|
||||
"stream": False,
|
||||
"cache_mode": "BYPASS",
|
||||
"proxy_rotation_strategy": { # <-- Define the strategy
|
||||
"type": "RoundRobinProxyStrategy",
|
||||
"params": {
|
||||
# Convert ProxyConfig dicts back to the serialized format expected by server
|
||||
"proxies": [{"type": "ProxyConfig", "params": p} for p in proxies]
|
||||
}
|
||||
},
|
||||
"deep_crawl_strategy": {
|
||||
"type": "BFSDeepCrawlStrategy",
|
||||
"params": {
|
||||
"max_depth": max_depth,
|
||||
"max_pages": max_pages,
|
||||
"filter_chain": {
|
||||
"type": "FilterChain",
|
||||
"params": { "filters": [{"type": "DomainFilter", "params": {"allowed_domains": [DEEP_CRAWL_DOMAIN]}}]}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
try:
|
||||
response = await async_client.post("/crawl", json=payload)
|
||||
response.raise_for_status()
|
||||
data = response.json()
|
||||
except httpx.HTTPStatusError as e:
|
||||
# Proxies often cause connection errors, catch them
|
||||
pytest.fail(f"Proxy deep crawl failed: {e}. Response: {e.response.text}. Are proxies valid and accessible by the server?")
|
||||
except httpx.RequestError as e:
|
||||
pytest.fail(f"Proxy deep crawl request failed: {e}. Are proxies valid and accessible?")
|
||||
|
||||
assert data["success"] is True
|
||||
assert len(data["results"]) > 0
|
||||
assert len(data["results"]) <= max_pages
|
||||
# Primary assertion is that the crawl succeeded *with* proxy config
|
||||
print(f"Proxy deep crawl completed successfully for {len(data['results'])} pages.")
|
||||
|
||||
# Verifying specific proxy usage requires server logs or custom headers/responses
|
||||
|
||||
|
||||
# --- Main Execution Block (for running script directly) ---
|
||||
if __name__ == "__main__":
|
||||
pytest_args = ["-v", "-s", __file__]
|
||||
# Example: Run only proxy test
|
||||
# pytest_args.append("-k test_deep_crawl_with_proxies")
|
||||
print(f"Running pytest with args: {pytest_args}")
|
||||
exit_code = pytest.main(pytest_args)
|
||||
print(f"Pytest finished with exit code: {exit_code}")
|
||||
335
tests/general/generate_dummy_site.py
Normal file
335
tests/general/generate_dummy_site.py
Normal file
@@ -0,0 +1,335 @@
|
||||
# ==== File: build_dummy_site.py ====
|
||||
|
||||
import os
|
||||
import random
|
||||
import argparse
|
||||
from pathlib import Path
|
||||
from urllib.parse import quote
|
||||
|
||||
# --- Configuration ---
|
||||
NUM_CATEGORIES = 3
|
||||
NUM_SUBCATEGORIES_PER_CAT = 2 # Results in NUM_CATEGORIES * NUM_SUBCATEGORIES_PER_CAT total L2 categories
|
||||
NUM_PRODUCTS_PER_SUBCAT = 5 # Products listed on L3 pages
|
||||
MAX_DEPTH_TARGET = 5 # Explicitly set target depth
|
||||
|
||||
# --- Helper Functions ---
|
||||
|
||||
def generate_lorem(words=20):
|
||||
"""Generates simple placeholder text."""
|
||||
lorem_words = ["lorem", "ipsum", "dolor", "sit", "amet", "consectetur",
|
||||
"adipiscing", "elit", "sed", "do", "eiusmod", "tempor",
|
||||
"incididunt", "ut", "labore", "et", "dolore", "magna", "aliqua"]
|
||||
return " ".join(random.choice(lorem_words) for _ in range(words)).capitalize() + "."
|
||||
|
||||
def create_html_page(filepath: Path, title: str, body_content: str, breadcrumbs: list = [], head_extras: str = ""):
|
||||
"""Creates an HTML file with basic structure and inline CSS."""
|
||||
os.makedirs(filepath.parent, exist_ok=True)
|
||||
|
||||
# Generate breadcrumb HTML using the 'link' provided in the breadcrumbs list
|
||||
breadcrumb_html = ""
|
||||
if breadcrumbs:
|
||||
links_html = " » ".join(f'<a href="{bc["link"]}">{bc["name"]}</a>' for bc in breadcrumbs)
|
||||
breadcrumb_html = f"<nav class='breadcrumbs'>{links_html} » {title}</nav>"
|
||||
|
||||
# Basic CSS for structure identification (kept the same)
|
||||
css = """
|
||||
<style>
|
||||
body {
|
||||
font-family: sans-serif;
|
||||
padding: 20px;
|
||||
background-color: #1e1e1e;
|
||||
color: #d1d1d1;
|
||||
}
|
||||
|
||||
.container {
|
||||
max-width: 960px;
|
||||
margin: auto;
|
||||
background: #2c2c2c;
|
||||
padding: 20px;
|
||||
border-radius: 5px;
|
||||
box-shadow: 0 2px 5px rgba(0, 0, 0, 0.5);
|
||||
}
|
||||
|
||||
h1, h2 {
|
||||
color: #ccc;
|
||||
}
|
||||
|
||||
a {
|
||||
color: #9bcdff;
|
||||
text-decoration: none;
|
||||
}
|
||||
|
||||
a:hover {
|
||||
text-decoration: underline;
|
||||
}
|
||||
|
||||
ul {
|
||||
list-style: none;
|
||||
padding-left: 0;
|
||||
}
|
||||
|
||||
li {
|
||||
margin-bottom: 10px;
|
||||
}
|
||||
|
||||
.category-link,
|
||||
.subcategory-link,
|
||||
.product-link,
|
||||
.details-link,
|
||||
.reviews-link {
|
||||
display: block;
|
||||
padding: 8px;
|
||||
background-color: #3a3a3a;
|
||||
border-radius: 3px;
|
||||
}
|
||||
|
||||
.product-preview {
|
||||
border: 1px solid #444;
|
||||
padding: 10px;
|
||||
margin-bottom: 10px;
|
||||
border-radius: 4px;
|
||||
background-color: #2a2a2a;
|
||||
}
|
||||
|
||||
.product-title {
|
||||
color: #d1d1d1;
|
||||
}
|
||||
|
||||
.product-price {
|
||||
font-weight: bold;
|
||||
color: #85e085;
|
||||
}
|
||||
|
||||
.product-description,
|
||||
.product-specs,
|
||||
.product-reviews {
|
||||
margin-top: 15px;
|
||||
line-height: 1.6;
|
||||
}
|
||||
|
||||
.product-specs li {
|
||||
margin-bottom: 5px;
|
||||
font-size: 0.9em;
|
||||
}
|
||||
|
||||
.spec-name {
|
||||
font-weight: bold;
|
||||
}
|
||||
|
||||
.breadcrumbs {
|
||||
margin-bottom: 20px;
|
||||
font-size: 0.9em;
|
||||
color: #888;
|
||||
}
|
||||
|
||||
.breadcrumbs a {
|
||||
color: #9bcdff;
|
||||
}
|
||||
</style>
|
||||
"""
|
||||
html_content = f"""<!DOCTYPE html>
|
||||
<html lang="en">
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
||||
<title>{title} - FakeShop</title>
|
||||
{head_extras}
|
||||
{css}
|
||||
</head>
|
||||
<body>
|
||||
<div class="container">
|
||||
{breadcrumb_html}
|
||||
<h1>{title}</h1>
|
||||
{body_content}
|
||||
</div>
|
||||
</body>
|
||||
</html>"""
|
||||
with open(filepath, "w", encoding="utf-8") as f:
|
||||
f.write(html_content)
|
||||
# Keep print statement concise for clarity
|
||||
# print(f"Created: {filepath}")
|
||||
|
||||
def generate_site(base_dir: Path, site_name: str = "FakeShop", base_path: str = ""):
|
||||
"""Generates the dummy website structure."""
|
||||
base_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# --- Clean and prepare the base path for URL construction ---
|
||||
# Ensure it starts with '/' if not empty, and remove any trailing '/'
|
||||
if base_path:
|
||||
full_base_path = "/" + base_path.strip('/')
|
||||
else:
|
||||
full_base_path = "" # Represents the root
|
||||
|
||||
print(f"Using base path for links: '{full_base_path}'")
|
||||
|
||||
# --- Level 0: Homepage ---
|
||||
home_body = "<h2>Welcome to FakeShop!</h2><p>Your one-stop shop for imaginary items.</p><h3>Categories:</h3>\n<ul>"
|
||||
# Define the *actual* link path for the homepage breadcrumb
|
||||
home_link_path = f"{full_base_path}/index.html"
|
||||
breadcrumbs_home = [{"name": "Home", "link": home_link_path}] # Base breadcrumb
|
||||
|
||||
# Links *within* the page content should remain relative
|
||||
for i in range(NUM_CATEGORIES):
|
||||
cat_name = f"Category-{i+1}"
|
||||
cat_folder_name = quote(cat_name.lower().replace(" ", "-"))
|
||||
# This path is relative to the current directory (index.html)
|
||||
cat_relative_page_path = f"{cat_folder_name}/index.html"
|
||||
home_body += f'<li><a class="category-link" href="{cat_relative_page_path}">{cat_name}</a> - {generate_lorem(10)}</li>'
|
||||
home_body += "</ul>"
|
||||
create_html_page(base_dir / "index.html", "Homepage", home_body, []) # No breadcrumbs *on* the homepage itself
|
||||
|
||||
# --- Levels 1-5 ---
|
||||
for i in range(NUM_CATEGORIES):
|
||||
cat_name = f"Category-{i+1}"
|
||||
cat_folder_name = quote(cat_name.lower().replace(" ", "-"))
|
||||
cat_dir = base_dir / cat_folder_name
|
||||
# This is the *absolute* path for the breadcrumb link
|
||||
cat_link_path = f"{full_base_path}/{cat_folder_name}/index.html"
|
||||
# Update breadcrumbs list for this level
|
||||
breadcrumbs_cat = breadcrumbs_home + [{"name": cat_name, "link": cat_link_path}]
|
||||
|
||||
# --- Level 1: Category Page ---
|
||||
cat_body = f"<p>{generate_lorem(15)} for {cat_name}.</p><h3>Sub-Categories:</h3>\n<ul>"
|
||||
for j in range(NUM_SUBCATEGORIES_PER_CAT):
|
||||
subcat_name = f"{cat_name}-Sub-{j+1}"
|
||||
subcat_folder_name = quote(subcat_name.lower().replace(" ", "-"))
|
||||
# Path relative to the category page
|
||||
subcat_relative_page_path = f"{subcat_folder_name}/index.html"
|
||||
cat_body += f'<li><a class="subcategory-link" href="{subcat_relative_page_path}">{subcat_name}</a> - {generate_lorem(8)}</li>'
|
||||
cat_body += "</ul>"
|
||||
# Pass the updated breadcrumbs list
|
||||
create_html_page(cat_dir / "index.html", cat_name, cat_body, breadcrumbs_home) # Parent breadcrumb needed here
|
||||
|
||||
for j in range(NUM_SUBCATEGORIES_PER_CAT):
|
||||
subcat_name = f"{cat_name}-Sub-{j+1}"
|
||||
subcat_folder_name = quote(subcat_name.lower().replace(" ", "-"))
|
||||
subcat_dir = cat_dir / subcat_folder_name
|
||||
# Absolute path for the breadcrumb link
|
||||
subcat_link_path = f"{full_base_path}/{cat_folder_name}/{subcat_folder_name}/index.html"
|
||||
# Update breadcrumbs list for this level
|
||||
breadcrumbs_subcat = breadcrumbs_cat + [{"name": subcat_name, "link": subcat_link_path}]
|
||||
|
||||
# --- Level 2: Sub-Category Page (Product List) ---
|
||||
subcat_body = f"<p>Explore products in {subcat_name}. {generate_lorem(12)}</p><h3>Products:</h3>\n<ul class='product-list'>"
|
||||
for k in range(NUM_PRODUCTS_PER_SUBCAT):
|
||||
prod_id = f"P{i+1}{j+1}{k+1:03d}" # e.g., P11001
|
||||
prod_name = f"{subcat_name} Product {k+1} ({prod_id})"
|
||||
# Filename relative to the subcategory page
|
||||
prod_filename = f"product_{prod_id}.html"
|
||||
# Absolute path for the breadcrumb link
|
||||
prod_link_path = f"{full_base_path}/{cat_folder_name}/{subcat_folder_name}/{prod_filename}"
|
||||
|
||||
# Preview on list page (link remains relative)
|
||||
subcat_body += f"""
|
||||
<li>
|
||||
<div class="product-preview">
|
||||
<a class="product-link" href="{prod_filename}"><strong>{prod_name}</strong></a>
|
||||
<p>{generate_lorem(10)}</p>
|
||||
<span class="product-price">£{random.uniform(10, 500):.2f}</span>
|
||||
</div>
|
||||
</li>"""
|
||||
|
||||
# --- Level 3: Product Page ---
|
||||
prod_price = random.uniform(10, 500)
|
||||
prod_desc = generate_lorem(40)
|
||||
prod_specs = {f"Spec {s+1}": generate_lorem(3) for s in range(random.randint(3,6))}
|
||||
prod_reviews_count = random.randint(0, 150)
|
||||
# Relative filenames for links on this page
|
||||
details_filename_relative = f"product_{prod_id}_details.html"
|
||||
reviews_filename_relative = f"product_{prod_id}_reviews.html"
|
||||
|
||||
prod_body = f"""
|
||||
<p class="product-price">Price: £{prod_price:.2f}</p>
|
||||
<div class="product-description">
|
||||
<h2>Description</h2>
|
||||
<p>{prod_desc}</p>
|
||||
</div>
|
||||
<div class="product-specs">
|
||||
<h2>Specifications</h2>
|
||||
<ul>
|
||||
{''.join(f'<li><span class="spec-name">{name}</span>: <span class="spec-value">{value}</span></li>' for name, value in prod_specs.items())}
|
||||
</ul>
|
||||
</div>
|
||||
<div class="product-reviews">
|
||||
<h2>Reviews</h2>
|
||||
<p>Total Reviews: <span class="review-count">{prod_reviews_count}</span></p>
|
||||
</div>
|
||||
<hr>
|
||||
<p>
|
||||
<a class="details-link" href="{details_filename_relative}">View More Details</a> |
|
||||
<a class="reviews-link" href="{reviews_filename_relative}">See All Reviews</a>
|
||||
</p>
|
||||
"""
|
||||
# Update breadcrumbs list for this level
|
||||
breadcrumbs_prod = breadcrumbs_subcat + [{"name": prod_name, "link": prod_link_path}]
|
||||
# Pass the updated breadcrumbs list
|
||||
create_html_page(subcat_dir / prod_filename, prod_name, prod_body, breadcrumbs_subcat) # Parent breadcrumb needed here
|
||||
|
||||
# --- Level 4: Product Details Page ---
|
||||
details_filename = f"product_{prod_id}_details.html" # Actual filename
|
||||
# Absolute path for the breadcrumb link
|
||||
details_link_path = f"{full_base_path}/{cat_folder_name}/{subcat_folder_name}/{details_filename}"
|
||||
details_body = f"<p>This page contains extremely detailed information about {prod_name}.</p>{generate_lorem(100)}"
|
||||
# Update breadcrumbs list for this level
|
||||
breadcrumbs_details = breadcrumbs_prod + [{"name": "Details", "link": details_link_path}]
|
||||
# Pass the updated breadcrumbs list
|
||||
create_html_page(subcat_dir / details_filename, f"{prod_name} - Details", details_body, breadcrumbs_prod) # Parent breadcrumb needed here
|
||||
|
||||
# --- Level 5: Product Reviews Page ---
|
||||
reviews_filename = f"product_{prod_id}_reviews.html" # Actual filename
|
||||
# Absolute path for the breadcrumb link
|
||||
reviews_link_path = f"{full_base_path}/{cat_folder_name}/{subcat_folder_name}/{reviews_filename}"
|
||||
reviews_body = f"<p>All {prod_reviews_count} reviews for {prod_name} are listed here.</p><ul>"
|
||||
for r in range(prod_reviews_count):
|
||||
reviews_body += f"<li>Review {r+1}: {generate_lorem(random.randint(15, 50))}</li>"
|
||||
reviews_body += "</ul>"
|
||||
# Update breadcrumbs list for this level
|
||||
breadcrumbs_reviews = breadcrumbs_prod + [{"name": "Reviews", "link": reviews_link_path}]
|
||||
# Pass the updated breadcrumbs list
|
||||
create_html_page(subcat_dir / reviews_filename, f"{prod_name} - Reviews", reviews_body, breadcrumbs_prod) # Parent breadcrumb needed here
|
||||
|
||||
|
||||
subcat_body += "</ul>" # Close product-list ul
|
||||
# Pass the correct breadcrumbs list for the subcategory index page
|
||||
create_html_page(subcat_dir / "index.html", subcat_name, subcat_body, breadcrumbs_cat) # Parent breadcrumb needed here
|
||||
|
||||
|
||||
# --- Main Execution ---
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser(description="Generate a dummy multi-level retail website.")
|
||||
parser.add_argument(
|
||||
"-o", "--output-dir",
|
||||
type=str,
|
||||
default="dummy_retail_site",
|
||||
help="Directory to generate the website in."
|
||||
)
|
||||
parser.add_argument(
|
||||
"-n", "--site-name",
|
||||
type=str,
|
||||
default="FakeShop",
|
||||
help="Name of the fake shop."
|
||||
)
|
||||
parser.add_argument(
|
||||
"-b", "--base-path",
|
||||
type=str,
|
||||
default="",
|
||||
help="Base path for hosting the site (e.g., 'samples/deepcrawl'). Leave empty if hosted at the root."
|
||||
)
|
||||
# Optional: Add more args to configure counts if needed
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
output_directory = Path(args.output_dir)
|
||||
site_name = args.site_name
|
||||
base_path = args.base_path
|
||||
|
||||
print(f"Generating dummy site '{site_name}' in '{output_directory}'...")
|
||||
# Pass the base_path to the generation function
|
||||
generate_site(output_directory, site_name, base_path)
|
||||
print(f"\nCreated {sum(1 for _ in output_directory.rglob('*.html'))} HTML pages.")
|
||||
print("Dummy site generation complete.")
|
||||
print(f"To serve locally (example): python -m http.server --directory {output_directory} 8000")
|
||||
if base_path:
|
||||
print(f"Access the site at: http://localhost:8000/{base_path.strip('/')}/index.html")
|
||||
else:
|
||||
print(f"Access the site at: http://localhost:8000/index.html")
|
||||
106
tests/general/test_content_source_parameter.py
Normal file
106
tests/general/test_content_source_parameter.py
Normal file
@@ -0,0 +1,106 @@
|
||||
"""
|
||||
Tests for the content_source parameter in markdown generation.
|
||||
"""
|
||||
import unittest
|
||||
import asyncio
|
||||
from unittest.mock import patch, MagicMock
|
||||
|
||||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator, MarkdownGenerationStrategy
|
||||
from crawl4ai.async_webcrawler import AsyncWebCrawler
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
from crawl4ai.models import MarkdownGenerationResult
|
||||
|
||||
HTML_SAMPLE = """
|
||||
<html>
|
||||
<head><title>Test Page</title></head>
|
||||
<body>
|
||||
<h1>Test Content</h1>
|
||||
<p>This is a test paragraph.</p>
|
||||
<div class="container">
|
||||
<p>This is content within a container.</p>
|
||||
</div>
|
||||
</body>
|
||||
</html>
|
||||
"""
|
||||
|
||||
|
||||
class TestContentSourceParameter(unittest.TestCase):
|
||||
"""Test cases for the content_source parameter in markdown generation."""
|
||||
|
||||
def setUp(self):
|
||||
"""Set up test fixtures."""
|
||||
self.loop = asyncio.new_event_loop()
|
||||
asyncio.set_event_loop(self.loop)
|
||||
|
||||
def tearDown(self):
|
||||
"""Tear down test fixtures."""
|
||||
self.loop.close()
|
||||
|
||||
def test_default_content_source(self):
|
||||
"""Test that the default content_source is 'cleaned_html'."""
|
||||
# Can't directly instantiate abstract class, so just test DefaultMarkdownGenerator
|
||||
generator = DefaultMarkdownGenerator()
|
||||
self.assertEqual(generator.content_source, "cleaned_html")
|
||||
|
||||
def test_custom_content_source(self):
|
||||
"""Test that content_source can be customized."""
|
||||
generator = DefaultMarkdownGenerator(content_source="fit_html")
|
||||
self.assertEqual(generator.content_source, "fit_html")
|
||||
|
||||
@patch('crawl4ai.markdown_generation_strategy.CustomHTML2Text')
|
||||
def test_html_processing_using_input_html(self, mock_html2text):
|
||||
"""Test that generate_markdown uses input_html parameter."""
|
||||
# Setup mock
|
||||
mock_instance = MagicMock()
|
||||
mock_instance.handle.return_value = "# Test Content\n\nThis is a test paragraph."
|
||||
mock_html2text.return_value = mock_instance
|
||||
|
||||
# Create generator and call generate_markdown
|
||||
generator = DefaultMarkdownGenerator()
|
||||
result = generator.generate_markdown(input_html="<h1>Test Content</h1><p>This is a test paragraph.</p>")
|
||||
|
||||
# Verify input_html was passed to HTML2Text handler
|
||||
mock_instance.handle.assert_called_once()
|
||||
# Get the first positional argument
|
||||
args, _ = mock_instance.handle.call_args
|
||||
self.assertEqual(args[0], "<h1>Test Content</h1><p>This is a test paragraph.</p>")
|
||||
|
||||
# Check result
|
||||
self.assertIsInstance(result, MarkdownGenerationResult)
|
||||
self.assertEqual(result.raw_markdown, "# Test Content\n\nThis is a test paragraph.")
|
||||
|
||||
def test_html_source_selection_logic(self):
|
||||
"""Test that the HTML source selection logic works correctly."""
|
||||
# We'll test the dispatch pattern directly to avoid async complexities
|
||||
|
||||
# Create test data
|
||||
raw_html = "<html><body><h1>Raw HTML</h1></body></html>"
|
||||
cleaned_html = "<html><body><h1>Cleaned HTML</h1></body></html>"
|
||||
fit_html = "<html><body><h1>Preprocessed HTML</h1></body></html>"
|
||||
|
||||
# Test the dispatch pattern
|
||||
html_source_selector = {
|
||||
"raw_html": lambda: raw_html,
|
||||
"cleaned_html": lambda: cleaned_html,
|
||||
"fit_html": lambda: fit_html,
|
||||
}
|
||||
|
||||
# Test Case 1: content_source="cleaned_html"
|
||||
source_lambda = html_source_selector.get("cleaned_html")
|
||||
self.assertEqual(source_lambda(), cleaned_html)
|
||||
|
||||
# Test Case 2: content_source="raw_html"
|
||||
source_lambda = html_source_selector.get("raw_html")
|
||||
self.assertEqual(source_lambda(), raw_html)
|
||||
|
||||
# Test Case 3: content_source="fit_html"
|
||||
source_lambda = html_source_selector.get("fit_html")
|
||||
self.assertEqual(source_lambda(), fit_html)
|
||||
|
||||
# Test Case 4: Invalid content_source falls back to cleaned_html
|
||||
source_lambda = html_source_selector.get("invalid_source", lambda: cleaned_html)
|
||||
self.assertEqual(source_lambda(), cleaned_html)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
unittest.main()
|
||||
119
tests/mcp/test_mcp_socket.py
Normal file
119
tests/mcp/test_mcp_socket.py
Normal file
@@ -0,0 +1,119 @@
|
||||
# pip install "mcp-sdk[ws]" anyio
|
||||
import anyio, json
|
||||
from mcp.client.websocket import websocket_client
|
||||
from mcp.client.session import ClientSession
|
||||
|
||||
async def test_list():
|
||||
async with websocket_client("ws://localhost:8020/mcp/ws") as (r, w):
|
||||
async with ClientSession(r, w) as s:
|
||||
await s.initialize()
|
||||
|
||||
print("tools :", [t.name for t in (await s.list_tools()).tools])
|
||||
print("resources :", [r.name for r in (await s.list_resources()).resources])
|
||||
print("templates :", [t.name for t in (await s.list_resource_templates()).resource_templates])
|
||||
|
||||
|
||||
async def test_crawl(s: ClientSession) -> None:
|
||||
"""Hit the @mcp_tool('crawl') endpoint."""
|
||||
res = await s.call_tool(
|
||||
"crawl",
|
||||
{
|
||||
"urls": ["https://example.com"],
|
||||
"browser_config": {},
|
||||
"crawler_config": {},
|
||||
},
|
||||
)
|
||||
print("crawl →", json.loads(res.content[0].text))
|
||||
|
||||
|
||||
async def test_md(s: ClientSession) -> None:
|
||||
"""Hit the @mcp_tool('md') endpoint."""
|
||||
res = await s.call_tool(
|
||||
"md",
|
||||
{
|
||||
"url": "https://example.com",
|
||||
"f": "fit", # or RAW, BM25, LLM
|
||||
"q": None,
|
||||
"c": "0",
|
||||
},
|
||||
)
|
||||
result = json.loads(res.content[0].text)
|
||||
print("md →", result['markdown'][:100], "...")
|
||||
|
||||
async def test_screenshot(s: ClientSession):
|
||||
res = await s.call_tool(
|
||||
"screenshot",
|
||||
{
|
||||
"url": "https://example.com",
|
||||
"screenshot_wait_for": 1.0,
|
||||
},
|
||||
)
|
||||
png_b64 = json.loads(res.content[0].text)["screenshot"]
|
||||
print("screenshot →", png_b64[:60], "… (base64)")
|
||||
|
||||
|
||||
async def test_pdf(s: ClientSession):
|
||||
res = await s.call_tool(
|
||||
"pdf",
|
||||
{
|
||||
"url": "https://example.com",
|
||||
},
|
||||
)
|
||||
pdf_b64 = json.loads(res.content[0].text)["pdf"]
|
||||
print("pdf →", pdf_b64[:60], "… (base64)")
|
||||
|
||||
async def test_execute_js(s: ClientSession):
|
||||
# click the “More” link on Hacker News front page and wait 1 s
|
||||
res = await s.call_tool(
|
||||
"execute_js",
|
||||
{
|
||||
"url": "https://news.ycombinator.com/news",
|
||||
"js_code": [
|
||||
"await page.click('a.morelink')",
|
||||
"await page.waitForTimeout(1000)",
|
||||
],
|
||||
},
|
||||
)
|
||||
crawl_result = json.loads(res.content[0].text)
|
||||
print("execute_js → status", crawl_result["success"], "| html len:", len(crawl_result["html"]))
|
||||
|
||||
async def test_html(s: ClientSession):
|
||||
# click the “More” link on Hacker News front page and wait 1 s
|
||||
res = await s.call_tool(
|
||||
"html",
|
||||
{
|
||||
"url": "https://news.ycombinator.com/news",
|
||||
},
|
||||
)
|
||||
crawl_result = json.loads(res.content[0].text)
|
||||
print("execute_js → status", crawl_result["success"], "| html len:", len(crawl_result["html"]))
|
||||
|
||||
async def test_context(s: ClientSession):
|
||||
# click the “More” link on Hacker News front page and wait 1 s
|
||||
res = await s.call_tool(
|
||||
"ask",
|
||||
{
|
||||
"query": "I hv a question about Crawl4ai library, how to extract internal links when crawling a page?"
|
||||
},
|
||||
)
|
||||
crawl_result = json.loads(res.content[0].text)
|
||||
print("execute_js → status", crawl_result["success"], "| html len:", len(crawl_result["html"]))
|
||||
|
||||
|
||||
async def main() -> None:
|
||||
async with websocket_client("ws://localhost:11235/mcp/ws") as (r, w):
|
||||
async with ClientSession(r, w) as s:
|
||||
await s.initialize() # handshake
|
||||
tools = (await s.list_tools()).tools
|
||||
print("tools:", [t.name for t in tools])
|
||||
|
||||
# await test_list()
|
||||
await test_crawl(s)
|
||||
await test_md(s)
|
||||
await test_screenshot(s)
|
||||
await test_pdf(s)
|
||||
await test_execute_js(s)
|
||||
await test_html(s)
|
||||
await test_context(s)
|
||||
|
||||
anyio.run(main)
|
||||
11
tests/mcp/test_mcp_sse.py
Normal file
11
tests/mcp/test_mcp_sse.py
Normal file
@@ -0,0 +1,11 @@
|
||||
from mcp.client.sse import sse_client
|
||||
from mcp.client.session import ClientSession
|
||||
|
||||
async def main():
|
||||
async with sse_client("http://127.0.0.1:8020/mcp") as (r, w):
|
||||
async with ClientSession(r, w) as sess:
|
||||
print(await sess.list_tools()) # now works
|
||||
|
||||
if __name__ == "__main__":
|
||||
import asyncio
|
||||
asyncio.run(main())
|
||||
315
tests/memory/README.md
Normal file
315
tests/memory/README.md
Normal file
@@ -0,0 +1,315 @@
|
||||
# Crawl4AI Stress Testing and Benchmarking
|
||||
|
||||
This directory contains tools for stress testing Crawl4AI's `arun_many` method and dispatcher system with high volumes of URLs to evaluate performance, concurrency handling, and potentially detect memory issues. It also includes a benchmarking system to track performance over time.
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
# Run a default stress test (small config) and generate a report
|
||||
# (Assumes run_all.sh is updated to call run_benchmark.py)
|
||||
./run_all.sh
|
||||
```
|
||||
*Note: `run_all.sh` might need to be updated if it directly called the old script.*
|
||||
|
||||
## Overview
|
||||
|
||||
The stress testing system works by:
|
||||
|
||||
1. Generating a local test site with heavy HTML pages (regenerated by default for each test).
|
||||
2. Starting a local HTTP server to serve these pages.
|
||||
3. Running Crawl4AI's `arun_many` method against this local site using the `MemoryAdaptiveDispatcher` with configurable concurrency (`max_sessions`).
|
||||
4. Monitoring performance metrics via the `CrawlerMonitor` and optionally logging memory usage.
|
||||
5. Optionally generating detailed benchmark reports with visualizations using `benchmark_report.py`.
|
||||
|
||||
## Available Tools
|
||||
|
||||
- `test_stress_sdk.py` - Main stress testing script utilizing `arun_many` and dispatchers.
|
||||
- `benchmark_report.py` - Report generator for comparing test results (assumes compatibility with `test_stress_sdk.py` outputs).
|
||||
- `run_benchmark.py` - Python script with predefined test configurations that orchestrates tests using `test_stress_sdk.py`.
|
||||
- `run_all.sh` - Simple wrapper script (may need updating).
|
||||
|
||||
## Usage Guide
|
||||
|
||||
### Using Predefined Configurations (Recommended)
|
||||
|
||||
The `run_benchmark.py` script offers the easiest way to run standardized tests:
|
||||
|
||||
```bash
|
||||
# Quick test (50 URLs, 4 max sessions)
|
||||
python run_benchmark.py quick
|
||||
|
||||
# Medium test (500 URLs, 16 max sessions)
|
||||
python run_benchmark.py medium
|
||||
|
||||
# Large test (1000 URLs, 32 max sessions)
|
||||
python run_benchmark.py large
|
||||
|
||||
# Extreme test (2000 URLs, 64 max sessions)
|
||||
python run_benchmark.py extreme
|
||||
|
||||
# Custom configuration
|
||||
python run_benchmark.py custom --urls 300 --max-sessions 24 --chunk-size 50
|
||||
|
||||
# Run 'small' test in streaming mode
|
||||
python run_benchmark.py small --stream
|
||||
|
||||
# Override max_sessions for the 'medium' config
|
||||
python run_benchmark.py medium --max-sessions 20
|
||||
|
||||
# Skip benchmark report generation after the test
|
||||
python run_benchmark.py small --no-report
|
||||
|
||||
# Clean up reports and site files before running
|
||||
python run_benchmark.py medium --clean
|
||||
```
|
||||
|
||||
#### `run_benchmark.py` Parameters
|
||||
|
||||
| Parameter | Default | Description |
|
||||
| -------------------- | --------------- | --------------------------------------------------------------------------- |
|
||||
| `config` | *required* | Test configuration: `quick`, `small`, `medium`, `large`, `extreme`, `custom`|
|
||||
| `--urls` | config-specific | Number of URLs (required for `custom`) |
|
||||
| `--max-sessions` | config-specific | Max concurrent sessions managed by dispatcher (required for `custom`) |
|
||||
| `--chunk-size` | config-specific | URLs per batch for non-stream logging (required for `custom`) |
|
||||
| `--stream` | False | Enable streaming results (disables batch logging) |
|
||||
| `--monitor-mode` | DETAILED | `DETAILED` or `AGGREGATED` display for the live monitor |
|
||||
| `--use-rate-limiter` | False | Enable basic rate limiter in the dispatcher |
|
||||
| `--port` | 8000 | HTTP server port |
|
||||
| `--no-report` | False | Skip generating comparison report via `benchmark_report.py` |
|
||||
| `--clean` | False | Clean up reports and site files before running |
|
||||
| `--keep-server-alive`| False | Keep local HTTP server running after test |
|
||||
| `--use-existing-site`| False | Use existing site on specified port (no local server start/site gen) |
|
||||
| `--skip-generation` | False | Use existing site files but start local server |
|
||||
| `--keep-site` | False | Keep generated site files after test |
|
||||
|
||||
#### Predefined Configurations
|
||||
|
||||
| Configuration | URLs | Max Sessions | Chunk Size | Description |
|
||||
| ------------- | ------ | ------------ | ---------- | -------------------------------- |
|
||||
| `quick` | 50 | 4 | 10 | Quick test for basic validation |
|
||||
| `small` | 100 | 8 | 20 | Small test for routine checks |
|
||||
| `medium` | 500 | 16 | 50 | Medium test for thorough checks |
|
||||
| `large` | 1000 | 32 | 100 | Large test for stress testing |
|
||||
| `extreme` | 2000 | 64 | 200 | Extreme test for limit testing |
|
||||
|
||||
### Direct Usage of `test_stress_sdk.py`
|
||||
|
||||
For fine-grained control or debugging, you can run the stress test script directly:
|
||||
|
||||
```bash
|
||||
# Test with 200 URLs and 32 max concurrent sessions
|
||||
python test_stress_sdk.py --urls 200 --max-sessions 32 --chunk-size 40
|
||||
|
||||
# Clean up previous test data first
|
||||
python test_stress_sdk.py --clean-reports --clean-site --urls 100 --max-sessions 16 --chunk-size 20
|
||||
|
||||
# Change the HTTP server port and use aggregated monitor
|
||||
python test_stress_sdk.py --port 8088 --urls 100 --max-sessions 16 --monitor-mode AGGREGATED
|
||||
|
||||
# Enable streaming mode and use rate limiting
|
||||
python test_stress_sdk.py --urls 50 --max-sessions 8 --stream --use-rate-limiter
|
||||
|
||||
# Change report output location
|
||||
python test_stress_sdk.py --report-path custom_reports --urls 100 --max-sessions 16
|
||||
```
|
||||
|
||||
#### `test_stress_sdk.py` Parameters
|
||||
|
||||
| Parameter | Default | Description |
|
||||
| -------------------- | ---------- | -------------------------------------------------------------------- |
|
||||
| `--urls` | 100 | Number of URLs to test |
|
||||
| `--max-sessions` | 16 | Maximum concurrent crawling sessions managed by the dispatcher |
|
||||
| `--chunk-size` | 10 | Number of URLs per batch (relevant for non-stream logging) |
|
||||
| `--stream` | False | Enable streaming results (disables batch logging) |
|
||||
| `--monitor-mode` | DETAILED | `DETAILED` or `AGGREGATED` display for the live `CrawlerMonitor` |
|
||||
| `--use-rate-limiter` | False | Enable a basic `RateLimiter` within the dispatcher |
|
||||
| `--site-path` | "test_site"| Path to store/use the generated test site |
|
||||
| `--port` | 8000 | Port for the local HTTP server |
|
||||
| `--report-path` | "reports" | Path to save test result summary (JSON) and memory samples (CSV) |
|
||||
| `--skip-generation` | False | Use existing test site files but still start local server |
|
||||
| `--use-existing-site`| False | Use existing site on specified port (no local server/site gen) |
|
||||
| `--keep-server-alive`| False | Keep local HTTP server running after test completion |
|
||||
| `--keep-site` | False | Keep the generated test site files after test completion |
|
||||
| `--clean-reports` | False | Clean up report directory before running |
|
||||
| `--clean-site` | False | Clean up site directory before/after running (see script logic) |
|
||||
|
||||
### Generating Reports Only
|
||||
|
||||
If you only want to generate a benchmark report from existing test results (assuming `benchmark_report.py` is compatible):
|
||||
|
||||
```bash
|
||||
# Generate a report from existing test results in ./reports/
|
||||
python benchmark_report.py
|
||||
|
||||
# Limit to the most recent 5 test results
|
||||
python benchmark_report.py --limit 5
|
||||
|
||||
# Specify a custom source directory for test results
|
||||
python benchmark_report.py --reports-dir alternate_results
|
||||
```
|
||||
|
||||
#### `benchmark_report.py` Parameters (Assumed)
|
||||
|
||||
| Parameter | Default | Description |
|
||||
| --------------- | -------------------- | ----------------------------------------------------------- |
|
||||
| `--reports-dir` | "reports" | Directory containing `test_stress_sdk.py` result files |
|
||||
| `--output-dir` | "benchmark_reports" | Directory to save generated HTML reports and charts |
|
||||
| `--limit` | None (all results) | Limit comparison to N most recent test results |
|
||||
| `--output-file` | Auto-generated | Custom output filename for the HTML report |
|
||||
|
||||
## Understanding the Test Output
|
||||
|
||||
### Real-time Progress Display (`CrawlerMonitor`)
|
||||
|
||||
When running `test_stress_sdk.py`, the `CrawlerMonitor` provides a live view of the crawling process managed by the dispatcher.
|
||||
|
||||
- **DETAILED Mode (Default):** Shows individual task status (Queued, Active, Completed, Failed), timings, memory usage per task (if `psutil` is available), overall queue statistics, and memory pressure status (if `psutil` available).
|
||||
- **AGGREGATED Mode:** Shows summary counts (Queued, Active, Completed, Failed), overall progress percentage, estimated time remaining, average URLs/sec, and memory pressure status.
|
||||
|
||||
### Batch Log Output (Non-Streaming Mode Only)
|
||||
|
||||
If running `test_stress_sdk.py` **without** the `--stream` flag, you will *also* see per-batch summary lines printed to the console *after* the monitor display, once each chunk of URLs finishes processing:
|
||||
|
||||
```
|
||||
Batch | Progress | Start Mem | End Mem | URLs/sec | Success/Fail | Time (s) | Status
|
||||
───────────────────────────────────────────────────────────────────────────────────────────
|
||||
1 | 10.0% | 50.1 MB | 55.3 MB | 23.8 | 10/0 | 0.42 | Success
|
||||
2 | 20.0% | 55.3 MB | 60.1 MB | 24.1 | 10/0 | 0.41 | Success
|
||||
...
|
||||
```
|
||||
|
||||
This display provides chunk-specific metrics:
|
||||
- **Batch**: The batch number being reported.
|
||||
- **Progress**: Overall percentage of total URLs processed *after* this batch.
|
||||
- **Start Mem / End Mem**: Memory usage before and after processing this batch (if tracked).
|
||||
- **URLs/sec**: Processing speed *for this specific batch*.
|
||||
- **Success/Fail**: Number of successful and failed URLs *in this batch*.
|
||||
- **Time (s)**: Wall-clock time taken to process *this batch*.
|
||||
- **Status**: Color-coded status for the batch outcome.
|
||||
|
||||
### Summary Output
|
||||
|
||||
After test completion, a final summary is displayed:
|
||||
|
||||
```
|
||||
================================================================================
|
||||
Test Completed
|
||||
================================================================================
|
||||
Test ID: 20250418_103015
|
||||
Configuration: 100 URLs, 16 max sessions, Chunk: 10, Stream: False, Monitor: DETAILED
|
||||
Results: 100 successful, 0 failed (100 processed, 100.0% success)
|
||||
Performance: 5.85 seconds total, 17.09 URLs/second avg
|
||||
Memory Usage: Start: 50.1 MB, End: 75.3 MB, Max: 78.1 MB, Growth: 25.2 MB
|
||||
Results summary saved to reports/test_summary_20250418_103015.json
|
||||
```
|
||||
|
||||
### HTML Report Structure (Generated by `benchmark_report.py`)
|
||||
|
||||
(This section remains the same, assuming `benchmark_report.py` generates these)
|
||||
The benchmark report contains several sections:
|
||||
1. **Summary**: Overview of the latest test results and trends
|
||||
2. **Performance Comparison**: Charts showing throughput across tests
|
||||
3. **Memory Usage**: Detailed memory usage graphs for each test
|
||||
4. **Detailed Results**: Tabular data of all test metrics
|
||||
5. **Conclusion**: Automated analysis of performance and memory patterns
|
||||
|
||||
### Memory Metrics
|
||||
|
||||
(This section remains conceptually the same)
|
||||
Memory growth is the key metric for detecting leaks...
|
||||
|
||||
### Performance Metrics
|
||||
|
||||
(This section remains conceptually the same, though "URLs per Worker" is less relevant - focus on overall URLs/sec)
|
||||
Key performance indicators include:
|
||||
- **URLs per Second**: Higher is better (throughput)
|
||||
- **Success Rate**: Should be 100% in normal conditions
|
||||
- **Total Processing Time**: Lower is better
|
||||
- **Dispatcher Efficiency**: Observe queue lengths and wait times in the monitor (Detailed mode)
|
||||
|
||||
### Raw Data Files
|
||||
|
||||
Raw data is saved in the `--report-path` directory (default `./reports/`):
|
||||
|
||||
- **JSON files** (`test_summary_*.json`): Contains the final summary for each test run.
|
||||
- **CSV files** (`memory_samples_*.csv`): Contains time-series memory samples taken during the test run.
|
||||
|
||||
Example of reading raw data:
|
||||
```python
|
||||
import json
|
||||
import pandas as pd
|
||||
|
||||
# Load test summary
|
||||
test_id = "20250418_103015" # Example ID
|
||||
with open(f'reports/test_summary_{test_id}.json', 'r') as f:
|
||||
results = json.load(f)
|
||||
|
||||
# Load memory samples
|
||||
memory_df = pd.read_csv(f'reports/memory_samples_{test_id}.csv')
|
||||
|
||||
# Analyze memory_df (e.g., calculate growth, plot)
|
||||
if not memory_df['memory_info_mb'].isnull().all():
|
||||
growth = memory_df['memory_info_mb'].iloc[-1] - memory_df['memory_info_mb'].iloc[0]
|
||||
print(f"Total Memory Growth: {growth:.1f} MB")
|
||||
else:
|
||||
print("No valid memory samples found.")
|
||||
|
||||
print(f"Avg URLs/sec: {results['urls_processed'] / results['total_time_seconds']:.2f}")
|
||||
```
|
||||
|
||||
## Visualization Dependencies
|
||||
|
||||
(This section remains the same)
|
||||
For full visualization capabilities in the HTML reports generated by `benchmark_report.py`, install additional dependencies...
|
||||
|
||||
## Directory Structure
|
||||
|
||||
```
|
||||
benchmarking/ # Or your top-level directory name
|
||||
├── benchmark_reports/ # Generated HTML reports (by benchmark_report.py)
|
||||
├── reports/ # Raw test result data (from test_stress_sdk.py)
|
||||
├── test_site/ # Generated test content (temporary)
|
||||
├── benchmark_report.py# Report generator
|
||||
├── run_benchmark.py # Test runner with predefined configs
|
||||
├── test_stress_sdk.py # Main stress test implementation using arun_many
|
||||
└── run_all.sh # Simple wrapper script (may need updates)
|
||||
#└── requirements.txt # Optional: Visualization dependencies for benchmark_report.py
|
||||
```
|
||||
|
||||
## Cleanup
|
||||
|
||||
To clean up after testing:
|
||||
|
||||
```bash
|
||||
# Remove the test site content (if not using --keep-site)
|
||||
rm -rf test_site
|
||||
|
||||
# Remove all raw reports and generated benchmark reports
|
||||
rm -rf reports benchmark_reports
|
||||
|
||||
# Or use the --clean flag with run_benchmark.py
|
||||
python run_benchmark.py medium --clean
|
||||
```
|
||||
|
||||
## Use in CI/CD
|
||||
|
||||
(This section remains conceptually the same, just update script names)
|
||||
These tests can be integrated into CI/CD pipelines:
|
||||
```bash
|
||||
# Example CI script
|
||||
python run_benchmark.py medium --no-report # Run test without interactive report gen
|
||||
# Check exit code
|
||||
if [ $? -ne 0 ]; then echo "Stress test failed!"; exit 1; fi
|
||||
# Optionally, run report generator and check its output/metrics
|
||||
# python benchmark_report.py
|
||||
# check_report_metrics.py reports/test_summary_*.json || exit 1
|
||||
exit 0
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
- **HTTP Server Port Conflict**: Use `--port` with `run_benchmark.py` or `test_stress_sdk.py`.
|
||||
- **Memory Tracking Issues**: The `SimpleMemoryTracker` uses platform commands (`ps`, `/proc`, `tasklist`). Ensure these are available and the script has permission. If it consistently fails, memory reporting will be limited.
|
||||
- **Visualization Missing**: Related to `benchmark_report.py` and its dependencies.
|
||||
- **Site Generation Issues**: Check permissions for creating `./test_site/`. Use `--skip-generation` if you want to manage the site manually.
|
||||
- **Testing Against External Site**: Ensure the external site is running and use `--use-existing-site --port <correct_port>`.
|
||||
887
tests/memory/benchmark_report.py
Executable file
887
tests/memory/benchmark_report.py
Executable file
@@ -0,0 +1,887 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Benchmark reporting tool for Crawl4AI stress tests.
|
||||
Generates visual reports and comparisons between test runs.
|
||||
"""
|
||||
|
||||
import os
|
||||
import json
|
||||
import glob
|
||||
import argparse
|
||||
import sys
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
from rich.console import Console
|
||||
from rich.table import Table
|
||||
from rich.panel import Panel
|
||||
|
||||
# Initialize rich console
|
||||
console = Console()
|
||||
|
||||
# Try to import optional visualization dependencies
|
||||
VISUALIZATION_AVAILABLE = True
|
||||
try:
|
||||
import pandas as pd
|
||||
import matplotlib.pyplot as plt
|
||||
import matplotlib as mpl
|
||||
import numpy as np
|
||||
import seaborn as sns
|
||||
except ImportError:
|
||||
VISUALIZATION_AVAILABLE = False
|
||||
console.print("[yellow]Warning: Visualization dependencies not found. Install with:[/yellow]")
|
||||
console.print("[yellow]pip install pandas matplotlib seaborn[/yellow]")
|
||||
console.print("[yellow]Only text-based reports will be generated.[/yellow]")
|
||||
|
||||
# Configure plotting if available
|
||||
if VISUALIZATION_AVAILABLE:
|
||||
# Set plot style for dark theme
|
||||
plt.style.use('dark_background')
|
||||
sns.set_theme(style="darkgrid")
|
||||
|
||||
# Custom color palette based on Nord theme
|
||||
nord_palette = ["#88c0d0", "#81a1c1", "#a3be8c", "#ebcb8b", "#bf616a", "#b48ead", "#5e81ac"]
|
||||
sns.set_palette(nord_palette)
|
||||
|
||||
class BenchmarkReporter:
|
||||
"""Generates visual reports and comparisons for Crawl4AI stress tests."""
|
||||
|
||||
def __init__(self, reports_dir="reports", output_dir="benchmark_reports"):
|
||||
"""Initialize the benchmark reporter.
|
||||
|
||||
Args:
|
||||
reports_dir: Directory containing test result files
|
||||
output_dir: Directory to save generated reports
|
||||
"""
|
||||
self.reports_dir = Path(reports_dir)
|
||||
self.output_dir = Path(output_dir)
|
||||
self.output_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# Configure matplotlib if available
|
||||
if VISUALIZATION_AVAILABLE:
|
||||
# Ensure the matplotlib backend works in headless environments
|
||||
mpl.use('Agg')
|
||||
|
||||
# Set up styling for plots with dark theme
|
||||
mpl.rcParams['figure.figsize'] = (12, 8)
|
||||
mpl.rcParams['font.size'] = 12
|
||||
mpl.rcParams['axes.labelsize'] = 14
|
||||
mpl.rcParams['axes.titlesize'] = 16
|
||||
mpl.rcParams['xtick.labelsize'] = 12
|
||||
mpl.rcParams['ytick.labelsize'] = 12
|
||||
mpl.rcParams['legend.fontsize'] = 12
|
||||
mpl.rcParams['figure.facecolor'] = '#1e1e1e'
|
||||
mpl.rcParams['axes.facecolor'] = '#2e3440'
|
||||
mpl.rcParams['savefig.facecolor'] = '#1e1e1e'
|
||||
mpl.rcParams['text.color'] = '#e0e0e0'
|
||||
mpl.rcParams['axes.labelcolor'] = '#e0e0e0'
|
||||
mpl.rcParams['xtick.color'] = '#e0e0e0'
|
||||
mpl.rcParams['ytick.color'] = '#e0e0e0'
|
||||
mpl.rcParams['grid.color'] = '#444444'
|
||||
mpl.rcParams['figure.edgecolor'] = '#444444'
|
||||
|
||||
def load_test_results(self, limit=None):
|
||||
"""Load all test results from the reports directory.
|
||||
|
||||
Args:
|
||||
limit: Optional limit on number of most recent tests to load
|
||||
|
||||
Returns:
|
||||
Dictionary mapping test IDs to result data
|
||||
"""
|
||||
result_files = glob.glob(str(self.reports_dir / "test_results_*.json"))
|
||||
|
||||
# Sort files by modification time (newest first)
|
||||
result_files.sort(key=os.path.getmtime, reverse=True)
|
||||
|
||||
if limit:
|
||||
result_files = result_files[:limit]
|
||||
|
||||
results = {}
|
||||
for file_path in result_files:
|
||||
try:
|
||||
with open(file_path, 'r') as f:
|
||||
data = json.load(f)
|
||||
test_id = data.get('test_id')
|
||||
if test_id:
|
||||
results[test_id] = data
|
||||
|
||||
# Try to load the corresponding memory samples
|
||||
csv_path = self.reports_dir / f"memory_samples_{test_id}.csv"
|
||||
if csv_path.exists():
|
||||
try:
|
||||
memory_df = pd.read_csv(csv_path)
|
||||
results[test_id]['memory_samples'] = memory_df
|
||||
except Exception as e:
|
||||
console.print(f"[yellow]Warning: Could not load memory samples for {test_id}: {e}[/yellow]")
|
||||
except Exception as e:
|
||||
console.print(f"[red]Error loading {file_path}: {e}[/red]")
|
||||
|
||||
console.print(f"Loaded {len(results)} test results")
|
||||
return results
|
||||
|
||||
def generate_summary_table(self, results):
|
||||
"""Generate a summary table of test results.
|
||||
|
||||
Args:
|
||||
results: Dictionary mapping test IDs to result data
|
||||
|
||||
Returns:
|
||||
Rich Table object
|
||||
"""
|
||||
table = Table(title="Crawl4AI Stress Test Summary", show_header=True)
|
||||
|
||||
# Define columns
|
||||
table.add_column("Test ID", style="cyan")
|
||||
table.add_column("Date", style="bright_green")
|
||||
table.add_column("URLs", justify="right")
|
||||
table.add_column("Workers", justify="right")
|
||||
table.add_column("Success %", justify="right")
|
||||
table.add_column("Time (s)", justify="right")
|
||||
table.add_column("Mem Growth", justify="right")
|
||||
table.add_column("URLs/sec", justify="right")
|
||||
|
||||
# Add rows
|
||||
for test_id, data in sorted(results.items(), key=lambda x: x[0], reverse=True):
|
||||
# Parse timestamp from test_id
|
||||
try:
|
||||
date_str = datetime.strptime(test_id, "%Y%m%d_%H%M%S").strftime("%Y-%m-%d %H:%M")
|
||||
except:
|
||||
date_str = "Unknown"
|
||||
|
||||
# Calculate success percentage
|
||||
total_urls = data.get('url_count', 0)
|
||||
successful = data.get('successful_urls', 0)
|
||||
success_pct = (successful / total_urls * 100) if total_urls > 0 else 0
|
||||
|
||||
# Calculate memory growth if available
|
||||
mem_growth = "N/A"
|
||||
if 'memory_samples' in data:
|
||||
samples = data['memory_samples']
|
||||
if len(samples) >= 2:
|
||||
# Try to extract numeric values from memory_info strings
|
||||
try:
|
||||
first_mem = float(samples.iloc[0]['memory_info'].split()[0])
|
||||
last_mem = float(samples.iloc[-1]['memory_info'].split()[0])
|
||||
mem_growth = f"{last_mem - first_mem:.1f} MB"
|
||||
except:
|
||||
pass
|
||||
|
||||
# Calculate URLs per second
|
||||
time_taken = data.get('total_time_seconds', 0)
|
||||
urls_per_sec = total_urls / time_taken if time_taken > 0 else 0
|
||||
|
||||
table.add_row(
|
||||
test_id,
|
||||
date_str,
|
||||
str(total_urls),
|
||||
str(data.get('workers', 'N/A')),
|
||||
f"{success_pct:.1f}%",
|
||||
f"{data.get('total_time_seconds', 0):.2f}",
|
||||
mem_growth,
|
||||
f"{urls_per_sec:.1f}"
|
||||
)
|
||||
|
||||
return table
|
||||
|
||||
def generate_performance_chart(self, results, output_file=None):
|
||||
"""Generate a performance comparison chart.
|
||||
|
||||
Args:
|
||||
results: Dictionary mapping test IDs to result data
|
||||
output_file: File path to save the chart
|
||||
|
||||
Returns:
|
||||
Path to the saved chart file or None if visualization is not available
|
||||
"""
|
||||
if not VISUALIZATION_AVAILABLE:
|
||||
console.print("[yellow]Skipping performance chart - visualization dependencies not available[/yellow]")
|
||||
return None
|
||||
|
||||
# Extract relevant data
|
||||
data = []
|
||||
for test_id, result in results.items():
|
||||
urls = result.get('url_count', 0)
|
||||
workers = result.get('workers', 0)
|
||||
time_taken = result.get('total_time_seconds', 0)
|
||||
urls_per_sec = urls / time_taken if time_taken > 0 else 0
|
||||
|
||||
# Parse timestamp from test_id for sorting
|
||||
try:
|
||||
timestamp = datetime.strptime(test_id, "%Y%m%d_%H%M%S")
|
||||
data.append({
|
||||
'test_id': test_id,
|
||||
'timestamp': timestamp,
|
||||
'urls': urls,
|
||||
'workers': workers,
|
||||
'time_seconds': time_taken,
|
||||
'urls_per_sec': urls_per_sec
|
||||
})
|
||||
except:
|
||||
console.print(f"[yellow]Warning: Could not parse timestamp from {test_id}[/yellow]")
|
||||
|
||||
if not data:
|
||||
console.print("[yellow]No valid data for performance chart[/yellow]")
|
||||
return None
|
||||
|
||||
# Convert to DataFrame and sort by timestamp
|
||||
df = pd.DataFrame(data)
|
||||
df = df.sort_values('timestamp')
|
||||
|
||||
# Create the plot
|
||||
fig, ax1 = plt.subplots(figsize=(12, 6))
|
||||
|
||||
# Plot URLs per second as bars with properly set x-axis
|
||||
x_pos = range(len(df['test_id']))
|
||||
bars = ax1.bar(x_pos, df['urls_per_sec'], color='#88c0d0', alpha=0.8)
|
||||
ax1.set_ylabel('URLs per Second', color='#88c0d0')
|
||||
ax1.tick_params(axis='y', labelcolor='#88c0d0')
|
||||
|
||||
# Properly set x-axis labels
|
||||
ax1.set_xticks(x_pos)
|
||||
ax1.set_xticklabels(df['test_id'].tolist(), rotation=45, ha='right')
|
||||
|
||||
# Add worker count as text on each bar
|
||||
for i, bar in enumerate(bars):
|
||||
height = bar.get_height()
|
||||
workers = df.iloc[i]['workers']
|
||||
ax1.text(i, height + 0.1,
|
||||
f'W: {workers}', ha='center', va='bottom', fontsize=9, color='#e0e0e0')
|
||||
|
||||
# Add a second y-axis for total URLs
|
||||
ax2 = ax1.twinx()
|
||||
ax2.plot(x_pos, df['urls'], '-', color='#bf616a', alpha=0.8, markersize=6, marker='o')
|
||||
ax2.set_ylabel('Total URLs', color='#bf616a')
|
||||
ax2.tick_params(axis='y', labelcolor='#bf616a')
|
||||
|
||||
# Set title and layout
|
||||
plt.title('Crawl4AI Performance Benchmarks')
|
||||
plt.tight_layout()
|
||||
|
||||
# Save the figure
|
||||
if output_file is None:
|
||||
output_file = self.output_dir / "performance_comparison.png"
|
||||
plt.savefig(output_file, dpi=100, bbox_inches='tight')
|
||||
plt.close()
|
||||
|
||||
return output_file
|
||||
|
||||
def generate_memory_charts(self, results, output_prefix=None):
|
||||
"""Generate memory usage charts for each test.
|
||||
|
||||
Args:
|
||||
results: Dictionary mapping test IDs to result data
|
||||
output_prefix: Prefix for output file names
|
||||
|
||||
Returns:
|
||||
List of paths to the saved chart files
|
||||
"""
|
||||
if not VISUALIZATION_AVAILABLE:
|
||||
console.print("[yellow]Skipping memory charts - visualization dependencies not available[/yellow]")
|
||||
return []
|
||||
|
||||
output_files = []
|
||||
|
||||
for test_id, result in results.items():
|
||||
if 'memory_samples' not in result:
|
||||
continue
|
||||
|
||||
memory_df = result['memory_samples']
|
||||
|
||||
# Check if we have enough data points
|
||||
if len(memory_df) < 2:
|
||||
continue
|
||||
|
||||
# Try to extract numeric values from memory_info strings
|
||||
try:
|
||||
memory_values = []
|
||||
for mem_str in memory_df['memory_info']:
|
||||
# Extract the number from strings like "142.8 MB"
|
||||
value = float(mem_str.split()[0])
|
||||
memory_values.append(value)
|
||||
|
||||
memory_df['memory_mb'] = memory_values
|
||||
except Exception as e:
|
||||
console.print(f"[yellow]Could not parse memory values for {test_id}: {e}[/yellow]")
|
||||
continue
|
||||
|
||||
# Create the plot
|
||||
plt.figure(figsize=(10, 6))
|
||||
|
||||
# Plot memory usage over time
|
||||
plt.plot(memory_df['elapsed_seconds'], memory_df['memory_mb'],
|
||||
color='#88c0d0', marker='o', linewidth=2, markersize=4)
|
||||
|
||||
# Add annotations for chunk processing
|
||||
chunk_size = result.get('chunk_size', 0)
|
||||
url_count = result.get('url_count', 0)
|
||||
if chunk_size > 0 and url_count > 0:
|
||||
# Estimate chunk processing times
|
||||
num_chunks = (url_count + chunk_size - 1) // chunk_size # Ceiling division
|
||||
total_time = result.get('total_time_seconds', memory_df['elapsed_seconds'].max())
|
||||
chunk_times = np.linspace(0, total_time, num_chunks + 1)[1:]
|
||||
|
||||
for i, time_point in enumerate(chunk_times):
|
||||
if time_point <= memory_df['elapsed_seconds'].max():
|
||||
plt.axvline(x=time_point, color='#4c566a', linestyle='--', alpha=0.6)
|
||||
plt.text(time_point, memory_df['memory_mb'].min(), f'Chunk {i+1}',
|
||||
rotation=90, verticalalignment='bottom', fontsize=8, color='#e0e0e0')
|
||||
|
||||
# Set labels and title
|
||||
plt.xlabel('Elapsed Time (seconds)', color='#e0e0e0')
|
||||
plt.ylabel('Memory Usage (MB)', color='#e0e0e0')
|
||||
plt.title(f'Memory Usage During Test {test_id}\n({url_count} URLs, {result.get("workers", "?")} Workers)',
|
||||
color='#e0e0e0')
|
||||
|
||||
# Add grid and set y-axis to start from zero
|
||||
plt.grid(True, alpha=0.3, color='#4c566a')
|
||||
|
||||
# Add test metadata as text
|
||||
info_text = (
|
||||
f"URLs: {url_count}\n"
|
||||
f"Workers: {result.get('workers', 'N/A')}\n"
|
||||
f"Chunk Size: {result.get('chunk_size', 'N/A')}\n"
|
||||
f"Total Time: {result.get('total_time_seconds', 0):.2f}s\n"
|
||||
)
|
||||
|
||||
# Calculate memory growth
|
||||
if len(memory_df) >= 2:
|
||||
first_mem = memory_df.iloc[0]['memory_mb']
|
||||
last_mem = memory_df.iloc[-1]['memory_mb']
|
||||
growth = last_mem - first_mem
|
||||
growth_rate = growth / result.get('total_time_seconds', 1)
|
||||
|
||||
info_text += f"Memory Growth: {growth:.1f} MB\n"
|
||||
info_text += f"Growth Rate: {growth_rate:.2f} MB/s"
|
||||
|
||||
plt.figtext(0.02, 0.02, info_text, fontsize=9, color='#e0e0e0',
|
||||
bbox=dict(facecolor='#3b4252', alpha=0.8, edgecolor='#4c566a'))
|
||||
|
||||
# Save the figure
|
||||
if output_prefix is None:
|
||||
output_file = self.output_dir / f"memory_chart_{test_id}.png"
|
||||
else:
|
||||
output_file = Path(f"{output_prefix}_memory_{test_id}.png")
|
||||
|
||||
plt.tight_layout()
|
||||
plt.savefig(output_file, dpi=100, bbox_inches='tight')
|
||||
plt.close()
|
||||
|
||||
output_files.append(output_file)
|
||||
|
||||
return output_files
|
||||
|
||||
def generate_comparison_report(self, results, title=None, output_file=None):
|
||||
"""Generate a comprehensive comparison report of multiple test runs.
|
||||
|
||||
Args:
|
||||
results: Dictionary mapping test IDs to result data
|
||||
title: Optional title for the report
|
||||
output_file: File path to save the report
|
||||
|
||||
Returns:
|
||||
Path to the saved report file
|
||||
"""
|
||||
if not results:
|
||||
console.print("[yellow]No results to generate comparison report[/yellow]")
|
||||
return None
|
||||
|
||||
if output_file is None:
|
||||
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
||||
output_file = self.output_dir / f"comparison_report_{timestamp}.html"
|
||||
|
||||
# Create data for the report
|
||||
rows = []
|
||||
for test_id, data in results.items():
|
||||
# Calculate metrics
|
||||
urls = data.get('url_count', 0)
|
||||
workers = data.get('workers', 0)
|
||||
successful = data.get('successful_urls', 0)
|
||||
failed = data.get('failed_urls', 0)
|
||||
time_seconds = data.get('total_time_seconds', 0)
|
||||
|
||||
# Calculate additional metrics
|
||||
success_rate = (successful / urls) * 100 if urls > 0 else 0
|
||||
urls_per_second = urls / time_seconds if time_seconds > 0 else 0
|
||||
urls_per_worker = urls / workers if workers > 0 else 0
|
||||
|
||||
# Calculate memory growth if available
|
||||
mem_start = None
|
||||
mem_end = None
|
||||
mem_growth = None
|
||||
if 'memory_samples' in data:
|
||||
samples = data['memory_samples']
|
||||
if len(samples) >= 2:
|
||||
try:
|
||||
first_mem = float(samples.iloc[0]['memory_info'].split()[0])
|
||||
last_mem = float(samples.iloc[-1]['memory_info'].split()[0])
|
||||
mem_start = first_mem
|
||||
mem_end = last_mem
|
||||
mem_growth = last_mem - first_mem
|
||||
except:
|
||||
pass
|
||||
|
||||
# Parse timestamp from test_id
|
||||
try:
|
||||
timestamp = datetime.strptime(test_id, "%Y%m%d_%H%M%S")
|
||||
except:
|
||||
timestamp = None
|
||||
|
||||
rows.append({
|
||||
'test_id': test_id,
|
||||
'timestamp': timestamp,
|
||||
'date': timestamp.strftime("%Y-%m-%d %H:%M:%S") if timestamp else "Unknown",
|
||||
'urls': urls,
|
||||
'workers': workers,
|
||||
'chunk_size': data.get('chunk_size', 0),
|
||||
'successful': successful,
|
||||
'failed': failed,
|
||||
'success_rate': success_rate,
|
||||
'time_seconds': time_seconds,
|
||||
'urls_per_second': urls_per_second,
|
||||
'urls_per_worker': urls_per_worker,
|
||||
'memory_start': mem_start,
|
||||
'memory_end': mem_end,
|
||||
'memory_growth': mem_growth
|
||||
})
|
||||
|
||||
# Sort data by timestamp if possible
|
||||
if VISUALIZATION_AVAILABLE:
|
||||
# Convert to DataFrame and sort by timestamp
|
||||
df = pd.DataFrame(rows)
|
||||
if 'timestamp' in df.columns and not df['timestamp'].isna().all():
|
||||
df = df.sort_values('timestamp', ascending=False)
|
||||
else:
|
||||
# Simple sorting without pandas
|
||||
rows.sort(key=lambda x: x.get('timestamp', datetime.now()), reverse=True)
|
||||
df = None
|
||||
|
||||
# Generate HTML report
|
||||
html = []
|
||||
html.append('<!DOCTYPE html>')
|
||||
html.append('<html lang="en">')
|
||||
html.append('<head>')
|
||||
html.append('<meta charset="UTF-8">')
|
||||
html.append('<meta name="viewport" content="width=device-width, initial-scale=1.0">')
|
||||
html.append(f'<title>{title or "Crawl4AI Benchmark Comparison"}</title>')
|
||||
html.append('<style>')
|
||||
html.append('''
|
||||
body {
|
||||
font-family: Arial, sans-serif;
|
||||
line-height: 1.6;
|
||||
margin: 0;
|
||||
padding: 20px;
|
||||
max-width: 1200px;
|
||||
margin: 0 auto;
|
||||
color: #e0e0e0;
|
||||
background-color: #1e1e1e;
|
||||
}
|
||||
h1, h2, h3 {
|
||||
color: #81a1c1;
|
||||
}
|
||||
table {
|
||||
border-collapse: collapse;
|
||||
width: 100%;
|
||||
margin-bottom: 20px;
|
||||
}
|
||||
th, td {
|
||||
text-align: left;
|
||||
padding: 12px;
|
||||
border-bottom: 1px solid #444;
|
||||
}
|
||||
th {
|
||||
background-color: #2e3440;
|
||||
font-weight: bold;
|
||||
}
|
||||
tr:hover {
|
||||
background-color: #2e3440;
|
||||
}
|
||||
a {
|
||||
color: #88c0d0;
|
||||
text-decoration: none;
|
||||
}
|
||||
a:hover {
|
||||
text-decoration: underline;
|
||||
}
|
||||
.chart-container {
|
||||
margin: 30px 0;
|
||||
text-align: center;
|
||||
background-color: #2e3440;
|
||||
padding: 20px;
|
||||
border-radius: 8px;
|
||||
}
|
||||
.chart-container img {
|
||||
max-width: 100%;
|
||||
height: auto;
|
||||
border: 1px solid #444;
|
||||
box-shadow: 0 0 10px rgba(0,0,0,0.3);
|
||||
}
|
||||
.card {
|
||||
border: 1px solid #444;
|
||||
border-radius: 8px;
|
||||
padding: 15px;
|
||||
margin-bottom: 20px;
|
||||
background-color: #2e3440;
|
||||
box-shadow: 0 0 10px rgba(0,0,0,0.2);
|
||||
}
|
||||
.highlight {
|
||||
background-color: #3b4252;
|
||||
font-weight: bold;
|
||||
}
|
||||
.status-good {
|
||||
color: #a3be8c;
|
||||
}
|
||||
.status-warning {
|
||||
color: #ebcb8b;
|
||||
}
|
||||
.status-bad {
|
||||
color: #bf616a;
|
||||
}
|
||||
''')
|
||||
html.append('</style>')
|
||||
html.append('</head>')
|
||||
html.append('<body>')
|
||||
|
||||
# Header
|
||||
html.append(f'<h1>{title or "Crawl4AI Benchmark Comparison"}</h1>')
|
||||
html.append(f'<p>Report generated on {datetime.now().strftime("%Y-%m-%d %H:%M:%S")}</p>')
|
||||
|
||||
# Summary section
|
||||
html.append('<div class="card">')
|
||||
html.append('<h2>Summary</h2>')
|
||||
html.append('<p>This report compares the performance of Crawl4AI across multiple test runs.</p>')
|
||||
|
||||
# Summary metrics
|
||||
data_available = (VISUALIZATION_AVAILABLE and df is not None and not df.empty) or (not VISUALIZATION_AVAILABLE and len(rows) > 0)
|
||||
if data_available:
|
||||
# Get the latest test data
|
||||
if VISUALIZATION_AVAILABLE and df is not None and not df.empty:
|
||||
latest_test = df.iloc[0]
|
||||
latest_id = latest_test['test_id']
|
||||
else:
|
||||
latest_test = rows[0] # First row (already sorted by timestamp)
|
||||
latest_id = latest_test['test_id']
|
||||
|
||||
html.append('<h3>Latest Test Results</h3>')
|
||||
html.append('<ul>')
|
||||
html.append(f'<li><strong>Test ID:</strong> {latest_id}</li>')
|
||||
html.append(f'<li><strong>Date:</strong> {latest_test["date"]}</li>')
|
||||
html.append(f'<li><strong>URLs:</strong> {latest_test["urls"]}</li>')
|
||||
html.append(f'<li><strong>Workers:</strong> {latest_test["workers"]}</li>')
|
||||
html.append(f'<li><strong>Success Rate:</strong> {latest_test["success_rate"]:.1f}%</li>')
|
||||
html.append(f'<li><strong>Time:</strong> {latest_test["time_seconds"]:.2f} seconds</li>')
|
||||
html.append(f'<li><strong>Performance:</strong> {latest_test["urls_per_second"]:.1f} URLs/second</li>')
|
||||
|
||||
# Check memory growth (handle both pandas and dict mode)
|
||||
memory_growth_available = False
|
||||
if VISUALIZATION_AVAILABLE and df is not None:
|
||||
if pd.notna(latest_test["memory_growth"]):
|
||||
html.append(f'<li><strong>Memory Growth:</strong> {latest_test["memory_growth"]:.1f} MB</li>')
|
||||
memory_growth_available = True
|
||||
else:
|
||||
if latest_test["memory_growth"] is not None:
|
||||
html.append(f'<li><strong>Memory Growth:</strong> {latest_test["memory_growth"]:.1f} MB</li>')
|
||||
memory_growth_available = True
|
||||
|
||||
html.append('</ul>')
|
||||
|
||||
# If we have more than one test, show trend
|
||||
if (VISUALIZATION_AVAILABLE and df is not None and len(df) > 1) or (not VISUALIZATION_AVAILABLE and len(rows) > 1):
|
||||
if VISUALIZATION_AVAILABLE and df is not None:
|
||||
prev_test = df.iloc[1]
|
||||
else:
|
||||
prev_test = rows[1]
|
||||
|
||||
# Calculate performance change
|
||||
perf_change = ((latest_test["urls_per_second"] / prev_test["urls_per_second"]) - 1) * 100 if prev_test["urls_per_second"] > 0 else 0
|
||||
|
||||
status_class = ""
|
||||
if perf_change > 5:
|
||||
status_class = "status-good"
|
||||
elif perf_change < -5:
|
||||
status_class = "status-bad"
|
||||
|
||||
html.append('<h3>Performance Trend</h3>')
|
||||
html.append('<ul>')
|
||||
html.append(f'<li><strong>Performance Change:</strong> <span class="{status_class}">{perf_change:+.1f}%</span> compared to previous test</li>')
|
||||
|
||||
# Memory trend if available
|
||||
memory_trend_available = False
|
||||
if VISUALIZATION_AVAILABLE and df is not None:
|
||||
if pd.notna(latest_test["memory_growth"]) and pd.notna(prev_test["memory_growth"]):
|
||||
mem_change = latest_test["memory_growth"] - prev_test["memory_growth"]
|
||||
memory_trend_available = True
|
||||
else:
|
||||
if latest_test["memory_growth"] is not None and prev_test["memory_growth"] is not None:
|
||||
mem_change = latest_test["memory_growth"] - prev_test["memory_growth"]
|
||||
memory_trend_available = True
|
||||
|
||||
if memory_trend_available:
|
||||
mem_status = ""
|
||||
if mem_change < -1: # Improved (less growth)
|
||||
mem_status = "status-good"
|
||||
elif mem_change > 1: # Worse (more growth)
|
||||
mem_status = "status-bad"
|
||||
|
||||
html.append(f'<li><strong>Memory Trend:</strong> <span class="{mem_status}">{mem_change:+.1f} MB</span> change in memory growth</li>')
|
||||
|
||||
html.append('</ul>')
|
||||
|
||||
html.append('</div>')
|
||||
|
||||
# Generate performance chart if visualization is available
|
||||
if VISUALIZATION_AVAILABLE:
|
||||
perf_chart = self.generate_performance_chart(results)
|
||||
if perf_chart:
|
||||
html.append('<div class="chart-container">')
|
||||
html.append('<h2>Performance Comparison</h2>')
|
||||
html.append(f'<img src="{os.path.relpath(perf_chart, os.path.dirname(output_file))}" alt="Performance Comparison Chart">')
|
||||
html.append('</div>')
|
||||
else:
|
||||
html.append('<div class="chart-container">')
|
||||
html.append('<h2>Performance Comparison</h2>')
|
||||
html.append('<p>Charts not available - install visualization dependencies (pandas, matplotlib, seaborn) to enable.</p>')
|
||||
html.append('</div>')
|
||||
|
||||
# Generate memory charts if visualization is available
|
||||
if VISUALIZATION_AVAILABLE:
|
||||
memory_charts = self.generate_memory_charts(results)
|
||||
if memory_charts:
|
||||
html.append('<div class="chart-container">')
|
||||
html.append('<h2>Memory Usage</h2>')
|
||||
|
||||
for chart in memory_charts:
|
||||
test_id = chart.stem.split('_')[-1]
|
||||
html.append(f'<h3>Test {test_id}</h3>')
|
||||
html.append(f'<img src="{os.path.relpath(chart, os.path.dirname(output_file))}" alt="Memory Chart for {test_id}">')
|
||||
|
||||
html.append('</div>')
|
||||
else:
|
||||
html.append('<div class="chart-container">')
|
||||
html.append('<h2>Memory Usage</h2>')
|
||||
html.append('<p>Charts not available - install visualization dependencies (pandas, matplotlib, seaborn) to enable.</p>')
|
||||
html.append('</div>')
|
||||
|
||||
# Detailed results table
|
||||
html.append('<h2>Detailed Results</h2>')
|
||||
|
||||
# Add the results as an HTML table
|
||||
html.append('<table>')
|
||||
|
||||
# Table headers
|
||||
html.append('<tr>')
|
||||
for col in ['Test ID', 'Date', 'URLs', 'Workers', 'Success %', 'Time (s)', 'URLs/sec', 'Mem Growth (MB)']:
|
||||
html.append(f'<th>{col}</th>')
|
||||
html.append('</tr>')
|
||||
|
||||
# Table rows - handle both pandas DataFrame and list of dicts
|
||||
if VISUALIZATION_AVAILABLE and df is not None:
|
||||
# Using pandas DataFrame
|
||||
for _, row in df.iterrows():
|
||||
html.append('<tr>')
|
||||
html.append(f'<td>{row["test_id"]}</td>')
|
||||
html.append(f'<td>{row["date"]}</td>')
|
||||
html.append(f'<td>{row["urls"]}</td>')
|
||||
html.append(f'<td>{row["workers"]}</td>')
|
||||
html.append(f'<td>{row["success_rate"]:.1f}%</td>')
|
||||
html.append(f'<td>{row["time_seconds"]:.2f}</td>')
|
||||
html.append(f'<td>{row["urls_per_second"]:.1f}</td>')
|
||||
|
||||
# Memory growth cell
|
||||
if pd.notna(row["memory_growth"]):
|
||||
html.append(f'<td>{row["memory_growth"]:.1f}</td>')
|
||||
else:
|
||||
html.append('<td>N/A</td>')
|
||||
|
||||
html.append('</tr>')
|
||||
else:
|
||||
# Using list of dicts (when pandas is not available)
|
||||
for row in rows:
|
||||
html.append('<tr>')
|
||||
html.append(f'<td>{row["test_id"]}</td>')
|
||||
html.append(f'<td>{row["date"]}</td>')
|
||||
html.append(f'<td>{row["urls"]}</td>')
|
||||
html.append(f'<td>{row["workers"]}</td>')
|
||||
html.append(f'<td>{row["success_rate"]:.1f}%</td>')
|
||||
html.append(f'<td>{row["time_seconds"]:.2f}</td>')
|
||||
html.append(f'<td>{row["urls_per_second"]:.1f}</td>')
|
||||
|
||||
# Memory growth cell
|
||||
if row["memory_growth"] is not None:
|
||||
html.append(f'<td>{row["memory_growth"]:.1f}</td>')
|
||||
else:
|
||||
html.append('<td>N/A</td>')
|
||||
|
||||
html.append('</tr>')
|
||||
|
||||
html.append('</table>')
|
||||
|
||||
# Conclusion section
|
||||
html.append('<div class="card">')
|
||||
html.append('<h2>Conclusion</h2>')
|
||||
|
||||
if VISUALIZATION_AVAILABLE and df is not None and not df.empty:
|
||||
# Using pandas for statistics (when available)
|
||||
# Calculate some overall statistics
|
||||
avg_urls_per_sec = df['urls_per_second'].mean()
|
||||
max_urls_per_sec = df['urls_per_second'].max()
|
||||
|
||||
# Determine if we have a trend
|
||||
if len(df) > 1:
|
||||
trend_data = df.sort_values('timestamp')
|
||||
first_perf = trend_data.iloc[0]['urls_per_second']
|
||||
last_perf = trend_data.iloc[-1]['urls_per_second']
|
||||
|
||||
perf_change = ((last_perf / first_perf) - 1) * 100 if first_perf > 0 else 0
|
||||
|
||||
if perf_change > 10:
|
||||
trend_desc = "significantly improved"
|
||||
trend_class = "status-good"
|
||||
elif perf_change > 5:
|
||||
trend_desc = "improved"
|
||||
trend_class = "status-good"
|
||||
elif perf_change < -10:
|
||||
trend_desc = "significantly decreased"
|
||||
trend_class = "status-bad"
|
||||
elif perf_change < -5:
|
||||
trend_desc = "decreased"
|
||||
trend_class = "status-bad"
|
||||
else:
|
||||
trend_desc = "remained stable"
|
||||
trend_class = ""
|
||||
|
||||
html.append(f'<p>Overall performance has <span class="{trend_class}">{trend_desc}</span> over the test period.</p>')
|
||||
|
||||
html.append(f'<p>Average throughput: <strong>{avg_urls_per_sec:.1f}</strong> URLs/second</p>')
|
||||
html.append(f'<p>Maximum throughput: <strong>{max_urls_per_sec:.1f}</strong> URLs/second</p>')
|
||||
|
||||
# Memory leak assessment
|
||||
if 'memory_growth' in df.columns and not df['memory_growth'].isna().all():
|
||||
avg_growth = df['memory_growth'].mean()
|
||||
max_growth = df['memory_growth'].max()
|
||||
|
||||
if avg_growth < 5:
|
||||
leak_assessment = "No significant memory leaks detected"
|
||||
leak_class = "status-good"
|
||||
elif avg_growth < 10:
|
||||
leak_assessment = "Minor memory growth observed"
|
||||
leak_class = "status-warning"
|
||||
else:
|
||||
leak_assessment = "Potential memory leak detected"
|
||||
leak_class = "status-bad"
|
||||
|
||||
html.append(f'<p><span class="{leak_class}">{leak_assessment}</span>. Average memory growth: <strong>{avg_growth:.1f} MB</strong> per test.</p>')
|
||||
else:
|
||||
# Manual calculations without pandas
|
||||
if rows:
|
||||
# Calculate average and max throughput
|
||||
total_urls_per_sec = sum(row['urls_per_second'] for row in rows)
|
||||
avg_urls_per_sec = total_urls_per_sec / len(rows)
|
||||
max_urls_per_sec = max(row['urls_per_second'] for row in rows)
|
||||
|
||||
html.append(f'<p>Average throughput: <strong>{avg_urls_per_sec:.1f}</strong> URLs/second</p>')
|
||||
html.append(f'<p>Maximum throughput: <strong>{max_urls_per_sec:.1f}</strong> URLs/second</p>')
|
||||
|
||||
# Memory assessment (simplified without pandas)
|
||||
growth_values = [row['memory_growth'] for row in rows if row['memory_growth'] is not None]
|
||||
if growth_values:
|
||||
avg_growth = sum(growth_values) / len(growth_values)
|
||||
|
||||
if avg_growth < 5:
|
||||
leak_assessment = "No significant memory leaks detected"
|
||||
leak_class = "status-good"
|
||||
elif avg_growth < 10:
|
||||
leak_assessment = "Minor memory growth observed"
|
||||
leak_class = "status-warning"
|
||||
else:
|
||||
leak_assessment = "Potential memory leak detected"
|
||||
leak_class = "status-bad"
|
||||
|
||||
html.append(f'<p><span class="{leak_class}">{leak_assessment}</span>. Average memory growth: <strong>{avg_growth:.1f} MB</strong> per test.</p>')
|
||||
else:
|
||||
html.append('<p>No test data available for analysis.</p>')
|
||||
|
||||
html.append('</div>')
|
||||
|
||||
# Footer
|
||||
html.append('<div style="margin-top: 30px; text-align: center; color: #777; font-size: 0.9em;">')
|
||||
html.append('<p>Generated by Crawl4AI Benchmark Reporter</p>')
|
||||
html.append('</div>')
|
||||
|
||||
html.append('</body>')
|
||||
html.append('</html>')
|
||||
|
||||
# Write the HTML file
|
||||
with open(output_file, 'w') as f:
|
||||
f.write('\n'.join(html))
|
||||
|
||||
# Print a clickable link for terminals that support it (iTerm, VS Code, etc.)
|
||||
file_url = f"file://{os.path.abspath(output_file)}"
|
||||
console.print(f"[green]Comparison report saved to: {output_file}[/green]")
|
||||
console.print(f"[blue underline]Click to open report: {file_url}[/blue underline]")
|
||||
return output_file
|
||||
|
||||
def run(self, limit=None, output_file=None):
|
||||
"""Generate a full benchmark report.
|
||||
|
||||
Args:
|
||||
limit: Optional limit on number of most recent tests to include
|
||||
output_file: Optional output file path
|
||||
|
||||
Returns:
|
||||
Path to the generated report file
|
||||
"""
|
||||
# Load test results
|
||||
results = self.load_test_results(limit=limit)
|
||||
|
||||
if not results:
|
||||
console.print("[yellow]No test results found. Run some tests first.[/yellow]")
|
||||
return None
|
||||
|
||||
# Generate and display summary table
|
||||
summary_table = self.generate_summary_table(results)
|
||||
console.print(summary_table)
|
||||
|
||||
# Generate comparison report
|
||||
title = f"Crawl4AI Benchmark Report ({len(results)} test runs)"
|
||||
report_file = self.generate_comparison_report(results, title=title, output_file=output_file)
|
||||
|
||||
if report_file:
|
||||
console.print(f"[bold green]Report generated successfully: {report_file}[/bold green]")
|
||||
return report_file
|
||||
else:
|
||||
console.print("[bold red]Failed to generate report[/bold red]")
|
||||
return None
|
||||
|
||||
|
||||
def main():
|
||||
"""Main entry point for the benchmark reporter."""
|
||||
parser = argparse.ArgumentParser(description="Generate benchmark reports for Crawl4AI stress tests")
|
||||
|
||||
parser.add_argument("--reports-dir", type=str, default="reports",
|
||||
help="Directory containing test result files")
|
||||
parser.add_argument("--output-dir", type=str, default="benchmark_reports",
|
||||
help="Directory to save generated reports")
|
||||
parser.add_argument("--limit", type=int, default=None,
|
||||
help="Limit to most recent N test results")
|
||||
parser.add_argument("--output-file", type=str, default=None,
|
||||
help="Custom output file path for the report")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Create the benchmark reporter
|
||||
reporter = BenchmarkReporter(reports_dir=args.reports_dir, output_dir=args.output_dir)
|
||||
|
||||
# Generate the report
|
||||
report_file = reporter.run(limit=args.limit, output_file=args.output_file)
|
||||
|
||||
if report_file:
|
||||
print(f"Report generated at: {report_file}")
|
||||
return 0
|
||||
else:
|
||||
print("Failed to generate report")
|
||||
return 1
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
import sys
|
||||
sys.exit(main())
|
||||
34
tests/memory/cap_test.py
Normal file
34
tests/memory/cap_test.py
Normal file
@@ -0,0 +1,34 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Hammer /crawl with many concurrent requests to prove GLOBAL_SEM works.
|
||||
"""
|
||||
|
||||
import asyncio, httpx, json, uuid, argparse
|
||||
|
||||
API = "http://localhost:8020/crawl"
|
||||
URLS_PER_CALL = 1 # keep it minimal so each arun() == 1 page
|
||||
CONCURRENT_CALLS = 20 # way above your cap
|
||||
|
||||
payload_template = {
|
||||
"browser_config": {"type": "BrowserConfig", "params": {"headless": True}},
|
||||
"crawler_config": {
|
||||
"type": "CrawlerRunConfig",
|
||||
"params": {"cache_mode": "BYPASS", "verbose": False},
|
||||
}
|
||||
}
|
||||
|
||||
async def one_call(client):
|
||||
payload = payload_template.copy()
|
||||
payload["urls"] = [f"https://httpbin.org/anything/{uuid.uuid4()}"]
|
||||
r = await client.post(API, json=payload)
|
||||
r.raise_for_status()
|
||||
return r.json()["server_peak_memory_mb"]
|
||||
|
||||
async def main():
|
||||
async with httpx.AsyncClient(timeout=60) as client:
|
||||
tasks = [asyncio.create_task(one_call(client)) for _ in range(CONCURRENT_CALLS)]
|
||||
mem_usages = await asyncio.gather(*tasks)
|
||||
print("Calls finished OK, server peaks reported:", mem_usages)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
4
tests/memory/requirements.txt
Normal file
4
tests/memory/requirements.txt
Normal file
@@ -0,0 +1,4 @@
|
||||
pandas>=1.5.0
|
||||
matplotlib>=3.5.0
|
||||
seaborn>=0.12.0
|
||||
rich>=12.0.0
|
||||
259
tests/memory/run_benchmark.py
Executable file
259
tests/memory/run_benchmark.py
Executable file
@@ -0,0 +1,259 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Run a complete Crawl4AI benchmark test using test_stress_sdk.py and generate a report.
|
||||
"""
|
||||
|
||||
import sys
|
||||
import os
|
||||
import glob
|
||||
import argparse
|
||||
import subprocess
|
||||
import time
|
||||
from datetime import datetime
|
||||
|
||||
from rich.console import Console
|
||||
from rich.text import Text
|
||||
|
||||
console = Console()
|
||||
|
||||
# Updated TEST_CONFIGS to use max_sessions
|
||||
TEST_CONFIGS = {
|
||||
"quick": {"urls": 50, "max_sessions": 4, "chunk_size": 10, "description": "Quick test (50 URLs, 4 sessions)"},
|
||||
"small": {"urls": 100, "max_sessions": 8, "chunk_size": 20, "description": "Small test (100 URLs, 8 sessions)"},
|
||||
"medium": {"urls": 500, "max_sessions": 16, "chunk_size": 50, "description": "Medium test (500 URLs, 16 sessions)"},
|
||||
"large": {"urls": 1000, "max_sessions": 32, "chunk_size": 100,"description": "Large test (1000 URLs, 32 sessions)"},
|
||||
"extreme": {"urls": 2000, "max_sessions": 64, "chunk_size": 200,"description": "Extreme test (2000 URLs, 64 sessions)"},
|
||||
}
|
||||
|
||||
# Arguments to forward directly if present in custom_args
|
||||
FORWARD_ARGS = {
|
||||
"urls": "--urls",
|
||||
"max_sessions": "--max-sessions",
|
||||
"chunk_size": "--chunk-size",
|
||||
"port": "--port",
|
||||
"monitor_mode": "--monitor-mode",
|
||||
}
|
||||
# Boolean flags to forward if True
|
||||
FORWARD_FLAGS = {
|
||||
"stream": "--stream",
|
||||
"use_rate_limiter": "--use-rate-limiter",
|
||||
"keep_server_alive": "--keep-server-alive",
|
||||
"use_existing_site": "--use-existing-site",
|
||||
"skip_generation": "--skip-generation",
|
||||
"keep_site": "--keep-site",
|
||||
"clean_reports": "--clean-reports", # Note: clean behavior is handled here, but pass flag if needed
|
||||
"clean_site": "--clean-site", # Note: clean behavior is handled here, but pass flag if needed
|
||||
}
|
||||
|
||||
def run_benchmark(config_name, custom_args=None, compare=True, clean=False):
|
||||
"""Runs the stress test and optionally the report generator."""
|
||||
if config_name not in TEST_CONFIGS and config_name != "custom":
|
||||
console.print(f"[bold red]Unknown configuration: {config_name}[/bold red]")
|
||||
return False
|
||||
|
||||
# Print header
|
||||
title = "Crawl4AI SDK Benchmark Test"
|
||||
if config_name != "custom":
|
||||
title += f" - {TEST_CONFIGS[config_name]['description']}"
|
||||
else:
|
||||
# Safely get custom args for title
|
||||
urls = custom_args.get('urls', '?') if custom_args else '?'
|
||||
sessions = custom_args.get('max_sessions', '?') if custom_args else '?'
|
||||
title += f" - Custom ({urls} URLs, {sessions} sessions)"
|
||||
|
||||
console.print(f"\n[bold blue]{title}[/bold blue]")
|
||||
console.print("=" * (len(title) + 4)) # Adjust underline length
|
||||
|
||||
console.print("\n[bold white]Preparing test...[/bold white]")
|
||||
|
||||
# --- Command Construction ---
|
||||
# Use the new script name
|
||||
cmd = ["python", "test_stress_sdk.py"]
|
||||
|
||||
# Apply config or custom args
|
||||
args_to_use = {}
|
||||
if config_name != "custom":
|
||||
args_to_use = TEST_CONFIGS[config_name].copy()
|
||||
# If custom args are provided (e.g., boolean flags), overlay them
|
||||
if custom_args:
|
||||
args_to_use.update(custom_args)
|
||||
elif custom_args: # Custom config
|
||||
args_to_use = custom_args.copy()
|
||||
|
||||
# Add arguments with values
|
||||
for key, arg_name in FORWARD_ARGS.items():
|
||||
if key in args_to_use:
|
||||
cmd.extend([arg_name, str(args_to_use[key])])
|
||||
|
||||
# Add boolean flags
|
||||
for key, flag_name in FORWARD_FLAGS.items():
|
||||
if args_to_use.get(key, False): # Check if key exists and is True
|
||||
# Special handling for clean flags - apply locally, don't forward?
|
||||
# Decide if test_stress_sdk.py also needs --clean flags or if run_benchmark handles it.
|
||||
# For now, let's assume run_benchmark handles cleaning based on its own --clean flag.
|
||||
# We'll forward other flags.
|
||||
if key not in ["clean_reports", "clean_site"]:
|
||||
cmd.append(flag_name)
|
||||
|
||||
# Handle the top-level --clean flag for run_benchmark
|
||||
if clean:
|
||||
# Pass clean flags to the stress test script as well, if needed
|
||||
# This assumes test_stress_sdk.py also uses --clean-reports and --clean-site
|
||||
cmd.append("--clean-reports")
|
||||
cmd.append("--clean-site")
|
||||
console.print("[yellow]Applying --clean: Cleaning reports and site before test.[/yellow]")
|
||||
# Actual cleaning logic might reside here or be delegated entirely
|
||||
|
||||
console.print(f"\n[bold white]Running stress test:[/bold white] {' '.join(cmd)}")
|
||||
start = time.time()
|
||||
|
||||
# Execute the stress test script
|
||||
# Use Popen to stream output
|
||||
try:
|
||||
proc = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True, encoding='utf-8', errors='replace')
|
||||
while True:
|
||||
line = proc.stdout.readline()
|
||||
if not line:
|
||||
break
|
||||
console.print(line.rstrip()) # Print line by line
|
||||
proc.wait() # Wait for the process to complete
|
||||
except FileNotFoundError:
|
||||
console.print(f"[bold red]Error: Script 'test_stress_sdk.py' not found. Make sure it's in the correct directory.[/bold red]")
|
||||
return False
|
||||
except Exception as e:
|
||||
console.print(f"[bold red]Error running stress test subprocess: {e}[/bold red]")
|
||||
return False
|
||||
|
||||
|
||||
if proc.returncode != 0:
|
||||
console.print(f"[bold red]Stress test failed with exit code {proc.returncode}[/bold red]")
|
||||
return False
|
||||
|
||||
duration = time.time() - start
|
||||
console.print(f"[bold green]Stress test completed in {duration:.1f} seconds[/bold green]")
|
||||
|
||||
# --- Report Generation (Optional) ---
|
||||
if compare:
|
||||
# Assuming benchmark_report.py exists and works with the generated reports
|
||||
report_script = "benchmark_report.py" # Keep configurable if needed
|
||||
report_cmd = ["python", report_script]
|
||||
console.print(f"\n[bold white]Generating benchmark report: {' '.join(report_cmd)}[/bold white]")
|
||||
|
||||
# Run the report command and capture output
|
||||
try:
|
||||
report_proc = subprocess.run(report_cmd, capture_output=True, text=True, check=False, encoding='utf-8', errors='replace') # Use check=False to handle potential errors
|
||||
|
||||
# Print the captured output from benchmark_report.py
|
||||
if report_proc.stdout:
|
||||
console.print("\n" + report_proc.stdout)
|
||||
if report_proc.stderr:
|
||||
console.print("[yellow]Report generator stderr:[/yellow]\n" + report_proc.stderr)
|
||||
|
||||
if report_proc.returncode != 0:
|
||||
console.print(f"[bold yellow]Benchmark report generation script '{report_script}' failed with exit code {report_proc.returncode}[/bold yellow]")
|
||||
# Don't return False here, test itself succeeded
|
||||
else:
|
||||
console.print(f"[bold green]Benchmark report script '{report_script}' completed.[/bold green]")
|
||||
|
||||
# Find and print clickable links to the reports
|
||||
# Assuming reports are saved in 'benchmark_reports' by benchmark_report.py
|
||||
report_dir = "benchmark_reports"
|
||||
if os.path.isdir(report_dir):
|
||||
report_files = glob.glob(os.path.join(report_dir, "comparison_report_*.html"))
|
||||
if report_files:
|
||||
try:
|
||||
latest_report = max(report_files, key=os.path.getctime)
|
||||
report_path = os.path.abspath(latest_report)
|
||||
report_url = pathlib.Path(report_path).as_uri() # Better way to create file URI
|
||||
console.print(f"[bold cyan]Click to open report: [link={report_url}]{report_url}[/link][/bold cyan]")
|
||||
except Exception as e:
|
||||
console.print(f"[yellow]Could not determine latest report: {e}[/yellow]")
|
||||
|
||||
chart_files = glob.glob(os.path.join(report_dir, "memory_chart_*.png"))
|
||||
if chart_files:
|
||||
try:
|
||||
latest_chart = max(chart_files, key=os.path.getctime)
|
||||
chart_path = os.path.abspath(latest_chart)
|
||||
chart_url = pathlib.Path(chart_path).as_uri()
|
||||
console.print(f"[cyan]Memory chart: [link={chart_url}]{chart_url}[/link][/cyan]")
|
||||
except Exception as e:
|
||||
console.print(f"[yellow]Could not determine latest chart: {e}[/yellow]")
|
||||
else:
|
||||
console.print(f"[yellow]Benchmark report directory '{report_dir}' not found. Cannot link reports.[/yellow]")
|
||||
|
||||
except FileNotFoundError:
|
||||
console.print(f"[bold red]Error: Report script '{report_script}' not found.[/bold red]")
|
||||
except Exception as e:
|
||||
console.print(f"[bold red]Error running report generation subprocess: {e}[/bold red]")
|
||||
|
||||
|
||||
# Prompt to exit
|
||||
console.print("\n[bold green]Benchmark run finished. Press Enter to exit.[/bold green]")
|
||||
try:
|
||||
input() # Wait for user input
|
||||
except EOFError:
|
||||
pass # Handle case where input is piped or unavailable
|
||||
|
||||
return True
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="Run a Crawl4AI SDK benchmark test and generate a report")
|
||||
|
||||
# --- Arguments ---
|
||||
parser.add_argument("config", choices=list(TEST_CONFIGS) + ["custom"],
|
||||
help="Test configuration: quick, small, medium, large, extreme, or custom")
|
||||
|
||||
# Arguments for 'custom' config or to override presets
|
||||
parser.add_argument("--urls", type=int, help="Number of URLs")
|
||||
parser.add_argument("--max-sessions", type=int, help="Max concurrent sessions (replaces --workers)")
|
||||
parser.add_argument("--chunk-size", type=int, help="URLs per batch (for non-stream logging)")
|
||||
parser.add_argument("--port", type=int, help="HTTP server port")
|
||||
parser.add_argument("--monitor-mode", type=str, choices=["DETAILED", "AGGREGATED"], help="Monitor display mode")
|
||||
|
||||
# Boolean flags / options
|
||||
parser.add_argument("--stream", action="store_true", help="Enable streaming results (disables batch logging)")
|
||||
parser.add_argument("--use-rate-limiter", action="store_true", help="Enable basic rate limiter")
|
||||
parser.add_argument("--no-report", action="store_true", help="Skip generating comparison report")
|
||||
parser.add_argument("--clean", action="store_true", help="Clean up reports and site before running")
|
||||
parser.add_argument("--keep-server-alive", action="store_true", help="Keep HTTP server running after test")
|
||||
parser.add_argument("--use-existing-site", action="store_true", help="Use existing site on specified port")
|
||||
parser.add_argument("--skip-generation", action="store_true", help="Use existing site files without regenerating")
|
||||
parser.add_argument("--keep-site", action="store_true", help="Keep generated site files after test")
|
||||
# Removed url_level_logging as it's implicitly handled by stream/batch mode now
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
custom_args = {}
|
||||
|
||||
# Populate custom_args from explicit command-line args
|
||||
if args.urls is not None: custom_args["urls"] = args.urls
|
||||
if args.max_sessions is not None: custom_args["max_sessions"] = args.max_sessions
|
||||
if args.chunk_size is not None: custom_args["chunk_size"] = args.chunk_size
|
||||
if args.port is not None: custom_args["port"] = args.port
|
||||
if args.monitor_mode is not None: custom_args["monitor_mode"] = args.monitor_mode
|
||||
if args.stream: custom_args["stream"] = True
|
||||
if args.use_rate_limiter: custom_args["use_rate_limiter"] = True
|
||||
if args.keep_server_alive: custom_args["keep_server_alive"] = True
|
||||
if args.use_existing_site: custom_args["use_existing_site"] = True
|
||||
if args.skip_generation: custom_args["skip_generation"] = True
|
||||
if args.keep_site: custom_args["keep_site"] = True
|
||||
# Clean flags are handled by the 'clean' argument passed to run_benchmark
|
||||
|
||||
# Validate custom config requirements
|
||||
if args.config == "custom":
|
||||
required_custom = ["urls", "max_sessions", "chunk_size"]
|
||||
missing = [f"--{arg}" for arg in required_custom if arg not in custom_args]
|
||||
if missing:
|
||||
console.print(f"[bold red]Error: 'custom' config requires: {', '.join(missing)}[/bold red]")
|
||||
return 1
|
||||
|
||||
success = run_benchmark(
|
||||
config_name=args.config,
|
||||
custom_args=custom_args, # Pass all collected custom args
|
||||
compare=not args.no_report,
|
||||
clean=args.clean
|
||||
)
|
||||
return 0 if success else 1
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user