Compare commits
42 Commits
release/v0
...
feature/te
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
d48d382d18 | ||
|
|
7f360577d9 | ||
|
|
9447054a65 | ||
|
|
f4a432829e | ||
|
|
e651e045c4 | ||
|
|
5398acc7d2 | ||
|
|
22c7932ba3 | ||
|
|
2ab0bf27c2 | ||
|
|
d30dc9fdc1 | ||
|
|
e6044e6053 | ||
|
|
a50e47adad | ||
|
|
ada7441bd1 | ||
|
|
9f7fee91a9 | ||
|
|
7f48655cf1 | ||
|
|
1417a67e90 | ||
|
|
19398d33ef | ||
|
|
263d362daa | ||
|
|
bac92a47e4 | ||
|
|
a51545c883 | ||
|
|
11b310edef | ||
|
|
926e41aab8 | ||
|
|
489981e670 | ||
|
|
b92be4ef66 | ||
|
|
7c0edaf266 | ||
|
|
dfcfd8ae57 | ||
|
|
955110a8b0 | ||
|
|
f30811b524 | ||
|
|
8146d477e9 | ||
|
|
96c4b0de67 | ||
|
|
57c14db7cb | ||
|
|
cd2dd68e4c | ||
|
|
f0ce7b2710 | ||
|
|
21f79fe166 | ||
|
|
18ad3ef159 | ||
|
|
0541b61405 | ||
|
|
b61b2ee676 | ||
|
|
89cf5aba2b | ||
|
|
6735c68288 | ||
|
|
64f37792a7 | ||
|
|
c4d625fb3c | ||
|
|
ef722766f0 | ||
|
|
4bcb7171a3 |
70
CHANGELOG.md
70
CHANGELOG.md
@@ -5,6 +5,76 @@ All notable changes to Crawl4AI will be documented in this file.
|
||||
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
||||
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
||||
|
||||
## [0.7.3] - 2025-08-09
|
||||
|
||||
### Added
|
||||
- **🕵️ Undetected Browser Support**: New browser adapter pattern with stealth capabilities
|
||||
- `browser_adapter.py` with undetected Chrome integration
|
||||
- Bypass sophisticated bot detection systems (Cloudflare, Akamai, custom solutions)
|
||||
- Support for headless stealth mode with anti-detection techniques
|
||||
- Human-like behavior simulation with random mouse movements and scrolling
|
||||
- Comprehensive examples for anti-bot strategies and stealth crawling
|
||||
- Full documentation guide for undetected browser usage
|
||||
|
||||
- **🎨 Multi-URL Configuration System**: URL-specific crawler configurations for batch processing
|
||||
- Different crawling strategies for different URL patterns in a single batch
|
||||
- Support for string patterns with wildcards (`"*.pdf"`, `"*/blog/*"`)
|
||||
- Lambda function matchers for complex URL logic
|
||||
- Mixed matchers combining strings and functions with AND/OR logic
|
||||
- Fallback configuration support when no patterns match
|
||||
- First-match-wins configuration selection with optional fallback
|
||||
|
||||
- **🧠 Memory Monitoring & Optimization**: Comprehensive memory usage tracking
|
||||
- New `memory_utils.py` module for memory monitoring and optimization
|
||||
- Real-time memory usage tracking during crawl sessions
|
||||
- Memory leak detection and reporting
|
||||
- Performance optimization recommendations
|
||||
- Peak memory usage analysis and efficiency metrics
|
||||
- Automatic cleanup suggestions for memory-intensive operations
|
||||
|
||||
- **📊 Enhanced Table Extraction**: Improved table access and DataFrame conversion
|
||||
- Direct `result.tables` interface replacing generic `result.media` approach
|
||||
- Instant pandas DataFrame conversion with `pd.DataFrame(table['data'])`
|
||||
- Enhanced table detection algorithms for better accuracy
|
||||
- Table metadata including source XPath and headers
|
||||
- Improved table structure preservation during extraction
|
||||
|
||||
- **💰 GitHub Sponsors Integration**: 4-tier sponsorship system
|
||||
- Supporter ($5/month): Community support + early feature previews
|
||||
- Professional ($25/month): Priority support + beta access
|
||||
- Business ($100/month): Direct consultation + custom integrations
|
||||
- Enterprise ($500/month): Dedicated support + feature development
|
||||
- Custom arrangement options for larger organizations
|
||||
|
||||
- **🐳 Docker LLM Provider Flexibility**: Environment-based LLM configuration
|
||||
- `LLM_PROVIDER` environment variable support for dynamic provider switching
|
||||
- `.llm.env` file support for secure configuration management
|
||||
- Per-request provider override capabilities in API endpoints
|
||||
- Support for OpenAI, Groq, and other providers without rebuilding images
|
||||
- Enhanced Docker documentation with deployment examples
|
||||
|
||||
### Fixed
|
||||
- **URL Matcher Fallback**: Resolved edge cases in URL pattern matching logic
|
||||
- **Memory Management**: Fixed memory leaks in long-running crawl sessions
|
||||
- **Sitemap Processing**: Improved redirect handling in sitemap fetching
|
||||
- **Table Extraction**: Enhanced table detection and extraction accuracy
|
||||
- **Error Handling**: Better error messages and recovery from network failures
|
||||
|
||||
### Changed
|
||||
- **Architecture Refactoring**: Major cleanup and optimization
|
||||
- Moved 2,450+ lines from main `async_crawler_strategy.py` to backup
|
||||
- Cleaner separation of concerns in crawler architecture
|
||||
- Better maintainability and code organization
|
||||
- Preserved backward compatibility while improving performance
|
||||
|
||||
### Documentation
|
||||
- **Comprehensive Examples**: Added real-world URLs and practical use cases
|
||||
- **API Documentation**: Complete CrawlResult field documentation with all available fields
|
||||
- **Migration Guides**: Updated table extraction patterns from `result.media` to `result.tables`
|
||||
- **Undetected Browser Guide**: Full documentation for stealth mode and anti-bot strategies
|
||||
- **Multi-Config Examples**: Detailed examples for URL-specific configurations
|
||||
- **Docker Deployment**: Enhanced Docker documentation with LLM provider configuration
|
||||
|
||||
## [0.7.x] - 2025-06-29
|
||||
|
||||
### Added
|
||||
|
||||
136
Makefile.telemetry
Normal file
136
Makefile.telemetry
Normal file
@@ -0,0 +1,136 @@
|
||||
# Makefile for Crawl4AI Telemetry Testing
|
||||
# Usage: make test-telemetry, make test-unit, make test-integration, etc.
|
||||
|
||||
.PHONY: help test-all test-telemetry test-unit test-integration test-privacy test-performance test-slow test-coverage test-verbose clean
|
||||
|
||||
# Default Python executable
|
||||
PYTHON := .venv/bin/python
|
||||
PYTEST := $(PYTHON) -m pytest
|
||||
|
||||
help:
|
||||
@echo "Crawl4AI Telemetry Testing Commands:"
|
||||
@echo ""
|
||||
@echo " test-all Run all telemetry tests"
|
||||
@echo " test-telemetry Run all telemetry tests (same as test-all)"
|
||||
@echo " test-unit Run unit tests only"
|
||||
@echo " test-integration Run integration tests only"
|
||||
@echo " test-privacy Run privacy compliance tests only"
|
||||
@echo " test-performance Run performance tests only"
|
||||
@echo " test-slow Run slow tests only"
|
||||
@echo " test-coverage Run tests with coverage report"
|
||||
@echo " test-verbose Run tests with verbose output"
|
||||
@echo " test-specific TEST= Run specific test (e.g., make test-specific TEST=test_telemetry.py::TestTelemetryConfig)"
|
||||
@echo " clean Clean test artifacts"
|
||||
@echo ""
|
||||
@echo "Environment Variables:"
|
||||
@echo " CRAWL4AI_TELEMETRY_TEST_REAL=1 Enable real telemetry during tests"
|
||||
@echo " PYTEST_ARGS Additional pytest arguments"
|
||||
|
||||
# Run all telemetry tests
|
||||
test-all test-telemetry:
|
||||
$(PYTEST) tests/telemetry/ -v
|
||||
|
||||
# Run unit tests only
|
||||
test-unit:
|
||||
$(PYTEST) tests/telemetry/ -m "unit" -v
|
||||
|
||||
# Run integration tests only
|
||||
test-integration:
|
||||
$(PYTEST) tests/telemetry/ -m "integration" -v
|
||||
|
||||
# Run privacy compliance tests only
|
||||
test-privacy:
|
||||
$(PYTEST) tests/telemetry/ -m "privacy" -v
|
||||
|
||||
# Run performance tests only
|
||||
test-performance:
|
||||
$(PYTEST) tests/telemetry/ -m "performance" -v
|
||||
|
||||
# Run slow tests only
|
||||
test-slow:
|
||||
$(PYTEST) tests/telemetry/ -m "slow" -v
|
||||
|
||||
# Run tests with coverage
|
||||
test-coverage:
|
||||
$(PYTEST) tests/telemetry/ --cov=crawl4ai.telemetry --cov-report=html --cov-report=term-missing -v
|
||||
|
||||
# Run tests with verbose output
|
||||
test-verbose:
|
||||
$(PYTEST) tests/telemetry/ -vvv --tb=long
|
||||
|
||||
# Run specific test
|
||||
test-specific:
|
||||
$(PYTEST) tests/telemetry/$(TEST) -v
|
||||
|
||||
# Run tests excluding slow ones
|
||||
test-fast:
|
||||
$(PYTEST) tests/telemetry/ -m "not slow" -v
|
||||
|
||||
# Run tests in parallel
|
||||
test-parallel:
|
||||
$(PYTEST) tests/telemetry/ -n auto -v
|
||||
|
||||
# Clean test artifacts
|
||||
clean:
|
||||
rm -rf .pytest_cache/
|
||||
rm -rf htmlcov/
|
||||
rm -rf .coverage
|
||||
find tests/ -name "*.pyc" -delete
|
||||
find tests/ -name "__pycache__" -type d -exec rm -rf {} +
|
||||
rm -rf tests/telemetry/__pycache__/
|
||||
|
||||
# Lint test files
|
||||
lint-tests:
|
||||
$(PYTHON) -m flake8 tests/telemetry/
|
||||
$(PYTHON) -m pylint tests/telemetry/
|
||||
|
||||
# Type check test files
|
||||
typecheck-tests:
|
||||
$(PYTHON) -m mypy tests/telemetry/
|
||||
|
||||
# Run all quality checks
|
||||
check-tests: lint-tests typecheck-tests test-unit
|
||||
|
||||
# Install test dependencies
|
||||
install-test-deps:
|
||||
$(PYTHON) -m pip install pytest pytest-asyncio pytest-mock pytest-cov pytest-xdist
|
||||
|
||||
# Setup development environment for testing
|
||||
setup-dev:
|
||||
$(PYTHON) -m pip install -e .
|
||||
$(MAKE) install-test-deps
|
||||
|
||||
# Generate test report
|
||||
test-report:
|
||||
$(PYTEST) tests/telemetry/ --html=test-report.html --self-contained-html -v
|
||||
|
||||
# Run performance benchmarks
|
||||
benchmark:
|
||||
$(PYTEST) tests/telemetry/test_privacy_performance.py::TestTelemetryPerformance -v --benchmark-only
|
||||
|
||||
# Test different environments
|
||||
test-docker-env:
|
||||
CRAWL4AI_DOCKER=true $(PYTEST) tests/telemetry/ -k "docker" -v
|
||||
|
||||
test-cli-env:
|
||||
$(PYTEST) tests/telemetry/ -k "cli" -v
|
||||
|
||||
# Validate telemetry implementation
|
||||
validate:
|
||||
@echo "Running telemetry validation suite..."
|
||||
$(MAKE) test-unit
|
||||
$(MAKE) test-privacy
|
||||
$(MAKE) test-performance
|
||||
@echo "Validation complete!"
|
||||
|
||||
# Debug failing tests
|
||||
debug:
|
||||
$(PYTEST) tests/telemetry/ --pdb -x -v
|
||||
|
||||
# Show test markers
|
||||
show-markers:
|
||||
$(PYTEST) --markers
|
||||
|
||||
# Show test collection (dry run)
|
||||
show-tests:
|
||||
$(PYTEST) tests/telemetry/ --collect-only -q
|
||||
132
README.md
132
README.md
@@ -27,9 +27,11 @@
|
||||
|
||||
Crawl4AI turns the web into clean, LLM ready Markdown for RAG, agents, and data pipelines. Fast, controllable, battle tested by a 50k+ star community.
|
||||
|
||||
[✨ Check out latest update v0.7.0](#-recent-updates)
|
||||
[✨ Check out latest update v0.7.4](#-recent-updates)
|
||||
|
||||
✨ New in v0.7.0, Adaptive Crawling, Virtual Scroll, Link Preview scoring, Async URL Seeder, big performance gains. [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.0.md)
|
||||
✨ New in v0.7.4: Revolutionary LLM Table Extraction with intelligent chunking, enhanced concurrency fixes, memory management refactor, and critical stability improvements. [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.4.md)
|
||||
|
||||
✨ Recent v0.7.3: Undetected Browser Support, Multi-URL Configurations, Memory Monitoring, Enhanced Table Extraction, GitHub Sponsors. [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.3.md)
|
||||
|
||||
<details>
|
||||
<summary>🤓 <strong>My Personal Story</strong></summary>
|
||||
@@ -302,9 +304,9 @@ The new Docker implementation includes:
|
||||
### Getting Started
|
||||
|
||||
```bash
|
||||
# Pull and run the latest release candidate
|
||||
docker pull unclecode/crawl4ai:0.7.0
|
||||
docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:0.7.0
|
||||
# Pull and run the latest release
|
||||
docker pull unclecode/crawl4ai:latest
|
||||
docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:latest
|
||||
|
||||
# Visit the playground at http://localhost:11235/playground
|
||||
```
|
||||
@@ -542,7 +544,123 @@ async def test_news_crawl():
|
||||
|
||||
## ✨ Recent Updates
|
||||
|
||||
### Version 0.7.0 Release Highlights - The Adaptive Intelligence Update
|
||||
<details>
|
||||
<summary><strong>Version 0.7.4 Release Highlights - The Intelligent Table Extraction & Performance Update</strong></summary>
|
||||
|
||||
- **🚀 LLMTableExtraction**: Revolutionary table extraction with intelligent chunking for massive tables:
|
||||
```python
|
||||
from crawl4ai import LLMTableExtraction, LLMConfig
|
||||
|
||||
# Configure intelligent table extraction
|
||||
table_strategy = LLMTableExtraction(
|
||||
llm_config=LLMConfig(provider="openai/gpt-4.1-mini"),
|
||||
enable_chunking=True, # Handle massive tables
|
||||
chunk_token_threshold=5000, # Smart chunking threshold
|
||||
overlap_threshold=100, # Maintain context between chunks
|
||||
extraction_type="structured" # Get structured data output
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(table_extraction_strategy=table_strategy)
|
||||
result = await crawler.arun("https://complex-tables-site.com", config=config)
|
||||
|
||||
# Tables are automatically chunked, processed, and merged
|
||||
for table in result.tables:
|
||||
print(f"Extracted table: {len(table['data'])} rows")
|
||||
```
|
||||
|
||||
- **⚡ Dispatcher Bug Fix**: Fixed sequential processing bottleneck in arun_many for fast-completing tasks
|
||||
- **🧹 Memory Management Refactor**: Consolidated memory utilities into main utils module for cleaner architecture
|
||||
- **🔧 Browser Manager Fixes**: Resolved race conditions in concurrent page creation with thread-safe locking
|
||||
- **🔗 Advanced URL Processing**: Better handling of raw:// URLs and base tag link resolution
|
||||
- **🛡️ Enhanced Proxy Support**: Flexible proxy configuration supporting both dict and string formats
|
||||
|
||||
[Full v0.7.4 Release Notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.4.md)
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary><strong>Version 0.7.3 Release Highlights - The Multi-Config Intelligence Update</strong></summary>
|
||||
|
||||
- **🕵️ Undetected Browser Support**: Bypass sophisticated bot detection systems:
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig
|
||||
|
||||
browser_config = BrowserConfig(
|
||||
browser_type="undetected", # Use undetected Chrome
|
||||
headless=True, # Can run headless with stealth
|
||||
extra_args=[
|
||||
"--disable-blink-features=AutomationControlled",
|
||||
"--disable-web-security"
|
||||
]
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun("https://protected-site.com")
|
||||
# Successfully bypass Cloudflare, Akamai, and custom bot detection
|
||||
```
|
||||
|
||||
- **🎨 Multi-URL Configuration**: Different strategies for different URL patterns in one batch:
|
||||
```python
|
||||
from crawl4ai import CrawlerRunConfig, MatchMode
|
||||
|
||||
configs = [
|
||||
# Documentation sites - aggressive caching
|
||||
CrawlerRunConfig(
|
||||
url_matcher=["*docs*", "*documentation*"],
|
||||
cache_mode="write",
|
||||
markdown_generator_options={"include_links": True}
|
||||
),
|
||||
|
||||
# News/blog sites - fresh content
|
||||
CrawlerRunConfig(
|
||||
url_matcher=lambda url: 'blog' in url or 'news' in url,
|
||||
cache_mode="bypass"
|
||||
),
|
||||
|
||||
# Fallback for everything else
|
||||
CrawlerRunConfig()
|
||||
]
|
||||
|
||||
results = await crawler.arun_many(urls, config=configs)
|
||||
# Each URL gets the perfect configuration automatically
|
||||
```
|
||||
|
||||
- **🧠 Memory Monitoring**: Track and optimize memory usage during crawling:
|
||||
```python
|
||||
from crawl4ai.memory_utils import MemoryMonitor
|
||||
|
||||
monitor = MemoryMonitor()
|
||||
monitor.start_monitoring()
|
||||
|
||||
results = await crawler.arun_many(large_url_list)
|
||||
|
||||
report = monitor.get_report()
|
||||
print(f"Peak memory: {report['peak_mb']:.1f} MB")
|
||||
print(f"Efficiency: {report['efficiency']:.1f}%")
|
||||
# Get optimization recommendations
|
||||
```
|
||||
|
||||
- **📊 Enhanced Table Extraction**: Direct DataFrame conversion from web tables:
|
||||
```python
|
||||
result = await crawler.arun("https://site-with-tables.com")
|
||||
|
||||
# New way - direct table access
|
||||
if result.tables:
|
||||
import pandas as pd
|
||||
for table in result.tables:
|
||||
df = pd.DataFrame(table['data'])
|
||||
print(f"Table: {df.shape[0]} rows × {df.shape[1]} columns")
|
||||
```
|
||||
|
||||
- **💰 GitHub Sponsors**: 4-tier sponsorship system for project sustainability
|
||||
- **🐳 Docker LLM Flexibility**: Configure providers via environment variables
|
||||
|
||||
[Full v0.7.3 Release Notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.3.md)
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary><strong>Version 0.7.0 Release Highlights - The Adaptive Intelligence Update</strong></summary>
|
||||
|
||||
- **🧠 Adaptive Crawling**: Your crawler now learns and adapts to website patterns automatically:
|
||||
```python
|
||||
@@ -607,6 +725,8 @@ async def test_news_crawl():
|
||||
|
||||
Read the full details in our [0.7.0 Release Notes](https://docs.crawl4ai.com/blog/release-v0.7.0) or check the [CHANGELOG](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md).
|
||||
|
||||
</details>
|
||||
|
||||
## Version Numbering in Crawl4AI
|
||||
|
||||
Crawl4AI follows standard Python version numbering conventions (PEP 440) to help users understand the stability and features of each release.
|
||||
|
||||
190
TELEMETRY_TESTING_IMPLEMENTATION.md
Normal file
190
TELEMETRY_TESTING_IMPLEMENTATION.md
Normal file
@@ -0,0 +1,190 @@
|
||||
# Crawl4AI Telemetry Testing Implementation
|
||||
|
||||
## Overview
|
||||
|
||||
This document summarizes the comprehensive testing strategy implementation for Crawl4AI's opt-in telemetry system. The implementation provides thorough test coverage across unit tests, integration tests, privacy compliance tests, and performance tests.
|
||||
|
||||
## Implementation Summary
|
||||
|
||||
### 📊 Test Statistics
|
||||
- **Total Tests**: 40 tests
|
||||
- **Success Rate**: 100% (40/40 passing)
|
||||
- **Test Categories**: 4 categories (Unit, Integration, Privacy, Performance)
|
||||
- **Code Coverage**: 51% (625 statements, 308 missing)
|
||||
|
||||
### 🗂️ Test Structure
|
||||
|
||||
#### 1. **Unit Tests** (`tests/telemetry/test_telemetry.py`)
|
||||
- `TestTelemetryConfig`: Configuration management and persistence
|
||||
- `TestEnvironmentDetection`: CLI, Docker, API server environment detection
|
||||
- `TestTelemetryManager`: Singleton pattern and exception capture
|
||||
- `TestConsentManager`: Docker default behavior and environment overrides
|
||||
- `TestPublicAPI`: Public enable/disable/status functions
|
||||
- `TestIntegration`: Crawler exception capture integration
|
||||
|
||||
#### 2. **Integration Tests** (`tests/telemetry/test_integration.py`)
|
||||
- `TestTelemetryCLI`: CLI command testing (status, enable, disable)
|
||||
- `TestAsyncWebCrawlerIntegration`: Real crawler integration with decorators
|
||||
- `TestDockerIntegration`: Docker environment-specific behavior
|
||||
- `TestTelemetryProviderIntegration`: Sentry provider initialization and fallbacks
|
||||
|
||||
#### 3. **Privacy & Performance Tests** (`tests/telemetry/test_privacy_performance.py`)
|
||||
- `TestTelemetryPrivacy`: Data sanitization and PII protection
|
||||
- `TestTelemetryPerformance`: Decorator overhead measurement
|
||||
- `TestTelemetryScalability`: Multiple and concurrent exception handling
|
||||
|
||||
#### 4. **Hello World Test** (`tests/telemetry/test_hello_world_telemetry.py`)
|
||||
- Basic telemetry functionality validation
|
||||
|
||||
### 🔧 Testing Infrastructure
|
||||
|
||||
#### **Pytest Configuration** (`pytest.ini`)
|
||||
```ini
|
||||
[pytest]
|
||||
testpaths = tests/telemetry
|
||||
markers =
|
||||
unit: Unit tests
|
||||
integration: Integration tests
|
||||
privacy: Privacy compliance tests
|
||||
performance: Performance tests
|
||||
asyncio_mode = auto
|
||||
```
|
||||
|
||||
#### **Test Fixtures** (`tests/conftest.py`)
|
||||
- `temp_config_dir`: Temporary configuration directory
|
||||
- `enabled_telemetry_config`: Pre-configured enabled telemetry
|
||||
- `disabled_telemetry_config`: Pre-configured disabled telemetry
|
||||
- `mock_sentry_provider`: Mocked Sentry provider for testing
|
||||
|
||||
#### **Makefile Targets** (`Makefile.telemetry`)
|
||||
```makefile
|
||||
test-all: Run all telemetry tests
|
||||
test-unit: Run unit tests only
|
||||
test-integration: Run integration tests only
|
||||
test-privacy: Run privacy tests only
|
||||
test-performance: Run performance tests only
|
||||
test-coverage: Run tests with coverage report
|
||||
test-watch: Run tests in watch mode
|
||||
test-parallel: Run tests in parallel
|
||||
```
|
||||
|
||||
## 🎯 Key Features Tested
|
||||
|
||||
### Privacy Compliance
|
||||
- ✅ No URLs captured in telemetry data
|
||||
- ✅ No content captured in telemetry data
|
||||
- ✅ No PII (personally identifiable information) captured
|
||||
- ✅ Sanitized context only (error types, stack traces without content)
|
||||
|
||||
### Performance Impact
|
||||
- ✅ Telemetry decorator overhead < 1ms
|
||||
- ✅ Async decorator overhead < 1ms
|
||||
- ✅ Disabled telemetry has minimal performance impact
|
||||
- ✅ Configuration loading performance acceptable
|
||||
- ✅ Multiple exception capture scalability
|
||||
- ✅ Concurrent exception capture handling
|
||||
|
||||
### Integration Points
|
||||
- ✅ CLI command integration (status, enable, disable)
|
||||
- ✅ AsyncWebCrawler decorator integration
|
||||
- ✅ Docker environment auto-detection
|
||||
- ✅ Sentry provider initialization
|
||||
- ✅ Graceful degradation without Sentry
|
||||
- ✅ Environment variable overrides
|
||||
|
||||
### Core Functionality
|
||||
- ✅ Configuration persistence and loading
|
||||
- ✅ Consent management (Docker defaults, user prompts)
|
||||
- ✅ Environment detection (CLI, Docker, Jupyter, etc.)
|
||||
- ✅ Singleton pattern for TelemetryManager
|
||||
- ✅ Exception capture and forwarding
|
||||
- ✅ Provider abstraction (Sentry, Null)
|
||||
|
||||
## 🚀 Usage Examples
|
||||
|
||||
### Run All Tests
|
||||
```bash
|
||||
make -f Makefile.telemetry test-all
|
||||
```
|
||||
|
||||
### Run Specific Test Categories
|
||||
```bash
|
||||
# Unit tests only
|
||||
make -f Makefile.telemetry test-unit
|
||||
|
||||
# Integration tests only
|
||||
make -f Makefile.telemetry test-integration
|
||||
|
||||
# Privacy tests only
|
||||
make -f Makefile.telemetry test-privacy
|
||||
|
||||
# Performance tests only
|
||||
make -f Makefile.telemetry test-performance
|
||||
```
|
||||
|
||||
### Coverage Report
|
||||
```bash
|
||||
make -f Makefile.telemetry test-coverage
|
||||
```
|
||||
|
||||
### Parallel Execution
|
||||
```bash
|
||||
make -f Makefile.telemetry test-parallel
|
||||
```
|
||||
|
||||
## 📁 File Structure
|
||||
|
||||
```
|
||||
tests/
|
||||
├── conftest.py # Shared pytest fixtures
|
||||
└── telemetry/
|
||||
├── test_hello_world_telemetry.py # Basic functionality test
|
||||
├── test_telemetry.py # Unit tests
|
||||
├── test_integration.py # Integration tests
|
||||
└── test_privacy_performance.py # Privacy & performance tests
|
||||
|
||||
# Configuration
|
||||
pytest.ini # Pytest configuration with markers
|
||||
Makefile.telemetry # Convenient test execution targets
|
||||
```
|
||||
|
||||
## 🔍 Test Isolation & Mocking
|
||||
|
||||
### Environment Isolation
|
||||
- Tests run in isolated temporary directories
|
||||
- Environment variables are properly mocked/isolated
|
||||
- No interference between test runs
|
||||
- Clean state for each test
|
||||
|
||||
### Mock Strategies
|
||||
- `unittest.mock` for external dependencies
|
||||
- Temporary file systems for configuration testing
|
||||
- Subprocess mocking for CLI command testing
|
||||
- Time measurement for performance testing
|
||||
|
||||
## 📈 Coverage Analysis
|
||||
|
||||
Current test coverage: **51%** (625 statements)
|
||||
|
||||
### Well-Covered Areas:
|
||||
- Core configuration management (78%)
|
||||
- Telemetry initialization (69%)
|
||||
- Environment detection (64%)
|
||||
|
||||
### Areas for Future Enhancement:
|
||||
- Consent management UI (20% - interactive prompts)
|
||||
- Sentry provider implementation (25% - network calls)
|
||||
- Base provider abstractions (49% - error handling paths)
|
||||
|
||||
## 🎉 Implementation Success
|
||||
|
||||
The comprehensive testing strategy has been **successfully implemented** with:
|
||||
|
||||
- ✅ **100% test pass rate** (40/40 tests passing)
|
||||
- ✅ **Complete test infrastructure** (fixtures, configuration, targets)
|
||||
- ✅ **Privacy compliance verification** (no PII, URLs, or content captured)
|
||||
- ✅ **Performance validation** (minimal overhead confirmed)
|
||||
- ✅ **Integration testing** (CLI, Docker, AsyncWebCrawler)
|
||||
- ✅ **CI/CD ready** (Makefile targets for automation)
|
||||
|
||||
The telemetry system now has robust test coverage ensuring reliability, privacy compliance, and performance characteristics while maintaining comprehensive validation of all core functionality.
|
||||
@@ -29,6 +29,12 @@ from .extraction_strategy import (
|
||||
)
|
||||
from .chunking_strategy import ChunkingStrategy, RegexChunking
|
||||
from .markdown_generation_strategy import DefaultMarkdownGenerator
|
||||
from .table_extraction import (
|
||||
TableExtractionStrategy,
|
||||
DefaultTableExtraction,
|
||||
NoTableExtraction,
|
||||
LLMTableExtraction,
|
||||
)
|
||||
from .content_filter_strategy import (
|
||||
PruningContentFilter,
|
||||
BM25ContentFilter,
|
||||
@@ -156,6 +162,9 @@ __all__ = [
|
||||
"ChunkingStrategy",
|
||||
"RegexChunking",
|
||||
"DefaultMarkdownGenerator",
|
||||
"TableExtractionStrategy",
|
||||
"DefaultTableExtraction",
|
||||
"NoTableExtraction",
|
||||
"RelevantContentFilter",
|
||||
"PruningContentFilter",
|
||||
"BM25ContentFilter",
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
# crawl4ai/__version__.py
|
||||
|
||||
# This is the version that will be used for stable releases
|
||||
__version__ = "0.7.3"
|
||||
__version__ = "0.7.4"
|
||||
|
||||
# For nightly builds, this gets set during build process
|
||||
__nightly_version__ = None
|
||||
|
||||
@@ -20,6 +20,7 @@ from .chunking_strategy import ChunkingStrategy, RegexChunking
|
||||
from .markdown_generation_strategy import MarkdownGenerationStrategy, DefaultMarkdownGenerator
|
||||
from .content_scraping_strategy import ContentScrapingStrategy, LXMLWebScrapingStrategy
|
||||
from .deep_crawling import DeepCrawlStrategy
|
||||
from .table_extraction import TableExtractionStrategy, DefaultTableExtraction
|
||||
|
||||
from .cache_context import CacheMode
|
||||
from .proxy_strategy import ProxyRotationStrategy
|
||||
@@ -448,6 +449,10 @@ class BrowserConfig:
|
||||
self.chrome_channel = ""
|
||||
self.proxy = proxy
|
||||
self.proxy_config = proxy_config
|
||||
if isinstance(self.proxy_config, dict):
|
||||
self.proxy_config = ProxyConfig.from_dict(self.proxy_config)
|
||||
if isinstance(self.proxy_config, str):
|
||||
self.proxy_config = ProxyConfig.from_string(self.proxy_config)
|
||||
|
||||
|
||||
self.viewport_width = viewport_width
|
||||
@@ -978,6 +983,8 @@ class CrawlerRunConfig():
|
||||
Default: False.
|
||||
table_score_threshold (int): Minimum score threshold for processing a table.
|
||||
Default: 7.
|
||||
table_extraction (TableExtractionStrategy): Strategy to use for table extraction.
|
||||
Default: DefaultTableExtraction with table_score_threshold.
|
||||
|
||||
# Virtual Scroll Parameters
|
||||
virtual_scroll_config (VirtualScrollConfig or dict or None): Configuration for handling virtual scroll containers.
|
||||
@@ -1104,6 +1111,7 @@ class CrawlerRunConfig():
|
||||
image_description_min_word_threshold: int = IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD,
|
||||
image_score_threshold: int = IMAGE_SCORE_THRESHOLD,
|
||||
table_score_threshold: int = 7,
|
||||
table_extraction: TableExtractionStrategy = None,
|
||||
exclude_external_images: bool = False,
|
||||
exclude_all_images: bool = False,
|
||||
# Link and Domain Handling Parameters
|
||||
@@ -1159,6 +1167,11 @@ class CrawlerRunConfig():
|
||||
self.parser_type = parser_type
|
||||
self.scraping_strategy = scraping_strategy or LXMLWebScrapingStrategy()
|
||||
self.proxy_config = proxy_config
|
||||
if isinstance(proxy_config, dict):
|
||||
self.proxy_config = ProxyConfig.from_dict(proxy_config)
|
||||
if isinstance(proxy_config, str):
|
||||
self.proxy_config = ProxyConfig.from_string(proxy_config)
|
||||
|
||||
self.proxy_rotation_strategy = proxy_rotation_strategy
|
||||
|
||||
# Browser Location and Identity Parameters
|
||||
@@ -1215,6 +1228,12 @@ class CrawlerRunConfig():
|
||||
self.exclude_external_images = exclude_external_images
|
||||
self.exclude_all_images = exclude_all_images
|
||||
self.table_score_threshold = table_score_threshold
|
||||
|
||||
# Table extraction strategy (default to DefaultTableExtraction if not specified)
|
||||
if table_extraction is None:
|
||||
self.table_extraction = DefaultTableExtraction(table_score_threshold=table_score_threshold)
|
||||
else:
|
||||
self.table_extraction = table_extraction
|
||||
|
||||
# Link and Domain Handling Parameters
|
||||
self.exclude_social_media_domains = (
|
||||
@@ -1486,6 +1505,7 @@ class CrawlerRunConfig():
|
||||
"image_score_threshold", IMAGE_SCORE_THRESHOLD
|
||||
),
|
||||
table_score_threshold=kwargs.get("table_score_threshold", 7),
|
||||
table_extraction=kwargs.get("table_extraction", None),
|
||||
exclude_all_images=kwargs.get("exclude_all_images", False),
|
||||
exclude_external_images=kwargs.get("exclude_external_images", False),
|
||||
# Link and Domain Handling Parameters
|
||||
@@ -1594,6 +1614,7 @@ class CrawlerRunConfig():
|
||||
"image_description_min_word_threshold": self.image_description_min_word_threshold,
|
||||
"image_score_threshold": self.image_score_threshold,
|
||||
"table_score_threshold": self.table_score_threshold,
|
||||
"table_extraction": self.table_extraction,
|
||||
"exclude_all_images": self.exclude_all_images,
|
||||
"exclude_external_images": self.exclude_external_images,
|
||||
"exclude_social_media_domains": self.exclude_social_media_domains,
|
||||
|
||||
@@ -824,7 +824,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
except Error:
|
||||
visibility_info = await self.check_visibility(page)
|
||||
|
||||
if self.browser_config.config.verbose:
|
||||
if self.browser_config.verbose:
|
||||
self.logger.debug(
|
||||
message="Body visibility info: {info}",
|
||||
tag="DEBUG",
|
||||
|
||||
@@ -2129,3 +2129,265 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
return True # Default to scrolling if check fails
|
||||
|
||||
|
||||
####################################################################################################
|
||||
# HTTP Crawler Strategy
|
||||
####################################################################################################
|
||||
|
||||
class HTTPCrawlerError(Exception):
|
||||
"""Base error class for HTTP crawler specific exceptions"""
|
||||
pass
|
||||
|
||||
|
||||
class ConnectionTimeoutError(HTTPCrawlerError):
|
||||
"""Raised when connection timeout occurs"""
|
||||
pass
|
||||
|
||||
|
||||
class HTTPStatusError(HTTPCrawlerError):
|
||||
"""Raised for unexpected status codes"""
|
||||
def __init__(self, status_code: int, message: str):
|
||||
self.status_code = status_code
|
||||
super().__init__(f"HTTP {status_code}: {message}")
|
||||
|
||||
|
||||
class AsyncHTTPCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
"""
|
||||
Fast, lightweight HTTP-only crawler strategy optimized for memory efficiency.
|
||||
"""
|
||||
|
||||
__slots__ = ('logger', 'max_connections', 'dns_cache_ttl', 'chunk_size', '_session', 'hooks', 'browser_config')
|
||||
|
||||
DEFAULT_TIMEOUT: Final[int] = 30
|
||||
DEFAULT_CHUNK_SIZE: Final[int] = 64 * 1024
|
||||
DEFAULT_MAX_CONNECTIONS: Final[int] = min(32, (os.cpu_count() or 1) * 4)
|
||||
DEFAULT_DNS_CACHE_TTL: Final[int] = 300
|
||||
VALID_SCHEMES: Final = frozenset({'http', 'https', 'file', 'raw'})
|
||||
|
||||
_BASE_HEADERS: Final = MappingProxyType({
|
||||
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
|
||||
'Accept-Language': 'en-US,en;q=0.5',
|
||||
'Accept-Encoding': 'gzip, deflate, br',
|
||||
'Connection': 'keep-alive',
|
||||
'Upgrade-Insecure-Requests': '1',
|
||||
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
|
||||
})
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
browser_config: Optional[HTTPCrawlerConfig] = None,
|
||||
logger: Optional[AsyncLogger] = None,
|
||||
max_connections: int = DEFAULT_MAX_CONNECTIONS,
|
||||
dns_cache_ttl: int = DEFAULT_DNS_CACHE_TTL,
|
||||
chunk_size: int = DEFAULT_CHUNK_SIZE
|
||||
):
|
||||
"""Initialize the HTTP crawler with config"""
|
||||
self.browser_config = browser_config or HTTPCrawlerConfig()
|
||||
self.logger = logger
|
||||
self.max_connections = max_connections
|
||||
self.dns_cache_ttl = dns_cache_ttl
|
||||
self.chunk_size = chunk_size
|
||||
self._session: Optional[aiohttp.ClientSession] = None
|
||||
|
||||
self.hooks = {
|
||||
k: partial(self._execute_hook, k)
|
||||
for k in ('before_request', 'after_request', 'on_error')
|
||||
}
|
||||
|
||||
# Set default hooks
|
||||
self.set_hook('before_request', lambda *args, **kwargs: None)
|
||||
self.set_hook('after_request', lambda *args, **kwargs: None)
|
||||
self.set_hook('on_error', lambda *args, **kwargs: None)
|
||||
|
||||
|
||||
async def __aenter__(self) -> AsyncHTTPCrawlerStrategy:
|
||||
await self.start()
|
||||
return self
|
||||
|
||||
async def __aexit__(self, exc_type, exc_val, exc_tb) -> None:
|
||||
await self.close()
|
||||
|
||||
@contextlib.asynccontextmanager
|
||||
async def _session_context(self):
|
||||
try:
|
||||
if not self._session:
|
||||
await self.start()
|
||||
yield self._session
|
||||
finally:
|
||||
pass
|
||||
|
||||
def set_hook(self, hook_type: str, hook_func: Callable) -> None:
|
||||
if hook_type in self.hooks:
|
||||
self.hooks[hook_type] = partial(self._execute_hook, hook_type, hook_func)
|
||||
else:
|
||||
raise ValueError(f"Invalid hook type: {hook_type}")
|
||||
|
||||
async def _execute_hook(
|
||||
self,
|
||||
hook_type: str,
|
||||
hook_func: Callable,
|
||||
*args: Any,
|
||||
**kwargs: Any
|
||||
) -> Any:
|
||||
if asyncio.iscoroutinefunction(hook_func):
|
||||
return await hook_func(*args, **kwargs)
|
||||
return hook_func(*args, **kwargs)
|
||||
|
||||
async def start(self) -> None:
|
||||
if not self._session:
|
||||
connector = aiohttp.TCPConnector(
|
||||
limit=self.max_connections,
|
||||
ttl_dns_cache=self.dns_cache_ttl,
|
||||
use_dns_cache=True,
|
||||
force_close=False
|
||||
)
|
||||
self._session = aiohttp.ClientSession(
|
||||
headers=dict(self._BASE_HEADERS),
|
||||
connector=connector,
|
||||
timeout=ClientTimeout(total=self.DEFAULT_TIMEOUT)
|
||||
)
|
||||
|
||||
async def close(self) -> None:
|
||||
if self._session and not self._session.closed:
|
||||
try:
|
||||
await asyncio.wait_for(self._session.close(), timeout=5.0)
|
||||
except asyncio.TimeoutError:
|
||||
if self.logger:
|
||||
self.logger.warning(
|
||||
message="Session cleanup timed out",
|
||||
tag="CLEANUP"
|
||||
)
|
||||
finally:
|
||||
self._session = None
|
||||
|
||||
async def _stream_file(self, path: str) -> AsyncGenerator[memoryview, None]:
|
||||
async with aiofiles.open(path, mode='rb') as f:
|
||||
while chunk := await f.read(self.chunk_size):
|
||||
yield memoryview(chunk)
|
||||
|
||||
async def _handle_file(self, path: str) -> AsyncCrawlResponse:
|
||||
if not os.path.exists(path):
|
||||
raise FileNotFoundError(f"Local file not found: {path}")
|
||||
|
||||
chunks = []
|
||||
async for chunk in self._stream_file(path):
|
||||
chunks.append(chunk.tobytes().decode('utf-8', errors='replace'))
|
||||
|
||||
return AsyncCrawlResponse(
|
||||
html=''.join(chunks),
|
||||
response_headers={},
|
||||
status_code=200
|
||||
)
|
||||
|
||||
async def _handle_raw(self, content: str) -> AsyncCrawlResponse:
|
||||
return AsyncCrawlResponse(
|
||||
html=content,
|
||||
response_headers={},
|
||||
status_code=200
|
||||
)
|
||||
|
||||
|
||||
async def _handle_http(
|
||||
self,
|
||||
url: str,
|
||||
config: CrawlerRunConfig
|
||||
) -> AsyncCrawlResponse:
|
||||
async with self._session_context() as session:
|
||||
timeout = ClientTimeout(
|
||||
total=config.page_timeout or self.DEFAULT_TIMEOUT,
|
||||
connect=10,
|
||||
sock_read=30
|
||||
)
|
||||
|
||||
headers = dict(self._BASE_HEADERS)
|
||||
if self.browser_config.headers:
|
||||
headers.update(self.browser_config.headers)
|
||||
|
||||
request_kwargs = {
|
||||
'timeout': timeout,
|
||||
'allow_redirects': self.browser_config.follow_redirects,
|
||||
'ssl': self.browser_config.verify_ssl,
|
||||
'headers': headers
|
||||
}
|
||||
|
||||
if self.browser_config.method == "POST":
|
||||
if self.browser_config.data:
|
||||
request_kwargs['data'] = self.browser_config.data
|
||||
if self.browser_config.json:
|
||||
request_kwargs['json'] = self.browser_config.json
|
||||
|
||||
await self.hooks['before_request'](url, request_kwargs)
|
||||
|
||||
try:
|
||||
async with session.request(self.browser_config.method, url, **request_kwargs) as response:
|
||||
content = memoryview(await response.read())
|
||||
|
||||
if not (200 <= response.status < 300):
|
||||
raise HTTPStatusError(
|
||||
response.status,
|
||||
f"Unexpected status code for {url}"
|
||||
)
|
||||
|
||||
encoding = response.charset
|
||||
if not encoding:
|
||||
encoding = chardet.detect(content.tobytes())['encoding'] or 'utf-8'
|
||||
|
||||
result = AsyncCrawlResponse(
|
||||
html=content.tobytes().decode(encoding, errors='replace'),
|
||||
response_headers=dict(response.headers),
|
||||
status_code=response.status,
|
||||
redirected_url=str(response.url)
|
||||
)
|
||||
|
||||
await self.hooks['after_request'](result)
|
||||
return result
|
||||
|
||||
except aiohttp.ServerTimeoutError as e:
|
||||
await self.hooks['on_error'](e)
|
||||
raise ConnectionTimeoutError(f"Request timed out: {str(e)}")
|
||||
|
||||
except aiohttp.ClientConnectorError as e:
|
||||
await self.hooks['on_error'](e)
|
||||
raise ConnectionError(f"Connection failed: {str(e)}")
|
||||
|
||||
except aiohttp.ClientError as e:
|
||||
await self.hooks['on_error'](e)
|
||||
raise HTTPCrawlerError(f"HTTP client error: {str(e)}")
|
||||
|
||||
except asyncio.exceptions.TimeoutError as e:
|
||||
await self.hooks['on_error'](e)
|
||||
raise ConnectionTimeoutError(f"Request timed out: {str(e)}")
|
||||
|
||||
except Exception as e:
|
||||
await self.hooks['on_error'](e)
|
||||
raise HTTPCrawlerError(f"HTTP request failed: {str(e)}")
|
||||
|
||||
async def crawl(
|
||||
self,
|
||||
url: str,
|
||||
config: Optional[CrawlerRunConfig] = None,
|
||||
**kwargs
|
||||
) -> AsyncCrawlResponse:
|
||||
config = config or CrawlerRunConfig.from_kwargs(kwargs)
|
||||
|
||||
parsed = urlparse(url)
|
||||
scheme = parsed.scheme.rstrip('/')
|
||||
|
||||
if scheme not in self.VALID_SCHEMES:
|
||||
raise ValueError(f"Unsupported URL scheme: {scheme}")
|
||||
|
||||
try:
|
||||
if scheme == 'file':
|
||||
return await self._handle_file(parsed.path)
|
||||
elif scheme == 'raw':
|
||||
return await self._handle_raw(parsed.path)
|
||||
else: # http or https
|
||||
return await self._handle_http(url, config)
|
||||
|
||||
except Exception as e:
|
||||
if self.logger:
|
||||
self.logger.error(
|
||||
message="Crawl failed: {error}",
|
||||
tag="CRAWL",
|
||||
params={"error": str(e), "url": url}
|
||||
)
|
||||
raise
|
||||
|
||||
@@ -22,7 +22,7 @@ from urllib.parse import urlparse
|
||||
import random
|
||||
from abc import ABC, abstractmethod
|
||||
|
||||
from .memory_utils import get_true_memory_usage_percent
|
||||
from .utils import get_true_memory_usage_percent
|
||||
|
||||
|
||||
class RateLimiter:
|
||||
@@ -407,32 +407,34 @@ class MemoryAdaptiveDispatcher(BaseDispatcher):
|
||||
t.cancel()
|
||||
raise exc
|
||||
|
||||
# If memory pressure is low, start new tasks
|
||||
if not self.memory_pressure_mode and len(active_tasks) < self.max_session_permit:
|
||||
try:
|
||||
# Try to get a task with timeout to avoid blocking indefinitely
|
||||
priority, (url, task_id, retry_count, enqueue_time) = await asyncio.wait_for(
|
||||
self.task_queue.get(), timeout=0.1
|
||||
)
|
||||
|
||||
# Create and start the task
|
||||
task = asyncio.create_task(
|
||||
self.crawl_url(url, config, task_id, retry_count)
|
||||
)
|
||||
active_tasks.append(task)
|
||||
|
||||
# Update waiting time in monitor
|
||||
if self.monitor:
|
||||
wait_time = time.time() - enqueue_time
|
||||
self.monitor.update_task(
|
||||
task_id,
|
||||
wait_time=wait_time,
|
||||
status=CrawlStatus.IN_PROGRESS
|
||||
)
|
||||
# If memory pressure is low, greedily fill all available slots
|
||||
if not self.memory_pressure_mode:
|
||||
slots = self.max_session_permit - len(active_tasks)
|
||||
while slots > 0:
|
||||
try:
|
||||
# Use get_nowait() to immediately get tasks without blocking
|
||||
priority, (url, task_id, retry_count, enqueue_time) = self.task_queue.get_nowait()
|
||||
|
||||
except asyncio.TimeoutError:
|
||||
# No tasks in queue, that's fine
|
||||
pass
|
||||
# Create and start the task
|
||||
task = asyncio.create_task(
|
||||
self.crawl_url(url, config, task_id, retry_count)
|
||||
)
|
||||
active_tasks.append(task)
|
||||
|
||||
# Update waiting time in monitor
|
||||
if self.monitor:
|
||||
wait_time = time.time() - enqueue_time
|
||||
self.monitor.update_task(
|
||||
task_id,
|
||||
wait_time=wait_time,
|
||||
status=CrawlStatus.IN_PROGRESS
|
||||
)
|
||||
|
||||
slots -= 1
|
||||
|
||||
except asyncio.QueueEmpty:
|
||||
# No more tasks in queue, exit the loop
|
||||
break
|
||||
|
||||
# Wait for completion even if queue is starved
|
||||
if active_tasks:
|
||||
@@ -559,32 +561,34 @@ class MemoryAdaptiveDispatcher(BaseDispatcher):
|
||||
for t in active_tasks:
|
||||
t.cancel()
|
||||
raise exc
|
||||
# If memory pressure is low, start new tasks
|
||||
if not self.memory_pressure_mode and len(active_tasks) < self.max_session_permit:
|
||||
try:
|
||||
# Try to get a task with timeout
|
||||
priority, (url, task_id, retry_count, enqueue_time) = await asyncio.wait_for(
|
||||
self.task_queue.get(), timeout=0.1
|
||||
)
|
||||
|
||||
# Create and start the task
|
||||
task = asyncio.create_task(
|
||||
self.crawl_url(url, config, task_id, retry_count)
|
||||
)
|
||||
active_tasks.append(task)
|
||||
|
||||
# Update waiting time in monitor
|
||||
if self.monitor:
|
||||
wait_time = time.time() - enqueue_time
|
||||
self.monitor.update_task(
|
||||
task_id,
|
||||
wait_time=wait_time,
|
||||
status=CrawlStatus.IN_PROGRESS
|
||||
)
|
||||
# If memory pressure is low, greedily fill all available slots
|
||||
if not self.memory_pressure_mode:
|
||||
slots = self.max_session_permit - len(active_tasks)
|
||||
while slots > 0:
|
||||
try:
|
||||
# Use get_nowait() to immediately get tasks without blocking
|
||||
priority, (url, task_id, retry_count, enqueue_time) = self.task_queue.get_nowait()
|
||||
|
||||
except asyncio.TimeoutError:
|
||||
# No tasks in queue, that's fine
|
||||
pass
|
||||
# Create and start the task
|
||||
task = asyncio.create_task(
|
||||
self.crawl_url(url, config, task_id, retry_count)
|
||||
)
|
||||
active_tasks.append(task)
|
||||
|
||||
# Update waiting time in monitor
|
||||
if self.monitor:
|
||||
wait_time = time.time() - enqueue_time
|
||||
self.monitor.update_task(
|
||||
task_id,
|
||||
wait_time=wait_time,
|
||||
status=CrawlStatus.IN_PROGRESS
|
||||
)
|
||||
|
||||
slots -= 1
|
||||
|
||||
except asyncio.QueueEmpty:
|
||||
# No more tasks in queue, exit the loop
|
||||
break
|
||||
|
||||
# Process completed tasks and yield results
|
||||
if active_tasks:
|
||||
|
||||
@@ -49,6 +49,9 @@ from .utils import (
|
||||
preprocess_html_for_schema,
|
||||
)
|
||||
|
||||
# Import telemetry
|
||||
from .telemetry import capture_exception, telemetry_decorator, async_telemetry_decorator
|
||||
|
||||
|
||||
class AsyncWebCrawler:
|
||||
"""
|
||||
@@ -201,6 +204,7 @@ class AsyncWebCrawler:
|
||||
"""异步空上下文管理器"""
|
||||
yield
|
||||
|
||||
@async_telemetry_decorator
|
||||
async def arun(
|
||||
self,
|
||||
url: str,
|
||||
@@ -430,6 +434,7 @@ class AsyncWebCrawler:
|
||||
)
|
||||
)
|
||||
|
||||
@async_telemetry_decorator
|
||||
async def aprocess_html(
|
||||
self,
|
||||
url: str,
|
||||
|
||||
@@ -608,6 +608,11 @@ class BrowserManager:
|
||||
self.contexts_by_config = {}
|
||||
self._contexts_lock = asyncio.Lock()
|
||||
|
||||
# Serialize context.new_page() across concurrent tasks to avoid races
|
||||
# when using a shared persistent context (context.pages may be empty
|
||||
# for all racers). Prevents 'Target page/context closed' errors.
|
||||
self._page_lock = asyncio.Lock()
|
||||
|
||||
# Stealth-related attributes
|
||||
self._stealth_instance = None
|
||||
self._stealth_cm = None
|
||||
@@ -1027,13 +1032,26 @@ class BrowserManager:
|
||||
context = await self.create_browser_context(crawlerRunConfig)
|
||||
ctx = self.default_context # default context, one window only
|
||||
ctx = await clone_runtime_state(context, ctx, crawlerRunConfig, self.config)
|
||||
page = await ctx.new_page()
|
||||
# Avoid concurrent new_page on shared persistent context
|
||||
# See GH-1198: context.pages can be empty under races
|
||||
async with self._page_lock:
|
||||
page = await ctx.new_page()
|
||||
else:
|
||||
context = self.default_context
|
||||
pages = context.pages
|
||||
page = next((p for p in pages if p.url == crawlerRunConfig.url), None)
|
||||
if not page:
|
||||
page = context.pages[0] # await context.new_page()
|
||||
if pages:
|
||||
page = pages[0]
|
||||
else:
|
||||
# Double-check under lock to avoid TOCTOU and ensure only
|
||||
# one task calls new_page when pages=[] concurrently
|
||||
async with self._page_lock:
|
||||
pages = context.pages
|
||||
if pages:
|
||||
page = pages[0]
|
||||
else:
|
||||
page = await context.new_page()
|
||||
else:
|
||||
# Otherwise, check if we have an existing context for this config
|
||||
config_signature = self._make_config_signature(crawlerRunConfig)
|
||||
|
||||
@@ -65,6 +65,213 @@ class BrowserProfiler:
|
||||
self.builtin_config_file = os.path.join(self.builtin_browser_dir, "browser_config.json")
|
||||
os.makedirs(self.builtin_browser_dir, exist_ok=True)
|
||||
|
||||
def _is_windows(self) -> bool:
|
||||
"""Check if running on Windows platform."""
|
||||
return sys.platform.startswith('win') or sys.platform == 'cygwin'
|
||||
|
||||
def _is_macos(self) -> bool:
|
||||
"""Check if running on macOS platform."""
|
||||
return sys.platform == 'darwin'
|
||||
|
||||
def _is_linux(self) -> bool:
|
||||
"""Check if running on Linux platform."""
|
||||
return sys.platform.startswith('linux')
|
||||
|
||||
def _get_quit_message(self, tag: str) -> str:
|
||||
"""Get appropriate quit message based on context."""
|
||||
if tag == "PROFILE":
|
||||
return "Closing browser and saving profile..."
|
||||
elif tag == "CDP":
|
||||
return "Closing browser..."
|
||||
else:
|
||||
return "Closing browser..."
|
||||
|
||||
async def _listen_windows(self, user_done_event, check_browser_process, tag: str):
|
||||
"""Windows-specific keyboard listener using msvcrt."""
|
||||
try:
|
||||
import msvcrt
|
||||
except ImportError:
|
||||
raise ImportError("msvcrt module not available on this platform")
|
||||
|
||||
while True:
|
||||
try:
|
||||
# Check for keyboard input
|
||||
if msvcrt.kbhit():
|
||||
raw = msvcrt.getch()
|
||||
|
||||
# Handle Unicode decoding more robustly
|
||||
key = None
|
||||
try:
|
||||
key = raw.decode("utf-8")
|
||||
except UnicodeDecodeError:
|
||||
try:
|
||||
# Try different encodings
|
||||
key = raw.decode("latin1")
|
||||
except UnicodeDecodeError:
|
||||
# Skip if we can't decode
|
||||
continue
|
||||
|
||||
# Validate key
|
||||
if not key or len(key) != 1:
|
||||
continue
|
||||
|
||||
# Check for printable characters only
|
||||
if not key.isprintable():
|
||||
continue
|
||||
|
||||
# Check for quit command
|
||||
if key.lower() == "q":
|
||||
self.logger.info(
|
||||
self._get_quit_message(tag),
|
||||
tag=tag,
|
||||
base_color=LogColor.GREEN
|
||||
)
|
||||
user_done_event.set()
|
||||
return
|
||||
|
||||
# Check if browser process ended
|
||||
if await check_browser_process():
|
||||
return
|
||||
|
||||
# Small delay to prevent busy waiting
|
||||
await asyncio.sleep(0.1)
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error in Windows keyboard listener: {e}", tag=tag)
|
||||
# Continue trying instead of failing completely
|
||||
await asyncio.sleep(0.1)
|
||||
continue
|
||||
|
||||
async def _listen_unix(self, user_done_event: asyncio.Event, check_browser_process, tag: str):
|
||||
"""Unix/Linux/macOS keyboard listener using termios and select."""
|
||||
try:
|
||||
import termios
|
||||
import tty
|
||||
import select
|
||||
except ImportError:
|
||||
raise ImportError("termios/tty/select modules not available on this platform")
|
||||
|
||||
# Get stdin file descriptor
|
||||
try:
|
||||
fd = sys.stdin.fileno()
|
||||
except (AttributeError, OSError):
|
||||
raise ImportError("stdin is not a terminal")
|
||||
|
||||
# Save original terminal settings
|
||||
old_settings = None
|
||||
try:
|
||||
old_settings = termios.tcgetattr(fd)
|
||||
except termios.error as e:
|
||||
raise ImportError(f"Cannot get terminal attributes: {e}")
|
||||
|
||||
try:
|
||||
# Switch to non-canonical mode (cbreak mode)
|
||||
tty.setcbreak(fd)
|
||||
|
||||
while True:
|
||||
try:
|
||||
# Use select to check if input is available (non-blocking)
|
||||
# Timeout of 0.5 seconds to periodically check browser process
|
||||
readable, _, _ = select.select([sys.stdin], [], [], 0.5)
|
||||
|
||||
if readable:
|
||||
# Read one character
|
||||
key = sys.stdin.read(1)
|
||||
|
||||
if key and key.lower() == "q":
|
||||
self.logger.info(
|
||||
self._get_quit_message(tag),
|
||||
tag=tag,
|
||||
base_color=LogColor.GREEN
|
||||
)
|
||||
user_done_event.set()
|
||||
return
|
||||
|
||||
# Check if browser process ended
|
||||
if await check_browser_process():
|
||||
return
|
||||
|
||||
# Small delay to prevent busy waiting
|
||||
await asyncio.sleep(0.1)
|
||||
|
||||
except (KeyboardInterrupt, EOFError):
|
||||
# Handle Ctrl+C or EOF gracefully
|
||||
self.logger.info("Keyboard interrupt received", tag=tag)
|
||||
user_done_event.set()
|
||||
return
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error in Unix keyboard listener: {e}", tag=tag)
|
||||
await asyncio.sleep(0.1)
|
||||
continue
|
||||
|
||||
finally:
|
||||
# Always restore terminal settings
|
||||
if old_settings is not None:
|
||||
try:
|
||||
termios.tcsetattr(fd, termios.TCSADRAIN, old_settings)
|
||||
except Exception as e:
|
||||
self.logger.error(f"Failed to restore terminal settings: {e}", tag=tag)
|
||||
|
||||
async def _listen_fallback(self, user_done_event: asyncio.Event, check_browser_process, tag: str):
|
||||
"""Fallback keyboard listener using simple input() method."""
|
||||
self.logger.info("Using fallback input mode. Type 'q' and press Enter to quit.", tag=tag)
|
||||
|
||||
# Run input in a separate thread to avoid blocking
|
||||
import threading
|
||||
import queue
|
||||
|
||||
input_queue = queue.Queue()
|
||||
|
||||
def input_thread():
|
||||
"""Thread function to handle input."""
|
||||
try:
|
||||
while not user_done_event.is_set():
|
||||
try:
|
||||
# Use input() with a prompt
|
||||
user_input = input("Press 'q' + Enter to quit: ").strip().lower()
|
||||
input_queue.put(user_input)
|
||||
if user_input == 'q':
|
||||
break
|
||||
except (EOFError, KeyboardInterrupt):
|
||||
input_queue.put('q')
|
||||
break
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error in input thread: {e}", tag=tag)
|
||||
break
|
||||
except Exception as e:
|
||||
self.logger.error(f"Input thread failed: {e}", tag=tag)
|
||||
|
||||
# Start input thread
|
||||
thread = threading.Thread(target=input_thread, daemon=True)
|
||||
thread.start()
|
||||
|
||||
try:
|
||||
while not user_done_event.is_set():
|
||||
# Check for user input
|
||||
try:
|
||||
user_input = input_queue.get_nowait()
|
||||
if user_input == 'q':
|
||||
self.logger.info(
|
||||
self._get_quit_message(tag),
|
||||
tag=tag,
|
||||
base_color=LogColor.GREEN
|
||||
)
|
||||
user_done_event.set()
|
||||
return
|
||||
except queue.Empty:
|
||||
pass
|
||||
|
||||
# Check if browser process ended
|
||||
if await check_browser_process():
|
||||
return
|
||||
|
||||
# Small delay
|
||||
await asyncio.sleep(0.5)
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Fallback listener failed: {e}", tag=tag)
|
||||
user_done_event.set()
|
||||
|
||||
async def create_profile(self,
|
||||
profile_name: Optional[str] = None,
|
||||
browser_config: Optional[BrowserConfig] = None) -> Optional[str]:
|
||||
@@ -180,42 +387,38 @@ class BrowserProfiler:
|
||||
|
||||
# Run keyboard input loop in a separate task
|
||||
async def listen_for_quit_command():
|
||||
import termios
|
||||
import tty
|
||||
import select
|
||||
|
||||
"""Cross-platform keyboard listener that waits for 'q' key press."""
|
||||
# First output the prompt
|
||||
self.logger.info("Press 'q' when you've finished using the browser...", tag="PROFILE")
|
||||
|
||||
# Save original terminal settings
|
||||
fd = sys.stdin.fileno()
|
||||
old_settings = termios.tcgetattr(fd)
|
||||
|
||||
self.logger.info(
|
||||
"Press {segment} when you've finished using the browser...",
|
||||
tag="PROFILE",
|
||||
params={"segment": "'q'"}, colors={"segment": LogColor.YELLOW},
|
||||
base_color=LogColor.CYAN
|
||||
)
|
||||
|
||||
async def check_browser_process():
|
||||
"""Check if browser process is still running."""
|
||||
if (
|
||||
managed_browser.browser_process
|
||||
and managed_browser.browser_process.poll() is not None
|
||||
):
|
||||
self.logger.info(
|
||||
"Browser already closed. Ending input listener.", tag="PROFILE"
|
||||
)
|
||||
user_done_event.set()
|
||||
return True
|
||||
return False
|
||||
|
||||
# Try platform-specific implementations with fallback
|
||||
try:
|
||||
# Switch to non-canonical mode (no line buffering)
|
||||
tty.setcbreak(fd)
|
||||
|
||||
while True:
|
||||
# Check if input is available (non-blocking)
|
||||
readable, _, _ = select.select([sys.stdin], [], [], 0.5)
|
||||
if readable:
|
||||
key = sys.stdin.read(1)
|
||||
if key.lower() == 'q':
|
||||
self.logger.info("Closing browser and saving profile...", tag="PROFILE", base_color=LogColor.GREEN)
|
||||
user_done_event.set()
|
||||
return
|
||||
|
||||
# Check if the browser process has already exited
|
||||
if managed_browser.browser_process and managed_browser.browser_process.poll() is not None:
|
||||
self.logger.info("Browser already closed. Ending input listener.", tag="PROFILE")
|
||||
user_done_event.set()
|
||||
return
|
||||
|
||||
await asyncio.sleep(0.1)
|
||||
|
||||
finally:
|
||||
# Restore terminal settings
|
||||
termios.tcsetattr(fd, termios.TCSADRAIN, old_settings)
|
||||
if self._is_windows():
|
||||
await self._listen_windows(user_done_event, check_browser_process, "PROFILE")
|
||||
else:
|
||||
await self._listen_unix(user_done_event, check_browser_process, "PROFILE")
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Platform-specific keyboard listener failed: {e}", tag="PROFILE")
|
||||
self.logger.info("Falling back to simple input mode...", tag="PROFILE")
|
||||
await self._listen_fallback(user_done_event, check_browser_process, "PROFILE")
|
||||
|
||||
try:
|
||||
from playwright.async_api import async_playwright
|
||||
@@ -682,42 +885,33 @@ class BrowserProfiler:
|
||||
|
||||
# Run keyboard input loop in a separate task
|
||||
async def listen_for_quit_command():
|
||||
import termios
|
||||
import tty
|
||||
import select
|
||||
|
||||
"""Cross-platform keyboard listener that waits for 'q' key press."""
|
||||
# First output the prompt
|
||||
self.logger.info("Press 'q' to stop the browser and exit...", tag="CDP")
|
||||
|
||||
# Save original terminal settings
|
||||
fd = sys.stdin.fileno()
|
||||
old_settings = termios.tcgetattr(fd)
|
||||
|
||||
self.logger.info(
|
||||
"Press {segment} to stop the browser and exit...",
|
||||
tag="CDP",
|
||||
params={"segment": "'q'"}, colors={"segment": LogColor.YELLOW},
|
||||
base_color=LogColor.CYAN
|
||||
)
|
||||
|
||||
async def check_browser_process():
|
||||
"""Check if browser process is still running."""
|
||||
if managed_browser.browser_process and managed_browser.browser_process.poll() is not None:
|
||||
self.logger.info("Browser already closed. Ending input listener.", tag="CDP")
|
||||
user_done_event.set()
|
||||
return True
|
||||
return False
|
||||
|
||||
# Try platform-specific implementations with fallback
|
||||
try:
|
||||
# Switch to non-canonical mode (no line buffering)
|
||||
tty.setcbreak(fd)
|
||||
|
||||
while True:
|
||||
# Check if input is available (non-blocking)
|
||||
readable, _, _ = select.select([sys.stdin], [], [], 0.5)
|
||||
if readable:
|
||||
key = sys.stdin.read(1)
|
||||
if key.lower() == 'q':
|
||||
self.logger.info("Closing browser...", tag="CDP")
|
||||
user_done_event.set()
|
||||
return
|
||||
|
||||
# Check if the browser process has already exited
|
||||
if managed_browser.browser_process and managed_browser.browser_process.poll() is not None:
|
||||
self.logger.info("Browser already closed. Ending input listener.", tag="CDP")
|
||||
user_done_event.set()
|
||||
return
|
||||
|
||||
await asyncio.sleep(0.1)
|
||||
|
||||
finally:
|
||||
# Restore terminal settings
|
||||
termios.tcsetattr(fd, termios.TCSADRAIN, old_settings)
|
||||
if self._is_windows():
|
||||
await self._listen_windows(user_done_event, check_browser_process, "CDP")
|
||||
else:
|
||||
await self._listen_unix(user_done_event, check_browser_process, "CDP")
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Platform-specific keyboard listener failed: {e}", tag="CDP")
|
||||
self.logger.info("Falling back to simple input mode...", tag="CDP")
|
||||
await self._listen_fallback(user_done_event, check_browser_process, "CDP")
|
||||
|
||||
# Function to retrieve and display CDP JSON config
|
||||
async def get_cdp_json(port):
|
||||
|
||||
@@ -1385,6 +1385,97 @@ def profiles_cmd():
|
||||
# Run interactive profile manager
|
||||
anyio.run(manage_profiles)
|
||||
|
||||
@cli.group("telemetry")
|
||||
def telemetry_cmd():
|
||||
"""Manage telemetry settings for Crawl4AI
|
||||
|
||||
Telemetry helps improve Crawl4AI by sending anonymous crash reports.
|
||||
No personal data or crawled content is ever collected.
|
||||
"""
|
||||
pass
|
||||
|
||||
@telemetry_cmd.command("enable")
|
||||
@click.option("--email", "-e", help="Optional email for follow-up on critical issues")
|
||||
@click.option("--always/--once", default=True, help="Always send errors (default) or just once")
|
||||
def telemetry_enable_cmd(email: Optional[str], always: bool):
|
||||
"""Enable telemetry to help improve Crawl4AI
|
||||
|
||||
Examples:
|
||||
crwl telemetry enable # Enable telemetry
|
||||
crwl telemetry enable --email me@ex.com # Enable with email
|
||||
crwl telemetry enable --once # Send only next error
|
||||
"""
|
||||
from crawl4ai.telemetry import enable
|
||||
|
||||
try:
|
||||
enable(email=email, always=always, once=not always)
|
||||
console.print("[green]✅ Telemetry enabled successfully[/green]")
|
||||
|
||||
if email:
|
||||
console.print(f" Email: {email}")
|
||||
console.print(f" Mode: {'Always send errors' if always else 'Send next error only'}")
|
||||
|
||||
except Exception as e:
|
||||
console.print(f"[red]❌ Failed to enable telemetry: {e}[/red]")
|
||||
sys.exit(1)
|
||||
|
||||
@telemetry_cmd.command("disable")
|
||||
def telemetry_disable_cmd():
|
||||
"""Disable telemetry
|
||||
|
||||
Stop sending anonymous crash reports to help improve Crawl4AI.
|
||||
"""
|
||||
from crawl4ai.telemetry import disable
|
||||
|
||||
try:
|
||||
disable()
|
||||
console.print("[green]✅ Telemetry disabled successfully[/green]")
|
||||
except Exception as e:
|
||||
console.print(f"[red]❌ Failed to disable telemetry: {e}[/red]")
|
||||
sys.exit(1)
|
||||
|
||||
@telemetry_cmd.command("status")
|
||||
def telemetry_status_cmd():
|
||||
"""Show current telemetry status
|
||||
|
||||
Display whether telemetry is enabled and current settings.
|
||||
"""
|
||||
from crawl4ai.telemetry import status
|
||||
|
||||
try:
|
||||
info = status()
|
||||
|
||||
# Create status table
|
||||
table = Table(title="Telemetry Status", show_header=False)
|
||||
table.add_column("Setting", style="cyan")
|
||||
table.add_column("Value")
|
||||
|
||||
# Status emoji
|
||||
status_icon = "✅" if info['enabled'] else "❌"
|
||||
|
||||
table.add_row("Status", f"{status_icon} {'Enabled' if info['enabled'] else 'Disabled'}")
|
||||
table.add_row("Consent", info['consent'].replace('_', ' ').title())
|
||||
|
||||
if info['email']:
|
||||
table.add_row("Email", info['email'])
|
||||
|
||||
table.add_row("Environment", info['environment'])
|
||||
table.add_row("Provider", info['provider'])
|
||||
|
||||
if info['errors_sent'] > 0:
|
||||
table.add_row("Errors Sent", str(info['errors_sent']))
|
||||
|
||||
console.print(table)
|
||||
|
||||
# Add helpful messages
|
||||
if not info['enabled']:
|
||||
console.print("\n[yellow]ℹ️ Telemetry is disabled. Enable it to help improve Crawl4AI:[/yellow]")
|
||||
console.print(" [dim]crwl telemetry enable[/dim]")
|
||||
|
||||
except Exception as e:
|
||||
console.print(f"[red]❌ Failed to get telemetry status: {e}[/red]")
|
||||
sys.exit(1)
|
||||
|
||||
@cli.command(name="")
|
||||
@click.argument("url", required=False)
|
||||
@click.option("--example", is_flag=True, help="Show usage examples")
|
||||
|
||||
@@ -242,6 +242,16 @@ class LXMLWebScrapingStrategy(ContentScrapingStrategy):
|
||||
exclude_domains = set(kwargs.get("exclude_domains", []))
|
||||
|
||||
# Process links
|
||||
try:
|
||||
base_element = element.xpath("//head/base[@href]")
|
||||
if base_element:
|
||||
base_href = base_element[0].get("href", "").strip()
|
||||
if base_href:
|
||||
url = base_href
|
||||
except Exception as e:
|
||||
self._log("error", f"Error extracting base URL: {str(e)}", "SCRAPE")
|
||||
pass
|
||||
|
||||
for link in element.xpath(".//a[@href]"):
|
||||
href = link.get("href", "").strip()
|
||||
if not href:
|
||||
@@ -576,117 +586,6 @@ class LXMLWebScrapingStrategy(ContentScrapingStrategy):
|
||||
|
||||
return root
|
||||
|
||||
def is_data_table(self, table: etree.Element, **kwargs) -> bool:
|
||||
score = 0
|
||||
# Check for thead and tbody
|
||||
has_thead = len(table.xpath(".//thead")) > 0
|
||||
has_tbody = len(table.xpath(".//tbody")) > 0
|
||||
if has_thead:
|
||||
score += 2
|
||||
if has_tbody:
|
||||
score += 1
|
||||
|
||||
# Check for th elements
|
||||
th_count = len(table.xpath(".//th"))
|
||||
if th_count > 0:
|
||||
score += 2
|
||||
if has_thead or table.xpath(".//tr[1]/th"):
|
||||
score += 1
|
||||
|
||||
# Check for nested tables
|
||||
if len(table.xpath(".//table")) > 0:
|
||||
score -= 3
|
||||
|
||||
# Role attribute check
|
||||
role = table.get("role", "").lower()
|
||||
if role in {"presentation", "none"}:
|
||||
score -= 3
|
||||
|
||||
# Column consistency
|
||||
rows = table.xpath(".//tr")
|
||||
if not rows:
|
||||
return False
|
||||
col_counts = [len(row.xpath(".//td|.//th")) for row in rows]
|
||||
avg_cols = sum(col_counts) / len(col_counts)
|
||||
variance = sum((c - avg_cols)**2 for c in col_counts) / len(col_counts)
|
||||
if variance < 1:
|
||||
score += 2
|
||||
|
||||
# Caption and summary
|
||||
if table.xpath(".//caption"):
|
||||
score += 2
|
||||
if table.get("summary"):
|
||||
score += 1
|
||||
|
||||
# Text density
|
||||
total_text = sum(len(''.join(cell.itertext()).strip()) for row in rows for cell in row.xpath(".//td|.//th"))
|
||||
total_tags = sum(1 for _ in table.iterdescendants())
|
||||
text_ratio = total_text / (total_tags + 1e-5)
|
||||
if text_ratio > 20:
|
||||
score += 3
|
||||
elif text_ratio > 10:
|
||||
score += 2
|
||||
|
||||
# Data attributes
|
||||
data_attrs = sum(1 for attr in table.attrib if attr.startswith('data-'))
|
||||
score += data_attrs * 0.5
|
||||
|
||||
# Size check
|
||||
if avg_cols >= 2 and len(rows) >= 2:
|
||||
score += 2
|
||||
|
||||
threshold = kwargs.get("table_score_threshold", 7)
|
||||
return score >= threshold
|
||||
|
||||
def extract_table_data(self, table: etree.Element) -> dict:
|
||||
caption = table.xpath(".//caption/text()")
|
||||
caption = caption[0].strip() if caption else ""
|
||||
summary = table.get("summary", "").strip()
|
||||
|
||||
# Extract headers with colspan handling
|
||||
headers = []
|
||||
thead_rows = table.xpath(".//thead/tr")
|
||||
if thead_rows:
|
||||
header_cells = thead_rows[0].xpath(".//th")
|
||||
for cell in header_cells:
|
||||
text = cell.text_content().strip()
|
||||
colspan = int(cell.get("colspan", 1))
|
||||
headers.extend([text] * colspan)
|
||||
else:
|
||||
first_row = table.xpath(".//tr[1]")
|
||||
if first_row:
|
||||
for cell in first_row[0].xpath(".//th|.//td"):
|
||||
text = cell.text_content().strip()
|
||||
colspan = int(cell.get("colspan", 1))
|
||||
headers.extend([text] * colspan)
|
||||
|
||||
# Extract rows with colspan handling
|
||||
rows = []
|
||||
for row in table.xpath(".//tr[not(ancestor::thead)]"):
|
||||
row_data = []
|
||||
for cell in row.xpath(".//td"):
|
||||
text = cell.text_content().strip()
|
||||
colspan = int(cell.get("colspan", 1))
|
||||
row_data.extend([text] * colspan)
|
||||
if row_data:
|
||||
rows.append(row_data)
|
||||
|
||||
# Align rows with headers
|
||||
max_columns = len(headers) if headers else (max(len(row) for row in rows) if rows else 0)
|
||||
aligned_rows = []
|
||||
for row in rows:
|
||||
aligned = row[:max_columns] + [''] * (max_columns - len(row))
|
||||
aligned_rows.append(aligned)
|
||||
|
||||
if not headers:
|
||||
headers = [f"Column {i+1}" for i in range(max_columns)]
|
||||
|
||||
return {
|
||||
"headers": headers,
|
||||
"rows": aligned_rows,
|
||||
"caption": caption,
|
||||
"summary": summary,
|
||||
}
|
||||
|
||||
def _scrap(
|
||||
self,
|
||||
@@ -829,12 +728,16 @@ class LXMLWebScrapingStrategy(ContentScrapingStrategy):
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
# Extract tables using the table extraction strategy if provided
|
||||
if 'table' not in excluded_tags:
|
||||
tables = body.xpath(".//table")
|
||||
for table in tables:
|
||||
if self.is_data_table(table, **kwargs):
|
||||
table_data = self.extract_table_data(table)
|
||||
media["tables"].append(table_data)
|
||||
table_extraction = kwargs.get('table_extraction')
|
||||
if table_extraction:
|
||||
# Pass logger to the strategy if it doesn't have one
|
||||
if not table_extraction.logger:
|
||||
table_extraction.logger = self.logger
|
||||
# Extract tables using the strategy
|
||||
extracted_tables = table_extraction.extract_tables(body, **kwargs)
|
||||
media["tables"].extend(extracted_tables)
|
||||
|
||||
# Handle only_text option
|
||||
if kwargs.get("only_text", False):
|
||||
|
||||
@@ -1,79 +0,0 @@
|
||||
import psutil
|
||||
import platform
|
||||
import subprocess
|
||||
from typing import Tuple
|
||||
|
||||
|
||||
def get_true_available_memory_gb() -> float:
|
||||
"""Get truly available memory including inactive pages (cross-platform)"""
|
||||
vm = psutil.virtual_memory()
|
||||
|
||||
if platform.system() == 'Darwin': # macOS
|
||||
# On macOS, we need to include inactive memory too
|
||||
try:
|
||||
# Use vm_stat to get accurate values
|
||||
result = subprocess.run(['vm_stat'], capture_output=True, text=True)
|
||||
lines = result.stdout.split('\n')
|
||||
|
||||
page_size = 16384 # macOS page size
|
||||
pages = {}
|
||||
|
||||
for line in lines:
|
||||
if 'Pages free:' in line:
|
||||
pages['free'] = int(line.split()[-1].rstrip('.'))
|
||||
elif 'Pages inactive:' in line:
|
||||
pages['inactive'] = int(line.split()[-1].rstrip('.'))
|
||||
elif 'Pages speculative:' in line:
|
||||
pages['speculative'] = int(line.split()[-1].rstrip('.'))
|
||||
elif 'Pages purgeable:' in line:
|
||||
pages['purgeable'] = int(line.split()[-1].rstrip('.'))
|
||||
|
||||
# Calculate total available (free + inactive + speculative + purgeable)
|
||||
total_available_pages = (
|
||||
pages.get('free', 0) +
|
||||
pages.get('inactive', 0) +
|
||||
pages.get('speculative', 0) +
|
||||
pages.get('purgeable', 0)
|
||||
)
|
||||
available_gb = (total_available_pages * page_size) / (1024**3)
|
||||
|
||||
return available_gb
|
||||
except:
|
||||
# Fallback to psutil
|
||||
return vm.available / (1024**3)
|
||||
else:
|
||||
# For Windows and Linux, psutil.available is accurate
|
||||
return vm.available / (1024**3)
|
||||
|
||||
|
||||
def get_true_memory_usage_percent() -> float:
|
||||
"""
|
||||
Get memory usage percentage that accounts for platform differences.
|
||||
|
||||
Returns:
|
||||
float: Memory usage percentage (0-100)
|
||||
"""
|
||||
vm = psutil.virtual_memory()
|
||||
total_gb = vm.total / (1024**3)
|
||||
available_gb = get_true_available_memory_gb()
|
||||
|
||||
# Calculate used percentage based on truly available memory
|
||||
used_percent = 100.0 * (total_gb - available_gb) / total_gb
|
||||
|
||||
# Ensure it's within valid range
|
||||
return max(0.0, min(100.0, used_percent))
|
||||
|
||||
|
||||
def get_memory_stats() -> Tuple[float, float, float]:
|
||||
"""
|
||||
Get comprehensive memory statistics.
|
||||
|
||||
Returns:
|
||||
Tuple[float, float, float]: (used_percent, available_gb, total_gb)
|
||||
"""
|
||||
vm = psutil.virtual_memory()
|
||||
total_gb = vm.total / (1024**3)
|
||||
available_gb = get_true_available_memory_gb()
|
||||
used_percent = get_true_memory_usage_percent()
|
||||
|
||||
return used_percent, available_gb, total_gb
|
||||
1396
crawl4ai/table_extraction.py
Normal file
1396
crawl4ai/table_extraction.py
Normal file
File diff suppressed because it is too large
Load Diff
440
crawl4ai/telemetry/__init__.py
Normal file
440
crawl4ai/telemetry/__init__.py
Normal file
@@ -0,0 +1,440 @@
|
||||
"""
|
||||
Crawl4AI Telemetry Module.
|
||||
Provides opt-in error tracking to improve stability.
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
import functools
|
||||
import traceback
|
||||
from typing import Optional, Any, Dict, Callable, Type
|
||||
from contextlib import contextmanager, asynccontextmanager
|
||||
|
||||
from .base import TelemetryProvider, NullProvider
|
||||
from .config import TelemetryConfig, TelemetryConsent
|
||||
from .consent import ConsentManager
|
||||
from .environment import Environment, EnvironmentDetector
|
||||
|
||||
|
||||
class TelemetryManager:
|
||||
"""
|
||||
Main telemetry manager for Crawl4AI.
|
||||
Coordinates provider, config, and consent management.
|
||||
"""
|
||||
|
||||
_instance: Optional['TelemetryManager'] = None
|
||||
|
||||
def __init__(self):
|
||||
"""Initialize telemetry manager."""
|
||||
self.config = TelemetryConfig()
|
||||
self.consent_manager = ConsentManager(self.config)
|
||||
self.environment = EnvironmentDetector.detect()
|
||||
self._provider: Optional[TelemetryProvider] = None
|
||||
self._initialized = False
|
||||
self._error_count = 0
|
||||
self._max_errors = 100 # Prevent telemetry spam
|
||||
|
||||
# Load provider based on config
|
||||
self._setup_provider()
|
||||
|
||||
@classmethod
|
||||
def get_instance(cls) -> 'TelemetryManager':
|
||||
"""
|
||||
Get singleton instance of telemetry manager.
|
||||
|
||||
Returns:
|
||||
TelemetryManager instance
|
||||
"""
|
||||
if cls._instance is None:
|
||||
cls._instance = cls()
|
||||
return cls._instance
|
||||
|
||||
def _setup_provider(self) -> None:
|
||||
"""Setup telemetry provider based on configuration."""
|
||||
# Update config from environment
|
||||
self.config.update_from_env()
|
||||
|
||||
# Check if telemetry is enabled
|
||||
if not self.config.is_enabled():
|
||||
self._provider = NullProvider()
|
||||
return
|
||||
|
||||
# Try to load Sentry provider
|
||||
try:
|
||||
from .providers.sentry import SentryProvider
|
||||
|
||||
# Get Crawl4AI version for release tracking
|
||||
try:
|
||||
from crawl4ai import __version__
|
||||
release = f"crawl4ai@{__version__}"
|
||||
except ImportError:
|
||||
release = "crawl4ai@unknown"
|
||||
|
||||
self._provider = SentryProvider(
|
||||
environment=self.environment.value,
|
||||
release=release
|
||||
)
|
||||
|
||||
# Initialize provider
|
||||
if not self._provider.initialize():
|
||||
# Fallback to null provider if init fails
|
||||
self._provider = NullProvider()
|
||||
|
||||
except ImportError:
|
||||
# Sentry not installed - use null provider
|
||||
self._provider = NullProvider()
|
||||
|
||||
self._initialized = True
|
||||
|
||||
def capture_exception(
|
||||
self,
|
||||
exception: Exception,
|
||||
context: Optional[Dict[str, Any]] = None
|
||||
) -> bool:
|
||||
"""
|
||||
Capture and send an exception.
|
||||
|
||||
Args:
|
||||
exception: The exception to capture
|
||||
context: Optional additional context
|
||||
|
||||
Returns:
|
||||
True if exception was sent
|
||||
"""
|
||||
# Check error count limit
|
||||
if self._error_count >= self._max_errors:
|
||||
return False
|
||||
|
||||
# Check consent on first error
|
||||
if self._error_count == 0:
|
||||
consent = self.consent_manager.check_and_prompt()
|
||||
|
||||
# Update provider if consent changed
|
||||
if consent == TelemetryConsent.DENIED:
|
||||
self._provider = NullProvider()
|
||||
return False
|
||||
elif consent in [TelemetryConsent.ONCE, TelemetryConsent.ALWAYS]:
|
||||
if isinstance(self._provider, NullProvider):
|
||||
self._setup_provider()
|
||||
|
||||
# Check if we should send this error
|
||||
if not self.config.should_send_current():
|
||||
return False
|
||||
|
||||
# Prepare context
|
||||
full_context = EnvironmentDetector.get_environment_context()
|
||||
if context:
|
||||
full_context.update(context)
|
||||
|
||||
# Add user email if available
|
||||
email = self.config.get_email()
|
||||
if email:
|
||||
full_context['email'] = email
|
||||
|
||||
# Add source info
|
||||
full_context['source'] = 'crawl4ai'
|
||||
|
||||
# Send exception
|
||||
try:
|
||||
if self._provider:
|
||||
success = self._provider.send_exception(exception, full_context)
|
||||
if success:
|
||||
self._error_count += 1
|
||||
return success
|
||||
except Exception:
|
||||
# Telemetry itself failed - ignore
|
||||
pass
|
||||
|
||||
return False
|
||||
|
||||
def capture_message(
|
||||
self,
|
||||
message: str,
|
||||
level: str = 'info',
|
||||
context: Optional[Dict[str, Any]] = None
|
||||
) -> bool:
|
||||
"""
|
||||
Capture a message event.
|
||||
|
||||
Args:
|
||||
message: Message to send
|
||||
level: Message level (info, warning, error)
|
||||
context: Optional context
|
||||
|
||||
Returns:
|
||||
True if message was sent
|
||||
"""
|
||||
if not self.config.is_enabled():
|
||||
return False
|
||||
|
||||
payload = {
|
||||
'level': level,
|
||||
'message': message
|
||||
}
|
||||
if context:
|
||||
payload.update(context)
|
||||
|
||||
try:
|
||||
if self._provider:
|
||||
return self._provider.send_event(message, payload)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
return False
|
||||
|
||||
def enable(
|
||||
self,
|
||||
email: Optional[str] = None,
|
||||
always: bool = True,
|
||||
once: bool = False
|
||||
) -> None:
|
||||
"""
|
||||
Enable telemetry.
|
||||
|
||||
Args:
|
||||
email: Optional email for follow-up
|
||||
always: If True, always send errors
|
||||
once: If True, send only next error
|
||||
"""
|
||||
if once:
|
||||
consent = TelemetryConsent.ONCE
|
||||
elif always:
|
||||
consent = TelemetryConsent.ALWAYS
|
||||
else:
|
||||
consent = TelemetryConsent.ALWAYS
|
||||
|
||||
self.config.set_consent(consent, email)
|
||||
self._setup_provider()
|
||||
|
||||
print("✅ Telemetry enabled")
|
||||
if email:
|
||||
print(f" Email: {email}")
|
||||
print(f" Mode: {'once' if once else 'always'}")
|
||||
|
||||
def disable(self) -> None:
|
||||
"""Disable telemetry."""
|
||||
self.config.set_consent(TelemetryConsent.DENIED)
|
||||
self._provider = NullProvider()
|
||||
print("✅ Telemetry disabled")
|
||||
|
||||
def status(self) -> Dict[str, Any]:
|
||||
"""
|
||||
Get telemetry status.
|
||||
|
||||
Returns:
|
||||
Dictionary with status information
|
||||
"""
|
||||
return {
|
||||
'enabled': self.config.is_enabled(),
|
||||
'consent': self.config.get_consent().value,
|
||||
'email': self.config.get_email(),
|
||||
'environment': self.environment.value,
|
||||
'provider': type(self._provider).__name__ if self._provider else 'None',
|
||||
'errors_sent': self._error_count
|
||||
}
|
||||
|
||||
def flush(self) -> None:
|
||||
"""Flush any pending telemetry data."""
|
||||
if self._provider:
|
||||
self._provider.flush()
|
||||
|
||||
def shutdown(self) -> None:
|
||||
"""Shutdown telemetry."""
|
||||
if self._provider:
|
||||
self._provider.shutdown()
|
||||
|
||||
|
||||
# Global instance
|
||||
_telemetry_manager: Optional[TelemetryManager] = None
|
||||
|
||||
|
||||
def get_telemetry() -> TelemetryManager:
|
||||
"""
|
||||
Get global telemetry manager instance.
|
||||
|
||||
Returns:
|
||||
TelemetryManager instance
|
||||
"""
|
||||
global _telemetry_manager
|
||||
if _telemetry_manager is None:
|
||||
_telemetry_manager = TelemetryManager.get_instance()
|
||||
return _telemetry_manager
|
||||
|
||||
|
||||
def capture_exception(
|
||||
exception: Exception,
|
||||
context: Optional[Dict[str, Any]] = None
|
||||
) -> bool:
|
||||
"""
|
||||
Capture an exception for telemetry.
|
||||
|
||||
Args:
|
||||
exception: Exception to capture
|
||||
context: Optional context
|
||||
|
||||
Returns:
|
||||
True if sent successfully
|
||||
"""
|
||||
try:
|
||||
return get_telemetry().capture_exception(exception, context)
|
||||
except Exception:
|
||||
return False
|
||||
|
||||
|
||||
def telemetry_decorator(func: Callable) -> Callable:
|
||||
"""
|
||||
Decorator to capture exceptions from a function.
|
||||
|
||||
Args:
|
||||
func: Function to wrap
|
||||
|
||||
Returns:
|
||||
Wrapped function
|
||||
"""
|
||||
@functools.wraps(func)
|
||||
def wrapper(*args, **kwargs):
|
||||
try:
|
||||
return func(*args, **kwargs)
|
||||
except Exception as e:
|
||||
# Capture exception
|
||||
capture_exception(e, {
|
||||
'function': func.__name__,
|
||||
'module': func.__module__
|
||||
})
|
||||
# Re-raise the exception
|
||||
raise
|
||||
|
||||
return wrapper
|
||||
|
||||
|
||||
def async_telemetry_decorator(func: Callable) -> Callable:
|
||||
"""
|
||||
Decorator to capture exceptions from an async function.
|
||||
|
||||
Args:
|
||||
func: Async function to wrap
|
||||
|
||||
Returns:
|
||||
Wrapped async function
|
||||
"""
|
||||
@functools.wraps(func)
|
||||
async def wrapper(*args, **kwargs):
|
||||
try:
|
||||
return await func(*args, **kwargs)
|
||||
except Exception as e:
|
||||
# Capture exception
|
||||
capture_exception(e, {
|
||||
'function': func.__name__,
|
||||
'module': func.__module__
|
||||
})
|
||||
# Re-raise the exception
|
||||
raise
|
||||
|
||||
return wrapper
|
||||
|
||||
|
||||
@contextmanager
|
||||
def telemetry_context(operation: str):
|
||||
"""
|
||||
Context manager for capturing exceptions.
|
||||
|
||||
Args:
|
||||
operation: Name of the operation
|
||||
|
||||
Example:
|
||||
with telemetry_context("web_crawl"):
|
||||
# Your code here
|
||||
pass
|
||||
"""
|
||||
try:
|
||||
yield
|
||||
except Exception as e:
|
||||
capture_exception(e, {'operation': operation})
|
||||
raise
|
||||
|
||||
|
||||
@asynccontextmanager
|
||||
async def async_telemetry_context(operation: str):
|
||||
"""
|
||||
Async context manager for capturing exceptions in async code.
|
||||
|
||||
Args:
|
||||
operation: Name of the operation
|
||||
|
||||
Example:
|
||||
async with async_telemetry_context("async_crawl"):
|
||||
# Your async code here
|
||||
await something()
|
||||
"""
|
||||
try:
|
||||
yield
|
||||
except Exception as e:
|
||||
capture_exception(e, {'operation': operation})
|
||||
raise
|
||||
|
||||
|
||||
def install_exception_handler():
|
||||
"""Install global exception handler for uncaught exceptions."""
|
||||
original_hook = sys.excepthook
|
||||
|
||||
def telemetry_exception_hook(exc_type, exc_value, exc_traceback):
|
||||
"""Custom exception hook with telemetry."""
|
||||
# Don't capture KeyboardInterrupt
|
||||
if not issubclass(exc_type, KeyboardInterrupt):
|
||||
capture_exception(exc_value, {
|
||||
'uncaught': True,
|
||||
'type': exc_type.__name__
|
||||
})
|
||||
|
||||
# Call original hook
|
||||
original_hook(exc_type, exc_value, exc_traceback)
|
||||
|
||||
sys.excepthook = telemetry_exception_hook
|
||||
|
||||
|
||||
# Public API
|
||||
def enable(email: Optional[str] = None, always: bool = True, once: bool = False) -> None:
|
||||
"""
|
||||
Enable telemetry.
|
||||
|
||||
Args:
|
||||
email: Optional email for follow-up
|
||||
always: If True, always send errors (default)
|
||||
once: If True, send only the next error
|
||||
"""
|
||||
get_telemetry().enable(email=email, always=always, once=once)
|
||||
|
||||
|
||||
def disable() -> None:
|
||||
"""Disable telemetry."""
|
||||
get_telemetry().disable()
|
||||
|
||||
|
||||
def status() -> Dict[str, Any]:
|
||||
"""
|
||||
Get telemetry status.
|
||||
|
||||
Returns:
|
||||
Dictionary with status information
|
||||
"""
|
||||
return get_telemetry().status()
|
||||
|
||||
|
||||
# Auto-install exception handler on import
|
||||
# (Only for main library usage, not for Docker/API)
|
||||
if EnvironmentDetector.detect() not in [Environment.DOCKER, Environment.API_SERVER]:
|
||||
install_exception_handler()
|
||||
|
||||
|
||||
__all__ = [
|
||||
'TelemetryManager',
|
||||
'get_telemetry',
|
||||
'capture_exception',
|
||||
'telemetry_decorator',
|
||||
'async_telemetry_decorator',
|
||||
'telemetry_context',
|
||||
'async_telemetry_context',
|
||||
'enable',
|
||||
'disable',
|
||||
'status',
|
||||
]
|
||||
140
crawl4ai/telemetry/base.py
Normal file
140
crawl4ai/telemetry/base.py
Normal file
@@ -0,0 +1,140 @@
|
||||
"""
|
||||
Base telemetry provider interface for Crawl4AI.
|
||||
Provides abstraction for different telemetry backends.
|
||||
"""
|
||||
|
||||
from abc import ABC, abstractmethod
|
||||
from typing import Dict, Any, Optional, Union
|
||||
import traceback
|
||||
|
||||
|
||||
class TelemetryProvider(ABC):
|
||||
"""Abstract base class for telemetry providers."""
|
||||
|
||||
def __init__(self, **kwargs):
|
||||
"""Initialize the provider with optional configuration."""
|
||||
self.config = kwargs
|
||||
self._initialized = False
|
||||
|
||||
@abstractmethod
|
||||
def initialize(self) -> bool:
|
||||
"""
|
||||
Initialize the telemetry provider.
|
||||
Returns True if initialization successful, False otherwise.
|
||||
"""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def send_exception(
|
||||
self,
|
||||
exc: Exception,
|
||||
context: Optional[Dict[str, Any]] = None
|
||||
) -> bool:
|
||||
"""
|
||||
Send an exception to the telemetry backend.
|
||||
|
||||
Args:
|
||||
exc: The exception to report
|
||||
context: Optional context data (email, environment, etc.)
|
||||
|
||||
Returns:
|
||||
True if sent successfully, False otherwise
|
||||
"""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def send_event(
|
||||
self,
|
||||
event_name: str,
|
||||
payload: Optional[Dict[str, Any]] = None
|
||||
) -> bool:
|
||||
"""
|
||||
Send a generic telemetry event.
|
||||
|
||||
Args:
|
||||
event_name: Name of the event
|
||||
payload: Optional event data
|
||||
|
||||
Returns:
|
||||
True if sent successfully, False otherwise
|
||||
"""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def flush(self) -> None:
|
||||
"""Flush any pending telemetry data."""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def shutdown(self) -> None:
|
||||
"""Clean shutdown of the provider."""
|
||||
pass
|
||||
|
||||
def sanitize_data(self, data: Dict[str, Any]) -> Dict[str, Any]:
|
||||
"""
|
||||
Remove sensitive information from telemetry data.
|
||||
Override in subclasses for custom sanitization.
|
||||
|
||||
Args:
|
||||
data: Raw data dictionary
|
||||
|
||||
Returns:
|
||||
Sanitized data dictionary
|
||||
"""
|
||||
# Default implementation - remove common sensitive fields
|
||||
sensitive_keys = {
|
||||
'password', 'token', 'api_key', 'secret', 'credential',
|
||||
'auth', 'authorization', 'cookie', 'session'
|
||||
}
|
||||
|
||||
def _sanitize_dict(d: Dict) -> Dict:
|
||||
sanitized = {}
|
||||
for key, value in d.items():
|
||||
key_lower = key.lower()
|
||||
if any(sensitive in key_lower for sensitive in sensitive_keys):
|
||||
sanitized[key] = '[REDACTED]'
|
||||
elif isinstance(value, dict):
|
||||
sanitized[key] = _sanitize_dict(value)
|
||||
elif isinstance(value, list):
|
||||
sanitized[key] = [
|
||||
_sanitize_dict(item) if isinstance(item, dict) else item
|
||||
for item in value
|
||||
]
|
||||
else:
|
||||
sanitized[key] = value
|
||||
return sanitized
|
||||
|
||||
return _sanitize_dict(data) if isinstance(data, dict) else data
|
||||
|
||||
|
||||
class NullProvider(TelemetryProvider):
|
||||
"""No-op provider for when telemetry is disabled."""
|
||||
|
||||
def initialize(self) -> bool:
|
||||
"""No initialization needed for null provider."""
|
||||
self._initialized = True
|
||||
return True
|
||||
|
||||
def send_exception(
|
||||
self,
|
||||
exc: Exception,
|
||||
context: Optional[Dict[str, Any]] = None
|
||||
) -> bool:
|
||||
"""No-op exception sending."""
|
||||
return True
|
||||
|
||||
def send_event(
|
||||
self,
|
||||
event_name: str,
|
||||
payload: Optional[Dict[str, Any]] = None
|
||||
) -> bool:
|
||||
"""No-op event sending."""
|
||||
return True
|
||||
|
||||
def flush(self) -> None:
|
||||
"""No-op flush."""
|
||||
pass
|
||||
|
||||
def shutdown(self) -> None:
|
||||
"""No-op shutdown."""
|
||||
pass
|
||||
196
crawl4ai/telemetry/config.py
Normal file
196
crawl4ai/telemetry/config.py
Normal file
@@ -0,0 +1,196 @@
|
||||
"""
|
||||
Configuration management for Crawl4AI telemetry.
|
||||
Handles user preferences and persistence.
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
from pathlib import Path
|
||||
from typing import Dict, Any, Optional
|
||||
from enum import Enum
|
||||
|
||||
|
||||
class TelemetryConsent(Enum):
|
||||
"""Telemetry consent levels."""
|
||||
NOT_SET = "not_set"
|
||||
DENIED = "denied"
|
||||
ONCE = "once" # Send current error only
|
||||
ALWAYS = "always" # Send all errors
|
||||
|
||||
|
||||
class TelemetryConfig:
|
||||
"""Manages telemetry configuration and persistence."""
|
||||
|
||||
def __init__(self, config_dir: Optional[Path] = None):
|
||||
"""
|
||||
Initialize configuration manager.
|
||||
|
||||
Args:
|
||||
config_dir: Optional custom config directory
|
||||
"""
|
||||
if config_dir:
|
||||
self.config_dir = config_dir
|
||||
else:
|
||||
# Default to ~/.crawl4ai/
|
||||
self.config_dir = Path.home() / '.crawl4ai'
|
||||
|
||||
self.config_file = self.config_dir / 'config.json'
|
||||
self._config: Dict[str, Any] = {}
|
||||
self._load_config()
|
||||
|
||||
def _ensure_config_dir(self) -> None:
|
||||
"""Ensure configuration directory exists."""
|
||||
self.config_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
def _load_config(self) -> None:
|
||||
"""Load configuration from disk."""
|
||||
if self.config_file.exists():
|
||||
try:
|
||||
with open(self.config_file, 'r') as f:
|
||||
self._config = json.load(f)
|
||||
except (json.JSONDecodeError, IOError):
|
||||
# Corrupted or inaccessible config - start fresh
|
||||
self._config = {}
|
||||
else:
|
||||
self._config = {}
|
||||
|
||||
def _save_config(self) -> bool:
|
||||
"""
|
||||
Save configuration to disk.
|
||||
|
||||
Returns:
|
||||
True if saved successfully
|
||||
"""
|
||||
try:
|
||||
self._ensure_config_dir()
|
||||
|
||||
# Write to temporary file first
|
||||
temp_file = self.config_file.with_suffix('.tmp')
|
||||
with open(temp_file, 'w') as f:
|
||||
json.dump(self._config, f, indent=2)
|
||||
|
||||
# Atomic rename
|
||||
temp_file.replace(self.config_file)
|
||||
return True
|
||||
|
||||
except (IOError, OSError):
|
||||
return False
|
||||
|
||||
def get_telemetry_settings(self) -> Dict[str, Any]:
|
||||
"""
|
||||
Get current telemetry settings.
|
||||
|
||||
Returns:
|
||||
Dictionary with telemetry settings
|
||||
"""
|
||||
return self._config.get('telemetry', {
|
||||
'consent': TelemetryConsent.NOT_SET.value,
|
||||
'email': None
|
||||
})
|
||||
|
||||
def get_consent(self) -> TelemetryConsent:
|
||||
"""
|
||||
Get current consent status.
|
||||
|
||||
Returns:
|
||||
TelemetryConsent enum value
|
||||
"""
|
||||
settings = self.get_telemetry_settings()
|
||||
consent_value = settings.get('consent', TelemetryConsent.NOT_SET.value)
|
||||
|
||||
# Handle legacy boolean values
|
||||
if isinstance(consent_value, bool):
|
||||
consent_value = TelemetryConsent.ALWAYS.value if consent_value else TelemetryConsent.DENIED.value
|
||||
|
||||
try:
|
||||
return TelemetryConsent(consent_value)
|
||||
except ValueError:
|
||||
return TelemetryConsent.NOT_SET
|
||||
|
||||
def set_consent(
|
||||
self,
|
||||
consent: TelemetryConsent,
|
||||
email: Optional[str] = None
|
||||
) -> bool:
|
||||
"""
|
||||
Set telemetry consent and optional email.
|
||||
|
||||
Args:
|
||||
consent: Consent level
|
||||
email: Optional email for follow-up
|
||||
|
||||
Returns:
|
||||
True if saved successfully
|
||||
"""
|
||||
if 'telemetry' not in self._config:
|
||||
self._config['telemetry'] = {}
|
||||
|
||||
self._config['telemetry']['consent'] = consent.value
|
||||
|
||||
# Only update email if provided
|
||||
if email is not None:
|
||||
self._config['telemetry']['email'] = email
|
||||
|
||||
return self._save_config()
|
||||
|
||||
def get_email(self) -> Optional[str]:
|
||||
"""
|
||||
Get stored email if any.
|
||||
|
||||
Returns:
|
||||
Email address or None
|
||||
"""
|
||||
settings = self.get_telemetry_settings()
|
||||
return settings.get('email')
|
||||
|
||||
def is_enabled(self) -> bool:
|
||||
"""
|
||||
Check if telemetry is enabled.
|
||||
|
||||
Returns:
|
||||
True if telemetry should send data
|
||||
"""
|
||||
consent = self.get_consent()
|
||||
return consent in [TelemetryConsent.ONCE, TelemetryConsent.ALWAYS]
|
||||
|
||||
def should_send_current(self) -> bool:
|
||||
"""
|
||||
Check if current error should be sent.
|
||||
Used for one-time consent.
|
||||
|
||||
Returns:
|
||||
True if current error should be sent
|
||||
"""
|
||||
consent = self.get_consent()
|
||||
if consent == TelemetryConsent.ONCE:
|
||||
# After sending once, reset to NOT_SET
|
||||
self.set_consent(TelemetryConsent.NOT_SET)
|
||||
return True
|
||||
return consent == TelemetryConsent.ALWAYS
|
||||
|
||||
def clear(self) -> bool:
|
||||
"""
|
||||
Clear all telemetry settings.
|
||||
|
||||
Returns:
|
||||
True if cleared successfully
|
||||
"""
|
||||
if 'telemetry' in self._config:
|
||||
del self._config['telemetry']
|
||||
return self._save_config()
|
||||
return True
|
||||
|
||||
def update_from_env(self) -> None:
|
||||
"""Update configuration from environment variables."""
|
||||
# Check for telemetry disable flag
|
||||
if os.environ.get('CRAWL4AI_TELEMETRY') == '0':
|
||||
self.set_consent(TelemetryConsent.DENIED)
|
||||
|
||||
# Check for email override
|
||||
env_email = os.environ.get('CRAWL4AI_TELEMETRY_EMAIL')
|
||||
if env_email and self.is_enabled():
|
||||
current_settings = self.get_telemetry_settings()
|
||||
self.set_consent(
|
||||
TelemetryConsent(current_settings['consent']),
|
||||
email=env_email
|
||||
)
|
||||
314
crawl4ai/telemetry/consent.py
Normal file
314
crawl4ai/telemetry/consent.py
Normal file
@@ -0,0 +1,314 @@
|
||||
"""
|
||||
User consent handling for Crawl4AI telemetry.
|
||||
Provides interactive prompts for different environments.
|
||||
"""
|
||||
|
||||
import sys
|
||||
from typing import Optional, Tuple
|
||||
from .config import TelemetryConsent, TelemetryConfig
|
||||
from .environment import Environment, EnvironmentDetector
|
||||
|
||||
|
||||
class ConsentManager:
|
||||
"""Manages user consent for telemetry."""
|
||||
|
||||
def __init__(self, config: Optional[TelemetryConfig] = None):
|
||||
"""
|
||||
Initialize consent manager.
|
||||
|
||||
Args:
|
||||
config: Optional TelemetryConfig instance
|
||||
"""
|
||||
self.config = config or TelemetryConfig()
|
||||
self.environment = EnvironmentDetector.detect()
|
||||
|
||||
def check_and_prompt(self) -> TelemetryConsent:
|
||||
"""
|
||||
Check consent status and prompt if needed.
|
||||
|
||||
Returns:
|
||||
Current consent status
|
||||
"""
|
||||
current_consent = self.config.get_consent()
|
||||
|
||||
# If already set, return current value
|
||||
if current_consent != TelemetryConsent.NOT_SET:
|
||||
return current_consent
|
||||
|
||||
# Docker/API server: default enabled (check env var)
|
||||
if self.environment in [Environment.DOCKER, Environment.API_SERVER]:
|
||||
return self._handle_docker_consent()
|
||||
|
||||
# Interactive environments: prompt user
|
||||
if EnvironmentDetector.is_interactive():
|
||||
return self._prompt_for_consent()
|
||||
|
||||
# Non-interactive: default disabled
|
||||
return TelemetryConsent.DENIED
|
||||
|
||||
def _handle_docker_consent(self) -> TelemetryConsent:
|
||||
"""
|
||||
Handle consent in Docker environment.
|
||||
Default enabled unless disabled via env var.
|
||||
"""
|
||||
import os
|
||||
|
||||
if os.environ.get('CRAWL4AI_TELEMETRY') == '0':
|
||||
self.config.set_consent(TelemetryConsent.DENIED)
|
||||
return TelemetryConsent.DENIED
|
||||
|
||||
# Default enabled for Docker
|
||||
self.config.set_consent(TelemetryConsent.ALWAYS)
|
||||
return TelemetryConsent.ALWAYS
|
||||
|
||||
def _prompt_for_consent(self) -> TelemetryConsent:
|
||||
"""
|
||||
Prompt user for consent based on environment.
|
||||
|
||||
Returns:
|
||||
User's consent choice
|
||||
"""
|
||||
if self.environment == Environment.CLI:
|
||||
return self._cli_prompt()
|
||||
elif self.environment in [Environment.JUPYTER, Environment.COLAB]:
|
||||
return self._notebook_prompt()
|
||||
else:
|
||||
return TelemetryConsent.DENIED
|
||||
|
||||
def _cli_prompt(self) -> TelemetryConsent:
|
||||
"""
|
||||
Show CLI prompt for consent.
|
||||
|
||||
Returns:
|
||||
User's consent choice
|
||||
"""
|
||||
print("\n" + "="*60)
|
||||
print("🚨 Crawl4AI Error Detection")
|
||||
print("="*60)
|
||||
print("\nWe noticed an error occurred. Help improve Crawl4AI by")
|
||||
print("sending anonymous crash reports?")
|
||||
print("\n[1] Yes, send this error only")
|
||||
print("[2] Yes, always send errors")
|
||||
print("[3] No, don't send")
|
||||
print("\n" + "-"*60)
|
||||
|
||||
# Get choice
|
||||
while True:
|
||||
try:
|
||||
choice = input("Your choice (1/2/3): ").strip()
|
||||
if choice == '1':
|
||||
consent = TelemetryConsent.ONCE
|
||||
break
|
||||
elif choice == '2':
|
||||
consent = TelemetryConsent.ALWAYS
|
||||
break
|
||||
elif choice == '3':
|
||||
consent = TelemetryConsent.DENIED
|
||||
break
|
||||
else:
|
||||
print("Please enter 1, 2, or 3")
|
||||
except (KeyboardInterrupt, EOFError):
|
||||
# User cancelled - treat as denial
|
||||
consent = TelemetryConsent.DENIED
|
||||
break
|
||||
|
||||
# Optional email
|
||||
email = None
|
||||
if consent != TelemetryConsent.DENIED:
|
||||
print("\nOptional: Enter email for follow-up (or press Enter to skip):")
|
||||
try:
|
||||
email_input = input("Email: ").strip()
|
||||
if email_input and '@' in email_input:
|
||||
email = email_input
|
||||
except (KeyboardInterrupt, EOFError):
|
||||
pass
|
||||
|
||||
# Save choice
|
||||
self.config.set_consent(consent, email)
|
||||
|
||||
if consent != TelemetryConsent.DENIED:
|
||||
print("\n✅ Thank you for helping improve Crawl4AI!")
|
||||
else:
|
||||
print("\n✅ Telemetry disabled. You can enable it anytime with:")
|
||||
print(" crawl4ai telemetry enable")
|
||||
|
||||
print("="*60 + "\n")
|
||||
|
||||
return consent
|
||||
|
||||
def _notebook_prompt(self) -> TelemetryConsent:
|
||||
"""
|
||||
Show notebook prompt for consent.
|
||||
Uses widgets if available, falls back to print + code.
|
||||
|
||||
Returns:
|
||||
User's consent choice
|
||||
"""
|
||||
if EnvironmentDetector.supports_widgets():
|
||||
return self._widget_prompt()
|
||||
else:
|
||||
return self._notebook_fallback_prompt()
|
||||
|
||||
def _widget_prompt(self) -> TelemetryConsent:
|
||||
"""
|
||||
Show interactive widget prompt in Jupyter/Colab.
|
||||
|
||||
Returns:
|
||||
User's consent choice
|
||||
"""
|
||||
try:
|
||||
import ipywidgets as widgets
|
||||
from IPython.display import display, HTML
|
||||
|
||||
# Create styled HTML
|
||||
html = HTML("""
|
||||
<div style="padding: 15px; border: 2px solid #ff6b6b; border-radius: 8px; background: #fff5f5;">
|
||||
<h3 style="color: #c92a2a; margin-top: 0;">🚨 Crawl4AI Error Detected</h3>
|
||||
<p style="color: #495057;">Help us improve by sending anonymous crash reports?</p>
|
||||
</div>
|
||||
""")
|
||||
display(html)
|
||||
|
||||
# Create buttons
|
||||
btn_once = widgets.Button(
|
||||
description='Send this error',
|
||||
button_style='info',
|
||||
icon='check'
|
||||
)
|
||||
btn_always = widgets.Button(
|
||||
description='Always send',
|
||||
button_style='success',
|
||||
icon='check-circle'
|
||||
)
|
||||
btn_never = widgets.Button(
|
||||
description='Don\'t send',
|
||||
button_style='danger',
|
||||
icon='times'
|
||||
)
|
||||
|
||||
# Email input
|
||||
email_input = widgets.Text(
|
||||
placeholder='Optional: your@email.com',
|
||||
description='Email:',
|
||||
style={'description_width': 'initial'}
|
||||
)
|
||||
|
||||
# Output area for feedback
|
||||
output = widgets.Output()
|
||||
|
||||
# Container
|
||||
button_box = widgets.HBox([btn_once, btn_always, btn_never])
|
||||
container = widgets.VBox([button_box, email_input, output])
|
||||
|
||||
# Variable to store choice
|
||||
consent_choice = {'value': None}
|
||||
|
||||
def on_button_click(btn):
|
||||
"""Handle button click."""
|
||||
with output:
|
||||
output.clear_output()
|
||||
|
||||
if btn == btn_once:
|
||||
consent_choice['value'] = TelemetryConsent.ONCE
|
||||
print("✅ Sending this error only")
|
||||
elif btn == btn_always:
|
||||
consent_choice['value'] = TelemetryConsent.ALWAYS
|
||||
print("✅ Always sending errors")
|
||||
else:
|
||||
consent_choice['value'] = TelemetryConsent.DENIED
|
||||
print("✅ Telemetry disabled")
|
||||
|
||||
# Save with email if provided
|
||||
email = email_input.value.strip() if email_input.value else None
|
||||
self.config.set_consent(consent_choice['value'], email)
|
||||
|
||||
# Disable buttons after choice
|
||||
btn_once.disabled = True
|
||||
btn_always.disabled = True
|
||||
btn_never.disabled = True
|
||||
email_input.disabled = True
|
||||
|
||||
# Attach handlers
|
||||
btn_once.on_click(on_button_click)
|
||||
btn_always.on_click(on_button_click)
|
||||
btn_never.on_click(on_button_click)
|
||||
|
||||
# Display widget
|
||||
display(container)
|
||||
|
||||
# Wait for user choice (in notebook, this is non-blocking)
|
||||
# Return NOT_SET for now, actual choice will be saved via callback
|
||||
return consent_choice.get('value', TelemetryConsent.NOT_SET)
|
||||
|
||||
except Exception:
|
||||
# Fallback if widgets fail
|
||||
return self._notebook_fallback_prompt()
|
||||
|
||||
def _notebook_fallback_prompt(self) -> TelemetryConsent:
|
||||
"""
|
||||
Fallback prompt for notebooks without widget support.
|
||||
|
||||
Returns:
|
||||
User's consent choice (defaults to DENIED)
|
||||
"""
|
||||
try:
|
||||
from IPython.display import display, Markdown
|
||||
|
||||
markdown_content = """
|
||||
### 🚨 Crawl4AI Error Detected
|
||||
|
||||
Help us improve by sending anonymous crash reports.
|
||||
|
||||
**Telemetry is currently OFF.** To enable, run:
|
||||
|
||||
```python
|
||||
import crawl4ai
|
||||
crawl4ai.telemetry.enable(email="your@email.com", always=True)
|
||||
```
|
||||
|
||||
To send just this error:
|
||||
```python
|
||||
crawl4ai.telemetry.enable(once=True)
|
||||
```
|
||||
|
||||
To keep telemetry disabled:
|
||||
```python
|
||||
crawl4ai.telemetry.disable()
|
||||
```
|
||||
"""
|
||||
|
||||
display(Markdown(markdown_content))
|
||||
|
||||
except ImportError:
|
||||
# Pure print fallback
|
||||
print("\n" + "="*60)
|
||||
print("🚨 Crawl4AI Error Detected")
|
||||
print("="*60)
|
||||
print("\nTelemetry is OFF. To enable, run:")
|
||||
print("\nimport crawl4ai")
|
||||
print('crawl4ai.telemetry.enable(email="you@example.com", always=True)')
|
||||
print("\n" + "="*60)
|
||||
|
||||
# Default to disabled in fallback mode
|
||||
return TelemetryConsent.DENIED
|
||||
|
||||
def force_prompt(self) -> Tuple[TelemetryConsent, Optional[str]]:
|
||||
"""
|
||||
Force a consent prompt regardless of current settings.
|
||||
Used for manual telemetry configuration.
|
||||
|
||||
Returns:
|
||||
Tuple of (consent choice, optional email)
|
||||
"""
|
||||
# Temporarily reset consent to force prompt
|
||||
original_consent = self.config.get_consent()
|
||||
self.config.set_consent(TelemetryConsent.NOT_SET)
|
||||
|
||||
try:
|
||||
new_consent = self._prompt_for_consent()
|
||||
email = self.config.get_email()
|
||||
return new_consent, email
|
||||
except Exception:
|
||||
# Restore original on error
|
||||
self.config.set_consent(original_consent)
|
||||
raise
|
||||
199
crawl4ai/telemetry/environment.py
Normal file
199
crawl4ai/telemetry/environment.py
Normal file
@@ -0,0 +1,199 @@
|
||||
"""
|
||||
Environment detection for Crawl4AI telemetry.
|
||||
Detects whether we're running in CLI, Docker, Jupyter, etc.
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
from enum import Enum
|
||||
from typing import Optional
|
||||
|
||||
|
||||
class Environment(Enum):
|
||||
"""Detected runtime environment."""
|
||||
CLI = "cli"
|
||||
DOCKER = "docker"
|
||||
JUPYTER = "jupyter"
|
||||
COLAB = "colab"
|
||||
API_SERVER = "api_server"
|
||||
UNKNOWN = "unknown"
|
||||
|
||||
|
||||
class EnvironmentDetector:
|
||||
"""Detects the current runtime environment."""
|
||||
|
||||
@staticmethod
|
||||
def detect() -> Environment:
|
||||
"""
|
||||
Detect current runtime environment.
|
||||
|
||||
Returns:
|
||||
Environment enum value
|
||||
"""
|
||||
# Check for Docker
|
||||
if EnvironmentDetector._is_docker():
|
||||
# Further check if it's API server
|
||||
if EnvironmentDetector._is_api_server():
|
||||
return Environment.API_SERVER
|
||||
return Environment.DOCKER
|
||||
|
||||
# Check for Google Colab
|
||||
if EnvironmentDetector._is_colab():
|
||||
return Environment.COLAB
|
||||
|
||||
# Check for Jupyter
|
||||
if EnvironmentDetector._is_jupyter():
|
||||
return Environment.JUPYTER
|
||||
|
||||
# Check for CLI
|
||||
if EnvironmentDetector._is_cli():
|
||||
return Environment.CLI
|
||||
|
||||
return Environment.UNKNOWN
|
||||
|
||||
@staticmethod
|
||||
def _is_docker() -> bool:
|
||||
"""Check if running inside Docker container."""
|
||||
# Check for Docker-specific files
|
||||
if os.path.exists('/.dockerenv'):
|
||||
return True
|
||||
|
||||
# Check cgroup for docker signature
|
||||
try:
|
||||
with open('/proc/1/cgroup', 'r') as f:
|
||||
return 'docker' in f.read()
|
||||
except (IOError, OSError):
|
||||
pass
|
||||
|
||||
# Check environment variable (if set in Dockerfile)
|
||||
return os.environ.get('CRAWL4AI_DOCKER', '').lower() == 'true'
|
||||
|
||||
@staticmethod
|
||||
def _is_api_server() -> bool:
|
||||
"""Check if running as API server."""
|
||||
# Check for API server indicators
|
||||
return (
|
||||
os.environ.get('CRAWL4AI_API_SERVER', '').lower() == 'true' or
|
||||
'deploy/docker/server.py' in ' '.join(sys.argv) or
|
||||
'deploy/docker/api.py' in ' '.join(sys.argv)
|
||||
)
|
||||
|
||||
@staticmethod
|
||||
def _is_jupyter() -> bool:
|
||||
"""Check if running in Jupyter notebook."""
|
||||
try:
|
||||
# Check for IPython
|
||||
from IPython import get_ipython
|
||||
ipython = get_ipython()
|
||||
|
||||
if ipython is None:
|
||||
return False
|
||||
|
||||
# Check for notebook kernel
|
||||
if 'IPKernelApp' in ipython.config:
|
||||
return True
|
||||
|
||||
# Check for Jupyter-specific attributes
|
||||
if hasattr(ipython, 'kernel'):
|
||||
return True
|
||||
|
||||
except (ImportError, AttributeError):
|
||||
pass
|
||||
|
||||
return False
|
||||
|
||||
@staticmethod
|
||||
def _is_colab() -> bool:
|
||||
"""Check if running in Google Colab."""
|
||||
try:
|
||||
import google.colab
|
||||
return True
|
||||
except ImportError:
|
||||
pass
|
||||
|
||||
# Alternative check
|
||||
return 'COLAB_GPU' in os.environ or 'COLAB_TPU_ADDR' in os.environ
|
||||
|
||||
@staticmethod
|
||||
def _is_cli() -> bool:
|
||||
"""Check if running from command line."""
|
||||
# Check if we have a terminal
|
||||
return (
|
||||
hasattr(sys, 'ps1') or
|
||||
sys.stdin.isatty() or
|
||||
bool(os.environ.get('TERM'))
|
||||
)
|
||||
|
||||
@staticmethod
|
||||
def is_interactive() -> bool:
|
||||
"""
|
||||
Check if environment supports interactive prompts.
|
||||
|
||||
Returns:
|
||||
True if interactive prompts are supported
|
||||
"""
|
||||
env = EnvironmentDetector.detect()
|
||||
|
||||
# Docker/API server are non-interactive
|
||||
if env in [Environment.DOCKER, Environment.API_SERVER]:
|
||||
return False
|
||||
|
||||
# CLI with TTY is interactive
|
||||
if env == Environment.CLI:
|
||||
return sys.stdin.isatty()
|
||||
|
||||
# Jupyter/Colab can be interactive with widgets
|
||||
if env in [Environment.JUPYTER, Environment.COLAB]:
|
||||
return True
|
||||
|
||||
return False
|
||||
|
||||
@staticmethod
|
||||
def supports_widgets() -> bool:
|
||||
"""
|
||||
Check if environment supports IPython widgets.
|
||||
|
||||
Returns:
|
||||
True if widgets are supported
|
||||
"""
|
||||
env = EnvironmentDetector.detect()
|
||||
|
||||
if env not in [Environment.JUPYTER, Environment.COLAB]:
|
||||
return False
|
||||
|
||||
try:
|
||||
import ipywidgets
|
||||
from IPython.display import display
|
||||
return True
|
||||
except ImportError:
|
||||
return False
|
||||
|
||||
@staticmethod
|
||||
def get_environment_context() -> dict:
|
||||
"""
|
||||
Get environment context for telemetry.
|
||||
|
||||
Returns:
|
||||
Dictionary with environment information
|
||||
"""
|
||||
env = EnvironmentDetector.detect()
|
||||
|
||||
context = {
|
||||
'environment_type': env.value,
|
||||
'python_version': f"{sys.version_info.major}.{sys.version_info.minor}.{sys.version_info.micro}",
|
||||
'platform': sys.platform,
|
||||
}
|
||||
|
||||
# Add environment-specific context
|
||||
if env == Environment.DOCKER:
|
||||
context['docker'] = True
|
||||
context['container_id'] = os.environ.get('HOSTNAME', 'unknown')
|
||||
|
||||
elif env == Environment.COLAB:
|
||||
context['colab'] = True
|
||||
context['gpu'] = bool(os.environ.get('COLAB_GPU'))
|
||||
|
||||
elif env == Environment.JUPYTER:
|
||||
context['jupyter'] = True
|
||||
|
||||
return context
|
||||
15
crawl4ai/telemetry/providers/__init__.py
Normal file
15
crawl4ai/telemetry/providers/__init__.py
Normal file
@@ -0,0 +1,15 @@
|
||||
"""
|
||||
Telemetry providers for Crawl4AI.
|
||||
"""
|
||||
|
||||
from ..base import TelemetryProvider, NullProvider
|
||||
|
||||
__all__ = ['TelemetryProvider', 'NullProvider']
|
||||
|
||||
# Try to import Sentry provider if available
|
||||
try:
|
||||
from .sentry import SentryProvider
|
||||
__all__.append('SentryProvider')
|
||||
except ImportError:
|
||||
# Sentry SDK not installed
|
||||
pass
|
||||
234
crawl4ai/telemetry/providers/sentry.py
Normal file
234
crawl4ai/telemetry/providers/sentry.py
Normal file
@@ -0,0 +1,234 @@
|
||||
"""
|
||||
Sentry telemetry provider for Crawl4AI.
|
||||
"""
|
||||
|
||||
import os
|
||||
from typing import Dict, Any, Optional
|
||||
from ..base import TelemetryProvider
|
||||
|
||||
# Hardcoded DSN for Crawl4AI project
|
||||
# This is safe to embed as it's the public part of the DSN
|
||||
# TODO: Replace with actual Crawl4AI Sentry project DSN before release
|
||||
# Format: "https://<public_key>@<organization>.ingest.sentry.io/<project_id>"
|
||||
DEFAULT_SENTRY_DSN = "https://your-public-key@sentry.io/your-project-id"
|
||||
|
||||
|
||||
class SentryProvider(TelemetryProvider):
|
||||
"""Sentry implementation of telemetry provider."""
|
||||
|
||||
def __init__(self, dsn: Optional[str] = None, **kwargs):
|
||||
"""
|
||||
Initialize Sentry provider.
|
||||
|
||||
Args:
|
||||
dsn: Optional DSN override (for testing/development)
|
||||
**kwargs: Additional Sentry configuration
|
||||
"""
|
||||
super().__init__(**kwargs)
|
||||
|
||||
# Allow DSN override via environment variable or parameter
|
||||
self.dsn = (
|
||||
dsn or
|
||||
os.environ.get('CRAWL4AI_SENTRY_DSN') or
|
||||
DEFAULT_SENTRY_DSN
|
||||
)
|
||||
|
||||
self._sentry_sdk = None
|
||||
self.environment = kwargs.get('environment', 'production')
|
||||
self.release = kwargs.get('release', None)
|
||||
|
||||
def initialize(self) -> bool:
|
||||
"""Initialize Sentry SDK."""
|
||||
try:
|
||||
import sentry_sdk
|
||||
from sentry_sdk.integrations.stdlib import StdlibIntegration
|
||||
from sentry_sdk.integrations.excepthook import ExcepthookIntegration
|
||||
|
||||
# Initialize Sentry with minimal integrations
|
||||
sentry_sdk.init(
|
||||
dsn=self.dsn,
|
||||
|
||||
environment=self.environment,
|
||||
release=self.release,
|
||||
|
||||
# Performance monitoring disabled by default
|
||||
traces_sample_rate=0.0,
|
||||
|
||||
# Only capture errors, not transactions
|
||||
# profiles_sample_rate=0.0,
|
||||
|
||||
# Minimal integrations
|
||||
integrations=[
|
||||
StdlibIntegration(),
|
||||
ExcepthookIntegration(always_run=False),
|
||||
],
|
||||
|
||||
# Privacy settings
|
||||
send_default_pii=False,
|
||||
attach_stacktrace=True,
|
||||
|
||||
# Before send hook for additional sanitization
|
||||
before_send=self._before_send,
|
||||
|
||||
# Disable automatic breadcrumbs
|
||||
max_breadcrumbs=0,
|
||||
|
||||
# Disable request data collection
|
||||
# request_bodies='never',
|
||||
|
||||
# # Custom transport options
|
||||
# transport_options={
|
||||
# 'keepalive': True,
|
||||
# },
|
||||
)
|
||||
|
||||
self._sentry_sdk = sentry_sdk
|
||||
self._initialized = True
|
||||
return True
|
||||
|
||||
except ImportError:
|
||||
# Sentry SDK not installed
|
||||
return False
|
||||
except Exception:
|
||||
# Initialization failed silently
|
||||
return False
|
||||
|
||||
def _before_send(self, event: Dict[str, Any], hint: Dict[str, Any]) -> Optional[Dict[str, Any]]:
|
||||
"""
|
||||
Process event before sending to Sentry.
|
||||
Provides additional privacy protection.
|
||||
"""
|
||||
# Remove sensitive data
|
||||
if 'request' in event:
|
||||
event['request'] = self._sanitize_request(event['request'])
|
||||
|
||||
# Remove local variables that might contain sensitive data
|
||||
if 'exception' in event and 'values' in event['exception']:
|
||||
for exc in event['exception']['values']:
|
||||
if 'stacktrace' in exc and 'frames' in exc['stacktrace']:
|
||||
for frame in exc['stacktrace']['frames']:
|
||||
# Remove local variables from frames
|
||||
frame.pop('vars', None)
|
||||
|
||||
# Apply general sanitization
|
||||
event = self.sanitize_data(event)
|
||||
|
||||
return event
|
||||
|
||||
def _sanitize_request(self, request_data: Dict[str, Any]) -> Dict[str, Any]:
|
||||
"""Sanitize request data to remove sensitive information."""
|
||||
sanitized = request_data.copy()
|
||||
|
||||
# Remove sensitive fields
|
||||
sensitive_fields = ['cookies', 'headers', 'data', 'query_string', 'env']
|
||||
for field in sensitive_fields:
|
||||
if field in sanitized:
|
||||
sanitized[field] = '[REDACTED]'
|
||||
|
||||
# Keep only safe fields
|
||||
safe_fields = ['method', 'url']
|
||||
return {k: v for k, v in sanitized.items() if k in safe_fields}
|
||||
|
||||
def send_exception(
|
||||
self,
|
||||
exc: Exception,
|
||||
context: Optional[Dict[str, Any]] = None
|
||||
) -> bool:
|
||||
"""
|
||||
Send exception to Sentry.
|
||||
|
||||
Args:
|
||||
exc: Exception to report
|
||||
context: Optional context (email, environment info)
|
||||
|
||||
Returns:
|
||||
True if sent successfully
|
||||
"""
|
||||
if not self._initialized:
|
||||
if not self.initialize():
|
||||
return False
|
||||
|
||||
try:
|
||||
if self._sentry_sdk:
|
||||
with self._sentry_sdk.push_scope() as scope:
|
||||
# Add user context if email provided
|
||||
if context and 'email' in context:
|
||||
scope.set_user({'email': context['email']})
|
||||
|
||||
# Add additional context
|
||||
if context:
|
||||
for key, value in context.items():
|
||||
if key != 'email':
|
||||
scope.set_context(key, value)
|
||||
|
||||
# Add tags for filtering
|
||||
scope.set_tag('source', context.get('source', 'unknown'))
|
||||
scope.set_tag('environment_type', context.get('environment_type', 'unknown'))
|
||||
|
||||
# Capture the exception
|
||||
self._sentry_sdk.capture_exception(exc)
|
||||
|
||||
return True
|
||||
|
||||
except Exception:
|
||||
# Silently fail - telemetry should never crash the app
|
||||
return False
|
||||
|
||||
return False
|
||||
|
||||
def send_event(
|
||||
self,
|
||||
event_name: str,
|
||||
payload: Optional[Dict[str, Any]] = None
|
||||
) -> bool:
|
||||
"""
|
||||
Send custom event to Sentry.
|
||||
|
||||
Args:
|
||||
event_name: Name of the event
|
||||
payload: Event data
|
||||
|
||||
Returns:
|
||||
True if sent successfully
|
||||
"""
|
||||
if not self._initialized:
|
||||
if not self.initialize():
|
||||
return False
|
||||
|
||||
try:
|
||||
if self._sentry_sdk:
|
||||
# Sanitize payload
|
||||
safe_payload = self.sanitize_data(payload) if payload else {}
|
||||
|
||||
# Send as a message with extra data
|
||||
self._sentry_sdk.capture_message(
|
||||
event_name,
|
||||
level='info',
|
||||
extras=safe_payload
|
||||
)
|
||||
return True
|
||||
|
||||
except Exception:
|
||||
return False
|
||||
|
||||
return False
|
||||
|
||||
def flush(self) -> None:
|
||||
"""Flush pending events to Sentry."""
|
||||
if self._initialized and self._sentry_sdk:
|
||||
try:
|
||||
self._sentry_sdk.flush(timeout=2.0)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
def shutdown(self) -> None:
|
||||
"""Shutdown Sentry client."""
|
||||
if self._initialized and self._sentry_sdk:
|
||||
try:
|
||||
self._sentry_sdk.flush(timeout=2.0)
|
||||
# Note: sentry_sdk doesn't have a shutdown method
|
||||
# Flush is sufficient for cleanup
|
||||
except Exception:
|
||||
pass
|
||||
finally:
|
||||
self._initialized = False
|
||||
@@ -16,7 +16,7 @@ from .config import MIN_WORD_THRESHOLD, IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD, IM
|
||||
import httpx
|
||||
from socket import gaierror
|
||||
from pathlib import Path
|
||||
from typing import Dict, Any, List, Optional, Callable
|
||||
from typing import Dict, Any, List, Optional, Callable, Generator, Tuple, Iterable
|
||||
from urllib.parse import urljoin
|
||||
import requests
|
||||
from requests.exceptions import InvalidSchema
|
||||
@@ -40,8 +40,7 @@ from typing import Sequence
|
||||
|
||||
from itertools import chain
|
||||
from collections import deque
|
||||
from typing import Generator, Iterable
|
||||
|
||||
import psutil
|
||||
import numpy as np
|
||||
|
||||
from urllib.parse import (
|
||||
@@ -3414,3 +3413,79 @@ def cosine_distance(vec1: np.ndarray, vec2: np.ndarray) -> float:
|
||||
"""Calculate cosine distance (1 - similarity) between two vectors"""
|
||||
return 1 - cosine_similarity(vec1, vec2)
|
||||
|
||||
|
||||
# Memory utilities
|
||||
|
||||
def get_true_available_memory_gb() -> float:
|
||||
"""Get truly available memory including inactive pages (cross-platform)"""
|
||||
vm = psutil.virtual_memory()
|
||||
|
||||
if platform.system() == 'Darwin': # macOS
|
||||
# On macOS, we need to include inactive memory too
|
||||
try:
|
||||
# Use vm_stat to get accurate values
|
||||
result = subprocess.run(['vm_stat'], capture_output=True, text=True)
|
||||
lines = result.stdout.split('\n')
|
||||
|
||||
page_size = 16384 # macOS page size
|
||||
pages = {}
|
||||
|
||||
for line in lines:
|
||||
if 'Pages free:' in line:
|
||||
pages['free'] = int(line.split()[-1].rstrip('.'))
|
||||
elif 'Pages inactive:' in line:
|
||||
pages['inactive'] = int(line.split()[-1].rstrip('.'))
|
||||
elif 'Pages speculative:' in line:
|
||||
pages['speculative'] = int(line.split()[-1].rstrip('.'))
|
||||
elif 'Pages purgeable:' in line:
|
||||
pages['purgeable'] = int(line.split()[-1].rstrip('.'))
|
||||
|
||||
# Calculate total available (free + inactive + speculative + purgeable)
|
||||
total_available_pages = (
|
||||
pages.get('free', 0) +
|
||||
pages.get('inactive', 0) +
|
||||
pages.get('speculative', 0) +
|
||||
pages.get('purgeable', 0)
|
||||
)
|
||||
available_gb = (total_available_pages * page_size) / (1024**3)
|
||||
|
||||
return available_gb
|
||||
except:
|
||||
# Fallback to psutil
|
||||
return vm.available / (1024**3)
|
||||
else:
|
||||
# For Windows and Linux, psutil.available is accurate
|
||||
return vm.available / (1024**3)
|
||||
|
||||
|
||||
def get_true_memory_usage_percent() -> float:
|
||||
"""
|
||||
Get memory usage percentage that accounts for platform differences.
|
||||
|
||||
Returns:
|
||||
float: Memory usage percentage (0-100)
|
||||
"""
|
||||
vm = psutil.virtual_memory()
|
||||
total_gb = vm.total / (1024**3)
|
||||
available_gb = get_true_available_memory_gb()
|
||||
|
||||
# Calculate used percentage based on truly available memory
|
||||
used_percent = 100.0 * (total_gb - available_gb) / total_gb
|
||||
|
||||
# Ensure it's within valid range
|
||||
return max(0.0, min(100.0, used_percent))
|
||||
|
||||
|
||||
def get_memory_stats() -> Tuple[float, float, float]:
|
||||
"""
|
||||
Get comprehensive memory statistics.
|
||||
|
||||
Returns:
|
||||
Tuple[float, float, float]: (used_percent, available_gb, total_gb)
|
||||
"""
|
||||
vm = psutil.virtual_memory()
|
||||
total_gb = vm.total / (1024**3)
|
||||
available_gb = get_true_available_memory_gb()
|
||||
used_percent = get_true_memory_usage_percent()
|
||||
|
||||
return used_percent, available_gb, total_gb
|
||||
@@ -65,7 +65,7 @@ async def handle_llm_qa(
|
||||
) -> str:
|
||||
"""Process QA using LLM with crawled content as context."""
|
||||
try:
|
||||
if not url.startswith(('http://', 'https://')):
|
||||
if not url.startswith(('http://', 'https://')) and not url.startswith(("raw:", "raw://")):
|
||||
url = 'https://' + url
|
||||
# Extract base URL by finding last '?q=' occurrence
|
||||
last_q_index = url.rfind('?q=')
|
||||
@@ -191,7 +191,7 @@ async def handle_markdown_request(
|
||||
detail=error_msg
|
||||
)
|
||||
decoded_url = unquote(url)
|
||||
if not decoded_url.startswith(('http://', 'https://')):
|
||||
if not decoded_url.startswith(('http://', 'https://')) and not decoded_url.startswith(("raw:", "raw://")):
|
||||
decoded_url = 'https://' + decoded_url
|
||||
|
||||
if filter_type == FilterType.RAW:
|
||||
@@ -328,7 +328,7 @@ async def create_new_task(
|
||||
) -> JSONResponse:
|
||||
"""Create and initialize a new task."""
|
||||
decoded_url = unquote(input_path)
|
||||
if not decoded_url.startswith(('http://', 'https://')):
|
||||
if not decoded_url.startswith(('http://', 'https://')) and not decoded_url.startswith(("raw:", "raw://")):
|
||||
decoded_url = 'https://' + decoded_url
|
||||
|
||||
from datetime import datetime
|
||||
@@ -428,7 +428,7 @@ async def handle_crawl_request(
|
||||
peak_mem_mb = start_mem_mb
|
||||
|
||||
try:
|
||||
urls = [('https://' + url) if not url.startswith(('http://', 'https://')) else url for url in urls]
|
||||
urls = [('https://' + url) if not url.startswith(('http://', 'https://')) and not url.startswith(("raw:", "raw://")) else url for url in urls]
|
||||
browser_config = BrowserConfig.load(browser_config)
|
||||
crawler_config = CrawlerRunConfig.load(crawler_config)
|
||||
|
||||
|
||||
@@ -15,3 +15,4 @@ PyJWT==2.10.1
|
||||
mcp>=1.6.0
|
||||
websockets>=15.0.1
|
||||
httpx[http2]>=0.27.2
|
||||
sentry-sdk>=2.0.0
|
||||
|
||||
@@ -74,6 +74,32 @@ setup_logging(config)
|
||||
|
||||
__version__ = "0.5.1-d1"
|
||||
|
||||
# ───────────────────── telemetry setup ────────────────────────
|
||||
# Docker/API server telemetry: enabled by default unless CRAWL4AI_TELEMETRY=0
|
||||
import os as _os
|
||||
if _os.environ.get('CRAWL4AI_TELEMETRY') != '0':
|
||||
# Set environment variable to indicate we're in API server mode
|
||||
_os.environ['CRAWL4AI_API_SERVER'] = 'true'
|
||||
|
||||
# Import and enable telemetry for Docker/API environment
|
||||
from crawl4ai.telemetry import enable as enable_telemetry
|
||||
from crawl4ai.telemetry import capture_exception
|
||||
|
||||
# Enable telemetry automatically in Docker mode
|
||||
enable_telemetry(always=True)
|
||||
|
||||
import logging
|
||||
telemetry_logger = logging.getLogger("telemetry")
|
||||
telemetry_logger.info("✅ Telemetry enabled for Docker/API server")
|
||||
else:
|
||||
# Define no-op for capture_exception if telemetry is disabled
|
||||
def capture_exception(exc, context=None):
|
||||
pass
|
||||
|
||||
import logging
|
||||
telemetry_logger = logging.getLogger("telemetry")
|
||||
telemetry_logger.info("❌ Telemetry disabled via CRAWL4AI_TELEMETRY=0")
|
||||
|
||||
# ── global page semaphore (hard cap) ─────────────────────────
|
||||
MAX_PAGES = config["crawler"]["pool"].get("max_pages", 30)
|
||||
GLOBAL_SEM = asyncio.Semaphore(MAX_PAGES)
|
||||
@@ -237,9 +263,9 @@ async def get_markdown(
|
||||
body: MarkdownRequest,
|
||||
_td: Dict = Depends(token_dep),
|
||||
):
|
||||
if not body.url.startswith(("http://", "https://")):
|
||||
if not body.url.startswith(("http://", "https://")) and not body.url.startswith(("raw:", "raw://")):
|
||||
raise HTTPException(
|
||||
400, "URL must be absolute and start with http/https")
|
||||
400, "Invalid URL format. Must start with http://, https://, or for raw HTML (raw:, raw://)")
|
||||
markdown = await handle_markdown_request(
|
||||
body.url, body.f, body.q, body.c, config, body.provider
|
||||
)
|
||||
@@ -401,7 +427,7 @@ async def llm_endpoint(
|
||||
):
|
||||
if not q:
|
||||
raise HTTPException(400, "Query parameter 'q' is required")
|
||||
if not url.startswith(("http://", "https://")):
|
||||
if not url.startswith(("http://", "https://")) and not url.startswith(("raw:", "raw://")):
|
||||
url = "https://" + url
|
||||
answer = await handle_llm_qa(url, q, config)
|
||||
return JSONResponse({"answer": answer})
|
||||
|
||||
@@ -8,10 +8,14 @@ Today I'm releasing Crawl4AI v0.7.3—the Multi-Config Intelligence Update. This
|
||||
|
||||
## 🎯 What's New at a Glance
|
||||
|
||||
- **Multi-URL Configurations**: Different crawling strategies for different URL patterns in a single batch
|
||||
- **Flexible Docker LLM Providers**: Configure LLM providers via environment variables
|
||||
- **Bug Fixes**: Resolved several critical issues for better stability
|
||||
- **Documentation Updates**: Clearer examples and improved API documentation
|
||||
- **🕵️ Undetected Browser Support**: Stealth mode for bypassing bot detection systems
|
||||
- **🎨 Multi-URL Configurations**: Different crawling strategies for different URL patterns in a single batch
|
||||
- **🐳 Flexible Docker LLM Providers**: Configure LLM providers via environment variables
|
||||
- **🧠 Memory Monitoring**: Enhanced memory usage tracking and optimization tools
|
||||
- **📊 Enhanced Table Extraction**: Improved table access and DataFrame conversion
|
||||
- **💰 GitHub Sponsors**: 4-tier sponsorship system with custom arrangements
|
||||
- **🔧 Bug Fixes**: Resolved several critical issues for better stability
|
||||
- **📚 Documentation Updates**: Clearer examples and improved API documentation
|
||||
|
||||
## 🎨 Multi-URL Configurations: One Size Doesn't Fit All
|
||||
|
||||
@@ -78,6 +82,182 @@ async with AsyncWebCrawler() as crawler:
|
||||
- **Reduced Complexity**: No more if/else forests in your extraction code
|
||||
- **Better Performance**: Each URL gets exactly the processing it needs
|
||||
|
||||
## 🕵️ Undetected Browser Support: Stealth Mode Activated
|
||||
|
||||
**The Problem:** Modern websites employ sophisticated bot detection systems. Cloudflare, Akamai, and custom solutions block automated crawlers, limiting access to valuable content.
|
||||
|
||||
**My Solution:** I implemented undetected browser support with a flexible adapter pattern. Now Crawl4AI can bypass most bot detection systems using stealth techniques.
|
||||
|
||||
### Technical Implementation
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig
|
||||
|
||||
# Enable undetected mode for stealth crawling
|
||||
browser_config = BrowserConfig(
|
||||
browser_type="undetected", # Use undetected Chrome
|
||||
headless=True, # Can run headless with stealth
|
||||
extra_args=[
|
||||
"--disable-blink-features=AutomationControlled",
|
||||
"--disable-web-security",
|
||||
"--disable-features=VizDisplayCompositor"
|
||||
]
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
# This will bypass most bot detection systems
|
||||
result = await crawler.arun("https://protected-site.com")
|
||||
|
||||
if result.success:
|
||||
print("✅ Successfully bypassed bot detection!")
|
||||
print(f"Content length: {len(result.markdown)}")
|
||||
```
|
||||
|
||||
**Advanced Anti-Bot Strategies:**
|
||||
|
||||
```python
|
||||
# Combine multiple stealth techniques
|
||||
from crawl4ai import CrawlerRunConfig
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
# Random user agents and headers
|
||||
headers={
|
||||
"Accept-Language": "en-US,en;q=0.9",
|
||||
"Accept-Encoding": "gzip, deflate, br",
|
||||
"DNT": "1"
|
||||
},
|
||||
|
||||
# Human-like behavior simulation
|
||||
js_code="""
|
||||
// Random mouse movements
|
||||
const simulateHuman = () => {
|
||||
const event = new MouseEvent('mousemove', {
|
||||
clientX: Math.random() * window.innerWidth,
|
||||
clientY: Math.random() * window.innerHeight
|
||||
});
|
||||
document.dispatchEvent(event);
|
||||
};
|
||||
setInterval(simulateHuman, 100 + Math.random() * 200);
|
||||
|
||||
// Random scrolling
|
||||
const randomScroll = () => {
|
||||
const scrollY = Math.random() * (document.body.scrollHeight - window.innerHeight);
|
||||
window.scrollTo(0, scrollY);
|
||||
};
|
||||
setTimeout(randomScroll, 500 + Math.random() * 1000);
|
||||
""",
|
||||
|
||||
# Delay to appear more human
|
||||
delay_before_return_html=2.0
|
||||
)
|
||||
|
||||
result = await crawler.arun("https://bot-protected-site.com", config=config)
|
||||
```
|
||||
|
||||
**Expected Real-World Impact:**
|
||||
- **Enterprise Scraping**: Access previously blocked corporate sites and databases
|
||||
- **Market Research**: Gather data from competitor sites with protection
|
||||
- **Price Monitoring**: Track e-commerce sites that block automated access
|
||||
- **Content Aggregation**: Collect news and social media despite anti-bot measures
|
||||
- **Compliance Testing**: Verify your own site's bot protection effectiveness
|
||||
|
||||
## 🧠 Memory Monitoring & Optimization
|
||||
|
||||
**The Problem:** Long-running crawl sessions consuming excessive memory, especially when processing large batches or heavy JavaScript sites.
|
||||
|
||||
**My Solution:** Built comprehensive memory monitoring and optimization utilities that track usage patterns and provide actionable insights.
|
||||
|
||||
### Memory Tracking Implementation
|
||||
|
||||
```python
|
||||
from crawl4ai.memory_utils import MemoryMonitor, get_memory_info
|
||||
|
||||
# Monitor memory during crawling
|
||||
monitor = MemoryMonitor()
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Start monitoring
|
||||
monitor.start_monitoring()
|
||||
|
||||
# Perform memory-intensive operations
|
||||
results = await crawler.arun_many([
|
||||
"https://heavy-js-site.com",
|
||||
"https://large-images-site.com",
|
||||
"https://dynamic-content-site.com"
|
||||
])
|
||||
|
||||
# Get detailed memory report
|
||||
memory_report = monitor.get_report()
|
||||
print(f"Peak memory usage: {memory_report['peak_mb']:.1f} MB")
|
||||
print(f"Memory efficiency: {memory_report['efficiency']:.1f}%")
|
||||
|
||||
# Automatic cleanup suggestions
|
||||
if memory_report['peak_mb'] > 1000: # > 1GB
|
||||
print("💡 Consider batch size optimization")
|
||||
print("💡 Enable aggressive garbage collection")
|
||||
```
|
||||
|
||||
**Expected Real-World Impact:**
|
||||
- **Production Stability**: Prevent memory-related crashes in long-running services
|
||||
- **Cost Optimization**: Right-size server resources based on actual usage
|
||||
- **Performance Tuning**: Identify memory bottlenecks and optimization opportunities
|
||||
- **Scalability Planning**: Understand memory patterns for horizontal scaling
|
||||
|
||||
## 📊 Enhanced Table Extraction
|
||||
|
||||
**The Problem:** Table data was accessed through the generic `result.media` interface, making DataFrame conversion cumbersome and unclear.
|
||||
|
||||
**My Solution:** Dedicated `result.tables` interface with direct DataFrame conversion and improved detection algorithms.
|
||||
|
||||
### New Table Access Pattern
|
||||
|
||||
```python
|
||||
# Old way (deprecated)
|
||||
# tables_data = result.media.get('tables', [])
|
||||
|
||||
# New way (v0.7.3+)
|
||||
result = await crawler.arun("https://site-with-tables.com")
|
||||
|
||||
# Direct table access
|
||||
if result.tables:
|
||||
print(f"Found {len(result.tables)} tables")
|
||||
|
||||
# Convert to pandas DataFrame instantly
|
||||
import pandas as pd
|
||||
|
||||
for i, table in enumerate(result.tables):
|
||||
df = pd.DataFrame(table['data'])
|
||||
print(f"Table {i}: {df.shape[0]} rows × {df.shape[1]} columns")
|
||||
print(df.head())
|
||||
|
||||
# Table metadata
|
||||
print(f"Source: {table.get('source_xpath', 'Unknown')}")
|
||||
print(f"Headers: {table.get('headers', [])}")
|
||||
```
|
||||
|
||||
**Expected Real-World Impact:**
|
||||
- **Data Analysis**: Faster transition from web data to analysis-ready DataFrames
|
||||
- **ETL Pipelines**: Cleaner integration with data processing workflows
|
||||
- **Reporting**: Simplified table extraction for automated reporting systems
|
||||
|
||||
## 💰 Community Support: GitHub Sponsors
|
||||
|
||||
I've launched GitHub Sponsors to ensure Crawl4AI's continued development and support our growing community.
|
||||
|
||||
**Sponsorship Tiers:**
|
||||
- **🌱 Supporter ($5/month)**: Community support + early feature previews
|
||||
- **🚀 Professional ($25/month)**: Priority support + beta access
|
||||
- **🏢 Business ($100/month)**: Direct consultation + custom integrations
|
||||
- **🏛️ Enterprise ($500/month)**: Dedicated support + feature development
|
||||
|
||||
**Why Sponsor?**
|
||||
- Ensure continuous development and maintenance
|
||||
- Get priority support and feature requests
|
||||
- Access to premium documentation and examples
|
||||
- Direct line to the development team
|
||||
|
||||
[**Become a Sponsor →**](https://github.com/sponsors/unclecode)
|
||||
|
||||
## 🐳 Docker: Flexible LLM Provider Configuration
|
||||
|
||||
**The Problem:** Hardcoded LLM providers in Docker deployments. Want to switch from OpenAI to Groq? Rebuild and redeploy. Testing different models? Multiple Docker images.
|
||||
|
||||
305
docs/blog/release-v0.7.4.md
Normal file
305
docs/blog/release-v0.7.4.md
Normal file
@@ -0,0 +1,305 @@
|
||||
# 🚀 Crawl4AI v0.7.4: The Intelligent Table Extraction & Performance Update
|
||||
|
||||
*August 17, 2025 • 6 min read*
|
||||
|
||||
---
|
||||
|
||||
Today I'm releasing Crawl4AI v0.7.4—the Intelligent Table Extraction & Performance Update. This release introduces revolutionary LLM-powered table extraction with intelligent chunking, significant performance improvements for concurrent crawling, enhanced browser management, and critical stability fixes that make Crawl4AI more robust for production workloads.
|
||||
|
||||
## 🎯 What's New at a Glance
|
||||
|
||||
- **🚀 LLMTableExtraction**: Revolutionary table extraction with intelligent chunking for massive tables
|
||||
- **⚡ Enhanced Concurrency**: True concurrency improvements for fast-completing tasks in batch operations
|
||||
- **🧹 Memory Management Refactor**: Streamlined memory utilities and better resource management
|
||||
- **🔧 Browser Manager Fixes**: Resolved race conditions in concurrent page creation
|
||||
- **⌨️ Cross-Platform Browser Profiler**: Improved keyboard handling and quit mechanisms
|
||||
- **🔗 Advanced URL Processing**: Better handling of raw URLs and base tag link resolution
|
||||
- **🛡️ Enhanced Proxy Support**: Flexible proxy configuration with dict and string formats
|
||||
- **🐳 Docker Improvements**: Better API handling and raw HTML support
|
||||
|
||||
## 🚀 LLMTableExtraction: Revolutionary Table Processing
|
||||
|
||||
**The Problem:** Complex tables with rowspan, colspan, nested structures, or massive datasets that traditional HTML parsing can't handle effectively. Large tables that exceed token limits crash extraction processes.
|
||||
|
||||
**My Solution:** I developed LLMTableExtraction—an intelligent table extraction strategy that uses Large Language Models with automatic chunking to handle tables of any size and complexity.
|
||||
|
||||
### Technical Implementation
|
||||
|
||||
```python
|
||||
from crawl4ai import (
|
||||
AsyncWebCrawler,
|
||||
CrawlerRunConfig,
|
||||
LLMConfig,
|
||||
LLMTableExtraction,
|
||||
CacheMode
|
||||
)
|
||||
|
||||
# Configure LLM for table extraction
|
||||
llm_config = LLMConfig(
|
||||
provider="openai/gpt-4.1-mini",
|
||||
api_token="env:OPENAI_API_KEY",
|
||||
temperature=0.1, # Low temperature for consistency
|
||||
max_tokens=32000
|
||||
)
|
||||
|
||||
# Create intelligent table extraction strategy
|
||||
table_strategy = LLMTableExtraction(
|
||||
llm_config=llm_config,
|
||||
verbose=True,
|
||||
max_tries=2,
|
||||
enable_chunking=True, # Handle massive tables
|
||||
chunk_token_threshold=5000, # Smart chunking threshold
|
||||
overlap_threshold=100, # Maintain context between chunks
|
||||
extraction_type="structured" # Get structured data output
|
||||
)
|
||||
|
||||
# Apply to crawler configuration
|
||||
config = CrawlerRunConfig(
|
||||
table_extraction_strategy=table_strategy,
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Extract complex tables with intelligence
|
||||
result = await crawler.arun(
|
||||
"https://en.wikipedia.org/wiki/List_of_countries_by_GDP",
|
||||
config=config
|
||||
)
|
||||
|
||||
# Access extracted tables directly
|
||||
for i, table in enumerate(result.tables):
|
||||
print(f"Table {i}: {len(table['data'])} rows × {len(table['headers'])} columns")
|
||||
|
||||
# Convert to pandas DataFrame instantly
|
||||
import pandas as pd
|
||||
df = pd.DataFrame(table['data'], columns=table['headers'])
|
||||
print(df.head())
|
||||
```
|
||||
|
||||
**Intelligent Chunking for Massive Tables:**
|
||||
|
||||
```python
|
||||
# Handle tables that exceed token limits
|
||||
large_table_strategy = LLMTableExtraction(
|
||||
llm_config=llm_config,
|
||||
enable_chunking=True,
|
||||
chunk_token_threshold=3000, # Conservative threshold
|
||||
overlap_threshold=150, # Preserve context
|
||||
max_concurrent_chunks=3, # Parallel processing
|
||||
merge_strategy="intelligent" # Smart chunk merging
|
||||
)
|
||||
|
||||
# Process Wikipedia comparison tables, financial reports, etc.
|
||||
config = CrawlerRunConfig(
|
||||
table_extraction_strategy=large_table_strategy,
|
||||
# Target specific table containers
|
||||
css_selector="div.wikitable, table.sortable",
|
||||
delay_before_return_html=2.0
|
||||
)
|
||||
|
||||
result = await crawler.arun(
|
||||
"https://en.wikipedia.org/wiki/Comparison_of_operating_systems",
|
||||
config=config
|
||||
)
|
||||
|
||||
# Tables are automatically chunked, processed, and merged
|
||||
print(f"Extracted {len(result.tables)} complex tables")
|
||||
for table in result.tables:
|
||||
print(f"Merged table: {len(table['data'])} total rows")
|
||||
```
|
||||
|
||||
**Advanced Features:**
|
||||
|
||||
- **Intelligent Chunking**: Automatically splits massive tables while preserving structure
|
||||
- **Context Preservation**: Overlapping chunks maintain column relationships
|
||||
- **Parallel Processing**: Concurrent chunk processing for speed
|
||||
- **Smart Merging**: Reconstructs complete tables from processed chunks
|
||||
- **Complex Structure Support**: Handles rowspan, colspan, nested tables
|
||||
- **Metadata Extraction**: Captures table context, captions, and relationships
|
||||
|
||||
**Expected Real-World Impact:**
|
||||
- **Financial Analysis**: Extract complex earnings tables and financial statements
|
||||
- **Research & Academia**: Process large datasets from Wikipedia, research papers
|
||||
- **E-commerce**: Handle product comparison tables with complex layouts
|
||||
- **Government Data**: Extract census data, statistical tables from official sources
|
||||
- **Competitive Intelligence**: Process competitor pricing and feature tables
|
||||
|
||||
## ⚡ Enhanced Concurrency: True Performance Gains
|
||||
|
||||
**The Problem:** The `arun_many()` method wasn't achieving true concurrency for fast-completing tasks, leading to sequential processing bottlenecks in batch operations.
|
||||
|
||||
**My Solution:** I implemented true concurrency improvements in the dispatcher that enable genuine parallel processing for fast-completing tasks.
|
||||
|
||||
### Performance Optimization
|
||||
|
||||
```python
|
||||
# Before v0.7.4: Sequential-like behavior for fast tasks
|
||||
# After v0.7.4: True concurrency
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# These will now run with true concurrency
|
||||
urls = [
|
||||
"https://httpbin.org/delay/1",
|
||||
"https://httpbin.org/delay/1",
|
||||
"https://httpbin.org/delay/1",
|
||||
"https://httpbin.org/delay/1"
|
||||
]
|
||||
|
||||
# Processes in truly parallel fashion
|
||||
results = await crawler.arun_many(urls)
|
||||
|
||||
# Performance improvement: ~4x faster for fast-completing tasks
|
||||
print(f"Processed {len(results)} URLs with true concurrency")
|
||||
```
|
||||
|
||||
**Expected Real-World Impact:**
|
||||
- **API Crawling**: 3-4x faster processing of REST endpoints and API documentation
|
||||
- **Batch URL Processing**: Significant speedup for large URL lists
|
||||
- **Monitoring Systems**: Faster health checks and status page monitoring
|
||||
- **Data Aggregation**: Improved performance for real-time data collection
|
||||
|
||||
## 🧹 Memory Management Refactor: Cleaner Architecture
|
||||
|
||||
**The Problem:** Memory utilities were scattered and difficult to maintain, with potential import conflicts and unclear organization.
|
||||
|
||||
**My Solution:** I consolidated all memory-related utilities into the main `utils.py` module, creating a cleaner, more maintainable architecture.
|
||||
|
||||
### Improved Memory Handling
|
||||
|
||||
```python
|
||||
# All memory utilities now consolidated
|
||||
from crawl4ai.utils import get_true_memory_usage_percent, MemoryMonitor
|
||||
|
||||
# Enhanced memory monitoring
|
||||
monitor = MemoryMonitor()
|
||||
monitor.start_monitoring()
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Memory-efficient batch processing
|
||||
results = await crawler.arun_many(large_url_list)
|
||||
|
||||
# Get accurate memory metrics
|
||||
memory_usage = get_true_memory_usage_percent()
|
||||
memory_report = monitor.get_report()
|
||||
|
||||
print(f"Memory efficiency: {memory_report['efficiency']:.1f}%")
|
||||
print(f"Peak usage: {memory_report['peak_mb']:.1f} MB")
|
||||
```
|
||||
|
||||
**Expected Real-World Impact:**
|
||||
- **Production Stability**: More reliable memory tracking and management
|
||||
- **Code Maintainability**: Cleaner architecture for easier debugging
|
||||
- **Import Clarity**: Resolved potential conflicts and import issues
|
||||
- **Developer Experience**: Simpler API for memory monitoring
|
||||
|
||||
## 🔧 Critical Stability Fixes
|
||||
|
||||
### Browser Manager Race Condition Resolution
|
||||
|
||||
**The Problem:** Concurrent page creation in persistent browser contexts caused "Target page/context closed" errors during high-concurrency operations.
|
||||
|
||||
**My Solution:** Implemented thread-safe page creation with proper locking mechanisms.
|
||||
|
||||
```python
|
||||
# Fixed: Safe concurrent page creation
|
||||
browser_config = BrowserConfig(
|
||||
browser_type="chromium",
|
||||
use_persistent_context=True, # Now thread-safe
|
||||
max_concurrent_sessions=10 # Safely handle concurrent requests
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
# These concurrent operations are now stable
|
||||
tasks = [crawler.arun(url) for url in url_list]
|
||||
results = await asyncio.gather(*tasks) # No more race conditions
|
||||
```
|
||||
|
||||
### Enhanced Browser Profiler
|
||||
|
||||
**The Problem:** Inconsistent keyboard handling across platforms and unreliable quit mechanisms.
|
||||
|
||||
**My Solution:** Cross-platform keyboard listeners with improved quit handling.
|
||||
|
||||
### Advanced URL Processing
|
||||
|
||||
**The Problem:** Raw URL formats (`raw://` and `raw:`) weren't properly handled, and base tag link resolution was incomplete.
|
||||
|
||||
**My Solution:** Enhanced URL preprocessing and base tag support.
|
||||
|
||||
```python
|
||||
# Now properly handles all URL formats
|
||||
urls = [
|
||||
"https://example.com",
|
||||
"raw://static-html-content",
|
||||
"raw:file://local-file.html"
|
||||
]
|
||||
|
||||
# Base tag links are now correctly resolved
|
||||
config = CrawlerRunConfig(
|
||||
include_links=True, # Links properly resolved with base tags
|
||||
resolve_absolute_urls=True
|
||||
)
|
||||
```
|
||||
|
||||
## 🛡️ Enhanced Proxy Configuration
|
||||
|
||||
**The Problem:** Proxy configuration only accepted specific formats, limiting flexibility.
|
||||
|
||||
**My Solution:** Enhanced ProxyConfig to support both dictionary and string formats.
|
||||
|
||||
```python
|
||||
# Multiple proxy configuration formats now supported
|
||||
from crawl4ai import BrowserConfig, ProxyConfig
|
||||
|
||||
# String format
|
||||
proxy_config = ProxyConfig("http://proxy.example.com:8080")
|
||||
|
||||
# Dictionary format
|
||||
proxy_config = ProxyConfig({
|
||||
"server": "http://proxy.example.com:8080",
|
||||
"username": "user",
|
||||
"password": "pass"
|
||||
})
|
||||
|
||||
# Use with crawler
|
||||
browser_config = BrowserConfig(proxy_config=proxy_config)
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun("https://httpbin.org/ip")
|
||||
```
|
||||
|
||||
## 🐳 Docker & Infrastructure Improvements
|
||||
|
||||
This release includes several Docker and infrastructure improvements:
|
||||
|
||||
- **Better API Token Handling**: Improved Docker example scripts with correct endpoints
|
||||
- **Raw HTML Support**: Enhanced Docker API to handle raw HTML content properly
|
||||
- **Documentation Updates**: Comprehensive Docker deployment examples
|
||||
- **Test Coverage**: Expanded test suite with better coverage
|
||||
|
||||
## 📚 Documentation & Examples
|
||||
|
||||
Enhanced documentation includes:
|
||||
|
||||
- **LLM Table Extraction Guide**: Comprehensive examples and best practices
|
||||
- **Migration Documentation**: Updated patterns for new table extraction methods
|
||||
- **Docker Deployment**: Complete deployment guide with examples
|
||||
- **Performance Optimization**: Guidelines for concurrent crawling
|
||||
|
||||
## 🙏 Acknowledgments
|
||||
|
||||
Thanks to our contributors and community for feedback, bug reports, and feature requests that made this release possible.
|
||||
|
||||
## 📚 Resources
|
||||
|
||||
- [Full Documentation](https://docs.crawl4ai.com)
|
||||
- [GitHub Repository](https://github.com/unclecode/crawl4ai)
|
||||
- [Discord Community](https://discord.gg/crawl4ai)
|
||||
- [LLM Table Extraction Examples](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/llm_table_extraction_example.py)
|
||||
|
||||
---
|
||||
|
||||
*Crawl4AI v0.7.4 delivers intelligent table extraction and significant performance improvements. The new LLMTableExtraction strategy handles complex tables that were previously impossible to process, while concurrency improvements make batch operations 3-4x faster. Try the intelligent table extraction—it's a game changer for data extraction workflows!*
|
||||
|
||||
**Happy Crawling! 🕷️**
|
||||
|
||||
*- The Crawl4AI Team*
|
||||
@@ -8,26 +8,20 @@ from typing import Dict, Any
|
||||
|
||||
|
||||
class Crawl4AiTester:
|
||||
def __init__(self, base_url: str = "http://localhost:11235", api_token: str = None):
|
||||
def __init__(self, base_url: str = "http://localhost:11235"):
|
||||
self.base_url = base_url
|
||||
self.api_token = (
|
||||
api_token or os.getenv("CRAWL4AI_API_TOKEN") or "test_api_code"
|
||||
) # Check environment variable as fallback
|
||||
self.headers = (
|
||||
{"Authorization": f"Bearer {self.api_token}"} if self.api_token else {}
|
||||
)
|
||||
|
||||
def submit_and_wait(
|
||||
self, request_data: Dict[str, Any], timeout: int = 300
|
||||
) -> Dict[str, Any]:
|
||||
# Submit crawl job
|
||||
# Submit crawl job using async endpoint
|
||||
response = requests.post(
|
||||
f"{self.base_url}/crawl", json=request_data, headers=self.headers
|
||||
f"{self.base_url}/crawl/job", json=request_data
|
||||
)
|
||||
if response.status_code == 403:
|
||||
raise Exception("API token is invalid or missing")
|
||||
task_id = response.json()["task_id"]
|
||||
print(f"Task ID: {task_id}")
|
||||
response.raise_for_status()
|
||||
job_response = response.json()
|
||||
task_id = job_response["task_id"]
|
||||
print(f"Submitted job with task_id: {task_id}")
|
||||
|
||||
# Poll for result
|
||||
start_time = time.time()
|
||||
@@ -38,8 +32,9 @@ class Crawl4AiTester:
|
||||
)
|
||||
|
||||
result = requests.get(
|
||||
f"{self.base_url}/task/{task_id}", headers=self.headers
|
||||
f"{self.base_url}/crawl/job/{task_id}"
|
||||
)
|
||||
result.raise_for_status()
|
||||
status = result.json()
|
||||
|
||||
if status["status"] == "failed":
|
||||
@@ -52,10 +47,10 @@ class Crawl4AiTester:
|
||||
time.sleep(2)
|
||||
|
||||
def submit_sync(self, request_data: Dict[str, Any]) -> Dict[str, Any]:
|
||||
# Use synchronous crawl endpoint
|
||||
response = requests.post(
|
||||
f"{self.base_url}/crawl_sync",
|
||||
f"{self.base_url}/crawl",
|
||||
json=request_data,
|
||||
headers=self.headers,
|
||||
timeout=60,
|
||||
)
|
||||
if response.status_code == 408:
|
||||
@@ -63,20 +58,9 @@ class Crawl4AiTester:
|
||||
response.raise_for_status()
|
||||
return response.json()
|
||||
|
||||
def crawl_direct(self, request_data: Dict[str, Any]) -> Dict[str, Any]:
|
||||
"""Directly crawl without using task queue"""
|
||||
response = requests.post(
|
||||
f"{self.base_url}/crawl_direct", json=request_data, headers=self.headers
|
||||
)
|
||||
response.raise_for_status()
|
||||
return response.json()
|
||||
|
||||
|
||||
def test_docker_deployment(version="basic"):
|
||||
tester = Crawl4AiTester(
|
||||
base_url="http://localhost:11235",
|
||||
# base_url="https://api.crawl4ai.com" # just for example
|
||||
# api_token="test" # just for example
|
||||
)
|
||||
print(f"Testing Crawl4AI Docker {version} version")
|
||||
|
||||
@@ -95,11 +79,8 @@ def test_docker_deployment(version="basic"):
|
||||
time.sleep(5)
|
||||
|
||||
# Test cases based on version
|
||||
test_basic_crawl_direct(tester)
|
||||
test_basic_crawl(tester)
|
||||
test_basic_crawl(tester)
|
||||
test_basic_crawl_sync(tester)
|
||||
|
||||
if version in ["full", "transformer"]:
|
||||
test_cosine_extraction(tester)
|
||||
|
||||
@@ -112,115 +93,129 @@ def test_docker_deployment(version="basic"):
|
||||
|
||||
|
||||
def test_basic_crawl(tester: Crawl4AiTester):
|
||||
print("\n=== Testing Basic Crawl ===")
|
||||
print("\n=== Testing Basic Crawl (Async) ===")
|
||||
request = {
|
||||
"urls": "https://www.nbcnews.com/business",
|
||||
"priority": 10,
|
||||
"session_id": "test",
|
||||
"urls": ["https://www.nbcnews.com/business"],
|
||||
"browser_config": {},
|
||||
"crawler_config": {}
|
||||
}
|
||||
|
||||
result = tester.submit_and_wait(request)
|
||||
print(f"Basic crawl result length: {len(result['result']['markdown'])}")
|
||||
print(f"Basic crawl result count: {len(result['result']['results'])}")
|
||||
assert result["result"]["success"]
|
||||
assert len(result["result"]["markdown"]) > 0
|
||||
assert len(result["result"]["results"]) > 0
|
||||
assert len(result["result"]["results"][0]["markdown"]) > 0
|
||||
|
||||
|
||||
def test_basic_crawl_sync(tester: Crawl4AiTester):
|
||||
print("\n=== Testing Basic Crawl (Sync) ===")
|
||||
request = {
|
||||
"urls": "https://www.nbcnews.com/business",
|
||||
"priority": 10,
|
||||
"session_id": "test",
|
||||
"urls": ["https://www.nbcnews.com/business"],
|
||||
"browser_config": {},
|
||||
"crawler_config": {}
|
||||
}
|
||||
|
||||
result = tester.submit_sync(request)
|
||||
print(f"Basic crawl result length: {len(result['result']['markdown'])}")
|
||||
assert result["status"] == "completed"
|
||||
assert result["result"]["success"]
|
||||
assert len(result["result"]["markdown"]) > 0
|
||||
|
||||
|
||||
def test_basic_crawl_direct(tester: Crawl4AiTester):
|
||||
print("\n=== Testing Basic Crawl (Direct) ===")
|
||||
request = {
|
||||
"urls": "https://www.nbcnews.com/business",
|
||||
"priority": 10,
|
||||
# "session_id": "test"
|
||||
"cache_mode": "bypass", # or "enabled", "disabled", "read_only", "write_only"
|
||||
}
|
||||
|
||||
result = tester.crawl_direct(request)
|
||||
print(f"Basic crawl result length: {len(result['result']['markdown'])}")
|
||||
assert result["result"]["success"]
|
||||
assert len(result["result"]["markdown"]) > 0
|
||||
print(f"Basic crawl result count: {len(result['results'])}")
|
||||
assert result["success"]
|
||||
assert len(result["results"]) > 0
|
||||
assert len(result["results"][0]["markdown"]) > 0
|
||||
|
||||
|
||||
def test_js_execution(tester: Crawl4AiTester):
|
||||
print("\n=== Testing JS Execution ===")
|
||||
request = {
|
||||
"urls": "https://www.nbcnews.com/business",
|
||||
"priority": 8,
|
||||
"js_code": [
|
||||
"const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"
|
||||
],
|
||||
"wait_for": "article.tease-card:nth-child(10)",
|
||||
"crawler_params": {"headless": True},
|
||||
"urls": ["https://www.nbcnews.com/business"],
|
||||
"browser_config": {"headless": True},
|
||||
"crawler_config": {
|
||||
"js_code": [
|
||||
"const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); if(loadMoreButton) loadMoreButton.click();"
|
||||
],
|
||||
"wait_for": "wide-tease-item__wrapper df flex-column flex-row-m flex-nowrap-m enable-new-sports-feed-mobile-design(10)"
|
||||
}
|
||||
}
|
||||
|
||||
result = tester.submit_and_wait(request)
|
||||
print(f"JS execution result length: {len(result['result']['markdown'])}")
|
||||
print(f"JS execution result count: {len(result['result']['results'])}")
|
||||
assert result["result"]["success"]
|
||||
|
||||
|
||||
def test_css_selector(tester: Crawl4AiTester):
|
||||
print("\n=== Testing CSS Selector ===")
|
||||
request = {
|
||||
"urls": "https://www.nbcnews.com/business",
|
||||
"priority": 7,
|
||||
"css_selector": ".wide-tease-item__description",
|
||||
"crawler_params": {"headless": True},
|
||||
"extra": {"word_count_threshold": 10},
|
||||
"urls": ["https://www.nbcnews.com/business"],
|
||||
"browser_config": {"headless": True},
|
||||
"crawler_config": {
|
||||
"css_selector": ".wide-tease-item__description",
|
||||
"word_count_threshold": 10
|
||||
}
|
||||
}
|
||||
|
||||
result = tester.submit_and_wait(request)
|
||||
print(f"CSS selector result length: {len(result['result']['markdown'])}")
|
||||
print(f"CSS selector result count: {len(result['result']['results'])}")
|
||||
assert result["result"]["success"]
|
||||
|
||||
|
||||
def test_structured_extraction(tester: Crawl4AiTester):
|
||||
print("\n=== Testing Structured Extraction ===")
|
||||
schema = {
|
||||
"name": "Coinbase Crypto Prices",
|
||||
"baseSelector": ".cds-tableRow-t45thuk",
|
||||
"name": "Cryptocurrency Prices",
|
||||
"baseSelector": "table[data-testid=\"prices-table\"] tbody tr",
|
||||
"fields": [
|
||||
{
|
||||
"name": "crypto",
|
||||
"selector": "td:nth-child(1) h2",
|
||||
"type": "text",
|
||||
"name": "asset_name",
|
||||
"selector": "td:nth-child(2) p.cds-headline-h4steop",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "symbol",
|
||||
"selector": "td:nth-child(1) p",
|
||||
"type": "text",
|
||||
"name": "asset_symbol",
|
||||
"selector": "td:nth-child(2) p.cds-label2-l1sm09ec",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "asset_image_url",
|
||||
"selector": "td:nth-child(2) img[alt=\"Asset Symbol\"]",
|
||||
"type": "attribute",
|
||||
"attribute": "src"
|
||||
},
|
||||
{
|
||||
"name": "asset_url",
|
||||
"selector": "td:nth-child(2) a[aria-label^=\"Asset page for\"]",
|
||||
"type": "attribute",
|
||||
"attribute": "href"
|
||||
},
|
||||
{
|
||||
"name": "price",
|
||||
"selector": "td:nth-child(2)",
|
||||
"type": "text",
|
||||
"selector": "td:nth-child(3) div.cds-typographyResets-t6muwls.cds-body-bwup3gq",
|
||||
"type": "text"
|
||||
},
|
||||
],
|
||||
{
|
||||
"name": "change",
|
||||
"selector": "td:nth-child(7) p.cds-body-bwup3gq",
|
||||
"type": "text"
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
request = {
|
||||
"urls": "https://www.coinbase.com/explore",
|
||||
"priority": 9,
|
||||
"extraction_config": {"type": "json_css", "params": {"schema": schema}},
|
||||
"urls": ["https://www.coinbase.com/explore"],
|
||||
"browser_config": {},
|
||||
"crawler_config": {
|
||||
"type": "CrawlerRunConfig",
|
||||
"params": {
|
||||
"extraction_strategy": {
|
||||
"type": "JsonCssExtractionStrategy",
|
||||
"params": {"schema": schema}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
result = tester.submit_and_wait(request)
|
||||
extracted = json.loads(result["result"]["extracted_content"])
|
||||
extracted = json.loads(result["result"]["results"][0]["extracted_content"])
|
||||
print(f"Extracted {len(extracted)} items")
|
||||
print("Sample item:", json.dumps(extracted[0], indent=2))
|
||||
if extracted:
|
||||
print("Sample item:", json.dumps(extracted[0], indent=2))
|
||||
assert result["result"]["success"]
|
||||
assert len(extracted) > 0
|
||||
|
||||
@@ -230,43 +225,54 @@ def test_llm_extraction(tester: Crawl4AiTester):
|
||||
schema = {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"model_name": {
|
||||
"asset_name": {
|
||||
"type": "string",
|
||||
"description": "Name of the OpenAI model.",
|
||||
"description": "Name of the asset.",
|
||||
},
|
||||
"input_fee": {
|
||||
"price": {
|
||||
"type": "string",
|
||||
"description": "Fee for input token for the OpenAI model.",
|
||||
"description": "Price of the asset.",
|
||||
},
|
||||
"output_fee": {
|
||||
"change": {
|
||||
"type": "string",
|
||||
"description": "Fee for output token for the OpenAI model.",
|
||||
"description": "Change in price of the asset.",
|
||||
},
|
||||
},
|
||||
"required": ["model_name", "input_fee", "output_fee"],
|
||||
"required": ["asset_name", "price", "change"],
|
||||
}
|
||||
|
||||
request = {
|
||||
"urls": "https://openai.com/api/pricing",
|
||||
"priority": 8,
|
||||
"extraction_config": {
|
||||
"type": "llm",
|
||||
"urls": ["https://www.coinbase.com/en-in/explore"],
|
||||
"browser_config": {},
|
||||
"crawler_config": {
|
||||
"type": "CrawlerRunConfig",
|
||||
"params": {
|
||||
"provider": "openai/gpt-4o-mini",
|
||||
"api_token": os.getenv("OPENAI_API_KEY"),
|
||||
"schema": schema,
|
||||
"extraction_type": "schema",
|
||||
"instruction": """From the crawled content, extract all mentioned model names along with their fees for input and output tokens.""",
|
||||
},
|
||||
},
|
||||
"crawler_params": {"word_count_threshold": 1},
|
||||
"extraction_strategy": {
|
||||
"type": "LLMExtractionStrategy",
|
||||
"params": {
|
||||
"llm_config": {
|
||||
"type": "LLMConfig",
|
||||
"params": {
|
||||
"provider": "gemini/gemini-2.0-flash-exp",
|
||||
"api_token": os.getenv("GEMINI_API_KEY")
|
||||
}
|
||||
},
|
||||
"schema": schema,
|
||||
"extraction_type": "schema",
|
||||
"instruction": "From the crawled content, extract asset names along with their prices and change in price.",
|
||||
}
|
||||
},
|
||||
"word_count_threshold": 1
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
try:
|
||||
result = tester.submit_and_wait(request)
|
||||
extracted = json.loads(result["result"]["extracted_content"])
|
||||
print(f"Extracted {len(extracted)} model pricing entries")
|
||||
print("Sample entry:", json.dumps(extracted[0], indent=2))
|
||||
extracted = json.loads(result["result"]["results"][0]["extracted_content"])
|
||||
print(f"Extracted {len(extracted)} asset pricing entries")
|
||||
if extracted:
|
||||
print("Sample entry:", json.dumps(extracted[0], indent=2))
|
||||
assert result["result"]["success"]
|
||||
except Exception as e:
|
||||
print(f"LLM extraction test failed (might be due to missing API key): {str(e)}")
|
||||
@@ -274,6 +280,16 @@ def test_llm_extraction(tester: Crawl4AiTester):
|
||||
|
||||
def test_llm_with_ollama(tester: Crawl4AiTester):
|
||||
print("\n=== Testing LLM with Ollama ===")
|
||||
|
||||
# Check if Ollama is accessible first
|
||||
try:
|
||||
ollama_response = requests.get("http://localhost:11434/api/tags", timeout=5)
|
||||
ollama_response.raise_for_status()
|
||||
print("Ollama is accessible")
|
||||
except:
|
||||
print("Ollama is not accessible, skipping test")
|
||||
return
|
||||
|
||||
schema = {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
@@ -294,24 +310,33 @@ def test_llm_with_ollama(tester: Crawl4AiTester):
|
||||
}
|
||||
|
||||
request = {
|
||||
"urls": "https://www.nbcnews.com/business",
|
||||
"priority": 8,
|
||||
"extraction_config": {
|
||||
"type": "llm",
|
||||
"urls": ["https://www.nbcnews.com/business"],
|
||||
"browser_config": {"verbose": True},
|
||||
"crawler_config": {
|
||||
"type": "CrawlerRunConfig",
|
||||
"params": {
|
||||
"provider": "ollama/llama2",
|
||||
"schema": schema,
|
||||
"extraction_type": "schema",
|
||||
"instruction": "Extract the main article information including title, summary, and main topics.",
|
||||
},
|
||||
},
|
||||
"extra": {"word_count_threshold": 1},
|
||||
"crawler_params": {"verbose": True},
|
||||
"extraction_strategy": {
|
||||
"type": "LLMExtractionStrategy",
|
||||
"params": {
|
||||
"llm_config": {
|
||||
"type": "LLMConfig",
|
||||
"params": {
|
||||
"provider": "ollama/llama3.2:latest",
|
||||
}
|
||||
},
|
||||
"schema": schema,
|
||||
"extraction_type": "schema",
|
||||
"instruction": "Extract the main article information including title, summary, and main topics.",
|
||||
}
|
||||
},
|
||||
"word_count_threshold": 1
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
try:
|
||||
result = tester.submit_and_wait(request)
|
||||
extracted = json.loads(result["result"]["extracted_content"])
|
||||
extracted = json.loads(result["result"]["results"][0]["extracted_content"])
|
||||
print("Extracted content:", json.dumps(extracted, indent=2))
|
||||
assert result["result"]["success"]
|
||||
except Exception as e:
|
||||
@@ -321,24 +346,30 @@ def test_llm_with_ollama(tester: Crawl4AiTester):
|
||||
def test_cosine_extraction(tester: Crawl4AiTester):
|
||||
print("\n=== Testing Cosine Extraction ===")
|
||||
request = {
|
||||
"urls": "https://www.nbcnews.com/business",
|
||||
"priority": 8,
|
||||
"extraction_config": {
|
||||
"type": "cosine",
|
||||
"urls": ["https://www.nbcnews.com/business"],
|
||||
"browser_config": {},
|
||||
"crawler_config": {
|
||||
"type": "CrawlerRunConfig",
|
||||
"params": {
|
||||
"semantic_filter": "business finance economy",
|
||||
"word_count_threshold": 10,
|
||||
"max_dist": 0.2,
|
||||
"top_k": 3,
|
||||
},
|
||||
},
|
||||
"extraction_strategy": {
|
||||
"type": "CosineStrategy",
|
||||
"params": {
|
||||
"semantic_filter": "business finance economy",
|
||||
"word_count_threshold": 10,
|
||||
"max_dist": 0.2,
|
||||
"top_k": 3,
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
try:
|
||||
result = tester.submit_and_wait(request)
|
||||
extracted = json.loads(result["result"]["extracted_content"])
|
||||
extracted = json.loads(result["result"]["results"][0]["extracted_content"])
|
||||
print(f"Extracted {len(extracted)} text clusters")
|
||||
print("First cluster tags:", extracted[0]["tags"])
|
||||
if extracted:
|
||||
print("First cluster tags:", extracted[0]["tags"])
|
||||
assert result["result"]["success"]
|
||||
except Exception as e:
|
||||
print(f"Cosine extraction test failed: {str(e)}")
|
||||
@@ -347,20 +378,25 @@ def test_cosine_extraction(tester: Crawl4AiTester):
|
||||
def test_screenshot(tester: Crawl4AiTester):
|
||||
print("\n=== Testing Screenshot ===")
|
||||
request = {
|
||||
"urls": "https://www.nbcnews.com/business",
|
||||
"priority": 5,
|
||||
"screenshot": True,
|
||||
"crawler_params": {"headless": True},
|
||||
"urls": ["https://www.nbcnews.com/business"],
|
||||
"browser_config": {"headless": True},
|
||||
"crawler_config": {
|
||||
"type": "CrawlerRunConfig",
|
||||
"params": {
|
||||
"screenshot": True
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
result = tester.submit_and_wait(request)
|
||||
print("Screenshot captured:", bool(result["result"]["screenshot"]))
|
||||
screenshot_data = result["result"]["results"][0]["screenshot"]
|
||||
print("Screenshot captured:", bool(screenshot_data))
|
||||
|
||||
if result["result"]["screenshot"]:
|
||||
if screenshot_data:
|
||||
# Save screenshot
|
||||
screenshot_data = base64.b64decode(result["result"]["screenshot"])
|
||||
screenshot_bytes = base64.b64decode(screenshot_data)
|
||||
with open("test_screenshot.jpg", "wb") as f:
|
||||
f.write(screenshot_data)
|
||||
f.write(screenshot_bytes)
|
||||
print("Screenshot saved as test_screenshot.jpg")
|
||||
|
||||
assert result["result"]["success"]
|
||||
@@ -368,5 +404,4 @@ def test_screenshot(tester: Crawl4AiTester):
|
||||
|
||||
if __name__ == "__main__":
|
||||
version = sys.argv[1] if len(sys.argv) > 1 else "basic"
|
||||
# version = "full"
|
||||
test_docker_deployment(version)
|
||||
|
||||
356
docs/examples/llm_table_extraction_example.py
Normal file
356
docs/examples/llm_table_extraction_example.py
Normal file
@@ -0,0 +1,356 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Example demonstrating LLM-based table extraction in Crawl4AI.
|
||||
|
||||
This example shows how to use the LLMTableExtraction strategy to extract
|
||||
complex tables from web pages, including handling rowspan, colspan, and nested tables.
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
|
||||
# Get the grandparent directory
|
||||
grandparent_dir = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
|
||||
sys.path.append(grandparent_dir)
|
||||
__location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
|
||||
|
||||
|
||||
|
||||
import asyncio
|
||||
from crawl4ai import (
|
||||
AsyncWebCrawler,
|
||||
CrawlerRunConfig,
|
||||
LLMConfig,
|
||||
LLMTableExtraction,
|
||||
CacheMode
|
||||
)
|
||||
import pandas as pd
|
||||
|
||||
|
||||
# Example 1: Basic LLM Table Extraction
|
||||
async def basic_llm_extraction():
|
||||
"""Extract tables using LLM with default settings."""
|
||||
print("\n=== Example 1: Basic LLM Table Extraction ===")
|
||||
|
||||
# Configure LLM (using OpenAI GPT-4o-mini for cost efficiency)
|
||||
llm_config = LLMConfig(
|
||||
provider="openai/gpt-4.1-mini",
|
||||
api_token="env:OPENAI_API_KEY", # Uses environment variable
|
||||
temperature=0.1, # Low temperature for consistency
|
||||
max_tokens=32000
|
||||
)
|
||||
|
||||
# Create LLM table extraction strategy
|
||||
table_strategy = LLMTableExtraction(
|
||||
llm_config=llm_config,
|
||||
verbose=True,
|
||||
# css_selector="div.mw-content-ltr",
|
||||
max_tries=2,
|
||||
enable_chunking=True,
|
||||
chunk_token_threshold=5000, # Lower threshold to force chunking
|
||||
min_rows_per_chunk=10,
|
||||
max_parallel_chunks=3
|
||||
)
|
||||
|
||||
# Configure crawler with the strategy
|
||||
config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
table_extraction=table_strategy
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Extract tables from a Wikipedia page
|
||||
result = await crawler.arun(
|
||||
url="https://en.wikipedia.org/wiki/List_of_chemical_elements",
|
||||
config=config
|
||||
)
|
||||
|
||||
if result.success:
|
||||
print(f"✓ Found {len(result.tables)} tables")
|
||||
|
||||
# Display first table
|
||||
if result.tables:
|
||||
first_table = result.tables[0]
|
||||
print(f"\nFirst table:")
|
||||
print(f" Headers: {first_table['headers'][:5]}...")
|
||||
print(f" Rows: {len(first_table['rows'])}")
|
||||
|
||||
# Convert to pandas DataFrame
|
||||
df = pd.DataFrame(
|
||||
first_table['rows'],
|
||||
columns=first_table['headers']
|
||||
)
|
||||
print(f"\nDataFrame shape: {df.shape}")
|
||||
print(df.head())
|
||||
else:
|
||||
print(f"✗ Extraction failed: {result.error}")
|
||||
|
||||
|
||||
# Example 2: Focused Extraction with CSS Selector
|
||||
async def focused_extraction():
|
||||
"""Extract tables from specific page sections using CSS selectors."""
|
||||
print("\n=== Example 2: Focused Extraction with CSS Selector ===")
|
||||
|
||||
# HTML with multiple tables
|
||||
test_html = """
|
||||
<html>
|
||||
<body>
|
||||
<div class="sidebar">
|
||||
<table role="presentation">
|
||||
<tr><td>Navigation</td></tr>
|
||||
</table>
|
||||
</div>
|
||||
|
||||
<div class="main-content">
|
||||
<table id="data-table">
|
||||
<caption>Quarterly Sales Report</caption>
|
||||
<thead>
|
||||
<tr>
|
||||
<th rowspan="2">Product</th>
|
||||
<th colspan="3">Q1 2024</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<th>Jan</th>
|
||||
<th>Feb</th>
|
||||
<th>Mar</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td>Widget A</td>
|
||||
<td>100</td>
|
||||
<td>120</td>
|
||||
<td>140</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Widget B</td>
|
||||
<td>200</td>
|
||||
<td>180</td>
|
||||
<td>220</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
</div>
|
||||
</body>
|
||||
</html>
|
||||
"""
|
||||
|
||||
llm_config = LLMConfig(
|
||||
provider="openai/gpt-4.1-mini",
|
||||
api_token="env:OPENAI_API_KEY"
|
||||
)
|
||||
|
||||
# Focus only on main content area
|
||||
table_strategy = LLMTableExtraction(
|
||||
llm_config=llm_config,
|
||||
css_selector=".main-content", # Only extract from main content
|
||||
verbose=True
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
table_extraction=table_strategy
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url=f"raw:{test_html}",
|
||||
config=config
|
||||
)
|
||||
|
||||
if result.success and result.tables:
|
||||
table = result.tables[0]
|
||||
print(f"✓ Extracted table: {table.get('caption', 'No caption')}")
|
||||
print(f" Headers: {table['headers']}")
|
||||
print(f" Metadata: {table['metadata']}")
|
||||
|
||||
# The LLM should have handled the rowspan/colspan correctly
|
||||
print("\nProcessed data (rowspan/colspan handled):")
|
||||
for i, row in enumerate(table['rows']):
|
||||
print(f" Row {i+1}: {row}")
|
||||
|
||||
|
||||
# Example 3: Comparing with Default Extraction
|
||||
async def compare_strategies():
|
||||
"""Compare LLM extraction with default extraction on complex tables."""
|
||||
print("\n=== Example 3: Comparing LLM vs Default Extraction ===")
|
||||
|
||||
# Complex table with nested structure
|
||||
complex_html = """
|
||||
<html>
|
||||
<body>
|
||||
<table>
|
||||
<tr>
|
||||
<th rowspan="3">Category</th>
|
||||
<th colspan="2">2023</th>
|
||||
<th colspan="2">2024</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<th>H1</th>
|
||||
<th>H2</th>
|
||||
<th>H1</th>
|
||||
<th>H2</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<td colspan="4">All values in millions</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Revenue</td>
|
||||
<td>100</td>
|
||||
<td>120</td>
|
||||
<td>130</td>
|
||||
<td>145</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Profit</td>
|
||||
<td>20</td>
|
||||
<td>25</td>
|
||||
<td>28</td>
|
||||
<td>32</td>
|
||||
</tr>
|
||||
</table>
|
||||
</body>
|
||||
</html>
|
||||
"""
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Test with default extraction
|
||||
from crawl4ai import DefaultTableExtraction
|
||||
|
||||
default_strategy = DefaultTableExtraction(
|
||||
table_score_threshold=3,
|
||||
verbose=True
|
||||
)
|
||||
|
||||
config_default = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
table_extraction=default_strategy
|
||||
)
|
||||
|
||||
result_default = await crawler.arun(
|
||||
url=f"raw:{complex_html}",
|
||||
config=config_default
|
||||
)
|
||||
|
||||
# Test with LLM extraction
|
||||
llm_strategy = LLMTableExtraction(
|
||||
llm_config=LLMConfig(
|
||||
provider="openai/gpt-4.1-mini",
|
||||
api_token="env:OPENAI_API_KEY"
|
||||
),
|
||||
verbose=True
|
||||
)
|
||||
|
||||
config_llm = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
table_extraction=llm_strategy
|
||||
)
|
||||
|
||||
result_llm = await crawler.arun(
|
||||
url=f"raw:{complex_html}",
|
||||
config=config_llm
|
||||
)
|
||||
|
||||
# Compare results
|
||||
print("\nDefault Extraction:")
|
||||
if result_default.tables:
|
||||
table = result_default.tables[0]
|
||||
print(f" Headers: {table.get('headers', [])}")
|
||||
print(f" Rows: {len(table.get('rows', []))}")
|
||||
for i, row in enumerate(table.get('rows', [])[:3]):
|
||||
print(f" Row {i+1}: {row}")
|
||||
|
||||
print("\nLLM Extraction (handles complex structure better):")
|
||||
if result_llm.tables:
|
||||
table = result_llm.tables[0]
|
||||
print(f" Headers: {table.get('headers', [])}")
|
||||
print(f" Rows: {len(table.get('rows', []))}")
|
||||
for i, row in enumerate(table.get('rows', [])):
|
||||
print(f" Row {i+1}: {row}")
|
||||
print(f" Metadata: {table.get('metadata', {})}")
|
||||
|
||||
# Example 4: Batch Processing Multiple Pages
|
||||
async def batch_extraction():
|
||||
"""Extract tables from multiple pages efficiently."""
|
||||
print("\n=== Example 4: Batch Table Extraction ===")
|
||||
|
||||
urls = [
|
||||
"https://www.worldometers.info/geography/alphabetical-list-of-countries/",
|
||||
# "https://en.wikipedia.org/wiki/List_of_chemical_elements",
|
||||
]
|
||||
|
||||
llm_config = LLMConfig(
|
||||
provider="openai/gpt-4.1-mini",
|
||||
api_token="env:OPENAI_API_KEY",
|
||||
temperature=0.1,
|
||||
max_tokens=1500
|
||||
)
|
||||
|
||||
table_strategy = LLMTableExtraction(
|
||||
llm_config=llm_config,
|
||||
css_selector="div.datatable-container", # Wikipedia data tables
|
||||
verbose=False,
|
||||
enable_chunking=True,
|
||||
chunk_token_threshold=5000, # Lower threshold to force chunking
|
||||
min_rows_per_chunk=10,
|
||||
max_parallel_chunks=3
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
table_extraction=table_strategy,
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
|
||||
all_tables = []
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
for url in urls:
|
||||
print(f"\nProcessing: {url.split('/')[-1][:50]}...")
|
||||
result = await crawler.arun(url=url, config=config)
|
||||
|
||||
if result.success and result.tables:
|
||||
print(f" ✓ Found {len(result.tables)} tables")
|
||||
# Store first table from each page
|
||||
if result.tables:
|
||||
all_tables.append({
|
||||
'url': url,
|
||||
'table': result.tables[0]
|
||||
})
|
||||
|
||||
# Summary
|
||||
print(f"\n=== Summary ===")
|
||||
print(f"Extracted {len(all_tables)} tables from {len(urls)} pages")
|
||||
for item in all_tables:
|
||||
table = item['table']
|
||||
print(f"\nFrom {item['url'].split('/')[-1][:30]}:")
|
||||
print(f" Columns: {len(table['headers'])}")
|
||||
print(f" Rows: {len(table['rows'])}")
|
||||
|
||||
|
||||
async def main():
|
||||
"""Run all examples."""
|
||||
print("=" * 60)
|
||||
print("LLM TABLE EXTRACTION EXAMPLES")
|
||||
print("=" * 60)
|
||||
|
||||
# Run examples (comment out ones you don't want to run)
|
||||
|
||||
# Basic extraction
|
||||
await basic_llm_extraction()
|
||||
|
||||
# # Focused extraction with CSS
|
||||
# await focused_extraction()
|
||||
|
||||
# # Compare strategies
|
||||
# await compare_strategies()
|
||||
|
||||
# # Batch processing
|
||||
# await batch_extraction()
|
||||
|
||||
print("\n" + "=" * 60)
|
||||
print("ALL EXAMPLES COMPLETED")
|
||||
print("=" * 60)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
276
docs/examples/table_extraction_example.py
Normal file
276
docs/examples/table_extraction_example.py
Normal file
@@ -0,0 +1,276 @@
|
||||
"""
|
||||
Example: Using Table Extraction Strategies in Crawl4AI
|
||||
|
||||
This example demonstrates how to use different table extraction strategies
|
||||
to extract tables from web pages.
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import pandas as pd
|
||||
from crawl4ai import (
|
||||
AsyncWebCrawler,
|
||||
CrawlerRunConfig,
|
||||
CacheMode,
|
||||
DefaultTableExtraction,
|
||||
NoTableExtraction,
|
||||
TableExtractionStrategy
|
||||
)
|
||||
from typing import Dict, List, Any
|
||||
|
||||
|
||||
async def example_default_extraction():
|
||||
"""Example 1: Using default table extraction (automatic)."""
|
||||
print("\n" + "="*50)
|
||||
print("Example 1: Default Table Extraction")
|
||||
print("="*50)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# No need to specify table_extraction - uses DefaultTableExtraction automatically
|
||||
config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
table_score_threshold=7 # Adjust sensitivity (default: 7)
|
||||
)
|
||||
|
||||
result = await crawler.arun(
|
||||
"https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)",
|
||||
config=config
|
||||
)
|
||||
|
||||
if result.success and result.tables:
|
||||
print(f"Found {len(result.tables)} tables")
|
||||
|
||||
# Convert first table to pandas DataFrame
|
||||
if result.tables:
|
||||
first_table = result.tables[0]
|
||||
df = pd.DataFrame(
|
||||
first_table['rows'],
|
||||
columns=first_table['headers'] if first_table['headers'] else None
|
||||
)
|
||||
print(f"\nFirst table preview:")
|
||||
print(df.head())
|
||||
print(f"Shape: {df.shape}")
|
||||
|
||||
|
||||
async def example_custom_configuration():
|
||||
"""Example 2: Custom table extraction configuration."""
|
||||
print("\n" + "="*50)
|
||||
print("Example 2: Custom Table Configuration")
|
||||
print("="*50)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Create custom extraction strategy with specific settings
|
||||
table_strategy = DefaultTableExtraction(
|
||||
table_score_threshold=5, # Lower threshold for more permissive detection
|
||||
min_rows=3, # Only extract tables with at least 3 rows
|
||||
min_cols=2, # Only extract tables with at least 2 columns
|
||||
verbose=True
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
table_extraction=table_strategy,
|
||||
# Target specific tables using CSS selector
|
||||
css_selector="div.main-content"
|
||||
)
|
||||
|
||||
result = await crawler.arun(
|
||||
"https://example.com/data",
|
||||
config=config
|
||||
)
|
||||
|
||||
if result.success:
|
||||
print(f"Found {len(result.tables)} tables matching criteria")
|
||||
|
||||
for i, table in enumerate(result.tables):
|
||||
print(f"\nTable {i+1}:")
|
||||
print(f" Caption: {table.get('caption', 'No caption')}")
|
||||
print(f" Size: {table['metadata']['row_count']} rows × {table['metadata']['column_count']} columns")
|
||||
print(f" Has headers: {table['metadata']['has_headers']}")
|
||||
|
||||
|
||||
async def example_disable_extraction():
|
||||
"""Example 3: Disable table extraction when not needed."""
|
||||
print("\n" + "="*50)
|
||||
print("Example 3: Disable Table Extraction")
|
||||
print("="*50)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Use NoTableExtraction to skip table processing entirely
|
||||
config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
table_extraction=NoTableExtraction() # No tables will be extracted
|
||||
)
|
||||
|
||||
result = await crawler.arun(
|
||||
"https://example.com",
|
||||
config=config
|
||||
)
|
||||
|
||||
if result.success:
|
||||
print(f"Tables extracted: {len(result.tables)} (should be 0)")
|
||||
print("Table extraction disabled - better performance for non-table content")
|
||||
|
||||
|
||||
class FinancialTableExtraction(TableExtractionStrategy):
|
||||
"""
|
||||
Custom strategy for extracting financial tables with specific requirements.
|
||||
"""
|
||||
|
||||
def __init__(self, currency_symbols=None, **kwargs):
|
||||
super().__init__(**kwargs)
|
||||
self.currency_symbols = currency_symbols or ['$', '€', '£', '¥']
|
||||
|
||||
def extract_tables(self, element, **kwargs):
|
||||
"""Extract only tables that appear to contain financial data."""
|
||||
tables_data = []
|
||||
|
||||
for table in element.xpath(".//table"):
|
||||
# Check if table contains currency symbols
|
||||
table_text = ''.join(table.itertext())
|
||||
has_currency = any(symbol in table_text for symbol in self.currency_symbols)
|
||||
|
||||
if not has_currency:
|
||||
continue
|
||||
|
||||
# Extract using base logic (could reuse DefaultTableExtraction logic)
|
||||
headers = []
|
||||
rows = []
|
||||
|
||||
# Extract headers
|
||||
for th in table.xpath(".//thead//th | .//tr[1]//th"):
|
||||
headers.append(th.text_content().strip())
|
||||
|
||||
# Extract rows
|
||||
for tr in table.xpath(".//tbody//tr | .//tr[position()>1]"):
|
||||
row = []
|
||||
for td in tr.xpath(".//td"):
|
||||
cell_text = td.text_content().strip()
|
||||
# Clean currency values
|
||||
for symbol in self.currency_symbols:
|
||||
cell_text = cell_text.replace(symbol, '')
|
||||
row.append(cell_text)
|
||||
if row:
|
||||
rows.append(row)
|
||||
|
||||
if headers or rows:
|
||||
tables_data.append({
|
||||
"headers": headers,
|
||||
"rows": rows,
|
||||
"caption": table.xpath(".//caption/text()")[0] if table.xpath(".//caption") else "",
|
||||
"summary": table.get("summary", ""),
|
||||
"metadata": {
|
||||
"type": "financial",
|
||||
"has_currency": True,
|
||||
"row_count": len(rows),
|
||||
"column_count": len(headers) if headers else len(rows[0]) if rows else 0
|
||||
}
|
||||
})
|
||||
|
||||
return tables_data
|
||||
|
||||
|
||||
async def example_custom_strategy():
|
||||
"""Example 4: Custom table extraction strategy."""
|
||||
print("\n" + "="*50)
|
||||
print("Example 4: Custom Financial Table Strategy")
|
||||
print("="*50)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Use custom strategy for financial tables
|
||||
config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
table_extraction=FinancialTableExtraction(
|
||||
currency_symbols=['$', '€'],
|
||||
verbose=True
|
||||
)
|
||||
)
|
||||
|
||||
result = await crawler.arun(
|
||||
"https://finance.yahoo.com/",
|
||||
config=config
|
||||
)
|
||||
|
||||
if result.success:
|
||||
print(f"Found {len(result.tables)} financial tables")
|
||||
|
||||
for table in result.tables:
|
||||
if table['metadata'].get('type') == 'financial':
|
||||
print(f" ✓ Financial table with {table['metadata']['row_count']} rows")
|
||||
|
||||
|
||||
async def example_combined_extraction():
|
||||
"""Example 5: Combine table extraction with other strategies."""
|
||||
print("\n" + "="*50)
|
||||
print("Example 5: Combined Extraction Strategies")
|
||||
print("="*50)
|
||||
|
||||
from crawl4ai import LLMExtractionStrategy, LLMConfig
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Define schema for structured extraction
|
||||
schema = {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"page_title": {"type": "string"},
|
||||
"main_topic": {"type": "string"},
|
||||
"key_figures": {
|
||||
"type": "array",
|
||||
"items": {"type": "string"}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
# Table extraction
|
||||
table_extraction=DefaultTableExtraction(
|
||||
table_score_threshold=6,
|
||||
min_rows=2
|
||||
),
|
||||
# LLM extraction for structured data
|
||||
extraction_strategy=LLMExtractionStrategy(
|
||||
llm_config=LLMConfig(provider="openai"),
|
||||
schema=schema
|
||||
)
|
||||
)
|
||||
|
||||
result = await crawler.arun(
|
||||
"https://en.wikipedia.org/wiki/Economy_of_the_United_States",
|
||||
config=config
|
||||
)
|
||||
|
||||
if result.success:
|
||||
print(f"Tables found: {len(result.tables)}")
|
||||
|
||||
# Tables are in result.tables
|
||||
if result.tables:
|
||||
print(f"First table has {len(result.tables[0]['rows'])} rows")
|
||||
|
||||
# Structured data is in result.extracted_content
|
||||
if result.extracted_content:
|
||||
import json
|
||||
structured_data = json.loads(result.extracted_content)
|
||||
print(f"Page title: {structured_data.get('page_title', 'N/A')}")
|
||||
print(f"Main topic: {structured_data.get('main_topic', 'N/A')}")
|
||||
|
||||
|
||||
async def main():
|
||||
"""Run all examples."""
|
||||
print("\n" + "="*60)
|
||||
print("CRAWL4AI TABLE EXTRACTION EXAMPLES")
|
||||
print("="*60)
|
||||
|
||||
# Run examples
|
||||
await example_default_extraction()
|
||||
await example_custom_configuration()
|
||||
await example_disable_extraction()
|
||||
await example_custom_strategy()
|
||||
# await example_combined_extraction() # Requires OpenAI API key
|
||||
|
||||
print("\n" + "="*60)
|
||||
print("EXAMPLES COMPLETED")
|
||||
print("="*60)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
@@ -20,136 +20,22 @@ Ever wondered why your AI coding assistant struggles with your library despite c
|
||||
|
||||
## Latest Release
|
||||
|
||||
### [Crawl4AI v0.7.3 – The Multi-Config Intelligence Update](releases/0.7.3.md)
|
||||
*August 6, 2025*
|
||||
### [Crawl4AI v0.7.4 – The Intelligent Table Extraction & Performance Update](../blog/release-v0.7.4.md)
|
||||
*August 17, 2025*
|
||||
|
||||
Crawl4AI v0.7.3 brings smarter URL-specific configurations, flexible Docker deployments, and critical stability improvements. Configure different crawling strategies for different URL patterns in a single batch—perfect for mixed content sites with docs, blogs, and APIs.
|
||||
Crawl4AI v0.7.4 introduces revolutionary LLM-powered table extraction with intelligent chunking, performance improvements for concurrent crawling, enhanced browser management, and critical stability fixes that make Crawl4AI more robust for production workloads.
|
||||
|
||||
Key highlights:
|
||||
- **Multi-URL Configurations**: Different strategies for different URL patterns in one crawl
|
||||
- **Flexible Docker LLM Providers**: Configure providers via environment variables
|
||||
- **Bug Fixes**: Critical stability improvements for production deployments
|
||||
- **Documentation Updates**: Clearer examples and improved API documentation
|
||||
- **🚀 LLMTableExtraction**: Revolutionary table extraction with intelligent chunking for massive tables
|
||||
- **⚡ Dispatcher Bug Fix**: Fixed sequential processing issue in arun_many for fast-completing tasks
|
||||
- **🧹 Memory Management Refactor**: Streamlined memory utilities and better resource management
|
||||
- **🔧 Browser Manager Fixes**: Resolved race conditions in concurrent page creation
|
||||
- **🔗 Advanced URL Processing**: Better handling of raw URLs and base tag link resolution
|
||||
|
||||
[Read full release notes →](releases/0.7.3.md)
|
||||
[Read full release notes →](../blog/release-v0.7.4.md)
|
||||
|
||||
---
|
||||
|
||||
## Previous Releases
|
||||
|
||||
### [Crawl4AI v0.7.0 – The Adaptive Intelligence Update](releases/0.7.0.md)
|
||||
*January 28, 2025*
|
||||
|
||||
Introduced groundbreaking intelligence features including Adaptive Crawling, Virtual Scroll support, intelligent Link Preview, and the Async URL Seeder for massive URL discovery.
|
||||
|
||||
[Read release notes →](releases/0.7.0.md)
|
||||
|
||||
### [Crawl4AI v0.6.0 – World-Aware Crawling, Pre-Warmed Browsers, and the MCP API](releases/0.6.0.md)
|
||||
*December 23, 2024*
|
||||
|
||||
Crawl4AI v0.6.0 brought major architectural upgrades including world-aware crawling (set geolocation, locale, and timezone), real-time traffic capture, and a memory-efficient crawler pool with pre-warmed pages.
|
||||
|
||||
The Docker server now exposes a full-featured MCP socket + SSE interface, supports streaming, and comes with a new Playground UI. Plus, table extraction is now native, and the new stress-test framework supports crawling 1,000+ URLs.
|
||||
|
||||
Other key changes:
|
||||
|
||||
* Native support for `result.media["tables"]` to export DataFrames
|
||||
* Full network + console logs and MHTML snapshot per crawl
|
||||
* Browser pooling and pre-warming for faster cold starts
|
||||
* New streaming endpoints via MCP API and Playground
|
||||
* Robots.txt support, proxy rotation, and improved session handling
|
||||
* Deprecated old markdown names, legacy modules cleaned up
|
||||
* Massive repo cleanup: ~36K insertions, ~5K deletions across 121 files
|
||||
|
||||
[Read full release notes →](releases/0.6.0.md)
|
||||
|
||||
---
|
||||
|
||||
### [Crawl4AI v0.5.0: Deep Crawling, Scalability, and a New CLI!](releases/0.5.0.md)
|
||||
|
||||
My dear friends and crawlers, there you go, this is the release of Crawl4AI v0.5.0! This release brings a wealth of new features, performance improvements, and a more streamlined developer experience. Here's a breakdown of what's new:
|
||||
|
||||
**Major New Features:**
|
||||
|
||||
* **Deep Crawling:** Explore entire websites with configurable strategies (BFS, DFS, Best-First). Define custom filters and URL scoring for targeted crawls.
|
||||
* **Memory-Adaptive Dispatcher:** Handle large-scale crawls with ease! Our new dispatcher dynamically adjusts concurrency based on available memory and includes built-in rate limiting.
|
||||
* **Multiple Crawler Strategies:** Choose between the full-featured Playwright browser-based crawler or a new, *much* faster HTTP-only crawler for simpler tasks.
|
||||
* **Docker Deployment:** Deploy Crawl4AI as a scalable, self-contained service with built-in API endpoints and optional JWT authentication.
|
||||
* **Command-Line Interface (CLI):** Interact with Crawl4AI directly from your terminal. Crawl, configure, and extract data with simple commands.
|
||||
* **LLM Configuration (`LLMConfig`):** A new, unified way to configure LLM providers (OpenAI, Anthropic, Ollama, etc.) for extraction, filtering, and schema generation. Simplifies API key management and switching between models.
|
||||
|
||||
**Minor Updates & Improvements:**
|
||||
|
||||
* **LXML Scraping Mode:** Faster HTML parsing with `LXMLWebScrapingStrategy`.
|
||||
* **Proxy Rotation:** Added `ProxyRotationStrategy` with a `RoundRobinProxyStrategy` implementation.
|
||||
* **PDF Processing:** Extract text, images, and metadata from PDF files.
|
||||
* **URL Redirection Tracking:** Automatically follows and records redirects.
|
||||
* **Robots.txt Compliance:** Optionally respect website crawling rules.
|
||||
* **LLM-Powered Schema Generation:** Automatically create extraction schemas using an LLM.
|
||||
* **`LLMContentFilter`:** Generate high-quality, focused markdown using an LLM.
|
||||
* **Improved Error Handling & Stability:** Numerous bug fixes and performance enhancements.
|
||||
* **Enhanced Documentation:** Updated guides and examples.
|
||||
|
||||
**Breaking Changes & Migration:**
|
||||
|
||||
This release includes several breaking changes to improve the library's structure and consistency. Here's what you need to know:
|
||||
|
||||
* **`arun_many()` Behavior:** Now uses the `MemoryAdaptiveDispatcher` by default. The return type depends on the `stream` parameter in `CrawlerRunConfig`. Adjust code that relied on unbounded concurrency.
|
||||
* **`max_depth` Location:** Moved to `CrawlerRunConfig` and now controls *crawl depth*.
|
||||
* **Deep Crawling Imports:** Import `DeepCrawlStrategy` and related classes from `crawl4ai.deep_crawling`.
|
||||
* **`BrowserContext` API:** Updated; the old `get_context` method is deprecated.
|
||||
* **Optional Model Fields:** Many data model fields are now optional. Handle potential `None` values.
|
||||
* **`ScrapingMode` Enum:** Replaced with strategy pattern (`WebScrapingStrategy`, `LXMLWebScrapingStrategy`).
|
||||
* **`content_filter` Parameter:** Removed from `CrawlerRunConfig`. Use extraction strategies or markdown generators with filters.
|
||||
* **Removed Functionality:** The synchronous `WebCrawler`, the old CLI, and docs management tools have been removed.
|
||||
* **Docker:** Significant changes to deployment. See the [Docker documentation](../deploy/docker/README.md).
|
||||
* **`ssl_certificate.json`:** This file has been removed.
|
||||
* **Config**: FastFilterChain has been replaced with FilterChain
|
||||
* **Deep-Crawl**: DeepCrawlStrategy.arun now returns Union[CrawlResultT, List[CrawlResultT], AsyncGenerator[CrawlResultT, None]]
|
||||
* **Proxy**: Removed synchronous WebCrawler support and related rate limiting configurations
|
||||
* **LLM Parameters:** Use the new `LLMConfig` object instead of passing `provider`, `api_token`, `base_url`, and `api_base` directly to `LLMExtractionStrategy` and `LLMContentFilter`.
|
||||
|
||||
**In short:** Update imports, adjust `arun_many()` usage, check for optional fields, and review the Docker deployment guide.
|
||||
|
||||
## License Change
|
||||
|
||||
Crawl4AI v0.5.0 updates the license to Apache 2.0 *with a required attribution clause*. This means you are free to use, modify, and distribute Crawl4AI (even commercially), but you *must* clearly attribute the project in any public use or distribution. See the updated `LICENSE` file for the full legal text and specific requirements.
|
||||
|
||||
**Get Started:**
|
||||
|
||||
* **Installation:** `pip install "crawl4ai[all]"` (or use the Docker image)
|
||||
* **Documentation:** [https://docs.crawl4ai.com](https://docs.crawl4ai.com)
|
||||
* **GitHub:** [https://github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai)
|
||||
|
||||
I'm very excited to see what you build with Crawl4AI v0.5.0!
|
||||
|
||||
---
|
||||
|
||||
### [0.4.2 - Configurable Crawlers, Session Management, and Smarter Screenshots](releases/0.4.2.md)
|
||||
*December 12, 2024*
|
||||
|
||||
The 0.4.2 update brings massive improvements to configuration, making crawlers and browsers easier to manage with dedicated objects. You can now import/export local storage for seamless session management. Plus, long-page screenshots are faster and cleaner, and full-page PDF exports are now possible. Check out all the new features to make your crawling experience even smoother.
|
||||
|
||||
[Read full release notes →](releases/0.4.2.md)
|
||||
|
||||
---
|
||||
|
||||
### [0.4.1 - Smarter Crawling with Lazy-Load Handling, Text-Only Mode, and More](releases/0.4.1.md)
|
||||
*December 8, 2024*
|
||||
|
||||
This release brings major improvements to handling lazy-loaded images, a blazing-fast Text-Only Mode, full-page scanning for infinite scrolls, dynamic viewport adjustments, and session reuse for efficient crawling. If you're looking to improve speed, reliability, or handle dynamic content with ease, this update has you covered.
|
||||
|
||||
[Read full release notes →](releases/0.4.1.md)
|
||||
|
||||
---
|
||||
|
||||
### [0.4.0 - Major Content Filtering Update](releases/0.4.0.md)
|
||||
*December 1, 2024*
|
||||
|
||||
Introduced significant improvements to content filtering, multi-threaded environment handling, and user-agent generation. This release features the new PruningContentFilter, enhanced thread safety, and improved test coverage.
|
||||
|
||||
[Read full release notes →](releases/0.4.0.md)
|
||||
|
||||
## Project History
|
||||
|
||||
Curious about how Crawl4AI has evolved? Check out our [complete changelog](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md) for a detailed history of all versions and updates.
|
||||
|
||||
807
docs/md_v2/core/table_extraction.md
Normal file
807
docs/md_v2/core/table_extraction.md
Normal file
@@ -0,0 +1,807 @@
|
||||
# Table Extraction Strategies
|
||||
|
||||
## Overview
|
||||
|
||||
**New in v0.7.3+**: Table extraction now follows the **Strategy Design Pattern**, providing unprecedented flexibility and power for handling different table structures. Don't worry - **your existing code still works!** We maintain full backward compatibility while offering new capabilities.
|
||||
|
||||
### What's Changed?
|
||||
- **Architecture**: Table extraction now uses pluggable strategies
|
||||
- **Backward Compatible**: Your existing code with `table_score_threshold` continues to work
|
||||
- **More Power**: Choose from multiple strategies or create your own
|
||||
- **Same Default Behavior**: By default, uses `DefaultTableExtraction` (same as before)
|
||||
|
||||
### Key Points
|
||||
✅ **Old code still works** - No breaking changes
|
||||
✅ **Same default behavior** - Uses the proven extraction algorithm
|
||||
✅ **New capabilities** - Add LLM extraction or custom strategies when needed
|
||||
✅ **Strategy pattern** - Clean, extensible architecture
|
||||
|
||||
## Quick Start
|
||||
|
||||
### The Simplest Way (Works Like Before)
|
||||
|
||||
If you're already using Crawl4AI, nothing changes:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
|
||||
async def extract_tables():
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# This works exactly like before - uses DefaultTableExtraction internally
|
||||
result = await crawler.arun("https://example.com/data")
|
||||
|
||||
# Tables are automatically extracted and available in result.tables
|
||||
for table in result.tables:
|
||||
print(f"Table with {len(table['rows'])} rows and {len(table['headers'])} columns")
|
||||
print(f"Headers: {table['headers']}")
|
||||
print(f"First row: {table['rows'][0] if table['rows'] else 'No data'}")
|
||||
|
||||
asyncio.run(extract_tables())
|
||||
```
|
||||
|
||||
### Using the Old Configuration (Still Supported)
|
||||
|
||||
Your existing code with `table_score_threshold` continues to work:
|
||||
|
||||
```python
|
||||
# This old approach STILL WORKS - we maintain backward compatibility
|
||||
config = CrawlerRunConfig(
|
||||
table_score_threshold=7 # Internally creates DefaultTableExtraction(table_score_threshold=7)
|
||||
)
|
||||
result = await crawler.arun(url, config)
|
||||
```
|
||||
|
||||
## Table Extraction Strategies
|
||||
|
||||
### Understanding the Strategy Pattern
|
||||
|
||||
The strategy pattern allows you to choose different table extraction algorithms at runtime. Think of it as having different tools in a toolbox - you pick the right one for the job:
|
||||
|
||||
- **No explicit strategy?** → Uses `DefaultTableExtraction` automatically (same as v0.7.2 and earlier)
|
||||
- **Need complex table handling?** → Choose `LLMTableExtraction` (costs money, use sparingly)
|
||||
- **Want to disable tables?** → Use `NoTableExtraction`
|
||||
- **Have special requirements?** → Create a custom strategy
|
||||
|
||||
### Available Strategies
|
||||
|
||||
| Strategy | Description | Use Case | Cost | When to Use |
|
||||
|----------|-------------|----------|------|-------------|
|
||||
| `DefaultTableExtraction` | **RECOMMENDED**: Same algorithm as before v0.7.3 | General purpose (default) | Free | **Use this first - handles 95% of cases** |
|
||||
| `LLMTableExtraction` | AI-powered extraction for complex tables | Tables with complex rowspan/colspan | **$$$ Per API call** | Only when DefaultTableExtraction fails |
|
||||
| `NoTableExtraction` | Disables table extraction | When tables aren't needed | Free | For text-only extraction |
|
||||
| Custom strategies | User-defined extraction logic | Specialized requirements | Free | Domain-specific needs |
|
||||
|
||||
> **⚠️ CRITICAL COST WARNING for LLMTableExtraction**:
|
||||
>
|
||||
> **DO NOT USE `LLMTableExtraction` UNLESS ABSOLUTELY NECESSARY!**
|
||||
>
|
||||
> - **Always try `DefaultTableExtraction` first** - It's free and handles most tables perfectly
|
||||
> - LLM extraction **costs money** with every API call
|
||||
> - For large tables (100+ rows), LLM extraction can be **very slow**
|
||||
> - **For large tables**: If you must use LLM, choose fast providers:
|
||||
> - ✅ **Groq** (fastest inference)
|
||||
> - ✅ **Cerebras** (optimized for speed)
|
||||
> - ⚠️ Avoid: OpenAI, Anthropic for large tables (slower)
|
||||
>
|
||||
> **🚧 WORK IN PROGRESS**:
|
||||
> We are actively developing an **advanced non-LLM algorithm** that will handle complex table structures (rowspan, colspan, nested tables) for **FREE**. This will replace the need for costly LLM extraction in most cases. Coming soon!
|
||||
|
||||
### DefaultTableExtraction
|
||||
|
||||
The default strategy uses a sophisticated scoring system to identify data tables:
|
||||
|
||||
```python
|
||||
from crawl4ai import DefaultTableExtraction, CrawlerRunConfig
|
||||
|
||||
# Customize the default extraction
|
||||
table_strategy = DefaultTableExtraction(
|
||||
table_score_threshold=7, # Scoring threshold (default: 7)
|
||||
min_rows=2, # Minimum rows required
|
||||
min_cols=2, # Minimum columns required
|
||||
verbose=True # Enable detailed logging
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
table_extraction=table_strategy
|
||||
)
|
||||
```
|
||||
|
||||
#### Scoring System
|
||||
|
||||
The scoring system evaluates multiple factors:
|
||||
|
||||
| Factor | Score Impact | Description |
|
||||
|--------|--------------|-------------|
|
||||
| Has `<thead>` | +2 | Semantic table structure |
|
||||
| Has `<tbody>` | +1 | Organized table body |
|
||||
| Has `<th>` elements | +2 | Header cells present |
|
||||
| Headers in correct position | +1 | Proper semantic structure |
|
||||
| Consistent column count | +2 | Regular data structure |
|
||||
| Has caption | +2 | Descriptive caption |
|
||||
| Has summary | +1 | Summary attribute |
|
||||
| High text density | +2 to +3 | Content-rich cells |
|
||||
| Data attributes | +0.5 each | Data-* attributes |
|
||||
| Nested tables | -3 | Often indicates layout |
|
||||
| Role="presentation" | -3 | Explicitly non-data |
|
||||
| Too few rows | -2 | Insufficient data |
|
||||
|
||||
### LLMTableExtraction (Use Sparingly!)
|
||||
|
||||
**⚠️ WARNING**: Only use this when `DefaultTableExtraction` fails with complex tables!
|
||||
|
||||
LLMTableExtraction uses AI to understand complex table structures that traditional parsers struggle with. It automatically handles large tables through intelligent chunking and parallel processing:
|
||||
|
||||
```python
|
||||
from crawl4ai import LLMTableExtraction, LLMConfig, CrawlerRunConfig
|
||||
|
||||
# Configure LLM (costs money per call!)
|
||||
llm_config = LLMConfig(
|
||||
provider="groq/llama-3.3-70b-versatile", # Fast provider for large tables
|
||||
api_token="your_api_key",
|
||||
temperature=0.1
|
||||
)
|
||||
|
||||
# Create LLM extraction strategy with smart chunking
|
||||
table_strategy = LLMTableExtraction(
|
||||
llm_config=llm_config,
|
||||
max_tries=3, # Retry up to 3 times if extraction fails
|
||||
css_selector="table", # Optional: focus on specific tables
|
||||
enable_chunking=True, # Automatically chunk large tables (default: True)
|
||||
chunk_token_threshold=3000, # Split tables larger than this (default: 3000 tokens)
|
||||
min_rows_per_chunk=10, # Minimum rows per chunk (default: 10)
|
||||
max_parallel_chunks=5, # Process up to 5 chunks in parallel (default: 5)
|
||||
verbose=True
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
table_extraction=table_strategy
|
||||
)
|
||||
|
||||
result = await crawler.arun(url, config)
|
||||
```
|
||||
|
||||
#### When to Use LLMTableExtraction
|
||||
|
||||
✅ **Use ONLY when**:
|
||||
- Tables have complex merged cells (rowspan/colspan) that break DefaultTableExtraction
|
||||
- Nested tables that need semantic understanding
|
||||
- Tables with irregular structures
|
||||
- You've tried DefaultTableExtraction and it failed
|
||||
|
||||
❌ **Never use when**:
|
||||
- DefaultTableExtraction works (99% of cases)
|
||||
- Tables are simple or well-structured
|
||||
- You're processing many pages (costs add up!)
|
||||
- Tables have 100+ rows (very slow)
|
||||
|
||||
#### How Smart Chunking Works
|
||||
|
||||
LLMTableExtraction automatically handles large tables through intelligent chunking:
|
||||
|
||||
1. **Automatic Detection**: Tables exceeding the token threshold are automatically split
|
||||
2. **Smart Splitting**: Chunks are created at row boundaries, preserving table structure
|
||||
3. **Header Preservation**: Each chunk includes the original headers for context
|
||||
4. **Parallel Processing**: Multiple chunks are processed simultaneously for speed
|
||||
5. **Intelligent Merging**: Results are merged back into a single, complete table
|
||||
|
||||
**Chunking Parameters**:
|
||||
- `enable_chunking` (default: `True`): Automatically handle large tables
|
||||
- `chunk_token_threshold` (default: `3000`): When to split tables
|
||||
- `min_rows_per_chunk` (default: `10`): Ensures meaningful chunk sizes
|
||||
- `max_parallel_chunks` (default: `5`): Concurrent processing for speed
|
||||
|
||||
The chunking is completely transparent - you get the same output format whether the table was processed in one piece or multiple chunks.
|
||||
|
||||
#### Performance Optimization for LLMTableExtraction
|
||||
|
||||
**Provider Recommendations by Table Size**:
|
||||
|
||||
| Table Size | Recommended Providers | Why |
|
||||
|------------|----------------------|-----|
|
||||
| Small (<50 rows) | Any provider | Fast enough |
|
||||
| Medium (50-200 rows) | Groq, Cerebras | Optimized inference |
|
||||
| Large (200+ rows) | **Groq** (best), Cerebras | Fastest inference + automatic chunking |
|
||||
| Very Large (500+ rows) | Groq with chunking | Parallel processing keeps it fast |
|
||||
|
||||
### NoTableExtraction
|
||||
|
||||
Disable table extraction for better performance when tables aren't needed:
|
||||
|
||||
```python
|
||||
from crawl4ai import NoTableExtraction, CrawlerRunConfig
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
table_extraction=NoTableExtraction()
|
||||
)
|
||||
|
||||
# Tables won't be extracted, improving performance
|
||||
result = await crawler.arun(url, config)
|
||||
assert len(result.tables) == 0
|
||||
```
|
||||
|
||||
## Extracted Table Structure
|
||||
|
||||
Each extracted table contains:
|
||||
|
||||
```python
|
||||
{
|
||||
"headers": ["Column 1", "Column 2", ...], # Column headers
|
||||
"rows": [ # Data rows
|
||||
["Row 1 Col 1", "Row 1 Col 2", ...],
|
||||
["Row 2 Col 1", "Row 2 Col 2", ...],
|
||||
],
|
||||
"caption": "Table Caption", # If present
|
||||
"summary": "Table Summary", # If present
|
||||
"metadata": {
|
||||
"row_count": 10, # Number of rows
|
||||
"column_count": 3, # Number of columns
|
||||
"has_headers": True, # Headers detected
|
||||
"has_caption": True, # Caption exists
|
||||
"has_summary": False, # Summary exists
|
||||
"id": "data-table-1", # Table ID if present
|
||||
"class": "financial-data" # Table class if present
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Configuration Options
|
||||
|
||||
### Basic Configuration
|
||||
|
||||
```python
|
||||
config = CrawlerRunConfig(
|
||||
# Table extraction settings
|
||||
table_score_threshold=7, # Default threshold (backward compatible)
|
||||
table_extraction=strategy, # Optional: custom strategy
|
||||
|
||||
# Filter what to process
|
||||
css_selector="main", # Focus on specific area
|
||||
excluded_tags=["nav", "aside"] # Exclude page sections
|
||||
)
|
||||
```
|
||||
|
||||
### Advanced Configuration
|
||||
|
||||
```python
|
||||
from crawl4ai import DefaultTableExtraction, CrawlerRunConfig
|
||||
|
||||
# Fine-tuned extraction
|
||||
strategy = DefaultTableExtraction(
|
||||
table_score_threshold=5, # Lower = more permissive
|
||||
min_rows=3, # Require at least 3 rows
|
||||
min_cols=2, # Require at least 2 columns
|
||||
verbose=True # Detailed logging
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
table_extraction=strategy,
|
||||
css_selector="article.content", # Target specific content
|
||||
exclude_domains=["ads.com"], # Exclude ad domains
|
||||
cache_mode=CacheMode.BYPASS # Fresh extraction
|
||||
)
|
||||
```
|
||||
|
||||
## Working with Extracted Tables
|
||||
|
||||
### Convert to Pandas DataFrame
|
||||
|
||||
```python
|
||||
import pandas as pd
|
||||
|
||||
async def tables_to_dataframes(url):
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url)
|
||||
|
||||
dataframes = []
|
||||
for table_data in result.tables:
|
||||
# Create DataFrame
|
||||
if table_data['headers']:
|
||||
df = pd.DataFrame(
|
||||
table_data['rows'],
|
||||
columns=table_data['headers']
|
||||
)
|
||||
else:
|
||||
df = pd.DataFrame(table_data['rows'])
|
||||
|
||||
# Add metadata as DataFrame attributes
|
||||
df.attrs['caption'] = table_data.get('caption', '')
|
||||
df.attrs['metadata'] = table_data.get('metadata', {})
|
||||
|
||||
dataframes.append(df)
|
||||
|
||||
return dataframes
|
||||
```
|
||||
|
||||
### Filter Tables by Criteria
|
||||
|
||||
```python
|
||||
async def extract_large_tables(url):
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Configure minimum size requirements
|
||||
strategy = DefaultTableExtraction(
|
||||
min_rows=10,
|
||||
min_cols=3,
|
||||
table_score_threshold=6
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
table_extraction=strategy
|
||||
)
|
||||
|
||||
result = await crawler.arun(url, config)
|
||||
|
||||
# Further filter results
|
||||
large_tables = [
|
||||
table for table in result.tables
|
||||
if table['metadata']['row_count'] > 10
|
||||
and table['metadata']['column_count'] > 3
|
||||
]
|
||||
|
||||
return large_tables
|
||||
```
|
||||
|
||||
### Export Tables to Different Formats
|
||||
|
||||
```python
|
||||
import json
|
||||
import csv
|
||||
|
||||
async def export_tables(url):
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url)
|
||||
|
||||
for i, table in enumerate(result.tables):
|
||||
# Export as JSON
|
||||
with open(f'table_{i}.json', 'w') as f:
|
||||
json.dump(table, f, indent=2)
|
||||
|
||||
# Export as CSV
|
||||
with open(f'table_{i}.csv', 'w', newline='') as f:
|
||||
writer = csv.writer(f)
|
||||
if table['headers']:
|
||||
writer.writerow(table['headers'])
|
||||
writer.writerows(table['rows'])
|
||||
|
||||
# Export as Markdown
|
||||
with open(f'table_{i}.md', 'w') as f:
|
||||
# Write headers
|
||||
if table['headers']:
|
||||
f.write('| ' + ' | '.join(table['headers']) + ' |\n')
|
||||
f.write('|' + '---|' * len(table['headers']) + '\n')
|
||||
|
||||
# Write rows
|
||||
for row in table['rows']:
|
||||
f.write('| ' + ' | '.join(str(cell) for cell in row) + ' |\n')
|
||||
```
|
||||
|
||||
## Creating Custom Strategies
|
||||
|
||||
Extend `TableExtractionStrategy` to create custom extraction logic:
|
||||
|
||||
### Example: Financial Table Extractor
|
||||
|
||||
```python
|
||||
from crawl4ai import TableExtractionStrategy
|
||||
from typing import List, Dict, Any
|
||||
import re
|
||||
|
||||
class FinancialTableExtractor(TableExtractionStrategy):
|
||||
"""Extract tables containing financial data."""
|
||||
|
||||
def __init__(self, currency_symbols=None, require_numbers=True, **kwargs):
|
||||
super().__init__(**kwargs)
|
||||
self.currency_symbols = currency_symbols or ['$', '€', '£', '¥']
|
||||
self.require_numbers = require_numbers
|
||||
self.number_pattern = re.compile(r'\d+[,.]?\d*')
|
||||
|
||||
def extract_tables(self, element, **kwargs):
|
||||
tables_data = []
|
||||
|
||||
for table in element.xpath(".//table"):
|
||||
# Check if table contains financial indicators
|
||||
table_text = ''.join(table.itertext())
|
||||
|
||||
# Must contain currency symbols
|
||||
has_currency = any(sym in table_text for sym in self.currency_symbols)
|
||||
if not has_currency:
|
||||
continue
|
||||
|
||||
# Must contain numbers if required
|
||||
if self.require_numbers:
|
||||
numbers = self.number_pattern.findall(table_text)
|
||||
if len(numbers) < 3: # Arbitrary minimum
|
||||
continue
|
||||
|
||||
# Extract the table data
|
||||
table_data = self._extract_financial_data(table)
|
||||
if table_data:
|
||||
tables_data.append(table_data)
|
||||
|
||||
return tables_data
|
||||
|
||||
def _extract_financial_data(self, table):
|
||||
"""Extract and clean financial data from table."""
|
||||
headers = []
|
||||
rows = []
|
||||
|
||||
# Extract headers
|
||||
for th in table.xpath(".//thead//th | .//tr[1]//th"):
|
||||
headers.append(th.text_content().strip())
|
||||
|
||||
# Extract and clean rows
|
||||
for tr in table.xpath(".//tbody//tr | .//tr[position()>1]"):
|
||||
row = []
|
||||
for td in tr.xpath(".//td"):
|
||||
text = td.text_content().strip()
|
||||
# Clean currency formatting
|
||||
text = re.sub(r'[$€£¥,]', '', text)
|
||||
row.append(text)
|
||||
if row:
|
||||
rows.append(row)
|
||||
|
||||
return {
|
||||
"headers": headers,
|
||||
"rows": rows,
|
||||
"caption": self._get_caption(table),
|
||||
"summary": table.get("summary", ""),
|
||||
"metadata": {
|
||||
"type": "financial",
|
||||
"row_count": len(rows),
|
||||
"column_count": len(headers) or len(rows[0]) if rows else 0
|
||||
}
|
||||
}
|
||||
|
||||
def _get_caption(self, table):
|
||||
caption = table.xpath(".//caption/text()")
|
||||
return caption[0].strip() if caption else ""
|
||||
|
||||
# Usage
|
||||
strategy = FinancialTableExtractor(
|
||||
currency_symbols=['$', 'EUR'],
|
||||
require_numbers=True
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
table_extraction=strategy
|
||||
)
|
||||
```
|
||||
|
||||
### Example: Specific Table Extractor
|
||||
|
||||
```python
|
||||
class SpecificTableExtractor(TableExtractionStrategy):
|
||||
"""Extract only tables matching specific criteria."""
|
||||
|
||||
def __init__(self,
|
||||
required_headers=None,
|
||||
id_pattern=None,
|
||||
class_pattern=None,
|
||||
**kwargs):
|
||||
super().__init__(**kwargs)
|
||||
self.required_headers = required_headers or []
|
||||
self.id_pattern = id_pattern
|
||||
self.class_pattern = class_pattern
|
||||
|
||||
def extract_tables(self, element, **kwargs):
|
||||
tables_data = []
|
||||
|
||||
for table in element.xpath(".//table"):
|
||||
# Check ID pattern
|
||||
if self.id_pattern:
|
||||
table_id = table.get('id', '')
|
||||
if not re.match(self.id_pattern, table_id):
|
||||
continue
|
||||
|
||||
# Check class pattern
|
||||
if self.class_pattern:
|
||||
table_class = table.get('class', '')
|
||||
if not re.match(self.class_pattern, table_class):
|
||||
continue
|
||||
|
||||
# Extract headers to check requirements
|
||||
headers = self._extract_headers(table)
|
||||
|
||||
# Check if required headers are present
|
||||
if self.required_headers:
|
||||
if not all(req in headers for req in self.required_headers):
|
||||
continue
|
||||
|
||||
# Extract full table data
|
||||
table_data = self._extract_table_data(table, headers)
|
||||
tables_data.append(table_data)
|
||||
|
||||
return tables_data
|
||||
```
|
||||
|
||||
## Combining with Other Strategies
|
||||
|
||||
Table extraction works seamlessly with other Crawl4AI strategies:
|
||||
|
||||
```python
|
||||
from crawl4ai import (
|
||||
AsyncWebCrawler,
|
||||
CrawlerRunConfig,
|
||||
DefaultTableExtraction,
|
||||
LLMExtractionStrategy,
|
||||
JsonCssExtractionStrategy
|
||||
)
|
||||
|
||||
async def combined_extraction(url):
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
config = CrawlerRunConfig(
|
||||
# Table extraction
|
||||
table_extraction=DefaultTableExtraction(
|
||||
table_score_threshold=6,
|
||||
min_rows=2
|
||||
),
|
||||
|
||||
# CSS-based extraction for specific elements
|
||||
extraction_strategy=JsonCssExtractionStrategy({
|
||||
"title": "h1",
|
||||
"summary": "p.summary",
|
||||
"date": "time"
|
||||
}),
|
||||
|
||||
# Focus on main content
|
||||
css_selector="main.content"
|
||||
)
|
||||
|
||||
result = await crawler.arun(url, config)
|
||||
|
||||
# Access different extraction results
|
||||
tables = result.tables # Table data
|
||||
structured = json.loads(result.extracted_content) # CSS extraction
|
||||
|
||||
return {
|
||||
"tables": tables,
|
||||
"structured_data": structured,
|
||||
"markdown": result.markdown
|
||||
}
|
||||
```
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
### Optimization Tips
|
||||
|
||||
1. **Disable when not needed**: Use `NoTableExtraction` if tables aren't required
|
||||
2. **Target specific areas**: Use `css_selector` to limit processing scope
|
||||
3. **Set minimum thresholds**: Filter out small/irrelevant tables early
|
||||
4. **Cache results**: Use appropriate cache modes for repeated extractions
|
||||
|
||||
```python
|
||||
# Optimized configuration for large pages
|
||||
config = CrawlerRunConfig(
|
||||
# Only process main content area
|
||||
css_selector="article.main-content",
|
||||
|
||||
# Exclude navigation and sidebars
|
||||
excluded_tags=["nav", "aside", "footer"],
|
||||
|
||||
# Higher threshold for stricter filtering
|
||||
table_extraction=DefaultTableExtraction(
|
||||
table_score_threshold=8,
|
||||
min_rows=5,
|
||||
min_cols=3
|
||||
),
|
||||
|
||||
# Enable caching for repeated access
|
||||
cache_mode=CacheMode.ENABLED
|
||||
)
|
||||
```
|
||||
|
||||
## Migration Guide
|
||||
|
||||
### Important: Your Code Still Works!
|
||||
|
||||
**No changes required!** The transition to the strategy pattern is **fully backward compatible**.
|
||||
|
||||
### How It Works Internally
|
||||
|
||||
#### v0.7.2 and Earlier
|
||||
```python
|
||||
# Old way - directly passing table_score_threshold
|
||||
config = CrawlerRunConfig(
|
||||
table_score_threshold=7
|
||||
)
|
||||
# Internally: No strategy pattern, direct implementation
|
||||
```
|
||||
|
||||
#### v0.7.3+ (Current)
|
||||
```python
|
||||
# Old way STILL WORKS - we handle it internally
|
||||
config = CrawlerRunConfig(
|
||||
table_score_threshold=7
|
||||
)
|
||||
# Internally: Automatically creates DefaultTableExtraction(table_score_threshold=7)
|
||||
```
|
||||
|
||||
### Taking Advantage of New Features
|
||||
|
||||
While your old code works, you can now use the strategy pattern for more control:
|
||||
|
||||
```python
|
||||
# Option 1: Keep using the old way (perfectly fine!)
|
||||
config = CrawlerRunConfig(
|
||||
table_score_threshold=7 # Still supported
|
||||
)
|
||||
|
||||
# Option 2: Use the new strategy pattern (more flexibility)
|
||||
from crawl4ai import DefaultTableExtraction
|
||||
|
||||
strategy = DefaultTableExtraction(
|
||||
table_score_threshold=7,
|
||||
min_rows=2, # New capability!
|
||||
min_cols=2 # New capability!
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
table_extraction=strategy
|
||||
)
|
||||
|
||||
# Option 3: Use advanced strategies when needed
|
||||
from crawl4ai import LLMTableExtraction, LLMConfig
|
||||
|
||||
# Only for complex tables that DefaultTableExtraction can't handle
|
||||
# Automatically handles large tables with smart chunking
|
||||
llm_strategy = LLMTableExtraction(
|
||||
llm_config=LLMConfig(
|
||||
provider="groq/llama-3.3-70b-versatile",
|
||||
api_token="your_key"
|
||||
),
|
||||
max_tries=3,
|
||||
enable_chunking=True, # Automatically chunk large tables
|
||||
chunk_token_threshold=3000, # Chunk when exceeding 3000 tokens
|
||||
max_parallel_chunks=5 # Process up to 5 chunks in parallel
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
table_extraction=llm_strategy # Advanced extraction with automatic chunking
|
||||
)
|
||||
```
|
||||
|
||||
### Summary
|
||||
|
||||
- ✅ **No breaking changes** - Old code works as-is
|
||||
- ✅ **Same defaults** - DefaultTableExtraction is automatically used
|
||||
- ✅ **Gradual adoption** - Use new features when you need them
|
||||
- ✅ **Full compatibility** - result.tables structure unchanged
|
||||
|
||||
## Best Practices
|
||||
|
||||
### 1. Choose the Right Strategy (Cost-Conscious Approach)
|
||||
|
||||
**Decision Flow**:
|
||||
```
|
||||
1. Do you need tables?
|
||||
→ No: Use NoTableExtraction
|
||||
→ Yes: Continue to #2
|
||||
|
||||
2. Try DefaultTableExtraction first (FREE)
|
||||
→ Works? Done! ✅
|
||||
→ Fails? Continue to #3
|
||||
|
||||
3. Is the table critical and complex?
|
||||
→ No: Accept DefaultTableExtraction results
|
||||
→ Yes: Continue to #4
|
||||
|
||||
4. Use LLMTableExtraction (COSTS MONEY)
|
||||
→ Small table (<50 rows): Any LLM provider
|
||||
→ Large table (50+ rows): Use Groq or Cerebras
|
||||
→ Very large (500+ rows): Reconsider - maybe chunk the page
|
||||
```
|
||||
|
||||
**Strategy Selection Guide**:
|
||||
- **DefaultTableExtraction**: Use for 99% of cases - it's free and effective
|
||||
- **LLMTableExtraction**: Only for complex tables with merged cells that break DefaultTableExtraction
|
||||
- **NoTableExtraction**: When you only need text/markdown content
|
||||
- **Custom Strategy**: For specialized requirements (financial, scientific, etc.)
|
||||
|
||||
### 2. Validate Extracted Data
|
||||
|
||||
```python
|
||||
def validate_table(table):
|
||||
"""Validate table data quality."""
|
||||
# Check structure
|
||||
if not table.get('rows'):
|
||||
return False
|
||||
|
||||
# Check consistency
|
||||
if table.get('headers'):
|
||||
expected_cols = len(table['headers'])
|
||||
for row in table['rows']:
|
||||
if len(row) != expected_cols:
|
||||
return False
|
||||
|
||||
# Check minimum content
|
||||
total_cells = sum(len(row) for row in table['rows'])
|
||||
non_empty = sum(1 for row in table['rows']
|
||||
for cell in row if cell.strip())
|
||||
|
||||
if non_empty / total_cells < 0.5: # Less than 50% non-empty
|
||||
return False
|
||||
|
||||
return True
|
||||
|
||||
# Filter valid tables
|
||||
valid_tables = [t for t in result.tables if validate_table(t)]
|
||||
```
|
||||
|
||||
### 3. Handle Edge Cases
|
||||
|
||||
```python
|
||||
async def robust_table_extraction(url):
|
||||
"""Extract tables with error handling."""
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
try:
|
||||
config = CrawlerRunConfig(
|
||||
table_extraction=DefaultTableExtraction(
|
||||
table_score_threshold=6,
|
||||
verbose=True
|
||||
)
|
||||
)
|
||||
|
||||
result = await crawler.arun(url, config)
|
||||
|
||||
if not result.success:
|
||||
print(f"Crawl failed: {result.error}")
|
||||
return []
|
||||
|
||||
# Process tables safely
|
||||
processed_tables = []
|
||||
for table in result.tables:
|
||||
try:
|
||||
# Validate and process
|
||||
if validate_table(table):
|
||||
processed_tables.append(table)
|
||||
except Exception as e:
|
||||
print(f"Error processing table: {e}")
|
||||
continue
|
||||
|
||||
return processed_tables
|
||||
|
||||
except Exception as e:
|
||||
print(f"Extraction error: {e}")
|
||||
return []
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues and Solutions
|
||||
|
||||
| Issue | Cause | Solution |
|
||||
|-------|-------|----------|
|
||||
| No tables extracted | Score too high | Lower `table_score_threshold` |
|
||||
| Layout tables included | Score too low | Increase `table_score_threshold` |
|
||||
| Missing tables | CSS selector too specific | Broaden or remove `css_selector` |
|
||||
| Incomplete data | Complex table structure | Create custom strategy |
|
||||
| Performance issues | Processing entire page | Use `css_selector` to limit scope |
|
||||
|
||||
### Debug Logging
|
||||
|
||||
Enable verbose logging to understand extraction decisions:
|
||||
|
||||
```python
|
||||
import logging
|
||||
|
||||
# Configure logging
|
||||
logging.basicConfig(level=logging.DEBUG)
|
||||
|
||||
# Enable verbose mode in strategy
|
||||
strategy = DefaultTableExtraction(
|
||||
table_score_threshold=7,
|
||||
verbose=True # Detailed extraction logs
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
table_extraction=strategy,
|
||||
verbose=True # General crawler logs
|
||||
)
|
||||
```
|
||||
|
||||
## See Also
|
||||
|
||||
- [Extraction Strategies](extraction-strategies.md) - Overview of all extraction strategies
|
||||
- [Content Selection](content-selection.md) - Using CSS selectors and filters
|
||||
- [Performance Optimization](../optimization/performance-tuning.md) - Speed up extraction
|
||||
- [Examples](../examples/table_extraction_example.py) - Complete working examples
|
||||
242
docs/md_v2/core/telemetry.md
Normal file
242
docs/md_v2/core/telemetry.md
Normal file
@@ -0,0 +1,242 @@
|
||||
# Telemetry
|
||||
|
||||
Crawl4AI includes **opt-in telemetry** to help improve stability by capturing anonymous crash reports. No personal data or crawled content is ever collected.
|
||||
|
||||
!!! info "Privacy First"
|
||||
Telemetry is completely optional and respects your privacy. Only exception information is collected - no URLs, no personal data, no crawled content.
|
||||
|
||||
## Overview
|
||||
|
||||
- **Privacy-first**: Only exceptions and crashes are reported
|
||||
- **Opt-in by default**: You control when telemetry is enabled (except in Docker where it's on by default)
|
||||
- **No PII**: No URLs, request data, or personal information is collected
|
||||
- **Provider-agnostic**: Currently uses Sentry, but designed to support multiple backends
|
||||
|
||||
## Installation
|
||||
|
||||
Telemetry requires the optional Sentry SDK:
|
||||
|
||||
```bash
|
||||
# Install with telemetry support
|
||||
pip install crawl4ai[telemetry]
|
||||
|
||||
# Or install Sentry SDK separately
|
||||
pip install sentry-sdk>=2.0.0
|
||||
```
|
||||
|
||||
## Environments
|
||||
|
||||
### 1. Python Library & CLI
|
||||
|
||||
On first exception, you'll see an interactive prompt:
|
||||
|
||||
```
|
||||
🚨 Crawl4AI Error Detection
|
||||
==============================================================
|
||||
We noticed an error occurred. Help improve Crawl4AI by
|
||||
sending anonymous crash reports?
|
||||
|
||||
[1] Yes, send this error only
|
||||
[2] Yes, always send errors
|
||||
[3] No, don't send
|
||||
|
||||
Your choice (1/2/3):
|
||||
```
|
||||
|
||||
Control via CLI:
|
||||
```bash
|
||||
# Enable telemetry
|
||||
crwl telemetry enable
|
||||
crwl telemetry enable --email you@example.com
|
||||
|
||||
# Disable telemetry
|
||||
crwl telemetry disable
|
||||
|
||||
# Check status
|
||||
crwl telemetry status
|
||||
```
|
||||
|
||||
### 2. Docker / API Server
|
||||
|
||||
!!! warning "Default Enabled in Docker"
|
||||
Telemetry is **enabled by default** in Docker environments to help identify container-specific issues. This is different from the CLI where it's opt-in.
|
||||
|
||||
To disable:
|
||||
```bash
|
||||
# Via environment variable
|
||||
docker run -e CRAWL4AI_TELEMETRY=0 ...
|
||||
|
||||
# In docker-compose.yml
|
||||
environment:
|
||||
- CRAWL4AI_TELEMETRY=0
|
||||
```
|
||||
|
||||
### 3. Jupyter / Google Colab
|
||||
|
||||
In notebooks, you'll see an interactive widget (if available) or a code snippet:
|
||||
|
||||
```python
|
||||
import crawl4ai
|
||||
|
||||
# Enable telemetry
|
||||
crawl4ai.telemetry.enable(email="you@example.com", always=True)
|
||||
|
||||
# Send only next error
|
||||
crawl4ai.telemetry.enable(once=True)
|
||||
|
||||
# Disable telemetry
|
||||
crawl4ai.telemetry.disable()
|
||||
|
||||
# Check status
|
||||
crawl4ai.telemetry.status()
|
||||
```
|
||||
|
||||
## Python API
|
||||
|
||||
### Basic Usage
|
||||
|
||||
```python
|
||||
from crawl4ai import telemetry
|
||||
|
||||
# Enable/disable telemetry
|
||||
telemetry.enable(email="optional@email.com", always=True)
|
||||
telemetry.disable()
|
||||
|
||||
# Check current status
|
||||
status = telemetry.status()
|
||||
print(f"Telemetry enabled: {status['enabled']}")
|
||||
print(f"Consent: {status['consent']}")
|
||||
```
|
||||
|
||||
### Manual Exception Capture
|
||||
|
||||
```python
|
||||
from crawl4ai.telemetry import capture_exception
|
||||
|
||||
try:
|
||||
# Your code here
|
||||
risky_operation()
|
||||
except Exception as e:
|
||||
# Manually capture exception with context
|
||||
capture_exception(e, {
|
||||
'operation': 'custom_crawler',
|
||||
'url': 'https://example.com' # Will be sanitized
|
||||
})
|
||||
raise
|
||||
```
|
||||
|
||||
### Decorator Pattern
|
||||
|
||||
```python
|
||||
from crawl4ai.telemetry import telemetry_decorator
|
||||
|
||||
@telemetry_decorator
|
||||
def my_crawler_function():
|
||||
# Exceptions will be automatically captured
|
||||
pass
|
||||
```
|
||||
|
||||
### Context Manager
|
||||
|
||||
```python
|
||||
from crawl4ai.telemetry import telemetry_context
|
||||
|
||||
with telemetry_context("data_extraction"):
|
||||
# Any exceptions in this block will be captured
|
||||
result = extract_data(html)
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
Settings are stored in `~/.crawl4ai/config.json`:
|
||||
|
||||
```json
|
||||
{
|
||||
"telemetry": {
|
||||
"consent": "always",
|
||||
"email": "user@example.com"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Consent levels:
|
||||
- `"not_set"` - No decision made yet
|
||||
- `"denied"` - Telemetry disabled
|
||||
- `"once"` - Send current error only
|
||||
- `"always"` - Always send errors
|
||||
|
||||
## Environment Variables
|
||||
|
||||
- `CRAWL4AI_TELEMETRY=0` - Disable telemetry (overrides config)
|
||||
- `CRAWL4AI_TELEMETRY_EMAIL=email@example.com` - Set email for follow-up
|
||||
- `CRAWL4AI_SENTRY_DSN=https://...` - Override default DSN (for maintainers)
|
||||
|
||||
## What's Collected
|
||||
|
||||
### Collected ✅
|
||||
- Exception type and traceback
|
||||
- Crawl4AI version
|
||||
- Python version
|
||||
- Operating system
|
||||
- Environment type (CLI, Docker, Jupyter)
|
||||
- Optional email (if provided)
|
||||
|
||||
### NOT Collected ❌
|
||||
- URLs being crawled
|
||||
- HTML content
|
||||
- Request/response data
|
||||
- Cookies or authentication tokens
|
||||
- IP addresses
|
||||
- Any personally identifiable information
|
||||
|
||||
## Provider Architecture
|
||||
|
||||
Telemetry is designed to be provider-agnostic:
|
||||
|
||||
```python
|
||||
from crawl4ai.telemetry.base import TelemetryProvider
|
||||
|
||||
class CustomProvider(TelemetryProvider):
|
||||
def send_exception(self, exc, context=None):
|
||||
# Your implementation
|
||||
pass
|
||||
```
|
||||
|
||||
## FAQ
|
||||
|
||||
### Q: Can I completely disable telemetry?
|
||||
A: Yes! Use `crwl telemetry disable` or set `CRAWL4AI_TELEMETRY=0`
|
||||
|
||||
### Q: Is telemetry required?
|
||||
A: No, it's completely optional (except enabled by default in Docker)
|
||||
|
||||
### Q: What if I don't install sentry-sdk?
|
||||
A: Telemetry will gracefully degrade to a no-op state
|
||||
|
||||
### Q: Can I see what's being sent?
|
||||
A: Yes, check the source code in `crawl4ai/telemetry/`
|
||||
|
||||
### Q: How do I remove my email?
|
||||
A: Delete `~/.crawl4ai/config.json` or edit it to remove the email field
|
||||
|
||||
## Privacy Commitment
|
||||
|
||||
1. **Transparency**: All telemetry code is open source
|
||||
2. **Control**: You can enable/disable at any time
|
||||
3. **Minimal**: Only crash data, no user content
|
||||
4. **Secure**: Data transmitted over HTTPS to Sentry
|
||||
5. **Anonymous**: No tracking or user identification
|
||||
|
||||
## Contributing
|
||||
|
||||
Help improve telemetry:
|
||||
- Report issues with telemetry itself
|
||||
- Suggest privacy improvements
|
||||
- Add new provider backends
|
||||
|
||||
## Support
|
||||
|
||||
If you have concerns about telemetry:
|
||||
- Open an issue on GitHub
|
||||
- Email the maintainers
|
||||
- Review the code in `crawl4ai/telemetry/`
|
||||
376
docs/md_v2/migration/table_extraction_v073.md
Normal file
376
docs/md_v2/migration/table_extraction_v073.md
Normal file
@@ -0,0 +1,376 @@
|
||||
# Migration Guide: Table Extraction v0.7.3
|
||||
|
||||
## Overview
|
||||
|
||||
Version 0.7.3 introduces the **Table Extraction Strategy Pattern**, providing a more flexible and extensible approach to table extraction while maintaining full backward compatibility.
|
||||
|
||||
## What's New
|
||||
|
||||
### Strategy Pattern Implementation
|
||||
|
||||
Table extraction now follows the same strategy pattern used throughout Crawl4AI:
|
||||
|
||||
- **Consistent Architecture**: Aligns with extraction, chunking, and markdown strategies
|
||||
- **Extensibility**: Easy to create custom table extraction strategies
|
||||
- **Better Separation**: Table logic moved from content scraping to dedicated module
|
||||
- **Full Control**: Fine-grained control over table detection and extraction
|
||||
|
||||
### New Classes
|
||||
|
||||
```python
|
||||
from crawl4ai import (
|
||||
TableExtractionStrategy, # Abstract base class
|
||||
DefaultTableExtraction, # Current implementation (default)
|
||||
NoTableExtraction # Explicitly disable extraction
|
||||
)
|
||||
```
|
||||
|
||||
## Backward Compatibility
|
||||
|
||||
**✅ All existing code continues to work without changes.**
|
||||
|
||||
### No Changes Required
|
||||
|
||||
If your code looks like this, it will continue to work:
|
||||
|
||||
```python
|
||||
# This still works exactly the same
|
||||
config = CrawlerRunConfig(
|
||||
table_score_threshold=7
|
||||
)
|
||||
result = await crawler.arun(url, config)
|
||||
tables = result.tables # Same structure, same data
|
||||
```
|
||||
|
||||
### What Happens Behind the Scenes
|
||||
|
||||
When you don't specify a `table_extraction` strategy:
|
||||
|
||||
1. `CrawlerRunConfig` automatically creates `DefaultTableExtraction`
|
||||
2. It uses your `table_score_threshold` parameter
|
||||
3. Tables are extracted exactly as before
|
||||
4. Results appear in `result.tables` with the same structure
|
||||
|
||||
## New Capabilities
|
||||
|
||||
### 1. Explicit Strategy Configuration
|
||||
|
||||
You can now explicitly configure table extraction:
|
||||
|
||||
```python
|
||||
# New: Explicit control
|
||||
strategy = DefaultTableExtraction(
|
||||
table_score_threshold=7,
|
||||
min_rows=2, # New: minimum row filter
|
||||
min_cols=2, # New: minimum column filter
|
||||
verbose=True # New: detailed logging
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
table_extraction=strategy
|
||||
)
|
||||
```
|
||||
|
||||
### 2. Disable Table Extraction
|
||||
|
||||
Improve performance when tables aren't needed:
|
||||
|
||||
```python
|
||||
# New: Skip table extraction entirely
|
||||
config = CrawlerRunConfig(
|
||||
table_extraction=NoTableExtraction()
|
||||
)
|
||||
# No CPU cycles spent on table detection/extraction
|
||||
```
|
||||
|
||||
### 3. Custom Extraction Strategies
|
||||
|
||||
Create specialized extractors:
|
||||
|
||||
```python
|
||||
class MyTableExtractor(TableExtractionStrategy):
|
||||
def extract_tables(self, element, **kwargs):
|
||||
# Custom extraction logic
|
||||
return custom_tables
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
table_extraction=MyTableExtractor()
|
||||
)
|
||||
```
|
||||
|
||||
## Migration Scenarios
|
||||
|
||||
### Scenario 1: Basic Usage (No Changes Needed)
|
||||
|
||||
**Before (v0.7.2):**
|
||||
```python
|
||||
config = CrawlerRunConfig()
|
||||
result = await crawler.arun(url, config)
|
||||
for table in result.tables:
|
||||
print(table['headers'])
|
||||
```
|
||||
|
||||
**After (v0.7.3):**
|
||||
```python
|
||||
# Exactly the same - no changes required
|
||||
config = CrawlerRunConfig()
|
||||
result = await crawler.arun(url, config)
|
||||
for table in result.tables:
|
||||
print(table['headers'])
|
||||
```
|
||||
|
||||
### Scenario 2: Custom Threshold (No Changes Needed)
|
||||
|
||||
**Before (v0.7.2):**
|
||||
```python
|
||||
config = CrawlerRunConfig(
|
||||
table_score_threshold=5
|
||||
)
|
||||
```
|
||||
|
||||
**After (v0.7.3):**
|
||||
```python
|
||||
# Still works the same
|
||||
config = CrawlerRunConfig(
|
||||
table_score_threshold=5
|
||||
)
|
||||
|
||||
# Or use new explicit approach for more control
|
||||
strategy = DefaultTableExtraction(
|
||||
table_score_threshold=5,
|
||||
min_rows=2 # Additional filtering
|
||||
)
|
||||
config = CrawlerRunConfig(
|
||||
table_extraction=strategy
|
||||
)
|
||||
```
|
||||
|
||||
### Scenario 3: Advanced Filtering (New Feature)
|
||||
|
||||
**Before (v0.7.2):**
|
||||
```python
|
||||
# Had to filter after extraction
|
||||
config = CrawlerRunConfig(
|
||||
table_score_threshold=5
|
||||
)
|
||||
result = await crawler.arun(url, config)
|
||||
|
||||
# Manual filtering
|
||||
large_tables = [
|
||||
t for t in result.tables
|
||||
if len(t['rows']) >= 5 and len(t['headers']) >= 3
|
||||
]
|
||||
```
|
||||
|
||||
**After (v0.7.3):**
|
||||
```python
|
||||
# Filter during extraction (more efficient)
|
||||
strategy = DefaultTableExtraction(
|
||||
table_score_threshold=5,
|
||||
min_rows=5,
|
||||
min_cols=3
|
||||
)
|
||||
config = CrawlerRunConfig(
|
||||
table_extraction=strategy
|
||||
)
|
||||
result = await crawler.arun(url, config)
|
||||
# result.tables already filtered
|
||||
```
|
||||
|
||||
## Code Organization Changes
|
||||
|
||||
### Module Structure
|
||||
|
||||
**Before (v0.7.2):**
|
||||
```
|
||||
crawl4ai/
|
||||
content_scraping_strategy.py
|
||||
- LXMLWebScrapingStrategy
|
||||
- is_data_table() # Table detection
|
||||
- extract_table_data() # Table extraction
|
||||
```
|
||||
|
||||
**After (v0.7.3):**
|
||||
```
|
||||
crawl4ai/
|
||||
content_scraping_strategy.py
|
||||
- LXMLWebScrapingStrategy
|
||||
# Table methods removed, uses strategy
|
||||
|
||||
table_extraction.py (NEW)
|
||||
- TableExtractionStrategy # Base class
|
||||
- DefaultTableExtraction # Moved logic here
|
||||
- NoTableExtraction # New option
|
||||
```
|
||||
|
||||
### Import Changes
|
||||
|
||||
**New imports available (optional):**
|
||||
```python
|
||||
# These are now available but not required for existing code
|
||||
from crawl4ai import (
|
||||
TableExtractionStrategy,
|
||||
DefaultTableExtraction,
|
||||
NoTableExtraction
|
||||
)
|
||||
```
|
||||
|
||||
## Performance Implications
|
||||
|
||||
### No Performance Impact
|
||||
|
||||
For existing code, performance remains identical:
|
||||
- Same extraction logic
|
||||
- Same scoring algorithm
|
||||
- Same processing time
|
||||
|
||||
### Performance Improvements Available
|
||||
|
||||
New options for better performance:
|
||||
|
||||
```python
|
||||
# Skip tables entirely (faster)
|
||||
config = CrawlerRunConfig(
|
||||
table_extraction=NoTableExtraction()
|
||||
)
|
||||
|
||||
# Process only specific areas (faster)
|
||||
config = CrawlerRunConfig(
|
||||
css_selector="main.content",
|
||||
table_extraction=DefaultTableExtraction(
|
||||
min_rows=5, # Skip small tables
|
||||
min_cols=3
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
## Testing Your Migration
|
||||
|
||||
### Verification Script
|
||||
|
||||
Run this to verify your extraction still works:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
|
||||
async def verify_extraction():
|
||||
url = "your_url_here"
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Test 1: Old approach
|
||||
config_old = CrawlerRunConfig(
|
||||
table_score_threshold=7
|
||||
)
|
||||
result_old = await crawler.arun(url, config_old)
|
||||
|
||||
# Test 2: New explicit approach
|
||||
from crawl4ai import DefaultTableExtraction
|
||||
config_new = CrawlerRunConfig(
|
||||
table_extraction=DefaultTableExtraction(
|
||||
table_score_threshold=7
|
||||
)
|
||||
)
|
||||
result_new = await crawler.arun(url, config_new)
|
||||
|
||||
# Compare results
|
||||
assert len(result_old.tables) == len(result_new.tables)
|
||||
print(f"✓ Both approaches extracted {len(result_old.tables)} tables")
|
||||
|
||||
# Verify structure
|
||||
for old, new in zip(result_old.tables, result_new.tables):
|
||||
assert old['headers'] == new['headers']
|
||||
assert old['rows'] == new['rows']
|
||||
|
||||
print("✓ Table content identical")
|
||||
|
||||
asyncio.run(verify_extraction())
|
||||
```
|
||||
|
||||
## Deprecation Notes
|
||||
|
||||
### No Deprecations
|
||||
|
||||
- All existing parameters continue to work
|
||||
- `table_score_threshold` in `CrawlerRunConfig` is still supported
|
||||
- No breaking changes
|
||||
|
||||
### Internal Changes (Transparent to Users)
|
||||
|
||||
- `LXMLWebScrapingStrategy.is_data_table()` - Moved to `DefaultTableExtraction`
|
||||
- `LXMLWebScrapingStrategy.extract_table_data()` - Moved to `DefaultTableExtraction`
|
||||
|
||||
These methods were internal and not part of the public API.
|
||||
|
||||
## Benefits of Upgrading
|
||||
|
||||
While not required, using the new pattern provides:
|
||||
|
||||
1. **Better Control**: Filter tables during extraction, not after
|
||||
2. **Performance Options**: Skip extraction when not needed
|
||||
3. **Extensibility**: Create custom extractors for specific needs
|
||||
4. **Consistency**: Same pattern as other Crawl4AI strategies
|
||||
5. **Future-Proof**: Ready for upcoming advanced strategies
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Issue: Different Number of Tables
|
||||
|
||||
**Cause**: Threshold or filtering differences
|
||||
|
||||
**Solution**:
|
||||
```python
|
||||
# Ensure same threshold
|
||||
strategy = DefaultTableExtraction(
|
||||
table_score_threshold=7, # Match your old setting
|
||||
min_rows=0, # No filtering (default)
|
||||
min_cols=0 # No filtering (default)
|
||||
)
|
||||
```
|
||||
|
||||
### Issue: Import Errors
|
||||
|
||||
**Cause**: Using new classes without importing
|
||||
|
||||
**Solution**:
|
||||
```python
|
||||
# Add imports if using new features
|
||||
from crawl4ai import (
|
||||
DefaultTableExtraction,
|
||||
NoTableExtraction,
|
||||
TableExtractionStrategy
|
||||
)
|
||||
```
|
||||
|
||||
### Issue: Custom Strategy Not Working
|
||||
|
||||
**Cause**: Incorrect method signature
|
||||
|
||||
**Solution**:
|
||||
```python
|
||||
class CustomExtractor(TableExtractionStrategy):
|
||||
def extract_tables(self, element, **kwargs): # Correct signature
|
||||
# Not: extract_tables(self, html)
|
||||
# Not: extract(self, element)
|
||||
return tables_list
|
||||
```
|
||||
|
||||
## Getting Help
|
||||
|
||||
If you encounter issues:
|
||||
|
||||
1. Check your `table_score_threshold` matches previous settings
|
||||
2. Verify imports if using new classes
|
||||
3. Enable verbose logging: `DefaultTableExtraction(verbose=True)`
|
||||
4. Review the [Table Extraction Documentation](../core/table_extraction.md)
|
||||
5. Check [examples](../examples/table_extraction_example.py)
|
||||
|
||||
## Summary
|
||||
|
||||
- ✅ **Full backward compatibility** - No code changes required
|
||||
- ✅ **Same results** - Identical extraction behavior by default
|
||||
- ✅ **New options** - Additional control when needed
|
||||
- ✅ **Better architecture** - Consistent with Crawl4AI patterns
|
||||
- ✅ **Ready for future** - Foundation for advanced strategies
|
||||
|
||||
The migration to v0.7.3 is seamless with no required changes while providing new capabilities for those who need them.
|
||||
@@ -35,6 +35,7 @@ nav:
|
||||
- "Page Interaction": "core/page-interaction.md"
|
||||
- "Content Selection": "core/content-selection.md"
|
||||
- "Cache Modes": "core/cache-modes.md"
|
||||
- "Telemetry": "core/telemetry.md"
|
||||
- "Local Files & Raw HTML": "core/local-files.md"
|
||||
- "Link & Media": "core/link-media.md"
|
||||
- Advanced:
|
||||
|
||||
@@ -64,6 +64,7 @@ torch = ["torch", "nltk", "scikit-learn"]
|
||||
transformer = ["transformers", "tokenizers", "sentence-transformers"]
|
||||
cosine = ["torch", "transformers", "nltk", "sentence-transformers"]
|
||||
sync = ["selenium"]
|
||||
telemetry = ["sentry-sdk>=2.0.0", "ipywidgets>=8.0.0"]
|
||||
all = [
|
||||
"PyPDF2",
|
||||
"torch",
|
||||
@@ -72,7 +73,9 @@ all = [
|
||||
"transformers",
|
||||
"tokenizers",
|
||||
"sentence-transformers",
|
||||
"selenium"
|
||||
"selenium",
|
||||
"sentry-sdk>=2.0.0",
|
||||
"ipywidgets>=8.0.0"
|
||||
]
|
||||
|
||||
[project.scripts]
|
||||
|
||||
16
pytest.ini
Normal file
16
pytest.ini
Normal file
@@ -0,0 +1,16 @@
|
||||
[pytest]
|
||||
testpaths = tests
|
||||
python_paths = .
|
||||
addopts = --maxfail=1 --disable-warnings -q --tb=short -v
|
||||
asyncio_mode = auto
|
||||
markers =
|
||||
slow: marks tests as slow (deselect with '-m "not slow"')
|
||||
integration: marks tests as integration tests
|
||||
unit: marks tests as unit tests
|
||||
privacy: marks tests related to privacy compliance
|
||||
performance: marks tests related to performance
|
||||
filterwarnings =
|
||||
ignore::DeprecationWarning
|
||||
ignore::PendingDeprecationWarning
|
||||
env =
|
||||
CRAWL4AI_TEST_MODE=1
|
||||
@@ -91,6 +91,17 @@ async def test_css_selector_extraction():
|
||||
assert result.markdown
|
||||
assert all(heading in result.markdown for heading in ["#", "##", "###"])
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_base_tag_link_extraction():
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
url = "https://sohamkukreti.github.io/portfolio"
|
||||
result = await crawler.arun(url=url)
|
||||
assert result.success
|
||||
assert result.links
|
||||
assert isinstance(result.links, dict)
|
||||
assert "internal" in result.links
|
||||
assert "external" in result.links
|
||||
assert any("github.com" in x["href"] for x in result.links["external"])
|
||||
|
||||
# Entry point for debugging
|
||||
if __name__ == "__main__":
|
||||
|
||||
@@ -10,11 +10,13 @@ import sys
|
||||
import uuid
|
||||
import shutil
|
||||
|
||||
from crawl4ai import BrowserProfiler
|
||||
from crawl4ai.browser_manager import BrowserManager
|
||||
|
||||
# Add the project root to Python path if running directly
|
||||
if __name__ == "__main__":
|
||||
sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), '../..')))
|
||||
|
||||
from crawl4ai.browser import BrowserManager, BrowserProfileManager
|
||||
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
|
||||
from crawl4ai.async_logger import AsyncLogger
|
||||
|
||||
@@ -25,7 +27,7 @@ async def test_profile_creation():
|
||||
"""Test creating and managing browser profiles."""
|
||||
logger.info("Testing profile creation and management", tag="TEST")
|
||||
|
||||
profile_manager = BrowserProfileManager(logger=logger)
|
||||
profile_manager = BrowserProfiler(logger=logger)
|
||||
|
||||
try:
|
||||
# List existing profiles
|
||||
@@ -83,7 +85,7 @@ async def test_profile_with_browser():
|
||||
"""Test using a profile with a browser."""
|
||||
logger.info("Testing using a profile with a browser", tag="TEST")
|
||||
|
||||
profile_manager = BrowserProfileManager(logger=logger)
|
||||
profile_manager = BrowserProfiler(logger=logger)
|
||||
test_profile_name = f"test-browser-profile-{uuid.uuid4().hex[:8]}"
|
||||
profile_path = None
|
||||
|
||||
@@ -101,6 +103,8 @@ async def test_profile_with_browser():
|
||||
# Now use this profile with a browser
|
||||
browser_config = BrowserConfig(
|
||||
user_data_dir=profile_path,
|
||||
use_managed_browser=True,
|
||||
use_persistent_context=True,
|
||||
headless=True
|
||||
)
|
||||
|
||||
|
||||
151
tests/conftest.py
Normal file
151
tests/conftest.py
Normal file
@@ -0,0 +1,151 @@
|
||||
"""
|
||||
Shared pytest fixtures for Crawl4AI tests.
|
||||
"""
|
||||
|
||||
import pytest
|
||||
import tempfile
|
||||
import os
|
||||
from pathlib import Path
|
||||
from unittest.mock import Mock, patch
|
||||
from crawl4ai.telemetry.config import TelemetryConfig, TelemetryConsent
|
||||
from crawl4ai.telemetry.environment import Environment
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def temp_config_dir():
|
||||
"""Provide a temporary directory for telemetry config testing."""
|
||||
with tempfile.TemporaryDirectory() as tmpdir:
|
||||
yield Path(tmpdir)
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def mock_telemetry_config(temp_config_dir):
|
||||
"""Provide a mocked telemetry config for testing."""
|
||||
config = TelemetryConfig(config_dir=temp_config_dir)
|
||||
yield config
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def clean_environment():
|
||||
"""Clean environment variables before and after test."""
|
||||
# Store original environment
|
||||
original_env = os.environ.copy()
|
||||
|
||||
# Clean telemetry-related env vars
|
||||
telemetry_vars = [
|
||||
'CRAWL4AI_TELEMETRY',
|
||||
'CRAWL4AI_DOCKER',
|
||||
'CRAWL4AI_API_SERVER',
|
||||
'CRAWL4AI_TEST_MODE'
|
||||
]
|
||||
|
||||
for var in telemetry_vars:
|
||||
if var in os.environ:
|
||||
del os.environ[var]
|
||||
|
||||
# Set test mode
|
||||
os.environ['CRAWL4AI_TEST_MODE'] = '1'
|
||||
|
||||
yield
|
||||
|
||||
# Restore original environment
|
||||
os.environ.clear()
|
||||
os.environ.update(original_env)
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def mock_sentry_provider():
|
||||
"""Provide a mocked Sentry provider for testing."""
|
||||
with patch('crawl4ai.telemetry.providers.sentry.SentryProvider') as mock:
|
||||
provider_instance = Mock()
|
||||
provider_instance.initialize.return_value = True
|
||||
provider_instance.send_exception.return_value = True
|
||||
provider_instance.is_initialized = True
|
||||
mock.return_value = provider_instance
|
||||
yield provider_instance
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def enabled_telemetry_config(temp_config_dir): # noqa: F811
|
||||
"""Provide a telemetry config with telemetry enabled."""
|
||||
config = Mock()
|
||||
config.get_consent.return_value = TelemetryConsent.ALWAYS
|
||||
config.is_enabled.return_value = True
|
||||
config.should_send_current.return_value = True
|
||||
config.get_email.return_value = "test@example.com"
|
||||
config.update_from_env.return_value = None
|
||||
yield config
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def disabled_telemetry_config(temp_config_dir): # noqa: F811
|
||||
"""Provide a telemetry config with telemetry disabled."""
|
||||
config = Mock()
|
||||
config.get_consent.return_value = TelemetryConsent.DENIED
|
||||
config.is_enabled.return_value = False
|
||||
config.should_send_current.return_value = False
|
||||
config.update_from_env.return_value = None
|
||||
yield config
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def docker_environment():
|
||||
"""Mock Docker environment detection."""
|
||||
with patch('crawl4ai.telemetry.environment.EnvironmentDetector.detect', return_value=Environment.DOCKER):
|
||||
yield
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def cli_environment():
|
||||
"""Mock CLI environment detection."""
|
||||
with patch('crawl4ai.telemetry.environment.EnvironmentDetector.detect', return_value=Environment.CLI):
|
||||
with patch('sys.stdin.isatty', return_value=True):
|
||||
yield
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def jupyter_environment():
|
||||
"""Mock Jupyter environment detection."""
|
||||
with patch('crawl4ai.telemetry.environment.EnvironmentDetector.detect', return_value=Environment.JUPYTER):
|
||||
yield
|
||||
|
||||
|
||||
@pytest.fixture(autouse=True)
|
||||
def reset_telemetry_singleton():
|
||||
"""Reset telemetry singleton between tests."""
|
||||
from crawl4ai.telemetry import TelemetryManager
|
||||
# Reset the singleton instance
|
||||
if hasattr(TelemetryManager, '_instance'):
|
||||
TelemetryManager._instance = None # noqa: SLF001
|
||||
yield
|
||||
# Clean up after test
|
||||
if hasattr(TelemetryManager, '_instance'):
|
||||
TelemetryManager._instance = None # noqa: SLF001
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def sample_exception():
|
||||
"""Provide a sample exception for testing."""
|
||||
try:
|
||||
raise ValueError("Test exception for telemetry")
|
||||
except ValueError as e:
|
||||
return e
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def privacy_test_data():
|
||||
"""Provide test data that should NOT be captured by telemetry."""
|
||||
return {
|
||||
'url': 'https://example.com/private-page',
|
||||
'content': 'This is private content that should not be sent',
|
||||
'user_data': {
|
||||
'email': 'user@private.com',
|
||||
'password': 'secret123',
|
||||
'api_key': 'sk-1234567890abcdef'
|
||||
},
|
||||
'pii': {
|
||||
'ssn': '123-45-6789',
|
||||
'phone': '+1-555-123-4567',
|
||||
'address': '123 Main St, Anytown, USA'
|
||||
}
|
||||
}
|
||||
@@ -168,7 +168,7 @@ class SimpleApiTester:
|
||||
print("\n=== CORE APIs ===")
|
||||
|
||||
test_url = "https://example.com"
|
||||
|
||||
test_raw_html_url = "raw://<html><body><h1>Hello, World!</h1></body></html>"
|
||||
# Test markdown endpoint
|
||||
md_payload = {
|
||||
"url": test_url,
|
||||
@@ -180,6 +180,17 @@ class SimpleApiTester:
|
||||
# print(result['data'].get('markdown', ''))
|
||||
self.print_result(result)
|
||||
|
||||
# Test markdown endpoint with raw HTML
|
||||
raw_md_payload = {
|
||||
"url": test_raw_html_url,
|
||||
"f": "fit",
|
||||
"q": "test query",
|
||||
"c": "0"
|
||||
}
|
||||
result = self.test_post_endpoint("/md", raw_md_payload)
|
||||
self.print_result(result)
|
||||
|
||||
|
||||
# Test HTML endpoint
|
||||
html_payload = {"url": test_url}
|
||||
result = self.test_post_endpoint("/html", html_payload)
|
||||
@@ -215,6 +226,15 @@ class SimpleApiTester:
|
||||
result = self.test_post_endpoint("/crawl", crawl_payload)
|
||||
self.print_result(result)
|
||||
|
||||
# Test crawl endpoint with raw HTML
|
||||
crawl_payload = {
|
||||
"urls": [test_raw_html_url],
|
||||
"browser_config": {},
|
||||
"crawler_config": {}
|
||||
}
|
||||
result = self.test_post_endpoint("/crawl", crawl_payload)
|
||||
self.print_result(result)
|
||||
|
||||
# Test config dump
|
||||
config_payload = {"code": "CrawlerRunConfig()"}
|
||||
result = self.test_post_endpoint("/config/dump", config_payload)
|
||||
|
||||
@@ -74,7 +74,7 @@ async def test_direct_api():
|
||||
# Make direct API call
|
||||
async with httpx.AsyncClient() as client:
|
||||
response = await client.post(
|
||||
"http://localhost:8000/crawl",
|
||||
"http://localhost:11235/crawl",
|
||||
json=request_data,
|
||||
timeout=300
|
||||
)
|
||||
@@ -100,13 +100,24 @@ async def test_direct_api():
|
||||
|
||||
async with httpx.AsyncClient() as client:
|
||||
response = await client.post(
|
||||
"http://localhost:8000/crawl",
|
||||
"http://localhost:11235/crawl",
|
||||
json=request_data
|
||||
)
|
||||
assert response.status_code == 200
|
||||
result = response.json()
|
||||
print("Structured extraction result:", result["success"])
|
||||
|
||||
# Test 3: Raw HTML
|
||||
request_data["urls"] = ["raw://<html><body><h1>Hello, World!</h1><a href='https://example.com'>Example</a></body></html>"]
|
||||
async with httpx.AsyncClient() as client:
|
||||
response = await client.post(
|
||||
"http://localhost:11235/crawl",
|
||||
json=request_data
|
||||
)
|
||||
assert response.status_code == 200
|
||||
result = response.json()
|
||||
print("Raw HTML result:", result["success"])
|
||||
|
||||
# Test 3: Get schema
|
||||
# async with httpx.AsyncClient() as client:
|
||||
# response = await client.get("http://localhost:8000/schema")
|
||||
@@ -118,7 +129,7 @@ async def test_with_client():
|
||||
"""Test using the Crawl4AI Docker client SDK"""
|
||||
print("\n=== Testing Client SDK ===")
|
||||
|
||||
async with Crawl4aiDockerClient(verbose=True) as client:
|
||||
async with Crawl4aiDockerClient(base_url="http://localhost:11235", verbose=True) as client:
|
||||
# Test 1: Basic crawl
|
||||
browser_config = BrowserConfig(headless=True)
|
||||
crawler_config = CrawlerRunConfig(
|
||||
|
||||
@@ -6,28 +6,22 @@ import base64
|
||||
import os
|
||||
from typing import Dict, Any
|
||||
|
||||
|
||||
class Crawl4AiTester:
|
||||
def __init__(self, base_url: str = "http://localhost:11235", api_token: str = None):
|
||||
def __init__(self, base_url: str = "http://localhost:11235"):
|
||||
self.base_url = base_url
|
||||
self.api_token = api_token or os.getenv(
|
||||
"CRAWL4AI_API_TOKEN"
|
||||
) # Check environment variable as fallback
|
||||
self.headers = (
|
||||
{"Authorization": f"Bearer {self.api_token}"} if self.api_token else {}
|
||||
)
|
||||
|
||||
|
||||
def submit_and_wait(
|
||||
self, request_data: Dict[str, Any], timeout: int = 300
|
||||
) -> Dict[str, Any]:
|
||||
# Submit crawl job
|
||||
# Submit crawl job using async endpoint
|
||||
response = requests.post(
|
||||
f"{self.base_url}/crawl", json=request_data, headers=self.headers
|
||||
f"{self.base_url}/crawl/job", json=request_data
|
||||
)
|
||||
if response.status_code == 403:
|
||||
raise Exception("API token is invalid or missing")
|
||||
task_id = response.json()["task_id"]
|
||||
print(f"Task ID: {task_id}")
|
||||
response.raise_for_status()
|
||||
job_response = response.json()
|
||||
task_id = job_response["task_id"]
|
||||
print(f"Submitted job with task_id: {task_id}")
|
||||
|
||||
# Poll for result
|
||||
start_time = time.time()
|
||||
@@ -38,8 +32,9 @@ class Crawl4AiTester:
|
||||
)
|
||||
|
||||
result = requests.get(
|
||||
f"{self.base_url}/task/{task_id}", headers=self.headers
|
||||
f"{self.base_url}/crawl/job/{task_id}"
|
||||
)
|
||||
result.raise_for_status()
|
||||
status = result.json()
|
||||
|
||||
if status["status"] == "failed":
|
||||
@@ -52,10 +47,10 @@ class Crawl4AiTester:
|
||||
time.sleep(2)
|
||||
|
||||
def submit_sync(self, request_data: Dict[str, Any]) -> Dict[str, Any]:
|
||||
# Use synchronous crawl endpoint
|
||||
response = requests.post(
|
||||
f"{self.base_url}/crawl_sync",
|
||||
f"{self.base_url}/crawl",
|
||||
json=request_data,
|
||||
headers=self.headers,
|
||||
timeout=60,
|
||||
)
|
||||
if response.status_code == 408:
|
||||
@@ -66,9 +61,8 @@ class Crawl4AiTester:
|
||||
|
||||
def test_docker_deployment(version="basic"):
|
||||
tester = Crawl4AiTester(
|
||||
# base_url="http://localhost:11235" ,
|
||||
base_url="https://crawl4ai-sby74.ondigitalocean.app",
|
||||
api_token="test",
|
||||
base_url="http://localhost:11235",
|
||||
#base_url="https://crawl4ai-sby74.ondigitalocean.app",
|
||||
)
|
||||
print(f"Testing Crawl4AI Docker {version} version")
|
||||
|
||||
@@ -88,63 +82,60 @@ def test_docker_deployment(version="basic"):
|
||||
|
||||
# Test cases based on version
|
||||
test_basic_crawl(tester)
|
||||
test_basic_crawl(tester)
|
||||
test_basic_crawl_sync(tester)
|
||||
|
||||
# if version in ["full", "transformer"]:
|
||||
# test_cosine_extraction(tester)
|
||||
if version in ["full", "transformer"]:
|
||||
test_cosine_extraction(tester)
|
||||
|
||||
# test_js_execution(tester)
|
||||
# test_css_selector(tester)
|
||||
# test_structured_extraction(tester)
|
||||
# test_llm_extraction(tester)
|
||||
# test_llm_with_ollama(tester)
|
||||
# test_screenshot(tester)
|
||||
test_js_execution(tester)
|
||||
test_css_selector(tester)
|
||||
test_structured_extraction(tester)
|
||||
test_llm_extraction(tester)
|
||||
test_llm_with_ollama(tester)
|
||||
test_screenshot(tester)
|
||||
|
||||
|
||||
def test_basic_crawl(tester: Crawl4AiTester):
|
||||
print("\n=== Testing Basic Crawl ===")
|
||||
print("\n=== Testing Basic Crawl (Async) ===")
|
||||
request = {
|
||||
"urls": ["https://www.nbcnews.com/business"],
|
||||
"priority": 10,
|
||||
"session_id": "test",
|
||||
}
|
||||
|
||||
result = tester.submit_and_wait(request)
|
||||
print(f"Basic crawl result length: {len(result['result']['markdown'])}")
|
||||
print(f"Basic crawl result count: {len(result['result']['results'])}")
|
||||
assert result["result"]["success"]
|
||||
assert len(result["result"]["markdown"]) > 0
|
||||
assert len(result["result"]["results"]) > 0
|
||||
assert len(result["result"]["results"][0]["markdown"]) > 0
|
||||
|
||||
|
||||
def test_basic_crawl_sync(tester: Crawl4AiTester):
|
||||
print("\n=== Testing Basic Crawl (Sync) ===")
|
||||
request = {
|
||||
"urls": ["https://www.nbcnews.com/business"],
|
||||
"priority": 10,
|
||||
"session_id": "test",
|
||||
}
|
||||
|
||||
result = tester.submit_sync(request)
|
||||
print(f"Basic crawl result length: {len(result['result']['markdown'])}")
|
||||
assert result["status"] == "completed"
|
||||
assert result["result"]["success"]
|
||||
assert len(result["result"]["markdown"]) > 0
|
||||
print(f"Basic crawl result count: {len(result['results'])}")
|
||||
assert result["success"]
|
||||
assert len(result["results"]) > 0
|
||||
assert len(result["results"][0]["markdown"]) > 0
|
||||
|
||||
|
||||
def test_js_execution(tester: Crawl4AiTester):
|
||||
print("\n=== Testing JS Execution ===")
|
||||
request = {
|
||||
"urls": ["https://www.nbcnews.com/business"],
|
||||
"priority": 8,
|
||||
"js_code": [
|
||||
"const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"
|
||||
],
|
||||
"wait_for": "article.tease-card:nth-child(10)",
|
||||
"crawler_params": {"headless": True},
|
||||
"browser_config": {"headless": True},
|
||||
"crawler_config": {
|
||||
"js_code": [
|
||||
"const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); if(loadMoreButton) loadMoreButton.click();"
|
||||
],
|
||||
"wait_for": "wide-tease-item__wrapper df flex-column flex-row-m flex-nowrap-m enable-new-sports-feed-mobile-design(10)"
|
||||
}
|
||||
}
|
||||
|
||||
result = tester.submit_and_wait(request)
|
||||
print(f"JS execution result length: {len(result['result']['markdown'])}")
|
||||
print(f"JS execution result count: {len(result['result']['results'])}")
|
||||
assert result["result"]["success"]
|
||||
|
||||
|
||||
@@ -152,51 +143,78 @@ def test_css_selector(tester: Crawl4AiTester):
|
||||
print("\n=== Testing CSS Selector ===")
|
||||
request = {
|
||||
"urls": ["https://www.nbcnews.com/business"],
|
||||
"priority": 7,
|
||||
"css_selector": ".wide-tease-item__description",
|
||||
"crawler_params": {"headless": True},
|
||||
"extra": {"word_count_threshold": 10},
|
||||
"browser_config": {"headless": True},
|
||||
"crawler_config": {
|
||||
"css_selector": ".wide-tease-item__description",
|
||||
"word_count_threshold": 10
|
||||
}
|
||||
}
|
||||
|
||||
result = tester.submit_and_wait(request)
|
||||
print(f"CSS selector result length: {len(result['result']['markdown'])}")
|
||||
print(f"CSS selector result count: {len(result['result']['results'])}")
|
||||
assert result["result"]["success"]
|
||||
|
||||
|
||||
def test_structured_extraction(tester: Crawl4AiTester):
|
||||
print("\n=== Testing Structured Extraction ===")
|
||||
schema = {
|
||||
"name": "Coinbase Crypto Prices",
|
||||
"baseSelector": ".cds-tableRow-t45thuk",
|
||||
"fields": [
|
||||
{
|
||||
"name": "crypto",
|
||||
"selector": "td:nth-child(1) h2",
|
||||
"type": "text",
|
||||
},
|
||||
{
|
||||
"name": "symbol",
|
||||
"selector": "td:nth-child(1) p",
|
||||
"type": "text",
|
||||
},
|
||||
{
|
||||
"name": "price",
|
||||
"selector": "td:nth-child(2)",
|
||||
"type": "text",
|
||||
},
|
||||
],
|
||||
"name": "Cryptocurrency Prices",
|
||||
"baseSelector": "table[data-testid=\"prices-table\"] tbody tr",
|
||||
"fields": [
|
||||
{
|
||||
"name": "asset_name",
|
||||
"selector": "td:nth-child(2) p.cds-headline-h4steop",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "asset_symbol",
|
||||
"selector": "td:nth-child(2) p.cds-label2-l1sm09ec",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "asset_image_url",
|
||||
"selector": "td:nth-child(2) img[alt=\"Asset Symbol\"]",
|
||||
"type": "attribute",
|
||||
"attribute": "src"
|
||||
},
|
||||
{
|
||||
"name": "asset_url",
|
||||
"selector": "td:nth-child(2) a[aria-label^=\"Asset page for\"]",
|
||||
"type": "attribute",
|
||||
"attribute": "href"
|
||||
},
|
||||
{
|
||||
"name": "price",
|
||||
"selector": "td:nth-child(3) div.cds-typographyResets-t6muwls.cds-body-bwup3gq",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "change",
|
||||
"selector": "td:nth-child(7) p.cds-body-bwup3gq",
|
||||
"type": "text"
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
|
||||
request = {
|
||||
"urls": ["https://www.coinbase.com/explore"],
|
||||
"priority": 9,
|
||||
"extraction_config": {"type": "json_css", "params": {"schema": schema}},
|
||||
"crawler_config": {
|
||||
"type": "CrawlerRunConfig",
|
||||
"params": {
|
||||
"extraction_strategy": {
|
||||
"type": "JsonCssExtractionStrategy",
|
||||
"params": {"schema": schema}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
result = tester.submit_and_wait(request)
|
||||
extracted = json.loads(result["result"]["extracted_content"])
|
||||
extracted = json.loads(result["result"]["results"][0]["extracted_content"])
|
||||
print(f"Extracted {len(extracted)} items")
|
||||
print("Sample item:", json.dumps(extracted[0], indent=2))
|
||||
if extracted:
|
||||
print("Sample item:", json.dumps(extracted[0], indent=2))
|
||||
assert result["result"]["success"]
|
||||
assert len(extracted) > 0
|
||||
|
||||
@@ -206,43 +224,54 @@ def test_llm_extraction(tester: Crawl4AiTester):
|
||||
schema = {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"model_name": {
|
||||
"asset_name": {
|
||||
"type": "string",
|
||||
"description": "Name of the OpenAI model.",
|
||||
"description": "Name of the asset.",
|
||||
},
|
||||
"input_fee": {
|
||||
"price": {
|
||||
"type": "string",
|
||||
"description": "Fee for input token for the OpenAI model.",
|
||||
"description": "Price of the asset.",
|
||||
},
|
||||
"output_fee": {
|
||||
"change": {
|
||||
"type": "string",
|
||||
"description": "Fee for output token for the OpenAI model.",
|
||||
"description": "Change in price of the asset.",
|
||||
},
|
||||
},
|
||||
"required": ["model_name", "input_fee", "output_fee"],
|
||||
"required": ["asset_name", "price", "change"],
|
||||
}
|
||||
|
||||
request = {
|
||||
"urls": ["https://openai.com/api/pricing"],
|
||||
"priority": 8,
|
||||
"extraction_config": {
|
||||
"type": "llm",
|
||||
"urls": ["https://www.coinbase.com/en-in/explore"],
|
||||
"browser_config": {},
|
||||
"crawler_config": {
|
||||
"type": "CrawlerRunConfig",
|
||||
"params": {
|
||||
"provider": "openai/gpt-4o-mini",
|
||||
"api_token": os.getenv("OPENAI_API_KEY"),
|
||||
"schema": schema,
|
||||
"extraction_type": "schema",
|
||||
"instruction": """From the crawled content, extract all mentioned model names along with their fees for input and output tokens.""",
|
||||
},
|
||||
},
|
||||
"crawler_params": {"word_count_threshold": 1},
|
||||
"extraction_strategy": {
|
||||
"type": "LLMExtractionStrategy",
|
||||
"params": {
|
||||
"llm_config": {
|
||||
"type": "LLMConfig",
|
||||
"params": {
|
||||
"provider": "gemini/gemini-2.5-flash",
|
||||
"api_token": os.getenv("GEMINI_API_KEY")
|
||||
}
|
||||
},
|
||||
"schema": schema,
|
||||
"extraction_type": "schema",
|
||||
"instruction": "From the crawled content tioned asset names along with their prices and change in price.",
|
||||
}
|
||||
},
|
||||
"word_count_threshold": 1
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
try:
|
||||
result = tester.submit_and_wait(request)
|
||||
extracted = json.loads(result["result"]["extracted_content"])
|
||||
extracted = json.loads(result["result"]["results"][0]["extracted_content"])
|
||||
print(f"Extracted {len(extracted)} model pricing entries")
|
||||
print("Sample entry:", json.dumps(extracted[0], indent=2))
|
||||
if extracted:
|
||||
print("Sample entry:", json.dumps(extracted[0], indent=2))
|
||||
assert result["result"]["success"]
|
||||
except Exception as e:
|
||||
print(f"LLM extraction test failed (might be due to missing API key): {str(e)}")
|
||||
@@ -271,23 +300,32 @@ def test_llm_with_ollama(tester: Crawl4AiTester):
|
||||
|
||||
request = {
|
||||
"urls": ["https://www.nbcnews.com/business"],
|
||||
"priority": 8,
|
||||
"extraction_config": {
|
||||
"type": "llm",
|
||||
"browser_config": {"verbose": True},
|
||||
"crawler_config": {
|
||||
"type": "CrawlerRunConfig",
|
||||
"params": {
|
||||
"provider": "ollama/llama2",
|
||||
"schema": schema,
|
||||
"extraction_type": "schema",
|
||||
"instruction": "Extract the main article information including title, summary, and main topics.",
|
||||
},
|
||||
},
|
||||
"extra": {"word_count_threshold": 1},
|
||||
"crawler_params": {"verbose": True},
|
||||
"extraction_strategy": {
|
||||
"type": "LLMExtractionStrategy",
|
||||
"params": {
|
||||
"llm_config": {
|
||||
"type": "LLMConfig",
|
||||
"params": {
|
||||
"provider": "ollama/llama3.2:latest",
|
||||
}
|
||||
},
|
||||
"schema": schema,
|
||||
"extraction_type": "schema",
|
||||
"instruction": "Extract the main article information including title, summary, and main topics.",
|
||||
}
|
||||
},
|
||||
"word_count_threshold": 1
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
try:
|
||||
result = tester.submit_and_wait(request)
|
||||
extracted = json.loads(result["result"]["extracted_content"])
|
||||
extracted = json.loads(result["result"]["results"][0]["extracted_content"])
|
||||
print("Extracted content:", json.dumps(extracted, indent=2))
|
||||
assert result["result"]["success"]
|
||||
except Exception as e:
|
||||
@@ -298,23 +336,29 @@ def test_cosine_extraction(tester: Crawl4AiTester):
|
||||
print("\n=== Testing Cosine Extraction ===")
|
||||
request = {
|
||||
"urls": ["https://www.nbcnews.com/business"],
|
||||
"priority": 8,
|
||||
"extraction_config": {
|
||||
"type": "cosine",
|
||||
"browser_config": {},
|
||||
"crawler_config": {
|
||||
"type": "CrawlerRunConfig",
|
||||
"params": {
|
||||
"semantic_filter": "business finance economy",
|
||||
"word_count_threshold": 10,
|
||||
"max_dist": 0.2,
|
||||
"top_k": 3,
|
||||
},
|
||||
},
|
||||
"extraction_strategy": {
|
||||
"type": "CosineStrategy",
|
||||
"params": {
|
||||
"semantic_filter": "business finance economy",
|
||||
"word_count_threshold": 10,
|
||||
"max_dist": 0.2,
|
||||
"top_k": 3,
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
try:
|
||||
result = tester.submit_and_wait(request)
|
||||
extracted = json.loads(result["result"]["extracted_content"])
|
||||
extracted = json.loads(result["result"]["results"][0]["extracted_content"])
|
||||
print(f"Extracted {len(extracted)} text clusters")
|
||||
print("First cluster tags:", extracted[0]["tags"])
|
||||
if extracted:
|
||||
print("First cluster tags:", extracted[0]["tags"])
|
||||
assert result["result"]["success"]
|
||||
except Exception as e:
|
||||
print(f"Cosine extraction test failed: {str(e)}")
|
||||
@@ -324,19 +368,24 @@ def test_screenshot(tester: Crawl4AiTester):
|
||||
print("\n=== Testing Screenshot ===")
|
||||
request = {
|
||||
"urls": ["https://www.nbcnews.com/business"],
|
||||
"priority": 5,
|
||||
"screenshot": True,
|
||||
"crawler_params": {"headless": True},
|
||||
"browser_config": {"headless": True},
|
||||
"crawler_config": {
|
||||
"type": "CrawlerRunConfig",
|
||||
"params": {
|
||||
"screenshot": True
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
result = tester.submit_and_wait(request)
|
||||
print("Screenshot captured:", bool(result["result"]["screenshot"]))
|
||||
screenshot_data = result["result"]["results"][0]["screenshot"]
|
||||
print("Screenshot captured:", bool(screenshot_data))
|
||||
|
||||
if result["result"]["screenshot"]:
|
||||
if screenshot_data:
|
||||
# Save screenshot
|
||||
screenshot_data = base64.b64decode(result["result"]["screenshot"])
|
||||
screenshot_bytes = base64.b64decode(screenshot_data)
|
||||
with open("test_screenshot.jpg", "wb") as f:
|
||||
f.write(screenshot_data)
|
||||
f.write(screenshot_bytes)
|
||||
print("Screenshot saved as test_screenshot.jpg")
|
||||
|
||||
assert result["result"]["success"]
|
||||
|
||||
43
tests/general/test_persistent_context.py
Normal file
43
tests/general/test_persistent_context.py
Normal file
@@ -0,0 +1,43 @@
|
||||
import asyncio
|
||||
import os
|
||||
from crawl4ai.async_webcrawler import AsyncWebCrawler
|
||||
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig, CacheMode
|
||||
|
||||
# Simple concurrency test for persistent context page creation
|
||||
# Usage: python scripts/test_persistent_context.py
|
||||
|
||||
URLS = [
|
||||
# "https://example.com",
|
||||
"https://httpbin.org/html",
|
||||
"https://www.python.org/",
|
||||
"https://www.rust-lang.org/",
|
||||
]
|
||||
|
||||
async def main():
|
||||
profile_dir = os.path.join(os.path.expanduser("~"), ".crawl4ai", "profiles", "test-persistent-profile")
|
||||
os.makedirs(profile_dir, exist_ok=True)
|
||||
|
||||
browser_config = BrowserConfig(
|
||||
browser_type="chromium",
|
||||
headless=True,
|
||||
use_persistent_context=True,
|
||||
user_data_dir=profile_dir,
|
||||
use_managed_browser=True,
|
||||
verbose=True,
|
||||
)
|
||||
|
||||
run_cfg = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
stream=False,
|
||||
verbose=True,
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
results = await crawler.arun_many(URLS, config=run_cfg)
|
||||
for r in results:
|
||||
print(r.url, r.success, len(r.markdown.raw_markdown) if r.markdown else 0)
|
||||
# r = await crawler.arun(url=URLS[0], config=run_cfg)
|
||||
# print(r.url, r.success, len(r.markdown.raw_markdown) if r.markdown else 0)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
55
tests/profiler/test_keyboard_handle.py
Normal file
55
tests/profiler/test_keyboard_handle.py
Normal file
@@ -0,0 +1,55 @@
|
||||
import sys
|
||||
import pytest
|
||||
import asyncio
|
||||
from unittest.mock import patch, MagicMock
|
||||
from crawl4ai.browser_profiler import BrowserProfiler
|
||||
|
||||
@pytest.mark.asyncio
|
||||
@pytest.mark.skipif(sys.platform != "win32", reason="Windows-specific msvcrt test")
|
||||
async def test_keyboard_input_handling():
|
||||
# Mock sequence of keystrokes: arrow key followed by 'q'
|
||||
mock_keys = [b'\x00K', b'q']
|
||||
mock_kbhit = MagicMock(side_effect=[True, True, False])
|
||||
mock_getch = MagicMock(side_effect=mock_keys)
|
||||
|
||||
with patch('msvcrt.kbhit', mock_kbhit), patch('msvcrt.getch', mock_getch):
|
||||
# profiler = BrowserProfiler()
|
||||
user_done_event = asyncio.Event()
|
||||
|
||||
# Create a local async function to simulate the keyboard input handling
|
||||
async def test_listen_for_quit_command():
|
||||
if sys.platform == "win32":
|
||||
while True:
|
||||
try:
|
||||
if mock_kbhit():
|
||||
raw = mock_getch()
|
||||
try:
|
||||
key = raw.decode("utf-8")
|
||||
except UnicodeDecodeError:
|
||||
continue
|
||||
|
||||
if len(key) != 1 or not key.isprintable():
|
||||
continue
|
||||
|
||||
if key.lower() == "q":
|
||||
user_done_event.set()
|
||||
return
|
||||
|
||||
await asyncio.sleep(0.1)
|
||||
except Exception as e:
|
||||
continue
|
||||
|
||||
# Run the listener
|
||||
listener_task = asyncio.create_task(test_listen_for_quit_command())
|
||||
|
||||
# Wait for the event to be set
|
||||
try:
|
||||
await asyncio.wait_for(user_done_event.wait(), timeout=1.0)
|
||||
assert user_done_event.is_set()
|
||||
finally:
|
||||
if not listener_task.done():
|
||||
listener_task.cancel()
|
||||
try:
|
||||
await listener_task
|
||||
except asyncio.CancelledError:
|
||||
pass
|
||||
582
tests/proxy/test_proxy_config.py
Normal file
582
tests/proxy/test_proxy_config.py
Normal file
@@ -0,0 +1,582 @@
|
||||
"""
|
||||
Comprehensive test suite for ProxyConfig in different forms:
|
||||
1. String form (ip:port:username:password)
|
||||
2. Dict form (dictionary with keys)
|
||||
3. Object form (ProxyConfig instance)
|
||||
4. Environment variable form (from env vars)
|
||||
|
||||
Tests cover all possible scenarios and edge cases using pytest.
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import os
|
||||
import pytest
|
||||
import tempfile
|
||||
from unittest.mock import patch
|
||||
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig
|
||||
from crawl4ai.async_configs import CrawlerRunConfig, ProxyConfig
|
||||
from crawl4ai.cache_context import CacheMode
|
||||
|
||||
|
||||
class TestProxyConfig:
|
||||
"""Comprehensive test suite for ProxyConfig functionality."""
|
||||
|
||||
# Test data for different scenarios
|
||||
# get free proxy server from from webshare.io https://www.webshare.io/?referral_code=3sqog0y1fvsl
|
||||
TEST_PROXY_DATA = {
|
||||
"server": "",
|
||||
"username": "",
|
||||
"password": "",
|
||||
"ip": ""
|
||||
}
|
||||
|
||||
def setup_method(self):
|
||||
"""Setup for each test method."""
|
||||
self.test_url = "https://httpbin.org/ip" # Use httpbin for testing
|
||||
|
||||
# ==================== OBJECT FORM TESTS ====================
|
||||
|
||||
def test_proxy_config_object_creation_basic(self):
|
||||
"""Test basic ProxyConfig object creation."""
|
||||
proxy = ProxyConfig(server="127.0.0.1:8080")
|
||||
assert proxy.server == "127.0.0.1:8080"
|
||||
assert proxy.username is None
|
||||
assert proxy.password is None
|
||||
assert proxy.ip == "127.0.0.1" # Should auto-extract IP
|
||||
|
||||
def test_proxy_config_object_creation_full(self):
|
||||
"""Test ProxyConfig object creation with all parameters."""
|
||||
proxy = ProxyConfig(
|
||||
server=f"http://{self.TEST_PROXY_DATA['server']}",
|
||||
username=self.TEST_PROXY_DATA['username'],
|
||||
password=self.TEST_PROXY_DATA['password'],
|
||||
ip=self.TEST_PROXY_DATA['ip']
|
||||
)
|
||||
assert proxy.server == f"http://{self.TEST_PROXY_DATA['server']}"
|
||||
assert proxy.username == self.TEST_PROXY_DATA['username']
|
||||
assert proxy.password == self.TEST_PROXY_DATA['password']
|
||||
assert proxy.ip == self.TEST_PROXY_DATA['ip']
|
||||
|
||||
def test_proxy_config_object_ip_extraction(self):
|
||||
"""Test automatic IP extraction from server URL."""
|
||||
test_cases = [
|
||||
("http://192.168.1.1:8080", "192.168.1.1"),
|
||||
("https://10.0.0.1:3128", "10.0.0.1"),
|
||||
("192.168.1.100:8080", "192.168.1.100"),
|
||||
("proxy.example.com:8080", "proxy.example.com"),
|
||||
]
|
||||
|
||||
for server, expected_ip in test_cases:
|
||||
proxy = ProxyConfig(server=server)
|
||||
assert proxy.ip == expected_ip, f"Failed for server: {server}"
|
||||
|
||||
def test_proxy_config_object_invalid_server(self):
|
||||
"""Test ProxyConfig with invalid server formats."""
|
||||
# Should not raise exception but may not extract IP properly
|
||||
proxy = ProxyConfig(server="invalid-format")
|
||||
assert proxy.server == "invalid-format"
|
||||
# IP extraction might fail but object should still be created
|
||||
|
||||
# ==================== DICT FORM TESTS ====================
|
||||
|
||||
def test_proxy_config_from_dict_basic(self):
|
||||
"""Test creating ProxyConfig from basic dictionary."""
|
||||
proxy_dict = {"server": "127.0.0.1:8080"}
|
||||
proxy = ProxyConfig.from_dict(proxy_dict)
|
||||
assert proxy.server == "127.0.0.1:8080"
|
||||
assert proxy.username is None
|
||||
assert proxy.password is None
|
||||
|
||||
def test_proxy_config_from_dict_full(self):
|
||||
"""Test creating ProxyConfig from complete dictionary."""
|
||||
proxy_dict = {
|
||||
"server": f"http://{self.TEST_PROXY_DATA['server']}",
|
||||
"username": self.TEST_PROXY_DATA['username'],
|
||||
"password": self.TEST_PROXY_DATA['password'],
|
||||
"ip": self.TEST_PROXY_DATA['ip']
|
||||
}
|
||||
proxy = ProxyConfig.from_dict(proxy_dict)
|
||||
assert proxy.server == proxy_dict["server"]
|
||||
assert proxy.username == proxy_dict["username"]
|
||||
assert proxy.password == proxy_dict["password"]
|
||||
assert proxy.ip == proxy_dict["ip"]
|
||||
|
||||
def test_proxy_config_from_dict_missing_keys(self):
|
||||
"""Test creating ProxyConfig from dictionary with missing keys."""
|
||||
proxy_dict = {"server": "127.0.0.1:8080", "username": "user"}
|
||||
proxy = ProxyConfig.from_dict(proxy_dict)
|
||||
assert proxy.server == "127.0.0.1:8080"
|
||||
assert proxy.username == "user"
|
||||
assert proxy.password is None
|
||||
assert proxy.ip == "127.0.0.1" # Should auto-extract
|
||||
|
||||
def test_proxy_config_from_dict_empty(self):
|
||||
"""Test creating ProxyConfig from empty dictionary."""
|
||||
proxy_dict = {}
|
||||
proxy = ProxyConfig.from_dict(proxy_dict)
|
||||
assert proxy.server is None
|
||||
assert proxy.username is None
|
||||
assert proxy.password is None
|
||||
assert proxy.ip is None
|
||||
|
||||
def test_proxy_config_from_dict_none_values(self):
|
||||
"""Test creating ProxyConfig from dictionary with None values."""
|
||||
proxy_dict = {
|
||||
"server": "127.0.0.1:8080",
|
||||
"username": None,
|
||||
"password": None,
|
||||
"ip": None
|
||||
}
|
||||
proxy = ProxyConfig.from_dict(proxy_dict)
|
||||
assert proxy.server == "127.0.0.1:8080"
|
||||
assert proxy.username is None
|
||||
assert proxy.password is None
|
||||
assert proxy.ip == "127.0.0.1" # Should auto-extract despite None
|
||||
|
||||
# ==================== STRING FORM TESTS ====================
|
||||
|
||||
def test_proxy_config_from_string_full_format(self):
|
||||
"""Test creating ProxyConfig from full string format (ip:port:username:password)."""
|
||||
proxy_str = f"{self.TEST_PROXY_DATA['ip']}:6114:{self.TEST_PROXY_DATA['username']}:{self.TEST_PROXY_DATA['password']}"
|
||||
proxy = ProxyConfig.from_string(proxy_str)
|
||||
assert proxy.server == f"http://{self.TEST_PROXY_DATA['ip']}:6114"
|
||||
assert proxy.username == self.TEST_PROXY_DATA['username']
|
||||
assert proxy.password == self.TEST_PROXY_DATA['password']
|
||||
assert proxy.ip == self.TEST_PROXY_DATA['ip']
|
||||
|
||||
def test_proxy_config_from_string_ip_port_only(self):
|
||||
"""Test creating ProxyConfig from string with only ip:port."""
|
||||
proxy_str = "192.168.1.1:8080"
|
||||
proxy = ProxyConfig.from_string(proxy_str)
|
||||
assert proxy.server == "http://192.168.1.1:8080"
|
||||
assert proxy.username is None
|
||||
assert proxy.password is None
|
||||
assert proxy.ip == "192.168.1.1"
|
||||
|
||||
def test_proxy_config_from_string_invalid_format(self):
|
||||
"""Test creating ProxyConfig from invalid string formats."""
|
||||
invalid_formats = [
|
||||
"invalid",
|
||||
"ip:port:user", # Missing password (3 parts)
|
||||
"ip:port:user:pass:extra", # Too many parts (5 parts)
|
||||
"",
|
||||
"::", # Empty parts but 3 total (invalid)
|
||||
"::::", # Empty parts but 5 total (invalid)
|
||||
]
|
||||
|
||||
for proxy_str in invalid_formats:
|
||||
with pytest.raises(ValueError, match="Invalid proxy string format"):
|
||||
ProxyConfig.from_string(proxy_str)
|
||||
|
||||
def test_proxy_config_from_string_edge_cases_that_work(self):
|
||||
"""Test string formats that should work but might be edge cases."""
|
||||
# These cases actually work as valid formats
|
||||
edge_cases = [
|
||||
(":", "http://:", ""), # ip:port format with empty values
|
||||
(":::", "http://:", ""), # ip:port:user:pass format with empty values
|
||||
]
|
||||
|
||||
for proxy_str, expected_server, expected_ip in edge_cases:
|
||||
proxy = ProxyConfig.from_string(proxy_str)
|
||||
assert proxy.server == expected_server
|
||||
assert proxy.ip == expected_ip
|
||||
|
||||
def test_proxy_config_from_string_edge_cases(self):
|
||||
"""Test string parsing edge cases."""
|
||||
# Test with different port numbers
|
||||
proxy_str = "10.0.0.1:3128:user:pass"
|
||||
proxy = ProxyConfig.from_string(proxy_str)
|
||||
assert proxy.server == "http://10.0.0.1:3128"
|
||||
|
||||
# Test with special characters in credentials
|
||||
proxy_str = "10.0.0.1:8080:user@domain:pass:word"
|
||||
with pytest.raises(ValueError): # Should fail due to extra colon in password
|
||||
ProxyConfig.from_string(proxy_str)
|
||||
|
||||
# ==================== ENVIRONMENT VARIABLE TESTS ====================
|
||||
|
||||
def test_proxy_config_from_env_single_proxy(self):
|
||||
"""Test loading single proxy from environment variable."""
|
||||
proxy_str = f"{self.TEST_PROXY_DATA['ip']}:6114:{self.TEST_PROXY_DATA['username']}:{self.TEST_PROXY_DATA['password']}"
|
||||
|
||||
with patch.dict(os.environ, {'TEST_PROXIES': proxy_str}):
|
||||
proxies = ProxyConfig.from_env('TEST_PROXIES')
|
||||
assert len(proxies) == 1
|
||||
proxy = proxies[0]
|
||||
assert proxy.ip == self.TEST_PROXY_DATA['ip']
|
||||
assert proxy.username == self.TEST_PROXY_DATA['username']
|
||||
assert proxy.password == self.TEST_PROXY_DATA['password']
|
||||
|
||||
def test_proxy_config_from_env_multiple_proxies(self):
|
||||
"""Test loading multiple proxies from environment variable."""
|
||||
proxy_list = [
|
||||
"192.168.1.1:8080:user1:pass1",
|
||||
"192.168.1.2:8080:user2:pass2",
|
||||
"10.0.0.1:3128" # No auth
|
||||
]
|
||||
proxy_str = ",".join(proxy_list)
|
||||
|
||||
with patch.dict(os.environ, {'TEST_PROXIES': proxy_str}):
|
||||
proxies = ProxyConfig.from_env('TEST_PROXIES')
|
||||
assert len(proxies) == 3
|
||||
|
||||
# Check first proxy
|
||||
assert proxies[0].ip == "192.168.1.1"
|
||||
assert proxies[0].username == "user1"
|
||||
assert proxies[0].password == "pass1"
|
||||
|
||||
# Check second proxy
|
||||
assert proxies[1].ip == "192.168.1.2"
|
||||
assert proxies[1].username == "user2"
|
||||
assert proxies[1].password == "pass2"
|
||||
|
||||
# Check third proxy (no auth)
|
||||
assert proxies[2].ip == "10.0.0.1"
|
||||
assert proxies[2].username is None
|
||||
assert proxies[2].password is None
|
||||
|
||||
def test_proxy_config_from_env_empty_var(self):
|
||||
"""Test loading from empty environment variable."""
|
||||
with patch.dict(os.environ, {'TEST_PROXIES': ''}):
|
||||
proxies = ProxyConfig.from_env('TEST_PROXIES')
|
||||
assert len(proxies) == 0
|
||||
|
||||
def test_proxy_config_from_env_missing_var(self):
|
||||
"""Test loading from missing environment variable."""
|
||||
# Ensure the env var doesn't exist
|
||||
with patch.dict(os.environ, {}, clear=True):
|
||||
proxies = ProxyConfig.from_env('NON_EXISTENT_VAR')
|
||||
assert len(proxies) == 0
|
||||
|
||||
def test_proxy_config_from_env_with_empty_entries(self):
|
||||
"""Test loading proxies with empty entries in the list."""
|
||||
proxy_str = "192.168.1.1:8080:user:pass,,10.0.0.1:3128,"
|
||||
|
||||
with patch.dict(os.environ, {'TEST_PROXIES': proxy_str}):
|
||||
proxies = ProxyConfig.from_env('TEST_PROXIES')
|
||||
assert len(proxies) == 2 # Empty entries should be skipped
|
||||
assert proxies[0].ip == "192.168.1.1"
|
||||
assert proxies[1].ip == "10.0.0.1"
|
||||
|
||||
def test_proxy_config_from_env_with_invalid_entries(self):
|
||||
"""Test loading proxies with some invalid entries."""
|
||||
proxy_str = "192.168.1.1:8080:user:pass,invalid_proxy,10.0.0.1:3128"
|
||||
|
||||
with patch.dict(os.environ, {'TEST_PROXIES': proxy_str}):
|
||||
# Should handle errors gracefully and return valid proxies
|
||||
proxies = ProxyConfig.from_env('TEST_PROXIES')
|
||||
# Depending on implementation, might return partial list or empty
|
||||
# This tests error handling
|
||||
assert isinstance(proxies, list)
|
||||
|
||||
# ==================== SERIALIZATION TESTS ====================
|
||||
|
||||
def test_proxy_config_to_dict(self):
|
||||
"""Test converting ProxyConfig to dictionary."""
|
||||
proxy = ProxyConfig(
|
||||
server=f"http://{self.TEST_PROXY_DATA['server']}",
|
||||
username=self.TEST_PROXY_DATA['username'],
|
||||
password=self.TEST_PROXY_DATA['password'],
|
||||
ip=self.TEST_PROXY_DATA['ip']
|
||||
)
|
||||
|
||||
result_dict = proxy.to_dict()
|
||||
expected = {
|
||||
"server": f"http://{self.TEST_PROXY_DATA['server']}",
|
||||
"username": self.TEST_PROXY_DATA['username'],
|
||||
"password": self.TEST_PROXY_DATA['password'],
|
||||
"ip": self.TEST_PROXY_DATA['ip']
|
||||
}
|
||||
assert result_dict == expected
|
||||
|
||||
def test_proxy_config_clone(self):
|
||||
"""Test cloning ProxyConfig with modifications."""
|
||||
original = ProxyConfig(
|
||||
server="http://127.0.0.1:8080",
|
||||
username="user",
|
||||
password="pass"
|
||||
)
|
||||
|
||||
# Clone with modifications
|
||||
cloned = original.clone(username="new_user", password="new_pass")
|
||||
|
||||
# Original should be unchanged
|
||||
assert original.username == "user"
|
||||
assert original.password == "pass"
|
||||
|
||||
# Clone should have new values
|
||||
assert cloned.username == "new_user"
|
||||
assert cloned.password == "new_pass"
|
||||
assert cloned.server == original.server # Unchanged value
|
||||
|
||||
def test_proxy_config_roundtrip_serialization(self):
|
||||
"""Test that ProxyConfig can be serialized and deserialized without loss."""
|
||||
original = ProxyConfig(
|
||||
server=f"http://{self.TEST_PROXY_DATA['server']}",
|
||||
username=self.TEST_PROXY_DATA['username'],
|
||||
password=self.TEST_PROXY_DATA['password'],
|
||||
ip=self.TEST_PROXY_DATA['ip']
|
||||
)
|
||||
|
||||
# Serialize to dict and back
|
||||
serialized = original.to_dict()
|
||||
deserialized = ProxyConfig.from_dict(serialized)
|
||||
|
||||
assert deserialized.server == original.server
|
||||
assert deserialized.username == original.username
|
||||
assert deserialized.password == original.password
|
||||
assert deserialized.ip == original.ip
|
||||
|
||||
# ==================== INTEGRATION TESTS ====================
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_crawler_with_proxy_config_object(self):
|
||||
"""Test AsyncWebCrawler with ProxyConfig object."""
|
||||
proxy_config = ProxyConfig(
|
||||
server=f"http://{self.TEST_PROXY_DATA['server']}",
|
||||
username=self.TEST_PROXY_DATA['username'],
|
||||
password=self.TEST_PROXY_DATA['password']
|
||||
)
|
||||
|
||||
browser_config = BrowserConfig(headless=True)
|
||||
|
||||
# Test that the crawler accepts the ProxyConfig object without errors
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
try:
|
||||
# Note: This might fail due to actual proxy connection, but should not fail due to config issues
|
||||
result = await crawler.arun(
|
||||
url=self.test_url,
|
||||
config=CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
proxy_config=proxy_config,
|
||||
page_timeout=10000 # Short timeout for testing
|
||||
)
|
||||
)
|
||||
# If we get here, proxy config was accepted
|
||||
assert result is not None
|
||||
except Exception as e:
|
||||
# We expect connection errors with test proxies, but not config errors
|
||||
error_msg = str(e).lower()
|
||||
assert "attribute" not in error_msg, f"Config error: {e}"
|
||||
assert "proxy_config" not in error_msg, f"Proxy config error: {e}"
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_crawler_with_proxy_config_dict(self):
|
||||
"""Test AsyncWebCrawler with ProxyConfig from dictionary."""
|
||||
proxy_dict = {
|
||||
"server": f"http://{self.TEST_PROXY_DATA['server']}",
|
||||
"username": self.TEST_PROXY_DATA['username'],
|
||||
"password": self.TEST_PROXY_DATA['password']
|
||||
}
|
||||
proxy_config = ProxyConfig.from_dict(proxy_dict)
|
||||
|
||||
browser_config = BrowserConfig(headless=True)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
try:
|
||||
result = await crawler.arun(
|
||||
url=self.test_url,
|
||||
config=CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
proxy_config=proxy_config,
|
||||
page_timeout=10000
|
||||
)
|
||||
)
|
||||
assert result is not None
|
||||
except Exception as e:
|
||||
error_msg = str(e).lower()
|
||||
assert "attribute" not in error_msg, f"Config error: {e}"
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_crawler_with_proxy_config_from_string(self):
|
||||
"""Test AsyncWebCrawler with ProxyConfig from string."""
|
||||
proxy_str = f"{self.TEST_PROXY_DATA['ip']}:6114:{self.TEST_PROXY_DATA['username']}:{self.TEST_PROXY_DATA['password']}"
|
||||
proxy_config = ProxyConfig.from_string(proxy_str)
|
||||
|
||||
browser_config = BrowserConfig(headless=True)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
try:
|
||||
result = await crawler.arun(
|
||||
url=self.test_url,
|
||||
config=CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
proxy_config=proxy_config,
|
||||
page_timeout=10000
|
||||
)
|
||||
)
|
||||
assert result is not None
|
||||
except Exception as e:
|
||||
error_msg = str(e).lower()
|
||||
assert "attribute" not in error_msg, f"Config error: {e}"
|
||||
|
||||
# ==================== EDGE CASES AND ERROR HANDLING ====================
|
||||
|
||||
def test_proxy_config_with_none_server(self):
|
||||
"""Test ProxyConfig behavior with None server."""
|
||||
proxy = ProxyConfig(server=None)
|
||||
assert proxy.server is None
|
||||
assert proxy.ip is None # Should not crash
|
||||
|
||||
def test_proxy_config_with_empty_string_server(self):
|
||||
"""Test ProxyConfig behavior with empty string server."""
|
||||
proxy = ProxyConfig(server="")
|
||||
assert proxy.server == ""
|
||||
assert proxy.ip is None or proxy.ip == ""
|
||||
|
||||
def test_proxy_config_special_characters_in_credentials(self):
|
||||
"""Test ProxyConfig with special characters in username/password."""
|
||||
special_chars_tests = [
|
||||
("user@domain.com", "pass!@#$%"),
|
||||
("user_123", "p@ssw0rd"),
|
||||
("user-test", "pass-word"),
|
||||
]
|
||||
|
||||
for username, password in special_chars_tests:
|
||||
proxy = ProxyConfig(
|
||||
server="http://127.0.0.1:8080",
|
||||
username=username,
|
||||
password=password
|
||||
)
|
||||
assert proxy.username == username
|
||||
assert proxy.password == password
|
||||
|
||||
def test_proxy_config_unicode_handling(self):
|
||||
"""Test ProxyConfig with unicode characters."""
|
||||
proxy = ProxyConfig(
|
||||
server="http://127.0.0.1:8080",
|
||||
username="ユーザー", # Japanese characters
|
||||
password="пароль" # Cyrillic characters
|
||||
)
|
||||
assert proxy.username == "ユーザー"
|
||||
assert proxy.password == "пароль"
|
||||
|
||||
# ==================== PERFORMANCE TESTS ====================
|
||||
|
||||
def test_proxy_config_creation_performance(self):
|
||||
"""Test that ProxyConfig creation is reasonably fast."""
|
||||
import time
|
||||
|
||||
start_time = time.time()
|
||||
for i in range(1000):
|
||||
proxy = ProxyConfig(
|
||||
server=f"http://192.168.1.{i % 255}:8080",
|
||||
username=f"user{i}",
|
||||
password=f"pass{i}"
|
||||
)
|
||||
end_time = time.time()
|
||||
|
||||
# Should be able to create 1000 configs in less than 1 second
|
||||
assert (end_time - start_time) < 1.0
|
||||
|
||||
def test_proxy_config_from_env_performance(self):
|
||||
"""Test that loading many proxies from env is reasonably fast."""
|
||||
import time
|
||||
|
||||
# Create a large list of proxy strings
|
||||
proxy_list = [f"192.168.1.{i}:8080:user{i}:pass{i}" for i in range(100)]
|
||||
proxy_str = ",".join(proxy_list)
|
||||
|
||||
with patch.dict(os.environ, {'PERF_TEST_PROXIES': proxy_str}):
|
||||
start_time = time.time()
|
||||
proxies = ProxyConfig.from_env('PERF_TEST_PROXIES')
|
||||
end_time = time.time()
|
||||
|
||||
assert len(proxies) == 100
|
||||
# Should be able to parse 100 proxies in less than 1 second
|
||||
assert (end_time - start_time) < 1.0
|
||||
|
||||
|
||||
# ==================== STANDALONE TEST FUNCTIONS ====================
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_dict_proxy():
|
||||
"""Original test function for dict proxy - kept for backward compatibility."""
|
||||
proxy_config = {
|
||||
"server": "23.95.150.145:6114",
|
||||
"username": "cfyswbwn",
|
||||
"password": "1gs266hoqysi"
|
||||
}
|
||||
proxy_config_obj = ProxyConfig.from_dict(proxy_config)
|
||||
|
||||
browser_config = BrowserConfig(headless=True)
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
try:
|
||||
result = await crawler.arun(url="https://httpbin.org/ip", config=CrawlerRunConfig(
|
||||
stream=False,
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
proxy_config=proxy_config_obj,
|
||||
page_timeout=10000
|
||||
))
|
||||
print("Dict proxy test passed!")
|
||||
print(result.markdown[:200] if result and result.markdown else "No result")
|
||||
except Exception as e:
|
||||
print(f"Dict proxy test error (expected): {e}")
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_string_proxy():
|
||||
"""Test function for string proxy format."""
|
||||
proxy_str = "23.95.150.145:6114:cfyswbwn:1gs266hoqysi"
|
||||
proxy_config_obj = ProxyConfig.from_string(proxy_str)
|
||||
|
||||
browser_config = BrowserConfig(headless=True)
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
try:
|
||||
result = await crawler.arun(url="https://httpbin.org/ip", config=CrawlerRunConfig(
|
||||
stream=False,
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
proxy_config=proxy_config_obj,
|
||||
page_timeout=10000
|
||||
))
|
||||
print("String proxy test passed!")
|
||||
print(result.markdown[:200] if result and result.markdown else "No result")
|
||||
except Exception as e:
|
||||
print(f"String proxy test error (expected): {e}")
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_env_proxy():
|
||||
"""Test function for environment variable proxy."""
|
||||
# Set environment variable
|
||||
os.environ['TEST_PROXIES'] = "23.95.150.145:6114:cfyswbwn:1gs266hoqysi"
|
||||
|
||||
proxies = ProxyConfig.from_env('TEST_PROXIES')
|
||||
if proxies:
|
||||
proxy_config_obj = proxies[0] # Use first proxy
|
||||
|
||||
browser_config = BrowserConfig(headless=True)
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
try:
|
||||
result = await crawler.arun(url="https://httpbin.org/ip", config=CrawlerRunConfig(
|
||||
stream=False,
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
proxy_config=proxy_config_obj,
|
||||
page_timeout=10000
|
||||
))
|
||||
print("Environment proxy test passed!")
|
||||
print(result.markdown[:200] if result and result.markdown else "No result")
|
||||
except Exception as e:
|
||||
print(f"Environment proxy test error (expected): {e}")
|
||||
else:
|
||||
print("No proxies loaded from environment")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
print("Running comprehensive ProxyConfig tests...")
|
||||
print("=" * 50)
|
||||
|
||||
# Run the standalone test functions
|
||||
print("\n1. Testing dict proxy format...")
|
||||
asyncio.run(test_dict_proxy())
|
||||
|
||||
print("\n2. Testing string proxy format...")
|
||||
asyncio.run(test_string_proxy())
|
||||
|
||||
print("\n3. Testing environment variable proxy format...")
|
||||
asyncio.run(test_env_proxy())
|
||||
|
||||
print("\n" + "=" * 50)
|
||||
print("To run the full pytest suite, use: pytest " + __file__)
|
||||
print("=" * 50)
|
||||
64
tests/telemetry/conftest.py
Normal file
64
tests/telemetry/conftest.py
Normal file
@@ -0,0 +1,64 @@
|
||||
"""
|
||||
Test configuration and utilities for telemetry testing.
|
||||
"""
|
||||
|
||||
import os
|
||||
import pytest
|
||||
|
||||
|
||||
def pytest_configure(config): # noqa: ARG001
|
||||
"""Configure pytest for telemetry tests."""
|
||||
# Add custom markers
|
||||
config.addinivalue_line("markers", "unit: Unit tests")
|
||||
config.addinivalue_line("markers", "integration: Integration tests")
|
||||
config.addinivalue_line("markers", "privacy: Privacy compliance tests")
|
||||
config.addinivalue_line("markers", "performance: Performance tests")
|
||||
config.addinivalue_line("markers", "slow: Slow running tests")
|
||||
|
||||
|
||||
def pytest_collection_modifyitems(config, items): # noqa: ARG001
|
||||
"""Modify test collection to add markers automatically."""
|
||||
for item in items:
|
||||
# Add markers based on test location and name
|
||||
if "telemetry" in str(item.fspath):
|
||||
if "integration" in item.name or "test_integration" in str(item.fspath):
|
||||
item.add_marker(pytest.mark.integration)
|
||||
elif "privacy" in item.name or "performance" in item.name:
|
||||
if "privacy" in item.name:
|
||||
item.add_marker(pytest.mark.privacy)
|
||||
if "performance" in item.name:
|
||||
item.add_marker(pytest.mark.performance)
|
||||
else:
|
||||
item.add_marker(pytest.mark.unit)
|
||||
|
||||
# Mark slow tests
|
||||
if "slow" in item.name or any(mark.name == "slow" for mark in item.iter_markers()):
|
||||
item.add_marker(pytest.mark.slow)
|
||||
|
||||
|
||||
@pytest.fixture(autouse=True)
|
||||
def setup_test_environment():
|
||||
"""Set up test environment variables."""
|
||||
# Ensure we're in test mode
|
||||
os.environ['CRAWL4AI_TEST_MODE'] = '1'
|
||||
|
||||
# Disable actual telemetry during tests unless explicitly enabled
|
||||
if 'CRAWL4AI_TELEMETRY_TEST_REAL' not in os.environ:
|
||||
os.environ['CRAWL4AI_TELEMETRY'] = '0'
|
||||
|
||||
yield
|
||||
|
||||
# Clean up after tests
|
||||
test_vars = ['CRAWL4AI_TEST_MODE', 'CRAWL4AI_TELEMETRY_TEST_REAL']
|
||||
for var in test_vars:
|
||||
if var in os.environ:
|
||||
del os.environ[var]
|
||||
|
||||
|
||||
def pytest_report_header(config): # noqa: ARG001
|
||||
"""Add information to pytest header."""
|
||||
return [
|
||||
"Crawl4AI Telemetry Tests",
|
||||
f"Test mode: {'ENABLED' if os.environ.get('CRAWL4AI_TEST_MODE') else 'DISABLED'}",
|
||||
f"Real telemetry: {'ENABLED' if os.environ.get('CRAWL4AI_TELEMETRY_TEST_REAL') else 'DISABLED'}"
|
||||
]
|
||||
216
tests/telemetry/test_integration.py
Normal file
216
tests/telemetry/test_integration.py
Normal file
@@ -0,0 +1,216 @@
|
||||
"""
|
||||
Integration tests for telemetry CLI commands.
|
||||
"""
|
||||
|
||||
import pytest
|
||||
import subprocess
|
||||
import sys
|
||||
import os
|
||||
from unittest.mock import patch, Mock
|
||||
|
||||
|
||||
@pytest.mark.integration
|
||||
class TestTelemetryCLI:
|
||||
"""Test telemetry CLI commands integration."""
|
||||
|
||||
def test_telemetry_status_command(self, clean_environment, temp_config_dir):
|
||||
"""Test the telemetry status CLI command."""
|
||||
# Import with mocked config
|
||||
with patch('crawl4ai.telemetry.TelemetryConfig') as MockConfig:
|
||||
mock_config = Mock()
|
||||
mock_config.get_consent.return_value = 'not_set'
|
||||
mock_config.is_enabled.return_value = False
|
||||
MockConfig.return_value = mock_config
|
||||
|
||||
from crawl4ai.cli import main
|
||||
|
||||
# Test status command
|
||||
with patch('sys.argv', ['crawl4ai', 'telemetry', 'status']):
|
||||
try:
|
||||
main()
|
||||
except SystemExit:
|
||||
pass # CLI commands often call sys.exit()
|
||||
|
||||
def test_telemetry_enable_command(self, clean_environment, temp_config_dir):
|
||||
"""Test the telemetry enable CLI command."""
|
||||
with patch('crawl4ai.telemetry.TelemetryConfig') as MockConfig:
|
||||
mock_config = Mock()
|
||||
MockConfig.return_value = mock_config
|
||||
|
||||
from crawl4ai.cli import main
|
||||
|
||||
# Test enable command
|
||||
with patch('sys.argv', ['crawl4ai', 'telemetry', 'enable', '--email', 'test@example.com']):
|
||||
try:
|
||||
main()
|
||||
except SystemExit:
|
||||
pass
|
||||
|
||||
def test_telemetry_disable_command(self, clean_environment, temp_config_dir):
|
||||
"""Test the telemetry disable CLI command."""
|
||||
with patch('crawl4ai.telemetry.TelemetryConfig') as MockConfig:
|
||||
mock_config = Mock()
|
||||
MockConfig.return_value = mock_config
|
||||
|
||||
from crawl4ai.cli import main
|
||||
|
||||
# Test disable command
|
||||
with patch('sys.argv', ['crawl4ai', 'telemetry', 'disable']):
|
||||
try:
|
||||
main()
|
||||
except SystemExit:
|
||||
pass
|
||||
|
||||
@pytest.mark.slow
|
||||
def test_cli_subprocess_integration(self, temp_config_dir):
|
||||
"""Test CLI commands as subprocess calls."""
|
||||
env = os.environ.copy()
|
||||
env['CRAWL4AI_CONFIG_DIR'] = str(temp_config_dir)
|
||||
|
||||
# Test status command via subprocess
|
||||
try:
|
||||
result = subprocess.run(
|
||||
[sys.executable, '-m', 'crawl4ai.cli', 'telemetry', 'status'],
|
||||
env=env,
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=10
|
||||
)
|
||||
# Should not crash, regardless of exit code
|
||||
assert result.returncode in [0, 1] # May return 1 if not configured
|
||||
except subprocess.TimeoutExpired:
|
||||
pytest.skip("CLI command timed out")
|
||||
except FileNotFoundError:
|
||||
pytest.skip("CLI module not found")
|
||||
|
||||
|
||||
@pytest.mark.integration
|
||||
class TestAsyncWebCrawlerIntegration:
|
||||
"""Test AsyncWebCrawler telemetry integration."""
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_crawler_telemetry_decorator(self, enabled_telemetry_config, mock_sentry_provider):
|
||||
"""Test that AsyncWebCrawler methods are decorated with telemetry."""
|
||||
with patch('crawl4ai.telemetry.TelemetryConfig', return_value=enabled_telemetry_config):
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
|
||||
# Check if the arun method has telemetry decoration
|
||||
crawler = AsyncWebCrawler()
|
||||
assert hasattr(crawler.arun, '__wrapped__') or callable(crawler.arun)
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_crawler_exception_capture_integration(self, enabled_telemetry_config, mock_sentry_provider):
|
||||
"""Test that exceptions in AsyncWebCrawler are captured."""
|
||||
with patch('crawl4ai.telemetry.TelemetryConfig', return_value=enabled_telemetry_config):
|
||||
with patch('crawl4ai.telemetry.capture_exception') as _mock_capture:
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
try:
|
||||
# This should cause an exception
|
||||
await crawler.arun(url="invalid://url")
|
||||
except Exception:
|
||||
pass # We expect this to fail
|
||||
|
||||
# The decorator should have attempted to capture the exception
|
||||
# Note: This might not always be called depending on where the exception occurs
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_crawler_with_disabled_telemetry(self, disabled_telemetry_config):
|
||||
"""Test that AsyncWebCrawler works normally with disabled telemetry."""
|
||||
with patch('crawl4ai.telemetry.TelemetryConfig', return_value=disabled_telemetry_config):
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
|
||||
# Should work normally even with telemetry disabled
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
assert crawler is not None
|
||||
|
||||
|
||||
@pytest.mark.integration
|
||||
class TestDockerIntegration:
|
||||
"""Test Docker environment telemetry integration."""
|
||||
|
||||
def test_docker_environment_detection(self, docker_environment, temp_config_dir):
|
||||
"""Test that Docker environment is detected correctly."""
|
||||
from crawl4ai.telemetry.environment import EnvironmentDetector
|
||||
|
||||
env = EnvironmentDetector.detect()
|
||||
from crawl4ai.telemetry.environment import Environment
|
||||
assert env == Environment.DOCKER
|
||||
|
||||
def test_docker_default_telemetry_enabled(self, temp_config_dir):
|
||||
"""Test that telemetry is enabled by default in Docker."""
|
||||
from crawl4ai.telemetry.environment import Environment
|
||||
|
||||
# Clear any existing environment variables that might interfere
|
||||
with patch.dict(os.environ, {}, clear=True):
|
||||
# Set only the Docker environment variable
|
||||
os.environ['CRAWL4AI_DOCKER'] = 'true'
|
||||
|
||||
with patch('crawl4ai.telemetry.environment.EnvironmentDetector.detect', return_value=Environment.DOCKER):
|
||||
from crawl4ai.telemetry.consent import ConsentManager
|
||||
from crawl4ai.telemetry.config import TelemetryConfig, TelemetryConsent
|
||||
|
||||
config = TelemetryConfig(config_dir=temp_config_dir)
|
||||
consent_manager = ConsentManager(config)
|
||||
|
||||
# Should set consent to ALWAYS for Docker
|
||||
consent_manager.check_and_prompt()
|
||||
assert config.get_consent() == TelemetryConsent.ALWAYS
|
||||
|
||||
def test_docker_telemetry_can_be_disabled(self, temp_config_dir):
|
||||
"""Test that Docker telemetry can be disabled via environment variable."""
|
||||
from crawl4ai.telemetry.environment import Environment
|
||||
|
||||
with patch.dict(os.environ, {'CRAWL4AI_TELEMETRY': '0', 'CRAWL4AI_DOCKER': 'true'}):
|
||||
with patch('crawl4ai.telemetry.environment.EnvironmentDetector.detect', return_value=Environment.DOCKER):
|
||||
from crawl4ai.telemetry.consent import ConsentManager
|
||||
from crawl4ai.telemetry.config import TelemetryConfig, TelemetryConsent
|
||||
|
||||
config = TelemetryConfig(config_dir=temp_config_dir)
|
||||
consent_manager = ConsentManager(config)
|
||||
|
||||
# Should set consent to DENIED when env var is 0
|
||||
consent_manager.check_and_prompt()
|
||||
assert config.get_consent() == TelemetryConsent.DENIED
|
||||
|
||||
|
||||
@pytest.mark.integration
|
||||
class TestTelemetryProviderIntegration:
|
||||
"""Test telemetry provider integration."""
|
||||
|
||||
def test_sentry_provider_initialization(self, enabled_telemetry_config):
|
||||
"""Test that Sentry provider initializes correctly."""
|
||||
try:
|
||||
from crawl4ai.telemetry.providers.sentry import SentryProvider
|
||||
|
||||
provider = SentryProvider()
|
||||
# Should not crash during initialization
|
||||
assert provider is not None
|
||||
|
||||
except ImportError:
|
||||
pytest.skip("Sentry provider not available")
|
||||
|
||||
def test_null_provider_fallback(self, disabled_telemetry_config):
|
||||
"""Test that NullProvider is used when telemetry is disabled."""
|
||||
with patch('crawl4ai.telemetry.TelemetryConfig', return_value=disabled_telemetry_config):
|
||||
from crawl4ai.telemetry import TelemetryManager
|
||||
from crawl4ai.telemetry.base import NullProvider
|
||||
|
||||
manager = TelemetryManager()
|
||||
assert isinstance(manager._provider, NullProvider) # noqa: SLF001
|
||||
|
||||
def test_graceful_degradation_without_sentry(self, enabled_telemetry_config):
|
||||
"""Test graceful degradation when sentry-sdk is not available."""
|
||||
with patch.dict('sys.modules', {'sentry_sdk': None}):
|
||||
with patch('crawl4ai.telemetry.TelemetryConfig', return_value=enabled_telemetry_config):
|
||||
from crawl4ai.telemetry import TelemetryManager
|
||||
from crawl4ai.telemetry.base import NullProvider
|
||||
|
||||
# Should fall back to NullProvider when Sentry is not available
|
||||
manager = TelemetryManager()
|
||||
assert isinstance(manager._provider, NullProvider) # noqa: SLF001
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
pytest.main([__file__, "-v"])
|
||||
283
tests/telemetry/test_privacy_performance.py
Normal file
283
tests/telemetry/test_privacy_performance.py
Normal file
@@ -0,0 +1,283 @@
|
||||
"""
|
||||
Privacy and performance tests for telemetry system.
|
||||
"""
|
||||
|
||||
import pytest
|
||||
import time
|
||||
import asyncio
|
||||
from unittest.mock import patch
|
||||
from crawl4ai.telemetry import telemetry_decorator, async_telemetry_decorator, TelemetryManager
|
||||
|
||||
|
||||
@pytest.mark.privacy
|
||||
class TestTelemetryPrivacy:
|
||||
"""Test privacy compliance of telemetry system."""
|
||||
|
||||
def test_no_url_captured(self, enabled_telemetry_config, mock_sentry_provider, privacy_test_data):
|
||||
"""Test that URLs are not captured in telemetry data."""
|
||||
# Ensure config is properly set for sending
|
||||
enabled_telemetry_config.is_enabled.return_value = True
|
||||
enabled_telemetry_config.should_send_current.return_value = True
|
||||
|
||||
with patch('crawl4ai.telemetry.TelemetryConfig', return_value=enabled_telemetry_config):
|
||||
# Mock the provider directly in the manager
|
||||
manager = TelemetryManager()
|
||||
manager._provider = mock_sentry_provider # noqa: SLF001
|
||||
manager._initialized = True # noqa: SLF001
|
||||
|
||||
# Create exception with URL in context
|
||||
exception = ValueError("Test error")
|
||||
context = {'url': privacy_test_data['url']}
|
||||
|
||||
manager.capture_exception(exception, context)
|
||||
|
||||
# Verify that the provider was called
|
||||
mock_sentry_provider.send_exception.assert_called_once()
|
||||
call_args = mock_sentry_provider.send_exception.call_args
|
||||
|
||||
# Verify that context was passed to the provider (filtering happens in provider)
|
||||
assert len(call_args) >= 2
|
||||
|
||||
def test_no_content_captured(self, enabled_telemetry_config, mock_sentry_provider, privacy_test_data):
|
||||
"""Test that crawled content is not captured."""
|
||||
# Ensure config is properly set
|
||||
enabled_telemetry_config.is_enabled.return_value = True
|
||||
enabled_telemetry_config.should_send_current.return_value = True
|
||||
|
||||
with patch('crawl4ai.telemetry.TelemetryConfig', return_value=enabled_telemetry_config):
|
||||
manager = TelemetryManager()
|
||||
manager._provider = mock_sentry_provider # noqa: SLF001
|
||||
manager._initialized = True # noqa: SLF001
|
||||
|
||||
exception = ValueError("Test error")
|
||||
context = {
|
||||
'content': privacy_test_data['content'],
|
||||
'html': '<html><body>Private content</body></html>',
|
||||
'text': 'Extracted private text'
|
||||
}
|
||||
|
||||
manager.capture_exception(exception, context)
|
||||
|
||||
mock_sentry_provider.send_exception.assert_called_once()
|
||||
call_args = mock_sentry_provider.send_exception.call_args
|
||||
|
||||
# Verify that the provider was called (actual filtering would happen in provider)
|
||||
assert len(call_args) >= 2
|
||||
|
||||
def test_no_pii_captured(self, enabled_telemetry_config, mock_sentry_provider, privacy_test_data):
|
||||
"""Test that PII is not captured in telemetry."""
|
||||
# Ensure config is properly set
|
||||
enabled_telemetry_config.is_enabled.return_value = True
|
||||
enabled_telemetry_config.should_send_current.return_value = True
|
||||
|
||||
with patch('crawl4ai.telemetry.TelemetryConfig', return_value=enabled_telemetry_config):
|
||||
manager = TelemetryManager()
|
||||
manager._provider = mock_sentry_provider # noqa: SLF001
|
||||
manager._initialized = True # noqa: SLF001
|
||||
|
||||
exception = ValueError("Test error")
|
||||
context = privacy_test_data['user_data'].copy()
|
||||
context.update(privacy_test_data['pii'])
|
||||
|
||||
manager.capture_exception(exception, context)
|
||||
|
||||
mock_sentry_provider.send_exception.assert_called_once()
|
||||
call_args = mock_sentry_provider.send_exception.call_args
|
||||
|
||||
# Verify that the provider was called (actual filtering would happen in provider)
|
||||
assert len(call_args) >= 2
|
||||
|
||||
def test_sanitized_context_captured(self, enabled_telemetry_config, mock_sentry_provider):
|
||||
"""Test that only safe context is captured."""
|
||||
# Ensure config is properly set
|
||||
enabled_telemetry_config.is_enabled.return_value = True
|
||||
enabled_telemetry_config.should_send_current.return_value = True
|
||||
|
||||
with patch('crawl4ai.telemetry.TelemetryConfig', return_value=enabled_telemetry_config):
|
||||
manager = TelemetryManager()
|
||||
manager._provider = mock_sentry_provider # noqa: SLF001
|
||||
manager._initialized = True # noqa: SLF001
|
||||
|
||||
exception = ValueError("Test error")
|
||||
context = {
|
||||
'operation': 'crawl', # Safe to capture
|
||||
'status_code': 404, # Safe to capture
|
||||
'retry_count': 3, # Safe to capture
|
||||
'user_email': 'secret@example.com', # Should be in context (not filtered at this level)
|
||||
'content': 'private content' # Should be in context (not filtered at this level)
|
||||
}
|
||||
|
||||
manager.capture_exception(exception, context)
|
||||
|
||||
mock_sentry_provider.send_exception.assert_called_once()
|
||||
call_args = mock_sentry_provider.send_exception.call_args
|
||||
|
||||
# Get the actual arguments passed to the mock
|
||||
args, kwargs = call_args
|
||||
assert len(args) >= 2, f"Expected at least 2 args, got {len(args)}"
|
||||
|
||||
# The second argument should be the context
|
||||
captured_context = args[1]
|
||||
|
||||
# The basic context should be present (this tests the manager, not the provider filtering)
|
||||
assert 'operation' in captured_context, f"operation not found in {captured_context}"
|
||||
assert captured_context.get('operation') == 'crawl'
|
||||
assert captured_context.get('status_code') == 404
|
||||
assert captured_context.get('retry_count') == 3
|
||||
|
||||
|
||||
@pytest.mark.performance
|
||||
class TestTelemetryPerformance:
|
||||
"""Test performance impact of telemetry system."""
|
||||
|
||||
def test_decorator_overhead_sync(self, enabled_telemetry_config, mock_sentry_provider): # noqa: ARG002
|
||||
"""Test performance overhead of sync telemetry decorator."""
|
||||
with patch('crawl4ai.telemetry.TelemetryConfig', return_value=enabled_telemetry_config):
|
||||
|
||||
@telemetry_decorator
|
||||
def test_function():
|
||||
"""Test function with telemetry decorator."""
|
||||
time.sleep(0.001) # Simulate small amount of work
|
||||
return "success"
|
||||
|
||||
# Measure time with telemetry
|
||||
start_time = time.time()
|
||||
for _ in range(100):
|
||||
test_function()
|
||||
telemetry_time = time.time() - start_time
|
||||
|
||||
# Telemetry should add minimal overhead
|
||||
assert telemetry_time < 1.0 # Should complete 100 calls in under 1 second
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_decorator_overhead_async(self, enabled_telemetry_config, mock_sentry_provider): # noqa: ARG002
|
||||
"""Test performance overhead of async telemetry decorator."""
|
||||
with patch('crawl4ai.telemetry.TelemetryConfig', return_value=enabled_telemetry_config):
|
||||
|
||||
@async_telemetry_decorator
|
||||
async def test_async_function():
|
||||
"""Test async function with telemetry decorator."""
|
||||
await asyncio.sleep(0.001) # Simulate small amount of async work
|
||||
return "success"
|
||||
|
||||
# Measure time with telemetry
|
||||
start_time = time.time()
|
||||
tasks = [test_async_function() for _ in range(100)]
|
||||
await asyncio.gather(*tasks)
|
||||
telemetry_time = time.time() - start_time
|
||||
|
||||
# Telemetry should add minimal overhead to async operations
|
||||
assert telemetry_time < 2.0 # Should complete 100 async calls in under 2 seconds
|
||||
|
||||
def test_disabled_telemetry_performance(self, disabled_telemetry_config):
|
||||
"""Test that disabled telemetry has zero overhead."""
|
||||
with patch('crawl4ai.telemetry.TelemetryConfig', return_value=disabled_telemetry_config):
|
||||
|
||||
@telemetry_decorator
|
||||
def test_function():
|
||||
"""Test function with disabled telemetry."""
|
||||
time.sleep(0.001)
|
||||
return "success"
|
||||
|
||||
# Measure time with disabled telemetry
|
||||
start_time = time.time()
|
||||
for _ in range(100):
|
||||
test_function()
|
||||
disabled_time = time.time() - start_time
|
||||
|
||||
# Should be very fast when disabled
|
||||
assert disabled_time < 0.5 # Should be faster than enabled telemetry
|
||||
|
||||
def test_telemetry_manager_initialization_performance(self):
|
||||
"""Test that TelemetryManager initializes quickly."""
|
||||
start_time = time.time()
|
||||
|
||||
# Initialize multiple managers (should use singleton)
|
||||
for _ in range(10):
|
||||
TelemetryManager.get_instance()
|
||||
|
||||
init_time = time.time() - start_time
|
||||
|
||||
# Initialization should be fast
|
||||
assert init_time < 0.1 # Should initialize in under 100ms
|
||||
|
||||
def test_config_loading_performance(self, temp_config_dir):
|
||||
"""Test that config loading is fast."""
|
||||
from crawl4ai.telemetry.config import TelemetryConfig
|
||||
|
||||
# Create config with some data
|
||||
config = TelemetryConfig(config_dir=temp_config_dir)
|
||||
from crawl4ai.telemetry.config import TelemetryConsent
|
||||
config.set_consent(TelemetryConsent.ALWAYS, email="test@example.com")
|
||||
|
||||
start_time = time.time()
|
||||
|
||||
# Load config multiple times
|
||||
for _ in range(100):
|
||||
new_config = TelemetryConfig(config_dir=temp_config_dir)
|
||||
new_config.get_consent()
|
||||
|
||||
load_time = time.time() - start_time
|
||||
|
||||
# Config loading should be fast
|
||||
assert load_time < 0.5 # Should load 100 times in under 500ms
|
||||
|
||||
|
||||
@pytest.mark.performance
|
||||
class TestTelemetryScalability:
|
||||
"""Test telemetry system scalability."""
|
||||
|
||||
def test_multiple_exception_capture(self, enabled_telemetry_config, mock_sentry_provider):
|
||||
"""Test capturing multiple exceptions in sequence."""
|
||||
# Ensure config is properly set
|
||||
enabled_telemetry_config.is_enabled.return_value = True
|
||||
enabled_telemetry_config.should_send_current.return_value = True
|
||||
|
||||
with patch('crawl4ai.telemetry.TelemetryConfig', return_value=enabled_telemetry_config):
|
||||
manager = TelemetryManager()
|
||||
manager._provider = mock_sentry_provider # noqa: SLF001
|
||||
manager._initialized = True # noqa: SLF001
|
||||
|
||||
start_time = time.time()
|
||||
|
||||
# Capture many exceptions
|
||||
for i in range(50):
|
||||
exception = ValueError(f"Test error {i}")
|
||||
manager.capture_exception(exception, {'operation': f'test_{i}'})
|
||||
|
||||
capture_time = time.time() - start_time
|
||||
|
||||
# Should handle multiple exceptions efficiently
|
||||
assert capture_time < 1.0 # Should capture 50 exceptions in under 1 second
|
||||
assert mock_sentry_provider.send_exception.call_count <= 50 # May be less due to consent checks
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_concurrent_exception_capture(self, enabled_telemetry_config, mock_sentry_provider): # noqa: ARG002
|
||||
"""Test concurrent exception capture performance."""
|
||||
# Ensure config is properly set
|
||||
enabled_telemetry_config.is_enabled.return_value = True
|
||||
enabled_telemetry_config.should_send_current.return_value = True
|
||||
|
||||
with patch('crawl4ai.telemetry.TelemetryConfig', return_value=enabled_telemetry_config):
|
||||
manager = TelemetryManager()
|
||||
manager._provider = mock_sentry_provider # noqa: SLF001
|
||||
manager._initialized = True # noqa: SLF001
|
||||
|
||||
async def capture_exception_async(i):
|
||||
exception = ValueError(f"Concurrent error {i}")
|
||||
return manager.capture_exception(exception, {'operation': f'concurrent_{i}'})
|
||||
|
||||
start_time = time.time()
|
||||
|
||||
# Capture exceptions concurrently
|
||||
tasks = [capture_exception_async(i) for i in range(20)]
|
||||
await asyncio.gather(*tasks)
|
||||
|
||||
capture_time = time.time() - start_time
|
||||
|
||||
# Should handle concurrent exceptions efficiently
|
||||
assert capture_time < 1.0 # Should capture 20 concurrent exceptions in under 1 second
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
pytest.main([__file__, "-v"])
|
||||
241
tests/telemetry/test_telemetry.py
Normal file
241
tests/telemetry/test_telemetry.py
Normal file
@@ -0,0 +1,241 @@
|
||||
"""
|
||||
Tests for Crawl4AI telemetry functionality.
|
||||
"""
|
||||
|
||||
import pytest
|
||||
import os
|
||||
import tempfile
|
||||
from pathlib import Path
|
||||
import json
|
||||
from unittest.mock import Mock, patch, MagicMock
|
||||
|
||||
from crawl4ai.telemetry import (
|
||||
TelemetryManager,
|
||||
capture_exception,
|
||||
enable,
|
||||
disable,
|
||||
status
|
||||
)
|
||||
from crawl4ai.telemetry.config import TelemetryConfig, TelemetryConsent
|
||||
from crawl4ai.telemetry.environment import Environment, EnvironmentDetector
|
||||
from crawl4ai.telemetry.base import NullProvider
|
||||
from crawl4ai.telemetry.consent import ConsentManager
|
||||
|
||||
|
||||
class TestTelemetryConfig:
|
||||
"""Test telemetry configuration management."""
|
||||
|
||||
def test_config_initialization(self):
|
||||
"""Test config initialization with custom directory."""
|
||||
with tempfile.TemporaryDirectory() as tmpdir:
|
||||
config = TelemetryConfig(config_dir=Path(tmpdir))
|
||||
assert config.config_dir == Path(tmpdir)
|
||||
assert config.get_consent() == TelemetryConsent.NOT_SET
|
||||
|
||||
def test_consent_persistence(self):
|
||||
"""Test that consent is saved and loaded correctly."""
|
||||
with tempfile.TemporaryDirectory() as tmpdir:
|
||||
config = TelemetryConfig(config_dir=Path(tmpdir))
|
||||
|
||||
# Set consent
|
||||
config.set_consent(TelemetryConsent.ALWAYS, email="test@example.com")
|
||||
|
||||
# Create new config instance to test persistence
|
||||
config2 = TelemetryConfig(config_dir=Path(tmpdir))
|
||||
assert config2.get_consent() == TelemetryConsent.ALWAYS
|
||||
assert config2.get_email() == "test@example.com"
|
||||
|
||||
def test_environment_variable_override(self):
|
||||
"""Test that environment variables override config."""
|
||||
with tempfile.TemporaryDirectory() as tmpdir:
|
||||
config = TelemetryConfig(config_dir=Path(tmpdir))
|
||||
config.set_consent(TelemetryConsent.ALWAYS)
|
||||
|
||||
# Set environment variable to disable
|
||||
os.environ['CRAWL4AI_TELEMETRY'] = '0'
|
||||
try:
|
||||
config.update_from_env()
|
||||
assert config.get_consent() == TelemetryConsent.DENIED
|
||||
finally:
|
||||
del os.environ['CRAWL4AI_TELEMETRY']
|
||||
|
||||
|
||||
class TestEnvironmentDetection:
|
||||
"""Test environment detection functionality."""
|
||||
|
||||
def test_cli_detection(self):
|
||||
"""Test CLI environment detection."""
|
||||
# Mock sys.stdin.isatty
|
||||
with patch('sys.stdin.isatty', return_value=True):
|
||||
env = EnvironmentDetector.detect()
|
||||
# Should detect as CLI in most test environments
|
||||
assert env in [Environment.CLI, Environment.UNKNOWN]
|
||||
|
||||
def test_docker_detection(self):
|
||||
"""Test Docker environment detection."""
|
||||
# Mock Docker environment
|
||||
with patch.dict(os.environ, {'CRAWL4AI_DOCKER': 'true'}):
|
||||
env = EnvironmentDetector.detect()
|
||||
assert env == Environment.DOCKER
|
||||
|
||||
def test_api_server_detection(self):
|
||||
"""Test API server detection."""
|
||||
with patch.dict(os.environ, {'CRAWL4AI_API_SERVER': 'true', 'CRAWL4AI_DOCKER': 'true'}):
|
||||
env = EnvironmentDetector.detect()
|
||||
assert env == Environment.API_SERVER
|
||||
|
||||
|
||||
class TestTelemetryManager:
|
||||
"""Test the main telemetry manager."""
|
||||
|
||||
def test_singleton_pattern(self):
|
||||
"""Test that TelemetryManager is a singleton."""
|
||||
manager1 = TelemetryManager.get_instance()
|
||||
manager2 = TelemetryManager.get_instance()
|
||||
assert manager1 is manager2
|
||||
|
||||
def test_exception_capture(self):
|
||||
"""Test exception capture functionality."""
|
||||
with tempfile.TemporaryDirectory() as tmpdir:
|
||||
# Create manager with custom config dir
|
||||
with patch('crawl4ai.telemetry.TelemetryConfig') as MockConfig:
|
||||
mock_config = Mock()
|
||||
mock_config.get_consent.return_value = TelemetryConsent.ALWAYS
|
||||
mock_config.is_enabled.return_value = True
|
||||
mock_config.should_send_current.return_value = True
|
||||
mock_config.get_email.return_value = "test@example.com"
|
||||
mock_config.update_from_env.return_value = None
|
||||
MockConfig.return_value = mock_config
|
||||
|
||||
# Mock the provider setup
|
||||
with patch('crawl4ai.telemetry.providers.sentry.SentryProvider') as MockSentryProvider:
|
||||
mock_provider = Mock()
|
||||
mock_provider.initialize.return_value = True
|
||||
mock_provider.send_exception.return_value = True
|
||||
MockSentryProvider.return_value = mock_provider
|
||||
|
||||
manager = TelemetryManager()
|
||||
|
||||
# Test exception capture
|
||||
test_exception = ValueError("Test error")
|
||||
result = manager.capture_exception(test_exception, {'test': 'context'})
|
||||
|
||||
# Verify the exception was processed
|
||||
assert mock_config.should_send_current.called
|
||||
|
||||
def test_null_provider_when_disabled(self):
|
||||
"""Test that NullProvider is used when telemetry is disabled."""
|
||||
with tempfile.TemporaryDirectory() as tmpdir:
|
||||
with patch('crawl4ai.telemetry.TelemetryConfig') as MockConfig:
|
||||
mock_config = Mock()
|
||||
mock_config.get_consent.return_value = TelemetryConsent.DENIED
|
||||
mock_config.is_enabled.return_value = False
|
||||
MockConfig.return_value = mock_config
|
||||
|
||||
manager = TelemetryManager()
|
||||
assert isinstance(manager._provider, NullProvider)
|
||||
|
||||
|
||||
class TestConsentManager:
|
||||
"""Test consent management functionality."""
|
||||
|
||||
def test_docker_default_enabled(self):
|
||||
"""Test that Docker environment has telemetry enabled by default."""
|
||||
with patch('crawl4ai.telemetry.consent.EnvironmentDetector.detect', return_value=Environment.DOCKER):
|
||||
with patch('os.environ.get') as mock_env_get:
|
||||
# Mock os.environ.get to return None for CRAWL4AI_TELEMETRY
|
||||
mock_env_get.return_value = None
|
||||
|
||||
config = Mock()
|
||||
config.get_consent.return_value = TelemetryConsent.NOT_SET
|
||||
|
||||
consent_manager = ConsentManager(config)
|
||||
consent_manager.check_and_prompt()
|
||||
|
||||
# Should be enabled by default in Docker
|
||||
assert config.set_consent.called
|
||||
assert config.set_consent.call_args[0][0] == TelemetryConsent.ALWAYS
|
||||
|
||||
def test_docker_disabled_by_env(self):
|
||||
"""Test that Docker telemetry can be disabled via environment variable."""
|
||||
with patch('crawl4ai.telemetry.consent.EnvironmentDetector.detect', return_value=Environment.DOCKER):
|
||||
with patch.dict(os.environ, {'CRAWL4AI_TELEMETRY': '0'}):
|
||||
config = Mock()
|
||||
config.get_consent.return_value = TelemetryConsent.NOT_SET
|
||||
|
||||
consent_manager = ConsentManager(config)
|
||||
consent = consent_manager.check_and_prompt()
|
||||
|
||||
# Should be disabled
|
||||
assert config.set_consent.called
|
||||
assert config.set_consent.call_args[0][0] == TelemetryConsent.DENIED
|
||||
|
||||
|
||||
class TestPublicAPI:
|
||||
"""Test the public API functions."""
|
||||
|
||||
@patch('crawl4ai.telemetry.get_telemetry')
|
||||
def test_enable_function(self, mock_get_telemetry):
|
||||
"""Test the enable() function."""
|
||||
mock_manager = Mock()
|
||||
mock_get_telemetry.return_value = mock_manager
|
||||
|
||||
enable(email="test@example.com", always=True)
|
||||
|
||||
mock_manager.enable.assert_called_once_with(
|
||||
email="test@example.com",
|
||||
always=True,
|
||||
once=False
|
||||
)
|
||||
|
||||
@patch('crawl4ai.telemetry.get_telemetry')
|
||||
def test_disable_function(self, mock_get_telemetry):
|
||||
"""Test the disable() function."""
|
||||
mock_manager = Mock()
|
||||
mock_get_telemetry.return_value = mock_manager
|
||||
|
||||
disable()
|
||||
|
||||
mock_manager.disable.assert_called_once()
|
||||
|
||||
@patch('crawl4ai.telemetry.get_telemetry')
|
||||
def test_status_function(self, mock_get_telemetry):
|
||||
"""Test the status() function."""
|
||||
mock_manager = Mock()
|
||||
mock_manager.status.return_value = {
|
||||
'enabled': True,
|
||||
'consent': 'always',
|
||||
'email': 'test@example.com'
|
||||
}
|
||||
mock_get_telemetry.return_value = mock_manager
|
||||
|
||||
result = status()
|
||||
|
||||
assert result['enabled'] is True
|
||||
assert result['consent'] == 'always'
|
||||
assert result['email'] == 'test@example.com'
|
||||
|
||||
|
||||
class TestIntegration:
|
||||
"""Integration tests for telemetry with AsyncWebCrawler."""
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_crawler_exception_capture(self):
|
||||
"""Test that AsyncWebCrawler captures exceptions."""
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
|
||||
with patch('crawl4ai.telemetry.capture_exception') as mock_capture:
|
||||
# This should trigger an exception for invalid URL
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
try:
|
||||
# Use an invalid URL that will cause an error
|
||||
result = await crawler.arun(url="not-a-valid-url")
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# Check if exception was captured (may not be called if error is handled)
|
||||
# This is more of a smoke test to ensure the integration doesn't break
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
pytest.main([__file__, "-v"])
|
||||
170
tests/test_llm_simple_url.py
Normal file
170
tests/test_llm_simple_url.py
Normal file
@@ -0,0 +1,170 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Test LLMTableExtraction with controlled HTML
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
|
||||
|
||||
import asyncio
|
||||
from crawl4ai import (
|
||||
AsyncWebCrawler,
|
||||
CrawlerRunConfig,
|
||||
LLMConfig,
|
||||
LLMTableExtraction,
|
||||
DefaultTableExtraction,
|
||||
CacheMode
|
||||
)
|
||||
|
||||
async def test_controlled_html():
|
||||
"""Test with controlled HTML content."""
|
||||
print("\n" + "=" * 60)
|
||||
print("LLM TABLE EXTRACTION TEST")
|
||||
print("=" * 60)
|
||||
|
||||
url = "https://en.wikipedia.org/wiki/List_of_chemical_elements"
|
||||
# url = "https://en.wikipedia.org/wiki/List_of_prime_ministers_of_India"
|
||||
|
||||
# Configure LLM
|
||||
llm_config = LLMConfig(
|
||||
# provider="openai/gpt-4.1-mini",
|
||||
# api_token=os.getenv("OPENAI_API_KEY"),
|
||||
provider="groq/llama-3.3-70b-versatile",
|
||||
api_token="GROQ_API_TOKEN",
|
||||
temperature=0.1,
|
||||
max_tokens=32000
|
||||
)
|
||||
|
||||
print("\n1. Testing LLMTableExtraction:")
|
||||
|
||||
# Create LLM extraction strategy
|
||||
llm_strategy = LLMTableExtraction(
|
||||
llm_config=llm_config,
|
||||
verbose=True,
|
||||
# css_selector="div.w3-example"
|
||||
css_selector="div.mw-content-ltr",
|
||||
# css_selector="table.wikitable",
|
||||
max_tries=2,
|
||||
|
||||
enable_chunking=True,
|
||||
chunk_token_threshold=5000, # Lower threshold to force chunking
|
||||
min_rows_per_chunk=10,
|
||||
max_parallel_chunks=3
|
||||
)
|
||||
|
||||
config_llm = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
table_extraction=llm_strategy
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Test with LLM extraction
|
||||
result_llm = await crawler.arun(
|
||||
# url=f"raw:{test_html}",
|
||||
url=url,
|
||||
config=config_llm
|
||||
)
|
||||
|
||||
if result_llm.success:
|
||||
print(f"\n ✓ LLM Extraction: Found {len(result_llm.tables)} table(s)")
|
||||
|
||||
for i, table in enumerate(result_llm.tables, 1):
|
||||
print(f"\n Table {i}:")
|
||||
print(f" - Caption: {table.get('caption', 'No caption')}")
|
||||
print(f" - Headers: {table['headers']}")
|
||||
print(f" - Rows: {len(table['rows'])}")
|
||||
|
||||
# Show how colspan/rowspan were handled
|
||||
print(f" - Sample rows:")
|
||||
for j, row in enumerate(table['rows'][:2], 1):
|
||||
print(f" Row {j}: {row}")
|
||||
|
||||
metadata = table.get('metadata', {})
|
||||
print(f" - Metadata:")
|
||||
print(f" • Has merged cells: {metadata.get('has_merged_cells', False)}")
|
||||
print(f" • Table type: {metadata.get('table_type', 'unknown')}")
|
||||
|
||||
# # Compare with default extraction
|
||||
# print("\n2. Comparing with DefaultTableExtraction:")
|
||||
|
||||
# default_strategy = DefaultTableExtraction(
|
||||
# table_score_threshold=3,
|
||||
# verbose=False
|
||||
# )
|
||||
|
||||
# config_default = CrawlerRunConfig(
|
||||
# cache_mode=CacheMode.BYPASS,
|
||||
# table_extraction=default_strategy
|
||||
# )
|
||||
|
||||
# result_default = await crawler.arun(
|
||||
# # url=f"raw:{test_html}",
|
||||
# url=url,
|
||||
# config=config_default
|
||||
# )
|
||||
|
||||
# if result_default.success:
|
||||
# print(f" ✓ Default Extraction: Found {len(result_default.tables)} table(s)")
|
||||
|
||||
# # Compare handling of complex structures
|
||||
# print("\n3. Comparison Summary:")
|
||||
# print(f" LLM found: {len(result_llm.tables)} tables")
|
||||
# print(f" Default found: {len(result_default.tables)} tables")
|
||||
|
||||
# if result_llm.tables and result_default.tables:
|
||||
# llm_first = result_llm.tables[0]
|
||||
# default_first = result_default.tables[0]
|
||||
|
||||
# print(f"\n First table comparison:")
|
||||
# print(f" LLM headers: {len(llm_first['headers'])} columns")
|
||||
# print(f" Default headers: {len(default_first['headers'])} columns")
|
||||
|
||||
# # Check if LLM better handled the complex structure
|
||||
# if llm_first.get('metadata', {}).get('has_merged_cells'):
|
||||
# print(" ✓ LLM correctly identified merged cells")
|
||||
|
||||
# # Test pandas compatibility
|
||||
# try:
|
||||
# import pandas as pd
|
||||
|
||||
# print("\n4. Testing Pandas compatibility:")
|
||||
|
||||
# # Create DataFrame from LLM extraction
|
||||
# df_llm = pd.DataFrame(
|
||||
# llm_first['rows'],
|
||||
# columns=llm_first['headers']
|
||||
# )
|
||||
# print(f" ✓ LLM table -> DataFrame: Shape {df_llm.shape}")
|
||||
|
||||
# # Create DataFrame from default extraction
|
||||
# df_default = pd.DataFrame(
|
||||
# default_first['rows'],
|
||||
# columns=default_first['headers']
|
||||
# )
|
||||
# print(f" ✓ Default table -> DataFrame: Shape {df_default.shape}")
|
||||
|
||||
# print("\n LLM DataFrame preview:")
|
||||
# print(df_llm.head(2).to_string())
|
||||
|
||||
# except ImportError:
|
||||
# print("\n4. Pandas not installed, skipping DataFrame test")
|
||||
|
||||
print("\n✅ Test completed successfully!")
|
||||
|
||||
async def main():
|
||||
"""Run the test."""
|
||||
|
||||
# Check for API key
|
||||
if not os.getenv("OPENAI_API_KEY"):
|
||||
print("⚠️ OPENAI_API_KEY not set. Please set it to test LLM extraction.")
|
||||
print(" You can set it with: export OPENAI_API_KEY='your-key-here'")
|
||||
return
|
||||
|
||||
await test_controlled_html()
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
|
||||
|
||||
|
||||
@@ -4,7 +4,7 @@
|
||||
import psutil
|
||||
import platform
|
||||
import time
|
||||
from crawl4ai.memory_utils import get_true_memory_usage_percent, get_memory_stats, get_true_available_memory_gb
|
||||
from crawl4ai.utils import get_true_memory_usage_percent, get_memory_stats, get_true_available_memory_gb
|
||||
|
||||
|
||||
def test_memory_calculation():
|
||||
|
||||
Reference in New Issue
Block a user