Files

ntohidi 7f360577d9 feat(telemetry): Add opt-in telemetry system for error tracking and stability improvement

Implement a privacy-first, provider-agnostic telemetry system to help improve Crawl4AI stability
through anonymous crash reporting. The system is designed with user privacy as the top priority,
collecting only exception information without any PII, URLs, or crawled content.

Architecture & Design:
- Provider-agnostic architecture with base TelemetryProvider interface
- Sentry as the initial provider implementation with easy extensibility
- Separate handling for sync and async code paths
- Environment-aware behavior (CLI, Docker, Jupyter/Colab)

Key Features:
- Opt-in by default for CLI/library usage with interactive consent prompt
- Opt-out by default for Docker/API server (enabled unless CRAWL4AI_TELEMETRY=0)
- Jupyter/Colab support with widget-based consent (fallback to code snippets)
- Persistent consent storage in ~/.crawl4ai/config.json
- Optional email collection for critical issue follow-up

CLI Integration:
- `crwl telemetry enable [--email <email>] [--once]` - Enable telemetry
- `crwl telemetry disable` - Disable telemetry
- `crwl telemetry status` - Check current status

Python API:
- Decorators: @telemetry_decorator, @async_telemetry_decorator
- Context managers: telemetry_context(), async_telemetry_context()
- Manual capture: capture_exception(exc, context)
- Control: telemetry.enable(), telemetry.disable(), telemetry.status()

Privacy Safeguards:
- No URL collection
- No request/response data
- No authentication tokens or cookies
- No crawled content
- Automatic sanitization of sensitive fields
- Local consent storage only

Testing:
- Comprehensive test suite with 15 test cases
- Coverage for all environments and consent flows
- Mock providers for testing without external dependencies

Documentation:
- Detailed documentation in docs/md_v2/core/telemetry.md
- Added to mkdocs navigation under Core section
- Privacy commitment and FAQ included
- Examples for all usage patterns

Installation:
- Optional dependency: pip install crawl4ai[telemetry]
- Graceful degradation if sentry-sdk not installed
- Added to pyproject.toml optional dependencies
- Docker requirements updated

Integration Points:
- AsyncWebCrawler: Automatic exception capture in arun() and aprocess_html()
- Docker server: Automatic initialization with environment control
- Global exception handler for uncaught exceptions (CLI only)

This implementation provides valuable error insights to improve Crawl4AI while maintaining
complete transparency and user control over data collection.

2025-08-20 16:49:44 +08:00

5.5 KiB

Raw Blame History

Telemetry

Crawl4AI includes opt-in telemetry to help improve stability by capturing anonymous crash reports. No personal data or crawled content is ever collected.

!!! info "Privacy First" Telemetry is completely optional and respects your privacy. Only exception information is collected - no URLs, no personal data, no crawled content.

Overview

Privacy-first: Only exceptions and crashes are reported
Opt-in by default: You control when telemetry is enabled (except in Docker where it's on by default)
No PII: No URLs, request data, or personal information is collected
Provider-agnostic: Currently uses Sentry, but designed to support multiple backends

Installation

Telemetry requires the optional Sentry SDK:

# Install with telemetry support
pip install crawl4ai[telemetry]

# Or install Sentry SDK separately
pip install sentry-sdk>=2.0.0

Environments

1. Python Library & CLI

On first exception, you'll see an interactive prompt:

🚨 Crawl4AI Error Detection
==============================================================
We noticed an error occurred. Help improve Crawl4AI by
sending anonymous crash reports?

[1] Yes, send this error only
[2] Yes, always send errors
[3] No, don't send

Your choice (1/2/3):

Control via CLI:

# Enable telemetry
crwl telemetry enable
crwl telemetry enable --email you@example.com

# Disable telemetry
crwl telemetry disable

# Check status
crwl telemetry status

2. Docker / API Server

!!! warning "Default Enabled in Docker" Telemetry is enabled by default in Docker environments to help identify container-specific issues. This is different from the CLI where it's opt-in.

To disable:

# Via environment variable
docker run -e CRAWL4AI_TELEMETRY=0 ...

# In docker-compose.yml
environment:
  - CRAWL4AI_TELEMETRY=0

3. Jupyter / Google Colab

In notebooks, you'll see an interactive widget (if available) or a code snippet:

import crawl4ai

# Enable telemetry
crawl4ai.telemetry.enable(email="you@example.com", always=True)

# Send only next error
crawl4ai.telemetry.enable(once=True)

# Disable telemetry
crawl4ai.telemetry.disable()

# Check status
crawl4ai.telemetry.status()

Python API

Basic Usage

from crawl4ai import telemetry

# Enable/disable telemetry
telemetry.enable(email="optional@email.com", always=True)
telemetry.disable()

# Check current status
status = telemetry.status()
print(f"Telemetry enabled: {status['enabled']}")
print(f"Consent: {status['consent']}")

Manual Exception Capture

from crawl4ai.telemetry import capture_exception

try:
    # Your code here
    risky_operation()
except Exception as e:
    # Manually capture exception with context
    capture_exception(e, {
        'operation': 'custom_crawler',
        'url': 'https://example.com'  # Will be sanitized
    })
    raise

Decorator Pattern

from crawl4ai.telemetry import telemetry_decorator

@telemetry_decorator
def my_crawler_function():
    # Exceptions will be automatically captured
    pass

Context Manager

from crawl4ai.telemetry import telemetry_context

with telemetry_context("data_extraction"):
    # Any exceptions in this block will be captured
    result = extract_data(html)

Configuration

Settings are stored in ~/.crawl4ai/config.json:

{
  "telemetry": {
    "consent": "always",
    "email": "user@example.com"
  }
}

Consent levels:

"not_set" - No decision made yet
"denied" - Telemetry disabled
"once" - Send current error only
"always" - Always send errors

Environment Variables

CRAWL4AI_TELEMETRY=0 - Disable telemetry (overrides config)
CRAWL4AI_TELEMETRY_EMAIL=email@example.com - Set email for follow-up
CRAWL4AI_SENTRY_DSN=https://... - Override default DSN (for maintainers)

What's Collected

Collected ✅

Exception type and traceback
Crawl4AI version
Python version
Operating system
Environment type (CLI, Docker, Jupyter)
Optional email (if provided)

NOT Collected ❌

URLs being crawled
HTML content
Request/response data
Cookies or authentication tokens
IP addresses
Any personally identifiable information

Provider Architecture

Telemetry is designed to be provider-agnostic:

from crawl4ai.telemetry.base import TelemetryProvider

class CustomProvider(TelemetryProvider):
    def send_exception(self, exc, context=None):
        # Your implementation
        pass

FAQ

Q: Can I completely disable telemetry?

A: Yes! Use crwl telemetry disable or set CRAWL4AI_TELEMETRY=0

Q: Is telemetry required?

A: No, it's completely optional (except enabled by default in Docker)

Q: What if I don't install sentry-sdk?

A: Telemetry will gracefully degrade to a no-op state

Q: Can I see what's being sent?

A: Yes, check the source code in crawl4ai/telemetry/

Q: How do I remove my email?

A: Delete ~/.crawl4ai/config.json or edit it to remove the email field

Privacy Commitment

Transparency: All telemetry code is open source
Control: You can enable/disable at any time
Minimal: Only crash data, no user content
Secure: Data transmitted over HTTPS to Sentry
Anonymous: No tracking or user identification

Contributing

Help improve telemetry:

Report issues with telemetry itself
Suggest privacy improvements
Add new provider backends

Support

If you have concerns about telemetry:

Open an issue on GitHub
Email the maintainers
Review the code in crawl4ai/telemetry/

5.5 KiB Raw Blame History