Compare commits

..

175 Commits

Author SHA1 Message Date
AHMET YILMAZ
d48d382d18 feat(tests): Implement comprehensive testing framework for telemetry system 2025-09-22 19:06:20 +08:00
ntohidi
7f360577d9 feat(telemetry): Add opt-in telemetry system for error tracking and stability improvement
Implement a privacy-first, provider-agnostic telemetry system to help improve Crawl4AI stability
through anonymous crash reporting. The system is designed with user privacy as the top priority,
collecting only exception information without any PII, URLs, or crawled content.

Architecture & Design:
- Provider-agnostic architecture with base TelemetryProvider interface
- Sentry as the initial provider implementation with easy extensibility
- Separate handling for sync and async code paths
- Environment-aware behavior (CLI, Docker, Jupyter/Colab)

Key Features:
- Opt-in by default for CLI/library usage with interactive consent prompt
- Opt-out by default for Docker/API server (enabled unless CRAWL4AI_TELEMETRY=0)
- Jupyter/Colab support with widget-based consent (fallback to code snippets)
- Persistent consent storage in ~/.crawl4ai/config.json
- Optional email collection for critical issue follow-up

CLI Integration:
- `crwl telemetry enable [--email <email>] [--once]` - Enable telemetry
- `crwl telemetry disable` - Disable telemetry
- `crwl telemetry status` - Check current status

Python API:
- Decorators: @telemetry_decorator, @async_telemetry_decorator
- Context managers: telemetry_context(), async_telemetry_context()
- Manual capture: capture_exception(exc, context)
- Control: telemetry.enable(), telemetry.disable(), telemetry.status()

Privacy Safeguards:
- No URL collection
- No request/response data
- No authentication tokens or cookies
- No crawled content
- Automatic sanitization of sensitive fields
- Local consent storage only

Testing:
- Comprehensive test suite with 15 test cases
- Coverage for all environments and consent flows
- Mock providers for testing without external dependencies

Documentation:
- Detailed documentation in docs/md_v2/core/telemetry.md
- Added to mkdocs navigation under Core section
- Privacy commitment and FAQ included
- Examples for all usage patterns

Installation:
- Optional dependency: pip install crawl4ai[telemetry]
- Graceful degradation if sentry-sdk not installed
- Added to pyproject.toml optional dependencies
- Docker requirements updated

Integration Points:
- AsyncWebCrawler: Automatic exception capture in arun() and aprocess_html()
- Docker server: Automatic initialization with environment control
- Global exception handler for uncaught exceptions (CLI only)

This implementation provides valuable error insights to improve Crawl4AI while maintaining
complete transparency and user control over data collection.
2025-08-20 16:49:44 +08:00
ntohidi
9447054a65 docs: update Docker instructions to use the latest release tag 2025-08-18 14:20:05 +08:00
ntohidi
f4a432829e fix(crawler): Removed the incorrect reference in browser_config variable #1310 2025-08-18 10:59:14 +08:00
UncleCode
e651e045c4 Release v0.7.4: Merge release branch
- Merge release/v0.7.4 into main
- Version: 0.7.4
- Ready for tag and publication
2025-08-17 19:46:48 +08:00
UncleCode
5398acc7d2 docs: add v0.7.4 release blog post and update documentation
- Add comprehensive v0.7.4 release blog post with LLMTableExtraction feature highlight
- Update blog index to feature v0.7.4 as latest release
- Update README.md to showcase v0.7.4 features alongside v0.7.3
- Accurately describe dispatcher fix as bug fix rather than major enhancement
- Include practical code examples for new LLMTableExtraction capabilities
2025-08-17 19:45:23 +08:00
UncleCode
22c7932ba3 chore(version): update version to 0.7.4 2025-08-17 19:22:23 +08:00
UncleCode
2ab0bf27c2 refactor(utils): move memory utilities to utils and update imports 2025-08-17 19:14:55 +08:00
ntohidi
d30dc9fdc1 fix(http-crawler): bring back HTTP crawler strategy 2025-08-16 09:27:23 +08:00
ntohidi
e6044e6053 Merge branch 'develop' of https://github.com/unclecode/crawl4ai into develop 2025-08-15 19:44:06 +08:00
ntohidi
a50e47adad Merge branch 'feature/table-extraction-strategies' into develop 2025-08-15 19:41:37 +08:00
ntohidi
ada7441bd1 refactor: Update LLMTableExtraction examples and tests 2025-08-15 19:11:26 +08:00
ntohidi
9f7fee91a9 feat: 🚀 Introduce revolutionary LLMTableExtraction with intelligent chunking for massive tables
BREAKING CHANGE: Table extraction now uses Strategy Design Pattern

This epic commit introduces a game-changing approach to table extraction in Crawl4AI:

 NEW FEATURES:
- LLMTableExtraction: AI-powered extraction for complex HTML tables with rowspan/colspan
- Smart Chunking: Automatically splits massive tables into optimal chunks at row boundaries
- Parallel Processing: Processes multiple chunks simultaneously for blazing-fast extraction
- Intelligent Merging: Seamlessly combines chunk results into complete tables
- Header Preservation: Each chunk maintains context with original headers
- Auto-retry Logic: Built-in resilience with configurable retry attempts

🏗️ ARCHITECTURE:
- Strategy Design Pattern for pluggable table extraction strategies
- ThreadPoolExecutor for concurrent chunk processing
- Token-based chunking with configurable thresholds
- Handles tables without headers gracefully

 PERFORMANCE:
- Process 1000+ row tables without timeout
- Parallel processing with up to 5 concurrent chunks
- Smart token estimation prevents LLM context overflow
- Optimized for providers like Groq for massive tables

🔧 CONFIGURATION:
- enable_chunking: Auto-handle large tables (default: True)
- chunk_token_threshold: When to split (default: 3000 tokens)
- min_rows_per_chunk: Meaningful chunk sizes (default: 10)
- max_parallel_chunks: Concurrent processing (default: 5)

📚 BACKWARD COMPATIBILITY:
- Existing code continues to work unchanged
- DefaultTableExtraction remains the default strategy
- Progressive enhancement approach

This is the future of web table extraction - handling everything from simple tables to massive, complex data grids with merged cells and nested structures. The chunking is completely transparent to users while providing unprecedented scalability.
2025-08-15 19:11:26 +08:00
AHMET YILMAZ
7f48655cf1 feat(browser-profiler): implement cross-platform keyboard listeners and improve quit handling 2025-08-15 19:11:26 +08:00
prokopis3
1417a67e90 chore(profile-test): fix filename typo ( test_crteate_profile.py → test_create_profile.py )
- Rename file to correct spelling
- No content changes
2025-08-15 19:11:26 +08:00
prokopis3
19398d33ef fix(browser_profiler): improve keyboard input handling
- fix handling of special keys in Windows msvcrt implementation
- Guard against UnicodeDecodeError from multi-byte key sequences
- Filter out non-printable characters and control sequences
- Add error handling to prevent coroutine crashes
- Add unit test to verify keyboard input handling

Key changes:
- Safe UTF-8 decoding with try/except for special keys
- Skip non-printable and multi-byte character sequences
- Add broad exception handling in keyboard listener

Test runs on Windows only due to msvcrt dependency.
2025-08-15 19:11:26 +08:00
prokopis3
263d362daa fix(browser_profiler): cross-platform 'q' to quit
This commit introduces platform-specific handling for the 'q' key press to quit the browser profiler, ensuring compatibility with both Windows and Unix-like systems. It also adds a check to see if the browser process has already exited, terminating the input listener if so.

- Implemented `msvcrt` for Windows to capture keyboard input without requiring a newline.
- Retained `termios`, `tty`, and `select` for Unix-like systems.
- Added a check for browser process termination to gracefully exit the input listener.
- Updated logger messages to use colored output for better user experience.
2025-08-15 19:11:26 +08:00
ntohidi
bac92a47e4 refactor: Update LLMTableExtraction examples and tests 2025-08-15 18:47:31 +08:00
ntohidi
a51545c883 feat: 🚀 Introduce revolutionary LLMTableExtraction with intelligent chunking for massive tables
BREAKING CHANGE: Table extraction now uses Strategy Design Pattern

This epic commit introduces a game-changing approach to table extraction in Crawl4AI:

 NEW FEATURES:
- LLMTableExtraction: AI-powered extraction for complex HTML tables with rowspan/colspan
- Smart Chunking: Automatically splits massive tables into optimal chunks at row boundaries
- Parallel Processing: Processes multiple chunks simultaneously for blazing-fast extraction
- Intelligent Merging: Seamlessly combines chunk results into complete tables
- Header Preservation: Each chunk maintains context with original headers
- Auto-retry Logic: Built-in resilience with configurable retry attempts

🏗️ ARCHITECTURE:
- Strategy Design Pattern for pluggable table extraction strategies
- ThreadPoolExecutor for concurrent chunk processing
- Token-based chunking with configurable thresholds
- Handles tables without headers gracefully

 PERFORMANCE:
- Process 1000+ row tables without timeout
- Parallel processing with up to 5 concurrent chunks
- Smart token estimation prevents LLM context overflow
- Optimized for providers like Groq for massive tables

🔧 CONFIGURATION:
- enable_chunking: Auto-handle large tables (default: True)
- chunk_token_threshold: When to split (default: 3000 tokens)
- min_rows_per_chunk: Meaningful chunk sizes (default: 10)
- max_parallel_chunks: Concurrent processing (default: 5)

📚 BACKWARD COMPATIBILITY:
- Existing code continues to work unchanged
- DefaultTableExtraction remains the default strategy
- Progressive enhancement approach

This is the future of web table extraction - handling everything from simple tables to massive, complex data grids with merged cells and nested structures. The chunking is completely transparent to users while providing unprecedented scalability.
2025-08-14 18:21:24 +08:00
Nasrin
11b310edef Merge pull request #1378 from unclecode/fix/exit_with_q
Cross Platform fix for browser profiler
2025-08-13 14:16:47 +08:00
Nasrin
926e41aab8 Merge pull request #1378 from unclecode/fix/exit_with_q
Cross Platform fix for browser profiler
2025-08-13 14:16:47 +08:00
Nasrin
489981e670 Merge pull request #1390 from unclecode/fix/docker-raw-html
Check for raw: and raw:// URLs before auto-appending https:// prefix
2025-08-13 13:56:33 +08:00
Nasrin
b92be4ef66 Merge pull request #1371 from unclecode/bug/proxy_config
#1057 : enhance ProxyConfig initialization to support dict and string…
2025-08-12 16:55:52 +08:00
Nasrin
7c0edaf266 Merge pull request #1384 from unclecode/fix/update_docker_examples
docs: remove CRAWL4AI_API_TOKEN references and use correct endpoints in Docker example scripts (#1015)
2025-08-12 16:53:42 +08:00
ntohidi
dfcfd8ae57 fix(dispatcher): enable true concurrency for fast-completing tasks in arun_many. REF: #560
The MemoryAdaptiveDispatcher was processing tasks sequentially despite
  max_session_permit > 1 due to fetching only one task per event loop iteration.
  This particularly affected raw:// URLs which complete in microseconds.

  Changes:
  - Replace single task fetch with greedy slot filling using get_nowait()
  - Fill all available slots (up to max_session_permit) immediately
  - Break on empty queue instead of waiting with timeout

  This ensures proper parallelization for all task types, especially
  ultra-fast operations like raw HTML processing.
2025-08-12 16:51:22 +08:00
ntohidi
955110a8b0 Merge branch 'develop' of https://github.com/unclecode/crawl4ai into develop 2025-08-12 12:22:25 +08:00
Soham Kukreti
f30811b524 fix: Check for raw: and raw:// URLs before auto-appending https:// prefix
- Add raw HTML URL validation alongside http/https checks
- Fix URL preprocessing logic to handle raw: and raw:// prefixes
- Update error message and add comprehensive test cases
2025-08-11 22:10:53 +05:30
ntohidi
8146d477e9 Merge branch 'main' into develop 2025-08-11 18:56:15 +08:00
ntohidi
96c4b0de67 fix(browser_manager): serialize new_page on persistent context to avoid races ref #1198
- Add _page_lock and guarded creation; handle empty context.pages safely
  - Prevents BrowserContext.new_page “Target page/context closed” during concurrent arun_many
2025-08-11 18:55:43 +08:00
Nasrin
57c14db7cb Merge pull request #1381 from unclecode/fix/base-tag-link-resolution
fix: Implement base tag support in link extraction (#1147)
2025-08-11 18:32:32 +08:00
Soham Kukreti
cd2dd68e4c docs: remove CRAWL4AI_API_TOKEN references and use correct endpoints in Docker example scripts (#1015)
- Remove deprecated API token authentication from all Docker examples
- Fix async job endpoints: /crawl -> /crawl/job for submission, /task/{id} -> /crawl/job/{id} for polling
- Fix sync endpoint: /crawl_sync -> /crawl (synchronous)
- Remove non-existent /crawl_direct endpoint
- Update request format to use new structure with browser_config and crawler_config
- Fix response handling for both async and sync calls
- Update extraction strategy format to use proper nested structure
- Add Ollama connectivity check before running tests
- Update test schemas and selectors for current website structures

This makes the Docker examples work out-of-the-box with the current API structure.
2025-08-09 19:37:22 +05:30
UncleCode
f0ce7b2710 feat: add v0.7.3 release notes, changelog updates, and documentation for new features 2025-08-09 21:04:18 +08:00
UncleCode
21f79fe166 Release v0.7.3: Merge release branch
- Merge release/v0.7.3 into main
- Version: 0.7.3
- Ready for tag and publication
2025-08-09 20:11:35 +08:00
unclecode
a9a2d798b4 feat: update sponsorship tier details and add custom arrangements note 2025-08-09 20:10:32 +08:00
unclecode
612270fcb0 feat: add scheduling link to contact information in SPONSORS.md 2025-08-09 20:05:59 +08:00
unclecode
bc099fdd76 Merge branch 'main' into release/v0.7.3 2025-08-09 19:30:46 +08:00
unclecode
18504d782e Add Founding Sponsors section and update README with detailed project information
- Introduced a new section in SPONSORS.md to recognize the first 50 sponsors as Founding Sponsors.
- Updated README-first.md to include comprehensive project details, features, installation instructions, and advanced usage examples.
- Highlighted the recent version 0.7.0 release with new features and improvements.
- Added a sponsorship program with tiered benefits and a mission statement to promote data democratization.
2025-08-09 19:11:32 +08:00
unclecode
ad547607b9 feat: add GitHub Sponsors support with 4 tiers
- Add FUNDING.yml to enable sponsor button
- Add sponsor section to README with tier overview
- Create SPONSORS.md for sponsor recognition
- Set up 4 tiers: Believer, Builder, Growing Team, Data Infrastructure Partner
2025-08-09 17:57:47 +08:00
Soham Kukreti
18ad3ef159 fix: Implement base tag support in link extraction (#1147)
- Extract base href from <head><base> tag using XPath in _process_element method
- Use base URL as the primary URL for link normalization when present
- Add error handling with logging for malformed or problematic base tags
- Maintain backward compatibility when no base tag is present
- Add test to verify the functionality of the base tag extraction.
2025-08-08 20:11:57 +05:30
AHMET YILMAZ
0541b61405 feat(browser-profiler): implement cross-platform keyboard listeners and improve quit handling 2025-08-08 11:18:34 +08:00
AHMET YILMAZ
b61b2ee676 feat(browser-profiler): implement cross-platform keyboard listeners and improve quit handling 2025-08-08 11:18:34 +08:00
AHMET YILMAZ
89cf5aba2b #1057 : enhance ProxyConfig initialization to support dict and string formats 2025-08-06 18:34:58 +08:00
ntohidi
6b0b5301ba Release v0.7.3:
- Updated version to 0.7.3
- Added release notes
- Updated documentation
2025-08-06 17:52:01 +08:00
Nasrin
6735c68288 Merge pull request #1170 from prokopis3/fix/create-profile
fix(browser_profiler): cross-platform 'q' to quit - create profile
2025-08-06 16:29:14 +08:00
Nasrin
64f37792a7 Merge pull request #1170 from prokopis3/fix/create-profile
fix(browser_profiler): cross-platform 'q' to quit - create profile
2025-08-06 16:29:14 +08:00
ntohidi
a5bcac4c9d feat(docs): enhance table data access example with a real url 2025-08-06 15:19:37 +08:00
Nasrin
45d8327d23 Merge pull request #1366 from unclecode/fix/update-tables-documentation
docs: Update README.md and modify Media and Tables Documentation.(#1271)
2025-08-06 15:15:24 +08:00
ntohidi
437395e490 Merge branch 'feat/undetected-browser' into develop-future 2025-08-06 15:03:30 +08:00
Soham Kukreti
fddae303fb docs: Update README.md and modify Media and Tables Documentation.(#1271)
- Update Table-to-DataFrame Extraction example in README.md
- Replace old method of accessing tables via result.media directly with result.tables in the documentation
- Remove tables section from links & media page.
- Add tables section to crawler result page.
2025-08-05 23:29:19 +05:30
ntohidi
ff6ea41ac3 feat(docker): add flexible LLM provider configuration
- Support LLM_PROVIDER env var to override default provider (openai/gpt-4o-mini)
- Add optional 'provider' parameter to API endpoints for per-request overrides
- Implement provider validation to ensure API keys exist
- Update documentation and examples with new configuration options

Closes the need to hardcode providers in config.yml
2025-08-05 14:09:54 +08:00
ntohidi
31a435fb0e Merge branch 'develop' of https://github.com/unclecode/crawl4ai into develop 2025-08-04 19:12:19 +08:00
Nasrin
5de6a28055 Merge pull request #1361 from unclecode/fix/crawler-result-docs
Update CrawlResult documentation with missing fields
2025-08-04 19:12:09 +08:00
ntohidi
de1561ad14 Merge branch 'develop' of https://github.com/unclecode/crawl4ai into develop 2025-08-04 19:04:50 +08:00
Nasrin
337b588732 Merge pull request #1358 from shonenada/patch-1
Fix typos in examples.md
2025-08-04 19:04:42 +08:00
ntohidi
7a6ad547f0 Squashed commit of the following:
commit 2def6524cdacb69c72760bf55a41089257c0bb07
Author: ntohidi <nasrin@kidocode.com>
Date:   Mon Aug 4 18:59:10 2025 +0800

    refactor: consolidate WebScrapingStrategy to use LXML implementation only

    BREAKING CHANGE: None - full backward compatibility maintained

    This commit simplifies the content scraping architecture by removing the
    redundant BeautifulSoup-based WebScrapingStrategy implementation and making
    it an alias for LXMLWebScrapingStrategy.

    Changes:
    - Remove ~1000 lines of BeautifulSoup-based WebScrapingStrategy code
    - Make WebScrapingStrategy an alias for LXMLWebScrapingStrategy
    - Update LXMLWebScrapingStrategy to inherit directly from ContentScrapingStrategy
    - Add required methods (scrap, ascrap, process_element, _log) to LXMLWebScrapingStrategy
    - Maintain 100% backward compatibility - existing code continues to work

    Code changes:
    - crawl4ai/content_scraping_strategy.py: Remove WebScrapingStrategy class, add alias
    - crawl4ai/async_configs.py: Remove WebScrapingStrategy from imports
    - crawl4ai/__init__.py: Update imports to show alias relationship
    - crawl4ai/types.py: Update type definitions
    - crawl4ai/legacy/web_crawler.py: Update import to use alias
    - tests/async/test_content_scraper_strategy.py: Update to use LXMLWebScrapingStrategy
    - docs/examples/scraping_strategies_performance.py: Update to use single strategy

    Documentation updates:
    - docs/md_v2/core/content-selection.md: Update scraping modes section
    - docs/md_v2/migration/webscraping-strategy-migration.md: Add migration guide
    - CHANGELOG.md: Document the refactoring under [Unreleased]

    Benefits:
    - 10-20x faster HTML parsing for large documents
    - Reduced memory usage and simplified codebase
    - Consistent parsing behavior
    - No migration required for existing users

    All existing code using WebScrapingStrategy continues to work without
    modification, while benefiting from LXML's superior performance.
2025-08-04 19:02:01 +08:00
Soham Kukreti
e6692b987d docs: Update CrawlResult documentation with missing fields.
- Add missing fields: fit_html, js_execution_result, redirected_url, network_requests, console_messages, tables
2025-08-04 15:43:40 +05:30
ntohidi
307fe28b32 fix: Correct URL matcher fallback behavior and improve memory monitoring
Fix critical issue where unmatched URLs incorrectly used the first config instead of failing safely. Also clarify that configs without url_matcher match ALL URLs by design, and improve memory usage monitoring.

Bug fixes:
- Change select_config() to return None when no config matches instead of using first config
- Add proper error handling in dispatchers when no config matches a URL
- Return failed CrawlResult with "No matching configuration found" error message
- Fix is_match() to return True when url_matcher is None (matches all URLs)
- Import and use get_true_memory_usage_percent() for more accurate memory monitoring

Behavior clarification:
- CrawlerRunConfig with url_matcher=None matches ALL URLs (not nothing)
- This is the intended behavior for default/fallback configurations
- Enables clean pattern: specific configs first, default config last

Documentation updates:
- Clarify that configs without url_matcher match everything
- Explain "No matching configuration found" error when no default config
- Add examples showing proper default config usage
- Update all relevant docs: multi-url-crawling.md, arun_many.md, parameters.md
- Simplify API config examples by removing extraction_strategy

Demo and test updates:
- Update demo_multi_config_clean.py with commented default config to show behavior
- Change example URL to w3schools.com to demonstrate no-match scenario
- Uncomment all test URLs in test_multi_config.py for comprehensive testing

Breaking changes: None - this restores the intended behavior

This ensures URLs only get processed with appropriate configs, preventing
issues like HTML pages being processed with PDF extraction strategies.
2025-08-03 16:50:54 +08:00
Yaoda Liu
438a103b17 Fix typos in examples.md 2025-08-03 14:33:10 +08:00
ntohidi
a03e68fa2f feat: Add URL-specific crawler configurations for multi-URL crawling
Implement dynamic configuration selection based on URL patterns to optimize crawling for different content types. This feature enables users to apply different crawling strategies (PDF extraction, content filtering, JavaScript execution) based on URL matching patterns.

Key additions:
- Add url_matcher and match_mode parameters to CrawlerRunConfig
- Implement is_match() method supporting string patterns, functions, and mixed lists
- Add MatchMode enum for OR/AND logic when combining multiple matchers
- Update AsyncWebCrawler.arun_many() to accept List[CrawlerRunConfig]
- Add select_config() method to dispatchers for runtime config selection
- First matching config wins, with fallback to default

Pattern matching supports:
- Glob-style strings: *.pdf, */blog/*, *api*
- Lambda functions: lambda url: 'github.com' in url
- Mixed patterns with AND/OR logic for complex matching

This enables optimal per-URL configuration:
- PDFs: Use PDFContentScrapingStrategy without JavaScript
- Blogs: Apply content filtering to reduce noise
- APIs: Skip JavaScript, use JSON extraction
- Dynamic sites: Execute only necessary JavaScript

Breaking changes: None - fully backward compatible
2025-08-02 19:10:36 +08:00
Nasrin
864d87afb2 Merge pull request #1339 from charlaie/fix-sitemap-redirect
Fix: URL Seeder sitemap redirect
2025-07-31 15:21:03 +08:00
Charlie C
508b6fc233 fix: Enable following redirects in sitemap fetching for seeder 2025-07-31 12:06:10 +08:00
UncleCode
e3281935bc fix: Add write permissions for GitHub release creation 2025-07-25 18:22:45 +08:00
UncleCode
48647300b4 chore: Bump version to 0.7.2 2025-07-25 17:42:48 +08:00
UncleCode
9f9ea3bb3b chore: Clean up test artifacts and disable test workflow 2025-07-25 17:31:52 +08:00
UncleCode
d58b93c207 fix: Re-enable multi-platform Docker builds for ARM64 support 2025-07-25 16:38:11 +08:00
UncleCode
e2b4705010 fix: Use hardcoded Docker repository name to avoid masking issues 2025-07-25 15:52:26 +08:00
UncleCode
4a1abd5086 fix: Handle existing version on Test PyPI gracefully 2025-07-25 15:41:16 +08:00
UncleCode
04258cd4f2 fix: Speed up Docker test builds by using single platform and caching 2025-07-25 15:37:44 +08:00
UncleCode
84e462d9f8 Merge remote-tracking branch 'origin/develop' 2025-07-25 15:35:53 +08:00
UncleCode
9546773a07 fix: Move sentence-transformers to optional dependencies
- Moved sentence-transformers from core to optional dependencies in pyproject.toml
- Removed sentence-transformers from requirements.txt
- Added proper ImportError handling with helpful installation message
- This prevents ~2.5GB of NVIDIA CUDA libraries from being installed by default
- Users who need embedding features can install with: pip install 'crawl4ai[transformer]'
2025-07-24 21:24:40 +08:00
UncleCode
66a979ad11 fix: Install dependencies before version check in workflows 2025-07-24 21:01:36 +08:00
UncleCode
0c31e91b53 feat: Add CI/CD workflows for automated PyPI and Docker releases 2025-07-24 20:58:43 +08:00
ntohidi
1b6a31f88f fix: encode PDF results to base64 in /crawl endpoint. ref #1301 2025-07-23 13:52:18 +02:00
Nasrin
b8c261780f Merge pull request #1319 from volumetric/fix_for_bug_#1310
Removed the incorrect reference in browser_config variable
2025-07-23 12:45:12 +02:00
ntohidi
db6ad7a79d fix: update links in README and C4A-Script documentation for accuracy 2025-07-23 09:47:18 +02:00
Nasrin
004d514f33 Merge pull request #1265 from unclecode/feature/nasrin-cli-deep-crawl
Feature/CLI - deep-crawl: Add --deep-crawl CLI option with BFS/DFS/Best-First strategies and fix serialization error. ref #874
2025-07-23 09:40:33 +02:00
Vinit Agrawal
3a9e2c716e Remvoed the incorrect reference in browser_config variable 2025-07-18 10:01:00 +05:30
unclecode
0163bd797c Merge branch 'release/v0.7.1' 2025-07-17 17:42:04 +08:00
ntohidi
26bad799e4 chore: update version to 0.7.1 2025-07-17 11:37:41 +02:00
ntohidi
cf8badfe27 feat: cleanup unused code and enhance documentation for v0.7.1
- Remove unused StealthConfig from browser_manager.py
- Update LinkPreviewConfig import path in __init__.py and examples
- Fix infinity handling in content_scraping_strategy.py (use 0 instead of float('inf'))
- Remove sanitize_json_data functions from API endpoints
- Add comprehensive C4A Script documentation to release notes
- Update v0.7.0 release notes with improved code examples
- Create v0.7.1 release notes focusing on cleanup and documentation improvements
- Update demo files with corrected import paths and examples
- Fix virtual scroll and adaptive crawling examples across documentation

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
2025-07-17 11:35:16 +02:00
unclecode
805c498adf docs: add simple anti-bot examples
- Add simple_anti_bot_examples.py with minimal code examples
- Demonstrates stealth mode, undetected browser, and combined usage
- Clean examples without logging for easy reference

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-07-17 17:05:35 +08:00
unclecode
6a728cbe5b feat: add stealth mode and enhance undetected browser support
- Add playwright-stealth integration with enable_stealth parameter in BrowserConfig
- Merge undetected browser strategy into main async_crawler_strategy.py using adapter pattern
- Add browser adapters (BrowserAdapter, PlaywrightAdapter, UndetectedAdapter) for flexible browser switching
- Update install.py to install both playwright and patchright browsers automatically
- Add comprehensive documentation for anti-bot features (stealth mode + undetected browser)
- Create examples demonstrating stealth mode usage and comparison tests
- Update pyproject.toml and requirements.txt with patchright>=1.49.0 and other dependencies
- Remove duplicate/unused dependencies (alphashape, cssselect, pyperclip, shapely, selenium)
- Add dependency checker tool in tests/check_dependencies.py

Breaking changes: None - all existing functionality preserved

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-07-17 16:59:10 +08:00
ntohidi
ccbe3c105c refactor: improve link scoring output format in release notes 2025-07-17 09:13:20 +02:00
Nasrin
761c19d54b Merge pull request #1307 from unclecode/fix/json-infinity-serialization
fix: Handle infinity values in JSON serialization for API  responses
2025-07-16 13:34:25 +02:00
Nasrin
14b0ecb137 Merge pull request #1305 from unclecode/fix/release-notes-demo-code
Fix: Update release notes and demo code
2025-07-16 13:33:53 +02:00
ntohidi
0eaa9f9895 fix: handle infinity values in JSON serialization for API responses
- Add sanitize_json_data() function to convert infinity/NaN to JSON-compliant strings
- Fix /execute_js endpoint returning ValueError: Out of range float values are not JSON compliant: inf
- Fix /crawl endpoint batch responses with infinity values
- Fix /crawl/stream endpoint streaming responses with infinity values
- Fix /crawl/job endpoint background job responses with infinity values

The sanitize_json_data() function recursively processes response data:
- float('inf') → \"Infinity\"
- float('-inf') → \"-Infinity\"
- float('nan') → \"NaN\"

This prevents JSON serialization errors when JavaScript execution or crawling operations produce infinity values, ensuring all API endpoints return valid JSON.

Fixes: API endpoints crashing with infinity JSON serialization errors
Affects: /execute_js, /crawl, /crawl/stream, /crawl/job endpoints
2025-07-15 13:49:07 +02:00
ntohidi
1d1970ae69 docs: Update release notes and docs for v0.7.0 with teh correct parameters and explanations 2025-07-15 11:32:04 +02:00
ntohidi
205df1e330 docs: Fix virtual scroll configuration 2025-07-15 10:29:47 +02:00
ntohidi
2640dc73a5 docs: Enhance session management example for dynamic content crawling with improved JavaScript handling and extraction schema. ref #226 2025-07-15 10:19:29 +02:00
ntohidi
58024755c5 docs: Update adaptive crawling parameters and examples in README and release notes 2025-07-15 10:15:05 +02:00
unclecode
5c33cbcca2 feat: add undetected browser support with adapter pattern 2025-07-14 17:29:50 +08:00
UncleCode
dd5ee752cf docs: Add missing documentation pages to mkdocs.yml
- Added Adaptive Crawling to Core section
- Added URL Seeding to Core section
- Added Adaptive Strategies to Advanced section
2025-07-12 19:58:26 +08:00
UncleCode
bde1bba6a2 docs: Add missing documentation pages to mkdocs.yml
- Added Adaptive Crawling to Core section
- Added URL Seeding to Core section
- Added Adaptive Strategies to Advanced section
2025-07-12 19:56:33 +08:00
UncleCode
7b80eb6b99 docs: Add missing documentation pages to mkdocs.yml
- Added Adaptive Crawling to Core section
- Added URL Seeding to Core section
- Added Adaptive Strategies to Advanced section
2025-07-12 19:55:35 +08:00
UncleCode
14f690d751 docs: Update documentation for v0.7.0 release
- Update mkdocs.yml site name to v0.7.x
- Add v0.7.0 to blog index as latest release
- Move v0.6.0 to Previous Releases section
- Copy release notes to proper location in docs/md_v2/blog/releases/
2025-07-12 19:08:17 +08:00
UncleCode
7b9ba3015f Merge branch 'release/v0.7.0' - The Adaptive Intelligence Update 2025-07-12 18:54:20 +08:00
UncleCode
0c8bb742b7 Release v0.7.0-r1: The Adaptive Intelligence Update
- Bump version to 0.7.0
- Add release notes and demo files
- Update README with v0.7.0 features
- Update Docker configurations for v0.7.0-r1
- Move v0.7.0 demo files to releases_review
- Fix BM25 scoring bug in URLSeeder

Major features:
- Adaptive Crawling with pattern learning
- Virtual Scroll support for infinite pages
- Link Preview with 3-layer scoring
- Async URL Seeder for massive discovery
- Performance optimizations
2025-07-12 18:51:13 +08:00
UncleCode
ba2ed53ff1 test(releases): Add test cases for release 0.7.0 2025-07-11 22:27:18 +08:00
UncleCode
a93efcb650 Merge PR #1285: 2025 APR, MAY, and JUN bug fixes 2025-07-11 21:22:34 +08:00
UncleCode
8794852a26 Merge PR #1285: 2025 APR, MAY, and JUN bug fixes 2025-07-11 21:22:03 +08:00
UncleCode
fb25a4a769 docs(examples): update crawl4ai showcase script
The crawl4ai showcase script has been significantly expanded to include more detailed examples and demonstrations. This includes live code examples, more detailed explanations, and a new real-world example. A new file, uv.lock, has also been added.
2025-07-11 20:55:37 +08:00
ntohidi
afe852935e fix: show /llm API response in playground. ref #1288 2025-07-09 16:59:17 +02:00
ntohidi
0ebce590f8 Merge branch '2025-JUN-1' into next-MAY 2025-07-09 09:41:03 +02:00
ntohidi
026e96a2df feat: Add social media and community links to README and index documentation 2025-07-08 15:48:40 +02:00
ntohidi
36429a63de fix: Improve comments for article metadata extraction in extract_metadata functions. ref #1105 2025-07-08 12:54:33 +02:00
ntohidi
a3d41c7951 fix: Clarify description of 'use_stemming' parameter in markdown generation documentation ref #1086 2025-07-08 12:24:33 +02:00
ntohidi
fee4c5c783 fix: Consolidate import statements in local-files.md for clarity 2025-07-08 11:46:24 +02:00
ntohidi
0f210f6e02 Merge branch '2025-MAY-2' into next-MAY 2025-07-08 11:46:13 +02:00
UncleCode
1a73fb60db feat(crawl4ai): Implement adaptive crawling feature
This commit introduces the adaptive crawling feature to the crawl4ai project. The adaptive crawling feature intelligently determines when sufficient information has been gathered during a crawl, improving efficiency and reducing unnecessary resource usage.

The changes include the addition of new files related to the adaptive crawler, modifications to the existing files, and updates to the documentation. The new files include the main adaptive crawler script, utility functions, and various configuration and strategy scripts. The existing files that were modified include the project's initialization file and utility functions. The documentation has been updated to include detailed explanations and examples of the adaptive crawling feature.

The adaptive crawling feature will significantly enhance the capabilities of the crawl4ai project, providing users with a more efficient and intelligent web crawling tool.

Significant modifications:
- Added adaptive_crawler.py and related scripts
- Modified __init__.py and utils.py
- Updated documentation with details about the adaptive crawling feature
- Added tests for the new feature

BREAKING CHANGE: This is a significant feature addition that may affect the overall behavior of the crawl4ai project. Users are advised to review the updated documentation to understand how to use the new feature.

Refs: #123, #456
2025-07-04 15:16:53 +08:00
UncleCode
74705c1f67 Move release scripts to private .scripts folder
- Remove release-agent.py, build-nightly.py from public repo
- Add .scripts/ to .gitignore for private tools
- Maintain clean public repository while keeping internal tools
2025-07-04 15:02:25 +08:00
UncleCode
048d9b0f5b feat: Implement nightly build script and update version handling 2025-07-03 20:53:03 +08:00
Aravind
02f3127ded Track Stargazers (#1249)
* Webhook for when repo is starred

* Send star data to google sheets to be saved

* change event name to watch

* Change message displayed on Discord

* Update .github/workflows/main.yml

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

---------

Co-authored-by: UncleCode <unclecode@kidocode.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
2025-06-25 22:26:19 +08:00
ntohidi
414f16e975 fix: Update pdf and screenshot usage documentation. ref #1230 2025-06-18 19:05:44 +02:00
ntohidi
b7a6e02236 fix: Update pdf and screenshot usage documentation. ref #1230 2025-06-18 19:04:32 +02:00
AHMET YILMAZ
9332326457 feat: Add PDF parsing documentation and navigation entry 2025-06-16 18:18:32 +08:00
ntohidi
6cd34b3157 Merge branch '2025-MAY-2' of https://github.com/unclecode/crawl4ai into 2025-MAY-2 2025-06-13 11:26:17 +02:00
ntohidi
871d4f1158 fix(extraction_strategy): rename response variable to content for clarity in LLMExtractionStrategy. ref #1146 2025-06-13 11:26:05 +02:00
prokopis3
c4d625fb3c chore(profile-test): fix filename typo ( test_crteate_profile.py → test_create_profile.py )
- Rename file to correct spelling
- No content changes
2025-06-12 14:38:32 +03:00
prokopis3
ef722766f0 fix(browser_profiler): improve keyboard input handling
- fix handling of special keys in Windows msvcrt implementation
- Guard against UnicodeDecodeError from multi-byte key sequences
- Filter out non-printable characters and control sequences
- Add error handling to prevent coroutine crashes
- Add unit test to verify keyboard input handling

Key changes:
- Safe UTF-8 decoding with try/except for special keys
- Skip non-printable and multi-byte character sequences
- Add broad exception handling in keyboard listener

Test runs on Windows only due to msvcrt dependency.
2025-06-12 14:33:12 +03:00
ntohidi
dc85481180 refactor: Update LLM extraction example with the updated structure 2025-06-12 12:23:03 +02:00
ntohidi
5d9213a0e9 fix: Update JavaScript execution in AsyncPlaywrightCrawlerStrategy to handle script errors and add basic download test case. ref #1215 2025-06-12 12:21:40 +02:00
ntohidi
4679ee023d fix: Enhance URLPatternFilter to enforce path boundary checks for prefix matching. ref #1003 2025-06-10 11:19:18 +02:00
Nasrin
f9b7090084 Merge pull request #1186 from zimmski/fix-typo-provoder
fix, Typo
2025-06-10 10:26:45 +02:00
AHMET YILMAZ
9442597f81 #1127: Improve URL handling and normalization in scraping strategies 2025-06-10 11:57:06 +08:00
AHMET YILMAZ
74b06d4b80 #1167 Add PHP MIME types to ContentTypeFilter for better file handling 2025-06-09 11:49:33 +08:00
UncleCode
b4bb0ccea0 Update simple-crawling.md
Fixing wrong documentation about th fit_markdown to assume its a direct parameter of CrawlerRunConfig, while it is NOT.
2025-06-08 11:33:28 +08:00
ntohidi
5ac19a61d7 feat: Implement max_scroll_steps parameter for full page scanning. ref: #1168 2025-06-05 16:40:34 +02:00
Markus Zimmermann
022cc2d92a fix, Typo 2025-06-05 15:30:38 +02:00
ntohidi
fcc2abe4db (fix): Update document about LLM extraction strategy to use LLMConfig. REF #1146 2025-06-03 12:53:59 +02:00
ntohidi
cc95d3abd4 Fix raw URL parsing logic to correctly handle "raw://" and "raw:" prefixes. REF #1118 2025-06-03 11:19:08 +02:00
Nasrin
5ce3e682f3 Merge pull request #752 from jl-martins/fix-raw-url-parsing
Fix `raw://` URL parsing logic. issue ref #1118
2025-06-03 11:10:29 +02:00
ntohidi
28125c1980 Merge branch 'next' into 2025-MAY-2 2025-06-02 20:26:40 +02:00
ntohidi
773ed7b281 Merge branch '2025-APR-1' into 2025-MAY-2 2025-06-02 20:25:58 +02:00
João Martins
58c1e17170 Merge branch 'main' into fix-raw-url-parsing 2025-05-30 13:03:25 +01:00
prokopis3
4bcb7171a3 fix(browser_profiler): cross-platform 'q' to quit
This commit introduces platform-specific handling for the 'q' key press to quit the browser profiler, ensuring compatibility with both Windows and Unix-like systems. It also adds a check to see if the browser process has already exited, terminating the input listener if so.

- Implemented `msvcrt` for Windows to capture keyboard input without requiring a newline.
- Retained `termios`, `tty`, and `select` for Unix-like systems.
- Added a check for browser process termination to gracefully exit the input listener.
- Updated logger messages to use colored output for better user experience.
2025-05-30 14:43:18 +03:00
ntohidi
b55e27d2ef fix: chanegd error variable name handle_crawl_request, docker api 2025-05-26 11:08:23 +02:00
Aravind Karnam
3d46d89759 docs: fix https://github.com/unclecode/crawl4ai/issues/1109 2025-05-22 17:21:42 +05:30
ntohidi
da8f0dbb93 fix(browser_profiler): change logger print to info for consistent logging in interactive manager 2025-05-22 11:25:51 +02:00
ntohidi
33a0c7a17a fix(logger): add RED color to LogColor enum for enhanced logging options 2025-05-22 11:17:28 +02:00
Ahmed-Tawfik94
984524ca1c fix(auth): add token authorization header in request preparation to ensure authenticated requests are made 2025-05-21 13:27:17 +08:00
ntohidi
cb8d581e47 fix(docs): update CrawlerRunConfig to use CacheMode for bypassing cache. REF: #1125 2025-05-19 18:03:05 +02:00
Ahmed-Tawfik94
a55c2b3f88 refactor(logging): update extraction logging to use url_status method 2025-05-19 16:32:22 +08:00
Ahmed Tawfik
ce09648af1 Merge pull request #1054 from Sacristaan/feature/readme_example
Fix: README.md urls list
2025-05-19 14:20:21 +08:00
Ahmed-Tawfik94
a97654270b #1086 fix(markdown): update BM25 filter to use language parameter for stemming 2025-05-19 14:11:46 +08:00
Ahmed-Tawfik94
b4fc60a555 #1103 fix(url): enhance URL normalization to handle invalid schemes and trailing slashes 2025-05-19 13:51:16 +08:00
Ahmed-Tawfik94
137ac014fb #1105 :fix(metadata): optimize article metadata extraction using XPath for improved performance 2025-05-19 13:48:02 +08:00
Ahmed-Tawfik94
faa98eefbc #1105 got fixed (metadata now matches with meta property article:* 2025-05-19 11:35:13 +08:00
ntohidi
22725ca87b fix(crawler): initialize captured_console to prevent unbound local error for local HTML files. REF: #1072
Resolved a bug where running the crawler on local HTML files with `capture_console_messages=False`
(default) raised `UnboundLocalError` due to `captured_console` being accessed before assignment.
2025-05-15 11:29:36 +02:00
ntohidi
e0fbd2b0a0 fix(schema): update f parameter description to use lowercase enum values. REF: #1070
Revised the description for the `f` parameter in the `/mcp/md` tool schema to use lowercase enum values
(`raw`, `fit`, `bm25`, `llm`) for consistency with the actual `enum` definition. This change prevents
LLM-based clients (e.g., Gemini via LibreChat) from generating uppercase values like `"FIT"`, which
caused 422 validation errors due to strict case-sensitive matching.
2025-05-15 10:45:23 +02:00
ntohidi
32966bea11 fix(extraction): resolve 'str' object has no attribute 'choices' error in LLMExtractionStrategy. Refs: #979
This patch ensures consistent handling of `response.choices[0].message.content` by avoiding redefinition
of the `response` variable, which caused downstream exceptions during error handling.
2025-05-15 10:09:19 +02:00
Ahmed-Tawfik94
a3b0cab52a #1088 is sloved flag -bc now if for --byPass-cache 2025-05-15 11:25:06 +08:00
medo94my
137556b3dc fix the EXTRACT to match the styling of the other methods 2025-05-14 16:01:10 +08:00
ntohidi
260e2dc347 fix(browser): create browser config before launching managed browser instance. REF: https://discord.com/channels/1278297938551902308/1278298697540567132/1371683009459392716 2025-05-13 14:03:20 +02:00
ntohidi
25d97d56e4 fix(dependencies): remove duplicated aiofiles from project dependencies. REF #1045 2025-05-13 13:56:12 +02:00
Aravind Karnam
98a56e6e01 Merge next branch 2025-05-13 17:12:11 +05:30
ntohidi
1af3d1c2e0 Merge branch '2025-APR-1' of https://github.com/unclecode/crawl4ai into 2025-APR-1 2025-05-08 11:11:32 +02:00
Aravind Karnam
c1041b9bbe fix: exclude_external_images flag simply discards elements ref:https://github.com/unclecode/crawl4ai/issues/345 2025-05-07 18:43:29 +05:30
Aravind Karnam
f6e25e2a6b fix: check_robots_txt to support wildcard rules ref: #699 2025-05-07 17:53:30 +05:30
ntohidi
ee93acbd06 fix(async_playwright_crawler): use config directly instead of self.config for verbosity check 2025-05-07 12:32:38 +02:00
Aravind Karnam
2b17f234f8 docs: update direct passing of content_filter to CrawlerRunConfig and instead pass it via MarkdownGenerator. Ref: #603 2025-05-07 15:20:36 +05:30
ntohidi
eebb8c84f0 fix(requirements): add PyPDF2 dependency for PDF processing 2025-05-07 11:18:44 +02:00
ntohidi
12783fabda fix(dependencies): update pillow version constraint to allow newer releases. ref #709 2025-05-07 11:18:13 +02:00
Aravind Karnam
39e3b792a1 Merge branch 'next' into 2025-APR-1 2025-05-07 10:25:25 +05:30
ntohidi
e0cd3e10de fix(crawler): initialize captured_console variable for local file processing 2025-05-02 10:35:35 +02:00
ntohidi
1d6a2b9979 fix(crawler): surface real redirect status codes and keep redirect chain. the 30x response instead of always returning 200. Refs #660 2025-04-30 12:29:17 +02:00
ntohidi
039be1b1ce feat: add pdf2image dependency to requirements 2025-04-30 11:41:35 +02:00
Marc Sacristán
53245e4e0e Fix: README.md urls list 2025-04-29 16:26:35 +02:00
Aravind Karnam
094201ab2a Merge next + resolve conflicts 2025-04-23 19:44:50 +05:30
ntohidi
14a31456ef fix(docs): update browser-crawler-config example to include LLMContentFilter and DefaultMarkdownGenerator, fix syntax errors 2025-04-21 13:59:49 +02:00
ntohidi
0886153d6a fix(async_playwright_crawler): improve segment handling and viewport adjustments during screenshot capture (Fixed bug: Capturing Screenshot Twice and Increasing Image Size) 2025-04-17 12:48:11 +02:00
ntohidi
0ec3c4a788 fix(crawler): handle navigation aborts during file downloads in AsyncPlaywrightCrawlerStrategy 2025-04-17 12:11:12 +02:00
ntohidi
05085b6e3d fix(requirements): add fake-useragent to requirements 2025-04-15 13:05:19 +02:00
ntohidi
1f3b1251d0 docs(cli): add Crawl4AI CLI installation instructions to the CLI guide 2025-04-14 12:16:31 +02:00
ntohidi
7b9aabc64a fix(crawler): ensure max_pages limit is respected during batch processing in crawling strategies 2025-04-14 12:11:22 +02:00
João Martins
27af4cc27b Fix "raw://" URL parsing logic
Closes https://github.com/unclecode/crawl4ai/issues/686
2025-02-15 15:34:59 +00:00
178 changed files with 35174 additions and 2274 deletions

7
.github/FUNDING.yml vendored Normal file
View File

@@ -0,0 +1,7 @@
# These are supported funding model platforms
# GitHub Sponsors
github: unclecode
# Custom links for enterprise inquiries (uncomment when ready)
# custom: ["https://crawl4ai.com/enterprise"]

View File

@@ -9,16 +9,26 @@ on:
types: [opened]
discussion:
types: [created]
watch:
types: [started]
jobs:
notify-discord:
runs-on: ubuntu-latest
steps:
- name: Send to Google Apps Script (Stars only)
if: github.event_name == 'watch'
run: |
curl -fSs -X POST "${{ secrets.GOOGLE_SCRIPT_ENDPOINT }}" \
-H 'Content-Type: application/json' \
-d '{"url":"${{ github.event.sender.html_url }}"}'
- name: Set webhook based on event type
id: set-webhook
run: |
if [ "${{ github.event_name }}" == "discussion" ]; then
echo "webhook=${{ secrets.DISCORD_DISCUSSIONS_WEBHOOK }}" >> $GITHUB_OUTPUT
elif [ "${{ github.event_name }}" == "watch" ]; then
echo "webhook=${{ secrets.DISCORD_STAR_GAZERS }}" >> $GITHUB_OUTPUT
else
echo "webhook=${{ secrets.DISCORD_WEBHOOK }}" >> $GITHUB_OUTPUT
fi
@@ -31,5 +41,6 @@ jobs:
args: |
${{ github.event_name == 'issues' && format('📣 New issue created: **{0}** by {1} - {2}', github.event.issue.title, github.event.issue.user.login, github.event.issue.html_url) ||
github.event_name == 'issue_comment' && format('💬 New comment on issue **{0}** by {1} - {2}', github.event.issue.title, github.event.comment.user.login, github.event.comment.html_url) ||
github.event_name == 'pull_request' && format('🔄 New PR opened: **{0}** by {1} - {2}', github.event.pull_request.title, github.event.pull_request.user.login, github.event.pull_request.html_url) ||
github.event_name == 'pull_request' && format('🔄 New PR opened: **{0}** by {1} - {2}', github.event.pull_request.title, github.event.pull_request.user.login, github.event.pull_request.html_url) ||
github.event_name == 'watch' && format('⭐ {0} starred Crawl4AI 🥳! Check out their profile: {1}', github.event.sender.login, github.event.sender.html_url) ||
format('💬 New discussion started: **{0}** by {1} - {2}', github.event.discussion.title, github.event.discussion.user.login, github.event.discussion.html_url) }}

142
.github/workflows/release.yml vendored Normal file
View File

@@ -0,0 +1,142 @@
name: Release Pipeline
on:
push:
tags:
- 'v*'
- '!test-v*' # Exclude test tags
jobs:
release:
runs-on: ubuntu-latest
permissions:
contents: write # Required for creating releases
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.12'
- name: Extract version from tag
id: get_version
run: |
TAG_VERSION=${GITHUB_REF#refs/tags/v}
echo "VERSION=$TAG_VERSION" >> $GITHUB_OUTPUT
echo "Releasing version: $TAG_VERSION"
- name: Install package dependencies
run: |
pip install -e .
- name: Check version consistency
run: |
TAG_VERSION=${{ steps.get_version.outputs.VERSION }}
PACKAGE_VERSION=$(python -c "from crawl4ai.__version__ import __version__; print(__version__)")
echo "Tag version: $TAG_VERSION"
echo "Package version: $PACKAGE_VERSION"
if [ "$TAG_VERSION" != "$PACKAGE_VERSION" ]; then
echo "❌ Version mismatch! Tag: $TAG_VERSION, Package: $PACKAGE_VERSION"
echo "Please update crawl4ai/__version__.py to match the tag version"
exit 1
fi
echo "✅ Version check passed: $TAG_VERSION"
- name: Install build dependencies
run: |
python -m pip install --upgrade pip
pip install build twine
- name: Build package
run: python -m build
- name: Check package
run: twine check dist/*
- name: Upload to PyPI
env:
TWINE_USERNAME: __token__
TWINE_PASSWORD: ${{ secrets.PYPI_TOKEN }}
run: |
echo "📦 Uploading to PyPI..."
twine upload dist/*
echo "✅ Package uploaded to https://pypi.org/project/crawl4ai/"
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Log in to Docker Hub
uses: docker/login-action@v3
with:
username: ${{ secrets.DOCKER_USERNAME }}
password: ${{ secrets.DOCKER_TOKEN }}
- name: Extract major and minor versions
id: versions
run: |
VERSION=${{ steps.get_version.outputs.VERSION }}
MAJOR=$(echo $VERSION | cut -d. -f1)
MINOR=$(echo $VERSION | cut -d. -f1-2)
echo "MAJOR=$MAJOR" >> $GITHUB_OUTPUT
echo "MINOR=$MINOR" >> $GITHUB_OUTPUT
- name: Build and push Docker images
uses: docker/build-push-action@v5
with:
context: .
push: true
tags: |
unclecode/crawl4ai:${{ steps.get_version.outputs.VERSION }}
unclecode/crawl4ai:${{ steps.versions.outputs.MINOR }}
unclecode/crawl4ai:${{ steps.versions.outputs.MAJOR }}
unclecode/crawl4ai:latest
platforms: linux/amd64,linux/arm64
- name: Create GitHub Release
uses: softprops/action-gh-release@v2
with:
tag_name: v${{ steps.get_version.outputs.VERSION }}
name: Release v${{ steps.get_version.outputs.VERSION }}
body: |
## 🎉 Crawl4AI v${{ steps.get_version.outputs.VERSION }} Released!
### 📦 Installation
**PyPI:**
```bash
pip install crawl4ai==${{ steps.get_version.outputs.VERSION }}
```
**Docker:**
```bash
docker pull unclecode/crawl4ai:${{ steps.get_version.outputs.VERSION }}
docker pull unclecode/crawl4ai:latest
```
### 📝 What's Changed
See [CHANGELOG.md](https://github.com/${{ github.repository }}/blob/main/CHANGELOG.md) for details.
draft: false
prerelease: false
token: ${{ secrets.GITHUB_TOKEN }}
- name: Summary
run: |
echo "## 🚀 Release Complete!" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "### 📦 PyPI Package" >> $GITHUB_STEP_SUMMARY
echo "- Version: ${{ steps.get_version.outputs.VERSION }}" >> $GITHUB_STEP_SUMMARY
echo "- URL: https://pypi.org/project/crawl4ai/" >> $GITHUB_STEP_SUMMARY
echo "- Install: \`pip install crawl4ai==${{ steps.get_version.outputs.VERSION }}\`" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "### 🐳 Docker Images" >> $GITHUB_STEP_SUMMARY
echo "- \`unclecode/crawl4ai:${{ steps.get_version.outputs.VERSION }}\`" >> $GITHUB_STEP_SUMMARY
echo "- \`unclecode/crawl4ai:${{ steps.versions.outputs.MINOR }}\`" >> $GITHUB_STEP_SUMMARY
echo "- \`unclecode/crawl4ai:${{ steps.versions.outputs.MAJOR }}\`" >> $GITHUB_STEP_SUMMARY
echo "- \`unclecode/crawl4ai:latest\`" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "### 📋 GitHub Release" >> $GITHUB_STEP_SUMMARY
echo "https://github.com/${{ github.repository }}/releases/tag/v${{ steps.get_version.outputs.VERSION }}" >> $GITHUB_STEP_SUMMARY

View File

@@ -0,0 +1,116 @@
name: Test Release Pipeline
on:
push:
tags:
- 'test-v*'
jobs:
test-release:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.12'
- name: Extract version from tag
id: get_version
run: |
TAG_VERSION=${GITHUB_REF#refs/tags/test-v}
echo "VERSION=$TAG_VERSION" >> $GITHUB_OUTPUT
echo "Testing with version: $TAG_VERSION"
- name: Install package dependencies
run: |
pip install -e .
- name: Check version consistency
run: |
TAG_VERSION=${{ steps.get_version.outputs.VERSION }}
PACKAGE_VERSION=$(python -c "from crawl4ai.__version__ import __version__; print(__version__)")
echo "Tag version: $TAG_VERSION"
echo "Package version: $PACKAGE_VERSION"
if [ "$TAG_VERSION" != "$PACKAGE_VERSION" ]; then
echo "❌ Version mismatch! Tag: $TAG_VERSION, Package: $PACKAGE_VERSION"
echo "Please update crawl4ai/__version__.py to match the tag version"
exit 1
fi
echo "✅ Version check passed: $TAG_VERSION"
- name: Install build dependencies
run: |
python -m pip install --upgrade pip
pip install build twine
- name: Build package
run: python -m build
- name: Check package
run: twine check dist/*
- name: Upload to Test PyPI
env:
TWINE_USERNAME: __token__
TWINE_PASSWORD: ${{ secrets.TEST_PYPI_TOKEN }}
run: |
echo "📦 Uploading to Test PyPI..."
twine upload --repository testpypi dist/* || {
if [ $? -eq 1 ]; then
echo "⚠️ Upload failed - likely version already exists on Test PyPI"
echo "Continuing anyway for test purposes..."
else
exit 1
fi
}
echo "✅ Test PyPI step complete"
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Log in to Docker Hub
uses: docker/login-action@v3
with:
username: ${{ secrets.DOCKER_USERNAME }}
password: ${{ secrets.DOCKER_TOKEN }}
- name: Build and push Docker test images
uses: docker/build-push-action@v5
with:
context: .
push: true
tags: |
unclecode/crawl4ai:test-${{ steps.get_version.outputs.VERSION }}
unclecode/crawl4ai:test-latest
platforms: linux/amd64,linux/arm64
cache-from: type=gha
cache-to: type=gha,mode=max
- name: Summary
run: |
echo "## 🎉 Test Release Complete!" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "### 📦 Test PyPI Package" >> $GITHUB_STEP_SUMMARY
echo "- Version: ${{ steps.get_version.outputs.VERSION }}" >> $GITHUB_STEP_SUMMARY
echo "- URL: https://test.pypi.org/project/crawl4ai/" >> $GITHUB_STEP_SUMMARY
echo "- Install: \`pip install -i https://test.pypi.org/simple/ crawl4ai==${{ steps.get_version.outputs.VERSION }}\`" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "### 🐳 Docker Test Images" >> $GITHUB_STEP_SUMMARY
echo "- \`unclecode/crawl4ai:test-${{ steps.get_version.outputs.VERSION }}\`" >> $GITHUB_STEP_SUMMARY
echo "- \`unclecode/crawl4ai:test-latest\`" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "### 🧹 Cleanup Commands" >> $GITHUB_STEP_SUMMARY
echo "\`\`\`bash" >> $GITHUB_STEP_SUMMARY
echo "# Remove test tag" >> $GITHUB_STEP_SUMMARY
echo "git tag -d test-v${{ steps.get_version.outputs.VERSION }}" >> $GITHUB_STEP_SUMMARY
echo "git push origin :test-v${{ steps.get_version.outputs.VERSION }}" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "# Remove Docker test images" >> $GITHUB_STEP_SUMMARY
echo "docker rmi unclecode/crawl4ai:test-${{ steps.get_version.outputs.VERSION }}" >> $GITHUB_STEP_SUMMARY
echo "docker rmi unclecode/crawl4ai:test-latest" >> $GITHUB_STEP_SUMMARY
echo "\`\`\`" >> $GITHUB_STEP_SUMMARY

3
.gitignore vendored
View File

@@ -1,3 +1,6 @@
# Scripts folder (private tools)
.scripts/
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]

View File

@@ -5,6 +5,76 @@ All notable changes to Crawl4AI will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
## [0.7.3] - 2025-08-09
### Added
- **🕵️ Undetected Browser Support**: New browser adapter pattern with stealth capabilities
- `browser_adapter.py` with undetected Chrome integration
- Bypass sophisticated bot detection systems (Cloudflare, Akamai, custom solutions)
- Support for headless stealth mode with anti-detection techniques
- Human-like behavior simulation with random mouse movements and scrolling
- Comprehensive examples for anti-bot strategies and stealth crawling
- Full documentation guide for undetected browser usage
- **🎨 Multi-URL Configuration System**: URL-specific crawler configurations for batch processing
- Different crawling strategies for different URL patterns in a single batch
- Support for string patterns with wildcards (`"*.pdf"`, `"*/blog/*"`)
- Lambda function matchers for complex URL logic
- Mixed matchers combining strings and functions with AND/OR logic
- Fallback configuration support when no patterns match
- First-match-wins configuration selection with optional fallback
- **🧠 Memory Monitoring & Optimization**: Comprehensive memory usage tracking
- New `memory_utils.py` module for memory monitoring and optimization
- Real-time memory usage tracking during crawl sessions
- Memory leak detection and reporting
- Performance optimization recommendations
- Peak memory usage analysis and efficiency metrics
- Automatic cleanup suggestions for memory-intensive operations
- **📊 Enhanced Table Extraction**: Improved table access and DataFrame conversion
- Direct `result.tables` interface replacing generic `result.media` approach
- Instant pandas DataFrame conversion with `pd.DataFrame(table['data'])`
- Enhanced table detection algorithms for better accuracy
- Table metadata including source XPath and headers
- Improved table structure preservation during extraction
- **💰 GitHub Sponsors Integration**: 4-tier sponsorship system
- Supporter ($5/month): Community support + early feature previews
- Professional ($25/month): Priority support + beta access
- Business ($100/month): Direct consultation + custom integrations
- Enterprise ($500/month): Dedicated support + feature development
- Custom arrangement options for larger organizations
- **🐳 Docker LLM Provider Flexibility**: Environment-based LLM configuration
- `LLM_PROVIDER` environment variable support for dynamic provider switching
- `.llm.env` file support for secure configuration management
- Per-request provider override capabilities in API endpoints
- Support for OpenAI, Groq, and other providers without rebuilding images
- Enhanced Docker documentation with deployment examples
### Fixed
- **URL Matcher Fallback**: Resolved edge cases in URL pattern matching logic
- **Memory Management**: Fixed memory leaks in long-running crawl sessions
- **Sitemap Processing**: Improved redirect handling in sitemap fetching
- **Table Extraction**: Enhanced table detection and extraction accuracy
- **Error Handling**: Better error messages and recovery from network failures
### Changed
- **Architecture Refactoring**: Major cleanup and optimization
- Moved 2,450+ lines from main `async_crawler_strategy.py` to backup
- Cleaner separation of concerns in crawler architecture
- Better maintainability and code organization
- Preserved backward compatibility while improving performance
### Documentation
- **Comprehensive Examples**: Added real-world URLs and practical use cases
- **API Documentation**: Complete CrawlResult field documentation with all available fields
- **Migration Guides**: Updated table extraction patterns from `result.media` to `result.tables`
- **Undetected Browser Guide**: Full documentation for stealth mode and anti-bot strategies
- **Multi-Config Examples**: Detailed examples for URL-specific configurations
- **Docker Deployment**: Enhanced Docker documentation with LLM provider configuration
## [0.7.x] - 2025-06-29
### Added
@@ -21,6 +91,21 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
## [Unreleased]
### Added
- **Flexible LLM Provider Configuration** (Docker):
- Support for `LLM_PROVIDER` environment variable to override default provider
- Per-request provider override via optional `provider` parameter in API endpoints
- Automatic provider validation with clear error messages
- Updated Docker documentation and examples
### Changed
- **WebScrapingStrategy Refactoring**: Simplified content scraping architecture
- `WebScrapingStrategy` is now an alias for `LXMLWebScrapingStrategy` for backward compatibility
- Removed redundant BeautifulSoup-based implementation (~1000 lines of code)
- `LXMLWebScrapingStrategy` now inherits directly from `ContentScrapingStrategy`
- All existing code using `WebScrapingStrategy` continues to work without modification
- Default scraping strategy remains `LXMLWebScrapingStrategy` for optimal performance
### Added
- **AsyncUrlSeeder**: High-performance URL discovery system for intelligent crawling at scale
- Discover URLs from sitemaps and Common Crawl index

View File

@@ -1,7 +1,7 @@
FROM python:3.12-slim-bookworm AS build
# C4ai version
ARG C4AI_VER=0.6.0
ARG C4AI_VER=0.7.0-r1
ENV C4AI_VERSION=$C4AI_VER
LABEL c4ai.version=$C4AI_VER

136
Makefile.telemetry Normal file
View File

@@ -0,0 +1,136 @@
# Makefile for Crawl4AI Telemetry Testing
# Usage: make test-telemetry, make test-unit, make test-integration, etc.
.PHONY: help test-all test-telemetry test-unit test-integration test-privacy test-performance test-slow test-coverage test-verbose clean
# Default Python executable
PYTHON := .venv/bin/python
PYTEST := $(PYTHON) -m pytest
help:
@echo "Crawl4AI Telemetry Testing Commands:"
@echo ""
@echo " test-all Run all telemetry tests"
@echo " test-telemetry Run all telemetry tests (same as test-all)"
@echo " test-unit Run unit tests only"
@echo " test-integration Run integration tests only"
@echo " test-privacy Run privacy compliance tests only"
@echo " test-performance Run performance tests only"
@echo " test-slow Run slow tests only"
@echo " test-coverage Run tests with coverage report"
@echo " test-verbose Run tests with verbose output"
@echo " test-specific TEST= Run specific test (e.g., make test-specific TEST=test_telemetry.py::TestTelemetryConfig)"
@echo " clean Clean test artifacts"
@echo ""
@echo "Environment Variables:"
@echo " CRAWL4AI_TELEMETRY_TEST_REAL=1 Enable real telemetry during tests"
@echo " PYTEST_ARGS Additional pytest arguments"
# Run all telemetry tests
test-all test-telemetry:
$(PYTEST) tests/telemetry/ -v
# Run unit tests only
test-unit:
$(PYTEST) tests/telemetry/ -m "unit" -v
# Run integration tests only
test-integration:
$(PYTEST) tests/telemetry/ -m "integration" -v
# Run privacy compliance tests only
test-privacy:
$(PYTEST) tests/telemetry/ -m "privacy" -v
# Run performance tests only
test-performance:
$(PYTEST) tests/telemetry/ -m "performance" -v
# Run slow tests only
test-slow:
$(PYTEST) tests/telemetry/ -m "slow" -v
# Run tests with coverage
test-coverage:
$(PYTEST) tests/telemetry/ --cov=crawl4ai.telemetry --cov-report=html --cov-report=term-missing -v
# Run tests with verbose output
test-verbose:
$(PYTEST) tests/telemetry/ -vvv --tb=long
# Run specific test
test-specific:
$(PYTEST) tests/telemetry/$(TEST) -v
# Run tests excluding slow ones
test-fast:
$(PYTEST) tests/telemetry/ -m "not slow" -v
# Run tests in parallel
test-parallel:
$(PYTEST) tests/telemetry/ -n auto -v
# Clean test artifacts
clean:
rm -rf .pytest_cache/
rm -rf htmlcov/
rm -rf .coverage
find tests/ -name "*.pyc" -delete
find tests/ -name "__pycache__" -type d -exec rm -rf {} +
rm -rf tests/telemetry/__pycache__/
# Lint test files
lint-tests:
$(PYTHON) -m flake8 tests/telemetry/
$(PYTHON) -m pylint tests/telemetry/
# Type check test files
typecheck-tests:
$(PYTHON) -m mypy tests/telemetry/
# Run all quality checks
check-tests: lint-tests typecheck-tests test-unit
# Install test dependencies
install-test-deps:
$(PYTHON) -m pip install pytest pytest-asyncio pytest-mock pytest-cov pytest-xdist
# Setup development environment for testing
setup-dev:
$(PYTHON) -m pip install -e .
$(MAKE) install-test-deps
# Generate test report
test-report:
$(PYTEST) tests/telemetry/ --html=test-report.html --self-contained-html -v
# Run performance benchmarks
benchmark:
$(PYTEST) tests/telemetry/test_privacy_performance.py::TestTelemetryPerformance -v --benchmark-only
# Test different environments
test-docker-env:
CRAWL4AI_DOCKER=true $(PYTEST) tests/telemetry/ -k "docker" -v
test-cli-env:
$(PYTEST) tests/telemetry/ -k "cli" -v
# Validate telemetry implementation
validate:
@echo "Running telemetry validation suite..."
$(MAKE) test-unit
$(MAKE) test-privacy
$(MAKE) test-performance
@echo "Validation complete!"
# Debug failing tests
debug:
$(PYTEST) tests/telemetry/ --pdb -x -v
# Show test markers
show-markers:
$(PYTEST) --markers
# Show test collection (dry run)
show-tests:
$(PYTEST) tests/telemetry/ --collect-only -q

320
PROGRESSIVE_CRAWLING.md Normal file
View File

@@ -0,0 +1,320 @@
# Progressive Web Crawling with Adaptive Information Foraging
## Abstract
This paper presents a novel approach to web crawling that adaptively determines when sufficient information has been gathered to answer a given query. Unlike traditional exhaustive crawling methods, our Progressive Information Sufficiency (PIS) framework uses statistical measures to balance information completeness against crawling efficiency. We introduce a multi-strategy architecture supporting pure statistical, embedding-enhanced, and LLM-assisted approaches, with theoretical guarantees on convergence and practical evaluation methods using synthetic datasets.
## 1. Introduction
Traditional web crawling approaches follow predetermined patterns (breadth-first, depth-first) without consideration for information sufficiency. This work addresses the fundamental question: *"When do we have enough information to answer a query and similar queries in its domain?"*
We formalize this as an optimal stopping problem in information foraging, introducing metrics for coverage, consistency, and saturation that enable crawlers to make intelligent decisions about when to stop crawling and which links to follow.
## 2. Problem Formulation
### 2.1 Definitions
Let:
- **K** = {d₁, d₂, ..., dₙ} be the current knowledge base (crawled documents)
- **Q** be the user query
- **L** = {l₁, l₂, ..., lₘ} be available links with preview metadata
- **θ** be the confidence threshold for information sufficiency
### 2.2 Objectives
1. **Minimize** |K| (number of crawled pages)
2. **Maximize** P(answers(Q) | K) (probability of answering Q given K)
3. **Ensure** coverage of Q's domain (similar queries)
## 3. Mathematical Framework
### 3.1 Information Sufficiency Metric
We define Information Sufficiency as:
```
IS(K, Q) = min(Coverage(K, Q), Consistency(K, Q), 1 - Redundancy(K)) × DomainCoverage(K, Q)
```
### 3.2 Coverage Score
Coverage measures how well current knowledge covers query terms and related concepts:
```
Coverage(K, Q) = Σ(t ∈ Q) log(df(t, K) + 1) × idf(t) / |Q|
```
Where:
- df(t, K) = document frequency of term t in knowledge base K
- idf(t) = inverse document frequency weight
### 3.3 Consistency Score
Consistency measures information coherence across documents:
```
Consistency(K, Q) = 1 - Var(answers from random subsets of K)
```
This captures the principle that sufficient knowledge should provide stable answers regardless of document subset.
### 3.4 Saturation Score
Saturation detects diminishing returns:
```
Saturation(K) = 1 - (ΔInfo(Kₙ) / ΔInfo(K₁))
```
Where ΔInfo represents marginal information gain from the nth crawl.
### 3.5 Link Value Prediction
Expected information gain from uncrawled links:
```
ExpectedGain(l) = Relevance(l, Q) × Novelty(l, K) × Authority(l)
```
Components:
- **Relevance**: BM25(preview_text, Q)
- **Novelty**: 1 - max_similarity(preview, K)
- **Authority**: f(url_structure, domain_metrics)
## 4. Algorithmic Approach
### 4.1 Progressive Crawling Algorithm
```
Algorithm: ProgressiveCrawl(start_url, query, θ)
K ← ∅
crawled ← {start_url}
pending ← extract_links(crawl(start_url))
while IS(K, Q) < θ and |crawled| < max_pages:
candidates ← rank_by_expected_gain(pending, Q, K)
if max(ExpectedGain(candidates)) < min_gain:
break // Diminishing returns
to_crawl ← top_k(candidates)
new_docs ← parallel_crawl(to_crawl)
K ← K new_docs
crawled ← crawled to_crawl
pending ← extract_new_links(new_docs) - crawled
return K
```
### 4.2 Stopping Criteria
Crawling terminates when:
1. IS(K, Q) ≥ θ (sufficient information)
2. d(IS)/d(crawls) < ε (plateau reached)
3. |crawled| ≥ max_pages (resource limit)
4. max(ExpectedGain) < min_gain (no promising links)
## 5. Multi-Strategy Architecture
### 5.1 Strategy Pattern Design
```
AbstractStrategy
├── StatisticalStrategy (no LLM, no embeddings)
├── EmbeddingStrategy (with semantic similarity)
└── LLMStrategy (with language model assistance)
```
### 5.2 Statistical Strategy
Pure statistical approach using:
- BM25 for relevance scoring
- Term frequency analysis for coverage
- Graph structure for authority
- No external models required
**Advantages**: Fast, no API costs, works offline
**Best for**: Technical documentation, specific terminology
### 5.3 Embedding Strategy (Implemented)
Semantic understanding through embeddings:
- Query expansion into semantic variations
- Coverage mapping in embedding space
- Gap-driven link selection
- Validation-based stopping criteria
**Mathematical Framework**:
```
Coverage(K, Q) = mean(max_similarity(q, K) for q in Q_expanded)
Gap(q) = 1 - max_similarity(q, K)
LinkScore(l) = Σ(Gap(q) × relevance(l, q)) × (1 - redundancy(l, K))
```
**Key Parameters**:
- `embedding_k_exp`: Exponential decay factor for distance-to-score mapping
- `embedding_coverage_radius`: Distance threshold for query coverage
- `embedding_min_confidence_threshold`: Minimum relevance threshold
**Advantages**: Semantic understanding, handles ambiguity, detects irrelevance
**Best for**: Research queries, conceptual topics, diverse content
### 5.4 Progressive Enhancement Path
1. **Level 0**: Statistical only (implemented)
2. **Level 1**: + Embeddings for semantic similarity (implemented)
3. **Level 2**: + LLM for query understanding (future)
## 6. Evaluation Methodology
### 6.1 Synthetic Dataset Generation
Using LLM to create evaluation data:
```python
def generate_synthetic_dataset(domain_url):
# 1. Fully crawl domain
full_knowledge = exhaustive_crawl(domain_url)
# 2. Generate answerable queries
queries = llm_generate_queries(full_knowledge)
# 3. Create query variations
for q in queries:
variations = generate_variations(q) # synonyms, sub/super queries
return queries, variations, full_knowledge
```
### 6.2 Evaluation Metrics
1. **Efficiency**: Information gained / Pages crawled
2. **Completeness**: Answerable queries / Total queries
3. **Redundancy**: 1 - (Unique information / Total information)
4. **Convergence Rate**: Pages to 95% completeness
### 6.3 Ablation Studies
- Impact of each score component (coverage, consistency, saturation)
- Sensitivity to threshold parameters
- Performance across different domain types
## 7. Theoretical Properties
### 7.1 Convergence Guarantee
**Theorem**: For finite websites, ProgressiveCrawl converges to IS(K, Q) ≥ θ or exhausts all reachable pages.
**Proof sketch**: IS(K, Q) is monotonically non-decreasing with each crawl, bounded above by 1.
### 7.2 Optimality
Under certain assumptions about link preview accuracy:
- Expected crawls ≤ 2 × optimal_crawls
- Approximation ratio improves with preview quality
## 8. Implementation Design
### 8.1 Core Components
1. **CrawlState**: Maintains crawl history and metrics
2. **AdaptiveConfig**: Configuration parameters
3. **CrawlStrategy**: Pluggable strategy interface
4. **AdaptiveCrawler**: Main orchestrator
### 8.2 Integration with Crawl4AI
- Wraps existing AsyncWebCrawler
- Leverages link preview functionality
- Maintains backward compatibility
### 8.3 Persistence
Knowledge base serialization for:
- Resumable crawls
- Knowledge sharing
- Offline analysis
## 9. Future Directions
### 9.1 Advanced Scoring
- Temporal information value
- Multi-query optimization
- Active learning from user feedback
### 9.2 Distributed Crawling
- Collaborative knowledge building
- Federated information sufficiency
### 9.3 Domain Adaptation
- Transfer learning across domains
- Meta-learning for threshold selection
## 10. Conclusion
Progressive crawling with adaptive information foraging provides a principled approach to efficient web information extraction. By combining coverage, consistency, and saturation metrics, we can determine information sufficiency without ground truth labels. The multi-strategy architecture allows graceful enhancement from pure statistical to LLM-assisted approaches based on requirements and resources.
## References
1. Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press.
2. Robertson, S., & Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval.
3. Pirolli, P., & Card, S. (1999). Information Foraging. Psychological Review, 106(4), 643-675.
4. Dasgupta, S. (2005). Analysis of a greedy active learning strategy. Advances in Neural Information Processing Systems.
## Appendix A: Implementation Pseudocode
```python
class StatisticalStrategy:
def calculate_confidence(self, state):
coverage = self.calculate_coverage(state)
consistency = self.calculate_consistency(state)
saturation = self.calculate_saturation(state)
return min(coverage, consistency, saturation)
def calculate_coverage(self, state):
# BM25-based term coverage
term_scores = []
for term in state.query.split():
df = state.document_frequencies.get(term, 0)
idf = self.idf_cache.get(term, 1.0)
term_scores.append(log(df + 1) * idf)
return mean(term_scores) / max_possible_score
def rank_links(self, state):
scored_links = []
for link in state.pending_links:
relevance = self.bm25_score(link.preview_text, state.query)
novelty = self.calculate_novelty(link, state.knowledge_base)
authority = self.url_authority(link.href)
score = relevance * novelty * authority
scored_links.append((link, score))
return sorted(scored_links, key=lambda x: x[1], reverse=True)
```
## Appendix B: Evaluation Protocol
1. **Dataset Creation**:
- Select diverse domains (documentation, blogs, e-commerce)
- Generate 100 queries per domain using LLM
- Create query variations (5-10 per query)
2. **Baseline Comparisons**:
- BFS crawler (depth-limited)
- DFS crawler (depth-limited)
- Random crawler
- Oracle (knows relevant pages)
3. **Metrics Collection**:
- Pages crawled vs query answerability
- Time to sufficient confidence
- False positive/negative rates
4. **Statistical Analysis**:
- ANOVA for strategy comparison
- Regression for parameter sensitivity
- Bootstrap for confidence intervals

809
README-first.md Normal file
View File

@@ -0,0 +1,809 @@
# 🚀🤖 Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper.
<div align="center">
<a href="https://trendshift.io/repositories/11716" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11716" alt="unclecode%2Fcrawl4ai | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
[![GitHub Stars](https://img.shields.io/github/stars/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/stargazers)
[![GitHub Forks](https://img.shields.io/github/forks/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/network/members)
[![PyPI version](https://badge.fury.io/py/crawl4ai.svg)](https://badge.fury.io/py/crawl4ai)
[![Python Version](https://img.shields.io/pypi/pyversions/crawl4ai)](https://pypi.org/project/crawl4ai/)
[![Downloads](https://static.pepy.tech/badge/crawl4ai/month)](https://pepy.tech/project/crawl4ai)
[![GitHub Sponsors](https://img.shields.io/github/sponsors/unclecode?style=flat&logo=GitHub-Sponsors&label=Sponsors&color=pink)](https://github.com/sponsors/unclecode)
<p align="center">
<a href="https://x.com/crawl4ai">
<img src="https://img.shields.io/badge/Follow%20on%20X-000000?style=for-the-badge&logo=x&logoColor=white" alt="Follow on X" />
</a>
<a href="https://www.linkedin.com/company/crawl4ai">
<img src="https://img.shields.io/badge/Follow%20on%20LinkedIn-0077B5?style=for-the-badge&logo=linkedin&logoColor=white" alt="Follow on LinkedIn" />
</a>
<a href="https://discord.gg/jP8KfhDhyN">
<img src="https://img.shields.io/badge/Join%20our%20Discord-5865F2?style=for-the-badge&logo=discord&logoColor=white" alt="Join our Discord" />
</a>
</p>
</div>
Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for LLMs, AI agents, and data pipelines. Open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease.
[✨ Check out latest update v0.7.0](#-recent-updates)
🎉 **Version 0.7.0 is now available!** The Adaptive Intelligence Update introduces groundbreaking features: Adaptive Crawling that learns website patterns, Virtual Scroll support for infinite pages, intelligent Link Preview with 3-layer scoring, Async URL Seeder for massive discovery, and significant performance improvements. [Read the release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.0.md)
<details>
<summary>🤓 <strong>My Personal Story</strong></summary>
My journey with computers started in childhood when my dad, a computer scientist, introduced me to an Amstrad computer. Those early days sparked a fascination with technology, leading me to pursue computer science and specialize in NLP during my postgraduate studies. It was during this time that I first delved into web crawling, building tools to help researchers organize papers and extract information from publications a challenging yet rewarding experience that honed my skills in data extraction.
Fast forward to 2023, I was working on a tool for a project and needed a crawler to convert a webpage into markdown. While exploring solutions, I found one that claimed to be open-source but required creating an account and generating an API token. Worse, it turned out to be a SaaS model charging $16, and its quality didnt meet my standards. Frustrated, I realized this was a deeper problem. That frustration turned into turbo anger mode, and I decided to build my own solution. In just a few days, I created Crawl4AI. To my surprise, it went viral, earning thousands of GitHub stars and resonating with a global community.
I made Crawl4AI open-source for two reasons. First, its my way of giving back to the open-source community that has supported me throughout my career. Second, I believe data should be accessible to everyone, not locked behind paywalls or monopolized by a few. Open access to data lays the foundation for the democratization of AI, a vision where individuals can train their own models and take ownership of their information. This library is the first step in a larger journey to create the best open-source data extraction and generation tool the world has ever seen, built collaboratively by a passionate community.
Thank you to everyone who has supported this project, used it, and shared feedback. Your encouragement motivates me to dream even bigger. Join us, file issues, submit PRs, or spread the word. Together, we can build a tool that truly empowers people to access their own data and reshape the future of AI.
</details>
## 🧐 Why Crawl4AI?
1. **Built for LLMs**: Creates smart, concise Markdown optimized for RAG and fine-tuning applications.
2. **Lightning Fast**: Delivers results faster with real-time, cost-efficient performance.
3. **Flexible Browser Control**: Offers session management, proxies, and custom hooks for seamless data access.
4. **Heuristic Intelligence**: Uses advanced algorithms for efficient extraction, reducing reliance on costly models.
5. **Open Source & Deployable**: Fully open-source with no API keys—ready for Docker and cloud integration.
6. **Thriving Community**: Actively maintained by a vibrant community and the #1 trending GitHub repository.
## 🚀 Quick Start
1. Install Crawl4AI:
```bash
# Install the package
pip install -U crawl4ai
# For pre release versions
pip install crawl4ai --pre
# Run post-installation setup
crawl4ai-setup
# Verify your installation
crawl4ai-doctor
```
If you encounter any browser-related issues, you can install them manually:
```bash
python -m playwright install --with-deps chromium
```
2. Run a simple web crawl with Python:
```python
import asyncio
from crawl4ai import *
async def main():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://www.nbcnews.com/business",
)
print(result.markdown)
if __name__ == "__main__":
asyncio.run(main())
```
3. Or use the new command-line interface:
```bash
# Basic crawl with markdown output
crwl https://www.nbcnews.com/business -o markdown
# Deep crawl with BFS strategy, max 10 pages
crwl https://docs.crawl4ai.com --deep-crawl bfs --max-pages 10
# Use LLM extraction with a specific question
crwl https://www.example.com/products -q "Extract all product prices"
```
## ✨ Features
<details>
<summary>📝 <strong>Markdown Generation</strong></summary>
- 🧹 **Clean Markdown**: Generates clean, structured Markdown with accurate formatting.
- 🎯 **Fit Markdown**: Heuristic-based filtering to remove noise and irrelevant parts for AI-friendly processing.
- 🔗 **Citations and References**: Converts page links into a numbered reference list with clean citations.
- 🛠️ **Custom Strategies**: Users can create their own Markdown generation strategies tailored to specific needs.
- 📚 **BM25 Algorithm**: Employs BM25-based filtering for extracting core information and removing irrelevant content.
</details>
<details>
<summary>📊 <strong>Structured Data Extraction</strong></summary>
- 🤖 **LLM-Driven Extraction**: Supports all LLMs (open-source and proprietary) for structured data extraction.
- 🧱 **Chunking Strategies**: Implements chunking (topic-based, regex, sentence-level) for targeted content processing.
- 🌌 **Cosine Similarity**: Find relevant content chunks based on user queries for semantic extraction.
- 🔎 **CSS-Based Extraction**: Fast schema-based data extraction using XPath and CSS selectors.
- 🔧 **Schema Definition**: Define custom schemas for extracting structured JSON from repetitive patterns.
</details>
<details>
<summary>🌐 <strong>Browser Integration</strong></summary>
- 🖥️ **Managed Browser**: Use user-owned browsers with full control, avoiding bot detection.
- 🔄 **Remote Browser Control**: Connect to Chrome Developer Tools Protocol for remote, large-scale data extraction.
- 👤 **Browser Profiler**: Create and manage persistent profiles with saved authentication states, cookies, and settings.
- 🔒 **Session Management**: Preserve browser states and reuse them for multi-step crawling.
- 🧩 **Proxy Support**: Seamlessly connect to proxies with authentication for secure access.
- ⚙️ **Full Browser Control**: Modify headers, cookies, user agents, and more for tailored crawling setups.
- 🌍 **Multi-Browser Support**: Compatible with Chromium, Firefox, and WebKit.
- 📐 **Dynamic Viewport Adjustment**: Automatically adjusts the browser viewport to match page content, ensuring complete rendering and capturing of all elements.
</details>
<details>
<summary>🔎 <strong>Crawling & Scraping</strong></summary>
- 🖼️ **Media Support**: Extract images, audio, videos, and responsive image formats like `srcset` and `picture`.
- 🚀 **Dynamic Crawling**: Execute JS and wait for async or sync for dynamic content extraction.
- 📸 **Screenshots**: Capture page screenshots during crawling for debugging or analysis.
- 📂 **Raw Data Crawling**: Directly process raw HTML (`raw:`) or local files (`file://`).
- 🔗 **Comprehensive Link Extraction**: Extracts internal, external links, and embedded iframe content.
- 🛠️ **Customizable Hooks**: Define hooks at every step to customize crawling behavior.
- 💾 **Caching**: Cache data for improved speed and to avoid redundant fetches.
- 📄 **Metadata Extraction**: Retrieve structured metadata from web pages.
- 📡 **IFrame Content Extraction**: Seamless extraction from embedded iframe content.
- 🕵️ **Lazy Load Handling**: Waits for images to fully load, ensuring no content is missed due to lazy loading.
- 🔄 **Full-Page Scanning**: Simulates scrolling to load and capture all dynamic content, perfect for infinite scroll pages.
</details>
<details>
<summary>🚀 <strong>Deployment</strong></summary>
- 🐳 **Dockerized Setup**: Optimized Docker image with FastAPI server for easy deployment.
- 🔑 **Secure Authentication**: Built-in JWT token authentication for API security.
- 🔄 **API Gateway**: One-click deployment with secure token authentication for API-based workflows.
- 🌐 **Scalable Architecture**: Designed for mass-scale production and optimized server performance.
- ☁️ **Cloud Deployment**: Ready-to-deploy configurations for major cloud platforms.
</details>
<details>
<summary>🎯 <strong>Additional Features</strong></summary>
- 🕶️ **Stealth Mode**: Avoid bot detection by mimicking real users.
- 🏷️ **Tag-Based Content Extraction**: Refine crawling based on custom tags, headers, or metadata.
- 🔗 **Link Analysis**: Extract and analyze all links for detailed data exploration.
- 🛡️ **Error Handling**: Robust error management for seamless execution.
- 🔐 **CORS & Static Serving**: Supports filesystem-based caching and cross-origin requests.
- 📖 **Clear Documentation**: Simplified and updated guides for onboarding and advanced usage.
- 🙌 **Community Recognition**: Acknowledges contributors and pull requests for transparency.
</details>
## Try it Now!
✨ Play around with this [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1SgRPrByQLzjRfwoRNq1wSGE9nYY_EE8C?usp=sharing)
✨ Visit our [Documentation Website](https://docs.crawl4ai.com/)
## Installation 🛠️
Crawl4AI offers flexible installation options to suit various use cases. You can install it as a Python package or use Docker.
<details>
<summary>🐍 <strong>Using pip</strong></summary>
Choose the installation option that best fits your needs:
### Basic Installation
For basic web crawling and scraping tasks:
```bash
pip install crawl4ai
crawl4ai-setup # Setup the browser
```
By default, this will install the asynchronous version of Crawl4AI, using Playwright for web crawling.
👉 **Note**: When you install Crawl4AI, the `crawl4ai-setup` should automatically install and set up Playwright. However, if you encounter any Playwright-related errors, you can manually install it using one of these methods:
1. Through the command line:
```bash
playwright install
```
2. If the above doesn't work, try this more specific command:
```bash
python -m playwright install chromium
```
This second method has proven to be more reliable in some cases.
---
### Installation with Synchronous Version
The sync version is deprecated and will be removed in future versions. If you need the synchronous version using Selenium:
```bash
pip install crawl4ai[sync]
```
---
### Development Installation
For contributors who plan to modify the source code:
```bash
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
pip install -e . # Basic installation in editable mode
```
Install optional features:
```bash
pip install -e ".[torch]" # With PyTorch features
pip install -e ".[transformer]" # With Transformer features
pip install -e ".[cosine]" # With cosine similarity features
pip install -e ".[sync]" # With synchronous crawling (Selenium)
pip install -e ".[all]" # Install all optional features
```
</details>
<details>
<summary>🐳 <strong>Docker Deployment</strong></summary>
> 🚀 **Now Available!** Our completely redesigned Docker implementation is here! This new solution makes deployment more efficient and seamless than ever.
### New Docker Features
The new Docker implementation includes:
- **Browser pooling** with page pre-warming for faster response times
- **Interactive playground** to test and generate request code
- **MCP integration** for direct connection to AI tools like Claude Code
- **Comprehensive API endpoints** including HTML extraction, screenshots, PDF generation, and JavaScript execution
- **Multi-architecture support** with automatic detection (AMD64/ARM64)
- **Optimized resources** with improved memory management
### Getting Started
```bash
# Pull and run the latest release candidate
docker pull unclecode/crawl4ai:0.7.0
docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:0.7.0
# Visit the playground at http://localhost:11235/playground
```
For complete documentation, see our [Docker Deployment Guide](https://docs.crawl4ai.com/core/docker-deployment/).
</details>
---
### Quick Test
Run a quick test (works for both Docker options):
```python
import requests
# Submit a crawl job
response = requests.post(
"http://localhost:11235/crawl",
json={"urls": ["https://example.com"], "priority": 10}
)
if response.status_code == 200:
print("Crawl job submitted successfully.")
if "results" in response.json():
results = response.json()["results"]
print("Crawl job completed. Results:")
for result in results:
print(result)
else:
task_id = response.json()["task_id"]
print(f"Crawl job submitted. Task ID:: {task_id}")
result = requests.get(f"http://localhost:11235/task/{task_id}")
```
For more examples, see our [Docker Examples](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/docker_example.py). For advanced configuration, environment variables, and usage examples, see our [Docker Deployment Guide](https://docs.crawl4ai.com/basic/docker-deployment/).
</details>
## 🔬 Advanced Usage Examples 🔬
You can check the project structure in the directory [docs/examples](https://github.com/unclecode/crawl4ai/tree/main/docs/examples). Over there, you can find a variety of examples; here, some popular examples are shared.
<details>
<summary>📝 <strong>Heuristic Markdown Generation with Clean and Fit Markdown</strong></summary>
```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.content_filter_strategy import PruningContentFilter, BM25ContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
async def main():
browser_config = BrowserConfig(
headless=True,
verbose=True,
)
run_config = CrawlerRunConfig(
cache_mode=CacheMode.ENABLED,
markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0)
),
# markdown_generator=DefaultMarkdownGenerator(
# content_filter=BM25ContentFilter(user_query="WHEN_WE_FOCUS_BASED_ON_A_USER_QUERY", bm25_threshold=1.0)
# ),
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://docs.micronaut.io/4.7.6/guide/",
config=run_config
)
print(len(result.markdown.raw_markdown))
print(len(result.markdown.fit_markdown))
if __name__ == "__main__":
asyncio.run(main())
```
</details>
<details>
<summary>🖥️ <strong>Executing JavaScript & Extract Structured Data without LLMs</strong></summary>
```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai import JsonCssExtractionStrategy
import json
async def main():
schema = {
"name": "KidoCode Courses",
"baseSelector": "section.charge-methodology .w-tab-content > div",
"fields": [
{
"name": "section_title",
"selector": "h3.heading-50",
"type": "text",
},
{
"name": "section_description",
"selector": ".charge-content",
"type": "text",
},
{
"name": "course_name",
"selector": ".text-block-93",
"type": "text",
},
{
"name": "course_description",
"selector": ".course-content-text",
"type": "text",
},
{
"name": "course_icon",
"selector": ".image-92",
"type": "attribute",
"attribute": "src"
}
}
}
extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
browser_config = BrowserConfig(
headless=False,
verbose=True
)
run_config = CrawlerRunConfig(
extraction_strategy=extraction_strategy,
js_code=["""(async () => {const tabs = document.querySelectorAll("section.charge-methodology .tabs-menu-3 > div");for(let tab of tabs) {tab.scrollIntoView();tab.click();await new Promise(r => setTimeout(r, 500));}})();"""],
cache_mode=CacheMode.BYPASS
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://www.kidocode.com/degrees/technology",
config=run_config
)
companies = json.loads(result.extracted_content)
print(f"Successfully extracted {len(companies)} companies")
print(json.dumps(companies[0], indent=2))
if __name__ == "__main__":
asyncio.run(main())
```
</details>
<details>
<summary>📚 <strong>Extracting Structured Data with LLMs</strong></summary>
```python
import os
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, LLMConfig
from crawl4ai import LLMExtractionStrategy
from pydantic import BaseModel, Field
class OpenAIModelFee(BaseModel):
model_name: str = Field(..., description="Name of the OpenAI model.")
input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")
async def main():
browser_config = BrowserConfig(verbose=True)
run_config = CrawlerRunConfig(
word_count_threshold=1,
extraction_strategy=LLMExtractionStrategy(
# Here you can use any provider that Litellm library supports, for instance: ollama/qwen2
# provider="ollama/qwen2", api_token="no-token",
llm_config = LLMConfig(provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY')),
schema=OpenAIModelFee.schema(),
extraction_type="schema",
instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
Do not miss any models in the entire content. One extracted model JSON format should look like this:
{"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
),
cache_mode=CacheMode.BYPASS,
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url='https://openai.com/api/pricing/',
config=run_config
)
print(result.extracted_content)
if __name__ == "__main__":
asyncio.run(main())
```
</details>
<details>
<summary>🤖 <strong>Using Your own Browser with Custom User Profile</strong></summary>
```python
import os, sys
from pathlib import Path
import asyncio, time
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
async def test_news_crawl():
# Create a persistent user data directory
user_data_dir = os.path.join(Path.home(), ".crawl4ai", "browser_profile")
os.makedirs(user_data_dir, exist_ok=True)
browser_config = BrowserConfig(
verbose=True,
headless=True,
user_data_dir=user_data_dir,
use_persistent_context=True,
)
run_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS
)
async with AsyncWebCrawler(config=browser_config) as crawler:
url = "ADDRESS_OF_A_CHALLENGING_WEBSITE"
result = await crawler.arun(
url,
config=run_config,
magic=True,
)
print(f"Successfully crawled {url}")
print(f"Content length: {len(result.markdown)}")
```
</details>
## ✨ Recent Updates
### Version 0.7.0 Release Highlights - The Adaptive Intelligence Update
- **🧠 Adaptive Crawling**: Your crawler now learns and adapts to website patterns automatically:
```python
config = AdaptiveConfig(
confidence_threshold=0.7, # Min confidence to stop crawling
max_depth=5, # Maximum crawl depth
max_pages=20, # Maximum number of pages to crawl
strategy="statistical"
)
async with AsyncWebCrawler() as crawler:
adaptive_crawler = AdaptiveCrawler(crawler, config)
state = await adaptive_crawler.digest(
start_url="https://news.example.com",
query="latest news content"
)
# Crawler learns patterns and improves extraction over time
```
- **🌊 Virtual Scroll Support**: Complete content extraction from infinite scroll pages:
```python
scroll_config = VirtualScrollConfig(
container_selector="[data-testid='feed']",
scroll_count=20,
scroll_by="container_height",
wait_after_scroll=1.0
)
result = await crawler.arun(url, config=CrawlerRunConfig(
virtual_scroll_config=scroll_config
))
```
- **🔗 Intelligent Link Analysis**: 3-layer scoring system for smart link prioritization:
```python
link_config = LinkPreviewConfig(
query="machine learning tutorials",
score_threshold=0.3,
concurrent_requests=10
)
result = await crawler.arun(url, config=CrawlerRunConfig(
link_preview_config=link_config,
score_links=True
))
# Links ranked by relevance and quality
```
- **🎣 Async URL Seeder**: Discover thousands of URLs in seconds:
```python
seeder = AsyncUrlSeeder(SeedingConfig(
source="sitemap+cc",
pattern="*/blog/*",
query="python tutorials",
score_threshold=0.4
))
urls = await seeder.discover("https://example.com")
```
- **⚡ Performance Boost**: Up to 3x faster with optimized resource handling and memory efficiency
Read the full details in our [0.7.0 Release Notes](https://docs.crawl4ai.com/blog/release-v0.7.0) or check the [CHANGELOG](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md).
## Version Numbering in Crawl4AI
Crawl4AI follows standard Python version numbering conventions (PEP 440) to help users understand the stability and features of each release.
### Version Numbers Explained
Our version numbers follow this pattern: `MAJOR.MINOR.PATCH` (e.g., 0.4.3)
#### Pre-release Versions
We use different suffixes to indicate development stages:
- `dev` (0.4.3dev1): Development versions, unstable
- `a` (0.4.3a1): Alpha releases, experimental features
- `b` (0.4.3b1): Beta releases, feature complete but needs testing
- `rc` (0.4.3): Release candidates, potential final version
#### Installation
- Regular installation (stable version):
```bash
pip install -U crawl4ai
```
- Install pre-release versions:
```bash
pip install crawl4ai --pre
```
- Install specific version:
```bash
pip install crawl4ai==0.4.3b1
```
#### Why Pre-releases?
We use pre-releases to:
- Test new features in real-world scenarios
- Gather feedback before final releases
- Ensure stability for production users
- Allow early adopters to try new features
For production environments, we recommend using the stable version. For testing new features, you can opt-in to pre-releases using the `--pre` flag.
## 📖 Documentation & Roadmap
> 🚨 **Documentation Update Alert**: We're undertaking a major documentation overhaul next week to reflect recent updates and improvements. Stay tuned for a more comprehensive and up-to-date guide!
For current documentation, including installation instructions, advanced features, and API reference, visit our [Documentation Website](https://docs.crawl4ai.com/).
To check our development plans and upcoming features, visit our [Roadmap](https://github.com/unclecode/crawl4ai/blob/main/ROADMAP.md).
<details>
<summary>📈 <strong>Development TODOs</strong></summary>
- [x] 0. Graph Crawler: Smart website traversal using graph search algorithms for comprehensive nested page extraction
- [ ] 1. Question-Based Crawler: Natural language driven web discovery and content extraction
- [ ] 2. Knowledge-Optimal Crawler: Smart crawling that maximizes knowledge while minimizing data extraction
- [ ] 3. Agentic Crawler: Autonomous system for complex multi-step crawling operations
- [ ] 4. Automated Schema Generator: Convert natural language to extraction schemas
- [ ] 5. Domain-Specific Scrapers: Pre-configured extractors for common platforms (academic, e-commerce)
- [ ] 6. Web Embedding Index: Semantic search infrastructure for crawled content
- [ ] 7. Interactive Playground: Web UI for testing, comparing strategies with AI assistance
- [ ] 8. Performance Monitor: Real-time insights into crawler operations
- [ ] 9. Cloud Integration: One-click deployment solutions across cloud providers
- [ ] 10. Sponsorship Program: Structured support system with tiered benefits
- [ ] 11. Educational Content: "How to Crawl" video series and interactive tutorials
</details>
## 🤝 Contributing
We welcome contributions from the open-source community. Check out our [contribution guidelines](https://github.com/unclecode/crawl4ai/blob/main/CONTRIBUTORS.md) for more information.
I'll help modify the license section with badges. For the halftone effect, here's a version with it:
Here's the updated license section:
## 📄 License & Attribution
This project is licensed under the Apache License 2.0, attribution is recommended via the badges below. See the [Apache 2.0 License](https://github.com/unclecode/crawl4ai/blob/main/LICENSE) file for details.
### Attribution Requirements
When using Crawl4AI, you must include one of the following attribution methods:
#### 1. Badge Attribution (Recommended)
Add one of these badges to your README, documentation, or website:
| Theme | Badge |
|-------|-------|
| **Disco Theme (Animated)** | <a href="https://github.com/unclecode/crawl4ai"><img src="./docs/assets/powered-by-disco.svg" alt="Powered by Crawl4AI" width="200"/></a> |
| **Night Theme (Dark with Neon)** | <a href="https://github.com/unclecode/crawl4ai"><img src="./docs/assets/powered-by-night.svg" alt="Powered by Crawl4AI" width="200"/></a> |
| **Dark Theme (Classic)** | <a href="https://github.com/unclecode/crawl4ai"><img src="./docs/assets/powered-by-dark.svg" alt="Powered by Crawl4AI" width="200"/></a> |
| **Light Theme (Classic)** | <a href="https://github.com/unclecode/crawl4ai"><img src="./docs/assets/powered-by-light.svg" alt="Powered by Crawl4AI" width="200"/></a> |
HTML code for adding the badges:
```html
<!-- Disco Theme (Animated) -->
<a href="https://github.com/unclecode/crawl4ai">
<img src="https://raw.githubusercontent.com/unclecode/crawl4ai/main/docs/assets/powered-by-disco.svg" alt="Powered by Crawl4AI" width="200"/>
</a>
<!-- Night Theme (Dark with Neon) -->
<a href="https://github.com/unclecode/crawl4ai">
<img src="https://raw.githubusercontent.com/unclecode/crawl4ai/main/docs/assets/powered-by-night.svg" alt="Powered by Crawl4AI" width="200"/>
</a>
<!-- Dark Theme (Classic) -->
<a href="https://github.com/unclecode/crawl4ai">
<img src="https://raw.githubusercontent.com/unclecode/crawl4ai/main/docs/assets/powered-by-dark.svg" alt="Powered by Crawl4AI" width="200"/>
</a>
<!-- Light Theme (Classic) -->
<a href="https://github.com/unclecode/crawl4ai">
<img src="https://raw.githubusercontent.com/unclecode/crawl4ai/main/docs/assets/powered-by-light.svg" alt="Powered by Crawl4AI" width="200"/>
</a>
<!-- Simple Shield Badge -->
<a href="https://github.com/unclecode/crawl4ai">
<img src="https://img.shields.io/badge/Powered%20by-Crawl4AI-blue?style=flat-square" alt="Powered by Crawl4AI"/>
</a>
```
#### 2. Text Attribution
Add this line to your documentation:
```
This project uses Crawl4AI (https://github.com/unclecode/crawl4ai) for web data extraction.
```
## 📚 Citation
If you use Crawl4AI in your research or project, please cite:
```bibtex
@software{crawl4ai2024,
author = {UncleCode},
title = {Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper},
year = {2024},
publisher = {GitHub},
journal = {GitHub Repository},
howpublished = {\url{https://github.com/unclecode/crawl4ai}},
commit = {Please use the commit hash you're working with}
}
```
Text citation format:
```
UncleCode. (2024). Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper [Computer software].
GitHub. https://github.com/unclecode/crawl4ai
```
## 📧 Contact
For questions, suggestions, or feedback, feel free to reach out:
- GitHub: [unclecode](https://github.com/unclecode)
- Twitter: [@unclecode](https://twitter.com/unclecode)
- Website: [crawl4ai.com](https://crawl4ai.com)
Happy Crawling! 🕸️🚀
## 💖 Support Crawl4AI
> 🎉 **Sponsorship Program Just Launched!** Be among the first 50 **Founding Sponsors** and get permanent recognition in our Hall of Fame!
Crawl4AI is the #1 trending open-source web crawler with 51K+ stars. Your support ensures we stay independent, innovative, and free forever.
<div align="center">
[![Become a Sponsor](https://img.shields.io/badge/Become%20a%20Sponsor-pink?style=for-the-badge&logo=github-sponsors&logoColor=white)](https://github.com/sponsors/unclecode)
[![Current Sponsors](https://img.shields.io/github/sponsors/unclecode?style=for-the-badge&logo=github&label=Current%20Sponsors&color=green)](https://github.com/sponsors/unclecode)
</div>
### 🤝 Sponsorship Tiers
- **🌱 Believer ($5/mo)**: Join the movement for data democratization
- **🚀 Builder ($50/mo)**: Get priority support and early feature access
- **💼 Growing Team ($500/mo)**: Bi-weekly syncs and optimization help
- **🏢 Data Infrastructure Partner ($2000/mo)**: Full partnership with dedicated support
**Why sponsor?** Every tier includes real benefits. No more rate-limited APIs. Own your data pipeline. Build data sovereignty together.
[View All Tiers & Benefits →](https://github.com/sponsors/unclecode)
### 🏆 Our Sponsors
#### 👑 Founding Sponsors (First 50)
*Be part of history - [Become a Founding Sponsor](https://github.com/sponsors/unclecode)*
<!-- Founding sponsors will be permanently recognized here -->
#### Current Sponsors
Thank you to all our sponsors who make this project possible!
<!-- Sponsors will be automatically added here -->
## 🗾 Mission
Our mission is to unlock the value of personal and enterprise data by transforming digital footprints into structured, tradeable assets. Crawl4AI empowers individuals and organizations with open-source tools to extract and structure data, fostering a shared data economy.
We envision a future where AI is powered by real human knowledge, ensuring data creators directly benefit from their contributions. By democratizing data and enabling ethical sharing, we are laying the foundation for authentic AI advancement.
<details>
<summary>🔑 <strong>Key Opportunities</strong></summary>
- **Data Capitalization**: Transform digital footprints into measurable, valuable assets.
- **Authentic AI Data**: Provide AI systems with real human insights.
- **Shared Economy**: Create a fair data marketplace that benefits data creators.
</details>
<details>
<summary>🚀 <strong>Development Pathway</strong></summary>
1. **Open-Source Tools**: Community-driven platforms for transparent data extraction.
2. **Digital Asset Structuring**: Tools to organize and value digital knowledge.
3. **Ethical Data Marketplace**: A secure, fair platform for exchanging structured data.
For more details, see our [full mission statement](./MISSION.md).
</details>
## Star History
[![Star History Chart](https://api.star-history.com/svg?repos=unclecode/crawl4ai&type=Date)](https://star-history.com/#unclecode/crawl4ai&Date)

391
README.md
View File

@@ -10,41 +10,50 @@
[![PyPI version](https://badge.fury.io/py/crawl4ai.svg)](https://badge.fury.io/py/crawl4ai)
[![Python Version](https://img.shields.io/pypi/pyversions/crawl4ai)](https://pypi.org/project/crawl4ai/)
[![Downloads](https://static.pepy.tech/badge/crawl4ai/month)](https://pepy.tech/project/crawl4ai)
[![GitHub Sponsors](https://img.shields.io/github/sponsors/unclecode?style=flat&logo=GitHub-Sponsors&label=Sponsors&color=pink)](https://github.com/sponsors/unclecode)
<!-- [![Documentation Status](https://readthedocs.org/projects/crawl4ai/badge/?version=latest)](https://crawl4ai.readthedocs.io/) -->
[![License](https://img.shields.io/github/license/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/blob/main/LICENSE)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![Security: bandit](https://img.shields.io/badge/security-bandit-yellow.svg)](https://github.com/PyCQA/bandit)
[![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-2.1-4baaaa.svg)](code_of_conduct.md)
<p align="center">
<a href="https://x.com/crawl4ai">
<img src="https://img.shields.io/badge/Follow%20on%20X-000000?style=for-the-badge&logo=x&logoColor=white" alt="Follow on X" />
</a>
<a href="https://www.linkedin.com/company/crawl4ai">
<img src="https://img.shields.io/badge/Follow%20on%20LinkedIn-0077B5?style=for-the-badge&logo=linkedin&logoColor=white" alt="Follow on LinkedIn" />
</a>
<a href="https://discord.gg/jP8KfhDhyN">
<img src="https://img.shields.io/badge/Join%20our%20Discord-5865F2?style=for-the-badge&logo=discord&logoColor=white" alt="Join our Discord" />
</a>
</p>
</div>
Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for LLMs, AI agents, and data pipelines. Open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease.
Crawl4AI turns the web into clean, LLM ready Markdown for RAG, agents, and data pipelines. Fast, controllable, battle tested by a 50k+ star community.
[✨ Check out latest update v0.6.0](#-recent-updates)
[✨ Check out latest update v0.7.4](#-recent-updates)
🎉 **Version 0.6.0 is now available!** This release candidate introduces World-aware Crawling with geolocation and locale settings, Table-to-DataFrame extraction, Browser pooling with pre-warming, Network and console traffic capture, MCP integration for AI tools, and a completely revamped Docker deployment! [Read the release notes →](https://docs.crawl4ai.com/blog)
✨ New in v0.7.4: Revolutionary LLM Table Extraction with intelligent chunking, enhanced concurrency fixes, memory management refactor, and critical stability improvements. [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.4.md)
✨ Recent v0.7.3: Undetected Browser Support, Multi-URL Configurations, Memory Monitoring, Enhanced Table Extraction, GitHub Sponsors. [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.3.md)
<details>
<summary>🤓 <strong>My Personal Story</strong></summary>
<summary>🤓 <strong>My Personal Story</strong></summary>
My journey with computers started in childhood when my dad, a computer scientist, introduced me to an Amstrad computer. Those early days sparked a fascination with technology, leading me to pursue computer science and specialize in NLP during my postgraduate studies. It was during this time that I first delved into web crawling, building tools to help researchers organize papers and extract information from publications a challenging yet rewarding experience that honed my skills in data extraction.
I grew up on an Amstrad, thanks to my dad, and never stopped building. In grad school I specialized in NLP and built crawlers for research. Thats where I learned how much extraction matters.
Fast forward to 2023, I was working on a tool for a project and needed a crawler to convert a webpage into markdown. While exploring solutions, I found one that claimed to be open-source but required creating an account and generating an API token. Worse, it turned out to be a SaaS model charging $16, and its quality didnt meet my standards. Frustrated, I realized this was a deeper problem. That frustration turned into turbo anger mode, and I decided to build my own solution. In just a few days, I created Crawl4AI. To my surprise, it went viral, earning thousands of GitHub stars and resonating with a global community.
In 2023, I needed web-to-Markdown. The “open source” option wanted an account, API token, and $16, and still under-delivered. I went turbo anger mode, built Crawl4AI in days, and it went viral. Now its the most-starred crawler on GitHub.
I made Crawl4AI open-source for two reasons. First, its my way of giving back to the open-source community that has supported me throughout my career. Second, I believe data should be accessible to everyone, not locked behind paywalls or monopolized by a few. Open access to data lays the foundation for the democratization of AI, a vision where individuals can train their own models and take ownership of their information. This library is the first step in a larger journey to create the best open-source data extraction and generation tool the world has ever seen, built collaboratively by a passionate community.
Thank you to everyone who has supported this project, used it, and shared feedback. Your encouragement motivates me to dream even bigger. Join us, file issues, submit PRs, or spread the word. Together, we can build a tool that truly empowers people to access their own data and reshape the future of AI.
I made it open source for **availability**, anyone can use it without a gate. Now Im building the platform for **affordability**, anyone can run serious crawls without breaking the bank. If that resonates, join in, send feedback, or just crawl something amazing.
</details>
## 🧐 Why Crawl4AI?
1. **Built for LLMs**: Creates smart, concise Markdown optimized for RAG and fine-tuning applications.
2. **Lightning Fast**: Delivers results 6x faster with real-time, cost-efficient performance.
3. **Flexible Browser Control**: Offers session management, proxies, and custom hooks for seamless data access.
4. **Heuristic Intelligence**: Uses advanced algorithms for efficient extraction, reducing reliance on costly models.
5. **Open Source & Deployable**: Fully open-source with no API keys—ready for Docker and cloud integration.
6. **Thriving Community**: Actively maintained by a vibrant community and the #1 trending GitHub repository.
<details>
<summary>Why developers pick Crawl4AI</summary>
- **LLM ready output**, smart Markdown with headings, tables, code, citation hints
- **Fast in practice**, async browser pool, caching, minimal hops
- **Full control**, sessions, proxies, cookies, user scripts, hooks
- **Adaptive intelligence**, learns site patterns, explores only what matters
- **Deploy anywhere**, zero keys, CLI and Docker, cloud friendly
</details>
## 🚀 Quick Start
@@ -96,6 +105,33 @@ crwl https://docs.crawl4ai.com --deep-crawl bfs --max-pages 10
crwl https://www.example.com/products -q "Extract all product prices"
```
## 💖 Support Crawl4AI
> 🎉 **Sponsorship Program Now Open!** After powering 51K+ developers and 1 year of growth, Crawl4AI is launching dedicated support for **startups** and **enterprises**. Be among the first 50 **Founding Sponsors** for permanent recognition in our Hall of Fame.
Crawl4AI is the #1 trending open-source web crawler on GitHub. Your support keeps it independent, innovative, and free for the community — while giving you direct access to premium benefits.
<div align="">
[![Become a Sponsor](https://img.shields.io/badge/Become%20a%20Sponsor-pink?style=for-the-badge&logo=github-sponsors&logoColor=white)](https://github.com/sponsors/unclecode)
[![Current Sponsors](https://img.shields.io/github/sponsors/unclecode?style=for-the-badge&logo=github&label=Current%20Sponsors&color=green)](https://github.com/sponsors/unclecode)
</div>
### 🤝 Sponsorship Tiers
- **🌱 Believer ($5/mo)** — Join the movement for data democratization
- **🚀 Builder ($50/mo)** — Priority support & early access to features
- **💼 Growing Team ($500/mo)** — Bi-weekly syncs & optimization help
- **🏢 Data Infrastructure Partner ($2000/mo)** — Full partnership with dedicated support
*Custom arrangements available - see [SPONSORS.md](SPONSORS.md) for details & contact*
**Why sponsor?**
No rate-limited APIs. No lock-in. Build and own your data pipeline with direct guidance from the creator of Crawl4AI.
[See All Tiers & Benefits →](https://github.com/sponsors/unclecode)
## ✨ Features
<details>
@@ -268,19 +304,13 @@ The new Docker implementation includes:
### Getting Started
```bash
# Pull and run the latest release candidate
docker pull unclecode/crawl4ai:0.6.0-rN # Use your favorite revision number
docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:0.6.0-rN # Use your favorite revision number
# Pull and run the latest release
docker pull unclecode/crawl4ai:latest
docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:latest
# Visit the playground at http://localhost:11235/playground
```
For complete documentation, see our [Docker Deployment Guide](https://docs.crawl4ai.com/core/docker-deployment/).
</details>
---
### Quick Test
Run a quick test (works for both Docker options):
@@ -291,22 +321,31 @@ import requests
# Submit a crawl job
response = requests.post(
"http://localhost:11235/crawl",
json={"urls": "https://example.com", "priority": 10}
json={"urls": ["https://example.com"], "priority": 10}
)
task_id = response.json()["task_id"]
# Continue polling until the task is complete (status="completed")
result = requests.get(f"http://localhost:11235/task/{task_id}")
if response.status_code == 200:
print("Crawl job submitted successfully.")
if "results" in response.json():
results = response.json()["results"]
print("Crawl job completed. Results:")
for result in results:
print(result)
else:
task_id = response.json()["task_id"]
print(f"Crawl job submitted. Task ID:: {task_id}")
result = requests.get(f"http://localhost:11235/task/{task_id}")
```
For more examples, see our [Docker Examples](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/docker_example.py). For advanced configuration, environment variables, and usage examples, see our [Docker Deployment Guide](https://docs.crawl4ai.com/basic/docker-deployment/).
</details>
---
## 🔬 Advanced Usage Examples 🔬
You can check the project structure in the directory [https://github.com/unclecode/crawl4ai/docs/examples](docs/examples). Over there, you can find a variety of examples; here, some popular examples are shared.
You can check the project structure in the directory [docs/examples](https://github.com/unclecode/crawl4ai/tree/main/docs/examples). Over there, you can find a variety of examples; here, some popular examples are shared.
<details>
<summary>📝 <strong>Heuristic Markdown Generation with Clean and Fit Markdown</strong></summary>
@@ -465,7 +504,7 @@ if __name__ == "__main__":
</details>
<details>
<summary>🤖 <strong>Using You own Browser with Custom User Profile</strong></summary>
<summary>🤖 <strong>Using Your own Browser with Custom User Profile</strong></summary>
```python
import os, sys
@@ -505,98 +544,195 @@ async def test_news_crawl():
## ✨ Recent Updates
### Version 0.6.0 Release Highlights
<details>
<summary><strong>Version 0.7.4 Release Highlights - The Intelligent Table Extraction & Performance Update</strong></summary>
- **🌎 World-aware Crawling**: Set geolocation, language, and timezone for authentic locale-specific content:
- **🚀 LLMTableExtraction**: Revolutionary table extraction with intelligent chunking for massive tables:
```python
crun_cfg = CrawlerRunConfig(
url="https://browserleaks.com/geo", # test page that shows your location
locale="en-US", # Accept-Language & UI locale
timezone_id="America/Los_Angeles", # JS Date()/Intl timezone
geolocation=GeolocationConfig( # override GPS coords
latitude=34.0522,
longitude=-118.2437,
accuracy=10.0,
)
)
```
- **📊 Table-to-DataFrame Extraction**: Extract HTML tables directly to CSV or pandas DataFrames:
```python
crawler = AsyncWebCrawler(config=browser_config)
await crawler.start()
try:
# Set up scraping parameters
crawl_config = CrawlerRunConfig(
table_score_threshold=8, # Strict table detection
)
# Execute market data extraction
results: List[CrawlResult] = await crawler.arun(
url="https://coinmarketcap.com/?page=1", config=crawl_config
)
# Process results
raw_df = pd.DataFrame()
for result in results:
if result.success and result.media["tables"]:
raw_df = pd.DataFrame(
result.media["tables"][0]["rows"],
columns=result.media["tables"][0]["headers"],
)
break
print(raw_df.head())
finally:
await crawler.stop()
```
- **🚀 Browser Pooling**: Pages launch hot with pre-warmed browser instances for lower latency and memory usage
- **🕸️ Network and Console Capture**: Full traffic logs and MHTML snapshots for debugging:
```python
crawler_config = CrawlerRunConfig(
capture_network=True,
capture_console=True,
mhtml=True
from crawl4ai import LLMTableExtraction, LLMConfig
# Configure intelligent table extraction
table_strategy = LLMTableExtraction(
llm_config=LLMConfig(provider="openai/gpt-4.1-mini"),
enable_chunking=True, # Handle massive tables
chunk_token_threshold=5000, # Smart chunking threshold
overlap_threshold=100, # Maintain context between chunks
extraction_type="structured" # Get structured data output
)
config = CrawlerRunConfig(table_extraction_strategy=table_strategy)
result = await crawler.arun("https://complex-tables-site.com", config=config)
# Tables are automatically chunked, processed, and merged
for table in result.tables:
print(f"Extracted table: {len(table['data'])} rows")
```
- **🔌 MCP Integration**: Connect to AI tools like Claude Code through the Model Context Protocol
```bash
# Add Crawl4AI to Claude Code
claude mcp add --transport sse c4ai-sse http://localhost:11235/mcp/sse
- **⚡ Dispatcher Bug Fix**: Fixed sequential processing bottleneck in arun_many for fast-completing tasks
- **🧹 Memory Management Refactor**: Consolidated memory utilities into main utils module for cleaner architecture
- **🔧 Browser Manager Fixes**: Resolved race conditions in concurrent page creation with thread-safe locking
- **🔗 Advanced URL Processing**: Better handling of raw:// URLs and base tag link resolution
- **🛡️ Enhanced Proxy Support**: Flexible proxy configuration supporting both dict and string formats
[Full v0.7.4 Release Notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.4.md)
</details>
<details>
<summary><strong>Version 0.7.3 Release Highlights - The Multi-Config Intelligence Update</strong></summary>
- **🕵️ Undetected Browser Support**: Bypass sophisticated bot detection systems:
```python
from crawl4ai import AsyncWebCrawler, BrowserConfig
browser_config = BrowserConfig(
browser_type="undetected", # Use undetected Chrome
headless=True, # Can run headless with stealth
extra_args=[
"--disable-blink-features=AutomationControlled",
"--disable-web-security"
]
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun("https://protected-site.com")
# Successfully bypass Cloudflare, Akamai, and custom bot detection
```
- **🖥️ Interactive Playground**: Test configurations and generate API requests with the built-in web interface at `http://localhost:11235//playground`
- **🎨 Multi-URL Configuration**: Different strategies for different URL patterns in one batch:
```python
from crawl4ai import CrawlerRunConfig, MatchMode
configs = [
# Documentation sites - aggressive caching
CrawlerRunConfig(
url_matcher=["*docs*", "*documentation*"],
cache_mode="write",
markdown_generator_options={"include_links": True}
),
# News/blog sites - fresh content
CrawlerRunConfig(
url_matcher=lambda url: 'blog' in url or 'news' in url,
cache_mode="bypass"
),
# Fallback for everything else
CrawlerRunConfig()
]
results = await crawler.arun_many(urls, config=configs)
# Each URL gets the perfect configuration automatically
```
- **🐳 Revamped Docker Deployment**: Streamlined multi-architecture Docker image with improved resource efficiency
- **🧠 Memory Monitoring**: Track and optimize memory usage during crawling:
```python
from crawl4ai.memory_utils import MemoryMonitor
monitor = MemoryMonitor()
monitor.start_monitoring()
results = await crawler.arun_many(large_url_list)
report = monitor.get_report()
print(f"Peak memory: {report['peak_mb']:.1f} MB")
print(f"Efficiency: {report['efficiency']:.1f}%")
# Get optimization recommendations
```
- **📱 Multi-stage Build System**: Optimized Dockerfile with platform-specific performance enhancements
- **📊 Enhanced Table Extraction**: Direct DataFrame conversion from web tables:
```python
result = await crawler.arun("https://site-with-tables.com")
# New way - direct table access
if result.tables:
import pandas as pd
for table in result.tables:
df = pd.DataFrame(table['data'])
print(f"Table: {df.shape[0]} rows × {df.shape[1]} columns")
```
Read the full details in our [0.6.0 Release Notes](https://docs.crawl4ai.com/blog/releases/0.6.0.html) or check the [CHANGELOG](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md).
- **💰 GitHub Sponsors**: 4-tier sponsorship system for project sustainability
- **🐳 Docker LLM Flexibility**: Configure providers via environment variables
### Previous Version: 0.5.0 Major Release Highlights
[Full v0.7.3 Release Notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.3.md)
- **🚀 Deep Crawling System**: Explore websites beyond initial URLs with BFS, DFS, and BestFirst strategies
- **⚡ Memory-Adaptive Dispatcher**: Dynamically adjusts concurrency based on system memory
- **🔄 Multiple Crawling Strategies**: Browser-based and lightweight HTTP-only crawlers
- **💻 Command-Line Interface**: New `crwl` CLI provides convenient terminal access
- **👤 Browser Profiler**: Create and manage persistent browser profiles
- **🧠 Crawl4AI Coding Assistant**: AI-powered coding assistant
- **🏎️ LXML Scraping Mode**: Fast HTML parsing using the `lxml` library
- **🌐 Proxy Rotation**: Built-in support for proxy switching
- **🤖 LLM Content Filter**: Intelligent markdown generation using LLMs
- **📄 PDF Processing**: Extract text, images, and metadata from PDF files
</details>
Read the full details in our [0.5.0 Release Notes](https://docs.crawl4ai.com/blog/releases/0.5.0.html).
<details>
<summary><strong>Version 0.7.0 Release Highlights - The Adaptive Intelligence Update</strong></summary>
- **🧠 Adaptive Crawling**: Your crawler now learns and adapts to website patterns automatically:
```python
config = AdaptiveConfig(
confidence_threshold=0.7, # Min confidence to stop crawling
max_depth=5, # Maximum crawl depth
max_pages=20, # Maximum number of pages to crawl
strategy="statistical"
)
async with AsyncWebCrawler() as crawler:
adaptive_crawler = AdaptiveCrawler(crawler, config)
state = await adaptive_crawler.digest(
start_url="https://news.example.com",
query="latest news content"
)
# Crawler learns patterns and improves extraction over time
```
- **🌊 Virtual Scroll Support**: Complete content extraction from infinite scroll pages:
```python
scroll_config = VirtualScrollConfig(
container_selector="[data-testid='feed']",
scroll_count=20,
scroll_by="container_height",
wait_after_scroll=1.0
)
result = await crawler.arun(url, config=CrawlerRunConfig(
virtual_scroll_config=scroll_config
))
```
- **🔗 Intelligent Link Analysis**: 3-layer scoring system for smart link prioritization:
```python
link_config = LinkPreviewConfig(
query="machine learning tutorials",
score_threshold=0.3,
concurrent_requests=10
)
result = await crawler.arun(url, config=CrawlerRunConfig(
link_preview_config=link_config,
score_links=True
))
# Links ranked by relevance and quality
```
- **🎣 Async URL Seeder**: Discover thousands of URLs in seconds:
```python
seeder = AsyncUrlSeeder(SeedingConfig(
source="sitemap+cc",
pattern="*/blog/*",
query="python tutorials",
score_threshold=0.4
))
urls = await seeder.discover("https://example.com")
```
- **⚡ Performance Boost**: Up to 3x faster with optimized resource handling and memory efficiency
Read the full details in our [0.7.0 Release Notes](https://docs.crawl4ai.com/blog/release-v0.7.0) or check the [CHANGELOG](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md).
</details>
## Version Numbering in Crawl4AI
Crawl4AI follows standard Python version numbering conventions (PEP 440) to help users understand the stability and features of each release.
### Version Numbers Explained
<details>
<summary>📈 <strong>Version Numbers Explained</strong></summary>
Our version numbers follow this pattern: `MAJOR.MINOR.PATCH` (e.g., 0.4.3)
@@ -633,6 +769,8 @@ We use pre-releases to:
For production environments, we recommend using the stable version. For testing new features, you can opt-in to pre-releases using the `--pre` flag.
</details>
## 📖 Documentation & Roadmap
> 🚨 **Documentation Update Alert**: We're undertaking a major documentation overhaul next week to reflect recent updates and improvements. Stay tuned for a more comprehensive and up-to-date guide!
@@ -645,16 +783,16 @@ To check our development plans and upcoming features, visit our [Roadmap](https:
<summary>📈 <strong>Development TODOs</strong></summary>
- [x] 0. Graph Crawler: Smart website traversal using graph search algorithms for comprehensive nested page extraction
- [ ] 1. Question-Based Crawler: Natural language driven web discovery and content extraction
- [ ] 2. Knowledge-Optimal Crawler: Smart crawling that maximizes knowledge while minimizing data extraction
- [ ] 3. Agentic Crawler: Autonomous system for complex multi-step crawling operations
- [ ] 4. Automated Schema Generator: Convert natural language to extraction schemas
- [ ] 5. Domain-Specific Scrapers: Pre-configured extractors for common platforms (academic, e-commerce)
- [ ] 6. Web Embedding Index: Semantic search infrastructure for crawled content
- [ ] 7. Interactive Playground: Web UI for testing, comparing strategies with AI assistance
- [ ] 8. Performance Monitor: Real-time insights into crawler operations
- [x] 1. Question-Based Crawler: Natural language driven web discovery and content extraction
- [x] 2. Knowledge-Optimal Crawler: Smart crawling that maximizes knowledge while minimizing data extraction
- [x] 3. Agentic Crawler: Autonomous system for complex multi-step crawling operations
- [x] 4. Automated Schema Generator: Convert natural language to extraction schemas
- [x] 5. Domain-Specific Scrapers: Pre-configured extractors for common platforms (academic, e-commerce)
- [x] 6. Web Embedding Index: Semantic search infrastructure for crawled content
- [x] 7. Interactive Playground: Web UI for testing, comparing strategies with AI assistance
- [x] 8. Performance Monitor: Real-time insights into crawler operations
- [ ] 9. Cloud Integration: One-click deployment solutions across cloud providers
- [ ] 10. Sponsorship Program: Structured support system with tiered benefits
- [x] 10. Sponsorship Program: Structured support system with tiered benefits
- [ ] 11. Educational Content: "How to Crawl" video series and interactive tutorials
</details>
@@ -669,12 +807,13 @@ Here's the updated license section:
## 📄 License & Attribution
This project is licensed under the Apache License 2.0 with a required attribution clause. See the [Apache 2.0 License](https://github.com/unclecode/crawl4ai/blob/main/LICENSE) file for details.
This project is licensed under the Apache License 2.0, attribution is recommended via the badges below. See the [Apache 2.0 License](https://github.com/unclecode/crawl4ai/blob/main/LICENSE) file for details.
### Attribution Requirements
When using Crawl4AI, you must include one of the following attribution methods:
#### 1. Badge Attribution (Recommended)
<details>
<summary>📈 <strong>1. Badge Attribution (Recommended)</strong></summary>
Add one of these badges to your README, documentation, or website:
| Theme | Badge |
@@ -713,11 +852,15 @@ HTML code for adding the badges:
</a>
```
#### 2. Text Attribution
</details>
<details>
<summary>📖 <strong>2. Text Attribution</strong></summary>
Add this line to your documentation:
```
This project uses Crawl4AI (https://github.com/unclecode/crawl4ai) for web data extraction.
```
</details>
## 📚 Citation

65
SPONSORS.md Normal file
View File

@@ -0,0 +1,65 @@
# 💖 Sponsors & Supporters
Thank you to everyone supporting Crawl4AI! Your sponsorship helps keep this project open-source and actively maintained.
## 👑 Founding Sponsors
*The first 50 sponsors who believed in our vision - permanently recognized*
<!-- Founding sponsors will be listed here with special recognition -->
🎉 **Become a Founding Sponsor!** Only [X/50] spots remaining! [Join now →](https://github.com/sponsors/unclecode)
---
## 🏢 Data Infrastructure Partners ($2000/month)
*These organizations are building their data sovereignty with Crawl4AI at the core*
<!-- Data Infrastructure Partners will be listed here -->
*Be the first Data Infrastructure Partner! [Join us →](https://github.com/sponsors/unclecode)*
---
## 💼 Growing Teams ($500/month)
*Teams scaling their data extraction with Crawl4AI*
<!-- Growing Teams will be listed here -->
*Your team could be here! [Become a sponsor →](https://github.com/sponsors/unclecode)*
---
## 🚀 Builders ($50/month)
*Developers and entrepreneurs building with Crawl4AI*
<!-- Builders will be listed here -->
*Join the builders! [Start sponsoring →](https://github.com/sponsors/unclecode)*
---
## 🌱 Believers ($5/month)
*The community supporting data democratization*
<!-- Believers will be listed here -->
*Thank you to all our community believers!*
---
## 🤝 Want to Sponsor?
Crawl4AI is the #1 trending open-source web crawler. We're building the future of data extraction - where organizations own their data pipelines instead of relying on rate-limited APIs.
### Available Sponsorship Tiers:
- **🌱 Believer** ($5/mo) - Support the movement
- **🚀 Builder** ($50/mo) - Priority support & early access
- **💼 Growing Team** ($500/mo) - Bi-weekly syncs & optimization
- **🏢 Data Infrastructure Partner** ($2000/mo) - Full partnership & dedicated support
[View all tiers and benefits →](https://github.com/sponsors/unclecode)
### Enterprise & Custom Partnerships
Building data extraction at scale? Need dedicated support or infrastructure? Let's talk about a custom partnership.
📧 Contact: [hello@crawl4ai.com](mailto:hello@crawl4ai.com) | 📅 [Schedule a call](https://calendar.app.google/rEpvi2UBgUQjWHfJ9)
---
*This list is updated regularly. Sponsors at $50+ tiers can submit their logos via [hello@crawl4ai.com](mailto:hello@crawl4ai.com)*

View File

@@ -0,0 +1,190 @@
# Crawl4AI Telemetry Testing Implementation
## Overview
This document summarizes the comprehensive testing strategy implementation for Crawl4AI's opt-in telemetry system. The implementation provides thorough test coverage across unit tests, integration tests, privacy compliance tests, and performance tests.
## Implementation Summary
### 📊 Test Statistics
- **Total Tests**: 40 tests
- **Success Rate**: 100% (40/40 passing)
- **Test Categories**: 4 categories (Unit, Integration, Privacy, Performance)
- **Code Coverage**: 51% (625 statements, 308 missing)
### 🗂️ Test Structure
#### 1. **Unit Tests** (`tests/telemetry/test_telemetry.py`)
- `TestTelemetryConfig`: Configuration management and persistence
- `TestEnvironmentDetection`: CLI, Docker, API server environment detection
- `TestTelemetryManager`: Singleton pattern and exception capture
- `TestConsentManager`: Docker default behavior and environment overrides
- `TestPublicAPI`: Public enable/disable/status functions
- `TestIntegration`: Crawler exception capture integration
#### 2. **Integration Tests** (`tests/telemetry/test_integration.py`)
- `TestTelemetryCLI`: CLI command testing (status, enable, disable)
- `TestAsyncWebCrawlerIntegration`: Real crawler integration with decorators
- `TestDockerIntegration`: Docker environment-specific behavior
- `TestTelemetryProviderIntegration`: Sentry provider initialization and fallbacks
#### 3. **Privacy & Performance Tests** (`tests/telemetry/test_privacy_performance.py`)
- `TestTelemetryPrivacy`: Data sanitization and PII protection
- `TestTelemetryPerformance`: Decorator overhead measurement
- `TestTelemetryScalability`: Multiple and concurrent exception handling
#### 4. **Hello World Test** (`tests/telemetry/test_hello_world_telemetry.py`)
- Basic telemetry functionality validation
### 🔧 Testing Infrastructure
#### **Pytest Configuration** (`pytest.ini`)
```ini
[pytest]
testpaths = tests/telemetry
markers =
unit: Unit tests
integration: Integration tests
privacy: Privacy compliance tests
performance: Performance tests
asyncio_mode = auto
```
#### **Test Fixtures** (`tests/conftest.py`)
- `temp_config_dir`: Temporary configuration directory
- `enabled_telemetry_config`: Pre-configured enabled telemetry
- `disabled_telemetry_config`: Pre-configured disabled telemetry
- `mock_sentry_provider`: Mocked Sentry provider for testing
#### **Makefile Targets** (`Makefile.telemetry`)
```makefile
test-all: Run all telemetry tests
test-unit: Run unit tests only
test-integration: Run integration tests only
test-privacy: Run privacy tests only
test-performance: Run performance tests only
test-coverage: Run tests with coverage report
test-watch: Run tests in watch mode
test-parallel: Run tests in parallel
```
## 🎯 Key Features Tested
### Privacy Compliance
- ✅ No URLs captured in telemetry data
- ✅ No content captured in telemetry data
- ✅ No PII (personally identifiable information) captured
- ✅ Sanitized context only (error types, stack traces without content)
### Performance Impact
- ✅ Telemetry decorator overhead < 1ms
- ✅ Async decorator overhead < 1ms
- ✅ Disabled telemetry has minimal performance impact
- ✅ Configuration loading performance acceptable
- ✅ Multiple exception capture scalability
- ✅ Concurrent exception capture handling
### Integration Points
- ✅ CLI command integration (status, enable, disable)
- ✅ AsyncWebCrawler decorator integration
- ✅ Docker environment auto-detection
- ✅ Sentry provider initialization
- ✅ Graceful degradation without Sentry
- ✅ Environment variable overrides
### Core Functionality
- ✅ Configuration persistence and loading
- ✅ Consent management (Docker defaults, user prompts)
- ✅ Environment detection (CLI, Docker, Jupyter, etc.)
- ✅ Singleton pattern for TelemetryManager
- ✅ Exception capture and forwarding
- ✅ Provider abstraction (Sentry, Null)
## 🚀 Usage Examples
### Run All Tests
```bash
make -f Makefile.telemetry test-all
```
### Run Specific Test Categories
```bash
# Unit tests only
make -f Makefile.telemetry test-unit
# Integration tests only
make -f Makefile.telemetry test-integration
# Privacy tests only
make -f Makefile.telemetry test-privacy
# Performance tests only
make -f Makefile.telemetry test-performance
```
### Coverage Report
```bash
make -f Makefile.telemetry test-coverage
```
### Parallel Execution
```bash
make -f Makefile.telemetry test-parallel
```
## 📁 File Structure
```
tests/
├── conftest.py # Shared pytest fixtures
└── telemetry/
├── test_hello_world_telemetry.py # Basic functionality test
├── test_telemetry.py # Unit tests
├── test_integration.py # Integration tests
└── test_privacy_performance.py # Privacy & performance tests
# Configuration
pytest.ini # Pytest configuration with markers
Makefile.telemetry # Convenient test execution targets
```
## 🔍 Test Isolation & Mocking
### Environment Isolation
- Tests run in isolated temporary directories
- Environment variables are properly mocked/isolated
- No interference between test runs
- Clean state for each test
### Mock Strategies
- `unittest.mock` for external dependencies
- Temporary file systems for configuration testing
- Subprocess mocking for CLI command testing
- Time measurement for performance testing
## 📈 Coverage Analysis
Current test coverage: **51%** (625 statements)
### Well-Covered Areas:
- Core configuration management (78%)
- Telemetry initialization (69%)
- Environment detection (64%)
### Areas for Future Enhancement:
- Consent management UI (20% - interactive prompts)
- Sentry provider implementation (25% - network calls)
- Base provider abstractions (49% - error handling paths)
## 🎉 Implementation Success
The comprehensive testing strategy has been **successfully implemented** with:
-**100% test pass rate** (40/40 tests passing)
-**Complete test infrastructure** (fixtures, configuration, targets)
-**Privacy compliance verification** (no PII, URLs, or content captured)
-**Performance validation** (minimal overhead confirmed)
-**Integration testing** (CLI, Docker, AsyncWebCrawler)
-**CI/CD ready** (Makefile targets for automation)
The telemetry system now has robust test coverage ensuring reliability, privacy compliance, and performance characteristics while maintaining comprehensive validation of all core functionality.

View File

@@ -3,12 +3,12 @@ import warnings
from .async_webcrawler import AsyncWebCrawler, CacheMode
# MODIFIED: Add SeedingConfig and VirtualScrollConfig here
from .async_configs import BrowserConfig, CrawlerRunConfig, HTTPCrawlerConfig, LLMConfig, ProxyConfig, GeolocationConfig, SeedingConfig, VirtualScrollConfig
from .async_configs import BrowserConfig, CrawlerRunConfig, HTTPCrawlerConfig, LLMConfig, ProxyConfig, GeolocationConfig, SeedingConfig, VirtualScrollConfig, LinkPreviewConfig, MatchMode
from .content_scraping_strategy import (
ContentScrapingStrategy,
WebScrapingStrategy,
LXMLWebScrapingStrategy,
WebScrapingStrategy, # Backward compatibility alias
)
from .async_logger import (
AsyncLoggerBase,
@@ -29,6 +29,12 @@ from .extraction_strategy import (
)
from .chunking_strategy import ChunkingStrategy, RegexChunking
from .markdown_generation_strategy import DefaultMarkdownGenerator
from .table_extraction import (
TableExtractionStrategy,
DefaultTableExtraction,
NoTableExtraction,
LLMTableExtraction,
)
from .content_filter_strategy import (
PruningContentFilter,
BM25ContentFilter,
@@ -69,6 +75,14 @@ from .deep_crawling import (
)
# NEW: Import AsyncUrlSeeder
from .async_url_seeder import AsyncUrlSeeder
# Adaptive Crawler
from .adaptive_crawler import (
AdaptiveCrawler,
AdaptiveConfig,
CrawlState,
CrawlStrategy,
StatisticalStrategy
)
# C4A Script Language Support
from .script import (
@@ -80,6 +94,13 @@ from .script import (
ErrorDetail
)
# Browser Adapters
from .browser_adapter import (
BrowserAdapter,
PlaywrightAdapter,
UndetectedAdapter
)
from .utils import (
start_colab_display_server,
setup_colab_environment
@@ -97,6 +118,12 @@ __all__ = [
"VirtualScrollConfig",
# NEW: Add AsyncUrlSeeder
"AsyncUrlSeeder",
# Adaptive Crawler
"AdaptiveCrawler",
"AdaptiveConfig",
"CrawlState",
"CrawlStrategy",
"StatisticalStrategy",
"DeepCrawlStrategy",
"BFSDeepCrawlStrategy",
"BestFirstCrawlingStrategy",
@@ -118,6 +145,7 @@ __all__ = [
"CrawlResult",
"CrawlerHub",
"CacheMode",
"MatchMode",
"ContentScrapingStrategy",
"WebScrapingStrategy",
"LXMLWebScrapingStrategy",
@@ -134,6 +162,9 @@ __all__ = [
"ChunkingStrategy",
"RegexChunking",
"DefaultMarkdownGenerator",
"TableExtractionStrategy",
"DefaultTableExtraction",
"NoTableExtraction",
"RelevantContentFilter",
"PruningContentFilter",
"BM25ContentFilter",
@@ -159,6 +190,11 @@ __all__ = [
"CompilationResult",
"ValidationResult",
"ErrorDetail",
# Browser Adapters
"BrowserAdapter",
"PlaywrightAdapter",
"UndetectedAdapter",
"LinkPreviewConfig"
]

View File

@@ -1,3 +1,8 @@
# crawl4ai/_version.py
__version__ = "0.6.3"
# crawl4ai/__version__.py
# This is the version that will be used for stable releases
__version__ = "0.7.4"
# For nightly builds, this gets set during build process
__nightly_version__ = None

File diff suppressed because it is too large Load Diff

1861
crawl4ai/adaptive_crawler.py Normal file

File diff suppressed because it is too large Load Diff

View File

@@ -18,17 +18,25 @@ from .extraction_strategy import ExtractionStrategy, LLMExtractionStrategy
from .chunking_strategy import ChunkingStrategy, RegexChunking
from .markdown_generation_strategy import MarkdownGenerationStrategy, DefaultMarkdownGenerator
from .content_scraping_strategy import ContentScrapingStrategy, WebScrapingStrategy, LXMLWebScrapingStrategy
from .content_scraping_strategy import ContentScrapingStrategy, LXMLWebScrapingStrategy
from .deep_crawling import DeepCrawlStrategy
from .table_extraction import TableExtractionStrategy, DefaultTableExtraction
from .cache_context import CacheMode
from .proxy_strategy import ProxyRotationStrategy
from typing import Union, List
from typing import Union, List, Callable
import inspect
from typing import Any, Dict, Optional
from enum import Enum
# Type alias for URL matching
UrlMatcher = Union[str, Callable[[str], bool], List[Union[str, Callable[[str], bool]]]]
class MatchMode(Enum):
OR = "or"
AND = "and"
# from .proxy_strategy import ProxyConfig
@@ -383,6 +391,8 @@ class BrowserConfig:
light_mode (bool): Disables certain background features for performance gains. Default: False.
extra_args (list): Additional command-line arguments passed to the browser.
Default: [].
enable_stealth (bool): If True, applies playwright-stealth to bypass basic bot detection.
Cannot be used with use_undetected browser mode. Default: False.
"""
def __init__(
@@ -423,6 +433,7 @@ class BrowserConfig:
extra_args: list = None,
debugging_port: int = 9222,
host: str = "localhost",
enable_stealth: bool = False,
):
self.browser_type = browser_type
self.headless = headless
@@ -438,6 +449,10 @@ class BrowserConfig:
self.chrome_channel = ""
self.proxy = proxy
self.proxy_config = proxy_config
if isinstance(self.proxy_config, dict):
self.proxy_config = ProxyConfig.from_dict(self.proxy_config)
if isinstance(self.proxy_config, str):
self.proxy_config = ProxyConfig.from_string(self.proxy_config)
self.viewport_width = viewport_width
@@ -463,6 +478,7 @@ class BrowserConfig:
self.verbose = verbose
self.debugging_port = debugging_port
self.host = host
self.enable_stealth = enable_stealth
fa_user_agenr_generator = ValidUAGenerator()
if self.user_agent_mode == "random":
@@ -494,6 +510,13 @@ class BrowserConfig:
# If persistent context is requested, ensure managed browser is enabled
if self.use_persistent_context:
self.use_managed_browser = True
# Validate stealth configuration
if self.enable_stealth and self.use_managed_browser and self.browser_mode == "builtin":
raise ValueError(
"enable_stealth cannot be used with browser_mode='builtin'. "
"Stealth mode requires a dedicated browser instance."
)
@staticmethod
def from_kwargs(kwargs: dict) -> "BrowserConfig":
@@ -530,6 +553,7 @@ class BrowserConfig:
extra_args=kwargs.get("extra_args", []),
debugging_port=kwargs.get("debugging_port", 9222),
host=kwargs.get("host", "localhost"),
enable_stealth=kwargs.get("enable_stealth", False),
)
def to_dict(self):
@@ -564,6 +588,7 @@ class BrowserConfig:
"verbose": self.verbose,
"debugging_port": self.debugging_port,
"host": self.host,
"enable_stealth": self.enable_stealth,
}
@@ -862,7 +887,7 @@ class CrawlerRunConfig():
parser_type (str): Type of parser to use for HTML parsing.
Default: "lxml".
scraping_strategy (ContentScrapingStrategy): Scraping strategy to use.
Default: WebScrapingStrategy.
Default: LXMLWebScrapingStrategy.
proxy_config (ProxyConfig or dict or None): Detailed proxy configuration, e.g. {"server": "...", "username": "..."}.
If None, no additional proxy config. Default: None.
@@ -926,6 +951,8 @@ class CrawlerRunConfig():
Default: False.
scroll_delay (float): Delay in seconds between scroll steps if scan_full_page is True.
Default: 0.2.
max_scroll_steps (Optional[int]): Maximum number of scroll steps to perform during full page scan.
If None, scrolls until the entire page is loaded. Default: None.
process_iframes (bool): If True, attempts to process and inline iframe content.
Default: False.
remove_overlay_elements (bool): If True, remove overlays/popups before extracting HTML.
@@ -956,6 +983,8 @@ class CrawlerRunConfig():
Default: False.
table_score_threshold (int): Minimum score threshold for processing a table.
Default: 7.
table_extraction (TableExtractionStrategy): Strategy to use for table extraction.
Default: DefaultTableExtraction with table_score_threshold.
# Virtual Scroll Parameters
virtual_scroll_config (VirtualScrollConfig or dict or None): Configuration for handling virtual scroll containers.
@@ -1066,6 +1095,7 @@ class CrawlerRunConfig():
ignore_body_visibility: bool = True,
scan_full_page: bool = False,
scroll_delay: float = 0.2,
max_scroll_steps: Optional[int] = None,
process_iframes: bool = False,
remove_overlay_elements: bool = False,
simulate_user: bool = False,
@@ -1081,6 +1111,7 @@ class CrawlerRunConfig():
image_description_min_word_threshold: int = IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD,
image_score_threshold: int = IMAGE_SCORE_THRESHOLD,
table_score_threshold: int = 7,
table_extraction: TableExtractionStrategy = None,
exclude_external_images: bool = False,
exclude_all_images: bool = False,
# Link and Domain Handling Parameters
@@ -1110,6 +1141,9 @@ class CrawlerRunConfig():
link_preview_config: Union[LinkPreviewConfig, Dict[str, Any]] = None,
# Virtual Scroll Parameters
virtual_scroll_config: Union[VirtualScrollConfig, Dict[str, Any]] = None,
# URL Matching Parameters
url_matcher: Optional[UrlMatcher] = None,
match_mode: MatchMode = MatchMode.OR,
# Experimental Parameters
experimental: Dict[str, Any] = None,
):
@@ -1133,6 +1167,11 @@ class CrawlerRunConfig():
self.parser_type = parser_type
self.scraping_strategy = scraping_strategy or LXMLWebScrapingStrategy()
self.proxy_config = proxy_config
if isinstance(proxy_config, dict):
self.proxy_config = ProxyConfig.from_dict(proxy_config)
if isinstance(proxy_config, str):
self.proxy_config = ProxyConfig.from_string(proxy_config)
self.proxy_rotation_strategy = proxy_rotation_strategy
# Browser Location and Identity Parameters
@@ -1170,6 +1209,7 @@ class CrawlerRunConfig():
self.ignore_body_visibility = ignore_body_visibility
self.scan_full_page = scan_full_page
self.scroll_delay = scroll_delay
self.max_scroll_steps = max_scroll_steps
self.process_iframes = process_iframes
self.remove_overlay_elements = remove_overlay_elements
self.simulate_user = simulate_user
@@ -1188,6 +1228,12 @@ class CrawlerRunConfig():
self.exclude_external_images = exclude_external_images
self.exclude_all_images = exclude_all_images
self.table_score_threshold = table_score_threshold
# Table extraction strategy (default to DefaultTableExtraction if not specified)
if table_extraction is None:
self.table_extraction = DefaultTableExtraction(table_score_threshold=table_score_threshold)
else:
self.table_extraction = table_extraction
# Link and Domain Handling Parameters
self.exclude_social_media_domains = (
@@ -1262,6 +1308,10 @@ class CrawlerRunConfig():
else:
raise ValueError("virtual_scroll_config must be VirtualScrollConfig object or dict")
# URL Matching Parameters
self.url_matcher = url_matcher
self.match_mode = match_mode
# Experimental Parameters
self.experimental = experimental or {}
@@ -1317,6 +1367,51 @@ class CrawlerRunConfig():
if "compilation error" not in str(e).lower():
raise ValueError(f"Failed to compile C4A script: {str(e)}")
raise
def is_match(self, url: str) -> bool:
"""Check if this config matches the given URL.
Args:
url: The URL to check against this config's matcher
Returns:
bool: True if this config should be used for the URL or if no matcher is set.
"""
if self.url_matcher is None:
return True
if callable(self.url_matcher):
# Single function matcher
return self.url_matcher(url)
elif isinstance(self.url_matcher, str):
# Single pattern string
from fnmatch import fnmatch
return fnmatch(url, self.url_matcher)
elif isinstance(self.url_matcher, list):
# List of mixed matchers
if not self.url_matcher: # Empty list
return False
results = []
for matcher in self.url_matcher:
if callable(matcher):
results.append(matcher(url))
elif isinstance(matcher, str):
from fnmatch import fnmatch
results.append(fnmatch(url, matcher))
else:
# Skip invalid matchers
continue
# Apply match mode logic
if self.match_mode == MatchMode.OR:
return any(results) if results else False
else: # AND mode
return all(results) if results else False
return False
def __getattr__(self, name):
@@ -1387,6 +1482,7 @@ class CrawlerRunConfig():
ignore_body_visibility=kwargs.get("ignore_body_visibility", True),
scan_full_page=kwargs.get("scan_full_page", False),
scroll_delay=kwargs.get("scroll_delay", 0.2),
max_scroll_steps=kwargs.get("max_scroll_steps"),
process_iframes=kwargs.get("process_iframes", False),
remove_overlay_elements=kwargs.get("remove_overlay_elements", False),
simulate_user=kwargs.get("simulate_user", False),
@@ -1409,6 +1505,7 @@ class CrawlerRunConfig():
"image_score_threshold", IMAGE_SCORE_THRESHOLD
),
table_score_threshold=kwargs.get("table_score_threshold", 7),
table_extraction=kwargs.get("table_extraction", None),
exclude_all_images=kwargs.get("exclude_all_images", False),
exclude_external_images=kwargs.get("exclude_external_images", False),
# Link and Domain Handling Parameters
@@ -1438,6 +1535,9 @@ class CrawlerRunConfig():
# Link Extraction Parameters
link_preview_config=kwargs.get("link_preview_config"),
url=kwargs.get("url"),
# URL Matching Parameters
url_matcher=kwargs.get("url_matcher"),
match_mode=kwargs.get("match_mode", MatchMode.OR),
# Experimental Parameters
experimental=kwargs.get("experimental"),
)
@@ -1499,6 +1599,7 @@ class CrawlerRunConfig():
"ignore_body_visibility": self.ignore_body_visibility,
"scan_full_page": self.scan_full_page,
"scroll_delay": self.scroll_delay,
"max_scroll_steps": self.max_scroll_steps,
"process_iframes": self.process_iframes,
"remove_overlay_elements": self.remove_overlay_elements,
"simulate_user": self.simulate_user,
@@ -1513,6 +1614,7 @@ class CrawlerRunConfig():
"image_description_min_word_threshold": self.image_description_min_word_threshold,
"image_score_threshold": self.image_score_threshold,
"table_score_threshold": self.table_score_threshold,
"table_extraction": self.table_extraction,
"exclude_all_images": self.exclude_all_images,
"exclude_external_images": self.exclude_external_images,
"exclude_social_media_domains": self.exclude_social_media_domains,
@@ -1534,6 +1636,8 @@ class CrawlerRunConfig():
"deep_crawl_strategy": self.deep_crawl_strategy,
"link_preview_config": self.link_preview_config.to_dict() if self.link_preview_config else None,
"url": self.url,
"url_matcher": self.url_matcher,
"match_mode": self.match_mode,
"experimental": self.experimental,
}
@@ -1653,22 +1757,57 @@ class SeedingConfig:
"""
def __init__(
self,
source: str = "sitemap+cc", # Options: "sitemap", "cc", "sitemap+cc"
pattern: Optional[str] = "*", # URL pattern to filter discovered URLs (e.g., "*example.com/blog/*")
live_check: bool = False, # Whether to perform HEAD requests to verify URL liveness
extract_head: bool = False, # Whether to fetch and parse <head> section for metadata
max_urls: int = -1, # Maximum number of URLs to discover (default: -1 for no limit)
concurrency: int = 1000, # Maximum concurrent requests for live checks/head extraction
hits_per_sec: int = 5, # Rate limit in requests per second
force: bool = False, # If True, bypasses the AsyncUrlSeeder's internal .jsonl cache
base_directory: Optional[str] = None, # Base directory for UrlSeeder's cache files (.jsonl)
llm_config: Optional[LLMConfig] = None, # Forward LLM config for future use (e.g., relevance scoring)
verbose: Optional[bool] = None, # Override crawler's general verbose setting
query: Optional[str] = None, # Search query for relevance scoring
score_threshold: Optional[float] = None, # Minimum relevance score to include URL (0.0-1.0)
scoring_method: str = "bm25", # Scoring method: "bm25" (default), future: "semantic"
filter_nonsense_urls: bool = True, # Filter out utility URLs like robots.txt, sitemap.xml, etc.
source: str = "sitemap+cc",
pattern: Optional[str] = "*",
live_check: bool = False,
extract_head: bool = False,
max_urls: int = -1,
concurrency: int = 1000,
hits_per_sec: int = 5,
force: bool = False,
base_directory: Optional[str] = None,
llm_config: Optional[LLMConfig] = None,
verbose: Optional[bool] = None,
query: Optional[str] = None,
score_threshold: Optional[float] = None,
scoring_method: str = "bm25",
filter_nonsense_urls: bool = True,
):
"""
Initialize URL seeding configuration.
Args:
source: Discovery source(s) to use. Options: "sitemap", "cc" (Common Crawl),
or "sitemap+cc" (both). Default: "sitemap+cc"
pattern: URL pattern to filter discovered URLs (e.g., "*example.com/blog/*").
Supports glob-style wildcards. Default: "*" (all URLs)
live_check: Whether to perform HEAD requests to verify URL liveness.
Default: False
extract_head: Whether to fetch and parse <head> section for metadata extraction.
Required for BM25 relevance scoring. Default: False
max_urls: Maximum number of URLs to discover. Use -1 for no limit.
Default: -1
concurrency: Maximum concurrent requests for live checks/head extraction.
Default: 1000
hits_per_sec: Rate limit in requests per second to avoid overwhelming servers.
Default: 5
force: If True, bypasses the AsyncUrlSeeder's internal .jsonl cache and
re-fetches URLs. Default: False
base_directory: Base directory for UrlSeeder's cache files (.jsonl).
If None, uses default ~/.crawl4ai/. Default: None
llm_config: LLM configuration for future features (e.g., semantic scoring).
Currently unused. Default: None
verbose: Override crawler's general verbose setting for seeding operations.
Default: None (inherits from crawler)
query: Search query for BM25 relevance scoring (e.g., "python tutorials").
Requires extract_head=True. Default: None
score_threshold: Minimum relevance score (0.0-1.0) to include URL.
Only applies when query is provided. Default: None
scoring_method: Scoring algorithm to use. Currently only "bm25" is supported.
Future: "semantic". Default: "bm25"
filter_nonsense_urls: Filter out utility URLs like robots.txt, sitemap.xml,
ads.txt, favicon.ico, etc. Default: True
"""
self.source = source
self.pattern = pattern
self.live_check = live_check

File diff suppressed because it is too large Load Diff

View File

@@ -21,6 +21,7 @@ from .async_logger import AsyncLogger
from .ssl_certificate import SSLCertificate
from .user_agent_generator import ValidUAGenerator
from .browser_manager import BrowserManager
from .browser_adapter import BrowserAdapter, PlaywrightAdapter, UndetectedAdapter
import aiofiles
import aiohttp
@@ -71,7 +72,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
"""
def __init__(
self, browser_config: BrowserConfig = None, logger: AsyncLogger = None, **kwargs
self, browser_config: BrowserConfig = None, logger: AsyncLogger = None, browser_adapter: BrowserAdapter = None, **kwargs
):
"""
Initialize the AsyncPlaywrightCrawlerStrategy with a browser configuration.
@@ -80,11 +81,16 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
browser_config (BrowserConfig): Configuration object containing browser settings.
If None, will be created from kwargs for backwards compatibility.
logger: Logger instance for recording events and errors.
browser_adapter (BrowserAdapter): Browser adapter for handling browser-specific operations.
If None, defaults to PlaywrightAdapter.
**kwargs: Additional arguments for backwards compatibility and extending functionality.
"""
# Initialize browser config, either from provided object or kwargs
self.browser_config = browser_config or BrowserConfig.from_kwargs(kwargs)
self.logger = logger
# Initialize browser adapter
self.adapter = browser_adapter or PlaywrightAdapter()
# Initialize session management
self._downloaded_files = []
@@ -104,7 +110,9 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
# Initialize browser manager with config
self.browser_manager = BrowserManager(
browser_config=self.browser_config, logger=self.logger
browser_config=self.browser_config,
logger=self.logger,
use_undetected=isinstance(self.adapter, UndetectedAdapter)
)
async def __aenter__(self):
@@ -322,7 +330,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
"""
try:
result = await page.evaluate(wrapper_js)
result = await self.adapter.evaluate(page, wrapper_js)
return result
except Exception as e:
if "Error evaluating condition" in str(e):
@@ -367,7 +375,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
# Replace the iframe with a div containing the extracted content
_iframe = iframe_content.replace("`", "\\`")
await page.evaluate(
await self.adapter.evaluate(page,
f"""
() => {{
const iframe = document.getElementById('iframe-{i}');
@@ -445,6 +453,9 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
return await self._crawl_web(url, config)
elif url.startswith("file://"):
# initialize empty lists for console messages
captured_console = []
# Process local file
local_file_path = url[7:] # Remove 'file://' prefix
if not os.path.exists(local_file_path):
@@ -466,9 +477,15 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
console_messages=captured_console,
)
elif url.startswith("raw:") or url.startswith("raw://"):
#####
# Since both "raw:" and "raw://" start with "raw:", the first condition is always true for both, so "raw://" will be sliced as "//...", which is incorrect.
# Fix: Check for "raw://" first, then "raw:"
# Also, the prefix "raw://" is actually 6 characters long, not 7, so it should be sliced accordingly: url[6:]
#####
elif url.startswith("raw://") or url.startswith("raw:"):
# Process raw HTML content
raw_html = url[4:] if url[:4] == "raw:" else url[7:]
# raw_html = url[4:] if url[:4] == "raw:" else url[7:]
raw_html = url[6:] if url.startswith("raw://") else url[4:]
html = raw_html
if config.screenshot:
screenshot_data = await self._generate_screenshot_from_html(html)
@@ -619,91 +636,16 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
page.on("requestfailed", handle_request_failed_capture)
# Console Message Capturing
handle_console = None
handle_error = None
if config.capture_console_messages:
def handle_console_capture(msg):
try:
message_type = "unknown"
try:
message_type = msg.type
except:
pass
message_text = "unknown"
try:
message_text = msg.text
except:
pass
# Basic console message with minimal content
entry = {
"type": message_type,
"text": message_text,
"timestamp": time.time()
}
captured_console.append(entry)
except Exception as e:
if self.logger:
self.logger.warning(f"Error capturing console message: {e}", tag="CAPTURE")
# Still add something to the list even on error
captured_console.append({
"type": "console_capture_error",
"error": str(e),
"timestamp": time.time()
})
def handle_pageerror_capture(err):
try:
error_message = "Unknown error"
try:
error_message = err.message
except:
pass
error_stack = ""
try:
error_stack = err.stack
except:
pass
captured_console.append({
"type": "error",
"text": error_message,
"stack": error_stack,
"timestamp": time.time()
})
except Exception as e:
if self.logger:
self.logger.warning(f"Error capturing page error: {e}", tag="CAPTURE")
captured_console.append({
"type": "pageerror_capture_error",
"error": str(e),
"timestamp": time.time()
})
# Add event listeners directly
page.on("console", handle_console_capture)
page.on("pageerror", handle_pageerror_capture)
# Set up console capture using adapter
handle_console = await self.adapter.setup_console_capture(page, captured_console)
handle_error = await self.adapter.setup_error_capture(page, captured_console)
# Set up console logging if requested
if config.log_console:
def log_consol(
msg, console_log_type="debug"
): # Corrected the parameter syntax
if console_log_type == "error":
self.logger.error(
message=f"Console error: {msg}", # Use f-string for variable interpolation
tag="CONSOLE"
)
elif console_log_type == "debug":
self.logger.debug(
message=f"Console: {msg}", # Use f-string for variable interpolation
tag="CONSOLE"
)
page.on("console", log_consol)
page.on("pageerror", lambda e: log_consol(e, "error"))
# Note: For undetected browsers, console logging won't work directly
# but captured messages can still be logged after retrieval
try:
# Get SSL certificate information if requested and URL is HTTPS
@@ -741,18 +683,49 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
)
redirected_url = page.url
except Error as e:
raise RuntimeError(f"Failed on navigating ACS-GOTO:\n{str(e)}")
# Allow navigation to be aborted when downloading files
# This is expected behavior for downloads in some browser engines
if 'net::ERR_ABORTED' in str(e) and self.browser_config.accept_downloads:
self.logger.info(
message=f"Navigation aborted, likely due to file download: {url}",
tag="GOTO",
params={"url": url},
)
response = None
else:
raise RuntimeError(f"Failed on navigating ACS-GOTO:\n{str(e)}")
await self.execute_hook(
"after_goto", page, context=context, url=url, response=response, config=config
)
# ──────────────────────────────────────────────────────────────
# Walk the redirect chain. Playwright returns only the last
# hop, so we trace the `request.redirected_from` links until the
# first response that differs from the final one and surface its
# status-code.
# ──────────────────────────────────────────────────────────────
if response is None:
status_code = 200
response_headers = {}
else:
status_code = response.status
response_headers = response.headers
first_resp = response
req = response.request
while req and req.redirected_from:
prev_req = req.redirected_from
prev_resp = await prev_req.response()
if prev_resp: # keep earliest
first_resp = prev_resp
req = prev_req
status_code = first_resp.status
response_headers = first_resp.headers
# if response is None:
# status_code = 200
# response_headers = {}
# else:
# status_code = response.status
# response_headers = response.headers
else:
status_code = 200
@@ -784,7 +757,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
except Error:
visibility_info = await self.check_visibility(page)
if self.browser_config.config.verbose:
if self.browser_config.verbose:
self.logger.debug(
message="Body visibility info: {info}",
tag="DEBUG",
@@ -896,7 +869,8 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
# Handle full page scanning
if config.scan_full_page:
await self._handle_full_page_scan(page, config.scroll_delay)
# await self._handle_full_page_scan(page, config.scroll_delay)
await self._handle_full_page_scan(page, config.scroll_delay, config.max_scroll_steps)
# Handle virtual scroll if configured
if config.virtual_scroll_config:
@@ -957,7 +931,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
await page.wait_for_load_state("domcontentloaded", timeout=5)
except PlaywrightTimeoutError:
pass
await page.evaluate(update_image_dimensions_js)
await self.adapter.evaluate(page, update_image_dimensions_js)
except Exception as e:
self.logger.error(
message="Error updating image dimensions: {error}",
@@ -986,7 +960,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
for selector in selectors:
try:
content = await page.evaluate(
content = await self.adapter.evaluate(page,
f"""Array.from(document.querySelectorAll("{selector}"))
.map(el => el.outerHTML)
.join('')"""
@@ -1044,6 +1018,11 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
await asyncio.sleep(delay)
return await page.content()
# For undetected browsers, retrieve console messages before returning
if config.capture_console_messages and hasattr(self.adapter, 'retrieve_console_messages'):
final_messages = await self.adapter.retrieve_console_messages(page)
captured_console.extend(final_messages)
# Return complete response
return AsyncCrawlResponse(
html=html,
@@ -1082,13 +1061,19 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
page.remove_listener("response", handle_response_capture)
page.remove_listener("requestfailed", handle_request_failed_capture)
if config.capture_console_messages:
page.remove_listener("console", handle_console_capture)
page.remove_listener("pageerror", handle_pageerror_capture)
# Retrieve any final console messages for undetected browsers
if hasattr(self.adapter, 'retrieve_console_messages'):
final_messages = await self.adapter.retrieve_console_messages(page)
captured_console.extend(final_messages)
# Clean up console capture
await self.adapter.cleanup_console_capture(page, handle_console, handle_error)
# Close the page
await page.close()
async def _handle_full_page_scan(self, page: Page, scroll_delay: float = 0.1):
# async def _handle_full_page_scan(self, page: Page, scroll_delay: float = 0.1):
async def _handle_full_page_scan(self, page: Page, scroll_delay: float = 0.1, max_scroll_steps: Optional[int] = None):
"""
Helper method to handle full page scanning.
@@ -1103,6 +1088,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
Args:
page (Page): The Playwright page object
scroll_delay (float): The delay between page scrolls
max_scroll_steps (Optional[int]): Maximum number of scroll steps to perform. If None, scrolls until end.
"""
try:
@@ -1127,9 +1113,21 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
dimensions = await self.get_page_dimensions(page)
total_height = dimensions["height"]
scroll_step_count = 0
while current_position < total_height:
####
# NEW FEATURE: Check if we've reached the maximum allowed scroll steps
# This prevents infinite scrolling on very long pages or infinite scroll scenarios
# If max_scroll_steps is None, this check is skipped (unlimited scrolling - original behavior)
####
if max_scroll_steps is not None and scroll_step_count >= max_scroll_steps:
break
current_position = min(current_position + viewport_height, total_height)
await self.safe_scroll(page, 0, current_position, delay=scroll_delay)
# Increment the step counter for max_scroll_steps tracking
scroll_step_count += 1
# await page.evaluate(f"window.scrollTo(0, {current_position})")
# await asyncio.sleep(scroll_delay)
@@ -1299,7 +1297,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
"""
# Execute virtual scroll capture
result = await page.evaluate(virtual_scroll_js, config.to_dict())
result = await self.adapter.evaluate(page, virtual_scroll_js, config.to_dict())
if result.get("replaced", False):
self.logger.success(
@@ -1383,7 +1381,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
remove_overlays_js = load_js_script("remove_overlay_elements")
try:
await page.evaluate(
await self.adapter.evaluate(page,
f"""
(() => {{
try {{
@@ -1616,12 +1614,32 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
num_segments = (page_height // viewport_height) + 1
for i in range(num_segments):
y_offset = i * viewport_height
# Special handling for the last segment
if i == num_segments - 1:
last_part_height = page_height % viewport_height
# If page_height is an exact multiple of viewport_height,
# we don't need an extra segment
if last_part_height == 0:
# Skip last segment if page height is exact multiple of viewport
break
# Adjust viewport to exactly match the remaining content height
await page.set_viewport_size({"width": page_width, "height": last_part_height})
await page.evaluate(f"window.scrollTo(0, {y_offset})")
await asyncio.sleep(0.01) # wait for render
seg_shot = await page.screenshot(full_page=False)
# Capture the current segment
# Note: Using compression options (format, quality) would go here
seg_shot = await page.screenshot(full_page=False, type="jpeg", quality=85)
# seg_shot = await page.screenshot(full_page=False)
img = Image.open(BytesIO(seg_shot)).convert("RGB")
segments.append(img)
# Reset viewport to original size after capturing segments
await page.set_viewport_size({"width": page_width, "height": viewport_height})
total_height = sum(img.height for img in segments)
stitched = Image.new("RGB", (segments[0].width, total_height))
offset = 0
@@ -1750,12 +1768,31 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
# then wait for the new page to load before continuing
result = None
try:
result = await page.evaluate(
# OLD VERSION:
# result = await page.evaluate(
# f"""
# (async () => {{
# try {{
# const script_result = {script};
# return {{ success: true, result: script_result }};
# }} catch (err) {{
# return {{ success: false, error: err.toString(), stack: err.stack }};
# }}
# }})();
# """
# )
# """ NEW VERSION:
# When {script} contains statements (e.g., const link = …; link.click();),
# this forms invalid JavaScript, causing Playwright execution error: SyntaxError: Unexpected token 'const'.
# """
result = await self.adapter.evaluate(page,
f"""
(async () => {{
try {{
const script_result = {script};
return {{ success: true, result: script_result }};
return await (async () => {{
{script}
}})();
}} catch (err) {{
return {{ success: false, error: err.toString(), stack: err.stack }};
}}
@@ -1871,7 +1908,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
for script in scripts:
try:
# Execute the script and wait for network idle
result = await page.evaluate(
result = await self.adapter.evaluate(page,
f"""
(() => {{
return new Promise((resolve) => {{
@@ -1955,7 +1992,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
Returns:
Boolean indicating visibility
"""
return await page.evaluate(
return await self.adapter.evaluate(page,
"""
() => {
const element = document.body;
@@ -1996,7 +2033,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
Dict containing scroll status and position information
"""
try:
result = await page.evaluate(
result = await self.adapter.evaluate(page,
f"""() => {{
try {{
const startX = window.scrollX;
@@ -2053,7 +2090,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
Returns:
Dict containing width and height of the page
"""
return await page.evaluate(
return await self.adapter.evaluate(page,
"""
() => {
const {scrollWidth, scrollHeight} = document.documentElement;
@@ -2073,7 +2110,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
bool: True if page needs scrolling
"""
try:
need_scroll = await page.evaluate(
need_scroll = await self.adapter.evaluate(page,
"""
() => {
const scrollHeight = document.documentElement.scrollHeight;
@@ -2353,4 +2390,4 @@ class AsyncHTTPCrawlerStrategy(AsyncCrawlerStrategy):
tag="CRAWL",
params={"error": str(e), "url": url}
)
raise
raise

View File

@@ -1,4 +1,4 @@
from typing import Dict, Optional, List, Tuple
from typing import Dict, Optional, List, Tuple, Union
from .async_configs import CrawlerRunConfig
from .models import (
CrawlResult,
@@ -22,6 +22,8 @@ from urllib.parse import urlparse
import random
from abc import ABC, abstractmethod
from .utils import get_true_memory_usage_percent
class RateLimiter:
def __init__(
@@ -96,11 +98,37 @@ class BaseDispatcher(ABC):
self.rate_limiter = rate_limiter
self.monitor = monitor
def select_config(self, url: str, configs: Union[CrawlerRunConfig, List[CrawlerRunConfig]]) -> Optional[CrawlerRunConfig]:
"""Select the appropriate config for a given URL.
Args:
url: The URL to match against
configs: Single config or list of configs to choose from
Returns:
The matching config, or None if no match found
"""
# Single config - return as is
if isinstance(configs, CrawlerRunConfig):
return configs
# Empty list - return None
if not configs:
return None
# Find first matching config
for config in configs:
if config.is_match(url):
return config
# No match found - return None to indicate URL should be skipped
return None
@abstractmethod
async def crawl_url(
self,
url: str,
config: CrawlerRunConfig,
config: Union[CrawlerRunConfig, List[CrawlerRunConfig]],
task_id: str,
monitor: Optional[CrawlerMonitor] = None,
) -> CrawlerTaskResult:
@@ -111,7 +139,7 @@ class BaseDispatcher(ABC):
self,
urls: List[str],
crawler: AsyncWebCrawler, # noqa: F821
config: CrawlerRunConfig,
config: Union[CrawlerRunConfig, List[CrawlerRunConfig]],
monitor: Optional[CrawlerMonitor] = None,
) -> List[CrawlerTaskResult]:
pass
@@ -147,7 +175,7 @@ class MemoryAdaptiveDispatcher(BaseDispatcher):
async def _memory_monitor_task(self):
"""Background task to continuously monitor memory usage and update state"""
while True:
self.current_memory_percent = psutil.virtual_memory().percent
self.current_memory_percent = get_true_memory_usage_percent()
# Enter memory pressure mode if we cross the threshold
if self.current_memory_percent >= self.memory_threshold_percent:
@@ -200,7 +228,7 @@ class MemoryAdaptiveDispatcher(BaseDispatcher):
async def crawl_url(
self,
url: str,
config: CrawlerRunConfig,
config: Union[CrawlerRunConfig, List[CrawlerRunConfig]],
task_id: str,
retry_count: int = 0,
) -> CrawlerTaskResult:
@@ -208,6 +236,37 @@ class MemoryAdaptiveDispatcher(BaseDispatcher):
error_message = ""
memory_usage = peak_memory = 0.0
# Select appropriate config for this URL
selected_config = self.select_config(url, config)
# If no config matches, return failed result
if selected_config is None:
error_message = f"No matching configuration found for URL: {url}"
if self.monitor:
self.monitor.update_task(
task_id,
status=CrawlStatus.FAILED,
error_message=error_message
)
return CrawlerTaskResult(
task_id=task_id,
url=url,
result=CrawlResult(
url=url,
html="",
metadata={"status": "no_config_match"},
success=False,
error_message=error_message
),
memory_usage=0,
peak_memory=0,
start_time=start_time,
end_time=time.time(),
error_message=error_message,
retry_count=retry_count
)
# Get starting memory for accurate measurement
process = psutil.Process()
start_memory = process.memory_info().rss / (1024 * 1024)
@@ -257,8 +316,8 @@ class MemoryAdaptiveDispatcher(BaseDispatcher):
retry_count=retry_count + 1
)
# Execute the crawl
result = await self.crawler.arun(url, config=config, session_id=task_id)
# Execute the crawl with selected config
result = await self.crawler.arun(url, config=selected_config, session_id=task_id)
# Measure memory usage
end_memory = process.memory_info().rss / (1024 * 1024)
@@ -316,7 +375,7 @@ class MemoryAdaptiveDispatcher(BaseDispatcher):
self,
urls: List[str],
crawler: AsyncWebCrawler,
config: CrawlerRunConfig,
config: Union[CrawlerRunConfig, List[CrawlerRunConfig]],
) -> List[CrawlerTaskResult]:
self.crawler = crawler
@@ -348,32 +407,34 @@ class MemoryAdaptiveDispatcher(BaseDispatcher):
t.cancel()
raise exc
# If memory pressure is low, start new tasks
if not self.memory_pressure_mode and len(active_tasks) < self.max_session_permit:
try:
# Try to get a task with timeout to avoid blocking indefinitely
priority, (url, task_id, retry_count, enqueue_time) = await asyncio.wait_for(
self.task_queue.get(), timeout=0.1
)
# Create and start the task
task = asyncio.create_task(
self.crawl_url(url, config, task_id, retry_count)
)
active_tasks.append(task)
# Update waiting time in monitor
if self.monitor:
wait_time = time.time() - enqueue_time
self.monitor.update_task(
task_id,
wait_time=wait_time,
status=CrawlStatus.IN_PROGRESS
)
# If memory pressure is low, greedily fill all available slots
if not self.memory_pressure_mode:
slots = self.max_session_permit - len(active_tasks)
while slots > 0:
try:
# Use get_nowait() to immediately get tasks without blocking
priority, (url, task_id, retry_count, enqueue_time) = self.task_queue.get_nowait()
except asyncio.TimeoutError:
# No tasks in queue, that's fine
pass
# Create and start the task
task = asyncio.create_task(
self.crawl_url(url, config, task_id, retry_count)
)
active_tasks.append(task)
# Update waiting time in monitor
if self.monitor:
wait_time = time.time() - enqueue_time
self.monitor.update_task(
task_id,
wait_time=wait_time,
status=CrawlStatus.IN_PROGRESS
)
slots -= 1
except asyncio.QueueEmpty:
# No more tasks in queue, exit the loop
break
# Wait for completion even if queue is starved
if active_tasks:
@@ -470,7 +531,7 @@ class MemoryAdaptiveDispatcher(BaseDispatcher):
self,
urls: List[str],
crawler: AsyncWebCrawler,
config: CrawlerRunConfig,
config: Union[CrawlerRunConfig, List[CrawlerRunConfig]],
) -> AsyncGenerator[CrawlerTaskResult, None]:
self.crawler = crawler
@@ -500,32 +561,34 @@ class MemoryAdaptiveDispatcher(BaseDispatcher):
for t in active_tasks:
t.cancel()
raise exc
# If memory pressure is low, start new tasks
if not self.memory_pressure_mode and len(active_tasks) < self.max_session_permit:
try:
# Try to get a task with timeout
priority, (url, task_id, retry_count, enqueue_time) = await asyncio.wait_for(
self.task_queue.get(), timeout=0.1
)
# Create and start the task
task = asyncio.create_task(
self.crawl_url(url, config, task_id, retry_count)
)
active_tasks.append(task)
# Update waiting time in monitor
if self.monitor:
wait_time = time.time() - enqueue_time
self.monitor.update_task(
task_id,
wait_time=wait_time,
status=CrawlStatus.IN_PROGRESS
)
# If memory pressure is low, greedily fill all available slots
if not self.memory_pressure_mode:
slots = self.max_session_permit - len(active_tasks)
while slots > 0:
try:
# Use get_nowait() to immediately get tasks without blocking
priority, (url, task_id, retry_count, enqueue_time) = self.task_queue.get_nowait()
except asyncio.TimeoutError:
# No tasks in queue, that's fine
pass
# Create and start the task
task = asyncio.create_task(
self.crawl_url(url, config, task_id, retry_count)
)
active_tasks.append(task)
# Update waiting time in monitor
if self.monitor:
wait_time = time.time() - enqueue_time
self.monitor.update_task(
task_id,
wait_time=wait_time,
status=CrawlStatus.IN_PROGRESS
)
slots -= 1
except asyncio.QueueEmpty:
# No more tasks in queue, exit the loop
break
# Process completed tasks and yield results
if active_tasks:
@@ -572,7 +635,7 @@ class SemaphoreDispatcher(BaseDispatcher):
async def crawl_url(
self,
url: str,
config: CrawlerRunConfig,
config: Union[CrawlerRunConfig, List[CrawlerRunConfig]],
task_id: str,
semaphore: asyncio.Semaphore = None,
) -> CrawlerTaskResult:
@@ -580,6 +643,36 @@ class SemaphoreDispatcher(BaseDispatcher):
error_message = ""
memory_usage = peak_memory = 0.0
# Select appropriate config for this URL
selected_config = self.select_config(url, config)
# If no config matches, return failed result
if selected_config is None:
error_message = f"No matching configuration found for URL: {url}"
if self.monitor:
self.monitor.update_task(
task_id,
status=CrawlStatus.FAILED,
error_message=error_message
)
return CrawlerTaskResult(
task_id=task_id,
url=url,
result=CrawlResult(
url=url,
html="",
metadata={"status": "no_config_match"},
success=False,
error_message=error_message
),
memory_usage=0,
peak_memory=0,
start_time=start_time,
end_time=time.time(),
error_message=error_message
)
try:
if self.monitor:
self.monitor.update_task(
@@ -592,7 +685,7 @@ class SemaphoreDispatcher(BaseDispatcher):
async with semaphore:
process = psutil.Process()
start_memory = process.memory_info().rss / (1024 * 1024)
result = await self.crawler.arun(url, config=config, session_id=task_id)
result = await self.crawler.arun(url, config=selected_config, session_id=task_id)
end_memory = process.memory_info().rss / (1024 * 1024)
memory_usage = peak_memory = end_memory - start_memory
@@ -654,7 +747,7 @@ class SemaphoreDispatcher(BaseDispatcher):
self,
crawler: AsyncWebCrawler, # noqa: F821
urls: List[str],
config: CrawlerRunConfig,
config: Union[CrawlerRunConfig, List[CrawlerRunConfig]],
) -> List[CrawlerTaskResult]:
self.crawler = crawler
if self.monitor:

View File

@@ -39,6 +39,7 @@ class LogColor(str, Enum):
YELLOW = "yellow"
MAGENTA = "magenta"
DIM_MAGENTA = "dim magenta"
RED = "red"
def __str__(self):
"""Automatically convert rich color to string."""

View File

@@ -424,10 +424,21 @@ class AsyncUrlSeeder:
self._log("info", "Finished URL seeding for {domain}. Total URLs: {count}",
params={"domain": domain, "count": len(results)}, tag="URL_SEED")
# Sort by relevance score if query was provided
# Apply BM25 scoring if query was provided
if query and extract_head and scoring_method == "bm25":
results.sort(key=lambda x: x.get(
"relevance_score", 0.0), reverse=True)
# Apply collective BM25 scoring across all documents
results = await self._apply_bm25_scoring(results, config)
# Filter by score threshold if specified
if score_threshold is not None:
original_count = len(results)
results = [r for r in results if r.get("relevance_score", 0) >= score_threshold]
if original_count > len(results):
self._log("info", "Filtered {filtered} URLs below score threshold {threshold}",
params={"filtered": original_count - len(results), "threshold": score_threshold}, tag="URL_SEED")
# Sort by relevance score
results.sort(key=lambda x: x.get("relevance_score", 0.0), reverse=True)
self._log("info", "Sorted {count} URLs by relevance score for query: '{query}'",
params={"count": len(results), "query": query}, tag="URL_SEED")
elif query and not extract_head:
@@ -818,7 +829,7 @@ class AsyncUrlSeeder:
async def _iter_sitemap(self, url: str):
try:
r = await self.client.get(url, timeout=15)
r = await self.client.get(url, timeout=15, follow_redirects=True)
r.raise_for_status()
except httpx.HTTPStatusError as e:
self._log("warning", "Failed to fetch sitemap {url}: HTTP {status_code}",
@@ -982,28 +993,6 @@ class AsyncUrlSeeder:
"head_data": head_data,
}
# Apply BM25 scoring if query is provided and head data exists
if query and ok and scoring_method == "bm25" and head_data:
text_context = self._extract_text_context(head_data)
if text_context:
# Calculate BM25 score for this single document
# scores = self._calculate_bm25_score(query, [text_context])
scores = await asyncio.to_thread(self._calculate_bm25_score, query, [text_context])
relevance_score = scores[0] if scores else 0.0
entry["relevance_score"] = float(relevance_score)
else:
# No text context, use URL-based scoring as fallback
relevance_score = self._calculate_url_relevance_score(
query, entry["url"])
entry["relevance_score"] = float(relevance_score)
elif query:
# Query provided but no head data - we reject this entry
self._log("debug", "No head data for {url}, using URL-based scoring",
params={"url": url}, tag="URL_SEED")
return
# relevance_score = self._calculate_url_relevance_score(query, entry["url"])
# entry["relevance_score"] = float(relevance_score)
elif live:
self._log("debug", "Performing live check for {url}", params={
"url": url}, tag="URL_SEED")
@@ -1013,35 +1002,13 @@ class AsyncUrlSeeder:
params={"status": status.upper(), "url": url}, tag="URL_SEED")
entry = {"url": url, "status": status, "head_data": {}}
# Apply URL-based scoring if query is provided
if query:
relevance_score = self._calculate_url_relevance_score(
query, url)
entry["relevance_score"] = float(relevance_score)
else:
entry = {"url": url, "status": "unknown", "head_data": {}}
# Apply URL-based scoring if query is provided
if query:
relevance_score = self._calculate_url_relevance_score(
query, url)
entry["relevance_score"] = float(relevance_score)
# Now decide whether to add the entry based on score threshold
if query and "relevance_score" in entry:
if score_threshold is None or entry["relevance_score"] >= score_threshold:
if live or extract:
await self._cache_set(cache_kind, url, entry)
res_list.append(entry)
else:
self._log("debug", "URL {url} filtered out with score {score} < {threshold}",
params={"url": url, "score": entry["relevance_score"], "threshold": score_threshold}, tag="URL_SEED")
else:
# No query or no scoring - add as usual
if live or extract:
await self._cache_set(cache_kind, url, entry)
res_list.append(entry)
# Add entry to results (scoring will be done later)
if live or extract:
await self._cache_set(cache_kind, url, entry)
res_list.append(entry)
async def _head_ok(self, url: str, timeout: int) -> bool:
try:
@@ -1436,8 +1403,19 @@ class AsyncUrlSeeder:
scores = bm25.get_scores(query_tokens)
# Normalize scores to 0-1 range
max_score = max(scores) if max(scores) > 0 else 1.0
normalized_scores = [score / max_score for score in scores]
# BM25 can return negative scores, so we need to handle the full range
if len(scores) == 0:
return []
min_score = min(scores)
max_score = max(scores)
# If all scores are the same, return 0.5 for all
if max_score == min_score:
return [0.5] * len(scores)
# Normalize to 0-1 range using min-max normalization
normalized_scores = [(score - min_score) / (max_score - min_score) for score in scores]
return normalized_scores
except Exception as e:

View File

@@ -49,6 +49,9 @@ from .utils import (
preprocess_html_for_schema,
)
# Import telemetry
from .telemetry import capture_exception, telemetry_decorator, async_telemetry_decorator
class AsyncWebCrawler:
"""
@@ -201,6 +204,7 @@ class AsyncWebCrawler:
"""异步空上下文管理器"""
yield
@async_telemetry_decorator
async def arun(
self,
url: str,
@@ -363,7 +367,7 @@ class AsyncWebCrawler:
pdf_data=pdf_data,
verbose=config.verbose,
is_raw_html=True if url.startswith("raw:") else False,
redirected_url=async_response.redirected_url,
redirected_url=async_response.redirected_url,
**kwargs,
)
@@ -430,6 +434,7 @@ class AsyncWebCrawler:
)
)
@async_telemetry_decorator
async def aprocess_html(
self,
url: str,
@@ -509,7 +514,7 @@ class AsyncWebCrawler:
tables = media.pop("tables", []) if isinstance(media, dict) else []
links = result.links.model_dump() if hasattr(result.links, 'model_dump') else result.links
metadata = result.metadata
fit_html = preprocess_html_for_schema(html_content=html, text_threshold= 500, max_size= 300_000)
################################
@@ -591,11 +596,13 @@ class AsyncWebCrawler:
# Choose content based on input_format
content_format = config.extraction_strategy.input_format
if content_format == "fit_markdown" and not markdown_result.fit_markdown:
self.logger.warning(
message="Fit markdown requested but not available. Falling back to raw markdown.",
tag="EXTRACT",
params={"url": _url},
)
self.logger.url_status(
url=_url,
success=bool(html),
timing=time.perf_counter() - t1,
tag="EXTRACT",
)
content_format = "markdown"
content = {
@@ -619,11 +626,12 @@ class AsyncWebCrawler:
)
# Log extraction completion
self.logger.info(
message="Completed for {url:.50}... | Time: {timing}s",
tag="EXTRACT",
params={"url": _url, "timing": time.perf_counter() - t1},
)
self.logger.url_status(
url=_url,
success=bool(html),
timing=time.perf_counter() - t1,
tag="EXTRACT",
)
# Apply HTML formatting if requested
if config.prettiify:
@@ -650,7 +658,7 @@ class AsyncWebCrawler:
async def arun_many(
self,
urls: List[str],
config: Optional[CrawlerRunConfig] = None,
config: Optional[Union[CrawlerRunConfig, List[CrawlerRunConfig]]] = None,
dispatcher: Optional[BaseDispatcher] = None,
# Legacy parameters maintained for backwards compatibility
# word_count_threshold=MIN_WORD_THRESHOLD,
@@ -671,7 +679,9 @@ class AsyncWebCrawler:
Args:
urls: List of URLs to crawl
config: Configuration object controlling crawl behavior for all URLs
config: Configuration object(s) controlling crawl behavior. Can be:
- Single CrawlerRunConfig: Used for all URLs
- List[CrawlerRunConfig]: Configs with url_matcher for URL-specific settings
dispatcher: The dispatcher strategy instance to use. Defaults to MemoryAdaptiveDispatcher
[other parameters maintained for backwards compatibility]
@@ -736,7 +746,11 @@ class AsyncWebCrawler:
or task_result.result
)
stream = config.stream
# Handle stream setting - use first config's stream setting if config is a list
if isinstance(config, list):
stream = config[0].stream if config else False
else:
stream = config.stream
if stream:

293
crawl4ai/browser_adapter.py Normal file
View File

@@ -0,0 +1,293 @@
# browser_adapter.py
"""
Browser adapter for Crawl4AI to support both Playwright and undetected browsers
with minimal changes to existing codebase.
"""
from abc import ABC, abstractmethod
from typing import List, Dict, Any, Optional, Callable
import time
import json
# Import both, but use conditionally
try:
from playwright.async_api import Page
except ImportError:
Page = Any
try:
from patchright.async_api import Page as UndetectedPage
except ImportError:
UndetectedPage = Any
class BrowserAdapter(ABC):
"""Abstract adapter for browser-specific operations"""
@abstractmethod
async def evaluate(self, page: Page, expression: str, arg: Any = None) -> Any:
"""Execute JavaScript in the page"""
pass
@abstractmethod
async def setup_console_capture(self, page: Page, captured_console: List[Dict]) -> Optional[Callable]:
"""Setup console message capturing, returns handler function if needed"""
pass
@abstractmethod
async def setup_error_capture(self, page: Page, captured_console: List[Dict]) -> Optional[Callable]:
"""Setup error capturing, returns handler function if needed"""
pass
@abstractmethod
async def retrieve_console_messages(self, page: Page) -> List[Dict]:
"""Retrieve captured console messages (for undetected browsers)"""
pass
@abstractmethod
async def cleanup_console_capture(self, page: Page, handle_console: Optional[Callable], handle_error: Optional[Callable]):
"""Clean up console event listeners"""
pass
@abstractmethod
def get_imports(self) -> tuple:
"""Get the appropriate imports for this adapter"""
pass
class PlaywrightAdapter(BrowserAdapter):
"""Adapter for standard Playwright"""
async def evaluate(self, page: Page, expression: str, arg: Any = None) -> Any:
"""Standard Playwright evaluate"""
if arg is not None:
return await page.evaluate(expression, arg)
return await page.evaluate(expression)
async def setup_console_capture(self, page: Page, captured_console: List[Dict]) -> Optional[Callable]:
"""Setup console capture using Playwright's event system"""
def handle_console_capture(msg):
try:
message_type = "unknown"
try:
message_type = msg.type
except:
pass
message_text = "unknown"
try:
message_text = msg.text
except:
pass
entry = {
"type": message_type,
"text": message_text,
"timestamp": time.time()
}
captured_console.append(entry)
except Exception as e:
captured_console.append({
"type": "console_capture_error",
"error": str(e),
"timestamp": time.time()
})
page.on("console", handle_console_capture)
return handle_console_capture
async def setup_error_capture(self, page: Page, captured_console: List[Dict]) -> Optional[Callable]:
"""Setup error capture using Playwright's event system"""
def handle_pageerror_capture(err):
try:
error_message = "Unknown error"
try:
error_message = err.message
except:
pass
error_stack = ""
try:
error_stack = err.stack
except:
pass
captured_console.append({
"type": "error",
"text": error_message,
"stack": error_stack,
"timestamp": time.time()
})
except Exception as e:
captured_console.append({
"type": "pageerror_capture_error",
"error": str(e),
"timestamp": time.time()
})
page.on("pageerror", handle_pageerror_capture)
return handle_pageerror_capture
async def retrieve_console_messages(self, page: Page) -> List[Dict]:
"""Not needed for Playwright - messages are captured via events"""
return []
async def cleanup_console_capture(self, page: Page, handle_console: Optional[Callable], handle_error: Optional[Callable]):
"""Remove event listeners"""
if handle_console:
page.remove_listener("console", handle_console)
if handle_error:
page.remove_listener("pageerror", handle_error)
def get_imports(self) -> tuple:
"""Return Playwright imports"""
from playwright.async_api import Page, Error
from playwright.async_api import TimeoutError as PlaywrightTimeoutError
return Page, Error, PlaywrightTimeoutError
class UndetectedAdapter(BrowserAdapter):
"""Adapter for undetected browser automation with stealth features"""
def __init__(self):
self._console_script_injected = {}
async def evaluate(self, page: UndetectedPage, expression: str, arg: Any = None) -> Any:
"""Undetected browser evaluate with isolated context"""
# For most evaluations, use isolated context for stealth
# Only use non-isolated when we need to access our injected console capture
isolated = not (
"__console" in expression or
"__captured" in expression or
"__error" in expression or
"window.__" in expression
)
if arg is not None:
return await page.evaluate(expression, arg, isolated_context=isolated)
return await page.evaluate(expression, isolated_context=isolated)
async def setup_console_capture(self, page: UndetectedPage, captured_console: List[Dict]) -> Optional[Callable]:
"""Setup console capture using JavaScript injection for undetected browsers"""
if not self._console_script_injected.get(page, False):
await page.add_init_script("""
// Initialize console capture
window.__capturedConsole = [];
window.__capturedErrors = [];
// Store original console methods
const originalConsole = {};
['log', 'info', 'warn', 'error', 'debug'].forEach(method => {
originalConsole[method] = console[method];
console[method] = function(...args) {
try {
window.__capturedConsole.push({
type: method,
text: args.map(arg => {
try {
if (typeof arg === 'object') {
return JSON.stringify(arg);
}
return String(arg);
} catch (e) {
return '[Object]';
}
}).join(' '),
timestamp: Date.now()
});
} catch (e) {
// Fail silently to avoid detection
}
// Call original method
originalConsole[method].apply(console, args);
};
});
""")
self._console_script_injected[page] = True
return None # No handler function needed for undetected browser
async def setup_error_capture(self, page: UndetectedPage, captured_console: List[Dict]) -> Optional[Callable]:
"""Setup error capture using JavaScript injection for undetected browsers"""
if not self._console_script_injected.get(page, False):
await page.add_init_script("""
// Capture errors
window.addEventListener('error', (event) => {
try {
window.__capturedErrors.push({
type: 'error',
text: event.message,
stack: event.error ? event.error.stack : '',
filename: event.filename,
lineno: event.lineno,
colno: event.colno,
timestamp: Date.now()
});
} catch (e) {
// Fail silently
}
});
// Capture unhandled promise rejections
window.addEventListener('unhandledrejection', (event) => {
try {
window.__capturedErrors.push({
type: 'unhandledrejection',
text: event.reason ? String(event.reason) : 'Unhandled Promise Rejection',
stack: event.reason && event.reason.stack ? event.reason.stack : '',
timestamp: Date.now()
});
} catch (e) {
// Fail silently
}
});
""")
self._console_script_injected[page] = True
return None # No handler function needed for undetected browser
async def retrieve_console_messages(self, page: UndetectedPage) -> List[Dict]:
"""Retrieve captured console messages and errors from the page"""
messages = []
try:
# Get console messages
console_messages = await page.evaluate(
"() => { const msgs = window.__capturedConsole || []; window.__capturedConsole = []; return msgs; }",
isolated_context=False
)
messages.extend(console_messages)
# Get errors
errors = await page.evaluate(
"() => { const errs = window.__capturedErrors || []; window.__capturedErrors = []; return errs; }",
isolated_context=False
)
messages.extend(errors)
# Convert timestamps from JS to Python format
for msg in messages:
if 'timestamp' in msg and isinstance(msg['timestamp'], (int, float)):
msg['timestamp'] = msg['timestamp'] / 1000.0 # Convert from ms to seconds
except Exception:
# If retrieval fails, return empty list
pass
return messages
async def cleanup_console_capture(self, page: UndetectedPage, handle_console: Optional[Callable], handle_error: Optional[Callable]):
"""Clean up for undetected browser - retrieve final messages"""
# For undetected browser, we don't have event listeners to remove
# but we should retrieve any final messages
final_messages = await self.retrieve_console_messages(page)
return final_messages
def get_imports(self) -> tuple:
"""Return undetected browser imports"""
from patchright.async_api import Page, Error
from patchright.async_api import TimeoutError as PlaywrightTimeoutError
return Page, Error, PlaywrightTimeoutError

View File

@@ -14,23 +14,8 @@ import hashlib
from .js_snippet import load_js_script
from .config import DOWNLOAD_PAGE_TIMEOUT
from .async_configs import BrowserConfig, CrawlerRunConfig
from playwright_stealth import StealthConfig
from .utils import get_chromium_path
stealth_config = StealthConfig(
webdriver=True,
chrome_app=True,
chrome_csi=True,
chrome_load_times=True,
chrome_runtime=True,
navigator_languages=True,
navigator_plugins=True,
navigator_permissions=True,
webgl_vendor=True,
outerdimensions=True,
navigator_hardware_concurrency=True,
media_codecs=True,
)
BROWSER_DISABLE_OPTIONS = [
"--disable-background-networking",
@@ -588,21 +573,26 @@ class BrowserManager:
_playwright_instance = None
@classmethod
async def get_playwright(cls):
from playwright.async_api import async_playwright
async def get_playwright(cls, use_undetected: bool = False):
if use_undetected:
from patchright.async_api import async_playwright
else:
from playwright.async_api import async_playwright
cls._playwright_instance = await async_playwright().start()
return cls._playwright_instance
def __init__(self, browser_config: BrowserConfig, logger=None):
def __init__(self, browser_config: BrowserConfig, logger=None, use_undetected: bool = False):
"""
Initialize the BrowserManager with a browser configuration.
Args:
browser_config (BrowserConfig): Configuration object containing all browser settings
logger: Logger instance for recording events and errors
use_undetected (bool): Whether to use undetected browser (Patchright)
"""
self.config: BrowserConfig = browser_config
self.logger = logger
self.use_undetected = use_undetected
# Browser state
self.browser = None
@@ -616,7 +606,16 @@ class BrowserManager:
# Keep track of contexts by a "config signature," so each unique config reuses a single context
self.contexts_by_config = {}
self._contexts_lock = asyncio.Lock()
self._contexts_lock = asyncio.Lock()
# Serialize context.new_page() across concurrent tasks to avoid races
# when using a shared persistent context (context.pages may be empty
# for all racers). Prevents 'Target page/context closed' errors.
self._page_lock = asyncio.Lock()
# Stealth-related attributes
self._stealth_instance = None
self._stealth_cm = None
# Initialize ManagedBrowser if needed
if self.config.use_managed_browser:
@@ -645,9 +644,21 @@ class BrowserManager:
if self.playwright is not None:
await self.close()
from playwright.async_api import async_playwright
if self.use_undetected:
from patchright.async_api import async_playwright
else:
from playwright.async_api import async_playwright
self.playwright = await async_playwright().start()
# Initialize playwright with or without stealth
if self.config.enable_stealth and not self.use_undetected:
# Import stealth only when needed
from playwright_stealth import Stealth
# Use the recommended stealth wrapper approach
self._stealth_instance = Stealth()
self._stealth_cm = self._stealth_instance.use_async(async_playwright())
self.playwright = await self._stealth_cm.__aenter__()
else:
self.playwright = await async_playwright().start()
if self.config.cdp_url or self.config.use_managed_browser:
self.config.use_managed_browser = True
@@ -1021,13 +1032,26 @@ class BrowserManager:
context = await self.create_browser_context(crawlerRunConfig)
ctx = self.default_context # default context, one window only
ctx = await clone_runtime_state(context, ctx, crawlerRunConfig, self.config)
page = await ctx.new_page()
# Avoid concurrent new_page on shared persistent context
# See GH-1198: context.pages can be empty under races
async with self._page_lock:
page = await ctx.new_page()
else:
context = self.default_context
pages = context.pages
page = next((p for p in pages if p.url == crawlerRunConfig.url), None)
if not page:
page = context.pages[0] # await context.new_page()
if pages:
page = pages[0]
else:
# Double-check under lock to avoid TOCTOU and ensure only
# one task calls new_page when pages=[] concurrently
async with self._page_lock:
pages = context.pages
if pages:
page = pages[0]
else:
page = await context.new_page()
else:
# Otherwise, check if we have an existing context for this config
config_signature = self._make_config_signature(crawlerRunConfig)
@@ -1109,5 +1133,19 @@ class BrowserManager:
self.managed_browser = None
if self.playwright:
await self.playwright.stop()
# Handle stealth context manager cleanup if it exists
if hasattr(self, '_stealth_cm') and self._stealth_cm is not None:
try:
await self._stealth_cm.__aexit__(None, None, None)
except Exception as e:
if self.logger:
self.logger.error(
message="Error closing stealth context: {error}",
tag="ERROR",
params={"error": str(e)}
)
self._stealth_cm = None
self._stealth_instance = None
else:
await self.playwright.stop()
self.playwright = None

View File

@@ -65,6 +65,213 @@ class BrowserProfiler:
self.builtin_config_file = os.path.join(self.builtin_browser_dir, "browser_config.json")
os.makedirs(self.builtin_browser_dir, exist_ok=True)
def _is_windows(self) -> bool:
"""Check if running on Windows platform."""
return sys.platform.startswith('win') or sys.platform == 'cygwin'
def _is_macos(self) -> bool:
"""Check if running on macOS platform."""
return sys.platform == 'darwin'
def _is_linux(self) -> bool:
"""Check if running on Linux platform."""
return sys.platform.startswith('linux')
def _get_quit_message(self, tag: str) -> str:
"""Get appropriate quit message based on context."""
if tag == "PROFILE":
return "Closing browser and saving profile..."
elif tag == "CDP":
return "Closing browser..."
else:
return "Closing browser..."
async def _listen_windows(self, user_done_event, check_browser_process, tag: str):
"""Windows-specific keyboard listener using msvcrt."""
try:
import msvcrt
except ImportError:
raise ImportError("msvcrt module not available on this platform")
while True:
try:
# Check for keyboard input
if msvcrt.kbhit():
raw = msvcrt.getch()
# Handle Unicode decoding more robustly
key = None
try:
key = raw.decode("utf-8")
except UnicodeDecodeError:
try:
# Try different encodings
key = raw.decode("latin1")
except UnicodeDecodeError:
# Skip if we can't decode
continue
# Validate key
if not key or len(key) != 1:
continue
# Check for printable characters only
if not key.isprintable():
continue
# Check for quit command
if key.lower() == "q":
self.logger.info(
self._get_quit_message(tag),
tag=tag,
base_color=LogColor.GREEN
)
user_done_event.set()
return
# Check if browser process ended
if await check_browser_process():
return
# Small delay to prevent busy waiting
await asyncio.sleep(0.1)
except Exception as e:
self.logger.warning(f"Error in Windows keyboard listener: {e}", tag=tag)
# Continue trying instead of failing completely
await asyncio.sleep(0.1)
continue
async def _listen_unix(self, user_done_event: asyncio.Event, check_browser_process, tag: str):
"""Unix/Linux/macOS keyboard listener using termios and select."""
try:
import termios
import tty
import select
except ImportError:
raise ImportError("termios/tty/select modules not available on this platform")
# Get stdin file descriptor
try:
fd = sys.stdin.fileno()
except (AttributeError, OSError):
raise ImportError("stdin is not a terminal")
# Save original terminal settings
old_settings = None
try:
old_settings = termios.tcgetattr(fd)
except termios.error as e:
raise ImportError(f"Cannot get terminal attributes: {e}")
try:
# Switch to non-canonical mode (cbreak mode)
tty.setcbreak(fd)
while True:
try:
# Use select to check if input is available (non-blocking)
# Timeout of 0.5 seconds to periodically check browser process
readable, _, _ = select.select([sys.stdin], [], [], 0.5)
if readable:
# Read one character
key = sys.stdin.read(1)
if key and key.lower() == "q":
self.logger.info(
self._get_quit_message(tag),
tag=tag,
base_color=LogColor.GREEN
)
user_done_event.set()
return
# Check if browser process ended
if await check_browser_process():
return
# Small delay to prevent busy waiting
await asyncio.sleep(0.1)
except (KeyboardInterrupt, EOFError):
# Handle Ctrl+C or EOF gracefully
self.logger.info("Keyboard interrupt received", tag=tag)
user_done_event.set()
return
except Exception as e:
self.logger.warning(f"Error in Unix keyboard listener: {e}", tag=tag)
await asyncio.sleep(0.1)
continue
finally:
# Always restore terminal settings
if old_settings is not None:
try:
termios.tcsetattr(fd, termios.TCSADRAIN, old_settings)
except Exception as e:
self.logger.error(f"Failed to restore terminal settings: {e}", tag=tag)
async def _listen_fallback(self, user_done_event: asyncio.Event, check_browser_process, tag: str):
"""Fallback keyboard listener using simple input() method."""
self.logger.info("Using fallback input mode. Type 'q' and press Enter to quit.", tag=tag)
# Run input in a separate thread to avoid blocking
import threading
import queue
input_queue = queue.Queue()
def input_thread():
"""Thread function to handle input."""
try:
while not user_done_event.is_set():
try:
# Use input() with a prompt
user_input = input("Press 'q' + Enter to quit: ").strip().lower()
input_queue.put(user_input)
if user_input == 'q':
break
except (EOFError, KeyboardInterrupt):
input_queue.put('q')
break
except Exception as e:
self.logger.warning(f"Error in input thread: {e}", tag=tag)
break
except Exception as e:
self.logger.error(f"Input thread failed: {e}", tag=tag)
# Start input thread
thread = threading.Thread(target=input_thread, daemon=True)
thread.start()
try:
while not user_done_event.is_set():
# Check for user input
try:
user_input = input_queue.get_nowait()
if user_input == 'q':
self.logger.info(
self._get_quit_message(tag),
tag=tag,
base_color=LogColor.GREEN
)
user_done_event.set()
return
except queue.Empty:
pass
# Check if browser process ended
if await check_browser_process():
return
# Small delay
await asyncio.sleep(0.5)
except Exception as e:
self.logger.error(f"Fallback listener failed: {e}", tag=tag)
user_done_event.set()
async def create_profile(self,
profile_name: Optional[str] = None,
browser_config: Optional[BrowserConfig] = None) -> Optional[str]:
@@ -180,42 +387,38 @@ class BrowserProfiler:
# Run keyboard input loop in a separate task
async def listen_for_quit_command():
import termios
import tty
import select
"""Cross-platform keyboard listener that waits for 'q' key press."""
# First output the prompt
self.logger.info("Press 'q' when you've finished using the browser...", tag="PROFILE")
# Save original terminal settings
fd = sys.stdin.fileno()
old_settings = termios.tcgetattr(fd)
self.logger.info(
"Press {segment} when you've finished using the browser...",
tag="PROFILE",
params={"segment": "'q'"}, colors={"segment": LogColor.YELLOW},
base_color=LogColor.CYAN
)
async def check_browser_process():
"""Check if browser process is still running."""
if (
managed_browser.browser_process
and managed_browser.browser_process.poll() is not None
):
self.logger.info(
"Browser already closed. Ending input listener.", tag="PROFILE"
)
user_done_event.set()
return True
return False
# Try platform-specific implementations with fallback
try:
# Switch to non-canonical mode (no line buffering)
tty.setcbreak(fd)
while True:
# Check if input is available (non-blocking)
readable, _, _ = select.select([sys.stdin], [], [], 0.5)
if readable:
key = sys.stdin.read(1)
if key.lower() == 'q':
self.logger.info("Closing browser and saving profile...", tag="PROFILE", base_color=LogColor.GREEN)
user_done_event.set()
return
# Check if the browser process has already exited
if managed_browser.browser_process and managed_browser.browser_process.poll() is not None:
self.logger.info("Browser already closed. Ending input listener.", tag="PROFILE")
user_done_event.set()
return
await asyncio.sleep(0.1)
finally:
# Restore terminal settings
termios.tcsetattr(fd, termios.TCSADRAIN, old_settings)
if self._is_windows():
await self._listen_windows(user_done_event, check_browser_process, "PROFILE")
else:
await self._listen_unix(user_done_event, check_browser_process, "PROFILE")
except Exception as e:
self.logger.warning(f"Platform-specific keyboard listener failed: {e}", tag="PROFILE")
self.logger.info("Falling back to simple input mode...", tag="PROFILE")
await self._listen_fallback(user_done_event, check_browser_process, "PROFILE")
try:
from playwright.async_api import async_playwright
@@ -480,7 +683,7 @@ class BrowserProfiler:
self.logger.info("4. Exit", tag="MENU", base_color=LogColor.MAGENTA)
exit_option = "4"
self.logger.print(f"\n[cyan]Enter your choice (1-{exit_option}): [/cyan]", end="")
self.logger.info(f"\n[cyan]Enter your choice (1-{exit_option}): [/cyan]", end="")
choice = input()
if choice == "1":
@@ -637,9 +840,18 @@ class BrowserProfiler:
self.logger.info(f"Debugging port: {debugging_port}", tag="CDP")
self.logger.info(f"Headless mode: {headless}", tag="CDP")
# create browser config
browser_config = BrowserConfig(
browser_type=browser_type,
headless=headless,
user_data_dir=profile_path,
debugging_port=debugging_port,
verbose=True
)
# Create managed browser instance
managed_browser = ManagedBrowser(
browser_type=browser_type,
browser_config=browser_config,
user_data_dir=profile_path,
headless=headless,
logger=self.logger,
@@ -673,42 +885,33 @@ class BrowserProfiler:
# Run keyboard input loop in a separate task
async def listen_for_quit_command():
import termios
import tty
import select
"""Cross-platform keyboard listener that waits for 'q' key press."""
# First output the prompt
self.logger.info("Press 'q' to stop the browser and exit...", tag="CDP")
# Save original terminal settings
fd = sys.stdin.fileno()
old_settings = termios.tcgetattr(fd)
self.logger.info(
"Press {segment} to stop the browser and exit...",
tag="CDP",
params={"segment": "'q'"}, colors={"segment": LogColor.YELLOW},
base_color=LogColor.CYAN
)
async def check_browser_process():
"""Check if browser process is still running."""
if managed_browser.browser_process and managed_browser.browser_process.poll() is not None:
self.logger.info("Browser already closed. Ending input listener.", tag="CDP")
user_done_event.set()
return True
return False
# Try platform-specific implementations with fallback
try:
# Switch to non-canonical mode (no line buffering)
tty.setcbreak(fd)
while True:
# Check if input is available (non-blocking)
readable, _, _ = select.select([sys.stdin], [], [], 0.5)
if readable:
key = sys.stdin.read(1)
if key.lower() == 'q':
self.logger.info("Closing browser...", tag="CDP")
user_done_event.set()
return
# Check if the browser process has already exited
if managed_browser.browser_process and managed_browser.browser_process.poll() is not None:
self.logger.info("Browser already closed. Ending input listener.", tag="CDP")
user_done_event.set()
return
await asyncio.sleep(0.1)
finally:
# Restore terminal settings
termios.tcsetattr(fd, termios.TCSADRAIN, old_settings)
if self._is_windows():
await self._listen_windows(user_done_event, check_browser_process, "CDP")
else:
await self._listen_unix(user_done_event, check_browser_process, "CDP")
except Exception as e:
self.logger.warning(f"Platform-specific keyboard listener failed: {e}", tag="CDP")
self.logger.info("Falling back to simple input mode...", tag="CDP")
await self._listen_fallback(user_done_event, check_browser_process, "CDP")
# Function to retrieve and display CDP JSON config
async def get_cdp_json(port):

View File

@@ -1385,6 +1385,97 @@ def profiles_cmd():
# Run interactive profile manager
anyio.run(manage_profiles)
@cli.group("telemetry")
def telemetry_cmd():
"""Manage telemetry settings for Crawl4AI
Telemetry helps improve Crawl4AI by sending anonymous crash reports.
No personal data or crawled content is ever collected.
"""
pass
@telemetry_cmd.command("enable")
@click.option("--email", "-e", help="Optional email for follow-up on critical issues")
@click.option("--always/--once", default=True, help="Always send errors (default) or just once")
def telemetry_enable_cmd(email: Optional[str], always: bool):
"""Enable telemetry to help improve Crawl4AI
Examples:
crwl telemetry enable # Enable telemetry
crwl telemetry enable --email me@ex.com # Enable with email
crwl telemetry enable --once # Send only next error
"""
from crawl4ai.telemetry import enable
try:
enable(email=email, always=always, once=not always)
console.print("[green]✅ Telemetry enabled successfully[/green]")
if email:
console.print(f" Email: {email}")
console.print(f" Mode: {'Always send errors' if always else 'Send next error only'}")
except Exception as e:
console.print(f"[red]❌ Failed to enable telemetry: {e}[/red]")
sys.exit(1)
@telemetry_cmd.command("disable")
def telemetry_disable_cmd():
"""Disable telemetry
Stop sending anonymous crash reports to help improve Crawl4AI.
"""
from crawl4ai.telemetry import disable
try:
disable()
console.print("[green]✅ Telemetry disabled successfully[/green]")
except Exception as e:
console.print(f"[red]❌ Failed to disable telemetry: {e}[/red]")
sys.exit(1)
@telemetry_cmd.command("status")
def telemetry_status_cmd():
"""Show current telemetry status
Display whether telemetry is enabled and current settings.
"""
from crawl4ai.telemetry import status
try:
info = status()
# Create status table
table = Table(title="Telemetry Status", show_header=False)
table.add_column("Setting", style="cyan")
table.add_column("Value")
# Status emoji
status_icon = "" if info['enabled'] else ""
table.add_row("Status", f"{status_icon} {'Enabled' if info['enabled'] else 'Disabled'}")
table.add_row("Consent", info['consent'].replace('_', ' ').title())
if info['email']:
table.add_row("Email", info['email'])
table.add_row("Environment", info['environment'])
table.add_row("Provider", info['provider'])
if info['errors_sent'] > 0:
table.add_row("Errors Sent", str(info['errors_sent']))
console.print(table)
# Add helpful messages
if not info['enabled']:
console.print("\n[yellow] Telemetry is disabled. Enable it to help improve Crawl4AI:[/yellow]")
console.print(" [dim]crwl telemetry enable[/dim]")
except Exception as e:
console.print(f"[red]❌ Failed to get telemetry status: {e}[/red]")
sys.exit(1)
@cli.command(name="")
@click.argument("url", required=False)
@click.option("--example", is_flag=True, help="Show usage examples")

File diff suppressed because it is too large Load Diff

View File

@@ -150,6 +150,14 @@ class BestFirstCrawlingStrategy(DeepCrawlStrategy):
self.logger.info(f"Max pages limit ({self.max_pages}) reached, stopping crawl")
break
# Calculate how many more URLs we can process in this batch
remaining = self.max_pages - self._pages_crawled
batch_size = min(BATCH_SIZE, remaining)
if batch_size <= 0:
# No more pages to crawl
self.logger.info(f"Max pages limit ({self.max_pages}) reached, stopping crawl")
break
batch: List[Tuple[float, int, str, Optional[str]]] = []
# Retrieve up to BATCH_SIZE items from the priority queue.
for _ in range(BATCH_SIZE):
@@ -184,6 +192,10 @@ class BestFirstCrawlingStrategy(DeepCrawlStrategy):
# Count only successful crawls toward max_pages limit
if result.success:
self._pages_crawled += 1
# Check if we've reached the limit during batch processing
if self._pages_crawled >= self.max_pages:
self.logger.info(f"Max pages limit ({self.max_pages}) reached during batch, stopping crawl")
break # Exit the generator
yield result

View File

@@ -157,6 +157,11 @@ class BFSDeepCrawlStrategy(DeepCrawlStrategy):
results: List[CrawlResult] = []
while current_level and not self._cancel_event.is_set():
# Check if we've already reached max_pages before starting a new level
if self._pages_crawled >= self.max_pages:
self.logger.info(f"Max pages limit ({self.max_pages}) reached, stopping crawl")
break
next_level: List[Tuple[str, Optional[str]]] = []
urls = [url for url, _ in current_level]
@@ -221,6 +226,10 @@ class BFSDeepCrawlStrategy(DeepCrawlStrategy):
# Count only successful crawls
if result.success:
self._pages_crawled += 1
# Check if we've reached the limit during batch processing
if self._pages_crawled >= self.max_pages:
self.logger.info(f"Max pages limit ({self.max_pages}) reached during batch, stopping crawl")
break # Exit the generator
results_count += 1
yield result

View File

@@ -49,6 +49,10 @@ class DFSDeepCrawlStrategy(BFSDeepCrawlStrategy):
# Count only successful crawls toward max_pages limit
if result.success:
self._pages_crawled += 1
# Check if we've reached the limit during batch processing
if self._pages_crawled >= self.max_pages:
self.logger.info(f"Max pages limit ({self.max_pages}) reached during batch, stopping crawl")
break # Exit the generator
# Only discover links from successful crawls
new_links: List[Tuple[str, Optional[str]]] = []
@@ -94,6 +98,10 @@ class DFSDeepCrawlStrategy(BFSDeepCrawlStrategy):
# and only discover links from successful crawls
if result.success:
self._pages_crawled += 1
# Check if we've reached the limit during batch processing
if self._pages_crawled >= self.max_pages:
self.logger.info(f"Max pages limit ({self.max_pages}) reached during batch, stopping crawl")
break # Exit the generator
new_links: List[Tuple[str, Optional[str]]] = []
await self.link_discovery(result, url, depth, visited, new_links, depths)

View File

@@ -227,10 +227,21 @@ class URLPatternFilter(URLFilter):
# Prefix check (/foo/*)
if self._simple_prefixes:
path = url.split("?")[0]
if any(path.startswith(p) for p in self._simple_prefixes):
result = True
self._update_stats(result)
return not result if self._reverse else result
# if any(path.startswith(p) for p in self._simple_prefixes):
# result = True
# self._update_stats(result)
# return not result if self._reverse else result
####
# Modified the prefix matching logic to ensure path boundary checking:
# - Check if the matched prefix is followed by a path separator (`/`), query parameter (`?`), fragment (`#`), or is at the end of the path
# - This ensures `/api/` only matches complete path segments, not substrings like `/apiv2/`
####
for prefix in self._simple_prefixes:
if path.startswith(prefix):
if len(path) == len(prefix) or path[len(prefix)] in ['/', '?', '#']:
result = True
self._update_stats(result)
return not result if self._reverse else result
# Complex patterns
if self._path_patterns:
@@ -337,6 +348,15 @@ class ContentTypeFilter(URLFilter):
"sqlite": "application/vnd.sqlite3",
# Placeholder
"unknown": "application/octet-stream", # Fallback for unknown file types
# php
"php": "application/x-httpd-php",
"php3": "application/x-httpd-php",
"php4": "application/x-httpd-php",
"php5": "application/x-httpd-php",
"php7": "application/x-httpd-php",
"phtml": "application/x-httpd-php",
"phps": "application/x-httpd-php-source",
}
@staticmethod

View File

@@ -73,6 +73,8 @@ class Crawl4aiDockerClient:
def _prepare_request(self, urls: List[str], browser_config: Optional[BrowserConfig] = None,
crawler_config: Optional[CrawlerRunConfig] = None) -> Dict[str, Any]:
"""Prepare request data from configs."""
if self._token:
self._http_client.headers["Authorization"] = f"Bearer {self._token}"
return {
"urls": urls,
"browser_config": browser_config.dump() if browser_config else {},
@@ -103,8 +105,6 @@ class Crawl4aiDockerClient:
crawler_config: Optional[CrawlerRunConfig] = None
) -> Union[CrawlResult, List[CrawlResult], AsyncGenerator[CrawlResult, None]]:
"""Execute a crawl operation."""
if not self._token:
raise Crawl4aiClientError("Authentication required. Call authenticate() first.")
await self._check_server()
data = self._prepare_request(urls, browser_config, crawler_config)
@@ -140,8 +140,6 @@ class Crawl4aiDockerClient:
async def get_schema(self) -> Dict[str, Any]:
"""Retrieve configuration schemas."""
if not self._token:
raise Crawl4aiClientError("Authentication required. Call authenticate() first.")
response = await self._request("GET", "/schema")
return response.json()
@@ -167,4 +165,4 @@ async def main():
print(schema)
if __name__ == "__main__":
asyncio.run(main())
asyncio.run(main())

View File

@@ -656,11 +656,11 @@ class LLMExtractionStrategy(ExtractionStrategy):
self.total_usage.total_tokens += usage.total_tokens
try:
response = response.choices[0].message.content
content = response.choices[0].message.content
blocks = None
if self.force_json_response:
blocks = json.loads(response)
blocks = json.loads(content)
if isinstance(blocks, dict):
# If it has only one key which calue is list then assign that to blocks, exampled: {"news": [..]}
if len(blocks) == 1 and isinstance(list(blocks.values())[0], list):
@@ -673,7 +673,7 @@ class LLMExtractionStrategy(ExtractionStrategy):
blocks = blocks
else:
# blocks = extract_xml_data(["blocks"], response.choices[0].message.content)["blocks"]
blocks = extract_xml_data(["blocks"], response)["blocks"]
blocks = extract_xml_data(["blocks"], content)["blocks"]
blocks = json.loads(blocks)
for block in blocks:

View File

@@ -119,6 +119,32 @@ def install_playwright():
logger.warning(
f"Please run '{sys.executable} -m playwright install --with-deps' manually after the installation."
)
# Install Patchright browsers for undetected browser support
logger.info("Installing Patchright browsers for undetected mode...", tag="INIT")
try:
subprocess.check_call(
[
sys.executable,
"-m",
"patchright",
"install",
"--with-deps",
"--force",
"chromium",
]
)
logger.success(
"Patchright installation completed successfully.", tag="COMPLETE"
)
except subprocess.CalledProcessError:
logger.warning(
f"Please run '{sys.executable} -m patchright install --with-deps' manually after the installation."
)
except Exception:
logger.warning(
f"Please run '{sys.executable} -m patchright install --with-deps' manually after the installation."
)
def run_migration():

View File

@@ -11,7 +11,7 @@ from .extraction_strategy import *
from .crawler_strategy import *
from typing import List
from concurrent.futures import ThreadPoolExecutor
from .content_scraping_strategy import WebScrapingStrategy
from ..content_scraping_strategy import LXMLWebScrapingStrategy as WebScrapingStrategy
from .config import *
import warnings
import json

View File

@@ -1056,7 +1056,7 @@ Your output must:
</output_requirements>
"""
GENERATE_SCRIPT_PROMPT = """You are a world-class browser automation specialist. Your sole purpose is to convert a natural language objective and a snippet of HTML into the most **efficient, robust, and simple** script possible to prepare a web page for data extraction.
GENERATE_SCRIPT_PROMPT = r"""You are a world-class browser automation specialist. Your sole purpose is to convert a natural language objective and a snippet of HTML into the most **efficient, robust, and simple** script possible to prepare a web page for data extraction.
Your scripts run **before the crawl** to handle dynamic content, user interactions, and other obstacles. You are a master of two tools: raw **JavaScript** and the high-level **Crawl4ai Script (c4a)**.

1396
crawl4ai/table_extraction.py Normal file

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,440 @@
"""
Crawl4AI Telemetry Module.
Provides opt-in error tracking to improve stability.
"""
import os
import sys
import functools
import traceback
from typing import Optional, Any, Dict, Callable, Type
from contextlib import contextmanager, asynccontextmanager
from .base import TelemetryProvider, NullProvider
from .config import TelemetryConfig, TelemetryConsent
from .consent import ConsentManager
from .environment import Environment, EnvironmentDetector
class TelemetryManager:
"""
Main telemetry manager for Crawl4AI.
Coordinates provider, config, and consent management.
"""
_instance: Optional['TelemetryManager'] = None
def __init__(self):
"""Initialize telemetry manager."""
self.config = TelemetryConfig()
self.consent_manager = ConsentManager(self.config)
self.environment = EnvironmentDetector.detect()
self._provider: Optional[TelemetryProvider] = None
self._initialized = False
self._error_count = 0
self._max_errors = 100 # Prevent telemetry spam
# Load provider based on config
self._setup_provider()
@classmethod
def get_instance(cls) -> 'TelemetryManager':
"""
Get singleton instance of telemetry manager.
Returns:
TelemetryManager instance
"""
if cls._instance is None:
cls._instance = cls()
return cls._instance
def _setup_provider(self) -> None:
"""Setup telemetry provider based on configuration."""
# Update config from environment
self.config.update_from_env()
# Check if telemetry is enabled
if not self.config.is_enabled():
self._provider = NullProvider()
return
# Try to load Sentry provider
try:
from .providers.sentry import SentryProvider
# Get Crawl4AI version for release tracking
try:
from crawl4ai import __version__
release = f"crawl4ai@{__version__}"
except ImportError:
release = "crawl4ai@unknown"
self._provider = SentryProvider(
environment=self.environment.value,
release=release
)
# Initialize provider
if not self._provider.initialize():
# Fallback to null provider if init fails
self._provider = NullProvider()
except ImportError:
# Sentry not installed - use null provider
self._provider = NullProvider()
self._initialized = True
def capture_exception(
self,
exception: Exception,
context: Optional[Dict[str, Any]] = None
) -> bool:
"""
Capture and send an exception.
Args:
exception: The exception to capture
context: Optional additional context
Returns:
True if exception was sent
"""
# Check error count limit
if self._error_count >= self._max_errors:
return False
# Check consent on first error
if self._error_count == 0:
consent = self.consent_manager.check_and_prompt()
# Update provider if consent changed
if consent == TelemetryConsent.DENIED:
self._provider = NullProvider()
return False
elif consent in [TelemetryConsent.ONCE, TelemetryConsent.ALWAYS]:
if isinstance(self._provider, NullProvider):
self._setup_provider()
# Check if we should send this error
if not self.config.should_send_current():
return False
# Prepare context
full_context = EnvironmentDetector.get_environment_context()
if context:
full_context.update(context)
# Add user email if available
email = self.config.get_email()
if email:
full_context['email'] = email
# Add source info
full_context['source'] = 'crawl4ai'
# Send exception
try:
if self._provider:
success = self._provider.send_exception(exception, full_context)
if success:
self._error_count += 1
return success
except Exception:
# Telemetry itself failed - ignore
pass
return False
def capture_message(
self,
message: str,
level: str = 'info',
context: Optional[Dict[str, Any]] = None
) -> bool:
"""
Capture a message event.
Args:
message: Message to send
level: Message level (info, warning, error)
context: Optional context
Returns:
True if message was sent
"""
if not self.config.is_enabled():
return False
payload = {
'level': level,
'message': message
}
if context:
payload.update(context)
try:
if self._provider:
return self._provider.send_event(message, payload)
except Exception:
pass
return False
def enable(
self,
email: Optional[str] = None,
always: bool = True,
once: bool = False
) -> None:
"""
Enable telemetry.
Args:
email: Optional email for follow-up
always: If True, always send errors
once: If True, send only next error
"""
if once:
consent = TelemetryConsent.ONCE
elif always:
consent = TelemetryConsent.ALWAYS
else:
consent = TelemetryConsent.ALWAYS
self.config.set_consent(consent, email)
self._setup_provider()
print("✅ Telemetry enabled")
if email:
print(f" Email: {email}")
print(f" Mode: {'once' if once else 'always'}")
def disable(self) -> None:
"""Disable telemetry."""
self.config.set_consent(TelemetryConsent.DENIED)
self._provider = NullProvider()
print("✅ Telemetry disabled")
def status(self) -> Dict[str, Any]:
"""
Get telemetry status.
Returns:
Dictionary with status information
"""
return {
'enabled': self.config.is_enabled(),
'consent': self.config.get_consent().value,
'email': self.config.get_email(),
'environment': self.environment.value,
'provider': type(self._provider).__name__ if self._provider else 'None',
'errors_sent': self._error_count
}
def flush(self) -> None:
"""Flush any pending telemetry data."""
if self._provider:
self._provider.flush()
def shutdown(self) -> None:
"""Shutdown telemetry."""
if self._provider:
self._provider.shutdown()
# Global instance
_telemetry_manager: Optional[TelemetryManager] = None
def get_telemetry() -> TelemetryManager:
"""
Get global telemetry manager instance.
Returns:
TelemetryManager instance
"""
global _telemetry_manager
if _telemetry_manager is None:
_telemetry_manager = TelemetryManager.get_instance()
return _telemetry_manager
def capture_exception(
exception: Exception,
context: Optional[Dict[str, Any]] = None
) -> bool:
"""
Capture an exception for telemetry.
Args:
exception: Exception to capture
context: Optional context
Returns:
True if sent successfully
"""
try:
return get_telemetry().capture_exception(exception, context)
except Exception:
return False
def telemetry_decorator(func: Callable) -> Callable:
"""
Decorator to capture exceptions from a function.
Args:
func: Function to wrap
Returns:
Wrapped function
"""
@functools.wraps(func)
def wrapper(*args, **kwargs):
try:
return func(*args, **kwargs)
except Exception as e:
# Capture exception
capture_exception(e, {
'function': func.__name__,
'module': func.__module__
})
# Re-raise the exception
raise
return wrapper
def async_telemetry_decorator(func: Callable) -> Callable:
"""
Decorator to capture exceptions from an async function.
Args:
func: Async function to wrap
Returns:
Wrapped async function
"""
@functools.wraps(func)
async def wrapper(*args, **kwargs):
try:
return await func(*args, **kwargs)
except Exception as e:
# Capture exception
capture_exception(e, {
'function': func.__name__,
'module': func.__module__
})
# Re-raise the exception
raise
return wrapper
@contextmanager
def telemetry_context(operation: str):
"""
Context manager for capturing exceptions.
Args:
operation: Name of the operation
Example:
with telemetry_context("web_crawl"):
# Your code here
pass
"""
try:
yield
except Exception as e:
capture_exception(e, {'operation': operation})
raise
@asynccontextmanager
async def async_telemetry_context(operation: str):
"""
Async context manager for capturing exceptions in async code.
Args:
operation: Name of the operation
Example:
async with async_telemetry_context("async_crawl"):
# Your async code here
await something()
"""
try:
yield
except Exception as e:
capture_exception(e, {'operation': operation})
raise
def install_exception_handler():
"""Install global exception handler for uncaught exceptions."""
original_hook = sys.excepthook
def telemetry_exception_hook(exc_type, exc_value, exc_traceback):
"""Custom exception hook with telemetry."""
# Don't capture KeyboardInterrupt
if not issubclass(exc_type, KeyboardInterrupt):
capture_exception(exc_value, {
'uncaught': True,
'type': exc_type.__name__
})
# Call original hook
original_hook(exc_type, exc_value, exc_traceback)
sys.excepthook = telemetry_exception_hook
# Public API
def enable(email: Optional[str] = None, always: bool = True, once: bool = False) -> None:
"""
Enable telemetry.
Args:
email: Optional email for follow-up
always: If True, always send errors (default)
once: If True, send only the next error
"""
get_telemetry().enable(email=email, always=always, once=once)
def disable() -> None:
"""Disable telemetry."""
get_telemetry().disable()
def status() -> Dict[str, Any]:
"""
Get telemetry status.
Returns:
Dictionary with status information
"""
return get_telemetry().status()
# Auto-install exception handler on import
# (Only for main library usage, not for Docker/API)
if EnvironmentDetector.detect() not in [Environment.DOCKER, Environment.API_SERVER]:
install_exception_handler()
__all__ = [
'TelemetryManager',
'get_telemetry',
'capture_exception',
'telemetry_decorator',
'async_telemetry_decorator',
'telemetry_context',
'async_telemetry_context',
'enable',
'disable',
'status',
]

140
crawl4ai/telemetry/base.py Normal file
View File

@@ -0,0 +1,140 @@
"""
Base telemetry provider interface for Crawl4AI.
Provides abstraction for different telemetry backends.
"""
from abc import ABC, abstractmethod
from typing import Dict, Any, Optional, Union
import traceback
class TelemetryProvider(ABC):
"""Abstract base class for telemetry providers."""
def __init__(self, **kwargs):
"""Initialize the provider with optional configuration."""
self.config = kwargs
self._initialized = False
@abstractmethod
def initialize(self) -> bool:
"""
Initialize the telemetry provider.
Returns True if initialization successful, False otherwise.
"""
pass
@abstractmethod
def send_exception(
self,
exc: Exception,
context: Optional[Dict[str, Any]] = None
) -> bool:
"""
Send an exception to the telemetry backend.
Args:
exc: The exception to report
context: Optional context data (email, environment, etc.)
Returns:
True if sent successfully, False otherwise
"""
pass
@abstractmethod
def send_event(
self,
event_name: str,
payload: Optional[Dict[str, Any]] = None
) -> bool:
"""
Send a generic telemetry event.
Args:
event_name: Name of the event
payload: Optional event data
Returns:
True if sent successfully, False otherwise
"""
pass
@abstractmethod
def flush(self) -> None:
"""Flush any pending telemetry data."""
pass
@abstractmethod
def shutdown(self) -> None:
"""Clean shutdown of the provider."""
pass
def sanitize_data(self, data: Dict[str, Any]) -> Dict[str, Any]:
"""
Remove sensitive information from telemetry data.
Override in subclasses for custom sanitization.
Args:
data: Raw data dictionary
Returns:
Sanitized data dictionary
"""
# Default implementation - remove common sensitive fields
sensitive_keys = {
'password', 'token', 'api_key', 'secret', 'credential',
'auth', 'authorization', 'cookie', 'session'
}
def _sanitize_dict(d: Dict) -> Dict:
sanitized = {}
for key, value in d.items():
key_lower = key.lower()
if any(sensitive in key_lower for sensitive in sensitive_keys):
sanitized[key] = '[REDACTED]'
elif isinstance(value, dict):
sanitized[key] = _sanitize_dict(value)
elif isinstance(value, list):
sanitized[key] = [
_sanitize_dict(item) if isinstance(item, dict) else item
for item in value
]
else:
sanitized[key] = value
return sanitized
return _sanitize_dict(data) if isinstance(data, dict) else data
class NullProvider(TelemetryProvider):
"""No-op provider for when telemetry is disabled."""
def initialize(self) -> bool:
"""No initialization needed for null provider."""
self._initialized = True
return True
def send_exception(
self,
exc: Exception,
context: Optional[Dict[str, Any]] = None
) -> bool:
"""No-op exception sending."""
return True
def send_event(
self,
event_name: str,
payload: Optional[Dict[str, Any]] = None
) -> bool:
"""No-op event sending."""
return True
def flush(self) -> None:
"""No-op flush."""
pass
def shutdown(self) -> None:
"""No-op shutdown."""
pass

View File

@@ -0,0 +1,196 @@
"""
Configuration management for Crawl4AI telemetry.
Handles user preferences and persistence.
"""
import json
import os
from pathlib import Path
from typing import Dict, Any, Optional
from enum import Enum
class TelemetryConsent(Enum):
"""Telemetry consent levels."""
NOT_SET = "not_set"
DENIED = "denied"
ONCE = "once" # Send current error only
ALWAYS = "always" # Send all errors
class TelemetryConfig:
"""Manages telemetry configuration and persistence."""
def __init__(self, config_dir: Optional[Path] = None):
"""
Initialize configuration manager.
Args:
config_dir: Optional custom config directory
"""
if config_dir:
self.config_dir = config_dir
else:
# Default to ~/.crawl4ai/
self.config_dir = Path.home() / '.crawl4ai'
self.config_file = self.config_dir / 'config.json'
self._config: Dict[str, Any] = {}
self._load_config()
def _ensure_config_dir(self) -> None:
"""Ensure configuration directory exists."""
self.config_dir.mkdir(parents=True, exist_ok=True)
def _load_config(self) -> None:
"""Load configuration from disk."""
if self.config_file.exists():
try:
with open(self.config_file, 'r') as f:
self._config = json.load(f)
except (json.JSONDecodeError, IOError):
# Corrupted or inaccessible config - start fresh
self._config = {}
else:
self._config = {}
def _save_config(self) -> bool:
"""
Save configuration to disk.
Returns:
True if saved successfully
"""
try:
self._ensure_config_dir()
# Write to temporary file first
temp_file = self.config_file.with_suffix('.tmp')
with open(temp_file, 'w') as f:
json.dump(self._config, f, indent=2)
# Atomic rename
temp_file.replace(self.config_file)
return True
except (IOError, OSError):
return False
def get_telemetry_settings(self) -> Dict[str, Any]:
"""
Get current telemetry settings.
Returns:
Dictionary with telemetry settings
"""
return self._config.get('telemetry', {
'consent': TelemetryConsent.NOT_SET.value,
'email': None
})
def get_consent(self) -> TelemetryConsent:
"""
Get current consent status.
Returns:
TelemetryConsent enum value
"""
settings = self.get_telemetry_settings()
consent_value = settings.get('consent', TelemetryConsent.NOT_SET.value)
# Handle legacy boolean values
if isinstance(consent_value, bool):
consent_value = TelemetryConsent.ALWAYS.value if consent_value else TelemetryConsent.DENIED.value
try:
return TelemetryConsent(consent_value)
except ValueError:
return TelemetryConsent.NOT_SET
def set_consent(
self,
consent: TelemetryConsent,
email: Optional[str] = None
) -> bool:
"""
Set telemetry consent and optional email.
Args:
consent: Consent level
email: Optional email for follow-up
Returns:
True if saved successfully
"""
if 'telemetry' not in self._config:
self._config['telemetry'] = {}
self._config['telemetry']['consent'] = consent.value
# Only update email if provided
if email is not None:
self._config['telemetry']['email'] = email
return self._save_config()
def get_email(self) -> Optional[str]:
"""
Get stored email if any.
Returns:
Email address or None
"""
settings = self.get_telemetry_settings()
return settings.get('email')
def is_enabled(self) -> bool:
"""
Check if telemetry is enabled.
Returns:
True if telemetry should send data
"""
consent = self.get_consent()
return consent in [TelemetryConsent.ONCE, TelemetryConsent.ALWAYS]
def should_send_current(self) -> bool:
"""
Check if current error should be sent.
Used for one-time consent.
Returns:
True if current error should be sent
"""
consent = self.get_consent()
if consent == TelemetryConsent.ONCE:
# After sending once, reset to NOT_SET
self.set_consent(TelemetryConsent.NOT_SET)
return True
return consent == TelemetryConsent.ALWAYS
def clear(self) -> bool:
"""
Clear all telemetry settings.
Returns:
True if cleared successfully
"""
if 'telemetry' in self._config:
del self._config['telemetry']
return self._save_config()
return True
def update_from_env(self) -> None:
"""Update configuration from environment variables."""
# Check for telemetry disable flag
if os.environ.get('CRAWL4AI_TELEMETRY') == '0':
self.set_consent(TelemetryConsent.DENIED)
# Check for email override
env_email = os.environ.get('CRAWL4AI_TELEMETRY_EMAIL')
if env_email and self.is_enabled():
current_settings = self.get_telemetry_settings()
self.set_consent(
TelemetryConsent(current_settings['consent']),
email=env_email
)

View File

@@ -0,0 +1,314 @@
"""
User consent handling for Crawl4AI telemetry.
Provides interactive prompts for different environments.
"""
import sys
from typing import Optional, Tuple
from .config import TelemetryConsent, TelemetryConfig
from .environment import Environment, EnvironmentDetector
class ConsentManager:
"""Manages user consent for telemetry."""
def __init__(self, config: Optional[TelemetryConfig] = None):
"""
Initialize consent manager.
Args:
config: Optional TelemetryConfig instance
"""
self.config = config or TelemetryConfig()
self.environment = EnvironmentDetector.detect()
def check_and_prompt(self) -> TelemetryConsent:
"""
Check consent status and prompt if needed.
Returns:
Current consent status
"""
current_consent = self.config.get_consent()
# If already set, return current value
if current_consent != TelemetryConsent.NOT_SET:
return current_consent
# Docker/API server: default enabled (check env var)
if self.environment in [Environment.DOCKER, Environment.API_SERVER]:
return self._handle_docker_consent()
# Interactive environments: prompt user
if EnvironmentDetector.is_interactive():
return self._prompt_for_consent()
# Non-interactive: default disabled
return TelemetryConsent.DENIED
def _handle_docker_consent(self) -> TelemetryConsent:
"""
Handle consent in Docker environment.
Default enabled unless disabled via env var.
"""
import os
if os.environ.get('CRAWL4AI_TELEMETRY') == '0':
self.config.set_consent(TelemetryConsent.DENIED)
return TelemetryConsent.DENIED
# Default enabled for Docker
self.config.set_consent(TelemetryConsent.ALWAYS)
return TelemetryConsent.ALWAYS
def _prompt_for_consent(self) -> TelemetryConsent:
"""
Prompt user for consent based on environment.
Returns:
User's consent choice
"""
if self.environment == Environment.CLI:
return self._cli_prompt()
elif self.environment in [Environment.JUPYTER, Environment.COLAB]:
return self._notebook_prompt()
else:
return TelemetryConsent.DENIED
def _cli_prompt(self) -> TelemetryConsent:
"""
Show CLI prompt for consent.
Returns:
User's consent choice
"""
print("\n" + "="*60)
print("🚨 Crawl4AI Error Detection")
print("="*60)
print("\nWe noticed an error occurred. Help improve Crawl4AI by")
print("sending anonymous crash reports?")
print("\n[1] Yes, send this error only")
print("[2] Yes, always send errors")
print("[3] No, don't send")
print("\n" + "-"*60)
# Get choice
while True:
try:
choice = input("Your choice (1/2/3): ").strip()
if choice == '1':
consent = TelemetryConsent.ONCE
break
elif choice == '2':
consent = TelemetryConsent.ALWAYS
break
elif choice == '3':
consent = TelemetryConsent.DENIED
break
else:
print("Please enter 1, 2, or 3")
except (KeyboardInterrupt, EOFError):
# User cancelled - treat as denial
consent = TelemetryConsent.DENIED
break
# Optional email
email = None
if consent != TelemetryConsent.DENIED:
print("\nOptional: Enter email for follow-up (or press Enter to skip):")
try:
email_input = input("Email: ").strip()
if email_input and '@' in email_input:
email = email_input
except (KeyboardInterrupt, EOFError):
pass
# Save choice
self.config.set_consent(consent, email)
if consent != TelemetryConsent.DENIED:
print("\n✅ Thank you for helping improve Crawl4AI!")
else:
print("\n✅ Telemetry disabled. You can enable it anytime with:")
print(" crawl4ai telemetry enable")
print("="*60 + "\n")
return consent
def _notebook_prompt(self) -> TelemetryConsent:
"""
Show notebook prompt for consent.
Uses widgets if available, falls back to print + code.
Returns:
User's consent choice
"""
if EnvironmentDetector.supports_widgets():
return self._widget_prompt()
else:
return self._notebook_fallback_prompt()
def _widget_prompt(self) -> TelemetryConsent:
"""
Show interactive widget prompt in Jupyter/Colab.
Returns:
User's consent choice
"""
try:
import ipywidgets as widgets
from IPython.display import display, HTML
# Create styled HTML
html = HTML("""
<div style="padding: 15px; border: 2px solid #ff6b6b; border-radius: 8px; background: #fff5f5;">
<h3 style="color: #c92a2a; margin-top: 0;">🚨 Crawl4AI Error Detected</h3>
<p style="color: #495057;">Help us improve by sending anonymous crash reports?</p>
</div>
""")
display(html)
# Create buttons
btn_once = widgets.Button(
description='Send this error',
button_style='info',
icon='check'
)
btn_always = widgets.Button(
description='Always send',
button_style='success',
icon='check-circle'
)
btn_never = widgets.Button(
description='Don\'t send',
button_style='danger',
icon='times'
)
# Email input
email_input = widgets.Text(
placeholder='Optional: your@email.com',
description='Email:',
style={'description_width': 'initial'}
)
# Output area for feedback
output = widgets.Output()
# Container
button_box = widgets.HBox([btn_once, btn_always, btn_never])
container = widgets.VBox([button_box, email_input, output])
# Variable to store choice
consent_choice = {'value': None}
def on_button_click(btn):
"""Handle button click."""
with output:
output.clear_output()
if btn == btn_once:
consent_choice['value'] = TelemetryConsent.ONCE
print("✅ Sending this error only")
elif btn == btn_always:
consent_choice['value'] = TelemetryConsent.ALWAYS
print("✅ Always sending errors")
else:
consent_choice['value'] = TelemetryConsent.DENIED
print("✅ Telemetry disabled")
# Save with email if provided
email = email_input.value.strip() if email_input.value else None
self.config.set_consent(consent_choice['value'], email)
# Disable buttons after choice
btn_once.disabled = True
btn_always.disabled = True
btn_never.disabled = True
email_input.disabled = True
# Attach handlers
btn_once.on_click(on_button_click)
btn_always.on_click(on_button_click)
btn_never.on_click(on_button_click)
# Display widget
display(container)
# Wait for user choice (in notebook, this is non-blocking)
# Return NOT_SET for now, actual choice will be saved via callback
return consent_choice.get('value', TelemetryConsent.NOT_SET)
except Exception:
# Fallback if widgets fail
return self._notebook_fallback_prompt()
def _notebook_fallback_prompt(self) -> TelemetryConsent:
"""
Fallback prompt for notebooks without widget support.
Returns:
User's consent choice (defaults to DENIED)
"""
try:
from IPython.display import display, Markdown
markdown_content = """
### 🚨 Crawl4AI Error Detected
Help us improve by sending anonymous crash reports.
**Telemetry is currently OFF.** To enable, run:
```python
import crawl4ai
crawl4ai.telemetry.enable(email="your@email.com", always=True)
```
To send just this error:
```python
crawl4ai.telemetry.enable(once=True)
```
To keep telemetry disabled:
```python
crawl4ai.telemetry.disable()
```
"""
display(Markdown(markdown_content))
except ImportError:
# Pure print fallback
print("\n" + "="*60)
print("🚨 Crawl4AI Error Detected")
print("="*60)
print("\nTelemetry is OFF. To enable, run:")
print("\nimport crawl4ai")
print('crawl4ai.telemetry.enable(email="you@example.com", always=True)')
print("\n" + "="*60)
# Default to disabled in fallback mode
return TelemetryConsent.DENIED
def force_prompt(self) -> Tuple[TelemetryConsent, Optional[str]]:
"""
Force a consent prompt regardless of current settings.
Used for manual telemetry configuration.
Returns:
Tuple of (consent choice, optional email)
"""
# Temporarily reset consent to force prompt
original_consent = self.config.get_consent()
self.config.set_consent(TelemetryConsent.NOT_SET)
try:
new_consent = self._prompt_for_consent()
email = self.config.get_email()
return new_consent, email
except Exception:
# Restore original on error
self.config.set_consent(original_consent)
raise

View File

@@ -0,0 +1,199 @@
"""
Environment detection for Crawl4AI telemetry.
Detects whether we're running in CLI, Docker, Jupyter, etc.
"""
import os
import sys
from enum import Enum
from typing import Optional
class Environment(Enum):
"""Detected runtime environment."""
CLI = "cli"
DOCKER = "docker"
JUPYTER = "jupyter"
COLAB = "colab"
API_SERVER = "api_server"
UNKNOWN = "unknown"
class EnvironmentDetector:
"""Detects the current runtime environment."""
@staticmethod
def detect() -> Environment:
"""
Detect current runtime environment.
Returns:
Environment enum value
"""
# Check for Docker
if EnvironmentDetector._is_docker():
# Further check if it's API server
if EnvironmentDetector._is_api_server():
return Environment.API_SERVER
return Environment.DOCKER
# Check for Google Colab
if EnvironmentDetector._is_colab():
return Environment.COLAB
# Check for Jupyter
if EnvironmentDetector._is_jupyter():
return Environment.JUPYTER
# Check for CLI
if EnvironmentDetector._is_cli():
return Environment.CLI
return Environment.UNKNOWN
@staticmethod
def _is_docker() -> bool:
"""Check if running inside Docker container."""
# Check for Docker-specific files
if os.path.exists('/.dockerenv'):
return True
# Check cgroup for docker signature
try:
with open('/proc/1/cgroup', 'r') as f:
return 'docker' in f.read()
except (IOError, OSError):
pass
# Check environment variable (if set in Dockerfile)
return os.environ.get('CRAWL4AI_DOCKER', '').lower() == 'true'
@staticmethod
def _is_api_server() -> bool:
"""Check if running as API server."""
# Check for API server indicators
return (
os.environ.get('CRAWL4AI_API_SERVER', '').lower() == 'true' or
'deploy/docker/server.py' in ' '.join(sys.argv) or
'deploy/docker/api.py' in ' '.join(sys.argv)
)
@staticmethod
def _is_jupyter() -> bool:
"""Check if running in Jupyter notebook."""
try:
# Check for IPython
from IPython import get_ipython
ipython = get_ipython()
if ipython is None:
return False
# Check for notebook kernel
if 'IPKernelApp' in ipython.config:
return True
# Check for Jupyter-specific attributes
if hasattr(ipython, 'kernel'):
return True
except (ImportError, AttributeError):
pass
return False
@staticmethod
def _is_colab() -> bool:
"""Check if running in Google Colab."""
try:
import google.colab
return True
except ImportError:
pass
# Alternative check
return 'COLAB_GPU' in os.environ or 'COLAB_TPU_ADDR' in os.environ
@staticmethod
def _is_cli() -> bool:
"""Check if running from command line."""
# Check if we have a terminal
return (
hasattr(sys, 'ps1') or
sys.stdin.isatty() or
bool(os.environ.get('TERM'))
)
@staticmethod
def is_interactive() -> bool:
"""
Check if environment supports interactive prompts.
Returns:
True if interactive prompts are supported
"""
env = EnvironmentDetector.detect()
# Docker/API server are non-interactive
if env in [Environment.DOCKER, Environment.API_SERVER]:
return False
# CLI with TTY is interactive
if env == Environment.CLI:
return sys.stdin.isatty()
# Jupyter/Colab can be interactive with widgets
if env in [Environment.JUPYTER, Environment.COLAB]:
return True
return False
@staticmethod
def supports_widgets() -> bool:
"""
Check if environment supports IPython widgets.
Returns:
True if widgets are supported
"""
env = EnvironmentDetector.detect()
if env not in [Environment.JUPYTER, Environment.COLAB]:
return False
try:
import ipywidgets
from IPython.display import display
return True
except ImportError:
return False
@staticmethod
def get_environment_context() -> dict:
"""
Get environment context for telemetry.
Returns:
Dictionary with environment information
"""
env = EnvironmentDetector.detect()
context = {
'environment_type': env.value,
'python_version': f"{sys.version_info.major}.{sys.version_info.minor}.{sys.version_info.micro}",
'platform': sys.platform,
}
# Add environment-specific context
if env == Environment.DOCKER:
context['docker'] = True
context['container_id'] = os.environ.get('HOSTNAME', 'unknown')
elif env == Environment.COLAB:
context['colab'] = True
context['gpu'] = bool(os.environ.get('COLAB_GPU'))
elif env == Environment.JUPYTER:
context['jupyter'] = True
return context

View File

@@ -0,0 +1,15 @@
"""
Telemetry providers for Crawl4AI.
"""
from ..base import TelemetryProvider, NullProvider
__all__ = ['TelemetryProvider', 'NullProvider']
# Try to import Sentry provider if available
try:
from .sentry import SentryProvider
__all__.append('SentryProvider')
except ImportError:
# Sentry SDK not installed
pass

View File

@@ -0,0 +1,234 @@
"""
Sentry telemetry provider for Crawl4AI.
"""
import os
from typing import Dict, Any, Optional
from ..base import TelemetryProvider
# Hardcoded DSN for Crawl4AI project
# This is safe to embed as it's the public part of the DSN
# TODO: Replace with actual Crawl4AI Sentry project DSN before release
# Format: "https://<public_key>@<organization>.ingest.sentry.io/<project_id>"
DEFAULT_SENTRY_DSN = "https://your-public-key@sentry.io/your-project-id"
class SentryProvider(TelemetryProvider):
"""Sentry implementation of telemetry provider."""
def __init__(self, dsn: Optional[str] = None, **kwargs):
"""
Initialize Sentry provider.
Args:
dsn: Optional DSN override (for testing/development)
**kwargs: Additional Sentry configuration
"""
super().__init__(**kwargs)
# Allow DSN override via environment variable or parameter
self.dsn = (
dsn or
os.environ.get('CRAWL4AI_SENTRY_DSN') or
DEFAULT_SENTRY_DSN
)
self._sentry_sdk = None
self.environment = kwargs.get('environment', 'production')
self.release = kwargs.get('release', None)
def initialize(self) -> bool:
"""Initialize Sentry SDK."""
try:
import sentry_sdk
from sentry_sdk.integrations.stdlib import StdlibIntegration
from sentry_sdk.integrations.excepthook import ExcepthookIntegration
# Initialize Sentry with minimal integrations
sentry_sdk.init(
dsn=self.dsn,
environment=self.environment,
release=self.release,
# Performance monitoring disabled by default
traces_sample_rate=0.0,
# Only capture errors, not transactions
# profiles_sample_rate=0.0,
# Minimal integrations
integrations=[
StdlibIntegration(),
ExcepthookIntegration(always_run=False),
],
# Privacy settings
send_default_pii=False,
attach_stacktrace=True,
# Before send hook for additional sanitization
before_send=self._before_send,
# Disable automatic breadcrumbs
max_breadcrumbs=0,
# Disable request data collection
# request_bodies='never',
# # Custom transport options
# transport_options={
# 'keepalive': True,
# },
)
self._sentry_sdk = sentry_sdk
self._initialized = True
return True
except ImportError:
# Sentry SDK not installed
return False
except Exception:
# Initialization failed silently
return False
def _before_send(self, event: Dict[str, Any], hint: Dict[str, Any]) -> Optional[Dict[str, Any]]:
"""
Process event before sending to Sentry.
Provides additional privacy protection.
"""
# Remove sensitive data
if 'request' in event:
event['request'] = self._sanitize_request(event['request'])
# Remove local variables that might contain sensitive data
if 'exception' in event and 'values' in event['exception']:
for exc in event['exception']['values']:
if 'stacktrace' in exc and 'frames' in exc['stacktrace']:
for frame in exc['stacktrace']['frames']:
# Remove local variables from frames
frame.pop('vars', None)
# Apply general sanitization
event = self.sanitize_data(event)
return event
def _sanitize_request(self, request_data: Dict[str, Any]) -> Dict[str, Any]:
"""Sanitize request data to remove sensitive information."""
sanitized = request_data.copy()
# Remove sensitive fields
sensitive_fields = ['cookies', 'headers', 'data', 'query_string', 'env']
for field in sensitive_fields:
if field in sanitized:
sanitized[field] = '[REDACTED]'
# Keep only safe fields
safe_fields = ['method', 'url']
return {k: v for k, v in sanitized.items() if k in safe_fields}
def send_exception(
self,
exc: Exception,
context: Optional[Dict[str, Any]] = None
) -> bool:
"""
Send exception to Sentry.
Args:
exc: Exception to report
context: Optional context (email, environment info)
Returns:
True if sent successfully
"""
if not self._initialized:
if not self.initialize():
return False
try:
if self._sentry_sdk:
with self._sentry_sdk.push_scope() as scope:
# Add user context if email provided
if context and 'email' in context:
scope.set_user({'email': context['email']})
# Add additional context
if context:
for key, value in context.items():
if key != 'email':
scope.set_context(key, value)
# Add tags for filtering
scope.set_tag('source', context.get('source', 'unknown'))
scope.set_tag('environment_type', context.get('environment_type', 'unknown'))
# Capture the exception
self._sentry_sdk.capture_exception(exc)
return True
except Exception:
# Silently fail - telemetry should never crash the app
return False
return False
def send_event(
self,
event_name: str,
payload: Optional[Dict[str, Any]] = None
) -> bool:
"""
Send custom event to Sentry.
Args:
event_name: Name of the event
payload: Event data
Returns:
True if sent successfully
"""
if not self._initialized:
if not self.initialize():
return False
try:
if self._sentry_sdk:
# Sanitize payload
safe_payload = self.sanitize_data(payload) if payload else {}
# Send as a message with extra data
self._sentry_sdk.capture_message(
event_name,
level='info',
extras=safe_payload
)
return True
except Exception:
return False
return False
def flush(self) -> None:
"""Flush pending events to Sentry."""
if self._initialized and self._sentry_sdk:
try:
self._sentry_sdk.flush(timeout=2.0)
except Exception:
pass
def shutdown(self) -> None:
"""Shutdown Sentry client."""
if self._initialized and self._sentry_sdk:
try:
self._sentry_sdk.flush(timeout=2.0)
# Note: sentry_sdk doesn't have a shutdown method
# Flush is sufficient for cleanup
except Exception:
pass
finally:
self._initialized = False

View File

@@ -23,8 +23,9 @@ SeedingConfig = Union['SeedingConfigType']
# Content scraping types
ContentScrapingStrategy = Union['ContentScrapingStrategyType']
WebScrapingStrategy = Union['WebScrapingStrategyType']
LXMLWebScrapingStrategy = Union['LXMLWebScrapingStrategyType']
# Backward compatibility alias
WebScrapingStrategy = Union['LXMLWebScrapingStrategyType']
# Proxy types
ProxyRotationStrategy = Union['ProxyRotationStrategyType']
@@ -114,7 +115,6 @@ if TYPE_CHECKING:
# Content scraping imports
from .content_scraping_strategy import (
ContentScrapingStrategy as ContentScrapingStrategyType,
WebScrapingStrategy as WebScrapingStrategyType,
LXMLWebScrapingStrategy as LXMLWebScrapingStrategyType,
)

View File

@@ -16,7 +16,7 @@ from .config import MIN_WORD_THRESHOLD, IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD, IM
import httpx
from socket import gaierror
from pathlib import Path
from typing import Dict, Any, List, Optional, Callable
from typing import Dict, Any, List, Optional, Callable, Generator, Tuple, Iterable
from urllib.parse import urljoin
import requests
from requests.exceptions import InvalidSchema
@@ -32,7 +32,6 @@ import hashlib
from urllib.robotparser import RobotFileParser
import aiohttp
from urllib.parse import urlparse, urlunparse
from functools import lru_cache
from packaging import version
@@ -41,7 +40,37 @@ from typing import Sequence
from itertools import chain
from collections import deque
from typing import Generator, Iterable
import psutil
import numpy as np
from urllib.parse import (
urljoin, urlparse, urlunparse,
parse_qsl, urlencode, quote, unquote
)
# Monkey patch to fix wildcard handling in urllib.robotparser
from urllib.robotparser import RuleLine
import re
original_applies_to = RuleLine.applies_to
def patched_applies_to(self, filename):
# Handle wildcards in paths
if '*' in self.path or '%2A' in self.path or self.path in ("*", "%2A"):
pattern = self.path.replace('%2A', '*')
pattern = re.escape(pattern).replace('\\*', '.*')
pattern = '^' + pattern
if pattern.endswith('\\$'):
pattern = pattern[:-2] + '$'
try:
return bool(re.match(pattern, filename))
except re.error:
return original_applies_to(self, filename)
return original_applies_to(self, filename)
RuleLine.applies_to = patched_applies_to
# Monkey patch ends
def chunk_documents(
documents: Iterable[str],
@@ -311,7 +340,7 @@ class RobotsParser:
robots_url = f"{scheme}://{domain}/robots.txt"
async with aiohttp.ClientSession() as session:
async with session.get(robots_url, timeout=2) as response:
async with session.get(robots_url, timeout=2, ssl=False) as response:
if response.status == 200:
rules = await response.text()
self._cache_rules(domain, rules)
@@ -1487,8 +1516,29 @@ def extract_metadata_using_lxml(html, doc=None):
head = head[0]
# Title - using XPath
# title = head.xpath(".//title/text()")
# metadata["title"] = title[0].strip() if title else None
# === Title Extraction - New Approach ===
# Attempt to extract <title> using XPath
title = head.xpath(".//title/text()")
metadata["title"] = title[0].strip() if title else None
title = title[0] if title else None
# Fallback: Use .find() in case XPath fails due to malformed HTML
if not title:
title_el = doc.find(".//title")
title = title_el.text if title_el is not None else None
# Final fallback: Use OpenGraph or Twitter title if <title> is missing or empty
if not title:
title_candidates = (
doc.xpath("//meta[@property='og:title']/@content") or
doc.xpath("//meta[@name='twitter:title']/@content")
)
title = title_candidates[0] if title_candidates else None
# Strip and assign title
metadata["title"] = title.strip() if title else None
# Meta description - using XPath with multiple attribute conditions
description = head.xpath('.//meta[@name="description"]/@content')
@@ -1517,6 +1567,14 @@ def extract_metadata_using_lxml(html, doc=None):
content = tag.get("content", "").strip()
if property_name and content:
metadata[property_name] = content
# Article metadata
article_tags = head.xpath('.//meta[starts-with(@property, "article:")]')
for tag in article_tags:
property_name = tag.get("property", "").strip()
content = tag.get("content", "").strip()
if property_name and content:
metadata[property_name] = content
return metadata
@@ -1592,7 +1650,15 @@ def extract_metadata(html, soup=None):
content = tag.get("content", "").strip()
if property_name and content:
metadata[property_name] = content
# Article metadata
article_tags = head.find_all("meta", attrs={"property": re.compile(r"^article:")})
for tag in article_tags:
property_name = tag.get("property", "").strip()
content = tag.get("content", "").strip()
if property_name and content:
metadata[property_name] = content
return metadata
@@ -2061,13 +2127,101 @@ def normalize_url(href, base_url):
parsed_base = urlparse(base_url)
if not parsed_base.scheme or not parsed_base.netloc:
raise ValueError(f"Invalid base URL format: {base_url}")
# Ensure base_url ends with a trailing slash if it's a directory path
if not base_url.endswith('/'):
base_url = base_url + '/'
if parsed_base.scheme.lower() not in ["http", "https"]:
# Handle special protocols
raise ValueError(f"Invalid base URL format: {base_url}")
cleaned_href = href.strip()
# Use urljoin to handle all cases
normalized = urljoin(base_url, href.strip())
return urljoin(base_url, cleaned_href)
def normalize_url(
href: str,
base_url: str,
*,
drop_query_tracking=True,
sort_query=True,
keep_fragment=False,
extra_drop_params=None
):
"""
Extended URL normalizer
Parameters
----------
href : str
The raw link extracted from a page.
base_url : str
The pages canonical URL (used to resolve relative links).
drop_query_tracking : bool (default True)
Remove common tracking query parameters.
sort_query : bool (default True)
Alphabetically sort query keys for deterministic output.
keep_fragment : bool (default False)
Preserve the hash fragment (#section) if you need in-page links.
extra_drop_params : Iterable[str] | None
Additional query keys to strip (case-insensitive).
Returns
-------
str | None
A clean, canonical URL or None if href is empty/None.
"""
if not href:
return None
# Resolve relative paths first
full_url = urljoin(base_url, href.strip())
# Parse once, edit parts, then rebuild
parsed = urlparse(full_url)
# ── netloc ──
netloc = parsed.netloc.lower()
# ── path ──
# Strip duplicate slashes and trailing “/” (except root)
path = quote(unquote(parsed.path))
if path.endswith('/') and path != '/':
path = path.rstrip('/')
# ── query ──
query = parsed.query
if query:
# explode, mutate, then rebuild
params = [(k.lower(), v) for k, v in parse_qsl(query, keep_blank_values=True)]
if drop_query_tracking:
default_tracking = {
'utm_source', 'utm_medium', 'utm_campaign', 'utm_term',
'utm_content', 'gclid', 'fbclid', 'ref', 'ref_src'
}
if extra_drop_params:
default_tracking |= {p.lower() for p in extra_drop_params}
params = [(k, v) for k, v in params if k not in default_tracking]
if sort_query:
params.sort(key=lambda kv: kv[0])
query = urlencode(params, doseq=True) if params else ''
# ── fragment ──
fragment = parsed.fragment if keep_fragment else ''
# Re-assemble
normalized = urlunparse((
parsed.scheme,
netloc,
path,
parsed.params,
query,
fragment
))
return normalized
@@ -3148,3 +3302,190 @@ def calculate_total_score(
return max(0.0, min(total, 10.0))
# Embedding utilities
async def get_text_embeddings(
texts: List[str],
llm_config: Optional[Dict] = None,
model_name: str = "sentence-transformers/all-MiniLM-L6-v2",
batch_size: int = 32
) -> np.ndarray:
"""
Compute embeddings for a list of texts using specified model.
Args:
texts: List of texts to embed
llm_config: Optional LLM configuration for API-based embeddings
model_name: Model name (used when llm_config is None)
batch_size: Batch size for processing
Returns:
numpy array of embeddings
"""
import numpy as np
if not texts:
return np.array([])
# If LLMConfig provided, use litellm for embeddings
if llm_config is not None:
from litellm import aembedding
# Get embedding model from config or use default
embedding_model = llm_config.get('provider', 'text-embedding-3-small')
api_base = llm_config.get('base_url', llm_config.get('api_base'))
# Prepare kwargs
kwargs = {
'model': embedding_model,
'input': texts,
'api_key': llm_config.get('api_token', llm_config.get('api_key'))
}
if api_base:
kwargs['api_base'] = api_base
# Handle OpenAI-compatible endpoints
if api_base and 'openai/' not in embedding_model:
kwargs['model'] = f"openai/{embedding_model}"
# Get embeddings
response = await aembedding(**kwargs)
# Extract embeddings from response
embeddings = []
for item in response.data:
embeddings.append(item['embedding'])
return np.array(embeddings)
# Default: use sentence-transformers
else:
# Lazy load to avoid importing heavy libraries unless needed
try:
from sentence_transformers import SentenceTransformer
except ImportError:
raise ImportError(
"sentence-transformers is required for local embeddings. "
"Install it with: pip install 'crawl4ai[transformer]' or pip install sentence-transformers"
)
# Cache the model in function attribute to avoid reloading
if not hasattr(get_text_embeddings, '_models'):
get_text_embeddings._models = {}
if model_name not in get_text_embeddings._models:
get_text_embeddings._models[model_name] = SentenceTransformer(model_name)
encoder = get_text_embeddings._models[model_name]
# Batch encode for efficiency
embeddings = encoder.encode(
texts,
batch_size=batch_size,
show_progress_bar=False,
convert_to_numpy=True
)
return embeddings
def get_text_embeddings_sync(
texts: List[str],
llm_config: Optional[Dict] = None,
model_name: str = "sentence-transformers/all-MiniLM-L6-v2",
batch_size: int = 32
) -> np.ndarray:
"""Synchronous wrapper for get_text_embeddings"""
import numpy as np
return asyncio.run(get_text_embeddings(texts, llm_config, model_name, batch_size))
def cosine_similarity(vec1: np.ndarray, vec2: np.ndarray) -> float:
"""Calculate cosine similarity between two vectors"""
import numpy as np
dot_product = np.dot(vec1, vec2)
norm_product = np.linalg.norm(vec1) * np.linalg.norm(vec2)
return float(dot_product / norm_product) if norm_product != 0 else 0.0
def cosine_distance(vec1: np.ndarray, vec2: np.ndarray) -> float:
"""Calculate cosine distance (1 - similarity) between two vectors"""
return 1 - cosine_similarity(vec1, vec2)
# Memory utilities
def get_true_available_memory_gb() -> float:
"""Get truly available memory including inactive pages (cross-platform)"""
vm = psutil.virtual_memory()
if platform.system() == 'Darwin': # macOS
# On macOS, we need to include inactive memory too
try:
# Use vm_stat to get accurate values
result = subprocess.run(['vm_stat'], capture_output=True, text=True)
lines = result.stdout.split('\n')
page_size = 16384 # macOS page size
pages = {}
for line in lines:
if 'Pages free:' in line:
pages['free'] = int(line.split()[-1].rstrip('.'))
elif 'Pages inactive:' in line:
pages['inactive'] = int(line.split()[-1].rstrip('.'))
elif 'Pages speculative:' in line:
pages['speculative'] = int(line.split()[-1].rstrip('.'))
elif 'Pages purgeable:' in line:
pages['purgeable'] = int(line.split()[-1].rstrip('.'))
# Calculate total available (free + inactive + speculative + purgeable)
total_available_pages = (
pages.get('free', 0) +
pages.get('inactive', 0) +
pages.get('speculative', 0) +
pages.get('purgeable', 0)
)
available_gb = (total_available_pages * page_size) / (1024**3)
return available_gb
except:
# Fallback to psutil
return vm.available / (1024**3)
else:
# For Windows and Linux, psutil.available is accurate
return vm.available / (1024**3)
def get_true_memory_usage_percent() -> float:
"""
Get memory usage percentage that accounts for platform differences.
Returns:
float: Memory usage percentage (0-100)
"""
vm = psutil.virtual_memory()
total_gb = vm.total / (1024**3)
available_gb = get_true_available_memory_gb()
# Calculate used percentage based on truly available memory
used_percent = 100.0 * (total_gb - available_gb) / total_gb
# Ensure it's within valid range
return max(0.0, min(100.0, used_percent))
def get_memory_stats() -> Tuple[float, float, float]:
"""
Get comprehensive memory statistics.
Returns:
Tuple[float, float, float]: (used_percent, available_gb, total_gb)
"""
vm = psutil.virtual_memory()
total_gb = vm.total / (1024**3)
available_gb = get_true_available_memory_gb()
used_percent = get_true_memory_usage_percent()
return used_percent, available_gb, total_gb

View File

@@ -5,4 +5,9 @@ ANTHROPIC_API_KEY=your_anthropic_key_here
GROQ_API_KEY=your_groq_key_here
TOGETHER_API_KEY=your_together_key_here
MISTRAL_API_KEY=your_mistral_key_here
GEMINI_API_TOKEN=your_gemini_key_here
GEMINI_API_TOKEN=your_gemini_key_here
# Optional: Override the default LLM provider
# Examples: "openai/gpt-4", "anthropic/claude-3-opus", "deepseek/chat", etc.
# If not set, uses the provider specified in config.yml (default: openai/gpt-4o-mini)
# LLM_PROVIDER=anthropic/claude-3-opus

View File

@@ -58,13 +58,15 @@ Pull and run images directly from Docker Hub without building locally.
#### 1. Pull the Image
Our latest release candidate is `0.6.0-r1`. Images are built with multi-arch manifests, so Docker automatically pulls the correct version for your system.
Our latest release candidate is `0.7.0-r1`. Images are built with multi-arch manifests, so Docker automatically pulls the correct version for your system.
> ⚠️ **Important Note**: The `latest` tag currently points to the stable `0.6.0` version. After testing and validation, `0.7.0` (without -r1) will be released and `latest` will be updated. For now, please use `0.7.0-r1` to test the new features.
```bash
# Pull the release candidate (recommended for latest features)
docker pull unclecode/crawl4ai:0.6.0-rN # Use your favorite revision number
# Pull the release candidate (for testing new features)
docker pull unclecode/crawl4ai:0.7.0-r1
# Or pull the latest stable version
# Or pull the current stable version (0.6.0)
docker pull unclecode/crawl4ai:latest
```
@@ -99,7 +101,7 @@ EOL
-p 11235:11235 \
--name crawl4ai \
--shm-size=1g \
unclecode/crawl4ai:0.6.0-rN # Use your favorite revision number
unclecode/crawl4ai:0.7.0-r1
```
* **With LLM support:**
@@ -110,7 +112,7 @@ EOL
--name crawl4ai \
--env-file .llm.env \
--shm-size=1g \
unclecode/crawl4ai:0.6.0-rN # Use your favorite revision number
unclecode/crawl4ai:0.7.0-r1
```
> The server will be available at `http://localhost:11235`. Visit `/playground` to access the interactive testing interface.
@@ -124,7 +126,7 @@ docker stop crawl4ai && docker rm crawl4ai
#### Docker Hub Versioning Explained
* **Image Name:** `unclecode/crawl4ai`
* **Tag Format:** `LIBRARY_VERSION[-SUFFIX]` (e.g., `0.6.0-r1`)
* **Tag Format:** `LIBRARY_VERSION[-SUFFIX]` (e.g., `0.7.0-r1`)
* `LIBRARY_VERSION`: The semantic version of the core `crawl4ai` Python library
* `SUFFIX`: Optional tag for release candidates (``) and revisions (`r1`)
* **`latest` Tag:** Points to the most recent stable version
@@ -152,6 +154,29 @@ cp deploy/docker/.llm.env.example .llm.env
# Now edit .llm.env and add your API keys
```
**Flexible LLM Provider Configuration:**
The Docker setup now supports flexible LLM provider configuration through three methods:
1. **Environment Variable** (Highest Priority): Set `LLM_PROVIDER` to override the default
```bash
export LLM_PROVIDER="anthropic/claude-3-opus"
# Or in your .llm.env file:
# LLM_PROVIDER=anthropic/claude-3-opus
```
2. **API Request Parameter**: Specify provider per request
```json
{
"url": "https://example.com",
"provider": "groq/mixtral-8x7b"
}
```
3. **Config File Default**: Falls back to `config.yml` (default: `openai/gpt-4o-mini`)
The system automatically selects the appropriate API key based on the provider.
#### 3. Build and Run with Compose
The `docker-compose.yml` file in the project root provides a simplified approach that automatically handles architecture detection using buildx.
@@ -160,7 +185,7 @@ The `docker-compose.yml` file in the project root provides a simplified approach
```bash
# Pulls and runs the release candidate from Docker Hub
# Automatically selects the correct architecture
IMAGE=unclecode/crawl4ai:0.6.0-rN # Use your favorite revision number docker compose up -d
IMAGE=unclecode/crawl4ai:0.7.0-r1 docker compose up -d
```
* **Build and Run Locally:**
@@ -666,7 +691,7 @@ app:
# Default LLM Configuration
llm:
provider: "openai/gpt-4o-mini"
provider: "openai/gpt-4o-mini" # Can be overridden by LLM_PROVIDER env var
api_key_env: "OPENAI_API_KEY"
# api_key: sk-... # If you pass the API key directly then api_key_env will be ignored

View File

@@ -5,6 +5,7 @@ from typing import List, Tuple, Dict
from functools import partial
from uuid import uuid4
from datetime import datetime
from base64 import b64encode
import logging
from typing import Optional, AsyncGenerator
@@ -39,7 +40,9 @@ from utils import (
get_base_url,
is_task_id,
should_cleanup_task,
decode_redis_hash
decode_redis_hash,
get_llm_api_key,
validate_llm_provider
)
import psutil, time
@@ -62,7 +65,7 @@ async def handle_llm_qa(
) -> str:
"""Process QA using LLM with crawled content as context."""
try:
if not url.startswith(('http://', 'https://')):
if not url.startswith(('http://', 'https://')) and not url.startswith(("raw:", "raw://")):
url = 'https://' + url
# Extract base URL by finding last '?q=' occurrence
last_q_index = url.rfind('?q=')
@@ -88,10 +91,12 @@ async def handle_llm_qa(
Answer:"""
# api_token=os.environ.get(config["llm"].get("api_key_env", ""))
response = perform_completion_with_backoff(
provider=config["llm"]["provider"],
prompt_with_variables=prompt,
api_token=os.environ.get(config["llm"].get("api_key_env", ""))
api_token=get_llm_api_key(config)
)
return response.choices[0].message.content
@@ -109,19 +114,23 @@ async def process_llm_extraction(
url: str,
instruction: str,
schema: Optional[str] = None,
cache: str = "0"
cache: str = "0",
provider: Optional[str] = None
) -> None:
"""Process LLM extraction in background."""
try:
# If config['llm'] has api_key then ignore the api_key_env
api_key = ""
if "api_key" in config["llm"]:
api_key = config["llm"]["api_key"]
else:
api_key = os.environ.get(config["llm"].get("api_key_env", None), "")
# Validate provider
is_valid, error_msg = validate_llm_provider(config, provider)
if not is_valid:
await redis.hset(f"task:{task_id}", mapping={
"status": TaskStatus.FAILED,
"error": error_msg
})
return
api_key = get_llm_api_key(config, provider)
llm_strategy = LLMExtractionStrategy(
llm_config=LLMConfig(
provider=config["llm"]["provider"],
provider=provider or config["llm"]["provider"],
api_token=api_key
),
instruction=instruction,
@@ -168,12 +177,21 @@ async def handle_markdown_request(
filter_type: FilterType,
query: Optional[str] = None,
cache: str = "0",
config: Optional[dict] = None
config: Optional[dict] = None,
provider: Optional[str] = None
) -> str:
"""Handle markdown generation requests."""
try:
# Validate provider if using LLM filter
if filter_type == FilterType.LLM:
is_valid, error_msg = validate_llm_provider(config, provider)
if not is_valid:
raise HTTPException(
status_code=status.HTTP_400_BAD_REQUEST,
detail=error_msg
)
decoded_url = unquote(url)
if not decoded_url.startswith(('http://', 'https://')):
if not decoded_url.startswith(('http://', 'https://')) and not decoded_url.startswith(("raw:", "raw://")):
decoded_url = 'https://' + decoded_url
if filter_type == FilterType.RAW:
@@ -184,8 +202,8 @@ async def handle_markdown_request(
FilterType.BM25: BM25ContentFilter(user_query=query or ""),
FilterType.LLM: LLMContentFilter(
llm_config=LLMConfig(
provider=config["llm"]["provider"],
api_token=os.environ.get(config["llm"].get("api_key_env", None), ""),
provider=provider or config["llm"]["provider"],
api_token=get_llm_api_key(config, provider),
),
instruction=query or "Extract main content"
)
@@ -229,7 +247,8 @@ async def handle_llm_request(
query: Optional[str] = None,
schema: Optional[str] = None,
cache: str = "0",
config: Optional[dict] = None
config: Optional[dict] = None,
provider: Optional[str] = None
) -> JSONResponse:
"""Handle LLM extraction requests."""
base_url = get_base_url(request)
@@ -259,7 +278,8 @@ async def handle_llm_request(
schema,
cache,
base_url,
config
config,
provider
)
except Exception as e:
@@ -303,11 +323,12 @@ async def create_new_task(
schema: Optional[str],
cache: str,
base_url: str,
config: dict
config: dict,
provider: Optional[str] = None
) -> JSONResponse:
"""Create and initialize a new task."""
decoded_url = unquote(input_path)
if not decoded_url.startswith(('http://', 'https://')):
if not decoded_url.startswith(('http://', 'https://')) and not decoded_url.startswith(("raw:", "raw://")):
decoded_url = 'https://' + decoded_url
from datetime import datetime
@@ -327,7 +348,8 @@ async def create_new_task(
decoded_url,
query,
schema,
cache
cache,
provider
)
return JSONResponse({
@@ -371,6 +393,9 @@ async def stream_results(crawler: AsyncWebCrawler, results_gen: AsyncGenerator)
server_memory_mb = _get_memory_mb()
result_dict = result.model_dump()
result_dict['server_memory_mb'] = server_memory_mb
# If PDF exists, encode it to base64
if result_dict.get('pdf') is not None:
result_dict['pdf'] = b64encode(result_dict['pdf']).decode('utf-8')
logger.info(f"Streaming result for {result_dict.get('url', 'unknown')}")
data = json.dumps(result_dict, default=datetime_handler) + "\n"
yield data.encode('utf-8')
@@ -403,7 +428,7 @@ async def handle_crawl_request(
peak_mem_mb = start_mem_mb
try:
urls = [('https://' + url) if not url.startswith(('http://', 'https://')) else url for url in urls]
urls = [('https://' + url) if not url.startswith(('http://', 'https://')) and not url.startswith(("raw:", "raw://")) else url for url in urls]
browser_config = BrowserConfig.load(browser_config)
crawler_config = CrawlerRunConfig.load(crawler_config)
@@ -443,10 +468,19 @@ async def handle_crawl_request(
mem_delta_mb = end_mem_mb - start_mem_mb # <--- Calculate delta
peak_mem_mb = max(peak_mem_mb if peak_mem_mb else 0, end_mem_mb) # <--- Get peak memory
logger.info(f"Memory usage: Start: {start_mem_mb} MB, End: {end_mem_mb} MB, Delta: {mem_delta_mb} MB, Peak: {peak_mem_mb} MB")
# Process results to handle PDF bytes
processed_results = []
for result in results:
result_dict = result.model_dump()
# If PDF exists, encode it to base64
if result_dict.get('pdf') is not None:
result_dict['pdf'] = b64encode(result_dict['pdf']).decode('utf-8')
processed_results.append(result_dict)
return {
"success": True,
"results": [result.model_dump() for result in results],
"results": processed_results,
"server_processing_time_s": end_time - start_time,
"server_memory_delta_mb": mem_delta_mb,
"server_peak_memory_mb": peak_mem_mb
@@ -459,7 +493,7 @@ async def handle_crawl_request(
# await crawler.close()
# except Exception as close_e:
# logger.error(f"Error closing crawler during exception handling: {close_e}")
logger.error(f"Error closing crawler during exception handling: {close_e}")
logger.error(f"Error closing crawler during exception handling: {str(e)}")
# Measure memory even on error if possible
end_mem_mb_error = _get_memory_mb()
@@ -518,7 +552,7 @@ async def handle_stream_crawl_request(
# await crawler.close()
# except Exception as close_e:
# logger.error(f"Error closing crawler during stream setup exception: {close_e}")
logger.error(f"Error closing crawler during stream setup exception: {close_e}")
logger.error(f"Error closing crawler during stream setup exception: {str(e)}")
logger.error(f"Stream crawl error: {str(e)}", exc_info=True)
# Raising HTTPException here will prevent streaming response
raise HTTPException(

View File

@@ -332,7 +332,7 @@ The `clone()` method:
### Key fields to note
1. **`provider`**:
- Which LLM provoder to use.
- Which LLM provider to use.
- Possible values are `"ollama/llama3","groq/llama3-70b-8192","groq/llama3-8b-8192", "openai/gpt-4o-mini" ,"openai/gpt-4o","openai/o1-mini","openai/o1-preview","openai/o3-mini","openai/o3-mini-high","anthropic/claude-3-haiku-20240307","anthropic/claude-3-opus-20240229","anthropic/claude-3-sonnet-20240229","anthropic/claude-3-5-sonnet-20240620","gemini/gemini-pro","gemini/gemini-1.5-pro","gemini/gemini-2.0-flash","gemini/gemini-2.0-flash-exp","gemini/gemini-2.0-flash-lite-preview-02-05","deepseek/deepseek-chat"`<br/>*(default: `"openai/gpt-4o-mini"`)*
2. **`api_token`**:
@@ -403,7 +403,7 @@ async def main():
md_generator = DefaultMarkdownGenerator(
content_filter=filter,
options={"ignore_links": True}
options={"ignore_links": True})
# 4) Crawler run config: skip cache, use extraction
run_conf = CrawlerRunConfig(
@@ -3760,11 +3760,11 @@ To crawl a live web page, provide the URL starting with `http://` or `https://`,
```python
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai.async_configs import CrawlerRunConfig
async def crawl_web():
config = CrawlerRunConfig(bypass_cache=True)
config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://en.wikipedia.org/wiki/apple",
@@ -3785,13 +3785,13 @@ To crawl a local HTML file, prefix the file path with `file://`.
```python
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai.async_configs import CrawlerRunConfig
async def crawl_local_file():
local_file_path = "/path/to/apple.html" # Replace with your file path
file_url = f"file://{local_file_path}"
config = CrawlerRunConfig(bypass_cache=True)
config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url=file_url, config=config)
@@ -3810,13 +3810,13 @@ To crawl raw HTML content, prefix the HTML string with `raw:`.
```python
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai.async_configs import CrawlerRunConfig
async def crawl_raw_html():
raw_html = "<html><body><h1>Hello, World!</h1></body></html>"
raw_html_url = f"raw:{raw_html}"
config = CrawlerRunConfig(bypass_cache=True)
config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url=raw_html_url, config=config)
@@ -3845,7 +3845,7 @@ import os
import sys
import asyncio
from pathlib import Path
from crawl4ai import AsyncWebCrawler
from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai.async_configs import CrawlerRunConfig
async def main():
@@ -3856,7 +3856,7 @@ async def main():
async with AsyncWebCrawler() as crawler:
# Step 1: Crawl the Web URL
print("\n=== Step 1: Crawling the Wikipedia URL ===")
web_config = CrawlerRunConfig(bypass_cache=True)
web_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
result = await crawler.arun(url=wikipedia_url, config=web_config)
if not result.success:
@@ -3871,7 +3871,7 @@ async def main():
# Step 2: Crawl from the Local HTML File
print("=== Step 2: Crawling from the Local HTML File ===")
file_url = f"file://{html_file_path.resolve()}"
file_config = CrawlerRunConfig(bypass_cache=True)
file_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
local_result = await crawler.arun(url=file_url, config=file_config)
if not local_result.success:
@@ -3887,7 +3887,7 @@ async def main():
with open(html_file_path, 'r', encoding='utf-8') as f:
raw_html_content = f.read()
raw_html_url = f"raw:{raw_html_content}"
raw_config = CrawlerRunConfig(bypass_cache=True)
raw_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
raw_result = await crawler.arun(url=raw_html_url, config=raw_config)
if not raw_result.success:
@@ -4152,7 +4152,7 @@ prune_filter = PruningContentFilter(
For intelligent content filtering and high-quality markdown generation, you can use the **LLMContentFilter**. This filter leverages LLMs to generate relevant markdown while preserving the original content's meaning and structure:
```python
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, LLMConfig
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, LLMConfig, DefaultMarkdownGenerator
from crawl4ai.content_filter_strategy import LLMContentFilter
async def main():
@@ -4175,8 +4175,13 @@ async def main():
verbose=True
)
md_generator = DefaultMarkdownGenerator(
content_filter=filter,
options={"ignore_links": True}
)
config = CrawlerRunConfig(
content_filter=filter
markdown_generator=md_generator
)
async with AsyncWebCrawler() as crawler:
@@ -5428,29 +5433,38 @@ Sometimes you need a visual record of a page or a PDF “printout.” Crawl4AI c
```python
import os, asyncio
from base64 import b64decode
from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai import AsyncWebCrawler, CacheMode, CrawlerRunConfig
async def main():
run_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
screenshot=True,
pdf=True
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://en.wikipedia.org/wiki/List_of_common_misconceptions",
cache_mode=CacheMode.BYPASS,
pdf=True,
screenshot=True
config=run_config
)
if result.success:
# Save screenshot
print(f"Screenshot data present: {result.screenshot is not None}")
print(f"PDF data present: {result.pdf is not None}")
if result.screenshot:
print(f"[OK] Screenshot captured, size: {len(result.screenshot)} bytes")
with open("wikipedia_screenshot.png", "wb") as f:
f.write(b64decode(result.screenshot))
# Save PDF
else:
print("[WARN] Screenshot data is None.")
if result.pdf:
print(f"[OK] PDF captured, size: {len(result.pdf)} bytes")
with open("wikipedia_page.pdf", "wb") as f:
f.write(result.pdf)
print("[OK] PDF & screenshot captured.")
else:
print("[WARN] PDF data is None.")
else:
print("[ERROR]", result.error_message)

View File

@@ -36,6 +36,7 @@ class LlmJobPayload(BaseModel):
q: str
schema: Optional[str] = None
cache: bool = False
provider: Optional[str] = None
class CrawlJobPayload(BaseModel):
@@ -61,6 +62,7 @@ async def llm_job_enqueue(
schema=payload.schema,
cache=payload.cache,
config=_config,
provider=payload.provider,
)

View File

@@ -15,3 +15,4 @@ PyJWT==2.10.1
mcp>=1.6.0
websockets>=15.0.1
httpx[http2]>=0.27.2
sentry-sdk>=2.0.0

View File

@@ -12,10 +12,10 @@ class CrawlRequest(BaseModel):
class MarkdownRequest(BaseModel):
"""Request body for the /md endpoint."""
url: str = Field(..., description="Absolute http/https URL to fetch")
f: FilterType = Field(FilterType.FIT,
description="Contentfilter strategy: FIT, RAW, BM25, or LLM")
f: FilterType = Field(FilterType.FIT, description="Contentfilter strategy: fit, raw, bm25, or llm")
q: Optional[str] = Field(None, description="Query string used by BM25/LLM filters")
c: Optional[str] = Field("0", description="Cachebust / revision counter")
provider: Optional[str] = Field(None, description="LLM provider override (e.g., 'anthropic/claude-3-opus')")
class RawCode(BaseModel):

View File

@@ -74,6 +74,32 @@ setup_logging(config)
__version__ = "0.5.1-d1"
# ───────────────────── telemetry setup ────────────────────────
# Docker/API server telemetry: enabled by default unless CRAWL4AI_TELEMETRY=0
import os as _os
if _os.environ.get('CRAWL4AI_TELEMETRY') != '0':
# Set environment variable to indicate we're in API server mode
_os.environ['CRAWL4AI_API_SERVER'] = 'true'
# Import and enable telemetry for Docker/API environment
from crawl4ai.telemetry import enable as enable_telemetry
from crawl4ai.telemetry import capture_exception
# Enable telemetry automatically in Docker mode
enable_telemetry(always=True)
import logging
telemetry_logger = logging.getLogger("telemetry")
telemetry_logger.info("✅ Telemetry enabled for Docker/API server")
else:
# Define no-op for capture_exception if telemetry is disabled
def capture_exception(exc, context=None):
pass
import logging
telemetry_logger = logging.getLogger("telemetry")
telemetry_logger.info("❌ Telemetry disabled via CRAWL4AI_TELEMETRY=0")
# ── global page semaphore (hard cap) ─────────────────────────
MAX_PAGES = config["crawler"]["pool"].get("max_pages", 30)
GLOBAL_SEM = asyncio.Semaphore(MAX_PAGES)
@@ -237,11 +263,11 @@ async def get_markdown(
body: MarkdownRequest,
_td: Dict = Depends(token_dep),
):
if not body.url.startswith(("http://", "https://")):
if not body.url.startswith(("http://", "https://")) and not body.url.startswith(("raw:", "raw://")):
raise HTTPException(
400, "URL must be absolute and start with http/https")
400, "Invalid URL format. Must start with http://, https://, or for raw HTML (raw:, raw://)")
markdown = await handle_markdown_request(
body.url, body.f, body.q, body.c, config
body.url, body.f, body.q, body.c, config, body.provider
)
return JSONResponse({
"url": body.url,
@@ -401,7 +427,7 @@ async def llm_endpoint(
):
if not q:
raise HTTPException(400, "Query parameter 'q' is required")
if not url.startswith(("http://", "https://")):
if not url.startswith(("http://", "https://")) and not url.startswith(("raw:", "raw://")):
url = "https://" + url
answer = await handle_llm_qa(url, q, config)
return JSONResponse({"answer": answer})

View File

@@ -671,6 +671,16 @@
method: 'GET',
headers: { 'Accept': 'application/json' }
});
responseData = await response.json();
const time = Math.round(performance.now() - startTime);
if (!response.ok) {
updateStatus('error', time);
throw new Error(responseData.error || 'Request failed');
}
updateStatus('success', time);
document.querySelector('#response-content code').textContent = JSON.stringify(responseData, null, 2);
document.querySelector('#response-content code').className = 'json hljs';
forceHighlightElement(document.querySelector('#response-content code'));
} else if (endpoint === 'crawl_stream') {
// Stream processing
response = await fetch(api, {

View File

@@ -1,6 +1,7 @@
import dns.resolver
import logging
import yaml
import os
from datetime import datetime
from enum import Enum
from pathlib import Path
@@ -19,10 +20,24 @@ class FilterType(str, Enum):
LLM = "llm"
def load_config() -> Dict:
"""Load and return application configuration."""
"""Load and return application configuration with environment variable overrides."""
config_path = Path(__file__).parent / "config.yml"
with open(config_path, "r") as config_file:
return yaml.safe_load(config_file)
config = yaml.safe_load(config_file)
# Override LLM provider from environment if set
llm_provider = os.environ.get("LLM_PROVIDER")
if llm_provider:
config["llm"]["provider"] = llm_provider
logging.info(f"LLM provider overridden from environment: {llm_provider}")
# Also support direct API key from environment if the provider-specific key isn't set
llm_api_key = os.environ.get("LLM_API_KEY")
if llm_api_key and "api_key" not in config["llm"]:
config["llm"]["api_key"] = llm_api_key
logging.info("LLM API key loaded from LLM_API_KEY environment variable")
return config
def setup_logging(config: Dict) -> None:
"""Configure application logging."""
@@ -56,6 +71,52 @@ def decode_redis_hash(hash_data: Dict[bytes, bytes]) -> Dict[str, str]:
def get_llm_api_key(config: Dict, provider: Optional[str] = None) -> str:
"""Get the appropriate API key based on the LLM provider.
Args:
config: The application configuration dictionary
provider: Optional provider override (e.g., "openai/gpt-4")
Returns:
The API key for the provider, or empty string if not found
"""
# Use provided provider or fall back to config
if not provider:
provider = config["llm"]["provider"]
# Check if direct API key is configured
if "api_key" in config["llm"]:
return config["llm"]["api_key"]
# Fall back to the configured api_key_env if no match
return os.environ.get(config["llm"].get("api_key_env", ""), "")
def validate_llm_provider(config: Dict, provider: Optional[str] = None) -> tuple[bool, str]:
"""Validate that the LLM provider has an associated API key.
Args:
config: The application configuration dictionary
provider: Optional provider override (e.g., "openai/gpt-4")
Returns:
Tuple of (is_valid, error_message)
"""
# Use provided provider or fall back to config
if not provider:
provider = config["llm"]["provider"]
# Get the API key for this provider
api_key = get_llm_api_key(config, provider)
if not api_key:
return False, f"No API key found for provider '{provider}'. Please set the appropriate environment variable."
return True, ""
def verify_email_domain(email: str) -> bool:
try:
domain = email.split('@')[1]

View File

@@ -14,6 +14,7 @@ x-base-config: &base-config
- TOGETHER_API_KEY=${TOGETHER_API_KEY:-}
- MISTRAL_API_KEY=${MISTRAL_API_KEY:-}
- GEMINI_API_TOKEN=${GEMINI_API_TOKEN:-}
- LLM_PROVIDER=${LLM_PROVIDER:-} # Optional: Override default provider (e.g., "anthropic/claude-3-opus")
volumes:
- /dev/shm:/dev/shm # Chromium performance
deploy:

343
docs/blog/release-v0.7.0.md Normal file
View File

@@ -0,0 +1,343 @@
# 🚀 Crawl4AI v0.7.0: The Adaptive Intelligence Update
*January 28, 2025 • 10 min read*
---
Today I'm releasing Crawl4AI v0.7.0—the Adaptive Intelligence Update. This release introduces fundamental improvements in how Crawl4AI handles modern web complexity through adaptive learning, intelligent content discovery, and advanced extraction capabilities.
## 🎯 What's New at a Glance
- **Adaptive Crawling**: Your crawler now learns and adapts to website patterns
- **Virtual Scroll Support**: Complete content extraction from infinite scroll pages
- **Link Preview with Intelligent Scoring**: Intelligent link analysis and prioritization
- **Async URL Seeder**: Discover thousands of URLs in seconds with intelligent filtering
- **Performance Optimizations**: Significant speed and memory improvements
## 🧠 Adaptive Crawling: Intelligence Through Pattern Learning
**The Problem:** Websites change. Class names shift. IDs disappear. Your carefully crafted selectors break at 3 AM, and you wake up to empty datasets and angry stakeholders.
**My Solution:** I implemented an adaptive learning system that observes patterns, builds confidence scores, and adjusts extraction strategies on the fly. It's like having a junior developer who gets better at their job with every page they scrape.
### Technical Deep-Dive
The Adaptive Crawler maintains a persistent state for each domain, tracking:
- Pattern success rates
- Selector stability over time
- Content structure variations
- Extraction confidence scores
```python
from crawl4ai import AsyncWebCrawler, AdaptiveCrawler, AdaptiveConfig
import asyncio
async def main():
# Configure adaptive crawler
config = AdaptiveConfig(
strategy="statistical", # or "embedding" for semantic understanding
max_pages=10,
confidence_threshold=0.7, # Stop at 70% confidence
top_k_links=3, # Follow top 3 links per page
min_gain_threshold=0.05 # Need 5% information gain to continue
)
async with AsyncWebCrawler(verbose=False) as crawler:
adaptive = AdaptiveCrawler(crawler, config)
print("Starting adaptive crawl about Python decorators...")
result = await adaptive.digest(
start_url="https://docs.python.org/3/glossary.html",
query="python decorators functions wrapping"
)
print(f"\n✅ Crawling Complete!")
print(f"• Confidence Level: {adaptive.confidence:.0%}")
print(f"• Pages Crawled: {len(result.crawled_urls)}")
print(f"• Knowledge Base: {len(adaptive.state.knowledge_base)} documents")
# Get most relevant content
relevant = adaptive.get_relevant_content(top_k=3)
print(f"\nMost Relevant Pages:")
for i, page in enumerate(relevant, 1):
print(f"{i}. {page['url']} (relevance: {page['score']:.2%})")
asyncio.run(main())
```
**Expected Real-World Impact:**
- **News Aggregation**: Maintain 95%+ extraction accuracy even as news sites update their templates
- **E-commerce Monitoring**: Track product changes across hundreds of stores without constant maintenance
- **Research Data Collection**: Build robust academic datasets that survive website redesigns
- **Reduced Maintenance**: Cut selector update time by 80% for frequently-changing sites
## 🌊 Virtual Scroll: Complete Content Capture
**The Problem:** Modern web apps only render what's visible. Scroll down, new content appears, old content vanishes into the void. Traditional crawlers capture that first viewport and miss 90% of the content. It's like reading only the first page of every book.
**My Solution:** I built Virtual Scroll support that mimics human browsing behavior, capturing content as it loads and preserving it before the browser's garbage collector strikes.
### Implementation Details
```python
from crawl4ai import VirtualScrollConfig
# For social media feeds (Twitter/X style)
twitter_config = VirtualScrollConfig(
container_selector="[data-testid='primaryColumn']",
scroll_count=20, # Number of scrolls
scroll_by="container_height", # Smart scrolling by container size
wait_after_scroll=1.0 # Let content load
)
# For e-commerce product grids (Instagram style)
grid_config = VirtualScrollConfig(
container_selector="main .product-grid",
scroll_count=30,
scroll_by=800, # Fixed pixel scrolling
wait_after_scroll=1.5 # Images need time
)
# For news feeds with lazy loading
news_config = VirtualScrollConfig(
container_selector=".article-feed",
scroll_count=50,
scroll_by="page_height", # Viewport-based scrolling
wait_after_scroll=0.5 # Wait for content to load
)
# Use it in your crawl
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
"https://twitter.com/trending",
config=CrawlerRunConfig(
virtual_scroll_config=twitter_config,
# Combine with other features
extraction_strategy=JsonCssExtractionStrategy({
"tweets": {
"selector": "[data-testid='tweet']",
"fields": {
"text": {"selector": "[data-testid='tweetText']", "type": "text"},
"likes": {"selector": "[data-testid='like']", "type": "text"}
}
}
})
)
)
print(f"Captured {len(result.extracted_content['tweets'])} tweets")
```
**Key Capabilities:**
- **DOM Recycling Awareness**: Detects and handles virtual DOM element recycling
- **Smart Scroll Physics**: Three modes - container height, page height, or fixed pixels
- **Content Preservation**: Captures content before it's destroyed
- **Intelligent Stopping**: Stops when no new content appears
- **Memory Efficient**: Streams content instead of holding everything in memory
**Expected Real-World Impact:**
- **Social Media Analysis**: Capture entire Twitter threads with hundreds of replies, not just top 10
- **E-commerce Scraping**: Extract 500+ products from infinite scroll catalogs vs. 20-50 with traditional methods
- **News Aggregation**: Get all articles from modern news sites, not just above-the-fold content
- **Research Applications**: Complete data extraction from academic databases using virtual pagination
## 🔗 Link Preview: Intelligent Link Analysis and Scoring
**The Problem:** You crawl a page and get 200 links. Which ones matter? Which lead to the content you actually want? Traditional crawlers force you to follow everything or build complex filters.
**My Solution:** I implemented a three-layer scoring system that analyzes links like a human would—considering their position, context, and relevance to your goals.
### Intelligent Link Analysis and Scoring
```python
import asyncio
from crawl4ai import CrawlerRunConfig, CacheMode, AsyncWebCrawler
from crawl4ai.adaptive_crawler import LinkPreviewConfig
async def main():
# Configure intelligent link analysis
link_config = LinkPreviewConfig(
include_internal=True,
include_external=False,
max_links=10,
concurrency=5,
query="python tutorial", # For contextual scoring
score_threshold=0.3,
verbose=True
)
# Use in your crawl
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
"https://www.geeksforgeeks.org/",
config=CrawlerRunConfig(
link_preview_config=link_config,
score_links=True, # Enable intrinsic scoring
cache_mode=CacheMode.BYPASS
)
)
# Access scored and sorted links
if result.success and result.links:
for link in result.links.get("internal", []):
text = link.get('text', 'No text')[:40]
print(
text,
f"{link.get('intrinsic_score', 0):.1f}/10" if link.get('intrinsic_score') is not None else "0.0/10",
f"{link.get('contextual_score', 0):.2f}/1" if link.get('contextual_score') is not None else "0.00/1",
f"{link.get('total_score', 0):.3f}" if link.get('total_score') is not None else "0.000"
)
asyncio.run(main())
```
**Scoring Components:**
1. **Intrinsic Score**: Based on link quality indicators
- Position on page (navigation, content, footer)
- Link attributes (rel, title, class names)
- Anchor text quality and length
- URL structure and depth
2. **Contextual Score**: Relevance to your query using BM25 algorithm
- Keyword matching in link text and title
- Meta description analysis
- Content preview scoring
3. **Total Score**: Combined score for final ranking
**Expected Real-World Impact:**
- **Research Efficiency**: Find relevant papers 10x faster by following only high-score links
- **Competitive Analysis**: Automatically identify important pages on competitor sites
- **Content Discovery**: Build topic-focused crawlers that stay on track
- **SEO Audits**: Identify and prioritize high-value internal linking opportunities
## 🎣 Async URL Seeder: Automated URL Discovery at Scale
**The Problem:** You want to crawl an entire domain but only have the homepage. Or worse, you want specific content types across thousands of pages. Manual URL discovery? That's a job for machines, not humans.
**My Solution:** I built Async URL Seeder—a turbocharged URL discovery engine that combines multiple sources with intelligent filtering and relevance scoring.
### Technical Architecture
```python
import asyncio
from crawl4ai import AsyncUrlSeeder, SeedingConfig
async def main():
async with AsyncUrlSeeder() as seeder:
# Discover Python tutorial URLs
config = SeedingConfig(
source="sitemap", # Use sitemap
pattern="*python*", # URL pattern filter
extract_head=True, # Get metadata
query="python tutorial", # For relevance scoring
scoring_method="bm25",
score_threshold=0.2,
max_urls=10
)
print("Discovering Python async tutorial URLs...")
urls = await seeder.urls("https://www.geeksforgeeks.org/", config)
print(f"\n✅ Found {len(urls)} relevant URLs:")
for i, url_info in enumerate(urls[:5], 1):
print(f"\n{i}. {url_info['url']}")
if url_info.get('relevance_score'):
print(f" Relevance: {url_info['relevance_score']:.3f}")
if url_info.get('head_data', {}).get('title'):
print(f" Title: {url_info['head_data']['title'][:60]}...")
asyncio.run(main())
```
**Discovery Methods:**
- **Sitemap Mining**: Parses robots.txt and all linked sitemaps
- **Common Crawl**: Queries the Common Crawl index for historical URLs
- **Intelligent Crawling**: Follows links with smart depth control
- **Pattern Analysis**: Learns URL structures and generates variations
**Expected Real-World Impact:**
- **Migration Projects**: Discover 10,000+ URLs from legacy sites in under 60 seconds
- **Market Research**: Map entire competitor ecosystems automatically
- **Academic Research**: Build comprehensive datasets without manual URL collection
- **SEO Audits**: Find every indexable page with content scoring
- **Content Archival**: Ensure no content is left behind during site migrations
## ⚡ Performance Optimizations
This release includes significant performance improvements through optimized resource handling, better concurrency management, and reduced memory footprint.
### What We Optimized
```python
# Optimized crawling with v0.7.0 improvements
results = []
for url in urls:
result = await crawler.arun(
url,
config=CrawlerRunConfig(
# Performance optimizations
wait_until="domcontentloaded", # Faster than networkidle
cache_mode=CacheMode.ENABLED # Enable caching
)
)
results.append(result)
```
**Performance Gains:**
- **Startup Time**: 70% faster browser initialization
- **Page Loading**: 40% reduction with smart resource blocking
- **Extraction**: 3x faster with compiled CSS selectors
- **Memory Usage**: 60% reduction with streaming processing
- **Concurrent Crawls**: Handle 5x more parallel requests
## 🔧 Important Changes
### Breaking Changes
- `link_extractor` renamed to `link_preview` (better reflects functionality)
- Minimum Python version now 3.9
- `CrawlerConfig` split into `CrawlerRunConfig` and `BrowserConfig`
### Migration Guide
```python
# Old (v0.6.x)
from crawl4ai import CrawlerConfig
config = CrawlerConfig(timeout=30000)
# New (v0.7.0)
from crawl4ai import CrawlerRunConfig, BrowserConfig
browser_config = BrowserConfig(timeout=30000)
run_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
```
## 🤖 Coming Soon: Intelligent Web Automation
I'm currently working on bringing advanced automation capabilities to Crawl4AI. This includes:
- **Crawl Agents**: Autonomous crawlers that understand your goals and adapt their strategies
- **Auto JS Generation**: Automatic JavaScript code generation for complex interactions
- **Smart Form Handling**: Intelligent form detection and filling
- **Context-Aware Actions**: Crawlers that understand page context and make decisions
These features are under active development and will revolutionize how we approach web automation. Stay tuned!
## 🚀 Get Started
```bash
pip install crawl4ai==0.7.0
```
Check out the [updated documentation](https://docs.crawl4ai.com).
Questions? Issues? I'm always listening:
- GitHub: [github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai)
- Discord: [discord.gg/crawl4ai](https://discord.gg/jP8KfhDhyN)
- Twitter: [@unclecode](https://x.com/unclecode)
Happy crawling! 🕷️
---
*P.S. If you're using Crawl4AI in production, I'd love to hear about it. Your use cases inspire the next features.*

View File

@@ -0,0 +1,43 @@
# 🛠️ Crawl4AI v0.7.1: Minor Cleanup Update
*July 17, 2025 • 2 min read*
---
A small maintenance release that removes unused code and improves documentation.
## 🎯 What's Changed
- **Removed unused StealthConfig** from `crawl4ai/browser_manager.py`
- **Updated documentation** with better examples and parameter explanations
- **Fixed virtual scroll configuration** examples in docs
## 🧹 Code Cleanup
Removed unused `StealthConfig` import and configuration that wasn't being used anywhere in the codebase. The project uses its own custom stealth implementation through JavaScript injection instead.
```python
# Removed unused code:
from playwright_stealth import StealthConfig
stealth_config = StealthConfig(...) # This was never used
```
## 📖 Documentation Updates
- Fixed adaptive crawling parameter examples
- Updated session management documentation
- Corrected virtual scroll configuration examples
## 🚀 Installation
```bash
pip install crawl4ai==0.7.1
```
No breaking changes - upgrade directly from v0.7.0.
---
Questions? Issues?
- GitHub: [github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai)
- Discord: [discord.gg/crawl4ai](https://discord.gg/jP8KfhDhyN)

350
docs/blog/release-v0.7.3.md Normal file
View File

@@ -0,0 +1,350 @@
# 🚀 Crawl4AI v0.7.3: The Multi-Config Intelligence Update
*August 6, 2025 • 5 min read*
---
Today I'm releasing Crawl4AI v0.7.3—the Multi-Config Intelligence Update. This release brings smarter URL-specific configurations, flexible Docker deployments, important bug fixes, and documentation improvements that make Crawl4AI more robust and production-ready.
## 🎯 What's New at a Glance
- **🕵️ Undetected Browser Support**: Stealth mode for bypassing bot detection systems
- **🎨 Multi-URL Configurations**: Different crawling strategies for different URL patterns in a single batch
- **🐳 Flexible Docker LLM Providers**: Configure LLM providers via environment variables
- **🧠 Memory Monitoring**: Enhanced memory usage tracking and optimization tools
- **📊 Enhanced Table Extraction**: Improved table access and DataFrame conversion
- **💰 GitHub Sponsors**: 4-tier sponsorship system with custom arrangements
- **🔧 Bug Fixes**: Resolved several critical issues for better stability
- **📚 Documentation Updates**: Clearer examples and improved API documentation
## 🎨 Multi-URL Configurations: One Size Doesn't Fit All
**The Problem:** You're crawling a mix of documentation sites, blogs, and API endpoints. Each needs different handling—caching for docs, fresh content for news, structured extraction for APIs. Previously, you'd run separate crawls or write complex conditional logic.
**My Solution:** I implemented URL-specific configurations that let you define different strategies for different URL patterns in a single crawl batch. First match wins, with optional fallback support.
### Technical Implementation
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, MatchMode
# Define specialized configs for different content types
configs = [
# Documentation sites - aggressive caching, include links
CrawlerRunConfig(
url_matcher=["*docs*", "*documentation*"],
cache_mode="write",
markdown_generator_options={"include_links": True}
),
# News/blog sites - fresh content, scroll for lazy loading
CrawlerRunConfig(
url_matcher=lambda url: 'blog' in url or 'news' in url,
cache_mode="bypass",
js_code="window.scrollTo(0, document.body.scrollHeight/2);"
),
# API endpoints - structured extraction
CrawlerRunConfig(
url_matcher=["*.json", "*api*"],
extraction_strategy=LLMExtractionStrategy(
provider="openai/gpt-4o-mini",
extraction_type="structured"
)
),
# Default fallback for everything else
CrawlerRunConfig() # No url_matcher = matches everything
]
# Crawl multiple URLs with appropriate configs
async with AsyncWebCrawler() as crawler:
results = await crawler.arun_many(
urls=[
"https://docs.python.org/3/", # → Uses documentation config
"https://blog.python.org/", # → Uses blog config
"https://api.github.com/users", # → Uses API config
"https://example.com/" # → Uses default config
],
config=configs
)
```
**Matching Capabilities:**
- **String Patterns**: Wildcards like `"*.pdf"`, `"*/blog/*"`
- **Function Matchers**: Lambda functions for complex logic
- **Mixed Matchers**: Combine strings and functions with AND/OR logic
- **Fallback Support**: Default config when nothing matches
**Expected Real-World Impact:**
- **Mixed Content Sites**: Handle blogs, docs, and downloads in one crawl
- **Multi-Domain Crawling**: Different strategies per domain without separate runs
- **Reduced Complexity**: No more if/else forests in your extraction code
- **Better Performance**: Each URL gets exactly the processing it needs
## 🕵️ Undetected Browser Support: Stealth Mode Activated
**The Problem:** Modern websites employ sophisticated bot detection systems. Cloudflare, Akamai, and custom solutions block automated crawlers, limiting access to valuable content.
**My Solution:** I implemented undetected browser support with a flexible adapter pattern. Now Crawl4AI can bypass most bot detection systems using stealth techniques.
### Technical Implementation
```python
from crawl4ai import AsyncWebCrawler, BrowserConfig
# Enable undetected mode for stealth crawling
browser_config = BrowserConfig(
browser_type="undetected", # Use undetected Chrome
headless=True, # Can run headless with stealth
extra_args=[
"--disable-blink-features=AutomationControlled",
"--disable-web-security",
"--disable-features=VizDisplayCompositor"
]
)
async with AsyncWebCrawler(config=browser_config) as crawler:
# This will bypass most bot detection systems
result = await crawler.arun("https://protected-site.com")
if result.success:
print("✅ Successfully bypassed bot detection!")
print(f"Content length: {len(result.markdown)}")
```
**Advanced Anti-Bot Strategies:**
```python
# Combine multiple stealth techniques
from crawl4ai import CrawlerRunConfig
config = CrawlerRunConfig(
# Random user agents and headers
headers={
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"DNT": "1"
},
# Human-like behavior simulation
js_code="""
// Random mouse movements
const simulateHuman = () => {
const event = new MouseEvent('mousemove', {
clientX: Math.random() * window.innerWidth,
clientY: Math.random() * window.innerHeight
});
document.dispatchEvent(event);
};
setInterval(simulateHuman, 100 + Math.random() * 200);
// Random scrolling
const randomScroll = () => {
const scrollY = Math.random() * (document.body.scrollHeight - window.innerHeight);
window.scrollTo(0, scrollY);
};
setTimeout(randomScroll, 500 + Math.random() * 1000);
""",
# Delay to appear more human
delay_before_return_html=2.0
)
result = await crawler.arun("https://bot-protected-site.com", config=config)
```
**Expected Real-World Impact:**
- **Enterprise Scraping**: Access previously blocked corporate sites and databases
- **Market Research**: Gather data from competitor sites with protection
- **Price Monitoring**: Track e-commerce sites that block automated access
- **Content Aggregation**: Collect news and social media despite anti-bot measures
- **Compliance Testing**: Verify your own site's bot protection effectiveness
## 🧠 Memory Monitoring & Optimization
**The Problem:** Long-running crawl sessions consuming excessive memory, especially when processing large batches or heavy JavaScript sites.
**My Solution:** Built comprehensive memory monitoring and optimization utilities that track usage patterns and provide actionable insights.
### Memory Tracking Implementation
```python
from crawl4ai.memory_utils import MemoryMonitor, get_memory_info
# Monitor memory during crawling
monitor = MemoryMonitor()
async with AsyncWebCrawler() as crawler:
# Start monitoring
monitor.start_monitoring()
# Perform memory-intensive operations
results = await crawler.arun_many([
"https://heavy-js-site.com",
"https://large-images-site.com",
"https://dynamic-content-site.com"
])
# Get detailed memory report
memory_report = monitor.get_report()
print(f"Peak memory usage: {memory_report['peak_mb']:.1f} MB")
print(f"Memory efficiency: {memory_report['efficiency']:.1f}%")
# Automatic cleanup suggestions
if memory_report['peak_mb'] > 1000: # > 1GB
print("💡 Consider batch size optimization")
print("💡 Enable aggressive garbage collection")
```
**Expected Real-World Impact:**
- **Production Stability**: Prevent memory-related crashes in long-running services
- **Cost Optimization**: Right-size server resources based on actual usage
- **Performance Tuning**: Identify memory bottlenecks and optimization opportunities
- **Scalability Planning**: Understand memory patterns for horizontal scaling
## 📊 Enhanced Table Extraction
**The Problem:** Table data was accessed through the generic `result.media` interface, making DataFrame conversion cumbersome and unclear.
**My Solution:** Dedicated `result.tables` interface with direct DataFrame conversion and improved detection algorithms.
### New Table Access Pattern
```python
# Old way (deprecated)
# tables_data = result.media.get('tables', [])
# New way (v0.7.3+)
result = await crawler.arun("https://site-with-tables.com")
# Direct table access
if result.tables:
print(f"Found {len(result.tables)} tables")
# Convert to pandas DataFrame instantly
import pandas as pd
for i, table in enumerate(result.tables):
df = pd.DataFrame(table['data'])
print(f"Table {i}: {df.shape[0]} rows × {df.shape[1]} columns")
print(df.head())
# Table metadata
print(f"Source: {table.get('source_xpath', 'Unknown')}")
print(f"Headers: {table.get('headers', [])}")
```
**Expected Real-World Impact:**
- **Data Analysis**: Faster transition from web data to analysis-ready DataFrames
- **ETL Pipelines**: Cleaner integration with data processing workflows
- **Reporting**: Simplified table extraction for automated reporting systems
## 💰 Community Support: GitHub Sponsors
I've launched GitHub Sponsors to ensure Crawl4AI's continued development and support our growing community.
**Sponsorship Tiers:**
- **🌱 Supporter ($5/month)**: Community support + early feature previews
- **🚀 Professional ($25/month)**: Priority support + beta access
- **🏢 Business ($100/month)**: Direct consultation + custom integrations
- **🏛️ Enterprise ($500/month)**: Dedicated support + feature development
**Why Sponsor?**
- Ensure continuous development and maintenance
- Get priority support and feature requests
- Access to premium documentation and examples
- Direct line to the development team
[**Become a Sponsor →**](https://github.com/sponsors/unclecode)
## 🐳 Docker: Flexible LLM Provider Configuration
**The Problem:** Hardcoded LLM providers in Docker deployments. Want to switch from OpenAI to Groq? Rebuild and redeploy. Testing different models? Multiple Docker images.
**My Solution:** Configure LLM providers via environment variables. Switch providers without touching code or rebuilding images.
### Deployment Flexibility
```bash
# Option 1: Direct environment variables
docker run -d \
-e LLM_PROVIDER="groq/llama-3.2-3b-preview" \
-e GROQ_API_KEY="your-key" \
-p 11235:11235 \
unclecode/crawl4ai:latest
# Option 2: Using .llm.env file (recommended for production)
# Create .llm.env file:
# LLM_PROVIDER=openai/gpt-4o-mini
# OPENAI_API_KEY=your-openai-key
# GROQ_API_KEY=your-groq-key
docker run -d \
--env-file .llm.env \
-p 11235:11235 \
unclecode/crawl4ai:latest
```
Override per request when needed:
```python
# Use default provider from .llm.env
response = requests.post("http://localhost:11235/crawl", json={
"url": "https://example.com",
"extraction_strategy": {"type": "llm"}
})
# Override to use different provider for this specific request
response = requests.post("http://localhost:11235/crawl", json={
"url": "https://complex-page.com",
"extraction_strategy": {
"type": "llm",
"provider": "openai/gpt-4" # Override default
}
})
```
**Expected Real-World Impact:**
- **Cost Optimization**: Use cheaper models for simple tasks, premium for complex
- **A/B Testing**: Compare provider performance without deployment changes
- **Fallback Strategies**: Switch providers on-the-fly during outages
- **Development Flexibility**: Test locally with one provider, deploy with another
- **Secure Configuration**: Keep API keys in `.llm.env` file, not in commands
## 🔧 Bug Fixes & Improvements
This release includes several important bug fixes that improve stability and reliability:
- **URL Matcher Fallback**: Fixed edge cases in URL pattern matching logic
- **Memory Management**: Resolved memory leaks in long-running crawl sessions
- **Sitemap Processing**: Fixed redirect handling in sitemap fetching
- **Table Extraction**: Improved table detection and extraction accuracy
- **Error Handling**: Better error messages and recovery from network failures
## 📚 Documentation Enhancements
Based on community feedback, we've updated:
- Clearer examples for multi-URL configuration
- Improved CrawlResult documentation with all available fields
- Fixed typos and inconsistencies across documentation
- Added real-world URLs in examples for better understanding
- New comprehensive demo showcasing all v0.7.3 features
## 🙏 Acknowledgments
Thanks to our contributors and the entire community for feedback and bug reports.
## 📚 Resources
- [Full Documentation](https://docs.crawl4ai.com)
- [GitHub Repository](https://github.com/unclecode/crawl4ai)
- [Discord Community](https://discord.gg/crawl4ai)
- [Feature Demo](https://github.com/unclecode/crawl4ai/blob/main/docs/releases_review/demo_v0.7.3.py)
---
*Crawl4AI continues to evolve with your needs. This release makes it smarter, more flexible, and more stable. Try the new multi-config feature and flexible Docker deployment—they're game changers!*
**Happy Crawling! 🕷️**
*- The Crawl4AI Team*

305
docs/blog/release-v0.7.4.md Normal file
View File

@@ -0,0 +1,305 @@
# 🚀 Crawl4AI v0.7.4: The Intelligent Table Extraction & Performance Update
*August 17, 2025 • 6 min read*
---
Today I'm releasing Crawl4AI v0.7.4—the Intelligent Table Extraction & Performance Update. This release introduces revolutionary LLM-powered table extraction with intelligent chunking, significant performance improvements for concurrent crawling, enhanced browser management, and critical stability fixes that make Crawl4AI more robust for production workloads.
## 🎯 What's New at a Glance
- **🚀 LLMTableExtraction**: Revolutionary table extraction with intelligent chunking for massive tables
- **⚡ Enhanced Concurrency**: True concurrency improvements for fast-completing tasks in batch operations
- **🧹 Memory Management Refactor**: Streamlined memory utilities and better resource management
- **🔧 Browser Manager Fixes**: Resolved race conditions in concurrent page creation
- **⌨️ Cross-Platform Browser Profiler**: Improved keyboard handling and quit mechanisms
- **🔗 Advanced URL Processing**: Better handling of raw URLs and base tag link resolution
- **🛡️ Enhanced Proxy Support**: Flexible proxy configuration with dict and string formats
- **🐳 Docker Improvements**: Better API handling and raw HTML support
## 🚀 LLMTableExtraction: Revolutionary Table Processing
**The Problem:** Complex tables with rowspan, colspan, nested structures, or massive datasets that traditional HTML parsing can't handle effectively. Large tables that exceed token limits crash extraction processes.
**My Solution:** I developed LLMTableExtraction—an intelligent table extraction strategy that uses Large Language Models with automatic chunking to handle tables of any size and complexity.
### Technical Implementation
```python
from crawl4ai import (
AsyncWebCrawler,
CrawlerRunConfig,
LLMConfig,
LLMTableExtraction,
CacheMode
)
# Configure LLM for table extraction
llm_config = LLMConfig(
provider="openai/gpt-4.1-mini",
api_token="env:OPENAI_API_KEY",
temperature=0.1, # Low temperature for consistency
max_tokens=32000
)
# Create intelligent table extraction strategy
table_strategy = LLMTableExtraction(
llm_config=llm_config,
verbose=True,
max_tries=2,
enable_chunking=True, # Handle massive tables
chunk_token_threshold=5000, # Smart chunking threshold
overlap_threshold=100, # Maintain context between chunks
extraction_type="structured" # Get structured data output
)
# Apply to crawler configuration
config = CrawlerRunConfig(
table_extraction_strategy=table_strategy,
cache_mode=CacheMode.BYPASS
)
async with AsyncWebCrawler() as crawler:
# Extract complex tables with intelligence
result = await crawler.arun(
"https://en.wikipedia.org/wiki/List_of_countries_by_GDP",
config=config
)
# Access extracted tables directly
for i, table in enumerate(result.tables):
print(f"Table {i}: {len(table['data'])} rows × {len(table['headers'])} columns")
# Convert to pandas DataFrame instantly
import pandas as pd
df = pd.DataFrame(table['data'], columns=table['headers'])
print(df.head())
```
**Intelligent Chunking for Massive Tables:**
```python
# Handle tables that exceed token limits
large_table_strategy = LLMTableExtraction(
llm_config=llm_config,
enable_chunking=True,
chunk_token_threshold=3000, # Conservative threshold
overlap_threshold=150, # Preserve context
max_concurrent_chunks=3, # Parallel processing
merge_strategy="intelligent" # Smart chunk merging
)
# Process Wikipedia comparison tables, financial reports, etc.
config = CrawlerRunConfig(
table_extraction_strategy=large_table_strategy,
# Target specific table containers
css_selector="div.wikitable, table.sortable",
delay_before_return_html=2.0
)
result = await crawler.arun(
"https://en.wikipedia.org/wiki/Comparison_of_operating_systems",
config=config
)
# Tables are automatically chunked, processed, and merged
print(f"Extracted {len(result.tables)} complex tables")
for table in result.tables:
print(f"Merged table: {len(table['data'])} total rows")
```
**Advanced Features:**
- **Intelligent Chunking**: Automatically splits massive tables while preserving structure
- **Context Preservation**: Overlapping chunks maintain column relationships
- **Parallel Processing**: Concurrent chunk processing for speed
- **Smart Merging**: Reconstructs complete tables from processed chunks
- **Complex Structure Support**: Handles rowspan, colspan, nested tables
- **Metadata Extraction**: Captures table context, captions, and relationships
**Expected Real-World Impact:**
- **Financial Analysis**: Extract complex earnings tables and financial statements
- **Research & Academia**: Process large datasets from Wikipedia, research papers
- **E-commerce**: Handle product comparison tables with complex layouts
- **Government Data**: Extract census data, statistical tables from official sources
- **Competitive Intelligence**: Process competitor pricing and feature tables
## ⚡ Enhanced Concurrency: True Performance Gains
**The Problem:** The `arun_many()` method wasn't achieving true concurrency for fast-completing tasks, leading to sequential processing bottlenecks in batch operations.
**My Solution:** I implemented true concurrency improvements in the dispatcher that enable genuine parallel processing for fast-completing tasks.
### Performance Optimization
```python
# Before v0.7.4: Sequential-like behavior for fast tasks
# After v0.7.4: True concurrency
async with AsyncWebCrawler() as crawler:
# These will now run with true concurrency
urls = [
"https://httpbin.org/delay/1",
"https://httpbin.org/delay/1",
"https://httpbin.org/delay/1",
"https://httpbin.org/delay/1"
]
# Processes in truly parallel fashion
results = await crawler.arun_many(urls)
# Performance improvement: ~4x faster for fast-completing tasks
print(f"Processed {len(results)} URLs with true concurrency")
```
**Expected Real-World Impact:**
- **API Crawling**: 3-4x faster processing of REST endpoints and API documentation
- **Batch URL Processing**: Significant speedup for large URL lists
- **Monitoring Systems**: Faster health checks and status page monitoring
- **Data Aggregation**: Improved performance for real-time data collection
## 🧹 Memory Management Refactor: Cleaner Architecture
**The Problem:** Memory utilities were scattered and difficult to maintain, with potential import conflicts and unclear organization.
**My Solution:** I consolidated all memory-related utilities into the main `utils.py` module, creating a cleaner, more maintainable architecture.
### Improved Memory Handling
```python
# All memory utilities now consolidated
from crawl4ai.utils import get_true_memory_usage_percent, MemoryMonitor
# Enhanced memory monitoring
monitor = MemoryMonitor()
monitor.start_monitoring()
async with AsyncWebCrawler() as crawler:
# Memory-efficient batch processing
results = await crawler.arun_many(large_url_list)
# Get accurate memory metrics
memory_usage = get_true_memory_usage_percent()
memory_report = monitor.get_report()
print(f"Memory efficiency: {memory_report['efficiency']:.1f}%")
print(f"Peak usage: {memory_report['peak_mb']:.1f} MB")
```
**Expected Real-World Impact:**
- **Production Stability**: More reliable memory tracking and management
- **Code Maintainability**: Cleaner architecture for easier debugging
- **Import Clarity**: Resolved potential conflicts and import issues
- **Developer Experience**: Simpler API for memory monitoring
## 🔧 Critical Stability Fixes
### Browser Manager Race Condition Resolution
**The Problem:** Concurrent page creation in persistent browser contexts caused "Target page/context closed" errors during high-concurrency operations.
**My Solution:** Implemented thread-safe page creation with proper locking mechanisms.
```python
# Fixed: Safe concurrent page creation
browser_config = BrowserConfig(
browser_type="chromium",
use_persistent_context=True, # Now thread-safe
max_concurrent_sessions=10 # Safely handle concurrent requests
)
async with AsyncWebCrawler(config=browser_config) as crawler:
# These concurrent operations are now stable
tasks = [crawler.arun(url) for url in url_list]
results = await asyncio.gather(*tasks) # No more race conditions
```
### Enhanced Browser Profiler
**The Problem:** Inconsistent keyboard handling across platforms and unreliable quit mechanisms.
**My Solution:** Cross-platform keyboard listeners with improved quit handling.
### Advanced URL Processing
**The Problem:** Raw URL formats (`raw://` and `raw:`) weren't properly handled, and base tag link resolution was incomplete.
**My Solution:** Enhanced URL preprocessing and base tag support.
```python
# Now properly handles all URL formats
urls = [
"https://example.com",
"raw://static-html-content",
"raw:file://local-file.html"
]
# Base tag links are now correctly resolved
config = CrawlerRunConfig(
include_links=True, # Links properly resolved with base tags
resolve_absolute_urls=True
)
```
## 🛡️ Enhanced Proxy Configuration
**The Problem:** Proxy configuration only accepted specific formats, limiting flexibility.
**My Solution:** Enhanced ProxyConfig to support both dictionary and string formats.
```python
# Multiple proxy configuration formats now supported
from crawl4ai import BrowserConfig, ProxyConfig
# String format
proxy_config = ProxyConfig("http://proxy.example.com:8080")
# Dictionary format
proxy_config = ProxyConfig({
"server": "http://proxy.example.com:8080",
"username": "user",
"password": "pass"
})
# Use with crawler
browser_config = BrowserConfig(proxy_config=proxy_config)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun("https://httpbin.org/ip")
```
## 🐳 Docker & Infrastructure Improvements
This release includes several Docker and infrastructure improvements:
- **Better API Token Handling**: Improved Docker example scripts with correct endpoints
- **Raw HTML Support**: Enhanced Docker API to handle raw HTML content properly
- **Documentation Updates**: Comprehensive Docker deployment examples
- **Test Coverage**: Expanded test suite with better coverage
## 📚 Documentation & Examples
Enhanced documentation includes:
- **LLM Table Extraction Guide**: Comprehensive examples and best practices
- **Migration Documentation**: Updated patterns for new table extraction methods
- **Docker Deployment**: Complete deployment guide with examples
- **Performance Optimization**: Guidelines for concurrent crawling
## 🙏 Acknowledgments
Thanks to our contributors and community for feedback, bug reports, and feature requests that made this release possible.
## 📚 Resources
- [Full Documentation](https://docs.crawl4ai.com)
- [GitHub Repository](https://github.com/unclecode/crawl4ai)
- [Discord Community](https://discord.gg/crawl4ai)
- [LLM Table Extraction Examples](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/llm_table_extraction_example.py)
---
*Crawl4AI v0.7.4 delivers intelligent table extraction and significant performance improvements. The new LLMTableExtraction strategy handles complex tables that were previously impossible to process, while concurrency improvements make batch operations 3-4x faster. Try the intelligent table extraction—it's a game changer for data extraction workflows!*
**Happy Crawling! 🕷️**
*- The Crawl4AI Team*

View File

@@ -0,0 +1,85 @@
# Adaptive Crawling Examples
This directory contains examples demonstrating various aspects of Crawl4AI's Adaptive Crawling feature.
## Examples Overview
### 1. `basic_usage.py`
- Simple introduction to adaptive crawling
- Uses default statistical strategy
- Shows how to get crawl statistics and relevant content
### 2. `embedding_strategy.py` ⭐ NEW
- Demonstrates the embedding-based strategy for semantic understanding
- Shows query expansion and irrelevance detection
- Includes configuration for both local and API-based embeddings
### 3. `embedding_vs_statistical.py` ⭐ NEW
- Direct comparison between statistical and embedding strategies
- Helps you choose the right strategy for your use case
- Shows performance and accuracy trade-offs
### 4. `embedding_configuration.py` ⭐ NEW
- Advanced configuration options for embedding strategy
- Parameter tuning guide for different scenarios
- Examples for research, exploration, and quality-focused crawling
### 5. `advanced_configuration.py`
- Shows various configuration options for both strategies
- Demonstrates threshold tuning and performance optimization
### 6. `custom_strategies.py`
- How to implement your own crawling strategy
- Extends the base CrawlStrategy class
- Advanced use case for specialized requirements
### 7. `export_import_kb.py`
- Export crawled knowledge base to JSONL
- Import and continue crawling from saved state
- Useful for building persistent knowledge bases
## Quick Start
For your first adaptive crawling experience, run:
```bash
python basic_usage.py
```
To try the new embedding strategy with semantic understanding:
```bash
python embedding_strategy.py
```
To compare strategies and see which works best for your use case:
```bash
python embedding_vs_statistical.py
```
## Strategy Selection Guide
### Use Statistical Strategy (Default) When:
- Working with technical documentation
- Queries contain specific terms or code
- Speed is critical
- No API access available
### Use Embedding Strategy When:
- Queries are conceptual or ambiguous
- Need semantic understanding beyond exact matches
- Want to detect irrelevant content
- Working with diverse content sources
## Requirements
- Crawl4AI installed
- For embedding strategy with local models: `sentence-transformers`
- For embedding strategy with OpenAI: Set `OPENAI_API_KEY` environment variable
## Learn More
- [Adaptive Crawling Documentation](https://docs.crawl4ai.com/core/adaptive-crawling/)
- [Mathematical Framework](https://github.com/unclecode/crawl4ai/blob/main/PROGRESSIVE_CRAWLING.md)
- [Blog: The Adaptive Crawling Revolution](https://docs.crawl4ai.com/blog/adaptive-crawling-revolution/)

View File

@@ -0,0 +1,207 @@
"""
Advanced Adaptive Crawling Configuration
This example demonstrates all configuration options available for adaptive crawling,
including threshold tuning, persistence, and custom parameters.
"""
import asyncio
from pathlib import Path
from crawl4ai import AsyncWebCrawler, AdaptiveCrawler, AdaptiveConfig
async def main():
"""Demonstrate advanced configuration options"""
# Example 1: Custom thresholds for different use cases
print("="*60)
print("EXAMPLE 1: Custom Confidence Thresholds")
print("="*60)
# High-precision configuration (exhaustive crawling)
high_precision_config = AdaptiveConfig(
confidence_threshold=0.9, # Very high confidence required
max_pages=50, # Allow more pages
top_k_links=5, # Follow more links per page
min_gain_threshold=0.02 # Lower threshold to continue
)
# Balanced configuration (default use case)
balanced_config = AdaptiveConfig(
confidence_threshold=0.7, # Moderate confidence
max_pages=20, # Reasonable limit
top_k_links=3, # Moderate branching
min_gain_threshold=0.05 # Standard gain threshold
)
# Quick exploration configuration
quick_config = AdaptiveConfig(
confidence_threshold=0.5, # Lower confidence acceptable
max_pages=10, # Strict limit
top_k_links=2, # Minimal branching
min_gain_threshold=0.1 # High gain required
)
async with AsyncWebCrawler(verbose=False) as crawler:
# Test different configurations
for config_name, config in [
("High Precision", high_precision_config),
("Balanced", balanced_config),
("Quick Exploration", quick_config)
]:
print(f"\nTesting {config_name} configuration...")
adaptive = AdaptiveCrawler(crawler, config=config)
result = await adaptive.digest(
start_url="https://httpbin.org",
query="http headers authentication"
)
print(f" - Pages crawled: {len(result.crawled_urls)}")
print(f" - Confidence achieved: {adaptive.confidence:.2%}")
print(f" - Coverage score: {adaptive.coverage_stats['coverage']:.2f}")
# Example 2: Persistence and state management
print("\n" + "="*60)
print("EXAMPLE 2: State Persistence")
print("="*60)
state_file = "crawl_state_demo.json"
# Configuration with persistence
persistent_config = AdaptiveConfig(
confidence_threshold=0.8,
max_pages=30,
save_state=True, # Enable auto-save
state_path=state_file # Specify save location
)
async with AsyncWebCrawler(verbose=False) as crawler:
# First crawl - will be interrupted
print("\nStarting initial crawl (will interrupt after 5 pages)...")
interrupt_config = AdaptiveConfig(
confidence_threshold=0.8,
max_pages=5, # Artificially low to simulate interruption
save_state=True,
state_path=state_file
)
adaptive = AdaptiveCrawler(crawler, config=interrupt_config)
result1 = await adaptive.digest(
start_url="https://docs.python.org/3/",
query="exception handling try except finally"
)
print(f"First crawl completed: {len(result1.crawled_urls)} pages")
print(f"Confidence reached: {adaptive.confidence:.2%}")
# Resume crawl with higher page limit
print("\nResuming crawl from saved state...")
resume_config = AdaptiveConfig(
confidence_threshold=0.8,
max_pages=20, # Increase limit
save_state=True,
state_path=state_file
)
adaptive2 = AdaptiveCrawler(crawler, config=resume_config)
result2 = await adaptive2.digest(
start_url="https://docs.python.org/3/",
query="exception handling try except finally",
resume_from=state_file
)
print(f"Resumed crawl completed: {len(result2.crawled_urls)} total pages")
print(f"Final confidence: {adaptive2.confidence:.2%}")
# Clean up
Path(state_file).unlink(missing_ok=True)
# Example 3: Link selection strategies
print("\n" + "="*60)
print("EXAMPLE 3: Link Selection Strategies")
print("="*60)
# Conservative link following
conservative_config = AdaptiveConfig(
confidence_threshold=0.7,
max_pages=15,
top_k_links=1, # Only follow best link
min_gain_threshold=0.15 # High threshold
)
# Aggressive link following
aggressive_config = AdaptiveConfig(
confidence_threshold=0.7,
max_pages=15,
top_k_links=10, # Follow many links
min_gain_threshold=0.01 # Very low threshold
)
async with AsyncWebCrawler(verbose=False) as crawler:
for strategy_name, config in [
("Conservative", conservative_config),
("Aggressive", aggressive_config)
]:
print(f"\n{strategy_name} link selection:")
adaptive = AdaptiveCrawler(crawler, config=config)
result = await adaptive.digest(
start_url="https://httpbin.org",
query="api endpoints"
)
# Analyze crawl pattern
print(f" - Total pages: {len(result.crawled_urls)}")
print(f" - Unique domains: {len(set(url.split('/')[2] for url in result.crawled_urls))}")
print(f" - Max depth reached: {max(url.count('/') for url in result.crawled_urls) - 2}")
# Show saturation trend
if hasattr(result, 'new_terms_history') and result.new_terms_history:
print(f" - New terms discovered: {result.new_terms_history[:5]}...")
print(f" - Saturation trend: {'decreasing' if result.new_terms_history[-1] < result.new_terms_history[0] else 'increasing'}")
# Example 4: Monitoring crawl progress
print("\n" + "="*60)
print("EXAMPLE 4: Progress Monitoring")
print("="*60)
# Configuration with detailed monitoring
monitor_config = AdaptiveConfig(
confidence_threshold=0.75,
max_pages=10,
top_k_links=3
)
async with AsyncWebCrawler(verbose=False) as crawler:
adaptive = AdaptiveCrawler(crawler, config=monitor_config)
# Start crawl
print("\nMonitoring crawl progress...")
result = await adaptive.digest(
start_url="https://httpbin.org",
query="http methods headers"
)
# Detailed statistics
print("\nDetailed crawl analysis:")
adaptive.print_stats(detailed=True)
# Export for analysis
print("\nExporting knowledge base for external analysis...")
adaptive.export_knowledge_base("knowledge_export_demo.jsonl")
print("Knowledge base exported to: knowledge_export_demo.jsonl")
# Show sample of exported data
with open("knowledge_export_demo.jsonl", 'r') as f:
first_line = f.readline()
print(f"Sample export: {first_line[:100]}...")
# Clean up
Path("knowledge_export_demo.jsonl").unlink(missing_ok=True)
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,76 @@
"""
Basic Adaptive Crawling Example
This example demonstrates the simplest use case of adaptive crawling:
finding information about a specific topic and knowing when to stop.
"""
import asyncio
from crawl4ai import AsyncWebCrawler, AdaptiveCrawler
async def main():
"""Basic adaptive crawling example"""
# Initialize the crawler
async with AsyncWebCrawler(verbose=True) as crawler:
# Create an adaptive crawler with default settings (statistical strategy)
adaptive = AdaptiveCrawler(crawler)
# Note: You can also use embedding strategy for semantic understanding:
# from crawl4ai import AdaptiveConfig
# config = AdaptiveConfig(strategy="embedding")
# adaptive = AdaptiveCrawler(crawler, config)
# Start adaptive crawling
print("Starting adaptive crawl for Python async programming information...")
result = await adaptive.digest(
start_url="https://docs.python.org/3/library/asyncio.html",
query="async await context managers coroutines"
)
# Display crawl statistics
print("\n" + "="*50)
print("CRAWL STATISTICS")
print("="*50)
adaptive.print_stats(detailed=False)
# Get the most relevant content found
print("\n" + "="*50)
print("MOST RELEVANT PAGES")
print("="*50)
relevant_pages = adaptive.get_relevant_content(top_k=5)
for i, page in enumerate(relevant_pages, 1):
print(f"\n{i}. {page['url']}")
print(f" Relevance Score: {page['score']:.2%}")
# Show a snippet of the content
content = page['content'] or ""
if content:
snippet = content[:200].replace('\n', ' ')
if len(content) > 200:
snippet += "..."
print(f" Preview: {snippet}")
# Show final confidence
print(f"\n{'='*50}")
print(f"Final Confidence: {adaptive.confidence:.2%}")
print(f"Total Pages Crawled: {len(result.crawled_urls)}")
print(f"Knowledge Base Size: {len(adaptive.state.knowledge_base)} documents")
# Example: Check if we can answer specific questions
print(f"\n{'='*50}")
print("INFORMATION SUFFICIENCY CHECK")
print(f"{'='*50}")
if adaptive.confidence >= 0.8:
print("✓ High confidence - can answer detailed questions about async Python")
elif adaptive.confidence >= 0.6:
print("~ Moderate confidence - can answer basic questions")
else:
print("✗ Low confidence - need more information")
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,373 @@
"""
Custom Adaptive Crawling Strategies
This example demonstrates how to implement custom scoring strategies
for domain-specific crawling needs.
"""
import asyncio
import re
from typing import List, Dict, Set
from crawl4ai import AsyncWebCrawler, AdaptiveCrawler, AdaptiveConfig
from crawl4ai.adaptive_crawler import CrawlState, Link
import math
class APIDocumentationStrategy:
"""
Custom strategy optimized for API documentation crawling.
Prioritizes endpoint references, code examples, and parameter descriptions.
"""
def __init__(self):
# Keywords that indicate high-value API documentation
self.api_keywords = {
'endpoint', 'request', 'response', 'parameter', 'authentication',
'header', 'body', 'query', 'path', 'method', 'get', 'post', 'put',
'delete', 'patch', 'status', 'code', 'example', 'curl', 'python'
}
# URL patterns that typically contain API documentation
self.valuable_patterns = [
r'/api/',
r'/reference/',
r'/endpoints?/',
r'/methods?/',
r'/resources?/'
]
# Patterns to avoid
self.avoid_patterns = [
r'/blog/',
r'/news/',
r'/about/',
r'/contact/',
r'/legal/'
]
def score_link(self, link: Link, query: str, state: CrawlState) -> float:
"""Custom link scoring for API documentation"""
score = 1.0
url = link.href.lower()
# Boost API-related URLs
for pattern in self.valuable_patterns:
if re.search(pattern, url):
score *= 2.0
break
# Reduce score for non-API content
for pattern in self.avoid_patterns:
if re.search(pattern, url):
score *= 0.1
break
# Boost if preview contains API keywords
if link.text:
preview_lower = link.text.lower()
keyword_count = sum(1 for kw in self.api_keywords if kw in preview_lower)
score *= (1 + keyword_count * 0.2)
# Prioritize shallow URLs (likely overview pages)
depth = url.count('/') - 2 # Subtract protocol slashes
if depth <= 3:
score *= 1.5
elif depth > 6:
score *= 0.5
return score
def calculate_api_coverage(self, state: CrawlState, query: str) -> Dict[str, float]:
"""Calculate specialized coverage metrics for API documentation"""
metrics = {
'endpoint_coverage': 0.0,
'example_coverage': 0.0,
'parameter_coverage': 0.0
}
# Analyze knowledge base for API-specific content
endpoint_patterns = [r'GET\s+/', r'POST\s+/', r'PUT\s+/', r'DELETE\s+/']
example_patterns = [r'```\w+', r'curl\s+-', r'import\s+requests']
param_patterns = [r'param(?:eter)?s?\s*:', r'required\s*:', r'optional\s*:']
total_docs = len(state.knowledge_base)
if total_docs == 0:
return metrics
docs_with_endpoints = 0
docs_with_examples = 0
docs_with_params = 0
for doc in state.knowledge_base:
content = doc.markdown.raw_markdown if hasattr(doc, 'markdown') else str(doc)
# Check for endpoints
if any(re.search(pattern, content, re.IGNORECASE) for pattern in endpoint_patterns):
docs_with_endpoints += 1
# Check for examples
if any(re.search(pattern, content, re.IGNORECASE) for pattern in example_patterns):
docs_with_examples += 1
# Check for parameters
if any(re.search(pattern, content, re.IGNORECASE) for pattern in param_patterns):
docs_with_params += 1
metrics['endpoint_coverage'] = docs_with_endpoints / total_docs
metrics['example_coverage'] = docs_with_examples / total_docs
metrics['parameter_coverage'] = docs_with_params / total_docs
return metrics
class ResearchPaperStrategy:
"""
Strategy optimized for crawling research papers and academic content.
Prioritizes citations, abstracts, and methodology sections.
"""
def __init__(self):
self.academic_keywords = {
'abstract', 'introduction', 'methodology', 'results', 'conclusion',
'references', 'citation', 'paper', 'study', 'research', 'analysis',
'hypothesis', 'experiment', 'findings', 'doi'
}
self.citation_patterns = [
r'\[\d+\]', # [1] style citations
r'\(\w+\s+\d{4}\)', # (Author 2024) style
r'doi:\s*\S+', # DOI references
]
def calculate_academic_relevance(self, content: str, query: str) -> float:
"""Calculate relevance score for academic content"""
score = 0.0
content_lower = content.lower()
# Check for academic keywords
keyword_matches = sum(1 for kw in self.academic_keywords if kw in content_lower)
score += keyword_matches * 0.1
# Check for citations
citation_count = sum(
len(re.findall(pattern, content))
for pattern in self.citation_patterns
)
score += min(citation_count * 0.05, 1.0) # Cap at 1.0
# Check for query terms in academic context
query_terms = query.lower().split()
for term in query_terms:
# Boost if term appears near academic keywords
for keyword in ['abstract', 'conclusion', 'results']:
if keyword in content_lower:
section = content_lower[content_lower.find(keyword):content_lower.find(keyword) + 500]
if term in section:
score += 0.2
return min(score, 2.0) # Cap total score
async def demo_custom_strategies():
"""Demonstrate custom strategy usage"""
# Example 1: API Documentation Strategy
print("="*60)
print("EXAMPLE 1: Custom API Documentation Strategy")
print("="*60)
api_strategy = APIDocumentationStrategy()
async with AsyncWebCrawler() as crawler:
# Standard adaptive crawler
config = AdaptiveConfig(
confidence_threshold=0.8,
max_pages=15
)
adaptive = AdaptiveCrawler(crawler, config)
# Override link scoring with custom strategy
original_rank_links = adaptive._rank_links
def custom_rank_links(links, query, state):
# Apply custom scoring
scored_links = []
for link in links:
base_score = api_strategy.score_link(link, query, state)
scored_links.append((link, base_score))
# Sort by score
scored_links.sort(key=lambda x: x[1], reverse=True)
return [link for link, _ in scored_links[:config.top_k_links]]
adaptive._rank_links = custom_rank_links
# Crawl API documentation
print("\nCrawling API documentation with custom strategy...")
state = await adaptive.digest(
start_url="https://httpbin.org",
query="api endpoints authentication headers"
)
# Calculate custom metrics
api_metrics = api_strategy.calculate_api_coverage(state, "api endpoints")
print(f"\nResults:")
print(f"Pages crawled: {len(state.crawled_urls)}")
print(f"Confidence: {adaptive.confidence:.2%}")
print(f"\nAPI-Specific Metrics:")
print(f" - Endpoint coverage: {api_metrics['endpoint_coverage']:.2%}")
print(f" - Example coverage: {api_metrics['example_coverage']:.2%}")
print(f" - Parameter coverage: {api_metrics['parameter_coverage']:.2%}")
# Example 2: Combined Strategy
print("\n" + "="*60)
print("EXAMPLE 2: Hybrid Strategy Combining Multiple Approaches")
print("="*60)
class HybridStrategy:
"""Combines multiple strategies with weights"""
def __init__(self):
self.api_strategy = APIDocumentationStrategy()
self.research_strategy = ResearchPaperStrategy()
self.weights = {
'api': 0.7,
'research': 0.3
}
def score_content(self, content: str, query: str) -> float:
# Get scores from each strategy
api_score = self._calculate_api_score(content, query)
research_score = self.research_strategy.calculate_academic_relevance(content, query)
# Weighted combination
total_score = (
api_score * self.weights['api'] +
research_score * self.weights['research']
)
return total_score
def _calculate_api_score(self, content: str, query: str) -> float:
# Simplified API scoring based on keyword presence
content_lower = content.lower()
api_keywords = self.api_strategy.api_keywords
keyword_count = sum(1 for kw in api_keywords if kw in content_lower)
return min(keyword_count * 0.1, 2.0)
hybrid_strategy = HybridStrategy()
async with AsyncWebCrawler() as crawler:
adaptive = AdaptiveCrawler(crawler)
# Crawl with hybrid scoring
print("\nTesting hybrid strategy on technical documentation...")
state = await adaptive.digest(
start_url="https://docs.python.org/3/library/asyncio.html",
query="async await coroutines api"
)
# Analyze results with hybrid strategy
print(f"\nHybrid Strategy Analysis:")
total_score = 0
for doc in adaptive.get_relevant_content(top_k=5):
content = doc['content'] or ""
score = hybrid_strategy.score_content(content, "async await api")
total_score += score
print(f" - {doc['url'][:50]}... Score: {score:.2f}")
print(f"\nAverage hybrid score: {total_score/5:.2f}")
async def demo_performance_optimization():
"""Demonstrate performance optimization with custom strategies"""
print("\n" + "="*60)
print("EXAMPLE 3: Performance-Optimized Strategy")
print("="*60)
class PerformanceOptimizedStrategy:
"""Strategy that balances thoroughness with speed"""
def __init__(self):
self.url_cache: Set[str] = set()
self.domain_scores: Dict[str, float] = {}
def should_crawl_domain(self, url: str) -> bool:
"""Implement domain-level filtering"""
domain = url.split('/')[2] if url.startswith('http') else url
# Skip if we've already crawled many pages from this domain
domain_count = sum(1 for cached in self.url_cache if domain in cached)
if domain_count > 5:
return False
# Skip low-scoring domains
if domain in self.domain_scores and self.domain_scores[domain] < 0.3:
return False
return True
def update_domain_score(self, url: str, relevance: float):
"""Track domain-level performance"""
domain = url.split('/')[2] if url.startswith('http') else url
if domain not in self.domain_scores:
self.domain_scores[domain] = relevance
else:
# Moving average
self.domain_scores[domain] = (
0.7 * self.domain_scores[domain] + 0.3 * relevance
)
perf_strategy = PerformanceOptimizedStrategy()
async with AsyncWebCrawler() as crawler:
config = AdaptiveConfig(
confidence_threshold=0.7,
max_pages=10,
top_k_links=2 # Fewer links for speed
)
adaptive = AdaptiveCrawler(crawler, config)
# Track performance
import time
start_time = time.time()
state = await adaptive.digest(
start_url="https://httpbin.org",
query="http methods headers"
)
elapsed = time.time() - start_time
print(f"\nPerformance Results:")
print(f" - Time elapsed: {elapsed:.2f} seconds")
print(f" - Pages crawled: {len(state.crawled_urls)}")
print(f" - Pages per second: {len(state.crawled_urls)/elapsed:.2f}")
print(f" - Final confidence: {adaptive.confidence:.2%}")
print(f" - Efficiency: {adaptive.confidence/len(state.crawled_urls):.2%} confidence per page")
async def main():
"""Run all demonstrations"""
try:
await demo_custom_strategies()
await demo_performance_optimization()
print("\n" + "="*60)
print("All custom strategy examples completed!")
print("="*60)
except Exception as e:
print(f"Error: {e}")
import traceback
traceback.print_exc()
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,206 @@
"""
Advanced Embedding Configuration Example
This example demonstrates all configuration options available for the
embedding strategy, including fine-tuning parameters for different use cases.
"""
import asyncio
import os
from crawl4ai import AsyncWebCrawler, AdaptiveCrawler, AdaptiveConfig
async def test_configuration(name: str, config: AdaptiveConfig, url: str, query: str):
"""Test a specific configuration"""
print(f"\n{'='*60}")
print(f"Configuration: {name}")
print(f"{'='*60}")
async with AsyncWebCrawler(verbose=False) as crawler:
adaptive = AdaptiveCrawler(crawler, config)
result = await adaptive.digest(start_url=url, query=query)
print(f"Pages crawled: {len(result.crawled_urls)}")
print(f"Final confidence: {adaptive.confidence:.1%}")
print(f"Stopped reason: {result.metrics.get('stopped_reason', 'max_pages')}")
if result.metrics.get('is_irrelevant', False):
print("⚠️ Query detected as irrelevant!")
return result
async def main():
"""Demonstrate various embedding configurations"""
print("EMBEDDING STRATEGY CONFIGURATION EXAMPLES")
print("=" * 60)
# Base URL and query for testing
test_url = "https://docs.python.org/3/library/asyncio.html"
# 1. Default Configuration
config_default = AdaptiveConfig(
strategy="embedding",
max_pages=10
)
await test_configuration(
"Default Settings",
config_default,
test_url,
"async programming patterns"
)
# 2. Strict Coverage Requirements
config_strict = AdaptiveConfig(
strategy="embedding",
max_pages=20,
# Stricter similarity requirements
embedding_k_exp=5.0, # Default is 3.0, higher = stricter
embedding_coverage_radius=0.15, # Default is 0.2, lower = stricter
# Higher validation threshold
embedding_validation_min_score=0.6, # Default is 0.3
# More query variations for better coverage
n_query_variations=15 # Default is 10
)
await test_configuration(
"Strict Coverage (Research/Academic)",
config_strict,
test_url,
"comprehensive guide async await"
)
# 3. Fast Exploration
config_fast = AdaptiveConfig(
strategy="embedding",
max_pages=10,
top_k_links=5, # Follow more links per page
# Relaxed requirements for faster convergence
embedding_k_exp=1.0, # Lower = more lenient
embedding_min_relative_improvement=0.05, # Stop earlier
# Lower quality thresholds
embedding_quality_min_confidence=0.5, # Display lower confidence
embedding_quality_max_confidence=0.85,
# Fewer query variations for speed
n_query_variations=5
)
await test_configuration(
"Fast Exploration (Quick Overview)",
config_fast,
test_url,
"async basics"
)
# 4. Irrelevance Detection Focus
config_irrelevance = AdaptiveConfig(
strategy="embedding",
max_pages=5,
# Aggressive irrelevance detection
embedding_min_confidence_threshold=0.2, # Higher threshold (default 0.1)
embedding_k_exp=5.0, # Strict similarity
# Quick stopping for irrelevant content
embedding_min_relative_improvement=0.15
)
await test_configuration(
"Irrelevance Detection",
config_irrelevance,
test_url,
"recipe for chocolate cake" # Irrelevant query
)
# 5. High-Quality Knowledge Base
config_quality = AdaptiveConfig(
strategy="embedding",
max_pages=30,
# Deduplication settings
embedding_overlap_threshold=0.75, # More aggressive deduplication
# Quality focus
embedding_validation_min_score=0.5,
embedding_quality_scale_factor=1.0, # Linear quality mapping
# Balanced parameters
embedding_k_exp=3.0,
embedding_nearest_weight=0.8, # Focus on best matches
embedding_top_k_weight=0.2
)
await test_configuration(
"High-Quality Knowledge Base",
config_quality,
test_url,
"asyncio advanced patterns best practices"
)
# 6. Custom Embedding Provider
if os.getenv('OPENAI_API_KEY'):
config_openai = AdaptiveConfig(
strategy="embedding",
max_pages=10,
# Use OpenAI embeddings
embedding_llm_config={
'provider': 'openai/text-embedding-3-small',
'api_token': os.getenv('OPENAI_API_KEY')
},
# OpenAI embeddings are high quality, can be stricter
embedding_k_exp=4.0,
n_query_variations=12
)
await test_configuration(
"OpenAI Embeddings",
config_openai,
test_url,
"event-driven architecture patterns"
)
# Parameter Guide
print("\n" + "="*60)
print("PARAMETER TUNING GUIDE")
print("="*60)
print("\n📊 Key Parameters and Their Effects:")
print("\n1. embedding_k_exp (default: 3.0)")
print(" - Lower (1-2): More lenient, faster convergence")
print(" - Higher (4-5): Stricter, better precision")
print("\n2. embedding_coverage_radius (default: 0.2)")
print(" - Lower (0.1-0.15): Requires closer matches")
print(" - Higher (0.25-0.3): Accepts broader matches")
print("\n3. n_query_variations (default: 10)")
print(" - Lower (5-7): Faster, less comprehensive")
print(" - Higher (15-20): Better coverage, slower")
print("\n4. embedding_min_confidence_threshold (default: 0.1)")
print(" - Set to 0.15-0.2 for aggressive irrelevance detection")
print(" - Set to 0.05 to crawl even barely relevant content")
print("\n5. embedding_validation_min_score (default: 0.3)")
print(" - Higher (0.5-0.6): Requires strong validation")
print(" - Lower (0.2): More permissive stopping")
print("\n💡 Tips:")
print("- For research: High k_exp, more variations, strict validation")
print("- For exploration: Low k_exp, fewer variations, relaxed thresholds")
print("- For quality: Focus on overlap_threshold and validation scores")
print("- For speed: Reduce variations, increase min_relative_improvement")
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,109 @@
"""
Embedding Strategy Example for Adaptive Crawling
This example demonstrates how to use the embedding-based strategy
for semantic understanding and intelligent crawling.
"""
import asyncio
import os
from crawl4ai import AsyncWebCrawler, AdaptiveCrawler, AdaptiveConfig
async def main():
"""Demonstrate embedding strategy for adaptive crawling"""
# Configure embedding strategy
config = AdaptiveConfig(
strategy="embedding", # Use embedding strategy
embedding_model="sentence-transformers/all-MiniLM-L6-v2", # Default model
n_query_variations=10, # Generate 10 semantic variations
max_pages=15,
top_k_links=3,
min_gain_threshold=0.05,
# Embedding-specific parameters
embedding_k_exp=3.0, # Higher = stricter similarity requirements
embedding_min_confidence_threshold=0.1, # Stop if <10% relevant
embedding_validation_min_score=0.4 # Validation threshold
)
# Optional: Use OpenAI embeddings instead
if os.getenv('OPENAI_API_KEY'):
config.embedding_llm_config = {
'provider': 'openai/text-embedding-3-small',
'api_token': os.getenv('OPENAI_API_KEY')
}
print("Using OpenAI embeddings")
else:
print("Using sentence-transformers (local embeddings)")
async with AsyncWebCrawler(verbose=True) as crawler:
adaptive = AdaptiveCrawler(crawler, config)
# Test 1: Relevant query with semantic understanding
print("\n" + "="*50)
print("TEST 1: Semantic Query Understanding")
print("="*50)
result = await adaptive.digest(
start_url="https://docs.python.org/3/library/asyncio.html",
query="concurrent programming event-driven architecture"
)
print("\nQuery Expansion:")
print(f"Original query expanded to {len(result.expanded_queries)} variations")
for i, q in enumerate(result.expanded_queries[:3], 1):
print(f" {i}. {q}")
print(" ...")
print("\nResults:")
adaptive.print_stats(detailed=False)
# Test 2: Detecting irrelevant queries
print("\n" + "="*50)
print("TEST 2: Irrelevant Query Detection")
print("="*50)
# Reset crawler for new query
adaptive = AdaptiveCrawler(crawler, config)
result = await adaptive.digest(
start_url="https://docs.python.org/3/library/asyncio.html",
query="how to bake chocolate chip cookies"
)
if result.metrics.get('is_irrelevant', False):
print("\n✅ Successfully detected irrelevant query!")
print(f"Stopped after just {len(result.crawled_urls)} pages")
print(f"Reason: {result.metrics.get('stopped_reason', 'unknown')}")
else:
print("\n❌ Failed to detect irrelevance")
print(f"Final confidence: {adaptive.confidence:.1%}")
# Test 3: Semantic gap analysis
print("\n" + "="*50)
print("TEST 3: Semantic Gap Analysis")
print("="*50)
# Show how embedding strategy identifies gaps
adaptive = AdaptiveCrawler(crawler, config)
result = await adaptive.digest(
start_url="https://realpython.com",
query="python decorators advanced patterns"
)
print(f"\nSemantic gaps identified: {len(result.semantic_gaps)}")
print(f"Knowledge base embeddings shape: {result.kb_embeddings.shape if result.kb_embeddings is not None else 'None'}")
# Show coverage metrics specific to embedding strategy
print("\nEmbedding-specific metrics:")
print(f" Average best similarity: {result.metrics.get('avg_best_similarity', 0):.3f}")
print(f" Coverage score: {result.metrics.get('coverage_score', 0):.3f}")
print(f" Validation confidence: {result.metrics.get('validation_confidence', 0):.2%}")
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,167 @@
"""
Comparison: Embedding vs Statistical Strategy
This example demonstrates the differences between statistical and embedding
strategies for adaptive crawling, showing when to use each approach.
"""
import asyncio
import time
import os
from crawl4ai import AsyncWebCrawler, AdaptiveCrawler, AdaptiveConfig
async def crawl_with_strategy(url: str, query: str, strategy: str, **kwargs):
"""Helper function to crawl with a specific strategy"""
config = AdaptiveConfig(
strategy=strategy,
max_pages=20,
top_k_links=3,
min_gain_threshold=0.05,
**kwargs
)
async with AsyncWebCrawler(verbose=False) as crawler:
adaptive = AdaptiveCrawler(crawler, config)
start_time = time.time()
result = await adaptive.digest(start_url=url, query=query)
elapsed = time.time() - start_time
return {
'result': result,
'crawler': adaptive,
'elapsed': elapsed,
'pages': len(result.crawled_urls),
'confidence': adaptive.confidence
}
async def main():
"""Compare embedding and statistical strategies"""
# Test scenarios
test_cases = [
{
'name': 'Technical Documentation (Specific Terms)',
'url': 'https://docs.python.org/3/library/asyncio.html',
'query': 'asyncio.create_task event_loop.run_until_complete'
},
{
'name': 'Conceptual Query (Semantic Understanding)',
'url': 'https://docs.python.org/3/library/asyncio.html',
'query': 'concurrent programming patterns'
},
{
'name': 'Ambiguous Query',
'url': 'https://realpython.com',
'query': 'python performance optimization'
}
]
# Configure embedding strategy
embedding_config = {}
if os.getenv('OPENAI_API_KEY'):
embedding_config['embedding_llm_config'] = {
'provider': 'openai/text-embedding-3-small',
'api_token': os.getenv('OPENAI_API_KEY')
}
for test in test_cases:
print("\n" + "="*70)
print(f"TEST: {test['name']}")
print(f"URL: {test['url']}")
print(f"Query: '{test['query']}'")
print("="*70)
# Run statistical strategy
print("\n📊 Statistical Strategy:")
stat_result = await crawl_with_strategy(
test['url'],
test['query'],
'statistical'
)
print(f" Pages crawled: {stat_result['pages']}")
print(f" Time taken: {stat_result['elapsed']:.2f}s")
print(f" Confidence: {stat_result['confidence']:.1%}")
print(f" Sufficient: {'Yes' if stat_result['crawler'].is_sufficient else 'No'}")
# Show term coverage
if hasattr(stat_result['result'], 'term_frequencies'):
query_terms = test['query'].lower().split()
covered = sum(1 for term in query_terms
if term in stat_result['result'].term_frequencies)
print(f" Term coverage: {covered}/{len(query_terms)} query terms found")
# Run embedding strategy
print("\n🧠 Embedding Strategy:")
emb_result = await crawl_with_strategy(
test['url'],
test['query'],
'embedding',
**embedding_config
)
print(f" Pages crawled: {emb_result['pages']}")
print(f" Time taken: {emb_result['elapsed']:.2f}s")
print(f" Confidence: {emb_result['confidence']:.1%}")
print(f" Sufficient: {'Yes' if emb_result['crawler'].is_sufficient else 'No'}")
# Show semantic understanding
if emb_result['result'].expanded_queries:
print(f" Query variations: {len(emb_result['result'].expanded_queries)}")
print(f" Semantic gaps: {len(emb_result['result'].semantic_gaps)}")
# Compare results
print("\n📈 Comparison:")
efficiency_diff = ((stat_result['pages'] - emb_result['pages']) /
stat_result['pages'] * 100) if stat_result['pages'] > 0 else 0
print(f" Efficiency: ", end="")
if efficiency_diff > 0:
print(f"Embedding used {efficiency_diff:.0f}% fewer pages")
else:
print(f"Statistical used {-efficiency_diff:.0f}% fewer pages")
print(f" Speed: ", end="")
if stat_result['elapsed'] < emb_result['elapsed']:
print(f"Statistical was {emb_result['elapsed']/stat_result['elapsed']:.1f}x faster")
else:
print(f"Embedding was {stat_result['elapsed']/emb_result['elapsed']:.1f}x faster")
print(f" Confidence difference: {abs(stat_result['confidence'] - emb_result['confidence'])*100:.0f} percentage points")
# Recommendation
print("\n💡 Recommendation:")
if 'specific' in test['name'].lower() or all(len(term) > 5 for term in test['query'].split()):
print(" → Statistical strategy is likely better for this use case (specific terms)")
elif 'conceptual' in test['name'].lower() or 'semantic' in test['name'].lower():
print(" → Embedding strategy is likely better for this use case (semantic understanding)")
else:
if emb_result['confidence'] > stat_result['confidence'] + 0.1:
print(" → Embedding strategy achieved significantly better understanding")
elif stat_result['elapsed'] < emb_result['elapsed'] / 2:
print(" → Statistical strategy is much faster with similar results")
else:
print(" → Both strategies performed similarly; choose based on your priorities")
# Summary recommendations
print("\n" + "="*70)
print("STRATEGY SELECTION GUIDE")
print("="*70)
print("\n✅ Use STATISTICAL strategy when:")
print(" - Queries contain specific technical terms")
print(" - Speed is critical")
print(" - No API access available")
print(" - Working with well-structured documentation")
print("\n✅ Use EMBEDDING strategy when:")
print(" - Queries are conceptual or ambiguous")
print(" - Semantic understanding is important")
print(" - Need to detect irrelevant content")
print(" - Working with diverse content sources")
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,232 @@
"""
Knowledge Base Export and Import
This example demonstrates how to export crawled knowledge bases and
import them for reuse, sharing, or analysis.
"""
import asyncio
import json
from pathlib import Path
from crawl4ai import AsyncWebCrawler, AdaptiveCrawler, AdaptiveConfig
async def build_knowledge_base():
"""Build a knowledge base about web technologies"""
print("="*60)
print("PHASE 1: Building Knowledge Base")
print("="*60)
async with AsyncWebCrawler(verbose=False) as crawler:
adaptive = AdaptiveCrawler(crawler)
# Crawl information about HTTP
print("\n1. Gathering HTTP protocol information...")
await adaptive.digest(
start_url="https://httpbin.org",
query="http methods headers status codes"
)
print(f" - Pages crawled: {len(adaptive.state.crawled_urls)}")
print(f" - Confidence: {adaptive.confidence:.2%}")
# Add more information about APIs
print("\n2. Adding API documentation knowledge...")
await adaptive.digest(
start_url="https://httpbin.org/anything",
query="rest api json response request"
)
print(f" - Total pages: {len(adaptive.state.crawled_urls)}")
print(f" - Confidence: {adaptive.confidence:.2%}")
# Export the knowledge base
export_path = "web_tech_knowledge.jsonl"
print(f"\n3. Exporting knowledge base to {export_path}")
adaptive.export_knowledge_base(export_path)
# Show export statistics
export_size = Path(export_path).stat().st_size / 1024
with open(export_path, 'r') as f:
line_count = sum(1 for _ in f)
print(f" - Exported {line_count} documents")
print(f" - File size: {export_size:.1f} KB")
return export_path
async def analyze_knowledge_base(kb_path):
"""Analyze the exported knowledge base"""
print("\n" + "="*60)
print("PHASE 2: Analyzing Exported Knowledge Base")
print("="*60)
# Read and analyze JSONL
documents = []
with open(kb_path, 'r') as f:
for line in f:
documents.append(json.loads(line))
print(f"\nKnowledge base contains {len(documents)} documents:")
# Analyze document properties
total_content_length = 0
urls_by_domain = {}
for doc in documents:
# Content analysis
content_length = len(doc.get('content', ''))
total_content_length += content_length
# URL analysis
url = doc.get('url', '')
domain = url.split('/')[2] if url.startswith('http') else 'unknown'
urls_by_domain[domain] = urls_by_domain.get(domain, 0) + 1
# Show sample document
if documents.index(doc) == 0:
print(f"\nSample document structure:")
print(f" - URL: {url}")
print(f" - Content length: {content_length} chars")
print(f" - Has metadata: {'metadata' in doc}")
print(f" - Has links: {len(doc.get('links', []))} links")
print(f" - Query: {doc.get('query', 'N/A')}")
print(f"\nContent statistics:")
print(f" - Total content: {total_content_length:,} characters")
print(f" - Average per document: {total_content_length/len(documents):,.0f} chars")
print(f"\nDomain distribution:")
for domain, count in urls_by_domain.items():
print(f" - {domain}: {count} pages")
async def import_and_continue():
"""Import a knowledge base and continue crawling"""
print("\n" + "="*60)
print("PHASE 3: Importing and Extending Knowledge Base")
print("="*60)
kb_path = "web_tech_knowledge.jsonl"
async with AsyncWebCrawler(verbose=False) as crawler:
# Create new adaptive crawler
adaptive = AdaptiveCrawler(crawler)
# Import existing knowledge base
print(f"\n1. Importing knowledge base from {kb_path}")
adaptive.import_knowledge_base(kb_path)
print(f" - Imported {len(adaptive.state.knowledge_base)} documents")
print(f" - Existing URLs: {len(adaptive.state.crawled_urls)}")
# Check current state
print("\n2. Checking imported knowledge state:")
adaptive.print_stats(detailed=False)
# Continue crawling with new query
print("\n3. Extending knowledge with new query...")
await adaptive.digest(
start_url="https://httpbin.org/status/200",
query="error handling retry timeout"
)
print("\n4. Final knowledge base state:")
adaptive.print_stats(detailed=False)
# Export extended knowledge base
extended_path = "web_tech_knowledge_extended.jsonl"
adaptive.export_knowledge_base(extended_path)
print(f"\n5. Extended knowledge base exported to {extended_path}")
async def share_knowledge_bases():
"""Demonstrate sharing knowledge bases between projects"""
print("\n" + "="*60)
print("PHASE 4: Sharing Knowledge Between Projects")
print("="*60)
# Simulate two different projects
project_a_kb = "project_a_knowledge.jsonl"
project_b_kb = "project_b_knowledge.jsonl"
async with AsyncWebCrawler(verbose=False) as crawler:
# Project A: Security documentation
print("\n1. Project A: Building security knowledge...")
crawler_a = AdaptiveCrawler(crawler)
await crawler_a.digest(
start_url="https://httpbin.org/basic-auth/user/pass",
query="authentication security headers"
)
crawler_a.export_knowledge_base(project_a_kb)
print(f" - Exported {len(crawler_a.state.knowledge_base)} documents")
# Project B: API testing
print("\n2. Project B: Building testing knowledge...")
crawler_b = AdaptiveCrawler(crawler)
await crawler_b.digest(
start_url="https://httpbin.org/anything",
query="testing endpoints mocking"
)
crawler_b.export_knowledge_base(project_b_kb)
print(f" - Exported {len(crawler_b.state.knowledge_base)} documents")
# Merge knowledge bases
print("\n3. Merging knowledge bases...")
merged_crawler = AdaptiveCrawler(crawler)
# Import both knowledge bases
merged_crawler.import_knowledge_base(project_a_kb)
initial_size = len(merged_crawler.state.knowledge_base)
merged_crawler.import_knowledge_base(project_b_kb)
final_size = len(merged_crawler.state.knowledge_base)
print(f" - Project A documents: {initial_size}")
print(f" - Additional from Project B: {final_size - initial_size}")
print(f" - Total merged documents: {final_size}")
# Export merged knowledge
merged_kb = "merged_knowledge.jsonl"
merged_crawler.export_knowledge_base(merged_kb)
print(f"\n4. Merged knowledge base exported to {merged_kb}")
# Show combined coverage
print("\n5. Combined knowledge coverage:")
merged_crawler.print_stats(detailed=False)
async def main():
"""Run all examples"""
try:
# Build initial knowledge base
kb_path = await build_knowledge_base()
# Analyze the export
await analyze_knowledge_base(kb_path)
# Import and extend
await import_and_continue()
# Demonstrate sharing
await share_knowledge_bases()
print("\n" + "="*60)
print("All examples completed successfully!")
print("="*60)
finally:
# Clean up generated files
print("\nCleaning up generated files...")
for file in [
"web_tech_knowledge.jsonl",
"web_tech_knowledge_extended.jsonl",
"project_a_knowledge.jsonl",
"project_b_knowledge.jsonl",
"merged_knowledge.jsonl"
]:
Path(file).unlink(missing_ok=True)
print("Cleanup complete.")
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -3,8 +3,8 @@ C4A-Script API Usage Examples
Shows how to use the new Result-based API in various scenarios
"""
from c4a_compile import compile, validate, compile_file
from c4a_result import CompilationResult, ValidationResult
from crawl4ai.script.c4a_compile import compile, validate, compile_file
from crawl4ai.script.c4a_result import CompilationResult, ValidationResult
import json

View File

@@ -3,7 +3,7 @@ C4A-Script Hello World
A concise example showing how to use the C4A-Script compiler
"""
from c4a_compile import compile
from crawl4ai.script.c4a_compile import compile
# Define your C4A-Script
script = """

View File

@@ -3,7 +3,7 @@ C4A-Script Hello World - Error Example
Shows how error handling works
"""
from c4a_compile import compile
from crawl4ai.script.c4a_compile import compile
# Define a script with an error (missing THEN)
script = """

View File

@@ -0,0 +1,303 @@
"""
🎯 Multi-Config URL Matching Demo
=================================
Learn how to use different crawler configurations for different URL patterns
in a single crawl batch with Crawl4AI's multi-config feature.
Part 1: Understanding URL Matching (Pattern Testing)
Part 2: Practical Example with Real Crawling
"""
import asyncio
from crawl4ai import (
AsyncWebCrawler,
CrawlerRunConfig,
MatchMode
)
from crawl4ai.processors.pdf import PDFContentScrapingStrategy
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
def print_section(title):
"""Print a formatted section header"""
print(f"\n{'=' * 60}")
print(f"{title}")
print(f"{'=' * 60}\n")
def test_url_matching(config, test_urls, config_name):
"""Test URL matching for a config and show results"""
print(f"Config: {config_name}")
print(f"Matcher: {config.url_matcher}")
if hasattr(config, 'match_mode'):
print(f"Mode: {config.match_mode.value}")
print("-" * 40)
for url in test_urls:
matches = config.is_match(url)
symbol = "" if matches else ""
print(f"{symbol} {url}")
print()
# ==============================================================================
# PART 1: Understanding URL Matching
# ==============================================================================
def demo_part1_pattern_matching():
"""Part 1: Learn how URL matching works without crawling"""
print_section("PART 1: Understanding URL Matching")
print("Let's explore different ways to match URLs with configs.\n")
# Test URLs we'll use throughout
test_urls = [
"https://example.com/report.pdf",
"https://example.com/data.json",
"https://example.com/blog/post-1",
"https://example.com/article/news",
"https://api.example.com/v1/users",
"https://example.com/about"
]
# 1.1 Simple String Pattern
print("1.1 Simple String Pattern Matching")
print("-" * 40)
pdf_config = CrawlerRunConfig(
url_matcher="*.pdf"
)
test_url_matching(pdf_config, test_urls, "PDF Config")
# 1.2 Multiple String Patterns
print("1.2 Multiple String Patterns (OR logic)")
print("-" * 40)
blog_config = CrawlerRunConfig(
url_matcher=["*/blog/*", "*/article/*", "*/news/*"],
match_mode=MatchMode.OR # This is default, shown for clarity
)
test_url_matching(blog_config, test_urls, "Blog/Article Config")
# 1.3 Single Function Matcher
print("1.3 Function-based Matching")
print("-" * 40)
api_config = CrawlerRunConfig(
url_matcher=lambda url: 'api' in url or url.endswith('.json')
)
test_url_matching(api_config, test_urls, "API Config")
# 1.4 List of Functions
print("1.4 Multiple Functions with AND Logic")
print("-" * 40)
# Must be HTTPS AND contain 'api' AND have version number
secure_api_config = CrawlerRunConfig(
url_matcher=[
lambda url: url.startswith('https://'),
lambda url: 'api' in url,
lambda url: '/v' in url # Version indicator
],
match_mode=MatchMode.AND
)
test_url_matching(secure_api_config, test_urls, "Secure API Config")
# 1.5 Mixed: String and Function Together
print("1.5 Mixed Patterns: String + Function")
print("-" * 40)
# Match JSON files OR any API endpoint
json_or_api_config = CrawlerRunConfig(
url_matcher=[
"*.json", # String pattern
lambda url: 'api' in url # Function
],
match_mode=MatchMode.OR
)
test_url_matching(json_or_api_config, test_urls, "JSON or API Config")
# 1.6 Complex: Multiple Strings + Multiple Functions
print("1.6 Complex Matcher: Mixed Types with AND Logic")
print("-" * 40)
# Must be: HTTPS AND (.com domain) AND (blog OR article) AND NOT a PDF
complex_config = CrawlerRunConfig(
url_matcher=[
lambda url: url.startswith('https://'), # Function: HTTPS check
"*.com/*", # String: .com domain
lambda url: any(pattern in url for pattern in ['/blog/', '/article/']), # Function: Blog OR article
lambda url: not url.endswith('.pdf') # Function: Not PDF
],
match_mode=MatchMode.AND
)
test_url_matching(complex_config, test_urls, "Complex Mixed Config")
print("\n✅ Key Takeaway: First matching config wins when passed to arun_many()!")
# ==============================================================================
# PART 2: Practical Multi-URL Crawling
# ==============================================================================
async def demo_part2_practical_crawling():
"""Part 2: Real-world example with different content types"""
print_section("PART 2: Practical Multi-URL Crawling")
print("Now let's see multi-config in action with real URLs.\n")
# Create specialized configs for different content types
configs = [
# Config 1: PDF documents - only match files ending with .pdf
CrawlerRunConfig(
url_matcher="*.pdf",
scraping_strategy=PDFContentScrapingStrategy()
),
# Config 2: Blog/article pages with content filtering
CrawlerRunConfig(
url_matcher=["*/blog/*", "*/article/*", "*python.org*"],
markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter(threshold=0.48)
)
),
# Config 3: Dynamic pages requiring JavaScript
CrawlerRunConfig(
url_matcher=lambda url: 'github.com' in url,
js_code="window.scrollTo(0, 500);" # Scroll to load content
),
# Config 4: Mixed matcher - API endpoints (string OR function)
CrawlerRunConfig(
url_matcher=[
"*.json", # String pattern for JSON files
lambda url: 'api' in url or 'httpbin.org' in url # Function for API endpoints
],
match_mode=MatchMode.OR,
),
# Config 5: Complex matcher - Secure documentation sites
CrawlerRunConfig(
url_matcher=[
lambda url: url.startswith('https://'), # Must be HTTPS
"*.org/*", # String: .org domain
lambda url: any(doc in url for doc in ['docs', 'documentation', 'reference']), # Has docs
lambda url: not url.endswith(('.pdf', '.json')) # Not PDF or JSON
],
match_mode=MatchMode.AND,
# wait_for="css:.content, css:article" # Wait for content to load
),
# Default config for everything else
# CrawlerRunConfig() # No url_matcher means it matches everything (use it as fallback)
]
# URLs to crawl - each will use a different config
urls = [
"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf", # → PDF config
"https://blog.python.org/", # → Blog config with content filter
"https://github.com/microsoft/playwright", # → JS config
"https://httpbin.org/json", # → Mixed matcher config (API)
"https://docs.python.org/3/reference/", # → Complex matcher config
"https://www.w3schools.com/", # → Default config, if you uncomment the default config line above, if not you will see `Error: No matching configuration`
]
print("URLs to crawl:")
for i, url in enumerate(urls, 1):
print(f"{i}. {url}")
print("\nCrawling with appropriate config for each URL...\n")
async with AsyncWebCrawler() as crawler:
results = await crawler.arun_many(
urls=urls,
config=configs
)
# Display results
print("Results:")
print("-" * 60)
for result in results:
if result.success:
# Determine which config was used
config_type = "Default"
if result.url.endswith('.pdf'):
config_type = "PDF Strategy"
elif any(pattern in result.url for pattern in ['blog', 'python.org']) and 'docs' not in result.url:
config_type = "Blog + Content Filter"
elif 'github.com' in result.url:
config_type = "JavaScript Enabled"
elif 'httpbin.org' in result.url or result.url.endswith('.json'):
config_type = "Mixed Matcher (API)"
elif 'docs.python.org' in result.url:
config_type = "Complex Matcher (Secure Docs)"
print(f"\n{result.url}")
print(f" Config used: {config_type}")
print(f" Content size: {len(result.markdown)} chars")
# Show if we have fit_markdown (from content filter)
if hasattr(result.markdown, 'fit_markdown') and result.markdown.fit_markdown:
print(f" Fit markdown size: {len(result.markdown.fit_markdown)} chars")
reduction = (1 - len(result.markdown.fit_markdown) / len(result.markdown)) * 100
print(f" Content reduced by: {reduction:.1f}%")
# Show extracted data if using extraction strategy
if hasattr(result, 'extracted_content') and result.extracted_content:
print(f" Extracted data: {str(result.extracted_content)[:100]}...")
else:
print(f"\n{result.url}")
print(f" Error: {result.error_message}")
print("\n" + "=" * 60)
print("✅ Multi-config crawling complete!")
print("\nBenefits demonstrated:")
print("- PDFs handled with specialized scraper")
print("- Blog content filtered for relevance")
print("- JavaScript executed only where needed")
print("- Mixed matchers (string + function) for flexible matching")
print("- Complex matchers for precise URL targeting")
print("- Each URL got optimal configuration automatically!")
async def main():
"""Run both parts of the demo"""
print("""
🎯 Multi-Config URL Matching Demo
=================================
Learn how Crawl4AI can use different configurations
for different URLs in a single batch.
""")
# Part 1: Pattern matching
demo_part1_pattern_matching()
print("\nPress Enter to continue to Part 2...")
try:
input()
except EOFError:
# Running in non-interactive mode, skip input
pass
# Part 2: Practical crawling
await demo_part2_practical_crawling()
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -8,26 +8,20 @@ from typing import Dict, Any
class Crawl4AiTester:
def __init__(self, base_url: str = "http://localhost:11235", api_token: str = None):
def __init__(self, base_url: str = "http://localhost:11235"):
self.base_url = base_url
self.api_token = (
api_token or os.getenv("CRAWL4AI_API_TOKEN") or "test_api_code"
) # Check environment variable as fallback
self.headers = (
{"Authorization": f"Bearer {self.api_token}"} if self.api_token else {}
)
def submit_and_wait(
self, request_data: Dict[str, Any], timeout: int = 300
) -> Dict[str, Any]:
# Submit crawl job
# Submit crawl job using async endpoint
response = requests.post(
f"{self.base_url}/crawl", json=request_data, headers=self.headers
f"{self.base_url}/crawl/job", json=request_data
)
if response.status_code == 403:
raise Exception("API token is invalid or missing")
task_id = response.json()["task_id"]
print(f"Task ID: {task_id}")
response.raise_for_status()
job_response = response.json()
task_id = job_response["task_id"]
print(f"Submitted job with task_id: {task_id}")
# Poll for result
start_time = time.time()
@@ -38,8 +32,9 @@ class Crawl4AiTester:
)
result = requests.get(
f"{self.base_url}/task/{task_id}", headers=self.headers
f"{self.base_url}/crawl/job/{task_id}"
)
result.raise_for_status()
status = result.json()
if status["status"] == "failed":
@@ -52,10 +47,10 @@ class Crawl4AiTester:
time.sleep(2)
def submit_sync(self, request_data: Dict[str, Any]) -> Dict[str, Any]:
# Use synchronous crawl endpoint
response = requests.post(
f"{self.base_url}/crawl_sync",
f"{self.base_url}/crawl",
json=request_data,
headers=self.headers,
timeout=60,
)
if response.status_code == 408:
@@ -63,20 +58,9 @@ class Crawl4AiTester:
response.raise_for_status()
return response.json()
def crawl_direct(self, request_data: Dict[str, Any]) -> Dict[str, Any]:
"""Directly crawl without using task queue"""
response = requests.post(
f"{self.base_url}/crawl_direct", json=request_data, headers=self.headers
)
response.raise_for_status()
return response.json()
def test_docker_deployment(version="basic"):
tester = Crawl4AiTester(
base_url="http://localhost:11235",
# base_url="https://api.crawl4ai.com" # just for example
# api_token="test" # just for example
)
print(f"Testing Crawl4AI Docker {version} version")
@@ -95,11 +79,8 @@ def test_docker_deployment(version="basic"):
time.sleep(5)
# Test cases based on version
test_basic_crawl_direct(tester)
test_basic_crawl(tester)
test_basic_crawl(tester)
test_basic_crawl_sync(tester)
if version in ["full", "transformer"]:
test_cosine_extraction(tester)
@@ -112,115 +93,129 @@ def test_docker_deployment(version="basic"):
def test_basic_crawl(tester: Crawl4AiTester):
print("\n=== Testing Basic Crawl ===")
print("\n=== Testing Basic Crawl (Async) ===")
request = {
"urls": "https://www.nbcnews.com/business",
"priority": 10,
"session_id": "test",
"urls": ["https://www.nbcnews.com/business"],
"browser_config": {},
"crawler_config": {}
}
result = tester.submit_and_wait(request)
print(f"Basic crawl result length: {len(result['result']['markdown'])}")
print(f"Basic crawl result count: {len(result['result']['results'])}")
assert result["result"]["success"]
assert len(result["result"]["markdown"]) > 0
assert len(result["result"]["results"]) > 0
assert len(result["result"]["results"][0]["markdown"]) > 0
def test_basic_crawl_sync(tester: Crawl4AiTester):
print("\n=== Testing Basic Crawl (Sync) ===")
request = {
"urls": "https://www.nbcnews.com/business",
"priority": 10,
"session_id": "test",
"urls": ["https://www.nbcnews.com/business"],
"browser_config": {},
"crawler_config": {}
}
result = tester.submit_sync(request)
print(f"Basic crawl result length: {len(result['result']['markdown'])}")
assert result["status"] == "completed"
assert result["result"]["success"]
assert len(result["result"]["markdown"]) > 0
def test_basic_crawl_direct(tester: Crawl4AiTester):
print("\n=== Testing Basic Crawl (Direct) ===")
request = {
"urls": "https://www.nbcnews.com/business",
"priority": 10,
# "session_id": "test"
"cache_mode": "bypass", # or "enabled", "disabled", "read_only", "write_only"
}
result = tester.crawl_direct(request)
print(f"Basic crawl result length: {len(result['result']['markdown'])}")
assert result["result"]["success"]
assert len(result["result"]["markdown"]) > 0
print(f"Basic crawl result count: {len(result['results'])}")
assert result["success"]
assert len(result["results"]) > 0
assert len(result["results"][0]["markdown"]) > 0
def test_js_execution(tester: Crawl4AiTester):
print("\n=== Testing JS Execution ===")
request = {
"urls": "https://www.nbcnews.com/business",
"priority": 8,
"js_code": [
"const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"
],
"wait_for": "article.tease-card:nth-child(10)",
"crawler_params": {"headless": True},
"urls": ["https://www.nbcnews.com/business"],
"browser_config": {"headless": True},
"crawler_config": {
"js_code": [
"const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); if(loadMoreButton) loadMoreButton.click();"
],
"wait_for": "wide-tease-item__wrapper df flex-column flex-row-m flex-nowrap-m enable-new-sports-feed-mobile-design(10)"
}
}
result = tester.submit_and_wait(request)
print(f"JS execution result length: {len(result['result']['markdown'])}")
print(f"JS execution result count: {len(result['result']['results'])}")
assert result["result"]["success"]
def test_css_selector(tester: Crawl4AiTester):
print("\n=== Testing CSS Selector ===")
request = {
"urls": "https://www.nbcnews.com/business",
"priority": 7,
"css_selector": ".wide-tease-item__description",
"crawler_params": {"headless": True},
"extra": {"word_count_threshold": 10},
"urls": ["https://www.nbcnews.com/business"],
"browser_config": {"headless": True},
"crawler_config": {
"css_selector": ".wide-tease-item__description",
"word_count_threshold": 10
}
}
result = tester.submit_and_wait(request)
print(f"CSS selector result length: {len(result['result']['markdown'])}")
print(f"CSS selector result count: {len(result['result']['results'])}")
assert result["result"]["success"]
def test_structured_extraction(tester: Crawl4AiTester):
print("\n=== Testing Structured Extraction ===")
schema = {
"name": "Coinbase Crypto Prices",
"baseSelector": ".cds-tableRow-t45thuk",
"name": "Cryptocurrency Prices",
"baseSelector": "table[data-testid=\"prices-table\"] tbody tr",
"fields": [
{
"name": "crypto",
"selector": "td:nth-child(1) h2",
"type": "text",
"name": "asset_name",
"selector": "td:nth-child(2) p.cds-headline-h4steop",
"type": "text"
},
{
"name": "symbol",
"selector": "td:nth-child(1) p",
"type": "text",
"name": "asset_symbol",
"selector": "td:nth-child(2) p.cds-label2-l1sm09ec",
"type": "text"
},
{
"name": "asset_image_url",
"selector": "td:nth-child(2) img[alt=\"Asset Symbol\"]",
"type": "attribute",
"attribute": "src"
},
{
"name": "asset_url",
"selector": "td:nth-child(2) a[aria-label^=\"Asset page for\"]",
"type": "attribute",
"attribute": "href"
},
{
"name": "price",
"selector": "td:nth-child(2)",
"type": "text",
"selector": "td:nth-child(3) div.cds-typographyResets-t6muwls.cds-body-bwup3gq",
"type": "text"
},
],
{
"name": "change",
"selector": "td:nth-child(7) p.cds-body-bwup3gq",
"type": "text"
}
]
}
request = {
"urls": "https://www.coinbase.com/explore",
"priority": 9,
"extraction_config": {"type": "json_css", "params": {"schema": schema}},
"urls": ["https://www.coinbase.com/explore"],
"browser_config": {},
"crawler_config": {
"type": "CrawlerRunConfig",
"params": {
"extraction_strategy": {
"type": "JsonCssExtractionStrategy",
"params": {"schema": schema}
}
}
}
}
result = tester.submit_and_wait(request)
extracted = json.loads(result["result"]["extracted_content"])
extracted = json.loads(result["result"]["results"][0]["extracted_content"])
print(f"Extracted {len(extracted)} items")
print("Sample item:", json.dumps(extracted[0], indent=2))
if extracted:
print("Sample item:", json.dumps(extracted[0], indent=2))
assert result["result"]["success"]
assert len(extracted) > 0
@@ -230,43 +225,54 @@ def test_llm_extraction(tester: Crawl4AiTester):
schema = {
"type": "object",
"properties": {
"model_name": {
"asset_name": {
"type": "string",
"description": "Name of the OpenAI model.",
"description": "Name of the asset.",
},
"input_fee": {
"price": {
"type": "string",
"description": "Fee for input token for the OpenAI model.",
"description": "Price of the asset.",
},
"output_fee": {
"change": {
"type": "string",
"description": "Fee for output token for the OpenAI model.",
"description": "Change in price of the asset.",
},
},
"required": ["model_name", "input_fee", "output_fee"],
"required": ["asset_name", "price", "change"],
}
request = {
"urls": "https://openai.com/api/pricing",
"priority": 8,
"extraction_config": {
"type": "llm",
"urls": ["https://www.coinbase.com/en-in/explore"],
"browser_config": {},
"crawler_config": {
"type": "CrawlerRunConfig",
"params": {
"provider": "openai/gpt-4o-mini",
"api_token": os.getenv("OPENAI_API_KEY"),
"schema": schema,
"extraction_type": "schema",
"instruction": """From the crawled content, extract all mentioned model names along with their fees for input and output tokens.""",
},
},
"crawler_params": {"word_count_threshold": 1},
"extraction_strategy": {
"type": "LLMExtractionStrategy",
"params": {
"llm_config": {
"type": "LLMConfig",
"params": {
"provider": "gemini/gemini-2.0-flash-exp",
"api_token": os.getenv("GEMINI_API_KEY")
}
},
"schema": schema,
"extraction_type": "schema",
"instruction": "From the crawled content, extract asset names along with their prices and change in price.",
}
},
"word_count_threshold": 1
}
}
}
try:
result = tester.submit_and_wait(request)
extracted = json.loads(result["result"]["extracted_content"])
print(f"Extracted {len(extracted)} model pricing entries")
print("Sample entry:", json.dumps(extracted[0], indent=2))
extracted = json.loads(result["result"]["results"][0]["extracted_content"])
print(f"Extracted {len(extracted)} asset pricing entries")
if extracted:
print("Sample entry:", json.dumps(extracted[0], indent=2))
assert result["result"]["success"]
except Exception as e:
print(f"LLM extraction test failed (might be due to missing API key): {str(e)}")
@@ -274,6 +280,16 @@ def test_llm_extraction(tester: Crawl4AiTester):
def test_llm_with_ollama(tester: Crawl4AiTester):
print("\n=== Testing LLM with Ollama ===")
# Check if Ollama is accessible first
try:
ollama_response = requests.get("http://localhost:11434/api/tags", timeout=5)
ollama_response.raise_for_status()
print("Ollama is accessible")
except:
print("Ollama is not accessible, skipping test")
return
schema = {
"type": "object",
"properties": {
@@ -294,24 +310,33 @@ def test_llm_with_ollama(tester: Crawl4AiTester):
}
request = {
"urls": "https://www.nbcnews.com/business",
"priority": 8,
"extraction_config": {
"type": "llm",
"urls": ["https://www.nbcnews.com/business"],
"browser_config": {"verbose": True},
"crawler_config": {
"type": "CrawlerRunConfig",
"params": {
"provider": "ollama/llama2",
"schema": schema,
"extraction_type": "schema",
"instruction": "Extract the main article information including title, summary, and main topics.",
},
},
"extra": {"word_count_threshold": 1},
"crawler_params": {"verbose": True},
"extraction_strategy": {
"type": "LLMExtractionStrategy",
"params": {
"llm_config": {
"type": "LLMConfig",
"params": {
"provider": "ollama/llama3.2:latest",
}
},
"schema": schema,
"extraction_type": "schema",
"instruction": "Extract the main article information including title, summary, and main topics.",
}
},
"word_count_threshold": 1
}
}
}
try:
result = tester.submit_and_wait(request)
extracted = json.loads(result["result"]["extracted_content"])
extracted = json.loads(result["result"]["results"][0]["extracted_content"])
print("Extracted content:", json.dumps(extracted, indent=2))
assert result["result"]["success"]
except Exception as e:
@@ -321,24 +346,30 @@ def test_llm_with_ollama(tester: Crawl4AiTester):
def test_cosine_extraction(tester: Crawl4AiTester):
print("\n=== Testing Cosine Extraction ===")
request = {
"urls": "https://www.nbcnews.com/business",
"priority": 8,
"extraction_config": {
"type": "cosine",
"urls": ["https://www.nbcnews.com/business"],
"browser_config": {},
"crawler_config": {
"type": "CrawlerRunConfig",
"params": {
"semantic_filter": "business finance economy",
"word_count_threshold": 10,
"max_dist": 0.2,
"top_k": 3,
},
},
"extraction_strategy": {
"type": "CosineStrategy",
"params": {
"semantic_filter": "business finance economy",
"word_count_threshold": 10,
"max_dist": 0.2,
"top_k": 3,
}
}
}
}
}
try:
result = tester.submit_and_wait(request)
extracted = json.loads(result["result"]["extracted_content"])
extracted = json.loads(result["result"]["results"][0]["extracted_content"])
print(f"Extracted {len(extracted)} text clusters")
print("First cluster tags:", extracted[0]["tags"])
if extracted:
print("First cluster tags:", extracted[0]["tags"])
assert result["result"]["success"]
except Exception as e:
print(f"Cosine extraction test failed: {str(e)}")
@@ -347,20 +378,25 @@ def test_cosine_extraction(tester: Crawl4AiTester):
def test_screenshot(tester: Crawl4AiTester):
print("\n=== Testing Screenshot ===")
request = {
"urls": "https://www.nbcnews.com/business",
"priority": 5,
"screenshot": True,
"crawler_params": {"headless": True},
"urls": ["https://www.nbcnews.com/business"],
"browser_config": {"headless": True},
"crawler_config": {
"type": "CrawlerRunConfig",
"params": {
"screenshot": True
}
}
}
result = tester.submit_and_wait(request)
print("Screenshot captured:", bool(result["result"]["screenshot"]))
screenshot_data = result["result"]["results"][0]["screenshot"]
print("Screenshot captured:", bool(screenshot_data))
if result["result"]["screenshot"]:
if screenshot_data:
# Save screenshot
screenshot_data = base64.b64decode(result["result"]["screenshot"])
screenshot_bytes = base64.b64decode(screenshot_data)
with open("test_screenshot.jpg", "wb") as f:
f.write(screenshot_data)
f.write(screenshot_bytes)
print("Screenshot saved as test_screenshot.jpg")
assert result["result"]["success"]
@@ -368,5 +404,4 @@ def test_screenshot(tester: Crawl4AiTester):
if __name__ == "__main__":
version = sys.argv[1] if len(sys.argv) > 1 else "basic"
# version = "full"
test_docker_deployment(version)

View File

@@ -0,0 +1,57 @@
import asyncio
from crawl4ai import (
AsyncWebCrawler,
BrowserConfig,
CrawlerRunConfig,
DefaultMarkdownGenerator,
PruningContentFilter,
CrawlResult,
UndetectedAdapter
)
from crawl4ai.async_crawler_strategy import AsyncPlaywrightCrawlerStrategy
async def main():
# Create browser config
browser_config = BrowserConfig(
headless=False,
verbose=True,
)
# Create the undetected adapter
undetected_adapter = UndetectedAdapter()
# Create the crawler strategy with the undetected adapter
crawler_strategy = AsyncPlaywrightCrawlerStrategy(
browser_config=browser_config,
browser_adapter=undetected_adapter
)
# Create the crawler with our custom strategy
async with AsyncWebCrawler(
crawler_strategy=crawler_strategy,
config=browser_config
) as crawler:
# Configure the crawl
crawler_config = CrawlerRunConfig(
markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter()
),
capture_console_messages=True, # Enable console capture to test adapter
)
# Test on a site that typically detects bots
print("Testing undetected adapter...")
result: CrawlResult = await crawler.arun(
url="https://www.helloworld.org",
config=crawler_config
)
print(f"Status: {result.status_code}")
print(f"Success: {result.success}")
print(f"Console messages captured: {len(result.console_messages or [])}")
print(f"Markdown content (first 500 chars):\n{result.markdown.raw_markdown[:500]}")
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -18,7 +18,7 @@ Usage:
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.async_configs import LinkPreviewConfig
from crawl4ai import LinkPreviewConfig
async def basic_link_head_extraction():

View File

@@ -1,43 +1,55 @@
from crawl4ai import LLMConfig
from crawl4ai import AsyncWebCrawler, LLMExtractionStrategy
import asyncio
import os
import json
from pydantic import BaseModel, Field
url = "https://openai.com/api/pricing/"
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, LLMConfig, BrowserConfig, CacheMode
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from typing import Dict
import os
class OpenAIModelFee(BaseModel):
model_name: str = Field(..., description="Name of the OpenAI model.")
input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
output_fee: str = Field(
..., description="Fee for output token for the OpenAI model."
output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")
async def extract_structured_data_using_llm(provider: str, api_token: str = None, extra_headers: Dict[str, str] = None):
print(f"\n--- Extracting Structured Data with {provider} ---")
if api_token is None and provider != "ollama":
print(f"API token is required for {provider}. Skipping this example.")
return
browser_config = BrowserConfig(headless=True)
extra_args = {"temperature": 0, "top_p": 0.9, "max_tokens": 2000}
if extra_headers:
extra_args["extra_headers"] = extra_headers
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
word_count_threshold=1,
page_timeout=80000,
extraction_strategy=LLMExtractionStrategy(
llm_config=LLMConfig(provider=provider, api_token=api_token),
schema=OpenAIModelFee.model_json_schema(),
extraction_type="schema",
instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
Do not miss any models in the entire content.""",
extra_args=extra_args,
),
)
async def main():
# Use AsyncWebCrawler
async with AsyncWebCrawler() as crawler:
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url=url,
word_count_threshold=1,
extraction_strategy=LLMExtractionStrategy(
# provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'),
llm_config=LLMConfig(provider="groq/llama-3.1-70b-versatile", api_token=os.getenv("GROQ_API_KEY")),
schema=OpenAIModelFee.model_json_schema(),
extraction_type="schema",
instruction="From the crawled content, extract all mentioned model names along with their "
"fees for input and output tokens. Make sure not to miss anything in the entire content. "
"One extracted model JSON format should look like this: "
'{ "model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens" }',
),
url="https://openai.com/api/pricing/",
config=crawler_config
)
print("Success:", result.success)
model_fees = json.loads(result.extracted_content)
print(len(model_fees))
with open(".data/data.json", "w", encoding="utf-8") as f:
f.write(result.extracted_content)
print(result.extracted_content)
asyncio.run(main())
if __name__ == "__main__":
asyncio.run(
extract_structured_data_using_llm(
provider="openai/gpt-4o", api_token=os.getenv("OPENAI_API_KEY")
)
)

View File

@@ -0,0 +1,356 @@
#!/usr/bin/env python3
"""
Example demonstrating LLM-based table extraction in Crawl4AI.
This example shows how to use the LLMTableExtraction strategy to extract
complex tables from web pages, including handling rowspan, colspan, and nested tables.
"""
import os
import sys
# Get the grandparent directory
grandparent_dir = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
sys.path.append(grandparent_dir)
__location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
import asyncio
from crawl4ai import (
AsyncWebCrawler,
CrawlerRunConfig,
LLMConfig,
LLMTableExtraction,
CacheMode
)
import pandas as pd
# Example 1: Basic LLM Table Extraction
async def basic_llm_extraction():
"""Extract tables using LLM with default settings."""
print("\n=== Example 1: Basic LLM Table Extraction ===")
# Configure LLM (using OpenAI GPT-4o-mini for cost efficiency)
llm_config = LLMConfig(
provider="openai/gpt-4.1-mini",
api_token="env:OPENAI_API_KEY", # Uses environment variable
temperature=0.1, # Low temperature for consistency
max_tokens=32000
)
# Create LLM table extraction strategy
table_strategy = LLMTableExtraction(
llm_config=llm_config,
verbose=True,
# css_selector="div.mw-content-ltr",
max_tries=2,
enable_chunking=True,
chunk_token_threshold=5000, # Lower threshold to force chunking
min_rows_per_chunk=10,
max_parallel_chunks=3
)
# Configure crawler with the strategy
config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
table_extraction=table_strategy
)
async with AsyncWebCrawler() as crawler:
# Extract tables from a Wikipedia page
result = await crawler.arun(
url="https://en.wikipedia.org/wiki/List_of_chemical_elements",
config=config
)
if result.success:
print(f"✓ Found {len(result.tables)} tables")
# Display first table
if result.tables:
first_table = result.tables[0]
print(f"\nFirst table:")
print(f" Headers: {first_table['headers'][:5]}...")
print(f" Rows: {len(first_table['rows'])}")
# Convert to pandas DataFrame
df = pd.DataFrame(
first_table['rows'],
columns=first_table['headers']
)
print(f"\nDataFrame shape: {df.shape}")
print(df.head())
else:
print(f"✗ Extraction failed: {result.error}")
# Example 2: Focused Extraction with CSS Selector
async def focused_extraction():
"""Extract tables from specific page sections using CSS selectors."""
print("\n=== Example 2: Focused Extraction with CSS Selector ===")
# HTML with multiple tables
test_html = """
<html>
<body>
<div class="sidebar">
<table role="presentation">
<tr><td>Navigation</td></tr>
</table>
</div>
<div class="main-content">
<table id="data-table">
<caption>Quarterly Sales Report</caption>
<thead>
<tr>
<th rowspan="2">Product</th>
<th colspan="3">Q1 2024</th>
</tr>
<tr>
<th>Jan</th>
<th>Feb</th>
<th>Mar</th>
</tr>
</thead>
<tbody>
<tr>
<td>Widget A</td>
<td>100</td>
<td>120</td>
<td>140</td>
</tr>
<tr>
<td>Widget B</td>
<td>200</td>
<td>180</td>
<td>220</td>
</tr>
</tbody>
</table>
</div>
</body>
</html>
"""
llm_config = LLMConfig(
provider="openai/gpt-4.1-mini",
api_token="env:OPENAI_API_KEY"
)
# Focus only on main content area
table_strategy = LLMTableExtraction(
llm_config=llm_config,
css_selector=".main-content", # Only extract from main content
verbose=True
)
config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
table_extraction=table_strategy
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url=f"raw:{test_html}",
config=config
)
if result.success and result.tables:
table = result.tables[0]
print(f"✓ Extracted table: {table.get('caption', 'No caption')}")
print(f" Headers: {table['headers']}")
print(f" Metadata: {table['metadata']}")
# The LLM should have handled the rowspan/colspan correctly
print("\nProcessed data (rowspan/colspan handled):")
for i, row in enumerate(table['rows']):
print(f" Row {i+1}: {row}")
# Example 3: Comparing with Default Extraction
async def compare_strategies():
"""Compare LLM extraction with default extraction on complex tables."""
print("\n=== Example 3: Comparing LLM vs Default Extraction ===")
# Complex table with nested structure
complex_html = """
<html>
<body>
<table>
<tr>
<th rowspan="3">Category</th>
<th colspan="2">2023</th>
<th colspan="2">2024</th>
</tr>
<tr>
<th>H1</th>
<th>H2</th>
<th>H1</th>
<th>H2</th>
</tr>
<tr>
<td colspan="4">All values in millions</td>
</tr>
<tr>
<td>Revenue</td>
<td>100</td>
<td>120</td>
<td>130</td>
<td>145</td>
</tr>
<tr>
<td>Profit</td>
<td>20</td>
<td>25</td>
<td>28</td>
<td>32</td>
</tr>
</table>
</body>
</html>
"""
async with AsyncWebCrawler() as crawler:
# Test with default extraction
from crawl4ai import DefaultTableExtraction
default_strategy = DefaultTableExtraction(
table_score_threshold=3,
verbose=True
)
config_default = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
table_extraction=default_strategy
)
result_default = await crawler.arun(
url=f"raw:{complex_html}",
config=config_default
)
# Test with LLM extraction
llm_strategy = LLMTableExtraction(
llm_config=LLMConfig(
provider="openai/gpt-4.1-mini",
api_token="env:OPENAI_API_KEY"
),
verbose=True
)
config_llm = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
table_extraction=llm_strategy
)
result_llm = await crawler.arun(
url=f"raw:{complex_html}",
config=config_llm
)
# Compare results
print("\nDefault Extraction:")
if result_default.tables:
table = result_default.tables[0]
print(f" Headers: {table.get('headers', [])}")
print(f" Rows: {len(table.get('rows', []))}")
for i, row in enumerate(table.get('rows', [])[:3]):
print(f" Row {i+1}: {row}")
print("\nLLM Extraction (handles complex structure better):")
if result_llm.tables:
table = result_llm.tables[0]
print(f" Headers: {table.get('headers', [])}")
print(f" Rows: {len(table.get('rows', []))}")
for i, row in enumerate(table.get('rows', [])):
print(f" Row {i+1}: {row}")
print(f" Metadata: {table.get('metadata', {})}")
# Example 4: Batch Processing Multiple Pages
async def batch_extraction():
"""Extract tables from multiple pages efficiently."""
print("\n=== Example 4: Batch Table Extraction ===")
urls = [
"https://www.worldometers.info/geography/alphabetical-list-of-countries/",
# "https://en.wikipedia.org/wiki/List_of_chemical_elements",
]
llm_config = LLMConfig(
provider="openai/gpt-4.1-mini",
api_token="env:OPENAI_API_KEY",
temperature=0.1,
max_tokens=1500
)
table_strategy = LLMTableExtraction(
llm_config=llm_config,
css_selector="div.datatable-container", # Wikipedia data tables
verbose=False,
enable_chunking=True,
chunk_token_threshold=5000, # Lower threshold to force chunking
min_rows_per_chunk=10,
max_parallel_chunks=3
)
config = CrawlerRunConfig(
table_extraction=table_strategy,
cache_mode=CacheMode.BYPASS
)
all_tables = []
async with AsyncWebCrawler() as crawler:
for url in urls:
print(f"\nProcessing: {url.split('/')[-1][:50]}...")
result = await crawler.arun(url=url, config=config)
if result.success and result.tables:
print(f" ✓ Found {len(result.tables)} tables")
# Store first table from each page
if result.tables:
all_tables.append({
'url': url,
'table': result.tables[0]
})
# Summary
print(f"\n=== Summary ===")
print(f"Extracted {len(all_tables)} tables from {len(urls)} pages")
for item in all_tables:
table = item['table']
print(f"\nFrom {item['url'].split('/')[-1][:30]}:")
print(f" Columns: {len(table['headers'])}")
print(f" Rows: {len(table['rows'])}")
async def main():
"""Run all examples."""
print("=" * 60)
print("LLM TABLE EXTRACTION EXAMPLES")
print("=" * 60)
# Run examples (comment out ones you don't want to run)
# Basic extraction
await basic_llm_extraction()
# # Focused extraction with CSS
# await focused_extraction()
# # Compare strategies
# await compare_strategies()
# # Batch processing
# await batch_extraction()
print("\n" + "=" * 60)
print("ALL EXAMPLES COMPLETED")
print("=" * 60)
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -1,5 +1,6 @@
import time, re
from crawl4ai.content_scraping_strategy import WebScrapingStrategy, LXMLWebScrapingStrategy
from crawl4ai.content_scraping_strategy import LXMLWebScrapingStrategy
# WebScrapingStrategy is now an alias for LXMLWebScrapingStrategy
import time
import functools
from collections import defaultdict
@@ -57,7 +58,7 @@ methods_to_profile = [
# Apply decorators to both strategies
for strategy, name in [(WebScrapingStrategy, "Original"), (LXMLWebScrapingStrategy, "LXML")]:
for strategy, name in [(LXMLWebScrapingStrategy, "LXML")]:
for method in methods_to_profile:
apply_decorators(strategy, method, name)
@@ -85,7 +86,7 @@ def generate_large_html(n_elements=1000):
def test_scraping():
# Initialize both scrapers
original_scraper = WebScrapingStrategy()
original_scraper = LXMLWebScrapingStrategy()
selected_scraper = LXMLWebScrapingStrategy()
# Generate test HTML

View File

@@ -0,0 +1,59 @@
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, UndetectedAdapter
from crawl4ai.async_crawler_strategy import AsyncPlaywrightCrawlerStrategy
# Example 1: Stealth Mode
async def stealth_mode_example():
browser_config = BrowserConfig(
enable_stealth=True,
headless=False
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun("https://example.com")
return result.html[:500]
# Example 2: Undetected Browser
async def undetected_browser_example():
browser_config = BrowserConfig(
headless=False
)
adapter = UndetectedAdapter()
strategy = AsyncPlaywrightCrawlerStrategy(
browser_config=browser_config,
browser_adapter=adapter
)
async with AsyncWebCrawler(
crawler_strategy=strategy,
config=browser_config
) as crawler:
result = await crawler.arun("https://example.com")
return result.html[:500]
# Example 3: Both Combined
async def combined_example():
browser_config = BrowserConfig(
enable_stealth=True,
headless=False
)
adapter = UndetectedAdapter()
strategy = AsyncPlaywrightCrawlerStrategy(
browser_config=browser_config,
browser_adapter=adapter
)
async with AsyncWebCrawler(
crawler_strategy=strategy,
config=browser_config
) as crawler:
result = await crawler.arun("https://example.com")
return result.html[:500]
# Run examples
if __name__ == "__main__":
asyncio.run(stealth_mode_example())
asyncio.run(undetected_browser_example())
asyncio.run(combined_example())

View File

@@ -0,0 +1,522 @@
"""
Stealth Mode Example with Crawl4AI
This example demonstrates how to use the stealth mode feature to bypass basic bot detection.
The stealth mode uses playwright-stealth to modify browser fingerprints and behaviors
that are commonly used to detect automated browsers.
Key features demonstrated:
1. Comparing crawling with and without stealth mode
2. Testing against bot detection sites
3. Accessing sites that block automated browsers
4. Best practices for stealth crawling
"""
import asyncio
import json
from typing import Dict, Any
from colorama import Fore, Style, init
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
from crawl4ai.async_logger import AsyncLogger
# Initialize colorama for colored output
init()
# Create a logger for better output
logger = AsyncLogger(verbose=True)
async def test_bot_detection(use_stealth: bool = False) -> Dict[str, Any]:
"""Test against a bot detection service"""
logger.info(
f"Testing bot detection with stealth={'ON' if use_stealth else 'OFF'}",
tag="STEALTH"
)
# Configure browser with or without stealth
browser_config = BrowserConfig(
headless=False, # Use False to see the browser in action
enable_stealth=use_stealth,
viewport_width=1280,
viewport_height=800
)
async with AsyncWebCrawler(config=browser_config) as crawler:
# JavaScript to extract bot detection results
detection_script = """
// Comprehensive bot detection checks
(() => {
const detectionResults = {
// Basic WebDriver detection
webdriver: navigator.webdriver,
// Chrome specific
chrome: !!window.chrome,
chromeRuntime: !!window.chrome?.runtime,
// Automation indicators
automationControlled: navigator.webdriver,
// Permissions API
permissionsPresent: !!navigator.permissions?.query,
// Plugins
pluginsLength: navigator.plugins.length,
pluginsArray: Array.from(navigator.plugins).map(p => p.name),
// Languages
languages: navigator.languages,
language: navigator.language,
// User agent
userAgent: navigator.userAgent,
// Screen and window properties
screen: {
width: screen.width,
height: screen.height,
availWidth: screen.availWidth,
availHeight: screen.availHeight,
colorDepth: screen.colorDepth,
pixelDepth: screen.pixelDepth
},
// WebGL vendor
webglVendor: (() => {
try {
const canvas = document.createElement('canvas');
const gl = canvas.getContext('webgl') || canvas.getContext('experimental-webgl');
const ext = gl.getExtension('WEBGL_debug_renderer_info');
return gl.getParameter(ext.UNMASKED_VENDOR_WEBGL);
} catch (e) {
return 'Error';
}
})(),
// Platform
platform: navigator.platform,
// Hardware concurrency
hardwareConcurrency: navigator.hardwareConcurrency,
// Device memory
deviceMemory: navigator.deviceMemory,
// Connection
connection: navigator.connection?.effectiveType
};
// Log results for console capture
console.log('DETECTION_RESULTS:', JSON.stringify(detectionResults, null, 2));
// Return results
return detectionResults;
})();
"""
# Crawl bot detection test page
config = CrawlerRunConfig(
js_code=detection_script,
capture_console_messages=True,
wait_until="networkidle",
delay_before_return_html=2.0 # Give time for all checks to complete
)
result = await crawler.arun(
url="https://bot.sannysoft.com",
config=config
)
if result.success:
# Extract detection results from console
detection_data = None
for msg in result.console_messages or []:
if "DETECTION_RESULTS:" in msg.get("text", ""):
try:
json_str = msg["text"].replace("DETECTION_RESULTS:", "").strip()
detection_data = json.loads(json_str)
except:
pass
# Also try to get from JavaScript execution result
if not detection_data and result.js_execution_result:
detection_data = result.js_execution_result
return {
"success": True,
"url": result.url,
"detection_data": detection_data,
"page_title": result.metadata.get("title", ""),
"stealth_enabled": use_stealth
}
else:
return {
"success": False,
"error": result.error_message,
"stealth_enabled": use_stealth
}
async def test_cloudflare_site(use_stealth: bool = False) -> Dict[str, Any]:
"""Test accessing a Cloudflare-protected site"""
logger.info(
f"Testing Cloudflare site with stealth={'ON' if use_stealth else 'OFF'}",
tag="STEALTH"
)
browser_config = BrowserConfig(
headless=True, # Cloudflare detection works better in headless mode with stealth
enable_stealth=use_stealth,
viewport_width=1920,
viewport_height=1080
)
async with AsyncWebCrawler(config=browser_config) as crawler:
config = CrawlerRunConfig(
wait_until="networkidle",
page_timeout=30000, # 30 seconds
delay_before_return_html=3.0
)
# Test on a site that often shows Cloudflare challenges
result = await crawler.arun(
url="https://nowsecure.nl",
config=config
)
# Check if we hit Cloudflare challenge
cloudflare_detected = False
if result.html:
cloudflare_indicators = [
"Checking your browser",
"Just a moment",
"cf-browser-verification",
"cf-challenge",
"ray ID"
]
cloudflare_detected = any(indicator in result.html for indicator in cloudflare_indicators)
return {
"success": result.success,
"url": result.url,
"cloudflare_challenge": cloudflare_detected,
"status_code": result.status_code,
"page_title": result.metadata.get("title", "") if result.metadata else "",
"stealth_enabled": use_stealth,
"html_snippet": result.html[:500] if result.html else ""
}
async def test_anti_bot_site(use_stealth: bool = False) -> Dict[str, Any]:
"""Test against sites with anti-bot measures"""
logger.info(
f"Testing anti-bot site with stealth={'ON' if use_stealth else 'OFF'}",
tag="STEALTH"
)
browser_config = BrowserConfig(
headless=False,
enable_stealth=use_stealth,
# Additional browser arguments that help with stealth
extra_args=[
"--disable-blink-features=AutomationControlled",
"--disable-features=site-per-process"
] if not use_stealth else [] # These are automatically applied with stealth
)
async with AsyncWebCrawler(config=browser_config) as crawler:
# Some sites check for specific behaviors
behavior_script = """
(async () => {
// Simulate human-like behavior
const sleep = ms => new Promise(resolve => setTimeout(resolve, ms));
// Random mouse movement
const moveX = Math.random() * 100;
const moveY = Math.random() * 100;
// Simulate reading time
await sleep(1000 + Math.random() * 2000);
// Scroll slightly
window.scrollBy(0, 100 + Math.random() * 200);
console.log('Human behavior simulation complete');
return true;
})()
"""
config = CrawlerRunConfig(
js_code=behavior_script,
wait_until="networkidle",
delay_before_return_html=5.0, # Longer delay to appear more human
capture_console_messages=True
)
# Test on a site that implements anti-bot measures
result = await crawler.arun(
url="https://www.g2.com/",
config=config
)
# Check for common anti-bot blocks
blocked_indicators = [
"Access Denied",
"403 Forbidden",
"Security Check",
"Verify you are human",
"captcha",
"challenge"
]
blocked = False
if result.html:
blocked = any(indicator.lower() in result.html.lower() for indicator in blocked_indicators)
return {
"success": result.success and not blocked,
"url": result.url,
"blocked": blocked,
"status_code": result.status_code,
"page_title": result.metadata.get("title", "") if result.metadata else "",
"stealth_enabled": use_stealth
}
async def compare_results():
"""Run all tests with and without stealth mode and compare results"""
print(f"\n{Fore.CYAN}{'='*60}{Style.RESET_ALL}")
print(f"{Fore.CYAN}Crawl4AI Stealth Mode Comparison{Style.RESET_ALL}")
print(f"{Fore.CYAN}{'='*60}{Style.RESET_ALL}\n")
# Test 1: Bot Detection
print(f"{Fore.YELLOW}1. Bot Detection Test (bot.sannysoft.com){Style.RESET_ALL}")
print("-" * 40)
# Without stealth
regular_detection = await test_bot_detection(use_stealth=False)
if regular_detection["success"] and regular_detection["detection_data"]:
print(f"{Fore.RED}Without Stealth:{Style.RESET_ALL}")
data = regular_detection["detection_data"]
print(f" • WebDriver detected: {data.get('webdriver', 'Unknown')}")
print(f" • Chrome: {data.get('chrome', 'Unknown')}")
print(f" • Languages: {data.get('languages', 'Unknown')}")
print(f" • Plugins: {data.get('pluginsLength', 'Unknown')}")
print(f" • User Agent: {data.get('userAgent', 'Unknown')[:60]}...")
# With stealth
stealth_detection = await test_bot_detection(use_stealth=True)
if stealth_detection["success"] and stealth_detection["detection_data"]:
print(f"\n{Fore.GREEN}With Stealth:{Style.RESET_ALL}")
data = stealth_detection["detection_data"]
print(f" • WebDriver detected: {data.get('webdriver', 'Unknown')}")
print(f" • Chrome: {data.get('chrome', 'Unknown')}")
print(f" • Languages: {data.get('languages', 'Unknown')}")
print(f" • Plugins: {data.get('pluginsLength', 'Unknown')}")
print(f" • User Agent: {data.get('userAgent', 'Unknown')[:60]}...")
# Test 2: Cloudflare Site
print(f"\n\n{Fore.YELLOW}2. Cloudflare Protected Site Test{Style.RESET_ALL}")
print("-" * 40)
# Without stealth
regular_cf = await test_cloudflare_site(use_stealth=False)
print(f"{Fore.RED}Without Stealth:{Style.RESET_ALL}")
print(f" • Success: {regular_cf['success']}")
print(f" • Cloudflare Challenge: {regular_cf['cloudflare_challenge']}")
print(f" • Status Code: {regular_cf['status_code']}")
print(f" • Page Title: {regular_cf['page_title']}")
# With stealth
stealth_cf = await test_cloudflare_site(use_stealth=True)
print(f"\n{Fore.GREEN}With Stealth:{Style.RESET_ALL}")
print(f" • Success: {stealth_cf['success']}")
print(f" • Cloudflare Challenge: {stealth_cf['cloudflare_challenge']}")
print(f" • Status Code: {stealth_cf['status_code']}")
print(f" • Page Title: {stealth_cf['page_title']}")
# Test 3: Anti-bot Site
print(f"\n\n{Fore.YELLOW}3. Anti-Bot Site Test{Style.RESET_ALL}")
print("-" * 40)
# Without stealth
regular_antibot = await test_anti_bot_site(use_stealth=False)
print(f"{Fore.RED}Without Stealth:{Style.RESET_ALL}")
print(f" • Success: {regular_antibot['success']}")
print(f" • Blocked: {regular_antibot['blocked']}")
print(f" • Status Code: {regular_antibot['status_code']}")
print(f" • Page Title: {regular_antibot['page_title']}")
# With stealth
stealth_antibot = await test_anti_bot_site(use_stealth=True)
print(f"\n{Fore.GREEN}With Stealth:{Style.RESET_ALL}")
print(f" • Success: {stealth_antibot['success']}")
print(f" • Blocked: {stealth_antibot['blocked']}")
print(f" • Status Code: {stealth_antibot['status_code']}")
print(f" • Page Title: {stealth_antibot['page_title']}")
# Summary
print(f"\n{Fore.CYAN}{'='*60}{Style.RESET_ALL}")
print(f"{Fore.CYAN}Summary:{Style.RESET_ALL}")
print(f"{Fore.CYAN}{'='*60}{Style.RESET_ALL}")
print(f"\nStealth mode helps bypass basic bot detection by:")
print(f" • Hiding webdriver property")
print(f" • Modifying browser fingerprints")
print(f" • Adjusting navigator properties")
print(f" • Emulating real browser plugin behavior")
print(f"\n{Fore.YELLOW}Note:{Style.RESET_ALL} Stealth mode is not a silver bullet.")
print(f"Advanced anti-bot systems may still detect automation.")
print(f"Always respect robots.txt and website terms of service.")
async def stealth_best_practices():
"""Demonstrate best practices for using stealth mode"""
print(f"\n\n{Fore.CYAN}{'='*60}{Style.RESET_ALL}")
print(f"{Fore.CYAN}Stealth Mode Best Practices{Style.RESET_ALL}")
print(f"{Fore.CYAN}{'='*60}{Style.RESET_ALL}\n")
# Best Practice 1: Combine with realistic behavior
print(f"{Fore.YELLOW}1. Combine with Realistic Behavior:{Style.RESET_ALL}")
browser_config = BrowserConfig(
headless=False,
enable_stealth=True,
viewport_width=1920,
viewport_height=1080
)
async with AsyncWebCrawler(config=browser_config) as crawler:
# Simulate human-like behavior
human_behavior_script = """
(async () => {
// Wait random time between actions
const randomWait = () => Math.random() * 2000 + 1000;
// Simulate reading
await new Promise(resolve => setTimeout(resolve, randomWait()));
// Smooth scroll
const smoothScroll = async () => {
const totalHeight = document.body.scrollHeight;
const viewHeight = window.innerHeight;
let currentPosition = 0;
while (currentPosition < totalHeight - viewHeight) {
const scrollAmount = Math.random() * 300 + 100;
window.scrollBy({
top: scrollAmount,
behavior: 'smooth'
});
currentPosition += scrollAmount;
await new Promise(resolve => setTimeout(resolve, randomWait()));
}
};
await smoothScroll();
console.log('Human-like behavior simulation completed');
return true;
})()
"""
config = CrawlerRunConfig(
js_code=human_behavior_script,
wait_until="networkidle",
delay_before_return_html=3.0,
capture_console_messages=True
)
result = await crawler.arun(
url="https://example.com",
config=config
)
print(f" ✓ Simulated human-like scrolling and reading patterns")
print(f" ✓ Added random delays between actions")
print(f" ✓ Result: {result.success}")
# Best Practice 2: Use appropriate viewport and user agent
print(f"\n{Fore.YELLOW}2. Use Realistic Viewport and User Agent:{Style.RESET_ALL}")
# Get a realistic user agent
from crawl4ai.user_agent_generator import UserAgentGenerator
ua_generator = UserAgentGenerator()
browser_config = BrowserConfig(
headless=True,
enable_stealth=True,
viewport_width=1920,
viewport_height=1080,
user_agent=ua_generator.generate(device_type="desktop", browser_type="chrome")
)
print(f" ✓ Using realistic viewport: 1920x1080")
print(f" ✓ Using current Chrome user agent")
print(f" ✓ Stealth mode will ensure consistency")
# Best Practice 3: Manage request rate
print(f"\n{Fore.YELLOW}3. Manage Request Rate:{Style.RESET_ALL}")
print(f" ✓ Add delays between requests")
print(f" ✓ Randomize timing patterns")
print(f" ✓ Respect robots.txt")
# Best Practice 4: Session management
print(f"\n{Fore.YELLOW}4. Use Session Management:{Style.RESET_ALL}")
browser_config = BrowserConfig(
headless=False,
enable_stealth=True
)
async with AsyncWebCrawler(config=browser_config) as crawler:
# Create a session for multiple requests
session_id = "stealth_session_1"
config = CrawlerRunConfig(
session_id=session_id,
wait_until="domcontentloaded"
)
# First request
result1 = await crawler.arun(
url="https://example.com",
config=config
)
# Subsequent request reuses the same browser context
result2 = await crawler.arun(
url="https://example.com/about",
config=config
)
print(f" ✓ Reused browser session for multiple requests")
print(f" ✓ Maintains cookies and state between requests")
print(f" ✓ More efficient and realistic browsing pattern")
print(f"\n{Fore.CYAN}{'='*60}{Style.RESET_ALL}")
async def main():
"""Run all examples"""
# Run comparison tests
await compare_results()
# Show best practices
await stealth_best_practices()
print(f"\n{Fore.GREEN}Examples completed!{Style.RESET_ALL}")
print(f"\n{Fore.YELLOW}Remember:{Style.RESET_ALL}")
print(f"• Stealth mode helps with basic bot detection")
print(f"• Always respect website terms of service")
print(f"• Consider rate limiting and ethical scraping practices")
print(f"• For advanced protection, consider additional measures")
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,215 @@
"""
Quick Start: Using Stealth Mode in Crawl4AI
This example shows practical use cases for the stealth mode feature.
Stealth mode helps bypass basic bot detection mechanisms.
"""
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
async def example_1_basic_stealth():
"""Example 1: Basic stealth mode usage"""
print("\n=== Example 1: Basic Stealth Mode ===")
# Enable stealth mode in browser config
browser_config = BrowserConfig(
enable_stealth=True, # This is the key parameter
headless=True
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url="https://example.com")
print(f"✓ Crawled {result.url} successfully")
print(f"✓ Title: {result.metadata.get('title', 'N/A')}")
async def example_2_stealth_with_screenshot():
"""Example 2: Stealth mode with screenshot to show detection results"""
print("\n=== Example 2: Stealth Mode Visual Verification ===")
browser_config = BrowserConfig(
enable_stealth=True,
headless=False # Set to False to see the browser
)
async with AsyncWebCrawler(config=browser_config) as crawler:
config = CrawlerRunConfig(
screenshot=True,
wait_until="networkidle"
)
result = await crawler.arun(
url="https://bot.sannysoft.com",
config=config
)
if result.success:
print(f"✓ Successfully crawled bot detection site")
print(f"✓ With stealth enabled, many detection tests should show as passed")
if result.screenshot:
# Save screenshot for verification
import base64
with open("stealth_detection_results.png", "wb") as f:
f.write(base64.b64decode(result.screenshot))
print(f"✓ Screenshot saved as 'stealth_detection_results.png'")
print(f" Check the screenshot to see detection results!")
async def example_3_stealth_for_protected_sites():
"""Example 3: Using stealth for sites with bot protection"""
print("\n=== Example 3: Stealth for Protected Sites ===")
browser_config = BrowserConfig(
enable_stealth=True,
headless=True,
viewport_width=1920,
viewport_height=1080
)
async with AsyncWebCrawler(config=browser_config) as crawler:
# Add human-like behavior
config = CrawlerRunConfig(
wait_until="networkidle",
delay_before_return_html=2.0, # Wait 2 seconds
js_code="""
// Simulate human-like scrolling
window.scrollTo({
top: document.body.scrollHeight / 2,
behavior: 'smooth'
});
"""
)
# Try accessing a site that might have bot protection
result = await crawler.arun(
url="https://www.g2.com/products/slack/reviews",
config=config
)
if result.success:
print(f"✓ Successfully accessed protected site")
print(f"✓ Retrieved {len(result.html)} characters of HTML")
else:
print(f"✗ Failed to access site: {result.error_message}")
async def example_4_stealth_with_sessions():
"""Example 4: Stealth mode with session management"""
print("\n=== Example 4: Stealth + Session Management ===")
browser_config = BrowserConfig(
enable_stealth=True,
headless=False
)
async with AsyncWebCrawler(config=browser_config) as crawler:
session_id = "my_stealth_session"
# First request - establish session
config = CrawlerRunConfig(
session_id=session_id,
wait_until="domcontentloaded"
)
result1 = await crawler.arun(
url="https://news.ycombinator.com",
config=config
)
print(f"✓ First request completed: {result1.url}")
# Second request - reuse session
await asyncio.sleep(2) # Brief delay between requests
result2 = await crawler.arun(
url="https://news.ycombinator.com/best",
config=config
)
print(f"✓ Second request completed: {result2.url}")
print(f"✓ Session reused, maintaining cookies and state")
async def example_5_stealth_comparison():
"""Example 5: Compare results with and without stealth using screenshots"""
print("\n=== Example 5: Stealth Mode Comparison ===")
test_url = "https://bot.sannysoft.com"
# First test WITHOUT stealth
print("\nWithout stealth:")
regular_config = BrowserConfig(
enable_stealth=False,
headless=True
)
async with AsyncWebCrawler(config=regular_config) as crawler:
config = CrawlerRunConfig(
screenshot=True,
wait_until="networkidle"
)
result = await crawler.arun(url=test_url, config=config)
if result.success and result.screenshot:
import base64
with open("comparison_without_stealth.png", "wb") as f:
f.write(base64.b64decode(result.screenshot))
print(f" ✓ Screenshot saved: comparison_without_stealth.png")
print(f" Many tests will show as FAILED (red)")
# Then test WITH stealth
print("\nWith stealth:")
stealth_config = BrowserConfig(
enable_stealth=True,
headless=True
)
async with AsyncWebCrawler(config=stealth_config) as crawler:
config = CrawlerRunConfig(
screenshot=True,
wait_until="networkidle"
)
result = await crawler.arun(url=test_url, config=config)
if result.success and result.screenshot:
import base64
with open("comparison_with_stealth.png", "wb") as f:
f.write(base64.b64decode(result.screenshot))
print(f" ✓ Screenshot saved: comparison_with_stealth.png")
print(f" More tests should show as PASSED (green)")
print("\nCompare the two screenshots to see the difference!")
async def main():
"""Run all examples"""
print("Crawl4AI Stealth Mode Examples")
print("==============================")
# Run basic example
await example_1_basic_stealth()
# Run screenshot verification example
await example_2_stealth_with_screenshot()
# Run protected site example
await example_3_stealth_for_protected_sites()
# Run session example
await example_4_stealth_with_sessions()
# Run comparison example
await example_5_stealth_comparison()
print("\n" + "="*50)
print("Tips for using stealth mode effectively:")
print("- Use realistic viewport sizes (1920x1080, 1366x768)")
print("- Add delays between requests to appear more human")
print("- Combine with session management for better results")
print("- Remember: stealth mode is for legitimate scraping only")
print("="*50)
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,62 @@
"""
Simple test to verify stealth mode is working
"""
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
async def test_stealth():
"""Test stealth mode effectiveness"""
# Test WITHOUT stealth
print("=== WITHOUT Stealth ===")
config1 = BrowserConfig(
headless=False,
enable_stealth=False
)
async with AsyncWebCrawler(config=config1) as crawler:
result = await crawler.arun(
url="https://bot.sannysoft.com",
config=CrawlerRunConfig(
wait_until="networkidle",
screenshot=True
)
)
print(f"Success: {result.success}")
# Take screenshot
if result.screenshot:
with open("without_stealth.png", "wb") as f:
import base64
f.write(base64.b64decode(result.screenshot))
print("Screenshot saved: without_stealth.png")
# Test WITH stealth
print("\n=== WITH Stealth ===")
config2 = BrowserConfig(
headless=False,
enable_stealth=True
)
async with AsyncWebCrawler(config=config2) as crawler:
result = await crawler.arun(
url="https://bot.sannysoft.com",
config=CrawlerRunConfig(
wait_until="networkidle",
screenshot=True
)
)
print(f"Success: {result.success}")
# Take screenshot
if result.screenshot:
with open("with_stealth.png", "wb") as f:
import base64
f.write(base64.b64decode(result.screenshot))
print("Screenshot saved: with_stealth.png")
print("\nCheck the screenshots to see the difference in bot detection results!")
if __name__ == "__main__":
asyncio.run(test_stealth())

View File

@@ -0,0 +1,276 @@
"""
Example: Using Table Extraction Strategies in Crawl4AI
This example demonstrates how to use different table extraction strategies
to extract tables from web pages.
"""
import asyncio
import pandas as pd
from crawl4ai import (
AsyncWebCrawler,
CrawlerRunConfig,
CacheMode,
DefaultTableExtraction,
NoTableExtraction,
TableExtractionStrategy
)
from typing import Dict, List, Any
async def example_default_extraction():
"""Example 1: Using default table extraction (automatic)."""
print("\n" + "="*50)
print("Example 1: Default Table Extraction")
print("="*50)
async with AsyncWebCrawler() as crawler:
# No need to specify table_extraction - uses DefaultTableExtraction automatically
config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
table_score_threshold=7 # Adjust sensitivity (default: 7)
)
result = await crawler.arun(
"https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)",
config=config
)
if result.success and result.tables:
print(f"Found {len(result.tables)} tables")
# Convert first table to pandas DataFrame
if result.tables:
first_table = result.tables[0]
df = pd.DataFrame(
first_table['rows'],
columns=first_table['headers'] if first_table['headers'] else None
)
print(f"\nFirst table preview:")
print(df.head())
print(f"Shape: {df.shape}")
async def example_custom_configuration():
"""Example 2: Custom table extraction configuration."""
print("\n" + "="*50)
print("Example 2: Custom Table Configuration")
print("="*50)
async with AsyncWebCrawler() as crawler:
# Create custom extraction strategy with specific settings
table_strategy = DefaultTableExtraction(
table_score_threshold=5, # Lower threshold for more permissive detection
min_rows=3, # Only extract tables with at least 3 rows
min_cols=2, # Only extract tables with at least 2 columns
verbose=True
)
config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
table_extraction=table_strategy,
# Target specific tables using CSS selector
css_selector="div.main-content"
)
result = await crawler.arun(
"https://example.com/data",
config=config
)
if result.success:
print(f"Found {len(result.tables)} tables matching criteria")
for i, table in enumerate(result.tables):
print(f"\nTable {i+1}:")
print(f" Caption: {table.get('caption', 'No caption')}")
print(f" Size: {table['metadata']['row_count']} rows × {table['metadata']['column_count']} columns")
print(f" Has headers: {table['metadata']['has_headers']}")
async def example_disable_extraction():
"""Example 3: Disable table extraction when not needed."""
print("\n" + "="*50)
print("Example 3: Disable Table Extraction")
print("="*50)
async with AsyncWebCrawler() as crawler:
# Use NoTableExtraction to skip table processing entirely
config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
table_extraction=NoTableExtraction() # No tables will be extracted
)
result = await crawler.arun(
"https://example.com",
config=config
)
if result.success:
print(f"Tables extracted: {len(result.tables)} (should be 0)")
print("Table extraction disabled - better performance for non-table content")
class FinancialTableExtraction(TableExtractionStrategy):
"""
Custom strategy for extracting financial tables with specific requirements.
"""
def __init__(self, currency_symbols=None, **kwargs):
super().__init__(**kwargs)
self.currency_symbols = currency_symbols or ['$', '', '£', '¥']
def extract_tables(self, element, **kwargs):
"""Extract only tables that appear to contain financial data."""
tables_data = []
for table in element.xpath(".//table"):
# Check if table contains currency symbols
table_text = ''.join(table.itertext())
has_currency = any(symbol in table_text for symbol in self.currency_symbols)
if not has_currency:
continue
# Extract using base logic (could reuse DefaultTableExtraction logic)
headers = []
rows = []
# Extract headers
for th in table.xpath(".//thead//th | .//tr[1]//th"):
headers.append(th.text_content().strip())
# Extract rows
for tr in table.xpath(".//tbody//tr | .//tr[position()>1]"):
row = []
for td in tr.xpath(".//td"):
cell_text = td.text_content().strip()
# Clean currency values
for symbol in self.currency_symbols:
cell_text = cell_text.replace(symbol, '')
row.append(cell_text)
if row:
rows.append(row)
if headers or rows:
tables_data.append({
"headers": headers,
"rows": rows,
"caption": table.xpath(".//caption/text()")[0] if table.xpath(".//caption") else "",
"summary": table.get("summary", ""),
"metadata": {
"type": "financial",
"has_currency": True,
"row_count": len(rows),
"column_count": len(headers) if headers else len(rows[0]) if rows else 0
}
})
return tables_data
async def example_custom_strategy():
"""Example 4: Custom table extraction strategy."""
print("\n" + "="*50)
print("Example 4: Custom Financial Table Strategy")
print("="*50)
async with AsyncWebCrawler() as crawler:
# Use custom strategy for financial tables
config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
table_extraction=FinancialTableExtraction(
currency_symbols=['$', ''],
verbose=True
)
)
result = await crawler.arun(
"https://finance.yahoo.com/",
config=config
)
if result.success:
print(f"Found {len(result.tables)} financial tables")
for table in result.tables:
if table['metadata'].get('type') == 'financial':
print(f" ✓ Financial table with {table['metadata']['row_count']} rows")
async def example_combined_extraction():
"""Example 5: Combine table extraction with other strategies."""
print("\n" + "="*50)
print("Example 5: Combined Extraction Strategies")
print("="*50)
from crawl4ai import LLMExtractionStrategy, LLMConfig
async with AsyncWebCrawler() as crawler:
# Define schema for structured extraction
schema = {
"type": "object",
"properties": {
"page_title": {"type": "string"},
"main_topic": {"type": "string"},
"key_figures": {
"type": "array",
"items": {"type": "string"}
}
}
}
config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
# Table extraction
table_extraction=DefaultTableExtraction(
table_score_threshold=6,
min_rows=2
),
# LLM extraction for structured data
extraction_strategy=LLMExtractionStrategy(
llm_config=LLMConfig(provider="openai"),
schema=schema
)
)
result = await crawler.arun(
"https://en.wikipedia.org/wiki/Economy_of_the_United_States",
config=config
)
if result.success:
print(f"Tables found: {len(result.tables)}")
# Tables are in result.tables
if result.tables:
print(f"First table has {len(result.tables[0]['rows'])} rows")
# Structured data is in result.extracted_content
if result.extracted_content:
import json
structured_data = json.loads(result.extracted_content)
print(f"Page title: {structured_data.get('page_title', 'N/A')}")
print(f"Main topic: {structured_data.get('main_topic', 'N/A')}")
async def main():
"""Run all examples."""
print("\n" + "="*60)
print("CRAWL4AI TABLE EXTRACTION EXAMPLES")
print("="*60)
# Run examples
await example_default_extraction()
await example_custom_configuration()
await example_disable_extraction()
await example_custom_strategy()
# await example_combined_extraction() # Requires OpenAI API key
print("\n" + "="*60)
print("EXAMPLES COMPLETED")
print("="*60)
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,74 @@
"""
Basic Undetected Browser Test
Simple example to test if undetected mode works
"""
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig
async def test_regular_mode():
"""Test with regular browser"""
print("Testing Regular Browser Mode...")
browser_config = BrowserConfig(
headless=False,
verbose=True
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url="https://www.example.com")
print(f"Regular Mode - Success: {result.success}")
print(f"Regular Mode - Status: {result.status_code}")
print(f"Regular Mode - Content length: {len(result.markdown.raw_markdown)}")
print(f"Regular Mode - First 100 chars: {result.markdown.raw_markdown[:100]}...")
return result.success
async def test_undetected_mode():
"""Test with undetected browser"""
print("\nTesting Undetected Browser Mode...")
from crawl4ai import UndetectedAdapter
from crawl4ai.async_crawler_strategy import AsyncPlaywrightCrawlerStrategy
browser_config = BrowserConfig(
headless=False,
verbose=True
)
# Create undetected adapter
undetected_adapter = UndetectedAdapter()
# Create strategy with undetected adapter
crawler_strategy = AsyncPlaywrightCrawlerStrategy(
browser_config=browser_config,
browser_adapter=undetected_adapter
)
async with AsyncWebCrawler(
crawler_strategy=crawler_strategy,
config=browser_config
) as crawler:
result = await crawler.arun(url="https://www.example.com")
print(f"Undetected Mode - Success: {result.success}")
print(f"Undetected Mode - Status: {result.status_code}")
print(f"Undetected Mode - Content length: {len(result.markdown.raw_markdown)}")
print(f"Undetected Mode - First 100 chars: {result.markdown.raw_markdown[:100]}...")
return result.success
async def main():
"""Run both tests"""
print("🤖 Crawl4AI Basic Adapter Test\n")
# Test regular mode
regular_success = await test_regular_mode()
# Test undetected mode
undetected_success = await test_undetected_mode()
# Summary
print("\n" + "="*50)
print("Summary:")
print(f"Regular Mode: {'✅ Success' if regular_success else '❌ Failed'}")
print(f"Undetected Mode: {'✅ Success' if undetected_success else '❌ Failed'}")
print("="*50)
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,155 @@
"""
Bot Detection Test - Compare Regular vs Undetected
Tests browser fingerprinting differences at bot.sannysoft.com
"""
import asyncio
from crawl4ai import (
AsyncWebCrawler,
BrowserConfig,
CrawlerRunConfig,
UndetectedAdapter,
CrawlResult
)
from crawl4ai.async_crawler_strategy import AsyncPlaywrightCrawlerStrategy
# Bot detection test site
TEST_URL = "https://bot.sannysoft.com"
def analyze_bot_detection(result: CrawlResult) -> dict:
"""Analyze bot detection results from the page"""
detections = {
"webdriver": False,
"headless": False,
"automation": False,
"user_agent": False,
"total_tests": 0,
"failed_tests": 0
}
if not result.success or not result.html:
return detections
# Look for specific test results in the HTML
html_lower = result.html.lower()
# Check for common bot indicators
if "webdriver" in html_lower and ("fail" in html_lower or "true" in html_lower):
detections["webdriver"] = True
detections["failed_tests"] += 1
if "headless" in html_lower and ("fail" in html_lower or "true" in html_lower):
detections["headless"] = True
detections["failed_tests"] += 1
if "automation" in html_lower and "detected" in html_lower:
detections["automation"] = True
detections["failed_tests"] += 1
# Count total tests (approximate)
detections["total_tests"] = html_lower.count("test") + html_lower.count("check")
return detections
async def test_browser_mode(adapter_name: str, adapter=None):
"""Test a browser mode and return results"""
print(f"\n{'='*60}")
print(f"Testing: {adapter_name}")
print(f"{'='*60}")
browser_config = BrowserConfig(
headless=False, # Run in headed mode for better results
verbose=True,
viewport_width=1920,
viewport_height=1080,
)
if adapter:
# Use undetected mode
crawler_strategy = AsyncPlaywrightCrawlerStrategy(
browser_config=browser_config,
browser_adapter=adapter
)
crawler = AsyncWebCrawler(
crawler_strategy=crawler_strategy,
config=browser_config
)
else:
# Use regular mode
crawler = AsyncWebCrawler(config=browser_config)
async with crawler:
config = CrawlerRunConfig(
delay_before_return_html=3.0, # Let detection scripts run
wait_for_images=True,
screenshot=True,
simulate_user=False, # Don't simulate for accurate detection
)
result = await crawler.arun(url=TEST_URL, config=config)
print(f"\n✓ Success: {result.success}")
print(f"✓ Status Code: {result.status_code}")
if result.success:
# Analyze detection results
detections = analyze_bot_detection(result)
print(f"\n🔍 Bot Detection Analysis:")
print(f" - WebDriver Detected: {'❌ Yes' if detections['webdriver'] else '✅ No'}")
print(f" - Headless Detected: {'❌ Yes' if detections['headless'] else '✅ No'}")
print(f" - Automation Detected: {'❌ Yes' if detections['automation'] else '✅ No'}")
print(f" - Failed Tests: {detections['failed_tests']}")
# Show some content
if result.markdown.raw_markdown:
print(f"\nContent preview:")
lines = result.markdown.raw_markdown.split('\n')
for line in lines[:20]: # Show first 20 lines
if any(keyword in line.lower() for keyword in ['test', 'pass', 'fail', 'yes', 'no']):
print(f" {line.strip()}")
return result, detections if result.success else {}
async def main():
"""Run the comparison"""
print("🤖 Crawl4AI - Bot Detection Test")
print(f"Testing at: {TEST_URL}")
print("This site runs various browser fingerprinting tests\n")
# Test regular browser
regular_result, regular_detections = await test_browser_mode("Regular Browser")
# Small delay
await asyncio.sleep(2)
# Test undetected browser
undetected_adapter = UndetectedAdapter()
undetected_result, undetected_detections = await test_browser_mode(
"Undetected Browser",
undetected_adapter
)
# Summary comparison
print(f"\n{'='*60}")
print("COMPARISON SUMMARY")
print(f"{'='*60}")
print(f"\n{'Test':<25} {'Regular':<15} {'Undetected':<15}")
print(f"{'-'*55}")
if regular_detections and undetected_detections:
print(f"{'WebDriver Detection':<25} {'❌ Detected' if regular_detections['webdriver'] else '✅ Passed':<15} {'❌ Detected' if undetected_detections['webdriver'] else '✅ Passed':<15}")
print(f"{'Headless Detection':<25} {'❌ Detected' if regular_detections['headless'] else '✅ Passed':<15} {'❌ Detected' if undetected_detections['headless'] else '✅ Passed':<15}")
print(f"{'Automation Detection':<25} {'❌ Detected' if regular_detections['automation'] else '✅ Passed':<15} {'❌ Detected' if undetected_detections['automation'] else '✅ Passed':<15}")
print(f"{'Failed Tests':<25} {regular_detections['failed_tests']:<15} {undetected_detections['failed_tests']:<15}")
print(f"\n{'='*60}")
if undetected_detections.get('failed_tests', 0) < regular_detections.get('failed_tests', 1):
print("✅ Undetected browser performed better at evading detection!")
else:
print(" Both browsers had similar detection results")
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,164 @@
"""
Undetected Browser Test - Cloudflare Protected Site
Tests the difference between regular and undetected modes on a Cloudflare-protected site
"""
import asyncio
from crawl4ai import (
AsyncWebCrawler,
BrowserConfig,
CrawlerRunConfig,
UndetectedAdapter
)
from crawl4ai.async_crawler_strategy import AsyncPlaywrightCrawlerStrategy
# Test URL with Cloudflare protection
TEST_URL = "https://nowsecure.nl"
async def test_regular_browser():
"""Test with regular browser - likely to be blocked"""
print("=" * 60)
print("Testing with Regular Browser")
print("=" * 60)
browser_config = BrowserConfig(
headless=False,
verbose=True,
viewport_width=1920,
viewport_height=1080,
)
async with AsyncWebCrawler(config=browser_config) as crawler:
config = CrawlerRunConfig(
delay_before_return_html=2.0,
simulate_user=True,
magic=True, # Try with magic mode too
)
result = await crawler.arun(url=TEST_URL, config=config)
print(f"\n✓ Success: {result.success}")
print(f"✓ Status Code: {result.status_code}")
print(f"✓ HTML Length: {len(result.html)}")
# Check for Cloudflare challenge
if result.html:
cf_indicators = [
"Checking your browser",
"Please stand by",
"cloudflare",
"cf-browser-verification",
"Access denied",
"Ray ID"
]
detected = False
for indicator in cf_indicators:
if indicator.lower() in result.html.lower():
print(f"⚠️ Cloudflare Challenge Detected: '{indicator}' found")
detected = True
break
if not detected and len(result.markdown.raw_markdown) > 100:
print("✅ Successfully bypassed Cloudflare!")
print(f"Content preview: {result.markdown.raw_markdown[:200]}...")
elif not detected:
print("⚠️ Page loaded but content seems minimal")
return result
async def test_undetected_browser():
"""Test with undetected browser - should bypass Cloudflare"""
print("\n" + "=" * 60)
print("Testing with Undetected Browser")
print("=" * 60)
browser_config = BrowserConfig(
headless=False, # Headless is easier to detect
verbose=True,
viewport_width=1920,
viewport_height=1080,
)
# Create undetected adapter
undetected_adapter = UndetectedAdapter()
# Create strategy with undetected adapter
crawler_strategy = AsyncPlaywrightCrawlerStrategy(
browser_config=browser_config,
browser_adapter=undetected_adapter
)
async with AsyncWebCrawler(
crawler_strategy=crawler_strategy,
config=browser_config
) as crawler:
config = CrawlerRunConfig(
delay_before_return_html=2.0,
simulate_user=True,
)
result = await crawler.arun(url=TEST_URL, config=config)
print(f"\n✓ Success: {result.success}")
print(f"✓ Status Code: {result.status_code}")
print(f"✓ HTML Length: {len(result.html)}")
# Check for Cloudflare challenge
if result.html:
cf_indicators = [
"Checking your browser",
"Please stand by",
"cloudflare",
"cf-browser-verification",
"Access denied",
"Ray ID"
]
detected = False
for indicator in cf_indicators:
if indicator.lower() in result.html.lower():
print(f"⚠️ Cloudflare Challenge Detected: '{indicator}' found")
detected = True
break
if not detected and len(result.markdown.raw_markdown) > 100:
print("✅ Successfully bypassed Cloudflare!")
print(f"Content preview: {result.markdown.raw_markdown[:200]}...")
elif not detected:
print("⚠️ Page loaded but content seems minimal")
return result
async def main():
"""Compare regular vs undetected browser"""
print("🤖 Crawl4AI - Cloudflare Bypass Test")
print(f"Testing URL: {TEST_URL}\n")
# Test regular browser
regular_result = await test_regular_browser()
# Small delay
await asyncio.sleep(2)
# Test undetected browser
undetected_result = await test_undetected_browser()
# Summary
print("\n" + "=" * 60)
print("SUMMARY")
print("=" * 60)
print(f"Regular Browser:")
print(f" - Success: {regular_result.success}")
print(f" - Content Length: {len(regular_result.markdown.raw_markdown) if regular_result.markdown else 0}")
print(f"\nUndetected Browser:")
print(f" - Success: {undetected_result.success}")
print(f" - Content Length: {len(undetected_result.markdown.raw_markdown) if undetected_result.markdown else 0}")
if undetected_result.success and len(undetected_result.markdown.raw_markdown) > len(regular_result.markdown.raw_markdown):
print("\n✅ Undetected browser successfully bypassed protection!")
print("=" * 60)
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,184 @@
"""
Undetected vs Regular Browser Comparison
This example demonstrates the difference between regular and undetected browser modes
when accessing sites with bot detection services.
Based on tested anti-bot services:
- Cloudflare
- Kasada
- Akamai
- DataDome
- Bet365
- And others
"""
import asyncio
from crawl4ai import (
AsyncWebCrawler,
BrowserConfig,
CrawlerRunConfig,
PlaywrightAdapter,
UndetectedAdapter,
CrawlResult
)
from crawl4ai.async_crawler_strategy import AsyncPlaywrightCrawlerStrategy
# Test URLs for various bot detection services
TEST_SITES = {
"Cloudflare Protected": "https://nowsecure.nl",
# "Bot Detection Test": "https://bot.sannysoft.com",
# "Fingerprint Test": "https://fingerprint.com/products/bot-detection",
# "Browser Scan": "https://browserscan.net",
# "CreepJS": "https://abrahamjuliot.github.io/creepjs",
}
async def test_with_adapter(url: str, adapter_name: str, adapter):
"""Test a URL with a specific adapter"""
browser_config = BrowserConfig(
headless=False, # Better for avoiding detection
viewport_width=1920,
viewport_height=1080,
verbose=True,
)
# Create the crawler strategy with the adapter
crawler_strategy = AsyncPlaywrightCrawlerStrategy(
browser_config=browser_config,
browser_adapter=adapter
)
print(f"\n{'='*60}")
print(f"Testing with {adapter_name} adapter")
print(f"URL: {url}")
print(f"{'='*60}")
try:
async with AsyncWebCrawler(
crawler_strategy=crawler_strategy,
config=browser_config
) as crawler:
crawler_config = CrawlerRunConfig(
delay_before_return_html=3.0, # Give page time to load
wait_for_images=True,
screenshot=True,
simulate_user=True, # Add user simulation
)
result: CrawlResult = await crawler.arun(
url=url,
config=crawler_config
)
# Check results
print(f"✓ Status Code: {result.status_code}")
print(f"✓ Success: {result.success}")
print(f"✓ HTML Length: {len(result.html)}")
print(f"✓ Markdown Length: {len(result.markdown.raw_markdown)}")
# Check for common bot detection indicators
detection_indicators = [
"Access denied",
"Please verify you are human",
"Checking your browser",
"Enable JavaScript",
"captcha",
"403 Forbidden",
"Bot detection",
"Security check"
]
content_lower = result.markdown.raw_markdown.lower()
detected = False
for indicator in detection_indicators:
if indicator.lower() in content_lower:
print(f"⚠️ Possible detection: Found '{indicator}'")
detected = True
break
if not detected:
print("✅ No obvious bot detection triggered!")
# Show first 200 chars of content
print(f"Content preview: {result.markdown.raw_markdown[:200]}...")
return result.success and not detected
except Exception as e:
print(f"❌ Error: {str(e)}")
return False
async def compare_adapters(url: str, site_name: str):
"""Compare regular and undetected adapters on the same URL"""
print(f"\n{'#'*60}")
print(f"# Testing: {site_name}")
print(f"{'#'*60}")
# Test with regular adapter
regular_adapter = PlaywrightAdapter()
regular_success = await test_with_adapter(url, "Regular", regular_adapter)
# Small delay between tests
await asyncio.sleep(2)
# Test with undetected adapter
undetected_adapter = UndetectedAdapter()
undetected_success = await test_with_adapter(url, "Undetected", undetected_adapter)
# Summary
print(f"\n{'='*60}")
print(f"Summary for {site_name}:")
print(f"Regular Adapter: {'✅ Passed' if regular_success else '❌ Blocked/Detected'}")
print(f"Undetected Adapter: {'✅ Passed' if undetected_success else '❌ Blocked/Detected'}")
print(f"{'='*60}")
return regular_success, undetected_success
async def main():
"""Run comparison tests on multiple sites"""
print("🤖 Crawl4AI Browser Adapter Comparison")
print("Testing regular vs undetected browser modes\n")
results = {}
# Test each site
for site_name, url in TEST_SITES.items():
regular, undetected = await compare_adapters(url, site_name)
results[site_name] = {
"regular": regular,
"undetected": undetected
}
# Delay between different sites
await asyncio.sleep(3)
# Final summary
print(f"\n{'#'*60}")
print("# FINAL RESULTS")
print(f"{'#'*60}")
print(f"{'Site':<30} {'Regular':<15} {'Undetected':<15}")
print(f"{'-'*60}")
for site, result in results.items():
regular_status = "✅ Passed" if result["regular"] else "❌ Blocked"
undetected_status = "✅ Passed" if result["undetected"] else "❌ Blocked"
print(f"{site:<30} {regular_status:<15} {undetected_status:<15}")
# Calculate success rates
regular_success = sum(1 for r in results.values() if r["regular"])
undetected_success = sum(1 for r in results.values() if r["undetected"])
total = len(results)
print(f"\n{'='*60}")
print(f"Success Rates:")
print(f"Regular Adapter: {regular_success}/{total} ({regular_success/total*100:.1f}%)")
print(f"Undetected Adapter: {undetected_success}/{total} ({undetected_success/total*100:.1f}%)")
print(f"{'='*60}")
if __name__ == "__main__":
# Note: This example may take a while to run as it tests multiple sites
# You can comment out sites in TEST_SITES to run faster tests
asyncio.run(main())

View File

@@ -0,0 +1,118 @@
"""
Simple Undetected Browser Demo
Demonstrates the basic usage of undetected browser mode
"""
import asyncio
from crawl4ai import (
AsyncWebCrawler,
BrowserConfig,
CrawlerRunConfig,
UndetectedAdapter
)
from crawl4ai.async_crawler_strategy import AsyncPlaywrightCrawlerStrategy
async def crawl_with_regular_browser(url: str):
"""Crawl with regular browser"""
print("\n[Regular Browser Mode]")
browser_config = BrowserConfig(
headless=False,
verbose=True,
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url=url,
config=CrawlerRunConfig(
delay_before_return_html=2.0
)
)
print(f"Success: {result.success}")
print(f"Status: {result.status_code}")
print(f"Content length: {len(result.markdown.raw_markdown)}")
# Check for bot detection keywords
content = result.markdown.raw_markdown.lower()
if any(word in content for word in ["cloudflare", "checking your browser", "please wait"]):
print("⚠️ Bot detection triggered!")
else:
print("✅ Page loaded successfully")
return result
async def crawl_with_undetected_browser(url: str):
"""Crawl with undetected browser"""
print("\n[Undetected Browser Mode]")
browser_config = BrowserConfig(
headless=False,
verbose=True,
)
# Create undetected adapter and strategy
undetected_adapter = UndetectedAdapter()
crawler_strategy = AsyncPlaywrightCrawlerStrategy(
browser_config=browser_config,
browser_adapter=undetected_adapter
)
async with AsyncWebCrawler(
crawler_strategy=crawler_strategy,
config=browser_config
) as crawler:
result = await crawler.arun(
url=url,
config=CrawlerRunConfig(
delay_before_return_html=2.0
)
)
print(f"Success: {result.success}")
print(f"Status: {result.status_code}")
print(f"Content length: {len(result.markdown.raw_markdown)}")
# Check for bot detection keywords
content = result.markdown.raw_markdown.lower()
if any(word in content for word in ["cloudflare", "checking your browser", "please wait"]):
print("⚠️ Bot detection triggered!")
else:
print("✅ Page loaded successfully")
return result
async def main():
"""Demo comparing regular vs undetected modes"""
print("🤖 Crawl4AI Undetected Browser Demo")
print("="*50)
# Test URLs - you can change these
test_urls = [
"https://www.example.com", # Simple site
"https://httpbin.org/headers", # Shows request headers
]
for url in test_urls:
print(f"\n📍 Testing URL: {url}")
# Test with regular browser
regular_result = await crawl_with_regular_browser(url)
# Small delay
await asyncio.sleep(2)
# Test with undetected browser
undetected_result = await crawl_with_undetected_browser(url)
# Compare results
print(f"\n📊 Comparison for {url}:")
print(f"Regular browser content: {len(regular_result.markdown.raw_markdown)} chars")
print(f"Undetected browser content: {len(undetected_result.markdown.raw_markdown)} chars")
if url == "https://httpbin.org/headers":
# Show headers for comparison
print("\nHeaders seen by server:")
print("Regular:", regular_result.markdown.raw_markdown[:500])
print("\nUndetected:", undetected_result.markdown.raw_markdown[:500])
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,432 @@
# Advanced Adaptive Strategies
## Overview
While the default adaptive crawling configuration works well for most use cases, understanding the underlying strategies and scoring mechanisms allows you to fine-tune the crawler for specific domains and requirements.
## The Three-Layer Scoring System
### 1. Coverage Score
Coverage measures how comprehensively your knowledge base covers the query terms and related concepts.
#### Mathematical Foundation
```python
Coverage(K, Q) = Σ(t Q) score(t, K) / |Q|
where score(t, K) = doc_coverage(t) × (1 + freq_boost(t))
```
#### Components
- **Document Coverage**: Percentage of documents containing the term
- **Frequency Boost**: Logarithmic bonus for term frequency
- **Query Decomposition**: Handles multi-word queries intelligently
#### Tuning Coverage
```python
# For technical documentation with specific terminology
config = AdaptiveConfig(
confidence_threshold=0.85, # Require high coverage
top_k_links=5 # Cast wider net
)
# For general topics with synonyms
config = AdaptiveConfig(
confidence_threshold=0.6, # Lower threshold
top_k_links=2 # More focused
)
```
### 2. Consistency Score
Consistency evaluates whether the information across pages is coherent and non-contradictory.
#### How It Works
1. Extracts key statements from each document
2. Compares statements across documents
3. Measures agreement vs. contradiction
4. Returns normalized score (0-1)
#### Practical Impact
- **High consistency (>0.8)**: Information is reliable and coherent
- **Medium consistency (0.5-0.8)**: Some variation, but generally aligned
- **Low consistency (<0.5)**: Conflicting information, need more sources
### 3. Saturation Score
Saturation detects when new pages stop providing novel information.
#### Detection Algorithm
```python
# Tracks new unique terms per page
new_terms_page_1 = 50
new_terms_page_2 = 30 # 60% of first
new_terms_page_3 = 15 # 50% of second
new_terms_page_4 = 5 # 33% of third
# Saturation detected: rapidly diminishing returns
```
#### Configuration
```python
config = AdaptiveConfig(
min_gain_threshold=0.1 # Stop if <10% new information
)
```
## Link Ranking Algorithm
### Expected Information Gain
Each uncrawled link is scored based on:
```python
ExpectedGain(link) = Relevance × Novelty × Authority
```
#### 1. Relevance Scoring
Uses BM25 algorithm on link preview text:
```python
relevance = BM25(link.preview_text, query)
```
Factors:
- Term frequency in preview
- Inverse document frequency
- Preview length normalization
#### 2. Novelty Estimation
Measures how different the link appears from already-crawled content:
```python
novelty = 1 - max_similarity(preview, knowledge_base)
```
Prevents crawling duplicate or highly similar pages.
#### 3. Authority Calculation
URL structure and domain analysis:
```python
authority = f(domain_rank, url_depth, url_structure)
```
Factors:
- Domain reputation
- URL depth (fewer slashes = higher authority)
- Clean URL structure
### Custom Link Scoring
```python
class CustomLinkScorer:
def score(self, link: Link, query: str, state: CrawlState) -> float:
# Prioritize specific URL patterns
if "/api/reference/" in link.href:
return 2.0 # Double the score
# Deprioritize certain sections
if "/archive/" in link.href:
return 0.1 # Reduce score by 90%
# Default scoring
return 1.0
# Use with adaptive crawler
adaptive = AdaptiveCrawler(
crawler,
config=config,
link_scorer=CustomLinkScorer()
)
```
## Domain-Specific Configurations
### Technical Documentation
```python
tech_doc_config = AdaptiveConfig(
confidence_threshold=0.85,
max_pages=30,
top_k_links=3,
min_gain_threshold=0.05 # Keep crawling for small gains
)
```
Rationale:
- High threshold ensures comprehensive coverage
- Lower gain threshold captures edge cases
- Moderate link following for depth
### News & Articles
```python
news_config = AdaptiveConfig(
confidence_threshold=0.6,
max_pages=10,
top_k_links=5,
min_gain_threshold=0.15 # Stop quickly on repetition
)
```
Rationale:
- Lower threshold (articles often repeat information)
- Higher gain threshold (avoid duplicate stories)
- More links per page (explore different perspectives)
### E-commerce
```python
ecommerce_config = AdaptiveConfig(
confidence_threshold=0.7,
max_pages=20,
top_k_links=2,
min_gain_threshold=0.1
)
```
Rationale:
- Balanced threshold for product variations
- Focused link following (avoid infinite products)
- Standard gain threshold
### Research & Academic
```python
research_config = AdaptiveConfig(
confidence_threshold=0.9,
max_pages=50,
top_k_links=4,
min_gain_threshold=0.02 # Very low - capture citations
)
```
Rationale:
- Very high threshold for completeness
- Many pages allowed for thorough research
- Very low gain threshold to capture references
## Performance Optimization
### Memory Management
```python
# For large crawls, use streaming
config = AdaptiveConfig(
max_pages=100,
save_state=True,
state_path="large_crawl.json"
)
# Periodically clean state
if len(state.knowledge_base) > 1000:
# Keep only most relevant
state.knowledge_base = get_top_relevant(state.knowledge_base, 500)
```
### Parallel Processing
```python
# Use multiple start points
start_urls = [
"https://docs.example.com/intro",
"https://docs.example.com/api",
"https://docs.example.com/guides"
]
# Crawl in parallel
tasks = [
adaptive.digest(url, query)
for url in start_urls
]
results = await asyncio.gather(*tasks)
```
### Caching Strategy
```python
# Enable caching for repeated crawls
async with AsyncWebCrawler(
config=BrowserConfig(
cache_mode=CacheMode.ENABLED
)
) as crawler:
adaptive = AdaptiveCrawler(crawler, config)
```
## Debugging & Analysis
### Enable Verbose Logging
```python
import logging
logging.basicConfig(level=logging.DEBUG)
adaptive = AdaptiveCrawler(crawler, config, verbose=True)
```
### Analyze Crawl Patterns
```python
# After crawling
state = await adaptive.digest(start_url, query)
# Analyze link selection
print("Link selection order:")
for i, url in enumerate(state.crawl_order):
print(f"{i+1}. {url}")
# Analyze term discovery
print("\nTerm discovery rate:")
for i, new_terms in enumerate(state.new_terms_history):
print(f"Page {i+1}: {new_terms} new terms")
# Analyze score progression
print("\nScore progression:")
print(f"Coverage: {state.metrics['coverage_history']}")
print(f"Saturation: {state.metrics['saturation_history']}")
```
### Export for Analysis
```python
# Export detailed metrics
import json
metrics = {
"query": query,
"total_pages": len(state.crawled_urls),
"confidence": adaptive.confidence,
"coverage_stats": adaptive.coverage_stats,
"crawl_order": state.crawl_order,
"term_frequencies": dict(state.term_frequencies),
"new_terms_history": state.new_terms_history
}
with open("crawl_analysis.json", "w") as f:
json.dump(metrics, f, indent=2)
```
## Custom Strategies
### Implementing a Custom Strategy
```python
from crawl4ai.adaptive_crawler import BaseStrategy
class DomainSpecificStrategy(BaseStrategy):
def calculate_coverage(self, state: CrawlState) -> float:
# Custom coverage calculation
# e.g., weight certain terms more heavily
pass
def calculate_consistency(self, state: CrawlState) -> float:
# Custom consistency logic
# e.g., domain-specific validation
pass
def rank_links(self, links: List[Link], state: CrawlState) -> List[Link]:
# Custom link ranking
# e.g., prioritize specific URL patterns
pass
# Use custom strategy
adaptive = AdaptiveCrawler(
crawler,
config=config,
strategy=DomainSpecificStrategy()
)
```
### Combining Strategies
```python
class HybridStrategy(BaseStrategy):
def __init__(self):
self.strategies = [
TechnicalDocStrategy(),
SemanticSimilarityStrategy(),
URLPatternStrategy()
]
def calculate_confidence(self, state: CrawlState) -> float:
# Weighted combination of strategies
scores = [s.calculate_confidence(state) for s in self.strategies]
weights = [0.5, 0.3, 0.2]
return sum(s * w for s, w in zip(scores, weights))
```
## Best Practices
### 1. Start Conservative
Begin with default settings and adjust based on results:
```python
# Start with defaults
result = await adaptive.digest(url, query)
# Analyze and adjust
if adaptive.confidence < 0.7:
config.max_pages += 10
config.confidence_threshold -= 0.1
```
### 2. Monitor Resource Usage
```python
import psutil
# Check memory before large crawls
memory_percent = psutil.virtual_memory().percent
if memory_percent > 80:
config.max_pages = min(config.max_pages, 20)
```
### 3. Use Domain Knowledge
```python
# For API documentation
if "api" in start_url:
config.top_k_links = 2 # APIs have clear structure
# For blogs
if "blog" in start_url:
config.min_gain_threshold = 0.2 # Avoid similar posts
```
### 4. Validate Results
```python
# Always validate the knowledge base
relevant_content = adaptive.get_relevant_content(top_k=10)
# Check coverage
query_terms = set(query.lower().split())
covered_terms = set()
for doc in relevant_content:
content_lower = doc['content'].lower()
for term in query_terms:
if term in content_lower:
covered_terms.add(term)
coverage_ratio = len(covered_terms) / len(query_terms)
print(f"Query term coverage: {coverage_ratio:.0%}")
```
## Next Steps
- Explore [Custom Strategy Implementation](../tutorials/custom-adaptive-strategies.md)
- Learn about [Knowledge Base Management](../tutorials/knowledge-base-management.md)
- See [Performance Benchmarks](../benchmarks/adaptive-performance.md)

View File

@@ -66,29 +66,38 @@ Sometimes you need a visual record of a page or a PDF “printout.” Crawl4AI c
```python
import os, asyncio
from base64 import b64decode
from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai import AsyncWebCrawler, CacheMode, CrawlerRunConfig
async def main():
run_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
screenshot=True,
pdf=True
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://en.wikipedia.org/wiki/List_of_common_misconceptions",
cache_mode=CacheMode.BYPASS,
pdf=True,
screenshot=True
config=run_config
)
if result.success:
# Save screenshot
print(f"Screenshot data present: {result.screenshot is not None}")
print(f"PDF data present: {result.pdf is not None}")
if result.screenshot:
print(f"[OK] Screenshot captured, size: {len(result.screenshot)} bytes")
with open("wikipedia_screenshot.png", "wb") as f:
f.write(b64decode(result.screenshot))
# Save PDF
else:
print("[WARN] Screenshot data is None.")
if result.pdf:
print(f"[OK] PDF captured, size: {len(result.pdf)} bytes")
with open("wikipedia_page.pdf", "wb") as f:
f.write(result.pdf)
print("[OK] PDF & screenshot captured.")
else:
print("[WARN] PDF data is None.")
else:
print("[ERROR]", result.error_message)
@@ -349,9 +358,77 @@ if __name__ == "__main__":
---
---
## 7. Anti-Bot Features (Stealth Mode & Undetected Browser)
Crawl4AI provides two powerful features to bypass bot detection:
### 7.1 Stealth Mode
Stealth mode uses playwright-stealth to modify browser fingerprints and behaviors. Enable it with a simple flag:
```python
browser_config = BrowserConfig(
enable_stealth=True, # Activates stealth mode
headless=False
)
```
**When to use**: Sites with basic bot detection (checking navigator.webdriver, plugins, etc.)
### 7.2 Undetected Browser
For advanced bot detection, use the undetected browser adapter:
```python
from crawl4ai import UndetectedAdapter
from crawl4ai.async_crawler_strategy import AsyncPlaywrightCrawlerStrategy
# Create undetected adapter
adapter = UndetectedAdapter()
strategy = AsyncPlaywrightCrawlerStrategy(
browser_config=browser_config,
browser_adapter=adapter
)
async with AsyncWebCrawler(crawler_strategy=strategy, config=browser_config) as crawler:
# Your crawling code
```
**When to use**: Sites with sophisticated bot detection (Cloudflare, DataDome, etc.)
### 7.3 Combining Both
For maximum evasion, combine stealth mode with undetected browser:
```python
browser_config = BrowserConfig(
enable_stealth=True, # Enable stealth
headless=False
)
adapter = UndetectedAdapter() # Use undetected browser
```
### Choosing the Right Approach
| Detection Level | Recommended Approach |
|----------------|---------------------|
| No protection | Regular browser |
| Basic checks | Regular + Stealth mode |
| Advanced protection | Undetected browser |
| Maximum evasion | Undetected + Stealth mode |
**Best Practice**: Start with regular browser + stealth mode. Only use undetected browser if needed, as it may be slightly slower.
See [Undetected Browser Mode](undetected-browser.md) for detailed examples.
---
## Conclusion & Next Steps
Youve now explored several **advanced** features:
You've now explored several **advanced** features:
- **Proxy Usage**
- **PDF & Screenshot** capturing for large or critical pages
@@ -359,7 +436,10 @@ Youve now explored several **advanced** features:
- **Custom Headers** for language or specialized requests
- **Session Persistence** via storage state
- **Robots.txt Compliance**
- **Anti-Bot Features** (Stealth Mode & Undetected Browser)
With these power tools, you can build robust scraping workflows that mimic real user behavior, handle secure sites, capture detailed snapshots, and manage sessions across multiple runs—streamlining your entire data collection pipeline.
With these power tools, you can build robust scraping workflows that mimic real user behavior, handle secure sites, capture detailed snapshots, manage sessions across multiple runs, and bypass bot detection—streamlining your entire data collection pipeline.
**Last Updated**: 2025-01-01
**Note**: In future versions, we may enable stealth mode and undetected browser by default. For now, users should explicitly enable these features when needed.
**Last Updated**: 2025-01-17

View File

@@ -404,7 +404,182 @@ for result in results:
print(f"Duration: {dr.end_time - dr.start_time}")
```
## 6. Summary
## 6. URL-Specific Configurations
When crawling diverse content types, you often need different configurations for different URLs. For example:
- PDFs need specialized extraction
- Blog pages benefit from content filtering
- Dynamic sites need JavaScript execution
- API endpoints need JSON parsing
### 6.1 Basic URL Pattern Matching
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, MatchMode
from crawl4ai.processors.pdf import PDFContentScrapingStrategy
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
async def crawl_mixed_content():
# Configure different strategies for different content
configs = [
# PDF files - specialized extraction
CrawlerRunConfig(
url_matcher="*.pdf",
scraping_strategy=PDFContentScrapingStrategy()
),
# Blog/article pages - content filtering
CrawlerRunConfig(
url_matcher=["*/blog/*", "*/article/*"],
markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter(threshold=0.48)
)
),
# Dynamic pages - JavaScript execution
CrawlerRunConfig(
url_matcher=lambda url: 'github.com' in url,
js_code="window.scrollTo(0, 500);"
),
# API endpoints - JSON extraction
CrawlerRunConfig(
url_matcher=lambda url: 'api' in url or url.endswith('.json'),
# Custome settings for JSON extraction
),
# Default config for everything else
CrawlerRunConfig() # No url_matcher means it matches ALL URLs (fallback)
]
# Mixed URLs
urls = [
"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf",
"https://blog.python.org/",
"https://github.com/microsoft/playwright",
"https://httpbin.org/json",
"https://example.com/"
]
async with AsyncWebCrawler() as crawler:
results = await crawler.arun_many(
urls=urls,
config=configs # Pass list of configs
)
for result in results:
print(f"{result.url}: {len(result.markdown)} chars")
```
### 6.2 Advanced Pattern Matching
**Important**: A `CrawlerRunConfig` without `url_matcher` (or with `url_matcher=None`) matches ALL URLs. This makes it perfect as a default/fallback configuration.
The `url_matcher` parameter supports three types of patterns:
#### Glob Patterns (Strings)
```python
# Simple patterns
"*.pdf" # Any PDF file
"*/api/*" # Any URL with /api/ in path
"https://*.example.com/*" # Subdomain matching
"*://example.com/blog/*" # Any protocol
```
#### Custom Functions
```python
# Complex logic with lambdas
lambda url: url.startswith('https://') and 'secure' in url
lambda url: len(url) > 50 and url.count('/') > 5
lambda url: any(domain in url for domain in ['api.', 'data.', 'feed.'])
```
#### Mixed Lists with AND/OR Logic
```python
# Combine multiple conditions
CrawlerRunConfig(
url_matcher=[
"https://*", # Must be HTTPS
lambda url: 'internal' in url, # Must contain 'internal'
lambda url: not url.endswith('.pdf') # Must not be PDF
],
match_mode=MatchMode.AND # ALL conditions must match
)
```
### 6.3 Practical Example: News Site Crawler
```python
async def crawl_news_site():
dispatcher = MemoryAdaptiveDispatcher(
memory_threshold_percent=70.0,
rate_limiter=RateLimiter(base_delay=(1.0, 2.0))
)
configs = [
# Homepage - light extraction
CrawlerRunConfig(
url_matcher=lambda url: url.rstrip('/') == 'https://news.ycombinator.com',
css_selector="nav, .headline",
extraction_strategy=None
),
# Article pages - full extraction
CrawlerRunConfig(
url_matcher="*/article/*",
extraction_strategy=CosineStrategy(
semantic_filter="article content",
word_count_threshold=100
),
screenshot=True,
excluded_tags=["nav", "aside", "footer"]
),
# Author pages - metadata focus
CrawlerRunConfig(
url_matcher="*/author/*",
extraction_strategy=JsonCssExtractionStrategy({
"name": "h1.author-name",
"bio": ".author-bio",
"articles": "article.post-card h2"
})
),
# Everything else
CrawlerRunConfig()
]
async with AsyncWebCrawler() as crawler:
results = await crawler.arun_many(
urls=news_urls,
config=configs,
dispatcher=dispatcher
)
```
### 6.4 Best Practices
1. **Order Matters**: Configs are evaluated in order - put specific patterns before general ones
2. **Default Config Behavior**:
- A config without `url_matcher` matches ALL URLs
- Always include a default config as the last item if you want to handle all URLs
- Without a default config, unmatched URLs will fail with "No matching configuration found"
3. **Test Your Patterns**: Use the config's `is_match()` method to test patterns:
```python
config = CrawlerRunConfig(url_matcher="*.pdf")
print(config.is_match("https://example.com/doc.pdf")) # True
default_config = CrawlerRunConfig() # No url_matcher
print(default_config.is_match("https://any-url.com")) # True - matches everything!
```
4. **Optimize for Performance**:
- Disable JS for static content
- Skip screenshots for data APIs
- Use appropriate extraction strategies
## 7. Summary
1.**Two Dispatcher Types**:

View File

@@ -0,0 +1,201 @@
# PDF Processing Strategies
Crawl4AI provides specialized strategies for handling and extracting content from PDF files. These strategies allow you to seamlessly integrate PDF processing into your crawling workflows, whether the PDFs are hosted online or stored locally.
## `PDFCrawlerStrategy`
### Overview
`PDFCrawlerStrategy` is an implementation of `AsyncCrawlerStrategy` designed specifically for PDF documents. Instead of interpreting the input URL as an HTML webpage, this strategy treats it as a pointer to a PDF file. It doesn't perform deep crawling or HTML parsing itself but rather prepares the PDF source for a dedicated PDF scraping strategy. Its primary role is to identify the PDF source (web URL or local file) and pass it along the processing pipeline in a way that `AsyncWebCrawler` can handle.
### When to Use
Use `PDFCrawlerStrategy` when you need to:
- Process PDF files using the `AsyncWebCrawler`.
- Handle PDFs from both web URLs (e.g., `https://example.com/document.pdf`) and local file paths (e.g., `file:///path/to/your/document.pdf`).
- Integrate PDF content extraction into a unified `CrawlResult` object, allowing consistent handling of PDF data alongside web page data.
### Key Methods and Their Behavior
- **`__init__(self, logger: AsyncLogger = None)`**:
- Initializes the strategy.
- `logger`: An optional `AsyncLogger` instance (from `crawl4ai.async_logger`) for logging purposes.
- **`async crawl(self, url: str, **kwargs) -> AsyncCrawlResponse`**:
- This method is called by the `AsyncWebCrawler` during the `arun` process.
- It takes the `url` (which should point to a PDF) and creates a minimal `AsyncCrawlResponse`.
- The `html` attribute of this response is typically empty or a placeholder, as the actual PDF content processing is deferred to the `PDFContentScrapingStrategy` (or a similar PDF-aware scraping strategy).
- It sets `response_headers` to indicate "application/pdf" and `status_code` to 200.
- **`async close(self)`**:
- A method for cleaning up any resources used by the strategy. For `PDFCrawlerStrategy`, this is usually minimal.
- **`async __aenter__(self)` / `async __aexit__(self, exc_type, exc_val, exc_tb)`**:
- Enables asynchronous context management for the strategy, allowing it to be used with `async with`.
### Example Usage
```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.processors.pdf import PDFCrawlerStrategy, PDFContentScrapingStrategy
async def main():
# Initialize the PDF crawler strategy
pdf_crawler_strategy = PDFCrawlerStrategy()
# PDFCrawlerStrategy is typically used in conjunction with PDFContentScrapingStrategy
# The scraping strategy handles the actual PDF content extraction
pdf_scraping_strategy = PDFContentScrapingStrategy()
run_config = CrawlerRunConfig(scraping_strategy=pdf_scraping_strategy)
async with AsyncWebCrawler(crawler_strategy=pdf_crawler_strategy) as crawler:
# Example with a remote PDF URL
pdf_url = "https://arxiv.org/pdf/2310.06825.pdf" # A public PDF from arXiv
print(f"Attempting to process PDF: {pdf_url}")
result = await crawler.arun(url=pdf_url, config=run_config)
if result.success:
print(f"Successfully processed PDF: {result.url}")
print(f"Metadata Title: {result.metadata.get('title', 'N/A')}")
# Further processing of result.markdown, result.media, etc.
# would be done here, based on what PDFContentScrapingStrategy extracts.
if result.markdown and hasattr(result.markdown, 'raw_markdown'):
print(f"Extracted text (first 200 chars): {result.markdown.raw_markdown[:200]}...")
else:
print("No markdown (text) content extracted.")
else:
print(f"Failed to process PDF: {result.error_message}")
if __name__ == "__main__":
asyncio.run(main())
```
### Pros and Cons
**Pros:**
- Enables `AsyncWebCrawler` to handle PDF sources directly using familiar `arun` calls.
- Provides a consistent interface for specifying PDF sources (URLs or local paths).
- Abstracts the source handling, allowing a separate scraping strategy to focus on PDF content parsing.
**Cons:**
- Does not perform any PDF data extraction itself; it strictly relies on a compatible scraping strategy (like `PDFContentScrapingStrategy`) to process the PDF.
- Has limited utility on its own; most of its value comes from being paired with a PDF-specific content scraping strategy.
---
## `PDFContentScrapingStrategy`
### Overview
`PDFContentScrapingStrategy` is an implementation of `ContentScrapingStrategy` designed to extract text, metadata, and optionally images from PDF documents. It is intended to be used in conjunction with a crawler strategy that can provide it with a PDF source, such as `PDFCrawlerStrategy`. This strategy uses the `NaivePDFProcessorStrategy` internally to perform the low-level PDF parsing.
### When to Use
Use `PDFContentScrapingStrategy` when your `AsyncWebCrawler` (often configured with `PDFCrawlerStrategy`) needs to:
- Extract textual content page by page from a PDF document.
- Retrieve standard metadata embedded within the PDF (e.g., title, author, subject, creation date, page count).
- Optionally, extract images contained within the PDF pages. These images can be saved to a local directory or made available for further processing.
- Produce a `ScrapingResult` that can be converted into a `CrawlResult`, making PDF content accessible in a manner similar to HTML web content (e.g., text in `result.markdown`, metadata in `result.metadata`).
### Key Configuration Attributes
When initializing `PDFContentScrapingStrategy`, you can configure its behavior using the following attributes:
- **`extract_images: bool = False`**: If `True`, the strategy will attempt to extract images from the PDF.
- **`save_images_locally: bool = False`**: If `True` (and `extract_images` is also `True`), extracted images will be saved to disk in the `image_save_dir`. If `False`, image data might be available in another form (e.g., base64, depending on the underlying processor) but not saved as separate files by this strategy.
- **`image_save_dir: str = None`**: Specifies the directory where extracted images should be saved if `save_images_locally` is `True`. If `None`, a default or temporary directory might be used.
- **`batch_size: int = 4`**: Defines how many PDF pages are processed in a single batch. This can be useful for managing memory when dealing with very large PDF documents.
- **`logger: AsyncLogger = None`**: An optional `AsyncLogger` instance for logging.
### Key Methods and Their Behavior
- **`__init__(self, save_images_locally: bool = False, extract_images: bool = False, image_save_dir: str = None, batch_size: int = 4, logger: AsyncLogger = None)`**:
- Initializes the strategy with configurations for image handling, batch processing, and logging. It sets up an internal `NaivePDFProcessorStrategy` instance which performs the actual PDF parsing.
- **`scrap(self, url: str, html: str, **params) -> ScrapingResult`**:
- This is the primary synchronous method called by the crawler (via `ascrap`) to process the PDF.
- `url`: The path or URL to the PDF file (provided by `PDFCrawlerStrategy` or similar).
- `html`: Typically an empty string when used with `PDFCrawlerStrategy`, as the content is a PDF, not HTML.
- It first ensures the PDF is accessible locally (downloads it to a temporary file if `url` is remote).
- It then uses its internal PDF processor to extract text, metadata, and images (if configured).
- The extracted information is compiled into a `ScrapingResult` object:
- `cleaned_html`: Contains an HTML-like representation of the PDF, where each page's content is often wrapped in a `<div>` with page number information.
- `media`: A dictionary where `media["images"]` will contain information about extracted images if `extract_images` was `True`.
- `links`: A dictionary where `links["urls"]` can contain URLs found within the PDF content.
- `metadata`: A dictionary holding PDF metadata (e.g., title, author, num_pages).
- **`async ascrap(self, url: str, html: str, **kwargs) -> ScrapingResult`**:
- The asynchronous version of `scrap`. Under the hood, it typically runs the synchronous `scrap` method in a separate thread using `asyncio.to_thread` to avoid blocking the event loop.
- **`_get_pdf_path(self, url: str) -> str`**:
- A private helper method to manage PDF file access. If the `url` is remote (http/https), it downloads the PDF to a temporary local file and returns its path. If `url` indicates a local file (`file://` or a direct path), it resolves and returns the local path.
### Example Usage
```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.processors.pdf import PDFCrawlerStrategy, PDFContentScrapingStrategy
import os # For creating image directory
async def main():
# Define the directory for saving extracted images
image_output_dir = "./my_pdf_images"
os.makedirs(image_output_dir, exist_ok=True)
# Configure the PDF content scraping strategy
# Enable image extraction and specify where to save them
pdf_scraping_cfg = PDFContentScrapingStrategy(
extract_images=True,
save_images_locally=True,
image_save_dir=image_output_dir,
batch_size=2 # Process 2 pages at a time for demonstration
)
# The PDFCrawlerStrategy is needed to tell AsyncWebCrawler how to "crawl" a PDF
pdf_crawler_cfg = PDFCrawlerStrategy()
# Configure the overall crawl run
run_cfg = CrawlerRunConfig(
scraping_strategy=pdf_scraping_cfg # Use our PDF scraping strategy
)
# Initialize the crawler with the PDF-specific crawler strategy
async with AsyncWebCrawler(crawler_strategy=pdf_crawler_cfg) as crawler:
pdf_url = "https://arxiv.org/pdf/2310.06825.pdf" # Example PDF
print(f"Starting PDF processing for: {pdf_url}")
result = await crawler.arun(url=pdf_url, config=run_cfg)
if result.success:
print("\n--- PDF Processing Successful ---")
print(f"Processed URL: {result.url}")
print("\n--- Metadata ---")
for key, value in result.metadata.items():
print(f" {key.replace('_', ' ').title()}: {value}")
if result.markdown and hasattr(result.markdown, 'raw_markdown'):
print(f"\n--- Extracted Text (Markdown Snippet) ---")
print(result.markdown.raw_markdown[:500].strip() + "...")
else:
print("\nNo text (markdown) content extracted.")
if result.media and result.media.get("images"):
print(f"\n--- Image Extraction ---")
print(f"Extracted {len(result.media['images'])} image(s).")
for i, img_info in enumerate(result.media["images"][:2]): # Show info for first 2 images
print(f" Image {i+1}:")
print(f" Page: {img_info.get('page')}")
print(f" Format: {img_info.get('format', 'N/A')}")
if img_info.get('path'):
print(f" Saved at: {img_info.get('path')}")
else:
print("\nNo images were extracted (or extract_images was False).")
else:
print(f"\n--- PDF Processing Failed ---")
print(f"Error: {result.error_message}")
if __name__ == "__main__":
asyncio.run(main())
```
### Pros and Cons
**Pros:**
- Provides a comprehensive way to extract text, metadata, and (optionally) images from PDF documents.
- Handles both remote PDFs (via URL) and local PDF files.
- Configurable image extraction allows saving images to disk or accessing their data.
- Integrates smoothly with the `CrawlResult` object structure, making PDF-derived data accessible in a way consistent with web-scraped data.
- The `batch_size` parameter can help in managing memory consumption when processing large or numerous PDF pages.
**Cons:**
- Extraction quality and performance can vary significantly depending on the PDF's complexity, encoding, and whether it's image-based (scanned) or text-based.
- Image extraction can be resource-intensive (both CPU and disk space if `save_images_locally` is true).
- Relies on `NaivePDFProcessorStrategy` internally, which might have limitations with very complex layouts, encrypted PDFs, or forms compared to more sophisticated PDF parsing libraries. Scanned PDFs will not yield text unless an OCR step is performed (which is not part of this strategy by default).
- Link extraction from PDFs can be basic and depends on how hyperlinks are embedded in the document.

View File

@@ -25,44 +25,70 @@ Use an authenticated proxy with `BrowserConfig`:
```python
from crawl4ai.async_configs import BrowserConfig
proxy_config = {
"server": "http://proxy.example.com:8080",
"username": "user",
"password": "pass"
}
browser_config = BrowserConfig(proxy_config=proxy_config)
browser_config = BrowserConfig(proxy="http://[username]:[password]@[host]:[port]")
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url="https://example.com")
```
Here's the corrected documentation:
## Rotating Proxies
Example using a proxy rotation service dynamically:
```python
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
async def get_next_proxy():
# Your proxy rotation logic here
return {"server": "http://next.proxy.com:8080"}
import re
from crawl4ai import (
AsyncWebCrawler,
BrowserConfig,
CrawlerRunConfig,
CacheMode,
RoundRobinProxyStrategy,
)
import asyncio
from crawl4ai import ProxyConfig
async def main():
browser_config = BrowserConfig()
run_config = CrawlerRunConfig()
async with AsyncWebCrawler(config=browser_config) as crawler:
# For each URL, create a new run config with different proxy
for url in urls:
proxy = await get_next_proxy()
# Clone the config and update proxy - this creates a new browser context
current_config = run_config.clone(proxy_config=proxy)
result = await crawler.arun(url=url, config=current_config)
# Load proxies and create rotation strategy
proxies = ProxyConfig.from_env()
#eg: export PROXIES="ip1:port1:username1:password1,ip2:port2:username2:password2"
if not proxies:
print("No proxies found in environment. Set PROXIES env variable!")
return
proxy_strategy = RoundRobinProxyStrategy(proxies)
# Create configs
browser_config = BrowserConfig(headless=True, verbose=False)
run_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
proxy_rotation_strategy=proxy_strategy
)
async with AsyncWebCrawler(config=browser_config) as crawler:
urls = ["https://httpbin.org/ip"] * (len(proxies) * 2) # Test each proxy twice
print("\n📈 Initializing crawler with proxy rotation...")
async with AsyncWebCrawler(config=browser_config) as crawler:
print("\n🚀 Starting batch crawl with proxy rotation...")
results = await crawler.arun_many(
urls=urls,
config=run_config
)
for result in results:
if result.success:
ip_match = re.search(r'(?:[0-9]{1,3}\.){3}[0-9]{1,3}', result.html)
current_proxy = run_config.proxy_config if run_config.proxy_config else None
if current_proxy and ip_match:
print(f"URL {result.url}")
print(f"Proxy {current_proxy.server} -> Response IP: {ip_match.group(0)}")
verified = ip_match.group(0) == current_proxy.ip
if verified:
print(f"✅ Proxy working! IP matches: {current_proxy.ip}")
else:
print("❌ Proxy failed or IP mismatch!")
print("---")
asyncio.run(main())
if __name__ == "__main__":
import asyncio
asyncio.run(main())
```

View File

@@ -49,46 +49,75 @@ from crawl4ai import JsonCssExtractionStrategy
from crawl4ai.cache_context import CacheMode
async def crawl_dynamic_content():
async with AsyncWebCrawler() as crawler:
session_id = "github_commits_session"
url = "https://github.com/microsoft/TypeScript/commits/main"
all_commits = []
url = "https://github.com/microsoft/TypeScript/commits/main"
session_id = "wait_for_session"
all_commits = []
# Define extraction schema
schema = {
"name": "Commit Extractor",
"baseSelector": "li.Box-sc-g0xbh4-0",
"fields": [{
"name": "title", "selector": "h4.markdown-title", "type": "text"
}],
}
extraction_strategy = JsonCssExtractionStrategy(schema)
js_next_page = """
const commits = document.querySelectorAll('li[data-testid="commit-row-item"] h4');
if (commits.length > 0) {
window.lastCommit = commits[0].textContent.trim();
}
const button = document.querySelector('a[data-testid="pagination-next-button"]');
if (button) {button.click(); console.log('button clicked') }
"""
# JavaScript and wait configurations
js_next_page = """document.querySelector('a[data-testid="pagination-next-button"]').click();"""
wait_for = """() => document.querySelectorAll('li.Box-sc-g0xbh4-0').length > 0"""
# Crawl multiple pages
wait_for = """() => {
const commits = document.querySelectorAll('li[data-testid="commit-row-item"] h4');
if (commits.length === 0) return false;
const firstCommit = commits[0].textContent.trim();
return firstCommit !== window.lastCommit;
}"""
schema = {
"name": "Commit Extractor",
"baseSelector": "li[data-testid='commit-row-item']",
"fields": [
{
"name": "title",
"selector": "h4 a",
"type": "text",
"transform": "strip",
},
],
}
extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
browser_config = BrowserConfig(
verbose=True,
headless=False,
)
async with AsyncWebCrawler(config=browser_config) as crawler:
for page in range(3):
config = CrawlerRunConfig(
url=url,
crawler_config = CrawlerRunConfig(
session_id=session_id,
css_selector="li[data-testid='commit-row-item']",
extraction_strategy=extraction_strategy,
js_code=js_next_page if page > 0 else None,
wait_for=wait_for if page > 0 else None,
js_only=page > 0,
cache_mode=CacheMode.BYPASS
cache_mode=CacheMode.BYPASS,
capture_console_messages=True,
)
result = await crawler.arun(config=config)
if result.success:
result = await crawler.arun(url=url, config=crawler_config)
if result.console_messages:
print(f"Page {page + 1} console messages:", result.console_messages)
if result.extracted_content:
# print(f"Page {page + 1} result:", result.extracted_content)
commits = json.loads(result.extracted_content)
all_commits.extend(commits)
print(f"Page {page + 1}: Found {len(commits)} commits")
else:
print(f"Page {page + 1}: No content extracted")
print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
# Clean up session
await crawler.crawler_strategy.kill_session(session_id)
return all_commits
```
---

View File

@@ -0,0 +1,394 @@
# Undetected Browser Mode
## Overview
Crawl4AI offers two powerful anti-bot features to help you access websites with bot detection:
1. **Stealth Mode** - Uses playwright-stealth to modify browser fingerprints and behaviors
2. **Undetected Browser Mode** - Advanced browser adapter with deep-level patches for sophisticated bot detection
This guide covers both features and helps you choose the right approach for your needs.
## Anti-Bot Features Comparison
| Feature | Regular Browser | Stealth Mode | Undetected Browser |
|---------|----------------|--------------|-------------------|
| WebDriver Detection | ❌ | ✅ | ✅ |
| Navigator Properties | ❌ | ✅ | ✅ |
| Plugin Emulation | ❌ | ✅ | ✅ |
| CDP Detection | ❌ | Partial | ✅ |
| Deep Browser Patches | ❌ | ❌ | ✅ |
| Performance Impact | None | Minimal | Moderate |
| Setup Complexity | None | None | Minimal |
## When to Use Each Approach
### Use Regular Browser + Stealth Mode When:
- Sites have basic bot detection (checking navigator.webdriver, plugins, etc.)
- You need good performance with basic protection
- Sites check for common automation indicators
### Use Undetected Browser When:
- Sites employ sophisticated bot detection services (Cloudflare, DataDome, etc.)
- Stealth mode alone isn't sufficient
- You're willing to trade some performance for better evasion
### Best Practice: Progressive Enhancement
1. **Start with**: Regular browser + Stealth mode
2. **If blocked**: Switch to Undetected browser
3. **If still blocked**: Combine Undetected browser + Stealth mode
## Stealth Mode
Stealth mode is the simpler anti-bot solution that works with both regular and undetected browsers:
```python
from crawl4ai import AsyncWebCrawler, BrowserConfig
# Enable stealth mode with regular browser
browser_config = BrowserConfig(
enable_stealth=True, # Simple flag to enable
headless=False # Better for avoiding detection
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun("https://example.com")
```
### What Stealth Mode Does:
- Removes `navigator.webdriver` flag
- Modifies browser fingerprints
- Emulates realistic plugin behavior
- Adjusts navigator properties
- Fixes common automation leaks
## Undetected Browser Mode
For sites with sophisticated bot detection that stealth mode can't bypass, use the undetected browser adapter:
### Key Features
- **Drop-in Replacement**: Uses the same API as regular browser mode
- **Enhanced Stealth**: Built-in patches to evade common detection methods
- **Browser Adapter Pattern**: Seamlessly switch between regular and undetected modes
- **Automatic Installation**: `crawl4ai-setup` installs all necessary browser dependencies
### Quick Start
```python
import asyncio
from crawl4ai import (
AsyncWebCrawler,
BrowserConfig,
CrawlerRunConfig,
UndetectedAdapter
)
from crawl4ai.async_crawler_strategy import AsyncPlaywrightCrawlerStrategy
async def main():
# Create the undetected adapter
undetected_adapter = UndetectedAdapter()
# Create browser config
browser_config = BrowserConfig(
headless=False, # Headless mode can be detected easier
verbose=True,
)
# Create the crawler strategy with undetected adapter
crawler_strategy = AsyncPlaywrightCrawlerStrategy(
browser_config=browser_config,
browser_adapter=undetected_adapter
)
# Create the crawler with our custom strategy
async with AsyncWebCrawler(
crawler_strategy=crawler_strategy,
config=browser_config
) as crawler:
# Your crawling code here
result = await crawler.arun(
url="https://example.com",
config=CrawlerRunConfig()
)
print(result.markdown[:500])
asyncio.run(main())
```
## Combining Both Features
For maximum evasion, combine stealth mode with undetected browser:
```python
from crawl4ai import AsyncWebCrawler, BrowserConfig, UndetectedAdapter
from crawl4ai.async_crawler_strategy import AsyncPlaywrightCrawlerStrategy
# Create browser config with stealth enabled
browser_config = BrowserConfig(
enable_stealth=True, # Enable stealth mode
headless=False
)
# Create undetected adapter
adapter = UndetectedAdapter()
# Create strategy with both features
strategy = AsyncPlaywrightCrawlerStrategy(
browser_config=browser_config,
browser_adapter=adapter
)
async with AsyncWebCrawler(
crawler_strategy=strategy,
config=browser_config
) as crawler:
result = await crawler.arun("https://protected-site.com")
```
## Examples
### Example 1: Basic Stealth Mode
```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
async def test_stealth_mode():
# Simple stealth mode configuration
browser_config = BrowserConfig(
enable_stealth=True,
headless=False
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://bot.sannysoft.com",
config=CrawlerRunConfig(screenshot=True)
)
if result.success:
print("✓ Successfully accessed bot detection test site")
# Save screenshot to verify detection results
if result.screenshot:
import base64
with open("stealth_test.png", "wb") as f:
f.write(base64.b64decode(result.screenshot))
print("✓ Screenshot saved - check for green (passed) tests")
asyncio.run(test_stealth_mode())
```
### Example 2: Undetected Browser Mode
```python
import asyncio
from crawl4ai import (
AsyncWebCrawler,
BrowserConfig,
CrawlerRunConfig,
UndetectedAdapter
)
from crawl4ai.async_crawler_strategy import AsyncPlaywrightCrawlerStrategy
async def main():
# Create browser config
browser_config = BrowserConfig(
headless=False,
verbose=True,
)
# Create the undetected adapter
undetected_adapter = UndetectedAdapter()
# Create the crawler strategy with the undetected adapter
crawler_strategy = AsyncPlaywrightCrawlerStrategy(
browser_config=browser_config,
browser_adapter=undetected_adapter
)
# Create the crawler with our custom strategy
async with AsyncWebCrawler(
crawler_strategy=crawler_strategy,
config=browser_config
) as crawler:
# Configure the crawl
crawler_config = CrawlerRunConfig(
markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter()
),
capture_console_messages=True, # Test adapter console capture
)
# Test on a site that typically detects bots
print("Testing undetected adapter...")
result: CrawlResult = await crawler.arun(
url="https://www.helloworld.org",
config=crawler_config
)
print(f"Status: {result.status_code}")
print(f"Success: {result.success}")
print(f"Console messages captured: {len(result.console_messages or [])}")
print(f"Markdown content (first 500 chars):\n{result.markdown.raw_markdown[:500]}")
if __name__ == "__main__":
asyncio.run(main())
```
## Browser Adapter Pattern
The undetected browser support is implemented using an adapter pattern, allowing seamless switching between different browser implementations:
```python
# Regular browser adapter (default)
from crawl4ai import PlaywrightAdapter
regular_adapter = PlaywrightAdapter()
# Undetected browser adapter
from crawl4ai import UndetectedAdapter
undetected_adapter = UndetectedAdapter()
```
The adapter handles:
- JavaScript execution
- Console message capture
- Error handling
- Browser-specific optimizations
## Best Practices
1. **Avoid Headless Mode**: Detection is easier in headless mode
```python
browser_config = BrowserConfig(headless=False)
```
2. **Use Reasonable Delays**: Don't rush through pages
```python
crawler_config = CrawlerRunConfig(
wait_time=3.0, # Wait 3 seconds after page load
delay_before_return_html=2.0 # Additional delay
)
```
3. **Rotate User Agents**: You can customize user agents
```python
browser_config = BrowserConfig(
headers={"User-Agent": "your-user-agent"}
)
```
4. **Handle Failures Gracefully**: Some sites may still detect and block
```python
if not result.success:
print(f"Crawl failed: {result.error_message}")
```
## Advanced Usage Tips
### Progressive Detection Handling
```python
async def crawl_with_progressive_evasion(url):
# Step 1: Try regular browser with stealth
browser_config = BrowserConfig(
enable_stealth=True,
headless=False
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url)
if result.success and "Access Denied" not in result.html:
return result
# Step 2: If blocked, try undetected browser
print("Regular + stealth blocked, trying undetected browser...")
adapter = UndetectedAdapter()
strategy = AsyncPlaywrightCrawlerStrategy(
browser_config=browser_config,
browser_adapter=adapter
)
async with AsyncWebCrawler(
crawler_strategy=strategy,
config=browser_config
) as crawler:
result = await crawler.arun(url)
return result
```
## Installation
The undetected browser dependencies are automatically installed when you run:
```bash
crawl4ai-setup
```
This command installs all necessary browser dependencies for both regular and undetected modes.
## Limitations
- **Performance**: Slightly slower than regular mode due to additional patches
- **Headless Detection**: Some sites can still detect headless mode
- **Resource Usage**: May use more resources than regular mode
- **Not 100% Guaranteed**: Advanced anti-bot services are constantly evolving
## Troubleshooting
### Browser Not Found
Run the setup command:
```bash
crawl4ai-setup
```
### Detection Still Occurring
Try combining with other features:
```python
crawler_config = CrawlerRunConfig(
simulate_user=True, # Add user simulation
magic=True, # Enable magic mode
wait_time=5.0, # Longer waits
)
```
### Performance Issues
If experiencing slow performance:
```python
# Use selective undetected mode only for protected sites
if is_protected_site(url):
adapter = UndetectedAdapter()
else:
adapter = PlaywrightAdapter() # Default adapter
```
## Future Plans
**Note**: In future versions of Crawl4AI, we may enable stealth mode and undetected browser by default to provide better out-of-the-box success rates. For now, users should explicitly enable these features when needed.
## Conclusion
Crawl4AI provides flexible anti-bot solutions:
1. **Start Simple**: Use regular browser + stealth mode for most sites
2. **Escalate if Needed**: Switch to undetected browser for sophisticated protection
3. **Combine for Maximum Effect**: Use both features together when facing the toughest challenges
Remember:
- Always respect robots.txt and website terms of service
- Use appropriate delays to avoid overwhelming servers
- Consider the performance trade-offs of each approach
- Test progressively to find the minimum necessary evasion level
## See Also
- [Advanced Features](advanced-features.md) - Overview of all advanced features
- [Proxy & Security](proxy-security.md) - Using proxies with anti-bot features
- [Session Management](session-management.md) - Maintaining sessions across requests
- [Identity Based Crawling](identity-based-crawling.md) - Additional anti-detection strategies

View File

@@ -91,13 +91,12 @@ async def crawl_twitter_timeline():
wait_after_scroll=1.0 # Twitter needs time to load
)
browser_config = BrowserConfig(headless=True) # Set to False to watch it work
config = CrawlerRunConfig(
virtual_scroll_config=virtual_config,
# Optional: Set headless=False to watch it work
# browser_config=BrowserConfig(headless=False)
virtual_scroll_config=virtual_config
)
async with AsyncWebCrawler() as crawler:
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://twitter.com/search?q=AI",
config=config
@@ -200,7 +199,7 @@ Use **scan_full_page** when:
Virtual Scroll works seamlessly with extraction strategies:
```python
from crawl4ai import LLMExtractionStrategy
from crawl4ai import LLMExtractionStrategy, LLMConfig
# Define extraction schema
schema = {
@@ -222,7 +221,7 @@ config = CrawlerRunConfig(
scroll_count=20
),
extraction_strategy=LLMExtractionStrategy(
provider="openai/gpt-4o-mini",
llm_config=LLMConfig(provider="openai/gpt-4o-mini"),
schema=schema
)
)

View File

@@ -0,0 +1,244 @@
# AdaptiveCrawler
The `AdaptiveCrawler` class implements intelligent web crawling that automatically determines when sufficient information has been gathered to answer a query. It uses a three-layer scoring system to evaluate coverage, consistency, and saturation.
## Constructor
```python
AdaptiveCrawler(
crawler: AsyncWebCrawler,
config: Optional[AdaptiveConfig] = None
)
```
### Parameters
- **crawler** (`AsyncWebCrawler`): The underlying web crawler instance to use for fetching pages
- **config** (`Optional[AdaptiveConfig]`): Configuration settings for adaptive crawling behavior. If not provided, uses default settings.
## Primary Method
### digest()
The main method that performs adaptive crawling starting from a URL with a specific query.
```python
async def digest(
start_url: str,
query: str,
resume_from: Optional[Union[str, Path]] = None
) -> CrawlState
```
#### Parameters
- **start_url** (`str`): The starting URL for crawling
- **query** (`str`): The search query that guides the crawling process
- **resume_from** (`Optional[Union[str, Path]]`): Path to a saved state file to resume from
#### Returns
- **CrawlState**: The final crawl state containing all crawled URLs, knowledge base, and metrics
#### Example
```python
async with AsyncWebCrawler() as crawler:
adaptive = AdaptiveCrawler(crawler)
state = await adaptive.digest(
start_url="https://docs.python.org",
query="async context managers"
)
```
## Properties
### confidence
Current confidence score (0-1) indicating information sufficiency.
```python
@property
def confidence(self) -> float
```
### coverage_stats
Dictionary containing detailed coverage statistics.
```python
@property
def coverage_stats(self) -> Dict[str, float]
```
Returns:
- **coverage**: Query term coverage score
- **consistency**: Information consistency score
- **saturation**: Content saturation score
- **confidence**: Overall confidence score
### is_sufficient
Boolean indicating whether sufficient information has been gathered.
```python
@property
def is_sufficient(self) -> bool
```
### state
Access to the current crawl state.
```python
@property
def state(self) -> CrawlState
```
## Methods
### get_relevant_content()
Retrieve the most relevant content from the knowledge base.
```python
def get_relevant_content(
self,
top_k: int = 5
) -> List[Dict[str, Any]]
```
#### Parameters
- **top_k** (`int`): Number of top relevant documents to return (default: 5)
#### Returns
List of dictionaries containing:
- **url**: The URL of the page
- **content**: The page content
- **score**: Relevance score
- **metadata**: Additional page metadata
### print_stats()
Display crawl statistics in formatted output.
```python
def print_stats(
self,
detailed: bool = False
) -> None
```
#### Parameters
- **detailed** (`bool`): If True, shows detailed metrics with colors. If False, shows summary table.
### export_knowledge_base()
Export the collected knowledge base to a JSONL file.
```python
def export_knowledge_base(
self,
path: Union[str, Path]
) -> None
```
#### Parameters
- **path** (`Union[str, Path]`): Output file path for JSONL export
#### Example
```python
adaptive.export_knowledge_base("my_knowledge.jsonl")
```
### import_knowledge_base()
Import a previously exported knowledge base.
```python
def import_knowledge_base(
self,
path: Union[str, Path]
) -> None
```
#### Parameters
- **path** (`Union[str, Path]`): Path to JSONL file to import
## Configuration
The `AdaptiveConfig` class controls the behavior of adaptive crawling:
```python
@dataclass
class AdaptiveConfig:
confidence_threshold: float = 0.8 # Stop when confidence reaches this
max_pages: int = 50 # Maximum pages to crawl
top_k_links: int = 5 # Links to follow per page
min_gain_threshold: float = 0.1 # Minimum expected gain to continue
save_state: bool = False # Auto-save crawl state
state_path: Optional[str] = None # Path for state persistence
```
### Example with Custom Config
```python
config = AdaptiveConfig(
confidence_threshold=0.7,
max_pages=20,
top_k_links=3
)
adaptive = AdaptiveCrawler(crawler, config=config)
```
## Complete Example
```python
import asyncio
from crawl4ai import AsyncWebCrawler, AdaptiveCrawler, AdaptiveConfig
async def main():
# Configure adaptive crawling
config = AdaptiveConfig(
confidence_threshold=0.75,
max_pages=15,
save_state=True,
state_path="my_crawl.json"
)
async with AsyncWebCrawler() as crawler:
adaptive = AdaptiveCrawler(crawler, config)
# Start crawling
state = await adaptive.digest(
start_url="https://example.com/docs",
query="authentication oauth2 jwt"
)
# Check results
print(f"Confidence achieved: {adaptive.confidence:.0%}")
adaptive.print_stats()
# Get most relevant pages
for page in adaptive.get_relevant_content(top_k=3):
print(f"- {page['url']} (score: {page['score']:.2f})")
# Export for later use
adaptive.export_knowledge_base("auth_knowledge.jsonl")
if __name__ == "__main__":
asyncio.run(main())
```
## See Also
- [digest() Method Reference](digest.md)
- [Adaptive Crawling Guide](../core/adaptive-crawling.md)
- [Advanced Adaptive Strategies](../advanced/adaptive-strategies.md)

Some files were not shown because too many files have changed in this diff Show More