Compare commits

...

123 Commits

Author SHA1 Message Date
ntohidi
61be862ab0 fix: add disk cleanup step to Docker workflow 2025-12-11 10:28:15 +01:00
ntohidi
9672afded2 docs: add section for Crawl4AI Cloud API closed beta with application link 2025-12-09 10:27:15 +01:00
Nasrin
60d6173914 Merge pull request #1661 from unclecode/waitlist
announcement: add application form for cloud API closed beta
2025-12-09 16:44:15 +08:00
ntohidi
48c31c4cb9 Release v0.7.8: Stability & Bug Fix Release
- Updated version to 0.7.8
- Introduced focused stability release addressing 11 community-reported bugs.
- Key fixes include Docker API improvements, LLM extraction enhancements, URL handling corrections, and dependency updates.
- Added detailed release notes for v0.7.8 in the blog and created a dedicated verification script to ensure all fixes are functioning as intended.
- Updated documentation to reflect recent changes and improvements.
2025-12-08 15:42:29 +01:00
Aravind Karnam
48b6283e71 announcement: add application form for cloud API closed beta 2025-12-08 14:00:57 +05:30
Nasrin
5a8fb57795 Merge pull request #1648 from christopher-w-murphy/fix/content-relevance-filter
[Fix]: Docker server does not decode ContentRelevanceFilter
2025-12-03 18:36:07 +08:00
ntohidi
df4d87ed78 refactor: replace PyPDF2 with pypdf across the codebase. ref #1412 2025-12-03 10:59:18 +01:00
Nasrin
f32cfc6db0 Merge pull request #1645 from unclecode/fix/configurable-backoff
Make LLM backoff configurable end-to-end
2025-12-02 21:07:49 +08:00
Nasrin
d06c39e8ab Merge pull request #1641 from unclecode/fix/serialize-proxy-config
Fix BrowserConfig proxy_config serialization
2025-12-02 21:06:02 +08:00
ntohidi
afc31e144a Merge branch 'develop' of https://github.com/unclecode/crawl4ai into develop 2025-12-02 13:01:11 +01:00
ntohidi
07ccf13be6 Fix: capture current page URL to reflect JavaScript navigation and add test for delayed redirects. ref #1268 2025-12-02 13:00:54 +01:00
Aravind
3a07c5962c Sponsors/new (#1643) 2025-12-02 00:49:39 +01:00
Chris Murphy
6893094f58 parameterized tests 2025-12-01 16:19:19 -05:00
Chris Murphy
3a8f8298d3 import modules from enhanceable deserialization 2025-12-01 16:18:59 -05:00
Chris Murphy
e95e8e1a97 generalized query in ContentRelevanceFilter to be a str or list 2025-12-01 16:16:31 -05:00
Chris Murphy
eb76df2c0d added missing deep crawling objects to init 2025-12-01 16:15:58 -05:00
Chris Murphy
6ec6bc4d8a pass timeout parameter to docker client request 2025-12-01 16:15:27 -05:00
Chris Murphy
33a3cc3933 reproduced AttributeError from #1642 2025-12-01 11:31:07 -05:00
Soham Kukreti
7a133e22cc feat: make LLM backoff configurable end-to-end
- extend LLMConfig with backoff delay/attempt/factor fields and thread them
  through LLMExtractionStrategy, LLMContentFilter, table extraction, and
  Docker API handlers
- expose the backoff parameter knobs on perform_completion_with_backoff/aperform_completion_with_backoff
  and document them in the md_v2 guides
2025-11-28 18:50:04 +05:30
Nasrin
dcb77c94bf Merge pull request #1623 from unclecode/fix/deprecated_pydantic
Refactor Pydantic model configuration to use ConfigDict for arbitrary…
2025-11-27 20:05:42 +08:00
Soham Kukreti
a0c5f0f79a fix: ensure BrowserConfig.to_dict serializes proxy_config 2025-11-26 17:44:06 +05:30
ntohidi
b36c6daa5c Fix: permission issues with .cache/url_seeder and other runtime cache dirs. ref #1638 2025-11-25 11:51:59 +01:00
Nasrin
94c8a833bf Merge pull request #1447 from rbushri/fix/wrong_url_raw
Fix: Wrong URL variable used for extraction of raw html
2025-11-25 17:49:44 +08:00
ntohidi
84bfea8bd1 Fix EmbeddingStrategy: Uncomment response handling for the variations and clean up mock data. ref #1621 2025-11-25 10:46:00 +01:00
Aravind
0024c82cdc Sponsors/new (#1637) 2025-11-24 13:29:33 +01:00
Rachel Bushrian
7771ed3894 Merge branch 'develop' into fix/wrong_url_raw 2025-11-24 13:54:07 +02:00
AHMET YILMAZ
eca04b0368 Refactor Pydantic model configuration to use ConfigDict for arbitrary types 2025-11-18 15:40:17 +08:00
ntohidi
c2c4d42be4 Fix #1181: Preserve whitespace in code blocks during HTML scraping
The remove_empty_elements_fast() method was removing whitespace-only
  span elements inside <pre> and <code> tags, causing import statements
  like "import torch" to become "importtorch". Now skips elements inside
  code blocks where whitespace is significant.
2025-11-17 12:21:23 +01:00
Aravind
f68e7531e3 Sponsors/scrapeless (#1619) 2025-11-17 07:44:52 +01:00
UncleCode
cb637fb5c4 Merge pull request #1613 from unclecode/release/v0.7.7 2025-11-16 12:26:54 +01:00
ntohidi
6244f56f36 Release v0.7.7
- Updated version to 0.7.7
- Added comprehensive demo and release notes
- Updated all documentation
2025-11-14 10:23:31 +01:00
ntohidi
2c973b1183 Merge branch 'develop' into release/v0.7.7 2025-11-13 14:54:05 +01:00
Nasrin
f3146de969 Merge pull request #1609 from unclecode/fix/update-config-documentation
Update browser and crawler run config documentation to match async_configs.py implementation
2025-11-13 21:52:53 +08:00
Soham Kukreti
d6b6d11a2d docs: update browser and crawler run config documentation to match async_configs.py implementation
Updated browser-crawler-config.md and parameters.md to ensure complete
accuracy with the actual BrowserConfig and CrawlerRunConfig implementations.

Changes:
- Removed non-existent parameters from documentation:
  * enable_rate_limiting, rate_limit_config (never implemented)
  * memory_threshold_percent, check_interval, max_session_permit (internal to AsyncDispatcher)
  * display_mode (doesn't exist)

- Added missing BrowserConfig parameters (14 total):
  * browser_mode, use_managed_browser, cdp_url, debugging_port, host
  * viewport, chrome_channel, channel
  * accept_downloads, downloads_path, storage_state, sleep_on_close
  * user_agent_mode, user_agent_generator_config, enable_stealth

- Added missing CrawlerRunConfig parameters (29 total):
  * chunking_strategy, keep_attrs, parser_type, scraping_strategy
  * proxy_config, proxy_rotation_strategy
  * locale, timezone_id, geolocation, fetch_ssl_certificate
  * shared_data, wait_for_timeout
  * c4a_script, max_scroll_steps
  * exclude_all_images, table_score_threshold, table_extraction
  * exclude_internal_links, score_links
  * capture_network_requests, capture_console_messages
  * method, stream, url, user_agent, user_agent_mode, user_agent_generator_config
  * deep_crawl_strategy, link_preview_config, url_matcher, match_mode, experimental

- Marked deprecated cache parameters (bypass_cache, disable_cache, no_cache_read, no_cache_write)
- Reorganized parameters into logical sections (Content Processing, Browser Location & Identity,
  Caching & Session, Page Navigation & Timing, Page Interaction, Media Handling, Link/Domain
  Handling, Debug & Logging, Connection & HTTP, Virtual Scroll, URL Matching, Advanced Features)
- Ensured all parameter descriptions match source code docstrings
- Added proper default values from __init__ signatures
2025-11-13 14:54:16 +05:30
ntohidi
b58579548c Bump version to 0.7.7 for stable release 2025-11-13 09:52:18 +01:00
Nasrin
466be69e72 Merge pull request #1607 from unclecode/fix/dfs_deep_crawling
Fix/dfs deep crawling
2025-11-13 16:43:47 +08:00
AHMET YILMAZ
ceade853c3 Enhance DFSDeepCrawlStrategy documentation for clarity and detail 2025-11-13 16:39:08 +08:00
ntohidi
998c809e08 Rename folder name for NSTProxy integration examples for crawl4ai 2025-11-13 09:36:39 +01:00
ntohidi
d0fb53540d Update proxy-security documentation 2025-11-13 09:23:44 +01:00
Nasrin
8116b15b63 Merge pull request #1596 from unclecode/docs-proxy-security
#1591 enhance proxy configuration with security, SSL analysis, and rotation examples
2025-11-13 16:22:28 +08:00
AHMET YILMAZ
fe353c4e27 Refactor proxy configuration documentation for clarity and consistency 2025-11-13 11:20:24 +08:00
ntohidi
89cc29fe44 Merge branch 'fix/docker' into develop 2025-11-12 17:06:31 +01:00
Nasrin
cdcb8836b7 Merge pull request #1605 from Nstproxy/feat/nstproxy
feat: Add Nstproxy Proxies
2025-11-12 23:56:14 +08:00
Nasrin
b207ae2848 Merge pull request #1528 from unclecode/fix/managed-browser-cdp-timing
Add CDP endpoint verification with exponential backoff for managed browsers
2025-11-12 23:53:57 +08:00
Nasrin
be00fc3a42 Merge pull request #1598 from unclecode/fix/sitemap_seeder
#1559 :Add tests for sitemap parsing and URL normalization in AsyncUr…
2025-11-12 18:09:34 +08:00
Nasrin
124ac583bb Merge pull request #1599 from unclecode/docs-llm-strategies-update
#1551 : Fix casing and variable name consistency for LLMConfig in doc…
2025-11-12 17:54:26 +08:00
AHMET YILMAZ
1bd3de6a47 #1510 : Add DFS deep crawler demonstration script and enhance DFS strategy with seen URL tracking 2025-11-12 17:44:43 +08:00
nstproxy
80452166c8 feat: Add Nstproxy Proxies 2025-11-12 16:25:39 +08:00
UncleCode
a99cd37c0e Merge pull request #1597 from unclecode/sponsors/capsolver 2025-11-11 14:50:44 +08:00
AHMET YILMAZ
2e8f8c9b49 #1551 : Fix casing and variable name consistency for LLMConfig in documentation 2025-11-10 15:38:14 +08:00
AHMET YILMAZ
80745bceb9 #1559 :Add tests for sitemap parsing and URL normalization in AsyncUrlSeeder 2025-11-10 14:15:54 +08:00
Aravind Karnam
4bee230c37 docs: Add a tip for captcha solving usecases using a third party integration 2025-11-10 11:20:48 +05:30
Aravind
006e29f308 Merge pull request #1589 from capsolver/main
Add some examples of using capsolver to solve captcha
2025-11-10 10:45:16 +05:30
AHMET YILMAZ
263ac890fd #1591
: Enhance proxy configuration documentation with security features, SSL analysis, and improved examples
2025-11-10 11:42:07 +08:00
Nasrin
d56b0eb9a9 Merge pull request #1495 from unclecode/fix/viewport_in_managed_browser
feat(ManagedBrowser): add viewport size configuration for browser launch
2025-11-06 18:42:45 +08:00
Nasrin
66175e132b Merge pull request #1590 from unclecode/fix/async-llm-extraction-arunMany
This commit resolves issue #1055 where LLM extraction was blocking async
2025-11-06 18:40:42 +08:00
ntohidi
a30548a98f This commit resolves issue #1055 where LLM extraction was blocking async
execution, causing URLs to be processed sequentially instead of in parallel.

  Changes:
  - Added aperform_completion_with_backoff() using litellm.acompletion for async LLM calls
  - Implemented arun() method in ExtractionStrategy base class with thread pool fallback
  - Created async arun() and aextract() methods in LLMExtractionStrategy using asyncio.gather
  - Updated AsyncWebCrawler.arun() to detect and use arun() when available
  - Added comprehensive test suite to verify parallel execution

  Impact:
  - LLM extraction now runs truly in parallel across multiple URLs
  - Significant performance improvement for multi-URL crawls with LLM strategies
  - Backward compatible - existing extraction strategies continue to work
  - No breaking changes to public API

  Technical details:
  - Uses litellm.acompletion for non-blocking LLM calls
  - Leverages asyncio.gather for concurrent chunk processing
  - Maintains backward compatibility via asyncio.to_thread fallback
  - Works seamlessly with MemoryAdaptiveDispatcher and other dispatchers
2025-11-06 11:22:45 +01:00
CapSolver
2ae9899eac Clarify CapSolver integration instructions
Updated text for clarity and capitalization.
2025-11-06 15:49:30 +08:00
CapSolver
57aeb70f00 Add CapSolver Captcha Solver 2025-11-06 15:37:31 +08:00
Nasrin
2c918155aa Merge pull request #1529 from unclecode/fix/remove_overlay_elements
Fix remove_overlay_elements functionality by calling injected JS function.
2025-11-06 00:10:32 +08:00
Nasrin
854694ef33 Merge pull request #1537 from unclecode/fix/docker-compose-llm-env
fix(docker): Remove environment variable overrides in docker-compose.yml
2025-11-06 00:07:51 +08:00
Nasrin
6534ece026 Merge pull request #1532 from unclecode/fix/update-documentation
Standardize C4A-Script tutorial, add CLI identity-based crawling, and add sponsorship CTA
2025-11-05 23:37:05 +08:00
Nasrin
89e28d4eee Merge pull request #1558 from unclecode/claude/fix-update-pyopenssl-security-011CUPexU25DkNvoxfu5ZrnB
Claude/fix update pyopenssl security 011 cu pex u25 dk nvoxfu5 zrn b
2025-10-28 17:09:11 +08:00
ntohidi
c0f1865287 feat(api): update marketplace version and build date in root endpoint response 2025-10-26 11:35:39 +01:00
ntohidi
46ef1116c4 fix(app-detail): enhance tab functionality, hide documentation and support tabs in marketplace 2025-10-26 11:21:29 +01:00
Nasrin
4df83893ac Merge pull request #1560 from unclecode/fix/marketplace
Fix/marketplace
2025-10-23 22:17:06 +08:00
ntohidi
13e116610d fix(marketplace): improve app detail page content rendering and UX
Fixed multiple issues with app detail page content display and formatting
2025-10-23 16:12:30 +02:00
Claude
613097d121 test: add verification tests for pyOpenSSL security update
- Add lightweight security test to verify version requirements
- Add comprehensive integration test for crawl4ai functionality
- Tests verify pyOpenSSL >= 25.3.0 and cryptography >= 45.0.7
- All tests passing: security vulnerability is resolved

Related to #1545

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-23 06:57:25 +00:00
Claude
44ef0682b0 fix: update pyOpenSSL to >=25.3.0 to address security vulnerability
- Updates pyOpenSSL from >=24.3.0 to >=25.3.0
- This resolves CVE affecting cryptography package versions >=37.0.0 & <43.0.1
- pyOpenSSL 25.3.0 requires cryptography>=45.0.7, which is above the vulnerable range
- Fixes issue #1545

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-23 06:51:25 +00:00
Nasrin
40173eeb73 Update Docker hooks and Webhook documents (#1557)
* fix(docker-api): migrate to modern datetime library API

Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>

* Fix examples in README.md

* feat(docker): add user-provided hooks support to Docker API

Implements comprehensive hooks functionality allowing users to provide custom Python
functions as strings that execute at specific points in the crawling pipeline.

Key Features:
- Support for all 8 crawl4ai hook points:
  • on_browser_created: Initialize browser settings
  • on_page_context_created: Configure page context
  • before_goto: Pre-navigation setup
  • after_goto: Post-navigation processing
  • on_user_agent_updated: User agent modification handling
  • on_execution_started: Crawl execution initialization
  • before_retrieve_html: Pre-extraction processing
  • before_return_html: Final HTML processing

Implementation Details:
- Created UserHookManager for validation, compilation, and safe execution
- Added IsolatedHookWrapper for error isolation and timeout protection
- AST-based validation ensures code structure correctness
- Sandboxed execution with restricted builtins for security
- Configurable timeout (1-120 seconds) prevents infinite loops
- Comprehensive error handling ensures hooks don't crash main process
- Execution tracking with detailed statistics and logging

API Changes:
- Added HookConfig schema with code and timeout fields
- Extended CrawlRequest with optional hooks parameter
- Added /hooks/info endpoint for hook discovery
- Updated /crawl and /crawl/stream endpoints to support hooks

Safety Features:
- Malformed hooks return clear validation errors
- Hook errors are isolated and reported without stopping crawl
- Execution statistics track success/failure/timeout rates
- All hook results are JSON-serializable

Testing:
- Comprehensive test suite covering all 8 hooks
- Error handling and timeout scenarios validated
- Authentication, performance, and content extraction examples
- 100% success rate in production testing

Documentation:
- Added extensive hooks section to docker-deployment.md
- Security warnings about user-provided code risks
- Real-world examples using httpbin.org, GitHub, BBC
- Best practices and troubleshooting guide

ref #1377

* fix(deep-crawl): BestFirst priority inversion; remove pre-scoring truncation. ref #1253

  Use negative scores in PQ to visit high-score URLs first and drop link cap prior to scoring; add test for ordering.

* docs: Update URL seeding examples to use proper async context managers
- Wrap all AsyncUrlSeeder usage with async context managers
- Update URL seeding adventure example to use "sitemap+cc" source, focus on course posts, and add stream=True parameter to fix runtime error

* fix(crawler): Removed the incorrect reference in browser_config variable #1310

* docs: update Docker instructions to use the latest release tag

* fix(docker): Fix LLM API key handling for multi-provider support

Previously, the system incorrectly used OPENAI_API_KEY for all LLM providers
due to a hardcoded api_key_env fallback in config.yml. This caused authentication
errors when using non-OpenAI providers like Gemini.

Changes:
- Remove api_key_env from config.yml to let litellm handle provider-specific env vars
- Simplify get_llm_api_key() to return None, allowing litellm to auto-detect keys
- Update validate_llm_provider() to trust litellm's built-in key detection
- Update documentation to reflect the new automatic key handling

The fix leverages litellm's existing capability to automatically find the correct
environment variable for each provider (OPENAI_API_KEY, GEMINI_API_TOKEN, etc.)
without manual configuration.

ref #1291

* docs: update adaptive crawler docs and cache defaults; remove deprecated examples (#1330)
- Replace BaseStrategy with CrawlStrategy in custom strategy examples (DomainSpecificStrategy, HybridStrategy)
- Remove “Custom Link Scoring” and “Caching Strategy” sections no longer aligned with current library
- Revise memory pruning example to use adaptive.get_relevant_content and index-based retention of top 500 docs
- Correct Quickstart note: default cache mode is CacheMode.BYPASS; instruct enabling with CacheMode.ENABLED

* fix(utils): Improve URL normalization by avoiding quote/unquote to preserve '+' signs. ref #1332

* feat: Add comprehensive website to API example with frontend

This commit adds a complete, web scraping API example that demonstrates how to get structured data from any website and use it like an API using the crawl4ai library with a minimalist frontend interface.

Core Functionality
- AI-powered web scraping with plain English queries
- Dual scraping approaches: Schema-based (faster) and LLM-based (flexible)
- Intelligent schema caching for improved performance
- Custom LLM model support with API key management
- Automatic duplicate request prevention

Modern Frontend Interface
- Minimalist black-and-white design inspired by modern web apps
- Responsive layout with smooth animations and transitions
- Three main pages: Scrape Data, Models Management, API Request History
- Real-time results display with JSON formatting
- Copy-to-clipboard functionality for extracted data
- Toast notifications for user feedback
- Auto-scroll to results when scraping starts

Model Management System
- Web-based model configuration interface
- Support for any LLM provider (OpenAI, Gemini, Anthropic, etc.)
- Simplified configuration requiring only provider and API token
- Add, list, and delete model configurations
- Secure storage of API keys in local JSON files

API Request History
- Automatic saving of all API requests and responses
- Display of request history with URL, query, and cURL commands
- Duplicate prevention (same URL + query combinations)
- Request deletion functionality
- Clean, simplified display focusing on essential information

Technical Implementation

Backend (FastAPI)
- RESTful API with comprehensive endpoints
- Pydantic models for request/response validation
- Async web scraping with crawl4ai library
- Error handling with detailed error messages
- File-based storage for models and request history

Frontend (Vanilla JS/CSS/HTML)
- No framework dependencies - pure HTML, CSS, JavaScript
- Modern CSS Grid and Flexbox layouts
- Custom dropdown styling with SVG arrows
- Responsive design for mobile and desktop
- Smooth scrolling and animations

Core Library Integration
- WebScraperAgent class for orchestration
- ModelConfig class for LLM configuration management
- Schema generation and caching system
- LLM extraction strategy support
- Browser configuration with headless mode

* fix(dependencies): add cssselect to project dependencies

Fixes bug reported in issue #1405
[Bug]: Excluded selector (excluded_selector) doesn't work

This commit reintroduces the cssselect library which was removed by PR (https://github.com/unclecode/crawl4ai/pull/1368) and merged via (437395e490).

Integration tested against 0.7.4 Docker container. Reintroducing cssselector package eliminated errors seen in logs and excluded_selector functionality was restored.

Refs: #1405

* fix(docker): resolve filter serialization and JSON encoding errors in deep crawl strategy (ref #1419)

  - Fix URLPatternFilter serialization by preventing private __slots__ from being serialized as constructor params
  - Add public attributes to URLPatternFilter to store original constructor parameters for proper serialization
  - Handle property descriptors in CrawlResult.model_dump() to prevent JSON serialization errors
  - Ensure filter chains work correctly with Docker client and REST API

  The issue occurred because:
  1. Private implementation details (_simple_suffixes, etc.) were being serialized and passed as constructor arguments during deserialization
  2. Property descriptors were being included in the serialized output, causing "Object of type property is not JSON serializable" errors

  Changes:
  - async_configs.py: Comment out __slots__ serialization logic (lines 100-109)
  - filters.py: Add patterns, use_glob, reverse to URLPatternFilter __slots__ and store as public attributes
  - models.py: Convert property descriptors to strings in model_dump() instead of including them directly

* fix(logger): ensure logger is a Logger instance in crawling strategies. ref #1437

* feat(docker): Add temperature and base_url parameters for LLM configuration. ref #1035

  Implement hierarchical configuration for LLM parameters with support for:
  - Temperature control (0.0-2.0) to adjust response creativity
  - Custom base_url for proxy servers and alternative endpoints
  - 4-tier priority: request params > provider env > global env > defaults

  Add helper functions in utils.py, update API schemas and handlers,
  support environment variables (LLM_TEMPERATURE, OPENAI_TEMPERATURE, etc.),
  and provide comprehensive documentation with examples.

* feat(docker): improve docker error handling
- Return comprehensive error messages along with status codes for api internal errors.
- Fix fit_html property serialization issue in both /crawl and /crawl/stream endpoints
- Add sanitization to ensure fit_html is always JSON-serializable (string or None)
- Add comprehensive error handling test suite.

* #1375 : refactor(proxy) Deprecate 'proxy' parameter in BrowserConfig and enhance proxy string parsing

- Updated ProxyConfig.from_string to support multiple proxy formats, including URLs with credentials.
- Deprecated the 'proxy' parameter in BrowserConfig, replacing it with 'proxy_config' for better flexibility.
- Added warnings for deprecated usage and clarified behavior when both parameters are provided.
- Updated documentation and tests to reflect changes in proxy configuration handling.

* Remove deprecated test for 'proxy' parameter in BrowserConfig and update .gitignore to include test_scripts directory.

* feat: add preserve_https_for_internal_links flag to maintain HTTPS during crawling. Ref #1410

Added a new `preserve_https_for_internal_links` configuration flag that preserves the original HTTPS scheme for same-domain links even when the server redirects to HTTP.

* feat: update documentation for preserve_https_for_internal_links. ref #1410

* fix: drop Python 3.9 support and require Python >=3.10.
The library no longer supports Python 3.9 and so it was important to drop all references to python 3.9.
Following changes have been made:
- pyproject.toml: set requires-python to ">=3.10"; remove 3.9 classifier
- setup.py: set python_requires to ">=3.10"; remove 3.9 classifier
- docs: update Python version mentions
  - deploy/docker/c4ai-doc-context.md: options -> 3.10, 3.11, 3.12, 3.13

* issue #1329 refactor(crawler): move unwanted properties to CrawlerRunConfig class

* fix(auth): fixed Docker JWT authentication. ref #1442

* remove: delete unused yoyo snapshot subproject

* fix: raise error on last attempt failure in perform_completion_with_backoff. ref #989

* Commit without API

* fix: update option labels in request builder for clarity

* fix: allow custom LLM providers for adaptive crawler embedding config. ref: #1291

  - Change embedding_llm_config from Dict to Union[LLMConfig, Dict] for type safety
  - Add backward-compatible conversion property _embedding_llm_config_dict
  - Replace all hardcoded OpenAI embedding configs with configurable options
  - Fix LLMConfig object attribute access in query expansion logic
  - Add comprehensive example demonstrating multiple provider configurations
  - Update documentation with both LLMConfig object and dictionary usage patterns

  Users can now specify any LLM provider for query expansion in embedding strategy:
  - New: embedding_llm_config=LLMConfig(provider='anthropic/claude-3', api_token='key')
  - Old: embedding_llm_config={'provider': 'openai/gpt-4', 'api_token': 'key'} (still works)

* refactor(BrowserConfig): change deprecation warning for 'proxy' parameter to UserWarning

* feat(StealthAdapter): fix stealth features for Playwright integration. ref #1481

* #1505 fix(api): update config handling to only set base config if not provided by user

* fix(docker-deployment): replace console.log with print for metadata extraction

* Release v0.7.5: The Update

- Updated version to 0.7.5
- Added comprehensive demo and release notes
- Updated documentation

* refactor(release): remove memory management section for cleaner documentation. ref #1443

* feat(docs): add brand book and page copy functionality

- Add comprehensive brand book with color system, typography, components
- Add page copy dropdown with markdown copy/view functionality
- Update mkdocs.yml with new assets and branding navigation
- Use terminal-style ASCII icons and condensed menu design

* Update gitignore add local scripts folder

* fix: remove this import as it causes python to treat "json" as a variable in the except block

* fix: always return a list, even if we catch an exception

* feat(marketplace): Add Crawl4AI marketplace with secure configuration

- Implement marketplace frontend and admin dashboard
- Add FastAPI backend with environment-based configuration
- Use .env file for secrets management
- Include data generation scripts
- Add proper CORS configuration
- Remove hardcoded password from admin login
- Update gitignore for security

* fix(marketplace): Update URLs to use /marketplace path and relative API endpoints

- Change API_BASE to relative '/api' for production
- Move marketplace to /marketplace instead of /marketplace/frontend
- Update MkDocs navigation
- Fix logo path in marketplace index

* fix(docs): hide copy menu on non-markdown pages

* feat(marketplace): add sponsor logo uploads

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>

* feat(docs): add chatgpt quick link to page actions

* fix(marketplace): align admin api with backend endpoints

* fix(marketplace): isolate api under marketplace prefix

* fix(marketplace): resolve app detail page routing and styling issues

- Fixed JavaScript errors from missing HTML elements (install-code, usage-code, integration-code)
- Added missing CSS classes for tabs, overview layout, sidebar, and integration content
- Fixed tab navigation to display horizontally in single line
- Added proper padding to tab content sections (removed from container, added to content)
- Fixed tab selector from .nav-tab to .tab-btn to match HTML structure
- Added sidebar styling with stats grid and metadata display
- Improved responsive design with mobile-friendly tab scrolling
- Fixed code block positioning for copy buttons
- Removed margin from first headings to prevent extra spacing
- Added null checks for DOM elements in JavaScript to prevent errors

These changes resolve the routing issue where clicking on apps caused page redirects,
and fix the broken layout where CSS was not properly applied to the app detail page.

* fix(marketplace): prevent hero image overflow and secondary card stretching

- Fixed hero image to 200px height with min/max constraints
- Added object-fit: cover to hero-image img elements
- Changed secondary-featured align-items from stretch to flex-start
- Fixed secondary-card height to 118px (no flex: 1 stretching)
- Updated responsive grid layouts for wider screens
- Added flex: 1 to hero-content for better content distribution

These changes ensure a rigid, predictable layout that prevents:
1. Large images from pushing text content down
2. Single secondary cards from stretching to fill entire height

* feat: Add hooks utility for function-based hooks with Docker client integration. ref #1377

   Add hooks_to_string() utility function that converts Python function objects
   to string representations for the Docker API, enabling developers to write hooks
   as regular Python functions instead of strings.

   Core Changes:
   - New hooks_to_string() utility in crawl4ai/utils.py using inspect.getsource()
   - Docker client now accepts both function objects and strings for hooks
   - Automatic detection and conversion in Crawl4aiDockerClient._prepare_request()
   - New hooks and hooks_timeout parameters in client.crawl() method

   Documentation:
   - Docker client examples with function-based hooks (docs/examples/docker_client_hooks_example.py)
   - Updated main Docker deployment guide with comprehensive hooks section
   - Added unit tests for hooks utility (tests/docker/test_hooks_utility.py)

* feat: Add hooks utility for function-based hooks with Docker client integration. ref #1377

   Add hooks_to_string() utility function that converts Python function objects
   to string representations for the Docker API, enabling developers to write hooks
   as regular Python functions instead of strings.

   Core Changes:
   - New hooks_to_string() utility in crawl4ai/utils.py using inspect.getsource()
   - Docker client now accepts both function objects and strings for hooks
   - Automatic detection and conversion in Crawl4aiDockerClient._prepare_request()
   - New hooks and hooks_timeout parameters in client.crawl() method

   Documentation:
   - Docker client examples with function-based hooks (docs/examples/docker_client_hooks_example.py)
   - Updated main Docker deployment guide with comprehensive hooks section
   - Added unit tests for hooks utility (tests/docker/test_hooks_utility.py)

* fix(docs): clarify Docker Hooks System with function-based API in README

* docs: Add demonstration files for v0.7.5 release, showcasing the new Docker Hooks System and all other features.

* docs: Update 0.7.5 video walkthrough

* docs: add complete SDK reference documentation

Add comprehensive single-page SDK reference combining:
- Installation & Setup
- Quick Start
- Core API (AsyncWebCrawler, arun, arun_many, CrawlResult)
- Configuration (BrowserConfig, CrawlerConfig, Parameters)
- Crawling Patterns
- Content Processing (Markdown, Fit Markdown, Selection, Interaction, Link & Media)
- Extraction Strategies (LLM and No-LLM)
- Advanced Features (Session Management, Hooks & Auth)

Generated using scripts/generate_sdk_docs.py in ultra-dense mode
optimized for AI assistant consumption.

Stats: 23K words, 185 code blocks, 220KB

* feat: add AI assistant skill package for Crawl4AI

- Create comprehensive skill package for AI coding assistants
- Include complete SDK reference (23K words, v0.7.4)
- Add three extraction scripts (basic, batch, pipeline)
- Implement version tracking in skill and scripts
- Add prominent download section on homepage
- Place skill in docs/assets for web distribution

The skill enables AI assistants like Claude, Cursor, and Windsurf
to effectively use Crawl4AI with optimized workflows for markdown
generation and data extraction.

* fix: remove non-existent wiki link and clarify skill usage instructions

* fix: update Crawl4AI skill with corrected parameters and examples

- Fixed CrawlerConfig → CrawlerRunConfig throughout
- Fixed parameter names (timeout → page_timeout, store_html removed)
- Fixed schema format (selector → baseSelector)
- Corrected proxy configuration (in BrowserConfig, not CrawlerRunConfig)
- Fixed fit_markdown usage with content filters
- Added comprehensive references to docs/examples/ directory
- Created safe packaging script to avoid root directory pollution
- All scripts tested and verified working

* fix: thoroughly verify and fix all Crawl4AI skill examples

- Cross-checked every section against actual docs
- Fixed BM25ContentFilter parameters (user_query, bm25_threshold)
- Removed incorrect wait_for selector from basic example
- Added comprehensive test suite (4 test files)
- All examples now tested and verified working
- Tests validate: basic crawling, markdown generation, data extraction, advanced patterns
- Package size: 76.6 KB (includes tests for future validation)

* feat(ci): split release pipeline and add Docker caching

- Split release.yml into PyPI/GitHub release and Docker workflows
- Add GitHub Actions cache for Docker builds (10-15x faster rebuilds)
- Implement dual-trigger for docker-release.yml (auto + manual)
- Add comprehensive workflow documentation in .github/workflows/docs/
- Backup original workflow as release.yml.backup

* feat: add webhook notifications for crawl job completion

Implements webhook support for the crawl job API to eliminate polling requirements.

Changes:
- Added WebhookConfig and WebhookPayload schemas to schemas.py
- Created webhook.py with WebhookDeliveryService class
- Integrated webhook notifications in api.py handle_crawl_job
- Updated job.py CrawlJobPayload to accept webhook_config
- Added webhook configuration section to config.yml
- Included comprehensive usage examples in WEBHOOK_EXAMPLES.md

Features:
- Webhook notifications on job completion (success/failure)
- Configurable data inclusion in webhook payload
- Custom webhook headers support
- Global default webhook URL configuration
- Exponential backoff retry logic (5 attempts: 1s, 2s, 4s, 8s, 16s)
- 30-second timeout per webhook call

Usage:
POST /crawl/job with optional webhook_config:
- webhook_url: URL to receive notifications
- webhook_data_in_payload: include full results (default: false)
- webhook_headers: custom headers for authentication

Generated with Claude Code https://claude.com/claude-code

Co-Authored-By: Claude <noreply@anthropic.com>

* docs: add webhook documentation to Docker README

Added comprehensive webhook section to README.md including:
- Overview of asynchronous job queue with webhooks
- Benefits and use cases
- Quick start examples
- Webhook authentication
- Global webhook configuration
- Job status polling alternative

Updated table of contents and summary to include webhook feature.
Maintains consistent tone and style with rest of README.

Generated with Claude Code https://claude.com/claude-code

Co-Authored-By: Claude <noreply@anthropic.com>

* docs: add webhook example for Docker deployment

Added docker_webhook_example.py demonstrating:
- Submitting crawl jobs with webhook configuration
- Flask-based webhook receiver implementation
- Three usage patterns:
  1. Webhook notification only (fetch data separately)
  2. Webhook with full data in payload
  3. Traditional polling approach for comparison

Includes comprehensive comments explaining:
- Webhook payload structure
- Authentication headers setup
- Error handling
- Production deployment tips

Example is fully functional and ready to run with Flask installed.

Generated with Claude Code https://claude.com/claude-code

Co-Authored-By: Claude <noreply@anthropic.com>

* test: add webhook implementation validation tests

Added comprehensive test suite to validate webhook implementation:
- Module import verification
- WebhookDeliveryService initialization
- Pydantic model validation (WebhookConfig)
- Payload construction logic
- Exponential backoff calculation
- API integration checks

All tests pass (6/6), confirming implementation is correct.

Generated with Claude Code https://claude.com/claude-code

Co-Authored-By: Claude <noreply@anthropic.com>

* test: add comprehensive webhook feature test script

Added end-to-end test script that automates webhook feature testing:

Script Features (test_webhook_feature.sh):
- Automatic branch switching and dependency installation
- Redis and server startup/shutdown management
- Webhook receiver implementation
- Integration test for webhook notifications
- Comprehensive cleanup and error handling
- Returns to original branch after completion

Test Flow:
1. Fetch and checkout webhook feature branch
2. Activate venv and install dependencies
3. Start Redis and Crawl4AI server
4. Submit crawl job with webhook config
5. Verify webhook delivery and payload
6. Clean up all processes and return to original branch

Documentation:
- WEBHOOK_TEST_README.md with usage instructions
- Troubleshooting guide
- Exit codes and safety features

Usage: ./tests/test_webhook_feature.sh

Generated with Claude Code https://claude.com/claude-code

Co-Authored-By: Claude <noreply@anthropic.com>

* fix: properly serialize Pydantic HttpUrl in webhook config

Use model_dump(mode='json') instead of deprecated dict() method to ensure
Pydantic special types (HttpUrl, UUID, etc.) are properly serialized to
JSON-compatible native Python types.

This fixes webhook delivery failures caused by HttpUrl objects remaining
as Pydantic types in the webhook_config dict, which caused JSON
serialization errors and httpx request failures.

Also update mcp requirement to >=1.18.0 for compatibility.

* feat: add webhook support for /llm/job endpoint

Add comprehensive webhook notification support for the /llm/job endpoint,
following the same pattern as the existing /crawl/job implementation.

Changes:
- Add webhook_config field to LlmJobPayload model (job.py)
- Implement webhook notifications in process_llm_extraction() with 4
  notification points: success, provider validation failure, extraction
  failure, and general exceptions (api.py)
- Store webhook_config in Redis task data for job tracking
- Initialize WebhookDeliveryService with exponential backoff retry logic
Documentation:
- Add Example 6 to WEBHOOK_EXAMPLES.md showing LLM extraction with webhooks
- Update Flask webhook handler to support both crawl and llm_extraction tasks
- Add TypeScript client examples for LLM jobs
- Add comprehensive examples to docker_webhook_example.py with schema support
- Clarify data structure differences between webhook and API responses

Testing:
- Add test_llm_webhook_feature.py with 7 validation tests (all passing)
- Verify pattern consistency with /crawl/job implementation
- Add implementation guide (WEBHOOK_LLM_JOB_IMPLEMENTATION.md)

* fix: remove duplicate comma in webhook_config parameter

* fix: update Crawl4AI Docker container port from 11234 to 11235

* docs: enhance README and docker-deployment documentation with Job Queue and Webhook API details

* docs: update docker_hooks_examples.py with comprehensive examples and improved structure

---------

Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
Co-authored-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
Co-authored-by: Nezar Ali <abu5sohaib@gmail.com>
Co-authored-by: Soham Kukreti <kukretisoham@gmail.com>
Co-authored-by: James T. Wood <jamesthomaswood@gmail.com>
Co-authored-by: AHMET YILMAZ <tawfik@kidocode.com>
Co-authored-by: nafeqq-1306 <nafiquee@yahoo.com>
Co-authored-by: unclecode <unclecode@kidocode.com>
Co-authored-by: Martin Sjöborg <martin.sjoborg@quartr.se>
Co-authored-by: Martin Sjöborg <martin@sjoborg.org>
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Co-authored-by: Claude <noreply@anthropic.com>
2025-10-22 22:34:19 +08:00
ntohidi
b74524fdfb docs: update docker_hooks_examples.py with comprehensive examples and improved structure 2025-10-22 16:29:19 +02:00
ntohidi
bcac486921 docs: enhance README and docker-deployment documentation with Job Queue and Webhook API details 2025-10-22 16:19:30 +02:00
ntohidi
6aef5a120f Merge branch 'main' into develop 2025-10-22 15:53:54 +02:00
Nasrin
7cac008c10 Release/v0.7.6 (#1556)
* fix(docker-api): migrate to modern datetime library API

Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>

* Fix examples in README.md

* feat(docker): add user-provided hooks support to Docker API

Implements comprehensive hooks functionality allowing users to provide custom Python
functions as strings that execute at specific points in the crawling pipeline.

Key Features:
- Support for all 8 crawl4ai hook points:
  • on_browser_created: Initialize browser settings
  • on_page_context_created: Configure page context
  • before_goto: Pre-navigation setup
  • after_goto: Post-navigation processing
  • on_user_agent_updated: User agent modification handling
  • on_execution_started: Crawl execution initialization
  • before_retrieve_html: Pre-extraction processing
  • before_return_html: Final HTML processing

Implementation Details:
- Created UserHookManager for validation, compilation, and safe execution
- Added IsolatedHookWrapper for error isolation and timeout protection
- AST-based validation ensures code structure correctness
- Sandboxed execution with restricted builtins for security
- Configurable timeout (1-120 seconds) prevents infinite loops
- Comprehensive error handling ensures hooks don't crash main process
- Execution tracking with detailed statistics and logging

API Changes:
- Added HookConfig schema with code and timeout fields
- Extended CrawlRequest with optional hooks parameter
- Added /hooks/info endpoint for hook discovery
- Updated /crawl and /crawl/stream endpoints to support hooks

Safety Features:
- Malformed hooks return clear validation errors
- Hook errors are isolated and reported without stopping crawl
- Execution statistics track success/failure/timeout rates
- All hook results are JSON-serializable

Testing:
- Comprehensive test suite covering all 8 hooks
- Error handling and timeout scenarios validated
- Authentication, performance, and content extraction examples
- 100% success rate in production testing

Documentation:
- Added extensive hooks section to docker-deployment.md
- Security warnings about user-provided code risks
- Real-world examples using httpbin.org, GitHub, BBC
- Best practices and troubleshooting guide

ref #1377

* fix(deep-crawl): BestFirst priority inversion; remove pre-scoring truncation. ref #1253

  Use negative scores in PQ to visit high-score URLs first and drop link cap prior to scoring; add test for ordering.

* docs: Update URL seeding examples to use proper async context managers
- Wrap all AsyncUrlSeeder usage with async context managers
- Update URL seeding adventure example to use "sitemap+cc" source, focus on course posts, and add stream=True parameter to fix runtime error

* fix(crawler): Removed the incorrect reference in browser_config variable #1310

* docs: update Docker instructions to use the latest release tag

* fix(docker): Fix LLM API key handling for multi-provider support

Previously, the system incorrectly used OPENAI_API_KEY for all LLM providers
due to a hardcoded api_key_env fallback in config.yml. This caused authentication
errors when using non-OpenAI providers like Gemini.

Changes:
- Remove api_key_env from config.yml to let litellm handle provider-specific env vars
- Simplify get_llm_api_key() to return None, allowing litellm to auto-detect keys
- Update validate_llm_provider() to trust litellm's built-in key detection
- Update documentation to reflect the new automatic key handling

The fix leverages litellm's existing capability to automatically find the correct
environment variable for each provider (OPENAI_API_KEY, GEMINI_API_TOKEN, etc.)
without manual configuration.

ref #1291

* docs: update adaptive crawler docs and cache defaults; remove deprecated examples (#1330)
- Replace BaseStrategy with CrawlStrategy in custom strategy examples (DomainSpecificStrategy, HybridStrategy)
- Remove “Custom Link Scoring” and “Caching Strategy” sections no longer aligned with current library
- Revise memory pruning example to use adaptive.get_relevant_content and index-based retention of top 500 docs
- Correct Quickstart note: default cache mode is CacheMode.BYPASS; instruct enabling with CacheMode.ENABLED

* fix(utils): Improve URL normalization by avoiding quote/unquote to preserve '+' signs. ref #1332

* feat: Add comprehensive website to API example with frontend

This commit adds a complete, web scraping API example that demonstrates how to get structured data from any website and use it like an API using the crawl4ai library with a minimalist frontend interface.

Core Functionality
- AI-powered web scraping with plain English queries
- Dual scraping approaches: Schema-based (faster) and LLM-based (flexible)
- Intelligent schema caching for improved performance
- Custom LLM model support with API key management
- Automatic duplicate request prevention

Modern Frontend Interface
- Minimalist black-and-white design inspired by modern web apps
- Responsive layout with smooth animations and transitions
- Three main pages: Scrape Data, Models Management, API Request History
- Real-time results display with JSON formatting
- Copy-to-clipboard functionality for extracted data
- Toast notifications for user feedback
- Auto-scroll to results when scraping starts

Model Management System
- Web-based model configuration interface
- Support for any LLM provider (OpenAI, Gemini, Anthropic, etc.)
- Simplified configuration requiring only provider and API token
- Add, list, and delete model configurations
- Secure storage of API keys in local JSON files

API Request History
- Automatic saving of all API requests and responses
- Display of request history with URL, query, and cURL commands
- Duplicate prevention (same URL + query combinations)
- Request deletion functionality
- Clean, simplified display focusing on essential information

Technical Implementation

Backend (FastAPI)
- RESTful API with comprehensive endpoints
- Pydantic models for request/response validation
- Async web scraping with crawl4ai library
- Error handling with detailed error messages
- File-based storage for models and request history

Frontend (Vanilla JS/CSS/HTML)
- No framework dependencies - pure HTML, CSS, JavaScript
- Modern CSS Grid and Flexbox layouts
- Custom dropdown styling with SVG arrows
- Responsive design for mobile and desktop
- Smooth scrolling and animations

Core Library Integration
- WebScraperAgent class for orchestration
- ModelConfig class for LLM configuration management
- Schema generation and caching system
- LLM extraction strategy support
- Browser configuration with headless mode

* fix(dependencies): add cssselect to project dependencies

Fixes bug reported in issue #1405
[Bug]: Excluded selector (excluded_selector) doesn't work

This commit reintroduces the cssselect library which was removed by PR (https://github.com/unclecode/crawl4ai/pull/1368) and merged via (437395e490).

Integration tested against 0.7.4 Docker container. Reintroducing cssselector package eliminated errors seen in logs and excluded_selector functionality was restored.

Refs: #1405

* fix(docker): resolve filter serialization and JSON encoding errors in deep crawl strategy (ref #1419)

  - Fix URLPatternFilter serialization by preventing private __slots__ from being serialized as constructor params
  - Add public attributes to URLPatternFilter to store original constructor parameters for proper serialization
  - Handle property descriptors in CrawlResult.model_dump() to prevent JSON serialization errors
  - Ensure filter chains work correctly with Docker client and REST API

  The issue occurred because:
  1. Private implementation details (_simple_suffixes, etc.) were being serialized and passed as constructor arguments during deserialization
  2. Property descriptors were being included in the serialized output, causing "Object of type property is not JSON serializable" errors

  Changes:
  - async_configs.py: Comment out __slots__ serialization logic (lines 100-109)
  - filters.py: Add patterns, use_glob, reverse to URLPatternFilter __slots__ and store as public attributes
  - models.py: Convert property descriptors to strings in model_dump() instead of including them directly

* fix(logger): ensure logger is a Logger instance in crawling strategies. ref #1437

* feat(docker): Add temperature and base_url parameters for LLM configuration. ref #1035

  Implement hierarchical configuration for LLM parameters with support for:
  - Temperature control (0.0-2.0) to adjust response creativity
  - Custom base_url for proxy servers and alternative endpoints
  - 4-tier priority: request params > provider env > global env > defaults

  Add helper functions in utils.py, update API schemas and handlers,
  support environment variables (LLM_TEMPERATURE, OPENAI_TEMPERATURE, etc.),
  and provide comprehensive documentation with examples.

* feat(docker): improve docker error handling
- Return comprehensive error messages along with status codes for api internal errors.
- Fix fit_html property serialization issue in both /crawl and /crawl/stream endpoints
- Add sanitization to ensure fit_html is always JSON-serializable (string or None)
- Add comprehensive error handling test suite.

* #1375 : refactor(proxy) Deprecate 'proxy' parameter in BrowserConfig and enhance proxy string parsing

- Updated ProxyConfig.from_string to support multiple proxy formats, including URLs with credentials.
- Deprecated the 'proxy' parameter in BrowserConfig, replacing it with 'proxy_config' for better flexibility.
- Added warnings for deprecated usage and clarified behavior when both parameters are provided.
- Updated documentation and tests to reflect changes in proxy configuration handling.

* Remove deprecated test for 'proxy' parameter in BrowserConfig and update .gitignore to include test_scripts directory.

* feat: add preserve_https_for_internal_links flag to maintain HTTPS during crawling. Ref #1410

Added a new `preserve_https_for_internal_links` configuration flag that preserves the original HTTPS scheme for same-domain links even when the server redirects to HTTP.

* feat: update documentation for preserve_https_for_internal_links. ref #1410

* fix: drop Python 3.9 support and require Python >=3.10.
The library no longer supports Python 3.9 and so it was important to drop all references to python 3.9.
Following changes have been made:
- pyproject.toml: set requires-python to ">=3.10"; remove 3.9 classifier
- setup.py: set python_requires to ">=3.10"; remove 3.9 classifier
- docs: update Python version mentions
  - deploy/docker/c4ai-doc-context.md: options -> 3.10, 3.11, 3.12, 3.13

* issue #1329 refactor(crawler): move unwanted properties to CrawlerRunConfig class

* fix(auth): fixed Docker JWT authentication. ref #1442

* remove: delete unused yoyo snapshot subproject

* fix: raise error on last attempt failure in perform_completion_with_backoff. ref #989

* Commit without API

* fix: update option labels in request builder for clarity

* fix: allow custom LLM providers for adaptive crawler embedding config. ref: #1291

  - Change embedding_llm_config from Dict to Union[LLMConfig, Dict] for type safety
  - Add backward-compatible conversion property _embedding_llm_config_dict
  - Replace all hardcoded OpenAI embedding configs with configurable options
  - Fix LLMConfig object attribute access in query expansion logic
  - Add comprehensive example demonstrating multiple provider configurations
  - Update documentation with both LLMConfig object and dictionary usage patterns

  Users can now specify any LLM provider for query expansion in embedding strategy:
  - New: embedding_llm_config=LLMConfig(provider='anthropic/claude-3', api_token='key')
  - Old: embedding_llm_config={'provider': 'openai/gpt-4', 'api_token': 'key'} (still works)

* refactor(BrowserConfig): change deprecation warning for 'proxy' parameter to UserWarning

* feat(StealthAdapter): fix stealth features for Playwright integration. ref #1481

* #1505 fix(api): update config handling to only set base config if not provided by user

* fix(docker-deployment): replace console.log with print for metadata extraction

* Release v0.7.5: The Update

- Updated version to 0.7.5
- Added comprehensive demo and release notes
- Updated documentation

* refactor(release): remove memory management section for cleaner documentation. ref #1443

* feat(docs): add brand book and page copy functionality

- Add comprehensive brand book with color system, typography, components
- Add page copy dropdown with markdown copy/view functionality
- Update mkdocs.yml with new assets and branding navigation
- Use terminal-style ASCII icons and condensed menu design

* Update gitignore add local scripts folder

* fix: remove this import as it causes python to treat "json" as a variable in the except block

* fix: always return a list, even if we catch an exception

* feat(marketplace): Add Crawl4AI marketplace with secure configuration

- Implement marketplace frontend and admin dashboard
- Add FastAPI backend with environment-based configuration
- Use .env file for secrets management
- Include data generation scripts
- Add proper CORS configuration
- Remove hardcoded password from admin login
- Update gitignore for security

* fix(marketplace): Update URLs to use /marketplace path and relative API endpoints

- Change API_BASE to relative '/api' for production
- Move marketplace to /marketplace instead of /marketplace/frontend
- Update MkDocs navigation
- Fix logo path in marketplace index

* fix(docs): hide copy menu on non-markdown pages

* feat(marketplace): add sponsor logo uploads

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>

* feat(docs): add chatgpt quick link to page actions

* fix(marketplace): align admin api with backend endpoints

* fix(marketplace): isolate api under marketplace prefix

* fix(marketplace): resolve app detail page routing and styling issues

- Fixed JavaScript errors from missing HTML elements (install-code, usage-code, integration-code)
- Added missing CSS classes for tabs, overview layout, sidebar, and integration content
- Fixed tab navigation to display horizontally in single line
- Added proper padding to tab content sections (removed from container, added to content)
- Fixed tab selector from .nav-tab to .tab-btn to match HTML structure
- Added sidebar styling with stats grid and metadata display
- Improved responsive design with mobile-friendly tab scrolling
- Fixed code block positioning for copy buttons
- Removed margin from first headings to prevent extra spacing
- Added null checks for DOM elements in JavaScript to prevent errors

These changes resolve the routing issue where clicking on apps caused page redirects,
and fix the broken layout where CSS was not properly applied to the app detail page.

* fix(marketplace): prevent hero image overflow and secondary card stretching

- Fixed hero image to 200px height with min/max constraints
- Added object-fit: cover to hero-image img elements
- Changed secondary-featured align-items from stretch to flex-start
- Fixed secondary-card height to 118px (no flex: 1 stretching)
- Updated responsive grid layouts for wider screens
- Added flex: 1 to hero-content for better content distribution

These changes ensure a rigid, predictable layout that prevents:
1. Large images from pushing text content down
2. Single secondary cards from stretching to fill entire height

* feat: Add hooks utility for function-based hooks with Docker client integration. ref #1377

   Add hooks_to_string() utility function that converts Python function objects
   to string representations for the Docker API, enabling developers to write hooks
   as regular Python functions instead of strings.

   Core Changes:
   - New hooks_to_string() utility in crawl4ai/utils.py using inspect.getsource()
   - Docker client now accepts both function objects and strings for hooks
   - Automatic detection and conversion in Crawl4aiDockerClient._prepare_request()
   - New hooks and hooks_timeout parameters in client.crawl() method

   Documentation:
   - Docker client examples with function-based hooks (docs/examples/docker_client_hooks_example.py)
   - Updated main Docker deployment guide with comprehensive hooks section
   - Added unit tests for hooks utility (tests/docker/test_hooks_utility.py)

* feat: Add hooks utility for function-based hooks with Docker client integration. ref #1377

   Add hooks_to_string() utility function that converts Python function objects
   to string representations for the Docker API, enabling developers to write hooks
   as regular Python functions instead of strings.

   Core Changes:
   - New hooks_to_string() utility in crawl4ai/utils.py using inspect.getsource()
   - Docker client now accepts both function objects and strings for hooks
   - Automatic detection and conversion in Crawl4aiDockerClient._prepare_request()
   - New hooks and hooks_timeout parameters in client.crawl() method

   Documentation:
   - Docker client examples with function-based hooks (docs/examples/docker_client_hooks_example.py)
   - Updated main Docker deployment guide with comprehensive hooks section
   - Added unit tests for hooks utility (tests/docker/test_hooks_utility.py)

* fix(docs): clarify Docker Hooks System with function-based API in README

* docs: Add demonstration files for v0.7.5 release, showcasing the new Docker Hooks System and all other features.

* docs: Update 0.7.5 video walkthrough

* docs: add complete SDK reference documentation

Add comprehensive single-page SDK reference combining:
- Installation & Setup
- Quick Start
- Core API (AsyncWebCrawler, arun, arun_many, CrawlResult)
- Configuration (BrowserConfig, CrawlerConfig, Parameters)
- Crawling Patterns
- Content Processing (Markdown, Fit Markdown, Selection, Interaction, Link & Media)
- Extraction Strategies (LLM and No-LLM)
- Advanced Features (Session Management, Hooks & Auth)

Generated using scripts/generate_sdk_docs.py in ultra-dense mode
optimized for AI assistant consumption.

Stats: 23K words, 185 code blocks, 220KB

* feat: add AI assistant skill package for Crawl4AI

- Create comprehensive skill package for AI coding assistants
- Include complete SDK reference (23K words, v0.7.4)
- Add three extraction scripts (basic, batch, pipeline)
- Implement version tracking in skill and scripts
- Add prominent download section on homepage
- Place skill in docs/assets for web distribution

The skill enables AI assistants like Claude, Cursor, and Windsurf
to effectively use Crawl4AI with optimized workflows for markdown
generation and data extraction.

* fix: remove non-existent wiki link and clarify skill usage instructions

* fix: update Crawl4AI skill with corrected parameters and examples

- Fixed CrawlerConfig → CrawlerRunConfig throughout
- Fixed parameter names (timeout → page_timeout, store_html removed)
- Fixed schema format (selector → baseSelector)
- Corrected proxy configuration (in BrowserConfig, not CrawlerRunConfig)
- Fixed fit_markdown usage with content filters
- Added comprehensive references to docs/examples/ directory
- Created safe packaging script to avoid root directory pollution
- All scripts tested and verified working

* fix: thoroughly verify and fix all Crawl4AI skill examples

- Cross-checked every section against actual docs
- Fixed BM25ContentFilter parameters (user_query, bm25_threshold)
- Removed incorrect wait_for selector from basic example
- Added comprehensive test suite (4 test files)
- All examples now tested and verified working
- Tests validate: basic crawling, markdown generation, data extraction, advanced patterns
- Package size: 76.6 KB (includes tests for future validation)

* feat(ci): split release pipeline and add Docker caching

- Split release.yml into PyPI/GitHub release and Docker workflows
- Add GitHub Actions cache for Docker builds (10-15x faster rebuilds)
- Implement dual-trigger for docker-release.yml (auto + manual)
- Add comprehensive workflow documentation in .github/workflows/docs/
- Backup original workflow as release.yml.backup

* feat: add webhook notifications for crawl job completion

Implements webhook support for the crawl job API to eliminate polling requirements.

Changes:
- Added WebhookConfig and WebhookPayload schemas to schemas.py
- Created webhook.py with WebhookDeliveryService class
- Integrated webhook notifications in api.py handle_crawl_job
- Updated job.py CrawlJobPayload to accept webhook_config
- Added webhook configuration section to config.yml
- Included comprehensive usage examples in WEBHOOK_EXAMPLES.md

Features:
- Webhook notifications on job completion (success/failure)
- Configurable data inclusion in webhook payload
- Custom webhook headers support
- Global default webhook URL configuration
- Exponential backoff retry logic (5 attempts: 1s, 2s, 4s, 8s, 16s)
- 30-second timeout per webhook call

Usage:
POST /crawl/job with optional webhook_config:
- webhook_url: URL to receive notifications
- webhook_data_in_payload: include full results (default: false)
- webhook_headers: custom headers for authentication

Generated with Claude Code https://claude.com/claude-code

Co-Authored-By: Claude <noreply@anthropic.com>

* docs: add webhook documentation to Docker README

Added comprehensive webhook section to README.md including:
- Overview of asynchronous job queue with webhooks
- Benefits and use cases
- Quick start examples
- Webhook authentication
- Global webhook configuration
- Job status polling alternative

Updated table of contents and summary to include webhook feature.
Maintains consistent tone and style with rest of README.

Generated with Claude Code https://claude.com/claude-code

Co-Authored-By: Claude <noreply@anthropic.com>

* docs: add webhook example for Docker deployment

Added docker_webhook_example.py demonstrating:
- Submitting crawl jobs with webhook configuration
- Flask-based webhook receiver implementation
- Three usage patterns:
  1. Webhook notification only (fetch data separately)
  2. Webhook with full data in payload
  3. Traditional polling approach for comparison

Includes comprehensive comments explaining:
- Webhook payload structure
- Authentication headers setup
- Error handling
- Production deployment tips

Example is fully functional and ready to run with Flask installed.

Generated with Claude Code https://claude.com/claude-code

Co-Authored-By: Claude <noreply@anthropic.com>

* test: add webhook implementation validation tests

Added comprehensive test suite to validate webhook implementation:
- Module import verification
- WebhookDeliveryService initialization
- Pydantic model validation (WebhookConfig)
- Payload construction logic
- Exponential backoff calculation
- API integration checks

All tests pass (6/6), confirming implementation is correct.

Generated with Claude Code https://claude.com/claude-code

Co-Authored-By: Claude <noreply@anthropic.com>

* test: add comprehensive webhook feature test script

Added end-to-end test script that automates webhook feature testing:

Script Features (test_webhook_feature.sh):
- Automatic branch switching and dependency installation
- Redis and server startup/shutdown management
- Webhook receiver implementation
- Integration test for webhook notifications
- Comprehensive cleanup and error handling
- Returns to original branch after completion

Test Flow:
1. Fetch and checkout webhook feature branch
2. Activate venv and install dependencies
3. Start Redis and Crawl4AI server
4. Submit crawl job with webhook config
5. Verify webhook delivery and payload
6. Clean up all processes and return to original branch

Documentation:
- WEBHOOK_TEST_README.md with usage instructions
- Troubleshooting guide
- Exit codes and safety features

Usage: ./tests/test_webhook_feature.sh

Generated with Claude Code https://claude.com/claude-code

Co-Authored-By: Claude <noreply@anthropic.com>

* fix: properly serialize Pydantic HttpUrl in webhook config

Use model_dump(mode='json') instead of deprecated dict() method to ensure
Pydantic special types (HttpUrl, UUID, etc.) are properly serialized to
JSON-compatible native Python types.

This fixes webhook delivery failures caused by HttpUrl objects remaining
as Pydantic types in the webhook_config dict, which caused JSON
serialization errors and httpx request failures.

Also update mcp requirement to >=1.18.0 for compatibility.

* feat: add webhook support for /llm/job endpoint

Add comprehensive webhook notification support for the /llm/job endpoint,
following the same pattern as the existing /crawl/job implementation.

Changes:
- Add webhook_config field to LlmJobPayload model (job.py)
- Implement webhook notifications in process_llm_extraction() with 4
  notification points: success, provider validation failure, extraction
  failure, and general exceptions (api.py)
- Store webhook_config in Redis task data for job tracking
- Initialize WebhookDeliveryService with exponential backoff retry logic
Documentation:
- Add Example 6 to WEBHOOK_EXAMPLES.md showing LLM extraction with webhooks
- Update Flask webhook handler to support both crawl and llm_extraction tasks
- Add TypeScript client examples for LLM jobs
- Add comprehensive examples to docker_webhook_example.py with schema support
- Clarify data structure differences between webhook and API responses

Testing:
- Add test_llm_webhook_feature.py with 7 validation tests (all passing)
- Verify pattern consistency with /crawl/job implementation
- Add implementation guide (WEBHOOK_LLM_JOB_IMPLEMENTATION.md)

* fix: remove duplicate comma in webhook_config parameter

* fix: update Crawl4AI Docker container port from 11234 to 11235

* Release v0.7.6: The 0.7.6 Update

- Updated version to 0.7.6
- Added comprehensive demo and release notes
- Updated all documentation
- Update the veriosn in Dockerfile to 0.7.6

---------

Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
Co-authored-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
Co-authored-by: Nezar Ali <abu5sohaib@gmail.com>
Co-authored-by: Soham Kukreti <kukretisoham@gmail.com>
Co-authored-by: James T. Wood <jamesthomaswood@gmail.com>
Co-authored-by: AHMET YILMAZ <tawfik@kidocode.com>
Co-authored-by: nafeqq-1306 <nafiquee@yahoo.com>
Co-authored-by: unclecode <unclecode@kidocode.com>
Co-authored-by: Martin Sjöborg <martin.sjoborg@quartr.se>
Co-authored-by: Martin Sjöborg <martin@sjoborg.org>
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Co-authored-by: Claude <noreply@anthropic.com>
2025-10-22 20:41:06 +08:00
ntohidi
7e8fb3a8f3 Merge branch 'release/v0.7.5' into develop 2025-10-22 13:16:16 +02:00
ntohidi
3efb59fb9a fix: update Crawl4AI Docker container port from 11234 to 11235 2025-10-22 13:14:11 +02:00
ntohidi
c7b7475b92 fix: remove duplicate comma in webhook_config parameter 2025-10-22 13:12:42 +02:00
ntohidi
b71d624168 Merge branch 'implement-webhook-crawl-feature-011CULZY1Jy8N5MUkZqXkRVp' into develop 2025-10-22 13:12:25 +02:00
ntohidi
d670dcde0a feat: add webhook support for /llm/job endpoint
Add comprehensive webhook notification support for the /llm/job endpoint,
following the same pattern as the existing /crawl/job implementation.

Changes:
- Add webhook_config field to LlmJobPayload model (job.py)
- Implement webhook notifications in process_llm_extraction() with 4
  notification points: success, provider validation failure, extraction
  failure, and general exceptions (api.py)
- Store webhook_config in Redis task data for job tracking
- Initialize WebhookDeliveryService with exponential backoff retry logic
Documentation:
- Add Example 6 to WEBHOOK_EXAMPLES.md showing LLM extraction with webhooks
- Update Flask webhook handler to support both crawl and llm_extraction tasks
- Add TypeScript client examples for LLM jobs
- Add comprehensive examples to docker_webhook_example.py with schema support
- Clarify data structure differences between webhook and API responses

Testing:
- Add test_llm_webhook_feature.py with 7 validation tests (all passing)
- Verify pattern consistency with /crawl/job implementation
- Add implementation guide (WEBHOOK_LLM_JOB_IMPLEMENTATION.md)
2025-10-22 13:03:09 +02:00
unclecode
f8606f6865 fix: properly serialize Pydantic HttpUrl in webhook config
Use model_dump(mode='json') instead of deprecated dict() method to ensure
Pydantic special types (HttpUrl, UUID, etc.) are properly serialized to
JSON-compatible native Python types.

This fixes webhook delivery failures caused by HttpUrl objects remaining
as Pydantic types in the webhook_config dict, which caused JSON
serialization errors and httpx request failures.

Also update mcp requirement to >=1.18.0 for compatibility.
2025-10-22 15:50:25 +08:00
Claude
52da8d72bc test: add comprehensive webhook feature test script
Added end-to-end test script that automates webhook feature testing:

Script Features (test_webhook_feature.sh):
- Automatic branch switching and dependency installation
- Redis and server startup/shutdown management
- Webhook receiver implementation
- Integration test for webhook notifications
- Comprehensive cleanup and error handling
- Returns to original branch after completion

Test Flow:
1. Fetch and checkout webhook feature branch
2. Activate venv and install dependencies
3. Start Redis and Crawl4AI server
4. Submit crawl job with webhook config
5. Verify webhook delivery and payload
6. Clean up all processes and return to original branch

Documentation:
- WEBHOOK_TEST_README.md with usage instructions
- Troubleshooting guide
- Exit codes and safety features

Usage: ./tests/test_webhook_feature.sh

Generated with Claude Code https://claude.com/claude-code

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-22 00:35:07 +00:00
Claude
8b7e67566e test: add webhook implementation validation tests
Added comprehensive test suite to validate webhook implementation:
- Module import verification
- WebhookDeliveryService initialization
- Pydantic model validation (WebhookConfig)
- Payload construction logic
- Exponential backoff calculation
- API integration checks

All tests pass (6/6), confirming implementation is correct.

Generated with Claude Code https://claude.com/claude-code

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-22 00:25:35 +00:00
Claude
7388baa205 docs: add webhook example for Docker deployment
Added docker_webhook_example.py demonstrating:
- Submitting crawl jobs with webhook configuration
- Flask-based webhook receiver implementation
- Three usage patterns:
  1. Webhook notification only (fetch data separately)
  2. Webhook with full data in payload
  3. Traditional polling approach for comparison

Includes comprehensive comments explaining:
- Webhook payload structure
- Authentication headers setup
- Error handling
- Production deployment tips

Example is fully functional and ready to run with Flask installed.

Generated with Claude Code https://claude.com/claude-code

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-21 16:38:53 +00:00
Claude
897bc3a493 docs: add webhook documentation to Docker README
Added comprehensive webhook section to README.md including:
- Overview of asynchronous job queue with webhooks
- Benefits and use cases
- Quick start examples
- Webhook authentication
- Global webhook configuration
- Job status polling alternative

Updated table of contents and summary to include webhook feature.
Maintains consistent tone and style with rest of README.

Generated with Claude Code https://claude.com/claude-code

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-21 16:21:07 +00:00
Claude
8a37710313 feat: add webhook notifications for crawl job completion
Implements webhook support for the crawl job API to eliminate polling requirements.

Changes:
- Added WebhookConfig and WebhookPayload schemas to schemas.py
- Created webhook.py with WebhookDeliveryService class
- Integrated webhook notifications in api.py handle_crawl_job
- Updated job.py CrawlJobPayload to accept webhook_config
- Added webhook configuration section to config.yml
- Included comprehensive usage examples in WEBHOOK_EXAMPLES.md

Features:
- Webhook notifications on job completion (success/failure)
- Configurable data inclusion in webhook payload
- Custom webhook headers support
- Global default webhook URL configuration
- Exponential backoff retry logic (5 attempts: 1s, 2s, 4s, 8s, 16s)
- 30-second timeout per webhook call

Usage:
POST /crawl/job with optional webhook_config:
- webhook_url: URL to receive notifications
- webhook_data_in_payload: include full results (default: false)
- webhook_headers: custom headers for authentication

Generated with Claude Code https://claude.com/claude-code

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-21 16:17:40 +00:00
ntohidi
97c92c4f62 fix(marketplace): replace hardcoded app detail content with database-driven fields.
The app detail page was displaying hardcoded/templated content instead of
using actual data from the database. This prevented admins from controlling
the content shown in Overview, Integration, and Documentation tabs.
2025-10-21 15:39:04 +02:00
ntohidi
f6a02c4358 Merge branch 'develop' into release/v0.7.5 2025-10-21 09:25:29 +02:00
unclecode
6d1a398419 feat(ci): split release pipeline and add Docker caching
- Split release.yml into PyPI/GitHub release and Docker workflows
- Add GitHub Actions cache for Docker builds (10-15x faster rebuilds)
- Implement dual-trigger for docker-release.yml (auto + manual)
- Add comprehensive workflow documentation in .github/workflows/docs/
- Backup original workflow as release.yml.backup
2025-10-21 10:53:12 +08:00
unclecode
c107617920 fix: thoroughly verify and fix all Crawl4AI skill examples
- Cross-checked every section against actual docs
- Fixed BM25ContentFilter parameters (user_query, bm25_threshold)
- Removed incorrect wait_for selector from basic example
- Added comprehensive test suite (4 test files)
- All examples now tested and verified working
- Tests validate: basic crawling, markdown generation, data extraction, advanced patterns
- Package size: 76.6 KB (includes tests for future validation)
2025-10-19 17:08:04 +08:00
unclecode
69d0ef89dd fix: update Crawl4AI skill with corrected parameters and examples
- Fixed CrawlerConfig → CrawlerRunConfig throughout
- Fixed parameter names (timeout → page_timeout, store_html removed)
- Fixed schema format (selector → baseSelector)
- Corrected proxy configuration (in BrowserConfig, not CrawlerRunConfig)
- Fixed fit_markdown usage with content filters
- Added comprehensive references to docs/examples/ directory
- Created safe packaging script to avoid root directory pollution
- All scripts tested and verified working
2025-10-19 16:16:20 +08:00
unclecode
1bf85bcb1a fix: remove non-existent wiki link and clarify skill usage instructions 2025-10-19 13:19:14 +08:00
unclecode
749232ba1a feat: add AI assistant skill package for Crawl4AI
- Create comprehensive skill package for AI coding assistants
- Include complete SDK reference (23K words, v0.7.4)
- Add three extraction scripts (basic, batch, pipeline)
- Implement version tracking in skill and scripts
- Add prominent download section on homepage
- Place skill in docs/assets for web distribution

The skill enables AI assistants like Claude, Cursor, and Windsurf
to effectively use Crawl4AI with optimized workflows for markdown
generation and data extraction.
2025-10-19 13:19:14 +08:00
unclecode
c7288dd2f1 docs: add complete SDK reference documentation
Add comprehensive single-page SDK reference combining:
- Installation & Setup
- Quick Start
- Core API (AsyncWebCrawler, arun, arun_many, CrawlResult)
- Configuration (BrowserConfig, CrawlerConfig, Parameters)
- Crawling Patterns
- Content Processing (Markdown, Fit Markdown, Selection, Interaction, Link & Media)
- Extraction Strategies (LLM and No-LLM)
- Advanced Features (Session Management, Hooks & Auth)

Generated using scripts/generate_sdk_docs.py in ultra-dense mode
optimized for AI assistant consumption.

Stats: 23K words, 185 code blocks, 220KB
2025-10-19 13:19:14 +08:00
UncleCode
fdbcddbf1a Merge pull request #1546 from unclecode/sponsors 2025-10-17 18:07:16 +08:00
Aravind Karnam
564d437d97 docs: fix order of star history and Current sponsors 2025-10-17 15:31:29 +05:30
Aravind Karnam
9cd06ea7eb docs: fix order of star history and Current sponsors 2025-10-17 15:30:02 +05:30
ntohidi
c91b235cb7 docs: Update 0.7.5 video walkthrough 2025-10-14 13:49:57 +08:00
Aravind Karnam
eb257c2ba3 docs: fixed sponsorship link 2025-10-13 17:47:42 +05:30
Aravind Karnam
8d364a0731 docs: Adjust background of sponsor logo to compensate for light themes 2025-10-13 17:45:10 +05:30
Aravind Karnam
6aff0e55aa docs: Adjust background of sponsor logo to compensate for light themes 2025-10-13 17:42:29 +05:30
Aravind Karnam
38a0742708 docs: Adjust background of sponsor logo to compensate for light themes 2025-10-13 17:41:19 +05:30
Aravind Karnam
a720a3a9fe docs: Adjust background of sponsor logo to compensate for light themes 2025-10-13 17:32:34 +05:30
Aravind Karnam
017144c2dd docs: Adjust background of sponsor logo to compensate for light themes 2025-10-13 17:30:22 +05:30
Aravind Karnam
32887ea40d docs: Adjust background of sponsor logo to compensate for light themes 2025-10-13 17:13:52 +05:30
Aravind Karnam
eea41bf1ca docs: Add a slight background to compensate light theme on github docs 2025-10-13 17:00:24 +05:30
Aravind Karnam
21c302f439 docs: Add Current sponsors section in README file 2025-10-13 16:45:16 +05:30
ntohidi
8fc1747225 docs: Add demonstration files for v0.7.5 release, showcasing the new Docker Hooks System and all other features. 2025-10-13 13:59:34 +08:00
ntohidi
aadab30c3d fix(docs): clarify Docker Hooks System with function-based API in README 2025-10-13 13:08:47 +08:00
ntohidi
4a04b8506a feat: Add hooks utility for function-based hooks with Docker client integration. ref #1377
Add hooks_to_string() utility function that converts Python function objects
   to string representations for the Docker API, enabling developers to write hooks
   as regular Python functions instead of strings.

   Core Changes:
   - New hooks_to_string() utility in crawl4ai/utils.py using inspect.getsource()
   - Docker client now accepts both function objects and strings for hooks
   - Automatic detection and conversion in Crawl4aiDockerClient._prepare_request()
   - New hooks and hooks_timeout parameters in client.crawl() method

   Documentation:
   - Docker client examples with function-based hooks (docs/examples/docker_client_hooks_example.py)
   - Updated main Docker deployment guide with comprehensive hooks section
   - Added unit tests for hooks utility (tests/docker/test_hooks_utility.py)
2025-10-13 12:53:33 +08:00
ntohidi
7dadb65b80 Merge branch 'develop' into release/v0.7.5 2025-10-13 12:34:45 +08:00
ntohidi
a3f057e19f feat: Add hooks utility for function-based hooks with Docker client integration. ref #1377
Add hooks_to_string() utility function that converts Python function objects
   to string representations for the Docker API, enabling developers to write hooks
   as regular Python functions instead of strings.

   Core Changes:
   - New hooks_to_string() utility in crawl4ai/utils.py using inspect.getsource()
   - Docker client now accepts both function objects and strings for hooks
   - Automatic detection and conversion in Crawl4aiDockerClient._prepare_request()
   - New hooks and hooks_timeout parameters in client.crawl() method

   Documentation:
   - Docker client examples with function-based hooks (docs/examples/docker_client_hooks_example.py)
   - Updated main Docker deployment guide with comprehensive hooks section
   - Added unit tests for hooks utility (tests/docker/test_hooks_utility.py)
2025-10-13 12:34:08 +08:00
ntohidi
611d48f93b Merge branch 'develop' into release/v0.7.5 2025-10-09 12:53:39 +08:00
ntohidi
936397ee0e Merge branch 'develop' of https://github.com/unclecode/crawl4ai into develop 2025-10-09 12:53:15 +08:00
Soham Kukreti
46e1a67f61 fix(docker): Remove environment variable overrides in docker-compose.yml (#1411)
The docker-compose.yml had an `environment:` section with variable
substitutions (${VAR:-}) that was overriding values from .llm.env with
empty strings.

- Commented out the `environment:` section to prevent overwrites
- Added clear warning comment explaining the override behavior
- .llm.env values now load directly into container without interference
2025-10-06 14:41:22 +05:30
Soham Kukreti
7dfe528d43 fix(docs): standardize C4A-Script tutorial, add CLI identity-based crawling, and add sponsorship CTA
- Switch installs to pip install -r requirements.txt (tutorial and app docs)
- Update local run steps to python server.py and http://localhost:8000
- Set default PORT to 8000; update port-in-use commands and alt port 8001
- Replace unsupported :contains() example with accessible attribute selector
- Update example URLs in tutorial servers to 127.0.0.1:8000
- Add “Identity-based crawling” section with crwl profiles CLI workflow and code usage
- Replace legacy-docs note with sponsorship message in docs/md_v2/index.md
- Minor copy and consistency fixes across pages
2025-10-03 22:00:46 +05:30
Nasrin
9900f63f97 Merge pull request #1531 from unclecode/develop
Marketplace and brand book changes
2025-10-03 13:24:51 +08:00
ntohidi
9292b265fc Merge branch 'develop' of https://github.com/unclecode/crawl4ai into develop 2025-10-03 12:57:23 +08:00
ntohidi
70af81d9d7 refactor(release): remove memory management section for cleaner documentation. ref #1443 2025-09-30 11:54:21 +08:00
Soham Kukreti
2dc6588573 fix: remove_overlay_elements functionality by calling injected JS function. ref: #1396
- Fix critical bug where overlay removal JS function was injected but never called
  - Change remove_overlay_elements() to properly execute the injected async function
  - Wrap JS execution in async to handle the async overlay removal logic
  - Add test_remove_overlay_elements() test case to verify functionality works
  - Ensure overlay elements (cookie banners, popups, modals) are actually removed

  The remove_overlay_elements feature now works as intended:
  - Before: Function definition injected but never executed (silent failure)
  - After: Function injected and called, successfully removing overlay elements
2025-09-29 20:40:08 +05:30
Soham Kukreti
34c0996ee4 fix: Add CDP endpoint verification with exponential backoff for managed browsers (#1445)
browser_manager:
- Add CDP endpoint verification with retry logic and exponential backoff
- Call verification before connecting to CDP in `start()` method
- Graceful handling of timing issues during browser startup

test_cdp_strategy:
- Fix cookie persistence test by adding storage state management
- Fix session management test to work with managed browser architecture
- Add comprehensive CDP timing tests covering:
  - Fast startup scenarios
  - Delayed browser startup simulation
  - Exponential backoff behavior validation
  - Concurrent browser connections
  - Stress testing with multiple successive startups
  - Retry count verification

Impact:
- Eliminates browser startup failures due to CDP timing issues
- Provides robust fallback with automatic retries
- Maintains fast startup when CDP is immediately available
- Comprehensive test coverage ensures reliability

Resolves CDP connection timing issues in managed browser mode.
2025-09-29 19:31:09 +05:30
ntohidi
361499d291 Release v0.7.5: The Update
- Updated version to 0.7.5
- Added comprehensive demo and release notes
- Updated documentation
2025-09-29 18:05:26 +08:00
AHMET YILMAZ
e3467c08f6 #1490 feat(ManagedBrowser): add viewport size configuration for browser launch 2025-09-17 17:40:38 +08:00
rbushria
edd0b576b1 Fix: Use correct URL variable for raw HTML extraction (#1116)
- Prevents full HTML content from being passed as URL to extraction strategies
- Added unit tests to verify raw HTML and regular URL processing

Fix: Wrong URL variable used for extraction of raw html
2025-09-01 23:15:56 +03:00
109 changed files with 23797 additions and 1131 deletions

100
.github/workflows/docker-release.yml vendored Normal file
View File

@@ -0,0 +1,100 @@
name: Docker Release
on:
release:
types: [published]
push:
tags:
- 'docker-rebuild-v*' # Allow manual Docker rebuilds via tags
jobs:
docker:
runs-on: ubuntu-latest
steps:
- name: Free up disk space
run: |
echo "=== Disk space before cleanup ==="
df -h
# Remove unnecessary tools and libraries (frees ~25GB)
sudo rm -rf /usr/share/dotnet
sudo rm -rf /usr/local/lib/android
sudo rm -rf /opt/ghc
sudo rm -rf /opt/hostedtoolcache/CodeQL
sudo rm -rf /usr/local/share/boost
sudo rm -rf /usr/share/swift
# Clean apt cache
sudo apt-get clean
echo "=== Disk space after cleanup ==="
df -h
- name: Checkout code
uses: actions/checkout@v4
- name: Extract version from release or tag
id: get_version
run: |
if [ "${{ github.event_name }}" == "release" ]; then
# Triggered by release event
VERSION="${{ github.event.release.tag_name }}"
VERSION=${VERSION#v} # Remove 'v' prefix
else
# Triggered by docker-rebuild-v* tag
VERSION=${GITHUB_REF#refs/tags/docker-rebuild-v}
fi
echo "VERSION=$VERSION" >> $GITHUB_OUTPUT
echo "Building Docker images for version: $VERSION"
- name: Extract major and minor versions
id: versions
run: |
VERSION=${{ steps.get_version.outputs.VERSION }}
MAJOR=$(echo $VERSION | cut -d. -f1)
MINOR=$(echo $VERSION | cut -d. -f1-2)
echo "MAJOR=$MAJOR" >> $GITHUB_OUTPUT
echo "MINOR=$MINOR" >> $GITHUB_OUTPUT
echo "Semantic versions - Major: $MAJOR, Minor: $MINOR"
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Log in to Docker Hub
uses: docker/login-action@v3
with:
username: ${{ secrets.DOCKER_USERNAME }}
password: ${{ secrets.DOCKER_TOKEN }}
- name: Build and push Docker images
uses: docker/build-push-action@v5
with:
context: .
push: true
tags: |
unclecode/crawl4ai:${{ steps.get_version.outputs.VERSION }}
unclecode/crawl4ai:${{ steps.versions.outputs.MINOR }}
unclecode/crawl4ai:${{ steps.versions.outputs.MAJOR }}
unclecode/crawl4ai:latest
platforms: linux/amd64,linux/arm64
cache-from: type=gha
cache-to: type=gha,mode=max
- name: Summary
run: |
echo "## 🐳 Docker Release Complete!" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "### Published Images" >> $GITHUB_STEP_SUMMARY
echo "- \`unclecode/crawl4ai:${{ steps.get_version.outputs.VERSION }}\`" >> $GITHUB_STEP_SUMMARY
echo "- \`unclecode/crawl4ai:${{ steps.versions.outputs.MINOR }}\`" >> $GITHUB_STEP_SUMMARY
echo "- \`unclecode/crawl4ai:${{ steps.versions.outputs.MAJOR }}\`" >> $GITHUB_STEP_SUMMARY
echo "- \`unclecode/crawl4ai:latest\`" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "### Platforms" >> $GITHUB_STEP_SUMMARY
echo "- linux/amd64" >> $GITHUB_STEP_SUMMARY
echo "- linux/arm64" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "### 🚀 Pull Command" >> $GITHUB_STEP_SUMMARY
echo "\`\`\`bash" >> $GITHUB_STEP_SUMMARY
echo "docker pull unclecode/crawl4ai:${{ steps.get_version.outputs.VERSION }}" >> $GITHUB_STEP_SUMMARY
echo "\`\`\`" >> $GITHUB_STEP_SUMMARY

917
.github/workflows/docs/ARCHITECTURE.md vendored Normal file
View File

@@ -0,0 +1,917 @@
# Workflow Architecture Documentation
## Overview
This document describes the technical architecture of the split release pipeline for Crawl4AI.
---
## Architecture Diagram
```
┌─────────────────────────────────────────────────────────────────┐
│ Developer │
│ │ │
│ ▼ │
│ git tag v1.2.3 │
│ git push --tags │
└──────────────────────────────┬──────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ GitHub Repository │
│ │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ Tag Event: v1.2.3 │ │
│ └────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ release.yml (Release Pipeline) │ │
│ │ ┌──────────────────────────────────────────────┐ │ │
│ │ │ 1. Extract Version │ │ │
│ │ │ v1.2.3 → 1.2.3 │ │ │
│ │ └──────────────────────────────────────────────┘ │ │
│ │ ┌──────────────────────────────────────────────┐ │ │
│ │ │ 2. Validate Version │ │ │
│ │ │ Tag == __version__.py │ │ │
│ │ └──────────────────────────────────────────────┘ │ │
│ │ ┌──────────────────────────────────────────────┐ │ │
│ │ │ 3. Build Python Package │ │ │
│ │ │ - Source dist (.tar.gz) │ │ │
│ │ │ - Wheel (.whl) │ │ │
│ │ └──────────────────────────────────────────────┘ │ │
│ │ ┌──────────────────────────────────────────────┐ │ │
│ │ │ 4. Upload to PyPI │ │ │
│ │ │ - Authenticate with token │ │ │
│ │ │ - Upload dist/* │ │ │
│ │ └──────────────────────────────────────────────┘ │ │
│ │ ┌──────────────────────────────────────────────┐ │ │
│ │ │ 5. Create GitHub Release │ │ │
│ │ │ - Tag: v1.2.3 │ │ │
│ │ │ - Body: Install instructions │ │ │
│ │ │ - Status: Published │ │ │
│ │ └──────────────────────────────────────────────┘ │ │
│ └────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ Release Event: published (v1.2.3) │ │
│ └────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ docker-release.yml (Docker Pipeline) │ │
│ │ ┌──────────────────────────────────────────────┐ │ │
│ │ │ 1. Extract Version from Release │ │ │
│ │ │ github.event.release.tag_name → 1.2.3 │ │ │
│ │ └──────────────────────────────────────────────┘ │ │
│ │ ┌──────────────────────────────────────────────┐ │ │
│ │ │ 2. Parse Semantic Versions │ │ │
│ │ │ 1.2.3 → Major: 1, Minor: 1.2 │ │ │
│ │ └──────────────────────────────────────────────┘ │ │
│ │ ┌──────────────────────────────────────────────┐ │ │
│ │ │ 3. Setup Multi-Arch Build │ │ │
│ │ │ - Docker Buildx │ │ │
│ │ │ - QEMU emulation │ │ │
│ │ └──────────────────────────────────────────────┘ │ │
│ │ ┌──────────────────────────────────────────────┐ │ │
│ │ │ 4. Authenticate Docker Hub │ │ │
│ │ │ - Username: DOCKER_USERNAME │ │ │
│ │ │ - Token: DOCKER_TOKEN │ │ │
│ │ └──────────────────────────────────────────────┘ │ │
│ │ ┌──────────────────────────────────────────────┐ │ │
│ │ │ 5. Build Multi-Arch Images │ │ │
│ │ │ ┌────────────────┬────────────────┐ │ │ │
│ │ │ │ linux/amd64 │ linux/arm64 │ │ │ │
│ │ │ └────────────────┴────────────────┘ │ │ │
│ │ │ Cache: GitHub Actions (type=gha) │ │ │
│ │ └──────────────────────────────────────────────┘ │ │
│ │ ┌──────────────────────────────────────────────┐ │ │
│ │ │ 6. Push to Docker Hub │ │ │
│ │ │ Tags: │ │ │
│ │ │ - unclecode/crawl4ai:1.2.3 │ │ │
│ │ │ - unclecode/crawl4ai:1.2 │ │ │
│ │ │ - unclecode/crawl4ai:1 │ │ │
│ │ │ - unclecode/crawl4ai:latest │ │ │
│ │ └──────────────────────────────────────────────┘ │ │
│ └────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ External Services │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ PyPI │ │ Docker Hub │ │ GitHub │ │
│ │ │ │ │ │ │ │
│ │ crawl4ai │ │ unclecode/ │ │ Releases │ │
│ │ 1.2.3 │ │ crawl4ai │ │ v1.2.3 │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────────┘
```
---
## Component Details
### 1. Release Pipeline (release.yml)
#### Purpose
Fast publication of Python package and GitHub release.
#### Input
- **Trigger**: Git tag matching `v*` (excluding `test-v*`)
- **Example**: `v1.2.3`
#### Processing Stages
##### Stage 1: Version Extraction
```bash
Input: refs/tags/v1.2.3
Output: VERSION=1.2.3
```
**Implementation**:
```bash
TAG_VERSION=${GITHUB_REF#refs/tags/v} # Remove 'refs/tags/v' prefix
echo "VERSION=$TAG_VERSION" >> $GITHUB_OUTPUT
```
##### Stage 2: Version Validation
```bash
Input: TAG_VERSION=1.2.3
Check: crawl4ai/__version__.py contains __version__ = "1.2.3"
Output: Pass/Fail
```
**Implementation**:
```bash
PACKAGE_VERSION=$(python -c "from crawl4ai.__version__ import __version__; print(__version__)")
if [ "$TAG_VERSION" != "$PACKAGE_VERSION" ]; then
exit 1
fi
```
##### Stage 3: Package Build
```bash
Input: Source code + pyproject.toml
Output: dist/crawl4ai-1.2.3.tar.gz
dist/crawl4ai-1.2.3-py3-none-any.whl
```
**Implementation**:
```bash
python -m build
# Uses build backend defined in pyproject.toml
```
##### Stage 4: PyPI Upload
```bash
Input: dist/*.{tar.gz,whl}
Auth: PYPI_TOKEN
Output: Package published to PyPI
```
**Implementation**:
```bash
twine upload dist/*
# Environment:
# TWINE_USERNAME: __token__
# TWINE_PASSWORD: ${{ secrets.PYPI_TOKEN }}
```
##### Stage 5: GitHub Release Creation
```bash
Input: Tag: v1.2.3
Body: Markdown content
Output: Published GitHub release
```
**Implementation**:
```yaml
uses: softprops/action-gh-release@v2
with:
tag_name: v1.2.3
name: Release v1.2.3
body: |
Installation instructions and changelog
draft: false
prerelease: false
```
#### Output
- **PyPI Package**: https://pypi.org/project/crawl4ai/1.2.3/
- **GitHub Release**: Published release on repository
- **Event**: `release.published` (triggers Docker workflow)
#### Timeline
```
0:00 - Tag pushed
0:01 - Checkout + Python setup
0:02 - Version validation
0:03 - Package build
0:04 - PyPI upload starts
0:06 - PyPI upload complete
0:07 - GitHub release created
0:08 - Workflow complete
```
---
### 2. Docker Release Pipeline (docker-release.yml)
#### Purpose
Build and publish multi-architecture Docker images.
#### Inputs
##### Input 1: Release Event (Automatic)
```yaml
Event: release.published
Data: github.event.release.tag_name = "v1.2.3"
```
##### Input 2: Docker Rebuild Tag (Manual)
```yaml
Tag: docker-rebuild-v1.2.3
```
#### Processing Stages
##### Stage 1: Version Detection
```bash
# From release event:
VERSION = github.event.release.tag_name.strip("v")
# Result: "1.2.3"
# From rebuild tag:
VERSION = GITHUB_REF.replace("refs/tags/docker-rebuild-v", "")
# Result: "1.2.3"
```
##### Stage 2: Semantic Version Parsing
```bash
Input: VERSION=1.2.3
Output: MAJOR=1
MINOR=1.2
PATCH=3 (implicit)
```
**Implementation**:
```bash
MAJOR=$(echo $VERSION | cut -d. -f1) # Extract first component
MINOR=$(echo $VERSION | cut -d. -f1-2) # Extract first two components
```
##### Stage 3: Multi-Architecture Setup
```yaml
Setup:
- Docker Buildx (multi-platform builder)
- QEMU (ARM emulation on x86)
Platforms:
- linux/amd64 (x86_64)
- linux/arm64 (aarch64)
```
**Architecture**:
```
GitHub Runner (linux/amd64)
├─ Buildx Builder
│ ├─ Native: Build linux/amd64 image
│ └─ QEMU: Emulate ARM to build linux/arm64 image
└─ Generate manifest list (points to both images)
```
##### Stage 4: Docker Hub Authentication
```bash
Input: DOCKER_USERNAME
DOCKER_TOKEN
Output: Authenticated Docker client
```
##### Stage 5: Build with Cache
```yaml
Cache Configuration:
cache-from: type=gha # Read from GitHub Actions cache
cache-to: type=gha,mode=max # Write all layers
Cache Key Components:
- Workflow file path
- Branch name
- Architecture (amd64/arm64)
```
**Cache Hierarchy**:
```
Cache Entry: main/docker-release.yml/linux-amd64
├─ Layer: sha256:abc123... (FROM python:3.12)
├─ Layer: sha256:def456... (RUN apt-get update)
├─ Layer: sha256:ghi789... (COPY requirements.txt)
├─ Layer: sha256:jkl012... (RUN pip install)
└─ Layer: sha256:mno345... (COPY . /app)
Cache Hit/Miss Logic:
- If layer input unchanged → cache hit → skip build
- If layer input changed → cache miss → rebuild + all subsequent layers
```
##### Stage 6: Tag Generation
```bash
Input: VERSION=1.2.3, MAJOR=1, MINOR=1.2
Output Tags:
- unclecode/crawl4ai:1.2.3 (exact version)
- unclecode/crawl4ai:1.2 (minor version)
- unclecode/crawl4ai:1 (major version)
- unclecode/crawl4ai:latest (latest stable)
```
**Tag Strategy**:
- All tags point to same image SHA
- Users can pin to desired stability level
- Pushing new version updates `1`, `1.2`, and `latest` automatically
##### Stage 7: Push to Registry
```bash
For each tag:
For each platform (amd64, arm64):
Push image to Docker Hub
Create manifest list:
Manifest: unclecode/crawl4ai:1.2.3
├─ linux/amd64: sha256:abc...
└─ linux/arm64: sha256:def...
Docker CLI automatically selects correct platform on pull
```
#### Output
- **Docker Images**: 4 tags × 2 platforms = 8 image variants + 4 manifests
- **Docker Hub**: https://hub.docker.com/r/unclecode/crawl4ai/tags
#### Timeline
**Cold Cache (First Build)**:
```
0:00 - Release event received
0:01 - Checkout + Buildx setup
0:02 - Docker Hub auth
0:03 - Start build (amd64)
0:08 - Complete amd64 build
0:09 - Start build (arm64)
0:14 - Complete arm64 build
0:15 - Generate manifests
0:16 - Push all tags
0:17 - Workflow complete
```
**Warm Cache (Code Change Only)**:
```
0:00 - Release event received
0:01 - Checkout + Buildx setup
0:02 - Docker Hub auth
0:03 - Start build (amd64) - cache hit for layers 1-4
0:04 - Complete amd64 build (only layer 5 rebuilt)
0:05 - Start build (arm64) - cache hit for layers 1-4
0:06 - Complete arm64 build (only layer 5 rebuilt)
0:07 - Generate manifests
0:08 - Push all tags
0:09 - Workflow complete
```
---
## Data Flow
### Version Information Flow
```
Developer
crawl4ai/__version__.py
__version__ = "1.2.3"
├─► Git Tag
│ v1.2.3
│ │
│ ▼
│ release.yml
│ │
│ ├─► Validation
│ │ ✓ Match
│ │
│ ├─► PyPI Package
│ │ crawl4ai==1.2.3
│ │
│ └─► GitHub Release
│ v1.2.3
│ │
│ ▼
│ docker-release.yml
│ │
│ └─► Docker Tags
│ 1.2.3, 1.2, 1, latest
└─► Package Metadata
pyproject.toml
version = "1.2.3"
```
### Secrets Flow
```
GitHub Secrets (Encrypted at Rest)
├─► PYPI_TOKEN
│ │
│ ▼
│ release.yml
│ │
│ ▼
│ TWINE_PASSWORD env var (masked in logs)
│ │
│ ▼
│ PyPI API (HTTPS)
├─► DOCKER_USERNAME
│ │
│ ▼
│ docker-release.yml
│ │
│ ▼
│ docker/login-action (masked in logs)
│ │
│ ▼
│ Docker Hub API (HTTPS)
└─► DOCKER_TOKEN
docker-release.yml
docker/login-action (masked in logs)
Docker Hub API (HTTPS)
```
### Artifact Flow
```
Source Code
├─► release.yml
│ │
│ ▼
│ python -m build
│ │
│ ├─► crawl4ai-1.2.3.tar.gz
│ │ │
│ │ ▼
│ │ PyPI Storage
│ │ │
│ │ ▼
│ │ pip install crawl4ai
│ │
│ └─► crawl4ai-1.2.3-py3-none-any.whl
│ │
│ ▼
│ PyPI Storage
│ │
│ ▼
│ pip install crawl4ai
└─► docker-release.yml
docker build
├─► Image: linux/amd64
│ │
│ └─► Docker Hub
│ unclecode/crawl4ai:1.2.3-amd64
└─► Image: linux/arm64
└─► Docker Hub
unclecode/crawl4ai:1.2.3-arm64
```
---
## State Machines
### Release Pipeline State Machine
```
┌─────────┐
│ START │
└────┬────┘
┌──────────────┐
│ Extract │
│ Version │
└──────┬───────┘
┌──────────────┐ ┌─────────┐
│ Validate │─────►│ FAILED │
│ Version │ No │ (Exit 1)│
└──────┬───────┘ └─────────┘
│ Yes
┌──────────────┐
│ Build │
│ Package │
└──────┬───────┘
┌──────────────┐ ┌─────────┐
│ Upload │─────►│ FAILED │
│ to PyPI │ Error│ (Exit 1)│
└──────┬───────┘ └─────────┘
│ Success
┌──────────────┐
│ Create │
│ GH Release │
└──────┬───────┘
┌──────────────┐
│ SUCCESS │
│ (Emit Event) │
└──────────────┘
```
### Docker Pipeline State Machine
```
┌─────────┐
│ START │
│ (Event) │
└────┬────┘
┌──────────────┐
│ Detect │
│ Version │
│ Source │
└──────┬───────┘
┌──────────────┐
│ Parse │
│ Semantic │
│ Versions │
└──────┬───────┘
┌──────────────┐ ┌─────────┐
│ Authenticate │─────►│ FAILED │
│ Docker Hub │ Error│ (Exit 1)│
└──────┬───────┘ └─────────┘
│ Success
┌──────────────┐
│ Build │
│ amd64 │
└──────┬───────┘
┌──────────────┐ ┌─────────┐
│ Build │─────►│ FAILED │
│ arm64 │ Error│ (Exit 1)│
└──────┬───────┘ └─────────┘
│ Success
┌──────────────┐
│ Push All │
│ Tags │
└──────┬───────┘
┌──────────────┐
│ SUCCESS │
└──────────────┘
```
---
## Security Architecture
### Threat Model
#### Threats Mitigated
1. **Secret Exposure**
- Mitigation: GitHub Actions secret masking
- Evidence: Secrets never appear in logs
2. **Unauthorized Package Upload**
- Mitigation: Scoped PyPI tokens
- Evidence: Token limited to `crawl4ai` project
3. **Man-in-the-Middle**
- Mitigation: HTTPS for all API calls
- Evidence: PyPI, Docker Hub, GitHub all use TLS
4. **Supply Chain Tampering**
- Mitigation: Immutable artifacts, content checksums
- Evidence: PyPI stores SHA256, Docker uses content-addressable storage
#### Trust Boundaries
```
┌─────────────────────────────────────────┐
│ Trusted Zone │
│ ┌────────────────────────────────┐ │
│ │ GitHub Actions Runner │ │
│ │ - Ephemeral VM │ │
│ │ - Isolated environment │ │
│ │ - Access to secrets │ │
│ └────────────────────────────────┘ │
│ │ │
│ │ HTTPS (TLS 1.2+) │
│ ▼ │
└─────────────────────────────────────────┘
┌────────────┼────────────┐
│ │ │
▼ ▼ ▼
┌────────┐ ┌─────────┐ ┌──────────┐
│ PyPI │ │ Docker │ │ GitHub │
│ API │ │ Hub │ │ API │
└────────┘ └─────────┘ └──────────┘
External External External
Service Service Service
```
### Secret Management
#### Secret Lifecycle
```
Creation (Developer)
├─► PyPI: Create API token (scoped to project)
├─► Docker Hub: Create access token (read/write)
Storage (GitHub)
├─► Encrypted at rest (AES-256)
├─► Access controlled (repo-scoped)
Usage (Workflow)
├─► Injected as env vars
├─► Masked in logs (GitHub redacts on output)
├─► Never persisted to disk (in-memory only)
Transmission (API Call)
├─► HTTPS only
├─► TLS 1.2+ with strong ciphers
Rotation (Manual)
└─► Regenerate on PyPI/Docker Hub
Update GitHub secret
```
---
## Performance Characteristics
### Release Pipeline Performance
| Metric | Value | Notes |
|--------|-------|-------|
| Cold start | ~2-3 min | First run on new runner |
| Warm start | ~2-3 min | Minimal caching benefit |
| PyPI upload | ~30-60 sec | Network-bound |
| Package build | ~30 sec | CPU-bound |
| Parallelization | None | Sequential by design |
### Docker Pipeline Performance
| Metric | Cold Cache | Warm Cache (code) | Warm Cache (deps) |
|--------|-----------|-------------------|-------------------|
| Total time | 10-15 min | 1-2 min | 3-5 min |
| amd64 build | 5-7 min | 30-60 sec | 1-2 min |
| arm64 build | 5-7 min | 30-60 sec | 1-2 min |
| Push time | 1-2 min | 30 sec | 30 sec |
| Cache hit rate | 0% | 85% | 60% |
### Cache Performance Model
```python
def estimate_build_time(changes):
base_time = 60 # seconds (setup + push)
if "Dockerfile" in changes:
return base_time + (10 * 60) # Full rebuild: ~11 min
elif "requirements.txt" in changes:
return base_time + (3 * 60) # Deps rebuild: ~4 min
elif any(f.endswith(".py") for f in changes):
return base_time + 60 # Code only: ~2 min
else:
return base_time # No changes: ~1 min
```
---
## Scalability Considerations
### Current Limits
| Resource | Limit | Impact |
|----------|-------|--------|
| Workflow concurrency | 20 (default) | Max 20 releases in parallel |
| Artifact storage | 500 MB/artifact | PyPI packages small (<10 MB) |
| Cache storage | 10 GB/repo | Docker layers fit comfortably |
| Workflow run time | 6 hours | Plenty of headroom |
### Scaling Strategies
#### Horizontal Scaling (Multiple Repos)
```
crawl4ai (main)
├─ release.yml
└─ docker-release.yml
crawl4ai-plugins (separate)
├─ release.yml
└─ docker-release.yml
Each repo has independent:
- Secrets
- Cache (10 GB each)
- Concurrency limits (20 each)
```
#### Vertical Scaling (Larger Runners)
```yaml
jobs:
docker:
runs-on: ubuntu-latest-8-cores # GitHub-hosted larger runner
# 4x faster builds for CPU-bound layers
```
---
## Disaster Recovery
### Failure Scenarios
#### Scenario 1: Release Pipeline Fails
**Failure Point**: PyPI upload fails (network error)
**State**:
- ✓ Version validated
- ✓ Package built
- ✗ PyPI upload
- ✗ GitHub release
**Recovery**:
```bash
# Manual upload
twine upload dist/*
# Retry workflow (re-run from GitHub Actions UI)
```
**Prevention**: Add retry logic to PyPI upload
#### Scenario 2: Docker Pipeline Fails
**Failure Point**: ARM build fails (dependency issue)
**State**:
- ✓ PyPI published
- ✓ GitHub release created
- ✓ amd64 image built
- ✗ arm64 image build
**Recovery**:
```bash
# Fix Dockerfile
git commit -am "fix: ARM build dependency"
# Trigger rebuild
git tag docker-rebuild-v1.2.3
git push origin docker-rebuild-v1.2.3
```
**Impact**: PyPI package available, only Docker ARM users affected
#### Scenario 3: Partial Release
**Failure Point**: GitHub release creation fails
**State**:
- ✓ PyPI published
- ✗ GitHub release
- ✗ Docker images
**Recovery**:
```bash
# Create release manually
gh release create v1.2.3 \
--title "Release v1.2.3" \
--notes "..."
# This triggers docker-release.yml automatically
```
---
## Monitoring and Observability
### Metrics to Track
#### Release Pipeline
- Success rate (target: >99%)
- Duration (target: <3 min)
- PyPI upload time (target: <60 sec)
#### Docker Pipeline
- Success rate (target: >95%)
- Duration (target: <15 min cold, <2 min warm)
- Cache hit rate (target: >80% for code changes)
### Alerting
**Critical Alerts**:
- Release pipeline failure (blocks release)
- PyPI authentication failure (expired token)
**Warning Alerts**:
- Docker build >15 min (performance degradation)
- Cache hit rate <50% (cache issue)
### Logging
**GitHub Actions Logs**:
- Retention: 90 days
- Downloadable: Yes
- Searchable: Limited
**Recommended External Logging**:
```yaml
- name: Send logs to external service
if: failure()
run: |
curl -X POST https://logs.example.com/api/v1/logs \
-H "Content-Type: application/json" \
-d "{\"workflow\": \"${{ github.workflow }}\", \"status\": \"failed\"}"
```
---
## Future Enhancements
### Planned Improvements
1. **Automated Changelog Generation**
- Use conventional commits
- Generate CHANGELOG.md automatically
2. **Pre-release Testing**
- Test builds on `test-v*` tags
- Upload to TestPyPI
3. **Notification System**
- Slack/Discord notifications on release
- Email on failure
4. **Performance Optimization**
- Parallel Docker builds (amd64 + arm64 simultaneously)
- Persistent runners for better caching
5. **Enhanced Validation**
- Smoke tests after PyPI upload
- Container security scanning
---
## References
- [GitHub Actions Architecture](https://docs.github.com/en/actions/learn-github-actions/understanding-github-actions)
- [Docker Build Cache](https://docs.docker.com/build/cache/)
- [PyPI API Documentation](https://warehouse.pypa.io/api-reference/)
---
**Last Updated**: 2025-01-21
**Version**: 2.0

1029
.github/workflows/docs/README.md vendored Normal file

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,287 @@
# Workflow Quick Reference
## Quick Commands
### Standard Release
```bash
# 1. Update version
vim crawl4ai/__version__.py # Set to "1.2.3"
# 2. Commit and tag
git add crawl4ai/__version__.py
git commit -m "chore: bump version to 1.2.3"
git tag v1.2.3
git push origin main
git push origin v1.2.3
# 3. Monitor
# - PyPI: ~2-3 minutes
# - Docker: ~1-15 minutes
```
### Docker Rebuild Only
```bash
git tag docker-rebuild-v1.2.3
git push origin docker-rebuild-v1.2.3
```
### Delete Tag (Undo Release)
```bash
# Local
git tag -d v1.2.3
# Remote
git push --delete origin v1.2.3
# GitHub Release
gh release delete v1.2.3
```
---
## Workflow Triggers
### release.yml
| Event | Pattern | Example |
|-------|---------|---------|
| Tag push | `v*` | `v1.2.3` |
| Excludes | `test-v*` | `test-v1.2.3` |
### docker-release.yml
| Event | Pattern | Example |
|-------|---------|---------|
| Release published | `release.published` | Automatic |
| Tag push | `docker-rebuild-v*` | `docker-rebuild-v1.2.3` |
---
## Environment Variables
### release.yml
| Variable | Source | Example |
|----------|--------|---------|
| `VERSION` | Git tag | `1.2.3` |
| `TWINE_USERNAME` | Static | `__token__` |
| `TWINE_PASSWORD` | Secret | `pypi-Ag...` |
| `GITHUB_TOKEN` | Auto | `ghp_...` |
### docker-release.yml
| Variable | Source | Example |
|----------|--------|---------|
| `VERSION` | Release/Tag | `1.2.3` |
| `MAJOR` | Computed | `1` |
| `MINOR` | Computed | `1.2` |
| `DOCKER_USERNAME` | Secret | `unclecode` |
| `DOCKER_TOKEN` | Secret | `dckr_pat_...` |
---
## Docker Tags Generated
| Version | Tags Created |
|---------|-------------|
| v1.0.0 | `1.0.0`, `1.0`, `1`, `latest` |
| v1.1.0 | `1.1.0`, `1.1`, `1`, `latest` |
| v1.2.3 | `1.2.3`, `1.2`, `1`, `latest` |
| v2.0.0 | `2.0.0`, `2.0`, `2`, `latest` |
---
## Workflow Outputs
### release.yml
| Output | Location | Time |
|--------|----------|------|
| PyPI Package | https://pypi.org/project/crawl4ai/ | ~2-3 min |
| GitHub Release | Repository → Releases | ~2-3 min |
| Workflow Summary | Actions → Run → Summary | Immediate |
### docker-release.yml
| Output | Location | Time |
|--------|----------|------|
| Docker Images | https://hub.docker.com/r/unclecode/crawl4ai | ~1-15 min |
| Workflow Summary | Actions → Run → Summary | Immediate |
---
## Common Issues
| Issue | Solution |
|-------|----------|
| Version mismatch | Update `crawl4ai/__version__.py` to match tag |
| PyPI 403 Forbidden | Check `PYPI_TOKEN` secret |
| PyPI 400 File exists | Version already published, increment version |
| Docker auth failed | Regenerate `DOCKER_TOKEN` |
| Docker build timeout | Check Dockerfile, review build logs |
| Cache not working | First build on branch always cold |
---
## Secrets Checklist
- [ ] `PYPI_TOKEN` - PyPI API token (project or account scope)
- [ ] `DOCKER_USERNAME` - Docker Hub username
- [ ] `DOCKER_TOKEN` - Docker Hub access token (read/write)
- [ ] `GITHUB_TOKEN` - Auto-provided (no action needed)
---
## Workflow Dependencies
### release.yml Dependencies
```yaml
Python: 3.12
Actions:
- actions/checkout@v4
- actions/setup-python@v5
- softprops/action-gh-release@v2
PyPI Packages:
- build
- twine
```
### docker-release.yml Dependencies
```yaml
Actions:
- actions/checkout@v4
- docker/setup-buildx-action@v3
- docker/login-action@v3
- docker/build-push-action@v5
Docker:
- Buildx
- QEMU (for multi-arch)
```
---
## Cache Information
### Type
- GitHub Actions Cache (`type=gha`)
### Storage
- **Limit**: 10GB per repository
- **Retention**: 7 days for unused entries
- **Cleanup**: Automatic LRU eviction
### Performance
| Scenario | Cache Hit | Build Time |
|----------|-----------|------------|
| First build | 0% | 10-15 min |
| Code change only | 85% | 1-2 min |
| Dependency update | 60% | 3-5 min |
| No changes | 100% | 30-60 sec |
---
## Build Platforms
| Platform | Architecture | Devices |
|----------|--------------|---------|
| linux/amd64 | x86_64 | Intel/AMD servers, AWS EC2, GCP |
| linux/arm64 | aarch64 | Apple Silicon, AWS Graviton, Raspberry Pi |
---
## Version Validation
### Pre-Tag Checklist
```bash
# Check current version
python -c "from crawl4ai.__version__ import __version__; print(__version__)"
# Verify it matches intended tag
# If tag is v1.2.3, version should be "1.2.3"
```
### Post-Release Verification
```bash
# PyPI
pip install crawl4ai==1.2.3
python -c "import crawl4ai; print(crawl4ai.__version__)"
# Docker
docker pull unclecode/crawl4ai:1.2.3
docker run unclecode/crawl4ai:1.2.3 python -c "import crawl4ai; print(crawl4ai.__version__)"
```
---
## Monitoring URLs
| Service | URL |
|---------|-----|
| GitHub Actions | `https://github.com/{owner}/{repo}/actions` |
| PyPI Project | `https://pypi.org/project/crawl4ai/` |
| Docker Hub | `https://hub.docker.com/r/unclecode/crawl4ai` |
| GitHub Releases | `https://github.com/{owner}/{repo}/releases` |
---
## Rollback Strategy
### PyPI (Cannot Delete)
```bash
# Increment patch version
git tag v1.2.4
git push origin v1.2.4
```
### Docker (Can Overwrite)
```bash
# Rebuild with fix
git tag docker-rebuild-v1.2.3
git push origin docker-rebuild-v1.2.3
```
### GitHub Release
```bash
# Delete release
gh release delete v1.2.3
# Delete tag
git push --delete origin v1.2.3
```
---
## Status Badge Markdown
```markdown
[![Release Pipeline](https://github.com/{owner}/{repo}/actions/workflows/release.yml/badge.svg)](https://github.com/{owner}/{repo}/actions/workflows/release.yml)
[![Docker Release](https://github.com/{owner}/{repo}/actions/workflows/docker-release.yml/badge.svg)](https://github.com/{owner}/{repo}/actions/workflows/docker-release.yml)
```
---
## Timeline Example
```
0:00 - Push tag v1.2.3
0:01 - release.yml starts
0:02 - Version validation passes
0:03 - Package built
0:04 - PyPI upload starts
0:06 - PyPI upload complete ✓
0:07 - GitHub release created ✓
0:08 - release.yml complete
0:08 - docker-release.yml triggered
0:10 - Docker build starts
0:12 - amd64 image built (cache hit)
0:14 - arm64 image built (cache hit)
0:15 - Images pushed to Docker Hub ✓
0:16 - docker-release.yml complete
Total: ~16 minutes
Critical path (PyPI + GitHub): ~8 minutes
```
---
## Contact
For workflow issues:
1. Check Actions tab for logs
2. Review this reference
3. See [README.md](./README.md) for detailed docs

View File

@@ -10,53 +10,53 @@ jobs:
runs-on: ubuntu-latest
permissions:
contents: write # Required for creating releases
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.12'
- name: Extract version from tag
id: get_version
run: |
TAG_VERSION=${GITHUB_REF#refs/tags/v}
echo "VERSION=$TAG_VERSION" >> $GITHUB_OUTPUT
echo "Releasing version: $TAG_VERSION"
- name: Install package dependencies
run: |
pip install -e .
- name: Check version consistency
run: |
TAG_VERSION=${{ steps.get_version.outputs.VERSION }}
PACKAGE_VERSION=$(python -c "from crawl4ai.__version__ import __version__; print(__version__)")
echo "Tag version: $TAG_VERSION"
echo "Package version: $PACKAGE_VERSION"
if [ "$TAG_VERSION" != "$PACKAGE_VERSION" ]; then
echo "❌ Version mismatch! Tag: $TAG_VERSION, Package: $PACKAGE_VERSION"
echo "Please update crawl4ai/__version__.py to match the tag version"
exit 1
fi
echo "✅ Version check passed: $TAG_VERSION"
- name: Install build dependencies
run: |
python -m pip install --upgrade pip
pip install build twine
- name: Build package
run: python -m build
- name: Check package
run: twine check dist/*
- name: Upload to PyPI
env:
TWINE_USERNAME: __token__
@@ -65,37 +65,7 @@ jobs:
echo "📦 Uploading to PyPI..."
twine upload dist/*
echo "✅ Package uploaded to https://pypi.org/project/crawl4ai/"
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Log in to Docker Hub
uses: docker/login-action@v3
with:
username: ${{ secrets.DOCKER_USERNAME }}
password: ${{ secrets.DOCKER_TOKEN }}
- name: Extract major and minor versions
id: versions
run: |
VERSION=${{ steps.get_version.outputs.VERSION }}
MAJOR=$(echo $VERSION | cut -d. -f1)
MINOR=$(echo $VERSION | cut -d. -f1-2)
echo "MAJOR=$MAJOR" >> $GITHUB_OUTPUT
echo "MINOR=$MINOR" >> $GITHUB_OUTPUT
- name: Build and push Docker images
uses: docker/build-push-action@v5
with:
context: .
push: true
tags: |
unclecode/crawl4ai:${{ steps.get_version.outputs.VERSION }}
unclecode/crawl4ai:${{ steps.versions.outputs.MINOR }}
unclecode/crawl4ai:${{ steps.versions.outputs.MAJOR }}
unclecode/crawl4ai:latest
platforms: linux/amd64,linux/arm64
- name: Create GitHub Release
uses: softprops/action-gh-release@v2
with:
@@ -103,26 +73,29 @@ jobs:
name: Release v${{ steps.get_version.outputs.VERSION }}
body: |
## 🎉 Crawl4AI v${{ steps.get_version.outputs.VERSION }} Released!
### 📦 Installation
**PyPI:**
```bash
pip install crawl4ai==${{ steps.get_version.outputs.VERSION }}
```
**Docker:**
```bash
docker pull unclecode/crawl4ai:${{ steps.get_version.outputs.VERSION }}
docker pull unclecode/crawl4ai:latest
```
**Note:** Docker images are being built and will be available shortly.
Check the [Docker Release workflow](https://github.com/${{ github.repository }}/actions/workflows/docker-release.yml) for build status.
### 📝 What's Changed
See [CHANGELOG.md](https://github.com/${{ github.repository }}/blob/main/CHANGELOG.md) for details.
draft: false
prerelease: false
token: ${{ secrets.GITHUB_TOKEN }}
- name: Summary
run: |
echo "## 🚀 Release Complete!" >> $GITHUB_STEP_SUMMARY
@@ -132,11 +105,9 @@ jobs:
echo "- URL: https://pypi.org/project/crawl4ai/" >> $GITHUB_STEP_SUMMARY
echo "- Install: \`pip install crawl4ai==${{ steps.get_version.outputs.VERSION }}\`" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "### 🐳 Docker Images" >> $GITHUB_STEP_SUMMARY
echo "- \`unclecode/crawl4ai:${{ steps.get_version.outputs.VERSION }}\`" >> $GITHUB_STEP_SUMMARY
echo "- \`unclecode/crawl4ai:${{ steps.versions.outputs.MINOR }}\`" >> $GITHUB_STEP_SUMMARY
echo "- \`unclecode/crawl4ai:${{ steps.versions.outputs.MAJOR }}\`" >> $GITHUB_STEP_SUMMARY
echo "- \`unclecode/crawl4ai:latest\`" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "### 📋 GitHub Release" >> $GITHUB_STEP_SUMMARY
echo "https://github.com/${{ github.repository }}/releases/tag/v${{ steps.get_version.outputs.VERSION }}" >> $GITHUB_STEP_SUMMARY
echo "- https://github.com/${{ github.repository }}/releases/tag/v${{ steps.get_version.outputs.VERSION }}" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "### 🐳 Docker Images" >> $GITHUB_STEP_SUMMARY
echo "Docker images are being built in a separate workflow." >> $GITHUB_STEP_SUMMARY
echo "Check: https://github.com/${{ github.repository }}/actions/workflows/docker-release.yml" >> $GITHUB_STEP_SUMMARY

142
.github/workflows/release.yml.backup vendored Normal file
View File

@@ -0,0 +1,142 @@
name: Release Pipeline
on:
push:
tags:
- 'v*'
- '!test-v*' # Exclude test tags
jobs:
release:
runs-on: ubuntu-latest
permissions:
contents: write # Required for creating releases
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.12'
- name: Extract version from tag
id: get_version
run: |
TAG_VERSION=${GITHUB_REF#refs/tags/v}
echo "VERSION=$TAG_VERSION" >> $GITHUB_OUTPUT
echo "Releasing version: $TAG_VERSION"
- name: Install package dependencies
run: |
pip install -e .
- name: Check version consistency
run: |
TAG_VERSION=${{ steps.get_version.outputs.VERSION }}
PACKAGE_VERSION=$(python -c "from crawl4ai.__version__ import __version__; print(__version__)")
echo "Tag version: $TAG_VERSION"
echo "Package version: $PACKAGE_VERSION"
if [ "$TAG_VERSION" != "$PACKAGE_VERSION" ]; then
echo "❌ Version mismatch! Tag: $TAG_VERSION, Package: $PACKAGE_VERSION"
echo "Please update crawl4ai/__version__.py to match the tag version"
exit 1
fi
echo "✅ Version check passed: $TAG_VERSION"
- name: Install build dependencies
run: |
python -m pip install --upgrade pip
pip install build twine
- name: Build package
run: python -m build
- name: Check package
run: twine check dist/*
- name: Upload to PyPI
env:
TWINE_USERNAME: __token__
TWINE_PASSWORD: ${{ secrets.PYPI_TOKEN }}
run: |
echo "📦 Uploading to PyPI..."
twine upload dist/*
echo "✅ Package uploaded to https://pypi.org/project/crawl4ai/"
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Log in to Docker Hub
uses: docker/login-action@v3
with:
username: ${{ secrets.DOCKER_USERNAME }}
password: ${{ secrets.DOCKER_TOKEN }}
- name: Extract major and minor versions
id: versions
run: |
VERSION=${{ steps.get_version.outputs.VERSION }}
MAJOR=$(echo $VERSION | cut -d. -f1)
MINOR=$(echo $VERSION | cut -d. -f1-2)
echo "MAJOR=$MAJOR" >> $GITHUB_OUTPUT
echo "MINOR=$MINOR" >> $GITHUB_OUTPUT
- name: Build and push Docker images
uses: docker/build-push-action@v5
with:
context: .
push: true
tags: |
unclecode/crawl4ai:${{ steps.get_version.outputs.VERSION }}
unclecode/crawl4ai:${{ steps.versions.outputs.MINOR }}
unclecode/crawl4ai:${{ steps.versions.outputs.MAJOR }}
unclecode/crawl4ai:latest
platforms: linux/amd64,linux/arm64
- name: Create GitHub Release
uses: softprops/action-gh-release@v2
with:
tag_name: v${{ steps.get_version.outputs.VERSION }}
name: Release v${{ steps.get_version.outputs.VERSION }}
body: |
## 🎉 Crawl4AI v${{ steps.get_version.outputs.VERSION }} Released!
### 📦 Installation
**PyPI:**
```bash
pip install crawl4ai==${{ steps.get_version.outputs.VERSION }}
```
**Docker:**
```bash
docker pull unclecode/crawl4ai:${{ steps.get_version.outputs.VERSION }}
docker pull unclecode/crawl4ai:latest
```
### 📝 What's Changed
See [CHANGELOG.md](https://github.com/${{ github.repository }}/blob/main/CHANGELOG.md) for details.
draft: false
prerelease: false
token: ${{ secrets.GITHUB_TOKEN }}
- name: Summary
run: |
echo "## 🚀 Release Complete!" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "### 📦 PyPI Package" >> $GITHUB_STEP_SUMMARY
echo "- Version: ${{ steps.get_version.outputs.VERSION }}" >> $GITHUB_STEP_SUMMARY
echo "- URL: https://pypi.org/project/crawl4ai/" >> $GITHUB_STEP_SUMMARY
echo "- Install: \`pip install crawl4ai==${{ steps.get_version.outputs.VERSION }}\`" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "### 🐳 Docker Images" >> $GITHUB_STEP_SUMMARY
echo "- \`unclecode/crawl4ai:${{ steps.get_version.outputs.VERSION }}\`" >> $GITHUB_STEP_SUMMARY
echo "- \`unclecode/crawl4ai:${{ steps.versions.outputs.MINOR }}\`" >> $GITHUB_STEP_SUMMARY
echo "- \`unclecode/crawl4ai:${{ steps.versions.outputs.MAJOR }}\`" >> $GITHUB_STEP_SUMMARY
echo "- \`unclecode/crawl4ai:latest\`" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "### 📋 GitHub Release" >> $GITHUB_STEP_SUMMARY
echo "https://github.com/${{ github.repository }}/releases/tag/v${{ steps.get_version.outputs.VERSION }}" >> $GITHUB_STEP_SUMMARY

2
.gitignore vendored
View File

@@ -266,6 +266,8 @@ continue_config.json
.llm.env
.private/
.claude/
CLAUDE_MONITOR.md
CLAUDE.md

View File

@@ -1,7 +1,7 @@
FROM python:3.12-slim-bookworm AS build
# C4ai version
ARG C4AI_VER=0.7.0-r1
ARG C4AI_VER=0.7.8
ENV C4AI_VERSION=$C4AI_VER
LABEL c4ai.version=$C4AI_VER
@@ -167,6 +167,11 @@ RUN mkdir -p /home/appuser/.cache/ms-playwright \
RUN crawl4ai-doctor
# Ensure all cache directories belong to appuser
# This fixes permission issues with .cache/url_seeder and other runtime cache dirs
RUN mkdir -p /home/appuser/.cache \
&& chown -R appuser:appuser /home/appuser/.cache
# Copy application code
COPY deploy/docker/* ${APP_HOME}/

212
README.md
View File

@@ -12,6 +12,16 @@
[![Downloads](https://static.pepy.tech/badge/crawl4ai/month)](https://pepy.tech/project/crawl4ai)
[![GitHub Sponsors](https://img.shields.io/github/sponsors/unclecode?style=flat&logo=GitHub-Sponsors&label=Sponsors&color=pink)](https://github.com/sponsors/unclecode)
---
#### 🚀 Crawl4AI Cloud API — Closed Beta (Launching Soon)
Reliable, large-scale web extraction, now built to be _**drastically more cost-effective**_ than any of the existing solutions.
👉 **Apply [here](https://forms.gle/E9MyPaNXACnAMaqG7) for early access**
_Well be onboarding in phases and working closely with early users.
Limited slots._
---
<p align="center">
<a href="https://x.com/crawl4ai">
<img src="https://img.shields.io/badge/Follow%20on%20X-000000?style=for-the-badge&logo=x&logoColor=white" alt="Follow on X" />
@@ -27,11 +37,13 @@
Crawl4AI turns the web into clean, LLM ready Markdown for RAG, agents, and data pipelines. Fast, controllable, battle tested by a 50k+ star community.
[✨ Check out latest update v0.7.4](#-recent-updates)
[✨ Check out latest update v0.7.8](#-recent-updates)
✨ New in v0.7.4: Revolutionary LLM Table Extraction with intelligent chunking, enhanced concurrency fixes, memory management refactor, and critical stability improvements. [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.4.md)
**New in v0.7.8**: Stability & Bug Fix Release! 11 bug fixes addressing Docker API issues (ContentRelevanceFilter, ProxyConfig, cache permissions), LLM extraction improvements (configurable backoff, HTML input format), URL handling fixes, and dependency updates (pypdf, Pydantic v2). [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.8.md)
✨ Recent v0.7.3: Undetected Browser Support, Multi-URL Configurations, Memory Monitoring, Enhanced Table Extraction, GitHub Sponsors. [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.3.md)
✨ Recent v0.7.7: Complete Self-Hosting Platform with Real-time Monitoring! Enterprise-grade monitoring dashboard, comprehensive REST API, WebSocket streaming, smart browser pool management, and production-ready observability. [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.7.md)
✨ Previous v0.7.6: Complete Webhook Infrastructure for Docker Job Queue API! Real-time notifications for both `/crawl/job` and `/llm/job` endpoints with exponential backoff retry, custom headers, and flexible delivery modes. [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.6.md)
<details>
<summary>🤓 <strong>My Personal Story</strong></summary>
@@ -177,7 +189,7 @@ No rate-limited APIs. No lock-in. Build and own your data pipeline with direct g
- 📸 **Screenshots**: Capture page screenshots during crawling for debugging or analysis.
- 📂 **Raw Data Crawling**: Directly process raw HTML (`raw:`) or local files (`file://`).
- 🔗 **Comprehensive Link Extraction**: Extracts internal, external links, and embedded iframe content.
- 🛠️ **Customizable Hooks**: Define hooks at every step to customize crawling behavior.
- 🛠️ **Customizable Hooks**: Define hooks at every step to customize crawling behavior (supports both string and function-based APIs).
- 💾 **Caching**: Cache data for improved speed and to avoid redundant fetches.
- 📄 **Metadata Extraction**: Retrieve structured metadata from web pages.
- 📡 **IFrame Content Extraction**: Seamless extraction from embedded iframe content.
@@ -294,6 +306,7 @@ pip install -e ".[all]" # Install all optional features
### New Docker Features
The new Docker implementation includes:
- **Real-time Monitoring Dashboard** with live system metrics and browser pool visibility
- **Browser pooling** with page pre-warming for faster response times
- **Interactive playground** to test and generate request code
- **MCP integration** for direct connection to AI tools like Claude Code
@@ -308,7 +321,8 @@ The new Docker implementation includes:
docker pull unclecode/crawl4ai:latest
docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:latest
# Visit the playground at http://localhost:11235/playground
# Visit the monitoring dashboard at http://localhost:11235/dashboard
# Or the playground at http://localhost:11235/playground
```
### Quick Test
@@ -337,7 +351,7 @@ else:
result = requests.get(f"http://localhost:11235/task/{task_id}")
```
For more examples, see our [Docker Examples](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/docker_example.py). For advanced configuration, environment variables, and usage examples, see our [Docker Deployment Guide](https://docs.crawl4ai.com/basic/docker-deployment/).
For more examples, see our [Docker Examples](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/docker_example.py). For advanced configuration, monitoring features, and production deployment, see our [Self-Hosting Guide](https://docs.crawl4ai.com/core/self-hosting/).
</details>
@@ -542,8 +556,160 @@ async def test_news_crawl():
</details>
---
> **💡 Tip:** Some websites may use **CAPTCHA** based verification mechanisms to prevent automated access. If your workflow encounters such challenges, you may optionally integrate a third-party CAPTCHA-handling service such as <strong>[CapSolver](https://www.capsolver.com/blog/Partners/crawl4ai-capsolver/?utm_source=crawl4ai&utm_medium=github_pr&utm_campaign=crawl4ai_integration)</strong>. They support reCAPTCHA v2/v3, Cloudflare Turnstile, Challenge, AWS WAF, and more. Please ensure that your usage complies with the target websites terms of service and applicable laws.
## ✨ Recent Updates
<details>
<summary><strong>Version 0.7.8 Release Highlights - Stability & Bug Fix Release</strong></summary>
This release focuses on stability with 11 bug fixes addressing issues reported by the community. No new features, but significant improvements to reliability.
- **🐳 Docker API Fixes**:
- Fixed `ContentRelevanceFilter` deserialization in deep crawl requests (#1642)
- Fixed `ProxyConfig` JSON serialization in `BrowserConfig.to_dict()` (#1629)
- Fixed `.cache` folder permissions in Docker image (#1638)
- **🤖 LLM Extraction Improvements**:
- Configurable rate limiter backoff with new `LLMConfig` parameters (#1269):
```python
from crawl4ai import LLMConfig
config = LLMConfig(
provider="openai/gpt-4o-mini",
backoff_base_delay=5, # Wait 5s on first retry
backoff_max_attempts=5, # Try up to 5 times
backoff_exponential_factor=3 # Multiply delay by 3 each attempt
)
```
- HTML input format support for `LLMExtractionStrategy` (#1178):
```python
from crawl4ai import LLMExtractionStrategy
strategy = LLMExtractionStrategy(
llm_config=config,
instruction="Extract table data",
input_format="html" # Now supports: "html", "markdown", "fit_markdown"
)
```
- Fixed raw HTML URL variable - extraction strategies now receive `"Raw HTML"` instead of HTML blob (#1116)
- **🔗 URL Handling**:
- Fixed relative URL resolution after JavaScript redirects (#1268)
- Fixed import statement formatting in extracted code (#1181)
- **📦 Dependency Updates**:
- Replaced deprecated PyPDF2 with pypdf (#1412)
- Pydantic v2 ConfigDict compatibility - no more deprecation warnings (#678)
- **🧠 AdaptiveCrawler**:
- Fixed query expansion to actually use LLM instead of hardcoded mock data (#1621)
[Full v0.7.8 Release Notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.8.md)
</details>
<details>
<summary><strong>Version 0.7.7 Release Highlights - The Self-Hosting & Monitoring Update</strong></summary>
- **📊 Real-time Monitoring Dashboard**: Interactive web UI with live system metrics and browser pool visibility
```python
# Access the monitoring dashboard
# Visit: http://localhost:11235/dashboard
# Real-time metrics include:
# - System health (CPU, memory, network, uptime)
# - Active and completed request tracking
# - Browser pool management (permanent/hot/cold)
# - Janitor cleanup events
# - Error monitoring with full context
```
- **🔌 Comprehensive Monitor API**: Complete REST API for programmatic access to all monitoring data
```python
import httpx
async with httpx.AsyncClient() as client:
# System health
health = await client.get("http://localhost:11235/monitor/health")
# Request tracking
requests = await client.get("http://localhost:11235/monitor/requests")
# Browser pool status
browsers = await client.get("http://localhost:11235/monitor/browsers")
# Endpoint statistics
stats = await client.get("http://localhost:11235/monitor/endpoints/stats")
```
- **⚡ WebSocket Streaming**: Real-time updates every 2 seconds for custom dashboards
- **🔥 Smart Browser Pool**: 3-tier architecture (permanent/hot/cold) with automatic promotion and cleanup
- **🧹 Janitor System**: Automatic resource management with event logging
- **🎮 Control Actions**: Manual browser management (kill, restart, cleanup) via API
- **📈 Production Metrics**: 6 critical metrics for operational excellence with Prometheus integration
- **🐛 Critical Bug Fixes**:
- Fixed async LLM extraction blocking issue (#1055)
- Enhanced DFS deep crawl strategy (#1607)
- Fixed sitemap parsing in AsyncUrlSeeder (#1598)
- Resolved browser viewport configuration (#1495)
- Fixed CDP timing with exponential backoff (#1528)
- Security update for pyOpenSSL (>=25.3.0)
[Full v0.7.7 Release Notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.7.md)
</details>
<details>
<summary><strong>Version 0.7.5 Release Highlights - The Docker Hooks & Security Update</strong></summary>
- **🔧 Docker Hooks System**: Complete pipeline customization with user-provided Python functions at 8 key points
- **✨ Function-Based Hooks API (NEW)**: Write hooks as regular Python functions with full IDE support:
```python
from crawl4ai import hooks_to_string
from crawl4ai.docker_client import Crawl4aiDockerClient
# Define hooks as regular Python functions
async def on_page_context_created(page, context, **kwargs):
"""Block images to speed up crawling"""
await context.route("**/*.{png,jpg,jpeg,gif,webp}", lambda route: route.abort())
await page.set_viewport_size({"width": 1920, "height": 1080})
return page
async def before_goto(page, context, url, **kwargs):
"""Add custom headers"""
await page.set_extra_http_headers({'X-Crawl4AI': 'v0.7.5'})
return page
# Option 1: Use hooks_to_string() utility for REST API
hooks_code = hooks_to_string({
"on_page_context_created": on_page_context_created,
"before_goto": before_goto
})
# Option 2: Docker client with automatic conversion (Recommended)
client = Crawl4aiDockerClient(base_url="http://localhost:11235")
results = await client.crawl(
urls=["https://httpbin.org/html"],
hooks={
"on_page_context_created": on_page_context_created,
"before_goto": before_goto
}
)
# ✓ Full IDE support, type checking, and reusability!
```
- **🤖 Enhanced LLM Integration**: Custom providers with temperature control and base_url configuration
- **🔒 HTTPS Preservation**: Secure internal link handling with `preserve_https_for_internal_links=True`
- **🐍 Python 3.10+ Support**: Modern language features and enhanced performance
- **🛠️ Bug Fixes**: Resolved multiple community-reported issues including URL processing, JWT authentication, and proxy configuration
[Full v0.7.5 Release Notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.5.md)
</details>
<details>
<summary><strong>Version 0.7.4 Release Highlights - The Intelligent Table Extraction & Performance Update</strong></summary>
@@ -919,6 +1085,40 @@ We envision a future where AI is powered by real human knowledge, ensuring data
For more details, see our [full mission statement](./MISSION.md).
</details>
## 🌟 Current Sponsors
### 🏢 Enterprise Sponsors & Partners
Our enterprise sponsors and technology partners help scale Crawl4AI to power production-grade data pipelines.
| Company | About | Sponsorship Tier |
|------|------|----------------------------|
| <a href="https://app.nstproxy.com/register?i=ecOqW9" target="_blank"><picture><source width="250" media="(prefers-color-scheme: dark)" srcset="https://gist.github.com/aravindkarnam/62f82bd4818d3079d9dd3c31df432cf8/raw/nst-light.svg"><source width="250" media="(prefers-color-scheme: light)" srcset="https://www.nstproxy.com/logo.svg"><img alt="nstproxy" src="ttps://www.nstproxy.com/logo.svg"></picture></a> | NstProxy is a trusted proxy provider with over 110M+ real residential IPs, city-level targeting, 99.99% uptime, and low pricing at $0.1/GB, it delivers unmatched stability, scale, and cost-efficiency. | 🥈 Silver |
| <a href="https://app.scrapeless.com/passport/register?utm_source=official&utm_term=crawl4ai" target="_blank"><picture><source width="250" media="(prefers-color-scheme: dark)" srcset="https://gist.githubusercontent.com/aravindkarnam/0d275b942705604263e5c32d2db27bc1/raw/Scrapeless-light-logo.svg"><source width="250" media="(prefers-color-scheme: light)" srcset="https://gist.githubusercontent.com/aravindkarnam/22d0525cc0f3021bf19ebf6e11a69ccd/raw/Scrapeless-dark-logo.svg"><img alt="Scrapeless" src="https://gist.githubusercontent.com/aravindkarnam/22d0525cc0f3021bf19ebf6e11a69ccd/raw/Scrapeless-dark-logo.svg"></picture></a> | Scrapeless provides production-grade infrastructure for Crawling, Automation, and AI Agents, offering Scraping Browser, 4 Proxy Types and Universal Scraping API. | 🥈 Silver |
| <a href="https://dashboard.capsolver.com/passport/register?inviteCode=ESVSECTX5Q23" target="_blank"><picture><source width="120" media="(prefers-color-scheme: dark)" srcset="https://docs.crawl4ai.com/uploads/sponsors/20251013045338_72a71fa4ee4d2f40.png"><source width="120" media="(prefers-color-scheme: light)" srcset="https://www.capsolver.com/assets/images/logo-text.png"><img alt="Capsolver" src="https://www.capsolver.com/assets/images/logo-text.png"></picture></a> | AI-powered Captcha solving service. Supports all major Captcha types, including reCAPTCHA, Cloudflare, and more | 🥉 Bronze |
| <a href="https://kipo.ai" target="_blank"><img src="https://docs.crawl4ai.com/uploads/sponsors/20251013045751_2d54f57f117c651e.png" alt="DataSync" width="120"/></a> | Helps engineers and buyers find, compare, and source electronic & industrial parts in seconds, with specs, pricing, lead times & alternatives.| 🥇 Gold |
| <a href="https://www.kidocode.com/" target="_blank"><img src="https://docs.crawl4ai.com/uploads/sponsors/20251013045045_bb8dace3f0440d65.svg" alt="Kidocode" width="120"/><p align="center">KidoCode</p></a> | Kidocode is a hybrid technology and entrepreneurship school for kids aged 518, offering both online and on-campus education. | 🥇 Gold |
| <a href="https://www.alephnull.sg/" target="_blank"><img src="https://docs.crawl4ai.com/uploads/sponsors/20251013050323_a9e8e8c4c3650421.svg" alt="Aleph null" width="120"/></a> | Singapore-based Aleph Null is Asias leading edtech hub, dedicated to student-centric, AI-driven education—empowering learners with the tools to thrive in a fast-changing world. | 🥇 Gold |
### 🧑‍🤝 Individual Sponsors
A heartfelt thanks to our individual supporters! Every contribution helps us keep our opensource mission alive and thriving!
<p align="left">
<a href="https://github.com/hafezparast"><img src="https://avatars.githubusercontent.com/u/14273305?s=60&v=4" style="border-radius:50%;" width="64px;"/></a>
<a href="https://github.com/ntohidi"><img src="https://avatars.githubusercontent.com/u/17140097?s=60&v=4" style="border-radius:50%;"width="64px;"/></a>
<a href="https://github.com/Sjoeborg"><img src="https://avatars.githubusercontent.com/u/17451310?s=60&v=4" style="border-radius:50%;"width="64px;"/></a>
<a href="https://github.com/romek-rozen"><img src="https://avatars.githubusercontent.com/u/30595969?s=60&v=4" style="border-radius:50%;"width="64px;"/></a>
<a href="https://github.com/Kourosh-Kiyani"><img src="https://avatars.githubusercontent.com/u/34105600?s=60&v=4" style="border-radius:50%;"width="64px;"/></a>
<a href="https://github.com/Etherdrake"><img src="https://avatars.githubusercontent.com/u/67021215?s=60&v=4" style="border-radius:50%;"width="64px;"/></a>
<a href="https://github.com/shaman247"><img src="https://avatars.githubusercontent.com/u/211010067?s=60&v=4" style="border-radius:50%;"width="64px;"/></a>
<a href="https://github.com/work-flow-manager"><img src="https://avatars.githubusercontent.com/u/217665461?s=60&v=4" style="border-radius:50%;"width="64px;"/></a>
</p>
> Want to join them? [Sponsor Crawl4AI →](https://github.com/sponsors/unclecode)
## Star History
[![Star History Chart](https://api.star-history.com/svg?repos=unclecode/crawl4ai&type=Date)](https://star-history.com/#unclecode/crawl4ai&Date)

View File

@@ -72,6 +72,8 @@ from .deep_crawling import (
BestFirstCrawlingStrategy,
DFSDeepCrawlStrategy,
DeepCrawlDecorator,
ContentRelevanceFilter,
ContentTypeScorer,
)
# NEW: Import AsyncUrlSeeder
from .async_url_seeder import AsyncUrlSeeder
@@ -103,7 +105,8 @@ from .browser_adapter import (
from .utils import (
start_colab_display_server,
setup_colab_environment
setup_colab_environment,
hooks_to_string
)
__all__ = [
@@ -183,6 +186,7 @@ __all__ = [
"ProxyConfig",
"start_colab_display_server",
"setup_colab_environment",
"hooks_to_string",
# C4A Script additions
"c4a_compile",
"c4a_validate",

View File

@@ -1,7 +1,7 @@
# crawl4ai/__version__.py
# This is the version that will be used for stable releases
__version__ = "0.7.4"
__version__ = "0.7.8"
# For nightly builds, this gets set during build process
__nightly_version__ = None

View File

@@ -728,18 +728,18 @@ class EmbeddingStrategy(CrawlStrategy):
provider = llm_config_dict.get('provider', 'openai/gpt-4o-mini') if llm_config_dict else 'openai/gpt-4o-mini'
api_token = llm_config_dict.get('api_token') if llm_config_dict else None
# response = perform_completion_with_backoff(
# provider=provider,
# prompt_with_variables=prompt,
# api_token=api_token,
# json_response=True
# )
response = perform_completion_with_backoff(
provider=provider,
prompt_with_variables=prompt,
api_token=api_token,
json_response=True
)
# variations = json.loads(response.choices[0].message.content)
variations = json.loads(response.choices[0].message.content)
# # Mock data with more variations for split
variations ={'queries': ['what are the best vegetables to use in fried rice?', 'how do I make vegetable fried rice from scratch?', 'can you provide a quick recipe for vegetable fried rice?', 'what cooking techniques are essential for perfect fried rice with vegetables?', 'how to add flavor to vegetable fried rice?', 'are there any tips for making healthy fried rice with vegetables?']}
# variations ={'queries': ['what are the best vegetables to use in fried rice?', 'how do I make vegetable fried rice from scratch?', 'can you provide a quick recipe for vegetable fried rice?', 'what cooking techniques are essential for perfect fried rice with vegetables?', 'how to add flavor to vegetable fried rice?', 'are there any tips for making healthy fried rice with vegetables?']}
# variations = {'queries': [

View File

@@ -1,6 +1,7 @@
import importlib
import os
from typing import Union
import warnings
import requests
from .config import (
DEFAULT_PROVIDER,
DEFAULT_PROVIDER_API_KEY,
@@ -26,14 +27,14 @@ from .table_extraction import TableExtractionStrategy, DefaultTableExtraction
from .cache_context import CacheMode
from .proxy_strategy import ProxyRotationStrategy
from typing import Union, List, Callable
import inspect
from typing import Any, Dict, Optional
from typing import Any, Callable, Dict, List, Optional, Union
from enum import Enum
# Type alias for URL matching
UrlMatcher = Union[str, Callable[[str], bool], List[Union[str, Callable[[str], bool]]]]
class MatchMode(Enum):
OR = "or"
AND = "and"
@@ -41,8 +42,7 @@ class MatchMode(Enum):
# from .proxy_strategy import ProxyConfig
def to_serializable_dict(obj: Any, ignore_default_value : bool = False) -> Dict:
def to_serializable_dict(obj: Any, ignore_default_value : bool = False):
"""
Recursively convert an object to a serializable dictionary using {type, params} structure
for complex objects.
@@ -109,8 +109,6 @@ def to_serializable_dict(obj: Any, ignore_default_value : bool = False) -> Dict:
# if value is not None:
# current_values[attr_name] = to_serializable_dict(value)
return {
"type": obj.__class__.__name__,
"params": current_values
@@ -136,12 +134,20 @@ def from_serializable_dict(data: Any) -> Any:
if data["type"] == "dict" and "value" in data:
return {k: from_serializable_dict(v) for k, v in data["value"].items()}
# Import from crawl4ai for class instances
import crawl4ai
if hasattr(crawl4ai, data["type"]):
cls = getattr(crawl4ai, data["type"])
cls = None
# If you are receiving an error while trying to convert a dict to an object:
# Either add a module to `modules_paths` list, or add the `data["type"]` to the crawl4ai __init__.py file
module_paths = ["crawl4ai"]
for module_path in module_paths:
try:
mod = importlib.import_module(module_path)
if hasattr(mod, data["type"]):
cls = getattr(mod, data["type"])
break
except (ImportError, AttributeError):
continue
if cls is not None:
# Handle Enum
if issubclass(cls, Enum):
return cls(data["params"])
@@ -597,7 +603,7 @@ class BrowserConfig:
"chrome_channel": self.chrome_channel,
"channel": self.channel,
"proxy": self.proxy,
"proxy_config": self.proxy_config,
"proxy_config": self.proxy_config.to_dict() if self.proxy_config else None,
"viewport_width": self.viewport_width,
"viewport_height": self.viewport_height,
"accept_downloads": self.accept_downloads,
@@ -649,6 +655,85 @@ class BrowserConfig:
return config
return BrowserConfig.from_kwargs(config)
def set_nstproxy(
self,
token: str,
channel_id: str,
country: str = "ANY",
state: str = "",
city: str = "",
protocol: str = "http",
session_duration: int = 10,
):
"""
Fetch a proxy from NSTProxy API and automatically assign it to proxy_config.
Get your NSTProxy token from: https://app.nstproxy.com/profile
Args:
token (str): NSTProxy API token.
channel_id (str): NSTProxy channel ID.
country (str, optional): Country code (default: "ANY").
state (str, optional): State code (default: "").
city (str, optional): City name (default: "").
protocol (str, optional): Proxy protocol ("http" or "socks5"). Defaults to "http".
session_duration (int, optional): Session duration in minutes (0 = rotate each request). Defaults to 10.
Raises:
ValueError: If the API response format is invalid.
PermissionError: If the API returns an error message.
"""
# --- Validate input early ---
if not token or not channel_id:
raise ValueError("[NSTProxy] token and channel_id are required")
if protocol not in ("http", "socks5"):
raise ValueError(f"[NSTProxy] Invalid protocol: {protocol}")
# --- Build NSTProxy API URL ---
params = {
"fType": 2,
"count": 1,
"channelId": channel_id,
"country": country,
"protocol": protocol,
"sessionDuration": session_duration,
"token": token,
}
if state:
params["state"] = state
if city:
params["city"] = city
url = "https://api.nstproxy.com/api/v1/generate/apiproxies"
try:
response = requests.get(url, params=params, timeout=10)
response.raise_for_status()
data = response.json()
# --- Handle API error response ---
if isinstance(data, dict) and data.get("err"):
raise PermissionError(f"[NSTProxy] API Error: {data.get('msg', 'Unknown error')}")
if not isinstance(data, list) or not data:
raise ValueError("[NSTProxy] Invalid API response — expected a non-empty list")
proxy_info = data[0]
# --- Apply proxy config ---
self.proxy_config = ProxyConfig(
server=f"{protocol}://{proxy_info['ip']}:{proxy_info['port']}",
username=proxy_info["username"],
password=proxy_info["password"],
)
except Exception as e:
print(f"[NSTProxy] ❌ Failed to set proxy: {e}")
raise
class VirtualScrollConfig:
"""Configuration for virtual scroll handling.
@@ -1712,7 +1797,10 @@ class LLMConfig:
frequency_penalty: Optional[float] = None,
presence_penalty: Optional[float] = None,
stop: Optional[List[str]] = None,
n: Optional[int] = None,
n: Optional[int] = None,
backoff_base_delay: Optional[int] = None,
backoff_max_attempts: Optional[int] = None,
backoff_exponential_factor: Optional[int] = None,
):
"""Configuaration class for LLM provider and API token."""
self.provider = provider
@@ -1741,6 +1829,9 @@ class LLMConfig:
self.presence_penalty = presence_penalty
self.stop = stop
self.n = n
self.backoff_base_delay = backoff_base_delay if backoff_base_delay is not None else 2
self.backoff_max_attempts = backoff_max_attempts if backoff_max_attempts is not None else 3
self.backoff_exponential_factor = backoff_exponential_factor if backoff_exponential_factor is not None else 2
@staticmethod
def from_kwargs(kwargs: dict) -> "LLMConfig":
@@ -1754,7 +1845,10 @@ class LLMConfig:
frequency_penalty=kwargs.get("frequency_penalty"),
presence_penalty=kwargs.get("presence_penalty"),
stop=kwargs.get("stop"),
n=kwargs.get("n")
n=kwargs.get("n"),
backoff_base_delay=kwargs.get("backoff_base_delay"),
backoff_max_attempts=kwargs.get("backoff_max_attempts"),
backoff_exponential_factor=kwargs.get("backoff_exponential_factor")
)
def to_dict(self):
@@ -1768,7 +1862,10 @@ class LLMConfig:
"frequency_penalty": self.frequency_penalty,
"presence_penalty": self.presence_penalty,
"stop": self.stop,
"n": self.n
"n": self.n,
"backoff_base_delay": self.backoff_base_delay,
"backoff_max_attempts": self.backoff_max_attempts,
"backoff_exponential_factor": self.backoff_exponential_factor
}
def clone(self, **kwargs):

View File

@@ -1023,6 +1023,12 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
final_messages = await self.adapter.retrieve_console_messages(page)
captured_console.extend(final_messages)
###
# This ensures we capture the current page URL at the time we return the response,
# which correctly reflects any JavaScript navigation that occurred.
###
redirected_url = page.url # Use current page URL to capture JS redirects
# Return complete response
return AsyncCrawlResponse(
html=html,
@@ -1383,9 +1389,10 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
try:
await self.adapter.evaluate(page,
f"""
(() => {{
(async () => {{
try {{
{remove_overlays_js}
const removeOverlays = {remove_overlays_js};
await removeOverlays();
return {{ success: true }};
}} catch (error) {{
return {{

View File

@@ -845,6 +845,15 @@ class AsyncUrlSeeder:
return
data = gzip.decompress(r.content) if url.endswith(".gz") else r.content
base_url = str(r.url)
def _normalize_loc(raw: Optional[str]) -> Optional[str]:
if not raw:
return None
normalized = urljoin(base_url, raw.strip())
if not normalized:
return None
return normalized
# Detect if this is a sitemap index by checking for <sitemapindex> or presence of <sitemap> elements
is_sitemap_index = False
@@ -857,25 +866,42 @@ class AsyncUrlSeeder:
# Use XML parser for sitemaps, not HTML parser
parser = etree.XMLParser(recover=True)
root = etree.fromstring(data, parser=parser)
# Namespace-agnostic lookups using local-name() so we honor custom or missing namespaces
sitemap_loc_nodes = root.xpath("//*[local-name()='sitemap']/*[local-name()='loc']")
url_loc_nodes = root.xpath("//*[local-name()='url']/*[local-name()='loc']")
# Define namespace for sitemap
ns = {'s': 'http://www.sitemaps.org/schemas/sitemap/0.9'}
self._log(
"debug",
"Parsed sitemap {url}: {sitemap_count} sitemap entries, {url_count} url entries discovered",
params={
"url": url,
"sitemap_count": len(sitemap_loc_nodes),
"url_count": len(url_loc_nodes),
},
tag="URL_SEED",
)
# Check for sitemap index entries
sitemap_locs = root.xpath('//s:sitemap/s:loc', namespaces=ns)
if sitemap_locs:
if sitemap_loc_nodes:
is_sitemap_index = True
for sitemap_elem in sitemap_locs:
loc = sitemap_elem.text.strip() if sitemap_elem.text else ""
for sitemap_elem in sitemap_loc_nodes:
loc = _normalize_loc(sitemap_elem.text)
if loc:
sub_sitemaps.append(loc)
# If not a sitemap index, get regular URLs
if not is_sitemap_index:
for loc_elem in root.xpath('//s:url/s:loc', namespaces=ns):
loc = loc_elem.text.strip() if loc_elem.text else ""
for loc_elem in url_loc_nodes:
loc = _normalize_loc(loc_elem.text)
if loc:
regular_urls.append(loc)
if not regular_urls:
self._log(
"warning",
"No <loc> entries found inside <url> tags for sitemap {url}. The sitemap might be empty or use an unexpected structure.",
params={"url": url},
tag="URL_SEED",
)
except Exception as e:
self._log("error", "LXML parsing error for sitemap {url}: {error}",
params={"url": url, "error": str(e)}, tag="URL_SEED")
@@ -892,19 +918,39 @@ class AsyncUrlSeeder:
# Check for sitemap index entries
sitemaps = root.findall('.//sitemap')
url_entries = root.findall('.//url')
self._log(
"debug",
"ElementTree parsed sitemap {url}: {sitemap_count} sitemap entries, {url_count} url entries discovered",
params={
"url": url,
"sitemap_count": len(sitemaps),
"url_count": len(url_entries),
},
tag="URL_SEED",
)
if sitemaps:
is_sitemap_index = True
for sitemap in sitemaps:
loc_elem = sitemap.find('loc')
if loc_elem is not None and loc_elem.text:
sub_sitemaps.append(loc_elem.text.strip())
loc = _normalize_loc(loc_elem.text if loc_elem is not None else None)
if loc:
sub_sitemaps.append(loc)
# If not a sitemap index, get regular URLs
if not is_sitemap_index:
for url_elem in root.findall('.//url'):
for url_elem in url_entries:
loc_elem = url_elem.find('loc')
if loc_elem is not None and loc_elem.text:
regular_urls.append(loc_elem.text.strip())
loc = _normalize_loc(loc_elem.text if loc_elem is not None else None)
if loc:
regular_urls.append(loc)
if not regular_urls:
self._log(
"warning",
"No <loc> entries found inside <url> tags for sitemap {url}. The sitemap might be empty or use an unexpected structure.",
params={"url": url},
tag="URL_SEED",
)
except Exception as e:
self._log("error", "ElementTree parsing error for sitemap {url}: {error}",
params={"url": url, "error": str(e)}, tag="URL_SEED")

View File

@@ -617,7 +617,17 @@ class AsyncWebCrawler:
else config.chunking_strategy
)
sections = chunking.chunk(content)
extracted_content = config.extraction_strategy.run(url, sections)
# extracted_content = config.extraction_strategy.run(_url, sections)
# Use async version if available for better parallelism
if hasattr(config.extraction_strategy, 'arun'):
extracted_content = await config.extraction_strategy.arun(_url, sections)
else:
# Fallback to sync version run in thread pool to avoid blocking
extracted_content = await asyncio.to_thread(
config.extraction_strategy.run, url, sections
)
extracted_content = json.dumps(
extracted_content, indent=4, default=str, ensure_ascii=False
)

View File

@@ -369,6 +369,9 @@ class ManagedBrowser:
]
if self.headless:
flags.append("--headless=new")
# Add viewport flag if specified in config
if self.browser_config.viewport_height and self.browser_config.viewport_width:
flags.append(f"--window-size={self.browser_config.viewport_width},{self.browser_config.viewport_height}")
# merge common launch flags
flags.extend(self.build_browser_flags(self.browser_config))
elif self.browser_type == "firefox":
@@ -658,6 +661,11 @@ class BrowserManager:
if self.config.cdp_url or self.config.use_managed_browser:
self.config.use_managed_browser = True
cdp_url = await self.managed_browser.start() if not self.config.cdp_url else self.config.cdp_url
# Add CDP endpoint verification before connecting
if not await self._verify_cdp_ready(cdp_url):
raise Exception(f"CDP endpoint at {cdp_url} is not ready after startup")
self.browser = await self.playwright.chromium.connect_over_cdp(cdp_url)
contexts = self.browser.contexts
if contexts:
@@ -678,6 +686,24 @@ class BrowserManager:
self.default_context = self.browser
async def _verify_cdp_ready(self, cdp_url: str) -> bool:
"""Verify CDP endpoint is ready with exponential backoff"""
import aiohttp
self.logger.debug(f"Starting CDP verification for {cdp_url}", tag="BROWSER")
for attempt in range(5):
try:
async with aiohttp.ClientSession() as session:
async with session.get(f"{cdp_url}/json/version", timeout=aiohttp.ClientTimeout(total=2)) as response:
if response.status == 200:
self.logger.debug(f"CDP endpoint ready after {attempt + 1} attempts", tag="BROWSER")
return True
except Exception as e:
self.logger.debug(f"CDP check attempt {attempt + 1} failed: {e}", tag="BROWSER")
delay = 0.5 * (1.4 ** attempt)
self.logger.debug(f"Waiting {delay:.2f}s before next CDP check...", tag="BROWSER")
await asyncio.sleep(delay)
self.logger.debug(f"CDP verification failed after 5 attempts", tag="BROWSER")
return False
def _build_browser_args(self) -> dict:
"""Build browser launch arguments from config."""

View File

@@ -980,6 +980,9 @@ class LLMContentFilter(RelevantContentFilter):
prompt,
api_token,
base_url=base_url,
base_delay=self.llm_config.backoff_base_delay,
max_attempts=self.llm_config.backoff_max_attempts,
exponential_factor=self.llm_config.backoff_exponential_factor,
extra_args=extra_args,
)

View File

@@ -542,6 +542,19 @@ class LXMLWebScrapingStrategy(ContentScrapingStrategy):
if el.tag in bypass_tags:
continue
# Skip elements inside <pre> or <code> tags where whitespace is significant
# This preserves whitespace-only spans (e.g., <span class="w"> </span>) in code blocks
is_in_code_block = False
ancestor = el.getparent()
while ancestor is not None:
if ancestor.tag in ("pre", "code"):
is_in_code_block = True
break
ancestor = ancestor.getparent()
if is_in_code_block:
continue
text_content = (el.text_content() or "").strip()
if (
len(text_content.split()) < word_count_threshold

View File

@@ -4,14 +4,26 @@ from typing import AsyncGenerator, Optional, Set, Dict, List, Tuple
from ..models import CrawlResult
from .bfs_strategy import BFSDeepCrawlStrategy # noqa
from ..types import AsyncWebCrawler, CrawlerRunConfig
from ..utils import normalize_url_for_deep_crawl
class DFSDeepCrawlStrategy(BFSDeepCrawlStrategy):
"""
Depth-First Search (DFS) deep crawling strategy.
Depth-first deep crawling with familiar BFS rules.
Inherits URL validation and link discovery from BFSDeepCrawlStrategy.
Overrides _arun_batch and _arun_stream to use a stack (LIFO) for DFS traversal.
We reuse the same filters, scoring, and page limits from :class:`BFSDeepCrawlStrategy`,
but walk the graph with a stack so we fully explore one branch before hopping to the
next. DFS also keeps its own ``_dfs_seen`` set so we can drop duplicate links at
discovery time without accidentally marking them as “already crawled”.
"""
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self._dfs_seen: Set[str] = set()
def _reset_seen(self, start_url: str) -> None:
"""Start each crawl with a clean dedupe set seeded with the root URL."""
self._dfs_seen = {start_url}
async def _arun_batch(
self,
start_url: str,
@@ -19,14 +31,19 @@ class DFSDeepCrawlStrategy(BFSDeepCrawlStrategy):
config: CrawlerRunConfig,
) -> List[CrawlResult]:
"""
Batch (non-streaming) DFS mode.
Uses a stack to traverse URLs in DFS order, aggregating CrawlResults into a list.
Crawl level-by-level but emit results at the end.
We keep a stack of ``(url, parent, depth)`` tuples, pop one at a time, and
hand it to ``crawler.arun_many`` with deep crawling disabled so we remain
in control of traversal. Every successful page bumps ``_pages_crawled`` and
seeds new stack items discovered via :meth:`link_discovery`.
"""
visited: Set[str] = set()
# Stack items: (url, parent_url, depth)
stack: List[Tuple[str, Optional[str], int]] = [(start_url, None, 0)]
depths: Dict[str, int] = {start_url: 0}
results: List[CrawlResult] = []
self._reset_seen(start_url)
while stack and not self._cancel_event.is_set():
url, parent, depth = stack.pop()
@@ -71,12 +88,16 @@ class DFSDeepCrawlStrategy(BFSDeepCrawlStrategy):
config: CrawlerRunConfig,
) -> AsyncGenerator[CrawlResult, None]:
"""
Streaming DFS mode.
Uses a stack to traverse URLs in DFS order and yields CrawlResults as they become available.
Same traversal as :meth:`_arun_batch`, but yield pages immediately.
Each popped URL is crawled, its metadata annotated, then the result gets
yielded before we even look at the next stack entry. Successful crawls
still feed :meth:`link_discovery`, keeping DFS order intact.
"""
visited: Set[str] = set()
stack: List[Tuple[str, Optional[str], int]] = [(start_url, None, 0)]
depths: Dict[str, int] = {start_url: 0}
self._reset_seen(start_url)
while stack and not self._cancel_event.is_set():
url, parent, depth = stack.pop()
@@ -108,3 +129,92 @@ class DFSDeepCrawlStrategy(BFSDeepCrawlStrategy):
for new_url, new_parent in reversed(new_links):
new_depth = depths.get(new_url, depth + 1)
stack.append((new_url, new_parent, new_depth))
async def link_discovery(
self,
result: CrawlResult,
source_url: str,
current_depth: int,
_visited: Set[str],
next_level: List[Tuple[str, Optional[str]]],
depths: Dict[str, int],
) -> None:
"""
Find the next URLs we should push onto the DFS stack.
Parameters
----------
result : CrawlResult
Output of the page we just crawled; its ``links`` block is our raw material.
source_url : str
URL of the parent page; stored so callers can track ancestry.
current_depth : int
Depth of the parent; children naturally sit at ``current_depth + 1``.
_visited : Set[str]
Present to match the BFS signature, but we rely on ``_dfs_seen`` instead.
next_level : list of tuples
The stack buffer supplied by the caller; we append new ``(url, parent)`` items here.
depths : dict
Shared depth map so future metadata tagging knows how deep each URL lives.
Notes
-----
- ``_dfs_seen`` keeps us from pushing duplicates without touching the traversal guard.
- Validation, scoring, and capacity trimming mirror the BFS version so behaviour stays consistent.
"""
next_depth = current_depth + 1
if next_depth > self.max_depth:
return
remaining_capacity = self.max_pages - self._pages_crawled
if remaining_capacity <= 0:
self.logger.info(
f"Max pages limit ({self.max_pages}) reached, stopping link discovery"
)
return
links = result.links.get("internal", [])
if self.include_external:
links += result.links.get("external", [])
seen = self._dfs_seen
valid_links: List[Tuple[str, float]] = []
for link in links:
raw_url = link.get("href")
if not raw_url:
continue
normalized_url = normalize_url_for_deep_crawl(raw_url, source_url)
if not normalized_url or normalized_url in seen:
continue
if not await self.can_process_url(raw_url, next_depth):
self.stats.urls_skipped += 1
continue
score = self.url_scorer.score(normalized_url) if self.url_scorer else 0
if score < self.score_threshold:
self.logger.debug(
f"URL {normalized_url} skipped: score {score} below threshold {self.score_threshold}"
)
self.stats.urls_skipped += 1
continue
seen.add(normalized_url)
valid_links.append((normalized_url, score))
if len(valid_links) > remaining_capacity:
if self.url_scorer:
valid_links.sort(key=lambda x: x[1], reverse=True)
valid_links = valid_links[:remaining_capacity]
self.logger.info(
f"Limiting to {remaining_capacity} URLs due to max_pages limit"
)
for url, score in valid_links:
if score:
result.metadata = result.metadata or {}
result.metadata["score"] = score
next_level.append((url, source_url))
depths[url] = next_depth

View File

@@ -509,18 +509,22 @@ class DomainFilter(URLFilter):
class ContentRelevanceFilter(URLFilter):
"""BM25-based relevance filter using head section content"""
__slots__ = ("query_terms", "threshold", "k1", "b", "avgdl")
__slots__ = ("query_terms", "threshold", "k1", "b", "avgdl", "query")
def __init__(
self,
query: str,
query: Union[str, List[str]],
threshold: float,
k1: float = 1.2,
b: float = 0.75,
avgdl: int = 1000,
):
super().__init__(name="BM25RelevanceFilter")
self.query_terms = self._tokenize(query)
if isinstance(query, list):
self.query = " ".join(query)
else:
self.query = query
self.query_terms = self._tokenize(self.query)
self.threshold = threshold
self.k1 = k1 # TF saturation parameter
self.b = b # Length normalization parameter

View File

@@ -1,4 +1,4 @@
from typing import List, Optional, Union, AsyncGenerator, Dict, Any
from typing import List, Optional, Union, AsyncGenerator, Dict, Any, Callable
import httpx
import json
from urllib.parse import urljoin
@@ -7,6 +7,7 @@ import asyncio
from .async_configs import BrowserConfig, CrawlerRunConfig
from .models import CrawlResult
from .async_logger import AsyncLogger, LogLevel
from .utils import hooks_to_string
class Crawl4aiClientError(Exception):
@@ -70,17 +71,41 @@ class Crawl4aiDockerClient:
self.logger.error(f"Server unreachable: {str(e)}", tag="ERROR")
raise ConnectionError(f"Cannot connect to server: {str(e)}")
def _prepare_request(self, urls: List[str], browser_config: Optional[BrowserConfig] = None,
crawler_config: Optional[CrawlerRunConfig] = None) -> Dict[str, Any]:
def _prepare_request(
self,
urls: List[str],
browser_config: Optional[BrowserConfig] = None,
crawler_config: Optional[CrawlerRunConfig] = None,
hooks: Optional[Union[Dict[str, Callable], Dict[str, str]]] = None,
hooks_timeout: int = 30
) -> Dict[str, Any]:
"""Prepare request data from configs."""
if self._token:
self._http_client.headers["Authorization"] = f"Bearer {self._token}"
return {
request_data = {
"urls": urls,
"browser_config": browser_config.dump() if browser_config else {},
"crawler_config": crawler_config.dump() if crawler_config else {}
}
# Handle hooks if provided
if hooks:
# Check if hooks are already strings or need conversion
if any(callable(v) for v in hooks.values()):
# Convert function objects to strings
hooks_code = hooks_to_string(hooks)
else:
# Already in string format
hooks_code = hooks
request_data["hooks"] = {
"code": hooks_code,
"timeout": hooks_timeout
}
return request_data
async def _request(self, method: str, endpoint: str, **kwargs) -> httpx.Response:
"""Make an HTTP request with error handling."""
url = urljoin(self.base_url, endpoint)
@@ -102,16 +127,42 @@ class Crawl4aiDockerClient:
self,
urls: List[str],
browser_config: Optional[BrowserConfig] = None,
crawler_config: Optional[CrawlerRunConfig] = None
crawler_config: Optional[CrawlerRunConfig] = None,
hooks: Optional[Union[Dict[str, Callable], Dict[str, str]]] = None,
hooks_timeout: int = 30
) -> Union[CrawlResult, List[CrawlResult], AsyncGenerator[CrawlResult, None]]:
"""Execute a crawl operation."""
"""
Execute a crawl operation.
Args:
urls: List of URLs to crawl
browser_config: Browser configuration
crawler_config: Crawler configuration
hooks: Optional hooks - can be either:
- Dict[str, Callable]: Function objects that will be converted to strings
- Dict[str, str]: Already stringified hook code
hooks_timeout: Timeout in seconds for each hook execution (1-120)
Returns:
Single CrawlResult, list of results, or async generator for streaming
Example with function hooks:
>>> async def my_hook(page, context, **kwargs):
... await page.set_viewport_size({"width": 1920, "height": 1080})
... return page
>>>
>>> result = await client.crawl(
... ["https://example.com"],
... hooks={"on_page_context_created": my_hook}
... )
"""
await self._check_server()
data = self._prepare_request(urls, browser_config, crawler_config)
data = self._prepare_request(urls, browser_config, crawler_config, hooks, hooks_timeout)
is_streaming = crawler_config and crawler_config.stream
self.logger.info(f"Crawling {len(urls)} URLs {'(streaming)' if is_streaming else ''}", tag="CRAWL")
if is_streaming:
async def stream_results() -> AsyncGenerator[CrawlResult, None]:
async with self._http_client.stream("POST", f"{self.base_url}/crawl/stream", json=data) as response:
@@ -128,12 +179,12 @@ class Crawl4aiDockerClient:
else:
yield CrawlResult(**result)
return stream_results()
response = await self._request("POST", "/crawl", json=data)
response = await self._request("POST", "/crawl", json=data, timeout=hooks_timeout)
result_data = response.json()
if not result_data.get("success", False):
raise RequestError(f"Crawl failed: {result_data.get('msg', 'Unknown error')}")
results = [CrawlResult(**r) for r in result_data.get("results", [])]
self.logger.success(f"Crawl completed with {len(results)} results", tag="CRAWL")
return results[0] if len(results) == 1 else results

View File

@@ -94,6 +94,20 @@ class ExtractionStrategy(ABC):
extracted_content.extend(future.result())
return extracted_content
async def arun(self, url: str, sections: List[str], *q, **kwargs) -> List[Dict[str, Any]]:
"""
Async version: Process sections of text in parallel using asyncio.
Default implementation runs the sync version in a thread pool.
Subclasses can override this for true async processing.
:param url: The URL of the webpage.
:param sections: List of sections (strings) to process.
:return: A list of processed JSON blocks.
"""
import asyncio
return await asyncio.to_thread(self.run, url, sections, *q, **kwargs)
class NoExtractionStrategy(ExtractionStrategy):
"""
@@ -635,6 +649,9 @@ class LLMExtractionStrategy(ExtractionStrategy):
base_url=self.llm_config.base_url,
json_response=self.force_json_response,
extra_args=self.extra_args,
base_delay=self.llm_config.backoff_base_delay,
max_attempts=self.llm_config.backoff_max_attempts,
exponential_factor=self.llm_config.backoff_exponential_factor
) # , json_response=self.extract_type == "schema")
# Track usage
usage = TokenUsage(
@@ -780,6 +797,180 @@ class LLMExtractionStrategy(ExtractionStrategy):
return extracted_content
async def aextract(self, url: str, ix: int, html: str) -> List[Dict[str, Any]]:
"""
Async version: Extract meaningful blocks or chunks from the given HTML using an LLM.
How it works:
1. Construct a prompt with variables.
2. Make an async request to the LLM using the prompt.
3. Parse the response and extract blocks or chunks.
Args:
url: The URL of the webpage.
ix: Index of the block.
html: The HTML content of the webpage.
Returns:
A list of extracted blocks or chunks.
"""
from .utils import aperform_completion_with_backoff
if self.verbose:
print(f"[LOG] Call LLM for {url} - block index: {ix}")
variable_values = {
"URL": url,
"HTML": escape_json_string(sanitize_html(html)),
}
prompt_with_variables = PROMPT_EXTRACT_BLOCKS
if self.instruction:
variable_values["REQUEST"] = self.instruction
prompt_with_variables = PROMPT_EXTRACT_BLOCKS_WITH_INSTRUCTION
if self.extract_type == "schema" and self.schema:
variable_values["SCHEMA"] = json.dumps(self.schema, indent=2)
prompt_with_variables = PROMPT_EXTRACT_SCHEMA_WITH_INSTRUCTION
if self.extract_type == "schema" and not self.schema:
prompt_with_variables = PROMPT_EXTRACT_INFERRED_SCHEMA
for variable in variable_values:
prompt_with_variables = prompt_with_variables.replace(
"{" + variable + "}", variable_values[variable]
)
try:
response = await aperform_completion_with_backoff(
self.llm_config.provider,
prompt_with_variables,
self.llm_config.api_token,
base_url=self.llm_config.base_url,
json_response=self.force_json_response,
extra_args=self.extra_args,
base_delay=self.llm_config.backoff_base_delay,
max_attempts=self.llm_config.backoff_max_attempts,
exponential_factor=self.llm_config.backoff_exponential_factor
)
# Track usage
usage = TokenUsage(
completion_tokens=response.usage.completion_tokens,
prompt_tokens=response.usage.prompt_tokens,
total_tokens=response.usage.total_tokens,
completion_tokens_details=response.usage.completion_tokens_details.__dict__
if response.usage.completion_tokens_details
else {},
prompt_tokens_details=response.usage.prompt_tokens_details.__dict__
if response.usage.prompt_tokens_details
else {},
)
self.usages.append(usage)
# Update totals
self.total_usage.completion_tokens += usage.completion_tokens
self.total_usage.prompt_tokens += usage.prompt_tokens
self.total_usage.total_tokens += usage.total_tokens
try:
content = response.choices[0].message.content
blocks = None
if self.force_json_response:
blocks = json.loads(content)
if isinstance(blocks, dict):
if len(blocks) == 1 and isinstance(list(blocks.values())[0], list):
blocks = list(blocks.values())[0]
else:
blocks = [blocks]
elif isinstance(blocks, list):
blocks = blocks
else:
blocks = extract_xml_data(["blocks"], content)["blocks"]
blocks = json.loads(blocks)
for block in blocks:
block["error"] = False
except Exception:
parsed, unparsed = split_and_parse_json_objects(
response.choices[0].message.content
)
blocks = parsed
if unparsed:
blocks.append(
{"index": 0, "error": True, "tags": ["error"], "content": unparsed}
)
if self.verbose:
print(
"[LOG] Extracted",
len(blocks),
"blocks from URL:",
url,
"block index:",
ix,
)
return blocks
except Exception as e:
if self.verbose:
print(f"[LOG] Error in LLM extraction: {e}")
return [
{
"index": ix,
"error": True,
"tags": ["error"],
"content": str(e),
}
]
async def arun(self, url: str, sections: List[str]) -> List[Dict[str, Any]]:
"""
Async version: Process sections with true parallelism using asyncio.gather.
Args:
url: The URL of the webpage.
sections: List of sections (strings) to process.
Returns:
A list of extracted blocks or chunks.
"""
import asyncio
merged_sections = self._merge(
sections,
self.chunk_token_threshold,
overlap=int(self.chunk_token_threshold * self.overlap_rate),
)
extracted_content = []
# Create tasks for all sections to run in parallel
tasks = [
self.aextract(url, ix, sanitize_input_encode(section))
for ix, section in enumerate(merged_sections)
]
# Execute all tasks concurrently
results = await asyncio.gather(*tasks, return_exceptions=True)
# Process results
for result in results:
if isinstance(result, Exception):
if self.verbose:
print(f"Error in async extraction: {result}")
extracted_content.append(
{
"index": 0,
"error": True,
"tags": ["error"],
"content": str(result),
}
)
else:
extracted_content.extend(result)
return extracted_content
def show_usage(self) -> None:
"""Print a detailed token usage report showing total and per-request usage."""
print("\n=== Token Usage Summary ===")

View File

@@ -1,4 +1,4 @@
from pydantic import BaseModel, HttpUrl, PrivateAttr, Field
from pydantic import BaseModel, HttpUrl, PrivateAttr, Field, ConfigDict
from typing import List, Dict, Optional, Callable, Awaitable, Union, Any
from typing import AsyncGenerator
from typing import Generic, TypeVar
@@ -153,8 +153,7 @@ class CrawlResult(BaseModel):
console_messages: Optional[List[Dict[str, Any]]] = None
tables: List[Dict] = Field(default_factory=list) # NEW [{headers,rows,caption,summary}]
class Config:
arbitrary_types_allowed = True
model_config = ConfigDict(arbitrary_types_allowed=True)
# NOTE: The StringCompatibleMarkdown class, custom __init__ method, property getters/setters,
# and model_dump override all exist to support a smooth transition from markdown as a string
@@ -332,8 +331,7 @@ class AsyncCrawlResponse(BaseModel):
network_requests: Optional[List[Dict[str, Any]]] = None
console_messages: Optional[List[Dict[str, Any]]] = None
class Config:
arbitrary_types_allowed = True
model_config = ConfigDict(arbitrary_types_allowed=True)
###############################
# Scraping Models

View File

@@ -15,9 +15,9 @@ from .utils import (
clean_pdf_text_to_html,
)
# Remove direct PyPDF2 imports from the top
# import PyPDF2
# from PyPDF2 import PdfReader
# Remove direct pypdf imports from the top
# import pypdf
# from pypdf import PdfReader
logger = logging.getLogger(__name__)
@@ -59,9 +59,9 @@ class NaivePDFProcessorStrategy(PDFProcessorStrategy):
save_images_locally: bool = False, image_save_dir: Optional[Path] = None, batch_size: int = 4):
# Import check at initialization time
try:
import PyPDF2
import pypdf
except ImportError:
raise ImportError("PyPDF2 is required for PDF processing. Install with 'pip install crawl4ai[pdf]'")
raise ImportError("pypdf is required for PDF processing. Install with 'pip install crawl4ai[pdf]'")
self.image_dpi = image_dpi
self.image_quality = image_quality
@@ -75,9 +75,9 @@ class NaivePDFProcessorStrategy(PDFProcessorStrategy):
def process(self, pdf_path: Path) -> PDFProcessResult:
# Import inside method to allow dependency to be optional
try:
from PyPDF2 import PdfReader
from pypdf import PdfReader
except ImportError:
raise ImportError("PyPDF2 is required for PDF processing. Install with 'pip install crawl4ai[pdf]'")
raise ImportError("pypdf is required for PDF processing. Install with 'pip install crawl4ai[pdf]'")
start_time = time()
result = PDFProcessResult(
@@ -125,15 +125,15 @@ class NaivePDFProcessorStrategy(PDFProcessorStrategy):
"""Like process() but processes PDF pages in parallel batches"""
# Import inside method to allow dependency to be optional
try:
from PyPDF2 import PdfReader
import PyPDF2 # For type checking
from pypdf import PdfReader
import pypdf # For type checking
except ImportError:
raise ImportError("PyPDF2 is required for PDF processing. Install with 'pip install crawl4ai[pdf]'")
raise ImportError("pypdf is required for PDF processing. Install with 'pip install crawl4ai[pdf]'")
import concurrent.futures
import threading
# Initialize PyPDF2 thread support
# Initialize pypdf thread support
if not hasattr(threading.current_thread(), "_children"):
threading.current_thread()._children = set()
@@ -232,11 +232,11 @@ class NaivePDFProcessorStrategy(PDFProcessorStrategy):
return pdf_page
def _extract_images(self, page, image_dir: Optional[Path]) -> List[Dict]:
# Import PyPDF2 for type checking only when needed
# Import pypdf for type checking only when needed
try:
import PyPDF2
from pypdf.generic import IndirectObject
except ImportError:
raise ImportError("PyPDF2 is required for PDF processing. Install with 'pip install crawl4ai[pdf]'")
raise ImportError("pypdf is required for PDF processing. Install with 'pip install crawl4ai[pdf]'")
if not self.extract_images:
return []
@@ -266,7 +266,7 @@ class NaivePDFProcessorStrategy(PDFProcessorStrategy):
width = xobj.get('/Width', 0)
height = xobj.get('/Height', 0)
color_space = xobj.get('/ColorSpace', '/DeviceRGB')
if isinstance(color_space, PyPDF2.generic.IndirectObject):
if isinstance(color_space, IndirectObject):
color_space = color_space.get_object()
# Handle different image encodings
@@ -277,7 +277,7 @@ class NaivePDFProcessorStrategy(PDFProcessorStrategy):
if '/FlateDecode' in filters:
try:
decode_parms = xobj.get('/DecodeParms', {})
if isinstance(decode_parms, PyPDF2.generic.IndirectObject):
if isinstance(decode_parms, IndirectObject):
decode_parms = decode_parms.get_object()
predictor = decode_parms.get('/Predictor', 1)
@@ -416,10 +416,10 @@ class NaivePDFProcessorStrategy(PDFProcessorStrategy):
# Import inside method to allow dependency to be optional
if reader is None:
try:
from PyPDF2 import PdfReader
from pypdf import PdfReader
reader = PdfReader(pdf_path)
except ImportError:
raise ImportError("PyPDF2 is required for PDF processing. Install with 'pip install crawl4ai[pdf]'")
raise ImportError("pypdf is required for PDF processing. Install with 'pip install crawl4ai[pdf]'")
meta = reader.metadata or {}
created = self._parse_pdf_date(meta.get('/CreationDate', ''))
@@ -459,11 +459,11 @@ if __name__ == "__main__":
from pathlib import Path
try:
# Import PyPDF2 only when running the file directly
import PyPDF2
from PyPDF2 import PdfReader
# Import pypdf only when running the file directly
import pypdf
from pypdf import PdfReader
except ImportError:
print("PyPDF2 is required for PDF processing. Install with 'pip install crawl4ai[pdf]'")
print("pypdf is required for PDF processing. Install with 'pip install crawl4ai[pdf]'")
exit(1)
current_dir = Path(__file__).resolve().parent

View File

@@ -795,6 +795,9 @@ Return only a JSON array of extracted tables following the specified format."""
api_token=self.llm_config.api_token,
base_url=self.llm_config.base_url,
json_response=True,
base_delay=self.llm_config.backoff_base_delay,
max_attempts=self.llm_config.backoff_max_attempts,
exponential_factor=self.llm_config.backoff_exponential_factor,
extra_args=self.extra_args
)
@@ -1116,6 +1119,9 @@ Return only a JSON array of extracted tables following the specified format."""
api_token=self.llm_config.api_token,
base_url=self.llm_config.base_url,
json_response=True,
base_delay=self.llm_config.backoff_base_delay,
max_attempts=self.llm_config.backoff_max_attempts,
exponential_factor=self.llm_config.backoff_exponential_factor,
extra_args=self.extra_args
)

View File

@@ -47,6 +47,7 @@ from urllib.parse import (
urljoin, urlparse, urlunparse,
parse_qsl, urlencode, quote, unquote
)
import inspect
# Monkey patch to fix wildcard handling in urllib.robotparser
@@ -1744,6 +1745,9 @@ def perform_completion_with_backoff(
api_token,
json_response=False,
base_url=None,
base_delay=2,
max_attempts=3,
exponential_factor=2,
**kwargs,
):
"""
@@ -1760,6 +1764,9 @@ def perform_completion_with_backoff(
api_token (str): The API token for authentication.
json_response (bool): Whether to request a JSON response. Defaults to False.
base_url (Optional[str]): The base URL for the API. Defaults to None.
base_delay (int): The base delay in seconds. Defaults to 2.
max_attempts (int): The maximum number of attempts. Defaults to 3.
exponential_factor (int): The exponential factor. Defaults to 2.
**kwargs: Additional arguments for the API request.
Returns:
@@ -1769,9 +1776,6 @@ def perform_completion_with_backoff(
from litellm import completion
from litellm.exceptions import RateLimitError
max_attempts = 3
base_delay = 2 # Base delay in seconds, you can adjust this based on your needs
extra_args = {"temperature": 0.01, "api_key": api_token, "base_url": base_url}
if json_response:
extra_args["response_format"] = {"type": "json_object"}
@@ -1797,7 +1801,7 @@ def perform_completion_with_backoff(
# Check if we have exhausted our max attempts
if attempt < max_attempts - 1:
# Calculate the delay and wait
delay = base_delay * (2**attempt) # Exponential backoff formula
delay = base_delay * (exponential_factor**attempt) # Exponential backoff formula
print(f"Waiting for {delay} seconds before retrying...")
time.sleep(delay)
else:
@@ -1824,6 +1828,85 @@ def perform_completion_with_backoff(
# ]
async def aperform_completion_with_backoff(
provider,
prompt_with_variables,
api_token,
json_response=False,
base_url=None,
base_delay=2,
max_attempts=3,
exponential_factor=2,
**kwargs,
):
"""
Async version: Perform an API completion request with exponential backoff.
How it works:
1. Sends an async completion request to the API.
2. Retries on rate-limit errors with exponential delays (async).
3. Returns the API response or an error after all retries.
Args:
provider (str): The name of the API provider.
prompt_with_variables (str): The input prompt for the completion request.
api_token (str): The API token for authentication.
json_response (bool): Whether to request a JSON response. Defaults to False.
base_url (Optional[str]): The base URL for the API. Defaults to None.
base_delay (int): The base delay in seconds. Defaults to 2.
max_attempts (int): The maximum number of attempts. Defaults to 3.
exponential_factor (int): The exponential factor. Defaults to 2.
**kwargs: Additional arguments for the API request.
Returns:
dict: The API response or an error message after all retries.
"""
from litellm import acompletion
from litellm.exceptions import RateLimitError
import asyncio
extra_args = {"temperature": 0.01, "api_key": api_token, "base_url": base_url}
if json_response:
extra_args["response_format"] = {"type": "json_object"}
if kwargs.get("extra_args"):
extra_args.update(kwargs["extra_args"])
for attempt in range(max_attempts):
try:
response = await acompletion(
model=provider,
messages=[{"role": "user", "content": prompt_with_variables}],
**extra_args,
)
return response # Return the successful response
except RateLimitError as e:
print("Rate limit error:", str(e))
if attempt == max_attempts - 1:
# Last attempt failed, raise the error.
raise
# Check if we have exhausted our max attempts
if attempt < max_attempts - 1:
# Calculate the delay and wait
delay = base_delay * (exponential_factor**attempt) # Exponential backoff formula
print(f"Waiting for {delay} seconds before retrying...")
await asyncio.sleep(delay)
else:
# Return an error response after exhausting all retries
return [
{
"index": 0,
"tags": ["error"],
"content": ["Rate limit error. Please try again later."],
}
]
except Exception as e:
raise e # Raise any other exceptions immediately
def extract_blocks(url, html, provider=DEFAULT_PROVIDER, api_token=None, base_url=None):
"""
Extract content blocks from website HTML using an AI provider.
@@ -3529,4 +3612,52 @@ def get_memory_stats() -> Tuple[float, float, float]:
available_gb = get_true_available_memory_gb()
used_percent = get_true_memory_usage_percent()
return used_percent, available_gb, total_gb
return used_percent, available_gb, total_gb
# Hook utilities for Docker API
def hooks_to_string(hooks: Dict[str, Callable]) -> Dict[str, str]:
"""
Convert hook function objects to string representations for Docker API.
This utility simplifies the process of using hooks with the Docker API by converting
Python function objects into the string format required by the API.
Args:
hooks: Dictionary mapping hook point names to Python function objects.
Functions should be async and follow hook signature requirements.
Returns:
Dictionary mapping hook point names to string representations of the functions.
Example:
>>> async def my_hook(page, context, **kwargs):
... await page.set_viewport_size({"width": 1920, "height": 1080})
... return page
>>>
>>> hooks_dict = {"on_page_context_created": my_hook}
>>> api_hooks = hooks_to_string(hooks_dict)
>>> # api_hooks is now ready to use with Docker API
Raises:
ValueError: If a hook is not callable or source cannot be extracted
"""
result = {}
for hook_name, hook_func in hooks.items():
if not callable(hook_func):
raise ValueError(f"Hook '{hook_name}' must be a callable function, got {type(hook_func)}")
try:
# Get the source code of the function
source = inspect.getsource(hook_func)
# Remove any leading indentation to get clean source
source = textwrap.dedent(source)
result[hook_name] = source
except (OSError, TypeError) as e:
raise ValueError(
f"Cannot extract source code for hook '{hook_name}'. "
f"Make sure the function is defined in a file (not interactively). Error: {e}"
)
return result

View File

@@ -12,6 +12,7 @@
- [Python SDK](#python-sdk)
- [Understanding Request Schema](#understanding-request-schema)
- [REST API Examples](#rest-api-examples)
- [Asynchronous Jobs with Webhooks](#asynchronous-jobs-with-webhooks)
- [Additional API Endpoints](#additional-api-endpoints)
- [HTML Extraction Endpoint](#html-extraction-endpoint)
- [Screenshot Endpoint](#screenshot-endpoint)
@@ -58,15 +59,13 @@ Pull and run images directly from Docker Hub without building locally.
#### 1. Pull the Image
Our latest release candidate is `0.7.0-r1`. Images are built with multi-arch manifests, so Docker automatically pulls the correct version for your system.
> ⚠️ **Important Note**: The `latest` tag currently points to the stable `0.6.0` version. After testing and validation, `0.7.0` (without -r1) will be released and `latest` will be updated. For now, please use `0.7.0-r1` to test the new features.
Our latest stable release is `0.7.7`. Images are built with multi-arch manifests, so Docker automatically pulls the correct version for your system.
```bash
# Pull the release candidate (for testing new features)
docker pull unclecode/crawl4ai:0.7.0-r1
# Pull the latest stable version (0.7.7)
docker pull unclecode/crawl4ai:0.7.7
# Or pull the current stable version (0.6.0)
# Or use the latest tag (points to 0.7.7)
docker pull unclecode/crawl4ai:latest
```
@@ -101,7 +100,7 @@ EOL
-p 11235:11235 \
--name crawl4ai \
--shm-size=1g \
unclecode/crawl4ai:0.7.0-r1
unclecode/crawl4ai:0.7.7
```
* **With LLM support:**
@@ -112,7 +111,7 @@ EOL
--name crawl4ai \
--env-file .llm.env \
--shm-size=1g \
unclecode/crawl4ai:0.7.0-r1
unclecode/crawl4ai:0.7.7
```
> The server will be available at `http://localhost:11235`. Visit `/playground` to access the interactive testing interface.
@@ -185,7 +184,7 @@ The `docker-compose.yml` file in the project root provides a simplified approach
```bash
# Pulls and runs the release candidate from Docker Hub
# Automatically selects the correct architecture
IMAGE=unclecode/crawl4ai:0.7.0-r1 docker compose up -d
IMAGE=unclecode/crawl4ai:0.7.7 docker compose up -d
```
* **Build and Run Locally:**
@@ -648,6 +647,194 @@ async def test_stream_crawl(token: str = None): # Made token optional
# asyncio.run(test_stream_crawl())
```
### Asynchronous Jobs with Webhooks
For long-running crawls or when you want to avoid keeping connections open, use the job queue endpoints. Instead of polling for results, configure a webhook to receive notifications when jobs complete.
#### Why Use Jobs & Webhooks?
- **No Polling Required** - Get notified when crawls complete instead of constantly checking status
- **Better Resource Usage** - Free up client connections while jobs run in the background
- **Scalable Architecture** - Ideal for high-volume crawling with TypeScript/Node.js clients or microservices
- **Reliable Delivery** - Automatic retry with exponential backoff (5 attempts: 1s → 2s → 4s → 8s → 16s)
#### How It Works
1. **Submit Job** → POST to `/crawl/job` with optional `webhook_config`
2. **Get Task ID** → Receive a `task_id` immediately
3. **Job Runs** → Crawl executes in the background
4. **Webhook Fired** → Server POSTs completion notification to your webhook URL
5. **Fetch Results** → If data wasn't included in webhook, GET `/crawl/job/{task_id}`
#### Quick Example
```bash
# Submit a crawl job with webhook notification
curl -X POST http://localhost:11235/crawl/job \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://example.com"],
"webhook_config": {
"webhook_url": "https://myapp.com/webhooks/crawl-complete",
"webhook_data_in_payload": false
}
}'
# Response: {"task_id": "crawl_a1b2c3d4"}
```
**Your webhook receives:**
```json
{
"task_id": "crawl_a1b2c3d4",
"task_type": "crawl",
"status": "completed",
"timestamp": "2025-10-21T10:30:00.000000+00:00",
"urls": ["https://example.com"]
}
```
Then fetch the results:
```bash
curl http://localhost:11235/crawl/job/crawl_a1b2c3d4
```
#### Include Data in Webhook
Set `webhook_data_in_payload: true` to receive the full crawl results directly in the webhook:
```bash
curl -X POST http://localhost:11235/crawl/job \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://example.com"],
"webhook_config": {
"webhook_url": "https://myapp.com/webhooks/crawl-complete",
"webhook_data_in_payload": true
}
}'
```
**Your webhook receives the complete data:**
```json
{
"task_id": "crawl_a1b2c3d4",
"task_type": "crawl",
"status": "completed",
"timestamp": "2025-10-21T10:30:00.000000+00:00",
"urls": ["https://example.com"],
"data": {
"markdown": "...",
"html": "...",
"links": {...},
"metadata": {...}
}
}
```
#### Webhook Authentication
Add custom headers for authentication:
```json
{
"urls": ["https://example.com"],
"webhook_config": {
"webhook_url": "https://myapp.com/webhooks/crawl",
"webhook_data_in_payload": false,
"webhook_headers": {
"X-Webhook-Secret": "your-secret-token",
"X-Service-ID": "crawl4ai-prod"
}
}
}
```
#### Global Default Webhook
Configure a default webhook URL in `config.yml` for all jobs:
```yaml
webhooks:
enabled: true
default_url: "https://myapp.com/webhooks/default"
data_in_payload: false
retry:
max_attempts: 5
initial_delay_ms: 1000
max_delay_ms: 32000
timeout_ms: 30000
```
Now jobs without `webhook_config` automatically use the default webhook.
#### Job Status Polling (Without Webhooks)
If you prefer polling instead of webhooks, just omit `webhook_config`:
```bash
# Submit job
curl -X POST http://localhost:11235/crawl/job \
-H "Content-Type: application/json" \
-d '{"urls": ["https://example.com"]}'
# Response: {"task_id": "crawl_xyz"}
# Poll for status
curl http://localhost:11235/crawl/job/crawl_xyz
```
The response includes `status` field: `"processing"`, `"completed"`, or `"failed"`.
#### LLM Extraction Jobs with Webhooks
The same webhook system works for LLM extraction jobs via `/llm/job`:
```bash
# Submit LLM extraction job with webhook
curl -X POST http://localhost:11235/llm/job \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/article",
"q": "Extract the article title, author, and main points",
"provider": "openai/gpt-4o-mini",
"webhook_config": {
"webhook_url": "https://myapp.com/webhooks/llm-complete",
"webhook_data_in_payload": true,
"webhook_headers": {
"X-Webhook-Secret": "your-secret-token"
}
}
}'
# Response: {"task_id": "llm_1234567890"}
```
**Your webhook receives:**
```json
{
"task_id": "llm_1234567890",
"task_type": "llm_extraction",
"status": "completed",
"timestamp": "2025-10-22T12:30:00.000000+00:00",
"urls": ["https://example.com/article"],
"data": {
"extracted_content": {
"title": "Understanding Web Scraping",
"author": "John Doe",
"main_points": ["Point 1", "Point 2", "Point 3"]
}
}
}
```
**Key Differences for LLM Jobs:**
- Task type is `"llm_extraction"` instead of `"crawl"`
- Extracted data is in `data.extracted_content`
- Single URL only (not an array)
- Supports schema-based extraction with `schema` parameter
> 💡 **Pro tip**: See [WEBHOOK_EXAMPLES.md](./WEBHOOK_EXAMPLES.md) for detailed examples including TypeScript client code, Flask webhook handlers, and failure handling.
---
## Metrics & Monitoring
@@ -826,10 +1013,11 @@ We're here to help you succeed with Crawl4AI! Here's how to get support:
In this guide, we've covered everything you need to get started with Crawl4AI's Docker deployment:
- Building and running the Docker container
- Configuring the environment
- Configuring the environment
- Using the interactive playground for testing
- Making API requests with proper typing
- Using the Python SDK
- Asynchronous job queues with webhook notifications
- Leveraging specialized endpoints for screenshots, PDFs, and JavaScript execution
- Connecting via the Model Context Protocol (MCP)
- Monitoring your deployment

View File

@@ -0,0 +1,378 @@
# Webhook Feature Examples
This document provides examples of how to use the webhook feature for crawl jobs in Crawl4AI.
## Overview
The webhook feature allows you to receive notifications when crawl jobs complete, eliminating the need for polling. Webhooks are sent with exponential backoff retry logic to ensure reliable delivery.
## Configuration
### Global Configuration (config.yml)
You can configure default webhook settings in `config.yml`:
```yaml
webhooks:
enabled: true
default_url: null # Optional: default webhook URL for all jobs
data_in_payload: false # Optional: default behavior for including data
retry:
max_attempts: 5
initial_delay_ms: 1000 # 1s, 2s, 4s, 8s, 16s exponential backoff
max_delay_ms: 32000
timeout_ms: 30000 # 30s timeout per webhook call
headers: # Optional: default headers to include
User-Agent: "Crawl4AI-Webhook/1.0"
```
## API Usage Examples
### Example 1: Basic Webhook (Notification Only)
Send a webhook notification without including the crawl data in the payload.
**Request:**
```bash
curl -X POST http://localhost:11235/crawl/job \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://example.com"],
"webhook_config": {
"webhook_url": "https://myapp.com/webhooks/crawl-complete",
"webhook_data_in_payload": false
}
}'
```
**Response:**
```json
{
"task_id": "crawl_a1b2c3d4"
}
```
**Webhook Payload Received:**
```json
{
"task_id": "crawl_a1b2c3d4",
"task_type": "crawl",
"status": "completed",
"timestamp": "2025-10-21T10:30:00.000000+00:00",
"urls": ["https://example.com"]
}
```
Your webhook handler should then fetch the results:
```bash
curl http://localhost:11235/crawl/job/crawl_a1b2c3d4
```
### Example 2: Webhook with Data Included
Include the full crawl results in the webhook payload.
**Request:**
```bash
curl -X POST http://localhost:11235/crawl/job \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://example.com"],
"webhook_config": {
"webhook_url": "https://myapp.com/webhooks/crawl-complete",
"webhook_data_in_payload": true
}
}'
```
**Webhook Payload Received:**
```json
{
"task_id": "crawl_a1b2c3d4",
"task_type": "crawl",
"status": "completed",
"timestamp": "2025-10-21T10:30:00.000000+00:00",
"urls": ["https://example.com"],
"data": {
"markdown": "...",
"html": "...",
"links": {...},
"metadata": {...}
}
}
```
### Example 3: Webhook with Custom Headers
Include custom headers for authentication or identification.
**Request:**
```bash
curl -X POST http://localhost:11235/crawl/job \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://example.com"],
"webhook_config": {
"webhook_url": "https://myapp.com/webhooks/crawl-complete",
"webhook_data_in_payload": false,
"webhook_headers": {
"X-Webhook-Secret": "my-secret-token",
"X-Service-ID": "crawl4ai-production"
}
}
}'
```
The webhook will be sent with these additional headers plus the default headers from config.
### Example 4: Failure Notification
When a crawl job fails, a webhook is sent with error details.
**Webhook Payload on Failure:**
```json
{
"task_id": "crawl_a1b2c3d4",
"task_type": "crawl",
"status": "failed",
"timestamp": "2025-10-21T10:30:00.000000+00:00",
"urls": ["https://example.com"],
"error": "Connection timeout after 30s"
}
```
### Example 5: Using Global Default Webhook
If you set a `default_url` in config.yml, jobs without webhook_config will use it:
**config.yml:**
```yaml
webhooks:
enabled: true
default_url: "https://myapp.com/webhooks/default"
data_in_payload: false
```
**Request (no webhook_config needed):**
```bash
curl -X POST http://localhost:11235/crawl/job \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://example.com"]
}'
```
The webhook will be sent to the default URL configured in config.yml.
### Example 6: LLM Extraction Job with Webhook
Use webhooks with the LLM extraction endpoint for asynchronous processing.
**Request:**
```bash
curl -X POST http://localhost:11235/llm/job \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/article",
"q": "Extract the article title, author, and publication date",
"schema": "{\"type\": \"object\", \"properties\": {\"title\": {\"type\": \"string\"}, \"author\": {\"type\": \"string\"}, \"date\": {\"type\": \"string\"}}}",
"cache": false,
"provider": "openai/gpt-4o-mini",
"webhook_config": {
"webhook_url": "https://myapp.com/webhooks/llm-complete",
"webhook_data_in_payload": true
}
}'
```
**Response:**
```json
{
"task_id": "llm_1698765432_12345"
}
```
**Webhook Payload Received:**
```json
{
"task_id": "llm_1698765432_12345",
"task_type": "llm_extraction",
"status": "completed",
"timestamp": "2025-10-21T10:30:00.000000+00:00",
"urls": ["https://example.com/article"],
"data": {
"extracted_content": {
"title": "Understanding Web Scraping",
"author": "John Doe",
"date": "2025-10-21"
}
}
}
```
## Webhook Handler Example
Here's a simple Python Flask webhook handler that supports both crawl and LLM extraction jobs:
```python
from flask import Flask, request, jsonify
import requests
app = Flask(__name__)
@app.route('/webhooks/crawl-complete', methods=['POST'])
def handle_crawl_webhook():
payload = request.json
task_id = payload['task_id']
task_type = payload['task_type']
status = payload['status']
if status == 'completed':
# If data not in payload, fetch it
if 'data' not in payload:
# Determine endpoint based on task type
endpoint = 'crawl' if task_type == 'crawl' else 'llm'
response = requests.get(f'http://localhost:11235/{endpoint}/job/{task_id}')
data = response.json()
else:
data = payload['data']
# Process based on task type
if task_type == 'crawl':
print(f"Processing crawl results for {task_id}")
# Handle crawl results
results = data.get('results', [])
for result in results:
print(f" - {result.get('url')}: {len(result.get('markdown', ''))} chars")
elif task_type == 'llm_extraction':
print(f"Processing LLM extraction for {task_id}")
# Handle LLM extraction
# Note: Webhook sends 'extracted_content', API returns 'result'
extracted = data.get('extracted_content', data.get('result', {}))
print(f" - Extracted: {extracted}")
# Your business logic here...
elif status == 'failed':
error = payload.get('error', 'Unknown error')
print(f"{task_type} job {task_id} failed: {error}")
# Handle failure...
return jsonify({"status": "received"}), 200
if __name__ == '__main__':
app.run(port=8080)
```
## Retry Logic
The webhook delivery service uses exponential backoff retry logic:
- **Attempts:** Up to 5 attempts by default
- **Delays:** 1s → 2s → 4s → 8s → 16s
- **Timeout:** 30 seconds per attempt
- **Retry Conditions:**
- Server errors (5xx status codes)
- Network errors
- Timeouts
- **No Retry:**
- Client errors (4xx status codes)
- Successful delivery (2xx status codes)
## Benefits
1. **No Polling Required** - Eliminates constant API calls to check job status
2. **Real-time Notifications** - Immediate notification when jobs complete
3. **Reliable Delivery** - Exponential backoff ensures webhooks are delivered
4. **Flexible** - Choose between notification-only or full data delivery
5. **Secure** - Support for custom headers for authentication
6. **Configurable** - Global defaults or per-job configuration
7. **Universal Support** - Works with both `/crawl/job` and `/llm/job` endpoints
## TypeScript Client Example
```typescript
interface WebhookConfig {
webhook_url: string;
webhook_data_in_payload?: boolean;
webhook_headers?: Record<string, string>;
}
interface CrawlJobRequest {
urls: string[];
browser_config?: Record<string, any>;
crawler_config?: Record<string, any>;
webhook_config?: WebhookConfig;
}
interface LLMJobRequest {
url: string;
q: string;
schema?: string;
cache?: boolean;
provider?: string;
webhook_config?: WebhookConfig;
}
async function createCrawlJob(request: CrawlJobRequest) {
const response = await fetch('http://localhost:11235/crawl/job', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify(request)
});
const { task_id } = await response.json();
return task_id;
}
async function createLLMJob(request: LLMJobRequest) {
const response = await fetch('http://localhost:11235/llm/job', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify(request)
});
const { task_id } = await response.json();
return task_id;
}
// Usage - Crawl Job
const crawlTaskId = await createCrawlJob({
urls: ['https://example.com'],
webhook_config: {
webhook_url: 'https://myapp.com/webhooks/crawl-complete',
webhook_data_in_payload: false,
webhook_headers: {
'X-Webhook-Secret': 'my-secret'
}
}
});
// Usage - LLM Extraction Job
const llmTaskId = await createLLMJob({
url: 'https://example.com/article',
q: 'Extract the main points from this article',
provider: 'openai/gpt-4o-mini',
webhook_config: {
webhook_url: 'https://myapp.com/webhooks/llm-complete',
webhook_data_in_payload: true,
webhook_headers: {
'X-Webhook-Secret': 'my-secret'
}
}
});
```
## Monitoring and Debugging
Webhook delivery attempts are logged at INFO level:
- Successful deliveries
- Retry attempts with delays
- Final failures after max attempts
Check the application logs for webhook delivery status:
```bash
docker logs crawl4ai-container | grep -i webhook
```

View File

@@ -46,6 +46,7 @@ from utils import (
get_llm_temperature,
get_llm_base_url
)
from webhook import WebhookDeliveryService
import psutil, time
@@ -107,7 +108,10 @@ async def handle_llm_qa(
prompt_with_variables=prompt,
api_token=get_llm_api_key(config), # Returns None to let litellm handle it
temperature=get_llm_temperature(config),
base_url=get_llm_base_url(config)
base_url=get_llm_base_url(config),
base_delay=config["llm"].get("backoff_base_delay", 2),
max_attempts=config["llm"].get("backoff_max_attempts", 3),
exponential_factor=config["llm"].get("backoff_exponential_factor", 2)
)
return response.choices[0].message.content
@@ -127,10 +131,14 @@ async def process_llm_extraction(
schema: Optional[str] = None,
cache: str = "0",
provider: Optional[str] = None,
webhook_config: Optional[Dict] = None,
temperature: Optional[float] = None,
base_url: Optional[str] = None
) -> None:
"""Process LLM extraction in background."""
# Initialize webhook service
webhook_service = WebhookDeliveryService(config)
try:
# Validate provider
is_valid, error_msg = validate_llm_provider(config, provider)
@@ -139,6 +147,16 @@ async def process_llm_extraction(
"status": TaskStatus.FAILED,
"error": error_msg
})
# Send webhook notification on failure
await webhook_service.notify_job_completion(
task_id=task_id,
task_type="llm_extraction",
status="failed",
urls=[url],
webhook_config=webhook_config,
error=error_msg
)
return
api_key = get_llm_api_key(config, provider) # Returns None to let litellm handle it
llm_strategy = LLMExtractionStrategy(
@@ -169,17 +187,40 @@ async def process_llm_extraction(
"status": TaskStatus.FAILED,
"error": result.error_message
})
# Send webhook notification on failure
await webhook_service.notify_job_completion(
task_id=task_id,
task_type="llm_extraction",
status="failed",
urls=[url],
webhook_config=webhook_config,
error=result.error_message
)
return
try:
content = json.loads(result.extracted_content)
except json.JSONDecodeError:
content = result.extracted_content
result_data = {"extracted_content": content}
await redis.hset(f"task:{task_id}", mapping={
"status": TaskStatus.COMPLETED,
"result": json.dumps(content)
})
# Send webhook notification on successful completion
await webhook_service.notify_job_completion(
task_id=task_id,
task_type="llm_extraction",
status="completed",
urls=[url],
webhook_config=webhook_config,
result=result_data
)
except Exception as e:
logger.error(f"LLM extraction error: {str(e)}", exc_info=True)
await redis.hset(f"task:{task_id}", mapping={
@@ -187,6 +228,16 @@ async def process_llm_extraction(
"error": str(e)
})
# Send webhook notification on failure
await webhook_service.notify_job_completion(
task_id=task_id,
task_type="llm_extraction",
status="failed",
urls=[url],
webhook_config=webhook_config,
error=str(e)
)
async def handle_markdown_request(
url: str,
filter_type: FilterType,
@@ -275,6 +326,7 @@ async def handle_llm_request(
cache: str = "0",
config: Optional[dict] = None,
provider: Optional[str] = None,
webhook_config: Optional[Dict] = None,
temperature: Optional[float] = None,
api_base_url: Optional[str] = None
) -> JSONResponse:
@@ -308,6 +360,7 @@ async def handle_llm_request(
base_url,
config,
provider,
webhook_config,
temperature,
api_base_url
)
@@ -355,6 +408,7 @@ async def create_new_task(
base_url: str,
config: dict,
provider: Optional[str] = None,
webhook_config: Optional[Dict] = None,
temperature: Optional[float] = None,
api_base_url: Optional[str] = None
) -> JSONResponse:
@@ -365,12 +419,18 @@ async def create_new_task(
from datetime import datetime
task_id = f"llm_{int(datetime.now().timestamp())}_{id(background_tasks)}"
await redis.hset(f"task:{task_id}", mapping={
task_data = {
"status": TaskStatus.PROCESSING,
"created_at": datetime.now().isoformat(),
"url": decoded_url
})
}
# Store webhook config if provided
if webhook_config:
task_data["webhook_config"] = json.dumps(webhook_config)
await redis.hset(f"task:{task_id}", mapping=task_data)
background_tasks.add_task(
process_llm_extraction,
@@ -382,6 +442,7 @@ async def create_new_task(
schema,
cache,
provider,
webhook_config,
temperature,
api_base_url
)
@@ -723,6 +784,7 @@ async def handle_crawl_job(
browser_config: Dict,
crawler_config: Dict,
config: Dict,
webhook_config: Optional[Dict] = None,
) -> Dict:
"""
Fire-and-forget version of handle_crawl_request.
@@ -730,13 +792,24 @@ async def handle_crawl_job(
lets /crawl/job/{task_id} polling fetch the result.
"""
task_id = f"crawl_{uuid4().hex[:8]}"
await redis.hset(f"task:{task_id}", mapping={
# Store task data in Redis
task_data = {
"status": TaskStatus.PROCESSING, # <-- keep enum values consistent
"created_at": datetime.now(timezone.utc).replace(tzinfo=None).isoformat(),
"url": json.dumps(urls), # store list as JSON string
"result": "",
"error": "",
})
}
# Store webhook config if provided
if webhook_config:
task_data["webhook_config"] = json.dumps(webhook_config)
await redis.hset(f"task:{task_id}", mapping=task_data)
# Initialize webhook service
webhook_service = WebhookDeliveryService(config)
async def _runner():
try:
@@ -750,6 +823,17 @@ async def handle_crawl_job(
"status": TaskStatus.COMPLETED,
"result": json.dumps(result),
})
# Send webhook notification on successful completion
await webhook_service.notify_job_completion(
task_id=task_id,
task_type="crawl",
status="completed",
urls=urls,
webhook_config=webhook_config,
result=result
)
await asyncio.sleep(5) # Give Redis time to process the update
except Exception as exc:
await redis.hset(f"task:{task_id}", mapping={
@@ -757,5 +841,15 @@ async def handle_crawl_job(
"error": str(exc),
})
# Send webhook notification on failure
await webhook_service.notify_job_completion(
task_id=task_id,
task_type="crawl",
status="failed",
urls=urls,
webhook_config=webhook_config,
error=str(exc)
)
background_tasks.add_task(_runner)
return {"task_id": task_id}

View File

@@ -87,4 +87,17 @@ observability:
enabled: True
endpoint: "/metrics"
health_check:
endpoint: "/health"
endpoint: "/health"
# Webhook Configuration
webhooks:
enabled: true
default_url: null # Optional: default webhook URL for all jobs
data_in_payload: false # Optional: default behavior for including data
retry:
max_attempts: 5
initial_delay_ms: 1000 # 1s, 2s, 4s, 8s, 16s exponential backoff
max_delay_ms: 32000
timeout_ms: 30000 # 30s timeout per webhook call
headers: # Optional: default headers to include
User-Agent: "Crawl4AI-Webhook/1.0"

View File

@@ -12,6 +12,7 @@ from api import (
handle_crawl_job,
handle_task_status,
)
from schemas import WebhookConfig
# ------------- dependency placeholders -------------
_redis = None # will be injected from server.py
@@ -37,6 +38,7 @@ class LlmJobPayload(BaseModel):
schema: Optional[str] = None
cache: bool = False
provider: Optional[str] = None
webhook_config: Optional[WebhookConfig] = None
temperature: Optional[float] = None
base_url: Optional[str] = None
@@ -45,6 +47,7 @@ class CrawlJobPayload(BaseModel):
urls: list[HttpUrl]
browser_config: Dict = {}
crawler_config: Dict = {}
webhook_config: Optional[WebhookConfig] = None
# ---------- LLM job ---------------------------------------------------------
@@ -55,6 +58,10 @@ async def llm_job_enqueue(
request: Request,
_td: Dict = Depends(lambda: _token_dep()), # late-bound dep
):
webhook_config = None
if payload.webhook_config:
webhook_config = payload.webhook_config.model_dump(mode='json')
return await handle_llm_request(
_redis,
background_tasks,
@@ -65,6 +72,7 @@ async def llm_job_enqueue(
cache=payload.cache,
config=_config,
provider=payload.provider,
webhook_config=webhook_config,
temperature=payload.temperature,
api_base_url=payload.base_url,
)
@@ -86,6 +94,10 @@ async def crawl_job_enqueue(
background_tasks: BackgroundTasks,
_td: Dict = Depends(lambda: _token_dep()),
):
webhook_config = None
if payload.webhook_config:
webhook_config = payload.webhook_config.model_dump(mode='json')
return await handle_crawl_job(
_redis,
background_tasks,
@@ -93,6 +105,7 @@ async def crawl_job_enqueue(
payload.browser_config,
payload.crawler_config,
config=_config,
webhook_config=webhook_config,
)

View File

@@ -12,6 +12,6 @@ pydantic>=2.11
rank-bm25==0.2.2
anyio==4.9.0
PyJWT==2.10.1
mcp>=1.6.0
mcp>=1.18.0
websockets>=15.0.1
httpx[http2]>=0.27.2

View File

@@ -1,6 +1,6 @@
from typing import List, Optional, Dict
from enum import Enum
from pydantic import BaseModel, Field
from pydantic import BaseModel, Field, HttpUrl
from utils import FilterType
@@ -85,4 +85,22 @@ class JSEndpointRequest(BaseModel):
scripts: List[str] = Field(
...,
description="List of separated JavaScript snippets to execute"
)
)
class WebhookConfig(BaseModel):
"""Configuration for webhook notifications."""
webhook_url: HttpUrl
webhook_data_in_payload: bool = False
webhook_headers: Optional[Dict[str, str]] = None
class WebhookPayload(BaseModel):
"""Payload sent to webhook endpoints."""
task_id: str
task_type: str # "crawl", "llm_extraction", etc.
status: str # "completed" or "failed"
timestamp: str # ISO 8601 format
urls: List[str]
error: Optional[str] = None
data: Optional[Dict] = None # Included only if webhook_data_in_payload=True

159
deploy/docker/webhook.py Normal file
View File

@@ -0,0 +1,159 @@
"""
Webhook delivery service for Crawl4AI.
This module provides webhook notification functionality with exponential backoff retry logic.
"""
import asyncio
import httpx
import logging
from typing import Dict, Optional
from datetime import datetime, timezone
logger = logging.getLogger(__name__)
class WebhookDeliveryService:
"""Handles webhook delivery with exponential backoff retry logic."""
def __init__(self, config: Dict):
"""
Initialize the webhook delivery service.
Args:
config: Application configuration dictionary containing webhook settings
"""
self.config = config.get("webhooks", {})
self.max_attempts = self.config.get("retry", {}).get("max_attempts", 5)
self.initial_delay = self.config.get("retry", {}).get("initial_delay_ms", 1000) / 1000
self.max_delay = self.config.get("retry", {}).get("max_delay_ms", 32000) / 1000
self.timeout = self.config.get("retry", {}).get("timeout_ms", 30000) / 1000
async def send_webhook(
self,
webhook_url: str,
payload: Dict,
headers: Optional[Dict[str, str]] = None
) -> bool:
"""
Send webhook with exponential backoff retry logic.
Args:
webhook_url: The URL to send the webhook to
payload: The JSON payload to send
headers: Optional custom headers
Returns:
bool: True if delivered successfully, False otherwise
"""
default_headers = self.config.get("headers", {})
merged_headers = {**default_headers, **(headers or {})}
merged_headers["Content-Type"] = "application/json"
async with httpx.AsyncClient(timeout=self.timeout) as client:
for attempt in range(self.max_attempts):
try:
logger.info(
f"Sending webhook (attempt {attempt + 1}/{self.max_attempts}) to {webhook_url}"
)
response = await client.post(
webhook_url,
json=payload,
headers=merged_headers
)
# Success or client error (don't retry client errors)
if response.status_code < 500:
if 200 <= response.status_code < 300:
logger.info(f"Webhook delivered successfully to {webhook_url}")
return True
else:
logger.warning(
f"Webhook rejected with status {response.status_code}: {response.text[:200]}"
)
return False # Client error - don't retry
# Server error - retry with backoff
logger.warning(
f"Webhook failed with status {response.status_code}, will retry"
)
except httpx.TimeoutException as exc:
logger.error(f"Webhook timeout (attempt {attempt + 1}): {exc}")
except httpx.RequestError as exc:
logger.error(f"Webhook request error (attempt {attempt + 1}): {exc}")
except Exception as exc:
logger.error(f"Webhook delivery error (attempt {attempt + 1}): {exc}")
# Calculate exponential backoff delay
if attempt < self.max_attempts - 1:
delay = min(self.initial_delay * (2 ** attempt), self.max_delay)
logger.info(f"Retrying in {delay}s...")
await asyncio.sleep(delay)
logger.error(
f"Webhook delivery failed after {self.max_attempts} attempts to {webhook_url}"
)
return False
async def notify_job_completion(
self,
task_id: str,
task_type: str,
status: str,
urls: list,
webhook_config: Optional[Dict],
result: Optional[Dict] = None,
error: Optional[str] = None
):
"""
Notify webhook of job completion.
Args:
task_id: The task identifier
task_type: Type of task (e.g., "crawl", "llm_extraction")
status: Task status ("completed" or "failed")
urls: List of URLs that were crawled
webhook_config: Webhook configuration from the job request
result: Optional crawl result data
error: Optional error message if failed
"""
# Determine webhook URL
webhook_url = None
data_in_payload = self.config.get("data_in_payload", False)
custom_headers = None
if webhook_config:
webhook_url = webhook_config.get("webhook_url")
data_in_payload = webhook_config.get("webhook_data_in_payload", data_in_payload)
custom_headers = webhook_config.get("webhook_headers")
if not webhook_url:
webhook_url = self.config.get("default_url")
if not webhook_url:
logger.debug("No webhook URL configured, skipping notification")
return
# Check if webhooks are enabled
if not self.config.get("enabled", True):
logger.debug("Webhooks are disabled, skipping notification")
return
# Build payload
payload = {
"task_id": task_id,
"task_type": task_type,
"status": status,
"timestamp": datetime.now(timezone.utc).isoformat(),
"urls": urls
}
if error:
payload["error"] = error
if data_in_payload and result:
payload["data"] = result
# Send webhook (fire and forget - don't block on completion)
await self.send_webhook(webhook_url, payload, custom_headers)

View File

@@ -6,15 +6,16 @@ x-base-config: &base-config
- "11235:11235" # Gunicorn port
env_file:
- .llm.env # API keys (create from .llm.env.example)
environment:
- OPENAI_API_KEY=${OPENAI_API_KEY:-}
- DEEPSEEK_API_KEY=${DEEPSEEK_API_KEY:-}
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY:-}
- GROQ_API_KEY=${GROQ_API_KEY:-}
- TOGETHER_API_KEY=${TOGETHER_API_KEY:-}
- MISTRAL_API_KEY=${MISTRAL_API_KEY:-}
- GEMINI_API_TOKEN=${GEMINI_API_TOKEN:-}
- LLM_PROVIDER=${LLM_PROVIDER:-} # Optional: Override default provider (e.g., "anthropic/claude-3-opus")
# Uncomment to set default environment variables (will overwrite .llm.env)
# environment:
# - OPENAI_API_KEY=${OPENAI_API_KEY:-}
# - DEEPSEEK_API_KEY=${DEEPSEEK_API_KEY:-}
# - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY:-}
# - GROQ_API_KEY=${GROQ_API_KEY:-}
# - TOGETHER_API_KEY=${TOGETHER_API_KEY:-}
# - MISTRAL_API_KEY=${MISTRAL_API_KEY:-}
# - GEMINI_API_KEY=${GEMINI_API_KEY:-}
# - LLM_PROVIDER=${LLM_PROVIDER:-} # Optional: Override default provider (e.g., "anthropic/claude-3-opus")
volumes:
- /dev/shm:/dev/shm # Chromium performance
deploy:

View File

@@ -10,7 +10,6 @@ Today I'm releasing Crawl4AI v0.7.4—the Intelligent Table Extraction & Perform
- **🚀 LLMTableExtraction**: Revolutionary table extraction with intelligent chunking for massive tables
- **⚡ Enhanced Concurrency**: True concurrency improvements for fast-completing tasks in batch operations
- **🧹 Memory Management Refactor**: Streamlined memory utilities and better resource management
- **🔧 Browser Manager Fixes**: Resolved race conditions in concurrent page creation
- **⌨️ Cross-Platform Browser Profiler**: Improved keyboard handling and quit mechanisms
- **🔗 Advanced URL Processing**: Better handling of raw URLs and base tag link resolution
@@ -158,40 +157,6 @@ async with AsyncWebCrawler() as crawler:
- **Monitoring Systems**: Faster health checks and status page monitoring
- **Data Aggregation**: Improved performance for real-time data collection
## 🧹 Memory Management Refactor: Cleaner Architecture
**The Problem:** Memory utilities were scattered and difficult to maintain, with potential import conflicts and unclear organization.
**My Solution:** I consolidated all memory-related utilities into the main `utils.py` module, creating a cleaner, more maintainable architecture.
### Improved Memory Handling
```python
# All memory utilities now consolidated
from crawl4ai.utils import get_true_memory_usage_percent, MemoryMonitor
# Enhanced memory monitoring
monitor = MemoryMonitor()
monitor.start_monitoring()
async with AsyncWebCrawler() as crawler:
# Memory-efficient batch processing
results = await crawler.arun_many(large_url_list)
# Get accurate memory metrics
memory_usage = get_true_memory_usage_percent()
memory_report = monitor.get_report()
print(f"Memory efficiency: {memory_report['efficiency']:.1f}%")
print(f"Peak usage: {memory_report['peak_mb']:.1f} MB")
```
**Expected Real-World Impact:**
- **Production Stability**: More reliable memory tracking and management
- **Code Maintainability**: Cleaner architecture for easier debugging
- **Import Clarity**: Resolved potential conflicts and import issues
- **Developer Experience**: Simpler API for memory monitoring
## 🔧 Critical Stability Fixes
### Browser Manager Race Condition Resolution

318
docs/blog/release-v0.7.5.md Normal file
View File

@@ -0,0 +1,318 @@
# 🚀 Crawl4AI v0.7.5: The Docker Hooks & Security Update
*September 29, 2025 • 8 min read*
---
Today I'm releasing Crawl4AI v0.7.5—focused on extensibility and security. This update introduces the Docker Hooks System for pipeline customization, enhanced LLM integration, and important security improvements.
## 🎯 What's New at a Glance
- **Docker Hooks System**: Custom Python functions at key pipeline points with function-based API
- **Function-Based Hooks**: New `hooks_to_string()` utility with Docker client auto-conversion
- **Enhanced LLM Integration**: Custom providers with temperature control
- **HTTPS Preservation**: Secure internal link handling
- **Bug Fixes**: Resolved multiple community-reported issues
- **Improved Docker Error Handling**: Better debugging and reliability
## 🔧 Docker Hooks System: Pipeline Customization
Every scraping project needs custom logic—authentication, performance optimization, content processing. Traditional solutions require forking or complex workarounds. Docker Hooks let you inject custom Python functions at 8 key points in the crawling pipeline.
### Real Example: Authentication & Performance
```python
import requests
# Real working hooks for httpbin.org
hooks_config = {
"on_page_context_created": """
async def hook(page, context, **kwargs):
print("Hook: Setting up page context")
# Block images to speed up crawling
await context.route("**/*.{png,jpg,jpeg,gif,webp}", lambda route: route.abort())
print("Hook: Images blocked")
return page
""",
"before_retrieve_html": """
async def hook(page, context, **kwargs):
print("Hook: Before retrieving HTML")
# Scroll to bottom to load lazy content
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await page.wait_for_timeout(1000)
print("Hook: Scrolled to bottom")
return page
""",
"before_goto": """
async def hook(page, context, url, **kwargs):
print(f"Hook: About to navigate to {url}")
# Add custom headers
await page.set_extra_http_headers({
'X-Test-Header': 'crawl4ai-hooks-test'
})
return page
"""
}
# Test with Docker API
payload = {
"urls": ["https://httpbin.org/html"],
"hooks": {
"code": hooks_config,
"timeout": 30
}
}
response = requests.post("http://localhost:11235/crawl", json=payload)
result = response.json()
if result.get('success'):
print("✅ Hooks executed successfully!")
print(f"Content length: {len(result.get('markdown', ''))} characters")
```
**Available Hook Points:**
- `on_browser_created`: Browser setup
- `on_page_context_created`: Page context configuration
- `before_goto`: Pre-navigation setup
- `after_goto`: Post-navigation processing
- `on_user_agent_updated`: User agent changes
- `on_execution_started`: Crawl initialization
- `before_retrieve_html`: Pre-extraction processing
- `before_return_html`: Final HTML processing
### Function-Based Hooks API
Writing hooks as strings works, but lacks IDE support and type checking. v0.7.5 introduces a function-based approach with automatic conversion!
**Option 1: Using the `hooks_to_string()` Utility**
```python
from crawl4ai import hooks_to_string
import requests
# Define hooks as regular Python functions (with full IDE support!)
async def on_page_context_created(page, context, **kwargs):
"""Block images to speed up crawling"""
await context.route("**/*.{png,jpg,jpeg,gif,webp}", lambda route: route.abort())
await page.set_viewport_size({"width": 1920, "height": 1080})
return page
async def before_goto(page, context, url, **kwargs):
"""Add custom headers"""
await page.set_extra_http_headers({
'X-Crawl4AI': 'v0.7.5',
'X-Custom-Header': 'my-value'
})
return page
# Convert functions to strings
hooks_code = hooks_to_string({
"on_page_context_created": on_page_context_created,
"before_goto": before_goto
})
# Use with REST API
payload = {
"urls": ["https://httpbin.org/html"],
"hooks": {"code": hooks_code, "timeout": 30}
}
response = requests.post("http://localhost:11235/crawl", json=payload)
```
**Option 2: Docker Client with Automatic Conversion (Recommended!)**
```python
from crawl4ai.docker_client import Crawl4aiDockerClient
# Define hooks as functions (same as above)
async def on_page_context_created(page, context, **kwargs):
await context.route("**/*.{png,jpg,jpeg,gif,webp}", lambda route: route.abort())
return page
async def before_retrieve_html(page, context, **kwargs):
# Scroll to load lazy content
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await page.wait_for_timeout(1000)
return page
# Use Docker client - conversion happens automatically!
client = Crawl4aiDockerClient(base_url="http://localhost:11235")
results = await client.crawl(
urls=["https://httpbin.org/html"],
hooks={
"on_page_context_created": on_page_context_created,
"before_retrieve_html": before_retrieve_html
},
hooks_timeout=30
)
if results and results.success:
print(f"✅ Hooks executed! HTML length: {len(results.html)}")
```
**Benefits of Function-Based Hooks:**
- ✅ Full IDE support (autocomplete, syntax highlighting)
- ✅ Type checking and linting
- ✅ Easier to test and debug
- ✅ Reusable across projects
- ✅ Automatic conversion in Docker client
- ✅ No breaking changes - string hooks still work!
## 🤖 Enhanced LLM Integration
Enhanced LLM integration with custom providers, temperature control, and base URL configuration.
### Multi-Provider Support
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
# Test with different providers
async def test_llm_providers():
# OpenAI with custom temperature
openai_strategy = LLMExtractionStrategy(
provider="gemini/gemini-2.5-flash-lite",
api_token="your-api-token",
temperature=0.7, # New in v0.7.5
instruction="Summarize this page in one sentence"
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
"https://example.com",
config=CrawlerRunConfig(extraction_strategy=openai_strategy)
)
if result.success:
print("✅ LLM extraction completed")
print(result.extracted_content)
# Docker API with enhanced LLM config
llm_payload = {
"url": "https://example.com",
"f": "llm",
"q": "Summarize this page in one sentence.",
"provider": "gemini/gemini-2.5-flash-lite",
"temperature": 0.7
}
response = requests.post("http://localhost:11235/md", json=llm_payload)
```
**New Features:**
- Custom `temperature` parameter for creativity control
- `base_url` for custom API endpoints
- Multi-provider environment variable support
- Docker API integration
## 🔒 HTTPS Preservation
**The Problem:** Modern web apps require HTTPS everywhere. When crawlers downgrade internal links from HTTPS to HTTP, authentication breaks and security warnings appear.
**Solution:** HTTPS preservation maintains secure protocols throughout crawling.
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, FilterChain, URLPatternFilter, BFSDeepCrawlStrategy
async def test_https_preservation():
# Enable HTTPS preservation
url_filter = URLPatternFilter(
patterns=["^(https:\/\/)?quotes\.toscrape\.com(\/.*)?$"]
)
config = CrawlerRunConfig(
exclude_external_links=True,
preserve_https_for_internal_links=True, # New in v0.7.5
deep_crawl_strategy=BFSDeepCrawlStrategy(
max_depth=2,
max_pages=5,
filter_chain=FilterChain([url_filter])
)
)
async with AsyncWebCrawler() as crawler:
async for result in await crawler.arun(
url="https://quotes.toscrape.com",
config=config
):
# All internal links maintain HTTPS
internal_links = [link['href'] for link in result.links['internal']]
https_links = [link for link in internal_links if link.startswith('https://')]
print(f"HTTPS links preserved: {len(https_links)}/{len(internal_links)}")
for link in https_links[:3]:
print(f"{link}")
```
## 🛠️ Bug Fixes and Improvements
### Major Fixes
- **URL Processing**: Fixed '+' sign preservation in query parameters (#1332)
- **Proxy Configuration**: Enhanced proxy string parsing (old `proxy` parameter deprecated)
- **Docker Error Handling**: Comprehensive error messages with status codes
- **Memory Management**: Fixed leaks in long-running sessions
- **JWT Authentication**: Fixed Docker JWT validation issues (#1442)
- **Playwright Stealth**: Fixed stealth features for Playwright integration (#1481)
- **API Configuration**: Fixed config handling to prevent overriding user-provided settings (#1505)
- **Docker Filter Serialization**: Resolved JSON encoding errors in deep crawl strategy (#1419)
- **LLM Provider Support**: Fixed custom LLM provider integration for adaptive crawler (#1291)
- **Performance Issues**: Resolved backoff strategy failures and timeout handling (#989)
### Community-Reported Issues Fixed
This release addresses multiple issues reported by the community through GitHub issues and Discord discussions:
- Fixed browser configuration reference errors
- Resolved dependency conflicts with cssselect
- Improved error messaging for failed authentications
- Enhanced compatibility with various proxy configurations
- Fixed edge cases in URL normalization
### Configuration Updates
```python
# Old proxy config (deprecated)
# browser_config = BrowserConfig(proxy="http://proxy:8080")
# New enhanced proxy config
browser_config = BrowserConfig(
proxy_config={
"server": "http://proxy:8080",
"username": "optional-user",
"password": "optional-pass"
}
)
```
## 🔄 Breaking Changes
1. **Python 3.10+ Required**: Upgrade from Python 3.9
2. **Proxy Parameter Deprecated**: Use new `proxy_config` structure
3. **New Dependency**: Added `cssselect` for better CSS handling
## 🚀 Get Started
```bash
# Install latest version
pip install crawl4ai==0.7.5
# Docker deployment
docker pull unclecode/crawl4ai:latest
docker run -p 11235:11235 unclecode/crawl4ai:latest
```
**Try the Demo:**
```bash
# Run working examples
python docs/releases_review/demo_v0.7.5.py
```
**Resources:**
- 📖 Documentation: [docs.crawl4ai.com](https://docs.crawl4ai.com)
- 🐙 GitHub: [github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai)
- 💬 Discord: [discord.gg/crawl4ai](https://discord.gg/jP8KfhDhyN)
- 🐦 Twitter: [@unclecode](https://x.com/unclecode)
Happy crawling! 🕷️

314
docs/blog/release-v0.7.6.md Normal file
View File

@@ -0,0 +1,314 @@
# Crawl4AI v0.7.6 Release Notes
*Release Date: October 22, 2025*
I'm excited to announce Crawl4AI v0.7.6, featuring a complete webhook infrastructure for the Docker job queue API! This release eliminates polling and brings real-time notifications to both crawling and LLM extraction workflows.
## 🎯 What's New
### Webhook Support for Docker Job Queue API
The headline feature of v0.7.6 is comprehensive webhook support for asynchronous job processing. No more constant polling to check if your jobs are done - get instant notifications when they complete!
**Key Capabilities:**
-**Universal Webhook Support**: Both `/crawl/job` and `/llm/job` endpoints now support webhooks
-**Flexible Delivery Modes**: Choose notification-only or include full data in the webhook payload
-**Reliable Delivery**: Exponential backoff retry mechanism (5 attempts: 1s → 2s → 4s → 8s → 16s)
-**Custom Authentication**: Add custom headers for webhook authentication
-**Global Configuration**: Set default webhook URL in `config.yml` for all jobs
-**Task Type Identification**: Distinguish between `crawl` and `llm_extraction` tasks
### How It Works
Instead of constantly checking job status:
**OLD WAY (Polling):**
```python
# Submit job
response = requests.post("http://localhost:11235/crawl/job", json=payload)
task_id = response.json()['task_id']
# Poll until complete
while True:
status = requests.get(f"http://localhost:11235/crawl/job/{task_id}")
if status.json()['status'] == 'completed':
break
time.sleep(5) # Wait and try again
```
**NEW WAY (Webhooks):**
```python
# Submit job with webhook
payload = {
"urls": ["https://example.com"],
"webhook_config": {
"webhook_url": "https://myapp.com/webhook",
"webhook_data_in_payload": True
}
}
response = requests.post("http://localhost:11235/crawl/job", json=payload)
# Done! Webhook will notify you when complete
# Your webhook handler receives the results automatically
```
### Crawl Job Webhooks
```bash
curl -X POST http://localhost:11235/crawl/job \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://example.com"],
"browser_config": {"headless": true},
"crawler_config": {"cache_mode": "bypass"},
"webhook_config": {
"webhook_url": "https://myapp.com/webhooks/crawl-complete",
"webhook_data_in_payload": false,
"webhook_headers": {
"X-Webhook-Secret": "your-secret-token"
}
}
}'
```
### LLM Extraction Job Webhooks (NEW!)
```bash
curl -X POST http://localhost:11235/llm/job \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/article",
"q": "Extract the article title, author, and publication date",
"schema": "{\"type\":\"object\",\"properties\":{\"title\":{\"type\":\"string\"}}}",
"provider": "openai/gpt-4o-mini",
"webhook_config": {
"webhook_url": "https://myapp.com/webhooks/llm-complete",
"webhook_data_in_payload": true
}
}'
```
### Webhook Payload Structure
**Success (with data):**
```json
{
"task_id": "llm_1698765432",
"task_type": "llm_extraction",
"status": "completed",
"timestamp": "2025-10-22T10:30:00.000000+00:00",
"urls": ["https://example.com/article"],
"data": {
"extracted_content": {
"title": "Understanding Web Scraping",
"author": "John Doe",
"date": "2025-10-22"
}
}
}
```
**Failure:**
```json
{
"task_id": "crawl_abc123",
"task_type": "crawl",
"status": "failed",
"timestamp": "2025-10-22T10:30:00.000000+00:00",
"urls": ["https://example.com"],
"error": "Connection timeout after 30s"
}
```
### Simple Webhook Handler Example
```python
from flask import Flask, request, jsonify
app = Flask(__name__)
@app.route('/webhook', methods=['POST'])
def handle_webhook():
payload = request.json
task_id = payload['task_id']
task_type = payload['task_type']
status = payload['status']
if status == 'completed':
if 'data' in payload:
# Process data directly
data = payload['data']
else:
# Fetch from API
endpoint = 'crawl' if task_type == 'crawl' else 'llm'
response = requests.get(f'http://localhost:11235/{endpoint}/job/{task_id}')
data = response.json()
# Your business logic here
print(f"Job {task_id} completed!")
elif status == 'failed':
error = payload.get('error', 'Unknown error')
print(f"Job {task_id} failed: {error}")
return jsonify({"status": "received"}), 200
app.run(port=8080)
```
## 📊 Performance Improvements
- **Reduced Server Load**: Eliminates constant polling requests
- **Lower Latency**: Instant notification vs. polling interval delay
- **Better Resource Usage**: Frees up client connections while jobs run in background
- **Scalable Architecture**: Handles high-volume crawling workflows efficiently
## 🐛 Bug Fixes
- Fixed webhook configuration serialization for Pydantic HttpUrl fields
- Improved error handling in webhook delivery service
- Enhanced Redis task storage for webhook config persistence
## 🌍 Expected Real-World Impact
### For Web Scraping Workflows
- **Reduced Costs**: Less API calls = lower bandwidth and server costs
- **Better UX**: Instant notifications improve user experience
- **Scalability**: Handle 100s of concurrent jobs without polling overhead
### For LLM Extraction Pipelines
- **Async Processing**: Submit LLM extraction jobs and move on
- **Batch Processing**: Queue multiple extractions, get notified as they complete
- **Integration**: Easy integration with workflow automation tools (Zapier, n8n, etc.)
### For Microservices
- **Event-Driven**: Perfect for event-driven microservice architectures
- **Decoupling**: Decouple job submission from result processing
- **Reliability**: Automatic retries ensure webhooks are delivered
## 🔄 Breaking Changes
**None!** This release is fully backward compatible.
- Webhook configuration is optional
- Existing code continues to work without modification
- Polling is still supported for jobs without webhook config
## 📚 Documentation
### New Documentation
- **[WEBHOOK_EXAMPLES.md](../deploy/docker/WEBHOOK_EXAMPLES.md)** - Comprehensive webhook usage guide
- **[docker_webhook_example.py](../docs/examples/docker_webhook_example.py)** - Working code examples
### Updated Documentation
- **[Docker README](../deploy/docker/README.md)** - Added webhook sections
- API documentation with webhook examples
## 🛠️ Migration Guide
No migration needed! Webhooks are opt-in:
1. **To use webhooks**: Add `webhook_config` to your job payload
2. **To keep polling**: Continue using your existing code
### Quick Start
```python
# Just add webhook_config to your existing payload
payload = {
# Your existing configuration
"urls": ["https://example.com"],
"browser_config": {...},
"crawler_config": {...},
# NEW: Add webhook configuration
"webhook_config": {
"webhook_url": "https://myapp.com/webhook",
"webhook_data_in_payload": True
}
}
```
## 🔧 Configuration
### Global Webhook Configuration (config.yml)
```yaml
webhooks:
enabled: true
default_url: "https://myapp.com/webhooks/default" # Optional
data_in_payload: false
retry:
max_attempts: 5
initial_delay_ms: 1000
max_delay_ms: 32000
timeout_ms: 30000
headers:
User-Agent: "Crawl4AI-Webhook/1.0"
```
## 🚀 Upgrade Instructions
### Docker
```bash
# Pull the latest image
docker pull unclecode/crawl4ai:0.7.6
# Or use latest tag
docker pull unclecode/crawl4ai:latest
# Run with webhook support
docker run -d \
-p 11235:11235 \
--env-file .llm.env \
--name crawl4ai \
unclecode/crawl4ai:0.7.6
```
### Python Package
```bash
pip install --upgrade crawl4ai
```
## 💡 Pro Tips
1. **Use notification-only mode** for large results - fetch data separately to avoid large webhook payloads
2. **Set custom headers** for webhook authentication and request tracking
3. **Configure global default webhook** for consistent handling across all jobs
4. **Implement idempotent webhook handlers** - same webhook may be delivered multiple times on retry
5. **Use structured schemas** with LLM extraction for predictable webhook data
## 🎬 Demo
Try the release demo:
```bash
python docs/releases_review/demo_v0.7.6.py
```
This comprehensive demo showcases:
- Crawl job webhooks (notification-only and with data)
- LLM extraction webhooks (with JSON schema support)
- Custom headers for authentication
- Webhook retry mechanism
- Real-time webhook receiver
## 🙏 Acknowledgments
Thank you to the community for the feedback that shaped this feature! Special thanks to everyone who requested webhook support for asynchronous job processing.
## 📞 Support
- **Documentation**: https://docs.crawl4ai.com
- **GitHub Issues**: https://github.com/unclecode/crawl4ai/issues
- **Discord**: https://discord.gg/crawl4ai
---
**Happy crawling with webhooks!** 🕷️🪝
*- unclecode*

626
docs/blog/release-v0.7.7.md Normal file
View File

@@ -0,0 +1,626 @@
# 🚀 Crawl4AI v0.7.7: The Self-Hosting & Monitoring Update
*November 14, 2025 • 10 min read*
---
Today I'm releasing Crawl4AI v0.7.7—the Self-Hosting & Monitoring Update. This release transforms Crawl4AI Docker from a simple containerized crawler into a complete self-hosting platform with enterprise-grade real-time monitoring, full operational transparency, and production-ready observability.
## 🎯 What's New at a Glance
- **📊 Real-time Monitoring Dashboard**: Interactive web UI with live system metrics and browser pool status
- **🔌 Comprehensive Monitor API**: Complete REST API for programmatic access to all monitoring data
- **⚡ WebSocket Streaming**: Real-time updates every 2 seconds for custom dashboards
- **🎮 Control Actions**: Manual browser management (kill, restart, cleanup)
- **🔥 Smart Browser Pool**: 3-tier architecture (permanent/hot/cold) with automatic promotion
- **🧹 Janitor Cleanup System**: Automatic resource management with event logging
- **📈 Production Metrics**: 6 critical metrics for operational excellence
- **🏭 Integration Ready**: Prometheus, alerting, and log aggregation examples
- **🐛 Critical Bug Fixes**: Async LLM extraction, DFS crawling, viewport config, and more
## 📊 Real-time Monitoring Dashboard: Complete Visibility
**The Problem:** Running Crawl4AI in Docker was like flying blind. Users had no visibility into what was happening inside the container—memory usage, active requests, browser pools, or errors. Troubleshooting required checking logs, and there was no way to monitor performance or manually intervene when issues occurred.
**My Solution:** I built a complete real-time monitoring system with an interactive dashboard, comprehensive REST API, WebSocket streaming, and manual control actions. Now you have full transparency and control over your crawling infrastructure.
### The Self-Hosting Value Proposition
Before v0.7.7, Docker was just a containerized crawler. After v0.7.7, it's a complete self-hosting platform that gives you:
- **🔒 Data Privacy**: Your data never leaves your infrastructure
- **💰 Cost Control**: No per-request pricing or rate limits
- **🎯 Full Customization**: Complete control over configurations and strategies
- **📊 Complete Transparency**: Real-time visibility into every aspect
- **⚡ Performance**: Direct access without network overhead
- **🛡️ Enterprise Security**: Keep workflows behind your firewall
### Interactive Monitoring Dashboard
Access the dashboard at `http://localhost:11235/dashboard` to see:
- **System Health Overview**: CPU, memory, network, and uptime in real-time
- **Live Request Tracking**: Active and completed requests with full details
- **Browser Pool Management**: Interactive table with permanent/hot/cold browsers
- **Janitor Events Log**: Automatic cleanup activities
- **Error Monitoring**: Full context error logs
The dashboard updates every 2 seconds via WebSocket, giving you live visibility into your crawling operations.
## 🔌 Monitor API: Programmatic Access
**The Problem:** Monitoring dashboards are great for humans, but automation and integration require programmatic access.
**My Solution:** A comprehensive REST API that exposes all monitoring data for integration with your existing infrastructure.
### System Health Endpoint
```python
import httpx
import asyncio
async def monitor_system_health():
async with httpx.AsyncClient() as client:
response = await client.get("http://localhost:11235/monitor/health")
health = response.json()
print(f"Container Metrics:")
print(f" CPU: {health['container']['cpu_percent']:.1f}%")
print(f" Memory: {health['container']['memory_percent']:.1f}%")
print(f" Uptime: {health['container']['uptime_seconds']}s")
print(f"\nBrowser Pool:")
print(f" Permanent: {health['pool']['permanent']['active']} active")
print(f" Hot Pool: {health['pool']['hot']['count']} browsers")
print(f" Cold Pool: {health['pool']['cold']['count']} browsers")
print(f"\nStatistics:")
print(f" Total Requests: {health['stats']['total_requests']}")
print(f" Success Rate: {health['stats']['success_rate_percent']:.1f}%")
print(f" Avg Latency: {health['stats']['avg_latency_ms']:.0f}ms")
asyncio.run(monitor_system_health())
```
### Request Tracking
```python
async def track_requests():
async with httpx.AsyncClient() as client:
response = await client.get("http://localhost:11235/monitor/requests")
requests_data = response.json()
print(f"Active Requests: {len(requests_data['active'])}")
print(f"Completed Requests: {len(requests_data['completed'])}")
# See details of recent requests
for req in requests_data['completed'][:5]:
status_icon = "" if req['success'] else ""
print(f"{status_icon} {req['endpoint']} - {req['latency_ms']:.0f}ms")
```
### Browser Pool Management
```python
async def monitor_browser_pool():
async with httpx.AsyncClient() as client:
response = await client.get("http://localhost:11235/monitor/browsers")
browsers = response.json()
print(f"Pool Summary:")
print(f" Total Browsers: {browsers['summary']['total_count']}")
print(f" Total Memory: {browsers['summary']['total_memory_mb']} MB")
print(f" Reuse Rate: {browsers['summary']['reuse_rate_percent']:.1f}%")
# List all browsers
for browser in browsers['permanent']:
print(f"🔥 Permanent: {browser['browser_id'][:8]}... | "
f"Requests: {browser['request_count']} | "
f"Memory: {browser['memory_mb']:.0f} MB")
```
### Endpoint Performance Statistics
```python
async def get_endpoint_stats():
async with httpx.AsyncClient() as client:
response = await client.get("http://localhost:11235/monitor/endpoints/stats")
stats = response.json()
print("Endpoint Analytics:")
for endpoint, data in stats.items():
print(f" {endpoint}:")
print(f" Requests: {data['count']}")
print(f" Avg Latency: {data['avg_latency_ms']:.0f}ms")
print(f" Success Rate: {data['success_rate_percent']:.1f}%")
```
### Complete API Reference
The Monitor API includes these endpoints:
- `GET /monitor/health` - System health with pool statistics
- `GET /monitor/requests` - Active and completed request tracking
- `GET /monitor/browsers` - Browser pool details and efficiency
- `GET /monitor/endpoints/stats` - Per-endpoint performance analytics
- `GET /monitor/timeline?minutes=5` - Time-series data for charts
- `GET /monitor/logs/janitor?limit=10` - Cleanup activity logs
- `GET /monitor/logs/errors?limit=10` - Error logs with context
- `POST /monitor/actions/cleanup` - Force immediate cleanup
- `POST /monitor/actions/kill_browser` - Kill specific browser
- `POST /monitor/actions/restart_browser` - Restart browser
- `POST /monitor/stats/reset` - Reset accumulated statistics
## ⚡ WebSocket Streaming: Real-time Updates
**The Problem:** Polling the API every few seconds wastes resources and adds latency. Real-time dashboards need instant updates.
**My Solution:** WebSocket streaming with 2-second update intervals for building custom real-time dashboards.
### WebSocket Integration Example
```python
import websockets
import json
import asyncio
async def monitor_realtime():
uri = "ws://localhost:11235/monitor/ws"
async with websockets.connect(uri) as websocket:
print("Connected to real-time monitoring stream")
while True:
# Receive update every 2 seconds
data = await websocket.recv()
update = json.loads(data)
# Access all monitoring data
print(f"\n--- Update at {update['timestamp']} ---")
print(f"Memory: {update['health']['container']['memory_percent']:.1f}%")
print(f"Active Requests: {len(update['requests']['active'])}")
print(f"Total Browsers: {update['browsers']['summary']['total_count']}")
if update['errors']:
print(f"⚠️ Recent Errors: {len(update['errors'])}")
asyncio.run(monitor_realtime())
```
**Expected Real-World Impact:**
- **Custom Dashboards**: Build tailored monitoring UIs for your team
- **Real-time Alerting**: Trigger alerts instantly when metrics exceed thresholds
- **Integration**: Feed live data into monitoring tools like Grafana
- **Automation**: React to events in real-time without polling
## 🔥 Smart Browser Pool: 3-Tier Architecture
**The Problem:** Creating a new browser for every request is slow and memory-intensive. Traditional browser pools are static and inefficient.
**My Solution:** A smart 3-tier browser pool that automatically adapts to usage patterns.
### How It Works
```python
import httpx
async def demonstrate_browser_pool():
async with httpx.AsyncClient() as client:
# Request 1-3: Default config → Uses permanent browser
print("Phase 1: Using permanent browser")
for i in range(3):
await client.post(
"http://localhost:11235/crawl",
json={"urls": [f"https://httpbin.org/html?req={i}"]}
)
print(f" Request {i+1}: Reused permanent browser")
# Request 4-6: Custom viewport → Cold pool (first use)
print("\nPhase 2: Custom config creates cold pool browser")
viewport_config = {"viewport": {"width": 1280, "height": 720}}
for i in range(4):
await client.post(
"http://localhost:11235/crawl",
json={
"urls": [f"https://httpbin.org/json?v={i}"],
"browser_config": viewport_config
}
)
if i < 2:
print(f" Request {i+1}: Cold pool browser")
else:
print(f" Request {i+1}: Promoted to hot pool! (after 3 uses)")
# Check pool status
response = await client.get("http://localhost:11235/monitor/browsers")
browsers = response.json()
print(f"\nPool Status:")
print(f" Permanent: {len(browsers['permanent'])} (always active)")
print(f" Hot: {len(browsers['hot'])} (frequently used configs)")
print(f" Cold: {len(browsers['cold'])} (on-demand)")
print(f" Reuse Rate: {browsers['summary']['reuse_rate_percent']:.1f}%")
asyncio.run(demonstrate_browser_pool())
```
**Pool Tiers:**
- **🔥 Permanent Browser**: Always-on, default configuration, instant response
- **♨️ Hot Pool**: Browsers promoted after 3+ uses, kept warm for quick access
- **❄️ Cold Pool**: On-demand browsers for variant configs, cleaned up when idle
**Expected Real-World Impact:**
- **Memory Efficiency**: 10x reduction in memory usage vs creating browsers per request
- **Performance**: Instant access to frequently-used configurations
- **Automatic Optimization**: Pool adapts to your usage patterns
- **Resource Management**: Janitor automatically cleans up idle browsers
## 🧹 Janitor System: Automatic Cleanup
**The Problem:** Long-running crawlers accumulate idle browsers and consume memory over time.
**My Solution:** An automatic janitor system that monitors and cleans up idle resources.
```python
async def monitor_janitor_activity():
async with httpx.AsyncClient() as client:
response = await client.get("http://localhost:11235/monitor/logs/janitor?limit=5")
logs = response.json()
print("Recent Cleanup Activities:")
for log in logs:
print(f" {log['timestamp']}: {log['message']}")
# Example output:
# 2025-11-14 10:30:00: Cleaned up 2 cold pool browsers (idle > 5min)
# 2025-11-14 10:25:00: Browser reuse rate: 85.3%
# 2025-11-14 10:20:00: Hot pool browser promoted (10 requests)
```
## 🎮 Control Actions: Manual Management
**The Problem:** Sometimes you need to manually intervene—kill a stuck browser, force cleanup, or restart resources.
**My Solution:** Manual control actions via the API for operational troubleshooting.
### Force Cleanup
```python
async def force_cleanup():
async with httpx.AsyncClient() as client:
response = await client.post("http://localhost:11235/monitor/actions/cleanup")
result = response.json()
print(f"Cleanup completed:")
print(f" Browsers cleaned: {result.get('cleaned_count', 0)}")
print(f" Memory freed: {result.get('memory_freed_mb', 0):.1f} MB")
```
### Kill Specific Browser
```python
async def kill_stuck_browser(browser_id: str):
async with httpx.AsyncClient() as client:
response = await client.post(
"http://localhost:11235/monitor/actions/kill_browser",
json={"browser_id": browser_id}
)
if response.status_code == 200:
print(f"✅ Browser {browser_id} killed successfully")
```
### Reset Statistics
```python
async def reset_stats():
async with httpx.AsyncClient() as client:
response = await client.post("http://localhost:11235/monitor/stats/reset")
print("📊 Statistics reset for fresh monitoring")
```
## 📈 Production Integration Patterns
### Prometheus Integration
```python
# Export metrics for Prometheus scraping
async def export_prometheus_metrics():
async with httpx.AsyncClient() as client:
health = await client.get("http://localhost:11235/monitor/health")
data = health.json()
# Export in Prometheus format
metrics = f"""
# HELP crawl4ai_memory_usage_percent Memory usage percentage
# TYPE crawl4ai_memory_usage_percent gauge
crawl4ai_memory_usage_percent {data['container']['memory_percent']}
# HELP crawl4ai_request_success_rate Request success rate
# TYPE crawl4ai_request_success_rate gauge
crawl4ai_request_success_rate {data['stats']['success_rate_percent']}
# HELP crawl4ai_browser_pool_count Total browsers in pool
# TYPE crawl4ai_browser_pool_count gauge
crawl4ai_browser_pool_count {data['pool']['permanent']['active'] + data['pool']['hot']['count'] + data['pool']['cold']['count']}
"""
return metrics
```
### Alerting Example
```python
async def check_alerts():
async with httpx.AsyncClient() as client:
health = await client.get("http://localhost:11235/monitor/health")
data = health.json()
# Memory alert
if data['container']['memory_percent'] > 80:
print("🚨 ALERT: Memory usage above 80%")
# Trigger cleanup
await client.post("http://localhost:11235/monitor/actions/cleanup")
# Success rate alert
if data['stats']['success_rate_percent'] < 90:
print("🚨 ALERT: Success rate below 90%")
# Check error logs
errors = await client.get("http://localhost:11235/monitor/logs/errors")
print(f"Recent errors: {len(errors.json())}")
# Latency alert
if data['stats']['avg_latency_ms'] > 5000:
print("🚨 ALERT: Average latency above 5s")
```
### Key Metrics to Track
```python
CRITICAL_METRICS = {
"memory_usage": {
"current": "container.memory_percent",
"target": "<80%",
"alert_threshold": ">80%",
"action": "Force cleanup or scale"
},
"success_rate": {
"current": "stats.success_rate_percent",
"target": ">95%",
"alert_threshold": "<90%",
"action": "Check error logs"
},
"avg_latency": {
"current": "stats.avg_latency_ms",
"target": "<2000ms",
"alert_threshold": ">5000ms",
"action": "Investigate slow requests"
},
"browser_reuse_rate": {
"current": "browsers.summary.reuse_rate_percent",
"target": ">80%",
"alert_threshold": "<60%",
"action": "Check pool configuration"
},
"total_browsers": {
"current": "browsers.summary.total_count",
"target": "<15",
"alert_threshold": ">20",
"action": "Check for browser leaks"
},
"error_frequency": {
"current": "len(errors)",
"target": "<5/hour",
"alert_threshold": ">10/hour",
"action": "Review error patterns"
}
}
```
## 🐛 Critical Bug Fixes
This release includes significant bug fixes that improve stability and performance:
### Async LLM Extraction (#1590)
**The Problem:** LLM extraction was blocking async execution, causing URLs to be processed sequentially instead of in parallel (issue #1055).
**The Fix:** Resolved the blocking issue to enable true parallel processing for LLM extraction.
```python
# Before v0.7.7: Sequential processing
# After v0.7.7: True parallel processing
async with AsyncWebCrawler() as crawler:
urls = ["url1", "url2", "url3", "url4"]
# Now processes truly in parallel with LLM extraction
results = await crawler.arun_many(
urls,
config=CrawlerRunConfig(
extraction_strategy=LLMExtractionStrategy(...)
)
)
# 4x faster for parallel LLM extraction!
```
**Expected Impact:** Major performance improvement for batch LLM extraction workflows.
### DFS Deep Crawling (#1607)
**The Problem:** DFS (Depth-First Search) deep crawl strategy had implementation issues.
**The Fix:** Enhanced DFSDeepCrawlStrategy with proper seen URL tracking and improved documentation.
### Browser & Crawler Config Documentation (#1609)
**The Problem:** Documentation didn't match the actual `async_configs.py` implementation.
**The Fix:** Updated all configuration documentation to accurately reflect the current implementation.
### Sitemap Seeder (#1598)
**The Problem:** Sitemap parsing and URL normalization issues in AsyncUrlSeeder (issue #1559).
**The Fix:** Added comprehensive tests and fixes for sitemap namespace parsing and URL normalization.
### Remove Overlay Elements (#1529)
**The Problem:** The `remove_overlay_elements` functionality wasn't working (issue #1396).
**The Fix:** Fixed by properly calling the injected JavaScript function.
### Viewport Configuration (#1495)
**The Problem:** Viewport configuration wasn't working in managed browsers (issue #1490).
**The Fix:** Added proper viewport size configuration support for browser launch.
### Managed Browser CDP Timing (#1528)
**The Problem:** CDP (Chrome DevTools Protocol) endpoint verification had timing issues causing connection failures (issue #1445).
**The Fix:** Added exponential backoff for CDP endpoint verification to handle timing variations.
### Security Updates
- **pyOpenSSL**: Updated from >=24.3.0 to >=25.3.0 to address security vulnerability
- Added verification tests for the security update
### Docker Fixes
- **Port Standardization**: Fixed inconsistent port usage (11234 vs 11235) - now standardized to 11235
- **LLM Environment**: Fixed LLM API key handling for multi-provider support (PR #1537)
- **Error Handling**: Improved Docker API error messages with comprehensive status codes
- **Serialization**: Fixed `fit_html` property serialization in `/crawl` and `/crawl/stream` endpoints
### Other Important Fixes
- **arun_many Returns**: Fixed function to always return a list, even on exception (PR #1530)
- **Webhook Serialization**: Properly serialize Pydantic HttpUrl in webhook config
- **LLMConfig Documentation**: Fixed casing and variable name consistency (issue #1551)
- **Python Version**: Dropped Python 3.9 support, now requires Python >=3.10
## 📊 Expected Real-World Impact
### For DevOps & Infrastructure Teams
- **Full Visibility**: Know exactly what's happening inside your crawling infrastructure
- **Proactive Monitoring**: Catch issues before they become problems
- **Resource Optimization**: Identify memory leaks and performance bottlenecks
- **Operational Control**: Manual intervention when automated systems need help
### For Production Deployments
- **Enterprise Observability**: Prometheus, Grafana, and alerting integration
- **Debugging**: Real-time logs and error tracking
- **Capacity Planning**: Historical metrics for scaling decisions
- **SLA Monitoring**: Track success rates and latency against targets
### For Development Teams
- **Local Monitoring**: Understand crawler behavior during development
- **Performance Testing**: Measure impact of configuration changes
- **Troubleshooting**: Quickly identify and fix issues
- **Learning**: See exactly how the browser pool works
## 🔄 Breaking Changes
**None!** This release is fully backward compatible.
- All existing Docker configurations continue to work
- No API changes to existing endpoints
- Monitoring is additive functionality
- No migration required
## 🚀 Upgrade Instructions
### Docker
```bash
# Pull the latest version
docker pull unclecode/crawl4ai:0.7.7
# Or use the latest tag
docker pull unclecode/crawl4ai:latest
# Run with monitoring enabled (default)
docker run -d \
-p 11235:11235 \
--shm-size=1g \
--name crawl4ai \
unclecode/crawl4ai:0.7.7
# Access the monitoring dashboard
open http://localhost:11235/dashboard
```
### Python Package
```bash
# Upgrade to latest version
pip install --upgrade crawl4ai
# Or install specific version
pip install crawl4ai==0.7.7
```
## 🎬 Try the Demo
Run the comprehensive demo that showcases all monitoring features:
```bash
python docs/releases_review/demo_v0.7.7.py
```
**The demo includes:**
1. System health overview with live metrics
2. Request tracking with active/completed monitoring
3. Browser pool management (permanent/hot/cold)
4. Complete Monitor API endpoint examples
5. WebSocket streaming demonstration
6. Control actions (cleanup, kill, restart)
7. Production metrics and alerting patterns
8. Self-hosting value proposition
## 📚 Documentation
### New Documentation
- **[Self-Hosting Guide](https://docs.crawl4ai.com/core/self-hosting/)** - Complete self-hosting documentation with monitoring
- **Demo Script**: `docs/releases_review/demo_v0.7.7.py` - Working examples
### Updated Documentation
- **Docker Deployment** → **Self-Hosting** (renamed for better positioning)
- Added comprehensive monitoring sections
- Production integration patterns
- WebSocket streaming examples
## 💡 Pro Tips
1. **Start with the dashboard** - Visit `/dashboard` to get familiar with the monitoring system
2. **Track the 6 key metrics** - Memory, success rate, latency, reuse rate, browser count, errors
3. **Set up alerting early** - Use the Monitor API to build alerts before issues occur
4. **Monitor browser pool efficiency** - Aim for >80% reuse rate for optimal performance
5. **Use WebSocket for custom dashboards** - Build tailored monitoring UIs for your team
6. **Leverage Prometheus integration** - Export metrics for long-term storage and analysis
7. **Check janitor logs** - Understand automatic cleanup patterns
8. **Use control actions judiciously** - Manual interventions are for exceptional cases
## 🙏 Acknowledgments
Thank you to our community for the feedback, bug reports, and feature requests that shaped this release. Special thanks to everyone who contributed to the issues that were fixed in this version.
The monitoring system was built based on real user needs for production deployments, and your input made it comprehensive and practical.
## 📞 Support & Resources
- **📖 Documentation**: [docs.crawl4ai.com](https://docs.crawl4ai.com)
- **🐙 GitHub**: [github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai)
- **💬 Discord**: [discord.gg/crawl4ai](https://discord.gg/jP8KfhDhyN)
- **🐦 Twitter**: [@unclecode](https://x.com/unclecode)
- **📊 Dashboard**: `http://localhost:11235/dashboard` (when running)
---
**Crawl4AI v0.7.7 delivers complete self-hosting with enterprise-grade monitoring. You now have full visibility and control over your web crawling infrastructure. The monitoring dashboard, comprehensive API, and WebSocket streaming give you everything needed for production deployments. Try the self-hosting platform—it's a game changer for operational excellence!**
**Happy crawling with full visibility!** 🕷️📊
*- unclecode*

327
docs/blog/release-v0.7.8.md Normal file
View File

@@ -0,0 +1,327 @@
# Crawl4AI v0.7.8: Stability & Bug Fix Release
*December 2025*
---
I'm releasing Crawl4AI v0.7.8—a focused stability release that addresses 11 bugs reported by the community. While there are no new features in this release, these fixes resolve important issues affecting Docker deployments, LLM extraction, URL handling, and dependency compatibility.
## What's Fixed at a Glance
- **Docker API**: Fixed ContentRelevanceFilter deserialization, ProxyConfig serialization, and cache folder permissions
- **LLM Extraction**: Configurable rate limiter backoff, HTML input format support, and proper URL handling for raw HTML
- **URL Handling**: Correct relative URL resolution after JavaScript redirects
- **Dependencies**: Replaced deprecated PyPDF2 with pypdf, Pydantic v2 ConfigDict compatibility
- **AdaptiveCrawler**: Fixed query expansion to actually use LLM instead of hardcoded mock data
## Bug Fixes
### Docker & API Fixes
#### ContentRelevanceFilter Deserialization (#1642)
**The Problem:** When sending deep crawl requests to the Docker API with `ContentRelevanceFilter`, the server failed to deserialize the filter, causing requests to fail.
**The Fix:** I added `ContentRelevanceFilter` to the public exports and enhanced the deserialization logic with dynamic imports.
```python
# This now works correctly in Docker API
import httpx
request = {
"urls": ["https://docs.example.com"],
"crawler_config": {
"deep_crawl_strategy": {
"type": "BFSDeepCrawlStrategy",
"max_depth": 2,
"filter_chain": [
{
"type": "ContentRelevanceFilter",
"query": "API documentation",
"threshold": 0.3
}
]
}
}
}
async with httpx.AsyncClient() as client:
response = await client.post("http://localhost:11235/crawl", json=request)
# Previously failed, now works!
```
#### ProxyConfig JSON Serialization (#1629)
**The Problem:** `BrowserConfig.to_dict()` failed when `proxy_config` was set because `ProxyConfig` wasn't being serialized to a dictionary.
**The Fix:** `ProxyConfig.to_dict()` is now called during serialization.
```python
from crawl4ai import BrowserConfig
from crawl4ai.async_configs import ProxyConfig
proxy = ProxyConfig(
server="http://proxy.example.com:8080",
username="user",
password="pass"
)
config = BrowserConfig(headless=True, proxy_config=proxy)
# Previously raised TypeError, now works
config_dict = config.to_dict()
json.dumps(config_dict) # Valid JSON
```
#### Docker Cache Folder Permissions (#1638)
**The Problem:** The `.cache` folder in the Docker image had incorrect permissions, causing crawling to fail when caching was enabled.
**The Fix:** Corrected ownership and permissions during image build.
```bash
# Cache now works correctly in Docker
docker run -d -p 11235:11235 \
--shm-size=1g \
-v ./my-cache:/app/.cache \
unclecode/crawl4ai:0.7.8
```
---
### LLM & Extraction Fixes
#### Configurable Rate Limiter Backoff (#1269)
**The Problem:** The LLM rate limiting backoff parameters were hardcoded, making it impossible to adjust retry behavior for different API rate limits.
**The Fix:** `LLMConfig` now accepts three new parameters for complete control over retry behavior.
```python
from crawl4ai import LLMConfig
# Default behavior (unchanged)
default_config = LLMConfig(provider="openai/gpt-4o-mini")
# backoff_base_delay=2, backoff_max_attempts=3, backoff_exponential_factor=2
# Custom configuration for APIs with strict rate limits
custom_config = LLMConfig(
provider="openai/gpt-4o-mini",
backoff_base_delay=5, # Wait 5 seconds on first retry
backoff_max_attempts=5, # Try up to 5 times
backoff_exponential_factor=3 # Multiply delay by 3 each attempt
)
# Retry sequence: 5s -> 15s -> 45s -> 135s -> 405s
```
#### LLM Strategy HTML Input Support (#1178)
**The Problem:** `LLMExtractionStrategy` always sent markdown to the LLM, but some extraction tasks work better with HTML structure preserved.
**The Fix:** Added `input_format` parameter supporting `"markdown"`, `"html"`, `"fit_markdown"`, `"cleaned_html"`, and `"fit_html"`.
```python
from crawl4ai import LLMExtractionStrategy, LLMConfig
# Default: markdown input (unchanged)
markdown_strategy = LLMExtractionStrategy(
llm_config=LLMConfig(provider="openai/gpt-4o-mini"),
instruction="Extract product information"
)
# NEW: HTML input - preserves table/list structure
html_strategy = LLMExtractionStrategy(
llm_config=LLMConfig(provider="openai/gpt-4o-mini"),
instruction="Extract the data table preserving structure",
input_format="html"
)
# NEW: Filtered markdown - only relevant content
fit_strategy = LLMExtractionStrategy(
llm_config=LLMConfig(provider="openai/gpt-4o-mini"),
instruction="Summarize the main content",
input_format="fit_markdown"
)
```
#### Raw HTML URL Variable (#1116)
**The Problem:** When using `url="raw:<html>..."`, the entire HTML content was being passed to extraction strategies as the URL parameter, polluting LLM prompts.
**The Fix:** The URL is now correctly set to `"Raw HTML"` for raw HTML inputs.
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
html = "<html><body><h1>Test</h1></body></html>"
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url=f"raw:{html}",
config=CrawlerRunConfig(extraction_strategy=my_strategy)
)
# extraction_strategy receives url="Raw HTML" instead of the HTML blob
```
---
### URL Handling Fix
#### Relative URLs After Redirects (#1268)
**The Problem:** When JavaScript caused a page redirect, relative links were resolved against the original URL instead of the final URL.
**The Fix:** `redirected_url` now captures the actual page URL after all JavaScript execution completes.
```python
from crawl4ai import AsyncWebCrawler
async with AsyncWebCrawler() as crawler:
# Page at /old-page redirects via JS to /new-page
result = await crawler.arun(url="https://example.com/old-page")
# BEFORE: redirected_url = "https://example.com/old-page"
# AFTER: redirected_url = "https://example.com/new-page"
# Links are now correctly resolved against the final URL
for link in result.links['internal']:
print(link['href']) # Relative links resolved correctly
```
---
### Dependency & Compatibility Fixes
#### PyPDF2 Replaced with pypdf (#1412)
**The Problem:** PyPDF2 was deprecated in 2022 and is no longer maintained.
**The Fix:** Replaced with the actively maintained `pypdf` library.
```python
# Installation (unchanged)
pip install crawl4ai[pdf]
# The PDF processor now uses pypdf internally
# No code changes required - API remains the same
```
#### Pydantic v2 ConfigDict Compatibility (#678)
**The Problem:** Using the deprecated `class Config` syntax caused deprecation warnings with Pydantic v2.
**The Fix:** Migrated to `model_config = ConfigDict(...)` syntax.
```python
# No more deprecation warnings when importing crawl4ai models
from crawl4ai.models import CrawlResult
from crawl4ai import CrawlerRunConfig, BrowserConfig
# All models are now Pydantic v2 compatible
```
---
### AdaptiveCrawler Fix
#### Query Expansion Using LLM (#1621)
**The Problem:** The `EmbeddingStrategy` in AdaptiveCrawler had commented-out LLM code and was using hardcoded mock query variations instead.
**The Fix:** Uncommented and activated the LLM call for actual query expansion.
```python
# AdaptiveCrawler query expansion now actually uses the LLM
# Instead of hardcoded variations like:
# variations = {'queries': ['what are the best vegetables...']}
# The LLM generates relevant query variations based on your actual query
```
---
### Code Formatting Fix
#### Import Statement Formatting (#1181)
**The Problem:** When extracting code from web pages, import statements were sometimes concatenated without proper line separation.
**The Fix:** Import statements now maintain proper newline separation.
```python
# BEFORE: "import osimport sysfrom pathlib import Path"
# AFTER:
# import os
# import sys
# from pathlib import Path
```
---
## Breaking Changes
**None!** This release is fully backward compatible.
- All existing code continues to work without modification
- New parameters have sensible defaults matching previous behavior
- No API changes to existing functionality
---
## Upgrade Instructions
### Python Package
```bash
pip install --upgrade crawl4ai
# or
pip install crawl4ai==0.7.8
```
### Docker
```bash
# Pull the latest version
docker pull unclecode/crawl4ai:0.7.8
# Run
docker run -d -p 11235:11235 --shm-size=1g unclecode/crawl4ai:0.7.8
```
---
## Verification
Run the verification tests to confirm all fixes are working:
```bash
python docs/releases_review/demo_v0.7.8.py
```
This runs actual tests that verify each bug fix is properly implemented.
---
## Acknowledgments
Thank you to everyone who reported these issues and provided detailed reproduction steps. Your bug reports make Crawl4AI better for everyone.
Issues fixed: #1642, #1638, #1629, #1621, #1412, #1269, #1268, #1181, #1178, #1116, #678
---
## Support & Resources
- **Documentation**: [docs.crawl4ai.com](https://docs.crawl4ai.com)
- **GitHub**: [github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai)
- **Discord**: [discord.gg/crawl4ai](https://discord.gg/jP8KfhDhyN)
- **Twitter**: [@unclecode](https://x.com/unclecode)
---
**This stability release ensures Crawl4AI works reliably across Docker deployments, LLM extraction workflows, and various edge cases. Thank you for your continued support and feedback!**
**Happy crawling!**
*- unclecode*

View File

@@ -18,7 +18,7 @@ A comprehensive web-based tutorial for learning and experimenting with C4A-Scrip
2. **Install Dependencies**
```bash
pip install flask
pip install -r requirements.txt
```
3. **Launch the Server**
@@ -28,7 +28,7 @@ A comprehensive web-based tutorial for learning and experimenting with C4A-Scrip
4. **Open in Browser**
```
http://localhost:8080
http://localhost:8000
```
**🌐 Try Online**: [Live Demo](https://docs.crawl4ai.com/c4a-script/demo)
@@ -325,7 +325,7 @@ Powers the recording functionality:
### Configuration
```python
# server.py configuration
PORT = 8080
PORT = 8000
DEBUG = True
THREADED = True
```
@@ -343,9 +343,9 @@ THREADED = True
**Port Already in Use**
```bash
# Kill existing process
lsof -ti:8080 | xargs kill -9
lsof -ti:8000 | xargs kill -9
# Or use different port
python server.py --port 8081
python server.py --port 8001
```
**Blockly Not Loading**

View File

@@ -216,7 +216,7 @@ def get_examples():
'name': 'Handle Cookie Banner',
'description': 'Accept cookies and close newsletter popup',
'script': '''# Handle cookie banner and newsletter
GO http://127.0.0.1:8080/playground/
GO http://127.0.0.1:8000/playground/
WAIT `body` 2
IF (EXISTS `.cookie-banner`) THEN CLICK `.accept`
IF (EXISTS `.newsletter-popup`) THEN CLICK `.close`'''

View File

@@ -0,0 +1,62 @@
import asyncio
import capsolver
from crawl4ai import *
# TODO: set your config
# Docs: https://docs.capsolver.com/guide/captcha/awsWaf/
api_key = "CAP-xxxxxxxxxxxxxxxxxxxxx" # your api key of capsolver
site_url = "https://nft.porsche.com/onboarding@6" # page url of your target site
cookie_domain = ".nft.porsche.com" # the domain name to which you want to apply the cookie
captcha_type = "AntiAwsWafTaskProxyLess" # type of your target captcha
capsolver.api_key = api_key
async def main():
browser_config = BrowserConfig(
verbose=True,
headless=False,
use_persistent_context=True,
)
async with AsyncWebCrawler(config=browser_config) as crawler:
await crawler.arun(
url=site_url,
cache_mode=CacheMode.BYPASS,
session_id="session_captcha_test"
)
# get aws waf cookie using capsolver sdk
solution = capsolver.solve({
"type": captcha_type,
"websiteURL": site_url,
})
cookie = solution["cookie"]
print("aws waf cookie:", cookie)
js_code = """
document.cookie = \'aws-waf-token=""" + cookie + """;domain=""" + cookie_domain + """;path=/\';
location.reload();
"""
wait_condition = """() => {
return document.title === \'Join Porsches journey into Web3\';
}"""
run_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
session_id="session_captcha_test",
js_code=js_code,
js_only=True,
wait_for=f"js:{wait_condition}"
)
result_next = await crawler.arun(
url=site_url,
config=run_config,
)
print(result_next.markdown)
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,60 @@
import asyncio
import capsolver
from crawl4ai import *
# TODO: set your config
# Docs: https://docs.capsolver.com/guide/captcha/cloudflare_challenge/
api_key = "CAP-xxxxxxxxxxxxxxxxxxxxx" # your api key of capsolver
site_url = "https://gitlab.com/users/sign_in" # page url of your target site
captcha_type = "AntiCloudflareTask" # type of your target captcha
# your http proxy to solve cloudflare challenge
proxy_server = "proxy.example.com:8080"
proxy_username = "myuser"
proxy_password = "mypass"
capsolver.api_key = api_key
async def main():
# get challenge cookie using capsolver sdk
solution = capsolver.solve({
"type": captcha_type,
"websiteURL": site_url,
"proxy": f"{proxy_server}:{proxy_username}:{proxy_password}",
})
cookies = solution["cookies"]
user_agent = solution["userAgent"]
print("challenge cookies:", cookies)
cookies_list = []
for name, value in cookies.items():
cookies_list.append({
"name": name,
"value": value,
"url": site_url,
})
browser_config = BrowserConfig(
verbose=True,
headless=False,
use_persistent_context=True,
user_agent=user_agent,
cookies=cookies_list,
proxy_config={
"server": f"http://{proxy_server}",
"username": proxy_username,
"password": proxy_password,
},
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url=site_url,
cache_mode=CacheMode.BYPASS,
session_id="session_captcha_test"
)
print(result.markdown)
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,64 @@
import asyncio
import capsolver
from crawl4ai import *
# TODO: set your config
# Docs: https://docs.capsolver.com/guide/captcha/cloudflare_turnstile/
api_key = "CAP-xxxxxxxxxxxxxxxxxxxxx" # your api key of capsolver
site_key = "0x4AAAAAAAGlwMzq_9z6S9Mh" # site key of your target site
site_url = "https://clifford.io/demo/cloudflare-turnstile" # page url of your target site
captcha_type = "AntiTurnstileTaskProxyLess" # type of your target captcha
capsolver.api_key = api_key
async def main():
browser_config = BrowserConfig(
verbose=True,
headless=False,
use_persistent_context=True,
)
async with AsyncWebCrawler(config=browser_config) as crawler:
await crawler.arun(
url=site_url,
cache_mode=CacheMode.BYPASS,
session_id="session_captcha_test"
)
# get turnstile token using capsolver sdk
solution = capsolver.solve({
"type": captcha_type,
"websiteURL": site_url,
"websiteKey": site_key,
})
token = solution["token"]
print("turnstile token:", token)
js_code = """
document.querySelector(\'input[name="cf-turnstile-response"]\').value = \'"""+token+"""\';
document.querySelector(\'button[type="submit"]\').click();
"""
wait_condition = """() => {
const items = document.querySelectorAll(\'h1\');
return items.length === 0;
}"""
run_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
session_id="session_captcha_test",
js_code=js_code,
js_only=True,
wait_for=f"js:{wait_condition}"
)
result_next = await crawler.arun(
url=site_url,
config=run_config,
)
print(result_next.markdown)
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,67 @@
import asyncio
import capsolver
from crawl4ai import *
# TODO: set your config
# Docs: https://docs.capsolver.com/guide/captcha/ReCaptchaV2/
api_key = "CAP-xxxxxxxxxxxxxxxxxxxxx" # your api key of capsolver
site_key = "6LfW6wATAAAAAHLqO2pb8bDBahxlMxNdo9g947u9" # site key of your target site
site_url = "https://recaptcha-demo.appspot.com/recaptcha-v2-checkbox.php" # page url of your target site
captcha_type = "ReCaptchaV2TaskProxyLess" # type of your target captcha
capsolver.api_key = api_key
async def main():
browser_config = BrowserConfig(
verbose=True,
headless=False,
use_persistent_context=True,
)
async with AsyncWebCrawler(config=browser_config) as crawler:
await crawler.arun(
url=site_url,
cache_mode=CacheMode.BYPASS,
session_id="session_captcha_test"
)
# get recaptcha token using capsolver sdk
solution = capsolver.solve({
"type": captcha_type,
"websiteURL": site_url,
"websiteKey": site_key,
})
token = solution["gRecaptchaResponse"]
print("recaptcha token:", token)
js_code = """
const textarea = document.getElementById(\'g-recaptcha-response\');
if (textarea) {
textarea.value = \"""" + token + """\";
document.querySelector(\'button.form-field[type="submit"]\').click();
}
"""
wait_condition = """() => {
const items = document.querySelectorAll(\'h2\');
return items.length > 1;
}"""
run_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
session_id="session_captcha_test",
js_code=js_code,
js_only=True,
wait_for=f"js:{wait_condition}"
)
result_next = await crawler.arun(
url=site_url,
config=run_config,
)
print(result_next.markdown)
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,75 @@
import asyncio
import capsolver
from crawl4ai import *
# TODO: set your config
# Docs: https://docs.capsolver.com/guide/captcha/ReCaptchaV3/
api_key = "CAP-xxxxxxxxxxxxxxxxxxxxx" # your api key of capsolver
site_key = "6LdKlZEpAAAAAAOQjzC2v_d36tWxCl6dWsozdSy9" # site key of your target site
site_url = "https://recaptcha-demo.appspot.com/recaptcha-v3-request-scores.php" # page url of your target site
page_action = "examples/v3scores" # page action of your target site
captcha_type = "ReCaptchaV3TaskProxyLess" # type of your target captcha
capsolver.api_key = api_key
async def main():
browser_config = BrowserConfig(
verbose=True,
headless=False,
use_persistent_context=True,
)
# get recaptcha token using capsolver sdk
solution = capsolver.solve({
"type": captcha_type,
"websiteURL": site_url,
"websiteKey": site_key,
"pageAction": page_action,
})
token = solution["gRecaptchaResponse"]
print("recaptcha token:", token)
async with AsyncWebCrawler(config=browser_config) as crawler:
await crawler.arun(
url=site_url,
cache_mode=CacheMode.BYPASS,
session_id="session_captcha_test"
)
js_code = """
const originalFetch = window.fetch;
window.fetch = function(...args) {
if (typeof args[0] === 'string' && args[0].includes('/recaptcha-v3-verify.php')) {
const url = new URL(args[0], window.location.origin);
url.searchParams.set('action', '""" + token + """');
args[0] = url.toString();
document.querySelector('.token').innerHTML = "fetch('/recaptcha-v3-verify.php?action=examples/v3scores&token=""" + token + """')";
console.log('Fetch URL hooked:', args[0]);
}
return originalFetch.apply(this, args);
};
"""
wait_condition = """() => {
return document.querySelector('.step3:not(.hidden)');
}"""
run_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
session_id="session_captcha_test",
js_code=js_code,
js_only=True,
wait_for=f"js:{wait_condition}"
)
result_next = await crawler.arun(
url=site_url,
config=run_config,
)
print(result_next.markdown)
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,36 @@
import time
import asyncio
from crawl4ai import *
# TODO: the user data directory that includes the capsolver extension
user_data_dir = "/browser-profile/Default1"
"""
The capsolver extension supports more features, such as:
- Telling the extension when to start solving captcha.
- Calling functions to check whether the captcha has been solved, etc.
Reference blog: https://docs.capsolver.com/guide/automation-tool-integration/
"""
browser_config = BrowserConfig(
verbose=True,
headless=False,
user_data_dir=user_data_dir,
use_persistent_context=True,
)
async def main():
async with AsyncWebCrawler(config=browser_config) as crawler:
result_initial = await crawler.arun(
url="https://nft.porsche.com/onboarding@6",
cache_mode=CacheMode.BYPASS,
session_id="session_captcha_test"
)
# do something later
time.sleep(300)
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,36 @@
import time
import asyncio
from crawl4ai import *
# TODO: the user data directory that includes the capsolver extension
user_data_dir = "/browser-profile/Default1"
"""
The capsolver extension supports more features, such as:
- Telling the extension when to start solving captcha.
- Calling functions to check whether the captcha has been solved, etc.
Reference blog: https://docs.capsolver.com/guide/automation-tool-integration/
"""
browser_config = BrowserConfig(
verbose=True,
headless=False,
user_data_dir=user_data_dir,
use_persistent_context=True,
)
async def main():
async with AsyncWebCrawler(config=browser_config) as crawler:
result_initial = await crawler.arun(
url="https://gitlab.com/users/sign_in",
cache_mode=CacheMode.BYPASS,
session_id="session_captcha_test"
)
# do something later
time.sleep(300)
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,36 @@
import time
import asyncio
from crawl4ai import *
# TODO: the user data directory that includes the capsolver extension
user_data_dir = "/browser-profile/Default1"
"""
The capsolver extension supports more features, such as:
- Telling the extension when to start solving captcha.
- Calling functions to check whether the captcha has been solved, etc.
Reference blog: https://docs.capsolver.com/guide/automation-tool-integration/
"""
browser_config = BrowserConfig(
verbose=True,
headless=False,
user_data_dir=user_data_dir,
use_persistent_context=True,
)
async def main():
async with AsyncWebCrawler(config=browser_config) as crawler:
result_initial = await crawler.arun(
url="https://clifford.io/demo/cloudflare-turnstile",
cache_mode=CacheMode.BYPASS,
session_id="session_captcha_test"
)
# do something later
time.sleep(300)
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,36 @@
import time
import asyncio
from crawl4ai import *
# TODO: the user data directory that includes the capsolver extension
user_data_dir = "/browser-profile/Default1"
"""
The capsolver extension supports more features, such as:
- Telling the extension when to start solving captcha.
- Calling functions to check whether the captcha has been solved, etc.
Reference blog: https://docs.capsolver.com/guide/automation-tool-integration/
"""
browser_config = BrowserConfig(
verbose=True,
headless=False,
user_data_dir=user_data_dir,
use_persistent_context=True,
)
async def main():
async with AsyncWebCrawler(config=browser_config) as crawler:
result_initial = await crawler.arun(
url="https://recaptcha-demo.appspot.com/recaptcha-v2-checkbox.php",
cache_mode=CacheMode.BYPASS,
session_id="session_captcha_test"
)
# do something later
time.sleep(300)
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,36 @@
import time
import asyncio
from crawl4ai import *
# TODO: the user data directory that includes the capsolver extension
user_data_dir = "/browser-profile/Default1"
"""
The capsolver extension supports more features, such as:
- Telling the extension when to start solving captcha.
- Calling functions to check whether the captcha has been solved, etc.
Reference blog: https://docs.capsolver.com/guide/automation-tool-integration/
"""
browser_config = BrowserConfig(
verbose=True,
headless=False,
user_data_dir=user_data_dir,
use_persistent_context=True,
)
async def main():
async with AsyncWebCrawler(config=browser_config) as crawler:
result_initial = await crawler.arun(
url="https://recaptcha-demo.appspot.com/recaptcha-v3-request-scores.php",
cache_mode=CacheMode.BYPASS,
session_id="session_captcha_test"
)
# do something later
time.sleep(300)
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,61 @@
import json
import asyncio
from urllib.parse import quote, urlencode
from crawl4ai import CrawlerRunConfig, BrowserConfig, AsyncWebCrawler
# Scrapeless provides a free anti-detection fingerprint browser client and cloud browsers:
# https://www.scrapeless.com/en/blog/scrapeless-nstbrowser-strategic-integration
async def main():
# customize browser fingerprint
fingerprint = {
"userAgent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.1.2.3 Safari/537.36",
"platform": "Windows",
"screen": {
"width": 1280, "height": 1024
},
"localization": {
"languages": ["zh-HK", "en-US", "en"], "timezone": "Asia/Hong_Kong",
}
}
fingerprint_json = json.dumps(fingerprint)
encoded_fingerprint = quote(fingerprint_json)
scrapeless_params = {
"token": "your token",
"sessionTTL": 1000,
"sessionName": "Demo",
"fingerprint": encoded_fingerprint,
# Sets the target country/region for the proxy, sending requests via an IP address from that region. You can specify a country code (e.g., US for the United States, GB for the United Kingdom, ANY for any country). See country codes for all supported options.
# "proxyCountry": "ANY",
# create profile on scrapeless
# "profileId": "your profileId",
# For more usage details, please refer to https://docs.scrapeless.com/en/scraping-browser/quickstart/getting-started
}
query_string = urlencode(scrapeless_params)
scrapeless_connection_url = f"wss://browser.scrapeless.com/api/v2/browser?{query_string}"
async with AsyncWebCrawler(
config=BrowserConfig(
headless=False,
browser_mode="cdp",
cdp_url=scrapeless_connection_url,
)
) as crawler:
result = await crawler.arun(
url="https://www.scrapeless.com/en",
config=CrawlerRunConfig(
wait_for="css:.content",
scan_full_page=True,
),
)
print("-" * 20)
print(f'Status Code: {result.status_code}')
print("-" * 20)
print(f'Title: {result.metadata["title"]}')
print(f'Description: {result.metadata["description"]}')
print("-" * 20)
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,39 @@
"""
Simple demonstration of the DFS deep crawler visiting multiple pages.
Run with: python docs/examples/dfs_crawl_demo.py
"""
import asyncio
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
from crawl4ai.async_webcrawler import AsyncWebCrawler
from crawl4ai.cache_context import CacheMode
from crawl4ai.deep_crawling.dfs_strategy import DFSDeepCrawlStrategy
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
async def main() -> None:
dfs_strategy = DFSDeepCrawlStrategy(
max_depth=3,
max_pages=50,
include_external=False,
)
config = CrawlerRunConfig(
deep_crawl_strategy=dfs_strategy,
cache_mode=CacheMode.BYPASS,
markdown_generator=DefaultMarkdownGenerator(),
stream=True,
)
seed_url = "https://docs.python.org/3/" # Plenty of internal links
async with AsyncWebCrawler(config=BrowserConfig(headless=True)) as crawler:
async for result in await crawler.arun(url=seed_url, config=config):
depth = result.metadata.get("depth")
status = "SUCCESS" if result.success else "FAILED"
print(f"[{status}] depth={depth} url={result.url}")
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,522 @@
#!/usr/bin/env python3
"""
Comprehensive hooks examples using Docker Client with function objects.
This approach is recommended because:
- Write hooks as regular Python functions
- Full IDE support (autocomplete, type checking)
- Automatic conversion to API format
- Reusable and testable code
- Clean, readable syntax
"""
import asyncio
from crawl4ai import Crawl4aiDockerClient
# API_BASE_URL = "http://localhost:11235"
API_BASE_URL = "http://localhost:11234"
# ============================================================================
# Hook Function Definitions
# ============================================================================
# --- All Hooks Demo ---
async def browser_created_hook(browser, **kwargs):
"""Called after browser is created"""
print("[HOOK] Browser created and ready")
return browser
async def page_context_hook(page, context, **kwargs):
"""Setup page environment"""
print("[HOOK] Setting up page environment")
# Set viewport
await page.set_viewport_size({"width": 1920, "height": 1080})
# Add cookies
await context.add_cookies([{
"name": "test_session",
"value": "abc123xyz",
"domain": ".httpbin.org",
"path": "/"
}])
# Block resources
await context.route("**/*.{png,jpg,jpeg,gif}", lambda route: route.abort())
await context.route("**/analytics/*", lambda route: route.abort())
print("[HOOK] Environment configured")
return page
async def user_agent_hook(page, context, user_agent, **kwargs):
"""Called when user agent is updated"""
print(f"[HOOK] User agent: {user_agent[:50]}...")
return page
async def before_goto_hook(page, context, url, **kwargs):
"""Called before navigating to URL"""
print(f"[HOOK] Navigating to: {url}")
await page.set_extra_http_headers({
"X-Custom-Header": "crawl4ai-test",
"Accept-Language": "en-US"
})
return page
async def after_goto_hook(page, context, url, response, **kwargs):
"""Called after page loads"""
print(f"[HOOK] Page loaded: {url}")
await page.wait_for_timeout(1000)
try:
await page.wait_for_selector("body", timeout=2000)
print("[HOOK] Body element ready")
except:
print("[HOOK] Timeout, continuing")
return page
async def execution_started_hook(page, context, **kwargs):
"""Called when custom JS execution starts"""
print("[HOOK] JS execution started")
await page.evaluate("console.log('[HOOK] Custom JS');")
return page
async def before_retrieve_hook(page, context, **kwargs):
"""Called before retrieving HTML"""
print("[HOOK] Preparing HTML retrieval")
# Scroll for lazy content
await page.evaluate("window.scrollTo(0, document.body.scrollHeight);")
await page.wait_for_timeout(500)
await page.evaluate("window.scrollTo(0, 0);")
print("[HOOK] Scrolling complete")
return page
async def before_return_hook(page, context, html, **kwargs):
"""Called before returning HTML"""
print(f"[HOOK] HTML ready: {len(html)} chars")
metrics = await page.evaluate('''() => ({
images: document.images.length,
links: document.links.length,
scripts: document.scripts.length
})''')
print(f"[HOOK] Metrics - Images: {metrics['images']}, Links: {metrics['links']}")
return page
# --- Authentication Hooks ---
async def auth_context_hook(page, context, **kwargs):
"""Setup authentication context"""
print("[HOOK] Setting up authentication")
# Add auth cookies
await context.add_cookies([{
"name": "auth_token",
"value": "fake_jwt_token",
"domain": ".httpbin.org",
"path": "/",
"httpOnly": True
}])
# Set localStorage
await page.evaluate('''
localStorage.setItem('user_id', '12345');
localStorage.setItem('auth_time', new Date().toISOString());
''')
print("[HOOK] Auth context ready")
return page
async def auth_headers_hook(page, context, url, **kwargs):
"""Add authentication headers"""
print(f"[HOOK] Adding auth headers for {url}")
import base64
credentials = base64.b64encode(b"user:passwd").decode('ascii')
await page.set_extra_http_headers({
'Authorization': f'Basic {credentials}',
'X-API-Key': 'test-key-123'
})
return page
# --- Performance Optimization Hooks ---
async def performance_hook(page, context, **kwargs):
"""Optimize page for performance"""
print("[HOOK] Optimizing for performance")
# Block resource-heavy content
await context.route("**/*.{png,jpg,jpeg,gif,webp,svg}", lambda r: r.abort())
await context.route("**/*.{woff,woff2,ttf}", lambda r: r.abort())
await context.route("**/*.{mp4,webm,ogg}", lambda r: r.abort())
await context.route("**/googletagmanager.com/*", lambda r: r.abort())
await context.route("**/google-analytics.com/*", lambda r: r.abort())
await context.route("**/facebook.com/*", lambda r: r.abort())
# Disable animations
await page.add_style_tag(content='''
*, *::before, *::after {
animation-duration: 0s !important;
transition-duration: 0s !important;
}
''')
print("[HOOK] Optimizations applied")
return page
async def cleanup_hook(page, context, **kwargs):
"""Clean page before extraction"""
print("[HOOK] Cleaning page")
await page.evaluate('''() => {
const selectors = [
'.ad', '.ads', '.advertisement',
'.popup', '.modal', '.overlay',
'.cookie-banner', '.newsletter'
];
selectors.forEach(sel => {
document.querySelectorAll(sel).forEach(el => el.remove());
});
document.querySelectorAll('script, style').forEach(el => el.remove());
}''')
print("[HOOK] Page cleaned")
return page
# --- Content Extraction Hooks ---
async def wait_dynamic_content_hook(page, context, url, response, **kwargs):
"""Wait for dynamic content to load"""
print(f"[HOOK] Waiting for dynamic content on {url}")
await page.wait_for_timeout(2000)
# Click "Load More" if exists
try:
load_more = await page.query_selector('[class*="load-more"], button:has-text("Load More")')
if load_more:
await load_more.click()
await page.wait_for_timeout(1000)
print("[HOOK] Clicked 'Load More'")
except:
pass
return page
async def extract_metadata_hook(page, context, **kwargs):
"""Extract page metadata"""
print("[HOOK] Extracting metadata")
metadata = await page.evaluate('''() => {
const getMeta = (name) => {
const el = document.querySelector(`meta[name="${name}"], meta[property="${name}"]`);
return el ? el.getAttribute('content') : null;
};
return {
title: document.title,
description: getMeta('description'),
author: getMeta('author'),
keywords: getMeta('keywords'),
};
}''')
print(f"[HOOK] Metadata: {metadata}")
# Infinite scroll
for i in range(3):
await page.evaluate("window.scrollTo(0, document.body.scrollHeight);")
await page.wait_for_timeout(1000)
print(f"[HOOK] Scroll {i+1}/3")
return page
# --- Multi-URL Hooks ---
async def url_specific_hook(page, context, url, **kwargs):
"""Apply URL-specific logic"""
print(f"[HOOK] Processing URL: {url}")
# URL-specific headers
if 'html' in url:
await page.set_extra_http_headers({"X-Type": "HTML"})
elif 'json' in url:
await page.set_extra_http_headers({"X-Type": "JSON"})
return page
async def track_progress_hook(page, context, url, response, **kwargs):
"""Track crawl progress"""
status = response.status if response else 'unknown'
print(f"[HOOK] Loaded {url} - Status: {status}")
return page
# ============================================================================
# Test Functions
# ============================================================================
async def test_all_hooks_comprehensive():
"""Test all 8 hook types"""
print("=" * 70)
print("Test 1: All Hooks Comprehensive Demo (Docker Client)")
print("=" * 70)
async with Crawl4aiDockerClient(base_url=API_BASE_URL, verbose=False) as client:
print("\nCrawling with all 8 hooks...")
# Define hooks with function objects
hooks = {
"on_browser_created": browser_created_hook,
"on_page_context_created": page_context_hook,
"on_user_agent_updated": user_agent_hook,
"before_goto": before_goto_hook,
"after_goto": after_goto_hook,
"on_execution_started": execution_started_hook,
"before_retrieve_html": before_retrieve_hook,
"before_return_html": before_return_hook
}
result = await client.crawl(
["https://httpbin.org/html"],
hooks=hooks,
hooks_timeout=30
)
print("\n✅ Success!")
print(f" URL: {result.url}")
print(f" Success: {result.success}")
print(f" HTML: {len(result.html)} chars")
async def test_authentication_workflow():
"""Test authentication with hooks"""
print("\n" + "=" * 70)
print("Test 2: Authentication Workflow (Docker Client)")
print("=" * 70)
async with Crawl4aiDockerClient(base_url=API_BASE_URL, verbose=False) as client:
print("\nTesting authentication...")
hooks = {
"on_page_context_created": auth_context_hook,
"before_goto": auth_headers_hook
}
result = await client.crawl(
["https://httpbin.org/basic-auth/user/passwd"],
hooks=hooks,
hooks_timeout=15
)
print("\n✅ Authentication completed")
if result.success:
if '"authenticated"' in result.html and 'true' in result.html:
print(" ✅ Basic auth successful!")
else:
print(" ⚠️ Auth status unclear")
else:
print(f" ❌ Failed: {result.error_message}")
async def test_performance_optimization():
"""Test performance optimization"""
print("\n" + "=" * 70)
print("Test 3: Performance Optimization (Docker Client)")
print("=" * 70)
async with Crawl4aiDockerClient(base_url=API_BASE_URL, verbose=False) as client:
print("\nTesting performance hooks...")
hooks = {
"on_page_context_created": performance_hook,
"before_retrieve_html": cleanup_hook
}
result = await client.crawl(
["https://httpbin.org/html"],
hooks=hooks,
hooks_timeout=10
)
print("\n✅ Optimization completed")
print(f" HTML size: {len(result.html):,} chars")
print(" Resources blocked, ads removed")
async def test_content_extraction():
"""Test content extraction"""
print("\n" + "=" * 70)
print("Test 4: Content Extraction (Docker Client)")
print("=" * 70)
async with Crawl4aiDockerClient(base_url=API_BASE_URL, verbose=False) as client:
print("\nTesting extraction hooks...")
hooks = {
"after_goto": wait_dynamic_content_hook,
"before_retrieve_html": extract_metadata_hook
}
result = await client.crawl(
["https://www.kidocode.com/"],
hooks=hooks,
hooks_timeout=20
)
print("\n✅ Extraction completed")
print(f" URL: {result.url}")
print(f" Success: {result.success}")
print(f" Metadata: {result.metadata}")
async def test_multi_url_crawl():
"""Test hooks with multiple URLs"""
print("\n" + "=" * 70)
print("Test 5: Multi-URL Crawl (Docker Client)")
print("=" * 70)
async with Crawl4aiDockerClient(base_url=API_BASE_URL, verbose=False) as client:
print("\nCrawling multiple URLs...")
hooks = {
"before_goto": url_specific_hook,
"after_goto": track_progress_hook
}
results = await client.crawl(
[
"https://httpbin.org/html",
"https://httpbin.org/json",
"https://httpbin.org/xml"
],
hooks=hooks,
hooks_timeout=15
)
print("\n✅ Multi-URL crawl completed")
print(f"\n Crawled {len(results)} URLs:")
for i, result in enumerate(results, 1):
status = "" if result.success else ""
print(f" {status} {i}. {result.url}")
async def test_reusable_hook_library():
"""Test using reusable hook library"""
print("\n" + "=" * 70)
print("Test 6: Reusable Hook Library (Docker Client)")
print("=" * 70)
# Create a library of reusable hooks
class HookLibrary:
@staticmethod
async def block_images(page, context, **kwargs):
"""Block all images"""
await context.route("**/*.{png,jpg,jpeg,gif}", lambda r: r.abort())
print("[LIBRARY] Images blocked")
return page
@staticmethod
async def block_analytics(page, context, **kwargs):
"""Block analytics"""
await context.route("**/analytics/*", lambda r: r.abort())
await context.route("**/google-analytics.com/*", lambda r: r.abort())
print("[LIBRARY] Analytics blocked")
return page
@staticmethod
async def scroll_infinite(page, context, **kwargs):
"""Handle infinite scroll"""
for i in range(5):
prev = await page.evaluate("document.body.scrollHeight")
await page.evaluate("window.scrollTo(0, document.body.scrollHeight);")
await page.wait_for_timeout(1000)
curr = await page.evaluate("document.body.scrollHeight")
if curr == prev:
break
print("[LIBRARY] Infinite scroll complete")
return page
async with Crawl4aiDockerClient(base_url=API_BASE_URL, verbose=False) as client:
print("\nUsing hook library...")
hooks = {
"on_page_context_created": HookLibrary.block_images,
"before_retrieve_html": HookLibrary.scroll_infinite
}
result = await client.crawl(
["https://www.kidocode.com/"],
hooks=hooks,
hooks_timeout=20
)
print("\n✅ Library hooks completed")
print(f" Success: {result.success}")
# ============================================================================
# Main
# ============================================================================
async def main():
"""Run all Docker client hook examples"""
print("🔧 Crawl4AI Docker Client - Hooks Examples (Function-Based)")
print("Using Python function objects with automatic conversion")
print("=" * 70)
tests = [
("All Hooks Demo", test_all_hooks_comprehensive),
("Authentication", test_authentication_workflow),
("Performance", test_performance_optimization),
("Extraction", test_content_extraction),
("Multi-URL", test_multi_url_crawl),
("Hook Library", test_reusable_hook_library)
]
for i, (name, test_func) in enumerate(tests, 1):
try:
await test_func()
print(f"\n✅ Test {i}/{len(tests)}: {name} completed\n")
except Exception as e:
print(f"\n❌ Test {i}/{len(tests)}: {name} failed: {e}\n")
import traceback
traceback.print_exc()
print("=" * 70)
print("🎉 All Docker client hook examples completed!")
print("\n💡 Key Benefits of Function-Based Hooks:")
print(" • Write as regular Python functions")
print(" • Full IDE support (autocomplete, types)")
print(" • Automatic conversion to API format")
print(" • Reusable across projects")
print(" • Clean, readable code")
print(" • Easy to test and debug")
print("=" * 70)
if __name__ == "__main__":
asyncio.run(main())

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,461 @@
"""
Docker Webhook Example for Crawl4AI
This example demonstrates how to use webhooks with the Crawl4AI job queue API.
Instead of polling for results, webhooks notify your application when jobs complete.
Supports both:
- /crawl/job - Raw crawling with markdown extraction
- /llm/job - LLM-powered content extraction
Prerequisites:
1. Crawl4AI Docker container running on localhost:11235
2. Flask installed: pip install flask requests
3. LLM API key configured in .llm.env (for LLM extraction examples)
Usage:
1. Run this script: python docker_webhook_example.py
2. The webhook server will start on http://localhost:8080
3. Jobs will be submitted and webhooks will be received automatically
"""
import requests
import json
import time
from flask import Flask, request, jsonify
from threading import Thread
# Configuration
CRAWL4AI_BASE_URL = "http://localhost:11235"
WEBHOOK_BASE_URL = "http://localhost:8080" # Your webhook receiver URL
# Initialize Flask app for webhook receiver
app = Flask(__name__)
# Store received webhook data for demonstration
received_webhooks = []
@app.route('/webhooks/crawl-complete', methods=['POST'])
def handle_crawl_webhook():
"""
Webhook handler that receives notifications when crawl jobs complete.
Payload structure:
{
"task_id": "crawl_abc123",
"task_type": "crawl",
"status": "completed" or "failed",
"timestamp": "2025-10-21T10:30:00.000000+00:00",
"urls": ["https://example.com"],
"error": "error message" (only if failed),
"data": {...} (only if webhook_data_in_payload=True)
}
"""
payload = request.json
print(f"\n{'='*60}")
print(f"📬 Webhook received for task: {payload['task_id']}")
print(f" Status: {payload['status']}")
print(f" Timestamp: {payload['timestamp']}")
print(f" URLs: {payload['urls']}")
if payload['status'] == 'completed':
# If data is in payload, process it directly
if 'data' in payload:
print(f" ✅ Data included in webhook")
data = payload['data']
# Process the crawl results here
for result in data.get('results', []):
print(f" - Crawled: {result.get('url')}")
print(f" - Markdown length: {len(result.get('markdown', ''))}")
else:
# Fetch results from API if not included
print(f" 📥 Fetching results from API...")
task_id = payload['task_id']
result_response = requests.get(f"{CRAWL4AI_BASE_URL}/crawl/job/{task_id}")
if result_response.ok:
data = result_response.json()
print(f" ✅ Results fetched successfully")
# Process the crawl results here
for result in data['result'].get('results', []):
print(f" - Crawled: {result.get('url')}")
print(f" - Markdown length: {len(result.get('markdown', ''))}")
elif payload['status'] == 'failed':
print(f" ❌ Job failed: {payload.get('error', 'Unknown error')}")
print(f"{'='*60}\n")
# Store webhook for demonstration
received_webhooks.append(payload)
# Return 200 OK to acknowledge receipt
return jsonify({"status": "received"}), 200
@app.route('/webhooks/llm-complete', methods=['POST'])
def handle_llm_webhook():
"""
Webhook handler that receives notifications when LLM extraction jobs complete.
Payload structure:
{
"task_id": "llm_1698765432_12345",
"task_type": "llm_extraction",
"status": "completed" or "failed",
"timestamp": "2025-10-21T10:30:00.000000+00:00",
"urls": ["https://example.com/article"],
"error": "error message" (only if failed),
"data": {"extracted_content": {...}} (only if webhook_data_in_payload=True)
}
"""
payload = request.json
print(f"\n{'='*60}")
print(f"🤖 LLM Webhook received for task: {payload['task_id']}")
print(f" Task Type: {payload['task_type']}")
print(f" Status: {payload['status']}")
print(f" Timestamp: {payload['timestamp']}")
print(f" URL: {payload['urls'][0]}")
if payload['status'] == 'completed':
# If data is in payload, process it directly
if 'data' in payload:
print(f" ✅ Data included in webhook")
data = payload['data']
# Webhook wraps extracted content in 'extracted_content' field
extracted = data.get('extracted_content', {})
print(f" - Extracted content:")
print(f" {json.dumps(extracted, indent=8)}")
else:
# Fetch results from API if not included
print(f" 📥 Fetching results from API...")
task_id = payload['task_id']
result_response = requests.get(f"{CRAWL4AI_BASE_URL}/llm/job/{task_id}")
if result_response.ok:
data = result_response.json()
print(f" ✅ Results fetched successfully")
# API returns unwrapped content in 'result' field
extracted = data['result']
print(f" - Extracted content:")
print(f" {json.dumps(extracted, indent=8)}")
elif payload['status'] == 'failed':
print(f" ❌ Job failed: {payload.get('error', 'Unknown error')}")
print(f"{'='*60}\n")
# Store webhook for demonstration
received_webhooks.append(payload)
# Return 200 OK to acknowledge receipt
return jsonify({"status": "received"}), 200
def start_webhook_server():
"""Start the Flask webhook server in a separate thread"""
app.run(host='0.0.0.0', port=8080, debug=False, use_reloader=False)
def submit_crawl_job_with_webhook(urls, webhook_url, include_data=False):
"""
Submit a crawl job with webhook notification.
Args:
urls: List of URLs to crawl
webhook_url: URL to receive webhook notifications
include_data: Whether to include full results in webhook payload
Returns:
task_id: The job's task identifier
"""
payload = {
"urls": urls,
"browser_config": {"headless": True},
"crawler_config": {"cache_mode": "bypass"},
"webhook_config": {
"webhook_url": webhook_url,
"webhook_data_in_payload": include_data,
# Optional: Add custom headers for authentication
# "webhook_headers": {
# "X-Webhook-Secret": "your-secret-token"
# }
}
}
print(f"\n🚀 Submitting crawl job...")
print(f" URLs: {urls}")
print(f" Webhook: {webhook_url}")
print(f" Include data: {include_data}")
response = requests.post(
f"{CRAWL4AI_BASE_URL}/crawl/job",
json=payload,
headers={"Content-Type": "application/json"}
)
if response.ok:
data = response.json()
task_id = data['task_id']
print(f" ✅ Job submitted successfully")
print(f" Task ID: {task_id}")
return task_id
else:
print(f" ❌ Failed to submit job: {response.text}")
return None
def submit_llm_job_with_webhook(url, query, webhook_url, include_data=False, schema=None, provider=None):
"""
Submit an LLM extraction job with webhook notification.
Args:
url: URL to extract content from
query: Instruction for the LLM (e.g., "Extract article title and author")
webhook_url: URL to receive webhook notifications
include_data: Whether to include full results in webhook payload
schema: Optional JSON schema for structured extraction
provider: Optional LLM provider (e.g., "openai/gpt-4o-mini")
Returns:
task_id: The job's task identifier
"""
payload = {
"url": url,
"q": query,
"cache": False,
"webhook_config": {
"webhook_url": webhook_url,
"webhook_data_in_payload": include_data,
# Optional: Add custom headers for authentication
# "webhook_headers": {
# "X-Webhook-Secret": "your-secret-token"
# }
}
}
if schema:
payload["schema"] = schema
if provider:
payload["provider"] = provider
print(f"\n🤖 Submitting LLM extraction job...")
print(f" URL: {url}")
print(f" Query: {query}")
print(f" Webhook: {webhook_url}")
print(f" Include data: {include_data}")
if provider:
print(f" Provider: {provider}")
response = requests.post(
f"{CRAWL4AI_BASE_URL}/llm/job",
json=payload,
headers={"Content-Type": "application/json"}
)
if response.ok:
data = response.json()
task_id = data['task_id']
print(f" ✅ Job submitted successfully")
print(f" Task ID: {task_id}")
return task_id
else:
print(f" ❌ Failed to submit job: {response.text}")
return None
def submit_job_without_webhook(urls):
"""
Submit a job without webhook (traditional polling approach).
Args:
urls: List of URLs to crawl
Returns:
task_id: The job's task identifier
"""
payload = {
"urls": urls,
"browser_config": {"headless": True},
"crawler_config": {"cache_mode": "bypass"}
}
print(f"\n🚀 Submitting crawl job (without webhook)...")
print(f" URLs: {urls}")
response = requests.post(
f"{CRAWL4AI_BASE_URL}/crawl/job",
json=payload
)
if response.ok:
data = response.json()
task_id = data['task_id']
print(f" ✅ Job submitted successfully")
print(f" Task ID: {task_id}")
return task_id
else:
print(f" ❌ Failed to submit job: {response.text}")
return None
def poll_job_status(task_id, timeout=60):
"""
Poll for job status (used when webhook is not configured).
Args:
task_id: The job's task identifier
timeout: Maximum time to wait in seconds
"""
print(f"\n⏳ Polling for job status...")
start_time = time.time()
while time.time() - start_time < timeout:
response = requests.get(f"{CRAWL4AI_BASE_URL}/crawl/job/{task_id}")
if response.ok:
data = response.json()
status = data.get('status', 'unknown')
if status == 'completed':
print(f" ✅ Job completed!")
return data
elif status == 'failed':
print(f" ❌ Job failed: {data.get('error', 'Unknown error')}")
return data
else:
print(f" ⏳ Status: {status}, waiting...")
time.sleep(2)
else:
print(f" ❌ Failed to get status: {response.text}")
return None
print(f" ⏰ Timeout reached")
return None
def main():
"""Run the webhook demonstration"""
# Check if Crawl4AI is running
try:
health = requests.get(f"{CRAWL4AI_BASE_URL}/health", timeout=5)
print(f"✅ Crawl4AI is running: {health.json()}")
except:
print(f"❌ Cannot connect to Crawl4AI at {CRAWL4AI_BASE_URL}")
print(" Please make sure Docker container is running:")
print(" docker run -d -p 11235:11235 --name crawl4ai unclecode/crawl4ai:latest")
return
# Start webhook server in background thread
print(f"\n🌐 Starting webhook server at {WEBHOOK_BASE_URL}...")
webhook_thread = Thread(target=start_webhook_server, daemon=True)
webhook_thread.start()
time.sleep(2) # Give server time to start
# Example 1: Job with webhook (notification only, fetch data separately)
print(f"\n{'='*60}")
print("Example 1: Webhook Notification Only")
print(f"{'='*60}")
task_id_1 = submit_crawl_job_with_webhook(
urls=["https://example.com"],
webhook_url=f"{WEBHOOK_BASE_URL}/webhooks/crawl-complete",
include_data=False
)
# Example 2: Job with webhook (data included in payload)
time.sleep(5) # Wait a bit between requests
print(f"\n{'='*60}")
print("Example 2: Webhook with Full Data")
print(f"{'='*60}")
task_id_2 = submit_crawl_job_with_webhook(
urls=["https://www.python.org"],
webhook_url=f"{WEBHOOK_BASE_URL}/webhooks/crawl-complete",
include_data=True
)
# Example 3: LLM extraction with webhook (notification only)
time.sleep(5) # Wait a bit between requests
print(f"\n{'='*60}")
print("Example 3: LLM Extraction with Webhook (Notification Only)")
print(f"{'='*60}")
task_id_3 = submit_llm_job_with_webhook(
url="https://www.example.com",
query="Extract the main heading and description from this page.",
webhook_url=f"{WEBHOOK_BASE_URL}/webhooks/llm-complete",
include_data=False,
provider="openai/gpt-4o-mini"
)
# Example 4: LLM extraction with webhook (data included + schema)
time.sleep(5) # Wait a bit between requests
print(f"\n{'='*60}")
print("Example 4: LLM Extraction with Schema and Full Data")
print(f"{'='*60}")
# Define a schema for structured extraction
schema = json.dumps({
"type": "object",
"properties": {
"title": {"type": "string", "description": "Page title"},
"description": {"type": "string", "description": "Page description"}
},
"required": ["title"]
})
task_id_4 = submit_llm_job_with_webhook(
url="https://www.python.org",
query="Extract the title and description of this website",
webhook_url=f"{WEBHOOK_BASE_URL}/webhooks/llm-complete",
include_data=True,
schema=schema,
provider="openai/gpt-4o-mini"
)
# Example 5: Traditional polling (no webhook)
time.sleep(5) # Wait a bit between requests
print(f"\n{'='*60}")
print("Example 5: Traditional Polling (No Webhook)")
print(f"{'='*60}")
task_id_5 = submit_job_without_webhook(
urls=["https://github.com"]
)
if task_id_5:
result = poll_job_status(task_id_5)
if result and result.get('status') == 'completed':
print(f" ✅ Results retrieved via polling")
# Wait for webhooks to arrive
print(f"\n⏳ Waiting for webhooks to be received...")
time.sleep(30) # Give jobs time to complete and webhooks to arrive (longer for LLM)
# Summary
print(f"\n{'='*60}")
print("Summary")
print(f"{'='*60}")
print(f"Total webhooks received: {len(received_webhooks)}")
crawl_webhooks = [w for w in received_webhooks if w['task_type'] == 'crawl']
llm_webhooks = [w for w in received_webhooks if w['task_type'] == 'llm_extraction']
print(f"\n📊 Breakdown:")
print(f" - Crawl webhooks: {len(crawl_webhooks)}")
print(f" - LLM extraction webhooks: {len(llm_webhooks)}")
print(f"\n📋 Details:")
for i, webhook in enumerate(received_webhooks, 1):
task_type = webhook['task_type']
icon = "🕷️" if task_type == "crawl" else "🤖"
print(f"{i}. {icon} Task {webhook['task_id']}: {webhook['status']} ({task_type})")
print(f"\n✅ Demo completed!")
print(f"\n💡 Pro tips:")
print(f" - In production, your webhook URL should be publicly accessible")
print(f" (e.g., https://myapp.com/webhooks) or use ngrok for testing")
print(f" - Both /crawl/job and /llm/job support the same webhook configuration")
print(f" - Use webhook_data_in_payload=true to get results directly in the webhook")
print(f" - LLM jobs may take longer, adjust timeouts accordingly")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,48 @@
"""
NSTProxy Integration Examples for crawl4ai
------------------------------------------
NSTProxy is a premium residential proxy provider.
👉 Purchase Proxies: https://nstproxy.com
💰 Use coupon code "crawl4ai" for 10% off your plan.
"""
import asyncio, requests
from crawl4ai import AsyncWebCrawler, BrowserConfig
async def main():
"""
Example: Dynamically fetch a proxy from NSTProxy API before crawling.
"""
NST_TOKEN = "YOUR_NST_PROXY_TOKEN" # Get from https://app.nstproxy.com/profile
CHANNEL_ID = "YOUR_NST_PROXY_CHANNEL_ID" # Your NSTProxy Channel ID
country = "ANY" # e.g. "ANY", "US", "DE"
# Fetch proxy from NSTProxy API
api_url = (
f"https://api.nstproxy.com/api/v1/generate/apiproxies"
f"?fType=2&channelId={CHANNEL_ID}&country={country}"
f"&protocol=http&sessionDuration=10&count=1&token={NST_TOKEN}"
)
response = requests.get(api_url, timeout=10).json()
proxy = response[0]
ip = proxy.get("ip")
port = proxy.get("port")
username = proxy.get("username", "")
password = proxy.get("password", "")
browser_config = BrowserConfig(proxy_config={
"server": f"http://{ip}:{port}",
"username": username,
"password": password,
})
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url="https://example.com")
print("[API Proxy] Status:", result.status_code)
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,31 @@
"""
NSTProxy Integration Examples for crawl4ai
------------------------------------------
NSTProxy is a premium residential proxy provider.
👉 Purchase Proxies: https://nstproxy.com
💰 Use coupon code "crawl4ai" for 10% off your plan.
"""
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig
async def main():
"""
Example: Use NSTProxy with manual username/password authentication.
"""
browser_config = BrowserConfig(proxy_config={
"server": "http://gate.nstproxy.io:24125",
"username": "your_username",
"password": "your_password",
})
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url="https://example.com")
print("[Auth Proxy] Status:", result.status_code)
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,29 @@
"""
NSTProxy Integration Examples for crawl4ai
------------------------------------------
NSTProxy is a premium residential proxy provider.
👉 Purchase Proxies: https://nstproxy.com
💰 Use coupon code "crawl4ai" for 10% off your plan.
"""
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig
async def main():
# Using HTTP proxy
browser_config = BrowserConfig(proxy_config={"server": "http://gate.nstproxy.io:24125"})
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url="https://example.com")
print("[HTTP Proxy] Status:", result.status_code)
# Using SOCKS proxy
browser_config = BrowserConfig(proxy_config={"server": "socks5://gate.nstproxy.io:24125"})
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url="https://example.com")
print("[SOCKS5 Proxy] Status:", result.status_code)
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,39 @@
"""
NSTProxy Integration Examples for crawl4ai
------------------------------------------
NSTProxy is a premium residential proxy provider.
👉 Purchase Proxies: https://nstproxy.com
💰 Use coupon code "crawl4ai" for 10% off your plan.
"""
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig
async def main():
"""
Example: Using NSTProxy with AsyncWebCrawler.
"""
NST_TOKEN = "YOUR_NST_PROXY_TOKEN" # Get from https://app.nstproxy.com/profile
CHANNEL_ID = "YOUR_NST_PROXY_CHANNEL_ID" # Your NSTProxy Channel ID
browser_config = BrowserConfig()
browser_config.set_nstproxy(
token=NST_TOKEN,
channel_id=CHANNEL_ID,
country="ANY", # e.g. "US", "JP", or "ANY"
state="", # optional, leave empty if not needed
city="", # optional, leave empty if not needed
session_duration=0 # Session duration in minutes,0 = rotate on every request
)
# === Run crawler ===
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url="https://example.com")
print("[Nstproxy] Status:", result.status_code)
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -82,6 +82,42 @@ If you installed Crawl4AI (which installs Playwright under the hood), you alread
---
### Creating a Profile Using the Crawl4AI CLI (Easiest)
If you prefer a guided, interactive setup, use the built-in CLI to create and manage persistent browser profiles.
1.Launch the profile manager:
```bash
crwl profiles
```
2.Choose "Create new profile" and enter a profile name. A Chromium window opens so you can log in to sites and configure settings. When finished, return to the terminal and press `q` to save the profile.
3.Profiles are saved under `~/.crawl4ai/profiles/<profile_name>` (for example: `/home/<you>/.crawl4ai/profiles/test_profile_1`) along with a `storage_state.json` for cookies and session data.
4.Optionally, choose "List profiles" in the CLI to view available profiles and their paths.
5.Use the saved path with `BrowserConfig.user_data_dir`:
```python
from crawl4ai import AsyncWebCrawler, BrowserConfig
profile_path = "/home/<you>/.crawl4ai/profiles/test_profile_1"
browser_config = BrowserConfig(
headless=True,
use_managed_browser=True,
user_data_dir=profile_path,
browser_type="chromium",
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url="https://example.com/private")
```
The CLI also supports listing and deleting profiles, and even testing a crawl directly from the menu.
---
## 3. Using Managed Browsers in Crawl4AI
Once you have a data directory with your session data, pass it to **`BrowserConfig`**:

View File

@@ -1,98 +1,304 @@
# Proxy
# Proxy & Security
This guide covers proxy configuration and security features in Crawl4AI, including SSL certificate analysis and proxy rotation strategies.
## Understanding Proxy Configuration
Crawl4AI recommends configuring proxies per request through `CrawlerRunConfig.proxy_config`. This gives you precise control, enables rotation strategies, and keeps examples simple enough to copy, paste, and run.
## Basic Proxy Setup
Simple proxy configuration with `BrowserConfig`:
Configure proxies that apply to each crawl operation:
```python
from crawl4ai.async_configs import BrowserConfig
# Using HTTP proxy
browser_config = BrowserConfig(proxy_config={"server": "http://proxy.example.com:8080"})
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url="https://example.com")
# Using SOCKS proxy
browser_config = BrowserConfig(proxy_config={"server": "socks5://proxy.example.com:1080"})
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url="https://example.com")
```
## Authenticated Proxy
Use an authenticated proxy with `BrowserConfig`:
```python
from crawl4ai.async_configs import BrowserConfig
browser_config = BrowserConfig(proxy_config={
"server": "http://[host]:[port]",
"username": "[username]",
"password": "[password]",
})
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url="https://example.com")
```
## Rotating Proxies
Example using a proxy rotation service dynamically:
```python
import re
from crawl4ai import (
AsyncWebCrawler,
BrowserConfig,
CrawlerRunConfig,
CacheMode,
RoundRobinProxyStrategy,
)
import asyncio
from crawl4ai import ProxyConfig
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, ProxyConfig
run_config = CrawlerRunConfig(proxy_config=ProxyConfig(server="http://proxy.example.com:8080"))
# run_config = CrawlerRunConfig(proxy_config={"server": "http://proxy.example.com:8080"})
# run_config = CrawlerRunConfig(proxy_config="http://proxy.example.com:8080")
async def main():
# Load proxies and create rotation strategy
browser_config = BrowserConfig()
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url="https://example.com", config=run_config)
print(f"Success: {result.success} -> {result.url}")
if __name__ == "__main__":
asyncio.run(main())
```
!!! note "Why request-level?"
`CrawlerRunConfig.proxy_config` keeps each request self-contained, so swapping proxies or rotation strategies is just a matter of building a new run configuration.
## Supported Proxy Formats
The `ProxyConfig.from_string()` method supports multiple formats:
```python
from crawl4ai import ProxyConfig
# HTTP proxy with authentication
proxy1 = ProxyConfig.from_string("http://user:pass@192.168.1.1:8080")
# HTTPS proxy
proxy2 = ProxyConfig.from_string("https://proxy.example.com:8080")
# SOCKS5 proxy
proxy3 = ProxyConfig.from_string("socks5://proxy.example.com:1080")
# Simple IP:port format
proxy4 = ProxyConfig.from_string("192.168.1.1:8080")
# IP:port:user:pass format
proxy5 = ProxyConfig.from_string("192.168.1.1:8080:user:pass")
```
## Authenticated Proxies
For proxies requiring authentication:
```python
import asyncio
from crawl4ai import AsyncWebCrawler,BrowserConfig, CrawlerRunConfig, ProxyConfig
run_config = CrawlerRunConfig(
proxy_config=ProxyConfig(
server="http://proxy.example.com:8080",
username="your_username",
password="your_password",
)
)
# Or dictionary style:
# run_config = CrawlerRunConfig(proxy_config={
# "server": "http://proxy.example.com:8080",
# "username": "your_username",
# "password": "your_password",
# })
async def main():
browser_config = BrowserConfig()
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url="https://example.com", config=run_config)
print(f"Success: {result.success} -> {result.url}")
if __name__ == "__main__":
asyncio.run(main())
```
## Environment Variable Configuration
Load proxies from environment variables for easy configuration:
```python
import os
from crawl4ai import ProxyConfig, CrawlerRunConfig
# Set environment variable
os.environ["PROXIES"] = "ip1:port1:user1:pass1,ip2:port2:user2:pass2,ip3:port3"
# Load all proxies
proxies = ProxyConfig.from_env()
print(f"Loaded {len(proxies)} proxies")
# Use first proxy
if proxies:
run_config = CrawlerRunConfig(proxy_config=proxies[0])
```
## Rotating Proxies
Crawl4AI supports automatic proxy rotation to distribute requests across multiple proxy servers. Rotation is applied per request using a rotation strategy on `CrawlerRunConfig`.
### Proxy Rotation (recommended)
```python
import asyncio
import re
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, ProxyConfig
from crawl4ai.proxy_strategy import RoundRobinProxyStrategy
async def main():
# Load proxies from environment
proxies = ProxyConfig.from_env()
#eg: export PROXIES="ip1:port1:username1:password1,ip2:port2:username2:password2"
if not proxies:
print("No proxies found in environment. Set PROXIES env variable!")
print("No proxies found! Set PROXIES environment variable.")
return
# Create rotation strategy
proxy_strategy = RoundRobinProxyStrategy(proxies)
# Create configs
# Configure per-request with proxy rotation
browser_config = BrowserConfig(headless=True, verbose=False)
run_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
proxy_rotation_strategy=proxy_strategy
proxy_rotation_strategy=proxy_strategy,
)
async with AsyncWebCrawler(config=browser_config) as crawler:
urls = ["https://httpbin.org/ip"] * (len(proxies) * 2) # Test each proxy twice
print("\n📈 Initializing crawler with proxy rotation...")
async with AsyncWebCrawler(config=browser_config) as crawler:
print("\n🚀 Starting batch crawl with proxy rotation...")
results = await crawler.arun_many(
urls=urls,
config=run_config
)
for result in results:
if result.success:
ip_match = re.search(r'(?:[0-9]{1,3}\.){3}[0-9]{1,3}', result.html)
current_proxy = run_config.proxy_config if run_config.proxy_config else None
print(f"🚀 Testing {len(proxies)} proxies with rotation...")
results = await crawler.arun_many(urls=urls, config=run_config)
if current_proxy and ip_match:
print(f"URL {result.url}")
print(f"Proxy {current_proxy.server} -> Response IP: {ip_match.group(0)}")
verified = ip_match.group(0) == current_proxy.ip
if verified:
print(f"✅ Proxy working! IP matches: {current_proxy.ip}")
else:
print("❌ Proxy failed or IP mismatch!")
print("---")
for i, result in enumerate(results):
if result.success:
# Extract IP from response
ip_match = re.search(r'(?:[0-9]{1,3}\.){3}[0-9]{1,3}', result.html)
if ip_match:
detected_ip = ip_match.group(0)
proxy_index = i % len(proxies)
expected_ip = proxies[proxy_index].ip
asyncio.run(main())
print(f"✅ Request {i+1}: Proxy {proxy_index+1} -> IP {detected_ip}")
if detected_ip == expected_ip:
print(" 🎯 IP matches proxy configuration")
else:
print(f" ⚠️ IP mismatch (expected {expected_ip})")
else:
print(f"❌ Request {i+1}: Could not extract IP from response")
else:
print(f"❌ Request {i+1}: Failed - {result.error_message}")
if __name__ == "__main__":
asyncio.run(main())
```
## SSL Certificate Analysis
Combine proxy usage with SSL certificate inspection for enhanced security analysis. SSL certificate fetching is configured per request via `CrawlerRunConfig`.
### Per-Request SSL Certificate Analysis
```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
run_config = CrawlerRunConfig(
proxy_config={
"server": "http://proxy.example.com:8080",
"username": "user",
"password": "pass",
},
fetch_ssl_certificate=True, # Enable SSL certificate analysis for this request
)
async def main():
browser_config = BrowserConfig()
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url="https://example.com", config=run_config)
if result.success:
print(f"✅ Crawled via proxy: {result.url}")
# Analyze SSL certificate
if result.ssl_certificate:
cert = result.ssl_certificate
print("🔒 SSL Certificate Info:")
print(f" Issuer: {cert.issuer}")
print(f" Subject: {cert.subject}")
print(f" Valid until: {cert.valid_until}")
print(f" Fingerprint: {cert.fingerprint}")
# Export certificate
cert.to_json("certificate.json")
print("💾 Certificate exported to certificate.json")
else:
print("⚠️ No SSL certificate information available")
if __name__ == "__main__":
asyncio.run(main())
```
## Security Best Practices
### 1. Proxy Rotation for Anonymity
```python
from crawl4ai import CrawlerRunConfig, ProxyConfig
from crawl4ai.proxy_strategy import RoundRobinProxyStrategy
# Use multiple proxies to avoid IP blocking
proxies = ProxyConfig.from_env("PROXIES")
strategy = RoundRobinProxyStrategy(proxies)
# Configure rotation per request (recommended)
run_config = CrawlerRunConfig(proxy_rotation_strategy=strategy)
# For a fixed proxy across all requests, just reuse the same run_config instance
static_run_config = run_config
```
### 2. SSL Certificate Verification
```python
from crawl4ai import CrawlerRunConfig
# Always verify SSL certificates when possible
# Per-request (affects specific requests)
run_config = CrawlerRunConfig(fetch_ssl_certificate=True)
```
### 3. Environment Variable Security
```bash
# Use environment variables for sensitive proxy credentials
# Avoid hardcoding usernames/passwords in code
export PROXIES="ip1:port1:user1:pass1,ip2:port2:user2:pass2"
```
### 4. SOCKS5 for Enhanced Security
```python
from crawl4ai import CrawlerRunConfig
# Prefer SOCKS5 proxies for better protocol support
run_config = CrawlerRunConfig(proxy_config="socks5://proxy.example.com:1080")
```
## Migration from Deprecated `proxy` Parameter
- "Deprecation Notice"
The legacy `proxy` argument on `BrowserConfig` is deprecated. Configure proxies through `CrawlerRunConfig.proxy_config` so each request fully describes its network settings.
```python
# Old (deprecated) approach
# from crawl4ai import BrowserConfig
# browser_config = BrowserConfig(proxy="http://proxy.example.com:8080")
# New (preferred) approach
from crawl4ai import CrawlerRunConfig
run_config = CrawlerRunConfig(proxy_config="http://proxy.example.com:8080")
```
### Safe Logging of Proxies
```python
from crawl4ai import ProxyConfig
def safe_proxy_repr(proxy: ProxyConfig):
if getattr(proxy, "username", None):
return f"{proxy.server} (auth: ****)"
return proxy.server
```
## Troubleshooting
### Common Issues
- "Proxy connection failed"
- Verify the proxy server is reachable from your network.
- Double-check authentication credentials.
- Ensure the protocol matches (`http`, `https`, or `socks5`).
- "SSL certificate errors"
- Some proxies break SSL inspection; switch proxies if you see repeated failures.
- Consider temporarily disabling certificate fetching to isolate the issue.
- "Environment variables not loading"
- Confirm `PROXIES` (or your custom env var) is set before running the script.
- Check formatting: `ip:port:user:pass,ip:port:user:pass`.
- "Proxy rotation not working"
- Ensure `ProxyConfig.from_env()` actually loaded entries (`len(proxies) > 0`).
- Attach `proxy_rotation_strategy` to `CrawlerRunConfig`.
- Validate the proxy definitions you pass into the strategy.

View File

@@ -21,21 +21,35 @@ browser_cfg = BrowserConfig(
|-----------------------|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------|
| **`browser_type`** | `"chromium"`, `"firefox"`, `"webkit"`<br/>*(default: `"chromium"`)* | Which browser engine to use. `"chromium"` is typical for many sites, `"firefox"` or `"webkit"` for specialized tests. |
| **`headless`** | `bool` (default: `True`) | Headless means no visible UI. `False` is handy for debugging. |
| **`browser_mode`** | `str` (default: `"dedicated"`) | How browser is initialized: `"dedicated"` (new instance), `"builtin"` (CDP background), `"custom"` (explicit CDP), `"docker"` (container). |
| **`use_managed_browser`** | `bool` (default: `False`) | Launch browser via CDP for advanced control. Set automatically based on `browser_mode`. |
| **`cdp_url`** | `str` (default: `None`) | Chrome DevTools Protocol endpoint URL (e.g., `"ws://localhost:9222/devtools/browser/"`). Set automatically based on `browser_mode`. |
| **`debugging_port`** | `int` (default: `9222`) | Port for browser debugging protocol. |
| **`host`** | `str` (default: `"localhost"`) | Host for browser connection. |
| **`viewport_width`** | `int` (default: `1080`) | Initial page width (in px). Useful for testing responsive layouts. |
| **`viewport_height`** | `int` (default: `600`) | Initial page height (in px). |
| **`viewport`** | `dict` (default: `None`) | Viewport dimensions dict. If set, overrides `viewport_width` and `viewport_height`. |
| **`proxy`** | `str` (deprecated) | Deprecated. Use `proxy_config` instead. If set, it will be auto-converted internally. |
| **`proxy_config`** | `dict` (default: `None`) | For advanced or multi-proxy needs, specify details like `{"server": "...", "username": "...", ...}`. |
| **`proxy_config`** | `ProxyConfig or dict` (default: `None`)| For advanced or multi-proxy needs, specify `ProxyConfig` object or dict like `{"server": "...", "username": "...", "password": "..."}`. |
| **`use_persistent_context`** | `bool` (default: `False`) | If `True`, uses a **persistent** browser context (keep cookies, sessions across runs). Also sets `use_managed_browser=True`. |
| **`user_data_dir`** | `str or None` (default: `None`) | Directory to store user data (profiles, cookies). Must be set if you want permanent sessions. |
| **`chrome_channel`** | `str` (default: `"chromium"`) | Chrome channel to launch (e.g., "chrome", "msedge"). Only for `browser_type="chromium"`. Auto-set to empty for Firefox/WebKit. |
| **`channel`** | `str` (default: `"chromium"`) | Alias for `chrome_channel`. |
| **`accept_downloads`** | `bool` (default: `False`) | Whether to allow file downloads. Requires `downloads_path` if `True`. |
| **`downloads_path`** | `str or None` (default: `None`) | Directory to store downloaded files. |
| **`storage_state`** | `str or dict or None` (default: `None`)| In-memory storage state (cookies, localStorage) to restore browser state. |
| **`ignore_https_errors`** | `bool` (default: `True`) | If `True`, continues despite invalid certificates (common in dev/staging). |
| **`java_script_enabled`** | `bool` (default: `True`) | Disable if you want no JS overhead, or if only static content is needed. |
| **`sleep_on_close`** | `bool` (default: `False`) | Add a small delay when closing browser (can help with cleanup issues). |
| **`cookies`** | `list` (default: `[]`) | Pre-set cookies, each a dict like `{"name": "session", "value": "...", "url": "..."}`. |
| **`headers`** | `dict` (default: `{}`) | Extra HTTP headers for every request, e.g. `{"Accept-Language": "en-US"}`. |
| **`user_agent`** | `str` (default: Chrome-based UA) | Your custom or random user agent. `user_agent_mode="random"` can shuffle it. |
| **`light_mode`** | `bool` (default: `False`) | Disables some background features for performance gains. |
| **`user_agent`** | `str` (default: Chrome-based UA) | Your custom user agent string. |
| **`user_agent_mode`** | `str` (default: `""`) | Set to `"random"` to randomize user agent from a pool (helps with bot detection). |
| **`user_agent_generator_config`** | `dict` (default: `{}`) | Configuration dict for user agent generation when `user_agent_mode="random"`. |
| **`text_mode`** | `bool` (default: `False`) | If `True`, tries to disable images/other heavy content for speed. |
| **`use_managed_browser`** | `bool` (default: `False`) | For advanced “managed” interactions (debugging, CDP usage). Typically set automatically if persistent context is on. |
| **`light_mode`** | `bool` (default: `False`) | Disables some background features for performance gains. |
| **`extra_args`** | `list` (default: `[]`) | Additional flags for the underlying browser process, e.g. `["--disable-extensions"]`. |
| **`enable_stealth`** | `bool` (default: `False`) | Enable playwright-stealth mode to bypass bot detection. Cannot be used with `browser_mode="builtin"`. |
**Tips**:
- Set `headless=False` to visually **debug** how pages load or how interactions proceed.
@@ -70,6 +84,7 @@ We group them by category.
|------------------------------|--------------------------------------|-------------------------------------------------------------------------------------------------|
| **`word_count_threshold`** | `int` (default: ~200) | Skips text blocks below X words. Helps ignore trivial sections. |
| **`extraction_strategy`** | `ExtractionStrategy` (default: None) | If set, extracts structured data (CSS-based, LLM-based, etc.). |
| **`chunking_strategy`** | `ChunkingStrategy` (default: RegexChunking()) | Strategy to chunk content before extraction. Can be customized for different chunking approaches. |
| **`markdown_generator`** | `MarkdownGenerationStrategy` (None) | If you want specialized markdown output (citations, filtering, chunking, etc.). Can be customized with options such as `content_source` parameter to select the HTML input source ('cleaned_html', 'raw_html', or 'fit_html'). |
| **`css_selector`** | `str` (None) | Retains only the part of the page matching this selector. Affects the entire extraction process. |
| **`target_elements`** | `List[str]` (None) | List of CSS selectors for elements to focus on for markdown generation and data extraction, while still processing the entire page for links, media, etc. Provides more flexibility than `css_selector`. |
@@ -78,32 +93,50 @@ We group them by category.
| **`only_text`** | `bool` (False) | If `True`, tries to extract text-only content. |
| **`prettiify`** | `bool` (False) | If `True`, beautifies final HTML (slower, purely cosmetic). |
| **`keep_data_attributes`** | `bool` (False) | If `True`, preserve `data-*` attributes in cleaned HTML. |
| **`keep_attrs`** | `list` (default: []) | List of HTML attributes to keep during processing (e.g., `["id", "class", "data-value"]`). |
| **`remove_forms`** | `bool` (False) | If `True`, remove all `<form>` elements. |
| **`parser_type`** | `str` (default: "lxml") | HTML parser to use (e.g., "lxml", "html.parser"). |
| **`scraping_strategy`** | `ContentScrapingStrategy` (default: LXMLWebScrapingStrategy()) | Strategy to use for content scraping. Can be customized for different scraping needs (e.g., PDF extraction). |
---
### B) **Caching & Session**
### B) **Browser Location and Identity**
| **Parameter** | **Type / Default** | **What It Does** |
|------------------------|---------------------------|--------------------------------------------------------------------------------------------------------|
| **`locale`** | `str or None` (None) | Browser's locale (e.g., "en-US", "fr-FR") for language preferences. |
| **`timezone_id`** | `str or None` (None) | Browser's timezone (e.g., "America/New_York", "Europe/Paris"). |
| **`geolocation`** | `GeolocationConfig or None` (None) | GPS coordinates configuration. Use `GeolocationConfig(latitude=..., longitude=..., accuracy=...)`. |
| **`fetch_ssl_certificate`** | `bool` (False) | If `True`, fetches and includes SSL certificate information in the result. |
| **`proxy_config`** | `ProxyConfig or dict or None` (None) | Proxy configuration for this specific crawl. Can override browser-level proxy settings. |
| **`proxy_rotation_strategy`** | `ProxyRotationStrategy` (None) | Strategy for rotating proxies during crawl operations. |
---
### C) **Caching & Session**
| **Parameter** | **Type / Default** | **What It Does** |
|-------------------------|------------------------|------------------------------------------------------------------------------------------------------------------------------|
| **`cache_mode`** | `CacheMode or None` | Controls how caching is handled (`ENABLED`, `BYPASS`, `DISABLED`, etc.). If `None`, typically defaults to `ENABLED`. |
| **`session_id`** | `str or None` | Assign a unique ID to reuse a single browser session across multiple `arun()` calls. |
| **`bypass_cache`** | `bool` (False) | If `True`, acts like `CacheMode.BYPASS`. |
| **`disable_cache`** | `bool` (False) | If `True`, acts like `CacheMode.DISABLED`. |
| **`no_cache_read`** | `bool` (False) | If `True`, acts like `CacheMode.WRITE_ONLY` (writes cache but never reads). |
| **`no_cache_write`** | `bool` (False) | If `True`, acts like `CacheMode.READ_ONLY` (reads cache but never writes). |
| **`bypass_cache`** | `bool` (False) | **Deprecated.** If `True`, acts like `CacheMode.BYPASS`. Use `cache_mode` instead. |
| **`disable_cache`** | `bool` (False) | **Deprecated.** If `True`, acts like `CacheMode.DISABLED`. Use `cache_mode` instead. |
| **`no_cache_read`** | `bool` (False) | **Deprecated.** If `True`, acts like `CacheMode.WRITE_ONLY` (writes cache but never reads). Use `cache_mode` instead. |
| **`no_cache_write`** | `bool` (False) | **Deprecated.** If `True`, acts like `CacheMode.READ_ONLY` (reads cache but never writes). Use `cache_mode` instead. |
| **`shared_data`** | `dict or None` (None) | Shared data to be passed between hooks and accessible across crawl operations. |
Use these for controlling whether you read or write from a local content cache. Handy for large batch crawls or repeated site visits.
---
### C) **Page Navigation & Timing**
### D) **Page Navigation & Timing**
| **Parameter** | **Type / Default** | **What It Does** |
|----------------------------|-------------------------|----------------------------------------------------------------------------------------------------------------------|
| **`wait_until`** | `str` (domcontentloaded)| Condition for navigation to complete. Often `"networkidle"` or `"domcontentloaded"`. |
| **`wait_until`** | `str` (domcontentloaded)| Condition for navigation to "complete". Often `"networkidle"` or `"domcontentloaded"`. |
| **`page_timeout`** | `int` (60000 ms) | Timeout for page navigation or JS steps. Increase for slow sites. |
| **`wait_for`** | `str or None` | Wait for a CSS (`"css:selector"`) or JS (`"js:() => bool"`) condition before content extraction. |
| **`wait_for_timeout`** | `int or None` (None) | Specific timeout in ms for the `wait_for` condition. If None, uses `page_timeout`. |
| **`wait_for_images`** | `bool` (False) | Wait for images to load before finishing. Slows down if you only want text. |
| **`delay_before_return_html`** | `float` (0.1) | Additional pause (seconds) before final HTML is captured. Good for last-second updates. |
| **`check_robots_txt`** | `bool` (False) | Whether to check and respect robots.txt rules before crawling. If True, caches robots.txt for efficiency. |
@@ -112,15 +145,17 @@ Use these for controlling whether you read or write from a local content cache.
---
### D) **Page Interaction**
### E) **Page Interaction**
| **Parameter** | **Type / Default** | **What It Does** |
|----------------------------|--------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------|
| **`js_code`** | `str or list[str]` (None) | JavaScript to run after load. E.g. `"document.querySelector('button')?.click();"`. |
| **`js_only`** | `bool` (False) | If `True`, indicates were reusing an existing session and only applying JS. No full reload. |
| **`c4a_script`** | `str or list[str]` (None) | C4A script that compiles to JavaScript. Alternative to writing raw JS. |
| **`js_only`** | `bool` (False) | If `True`, indicates we're reusing an existing session and only applying JS. No full reload. |
| **`ignore_body_visibility`** | `bool` (True) | Skip checking if `<body>` is visible. Usually best to keep `True`. |
| **`scan_full_page`** | `bool` (False) | If `True`, auto-scroll the page to load dynamic content (infinite scroll). |
| **`scroll_delay`** | `float` (0.2) | Delay between scroll steps if `scan_full_page=True`. |
| **`max_scroll_steps`** | `int or None` (None) | Maximum number of scroll steps during full page scan. If None, scrolls until entire page is loaded. |
| **`process_iframes`** | `bool` (False) | Inlines iframe content for single-page extraction. |
| **`remove_overlay_elements`** | `bool` (False) | Removes potential modals/popups blocking the main content. |
| **`simulate_user`** | `bool` (False) | Simulate user interactions (mouse movements) to avoid bot detection. |
@@ -132,7 +167,7 @@ If your page is a single-page app with repeated JS updates, set `js_only=True` i
---
### E) **Media Handling**
### F) **Media Handling**
| **Parameter** | **Type / Default** | **What It Does** |
|--------------------------------------------|---------------------|-----------------------------------------------------------------------------------------------------------|
@@ -141,13 +176,16 @@ If your page is a single-page app with repeated JS updates, set `js_only=True` i
| **`screenshot_height_threshold`** | `int` (~20000) | If the page is taller than this, alternate screenshot strategies are used. |
| **`pdf`** | `bool` (False) | If `True`, returns a PDF in `result.pdf`. |
| **`capture_mhtml`** | `bool` (False) | If `True`, captures an MHTML snapshot of the page in `result.mhtml`. MHTML includes all page resources (CSS, images, etc.) in a single file. |
| **`image_description_min_word_threshold`** | `int` (~50) | Minimum words for an images alt text or description to be considered valid. |
| **`image_description_min_word_threshold`** | `int` (~50) | Minimum words for an image's alt text or description to be considered valid. |
| **`image_score_threshold`** | `int` (~3) | Filter out low-scoring images. The crawler scores images by relevance (size, context, etc.). |
| **`exclude_external_images`** | `bool` (False) | Exclude images from other domains. |
| **`exclude_all_images`** | `bool` (False) | If `True`, excludes all images from processing (both internal and external). |
| **`table_score_threshold`** | `int` (7) | Minimum score threshold for processing a table. Lower values include more tables. |
| **`table_extraction`** | `TableExtractionStrategy` (DefaultTableExtraction) | Strategy for table extraction. Defaults to DefaultTableExtraction with configured threshold. |
---
### F) **Link/Domain Handling**
### G) **Link/Domain Handling**
| **Parameter** | **Type / Default** | **What It Does** |
|------------------------------|-------------------------|-----------------------------------------------------------------------------------------------------------------------------|
@@ -155,23 +193,39 @@ If your page is a single-page app with repeated JS updates, set `js_only=True` i
| **`exclude_external_links`** | `bool` (False) | Removes all links pointing outside the current domain. |
| **`exclude_social_media_links`** | `bool` (False) | Strips links specifically to social sites (like Facebook or Twitter). |
| **`exclude_domains`** | `list` ([]) | Provide a custom list of domains to exclude (like `["ads.com", "trackers.io"]`). |
| **`exclude_internal_links`** | `bool` (False) | If `True`, excludes internal links from the results. |
| **`score_links`** | `bool` (False) | If `True`, calculates intrinsic quality scores for all links using URL structure, text quality, and contextual metrics. |
| **`preserve_https_for_internal_links`** | `bool` (False) | If `True`, preserves HTTPS scheme for internal links even when the server redirects to HTTP. Useful for security-conscious crawling. |
Use these for link-level content filtering (often to keep crawls “internal” or to remove spammy domains).
---
### G) **Debug & Logging**
### H) **Debug, Logging & Network Monitoring**
| **Parameter** | **Type / Default** | **What It Does** |
|----------------|--------------------|---------------------------------------------------------------------------|
| **`verbose`** | `bool` (True) | Prints logs detailing each step of crawling, interactions, or errors. |
| **`log_console`** | `bool` (False) | Logs the pages JavaScript console output if you want deeper JS debugging.|
| **`log_console`** | `bool` (False) | Logs the page's JavaScript console output if you want deeper JS debugging.|
| **`capture_network_requests`** | `bool` (False) | If `True`, captures network requests made by the page in `result.captured_requests`. |
| **`capture_console_messages`** | `bool` (False) | If `True`, captures console messages from the page in `result.console_messages`. |
---
### I) **Connection & HTTP Parameters**
### H) **Virtual Scroll Configuration**
| **Parameter** | **Type / Default** | **What It Does** |
|-----------------------------|-------------------------|----------------------------------------------------------------------------------------------------------------------|
| **`method`** | `str` ("GET") | HTTP method to use when using AsyncHTTPCrawlerStrategy (e.g., "GET", "POST"). |
| **`stream`** | `bool` (False) | If `True`, enables streaming mode for `arun_many()` to process URLs as they complete rather than waiting for all. |
| **`url`** | `str or None` (None) | URL for this specific config. Not typically set directly but used internally for URL-specific configurations. |
| **`user_agent`** | `str or None` (None) | Custom User-Agent string for this crawl. Can override browser-level user agent. |
| **`user_agent_mode`** | `str or None` (None) | Set to `"random"` to randomize user agent. Can override browser-level setting. |
| **`user_agent_generator_config`** | `dict` ({}) | Configuration for user agent generation when `user_agent_mode="random"`. |
---
### J) **Virtual Scroll Configuration**
| **Parameter** | **Type / Default** | **What It Does** |
|------------------------------|------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|
@@ -211,7 +265,7 @@ See [Virtual Scroll documentation](../../advanced/virtual-scroll.md) for detaile
---
### I) **URL Matching Configuration**
### K) **URL Matching Configuration**
| **Parameter** | **Type / Default** | **What It Does** |
|------------------------|------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|
@@ -274,7 +328,25 @@ default_config = CrawlerRunConfig() # No url_matcher = matches everything
- If no config matches a URL and there's no default config (one without `url_matcher`), the URL will fail with "No matching configuration found"
- Always include a default config as the last item if you want to handle all URLs
---## 2.2 Helper Methods
---
### L) **Advanced Crawling Features**
| **Parameter** | **Type / Default** | **What It Does** |
|-----------------------------|------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|
| **`deep_crawl_strategy`** | `DeepCrawlStrategy or None` (None) | Strategy for deep/recursive crawling. Enables automatic link following and multi-level site crawling. |
| **`link_preview_config`** | `LinkPreviewConfig or dict or None` (None) | Configuration for link head extraction and scoring. Fetches and scores link metadata without full page loads. |
| **`experimental`** | `dict or None` (None) | Dictionary for experimental/beta features not yet integrated into main parameters. Use with caution. |
**Deep Crawl Strategy** enables automatic site exploration by following links according to defined rules. Useful for sitemap generation or comprehensive site archiving.
**Link Preview Config** allows efficient link discovery and scoring by fetching only the `<head>` section of linked pages, enabling smart crawl prioritization without the overhead of full page loads.
**Experimental** parameters are features in beta testing. They may change or be removed in future versions. Check documentation for currently available experimental features.
---
## 2.2 Helper Methods
Both `BrowserConfig` and `CrawlerRunConfig` provide a `clone()` method to create modified copies:
@@ -367,10 +439,19 @@ LLMConfig is useful to pass LLM provider config to strategies and functions that
| **`provider`** | `"ollama/llama3","groq/llama3-70b-8192","groq/llama3-8b-8192", "openai/gpt-4o-mini" ,"openai/gpt-4o","openai/o1-mini","openai/o1-preview","openai/o3-mini","openai/o3-mini-high","anthropic/claude-3-haiku-20240307","anthropic/claude-3-opus-20240229","anthropic/claude-3-sonnet-20240229","anthropic/claude-3-5-sonnet-20240620","gemini/gemini-pro","gemini/gemini-1.5-pro","gemini/gemini-2.0-flash","gemini/gemini-2.0-flash-exp","gemini/gemini-2.0-flash-lite-preview-02-05","deepseek/deepseek-chat"`<br/>*(default: `"openai/gpt-4o-mini"`)* | Which LLM provider to use.
| **`api_token`** |1.Optional. When not provided explicitly, api_token will be read from environment variables based on provider. For example: If a gemini model is passed as provider then,`"GEMINI_API_KEY"` will be read from environment variables <br/> 2. API token of LLM provider <br/> eg: `api_token = "gsk_1ClHGGJ7Lpn4WGybR7vNWGdyb3FY7zXEw3SCiy0BAVM9lL8CQv"` <br/> 3. Environment variable - use with prefix "env:" <br/> eg:`api_token = "env: GROQ_API_KEY"` | API token to use for the given provider
| **`base_url`** |Optional. Custom API endpoint | If your provider has a custom endpoint
| **`backoff_base_delay`** |Optional. `int` *(default: `2`)* | Seconds to wait before the first retry when the provider throttles a request.
| **`backoff_max_attempts`** |Optional. `int` *(default: `3`)* | Total tries (initial call + retries) before surfacing an error.
| **`backoff_exponential_factor`** |Optional. `int` *(default: `2`)* | Multiplier that increases the wait time for each retry (`delay = base_delay * factor^attempt`).
## 3.2 Example Usage
```python
llm_config = LLMConfig(provider="openai/gpt-4o-mini", api_token=os.getenv("OPENAI_API_KEY"))
llm_config = LLMConfig(
provider="openai/gpt-4o-mini",
api_token=os.getenv("OPENAI_API_KEY"),
backoff_base_delay=1, # optional
backoff_max_attempts=5, # optional
backoff_exponential_factor=3, # optional
)
```
## 4. Putting It All Together

View File

@@ -18,7 +18,7 @@ A comprehensive web-based tutorial for learning and experimenting with C4A-Scrip
2. **Install Dependencies**
```bash
pip install flask
pip install -r requirements.txt
```
3. **Launch the Server**
@@ -28,7 +28,7 @@ A comprehensive web-based tutorial for learning and experimenting with C4A-Scrip
4. **Open in Browser**
```
http://localhost:8080
http://localhost:8000
```
**🌐 Try Online**: [Live Demo](https://docs.crawl4ai.com/c4a-script/demo)
@@ -325,7 +325,7 @@ Powers the recording functionality:
### Configuration
```python
# server.py configuration
PORT = 8080
PORT = 8000
DEBUG = True
THREADED = True
```
@@ -343,9 +343,9 @@ THREADED = True
**Port Already in Use**
```bash
# Kill existing process
lsof -ti:8080 | xargs kill -9
lsof -ti:8000 | xargs kill -9
# Or use different port
python server.py --port 8081
python server.py --port 8001
```
**Blockly Not Loading**

View File

@@ -216,7 +216,7 @@ def get_examples():
'name': 'Handle Cookie Banner',
'description': 'Accept cookies and close newsletter popup',
'script': '''# Handle cookie banner and newsletter
GO http://127.0.0.1:8080/playground/
GO http://127.0.0.1:8000/playground/
WAIT `body` 2
IF (EXISTS `.cookie-banner`) THEN CLICK `.accept`
IF (EXISTS `.newsletter-popup`) THEN CLICK `.close`'''
@@ -283,7 +283,7 @@ WAIT `.success-message` 5'''
return jsonify(examples)
if __name__ == '__main__':
port = int(os.environ.get('PORT', 8080))
port = int(os.environ.get('PORT', 8000))
print(f"""
╔══════════════════════════════════════════════════════════╗
║ C4A-Script Interactive Tutorial Server ║

Binary file not shown.

View File

@@ -20,22 +20,73 @@ Ever wondered why your AI coding assistant struggles with your library despite c
## Latest Release
### [Crawl4AI v0.7.4 The Intelligent Table Extraction & Performance Update](../blog/release-v0.7.4.md)
*August 17, 2025*
### [Crawl4AI v0.7.8 Stability & Bug Fix Release](../blog/release-v0.7.8.md)
*December 2025*
Crawl4AI v0.7.4 introduces revolutionary LLM-powered table extraction with intelligent chunking, performance improvements for concurrent crawling, enhanced browser management, and critical stability fixes that make Crawl4AI more robust for production workloads.
Crawl4AI v0.7.8 is a focused stability release addressing 11 bugs reported by the community. While there are no new features, these fixes resolve important issues affecting Docker deployments, LLM extraction, URL handling, and dependency compatibility.
Key highlights:
- **🚀 LLMTableExtraction**: Revolutionary table extraction with intelligent chunking for massive tables
- **⚡ Dispatcher Bug Fix**: Fixed sequential processing issue in arun_many for fast-completing tasks
- **🧹 Memory Management Refactor**: Streamlined memory utilities and better resource management
- **🔧 Browser Manager Fixes**: Resolved race conditions in concurrent page creation
- **🔗 Advanced URL Processing**: Better handling of raw URLs and base tag link resolution
- **🐳 Docker API Fixes**: ContentRelevanceFilter deserialization, ProxyConfig serialization, cache folder permissions
- **🤖 LLM Improvements**: Configurable rate limiter backoff, HTML input format support, raw HTML URL handling
- **🔗 URL Handling**: Correct relative URL resolution after JavaScript redirects
- **📦 Dependencies**: Replaced deprecated PyPDF2 with pypdf, Pydantic v2 ConfigDict compatibility
- **🧠 AdaptiveCrawler**: Fixed query expansion to actually use LLM instead of mock data
[Read full release notes →](../blog/release-v0.7.4.md)
[Read full release notes →](../blog/release-v0.7.8.md)
## Recent Releases
### [Crawl4AI v0.7.7 The Self-Hosting & Monitoring Update](../blog/release-v0.7.7.md)
*November 14, 2025*
Crawl4AI v0.7.7 transforms Docker into a complete self-hosting platform with enterprise-grade real-time monitoring, comprehensive observability, and full operational control.
Key highlights:
- **📊 Real-time Monitoring Dashboard**: Interactive web UI with live system metrics
- **🔌 Comprehensive Monitor API**: Complete REST API for programmatic access
- **⚡ WebSocket Streaming**: Real-time updates every 2 seconds
- **🔥 Smart Browser Pool**: 3-tier architecture with automatic promotion and cleanup
[Read full release notes →](../blog/release-v0.7.7.md)
### [Crawl4AI v0.7.6 The Webhook Infrastructure Update](../blog/release-v0.7.6.md)
*October 22, 2025*
Crawl4AI v0.7.6 introduces comprehensive webhook support for the Docker job queue API, bringing real-time notifications to both crawling and LLM extraction workflows. No more polling!
Key highlights:
- **🪝 Complete Webhook Support**: Real-time notifications for both `/crawl/job` and `/llm/job` endpoints
- **🔄 Reliable Delivery**: Exponential backoff retry mechanism (5 attempts: 1s → 2s → 4s → 8s → 16s)
- **🔐 Custom Authentication**: Add custom headers for webhook authentication
- **📊 Flexible Delivery**: Choose notification-only or include full data in payload
- **⚙️ Global Configuration**: Set default webhook URL in config.yml for all jobs
[Read full release notes →](../blog/release-v0.7.6.md)
### [Crawl4AI v0.7.5 The Docker Hooks & Security Update](../blog/release-v0.7.5.md)
*September 29, 2025*
Crawl4AI v0.7.5 introduces the powerful Docker Hooks System for complete pipeline customization, enhanced LLM integration with custom providers, HTTPS preservation for modern web security, and resolves multiple community-reported issues.
Key highlights:
- **🔧 Docker Hooks System**: Custom Python functions at 8 key pipeline points for unprecedented customization
- **🤖 Enhanced LLM Integration**: Custom providers with temperature control and base_url configuration
- **🔒 HTTPS Preservation**: Secure internal link handling for modern web applications
- **🐍 Python 3.10+ Support**: Modern language features and enhanced performance
[Read full release notes →](../blog/release-v0.7.5.md)
---
## Older Releases
| Version | Date | Highlights |
|---------|------|------------|
| [v0.7.4](../blog/release-v0.7.4.md) | August 2025 | LLM-powered table extraction, performance improvements |
| [v0.7.3](../blog/release-v0.7.3.md) | July 2025 | Undetected browser, multi-URL config, memory monitoring |
| [v0.7.1](../blog/release-v0.7.1.md) | June 2025 | Bug fixes and stability improvements |
| [v0.7.0](../blog/release-v0.7.0.md) | May 2025 | Adaptive crawling, virtual scroll, link analysis |
## Project History
Curious about how Crawl4AI has evolved? Check out our [complete changelog](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md) for a detailed history of all versions and updates.

View File

@@ -0,0 +1,314 @@
# Crawl4AI v0.7.6 Release Notes
*Release Date: October 22, 2025*
I'm excited to announce Crawl4AI v0.7.6, featuring a complete webhook infrastructure for the Docker job queue API! This release eliminates polling and brings real-time notifications to both crawling and LLM extraction workflows.
## 🎯 What's New
### Webhook Support for Docker Job Queue API
The headline feature of v0.7.6 is comprehensive webhook support for asynchronous job processing. No more constant polling to check if your jobs are done - get instant notifications when they complete!
**Key Capabilities:**
-**Universal Webhook Support**: Both `/crawl/job` and `/llm/job` endpoints now support webhooks
-**Flexible Delivery Modes**: Choose notification-only or include full data in the webhook payload
-**Reliable Delivery**: Exponential backoff retry mechanism (5 attempts: 1s → 2s → 4s → 8s → 16s)
-**Custom Authentication**: Add custom headers for webhook authentication
-**Global Configuration**: Set default webhook URL in `config.yml` for all jobs
-**Task Type Identification**: Distinguish between `crawl` and `llm_extraction` tasks
### How It Works
Instead of constantly checking job status:
**OLD WAY (Polling):**
```python
# Submit job
response = requests.post("http://localhost:11235/crawl/job", json=payload)
task_id = response.json()['task_id']
# Poll until complete
while True:
status = requests.get(f"http://localhost:11235/crawl/job/{task_id}")
if status.json()['status'] == 'completed':
break
time.sleep(5) # Wait and try again
```
**NEW WAY (Webhooks):**
```python
# Submit job with webhook
payload = {
"urls": ["https://example.com"],
"webhook_config": {
"webhook_url": "https://myapp.com/webhook",
"webhook_data_in_payload": True
}
}
response = requests.post("http://localhost:11235/crawl/job", json=payload)
# Done! Webhook will notify you when complete
# Your webhook handler receives the results automatically
```
### Crawl Job Webhooks
```bash
curl -X POST http://localhost:11235/crawl/job \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://example.com"],
"browser_config": {"headless": true},
"crawler_config": {"cache_mode": "bypass"},
"webhook_config": {
"webhook_url": "https://myapp.com/webhooks/crawl-complete",
"webhook_data_in_payload": false,
"webhook_headers": {
"X-Webhook-Secret": "your-secret-token"
}
}
}'
```
### LLM Extraction Job Webhooks (NEW!)
```bash
curl -X POST http://localhost:11235/llm/job \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/article",
"q": "Extract the article title, author, and publication date",
"schema": "{\"type\":\"object\",\"properties\":{\"title\":{\"type\":\"string\"}}}",
"provider": "openai/gpt-4o-mini",
"webhook_config": {
"webhook_url": "https://myapp.com/webhooks/llm-complete",
"webhook_data_in_payload": true
}
}'
```
### Webhook Payload Structure
**Success (with data):**
```json
{
"task_id": "llm_1698765432",
"task_type": "llm_extraction",
"status": "completed",
"timestamp": "2025-10-22T10:30:00.000000+00:00",
"urls": ["https://example.com/article"],
"data": {
"extracted_content": {
"title": "Understanding Web Scraping",
"author": "John Doe",
"date": "2025-10-22"
}
}
}
```
**Failure:**
```json
{
"task_id": "crawl_abc123",
"task_type": "crawl",
"status": "failed",
"timestamp": "2025-10-22T10:30:00.000000+00:00",
"urls": ["https://example.com"],
"error": "Connection timeout after 30s"
}
```
### Simple Webhook Handler Example
```python
from flask import Flask, request, jsonify
app = Flask(__name__)
@app.route('/webhook', methods=['POST'])
def handle_webhook():
payload = request.json
task_id = payload['task_id']
task_type = payload['task_type']
status = payload['status']
if status == 'completed':
if 'data' in payload:
# Process data directly
data = payload['data']
else:
# Fetch from API
endpoint = 'crawl' if task_type == 'crawl' else 'llm'
response = requests.get(f'http://localhost:11235/{endpoint}/job/{task_id}')
data = response.json()
# Your business logic here
print(f"Job {task_id} completed!")
elif status == 'failed':
error = payload.get('error', 'Unknown error')
print(f"Job {task_id} failed: {error}")
return jsonify({"status": "received"}), 200
app.run(port=8080)
```
## 📊 Performance Improvements
- **Reduced Server Load**: Eliminates constant polling requests
- **Lower Latency**: Instant notification vs. polling interval delay
- **Better Resource Usage**: Frees up client connections while jobs run in background
- **Scalable Architecture**: Handles high-volume crawling workflows efficiently
## 🐛 Bug Fixes
- Fixed webhook configuration serialization for Pydantic HttpUrl fields
- Improved error handling in webhook delivery service
- Enhanced Redis task storage for webhook config persistence
## 🌍 Expected Real-World Impact
### For Web Scraping Workflows
- **Reduced Costs**: Less API calls = lower bandwidth and server costs
- **Better UX**: Instant notifications improve user experience
- **Scalability**: Handle 100s of concurrent jobs without polling overhead
### For LLM Extraction Pipelines
- **Async Processing**: Submit LLM extraction jobs and move on
- **Batch Processing**: Queue multiple extractions, get notified as they complete
- **Integration**: Easy integration with workflow automation tools (Zapier, n8n, etc.)
### For Microservices
- **Event-Driven**: Perfect for event-driven microservice architectures
- **Decoupling**: Decouple job submission from result processing
- **Reliability**: Automatic retries ensure webhooks are delivered
## 🔄 Breaking Changes
**None!** This release is fully backward compatible.
- Webhook configuration is optional
- Existing code continues to work without modification
- Polling is still supported for jobs without webhook config
## 📚 Documentation
### New Documentation
- **[WEBHOOK_EXAMPLES.md](../deploy/docker/WEBHOOK_EXAMPLES.md)** - Comprehensive webhook usage guide
- **[docker_webhook_example.py](../docs/examples/docker_webhook_example.py)** - Working code examples
### Updated Documentation
- **[Docker README](../deploy/docker/README.md)** - Added webhook sections
- API documentation with webhook examples
## 🛠️ Migration Guide
No migration needed! Webhooks are opt-in:
1. **To use webhooks**: Add `webhook_config` to your job payload
2. **To keep polling**: Continue using your existing code
### Quick Start
```python
# Just add webhook_config to your existing payload
payload = {
# Your existing configuration
"urls": ["https://example.com"],
"browser_config": {...},
"crawler_config": {...},
# NEW: Add webhook configuration
"webhook_config": {
"webhook_url": "https://myapp.com/webhook",
"webhook_data_in_payload": True
}
}
```
## 🔧 Configuration
### Global Webhook Configuration (config.yml)
```yaml
webhooks:
enabled: true
default_url: "https://myapp.com/webhooks/default" # Optional
data_in_payload: false
retry:
max_attempts: 5
initial_delay_ms: 1000
max_delay_ms: 32000
timeout_ms: 30000
headers:
User-Agent: "Crawl4AI-Webhook/1.0"
```
## 🚀 Upgrade Instructions
### Docker
```bash
# Pull the latest image
docker pull unclecode/crawl4ai:0.7.6
# Or use latest tag
docker pull unclecode/crawl4ai:latest
# Run with webhook support
docker run -d \
-p 11235:11235 \
--env-file .llm.env \
--name crawl4ai \
unclecode/crawl4ai:0.7.6
```
### Python Package
```bash
pip install --upgrade crawl4ai
```
## 💡 Pro Tips
1. **Use notification-only mode** for large results - fetch data separately to avoid large webhook payloads
2. **Set custom headers** for webhook authentication and request tracking
3. **Configure global default webhook** for consistent handling across all jobs
4. **Implement idempotent webhook handlers** - same webhook may be delivered multiple times on retry
5. **Use structured schemas** with LLM extraction for predictable webhook data
## 🎬 Demo
Try the release demo:
```bash
python docs/releases_review/demo_v0.7.6.py
```
This comprehensive demo showcases:
- Crawl job webhooks (notification-only and with data)
- LLM extraction webhooks (with JSON schema support)
- Custom headers for authentication
- Webhook retry mechanism
- Real-time webhook receiver
## 🙏 Acknowledgments
Thank you to the community for the feedback that shaped this feature! Special thanks to everyone who requested webhook support for asynchronous job processing.
## 📞 Support
- **Documentation**: https://docs.crawl4ai.com
- **GitHub Issues**: https://github.com/unclecode/crawl4ai/issues
- **Discord**: https://discord.gg/crawl4ai
---
**Happy crawling with webhooks!** 🕷️🪝
*- unclecode*

View File

@@ -0,0 +1,318 @@
# 🚀 Crawl4AI v0.7.5: The Docker Hooks & Security Update
*September 29, 2025 • 8 min read*
---
Today I'm releasing Crawl4AI v0.7.5—focused on extensibility and security. This update introduces the Docker Hooks System for pipeline customization, enhanced LLM integration, and important security improvements.
## 🎯 What's New at a Glance
- **Docker Hooks System**: Custom Python functions at key pipeline points with function-based API
- **Function-Based Hooks**: New `hooks_to_string()` utility with Docker client auto-conversion
- **Enhanced LLM Integration**: Custom providers with temperature control
- **HTTPS Preservation**: Secure internal link handling
- **Bug Fixes**: Resolved multiple community-reported issues
- **Improved Docker Error Handling**: Better debugging and reliability
## 🔧 Docker Hooks System: Pipeline Customization
Every scraping project needs custom logic—authentication, performance optimization, content processing. Traditional solutions require forking or complex workarounds. Docker Hooks let you inject custom Python functions at 8 key points in the crawling pipeline.
### Real Example: Authentication & Performance
```python
import requests
# Real working hooks for httpbin.org
hooks_config = {
"on_page_context_created": """
async def hook(page, context, **kwargs):
print("Hook: Setting up page context")
# Block images to speed up crawling
await context.route("**/*.{png,jpg,jpeg,gif,webp}", lambda route: route.abort())
print("Hook: Images blocked")
return page
""",
"before_retrieve_html": """
async def hook(page, context, **kwargs):
print("Hook: Before retrieving HTML")
# Scroll to bottom to load lazy content
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await page.wait_for_timeout(1000)
print("Hook: Scrolled to bottom")
return page
""",
"before_goto": """
async def hook(page, context, url, **kwargs):
print(f"Hook: About to navigate to {url}")
# Add custom headers
await page.set_extra_http_headers({
'X-Test-Header': 'crawl4ai-hooks-test'
})
return page
"""
}
# Test with Docker API
payload = {
"urls": ["https://httpbin.org/html"],
"hooks": {
"code": hooks_config,
"timeout": 30
}
}
response = requests.post("http://localhost:11235/crawl", json=payload)
result = response.json()
if result.get('success'):
print("✅ Hooks executed successfully!")
print(f"Content length: {len(result.get('markdown', ''))} characters")
```
**Available Hook Points:**
- `on_browser_created`: Browser setup
- `on_page_context_created`: Page context configuration
- `before_goto`: Pre-navigation setup
- `after_goto`: Post-navigation processing
- `on_user_agent_updated`: User agent changes
- `on_execution_started`: Crawl initialization
- `before_retrieve_html`: Pre-extraction processing
- `before_return_html`: Final HTML processing
### Function-Based Hooks API
Writing hooks as strings works, but lacks IDE support and type checking. v0.7.5 introduces a function-based approach with automatic conversion!
**Option 1: Using the `hooks_to_string()` Utility**
```python
from crawl4ai import hooks_to_string
import requests
# Define hooks as regular Python functions (with full IDE support!)
async def on_page_context_created(page, context, **kwargs):
"""Block images to speed up crawling"""
await context.route("**/*.{png,jpg,jpeg,gif,webp}", lambda route: route.abort())
await page.set_viewport_size({"width": 1920, "height": 1080})
return page
async def before_goto(page, context, url, **kwargs):
"""Add custom headers"""
await page.set_extra_http_headers({
'X-Crawl4AI': 'v0.7.5',
'X-Custom-Header': 'my-value'
})
return page
# Convert functions to strings
hooks_code = hooks_to_string({
"on_page_context_created": on_page_context_created,
"before_goto": before_goto
})
# Use with REST API
payload = {
"urls": ["https://httpbin.org/html"],
"hooks": {"code": hooks_code, "timeout": 30}
}
response = requests.post("http://localhost:11235/crawl", json=payload)
```
**Option 2: Docker Client with Automatic Conversion (Recommended!)**
```python
from crawl4ai.docker_client import Crawl4aiDockerClient
# Define hooks as functions (same as above)
async def on_page_context_created(page, context, **kwargs):
await context.route("**/*.{png,jpg,jpeg,gif,webp}", lambda route: route.abort())
return page
async def before_retrieve_html(page, context, **kwargs):
# Scroll to load lazy content
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await page.wait_for_timeout(1000)
return page
# Use Docker client - conversion happens automatically!
client = Crawl4aiDockerClient(base_url="http://localhost:11235")
results = await client.crawl(
urls=["https://httpbin.org/html"],
hooks={
"on_page_context_created": on_page_context_created,
"before_retrieve_html": before_retrieve_html
},
hooks_timeout=30
)
if results and results.success:
print(f"✅ Hooks executed! HTML length: {len(results.html)}")
```
**Benefits of Function-Based Hooks:**
- ✅ Full IDE support (autocomplete, syntax highlighting)
- ✅ Type checking and linting
- ✅ Easier to test and debug
- ✅ Reusable across projects
- ✅ Automatic conversion in Docker client
- ✅ No breaking changes - string hooks still work!
## 🤖 Enhanced LLM Integration
Enhanced LLM integration with custom providers, temperature control, and base URL configuration.
### Multi-Provider Support
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
# Test with different providers
async def test_llm_providers():
# OpenAI with custom temperature
openai_strategy = LLMExtractionStrategy(
provider="gemini/gemini-2.5-flash-lite",
api_token="your-api-token",
temperature=0.7, # New in v0.7.5
instruction="Summarize this page in one sentence"
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
"https://example.com",
config=CrawlerRunConfig(extraction_strategy=openai_strategy)
)
if result.success:
print("✅ LLM extraction completed")
print(result.extracted_content)
# Docker API with enhanced LLM config
llm_payload = {
"url": "https://example.com",
"f": "llm",
"q": "Summarize this page in one sentence.",
"provider": "gemini/gemini-2.5-flash-lite",
"temperature": 0.7
}
response = requests.post("http://localhost:11235/md", json=llm_payload)
```
**New Features:**
- Custom `temperature` parameter for creativity control
- `base_url` for custom API endpoints
- Multi-provider environment variable support
- Docker API integration
## 🔒 HTTPS Preservation
**The Problem:** Modern web apps require HTTPS everywhere. When crawlers downgrade internal links from HTTPS to HTTP, authentication breaks and security warnings appear.
**Solution:** HTTPS preservation maintains secure protocols throughout crawling.
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, FilterChain, URLPatternFilter, BFSDeepCrawlStrategy
async def test_https_preservation():
# Enable HTTPS preservation
url_filter = URLPatternFilter(
patterns=["^(https:\/\/)?quotes\.toscrape\.com(\/.*)?$"]
)
config = CrawlerRunConfig(
exclude_external_links=True,
preserve_https_for_internal_links=True, # New in v0.7.5
deep_crawl_strategy=BFSDeepCrawlStrategy(
max_depth=2,
max_pages=5,
filter_chain=FilterChain([url_filter])
)
)
async with AsyncWebCrawler() as crawler:
async for result in await crawler.arun(
url="https://quotes.toscrape.com",
config=config
):
# All internal links maintain HTTPS
internal_links = [link['href'] for link in result.links['internal']]
https_links = [link for link in internal_links if link.startswith('https://')]
print(f"HTTPS links preserved: {len(https_links)}/{len(internal_links)}")
for link in https_links[:3]:
print(f"{link}")
```
## 🛠️ Bug Fixes and Improvements
### Major Fixes
- **URL Processing**: Fixed '+' sign preservation in query parameters (#1332)
- **Proxy Configuration**: Enhanced proxy string parsing (old `proxy` parameter deprecated)
- **Docker Error Handling**: Comprehensive error messages with status codes
- **Memory Management**: Fixed leaks in long-running sessions
- **JWT Authentication**: Fixed Docker JWT validation issues (#1442)
- **Playwright Stealth**: Fixed stealth features for Playwright integration (#1481)
- **API Configuration**: Fixed config handling to prevent overriding user-provided settings (#1505)
- **Docker Filter Serialization**: Resolved JSON encoding errors in deep crawl strategy (#1419)
- **LLM Provider Support**: Fixed custom LLM provider integration for adaptive crawler (#1291)
- **Performance Issues**: Resolved backoff strategy failures and timeout handling (#989)
### Community-Reported Issues Fixed
This release addresses multiple issues reported by the community through GitHub issues and Discord discussions:
- Fixed browser configuration reference errors
- Resolved dependency conflicts with cssselect
- Improved error messaging for failed authentications
- Enhanced compatibility with various proxy configurations
- Fixed edge cases in URL normalization
### Configuration Updates
```python
# Old proxy config (deprecated)
# browser_config = BrowserConfig(proxy="http://proxy:8080")
# New enhanced proxy config
browser_config = BrowserConfig(
proxy_config={
"server": "http://proxy:8080",
"username": "optional-user",
"password": "optional-pass"
}
)
```
## 🔄 Breaking Changes
1. **Python 3.10+ Required**: Upgrade from Python 3.9
2. **Proxy Parameter Deprecated**: Use new `proxy_config` structure
3. **New Dependency**: Added `cssselect` for better CSS handling
## 🚀 Get Started
```bash
# Install latest version
pip install crawl4ai==0.7.5
# Docker deployment
docker pull unclecode/crawl4ai:latest
docker run -p 11235:11235 unclecode/crawl4ai:latest
```
**Try the Demo:**
```bash
# Run working examples
python docs/releases_review/demo_v0.7.5.py
```
**Resources:**
- 📖 Documentation: [docs.crawl4ai.com](https://docs.crawl4ai.com)
- 🐙 GitHub: [github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai)
- 💬 Discord: [discord.gg/crawl4ai](https://discord.gg/jP8KfhDhyN)
- 🐦 Twitter: [@unclecode](https://x.com/unclecode)
Happy crawling! 🕷️

View File

@@ -0,0 +1,626 @@
# 🚀 Crawl4AI v0.7.7: The Self-Hosting & Monitoring Update
*November 14, 2025 • 10 min read*
---
Today I'm releasing Crawl4AI v0.7.7—the Self-Hosting & Monitoring Update. This release transforms Crawl4AI Docker from a simple containerized crawler into a complete self-hosting platform with enterprise-grade real-time monitoring, full operational transparency, and production-ready observability.
## 🎯 What's New at a Glance
- **📊 Real-time Monitoring Dashboard**: Interactive web UI with live system metrics and browser pool status
- **🔌 Comprehensive Monitor API**: Complete REST API for programmatic access to all monitoring data
- **⚡ WebSocket Streaming**: Real-time updates every 2 seconds for custom dashboards
- **🎮 Control Actions**: Manual browser management (kill, restart, cleanup)
- **🔥 Smart Browser Pool**: 3-tier architecture (permanent/hot/cold) with automatic promotion
- **🧹 Janitor Cleanup System**: Automatic resource management with event logging
- **📈 Production Metrics**: 6 critical metrics for operational excellence
- **🏭 Integration Ready**: Prometheus, alerting, and log aggregation examples
- **🐛 Critical Bug Fixes**: Async LLM extraction, DFS crawling, viewport config, and more
## 📊 Real-time Monitoring Dashboard: Complete Visibility
**The Problem:** Running Crawl4AI in Docker was like flying blind. Users had no visibility into what was happening inside the container—memory usage, active requests, browser pools, or errors. Troubleshooting required checking logs, and there was no way to monitor performance or manually intervene when issues occurred.
**My Solution:** I built a complete real-time monitoring system with an interactive dashboard, comprehensive REST API, WebSocket streaming, and manual control actions. Now you have full transparency and control over your crawling infrastructure.
### The Self-Hosting Value Proposition
Before v0.7.7, Docker was just a containerized crawler. After v0.7.7, it's a complete self-hosting platform that gives you:
- **🔒 Data Privacy**: Your data never leaves your infrastructure
- **💰 Cost Control**: No per-request pricing or rate limits
- **🎯 Full Customization**: Complete control over configurations and strategies
- **📊 Complete Transparency**: Real-time visibility into every aspect
- **⚡ Performance**: Direct access without network overhead
- **🛡️ Enterprise Security**: Keep workflows behind your firewall
### Interactive Monitoring Dashboard
Access the dashboard at `http://localhost:11235/dashboard` to see:
- **System Health Overview**: CPU, memory, network, and uptime in real-time
- **Live Request Tracking**: Active and completed requests with full details
- **Browser Pool Management**: Interactive table with permanent/hot/cold browsers
- **Janitor Events Log**: Automatic cleanup activities
- **Error Monitoring**: Full context error logs
The dashboard updates every 2 seconds via WebSocket, giving you live visibility into your crawling operations.
## 🔌 Monitor API: Programmatic Access
**The Problem:** Monitoring dashboards are great for humans, but automation and integration require programmatic access.
**My Solution:** A comprehensive REST API that exposes all monitoring data for integration with your existing infrastructure.
### System Health Endpoint
```python
import httpx
import asyncio
async def monitor_system_health():
async with httpx.AsyncClient() as client:
response = await client.get("http://localhost:11235/monitor/health")
health = response.json()
print(f"Container Metrics:")
print(f" CPU: {health['container']['cpu_percent']:.1f}%")
print(f" Memory: {health['container']['memory_percent']:.1f}%")
print(f" Uptime: {health['container']['uptime_seconds']}s")
print(f"\nBrowser Pool:")
print(f" Permanent: {health['pool']['permanent']['active']} active")
print(f" Hot Pool: {health['pool']['hot']['count']} browsers")
print(f" Cold Pool: {health['pool']['cold']['count']} browsers")
print(f"\nStatistics:")
print(f" Total Requests: {health['stats']['total_requests']}")
print(f" Success Rate: {health['stats']['success_rate_percent']:.1f}%")
print(f" Avg Latency: {health['stats']['avg_latency_ms']:.0f}ms")
asyncio.run(monitor_system_health())
```
### Request Tracking
```python
async def track_requests():
async with httpx.AsyncClient() as client:
response = await client.get("http://localhost:11235/monitor/requests")
requests_data = response.json()
print(f"Active Requests: {len(requests_data['active'])}")
print(f"Completed Requests: {len(requests_data['completed'])}")
# See details of recent requests
for req in requests_data['completed'][:5]:
status_icon = "" if req['success'] else ""
print(f"{status_icon} {req['endpoint']} - {req['latency_ms']:.0f}ms")
```
### Browser Pool Management
```python
async def monitor_browser_pool():
async with httpx.AsyncClient() as client:
response = await client.get("http://localhost:11235/monitor/browsers")
browsers = response.json()
print(f"Pool Summary:")
print(f" Total Browsers: {browsers['summary']['total_count']}")
print(f" Total Memory: {browsers['summary']['total_memory_mb']} MB")
print(f" Reuse Rate: {browsers['summary']['reuse_rate_percent']:.1f}%")
# List all browsers
for browser in browsers['permanent']:
print(f"🔥 Permanent: {browser['browser_id'][:8]}... | "
f"Requests: {browser['request_count']} | "
f"Memory: {browser['memory_mb']:.0f} MB")
```
### Endpoint Performance Statistics
```python
async def get_endpoint_stats():
async with httpx.AsyncClient() as client:
response = await client.get("http://localhost:11235/monitor/endpoints/stats")
stats = response.json()
print("Endpoint Analytics:")
for endpoint, data in stats.items():
print(f" {endpoint}:")
print(f" Requests: {data['count']}")
print(f" Avg Latency: {data['avg_latency_ms']:.0f}ms")
print(f" Success Rate: {data['success_rate_percent']:.1f}%")
```
### Complete API Reference
The Monitor API includes these endpoints:
- `GET /monitor/health` - System health with pool statistics
- `GET /monitor/requests` - Active and completed request tracking
- `GET /monitor/browsers` - Browser pool details and efficiency
- `GET /monitor/endpoints/stats` - Per-endpoint performance analytics
- `GET /monitor/timeline?minutes=5` - Time-series data for charts
- `GET /monitor/logs/janitor?limit=10` - Cleanup activity logs
- `GET /monitor/logs/errors?limit=10` - Error logs with context
- `POST /monitor/actions/cleanup` - Force immediate cleanup
- `POST /monitor/actions/kill_browser` - Kill specific browser
- `POST /monitor/actions/restart_browser` - Restart browser
- `POST /monitor/stats/reset` - Reset accumulated statistics
## ⚡ WebSocket Streaming: Real-time Updates
**The Problem:** Polling the API every few seconds wastes resources and adds latency. Real-time dashboards need instant updates.
**My Solution:** WebSocket streaming with 2-second update intervals for building custom real-time dashboards.
### WebSocket Integration Example
```python
import websockets
import json
import asyncio
async def monitor_realtime():
uri = "ws://localhost:11235/monitor/ws"
async with websockets.connect(uri) as websocket:
print("Connected to real-time monitoring stream")
while True:
# Receive update every 2 seconds
data = await websocket.recv()
update = json.loads(data)
# Access all monitoring data
print(f"\n--- Update at {update['timestamp']} ---")
print(f"Memory: {update['health']['container']['memory_percent']:.1f}%")
print(f"Active Requests: {len(update['requests']['active'])}")
print(f"Total Browsers: {update['browsers']['summary']['total_count']}")
if update['errors']:
print(f"⚠️ Recent Errors: {len(update['errors'])}")
asyncio.run(monitor_realtime())
```
**Expected Real-World Impact:**
- **Custom Dashboards**: Build tailored monitoring UIs for your team
- **Real-time Alerting**: Trigger alerts instantly when metrics exceed thresholds
- **Integration**: Feed live data into monitoring tools like Grafana
- **Automation**: React to events in real-time without polling
## 🔥 Smart Browser Pool: 3-Tier Architecture
**The Problem:** Creating a new browser for every request is slow and memory-intensive. Traditional browser pools are static and inefficient.
**My Solution:** A smart 3-tier browser pool that automatically adapts to usage patterns.
### How It Works
```python
import httpx
async def demonstrate_browser_pool():
async with httpx.AsyncClient() as client:
# Request 1-3: Default config → Uses permanent browser
print("Phase 1: Using permanent browser")
for i in range(3):
await client.post(
"http://localhost:11235/crawl",
json={"urls": [f"https://httpbin.org/html?req={i}"]}
)
print(f" Request {i+1}: Reused permanent browser")
# Request 4-6: Custom viewport → Cold pool (first use)
print("\nPhase 2: Custom config creates cold pool browser")
viewport_config = {"viewport": {"width": 1280, "height": 720}}
for i in range(4):
await client.post(
"http://localhost:11235/crawl",
json={
"urls": [f"https://httpbin.org/json?v={i}"],
"browser_config": viewport_config
}
)
if i < 2:
print(f" Request {i+1}: Cold pool browser")
else:
print(f" Request {i+1}: Promoted to hot pool! (after 3 uses)")
# Check pool status
response = await client.get("http://localhost:11235/monitor/browsers")
browsers = response.json()
print(f"\nPool Status:")
print(f" Permanent: {len(browsers['permanent'])} (always active)")
print(f" Hot: {len(browsers['hot'])} (frequently used configs)")
print(f" Cold: {len(browsers['cold'])} (on-demand)")
print(f" Reuse Rate: {browsers['summary']['reuse_rate_percent']:.1f}%")
asyncio.run(demonstrate_browser_pool())
```
**Pool Tiers:**
- **🔥 Permanent Browser**: Always-on, default configuration, instant response
- **♨️ Hot Pool**: Browsers promoted after 3+ uses, kept warm for quick access
- **❄️ Cold Pool**: On-demand browsers for variant configs, cleaned up when idle
**Expected Real-World Impact:**
- **Memory Efficiency**: 10x reduction in memory usage vs creating browsers per request
- **Performance**: Instant access to frequently-used configurations
- **Automatic Optimization**: Pool adapts to your usage patterns
- **Resource Management**: Janitor automatically cleans up idle browsers
## 🧹 Janitor System: Automatic Cleanup
**The Problem:** Long-running crawlers accumulate idle browsers and consume memory over time.
**My Solution:** An automatic janitor system that monitors and cleans up idle resources.
```python
async def monitor_janitor_activity():
async with httpx.AsyncClient() as client:
response = await client.get("http://localhost:11235/monitor/logs/janitor?limit=5")
logs = response.json()
print("Recent Cleanup Activities:")
for log in logs:
print(f" {log['timestamp']}: {log['message']}")
# Example output:
# 2025-11-14 10:30:00: Cleaned up 2 cold pool browsers (idle > 5min)
# 2025-11-14 10:25:00: Browser reuse rate: 85.3%
# 2025-11-14 10:20:00: Hot pool browser promoted (10 requests)
```
## 🎮 Control Actions: Manual Management
**The Problem:** Sometimes you need to manually intervene—kill a stuck browser, force cleanup, or restart resources.
**My Solution:** Manual control actions via the API for operational troubleshooting.
### Force Cleanup
```python
async def force_cleanup():
async with httpx.AsyncClient() as client:
response = await client.post("http://localhost:11235/monitor/actions/cleanup")
result = response.json()
print(f"Cleanup completed:")
print(f" Browsers cleaned: {result.get('cleaned_count', 0)}")
print(f" Memory freed: {result.get('memory_freed_mb', 0):.1f} MB")
```
### Kill Specific Browser
```python
async def kill_stuck_browser(browser_id: str):
async with httpx.AsyncClient() as client:
response = await client.post(
"http://localhost:11235/monitor/actions/kill_browser",
json={"browser_id": browser_id}
)
if response.status_code == 200:
print(f"✅ Browser {browser_id} killed successfully")
```
### Reset Statistics
```python
async def reset_stats():
async with httpx.AsyncClient() as client:
response = await client.post("http://localhost:11235/monitor/stats/reset")
print("📊 Statistics reset for fresh monitoring")
```
## 📈 Production Integration Patterns
### Prometheus Integration
```python
# Export metrics for Prometheus scraping
async def export_prometheus_metrics():
async with httpx.AsyncClient() as client:
health = await client.get("http://localhost:11235/monitor/health")
data = health.json()
# Export in Prometheus format
metrics = f"""
# HELP crawl4ai_memory_usage_percent Memory usage percentage
# TYPE crawl4ai_memory_usage_percent gauge
crawl4ai_memory_usage_percent {data['container']['memory_percent']}
# HELP crawl4ai_request_success_rate Request success rate
# TYPE crawl4ai_request_success_rate gauge
crawl4ai_request_success_rate {data['stats']['success_rate_percent']}
# HELP crawl4ai_browser_pool_count Total browsers in pool
# TYPE crawl4ai_browser_pool_count gauge
crawl4ai_browser_pool_count {data['pool']['permanent']['active'] + data['pool']['hot']['count'] + data['pool']['cold']['count']}
"""
return metrics
```
### Alerting Example
```python
async def check_alerts():
async with httpx.AsyncClient() as client:
health = await client.get("http://localhost:11235/monitor/health")
data = health.json()
# Memory alert
if data['container']['memory_percent'] > 80:
print("🚨 ALERT: Memory usage above 80%")
# Trigger cleanup
await client.post("http://localhost:11235/monitor/actions/cleanup")
# Success rate alert
if data['stats']['success_rate_percent'] < 90:
print("🚨 ALERT: Success rate below 90%")
# Check error logs
errors = await client.get("http://localhost:11235/monitor/logs/errors")
print(f"Recent errors: {len(errors.json())}")
# Latency alert
if data['stats']['avg_latency_ms'] > 5000:
print("🚨 ALERT: Average latency above 5s")
```
### Key Metrics to Track
```python
CRITICAL_METRICS = {
"memory_usage": {
"current": "container.memory_percent",
"target": "<80%",
"alert_threshold": ">80%",
"action": "Force cleanup or scale"
},
"success_rate": {
"current": "stats.success_rate_percent",
"target": ">95%",
"alert_threshold": "<90%",
"action": "Check error logs"
},
"avg_latency": {
"current": "stats.avg_latency_ms",
"target": "<2000ms",
"alert_threshold": ">5000ms",
"action": "Investigate slow requests"
},
"browser_reuse_rate": {
"current": "browsers.summary.reuse_rate_percent",
"target": ">80%",
"alert_threshold": "<60%",
"action": "Check pool configuration"
},
"total_browsers": {
"current": "browsers.summary.total_count",
"target": "<15",
"alert_threshold": ">20",
"action": "Check for browser leaks"
},
"error_frequency": {
"current": "len(errors)",
"target": "<5/hour",
"alert_threshold": ">10/hour",
"action": "Review error patterns"
}
}
```
## 🐛 Critical Bug Fixes
This release includes significant bug fixes that improve stability and performance:
### Async LLM Extraction (#1590)
**The Problem:** LLM extraction was blocking async execution, causing URLs to be processed sequentially instead of in parallel (issue #1055).
**The Fix:** Resolved the blocking issue to enable true parallel processing for LLM extraction.
```python
# Before v0.7.7: Sequential processing
# After v0.7.7: True parallel processing
async with AsyncWebCrawler() as crawler:
urls = ["url1", "url2", "url3", "url4"]
# Now processes truly in parallel with LLM extraction
results = await crawler.arun_many(
urls,
config=CrawlerRunConfig(
extraction_strategy=LLMExtractionStrategy(...)
)
)
# 4x faster for parallel LLM extraction!
```
**Expected Impact:** Major performance improvement for batch LLM extraction workflows.
### DFS Deep Crawling (#1607)
**The Problem:** DFS (Depth-First Search) deep crawl strategy had implementation issues.
**The Fix:** Enhanced DFSDeepCrawlStrategy with proper seen URL tracking and improved documentation.
### Browser & Crawler Config Documentation (#1609)
**The Problem:** Documentation didn't match the actual `async_configs.py` implementation.
**The Fix:** Updated all configuration documentation to accurately reflect the current implementation.
### Sitemap Seeder (#1598)
**The Problem:** Sitemap parsing and URL normalization issues in AsyncUrlSeeder (issue #1559).
**The Fix:** Added comprehensive tests and fixes for sitemap namespace parsing and URL normalization.
### Remove Overlay Elements (#1529)
**The Problem:** The `remove_overlay_elements` functionality wasn't working (issue #1396).
**The Fix:** Fixed by properly calling the injected JavaScript function.
### Viewport Configuration (#1495)
**The Problem:** Viewport configuration wasn't working in managed browsers (issue #1490).
**The Fix:** Added proper viewport size configuration support for browser launch.
### Managed Browser CDP Timing (#1528)
**The Problem:** CDP (Chrome DevTools Protocol) endpoint verification had timing issues causing connection failures (issue #1445).
**The Fix:** Added exponential backoff for CDP endpoint verification to handle timing variations.
### Security Updates
- **pyOpenSSL**: Updated from >=24.3.0 to >=25.3.0 to address security vulnerability
- Added verification tests for the security update
### Docker Fixes
- **Port Standardization**: Fixed inconsistent port usage (11234 vs 11235) - now standardized to 11235
- **LLM Environment**: Fixed LLM API key handling for multi-provider support (PR #1537)
- **Error Handling**: Improved Docker API error messages with comprehensive status codes
- **Serialization**: Fixed `fit_html` property serialization in `/crawl` and `/crawl/stream` endpoints
### Other Important Fixes
- **arun_many Returns**: Fixed function to always return a list, even on exception (PR #1530)
- **Webhook Serialization**: Properly serialize Pydantic HttpUrl in webhook config
- **LLMConfig Documentation**: Fixed casing and variable name consistency (issue #1551)
- **Python Version**: Dropped Python 3.9 support, now requires Python >=3.10
## 📊 Expected Real-World Impact
### For DevOps & Infrastructure Teams
- **Full Visibility**: Know exactly what's happening inside your crawling infrastructure
- **Proactive Monitoring**: Catch issues before they become problems
- **Resource Optimization**: Identify memory leaks and performance bottlenecks
- **Operational Control**: Manual intervention when automated systems need help
### For Production Deployments
- **Enterprise Observability**: Prometheus, Grafana, and alerting integration
- **Debugging**: Real-time logs and error tracking
- **Capacity Planning**: Historical metrics for scaling decisions
- **SLA Monitoring**: Track success rates and latency against targets
### For Development Teams
- **Local Monitoring**: Understand crawler behavior during development
- **Performance Testing**: Measure impact of configuration changes
- **Troubleshooting**: Quickly identify and fix issues
- **Learning**: See exactly how the browser pool works
## 🔄 Breaking Changes
**None!** This release is fully backward compatible.
- All existing Docker configurations continue to work
- No API changes to existing endpoints
- Monitoring is additive functionality
- No migration required
## 🚀 Upgrade Instructions
### Docker
```bash
# Pull the latest version
docker pull unclecode/crawl4ai:0.7.7
# Or use the latest tag
docker pull unclecode/crawl4ai:latest
# Run with monitoring enabled (default)
docker run -d \
-p 11235:11235 \
--shm-size=1g \
--name crawl4ai \
unclecode/crawl4ai:0.7.7
# Access the monitoring dashboard
open http://localhost:11235/dashboard
```
### Python Package
```bash
# Upgrade to latest version
pip install --upgrade crawl4ai
# Or install specific version
pip install crawl4ai==0.7.7
```
## 🎬 Try the Demo
Run the comprehensive demo that showcases all monitoring features:
```bash
python docs/releases_review/demo_v0.7.7.py
```
**The demo includes:**
1. System health overview with live metrics
2. Request tracking with active/completed monitoring
3. Browser pool management (permanent/hot/cold)
4. Complete Monitor API endpoint examples
5. WebSocket streaming demonstration
6. Control actions (cleanup, kill, restart)
7. Production metrics and alerting patterns
8. Self-hosting value proposition
## 📚 Documentation
### New Documentation
- **[Self-Hosting Guide](https://docs.crawl4ai.com/core/self-hosting/)** - Complete self-hosting documentation with monitoring
- **Demo Script**: `docs/releases_review/demo_v0.7.7.py` - Working examples
### Updated Documentation
- **Docker Deployment** → **Self-Hosting** (renamed for better positioning)
- Added comprehensive monitoring sections
- Production integration patterns
- WebSocket streaming examples
## 💡 Pro Tips
1. **Start with the dashboard** - Visit `/dashboard` to get familiar with the monitoring system
2. **Track the 6 key metrics** - Memory, success rate, latency, reuse rate, browser count, errors
3. **Set up alerting early** - Use the Monitor API to build alerts before issues occur
4. **Monitor browser pool efficiency** - Aim for >80% reuse rate for optimal performance
5. **Use WebSocket for custom dashboards** - Build tailored monitoring UIs for your team
6. **Leverage Prometheus integration** - Export metrics for long-term storage and analysis
7. **Check janitor logs** - Understand automatic cleanup patterns
8. **Use control actions judiciously** - Manual interventions are for exceptional cases
## 🙏 Acknowledgments
Thank you to our community for the feedback, bug reports, and feature requests that shaped this release. Special thanks to everyone who contributed to the issues that were fixed in this version.
The monitoring system was built based on real user needs for production deployments, and your input made it comprehensive and practical.
## 📞 Support & Resources
- **📖 Documentation**: [docs.crawl4ai.com](https://docs.crawl4ai.com)
- **🐙 GitHub**: [github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai)
- **💬 Discord**: [discord.gg/crawl4ai](https://discord.gg/jP8KfhDhyN)
- **🐦 Twitter**: [@unclecode](https://x.com/unclecode)
- **📊 Dashboard**: `http://localhost:11235/dashboard` (when running)
---
**Crawl4AI v0.7.7 delivers complete self-hosting with enterprise-grade monitoring. You now have full visibility and control over your web crawling infrastructure. The monitoring dashboard, comprehensive API, and WebSocket streaming give you everything needed for production deployments. Try the self-hosting platform—it's a game changer for operational excellence!**
**Happy crawling with full visibility!** 🕷️📊
*- unclecode*

View File

@@ -0,0 +1,327 @@
# Crawl4AI v0.7.8: Stability & Bug Fix Release
*December 2025*
---
I'm releasing Crawl4AI v0.7.8—a focused stability release that addresses 11 bugs reported by the community. While there are no new features in this release, these fixes resolve important issues affecting Docker deployments, LLM extraction, URL handling, and dependency compatibility.
## What's Fixed at a Glance
- **Docker API**: Fixed ContentRelevanceFilter deserialization, ProxyConfig serialization, and cache folder permissions
- **LLM Extraction**: Configurable rate limiter backoff, HTML input format support, and proper URL handling for raw HTML
- **URL Handling**: Correct relative URL resolution after JavaScript redirects
- **Dependencies**: Replaced deprecated PyPDF2 with pypdf, Pydantic v2 ConfigDict compatibility
- **AdaptiveCrawler**: Fixed query expansion to actually use LLM instead of hardcoded mock data
## Bug Fixes
### Docker & API Fixes
#### ContentRelevanceFilter Deserialization (#1642)
**The Problem:** When sending deep crawl requests to the Docker API with `ContentRelevanceFilter`, the server failed to deserialize the filter, causing requests to fail.
**The Fix:** I added `ContentRelevanceFilter` to the public exports and enhanced the deserialization logic with dynamic imports.
```python
# This now works correctly in Docker API
import httpx
request = {
"urls": ["https://docs.example.com"],
"crawler_config": {
"deep_crawl_strategy": {
"type": "BFSDeepCrawlStrategy",
"max_depth": 2,
"filter_chain": [
{
"type": "ContentRelevanceFilter",
"query": "API documentation",
"threshold": 0.3
}
]
}
}
}
async with httpx.AsyncClient() as client:
response = await client.post("http://localhost:11235/crawl", json=request)
# Previously failed, now works!
```
#### ProxyConfig JSON Serialization (#1629)
**The Problem:** `BrowserConfig.to_dict()` failed when `proxy_config` was set because `ProxyConfig` wasn't being serialized to a dictionary.
**The Fix:** `ProxyConfig.to_dict()` is now called during serialization.
```python
from crawl4ai import BrowserConfig
from crawl4ai.async_configs import ProxyConfig
proxy = ProxyConfig(
server="http://proxy.example.com:8080",
username="user",
password="pass"
)
config = BrowserConfig(headless=True, proxy_config=proxy)
# Previously raised TypeError, now works
config_dict = config.to_dict()
json.dumps(config_dict) # Valid JSON
```
#### Docker Cache Folder Permissions (#1638)
**The Problem:** The `.cache` folder in the Docker image had incorrect permissions, causing crawling to fail when caching was enabled.
**The Fix:** Corrected ownership and permissions during image build.
```bash
# Cache now works correctly in Docker
docker run -d -p 11235:11235 \
--shm-size=1g \
-v ./my-cache:/app/.cache \
unclecode/crawl4ai:0.7.8
```
---
### LLM & Extraction Fixes
#### Configurable Rate Limiter Backoff (#1269)
**The Problem:** The LLM rate limiting backoff parameters were hardcoded, making it impossible to adjust retry behavior for different API rate limits.
**The Fix:** `LLMConfig` now accepts three new parameters for complete control over retry behavior.
```python
from crawl4ai import LLMConfig
# Default behavior (unchanged)
default_config = LLMConfig(provider="openai/gpt-4o-mini")
# backoff_base_delay=2, backoff_max_attempts=3, backoff_exponential_factor=2
# Custom configuration for APIs with strict rate limits
custom_config = LLMConfig(
provider="openai/gpt-4o-mini",
backoff_base_delay=5, # Wait 5 seconds on first retry
backoff_max_attempts=5, # Try up to 5 times
backoff_exponential_factor=3 # Multiply delay by 3 each attempt
)
# Retry sequence: 5s -> 15s -> 45s -> 135s -> 405s
```
#### LLM Strategy HTML Input Support (#1178)
**The Problem:** `LLMExtractionStrategy` always sent markdown to the LLM, but some extraction tasks work better with HTML structure preserved.
**The Fix:** Added `input_format` parameter supporting `"markdown"`, `"html"`, `"fit_markdown"`, `"cleaned_html"`, and `"fit_html"`.
```python
from crawl4ai import LLMExtractionStrategy, LLMConfig
# Default: markdown input (unchanged)
markdown_strategy = LLMExtractionStrategy(
llm_config=LLMConfig(provider="openai/gpt-4o-mini"),
instruction="Extract product information"
)
# NEW: HTML input - preserves table/list structure
html_strategy = LLMExtractionStrategy(
llm_config=LLMConfig(provider="openai/gpt-4o-mini"),
instruction="Extract the data table preserving structure",
input_format="html"
)
# NEW: Filtered markdown - only relevant content
fit_strategy = LLMExtractionStrategy(
llm_config=LLMConfig(provider="openai/gpt-4o-mini"),
instruction="Summarize the main content",
input_format="fit_markdown"
)
```
#### Raw HTML URL Variable (#1116)
**The Problem:** When using `url="raw:<html>..."`, the entire HTML content was being passed to extraction strategies as the URL parameter, polluting LLM prompts.
**The Fix:** The URL is now correctly set to `"Raw HTML"` for raw HTML inputs.
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
html = "<html><body><h1>Test</h1></body></html>"
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url=f"raw:{html}",
config=CrawlerRunConfig(extraction_strategy=my_strategy)
)
# extraction_strategy receives url="Raw HTML" instead of the HTML blob
```
---
### URL Handling Fix
#### Relative URLs After Redirects (#1268)
**The Problem:** When JavaScript caused a page redirect, relative links were resolved against the original URL instead of the final URL.
**The Fix:** `redirected_url` now captures the actual page URL after all JavaScript execution completes.
```python
from crawl4ai import AsyncWebCrawler
async with AsyncWebCrawler() as crawler:
# Page at /old-page redirects via JS to /new-page
result = await crawler.arun(url="https://example.com/old-page")
# BEFORE: redirected_url = "https://example.com/old-page"
# AFTER: redirected_url = "https://example.com/new-page"
# Links are now correctly resolved against the final URL
for link in result.links['internal']:
print(link['href']) # Relative links resolved correctly
```
---
### Dependency & Compatibility Fixes
#### PyPDF2 Replaced with pypdf (#1412)
**The Problem:** PyPDF2 was deprecated in 2022 and is no longer maintained.
**The Fix:** Replaced with the actively maintained `pypdf` library.
```python
# Installation (unchanged)
pip install crawl4ai[pdf]
# The PDF processor now uses pypdf internally
# No code changes required - API remains the same
```
#### Pydantic v2 ConfigDict Compatibility (#678)
**The Problem:** Using the deprecated `class Config` syntax caused deprecation warnings with Pydantic v2.
**The Fix:** Migrated to `model_config = ConfigDict(...)` syntax.
```python
# No more deprecation warnings when importing crawl4ai models
from crawl4ai.models import CrawlResult
from crawl4ai import CrawlerRunConfig, BrowserConfig
# All models are now Pydantic v2 compatible
```
---
### AdaptiveCrawler Fix
#### Query Expansion Using LLM (#1621)
**The Problem:** The `EmbeddingStrategy` in AdaptiveCrawler had commented-out LLM code and was using hardcoded mock query variations instead.
**The Fix:** Uncommented and activated the LLM call for actual query expansion.
```python
# AdaptiveCrawler query expansion now actually uses the LLM
# Instead of hardcoded variations like:
# variations = {'queries': ['what are the best vegetables...']}
# The LLM generates relevant query variations based on your actual query
```
---
### Code Formatting Fix
#### Import Statement Formatting (#1181)
**The Problem:** When extracting code from web pages, import statements were sometimes concatenated without proper line separation.
**The Fix:** Import statements now maintain proper newline separation.
```python
# BEFORE: "import osimport sysfrom pathlib import Path"
# AFTER:
# import os
# import sys
# from pathlib import Path
```
---
## Breaking Changes
**None!** This release is fully backward compatible.
- All existing code continues to work without modification
- New parameters have sensible defaults matching previous behavior
- No API changes to existing functionality
---
## Upgrade Instructions
### Python Package
```bash
pip install --upgrade crawl4ai
# or
pip install crawl4ai==0.7.8
```
### Docker
```bash
# Pull the latest version
docker pull unclecode/crawl4ai:0.7.8
# Run
docker run -d -p 11235:11235 --shm-size=1g unclecode/crawl4ai:0.7.8
```
---
## Verification
Run the verification tests to confirm all fixes are working:
```bash
python docs/releases_review/demo_v0.7.8.py
```
This runs actual tests that verify each bug fix is properly implemented.
---
## Acknowledgments
Thank you to everyone who reported these issues and provided detailed reproduction steps. Your bug reports make Crawl4AI better for everyone.
Issues fixed: #1642, #1638, #1629, #1621, #1412, #1269, #1268, #1181, #1178, #1116, #678
---
## Support & Resources
- **Documentation**: [docs.crawl4ai.com](https://docs.crawl4ai.com)
- **GitHub**: [github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai)
- **Discord**: [discord.gg/crawl4ai](https://discord.gg/jP8KfhDhyN)
- **Twitter**: [@unclecode](https://x.com/unclecode)
---
**This stability release ensures Crawl4AI works reliably across Docker deployments, LLM extraction workflows, and various edge cases. Thank you for your continued support and feedback!**
**Happy crawling!**
*- unclecode*

File diff suppressed because it is too large Load Diff

View File

@@ -17,6 +17,11 @@ class BrowserConfig:
def __init__(
browser_type="chromium",
headless=True,
browser_mode="dedicated",
use_managed_browser=False,
cdp_url=None,
debugging_port=9222,
host="localhost",
proxy_config=None,
viewport_width=1080,
viewport_height=600,
@@ -25,7 +30,13 @@ class BrowserConfig:
user_data_dir=None,
cookies=None,
headers=None,
user_agent=None,
user_agent=(
# "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) AppleWebKit/537.36 "
# "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
# "(KHTML, like Gecko) Chrome/116.0.5845.187 Safari/604.1 Edg/117.0.2045.47"
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/116.0.0.0 Safari/537.36"
),
user_agent_mode="",
text_mode=False,
light_mode=False,
extra_args=None,
@@ -37,17 +48,33 @@ class BrowserConfig:
### Key Fields to Note
1. **`browser_type`**
- Options: `"chromium"`, `"firefox"`, or `"webkit"`.
- Defaults to `"chromium"`.
- If you need a different engine, specify it here.
1.**`browser_type`**
- Options: `"chromium"`, `"firefox"`, or `"webkit"`.
- Defaults to `"chromium"`.
- If you need a different engine, specify it here.
2. **`headless`**
2.**`headless`**
- `True`: Runs the browser in headless mode (invisible browser).
- `False`: Runs the browser in visible mode, which helps with debugging.
3. **`proxy_config`**
- A dictionary with fields like:
3.**`browser_mode`**
- Determines how the browser should be initialized:
- `"dedicated"` (default): Creates a new browser instance each time
- `"builtin"`: Uses the builtin CDP browser running in background
- `"custom"`: Uses explicit CDP settings provided in `cdp_url`
- `"docker"`: Runs browser in Docker container with isolation
4.**`use_managed_browser`** & **`cdp_url`**
- `use_managed_browser=True`: Launch browser using Chrome DevTools Protocol (CDP) for advanced control
- `cdp_url`: URL for CDP endpoint (e.g., `"ws://localhost:9222/devtools/browser/"`)
- Automatically set based on `browser_mode`
5.**`debugging_port`** & **`host`**
- `debugging_port`: Port for browser debugging protocol (default: 9222)
- `host`: Host for browser connection (default: "localhost")
6.**`proxy_config`**
- A `ProxyConfig` object or dictionary with fields like:
```json
{
"server": "http://proxy.example.com:8080",
@@ -57,35 +84,35 @@ class BrowserConfig:
```
- Leave as `None` if a proxy is not required.
4. **`viewport_width` & `viewport_height`**:
7.**`viewport_width` & `viewport_height`**
- The initial window size.
- Some sites behave differently with smaller or bigger viewports.
5. **`verbose`**:
8.**`verbose`**
- If `True`, prints extra logs.
- Handy for debugging.
6. **`use_persistent_context`**:
9.**`use_persistent_context`**
- If `True`, uses a **persistent** browser profile, storing cookies/local storage across runs.
- Typically also set `user_data_dir` to point to a folder.
7. **`cookies`** & **`headers`**:
- If you want to start with specific cookies or add universal HTTP headers, set them here.
- E.g. `cookies=[{"name": "session", "value": "abc123", "domain": "example.com"}]`.
10.**`cookies`** & **`headers`**
- If you want to start with specific cookies or add universal HTTP headers to the browser context, set them here.
- E.g. `cookies=[{"name": "session", "value": "abc123", "domain": "example.com"}]`.
8. **`user_agent`**:
- Custom User-Agent string. If `None`, a default is used.
- You can also set `user_agent_mode="random"` for randomization (if you want to fight bot detection).
11.**`user_agent`** & **`user_agent_mode`**
- `user_agent`: Custom User-Agent string. If `None`, a default is used.
- `user_agent_mode`: Set to `"random"` for randomization (helps fight bot detection).
9. **`text_mode`** & **`light_mode`**:
- `text_mode=True` disables images, possibly speeding up text-only crawls.
- `light_mode=True` turns off certain background features for performance.
12.**`text_mode`** & **`light_mode`**
- `text_mode=True` disables images, possibly speeding up text-only crawls.
- `light_mode=True` turns off certain background features for performance.
10. **`extra_args`**:
13.**`extra_args`**
- Additional flags for the underlying browser.
- E.g. `["--disable-extensions"]`.
11. **`enable_stealth`**:
14.**`enable_stealth`**
- If `True`, enables stealth mode using playwright-stealth.
- Modifies browser fingerprints to avoid basic bot detection.
- Default is `False`. Recommended for sites with bot protection.
@@ -134,9 +161,11 @@ class CrawlerRunConfig:
def __init__(
word_count_threshold=200,
extraction_strategy=None,
chunking_strategy=RegexChunking(),
markdown_generator=None,
cache_mode=None,
cache_mode=CacheMode.BYPASS,
js_code=None,
c4a_script=None,
wait_for=None,
screenshot=False,
pdf=False,
@@ -145,13 +174,18 @@ class CrawlerRunConfig:
locale=None, # e.g. "en-US", "fr-FR"
timezone_id=None, # e.g. "America/New_York"
geolocation=None, # GeolocationConfig object
# Resource Management
enable_rate_limiting=False,
rate_limit_config=None,
memory_threshold_percent=70.0,
check_interval=1.0,
max_session_permit=20,
display_mode=None,
# Proxy Configuration
proxy_config=None,
proxy_rotation_strategy=None,
# Page Interaction Parameters
scan_full_page=False,
scroll_delay=0.2,
wait_until="domcontentloaded",
page_timeout=60000,
delay_before_return_html=0.1,
# URL Matching Parameters
url_matcher=None, # For URL-specific configurations
match_mode=MatchMode.OR,
verbose=True,
stream=False, # Enable streaming for arun_many()
# ... other advanced parameters omitted
@@ -161,69 +195,68 @@ class CrawlerRunConfig:
### Key Fields to Note
1. **`word_count_threshold`**:
1.**`word_count_threshold`**:
- The minimum word count before a block is considered.
- If your site has lots of short paragraphs or items, you can lower it.
2. **`extraction_strategy`**:
2.**`extraction_strategy`**:
- Where you plug in JSON-based extraction (CSS, LLM, etc.).
- If `None`, no structured extraction is done (only raw/cleaned HTML + markdown).
3. **`markdown_generator`**:
3.**`chunking_strategy`**:
- Strategy to chunk content before extraction.
- Defaults to `RegexChunking()`. Can be customized for different chunking approaches.
4.**`markdown_generator`**:
- E.g., `DefaultMarkdownGenerator(...)`, controlling how HTML→Markdown conversion is done.
- If `None`, a default approach is used.
4. **`cache_mode`**:
5.**`cache_mode`**:
- Controls caching behavior (`ENABLED`, `BYPASS`, `DISABLED`, etc.).
- If `None`, defaults to some level of caching or you can specify `CacheMode.ENABLED`.
- Defaults to `CacheMode.BYPASS`.
5. **`js_code`**:
- A string or list of JS strings to execute.
6.**`js_code`** & **`c4a_script`**:
- `js_code`: A string or list of JavaScript strings to execute.
- `c4a_script`: C4A script that compiles to JavaScript.
- Great for "Load More" buttons or user interactions.
6. **`wait_for`**:
7.**`wait_for`**:
- A CSS or JS expression to wait for before extracting content.
- Common usage: `wait_for="css:.main-loaded"` or `wait_for="js:() => window.loaded === true"`.
7. **`screenshot`**, **`pdf`**, & **`capture_mhtml`**:
8.**`screenshot`**, **`pdf`**, & **`capture_mhtml`**:
- If `True`, captures a screenshot, PDF, or MHTML snapshot after the page is fully loaded.
- The results go to `result.screenshot` (base64), `result.pdf` (bytes), or `result.mhtml` (string).
8. **Location Parameters**:
9.**Location Parameters**:
- **`locale`**: Browser's locale (e.g., `"en-US"`, `"fr-FR"`) for language preferences
- **`timezone_id`**: Browser's timezone (e.g., `"America/New_York"`, `"Europe/Paris"`)
- **`geolocation`**: GPS coordinates via `GeolocationConfig(latitude=48.8566, longitude=2.3522)`
- See [Identity Based Crawling](../advanced/identity-based-crawling.md#7-locale-timezone-and-geolocation-control)
9. **`verbose`**:
- Logs additional runtime details.
- Overlaps with the browser's verbosity if also set to `True` in `BrowserConfig`.
10.**Proxy Configuration**:
- **`proxy_config`**: Proxy server configuration (ProxyConfig object or dict) e.g. {"server": "...", "username": "...", "password"}
- **`proxy_rotation_strategy`**: Strategy for rotating proxies during crawls
10. **`enable_rate_limiting`**:
- If `True`, enables rate limiting for batch processing.
- Requires `rate_limit_config` to be set.
11.**Page Interaction Parameters**:
- **`scan_full_page`**: If `True`, scroll through the entire page to load all content
- **`wait_until`**: Condition to wait for when navigating (e.g., "domcontentloaded", "networkidle")
- **`page_timeout`**: Timeout in milliseconds for page operations (default: 60000)
- **`delay_before_return_html`**: Delay in seconds before retrieving final HTML.
11. **`memory_threshold_percent`**:
- The memory threshold (as a percentage) to monitor.
- If exceeded, the crawler will pause or slow down.
12. **`check_interval`**:
- The interval (in seconds) to check system resources.
- Affects how often memory and CPU usage are monitored.
13. **`max_session_permit`**:
- The maximum number of concurrent crawl sessions.
- Helps prevent overwhelming the system.
14. **`url_matcher`** & **`match_mode`**:
12.**`url_matcher`** & **`match_mode`**:
- Enable URL-specific configurations when used with `arun_many()`.
- Set `url_matcher` to patterns (glob, function, or list) to match specific URLs.
- Use `match_mode` (OR/AND) to control how multiple patterns combine.
- See [URL-Specific Configurations](../api/arun_many.md#url-specific-configurations) for examples.
15. **`display_mode`**:
- The display mode for progress information (`DETAILED`, `BRIEF`, etc.).
- Affects how much information is printed during the crawl.
13.**`verbose`**:
- Logs additional runtime details.
- Overlaps with the browser's verbosity if also set to `True` in `BrowserConfig`.
14.**`stream`**:
- If `True`, enables streaming mode for `arun_many()` to process URLs as they complete.
- Allows handling results incrementally instead of waiting for all URLs to finish.
### Helper Methods
@@ -263,20 +296,32 @@ The `clone()` method:
### Key fields to note
1. **`provider`**:
1.**`provider`**:
- Which LLM provider to use.
- Possible values are `"ollama/llama3","groq/llama3-70b-8192","groq/llama3-8b-8192", "openai/gpt-4o-mini" ,"openai/gpt-4o","openai/o1-mini","openai/o1-preview","openai/o3-mini","openai/o3-mini-high","anthropic/claude-3-haiku-20240307","anthropic/claude-3-opus-20240229","anthropic/claude-3-sonnet-20240229","anthropic/claude-3-5-sonnet-20240620","gemini/gemini-pro","gemini/gemini-1.5-pro","gemini/gemini-2.0-flash","gemini/gemini-2.0-flash-exp","gemini/gemini-2.0-flash-lite-preview-02-05","deepseek/deepseek-chat"`<br/>*(default: `"openai/gpt-4o-mini"`)*
2. **`api_token`**:
2.**`api_token`**:
- Optional. When not provided explicitly, api_token will be read from environment variables based on provider. For example: If a gemini model is passed as provider then,`"GEMINI_API_KEY"` will be read from environment variables
- API token of LLM provider <br/> eg: `api_token = "gsk_1ClHGGJ7Lpn4WGybR7vNWGdyb3FY7zXEw3SCiy0BAVM9lL8CQv"`
- Environment variable - use with prefix "env:" <br/> eg:`api_token = "env: GROQ_API_KEY"`
3. **`base_url`**:
3.**`base_url`**:
- If your provider has a custom endpoint
4.**Retry/backoff controls** *(optional)*:
- `backoff_base_delay` *(default `2` seconds)* base delay inserted before the first retry when the provider returns a rate-limit response.
- `backoff_max_attempts` *(default `3`)* total number of attempts (initial call plus retries) before the request is surfaced as an error.
- `backoff_exponential_factor` *(default `2`)* growth rate for the retry delay (`delay = base_delay * factor^attempt`).
- These values are forwarded to the shared `perform_completion_with_backoff` helper, ensuring every strategy that consumes your `LLMConfig` honors the same throttling policy.
```python
llm_config = LLMConfig(provider="openai/gpt-4o-mini", api_token=os.getenv("OPENAI_API_KEY"))
llm_config = LLMConfig(
provider="openai/gpt-4o-mini",
api_token=os.getenv("OPENAI_API_KEY"),
backoff_base_delay=1, # optional
backoff_max_attempts=5, # optional
backoff_exponential_factor=3, #optional
)
```
## 4. Putting It All Together

View File

@@ -69,12 +69,12 @@ The tutorial includes a Flask-based web interface with:
cd docs/examples/c4a_script/tutorial/
# Install dependencies
pip install flask
pip install -r requirements.txt
# Launch the tutorial server
python app.py
python server.py
# Open http://localhost:5000 in your browser
# Open http://localhost:8000 in your browser
```
## Core Concepts
@@ -111,8 +111,8 @@ CLICK `.submit-btn`
# By attribute
CLICK `button[type="submit"]`
# By text content
CLICK `button:contains("Sign In")`
# By accessible attributes
CLICK `button[aria-label="Search"][title="Search"]`
# Complex selectors
CLICK `.form-container input[name="email"]`

View File

@@ -11,6 +11,12 @@ This page provides a comprehensive list of example scripts that demonstrate vari
| Quickstart Set 1 | Basic examples for getting started with Crawl4AI. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/quickstart_examples_set_1.py) |
| Quickstart Set 2 | More advanced examples for working with Crawl4AI. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/quickstart_examples_set_2.py) |
## Proxies
| Example | Description | Link |
|----------|--------------|------|
| **NSTProxy** | [NSTProxy](https://www.nstproxy.com/?utm_source=crawl4ai) Seamlessly integrates with crawl4ai — no setup required. Access high-performance residential, datacenter, ISP, and IPv6 proxies with smart rotation and anti-blocking technology. Starts from $0.1/GB. Use code crawl4ai for 10% off. | [View Code](https://github.com/unclecode/crawl4ai/tree/main/docs/examples/proxy) |
## Browser & Crawling Features
| Example | Description | Link |
@@ -56,13 +62,14 @@ This page provides a comprehensive list of example scripts that demonstrate vari
## Anti-Bot & Stealth Features
| Example | Description | Link |
|---------|-------------|------|
| Stealth Mode Quick Start | Five practical examples showing how to use stealth mode for bypassing basic bot detection. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/stealth_mode_quick_start.py) |
| Example | Description | Link |
|----------------------------|-------------|------|
| Stealth Mode Quick Start | Five practical examples showing how to use stealth mode for bypassing basic bot detection. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/stealth_mode_quick_start.py) |
| Stealth Mode Comprehensive | Comprehensive demonstration of stealth mode features with bot detection testing and comparisons. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/stealth_mode_example.py) |
| Undetected Browser | Simple example showing how to use the undetected browser adapter. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/hello_world_undetected.py) |
| Undetected Browser Demo | Basic demo comparing regular and undetected browser modes. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/undetected_simple_demo.py) |
| Undetected Tests | Advanced tests comparing regular vs undetected browsers on various bot detection services. | [View Folder](https://github.com/unclecode/crawl4ai/tree/main/docs/examples/undetectability/) |
| Undetected Browser | Simple example showing how to use the undetected browser adapter. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/hello_world_undetected.py) |
| Undetected Browser Demo | Basic demo comparing regular and undetected browser modes. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/undetected_simple_demo.py) |
| Undetected Tests | Advanced tests comparing regular vs undetected browsers on various bot detection services. | [View Folder](https://github.com/unclecode/crawl4ai/tree/main/docs/examples/undetectability/) |
| CapSolver Captcha Solver | Seamlessly integrate with [CapSolver](https://www.capsolver.com/?utm_source=crawl4ai&utm_medium=github_pr&utm_campaign=crawl4ai_integration) to automatically solve reCAPTCHA v2/v3, Cloudflare Turnstile / Challenges, AWS WAF and more for uninterrupted scraping and automation. | [View Folder](https://github.com/unclecode/crawl4ai/tree/main/docs/examples/capsolver_captcha_solver/) |
## Customization & Security

View File

@@ -22,18 +22,6 @@ When you self-host, you can scale from a single container to a full browser infr
- [Option 1: Using Pre-built Docker Hub Images (Recommended)](#option-1-using-pre-built-docker-hub-images-recommended)
- [Option 2: Using Docker Compose](#option-2-using-docker-compose)
- [Option 3: Manual Local Build & Run](#option-3-manual-local-build--run)
- [Dockerfile Parameters](#dockerfile-parameters)
- [Using the API](#using-the-api)
- [Playground Interface](#playground-interface)
- [Python SDK](#python-sdk)
- [Understanding Request Schema](#understanding-request-schema)
- [REST API Examples](#rest-api-examples)
- [Additional API Endpoints](#additional-api-endpoints)
- [HTML Extraction Endpoint](#html-extraction-endpoint)
- [Screenshot Endpoint](#screenshot-endpoint)
- [PDF Export Endpoint](#pdf-export-endpoint)
- [JavaScript Execution Endpoint](#javascript-execution-endpoint)
- [Library Context Endpoint](#library-context-endpoint)
- [MCP (Model Context Protocol) Support](#mcp-model-context-protocol-support)
- [What is MCP?](#what-is-mcp)
- [Connecting via MCP](#connecting-via-mcp)
@@ -79,13 +67,13 @@ Pull and run images directly from Docker Hub without building locally.
#### 1. Pull the Image
Our latest release is `0.7.3`. Images are built with multi-arch manifests, so Docker automatically pulls the correct version for your system.
Our latest release is `0.7.6`. Images are built with multi-arch manifests, so Docker automatically pulls the correct version for your system.
> 💡 **Note**: The `latest` tag points to the stable `0.7.3` version.
> 💡 **Note**: The `latest` tag points to the stable `0.7.6` version.
```bash
# Pull the latest version
docker pull unclecode/crawl4ai:0.7.3
docker pull unclecode/crawl4ai:0.7.6
# Or pull using the latest tag
docker pull unclecode/crawl4ai:latest
@@ -157,7 +145,7 @@ docker stop crawl4ai && docker rm crawl4ai
#### Docker Hub Versioning Explained
* **Image Name:** `unclecode/crawl4ai`
* **Tag Format:** `LIBRARY_VERSION[-SUFFIX]` (e.g., `0.7.3`)
* **Tag Format:** `LIBRARY_VERSION[-SUFFIX]` (e.g., `0.7.6`)
* `LIBRARY_VERSION`: The semantic version of the core `crawl4ai` Python library
* `SUFFIX`: Optional tag for release candidates (``) and revisions (`r1`)
* **`latest` Tag:** Points to the most recent stable version
@@ -853,6 +841,733 @@ else:
> 💡 **Remember**: Always test your hooks on safe, known websites first before using them on production sites. Never crawl sites that you don't have permission to access or that might be malicious.
### Hooks Utility: Function-Based Approach (Python)
For Python developers, Crawl4AI provides a more convenient way to work with hooks using the `hooks_to_string()` utility function and Docker client integration.
#### Why Use Function-Based Hooks?
**String-Based Approach (shown above)**:
```python
hooks_code = {
"on_page_context_created": """
async def hook(page, context, **kwargs):
await page.set_viewport_size({"width": 1920, "height": 1080})
return page
"""
}
```
**Function-Based Approach (recommended for Python)**:
```python
from crawl4ai import Crawl4aiDockerClient
async def my_hook(page, context, **kwargs):
await page.set_viewport_size({"width": 1920, "height": 1080})
return page
async with Crawl4aiDockerClient(base_url="http://localhost:11235") as client:
result = await client.crawl(
["https://example.com"],
hooks={"on_page_context_created": my_hook}
)
```
**Benefits**:
- ✅ Write hooks as regular Python functions
- ✅ Full IDE support (autocomplete, syntax highlighting, type checking)
- ✅ Easy to test and debug
- ✅ Reusable hook libraries
- ✅ Automatic conversion to API format
#### Using the Hooks Utility
The `hooks_to_string()` utility converts Python function objects to the string format required by the API:
```python
from crawl4ai import hooks_to_string
# Define your hooks as functions
async def setup_hook(page, context, **kwargs):
await page.set_viewport_size({"width": 1920, "height": 1080})
await context.add_cookies([{
"name": "session",
"value": "token",
"domain": ".example.com"
}])
return page
async def scroll_hook(page, context, **kwargs):
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
return page
# Convert to string format
hooks_dict = {
"on_page_context_created": setup_hook,
"before_retrieve_html": scroll_hook
}
hooks_string = hooks_to_string(hooks_dict)
# Now use with REST API or Docker client
# hooks_string contains the string representations
```
#### Docker Client with Automatic Conversion
The Docker client automatically detects and converts function objects:
```python
from crawl4ai import Crawl4aiDockerClient
async def auth_hook(page, context, **kwargs):
"""Add authentication cookies"""
await context.add_cookies([{
"name": "auth_token",
"value": "your_token",
"domain": ".example.com"
}])
return page
async def performance_hook(page, context, **kwargs):
"""Block unnecessary resources"""
await context.route("**/*.{png,jpg,gif}", lambda r: r.abort())
await context.route("**/analytics/*", lambda r: r.abort())
return page
async with Crawl4aiDockerClient(base_url="http://localhost:11235") as client:
# Pass functions directly - automatic conversion!
result = await client.crawl(
["https://example.com"],
hooks={
"on_page_context_created": performance_hook,
"before_goto": auth_hook
},
hooks_timeout=30 # Optional timeout in seconds (1-120)
)
print(f"Success: {result.success}")
print(f"HTML: {len(result.html)} chars")
```
#### Creating Reusable Hook Libraries
Build collections of reusable hooks:
```python
# hooks_library.py
class CrawlHooks:
"""Reusable hook collection for common crawling tasks"""
@staticmethod
async def block_images(page, context, **kwargs):
"""Block all images to speed up crawling"""
await context.route("**/*.{png,jpg,jpeg,gif,webp}", lambda r: r.abort())
return page
@staticmethod
async def block_analytics(page, context, **kwargs):
"""Block analytics and tracking scripts"""
tracking_domains = [
"**/google-analytics.com/*",
"**/googletagmanager.com/*",
"**/facebook.com/tr/*",
"**/doubleclick.net/*"
]
for domain in tracking_domains:
await context.route(domain, lambda r: r.abort())
return page
@staticmethod
async def scroll_infinite(page, context, **kwargs):
"""Handle infinite scroll to load more content"""
previous_height = 0
for i in range(5): # Max 5 scrolls
current_height = await page.evaluate("document.body.scrollHeight")
if current_height == previous_height:
break
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await page.wait_for_timeout(1000)
previous_height = current_height
return page
@staticmethod
async def wait_for_dynamic_content(page, context, url, response, **kwargs):
"""Wait for dynamic content to load"""
await page.wait_for_timeout(2000)
try:
# Click "Load More" if present
load_more = await page.query_selector('[class*="load-more"]')
if load_more:
await load_more.click()
await page.wait_for_timeout(1000)
except:
pass
return page
# Use in your application
from hooks_library import CrawlHooks
from crawl4ai import Crawl4aiDockerClient
async def crawl_with_optimizations(url):
async with Crawl4aiDockerClient() as client:
result = await client.crawl(
[url],
hooks={
"on_page_context_created": CrawlHooks.block_images,
"before_retrieve_html": CrawlHooks.scroll_infinite
}
)
return result
```
#### Choosing the Right Approach
| Approach | Best For | IDE Support | Language |
|----------|----------|-------------|----------|
| **String-based** | Non-Python clients, REST APIs, other languages | ❌ None | Any |
| **Function-based** | Python applications, local development | ✅ Full | Python only |
| **Docker Client** | Python apps with automatic conversion | ✅ Full | Python only |
**Recommendation**:
- **Python applications**: Use Docker client with function objects (easiest)
- **Non-Python or REST API**: Use string-based hooks (most flexible)
- **Manual control**: Use `hooks_to_string()` utility (middle ground)
#### Complete Example with Function Hooks
```python
from crawl4ai import Crawl4aiDockerClient, BrowserConfig, CrawlerRunConfig, CacheMode
# Define hooks as regular Python functions
async def setup_environment(page, context, **kwargs):
"""Setup crawling environment"""
# Set viewport
await page.set_viewport_size({"width": 1920, "height": 1080})
# Block resources for speed
await context.route("**/*.{png,jpg,gif}", lambda r: r.abort())
# Add custom headers
await page.set_extra_http_headers({
"Accept-Language": "en-US",
"X-Custom-Header": "Crawl4AI"
})
print("[HOOK] Environment configured")
return page
async def extract_content(page, context, **kwargs):
"""Extract and prepare content"""
# Scroll to load lazy content
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await page.wait_for_timeout(1000)
# Extract metadata
metadata = await page.evaluate('''() => ({
title: document.title,
links: document.links.length,
images: document.images.length
})''')
print(f"[HOOK] Page metadata: {metadata}")
return page
async def main():
async with Crawl4aiDockerClient(base_url="http://localhost:11235", verbose=True) as client:
# Configure crawl
browser_config = BrowserConfig(headless=True)
crawler_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
# Crawl with hooks
result = await client.crawl(
["https://httpbin.org/html"],
browser_config=browser_config,
crawler_config=crawler_config,
hooks={
"on_page_context_created": setup_environment,
"before_retrieve_html": extract_content
},
hooks_timeout=30
)
if result.success:
print(f"✅ Crawl successful!")
print(f" URL: {result.url}")
print(f" HTML: {len(result.html)} chars")
print(f" Markdown: {len(result.markdown)} chars")
else:
print(f"❌ Crawl failed: {result.error_message}")
if __name__ == "__main__":
import asyncio
asyncio.run(main())
```
#### Additional Resources
- **Comprehensive Examples**: See `/docs/examples/hooks_docker_client_example.py` for Python function-based examples
- **REST API Examples**: See `/docs/examples/hooks_rest_api_example.py` for string-based examples
- **Comparison Guide**: See `/docs/examples/README_HOOKS.md` for detailed comparison
- **Utility Documentation**: See `/docs/hooks-utility-guide.md` for complete guide
---
## Job Queue & Webhook API
The Docker deployment includes a powerful asynchronous job queue system with webhook support for both crawling and LLM extraction tasks. Instead of waiting for long-running operations to complete, submit jobs and receive real-time notifications via webhooks when they finish.
### Why Use the Job Queue API?
**Traditional Synchronous API (`/crawl`):**
- Client waits for entire crawl to complete
- Timeout issues with long-running crawls
- Resource blocking during execution
- Constant polling required for status updates
**Asynchronous Job Queue API (`/crawl/job`, `/llm/job`):**
- ✅ Submit job and continue immediately
- ✅ No timeout concerns for long operations
- ✅ Real-time webhook notifications on completion
- ✅ Better resource utilization
- ✅ Perfect for batch processing
- ✅ Ideal for microservice architectures
### Available Endpoints
#### 1. Crawl Job Endpoint
```
POST /crawl/job
```
Submit an asynchronous crawl job with optional webhook notification.
**Request Body:**
```json
{
"urls": ["https://example.com"],
"cache_mode": "bypass",
"extraction_strategy": {
"type": "JsonCssExtractionStrategy",
"schema": {
"title": "h1",
"content": ".article-body"
}
},
"webhook_config": {
"webhook_url": "https://your-app.com/webhook/crawl-complete",
"webhook_data_in_payload": true,
"webhook_headers": {
"X-Webhook-Secret": "your-secret-token",
"X-Custom-Header": "value"
}
}
}
```
**Response:**
```json
{
"task_id": "crawl_1698765432",
"message": "Crawl job submitted"
}
```
#### 2. LLM Extraction Job Endpoint
```
POST /llm/job
```
Submit an asynchronous LLM extraction job with optional webhook notification.
**Request Body:**
```json
{
"url": "https://example.com/article",
"q": "Extract the article title, author, publication date, and main points",
"provider": "openai/gpt-4o-mini",
"schema": "{\"title\": \"string\", \"author\": \"string\", \"date\": \"string\", \"points\": [\"string\"]}",
"cache": false,
"webhook_config": {
"webhook_url": "https://your-app.com/webhook/llm-complete",
"webhook_data_in_payload": true,
"webhook_headers": {
"X-Webhook-Secret": "your-secret-token"
}
}
}
```
**Response:**
```json
{
"task_id": "llm_1698765432",
"message": "LLM job submitted"
}
```
#### 3. Job Status Endpoint
```
GET /job/{task_id}
```
Check the status and retrieve results of a submitted job.
**Response (In Progress):**
```json
{
"task_id": "crawl_1698765432",
"status": "processing",
"message": "Job is being processed"
}
```
**Response (Completed):**
```json
{
"task_id": "crawl_1698765432",
"status": "completed",
"result": {
"markdown": "# Page Title\n\nContent...",
"extracted_content": {...},
"links": {...}
}
}
```
### Webhook Configuration
Webhooks provide real-time notifications when your jobs complete, eliminating the need for constant polling.
#### Webhook Config Parameters
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `webhook_url` | string | Yes | Your HTTP(S) endpoint to receive notifications |
| `webhook_data_in_payload` | boolean | No | Include full result data in webhook payload (default: false) |
| `webhook_headers` | object | No | Custom headers for authentication/identification |
#### Webhook Payload Format
**Success Notification (Crawl Job):**
```json
{
"task_id": "crawl_1698765432",
"task_type": "crawl",
"status": "completed",
"timestamp": "2025-10-22T12:30:00.000000+00:00",
"urls": ["https://example.com"],
"data": {
"markdown": "# Page content...",
"extracted_content": {...},
"links": {...}
}
}
```
**Success Notification (LLM Job):**
```json
{
"task_id": "llm_1698765432",
"task_type": "llm_extraction",
"status": "completed",
"timestamp": "2025-10-22T12:30:00.000000+00:00",
"urls": ["https://example.com/article"],
"data": {
"extracted_content": {
"title": "Understanding Web Scraping",
"author": "John Doe",
"date": "2025-10-22",
"points": ["Point 1", "Point 2"]
}
}
}
```
**Failure Notification:**
```json
{
"task_id": "crawl_1698765432",
"task_type": "crawl",
"status": "failed",
"timestamp": "2025-10-22T12:30:00.000000+00:00",
"urls": ["https://example.com"],
"error": "Connection timeout after 30 seconds"
}
```
#### Webhook Delivery & Retry
- **Delivery Method:** HTTP POST to your `webhook_url`
- **Content-Type:** `application/json`
- **Retry Policy:** Exponential backoff with 5 attempts
- Attempt 1: Immediate
- Attempt 2: 1 second delay
- Attempt 3: 2 seconds delay
- Attempt 4: 4 seconds delay
- Attempt 5: 8 seconds delay
- **Success Status Codes:** 200-299
- **Custom Headers:** Your `webhook_headers` are included in every request
### Usage Examples
#### Example 1: Python with Webhook Handler (Flask)
```python
from flask import Flask, request, jsonify
import requests
app = Flask(__name__)
# Webhook handler
@app.route('/webhook/crawl-complete', methods=['POST'])
def handle_crawl_webhook():
payload = request.json
if payload['status'] == 'completed':
print(f"✅ Job {payload['task_id']} completed!")
print(f"Task type: {payload['task_type']}")
# Access the crawl results
if 'data' in payload:
markdown = payload['data'].get('markdown', '')
extracted = payload['data'].get('extracted_content', {})
print(f"Extracted {len(markdown)} characters")
print(f"Structured data: {extracted}")
else:
print(f"❌ Job {payload['task_id']} failed: {payload.get('error')}")
return jsonify({"status": "received"}), 200
# Submit a crawl job with webhook
def submit_crawl_job():
response = requests.post(
"http://localhost:11235/crawl/job",
json={
"urls": ["https://example.com"],
"extraction_strategy": {
"type": "JsonCssExtractionStrategy",
"schema": {
"name": "Example Schema",
"baseSelector": "body",
"fields": [
{"name": "title", "selector": "h1", "type": "text"},
{"name": "description", "selector": "meta[name='description']", "type": "attribute", "attribute": "content"}
]
}
},
"webhook_config": {
"webhook_url": "https://your-app.com/webhook/crawl-complete",
"webhook_data_in_payload": True,
"webhook_headers": {
"X-Webhook-Secret": "your-secret-token"
}
}
}
)
task_id = response.json()['task_id']
print(f"Job submitted: {task_id}")
return task_id
if __name__ == '__main__':
app.run(port=5000)
```
#### Example 2: LLM Extraction with Webhooks
```python
import requests
def submit_llm_job_with_webhook():
response = requests.post(
"http://localhost:11235/llm/job",
json={
"url": "https://example.com/article",
"q": "Extract the article title, author, and main points",
"provider": "openai/gpt-4o-mini",
"webhook_config": {
"webhook_url": "https://your-app.com/webhook/llm-complete",
"webhook_data_in_payload": True,
"webhook_headers": {
"X-Webhook-Secret": "your-secret-token"
}
}
}
)
task_id = response.json()['task_id']
print(f"LLM job submitted: {task_id}")
return task_id
# Webhook handler for LLM jobs
@app.route('/webhook/llm-complete', methods=['POST'])
def handle_llm_webhook():
payload = request.json
if payload['status'] == 'completed':
extracted = payload['data']['extracted_content']
print(f"✅ LLM extraction completed!")
print(f"Results: {extracted}")
else:
print(f"❌ LLM extraction failed: {payload.get('error')}")
return jsonify({"status": "received"}), 200
```
#### Example 3: Without Webhooks (Polling)
If you don't use webhooks, you can poll for results:
```python
import requests
import time
# Submit job
response = requests.post(
"http://localhost:11235/crawl/job",
json={"urls": ["https://example.com"]}
)
task_id = response.json()['task_id']
# Poll for results
while True:
result = requests.get(f"http://localhost:11235/job/{task_id}")
data = result.json()
if data['status'] == 'completed':
print("Job completed!")
print(data['result'])
break
elif data['status'] == 'failed':
print(f"Job failed: {data.get('error')}")
break
print("Still processing...")
time.sleep(2)
```
#### Example 4: Global Webhook Configuration
Set a default webhook URL in your `config.yml` to avoid repeating it in every request:
```yaml
# config.yml
api:
crawler:
# ... other settings ...
webhook:
default_url: "https://your-app.com/webhook/default"
default_headers:
X-Webhook-Secret: "your-secret-token"
```
Then submit jobs without webhook config:
```python
# Uses the global webhook configuration
response = requests.post(
"http://localhost:11235/crawl/job",
json={"urls": ["https://example.com"]}
)
```
### Webhook Best Practices
1. **Authentication:** Always use custom headers for webhook authentication
```json
"webhook_headers": {
"X-Webhook-Secret": "your-secret-token"
}
```
2. **Idempotency:** Design your webhook handler to be idempotent (safe to receive duplicate notifications)
3. **Fast Response:** Return HTTP 200 quickly; process data asynchronously if needed
```python
@app.route('/webhook', methods=['POST'])
def webhook():
payload = request.json
# Queue for background processing
queue.enqueue(process_webhook, payload)
return jsonify({"status": "received"}), 200
```
4. **Error Handling:** Handle both success and failure notifications
```python
if payload['status'] == 'completed':
# Process success
elif payload['status'] == 'failed':
# Log error, retry, or alert
```
5. **Validation:** Verify webhook authenticity using custom headers
```python
secret = request.headers.get('X-Webhook-Secret')
if secret != os.environ['EXPECTED_SECRET']:
return jsonify({"error": "Unauthorized"}), 401
```
6. **Logging:** Log webhook deliveries for debugging
```python
logger.info(f"Webhook received: {payload['task_id']} - {payload['status']}")
```
### Use Cases
**1. Batch Processing**
Submit hundreds of URLs and get notified as each completes:
```python
urls = ["https://site1.com", "https://site2.com", ...]
for url in urls:
submit_crawl_job(url, webhook_url="https://app.com/webhook")
```
**2. Microservice Integration**
Integrate with event-driven architectures:
```python
# Service A submits job
task_id = submit_crawl_job(url)
# Service B receives webhook and triggers next step
@app.route('/webhook')
def webhook():
process_result(request.json)
trigger_next_service()
return "OK", 200
```
**3. Long-Running Extractions**
Handle complex LLM extractions without timeouts:
```python
submit_llm_job(
url="https://long-article.com",
q="Comprehensive summary with key points and analysis",
webhook_url="https://app.com/webhook/llm"
)
```
### Troubleshooting
**Webhook not receiving notifications?**
- Check your webhook URL is publicly accessible
- Verify firewall/security group settings
- Use webhook testing tools like webhook.site for debugging
- Check server logs for delivery attempts
- Ensure your handler returns 200-299 status code
**Job stuck in processing?**
- Check Redis connection: `docker logs <container_name> | grep redis`
- Verify worker processes: `docker exec <container_name> ps aux | grep worker`
- Check server logs: `docker logs <container_name>`
**Need to cancel a job?**
Jobs are processed asynchronously. If you need to cancel:
- Delete the task from Redis (requires Redis CLI access)
- Or implement a cancellation endpoint in your webhook handler
---
## Dockerfile Parameters
@@ -913,10 +1628,12 @@ This is the easiest way to translate Python configuration to JSON requests when
Install the SDK: `pip install crawl4ai`
The Python SDK provides a convenient way to interact with the Docker API, including **automatic hook conversion** when using function objects.
```python
import asyncio
from crawl4ai.docker_client import Crawl4aiDockerClient
from crawl4ai import BrowserConfig, CrawlerRunConfig, CacheMode # Assuming you have crawl4ai installed
from crawl4ai import BrowserConfig, CrawlerRunConfig, CacheMode
async def main():
# Point to the correct server port
@@ -928,23 +1645,22 @@ async def main():
print("--- Running Non-Streaming Crawl ---")
results = await client.crawl(
["https://httpbin.org/html"],
browser_config=BrowserConfig(headless=True), # Use library classes for config aid
browser_config=BrowserConfig(headless=True),
crawler_config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
)
if results: # client.crawl returns None on failure
print(f"Non-streaming results success: {results.success}")
if results.success:
for result in results: # Iterate through the CrawlResultContainer
print(f"URL: {result.url}, Success: {result.success}")
if results:
print(f"Non-streaming results success: {results.success}")
if results.success:
for result in results:
print(f"URL: {result.url}, Success: {result.success}")
else:
print("Non-streaming crawl failed.")
# Example Streaming crawl
print("\n--- Running Streaming Crawl ---")
stream_config = CrawlerRunConfig(stream=True, cache_mode=CacheMode.BYPASS)
try:
async for result in await client.crawl( # client.crawl returns an async generator for streaming
async for result in await client.crawl(
["https://httpbin.org/html", "https://httpbin.org/links/5/0"],
browser_config=BrowserConfig(headless=True),
crawler_config=stream_config
@@ -953,17 +1669,56 @@ async def main():
except Exception as e:
print(f"Streaming crawl failed: {e}")
# Example with hooks (Python function objects)
print("\n--- Crawl with Hooks ---")
async def my_hook(page, context, **kwargs):
"""Custom hook to optimize performance"""
await page.set_viewport_size({"width": 1920, "height": 1080})
await context.route("**/*.{png,jpg}", lambda r: r.abort())
print("[HOOK] Page optimized")
return page
result = await client.crawl(
["https://httpbin.org/html"],
browser_config=BrowserConfig(headless=True),
crawler_config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS),
hooks={"on_page_context_created": my_hook}, # Pass function directly!
hooks_timeout=30
)
print(f"Crawl with hooks success: {result.success}")
# Example Get schema
print("\n--- Getting Schema ---")
schema = await client.get_schema()
print(f"Schema received: {bool(schema)}") # Print whether schema was received
print(f"Schema received: {bool(schema)}")
if __name__ == "__main__":
asyncio.run(main())
```
*(SDK parameters like timeout, verify_ssl etc. remain the same)*
#### SDK Parameters
The Docker client supports the following parameters:
**Client Initialization**:
- `base_url` (str): URL of the Docker server (default: `http://localhost:8000`)
- `timeout` (float): Request timeout in seconds (default: 30.0)
- `verify_ssl` (bool): Verify SSL certificates (default: True)
- `verbose` (bool): Enable verbose logging (default: True)
- `log_file` (Optional[str]): Path to log file (default: None)
**crawl() Method**:
- `urls` (List[str]): List of URLs to crawl
- `browser_config` (Optional[BrowserConfig]): Browser configuration
- `crawler_config` (Optional[CrawlerRunConfig]): Crawler configuration
- `hooks` (Optional[Dict]): Hook functions or strings - **automatically converts function objects!**
- `hooks_timeout` (int): Timeout for each hook execution in seconds (default: 30)
**Returns**:
- Single URL: `CrawlResult` object
- Multiple URLs: `List[CrawlResult]`
- Streaming: `AsyncGenerator[CrawlResult]`
### Second Approach: Direct API Calls

View File

@@ -20,10 +20,10 @@ In some cases, you need to extract **complex or unstructured** information from
## 2. Provider-Agnostic via LiteLLM
You can use LlmConfig, to quickly configure multiple variations of LLMs and experiment with them to find the optimal one for your use case. You can read more about LlmConfig [here](/api/parameters).
You can use LLMConfig, to quickly configure multiple variations of LLMs and experiment with them to find the optimal one for your use case. You can read more about LLMConfig [here](/api/parameters).
```python
llmConfig = LlmConfig(provider="openai/gpt-4o-mini", api_token=os.getenv("OPENAI_API_KEY"))
llm_config = LLMConfig(provider="openai/gpt-4o-mini", api_token=os.getenv("OPENAI_API_KEY"))
```
Crawl4AI uses a “provider string” (e.g., `"openai/gpt-4o"`, `"ollama/llama2.0"`, `"aws/titan"`) to identify your LLM. **Any** model that LiteLLM supports is fair game. You just provide:
@@ -58,7 +58,7 @@ For structured data, `"schema"` is recommended. You provide `schema=YourPydantic
Below is an overview of important LLM extraction parameters. All are typically set inside `LLMExtractionStrategy(...)`. You then put that strategy in your `CrawlerRunConfig(..., extraction_strategy=...)`.
1. **`llmConfig`** (LlmConfig): e.g., `"openai/gpt-4"`, `"ollama/llama2"`.
1. **`llm_config`** (LLMConfig): e.g., `"openai/gpt-4"`, `"ollama/llama2"`.
2. **`schema`** (dict): A JSON schema describing the fields you want. Usually generated by `YourModel.model_json_schema()`.
3. **`extraction_type`** (str): `"schema"` or `"block"`.
4. **`instruction`** (str): Prompt text telling the LLM what you want extracted. E.g., “Extract these fields as a JSON array.”
@@ -112,7 +112,7 @@ async def main():
# 1. Define the LLM extraction strategy
llm_strategy = LLMExtractionStrategy(
llm_config = LLMConfig(provider="openai/gpt-4o-mini", api_token=os.getenv('OPENAI_API_KEY')),
schema=Product.schema_json(), # Or use model_json_schema()
schema=Product.model_json_schema(), # Or use model_json_schema()
extraction_type="schema",
instruction="Extract all product objects with 'name' and 'price' from the content.",
chunk_token_threshold=1000,
@@ -238,7 +238,7 @@ class KnowledgeGraph(BaseModel):
async def main():
# LLM extraction strategy
llm_strat = LLMExtractionStrategy(
llmConfig = LLMConfig(provider="openai/gpt-4", api_token=os.getenv('OPENAI_API_KEY')),
llm_config = LLMConfig(provider="openai/gpt-4", api_token=os.getenv('OPENAI_API_KEY')),
schema=KnowledgeGraph.model_json_schema(),
extraction_type="schema",
instruction="Extract entities and relationships from the content. Return valid JSON.",

View File

@@ -55,9 +55,40 @@
</div>
---
#### 🚀 Crawl4AI Cloud API — Closed Beta (Launching Soon)
Reliable, large-scale web extraction, now built to be _**drastically more cost-effective**_ than any of the existing solutions.
👉 **Apply [here](https://forms.gle/E9MyPaNXACnAMaqG7) for early access**
_Well be onboarding in phases and working closely with early users.
Limited slots._
---
Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for large language models, AI agents, and data pipelines. Fully open source, flexible, and built for real-time performance, **Crawl4AI** empowers developers with unmatched speed, precision, and deployment ease.
> **Note**: If you're looking for the old documentation, you can access it [here](https://old.docs.crawl4ai.com).
> Enjoy using Crawl4AI? Consider **[becoming a sponsor](https://github.com/sponsors/unclecode)** to support ongoing development and community growth!
## 🆕 AI Assistant Skill Now Available!
<div style="background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); padding: 20px; border-radius: 10px; margin: 20px 0; box-shadow: 0 4px 6px rgba(0,0,0,0.1);">
<h3 style="color: white; margin: 0 0 10px 0;">🤖 Crawl4AI Skill for Claude & AI Assistants</h3>
<p style="color: white; margin: 10px 0;">Supercharge your AI coding assistant with complete Crawl4AI knowledge! Download our comprehensive skill package that includes:</p>
<ul style="color: white; margin: 10px 0;">
<li>📚 Complete SDK reference (23K+ words)</li>
<li>🚀 Ready-to-use extraction scripts</li>
<li>⚡ Schema generation for efficient scraping</li>
<li>🔧 Version 0.7.4 compatible</li>
</ul>
<div style="text-align: center; margin-top: 15px;">
<a href="assets/crawl4ai-skill.zip" download style="background: white; color: #667eea; padding: 12px 30px; border-radius: 5px; text-decoration: none; font-weight: bold; display: inline-block; transition: transform 0.2s;">
📦 Download Skill Package
</a>
</div>
<p style="color: white; margin: 15px 0 0 0; font-size: 0.9em; text-align: center;">
Works with Claude, Cursor, Windsurf, and other AI coding assistants. Import the .zip file into your AI assistant's skill/knowledge system.
</p>
</div>
## 🎯 New: Adaptive Web Crawling

View File

@@ -529,8 +529,19 @@ class AdminDashboard {
</label>
</div>
<div class="form-group full-width">
<label>Integration Guide</label>
<textarea id="form-integration" rows="10">${app?.integration_guide || ''}</textarea>
<label>Long Description (Markdown - Overview tab)</label>
<textarea id="form-long-description" rows="10" placeholder="Enter detailed description with markdown formatting...">${app?.long_description || ''}</textarea>
<small>Markdown support: **bold**, *italic*, [links](url), # headers, code blocks, lists</small>
</div>
<div class="form-group full-width">
<label>Integration Guide (Markdown - Integration tab)</label>
<textarea id="form-integration" rows="20" placeholder="Enter integration guide with installation, examples, and code snippets using markdown...">${app?.integration_guide || ''}</textarea>
<small>Single markdown field with installation, examples, and complete guide. Code blocks get auto copy buttons.</small>
</div>
<div class="form-group full-width">
<label>Documentation (Markdown - Documentation tab)</label>
<textarea id="form-documentation" rows="20" placeholder="Enter documentation with API reference, examples, and best practices using markdown...">${app?.documentation || ''}</textarea>
<small>Full documentation with API reference, examples, best practices, etc.</small>
</div>
</div>
`;
@@ -712,7 +723,9 @@ class AdminDashboard {
data.contact_email = document.getElementById('form-email').value;
data.featured = document.getElementById('form-featured').checked ? 1 : 0;
data.sponsored = document.getElementById('form-sponsored').checked ? 1 : 0;
data.long_description = document.getElementById('form-long-description').value;
data.integration_guide = document.getElementById('form-integration').value;
data.documentation = document.getElementById('form-documentation').value;
} else if (type === 'articles') {
data.title = document.getElementById('form-title').value;
data.slug = this.generateSlug(data.title);

View File

@@ -278,12 +278,12 @@
}
.tab-content {
display: none;
display: none !important;
padding: 2rem;
}
.tab-content.active {
display: block;
display: block !important;
}
/* Overview Layout */
@@ -510,6 +510,31 @@
line-height: 1.5;
}
/* Markdown rendered code blocks */
.integration-content pre,
.docs-content pre {
background: var(--bg-dark);
border: 1px solid var(--border-color);
margin: 1rem 0;
padding: 1rem;
padding-top: 2.5rem; /* Space for copy button */
overflow-x: auto;
position: relative;
max-height: none; /* Remove any height restrictions */
height: auto; /* Allow content to expand */
}
.integration-content pre code,
.docs-content pre code {
background: transparent;
padding: 0;
color: var(--text-secondary);
font-size: 0.875rem;
line-height: 1.5;
white-space: pre; /* Preserve whitespace and line breaks */
display: block;
}
/* Feature Grid */
.feature-grid {
display: grid;

View File

@@ -73,27 +73,14 @@
<div class="tabs">
<button class="tab-btn active" data-tab="overview">Overview</button>
<button class="tab-btn" data-tab="integration">Integration</button>
<button class="tab-btn" data-tab="docs">Documentation</button>
<button class="tab-btn" data-tab="support">Support</button>
<!-- <button class="tab-btn" data-tab="docs">Documentation</button>
<button class="tab-btn" data-tab="support">Support</button> -->
</div>
<section id="overview-tab" class="tab-content active">
<div class="overview-columns">
<div class="overview-main">
<h2>Overview</h2>
<div id="app-overview">Overview content goes here.</div>
<h3>Key Features</h3>
<ul id="app-features" class="features-list">
<li>Feature 1</li>
<li>Feature 2</li>
<li>Feature 3</li>
</ul>
<h3>Use Cases</h3>
<div id="app-use-cases" class="use-cases">
<p>Describe how this app can help your workflow.</p>
</div>
</div>
<aside class="sidebar">
@@ -142,37 +129,16 @@
</section>
<section id="integration-tab" class="tab-content">
<div class="integration-content">
<h2>Integration Guide</h2>
<h3>Installation</h3>
<div class="code-block">
<pre><code id="install-code"># Installation instructions will appear here</code></pre>
</div>
<h3>Basic Usage</h3>
<div class="code-block">
<pre><code id="usage-code"># Usage example will appear here</code></pre>
</div>
<h3>Complete Integration Example</h3>
<div class="code-block">
<button class="copy-btn" id="copy-integration">Copy</button>
<pre><code id="integration-code"># Complete integration guide will appear here</code></pre>
</div>
<div class="integration-content" id="app-integration">
</div>
</section>
<section id="docs-tab" class="tab-content">
<div class="docs-content">
<h2>Documentation</h2>
<div id="app-docs" class="doc-sections">
<p>Documentation coming soon.</p>
</div>
<!-- <section id="docs-tab" class="tab-content">
<div class="docs-content" id="app-docs">
</div>
</section>
</section> -->
<section id="support-tab" class="tab-content">
<!-- <section id="support-tab" class="tab-content">
<div class="docs-content">
<h2>Support</h2>
<div class="support-grid">
@@ -190,7 +156,7 @@
</div>
</div>
</div>
</section>
</section> -->
</div>
</main>

View File

@@ -112,7 +112,7 @@ class AppDetailPage {
}
// Contact
document.getElementById('app-contact').textContent = this.appData.contact_email || 'Not available';
document.getElementById('app-contact') && (document.getElementById('app-contact').textContent = this.appData.contact_email || 'Not available');
// Sidebar info
document.getElementById('sidebar-downloads').textContent = this.formatNumber(this.appData.downloads || 0);
@@ -123,144 +123,132 @@ class AppDetailPage {
document.getElementById('sidebar-pricing').textContent = this.appData.pricing || 'Free';
document.getElementById('sidebar-contact').textContent = this.appData.contact_email || 'contact@example.com';
// Integration guide
this.renderIntegrationGuide();
// Render tab contents from database fields
this.renderTabContents();
}
renderIntegrationGuide() {
// Installation code
const installCode = document.getElementById('install-code');
if (installCode) {
if (this.appData.type === 'Open Source' && this.appData.github_url) {
installCode.textContent = `# Clone from GitHub
git clone ${this.appData.github_url}
# Install dependencies
pip install -r requirements.txt`;
} else if (this.appData.name.toLowerCase().includes('api')) {
installCode.textContent = `# Install via pip
pip install ${this.appData.slug}
# Or install from source
pip install git+${this.appData.github_url || 'https://github.com/example/repo'}`;
renderTabContents() {
// Overview tab - use long_description from database
const overviewDiv = document.getElementById('app-overview');
if (overviewDiv) {
if (this.appData.long_description) {
overviewDiv.innerHTML = this.renderMarkdown(this.appData.long_description);
} else {
overviewDiv.innerHTML = `<p>${this.appData.description || 'No overview available.'}</p>`;
}
}
// Usage code - customize based on category
const usageCode = document.getElementById('usage-code');
if (usageCode) {
if (this.appData.category === 'Browser Automation') {
usageCode.textContent = `from crawl4ai import AsyncWebCrawler
from ${this.appData.slug.replace(/-/g, '_')} import ${this.appData.name.replace(/\s+/g, '')}
async def main():
# Initialize ${this.appData.name}
automation = ${this.appData.name.replace(/\s+/g, '')}()
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com",
browser_config=automation.config,
wait_for="css:body"
)
print(result.markdown)`;
} else if (this.appData.category === 'Proxy Services') {
usageCode.textContent = `from crawl4ai import AsyncWebCrawler
import ${this.appData.slug.replace(/-/g, '_')}
# Configure proxy
proxy_config = {
"server": "${this.appData.website_url || 'https://proxy.example.com'}",
"username": "your_username",
"password": "your_password"
}
async with AsyncWebCrawler(proxy=proxy_config) as crawler:
result = await crawler.arun(
url="https://example.com",
bypass_cache=True
)
print(result.status_code)`;
} else if (this.appData.category === 'LLM Integration') {
usageCode.textContent = `from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
# Configure LLM extraction
strategy = LLMExtractionStrategy(
provider="${this.appData.name.toLowerCase().includes('gpt') ? 'openai' : 'anthropic'}",
api_key="your-api-key",
model="${this.appData.name.toLowerCase().includes('gpt') ? 'gpt-4' : 'claude-3'}",
instruction="Extract structured data"
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com",
extraction_strategy=strategy
)
print(result.extracted_content)`;
// Integration tab - use integration_guide field from database
const integrationDiv = document.getElementById('app-integration');
if (integrationDiv) {
if (this.appData.integration_guide) {
integrationDiv.innerHTML = this.renderMarkdown(this.appData.integration_guide);
// Add copy buttons to all code blocks
this.addCopyButtonsToCodeBlocks(integrationDiv);
} else {
integrationDiv.innerHTML = '<p>Integration guide not yet available. Please check the official website for details.</p>';
}
}
// Integration example
const integrationCode = document.getElementById('integration-code');
if (integrationCode) {
integrationCode.textContent = this.appData.integration_guide ||
`# Complete ${this.appData.name} Integration Example
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
import json
async def crawl_with_${this.appData.slug.replace(/-/g, '_')}():
"""
Complete example showing how to use ${this.appData.name}
with Crawl4AI for production web scraping
"""
# Define extraction schema
schema = {
"name": "ProductList",
"baseSelector": "div.product",
"fields": [
{"name": "title", "selector": "h2", "type": "text"},
{"name": "price", "selector": ".price", "type": "text"},
{"name": "image", "selector": "img", "type": "attribute", "attribute": "src"},
{"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
]
// Documentation tab - use documentation field from database
const docsDiv = document.getElementById('app-docs');
if (docsDiv) {
if (this.appData.documentation) {
docsDiv.innerHTML = this.renderMarkdown(this.appData.documentation);
// Add copy buttons to all code blocks
this.addCopyButtonsToCodeBlocks(docsDiv);
} else {
docsDiv.innerHTML = '<p>Documentation coming soon.</p>';
}
}
}
# Initialize crawler with ${this.appData.name}
async with AsyncWebCrawler(
browser_type="chromium",
headless=True,
verbose=True
) as crawler:
addCopyButtonsToCodeBlocks(container) {
// Find all code blocks and add copy buttons
const codeBlocks = container.querySelectorAll('pre code');
codeBlocks.forEach(codeBlock => {
const pre = codeBlock.parentElement;
# Crawl with extraction
result = await crawler.arun(
url="https://example.com/products",
extraction_strategy=JsonCssExtractionStrategy(schema),
cache_mode="bypass",
wait_for="css:.product",
screenshot=True
)
// Skip if already has a copy button
if (pre.querySelector('.copy-btn')) return;
# Process results
if result.success:
products = json.loads(result.extracted_content)
print(f"Found {len(products)} products")
// Create copy button
const copyBtn = document.createElement('button');
copyBtn.className = 'copy-btn';
copyBtn.textContent = 'Copy';
copyBtn.onclick = () => {
navigator.clipboard.writeText(codeBlock.textContent).then(() => {
copyBtn.textContent = '✓ Copied!';
setTimeout(() => {
copyBtn.textContent = 'Copy';
}, 2000);
});
};
for product in products[:5]:
print(f"- {product['title']}: {product['price']}")
// Add button to pre element
pre.style.position = 'relative';
pre.insertBefore(copyBtn, codeBlock);
});
}
return products
renderMarkdown(text) {
if (!text) return '';
# Run the crawler
if __name__ == "__main__":
import asyncio
asyncio.run(crawl_with_${this.appData.slug.replace(/-/g, '_')}())`;
}
// Store code blocks temporarily to protect them from processing
const codeBlocks = [];
let processed = text.replace(/```(\w+)?\n([\s\S]*?)```/g, (match, lang, code) => {
const placeholder = `___CODE_BLOCK_${codeBlocks.length}___`;
codeBlocks.push(`<pre><code class="language-${lang || ''}">${this.escapeHtml(code)}</code></pre>`);
return placeholder;
});
// Store inline code temporarily
const inlineCodes = [];
processed = processed.replace(/`([^`]+)`/g, (match, code) => {
const placeholder = `___INLINE_CODE_${inlineCodes.length}___`;
inlineCodes.push(`<code>${this.escapeHtml(code)}</code>`);
return placeholder;
});
// Now process the rest of the markdown
processed = processed
// Headers
.replace(/^### (.*$)/gim, '<h3>$1</h3>')
.replace(/^## (.*$)/gim, '<h2>$1</h2>')
.replace(/^# (.*$)/gim, '<h1>$1</h1>')
// Bold
.replace(/\*\*(.*?)\*\*/g, '<strong>$1</strong>')
// Italic
.replace(/\*(.*?)\*/g, '<em>$1</em>')
// Links
.replace(/\[([^\]]+)\]\(([^)]+)\)/g, '<a href="$2" target="_blank">$1</a>')
// Line breaks
.replace(/\n\n/g, '</p><p>')
.replace(/\n/g, '<br>')
// Lists
.replace(/^\* (.*)$/gim, '<li>$1</li>')
.replace(/^- (.*)$/gim, '<li>$1</li>')
// Wrap in paragraphs
.replace(/^(?!<[h|p|pre|ul|ol|li])/gim, '<p>')
.replace(/(?<![>])$/gim, '</p>');
// Restore inline code
inlineCodes.forEach((code, i) => {
processed = processed.replace(`___INLINE_CODE_${i}___`, code);
});
// Restore code blocks
codeBlocks.forEach((block, i) => {
processed = processed.replace(`___CODE_BLOCK_${i}___`, block);
});
return processed;
}
escapeHtml(text) {
const div = document.createElement('div');
div.textContent = text;
return div.innerHTML;
}
formatNumber(num) {
@@ -275,45 +263,27 @@ if __name__ == "__main__":
setupEventListeners() {
// Tab switching
const tabs = document.querySelectorAll('.tab-btn');
tabs.forEach(tab => {
tab.addEventListener('click', () => {
// Update active tab
// Update active tab button
tabs.forEach(t => t.classList.remove('active'));
tab.classList.add('active');
// Show corresponding content
const tabName = tab.dataset.tab;
document.querySelectorAll('.tab-content').forEach(content => {
// Hide all tab contents
const allTabContents = document.querySelectorAll('.tab-content');
allTabContents.forEach(content => {
content.classList.remove('active');
});
document.getElementById(`${tabName}-tab`).classList.add('active');
});
});
// Copy integration code
document.getElementById('copy-integration').addEventListener('click', () => {
const code = document.getElementById('integration-code').textContent;
navigator.clipboard.writeText(code).then(() => {
const btn = document.getElementById('copy-integration');
const originalText = btn.innerHTML;
btn.innerHTML = '<span>✓</span> Copied!';
setTimeout(() => {
btn.innerHTML = originalText;
}, 2000);
});
});
// Copy code buttons
document.querySelectorAll('.copy-btn').forEach(btn => {
btn.addEventListener('click', (e) => {
const codeBlock = e.target.closest('.code-block');
const code = codeBlock.querySelector('code').textContent;
navigator.clipboard.writeText(code).then(() => {
btn.textContent = 'Copied!';
setTimeout(() => {
btn.textContent = 'Copy';
}, 2000);
});
// Show the selected tab content
const targetTab = document.getElementById(`${tabName}-tab`);
if (targetTab) {
targetTab.classList.add('active');
}
});
});
}

View File

@@ -471,13 +471,17 @@ async def delete_sponsor(sponsor_id: int):
app.include_router(router)
# Version info
VERSION = "1.1.0"
BUILD_DATE = "2025-10-26"
@app.get("/")
async def root():
"""API info"""
return {
"name": "Crawl4AI Marketplace API",
"version": "1.0.0",
"version": VERSION,
"build_date": BUILD_DATE,
"endpoints": [
"/marketplace/api/apps",
"/marketplace/api/articles",

View File

@@ -0,0 +1,338 @@
"""
🚀 Crawl4AI v0.7.5 Release Demo - Working Examples
==================================================
This demo showcases key features introduced in v0.7.5 with real, executable examples.
Featured Demos:
1. ✅ Docker Hooks System - Real API calls with custom hooks (string & function-based)
2. ✅ Enhanced LLM Integration - Working LLM configurations
3. ✅ HTTPS Preservation - Live crawling with HTTPS maintenance
Requirements:
- crawl4ai v0.7.5 installed
- Docker running with crawl4ai image (optional for Docker demos)
- Valid API keys for LLM demos (optional)
"""
import asyncio
import requests
import time
import sys
from crawl4ai import (AsyncWebCrawler, CrawlerRunConfig, BrowserConfig,
CacheMode, FilterChain, URLPatternFilter, BFSDeepCrawlStrategy,
hooks_to_string)
from crawl4ai.docker_client import Crawl4aiDockerClient
def print_section(title: str, description: str = ""):
"""Print a section header"""
print(f"\n{'=' * 60}")
print(f"{title}")
if description:
print(f"{description}")
print(f"{'=' * 60}\n")
async def demo_1_docker_hooks_system():
"""Demo 1: Docker Hooks System - Real API calls with custom hooks"""
print_section(
"Demo 1: Docker Hooks System",
"Testing both string-based and function-based hooks (NEW in v0.7.5!)"
)
# Check Docker service availability
def check_docker_service():
try:
response = requests.get("http://localhost:11235/", timeout=3)
return response.status_code == 200
except:
return False
print("Checking Docker service...")
docker_running = check_docker_service()
if not docker_running:
print("⚠️ Docker service not running on localhost:11235")
print("To test Docker hooks:")
print("1. Run: docker run -p 11235:11235 unclecode/crawl4ai:latest")
print("2. Wait for service to start")
print("3. Re-run this demo\n")
return
print("✓ Docker service detected!")
# ============================================================================
# PART 1: Traditional String-Based Hooks (Works with REST API)
# ============================================================================
print("\n" + "" * 60)
print("Part 1: String-Based Hooks (REST API)")
print("" * 60)
hooks_config_string = {
"on_page_context_created": """
async def hook(page, context, **kwargs):
print("[String Hook] Setting up page context")
await context.route("**/*.{png,jpg,jpeg,gif,webp}", lambda route: route.abort())
return page
""",
"before_retrieve_html": """
async def hook(page, context, **kwargs):
print("[String Hook] Before retrieving HTML")
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await page.wait_for_timeout(1000)
return page
"""
}
payload = {
"urls": ["https://httpbin.org/html"],
"hooks": {
"code": hooks_config_string,
"timeout": 30
}
}
print("🔧 Using string-based hooks for REST API...")
try:
start_time = time.time()
response = requests.post("http://localhost:11235/crawl", json=payload, timeout=60)
execution_time = time.time() - start_time
if response.status_code == 200:
result = response.json()
print(f"✅ String-based hooks executed in {execution_time:.2f}s")
if result.get('results') and result['results'][0].get('success'):
html_length = len(result['results'][0].get('html', ''))
print(f" 📄 HTML length: {html_length} characters")
else:
print(f"❌ Request failed: {response.status_code}")
except Exception as e:
print(f"❌ Error: {str(e)}")
# ============================================================================
# PART 2: NEW Function-Based Hooks with Docker Client (v0.7.5)
# ============================================================================
print("\n" + "" * 60)
print("Part 2: Function-Based Hooks with Docker Client (✨ NEW!)")
print("" * 60)
# Define hooks as regular Python functions
async def on_page_context_created_func(page, context, **kwargs):
"""Block images to speed up crawling"""
print("[Function Hook] Setting up page context")
await context.route("**/*.{png,jpg,jpeg,gif,webp}", lambda route: route.abort())
await page.set_viewport_size({"width": 1920, "height": 1080})
return page
async def before_goto_func(page, context, url, **kwargs):
"""Add custom headers before navigation"""
print(f"[Function Hook] About to navigate to {url}")
await page.set_extra_http_headers({
'X-Crawl4AI': 'v0.7.5-function-hooks',
'X-Test-Header': 'demo'
})
return page
async def before_retrieve_html_func(page, context, **kwargs):
"""Scroll to load lazy content"""
print("[Function Hook] Scrolling page for lazy-loaded content")
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await page.wait_for_timeout(500)
await page.evaluate("window.scrollTo(0, 0)")
return page
# Use the hooks_to_string utility (can be used standalone)
print("\n📦 Converting functions to strings with hooks_to_string()...")
hooks_as_strings = hooks_to_string({
"on_page_context_created": on_page_context_created_func,
"before_goto": before_goto_func,
"before_retrieve_html": before_retrieve_html_func
})
print(f" ✓ Converted {len(hooks_as_strings)} hooks to string format")
# OR use Docker Client which does conversion automatically!
print("\n🐳 Using Docker Client with automatic conversion...")
try:
client = Crawl4aiDockerClient(base_url="http://localhost:11235")
# Pass function objects directly - conversion happens automatically!
results = await client.crawl(
urls=["https://httpbin.org/html"],
hooks={
"on_page_context_created": on_page_context_created_func,
"before_goto": before_goto_func,
"before_retrieve_html": before_retrieve_html_func
},
hooks_timeout=30
)
if results and results.success:
print(f"✅ Function-based hooks executed successfully!")
print(f" 📄 HTML length: {len(results.html)} characters")
print(f" 🎯 URL: {results.url}")
else:
print("⚠️ Crawl completed but may have warnings")
except Exception as e:
print(f"❌ Docker client error: {str(e)}")
# Show the benefits
print("\n" + "=" * 60)
print("✨ Benefits of Function-Based Hooks:")
print("=" * 60)
print("✓ Full IDE support (autocomplete, syntax highlighting)")
print("✓ Type checking and linting")
print("✓ Easier to test and debug")
print("✓ Reusable across projects")
print("✓ Automatic conversion in Docker client")
print("=" * 60)
async def demo_2_enhanced_llm_integration():
"""Demo 2: Enhanced LLM Integration - Working LLM configurations"""
print_section(
"Demo 2: Enhanced LLM Integration",
"Testing custom LLM providers and configurations"
)
print("🤖 Testing Enhanced LLM Integration Features")
provider = "gemini/gemini-2.5-flash-lite"
payload = {
"url": "https://example.com",
"f": "llm",
"q": "Summarize this page in one sentence.",
"provider": provider, # Explicitly set provider
"temperature": 0.7
}
try:
response = requests.post(
"http://localhost:11235/md",
json=payload,
timeout=60
)
if response.status_code == 200:
result = response.json()
print(f"✓ Request successful with provider: {provider}")
print(f" - Response keys: {list(result.keys())}")
print(f" - Content length: {len(result.get('markdown', ''))} characters")
print(f" - Note: Actual LLM call may fail without valid API key")
else:
print(f"❌ Request failed: {response.status_code}")
print(f" - Response: {response.text[:500]}")
except Exception as e:
print(f"[red]Error: {e}[/]")
async def demo_3_https_preservation():
"""Demo 3: HTTPS Preservation - Live crawling with HTTPS maintenance"""
print_section(
"Demo 3: HTTPS Preservation",
"Testing HTTPS preservation for internal links"
)
print("🔒 Testing HTTPS Preservation Feature")
# Test with HTTPS preservation enabled
print("\nTest 1: HTTPS Preservation ENABLED")
url_filter = URLPatternFilter(
patterns=["^(https:\/\/)?quotes\.toscrape\.com(\/.*)?$"]
)
config = CrawlerRunConfig(
exclude_external_links=True,
stream=True,
verbose=False,
preserve_https_for_internal_links=True,
deep_crawl_strategy=BFSDeepCrawlStrategy(
max_depth=2,
max_pages=5,
filter_chain=FilterChain([url_filter])
)
)
test_url = "https://quotes.toscrape.com"
print(f"🎯 Testing URL: {test_url}")
async with AsyncWebCrawler() as crawler:
async for result in await crawler.arun(url=test_url, config=config):
print("✓ HTTPS Preservation Test Completed")
internal_links = [i['href'] for i in result.links['internal']]
for link in internal_links:
print(f"{link}")
async def main():
"""Run all demos"""
print("\n" + "=" * 60)
print("🚀 Crawl4AI v0.7.5 Working Demo")
print("=" * 60)
# Check system requirements
print("🔍 System Requirements Check:")
print(f" - Python version: {sys.version.split()[0]} {'' if sys.version_info >= (3, 10) else '❌ (3.10+ required)'}")
try:
import requests
print(f" - Requests library: ✓")
except ImportError:
print(f" - Requests library: ❌")
print()
demos = [
("Docker Hooks System", demo_1_docker_hooks_system),
("Enhanced LLM Integration", demo_2_enhanced_llm_integration),
("HTTPS Preservation", demo_3_https_preservation),
]
for i, (name, demo_func) in enumerate(demos, 1):
try:
print(f"\n📍 Starting Demo {i}/{len(demos)}: {name}")
await demo_func()
if i < len(demos):
print(f"\n✨ Demo {i} complete! Press Enter for next demo...")
input()
except KeyboardInterrupt:
print(f"\n⏹️ Demo interrupted by user")
break
except Exception as e:
print(f"❌ Demo {i} error: {str(e)}")
print("Continuing to next demo...")
continue
print("\n" + "=" * 60)
print("🎉 Demo Complete!")
print("=" * 60)
print("You've experienced the power of Crawl4AI v0.7.5!")
print("")
print("Key Features Demonstrated:")
print("🔧 Docker Hooks - String-based & function-based (NEW!)")
print(" • hooks_to_string() utility for function conversion")
print(" • Docker client with automatic conversion")
print(" • Full IDE support and type checking")
print("🤖 Enhanced LLM - Better AI integration")
print("🔒 HTTPS Preservation - Secure link handling")
print("")
print("Ready to build something amazing? 🚀")
print("")
print("📖 Docs: https://docs.crawl4ai.com/")
print("🐙 GitHub: https://github.com/unclecode/crawl4ai")
print("=" * 60)
if __name__ == "__main__":
print("🚀 Crawl4AI v0.7.5 Live Demo Starting...")
print("Press Ctrl+C anytime to exit\n")
try:
asyncio.run(main())
except KeyboardInterrupt:
print("\n👋 Demo stopped by user. Thanks for trying Crawl4AI v0.7.5!")
except Exception as e:
print(f"\n❌ Demo error: {str(e)}")
print("Make sure you have the required dependencies installed.")

View File

@@ -0,0 +1,359 @@
#!/usr/bin/env python3
"""
Crawl4AI v0.7.6 Release Demo
============================
This demo showcases the major feature in v0.7.6:
**Webhook Support for Docker Job Queue API**
Features Demonstrated:
1. Asynchronous job processing with webhook notifications
2. Webhook support for /crawl/job endpoint
3. Webhook support for /llm/job endpoint
4. Notification-only vs data-in-payload modes
5. Custom webhook headers for authentication
6. Structured extraction with JSON schemas
7. Exponential backoff retry for reliable delivery
Prerequisites:
- Crawl4AI Docker container running on localhost:11235
- Flask installed: pip install flask requests
- LLM API key configured (for LLM examples)
Usage:
python docs/releases_review/demo_v0.7.6.py
"""
import requests
import json
import time
from flask import Flask, request, jsonify
from threading import Thread
# Configuration
CRAWL4AI_BASE_URL = "http://localhost:11235"
WEBHOOK_BASE_URL = "http://localhost:8080"
# Flask app for webhook receiver
app = Flask(__name__)
received_webhooks = []
@app.route('/webhook', methods=['POST'])
def webhook_handler():
"""Universal webhook handler for both crawl and LLM extraction jobs."""
payload = request.json
task_id = payload['task_id']
task_type = payload['task_type']
status = payload['status']
print(f"\n{'='*70}")
print(f"📬 Webhook Received!")
print(f" Task ID: {task_id}")
print(f" Task Type: {task_type}")
print(f" Status: {status}")
print(f" Timestamp: {payload['timestamp']}")
if status == 'completed':
if 'data' in payload:
print(f" ✅ Data included in webhook")
if task_type == 'crawl':
results = payload['data'].get('results', [])
print(f" 📊 Crawled {len(results)} URL(s)")
elif task_type == 'llm_extraction':
extracted = payload['data'].get('extracted_content', {})
print(f" 🤖 Extracted: {json.dumps(extracted, indent=6)}")
else:
print(f" 📥 Notification only (fetch data separately)")
elif status == 'failed':
print(f" ❌ Error: {payload.get('error', 'Unknown')}")
print(f"{'='*70}\n")
received_webhooks.append(payload)
return jsonify({"status": "received"}), 200
def start_webhook_server():
"""Start Flask webhook server in background."""
app.run(host='0.0.0.0', port=8080, debug=False, use_reloader=False)
def demo_1_crawl_webhook_notification_only():
"""Demo 1: Crawl job with webhook notification (data fetched separately)."""
print("\n" + "="*70)
print("DEMO 1: Crawl Job - Webhook Notification Only")
print("="*70)
print("Submitting crawl job with webhook notification...")
payload = {
"urls": ["https://example.com"],
"browser_config": {"headless": True},
"crawler_config": {"cache_mode": "bypass"},
"webhook_config": {
"webhook_url": f"{WEBHOOK_BASE_URL}/webhook",
"webhook_data_in_payload": False,
"webhook_headers": {
"X-Demo": "v0.7.6",
"X-Type": "crawl"
}
}
}
response = requests.post(f"{CRAWL4AI_BASE_URL}/crawl/job", json=payload)
if response.ok:
task_id = response.json()['task_id']
print(f"✅ Job submitted: {task_id}")
print("⏳ Webhook will notify when complete...")
return task_id
else:
print(f"❌ Failed: {response.text}")
return None
def demo_2_crawl_webhook_with_data():
"""Demo 2: Crawl job with full data in webhook payload."""
print("\n" + "="*70)
print("DEMO 2: Crawl Job - Webhook with Full Data")
print("="*70)
print("Submitting crawl job with data included in webhook...")
payload = {
"urls": ["https://www.python.org"],
"browser_config": {"headless": True},
"crawler_config": {"cache_mode": "bypass"},
"webhook_config": {
"webhook_url": f"{WEBHOOK_BASE_URL}/webhook",
"webhook_data_in_payload": True,
"webhook_headers": {
"X-Demo": "v0.7.6",
"X-Type": "crawl-with-data"
}
}
}
response = requests.post(f"{CRAWL4AI_BASE_URL}/crawl/job", json=payload)
if response.ok:
task_id = response.json()['task_id']
print(f"✅ Job submitted: {task_id}")
print("⏳ Webhook will include full results...")
return task_id
else:
print(f"❌ Failed: {response.text}")
return None
def demo_3_llm_webhook_notification_only():
"""Demo 3: LLM extraction with webhook notification (NEW in v0.7.6!)."""
print("\n" + "="*70)
print("DEMO 3: LLM Extraction - Webhook Notification Only (NEW!)")
print("="*70)
print("Submitting LLM extraction job with webhook notification...")
payload = {
"url": "https://www.example.com",
"q": "Extract the main heading and description from this page",
"provider": "openai/gpt-4o-mini",
"cache": False,
"webhook_config": {
"webhook_url": f"{WEBHOOK_BASE_URL}/webhook",
"webhook_data_in_payload": False,
"webhook_headers": {
"X-Demo": "v0.7.6",
"X-Type": "llm"
}
}
}
response = requests.post(f"{CRAWL4AI_BASE_URL}/llm/job", json=payload)
if response.ok:
task_id = response.json()['task_id']
print(f"✅ Job submitted: {task_id}")
print("⏳ Webhook will notify when LLM extraction completes...")
return task_id
else:
print(f"❌ Failed: {response.text}")
return None
def demo_4_llm_webhook_with_schema():
"""Demo 4: LLM extraction with JSON schema and data in webhook (NEW in v0.7.6!)."""
print("\n" + "="*70)
print("DEMO 4: LLM Extraction - Schema + Full Data in Webhook (NEW!)")
print("="*70)
print("Submitting LLM extraction with JSON schema...")
schema = {
"type": "object",
"properties": {
"title": {"type": "string", "description": "Page title"},
"description": {"type": "string", "description": "Page description"},
"main_topics": {
"type": "array",
"items": {"type": "string"},
"description": "Main topics covered"
}
},
"required": ["title"]
}
payload = {
"url": "https://www.python.org",
"q": "Extract the title, description, and main topics from this website",
"schema": json.dumps(schema),
"provider": "openai/gpt-4o-mini",
"cache": False,
"webhook_config": {
"webhook_url": f"{WEBHOOK_BASE_URL}/webhook",
"webhook_data_in_payload": True,
"webhook_headers": {
"X-Demo": "v0.7.6",
"X-Type": "llm-with-schema"
}
}
}
response = requests.post(f"{CRAWL4AI_BASE_URL}/llm/job", json=payload)
if response.ok:
task_id = response.json()['task_id']
print(f"✅ Job submitted: {task_id}")
print("⏳ Webhook will include structured extraction results...")
return task_id
else:
print(f"❌ Failed: {response.text}")
return None
def demo_5_global_webhook_config():
"""Demo 5: Using global webhook configuration from config.yml."""
print("\n" + "="*70)
print("DEMO 5: Global Webhook Configuration")
print("="*70)
print("💡 You can configure a default webhook URL in config.yml:")
print("""
webhooks:
enabled: true
default_url: "https://myapp.com/webhooks/default"
data_in_payload: false
retry:
max_attempts: 5
initial_delay_ms: 1000
max_delay_ms: 32000
timeout_ms: 30000
""")
print("Then submit jobs WITHOUT webhook_config - they'll use the default!")
print("This is useful for consistent webhook handling across all jobs.")
def demo_6_webhook_retry_logic():
"""Demo 6: Webhook retry mechanism with exponential backoff."""
print("\n" + "="*70)
print("DEMO 6: Webhook Retry Logic")
print("="*70)
print("🔄 Webhook delivery uses exponential backoff retry:")
print(" • Max attempts: 5")
print(" • Delays: 1s → 2s → 4s → 8s → 16s")
print(" • Timeout: 30s per attempt")
print(" • Retries on: 5xx errors, network errors, timeouts")
print(" • No retry on: 4xx client errors")
print("\nThis ensures reliable webhook delivery even with temporary failures!")
def print_summary():
"""Print demo summary and results."""
print("\n" + "="*70)
print("📊 DEMO SUMMARY")
print("="*70)
print(f"Total webhooks received: {len(received_webhooks)}")
crawl_webhooks = [w for w in received_webhooks if w['task_type'] == 'crawl']
llm_webhooks = [w for w in received_webhooks if w['task_type'] == 'llm_extraction']
print(f"\nBreakdown:")
print(f" 🕷️ Crawl jobs: {len(crawl_webhooks)}")
print(f" 🤖 LLM extraction jobs: {len(llm_webhooks)}")
print(f"\nDetails:")
for i, webhook in enumerate(received_webhooks, 1):
icon = "🕷️" if webhook['task_type'] == 'crawl' else "🤖"
print(f" {i}. {icon} {webhook['task_id']}: {webhook['status']}")
print("\n" + "="*70)
print("✨ v0.7.6 KEY FEATURES DEMONSTRATED:")
print("="*70)
print("✅ Webhook support for /crawl/job")
print("✅ Webhook support for /llm/job (NEW!)")
print("✅ Notification-only mode (fetch data separately)")
print("✅ Data-in-payload mode (get full results in webhook)")
print("✅ Custom headers for authentication")
print("✅ JSON schema for structured LLM extraction")
print("✅ Exponential backoff retry for reliable delivery")
print("✅ Global webhook configuration support")
print("✅ Universal webhook handler for both job types")
print("\n💡 Benefits:")
print(" • No more polling - get instant notifications")
print(" • Better resource utilization")
print(" • Reliable delivery with automatic retries")
print(" • Consistent API across crawl and LLM jobs")
print(" • Production-ready webhook infrastructure")
def main():
"""Run all demos."""
print("\n" + "="*70)
print("🚀 Crawl4AI v0.7.6 Release Demo")
print("="*70)
print("Feature: Webhook Support for Docker Job Queue API")
print("="*70)
# Check if server is running
try:
health = requests.get(f"{CRAWL4AI_BASE_URL}/health", timeout=5)
print(f"✅ Crawl4AI server is running")
except:
print(f"❌ Cannot connect to Crawl4AI at {CRAWL4AI_BASE_URL}")
print("Please start Docker container:")
print(" docker run -d -p 11235:11235 --env-file .llm.env unclecode/crawl4ai:0.7.6")
return
# Start webhook server
print(f"\n🌐 Starting webhook server at {WEBHOOK_BASE_URL}...")
webhook_thread = Thread(target=start_webhook_server, daemon=True)
webhook_thread.start()
time.sleep(2)
# Run demos
demo_1_crawl_webhook_notification_only()
time.sleep(5)
demo_2_crawl_webhook_with_data()
time.sleep(5)
demo_3_llm_webhook_notification_only()
time.sleep(5)
demo_4_llm_webhook_with_schema()
time.sleep(5)
demo_5_global_webhook_config()
demo_6_webhook_retry_logic()
# Wait for webhooks
print("\n⏳ Waiting for all webhooks to arrive...")
time.sleep(30)
# Print summary
print_summary()
print("\n" + "="*70)
print("✅ Demo completed!")
print("="*70)
print("\n📚 Documentation:")
print(" • deploy/docker/WEBHOOK_EXAMPLES.md")
print(" • docs/examples/docker_webhook_example.py")
print("\n🔗 Upgrade:")
print(" docker pull unclecode/crawl4ai:0.7.6")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,628 @@
#!/usr/bin/env python3
"""
Crawl4AI v0.7.7 Release Demo
============================
This demo showcases the major feature in v0.7.7:
**Self-Hosting with Real-time Monitoring Dashboard**
Features Demonstrated:
1. System health monitoring with live metrics
2. Real-time request tracking (active & completed)
3. Browser pool management (permanent/hot/cold pools)
4. Monitor API endpoints for programmatic access
5. WebSocket streaming for real-time updates
6. Control actions (kill browser, cleanup, restart)
7. Production metrics (efficiency, reuse rates, memory)
Prerequisites:
- Crawl4AI Docker container running on localhost:11235
- Python packages: pip install httpx websockets
Usage:
python docs/releases_review/demo_v0.7.7.py
"""
import asyncio
import httpx
import json
import time
from datetime import datetime
from typing import Dict, Any
# Configuration
CRAWL4AI_BASE_URL = "http://localhost:11235"
MONITOR_DASHBOARD_URL = f"{CRAWL4AI_BASE_URL}/dashboard"
def print_section(title: str, description: str = ""):
"""Print a formatted section header"""
print(f"\n{'=' * 70}")
print(f"📊 {title}")
if description:
print(f"{description}")
print(f"{'=' * 70}\n")
def print_subsection(title: str):
"""Print a formatted subsection header"""
print(f"\n{'-' * 70}")
print(f"{title}")
print(f"{'-' * 70}")
async def check_server_health():
"""Check if Crawl4AI server is running"""
try:
async with httpx.AsyncClient(timeout=5.0) as client:
response = await client.get(f"{CRAWL4AI_BASE_URL}/health")
return response.status_code == 200
except:
return False
async def demo_1_system_health_overview():
"""Demo 1: System Health Overview - Live metrics and pool status"""
print_section(
"Demo 1: System Health Overview",
"Real-time monitoring of system resources and browser pool"
)
async with httpx.AsyncClient(timeout=30.0) as client:
print("🔍 Fetching system health metrics...")
try:
response = await client.get(f"{CRAWL4AI_BASE_URL}/monitor/health")
health = response.json()
print("\n✅ System Health Report:")
print(f"\n🖥️ Container Metrics:")
print(f" • CPU Usage: {health['container']['cpu_percent']:.1f}%")
print(f" • Memory Usage: {health['container']['memory_percent']:.1f}% "
f"({health['container']['memory_mb']:.0f} MB)")
print(f" • Network RX: {health['container']['network_rx_mb']:.2f} MB")
print(f" • Network TX: {health['container']['network_tx_mb']:.2f} MB")
print(f" • Uptime: {health['container']['uptime_seconds']:.0f}s")
print(f"\n🌐 Browser Pool Status:")
print(f" Permanent Browser:")
print(f" • Active: {health['pool']['permanent']['active']}")
print(f" • Total Requests: {health['pool']['permanent']['total_requests']}")
print(f" Hot Pool (Frequently Used Configs):")
print(f" • Count: {health['pool']['hot']['count']}")
print(f" • Total Requests: {health['pool']['hot']['total_requests']}")
print(f" Cold Pool (On-Demand Configs):")
print(f" • Count: {health['pool']['cold']['count']}")
print(f" • Total Requests: {health['pool']['cold']['total_requests']}")
print(f"\n📈 Overall Statistics:")
print(f" • Total Requests: {health['stats']['total_requests']}")
print(f" • Success Rate: {health['stats']['success_rate_percent']:.1f}%")
print(f" • Avg Latency: {health['stats']['avg_latency_ms']:.0f}ms")
print(f"\n💡 Dashboard URL: {MONITOR_DASHBOARD_URL}")
except Exception as e:
print(f"❌ Error fetching health: {e}")
async def demo_2_request_tracking():
"""Demo 2: Real-time Request Tracking - Generate and monitor requests"""
print_section(
"Demo 2: Real-time Request Tracking",
"Submit crawl jobs and watch them in real-time"
)
async with httpx.AsyncClient(timeout=60.0) as client:
print("🚀 Submitting crawl requests...")
# Submit multiple requests
urls_to_crawl = [
"https://httpbin.org/html",
"https://httpbin.org/json",
"https://example.com"
]
tasks = []
for url in urls_to_crawl:
task = client.post(
f"{CRAWL4AI_BASE_URL}/crawl",
json={"urls": [url], "crawler_config": {}}
)
tasks.append(task)
print(f" • Submitting {len(urls_to_crawl)} requests in parallel...")
results = await asyncio.gather(*tasks, return_exceptions=True)
successful = sum(1 for r in results if not isinstance(r, Exception) and r.status_code == 200)
print(f"{successful}/{len(urls_to_crawl)} requests submitted")
# Check request tracking
print("\n📊 Checking request tracking...")
await asyncio.sleep(2) # Wait for requests to process
response = await client.get(f"{CRAWL4AI_BASE_URL}/monitor/requests")
requests_data = response.json()
print(f"\n📋 Request Status:")
print(f" • Active Requests: {len(requests_data['active'])}")
print(f" • Completed Requests: {len(requests_data['completed'])}")
if requests_data['completed']:
print(f"\n📝 Recent Completed Requests:")
for req in requests_data['completed'][:3]:
status_icon = "" if req['success'] else ""
print(f" {status_icon} {req['endpoint']} - {req['latency_ms']:.0f}ms")
async def demo_3_browser_pool_management():
"""Demo 3: Browser Pool Management - 3-tier architecture in action"""
print_section(
"Demo 3: Browser Pool Management",
"Understanding permanent, hot, and cold browser pools"
)
async with httpx.AsyncClient(timeout=60.0) as client:
print("🌊 Testing browser pool with different configurations...")
# Test 1: Default config (permanent browser)
print("\n🔥 Test 1: Default Config → Permanent Browser")
for i in range(3):
await client.post(
f"{CRAWL4AI_BASE_URL}/crawl",
json={"urls": [f"https://httpbin.org/html?req={i}"], "crawler_config": {}}
)
print(f" • Request {i+1}/3 sent (should use permanent browser)")
await asyncio.sleep(2)
# Test 2: Custom viewport (cold → hot promotion after 3 uses)
print("\n♨️ Test 2: Custom Viewport → Cold Pool (promoting to Hot)")
viewport_config = {"viewport": {"width": 1280, "height": 720}}
for i in range(4):
await client.post(
f"{CRAWL4AI_BASE_URL}/crawl",
json={
"urls": [f"https://httpbin.org/json?viewport={i}"],
"browser_config": viewport_config,
"crawler_config": {}
}
)
print(f" • Request {i+1}/4 sent (cold→hot promotion after 3rd use)")
await asyncio.sleep(2)
# Check browser pool status
print("\n📊 Browser Pool Report:")
response = await client.get(f"{CRAWL4AI_BASE_URL}/monitor/browsers")
browsers = response.json()
print(f"\n🎯 Pool Summary:")
print(f" • Total Browsers: {browsers['summary']['total_count']}")
print(f" • Total Memory: {browsers['summary']['total_memory_mb']} MB")
print(f" • Reuse Rate: {browsers['summary']['reuse_rate_percent']:.1f}%")
print(f"\n📋 Browser Pool Details:")
if browsers['permanent']:
for browser in browsers['permanent']:
print(f" 🔥 Permanent: {browser['browser_id'][:8]}... | "
f"Requests: {browser['request_count']} | "
f"Memory: {browser['memory_mb']:.0f} MB")
if browsers['hot']:
for browser in browsers['hot']:
print(f" ♨️ Hot: {browser['browser_id'][:8]}... | "
f"Requests: {browser['request_count']} | "
f"Memory: {browser['memory_mb']:.0f} MB")
if browsers['cold']:
for browser in browsers['cold']:
print(f" ❄️ Cold: {browser['browser_id'][:8]}... | "
f"Requests: {browser['request_count']} | "
f"Memory: {browser['memory_mb']:.0f} MB")
async def demo_4_monitor_api_endpoints():
"""Demo 4: Monitor API Endpoints - Complete API surface"""
print_section(
"Demo 4: Monitor API Endpoints",
"Programmatic access to all monitoring data"
)
async with httpx.AsyncClient(timeout=30.0) as client:
print("🔌 Testing Monitor API endpoints...")
# Endpoint performance statistics
print_subsection("Endpoint Performance Statistics")
response = await client.get(f"{CRAWL4AI_BASE_URL}/monitor/endpoints/stats")
endpoint_stats = response.json()
print("\n📊 Per-Endpoint Analytics:")
for endpoint, stats in endpoint_stats.items():
print(f" {endpoint}:")
print(f" • Requests: {stats['count']}")
print(f" • Avg Latency: {stats['avg_latency_ms']:.0f}ms")
print(f" • Success Rate: {stats['success_rate_percent']:.1f}%")
# Timeline data for charts
print_subsection("Timeline Data (for Charts)")
response = await client.get(f"{CRAWL4AI_BASE_URL}/monitor/timeline?minutes=5")
timeline = response.json()
print(f"\n📈 Timeline Metrics (last 5 minutes):")
print(f" • Data Points: {len(timeline['memory'])}")
if timeline['memory']:
latest = timeline['memory'][-1]
print(f" • Latest Memory: {latest['value']:.1f}%")
print(f" • Timestamp: {latest['timestamp']}")
# Janitor logs
print_subsection("Janitor Cleanup Events")
response = await client.get(f"{CRAWL4AI_BASE_URL}/monitor/logs/janitor?limit=3")
janitor_logs = response.json()
print(f"\n🧹 Recent Cleanup Activities:")
if janitor_logs:
for log in janitor_logs[:3]:
print(f"{log['timestamp']}: {log['message']}")
else:
print(" (No cleanup events yet - janitor runs periodically)")
# Error logs
print_subsection("Error Monitoring")
response = await client.get(f"{CRAWL4AI_BASE_URL}/monitor/logs/errors?limit=3")
error_logs = response.json()
print(f"\n❌ Recent Errors:")
if error_logs:
for log in error_logs[:3]:
print(f"{log['timestamp']}: {log['error_type']}")
print(f" {log['message'][:100]}...")
else:
print(" ✅ No recent errors!")
async def demo_5_websocket_streaming():
"""Demo 5: WebSocket Streaming - Real-time updates"""
print_section(
"Demo 5: WebSocket Streaming",
"Live monitoring with 2-second update intervals"
)
print("⚡ WebSocket Streaming Demo")
print("\n💡 The monitoring dashboard uses WebSocket for real-time updates")
print(f" • Connection: ws://localhost:11235/monitor/ws")
print(f" • Update Interval: 2 seconds")
print(f" • Data: Health, requests, browsers, memory, errors")
print("\n📝 Sample WebSocket Integration Code:")
print("""
import websockets
import json
async def monitor_realtime():
uri = "ws://localhost:11235/monitor/ws"
async with websockets.connect(uri) as websocket:
while True:
data = await websocket.recv()
update = json.loads(data)
print(f"Memory: {update['health']['container']['memory_percent']:.1f}%")
print(f"Active Requests: {len(update['requests']['active'])}")
print(f"Browser Pool: {update['health']['pool']['permanent']['active']}")
""")
print("\n🌐 Open the dashboard to see WebSocket in action:")
print(f" {MONITOR_DASHBOARD_URL}")
async def demo_6_control_actions():
"""Demo 6: Control Actions - Manual browser management"""
print_section(
"Demo 6: Control Actions",
"Manual control over browser pool and cleanup"
)
async with httpx.AsyncClient(timeout=30.0) as client:
print("🎮 Testing control actions...")
# Force cleanup
print_subsection("Force Immediate Cleanup")
print("🧹 Triggering manual cleanup...")
try:
response = await client.post(f"{CRAWL4AI_BASE_URL}/monitor/actions/cleanup")
if response.status_code == 200:
result = response.json()
print(f" ✅ Cleanup completed")
print(f" • Browsers cleaned: {result.get('cleaned_count', 0)}")
print(f" • Memory freed: {result.get('memory_freed_mb', 0):.1f} MB")
else:
print(f" ⚠️ Response: {response.status_code}")
except Exception as e:
print(f" Cleanup action: {e}")
# Get browser list for potential kill/restart
print_subsection("Browser Management")
response = await client.get(f"{CRAWL4AI_BASE_URL}/monitor/browsers")
browsers = response.json()
cold_browsers = browsers.get('cold', [])
if cold_browsers:
browser_id = cold_browsers[0]['browser_id']
print(f"\n🎯 Example: Kill specific browser")
print(f" POST /monitor/actions/kill_browser")
print(f" JSON: {{'browser_id': '{browser_id[:16]}...'}}")
print(f" → Kills the browser and frees resources")
print(f"\n🔄 Example: Restart browser")
print(f" POST /monitor/actions/restart_browser")
print(f" JSON: {{'browser_id': 'browser_id_here'}}")
print(f" → Restart a specific browser instance")
# Reset statistics
print_subsection("Reset Statistics")
print("📊 Statistics can be reset for fresh monitoring:")
print(f" POST /monitor/stats/reset")
print(f" → Clears all accumulated statistics")
async def demo_7_production_metrics():
"""Demo 7: Production Metrics - Key indicators for operations"""
print_section(
"Demo 7: Production Metrics",
"Critical metrics for production monitoring"
)
async with httpx.AsyncClient(timeout=30.0) as client:
print("📊 Key Production Metrics:")
# Overall health
response = await client.get(f"{CRAWL4AI_BASE_URL}/monitor/health")
health = response.json()
# Browser efficiency
response = await client.get(f"{CRAWL4AI_BASE_URL}/monitor/browsers")
browsers = response.json()
print("\n🎯 Critical Metrics to Track:")
print(f"\n1⃣ Memory Usage Trends")
print(f" • Current: {health['container']['memory_percent']:.1f}%")
print(f" • Alert if: >80%")
print(f" • Action: Trigger cleanup or scale")
print(f"\n2⃣ Request Success Rate")
print(f" • Current: {health['stats']['success_rate_percent']:.1f}%")
print(f" • Target: >95%")
print(f" • Alert if: <90%")
print(f"\n3⃣ Average Latency")
print(f" • Current: {health['stats']['avg_latency_ms']:.0f}ms")
print(f" • Target: <2000ms")
print(f" • Alert if: >5000ms")
print(f"\n4⃣ Browser Pool Efficiency")
print(f" • Reuse Rate: {browsers['summary']['reuse_rate_percent']:.1f}%")
print(f" • Target: >80%")
print(f" • Indicates: Effective browser pooling")
print(f"\n5⃣ Total Browsers")
print(f" • Current: {browsers['summary']['total_count']}")
print(f" • Alert if: >20 (possible leak)")
print(f" • Check: Janitor is running correctly")
print(f"\n6⃣ Error Frequency")
response = await client.get(f"{CRAWL4AI_BASE_URL}/monitor/logs/errors?limit=10")
errors = response.json()
print(f" • Recent Errors: {len(errors)}")
print(f" • Alert if: >10 in last hour")
print(f" • Action: Review error patterns")
print("\n💡 Integration Examples:")
print(" • Prometheus: Scrape /monitor/health")
print(" • Alerting: Monitor memory, success rate, latency")
print(" • Dashboards: WebSocket streaming to custom UI")
print(" • Log Aggregation: Collect /monitor/logs/* endpoints")
async def demo_8_self_hosting_value():
"""Demo 8: Self-Hosting Value Proposition"""
print_section(
"Demo 8: Why Self-Host Crawl4AI?",
"The value proposition of owning your infrastructure"
)
print("🎯 Self-Hosting Benefits:\n")
print("🔒 Data Privacy & Security")
print(" • Your data never leaves your infrastructure")
print(" • No third-party access to crawled content")
print(" • Keep sensitive workflows behind your firewall")
print("\n💰 Cost Control")
print(" • No per-request pricing or rate limits")
print(" • Predictable infrastructure costs")
print(" • Scale based on your actual needs")
print("\n🎯 Full Customization")
print(" • Complete control over browser configs")
print(" • Custom hooks and strategies")
print(" • Tailored monitoring and alerting")
print("\n📊 Complete Transparency")
print(" • Real-time monitoring dashboard")
print(" • Full visibility into system performance")
print(" • Detailed request and error tracking")
print("\n⚡ Performance & Flexibility")
print(" • Direct access, no network overhead")
print(" • Integrate with existing infrastructure")
print(" • Custom resource allocation")
print("\n🛡️ Enterprise-Grade Operations")
print(" • Prometheus integration ready")
print(" • WebSocket for real-time dashboards")
print(" • Full API for automation")
print(" • Manual controls for troubleshooting")
print(f"\n🌐 Get Started:")
print(f" docker pull unclecode/crawl4ai:0.7.7")
print(f" docker run -d -p 11235:11235 --shm-size=1g unclecode/crawl4ai:0.7.7")
print(f" # Visit: {MONITOR_DASHBOARD_URL}")
def print_summary():
"""Print comprehensive demo summary"""
print("\n" + "=" * 70)
print("📊 DEMO SUMMARY - Crawl4AI v0.7.7")
print("=" * 70)
print("\n✨ Features Demonstrated:")
print("=" * 70)
print("✅ System Health Overview")
print(" → Real-time CPU, memory, network, and uptime monitoring")
print("\n✅ Request Tracking")
print(" → Active and completed request monitoring with full details")
print("\n✅ Browser Pool Management")
print(" → 3-tier architecture: Permanent, Hot, and Cold pools")
print(" → Automatic promotion and cleanup")
print("\n✅ Monitor API Endpoints")
print(" → Complete REST API for programmatic access")
print(" → Health, requests, browsers, timeline, logs, errors")
print("\n✅ WebSocket Streaming")
print(" → Real-time updates every 2 seconds")
print(" → Build custom dashboards with live data")
print("\n✅ Control Actions")
print(" → Manual browser management (kill, restart)")
print(" → Force cleanup and statistics reset")
print("\n✅ Production Metrics")
print(" → 6 critical metrics for operational excellence")
print(" → Prometheus integration patterns")
print("\n✅ Self-Hosting Value")
print(" → Data privacy, cost control, full customization")
print(" → Enterprise-grade transparency and control")
print("\n" + "=" * 70)
print("🎯 What's New in v0.7.7?")
print("=" * 70)
print("• 📊 Complete Real-time Monitoring System")
print("• 🌐 Interactive Web Dashboard (/dashboard)")
print("• 🔌 Comprehensive Monitor API")
print("• ⚡ WebSocket Streaming (2-second updates)")
print("• 🎮 Manual Control Actions")
print("• 📈 Production Integration Examples")
print("• 🏭 Prometheus, Alerting, Log Aggregation")
print("• 🔥 Smart Browser Pool (Permanent/Hot/Cold)")
print("• 🧹 Automatic Janitor Cleanup")
print("• 📋 Full Request & Error Tracking")
print("\n" + "=" * 70)
print("💡 Why This Matters")
print("=" * 70)
print("Before v0.7.7: Docker was just a containerized crawler")
print("After v0.7.7: Complete self-hosting platform with enterprise monitoring")
print("\nYou now have:")
print(" • Full visibility into what's happening inside")
print(" • Real-time operational dashboards")
print(" • Complete control over browser resources")
print(" • Production-ready observability")
print(" • Zero external dependencies")
print("\n" + "=" * 70)
print("📚 Next Steps")
print("=" * 70)
print(f"1. Open the dashboard: {MONITOR_DASHBOARD_URL}")
print("2. Read the docs: https://docs.crawl4ai.com/basic/self-hosting/")
print("3. Try the Monitor API endpoints yourself")
print("4. Set up Prometheus integration for production")
print("5. Build custom dashboards with WebSocket streaming")
print("\n" + "=" * 70)
print("🔗 Resources")
print("=" * 70)
print(f"• Dashboard: {MONITOR_DASHBOARD_URL}")
print(f"• Health API: {CRAWL4AI_BASE_URL}/monitor/health")
print(f"• Documentation: https://docs.crawl4ai.com/")
print(f"• GitHub: https://github.com/unclecode/crawl4ai")
print("\n" + "=" * 70)
print("🎉 You're now in control of your web crawling destiny!")
print("=" * 70)
async def main():
"""Run all demos"""
print("\n" + "=" * 70)
print("🚀 Crawl4AI v0.7.7 Release Demo")
print("=" * 70)
print("Feature: Self-Hosting with Real-time Monitoring Dashboard")
print("=" * 70)
# Check if server is running
print("\n🔍 Checking Crawl4AI server...")
server_running = await check_server_health()
if not server_running:
print(f"❌ Cannot connect to Crawl4AI at {CRAWL4AI_BASE_URL}")
print("\nPlease start the Docker container:")
print(" docker pull unclecode/crawl4ai:0.7.7")
print(" docker run -d -p 11235:11235 --shm-size=1g unclecode/crawl4ai:0.7.7")
print("\nThen re-run this demo.")
return
print(f"✅ Crawl4AI server is running!")
print(f"📊 Dashboard available at: {MONITOR_DASHBOARD_URL}")
# Run all demos
demos = [
demo_1_system_health_overview,
demo_2_request_tracking,
demo_3_browser_pool_management,
demo_4_monitor_api_endpoints,
demo_5_websocket_streaming,
demo_6_control_actions,
demo_7_production_metrics,
demo_8_self_hosting_value,
]
for i, demo_func in enumerate(demos, 1):
try:
await demo_func()
if i < len(demos):
await asyncio.sleep(2) # Brief pause between demos
except KeyboardInterrupt:
print(f"\n\n⚠️ Demo interrupted by user")
return
except Exception as e:
print(f"\n❌ Demo {i} error: {e}")
print("Continuing to next demo...\n")
continue
# Print comprehensive summary
print_summary()
print("\n" + "=" * 70)
print("✅ Demo completed!")
print("=" * 70)
if __name__ == "__main__":
try:
asyncio.run(main())
except KeyboardInterrupt:
print("\n\n👋 Demo stopped by user. Thanks for trying Crawl4AI v0.7.7!")
except Exception as e:
print(f"\n\n❌ Demo failed: {e}")
print("Make sure the Docker container is running:")
print(" docker run -d -p 11235:11235 --shm-size=1g unclecode/crawl4ai:0.7.7")

View File

@@ -0,0 +1,910 @@
#!/usr/bin/env python3
"""
Crawl4AI v0.7.8 Release Demo - Verification Tests
==================================================
This demo ACTUALLY RUNS and VERIFIES the bug fixes in v0.7.8.
Each test executes real code and validates the fix is working.
Bug Fixes Verified:
1. ProxyConfig JSON serialization (#1629)
2. Configurable backoff parameters (#1269)
3. LLM Strategy input_format support (#1178)
4. Raw HTML URL variable (#1116)
5. Relative URLs after redirects (#1268)
6. pypdf migration (#1412)
7. Pydantic v2 ConfigDict (#678)
8. Docker ContentRelevanceFilter (#1642) - requires Docker
9. Docker .cache permissions (#1638) - requires Docker
10. AdaptiveCrawler query expansion (#1621) - requires LLM API key
11. Import statement formatting (#1181)
Usage:
python docs/releases_review/demo_v0.7.8.py
For Docker tests:
docker run -d -p 11235:11235 --shm-size=1g unclecode/crawl4ai:0.7.8
python docs/releases_review/demo_v0.7.8.py
"""
import asyncio
import json
import sys
import warnings
import os
import tempfile
from typing import Tuple, Optional
from dataclasses import dataclass
# Test results tracking
@dataclass
class TestResult:
name: str
issue: str
passed: bool
message: str
skipped: bool = False
results: list[TestResult] = []
def print_header(title: str):
print(f"\n{'=' * 70}")
print(f"{title}")
print(f"{'=' * 70}")
def print_test(name: str, issue: str):
print(f"\n[TEST] {name} ({issue})")
print("-" * 50)
def record_result(name: str, issue: str, passed: bool, message: str, skipped: bool = False):
results.append(TestResult(name, issue, passed, message, skipped))
if skipped:
print(f" SKIPPED: {message}")
elif passed:
print(f" PASSED: {message}")
else:
print(f" FAILED: {message}")
# =============================================================================
# TEST 1: ProxyConfig JSON Serialization (#1629)
# =============================================================================
async def test_proxy_config_serialization():
"""
Verify BrowserConfig.to_dict() properly serializes ProxyConfig to JSON.
BEFORE: ProxyConfig was included as object, causing JSON serialization to fail
AFTER: ProxyConfig.to_dict() is called, producing valid JSON
"""
print_test("ProxyConfig JSON Serialization", "#1629")
try:
from crawl4ai import BrowserConfig
from crawl4ai.async_configs import ProxyConfig
# Create config with ProxyConfig
proxy = ProxyConfig(
server="http://proxy.example.com:8080",
username="testuser",
password="testpass"
)
browser_config = BrowserConfig(headless=True, proxy_config=proxy)
# Test 1: to_dict() should return dict for proxy_config
config_dict = browser_config.to_dict()
proxy_dict = config_dict.get('proxy_config')
if not isinstance(proxy_dict, dict):
record_result("ProxyConfig Serialization", "#1629", False,
f"proxy_config is {type(proxy_dict)}, expected dict")
return
# Test 2: Should be JSON serializable
try:
json_str = json.dumps(config_dict)
json.loads(json_str) # Verify valid JSON
except (TypeError, json.JSONDecodeError) as e:
record_result("ProxyConfig Serialization", "#1629", False,
f"JSON serialization failed: {e}")
return
# Test 3: Verify proxy data is preserved
if proxy_dict.get('server') != "http://proxy.example.com:8080":
record_result("ProxyConfig Serialization", "#1629", False,
"Proxy server not preserved in serialization")
return
record_result("ProxyConfig Serialization", "#1629", True,
"BrowserConfig with ProxyConfig serializes to valid JSON")
except Exception as e:
record_result("ProxyConfig Serialization", "#1629", False, f"Exception: {e}")
# =============================================================================
# TEST 2: Configurable Backoff Parameters (#1269)
# =============================================================================
async def test_configurable_backoff():
"""
Verify LLMConfig accepts and stores backoff configuration parameters.
BEFORE: Backoff was hardcoded (delay=2, attempts=3, factor=2)
AFTER: LLMConfig accepts backoff_base_delay, backoff_max_attempts, backoff_exponential_factor
"""
print_test("Configurable Backoff Parameters", "#1269")
try:
from crawl4ai import LLMConfig
# Test 1: Default values
default_config = LLMConfig(provider="openai/gpt-4o-mini")
if default_config.backoff_base_delay != 2:
record_result("Configurable Backoff", "#1269", False,
f"Default base_delay is {default_config.backoff_base_delay}, expected 2")
return
if default_config.backoff_max_attempts != 3:
record_result("Configurable Backoff", "#1269", False,
f"Default max_attempts is {default_config.backoff_max_attempts}, expected 3")
return
if default_config.backoff_exponential_factor != 2:
record_result("Configurable Backoff", "#1269", False,
f"Default exponential_factor is {default_config.backoff_exponential_factor}, expected 2")
return
# Test 2: Custom values
custom_config = LLMConfig(
provider="openai/gpt-4o-mini",
backoff_base_delay=5,
backoff_max_attempts=10,
backoff_exponential_factor=3
)
if custom_config.backoff_base_delay != 5:
record_result("Configurable Backoff", "#1269", False,
f"Custom base_delay is {custom_config.backoff_base_delay}, expected 5")
return
if custom_config.backoff_max_attempts != 10:
record_result("Configurable Backoff", "#1269", False,
f"Custom max_attempts is {custom_config.backoff_max_attempts}, expected 10")
return
if custom_config.backoff_exponential_factor != 3:
record_result("Configurable Backoff", "#1269", False,
f"Custom exponential_factor is {custom_config.backoff_exponential_factor}, expected 3")
return
# Test 3: to_dict() includes backoff params
config_dict = custom_config.to_dict()
if 'backoff_base_delay' not in config_dict:
record_result("Configurable Backoff", "#1269", False,
"backoff_base_delay missing from to_dict()")
return
record_result("Configurable Backoff", "#1269", True,
"LLMConfig accepts and stores custom backoff parameters")
except Exception as e:
record_result("Configurable Backoff", "#1269", False, f"Exception: {e}")
# =============================================================================
# TEST 3: LLM Strategy Input Format (#1178)
# =============================================================================
async def test_llm_input_format():
"""
Verify LLMExtractionStrategy accepts input_format parameter.
BEFORE: Always used markdown input
AFTER: Supports "markdown", "html", "fit_markdown", "cleaned_html", "fit_html"
"""
print_test("LLM Strategy Input Format", "#1178")
try:
from crawl4ai import LLMExtractionStrategy, LLMConfig
llm_config = LLMConfig(provider="openai/gpt-4o-mini")
# Test 1: Default is markdown
default_strategy = LLMExtractionStrategy(
llm_config=llm_config,
instruction="Extract data"
)
if default_strategy.input_format != "markdown":
record_result("LLM Input Format", "#1178", False,
f"Default input_format is '{default_strategy.input_format}', expected 'markdown'")
return
# Test 2: Can set to html
html_strategy = LLMExtractionStrategy(
llm_config=llm_config,
instruction="Extract data",
input_format="html"
)
if html_strategy.input_format != "html":
record_result("LLM Input Format", "#1178", False,
f"HTML input_format is '{html_strategy.input_format}', expected 'html'")
return
# Test 3: Can set to fit_markdown
fit_strategy = LLMExtractionStrategy(
llm_config=llm_config,
instruction="Extract data",
input_format="fit_markdown"
)
if fit_strategy.input_format != "fit_markdown":
record_result("LLM Input Format", "#1178", False,
f"fit_markdown input_format is '{fit_strategy.input_format}'")
return
record_result("LLM Input Format", "#1178", True,
"LLMExtractionStrategy accepts all input_format options")
except Exception as e:
record_result("LLM Input Format", "#1178", False, f"Exception: {e}")
# =============================================================================
# TEST 4: Raw HTML URL Variable (#1116)
# =============================================================================
async def test_raw_html_url_variable():
"""
Verify that raw: prefix URLs pass "Raw HTML" to extraction strategy.
BEFORE: Entire HTML blob was passed as URL parameter
AFTER: "Raw HTML" string is passed as URL parameter
"""
print_test("Raw HTML URL Variable", "#1116")
try:
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.extraction_strategy import ExtractionStrategy
# Custom strategy to capture what URL is passed
class URLCapturingStrategy(ExtractionStrategy):
captured_url = None
def extract(self, url: str, html: str, *args, **kwargs):
URLCapturingStrategy.captured_url = url
return [{"content": "test"}]
html_content = "<html><body><h1>Test</h1></body></html>"
strategy = URLCapturingStrategy()
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url=f"raw:{html_content}",
config=CrawlerRunConfig(
extraction_strategy=strategy
)
)
captured = URLCapturingStrategy.captured_url
if captured is None:
record_result("Raw HTML URL Variable", "#1116", False,
"Extraction strategy was not called")
return
if captured == html_content or captured.startswith("<html"):
record_result("Raw HTML URL Variable", "#1116", False,
f"URL contains HTML content instead of 'Raw HTML': {captured[:50]}...")
return
if captured != "Raw HTML":
record_result("Raw HTML URL Variable", "#1116", False,
f"URL is '{captured}', expected 'Raw HTML'")
return
record_result("Raw HTML URL Variable", "#1116", True,
"Extraction strategy receives 'Raw HTML' as URL for raw: prefix")
except Exception as e:
record_result("Raw HTML URL Variable", "#1116", False, f"Exception: {e}")
# =============================================================================
# TEST 5: Relative URLs After Redirects (#1268)
# =============================================================================
async def test_redirect_url_handling():
"""
Verify that redirected_url reflects the final URL after JS navigation.
BEFORE: redirected_url was the original URL, not the final URL
AFTER: redirected_url is captured after JS execution completes
"""
print_test("Relative URLs After Redirects", "#1268")
try:
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
# Test with a URL that we know the final state of
# We'll use httpbin which doesn't redirect, but verify the mechanism works
test_url = "https://httpbin.org/html"
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url=test_url,
config=CrawlerRunConfig()
)
# Verify redirected_url is populated
if not result.redirected_url:
record_result("Redirect URL Handling", "#1268", False,
"redirected_url is empty")
return
# For non-redirecting URL, should match original or be the final URL
if not result.redirected_url.startswith("https://httpbin.org"):
record_result("Redirect URL Handling", "#1268", False,
f"redirected_url is unexpected: {result.redirected_url}")
return
# Verify links are present and resolved
if result.links:
# Check that internal links have full URLs
internal_links = result.links.get('internal', [])
external_links = result.links.get('external', [])
all_links = internal_links + external_links
for link in all_links[:5]: # Check first 5 links
href = link.get('href', '')
if href and not href.startswith(('http://', 'https://', 'mailto:', 'tel:', '#', 'javascript:')):
record_result("Redirect URL Handling", "#1268", False,
f"Link not resolved to absolute URL: {href}")
return
record_result("Redirect URL Handling", "#1268", True,
f"redirected_url correctly captured: {result.redirected_url}")
except Exception as e:
record_result("Redirect URL Handling", "#1268", False, f"Exception: {e}")
# =============================================================================
# TEST 6: pypdf Migration (#1412)
# =============================================================================
async def test_pypdf_migration():
"""
Verify pypdf is used instead of deprecated PyPDF2.
BEFORE: Used PyPDF2 (deprecated since 2022)
AFTER: Uses pypdf (actively maintained)
"""
print_test("pypdf Migration", "#1412")
try:
# Test 1: pypdf should be importable (if pdf extra is installed)
try:
import pypdf
pypdf_available = True
pypdf_version = pypdf.__version__
except ImportError:
pypdf_available = False
pypdf_version = None
# Test 2: PyPDF2 should NOT be imported by crawl4ai
# Check if the processor uses pypdf
try:
from crawl4ai.processors.pdf import processor
processor_source = open(processor.__file__).read()
uses_pypdf = 'from pypdf' in processor_source or 'import pypdf' in processor_source
uses_pypdf2 = 'from PyPDF2' in processor_source or 'import PyPDF2' in processor_source
if uses_pypdf2 and not uses_pypdf:
record_result("pypdf Migration", "#1412", False,
"PDF processor still uses PyPDF2")
return
if uses_pypdf:
record_result("pypdf Migration", "#1412", True,
f"PDF processor uses pypdf{' v' + pypdf_version if pypdf_version else ''}")
return
else:
record_result("pypdf Migration", "#1412", True,
"PDF processor found, pypdf dependency updated", skipped=not pypdf_available)
return
except ImportError:
# PDF processor not available
if pypdf_available:
record_result("pypdf Migration", "#1412", True,
f"pypdf v{pypdf_version} is installed (PDF processor not loaded)")
else:
record_result("pypdf Migration", "#1412", True,
"PDF support not installed (optional feature)", skipped=True)
return
except Exception as e:
record_result("pypdf Migration", "#1412", False, f"Exception: {e}")
# =============================================================================
# TEST 7: Pydantic v2 ConfigDict (#678)
# =============================================================================
async def test_pydantic_configdict():
"""
Verify no Pydantic deprecation warnings for Config class.
BEFORE: Used deprecated 'class Config' syntax
AFTER: Uses ConfigDict for Pydantic v2 compatibility
"""
print_test("Pydantic v2 ConfigDict", "#678")
try:
import pydantic
from pydantic import __version__ as pydantic_version
# Capture warnings during import
with warnings.catch_warnings(record=True) as w:
warnings.simplefilter("always", DeprecationWarning)
# Import models that might have Config classes
from crawl4ai.models import CrawlResult, MarkdownGenerationResult
from crawl4ai.async_configs import CrawlerRunConfig, BrowserConfig
# Filter for Pydantic-related deprecation warnings
pydantic_warnings = [
warning for warning in w
if 'pydantic' in str(warning.message).lower()
or 'config' in str(warning.message).lower()
]
if pydantic_warnings:
warning_msgs = [str(w.message) for w in pydantic_warnings[:3]]
record_result("Pydantic ConfigDict", "#678", False,
f"Deprecation warnings: {warning_msgs}")
return
# Verify models work correctly
try:
# Test that models can be instantiated without issues
config = CrawlerRunConfig()
browser = BrowserConfig()
record_result("Pydantic ConfigDict", "#678", True,
f"No deprecation warnings with Pydantic v{pydantic_version}")
except Exception as e:
record_result("Pydantic ConfigDict", "#678", False,
f"Model instantiation failed: {e}")
except Exception as e:
record_result("Pydantic ConfigDict", "#678", False, f"Exception: {e}")
# =============================================================================
# TEST 8: Docker ContentRelevanceFilter (#1642)
# =============================================================================
async def test_docker_content_filter():
"""
Verify ContentRelevanceFilter deserializes correctly in Docker API.
BEFORE: Docker API failed to import/instantiate ContentRelevanceFilter
AFTER: Filter is properly exported and deserializable
"""
print_test("Docker ContentRelevanceFilter", "#1642")
# First verify the fix in local code
try:
# Test 1: ContentRelevanceFilter should be importable from crawl4ai
from crawl4ai import ContentRelevanceFilter
# Test 2: Should be instantiable
filter_instance = ContentRelevanceFilter(
query="test query",
threshold=0.3
)
if not hasattr(filter_instance, 'query'):
record_result("Docker ContentRelevanceFilter", "#1642", False,
"ContentRelevanceFilter missing query attribute")
return
except ImportError as e:
record_result("Docker ContentRelevanceFilter", "#1642", False,
f"ContentRelevanceFilter not exported: {e}")
return
except Exception as e:
record_result("Docker ContentRelevanceFilter", "#1642", False,
f"ContentRelevanceFilter instantiation failed: {e}")
return
# Test Docker API if available
try:
import httpx
async with httpx.AsyncClient(timeout=5.0) as client:
response = await client.get("http://localhost:11235/health")
if response.status_code != 200:
raise Exception("Docker not available")
# Docker is running, test the API
async with httpx.AsyncClient(timeout=30.0) as client:
request = {
"urls": ["https://httpbin.org/html"],
"crawler_config": {
"deep_crawl_strategy": {
"type": "BFSDeepCrawlStrategy",
"max_depth": 1,
"filter_chain": [
{
"type": "ContentTypeFilter",
"allowed_types": ["text/html"]
}
]
}
}
}
response = await client.post(
"http://localhost:11235/crawl",
json=request
)
if response.status_code == 200:
record_result("Docker ContentRelevanceFilter", "#1642", True,
"Filter deserializes correctly in Docker API")
else:
record_result("Docker ContentRelevanceFilter", "#1642", False,
f"Docker API returned {response.status_code}: {response.text[:100]}")
except ImportError:
record_result("Docker ContentRelevanceFilter", "#1642", True,
"ContentRelevanceFilter exportable (Docker test skipped - httpx not installed)",
skipped=True)
except Exception as e:
record_result("Docker ContentRelevanceFilter", "#1642", True,
f"ContentRelevanceFilter exportable (Docker test skipped: {e})",
skipped=True)
# =============================================================================
# TEST 9: Docker Cache Permissions (#1638)
# =============================================================================
async def test_docker_cache_permissions():
"""
Verify Docker image has correct .cache folder permissions.
This test requires Docker container to be running.
"""
print_test("Docker Cache Permissions", "#1638")
try:
import httpx
async with httpx.AsyncClient(timeout=5.0) as client:
response = await client.get("http://localhost:11235/health")
if response.status_code != 200:
raise Exception("Docker not available")
# Test by making a crawl request with caching
async with httpx.AsyncClient(timeout=60.0) as client:
request = {
"urls": ["https://httpbin.org/html"],
"crawler_config": {
"cache_mode": "enabled"
}
}
response = await client.post(
"http://localhost:11235/crawl",
json=request
)
if response.status_code == 200:
result = response.json()
# Check if there were permission errors
if "permission" in str(result).lower() and "denied" in str(result).lower():
record_result("Docker Cache Permissions", "#1638", False,
"Permission denied error in response")
else:
record_result("Docker Cache Permissions", "#1638", True,
"Crawl with caching succeeded in Docker")
else:
error_text = response.text[:200]
if "permission" in error_text.lower():
record_result("Docker Cache Permissions", "#1638", False,
f"Permission error: {error_text}")
else:
record_result("Docker Cache Permissions", "#1638", False,
f"Request failed: {response.status_code}")
except ImportError:
record_result("Docker Cache Permissions", "#1638", True,
"Skipped - httpx not installed", skipped=True)
except Exception as e:
record_result("Docker Cache Permissions", "#1638", True,
f"Skipped - Docker not available: {e}", skipped=True)
# =============================================================================
# TEST 10: AdaptiveCrawler Query Expansion (#1621)
# =============================================================================
async def test_adaptive_crawler_embedding():
"""
Verify EmbeddingStrategy LLM code is uncommented and functional.
BEFORE: LLM call was commented out, using hardcoded mock data
AFTER: Actually calls LLM for query expansion
"""
print_test("AdaptiveCrawler Query Expansion", "#1621")
try:
# Read the source file to verify the fix
import crawl4ai.adaptive_crawler as adaptive_module
source_file = adaptive_module.__file__
with open(source_file, 'r') as f:
source_code = f.read()
# Check that the LLM call is NOT commented out
# Look for the perform_completion_with_backoff call
# Find the EmbeddingStrategy section
if 'class EmbeddingStrategy' not in source_code:
record_result("AdaptiveCrawler Query Expansion", "#1621", True,
"EmbeddingStrategy not in adaptive_crawler (may have moved)",
skipped=True)
return
# Check if the mock data line is commented out
# and the actual LLM call is NOT commented out
lines = source_code.split('\n')
in_embedding_strategy = False
found_llm_call = False
mock_data_commented = False
for i, line in enumerate(lines):
if 'class EmbeddingStrategy' in line:
in_embedding_strategy = True
elif in_embedding_strategy and line.strip().startswith('class '):
in_embedding_strategy = False
if in_embedding_strategy:
# Check for uncommented LLM call
if 'perform_completion_with_backoff' in line and not line.strip().startswith('#'):
found_llm_call = True
# Check for commented mock data
if "variations ={'queries'" in line or 'variations = {\'queries\'' in line:
if line.strip().startswith('#'):
mock_data_commented = True
if found_llm_call:
record_result("AdaptiveCrawler Query Expansion", "#1621", True,
"LLM call is active in EmbeddingStrategy")
else:
# Check if the entire embedding strategy exists but might be structured differently
if 'perform_completion_with_backoff' in source_code:
record_result("AdaptiveCrawler Query Expansion", "#1621", True,
"perform_completion_with_backoff found in module")
else:
record_result("AdaptiveCrawler Query Expansion", "#1621", False,
"LLM call not found or still commented out")
except Exception as e:
record_result("AdaptiveCrawler Query Expansion", "#1621", False, f"Exception: {e}")
# =============================================================================
# TEST 11: Import Statement Formatting (#1181)
# =============================================================================
async def test_import_formatting():
"""
Verify code extraction properly formats import statements.
BEFORE: Import statements were concatenated without newlines
AFTER: Import statements have proper newline separation
"""
print_test("Import Statement Formatting", "#1181")
try:
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
# Create HTML with code containing imports
html_with_code = """
<html>
<body>
<pre><code>
import os
import sys
from pathlib import Path
from typing import List, Dict
def main():
pass
</code></pre>
</body>
</html>
"""
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url=f"raw:{html_with_code}",
config=CrawlerRunConfig()
)
markdown = result.markdown.raw_markdown if result.markdown else ""
# Check that imports are not concatenated on the same line
# Bad: "import osimport sys" (no newline between statements)
# This is the actual bug - statements getting merged on same line
bad_patterns = [
"import os import sys", # Space but no newline
"import osimport sys", # No space or newline
"import os from pathlib", # Space but no newline
"import osfrom pathlib", # No space or newline
]
markdown_single_line = markdown.replace('\n', ' ') # Convert newlines to spaces
for pattern in bad_patterns:
# Check if pattern exists without proper line separation
if pattern.replace(' ', '') in markdown_single_line.replace(' ', ''):
# Verify it's actually on same line (not just adjacent after newline removal)
lines = markdown.split('\n')
for line in lines:
if 'import' in line.lower():
# Count import statements on this line
import_count = line.lower().count('import ')
if import_count > 1:
record_result("Import Formatting", "#1181", False,
f"Multiple imports on same line: {line[:60]}...")
return
# Verify imports are present
if "import" in markdown.lower():
record_result("Import Formatting", "#1181", True,
"Import statements are properly line-separated")
else:
record_result("Import Formatting", "#1181", True,
"No import statements found to verify (test HTML may have changed)")
except Exception as e:
record_result("Import Formatting", "#1181", False, f"Exception: {e}")
# =============================================================================
# COMPREHENSIVE CRAWL TEST
# =============================================================================
async def test_comprehensive_crawl():
"""
Run a comprehensive crawl to verify overall stability.
"""
print_test("Comprehensive Crawl Test", "Overall")
try:
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, BrowserConfig
async with AsyncWebCrawler(config=BrowserConfig(headless=True)) as crawler:
result = await crawler.arun(
url="https://httpbin.org/html",
config=CrawlerRunConfig()
)
# Verify result
checks = []
if result.success:
checks.append("success=True")
else:
record_result("Comprehensive Crawl", "Overall", False,
f"Crawl failed: {result.error_message}")
return
if result.html and len(result.html) > 100:
checks.append(f"html={len(result.html)} chars")
if result.markdown and result.markdown.raw_markdown:
checks.append(f"markdown={len(result.markdown.raw_markdown)} chars")
if result.redirected_url:
checks.append("redirected_url present")
record_result("Comprehensive Crawl", "Overall", True,
f"All checks passed: {', '.join(checks)}")
except Exception as e:
record_result("Comprehensive Crawl", "Overall", False, f"Exception: {e}")
# =============================================================================
# MAIN
# =============================================================================
def print_summary():
"""Print test results summary"""
print_header("TEST RESULTS SUMMARY")
passed = sum(1 for r in results if r.passed and not r.skipped)
failed = sum(1 for r in results if not r.passed and not r.skipped)
skipped = sum(1 for r in results if r.skipped)
print(f"\nTotal: {len(results)} tests")
print(f" Passed: {passed}")
print(f" Failed: {failed}")
print(f" Skipped: {skipped}")
if failed > 0:
print("\nFailed Tests:")
for r in results:
if not r.passed and not r.skipped:
print(f" - {r.name} ({r.issue}): {r.message}")
if skipped > 0:
print("\nSkipped Tests:")
for r in results:
if r.skipped:
print(f" - {r.name} ({r.issue}): {r.message}")
print("\n" + "=" * 70)
if failed == 0:
print("All tests passed! v0.7.8 bug fixes verified.")
else:
print(f"WARNING: {failed} test(s) failed!")
print("=" * 70)
return failed == 0
async def main():
"""Run all verification tests"""
print_header("Crawl4AI v0.7.8 - Bug Fix Verification Tests")
print("Running actual tests to verify bug fixes...")
# Run all tests
tests = [
test_proxy_config_serialization, # #1629
test_configurable_backoff, # #1269
test_llm_input_format, # #1178
test_raw_html_url_variable, # #1116
test_redirect_url_handling, # #1268
test_pypdf_migration, # #1412
test_pydantic_configdict, # #678
test_docker_content_filter, # #1642
test_docker_cache_permissions, # #1638
test_adaptive_crawler_embedding, # #1621
test_import_formatting, # #1181
test_comprehensive_crawl, # Overall
]
for test_func in tests:
try:
await test_func()
except Exception as e:
print(f"\nTest {test_func.__name__} crashed: {e}")
results.append(TestResult(
test_func.__name__,
"Unknown",
False,
f"Crashed: {e}"
))
# Print summary
all_passed = print_summary()
return 0 if all_passed else 1
if __name__ == "__main__":
try:
exit_code = asyncio.run(main())
sys.exit(exit_code)
except KeyboardInterrupt:
print("\n\nTests interrupted by user.")
sys.exit(1)
except Exception as e:
print(f"\n\nTest suite failed: {e}")
import traceback
traceback.print_exc()
sys.exit(1)

View File

@@ -0,0 +1,655 @@
#!/usr/bin/env python3
"""
🚀 Crawl4AI v0.7.5 - Docker Hooks System Complete Demonstration
================================================================
This file demonstrates the NEW Docker Hooks System introduced in v0.7.5.
The Docker Hooks System is a completely NEW feature that provides pipeline
customization through user-provided Python functions. It offers three approaches:
1. String-based hooks for REST API
2. hooks_to_string() utility to convert functions
3. Docker Client with automatic conversion (most convenient)
All three approaches are part of this NEW v0.7.5 feature!
Perfect for video recording and demonstration purposes.
Requirements:
- Docker container running: docker run -p 11235:11235 unclecode/crawl4ai:latest
- crawl4ai v0.7.5 installed: pip install crawl4ai==0.7.5
"""
import asyncio
import requests
import json
import time
from typing import Dict, Any
# Import Crawl4AI components
from crawl4ai import hooks_to_string
from crawl4ai.docker_client import Crawl4aiDockerClient
# Configuration
DOCKER_URL = "http://localhost:11235"
# DOCKER_URL = "http://localhost:11234"
TEST_URLS = [
# "https://httpbin.org/html",
"https://www.kidocode.com",
"https://quotes.toscrape.com",
]
def print_section(title: str, description: str = ""):
"""Print a formatted section header"""
print("\n" + "=" * 70)
print(f" {title}")
if description:
print(f" {description}")
print("=" * 70 + "\n")
def check_docker_service() -> bool:
"""Check if Docker service is running"""
try:
response = requests.get(f"{DOCKER_URL}/health", timeout=3)
return response.status_code == 200
except:
return False
# ============================================================================
# REUSABLE HOOK LIBRARY (NEW in v0.7.5)
# ============================================================================
async def performance_optimization_hook(page, context, **kwargs):
"""
Performance Hook: Block unnecessary resources to speed up crawling
"""
print(" [Hook] 🚀 Optimizing performance - blocking images and ads...")
# Block images
await context.route(
"**/*.{png,jpg,jpeg,gif,webp,svg,ico}",
lambda route: route.abort()
)
# Block ads and analytics
await context.route("**/analytics/*", lambda route: route.abort())
await context.route("**/ads/*", lambda route: route.abort())
await context.route("**/google-analytics.com/*", lambda route: route.abort())
print(" [Hook] ✓ Performance optimization applied")
return page
async def viewport_setup_hook(page, context, **kwargs):
"""
Viewport Hook: Set consistent viewport size for rendering
"""
print(" [Hook] 🖥️ Setting viewport to 1920x1080...")
await page.set_viewport_size({"width": 1920, "height": 1080})
print(" [Hook] ✓ Viewport configured")
return page
async def authentication_headers_hook(page, context, url, **kwargs):
"""
Headers Hook: Add custom authentication and tracking headers
"""
print(f" [Hook] 🔐 Adding custom headers for {url[:50]}...")
await page.set_extra_http_headers({
'X-Crawl4AI-Version': '0.7.5',
'X-Custom-Hook': 'function-based-demo',
'Accept-Language': 'en-US,en;q=0.9',
'User-Agent': 'Crawl4AI/0.7.5 (Educational Demo)'
})
print(" [Hook] ✓ Custom headers added")
return page
async def lazy_loading_handler_hook(page, context, **kwargs):
"""
Content Hook: Handle lazy-loaded content by scrolling
"""
print(" [Hook] 📜 Scrolling to load lazy content...")
# Scroll to bottom
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await page.wait_for_timeout(1000)
# Scroll to middle
await page.evaluate("window.scrollTo(0, document.body.scrollHeight / 2)")
await page.wait_for_timeout(500)
# Scroll back to top
await page.evaluate("window.scrollTo(0, 0)")
await page.wait_for_timeout(500)
print(" [Hook] ✓ Lazy content loaded")
return page
async def page_analytics_hook(page, context, **kwargs):
"""
Analytics Hook: Log page metrics before extraction
"""
print(" [Hook] 📊 Collecting page analytics...")
metrics = await page.evaluate('''
() => ({
title: document.title,
images: document.images.length,
links: document.links.length,
scripts: document.scripts.length,
headings: document.querySelectorAll('h1, h2, h3').length,
paragraphs: document.querySelectorAll('p').length
})
''')
print(f" [Hook] 📈 Page: {metrics['title'][:50]}...")
print(f" Links: {metrics['links']}, Images: {metrics['images']}, "
f"Headings: {metrics['headings']}, Paragraphs: {metrics['paragraphs']}")
return page
# ============================================================================
# DEMO 1: String-Based Hooks (NEW Docker Hooks System)
# ============================================================================
def demo_1_string_based_hooks():
"""
Demonstrate string-based hooks with REST API (part of NEW Docker Hooks System)
"""
print_section(
"DEMO 1: String-Based Hooks (REST API)",
"Part of the NEW Docker Hooks System - hooks as strings"
)
# Define hooks as strings
hooks_config = {
"on_page_context_created": """
async def hook(page, context, **kwargs):
print(" [String Hook] Setting up page context...")
# Block images for performance
await context.route("**/*.{png,jpg,jpeg,gif,webp}", lambda route: route.abort())
await page.set_viewport_size({"width": 1920, "height": 1080})
return page
""",
"before_goto": """
async def hook(page, context, url, **kwargs):
print(f" [String Hook] Navigating to {url[:50]}...")
await page.set_extra_http_headers({
'X-Crawl4AI': 'string-based-hooks',
'X-Demo': 'v0.7.5'
})
return page
""",
"before_retrieve_html": """
async def hook(page, context, **kwargs):
print(" [String Hook] Scrolling page...")
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await page.wait_for_timeout(1000)
return page
"""
}
# Prepare request payload
payload = {
"urls": [TEST_URLS[0]],
"hooks": {
"code": hooks_config,
"timeout": 30
},
"crawler_config": {
"cache_mode": "bypass"
}
}
print(f"🎯 Target URL: {TEST_URLS[0]}")
print(f"🔧 Configured {len(hooks_config)} string-based hooks")
print(f"📡 Sending request to Docker API...\n")
try:
start_time = time.time()
response = requests.post(f"{DOCKER_URL}/crawl", json=payload, timeout=60)
execution_time = time.time() - start_time
if response.status_code == 200:
result = response.json()
print(f"\n✅ Request successful! (took {execution_time:.2f}s)")
# Display results
if result.get('results') and result['results'][0].get('success'):
crawl_result = result['results'][0]
html_length = len(crawl_result.get('html', ''))
markdown_length = len(crawl_result.get('markdown', ''))
print(f"\n📊 Results:")
print(f" • HTML length: {html_length:,} characters")
print(f" • Markdown length: {markdown_length:,} characters")
print(f" • URL: {crawl_result.get('url')}")
# Check hooks execution
if 'hooks' in result:
hooks_info = result['hooks']
print(f"\n🎣 Hooks Execution:")
print(f" • Status: {hooks_info['status']['status']}")
print(f" • Attached hooks: {len(hooks_info['status']['attached_hooks'])}")
if 'summary' in hooks_info:
summary = hooks_info['summary']
print(f" • Total executions: {summary['total_executions']}")
print(f" • Successful: {summary['successful']}")
print(f" • Success rate: {summary['success_rate']:.1f}%")
else:
print(f"⚠️ Crawl completed but no results")
else:
print(f"❌ Request failed with status {response.status_code}")
print(f" Error: {response.text[:200]}")
except requests.exceptions.Timeout:
print("⏰ Request timed out after 60 seconds")
except Exception as e:
print(f"❌ Error: {str(e)}")
print("\n" + "" * 70)
print("✓ String-based hooks demo complete\n")
# ============================================================================
# DEMO 2: Function-Based Hooks with hooks_to_string() Utility
# ============================================================================
def demo_2_hooks_to_string_utility():
"""
Demonstrate the new hooks_to_string() utility for converting functions
"""
print_section(
"DEMO 2: hooks_to_string() Utility (NEW! ✨)",
"Convert Python functions to strings for REST API"
)
print("📦 Creating hook functions...")
print(" • performance_optimization_hook")
print(" • viewport_setup_hook")
print(" • authentication_headers_hook")
print(" • lazy_loading_handler_hook")
# Convert function objects to strings using the NEW utility
print("\n🔄 Converting functions to strings with hooks_to_string()...")
hooks_dict = {
"on_page_context_created": performance_optimization_hook,
"before_goto": authentication_headers_hook,
"before_retrieve_html": lazy_loading_handler_hook,
}
hooks_as_strings = hooks_to_string(hooks_dict)
print(f"✅ Successfully converted {len(hooks_as_strings)} functions to strings")
# Show a preview
print("\n📝 Sample converted hook (first 250 characters):")
print("" * 70)
sample_hook = list(hooks_as_strings.values())[0]
print(sample_hook[:250] + "...")
print("" * 70)
# Use the converted hooks with REST API
print("\n📡 Using converted hooks with REST API...")
payload = {
"urls": [TEST_URLS[0]],
"hooks": {
"code": hooks_as_strings,
"timeout": 30
}
}
try:
start_time = time.time()
response = requests.post(f"{DOCKER_URL}/crawl", json=payload, timeout=60)
execution_time = time.time() - start_time
if response.status_code == 200:
result = response.json()
print(f"\n✅ Request successful! (took {execution_time:.2f}s)")
if result.get('results') and result['results'][0].get('success'):
crawl_result = result['results'][0]
print(f" • HTML length: {len(crawl_result.get('html', '')):,} characters")
print(f" • Hooks executed successfully!")
else:
print(f"❌ Request failed: {response.status_code}")
except Exception as e:
print(f"❌ Error: {str(e)}")
print("\n💡 Benefits of hooks_to_string():")
print(" ✓ Write hooks as regular Python functions")
print(" ✓ Full IDE support (autocomplete, syntax highlighting)")
print(" ✓ Type checking and linting")
print(" ✓ Easy to test and debug")
print(" ✓ Reusable across projects")
print(" ✓ Works with any REST API client")
print("\n" + "" * 70)
print("✓ hooks_to_string() utility demo complete\n")
# ============================================================================
# DEMO 3: Docker Client with Automatic Conversion (RECOMMENDED! 🌟)
# ============================================================================
async def demo_3_docker_client_auto_conversion():
"""
Demonstrate Docker Client with automatic hook conversion (RECOMMENDED)
"""
print_section(
"DEMO 3: Docker Client with Auto-Conversion (RECOMMENDED! 🌟)",
"Pass function objects directly - conversion happens automatically!"
)
print("🐳 Initializing Crawl4AI Docker Client...")
client = Crawl4aiDockerClient(base_url=DOCKER_URL)
print("✅ Client ready!\n")
# Use our reusable hook library - just pass the function objects!
print("📚 Using reusable hook library:")
print(" • performance_optimization_hook")
print(" • viewport_setup_hook")
print(" • authentication_headers_hook")
print(" • lazy_loading_handler_hook")
print(" • page_analytics_hook")
print("\n🎯 Target URL: " + TEST_URLS[1])
print("🚀 Starting crawl with automatic hook conversion...\n")
try:
start_time = time.time()
# Pass function objects directly - NO manual conversion needed! ✨
results = await client.crawl(
urls=[TEST_URLS[0]],
hooks={
"on_page_context_created": performance_optimization_hook,
"before_goto": authentication_headers_hook,
"before_retrieve_html": lazy_loading_handler_hook,
"before_return_html": page_analytics_hook,
},
hooks_timeout=30
)
execution_time = time.time() - start_time
print(f"\n✅ Crawl completed! (took {execution_time:.2f}s)\n")
# Display results
if results and results.success:
result = results
print(f"📊 Results:")
print(f" • URL: {result.url}")
print(f" • Success: {result.success}")
print(f" • HTML length: {len(result.html):,} characters")
print(f" • Markdown length: {len(result.markdown):,} characters")
# Show metadata
if result.metadata:
print(f"\n📋 Metadata:")
print(f" • Title: {result.metadata.get('title', 'N/A')}")
print(f" • Description: {result.metadata.get('description', 'N/A')}")
# Show links
if result.links:
internal_count = len(result.links.get('internal', []))
external_count = len(result.links.get('external', []))
print(f"\n🔗 Links Found:")
print(f" • Internal: {internal_count}")
print(f" • External: {external_count}")
else:
print(f"⚠️ Crawl completed but no successful results")
if results:
print(f" Error: {results.error_message}")
except Exception as e:
print(f"❌ Error: {str(e)}")
import traceback
traceback.print_exc()
print("\n🌟 Why Docker Client is RECOMMENDED:")
print(" ✓ Automatic function-to-string conversion")
print(" ✓ No manual hooks_to_string() calls needed")
print(" ✓ Cleaner, more Pythonic code")
print(" ✓ Full type hints and IDE support")
print(" ✓ Built-in error handling")
print(" ✓ Async/await support")
print("\n" + "" * 70)
print("✓ Docker Client auto-conversion demo complete\n")
# ============================================================================
# DEMO 4: Advanced Use Case - Complete Hook Pipeline
# ============================================================================
async def demo_4_complete_hook_pipeline():
"""
Demonstrate a complete hook pipeline using all 8 hook points
"""
print_section(
"DEMO 4: Complete Hook Pipeline",
"Using all 8 available hook points for comprehensive control"
)
# Define all 8 hooks
async def on_browser_created_hook(browser, **kwargs):
"""Hook 1: Called after browser is created"""
print(" [Pipeline] 1/8 Browser created")
return browser
async def on_page_context_created_hook(page, context, **kwargs):
"""Hook 2: Called after page context is created"""
print(" [Pipeline] 2/8 Page context created - setting up...")
await page.set_viewport_size({"width": 1920, "height": 1080})
return page
async def on_user_agent_updated_hook(page, context, user_agent, **kwargs):
"""Hook 3: Called when user agent is updated"""
print(f" [Pipeline] 3/8 User agent updated: {user_agent[:50]}...")
return page
async def before_goto_hook(page, context, url, **kwargs):
"""Hook 4: Called before navigating to URL"""
print(f" [Pipeline] 4/8 Before navigation to: {url[:60]}...")
return page
async def after_goto_hook(page, context, url, response, **kwargs):
"""Hook 5: Called after navigation completes"""
print(f" [Pipeline] 5/8 After navigation - Status: {response.status if response else 'N/A'}")
await page.wait_for_timeout(1000)
return page
async def on_execution_started_hook(page, context, **kwargs):
"""Hook 6: Called when JavaScript execution starts"""
print(" [Pipeline] 6/8 JavaScript execution started")
return page
async def before_retrieve_html_hook(page, context, **kwargs):
"""Hook 7: Called before retrieving HTML"""
print(" [Pipeline] 7/8 Before HTML retrieval - scrolling...")
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
return page
async def before_return_html_hook(page, context, html, **kwargs):
"""Hook 8: Called before returning HTML"""
print(f" [Pipeline] 8/8 Before return - HTML length: {len(html):,} chars")
return page
print("🎯 Target URL: " + TEST_URLS[0])
print("🔧 Configured ALL 8 hook points for complete pipeline control\n")
client = Crawl4aiDockerClient(base_url=DOCKER_URL)
try:
print("🚀 Starting complete pipeline crawl...\n")
start_time = time.time()
results = await client.crawl(
urls=[TEST_URLS[0]],
hooks={
"on_browser_created": on_browser_created_hook,
"on_page_context_created": on_page_context_created_hook,
"on_user_agent_updated": on_user_agent_updated_hook,
"before_goto": before_goto_hook,
"after_goto": after_goto_hook,
"on_execution_started": on_execution_started_hook,
"before_retrieve_html": before_retrieve_html_hook,
"before_return_html": before_return_html_hook,
},
hooks_timeout=45
)
execution_time = time.time() - start_time
if results and results.success:
print(f"\n✅ Complete pipeline executed successfully! (took {execution_time:.2f}s)")
print(f" • All 8 hooks executed in sequence")
print(f" • HTML length: {len(results.html):,} characters")
else:
print(f"⚠️ Pipeline completed with warnings")
except Exception as e:
print(f"❌ Error: {str(e)}")
print("\n📚 Available Hook Points:")
print(" 1. on_browser_created - Browser initialization")
print(" 2. on_page_context_created - Page context setup")
print(" 3. on_user_agent_updated - User agent configuration")
print(" 4. before_goto - Pre-navigation setup")
print(" 5. after_goto - Post-navigation processing")
print(" 6. on_execution_started - JavaScript execution start")
print(" 7. before_retrieve_html - Pre-extraction processing")
print(" 8. before_return_html - Final HTML processing")
print("\n" + "" * 70)
print("✓ Complete hook pipeline demo complete\n")
# ============================================================================
# MAIN EXECUTION
# ============================================================================
async def main():
"""
Run all demonstrations
"""
print("\n" + "=" * 70)
print(" 🚀 Crawl4AI v0.7.5 - Docker Hooks Complete Demonstration")
print("=" * 70)
# Check Docker service
print("\n🔍 Checking Docker service status...")
if not check_docker_service():
print("❌ Docker service is not running!")
print("\n📋 To start the Docker service:")
print(" docker run -p 11235:11235 unclecode/crawl4ai:latest")
print("\nPlease start the service and run this demo again.")
return
print("✅ Docker service is running!\n")
# Run all demos
demos = [
("String-Based Hooks (REST API)", demo_1_string_based_hooks, False),
("hooks_to_string() Utility", demo_2_hooks_to_string_utility, False),
("Docker Client Auto-Conversion", demo_3_docker_client_auto_conversion, True),
# ("Complete Hook Pipeline", demo_4_complete_hook_pipeline, True),
]
for i, (name, demo_func, is_async) in enumerate(demos, 1):
print(f"\n{'🔷' * 35}")
print(f"Starting Demo {i}/{len(demos)}: {name}")
print(f"{'🔷' * 35}\n")
try:
if is_async:
await demo_func()
else:
demo_func()
print(f"✅ Demo {i} completed successfully!")
# Pause between demos (except the last one)
if i < len(demos):
print("\n⏸️ Press Enter to continue to next demo...")
# input()
except KeyboardInterrupt:
print(f"\n⏹️ Demo interrupted by user")
break
except Exception as e:
print(f"\n❌ Demo {i} failed: {str(e)}")
import traceback
traceback.print_exc()
print("\nContinuing to next demo...\n")
continue
# Final summary
print("\n" + "=" * 70)
print(" 🎉 All Demonstrations Complete!")
print("=" * 70)
print("\n📊 Summary of v0.7.5 Docker Hooks System:")
print("\n🆕 COMPLETELY NEW FEATURE in v0.7.5:")
print(" The Docker Hooks System lets you customize the crawling pipeline")
print(" with user-provided Python functions at 8 strategic points.")
print("\n✨ Three Ways to Use Docker Hooks (All NEW!):")
print(" 1. String-based - Write hooks as strings for REST API")
print(" 2. hooks_to_string() - Convert Python functions to strings")
print(" 3. Docker Client - Automatic conversion (RECOMMENDED)")
print("\n💡 Key Benefits:")
print(" ✓ Full IDE support (autocomplete, syntax highlighting)")
print(" ✓ Type checking and linting")
print(" ✓ Easy to test and debug")
print(" ✓ Reusable across projects")
print(" ✓ Complete pipeline control")
print("\n🎯 8 Hook Points Available:")
print(" • on_browser_created, on_page_context_created")
print(" • on_user_agent_updated, before_goto, after_goto")
print(" • on_execution_started, before_retrieve_html, before_return_html")
print("\n📚 Resources:")
print(" • Docs: https://docs.crawl4ai.com")
print(" • GitHub: https://github.com/unclecode/crawl4ai")
print(" • Discord: https://discord.gg/jP8KfhDhyN")
print("\n" + "=" * 70)
print(" Happy Crawling with v0.7.5! 🕷️")
print("=" * 70 + "\n")
if __name__ == "__main__":
print("\n🎬 Starting Crawl4AI v0.7.5 Docker Hooks Demonstration...")
print("Press Ctrl+C anytime to exit\n")
try:
asyncio.run(main())
except KeyboardInterrupt:
print("\n\n👋 Demo stopped by user. Thanks for exploring Crawl4AI v0.7.5!")
except Exception as e:
print(f"\n\n❌ Demo error: {str(e)}")
import traceback
traceback.print_exc()

File diff suppressed because it is too large Load Diff

View File

@@ -7,6 +7,7 @@ docs_dir: docs/md_v2
nav:
- Home: 'index.md'
- "📚 Complete SDK Reference": "complete-sdk-reference.md"
- "Ask AI": "core/ask-ai.md"
- "Quick Start": "core/quickstart.md"
- "Code Examples": "core/examples.md"

View File

@@ -31,7 +31,7 @@ dependencies = [
"rank-bm25~=0.2",
"snowballstemmer~=2.2",
"pydantic>=2.10",
"pyOpenSSL>=24.3.0",
"pyOpenSSL>=25.3.0",
"psutil>=6.1.1",
"PyYAML>=6.0",
"nltk>=3.9.1",
@@ -59,13 +59,13 @@ classifiers = [
]
[project.optional-dependencies]
pdf = ["PyPDF2"]
pdf = ["pypdf"]
torch = ["torch", "nltk", "scikit-learn"]
transformer = ["transformers", "tokenizers", "sentence-transformers"]
cosine = ["torch", "transformers", "nltk", "sentence-transformers"]
sync = ["selenium"]
all = [
"PyPDF2",
"pypdf",
"torch",
"nltk",
"scikit-learn",

View File

@@ -19,7 +19,7 @@ rank-bm25~=0.2
colorama~=0.4
snowballstemmer~=2.2
pydantic>=2.10
pyOpenSSL>=24.3.0
pyOpenSSL>=25.3.0
psutil>=6.1.1
PyYAML>=6.0
nltk>=3.9.1
@@ -33,4 +33,4 @@ shapely>=2.0.0
fake-useragent>=2.2.0
pdf2image>=1.17.0
PyPDF2>=3.0.1
pypdf>=6.0.0

401
test_llm_webhook_feature.py Normal file
View File

@@ -0,0 +1,401 @@
#!/usr/bin/env python3
"""
Test script to validate webhook implementation for /llm/job endpoint.
This tests that the /llm/job endpoint now supports webhooks
following the same pattern as /crawl/job.
"""
import sys
import os
# Add deploy/docker to path
sys.path.insert(0, os.path.join(os.path.dirname(__file__), 'deploy', 'docker'))
def test_llm_job_payload_model():
"""Test that LlmJobPayload includes webhook_config field"""
print("=" * 60)
print("TEST 1: LlmJobPayload Model")
print("=" * 60)
try:
from job import LlmJobPayload
from schemas import WebhookConfig
from pydantic import ValidationError
# Test with webhook_config
payload_dict = {
"url": "https://example.com",
"q": "Extract main content",
"schema": None,
"cache": False,
"provider": None,
"webhook_config": {
"webhook_url": "https://myapp.com/webhook",
"webhook_data_in_payload": True,
"webhook_headers": {"X-Secret": "token"}
}
}
payload = LlmJobPayload(**payload_dict)
print(f"✅ LlmJobPayload accepts webhook_config")
print(f" - URL: {payload.url}")
print(f" - Query: {payload.q}")
print(f" - Webhook URL: {payload.webhook_config.webhook_url}")
print(f" - Data in payload: {payload.webhook_config.webhook_data_in_payload}")
# Test without webhook_config (should be optional)
minimal_payload = {
"url": "https://example.com",
"q": "Extract content"
}
payload2 = LlmJobPayload(**minimal_payload)
assert payload2.webhook_config is None, "webhook_config should be optional"
print(f"✅ LlmJobPayload works without webhook_config (optional)")
return True
except Exception as e:
print(f"❌ Failed: {e}")
import traceback
traceback.print_exc()
return False
def test_handle_llm_request_signature():
"""Test that handle_llm_request accepts webhook_config parameter"""
print("\n" + "=" * 60)
print("TEST 2: handle_llm_request Function Signature")
print("=" * 60)
try:
from api import handle_llm_request
import inspect
sig = inspect.signature(handle_llm_request)
params = list(sig.parameters.keys())
print(f"Function parameters: {params}")
if 'webhook_config' in params:
print(f"✅ handle_llm_request has webhook_config parameter")
# Check that it's optional with default None
webhook_param = sig.parameters['webhook_config']
if webhook_param.default is None or webhook_param.default == inspect.Parameter.empty:
print(f"✅ webhook_config is optional (default: {webhook_param.default})")
else:
print(f"⚠️ webhook_config default is: {webhook_param.default}")
return True
else:
print(f"❌ handle_llm_request missing webhook_config parameter")
return False
except Exception as e:
print(f"❌ Failed: {e}")
import traceback
traceback.print_exc()
return False
def test_process_llm_extraction_signature():
"""Test that process_llm_extraction accepts webhook_config parameter"""
print("\n" + "=" * 60)
print("TEST 3: process_llm_extraction Function Signature")
print("=" * 60)
try:
from api import process_llm_extraction
import inspect
sig = inspect.signature(process_llm_extraction)
params = list(sig.parameters.keys())
print(f"Function parameters: {params}")
if 'webhook_config' in params:
print(f"✅ process_llm_extraction has webhook_config parameter")
webhook_param = sig.parameters['webhook_config']
if webhook_param.default is None or webhook_param.default == inspect.Parameter.empty:
print(f"✅ webhook_config is optional (default: {webhook_param.default})")
else:
print(f"⚠️ webhook_config default is: {webhook_param.default}")
return True
else:
print(f"❌ process_llm_extraction missing webhook_config parameter")
return False
except Exception as e:
print(f"❌ Failed: {e}")
import traceback
traceback.print_exc()
return False
def test_webhook_integration_in_api():
"""Test that api.py properly integrates webhook notifications"""
print("\n" + "=" * 60)
print("TEST 4: Webhook Integration in process_llm_extraction")
print("=" * 60)
try:
api_file = os.path.join(os.path.dirname(__file__), 'deploy', 'docker', 'api.py')
with open(api_file, 'r') as f:
api_content = f.read()
# Check for WebhookDeliveryService initialization
if 'webhook_service = WebhookDeliveryService(config)' in api_content:
print("✅ process_llm_extraction initializes WebhookDeliveryService")
else:
print("❌ Missing WebhookDeliveryService initialization in process_llm_extraction")
return False
# Check for notify_job_completion calls with llm_extraction
if 'task_type="llm_extraction"' in api_content:
print("✅ Uses correct task_type='llm_extraction' for notifications")
else:
print("❌ Missing task_type='llm_extraction' in webhook notifications")
return False
# Count webhook notification calls (should have at least 3: success + 2 failure paths)
notification_count = api_content.count('await webhook_service.notify_job_completion')
# Find only in process_llm_extraction function
llm_func_start = api_content.find('async def process_llm_extraction')
llm_func_end = api_content.find('\nasync def ', llm_func_start + 1)
if llm_func_end == -1:
llm_func_end = len(api_content)
llm_func_content = api_content[llm_func_start:llm_func_end]
llm_notification_count = llm_func_content.count('await webhook_service.notify_job_completion')
print(f"✅ Found {llm_notification_count} webhook notification calls in process_llm_extraction")
if llm_notification_count >= 3:
print(f"✅ Sufficient notification points (success + failure paths)")
else:
print(f"⚠️ Expected at least 3 notification calls, found {llm_notification_count}")
return True
except Exception as e:
print(f"❌ Failed: {e}")
import traceback
traceback.print_exc()
return False
def test_job_endpoint_integration():
"""Test that /llm/job endpoint extracts and passes webhook_config"""
print("\n" + "=" * 60)
print("TEST 5: /llm/job Endpoint Integration")
print("=" * 60)
try:
job_file = os.path.join(os.path.dirname(__file__), 'deploy', 'docker', 'job.py')
with open(job_file, 'r') as f:
job_content = f.read()
# Find the llm_job_enqueue function
llm_job_start = job_content.find('async def llm_job_enqueue')
llm_job_end = job_content.find('\n\n@router', llm_job_start + 1)
if llm_job_end == -1:
llm_job_end = job_content.find('\n\nasync def', llm_job_start + 1)
llm_job_func = job_content[llm_job_start:llm_job_end]
# Check for webhook_config extraction
if 'webhook_config = None' in llm_job_func:
print("✅ llm_job_enqueue initializes webhook_config variable")
else:
print("❌ Missing webhook_config initialization")
return False
if 'if payload.webhook_config:' in llm_job_func:
print("✅ llm_job_enqueue checks for payload.webhook_config")
else:
print("❌ Missing webhook_config check")
return False
if 'webhook_config = payload.webhook_config.model_dump(mode=\'json\')' in llm_job_func:
print("✅ llm_job_enqueue converts webhook_config to dict")
else:
print("❌ Missing webhook_config.model_dump conversion")
return False
if 'webhook_config=webhook_config' in llm_job_func:
print("✅ llm_job_enqueue passes webhook_config to handle_llm_request")
else:
print("❌ Missing webhook_config parameter in handle_llm_request call")
return False
return True
except Exception as e:
print(f"❌ Failed: {e}")
import traceback
traceback.print_exc()
return False
def test_create_new_task_integration():
"""Test that create_new_task stores webhook_config in Redis"""
print("\n" + "=" * 60)
print("TEST 6: create_new_task Webhook Storage")
print("=" * 60)
try:
api_file = os.path.join(os.path.dirname(__file__), 'deploy', 'docker', 'api.py')
with open(api_file, 'r') as f:
api_content = f.read()
# Find create_new_task function
create_task_start = api_content.find('async def create_new_task')
create_task_end = api_content.find('\nasync def ', create_task_start + 1)
if create_task_end == -1:
create_task_end = len(api_content)
create_task_func = api_content[create_task_start:create_task_end]
# Check for webhook_config storage
if 'if webhook_config:' in create_task_func:
print("✅ create_new_task checks for webhook_config")
else:
print("❌ Missing webhook_config check in create_new_task")
return False
if 'task_data["webhook_config"] = json.dumps(webhook_config)' in create_task_func:
print("✅ create_new_task stores webhook_config in Redis task data")
else:
print("❌ Missing webhook_config storage in task_data")
return False
# Check that webhook_config is passed to process_llm_extraction
if 'webhook_config' in create_task_func and 'background_tasks.add_task' in create_task_func:
print("✅ create_new_task passes webhook_config to background task")
else:
print("⚠️ Could not verify webhook_config passed to background task")
return True
except Exception as e:
print(f"❌ Failed: {e}")
import traceback
traceback.print_exc()
return False
def test_pattern_consistency():
"""Test that /llm/job follows the same pattern as /crawl/job"""
print("\n" + "=" * 60)
print("TEST 7: Pattern Consistency with /crawl/job")
print("=" * 60)
try:
api_file = os.path.join(os.path.dirname(__file__), 'deploy', 'docker', 'api.py')
with open(api_file, 'r') as f:
api_content = f.read()
# Find handle_crawl_job to compare pattern
crawl_job_start = api_content.find('async def handle_crawl_job')
crawl_job_end = api_content.find('\nasync def ', crawl_job_start + 1)
if crawl_job_end == -1:
crawl_job_end = len(api_content)
crawl_job_func = api_content[crawl_job_start:crawl_job_end]
# Find process_llm_extraction
llm_extract_start = api_content.find('async def process_llm_extraction')
llm_extract_end = api_content.find('\nasync def ', llm_extract_start + 1)
if llm_extract_end == -1:
llm_extract_end = len(api_content)
llm_extract_func = api_content[llm_extract_start:llm_extract_end]
print("Checking pattern consistency...")
# Both should initialize WebhookDeliveryService
crawl_has_service = 'webhook_service = WebhookDeliveryService(config)' in crawl_job_func
llm_has_service = 'webhook_service = WebhookDeliveryService(config)' in llm_extract_func
if crawl_has_service and llm_has_service:
print("✅ Both initialize WebhookDeliveryService")
else:
print(f"❌ Service initialization mismatch (crawl: {crawl_has_service}, llm: {llm_has_service})")
return False
# Both should call notify_job_completion on success
crawl_notifies_success = 'status="completed"' in crawl_job_func and 'notify_job_completion' in crawl_job_func
llm_notifies_success = 'status="completed"' in llm_extract_func and 'notify_job_completion' in llm_extract_func
if crawl_notifies_success and llm_notifies_success:
print("✅ Both notify on success")
else:
print(f"❌ Success notification mismatch (crawl: {crawl_notifies_success}, llm: {llm_notifies_success})")
return False
# Both should call notify_job_completion on failure
crawl_notifies_failure = 'status="failed"' in crawl_job_func and 'error=' in crawl_job_func
llm_notifies_failure = 'status="failed"' in llm_extract_func and 'error=' in llm_extract_func
if crawl_notifies_failure and llm_notifies_failure:
print("✅ Both notify on failure")
else:
print(f"❌ Failure notification mismatch (crawl: {crawl_notifies_failure}, llm: {llm_notifies_failure})")
return False
print("✅ /llm/job follows the same pattern as /crawl/job")
return True
except Exception as e:
print(f"❌ Failed: {e}")
import traceback
traceback.print_exc()
return False
def main():
"""Run all tests"""
print("\n🧪 LLM Job Webhook Feature Validation")
print("=" * 60)
print("Testing that /llm/job now supports webhooks like /crawl/job")
print("=" * 60 + "\n")
results = []
# Run all tests
results.append(("LlmJobPayload Model", test_llm_job_payload_model()))
results.append(("handle_llm_request Signature", test_handle_llm_request_signature()))
results.append(("process_llm_extraction Signature", test_process_llm_extraction_signature()))
results.append(("Webhook Integration", test_webhook_integration_in_api()))
results.append(("/llm/job Endpoint", test_job_endpoint_integration()))
results.append(("create_new_task Storage", test_create_new_task_integration()))
results.append(("Pattern Consistency", test_pattern_consistency()))
# Print summary
print("\n" + "=" * 60)
print("TEST SUMMARY")
print("=" * 60)
passed = sum(1 for _, result in results if result)
total = len(results)
for test_name, result in results:
status = "✅ PASS" if result else "❌ FAIL"
print(f"{status} - {test_name}")
print(f"\n{'=' * 60}")
print(f"Results: {passed}/{total} tests passed")
print(f"{'=' * 60}")
if passed == total:
print("\n🎉 All tests passed! /llm/job webhook feature is correctly implemented.")
print("\n📝 Summary of changes:")
print(" 1. LlmJobPayload model includes webhook_config field")
print(" 2. /llm/job endpoint extracts and passes webhook_config")
print(" 3. handle_llm_request accepts webhook_config parameter")
print(" 4. create_new_task stores webhook_config in Redis")
print(" 5. process_llm_extraction sends webhook notifications")
print(" 6. Follows the same pattern as /crawl/job")
return 0
else:
print(f"\n⚠️ {total - passed} test(s) failed. Please review the output above.")
return 1
if __name__ == "__main__":
exit(main())

View File

@@ -0,0 +1,307 @@
"""
Simple test script to validate webhook implementation without running full server.
This script tests:
1. Webhook module imports and syntax
2. WebhookDeliveryService initialization
3. Payload construction logic
4. Configuration parsing
"""
import sys
import os
import json
from datetime import datetime, timezone
# Add deploy/docker to path to import modules
# sys.path.insert(0, '/home/user/crawl4ai/deploy/docker')
sys.path.insert(0, os.path.join(os.path.dirname(__file__), 'deploy', 'docker'))
def test_imports():
"""Test that all webhook-related modules can be imported"""
print("=" * 60)
print("TEST 1: Module Imports")
print("=" * 60)
try:
from webhook import WebhookDeliveryService
print("✅ webhook.WebhookDeliveryService imported successfully")
except Exception as e:
print(f"❌ Failed to import webhook module: {e}")
return False
try:
from schemas import WebhookConfig, WebhookPayload
print("✅ schemas.WebhookConfig imported successfully")
print("✅ schemas.WebhookPayload imported successfully")
except Exception as e:
print(f"❌ Failed to import schemas: {e}")
return False
return True
def test_webhook_service_init():
"""Test WebhookDeliveryService initialization"""
print("\n" + "=" * 60)
print("TEST 2: WebhookDeliveryService Initialization")
print("=" * 60)
try:
from webhook import WebhookDeliveryService
# Test with default config
config = {
"webhooks": {
"enabled": True,
"default_url": None,
"data_in_payload": False,
"retry": {
"max_attempts": 5,
"initial_delay_ms": 1000,
"max_delay_ms": 32000,
"timeout_ms": 30000
},
"headers": {
"User-Agent": "Crawl4AI-Webhook/1.0"
}
}
}
service = WebhookDeliveryService(config)
print(f"✅ Service initialized successfully")
print(f" - Max attempts: {service.max_attempts}")
print(f" - Initial delay: {service.initial_delay}s")
print(f" - Max delay: {service.max_delay}s")
print(f" - Timeout: {service.timeout}s")
# Verify calculations
assert service.max_attempts == 5, "Max attempts should be 5"
assert service.initial_delay == 1.0, "Initial delay should be 1.0s"
assert service.max_delay == 32.0, "Max delay should be 32.0s"
assert service.timeout == 30.0, "Timeout should be 30.0s"
print("✅ All configuration values correct")
return True
except Exception as e:
print(f"❌ Service initialization failed: {e}")
import traceback
traceback.print_exc()
return False
def test_webhook_config_model():
"""Test WebhookConfig Pydantic model"""
print("\n" + "=" * 60)
print("TEST 3: WebhookConfig Model Validation")
print("=" * 60)
try:
from schemas import WebhookConfig
from pydantic import ValidationError
# Test valid config
valid_config = {
"webhook_url": "https://example.com/webhook",
"webhook_data_in_payload": True,
"webhook_headers": {"X-Secret": "token123"}
}
config = WebhookConfig(**valid_config)
print(f"✅ Valid config accepted:")
print(f" - URL: {config.webhook_url}")
print(f" - Data in payload: {config.webhook_data_in_payload}")
print(f" - Headers: {config.webhook_headers}")
# Test minimal config
minimal_config = {
"webhook_url": "https://example.com/webhook"
}
config2 = WebhookConfig(**minimal_config)
print(f"✅ Minimal config accepted (defaults applied):")
print(f" - URL: {config2.webhook_url}")
print(f" - Data in payload: {config2.webhook_data_in_payload}")
print(f" - Headers: {config2.webhook_headers}")
# Test invalid URL
try:
invalid_config = {
"webhook_url": "not-a-url"
}
config3 = WebhookConfig(**invalid_config)
print(f"❌ Invalid URL should have been rejected")
return False
except ValidationError as e:
print(f"✅ Invalid URL correctly rejected")
return True
except Exception as e:
print(f"❌ Model validation test failed: {e}")
import traceback
traceback.print_exc()
return False
def test_payload_construction():
"""Test webhook payload construction logic"""
print("\n" + "=" * 60)
print("TEST 4: Payload Construction")
print("=" * 60)
try:
# Simulate payload construction from notify_job_completion
task_id = "crawl_abc123"
task_type = "crawl"
status = "completed"
urls = ["https://example.com"]
payload = {
"task_id": task_id,
"task_type": task_type,
"status": status,
"timestamp": datetime.now(timezone.utc).isoformat(),
"urls": urls
}
print(f"✅ Basic payload constructed:")
print(json.dumps(payload, indent=2))
# Test with error
error_payload = {
"task_id": "crawl_xyz789",
"task_type": "crawl",
"status": "failed",
"timestamp": datetime.now(timezone.utc).isoformat(),
"urls": ["https://example.com"],
"error": "Connection timeout"
}
print(f"\n✅ Error payload constructed:")
print(json.dumps(error_payload, indent=2))
# Test with data
data_payload = {
"task_id": "crawl_def456",
"task_type": "crawl",
"status": "completed",
"timestamp": datetime.now(timezone.utc).isoformat(),
"urls": ["https://example.com"],
"data": {
"results": [
{"url": "https://example.com", "markdown": "# Example"}
]
}
}
print(f"\n✅ Data payload constructed:")
print(json.dumps(data_payload, indent=2))
return True
except Exception as e:
print(f"❌ Payload construction failed: {e}")
import traceback
traceback.print_exc()
return False
def test_exponential_backoff():
"""Test exponential backoff calculation"""
print("\n" + "=" * 60)
print("TEST 5: Exponential Backoff Calculation")
print("=" * 60)
try:
initial_delay = 1.0 # 1 second
max_delay = 32.0 # 32 seconds
print("Backoff delays for 5 attempts:")
for attempt in range(5):
delay = min(initial_delay * (2 ** attempt), max_delay)
print(f" Attempt {attempt + 1}: {delay}s")
# Verify the sequence: 1s, 2s, 4s, 8s, 16s
expected = [1.0, 2.0, 4.0, 8.0, 16.0]
actual = [min(initial_delay * (2 ** i), max_delay) for i in range(5)]
assert actual == expected, f"Expected {expected}, got {actual}"
print("✅ Exponential backoff sequence correct")
return True
except Exception as e:
print(f"❌ Backoff calculation failed: {e}")
return False
def test_api_integration():
"""Test that api.py imports webhook module correctly"""
print("\n" + "=" * 60)
print("TEST 6: API Integration")
print("=" * 60)
try:
# Check if api.py can import webhook module
api_path = os.path.join(os.path.dirname(__file__), 'deploy', 'docker', 'api.py')
with open(api_path, 'r') as f:
api_content = f.read()
if 'from webhook import WebhookDeliveryService' in api_content:
print("✅ api.py imports WebhookDeliveryService")
else:
print("❌ api.py missing webhook import")
return False
if 'WebhookDeliveryService(config)' in api_content:
print("✅ api.py initializes WebhookDeliveryService")
else:
print("❌ api.py doesn't initialize WebhookDeliveryService")
return False
if 'notify_job_completion' in api_content:
print("✅ api.py calls notify_job_completion")
else:
print("❌ api.py doesn't call notify_job_completion")
return False
return True
except Exception as e:
print(f"❌ API integration check failed: {e}")
return False
def main():
"""Run all tests"""
print("\n🧪 Webhook Implementation Validation Tests")
print("=" * 60)
results = []
# Run tests
results.append(("Module Imports", test_imports()))
results.append(("Service Initialization", test_webhook_service_init()))
results.append(("Config Model", test_webhook_config_model()))
results.append(("Payload Construction", test_payload_construction()))
results.append(("Exponential Backoff", test_exponential_backoff()))
results.append(("API Integration", test_api_integration()))
# Print summary
print("\n" + "=" * 60)
print("TEST SUMMARY")
print("=" * 60)
passed = sum(1 for _, result in results if result)
total = len(results)
for test_name, result in results:
status = "✅ PASS" if result else "❌ FAIL"
print(f"{status} - {test_name}")
print(f"\n{'=' * 60}")
print(f"Results: {passed}/{total} tests passed")
print(f"{'=' * 60}")
if passed == total:
print("\n🎉 All tests passed! Webhook implementation is valid.")
return 0
else:
print(f"\n⚠️ {total - passed} test(s) failed. Please review the output above.")
return 1
if __name__ == "__main__":
exit(main())

View File

@@ -0,0 +1,251 @@
# Webhook Feature Test Script
This directory contains a comprehensive test script for the webhook feature implementation.
## Overview
The `test_webhook_feature.sh` script automates the entire process of testing the webhook feature:
1. ✅ Fetches and switches to the webhook feature branch
2. ✅ Activates the virtual environment
3. ✅ Installs all required dependencies
4. ✅ Starts Redis server in background
5. ✅ Starts Crawl4AI server in background
6. ✅ Runs webhook integration test
7. ✅ Verifies job completion via webhook
8. ✅ Cleans up and returns to original branch
## Prerequisites
- Python 3.10+
- Virtual environment already created (`venv/` in project root)
- Git repository with the webhook feature branch
- `redis-server` (script will attempt to install if missing)
- `curl` and `lsof` commands available
## Usage
### Quick Start
From the project root:
```bash
./tests/test_webhook_feature.sh
```
Or from the tests directory:
```bash
cd tests
./test_webhook_feature.sh
```
### What the Script Does
#### Step 1: Branch Management
- Saves your current branch
- Fetches the webhook feature branch from remote
- Switches to the webhook feature branch
#### Step 2: Environment Setup
- Activates your existing virtual environment
- Installs dependencies from `deploy/docker/requirements.txt`
- Installs Flask for the webhook receiver
#### Step 3: Service Startup
- Starts Redis server on port 6379
- Starts Crawl4AI server on port 11235
- Waits for server health check to pass
#### Step 4: Webhook Test
- Creates a webhook receiver on port 8080
- Submits a crawl job for `https://example.com` with webhook config
- Waits for webhook notification (60s timeout)
- Verifies webhook payload contains expected data
#### Step 5: Cleanup
- Stops webhook receiver
- Stops Crawl4AI server
- Stops Redis server
- Returns to your original branch
## Expected Output
```
[INFO] Starting webhook feature test script
[INFO] Project root: /path/to/crawl4ai
[INFO] Step 1: Fetching PR branch...
[INFO] Current branch: develop
[SUCCESS] Branch fetched
[INFO] Step 2: Switching to branch: claude/implement-webhook-crawl-feature-011CULZY1Jy8N5MUkZqXkRVp
[SUCCESS] Switched to webhook feature branch
[INFO] Step 3: Activating virtual environment...
[SUCCESS] Virtual environment activated
[INFO] Step 4: Installing server dependencies...
[SUCCESS] Dependencies installed
[INFO] Step 5a: Starting Redis...
[SUCCESS] Redis started (PID: 12345)
[INFO] Step 5b: Starting server on port 11235...
[INFO] Server started (PID: 12346)
[INFO] Waiting for server to be ready...
[SUCCESS] Server is ready!
[INFO] Step 6: Creating webhook test script...
[INFO] Running webhook test...
🚀 Submitting crawl job with webhook...
✅ Job submitted successfully, task_id: crawl_abc123
⏳ Waiting for webhook notification...
✅ Webhook received: {
"task_id": "crawl_abc123",
"task_type": "crawl",
"status": "completed",
"timestamp": "2025-10-22T00:00:00.000000+00:00",
"urls": ["https://example.com"],
"data": { ... }
}
✅ Webhook received!
Task ID: crawl_abc123
Status: completed
URLs: ['https://example.com']
✅ Data included in webhook payload
📄 Crawled 1 URL(s)
- https://example.com: 1234 chars
🎉 Webhook test PASSED!
[INFO] Step 7: Verifying test results...
[SUCCESS] ✅ Webhook test PASSED!
[SUCCESS] All tests completed successfully! 🎉
[INFO] Cleanup will happen automatically...
[INFO] Starting cleanup...
[INFO] Stopping webhook receiver...
[INFO] Stopping server...
[INFO] Stopping Redis...
[INFO] Switching back to branch: develop
[SUCCESS] Cleanup complete
```
## Troubleshooting
### Server Failed to Start
If the server fails to start, check the logs:
```bash
tail -100 /tmp/crawl4ai_server.log
```
Common issues:
- Port 11235 already in use: `lsof -ti:11235 | xargs kill -9`
- Missing dependencies: Check that all packages are installed
### Redis Connection Failed
Check if Redis is running:
```bash
redis-cli ping
# Should return: PONG
```
If not running:
```bash
redis-server --port 6379 --daemonize yes
```
### Webhook Not Received
The script has a 60-second timeout for webhook delivery. If the webhook isn't received:
1. Check server logs: `/tmp/crawl4ai_server.log`
2. Verify webhook receiver is running on port 8080
3. Check network connectivity between components
### Script Interruption
If the script is interrupted (Ctrl+C), cleanup happens automatically via trap. The script will:
- Kill all background processes
- Stop Redis
- Return to your original branch
To manually cleanup if needed:
```bash
# Kill processes by port
lsof -ti:11235 | xargs kill -9 # Server
lsof -ti:8080 | xargs kill -9 # Webhook receiver
lsof -ti:6379 | xargs kill -9 # Redis
# Return to your branch
git checkout develop # or your branch name
```
## Testing Different URLs
To test with a different URL, modify the script or create a custom test:
```python
payload = {
"urls": ["https://your-url-here.com"],
"browser_config": {"headless": True},
"crawler_config": {"cache_mode": "bypass"},
"webhook_config": {
"webhook_url": "http://localhost:8080/webhook",
"webhook_data_in_payload": True
}
}
```
## Files Generated
The script creates temporary files:
- `/tmp/crawl4ai_server.log` - Server output logs
- `/tmp/test_webhook.py` - Webhook test Python script
These are not cleaned up automatically so you can review them after the test.
## Exit Codes
- `0` - All tests passed successfully
- `1` - Test failed (check output for details)
## Safety Features
- ✅ Automatic cleanup on exit, interrupt, or error
- ✅ Returns to original branch on completion
- ✅ Kills all background processes
- ✅ Comprehensive error handling
- ✅ Colored output for easy reading
- ✅ Detailed logging at each step
## Notes
- The script uses `set -e` to exit on any command failure
- All background processes are tracked and cleaned up
- The virtual environment must exist before running
- Redis must be available (installed or installable via apt-get/brew)
## Integration with CI/CD
This script can be integrated into CI/CD pipelines:
```yaml
# Example GitHub Actions
- name: Test Webhook Feature
run: |
chmod +x tests/test_webhook_feature.sh
./tests/test_webhook_feature.sh
```
## Support
If you encounter issues:
1. Check the troubleshooting section above
2. Review server logs at `/tmp/crawl4ai_server.log`
3. Ensure all prerequisites are met
4. Open an issue with the full output of the script

View File

@@ -0,0 +1,118 @@
"""Test delayed redirect WITH wait_for - does link resolution use correct URL?"""
import asyncio
import threading
from http.server import HTTPServer, SimpleHTTPRequestHandler
class RedirectTestHandler(SimpleHTTPRequestHandler):
def log_message(self, format, *args):
pass
def do_GET(self):
if self.path == "/page-a":
self.send_response(200)
self.send_header("Content-type", "text/html")
self.end_headers()
content = """
<!DOCTYPE html>
<html>
<head><title>Page A</title></head>
<body>
<h1>Page A - Will redirect after 200ms</h1>
<script>
setTimeout(function() {
window.location.href = '/redirect-target/';
}, 200);
</script>
</body>
</html>
"""
self.wfile.write(content.encode())
elif self.path.startswith("/redirect-target"):
self.send_response(200)
self.send_header("Content-type", "text/html")
self.end_headers()
content = """
<!DOCTYPE html>
<html>
<head><title>Redirect Target</title></head>
<body>
<h1>Redirect Target</h1>
<nav id="target-nav">
<a href="subpage-1">Subpage 1</a>
<a href="subpage-2">Subpage 2</a>
</nav>
</body>
</html>
"""
self.wfile.write(content.encode())
else:
self.send_response(404)
self.end_headers()
async def main():
import socket
class ReuseAddrHTTPServer(HTTPServer):
allow_reuse_address = True
server = ReuseAddrHTTPServer(("localhost", 8769), RedirectTestHandler)
thread = threading.Thread(target=server.serve_forever)
thread.daemon = True
thread.start()
try:
import sys
sys.path.insert(0, '/Users/nasrin/vscode/c4ai-uc/develop')
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
print("=" * 60)
print("TEST: Delayed JS redirect WITH wait_for='css:#target-nav'")
print("This waits for the redirect to complete")
print("=" * 60)
browser_config = BrowserConfig(headless=True, verbose=False)
crawl_config = CrawlerRunConfig(
cache_mode="bypass",
wait_for="css:#target-nav" # Wait for element on redirect target
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="http://localhost:8769/page-a",
config=crawl_config
)
print(f"Original URL: http://localhost:8769/page-a")
print(f"Redirected URL returned: {result.redirected_url}")
print(f"HTML contains 'Redirect Target': {'Redirect Target' in result.html}")
print()
if "/redirect-target" in (result.redirected_url or ""):
print("✓ redirected_url is CORRECT")
else:
print("✗ BUG #1: redirected_url is WRONG - still shows original URL!")
# Check links
all_links = []
if isinstance(result.links, dict):
all_links = result.links.get("internal", []) + result.links.get("external", [])
print(f"\nLinks found ({len(all_links)} total):")
bug_found = False
for link in all_links:
href = link.get("href", "") if isinstance(link, dict) else getattr(link, 'href', "")
if "subpage" in href:
print(f" {href}")
if "/page-a/" in href:
print(" ^^^ BUG #2: Link resolved with WRONG base URL!")
bug_found = True
elif "/redirect-target/" in href:
print(" ^^^ CORRECT")
if not bug_found and all_links:
print("\n✓ Link resolution is CORRECT")
finally:
server.shutdown()
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -7,12 +7,13 @@ and serve as functional tests.
import asyncio
import os
import sys
import time
# Add the project root to Python path if running directly
if __name__ == "__main__":
sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), '../..')))
from crawl4ai.browser import BrowserManager
from crawl4ai.browser_manager import BrowserManager
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
from crawl4ai.async_logger import AsyncLogger
@@ -24,8 +25,8 @@ async def test_cdp_launch_connect():
logger.info("Testing launch and connect via CDP", tag="TEST")
browser_config = BrowserConfig(
use_managed_browser=True,
browser_mode="cdp",
use_managed_browser=True,
headless=True
)
@@ -62,17 +63,18 @@ async def test_cdp_launch_connect():
return False
async def test_cdp_with_user_data_dir():
"""Test CDP browser with a user data directory."""
"""Test CDP browser with a user data directory and storage state."""
logger.info("Testing CDP browser with user data directory", tag="TEST")
# Create a temporary user data directory
import tempfile
user_data_dir = tempfile.mkdtemp(prefix="crawl4ai-test-")
storage_state_file = os.path.join(user_data_dir, "storage_state.json")
logger.info(f"Created temporary user data directory: {user_data_dir}", tag="TEST")
browser_config = BrowserConfig(
headless=True,
browser_mode="cdp",
use_managed_browser=True,
user_data_dir=user_data_dir
)
@@ -86,38 +88,59 @@ async def test_cdp_with_user_data_dir():
crawler_config = CrawlerRunConfig()
page, context = await manager.get_page(crawler_config)
# Set a cookie
# Visit the site first
await page.goto("https://example.com", wait_until="domcontentloaded")
# Set a cookie via JavaScript (more reliable for persistence)
await page.evaluate("""
document.cookie = 'test_cookie=test_value; path=/; max-age=86400';
""")
# Also set via context API for double coverage
await context.add_cookies([{
"name": "test_cookie",
"value": "test_value",
"url": "https://example.com"
"name": "test_cookie_api",
"value": "test_value_api",
"domain": "example.com",
"path": "/"
}])
# Visit the site
await page.goto("https://example.com")
# Verify cookie was set
# Verify cookies were set
cookies = await context.cookies(["https://example.com"])
has_test_cookie = any(cookie["name"] == "test_cookie" for cookie in cookies)
has_test_cookie = any(cookie["name"] in ["test_cookie", "test_cookie_api"] for cookie in cookies)
logger.info(f"Cookie set successfully: {has_test_cookie}", tag="TEST")
# Save storage state before closing
await context.storage_state(path=storage_state_file)
logger.info(f"Storage state saved to: {storage_state_file}", tag="TEST")
# Close the browser
await manager.close()
logger.info("First browser session closed", tag="TEST")
# Start a new browser with the same user data directory
# Wait a moment for clean shutdown
await asyncio.sleep(1.0)
# Start a new browser with the same user data directory and storage state
logger.info("Starting second browser session with same user data directory", tag="TEST")
manager2 = BrowserManager(browser_config=browser_config, logger=logger)
browser_config2 = BrowserConfig(
headless=True,
use_managed_browser=True,
user_data_dir=user_data_dir,
storage_state=storage_state_file
)
manager2 = BrowserManager(browser_config=browser_config2, logger=logger)
await manager2.start()
# Get a new page and check if the cookie persists
page2, context2 = await manager2.get_page(crawler_config)
await page2.goto("https://example.com")
await page2.goto("https://example.com", wait_until="domcontentloaded")
# Verify cookie persisted
cookies2 = await context2.cookies(["https://example.com"])
has_test_cookie2 = any(cookie["name"] == "test_cookie" for cookie in cookies2)
has_test_cookie2 = any(cookie["name"] in ["test_cookie", "test_cookie_api"] for cookie in cookies2)
logger.info(f"Cookie persisted across sessions: {has_test_cookie2}", tag="TEST")
logger.info(f"Cookies found: {[c['name'] for c in cookies2]}", tag="TEST")
# Clean up
await manager2.close()
@@ -134,6 +157,10 @@ async def test_cdp_with_user_data_dir():
await manager.close()
except:
pass
try:
await manager2.close()
except:
pass
# Clean up temporary directory
try:
@@ -145,7 +172,7 @@ async def test_cdp_with_user_data_dir():
return False
async def test_cdp_session_management():
"""Test session management with CDP browser."""
"""Test session management with CDP browser - focused on session tracking."""
logger.info("Testing session management with CDP browser", tag="TEST")
browser_config = BrowserConfig(
@@ -159,45 +186,104 @@ async def test_cdp_session_management():
await manager.start()
logger.info("Browser launched successfully", tag="TEST")
# Create two sessions
# Test session tracking and lifecycle management
session1_id = "test_session_1"
session2_id = "test_session_2"
# Set up first session
crawler_config1 = CrawlerRunConfig(session_id=session1_id)
page1, context1 = await manager.get_page(crawler_config1)
await page1.goto("https://example.com")
await page1.evaluate("localStorage.setItem('session1_data', 'test_value')")
logger.info(f"Set up session 1 with ID: {session1_id}", tag="TEST")
await page1.goto("https://example.com", wait_until="domcontentloaded")
# Set up second session
# Get page URL and title for verification
page1_url = page1.url
page1_title = await page1.title()
logger.info(f"Session 1 setup - URL: {page1_url}, Title: {page1_title}", tag="TEST")
# Set up second session
crawler_config2 = CrawlerRunConfig(session_id=session2_id)
page2, context2 = await manager.get_page(crawler_config2)
await page2.goto("https://example.org")
await page2.evaluate("localStorage.setItem('session2_data', 'test_value2')")
logger.info(f"Set up session 2 with ID: {session2_id}", tag="TEST")
await page2.goto("https://httpbin.org/html", wait_until="domcontentloaded")
# Get first session again
page1_again, _ = await manager.get_page(crawler_config1)
page2_url = page2.url
page2_title = await page2.title()
logger.info(f"Session 2 setup - URL: {page2_url}, Title: {page2_title}", tag="TEST")
# Verify it's the same page and data persists
# Verify sessions exist in manager
session1_exists = session1_id in manager.sessions
session2_exists = session2_id in manager.sessions
logger.info(f"Sessions in manager - S1: {session1_exists}, S2: {session2_exists}", tag="TEST")
# Test session reuse
page1_again, context1_again = await manager.get_page(crawler_config1)
is_same_page = page1 == page1_again
data1 = await page1_again.evaluate("localStorage.getItem('session1_data')")
logger.info(f"Session 1 reuse successful: {is_same_page}, data: {data1}", tag="TEST")
is_same_context = context1 == context1_again
# Kill first session
logger.info(f"Session 1 reuse - Same page: {is_same_page}, Same context: {is_same_context}", tag="TEST")
# Test that sessions are properly tracked with timestamps
session1_info = manager.sessions.get(session1_id)
session2_info = manager.sessions.get(session2_id)
session1_has_timestamp = session1_info and len(session1_info) == 3
session2_has_timestamp = session2_info and len(session2_info) == 3
logger.info(f"Session tracking - S1 complete: {session1_has_timestamp}, S2 complete: {session2_has_timestamp}", tag="TEST")
# In managed browser mode, pages might be shared. Let's test what actually happens
pages_same_or_different = page1 == page2
logger.info(f"Pages same object: {pages_same_or_different}", tag="TEST")
# Test that we can distinguish sessions by their stored info
session1_context, session1_page, session1_time = session1_info
session2_context, session2_page, session2_time = session2_info
sessions_have_different_timestamps = session1_time != session2_time
logger.info(f"Sessions have different timestamps: {sessions_have_different_timestamps}", tag="TEST")
# Test session killing
await manager.kill_session(session1_id)
logger.info(f"Killed session 1", tag="TEST")
# Verify second session still works
data2 = await page2.evaluate("localStorage.getItem('session2_data')")
logger.info(f"Session 2 still functional after killing session 1, data: {data2}", tag="TEST")
# Verify session was removed
session1_removed = session1_id not in manager.sessions
session2_still_exists = session2_id in manager.sessions
logger.info(f"After kill - S1 removed: {session1_removed}, S2 exists: {session2_still_exists}", tag="TEST")
# Test page state after killing session
page1_closed = page1.is_closed()
logger.info(f"Page1 closed after kill: {page1_closed}", tag="TEST")
# Clean up remaining session
try:
await manager.kill_session(session2_id)
logger.info("Killed session 2", tag="TEST")
session2_removed = session2_id not in manager.sessions
except Exception as e:
logger.info(f"Session 2 cleanup: {e}", tag="TEST")
session2_removed = False
# Clean up
await manager.close()
logger.info("Browser closed successfully", tag="TEST")
return is_same_page and data1 == "test_value" and data2 == "test_value2"
# Success criteria for managed browser sessions:
# 1. Sessions can be created and tracked with proper info
# 2. Same page/context returned for same session ID
# 3. Sessions have proper timestamp tracking
# 4. Sessions can be killed and removed from tracking
# 5. Session cleanup works properly
success = (session1_exists and
session2_exists and
is_same_page and
session1_has_timestamp and
session2_has_timestamp and
sessions_have_different_timestamps and
session1_removed and
session2_removed)
logger.info(f"Test success: {success}", tag="TEST")
return success
except Exception as e:
logger.error(f"Test failed: {str(e)}", tag="TEST")
try:
@@ -206,14 +292,170 @@ async def test_cdp_session_management():
pass
return False
async def test_cdp_timing_fix_fast_startup():
"""
Test that the CDP timing fix handles fast browser startup correctly.
This should work without any delays or retries.
"""
logger.info("Testing CDP timing fix with fast startup", tag="TEST")
browser_config = BrowserConfig(
use_managed_browser=True,
browser_mode="cdp",
headless=True,
debugging_port=9223, # Use different port to avoid conflicts
verbose=True
)
manager = BrowserManager(browser_config=browser_config, logger=logger)
try:
start_time = time.time()
await manager.start()
startup_time = time.time() - start_time
logger.info(f"Browser started successfully in {startup_time:.2f}s", tag="TEST")
# Test basic functionality
crawler_config = CrawlerRunConfig(url="https://example.com")
page, context = await manager.get_page(crawler_config)
await page.goto("https://example.com", wait_until="domcontentloaded")
title = await page.title()
logger.info(f"Successfully navigated to page: {title}", tag="TEST")
await manager.close()
logger.success("test_cdp_timing_fix_fast_startup completed successfully", tag="TEST")
return True
except Exception as e:
logger.error(f"test_cdp_timing_fix_fast_startup failed: {str(e)}", tag="TEST")
try:
await manager.close()
except:
pass
return False
async def test_cdp_timing_fix_delayed_browser_start():
"""
Test CDP timing fix by actually delaying the browser startup process.
This simulates a real scenario where the browser takes time to expose CDP.
"""
logger.info("Testing CDP timing fix with delayed browser startup", tag="TEST")
browser_config = BrowserConfig(
use_managed_browser=True,
browser_mode="cdp",
headless=True,
debugging_port=9224,
verbose=True
)
# Start the managed browser separately to control timing
from crawl4ai.browser_manager import ManagedBrowser
managed_browser = ManagedBrowser(browser_config=browser_config, logger=logger)
try:
# Start browser process but it will take time for CDP to be ready
cdp_url = await managed_browser.start()
logger.info(f"Managed browser started at {cdp_url}", tag="TEST")
# Small delay to simulate the browser needing time to fully initialize CDP
await asyncio.sleep(1.0)
# Now create BrowserManager and connect - this should use the CDP verification fix
manager = BrowserManager(browser_config=browser_config, logger=logger)
manager.config.cdp_url = cdp_url # Use the CDP URL from managed browser
start_time = time.time()
await manager.start()
startup_time = time.time() - start_time
logger.info(f"BrowserManager connected successfully in {startup_time:.2f}s", tag="TEST")
# Test basic functionality
crawler_config = CrawlerRunConfig(url="https://example.com")
page, context = await manager.get_page(crawler_config)
await page.goto("https://example.com", wait_until="domcontentloaded")
title = await page.title()
logger.info(f"Successfully navigated to page: {title}", tag="TEST")
# Clean up
await manager.close()
await managed_browser.cleanup()
logger.success("test_cdp_timing_fix_delayed_browser_start completed successfully", tag="TEST")
return True
except Exception as e:
logger.error(f"test_cdp_timing_fix_delayed_browser_start failed: {str(e)}", tag="TEST")
try:
await manager.close()
await managed_browser.cleanup()
except:
pass
return False
async def test_cdp_verification_backoff_behavior():
"""
Test the exponential backoff behavior of CDP verification in isolation.
"""
logger.info("Testing CDP verification exponential backoff behavior", tag="TEST")
browser_config = BrowserConfig(
use_managed_browser=True,
debugging_port=9225, # Use different port
verbose=True
)
manager = BrowserManager(browser_config=browser_config, logger=logger)
try:
# Test with a non-existent CDP URL to trigger retries
fake_cdp_url = "http://localhost:19999" # This should not exist
start_time = time.time()
result = await manager._verify_cdp_ready(fake_cdp_url)
elapsed_time = time.time() - start_time
# Should return False after all retries
assert result is False, "Expected CDP verification to fail with non-existent endpoint"
# Should take some time due to retries and backoff
assert elapsed_time > 2.0, f"Expected backoff delays, but took only {elapsed_time:.2f}s"
logger.info(f"CDP verification correctly failed after {elapsed_time:.2f}s with exponential backoff", tag="TEST")
logger.success("test_cdp_verification_backoff_behavior completed successfully", tag="TEST")
return True
except Exception as e:
logger.error(f"test_cdp_verification_backoff_behavior failed: {str(e)}", tag="TEST")
return False
async def run_tests():
"""Run all tests sequentially."""
import time
results = []
# Original CDP strategy tests
logger.info("Running original CDP strategy tests", tag="SUITE")
# results.append(await test_cdp_launch_connect())
results.append(await test_cdp_with_user_data_dir())
results.append(await test_cdp_session_management())
# CDP timing fix tests
logger.info("Running CDP timing fix tests", tag="SUITE")
results.append(await test_cdp_timing_fix_fast_startup())
results.append(await test_cdp_timing_fix_delayed_browser_start())
results.append(await test_cdp_verification_backoff_behavior())
# Print summary
total = len(results)
passed = sum(results)

View File

@@ -71,7 +71,7 @@ PACKAGE_MAPPINGS = {
'sentence_transformers': 'sentence-transformers',
'rank_bm25': 'rank-bm25',
'snowballstemmer': 'snowballstemmer',
'PyPDF2': 'PyPDF2',
'pypdf': 'pypdf',
'pdf2image': 'pdf2image',
}

Some files were not shown because too many files have changed in this diff Show More