Release v0.8.0: The v0.8.0 Update
- Updated version to 0.8.0 - Added comprehensive demo and release notes - Updated all documentation
This commit is contained in:
@@ -1,7 +1,7 @@
|
||||
FROM python:3.12-slim-bookworm AS build
|
||||
|
||||
# C4ai version
|
||||
ARG C4AI_VER=0.7.8
|
||||
ARG C4AI_VER=0.8.0
|
||||
ENV C4AI_VERSION=$C4AI_VER
|
||||
LABEL c4ai.version=$C4AI_VER
|
||||
|
||||
|
||||
47
README.md
47
README.md
@@ -37,13 +37,13 @@ Limited slots._
|
||||
|
||||
Crawl4AI turns the web into clean, LLM ready Markdown for RAG, agents, and data pipelines. Fast, controllable, battle tested by a 50k+ star community.
|
||||
|
||||
[✨ Check out latest update v0.7.8](#-recent-updates)
|
||||
[✨ Check out latest update v0.8.0](#-recent-updates)
|
||||
|
||||
✨ **New in v0.7.8**: Stability & Bug Fix Release! 11 bug fixes addressing Docker API issues (ContentRelevanceFilter, ProxyConfig, cache permissions), LLM extraction improvements (configurable backoff, HTML input format), URL handling fixes, and dependency updates (pypdf, Pydantic v2). [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.8.md)
|
||||
✨ **New in v0.8.0**: Crash Recovery & Prefetch Mode! Deep crawl crash recovery with `resume_state` and `on_state_change` callbacks for long-running crawls. New `prefetch=True` mode for 5-10x faster URL discovery. Critical security fixes for Docker API (hooks disabled by default, file:// URLs blocked). [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.8.0.md)
|
||||
|
||||
✨ Recent v0.7.7: Complete Self-Hosting Platform with Real-time Monitoring! Enterprise-grade monitoring dashboard, comprehensive REST API, WebSocket streaming, smart browser pool management, and production-ready observability. [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.7.md)
|
||||
✨ Recent v0.7.8: Stability & Bug Fix Release! 11 bug fixes addressing Docker API issues, LLM extraction improvements, URL handling fixes, and dependency updates. [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.8.md)
|
||||
|
||||
✨ Previous v0.7.6: Complete Webhook Infrastructure for Docker Job Queue API! Real-time notifications for both `/crawl/job` and `/llm/job` endpoints with exponential backoff retry, custom headers, and flexible delivery modes. [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.6.md)
|
||||
✨ Previous v0.7.7: Complete Self-Hosting Platform with Real-time Monitoring! Enterprise-grade monitoring dashboard, comprehensive REST API, WebSocket streaming, and smart browser pool management. [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.7.md)
|
||||
|
||||
<details>
|
||||
<summary>🤓 <strong>My Personal Story</strong></summary>
|
||||
@@ -562,6 +562,45 @@ async def test_news_crawl():
|
||||
|
||||
## ✨ Recent Updates
|
||||
|
||||
<details open>
|
||||
<summary><strong>Version 0.8.0 Release Highlights - Crash Recovery & Prefetch Mode</strong></summary>
|
||||
|
||||
This release introduces crash recovery for deep crawls, a new prefetch mode for fast URL discovery, and critical security fixes for Docker deployments.
|
||||
|
||||
- **🔄 Deep Crawl Crash Recovery**:
|
||||
- `on_state_change` callback fires after each URL for real-time state persistence
|
||||
- `resume_state` parameter to continue from a saved checkpoint
|
||||
- JSON-serializable state for Redis/database storage
|
||||
- Works with BFS, DFS, and Best-First strategies
|
||||
```python
|
||||
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
|
||||
|
||||
strategy = BFSDeepCrawlStrategy(
|
||||
max_depth=3,
|
||||
resume_state=saved_state, # Continue from checkpoint
|
||||
on_state_change=save_to_redis, # Called after each URL
|
||||
)
|
||||
```
|
||||
|
||||
- **⚡ Prefetch Mode for Fast URL Discovery**:
|
||||
- `prefetch=True` skips markdown, extraction, and media processing
|
||||
- 5-10x faster than full processing
|
||||
- Perfect for two-phase crawling: discover first, process selectively
|
||||
```python
|
||||
config = CrawlerRunConfig(prefetch=True)
|
||||
result = await crawler.arun("https://example.com", config=config)
|
||||
# Returns HTML and links only - no markdown generation
|
||||
```
|
||||
|
||||
- **🔒 Security Fixes (Docker API)**:
|
||||
- Hooks disabled by default (`CRAWL4AI_HOOKS_ENABLED=false`)
|
||||
- `file://` URLs blocked on API endpoints to prevent LFI
|
||||
- `__import__` removed from hook execution sandbox
|
||||
|
||||
[Full v0.8.0 Release Notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.8.0.md)
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary><strong>Version 0.7.8 Release Highlights - Stability & Bug Fix Release</strong></summary>
|
||||
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
# crawl4ai/__version__.py
|
||||
|
||||
# This is the version that will be used for stable releases
|
||||
__version__ = "0.7.8"
|
||||
__version__ = "0.8.0"
|
||||
|
||||
# For nightly builds, this gets set during build process
|
||||
__nightly_version__ = None
|
||||
|
||||
@@ -59,13 +59,13 @@ Pull and run images directly from Docker Hub without building locally.
|
||||
|
||||
#### 1. Pull the Image
|
||||
|
||||
Our latest stable release is `0.7.7`. Images are built with multi-arch manifests, so Docker automatically pulls the correct version for your system.
|
||||
Our latest stable release is `0.8.0`. Images are built with multi-arch manifests, so Docker automatically pulls the correct version for your system.
|
||||
|
||||
```bash
|
||||
# Pull the latest stable version (0.7.7)
|
||||
docker pull unclecode/crawl4ai:0.7.7
|
||||
# Pull the latest stable version (0.8.0)
|
||||
docker pull unclecode/crawl4ai:0.8.0
|
||||
|
||||
# Or use the latest tag (points to 0.7.7)
|
||||
# Or use the latest tag (points to 0.8.0)
|
||||
docker pull unclecode/crawl4ai:latest
|
||||
```
|
||||
|
||||
@@ -100,7 +100,7 @@ EOL
|
||||
-p 11235:11235 \
|
||||
--name crawl4ai \
|
||||
--shm-size=1g \
|
||||
unclecode/crawl4ai:0.7.7
|
||||
unclecode/crawl4ai:0.8.0
|
||||
```
|
||||
|
||||
* **With LLM support:**
|
||||
@@ -111,7 +111,7 @@ EOL
|
||||
--name crawl4ai \
|
||||
--env-file .llm.env \
|
||||
--shm-size=1g \
|
||||
unclecode/crawl4ai:0.7.7
|
||||
unclecode/crawl4ai:0.8.0
|
||||
```
|
||||
|
||||
> The server will be available at `http://localhost:11235`. Visit `/playground` to access the interactive testing interface.
|
||||
@@ -184,7 +184,7 @@ The `docker-compose.yml` file in the project root provides a simplified approach
|
||||
```bash
|
||||
# Pulls and runs the release candidate from Docker Hub
|
||||
# Automatically selects the correct architecture
|
||||
IMAGE=unclecode/crawl4ai:0.7.7 docker compose up -d
|
||||
IMAGE=unclecode/crawl4ai:0.8.0 docker compose up -d
|
||||
```
|
||||
|
||||
* **Build and Run Locally:**
|
||||
|
||||
243
docs/blog/release-v0.8.0.md
Normal file
243
docs/blog/release-v0.8.0.md
Normal file
@@ -0,0 +1,243 @@
|
||||
# Crawl4AI v0.8.0 Release Notes
|
||||
|
||||
**Release Date**: January 2026
|
||||
**Previous Version**: v0.7.6
|
||||
**Status**: Release Candidate
|
||||
|
||||
---
|
||||
|
||||
## Highlights
|
||||
|
||||
- **Critical Security Fixes** for Docker API deployment
|
||||
- **11 New Features** including crash recovery, prefetch mode, and proxy improvements
|
||||
- **Breaking Changes** - see migration guide below
|
||||
|
||||
---
|
||||
|
||||
## Breaking Changes
|
||||
|
||||
### 1. Docker API: Hooks Disabled by Default
|
||||
|
||||
**What changed**: Hooks are now disabled by default on the Docker API.
|
||||
|
||||
**Why**: Security fix for Remote Code Execution (RCE) vulnerability.
|
||||
|
||||
**Who is affected**: Users of the Docker API who use the `hooks` parameter in `/crawl` requests.
|
||||
|
||||
**Migration**:
|
||||
```bash
|
||||
# To re-enable hooks (only if you trust all API users):
|
||||
export CRAWL4AI_HOOKS_ENABLED=true
|
||||
```
|
||||
|
||||
### 2. Docker API: file:// URLs Blocked
|
||||
|
||||
**What changed**: The endpoints `/execute_js`, `/screenshot`, `/pdf`, and `/html` now reject `file://` URLs.
|
||||
|
||||
**Why**: Security fix for Local File Inclusion (LFI) vulnerability.
|
||||
|
||||
**Who is affected**: Users who were reading local files via the Docker API.
|
||||
|
||||
**Migration**: Use the Python library directly for local file processing:
|
||||
```python
|
||||
# Instead of API call with file:// URL, use library:
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url="file:///path/to/file.html")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Security Fixes
|
||||
|
||||
### Critical: Remote Code Execution via Hooks (CVE Pending)
|
||||
|
||||
**Severity**: CRITICAL (CVSS 10.0)
|
||||
**Affected**: Docker API deployment (all versions before v0.8.0)
|
||||
**Vector**: `POST /crawl` with malicious `hooks` parameter
|
||||
|
||||
**Details**: The `__import__` builtin was available in hook code, allowing attackers to import `os`, `subprocess`, etc. and execute arbitrary commands.
|
||||
|
||||
**Fix**:
|
||||
1. Removed `__import__` from allowed builtins
|
||||
2. Hooks disabled by default (`CRAWL4AI_HOOKS_ENABLED=false`)
|
||||
|
||||
### High: Local File Inclusion via file:// URLs (CVE Pending)
|
||||
|
||||
**Severity**: HIGH (CVSS 8.6)
|
||||
**Affected**: Docker API deployment (all versions before v0.8.0)
|
||||
**Vector**: `POST /execute_js` (and other endpoints) with `file:///etc/passwd`
|
||||
|
||||
**Details**: API endpoints accepted `file://` URLs, allowing attackers to read arbitrary files from the server.
|
||||
|
||||
**Fix**: URL scheme validation now only allows `http://`, `https://`, and `raw:` URLs.
|
||||
|
||||
### Credits
|
||||
|
||||
Discovered by **Neo by ProjectDiscovery** ([projectdiscovery.io](https://projectdiscovery.io)) - December 2025
|
||||
|
||||
---
|
||||
|
||||
## New Features
|
||||
|
||||
### 1. init_scripts Support for BrowserConfig
|
||||
|
||||
Pre-page-load JavaScript injection for stealth evasions.
|
||||
|
||||
```python
|
||||
config = BrowserConfig(
|
||||
init_scripts=[
|
||||
"Object.defineProperty(navigator, 'webdriver', {get: () => false})"
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### 2. CDP Connection Improvements
|
||||
|
||||
- WebSocket URL support (`ws://`, `wss://`)
|
||||
- Proper cleanup with `cdp_cleanup_on_close=True`
|
||||
- Browser reuse across multiple connections
|
||||
|
||||
### 3. Crash Recovery for Deep Crawl Strategies
|
||||
|
||||
All deep crawl strategies (BFS, DFS, Best-First) now support crash recovery:
|
||||
|
||||
```python
|
||||
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
|
||||
|
||||
strategy = BFSDeepCrawlStrategy(
|
||||
max_depth=3,
|
||||
resume_state=saved_state, # Resume from checkpoint
|
||||
on_state_change=save_callback # Persist state in real-time
|
||||
)
|
||||
```
|
||||
|
||||
### 4. PDF and MHTML for raw:/file:// URLs
|
||||
|
||||
Generate PDFs and MHTML from cached HTML content.
|
||||
|
||||
### 5. Screenshots for raw:/file:// URLs
|
||||
|
||||
Render cached HTML and capture screenshots.
|
||||
|
||||
### 6. base_url Parameter for CrawlerRunConfig
|
||||
|
||||
Proper URL resolution for raw: HTML processing:
|
||||
|
||||
```python
|
||||
config = CrawlerRunConfig(base_url='https://example.com')
|
||||
result = await crawler.arun(url='raw:{html}', config=config)
|
||||
```
|
||||
|
||||
### 7. Prefetch Mode for Two-Phase Deep Crawling
|
||||
|
||||
Fast link extraction without full page processing:
|
||||
|
||||
```python
|
||||
config = CrawlerRunConfig(prefetch=True)
|
||||
```
|
||||
|
||||
### 8. Proxy Rotation and Configuration
|
||||
|
||||
Enhanced proxy rotation with sticky sessions support.
|
||||
|
||||
### 9. Proxy Support for HTTP Strategy
|
||||
|
||||
Non-browser crawler now supports proxies.
|
||||
|
||||
### 10. Browser Pipeline for raw:/file:// URLs
|
||||
|
||||
New `process_in_browser` parameter for browser operations on local content:
|
||||
|
||||
```python
|
||||
config = CrawlerRunConfig(
|
||||
process_in_browser=True, # Force browser processing
|
||||
screenshot=True
|
||||
)
|
||||
result = await crawler.arun(url='raw:<html>...</html>', config=config)
|
||||
```
|
||||
|
||||
### 11. Smart TTL Cache for Sitemap URL Seeder
|
||||
|
||||
Intelligent cache invalidation for sitemaps:
|
||||
|
||||
```python
|
||||
config = SeedingConfig(
|
||||
cache_ttl_hours=24,
|
||||
validate_sitemap_lastmod=True
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Bug Fixes
|
||||
|
||||
### raw: URL Parsing Truncates at # Character
|
||||
|
||||
**Problem**: CSS color codes like `#eee` were being truncated.
|
||||
|
||||
**Before**: `raw:body{background:#eee}` → `body{background:`
|
||||
**After**: `raw:body{background:#eee}` → `body{background:#eee}`
|
||||
|
||||
### Caching System Improvements
|
||||
|
||||
Various fixes to cache validation and persistence.
|
||||
|
||||
---
|
||||
|
||||
## Documentation Updates
|
||||
|
||||
- Multi-sample schema generation documentation
|
||||
- URL seeder smart TTL cache parameters
|
||||
- Security documentation (SECURITY.md)
|
||||
|
||||
---
|
||||
|
||||
## Upgrade Guide
|
||||
|
||||
### From v0.7.x to v0.8.0
|
||||
|
||||
1. **Update the package**:
|
||||
```bash
|
||||
pip install --upgrade crawl4ai
|
||||
```
|
||||
|
||||
2. **Docker API users**:
|
||||
- Hooks are now disabled by default
|
||||
- If you need hooks: `export CRAWL4AI_HOOKS_ENABLED=true`
|
||||
- `file://` URLs no longer work on API (use library directly)
|
||||
|
||||
3. **Review security settings**:
|
||||
```yaml
|
||||
# config.yml - recommended for production
|
||||
security:
|
||||
enabled: true
|
||||
jwt_enabled: true
|
||||
```
|
||||
|
||||
4. **Test your integration** before deploying to production
|
||||
|
||||
### Breaking Change Checklist
|
||||
|
||||
- [ ] Check if you use `hooks` parameter in API calls
|
||||
- [ ] Check if you use `file://` URLs via the API
|
||||
- [ ] Update environment variables if needed
|
||||
- [ ] Review security configuration
|
||||
|
||||
---
|
||||
|
||||
## Full Changelog
|
||||
|
||||
See [CHANGELOG.md](../CHANGELOG.md) for complete version history.
|
||||
|
||||
---
|
||||
|
||||
## Contributors
|
||||
|
||||
Thanks to all contributors who made this release possible.
|
||||
|
||||
Special thanks to **Neo by ProjectDiscovery** for responsible security disclosure.
|
||||
|
||||
---
|
||||
|
||||
*For questions or issues, please open a [GitHub Issue](https://github.com/unclecode/crawl4ai/issues).*
|
||||
@@ -20,22 +20,32 @@ Ever wondered why your AI coding assistant struggles with your library despite c
|
||||
|
||||
## Latest Release
|
||||
|
||||
### [Crawl4AI v0.8.0 – Crash Recovery & Prefetch Mode](../blog/release-v0.8.0.md)
|
||||
*January 2026*
|
||||
|
||||
Crawl4AI v0.8.0 introduces crash recovery for deep crawls, a new prefetch mode for fast URL discovery, and critical security fixes for Docker deployments.
|
||||
|
||||
Key highlights:
|
||||
- **🔄 Deep Crawl Crash Recovery**: `on_state_change` callback for real-time state persistence, `resume_state` to continue from checkpoints
|
||||
- **⚡ Prefetch Mode**: `prefetch=True` for 5-10x faster URL discovery, perfect for two-phase crawling patterns
|
||||
- **🔒 Security Fixes**: Hooks disabled by default, `file://` URLs blocked on Docker API, `__import__` removed from sandbox
|
||||
|
||||
[Read full release notes →](../blog/release-v0.8.0.md)
|
||||
|
||||
## Recent Releases
|
||||
|
||||
### [Crawl4AI v0.7.8 – Stability & Bug Fix Release](../blog/release-v0.7.8.md)
|
||||
*December 2025*
|
||||
|
||||
Crawl4AI v0.7.8 is a focused stability release addressing 11 bugs reported by the community. While there are no new features, these fixes resolve important issues affecting Docker deployments, LLM extraction, URL handling, and dependency compatibility.
|
||||
Crawl4AI v0.7.8 is a focused stability release addressing 11 bugs reported by the community. Fixes for Docker deployments, LLM extraction, URL handling, and dependency compatibility.
|
||||
|
||||
Key highlights:
|
||||
- **🐳 Docker API Fixes**: ContentRelevanceFilter deserialization, ProxyConfig serialization, cache folder permissions
|
||||
- **🤖 LLM Improvements**: Configurable rate limiter backoff, HTML input format support, raw HTML URL handling
|
||||
- **🔗 URL Handling**: Correct relative URL resolution after JavaScript redirects
|
||||
- **🤖 LLM Improvements**: Configurable rate limiter backoff, HTML input format support
|
||||
- **📦 Dependencies**: Replaced deprecated PyPDF2 with pypdf, Pydantic v2 ConfigDict compatibility
|
||||
- **🧠 AdaptiveCrawler**: Fixed query expansion to actually use LLM instead of mock data
|
||||
|
||||
[Read full release notes →](../blog/release-v0.7.8.md)
|
||||
|
||||
## Recent Releases
|
||||
|
||||
### [Crawl4AI v0.7.7 – The Self-Hosting & Monitoring Update](../blog/release-v0.7.7.md)
|
||||
*November 14, 2025*
|
||||
|
||||
@@ -52,36 +62,22 @@ Key highlights:
|
||||
### [Crawl4AI v0.7.6 – The Webhook Infrastructure Update](../blog/release-v0.7.6.md)
|
||||
*October 22, 2025*
|
||||
|
||||
Crawl4AI v0.7.6 introduces comprehensive webhook support for the Docker job queue API, bringing real-time notifications to both crawling and LLM extraction workflows. No more polling!
|
||||
Crawl4AI v0.7.6 introduces comprehensive webhook support for the Docker job queue API, bringing real-time notifications to both crawling and LLM extraction workflows.
|
||||
|
||||
Key highlights:
|
||||
- **🪝 Complete Webhook Support**: Real-time notifications for both `/crawl/job` and `/llm/job` endpoints
|
||||
- **🔄 Reliable Delivery**: Exponential backoff retry mechanism (5 attempts: 1s → 2s → 4s → 8s → 16s)
|
||||
- **🔄 Reliable Delivery**: Exponential backoff retry mechanism
|
||||
- **🔐 Custom Authentication**: Add custom headers for webhook authentication
|
||||
- **📊 Flexible Delivery**: Choose notification-only or include full data in payload
|
||||
- **⚙️ Global Configuration**: Set default webhook URL in config.yml for all jobs
|
||||
|
||||
[Read full release notes →](../blog/release-v0.7.6.md)
|
||||
|
||||
### [Crawl4AI v0.7.5 – The Docker Hooks & Security Update](../blog/release-v0.7.5.md)
|
||||
*September 29, 2025*
|
||||
|
||||
Crawl4AI v0.7.5 introduces the powerful Docker Hooks System for complete pipeline customization, enhanced LLM integration with custom providers, HTTPS preservation for modern web security, and resolves multiple community-reported issues.
|
||||
|
||||
Key highlights:
|
||||
- **🔧 Docker Hooks System**: Custom Python functions at 8 key pipeline points for unprecedented customization
|
||||
- **🤖 Enhanced LLM Integration**: Custom providers with temperature control and base_url configuration
|
||||
- **🔒 HTTPS Preservation**: Secure internal link handling for modern web applications
|
||||
- **🐍 Python 3.10+ Support**: Modern language features and enhanced performance
|
||||
|
||||
[Read full release notes →](../blog/release-v0.7.5.md)
|
||||
|
||||
---
|
||||
|
||||
## Older Releases
|
||||
|
||||
| Version | Date | Highlights |
|
||||
|---------|------|------------|
|
||||
| [v0.7.5](../blog/release-v0.7.5.md) | September 2025 | Docker Hooks System, enhanced LLM integration, HTTPS preservation |
|
||||
| [v0.7.4](../blog/release-v0.7.4.md) | August 2025 | LLM-powered table extraction, performance improvements |
|
||||
| [v0.7.3](../blog/release-v0.7.3.md) | July 2025 | Undetected browser, multi-URL config, memory monitoring |
|
||||
| [v0.7.1](../blog/release-v0.7.1.md) | June 2025 | Bug fixes and stability improvements |
|
||||
|
||||
243
docs/md_v2/blog/releases/v0.8.0.md
Normal file
243
docs/md_v2/blog/releases/v0.8.0.md
Normal file
@@ -0,0 +1,243 @@
|
||||
# Crawl4AI v0.8.0 Release Notes
|
||||
|
||||
**Release Date**: January 2026
|
||||
**Previous Version**: v0.7.6
|
||||
**Status**: Release Candidate
|
||||
|
||||
---
|
||||
|
||||
## Highlights
|
||||
|
||||
- **Critical Security Fixes** for Docker API deployment
|
||||
- **11 New Features** including crash recovery, prefetch mode, and proxy improvements
|
||||
- **Breaking Changes** - see migration guide below
|
||||
|
||||
---
|
||||
|
||||
## Breaking Changes
|
||||
|
||||
### 1. Docker API: Hooks Disabled by Default
|
||||
|
||||
**What changed**: Hooks are now disabled by default on the Docker API.
|
||||
|
||||
**Why**: Security fix for Remote Code Execution (RCE) vulnerability.
|
||||
|
||||
**Who is affected**: Users of the Docker API who use the `hooks` parameter in `/crawl` requests.
|
||||
|
||||
**Migration**:
|
||||
```bash
|
||||
# To re-enable hooks (only if you trust all API users):
|
||||
export CRAWL4AI_HOOKS_ENABLED=true
|
||||
```
|
||||
|
||||
### 2. Docker API: file:// URLs Blocked
|
||||
|
||||
**What changed**: The endpoints `/execute_js`, `/screenshot`, `/pdf`, and `/html` now reject `file://` URLs.
|
||||
|
||||
**Why**: Security fix for Local File Inclusion (LFI) vulnerability.
|
||||
|
||||
**Who is affected**: Users who were reading local files via the Docker API.
|
||||
|
||||
**Migration**: Use the Python library directly for local file processing:
|
||||
```python
|
||||
# Instead of API call with file:// URL, use library:
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url="file:///path/to/file.html")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Security Fixes
|
||||
|
||||
### Critical: Remote Code Execution via Hooks (CVE Pending)
|
||||
|
||||
**Severity**: CRITICAL (CVSS 10.0)
|
||||
**Affected**: Docker API deployment (all versions before v0.8.0)
|
||||
**Vector**: `POST /crawl` with malicious `hooks` parameter
|
||||
|
||||
**Details**: The `__import__` builtin was available in hook code, allowing attackers to import `os`, `subprocess`, etc. and execute arbitrary commands.
|
||||
|
||||
**Fix**:
|
||||
1. Removed `__import__` from allowed builtins
|
||||
2. Hooks disabled by default (`CRAWL4AI_HOOKS_ENABLED=false`)
|
||||
|
||||
### High: Local File Inclusion via file:// URLs (CVE Pending)
|
||||
|
||||
**Severity**: HIGH (CVSS 8.6)
|
||||
**Affected**: Docker API deployment (all versions before v0.8.0)
|
||||
**Vector**: `POST /execute_js` (and other endpoints) with `file:///etc/passwd`
|
||||
|
||||
**Details**: API endpoints accepted `file://` URLs, allowing attackers to read arbitrary files from the server.
|
||||
|
||||
**Fix**: URL scheme validation now only allows `http://`, `https://`, and `raw:` URLs.
|
||||
|
||||
### Credits
|
||||
|
||||
Discovered by **Neo by ProjectDiscovery** ([projectdiscovery.io](https://projectdiscovery.io)) - December 2025
|
||||
|
||||
---
|
||||
|
||||
## New Features
|
||||
|
||||
### 1. init_scripts Support for BrowserConfig
|
||||
|
||||
Pre-page-load JavaScript injection for stealth evasions.
|
||||
|
||||
```python
|
||||
config = BrowserConfig(
|
||||
init_scripts=[
|
||||
"Object.defineProperty(navigator, 'webdriver', {get: () => false})"
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### 2. CDP Connection Improvements
|
||||
|
||||
- WebSocket URL support (`ws://`, `wss://`)
|
||||
- Proper cleanup with `cdp_cleanup_on_close=True`
|
||||
- Browser reuse across multiple connections
|
||||
|
||||
### 3. Crash Recovery for Deep Crawl Strategies
|
||||
|
||||
All deep crawl strategies (BFS, DFS, Best-First) now support crash recovery:
|
||||
|
||||
```python
|
||||
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
|
||||
|
||||
strategy = BFSDeepCrawlStrategy(
|
||||
max_depth=3,
|
||||
resume_state=saved_state, # Resume from checkpoint
|
||||
on_state_change=save_callback # Persist state in real-time
|
||||
)
|
||||
```
|
||||
|
||||
### 4. PDF and MHTML for raw:/file:// URLs
|
||||
|
||||
Generate PDFs and MHTML from cached HTML content.
|
||||
|
||||
### 5. Screenshots for raw:/file:// URLs
|
||||
|
||||
Render cached HTML and capture screenshots.
|
||||
|
||||
### 6. base_url Parameter for CrawlerRunConfig
|
||||
|
||||
Proper URL resolution for raw: HTML processing:
|
||||
|
||||
```python
|
||||
config = CrawlerRunConfig(base_url='https://example.com')
|
||||
result = await crawler.arun(url='raw:{html}', config=config)
|
||||
```
|
||||
|
||||
### 7. Prefetch Mode for Two-Phase Deep Crawling
|
||||
|
||||
Fast link extraction without full page processing:
|
||||
|
||||
```python
|
||||
config = CrawlerRunConfig(prefetch=True)
|
||||
```
|
||||
|
||||
### 8. Proxy Rotation and Configuration
|
||||
|
||||
Enhanced proxy rotation with sticky sessions support.
|
||||
|
||||
### 9. Proxy Support for HTTP Strategy
|
||||
|
||||
Non-browser crawler now supports proxies.
|
||||
|
||||
### 10. Browser Pipeline for raw:/file:// URLs
|
||||
|
||||
New `process_in_browser` parameter for browser operations on local content:
|
||||
|
||||
```python
|
||||
config = CrawlerRunConfig(
|
||||
process_in_browser=True, # Force browser processing
|
||||
screenshot=True
|
||||
)
|
||||
result = await crawler.arun(url='raw:<html>...</html>', config=config)
|
||||
```
|
||||
|
||||
### 11. Smart TTL Cache for Sitemap URL Seeder
|
||||
|
||||
Intelligent cache invalidation for sitemaps:
|
||||
|
||||
```python
|
||||
config = SeedingConfig(
|
||||
cache_ttl_hours=24,
|
||||
validate_sitemap_lastmod=True
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Bug Fixes
|
||||
|
||||
### raw: URL Parsing Truncates at # Character
|
||||
|
||||
**Problem**: CSS color codes like `#eee` were being truncated.
|
||||
|
||||
**Before**: `raw:body{background:#eee}` → `body{background:`
|
||||
**After**: `raw:body{background:#eee}` → `body{background:#eee}`
|
||||
|
||||
### Caching System Improvements
|
||||
|
||||
Various fixes to cache validation and persistence.
|
||||
|
||||
---
|
||||
|
||||
## Documentation Updates
|
||||
|
||||
- Multi-sample schema generation documentation
|
||||
- URL seeder smart TTL cache parameters
|
||||
- Security documentation (SECURITY.md)
|
||||
|
||||
---
|
||||
|
||||
## Upgrade Guide
|
||||
|
||||
### From v0.7.x to v0.8.0
|
||||
|
||||
1. **Update the package**:
|
||||
```bash
|
||||
pip install --upgrade crawl4ai
|
||||
```
|
||||
|
||||
2. **Docker API users**:
|
||||
- Hooks are now disabled by default
|
||||
- If you need hooks: `export CRAWL4AI_HOOKS_ENABLED=true`
|
||||
- `file://` URLs no longer work on API (use library directly)
|
||||
|
||||
3. **Review security settings**:
|
||||
```yaml
|
||||
# config.yml - recommended for production
|
||||
security:
|
||||
enabled: true
|
||||
jwt_enabled: true
|
||||
```
|
||||
|
||||
4. **Test your integration** before deploying to production
|
||||
|
||||
### Breaking Change Checklist
|
||||
|
||||
- [ ] Check if you use `hooks` parameter in API calls
|
||||
- [ ] Check if you use `file://` URLs via the API
|
||||
- [ ] Update environment variables if needed
|
||||
- [ ] Review security configuration
|
||||
|
||||
---
|
||||
|
||||
## Full Changelog
|
||||
|
||||
See [CHANGELOG.md](../CHANGELOG.md) for complete version history.
|
||||
|
||||
---
|
||||
|
||||
## Contributors
|
||||
|
||||
Thanks to all contributors who made this release possible.
|
||||
|
||||
Special thanks to **Neo by ProjectDiscovery** for responsible security disclosure.
|
||||
|
||||
---
|
||||
|
||||
*For questions or issues, please open a [GitHub Issue](https://github.com/unclecode/crawl4ai/issues).*
|
||||
@@ -67,13 +67,13 @@ Pull and run images directly from Docker Hub without building locally.
|
||||
|
||||
#### 1. Pull the Image
|
||||
|
||||
Our latest release is `0.7.6`. Images are built with multi-arch manifests, so Docker automatically pulls the correct version for your system.
|
||||
Our latest release is `0.8.0`. Images are built with multi-arch manifests, so Docker automatically pulls the correct version for your system.
|
||||
|
||||
> 💡 **Note**: The `latest` tag points to the stable `0.7.6` version.
|
||||
> 💡 **Note**: The `latest` tag points to the stable `0.8.0` version.
|
||||
|
||||
```bash
|
||||
# Pull the latest version
|
||||
docker pull unclecode/crawl4ai:0.7.6
|
||||
docker pull unclecode/crawl4ai:0.8.0
|
||||
|
||||
# Or pull using the latest tag
|
||||
docker pull unclecode/crawl4ai:latest
|
||||
@@ -145,7 +145,7 @@ docker stop crawl4ai && docker rm crawl4ai
|
||||
#### Docker Hub Versioning Explained
|
||||
|
||||
* **Image Name:** `unclecode/crawl4ai`
|
||||
* **Tag Format:** `LIBRARY_VERSION[-SUFFIX]` (e.g., `0.7.6`)
|
||||
* **Tag Format:** `LIBRARY_VERSION[-SUFFIX]` (e.g., `0.8.0`)
|
||||
* `LIBRARY_VERSION`: The semantic version of the core `crawl4ai` Python library
|
||||
* `SUFFIX`: Optional tag for release candidates (``) and revisions (`r1`)
|
||||
* **`latest` Tag:** Points to the most recent stable version
|
||||
|
||||
633
docs/releases_review/demo_v0.8.0.py
Normal file
633
docs/releases_review/demo_v0.8.0.py
Normal file
@@ -0,0 +1,633 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Crawl4AI v0.8.0 Release Demo - Feature Verification Tests
|
||||
==========================================================
|
||||
|
||||
This demo ACTUALLY RUNS and VERIFIES the new features in v0.8.0.
|
||||
Each test executes real code and validates the feature is working.
|
||||
|
||||
New Features Verified:
|
||||
1. Crash Recovery - on_state_change callback for real-time state persistence
|
||||
2. Crash Recovery - resume_state for resuming from checkpoint
|
||||
3. Crash Recovery - State is JSON serializable
|
||||
4. Prefetch Mode - Returns HTML and links only
|
||||
5. Prefetch Mode - Skips heavy processing (markdown, extraction)
|
||||
6. Prefetch Mode - Two-phase crawl pattern
|
||||
7. Security - Hooks disabled by default (Docker API)
|
||||
|
||||
Breaking Changes in v0.8.0:
|
||||
- Docker API hooks disabled by default (CRAWL4AI_HOOKS_ENABLED=false)
|
||||
- file:// URLs blocked on Docker API endpoints
|
||||
|
||||
Usage:
|
||||
python docs/releases_review/demo_v0.8.0.py
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import json
|
||||
import sys
|
||||
import time
|
||||
from typing import Dict, Any, List, Optional
|
||||
from dataclasses import dataclass
|
||||
|
||||
|
||||
# Test results tracking
|
||||
@dataclass
|
||||
class TestResult:
|
||||
name: str
|
||||
feature: str
|
||||
passed: bool
|
||||
message: str
|
||||
skipped: bool = False
|
||||
|
||||
|
||||
results: list[TestResult] = []
|
||||
|
||||
|
||||
def print_header(title: str):
|
||||
print(f"\n{'=' * 70}")
|
||||
print(f"{title}")
|
||||
print(f"{'=' * 70}")
|
||||
|
||||
|
||||
def print_test(name: str, feature: str):
|
||||
print(f"\n[TEST] {name} ({feature})")
|
||||
print("-" * 50)
|
||||
|
||||
|
||||
def record_result(name: str, feature: str, passed: bool, message: str, skipped: bool = False):
|
||||
results.append(TestResult(name, feature, passed, message, skipped))
|
||||
if skipped:
|
||||
print(f" SKIPPED: {message}")
|
||||
elif passed:
|
||||
print(f" PASSED: {message}")
|
||||
else:
|
||||
print(f" FAILED: {message}")
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# TEST 1: Crash Recovery - State Capture with on_state_change
|
||||
# =============================================================================
|
||||
async def test_crash_recovery_state_capture():
|
||||
"""
|
||||
Verify on_state_change callback is called after each URL is processed.
|
||||
|
||||
NEW in v0.8.0: Deep crawl strategies support on_state_change callback
|
||||
for real-time state persistence (useful for cloud deployments).
|
||||
"""
|
||||
print_test("Crash Recovery - State Capture", "on_state_change")
|
||||
|
||||
try:
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
|
||||
|
||||
captured_states: List[Dict[str, Any]] = []
|
||||
|
||||
async def capture_state(state: Dict[str, Any]):
|
||||
"""Callback that fires after each URL is processed."""
|
||||
captured_states.append(state.copy())
|
||||
|
||||
strategy = BFSDeepCrawlStrategy(
|
||||
max_depth=1,
|
||||
max_pages=3,
|
||||
on_state_change=capture_state,
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
deep_crawl_strategy=strategy,
|
||||
verbose=False,
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(verbose=False) as crawler:
|
||||
await crawler.arun("https://books.toscrape.com", config=config)
|
||||
|
||||
# Verify states were captured
|
||||
if len(captured_states) == 0:
|
||||
record_result("State Capture", "on_state_change", False,
|
||||
"No states captured - callback not called")
|
||||
return
|
||||
|
||||
# Verify callback was called for each page
|
||||
pages_crawled = captured_states[-1].get("pages_crawled", 0)
|
||||
if pages_crawled != len(captured_states):
|
||||
record_result("State Capture", "on_state_change", False,
|
||||
f"Callback count {len(captured_states)} != pages_crawled {pages_crawled}")
|
||||
return
|
||||
|
||||
record_result("State Capture", "on_state_change", True,
|
||||
f"Callback fired {len(captured_states)} times (once per URL)")
|
||||
|
||||
except Exception as e:
|
||||
record_result("State Capture", "on_state_change", False, f"Exception: {e}")
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# TEST 2: Crash Recovery - Resume from Checkpoint
|
||||
# =============================================================================
|
||||
async def test_crash_recovery_resume():
|
||||
"""
|
||||
Verify crawl can resume from a saved checkpoint without re-crawling visited URLs.
|
||||
|
||||
NEW in v0.8.0: BFSDeepCrawlStrategy accepts resume_state parameter
|
||||
to continue from a previously saved checkpoint.
|
||||
"""
|
||||
print_test("Crash Recovery - Resume from Checkpoint", "resume_state")
|
||||
|
||||
try:
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
|
||||
|
||||
# Phase 1: Start crawl and capture state after 2 pages
|
||||
crash_after = 2
|
||||
captured_states: List[Dict] = []
|
||||
phase1_urls: List[str] = []
|
||||
|
||||
async def capture_until_crash(state: Dict[str, Any]):
|
||||
captured_states.append(state.copy())
|
||||
phase1_urls.clear()
|
||||
phase1_urls.extend(state["visited"])
|
||||
if state["pages_crawled"] >= crash_after:
|
||||
raise Exception("Simulated crash")
|
||||
|
||||
strategy1 = BFSDeepCrawlStrategy(
|
||||
max_depth=1,
|
||||
max_pages=5,
|
||||
on_state_change=capture_until_crash,
|
||||
)
|
||||
|
||||
config1 = CrawlerRunConfig(
|
||||
deep_crawl_strategy=strategy1,
|
||||
verbose=False,
|
||||
)
|
||||
|
||||
# Run until "crash"
|
||||
try:
|
||||
async with AsyncWebCrawler(verbose=False) as crawler:
|
||||
await crawler.arun("https://books.toscrape.com", config=config1)
|
||||
except Exception:
|
||||
pass # Expected crash
|
||||
|
||||
if not captured_states:
|
||||
record_result("Resume from Checkpoint", "resume_state", False,
|
||||
"No state captured before crash")
|
||||
return
|
||||
|
||||
saved_state = captured_states[-1]
|
||||
print(f" Phase 1: Crawled {len(phase1_urls)} URLs before crash")
|
||||
|
||||
# Phase 2: Resume from checkpoint
|
||||
phase2_urls: List[str] = []
|
||||
|
||||
async def track_phase2(state: Dict[str, Any]):
|
||||
new_urls = set(state["visited"]) - set(saved_state["visited"])
|
||||
for url in new_urls:
|
||||
if url not in phase2_urls:
|
||||
phase2_urls.append(url)
|
||||
|
||||
strategy2 = BFSDeepCrawlStrategy(
|
||||
max_depth=1,
|
||||
max_pages=5,
|
||||
resume_state=saved_state, # Resume from checkpoint!
|
||||
on_state_change=track_phase2,
|
||||
)
|
||||
|
||||
config2 = CrawlerRunConfig(
|
||||
deep_crawl_strategy=strategy2,
|
||||
verbose=False,
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(verbose=False) as crawler:
|
||||
await crawler.arun("https://books.toscrape.com", config=config2)
|
||||
|
||||
print(f" Phase 2: Crawled {len(phase2_urls)} new URLs after resume")
|
||||
|
||||
# Verify no duplicates
|
||||
duplicates = set(phase2_urls) & set(phase1_urls)
|
||||
if duplicates:
|
||||
record_result("Resume from Checkpoint", "resume_state", False,
|
||||
f"Re-crawled {len(duplicates)} URLs: {list(duplicates)[:2]}")
|
||||
return
|
||||
|
||||
record_result("Resume from Checkpoint", "resume_state", True,
|
||||
f"Resumed successfully, no duplicate crawls")
|
||||
|
||||
except Exception as e:
|
||||
record_result("Resume from Checkpoint", "resume_state", False, f"Exception: {e}")
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# TEST 3: Crash Recovery - State is JSON Serializable
|
||||
# =============================================================================
|
||||
async def test_crash_recovery_json_serializable():
|
||||
"""
|
||||
Verify the state dictionary can be serialized to JSON (for Redis/DB storage).
|
||||
|
||||
NEW in v0.8.0: State dictionary is designed to be JSON-serializable
|
||||
for easy storage in Redis, databases, or files.
|
||||
"""
|
||||
print_test("Crash Recovery - JSON Serializable", "State Structure")
|
||||
|
||||
try:
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
|
||||
|
||||
captured_state: Optional[Dict] = None
|
||||
|
||||
async def capture_state(state: Dict[str, Any]):
|
||||
nonlocal captured_state
|
||||
captured_state = state
|
||||
|
||||
strategy = BFSDeepCrawlStrategy(
|
||||
max_depth=1,
|
||||
max_pages=2,
|
||||
on_state_change=capture_state,
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
deep_crawl_strategy=strategy,
|
||||
verbose=False,
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(verbose=False) as crawler:
|
||||
await crawler.arun("https://books.toscrape.com", config=config)
|
||||
|
||||
if not captured_state:
|
||||
record_result("JSON Serializable", "State Structure", False,
|
||||
"No state captured")
|
||||
return
|
||||
|
||||
# Test JSON serialization round-trip
|
||||
try:
|
||||
json_str = json.dumps(captured_state)
|
||||
restored = json.loads(json_str)
|
||||
except (TypeError, json.JSONDecodeError) as e:
|
||||
record_result("JSON Serializable", "State Structure", False,
|
||||
f"JSON serialization failed: {e}")
|
||||
return
|
||||
|
||||
# Verify state structure
|
||||
required_fields = ["strategy_type", "visited", "pending", "depths", "pages_crawled"]
|
||||
missing = [f for f in required_fields if f not in restored]
|
||||
if missing:
|
||||
record_result("JSON Serializable", "State Structure", False,
|
||||
f"Missing fields: {missing}")
|
||||
return
|
||||
|
||||
# Verify types
|
||||
if not isinstance(restored["visited"], list):
|
||||
record_result("JSON Serializable", "State Structure", False,
|
||||
"visited is not a list")
|
||||
return
|
||||
|
||||
if not isinstance(restored["pages_crawled"], int):
|
||||
record_result("JSON Serializable", "State Structure", False,
|
||||
"pages_crawled is not an int")
|
||||
return
|
||||
|
||||
record_result("JSON Serializable", "State Structure", True,
|
||||
f"State serializes to {len(json_str)} bytes, all fields present")
|
||||
|
||||
except Exception as e:
|
||||
record_result("JSON Serializable", "State Structure", False, f"Exception: {e}")
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# TEST 4: Prefetch Mode - Returns HTML and Links Only
|
||||
# =============================================================================
|
||||
async def test_prefetch_returns_html_links():
|
||||
"""
|
||||
Verify prefetch mode returns HTML and links but skips markdown generation.
|
||||
|
||||
NEW in v0.8.0: CrawlerRunConfig accepts prefetch=True for fast
|
||||
URL discovery without heavy processing.
|
||||
"""
|
||||
print_test("Prefetch Mode - HTML and Links", "prefetch=True")
|
||||
|
||||
try:
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
|
||||
config = CrawlerRunConfig(prefetch=True)
|
||||
|
||||
async with AsyncWebCrawler(verbose=False) as crawler:
|
||||
result = await crawler.arun("https://books.toscrape.com", config=config)
|
||||
|
||||
# Verify HTML is present
|
||||
if not result.html or len(result.html) < 100:
|
||||
record_result("Prefetch HTML/Links", "prefetch=True", False,
|
||||
"HTML not returned or too short")
|
||||
return
|
||||
|
||||
# Verify links are present
|
||||
if not result.links:
|
||||
record_result("Prefetch HTML/Links", "prefetch=True", False,
|
||||
"Links not returned")
|
||||
return
|
||||
|
||||
internal_count = len(result.links.get("internal", []))
|
||||
external_count = len(result.links.get("external", []))
|
||||
|
||||
if internal_count == 0:
|
||||
record_result("Prefetch HTML/Links", "prefetch=True", False,
|
||||
"No internal links extracted")
|
||||
return
|
||||
|
||||
record_result("Prefetch HTML/Links", "prefetch=True", True,
|
||||
f"HTML: {len(result.html)} chars, Links: {internal_count} internal, {external_count} external")
|
||||
|
||||
except Exception as e:
|
||||
record_result("Prefetch HTML/Links", "prefetch=True", False, f"Exception: {e}")
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# TEST 5: Prefetch Mode - Skips Heavy Processing
|
||||
# =============================================================================
|
||||
async def test_prefetch_skips_processing():
|
||||
"""
|
||||
Verify prefetch mode skips markdown generation and content extraction.
|
||||
|
||||
NEW in v0.8.0: prefetch=True skips markdown generation, content scraping,
|
||||
media extraction, and LLM extraction for maximum speed.
|
||||
"""
|
||||
print_test("Prefetch Mode - Skips Processing", "prefetch=True")
|
||||
|
||||
try:
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
|
||||
config = CrawlerRunConfig(prefetch=True)
|
||||
|
||||
async with AsyncWebCrawler(verbose=False) as crawler:
|
||||
result = await crawler.arun("https://books.toscrape.com", config=config)
|
||||
|
||||
# Check that heavy processing was skipped
|
||||
checks = []
|
||||
|
||||
# Markdown should be None or empty
|
||||
if result.markdown is None:
|
||||
checks.append("markdown=None")
|
||||
elif hasattr(result.markdown, 'raw_markdown') and result.markdown.raw_markdown is None:
|
||||
checks.append("raw_markdown=None")
|
||||
else:
|
||||
record_result("Prefetch Skips Processing", "prefetch=True", False,
|
||||
f"Markdown was generated (should be skipped)")
|
||||
return
|
||||
|
||||
# cleaned_html should be None
|
||||
if result.cleaned_html is None:
|
||||
checks.append("cleaned_html=None")
|
||||
else:
|
||||
record_result("Prefetch Skips Processing", "prefetch=True", False,
|
||||
"cleaned_html was generated (should be skipped)")
|
||||
return
|
||||
|
||||
# extracted_content should be None
|
||||
if result.extracted_content is None:
|
||||
checks.append("extracted_content=None")
|
||||
|
||||
record_result("Prefetch Skips Processing", "prefetch=True", True,
|
||||
f"Heavy processing skipped: {', '.join(checks)}")
|
||||
|
||||
except Exception as e:
|
||||
record_result("Prefetch Skips Processing", "prefetch=True", False, f"Exception: {e}")
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# TEST 6: Prefetch Mode - Two-Phase Crawl Pattern
|
||||
# =============================================================================
|
||||
async def test_prefetch_two_phase():
|
||||
"""
|
||||
Verify the two-phase crawl pattern: prefetch for discovery, then full processing.
|
||||
|
||||
NEW in v0.8.0: Prefetch mode enables efficient two-phase crawling where
|
||||
you discover URLs quickly, then selectively process important ones.
|
||||
"""
|
||||
print_test("Prefetch Mode - Two-Phase Crawl", "Two-Phase Pattern")
|
||||
|
||||
try:
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
|
||||
async with AsyncWebCrawler(verbose=False) as crawler:
|
||||
# Phase 1: Fast discovery with prefetch
|
||||
prefetch_config = CrawlerRunConfig(prefetch=True)
|
||||
|
||||
start = time.time()
|
||||
discovery = await crawler.arun("https://books.toscrape.com", config=prefetch_config)
|
||||
prefetch_time = time.time() - start
|
||||
|
||||
all_urls = [link["href"] for link in discovery.links.get("internal", [])]
|
||||
|
||||
# Filter to specific pages (e.g., book detail pages)
|
||||
book_urls = [
|
||||
url for url in all_urls
|
||||
if "catalogue/" in url and "category/" not in url
|
||||
][:2] # Just 2 for demo
|
||||
|
||||
print(f" Phase 1: Found {len(all_urls)} URLs in {prefetch_time:.2f}s")
|
||||
print(f" Filtered to {len(book_urls)} book pages for full processing")
|
||||
|
||||
if len(book_urls) == 0:
|
||||
record_result("Two-Phase Crawl", "Two-Phase Pattern", False,
|
||||
"No book URLs found to process")
|
||||
return
|
||||
|
||||
# Phase 2: Full processing on selected URLs
|
||||
full_config = CrawlerRunConfig() # Normal mode
|
||||
|
||||
start = time.time()
|
||||
processed = []
|
||||
for url in book_urls:
|
||||
result = await crawler.arun(url, config=full_config)
|
||||
if result.success and result.markdown:
|
||||
processed.append(result)
|
||||
|
||||
full_time = time.time() - start
|
||||
|
||||
print(f" Phase 2: Processed {len(processed)} pages in {full_time:.2f}s")
|
||||
|
||||
if len(processed) == 0:
|
||||
record_result("Two-Phase Crawl", "Two-Phase Pattern", False,
|
||||
"No pages successfully processed in phase 2")
|
||||
return
|
||||
|
||||
# Verify full processing includes markdown
|
||||
if not processed[0].markdown or not processed[0].markdown.raw_markdown:
|
||||
record_result("Two-Phase Crawl", "Two-Phase Pattern", False,
|
||||
"Full processing did not generate markdown")
|
||||
return
|
||||
|
||||
record_result("Two-Phase Crawl", "Two-Phase Pattern", True,
|
||||
f"Discovered {len(all_urls)} URLs (prefetch), processed {len(processed)} (full)")
|
||||
|
||||
except Exception as e:
|
||||
record_result("Two-Phase Crawl", "Two-Phase Pattern", False, f"Exception: {e}")
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# TEST 7: Security - Hooks Disabled by Default
|
||||
# =============================================================================
|
||||
async def test_security_hooks_disabled():
|
||||
"""
|
||||
Verify hooks are disabled by default in Docker API for security.
|
||||
|
||||
NEW in v0.8.0: Docker API hooks are disabled by default to prevent
|
||||
Remote Code Execution. Set CRAWL4AI_HOOKS_ENABLED=true to enable.
|
||||
"""
|
||||
print_test("Security - Hooks Disabled", "CRAWL4AI_HOOKS_ENABLED")
|
||||
|
||||
try:
|
||||
import os
|
||||
|
||||
# Check the default environment variable
|
||||
hooks_enabled = os.environ.get("CRAWL4AI_HOOKS_ENABLED", "false").lower()
|
||||
|
||||
if hooks_enabled == "true":
|
||||
record_result("Hooks Disabled Default", "Security", True,
|
||||
"CRAWL4AI_HOOKS_ENABLED is explicitly set to 'true' (user override)",
|
||||
skipped=True)
|
||||
return
|
||||
|
||||
# Verify default is "false"
|
||||
if hooks_enabled == "false":
|
||||
record_result("Hooks Disabled Default", "Security", True,
|
||||
"Hooks disabled by default (CRAWL4AI_HOOKS_ENABLED=false)")
|
||||
else:
|
||||
record_result("Hooks Disabled Default", "Security", True,
|
||||
f"CRAWL4AI_HOOKS_ENABLED='{hooks_enabled}' (not 'true', hooks disabled)")
|
||||
|
||||
except Exception as e:
|
||||
record_result("Hooks Disabled Default", "Security", False, f"Exception: {e}")
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# TEST 8: Comprehensive Crawl Test
|
||||
# =============================================================================
|
||||
async def test_comprehensive_crawl():
|
||||
"""
|
||||
Run a comprehensive crawl to verify overall stability with new features.
|
||||
"""
|
||||
print_test("Comprehensive Crawl Test", "Overall")
|
||||
|
||||
try:
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, BrowserConfig
|
||||
|
||||
async with AsyncWebCrawler(config=BrowserConfig(headless=True), verbose=False) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://httpbin.org/html",
|
||||
config=CrawlerRunConfig()
|
||||
)
|
||||
|
||||
checks = []
|
||||
|
||||
if result.success:
|
||||
checks.append("success=True")
|
||||
else:
|
||||
record_result("Comprehensive Crawl", "Overall", False,
|
||||
f"Crawl failed: {result.error_message}")
|
||||
return
|
||||
|
||||
if result.html and len(result.html) > 100:
|
||||
checks.append(f"html={len(result.html)} chars")
|
||||
|
||||
if result.markdown and result.markdown.raw_markdown:
|
||||
checks.append(f"markdown={len(result.markdown.raw_markdown)} chars")
|
||||
|
||||
if result.links:
|
||||
total_links = len(result.links.get("internal", [])) + len(result.links.get("external", []))
|
||||
checks.append(f"links={total_links}")
|
||||
|
||||
record_result("Comprehensive Crawl", "Overall", True,
|
||||
f"All checks passed: {', '.join(checks)}")
|
||||
|
||||
except Exception as e:
|
||||
record_result("Comprehensive Crawl", "Overall", False, f"Exception: {e}")
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# MAIN
|
||||
# =============================================================================
|
||||
|
||||
def print_summary():
|
||||
"""Print test results summary"""
|
||||
print_header("TEST RESULTS SUMMARY")
|
||||
|
||||
passed = sum(1 for r in results if r.passed and not r.skipped)
|
||||
failed = sum(1 for r in results if not r.passed and not r.skipped)
|
||||
skipped = sum(1 for r in results if r.skipped)
|
||||
|
||||
print(f"\nTotal: {len(results)} tests")
|
||||
print(f" Passed: {passed}")
|
||||
print(f" Failed: {failed}")
|
||||
print(f" Skipped: {skipped}")
|
||||
|
||||
if failed > 0:
|
||||
print("\nFailed Tests:")
|
||||
for r in results:
|
||||
if not r.passed and not r.skipped:
|
||||
print(f" - {r.name} ({r.feature}): {r.message}")
|
||||
|
||||
if skipped > 0:
|
||||
print("\nSkipped Tests:")
|
||||
for r in results:
|
||||
if r.skipped:
|
||||
print(f" - {r.name} ({r.feature}): {r.message}")
|
||||
|
||||
print("\n" + "=" * 70)
|
||||
if failed == 0:
|
||||
print("All tests passed! v0.8.0 features verified.")
|
||||
else:
|
||||
print(f"WARNING: {failed} test(s) failed!")
|
||||
print("=" * 70)
|
||||
|
||||
return failed == 0
|
||||
|
||||
|
||||
async def main():
|
||||
"""Run all verification tests"""
|
||||
print_header("Crawl4AI v0.8.0 - Feature Verification Tests")
|
||||
print("Running actual tests to verify new features...")
|
||||
print("\nKey Features in v0.8.0:")
|
||||
print(" - Crash Recovery for Deep Crawl (resume_state, on_state_change)")
|
||||
print(" - Prefetch Mode for Fast URL Discovery (prefetch=True)")
|
||||
print(" - Security: Hooks disabled by default on Docker API")
|
||||
|
||||
# Run all tests
|
||||
tests = [
|
||||
test_crash_recovery_state_capture, # on_state_change
|
||||
test_crash_recovery_resume, # resume_state
|
||||
test_crash_recovery_json_serializable, # State structure
|
||||
test_prefetch_returns_html_links, # prefetch=True basics
|
||||
test_prefetch_skips_processing, # prefetch skips heavy work
|
||||
test_prefetch_two_phase, # Two-phase pattern
|
||||
test_security_hooks_disabled, # Security check
|
||||
test_comprehensive_crawl, # Overall stability
|
||||
]
|
||||
|
||||
for test_func in tests:
|
||||
try:
|
||||
await test_func()
|
||||
except Exception as e:
|
||||
print(f"\nTest {test_func.__name__} crashed: {e}")
|
||||
results.append(TestResult(
|
||||
test_func.__name__,
|
||||
"Unknown",
|
||||
False,
|
||||
f"Crashed: {e}"
|
||||
))
|
||||
|
||||
# Print summary
|
||||
all_passed = print_summary()
|
||||
|
||||
return 0 if all_passed else 1
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
try:
|
||||
exit_code = asyncio.run(main())
|
||||
sys.exit(exit_code)
|
||||
except KeyboardInterrupt:
|
||||
print("\n\nTests interrupted by user.")
|
||||
sys.exit(1)
|
||||
except Exception as e:
|
||||
print(f"\n\nTest suite failed: {e}")
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
sys.exit(1)
|
||||
@@ -1,4 +1,4 @@
|
||||
site_name: Crawl4AI Documentation (v0.7.x)
|
||||
site_name: Crawl4AI Documentation (v0.8.x)
|
||||
site_description: 🚀🤖 Crawl4AI, Open-source LLM-Friendly Web Crawler & Scraper
|
||||
site_url: https://docs.crawl4ai.com
|
||||
repo_url: https://github.com/unclecode/crawl4ai
|
||||
|
||||
Reference in New Issue
Block a user