From 122b4fe3f03993b1c9fecc90758724a313a9bc7d Mon Sep 17 00:00:00 2001 From: ntohidi Date: Mon, 12 Jan 2026 13:46:39 +0100 Subject: [PATCH] Add release notes for v0.7.9, detailing breaking changes, security fixes, new features, bug fixes, and documentation updates --- docs/CHANGES_NOTES_FOR_V.0.7.9_RELEASE.md | 335 ++++++++++++++++++++++ 1 file changed, 335 insertions(+) create mode 100644 docs/CHANGES_NOTES_FOR_V.0.7.9_RELEASE.md diff --git a/docs/CHANGES_NOTES_FOR_V.0.7.9_RELEASE.md b/docs/CHANGES_NOTES_FOR_V.0.7.9_RELEASE.md new file mode 100644 index 00000000..2a6110f4 --- /dev/null +++ b/docs/CHANGES_NOTES_FOR_V.0.7.9_RELEASE.md @@ -0,0 +1,335 @@ +# Crawl4AI Release Notes v0.7.9 + +**Period**: December 13, 2025 - January 12, 2026 +**Total Commits**: 18 +**Status**: DRAFT - Pending review + +--- + +## Breaking Changes ⚠️ + +### 1. Docker API Hooks Disabled by Default +- **Commit**: `f24396c` (Jan 12, 2026) +- **Reason**: Security fix for RCE vulnerability +- **Impact**: Hooks no longer work unless `CRAWL4AI_HOOKS_ENABLED=true` is set +- **Migration**: Set environment variable `CRAWL4AI_HOOKS_ENABLED=true` in Docker container + +### 2. file:// URLs Blocked on Docker API +- **Commit**: `f24396c` (Jan 12, 2026) +- **Reason**: Security fix for LFI vulnerability +- **Impact**: Endpoints `/execute_js`, `/screenshot`, `/pdf`, `/html` reject `file://` URLs +- **Migration**: Use the library directly for local file processing + +--- + +## Security Fixes 🔒 + +### Fix Critical RCE and LFI Vulnerabilities in Docker API +- **Commit**: `f24396c` +- **Date**: January 12, 2026 +- **Severity**: CRITICAL +- **Files Changed**: + - `deploy/docker/config.yml` + - `deploy/docker/hook_manager.py` + - `deploy/docker/server.py` + - `deploy/docker/tests/run_security_tests.py` + - `deploy/docker/tests/test_security_fixes.py` + +**Details**: +1. **Remote Code Execution via Hooks (CVE pending)** + - Removed `__import__` from allowed_builtins in hook_manager.py + - Prevents arbitrary module imports (os, subprocess, etc.) + - Hooks now disabled by default via `CRAWL4AI_HOOKS_ENABLED` env var + +2. **Local File Inclusion via file:// URLs (CVE pending)** + - Added URL scheme validation to `/execute_js`, `/screenshot`, `/pdf`, `/html` + - Blocks `file://`, `javascript:`, `data:` and other dangerous schemes + - Only allows `http://`, `https://`, and `raw:` (where appropriate) + +3. **Security hardening** + - Added `CRAWL4AI_HOOKS_ENABLED=false` as default (opt-in for hooks) + - Added security warning comments in config.yml + - Added `validate_url_scheme()` helper for consistent validation + +--- + +## New Features ✨ + +### 1. init_scripts Support for BrowserConfig +- **Commit**: `d10ca38` +- **Date**: December 14, 2025 +- **Files Changed**: + - `crawl4ai/async_configs.py` + - `crawl4ai/browser_manager.py` + +**Description**: Pre-page-load JavaScript injection capability. Useful for stealth evasions (canvas/audio fingerprinting, userAgentData). + +**Usage**: +```python +config = BrowserConfig( + init_scripts=["Object.defineProperty(navigator, 'webdriver', {get: () => false})"] +) +``` + +--- + +### 2. CDP Connection Improvements +- **Commit**: `02acad1` +- **Date**: December 18, 2025 +- **Files Changed**: + - `crawl4ai/browser_manager.py` + - `tests/browser/test_cdp_cleanup_reuse.py` + +**Description**: +- Support WebSocket URLs (ws://, wss://) for CDP connections +- Proper cleanup when `cdp_cleanup_on_close=True` +- Enables reusing the same browser with multiple sequential connections + +--- + +### 3. Crash Recovery for Deep Crawl Strategies +- **Commit**: `31ebf37` +- **Date**: December 22, 2025 +- **Files Changed**: + - `crawl4ai/deep_crawling/bff_strategy.py` + - `crawl4ai/deep_crawling/bfs_strategy.py` + - `crawl4ai/deep_crawling/dfs_strategy.py` + - `tests/deep_crawling/test_deep_crawl_resume.py` + - `tests/deep_crawling/test_deep_crawl_resume_integration.py` + +**Description**: Optional `resume_state` and `on_state_change` parameters for all deep crawl strategies (BFS, DFS, Best-First) enabling cloud deployment crash recovery. + +**Features**: +- `resume_state`: Pass saved state to resume from checkpoint +- `on_state_change`: Async callback fired after each URL for real-time state persistence +- `export_state()`: Get last captured state manually +- Zero overhead when features are disabled (None defaults) +- State is JSON-serializable + +--- + +### 4. PDF and MHTML Support for raw:/file:// URLs +- **Commit**: `67e03d6` +- **Date**: December 22, 2025 +- **Files Changed**: + - `crawl4ai/async_crawler_strategy.py` + +**Description**: Generate PDFs and MHTML from cached HTML content. Replaced `_generate_screenshot_from_html` with unified `_generate_media_from_html` method. + +--- + +### 5. Screenshots for raw:/file:// URLs +- **Commit**: `444cb14` +- **Date**: December 22, 2025 +- **Files Changed**: + - `crawl4ai/async_crawler_strategy.py` + +**Description**: Enables cached HTML to be rendered with screenshots. Loads HTML into browser via `page.set_content()` and takes screenshot. + +--- + +### 6. base_url Parameter for CrawlerRunConfig +- **Commit**: `3937efc` +- **Date**: December 24, 2025 +- **Files Changed**: + - `crawl4ai/async_configs.py` + - `crawl4ai/async_webcrawler.py` + +**Description**: When processing raw: HTML (e.g., from cache), provides proper URL resolution context for markdown link generation. + +**Usage**: +```python +config = CrawlerRunConfig(base_url='https://example.com') +result = await crawler.arun(url='raw:{html}', config=config) +``` + +--- + +### 7. Prefetch Mode for Two-Phase Deep Crawling +- **Commit**: `fde4e9f` +- **Date**: December 25, 2025 +- **Files Changed**: + - `crawl4ai/async_configs.py` + - `crawl4ai/async_webcrawler.py` + - `crawl4ai/utils.py` + - `tests/test_prefetch_integration.py` + - `tests/test_prefetch_mode.py` + - `tests/test_prefetch_regression.py` + +**Description**: +- Added `prefetch` parameter to CrawlerRunConfig +- Added `quick_extract_links()` function for fast link extraction +- Short-circuit in `aprocess_html()` for prefetch mode +- 42 tests added (unit, integration, regression) + +--- + +### 8. Proxy Rotation and Configuration Updates +- **Commit**: `9e7f5aa` +- **Date**: December 26, 2025 +- **Files Changed**: + - `crawl4ai/async_configs.py` + - `crawl4ai/async_webcrawler.py` + - `crawl4ai/proxy_strategy.py` + - `tests/proxy/test_sticky_sessions.py` + +**Description**: Enhanced proxy rotation and proxy configuration options. + +--- + +### 9. Proxy Support for HTTP Crawler Strategy +- **Commit**: `a43256b` +- **Date**: December 26, 2025 +- **Files Changed**: + - `crawl4ai/async_crawler_strategy.py` + +**Description**: Added proxy support to the HTTP (non-browser) crawler strategy. + +--- + +### 10. Browser Pipeline Support for raw:/file:// URLs +- **Commit**: `2550f3d` +- **Date**: December 27, 2025 +- **Files Changed**: + - `crawl4ai/async_configs.py` + - `crawl4ai/async_crawler_strategy.py` + - `crawl4ai/browser_manager.py` + - `tests/test_raw_html_browser.py` + - `tests/test_raw_html_edge_cases.py` + +**Description**: +- Added `process_in_browser` parameter to CrawlerRunConfig +- Routes raw:/file:// URLs through browser when browser operations needed +- Uses `page.set_content()` instead of `goto()` for local content +- Auto-detects browser requirements: js_code, wait_for, screenshot, etc. +- Maintains fast path for raw:/file:// without browser params + +**Fixes**: #310 + +--- + +### 11. Smart TTL Cache for Sitemap URL Seeder +- **Commit**: `3d78001` +- **Date**: December 30, 2025 +- **Files Changed**: + - `crawl4ai/async_configs.py` + - `crawl4ai/async_url_seeder.py` + +**Description**: +- Added `cache_ttl_hours` and `validate_sitemap_lastmod` params to SeedingConfig +- New JSON cache format with metadata (version, created_at, lastmod, url_count) +- Cache validation by TTL expiry and sitemap lastmod comparison +- Auto-migration from old .jsonl to new .json format +- Fixes bug where incomplete cache was used indefinitely + +--- + +## Bug Fixes 🐛 + +### 1. HTTP Strategy raw: URL Parsing Truncates at # Character +- **Commit**: `624e341` +- **Date**: December 24, 2025 +- **Files Changed**: + - `crawl4ai/async_crawler_strategy.py` + +**Problem**: `urlparse()` treated `#` as URL fragment delimiter, breaking CSS color codes like `#eee`. + +**Before**: `raw:body{background:#eee}` → parsed to `body{background:` +**After**: `raw:body{background:#eee}` → parsed to `body{background:#eee}` + +**Fix**: Strip `raw:` prefix directly instead of using urlparse. + +--- + +### 2. Caching Debugging and Fixes +- **Commit**: `48426f7` +- **Date**: December 21, 2025 +- **Files Changed**: + - `crawl4ai/async_configs.py` + - `crawl4ai/async_database.py` + - `crawl4ai/async_webcrawler.py` + - `crawl4ai/cache_validator.py` + - `crawl4ai/models.py` + - `crawl4ai/utils.py` + - `tests/cache_validation/` (multiple test files) + +**Description**: Various debugging and improvements to the caching system. + +--- + +## Documentation Updates 📚 + +### 1. Multi-Sample Schema Generation Section +- **Commit**: `6b2dca7` +- **Date**: January 4, 2026 +- **Files Changed**: + - `docs/md_v2/extraction/no-llm-strategies.md` + +**Description**: Added documentation explaining how to pass multiple HTML samples to `generate_schema()` for stable selectors that work across pages with varying DOM structures. + +--- + +### 2. URL Seeder Docs - Smart TTL Cache Parameters +- **Commit**: `db61ab8` +- **Date**: December 30, 2025 +- **Files Changed**: + - `docs/md_v2/core/url-seeding.md` + +**Description**: +- Added `cache_ttl_hours` and `validate_sitemap_lastmod` to parameter table +- Documented smart TTL cache validation with examples +- Added cache-related troubleshooting entries + +--- + +## Other Changes 🔧 + +| Date | Commit | Description | +|------|--------|-------------| +| Dec 30, 2025 | `0d3f9e6` | Add MEMORY.md to gitignore | +| Dec 21, 2025 | `f6b29a8` | Update gitignore | + +--- + +## Files Changed Summary + +### Core Library +- `crawl4ai/async_configs.py` - Multiple changes (init_scripts, base_url, prefetch, proxy, process_in_browser, cache TTL) +- `crawl4ai/async_webcrawler.py` - Prefetch mode, base_url, proxy +- `crawl4ai/async_crawler_strategy.py` - Media generation, proxy, raw URL fixes +- `crawl4ai/browser_manager.py` - init_scripts, CDP cleanup, cookie handling +- `crawl4ai/async_url_seeder.py` - Smart TTL cache +- `crawl4ai/proxy_strategy.py` - Proxy rotation +- `crawl4ai/deep_crawling/*.py` - Crash recovery for all strategies +- `crawl4ai/utils.py` - quick_extract_links, cache fixes + +### Docker Deployment +- `deploy/docker/server.py` - Security fixes +- `deploy/docker/hook_manager.py` - RCE fix +- `deploy/docker/config.yml` - Security warnings + +### Documentation +- `docs/md_v2/extraction/no-llm-strategies.md` - Schema generation +- `docs/md_v2/core/url-seeding.md` - Cache parameters + +### Tests Added +- `tests/test_prefetch_*.py` - 42 prefetch tests +- `tests/test_raw_html_*.py` - Raw HTML browser tests +- `tests/deep_crawling/test_deep_crawl_resume*.py` - Resume tests +- `tests/browser/test_cdp_cleanup_reuse.py` - CDP tests +- `tests/proxy/test_sticky_sessions.py` - Proxy tests +- `tests/cache_validation/*.py` - Cache tests +- `deploy/docker/tests/test_security_fixes.py` - Security tests +- `deploy/docker/tests/run_security_tests.py` - Security integration tests + +--- + +## Questions for Main Developer + +1. [ ] Are there any other breaking changes not captured here? +2. [ ] Should the security fixes get their own patch release? +3. [ ] Any features that need additional documentation? + +--- + +*Generated: January 12, 2026*