10 KiB
Crawl4AI Release Notes v0.7.9
Period: December 13, 2025 - January 12, 2026 Total Commits: 18 Status: DRAFT - Pending review
Breaking Changes ⚠️
1. Docker API Hooks Disabled by Default
- Commit:
f24396c(Jan 12, 2026) - Reason: Security fix for RCE vulnerability
- Impact: Hooks no longer work unless
CRAWL4AI_HOOKS_ENABLED=trueis set - Migration: Set environment variable
CRAWL4AI_HOOKS_ENABLED=truein Docker container
2. file:// URLs Blocked on Docker API
- Commit:
f24396c(Jan 12, 2026) - Reason: Security fix for LFI vulnerability
- Impact: Endpoints
/execute_js,/screenshot,/pdf,/htmlrejectfile://URLs - Migration: Use the library directly for local file processing
Security Fixes 🔒
Fix Critical RCE and LFI Vulnerabilities in Docker API
- Commit:
f24396c - Date: January 12, 2026
- Severity: CRITICAL
- Files Changed:
deploy/docker/config.ymldeploy/docker/hook_manager.pydeploy/docker/server.pydeploy/docker/tests/run_security_tests.pydeploy/docker/tests/test_security_fixes.py
Details:
-
Remote Code Execution via Hooks (CVE pending)
- Removed
__import__from allowed_builtins in hook_manager.py - Prevents arbitrary module imports (os, subprocess, etc.)
- Hooks now disabled by default via
CRAWL4AI_HOOKS_ENABLEDenv var
- Removed
-
Local File Inclusion via file:// URLs (CVE pending)
- Added URL scheme validation to
/execute_js,/screenshot,/pdf,/html - Blocks
file://,javascript:,data:and other dangerous schemes - Only allows
http://,https://, andraw:(where appropriate)
- Added URL scheme validation to
-
Security hardening
- Added
CRAWL4AI_HOOKS_ENABLED=falseas default (opt-in for hooks) - Added security warning comments in config.yml
- Added
validate_url_scheme()helper for consistent validation
- Added
New Features ✨
1. init_scripts Support for BrowserConfig
- Commit:
d10ca38 - Date: December 14, 2025
- Files Changed:
crawl4ai/async_configs.pycrawl4ai/browser_manager.py
Description: Pre-page-load JavaScript injection capability. Useful for stealth evasions (canvas/audio fingerprinting, userAgentData).
Usage:
config = BrowserConfig(
init_scripts=["Object.defineProperty(navigator, 'webdriver', {get: () => false})"]
)
2. CDP Connection Improvements
- Commit:
02acad1 - Date: December 18, 2025
- Files Changed:
crawl4ai/browser_manager.pytests/browser/test_cdp_cleanup_reuse.py
Description:
- Support WebSocket URLs (ws://, wss://) for CDP connections
- Proper cleanup when
cdp_cleanup_on_close=True - Enables reusing the same browser with multiple sequential connections
3. Crash Recovery for Deep Crawl Strategies
- Commit:
31ebf37 - Date: December 22, 2025
- Files Changed:
crawl4ai/deep_crawling/bff_strategy.pycrawl4ai/deep_crawling/bfs_strategy.pycrawl4ai/deep_crawling/dfs_strategy.pytests/deep_crawling/test_deep_crawl_resume.pytests/deep_crawling/test_deep_crawl_resume_integration.py
Description: Optional resume_state and on_state_change parameters for all deep crawl strategies (BFS, DFS, Best-First) enabling cloud deployment crash recovery.
Features:
resume_state: Pass saved state to resume from checkpointon_state_change: Async callback fired after each URL for real-time state persistenceexport_state(): Get last captured state manually- Zero overhead when features are disabled (None defaults)
- State is JSON-serializable
4. PDF and MHTML Support for raw:/file:// URLs
- Commit:
67e03d6 - Date: December 22, 2025
- Files Changed:
crawl4ai/async_crawler_strategy.py
Description: Generate PDFs and MHTML from cached HTML content. Replaced _generate_screenshot_from_html with unified _generate_media_from_html method.
5. Screenshots for raw:/file:// URLs
- Commit:
444cb14 - Date: December 22, 2025
- Files Changed:
crawl4ai/async_crawler_strategy.py
Description: Enables cached HTML to be rendered with screenshots. Loads HTML into browser via page.set_content() and takes screenshot.
6. base_url Parameter for CrawlerRunConfig
- Commit:
3937efc - Date: December 24, 2025
- Files Changed:
crawl4ai/async_configs.pycrawl4ai/async_webcrawler.py
Description: When processing raw: HTML (e.g., from cache), provides proper URL resolution context for markdown link generation.
Usage:
config = CrawlerRunConfig(base_url='https://example.com')
result = await crawler.arun(url='raw:{html}', config=config)
7. Prefetch Mode for Two-Phase Deep Crawling
- Commit:
fde4e9f - Date: December 25, 2025
- Files Changed:
crawl4ai/async_configs.pycrawl4ai/async_webcrawler.pycrawl4ai/utils.pytests/test_prefetch_integration.pytests/test_prefetch_mode.pytests/test_prefetch_regression.py
Description:
- Added
prefetchparameter to CrawlerRunConfig - Added
quick_extract_links()function for fast link extraction - Short-circuit in
aprocess_html()for prefetch mode - 42 tests added (unit, integration, regression)
8. Proxy Rotation and Configuration Updates
- Commit:
9e7f5aa - Date: December 26, 2025
- Files Changed:
crawl4ai/async_configs.pycrawl4ai/async_webcrawler.pycrawl4ai/proxy_strategy.pytests/proxy/test_sticky_sessions.py
Description: Enhanced proxy rotation and proxy configuration options.
9. Proxy Support for HTTP Crawler Strategy
- Commit:
a43256b - Date: December 26, 2025
- Files Changed:
crawl4ai/async_crawler_strategy.py
Description: Added proxy support to the HTTP (non-browser) crawler strategy.
10. Browser Pipeline Support for raw:/file:// URLs
- Commit:
2550f3d - Date: December 27, 2025
- Files Changed:
crawl4ai/async_configs.pycrawl4ai/async_crawler_strategy.pycrawl4ai/browser_manager.pytests/test_raw_html_browser.pytests/test_raw_html_edge_cases.py
Description:
- Added
process_in_browserparameter to CrawlerRunConfig - Routes raw:/file:// URLs through browser when browser operations needed
- Uses
page.set_content()instead ofgoto()for local content - Auto-detects browser requirements: js_code, wait_for, screenshot, etc.
- Maintains fast path for raw:/file:// without browser params
Fixes: #310
11. Smart TTL Cache for Sitemap URL Seeder
- Commit:
3d78001 - Date: December 30, 2025
- Files Changed:
crawl4ai/async_configs.pycrawl4ai/async_url_seeder.py
Description:
- Added
cache_ttl_hoursandvalidate_sitemap_lastmodparams to SeedingConfig - New JSON cache format with metadata (version, created_at, lastmod, url_count)
- Cache validation by TTL expiry and sitemap lastmod comparison
- Auto-migration from old .jsonl to new .json format
- Fixes bug where incomplete cache was used indefinitely
Bug Fixes 🐛
1. HTTP Strategy raw: URL Parsing Truncates at # Character
- Commit:
624e341 - Date: December 24, 2025
- Files Changed:
crawl4ai/async_crawler_strategy.py
Problem: urlparse() treated # as URL fragment delimiter, breaking CSS color codes like #eee.
Before: raw:body{background:#eee} → parsed to body{background:
After: raw:body{background:#eee} → parsed to body{background:#eee}
Fix: Strip raw: prefix directly instead of using urlparse.
2. Caching Debugging and Fixes
- Commit:
48426f7 - Date: December 21, 2025
- Files Changed:
crawl4ai/async_configs.pycrawl4ai/async_database.pycrawl4ai/async_webcrawler.pycrawl4ai/cache_validator.pycrawl4ai/models.pycrawl4ai/utils.pytests/cache_validation/(multiple test files)
Description: Various debugging and improvements to the caching system.
Documentation Updates 📚
1. Multi-Sample Schema Generation Section
- Commit:
6b2dca7 - Date: January 4, 2026
- Files Changed:
docs/md_v2/extraction/no-llm-strategies.md
Description: Added documentation explaining how to pass multiple HTML samples to generate_schema() for stable selectors that work across pages with varying DOM structures.
2. URL Seeder Docs - Smart TTL Cache Parameters
- Commit:
db61ab8 - Date: December 30, 2025
- Files Changed:
docs/md_v2/core/url-seeding.md
Description:
- Added
cache_ttl_hoursandvalidate_sitemap_lastmodto parameter table - Documented smart TTL cache validation with examples
- Added cache-related troubleshooting entries
Other Changes 🔧
| Date | Commit | Description |
|---|---|---|
| Dec 30, 2025 | 0d3f9e6 |
Add MEMORY.md to gitignore |
| Dec 21, 2025 | f6b29a8 |
Update gitignore |
Files Changed Summary
Core Library
crawl4ai/async_configs.py- Multiple changes (init_scripts, base_url, prefetch, proxy, process_in_browser, cache TTL)crawl4ai/async_webcrawler.py- Prefetch mode, base_url, proxycrawl4ai/async_crawler_strategy.py- Media generation, proxy, raw URL fixescrawl4ai/browser_manager.py- init_scripts, CDP cleanup, cookie handlingcrawl4ai/async_url_seeder.py- Smart TTL cachecrawl4ai/proxy_strategy.py- Proxy rotationcrawl4ai/deep_crawling/*.py- Crash recovery for all strategiescrawl4ai/utils.py- quick_extract_links, cache fixes
Docker Deployment
deploy/docker/server.py- Security fixesdeploy/docker/hook_manager.py- RCE fixdeploy/docker/config.yml- Security warnings
Documentation
docs/md_v2/extraction/no-llm-strategies.md- Schema generationdocs/md_v2/core/url-seeding.md- Cache parameters
Tests Added
tests/test_prefetch_*.py- 42 prefetch teststests/test_raw_html_*.py- Raw HTML browser teststests/deep_crawling/test_deep_crawl_resume*.py- Resume teststests/browser/test_cdp_cleanup_reuse.py- CDP teststests/proxy/test_sticky_sessions.py- Proxy teststests/cache_validation/*.py- Cache testsdeploy/docker/tests/test_security_fixes.py- Security testsdeploy/docker/tests/run_security_tests.py- Security integration tests
Questions for Main Developer
- Are there any other breaking changes not captured here?
- Should the security fixes get their own patch release?
- Any features that need additional documentation?
Generated: January 12, 2026