* Fix: Use correct URL variable for raw HTML extraction (#1116)
- Prevents full HTML content from being passed as URL to extraction strategies
- Added unit tests to verify raw HTML and regular URL processing
Fix: Wrong URL variable used for extraction of raw html
* Fix#1181: Preserve whitespace in code blocks during HTML scraping
The remove_empty_elements_fast() method was removing whitespace-only
span elements inside <pre> and <code> tags, causing import statements
like "import torch" to become "importtorch". Now skips elements inside
code blocks where whitespace is significant.
* Refactor Pydantic model configuration to use ConfigDict for arbitrary types
* Fix EmbeddingStrategy: Uncomment response handling for the variations and clean up mock data. ref #1621
* Fix: permission issues with .cache/url_seeder and other runtime cache dirs. ref #1638
* fix: ensure BrowserConfig.to_dict serializes proxy_config
* feat: make LLM backoff configurable end-to-end
- extend LLMConfig with backoff delay/attempt/factor fields and thread them
through LLMExtractionStrategy, LLMContentFilter, table extraction, and
Docker API handlers
- expose the backoff parameter knobs on perform_completion_with_backoff/aperform_completion_with_backoff
and document them in the md_v2 guides
* reproduced AttributeError from #1642
* pass timeout parameter to docker client request
* added missing deep crawling objects to init
* generalized query in ContentRelevanceFilter to be a str or list
* import modules from enhanceable deserialization
* parameterized tests
* Fix: capture current page URL to reflect JavaScript navigation and add test for delayed redirects. ref #1268
* refactor: replace PyPDF2 with pypdf across the codebase. ref #1412
* announcement: add application form for cloud API closed beta
* Release v0.7.8: Stability & Bug Fix Release
- Updated version to 0.7.8
- Introduced focused stability release addressing 11 community-reported bugs.
- Key fixes include Docker API improvements, LLM extraction enhancements, URL handling corrections, and dependency updates.
- Added detailed release notes for v0.7.8 in the blog and created a dedicated verification script to ensure all fixes are functioning as intended.
- Updated documentation to reflect recent changes and improvements.
* docs: add section for Crawl4AI Cloud API closed beta with application link
* fix: add disk cleanup step to Docker workflow
---------
Co-authored-by: rbushria <rbushri@gmail.com>
Co-authored-by: AHMET YILMAZ <tawfik@kidocode.com>
Co-authored-by: Soham Kukreti <kukretisoham@gmail.com>
Co-authored-by: Chris Murphy <chris.murphy@klaviyo.com>
Co-authored-by: Aravind Karnam <aravind.karanam@gmail.com>
Major documentation restructuring to emphasize self-hosting capabilities and fully document the real-time monitoring system.
Changes:
- Renamed docker-deployment.md → self-hosting.md to better reflect the value proposition
- Updated mkdocs.yml navigation to "Self-Hosting Guide"
- Completely rewrote introduction emphasizing self-hosting benefits:
* Data privacy and ownership
* Cost control and transparency
* Performance and security advantages
* Full customization capabilities
- Expanded "Metrics & Monitoring" → "Real-time Monitoring & Operations" with:
* Monitoring Dashboard section documenting the /monitor UI
* Complete feature breakdown (system health, requests, browsers, janitor, errors)
* Monitor API Endpoints with all REST endpoints and examples
* WebSocket Streaming integration guide with Python examples
* Control Actions for manual browser management
* Production Integration patterns (Prometheus, custom dashboards, alerting)
* Key production metrics to track
- Enhanced summary section:
* What users learned checklist
* Why self-hosting matters
* Clear next steps
* Key resources with monitoring dashboard URL
The monitoring dashboard built 2-3 weeks ago is now fully documented and discoverable.
Users will understand they have complete operational visibility at http://localhost:11235/monitor
with real-time updates, browser pool management, and programmatic control via REST/WebSocket APIs.
This positions Crawl4AI as an enterprise-grade self-hosting solution with DevOps-level
monitoring capabilities, not just a Docker deployment.
execution, causing URLs to be processed sequentially instead of in parallel.
Changes:
- Added aperform_completion_with_backoff() using litellm.acompletion for async LLM calls
- Implemented arun() method in ExtractionStrategy base class with thread pool fallback
- Created async arun() and aextract() methods in LLMExtractionStrategy using asyncio.gather
- Updated AsyncWebCrawler.arun() to detect and use arun() when available
- Added comprehensive test suite to verify parallel execution
Impact:
- LLM extraction now runs truly in parallel across multiple URLs
- Significant performance improvement for multi-URL crawls with LLM strategies
- Backward compatible - existing extraction strategies continue to work
- No breaking changes to public API
Technical details:
- Uses litellm.acompletion for non-blocking LLM calls
- Leverages asyncio.gather for concurrent chunk processing
- Maintains backward compatibility via asyncio.to_thread fallback
- Works seamlessly with MemoryAdaptiveDispatcher and other dispatchers
- Add lightweight security test to verify version requirements
- Add comprehensive integration test for crawl4ai functionality
- Tests verify pyOpenSSL >= 25.3.0 and cryptography >= 45.0.7
- All tests passing: security vulnerability is resolved
Related to #1545🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Updates pyOpenSSL from >=24.3.0 to >=25.3.0
- This resolves CVE affecting cryptography package versions >=37.0.0 & <43.0.1
- pyOpenSSL 25.3.0 requires cryptography>=45.0.7, which is above the vulnerable range
- Fixes issue #1545🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>