Add ability to capture web pages as MHTML format, which includes all page resources in a single file. This enables complete page archival and offline viewing. - Add capture_mhtml parameter to CrawlerRunConfig - Implement MHTML capture using CDP in AsyncPlaywrightCrawlerStrategy - Add mhtml field to CrawlResult and AsyncCrawlResponse models - Add comprehensive tests for MHTML capture functionality - Update documentation with MHTML capture details - Add exclude_all_images option for better memory management Breaking changes: None
2.5 KiB
2.5 KiB
Development Journal
This journal tracks significant feature additions, bug fixes, and architectural decisions in the crawl4ai project. It serves as both documentation and a historical record of the project's evolution.
[2025-04-09] Added MHTML Capture Feature
Feature: MHTML snapshot capture of crawled pages
Changes Made:
- Added
capture_mhtml: bool = Falseparameter toCrawlerRunConfigclass - Added
mhtml: Optional[str] = Nonefield toCrawlResultmodel - Added
mhtml_data: Optional[str] = Nonefield toAsyncCrawlResponseclass - Implemented
capture_mhtml()method inAsyncPlaywrightCrawlerStrategyclass to capture MHTML via CDP - Modified the crawler to capture MHTML when enabled and pass it to the result
Implementation Details:
- MHTML capture uses Chrome DevTools Protocol (CDP) via Playwright's CDP session API
- The implementation waits for page to fully load before capturing MHTML content
- Enhanced waiting for JavaScript content with requestAnimationFrame for better JS content capture
- We ensure all browser resources are properly cleaned up after capture
Files Modified:
crawl4ai/models.py: Added the mhtml field to CrawlResultcrawl4ai/async_configs.py: Added capture_mhtml parameter to CrawlerRunConfigcrawl4ai/async_crawler_strategy.py: Implemented MHTML capture logiccrawl4ai/async_webcrawler.py: Added mapping from AsyncCrawlResponse.mhtml_data to CrawlResult.mhtml
Testing:
- Created comprehensive tests in
tests/20241401/test_mhtml.pycovering:- Capturing MHTML when enabled
- Ensuring mhtml is None when disabled explicitly
- Ensuring mhtml is None by default
- Capturing MHTML on JavaScript-enabled pages
Challenges:
- Had to improve page loading detection to ensure JavaScript content was fully rendered
- Tests needed to be run independently due to Playwright browser instance management
- Modified test expected content to match actual MHTML output
Why This Feature: The MHTML capture feature allows users to capture complete web pages including all resources (CSS, images, etc.) in a single file. This is valuable for:
- Offline viewing of captured pages
- Creating permanent snapshots of web content for archival
- Ensuring consistent content for later analysis, even if the original site changes
Future Enhancements to Consider:
- Add option to save MHTML to file
- Support for filtering what resources get included in MHTML
- Add support for specifying MHTML capture options