feat(crawler): add MHTML capture functionality

Add ability to capture web pages as MHTML format, which includes all page resources in a single file. This enables complete page archival and offline viewing. - Add capture_mhtml parameter to CrawlerRunConfig - Implement MHTML capture using CDP in AsyncPlaywrightCrawlerStrategy - Add mhtml field to CrawlResult and AsyncCrawlResponse models - Add comprehensive tests for MHTML capture functionality - Update documentation with MHTML capture details - Add exclude_all_images option for better memory management Breaking changes: None
2025-04-09 15:39:04 +08:00
parent 9038e9acbd
commit a2061bf31e
14 changed files with 467 additions and 24 deletions
--- a/JOURNAL.md
+++ b/JOURNAL.md
@@ -0,0 +1,49 @@
+# Development Journal
+
+This journal tracks significant feature additions, bug fixes, and architectural decisions in the crawl4ai project. It serves as both documentation and a historical record of the project's evolution.
+
+## [2025-04-09] Added MHTML Capture Feature
+
+**Feature:** MHTML snapshot capture of crawled pages
+
+**Changes Made:**
+1. Added `capture_mhtml: bool = False` parameter to `CrawlerRunConfig` class
+2. Added `mhtml: Optional[str] = None` field to `CrawlResult` model
+3. Added `mhtml_data: Optional[str] = None` field to `AsyncCrawlResponse` class
+4. Implemented `capture_mhtml()` method in `AsyncPlaywrightCrawlerStrategy` class to capture MHTML via CDP
+5. Modified the crawler to capture MHTML when enabled and pass it to the result
+
+**Implementation Details:**
+- MHTML capture uses Chrome DevTools Protocol (CDP) via Playwright's CDP session API
+- The implementation waits for page to fully load before capturing MHTML content
+- Enhanced waiting for JavaScript content with requestAnimationFrame for better JS content capture
+- We ensure all browser resources are properly cleaned up after capture
+
+**Files Modified:**
+- `crawl4ai/models.py`: Added the mhtml field to CrawlResult
+- `crawl4ai/async_configs.py`: Added capture_mhtml parameter to CrawlerRunConfig
+- `crawl4ai/async_crawler_strategy.py`: Implemented MHTML capture logic
+- `crawl4ai/async_webcrawler.py`: Added mapping from AsyncCrawlResponse.mhtml_data to CrawlResult.mhtml
+
+**Testing:**
+- Created comprehensive tests in `tests/20241401/test_mhtml.py` covering:
+  - Capturing MHTML when enabled
+  - Ensuring mhtml is None when disabled explicitly
+  - Ensuring mhtml is None by default
+  - Capturing MHTML on JavaScript-enabled pages
+
+**Challenges:**
+- Had to improve page loading detection to ensure JavaScript content was fully rendered
+- Tests needed to be run independently due to Playwright browser instance management
+- Modified test expected content to match actual MHTML output
+
+**Why This Feature:**
+The MHTML capture feature allows users to capture complete web pages including all resources (CSS, images, etc.) in a single file. This is valuable for:
+1. Offline viewing of captured pages
+2. Creating permanent snapshots of web content for archival
+3. Ensuring consistent content for later analysis, even if the original site changes
+
+**Future Enhancements to Consider:**
+- Add option to save MHTML to file
+- Support for filtering what resources get included in MHTML
+- Add support for specifying MHTML capture options