Files

UncleCode a2061bf31e feat(crawler): add MHTML capture functionality

Add ability to capture web pages as MHTML format, which includes all page resources
in a single file. This enables complete page archival and offline viewing.

- Add capture_mhtml parameter to CrawlerRunConfig
- Implement MHTML capture using CDP in AsyncPlaywrightCrawlerStrategy
- Add mhtml field to CrawlResult and AsyncCrawlResponse models
- Add comprehensive tests for MHTML capture functionality
- Update documentation with MHTML capture details
- Add exclude_all_images option for better memory management

Breaking changes: None

2025-04-09 15:39:04 +08:00

2.5 KiB

Raw Blame History

Development Journal

This journal tracks significant feature additions, bug fixes, and architectural decisions in the crawl4ai project. It serves as both documentation and a historical record of the project's evolution.

[2025-04-09] Added MHTML Capture Feature

Feature: MHTML snapshot capture of crawled pages

Changes Made:

Added capture_mhtml: bool = False parameter to CrawlerRunConfig class
Added mhtml: Optional[str] = None field to CrawlResult model
Added mhtml_data: Optional[str] = None field to AsyncCrawlResponse class
Implemented capture_mhtml() method in AsyncPlaywrightCrawlerStrategy class to capture MHTML via CDP
Modified the crawler to capture MHTML when enabled and pass it to the result

Implementation Details:

MHTML capture uses Chrome DevTools Protocol (CDP) via Playwright's CDP session API
The implementation waits for page to fully load before capturing MHTML content
Enhanced waiting for JavaScript content with requestAnimationFrame for better JS content capture
We ensure all browser resources are properly cleaned up after capture

Files Modified:

crawl4ai/models.py: Added the mhtml field to CrawlResult
crawl4ai/async_configs.py: Added capture_mhtml parameter to CrawlerRunConfig
crawl4ai/async_crawler_strategy.py: Implemented MHTML capture logic
crawl4ai/async_webcrawler.py: Added mapping from AsyncCrawlResponse.mhtml_data to CrawlResult.mhtml

Testing:

Created comprehensive tests in tests/20241401/test_mhtml.py covering:
- Capturing MHTML when enabled
- Ensuring mhtml is None when disabled explicitly
- Ensuring mhtml is None by default
- Capturing MHTML on JavaScript-enabled pages

Challenges:

Had to improve page loading detection to ensure JavaScript content was fully rendered
Tests needed to be run independently due to Playwright browser instance management
Modified test expected content to match actual MHTML output

Why This Feature: The MHTML capture feature allows users to capture complete web pages including all resources (CSS, images, etc.) in a single file. This is valuable for:

Offline viewing of captured pages
Creating permanent snapshots of web content for archival
Ensuring consistent content for later analysis, even if the original site changes

Future Enhancements to Consider:

Add option to save MHTML to file
Support for filtering what resources get included in MHTML
Add support for specifying MHTML capture options

2.5 KiB Raw Blame History

Development Journal

[2025-04-09] Added MHTML Capture Feature

2.5 KiB

Raw Blame History