Files
crawl4ai/JOURNAL.md
UncleCode a2061bf31e feat(crawler): add MHTML capture functionality
Add ability to capture web pages as MHTML format, which includes all page resources
in a single file. This enables complete page archival and offline viewing.

- Add capture_mhtml parameter to CrawlerRunConfig
- Implement MHTML capture using CDP in AsyncPlaywrightCrawlerStrategy
- Add mhtml field to CrawlResult and AsyncCrawlResponse models
- Add comprehensive tests for MHTML capture functionality
- Update documentation with MHTML capture details
- Add exclude_all_images option for better memory management

Breaking changes: None
2025-04-09 15:39:04 +08:00

2.5 KiB

Development Journal

This journal tracks significant feature additions, bug fixes, and architectural decisions in the crawl4ai project. It serves as both documentation and a historical record of the project's evolution.

[2025-04-09] Added MHTML Capture Feature

Feature: MHTML snapshot capture of crawled pages

Changes Made:

  1. Added capture_mhtml: bool = False parameter to CrawlerRunConfig class
  2. Added mhtml: Optional[str] = None field to CrawlResult model
  3. Added mhtml_data: Optional[str] = None field to AsyncCrawlResponse class
  4. Implemented capture_mhtml() method in AsyncPlaywrightCrawlerStrategy class to capture MHTML via CDP
  5. Modified the crawler to capture MHTML when enabled and pass it to the result

Implementation Details:

  • MHTML capture uses Chrome DevTools Protocol (CDP) via Playwright's CDP session API
  • The implementation waits for page to fully load before capturing MHTML content
  • Enhanced waiting for JavaScript content with requestAnimationFrame for better JS content capture
  • We ensure all browser resources are properly cleaned up after capture

Files Modified:

  • crawl4ai/models.py: Added the mhtml field to CrawlResult
  • crawl4ai/async_configs.py: Added capture_mhtml parameter to CrawlerRunConfig
  • crawl4ai/async_crawler_strategy.py: Implemented MHTML capture logic
  • crawl4ai/async_webcrawler.py: Added mapping from AsyncCrawlResponse.mhtml_data to CrawlResult.mhtml

Testing:

  • Created comprehensive tests in tests/20241401/test_mhtml.py covering:
    • Capturing MHTML when enabled
    • Ensuring mhtml is None when disabled explicitly
    • Ensuring mhtml is None by default
    • Capturing MHTML on JavaScript-enabled pages

Challenges:

  • Had to improve page loading detection to ensure JavaScript content was fully rendered
  • Tests needed to be run independently due to Playwright browser instance management
  • Modified test expected content to match actual MHTML output

Why This Feature: The MHTML capture feature allows users to capture complete web pages including all resources (CSS, images, etc.) in a single file. This is valuable for:

  1. Offline viewing of captured pages
  2. Creating permanent snapshots of web content for archival
  3. Ensuring consistent content for later analysis, even if the original site changes

Future Enhancements to Consider:

  • Add option to save MHTML to file
  • Support for filtering what resources get included in MHTML
  • Add support for specifying MHTML capture options