Files

UncleCode 7db6b468d9 feat(markdown): add content source selection for markdown generation

Adds a new content_source parameter to MarkdownGenerationStrategy that allows
selecting which HTML content to use for markdown generation:
- cleaned_html (default): uses post-processed HTML
- raw_html: uses original webpage HTML
- fit_html: uses preprocessed HTML for schema extraction

Changes include:
- Added content_source parameter to MarkdownGenerationStrategy
- Updated AsyncWebCrawler to handle HTML source selection
- Added examples and tests for the new feature
- Updated documentation with new parameter details

BREAKING CHANGE: Renamed cleaned_html parameter to input_html in generate_markdown()
method signature to better reflect its generalized purpose

2025-04-17 20:13:53 +08:00

8.1 KiB

Raw Blame History

Development Journal

This journal tracks significant feature additions, bug fixes, and architectural decisions in the crawl4ai project. It serves as both documentation and a historical record of the project's evolution.

[2025-04-17] Added Content Source Selection for Markdown Generation

Feature: Configurable content source for markdown generation

Changes Made:

Added content_source: str = "cleaned_html" parameter to MarkdownGenerationStrategy class
Updated DefaultMarkdownGenerator to accept and pass the content source parameter
Renamed the cleaned_html parameter to input_html in the generate_markdown method
Modified AsyncWebCrawler.aprocess_html to select the appropriate HTML source based on the generator's config
Added preprocess_html_for_schema import in async_webcrawler.py

Implementation Details:

Added a new content_source parameter to specify which HTML input to use for markdown generation
Options include: "cleaned_html" (default), "raw_html", and "fit_html"
Used a dictionary dispatch pattern in aprocess_html to select the appropriate HTML source
Added proper error handling with fallback to cleaned_html if content source selection fails
Ensured backward compatibility by defaulting to "cleaned_html" option

Files Modified:

crawl4ai/markdown_generation_strategy.py: Added content_source parameter and updated the method signature
crawl4ai/async_webcrawler.py: Added HTML source selection logic and updated imports

Examples:

Created docs/examples/content_source_example.py demonstrating how to use the new parameter

Challenges:

Maintaining backward compatibility while reorganizing the parameter flow
Ensuring proper error handling for all content source options
Making the change with minimal code modifications

Why This Feature: The content source selection feature allows users to choose which HTML content to use as input for markdown generation:

"cleaned_html" - Uses the post-processed HTML after scraping strategy (original behavior)
"raw_html" - Uses the original raw HTML directly from the web page
"fit_html" - Uses the preprocessed HTML optimized for schema extraction

This feature provides greater flexibility in how users generate markdown, enabling them to:

Capture more detailed content from the original HTML when needed
Use schema-optimized HTML when working with structured data
Choose the approach that best suits their specific use case

[2025-04-09] Added MHTML Capture Feature

Feature: MHTML snapshot capture of crawled pages

Changes Made:

Added capture_mhtml: bool = False parameter to CrawlerRunConfig class
Added mhtml: Optional[str] = None field to CrawlResult model
Added mhtml_data: Optional[str] = None field to AsyncCrawlResponse class
Implemented capture_mhtml() method in AsyncPlaywrightCrawlerStrategy class to capture MHTML via CDP
Modified the crawler to capture MHTML when enabled and pass it to the result

Implementation Details:

MHTML capture uses Chrome DevTools Protocol (CDP) via Playwright's CDP session API
The implementation waits for page to fully load before capturing MHTML content
Enhanced waiting for JavaScript content with requestAnimationFrame for better JS content capture
We ensure all browser resources are properly cleaned up after capture

Files Modified:

crawl4ai/models.py: Added the mhtml field to CrawlResult
crawl4ai/async_configs.py: Added capture_mhtml parameter to CrawlerRunConfig
crawl4ai/async_crawler_strategy.py: Implemented MHTML capture logic
crawl4ai/async_webcrawler.py: Added mapping from AsyncCrawlResponse.mhtml_data to CrawlResult.mhtml

Testing:

Created comprehensive tests in tests/20241401/test_mhtml.py covering:
- Capturing MHTML when enabled
- Ensuring mhtml is None when disabled explicitly
- Ensuring mhtml is None by default
- Capturing MHTML on JavaScript-enabled pages

Challenges:

Had to improve page loading detection to ensure JavaScript content was fully rendered
Tests needed to be run independently due to Playwright browser instance management
Modified test expected content to match actual MHTML output

Why This Feature: The MHTML capture feature allows users to capture complete web pages including all resources (CSS, images, etc.) in a single file. This is valuable for:

Offline viewing of captured pages
Creating permanent snapshots of web content for archival
Ensuring consistent content for later analysis, even if the original site changes

Future Enhancements to Consider:

Add option to save MHTML to file
Support for filtering what resources get included in MHTML
Add support for specifying MHTML capture options

[2025-04-10] Added Network Request and Console Message Capturing

Feature: Comprehensive capturing of network requests/responses and browser console messages during crawling

Changes Made:

Added capture_network_requests: bool = False and capture_console_messages: bool = False parameters to CrawlerRunConfig class
Added network_requests: Optional[List[Dict[str, Any]]] = None and console_messages: Optional[List[Dict[str, Any]]] = None fields to both AsyncCrawlResponse and CrawlResult models
Implemented event listeners in AsyncPlaywrightCrawlerStrategy._crawl_web() to capture browser network events and console messages
Added proper event listener cleanup in the finally block to prevent resource leaks
Modified the crawler flow to pass captured data from AsyncCrawlResponse to CrawlResult

Implementation Details:

Network capture uses Playwright event listeners (request, response, and requestfailed) to record all network activity
Console capture uses Playwright event listeners (console and pageerror) to record console messages and errors
Each network event includes metadata like URL, headers, status, and timing information
Each console message includes type, text content, and source location when available
All captured events include timestamps for chronological analysis
Error handling ensures even failed capture attempts won't crash the main crawling process

Files Modified:

crawl4ai/models.py: Added new fields to AsyncCrawlResponse and CrawlResult
crawl4ai/async_configs.py: Added new configuration parameters to CrawlerRunConfig
crawl4ai/async_crawler_strategy.py: Implemented capture logic using event listeners
crawl4ai/async_webcrawler.py: Added data transfer from AsyncCrawlResponse to CrawlResult

Documentation:

Created detailed documentation in docs/md_v2/advanced/network-console-capture.md
Added feature to site navigation in mkdocs.yml
Updated CrawlResult documentation in docs/md_v2/api/crawl-result.md
Created comprehensive example in docs/examples/network_console_capture_example.py

Testing:

Created tests/general/test_network_console_capture.py with tests for:
- Verifying capture is disabled by default
- Testing network request capturing
- Testing console message capturing
- Ensuring both capture types can be enabled simultaneously
- Checking correct content is captured in expected formats

Challenges:

Initial implementation had synchronous/asynchronous mismatches in event handlers
Needed to fix type of property access vs. method calls in handlers
Required careful cleanup of event listeners to prevent memory leaks

Why This Feature: The network and console capture feature provides deep visibility into web page activity, enabling:

Debugging complex web applications by seeing all network requests and errors
Security analysis to detect unexpected third-party requests and data flows
Performance profiling to identify slow-loading resources
API discovery in single-page applications
Comprehensive analysis of web application behavior

Future Enhancements to Consider:

Option to filter captured events by type, domain, or content
Support for capturing response bodies (with size limits)
Aggregate statistics calculation for performance metrics
Integration with visualization tools for network waterfall analysis
Exporting captures in HAR format for use with external tools

8.1 KiB Raw Blame History

Development Journal

[2025-04-17] Added Content Source Selection for Markdown Generation

[2025-04-09] Added MHTML Capture Feature

[2025-04-10] Added Network Request and Console Message Capturing

8.1 KiB

Raw Blame History