Adds a new content_source parameter to MarkdownGenerationStrategy that allows selecting which HTML content to use for markdown generation: - cleaned_html (default): uses post-processed HTML - raw_html: uses original webpage HTML - fit_html: uses preprocessed HTML for schema extraction Changes include: - Added content_source parameter to MarkdownGenerationStrategy - Updated AsyncWebCrawler to handle HTML source selection - Added examples and tests for the new feature - Updated documentation with new parameter details BREAKING CHANGE: Renamed cleaned_html parameter to input_html in generate_markdown() method signature to better reflect its generalized purpose
8.1 KiB
Development Journal
This journal tracks significant feature additions, bug fixes, and architectural decisions in the crawl4ai project. It serves as both documentation and a historical record of the project's evolution.
[2025-04-17] Added Content Source Selection for Markdown Generation
Feature: Configurable content source for markdown generation
Changes Made:
- Added
content_source: str = "cleaned_html"parameter toMarkdownGenerationStrategyclass - Updated
DefaultMarkdownGeneratorto accept and pass the content source parameter - Renamed the
cleaned_htmlparameter toinput_htmlin thegenerate_markdownmethod - Modified
AsyncWebCrawler.aprocess_htmlto select the appropriate HTML source based on the generator's config - Added
preprocess_html_for_schemaimport inasync_webcrawler.py
Implementation Details:
- Added a new
content_sourceparameter to specify which HTML input to use for markdown generation - Options include: "cleaned_html" (default), "raw_html", and "fit_html"
- Used a dictionary dispatch pattern in
aprocess_htmlto select the appropriate HTML source - Added proper error handling with fallback to cleaned_html if content source selection fails
- Ensured backward compatibility by defaulting to "cleaned_html" option
Files Modified:
crawl4ai/markdown_generation_strategy.py: Added content_source parameter and updated the method signaturecrawl4ai/async_webcrawler.py: Added HTML source selection logic and updated imports
Examples:
- Created
docs/examples/content_source_example.pydemonstrating how to use the new parameter
Challenges:
- Maintaining backward compatibility while reorganizing the parameter flow
- Ensuring proper error handling for all content source options
- Making the change with minimal code modifications
Why This Feature: The content source selection feature allows users to choose which HTML content to use as input for markdown generation:
- "cleaned_html" - Uses the post-processed HTML after scraping strategy (original behavior)
- "raw_html" - Uses the original raw HTML directly from the web page
- "fit_html" - Uses the preprocessed HTML optimized for schema extraction
This feature provides greater flexibility in how users generate markdown, enabling them to:
- Capture more detailed content from the original HTML when needed
- Use schema-optimized HTML when working with structured data
- Choose the approach that best suits their specific use case
[2025-04-09] Added MHTML Capture Feature
Feature: MHTML snapshot capture of crawled pages
Changes Made:
- Added
capture_mhtml: bool = Falseparameter toCrawlerRunConfigclass - Added
mhtml: Optional[str] = Nonefield toCrawlResultmodel - Added
mhtml_data: Optional[str] = Nonefield toAsyncCrawlResponseclass - Implemented
capture_mhtml()method inAsyncPlaywrightCrawlerStrategyclass to capture MHTML via CDP - Modified the crawler to capture MHTML when enabled and pass it to the result
Implementation Details:
- MHTML capture uses Chrome DevTools Protocol (CDP) via Playwright's CDP session API
- The implementation waits for page to fully load before capturing MHTML content
- Enhanced waiting for JavaScript content with requestAnimationFrame for better JS content capture
- We ensure all browser resources are properly cleaned up after capture
Files Modified:
crawl4ai/models.py: Added the mhtml field to CrawlResultcrawl4ai/async_configs.py: Added capture_mhtml parameter to CrawlerRunConfigcrawl4ai/async_crawler_strategy.py: Implemented MHTML capture logiccrawl4ai/async_webcrawler.py: Added mapping from AsyncCrawlResponse.mhtml_data to CrawlResult.mhtml
Testing:
- Created comprehensive tests in
tests/20241401/test_mhtml.pycovering:- Capturing MHTML when enabled
- Ensuring mhtml is None when disabled explicitly
- Ensuring mhtml is None by default
- Capturing MHTML on JavaScript-enabled pages
Challenges:
- Had to improve page loading detection to ensure JavaScript content was fully rendered
- Tests needed to be run independently due to Playwright browser instance management
- Modified test expected content to match actual MHTML output
Why This Feature: The MHTML capture feature allows users to capture complete web pages including all resources (CSS, images, etc.) in a single file. This is valuable for:
- Offline viewing of captured pages
- Creating permanent snapshots of web content for archival
- Ensuring consistent content for later analysis, even if the original site changes
Future Enhancements to Consider:
- Add option to save MHTML to file
- Support for filtering what resources get included in MHTML
- Add support for specifying MHTML capture options
[2025-04-10] Added Network Request and Console Message Capturing
Feature: Comprehensive capturing of network requests/responses and browser console messages during crawling
Changes Made:
- Added
capture_network_requests: bool = Falseandcapture_console_messages: bool = Falseparameters toCrawlerRunConfigclass - Added
network_requests: Optional[List[Dict[str, Any]]] = Noneandconsole_messages: Optional[List[Dict[str, Any]]] = Nonefields to bothAsyncCrawlResponseandCrawlResultmodels - Implemented event listeners in
AsyncPlaywrightCrawlerStrategy._crawl_web()to capture browser network events and console messages - Added proper event listener cleanup in the finally block to prevent resource leaks
- Modified the crawler flow to pass captured data from AsyncCrawlResponse to CrawlResult
Implementation Details:
- Network capture uses Playwright event listeners (
request,response, andrequestfailed) to record all network activity - Console capture uses Playwright event listeners (
consoleandpageerror) to record console messages and errors - Each network event includes metadata like URL, headers, status, and timing information
- Each console message includes type, text content, and source location when available
- All captured events include timestamps for chronological analysis
- Error handling ensures even failed capture attempts won't crash the main crawling process
Files Modified:
crawl4ai/models.py: Added new fields to AsyncCrawlResponse and CrawlResultcrawl4ai/async_configs.py: Added new configuration parameters to CrawlerRunConfigcrawl4ai/async_crawler_strategy.py: Implemented capture logic using event listenerscrawl4ai/async_webcrawler.py: Added data transfer from AsyncCrawlResponse to CrawlResult
Documentation:
- Created detailed documentation in
docs/md_v2/advanced/network-console-capture.md - Added feature to site navigation in
mkdocs.yml - Updated CrawlResult documentation in
docs/md_v2/api/crawl-result.md - Created comprehensive example in
docs/examples/network_console_capture_example.py
Testing:
- Created
tests/general/test_network_console_capture.pywith tests for:- Verifying capture is disabled by default
- Testing network request capturing
- Testing console message capturing
- Ensuring both capture types can be enabled simultaneously
- Checking correct content is captured in expected formats
Challenges:
- Initial implementation had synchronous/asynchronous mismatches in event handlers
- Needed to fix type of property access vs. method calls in handlers
- Required careful cleanup of event listeners to prevent memory leaks
Why This Feature: The network and console capture feature provides deep visibility into web page activity, enabling:
- Debugging complex web applications by seeing all network requests and errors
- Security analysis to detect unexpected third-party requests and data flows
- Performance profiling to identify slow-loading resources
- API discovery in single-page applications
- Comprehensive analysis of web application behavior
Future Enhancements to Consider:
- Option to filter captured events by type, domain, or content
- Support for capturing response bodies (with size limits)
- Aggregate statistics calculation for performance metrics
- Integration with visualization tools for network waterfall analysis
- Exporting captures in HAR format for use with external tools