# Crawl4ai - AI Friendly Documentation (aka LLM.TXT) This document provides AI-friendly documentation for the Crawl4ai library. It contains three types of content: **Memory (Facts)**: Similar to traditional LLM.txt files, this section contains factual information about the library - what it is, what it does, its components, APIs, and capabilities. This is the reference knowledge that an AI needs to understand the library. **Reasoning (Instructions)**: This section instructs AI models on how to use the factual knowledge, think about problems in the way the library authors intended, and provide solutions that align with the library's design philosophy. It guides the AI's problem-solving approach when working with Crawl4ai. **Examples**: Practical code examples demonstrating how to use the library's features in real-world scenarios. These examples help AI models understand the practical application of the concepts. ## Content (Memory) Okay, I have processed the instructions and the provided information. I will now generate the detailed "Foundational Memory" document for the `markdown` component of `crawl4ai`, adhering strictly to the outline you've provided and focusing on factual API details. ```markdown # Detailed Outline for crawl4ai - markdown Component **Target Document Type:** memory **Target Output Filename Suggestion:** `llm_memory_markdown.md` **Library Version Context:** 0.6.3 **Outline Generation Date:** 2025-05-24 --- ## 1. Introduction to Markdown Generation in Crawl4ai * 1.1. Purpose: This section outlines the `markdown` component of the `crawl4ai` library. Its primary role is to convert HTML content, obtained during web crawling, into various Markdown formats. These formats are designed to be suitable for consumption by Large Language Models (LLMs), as well as for other applications requiring structured text from web pages. * 1.2. Key Abstractions: * `MarkdownGenerationStrategy`: An abstract base class that defines the interface for different markdown generation algorithms and approaches. This allows for customizable Markdown conversion processes. * `DefaultMarkdownGenerator`: The standard, out-of-the-box implementation of `MarkdownGenerationStrategy`. It handles the conversion of HTML to Markdown, including features like link-to-citation conversion and integration with content filtering. * `MarkdownGenerationResult`: A Pydantic data model that encapsulates the various outputs of the markdown generation process, such as raw markdown, markdown with citations, and markdown derived from filtered content. * `CrawlerRunConfig.markdown_generator`: An attribute within the `CrawlerRunConfig` class that allows users to specify which instance of a `MarkdownGenerationStrategy` should be used for a particular crawl operation. * 1.3. Relationship with Content Filtering: The markdown generation process can be integrated with `RelevantContentFilter` strategies. When a content filter is applied, it first refines the input HTML, and then this filtered HTML is used to produce a `fit_markdown` output, providing a more focused version of the content. ## 2. Core Interface: `MarkdownGenerationStrategy` * 2.1. Purpose: The `MarkdownGenerationStrategy` class is an abstract base class (ABC) that defines the contract for all markdown generation strategies within `crawl4ai`. It ensures that any custom markdown generator will adhere to a common interface, making them pluggable into the crawling process. * 2.2. Source File: `crawl4ai/markdown_generation_strategy.py` * 2.3. Initialization (`__init__`) * 2.3.1. Signature: ```python class MarkdownGenerationStrategy(ABC): def __init__( self, content_filter: Optional[RelevantContentFilter] = None, options: Optional[Dict[str, Any]] = None, verbose: bool = False, content_source: str = "cleaned_html", ): # ... ``` * 2.3.2. Parameters: * `content_filter (Optional[RelevantContentFilter]`, default: `None`)`: An optional `RelevantContentFilter` instance. If provided, this filter will be used to process the HTML before generating the `fit_markdown` and `fit_html` outputs in the `MarkdownGenerationResult`. * `options (Optional[Dict[str, Any]]`, default: `None`)`: A dictionary for strategy-specific custom options. This allows subclasses to receive additional configuration parameters. Defaults to an empty dictionary if `None`. * `verbose (bool`, default: `False`)`: If `True`, enables verbose logging for the markdown generation process. * `content_source (str`, default: `"cleaned_html"`)`: A string indicating the source of HTML to use for Markdown generation. Common values might include `"raw_html"` (original HTML from the page), `"cleaned_html"` (HTML after initial cleaning by the scraping strategy), or `"fit_html"` (HTML after being processed by `content_filter`). The actual available sources depend on the `ScrapingResult` provided to the markdown generator. * 2.4. Abstract Methods: * 2.4.1. `generate_markdown(self, input_html: str, base_url: str = "", html2text_options: Optional[Dict[str, Any]] = None, content_filter: Optional[RelevantContentFilter] = None, citations: bool = True, **kwargs) -> MarkdownGenerationResult` * Purpose: This abstract method must be implemented by concrete subclasses. It is responsible for taking an HTML string and converting it into various Markdown representations, encapsulated within a `MarkdownGenerationResult` object. * Parameters: * `input_html (str)`: The HTML string content to be converted to Markdown. * `base_url (str`, default: `""`)`: The base URL of the crawled page. This is crucial for resolving relative URLs, especially when converting links to citations. * `html2text_options (Optional[Dict[str, Any]]`, default: `None`)`: A dictionary of options to be passed to the underlying HTML-to-text conversion engine (e.g., `CustomHTML2Text`). * `content_filter (Optional[RelevantContentFilter]`, default: `None`)`: An optional `RelevantContentFilter` instance. If provided, this filter is used to generate `fit_markdown` and `fit_html`. This parameter overrides any filter set during the strategy's initialization for this specific call. * `citations (bool`, default: `True`)`: A boolean flag indicating whether to convert Markdown links into a citation format (e.g., `[text]^[1]^`) with a corresponding reference list. * `**kwargs`: Additional keyword arguments to allow for future extensions or strategy-specific parameters. * Returns: (`MarkdownGenerationResult`) An object containing the results of the Markdown generation, including `raw_markdown`, `markdown_with_citations`, `references_markdown`, and potentially `fit_markdown` and `fit_html`. ## 3. Default Implementation: `DefaultMarkdownGenerator` * 3.1. Purpose: `DefaultMarkdownGenerator` is the standard concrete implementation of `MarkdownGenerationStrategy`. It provides a robust mechanism for converting HTML to Markdown, featuring link-to-citation conversion and the ability to integrate with `RelevantContentFilter` strategies for focused content output. * 3.2. Source File: `crawl4ai/markdown_generation_strategy.py` * 3.3. Inheritance: Inherits from `MarkdownGenerationStrategy`. * 3.4. Initialization (`__init__`) * 3.4.1. Signature: ```python class DefaultMarkdownGenerator(MarkdownGenerationStrategy): def __init__( self, content_filter: Optional[RelevantContentFilter] = None, options: Optional[Dict[str, Any]] = None, # content_source parameter from parent is available # verbose parameter from parent is available ): super().__init__(content_filter, options, content_source=kwargs.get("content_source", "cleaned_html"), verbose=kwargs.get("verbose", False)) ``` *(Note: The provided code snippet for `DefaultMarkdownGenerator.__init__` does not explicitly list `verbose` and `content_source`, but they are passed to `super().__init__` through `**kwargs` in the actual library code, so their effective signature matches the parent.)* * 3.4.2. Parameters: * `content_filter (Optional[RelevantContentFilter]`, default: `None`)`: As defined in `MarkdownGenerationStrategy`. * `options (Optional[Dict[str, Any]]`, default: `None`)`: As defined in `MarkdownGenerationStrategy`. * `verbose (bool`, default: `False`)`: (Passed via `kwargs` to parent) As defined in `MarkdownGenerationStrategy`. * `content_source (str`, default: `"cleaned_html"`)`: (Passed via `kwargs` to parent) As defined in `MarkdownGenerationStrategy`. * 3.5. Key Class Attributes: * 3.5.1. `LINK_PATTERN (re.Pattern)`: A compiled regular expression pattern used to find Markdown links. The pattern is `r'!\[(.[^\]]*)\]\(([^)]*?)(?:\s*\"(.*)\")?\)'`. * 3.6. Key Public Methods: * 3.6.1. `generate_markdown(self, input_html: str, base_url: str = "", html2text_options: Optional[Dict[str, Any]] = None, content_filter: Optional[RelevantContentFilter] = None, citations: bool = True, **kwargs) -> MarkdownGenerationResult` * Purpose: Implements the conversion of HTML to Markdown. It uses `CustomHTML2Text` for the base conversion, handles link-to-citation transformation, and integrates with an optional `RelevantContentFilter` to produce `fit_markdown`. * Parameters: * `input_html (str)`: The HTML content to convert. * `base_url (str`, default: `""`)`: Base URL for resolving relative links. * `html2text_options (Optional[Dict[str, Any]]`, default: `None`)`: Options for the `CustomHTML2Text` converter. If not provided, it uses `self.options`. * `content_filter (Optional[RelevantContentFilter]`, default: `None`)`: Overrides the instance's `content_filter` for this call. * `citations (bool`, default: `True`)`: Whether to convert links to citations. * `**kwargs`: Additional arguments (not currently used by this specific implementation beyond parent class). * Core Logic: 1. Instantiates `CustomHTML2Text` using `base_url` and the resolved `html2text_options` (merged from method arg, `self.options`, and defaults). 2. Converts `input_html` to `raw_markdown` using the `CustomHTML2Text` instance. 3. If `citations` is `True`, calls `self.convert_links_to_citations(raw_markdown, base_url)` to get `markdown_with_citations` and `references_markdown`. 4. If `citations` is `False`, `markdown_with_citations` is set to `raw_markdown`, and `references_markdown` is an empty string. 5. Determines the active `content_filter` (parameter or instance's `self.content_filter`). 6. If an active `content_filter` exists: * Calls `active_filter.filter_content(input_html)` to get a list of filtered HTML strings. * Joins these strings with `\n` and wraps them in `
` and `` blocks are appropriately marked in Markdown.
* `escape_snob (bool`, default: `False` in `CustomHTML2Text`): If `True`, more aggressive escaping of special Markdown characters is performed.
* *Note: This list is based on common `html2text` options; refer to `crawl4ai/html2text.py` for the exact implementation and default behaviors within `CustomHTML2Text`.*
* 7.3. Impact of `citations (bool)` in `generate_markdown`:
* When `citations=True` (default in `DefaultMarkdownGenerator.generate_markdown`):
* Standard Markdown links `[text](url)` are converted to `[text]^[citation_number]^`.
* A `references_markdown` string is generated, listing all unique URLs with their corresponding citation numbers. This helps LLMs trace information back to its source and can reduce token count if URLs are long or repetitive.
* When `citations=False`:
* Links remain in their original Markdown format `[text](url)`.
* `references_markdown` will be an empty string.
* This might be preferred if the LLM needs to directly process the URLs or if the citation format is not desired.
* 7.4. Role of `content_source` in `MarkdownGenerationStrategy`:
* This parameter (defaulting to `"cleaned_html"` in `DefaultMarkdownGenerator`) specifies which HTML version is used as the primary input for the `generate_markdown` method.
* `"cleaned_html"`: Typically refers to HTML that has undergone initial processing by the `ContentScrapingStrategy` (e.g., removal of scripts, styles, and potentially some boilerplate based on the scraping strategy's rules). This is usually the recommended source for general Markdown conversion.
* `"raw_html"`: The original, unmodified HTML content fetched from the web page. Using this source would bypass any initial cleaning done by the scraping strategy.
* `"fit_html"`: This source is relevant when a `RelevantContentFilter` is used. `fit_html` is the HTML output *after* the `RelevantContentFilter` has processed the `input_html` (which itself is determined by `content_source`). If `content_source` is, for example, `"cleaned_html"`, then `fit_html` is the result of filtering that cleaned HTML. `fit_markdown` is then generated from this `fit_html`.
* 7.5. `fit_markdown` vs. `raw_markdown`/`markdown_with_citations`:
* `raw_markdown` (or `markdown_with_citations` if `citations=True`) is generated from the HTML specified by `content_source` (e.g., `"cleaned_html"`). It represents a general conversion of that source.
* `fit_markdown` is generated *only if* a `RelevantContentFilter` is active (either set in `DefaultMarkdownGenerator` or passed to `generate_markdown`). It is derived from the `fit_html` (the output of the content filter).
* **Choosing which to use for LLMs:**
* Use `fit_markdown` when you need a concise, highly relevant subset of the page's content tailored to a specific query or set of criteria defined by the filter. This can reduce noise and token count for the LLM.
* Use `raw_markdown` or `markdown_with_citations` when you need a more comprehensive representation of the page's textual content, or when no specific filtering criteria are applied.
```
## Content (Reasoning)
```markdown
# Detailed Outline for crawl4ai - markdown Component
**Target Document Type:** reasoning
**Target Output Filename Suggestion:** `llm_reasoning_markdown_generation.md`
**Library Version Context:** 0.6.3
**Outline Generation Date:** 2025-05-24
---
## 1. Introduction to Markdown Generation in Crawl4AI
* 1.1. **Why Markdown Generation Matters for LLMs**
* 1.1.1. The role of clean, structured text for Large Language Model consumption.
* **Explanation:** LLMs perform significantly better when input data is well-structured and free of irrelevant noise (like HTML tags, scripts, or complex layouts not meant for textual understanding). Markdown, with its simple syntax, provides a human-readable and machine-parseable format that captures essential semantic structure (headings, lists, paragraphs, code blocks, tables) without the clutter of full HTML. This makes it easier for LLMs to understand the content's hierarchy, identify key information, and perform tasks like summarization, question-answering, or RAG (Retrieval Augmented Generation) more accurately and efficiently.
* 1.1.2. Benefits of Markdown: readability, structure preservation, common format.
* **Explanation:**
* **Readability:** Markdown is designed to be easily readable in its raw form, making it simple for developers and users to inspect and understand the crawled content.
* **Structure Preservation:** It effectively preserves the semantic structure of the original HTML (headings, lists, emphasis, etc.), which is crucial context for LLMs.
* **Common Format:** Markdown is a widely adopted standard, ensuring compatibility with a vast ecosystem of tools, editors, and LLM input pipelines.
* 1.1.3. How Crawl4AI's Markdown generation facilitates RAG and other LLM applications.
* **Explanation:** For RAG, Crawl4AI's Markdown output, especially when combined with content filtering, provides clean, relevant text chunks that can be easily embedded and indexed. This improves the quality of retrieved context for LLM prompts. For fine-tuning or direct prompting, the structured Markdown helps the LLM focus on the core content, leading to better quality responses and reducing token consumption by eliminating HTML overhead.
* 1.2. **Overview of Crawl4AI's Markdown Generation Pipeline**
* 1.2.1. High-level flow: HTML -> (Optional Filtering) -> Markdown Conversion -> (Optional Citation Handling).
* **Explanation:**
1. **Input HTML:** The process starts with either raw HTML from the crawled page or a cleaned/selected HTML segment.
2. **Optional Content Filtering:** Before Markdown conversion, a `RelevantContentFilter` can be applied to the HTML. This step aims to remove boilerplate, ads, or irrelevant sections, resulting in `fit_html`. This is crucial for generating `fit_markdown`.
3. **Markdown Conversion:** The selected HTML (either the original, cleaned, or filtered `fit_html`) is converted into Markdown using an underlying `html2text` library, specifically `CustomHTML2Text` in Crawl4AI for enhanced control.
4. **Optional Citation Handling:** If enabled, inline links in the generated Markdown are converted to a citation format (e.g., `text [^1^]`), and a separate list of references is created.
* 1.2.2. Key components involved: `MarkdownGenerationStrategy`, `DefaultMarkdownGenerator`, `CustomHTML2Text`, `RelevantContentFilter`.
* **Explanation:**
* **`MarkdownGenerationStrategy`:** An interface defining how Markdown should be generated. Allows for custom implementations.
* **`DefaultMarkdownGenerator`:** The standard implementation of `MarkdownGenerationStrategy`, using `CustomHTML2Text`. It orchestrates filtering (if provided) and citation handling.
* **`CustomHTML2Text`:** An enhanced version of the `html2text` library, providing fine-grained control over the HTML-to-Markdown conversion.
* **`RelevantContentFilter`:** An interface for strategies that filter HTML content before it's converted to Markdown, producing `fit_html` and consequently `fit_markdown`.
* 1.2.3. How `CrawlerRunConfig` ties these components together.
* **Explanation:** The `CrawlerRunConfig` object allows you to specify which `MarkdownGenerationStrategy` (and by extension, which filters and `CustomHTML2Text` options) should be used for a particular crawl run via its `markdown_generator` parameter. This provides run-specific control over the Markdown output.
* 1.3. **Goals of this Guide**
* 1.3.1. Understanding how to configure and customize Markdown output.
* **Explanation:** This guide will walk you through the various configuration options available, from choosing HTML sources and content filters to fine-tuning the `html2text` conversion itself.
* 1.3.2. Best practices for generating LLM-friendly Markdown.
* **Explanation:** We'll discuss tips and techniques to produce Markdown that is optimally structured and cleaned for consumption by Large Language Models.
* 1.3.3. Troubleshooting common Markdown generation issues.
* **Explanation:** We'll cover common problems encountered during Markdown generation (e.g., noisy output, missing content) and provide strategies for diagnosing and resolving them.
## 2. Core Concepts in Markdown Generation
* 2.1. **The `MarkdownGenerationStrategy` Interface**
* 2.1.1. **Purpose and Design Rationale:**
* Why use a strategy pattern for Markdown generation? (Flexibility, extensibility).
* **Explanation:** The strategy pattern allows Crawl4AI to define a common interface for Markdown generation while enabling different concrete implementations. This means users can easily swap out the default Markdown generator for a custom one without altering the core crawler logic. It promotes flexibility and makes the system extensible for future Markdown conversion needs or integration with other libraries.
* Core problem it solves: Decoupling Markdown generation logic from the crawler.
* **Explanation:** By abstracting Markdown generation into a strategy, the `AsyncWebCrawler` itself doesn't need to know the specifics of *how* Markdown is created. It simply delegates the task to the configured strategy. This separation of concerns makes the codebase cleaner and easier to maintain.
* 2.1.2. **When to Implement a Custom `MarkdownGenerationStrategy`:**
* Scenarios requiring completely different Markdown conversion logic.
* **Example:** If you need to convert HTML to a very specific dialect of Markdown not supported by `html2text`, or if you want to use a different underlying conversion library entirely.
* Integrating third-party Markdown conversion libraries.
* **Example:** If you prefer to use a library like `turndown` or `mistune` for its specific features or output style.
* Advanced pre/post-processing of Markdown.
* **Example:** If you need to perform complex transformations on the Markdown *after* initial generation, such as custom table formatting, complex footnote handling beyond standard citations, or domain-specific semantic tagging within the Markdown.
* 2.1.3. **How to Implement a Custom `MarkdownGenerationStrategy`:**
* Key methods to override (`generate_markdown`).
* **Explanation:** The primary method to implement is `generate_markdown(self, input_html: str, base_url: str = "", html2text_options: Optional[Dict[str, Any]] = None, content_filter: Optional[RelevantContentFilter] = None, citations: bool = True, **kwargs) -> MarkdownGenerationResult`. This method will receive the HTML (based on `content_source`), and it's responsible for returning a `MarkdownGenerationResult` object.
* Input parameters and expected output (`MarkdownGenerationResult`).
* **Explanation:** Your custom strategy will receive the `input_html`, the `base_url` (for resolving relative links if needed), `html2text_options` (which you can choose to use or ignore), an optional `content_filter`, and a `citations` flag. It must return an instance of `MarkdownGenerationResult` populated with the relevant Markdown strings.
* *Code Example:*
```python
from crawl4ai import MarkdownGenerationStrategy, MarkdownGenerationResult, RelevantContentFilter
from typing import Optional, Dict, Any
class MyCustomMarkdownStrategy(MarkdownGenerationStrategy):
def __init__(self, content_source: str = "cleaned_html", **kwargs):
super().__init__(content_source=content_source, **kwargs)
# Initialize any custom resources if needed
def generate_markdown(
self,
input_html: str,
base_url: str = "",
html2text_options: Optional[Dict[str, Any]] = None, # You can use or ignore these
content_filter: Optional[RelevantContentFilter] = None,
citations: bool = True, # You can decide how to handle this
**kwargs
) -> MarkdownGenerationResult:
# 1. Apply content filter if provided and desired
fit_html_output = ""
if content_filter:
# Assuming content_filter.filter_content returns a list of HTML strings
filtered_html_blocks = content_filter.filter_content(input_html)
fit_html_output = "\n".join(filtered_html_blocks)
# 2. Your custom HTML to Markdown conversion logic
# This is where you'd use your preferred library or custom logic
raw_markdown_text = f"# Custom Markdown for {base_url}\n\n{input_html[:200]}..." # Placeholder
markdown_with_citations_text = raw_markdown_text # Placeholder for citation logic
references_markdown_text = "" # Placeholder for references
# If you used a filter, also generate fit_markdown
fit_markdown_text = ""
if fit_html_output:
fit_markdown_text = f"# Custom Filtered Markdown\n\n{fit_html_output[:200]}..." # Placeholder
return MarkdownGenerationResult(
raw_markdown=raw_markdown_text,
markdown_with_citations=markdown_with_citations_text,
references_markdown=references_markdown_text,
fit_markdown=fit_markdown_text,
fit_html=fit_html_output
)
# Usage:
# custom_md_generator = MyCustomMarkdownStrategy()
# run_config = CrawlerRunConfig(markdown_generator=custom_md_generator)
```
* Common pitfalls when creating custom strategies.
* **Explanation:**
* Forgetting to handle all fields in `MarkdownGenerationResult` (even if some are empty strings).
* Incorrectly managing `base_url` for relative links if your custom converter doesn't handle it.
* Performance bottlenecks if your custom logic is inefficient.
* Not properly integrating with the `content_filter` if one is provided.
* 2.1.4. **Understanding `content_source` in `MarkdownGenerationStrategy`**
* 2.1.4.1. Purpose: What HTML source should be used for Markdown generation?
* **Explanation:** The `content_source` attribute of a `MarkdownGenerationStrategy` (including `DefaultMarkdownGenerator`) tells the strategy which version of the HTML to use as the primary input for generating `raw_markdown` and `markdown_with_citations`.
* 2.1.4.2. Available options: `"cleaned_html"`, `"raw_html"`, `"fit_html"`.
* **`"cleaned_html"` (Default):** This is the HTML after Crawl4AI's internal `ContentScrapingStrategy` (e.g., `WebScrapingStrategy` or `LXMLWebScrapingStrategy`) has processed it. This usually involves removing scripts, styles, and applying structural cleaning or selection based on `target_elements` or `css_selector` in `CrawlerRunConfig`.
* **`"raw_html"`:** The original, unmodified HTML fetched from the page. This is useful if you want to apply your own complete cleaning and Markdown conversion pipeline.
* **`"fit_html"`:** The HTML *after* a `RelevantContentFilter` (if provided to the `MarkdownGenerationStrategy`) has processed the input HTML (which would be `cleaned_html` or `raw_html` depending on the initial source). This option is powerful when you want Markdown generated *only* from the most relevant parts of the page.
* 2.1.4.3. **Decision Guide: Choosing the Right `content_source`**:
* **When to use `"cleaned_html"`:** This is the recommended default for most LLM use cases. It provides a good balance of structured content without excessive noise, as common boilerplate is often removed by the scraping strategy.
* **When to use `"raw_html"`:** Choose this if you need absolute control over the HTML input for your Markdown converter, or if Crawl4AI's default cleaning removes elements you wish to keep. Be aware that this might result in noisier Markdown.
* **When to use `"fit_html"`:** Opt for this when you are using a `RelevantContentFilter` with your `MarkdownGenerationStrategy` and you want the `raw_markdown` and `markdown_with_citations` to be based *only* on the filtered content. This is distinct from just using the `fit_markdown` field in the result, as it makes the filtered content the *primary* source for all main Markdown outputs.
* **Impact on performance and output quality:**
* `"raw_html"` might be slightly faster if Crawl4AI's cleaning is complex, but could lead to lower quality Markdown due to more noise.
* `"cleaned_html"` offers a good trade-off.
* `"fit_html"` depends on the performance of the `RelevantContentFilter` itself.
* 2.1.4.4. *Example Scenarios:*
* **General Summarization:** `"cleaned_html"` is usually best.
* **Highly Specific Q&A on a Section:** Use a `RelevantContentFilter` to produce `fit_html`, then set `content_source="fit_html"` (or just use the `fit_markdown` from the result if `raw_markdown` from `"cleaned_html"` is also desired).
* **Archiving Raw Structure:** `"raw_html"` might be chosen if the goal is to convert the entire, unmodified page structure to Markdown, perhaps for later, more nuanced processing.
* 2.2. **The `MarkdownGenerationResult` Model**
* 2.2.1. **Understanding its Purpose:** Why a structured result object?
* **Explanation:** A structured object like `MarkdownGenerationResult` is used instead of a single Markdown string to provide different views or versions of the generated Markdown, catering to various use cases. This allows users to pick the representation that best suits their needs (e.g., with or without citations, raw vs. filtered) without re-processing. It also clearly separates the main content from metadata like references or the intermediate `fit_html`.
* 2.2.2. **Deep Dive into `MarkdownGenerationResult` Fields:**
* `raw_markdown`:
* **What it is:** This is the direct, primary Markdown output generated from the `content_source` (e.g., `cleaned_html`) defined in the `MarkdownGenerationStrategy`. It does *not* have inline links converted to citation format.
* **How to use it:** Use this when you need the most "vanilla" Markdown, perhaps for LLMs that are sensitive to citation formats or if you plan to implement your own link/reference handling.
* **When it's useful:** For direct input to LLMs that don't require source attribution within the text, or as a base for further custom Markdown processing.
* `markdown_with_citations`:
* **What it is:** This takes the `raw_markdown` and converts its inline links (e.g., `[link text](http://example.com)`) into a citation format (e.g., `link text [^1^]`).
* **How it's generated:** The `DefaultMarkdownGenerator` (via `CustomHTML2Text`) scans `raw_markdown` for links, assigns unique numerical IDs to each unique URL, replaces the inline link with the text and citation marker, and populates `references_markdown`.
* **How to use it:** This is often the most useful Markdown for LLM tasks requiring RAG or for generating human-readable documents where sources are important. Combine it with `references_markdown`.
* *Example:*
```html
Crawl4AI is an open-source library.
```
```markdown
// Resulting markdown_with_citations (simplified)
Crawl4AI is an open-source [^1^] library.
```
* `references_markdown`:
* **What it is:** A separate Markdown string that lists all unique URLs found and converted to citations, formatted typically as a numbered list.
* **How to use it:** Append this string to the end of `markdown_with_citations` to create a complete document with a bibliography or reference section.
* **Why it's separate:** This provides flexibility. You can choose to display references at the end, in a sidebar, or not at all.
* *Example:*
```markdown
## References
[^1^]: https://github.com/unclecode/crawl4ai
```
* `fit_markdown`:
* **What it is:** This is Markdown generated *exclusively* from the `fit_html`. `fit_html` itself is the output of a `RelevantContentFilter` if one was provided to the `MarkdownGenerationStrategy`. If no filter was used, `fit_markdown` will likely be empty or reflect the `raw_markdown`.
* **How to use it:** When your primary goal is to feed an LLM with the most relevant, filtered content. This is excellent for tasks like generating concise summaries or providing highly focused context for RAG.
* **Relationship with `raw_markdown`:** If a filter is active, `fit_markdown` is based on a *subset* or *transformed version* of the HTML that `raw_markdown` was based on (assuming `content_source` wasn't `"fit_html"`). If `content_source` *was* `"fit_html"`, then `raw_markdown` and `fit_markdown` would be derived from the same filtered HTML, but `fit_markdown` might still undergo different processing if the strategy handles it distinctly.
* *Example:* Imagine a news article page. `raw_markdown` might contain the article, comments, ads, and navigation. If a `BM25ContentFilter` is used with a query about "stock market impact", `fit_markdown` would ideally only contain paragraphs related to that topic, stripped of other page elements.
* `fit_html`:
* **What it is:** The actual HTML string *after* a `RelevantContentFilter` (like `PruningContentFilter` or `LLMContentFilter`) has processed the input HTML. If no filter is applied, this field will be empty.
* **How to use it:** Primarily for debugging your content filters. You can inspect `fit_html` to see exactly what HTML content was deemed "relevant" by your filter before it was converted to `fit_markdown`. It can also be useful if you need this filtered HTML for purposes other than Markdown generation.
* **Why it's included:** It provides transparency into the filtering process and allows advanced users to work with the intermediate filtered HTML directly.
## 3. The `DefaultMarkdownGenerator` - Your Go-To Solution
* 3.1. **Understanding the `DefaultMarkdownGenerator`**
* 3.1.1. **Purpose and Design:** The `DefaultMarkdownGenerator` is Crawl4AI's standard, out-of-the-box mechanism for converting HTML content into various Markdown representations. It's designed to be a robust and generally applicable solution for most common use cases, especially when targeting LLM consumption.
* 3.1.2. Core Functionality: Its primary task is to orchestrate the HTML-to-Markdown conversion. It internally uses an instance of `CustomHTML2Text` (Crawl4AI's enhanced `html2text` wrapper) to perform the actual conversion.
* 3.1.3. How it handles citations and references by default.
* **Explanation:** If the `citations` parameter in its `generate_markdown` method is `True` (which it is by default), `DefaultMarkdownGenerator` will post-process the initially generated Markdown to convert inline links into citation markers (e.g., `[^1^]`) and generate a corresponding `references_markdown` block. This is done by its internal `CustomHTML2Text` instance.
* 3.2. **Configuring `DefaultMarkdownGenerator`**
* 3.2.1. **Initialization Options:**
* `content_filter (Optional[RelevantContentFilter])`:
* **Why use it:** To refine the HTML *before* it's converted to Markdown. This is essential if you want `fit_markdown` (and consequently `fit_html`) to contain only the most relevant parts of the page, leading to a more focused Markdown output.
* **How it integrates:** When `generate_markdown` is called, if a `content_filter` is present, `DefaultMarkdownGenerator` first passes the `input_html` (determined by `content_source`) to this filter. The filter returns a list of HTML strings (or a single string if merged). This filtered HTML becomes the `fit_html`. Then, `fit_markdown` is generated from this `fit_html`. The `raw_markdown` and `markdown_with_citations` are still generated from the original `content_source` unless `content_source` itself is set to `"fit_html"`.
* *Impact:* Directly influences `fit_markdown` and `fit_html` fields in `MarkdownGenerationResult`. Can significantly reduce the noise and improve the relevance of the final Markdown for LLMs.
* *Code Example:*
```python
from crawl4ai import DefaultMarkdownGenerator, CrawlerRunConfig
from crawl4ai.content_filter_strategy import PruningContentFilter
# Initialize a filter
pruning_filter = PruningContentFilter(threshold_type="fixed", threshold=0.5)
# Initialize DefaultMarkdownGenerator with the filter
md_generator_with_filter = DefaultMarkdownGenerator(content_filter=pruning_filter)
# This generator will now produce 'fit_markdown' based on pruning.
# run_config = CrawlerRunConfig(markdown_generator=md_generator_with_filter)
# result = await crawler.arun(url="...", config=run_config)
# print(result.markdown.fit_markdown)
```
* `options (Optional[Dict[str, Any]])`:
* **What it is:** This dictionary allows you to pass configuration options directly to the underlying `CustomHTML2Text` instance. These options control the specifics of the HTML-to-Markdown conversion process.
* **How to use it:** Provide a dictionary where keys are `html2text` option names (e.g., `body_width`, `ignore_links`) and values are their desired settings.
* *See Section 6: Mastering `CustomHTML2Text` for detailed options.*
* `content_source (str)`:
* **Reiteration:** As discussed in section 2.1.4, this determines the primary HTML input for `raw_markdown` and `markdown_with_citations`.
* **How it interacts with `content_filter`:**
* If `content_source` is, for example, `"cleaned_html"` and a `content_filter` is also provided, the `content_filter` will process this `"cleaned_html"` to produce `fit_html`. The `fit_markdown` field in `MarkdownGenerationResult` will be based on this `fit_html`.
* However, `raw_markdown` and `markdown_with_citations` will still be based on the original `"cleaned_html"` (unless `content_source` was explicitly set to `"fit_html"`). This allows you to have both a "fuller" Markdown and a "filtered" Markdown from a single generation step.
* 3.3. **Common Workflows with `DefaultMarkdownGenerator`**
* 3.3.1. **Workflow: Generating Basic Markdown with Citations**
* Steps: Instantiate `DefaultMarkdownGenerator` (or use the crawler's default). The crawler calls its `generate_markdown` method. Access `result.markdown.markdown_with_citations` and `result.markdown.references_markdown`.
* *Code Example:*
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, DefaultMarkdownGenerator
async def basic_markdown_workflow():
# DefaultMarkdownGenerator is used implicitly if none is specified in CrawlerRunConfig
# Or explicitly:
md_generator = DefaultMarkdownGenerator()
run_config = CrawlerRunConfig(markdown_generator=md_generator)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com", config=run_config)
if result.success:
print("--- Markdown with Citations ---")
print(result.markdown.markdown_with_citations[:500]) # Show first 500 chars
print("\n--- References ---")
print(result.markdown.references_markdown)
else:
print(f"Crawl failed: {result.error_message}")
```
* 3.3.2. **Workflow: Generating Focused Markdown using a Content Filter**
* Steps:
1. Choose and instantiate a `RelevantContentFilter` (e.g., `BM25ContentFilter`).
2. Instantiate `DefaultMarkdownGenerator`, passing the filter to its `content_filter` parameter.
3. Set this `DefaultMarkdownGenerator` instance in `CrawlerRunConfig.markdown_generator`.
4. After crawling, access `result.markdown.fit_markdown`.
* Key configuration considerations for the filter and generator:
* For `BM25ContentFilter`, ensure you provide a relevant `user_query`.
* Adjust filter thresholds (e.g., `bm25_threshold`) as needed.
* The `content_source` for `DefaultMarkdownGenerator` will be the input to the filter.
* *Code Example:*
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, DefaultMarkdownGenerator, CacheMode
from crawl4ai.content_filter_strategy import BM25ContentFilter
async def filtered_markdown_workflow():
user_query = "information about Crawl4AI library"
bm25_filter = BM25ContentFilter(user_query=user_query, bm25_threshold=0.1)
md_generator = DefaultMarkdownGenerator(content_filter=bm25_filter)
run_config = CrawlerRunConfig(
markdown_generator=md_generator,
cache_mode=CacheMode.BYPASS # For consistent demo results
)
async with AsyncWebCrawler() as crawler:
# Using a page that hopefully has content related to the query
result = await crawler.arun(url="https://github.com/unclecode/crawl4ai", config=run_config)
if result.success:
print("--- Fit Markdown (BM25 Filtered) ---")
print(result.markdown.fit_markdown) # This is the key output
# You can also inspect fit_html to see what the filter selected
# print("\n--- Fit HTML ---")
# print(result.markdown.fit_html[:500])
else:
print(f"Crawl failed: {result.error_message}")
```
* 3.3.3. **Workflow: Customizing Markdown Style via `html2text_options`**
* Steps: Instantiate `DefaultMarkdownGenerator` passing a dictionary of `html2text` options to its `options` parameter.
* *Code Example:*
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, DefaultMarkdownGenerator
async def custom_style_markdown_workflow():
# Example: Disable line wrapping and ignore images
html2text_opts = {
"body_width": 0, # Disable line wrapping
"ignore_images": True # Don't include image markdown 
}
md_generator = DefaultMarkdownGenerator(options=html2text_opts)
run_config = CrawlerRunConfig(markdown_generator=md_generator)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com", config=run_config)
if result.success:
print("--- Custom Styled Markdown (No Wrap, No Images) ---")
print(result.markdown.raw_markdown[:500]) # raw_markdown will reflect these options
else:
print(f"Crawl failed: {result.error_message}")
```
* 3.4. **Best Practices for `DefaultMarkdownGenerator`**
* **When to use `DefaultMarkdownGenerator` vs. a custom strategy:**
* Use `DefaultMarkdownGenerator` for most cases. It's robust and highly configurable through `content_filter` and `html2text_options`.
* Opt for a custom strategy only if you need fundamentally different conversion logic or integration with external Markdown libraries that `CustomHTML2Text` doesn't cover.
* **Tips for choosing the right `content_source` and `content_filter`:**
* Start with `content_source="cleaned_html"` (default) and no filter.
* If the output is too noisy, introduce a `RelevantContentFilter`. `PruningContentFilter` is a good first step for general boilerplate. Use `BM25ContentFilter` or `LLMContentFilter` for more targeted filtering based on semantic relevance.
* If your filter is very effective and you *only* want Markdown from the filtered content, consider setting `content_source="fit_html"` in your `DefaultMarkdownGenerator` instance.
* **How to leverage `MarkdownGenerationResult` effectively:**
* For LLM input where source attribution is important, use `markdown_with_citations` + `references_markdown`.
* For tasks needing maximum conciseness based on relevance, use `fit_markdown` (after configuring a `content_filter`).
* Use `raw_markdown` if you need the "purest" Markdown conversion without citation processing.
* Inspect `fit_html` to debug your content filters.
## 4. Integrating Content Filters for Smarter Markdown (`fit_markdown`)
* 4.1. **The "Why": Purpose of Content Filtering Before Markdown Generation**
* 4.1.1. Reducing noise and improving relevance for LLMs.
* **Explanation:** Web pages often contain much more than just the main article content (e.g., navigation, ads, footers, related articles). These can be detrimental to LLM performance, increasing token count, processing time, and potentially confusing the model. Content filters aim to isolate the core, relevant information.
* 4.1.2. Generating more concise and focused Markdown (`fit_markdown`).
* **Explanation:** By filtering the HTML *before* converting it to Markdown, the resulting `fit_markdown` is inherently more concise and focused on what the filter deemed important. This is ideal for tasks where brevity and relevance are key.
* 4.1.3. How `fit_html` is generated and its role.
* **Explanation:** When a `RelevantContentFilter` is used with a `MarkdownGenerationStrategy`, the strategy first passes the input HTML (e.g., `cleaned_html`) to the filter's `filter_content` method. This method returns a list of HTML strings (or a single merged string). This output is stored as `fit_html` in the `MarkdownGenerationResult`. `fit_markdown` is then generated by converting this `fit_html` to Markdown.
* 4.2. **Overview of `RelevantContentFilter` Strategies**
* 4.2.1. **`PruningContentFilter`**:
* **How it works:** Applies heuristic rules to remove common boilerplate. For example, it might remove elements with very short text content, elements with a high link-to-text ratio, or elements matching common boilerplate CSS classes/IDs (like "footer", "nav", "sidebar").
* **When to use it:** A good first-pass filter for general-purpose cleaning. It's fast and doesn't require LLM calls or complex configuration.
* **Impact on `fit_markdown`:** Typically good at removing obvious non-content sections, resulting in a cleaner, more article-focused Markdown.
* 4.2.2. **`BM25ContentFilter`**:
* **How it works:** This filter uses the BM25 algorithm, a classical information retrieval technique. It tokenizes the HTML content into chunks and scores each chunk's relevance against a `user_query`. Chunks exceeding a `bm25_threshold` are kept.
* **When to use it:** When you want to extract content specifically related to a user's query from a larger page. Excellent for targeted information retrieval.
* **Impact on `fit_markdown`:** The output will be highly tailored to the query. If the query is "Tell me about Crawl4AI's caching", `fit_markdown` should primarily contain sections discussing caching.
* 4.2.3. **`LLMContentFilter`**:
* **How it works:** This is the most powerful and flexible filter. It chunks the input HTML and sends each chunk (or a summary) to an LLM with specific `instructions` (e.g., "Extract only the paragraphs discussing financial results"). The LLM decides which chunks are relevant.
* **When to use it:** For complex filtering criteria that are hard to express with rules or keywords, or when nuanced understanding of content is required.
* **Impact on `fit_markdown`:** Can produce very precise and contextually relevant Markdown. However, it's generally slower and can be more expensive due to LLM API calls.
* 4.3. **Decision Guide: Choosing the Right `RelevantContentFilter`**
* *Table:*
| Filter | Speed | Cost (LLM API) | Accuracy/Nuance | Use Case Examples | Configuration Complexity |
|-----------------------|------------|----------------|-----------------|----------------------------------------------------|--------------------------|
| `PruningContentFilter`| Very Fast | None | Low-Medium | General boilerplate removal, quick cleaning. | Low |
| `BM25ContentFilter` | Fast | None | Medium | Query-focused extraction, finding relevant sections. | Medium (query, threshold)|
| `LLMContentFilter` | Slow | Potentially High| High | Complex criteria, nuanced extraction, summarization. | High (prompt engineering) |
* Factors to consider:
* **Desired Output Quality:** For the highest semantic relevance, `LLMContentFilter` is often best, but at a cost.
* **Performance Constraints:** If speed is critical, `PruningContentFilter` or `BM25ContentFilter` are preferred.
* **Nature of the HTML Content:** For well-structured articles, `PruningContentFilter` might be sufficient. For diverse content or Q&A, `BM25ContentFilter` or `LLMContentFilter` might be better.
* **Specificity of Task:** If you have a clear query, `BM25ContentFilter` excels. If you have complex instructions, `LLMContentFilter` is suitable.
* 4.4. **Code Examples: Combining Filters with `DefaultMarkdownGenerator`**
* 4.4.1. *Example:* [Using `PruningContentFilter` to generate `fit_markdown`].
```python
from crawl4ai import DefaultMarkdownGenerator, CrawlerRunConfig, AsyncWebCrawler, CacheMode
from crawl4ai.content_filter_strategy import PruningContentFilter
async def pruning_filter_example():
pruning_filter = PruningContentFilter(threshold=0.4, threshold_type="fixed") # Adjust threshold as needed
md_generator = DefaultMarkdownGenerator(content_filter=pruning_filter)
run_config = CrawlerRunConfig(markdown_generator=md_generator, cache_mode=CacheMode.BYPASS)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://en.wikipedia.org/wiki/Python_(programming_language)", config=run_config)
if result.success:
print("--- Fit Markdown (Pruned) ---")
print(result.markdown.fit_markdown[:1000]) # Show first 1000 chars
# print("\n--- Original Raw Markdown (for comparison) ---")
# print(result.markdown.raw_markdown[:1000])
```
* 4.4.2. *Example:* [Using `BM25ContentFilter` with a query to generate query-focused `fit_markdown`].
```python
from crawl4ai import DefaultMarkdownGenerator, CrawlerRunConfig, AsyncWebCrawler, CacheMode
from crawl4ai.content_filter_strategy import BM25ContentFilter
async def bm25_filter_example():
user_query = "Python syntax and semantics"
bm25_filter = BM25ContentFilter(user_query=user_query, bm25_threshold=0.1)
md_generator = DefaultMarkdownGenerator(content_filter=bm25_filter)
run_config = CrawlerRunConfig(markdown_generator=md_generator, cache_mode=CacheMode.BYPASS)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://en.wikipedia.org/wiki/Python_(programming_language)", config=run_config)
if result.success:
print(f"--- Fit Markdown (BM25 Filtered for query: '{user_query}') ---")
print(result.markdown.fit_markdown)
```
* 4.4.3. *Example:* [Using `LLMContentFilter` for nuanced content selection before Markdown generation].
```python
from crawl4ai import DefaultMarkdownGenerator, CrawlerRunConfig, AsyncWebCrawler, LLMConfig, CacheMode
from crawl4ai.content_filter_strategy import LLMContentFilter
import os
async def llm_filter_example():
# Ensure OPENAI_API_KEY is set in your environment
if not os.getenv("OPENAI_API_KEY"):
print("OPENAI_API_KEY not set. Skipping LLMContentFilter example.")
return
llm_config_obj = LLMConfig(provider="openai/gpt-4o-mini", api_key=os.getenv("OPENAI_API_KEY"))
instruction = "Extract only the sections that discuss Python's history and its creator."
llm_filter = LLMContentFilter(
llm_config=llm_config_obj,
instruction=instruction,
# chunk_token_threshold=1000 # Adjust as needed
)
md_generator = DefaultMarkdownGenerator(content_filter=llm_filter, content_source="cleaned_html")
run_config = CrawlerRunConfig(markdown_generator=md_generator, cache_mode=CacheMode.BYPASS)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://en.wikipedia.org/wiki/Python_(programming_language)", config=run_config)
if result.success:
print(f"--- Fit Markdown (LLM Filtered with instruction: '{instruction}') ---")
print(result.markdown.fit_markdown)
llm_filter.show_usage() # Display token usage
else:
print(f"Crawl failed: {result.error_message}")
```
* 4.5. **Best Practices for Content Filtering for Markdown**
* **Start Simple:** Begin with `PruningContentFilter` for general cleanup. It's fast and often effective for removing common boilerplate.
* **Query-Specific Tasks:** If your goal is to extract information relevant to a specific query, `BM25ContentFilter` is a great, cost-effective choice.
* **Nuanced Selection:** Reserve `LLMContentFilter` for tasks requiring deeper semantic understanding or complex filtering logic that rules-based or keyword-based approaches can't handle. Be mindful of its cost and latency.
* **Iterate and Test:** Content filtering is often an iterative process. Test your filter configurations on various pages to ensure they behave as expected. Inspect `fit_html` to understand what the filter is selecting/discarding.
* **Combine with `content_source`:** Remember that `fit_markdown` is derived from the output of the filter. If you also need Markdown from the pre-filtered content, ensure your `MarkdownGenerationStrategy`'s `content_source` is set appropriately (e.g., `"cleaned_html"`) so that `raw_markdown` reflects that, while `fit_markdown` reflects the filtered version.
## 5. Customizing Markdown Output via `CrawlerRunConfig`
* 5.1. **The Role of `CrawlerRunConfig.markdown_generator`**
* 5.1.1. How it allows specifying a custom Markdown generation strategy for a crawl run.
* **Explanation:** The `markdown_generator` parameter within the `CrawlerRunConfig` object is the primary way to control how Markdown is generated for a specific crawl operation (i.e., a call to `crawler.arun()` or tasks within `crawler.arun_many()`). You can assign an instance of any class that adheres to the `MarkdownGenerationStrategy` interface to it.
* 5.1.2. Overriding the default Markdown generation behavior.
* **Explanation:** If `CrawlerRunConfig.markdown_generator` is not set (i.e., it's `None`), Crawl4AI will use a default instance of `DefaultMarkdownGenerator` with its standard settings. By providing your own `MarkdownGenerationStrategy` instance (be it a configured `DefaultMarkdownGenerator` or a custom class), you override this default behavior for that particular run.
* 5.2. **Scenarios for Using `CrawlerRunConfig.markdown_generator`**
* 5.2.1. Applying a pre-configured `DefaultMarkdownGenerator` with specific filters or options.
* **Why:** You might want different filtering logic or `html2text` options for different URLs or types of content you're crawling, even within the same `AsyncWebCrawler` instance.
* 5.2.2. Plugging in a completely custom `MarkdownGenerationStrategy`.
* **Why:** As discussed in section 2.1.2, if you have unique Markdown requirements or want to use a different conversion library.
* 5.2.3. Disabling Markdown generation entirely by setting it to `None` (if applicable, or by using a "NoOp" strategy).
* **Why:** If, for a specific crawl, you only need the HTML or extracted structured data and don't require Markdown output, you can pass `markdown_generator=None` (or a strategy that does nothing) to save processing time.
* *Note:* To truly disable Markdown generation and its associated `CustomHTML2Text` processing, you might need a "NoOpMarkdownGenerator". If `markdown_generator` is `None`, the crawler might still fall back to a default. A NoOp strategy would explicitly do nothing.
```python
# class NoOpMarkdownGenerator(MarkdownGenerationStrategy):
# def generate_markdown(self, input_html: str, **kwargs) -> MarkdownGenerationResult:
# return MarkdownGenerationResult(raw_markdown="", markdown_with_citations="", references_markdown="")
# run_config = CrawlerRunConfig(markdown_generator=NoOpMarkdownGenerator())
```
* 5.3. **Code Examples:**
* 5.3.1. *Example:* [Setting a `DefaultMarkdownGenerator` with a `PruningContentFilter` in `CrawlerRunConfig`].
```python
from crawl4ai import (
AsyncWebCrawler,
CrawlerRunConfig,
DefaultMarkdownGenerator,
CacheMode
)
from crawl4ai.content_filter_strategy import PruningContentFilter
async def run_with_specific_md_generator():
# Configure a specific markdown generator
pruning_filter = PruningContentFilter(threshold=0.6)
specific_md_generator = DefaultMarkdownGenerator(
content_filter=pruning_filter,
options={"body_width": 0, "ignore_links": True}
)
# Configure the crawl run to use this generator
run_config = CrawlerRunConfig(
markdown_generator=specific_md_generator,
cache_mode=CacheMode.BYPASS
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com/article1", config=run_config)
if result.success:
print("--- Markdown from Article 1 (Pruned, No Links, No Wrap) ---")
print(result.markdown.fit_markdown[:500])
# raw_markdown would also reflect no-wrap and no-links from html2text_options
# For another URL, you could use a different (or default) generator
# default_run_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
# result2 = await crawler.arun(url="https://example.com/article2", config=default_run_config)
# asyncio.run(run_with_specific_md_generator())
```
* 5.3.2. *Example:* [Setting a custom `MyMarkdownStrategy` in `CrawlerRunConfig` (assuming `MyCustomMarkdownStrategy` from 2.1.3)].
```python
# Assuming MyCustomMarkdownStrategy is defined as in section 2.1.3
# from crawl4ai import CrawlerRunConfig, AsyncWebCrawler, CacheMode
# from your_module import MyCustomMarkdownStrategy # If it's in another file
# async def run_with_custom_md_strategy():
# custom_strategy = MyCustomMarkdownStrategy(content_source="raw_html")
# run_config_custom = CrawlerRunConfig(
# markdown_generator=custom_strategy,
# cache_mode=CacheMode.BYPASS
# )
# async with AsyncWebCrawler() as crawler:
# result = await crawler.arun(url="https://example.com", config=run_config_custom)
# if result.success:
# print("--- Markdown from Custom Strategy ---")
# print(result.markdown.raw_markdown) # Or other fields your strategy populates
# asyncio.run(run_with_custom_md_strategy())
```
* 5.4. **Interaction with Global vs. Run-Specific Configurations**
* **Explanation:** `AsyncWebCrawler` itself does not have a global `markdown_generator` setting during its initialization. Markdown generation is configured *per run* via `CrawlerRunConfig`. This design choice provides maximum flexibility, allowing different Markdown strategies for different URLs or tasks within the same crawler instance lifecycle. If `CrawlerRunConfig.markdown_generator` is not provided, a default `DefaultMarkdownGenerator` instance is used for that specific run.
## 6. Mastering `CustomHTML2Text` for Fine-Grained Control
* 6.1. **Understanding `CustomHTML2Text`**
* 6.1.1. **Purpose:** Why Crawl4AI includes its own `html2text` extension.
* **Enhanced control:** `CustomHTML2Text` is a subclass of the standard `html2text.HTML2Text` library. Crawl4AI uses this custom version to gain more precise control over the HTML-to-Markdown conversion process, particularly to make the output more suitable for LLMs.
* **Specific adaptations:** It includes logic for handling Crawl4AI's citation and reference generation (`convert_links_to_citations`), and potentially other tweaks that improve the quality and utility of the Markdown output for AI applications.
* 6.1.2. **How it's used by `DefaultMarkdownGenerator`**.
* **Explanation:** `DefaultMarkdownGenerator` instantiates `CustomHTML2Text` internally. When you pass `options` to `DefaultMarkdownGenerator`, these are ultimately used to configure this `CustomHTML2Text` instance. The `handle()` method of `CustomHTML2Text` is what performs the core HTML to Markdown conversion.
* 6.2. **Key `html2text_options` and Their Impact**
* (These options are passed via `DefaultMarkdownGenerator(options=...)`)
* 6.2.1. `body_width`:
* **What it does:** Controls the maximum width of lines in the generated Markdown before wrapping.
* **Why configure it:** For LLM consumption, it's often best to disable automatic line wrapping to allow the LLM to process text based on natural paragraph breaks. Setting `body_width=0` achieves this.
* *Example:*
* `body_width=80` (default-ish for some tools):
```markdown
This is a longer sentence that will be wrapped by html2text if the body_width is
set to a value like 80 characters.
```
* `body_width=0`:
```markdown
This is a longer sentence that will not be wrapped by html2text if body_width is 0, allowing the LLM to handle line breaks.
```
* 6.2.2. `ignore_links`:
* **What it does:** If `True`, all hyperlink information (`[text](url)`) is removed, leaving only the link text.
* **Why configure it:** Set to `True` if links are considered noise for your LLM task and you don't need source attribution. If `False` (default for Crawl4AI's `CustomHTML2Text` unless overridden), links are preserved and can then be converted to citations by `DefaultMarkdownGenerator`.
* *Example:*
* `ignore_links=False` (then processed for citations): `Visit [Crawl4AI](https://crawl4ai.com)` -> `Visit Crawl4AI [^1^]`
* `ignore_links=True`: `Visit [Crawl4AI](https://crawl4ai.com)` -> `Visit Crawl4AI`
* 6.2.3. `ignore_images`:
* **What it does:** If `True`, image tags (`
`) are completely ignored, and no Markdown image syntax (``) is generated.
* **Why configure it:** Useful if image information is irrelevant to your LLM task and you want cleaner, more text-focused Markdown.
* *Example:*
* HTML: `
`
* `ignore_images=False`: ``
* `ignore_images=True`: (nothing is output for the image)
* 6.2.4. `protect_links`:
* **What it does:** If `True`, surrounds link URLs with `<` and `>`. E.g., `[text]()`.
* **Why configure it:** This can sometimes help Markdown parsers that might misinterpret URLs containing special characters. However, with Crawl4AI's citation handling, this is generally not needed, as the raw URLs are moved to the reference section.
* 6.2.5. `mark_code`:
* **What it does:** Controls how `` and `` tags are handled. If `True`, it attempts to use Markdown code block syntax (backticks).
* **Why configure it:** Essential for preserving code snippets correctly. Usually, you'd want this to be `True`.
* 6.2.6. `default_image_alt`:
* **What it does:** Provides a default alt text string if an `
` tag is missing an `alt` attribute.
* **Why configure it:** Can make Markdown more consistent if you choose to include images.
* 6.2.7. `bypass_tables`:
* **What it does:** If `True`, `