Enhance Crawl4AI with new features and documentation

- Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags. - Introduced Managed Browsers for enhanced crawling experience. - Updated documentation for clearer navigation on configuration. - Changed 'text_only' to 'text_mode' in configuration and methods. - Improved performance and relevance in content filtering strategies.
2024-12-19 21:02:29 +08:00
parent 393bb911c0
commit 849765712f
23 changed files with 1825 additions and 1721 deletions
--- a/docs/md_v2/basic/output-formats.md
+++ b/docs/md_v2/basic/output-formats.md
@@ -1,6 +1,6 @@
 # Output Formats

-Crawl4AI provides multiple output formats to suit different needs, from raw HTML to structured data using LLM or pattern-based extraction.
+Crawl4AI provides multiple output formats to suit different needs, ranging from raw HTML to structured data using LLM or pattern-based extraction, and versatile markdown outputs.

 ## Basic Formats

@@ -8,18 +8,20 @@ Crawl4AI provides multiple output formats to suit different needs, from raw HTML
 result = await crawler.arun(url="https://example.com")

 # Access different formats
-raw_html = result.html           # Original HTML
-clean_html = result.cleaned_html # Sanitized HTML
-markdown = result.markdown       # Standard markdown
-fit_md = result.fit_markdown    # Most relevant content in markdown
+raw_html = result.html                # Original HTML
+clean_html = result.cleaned_html      # Sanitized HTML
+markdown_v2 = result.markdown_v2      # Detailed markdown generation results
+fit_md = result.markdown_v2.fit_markdown  # Most relevant content in markdown
 ```

+> **Note**: The `markdown_v2` property will soon be replaced by `markdown`. It is recommended to start transitioning to using `markdown` for new implementations.
+
 ## Raw HTML

 Original, unmodified HTML from the webpage. Useful when you need to:
- Preserve the exact page structure
- Process HTML with your own tools
- Debug page issues
+- Preserve the exact page structure.
+- Process HTML with your own tools.
+- Debug page issues.

 ```python
 result = await crawler.arun(url="https://example.com")
@@ -29,167 +31,72 @@ print(result.html)  # Complete HTML including headers, scripts, etc.
 ## Cleaned HTML

 Sanitized HTML with unnecessary elements removed. Automatically:
- Removes scripts and styles
- Cleans up formatting
- Preserves semantic structure
+- Removes scripts and styles.
+- Cleans up formatting.
+- Preserves semantic structure.

 ```python
-result = await crawler.arun(
-    url="https://example.com",
+config = CrawlerRunConfig(
    excluded_tags=['form', 'header', 'footer'],  # Additional tags to remove
    keep_data_attributes=False  # Remove data-* attributes
 )
+result = await crawler.arun(url="https://example.com", config=config)
 print(result.cleaned_html)
 ```

 ## Standard Markdown

-HTML converted to clean markdown format. Great for:
- Content analysis
- Documentation
- Readability
+HTML converted to clean markdown format. This output is useful for:
+- Content analysis.
+- Documentation.
+- Readability.

 ```python
-result = await crawler.arun(
-    url="https://example.com",
-    include_links_on_markdown=True  # Include links in markdown
+config = CrawlerRunConfig(
+    markdown_generator=DefaultMarkdownGenerator(
+        options={"include_links": True}  # Include links in markdown
+    )
 )
-print(result.markdown)
+result = await crawler.arun(url="https://example.com", config=config)
+print(result.markdown_v2.raw_markdown)  # Standard markdown with links
 ```

 ## Fit Markdown

-Most relevant content extracted and converted to markdown. Ideal for:
- Article extraction
- Main content focus
- Removing boilerplate
+Extract and convert only the most relevant content into markdown format. Best suited for:
+- Article extraction.
+- Focusing on the main content.
+- Removing boilerplate.
+
+To generate `fit_markdown`, use a content filter like `PruningContentFilter`:

 ```python
-result = await crawler.arun(url="https://example.com")
-print(result.fit_markdown)  # Only the main content
+from crawl4ai.content_filter_strategy import PruningContentFilter
+
+config = CrawlerRunConfig(
+    content_filter=PruningContentFilter(
+        threshold=0.7,
+        threshold_type="dynamic",
+        min_word_threshold=100
+    )
+)
+result = await crawler.arun(url="https://example.com", config=config)
+print(result.markdown_v2.fit_markdown)  # Extracted main content in markdown
 ```

-## Structured Data Extraction
+## Markdown with Citations

-Crawl4AI offers two powerful approaches for structured data extraction:
-
-### 1. LLM-Based Extraction
-
-Use any LLM (OpenAI, HuggingFace, Ollama, etc.) to extract structured data with high accuracy:
+Generate markdown that includes citations for links. This format is ideal for:
+- Creating structured documentation.
+- Including references for extracted content.

 ```python
-from pydantic import BaseModel
-from crawl4ai.extraction_strategy import LLMExtractionStrategy
-
-class KnowledgeGraph(BaseModel):
-    entities: List[dict]
-    relationships: List[dict]
-
-strategy = LLMExtractionStrategy(
-    provider="ollama/nemotron",  # or "huggingface/...", "ollama/..."
-    api_token="your-token",   # not needed for Ollama
-    schema=KnowledgeGraph.schema(),
-    instruction="Extract entities and relationships from the content"
+config = CrawlerRunConfig(
+    markdown_generator=DefaultMarkdownGenerator(
+        options={"citations": True}  # Enable citations
+    )
 )
-
-result = await crawler.arun(
-    url="https://example.com",
-    extraction_strategy=strategy
-)
-knowledge_graph = json.loads(result.extracted_content)
+result = await crawler.arun(url="https://example.com", config=config)
+print(result.markdown_v2.markdown_with_citations)
+print(result.markdown_v2.references_markdown)  # Citations section
 ```
-
-### 2. Pattern-Based Extraction
-
-For pages with repetitive patterns (e.g., product listings, article feeds), use JsonCssExtractionStrategy:
-
-```python
-from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
-
-schema = {
-    "name": "Product Listing",
-    "baseSelector": ".product-card",  # Repeated element
-    "fields": [
-        {"name": "title", "selector": "h2", "type": "text"},
-        {"name": "price", "selector": ".price", "type": "text"},
-        {"name": "description", "selector": ".desc", "type": "text"}
-    ]
-}
-
-strategy = JsonCssExtractionStrategy(schema)
-result = await crawler.arun(
-    url="https://example.com",
-    extraction_strategy=strategy
-)
-products = json.loads(result.extracted_content)
-```
-
-## Content Customization
-
-### HTML to Text Options
-
-Configure markdown conversion:
-
-```python
-result = await crawler.arun(
-    url="https://example.com",
-    html2text={
-        "escape_dot": False,
-        "body_width": 0,
-        "protect_links": True,
-        "unicode_snob": True
-    }
-)
-```
-
-### Content Filters
-
-Control what content is included:
-
-```python
-result = await crawler.arun(
-    url="https://example.com",
-    word_count_threshold=10,        # Minimum words per block
-    exclude_external_links=True,    # Remove external links
-    exclude_external_images=True,   # Remove external images
-    excluded_tags=['form', 'nav']   # Remove specific HTML tags
-)
-```
-
-## Comprehensive Example
-
-Here's how to use multiple output formats together:
-
-```python
-async def crawl_content(url: str):
-    async with AsyncWebCrawler() as crawler:
-        # Extract main content with fit markdown
-        result = await crawler.arun(
-            url=url,
-            word_count_threshold=10,
-            exclude_external_links=True
-        )
-        
-        # Get structured data using LLM
-        llm_result = await crawler.arun(
-            url=url,
-            extraction_strategy=LLMExtractionStrategy(
-                provider="ollama/nemotron",
-                schema=YourSchema.schema(),
-                instruction="Extract key information"
-            )
-        )
-        
-        # Get repeated patterns (if any)
-        pattern_result = await crawler.arun(
-            url=url,
-            extraction_strategy=JsonCssExtractionStrategy(your_schema)
-        )
-        
-        return {
-            "main_content": result.fit_markdown,
-            "structured_data": json.loads(llm_result.extracted_content),
-            "pattern_data": json.loads(pattern_result.extracted_content),
-            "media": result.media
-        }
-```