Enhance Crawl4AI with new features and documentation
- Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags. - Introduced Managed Browsers for enhanced crawling experience. - Updated documentation for clearer navigation on configuration. - Changed 'text_only' to 'text_mode' in configuration and methods. - Improved performance and relevance in content filtering strategies.
This commit is contained in:
@@ -1,6 +1,6 @@
|
||||
# Output Formats
|
||||
|
||||
Crawl4AI provides multiple output formats to suit different needs, from raw HTML to structured data using LLM or pattern-based extraction.
|
||||
Crawl4AI provides multiple output formats to suit different needs, ranging from raw HTML to structured data using LLM or pattern-based extraction, and versatile markdown outputs.
|
||||
|
||||
## Basic Formats
|
||||
|
||||
@@ -8,18 +8,20 @@ Crawl4AI provides multiple output formats to suit different needs, from raw HTML
|
||||
result = await crawler.arun(url="https://example.com")
|
||||
|
||||
# Access different formats
|
||||
raw_html = result.html # Original HTML
|
||||
clean_html = result.cleaned_html # Sanitized HTML
|
||||
markdown = result.markdown # Standard markdown
|
||||
fit_md = result.fit_markdown # Most relevant content in markdown
|
||||
raw_html = result.html # Original HTML
|
||||
clean_html = result.cleaned_html # Sanitized HTML
|
||||
markdown_v2 = result.markdown_v2 # Detailed markdown generation results
|
||||
fit_md = result.markdown_v2.fit_markdown # Most relevant content in markdown
|
||||
```
|
||||
|
||||
> **Note**: The `markdown_v2` property will soon be replaced by `markdown`. It is recommended to start transitioning to using `markdown` for new implementations.
|
||||
|
||||
## Raw HTML
|
||||
|
||||
Original, unmodified HTML from the webpage. Useful when you need to:
|
||||
- Preserve the exact page structure
|
||||
- Process HTML with your own tools
|
||||
- Debug page issues
|
||||
- Preserve the exact page structure.
|
||||
- Process HTML with your own tools.
|
||||
- Debug page issues.
|
||||
|
||||
```python
|
||||
result = await crawler.arun(url="https://example.com")
|
||||
@@ -29,167 +31,72 @@ print(result.html) # Complete HTML including headers, scripts, etc.
|
||||
## Cleaned HTML
|
||||
|
||||
Sanitized HTML with unnecessary elements removed. Automatically:
|
||||
- Removes scripts and styles
|
||||
- Cleans up formatting
|
||||
- Preserves semantic structure
|
||||
- Removes scripts and styles.
|
||||
- Cleans up formatting.
|
||||
- Preserves semantic structure.
|
||||
|
||||
```python
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
config = CrawlerRunConfig(
|
||||
excluded_tags=['form', 'header', 'footer'], # Additional tags to remove
|
||||
keep_data_attributes=False # Remove data-* attributes
|
||||
)
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
print(result.cleaned_html)
|
||||
```
|
||||
|
||||
## Standard Markdown
|
||||
|
||||
HTML converted to clean markdown format. Great for:
|
||||
- Content analysis
|
||||
- Documentation
|
||||
- Readability
|
||||
HTML converted to clean markdown format. This output is useful for:
|
||||
- Content analysis.
|
||||
- Documentation.
|
||||
- Readability.
|
||||
|
||||
```python
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
include_links_on_markdown=True # Include links in markdown
|
||||
config = CrawlerRunConfig(
|
||||
markdown_generator=DefaultMarkdownGenerator(
|
||||
options={"include_links": True} # Include links in markdown
|
||||
)
|
||||
)
|
||||
print(result.markdown)
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
print(result.markdown_v2.raw_markdown) # Standard markdown with links
|
||||
```
|
||||
|
||||
## Fit Markdown
|
||||
|
||||
Most relevant content extracted and converted to markdown. Ideal for:
|
||||
- Article extraction
|
||||
- Main content focus
|
||||
- Removing boilerplate
|
||||
Extract and convert only the most relevant content into markdown format. Best suited for:
|
||||
- Article extraction.
|
||||
- Focusing on the main content.
|
||||
- Removing boilerplate.
|
||||
|
||||
To generate `fit_markdown`, use a content filter like `PruningContentFilter`:
|
||||
|
||||
```python
|
||||
result = await crawler.arun(url="https://example.com")
|
||||
print(result.fit_markdown) # Only the main content
|
||||
from crawl4ai.content_filter_strategy import PruningContentFilter
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
content_filter=PruningContentFilter(
|
||||
threshold=0.7,
|
||||
threshold_type="dynamic",
|
||||
min_word_threshold=100
|
||||
)
|
||||
)
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
print(result.markdown_v2.fit_markdown) # Extracted main content in markdown
|
||||
```
|
||||
|
||||
## Structured Data Extraction
|
||||
## Markdown with Citations
|
||||
|
||||
Crawl4AI offers two powerful approaches for structured data extraction:
|
||||
|
||||
### 1. LLM-Based Extraction
|
||||
|
||||
Use any LLM (OpenAI, HuggingFace, Ollama, etc.) to extract structured data with high accuracy:
|
||||
Generate markdown that includes citations for links. This format is ideal for:
|
||||
- Creating structured documentation.
|
||||
- Including references for extracted content.
|
||||
|
||||
```python
|
||||
from pydantic import BaseModel
|
||||
from crawl4ai.extraction_strategy import LLMExtractionStrategy
|
||||
|
||||
class KnowledgeGraph(BaseModel):
|
||||
entities: List[dict]
|
||||
relationships: List[dict]
|
||||
|
||||
strategy = LLMExtractionStrategy(
|
||||
provider="ollama/nemotron", # or "huggingface/...", "ollama/..."
|
||||
api_token="your-token", # not needed for Ollama
|
||||
schema=KnowledgeGraph.schema(),
|
||||
instruction="Extract entities and relationships from the content"
|
||||
config = CrawlerRunConfig(
|
||||
markdown_generator=DefaultMarkdownGenerator(
|
||||
options={"citations": True} # Enable citations
|
||||
)
|
||||
)
|
||||
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
extraction_strategy=strategy
|
||||
)
|
||||
knowledge_graph = json.loads(result.extracted_content)
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
print(result.markdown_v2.markdown_with_citations)
|
||||
print(result.markdown_v2.references_markdown) # Citations section
|
||||
```
|
||||
|
||||
### 2. Pattern-Based Extraction
|
||||
|
||||
For pages with repetitive patterns (e.g., product listings, article feeds), use JsonCssExtractionStrategy:
|
||||
|
||||
```python
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
|
||||
schema = {
|
||||
"name": "Product Listing",
|
||||
"baseSelector": ".product-card", # Repeated element
|
||||
"fields": [
|
||||
{"name": "title", "selector": "h2", "type": "text"},
|
||||
{"name": "price", "selector": ".price", "type": "text"},
|
||||
{"name": "description", "selector": ".desc", "type": "text"}
|
||||
]
|
||||
}
|
||||
|
||||
strategy = JsonCssExtractionStrategy(schema)
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
extraction_strategy=strategy
|
||||
)
|
||||
products = json.loads(result.extracted_content)
|
||||
```
|
||||
|
||||
## Content Customization
|
||||
|
||||
### HTML to Text Options
|
||||
|
||||
Configure markdown conversion:
|
||||
|
||||
```python
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
html2text={
|
||||
"escape_dot": False,
|
||||
"body_width": 0,
|
||||
"protect_links": True,
|
||||
"unicode_snob": True
|
||||
}
|
||||
)
|
||||
```
|
||||
|
||||
### Content Filters
|
||||
|
||||
Control what content is included:
|
||||
|
||||
```python
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
word_count_threshold=10, # Minimum words per block
|
||||
exclude_external_links=True, # Remove external links
|
||||
exclude_external_images=True, # Remove external images
|
||||
excluded_tags=['form', 'nav'] # Remove specific HTML tags
|
||||
)
|
||||
```
|
||||
|
||||
## Comprehensive Example
|
||||
|
||||
Here's how to use multiple output formats together:
|
||||
|
||||
```python
|
||||
async def crawl_content(url: str):
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Extract main content with fit markdown
|
||||
result = await crawler.arun(
|
||||
url=url,
|
||||
word_count_threshold=10,
|
||||
exclude_external_links=True
|
||||
)
|
||||
|
||||
# Get structured data using LLM
|
||||
llm_result = await crawler.arun(
|
||||
url=url,
|
||||
extraction_strategy=LLMExtractionStrategy(
|
||||
provider="ollama/nemotron",
|
||||
schema=YourSchema.schema(),
|
||||
instruction="Extract key information"
|
||||
)
|
||||
)
|
||||
|
||||
# Get repeated patterns (if any)
|
||||
pattern_result = await crawler.arun(
|
||||
url=url,
|
||||
extraction_strategy=JsonCssExtractionStrategy(your_schema)
|
||||
)
|
||||
|
||||
return {
|
||||
"main_content": result.fit_markdown,
|
||||
"structured_data": json.loads(llm_result.extracted_content),
|
||||
"pattern_data": json.loads(pattern_result.extracted_content),
|
||||
"media": result.media
|
||||
}
|
||||
```
|
||||
Reference in New Issue
Block a user