Enhance Crawl4AI with new features and documentation

- Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags.
  - Introduced Managed Browsers for enhanced crawling experience.
  - Updated documentation for clearer navigation on configuration.
  - Changed 'text_only' to 'text_mode' in configuration and methods.
  - Improved performance and relevance in content filtering strategies.
This commit is contained in:
UncleCode
2024-12-19 21:02:29 +08:00
parent 393bb911c0
commit 849765712f
23 changed files with 1825 additions and 1721 deletions

View File

@@ -1,6 +1,6 @@
# Output Formats
Crawl4AI provides multiple output formats to suit different needs, from raw HTML to structured data using LLM or pattern-based extraction.
Crawl4AI provides multiple output formats to suit different needs, ranging from raw HTML to structured data using LLM or pattern-based extraction, and versatile markdown outputs.
## Basic Formats
@@ -8,18 +8,20 @@ Crawl4AI provides multiple output formats to suit different needs, from raw HTML
result = await crawler.arun(url="https://example.com")
# Access different formats
raw_html = result.html # Original HTML
clean_html = result.cleaned_html # Sanitized HTML
markdown = result.markdown # Standard markdown
fit_md = result.fit_markdown # Most relevant content in markdown
raw_html = result.html # Original HTML
clean_html = result.cleaned_html # Sanitized HTML
markdown_v2 = result.markdown_v2 # Detailed markdown generation results
fit_md = result.markdown_v2.fit_markdown # Most relevant content in markdown
```
> **Note**: The `markdown_v2` property will soon be replaced by `markdown`. It is recommended to start transitioning to using `markdown` for new implementations.
## Raw HTML
Original, unmodified HTML from the webpage. Useful when you need to:
- Preserve the exact page structure
- Process HTML with your own tools
- Debug page issues
- Preserve the exact page structure.
- Process HTML with your own tools.
- Debug page issues.
```python
result = await crawler.arun(url="https://example.com")
@@ -29,167 +31,72 @@ print(result.html) # Complete HTML including headers, scripts, etc.
## Cleaned HTML
Sanitized HTML with unnecessary elements removed. Automatically:
- Removes scripts and styles
- Cleans up formatting
- Preserves semantic structure
- Removes scripts and styles.
- Cleans up formatting.
- Preserves semantic structure.
```python
result = await crawler.arun(
url="https://example.com",
config = CrawlerRunConfig(
excluded_tags=['form', 'header', 'footer'], # Additional tags to remove
keep_data_attributes=False # Remove data-* attributes
)
result = await crawler.arun(url="https://example.com", config=config)
print(result.cleaned_html)
```
## Standard Markdown
HTML converted to clean markdown format. Great for:
- Content analysis
- Documentation
- Readability
HTML converted to clean markdown format. This output is useful for:
- Content analysis.
- Documentation.
- Readability.
```python
result = await crawler.arun(
url="https://example.com",
include_links_on_markdown=True # Include links in markdown
config = CrawlerRunConfig(
markdown_generator=DefaultMarkdownGenerator(
options={"include_links": True} # Include links in markdown
)
)
print(result.markdown)
result = await crawler.arun(url="https://example.com", config=config)
print(result.markdown_v2.raw_markdown) # Standard markdown with links
```
## Fit Markdown
Most relevant content extracted and converted to markdown. Ideal for:
- Article extraction
- Main content focus
- Removing boilerplate
Extract and convert only the most relevant content into markdown format. Best suited for:
- Article extraction.
- Focusing on the main content.
- Removing boilerplate.
To generate `fit_markdown`, use a content filter like `PruningContentFilter`:
```python
result = await crawler.arun(url="https://example.com")
print(result.fit_markdown) # Only the main content
from crawl4ai.content_filter_strategy import PruningContentFilter
config = CrawlerRunConfig(
content_filter=PruningContentFilter(
threshold=0.7,
threshold_type="dynamic",
min_word_threshold=100
)
)
result = await crawler.arun(url="https://example.com", config=config)
print(result.markdown_v2.fit_markdown) # Extracted main content in markdown
```
## Structured Data Extraction
## Markdown with Citations
Crawl4AI offers two powerful approaches for structured data extraction:
### 1. LLM-Based Extraction
Use any LLM (OpenAI, HuggingFace, Ollama, etc.) to extract structured data with high accuracy:
Generate markdown that includes citations for links. This format is ideal for:
- Creating structured documentation.
- Including references for extracted content.
```python
from pydantic import BaseModel
from crawl4ai.extraction_strategy import LLMExtractionStrategy
class KnowledgeGraph(BaseModel):
entities: List[dict]
relationships: List[dict]
strategy = LLMExtractionStrategy(
provider="ollama/nemotron", # or "huggingface/...", "ollama/..."
api_token="your-token", # not needed for Ollama
schema=KnowledgeGraph.schema(),
instruction="Extract entities and relationships from the content"
config = CrawlerRunConfig(
markdown_generator=DefaultMarkdownGenerator(
options={"citations": True} # Enable citations
)
)
result = await crawler.arun(
url="https://example.com",
extraction_strategy=strategy
)
knowledge_graph = json.loads(result.extracted_content)
result = await crawler.arun(url="https://example.com", config=config)
print(result.markdown_v2.markdown_with_citations)
print(result.markdown_v2.references_markdown) # Citations section
```
### 2. Pattern-Based Extraction
For pages with repetitive patterns (e.g., product listings, article feeds), use JsonCssExtractionStrategy:
```python
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
schema = {
"name": "Product Listing",
"baseSelector": ".product-card", # Repeated element
"fields": [
{"name": "title", "selector": "h2", "type": "text"},
{"name": "price", "selector": ".price", "type": "text"},
{"name": "description", "selector": ".desc", "type": "text"}
]
}
strategy = JsonCssExtractionStrategy(schema)
result = await crawler.arun(
url="https://example.com",
extraction_strategy=strategy
)
products = json.loads(result.extracted_content)
```
## Content Customization
### HTML to Text Options
Configure markdown conversion:
```python
result = await crawler.arun(
url="https://example.com",
html2text={
"escape_dot": False,
"body_width": 0,
"protect_links": True,
"unicode_snob": True
}
)
```
### Content Filters
Control what content is included:
```python
result = await crawler.arun(
url="https://example.com",
word_count_threshold=10, # Minimum words per block
exclude_external_links=True, # Remove external links
exclude_external_images=True, # Remove external images
excluded_tags=['form', 'nav'] # Remove specific HTML tags
)
```
## Comprehensive Example
Here's how to use multiple output formats together:
```python
async def crawl_content(url: str):
async with AsyncWebCrawler() as crawler:
# Extract main content with fit markdown
result = await crawler.arun(
url=url,
word_count_threshold=10,
exclude_external_links=True
)
# Get structured data using LLM
llm_result = await crawler.arun(
url=url,
extraction_strategy=LLMExtractionStrategy(
provider="ollama/nemotron",
schema=YourSchema.schema(),
instruction="Extract key information"
)
)
# Get repeated patterns (if any)
pattern_result = await crawler.arun(
url=url,
extraction_strategy=JsonCssExtractionStrategy(your_schema)
)
return {
"main_content": result.fit_markdown,
"structured_data": json.loads(llm_result.extracted_content),
"pattern_data": json.loads(pattern_result.extracted_content),
"media": result.media
}
```