refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized
This commit is contained in:
UncleCode
2025-01-07 20:49:50 +08:00
parent ae376f15fb
commit ca3e33122e
87 changed files with 4869 additions and 8951 deletions

View File

@@ -1,302 +1,330 @@
# CrawlResult
# `CrawlResult` Reference
The `CrawlResult` class represents the result of a web crawling operation. It provides access to various forms of extracted content and metadata from the crawled webpage.
The **`CrawlResult`** class encapsulates everything returned after a single crawl operation. It provides the **raw or processed content**, details on links and media, plus optional metadata (like screenshots, PDFs, or extracted JSON).
## Class Definition
**Location**: `crawl4ai/crawler/models.py` (for reference)
```python
class CrawlResult(BaseModel):
"""Result of a web crawling operation."""
# Basic Information
url: str # Crawled URL
success: bool # Whether crawl succeeded
status_code: Optional[int] = None # HTTP status code
error_message: Optional[str] = None # Error message if failed
# Content
html: str # Raw HTML content
cleaned_html: Optional[str] = None # Cleaned HTML
fit_html: Optional[str] = None # Most relevant HTML content
markdown: Optional[str] = None # HTML converted to markdown
fit_markdown: Optional[str] = None # Most relevant markdown content
downloaded_files: Optional[List[str]] = None # Downloaded files
# Extracted Data
extracted_content: Optional[str] = None # Content from extraction strategy
media: Dict[str, List[Dict]] = {} # Extracted media information
links: Dict[str, List[Dict]] = {} # Extracted links
metadata: Optional[dict] = None # Page metadata
# Additional Data
screenshot: Optional[str] = None # Base64 encoded screenshot
session_id: Optional[str] = None # Session identifier
response_headers: Optional[dict] = None # HTTP response headers
url: str
html: str
success: bool
cleaned_html: Optional[str] = None
media: Dict[str, List[Dict]] = {}
links: Dict[str, List[Dict]] = {}
downloaded_files: Optional[List[str]] = None
screenshot: Optional[str] = None
pdf : Optional[bytes] = None
markdown: Optional[Union[str, MarkdownGenerationResult]] = None
markdown_v2: Optional[MarkdownGenerationResult] = None
fit_markdown: Optional[str] = None
fit_html: Optional[str] = None
extracted_content: Optional[str] = None
metadata: Optional[dict] = None
error_message: Optional[str] = None
session_id: Optional[str] = None
response_headers: Optional[dict] = None
status_code: Optional[int] = None
ssl_certificate: Optional[SSLCertificate] = None
...
```
## Properties and Their Data Structures
Below is a **field-by-field** explanation and possible usage patterns.
### Basic Information
---
## 1. Basic Crawl Info
### 1.1 **`url`** *(str)*
**What**: The final crawled URL (after any redirects).
**Usage**:
```python
# Access basic information
result = await crawler.arun(url="https://example.com")
print(result.url) # "https://example.com"
print(result.success) # True/False
print(result.status_code) # 200, 404, etc.
print(result.error_message) # Error details if failed
print(result.url) # e.g., "https://example.com/"
```
### Content Properties
#### HTML Content
### 1.2 **`success`** *(bool)*
**What**: `True` if the crawl pipeline ended without major errors; `False` otherwise.
**Usage**:
```python
# Raw HTML
html_content = result.html
# Cleaned HTML (removed ads, popups, etc.)
clean_content = result.cleaned_html
# Most relevant HTML content
main_content = result.fit_html
if not result.success:
print(f"Crawl failed: {result.error_message}")
```
#### Markdown Content
### 1.3 **`status_code`** *(Optional[int])*
**What**: The pages HTTP status code (e.g., 200, 404).
**Usage**:
```python
# Full markdown version
markdown_content = result.markdown
# Most relevant markdown content
main_content = result.fit_markdown
if result.status_code == 404:
print("Page not found!")
```
### Media Content
The media dictionary contains organized media elements:
### 1.4 **`error_message`** *(Optional[str])*
**What**: If `success=False`, a textual description of the failure.
**Usage**:
```python
# Structure
media = {
"images": [
{
"src": str, # Image URL
"alt": str, # Alt text
"desc": str, # Contextual description
"score": float, # Relevance score (0-10)
"type": str, # "image"
"width": int, # Image width (if available)
"height": int, # Image height (if available)
"context": str, # Surrounding text
"lazy": bool # Whether image was lazy-loaded
}
],
"videos": [
{
"src": str, # Video URL
"type": str, # "video"
"title": str, # Video title
"poster": str, # Thumbnail URL
"duration": str, # Video duration
"description": str # Video description
}
],
"audios": [
{
"src": str, # Audio URL
"type": str, # "audio"
"title": str, # Audio title
"duration": str, # Audio duration
"description": str # Audio description
}
]
}
# Example usage
for image in result.media["images"]:
if image["score"] > 5: # High-relevance images
print(f"High-quality image: {image['src']}")
print(f"Context: {image['context']}")
if not result.success:
print("Error:", result.error_message)
```
### Link Analysis
The links dictionary organizes discovered links:
### 1.5 **`session_id`** *(Optional[str])*
**What**: The ID used for reusing a browser context across multiple calls.
**Usage**:
```python
# Structure
links = {
"internal": [
{
"href": str, # URL
"text": str, # Link text
"title": str, # Title attribute
"type": str, # Link type (nav, content, etc.)
"context": str, # Surrounding text
"score": float # Relevance score
}
],
"external": [
{
"href": str, # External URL
"text": str, # Link text
"title": str, # Title attribute
"domain": str, # Domain name
"type": str, # Link type
"context": str # Surrounding text
}
]
}
# If you used session_id="login_session" in CrawlerRunConfig, see it here:
print("Session:", result.session_id)
```
# Example usage
### 1.6 **`response_headers`** *(Optional[dict])*
**What**: Final HTTP response headers.
**Usage**:
```python
if result.response_headers:
print("Server:", result.response_headers.get("Server", "Unknown"))
```
### 1.7 **`ssl_certificate`** *(Optional[SSLCertificate])*
**What**: If `fetch_ssl_certificate=True` in your CrawlerRunConfig, **`result.ssl_certificate`** contains a [**`SSLCertificate`**](../advanced/ssl-certificate.md) object describing the sites certificate. You can export the cert in multiple formats (PEM/DER/JSON) or access its properties like `issuer`,
`subject`, `valid_from`, `valid_until`, etc.
**Usage**:
```python
if result.ssl_certificate:
print("Issuer:", result.ssl_certificate.issuer)
```
---
## 2. Raw / Cleaned Content
### 2.1 **`html`** *(str)*
**What**: The **original** unmodified HTML from the final page load.
**Usage**:
```python
# Possibly large
print(len(result.html))
```
### 2.2 **`cleaned_html`** *(Optional[str])*
**What**: A sanitized HTML version—scripts, styles, or excluded tags are removed based on your `CrawlerRunConfig`.
**Usage**:
```python
print(result.cleaned_html[:500]) # Show a snippet
```
### 2.3 **`fit_html`** *(Optional[str])*
**What**: If a **content filter** or heuristic (e.g., Pruning/BM25) modifies the HTML, the “fit” or post-filter version.
**When**: This is **only** present if your `markdown_generator` or `content_filter` produces it.
**Usage**:
```python
if result.fit_html:
print("High-value HTML content:", result.fit_html[:300])
```
---
## 3. Markdown Fields
### 3.1 The Markdown Generation Approach
Crawl4AI can convert HTML→Markdown, optionally including:
- **Raw** markdown
- **Links as citations** (with a references section)
- **Fit** markdown if a **content filter** is used (like Pruning or BM25)
### 3.2 **`markdown_v2`** *(Optional[MarkdownGenerationResult])*
**What**: The **structured** object holding multiple markdown variants. Soon to be consolidated into `markdown`.
**`MarkdownGenerationResult`** includes:
- **`raw_markdown`** *(str)*: The full HTML→Markdown conversion.
- **`markdown_with_citations`** *(str)*: Same markdown, but with link references as academic-style citations.
- **`references_markdown`** *(str)*: The reference list or footnotes at the end.
- **`fit_markdown`** *(Optional[str])*: If content filtering (Pruning/BM25) was applied, the filtered “fit” text.
- **`fit_html`** *(Optional[str])*: The HTML that led to `fit_markdown`.
**Usage**:
```python
if result.markdown_v2:
md_res = result.markdown_v2
print("Raw MD:", md_res.raw_markdown[:300])
print("Citations MD:", md_res.markdown_with_citations[:300])
print("References:", md_res.references_markdown)
if md_res.fit_markdown:
print("Pruned text:", md_res.fit_markdown[:300])
```
### 3.3 **`markdown`** *(Optional[Union[str, MarkdownGenerationResult]])*
**What**: In future versions, `markdown` will fully replace `markdown_v2`. Right now, it might be a `str` or a `MarkdownGenerationResult`.
**Usage**:
```python
# Soon, you might see:
if isinstance(result.markdown, MarkdownGenerationResult):
print(result.markdown.raw_markdown[:200])
else:
print(result.markdown)
```
### 3.4 **`fit_markdown`** *(Optional[str])*
**What**: A direct reference to the final filtered markdown (legacy approach).
**When**: This is set if a filter or content strategy explicitly writes there. Usually overshadowed by `markdown_v2.fit_markdown`.
**Usage**:
```python
print(result.fit_markdown) # Legacy field, prefer result.markdown_v2.fit_markdown
```
**Important**: “Fit” content (in `fit_markdown`/`fit_html`) only exists if you used a **filter** (like **PruningContentFilter** or **BM25ContentFilter**) within a `MarkdownGenerationStrategy`.
---
## 4. Media & Links
### 4.1 **`media`** *(Dict[str, List[Dict]])*
**What**: Contains info about discovered images, videos, or audio. Typically keys: `"images"`, `"videos"`, `"audios"`.
**Common Fields** in each item:
- `src` *(str)*: Media URL
- `alt` or `title` *(str)*: Descriptive text
- `score` *(float)*: Relevance score if the crawlers heuristic found it “important”
- `desc` or `description` *(Optional[str])*: Additional context extracted from surrounding text
**Usage**:
```python
images = result.media.get("images", [])
for img in images:
if img.get("score", 0) > 5:
print("High-value image:", img["src"])
```
### 4.2 **`links`** *(Dict[str, List[Dict]])*
**What**: Holds internal and external link data. Usually two keys: `"internal"` and `"external"`.
**Common Fields**:
- `href` *(str)*: The link target
- `text` *(str)*: Link text
- `title` *(str)*: Title attribute
- `context` *(str)*: Surrounding text snippet
- `domain` *(str)*: If external, the domain
**Usage**:
```python
for link in result.links["internal"]:
print(f"Internal link: {link['href']}")
print(f"Context: {link['context']}")
print(f"Internal link to {link['href']} with text {link['text']}")
```
### Metadata
---
The metadata dictionary contains page information:
## 5. Additional Fields
### 5.1 **`extracted_content`** *(Optional[str])*
**What**: If you used **`extraction_strategy`** (CSS, LLM, etc.), the structured output (JSON).
**Usage**:
```python
# Structure
metadata = {
"title": str, # Page title
"description": str, # Meta description
"keywords": List[str], # Meta keywords
"author": str, # Author information
"published_date": str, # Publication date
"modified_date": str, # Last modified date
"language": str, # Page language
"canonical_url": str, # Canonical URL
"og_data": Dict, # Open Graph data
"twitter_data": Dict # Twitter card data
}
# Example usage
if result.metadata:
print(f"Title: {result.metadata['title']}")
print(f"Author: {result.metadata.get('author', 'Unknown')}")
```
### Extracted Content
Content from extraction strategies:
```python
# For LLM or CSS extraction strategies
if result.extracted_content:
structured_data = json.loads(result.extracted_content)
print(structured_data)
data = json.loads(result.extracted_content)
print(data)
```
### Screenshot
Base64 encoded screenshot:
### 5.2 **`downloaded_files`** *(Optional[List[str]])*
**What**: If `accept_downloads=True` in your `BrowserConfig` + `downloads_path`, lists local file paths for downloaded items.
**Usage**:
```python
# Save screenshot if available
if result.downloaded_files:
for file_path in result.downloaded_files:
print("Downloaded:", file_path)
```
### 5.3 **`screenshot`** *(Optional[str])*
**What**: Base64-encoded screenshot if `screenshot=True` in `CrawlerRunConfig`.
**Usage**:
```python
import base64
if result.screenshot:
import base64
# Decode and save
with open("screenshot.png", "wb") as f:
with open("page.png", "wb") as f:
f.write(base64.b64decode(result.screenshot))
```
## Usage Examples
### Basic Content Access
### 5.4 **`pdf`** *(Optional[bytes])*
**What**: Raw PDF bytes if `pdf=True` in `CrawlerRunConfig`.
**Usage**:
```python
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com")
if result.pdf:
with open("page.pdf", "wb") as f:
f.write(result.pdf)
```
### 5.5 **`metadata`** *(Optional[dict])*
**What**: Page-level metadata if discovered (title, description, OG data, etc.).
**Usage**:
```python
if result.metadata:
print("Title:", result.metadata.get("title"))
print("Author:", result.metadata.get("author"))
```
---
## 6. Example: Accessing Everything
```python
async def handle_result(result: CrawlResult):
if not result.success:
print("Crawl error:", result.error_message)
return
if result.success:
# Get clean content
print(result.fit_markdown)
# Process images
for image in result.media["images"]:
if image["score"] > 7:
print(f"High-quality image: {image['src']}")
# Basic info
print("Crawled URL:", result.url)
print("Status code:", result.status_code)
# HTML
print("Original HTML size:", len(result.html))
print("Cleaned HTML size:", len(result.cleaned_html or ""))
# Markdown output
if result.markdown_v2:
print("Raw Markdown:", result.markdown_v2.raw_markdown[:300])
print("Citations Markdown:", result.markdown_v2.markdown_with_citations[:300])
if result.markdown_v2.fit_markdown:
print("Fit Markdown:", result.markdown_v2.fit_markdown[:200])
else:
print("Raw Markdown (legacy):", result.markdown[:200] if result.markdown else "N/A")
# Media & Links
if "images" in result.media:
print("Image count:", len(result.media["images"]))
if "internal" in result.links:
print("Internal link count:", len(result.links["internal"]))
# Extraction strategy result
if result.extracted_content:
print("Structured data:", result.extracted_content)
# Screenshot/PDF
if result.screenshot:
print("Screenshot length:", len(result.screenshot))
if result.pdf:
print("PDF bytes length:", len(result.pdf))
```
### Complete Data Processing
```python
async def process_webpage(url: str) -> Dict:
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url=url)
if not result.success:
raise Exception(f"Crawl failed: {result.error_message}")
return {
"content": result.fit_markdown,
"images": [
img for img in result.media["images"]
if img["score"] > 5
],
"internal_links": [
link["href"] for link in result.links["internal"]
],
"metadata": result.metadata,
"status": result.status_code
}
```
---
### Error Handling
```python
async def safe_crawl(url: str) -> Dict:
async with AsyncWebCrawler() as crawler:
try:
result = await crawler.arun(url=url)
if not result.success:
return {
"success": False,
"error": result.error_message,
"status": result.status_code
}
return {
"success": True,
"content": result.fit_markdown,
"status": result.status_code
}
except Exception as e:
return {
"success": False,
"error": str(e),
"status": None
}
```
## 7. Key Points & Future
## Best Practices
1. **`markdown_v2` vs `markdown`**
- Right now, `markdown_v2` is the more robust container (`MarkdownGenerationResult`), providing **raw_markdown**, **markdown_with_citations**, references, plus possible **fit_markdown**.
- In future versions, everything will unify under **`markdown`**. If you rely on advanced features (citations, fit content), check `markdown_v2`.
1. **Always Check Success**
```python
if not result.success:
print(f"Error: {result.error_message}")
return
```
2. **Fit Content**
- **`fit_markdown`** and **`fit_html`** appear only if you used a content filter (like **PruningContentFilter** or **BM25ContentFilter**) inside your **MarkdownGenerationStrategy** or set them directly.
- If no filter is used, they remain `None`.
2. **Use fit_markdown for Articles**
```python
# Better for article content
content = result.fit_markdown if result.fit_markdown else result.markdown
```
3. **References & Citations**
- If you enable link citations in your `DefaultMarkdownGenerator` (`options={"citations": True}`), youll see `markdown_with_citations` plus a **`references_markdown`** block. This helps large language models or academic-like referencing.
3. **Filter Media by Score**
```python
relevant_images = [
img for img in result.media["images"]
if img["score"] > 5
]
```
4. **Links & Media**
- `links["internal"]` and `links["external"]` group discovered anchors by domain.
- `media["images"]` / `["videos"]` / `["audios"]` store extracted media elements with optional scoring or context.
4. **Handle Missing Data**
```python
metadata = result.metadata or {}
title = metadata.get('title', 'Unknown Title')
```
5. **Error Cases**
- If `success=False`, check `error_message` (e.g., timeouts, invalid URLs).
- `status_code` might be `None` if we failed before an HTTP response.
Use **`CrawlResult`** to glean all final outputs and feed them into your data pipelines, AI models, or archives. With the synergy of a properly configured **BrowserConfig** and **CrawlerRunConfig**, the crawler can produce robust, structured results here in **`CrawlResult`**.