refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized
This commit is contained in:
UncleCode
2025-01-07 20:49:50 +08:00
parent ae376f15fb
commit ca3e33122e
87 changed files with 4869 additions and 8951 deletions

View File

@@ -1,244 +1,305 @@
# Complete Parameter Guide for arun()
Below is a **revised parameter guide** for **`arun()`** in **AsyncWebCrawler**, reflecting the **new** approach where all parameters are passed via a **`CrawlerRunConfig`** instead of directly to `arun()`. Each section includes example usage in the new style, ensuring a clear, modern approach.
The following parameters can be passed to the `arun()` method. They are organized by their primary usage context and functionality.
---
## Core Parameters
# `arun()` Parameter Guide (New Approach)
In Crawl4AIs **latest** configuration model, nearly all parameters that once went directly to `arun()` are now part of **`CrawlerRunConfig`**. When calling `arun()`, you provide:
```python
await crawler.arun(
url="https://example.com", # Required: URL to crawl
verbose=True, # Enable detailed logging
cache_mode=CacheMode.ENABLED, # Control cache behavior
warmup=True # Whether to run warmup check
url="https://example.com",
config=my_run_config
)
```
## Cache Control
Below is an organized look at the parameters that can go inside `CrawlerRunConfig`, divided by their functional areas. For **Browser** settings (e.g., `headless`, `browser_type`), see [BrowserConfig](./parameters.md).
---
## 1. Core Usage
```python
from crawl4ai import CacheMode
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
await crawler.arun(
cache_mode=CacheMode.ENABLED, # Normal caching (read/write)
# Other cache modes:
# cache_mode=CacheMode.DISABLED # No caching at all
# cache_mode=CacheMode.READ_ONLY # Only read from cache
# cache_mode=CacheMode.WRITE_ONLY # Only write to cache
# cache_mode=CacheMode.BYPASS # Skip cache for this operation
async def main():
run_config = CrawlerRunConfig(
verbose=True, # Detailed logging
cache_mode=CacheMode.ENABLED, # Use normal read/write cache
# ... other parameters
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com",
config=run_config
)
print(result.cleaned_html[:500])
```
**Key Fields**:
- `verbose=True` logs each crawl step.
- `cache_mode` decides how to read/write the local crawl cache.
---
## 2. Cache Control
**`cache_mode`** (default: `CacheMode.ENABLED`)
Use a built-in enum from `CacheMode`:
- `ENABLED`: Normal caching—reads if available, writes if missing.
- `DISABLED`: No caching—always refetch pages.
- `READ_ONLY`: Reads from cache only; no new writes.
- `WRITE_ONLY`: Writes to cache but doesnt read existing data.
- `BYPASS`: Skips reading cache for this crawl (though it might still write if set up that way).
```python
run_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS
)
```
## Content Processing Parameters
**Additional flags**:
- `bypass_cache=True` acts like `CacheMode.BYPASS`.
- `disable_cache=True` acts like `CacheMode.DISABLED`.
- `no_cache_read=True` acts like `CacheMode.WRITE_ONLY`.
- `no_cache_write=True` acts like `CacheMode.READ_ONLY`.
---
## 3. Content Processing & Selection
### 3.1 Text Processing
### Text Processing
```python
await crawler.arun(
word_count_threshold=10, # Minimum words per content block
image_description_min_word_threshold=5, # Minimum words for image descriptions
only_text=False, # Extract only text content
excluded_tags=['form', 'nav'], # HTML tags to exclude
keep_data_attributes=False, # Preserve data-* attributes
run_config = CrawlerRunConfig(
word_count_threshold=10, # Ignore text blocks <10 words
only_text=False, # If True, tries to remove non-text elements
keep_data_attributes=False # Keep or discard data-* attributes
)
```
### Content Selection
### 3.2 Content Selection
```python
await crawler.arun(
css_selector=".main-content", # CSS selector for content extraction
remove_forms=True, # Remove all form elements
remove_overlay_elements=True, # Remove popups/modals/overlays
run_config = CrawlerRunConfig(
css_selector=".main-content", # Focus on .main-content region only
excluded_tags=["form", "nav"], # Remove entire tag blocks
remove_forms=True, # Specifically strip <form> elements
remove_overlay_elements=True, # Attempt to remove modals/popups
)
```
### Link Handling
### 3.3 Link Handling
```python
await crawler.arun(
exclude_external_links=True, # Remove external links
exclude_social_media_links=True, # Remove social media links
exclude_external_images=True, # Remove external images
exclude_domains=["ads.example.com"], # Specific domains to exclude
social_media_domains=[ # Additional social media domains
"facebook.com",
"twitter.com",
"instagram.com"
]
run_config = CrawlerRunConfig(
exclude_external_links=True, # Remove external links from final content
exclude_social_media_links=True, # Remove links to known social sites
exclude_domains=["ads.example.com"], # Exclude links to these domains
exclude_social_media_domains=["facebook.com","twitter.com"], # Extend the default list
)
```
## Browser Control Parameters
### 3.4 Media Filtering
### Basic Browser Settings
```python
await crawler.arun(
headless=True, # Run browser in headless mode
browser_type="chromium", # Browser engine: "chromium", "firefox", "webkit"
page_timeout=60000, # Page load timeout in milliseconds
user_agent="custom-agent", # Custom user agent
run_config = CrawlerRunConfig(
exclude_external_images=True # Strip images from other domains
)
```
### Navigation and Waiting
---
## 4. Page Navigation & Timing
### 4.1 Basic Browser Flow
```python
await crawler.arun(
wait_for="css:.dynamic-content", # Wait for element/condition
delay_before_return_html=2.0, # Wait before returning HTML (seconds)
run_config = CrawlerRunConfig(
wait_for="css:.dynamic-content", # Wait for .dynamic-content
delay_before_return_html=2.0, # Wait 2s before capturing final HTML
page_timeout=60000, # Navigation & script timeout (ms)
)
```
### JavaScript Execution
**Key Fields**:
- `wait_for`:
- `"css:selector"` or
- `"js:() => boolean"`
e.g. `js:() => document.querySelectorAll('.item').length > 10`.
- `mean_delay` & `max_range`: define random delays for `arun_many()` calls.
- `semaphore_count`: concurrency limit when crawling multiple URLs.
### 4.2 JavaScript Execution
```python
await crawler.arun(
js_code=[ # JavaScript to execute (string or list)
run_config = CrawlerRunConfig(
js_code=[
"window.scrollTo(0, document.body.scrollHeight);",
"document.querySelector('.load-more').click();"
"document.querySelector('.load-more')?.click();"
],
js_only=False, # Only execute JavaScript without reloading page
js_only=False
)
```
### Anti-Bot Features
- `js_code` can be a single string or a list of strings.
- `js_only=True` means “Im continuing in the same session with new JS steps, no new full navigation.”
### 4.3 Anti-Bot
```python
await crawler.arun(
magic=True, # Enable all anti-detection features
simulate_user=True, # Simulate human behavior
override_navigator=True # Override navigator properties
run_config = CrawlerRunConfig(
magic=True,
simulate_user=True,
override_navigator=True
)
```
- `magic=True` tries multiple stealth features.
- `simulate_user=True` mimics mouse movements or random delays.
- `override_navigator=True` fakes some navigator properties (like user agent checks).
---
## 5. Session Management
**`session_id`**:
```python
run_config = CrawlerRunConfig(
session_id="my_session123"
)
```
If re-used in subsequent `arun()` calls, the same tab/page context is continued (helpful for multi-step tasks or stateful browsing).
---
## 6. Screenshot, PDF & Media Options
```python
run_config = CrawlerRunConfig(
screenshot=True, # Grab a screenshot as base64
screenshot_wait_for=1.0, # Wait 1s before capturing
pdf=True, # Also produce a PDF
image_description_min_word_threshold=5, # If analyzing alt text
image_score_threshold=3, # Filter out low-score images
)
```
**Where they appear**:
- `result.screenshot` → Base64 screenshot string.
- `result.pdf` → Byte array with PDF data.
---
## 7. Extraction Strategy
**For advanced data extraction** (CSS/LLM-based), set `extraction_strategy`:
```python
run_config = CrawlerRunConfig(
extraction_strategy=my_css_or_llm_strategy
)
```
### Session Management
```python
await crawler.arun(
session_id="my_session", # Session identifier for persistent browsing
)
```
The extracted data will appear in `result.extracted_content`.
### Screenshot Options
```python
await crawler.arun(
screenshot=True, # Take page screenshot
screenshot_wait_for=2.0, # Wait before screenshot (seconds)
)
```
---
## 8. Comprehensive Example
Below is a snippet combining many parameters:
### Proxy Configuration
```python
await crawler.arun(
proxy="http://proxy.example.com:8080", # Simple proxy URL
proxy_config={ # Advanced proxy settings
"server": "http://proxy.example.com:8080",
"username": "user",
"password": "pass"
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
async def main():
# Example schema
schema = {
"name": "Articles",
"baseSelector": "article.post",
"fields": [
{"name": "title", "selector": "h2", "type": "text"},
{"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
]
}
)
```
## Content Extraction Parameters
### Extraction Strategy
```python
await crawler.arun(
extraction_strategy=LLMExtractionStrategy(
provider="ollama/llama2",
schema=MySchema.schema(),
instruction="Extract specific data"
run_config = CrawlerRunConfig(
# Core
verbose=True,
cache_mode=CacheMode.ENABLED,
# Content
word_count_threshold=10,
css_selector="main.content",
excluded_tags=["nav", "footer"],
exclude_external_links=True,
# Page & JS
js_code="document.querySelector('.show-more')?.click();",
wait_for="css:.loaded-block",
page_timeout=30000,
# Extraction
extraction_strategy=JsonCssExtractionStrategy(schema),
# Session
session_id="persistent_session",
# Media
screenshot=True,
pdf=True,
# Anti-bot
simulate_user=True,
magic=True,
)
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com/posts", config=run_config)
if result.success:
print("HTML length:", len(result.cleaned_html))
print("Extraction JSON:", result.extracted_content)
if result.screenshot:
print("Screenshot length:", len(result.screenshot))
if result.pdf:
print("PDF bytes length:", len(result.pdf))
else:
print("Error:", result.error_message)
if __name__ == "__main__":
asyncio.run(main())
```
### Chunking Strategy
```python
await crawler.arun(
chunking_strategy=RegexChunking(
patterns=[r'\n\n', r'\.\s+']
)
)
```
**What we covered**:
1. **Crawling** the main content region, ignoring external links.
2. Running **JavaScript** to click “.show-more”.
3. **Waiting** for “.loaded-block” to appear.
4. Generating a **screenshot** & **PDF** of the final page.
5. Extracting repeated “article.post” elements with a **CSS-based** extraction strategy.
### HTML to Text Options
```python
await crawler.arun(
html2text={
"ignore_links": False,
"ignore_images": False,
"escape_dot": False,
"body_width": 0,
"protect_links": True,
"unicode_snob": True
}
)
```
---
## Debug Options
```python
await crawler.arun(
log_console=True, # Log browser console messages
)
```
## 9. Best Practices
## Parameter Interactions and Notes
1. **Use `BrowserConfig` for global browser** settings (headless, user agent).
2. **Use `CrawlerRunConfig`** to handle the **specific** crawl needs: content filtering, caching, JS, screenshot, extraction, etc.
3. Keep your **parameters consistent** in run configs—especially if youre part of a large codebase with multiple crawls.
4. **Limit** large concurrency (`semaphore_count`) if the site or your system cant handle it.
5. For dynamic pages, set `js_code` or `scan_full_page` so you load all content.
1. **Cache and Performance Setup**
```python
# Optimal caching for repeated crawls
await crawler.arun(
cache_mode=CacheMode.ENABLED,
word_count_threshold=10,
process_iframes=False
)
```
---
2. **Dynamic Content Handling**
```python
# Handle lazy-loaded content
await crawler.arun(
js_code="window.scrollTo(0, document.body.scrollHeight);",
wait_for="css:.lazy-content",
delay_before_return_html=2.0,
cache_mode=CacheMode.WRITE_ONLY # Cache results after dynamic load
)
```
## 10. Conclusion
3. **Content Extraction Pipeline**
```python
# Complete extraction setup
await crawler.arun(
css_selector=".main-content",
word_count_threshold=20,
extraction_strategy=my_strategy,
chunking_strategy=my_chunking,
process_iframes=True,
remove_overlay_elements=True,
cache_mode=CacheMode.ENABLED
)
```
All parameters that used to be direct arguments to `arun()` now belong in **`CrawlerRunConfig`**. This approach:
## Best Practices
- Makes code **clearer** and **more maintainable**.
- Minimizes confusion about which arguments affect global vs. per-crawl behavior.
- Allows you to create **reusable** config objects for different pages or tasks.
1. **Performance Optimization**
```python
await crawler.arun(
cache_mode=CacheMode.ENABLED, # Use full caching
word_count_threshold=10, # Filter out noise
process_iframes=False # Skip iframes if not needed
)
```
For a **full** reference, check out the [CrawlerRunConfig Docs](./parameters.md).
2. **Reliable Scraping**
```python
await crawler.arun(
magic=True, # Enable anti-detection
delay_before_return_html=1.0, # Wait for dynamic content
page_timeout=60000, # Longer timeout for slow pages
cache_mode=CacheMode.WRITE_ONLY # Cache results after successful crawl
)
```
3. **Clean Content**
```python
await crawler.arun(
remove_overlay_elements=True, # Remove popups
excluded_tags=['nav', 'aside'],# Remove unnecessary elements
keep_data_attributes=False, # Remove data attributes
cache_mode=CacheMode.ENABLED # Use cache for faster processing
)
```
Happy crawling with your **structured, flexible** config approach!

View File

@@ -1,320 +1,283 @@
Below is the **updated** guide for the **AsyncWebCrawler** class, reflecting the **new** recommended approach of configuring the browser via **`BrowserConfig`** and each crawl via **`CrawlerRunConfig`**. While the crawler still accepts legacy parameters for backward compatibility, the modern, maintainable way is shown below.
---
# AsyncWebCrawler
The `AsyncWebCrawler` class is the main interface for web crawling operations. It provides asynchronous web crawling capabilities with extensive configuration options.
The **`AsyncWebCrawler`** is the core class for asynchronous web crawling in Crawl4AI. You typically create it **once**, optionally customize it with a **`BrowserConfig`** (e.g., headless, user agent), then **run** multiple **`arun()`** calls with different **`CrawlerRunConfig`** objects.
## Constructor
**Recommended usage**:
1. **Create** a `BrowserConfig` for global browser settings.
2. **Instantiate** `AsyncWebCrawler(config=browser_config)`.
3. **Use** the crawler in an async context manager (`async with`) or manage start/close manually.
4. **Call** `arun(url, config=crawler_run_config)` for each page you want.
---
## 1. Constructor Overview
```python
AsyncWebCrawler(
# Browser Settings
browser_type: str = "chromium", # Options: "chromium", "firefox", "webkit"
headless: bool = True, # Run browser in headless mode
verbose: bool = False, # Enable verbose logging
# Cache Settings
always_by_pass_cache: bool = False, # Always bypass cache
base_directory: str = str(os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home())), # Base directory for cache
# Network Settings
proxy: str = None, # Simple proxy URL
proxy_config: Dict = None, # Advanced proxy configuration
# Browser Behavior
sleep_on_close: bool = False, # Wait before closing browser
# Custom Settings
user_agent: str = None, # Custom user agent
headers: Dict[str, str] = {}, # Custom HTTP headers
js_code: Union[str, List[str]] = None, # Default JavaScript to execute
)
class AsyncWebCrawler:
def __init__(
self,
crawler_strategy: Optional[AsyncCrawlerStrategy] = None,
config: Optional[BrowserConfig] = None,
always_bypass_cache: bool = False, # deprecated
always_by_pass_cache: Optional[bool] = None, # also deprecated
base_directory: str = ...,
thread_safe: bool = False,
**kwargs,
):
"""
Create an AsyncWebCrawler instance.
Args:
crawler_strategy: (Advanced) Provide a custom crawler strategy if needed.
config: A BrowserConfig object specifying how the browser is set up.
always_bypass_cache: (Deprecated) Use CrawlerRunConfig.cache_mode instead.
base_directory: Folder for storing caches/logs (if relevant).
thread_safe: If True, attempts some concurrency safeguards. Usually False.
**kwargs: Additional legacy or debugging parameters.
"""
```
### Parameters in Detail
### Typical Initialization
#### Browser Settings
```python
from crawl4ai import AsyncWebCrawler, BrowserConfig
- **browser_type** (str, optional)
- Default: `"chromium"`
- Options: `"chromium"`, `"firefox"`, `"webkit"`
- Controls which browser engine to use
```python
# Example: Using Firefox
crawler = AsyncWebCrawler(browser_type="firefox")
```
browser_cfg = BrowserConfig(
browser_type="chromium",
headless=True,
verbose=True
)
- **headless** (bool, optional)
- Default: `True`
- When `True`, browser runs without GUI
- Set to `False` for debugging
```python
# Visible browser for debugging
crawler = AsyncWebCrawler(headless=False)
```
crawler = AsyncWebCrawler(config=browser_cfg)
```
- **verbose** (bool, optional)
- Default: `False`
- Enables detailed logging
```python
# Enable detailed logging
crawler = AsyncWebCrawler(verbose=True)
```
**Notes**:
- **Legacy** parameters like `always_bypass_cache` remain for backward compatibility, but prefer to set **caching** in `CrawlerRunConfig`.
#### Cache Settings
---
- **always_by_pass_cache** (bool, optional)
- Default: `False`
- When `True`, always fetches fresh content
```python
# Always fetch fresh content
crawler = AsyncWebCrawler(always_by_pass_cache=True)
```
## 2. Lifecycle: Start/Close or Context Manager
- **base_directory** (str, optional)
- Default: User's home directory
- Base path for cache storage
```python
# Custom cache directory
crawler = AsyncWebCrawler(base_directory="/path/to/cache")
```
### 2.1 Context Manager (Recommended)
#### Network Settings
```python
async with AsyncWebCrawler(config=browser_cfg) as crawler:
result = await crawler.arun("https://example.com")
# The crawler automatically starts/closes resources
```
- **proxy** (str, optional)
- Simple proxy URL
```python
# Using simple proxy
crawler = AsyncWebCrawler(proxy="http://proxy.example.com:8080")
```
When the `async with` block ends, the crawler cleans up (closes the browser, etc.).
- **proxy_config** (Dict, optional)
- Advanced proxy configuration with authentication
```python
# Advanced proxy with auth
crawler = AsyncWebCrawler(proxy_config={
"server": "http://proxy.example.com:8080",
"username": "user",
"password": "pass"
})
```
### 2.2 Manual Start & Close
#### Browser Behavior
```python
crawler = AsyncWebCrawler(config=browser_cfg)
await crawler.start()
- **sleep_on_close** (bool, optional)
- Default: `False`
- Adds delay before closing browser
```python
# Wait before closing
crawler = AsyncWebCrawler(sleep_on_close=True)
```
result1 = await crawler.arun("https://example.com")
result2 = await crawler.arun("https://another.com")
#### Custom Settings
await crawler.close()
```
- **user_agent** (str, optional)
- Custom user agent string
```python
# Custom user agent
crawler = AsyncWebCrawler(
user_agent="Mozilla/5.0 (Custom Agent) Chrome/90.0"
)
```
Use this style if you have a **long-running** application or need full control of the crawlers lifecycle.
- **headers** (Dict[str, str], optional)
- Custom HTTP headers
```python
# Custom headers
crawler = AsyncWebCrawler(
headers={
"Accept-Language": "en-US",
"Custom-Header": "Value"
}
)
```
---
- **js_code** (Union[str, List[str]], optional)
- Default JavaScript to execute on each page
```python
# Default JavaScript
crawler = AsyncWebCrawler(
js_code=[
"window.scrollTo(0, document.body.scrollHeight);",
"document.querySelector('.load-more').click();"
]
)
```
## Methods
### arun()
The primary method for crawling web pages.
## 3. Primary Method: `arun()`
```python
async def arun(
# Required
url: str, # URL to crawl
# Content Selection
css_selector: str = None, # CSS selector for content
word_count_threshold: int = 10, # Minimum words per block
# Cache Control
bypass_cache: bool = False, # Bypass cache for this request
# Session Management
session_id: str = None, # Session identifier
# Screenshot Options
screenshot: bool = False, # Take screenshot
screenshot_wait_for: float = None, # Wait before screenshot
# Content Processing
process_iframes: bool = False, # Process iframe content
remove_overlay_elements: bool = False, # Remove popups/modals
# Anti-Bot Settings
simulate_user: bool = False, # Simulate human behavior
override_navigator: bool = False, # Override navigator properties
magic: bool = False, # Enable all anti-detection
# Content Filtering
excluded_tags: List[str] = None, # HTML tags to exclude
exclude_external_links: bool = False, # Remove external links
exclude_social_media_links: bool = False, # Remove social media links
# JavaScript Handling
js_code: Union[str, List[str]] = None, # JavaScript to execute
wait_for: str = None, # Wait condition
# Page Loading
page_timeout: int = 60000, # Page load timeout (ms)
delay_before_return_html: float = None, # Wait before return
# Extraction
extraction_strategy: ExtractionStrategy = None # Extraction strategy
self,
url: str,
config: Optional[CrawlerRunConfig] = None,
# Legacy parameters for backward compatibility...
) -> CrawlResult:
...
```
### Usage Examples
### 3.1 New Approach
#### Basic Crawling
```python
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com")
```
#### Advanced Crawling
```python
async with AsyncWebCrawler(
browser_type="firefox",
verbose=True,
headers={"Custom-Header": "Value"}
) as crawler:
result = await crawler.arun(
url="https://example.com",
css_selector=".main-content",
word_count_threshold=20,
process_iframes=True,
magic=True,
wait_for="css:.dynamic-content",
screenshot=True
)
```
#### Session Management
```python
async with AsyncWebCrawler() as crawler:
# First request
result1 = await crawler.arun(
url="https://example.com/login",
session_id="my_session"
)
# Subsequent request using same session
result2 = await crawler.arun(
url="https://example.com/protected",
session_id="my_session"
)
```
## Context Manager
AsyncWebCrawler implements the async context manager protocol:
You pass a `CrawlerRunConfig` object that sets up everything about a crawl—content filtering, caching, session reuse, JS code, screenshots, etc.
```python
async def __aenter__(self) -> 'AsyncWebCrawler':
# Initialize browser and resources
return self
import asyncio
from crawl4ai import CrawlerRunConfig, CacheMode
async def __aexit__(self, *args):
# Cleanup resources
pass
```
Always use AsyncWebCrawler with async context manager:
```python
async with AsyncWebCrawler() as crawler:
# Your crawling code here
pass
```
## Best Practices
1. **Resource Management**
```python
# Always use context manager
async with AsyncWebCrawler() as crawler:
# Crawler will be properly cleaned up
pass
```
2. **Error Handling**
```python
try:
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com")
if not result.success:
print(f"Crawl failed: {result.error_message}")
except Exception as e:
print(f"Error: {str(e)}")
```
3. **Performance Optimization**
```python
# Enable caching for better performance
crawler = AsyncWebCrawler(
always_by_pass_cache=False,
verbose=True
run_cfg = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
css_selector="main.article",
word_count_threshold=10,
screenshot=True
)
async with AsyncWebCrawler(config=browser_cfg) as crawler:
result = await crawler.arun("https://example.com/news", config=run_cfg)
print("Crawled HTML length:", len(result.cleaned_html))
if result.screenshot:
print("Screenshot base64 length:", len(result.screenshot))
```
4. **Anti-Detection**
### 3.2 Legacy Parameters Still Accepted
For **backward** compatibility, `arun()` can still accept direct arguments like `css_selector=...`, `word_count_threshold=...`, etc., but we strongly advise migrating them into a **`CrawlerRunConfig`**.
---
## 4. Helper Methods
### 4.1 `arun_many()`
```python
# Maximum stealth
crawler = AsyncWebCrawler(
headless=True,
user_agent="Mozilla/5.0...",
headers={"Accept-Language": "en-US"}
)
result = await crawler.arun(
url="https://example.com",
magic=True,
simulate_user=True
)
async def arun_many(
self,
urls: List[str],
config: Optional[CrawlerRunConfig] = None,
# Legacy parameters...
) -> List[CrawlResult]:
...
```
## Note on Browser Types
Crawls multiple URLs in concurrency. Accepts the same style `CrawlerRunConfig`. Example:
Each browser type has its characteristics:
- **chromium**: Best overall compatibility
- **firefox**: Good for specific use cases
- **webkit**: Lighter weight, good for basic crawling
Choose based on your specific needs:
```python
# High compatibility
crawler = AsyncWebCrawler(browser_type="chromium")
run_cfg = CrawlerRunConfig(
# e.g., concurrency, wait_for, caching, extraction, etc.
semaphore_count=5
)
# Memory efficient
crawler = AsyncWebCrawler(browser_type="webkit")
```
async with AsyncWebCrawler(config=browser_cfg) as crawler:
results = await crawler.arun_many(
urls=["https://example.com", "https://another.com"],
config=run_cfg
)
for r in results:
print(r.url, ":", len(r.cleaned_html))
```
### 4.2 `start()` & `close()`
Allows manual lifecycle usage instead of context manager:
```python
crawler = AsyncWebCrawler(config=browser_cfg)
await crawler.start()
# Perform multiple operations
resultA = await crawler.arun("https://exampleA.com", config=run_cfg)
resultB = await crawler.arun("https://exampleB.com", config=run_cfg)
await crawler.close()
```
---
## 5. `CrawlResult` Output
Each `arun()` returns a **`CrawlResult`** containing:
- `url`: Final URL (if redirected).
- `html`: Original HTML.
- `cleaned_html`: Sanitized HTML.
- `markdown_v2` (or future `markdown`): Markdown outputs (raw, fit, etc.).
- `extracted_content`: If an extraction strategy was used (JSON for CSS/LLM strategies).
- `screenshot`, `pdf`: If screenshots/PDF requested.
- `media`, `links`: Information about discovered images/links.
- `success`, `error_message`: Status info.
For details, see [CrawlResult doc](./crawl-result.md).
---
## 6. Quick Example
Below is an example hooking it all together:
```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
import json
async def main():
# 1. Browser config
browser_cfg = BrowserConfig(
browser_type="firefox",
headless=False,
verbose=True
)
# 2. Run config
schema = {
"name": "Articles",
"baseSelector": "article.post",
"fields": [
{"name": "title", "selector": "h2", "type": "text"},
{"name": "url", "selector": "a", "type": "attribute", "attribute": "href"}
]
}
run_cfg = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
extraction_strategy=JsonCssExtractionStrategy(schema),
word_count_threshold=15,
remove_overlay_elements=True,
wait_for="css:.post" # Wait for posts to appear
)
async with AsyncWebCrawler(config=browser_cfg) as crawler:
result = await crawler.arun(
url="https://example.com/blog",
config=run_cfg
)
if result.success:
print("Cleaned HTML length:", len(result.cleaned_html))
if result.extracted_content:
articles = json.loads(result.extracted_content)
print("Extracted articles:", articles[:2])
else:
print("Error:", result.error_message)
asyncio.run(main())
```
**Explanation**:
- We define a **`BrowserConfig`** with Firefox, no headless, and `verbose=True`.
- We define a **`CrawlerRunConfig`** that **bypasses cache**, uses a **CSS** extraction schema, has a `word_count_threshold=15`, etc.
- We pass them to `AsyncWebCrawler(config=...)` and `arun(url=..., config=...)`.
---
## 7. Best Practices & Migration Notes
1. **Use** `BrowserConfig` for **global** settings about the browsers environment.
2. **Use** `CrawlerRunConfig` for **per-crawl** logic (caching, content filtering, extraction strategies, wait conditions).
3. **Avoid** legacy parameters like `css_selector` or `word_count_threshold` directly in `arun()`. Instead:
```python
run_cfg = CrawlerRunConfig(css_selector=".main-content", word_count_threshold=20)
result = await crawler.arun(url="...", config=run_cfg)
```
4. **Context Manager** usage is simplest unless you want a persistent crawler across many calls.
---
## 8. Summary
**AsyncWebCrawler** is your entry point to asynchronous crawling:
- **Constructor** accepts **`BrowserConfig`** (or defaults).
- **`arun(url, config=CrawlerRunConfig)`** is the main method for single-page crawls.
- **`arun_many(urls, config=CrawlerRunConfig)`** handles concurrency across multiple URLs.
- For advanced lifecycle control, use `start()` and `close()` explicitly.
**Migration**:
- If you used `AsyncWebCrawler(browser_type="chromium", css_selector="...")`, move browser settings to `BrowserConfig(...)` and content/crawl logic to `CrawlerRunConfig(...)`.
This modular approach ensures your code is **clean**, **scalable**, and **easy to maintain**. For any advanced or rarely used parameters, see the [BrowserConfig docs](../api/parameters.md).

View File

@@ -1,85 +0,0 @@
# CrawlerRunConfig Parameters Documentation
## Content Processing Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `word_count_threshold` | int | 200 | Minimum word count threshold before processing content |
| `extraction_strategy` | ExtractionStrategy | None | Strategy to extract structured data from crawled pages. When None, uses NoExtractionStrategy |
| `chunking_strategy` | ChunkingStrategy | RegexChunking() | Strategy to chunk content before extraction |
| `markdown_generator` | MarkdownGenerationStrategy | None | Strategy for generating markdown from extracted content |
| `content_filter` | RelevantContentFilter | None | Optional filter to prune irrelevant content |
| `only_text` | bool | False | If True, attempt to extract text-only content where applicable |
| `css_selector` | str | None | CSS selector to extract a specific portion of the page |
| `excluded_tags` | list[str] | [] | List of HTML tags to exclude from processing |
| `keep_data_attributes` | bool | False | If True, retain `data-*` attributes while removing unwanted attributes |
| `remove_forms` | bool | False | If True, remove all `<form>` elements from the HTML |
| `prettiify` | bool | False | If True, apply `fast_format_html` to produce prettified HTML output |
## Caching Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `cache_mode` | CacheMode | None | Defines how caching is handled. Defaults to CacheMode.ENABLED internally |
| `session_id` | str | None | Optional session ID to persist browser context and page instance |
| `bypass_cache` | bool | False | Legacy parameter, if True acts like CacheMode.BYPASS |
| `disable_cache` | bool | False | Legacy parameter, if True acts like CacheMode.DISABLED |
| `no_cache_read` | bool | False | Legacy parameter, if True acts like CacheMode.WRITE_ONLY |
| `no_cache_write` | bool | False | Legacy parameter, if True acts like CacheMode.READ_ONLY |
## Page Navigation and Timing Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `wait_until` | str | "domcontentloaded" | The condition to wait for when navigating |
| `page_timeout` | int | 60000 | Timeout in milliseconds for page operations like navigation |
| `wait_for` | str | None | CSS selector or JS condition to wait for before extracting content |
| `wait_for_images` | bool | True | If True, wait for images to load before extracting content |
| `delay_before_return_html` | float | 0.1 | Delay in seconds before retrieving final HTML |
| `mean_delay` | float | 0.1 | Mean base delay between requests when calling arun_many |
| `max_range` | float | 0.3 | Max random additional delay range for requests in arun_many |
| `semaphore_count` | int | 5 | Number of concurrent operations allowed |
## Page Interaction Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `js_code` | str or list[str] | None | JavaScript code/snippets to run on the page |
| `js_only` | bool | False | If True, indicates subsequent calls are JS-driven updates |
| `ignore_body_visibility` | bool | True | If True, ignore whether the body is visible before proceeding |
| `scan_full_page` | bool | False | If True, scroll through the entire page to load all content |
| `scroll_delay` | float | 0.2 | Delay in seconds between scroll steps if scan_full_page is True |
| `process_iframes` | bool | False | If True, attempts to process and inline iframe content |
| `remove_overlay_elements` | bool | False | If True, remove overlays/popups before extracting HTML |
| `simulate_user` | bool | False | If True, simulate user interactions for anti-bot measures |
| `override_navigator` | bool | False | If True, overrides navigator properties for more human-like behavior |
| `magic` | bool | False | If True, attempts automatic handling of overlays/popups |
| `adjust_viewport_to_content` | bool | False | If True, adjust viewport according to page content dimensions |
## Media Handling Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `screenshot` | bool | False | Whether to take a screenshot after crawling |
| `screenshot_wait_for` | float | None | Additional wait time before taking a screenshot |
| `screenshot_height_threshold` | int | 20000 | Threshold for page height to decide screenshot strategy |
| `pdf` | bool | False | Whether to generate a PDF of the page |
| `image_description_min_word_threshold` | int | 50 | Minimum words for image description extraction |
| `image_score_threshold` | int | 3 | Minimum score threshold for processing an image |
| `exclude_external_images` | bool | False | If True, exclude all external images from processing |
## Link and Domain Handling Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `exclude_social_media_domains` | list[str] | SOCIAL_MEDIA_DOMAINS | List of domains to exclude for social media links |
| `exclude_external_links` | bool | False | If True, exclude all external links from the results |
| `exclude_social_media_links` | bool | False | If True, exclude links pointing to social media domains |
| `exclude_domains` | list[str] | [] | List of specific domains to exclude from results |
## Debugging and Logging Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `verbose` | bool | True | Enable verbose logging |
| `log_console` | bool | False | If True, log console messages from the page |

View File

@@ -1,302 +1,330 @@
# CrawlResult
# `CrawlResult` Reference
The `CrawlResult` class represents the result of a web crawling operation. It provides access to various forms of extracted content and metadata from the crawled webpage.
The **`CrawlResult`** class encapsulates everything returned after a single crawl operation. It provides the **raw or processed content**, details on links and media, plus optional metadata (like screenshots, PDFs, or extracted JSON).
## Class Definition
**Location**: `crawl4ai/crawler/models.py` (for reference)
```python
class CrawlResult(BaseModel):
"""Result of a web crawling operation."""
# Basic Information
url: str # Crawled URL
success: bool # Whether crawl succeeded
status_code: Optional[int] = None # HTTP status code
error_message: Optional[str] = None # Error message if failed
# Content
html: str # Raw HTML content
cleaned_html: Optional[str] = None # Cleaned HTML
fit_html: Optional[str] = None # Most relevant HTML content
markdown: Optional[str] = None # HTML converted to markdown
fit_markdown: Optional[str] = None # Most relevant markdown content
downloaded_files: Optional[List[str]] = None # Downloaded files
# Extracted Data
extracted_content: Optional[str] = None # Content from extraction strategy
media: Dict[str, List[Dict]] = {} # Extracted media information
links: Dict[str, List[Dict]] = {} # Extracted links
metadata: Optional[dict] = None # Page metadata
# Additional Data
screenshot: Optional[str] = None # Base64 encoded screenshot
session_id: Optional[str] = None # Session identifier
response_headers: Optional[dict] = None # HTTP response headers
url: str
html: str
success: bool
cleaned_html: Optional[str] = None
media: Dict[str, List[Dict]] = {}
links: Dict[str, List[Dict]] = {}
downloaded_files: Optional[List[str]] = None
screenshot: Optional[str] = None
pdf : Optional[bytes] = None
markdown: Optional[Union[str, MarkdownGenerationResult]] = None
markdown_v2: Optional[MarkdownGenerationResult] = None
fit_markdown: Optional[str] = None
fit_html: Optional[str] = None
extracted_content: Optional[str] = None
metadata: Optional[dict] = None
error_message: Optional[str] = None
session_id: Optional[str] = None
response_headers: Optional[dict] = None
status_code: Optional[int] = None
ssl_certificate: Optional[SSLCertificate] = None
...
```
## Properties and Their Data Structures
Below is a **field-by-field** explanation and possible usage patterns.
### Basic Information
---
## 1. Basic Crawl Info
### 1.1 **`url`** *(str)*
**What**: The final crawled URL (after any redirects).
**Usage**:
```python
# Access basic information
result = await crawler.arun(url="https://example.com")
print(result.url) # "https://example.com"
print(result.success) # True/False
print(result.status_code) # 200, 404, etc.
print(result.error_message) # Error details if failed
print(result.url) # e.g., "https://example.com/"
```
### Content Properties
#### HTML Content
### 1.2 **`success`** *(bool)*
**What**: `True` if the crawl pipeline ended without major errors; `False` otherwise.
**Usage**:
```python
# Raw HTML
html_content = result.html
# Cleaned HTML (removed ads, popups, etc.)
clean_content = result.cleaned_html
# Most relevant HTML content
main_content = result.fit_html
if not result.success:
print(f"Crawl failed: {result.error_message}")
```
#### Markdown Content
### 1.3 **`status_code`** *(Optional[int])*
**What**: The pages HTTP status code (e.g., 200, 404).
**Usage**:
```python
# Full markdown version
markdown_content = result.markdown
# Most relevant markdown content
main_content = result.fit_markdown
if result.status_code == 404:
print("Page not found!")
```
### Media Content
The media dictionary contains organized media elements:
### 1.4 **`error_message`** *(Optional[str])*
**What**: If `success=False`, a textual description of the failure.
**Usage**:
```python
# Structure
media = {
"images": [
{
"src": str, # Image URL
"alt": str, # Alt text
"desc": str, # Contextual description
"score": float, # Relevance score (0-10)
"type": str, # "image"
"width": int, # Image width (if available)
"height": int, # Image height (if available)
"context": str, # Surrounding text
"lazy": bool # Whether image was lazy-loaded
}
],
"videos": [
{
"src": str, # Video URL
"type": str, # "video"
"title": str, # Video title
"poster": str, # Thumbnail URL
"duration": str, # Video duration
"description": str # Video description
}
],
"audios": [
{
"src": str, # Audio URL
"type": str, # "audio"
"title": str, # Audio title
"duration": str, # Audio duration
"description": str # Audio description
}
]
}
# Example usage
for image in result.media["images"]:
if image["score"] > 5: # High-relevance images
print(f"High-quality image: {image['src']}")
print(f"Context: {image['context']}")
if not result.success:
print("Error:", result.error_message)
```
### Link Analysis
The links dictionary organizes discovered links:
### 1.5 **`session_id`** *(Optional[str])*
**What**: The ID used for reusing a browser context across multiple calls.
**Usage**:
```python
# Structure
links = {
"internal": [
{
"href": str, # URL
"text": str, # Link text
"title": str, # Title attribute
"type": str, # Link type (nav, content, etc.)
"context": str, # Surrounding text
"score": float # Relevance score
}
],
"external": [
{
"href": str, # External URL
"text": str, # Link text
"title": str, # Title attribute
"domain": str, # Domain name
"type": str, # Link type
"context": str # Surrounding text
}
]
}
# If you used session_id="login_session" in CrawlerRunConfig, see it here:
print("Session:", result.session_id)
```
# Example usage
### 1.6 **`response_headers`** *(Optional[dict])*
**What**: Final HTTP response headers.
**Usage**:
```python
if result.response_headers:
print("Server:", result.response_headers.get("Server", "Unknown"))
```
### 1.7 **`ssl_certificate`** *(Optional[SSLCertificate])*
**What**: If `fetch_ssl_certificate=True` in your CrawlerRunConfig, **`result.ssl_certificate`** contains a [**`SSLCertificate`**](../advanced/ssl-certificate.md) object describing the sites certificate. You can export the cert in multiple formats (PEM/DER/JSON) or access its properties like `issuer`,
`subject`, `valid_from`, `valid_until`, etc.
**Usage**:
```python
if result.ssl_certificate:
print("Issuer:", result.ssl_certificate.issuer)
```
---
## 2. Raw / Cleaned Content
### 2.1 **`html`** *(str)*
**What**: The **original** unmodified HTML from the final page load.
**Usage**:
```python
# Possibly large
print(len(result.html))
```
### 2.2 **`cleaned_html`** *(Optional[str])*
**What**: A sanitized HTML version—scripts, styles, or excluded tags are removed based on your `CrawlerRunConfig`.
**Usage**:
```python
print(result.cleaned_html[:500]) # Show a snippet
```
### 2.3 **`fit_html`** *(Optional[str])*
**What**: If a **content filter** or heuristic (e.g., Pruning/BM25) modifies the HTML, the “fit” or post-filter version.
**When**: This is **only** present if your `markdown_generator` or `content_filter` produces it.
**Usage**:
```python
if result.fit_html:
print("High-value HTML content:", result.fit_html[:300])
```
---
## 3. Markdown Fields
### 3.1 The Markdown Generation Approach
Crawl4AI can convert HTML→Markdown, optionally including:
- **Raw** markdown
- **Links as citations** (with a references section)
- **Fit** markdown if a **content filter** is used (like Pruning or BM25)
### 3.2 **`markdown_v2`** *(Optional[MarkdownGenerationResult])*
**What**: The **structured** object holding multiple markdown variants. Soon to be consolidated into `markdown`.
**`MarkdownGenerationResult`** includes:
- **`raw_markdown`** *(str)*: The full HTML→Markdown conversion.
- **`markdown_with_citations`** *(str)*: Same markdown, but with link references as academic-style citations.
- **`references_markdown`** *(str)*: The reference list or footnotes at the end.
- **`fit_markdown`** *(Optional[str])*: If content filtering (Pruning/BM25) was applied, the filtered “fit” text.
- **`fit_html`** *(Optional[str])*: The HTML that led to `fit_markdown`.
**Usage**:
```python
if result.markdown_v2:
md_res = result.markdown_v2
print("Raw MD:", md_res.raw_markdown[:300])
print("Citations MD:", md_res.markdown_with_citations[:300])
print("References:", md_res.references_markdown)
if md_res.fit_markdown:
print("Pruned text:", md_res.fit_markdown[:300])
```
### 3.3 **`markdown`** *(Optional[Union[str, MarkdownGenerationResult]])*
**What**: In future versions, `markdown` will fully replace `markdown_v2`. Right now, it might be a `str` or a `MarkdownGenerationResult`.
**Usage**:
```python
# Soon, you might see:
if isinstance(result.markdown, MarkdownGenerationResult):
print(result.markdown.raw_markdown[:200])
else:
print(result.markdown)
```
### 3.4 **`fit_markdown`** *(Optional[str])*
**What**: A direct reference to the final filtered markdown (legacy approach).
**When**: This is set if a filter or content strategy explicitly writes there. Usually overshadowed by `markdown_v2.fit_markdown`.
**Usage**:
```python
print(result.fit_markdown) # Legacy field, prefer result.markdown_v2.fit_markdown
```
**Important**: “Fit” content (in `fit_markdown`/`fit_html`) only exists if you used a **filter** (like **PruningContentFilter** or **BM25ContentFilter**) within a `MarkdownGenerationStrategy`.
---
## 4. Media & Links
### 4.1 **`media`** *(Dict[str, List[Dict]])*
**What**: Contains info about discovered images, videos, or audio. Typically keys: `"images"`, `"videos"`, `"audios"`.
**Common Fields** in each item:
- `src` *(str)*: Media URL
- `alt` or `title` *(str)*: Descriptive text
- `score` *(float)*: Relevance score if the crawlers heuristic found it “important”
- `desc` or `description` *(Optional[str])*: Additional context extracted from surrounding text
**Usage**:
```python
images = result.media.get("images", [])
for img in images:
if img.get("score", 0) > 5:
print("High-value image:", img["src"])
```
### 4.2 **`links`** *(Dict[str, List[Dict]])*
**What**: Holds internal and external link data. Usually two keys: `"internal"` and `"external"`.
**Common Fields**:
- `href` *(str)*: The link target
- `text` *(str)*: Link text
- `title` *(str)*: Title attribute
- `context` *(str)*: Surrounding text snippet
- `domain` *(str)*: If external, the domain
**Usage**:
```python
for link in result.links["internal"]:
print(f"Internal link: {link['href']}")
print(f"Context: {link['context']}")
print(f"Internal link to {link['href']} with text {link['text']}")
```
### Metadata
---
The metadata dictionary contains page information:
## 5. Additional Fields
### 5.1 **`extracted_content`** *(Optional[str])*
**What**: If you used **`extraction_strategy`** (CSS, LLM, etc.), the structured output (JSON).
**Usage**:
```python
# Structure
metadata = {
"title": str, # Page title
"description": str, # Meta description
"keywords": List[str], # Meta keywords
"author": str, # Author information
"published_date": str, # Publication date
"modified_date": str, # Last modified date
"language": str, # Page language
"canonical_url": str, # Canonical URL
"og_data": Dict, # Open Graph data
"twitter_data": Dict # Twitter card data
}
# Example usage
if result.metadata:
print(f"Title: {result.metadata['title']}")
print(f"Author: {result.metadata.get('author', 'Unknown')}")
```
### Extracted Content
Content from extraction strategies:
```python
# For LLM or CSS extraction strategies
if result.extracted_content:
structured_data = json.loads(result.extracted_content)
print(structured_data)
data = json.loads(result.extracted_content)
print(data)
```
### Screenshot
Base64 encoded screenshot:
### 5.2 **`downloaded_files`** *(Optional[List[str]])*
**What**: If `accept_downloads=True` in your `BrowserConfig` + `downloads_path`, lists local file paths for downloaded items.
**Usage**:
```python
# Save screenshot if available
if result.downloaded_files:
for file_path in result.downloaded_files:
print("Downloaded:", file_path)
```
### 5.3 **`screenshot`** *(Optional[str])*
**What**: Base64-encoded screenshot if `screenshot=True` in `CrawlerRunConfig`.
**Usage**:
```python
import base64
if result.screenshot:
import base64
# Decode and save
with open("screenshot.png", "wb") as f:
with open("page.png", "wb") as f:
f.write(base64.b64decode(result.screenshot))
```
## Usage Examples
### Basic Content Access
### 5.4 **`pdf`** *(Optional[bytes])*
**What**: Raw PDF bytes if `pdf=True` in `CrawlerRunConfig`.
**Usage**:
```python
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com")
if result.pdf:
with open("page.pdf", "wb") as f:
f.write(result.pdf)
```
### 5.5 **`metadata`** *(Optional[dict])*
**What**: Page-level metadata if discovered (title, description, OG data, etc.).
**Usage**:
```python
if result.metadata:
print("Title:", result.metadata.get("title"))
print("Author:", result.metadata.get("author"))
```
---
## 6. Example: Accessing Everything
```python
async def handle_result(result: CrawlResult):
if not result.success:
print("Crawl error:", result.error_message)
return
if result.success:
# Get clean content
print(result.fit_markdown)
# Process images
for image in result.media["images"]:
if image["score"] > 7:
print(f"High-quality image: {image['src']}")
# Basic info
print("Crawled URL:", result.url)
print("Status code:", result.status_code)
# HTML
print("Original HTML size:", len(result.html))
print("Cleaned HTML size:", len(result.cleaned_html or ""))
# Markdown output
if result.markdown_v2:
print("Raw Markdown:", result.markdown_v2.raw_markdown[:300])
print("Citations Markdown:", result.markdown_v2.markdown_with_citations[:300])
if result.markdown_v2.fit_markdown:
print("Fit Markdown:", result.markdown_v2.fit_markdown[:200])
else:
print("Raw Markdown (legacy):", result.markdown[:200] if result.markdown else "N/A")
# Media & Links
if "images" in result.media:
print("Image count:", len(result.media["images"]))
if "internal" in result.links:
print("Internal link count:", len(result.links["internal"]))
# Extraction strategy result
if result.extracted_content:
print("Structured data:", result.extracted_content)
# Screenshot/PDF
if result.screenshot:
print("Screenshot length:", len(result.screenshot))
if result.pdf:
print("PDF bytes length:", len(result.pdf))
```
### Complete Data Processing
```python
async def process_webpage(url: str) -> Dict:
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url=url)
if not result.success:
raise Exception(f"Crawl failed: {result.error_message}")
return {
"content": result.fit_markdown,
"images": [
img for img in result.media["images"]
if img["score"] > 5
],
"internal_links": [
link["href"] for link in result.links["internal"]
],
"metadata": result.metadata,
"status": result.status_code
}
```
---
### Error Handling
```python
async def safe_crawl(url: str) -> Dict:
async with AsyncWebCrawler() as crawler:
try:
result = await crawler.arun(url=url)
if not result.success:
return {
"success": False,
"error": result.error_message,
"status": result.status_code
}
return {
"success": True,
"content": result.fit_markdown,
"status": result.status_code
}
except Exception as e:
return {
"success": False,
"error": str(e),
"status": None
}
```
## 7. Key Points & Future
## Best Practices
1. **`markdown_v2` vs `markdown`**
- Right now, `markdown_v2` is the more robust container (`MarkdownGenerationResult`), providing **raw_markdown**, **markdown_with_citations**, references, plus possible **fit_markdown**.
- In future versions, everything will unify under **`markdown`**. If you rely on advanced features (citations, fit content), check `markdown_v2`.
1. **Always Check Success**
```python
if not result.success:
print(f"Error: {result.error_message}")
return
```
2. **Fit Content**
- **`fit_markdown`** and **`fit_html`** appear only if you used a content filter (like **PruningContentFilter** or **BM25ContentFilter**) inside your **MarkdownGenerationStrategy** or set them directly.
- If no filter is used, they remain `None`.
2. **Use fit_markdown for Articles**
```python
# Better for article content
content = result.fit_markdown if result.fit_markdown else result.markdown
```
3. **References & Citations**
- If you enable link citations in your `DefaultMarkdownGenerator` (`options={"citations": True}`), youll see `markdown_with_citations` plus a **`references_markdown`** block. This helps large language models or academic-like referencing.
3. **Filter Media by Score**
```python
relevant_images = [
img for img in result.media["images"]
if img["score"] > 5
]
```
4. **Links & Media**
- `links["internal"]` and `links["external"]` group discovered anchors by domain.
- `media["images"]` / `["videos"]` / `["audios"]` store extracted media elements with optional scoring or context.
4. **Handle Missing Data**
```python
metadata = result.metadata or {}
title = metadata.get('title', 'Unknown Title')
```
5. **Error Cases**
- If `success=False`, check `error_message` (e.g., timeouts, invalid URLs).
- `status_code` might be `None` if we failed before an HTTP response.
Use **`CrawlResult`** to glean all final outputs and feed them into your data pipelines, AI models, or archives. With the synergy of a properly configured **BrowserConfig** and **CrawlerRunConfig**, the crawler can produce robust, structured results here in **`CrawlResult`**.

View File

@@ -1,36 +1,226 @@
# Parameter Reference Table
# 1. **BrowserConfig** Controlling the Browser
`BrowserConfig` focuses on **how** the browser is launched and behaves. This includes headless mode, proxies, user agents, and other environment tweaks.
```python
from crawl4ai import AsyncWebCrawler, BrowserConfig
browser_cfg = BrowserConfig(
browser_type="chromium",
headless=True,
viewport_width=1280,
viewport_height=720,
proxy="http://user:pass@proxy:8080",
user_agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/116.0.0.0 Safari/537.36",
)
```
## 1.1 Parameter Highlights
| **Parameter** | **Type / Default** | **What It Does** |
|-----------------------|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------|
| **`browser_type`** | `"chromium"`, `"firefox"`, `"webkit"`<br/>*(default: `"chromium"`)* | Which browser engine to use. `"chromium"` is typical for many sites, `"firefox"` or `"webkit"` for specialized tests. |
| **`headless`** | `bool` (default: `True`) | Headless means no visible UI. `False` is handy for debugging. |
| **`viewport_width`** | `int` (default: `1080`) | Initial page width (in px). Useful for testing responsive layouts. |
| **`viewport_height`** | `int` (default: `600`) | Initial page height (in px). |
| **`proxy`** | `str` (default: `None`) | Single-proxy URL if you want all traffic to go through it, e.g. `"http://user:pass@proxy:8080"`. |
| **`proxy_config`** | `dict` (default: `None`) | For advanced or multi-proxy needs, specify details like `{"server": "...", "username": "...", ...}`. |
| **`use_persistent_context`** | `bool` (default: `False`) | If `True`, uses a **persistent** browser context (keep cookies, sessions across runs). Also sets `use_managed_browser=True`. |
| **`user_data_dir`** | `str or None` (default: `None`) | Directory to store user data (profiles, cookies). Must be set if you want permanent sessions. |
| **`ignore_https_errors`** | `bool` (default: `True`) | If `True`, continues despite invalid certificates (common in dev/staging). |
| **`java_script_enabled`** | `bool` (default: `True`) | Disable if you want no JS overhead, or if only static content is needed. |
| **`cookies`** | `list` (default: `[]`) | Pre-set cookies, each a dict like `{"name": "session", "value": "...", "url": "..."}`. |
| **`headers`** | `dict` (default: `{}`) | Extra HTTP headers for every request, e.g. `{"Accept-Language": "en-US"}`. |
| **`user_agent`** | `str` (default: Chrome-based UA) | Your custom or random user agent. `user_agent_mode="random"` can shuffle it. |
| **`light_mode`** | `bool` (default: `False`) | Disables some background features for performance gains. |
| **`text_mode`** | `bool` (default: `False`) | If `True`, tries to disable images/other heavy content for speed. |
| **`use_managed_browser`** | `bool` (default: `False`) | For advanced “managed” interactions (debugging, CDP usage). Typically set automatically if persistent context is on. |
| **`extra_args`** | `list` (default: `[]`) | Additional flags for the underlying browser process, e.g. `["--disable-extensions"]`. |
**Tips**:
- Set `headless=False` to visually **debug** how pages load or how interactions proceed.
- If you need **authentication** storage or repeated sessions, consider `use_persistent_context=True` and specify `user_data_dir`.
- For large pages, you might need a bigger `viewport_width` and `viewport_height` to handle dynamic content.
---
# 2. **CrawlerRunConfig** Controlling Each Crawl
While `BrowserConfig` sets up the **environment**, `CrawlerRunConfig` details **how** each **crawl operation** should behave: caching, content filtering, link or domain blocking, timeouts, JavaScript code, etc.
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
run_cfg = CrawlerRunConfig(
wait_for="css:.main-content",
word_count_threshold=15,
excluded_tags=["nav", "footer"],
exclude_external_links=True,
)
```
## 2.1 Parameter Highlights
We group them by category.
### A) **Content Processing**
| **Parameter** | **Type / Default** | **What It Does** |
|------------------------------|--------------------------------------|-------------------------------------------------------------------------------------------------|
| **`word_count_threshold`** | `int` (default: ~200) | Skips text blocks below X words. Helps ignore trivial sections. |
| **`extraction_strategy`** | `ExtractionStrategy` (default: None) | If set, extracts structured data (CSS-based, LLM-based, etc.). |
| **`markdown_generator`** | `MarkdownGenerationStrategy` (None) | If you want specialized markdown output (citations, filtering, chunking, etc.). |
| **`content_filter`** | `RelevantContentFilter` (None) | Filters out irrelevant text blocks. E.g., `PruningContentFilter` or `BM25ContentFilter`. |
| **`css_selector`** | `str` (None) | Retains only the part of the page matching this selector. |
| **`excluded_tags`** | `list` (None) | Removes entire tags (e.g. `["script", "style"]`). |
| **`excluded_selector`** | `str` (None) | Like `css_selector` but to exclude. E.g. `"#ads, .tracker"`. |
| **`only_text`** | `bool` (False) | If `True`, tries to extract text-only content. |
| **`prettiify`** | `bool` (False) | If `True`, beautifies final HTML (slower, purely cosmetic). |
| **`keep_data_attributes`** | `bool` (False) | If `True`, preserve `data-*` attributes in cleaned HTML. |
| **`remove_forms`** | `bool` (False) | If `True`, remove all `<form>` elements. |
---
### B) **Caching & Session**
| **Parameter** | **Type / Default** | **What It Does** |
|-------------------------|------------------------|------------------------------------------------------------------------------------------------------------------------------|
| **`cache_mode`** | `CacheMode or None` | Controls how caching is handled (`ENABLED`, `BYPASS`, `DISABLED`, etc.). If `None`, typically defaults to `ENABLED`. |
| **`session_id`** | `str or None` | Assign a unique ID to reuse a single browser session across multiple `arun()` calls. |
| **`bypass_cache`** | `bool` (False) | If `True`, acts like `CacheMode.BYPASS`. |
| **`disable_cache`** | `bool` (False) | If `True`, acts like `CacheMode.DISABLED`. |
| **`no_cache_read`** | `bool` (False) | If `True`, acts like `CacheMode.WRITE_ONLY` (writes cache but never reads). |
| **`no_cache_write`** | `bool` (False) | If `True`, acts like `CacheMode.READ_ONLY` (reads cache but never writes). |
Use these for controlling whether you read or write from a local content cache. Handy for large batch crawls or repeated site visits.
---
### C) **Page Navigation & Timing**
| **Parameter** | **Type / Default** | **What It Does** |
|----------------------------|-------------------------|----------------------------------------------------------------------------------------------------------------------|
| **`wait_until`** | `str` (domcontentloaded)| Condition for navigation to “complete”. Often `"networkidle"` or `"domcontentloaded"`. |
| **`page_timeout`** | `int` (60000 ms) | Timeout for page navigation or JS steps. Increase for slow sites. |
| **`wait_for`** | `str or None` | Wait for a CSS (`"css:selector"`) or JS (`"js:() => bool"`) condition before content extraction. |
| **`wait_for_images`** | `bool` (False) | Wait for images to load before finishing. Slows down if you only want text. |
| **`delay_before_return_html`** | `float` (0.1) | Additional pause (seconds) before final HTML is captured. Good for last-second updates. |
| **`mean_delay`** and **`max_range`** | `float` (0.1, 0.3) | If you call `arun_many()`, these define random delay intervals between crawls, helping avoid detection or rate limits. |
| **`semaphore_count`** | `int` (5) | Max concurrency for `arun_many()`. Increase if you have resources for parallel crawls. |
---
### D) **Page Interaction**
| **Parameter** | **Type / Default** | **What It Does** |
|----------------------------|--------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------|
| **`js_code`** | `str or list[str]` (None) | JavaScript to run after load. E.g. `"document.querySelector('button')?.click();"`. |
| **`js_only`** | `bool` (False) | If `True`, indicates were reusing an existing session and only applying JS. No full reload. |
| **`ignore_body_visibility`** | `bool` (True) | Skip checking if `<body>` is visible. Usually best to keep `True`. |
| **`scan_full_page`** | `bool` (False) | If `True`, auto-scroll the page to load dynamic content (infinite scroll). |
| **`scroll_delay`** | `float` (0.2) | Delay between scroll steps if `scan_full_page=True`. |
| **`process_iframes`** | `bool` (False) | Inlines iframe content for single-page extraction. |
| **`remove_overlay_elements`** | `bool` (False) | Removes potential modals/popups blocking the main content. |
| **`simulate_user`** | `bool` (False) | Simulate user interactions (mouse movements) to avoid bot detection. |
| **`override_navigator`** | `bool` (False) | Override `navigator` properties in JS for stealth. |
| **`magic`** | `bool` (False) | Automatic handling of popups/consent banners. Experimental. |
| **`adjust_viewport_to_content`** | `bool` (False) | Resizes viewport to match page content height. |
If your page is a single-page app with repeated JS updates, set `js_only=True` in subsequent calls, plus a `session_id` for reusing the same tab.
---
### E) **Media Handling**
| **Parameter** | **Type / Default** | **What It Does** |
|--------------------------------------------|---------------------|-----------------------------------------------------------------------------------------------------------|
| **`screenshot`** | `bool` (False) | Capture a screenshot (base64) in `result.screenshot`. |
| **`screenshot_wait_for`** | `float or None` | Extra wait time before the screenshot. |
| **`screenshot_height_threshold`** | `int` (~20000) | If the page is taller than this, alternate screenshot strategies are used. |
| **`pdf`** | `bool` (False) | If `True`, returns a PDF in `result.pdf`. |
| **`image_description_min_word_threshold`** | `int` (~50) | Minimum words for an images alt text or description to be considered valid. |
| **`image_score_threshold`** | `int` (~3) | Filter out low-scoring images. The crawler scores images by relevance (size, context, etc.). |
| **`exclude_external_images`** | `bool` (False) | Exclude images from other domains. |
---
### F) **Link/Domain Handling**
| **Parameter** | **Type / Default** | **What It Does** |
|------------------------------|-------------------------|-----------------------------------------------------------------------------------------------------------------------------|
| **`exclude_social_media_domains`** | `list` (e.g. Facebook/Twitter) | A default list can be extended. Any link to these domains is removed from final output. |
| **`exclude_external_links`** | `bool` (False) | Removes all links pointing outside the current domain. |
| **`exclude_social_media_links`** | `bool` (False) | Strips links specifically to social sites (like Facebook or Twitter). |
| **`exclude_domains`** | `list` ([]) | Provide a custom list of domains to exclude (like `["ads.com", "trackers.io"]`). |
Use these for link-level content filtering (often to keep crawls “internal” or to remove spammy domains).
---
### G) **Debug & Logging**
| **Parameter** | **Type / Default** | **What It Does** |
|----------------|--------------------|---------------------------------------------------------------------------|
| **`verbose`** | `bool` (True) | Prints logs detailing each step of crawling, interactions, or errors. |
| **`log_console`** | `bool` (False) | Logs the pages JavaScript console output if you want deeper JS debugging.|
---
## 2.2 Example Usage
```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
async def main():
# Configure the browser
browser_cfg = BrowserConfig(
headless=False,
viewport_width=1280,
viewport_height=720,
proxy="http://user:pass@myproxy:8080",
text_mode=True
)
# Configure the run
run_cfg = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
session_id="my_session",
css_selector="main.article",
excluded_tags=["script", "style"],
exclude_external_links=True,
wait_for="css:.article-loaded",
screenshot=True
)
async with AsyncWebCrawler(config=browser_cfg) as crawler:
result = await crawler.arun(
url="https://example.com/news",
config=run_cfg
)
if result.success:
print("Final cleaned_html length:", len(result.cleaned_html))
if result.screenshot:
print("Screenshot captured (base64, length):", len(result.screenshot))
else:
print("Crawl failed:", result.error_message)
if __name__ == "__main__":
asyncio.run(main())
```
**Whats Happening**:
- **`text_mode=True`** avoids loading images and other heavy resources, speeding up the crawl.
- We disable caching (`cache_mode=CacheMode.BYPASS`) to always fetch fresh content.
- We only keep `main.article` content by specifying `css_selector="main.article"`.
- We exclude external links (`exclude_external_links=True`).
- We do a quick screenshot (`screenshot=True`) before finishing.
---
## 3. Putting It All Together
- **Use** `BrowserConfig` for **global** browser settings: engine, headless, proxy, user agent.
- **Use** `CrawlerRunConfig` for each crawls **context**: how to filter content, handle caching, wait for dynamic elements, or run JS.
- **Pass** both configs to `AsyncWebCrawler` (the `BrowserConfig`) and then to `arun()` (the `CrawlerRunConfig`).
| File Name | Parameter Name | Code Usage | Strategy/Class | Description |
|-----------|---------------|------------|----------------|-------------|
| async_crawler_strategy.py | user_agent | `kwargs.get("user_agent")` | AsyncPlaywrightCrawlerStrategy | User agent string for browser identification |
| async_crawler_strategy.py | proxy | `kwargs.get("proxy")` | AsyncPlaywrightCrawlerStrategy | Proxy server configuration for network requests |
| async_crawler_strategy.py | proxy_config | `kwargs.get("proxy_config")` | AsyncPlaywrightCrawlerStrategy | Detailed proxy configuration including auth |
| async_crawler_strategy.py | headless | `kwargs.get("headless", True)` | AsyncPlaywrightCrawlerStrategy | Whether to run browser in headless mode |
| async_crawler_strategy.py | browser_type | `kwargs.get("browser_type", "chromium")` | AsyncPlaywrightCrawlerStrategy | Type of browser to use (chromium/firefox/webkit) |
| async_crawler_strategy.py | headers | `kwargs.get("headers", {})` | AsyncPlaywrightCrawlerStrategy | Custom HTTP headers for requests |
| async_crawler_strategy.py | verbose | `kwargs.get("verbose", False)` | AsyncPlaywrightCrawlerStrategy | Enable detailed logging output |
| async_crawler_strategy.py | sleep_on_close | `kwargs.get("sleep_on_close", False)` | AsyncPlaywrightCrawlerStrategy | Add delay before closing browser |
| async_crawler_strategy.py | use_managed_browser | `kwargs.get("use_managed_browser", False)` | AsyncPlaywrightCrawlerStrategy | Use managed browser instance |
| async_crawler_strategy.py | user_data_dir | `kwargs.get("user_data_dir", None)` | AsyncPlaywrightCrawlerStrategy | Custom directory for browser profile data |
| async_crawler_strategy.py | session_id | `kwargs.get("session_id")` | AsyncPlaywrightCrawlerStrategy | Unique identifier for browser session |
| async_crawler_strategy.py | override_navigator | `kwargs.get("override_navigator", False)` | AsyncPlaywrightCrawlerStrategy | Override browser navigator properties |
| async_crawler_strategy.py | simulate_user | `kwargs.get("simulate_user", False)` | AsyncPlaywrightCrawlerStrategy | Simulate human-like behavior |
| async_crawler_strategy.py | magic | `kwargs.get("magic", False)` | AsyncPlaywrightCrawlerStrategy | Enable advanced anti-detection features |
| async_crawler_strategy.py | log_console | `kwargs.get("log_console", False)` | AsyncPlaywrightCrawlerStrategy | Log browser console messages |
| async_crawler_strategy.py | js_only | `kwargs.get("js_only", False)` | AsyncPlaywrightCrawlerStrategy | Only execute JavaScript without page load |
| async_crawler_strategy.py | page_timeout | `kwargs.get("page_timeout", 60000)` | AsyncPlaywrightCrawlerStrategy | Timeout for page load in milliseconds |
| async_crawler_strategy.py | ignore_body_visibility | `kwargs.get("ignore_body_visibility", True)` | AsyncPlaywrightCrawlerStrategy | Process page even if body is hidden |
| async_crawler_strategy.py | js_code | `kwargs.get("js_code", kwargs.get("js", self.js_code))` | AsyncPlaywrightCrawlerStrategy | Custom JavaScript code to execute |
| async_crawler_strategy.py | wait_for | `kwargs.get("wait_for")` | AsyncPlaywrightCrawlerStrategy | Wait for specific element/condition |
| async_crawler_strategy.py | process_iframes | `kwargs.get("process_iframes", False)` | AsyncPlaywrightCrawlerStrategy | Extract content from iframes |
| async_crawler_strategy.py | delay_before_return_html | `kwargs.get("delay_before_return_html")` | AsyncPlaywrightCrawlerStrategy | Additional delay before returning HTML |
| async_crawler_strategy.py | remove_overlay_elements | `kwargs.get("remove_overlay_elements", False)` | AsyncPlaywrightCrawlerStrategy | Remove pop-ups and overlay elements |
| async_crawler_strategy.py | screenshot | `kwargs.get("screenshot")` | AsyncPlaywrightCrawlerStrategy | Take page screenshot |
| async_crawler_strategy.py | screenshot_wait_for | `kwargs.get("screenshot_wait_for")` | AsyncPlaywrightCrawlerStrategy | Wait before taking screenshot |
| async_crawler_strategy.py | semaphore_count | `kwargs.get("semaphore_count", 5)` | AsyncPlaywrightCrawlerStrategy | Concurrent request limit |
| async_webcrawler.py | verbose | `kwargs.get("verbose", False)` | AsyncWebCrawler | Enable detailed logging |
| async_webcrawler.py | warmup | `kwargs.get("warmup", True)` | AsyncWebCrawler | Initialize crawler with warmup request |
| async_webcrawler.py | session_id | `kwargs.get("session_id", None)` | AsyncWebCrawler | Session identifier for browser reuse |
| async_webcrawler.py | only_text | `kwargs.get("only_text", False)` | AsyncWebCrawler | Extract only text content |
| async_webcrawler.py | bypass_cache | `kwargs.get("bypass_cache", False)` | AsyncWebCrawler | Skip cache and force fresh crawl |
| async_webcrawler.py | cache_mode | `kwargs.get("cache_mode", CacheMode.ENABLE)` | AsyncWebCrawler | Cache handling mode for request |

View File

@@ -218,12 +218,12 @@ result = await crawler.arun(
## Best Practices
1. **Choose the Right Strategy**
1. **Choose the Right Strategy**
- Use `LLMExtractionStrategy` for complex, unstructured content
- Use `JsonCssExtractionStrategy` for well-structured HTML
- Use `CosineStrategy` for content similarity and clustering
2. **Optimize Chunking**
2. **Optimize Chunking**
```python
# For long documents
strategy = LLMExtractionStrategy(
@@ -232,7 +232,7 @@ result = await crawler.arun(
)
```
3. **Handle Errors**
3. **Handle Errors**
```python
try:
result = await crawler.arun(
@@ -245,7 +245,7 @@ result = await crawler.arun(
print(f"Extraction failed: {e}")
```
4. **Monitor Performance**
4. **Monitor Performance**
```python
strategy = CosineStrategy(
verbose=True, # Enable logging