refactor(docs): reorganize documentation structure and update styles
Reorganize documentation into core/advanced/extraction sections for better navigation. Update terminal theme styles and add rich library for better CLI output. Remove redundant tutorial files and consolidate content into core sections. Add personal story to index page for project context. BREAKING CHANGE: Documentation structure has been significantly reorganized
This commit is contained in:
@@ -1,244 +1,305 @@
|
||||
# Complete Parameter Guide for arun()
|
||||
Below is a **revised parameter guide** for **`arun()`** in **AsyncWebCrawler**, reflecting the **new** approach where all parameters are passed via a **`CrawlerRunConfig`** instead of directly to `arun()`. Each section includes example usage in the new style, ensuring a clear, modern approach.
|
||||
|
||||
The following parameters can be passed to the `arun()` method. They are organized by their primary usage context and functionality.
|
||||
---
|
||||
|
||||
## Core Parameters
|
||||
# `arun()` Parameter Guide (New Approach)
|
||||
|
||||
In Crawl4AI’s **latest** configuration model, nearly all parameters that once went directly to `arun()` are now part of **`CrawlerRunConfig`**. When calling `arun()`, you provide:
|
||||
|
||||
```python
|
||||
await crawler.arun(
|
||||
url="https://example.com", # Required: URL to crawl
|
||||
verbose=True, # Enable detailed logging
|
||||
cache_mode=CacheMode.ENABLED, # Control cache behavior
|
||||
warmup=True # Whether to run warmup check
|
||||
url="https://example.com",
|
||||
config=my_run_config
|
||||
)
|
||||
```
|
||||
|
||||
## Cache Control
|
||||
Below is an organized look at the parameters that can go inside `CrawlerRunConfig`, divided by their functional areas. For **Browser** settings (e.g., `headless`, `browser_type`), see [BrowserConfig](./parameters.md).
|
||||
|
||||
---
|
||||
|
||||
## 1. Core Usage
|
||||
|
||||
```python
|
||||
from crawl4ai import CacheMode
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
|
||||
|
||||
await crawler.arun(
|
||||
cache_mode=CacheMode.ENABLED, # Normal caching (read/write)
|
||||
# Other cache modes:
|
||||
# cache_mode=CacheMode.DISABLED # No caching at all
|
||||
# cache_mode=CacheMode.READ_ONLY # Only read from cache
|
||||
# cache_mode=CacheMode.WRITE_ONLY # Only write to cache
|
||||
# cache_mode=CacheMode.BYPASS # Skip cache for this operation
|
||||
async def main():
|
||||
run_config = CrawlerRunConfig(
|
||||
verbose=True, # Detailed logging
|
||||
cache_mode=CacheMode.ENABLED, # Use normal read/write cache
|
||||
# ... other parameters
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
config=run_config
|
||||
)
|
||||
print(result.cleaned_html[:500])
|
||||
|
||||
```
|
||||
|
||||
**Key Fields**:
|
||||
- `verbose=True` logs each crawl step.
|
||||
- `cache_mode` decides how to read/write the local crawl cache.
|
||||
|
||||
---
|
||||
|
||||
## 2. Cache Control
|
||||
|
||||
**`cache_mode`** (default: `CacheMode.ENABLED`)
|
||||
Use a built-in enum from `CacheMode`:
|
||||
- `ENABLED`: Normal caching—reads if available, writes if missing.
|
||||
- `DISABLED`: No caching—always refetch pages.
|
||||
- `READ_ONLY`: Reads from cache only; no new writes.
|
||||
- `WRITE_ONLY`: Writes to cache but doesn’t read existing data.
|
||||
- `BYPASS`: Skips reading cache for this crawl (though it might still write if set up that way).
|
||||
|
||||
```python
|
||||
run_config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
```
|
||||
|
||||
## Content Processing Parameters
|
||||
**Additional flags**:
|
||||
- `bypass_cache=True` acts like `CacheMode.BYPASS`.
|
||||
- `disable_cache=True` acts like `CacheMode.DISABLED`.
|
||||
- `no_cache_read=True` acts like `CacheMode.WRITE_ONLY`.
|
||||
- `no_cache_write=True` acts like `CacheMode.READ_ONLY`.
|
||||
|
||||
---
|
||||
|
||||
## 3. Content Processing & Selection
|
||||
|
||||
### 3.1 Text Processing
|
||||
|
||||
### Text Processing
|
||||
```python
|
||||
await crawler.arun(
|
||||
word_count_threshold=10, # Minimum words per content block
|
||||
image_description_min_word_threshold=5, # Minimum words for image descriptions
|
||||
only_text=False, # Extract only text content
|
||||
excluded_tags=['form', 'nav'], # HTML tags to exclude
|
||||
keep_data_attributes=False, # Preserve data-* attributes
|
||||
run_config = CrawlerRunConfig(
|
||||
word_count_threshold=10, # Ignore text blocks <10 words
|
||||
only_text=False, # If True, tries to remove non-text elements
|
||||
keep_data_attributes=False # Keep or discard data-* attributes
|
||||
)
|
||||
```
|
||||
|
||||
### Content Selection
|
||||
### 3.2 Content Selection
|
||||
|
||||
```python
|
||||
await crawler.arun(
|
||||
css_selector=".main-content", # CSS selector for content extraction
|
||||
remove_forms=True, # Remove all form elements
|
||||
remove_overlay_elements=True, # Remove popups/modals/overlays
|
||||
run_config = CrawlerRunConfig(
|
||||
css_selector=".main-content", # Focus on .main-content region only
|
||||
excluded_tags=["form", "nav"], # Remove entire tag blocks
|
||||
remove_forms=True, # Specifically strip <form> elements
|
||||
remove_overlay_elements=True, # Attempt to remove modals/popups
|
||||
)
|
||||
```
|
||||
|
||||
### Link Handling
|
||||
### 3.3 Link Handling
|
||||
|
||||
```python
|
||||
await crawler.arun(
|
||||
exclude_external_links=True, # Remove external links
|
||||
exclude_social_media_links=True, # Remove social media links
|
||||
exclude_external_images=True, # Remove external images
|
||||
exclude_domains=["ads.example.com"], # Specific domains to exclude
|
||||
social_media_domains=[ # Additional social media domains
|
||||
"facebook.com",
|
||||
"twitter.com",
|
||||
"instagram.com"
|
||||
]
|
||||
run_config = CrawlerRunConfig(
|
||||
exclude_external_links=True, # Remove external links from final content
|
||||
exclude_social_media_links=True, # Remove links to known social sites
|
||||
exclude_domains=["ads.example.com"], # Exclude links to these domains
|
||||
exclude_social_media_domains=["facebook.com","twitter.com"], # Extend the default list
|
||||
)
|
||||
```
|
||||
|
||||
## Browser Control Parameters
|
||||
### 3.4 Media Filtering
|
||||
|
||||
### Basic Browser Settings
|
||||
```python
|
||||
await crawler.arun(
|
||||
headless=True, # Run browser in headless mode
|
||||
browser_type="chromium", # Browser engine: "chromium", "firefox", "webkit"
|
||||
page_timeout=60000, # Page load timeout in milliseconds
|
||||
user_agent="custom-agent", # Custom user agent
|
||||
run_config = CrawlerRunConfig(
|
||||
exclude_external_images=True # Strip images from other domains
|
||||
)
|
||||
```
|
||||
|
||||
### Navigation and Waiting
|
||||
---
|
||||
|
||||
## 4. Page Navigation & Timing
|
||||
|
||||
### 4.1 Basic Browser Flow
|
||||
|
||||
```python
|
||||
await crawler.arun(
|
||||
wait_for="css:.dynamic-content", # Wait for element/condition
|
||||
delay_before_return_html=2.0, # Wait before returning HTML (seconds)
|
||||
run_config = CrawlerRunConfig(
|
||||
wait_for="css:.dynamic-content", # Wait for .dynamic-content
|
||||
delay_before_return_html=2.0, # Wait 2s before capturing final HTML
|
||||
page_timeout=60000, # Navigation & script timeout (ms)
|
||||
)
|
||||
```
|
||||
|
||||
### JavaScript Execution
|
||||
**Key Fields**:
|
||||
- `wait_for`:
|
||||
- `"css:selector"` or
|
||||
- `"js:() => boolean"`
|
||||
e.g. `js:() => document.querySelectorAll('.item').length > 10`.
|
||||
|
||||
- `mean_delay` & `max_range`: define random delays for `arun_many()` calls.
|
||||
- `semaphore_count`: concurrency limit when crawling multiple URLs.
|
||||
|
||||
### 4.2 JavaScript Execution
|
||||
|
||||
```python
|
||||
await crawler.arun(
|
||||
js_code=[ # JavaScript to execute (string or list)
|
||||
run_config = CrawlerRunConfig(
|
||||
js_code=[
|
||||
"window.scrollTo(0, document.body.scrollHeight);",
|
||||
"document.querySelector('.load-more').click();"
|
||||
"document.querySelector('.load-more')?.click();"
|
||||
],
|
||||
js_only=False, # Only execute JavaScript without reloading page
|
||||
js_only=False
|
||||
)
|
||||
```
|
||||
|
||||
### Anti-Bot Features
|
||||
- `js_code` can be a single string or a list of strings.
|
||||
- `js_only=True` means “I’m continuing in the same session with new JS steps, no new full navigation.”
|
||||
|
||||
### 4.3 Anti-Bot
|
||||
|
||||
```python
|
||||
await crawler.arun(
|
||||
magic=True, # Enable all anti-detection features
|
||||
simulate_user=True, # Simulate human behavior
|
||||
override_navigator=True # Override navigator properties
|
||||
run_config = CrawlerRunConfig(
|
||||
magic=True,
|
||||
simulate_user=True,
|
||||
override_navigator=True
|
||||
)
|
||||
```
|
||||
- `magic=True` tries multiple stealth features.
|
||||
- `simulate_user=True` mimics mouse movements or random delays.
|
||||
- `override_navigator=True` fakes some navigator properties (like user agent checks).
|
||||
|
||||
---
|
||||
|
||||
## 5. Session Management
|
||||
|
||||
**`session_id`**:
|
||||
```python
|
||||
run_config = CrawlerRunConfig(
|
||||
session_id="my_session123"
|
||||
)
|
||||
```
|
||||
If re-used in subsequent `arun()` calls, the same tab/page context is continued (helpful for multi-step tasks or stateful browsing).
|
||||
|
||||
---
|
||||
|
||||
## 6. Screenshot, PDF & Media Options
|
||||
|
||||
```python
|
||||
run_config = CrawlerRunConfig(
|
||||
screenshot=True, # Grab a screenshot as base64
|
||||
screenshot_wait_for=1.0, # Wait 1s before capturing
|
||||
pdf=True, # Also produce a PDF
|
||||
image_description_min_word_threshold=5, # If analyzing alt text
|
||||
image_score_threshold=3, # Filter out low-score images
|
||||
)
|
||||
```
|
||||
**Where they appear**:
|
||||
- `result.screenshot` → Base64 screenshot string.
|
||||
- `result.pdf` → Byte array with PDF data.
|
||||
|
||||
---
|
||||
|
||||
## 7. Extraction Strategy
|
||||
|
||||
**For advanced data extraction** (CSS/LLM-based), set `extraction_strategy`:
|
||||
|
||||
```python
|
||||
run_config = CrawlerRunConfig(
|
||||
extraction_strategy=my_css_or_llm_strategy
|
||||
)
|
||||
```
|
||||
|
||||
### Session Management
|
||||
```python
|
||||
await crawler.arun(
|
||||
session_id="my_session", # Session identifier for persistent browsing
|
||||
)
|
||||
```
|
||||
The extracted data will appear in `result.extracted_content`.
|
||||
|
||||
### Screenshot Options
|
||||
```python
|
||||
await crawler.arun(
|
||||
screenshot=True, # Take page screenshot
|
||||
screenshot_wait_for=2.0, # Wait before screenshot (seconds)
|
||||
)
|
||||
```
|
||||
---
|
||||
|
||||
## 8. Comprehensive Example
|
||||
|
||||
Below is a snippet combining many parameters:
|
||||
|
||||
### Proxy Configuration
|
||||
```python
|
||||
await crawler.arun(
|
||||
proxy="http://proxy.example.com:8080", # Simple proxy URL
|
||||
proxy_config={ # Advanced proxy settings
|
||||
"server": "http://proxy.example.com:8080",
|
||||
"username": "user",
|
||||
"password": "pass"
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
|
||||
async def main():
|
||||
# Example schema
|
||||
schema = {
|
||||
"name": "Articles",
|
||||
"baseSelector": "article.post",
|
||||
"fields": [
|
||||
{"name": "title", "selector": "h2", "type": "text"},
|
||||
{"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
|
||||
]
|
||||
}
|
||||
)
|
||||
```
|
||||
|
||||
## Content Extraction Parameters
|
||||
|
||||
### Extraction Strategy
|
||||
```python
|
||||
await crawler.arun(
|
||||
extraction_strategy=LLMExtractionStrategy(
|
||||
provider="ollama/llama2",
|
||||
schema=MySchema.schema(),
|
||||
instruction="Extract specific data"
|
||||
run_config = CrawlerRunConfig(
|
||||
# Core
|
||||
verbose=True,
|
||||
cache_mode=CacheMode.ENABLED,
|
||||
|
||||
# Content
|
||||
word_count_threshold=10,
|
||||
css_selector="main.content",
|
||||
excluded_tags=["nav", "footer"],
|
||||
exclude_external_links=True,
|
||||
|
||||
# Page & JS
|
||||
js_code="document.querySelector('.show-more')?.click();",
|
||||
wait_for="css:.loaded-block",
|
||||
page_timeout=30000,
|
||||
|
||||
# Extraction
|
||||
extraction_strategy=JsonCssExtractionStrategy(schema),
|
||||
|
||||
# Session
|
||||
session_id="persistent_session",
|
||||
|
||||
# Media
|
||||
screenshot=True,
|
||||
pdf=True,
|
||||
|
||||
# Anti-bot
|
||||
simulate_user=True,
|
||||
magic=True,
|
||||
)
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://example.com/posts", config=run_config)
|
||||
if result.success:
|
||||
print("HTML length:", len(result.cleaned_html))
|
||||
print("Extraction JSON:", result.extracted_content)
|
||||
if result.screenshot:
|
||||
print("Screenshot length:", len(result.screenshot))
|
||||
if result.pdf:
|
||||
print("PDF bytes length:", len(result.pdf))
|
||||
else:
|
||||
print("Error:", result.error_message)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
### Chunking Strategy
|
||||
```python
|
||||
await crawler.arun(
|
||||
chunking_strategy=RegexChunking(
|
||||
patterns=[r'\n\n', r'\.\s+']
|
||||
)
|
||||
)
|
||||
```
|
||||
**What we covered**:
|
||||
1. **Crawling** the main content region, ignoring external links.
|
||||
2. Running **JavaScript** to click “.show-more”.
|
||||
3. **Waiting** for “.loaded-block” to appear.
|
||||
4. Generating a **screenshot** & **PDF** of the final page.
|
||||
5. Extracting repeated “article.post” elements with a **CSS-based** extraction strategy.
|
||||
|
||||
### HTML to Text Options
|
||||
```python
|
||||
await crawler.arun(
|
||||
html2text={
|
||||
"ignore_links": False,
|
||||
"ignore_images": False,
|
||||
"escape_dot": False,
|
||||
"body_width": 0,
|
||||
"protect_links": True,
|
||||
"unicode_snob": True
|
||||
}
|
||||
)
|
||||
```
|
||||
---
|
||||
|
||||
## Debug Options
|
||||
```python
|
||||
await crawler.arun(
|
||||
log_console=True, # Log browser console messages
|
||||
)
|
||||
```
|
||||
## 9. Best Practices
|
||||
|
||||
## Parameter Interactions and Notes
|
||||
1. **Use `BrowserConfig` for global browser** settings (headless, user agent).
|
||||
2. **Use `CrawlerRunConfig`** to handle the **specific** crawl needs: content filtering, caching, JS, screenshot, extraction, etc.
|
||||
3. Keep your **parameters consistent** in run configs—especially if you’re part of a large codebase with multiple crawls.
|
||||
4. **Limit** large concurrency (`semaphore_count`) if the site or your system can’t handle it.
|
||||
5. For dynamic pages, set `js_code` or `scan_full_page` so you load all content.
|
||||
|
||||
1. **Cache and Performance Setup**
|
||||
```python
|
||||
# Optimal caching for repeated crawls
|
||||
await crawler.arun(
|
||||
cache_mode=CacheMode.ENABLED,
|
||||
word_count_threshold=10,
|
||||
process_iframes=False
|
||||
)
|
||||
```
|
||||
---
|
||||
|
||||
2. **Dynamic Content Handling**
|
||||
```python
|
||||
# Handle lazy-loaded content
|
||||
await crawler.arun(
|
||||
js_code="window.scrollTo(0, document.body.scrollHeight);",
|
||||
wait_for="css:.lazy-content",
|
||||
delay_before_return_html=2.0,
|
||||
cache_mode=CacheMode.WRITE_ONLY # Cache results after dynamic load
|
||||
)
|
||||
```
|
||||
## 10. Conclusion
|
||||
|
||||
3. **Content Extraction Pipeline**
|
||||
```python
|
||||
# Complete extraction setup
|
||||
await crawler.arun(
|
||||
css_selector=".main-content",
|
||||
word_count_threshold=20,
|
||||
extraction_strategy=my_strategy,
|
||||
chunking_strategy=my_chunking,
|
||||
process_iframes=True,
|
||||
remove_overlay_elements=True,
|
||||
cache_mode=CacheMode.ENABLED
|
||||
)
|
||||
```
|
||||
All parameters that used to be direct arguments to `arun()` now belong in **`CrawlerRunConfig`**. This approach:
|
||||
|
||||
## Best Practices
|
||||
- Makes code **clearer** and **more maintainable**.
|
||||
- Minimizes confusion about which arguments affect global vs. per-crawl behavior.
|
||||
- Allows you to create **reusable** config objects for different pages or tasks.
|
||||
|
||||
1. **Performance Optimization**
|
||||
```python
|
||||
await crawler.arun(
|
||||
cache_mode=CacheMode.ENABLED, # Use full caching
|
||||
word_count_threshold=10, # Filter out noise
|
||||
process_iframes=False # Skip iframes if not needed
|
||||
)
|
||||
```
|
||||
For a **full** reference, check out the [CrawlerRunConfig Docs](./parameters.md).
|
||||
|
||||
2. **Reliable Scraping**
|
||||
```python
|
||||
await crawler.arun(
|
||||
magic=True, # Enable anti-detection
|
||||
delay_before_return_html=1.0, # Wait for dynamic content
|
||||
page_timeout=60000, # Longer timeout for slow pages
|
||||
cache_mode=CacheMode.WRITE_ONLY # Cache results after successful crawl
|
||||
)
|
||||
```
|
||||
|
||||
3. **Clean Content**
|
||||
```python
|
||||
await crawler.arun(
|
||||
remove_overlay_elements=True, # Remove popups
|
||||
excluded_tags=['nav', 'aside'],# Remove unnecessary elements
|
||||
keep_data_attributes=False, # Remove data attributes
|
||||
cache_mode=CacheMode.ENABLED # Use cache for faster processing
|
||||
)
|
||||
```
|
||||
Happy crawling with your **structured, flexible** config approach!
|
||||
@@ -1,320 +1,283 @@
|
||||
Below is the **updated** guide for the **AsyncWebCrawler** class, reflecting the **new** recommended approach of configuring the browser via **`BrowserConfig`** and each crawl via **`CrawlerRunConfig`**. While the crawler still accepts legacy parameters for backward compatibility, the modern, maintainable way is shown below.
|
||||
|
||||
---
|
||||
|
||||
# AsyncWebCrawler
|
||||
|
||||
The `AsyncWebCrawler` class is the main interface for web crawling operations. It provides asynchronous web crawling capabilities with extensive configuration options.
|
||||
The **`AsyncWebCrawler`** is the core class for asynchronous web crawling in Crawl4AI. You typically create it **once**, optionally customize it with a **`BrowserConfig`** (e.g., headless, user agent), then **run** multiple **`arun()`** calls with different **`CrawlerRunConfig`** objects.
|
||||
|
||||
## Constructor
|
||||
**Recommended usage**:
|
||||
1. **Create** a `BrowserConfig` for global browser settings.
|
||||
2. **Instantiate** `AsyncWebCrawler(config=browser_config)`.
|
||||
3. **Use** the crawler in an async context manager (`async with`) or manage start/close manually.
|
||||
4. **Call** `arun(url, config=crawler_run_config)` for each page you want.
|
||||
|
||||
---
|
||||
|
||||
## 1. Constructor Overview
|
||||
|
||||
```python
|
||||
AsyncWebCrawler(
|
||||
# Browser Settings
|
||||
browser_type: str = "chromium", # Options: "chromium", "firefox", "webkit"
|
||||
headless: bool = True, # Run browser in headless mode
|
||||
verbose: bool = False, # Enable verbose logging
|
||||
|
||||
# Cache Settings
|
||||
always_by_pass_cache: bool = False, # Always bypass cache
|
||||
base_directory: str = str(os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home())), # Base directory for cache
|
||||
|
||||
# Network Settings
|
||||
proxy: str = None, # Simple proxy URL
|
||||
proxy_config: Dict = None, # Advanced proxy configuration
|
||||
|
||||
# Browser Behavior
|
||||
sleep_on_close: bool = False, # Wait before closing browser
|
||||
|
||||
# Custom Settings
|
||||
user_agent: str = None, # Custom user agent
|
||||
headers: Dict[str, str] = {}, # Custom HTTP headers
|
||||
js_code: Union[str, List[str]] = None, # Default JavaScript to execute
|
||||
)
|
||||
class AsyncWebCrawler:
|
||||
def __init__(
|
||||
self,
|
||||
crawler_strategy: Optional[AsyncCrawlerStrategy] = None,
|
||||
config: Optional[BrowserConfig] = None,
|
||||
always_bypass_cache: bool = False, # deprecated
|
||||
always_by_pass_cache: Optional[bool] = None, # also deprecated
|
||||
base_directory: str = ...,
|
||||
thread_safe: bool = False,
|
||||
**kwargs,
|
||||
):
|
||||
"""
|
||||
Create an AsyncWebCrawler instance.
|
||||
|
||||
Args:
|
||||
crawler_strategy: (Advanced) Provide a custom crawler strategy if needed.
|
||||
config: A BrowserConfig object specifying how the browser is set up.
|
||||
always_bypass_cache: (Deprecated) Use CrawlerRunConfig.cache_mode instead.
|
||||
base_directory: Folder for storing caches/logs (if relevant).
|
||||
thread_safe: If True, attempts some concurrency safeguards. Usually False.
|
||||
**kwargs: Additional legacy or debugging parameters.
|
||||
"""
|
||||
```
|
||||
|
||||
### Parameters in Detail
|
||||
### Typical Initialization
|
||||
|
||||
#### Browser Settings
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig
|
||||
|
||||
- **browser_type** (str, optional)
|
||||
- Default: `"chromium"`
|
||||
- Options: `"chromium"`, `"firefox"`, `"webkit"`
|
||||
- Controls which browser engine to use
|
||||
```python
|
||||
# Example: Using Firefox
|
||||
crawler = AsyncWebCrawler(browser_type="firefox")
|
||||
```
|
||||
browser_cfg = BrowserConfig(
|
||||
browser_type="chromium",
|
||||
headless=True,
|
||||
verbose=True
|
||||
)
|
||||
|
||||
- **headless** (bool, optional)
|
||||
- Default: `True`
|
||||
- When `True`, browser runs without GUI
|
||||
- Set to `False` for debugging
|
||||
```python
|
||||
# Visible browser for debugging
|
||||
crawler = AsyncWebCrawler(headless=False)
|
||||
```
|
||||
crawler = AsyncWebCrawler(config=browser_cfg)
|
||||
```
|
||||
|
||||
- **verbose** (bool, optional)
|
||||
- Default: `False`
|
||||
- Enables detailed logging
|
||||
```python
|
||||
# Enable detailed logging
|
||||
crawler = AsyncWebCrawler(verbose=True)
|
||||
```
|
||||
**Notes**:
|
||||
- **Legacy** parameters like `always_bypass_cache` remain for backward compatibility, but prefer to set **caching** in `CrawlerRunConfig`.
|
||||
|
||||
#### Cache Settings
|
||||
---
|
||||
|
||||
- **always_by_pass_cache** (bool, optional)
|
||||
- Default: `False`
|
||||
- When `True`, always fetches fresh content
|
||||
```python
|
||||
# Always fetch fresh content
|
||||
crawler = AsyncWebCrawler(always_by_pass_cache=True)
|
||||
```
|
||||
## 2. Lifecycle: Start/Close or Context Manager
|
||||
|
||||
- **base_directory** (str, optional)
|
||||
- Default: User's home directory
|
||||
- Base path for cache storage
|
||||
```python
|
||||
# Custom cache directory
|
||||
crawler = AsyncWebCrawler(base_directory="/path/to/cache")
|
||||
```
|
||||
### 2.1 Context Manager (Recommended)
|
||||
|
||||
#### Network Settings
|
||||
```python
|
||||
async with AsyncWebCrawler(config=browser_cfg) as crawler:
|
||||
result = await crawler.arun("https://example.com")
|
||||
# The crawler automatically starts/closes resources
|
||||
```
|
||||
|
||||
- **proxy** (str, optional)
|
||||
- Simple proxy URL
|
||||
```python
|
||||
# Using simple proxy
|
||||
crawler = AsyncWebCrawler(proxy="http://proxy.example.com:8080")
|
||||
```
|
||||
When the `async with` block ends, the crawler cleans up (closes the browser, etc.).
|
||||
|
||||
- **proxy_config** (Dict, optional)
|
||||
- Advanced proxy configuration with authentication
|
||||
```python
|
||||
# Advanced proxy with auth
|
||||
crawler = AsyncWebCrawler(proxy_config={
|
||||
"server": "http://proxy.example.com:8080",
|
||||
"username": "user",
|
||||
"password": "pass"
|
||||
})
|
||||
```
|
||||
### 2.2 Manual Start & Close
|
||||
|
||||
#### Browser Behavior
|
||||
```python
|
||||
crawler = AsyncWebCrawler(config=browser_cfg)
|
||||
await crawler.start()
|
||||
|
||||
- **sleep_on_close** (bool, optional)
|
||||
- Default: `False`
|
||||
- Adds delay before closing browser
|
||||
```python
|
||||
# Wait before closing
|
||||
crawler = AsyncWebCrawler(sleep_on_close=True)
|
||||
```
|
||||
result1 = await crawler.arun("https://example.com")
|
||||
result2 = await crawler.arun("https://another.com")
|
||||
|
||||
#### Custom Settings
|
||||
await crawler.close()
|
||||
```
|
||||
|
||||
- **user_agent** (str, optional)
|
||||
- Custom user agent string
|
||||
```python
|
||||
# Custom user agent
|
||||
crawler = AsyncWebCrawler(
|
||||
user_agent="Mozilla/5.0 (Custom Agent) Chrome/90.0"
|
||||
)
|
||||
```
|
||||
Use this style if you have a **long-running** application or need full control of the crawler’s lifecycle.
|
||||
|
||||
- **headers** (Dict[str, str], optional)
|
||||
- Custom HTTP headers
|
||||
```python
|
||||
# Custom headers
|
||||
crawler = AsyncWebCrawler(
|
||||
headers={
|
||||
"Accept-Language": "en-US",
|
||||
"Custom-Header": "Value"
|
||||
}
|
||||
)
|
||||
```
|
||||
---
|
||||
|
||||
- **js_code** (Union[str, List[str]], optional)
|
||||
- Default JavaScript to execute on each page
|
||||
```python
|
||||
# Default JavaScript
|
||||
crawler = AsyncWebCrawler(
|
||||
js_code=[
|
||||
"window.scrollTo(0, document.body.scrollHeight);",
|
||||
"document.querySelector('.load-more').click();"
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
## Methods
|
||||
|
||||
### arun()
|
||||
|
||||
The primary method for crawling web pages.
|
||||
## 3. Primary Method: `arun()`
|
||||
|
||||
```python
|
||||
async def arun(
|
||||
# Required
|
||||
url: str, # URL to crawl
|
||||
|
||||
# Content Selection
|
||||
css_selector: str = None, # CSS selector for content
|
||||
word_count_threshold: int = 10, # Minimum words per block
|
||||
|
||||
# Cache Control
|
||||
bypass_cache: bool = False, # Bypass cache for this request
|
||||
|
||||
# Session Management
|
||||
session_id: str = None, # Session identifier
|
||||
|
||||
# Screenshot Options
|
||||
screenshot: bool = False, # Take screenshot
|
||||
screenshot_wait_for: float = None, # Wait before screenshot
|
||||
|
||||
# Content Processing
|
||||
process_iframes: bool = False, # Process iframe content
|
||||
remove_overlay_elements: bool = False, # Remove popups/modals
|
||||
|
||||
# Anti-Bot Settings
|
||||
simulate_user: bool = False, # Simulate human behavior
|
||||
override_navigator: bool = False, # Override navigator properties
|
||||
magic: bool = False, # Enable all anti-detection
|
||||
|
||||
# Content Filtering
|
||||
excluded_tags: List[str] = None, # HTML tags to exclude
|
||||
exclude_external_links: bool = False, # Remove external links
|
||||
exclude_social_media_links: bool = False, # Remove social media links
|
||||
|
||||
# JavaScript Handling
|
||||
js_code: Union[str, List[str]] = None, # JavaScript to execute
|
||||
wait_for: str = None, # Wait condition
|
||||
|
||||
# Page Loading
|
||||
page_timeout: int = 60000, # Page load timeout (ms)
|
||||
delay_before_return_html: float = None, # Wait before return
|
||||
|
||||
# Extraction
|
||||
extraction_strategy: ExtractionStrategy = None # Extraction strategy
|
||||
self,
|
||||
url: str,
|
||||
config: Optional[CrawlerRunConfig] = None,
|
||||
# Legacy parameters for backward compatibility...
|
||||
) -> CrawlResult:
|
||||
...
|
||||
```
|
||||
|
||||
### Usage Examples
|
||||
### 3.1 New Approach
|
||||
|
||||
#### Basic Crawling
|
||||
```python
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url="https://example.com")
|
||||
```
|
||||
|
||||
#### Advanced Crawling
|
||||
```python
|
||||
async with AsyncWebCrawler(
|
||||
browser_type="firefox",
|
||||
verbose=True,
|
||||
headers={"Custom-Header": "Value"}
|
||||
) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
css_selector=".main-content",
|
||||
word_count_threshold=20,
|
||||
process_iframes=True,
|
||||
magic=True,
|
||||
wait_for="css:.dynamic-content",
|
||||
screenshot=True
|
||||
)
|
||||
```
|
||||
|
||||
#### Session Management
|
||||
```python
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# First request
|
||||
result1 = await crawler.arun(
|
||||
url="https://example.com/login",
|
||||
session_id="my_session"
|
||||
)
|
||||
|
||||
# Subsequent request using same session
|
||||
result2 = await crawler.arun(
|
||||
url="https://example.com/protected",
|
||||
session_id="my_session"
|
||||
)
|
||||
```
|
||||
|
||||
## Context Manager
|
||||
|
||||
AsyncWebCrawler implements the async context manager protocol:
|
||||
You pass a `CrawlerRunConfig` object that sets up everything about a crawl—content filtering, caching, session reuse, JS code, screenshots, etc.
|
||||
|
||||
```python
|
||||
async def __aenter__(self) -> 'AsyncWebCrawler':
|
||||
# Initialize browser and resources
|
||||
return self
|
||||
import asyncio
|
||||
from crawl4ai import CrawlerRunConfig, CacheMode
|
||||
|
||||
async def __aexit__(self, *args):
|
||||
# Cleanup resources
|
||||
pass
|
||||
```
|
||||
|
||||
Always use AsyncWebCrawler with async context manager:
|
||||
```python
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Your crawling code here
|
||||
pass
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Resource Management**
|
||||
```python
|
||||
# Always use context manager
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Crawler will be properly cleaned up
|
||||
pass
|
||||
```
|
||||
|
||||
2. **Error Handling**
|
||||
```python
|
||||
try:
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url="https://example.com")
|
||||
if not result.success:
|
||||
print(f"Crawl failed: {result.error_message}")
|
||||
except Exception as e:
|
||||
print(f"Error: {str(e)}")
|
||||
```
|
||||
|
||||
3. **Performance Optimization**
|
||||
```python
|
||||
# Enable caching for better performance
|
||||
crawler = AsyncWebCrawler(
|
||||
always_by_pass_cache=False,
|
||||
verbose=True
|
||||
run_cfg = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
css_selector="main.article",
|
||||
word_count_threshold=10,
|
||||
screenshot=True
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_cfg) as crawler:
|
||||
result = await crawler.arun("https://example.com/news", config=run_cfg)
|
||||
print("Crawled HTML length:", len(result.cleaned_html))
|
||||
if result.screenshot:
|
||||
print("Screenshot base64 length:", len(result.screenshot))
|
||||
```
|
||||
|
||||
4. **Anti-Detection**
|
||||
### 3.2 Legacy Parameters Still Accepted
|
||||
|
||||
For **backward** compatibility, `arun()` can still accept direct arguments like `css_selector=...`, `word_count_threshold=...`, etc., but we strongly advise migrating them into a **`CrawlerRunConfig`**.
|
||||
|
||||
---
|
||||
|
||||
## 4. Helper Methods
|
||||
|
||||
### 4.1 `arun_many()`
|
||||
|
||||
```python
|
||||
# Maximum stealth
|
||||
crawler = AsyncWebCrawler(
|
||||
headless=True,
|
||||
user_agent="Mozilla/5.0...",
|
||||
headers={"Accept-Language": "en-US"}
|
||||
)
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
magic=True,
|
||||
simulate_user=True
|
||||
)
|
||||
async def arun_many(
|
||||
self,
|
||||
urls: List[str],
|
||||
config: Optional[CrawlerRunConfig] = None,
|
||||
# Legacy parameters...
|
||||
) -> List[CrawlResult]:
|
||||
...
|
||||
```
|
||||
|
||||
## Note on Browser Types
|
||||
Crawls multiple URLs in concurrency. Accepts the same style `CrawlerRunConfig`. Example:
|
||||
|
||||
Each browser type has its characteristics:
|
||||
|
||||
- **chromium**: Best overall compatibility
|
||||
- **firefox**: Good for specific use cases
|
||||
- **webkit**: Lighter weight, good for basic crawling
|
||||
|
||||
Choose based on your specific needs:
|
||||
```python
|
||||
# High compatibility
|
||||
crawler = AsyncWebCrawler(browser_type="chromium")
|
||||
run_cfg = CrawlerRunConfig(
|
||||
# e.g., concurrency, wait_for, caching, extraction, etc.
|
||||
semaphore_count=5
|
||||
)
|
||||
|
||||
# Memory efficient
|
||||
crawler = AsyncWebCrawler(browser_type="webkit")
|
||||
```
|
||||
async with AsyncWebCrawler(config=browser_cfg) as crawler:
|
||||
results = await crawler.arun_many(
|
||||
urls=["https://example.com", "https://another.com"],
|
||||
config=run_cfg
|
||||
)
|
||||
for r in results:
|
||||
print(r.url, ":", len(r.cleaned_html))
|
||||
```
|
||||
|
||||
### 4.2 `start()` & `close()`
|
||||
|
||||
Allows manual lifecycle usage instead of context manager:
|
||||
|
||||
```python
|
||||
crawler = AsyncWebCrawler(config=browser_cfg)
|
||||
await crawler.start()
|
||||
|
||||
# Perform multiple operations
|
||||
resultA = await crawler.arun("https://exampleA.com", config=run_cfg)
|
||||
resultB = await crawler.arun("https://exampleB.com", config=run_cfg)
|
||||
|
||||
await crawler.close()
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. `CrawlResult` Output
|
||||
|
||||
Each `arun()` returns a **`CrawlResult`** containing:
|
||||
|
||||
- `url`: Final URL (if redirected).
|
||||
- `html`: Original HTML.
|
||||
- `cleaned_html`: Sanitized HTML.
|
||||
- `markdown_v2` (or future `markdown`): Markdown outputs (raw, fit, etc.).
|
||||
- `extracted_content`: If an extraction strategy was used (JSON for CSS/LLM strategies).
|
||||
- `screenshot`, `pdf`: If screenshots/PDF requested.
|
||||
- `media`, `links`: Information about discovered images/links.
|
||||
- `success`, `error_message`: Status info.
|
||||
|
||||
For details, see [CrawlResult doc](./crawl-result.md).
|
||||
|
||||
---
|
||||
|
||||
## 6. Quick Example
|
||||
|
||||
Below is an example hooking it all together:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
import json
|
||||
|
||||
async def main():
|
||||
# 1. Browser config
|
||||
browser_cfg = BrowserConfig(
|
||||
browser_type="firefox",
|
||||
headless=False,
|
||||
verbose=True
|
||||
)
|
||||
|
||||
# 2. Run config
|
||||
schema = {
|
||||
"name": "Articles",
|
||||
"baseSelector": "article.post",
|
||||
"fields": [
|
||||
{"name": "title", "selector": "h2", "type": "text"},
|
||||
{"name": "url", "selector": "a", "type": "attribute", "attribute": "href"}
|
||||
]
|
||||
}
|
||||
|
||||
run_cfg = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
extraction_strategy=JsonCssExtractionStrategy(schema),
|
||||
word_count_threshold=15,
|
||||
remove_overlay_elements=True,
|
||||
wait_for="css:.post" # Wait for posts to appear
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_cfg) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com/blog",
|
||||
config=run_cfg
|
||||
)
|
||||
|
||||
if result.success:
|
||||
print("Cleaned HTML length:", len(result.cleaned_html))
|
||||
if result.extracted_content:
|
||||
articles = json.loads(result.extracted_content)
|
||||
print("Extracted articles:", articles[:2])
|
||||
else:
|
||||
print("Error:", result.error_message)
|
||||
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**Explanation**:
|
||||
- We define a **`BrowserConfig`** with Firefox, no headless, and `verbose=True`.
|
||||
- We define a **`CrawlerRunConfig`** that **bypasses cache**, uses a **CSS** extraction schema, has a `word_count_threshold=15`, etc.
|
||||
- We pass them to `AsyncWebCrawler(config=...)` and `arun(url=..., config=...)`.
|
||||
|
||||
---
|
||||
|
||||
## 7. Best Practices & Migration Notes
|
||||
|
||||
1. **Use** `BrowserConfig` for **global** settings about the browser’s environment.
|
||||
2. **Use** `CrawlerRunConfig` for **per-crawl** logic (caching, content filtering, extraction strategies, wait conditions).
|
||||
3. **Avoid** legacy parameters like `css_selector` or `word_count_threshold` directly in `arun()`. Instead:
|
||||
|
||||
```python
|
||||
run_cfg = CrawlerRunConfig(css_selector=".main-content", word_count_threshold=20)
|
||||
result = await crawler.arun(url="...", config=run_cfg)
|
||||
```
|
||||
|
||||
4. **Context Manager** usage is simplest unless you want a persistent crawler across many calls.
|
||||
|
||||
---
|
||||
|
||||
## 8. Summary
|
||||
|
||||
**AsyncWebCrawler** is your entry point to asynchronous crawling:
|
||||
|
||||
- **Constructor** accepts **`BrowserConfig`** (or defaults).
|
||||
- **`arun(url, config=CrawlerRunConfig)`** is the main method for single-page crawls.
|
||||
- **`arun_many(urls, config=CrawlerRunConfig)`** handles concurrency across multiple URLs.
|
||||
- For advanced lifecycle control, use `start()` and `close()` explicitly.
|
||||
|
||||
**Migration**:
|
||||
- If you used `AsyncWebCrawler(browser_type="chromium", css_selector="...")`, move browser settings to `BrowserConfig(...)` and content/crawl logic to `CrawlerRunConfig(...)`.
|
||||
|
||||
This modular approach ensures your code is **clean**, **scalable**, and **easy to maintain**. For any advanced or rarely used parameters, see the [BrowserConfig docs](../api/parameters.md).
|
||||
@@ -1,85 +0,0 @@
|
||||
# CrawlerRunConfig Parameters Documentation
|
||||
|
||||
## Content Processing Parameters
|
||||
|
||||
| Parameter | Type | Default | Description |
|
||||
|-----------|------|---------|-------------|
|
||||
| `word_count_threshold` | int | 200 | Minimum word count threshold before processing content |
|
||||
| `extraction_strategy` | ExtractionStrategy | None | Strategy to extract structured data from crawled pages. When None, uses NoExtractionStrategy |
|
||||
| `chunking_strategy` | ChunkingStrategy | RegexChunking() | Strategy to chunk content before extraction |
|
||||
| `markdown_generator` | MarkdownGenerationStrategy | None | Strategy for generating markdown from extracted content |
|
||||
| `content_filter` | RelevantContentFilter | None | Optional filter to prune irrelevant content |
|
||||
| `only_text` | bool | False | If True, attempt to extract text-only content where applicable |
|
||||
| `css_selector` | str | None | CSS selector to extract a specific portion of the page |
|
||||
| `excluded_tags` | list[str] | [] | List of HTML tags to exclude from processing |
|
||||
| `keep_data_attributes` | bool | False | If True, retain `data-*` attributes while removing unwanted attributes |
|
||||
| `remove_forms` | bool | False | If True, remove all `<form>` elements from the HTML |
|
||||
| `prettiify` | bool | False | If True, apply `fast_format_html` to produce prettified HTML output |
|
||||
|
||||
## Caching Parameters
|
||||
|
||||
| Parameter | Type | Default | Description |
|
||||
|-----------|------|---------|-------------|
|
||||
| `cache_mode` | CacheMode | None | Defines how caching is handled. Defaults to CacheMode.ENABLED internally |
|
||||
| `session_id` | str | None | Optional session ID to persist browser context and page instance |
|
||||
| `bypass_cache` | bool | False | Legacy parameter, if True acts like CacheMode.BYPASS |
|
||||
| `disable_cache` | bool | False | Legacy parameter, if True acts like CacheMode.DISABLED |
|
||||
| `no_cache_read` | bool | False | Legacy parameter, if True acts like CacheMode.WRITE_ONLY |
|
||||
| `no_cache_write` | bool | False | Legacy parameter, if True acts like CacheMode.READ_ONLY |
|
||||
|
||||
## Page Navigation and Timing Parameters
|
||||
|
||||
| Parameter | Type | Default | Description |
|
||||
|-----------|------|---------|-------------|
|
||||
| `wait_until` | str | "domcontentloaded" | The condition to wait for when navigating |
|
||||
| `page_timeout` | int | 60000 | Timeout in milliseconds for page operations like navigation |
|
||||
| `wait_for` | str | None | CSS selector or JS condition to wait for before extracting content |
|
||||
| `wait_for_images` | bool | True | If True, wait for images to load before extracting content |
|
||||
| `delay_before_return_html` | float | 0.1 | Delay in seconds before retrieving final HTML |
|
||||
| `mean_delay` | float | 0.1 | Mean base delay between requests when calling arun_many |
|
||||
| `max_range` | float | 0.3 | Max random additional delay range for requests in arun_many |
|
||||
| `semaphore_count` | int | 5 | Number of concurrent operations allowed |
|
||||
|
||||
## Page Interaction Parameters
|
||||
|
||||
| Parameter | Type | Default | Description |
|
||||
|-----------|------|---------|-------------|
|
||||
| `js_code` | str or list[str] | None | JavaScript code/snippets to run on the page |
|
||||
| `js_only` | bool | False | If True, indicates subsequent calls are JS-driven updates |
|
||||
| `ignore_body_visibility` | bool | True | If True, ignore whether the body is visible before proceeding |
|
||||
| `scan_full_page` | bool | False | If True, scroll through the entire page to load all content |
|
||||
| `scroll_delay` | float | 0.2 | Delay in seconds between scroll steps if scan_full_page is True |
|
||||
| `process_iframes` | bool | False | If True, attempts to process and inline iframe content |
|
||||
| `remove_overlay_elements` | bool | False | If True, remove overlays/popups before extracting HTML |
|
||||
| `simulate_user` | bool | False | If True, simulate user interactions for anti-bot measures |
|
||||
| `override_navigator` | bool | False | If True, overrides navigator properties for more human-like behavior |
|
||||
| `magic` | bool | False | If True, attempts automatic handling of overlays/popups |
|
||||
| `adjust_viewport_to_content` | bool | False | If True, adjust viewport according to page content dimensions |
|
||||
|
||||
## Media Handling Parameters
|
||||
|
||||
| Parameter | Type | Default | Description |
|
||||
|-----------|------|---------|-------------|
|
||||
| `screenshot` | bool | False | Whether to take a screenshot after crawling |
|
||||
| `screenshot_wait_for` | float | None | Additional wait time before taking a screenshot |
|
||||
| `screenshot_height_threshold` | int | 20000 | Threshold for page height to decide screenshot strategy |
|
||||
| `pdf` | bool | False | Whether to generate a PDF of the page |
|
||||
| `image_description_min_word_threshold` | int | 50 | Minimum words for image description extraction |
|
||||
| `image_score_threshold` | int | 3 | Minimum score threshold for processing an image |
|
||||
| `exclude_external_images` | bool | False | If True, exclude all external images from processing |
|
||||
|
||||
## Link and Domain Handling Parameters
|
||||
|
||||
| Parameter | Type | Default | Description |
|
||||
|-----------|------|---------|-------------|
|
||||
| `exclude_social_media_domains` | list[str] | SOCIAL_MEDIA_DOMAINS | List of domains to exclude for social media links |
|
||||
| `exclude_external_links` | bool | False | If True, exclude all external links from the results |
|
||||
| `exclude_social_media_links` | bool | False | If True, exclude links pointing to social media domains |
|
||||
| `exclude_domains` | list[str] | [] | List of specific domains to exclude from results |
|
||||
|
||||
## Debugging and Logging Parameters
|
||||
|
||||
| Parameter | Type | Default | Description |
|
||||
|-----------|------|---------|-------------|
|
||||
| `verbose` | bool | True | Enable verbose logging |
|
||||
| `log_console` | bool | False | If True, log console messages from the page |
|
||||
@@ -1,302 +1,330 @@
|
||||
# CrawlResult
|
||||
# `CrawlResult` Reference
|
||||
|
||||
The `CrawlResult` class represents the result of a web crawling operation. It provides access to various forms of extracted content and metadata from the crawled webpage.
|
||||
The **`CrawlResult`** class encapsulates everything returned after a single crawl operation. It provides the **raw or processed content**, details on links and media, plus optional metadata (like screenshots, PDFs, or extracted JSON).
|
||||
|
||||
## Class Definition
|
||||
**Location**: `crawl4ai/crawler/models.py` (for reference)
|
||||
|
||||
```python
|
||||
class CrawlResult(BaseModel):
|
||||
"""Result of a web crawling operation."""
|
||||
|
||||
# Basic Information
|
||||
url: str # Crawled URL
|
||||
success: bool # Whether crawl succeeded
|
||||
status_code: Optional[int] = None # HTTP status code
|
||||
error_message: Optional[str] = None # Error message if failed
|
||||
|
||||
# Content
|
||||
html: str # Raw HTML content
|
||||
cleaned_html: Optional[str] = None # Cleaned HTML
|
||||
fit_html: Optional[str] = None # Most relevant HTML content
|
||||
markdown: Optional[str] = None # HTML converted to markdown
|
||||
fit_markdown: Optional[str] = None # Most relevant markdown content
|
||||
downloaded_files: Optional[List[str]] = None # Downloaded files
|
||||
|
||||
# Extracted Data
|
||||
extracted_content: Optional[str] = None # Content from extraction strategy
|
||||
media: Dict[str, List[Dict]] = {} # Extracted media information
|
||||
links: Dict[str, List[Dict]] = {} # Extracted links
|
||||
metadata: Optional[dict] = None # Page metadata
|
||||
|
||||
# Additional Data
|
||||
screenshot: Optional[str] = None # Base64 encoded screenshot
|
||||
session_id: Optional[str] = None # Session identifier
|
||||
response_headers: Optional[dict] = None # HTTP response headers
|
||||
url: str
|
||||
html: str
|
||||
success: bool
|
||||
cleaned_html: Optional[str] = None
|
||||
media: Dict[str, List[Dict]] = {}
|
||||
links: Dict[str, List[Dict]] = {}
|
||||
downloaded_files: Optional[List[str]] = None
|
||||
screenshot: Optional[str] = None
|
||||
pdf : Optional[bytes] = None
|
||||
markdown: Optional[Union[str, MarkdownGenerationResult]] = None
|
||||
markdown_v2: Optional[MarkdownGenerationResult] = None
|
||||
fit_markdown: Optional[str] = None
|
||||
fit_html: Optional[str] = None
|
||||
extracted_content: Optional[str] = None
|
||||
metadata: Optional[dict] = None
|
||||
error_message: Optional[str] = None
|
||||
session_id: Optional[str] = None
|
||||
response_headers: Optional[dict] = None
|
||||
status_code: Optional[int] = None
|
||||
ssl_certificate: Optional[SSLCertificate] = None
|
||||
...
|
||||
```
|
||||
|
||||
## Properties and Their Data Structures
|
||||
Below is a **field-by-field** explanation and possible usage patterns.
|
||||
|
||||
### Basic Information
|
||||
---
|
||||
|
||||
## 1. Basic Crawl Info
|
||||
|
||||
### 1.1 **`url`** *(str)*
|
||||
**What**: The final crawled URL (after any redirects).
|
||||
**Usage**:
|
||||
```python
|
||||
# Access basic information
|
||||
result = await crawler.arun(url="https://example.com")
|
||||
|
||||
print(result.url) # "https://example.com"
|
||||
print(result.success) # True/False
|
||||
print(result.status_code) # 200, 404, etc.
|
||||
print(result.error_message) # Error details if failed
|
||||
print(result.url) # e.g., "https://example.com/"
|
||||
```
|
||||
|
||||
### Content Properties
|
||||
|
||||
#### HTML Content
|
||||
### 1.2 **`success`** *(bool)*
|
||||
**What**: `True` if the crawl pipeline ended without major errors; `False` otherwise.
|
||||
**Usage**:
|
||||
```python
|
||||
# Raw HTML
|
||||
html_content = result.html
|
||||
|
||||
# Cleaned HTML (removed ads, popups, etc.)
|
||||
clean_content = result.cleaned_html
|
||||
|
||||
# Most relevant HTML content
|
||||
main_content = result.fit_html
|
||||
if not result.success:
|
||||
print(f"Crawl failed: {result.error_message}")
|
||||
```
|
||||
|
||||
#### Markdown Content
|
||||
### 1.3 **`status_code`** *(Optional[int])*
|
||||
**What**: The page’s HTTP status code (e.g., 200, 404).
|
||||
**Usage**:
|
||||
```python
|
||||
# Full markdown version
|
||||
markdown_content = result.markdown
|
||||
|
||||
# Most relevant markdown content
|
||||
main_content = result.fit_markdown
|
||||
if result.status_code == 404:
|
||||
print("Page not found!")
|
||||
```
|
||||
|
||||
### Media Content
|
||||
|
||||
The media dictionary contains organized media elements:
|
||||
|
||||
### 1.4 **`error_message`** *(Optional[str])*
|
||||
**What**: If `success=False`, a textual description of the failure.
|
||||
**Usage**:
|
||||
```python
|
||||
# Structure
|
||||
media = {
|
||||
"images": [
|
||||
{
|
||||
"src": str, # Image URL
|
||||
"alt": str, # Alt text
|
||||
"desc": str, # Contextual description
|
||||
"score": float, # Relevance score (0-10)
|
||||
"type": str, # "image"
|
||||
"width": int, # Image width (if available)
|
||||
"height": int, # Image height (if available)
|
||||
"context": str, # Surrounding text
|
||||
"lazy": bool # Whether image was lazy-loaded
|
||||
}
|
||||
],
|
||||
"videos": [
|
||||
{
|
||||
"src": str, # Video URL
|
||||
"type": str, # "video"
|
||||
"title": str, # Video title
|
||||
"poster": str, # Thumbnail URL
|
||||
"duration": str, # Video duration
|
||||
"description": str # Video description
|
||||
}
|
||||
],
|
||||
"audios": [
|
||||
{
|
||||
"src": str, # Audio URL
|
||||
"type": str, # "audio"
|
||||
"title": str, # Audio title
|
||||
"duration": str, # Audio duration
|
||||
"description": str # Audio description
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
# Example usage
|
||||
for image in result.media["images"]:
|
||||
if image["score"] > 5: # High-relevance images
|
||||
print(f"High-quality image: {image['src']}")
|
||||
print(f"Context: {image['context']}")
|
||||
if not result.success:
|
||||
print("Error:", result.error_message)
|
||||
```
|
||||
|
||||
### Link Analysis
|
||||
|
||||
The links dictionary organizes discovered links:
|
||||
|
||||
### 1.5 **`session_id`** *(Optional[str])*
|
||||
**What**: The ID used for reusing a browser context across multiple calls.
|
||||
**Usage**:
|
||||
```python
|
||||
# Structure
|
||||
links = {
|
||||
"internal": [
|
||||
{
|
||||
"href": str, # URL
|
||||
"text": str, # Link text
|
||||
"title": str, # Title attribute
|
||||
"type": str, # Link type (nav, content, etc.)
|
||||
"context": str, # Surrounding text
|
||||
"score": float # Relevance score
|
||||
}
|
||||
],
|
||||
"external": [
|
||||
{
|
||||
"href": str, # External URL
|
||||
"text": str, # Link text
|
||||
"title": str, # Title attribute
|
||||
"domain": str, # Domain name
|
||||
"type": str, # Link type
|
||||
"context": str # Surrounding text
|
||||
}
|
||||
]
|
||||
}
|
||||
# If you used session_id="login_session" in CrawlerRunConfig, see it here:
|
||||
print("Session:", result.session_id)
|
||||
```
|
||||
|
||||
# Example usage
|
||||
### 1.6 **`response_headers`** *(Optional[dict])*
|
||||
**What**: Final HTTP response headers.
|
||||
**Usage**:
|
||||
```python
|
||||
if result.response_headers:
|
||||
print("Server:", result.response_headers.get("Server", "Unknown"))
|
||||
```
|
||||
|
||||
### 1.7 **`ssl_certificate`** *(Optional[SSLCertificate])*
|
||||
**What**: If `fetch_ssl_certificate=True` in your CrawlerRunConfig, **`result.ssl_certificate`** contains a [**`SSLCertificate`**](../advanced/ssl-certificate.md) object describing the site’s certificate. You can export the cert in multiple formats (PEM/DER/JSON) or access its properties like `issuer`,
|
||||
`subject`, `valid_from`, `valid_until`, etc.
|
||||
**Usage**:
|
||||
```python
|
||||
if result.ssl_certificate:
|
||||
print("Issuer:", result.ssl_certificate.issuer)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2. Raw / Cleaned Content
|
||||
|
||||
### 2.1 **`html`** *(str)*
|
||||
**What**: The **original** unmodified HTML from the final page load.
|
||||
**Usage**:
|
||||
```python
|
||||
# Possibly large
|
||||
print(len(result.html))
|
||||
```
|
||||
|
||||
### 2.2 **`cleaned_html`** *(Optional[str])*
|
||||
**What**: A sanitized HTML version—scripts, styles, or excluded tags are removed based on your `CrawlerRunConfig`.
|
||||
**Usage**:
|
||||
```python
|
||||
print(result.cleaned_html[:500]) # Show a snippet
|
||||
```
|
||||
|
||||
### 2.3 **`fit_html`** *(Optional[str])*
|
||||
**What**: If a **content filter** or heuristic (e.g., Pruning/BM25) modifies the HTML, the “fit” or post-filter version.
|
||||
**When**: This is **only** present if your `markdown_generator` or `content_filter` produces it.
|
||||
**Usage**:
|
||||
```python
|
||||
if result.fit_html:
|
||||
print("High-value HTML content:", result.fit_html[:300])
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. Markdown Fields
|
||||
|
||||
### 3.1 The Markdown Generation Approach
|
||||
|
||||
Crawl4AI can convert HTML→Markdown, optionally including:
|
||||
|
||||
- **Raw** markdown
|
||||
- **Links as citations** (with a references section)
|
||||
- **Fit** markdown if a **content filter** is used (like Pruning or BM25)
|
||||
|
||||
### 3.2 **`markdown_v2`** *(Optional[MarkdownGenerationResult])*
|
||||
**What**: The **structured** object holding multiple markdown variants. Soon to be consolidated into `markdown`.
|
||||
|
||||
**`MarkdownGenerationResult`** includes:
|
||||
- **`raw_markdown`** *(str)*: The full HTML→Markdown conversion.
|
||||
- **`markdown_with_citations`** *(str)*: Same markdown, but with link references as academic-style citations.
|
||||
- **`references_markdown`** *(str)*: The reference list or footnotes at the end.
|
||||
- **`fit_markdown`** *(Optional[str])*: If content filtering (Pruning/BM25) was applied, the filtered “fit” text.
|
||||
- **`fit_html`** *(Optional[str])*: The HTML that led to `fit_markdown`.
|
||||
|
||||
**Usage**:
|
||||
```python
|
||||
if result.markdown_v2:
|
||||
md_res = result.markdown_v2
|
||||
print("Raw MD:", md_res.raw_markdown[:300])
|
||||
print("Citations MD:", md_res.markdown_with_citations[:300])
|
||||
print("References:", md_res.references_markdown)
|
||||
if md_res.fit_markdown:
|
||||
print("Pruned text:", md_res.fit_markdown[:300])
|
||||
```
|
||||
|
||||
### 3.3 **`markdown`** *(Optional[Union[str, MarkdownGenerationResult]])*
|
||||
**What**: In future versions, `markdown` will fully replace `markdown_v2`. Right now, it might be a `str` or a `MarkdownGenerationResult`.
|
||||
**Usage**:
|
||||
```python
|
||||
# Soon, you might see:
|
||||
if isinstance(result.markdown, MarkdownGenerationResult):
|
||||
print(result.markdown.raw_markdown[:200])
|
||||
else:
|
||||
print(result.markdown)
|
||||
```
|
||||
|
||||
### 3.4 **`fit_markdown`** *(Optional[str])*
|
||||
**What**: A direct reference to the final filtered markdown (legacy approach).
|
||||
**When**: This is set if a filter or content strategy explicitly writes there. Usually overshadowed by `markdown_v2.fit_markdown`.
|
||||
**Usage**:
|
||||
```python
|
||||
print(result.fit_markdown) # Legacy field, prefer result.markdown_v2.fit_markdown
|
||||
```
|
||||
|
||||
**Important**: “Fit” content (in `fit_markdown`/`fit_html`) only exists if you used a **filter** (like **PruningContentFilter** or **BM25ContentFilter**) within a `MarkdownGenerationStrategy`.
|
||||
|
||||
---
|
||||
|
||||
## 4. Media & Links
|
||||
|
||||
### 4.1 **`media`** *(Dict[str, List[Dict]])*
|
||||
**What**: Contains info about discovered images, videos, or audio. Typically keys: `"images"`, `"videos"`, `"audios"`.
|
||||
**Common Fields** in each item:
|
||||
|
||||
- `src` *(str)*: Media URL
|
||||
- `alt` or `title` *(str)*: Descriptive text
|
||||
- `score` *(float)*: Relevance score if the crawler’s heuristic found it “important”
|
||||
- `desc` or `description` *(Optional[str])*: Additional context extracted from surrounding text
|
||||
|
||||
**Usage**:
|
||||
```python
|
||||
images = result.media.get("images", [])
|
||||
for img in images:
|
||||
if img.get("score", 0) > 5:
|
||||
print("High-value image:", img["src"])
|
||||
```
|
||||
|
||||
### 4.2 **`links`** *(Dict[str, List[Dict]])*
|
||||
**What**: Holds internal and external link data. Usually two keys: `"internal"` and `"external"`.
|
||||
**Common Fields**:
|
||||
|
||||
- `href` *(str)*: The link target
|
||||
- `text` *(str)*: Link text
|
||||
- `title` *(str)*: Title attribute
|
||||
- `context` *(str)*: Surrounding text snippet
|
||||
- `domain` *(str)*: If external, the domain
|
||||
|
||||
**Usage**:
|
||||
```python
|
||||
for link in result.links["internal"]:
|
||||
print(f"Internal link: {link['href']}")
|
||||
print(f"Context: {link['context']}")
|
||||
print(f"Internal link to {link['href']} with text {link['text']}")
|
||||
```
|
||||
|
||||
### Metadata
|
||||
---
|
||||
|
||||
The metadata dictionary contains page information:
|
||||
## 5. Additional Fields
|
||||
|
||||
### 5.1 **`extracted_content`** *(Optional[str])*
|
||||
**What**: If you used **`extraction_strategy`** (CSS, LLM, etc.), the structured output (JSON).
|
||||
**Usage**:
|
||||
```python
|
||||
# Structure
|
||||
metadata = {
|
||||
"title": str, # Page title
|
||||
"description": str, # Meta description
|
||||
"keywords": List[str], # Meta keywords
|
||||
"author": str, # Author information
|
||||
"published_date": str, # Publication date
|
||||
"modified_date": str, # Last modified date
|
||||
"language": str, # Page language
|
||||
"canonical_url": str, # Canonical URL
|
||||
"og_data": Dict, # Open Graph data
|
||||
"twitter_data": Dict # Twitter card data
|
||||
}
|
||||
|
||||
# Example usage
|
||||
if result.metadata:
|
||||
print(f"Title: {result.metadata['title']}")
|
||||
print(f"Author: {result.metadata.get('author', 'Unknown')}")
|
||||
```
|
||||
|
||||
### Extracted Content
|
||||
|
||||
Content from extraction strategies:
|
||||
|
||||
```python
|
||||
# For LLM or CSS extraction strategies
|
||||
if result.extracted_content:
|
||||
structured_data = json.loads(result.extracted_content)
|
||||
print(structured_data)
|
||||
data = json.loads(result.extracted_content)
|
||||
print(data)
|
||||
```
|
||||
|
||||
### Screenshot
|
||||
|
||||
Base64 encoded screenshot:
|
||||
|
||||
### 5.2 **`downloaded_files`** *(Optional[List[str]])*
|
||||
**What**: If `accept_downloads=True` in your `BrowserConfig` + `downloads_path`, lists local file paths for downloaded items.
|
||||
**Usage**:
|
||||
```python
|
||||
# Save screenshot if available
|
||||
if result.downloaded_files:
|
||||
for file_path in result.downloaded_files:
|
||||
print("Downloaded:", file_path)
|
||||
```
|
||||
|
||||
### 5.3 **`screenshot`** *(Optional[str])*
|
||||
**What**: Base64-encoded screenshot if `screenshot=True` in `CrawlerRunConfig`.
|
||||
**Usage**:
|
||||
```python
|
||||
import base64
|
||||
if result.screenshot:
|
||||
import base64
|
||||
|
||||
# Decode and save
|
||||
with open("screenshot.png", "wb") as f:
|
||||
with open("page.png", "wb") as f:
|
||||
f.write(base64.b64decode(result.screenshot))
|
||||
```
|
||||
|
||||
## Usage Examples
|
||||
|
||||
### Basic Content Access
|
||||
### 5.4 **`pdf`** *(Optional[bytes])*
|
||||
**What**: Raw PDF bytes if `pdf=True` in `CrawlerRunConfig`.
|
||||
**Usage**:
|
||||
```python
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url="https://example.com")
|
||||
if result.pdf:
|
||||
with open("page.pdf", "wb") as f:
|
||||
f.write(result.pdf)
|
||||
```
|
||||
|
||||
### 5.5 **`metadata`** *(Optional[dict])*
|
||||
**What**: Page-level metadata if discovered (title, description, OG data, etc.).
|
||||
**Usage**:
|
||||
```python
|
||||
if result.metadata:
|
||||
print("Title:", result.metadata.get("title"))
|
||||
print("Author:", result.metadata.get("author"))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. Example: Accessing Everything
|
||||
|
||||
```python
|
||||
async def handle_result(result: CrawlResult):
|
||||
if not result.success:
|
||||
print("Crawl error:", result.error_message)
|
||||
return
|
||||
|
||||
if result.success:
|
||||
# Get clean content
|
||||
print(result.fit_markdown)
|
||||
|
||||
# Process images
|
||||
for image in result.media["images"]:
|
||||
if image["score"] > 7:
|
||||
print(f"High-quality image: {image['src']}")
|
||||
# Basic info
|
||||
print("Crawled URL:", result.url)
|
||||
print("Status code:", result.status_code)
|
||||
|
||||
# HTML
|
||||
print("Original HTML size:", len(result.html))
|
||||
print("Cleaned HTML size:", len(result.cleaned_html or ""))
|
||||
|
||||
# Markdown output
|
||||
if result.markdown_v2:
|
||||
print("Raw Markdown:", result.markdown_v2.raw_markdown[:300])
|
||||
print("Citations Markdown:", result.markdown_v2.markdown_with_citations[:300])
|
||||
if result.markdown_v2.fit_markdown:
|
||||
print("Fit Markdown:", result.markdown_v2.fit_markdown[:200])
|
||||
else:
|
||||
print("Raw Markdown (legacy):", result.markdown[:200] if result.markdown else "N/A")
|
||||
|
||||
# Media & Links
|
||||
if "images" in result.media:
|
||||
print("Image count:", len(result.media["images"]))
|
||||
if "internal" in result.links:
|
||||
print("Internal link count:", len(result.links["internal"]))
|
||||
|
||||
# Extraction strategy result
|
||||
if result.extracted_content:
|
||||
print("Structured data:", result.extracted_content)
|
||||
|
||||
# Screenshot/PDF
|
||||
if result.screenshot:
|
||||
print("Screenshot length:", len(result.screenshot))
|
||||
if result.pdf:
|
||||
print("PDF bytes length:", len(result.pdf))
|
||||
```
|
||||
|
||||
### Complete Data Processing
|
||||
```python
|
||||
async def process_webpage(url: str) -> Dict:
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url=url)
|
||||
|
||||
if not result.success:
|
||||
raise Exception(f"Crawl failed: {result.error_message}")
|
||||
|
||||
return {
|
||||
"content": result.fit_markdown,
|
||||
"images": [
|
||||
img for img in result.media["images"]
|
||||
if img["score"] > 5
|
||||
],
|
||||
"internal_links": [
|
||||
link["href"] for link in result.links["internal"]
|
||||
],
|
||||
"metadata": result.metadata,
|
||||
"status": result.status_code
|
||||
}
|
||||
```
|
||||
---
|
||||
|
||||
### Error Handling
|
||||
```python
|
||||
async def safe_crawl(url: str) -> Dict:
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
try:
|
||||
result = await crawler.arun(url=url)
|
||||
|
||||
if not result.success:
|
||||
return {
|
||||
"success": False,
|
||||
"error": result.error_message,
|
||||
"status": result.status_code
|
||||
}
|
||||
|
||||
return {
|
||||
"success": True,
|
||||
"content": result.fit_markdown,
|
||||
"status": result.status_code
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
return {
|
||||
"success": False,
|
||||
"error": str(e),
|
||||
"status": None
|
||||
}
|
||||
```
|
||||
## 7. Key Points & Future
|
||||
|
||||
## Best Practices
|
||||
1. **`markdown_v2` vs `markdown`**
|
||||
- Right now, `markdown_v2` is the more robust container (`MarkdownGenerationResult`), providing **raw_markdown**, **markdown_with_citations**, references, plus possible **fit_markdown**.
|
||||
- In future versions, everything will unify under **`markdown`**. If you rely on advanced features (citations, fit content), check `markdown_v2`.
|
||||
|
||||
1. **Always Check Success**
|
||||
```python
|
||||
if not result.success:
|
||||
print(f"Error: {result.error_message}")
|
||||
return
|
||||
```
|
||||
2. **Fit Content**
|
||||
- **`fit_markdown`** and **`fit_html`** appear only if you used a content filter (like **PruningContentFilter** or **BM25ContentFilter**) inside your **MarkdownGenerationStrategy** or set them directly.
|
||||
- If no filter is used, they remain `None`.
|
||||
|
||||
2. **Use fit_markdown for Articles**
|
||||
```python
|
||||
# Better for article content
|
||||
content = result.fit_markdown if result.fit_markdown else result.markdown
|
||||
```
|
||||
3. **References & Citations**
|
||||
- If you enable link citations in your `DefaultMarkdownGenerator` (`options={"citations": True}`), you’ll see `markdown_with_citations` plus a **`references_markdown`** block. This helps large language models or academic-like referencing.
|
||||
|
||||
3. **Filter Media by Score**
|
||||
```python
|
||||
relevant_images = [
|
||||
img for img in result.media["images"]
|
||||
if img["score"] > 5
|
||||
]
|
||||
```
|
||||
4. **Links & Media**
|
||||
- `links["internal"]` and `links["external"]` group discovered anchors by domain.
|
||||
- `media["images"]` / `["videos"]` / `["audios"]` store extracted media elements with optional scoring or context.
|
||||
|
||||
4. **Handle Missing Data**
|
||||
```python
|
||||
metadata = result.metadata or {}
|
||||
title = metadata.get('title', 'Unknown Title')
|
||||
```
|
||||
5. **Error Cases**
|
||||
- If `success=False`, check `error_message` (e.g., timeouts, invalid URLs).
|
||||
- `status_code` might be `None` if we failed before an HTTP response.
|
||||
|
||||
Use **`CrawlResult`** to glean all final outputs and feed them into your data pipelines, AI models, or archives. With the synergy of a properly configured **BrowserConfig** and **CrawlerRunConfig**, the crawler can produce robust, structured results here in **`CrawlResult`**.
|
||||
@@ -1,36 +1,226 @@
|
||||
# Parameter Reference Table
|
||||
# 1. **BrowserConfig** – Controlling the Browser
|
||||
|
||||
`BrowserConfig` focuses on **how** the browser is launched and behaves. This includes headless mode, proxies, user agents, and other environment tweaks.
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig
|
||||
|
||||
browser_cfg = BrowserConfig(
|
||||
browser_type="chromium",
|
||||
headless=True,
|
||||
viewport_width=1280,
|
||||
viewport_height=720,
|
||||
proxy="http://user:pass@proxy:8080",
|
||||
user_agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/116.0.0.0 Safari/537.36",
|
||||
)
|
||||
```
|
||||
|
||||
## 1.1 Parameter Highlights
|
||||
|
||||
| **Parameter** | **Type / Default** | **What It Does** |
|
||||
|-----------------------|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| **`browser_type`** | `"chromium"`, `"firefox"`, `"webkit"`<br/>*(default: `"chromium"`)* | Which browser engine to use. `"chromium"` is typical for many sites, `"firefox"` or `"webkit"` for specialized tests. |
|
||||
| **`headless`** | `bool` (default: `True`) | Headless means no visible UI. `False` is handy for debugging. |
|
||||
| **`viewport_width`** | `int` (default: `1080`) | Initial page width (in px). Useful for testing responsive layouts. |
|
||||
| **`viewport_height`** | `int` (default: `600`) | Initial page height (in px). |
|
||||
| **`proxy`** | `str` (default: `None`) | Single-proxy URL if you want all traffic to go through it, e.g. `"http://user:pass@proxy:8080"`. |
|
||||
| **`proxy_config`** | `dict` (default: `None`) | For advanced or multi-proxy needs, specify details like `{"server": "...", "username": "...", ...}`. |
|
||||
| **`use_persistent_context`** | `bool` (default: `False`) | If `True`, uses a **persistent** browser context (keep cookies, sessions across runs). Also sets `use_managed_browser=True`. |
|
||||
| **`user_data_dir`** | `str or None` (default: `None`) | Directory to store user data (profiles, cookies). Must be set if you want permanent sessions. |
|
||||
| **`ignore_https_errors`** | `bool` (default: `True`) | If `True`, continues despite invalid certificates (common in dev/staging). |
|
||||
| **`java_script_enabled`** | `bool` (default: `True`) | Disable if you want no JS overhead, or if only static content is needed. |
|
||||
| **`cookies`** | `list` (default: `[]`) | Pre-set cookies, each a dict like `{"name": "session", "value": "...", "url": "..."}`. |
|
||||
| **`headers`** | `dict` (default: `{}`) | Extra HTTP headers for every request, e.g. `{"Accept-Language": "en-US"}`. |
|
||||
| **`user_agent`** | `str` (default: Chrome-based UA) | Your custom or random user agent. `user_agent_mode="random"` can shuffle it. |
|
||||
| **`light_mode`** | `bool` (default: `False`) | Disables some background features for performance gains. |
|
||||
| **`text_mode`** | `bool` (default: `False`) | If `True`, tries to disable images/other heavy content for speed. |
|
||||
| **`use_managed_browser`** | `bool` (default: `False`) | For advanced “managed” interactions (debugging, CDP usage). Typically set automatically if persistent context is on. |
|
||||
| **`extra_args`** | `list` (default: `[]`) | Additional flags for the underlying browser process, e.g. `["--disable-extensions"]`. |
|
||||
|
||||
**Tips**:
|
||||
- Set `headless=False` to visually **debug** how pages load or how interactions proceed.
|
||||
- If you need **authentication** storage or repeated sessions, consider `use_persistent_context=True` and specify `user_data_dir`.
|
||||
- For large pages, you might need a bigger `viewport_width` and `viewport_height` to handle dynamic content.
|
||||
|
||||
---
|
||||
|
||||
# 2. **CrawlerRunConfig** – Controlling Each Crawl
|
||||
|
||||
While `BrowserConfig` sets up the **environment**, `CrawlerRunConfig` details **how** each **crawl operation** should behave: caching, content filtering, link or domain blocking, timeouts, JavaScript code, etc.
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
|
||||
run_cfg = CrawlerRunConfig(
|
||||
wait_for="css:.main-content",
|
||||
word_count_threshold=15,
|
||||
excluded_tags=["nav", "footer"],
|
||||
exclude_external_links=True,
|
||||
)
|
||||
```
|
||||
|
||||
## 2.1 Parameter Highlights
|
||||
|
||||
We group them by category.
|
||||
|
||||
### A) **Content Processing**
|
||||
|
||||
| **Parameter** | **Type / Default** | **What It Does** |
|
||||
|------------------------------|--------------------------------------|-------------------------------------------------------------------------------------------------|
|
||||
| **`word_count_threshold`** | `int` (default: ~200) | Skips text blocks below X words. Helps ignore trivial sections. |
|
||||
| **`extraction_strategy`** | `ExtractionStrategy` (default: None) | If set, extracts structured data (CSS-based, LLM-based, etc.). |
|
||||
| **`markdown_generator`** | `MarkdownGenerationStrategy` (None) | If you want specialized markdown output (citations, filtering, chunking, etc.). |
|
||||
| **`content_filter`** | `RelevantContentFilter` (None) | Filters out irrelevant text blocks. E.g., `PruningContentFilter` or `BM25ContentFilter`. |
|
||||
| **`css_selector`** | `str` (None) | Retains only the part of the page matching this selector. |
|
||||
| **`excluded_tags`** | `list` (None) | Removes entire tags (e.g. `["script", "style"]`). |
|
||||
| **`excluded_selector`** | `str` (None) | Like `css_selector` but to exclude. E.g. `"#ads, .tracker"`. |
|
||||
| **`only_text`** | `bool` (False) | If `True`, tries to extract text-only content. |
|
||||
| **`prettiify`** | `bool` (False) | If `True`, beautifies final HTML (slower, purely cosmetic). |
|
||||
| **`keep_data_attributes`** | `bool` (False) | If `True`, preserve `data-*` attributes in cleaned HTML. |
|
||||
| **`remove_forms`** | `bool` (False) | If `True`, remove all `<form>` elements. |
|
||||
|
||||
---
|
||||
|
||||
### B) **Caching & Session**
|
||||
|
||||
| **Parameter** | **Type / Default** | **What It Does** |
|
||||
|-------------------------|------------------------|------------------------------------------------------------------------------------------------------------------------------|
|
||||
| **`cache_mode`** | `CacheMode or None` | Controls how caching is handled (`ENABLED`, `BYPASS`, `DISABLED`, etc.). If `None`, typically defaults to `ENABLED`. |
|
||||
| **`session_id`** | `str or None` | Assign a unique ID to reuse a single browser session across multiple `arun()` calls. |
|
||||
| **`bypass_cache`** | `bool` (False) | If `True`, acts like `CacheMode.BYPASS`. |
|
||||
| **`disable_cache`** | `bool` (False) | If `True`, acts like `CacheMode.DISABLED`. |
|
||||
| **`no_cache_read`** | `bool` (False) | If `True`, acts like `CacheMode.WRITE_ONLY` (writes cache but never reads). |
|
||||
| **`no_cache_write`** | `bool` (False) | If `True`, acts like `CacheMode.READ_ONLY` (reads cache but never writes). |
|
||||
|
||||
Use these for controlling whether you read or write from a local content cache. Handy for large batch crawls or repeated site visits.
|
||||
|
||||
---
|
||||
|
||||
### C) **Page Navigation & Timing**
|
||||
|
||||
| **Parameter** | **Type / Default** | **What It Does** |
|
||||
|----------------------------|-------------------------|----------------------------------------------------------------------------------------------------------------------|
|
||||
| **`wait_until`** | `str` (domcontentloaded)| Condition for navigation to “complete”. Often `"networkidle"` or `"domcontentloaded"`. |
|
||||
| **`page_timeout`** | `int` (60000 ms) | Timeout for page navigation or JS steps. Increase for slow sites. |
|
||||
| **`wait_for`** | `str or None` | Wait for a CSS (`"css:selector"`) or JS (`"js:() => bool"`) condition before content extraction. |
|
||||
| **`wait_for_images`** | `bool` (False) | Wait for images to load before finishing. Slows down if you only want text. |
|
||||
| **`delay_before_return_html`** | `float` (0.1) | Additional pause (seconds) before final HTML is captured. Good for last-second updates. |
|
||||
| **`mean_delay`** and **`max_range`** | `float` (0.1, 0.3) | If you call `arun_many()`, these define random delay intervals between crawls, helping avoid detection or rate limits. |
|
||||
| **`semaphore_count`** | `int` (5) | Max concurrency for `arun_many()`. Increase if you have resources for parallel crawls. |
|
||||
|
||||
---
|
||||
|
||||
### D) **Page Interaction**
|
||||
|
||||
| **Parameter** | **Type / Default** | **What It Does** |
|
||||
|----------------------------|--------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| **`js_code`** | `str or list[str]` (None) | JavaScript to run after load. E.g. `"document.querySelector('button')?.click();"`. |
|
||||
| **`js_only`** | `bool` (False) | If `True`, indicates we’re reusing an existing session and only applying JS. No full reload. |
|
||||
| **`ignore_body_visibility`** | `bool` (True) | Skip checking if `<body>` is visible. Usually best to keep `True`. |
|
||||
| **`scan_full_page`** | `bool` (False) | If `True`, auto-scroll the page to load dynamic content (infinite scroll). |
|
||||
| **`scroll_delay`** | `float` (0.2) | Delay between scroll steps if `scan_full_page=True`. |
|
||||
| **`process_iframes`** | `bool` (False) | Inlines iframe content for single-page extraction. |
|
||||
| **`remove_overlay_elements`** | `bool` (False) | Removes potential modals/popups blocking the main content. |
|
||||
| **`simulate_user`** | `bool` (False) | Simulate user interactions (mouse movements) to avoid bot detection. |
|
||||
| **`override_navigator`** | `bool` (False) | Override `navigator` properties in JS for stealth. |
|
||||
| **`magic`** | `bool` (False) | Automatic handling of popups/consent banners. Experimental. |
|
||||
| **`adjust_viewport_to_content`** | `bool` (False) | Resizes viewport to match page content height. |
|
||||
|
||||
If your page is a single-page app with repeated JS updates, set `js_only=True` in subsequent calls, plus a `session_id` for reusing the same tab.
|
||||
|
||||
---
|
||||
|
||||
### E) **Media Handling**
|
||||
|
||||
| **Parameter** | **Type / Default** | **What It Does** |
|
||||
|--------------------------------------------|---------------------|-----------------------------------------------------------------------------------------------------------|
|
||||
| **`screenshot`** | `bool` (False) | Capture a screenshot (base64) in `result.screenshot`. |
|
||||
| **`screenshot_wait_for`** | `float or None` | Extra wait time before the screenshot. |
|
||||
| **`screenshot_height_threshold`** | `int` (~20000) | If the page is taller than this, alternate screenshot strategies are used. |
|
||||
| **`pdf`** | `bool` (False) | If `True`, returns a PDF in `result.pdf`. |
|
||||
| **`image_description_min_word_threshold`** | `int` (~50) | Minimum words for an image’s alt text or description to be considered valid. |
|
||||
| **`image_score_threshold`** | `int` (~3) | Filter out low-scoring images. The crawler scores images by relevance (size, context, etc.). |
|
||||
| **`exclude_external_images`** | `bool` (False) | Exclude images from other domains. |
|
||||
|
||||
---
|
||||
|
||||
### F) **Link/Domain Handling**
|
||||
|
||||
| **Parameter** | **Type / Default** | **What It Does** |
|
||||
|------------------------------|-------------------------|-----------------------------------------------------------------------------------------------------------------------------|
|
||||
| **`exclude_social_media_domains`** | `list` (e.g. Facebook/Twitter) | A default list can be extended. Any link to these domains is removed from final output. |
|
||||
| **`exclude_external_links`** | `bool` (False) | Removes all links pointing outside the current domain. |
|
||||
| **`exclude_social_media_links`** | `bool` (False) | Strips links specifically to social sites (like Facebook or Twitter). |
|
||||
| **`exclude_domains`** | `list` ([]) | Provide a custom list of domains to exclude (like `["ads.com", "trackers.io"]`). |
|
||||
|
||||
Use these for link-level content filtering (often to keep crawls “internal” or to remove spammy domains).
|
||||
|
||||
---
|
||||
|
||||
### G) **Debug & Logging**
|
||||
|
||||
| **Parameter** | **Type / Default** | **What It Does** |
|
||||
|----------------|--------------------|---------------------------------------------------------------------------|
|
||||
| **`verbose`** | `bool` (True) | Prints logs detailing each step of crawling, interactions, or errors. |
|
||||
| **`log_console`** | `bool` (False) | Logs the page’s JavaScript console output if you want deeper JS debugging.|
|
||||
|
||||
---
|
||||
|
||||
## 2.2 Example Usage
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
|
||||
|
||||
async def main():
|
||||
# Configure the browser
|
||||
browser_cfg = BrowserConfig(
|
||||
headless=False,
|
||||
viewport_width=1280,
|
||||
viewport_height=720,
|
||||
proxy="http://user:pass@myproxy:8080",
|
||||
text_mode=True
|
||||
)
|
||||
|
||||
# Configure the run
|
||||
run_cfg = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
session_id="my_session",
|
||||
css_selector="main.article",
|
||||
excluded_tags=["script", "style"],
|
||||
exclude_external_links=True,
|
||||
wait_for="css:.article-loaded",
|
||||
screenshot=True
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_cfg) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com/news",
|
||||
config=run_cfg
|
||||
)
|
||||
if result.success:
|
||||
print("Final cleaned_html length:", len(result.cleaned_html))
|
||||
if result.screenshot:
|
||||
print("Screenshot captured (base64, length):", len(result.screenshot))
|
||||
else:
|
||||
print("Crawl failed:", result.error_message)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**What’s Happening**:
|
||||
- **`text_mode=True`** avoids loading images and other heavy resources, speeding up the crawl.
|
||||
- We disable caching (`cache_mode=CacheMode.BYPASS`) to always fetch fresh content.
|
||||
- We only keep `main.article` content by specifying `css_selector="main.article"`.
|
||||
- We exclude external links (`exclude_external_links=True`).
|
||||
- We do a quick screenshot (`screenshot=True`) before finishing.
|
||||
|
||||
---
|
||||
|
||||
## 3. Putting It All Together
|
||||
|
||||
- **Use** `BrowserConfig` for **global** browser settings: engine, headless, proxy, user agent.
|
||||
- **Use** `CrawlerRunConfig` for each crawl’s **context**: how to filter content, handle caching, wait for dynamic elements, or run JS.
|
||||
- **Pass** both configs to `AsyncWebCrawler` (the `BrowserConfig`) and then to `arun()` (the `CrawlerRunConfig`).
|
||||
|
||||
| File Name | Parameter Name | Code Usage | Strategy/Class | Description |
|
||||
|-----------|---------------|------------|----------------|-------------|
|
||||
| async_crawler_strategy.py | user_agent | `kwargs.get("user_agent")` | AsyncPlaywrightCrawlerStrategy | User agent string for browser identification |
|
||||
| async_crawler_strategy.py | proxy | `kwargs.get("proxy")` | AsyncPlaywrightCrawlerStrategy | Proxy server configuration for network requests |
|
||||
| async_crawler_strategy.py | proxy_config | `kwargs.get("proxy_config")` | AsyncPlaywrightCrawlerStrategy | Detailed proxy configuration including auth |
|
||||
| async_crawler_strategy.py | headless | `kwargs.get("headless", True)` | AsyncPlaywrightCrawlerStrategy | Whether to run browser in headless mode |
|
||||
| async_crawler_strategy.py | browser_type | `kwargs.get("browser_type", "chromium")` | AsyncPlaywrightCrawlerStrategy | Type of browser to use (chromium/firefox/webkit) |
|
||||
| async_crawler_strategy.py | headers | `kwargs.get("headers", {})` | AsyncPlaywrightCrawlerStrategy | Custom HTTP headers for requests |
|
||||
| async_crawler_strategy.py | verbose | `kwargs.get("verbose", False)` | AsyncPlaywrightCrawlerStrategy | Enable detailed logging output |
|
||||
| async_crawler_strategy.py | sleep_on_close | `kwargs.get("sleep_on_close", False)` | AsyncPlaywrightCrawlerStrategy | Add delay before closing browser |
|
||||
| async_crawler_strategy.py | use_managed_browser | `kwargs.get("use_managed_browser", False)` | AsyncPlaywrightCrawlerStrategy | Use managed browser instance |
|
||||
| async_crawler_strategy.py | user_data_dir | `kwargs.get("user_data_dir", None)` | AsyncPlaywrightCrawlerStrategy | Custom directory for browser profile data |
|
||||
| async_crawler_strategy.py | session_id | `kwargs.get("session_id")` | AsyncPlaywrightCrawlerStrategy | Unique identifier for browser session |
|
||||
| async_crawler_strategy.py | override_navigator | `kwargs.get("override_navigator", False)` | AsyncPlaywrightCrawlerStrategy | Override browser navigator properties |
|
||||
| async_crawler_strategy.py | simulate_user | `kwargs.get("simulate_user", False)` | AsyncPlaywrightCrawlerStrategy | Simulate human-like behavior |
|
||||
| async_crawler_strategy.py | magic | `kwargs.get("magic", False)` | AsyncPlaywrightCrawlerStrategy | Enable advanced anti-detection features |
|
||||
| async_crawler_strategy.py | log_console | `kwargs.get("log_console", False)` | AsyncPlaywrightCrawlerStrategy | Log browser console messages |
|
||||
| async_crawler_strategy.py | js_only | `kwargs.get("js_only", False)` | AsyncPlaywrightCrawlerStrategy | Only execute JavaScript without page load |
|
||||
| async_crawler_strategy.py | page_timeout | `kwargs.get("page_timeout", 60000)` | AsyncPlaywrightCrawlerStrategy | Timeout for page load in milliseconds |
|
||||
| async_crawler_strategy.py | ignore_body_visibility | `kwargs.get("ignore_body_visibility", True)` | AsyncPlaywrightCrawlerStrategy | Process page even if body is hidden |
|
||||
| async_crawler_strategy.py | js_code | `kwargs.get("js_code", kwargs.get("js", self.js_code))` | AsyncPlaywrightCrawlerStrategy | Custom JavaScript code to execute |
|
||||
| async_crawler_strategy.py | wait_for | `kwargs.get("wait_for")` | AsyncPlaywrightCrawlerStrategy | Wait for specific element/condition |
|
||||
| async_crawler_strategy.py | process_iframes | `kwargs.get("process_iframes", False)` | AsyncPlaywrightCrawlerStrategy | Extract content from iframes |
|
||||
| async_crawler_strategy.py | delay_before_return_html | `kwargs.get("delay_before_return_html")` | AsyncPlaywrightCrawlerStrategy | Additional delay before returning HTML |
|
||||
| async_crawler_strategy.py | remove_overlay_elements | `kwargs.get("remove_overlay_elements", False)` | AsyncPlaywrightCrawlerStrategy | Remove pop-ups and overlay elements |
|
||||
| async_crawler_strategy.py | screenshot | `kwargs.get("screenshot")` | AsyncPlaywrightCrawlerStrategy | Take page screenshot |
|
||||
| async_crawler_strategy.py | screenshot_wait_for | `kwargs.get("screenshot_wait_for")` | AsyncPlaywrightCrawlerStrategy | Wait before taking screenshot |
|
||||
| async_crawler_strategy.py | semaphore_count | `kwargs.get("semaphore_count", 5)` | AsyncPlaywrightCrawlerStrategy | Concurrent request limit |
|
||||
| async_webcrawler.py | verbose | `kwargs.get("verbose", False)` | AsyncWebCrawler | Enable detailed logging |
|
||||
| async_webcrawler.py | warmup | `kwargs.get("warmup", True)` | AsyncWebCrawler | Initialize crawler with warmup request |
|
||||
| async_webcrawler.py | session_id | `kwargs.get("session_id", None)` | AsyncWebCrawler | Session identifier for browser reuse |
|
||||
| async_webcrawler.py | only_text | `kwargs.get("only_text", False)` | AsyncWebCrawler | Extract only text content |
|
||||
| async_webcrawler.py | bypass_cache | `kwargs.get("bypass_cache", False)` | AsyncWebCrawler | Skip cache and force fresh crawl |
|
||||
| async_webcrawler.py | cache_mode | `kwargs.get("cache_mode", CacheMode.ENABLE)` | AsyncWebCrawler | Cache handling mode for request |
|
||||
@@ -218,12 +218,12 @@ result = await crawler.arun(
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Choose the Right Strategy**
|
||||
1. **Choose the Right Strategy**
|
||||
- Use `LLMExtractionStrategy` for complex, unstructured content
|
||||
- Use `JsonCssExtractionStrategy` for well-structured HTML
|
||||
- Use `CosineStrategy` for content similarity and clustering
|
||||
|
||||
2. **Optimize Chunking**
|
||||
2. **Optimize Chunking**
|
||||
```python
|
||||
# For long documents
|
||||
strategy = LLMExtractionStrategy(
|
||||
@@ -232,7 +232,7 @@ result = await crawler.arun(
|
||||
)
|
||||
```
|
||||
|
||||
3. **Handle Errors**
|
||||
3. **Handle Errors**
|
||||
```python
|
||||
try:
|
||||
result = await crawler.arun(
|
||||
@@ -245,7 +245,7 @@ result = await crawler.arun(
|
||||
print(f"Extraction failed: {e}")
|
||||
```
|
||||
|
||||
4. **Monitor Performance**
|
||||
4. **Monitor Performance**
|
||||
```python
|
||||
strategy = CosineStrategy(
|
||||
verbose=True, # Enable logging
|
||||
|
||||
Reference in New Issue
Block a user