Reorganize documentation into core/advanced/extraction sections for better navigation. Update terminal theme styles and add rich library for better CLI output. Remove redundant tutorial files and consolidate content into core sections. Add personal story to index page for project context. BREAKING CHANGE: Documentation structure has been significantly reorganized
305 lines
9.0 KiB
Markdown
305 lines
9.0 KiB
Markdown
Below is a **revised parameter guide** for **`arun()`** in **AsyncWebCrawler**, reflecting the **new** approach where all parameters are passed via a **`CrawlerRunConfig`** instead of directly to `arun()`. Each section includes example usage in the new style, ensuring a clear, modern approach.
|
||
|
||
---
|
||
|
||
# `arun()` Parameter Guide (New Approach)
|
||
|
||
In Crawl4AI’s **latest** configuration model, nearly all parameters that once went directly to `arun()` are now part of **`CrawlerRunConfig`**. When calling `arun()`, you provide:
|
||
|
||
```python
|
||
await crawler.arun(
|
||
url="https://example.com",
|
||
config=my_run_config
|
||
)
|
||
```
|
||
|
||
Below is an organized look at the parameters that can go inside `CrawlerRunConfig`, divided by their functional areas. For **Browser** settings (e.g., `headless`, `browser_type`), see [BrowserConfig](./parameters.md).
|
||
|
||
---
|
||
|
||
## 1. Core Usage
|
||
|
||
```python
|
||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
|
||
|
||
async def main():
|
||
run_config = CrawlerRunConfig(
|
||
verbose=True, # Detailed logging
|
||
cache_mode=CacheMode.ENABLED, # Use normal read/write cache
|
||
# ... other parameters
|
||
)
|
||
|
||
async with AsyncWebCrawler() as crawler:
|
||
result = await crawler.arun(
|
||
url="https://example.com",
|
||
config=run_config
|
||
)
|
||
print(result.cleaned_html[:500])
|
||
|
||
```
|
||
|
||
**Key Fields**:
|
||
- `verbose=True` logs each crawl step.
|
||
- `cache_mode` decides how to read/write the local crawl cache.
|
||
|
||
---
|
||
|
||
## 2. Cache Control
|
||
|
||
**`cache_mode`** (default: `CacheMode.ENABLED`)
|
||
Use a built-in enum from `CacheMode`:
|
||
- `ENABLED`: Normal caching—reads if available, writes if missing.
|
||
- `DISABLED`: No caching—always refetch pages.
|
||
- `READ_ONLY`: Reads from cache only; no new writes.
|
||
- `WRITE_ONLY`: Writes to cache but doesn’t read existing data.
|
||
- `BYPASS`: Skips reading cache for this crawl (though it might still write if set up that way).
|
||
|
||
```python
|
||
run_config = CrawlerRunConfig(
|
||
cache_mode=CacheMode.BYPASS
|
||
)
|
||
```
|
||
|
||
**Additional flags**:
|
||
- `bypass_cache=True` acts like `CacheMode.BYPASS`.
|
||
- `disable_cache=True` acts like `CacheMode.DISABLED`.
|
||
- `no_cache_read=True` acts like `CacheMode.WRITE_ONLY`.
|
||
- `no_cache_write=True` acts like `CacheMode.READ_ONLY`.
|
||
|
||
---
|
||
|
||
## 3. Content Processing & Selection
|
||
|
||
### 3.1 Text Processing
|
||
|
||
```python
|
||
run_config = CrawlerRunConfig(
|
||
word_count_threshold=10, # Ignore text blocks <10 words
|
||
only_text=False, # If True, tries to remove non-text elements
|
||
keep_data_attributes=False # Keep or discard data-* attributes
|
||
)
|
||
```
|
||
|
||
### 3.2 Content Selection
|
||
|
||
```python
|
||
run_config = CrawlerRunConfig(
|
||
css_selector=".main-content", # Focus on .main-content region only
|
||
excluded_tags=["form", "nav"], # Remove entire tag blocks
|
||
remove_forms=True, # Specifically strip <form> elements
|
||
remove_overlay_elements=True, # Attempt to remove modals/popups
|
||
)
|
||
```
|
||
|
||
### 3.3 Link Handling
|
||
|
||
```python
|
||
run_config = CrawlerRunConfig(
|
||
exclude_external_links=True, # Remove external links from final content
|
||
exclude_social_media_links=True, # Remove links to known social sites
|
||
exclude_domains=["ads.example.com"], # Exclude links to these domains
|
||
exclude_social_media_domains=["facebook.com","twitter.com"], # Extend the default list
|
||
)
|
||
```
|
||
|
||
### 3.4 Media Filtering
|
||
|
||
```python
|
||
run_config = CrawlerRunConfig(
|
||
exclude_external_images=True # Strip images from other domains
|
||
)
|
||
```
|
||
|
||
---
|
||
|
||
## 4. Page Navigation & Timing
|
||
|
||
### 4.1 Basic Browser Flow
|
||
|
||
```python
|
||
run_config = CrawlerRunConfig(
|
||
wait_for="css:.dynamic-content", # Wait for .dynamic-content
|
||
delay_before_return_html=2.0, # Wait 2s before capturing final HTML
|
||
page_timeout=60000, # Navigation & script timeout (ms)
|
||
)
|
||
```
|
||
|
||
**Key Fields**:
|
||
- `wait_for`:
|
||
- `"css:selector"` or
|
||
- `"js:() => boolean"`
|
||
e.g. `js:() => document.querySelectorAll('.item').length > 10`.
|
||
|
||
- `mean_delay` & `max_range`: define random delays for `arun_many()` calls.
|
||
- `semaphore_count`: concurrency limit when crawling multiple URLs.
|
||
|
||
### 4.2 JavaScript Execution
|
||
|
||
```python
|
||
run_config = CrawlerRunConfig(
|
||
js_code=[
|
||
"window.scrollTo(0, document.body.scrollHeight);",
|
||
"document.querySelector('.load-more')?.click();"
|
||
],
|
||
js_only=False
|
||
)
|
||
```
|
||
|
||
- `js_code` can be a single string or a list of strings.
|
||
- `js_only=True` means “I’m continuing in the same session with new JS steps, no new full navigation.”
|
||
|
||
### 4.3 Anti-Bot
|
||
|
||
```python
|
||
run_config = CrawlerRunConfig(
|
||
magic=True,
|
||
simulate_user=True,
|
||
override_navigator=True
|
||
)
|
||
```
|
||
- `magic=True` tries multiple stealth features.
|
||
- `simulate_user=True` mimics mouse movements or random delays.
|
||
- `override_navigator=True` fakes some navigator properties (like user agent checks).
|
||
|
||
---
|
||
|
||
## 5. Session Management
|
||
|
||
**`session_id`**:
|
||
```python
|
||
run_config = CrawlerRunConfig(
|
||
session_id="my_session123"
|
||
)
|
||
```
|
||
If re-used in subsequent `arun()` calls, the same tab/page context is continued (helpful for multi-step tasks or stateful browsing).
|
||
|
||
---
|
||
|
||
## 6. Screenshot, PDF & Media Options
|
||
|
||
```python
|
||
run_config = CrawlerRunConfig(
|
||
screenshot=True, # Grab a screenshot as base64
|
||
screenshot_wait_for=1.0, # Wait 1s before capturing
|
||
pdf=True, # Also produce a PDF
|
||
image_description_min_word_threshold=5, # If analyzing alt text
|
||
image_score_threshold=3, # Filter out low-score images
|
||
)
|
||
```
|
||
**Where they appear**:
|
||
- `result.screenshot` → Base64 screenshot string.
|
||
- `result.pdf` → Byte array with PDF data.
|
||
|
||
---
|
||
|
||
## 7. Extraction Strategy
|
||
|
||
**For advanced data extraction** (CSS/LLM-based), set `extraction_strategy`:
|
||
|
||
```python
|
||
run_config = CrawlerRunConfig(
|
||
extraction_strategy=my_css_or_llm_strategy
|
||
)
|
||
```
|
||
|
||
The extracted data will appear in `result.extracted_content`.
|
||
|
||
---
|
||
|
||
## 8. Comprehensive Example
|
||
|
||
Below is a snippet combining many parameters:
|
||
|
||
```python
|
||
import asyncio
|
||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
|
||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||
|
||
async def main():
|
||
# Example schema
|
||
schema = {
|
||
"name": "Articles",
|
||
"baseSelector": "article.post",
|
||
"fields": [
|
||
{"name": "title", "selector": "h2", "type": "text"},
|
||
{"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
|
||
]
|
||
}
|
||
|
||
run_config = CrawlerRunConfig(
|
||
# Core
|
||
verbose=True,
|
||
cache_mode=CacheMode.ENABLED,
|
||
|
||
# Content
|
||
word_count_threshold=10,
|
||
css_selector="main.content",
|
||
excluded_tags=["nav", "footer"],
|
||
exclude_external_links=True,
|
||
|
||
# Page & JS
|
||
js_code="document.querySelector('.show-more')?.click();",
|
||
wait_for="css:.loaded-block",
|
||
page_timeout=30000,
|
||
|
||
# Extraction
|
||
extraction_strategy=JsonCssExtractionStrategy(schema),
|
||
|
||
# Session
|
||
session_id="persistent_session",
|
||
|
||
# Media
|
||
screenshot=True,
|
||
pdf=True,
|
||
|
||
# Anti-bot
|
||
simulate_user=True,
|
||
magic=True,
|
||
)
|
||
|
||
async with AsyncWebCrawler() as crawler:
|
||
result = await crawler.arun("https://example.com/posts", config=run_config)
|
||
if result.success:
|
||
print("HTML length:", len(result.cleaned_html))
|
||
print("Extraction JSON:", result.extracted_content)
|
||
if result.screenshot:
|
||
print("Screenshot length:", len(result.screenshot))
|
||
if result.pdf:
|
||
print("PDF bytes length:", len(result.pdf))
|
||
else:
|
||
print("Error:", result.error_message)
|
||
|
||
if __name__ == "__main__":
|
||
asyncio.run(main())
|
||
```
|
||
|
||
**What we covered**:
|
||
1. **Crawling** the main content region, ignoring external links.
|
||
2. Running **JavaScript** to click “.show-more”.
|
||
3. **Waiting** for “.loaded-block” to appear.
|
||
4. Generating a **screenshot** & **PDF** of the final page.
|
||
5. Extracting repeated “article.post” elements with a **CSS-based** extraction strategy.
|
||
|
||
---
|
||
|
||
## 9. Best Practices
|
||
|
||
1. **Use `BrowserConfig` for global browser** settings (headless, user agent).
|
||
2. **Use `CrawlerRunConfig`** to handle the **specific** crawl needs: content filtering, caching, JS, screenshot, extraction, etc.
|
||
3. Keep your **parameters consistent** in run configs—especially if you’re part of a large codebase with multiple crawls.
|
||
4. **Limit** large concurrency (`semaphore_count`) if the site or your system can’t handle it.
|
||
5. For dynamic pages, set `js_code` or `scan_full_page` so you load all content.
|
||
|
||
---
|
||
|
||
## 10. Conclusion
|
||
|
||
All parameters that used to be direct arguments to `arun()` now belong in **`CrawlerRunConfig`**. This approach:
|
||
|
||
- Makes code **clearer** and **more maintainable**.
|
||
- Minimizes confusion about which arguments affect global vs. per-crawl behavior.
|
||
- Allows you to create **reusable** config objects for different pages or tasks.
|
||
|
||
For a **full** reference, check out the [CrawlerRunConfig Docs](./parameters.md).
|
||
|
||
Happy crawling with your **structured, flexible** config approach! |