crawl4ai/docs/md_v2/api/arun.md

Below is a **revised parameter guide** for **`arun()`** in **AsyncWebCrawler**, reflecting the **new** approach where all parameters are passed via a **`CrawlerRunConfig`** instead of directly to `arun()`. Each section includes example usage in the new style, ensuring a clear, modern approach.

---

# `arun()` Parameter Guide (New Approach)

In Crawl4AI’s **latest** configuration model, nearly all parameters that once went directly to `arun()` are now part of **`CrawlerRunConfig`**. When calling `arun()`, you provide:

```python
await crawler.arun(
    url="https://example.com",
    config=my_run_config
)
```

Below is an organized look at the parameters that can go inside `CrawlerRunConfig`, divided by their functional areas. For **Browser** settings (e.g., `headless`, `browser_type`), see [BrowserConfig](./parameters.md).

---

## 1. Core Usage

```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode

async def main():
    run_config = CrawlerRunConfig(
        verbose=True,            # Detailed logging
        cache_mode=CacheMode.ENABLED,  # Use normal read/write cache
        # ... other parameters
    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://example.com",
            config=run_config
        )
        print(result.cleaned_html[:500])

```

**Key Fields**:
- `verbose=True` logs each crawl step.
- `cache_mode` decides how to read/write the local crawl cache.

---

## 2. Cache Control

**`cache_mode`** (default: `CacheMode.ENABLED`)
Use a built-in enum from `CacheMode`:
- `ENABLED`: Normal caching—reads if available, writes if missing.
- `DISABLED`: No caching—always refetch pages.
- `READ_ONLY`: Reads from cache only; no new writes.
- `WRITE_ONLY`: Writes to cache but doesn’t read existing data.
- `BYPASS`: Skips reading cache for this crawl (though it might still write if set up that way).

```python
run_config = CrawlerRunConfig(
    cache_mode=CacheMode.BYPASS
)
```

**Additional flags**:
- `bypass_cache=True` acts like `CacheMode.BYPASS`.
- `disable_cache=True` acts like `CacheMode.DISABLED`.
- `no_cache_read=True` acts like `CacheMode.WRITE_ONLY`.
- `no_cache_write=True` acts like `CacheMode.READ_ONLY`.

---

## 3. Content Processing & Selection

### 3.1 Text Processing

```python
run_config = CrawlerRunConfig(
    word_count_threshold=10,   # Ignore text blocks <10 words
    only_text=False,           # If True, tries to remove non-text elements
    keep_data_attributes=False # Keep or discard data-* attributes
)
```

### 3.2 Content Selection

```python
run_config = CrawlerRunConfig(
    css_selector=".main-content",  # Focus on .main-content region only
    excluded_tags=["form", "nav"], # Remove entire tag blocks
    remove_forms=True,             # Specifically strip <form> elements
    remove_overlay_elements=True,  # Attempt to remove modals/popups
)
```

### 3.3 Link Handling

```python
run_config = CrawlerRunConfig(
    exclude_external_links=True,         # Remove external links from final content
    exclude_social_media_links=True,     # Remove links to known social sites
    exclude_domains=["ads.example.com"], # Exclude links to these domains
    exclude_social_media_domains=["facebook.com","twitter.com"], # Extend the default list
)
```

### 3.4 Media Filtering

```python
run_config = CrawlerRunConfig(
    exclude_external_images=True  # Strip images from other domains
)
```

---

## 4. Page Navigation & Timing

### 4.1 Basic Browser Flow

```python
run_config = CrawlerRunConfig(
    wait_for="css:.dynamic-content", # Wait for .dynamic-content
    delay_before_return_html=2.0,    # Wait 2s before capturing final HTML
    page_timeout=60000,             # Navigation & script timeout (ms)
)
```

**Key Fields**:
- `wait_for`:
  - `"css:selector"` or
  - `"js:() => boolean"`
  e.g. `js:() => document.querySelectorAll('.item').length > 10`.

- `mean_delay` & `max_range`: define random delays for `arun_many()` calls.
- `semaphore_count`: concurrency limit when crawling multiple URLs.

### 4.2 JavaScript Execution

```python
run_config = CrawlerRunConfig(
    js_code=[
        "window.scrollTo(0, document.body.scrollHeight);",
        "document.querySelector('.load-more')?.click();"
    ],
    js_only=False
)
```

- `js_code` can be a single string or a list of strings.
- `js_only=True` means “I’m continuing in the same session with new JS steps, no new full navigation.”

### 4.3 Anti-Bot

```python
run_config = CrawlerRunConfig(
    magic=True,
    simulate_user=True,
    override_navigator=True
)
```
- `magic=True` tries multiple stealth features.
- `simulate_user=True` mimics mouse movements or random delays.
- `override_navigator=True` fakes some navigator properties (like user agent checks).

---

## 5. Session Management

**`session_id`**:
```python
run_config = CrawlerRunConfig(
    session_id="my_session123"
)
```
If re-used in subsequent `arun()` calls, the same tab/page context is continued (helpful for multi-step tasks or stateful browsing).

---

## 6. Screenshot, PDF & Media Options

```python
run_config = CrawlerRunConfig(
    screenshot=True,             # Grab a screenshot as base64
    screenshot_wait_for=1.0,     # Wait 1s before capturing
    pdf=True,                    # Also produce a PDF
    image_description_min_word_threshold=5,  # If analyzing alt text
    image_score_threshold=3,                # Filter out low-score images
)
```
**Where they appear**:
- `result.screenshot` → Base64 screenshot string.
- `result.pdf` → Byte array with PDF data.

---

## 7. Extraction Strategy

**For advanced data extraction** (CSS/LLM-based), set `extraction_strategy`:

```python
run_config = CrawlerRunConfig(
    extraction_strategy=my_css_or_llm_strategy
)
```

The extracted data will appear in `result.extracted_content`.

---

## 8. Comprehensive Example

Below is a snippet combining many parameters:

```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

async def main():
    # Example schema
    schema = {
        "name": "Articles",
        "baseSelector": "article.post",
        "fields": [
            {"name": "title", "selector": "h2", "type": "text"},
            {"name": "link",  "selector": "a",  "type": "attribute", "attribute": "href"}
        ]
    }

    run_config = CrawlerRunConfig(
        # Core
        verbose=True,
        cache_mode=CacheMode.ENABLED,

        # Content
        word_count_threshold=10,
        css_selector="main.content",
        excluded_tags=["nav", "footer"],
        exclude_external_links=True,

        # Page & JS
        js_code="document.querySelector('.show-more')?.click();",
        wait_for="css:.loaded-block",
        page_timeout=30000,

        # Extraction
        extraction_strategy=JsonCssExtractionStrategy(schema),

        # Session
        session_id="persistent_session",

        # Media
        screenshot=True,
        pdf=True,

        # Anti-bot
        simulate_user=True,
        magic=True,
    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun("https://example.com/posts", config=run_config)
        if result.success:
            print("HTML length:", len(result.cleaned_html))
            print("Extraction JSON:", result.extracted_content)
            if result.screenshot:
                print("Screenshot length:", len(result.screenshot))
            if result.pdf:
                print("PDF bytes length:", len(result.pdf))
        else:
            print("Error:", result.error_message)

if __name__ == "__main__":
    asyncio.run(main())
```

**What we covered**:
1. **Crawling** the main content region, ignoring external links.
2. Running **JavaScript** to click “.show-more”.
3. **Waiting** for “.loaded-block” to appear.
4. Generating a **screenshot** & **PDF** of the final page.
5. Extracting repeated “article.post” elements with a **CSS-based** extraction strategy.

---

## 9. Best Practices

1. **Use `BrowserConfig` for global browser** settings (headless, user agent).
2. **Use `CrawlerRunConfig`** to handle the **specific** crawl needs: content filtering, caching, JS, screenshot, extraction, etc.
3. Keep your **parameters consistent** in run configs—especially if you’re part of a large codebase with multiple crawls.
4. **Limit** large concurrency (`semaphore_count`) if the site or your system can’t handle it.
5. For dynamic pages, set `js_code` or `scan_full_page` so you load all content.

---

## 10. Conclusion

All parameters that used to be direct arguments to `arun()` now belong in **`CrawlerRunConfig`**. This approach:

- Makes code **clearer** and **more maintainable**.
- Minimizes confusion about which arguments affect global vs. per-crawl behavior.
- Allows you to create **reusable** config objects for different pages or tasks.

For a **full** reference, check out the [CrawlerRunConfig Docs](./parameters.md).

Happy crawling with your **structured, flexible** config approach!