docs(api): improve formatting and readability of API documentation
Enhanced markdown formatting, fixed list indentation, and improved readability across multiple API documentation files: - arun.md - arun_many.md - async-webcrawler.md - parameters.md Changes include: - Consistent list formatting and indentation - Better spacing between sections - Clearer separation of content blocks - Fixed quotation marks and code block formatting
This commit is contained in:
@@ -1,6 +1,6 @@
|
|||||||
# `arun()` Parameter Guide (New Approach)
|
# `arun()` Parameter Guide (New Approach)
|
||||||
|
|
||||||
In Crawl4AI’s **latest** configuration model, nearly all parameters that once went directly to `arun()` are now part of **`CrawlerRunConfig`**. When calling `arun()`, you provide:
|
In Crawl4AI’s **latest** configuration model, nearly all parameters that once went directly to `arun()` are now part of **`CrawlerRunConfig`**. When calling `arun()`, you provide:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
await crawler.arun(
|
await crawler.arun(
|
||||||
@@ -9,11 +9,11 @@ await crawler.arun(
|
|||||||
)
|
)
|
||||||
```
|
```
|
||||||
|
|
||||||
Below is an organized look at the parameters that can go inside `CrawlerRunConfig`, divided by their functional areas. For **Browser** settings (e.g., `headless`, `browser_type`), see [BrowserConfig](./parameters.md).
|
Below is an organized look at the parameters that can go inside `CrawlerRunConfig`, divided by their functional areas. For **Browser** settings (e.g., `headless`, `browser_type`), see [BrowserConfig](./parameters.md).
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 1. Core Usage
|
## 1. Core Usage
|
||||||
|
|
||||||
```python
|
```python
|
||||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
|
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
|
||||||
@@ -23,7 +23,7 @@ async def main():
|
|||||||
verbose=True, # Detailed logging
|
verbose=True, # Detailed logging
|
||||||
cache_mode=CacheMode.ENABLED, # Use normal read/write cache
|
cache_mode=CacheMode.ENABLED, # Use normal read/write cache
|
||||||
check_robots_txt=True, # Respect robots.txt rules
|
check_robots_txt=True, # Respect robots.txt rules
|
||||||
# ... other parameters
|
# ... other parameters
|
||||||
)
|
)
|
||||||
|
|
||||||
async with AsyncWebCrawler() as crawler:
|
async with AsyncWebCrawler() as crawler:
|
||||||
@@ -38,15 +38,16 @@ async def main():
|
|||||||
```
|
```
|
||||||
|
|
||||||
**Key Fields**:
|
**Key Fields**:
|
||||||
- `verbose=True` logs each crawl step.
|
- `verbose=True` logs each crawl step.
|
||||||
- `cache_mode` decides how to read/write the local crawl cache.
|
- `cache_mode` decides how to read/write the local crawl cache.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 2. Cache Control
|
## 2. Cache Control
|
||||||
|
|
||||||
**`cache_mode`** (default: `CacheMode.ENABLED`)
|
**`cache_mode`** (default: `CacheMode.ENABLED`)
|
||||||
Use a built-in enum from `CacheMode`:
|
Use a built-in enum from `CacheMode`:
|
||||||
|
|
||||||
- `ENABLED`: Normal caching—reads if available, writes if missing.
|
- `ENABLED`: Normal caching—reads if available, writes if missing.
|
||||||
- `DISABLED`: No caching—always refetch pages.
|
- `DISABLED`: No caching—always refetch pages.
|
||||||
- `READ_ONLY`: Reads from cache only; no new writes.
|
- `READ_ONLY`: Reads from cache only; no new writes.
|
||||||
@@ -60,6 +61,7 @@ run_config = CrawlerRunConfig(
|
|||||||
```
|
```
|
||||||
|
|
||||||
**Additional flags**:
|
**Additional flags**:
|
||||||
|
|
||||||
- `bypass_cache=True` acts like `CacheMode.BYPASS`.
|
- `bypass_cache=True` acts like `CacheMode.BYPASS`.
|
||||||
- `disable_cache=True` acts like `CacheMode.DISABLED`.
|
- `disable_cache=True` acts like `CacheMode.DISABLED`.
|
||||||
- `no_cache_read=True` acts like `CacheMode.WRITE_ONLY`.
|
- `no_cache_read=True` acts like `CacheMode.WRITE_ONLY`.
|
||||||
@@ -67,7 +69,7 @@ run_config = CrawlerRunConfig(
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 3. Content Processing & Selection
|
## 3. Content Processing & Selection
|
||||||
|
|
||||||
### 3.1 Text Processing
|
### 3.1 Text Processing
|
||||||
|
|
||||||
@@ -111,7 +113,7 @@ run_config = CrawlerRunConfig(
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 4. Page Navigation & Timing
|
## 4. Page Navigation & Timing
|
||||||
|
|
||||||
### 4.1 Basic Browser Flow
|
### 4.1 Basic Browser Flow
|
||||||
|
|
||||||
@@ -124,12 +126,13 @@ run_config = CrawlerRunConfig(
|
|||||||
```
|
```
|
||||||
|
|
||||||
**Key Fields**:
|
**Key Fields**:
|
||||||
|
|
||||||
- `wait_for`:
|
- `wait_for`:
|
||||||
- `"css:selector"` or
|
- `"css:selector"` or
|
||||||
- `"js:() => boolean"`
|
- `"js:() => boolean"`
|
||||||
e.g. `js:() => document.querySelectorAll('.item').length > 10`.
|
e.g. `js:() => document.querySelectorAll('.item').length > 10`.
|
||||||
|
|
||||||
- `mean_delay` & `max_range`: define random delays for `arun_many()` calls.
|
- `mean_delay` & `max_range`: define random delays for `arun_many()` calls.
|
||||||
- `semaphore_count`: concurrency limit when crawling multiple URLs.
|
- `semaphore_count`: concurrency limit when crawling multiple URLs.
|
||||||
|
|
||||||
### 4.2 JavaScript Execution
|
### 4.2 JavaScript Execution
|
||||||
@@ -144,7 +147,7 @@ run_config = CrawlerRunConfig(
|
|||||||
)
|
)
|
||||||
```
|
```
|
||||||
|
|
||||||
- `js_code` can be a single string or a list of strings.
|
- `js_code` can be a single string or a list of strings.
|
||||||
- `js_only=True` means “I’m continuing in the same session with new JS steps, no new full navigation.”
|
- `js_only=True` means “I’m continuing in the same session with new JS steps, no new full navigation.”
|
||||||
|
|
||||||
### 4.3 Anti-Bot
|
### 4.3 Anti-Bot
|
||||||
@@ -156,13 +159,13 @@ run_config = CrawlerRunConfig(
|
|||||||
override_navigator=True
|
override_navigator=True
|
||||||
)
|
)
|
||||||
```
|
```
|
||||||
- `magic=True` tries multiple stealth features.
|
- `magic=True` tries multiple stealth features.
|
||||||
- `simulate_user=True` mimics mouse movements or random delays.
|
- `simulate_user=True` mimics mouse movements or random delays.
|
||||||
- `override_navigator=True` fakes some navigator properties (like user agent checks).
|
- `override_navigator=True` fakes some navigator properties (like user agent checks).
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 5. Session Management
|
## 5. Session Management
|
||||||
|
|
||||||
**`session_id`**:
|
**`session_id`**:
|
||||||
```python
|
```python
|
||||||
@@ -174,7 +177,7 @@ If re-used in subsequent `arun()` calls, the same tab/page context is continued
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 6. Screenshot, PDF & Media Options
|
## 6. Screenshot, PDF & Media Options
|
||||||
|
|
||||||
```python
|
```python
|
||||||
run_config = CrawlerRunConfig(
|
run_config = CrawlerRunConfig(
|
||||||
@@ -191,7 +194,7 @@ run_config = CrawlerRunConfig(
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 7. Extraction Strategy
|
## 7. Extraction Strategy
|
||||||
|
|
||||||
**For advanced data extraction** (CSS/LLM-based), set `extraction_strategy`:
|
**For advanced data extraction** (CSS/LLM-based), set `extraction_strategy`:
|
||||||
|
|
||||||
@@ -205,7 +208,7 @@ The extracted data will appear in `result.extracted_content`.
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 8. Comprehensive Example
|
## 8. Comprehensive Example
|
||||||
|
|
||||||
Below is a snippet combining many parameters:
|
Below is a snippet combining many parameters:
|
||||||
|
|
||||||
@@ -274,32 +277,33 @@ if __name__ == "__main__":
|
|||||||
```
|
```
|
||||||
|
|
||||||
**What we covered**:
|
**What we covered**:
|
||||||
1. **Crawling** the main content region, ignoring external links.
|
|
||||||
2. Running **JavaScript** to click “.show-more”.
|
1. **Crawling** the main content region, ignoring external links.
|
||||||
3. **Waiting** for “.loaded-block” to appear.
|
2. Running **JavaScript** to click “.show-more”.
|
||||||
4. Generating a **screenshot** & **PDF** of the final page.
|
3. **Waiting** for “.loaded-block” to appear.
|
||||||
5. Extracting repeated “article.post” elements with a **CSS-based** extraction strategy.
|
4. Generating a **screenshot** & **PDF** of the final page.
|
||||||
|
5. Extracting repeated “article.post” elements with a **CSS-based** extraction strategy.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 9. Best Practices
|
## 9. Best Practices
|
||||||
|
|
||||||
1. **Use `BrowserConfig` for global browser** settings (headless, user agent).
|
1. **Use `BrowserConfig` for global browser** settings (headless, user agent).
|
||||||
2. **Use `CrawlerRunConfig`** to handle the **specific** crawl needs: content filtering, caching, JS, screenshot, extraction, etc.
|
2. **Use `CrawlerRunConfig`** to handle the **specific** crawl needs: content filtering, caching, JS, screenshot, extraction, etc.
|
||||||
3. Keep your **parameters consistent** in run configs—especially if you’re part of a large codebase with multiple crawls.
|
3. Keep your **parameters consistent** in run configs—especially if you’re part of a large codebase with multiple crawls.
|
||||||
4. **Limit** large concurrency (`semaphore_count`) if the site or your system can’t handle it.
|
4. **Limit** large concurrency (`semaphore_count`) if the site or your system can’t handle it.
|
||||||
5. For dynamic pages, set `js_code` or `scan_full_page` so you load all content.
|
5. For dynamic pages, set `js_code` or `scan_full_page` so you load all content.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 10. Conclusion
|
## 10. Conclusion
|
||||||
|
|
||||||
All parameters that used to be direct arguments to `arun()` now belong in **`CrawlerRunConfig`**. This approach:
|
All parameters that used to be direct arguments to `arun()` now belong in **`CrawlerRunConfig`**. This approach:
|
||||||
|
|
||||||
- Makes code **clearer** and **more maintainable**.
|
- Makes code **clearer** and **more maintainable**.
|
||||||
- Minimizes confusion about which arguments affect global vs. per-crawl behavior.
|
- Minimizes confusion about which arguments affect global vs. per-crawl behavior.
|
||||||
- Allows you to create **reusable** config objects for different pages or tasks.
|
- Allows you to create **reusable** config objects for different pages or tasks.
|
||||||
|
|
||||||
For a **full** reference, check out the [CrawlerRunConfig Docs](./parameters.md).
|
For a **full** reference, check out the [CrawlerRunConfig Docs](./parameters.md).
|
||||||
|
|
||||||
Happy crawling with your **structured, flexible** config approach!
|
Happy crawling with your **structured, flexible** config approach!
|
||||||
@@ -1,6 +1,6 @@
|
|||||||
# `arun_many(...)` Reference
|
# `arun_many(...)` Reference
|
||||||
|
|
||||||
> **Note**: This function is very similar to [`arun()`](./arun.md) but focused on **concurrent** or **batch** crawling. If you’re unfamiliar with `arun()` usage, please read that doc first, then review this for differences.
|
> **Note**: This function is very similar to [`arun()`](./arun.md) but focused on **concurrent** or **batch** crawling. If you’re unfamiliar with `arun()` usage, please read that doc first, then review this for differences.
|
||||||
|
|
||||||
## Function Signature
|
## Function Signature
|
||||||
|
|
||||||
@@ -16,7 +16,7 @@ async def arun_many(
|
|||||||
|
|
||||||
:param urls: A list of URLs (or tasks) to crawl.
|
:param urls: A list of URLs (or tasks) to crawl.
|
||||||
:param config: (Optional) A default `CrawlerRunConfig` applying to each crawl.
|
:param config: (Optional) A default `CrawlerRunConfig` applying to each crawl.
|
||||||
:param dispatcher: (Optional) A concurrency controller (e.g. MemoryAdaptiveDispatcher).
|
:param dispatcher: (Optional) A concurrency controller (e.g. MemoryAdaptiveDispatcher).
|
||||||
...
|
...
|
||||||
:return: Either a list of `CrawlResult` objects, or an async generator if streaming is enabled.
|
:return: Either a list of `CrawlResult` objects, or an async generator if streaming is enabled.
|
||||||
"""
|
"""
|
||||||
@@ -24,22 +24,26 @@ async def arun_many(
|
|||||||
|
|
||||||
## Differences from `arun()`
|
## Differences from `arun()`
|
||||||
|
|
||||||
1. **Multiple URLs**:
|
1. **Multiple URLs**:
|
||||||
- Instead of crawling a single URL, you pass a list of them (strings or tasks).
|
|
||||||
|
- Instead of crawling a single URL, you pass a list of them (strings or tasks).
|
||||||
- The function returns either a **list** of `CrawlResult` or an **async generator** if streaming is enabled.
|
- The function returns either a **list** of `CrawlResult` or an **async generator** if streaming is enabled.
|
||||||
|
|
||||||
2. **Concurrency & Dispatchers**:
|
2. **Concurrency & Dispatchers**:
|
||||||
- **`dispatcher`** param allows advanced concurrency control.
|
|
||||||
- If omitted, a default dispatcher (like `MemoryAdaptiveDispatcher`) is used internally.
|
- **`dispatcher`** param allows advanced concurrency control.
|
||||||
|
- If omitted, a default dispatcher (like `MemoryAdaptiveDispatcher`) is used internally.
|
||||||
- Dispatchers handle concurrency, rate limiting, and memory-based adaptive throttling (see [Multi-URL Crawling](../advanced/multi-url-crawling.md)).
|
- Dispatchers handle concurrency, rate limiting, and memory-based adaptive throttling (see [Multi-URL Crawling](../advanced/multi-url-crawling.md)).
|
||||||
|
|
||||||
3. **Streaming Support**:
|
3. **Streaming Support**:
|
||||||
|
|
||||||
- Enable streaming by setting `stream=True` in your `CrawlerRunConfig`.
|
- Enable streaming by setting `stream=True` in your `CrawlerRunConfig`.
|
||||||
- When streaming, use `async for` to process results as they become available.
|
- When streaming, use `async for` to process results as they become available.
|
||||||
- Ideal for processing large numbers of URLs without waiting for all to complete.
|
- Ideal for processing large numbers of URLs without waiting for all to complete.
|
||||||
|
|
||||||
4. **Parallel** Execution**:
|
4. **Parallel** Execution**:
|
||||||
- `arun_many()` can run multiple requests concurrently under the hood.
|
|
||||||
|
- `arun_many()` can run multiple requests concurrently under the hood.
|
||||||
- Each `CrawlResult` might also include a **`dispatch_result`** with concurrency details (like memory usage, start/end times).
|
- Each `CrawlResult` might also include a **`dispatch_result`** with concurrency details (like memory usage, start/end times).
|
||||||
|
|
||||||
### Basic Example (Batch Mode)
|
### Basic Example (Batch Mode)
|
||||||
@@ -93,19 +97,19 @@ results = await crawler.arun_many(
|
|||||||
|
|
||||||
**Key Points**:
|
**Key Points**:
|
||||||
- Each URL is processed by the same or separate sessions, depending on the dispatcher’s strategy.
|
- Each URL is processed by the same or separate sessions, depending on the dispatcher’s strategy.
|
||||||
- `dispatch_result` in each `CrawlResult` (if using concurrency) can hold memory and timing info.
|
- `dispatch_result` in each `CrawlResult` (if using concurrency) can hold memory and timing info.
|
||||||
- If you need to handle authentication or session IDs, pass them in each individual task or within your run config.
|
- If you need to handle authentication or session IDs, pass them in each individual task or within your run config.
|
||||||
|
|
||||||
### Return Value
|
### Return Value
|
||||||
|
|
||||||
Either a **list** of [`CrawlResult`](./crawl-result.md) objects, or an **async generator** if streaming is enabled. You can iterate to check `result.success` or read each item’s `extracted_content`, `markdown`, or `dispatch_result`.
|
Either a **list** of [`CrawlResult`](./crawl-result.md) objects, or an **async generator** if streaming is enabled. You can iterate to check `result.success` or read each item’s `extracted_content`, `markdown`, or `dispatch_result`.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Dispatcher Reference
|
## Dispatcher Reference
|
||||||
|
|
||||||
- **`MemoryAdaptiveDispatcher`**: Dynamically manages concurrency based on system memory usage.
|
- **`MemoryAdaptiveDispatcher`**: Dynamically manages concurrency based on system memory usage.
|
||||||
- **`SemaphoreDispatcher`**: Fixed concurrency limit, simpler but less adaptive.
|
- **`SemaphoreDispatcher`**: Fixed concurrency limit, simpler but less adaptive.
|
||||||
|
|
||||||
For advanced usage or custom settings, see [Multi-URL Crawling with Dispatchers](../advanced/multi-url-crawling.md).
|
For advanced usage or custom settings, see [Multi-URL Crawling with Dispatchers](../advanced/multi-url-crawling.md).
|
||||||
|
|
||||||
@@ -113,12 +117,14 @@ For advanced usage or custom settings, see [Multi-URL Crawling with Dispatchers]
|
|||||||
|
|
||||||
## Common Pitfalls
|
## Common Pitfalls
|
||||||
|
|
||||||
1. **Large Lists**: If you pass thousands of URLs, be mindful of memory or rate-limits. A dispatcher can help.
|
1. **Large Lists**: If you pass thousands of URLs, be mindful of memory or rate-limits. A dispatcher can help.
|
||||||
2. **Session Reuse**: If you need specialized logins or persistent contexts, ensure your dispatcher or tasks handle sessions accordingly.
|
|
||||||
3. **Error Handling**: Each `CrawlResult` might fail for different reasons—always check `result.success` or the `error_message` before proceeding.
|
2. **Session Reuse**: If you need specialized logins or persistent contexts, ensure your dispatcher or tasks handle sessions accordingly.
|
||||||
|
|
||||||
|
3. **Error Handling**: Each `CrawlResult` might fail for different reasons—always check `result.success` or the `error_message` before proceeding.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Conclusion
|
## Conclusion
|
||||||
|
|
||||||
Use `arun_many()` when you want to **crawl multiple URLs** simultaneously or in controlled parallel tasks. If you need advanced concurrency features (like memory-based adaptive throttling or complex rate-limiting), provide a **dispatcher**. Each result is a standard `CrawlResult`, possibly augmented with concurrency stats (`dispatch_result`) for deeper inspection. For more details on concurrency logic and dispatchers, see the [Advanced Multi-URL Crawling](../advanced/multi-url-crawling.md) docs.
|
Use `arun_many()` when you want to **crawl multiple URLs** simultaneously or in controlled parallel tasks. If you need advanced concurrency features (like memory-based adaptive throttling or complex rate-limiting), provide a **dispatcher**. Each result is a standard `CrawlResult`, possibly augmented with concurrency stats (`dispatch_result`) for deeper inspection. For more details on concurrency logic and dispatchers, see the [Advanced Multi-URL Crawling](../advanced/multi-url-crawling.md) docs.
|
||||||
@@ -1,16 +1,20 @@
|
|||||||
# AsyncWebCrawler
|
# AsyncWebCrawler
|
||||||
|
|
||||||
The **`AsyncWebCrawler`** is the core class for asynchronous web crawling in Crawl4AI. You typically create it **once**, optionally customize it with a **`BrowserConfig`** (e.g., headless, user agent), then **run** multiple **`arun()`** calls with different **`CrawlerRunConfig`** objects.
|
The **`AsyncWebCrawler`** is the core class for asynchronous web crawling in Crawl4AI. You typically create it **once**, optionally customize it with a **`BrowserConfig`** (e.g., headless, user agent), then **run** multiple **`arun()`** calls with different **`CrawlerRunConfig`** objects.
|
||||||
|
|
||||||
**Recommended usage**:
|
**Recommended usage**:
|
||||||
1. **Create** a `BrowserConfig` for global browser settings.
|
|
||||||
2. **Instantiate** `AsyncWebCrawler(config=browser_config)`.
|
1. **Create** a `BrowserConfig` for global browser settings.
|
||||||
3. **Use** the crawler in an async context manager (`async with`) or manage start/close manually.
|
|
||||||
|
2. **Instantiate** `AsyncWebCrawler(config=browser_config)`.
|
||||||
|
|
||||||
|
3. **Use** the crawler in an async context manager (`async with`) or manage start/close manually.
|
||||||
|
|
||||||
4. **Call** `arun(url, config=crawler_run_config)` for each page you want.
|
4. **Call** `arun(url, config=crawler_run_config)` for each page you want.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 1. Constructor Overview
|
## 1. Constructor Overview
|
||||||
|
|
||||||
```python
|
```python
|
||||||
class AsyncWebCrawler:
|
class AsyncWebCrawler:
|
||||||
@@ -37,7 +41,7 @@ class AsyncWebCrawler:
|
|||||||
base_directory:
|
base_directory:
|
||||||
Folder for storing caches/logs (if relevant).
|
Folder for storing caches/logs (if relevant).
|
||||||
thread_safe:
|
thread_safe:
|
||||||
If True, attempts some concurrency safeguards. Usually False.
|
If True, attempts some concurrency safeguards. Usually False.
|
||||||
**kwargs:
|
**kwargs:
|
||||||
Additional legacy or debugging parameters.
|
Additional legacy or debugging parameters.
|
||||||
"""
|
"""
|
||||||
@@ -58,11 +62,12 @@ crawler = AsyncWebCrawler(config=browser_cfg)
|
|||||||
```
|
```
|
||||||
|
|
||||||
**Notes**:
|
**Notes**:
|
||||||
|
|
||||||
- **Legacy** parameters like `always_bypass_cache` remain for backward compatibility, but prefer to set **caching** in `CrawlerRunConfig`.
|
- **Legacy** parameters like `always_bypass_cache` remain for backward compatibility, but prefer to set **caching** in `CrawlerRunConfig`.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 2. Lifecycle: Start/Close or Context Manager
|
## 2. Lifecycle: Start/Close or Context Manager
|
||||||
|
|
||||||
### 2.1 Context Manager (Recommended)
|
### 2.1 Context Manager (Recommended)
|
||||||
|
|
||||||
@@ -90,7 +95,7 @@ Use this style if you have a **long-running** application or need full control o
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 3. Primary Method: `arun()`
|
## 3. Primary Method: `arun()`
|
||||||
|
|
||||||
```python
|
```python
|
||||||
async def arun(
|
async def arun(
|
||||||
@@ -130,7 +135,7 @@ For **backward** compatibility, `arun()` can still accept direct arguments like
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 4. Batch Processing: `arun_many()`
|
## 4. Batch Processing: `arun_many()`
|
||||||
|
|
||||||
```python
|
```python
|
||||||
async def arun_many(
|
async def arun_many(
|
||||||
@@ -147,6 +152,7 @@ async def arun_many(
|
|||||||
### 4.1 Resource-Aware Crawling
|
### 4.1 Resource-Aware Crawling
|
||||||
|
|
||||||
The `arun_many()` method now uses an intelligent dispatcher that:
|
The `arun_many()` method now uses an intelligent dispatcher that:
|
||||||
|
|
||||||
- Monitors system memory usage
|
- Monitors system memory usage
|
||||||
- Implements adaptive rate limiting
|
- Implements adaptive rate limiting
|
||||||
- Provides detailed progress monitoring
|
- Provides detailed progress monitoring
|
||||||
@@ -192,30 +198,34 @@ async with AsyncWebCrawler(config=browser_cfg) as crawler:
|
|||||||
|
|
||||||
### 4.3 Key Features
|
### 4.3 Key Features
|
||||||
|
|
||||||
1. **Rate Limiting**
|
1. **Rate Limiting**
|
||||||
|
|
||||||
- Automatic delay between requests
|
- Automatic delay between requests
|
||||||
- Exponential backoff on rate limit detection
|
- Exponential backoff on rate limit detection
|
||||||
- Domain-specific rate limiting
|
- Domain-specific rate limiting
|
||||||
- Configurable retry strategy
|
- Configurable retry strategy
|
||||||
|
|
||||||
2. **Resource Monitoring**
|
2. **Resource Monitoring**
|
||||||
|
|
||||||
- Memory usage tracking
|
- Memory usage tracking
|
||||||
- Adaptive concurrency based on system load
|
- Adaptive concurrency based on system load
|
||||||
- Automatic pausing when resources are constrained
|
- Automatic pausing when resources are constrained
|
||||||
|
|
||||||
3. **Progress Monitoring**
|
3. **Progress Monitoring**
|
||||||
|
|
||||||
- Detailed or aggregated progress display
|
- Detailed or aggregated progress display
|
||||||
- Real-time status updates
|
- Real-time status updates
|
||||||
- Memory usage statistics
|
- Memory usage statistics
|
||||||
|
|
||||||
4. **Error Handling**
|
4. **Error Handling**
|
||||||
|
|
||||||
- Graceful handling of rate limits
|
- Graceful handling of rate limits
|
||||||
- Automatic retries with backoff
|
- Automatic retries with backoff
|
||||||
- Detailed error reporting
|
- Detailed error reporting
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 5. `CrawlResult` Output
|
## 5. `CrawlResult` Output
|
||||||
|
|
||||||
Each `arun()` returns a **`CrawlResult`** containing:
|
Each `arun()` returns a **`CrawlResult`** containing:
|
||||||
|
|
||||||
@@ -232,7 +242,7 @@ For details, see [CrawlResult doc](./crawl-result.md).
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 6. Quick Example
|
## 6. Quick Example
|
||||||
|
|
||||||
Below is an example hooking it all together:
|
Below is an example hooking it all together:
|
||||||
|
|
||||||
@@ -243,14 +253,14 @@ from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
|||||||
import json
|
import json
|
||||||
|
|
||||||
async def main():
|
async def main():
|
||||||
# 1. Browser config
|
# 1. Browser config
|
||||||
browser_cfg = BrowserConfig(
|
browser_cfg = BrowserConfig(
|
||||||
browser_type="firefox",
|
browser_type="firefox",
|
||||||
headless=False,
|
headless=False,
|
||||||
verbose=True
|
verbose=True
|
||||||
)
|
)
|
||||||
|
|
||||||
# 2. Run config
|
# 2. Run config
|
||||||
schema = {
|
schema = {
|
||||||
"name": "Articles",
|
"name": "Articles",
|
||||||
"baseSelector": "article.post",
|
"baseSelector": "article.post",
|
||||||
@@ -295,17 +305,18 @@ asyncio.run(main())
|
|||||||
```
|
```
|
||||||
|
|
||||||
**Explanation**:
|
**Explanation**:
|
||||||
- We define a **`BrowserConfig`** with Firefox, no headless, and `verbose=True`.
|
|
||||||
- We define a **`CrawlerRunConfig`** that **bypasses cache**, uses a **CSS** extraction schema, has a `word_count_threshold=15`, etc.
|
- We define a **`BrowserConfig`** with Firefox, no headless, and `verbose=True`.
|
||||||
|
- We define a **`CrawlerRunConfig`** that **bypasses cache**, uses a **CSS** extraction schema, has a `word_count_threshold=15`, etc.
|
||||||
- We pass them to `AsyncWebCrawler(config=...)` and `arun(url=..., config=...)`.
|
- We pass them to `AsyncWebCrawler(config=...)` and `arun(url=..., config=...)`.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 7. Best Practices & Migration Notes
|
## 7. Best Practices & Migration Notes
|
||||||
|
|
||||||
1. **Use** `BrowserConfig` for **global** settings about the browser’s environment.
|
1. **Use** `BrowserConfig` for **global** settings about the browser’s environment.
|
||||||
2. **Use** `CrawlerRunConfig` for **per-crawl** logic (caching, content filtering, extraction strategies, wait conditions).
|
2. **Use** `CrawlerRunConfig` for **per-crawl** logic (caching, content filtering, extraction strategies, wait conditions).
|
||||||
3. **Avoid** legacy parameters like `css_selector` or `word_count_threshold` directly in `arun()`. Instead:
|
3. **Avoid** legacy parameters like `css_selector` or `word_count_threshold` directly in `arun()`. Instead:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
run_cfg = CrawlerRunConfig(css_selector=".main-content", word_count_threshold=20)
|
run_cfg = CrawlerRunConfig(css_selector=".main-content", word_count_threshold=20)
|
||||||
@@ -316,16 +327,17 @@ asyncio.run(main())
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 8. Summary
|
## 8. Summary
|
||||||
|
|
||||||
**AsyncWebCrawler** is your entry point to asynchronous crawling:
|
**AsyncWebCrawler** is your entry point to asynchronous crawling:
|
||||||
|
|
||||||
- **Constructor** accepts **`BrowserConfig`** (or defaults).
|
- **Constructor** accepts **`BrowserConfig`** (or defaults).
|
||||||
- **`arun(url, config=CrawlerRunConfig)`** is the main method for single-page crawls.
|
- **`arun(url, config=CrawlerRunConfig)`** is the main method for single-page crawls.
|
||||||
- **`arun_many(urls, config=CrawlerRunConfig)`** handles concurrency across multiple URLs.
|
- **`arun_many(urls, config=CrawlerRunConfig)`** handles concurrency across multiple URLs.
|
||||||
- For advanced lifecycle control, use `start()` and `close()` explicitly.
|
- For advanced lifecycle control, use `start()` and `close()` explicitly.
|
||||||
|
|
||||||
**Migration**:
|
**Migration**:
|
||||||
|
|
||||||
- If you used `AsyncWebCrawler(browser_type="chromium", css_selector="...")`, move browser settings to `BrowserConfig(...)` and content/crawl logic to `CrawlerRunConfig(...)`.
|
- If you used `AsyncWebCrawler(browser_type="chromium", css_selector="...")`, move browser settings to `BrowserConfig(...)` and content/crawl logic to `CrawlerRunConfig(...)`.
|
||||||
|
|
||||||
This modular approach ensures your code is **clean**, **scalable**, and **easy to maintain**. For any advanced or rarely used parameters, see the [BrowserConfig docs](../api/parameters.md).
|
This modular approach ensures your code is **clean**, **scalable**, and **easy to maintain**. For any advanced or rarely used parameters, see the [BrowserConfig docs](../api/parameters.md).
|
||||||
@@ -294,3 +294,4 @@ stream_cfg = run_cfg.clone(
|
|||||||
stream=True,
|
stream=True,
|
||||||
cache_mode=CacheMode.BYPASS
|
cache_mode=CacheMode.BYPASS
|
||||||
)
|
)
|
||||||
|
```
|
||||||
|
|||||||
Reference in New Issue
Block a user