docs(api): improve formatting and readability of API documentation

Enhanced markdown formatting, fixed list indentation, and improved readability across multiple API documentation files:
- arun.md
- arun_many.md
- async-webcrawler.md
- parameters.md

Changes include:
- Consistent list formatting and indentation
- Better spacing between sections
- Clearer separation of content blocks
- Fixed quotation marks and code block formatting
This commit is contained in:
UncleCode
2025-01-25 22:06:11 +08:00
parent 09ac7ed008
commit 54c84079c4
4 changed files with 103 additions and 80 deletions

View File

@@ -1,16 +1,20 @@
# AsyncWebCrawler
The **`AsyncWebCrawler`** is the core class for asynchronous web crawling in Crawl4AI. You typically create it **once**, optionally customize it with a **`BrowserConfig`** (e.g., headless, user agent), then **run** multiple **`arun()`** calls with different **`CrawlerRunConfig`** objects.
The **`AsyncWebCrawler`** is the core class for asynchronous web crawling in Crawl4AI. You typically create it **once**, optionally customize it with a **`BrowserConfig`** (e.g., headless, user agent), then **run** multiple **`arun()`** calls with different **`CrawlerRunConfig`** objects.
**Recommended usage**:
1. **Create** a `BrowserConfig` for global browser settings.
2. **Instantiate** `AsyncWebCrawler(config=browser_config)`.
3. **Use** the crawler in an async context manager (`async with`) or manage start/close manually.
1. **Create** a `BrowserConfig` for global browser settings. 
2. **Instantiate** `AsyncWebCrawler(config=browser_config)`. 
3. **Use** the crawler in an async context manager (`async with`) or manage start/close manually. 
4. **Call** `arun(url, config=crawler_run_config)` for each page you want.
---
## 1. Constructor Overview
## 1. Constructor Overview
```python
class AsyncWebCrawler:
@@ -37,7 +41,7 @@ class AsyncWebCrawler:
base_directory:
Folder for storing caches/logs (if relevant).
thread_safe:
If True, attempts some concurrency safeguards. Usually False.
If True, attempts some concurrency safeguards. Usually False.
**kwargs:
Additional legacy or debugging parameters.
"""
@@ -58,11 +62,12 @@ crawler = AsyncWebCrawler(config=browser_cfg)
```
**Notes**:
- **Legacy** parameters like `always_bypass_cache` remain for backward compatibility, but prefer to set **caching** in `CrawlerRunConfig`.
---
## 2. Lifecycle: Start/Close or Context Manager
## 2. Lifecycle: Start/Close or Context Manager
### 2.1 Context Manager (Recommended)
@@ -90,7 +95,7 @@ Use this style if you have a **long-running** application or need full control o
---
## 3. Primary Method: `arun()`
## 3. Primary Method: `arun()`
```python
async def arun(
@@ -130,7 +135,7 @@ For **backward** compatibility, `arun()` can still accept direct arguments like
---
## 4. Batch Processing: `arun_many()`
## 4. Batch Processing: `arun_many()`
```python
async def arun_many(
@@ -147,6 +152,7 @@ async def arun_many(
### 4.1 Resource-Aware Crawling
The `arun_many()` method now uses an intelligent dispatcher that:
- Monitors system memory usage
- Implements adaptive rate limiting
- Provides detailed progress monitoring
@@ -192,30 +198,34 @@ async with AsyncWebCrawler(config=browser_cfg) as crawler:
### 4.3 Key Features
1. **Rate Limiting**
1. **Rate Limiting**
- Automatic delay between requests
- Exponential backoff on rate limit detection
- Domain-specific rate limiting
- Configurable retry strategy
2. **Resource Monitoring**
2. **Resource Monitoring**
- Memory usage tracking
- Adaptive concurrency based on system load
- Automatic pausing when resources are constrained
3. **Progress Monitoring**
3. **Progress Monitoring**
- Detailed or aggregated progress display
- Real-time status updates
- Memory usage statistics
4. **Error Handling**
4. **Error Handling**
- Graceful handling of rate limits
- Automatic retries with backoff
- Detailed error reporting
---
## 5. `CrawlResult` Output
## 5. `CrawlResult` Output
Each `arun()` returns a **`CrawlResult`** containing:
@@ -232,7 +242,7 @@ For details, see [CrawlResult doc](./crawl-result.md).
---
## 6. Quick Example
## 6. Quick Example
Below is an example hooking it all together:
@@ -243,14 +253,14 @@ from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
import json
async def main():
# 1. Browser config
# 1. Browser config
browser_cfg = BrowserConfig(
browser_type="firefox",
headless=False,
verbose=True
)
# 2. Run config
# 2. Run config
schema = {
"name": "Articles",
"baseSelector": "article.post",
@@ -295,17 +305,18 @@ asyncio.run(main())
```
**Explanation**:
- We define a **`BrowserConfig`** with Firefox, no headless, and `verbose=True`.
- We define a **`CrawlerRunConfig`** that **bypasses cache**, uses a **CSS** extraction schema, has a `word_count_threshold=15`, etc.
- We define a **`BrowserConfig`** with Firefox, no headless, and `verbose=True`. 
- We define a **`CrawlerRunConfig`** that **bypasses cache**, uses a **CSS** extraction schema, has a `word_count_threshold=15`, etc. 
- We pass them to `AsyncWebCrawler(config=...)` and `arun(url=..., config=...)`.
---
## 7. Best Practices & Migration Notes
## 7. Best Practices & Migration Notes
1. **Use** `BrowserConfig` for **global** settings about the browsers environment.
2. **Use** `CrawlerRunConfig` for **per-crawl** logic (caching, content filtering, extraction strategies, wait conditions).
3. **Avoid** legacy parameters like `css_selector` or `word_count_threshold` directly in `arun()`. Instead:
1. **Use** `BrowserConfig` for **global** settings about the browsers environment. 
2. **Use** `CrawlerRunConfig` for **per-crawl** logic (caching, content filtering, extraction strategies, wait conditions). 
3. **Avoid** legacy parameters like `css_selector` or `word_count_threshold` directly in `arun()`. Instead:
```python
run_cfg = CrawlerRunConfig(css_selector=".main-content", word_count_threshold=20)
@@ -316,16 +327,17 @@ asyncio.run(main())
---
## 8. Summary
## 8. Summary
**AsyncWebCrawler** is your entry point to asynchronous crawling:
- **Constructor** accepts **`BrowserConfig`** (or defaults).
- **`arun(url, config=CrawlerRunConfig)`** is the main method for single-page crawls.
- **`arun_many(urls, config=CrawlerRunConfig)`** handles concurrency across multiple URLs.
- For advanced lifecycle control, use `start()` and `close()` explicitly.
- **Constructor** accepts **`BrowserConfig`** (or defaults). 
- **`arun(url, config=CrawlerRunConfig)`** is the main method for single-page crawls. 
- **`arun_many(urls, config=CrawlerRunConfig)`** handles concurrency across multiple URLs. 
- For advanced lifecycle control, use `start()` and `close()` explicitly. 
**Migration**:
- If you used `AsyncWebCrawler(browser_type="chromium", css_selector="...")`, move browser settings to `BrowserConfig(...)` and content/crawl logic to `CrawlerRunConfig(...)`.
This modular approach ensures your code is **clean**, **scalable**, and **easy to maintain**. For any advanced or rarely used parameters, see the [BrowserConfig docs](../api/parameters.md).
This modular approach ensures your code is **clean**, **scalable**, and **easy to maintain**. For any advanced or rarely used parameters, see the [BrowserConfig docs](../api/parameters.md).