docs(api): improve formatting and readability of API documentation
Enhanced markdown formatting, fixed list indentation, and improved readability across multiple API documentation files: - arun.md - arun_many.md - async-webcrawler.md - parameters.md Changes include: - Consistent list formatting and indentation - Better spacing between sections - Clearer separation of content blocks - Fixed quotation marks and code block formatting
This commit is contained in:
@@ -1,16 +1,20 @@
|
||||
# AsyncWebCrawler
|
||||
|
||||
The **`AsyncWebCrawler`** is the core class for asynchronous web crawling in Crawl4AI. You typically create it **once**, optionally customize it with a **`BrowserConfig`** (e.g., headless, user agent), then **run** multiple **`arun()`** calls with different **`CrawlerRunConfig`** objects.
|
||||
The **`AsyncWebCrawler`** is the core class for asynchronous web crawling in Crawl4AI. You typically create it **once**, optionally customize it with a **`BrowserConfig`** (e.g., headless, user agent), then **run** multiple **`arun()`** calls with different **`CrawlerRunConfig`** objects.
|
||||
|
||||
**Recommended usage**:
|
||||
1. **Create** a `BrowserConfig` for global browser settings.
|
||||
2. **Instantiate** `AsyncWebCrawler(config=browser_config)`.
|
||||
3. **Use** the crawler in an async context manager (`async with`) or manage start/close manually.
|
||||
|
||||
1. **Create** a `BrowserConfig` for global browser settings.
|
||||
|
||||
2. **Instantiate** `AsyncWebCrawler(config=browser_config)`.
|
||||
|
||||
3. **Use** the crawler in an async context manager (`async with`) or manage start/close manually.
|
||||
|
||||
4. **Call** `arun(url, config=crawler_run_config)` for each page you want.
|
||||
|
||||
---
|
||||
|
||||
## 1. Constructor Overview
|
||||
## 1. Constructor Overview
|
||||
|
||||
```python
|
||||
class AsyncWebCrawler:
|
||||
@@ -37,7 +41,7 @@ class AsyncWebCrawler:
|
||||
base_directory:
|
||||
Folder for storing caches/logs (if relevant).
|
||||
thread_safe:
|
||||
If True, attempts some concurrency safeguards. Usually False.
|
||||
If True, attempts some concurrency safeguards. Usually False.
|
||||
**kwargs:
|
||||
Additional legacy or debugging parameters.
|
||||
"""
|
||||
@@ -58,11 +62,12 @@ crawler = AsyncWebCrawler(config=browser_cfg)
|
||||
```
|
||||
|
||||
**Notes**:
|
||||
|
||||
- **Legacy** parameters like `always_bypass_cache` remain for backward compatibility, but prefer to set **caching** in `CrawlerRunConfig`.
|
||||
|
||||
---
|
||||
|
||||
## 2. Lifecycle: Start/Close or Context Manager
|
||||
## 2. Lifecycle: Start/Close or Context Manager
|
||||
|
||||
### 2.1 Context Manager (Recommended)
|
||||
|
||||
@@ -90,7 +95,7 @@ Use this style if you have a **long-running** application or need full control o
|
||||
|
||||
---
|
||||
|
||||
## 3. Primary Method: `arun()`
|
||||
## 3. Primary Method: `arun()`
|
||||
|
||||
```python
|
||||
async def arun(
|
||||
@@ -130,7 +135,7 @@ For **backward** compatibility, `arun()` can still accept direct arguments like
|
||||
|
||||
---
|
||||
|
||||
## 4. Batch Processing: `arun_many()`
|
||||
## 4. Batch Processing: `arun_many()`
|
||||
|
||||
```python
|
||||
async def arun_many(
|
||||
@@ -147,6 +152,7 @@ async def arun_many(
|
||||
### 4.1 Resource-Aware Crawling
|
||||
|
||||
The `arun_many()` method now uses an intelligent dispatcher that:
|
||||
|
||||
- Monitors system memory usage
|
||||
- Implements adaptive rate limiting
|
||||
- Provides detailed progress monitoring
|
||||
@@ -192,30 +198,34 @@ async with AsyncWebCrawler(config=browser_cfg) as crawler:
|
||||
|
||||
### 4.3 Key Features
|
||||
|
||||
1. **Rate Limiting**
|
||||
1. **Rate Limiting**
|
||||
|
||||
- Automatic delay between requests
|
||||
- Exponential backoff on rate limit detection
|
||||
- Domain-specific rate limiting
|
||||
- Configurable retry strategy
|
||||
|
||||
2. **Resource Monitoring**
|
||||
2. **Resource Monitoring**
|
||||
|
||||
- Memory usage tracking
|
||||
- Adaptive concurrency based on system load
|
||||
- Automatic pausing when resources are constrained
|
||||
|
||||
3. **Progress Monitoring**
|
||||
3. **Progress Monitoring**
|
||||
|
||||
- Detailed or aggregated progress display
|
||||
- Real-time status updates
|
||||
- Memory usage statistics
|
||||
|
||||
4. **Error Handling**
|
||||
4. **Error Handling**
|
||||
|
||||
- Graceful handling of rate limits
|
||||
- Automatic retries with backoff
|
||||
- Detailed error reporting
|
||||
|
||||
---
|
||||
|
||||
## 5. `CrawlResult` Output
|
||||
## 5. `CrawlResult` Output
|
||||
|
||||
Each `arun()` returns a **`CrawlResult`** containing:
|
||||
|
||||
@@ -232,7 +242,7 @@ For details, see [CrawlResult doc](./crawl-result.md).
|
||||
|
||||
---
|
||||
|
||||
## 6. Quick Example
|
||||
## 6. Quick Example
|
||||
|
||||
Below is an example hooking it all together:
|
||||
|
||||
@@ -243,14 +253,14 @@ from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
import json
|
||||
|
||||
async def main():
|
||||
# 1. Browser config
|
||||
# 1. Browser config
|
||||
browser_cfg = BrowserConfig(
|
||||
browser_type="firefox",
|
||||
headless=False,
|
||||
verbose=True
|
||||
)
|
||||
|
||||
# 2. Run config
|
||||
# 2. Run config
|
||||
schema = {
|
||||
"name": "Articles",
|
||||
"baseSelector": "article.post",
|
||||
@@ -295,17 +305,18 @@ asyncio.run(main())
|
||||
```
|
||||
|
||||
**Explanation**:
|
||||
- We define a **`BrowserConfig`** with Firefox, no headless, and `verbose=True`.
|
||||
- We define a **`CrawlerRunConfig`** that **bypasses cache**, uses a **CSS** extraction schema, has a `word_count_threshold=15`, etc.
|
||||
|
||||
- We define a **`BrowserConfig`** with Firefox, no headless, and `verbose=True`.
|
||||
- We define a **`CrawlerRunConfig`** that **bypasses cache**, uses a **CSS** extraction schema, has a `word_count_threshold=15`, etc.
|
||||
- We pass them to `AsyncWebCrawler(config=...)` and `arun(url=..., config=...)`.
|
||||
|
||||
---
|
||||
|
||||
## 7. Best Practices & Migration Notes
|
||||
## 7. Best Practices & Migration Notes
|
||||
|
||||
1. **Use** `BrowserConfig` for **global** settings about the browser’s environment.
|
||||
2. **Use** `CrawlerRunConfig` for **per-crawl** logic (caching, content filtering, extraction strategies, wait conditions).
|
||||
3. **Avoid** legacy parameters like `css_selector` or `word_count_threshold` directly in `arun()`. Instead:
|
||||
1. **Use** `BrowserConfig` for **global** settings about the browser’s environment.
|
||||
2. **Use** `CrawlerRunConfig` for **per-crawl** logic (caching, content filtering, extraction strategies, wait conditions).
|
||||
3. **Avoid** legacy parameters like `css_selector` or `word_count_threshold` directly in `arun()`. Instead:
|
||||
|
||||
```python
|
||||
run_cfg = CrawlerRunConfig(css_selector=".main-content", word_count_threshold=20)
|
||||
@@ -316,16 +327,17 @@ asyncio.run(main())
|
||||
|
||||
---
|
||||
|
||||
## 8. Summary
|
||||
## 8. Summary
|
||||
|
||||
**AsyncWebCrawler** is your entry point to asynchronous crawling:
|
||||
|
||||
- **Constructor** accepts **`BrowserConfig`** (or defaults).
|
||||
- **`arun(url, config=CrawlerRunConfig)`** is the main method for single-page crawls.
|
||||
- **`arun_many(urls, config=CrawlerRunConfig)`** handles concurrency across multiple URLs.
|
||||
- For advanced lifecycle control, use `start()` and `close()` explicitly.
|
||||
- **Constructor** accepts **`BrowserConfig`** (or defaults).
|
||||
- **`arun(url, config=CrawlerRunConfig)`** is the main method for single-page crawls.
|
||||
- **`arun_many(urls, config=CrawlerRunConfig)`** handles concurrency across multiple URLs.
|
||||
- For advanced lifecycle control, use `start()` and `close()` explicitly.
|
||||
|
||||
**Migration**:
|
||||
|
||||
- If you used `AsyncWebCrawler(browser_type="chromium", css_selector="...")`, move browser settings to `BrowserConfig(...)` and content/crawl logic to `CrawlerRunConfig(...)`.
|
||||
|
||||
This modular approach ensures your code is **clean**, **scalable**, and **easy to maintain**. For any advanced or rarely used parameters, see the [BrowserConfig docs](../api/parameters.md).
|
||||
This modular approach ensures your code is **clean**, **scalable**, and **easy to maintain**. For any advanced or rarely used parameters, see the [BrowserConfig docs](../api/parameters.md).
|
||||
Reference in New Issue
Block a user