diff --git a/docs/md_v2/api/arun.md b/docs/md_v2/api/arun.md index b951b9a5..ea0f8176 100644 --- a/docs/md_v2/api/arun.md +++ b/docs/md_v2/api/arun.md @@ -1,6 +1,6 @@ # `arun()` Parameter Guide (New Approach) -In Crawl4AI’s **latest** configuration model, nearly all parameters that once went directly to `arun()` are now part of **`CrawlerRunConfig`**. When calling `arun()`, you provide: +In Crawl4AI’s **latest** configuration model, nearly all parameters that once went directly to `arun()` are now part of **`CrawlerRunConfig`**. When calling `arun()`, you provide: ```python await crawler.arun( @@ -9,11 +9,11 @@ await crawler.arun( ) ``` -Below is an organized look at the parameters that can go inside `CrawlerRunConfig`, divided by their functional areas. For **Browser** settings (e.g., `headless`, `browser_type`), see [BrowserConfig](./parameters.md). +Below is an organized look at the parameters that can go inside `CrawlerRunConfig`, divided by their functional areas. For **Browser** settings (e.g., `headless`, `browser_type`), see [BrowserConfig](./parameters.md). --- -## 1. Core Usage +## 1. Core Usage ```python from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode @@ -23,7 +23,7 @@ async def main(): verbose=True, # Detailed logging cache_mode=CacheMode.ENABLED, # Use normal read/write cache check_robots_txt=True, # Respect robots.txt rules - # ... other parameters + # ... other parameters ) async with AsyncWebCrawler() as crawler: @@ -38,15 +38,16 @@ async def main(): ``` **Key Fields**: -- `verbose=True` logs each crawl step. +- `verbose=True` logs each crawl step.  - `cache_mode` decides how to read/write the local crawl cache. --- -## 2. Cache Control +## 2. Cache Control **`cache_mode`** (default: `CacheMode.ENABLED`) Use a built-in enum from `CacheMode`: + - `ENABLED`: Normal caching—reads if available, writes if missing. - `DISABLED`: No caching—always refetch pages. - `READ_ONLY`: Reads from cache only; no new writes. @@ -60,6 +61,7 @@ run_config = CrawlerRunConfig( ``` **Additional flags**: + - `bypass_cache=True` acts like `CacheMode.BYPASS`. - `disable_cache=True` acts like `CacheMode.DISABLED`. - `no_cache_read=True` acts like `CacheMode.WRITE_ONLY`. @@ -67,7 +69,7 @@ run_config = CrawlerRunConfig( --- -## 3. Content Processing & Selection +## 3. Content Processing & Selection ### 3.1 Text Processing @@ -111,7 +113,7 @@ run_config = CrawlerRunConfig( --- -## 4. Page Navigation & Timing +## 4. Page Navigation & Timing ### 4.1 Basic Browser Flow @@ -124,12 +126,13 @@ run_config = CrawlerRunConfig( ``` **Key Fields**: + - `wait_for`: - `"css:selector"` or - `"js:() => boolean"` - e.g. `js:() => document.querySelectorAll('.item').length > 10`. + e.g. `js:() => document.querySelectorAll('.item').length > 10`. -- `mean_delay` & `max_range`: define random delays for `arun_many()` calls. +- `mean_delay` & `max_range`: define random delays for `arun_many()` calls.  - `semaphore_count`: concurrency limit when crawling multiple URLs. ### 4.2 JavaScript Execution @@ -144,7 +147,7 @@ run_config = CrawlerRunConfig( ) ``` -- `js_code` can be a single string or a list of strings. +- `js_code` can be a single string or a list of strings.  - `js_only=True` means “I’m continuing in the same session with new JS steps, no new full navigation.” ### 4.3 Anti-Bot @@ -156,13 +159,13 @@ run_config = CrawlerRunConfig( override_navigator=True ) ``` -- `magic=True` tries multiple stealth features. -- `simulate_user=True` mimics mouse movements or random delays. +- `magic=True` tries multiple stealth features.  +- `simulate_user=True` mimics mouse movements or random delays.  - `override_navigator=True` fakes some navigator properties (like user agent checks). --- -## 5. Session Management +## 5. Session Management **`session_id`**: ```python @@ -174,7 +177,7 @@ If re-used in subsequent `arun()` calls, the same tab/page context is continued --- -## 6. Screenshot, PDF & Media Options +## 6. Screenshot, PDF & Media Options ```python run_config = CrawlerRunConfig( @@ -191,7 +194,7 @@ run_config = CrawlerRunConfig( --- -## 7. Extraction Strategy +## 7. Extraction Strategy **For advanced data extraction** (CSS/LLM-based), set `extraction_strategy`: @@ -205,7 +208,7 @@ The extracted data will appear in `result.extracted_content`. --- -## 8. Comprehensive Example +## 8. Comprehensive Example Below is a snippet combining many parameters: @@ -274,32 +277,33 @@ if __name__ == "__main__": ``` **What we covered**: -1. **Crawling** the main content region, ignoring external links. -2. Running **JavaScript** to click “.show-more”. -3. **Waiting** for “.loaded-block” to appear. -4. Generating a **screenshot** & **PDF** of the final page. -5. Extracting repeated “article.post” elements with a **CSS-based** extraction strategy. + +1. **Crawling** the main content region, ignoring external links.  +2. Running **JavaScript** to click “.show-more”.  +3. **Waiting** for “.loaded-block” to appear.  +4. Generating a **screenshot** & **PDF** of the final page.  +5. Extracting repeated “article.post” elements with a **CSS-based** extraction strategy. --- -## 9. Best Practices +## 9. Best Practices -1. **Use `BrowserConfig` for global browser** settings (headless, user agent). -2. **Use `CrawlerRunConfig`** to handle the **specific** crawl needs: content filtering, caching, JS, screenshot, extraction, etc. -3. Keep your **parameters consistent** in run configs—especially if you’re part of a large codebase with multiple crawls. -4. **Limit** large concurrency (`semaphore_count`) if the site or your system can’t handle it. -5. For dynamic pages, set `js_code` or `scan_full_page` so you load all content. +1. **Use `BrowserConfig` for global browser** settings (headless, user agent).  +2. **Use `CrawlerRunConfig`** to handle the **specific** crawl needs: content filtering, caching, JS, screenshot, extraction, etc.  +3. Keep your **parameters consistent** in run configs—especially if you’re part of a large codebase with multiple crawls.  +4. **Limit** large concurrency (`semaphore_count`) if the site or your system can’t handle it.  +5. For dynamic pages, set `js_code` or `scan_full_page` so you load all content. --- -## 10. Conclusion +## 10. Conclusion -All parameters that used to be direct arguments to `arun()` now belong in **`CrawlerRunConfig`**. This approach: +All parameters that used to be direct arguments to `arun()` now belong in **`CrawlerRunConfig`**. This approach: -- Makes code **clearer** and **more maintainable**. -- Minimizes confusion about which arguments affect global vs. per-crawl behavior. +- Makes code **clearer** and **more maintainable**.  +- Minimizes confusion about which arguments affect global vs. per-crawl behavior.  - Allows you to create **reusable** config objects for different pages or tasks. -For a **full** reference, check out the [CrawlerRunConfig Docs](./parameters.md). +For a **full** reference, check out the [CrawlerRunConfig Docs](./parameters.md).  Happy crawling with your **structured, flexible** config approach! \ No newline at end of file diff --git a/docs/md_v2/api/arun_many.md b/docs/md_v2/api/arun_many.md index edc01145..98d91c08 100644 --- a/docs/md_v2/api/arun_many.md +++ b/docs/md_v2/api/arun_many.md @@ -1,6 +1,6 @@ # `arun_many(...)` Reference -> **Note**: This function is very similar to [`arun()`](./arun.md) but focused on **concurrent** or **batch** crawling. If you’re unfamiliar with `arun()` usage, please read that doc first, then review this for differences. +> **Note**: This function is very similar to [`arun()`](./arun.md) but focused on **concurrent** or **batch** crawling. If you’re unfamiliar with `arun()` usage, please read that doc first, then review this for differences. ## Function Signature @@ -16,7 +16,7 @@ async def arun_many( :param urls: A list of URLs (or tasks) to crawl. :param config: (Optional) A default `CrawlerRunConfig` applying to each crawl. - :param dispatcher: (Optional) A concurrency controller (e.g. MemoryAdaptiveDispatcher). + :param dispatcher: (Optional) A concurrency controller (e.g. MemoryAdaptiveDispatcher). ... :return: Either a list of `CrawlResult` objects, or an async generator if streaming is enabled. """ @@ -24,22 +24,26 @@ async def arun_many( ## Differences from `arun()` -1. **Multiple URLs**: - - Instead of crawling a single URL, you pass a list of them (strings or tasks). +1. **Multiple URLs**: + + - Instead of crawling a single URL, you pass a list of them (strings or tasks).  - The function returns either a **list** of `CrawlResult` or an **async generator** if streaming is enabled. -2. **Concurrency & Dispatchers**: - - **`dispatcher`** param allows advanced concurrency control. - - If omitted, a default dispatcher (like `MemoryAdaptiveDispatcher`) is used internally. +2. **Concurrency & Dispatchers**: + + - **`dispatcher`** param allows advanced concurrency control.  + - If omitted, a default dispatcher (like `MemoryAdaptiveDispatcher`) is used internally.  - Dispatchers handle concurrency, rate limiting, and memory-based adaptive throttling (see [Multi-URL Crawling](../advanced/multi-url-crawling.md)). -3. **Streaming Support**: +3. **Streaming Support**: + - Enable streaming by setting `stream=True` in your `CrawlerRunConfig`. - When streaming, use `async for` to process results as they become available. - Ideal for processing large numbers of URLs without waiting for all to complete. -4. **Parallel** Execution**: - - `arun_many()` can run multiple requests concurrently under the hood. +4. **Parallel** Execution**: + + - `arun_many()` can run multiple requests concurrently under the hood.  - Each `CrawlResult` might also include a **`dispatch_result`** with concurrency details (like memory usage, start/end times). ### Basic Example (Batch Mode) @@ -93,19 +97,19 @@ results = await crawler.arun_many( **Key Points**: - Each URL is processed by the same or separate sessions, depending on the dispatcher’s strategy. -- `dispatch_result` in each `CrawlResult` (if using concurrency) can hold memory and timing info. +- `dispatch_result` in each `CrawlResult` (if using concurrency) can hold memory and timing info.  - If you need to handle authentication or session IDs, pass them in each individual task or within your run config. ### Return Value -Either a **list** of [`CrawlResult`](./crawl-result.md) objects, or an **async generator** if streaming is enabled. You can iterate to check `result.success` or read each item’s `extracted_content`, `markdown`, or `dispatch_result`. +Either a **list** of [`CrawlResult`](./crawl-result.md) objects, or an **async generator** if streaming is enabled. You can iterate to check `result.success` or read each item’s `extracted_content`, `markdown`, or `dispatch_result`. --- ## Dispatcher Reference -- **`MemoryAdaptiveDispatcher`**: Dynamically manages concurrency based on system memory usage. -- **`SemaphoreDispatcher`**: Fixed concurrency limit, simpler but less adaptive. +- **`MemoryAdaptiveDispatcher`**: Dynamically manages concurrency based on system memory usage.  +- **`SemaphoreDispatcher`**: Fixed concurrency limit, simpler but less adaptive.  For advanced usage or custom settings, see [Multi-URL Crawling with Dispatchers](../advanced/multi-url-crawling.md). @@ -113,12 +117,14 @@ For advanced usage or custom settings, see [Multi-URL Crawling with Dispatchers] ## Common Pitfalls -1. **Large Lists**: If you pass thousands of URLs, be mindful of memory or rate-limits. A dispatcher can help. -2. **Session Reuse**: If you need specialized logins or persistent contexts, ensure your dispatcher or tasks handle sessions accordingly. -3. **Error Handling**: Each `CrawlResult` might fail for different reasons—always check `result.success` or the `error_message` before proceeding. +1. **Large Lists**: If you pass thousands of URLs, be mindful of memory or rate-limits. A dispatcher can help.  + +2. **Session Reuse**: If you need specialized logins or persistent contexts, ensure your dispatcher or tasks handle sessions accordingly.  + +3. **Error Handling**: Each `CrawlResult` might fail for different reasons—always check `result.success` or the `error_message` before proceeding. --- ## Conclusion -Use `arun_many()` when you want to **crawl multiple URLs** simultaneously or in controlled parallel tasks. If you need advanced concurrency features (like memory-based adaptive throttling or complex rate-limiting), provide a **dispatcher**. Each result is a standard `CrawlResult`, possibly augmented with concurrency stats (`dispatch_result`) for deeper inspection. For more details on concurrency logic and dispatchers, see the [Advanced Multi-URL Crawling](../advanced/multi-url-crawling.md) docs. \ No newline at end of file +Use `arun_many()` when you want to **crawl multiple URLs** simultaneously or in controlled parallel tasks. If you need advanced concurrency features (like memory-based adaptive throttling or complex rate-limiting), provide a **dispatcher**. Each result is a standard `CrawlResult`, possibly augmented with concurrency stats (`dispatch_result`) for deeper inspection. For more details on concurrency logic and dispatchers, see the [Advanced Multi-URL Crawling](../advanced/multi-url-crawling.md) docs. \ No newline at end of file diff --git a/docs/md_v2/api/async-webcrawler.md b/docs/md_v2/api/async-webcrawler.md index e9c6cc6b..eacd3de4 100644 --- a/docs/md_v2/api/async-webcrawler.md +++ b/docs/md_v2/api/async-webcrawler.md @@ -1,16 +1,20 @@ # AsyncWebCrawler -The **`AsyncWebCrawler`** is the core class for asynchronous web crawling in Crawl4AI. You typically create it **once**, optionally customize it with a **`BrowserConfig`** (e.g., headless, user agent), then **run** multiple **`arun()`** calls with different **`CrawlerRunConfig`** objects. +The **`AsyncWebCrawler`** is the core class for asynchronous web crawling in Crawl4AI. You typically create it **once**, optionally customize it with a **`BrowserConfig`** (e.g., headless, user agent), then **run** multiple **`arun()`** calls with different **`CrawlerRunConfig`** objects. **Recommended usage**: -1. **Create** a `BrowserConfig` for global browser settings. -2. **Instantiate** `AsyncWebCrawler(config=browser_config)`. -3. **Use** the crawler in an async context manager (`async with`) or manage start/close manually. + +1. **Create** a `BrowserConfig` for global browser settings.  + +2. **Instantiate** `AsyncWebCrawler(config=browser_config)`.  + +3. **Use** the crawler in an async context manager (`async with`) or manage start/close manually.  + 4. **Call** `arun(url, config=crawler_run_config)` for each page you want. --- -## 1. Constructor Overview +## 1. Constructor Overview ```python class AsyncWebCrawler: @@ -37,7 +41,7 @@ class AsyncWebCrawler: base_directory: Folder for storing caches/logs (if relevant). thread_safe: - If True, attempts some concurrency safeguards. Usually False. + If True, attempts some concurrency safeguards. Usually False. **kwargs: Additional legacy or debugging parameters. """ @@ -58,11 +62,12 @@ crawler = AsyncWebCrawler(config=browser_cfg) ``` **Notes**: + - **Legacy** parameters like `always_bypass_cache` remain for backward compatibility, but prefer to set **caching** in `CrawlerRunConfig`. --- -## 2. Lifecycle: Start/Close or Context Manager +## 2. Lifecycle: Start/Close or Context Manager ### 2.1 Context Manager (Recommended) @@ -90,7 +95,7 @@ Use this style if you have a **long-running** application or need full control o --- -## 3. Primary Method: `arun()` +## 3. Primary Method: `arun()` ```python async def arun( @@ -130,7 +135,7 @@ For **backward** compatibility, `arun()` can still accept direct arguments like --- -## 4. Batch Processing: `arun_many()` +## 4. Batch Processing: `arun_many()` ```python async def arun_many( @@ -147,6 +152,7 @@ async def arun_many( ### 4.1 Resource-Aware Crawling The `arun_many()` method now uses an intelligent dispatcher that: + - Monitors system memory usage - Implements adaptive rate limiting - Provides detailed progress monitoring @@ -192,30 +198,34 @@ async with AsyncWebCrawler(config=browser_cfg) as crawler: ### 4.3 Key Features -1. **Rate Limiting** +1. **Rate Limiting** + - Automatic delay between requests - Exponential backoff on rate limit detection - Domain-specific rate limiting - Configurable retry strategy -2. **Resource Monitoring** +2. **Resource Monitoring** + - Memory usage tracking - Adaptive concurrency based on system load - Automatic pausing when resources are constrained -3. **Progress Monitoring** +3. **Progress Monitoring** + - Detailed or aggregated progress display - Real-time status updates - Memory usage statistics -4. **Error Handling** +4. **Error Handling** + - Graceful handling of rate limits - Automatic retries with backoff - Detailed error reporting --- -## 5. `CrawlResult` Output +## 5. `CrawlResult` Output Each `arun()` returns a **`CrawlResult`** containing: @@ -232,7 +242,7 @@ For details, see [CrawlResult doc](./crawl-result.md). --- -## 6. Quick Example +## 6. Quick Example Below is an example hooking it all together: @@ -243,14 +253,14 @@ from crawl4ai.extraction_strategy import JsonCssExtractionStrategy import json async def main(): - # 1. Browser config + # 1. Browser config browser_cfg = BrowserConfig( browser_type="firefox", headless=False, verbose=True ) - # 2. Run config + # 2. Run config schema = { "name": "Articles", "baseSelector": "article.post", @@ -295,17 +305,18 @@ asyncio.run(main()) ``` **Explanation**: -- We define a **`BrowserConfig`** with Firefox, no headless, and `verbose=True`. -- We define a **`CrawlerRunConfig`** that **bypasses cache**, uses a **CSS** extraction schema, has a `word_count_threshold=15`, etc. + +- We define a **`BrowserConfig`** with Firefox, no headless, and `verbose=True`.  +- We define a **`CrawlerRunConfig`** that **bypasses cache**, uses a **CSS** extraction schema, has a `word_count_threshold=15`, etc.  - We pass them to `AsyncWebCrawler(config=...)` and `arun(url=..., config=...)`. --- -## 7. Best Practices & Migration Notes +## 7. Best Practices & Migration Notes -1. **Use** `BrowserConfig` for **global** settings about the browser’s environment. -2. **Use** `CrawlerRunConfig` for **per-crawl** logic (caching, content filtering, extraction strategies, wait conditions). -3. **Avoid** legacy parameters like `css_selector` or `word_count_threshold` directly in `arun()`. Instead: +1. **Use** `BrowserConfig` for **global** settings about the browser’s environment.  +2. **Use** `CrawlerRunConfig` for **per-crawl** logic (caching, content filtering, extraction strategies, wait conditions).  +3. **Avoid** legacy parameters like `css_selector` or `word_count_threshold` directly in `arun()`. Instead: ```python run_cfg = CrawlerRunConfig(css_selector=".main-content", word_count_threshold=20) @@ -316,16 +327,17 @@ asyncio.run(main()) --- -## 8. Summary +## 8. Summary **AsyncWebCrawler** is your entry point to asynchronous crawling: -- **Constructor** accepts **`BrowserConfig`** (or defaults). -- **`arun(url, config=CrawlerRunConfig)`** is the main method for single-page crawls. -- **`arun_many(urls, config=CrawlerRunConfig)`** handles concurrency across multiple URLs. -- For advanced lifecycle control, use `start()` and `close()` explicitly. +- **Constructor** accepts **`BrowserConfig`** (or defaults).  +- **`arun(url, config=CrawlerRunConfig)`** is the main method for single-page crawls.  +- **`arun_many(urls, config=CrawlerRunConfig)`** handles concurrency across multiple URLs.  +- For advanced lifecycle control, use `start()` and `close()` explicitly.  **Migration**: + - If you used `AsyncWebCrawler(browser_type="chromium", css_selector="...")`, move browser settings to `BrowserConfig(...)` and content/crawl logic to `CrawlerRunConfig(...)`. -This modular approach ensures your code is **clean**, **scalable**, and **easy to maintain**. For any advanced or rarely used parameters, see the [BrowserConfig docs](../api/parameters.md). \ No newline at end of file +This modular approach ensures your code is **clean**, **scalable**, and **easy to maintain**. For any advanced or rarely used parameters, see the [BrowserConfig docs](../api/parameters.md). \ No newline at end of file diff --git a/docs/md_v2/api/parameters.md b/docs/md_v2/api/parameters.md index 932a2642..e60c1025 100644 --- a/docs/md_v2/api/parameters.md +++ b/docs/md_v2/api/parameters.md @@ -294,3 +294,4 @@ stream_cfg = run_cfg.clone( stream=True, cache_mode=CacheMode.BYPASS ) +```