Files
crawl4ai/docs/md/full_details/crawl_request_parameters.md

179 lines
5.8 KiB
Markdown

# Crawl Request Parameters for AsyncWebCrawler
The `arun` method in Crawl4AI's `AsyncWebCrawler` is designed to be highly configurable, allowing you to customize the crawling and extraction process to suit your needs. Below are the parameters you can use with the `arun` method, along with their descriptions, possible values, and examples.
## Parameters
### url (str)
**Description:** The URL of the webpage to crawl.
**Required:** Yes
**Example:**
```python
url = "https://www.nbcnews.com/business"
```
### word_count_threshold (int)
**Description:** The minimum number of words a block must contain to be considered meaningful. The default value is defined by `MIN_WORD_THRESHOLD`.
**Required:** No
**Default Value:** `MIN_WORD_THRESHOLD`
**Example:**
```python
word_count_threshold = 10
```
### extraction_strategy (ExtractionStrategy)
**Description:** The strategy to use for extracting content from the HTML. It must be an instance of `ExtractionStrategy`. If not provided, the default is `NoExtractionStrategy`.
**Required:** No
**Default Value:** `NoExtractionStrategy()`
**Example:**
```python
extraction_strategy = CosineStrategy(semantic_filter="finance")
```
### chunking_strategy (ChunkingStrategy)
**Description:** The strategy to use for chunking the text before processing. It must be an instance of `ChunkingStrategy`. The default value is `RegexChunking()`.
**Required:** No
**Default Value:** `RegexChunking()`
**Example:**
```python
chunking_strategy = NlpSentenceChunking()
```
### bypass_cache (bool)
**Description:** Whether to force a fresh crawl even if the URL has been previously crawled. The default value is `False`.
**Required:** No
**Default Value:** `False`
**Example:**
```python
bypass_cache = True
```
### css_selector (str)
**Description:** The CSS selector to target specific parts of the HTML for extraction. If not provided, the entire HTML will be processed.
**Required:** No
**Default Value:** `None`
**Example:**
```python
css_selector = "div.article-content"
```
### screenshot (bool)
**Description:** Whether to take screenshots of the page. The default value is `False`.
**Required:** No
**Default Value:** `False`
**Example:**
```python
screenshot = True
```
### user_agent (str)
**Description:** The user agent to use for the HTTP requests. If not provided, a default user agent will be used.
**Required:** No
**Default Value:** `None`
**Example:**
```python
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
```
### verbose (bool)
**Description:** Whether to enable verbose logging. The default value is `True`.
**Required:** No
**Default Value:** `True`
**Example:**
```python
verbose = True
```
### **kwargs
Additional keyword arguments that can be passed to customize the crawling process further. Some notable options include:
- **only_text (bool):** Whether to extract only text content, excluding HTML tags. Default is `False`.
- **session_id (str):** A unique identifier for the crawling session. This is useful for maintaining state across multiple requests.
- **js_code (str or list):** JavaScript code to be executed on the page before extraction.
- **wait_for (str):** A CSS selector or JavaScript function to wait for before considering the page load complete.
**Example:**
```python
result = await crawler.arun(
url="https://www.nbcnews.com/business",
css_selector="p",
only_text=True,
session_id="unique_session_123",
js_code="window.scrollTo(0, document.body.scrollHeight);",
wait_for="article.main-article"
)
```
## Example Usage
Here's an example of how to use the `arun` method with various parameters:
```python
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import CosineStrategy
from crawl4ai.chunking_strategy import NlpSentenceChunking
async def main():
# Create the AsyncWebCrawler instance
async with AsyncWebCrawler(verbose=True) as crawler:
# Run the crawler with custom parameters
result = await crawler.arun(
url="https://www.nbcnews.com/business",
word_count_threshold=10,
extraction_strategy=CosineStrategy(semantic_filter="finance"),
chunking_strategy=NlpSentenceChunking(),
bypass_cache=True,
css_selector="div.article-content",
screenshot=True,
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3",
verbose=True,
only_text=True,
session_id="business_news_session",
js_code="window.scrollTo(0, document.body.scrollHeight);",
wait_for="footer"
)
print(result)
# Run the async function
asyncio.run(main())
```
This example demonstrates how to configure various parameters to customize the crawling and extraction process using the asynchronous version of Crawl4AI.
## Additional Asynchronous Methods
The `AsyncWebCrawler` class also provides other useful asynchronous methods:
### arun_many
**Description:** Crawl multiple URLs concurrently.
**Example:**
```python
urls = ["https://example1.com", "https://example2.com", "https://example3.com"]
results = await crawler.arun_many(urls, word_count_threshold=10, bypass_cache=True)
```
### aclear_cache
**Description:** Clear the crawler's cache.
**Example:**
```python
await crawler.aclear_cache()
```
### aflush_cache
**Description:** Completely flush the crawler's cache.
**Example:**
```python
await crawler.aflush_cache()
```
### aget_cache_size
**Description:** Get the current size of the cache.
**Example:**
```python
cache_size = await crawler.aget_cache_size()
print(f"Current cache size: {cache_size}")
```
These asynchronous methods allow for efficient and flexible use of the AsyncWebCrawler in various scenarios.