Files
crawl4ai/docs/md/full_details/crawl_request_parameters.md
2024-06-21 17:56:54 +08:00

131 lines
3.9 KiB
Markdown

# Crawl Request Parameters
The `run` function in Crawl4AI is designed to be highly configurable, allowing you to customize the crawling and extraction process to suit your needs. Below are the parameters you can use with the `run` function, along with their descriptions, possible values, and examples.
## Parameters
### url (str)
**Description:** The URL of the webpage to crawl.
**Required:** Yes
**Example:**
```python
url = "https://www.nbcnews.com/business"
```
### word_count_threshold (int)
**Description:** The minimum number of words a block must contain to be considered meaningful. The default value is `5`.
**Required:** No
**Default Value:** `5`
**Example:**
```python
word_count_threshold = 10
```
### extraction_strategy (ExtractionStrategy)
**Description:** The strategy to use for extracting content from the HTML. It must be an instance of `ExtractionStrategy`. If not provided, the default is `NoExtractionStrategy`.
**Required:** No
**Default Value:** `NoExtractionStrategy()`
**Example:**
```python
extraction_strategy = CosineStrategy(semantic_filter="finance")
```
### chunking_strategy (ChunkingStrategy)
**Description:** The strategy to use for chunking the text before processing. It must be an instance of `ChunkingStrategy`. The default value is `RegexChunking()`.
**Required:** No
**Default Value:** `RegexChunking()`
**Example:**
```python
chunking_strategy = NlpSentenceChunking()
```
### bypass_cache (bool)
**Description:** Whether to force a fresh crawl even if the URL has been previously crawled. The default value is `False`.
**Required:** No
**Default Value:** `False`
**Example:**
```python
bypass_cache = True
```
### css_selector (str)
**Description:** The CSS selector to target specific parts of the HTML for extraction. If not provided, the entire HTML will be processed.
**Required:** No
**Default Value:** `None`
**Example:**
```python
css_selector = "div.article-content"
```
### screenshot (bool)
**Description:** Whether to take screenshots of the page. The default value is `False`.
**Required:** No
**Default Value:** `False`
**Example:**
```python
screenshot = True
```
### user_agent (str)
**Description:** The user agent to use for the HTTP requests. If not provided, a default user agent will be used.
**Required:** No
**Default Value:** `None`
**Example:**
```python
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
```
### verbose (bool)
**Description:** Whether to enable verbose logging. The default value is `True`.
**Required:** No
**Default Value:** `True`
**Example:**
```python
verbose = True
```
### **kwargs
Additional keyword arguments that can be passed to customize the crawling process further. Some notable options include:
- **only_text (bool):** Whether to extract only text content, excluding HTML tags. Default is `False`.
**Example:**
```python
result = crawler.run(
url="https://www.nbcnews.com/business",
css_selector="p",
only_text=True
)
```
## Example Usage
Here's an example of how to use the `run` function with various parameters:
```python
from crawl4ai import WebCrawler
from crawl4ai.extraction_strategy import CosineStrategy
from crawl4ai.chunking_strategy import NlpSentenceChunking
# Create the WebCrawler instance
crawler = WebCrawler()
# Run the crawler with custom parameters
result = crawler.run(
url="https://www.nbcnews.com/business",
word_count_threshold=10,
extraction_strategy=CosineStrategy(semantic_filter="finance"),
chunking_strategy=NlpSentenceChunking(),
bypass_cache=True,
css_selector="div.article-content",
screenshot=True,
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3",
verbose=True,
only_text=True
)
print(result)
```
This example demonstrates how to configure various parameters to customize the crawling and extraction process using Crawl4AI.