131 lines
3.9 KiB
Markdown
131 lines
3.9 KiB
Markdown
# Crawl Request Parameters
|
|
|
|
The `run` function in Crawl4AI is designed to be highly configurable, allowing you to customize the crawling and extraction process to suit your needs. Below are the parameters you can use with the `run` function, along with their descriptions, possible values, and examples.
|
|
|
|
## Parameters
|
|
|
|
### url (str)
|
|
**Description:** The URL of the webpage to crawl.
|
|
**Required:** Yes
|
|
**Example:**
|
|
```python
|
|
url = "https://www.nbcnews.com/business"
|
|
```
|
|
|
|
### word_count_threshold (int)
|
|
**Description:** The minimum number of words a block must contain to be considered meaningful. The default value is `5`.
|
|
**Required:** No
|
|
**Default Value:** `5`
|
|
**Example:**
|
|
```python
|
|
word_count_threshold = 10
|
|
```
|
|
|
|
### extraction_strategy (ExtractionStrategy)
|
|
**Description:** The strategy to use for extracting content from the HTML. It must be an instance of `ExtractionStrategy`. If not provided, the default is `NoExtractionStrategy`.
|
|
**Required:** No
|
|
**Default Value:** `NoExtractionStrategy()`
|
|
**Example:**
|
|
```python
|
|
extraction_strategy = CosineStrategy(semantic_filter="finance")
|
|
```
|
|
|
|
### chunking_strategy (ChunkingStrategy)
|
|
**Description:** The strategy to use for chunking the text before processing. It must be an instance of `ChunkingStrategy`. The default value is `RegexChunking()`.
|
|
**Required:** No
|
|
**Default Value:** `RegexChunking()`
|
|
**Example:**
|
|
```python
|
|
chunking_strategy = NlpSentenceChunking()
|
|
```
|
|
|
|
### bypass_cache (bool)
|
|
**Description:** Whether to force a fresh crawl even if the URL has been previously crawled. The default value is `False`.
|
|
**Required:** No
|
|
**Default Value:** `False`
|
|
**Example:**
|
|
```python
|
|
bypass_cache = True
|
|
```
|
|
|
|
### css_selector (str)
|
|
**Description:** The CSS selector to target specific parts of the HTML for extraction. If not provided, the entire HTML will be processed.
|
|
**Required:** No
|
|
**Default Value:** `None`
|
|
**Example:**
|
|
```python
|
|
css_selector = "div.article-content"
|
|
```
|
|
|
|
### screenshot (bool)
|
|
**Description:** Whether to take screenshots of the page. The default value is `False`.
|
|
**Required:** No
|
|
**Default Value:** `False`
|
|
**Example:**
|
|
```python
|
|
screenshot = True
|
|
```
|
|
|
|
### user_agent (str)
|
|
**Description:** The user agent to use for the HTTP requests. If not provided, a default user agent will be used.
|
|
**Required:** No
|
|
**Default Value:** `None`
|
|
**Example:**
|
|
```python
|
|
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
|
|
```
|
|
|
|
### verbose (bool)
|
|
**Description:** Whether to enable verbose logging. The default value is `True`.
|
|
**Required:** No
|
|
**Default Value:** `True`
|
|
**Example:**
|
|
```python
|
|
verbose = True
|
|
```
|
|
|
|
### **kwargs
|
|
Additional keyword arguments that can be passed to customize the crawling process further. Some notable options include:
|
|
|
|
- **only_text (bool):** Whether to extract only text content, excluding HTML tags. Default is `False`.
|
|
|
|
**Example:**
|
|
```python
|
|
result = crawler.run(
|
|
url="https://www.nbcnews.com/business",
|
|
css_selector="p",
|
|
only_text=True
|
|
)
|
|
```
|
|
|
|
## Example Usage
|
|
|
|
Here's an example of how to use the `run` function with various parameters:
|
|
|
|
```python
|
|
from crawl4ai import WebCrawler
|
|
from crawl4ai.extraction_strategy import CosineStrategy
|
|
from crawl4ai.chunking_strategy import NlpSentenceChunking
|
|
|
|
# Create the WebCrawler instance
|
|
crawler = WebCrawler()
|
|
|
|
# Run the crawler with custom parameters
|
|
result = crawler.run(
|
|
url="https://www.nbcnews.com/business",
|
|
word_count_threshold=10,
|
|
extraction_strategy=CosineStrategy(semantic_filter="finance"),
|
|
chunking_strategy=NlpSentenceChunking(),
|
|
bypass_cache=True,
|
|
css_selector="div.article-content",
|
|
screenshot=True,
|
|
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3",
|
|
verbose=True,
|
|
only_text=True
|
|
)
|
|
|
|
print(result)
|
|
```
|
|
|
|
This example demonstrates how to configure various parameters to customize the crawling and extraction process using Crawl4AI.
|