# Crawl Request Parameters for AsyncWebCrawler The `arun` method in Crawl4AI's `AsyncWebCrawler` is designed to be highly configurable, allowing you to customize the crawling and extraction process to suit your needs. Below are the parameters you can use with the `arun` method, along with their descriptions, possible values, and examples. ## Parameters ### url (str) **Description:** The URL of the webpage to crawl. **Required:** Yes **Example:** ```python url = "https://www.nbcnews.com/business" ``` ### word_count_threshold (int) **Description:** The minimum number of words a block must contain to be considered meaningful. The default value is defined by `MIN_WORD_THRESHOLD`. **Required:** No **Default Value:** `MIN_WORD_THRESHOLD` **Example:** ```python word_count_threshold = 10 ``` ### extraction_strategy (ExtractionStrategy) **Description:** The strategy to use for extracting content from the HTML. It must be an instance of `ExtractionStrategy`. If not provided, the default is `NoExtractionStrategy`. **Required:** No **Default Value:** `NoExtractionStrategy()` **Example:** ```python extraction_strategy = CosineStrategy(semantic_filter="finance") ``` ### chunking_strategy (ChunkingStrategy) **Description:** The strategy to use for chunking the text before processing. It must be an instance of `ChunkingStrategy`. The default value is `RegexChunking()`. **Required:** No **Default Value:** `RegexChunking()` **Example:** ```python chunking_strategy = NlpSentenceChunking() ``` ### bypass_cache (bool) **Description:** Whether to force a fresh crawl even if the URL has been previously crawled. The default value is `False`. **Required:** No **Default Value:** `False` **Example:** ```python bypass_cache = True ``` ### css_selector (str) **Description:** The CSS selector to target specific parts of the HTML for extraction. If not provided, the entire HTML will be processed. **Required:** No **Default Value:** `None` **Example:** ```python css_selector = "div.article-content" ``` ### screenshot (bool) **Description:** Whether to take screenshots of the page. The default value is `False`. **Required:** No **Default Value:** `False` **Example:** ```python screenshot = True ``` ### user_agent (str) **Description:** The user agent to use for the HTTP requests. If not provided, a default user agent will be used. **Required:** No **Default Value:** `None` **Example:** ```python user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3" ``` ### verbose (bool) **Description:** Whether to enable verbose logging. The default value is `True`. **Required:** No **Default Value:** `True` **Example:** ```python verbose = True ``` ### **kwargs Additional keyword arguments that can be passed to customize the crawling process further. Some notable options include: - **only_text (bool):** Whether to extract only text content, excluding HTML tags. Default is `False`. - **session_id (str):** A unique identifier for the crawling session. This is useful for maintaining state across multiple requests. - **js_code (str or list):** JavaScript code to be executed on the page before extraction. - **wait_for (str):** A CSS selector or JavaScript function to wait for before considering the page load complete. **Example:** ```python result = await crawler.arun( url="https://www.nbcnews.com/business", css_selector="p", only_text=True, session_id="unique_session_123", js_code="window.scrollTo(0, document.body.scrollHeight);", wait_for="article.main-article" ) ``` ## Example Usage Here's an example of how to use the `arun` method with various parameters: ```python import asyncio from crawl4ai import AsyncWebCrawler from crawl4ai.extraction_strategy import CosineStrategy from crawl4ai.chunking_strategy import NlpSentenceChunking async def main(): # Create the AsyncWebCrawler instance async with AsyncWebCrawler(verbose=True) as crawler: # Run the crawler with custom parameters result = await crawler.arun( url="https://www.nbcnews.com/business", word_count_threshold=10, extraction_strategy=CosineStrategy(semantic_filter="finance"), chunking_strategy=NlpSentenceChunking(), bypass_cache=True, css_selector="div.article-content", screenshot=True, user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3", verbose=True, only_text=True, session_id="business_news_session", js_code="window.scrollTo(0, document.body.scrollHeight);", wait_for="footer" ) print(result) # Run the async function asyncio.run(main()) ``` This example demonstrates how to configure various parameters to customize the crawling and extraction process using the asynchronous version of Crawl4AI. ## Additional Asynchronous Methods The `AsyncWebCrawler` class also provides other useful asynchronous methods: ### arun_many **Description:** Crawl multiple URLs concurrently. **Example:** ```python urls = ["https://example1.com", "https://example2.com", "https://example3.com"] results = await crawler.arun_many(urls, word_count_threshold=10, bypass_cache=True) ``` ### aclear_cache **Description:** Clear the crawler's cache. **Example:** ```python await crawler.aclear_cache() ``` ### aflush_cache **Description:** Completely flush the crawler's cache. **Example:** ```python await crawler.aflush_cache() ``` ### aget_cache_size **Description:** Get the current size of the cache. **Example:** ```python cache_size = await crawler.aget_cache_size() print(f"Current cache size: {cache_size}") ``` These asynchronous methods allow for efficient and flexible use of the AsyncWebCrawler in various scenarios.