Files

unclecode 4d48bd31ca Push async version last changes for merge to main branch

2024-09-24 20:52:08 +08:00

5.8 KiB

Raw Blame History

Crawl Request Parameters for AsyncWebCrawler

The arun method in Crawl4AI's AsyncWebCrawler is designed to be highly configurable, allowing you to customize the crawling and extraction process to suit your needs. Below are the parameters you can use with the arun method, along with their descriptions, possible values, and examples.

Parameters

url (str)

Description: The URL of the webpage to crawl. Required: Yes Example:

url = "https://www.nbcnews.com/business"

word_count_threshold (int)

Description: The minimum number of words a block must contain to be considered meaningful. The default value is defined by MIN_WORD_THRESHOLD. Required: No Default Value: MIN_WORD_THRESHOLD Example:

word_count_threshold = 10

extraction_strategy (ExtractionStrategy)

Description: The strategy to use for extracting content from the HTML. It must be an instance of ExtractionStrategy. If not provided, the default is NoExtractionStrategy. Required: No Default Value: NoExtractionStrategy() Example:

extraction_strategy = CosineStrategy(semantic_filter="finance")

chunking_strategy (ChunkingStrategy)

Description: The strategy to use for chunking the text before processing. It must be an instance of ChunkingStrategy. The default value is RegexChunking(). Required: No Default Value: RegexChunking() Example:

chunking_strategy = NlpSentenceChunking()

bypass_cache (bool)

Description: Whether to force a fresh crawl even if the URL has been previously crawled. The default value is False. Required: No Default Value: False Example:

bypass_cache = True

css_selector (str)

Description: The CSS selector to target specific parts of the HTML for extraction. If not provided, the entire HTML will be processed. Required: No Default Value: None Example:

css_selector = "div.article-content"

screenshot (bool)

Description: Whether to take screenshots of the page. The default value is False. Required: No Default Value: False Example:

screenshot = True

user_agent (str)

Description: The user agent to use for the HTTP requests. If not provided, a default user agent will be used. Required: No Default Value: None Example:

user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"

verbose (bool)

Description: Whether to enable verbose logging. The default value is True. Required: No Default Value: True Example:

verbose = True

**kwargs

Additional keyword arguments that can be passed to customize the crawling process further. Some notable options include:

only_text (bool): Whether to extract only text content, excluding HTML tags. Default is False.
session_id (str): A unique identifier for the crawling session. This is useful for maintaining state across multiple requests.
js_code (str or list): JavaScript code to be executed on the page before extraction.
wait_for (str): A CSS selector or JavaScript function to wait for before considering the page load complete.

Example:

result = await crawler.arun(
    url="https://www.nbcnews.com/business",
    css_selector="p",
    only_text=True,
    session_id="unique_session_123",
    js_code="window.scrollTo(0, document.body.scrollHeight);",
    wait_for="article.main-article"
)

Example Usage

Here's an example of how to use the arun method with various parameters:

import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import CosineStrategy
from crawl4ai.chunking_strategy import NlpSentenceChunking

async def main():
    # Create the AsyncWebCrawler instance 
    async with AsyncWebCrawler(verbose=True) as crawler:
        # Run the crawler with custom parameters
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
            word_count_threshold=10,
            extraction_strategy=CosineStrategy(semantic_filter="finance"),
            chunking_strategy=NlpSentenceChunking(),
            bypass_cache=True,
            css_selector="div.article-content",
            screenshot=True,
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3",
            verbose=True,
            only_text=True,
            session_id="business_news_session",
            js_code="window.scrollTo(0, document.body.scrollHeight);",
            wait_for="footer"
        )

        print(result)

# Run the async function
asyncio.run(main())

This example demonstrates how to configure various parameters to customize the crawling and extraction process using the asynchronous version of Crawl4AI.

Additional Asynchronous Methods

The AsyncWebCrawler class also provides other useful asynchronous methods:

arun_many

Description: Crawl multiple URLs concurrently. Example:

urls = ["https://example1.com", "https://example2.com", "https://example3.com"]
results = await crawler.arun_many(urls, word_count_threshold=10, bypass_cache=True)

aclear_cache

Description: Clear the crawler's cache. Example:

await crawler.aclear_cache()

aflush_cache

Description: Completely flush the crawler's cache. Example:

await crawler.aflush_cache()

aget_cache_size

Description: Get the current size of the cache. Example:

cache_size = await crawler.aget_cache_size()
print(f"Current cache size: {cache_size}")

These asynchronous methods allow for efficient and flexible use of the AsyncWebCrawler in various scenarios.

5.8 KiB Raw Blame History