Files

UncleCode 1c9464b988 Update all documents

2025-01-08 19:31:31 +08:00

8.8 KiB

Raw Blame History

AsyncWebCrawler

The AsyncWebCrawler is the core class for asynchronous web crawling in Crawl4AI. You typically create it once, optionally customize it with a BrowserConfig (e.g., headless, user agent), then run multiple arun() calls with different CrawlerRunConfig objects.

Recommended usage: 1. Create a BrowserConfig for global browser settings.
2. Instantiate AsyncWebCrawler(config=browser_config).
3. Use the crawler in an async context manager (async with) or manage start/close manually.
4. Call arun(url, config=crawler_run_config) for each page you want.

1. Constructor Overview

class AsyncWebCrawler:
    def __init__(
        self,
        crawler_strategy: Optional[AsyncCrawlerStrategy] = None,
        config: Optional[BrowserConfig] = None,
        always_bypass_cache: bool = False,           # deprecated
        always_by_pass_cache: Optional[bool] = None, # also deprecated
        base_directory: str = ...,
        thread_safe: bool = False,
        **kwargs,
    ):
        """
        Create an AsyncWebCrawler instance.

        Args:
            crawler_strategy: 
                (Advanced) Provide a custom crawler strategy if needed.
            config: 
                A BrowserConfig object specifying how the browser is set up.
            always_bypass_cache: 
                (Deprecated) Use CrawlerRunConfig.cache_mode instead.
            base_directory:     
                Folder for storing caches/logs (if relevant).
            thread_safe: 
                If True, attempts some concurrency safeguards. Usually False.
            **kwargs: 
                Additional legacy or debugging parameters.
        """
    )

### Typical Initialization

```python
from crawl4ai import AsyncWebCrawler, BrowserConfig

browser_cfg = BrowserConfig(
    browser_type="chromium",
    headless=True,
    verbose=True
)

crawler = AsyncWebCrawler(config=browser_cfg)

Notes:

Legacy parameters like always_bypass_cache remain for backward compatibility, but prefer to set caching in CrawlerRunConfig.

2. Lifecycle: Start/Close or Context Manager

2.1 Context Manager (Recommended)

async with AsyncWebCrawler(config=browser_cfg) as crawler:
    result = await crawler.arun("https://example.com")
    # The crawler automatically starts/closes resources

When the async with block ends, the crawler cleans up (closes the browser, etc.).

2.2 Manual Start & Close

crawler = AsyncWebCrawler(config=browser_cfg)
await crawler.start()

result1 = await crawler.arun("https://example.com")
result2 = await crawler.arun("https://another.com")

await crawler.close()

Use this style if you have a long-running application or need full control of the crawler’s lifecycle.

3. Primary Method: `arun()`

async def arun(
    self,
    url: str,
    config: Optional[CrawlerRunConfig] = None,
    # Legacy parameters for backward compatibility...
) -> CrawlResult:
    ...

3.1 New Approach

You pass a CrawlerRunConfig object that sets up everything about a crawl—content filtering, caching, session reuse, JS code, screenshots, etc.

import asyncio
from crawl4ai import CrawlerRunConfig, CacheMode

run_cfg = CrawlerRunConfig(
    cache_mode=CacheMode.BYPASS,
    css_selector="main.article",
    word_count_threshold=10,
    screenshot=True
)

async with AsyncWebCrawler(config=browser_cfg) as crawler:
    result = await crawler.arun("https://example.com/news", config=run_cfg)
    print("Crawled HTML length:", len(result.cleaned_html))
    if result.screenshot:
        print("Screenshot base64 length:", len(result.screenshot))

3.2 Legacy Parameters Still Accepted

For backward compatibility, arun() can still accept direct arguments like css_selector=..., word_count_threshold=..., etc., but we strongly advise migrating them into a CrawlerRunConfig.

4. Helper Methods

4.1 `arun_many()`

async def arun_many(
    self,
    urls: List[str],
    config: Optional[CrawlerRunConfig] = None,
    # Legacy parameters...
) -> List[CrawlResult]:
    ...

Crawls multiple URLs in concurrency. Accepts the same style CrawlerRunConfig. Example:

run_cfg = CrawlerRunConfig(
    # e.g., concurrency, wait_for, caching, extraction, etc.
    semaphore_count=5
)

async with AsyncWebCrawler(config=browser_cfg) as crawler:
    results = await crawler.arun_many(
        urls=["https://example.com", "https://another.com"],
        config=run_cfg
    )
    for r in results:
        print(r.url, ":", len(r.cleaned_html))

4.2 `start()` & `close()`

Allows manual lifecycle usage instead of context manager:

crawler = AsyncWebCrawler(config=browser_cfg)
await crawler.start()

# Perform multiple operations
resultA = await crawler.arun("https://exampleA.com", config=run_cfg)
resultB = await crawler.arun("https://exampleB.com", config=run_cfg)

await crawler.close()

5. `CrawlResult` Output

Each arun() returns a CrawlResult containing:

url: Final URL (if redirected).
html: Original HTML.
cleaned_html: Sanitized HTML.
markdown_v2 (or future markdown): Markdown outputs (raw, fit, etc.).
extracted_content: If an extraction strategy was used (JSON for CSS/LLM strategies).
screenshot, pdf: If screenshots/PDF requested.
media, links: Information about discovered images/links.
success, error_message: Status info.

For details, see CrawlResult doc.

6. Quick Example

Below is an example hooking it all together:

import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
import json

async def main():
    # 1. Browser config
    browser_cfg = BrowserConfig(
        browser_type="firefox",
        headless=False,
        verbose=True
    )

    # 2. Run config
    schema = {
        "name": "Articles",
        "baseSelector": "article.post",
        "fields": [
            {
                "name": "title", 
                "selector": "h2", 
                "type": "text"
            },
            {
                "name": "url", 
                "selector": "a", 
                "type": "attribute", 
                "attribute": "href"
            }
        ]
    }

    run_cfg = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        extraction_strategy=JsonCssExtractionStrategy(schema),
        word_count_threshold=15,
        remove_overlay_elements=True,
        wait_for="css:.post"  # Wait for posts to appear
    )

    async with AsyncWebCrawler(config=browser_cfg) as crawler:
        result = await crawler.arun(
            url="https://example.com/blog",
            config=run_cfg
        )

        if result.success:
            print("Cleaned HTML length:", len(result.cleaned_html))
            if result.extracted_content:
                articles = json.loads(result.extracted_content)
                print("Extracted articles:", articles[:2])
        else:
            print("Error:", result.error_message)

asyncio.run(main())

Explanation:

We define a BrowserConfig with Firefox, no headless, and verbose=True.
We define a CrawlerRunConfig that bypasses cache, uses a CSS extraction schema, has a word_count_threshold=15, etc.
We pass them to AsyncWebCrawler(config=...) and arun(url=..., config=...).

7. Best Practices & Migration Notes

1. Use BrowserConfig for global settings about the browser’s environment.
2. Use CrawlerRunConfig for per-crawl logic (caching, content filtering, extraction strategies, wait conditions).
3. Avoid legacy parameters like css_selector or word_count_threshold directly in arun(). Instead:

run_cfg = CrawlerRunConfig(css_selector=".main-content", word_count_threshold=20)
result = await crawler.arun(url="...", config=run_cfg)

4. Context Manager usage is simplest unless you want a persistent crawler across many calls.

8. Summary

AsyncWebCrawler is your entry point to asynchronous crawling:

Constructor accepts BrowserConfig (or defaults).
arun(url, config=CrawlerRunConfig) is the main method for single-page crawls.
arun_many(urls, config=CrawlerRunConfig) handles concurrency across multiple URLs.
For advanced lifecycle control, use start() and close() explicitly.

Migration:

If you used AsyncWebCrawler(browser_type="chromium", css_selector="..."), move browser settings to BrowserConfig(...) and content/crawl logic to CrawlerRunConfig(...).

This modular approach ensures your code is clean, scalable, and easy to maintain. For any advanced or rarely used parameters, see the BrowserConfig docs.

8.8 KiB Raw Blame History Unescape Escape