Reorganize documentation into core/advanced/extraction sections for better navigation. Update terminal theme styles and add rich library for better CLI output. Remove redundant tutorial files and consolidate content into core sections. Add personal story to index page for project context. BREAKING CHANGE: Documentation structure has been significantly reorganized
8.9 KiB
Below is the updated guide for the AsyncWebCrawler class, reflecting the new recommended approach of configuring the browser via BrowserConfig and each crawl via CrawlerRunConfig. While the crawler still accepts legacy parameters for backward compatibility, the modern, maintainable way is shown below.
AsyncWebCrawler
The AsyncWebCrawler is the core class for asynchronous web crawling in Crawl4AI. You typically create it once, optionally customize it with a BrowserConfig (e.g., headless, user agent), then run multiple arun() calls with different CrawlerRunConfig objects.
Recommended usage:
1. Create a BrowserConfig for global browser settings.
2. Instantiate AsyncWebCrawler(config=browser_config).
3. Use the crawler in an async context manager (async with) or manage start/close manually.
4. Call arun(url, config=crawler_run_config) for each page you want.
1. Constructor Overview
class AsyncWebCrawler:
def __init__(
self,
crawler_strategy: Optional[AsyncCrawlerStrategy] = None,
config: Optional[BrowserConfig] = None,
always_bypass_cache: bool = False, # deprecated
always_by_pass_cache: Optional[bool] = None, # also deprecated
base_directory: str = ...,
thread_safe: bool = False,
**kwargs,
):
"""
Create an AsyncWebCrawler instance.
Args:
crawler_strategy: (Advanced) Provide a custom crawler strategy if needed.
config: A BrowserConfig object specifying how the browser is set up.
always_bypass_cache: (Deprecated) Use CrawlerRunConfig.cache_mode instead.
base_directory: Folder for storing caches/logs (if relevant).
thread_safe: If True, attempts some concurrency safeguards. Usually False.
**kwargs: Additional legacy or debugging parameters.
"""
Typical Initialization
from crawl4ai import AsyncWebCrawler, BrowserConfig
browser_cfg = BrowserConfig(
browser_type="chromium",
headless=True,
verbose=True
)
crawler = AsyncWebCrawler(config=browser_cfg)
Notes:
- Legacy parameters like
always_bypass_cacheremain for backward compatibility, but prefer to set caching inCrawlerRunConfig.
2. Lifecycle: Start/Close or Context Manager
2.1 Context Manager (Recommended)
async with AsyncWebCrawler(config=browser_cfg) as crawler:
result = await crawler.arun("https://example.com")
# The crawler automatically starts/closes resources
When the async with block ends, the crawler cleans up (closes the browser, etc.).
2.2 Manual Start & Close
crawler = AsyncWebCrawler(config=browser_cfg)
await crawler.start()
result1 = await crawler.arun("https://example.com")
result2 = await crawler.arun("https://another.com")
await crawler.close()
Use this style if you have a long-running application or need full control of the crawler’s lifecycle.
3. Primary Method: arun()
async def arun(
self,
url: str,
config: Optional[CrawlerRunConfig] = None,
# Legacy parameters for backward compatibility...
) -> CrawlResult:
...
3.1 New Approach
You pass a CrawlerRunConfig object that sets up everything about a crawl—content filtering, caching, session reuse, JS code, screenshots, etc.
import asyncio
from crawl4ai import CrawlerRunConfig, CacheMode
run_cfg = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
css_selector="main.article",
word_count_threshold=10,
screenshot=True
)
async with AsyncWebCrawler(config=browser_cfg) as crawler:
result = await crawler.arun("https://example.com/news", config=run_cfg)
print("Crawled HTML length:", len(result.cleaned_html))
if result.screenshot:
print("Screenshot base64 length:", len(result.screenshot))
3.2 Legacy Parameters Still Accepted
For backward compatibility, arun() can still accept direct arguments like css_selector=..., word_count_threshold=..., etc., but we strongly advise migrating them into a CrawlerRunConfig.
4. Helper Methods
4.1 arun_many()
async def arun_many(
self,
urls: List[str],
config: Optional[CrawlerRunConfig] = None,
# Legacy parameters...
) -> List[CrawlResult]:
...
Crawls multiple URLs in concurrency. Accepts the same style CrawlerRunConfig. Example:
run_cfg = CrawlerRunConfig(
# e.g., concurrency, wait_for, caching, extraction, etc.
semaphore_count=5
)
async with AsyncWebCrawler(config=browser_cfg) as crawler:
results = await crawler.arun_many(
urls=["https://example.com", "https://another.com"],
config=run_cfg
)
for r in results:
print(r.url, ":", len(r.cleaned_html))
4.2 start() & close()
Allows manual lifecycle usage instead of context manager:
crawler = AsyncWebCrawler(config=browser_cfg)
await crawler.start()
# Perform multiple operations
resultA = await crawler.arun("https://exampleA.com", config=run_cfg)
resultB = await crawler.arun("https://exampleB.com", config=run_cfg)
await crawler.close()
5. CrawlResult Output
Each arun() returns a CrawlResult containing:
url: Final URL (if redirected).html: Original HTML.cleaned_html: Sanitized HTML.markdown_v2(or futuremarkdown): Markdown outputs (raw, fit, etc.).extracted_content: If an extraction strategy was used (JSON for CSS/LLM strategies).screenshot,pdf: If screenshots/PDF requested.media,links: Information about discovered images/links.success,error_message: Status info.
For details, see CrawlResult doc.
6. Quick Example
Below is an example hooking it all together:
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
import json
async def main():
# 1. Browser config
browser_cfg = BrowserConfig(
browser_type="firefox",
headless=False,
verbose=True
)
# 2. Run config
schema = {
"name": "Articles",
"baseSelector": "article.post",
"fields": [
{"name": "title", "selector": "h2", "type": "text"},
{"name": "url", "selector": "a", "type": "attribute", "attribute": "href"}
]
}
run_cfg = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
extraction_strategy=JsonCssExtractionStrategy(schema),
word_count_threshold=15,
remove_overlay_elements=True,
wait_for="css:.post" # Wait for posts to appear
)
async with AsyncWebCrawler(config=browser_cfg) as crawler:
result = await crawler.arun(
url="https://example.com/blog",
config=run_cfg
)
if result.success:
print("Cleaned HTML length:", len(result.cleaned_html))
if result.extracted_content:
articles = json.loads(result.extracted_content)
print("Extracted articles:", articles[:2])
else:
print("Error:", result.error_message)
asyncio.run(main())
Explanation:
- We define a
BrowserConfigwith Firefox, no headless, andverbose=True. - We define a
CrawlerRunConfigthat bypasses cache, uses a CSS extraction schema, has aword_count_threshold=15, etc. - We pass them to
AsyncWebCrawler(config=...)andarun(url=..., config=...).
7. Best Practices & Migration Notes
1. Use BrowserConfig for global settings about the browser’s environment.
2. Use CrawlerRunConfig for per-crawl logic (caching, content filtering, extraction strategies, wait conditions).
3. Avoid legacy parameters like css_selector or word_count_threshold directly in arun(). Instead:
run_cfg = CrawlerRunConfig(css_selector=".main-content", word_count_threshold=20)
result = await crawler.arun(url="...", config=run_cfg)
4. Context Manager usage is simplest unless you want a persistent crawler across many calls.
8. Summary
AsyncWebCrawler is your entry point to asynchronous crawling:
- Constructor accepts
BrowserConfig(or defaults). arun(url, config=CrawlerRunConfig)is the main method for single-page crawls.arun_many(urls, config=CrawlerRunConfig)handles concurrency across multiple URLs.- For advanced lifecycle control, use
start()andclose()explicitly.
Migration:
- If you used
AsyncWebCrawler(browser_type="chromium", css_selector="..."), move browser settings toBrowserConfig(...)and content/crawl logic toCrawlerRunConfig(...).
This modular approach ensures your code is clean, scalable, and easy to maintain. For any advanced or rarely used parameters, see the BrowserConfig docs.