Reorganize documentation into core/advanced/extraction sections for better navigation. Update terminal theme styles and add rich library for better CLI output. Remove redundant tutorial files and consolidate content into core sections. Add personal story to index page for project context. BREAKING CHANGE: Documentation structure has been significantly reorganized
264 lines
9.5 KiB
Markdown
264 lines
9.5 KiB
Markdown
# Optimized Multi-URL Crawling
|
||
|
||
> **Note**: We’re developing a new **executor module** that uses a sophisticated algorithm to dynamically manage multi-URL crawling, optimizing for speed and memory usage. The approaches in this document remain fully valid, but keep an eye on **Crawl4AI**’s upcoming releases for this powerful feature! Follow [@unclecode](https://twitter.com/unclecode) on X and check the changelogs to stay updated.
|
||
|
||
|
||
Crawl4AI’s **AsyncWebCrawler** can handle multiple URLs in a single run, which can greatly reduce overhead and speed up crawling. This guide shows how to:
|
||
|
||
1. **Sequentially** crawl a list of URLs using the **same** session, avoiding repeated browser creation.
|
||
2. **Parallel**-crawl subsets of URLs in batches, again reusing the same browser.
|
||
|
||
When the entire process finishes, you close the browser once—**minimizing** memory and resource usage.
|
||
|
||
---
|
||
|
||
## 1. Why Avoid Simple Loops per URL?
|
||
|
||
If you naively do:
|
||
|
||
```python
|
||
for url in urls:
|
||
async with AsyncWebCrawler() as crawler:
|
||
result = await crawler.arun(url)
|
||
```
|
||
|
||
You end up:
|
||
|
||
1. Spinning up a **new** browser for each URL
|
||
2. Closing it immediately after the single crawl
|
||
3. Potentially using a lot of CPU/memory for short-living browsers
|
||
4. Missing out on session reusability if you have login or ongoing states
|
||
|
||
**Better** approaches ensure you **create** the browser once, then crawl multiple URLs with minimal overhead.
|
||
|
||
---
|
||
|
||
## 2. Sequential Crawling with Session Reuse
|
||
|
||
### 2.1 Overview
|
||
|
||
1. **One** `AsyncWebCrawler` instance for **all** URLs.
|
||
2. **One** session (via `session_id`) so we can preserve local storage or cookies across URLs if needed.
|
||
3. The crawler is only closed at the **end**.
|
||
|
||
**This** is the simplest pattern if your workload is moderate (dozens to a few hundred URLs).
|
||
|
||
### 2.2 Example Code
|
||
|
||
```python
|
||
import asyncio
|
||
from typing import List
|
||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
|
||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
||
|
||
async def crawl_sequential(urls: List[str]):
|
||
print("\n=== Sequential Crawling with Session Reuse ===")
|
||
|
||
browser_config = BrowserConfig(
|
||
headless=True,
|
||
# For better performance in Docker or low-memory environments:
|
||
extra_args=["--disable-gpu", "--disable-dev-shm-usage", "--no-sandbox"],
|
||
)
|
||
|
||
crawl_config = CrawlerRunConfig(
|
||
markdown_generator=DefaultMarkdownGenerator()
|
||
)
|
||
|
||
# Create the crawler (opens the browser)
|
||
crawler = AsyncWebCrawler(config=browser_config)
|
||
await crawler.start()
|
||
|
||
try:
|
||
session_id = "session1" # Reuse the same session across all URLs
|
||
for url in urls:
|
||
result = await crawler.arun(
|
||
url=url,
|
||
config=crawl_config,
|
||
session_id=session_id
|
||
)
|
||
if result.success:
|
||
print(f"Successfully crawled: {url}")
|
||
# E.g. check markdown length
|
||
print(f"Markdown length: {len(result.markdown_v2.raw_markdown)}")
|
||
else:
|
||
print(f"Failed: {url} - Error: {result.error_message}")
|
||
finally:
|
||
# After all URLs are done, close the crawler (and the browser)
|
||
await crawler.close()
|
||
|
||
async def main():
|
||
urls = [
|
||
"https://example.com/page1",
|
||
"https://example.com/page2",
|
||
"https://example.com/page3"
|
||
]
|
||
await crawl_sequential(urls)
|
||
|
||
if __name__ == "__main__":
|
||
asyncio.run(main())
|
||
```
|
||
|
||
**Why It’s Good**:
|
||
|
||
- **One** browser launch.
|
||
- Minimal memory usage.
|
||
- If the site requires login, you can log in once in `session_id` context and preserve auth across all URLs.
|
||
|
||
---
|
||
|
||
## 3. Parallel Crawling with Browser Reuse
|
||
|
||
### 3.1 Overview
|
||
|
||
To speed up crawling further, you can crawl multiple URLs in **parallel** (batches or a concurrency limit). The crawler still uses **one** browser, but spawns different sessions (or the same, depending on your logic) for each task.
|
||
|
||
### 3.2 Example Code
|
||
|
||
For this example make sure to install the [psutil](https://pypi.org/project/psutil/) package.
|
||
|
||
```bash
|
||
pip install psutil
|
||
```
|
||
|
||
Then you can run the following code:
|
||
|
||
```python
|
||
import os
|
||
import sys
|
||
import psutil
|
||
import asyncio
|
||
|
||
__location__ = os.path.dirname(os.path.abspath(__file__))
|
||
__output__ = os.path.join(__location__, "output")
|
||
|
||
# Append parent directory to system path
|
||
parent_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
|
||
sys.path.append(parent_dir)
|
||
|
||
from typing import List
|
||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
|
||
|
||
async def crawl_parallel(urls: List[str], max_concurrent: int = 3):
|
||
print("\n=== Parallel Crawling with Browser Reuse + Memory Check ===")
|
||
|
||
# We'll keep track of peak memory usage across all tasks
|
||
peak_memory = 0
|
||
process = psutil.Process(os.getpid())
|
||
|
||
def log_memory(prefix: str = ""):
|
||
nonlocal peak_memory
|
||
current_mem = process.memory_info().rss # in bytes
|
||
if current_mem > peak_memory:
|
||
peak_memory = current_mem
|
||
print(f"{prefix} Current Memory: {current_mem // (1024 * 1024)} MB, Peak: {peak_memory // (1024 * 1024)} MB")
|
||
|
||
# Minimal browser config
|
||
browser_config = BrowserConfig(
|
||
headless=True,
|
||
verbose=False, # corrected from 'verbos=False'
|
||
extra_args=["--disable-gpu", "--disable-dev-shm-usage", "--no-sandbox"],
|
||
)
|
||
crawl_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
|
||
|
||
# Create the crawler instance
|
||
crawler = AsyncWebCrawler(config=browser_config)
|
||
await crawler.start()
|
||
|
||
try:
|
||
# We'll chunk the URLs in batches of 'max_concurrent'
|
||
success_count = 0
|
||
fail_count = 0
|
||
for i in range(0, len(urls), max_concurrent):
|
||
batch = urls[i : i + max_concurrent]
|
||
tasks = []
|
||
|
||
for j, url in enumerate(batch):
|
||
# Unique session_id per concurrent sub-task
|
||
session_id = f"parallel_session_{i + j}"
|
||
task = crawler.arun(url=url, config=crawl_config, session_id=session_id)
|
||
tasks.append(task)
|
||
|
||
# Check memory usage prior to launching tasks
|
||
log_memory(prefix=f"Before batch {i//max_concurrent + 1}: ")
|
||
|
||
# Gather results
|
||
results = await asyncio.gather(*tasks, return_exceptions=True)
|
||
|
||
# Check memory usage after tasks complete
|
||
log_memory(prefix=f"After batch {i//max_concurrent + 1}: ")
|
||
|
||
# Evaluate results
|
||
for url, result in zip(batch, results):
|
||
if isinstance(result, Exception):
|
||
print(f"Error crawling {url}: {result}")
|
||
fail_count += 1
|
||
elif result.success:
|
||
success_count += 1
|
||
else:
|
||
fail_count += 1
|
||
|
||
print(f"\nSummary:")
|
||
print(f" - Successfully crawled: {success_count}")
|
||
print(f" - Failed: {fail_count}")
|
||
|
||
finally:
|
||
print("\nClosing crawler...")
|
||
await crawler.close()
|
||
# Final memory log
|
||
log_memory(prefix="Final: ")
|
||
print(f"\nPeak memory usage (MB): {peak_memory // (1024 * 1024)}")
|
||
|
||
async def main():
|
||
urls = [
|
||
"https://example.com/page1",
|
||
"https://example.com/page2",
|
||
"https://example.com/page3",
|
||
"https://example.com/page4"
|
||
]
|
||
await crawl_parallel(urls, max_concurrent=2)
|
||
|
||
if __name__ == "__main__":
|
||
asyncio.run(main())
|
||
|
||
```
|
||
|
||
**Notes**:
|
||
|
||
- We **reuse** the same `AsyncWebCrawler` instance for all parallel tasks, launching **one** browser.
|
||
- Each parallel sub-task might get its own `session_id` so they don’t share cookies/localStorage (unless that’s desired).
|
||
- We limit concurrency to `max_concurrent=2` or 3 to avoid saturating CPU/memory.
|
||
|
||
---
|
||
|
||
## 4. Performance Tips
|
||
|
||
1. **Extra Browser Args**
|
||
- `--disable-gpu`, `--no-sandbox` can help in Docker or restricted environments.
|
||
- `--disable-dev-shm-usage` avoids using `/dev/shm` which can be small on some systems.
|
||
|
||
2. **Session Reuse**
|
||
- If your site requires a login or you want to maintain local data across URLs, share the **same** `session_id`.
|
||
- If you want isolation (each URL fresh), create unique sessions.
|
||
|
||
3. **Batching**
|
||
- If you have **many** URLs (like thousands), you can do parallel crawling in chunks (like `max_concurrent=5`).
|
||
- Use `arun_many()` for a built-in approach if you prefer, but the example above is often more flexible.
|
||
|
||
4. **Cache**
|
||
- If your pages share many resources or you’re re-crawling the same domain repeatedly, consider setting `cache_mode=CacheMode.ENABLED` in `CrawlerRunConfig`.
|
||
- If you need fresh data each time, keep `cache_mode=CacheMode.BYPASS`.
|
||
|
||
5. **Hooks**
|
||
- You can set up global hooks for each crawler (like to block images) or per-run if you want.
|
||
- Keep them consistent if you’re reusing sessions.
|
||
|
||
---
|
||
|
||
## 5. Summary
|
||
|
||
- **One** `AsyncWebCrawler` + multiple calls to `.arun()` is far more efficient than launching a new crawler per URL.
|
||
- **Sequential** approach with a shared session is simple and memory-friendly for moderate sets of URLs.
|
||
- **Parallel** approach can speed up large crawls by concurrency, but keep concurrency balanced to avoid overhead.
|
||
- Close the crawler once at the end, ensuring the browser is only opened/closed once.
|
||
|
||
For even more advanced memory optimizations or dynamic concurrency patterns, see future sections on hooking or distributed crawling. The patterns above suffice for the majority of multi-URL scenarios—**giving you speed, simplicity, and minimal resource usage**. Enjoy your optimized crawling! |