Merge: with v-0.4.3b
This commit is contained in:
@@ -8,6 +8,7 @@ Crawl4AI offers multiple power-user features that go beyond simple crawling. Thi
|
||||
3. **Handling SSL Certificates**
|
||||
4. **Custom Headers**
|
||||
5. **Session Persistence & Local Storage**
|
||||
6. **Robots.txt Compliance**
|
||||
|
||||
> **Prerequisites**
|
||||
> - You have a basic grasp of [AsyncWebCrawler Basics](../core/simple-crawling.md)
|
||||
@@ -251,6 +252,42 @@ You can sign in once, export the browser context, and reuse it later—without r
|
||||
|
||||
---
|
||||
|
||||
## 6. Robots.txt Compliance
|
||||
|
||||
Crawl4AI supports respecting robots.txt rules with efficient caching:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
|
||||
async def main():
|
||||
# Enable robots.txt checking in config
|
||||
config = CrawlerRunConfig(
|
||||
check_robots_txt=True # Will check and respect robots.txt rules
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
"https://example.com",
|
||||
config=config
|
||||
)
|
||||
|
||||
if not result.success and result.status_code == 403:
|
||||
print("Access denied by robots.txt")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**Key Points**
|
||||
- Robots.txt files are cached locally for efficiency
|
||||
- Cache is stored in `~/.crawl4ai/robots/robots_cache.db`
|
||||
- Cache has a default TTL of 7 days
|
||||
- If robots.txt can't be fetched, crawling is allowed
|
||||
- Returns 403 status code if URL is disallowed
|
||||
|
||||
---
|
||||
|
||||
## Putting It All Together
|
||||
|
||||
Here’s a snippet that combines multiple “advanced” features (proxy, PDF, screenshot, SSL, custom headers, and session reuse) into one run. Normally, you’d tailor each setting to your project’s needs.
|
||||
@@ -321,6 +358,7 @@ You’ve now explored several **advanced** features:
|
||||
- **SSL Certificate** retrieval & exporting
|
||||
- **Custom Headers** for language or specialized requests
|
||||
- **Session Persistence** via storage state
|
||||
- **Robots.txt Compliance**
|
||||
|
||||
With these power tools, you can build robust scraping workflows that mimic real user behavior, handle secure sites, capture detailed snapshots, and manage sessions across multiple runs—streamlining your entire data collection pipeline.
|
||||
|
||||
|
||||
@@ -1,264 +0,0 @@
|
||||
# Optimized Multi-URL Crawling
|
||||
|
||||
> **Note**: We’re developing a new **executor module** that uses a sophisticated algorithm to dynamically manage multi-URL crawling, optimizing for speed and memory usage. The approaches in this document remain fully valid, but keep an eye on **Crawl4AI**’s upcoming releases for this powerful feature! Follow [@unclecode](https://twitter.com/unclecode) on X and check the changelogs to stay updated.
|
||||
|
||||
|
||||
Crawl4AI’s **AsyncWebCrawler** can handle multiple URLs in a single run, which can greatly reduce overhead and speed up crawling. This guide shows how to:
|
||||
|
||||
1. **Sequentially** crawl a list of URLs using the **same** session, avoiding repeated browser creation.
|
||||
2. **Parallel**-crawl subsets of URLs in batches, again reusing the same browser.
|
||||
|
||||
When the entire process finishes, you close the browser once—**minimizing** memory and resource usage.
|
||||
|
||||
---
|
||||
|
||||
## 1. Why Avoid Simple Loops per URL?
|
||||
|
||||
If you naively do:
|
||||
|
||||
```python
|
||||
for url in urls:
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url)
|
||||
```
|
||||
|
||||
You end up:
|
||||
|
||||
1. Spinning up a **new** browser for each URL
|
||||
2. Closing it immediately after the single crawl
|
||||
3. Potentially using a lot of CPU/memory for short-living browsers
|
||||
4. Missing out on session reusability if you have login or ongoing states
|
||||
|
||||
**Better** approaches ensure you **create** the browser once, then crawl multiple URLs with minimal overhead.
|
||||
|
||||
---
|
||||
|
||||
## 2. Sequential Crawling with Session Reuse
|
||||
|
||||
### 2.1 Overview
|
||||
|
||||
1. **One** `AsyncWebCrawler` instance for **all** URLs.
|
||||
2. **One** session (via `session_id`) so we can preserve local storage or cookies across URLs if needed.
|
||||
3. The crawler is only closed at the **end**.
|
||||
|
||||
**This** is the simplest pattern if your workload is moderate (dozens to a few hundred URLs).
|
||||
|
||||
### 2.2 Example Code
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from typing import List
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
|
||||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
||||
|
||||
async def crawl_sequential(urls: List[str]):
|
||||
print("\n=== Sequential Crawling with Session Reuse ===")
|
||||
|
||||
browser_config = BrowserConfig(
|
||||
headless=True,
|
||||
# For better performance in Docker or low-memory environments:
|
||||
extra_args=["--disable-gpu", "--disable-dev-shm-usage", "--no-sandbox"],
|
||||
)
|
||||
|
||||
crawl_config = CrawlerRunConfig(
|
||||
markdown_generator=DefaultMarkdownGenerator()
|
||||
)
|
||||
|
||||
# Create the crawler (opens the browser)
|
||||
crawler = AsyncWebCrawler(config=browser_config)
|
||||
await crawler.start()
|
||||
|
||||
try:
|
||||
session_id = "session1" # Reuse the same session across all URLs
|
||||
for url in urls:
|
||||
result = await crawler.arun(
|
||||
url=url,
|
||||
config=crawl_config,
|
||||
session_id=session_id
|
||||
)
|
||||
if result.success:
|
||||
print(f"Successfully crawled: {url}")
|
||||
# E.g. check markdown length
|
||||
print(f"Markdown length: {len(result.markdown_v2.raw_markdown)}")
|
||||
else:
|
||||
print(f"Failed: {url} - Error: {result.error_message}")
|
||||
finally:
|
||||
# After all URLs are done, close the crawler (and the browser)
|
||||
await crawler.close()
|
||||
|
||||
async def main():
|
||||
urls = [
|
||||
"https://example.com/page1",
|
||||
"https://example.com/page2",
|
||||
"https://example.com/page3"
|
||||
]
|
||||
await crawl_sequential(urls)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**Why It’s Good**:
|
||||
|
||||
- **One** browser launch.
|
||||
- Minimal memory usage.
|
||||
- If the site requires login, you can log in once in `session_id` context and preserve auth across all URLs.
|
||||
|
||||
---
|
||||
|
||||
## 3. Parallel Crawling with Browser Reuse
|
||||
|
||||
### 3.1 Overview
|
||||
|
||||
To speed up crawling further, you can crawl multiple URLs in **parallel** (batches or a concurrency limit). The crawler still uses **one** browser, but spawns different sessions (or the same, depending on your logic) for each task.
|
||||
|
||||
### 3.2 Example Code
|
||||
|
||||
For this example make sure to install the [psutil](https://pypi.org/project/psutil/) package.
|
||||
|
||||
```bash
|
||||
pip install psutil
|
||||
```
|
||||
|
||||
Then you can run the following code:
|
||||
|
||||
```python
|
||||
import os
|
||||
import sys
|
||||
import psutil
|
||||
import asyncio
|
||||
|
||||
__location__ = os.path.dirname(os.path.abspath(__file__))
|
||||
__output__ = os.path.join(__location__, "output")
|
||||
|
||||
# Append parent directory to system path
|
||||
parent_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
|
||||
sys.path.append(parent_dir)
|
||||
|
||||
from typing import List
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
|
||||
|
||||
async def crawl_parallel(urls: List[str], max_concurrent: int = 3):
|
||||
print("\n=== Parallel Crawling with Browser Reuse + Memory Check ===")
|
||||
|
||||
# We'll keep track of peak memory usage across all tasks
|
||||
peak_memory = 0
|
||||
process = psutil.Process(os.getpid())
|
||||
|
||||
def log_memory(prefix: str = ""):
|
||||
nonlocal peak_memory
|
||||
current_mem = process.memory_info().rss # in bytes
|
||||
if current_mem > peak_memory:
|
||||
peak_memory = current_mem
|
||||
print(f"{prefix} Current Memory: {current_mem // (1024 * 1024)} MB, Peak: {peak_memory // (1024 * 1024)} MB")
|
||||
|
||||
# Minimal browser config
|
||||
browser_config = BrowserConfig(
|
||||
headless=True,
|
||||
verbose=False, # corrected from 'verbos=False'
|
||||
extra_args=["--disable-gpu", "--disable-dev-shm-usage", "--no-sandbox"],
|
||||
)
|
||||
crawl_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
|
||||
|
||||
# Create the crawler instance
|
||||
crawler = AsyncWebCrawler(config=browser_config)
|
||||
await crawler.start()
|
||||
|
||||
try:
|
||||
# We'll chunk the URLs in batches of 'max_concurrent'
|
||||
success_count = 0
|
||||
fail_count = 0
|
||||
for i in range(0, len(urls), max_concurrent):
|
||||
batch = urls[i : i + max_concurrent]
|
||||
tasks = []
|
||||
|
||||
for j, url in enumerate(batch):
|
||||
# Unique session_id per concurrent sub-task
|
||||
session_id = f"parallel_session_{i + j}"
|
||||
task = crawler.arun(url=url, config=crawl_config, session_id=session_id)
|
||||
tasks.append(task)
|
||||
|
||||
# Check memory usage prior to launching tasks
|
||||
log_memory(prefix=f"Before batch {i//max_concurrent + 1}: ")
|
||||
|
||||
# Gather results
|
||||
results = await asyncio.gather(*tasks, return_exceptions=True)
|
||||
|
||||
# Check memory usage after tasks complete
|
||||
log_memory(prefix=f"After batch {i//max_concurrent + 1}: ")
|
||||
|
||||
# Evaluate results
|
||||
for url, result in zip(batch, results):
|
||||
if isinstance(result, Exception):
|
||||
print(f"Error crawling {url}: {result}")
|
||||
fail_count += 1
|
||||
elif result.success:
|
||||
success_count += 1
|
||||
else:
|
||||
fail_count += 1
|
||||
|
||||
print(f"\nSummary:")
|
||||
print(f" - Successfully crawled: {success_count}")
|
||||
print(f" - Failed: {fail_count}")
|
||||
|
||||
finally:
|
||||
print("\nClosing crawler...")
|
||||
await crawler.close()
|
||||
# Final memory log
|
||||
log_memory(prefix="Final: ")
|
||||
print(f"\nPeak memory usage (MB): {peak_memory // (1024 * 1024)}")
|
||||
|
||||
async def main():
|
||||
urls = [
|
||||
"https://example.com/page1",
|
||||
"https://example.com/page2",
|
||||
"https://example.com/page3",
|
||||
"https://example.com/page4"
|
||||
]
|
||||
await crawl_parallel(urls, max_concurrent=2)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
|
||||
```
|
||||
|
||||
**Notes**:
|
||||
|
||||
- We **reuse** the same `AsyncWebCrawler` instance for all parallel tasks, launching **one** browser.
|
||||
- Each parallel sub-task might get its own `session_id` so they don’t share cookies/localStorage (unless that’s desired).
|
||||
- We limit concurrency to `max_concurrent=2` or 3 to avoid saturating CPU/memory.
|
||||
|
||||
---
|
||||
|
||||
## 4. Performance Tips
|
||||
|
||||
1. **Extra Browser Args**
|
||||
- `--disable-gpu`, `--no-sandbox` can help in Docker or restricted environments.
|
||||
- `--disable-dev-shm-usage` avoids using `/dev/shm` which can be small on some systems.
|
||||
|
||||
2. **Session Reuse**
|
||||
- If your site requires a login or you want to maintain local data across URLs, share the **same** `session_id`.
|
||||
- If you want isolation (each URL fresh), create unique sessions.
|
||||
|
||||
3. **Batching**
|
||||
- If you have **many** URLs (like thousands), you can do parallel crawling in chunks (like `max_concurrent=5`).
|
||||
- Use `arun_many()` for a built-in approach if you prefer, but the example above is often more flexible.
|
||||
|
||||
4. **Cache**
|
||||
- If your pages share many resources or you’re re-crawling the same domain repeatedly, consider setting `cache_mode=CacheMode.ENABLED` in `CrawlerRunConfig`.
|
||||
- If you need fresh data each time, keep `cache_mode=CacheMode.BYPASS`.
|
||||
|
||||
5. **Hooks**
|
||||
- You can set up global hooks for each crawler (like to block images) or per-run if you want.
|
||||
- Keep them consistent if you’re reusing sessions.
|
||||
|
||||
---
|
||||
|
||||
## 5. Summary
|
||||
|
||||
- **One** `AsyncWebCrawler` + multiple calls to `.arun()` is far more efficient than launching a new crawler per URL.
|
||||
- **Sequential** approach with a shared session is simple and memory-friendly for moderate sets of URLs.
|
||||
- **Parallel** approach can speed up large crawls by concurrency, but keep concurrency balanced to avoid overhead.
|
||||
- Close the crawler once at the end, ensuring the browser is only opened/closed once.
|
||||
|
||||
For even more advanced memory optimizations or dynamic concurrency patterns, see future sections on hooking or distributed crawling. The patterns above suffice for the majority of multi-URL scenarios—**giving you speed, simplicity, and minimal resource usage**. Enjoy your optimized crawling!
|
||||
@@ -5,16 +5,20 @@
|
||||
## 1. Introduction
|
||||
|
||||
When crawling many URLs:
|
||||
|
||||
- **Basic**: Use `arun()` in a loop (simple but less efficient)
|
||||
- **Better**: Use `arun_many()`, which efficiently handles multiple URLs with proper concurrency control
|
||||
- **Best**: Customize dispatcher behavior for your specific needs (memory management, rate limits, etc.)
|
||||
|
||||
**Why Dispatchers?**
|
||||
|
||||
- **Adaptive**: Memory-based dispatchers can pause or slow down based on system resources
|
||||
- **Rate-limiting**: Built-in rate limiting with exponential backoff for 429/503 responses
|
||||
- **Real-time Monitoring**: Live dashboard of ongoing tasks, memory usage, and performance
|
||||
- **Flexibility**: Choose between memory-adaptive or semaphore-based concurrency
|
||||
|
||||
---
|
||||
|
||||
## 2. Core Components
|
||||
|
||||
### 2.1 Rate Limiter
|
||||
@@ -22,34 +26,116 @@ When crawling many URLs:
|
||||
```python
|
||||
class RateLimiter:
|
||||
def __init__(
|
||||
base_delay: Tuple[float, float] = (1.0, 3.0), # Random delay range between requests
|
||||
max_delay: float = 60.0, # Maximum backoff delay
|
||||
max_retries: int = 3, # Retries before giving up
|
||||
rate_limit_codes: List[int] = [429, 503] # Status codes triggering backoff
|
||||
# Random delay range between requests
|
||||
base_delay: Tuple[float, float] = (1.0, 3.0),
|
||||
|
||||
# Maximum backoff delay
|
||||
max_delay: float = 60.0,
|
||||
|
||||
# Retries before giving up
|
||||
max_retries: int = 3,
|
||||
|
||||
# Status codes triggering backoff
|
||||
rate_limit_codes: List[int] = [429, 503]
|
||||
)
|
||||
```
|
||||
|
||||
The RateLimiter provides:
|
||||
- Random delays between requests
|
||||
- Exponential backoff on rate limit responses
|
||||
- Domain-specific rate limiting
|
||||
- Automatic retry handling
|
||||
Here’s the revised and simplified explanation of the **RateLimiter**, focusing on constructor parameters and adhering to your markdown style and mkDocs guidelines.
|
||||
|
||||
#### RateLimiter Constructor Parameters
|
||||
|
||||
The **RateLimiter** is a utility that helps manage the pace of requests to avoid overloading servers or getting blocked due to rate limits. It operates internally to delay requests and handle retries but can be configured using its constructor parameters.
|
||||
|
||||
**Parameters of the `RateLimiter` constructor:**
|
||||
|
||||
1. **`base_delay`** (`Tuple[float, float]`, default: `(1.0, 3.0)`)
|
||||
The range for a random delay (in seconds) between consecutive requests to the same domain.
|
||||
|
||||
- A random delay is chosen between `base_delay[0]` and `base_delay[1]` for each request.
|
||||
- This prevents sending requests at a predictable frequency, reducing the chances of triggering rate limits.
|
||||
|
||||
**Example:**
|
||||
If `base_delay = (2.0, 5.0)`, delays could be randomly chosen as `2.3s`, `4.1s`, etc.
|
||||
|
||||
---
|
||||
|
||||
2. **`max_delay`** (`float`, default: `60.0`)
|
||||
The maximum allowable delay when rate-limiting errors occur.
|
||||
|
||||
- When servers return rate-limit responses (e.g., 429 or 503), the delay increases exponentially with jitter.
|
||||
- The `max_delay` ensures the delay doesn’t grow unreasonably high, capping it at this value.
|
||||
|
||||
**Example:**
|
||||
For a `max_delay = 30.0`, even if backoff calculations suggest a delay of `45s`, it will cap at `30s`.
|
||||
|
||||
---
|
||||
|
||||
3. **`max_retries`** (`int`, default: `3`)
|
||||
The maximum number of retries for a request if rate-limiting errors occur.
|
||||
|
||||
- After encountering a rate-limit response, the `RateLimiter` retries the request up to this number of times.
|
||||
- If all retries fail, the request is marked as failed, and the process continues.
|
||||
|
||||
**Example:**
|
||||
If `max_retries = 3`, the system retries a failed request three times before giving up.
|
||||
|
||||
---
|
||||
|
||||
4. **`rate_limit_codes`** (`List[int]`, default: `[429, 503]`)
|
||||
A list of HTTP status codes that trigger the rate-limiting logic.
|
||||
|
||||
- These status codes indicate the server is overwhelmed or actively limiting requests.
|
||||
- You can customize this list to include other codes based on specific server behavior.
|
||||
|
||||
**Example:**
|
||||
If `rate_limit_codes = [429, 503, 504]`, the crawler will back off on these three error codes.
|
||||
|
||||
---
|
||||
|
||||
**How to Use the `RateLimiter`:**
|
||||
|
||||
Here’s an example of initializing and using a `RateLimiter` in your project:
|
||||
|
||||
```python
|
||||
from crawl4ai import RateLimiter
|
||||
|
||||
# Create a RateLimiter with custom settings
|
||||
rate_limiter = RateLimiter(
|
||||
base_delay=(2.0, 4.0), # Random delay between 2-4 seconds
|
||||
max_delay=30.0, # Cap delay at 30 seconds
|
||||
max_retries=5, # Retry up to 5 times on rate-limiting errors
|
||||
rate_limit_codes=[429, 503] # Handle these HTTP status codes
|
||||
)
|
||||
|
||||
# RateLimiter will handle delays and retries internally
|
||||
# No additional setup is required for its operation
|
||||
```
|
||||
|
||||
The `RateLimiter` integrates seamlessly with dispatchers like `MemoryAdaptiveDispatcher` and `SemaphoreDispatcher`, ensuring requests are paced correctly without user intervention. Its internal mechanisms manage delays and retries to avoid overwhelming servers while maximizing efficiency.
|
||||
|
||||
|
||||
### 2.2 Crawler Monitor
|
||||
|
||||
The CrawlerMonitor provides real-time visibility into crawling operations:
|
||||
|
||||
```python
|
||||
from crawl4ai import CrawlerMonitor, DisplayMode
|
||||
monitor = CrawlerMonitor(
|
||||
max_visible_rows=15, # Maximum rows in live display
|
||||
display_mode=DisplayMode.DETAILED # DETAILED or AGGREGATED view
|
||||
# Maximum rows in live display
|
||||
max_visible_rows=15,
|
||||
|
||||
# DETAILED or AGGREGATED view
|
||||
display_mode=DisplayMode.DETAILED
|
||||
)
|
||||
```
|
||||
|
||||
**Display Modes**:
|
||||
|
||||
1. **DETAILED**: Shows individual task status, memory usage, and timing
|
||||
2. **AGGREGATED**: Displays summary statistics and overall progress
|
||||
|
||||
---
|
||||
|
||||
## 3. Available Dispatchers
|
||||
|
||||
### 3.1 MemoryAdaptiveDispatcher (Default)
|
||||
@@ -57,8 +143,10 @@ monitor = CrawlerMonitor(
|
||||
Automatically manages concurrency based on system memory usage:
|
||||
|
||||
```python
|
||||
from crawl4ai.async_dispatcher import MemoryAdaptiveDispatcher
|
||||
|
||||
dispatcher = MemoryAdaptiveDispatcher(
|
||||
memory_threshold_percent=70.0, # Pause if memory exceeds this
|
||||
memory_threshold_percent=90.0, # Pause if memory exceeds this
|
||||
check_interval=1.0, # How often to check memory
|
||||
max_session_permit=10, # Maximum concurrent tasks
|
||||
rate_limiter=RateLimiter( # Optional rate limiting
|
||||
@@ -73,13 +161,37 @@ dispatcher = MemoryAdaptiveDispatcher(
|
||||
)
|
||||
```
|
||||
|
||||
**Constructor Parameters:**
|
||||
|
||||
1. **`memory_threshold_percent`** (`float`, default: `90.0`)
|
||||
Specifies the memory usage threshold (as a percentage). If system memory usage exceeds this value, the dispatcher pauses crawling to prevent system overload.
|
||||
|
||||
2. **`check_interval`** (`float`, default: `1.0`)
|
||||
The interval (in seconds) at which the dispatcher checks system memory usage.
|
||||
|
||||
3. **`max_session_permit`** (`int`, default: `10`)
|
||||
The maximum number of concurrent crawling tasks allowed. This ensures resource limits are respected while maintaining concurrency.
|
||||
|
||||
4. **`memory_wait_timeout`** (`float`, default: `300.0`)
|
||||
Optional timeout (in seconds). If memory usage exceeds `memory_threshold_percent` for longer than this duration, a `MemoryError` is raised.
|
||||
|
||||
5. **`rate_limiter`** (`RateLimiter`, default: `None`)
|
||||
Optional rate-limiting logic to avoid server-side blocking (e.g., for handling 429 or 503 errors). See **RateLimiter** for details.
|
||||
|
||||
6. **`monitor`** (`CrawlerMonitor`, default: `None`)
|
||||
Optional monitoring for real-time task tracking and performance insights. See **CrawlerMonitor** for details.
|
||||
|
||||
---
|
||||
|
||||
### 3.2 SemaphoreDispatcher
|
||||
|
||||
Provides simple concurrency control with a fixed limit:
|
||||
|
||||
```python
|
||||
from crawl4ai.async_dispatcher import SemaphoreDispatcher
|
||||
|
||||
dispatcher = SemaphoreDispatcher(
|
||||
semaphore_count=5, # Fixed concurrent tasks
|
||||
max_session_permit=20, # Maximum concurrent tasks
|
||||
rate_limiter=RateLimiter( # Optional rate limiting
|
||||
base_delay=(0.5, 1.0),
|
||||
max_delay=10.0
|
||||
@@ -91,6 +203,19 @@ dispatcher = SemaphoreDispatcher(
|
||||
)
|
||||
```
|
||||
|
||||
**Constructor Parameters:**
|
||||
|
||||
1. **`max_session_permit`** (`int`, default: `20`)
|
||||
The maximum number of concurrent crawling tasks allowed, irrespective of semaphore slots.
|
||||
|
||||
2. **`rate_limiter`** (`RateLimiter`, default: `None`)
|
||||
Optional rate-limiting logic to avoid overwhelming servers. See **RateLimiter** for details.
|
||||
|
||||
3. **`monitor`** (`CrawlerMonitor`, default: `None`)
|
||||
Optional monitoring for tracking task progress and resource usage. See **CrawlerMonitor** for details.
|
||||
|
||||
---
|
||||
|
||||
## 4. Usage Examples
|
||||
|
||||
### 4.1 Batch Processing (Default)
|
||||
@@ -128,6 +253,14 @@ async def crawl_batch():
|
||||
print(f"Failed to crawl {result.url}: {result.error_message}")
|
||||
```
|
||||
|
||||
**Review:**
|
||||
- **Purpose:** Executes a batch crawl with all URLs processed together after crawling is complete.
|
||||
- **Dispatcher:** Uses `MemoryAdaptiveDispatcher` to manage concurrency and system memory.
|
||||
- **Stream:** Disabled (`stream=False`), so all results are collected at once for post-processing.
|
||||
- **Best Use Case:** When you need to analyze results in bulk rather than individually during the crawl.
|
||||
|
||||
---
|
||||
|
||||
### 4.2 Streaming Mode
|
||||
|
||||
```python
|
||||
@@ -161,6 +294,14 @@ async def crawl_streaming():
|
||||
print(f"Failed to crawl {result.url}: {result.error_message}")
|
||||
```
|
||||
|
||||
**Review:**
|
||||
- **Purpose:** Enables streaming to process results as soon as they’re available.
|
||||
- **Dispatcher:** Uses `MemoryAdaptiveDispatcher` for concurrency and memory management.
|
||||
- **Stream:** Enabled (`stream=True`), allowing real-time processing during crawling.
|
||||
- **Best Use Case:** When you need to act on results immediately, such as for real-time analytics or progressive data storage.
|
||||
|
||||
---
|
||||
|
||||
### 4.3 Semaphore-based Crawling
|
||||
|
||||
```python
|
||||
@@ -189,6 +330,54 @@ async def crawl_with_semaphore(urls):
|
||||
return results
|
||||
```
|
||||
|
||||
**Review:**
|
||||
- **Purpose:** Uses `SemaphoreDispatcher` to limit concurrency with a fixed number of slots.
|
||||
- **Dispatcher:** Configured with a semaphore to control parallel crawling tasks.
|
||||
- **Rate Limiter:** Prevents servers from being overwhelmed by pacing requests.
|
||||
- **Best Use Case:** When you want precise control over the number of concurrent requests, independent of system memory.
|
||||
|
||||
---
|
||||
|
||||
### 4.4 Robots.txt Consideration
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
|
||||
|
||||
async def main():
|
||||
urls = [
|
||||
"https://example1.com",
|
||||
"https://example2.com",
|
||||
"https://example3.com"
|
||||
]
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.ENABLED,
|
||||
check_robots_txt=True, # Will respect robots.txt for each URL
|
||||
semaphore_count=3 # Max concurrent requests
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
async for result in crawler.arun_many(urls, config=config):
|
||||
if result.success:
|
||||
print(f"Successfully crawled {result.url}")
|
||||
elif result.status_code == 403 and "robots.txt" in result.error_message:
|
||||
print(f"Skipped {result.url} - blocked by robots.txt")
|
||||
else:
|
||||
print(f"Failed to crawl {result.url}: {result.error_message}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**Review:**
|
||||
- **Purpose:** Ensures compliance with `robots.txt` rules for ethical and legal web crawling.
|
||||
- **Configuration:** Set `check_robots_txt=True` to validate each URL against `robots.txt` before crawling.
|
||||
- **Dispatcher:** Handles requests with concurrency limits (`semaphore_count=3`).
|
||||
- **Best Use Case:** When crawling websites that strictly enforce robots.txt policies or for responsible crawling practices.
|
||||
|
||||
---
|
||||
|
||||
## 5. Dispatch Results
|
||||
|
||||
Each crawl result includes dispatch information:
|
||||
@@ -217,20 +406,24 @@ for result in results:
|
||||
|
||||
## 6. Summary
|
||||
|
||||
1. **Two Dispatcher Types**:
|
||||
1. **Two Dispatcher Types**:
|
||||
|
||||
- MemoryAdaptiveDispatcher (default): Dynamic concurrency based on memory
|
||||
- SemaphoreDispatcher: Fixed concurrency limit
|
||||
|
||||
2. **Optional Components**:
|
||||
2. **Optional Components**:
|
||||
|
||||
- RateLimiter: Smart request pacing and backoff
|
||||
- CrawlerMonitor: Real-time progress visualization
|
||||
|
||||
3. **Key Benefits**:
|
||||
3. **Key Benefits**:
|
||||
|
||||
- Automatic memory management
|
||||
- Built-in rate limiting
|
||||
- Live progress monitoring
|
||||
- Flexible concurrency control
|
||||
|
||||
Choose the dispatcher that best fits your needs:
|
||||
|
||||
- **MemoryAdaptiveDispatcher**: For large crawls or limited resources
|
||||
- **SemaphoreDispatcher**: For simple, fixed-concurrency scenarios
|
||||
|
||||
@@ -36,23 +36,33 @@ async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(url="https://example.com")
|
||||
```
|
||||
|
||||
## Rotating Proxies
|
||||
Here's the corrected documentation:
|
||||
|
||||
Example using a proxy rotation service and updating `BrowserConfig` dynamically:
|
||||
## Rotating Proxies
|
||||
|
||||
Example using a proxy rotation service dynamically:
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import BrowserConfig
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
|
||||
|
||||
async def get_next_proxy():
|
||||
# Your proxy rotation logic here
|
||||
return {"server": "http://next.proxy.com:8080"}
|
||||
|
||||
browser_config = BrowserConfig()
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
# Update proxy for each request
|
||||
for url in urls:
|
||||
proxy = await get_next_proxy()
|
||||
browser_config.proxy_config = proxy
|
||||
result = await crawler.arun(url=url, config=browser_config)
|
||||
async def main():
|
||||
browser_config = BrowserConfig()
|
||||
run_config = CrawlerRunConfig()
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
# For each URL, create a new run config with different proxy
|
||||
for url in urls:
|
||||
proxy = await get_next_proxy()
|
||||
# Clone the config and update proxy - this creates a new browser context
|
||||
current_config = run_config.clone(proxy_config=proxy)
|
||||
result = await crawler.arun(url=url, config=current_config)
|
||||
|
||||
if __name__ == "__main__":
|
||||
import asyncio
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
|
||||
@@ -22,6 +22,7 @@ async def main():
|
||||
run_config = CrawlerRunConfig(
|
||||
verbose=True, # Detailed logging
|
||||
cache_mode=CacheMode.ENABLED, # Use normal read/write cache
|
||||
check_robots_txt=True, # Respect robots.txt rules
|
||||
# ... other parameters
|
||||
)
|
||||
|
||||
@@ -30,8 +31,10 @@ async def main():
|
||||
url="https://example.com",
|
||||
config=run_config
|
||||
)
|
||||
print(result.cleaned_html[:500])
|
||||
|
||||
|
||||
# Check if blocked by robots.txt
|
||||
if not result.success and result.status_code == 403:
|
||||
print(f"Error: {result.error_message}")
|
||||
```
|
||||
|
||||
**Key Fields**:
|
||||
@@ -226,6 +229,7 @@ async def main():
|
||||
# Core
|
||||
verbose=True,
|
||||
cache_mode=CacheMode.ENABLED,
|
||||
check_robots_txt=True, # Respect robots.txt rules
|
||||
|
||||
# Content
|
||||
word_count_threshold=10,
|
||||
|
||||
@@ -106,6 +106,7 @@ Use these for controlling whether you read or write from a local content cache.
|
||||
| **`wait_for`** | `str or None` | Wait for a CSS (`"css:selector"`) or JS (`"js:() => bool"`) condition before content extraction. |
|
||||
| **`wait_for_images`** | `bool` (False) | Wait for images to load before finishing. Slows down if you only want text. |
|
||||
| **`delay_before_return_html`** | `float` (0.1) | Additional pause (seconds) before final HTML is captured. Good for last-second updates. |
|
||||
| **`check_robots_txt`** | `bool` (False) | Whether to check and respect robots.txt rules before crawling. If True, caches robots.txt for efficiency. |
|
||||
| **`mean_delay`** and **`max_range`** | `float` (0.1, 0.3) | If you call `arun_many()`, these define random delay intervals between crawls, helping avoid detection or rate limits. |
|
||||
| **`semaphore_count`** | `int` (5) | Max concurrency for `arun_many()`. Increase if you have resources for parallel crawls. |
|
||||
|
||||
@@ -266,17 +267,21 @@ async def main():
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
|
||||
## 2.4 Compliance & Ethics
|
||||
|
||||
| **Parameter** | **Type / Default** | **What It Does** |
|
||||
|-----------------------|-------------------------|----------------------------------------------------------------------------------------------------------------------|
|
||||
| **`check_robots_txt`**| `bool` (False) | When True, checks and respects robots.txt rules before crawling. Uses efficient caching with SQLite backend. |
|
||||
| **`user_agent`** | `str` (None) | User agent string to identify your crawler. Used for robots.txt checking when enabled. |
|
||||
|
||||
```python
|
||||
run_config = CrawlerRunConfig(
|
||||
check_robots_txt=True, # Enable robots.txt compliance
|
||||
user_agent="MyBot/1.0" # Identify your crawler
|
||||
)
|
||||
```
|
||||
|
||||
**What’s Happening**:
|
||||
- **`text_mode=True`** avoids loading images and other heavy resources, speeding up the crawl.
|
||||
- We disable caching (`cache_mode=CacheMode.BYPASS`) to always fetch fresh content.
|
||||
- We only keep `main.article` content by specifying `css_selector="main.article"`.
|
||||
- We exclude external links (`exclude_external_links=True`).
|
||||
- We do a quick screenshot (`screenshot=True`) before finishing.
|
||||
|
||||
---
|
||||
|
||||
## 3. Putting It All Together
|
||||
|
||||
- **Use** `BrowserConfig` for **global** browser settings: engine, headless, proxy, user agent.
|
||||
|
||||
@@ -95,6 +95,10 @@ strong {
|
||||
|
||||
}
|
||||
|
||||
div.highlight {
|
||||
margin-bottom: 2em;
|
||||
}
|
||||
|
||||
.terminal-card > header {
|
||||
color: var(--font-color);
|
||||
text-align: center;
|
||||
@@ -231,6 +235,16 @@ pre {
|
||||
font-size: 2em;
|
||||
}
|
||||
|
||||
.terminal h2 {
|
||||
font-size: 1.5em;
|
||||
margin-bottom: 0.8em;
|
||||
}
|
||||
|
||||
.terminal h3 {
|
||||
font-size: 1.3em;
|
||||
margin-bottom: 0.8em;
|
||||
}
|
||||
|
||||
.terminal h1, .terminal h2, .terminal h3, .terminal h4, .terminal h5, .terminal h6 {
|
||||
text-shadow: 0 0 0px var(--font-color), 0 0 0px var(--font-color), 0 0 0px var(--font-color);
|
||||
}
|
||||
|
||||
138
docs/md_v2/blog/releases/v0.4.3b1.md
Normal file
138
docs/md_v2/blog/releases/v0.4.3b1.md
Normal file
@@ -0,0 +1,138 @@
|
||||
# Crawl4AI 0.4.3: Major Performance Boost & LLM Integration
|
||||
|
||||
We're excited to announce Crawl4AI 0.4.3, focusing on three key areas: Speed & Efficiency, LLM Integration, and Core Platform Improvements. This release significantly improves crawling performance while adding powerful new LLM-powered features.
|
||||
|
||||
## ⚡ Speed & Efficiency Improvements
|
||||
|
||||
### 1. Memory-Adaptive Dispatcher System
|
||||
The new dispatcher system provides intelligent resource management and real-time monitoring:
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, DisplayMode
|
||||
from crawl4ai.async_dispatcher import MemoryAdaptiveDispatcher, CrawlerMonitor
|
||||
|
||||
async def main():
|
||||
urls = ["https://example1.com", "https://example2.com"] * 50
|
||||
|
||||
# Configure memory-aware dispatch
|
||||
dispatcher = MemoryAdaptiveDispatcher(
|
||||
memory_threshold_percent=80.0, # Auto-throttle at 80% memory
|
||||
check_interval=0.5, # Check every 0.5 seconds
|
||||
max_session_permit=20, # Max concurrent sessions
|
||||
monitor=CrawlerMonitor( # Real-time monitoring
|
||||
display_mode=DisplayMode.DETAILED
|
||||
)
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
results = await dispatcher.run_urls(
|
||||
urls=urls,
|
||||
crawler=crawler,
|
||||
config=CrawlerRunConfig()
|
||||
)
|
||||
```
|
||||
|
||||
### 2. Streaming Support
|
||||
Process crawled URLs in real-time instead of waiting for all results:
|
||||
|
||||
```python
|
||||
config = CrawlerRunConfig(stream=True)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
async for result in await crawler.arun_many(urls, config=config):
|
||||
print(f"Got result for {result.url}")
|
||||
# Process each result immediately
|
||||
```
|
||||
|
||||
### 3. LXML-Based Scraping
|
||||
New LXML scraping strategy offering up to 20x faster parsing:
|
||||
|
||||
```python
|
||||
config = CrawlerRunConfig(
|
||||
scraping_strategy=LXMLWebScrapingStrategy(),
|
||||
cache_mode=CacheMode.ENABLED
|
||||
)
|
||||
```
|
||||
|
||||
## 🤖 LLM Integration
|
||||
|
||||
### 1. LLM-Powered Markdown Generation
|
||||
Smart content filtering and organization using LLMs:
|
||||
|
||||
```python
|
||||
config = CrawlerRunConfig(
|
||||
markdown_generator=DefaultMarkdownGenerator(
|
||||
content_filter=LLMContentFilter(
|
||||
provider="openai/gpt-4o",
|
||||
instruction="Extract technical documentation and code examples"
|
||||
)
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
### 2. Automatic Schema Generation
|
||||
Generate extraction schemas instantly using LLMs instead of manual CSS/XPath writing:
|
||||
|
||||
```python
|
||||
schema = JsonCssExtractionStrategy.generate_schema(
|
||||
html_content,
|
||||
schema_type="CSS",
|
||||
query="Extract product name, price, and description"
|
||||
)
|
||||
```
|
||||
|
||||
## 🔧 Core Improvements
|
||||
|
||||
### 1. Proxy Support & Rotation
|
||||
Integrated proxy support with automatic rotation and verification:
|
||||
|
||||
```python
|
||||
config = CrawlerRunConfig(
|
||||
proxy_config={
|
||||
"server": "http://proxy:8080",
|
||||
"username": "user",
|
||||
"password": "pass"
|
||||
}
|
||||
)
|
||||
```
|
||||
|
||||
### 2. Robots.txt Compliance
|
||||
Built-in robots.txt support with SQLite caching:
|
||||
|
||||
```python
|
||||
config = CrawlerRunConfig(check_robots_txt=True)
|
||||
result = await crawler.arun(url, config=config)
|
||||
if result.status_code == 403:
|
||||
print("Access blocked by robots.txt")
|
||||
```
|
||||
|
||||
### 3. URL Redirection Tracking
|
||||
Track final URLs after redirects:
|
||||
|
||||
```python
|
||||
result = await crawler.arun(url)
|
||||
print(f"Initial URL: {url}")
|
||||
print(f"Final URL: {result.redirected_url}")
|
||||
```
|
||||
|
||||
## Performance Impact
|
||||
|
||||
- Memory usage reduced by up to 40% with adaptive dispatcher
|
||||
- Parsing speed increased up to 20x with LXML strategy
|
||||
- Streaming reduces memory footprint for large crawls by ~60%
|
||||
|
||||
## Getting Started
|
||||
|
||||
```bash
|
||||
pip install -U crawl4ai
|
||||
```
|
||||
|
||||
For complete examples, check our [demo repository](https://github.com/unclecode/crawl4ai/examples).
|
||||
|
||||
## Stay Connected
|
||||
|
||||
- Star us on [GitHub](https://github.com/unclecode/crawl4ai)
|
||||
- Follow [@unclecode](https://twitter.com/unclecode)
|
||||
- Join our [Discord](https://discord.gg/crawl4ai)
|
||||
|
||||
Happy crawling! 🕷️
|
||||
@@ -181,7 +181,7 @@ from crawl4ai.content_filter_strategy import LLMContentFilter
|
||||
async def main():
|
||||
# Initialize LLM filter with specific instruction
|
||||
filter = LLMContentFilter(
|
||||
provider="openai/gpt-4", # or your preferred provider
|
||||
provider="openai/gpt-4o", # or your preferred provider
|
||||
api_token="your-api-token", # or use environment variable
|
||||
instruction="""
|
||||
Focus on extracting the core educational content.
|
||||
|
||||
Reference in New Issue
Block a user