Renames the final_url field to redirected_url across all components to maintain consistent terminology throughout the codebase. This change affects: - AsyncCrawlResponse model - AsyncPlaywrightCrawlerStrategy - Documentation and examples No functional changes, purely naming consistency improvement.
138 lines
3.8 KiB
Markdown
138 lines
3.8 KiB
Markdown
# Crawl4AI 0.4.3: Major Performance Boost & LLM Integration
|
|
|
|
We're excited to announce Crawl4AI 0.4.3, focusing on three key areas: Speed & Efficiency, LLM Integration, and Core Platform Improvements. This release significantly improves crawling performance while adding powerful new LLM-powered features.
|
|
|
|
## ⚡ Speed & Efficiency Improvements
|
|
|
|
### 1. Memory-Adaptive Dispatcher System
|
|
The new dispatcher system provides intelligent resource management and real-time monitoring:
|
|
|
|
```python
|
|
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, DisplayMode
|
|
from crawl4ai.async_dispatcher import MemoryAdaptiveDispatcher, CrawlerMonitor
|
|
|
|
async def main():
|
|
urls = ["https://example1.com", "https://example2.com"] * 50
|
|
|
|
# Configure memory-aware dispatch
|
|
dispatcher = MemoryAdaptiveDispatcher(
|
|
memory_threshold_percent=80.0, # Auto-throttle at 80% memory
|
|
check_interval=0.5, # Check every 0.5 seconds
|
|
max_session_permit=20, # Max concurrent sessions
|
|
monitor=CrawlerMonitor( # Real-time monitoring
|
|
display_mode=DisplayMode.DETAILED
|
|
)
|
|
)
|
|
|
|
async with AsyncWebCrawler() as crawler:
|
|
results = await dispatcher.run_urls(
|
|
urls=urls,
|
|
crawler=crawler,
|
|
config=CrawlerRunConfig()
|
|
)
|
|
```
|
|
|
|
### 2. Streaming Support
|
|
Process crawled URLs in real-time instead of waiting for all results:
|
|
|
|
```python
|
|
config = CrawlerRunConfig(stream=True)
|
|
|
|
async with AsyncWebCrawler() as crawler:
|
|
async for result in await crawler.arun_many(urls, config=config):
|
|
print(f"Got result for {result.url}")
|
|
# Process each result immediately
|
|
```
|
|
|
|
### 3. LXML-Based Scraping
|
|
New LXML scraping strategy offering up to 20x faster parsing:
|
|
|
|
```python
|
|
config = CrawlerRunConfig(
|
|
scraping_strategy=LXMLWebScrapingStrategy(),
|
|
cache_mode=CacheMode.ENABLED
|
|
)
|
|
```
|
|
|
|
## 🤖 LLM Integration
|
|
|
|
### 1. LLM-Powered Markdown Generation
|
|
Smart content filtering and organization using LLMs:
|
|
|
|
```python
|
|
config = CrawlerRunConfig(
|
|
markdown_generator=DefaultMarkdownGenerator(
|
|
content_filter=LLMContentFilter(
|
|
provider="openai/gpt-4o",
|
|
instruction="Extract technical documentation and code examples"
|
|
)
|
|
)
|
|
)
|
|
```
|
|
|
|
### 2. Automatic Schema Generation
|
|
Generate extraction schemas instantly using LLMs instead of manual CSS/XPath writing:
|
|
|
|
```python
|
|
schema = JsonCssExtractionStrategy.generate_schema(
|
|
html_content,
|
|
schema_type="CSS",
|
|
query="Extract product name, price, and description"
|
|
)
|
|
```
|
|
|
|
## 🔧 Core Improvements
|
|
|
|
### 1. Proxy Support & Rotation
|
|
Integrated proxy support with automatic rotation and verification:
|
|
|
|
```python
|
|
config = CrawlerRunConfig(
|
|
proxy_config={
|
|
"server": "http://proxy:8080",
|
|
"username": "user",
|
|
"password": "pass"
|
|
}
|
|
)
|
|
```
|
|
|
|
### 2. Robots.txt Compliance
|
|
Built-in robots.txt support with SQLite caching:
|
|
|
|
```python
|
|
config = CrawlerRunConfig(check_robots_txt=True)
|
|
result = await crawler.arun(url, config=config)
|
|
if result.status_code == 403:
|
|
print("Access blocked by robots.txt")
|
|
```
|
|
|
|
### 3. URL Redirection Tracking
|
|
Track final URLs after redirects:
|
|
|
|
```python
|
|
result = await crawler.arun(url)
|
|
print(f"Initial URL: {url}")
|
|
print(f"Final URL: {result.redirected_url}")
|
|
```
|
|
|
|
## Performance Impact
|
|
|
|
- Memory usage reduced by up to 40% with adaptive dispatcher
|
|
- Parsing speed increased up to 20x with LXML strategy
|
|
- Streaming reduces memory footprint for large crawls by ~60%
|
|
|
|
## Getting Started
|
|
|
|
```bash
|
|
pip install -U crawl4ai
|
|
```
|
|
|
|
For complete examples, check our [demo repository](https://github.com/unclecode/crawl4ai/examples).
|
|
|
|
## Stay Connected
|
|
|
|
- Star us on [GitHub](https://github.com/unclecode/crawl4ai)
|
|
- Follow [@unclecode](https://twitter.com/unclecode)
|
|
- Join our [Discord](https://discord.gg/crawl4ai)
|
|
|
|
Happy crawling! 🕷️ |