feat: add v0.7.3 release notes, changelog updates, and documentation for new features
This commit is contained in:
70
CHANGELOG.md
70
CHANGELOG.md
@@ -5,6 +5,76 @@ All notable changes to Crawl4AI will be documented in this file.
|
|||||||
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
||||||
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
||||||
|
|
||||||
|
## [0.7.3] - 2025-08-09
|
||||||
|
|
||||||
|
### Added
|
||||||
|
- **🕵️ Undetected Browser Support**: New browser adapter pattern with stealth capabilities
|
||||||
|
- `browser_adapter.py` with undetected Chrome integration
|
||||||
|
- Bypass sophisticated bot detection systems (Cloudflare, Akamai, custom solutions)
|
||||||
|
- Support for headless stealth mode with anti-detection techniques
|
||||||
|
- Human-like behavior simulation with random mouse movements and scrolling
|
||||||
|
- Comprehensive examples for anti-bot strategies and stealth crawling
|
||||||
|
- Full documentation guide for undetected browser usage
|
||||||
|
|
||||||
|
- **🎨 Multi-URL Configuration System**: URL-specific crawler configurations for batch processing
|
||||||
|
- Different crawling strategies for different URL patterns in a single batch
|
||||||
|
- Support for string patterns with wildcards (`"*.pdf"`, `"*/blog/*"`)
|
||||||
|
- Lambda function matchers for complex URL logic
|
||||||
|
- Mixed matchers combining strings and functions with AND/OR logic
|
||||||
|
- Fallback configuration support when no patterns match
|
||||||
|
- First-match-wins configuration selection with optional fallback
|
||||||
|
|
||||||
|
- **🧠 Memory Monitoring & Optimization**: Comprehensive memory usage tracking
|
||||||
|
- New `memory_utils.py` module for memory monitoring and optimization
|
||||||
|
- Real-time memory usage tracking during crawl sessions
|
||||||
|
- Memory leak detection and reporting
|
||||||
|
- Performance optimization recommendations
|
||||||
|
- Peak memory usage analysis and efficiency metrics
|
||||||
|
- Automatic cleanup suggestions for memory-intensive operations
|
||||||
|
|
||||||
|
- **📊 Enhanced Table Extraction**: Improved table access and DataFrame conversion
|
||||||
|
- Direct `result.tables` interface replacing generic `result.media` approach
|
||||||
|
- Instant pandas DataFrame conversion with `pd.DataFrame(table['data'])`
|
||||||
|
- Enhanced table detection algorithms for better accuracy
|
||||||
|
- Table metadata including source XPath and headers
|
||||||
|
- Improved table structure preservation during extraction
|
||||||
|
|
||||||
|
- **💰 GitHub Sponsors Integration**: 4-tier sponsorship system
|
||||||
|
- Supporter ($5/month): Community support + early feature previews
|
||||||
|
- Professional ($25/month): Priority support + beta access
|
||||||
|
- Business ($100/month): Direct consultation + custom integrations
|
||||||
|
- Enterprise ($500/month): Dedicated support + feature development
|
||||||
|
- Custom arrangement options for larger organizations
|
||||||
|
|
||||||
|
- **🐳 Docker LLM Provider Flexibility**: Environment-based LLM configuration
|
||||||
|
- `LLM_PROVIDER` environment variable support for dynamic provider switching
|
||||||
|
- `.llm.env` file support for secure configuration management
|
||||||
|
- Per-request provider override capabilities in API endpoints
|
||||||
|
- Support for OpenAI, Groq, and other providers without rebuilding images
|
||||||
|
- Enhanced Docker documentation with deployment examples
|
||||||
|
|
||||||
|
### Fixed
|
||||||
|
- **URL Matcher Fallback**: Resolved edge cases in URL pattern matching logic
|
||||||
|
- **Memory Management**: Fixed memory leaks in long-running crawl sessions
|
||||||
|
- **Sitemap Processing**: Improved redirect handling in sitemap fetching
|
||||||
|
- **Table Extraction**: Enhanced table detection and extraction accuracy
|
||||||
|
- **Error Handling**: Better error messages and recovery from network failures
|
||||||
|
|
||||||
|
### Changed
|
||||||
|
- **Architecture Refactoring**: Major cleanup and optimization
|
||||||
|
- Moved 2,450+ lines from main `async_crawler_strategy.py` to backup
|
||||||
|
- Cleaner separation of concerns in crawler architecture
|
||||||
|
- Better maintainability and code organization
|
||||||
|
- Preserved backward compatibility while improving performance
|
||||||
|
|
||||||
|
### Documentation
|
||||||
|
- **Comprehensive Examples**: Added real-world URLs and practical use cases
|
||||||
|
- **API Documentation**: Complete CrawlResult field documentation with all available fields
|
||||||
|
- **Migration Guides**: Updated table extraction patterns from `result.media` to `result.tables`
|
||||||
|
- **Undetected Browser Guide**: Full documentation for stealth mode and anti-bot strategies
|
||||||
|
- **Multi-Config Examples**: Detailed examples for URL-specific configurations
|
||||||
|
- **Docker Deployment**: Enhanced Docker documentation with LLM provider configuration
|
||||||
|
|
||||||
## [0.7.x] - 2025-06-29
|
## [0.7.x] - 2025-06-29
|
||||||
|
|
||||||
### Added
|
### Added
|
||||||
|
|||||||
90
README.md
90
README.md
@@ -27,9 +27,9 @@
|
|||||||
|
|
||||||
Crawl4AI turns the web into clean, LLM ready Markdown for RAG, agents, and data pipelines. Fast, controllable, battle tested by a 50k+ star community.
|
Crawl4AI turns the web into clean, LLM ready Markdown for RAG, agents, and data pipelines. Fast, controllable, battle tested by a 50k+ star community.
|
||||||
|
|
||||||
[✨ Check out latest update v0.7.0](#-recent-updates)
|
[✨ Check out latest update v0.7.3](#-recent-updates)
|
||||||
|
|
||||||
✨ New in v0.7.0, Adaptive Crawling, Virtual Scroll, Link Preview scoring, Async URL Seeder, big performance gains. [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.0.md)
|
✨ New in v0.7.3: Undetected Browser Support, Multi-URL Configurations, Memory Monitoring, Enhanced Table Extraction, GitHub Sponsors. [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.3.md)
|
||||||
|
|
||||||
<details>
|
<details>
|
||||||
<summary>🤓 <strong>My Personal Story</strong></summary>
|
<summary>🤓 <strong>My Personal Story</strong></summary>
|
||||||
@@ -542,7 +542,89 @@ async def test_news_crawl():
|
|||||||
|
|
||||||
## ✨ Recent Updates
|
## ✨ Recent Updates
|
||||||
|
|
||||||
### Version 0.7.0 Release Highlights - The Adaptive Intelligence Update
|
<details>
|
||||||
|
<summary><strong>Version 0.7.3 Release Highlights - The Multi-Config Intelligence Update</strong></summary>
|
||||||
|
|
||||||
|
- **🕵️ Undetected Browser Support**: Bypass sophisticated bot detection systems:
|
||||||
|
```python
|
||||||
|
from crawl4ai import AsyncWebCrawler, BrowserConfig
|
||||||
|
|
||||||
|
browser_config = BrowserConfig(
|
||||||
|
browser_type="undetected", # Use undetected Chrome
|
||||||
|
headless=True, # Can run headless with stealth
|
||||||
|
extra_args=[
|
||||||
|
"--disable-blink-features=AutomationControlled",
|
||||||
|
"--disable-web-security"
|
||||||
|
]
|
||||||
|
)
|
||||||
|
|
||||||
|
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||||
|
result = await crawler.arun("https://protected-site.com")
|
||||||
|
# Successfully bypass Cloudflare, Akamai, and custom bot detection
|
||||||
|
```
|
||||||
|
|
||||||
|
- **🎨 Multi-URL Configuration**: Different strategies for different URL patterns in one batch:
|
||||||
|
```python
|
||||||
|
from crawl4ai import CrawlerRunConfig, MatchMode
|
||||||
|
|
||||||
|
configs = [
|
||||||
|
# Documentation sites - aggressive caching
|
||||||
|
CrawlerRunConfig(
|
||||||
|
url_matcher=["*docs*", "*documentation*"],
|
||||||
|
cache_mode="write",
|
||||||
|
markdown_generator_options={"include_links": True}
|
||||||
|
),
|
||||||
|
|
||||||
|
# News/blog sites - fresh content
|
||||||
|
CrawlerRunConfig(
|
||||||
|
url_matcher=lambda url: 'blog' in url or 'news' in url,
|
||||||
|
cache_mode="bypass"
|
||||||
|
),
|
||||||
|
|
||||||
|
# Fallback for everything else
|
||||||
|
CrawlerRunConfig()
|
||||||
|
]
|
||||||
|
|
||||||
|
results = await crawler.arun_many(urls, config=configs)
|
||||||
|
# Each URL gets the perfect configuration automatically
|
||||||
|
```
|
||||||
|
|
||||||
|
- **🧠 Memory Monitoring**: Track and optimize memory usage during crawling:
|
||||||
|
```python
|
||||||
|
from crawl4ai.memory_utils import MemoryMonitor
|
||||||
|
|
||||||
|
monitor = MemoryMonitor()
|
||||||
|
monitor.start_monitoring()
|
||||||
|
|
||||||
|
results = await crawler.arun_many(large_url_list)
|
||||||
|
|
||||||
|
report = monitor.get_report()
|
||||||
|
print(f"Peak memory: {report['peak_mb']:.1f} MB")
|
||||||
|
print(f"Efficiency: {report['efficiency']:.1f}%")
|
||||||
|
# Get optimization recommendations
|
||||||
|
```
|
||||||
|
|
||||||
|
- **📊 Enhanced Table Extraction**: Direct DataFrame conversion from web tables:
|
||||||
|
```python
|
||||||
|
result = await crawler.arun("https://site-with-tables.com")
|
||||||
|
|
||||||
|
# New way - direct table access
|
||||||
|
if result.tables:
|
||||||
|
import pandas as pd
|
||||||
|
for table in result.tables:
|
||||||
|
df = pd.DataFrame(table['data'])
|
||||||
|
print(f"Table: {df.shape[0]} rows × {df.shape[1]} columns")
|
||||||
|
```
|
||||||
|
|
||||||
|
- **💰 GitHub Sponsors**: 4-tier sponsorship system for project sustainability
|
||||||
|
- **🐳 Docker LLM Flexibility**: Configure providers via environment variables
|
||||||
|
|
||||||
|
[Full v0.7.3 Release Notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.3.md)
|
||||||
|
|
||||||
|
</details>
|
||||||
|
|
||||||
|
<details>
|
||||||
|
<summary><strong>Version 0.7.0 Release Highlights - The Adaptive Intelligence Update</strong></summary>
|
||||||
|
|
||||||
- **🧠 Adaptive Crawling**: Your crawler now learns and adapts to website patterns automatically:
|
- **🧠 Adaptive Crawling**: Your crawler now learns and adapts to website patterns automatically:
|
||||||
```python
|
```python
|
||||||
@@ -607,6 +689,8 @@ async def test_news_crawl():
|
|||||||
|
|
||||||
Read the full details in our [0.7.0 Release Notes](https://docs.crawl4ai.com/blog/release-v0.7.0) or check the [CHANGELOG](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md).
|
Read the full details in our [0.7.0 Release Notes](https://docs.crawl4ai.com/blog/release-v0.7.0) or check the [CHANGELOG](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md).
|
||||||
|
|
||||||
|
</details>
|
||||||
|
|
||||||
## Version Numbering in Crawl4AI
|
## Version Numbering in Crawl4AI
|
||||||
|
|
||||||
Crawl4AI follows standard Python version numbering conventions (PEP 440) to help users understand the stability and features of each release.
|
Crawl4AI follows standard Python version numbering conventions (PEP 440) to help users understand the stability and features of each release.
|
||||||
|
|||||||
@@ -8,10 +8,14 @@ Today I'm releasing Crawl4AI v0.7.3—the Multi-Config Intelligence Update. This
|
|||||||
|
|
||||||
## 🎯 What's New at a Glance
|
## 🎯 What's New at a Glance
|
||||||
|
|
||||||
- **Multi-URL Configurations**: Different crawling strategies for different URL patterns in a single batch
|
- **🕵️ Undetected Browser Support**: Stealth mode for bypassing bot detection systems
|
||||||
- **Flexible Docker LLM Providers**: Configure LLM providers via environment variables
|
- **🎨 Multi-URL Configurations**: Different crawling strategies for different URL patterns in a single batch
|
||||||
- **Bug Fixes**: Resolved several critical issues for better stability
|
- **🐳 Flexible Docker LLM Providers**: Configure LLM providers via environment variables
|
||||||
- **Documentation Updates**: Clearer examples and improved API documentation
|
- **🧠 Memory Monitoring**: Enhanced memory usage tracking and optimization tools
|
||||||
|
- **📊 Enhanced Table Extraction**: Improved table access and DataFrame conversion
|
||||||
|
- **💰 GitHub Sponsors**: 4-tier sponsorship system with custom arrangements
|
||||||
|
- **🔧 Bug Fixes**: Resolved several critical issues for better stability
|
||||||
|
- **📚 Documentation Updates**: Clearer examples and improved API documentation
|
||||||
|
|
||||||
## 🎨 Multi-URL Configurations: One Size Doesn't Fit All
|
## 🎨 Multi-URL Configurations: One Size Doesn't Fit All
|
||||||
|
|
||||||
@@ -78,6 +82,182 @@ async with AsyncWebCrawler() as crawler:
|
|||||||
- **Reduced Complexity**: No more if/else forests in your extraction code
|
- **Reduced Complexity**: No more if/else forests in your extraction code
|
||||||
- **Better Performance**: Each URL gets exactly the processing it needs
|
- **Better Performance**: Each URL gets exactly the processing it needs
|
||||||
|
|
||||||
|
## 🕵️ Undetected Browser Support: Stealth Mode Activated
|
||||||
|
|
||||||
|
**The Problem:** Modern websites employ sophisticated bot detection systems. Cloudflare, Akamai, and custom solutions block automated crawlers, limiting access to valuable content.
|
||||||
|
|
||||||
|
**My Solution:** I implemented undetected browser support with a flexible adapter pattern. Now Crawl4AI can bypass most bot detection systems using stealth techniques.
|
||||||
|
|
||||||
|
### Technical Implementation
|
||||||
|
|
||||||
|
```python
|
||||||
|
from crawl4ai import AsyncWebCrawler, BrowserConfig
|
||||||
|
|
||||||
|
# Enable undetected mode for stealth crawling
|
||||||
|
browser_config = BrowserConfig(
|
||||||
|
browser_type="undetected", # Use undetected Chrome
|
||||||
|
headless=True, # Can run headless with stealth
|
||||||
|
extra_args=[
|
||||||
|
"--disable-blink-features=AutomationControlled",
|
||||||
|
"--disable-web-security",
|
||||||
|
"--disable-features=VizDisplayCompositor"
|
||||||
|
]
|
||||||
|
)
|
||||||
|
|
||||||
|
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||||
|
# This will bypass most bot detection systems
|
||||||
|
result = await crawler.arun("https://protected-site.com")
|
||||||
|
|
||||||
|
if result.success:
|
||||||
|
print("✅ Successfully bypassed bot detection!")
|
||||||
|
print(f"Content length: {len(result.markdown)}")
|
||||||
|
```
|
||||||
|
|
||||||
|
**Advanced Anti-Bot Strategies:**
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Combine multiple stealth techniques
|
||||||
|
from crawl4ai import CrawlerRunConfig
|
||||||
|
|
||||||
|
config = CrawlerRunConfig(
|
||||||
|
# Random user agents and headers
|
||||||
|
headers={
|
||||||
|
"Accept-Language": "en-US,en;q=0.9",
|
||||||
|
"Accept-Encoding": "gzip, deflate, br",
|
||||||
|
"DNT": "1"
|
||||||
|
},
|
||||||
|
|
||||||
|
# Human-like behavior simulation
|
||||||
|
js_code="""
|
||||||
|
// Random mouse movements
|
||||||
|
const simulateHuman = () => {
|
||||||
|
const event = new MouseEvent('mousemove', {
|
||||||
|
clientX: Math.random() * window.innerWidth,
|
||||||
|
clientY: Math.random() * window.innerHeight
|
||||||
|
});
|
||||||
|
document.dispatchEvent(event);
|
||||||
|
};
|
||||||
|
setInterval(simulateHuman, 100 + Math.random() * 200);
|
||||||
|
|
||||||
|
// Random scrolling
|
||||||
|
const randomScroll = () => {
|
||||||
|
const scrollY = Math.random() * (document.body.scrollHeight - window.innerHeight);
|
||||||
|
window.scrollTo(0, scrollY);
|
||||||
|
};
|
||||||
|
setTimeout(randomScroll, 500 + Math.random() * 1000);
|
||||||
|
""",
|
||||||
|
|
||||||
|
# Delay to appear more human
|
||||||
|
delay_before_return_html=2.0
|
||||||
|
)
|
||||||
|
|
||||||
|
result = await crawler.arun("https://bot-protected-site.com", config=config)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Expected Real-World Impact:**
|
||||||
|
- **Enterprise Scraping**: Access previously blocked corporate sites and databases
|
||||||
|
- **Market Research**: Gather data from competitor sites with protection
|
||||||
|
- **Price Monitoring**: Track e-commerce sites that block automated access
|
||||||
|
- **Content Aggregation**: Collect news and social media despite anti-bot measures
|
||||||
|
- **Compliance Testing**: Verify your own site's bot protection effectiveness
|
||||||
|
|
||||||
|
## 🧠 Memory Monitoring & Optimization
|
||||||
|
|
||||||
|
**The Problem:** Long-running crawl sessions consuming excessive memory, especially when processing large batches or heavy JavaScript sites.
|
||||||
|
|
||||||
|
**My Solution:** Built comprehensive memory monitoring and optimization utilities that track usage patterns and provide actionable insights.
|
||||||
|
|
||||||
|
### Memory Tracking Implementation
|
||||||
|
|
||||||
|
```python
|
||||||
|
from crawl4ai.memory_utils import MemoryMonitor, get_memory_info
|
||||||
|
|
||||||
|
# Monitor memory during crawling
|
||||||
|
monitor = MemoryMonitor()
|
||||||
|
|
||||||
|
async with AsyncWebCrawler() as crawler:
|
||||||
|
# Start monitoring
|
||||||
|
monitor.start_monitoring()
|
||||||
|
|
||||||
|
# Perform memory-intensive operations
|
||||||
|
results = await crawler.arun_many([
|
||||||
|
"https://heavy-js-site.com",
|
||||||
|
"https://large-images-site.com",
|
||||||
|
"https://dynamic-content-site.com"
|
||||||
|
])
|
||||||
|
|
||||||
|
# Get detailed memory report
|
||||||
|
memory_report = monitor.get_report()
|
||||||
|
print(f"Peak memory usage: {memory_report['peak_mb']:.1f} MB")
|
||||||
|
print(f"Memory efficiency: {memory_report['efficiency']:.1f}%")
|
||||||
|
|
||||||
|
# Automatic cleanup suggestions
|
||||||
|
if memory_report['peak_mb'] > 1000: # > 1GB
|
||||||
|
print("💡 Consider batch size optimization")
|
||||||
|
print("💡 Enable aggressive garbage collection")
|
||||||
|
```
|
||||||
|
|
||||||
|
**Expected Real-World Impact:**
|
||||||
|
- **Production Stability**: Prevent memory-related crashes in long-running services
|
||||||
|
- **Cost Optimization**: Right-size server resources based on actual usage
|
||||||
|
- **Performance Tuning**: Identify memory bottlenecks and optimization opportunities
|
||||||
|
- **Scalability Planning**: Understand memory patterns for horizontal scaling
|
||||||
|
|
||||||
|
## 📊 Enhanced Table Extraction
|
||||||
|
|
||||||
|
**The Problem:** Table data was accessed through the generic `result.media` interface, making DataFrame conversion cumbersome and unclear.
|
||||||
|
|
||||||
|
**My Solution:** Dedicated `result.tables` interface with direct DataFrame conversion and improved detection algorithms.
|
||||||
|
|
||||||
|
### New Table Access Pattern
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Old way (deprecated)
|
||||||
|
# tables_data = result.media.get('tables', [])
|
||||||
|
|
||||||
|
# New way (v0.7.3+)
|
||||||
|
result = await crawler.arun("https://site-with-tables.com")
|
||||||
|
|
||||||
|
# Direct table access
|
||||||
|
if result.tables:
|
||||||
|
print(f"Found {len(result.tables)} tables")
|
||||||
|
|
||||||
|
# Convert to pandas DataFrame instantly
|
||||||
|
import pandas as pd
|
||||||
|
|
||||||
|
for i, table in enumerate(result.tables):
|
||||||
|
df = pd.DataFrame(table['data'])
|
||||||
|
print(f"Table {i}: {df.shape[0]} rows × {df.shape[1]} columns")
|
||||||
|
print(df.head())
|
||||||
|
|
||||||
|
# Table metadata
|
||||||
|
print(f"Source: {table.get('source_xpath', 'Unknown')}")
|
||||||
|
print(f"Headers: {table.get('headers', [])}")
|
||||||
|
```
|
||||||
|
|
||||||
|
**Expected Real-World Impact:**
|
||||||
|
- **Data Analysis**: Faster transition from web data to analysis-ready DataFrames
|
||||||
|
- **ETL Pipelines**: Cleaner integration with data processing workflows
|
||||||
|
- **Reporting**: Simplified table extraction for automated reporting systems
|
||||||
|
|
||||||
|
## 💰 Community Support: GitHub Sponsors
|
||||||
|
|
||||||
|
I've launched GitHub Sponsors to ensure Crawl4AI's continued development and support our growing community.
|
||||||
|
|
||||||
|
**Sponsorship Tiers:**
|
||||||
|
- **🌱 Supporter ($5/month)**: Community support + early feature previews
|
||||||
|
- **🚀 Professional ($25/month)**: Priority support + beta access
|
||||||
|
- **🏢 Business ($100/month)**: Direct consultation + custom integrations
|
||||||
|
- **🏛️ Enterprise ($500/month)**: Dedicated support + feature development
|
||||||
|
|
||||||
|
**Why Sponsor?**
|
||||||
|
- Ensure continuous development and maintenance
|
||||||
|
- Get priority support and feature requests
|
||||||
|
- Access to premium documentation and examples
|
||||||
|
- Direct line to the development team
|
||||||
|
|
||||||
|
[**Become a Sponsor →**](https://github.com/sponsors/unclecode)
|
||||||
|
|
||||||
## 🐳 Docker: Flexible LLM Provider Configuration
|
## 🐳 Docker: Flexible LLM Provider Configuration
|
||||||
|
|
||||||
**The Problem:** Hardcoded LLM providers in Docker deployments. Want to switch from OpenAI to Groq? Rebuild and redeploy. Testing different models? Multiple Docker images.
|
**The Problem:** Hardcoded LLM providers in Docker deployments. Want to switch from OpenAI to Groq? Rebuild and redeploy. Testing different models? Multiple Docker images.
|
||||||
|
|||||||
Reference in New Issue
Block a user