diff --git a/CHANGELOG.md b/CHANGELOG.md index 45769940..9788caf2 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -5,6 +5,76 @@ All notable changes to Crawl4AI will be documented in this file. The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). +## [0.7.3] - 2025-08-09 + +### Added +- **πŸ•΅οΈ Undetected Browser Support**: New browser adapter pattern with stealth capabilities + - `browser_adapter.py` with undetected Chrome integration + - Bypass sophisticated bot detection systems (Cloudflare, Akamai, custom solutions) + - Support for headless stealth mode with anti-detection techniques + - Human-like behavior simulation with random mouse movements and scrolling + - Comprehensive examples for anti-bot strategies and stealth crawling + - Full documentation guide for undetected browser usage + +- **🎨 Multi-URL Configuration System**: URL-specific crawler configurations for batch processing + - Different crawling strategies for different URL patterns in a single batch + - Support for string patterns with wildcards (`"*.pdf"`, `"*/blog/*"`) + - Lambda function matchers for complex URL logic + - Mixed matchers combining strings and functions with AND/OR logic + - Fallback configuration support when no patterns match + - First-match-wins configuration selection with optional fallback + +- **🧠 Memory Monitoring & Optimization**: Comprehensive memory usage tracking + - New `memory_utils.py` module for memory monitoring and optimization + - Real-time memory usage tracking during crawl sessions + - Memory leak detection and reporting + - Performance optimization recommendations + - Peak memory usage analysis and efficiency metrics + - Automatic cleanup suggestions for memory-intensive operations + +- **πŸ“Š Enhanced Table Extraction**: Improved table access and DataFrame conversion + - Direct `result.tables` interface replacing generic `result.media` approach + - Instant pandas DataFrame conversion with `pd.DataFrame(table['data'])` + - Enhanced table detection algorithms for better accuracy + - Table metadata including source XPath and headers + - Improved table structure preservation during extraction + +- **πŸ’° GitHub Sponsors Integration**: 4-tier sponsorship system + - Supporter ($5/month): Community support + early feature previews + - Professional ($25/month): Priority support + beta access + - Business ($100/month): Direct consultation + custom integrations + - Enterprise ($500/month): Dedicated support + feature development + - Custom arrangement options for larger organizations + +- **🐳 Docker LLM Provider Flexibility**: Environment-based LLM configuration + - `LLM_PROVIDER` environment variable support for dynamic provider switching + - `.llm.env` file support for secure configuration management + - Per-request provider override capabilities in API endpoints + - Support for OpenAI, Groq, and other providers without rebuilding images + - Enhanced Docker documentation with deployment examples + +### Fixed +- **URL Matcher Fallback**: Resolved edge cases in URL pattern matching logic +- **Memory Management**: Fixed memory leaks in long-running crawl sessions +- **Sitemap Processing**: Improved redirect handling in sitemap fetching +- **Table Extraction**: Enhanced table detection and extraction accuracy +- **Error Handling**: Better error messages and recovery from network failures + +### Changed +- **Architecture Refactoring**: Major cleanup and optimization + - Moved 2,450+ lines from main `async_crawler_strategy.py` to backup + - Cleaner separation of concerns in crawler architecture + - Better maintainability and code organization + - Preserved backward compatibility while improving performance + +### Documentation +- **Comprehensive Examples**: Added real-world URLs and practical use cases +- **API Documentation**: Complete CrawlResult field documentation with all available fields +- **Migration Guides**: Updated table extraction patterns from `result.media` to `result.tables` +- **Undetected Browser Guide**: Full documentation for stealth mode and anti-bot strategies +- **Multi-Config Examples**: Detailed examples for URL-specific configurations +- **Docker Deployment**: Enhanced Docker documentation with LLM provider configuration + ## [0.7.x] - 2025-06-29 ### Added diff --git a/README.md b/README.md index a6b279a0..c6309d0c 100644 --- a/README.md +++ b/README.md @@ -27,9 +27,9 @@ Crawl4AI turns the web into clean, LLM ready Markdown for RAG, agents, and data pipelines. Fast, controllable, battle tested by a 50k+ star community. -[✨ Check out latest update v0.7.0](#-recent-updates) +[✨ Check out latest update v0.7.3](#-recent-updates) -✨ New in v0.7.0, Adaptive Crawling, Virtual Scroll, Link Preview scoring, Async URL Seeder, big performance gains. [Release notes β†’](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.0.md) +✨ New in v0.7.3: Undetected Browser Support, Multi-URL Configurations, Memory Monitoring, Enhanced Table Extraction, GitHub Sponsors. [Release notes β†’](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.3.md)
πŸ€“ My Personal Story @@ -542,7 +542,89 @@ async def test_news_crawl(): ## ✨ Recent Updates -### Version 0.7.0 Release Highlights - The Adaptive Intelligence Update +
+Version 0.7.3 Release Highlights - The Multi-Config Intelligence Update + +- **πŸ•΅οΈ Undetected Browser Support**: Bypass sophisticated bot detection systems: + ```python + from crawl4ai import AsyncWebCrawler, BrowserConfig + + browser_config = BrowserConfig( + browser_type="undetected", # Use undetected Chrome + headless=True, # Can run headless with stealth + extra_args=[ + "--disable-blink-features=AutomationControlled", + "--disable-web-security" + ] + ) + + async with AsyncWebCrawler(config=browser_config) as crawler: + result = await crawler.arun("https://protected-site.com") + # Successfully bypass Cloudflare, Akamai, and custom bot detection + ``` + +- **🎨 Multi-URL Configuration**: Different strategies for different URL patterns in one batch: + ```python + from crawl4ai import CrawlerRunConfig, MatchMode + + configs = [ + # Documentation sites - aggressive caching + CrawlerRunConfig( + url_matcher=["*docs*", "*documentation*"], + cache_mode="write", + markdown_generator_options={"include_links": True} + ), + + # News/blog sites - fresh content + CrawlerRunConfig( + url_matcher=lambda url: 'blog' in url or 'news' in url, + cache_mode="bypass" + ), + + # Fallback for everything else + CrawlerRunConfig() + ] + + results = await crawler.arun_many(urls, config=configs) + # Each URL gets the perfect configuration automatically + ``` + +- **🧠 Memory Monitoring**: Track and optimize memory usage during crawling: + ```python + from crawl4ai.memory_utils import MemoryMonitor + + monitor = MemoryMonitor() + monitor.start_monitoring() + + results = await crawler.arun_many(large_url_list) + + report = monitor.get_report() + print(f"Peak memory: {report['peak_mb']:.1f} MB") + print(f"Efficiency: {report['efficiency']:.1f}%") + # Get optimization recommendations + ``` + +- **πŸ“Š Enhanced Table Extraction**: Direct DataFrame conversion from web tables: + ```python + result = await crawler.arun("https://site-with-tables.com") + + # New way - direct table access + if result.tables: + import pandas as pd + for table in result.tables: + df = pd.DataFrame(table['data']) + print(f"Table: {df.shape[0]} rows Γ— {df.shape[1]} columns") + ``` + +- **πŸ’° GitHub Sponsors**: 4-tier sponsorship system for project sustainability +- **🐳 Docker LLM Flexibility**: Configure providers via environment variables + +[Full v0.7.3 Release Notes β†’](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.3.md) + +
+ +
+Version 0.7.0 Release Highlights - The Adaptive Intelligence Update - **🧠 Adaptive Crawling**: Your crawler now learns and adapts to website patterns automatically: ```python @@ -607,6 +689,8 @@ async def test_news_crawl(): Read the full details in our [0.7.0 Release Notes](https://docs.crawl4ai.com/blog/release-v0.7.0) or check the [CHANGELOG](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md). +
+ ## Version Numbering in Crawl4AI Crawl4AI follows standard Python version numbering conventions (PEP 440) to help users understand the stability and features of each release. diff --git a/docs/blog/release-v0.7.3.md b/docs/blog/release-v0.7.3.md index d08d4774..59b2a9be 100644 --- a/docs/blog/release-v0.7.3.md +++ b/docs/blog/release-v0.7.3.md @@ -8,10 +8,14 @@ Today I'm releasing Crawl4AI v0.7.3β€”the Multi-Config Intelligence Update. This ## 🎯 What's New at a Glance -- **Multi-URL Configurations**: Different crawling strategies for different URL patterns in a single batch -- **Flexible Docker LLM Providers**: Configure LLM providers via environment variables -- **Bug Fixes**: Resolved several critical issues for better stability -- **Documentation Updates**: Clearer examples and improved API documentation +- **πŸ•΅οΈ Undetected Browser Support**: Stealth mode for bypassing bot detection systems +- **🎨 Multi-URL Configurations**: Different crawling strategies for different URL patterns in a single batch +- **🐳 Flexible Docker LLM Providers**: Configure LLM providers via environment variables +- **🧠 Memory Monitoring**: Enhanced memory usage tracking and optimization tools +- **πŸ“Š Enhanced Table Extraction**: Improved table access and DataFrame conversion +- **πŸ’° GitHub Sponsors**: 4-tier sponsorship system with custom arrangements +- **πŸ”§ Bug Fixes**: Resolved several critical issues for better stability +- **πŸ“š Documentation Updates**: Clearer examples and improved API documentation ## 🎨 Multi-URL Configurations: One Size Doesn't Fit All @@ -78,6 +82,182 @@ async with AsyncWebCrawler() as crawler: - **Reduced Complexity**: No more if/else forests in your extraction code - **Better Performance**: Each URL gets exactly the processing it needs +## πŸ•΅οΈ Undetected Browser Support: Stealth Mode Activated + +**The Problem:** Modern websites employ sophisticated bot detection systems. Cloudflare, Akamai, and custom solutions block automated crawlers, limiting access to valuable content. + +**My Solution:** I implemented undetected browser support with a flexible adapter pattern. Now Crawl4AI can bypass most bot detection systems using stealth techniques. + +### Technical Implementation + +```python +from crawl4ai import AsyncWebCrawler, BrowserConfig + +# Enable undetected mode for stealth crawling +browser_config = BrowserConfig( + browser_type="undetected", # Use undetected Chrome + headless=True, # Can run headless with stealth + extra_args=[ + "--disable-blink-features=AutomationControlled", + "--disable-web-security", + "--disable-features=VizDisplayCompositor" + ] +) + +async with AsyncWebCrawler(config=browser_config) as crawler: + # This will bypass most bot detection systems + result = await crawler.arun("https://protected-site.com") + + if result.success: + print("βœ… Successfully bypassed bot detection!") + print(f"Content length: {len(result.markdown)}") +``` + +**Advanced Anti-Bot Strategies:** + +```python +# Combine multiple stealth techniques +from crawl4ai import CrawlerRunConfig + +config = CrawlerRunConfig( + # Random user agents and headers + headers={ + "Accept-Language": "en-US,en;q=0.9", + "Accept-Encoding": "gzip, deflate, br", + "DNT": "1" + }, + + # Human-like behavior simulation + js_code=""" + // Random mouse movements + const simulateHuman = () => { + const event = new MouseEvent('mousemove', { + clientX: Math.random() * window.innerWidth, + clientY: Math.random() * window.innerHeight + }); + document.dispatchEvent(event); + }; + setInterval(simulateHuman, 100 + Math.random() * 200); + + // Random scrolling + const randomScroll = () => { + const scrollY = Math.random() * (document.body.scrollHeight - window.innerHeight); + window.scrollTo(0, scrollY); + }; + setTimeout(randomScroll, 500 + Math.random() * 1000); + """, + + # Delay to appear more human + delay_before_return_html=2.0 +) + +result = await crawler.arun("https://bot-protected-site.com", config=config) +``` + +**Expected Real-World Impact:** +- **Enterprise Scraping**: Access previously blocked corporate sites and databases +- **Market Research**: Gather data from competitor sites with protection +- **Price Monitoring**: Track e-commerce sites that block automated access +- **Content Aggregation**: Collect news and social media despite anti-bot measures +- **Compliance Testing**: Verify your own site's bot protection effectiveness + +## 🧠 Memory Monitoring & Optimization + +**The Problem:** Long-running crawl sessions consuming excessive memory, especially when processing large batches or heavy JavaScript sites. + +**My Solution:** Built comprehensive memory monitoring and optimization utilities that track usage patterns and provide actionable insights. + +### Memory Tracking Implementation + +```python +from crawl4ai.memory_utils import MemoryMonitor, get_memory_info + +# Monitor memory during crawling +monitor = MemoryMonitor() + +async with AsyncWebCrawler() as crawler: + # Start monitoring + monitor.start_monitoring() + + # Perform memory-intensive operations + results = await crawler.arun_many([ + "https://heavy-js-site.com", + "https://large-images-site.com", + "https://dynamic-content-site.com" + ]) + + # Get detailed memory report + memory_report = monitor.get_report() + print(f"Peak memory usage: {memory_report['peak_mb']:.1f} MB") + print(f"Memory efficiency: {memory_report['efficiency']:.1f}%") + + # Automatic cleanup suggestions + if memory_report['peak_mb'] > 1000: # > 1GB + print("πŸ’‘ Consider batch size optimization") + print("πŸ’‘ Enable aggressive garbage collection") +``` + +**Expected Real-World Impact:** +- **Production Stability**: Prevent memory-related crashes in long-running services +- **Cost Optimization**: Right-size server resources based on actual usage +- **Performance Tuning**: Identify memory bottlenecks and optimization opportunities +- **Scalability Planning**: Understand memory patterns for horizontal scaling + +## πŸ“Š Enhanced Table Extraction + +**The Problem:** Table data was accessed through the generic `result.media` interface, making DataFrame conversion cumbersome and unclear. + +**My Solution:** Dedicated `result.tables` interface with direct DataFrame conversion and improved detection algorithms. + +### New Table Access Pattern + +```python +# Old way (deprecated) +# tables_data = result.media.get('tables', []) + +# New way (v0.7.3+) +result = await crawler.arun("https://site-with-tables.com") + +# Direct table access +if result.tables: + print(f"Found {len(result.tables)} tables") + + # Convert to pandas DataFrame instantly + import pandas as pd + + for i, table in enumerate(result.tables): + df = pd.DataFrame(table['data']) + print(f"Table {i}: {df.shape[0]} rows Γ— {df.shape[1]} columns") + print(df.head()) + + # Table metadata + print(f"Source: {table.get('source_xpath', 'Unknown')}") + print(f"Headers: {table.get('headers', [])}") +``` + +**Expected Real-World Impact:** +- **Data Analysis**: Faster transition from web data to analysis-ready DataFrames +- **ETL Pipelines**: Cleaner integration with data processing workflows +- **Reporting**: Simplified table extraction for automated reporting systems + +## πŸ’° Community Support: GitHub Sponsors + +I've launched GitHub Sponsors to ensure Crawl4AI's continued development and support our growing community. + +**Sponsorship Tiers:** +- **🌱 Supporter ($5/month)**: Community support + early feature previews +- **πŸš€ Professional ($25/month)**: Priority support + beta access +- **🏒 Business ($100/month)**: Direct consultation + custom integrations +- **πŸ›οΈ Enterprise ($500/month)**: Dedicated support + feature development + +**Why Sponsor?** +- Ensure continuous development and maintenance +- Get priority support and feature requests +- Access to premium documentation and examples +- Direct line to the development team + +[**Become a Sponsor β†’**](https://github.com/sponsors/unclecode) + ## 🐳 Docker: Flexible LLM Provider Configuration **The Problem:** Hardcoded LLM providers in Docker deployments. Want to switch from OpenAI to Groq? Rebuild and redeploy. Testing different models? Multiple Docker images.