# πŸš€ Crawl4AI v0.7.3: The Multi-Config Intelligence Update *August 6, 2025 β€’ 5 min read* --- Today I'm releasing Crawl4AI v0.7.3β€”the Multi-Config Intelligence Update. This release brings smarter URL-specific configurations, flexible Docker deployments, important bug fixes, and documentation improvements that make Crawl4AI more robust and production-ready. ## 🎯 What's New at a Glance - **πŸ•΅οΈ Undetected Browser Support**: Stealth mode for bypassing bot detection systems - **🎨 Multi-URL Configurations**: Different crawling strategies for different URL patterns in a single batch - **🐳 Flexible Docker LLM Providers**: Configure LLM providers via environment variables - **🧠 Memory Monitoring**: Enhanced memory usage tracking and optimization tools - **πŸ“Š Enhanced Table Extraction**: Improved table access and DataFrame conversion - **πŸ’° GitHub Sponsors**: 4-tier sponsorship system with custom arrangements - **πŸ”§ Bug Fixes**: Resolved several critical issues for better stability - **πŸ“š Documentation Updates**: Clearer examples and improved API documentation ## 🎨 Multi-URL Configurations: One Size Doesn't Fit All **The Problem:** You're crawling a mix of documentation sites, blogs, and API endpoints. Each needs different handlingβ€”caching for docs, fresh content for news, structured extraction for APIs. Previously, you'd run separate crawls or write complex conditional logic. **My Solution:** I implemented URL-specific configurations that let you define different strategies for different URL patterns in a single crawl batch. First match wins, with optional fallback support. ### Technical Implementation ```python from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, MatchMode # Define specialized configs for different content types configs = [ # Documentation sites - aggressive caching, include links CrawlerRunConfig( url_matcher=["*docs*", "*documentation*"], cache_mode="write", markdown_generator_options={"include_links": True} ), # News/blog sites - fresh content, scroll for lazy loading CrawlerRunConfig( url_matcher=lambda url: 'blog' in url or 'news' in url, cache_mode="bypass", js_code="window.scrollTo(0, document.body.scrollHeight/2);" ), # API endpoints - structured extraction CrawlerRunConfig( url_matcher=["*.json", "*api*"], extraction_strategy=LLMExtractionStrategy( provider="openai/gpt-4o-mini", extraction_type="structured" ) ), # Default fallback for everything else CrawlerRunConfig() # No url_matcher = matches everything ] # Crawl multiple URLs with appropriate configs async with AsyncWebCrawler() as crawler: results = await crawler.arun_many( urls=[ "https://docs.python.org/3/", # β†’ Uses documentation config "https://blog.python.org/", # β†’ Uses blog config "https://api.github.com/users", # β†’ Uses API config "https://example.com/" # β†’ Uses default config ], config=configs ) ``` **Matching Capabilities:** - **String Patterns**: Wildcards like `"*.pdf"`, `"*/blog/*"` - **Function Matchers**: Lambda functions for complex logic - **Mixed Matchers**: Combine strings and functions with AND/OR logic - **Fallback Support**: Default config when nothing matches **Expected Real-World Impact:** - **Mixed Content Sites**: Handle blogs, docs, and downloads in one crawl - **Multi-Domain Crawling**: Different strategies per domain without separate runs - **Reduced Complexity**: No more if/else forests in your extraction code - **Better Performance**: Each URL gets exactly the processing it needs ## πŸ•΅οΈ Undetected Browser Support: Stealth Mode Activated **The Problem:** Modern websites employ sophisticated bot detection systems. Cloudflare, Akamai, and custom solutions block automated crawlers, limiting access to valuable content. **My Solution:** I implemented undetected browser support with a flexible adapter pattern. Now Crawl4AI can bypass most bot detection systems using stealth techniques. ### Technical Implementation ```python from crawl4ai import AsyncWebCrawler, BrowserConfig # Enable undetected mode for stealth crawling browser_config = BrowserConfig( browser_type="undetected", # Use undetected Chrome headless=True, # Can run headless with stealth extra_args=[ "--disable-blink-features=AutomationControlled", "--disable-web-security", "--disable-features=VizDisplayCompositor" ] ) async with AsyncWebCrawler(config=browser_config) as crawler: # This will bypass most bot detection systems result = await crawler.arun("https://protected-site.com") if result.success: print("βœ… Successfully bypassed bot detection!") print(f"Content length: {len(result.markdown)}") ``` **Advanced Anti-Bot Strategies:** ```python # Combine multiple stealth techniques from crawl4ai import CrawlerRunConfig config = CrawlerRunConfig( # Random user agents and headers headers={ "Accept-Language": "en-US,en;q=0.9", "Accept-Encoding": "gzip, deflate, br", "DNT": "1" }, # Human-like behavior simulation js_code=""" // Random mouse movements const simulateHuman = () => { const event = new MouseEvent('mousemove', { clientX: Math.random() * window.innerWidth, clientY: Math.random() * window.innerHeight }); document.dispatchEvent(event); }; setInterval(simulateHuman, 100 + Math.random() * 200); // Random scrolling const randomScroll = () => { const scrollY = Math.random() * (document.body.scrollHeight - window.innerHeight); window.scrollTo(0, scrollY); }; setTimeout(randomScroll, 500 + Math.random() * 1000); """, # Delay to appear more human delay_before_return_html=2.0 ) result = await crawler.arun("https://bot-protected-site.com", config=config) ``` **Expected Real-World Impact:** - **Enterprise Scraping**: Access previously blocked corporate sites and databases - **Market Research**: Gather data from competitor sites with protection - **Price Monitoring**: Track e-commerce sites that block automated access - **Content Aggregation**: Collect news and social media despite anti-bot measures - **Compliance Testing**: Verify your own site's bot protection effectiveness ## 🧠 Memory Monitoring & Optimization **The Problem:** Long-running crawl sessions consuming excessive memory, especially when processing large batches or heavy JavaScript sites. **My Solution:** Built comprehensive memory monitoring and optimization utilities that track usage patterns and provide actionable insights. ### Memory Tracking Implementation ```python from crawl4ai.memory_utils import MemoryMonitor, get_memory_info # Monitor memory during crawling monitor = MemoryMonitor() async with AsyncWebCrawler() as crawler: # Start monitoring monitor.start_monitoring() # Perform memory-intensive operations results = await crawler.arun_many([ "https://heavy-js-site.com", "https://large-images-site.com", "https://dynamic-content-site.com" ]) # Get detailed memory report memory_report = monitor.get_report() print(f"Peak memory usage: {memory_report['peak_mb']:.1f} MB") print(f"Memory efficiency: {memory_report['efficiency']:.1f}%") # Automatic cleanup suggestions if memory_report['peak_mb'] > 1000: # > 1GB print("πŸ’‘ Consider batch size optimization") print("πŸ’‘ Enable aggressive garbage collection") ``` **Expected Real-World Impact:** - **Production Stability**: Prevent memory-related crashes in long-running services - **Cost Optimization**: Right-size server resources based on actual usage - **Performance Tuning**: Identify memory bottlenecks and optimization opportunities - **Scalability Planning**: Understand memory patterns for horizontal scaling ## πŸ“Š Enhanced Table Extraction **The Problem:** Table data was accessed through the generic `result.media` interface, making DataFrame conversion cumbersome and unclear. **My Solution:** Dedicated `result.tables` interface with direct DataFrame conversion and improved detection algorithms. ### New Table Access Pattern ```python # Old way (deprecated) # tables_data = result.media.get('tables', []) # New way (v0.7.3+) result = await crawler.arun("https://site-with-tables.com") # Direct table access if result.tables: print(f"Found {len(result.tables)} tables") # Convert to pandas DataFrame instantly import pandas as pd for i, table in enumerate(result.tables): df = pd.DataFrame(table['data']) print(f"Table {i}: {df.shape[0]} rows Γ— {df.shape[1]} columns") print(df.head()) # Table metadata print(f"Source: {table.get('source_xpath', 'Unknown')}") print(f"Headers: {table.get('headers', [])}") ``` **Expected Real-World Impact:** - **Data Analysis**: Faster transition from web data to analysis-ready DataFrames - **ETL Pipelines**: Cleaner integration with data processing workflows - **Reporting**: Simplified table extraction for automated reporting systems ## πŸ’° Community Support: GitHub Sponsors I've launched GitHub Sponsors to ensure Crawl4AI's continued development and support our growing community. **Sponsorship Tiers:** - **🌱 Supporter ($5/month)**: Community support + early feature previews - **πŸš€ Professional ($25/month)**: Priority support + beta access - **🏒 Business ($100/month)**: Direct consultation + custom integrations - **πŸ›οΈ Enterprise ($500/month)**: Dedicated support + feature development **Why Sponsor?** - Ensure continuous development and maintenance - Get priority support and feature requests - Access to premium documentation and examples - Direct line to the development team [**Become a Sponsor β†’**](https://github.com/sponsors/unclecode) ## 🐳 Docker: Flexible LLM Provider Configuration **The Problem:** Hardcoded LLM providers in Docker deployments. Want to switch from OpenAI to Groq? Rebuild and redeploy. Testing different models? Multiple Docker images. **My Solution:** Configure LLM providers via environment variables. Switch providers without touching code or rebuilding images. ### Deployment Flexibility ```bash # Option 1: Direct environment variables docker run -d \ -e LLM_PROVIDER="groq/llama-3.2-3b-preview" \ -e GROQ_API_KEY="your-key" \ -p 11235:11235 \ unclecode/crawl4ai:latest # Option 2: Using .llm.env file (recommended for production) # Create .llm.env file: # LLM_PROVIDER=openai/gpt-4o-mini # OPENAI_API_KEY=your-openai-key # GROQ_API_KEY=your-groq-key docker run -d \ --env-file .llm.env \ -p 11235:11235 \ unclecode/crawl4ai:latest ``` Override per request when needed: ```python # Use default provider from .llm.env response = requests.post("http://localhost:11235/crawl", json={ "url": "https://example.com", "extraction_strategy": {"type": "llm"} }) # Override to use different provider for this specific request response = requests.post("http://localhost:11235/crawl", json={ "url": "https://complex-page.com", "extraction_strategy": { "type": "llm", "provider": "openai/gpt-4" # Override default } }) ``` **Expected Real-World Impact:** - **Cost Optimization**: Use cheaper models for simple tasks, premium for complex - **A/B Testing**: Compare provider performance without deployment changes - **Fallback Strategies**: Switch providers on-the-fly during outages - **Development Flexibility**: Test locally with one provider, deploy with another - **Secure Configuration**: Keep API keys in `.llm.env` file, not in commands ## πŸ”§ Bug Fixes & Improvements This release includes several important bug fixes that improve stability and reliability: - **URL Matcher Fallback**: Fixed edge cases in URL pattern matching logic - **Memory Management**: Resolved memory leaks in long-running crawl sessions - **Sitemap Processing**: Fixed redirect handling in sitemap fetching - **Table Extraction**: Improved table detection and extraction accuracy - **Error Handling**: Better error messages and recovery from network failures ## πŸ“š Documentation Enhancements Based on community feedback, we've updated: - Clearer examples for multi-URL configuration - Improved CrawlResult documentation with all available fields - Fixed typos and inconsistencies across documentation - Added real-world URLs in examples for better understanding - New comprehensive demo showcasing all v0.7.3 features ## πŸ™ Acknowledgments Thanks to our contributors and the entire community for feedback and bug reports. ## πŸ“š Resources - [Full Documentation](https://docs.crawl4ai.com) - [GitHub Repository](https://github.com/unclecode/crawl4ai) - [Discord Community](https://discord.gg/crawl4ai) - [Feature Demo](https://github.com/unclecode/crawl4ai/blob/main/docs/releases_review/demo_v0.7.3.py) --- *Crawl4AI continues to evolve with your needs. This release makes it smarter, more flexible, and more stable. Try the new multi-config feature and flexible Docker deploymentβ€”they're game changers!* **Happy Crawling! πŸ•·οΈ** *- The Crawl4AI Team*