# 🚀 Crawl4AI v0.7.4: The Intelligent Table Extraction & Performance Update *August 17, 2025 • 6 min read* --- Today I'm releasing Crawl4AI v0.7.4—the Intelligent Table Extraction & Performance Update. This release introduces revolutionary LLM-powered table extraction with intelligent chunking, significant performance improvements for concurrent crawling, enhanced browser management, and critical stability fixes that make Crawl4AI more robust for production workloads. ## 🎯 What's New at a Glance - **🚀 LLMTableExtraction**: Revolutionary table extraction with intelligent chunking for massive tables - **⚡ Enhanced Concurrency**: True concurrency improvements for fast-completing tasks in batch operations - **🔧 Browser Manager Fixes**: Resolved race conditions in concurrent page creation - **⌨️ Cross-Platform Browser Profiler**: Improved keyboard handling and quit mechanisms - **🔗 Advanced URL Processing**: Better handling of raw URLs and base tag link resolution - **🛡️ Enhanced Proxy Support**: Flexible proxy configuration with dict and string formats - **🐳 Docker Improvements**: Better API handling and raw HTML support ## 🚀 LLMTableExtraction: Revolutionary Table Processing **The Problem:** Complex tables with rowspan, colspan, nested structures, or massive datasets that traditional HTML parsing can't handle effectively. Large tables that exceed token limits crash extraction processes. **My Solution:** I developed LLMTableExtraction—an intelligent table extraction strategy that uses Large Language Models with automatic chunking to handle tables of any size and complexity. ### Technical Implementation ```python from crawl4ai import ( AsyncWebCrawler, CrawlerRunConfig, LLMConfig, LLMTableExtraction, CacheMode ) # Configure LLM for table extraction llm_config = LLMConfig( provider="openai/gpt-4.1-mini", api_token="env:OPENAI_API_KEY", temperature=0.1, # Low temperature for consistency max_tokens=32000 ) # Create intelligent table extraction strategy table_strategy = LLMTableExtraction( llm_config=llm_config, verbose=True, max_tries=2, enable_chunking=True, # Handle massive tables chunk_token_threshold=5000, # Smart chunking threshold overlap_threshold=100, # Maintain context between chunks extraction_type="structured" # Get structured data output ) # Apply to crawler configuration config = CrawlerRunConfig( table_extraction_strategy=table_strategy, cache_mode=CacheMode.BYPASS ) async with AsyncWebCrawler() as crawler: # Extract complex tables with intelligence result = await crawler.arun( "https://en.wikipedia.org/wiki/List_of_countries_by_GDP", config=config ) # Access extracted tables directly for i, table in enumerate(result.tables): print(f"Table {i}: {len(table['data'])} rows × {len(table['headers'])} columns") # Convert to pandas DataFrame instantly import pandas as pd df = pd.DataFrame(table['data'], columns=table['headers']) print(df.head()) ``` **Intelligent Chunking for Massive Tables:** ```python # Handle tables that exceed token limits large_table_strategy = LLMTableExtraction( llm_config=llm_config, enable_chunking=True, chunk_token_threshold=3000, # Conservative threshold overlap_threshold=150, # Preserve context max_concurrent_chunks=3, # Parallel processing merge_strategy="intelligent" # Smart chunk merging ) # Process Wikipedia comparison tables, financial reports, etc. config = CrawlerRunConfig( table_extraction_strategy=large_table_strategy, # Target specific table containers css_selector="div.wikitable, table.sortable", delay_before_return_html=2.0 ) result = await crawler.arun( "https://en.wikipedia.org/wiki/Comparison_of_operating_systems", config=config ) # Tables are automatically chunked, processed, and merged print(f"Extracted {len(result.tables)} complex tables") for table in result.tables: print(f"Merged table: {len(table['data'])} total rows") ``` **Advanced Features:** - **Intelligent Chunking**: Automatically splits massive tables while preserving structure - **Context Preservation**: Overlapping chunks maintain column relationships - **Parallel Processing**: Concurrent chunk processing for speed - **Smart Merging**: Reconstructs complete tables from processed chunks - **Complex Structure Support**: Handles rowspan, colspan, nested tables - **Metadata Extraction**: Captures table context, captions, and relationships **Expected Real-World Impact:** - **Financial Analysis**: Extract complex earnings tables and financial statements - **Research & Academia**: Process large datasets from Wikipedia, research papers - **E-commerce**: Handle product comparison tables with complex layouts - **Government Data**: Extract census data, statistical tables from official sources - **Competitive Intelligence**: Process competitor pricing and feature tables ## ⚡ Enhanced Concurrency: True Performance Gains **The Problem:** The `arun_many()` method wasn't achieving true concurrency for fast-completing tasks, leading to sequential processing bottlenecks in batch operations. **My Solution:** I implemented true concurrency improvements in the dispatcher that enable genuine parallel processing for fast-completing tasks. ### Performance Optimization ```python # Before v0.7.4: Sequential-like behavior for fast tasks # After v0.7.4: True concurrency async with AsyncWebCrawler() as crawler: # These will now run with true concurrency urls = [ "https://httpbin.org/delay/1", "https://httpbin.org/delay/1", "https://httpbin.org/delay/1", "https://httpbin.org/delay/1" ] # Processes in truly parallel fashion results = await crawler.arun_many(urls) # Performance improvement: ~4x faster for fast-completing tasks print(f"Processed {len(results)} URLs with true concurrency") ``` **Expected Real-World Impact:** - **API Crawling**: 3-4x faster processing of REST endpoints and API documentation - **Batch URL Processing**: Significant speedup for large URL lists - **Monitoring Systems**: Faster health checks and status page monitoring - **Data Aggregation**: Improved performance for real-time data collection ## 🔧 Critical Stability Fixes ### Browser Manager Race Condition Resolution **The Problem:** Concurrent page creation in persistent browser contexts caused "Target page/context closed" errors during high-concurrency operations. **My Solution:** Implemented thread-safe page creation with proper locking mechanisms. ```python # Fixed: Safe concurrent page creation browser_config = BrowserConfig( browser_type="chromium", use_persistent_context=True, # Now thread-safe max_concurrent_sessions=10 # Safely handle concurrent requests ) async with AsyncWebCrawler(config=browser_config) as crawler: # These concurrent operations are now stable tasks = [crawler.arun(url) for url in url_list] results = await asyncio.gather(*tasks) # No more race conditions ``` ### Enhanced Browser Profiler **The Problem:** Inconsistent keyboard handling across platforms and unreliable quit mechanisms. **My Solution:** Cross-platform keyboard listeners with improved quit handling. ### Advanced URL Processing **The Problem:** Raw URL formats (`raw://` and `raw:`) weren't properly handled, and base tag link resolution was incomplete. **My Solution:** Enhanced URL preprocessing and base tag support. ```python # Now properly handles all URL formats urls = [ "https://example.com", "raw://static-html-content", "raw:file://local-file.html" ] # Base tag links are now correctly resolved config = CrawlerRunConfig( include_links=True, # Links properly resolved with base tags resolve_absolute_urls=True ) ``` ## 🛡️ Enhanced Proxy Configuration **The Problem:** Proxy configuration only accepted specific formats, limiting flexibility. **My Solution:** Enhanced ProxyConfig to support both dictionary and string formats. ```python # Multiple proxy configuration formats now supported from crawl4ai import BrowserConfig, ProxyConfig # String format proxy_config = ProxyConfig("http://proxy.example.com:8080") # Dictionary format proxy_config = ProxyConfig({ "server": "http://proxy.example.com:8080", "username": "user", "password": "pass" }) # Use with crawler browser_config = BrowserConfig(proxy_config=proxy_config) async with AsyncWebCrawler(config=browser_config) as crawler: result = await crawler.arun("https://httpbin.org/ip") ``` ## 🐳 Docker & Infrastructure Improvements This release includes several Docker and infrastructure improvements: - **Better API Token Handling**: Improved Docker example scripts with correct endpoints - **Raw HTML Support**: Enhanced Docker API to handle raw HTML content properly - **Documentation Updates**: Comprehensive Docker deployment examples - **Test Coverage**: Expanded test suite with better coverage ## 📚 Documentation & Examples Enhanced documentation includes: - **LLM Table Extraction Guide**: Comprehensive examples and best practices - **Migration Documentation**: Updated patterns for new table extraction methods - **Docker Deployment**: Complete deployment guide with examples - **Performance Optimization**: Guidelines for concurrent crawling ## 🙏 Acknowledgments Thanks to our contributors and community for feedback, bug reports, and feature requests that made this release possible. ## 📚 Resources - [Full Documentation](https://docs.crawl4ai.com) - [GitHub Repository](https://github.com/unclecode/crawl4ai) - [Discord Community](https://discord.gg/crawl4ai) - [LLM Table Extraction Examples](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/llm_table_extraction_example.py) --- *Crawl4AI v0.7.4 delivers intelligent table extraction and significant performance improvements. The new LLMTableExtraction strategy handles complex tables that were previously impossible to process, while concurrency improvements make batch operations 3-4x faster. Try the intelligent table extraction—it's a game changer for data extraction workflows!* **Happy Crawling! 🕷️** *- The Crawl4AI Team*