BREAKING CHANGE: Table extraction now uses Strategy Design Pattern This epic commit introduces a game-changing approach to table extraction in Crawl4AI: ✨ NEW FEATURES: - LLMTableExtraction: AI-powered extraction for complex HTML tables with rowspan/colspan - Smart Chunking: Automatically splits massive tables into optimal chunks at row boundaries - Parallel Processing: Processes multiple chunks simultaneously for blazing-fast extraction - Intelligent Merging: Seamlessly combines chunk results into complete tables - Header Preservation: Each chunk maintains context with original headers - Auto-retry Logic: Built-in resilience with configurable retry attempts 🏗️ ARCHITECTURE: - Strategy Design Pattern for pluggable table extraction strategies - ThreadPoolExecutor for concurrent chunk processing - Token-based chunking with configurable thresholds - Handles tables without headers gracefully ⚡ PERFORMANCE: - Process 1000+ row tables without timeout - Parallel processing with up to 5 concurrent chunks - Smart token estimation prevents LLM context overflow - Optimized for providers like Groq for massive tables 🔧 CONFIGURATION: - enable_chunking: Auto-handle large tables (default: True) - chunk_token_threshold: When to split (default: 3000 tokens) - min_rows_per_chunk: Meaningful chunk sizes (default: 10) - max_parallel_chunks: Concurrent processing (default: 5) 📚 BACKWARD COMPATIBILITY: - Existing code continues to work unchanged - DefaultTableExtraction remains the default strategy - Progressive enhancement approach This is the future of web table extraction - handling everything from simple tables to massive, complex data grids with merged cells and nested structures. The chunking is completely transparent to users while providing unprecedented scalability.
9.2 KiB
Migration Guide: Table Extraction v0.7.3
Overview
Version 0.7.3 introduces the Table Extraction Strategy Pattern, providing a more flexible and extensible approach to table extraction while maintaining full backward compatibility.
What's New
Strategy Pattern Implementation
Table extraction now follows the same strategy pattern used throughout Crawl4AI:
- Consistent Architecture: Aligns with extraction, chunking, and markdown strategies
- Extensibility: Easy to create custom table extraction strategies
- Better Separation: Table logic moved from content scraping to dedicated module
- Full Control: Fine-grained control over table detection and extraction
New Classes
from crawl4ai import (
TableExtractionStrategy, # Abstract base class
DefaultTableExtraction, # Current implementation (default)
NoTableExtraction # Explicitly disable extraction
)
Backward Compatibility
✅ All existing code continues to work without changes.
No Changes Required
If your code looks like this, it will continue to work:
# This still works exactly the same
config = CrawlerRunConfig(
table_score_threshold=7
)
result = await crawler.arun(url, config)
tables = result.tables # Same structure, same data
What Happens Behind the Scenes
When you don't specify a table_extraction strategy:
CrawlerRunConfigautomatically createsDefaultTableExtraction- It uses your
table_score_thresholdparameter - Tables are extracted exactly as before
- Results appear in
result.tableswith the same structure
New Capabilities
1. Explicit Strategy Configuration
You can now explicitly configure table extraction:
# New: Explicit control
strategy = DefaultTableExtraction(
table_score_threshold=7,
min_rows=2, # New: minimum row filter
min_cols=2, # New: minimum column filter
verbose=True # New: detailed logging
)
config = CrawlerRunConfig(
table_extraction=strategy
)
2. Disable Table Extraction
Improve performance when tables aren't needed:
# New: Skip table extraction entirely
config = CrawlerRunConfig(
table_extraction=NoTableExtraction()
)
# No CPU cycles spent on table detection/extraction
3. Custom Extraction Strategies
Create specialized extractors:
class MyTableExtractor(TableExtractionStrategy):
def extract_tables(self, element, **kwargs):
# Custom extraction logic
return custom_tables
config = CrawlerRunConfig(
table_extraction=MyTableExtractor()
)
Migration Scenarios
Scenario 1: Basic Usage (No Changes Needed)
Before (v0.7.2):
config = CrawlerRunConfig()
result = await crawler.arun(url, config)
for table in result.tables:
print(table['headers'])
After (v0.7.3):
# Exactly the same - no changes required
config = CrawlerRunConfig()
result = await crawler.arun(url, config)
for table in result.tables:
print(table['headers'])
Scenario 2: Custom Threshold (No Changes Needed)
Before (v0.7.2):
config = CrawlerRunConfig(
table_score_threshold=5
)
After (v0.7.3):
# Still works the same
config = CrawlerRunConfig(
table_score_threshold=5
)
# Or use new explicit approach for more control
strategy = DefaultTableExtraction(
table_score_threshold=5,
min_rows=2 # Additional filtering
)
config = CrawlerRunConfig(
table_extraction=strategy
)
Scenario 3: Advanced Filtering (New Feature)
Before (v0.7.2):
# Had to filter after extraction
config = CrawlerRunConfig(
table_score_threshold=5
)
result = await crawler.arun(url, config)
# Manual filtering
large_tables = [
t for t in result.tables
if len(t['rows']) >= 5 and len(t['headers']) >= 3
]
After (v0.7.3):
# Filter during extraction (more efficient)
strategy = DefaultTableExtraction(
table_score_threshold=5,
min_rows=5,
min_cols=3
)
config = CrawlerRunConfig(
table_extraction=strategy
)
result = await crawler.arun(url, config)
# result.tables already filtered
Code Organization Changes
Module Structure
Before (v0.7.2):
crawl4ai/
content_scraping_strategy.py
- LXMLWebScrapingStrategy
- is_data_table() # Table detection
- extract_table_data() # Table extraction
After (v0.7.3):
crawl4ai/
content_scraping_strategy.py
- LXMLWebScrapingStrategy
# Table methods removed, uses strategy
table_extraction.py (NEW)
- TableExtractionStrategy # Base class
- DefaultTableExtraction # Moved logic here
- NoTableExtraction # New option
Import Changes
New imports available (optional):
# These are now available but not required for existing code
from crawl4ai import (
TableExtractionStrategy,
DefaultTableExtraction,
NoTableExtraction
)
Performance Implications
No Performance Impact
For existing code, performance remains identical:
- Same extraction logic
- Same scoring algorithm
- Same processing time
Performance Improvements Available
New options for better performance:
# Skip tables entirely (faster)
config = CrawlerRunConfig(
table_extraction=NoTableExtraction()
)
# Process only specific areas (faster)
config = CrawlerRunConfig(
css_selector="main.content",
table_extraction=DefaultTableExtraction(
min_rows=5, # Skip small tables
min_cols=3
)
)
Testing Your Migration
Verification Script
Run this to verify your extraction still works:
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
async def verify_extraction():
url = "your_url_here"
async with AsyncWebCrawler() as crawler:
# Test 1: Old approach
config_old = CrawlerRunConfig(
table_score_threshold=7
)
result_old = await crawler.arun(url, config_old)
# Test 2: New explicit approach
from crawl4ai import DefaultTableExtraction
config_new = CrawlerRunConfig(
table_extraction=DefaultTableExtraction(
table_score_threshold=7
)
)
result_new = await crawler.arun(url, config_new)
# Compare results
assert len(result_old.tables) == len(result_new.tables)
print(f"✓ Both approaches extracted {len(result_old.tables)} tables")
# Verify structure
for old, new in zip(result_old.tables, result_new.tables):
assert old['headers'] == new['headers']
assert old['rows'] == new['rows']
print("✓ Table content identical")
asyncio.run(verify_extraction())
Deprecation Notes
No Deprecations
- All existing parameters continue to work
table_score_thresholdinCrawlerRunConfigis still supported- No breaking changes
Internal Changes (Transparent to Users)
LXMLWebScrapingStrategy.is_data_table()- Moved toDefaultTableExtractionLXMLWebScrapingStrategy.extract_table_data()- Moved toDefaultTableExtraction
These methods were internal and not part of the public API.
Benefits of Upgrading
While not required, using the new pattern provides:
- Better Control: Filter tables during extraction, not after
- Performance Options: Skip extraction when not needed
- Extensibility: Create custom extractors for specific needs
- Consistency: Same pattern as other Crawl4AI strategies
- Future-Proof: Ready for upcoming advanced strategies
Troubleshooting
Issue: Different Number of Tables
Cause: Threshold or filtering differences
Solution:
# Ensure same threshold
strategy = DefaultTableExtraction(
table_score_threshold=7, # Match your old setting
min_rows=0, # No filtering (default)
min_cols=0 # No filtering (default)
)
Issue: Import Errors
Cause: Using new classes without importing
Solution:
# Add imports if using new features
from crawl4ai import (
DefaultTableExtraction,
NoTableExtraction,
TableExtractionStrategy
)
Issue: Custom Strategy Not Working
Cause: Incorrect method signature
Solution:
class CustomExtractor(TableExtractionStrategy):
def extract_tables(self, element, **kwargs): # Correct signature
# Not: extract_tables(self, html)
# Not: extract(self, element)
return tables_list
Getting Help
If you encounter issues:
- Check your
table_score_thresholdmatches previous settings - Verify imports if using new classes
- Enable verbose logging:
DefaultTableExtraction(verbose=True) - Review the Table Extraction Documentation
- Check examples
Summary
- ✅ Full backward compatibility - No code changes required
- ✅ Same results - Identical extraction behavior by default
- ✅ New options - Additional control when needed
- ✅ Better architecture - Consistent with Crawl4AI patterns
- ✅ Ready for future - Foundation for advanced strategies
The migration to v0.7.3 is seamless with no required changes while providing new capabilities for those who need them.