Files
crawl4ai/docs/md_v2/migration/table_extraction_v073.md
ntohidi a51545c883 feat: 🚀 Introduce revolutionary LLMTableExtraction with intelligent chunking for massive tables
BREAKING CHANGE: Table extraction now uses Strategy Design Pattern

This epic commit introduces a game-changing approach to table extraction in Crawl4AI:

 NEW FEATURES:
- LLMTableExtraction: AI-powered extraction for complex HTML tables with rowspan/colspan
- Smart Chunking: Automatically splits massive tables into optimal chunks at row boundaries
- Parallel Processing: Processes multiple chunks simultaneously for blazing-fast extraction
- Intelligent Merging: Seamlessly combines chunk results into complete tables
- Header Preservation: Each chunk maintains context with original headers
- Auto-retry Logic: Built-in resilience with configurable retry attempts

🏗️ ARCHITECTURE:
- Strategy Design Pattern for pluggable table extraction strategies
- ThreadPoolExecutor for concurrent chunk processing
- Token-based chunking with configurable thresholds
- Handles tables without headers gracefully

 PERFORMANCE:
- Process 1000+ row tables without timeout
- Parallel processing with up to 5 concurrent chunks
- Smart token estimation prevents LLM context overflow
- Optimized for providers like Groq for massive tables

🔧 CONFIGURATION:
- enable_chunking: Auto-handle large tables (default: True)
- chunk_token_threshold: When to split (default: 3000 tokens)
- min_rows_per_chunk: Meaningful chunk sizes (default: 10)
- max_parallel_chunks: Concurrent processing (default: 5)

📚 BACKWARD COMPATIBILITY:
- Existing code continues to work unchanged
- DefaultTableExtraction remains the default strategy
- Progressive enhancement approach

This is the future of web table extraction - handling everything from simple tables to massive, complex data grids with merged cells and nested structures. The chunking is completely transparent to users while providing unprecedented scalability.
2025-08-14 18:21:24 +08:00

376 lines
9.2 KiB
Markdown

# Migration Guide: Table Extraction v0.7.3
## Overview
Version 0.7.3 introduces the **Table Extraction Strategy Pattern**, providing a more flexible and extensible approach to table extraction while maintaining full backward compatibility.
## What's New
### Strategy Pattern Implementation
Table extraction now follows the same strategy pattern used throughout Crawl4AI:
- **Consistent Architecture**: Aligns with extraction, chunking, and markdown strategies
- **Extensibility**: Easy to create custom table extraction strategies
- **Better Separation**: Table logic moved from content scraping to dedicated module
- **Full Control**: Fine-grained control over table detection and extraction
### New Classes
```python
from crawl4ai import (
TableExtractionStrategy, # Abstract base class
DefaultTableExtraction, # Current implementation (default)
NoTableExtraction # Explicitly disable extraction
)
```
## Backward Compatibility
**✅ All existing code continues to work without changes.**
### No Changes Required
If your code looks like this, it will continue to work:
```python
# This still works exactly the same
config = CrawlerRunConfig(
table_score_threshold=7
)
result = await crawler.arun(url, config)
tables = result.tables # Same structure, same data
```
### What Happens Behind the Scenes
When you don't specify a `table_extraction` strategy:
1. `CrawlerRunConfig` automatically creates `DefaultTableExtraction`
2. It uses your `table_score_threshold` parameter
3. Tables are extracted exactly as before
4. Results appear in `result.tables` with the same structure
## New Capabilities
### 1. Explicit Strategy Configuration
You can now explicitly configure table extraction:
```python
# New: Explicit control
strategy = DefaultTableExtraction(
table_score_threshold=7,
min_rows=2, # New: minimum row filter
min_cols=2, # New: minimum column filter
verbose=True # New: detailed logging
)
config = CrawlerRunConfig(
table_extraction=strategy
)
```
### 2. Disable Table Extraction
Improve performance when tables aren't needed:
```python
# New: Skip table extraction entirely
config = CrawlerRunConfig(
table_extraction=NoTableExtraction()
)
# No CPU cycles spent on table detection/extraction
```
### 3. Custom Extraction Strategies
Create specialized extractors:
```python
class MyTableExtractor(TableExtractionStrategy):
def extract_tables(self, element, **kwargs):
# Custom extraction logic
return custom_tables
config = CrawlerRunConfig(
table_extraction=MyTableExtractor()
)
```
## Migration Scenarios
### Scenario 1: Basic Usage (No Changes Needed)
**Before (v0.7.2):**
```python
config = CrawlerRunConfig()
result = await crawler.arun(url, config)
for table in result.tables:
print(table['headers'])
```
**After (v0.7.3):**
```python
# Exactly the same - no changes required
config = CrawlerRunConfig()
result = await crawler.arun(url, config)
for table in result.tables:
print(table['headers'])
```
### Scenario 2: Custom Threshold (No Changes Needed)
**Before (v0.7.2):**
```python
config = CrawlerRunConfig(
table_score_threshold=5
)
```
**After (v0.7.3):**
```python
# Still works the same
config = CrawlerRunConfig(
table_score_threshold=5
)
# Or use new explicit approach for more control
strategy = DefaultTableExtraction(
table_score_threshold=5,
min_rows=2 # Additional filtering
)
config = CrawlerRunConfig(
table_extraction=strategy
)
```
### Scenario 3: Advanced Filtering (New Feature)
**Before (v0.7.2):**
```python
# Had to filter after extraction
config = CrawlerRunConfig(
table_score_threshold=5
)
result = await crawler.arun(url, config)
# Manual filtering
large_tables = [
t for t in result.tables
if len(t['rows']) >= 5 and len(t['headers']) >= 3
]
```
**After (v0.7.3):**
```python
# Filter during extraction (more efficient)
strategy = DefaultTableExtraction(
table_score_threshold=5,
min_rows=5,
min_cols=3
)
config = CrawlerRunConfig(
table_extraction=strategy
)
result = await crawler.arun(url, config)
# result.tables already filtered
```
## Code Organization Changes
### Module Structure
**Before (v0.7.2):**
```
crawl4ai/
content_scraping_strategy.py
- LXMLWebScrapingStrategy
- is_data_table() # Table detection
- extract_table_data() # Table extraction
```
**After (v0.7.3):**
```
crawl4ai/
content_scraping_strategy.py
- LXMLWebScrapingStrategy
# Table methods removed, uses strategy
table_extraction.py (NEW)
- TableExtractionStrategy # Base class
- DefaultTableExtraction # Moved logic here
- NoTableExtraction # New option
```
### Import Changes
**New imports available (optional):**
```python
# These are now available but not required for existing code
from crawl4ai import (
TableExtractionStrategy,
DefaultTableExtraction,
NoTableExtraction
)
```
## Performance Implications
### No Performance Impact
For existing code, performance remains identical:
- Same extraction logic
- Same scoring algorithm
- Same processing time
### Performance Improvements Available
New options for better performance:
```python
# Skip tables entirely (faster)
config = CrawlerRunConfig(
table_extraction=NoTableExtraction()
)
# Process only specific areas (faster)
config = CrawlerRunConfig(
css_selector="main.content",
table_extraction=DefaultTableExtraction(
min_rows=5, # Skip small tables
min_cols=3
)
)
```
## Testing Your Migration
### Verification Script
Run this to verify your extraction still works:
```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
async def verify_extraction():
url = "your_url_here"
async with AsyncWebCrawler() as crawler:
# Test 1: Old approach
config_old = CrawlerRunConfig(
table_score_threshold=7
)
result_old = await crawler.arun(url, config_old)
# Test 2: New explicit approach
from crawl4ai import DefaultTableExtraction
config_new = CrawlerRunConfig(
table_extraction=DefaultTableExtraction(
table_score_threshold=7
)
)
result_new = await crawler.arun(url, config_new)
# Compare results
assert len(result_old.tables) == len(result_new.tables)
print(f"✓ Both approaches extracted {len(result_old.tables)} tables")
# Verify structure
for old, new in zip(result_old.tables, result_new.tables):
assert old['headers'] == new['headers']
assert old['rows'] == new['rows']
print("✓ Table content identical")
asyncio.run(verify_extraction())
```
## Deprecation Notes
### No Deprecations
- All existing parameters continue to work
- `table_score_threshold` in `CrawlerRunConfig` is still supported
- No breaking changes
### Internal Changes (Transparent to Users)
- `LXMLWebScrapingStrategy.is_data_table()` - Moved to `DefaultTableExtraction`
- `LXMLWebScrapingStrategy.extract_table_data()` - Moved to `DefaultTableExtraction`
These methods were internal and not part of the public API.
## Benefits of Upgrading
While not required, using the new pattern provides:
1. **Better Control**: Filter tables during extraction, not after
2. **Performance Options**: Skip extraction when not needed
3. **Extensibility**: Create custom extractors for specific needs
4. **Consistency**: Same pattern as other Crawl4AI strategies
5. **Future-Proof**: Ready for upcoming advanced strategies
## Troubleshooting
### Issue: Different Number of Tables
**Cause**: Threshold or filtering differences
**Solution**:
```python
# Ensure same threshold
strategy = DefaultTableExtraction(
table_score_threshold=7, # Match your old setting
min_rows=0, # No filtering (default)
min_cols=0 # No filtering (default)
)
```
### Issue: Import Errors
**Cause**: Using new classes without importing
**Solution**:
```python
# Add imports if using new features
from crawl4ai import (
DefaultTableExtraction,
NoTableExtraction,
TableExtractionStrategy
)
```
### Issue: Custom Strategy Not Working
**Cause**: Incorrect method signature
**Solution**:
```python
class CustomExtractor(TableExtractionStrategy):
def extract_tables(self, element, **kwargs): # Correct signature
# Not: extract_tables(self, html)
# Not: extract(self, element)
return tables_list
```
## Getting Help
If you encounter issues:
1. Check your `table_score_threshold` matches previous settings
2. Verify imports if using new classes
3. Enable verbose logging: `DefaultTableExtraction(verbose=True)`
4. Review the [Table Extraction Documentation](../core/table_extraction.md)
5. Check [examples](../examples/table_extraction_example.py)
## Summary
-**Full backward compatibility** - No code changes required
-**Same results** - Identical extraction behavior by default
-**New options** - Additional control when needed
-**Better architecture** - Consistent with Crawl4AI patterns
-**Ready for future** - Foundation for advanced strategies
The migration to v0.7.3 is seamless with no required changes while providing new capabilities for those who need them.