BREAKING CHANGE: Table extraction now uses Strategy Design Pattern This epic commit introduces a game-changing approach to table extraction in Crawl4AI: ✨ NEW FEATURES: - LLMTableExtraction: AI-powered extraction for complex HTML tables with rowspan/colspan - Smart Chunking: Automatically splits massive tables into optimal chunks at row boundaries - Parallel Processing: Processes multiple chunks simultaneously for blazing-fast extraction - Intelligent Merging: Seamlessly combines chunk results into complete tables - Header Preservation: Each chunk maintains context with original headers - Auto-retry Logic: Built-in resilience with configurable retry attempts 🏗️ ARCHITECTURE: - Strategy Design Pattern for pluggable table extraction strategies - ThreadPoolExecutor for concurrent chunk processing - Token-based chunking with configurable thresholds - Handles tables without headers gracefully ⚡ PERFORMANCE: - Process 1000+ row tables without timeout - Parallel processing with up to 5 concurrent chunks - Smart token estimation prevents LLM context overflow - Optimized for providers like Groq for massive tables 🔧 CONFIGURATION: - enable_chunking: Auto-handle large tables (default: True) - chunk_token_threshold: When to split (default: 3000 tokens) - min_rows_per_chunk: Meaningful chunk sizes (default: 10) - max_parallel_chunks: Concurrent processing (default: 5) 📚 BACKWARD COMPATIBILITY: - Existing code continues to work unchanged - DefaultTableExtraction remains the default strategy - Progressive enhancement approach This is the future of web table extraction - handling everything from simple tables to massive, complex data grids with merged cells and nested structures. The chunking is completely transparent to users while providing unprecedented scalability.
376 lines
9.2 KiB
Markdown
376 lines
9.2 KiB
Markdown
# Migration Guide: Table Extraction v0.7.3
|
|
|
|
## Overview
|
|
|
|
Version 0.7.3 introduces the **Table Extraction Strategy Pattern**, providing a more flexible and extensible approach to table extraction while maintaining full backward compatibility.
|
|
|
|
## What's New
|
|
|
|
### Strategy Pattern Implementation
|
|
|
|
Table extraction now follows the same strategy pattern used throughout Crawl4AI:
|
|
|
|
- **Consistent Architecture**: Aligns with extraction, chunking, and markdown strategies
|
|
- **Extensibility**: Easy to create custom table extraction strategies
|
|
- **Better Separation**: Table logic moved from content scraping to dedicated module
|
|
- **Full Control**: Fine-grained control over table detection and extraction
|
|
|
|
### New Classes
|
|
|
|
```python
|
|
from crawl4ai import (
|
|
TableExtractionStrategy, # Abstract base class
|
|
DefaultTableExtraction, # Current implementation (default)
|
|
NoTableExtraction # Explicitly disable extraction
|
|
)
|
|
```
|
|
|
|
## Backward Compatibility
|
|
|
|
**✅ All existing code continues to work without changes.**
|
|
|
|
### No Changes Required
|
|
|
|
If your code looks like this, it will continue to work:
|
|
|
|
```python
|
|
# This still works exactly the same
|
|
config = CrawlerRunConfig(
|
|
table_score_threshold=7
|
|
)
|
|
result = await crawler.arun(url, config)
|
|
tables = result.tables # Same structure, same data
|
|
```
|
|
|
|
### What Happens Behind the Scenes
|
|
|
|
When you don't specify a `table_extraction` strategy:
|
|
|
|
1. `CrawlerRunConfig` automatically creates `DefaultTableExtraction`
|
|
2. It uses your `table_score_threshold` parameter
|
|
3. Tables are extracted exactly as before
|
|
4. Results appear in `result.tables` with the same structure
|
|
|
|
## New Capabilities
|
|
|
|
### 1. Explicit Strategy Configuration
|
|
|
|
You can now explicitly configure table extraction:
|
|
|
|
```python
|
|
# New: Explicit control
|
|
strategy = DefaultTableExtraction(
|
|
table_score_threshold=7,
|
|
min_rows=2, # New: minimum row filter
|
|
min_cols=2, # New: minimum column filter
|
|
verbose=True # New: detailed logging
|
|
)
|
|
|
|
config = CrawlerRunConfig(
|
|
table_extraction=strategy
|
|
)
|
|
```
|
|
|
|
### 2. Disable Table Extraction
|
|
|
|
Improve performance when tables aren't needed:
|
|
|
|
```python
|
|
# New: Skip table extraction entirely
|
|
config = CrawlerRunConfig(
|
|
table_extraction=NoTableExtraction()
|
|
)
|
|
# No CPU cycles spent on table detection/extraction
|
|
```
|
|
|
|
### 3. Custom Extraction Strategies
|
|
|
|
Create specialized extractors:
|
|
|
|
```python
|
|
class MyTableExtractor(TableExtractionStrategy):
|
|
def extract_tables(self, element, **kwargs):
|
|
# Custom extraction logic
|
|
return custom_tables
|
|
|
|
config = CrawlerRunConfig(
|
|
table_extraction=MyTableExtractor()
|
|
)
|
|
```
|
|
|
|
## Migration Scenarios
|
|
|
|
### Scenario 1: Basic Usage (No Changes Needed)
|
|
|
|
**Before (v0.7.2):**
|
|
```python
|
|
config = CrawlerRunConfig()
|
|
result = await crawler.arun(url, config)
|
|
for table in result.tables:
|
|
print(table['headers'])
|
|
```
|
|
|
|
**After (v0.7.3):**
|
|
```python
|
|
# Exactly the same - no changes required
|
|
config = CrawlerRunConfig()
|
|
result = await crawler.arun(url, config)
|
|
for table in result.tables:
|
|
print(table['headers'])
|
|
```
|
|
|
|
### Scenario 2: Custom Threshold (No Changes Needed)
|
|
|
|
**Before (v0.7.2):**
|
|
```python
|
|
config = CrawlerRunConfig(
|
|
table_score_threshold=5
|
|
)
|
|
```
|
|
|
|
**After (v0.7.3):**
|
|
```python
|
|
# Still works the same
|
|
config = CrawlerRunConfig(
|
|
table_score_threshold=5
|
|
)
|
|
|
|
# Or use new explicit approach for more control
|
|
strategy = DefaultTableExtraction(
|
|
table_score_threshold=5,
|
|
min_rows=2 # Additional filtering
|
|
)
|
|
config = CrawlerRunConfig(
|
|
table_extraction=strategy
|
|
)
|
|
```
|
|
|
|
### Scenario 3: Advanced Filtering (New Feature)
|
|
|
|
**Before (v0.7.2):**
|
|
```python
|
|
# Had to filter after extraction
|
|
config = CrawlerRunConfig(
|
|
table_score_threshold=5
|
|
)
|
|
result = await crawler.arun(url, config)
|
|
|
|
# Manual filtering
|
|
large_tables = [
|
|
t for t in result.tables
|
|
if len(t['rows']) >= 5 and len(t['headers']) >= 3
|
|
]
|
|
```
|
|
|
|
**After (v0.7.3):**
|
|
```python
|
|
# Filter during extraction (more efficient)
|
|
strategy = DefaultTableExtraction(
|
|
table_score_threshold=5,
|
|
min_rows=5,
|
|
min_cols=3
|
|
)
|
|
config = CrawlerRunConfig(
|
|
table_extraction=strategy
|
|
)
|
|
result = await crawler.arun(url, config)
|
|
# result.tables already filtered
|
|
```
|
|
|
|
## Code Organization Changes
|
|
|
|
### Module Structure
|
|
|
|
**Before (v0.7.2):**
|
|
```
|
|
crawl4ai/
|
|
content_scraping_strategy.py
|
|
- LXMLWebScrapingStrategy
|
|
- is_data_table() # Table detection
|
|
- extract_table_data() # Table extraction
|
|
```
|
|
|
|
**After (v0.7.3):**
|
|
```
|
|
crawl4ai/
|
|
content_scraping_strategy.py
|
|
- LXMLWebScrapingStrategy
|
|
# Table methods removed, uses strategy
|
|
|
|
table_extraction.py (NEW)
|
|
- TableExtractionStrategy # Base class
|
|
- DefaultTableExtraction # Moved logic here
|
|
- NoTableExtraction # New option
|
|
```
|
|
|
|
### Import Changes
|
|
|
|
**New imports available (optional):**
|
|
```python
|
|
# These are now available but not required for existing code
|
|
from crawl4ai import (
|
|
TableExtractionStrategy,
|
|
DefaultTableExtraction,
|
|
NoTableExtraction
|
|
)
|
|
```
|
|
|
|
## Performance Implications
|
|
|
|
### No Performance Impact
|
|
|
|
For existing code, performance remains identical:
|
|
- Same extraction logic
|
|
- Same scoring algorithm
|
|
- Same processing time
|
|
|
|
### Performance Improvements Available
|
|
|
|
New options for better performance:
|
|
|
|
```python
|
|
# Skip tables entirely (faster)
|
|
config = CrawlerRunConfig(
|
|
table_extraction=NoTableExtraction()
|
|
)
|
|
|
|
# Process only specific areas (faster)
|
|
config = CrawlerRunConfig(
|
|
css_selector="main.content",
|
|
table_extraction=DefaultTableExtraction(
|
|
min_rows=5, # Skip small tables
|
|
min_cols=3
|
|
)
|
|
)
|
|
```
|
|
|
|
## Testing Your Migration
|
|
|
|
### Verification Script
|
|
|
|
Run this to verify your extraction still works:
|
|
|
|
```python
|
|
import asyncio
|
|
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
|
|
|
async def verify_extraction():
|
|
url = "your_url_here"
|
|
|
|
async with AsyncWebCrawler() as crawler:
|
|
# Test 1: Old approach
|
|
config_old = CrawlerRunConfig(
|
|
table_score_threshold=7
|
|
)
|
|
result_old = await crawler.arun(url, config_old)
|
|
|
|
# Test 2: New explicit approach
|
|
from crawl4ai import DefaultTableExtraction
|
|
config_new = CrawlerRunConfig(
|
|
table_extraction=DefaultTableExtraction(
|
|
table_score_threshold=7
|
|
)
|
|
)
|
|
result_new = await crawler.arun(url, config_new)
|
|
|
|
# Compare results
|
|
assert len(result_old.tables) == len(result_new.tables)
|
|
print(f"✓ Both approaches extracted {len(result_old.tables)} tables")
|
|
|
|
# Verify structure
|
|
for old, new in zip(result_old.tables, result_new.tables):
|
|
assert old['headers'] == new['headers']
|
|
assert old['rows'] == new['rows']
|
|
|
|
print("✓ Table content identical")
|
|
|
|
asyncio.run(verify_extraction())
|
|
```
|
|
|
|
## Deprecation Notes
|
|
|
|
### No Deprecations
|
|
|
|
- All existing parameters continue to work
|
|
- `table_score_threshold` in `CrawlerRunConfig` is still supported
|
|
- No breaking changes
|
|
|
|
### Internal Changes (Transparent to Users)
|
|
|
|
- `LXMLWebScrapingStrategy.is_data_table()` - Moved to `DefaultTableExtraction`
|
|
- `LXMLWebScrapingStrategy.extract_table_data()` - Moved to `DefaultTableExtraction`
|
|
|
|
These methods were internal and not part of the public API.
|
|
|
|
## Benefits of Upgrading
|
|
|
|
While not required, using the new pattern provides:
|
|
|
|
1. **Better Control**: Filter tables during extraction, not after
|
|
2. **Performance Options**: Skip extraction when not needed
|
|
3. **Extensibility**: Create custom extractors for specific needs
|
|
4. **Consistency**: Same pattern as other Crawl4AI strategies
|
|
5. **Future-Proof**: Ready for upcoming advanced strategies
|
|
|
|
## Troubleshooting
|
|
|
|
### Issue: Different Number of Tables
|
|
|
|
**Cause**: Threshold or filtering differences
|
|
|
|
**Solution**:
|
|
```python
|
|
# Ensure same threshold
|
|
strategy = DefaultTableExtraction(
|
|
table_score_threshold=7, # Match your old setting
|
|
min_rows=0, # No filtering (default)
|
|
min_cols=0 # No filtering (default)
|
|
)
|
|
```
|
|
|
|
### Issue: Import Errors
|
|
|
|
**Cause**: Using new classes without importing
|
|
|
|
**Solution**:
|
|
```python
|
|
# Add imports if using new features
|
|
from crawl4ai import (
|
|
DefaultTableExtraction,
|
|
NoTableExtraction,
|
|
TableExtractionStrategy
|
|
)
|
|
```
|
|
|
|
### Issue: Custom Strategy Not Working
|
|
|
|
**Cause**: Incorrect method signature
|
|
|
|
**Solution**:
|
|
```python
|
|
class CustomExtractor(TableExtractionStrategy):
|
|
def extract_tables(self, element, **kwargs): # Correct signature
|
|
# Not: extract_tables(self, html)
|
|
# Not: extract(self, element)
|
|
return tables_list
|
|
```
|
|
|
|
## Getting Help
|
|
|
|
If you encounter issues:
|
|
|
|
1. Check your `table_score_threshold` matches previous settings
|
|
2. Verify imports if using new classes
|
|
3. Enable verbose logging: `DefaultTableExtraction(verbose=True)`
|
|
4. Review the [Table Extraction Documentation](../core/table_extraction.md)
|
|
5. Check [examples](../examples/table_extraction_example.py)
|
|
|
|
## Summary
|
|
|
|
- ✅ **Full backward compatibility** - No code changes required
|
|
- ✅ **Same results** - Identical extraction behavior by default
|
|
- ✅ **New options** - Additional control when needed
|
|
- ✅ **Better architecture** - Consistent with Crawl4AI patterns
|
|
- ✅ **Ready for future** - Foundation for advanced strategies
|
|
|
|
The migration to v0.7.3 is seamless with no required changes while providing new capabilities for those who need them. |