crawl4ai/docs/md_v2/migration/table_extraction_v073.md

# Migration Guide: Table Extraction v0.7.3

## Overview

Version 0.7.3 introduces the **Table Extraction Strategy Pattern**, providing a more flexible and extensible approach to table extraction while maintaining full backward compatibility.

## What's New

### Strategy Pattern Implementation

Table extraction now follows the same strategy pattern used throughout Crawl4AI:

- **Consistent Architecture**: Aligns with extraction, chunking, and markdown strategies
- **Extensibility**: Easy to create custom table extraction strategies
- **Better Separation**: Table logic moved from content scraping to dedicated module
- **Full Control**: Fine-grained control over table detection and extraction

### New Classes

```python
from crawl4ai import (
    TableExtractionStrategy,    # Abstract base class
    DefaultTableExtraction,      # Current implementation (default)
    NoTableExtraction           # Explicitly disable extraction
)
```

## Backward Compatibility

**✅ All existing code continues to work without changes.**

### No Changes Required

If your code looks like this, it will continue to work:

```python
# This still works exactly the same
config = CrawlerRunConfig(
    table_score_threshold=7
)
result = await crawler.arun(url, config)
tables = result.tables  # Same structure, same data
```

### What Happens Behind the Scenes

When you don't specify a `table_extraction` strategy:

1. `CrawlerRunConfig` automatically creates `DefaultTableExtraction`
2. It uses your `table_score_threshold` parameter
3. Tables are extracted exactly as before
4. Results appear in `result.tables` with the same structure

## New Capabilities

### 1. Explicit Strategy Configuration

You can now explicitly configure table extraction:

```python
# New: Explicit control
strategy = DefaultTableExtraction(
    table_score_threshold=7,
    min_rows=2,              # New: minimum row filter
    min_cols=2,              # New: minimum column filter
    verbose=True             # New: detailed logging
)

config = CrawlerRunConfig(
    table_extraction=strategy
)
```

### 2. Disable Table Extraction

Improve performance when tables aren't needed:

```python
# New: Skip table extraction entirely
config = CrawlerRunConfig(
    table_extraction=NoTableExtraction()
)
# No CPU cycles spent on table detection/extraction
```

### 3. Custom Extraction Strategies

Create specialized extractors:

```python
class MyTableExtractor(TableExtractionStrategy):
    def extract_tables(self, element, **kwargs):
        # Custom extraction logic
        return custom_tables

config = CrawlerRunConfig(
    table_extraction=MyTableExtractor()
)
```

## Migration Scenarios

### Scenario 1: Basic Usage (No Changes Needed)

**Before (v0.7.2):**
```python
config = CrawlerRunConfig()
result = await crawler.arun(url, config)
for table in result.tables:
    print(table['headers'])
```

**After (v0.7.3):**
```python
# Exactly the same - no changes required
config = CrawlerRunConfig()
result = await crawler.arun(url, config)
for table in result.tables:
    print(table['headers'])
```

### Scenario 2: Custom Threshold (No Changes Needed)

**Before (v0.7.2):**
```python
config = CrawlerRunConfig(
    table_score_threshold=5
)
```

**After (v0.7.3):**
```python
# Still works the same
config = CrawlerRunConfig(
    table_score_threshold=5
)

# Or use new explicit approach for more control
strategy = DefaultTableExtraction(
    table_score_threshold=5,
    min_rows=2  # Additional filtering
)
config = CrawlerRunConfig(
    table_extraction=strategy
)
```

### Scenario 3: Advanced Filtering (New Feature)

**Before (v0.7.2):**
```python
# Had to filter after extraction
config = CrawlerRunConfig(
    table_score_threshold=5
)
result = await crawler.arun(url, config)

# Manual filtering
large_tables = [
    t for t in result.tables
    if len(t['rows']) >= 5 and len(t['headers']) >= 3
]
```

**After (v0.7.3):**
```python
# Filter during extraction (more efficient)
strategy = DefaultTableExtraction(
    table_score_threshold=5,
    min_rows=5,
    min_cols=3
)
config = CrawlerRunConfig(
    table_extraction=strategy
)
result = await crawler.arun(url, config)
# result.tables already filtered
```

## Code Organization Changes

### Module Structure

**Before (v0.7.2):**
```
crawl4ai/
  content_scraping_strategy.py
    - LXMLWebScrapingStrategy
      - is_data_table()      # Table detection
      - extract_table_data() # Table extraction
```

**After (v0.7.3):**
```
crawl4ai/
  content_scraping_strategy.py
    - LXMLWebScrapingStrategy
      # Table methods removed, uses strategy

  table_extraction.py (NEW)
    - TableExtractionStrategy    # Base class
    - DefaultTableExtraction      # Moved logic here
    - NoTableExtraction          # New option
```

### Import Changes

**New imports available (optional):**
```python
# These are now available but not required for existing code
from crawl4ai import (
    TableExtractionStrategy,
    DefaultTableExtraction,
    NoTableExtraction
)
```

## Performance Implications

### No Performance Impact

For existing code, performance remains identical:
- Same extraction logic
- Same scoring algorithm
- Same processing time

### Performance Improvements Available

New options for better performance:

```python
# Skip tables entirely (faster)
config = CrawlerRunConfig(
    table_extraction=NoTableExtraction()
)

# Process only specific areas (faster)
config = CrawlerRunConfig(
    css_selector="main.content",
    table_extraction=DefaultTableExtraction(
        min_rows=5,  # Skip small tables
        min_cols=3
    )
)
```

## Testing Your Migration

### Verification Script

Run this to verify your extraction still works:

```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig

async def verify_extraction():
    url = "your_url_here"

    async with AsyncWebCrawler() as crawler:
        # Test 1: Old approach
        config_old = CrawlerRunConfig(
            table_score_threshold=7
        )
        result_old = await crawler.arun(url, config_old)

        # Test 2: New explicit approach
        from crawl4ai import DefaultTableExtraction
        config_new = CrawlerRunConfig(
            table_extraction=DefaultTableExtraction(
                table_score_threshold=7
            )
        )
        result_new = await crawler.arun(url, config_new)

        # Compare results
        assert len(result_old.tables) == len(result_new.tables)
        print(f"✓ Both approaches extracted {len(result_old.tables)} tables")

        # Verify structure
        for old, new in zip(result_old.tables, result_new.tables):
            assert old['headers'] == new['headers']
            assert old['rows'] == new['rows']

        print("✓ Table content identical")

asyncio.run(verify_extraction())
```

## Deprecation Notes

### No Deprecations

- All existing parameters continue to work
- `table_score_threshold` in `CrawlerRunConfig` is still supported
- No breaking changes

### Internal Changes (Transparent to Users)

- `LXMLWebScrapingStrategy.is_data_table()` - Moved to `DefaultTableExtraction`
- `LXMLWebScrapingStrategy.extract_table_data()` - Moved to `DefaultTableExtraction`

These methods were internal and not part of the public API.

## Benefits of Upgrading

While not required, using the new pattern provides:

1. **Better Control**: Filter tables during extraction, not after
2. **Performance Options**: Skip extraction when not needed
3. **Extensibility**: Create custom extractors for specific needs
4. **Consistency**: Same pattern as other Crawl4AI strategies
5. **Future-Proof**: Ready for upcoming advanced strategies

## Troubleshooting

### Issue: Different Number of Tables

**Cause**: Threshold or filtering differences

**Solution**:
```python
# Ensure same threshold
strategy = DefaultTableExtraction(
    table_score_threshold=7,  # Match your old setting
    min_rows=0,               # No filtering (default)
    min_cols=0                # No filtering (default)
)
```

### Issue: Import Errors

**Cause**: Using new classes without importing

**Solution**:
```python
# Add imports if using new features
from crawl4ai import (
    DefaultTableExtraction,
    NoTableExtraction,
    TableExtractionStrategy
)
```

### Issue: Custom Strategy Not Working

**Cause**: Incorrect method signature

**Solution**:
```python
class CustomExtractor(TableExtractionStrategy):
    def extract_tables(self, element, **kwargs):  # Correct signature
        # Not: extract_tables(self, html)
        # Not: extract(self, element)
        return tables_list
```

## Getting Help

If you encounter issues:

1. Check your `table_score_threshold` matches previous settings
2. Verify imports if using new classes
3. Enable verbose logging: `DefaultTableExtraction(verbose=True)`
4. Review the [Table Extraction Documentation](../core/table_extraction.md)
5. Check [examples](../examples/table_extraction_example.py)

## Summary

- ✅ **Full backward compatibility** - No code changes required
- ✅ **Same results** - Identical extraction behavior by default
- ✅ **New options** - Additional control when needed
- ✅ **Better architecture** - Consistent with Crawl4AI patterns
- ✅ **Ready for future** - Foundation for advanced strategies

The migration to v0.7.3 is seamless with no required changes while providing new capabilities for those who need them.