Files

ntohidi a51545c883 feat: 🚀 Introduce revolutionary LLMTableExtraction with intelligent chunking for massive tables

BREAKING CHANGE: Table extraction now uses Strategy Design Pattern

This epic commit introduces a game-changing approach to table extraction in Crawl4AI:

✨ NEW FEATURES:
- LLMTableExtraction: AI-powered extraction for complex HTML tables with rowspan/colspan
- Smart Chunking: Automatically splits massive tables into optimal chunks at row boundaries
- Parallel Processing: Processes multiple chunks simultaneously for blazing-fast extraction
- Intelligent Merging: Seamlessly combines chunk results into complete tables
- Header Preservation: Each chunk maintains context with original headers
- Auto-retry Logic: Built-in resilience with configurable retry attempts

🏗️ ARCHITECTURE:
- Strategy Design Pattern for pluggable table extraction strategies
- ThreadPoolExecutor for concurrent chunk processing
- Token-based chunking with configurable thresholds
- Handles tables without headers gracefully

⚡ PERFORMANCE:
- Process 1000+ row tables without timeout
- Parallel processing with up to 5 concurrent chunks
- Smart token estimation prevents LLM context overflow
- Optimized for providers like Groq for massive tables

🔧 CONFIGURATION:
- enable_chunking: Auto-handle large tables (default: True)
- chunk_token_threshold: When to split (default: 3000 tokens)
- min_rows_per_chunk: Meaningful chunk sizes (default: 10)
- max_parallel_chunks: Concurrent processing (default: 5)

📚 BACKWARD COMPATIBILITY:
- Existing code continues to work unchanged
- DefaultTableExtraction remains the default strategy
- Progressive enhancement approach

This is the future of web table extraction - handling everything from simple tables to massive, complex data grids with merged cells and nested structures. The chunking is completely transparent to users while providing unprecedented scalability.

2025-08-14 18:21:24 +08:00

9.2 KiB

Raw Blame History

Migration Guide: Table Extraction v0.7.3

Overview

Version 0.7.3 introduces the Table Extraction Strategy Pattern, providing a more flexible and extensible approach to table extraction while maintaining full backward compatibility.

What's New

Strategy Pattern Implementation

Table extraction now follows the same strategy pattern used throughout Crawl4AI:

Consistent Architecture: Aligns with extraction, chunking, and markdown strategies
Extensibility: Easy to create custom table extraction strategies
Better Separation: Table logic moved from content scraping to dedicated module
Full Control: Fine-grained control over table detection and extraction

New Classes

from crawl4ai import (
    TableExtractionStrategy,    # Abstract base class
    DefaultTableExtraction,      # Current implementation (default)
    NoTableExtraction           # Explicitly disable extraction
)

Backward Compatibility

✅ All existing code continues to work without changes.

No Changes Required

If your code looks like this, it will continue to work:

# This still works exactly the same
config = CrawlerRunConfig(
    table_score_threshold=7
)
result = await crawler.arun(url, config)
tables = result.tables  # Same structure, same data

What Happens Behind the Scenes

When you don't specify a table_extraction strategy:

CrawlerRunConfig automatically creates DefaultTableExtraction
It uses your table_score_threshold parameter
Tables are extracted exactly as before
Results appear in result.tables with the same structure

New Capabilities

1. Explicit Strategy Configuration

You can now explicitly configure table extraction:

# New: Explicit control
strategy = DefaultTableExtraction(
    table_score_threshold=7,
    min_rows=2,              # New: minimum row filter
    min_cols=2,              # New: minimum column filter
    verbose=True             # New: detailed logging
)

config = CrawlerRunConfig(
    table_extraction=strategy
)

2. Disable Table Extraction

Improve performance when tables aren't needed:

# New: Skip table extraction entirely
config = CrawlerRunConfig(
    table_extraction=NoTableExtraction()
)
# No CPU cycles spent on table detection/extraction

3. Custom Extraction Strategies

Create specialized extractors:

class MyTableExtractor(TableExtractionStrategy):
    def extract_tables(self, element, **kwargs):
        # Custom extraction logic
        return custom_tables

config = CrawlerRunConfig(
    table_extraction=MyTableExtractor()
)

Migration Scenarios

Scenario 1: Basic Usage (No Changes Needed)

Before (v0.7.2):

config = CrawlerRunConfig()
result = await crawler.arun(url, config)
for table in result.tables:
    print(table['headers'])

After (v0.7.3):

# Exactly the same - no changes required
config = CrawlerRunConfig()
result = await crawler.arun(url, config)
for table in result.tables:
    print(table['headers'])

Scenario 2: Custom Threshold (No Changes Needed)

Before (v0.7.2):

config = CrawlerRunConfig(
    table_score_threshold=5
)

After (v0.7.3):

# Still works the same
config = CrawlerRunConfig(
    table_score_threshold=5
)

# Or use new explicit approach for more control
strategy = DefaultTableExtraction(
    table_score_threshold=5,
    min_rows=2  # Additional filtering
)
config = CrawlerRunConfig(
    table_extraction=strategy
)

Scenario 3: Advanced Filtering (New Feature)

Before (v0.7.2):

# Had to filter after extraction
config = CrawlerRunConfig(
    table_score_threshold=5
)
result = await crawler.arun(url, config)

# Manual filtering
large_tables = [
    t for t in result.tables 
    if len(t['rows']) >= 5 and len(t['headers']) >= 3
]

After (v0.7.3):

# Filter during extraction (more efficient)
strategy = DefaultTableExtraction(
    table_score_threshold=5,
    min_rows=5,
    min_cols=3
)
config = CrawlerRunConfig(
    table_extraction=strategy
)
result = await crawler.arun(url, config)
# result.tables already filtered

Code Organization Changes

Module Structure

Before (v0.7.2):

crawl4ai/
  content_scraping_strategy.py
    - LXMLWebScrapingStrategy
      - is_data_table()      # Table detection
      - extract_table_data() # Table extraction

After (v0.7.3):

crawl4ai/
  content_scraping_strategy.py
    - LXMLWebScrapingStrategy
      # Table methods removed, uses strategy
  
  table_extraction.py (NEW)
    - TableExtractionStrategy    # Base class
    - DefaultTableExtraction      # Moved logic here
    - NoTableExtraction          # New option

Import Changes

New imports available (optional):

# These are now available but not required for existing code
from crawl4ai import (
    TableExtractionStrategy,
    DefaultTableExtraction,
    NoTableExtraction
)

Performance Implications

No Performance Impact

For existing code, performance remains identical:

Same extraction logic
Same scoring algorithm
Same processing time

Performance Improvements Available

New options for better performance:

# Skip tables entirely (faster)
config = CrawlerRunConfig(
    table_extraction=NoTableExtraction()
)

# Process only specific areas (faster)
config = CrawlerRunConfig(
    css_selector="main.content",
    table_extraction=DefaultTableExtraction(
        min_rows=5,  # Skip small tables
        min_cols=3
    )
)

Testing Your Migration

Verification Script

Run this to verify your extraction still works:

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig

async def verify_extraction():
    url = "your_url_here"
    
    async with AsyncWebCrawler() as crawler:
        # Test 1: Old approach
        config_old = CrawlerRunConfig(
            table_score_threshold=7
        )
        result_old = await crawler.arun(url, config_old)
        
        # Test 2: New explicit approach
        from crawl4ai import DefaultTableExtraction
        config_new = CrawlerRunConfig(
            table_extraction=DefaultTableExtraction(
                table_score_threshold=7
            )
        )
        result_new = await crawler.arun(url, config_new)
        
        # Compare results
        assert len(result_old.tables) == len(result_new.tables)
        print(f"✓ Both approaches extracted {len(result_old.tables)} tables")
        
        # Verify structure
        for old, new in zip(result_old.tables, result_new.tables):
            assert old['headers'] == new['headers']
            assert old['rows'] == new['rows']
        
        print("✓ Table content identical")

asyncio.run(verify_extraction())

Deprecation Notes

No Deprecations

All existing parameters continue to work
table_score_threshold in CrawlerRunConfig is still supported
No breaking changes

Internal Changes (Transparent to Users)

LXMLWebScrapingStrategy.is_data_table() - Moved to DefaultTableExtraction
LXMLWebScrapingStrategy.extract_table_data() - Moved to DefaultTableExtraction

These methods were internal and not part of the public API.

Benefits of Upgrading

While not required, using the new pattern provides:

Better Control: Filter tables during extraction, not after
Performance Options: Skip extraction when not needed
Extensibility: Create custom extractors for specific needs
Consistency: Same pattern as other Crawl4AI strategies
Future-Proof: Ready for upcoming advanced strategies

Troubleshooting

Issue: Different Number of Tables

Cause: Threshold or filtering differences

Solution:

# Ensure same threshold
strategy = DefaultTableExtraction(
    table_score_threshold=7,  # Match your old setting
    min_rows=0,               # No filtering (default)
    min_cols=0                # No filtering (default)
)

Issue: Import Errors

Cause: Using new classes without importing

Solution:

# Add imports if using new features
from crawl4ai import (
    DefaultTableExtraction,
    NoTableExtraction,
    TableExtractionStrategy
)

Issue: Custom Strategy Not Working

Cause: Incorrect method signature

Solution:

class CustomExtractor(TableExtractionStrategy):
    def extract_tables(self, element, **kwargs):  # Correct signature
        # Not: extract_tables(self, html)
        # Not: extract(self, element)
        return tables_list

Getting Help

If you encounter issues:

Check your table_score_threshold matches previous settings
Verify imports if using new classes
Enable verbose logging: DefaultTableExtraction(verbose=True)
Review the Table Extraction Documentation
Check examples

Summary

✅ Full backward compatibility - No code changes required
✅ Same results - Identical extraction behavior by default
✅ New options - Additional control when needed
✅ Better architecture - Consistent with Crawl4AI patterns
✅ Ready for future - Foundation for advanced strategies

The migration to v0.7.3 is seamless with no required changes while providing new capabilities for those who need them.

9.2 KiB Raw Blame History

Migration Guide: Table Extraction v0.7.3

Overview

What's New

Strategy Pattern Implementation

New Classes

Backward Compatibility

No Changes Required

What Happens Behind the Scenes

New Capabilities

1. Explicit Strategy Configuration

2. Disable Table Extraction

3. Custom Extraction Strategies

Migration Scenarios

Scenario 1: Basic Usage (No Changes Needed)

Scenario 2: Custom Threshold (No Changes Needed)

Scenario 3: Advanced Filtering (New Feature)

Code Organization Changes

Module Structure

Import Changes

Performance Implications

No Performance Impact

Performance Improvements Available

Testing Your Migration

Verification Script

Deprecation Notes

No Deprecations

Internal Changes (Transparent to Users)

Benefits of Upgrading

Troubleshooting

Issue: Different Number of Tables

Issue: Import Errors

Issue: Custom Strategy Not Working

Getting Help

Summary

9.2 KiB

Raw Blame History