# Table Extraction Strategies
## Overview
**New in v0.7.3+**: Table extraction now follows the **Strategy Design Pattern**, providing unprecedented flexibility and power for handling different table structures. Don't worry - **your existing code still works!** We maintain full backward compatibility while offering new capabilities.
### What's Changed?
- **Architecture**: Table extraction now uses pluggable strategies
- **Backward Compatible**: Your existing code with `table_score_threshold` continues to work
- **More Power**: Choose from multiple strategies or create your own
- **Same Default Behavior**: By default, uses `DefaultTableExtraction` (same as before)
### Key Points
✅ **Old code still works** - No breaking changes
✅ **Same default behavior** - Uses the proven extraction algorithm
✅ **New capabilities** - Add LLM extraction or custom strategies when needed
✅ **Strategy pattern** - Clean, extensible architecture
## Quick Start
### The Simplest Way (Works Like Before)
If you're already using Crawl4AI, nothing changes:
```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
async def extract_tables():
async with AsyncWebCrawler() as crawler:
# This works exactly like before - uses DefaultTableExtraction internally
result = await crawler.arun("https://example.com/data")
# Tables are automatically extracted and available in result.tables
for table in result.tables:
print(f"Table with {len(table['rows'])} rows and {len(table['headers'])} columns")
print(f"Headers: {table['headers']}")
print(f"First row: {table['rows'][0] if table['rows'] else 'No data'}")
asyncio.run(extract_tables())
```
### Using the Old Configuration (Still Supported)
Your existing code with `table_score_threshold` continues to work:
```python
# This old approach STILL WORKS - we maintain backward compatibility
config = CrawlerRunConfig(
table_score_threshold=7 # Internally creates DefaultTableExtraction(table_score_threshold=7)
)
result = await crawler.arun(url, config)
```
## Table Extraction Strategies
### Understanding the Strategy Pattern
The strategy pattern allows you to choose different table extraction algorithms at runtime. Think of it as having different tools in a toolbox - you pick the right one for the job:
- **No explicit strategy?** → Uses `DefaultTableExtraction` automatically (same as v0.7.2 and earlier)
- **Need complex table handling?** → Choose `LLMTableExtraction` (costs money, use sparingly)
- **Want to disable tables?** → Use `NoTableExtraction`
- **Have special requirements?** → Create a custom strategy
### Available Strategies
| Strategy | Description | Use Case | Cost | When to Use |
|----------|-------------|----------|------|-------------|
| `DefaultTableExtraction` | **RECOMMENDED**: Same algorithm as before v0.7.3 | General purpose (default) | Free | **Use this first - handles 95% of cases** |
| `LLMTableExtraction` | AI-powered extraction for complex tables | Tables with complex rowspan/colspan | **$$$ Per API call** | Only when DefaultTableExtraction fails |
| `NoTableExtraction` | Disables table extraction | When tables aren't needed | Free | For text-only extraction |
| Custom strategies | User-defined extraction logic | Specialized requirements | Free | Domain-specific needs |
> **⚠️ CRITICAL COST WARNING for LLMTableExtraction**:
>
> **DO NOT USE `LLMTableExtraction` UNLESS ABSOLUTELY NECESSARY!**
>
> - **Always try `DefaultTableExtraction` first** - It's free and handles most tables perfectly
> - LLM extraction **costs money** with every API call
> - For large tables (100+ rows), LLM extraction can be **very slow**
> - **For large tables**: If you must use LLM, choose fast providers:
> - ✅ **Groq** (fastest inference)
> - ✅ **Cerebras** (optimized for speed)
> - ⚠️ Avoid: OpenAI, Anthropic for large tables (slower)
>
> **🚧 WORK IN PROGRESS**:
> We are actively developing an **advanced non-LLM algorithm** that will handle complex table structures (rowspan, colspan, nested tables) for **FREE**. This will replace the need for costly LLM extraction in most cases. Coming soon!
### DefaultTableExtraction
The default strategy uses a sophisticated scoring system to identify data tables:
```python
from crawl4ai import DefaultTableExtraction, CrawlerRunConfig
# Customize the default extraction
table_strategy = DefaultTableExtraction(
table_score_threshold=7, # Scoring threshold (default: 7)
min_rows=2, # Minimum rows required
min_cols=2, # Minimum columns required
verbose=True # Enable detailed logging
)
config = CrawlerRunConfig(
table_extraction=table_strategy
)
```
#### Scoring System
The scoring system evaluates multiple factors:
| Factor | Score Impact | Description |
|--------|--------------|-------------|
| Has `` | +2 | Semantic table structure |
| Has `
` | +1 | Organized table body |
| Has `` elements | +2 | Header cells present |
| Headers in correct position | +1 | Proper semantic structure |
| Consistent column count | +2 | Regular data structure |
| Has caption | +2 | Descriptive caption |
| Has summary | +1 | Summary attribute |
| High text density | +2 to +3 | Content-rich cells |
| Data attributes | +0.5 each | Data-* attributes |
| Nested tables | -3 | Often indicates layout |
| Role="presentation" | -3 | Explicitly non-data |
| Too few rows | -2 | Insufficient data |
### LLMTableExtraction (Use Sparingly!)
**⚠️ WARNING**: Only use this when `DefaultTableExtraction` fails with complex tables!
LLMTableExtraction uses AI to understand complex table structures that traditional parsers struggle with. It automatically handles large tables through intelligent chunking and parallel processing:
```python
from crawl4ai import LLMTableExtraction, LLMConfig, CrawlerRunConfig
# Configure LLM (costs money per call!)
llm_config = LLMConfig(
provider="groq/llama-3.3-70b-versatile", # Fast provider for large tables
api_token="your_api_key",
temperature=0.1
)
# Create LLM extraction strategy with smart chunking
table_strategy = LLMTableExtraction(
llm_config=llm_config,
max_tries=3, # Retry up to 3 times if extraction fails
css_selector="table", # Optional: focus on specific tables
enable_chunking=True, # Automatically chunk large tables (default: True)
chunk_token_threshold=3000, # Split tables larger than this (default: 3000 tokens)
min_rows_per_chunk=10, # Minimum rows per chunk (default: 10)
max_parallel_chunks=5, # Process up to 5 chunks in parallel (default: 5)
verbose=True
)
config = CrawlerRunConfig(
table_extraction=table_strategy
)
result = await crawler.arun(url, config)
```
#### When to Use LLMTableExtraction
✅ **Use ONLY when**:
- Tables have complex merged cells (rowspan/colspan) that break DefaultTableExtraction
- Nested tables that need semantic understanding
- Tables with irregular structures
- You've tried DefaultTableExtraction and it failed
❌ **Never use when**:
- DefaultTableExtraction works (99% of cases)
- Tables are simple or well-structured
- You're processing many pages (costs add up!)
- Tables have 100+ rows (very slow)
#### How Smart Chunking Works
LLMTableExtraction automatically handles large tables through intelligent chunking:
1. **Automatic Detection**: Tables exceeding the token threshold are automatically split
2. **Smart Splitting**: Chunks are created at row boundaries, preserving table structure
3. **Header Preservation**: Each chunk includes the original headers for context
4. **Parallel Processing**: Multiple chunks are processed simultaneously for speed
5. **Intelligent Merging**: Results are merged back into a single, complete table
**Chunking Parameters**:
- `enable_chunking` (default: `True`): Automatically handle large tables
- `chunk_token_threshold` (default: `3000`): When to split tables
- `min_rows_per_chunk` (default: `10`): Ensures meaningful chunk sizes
- `max_parallel_chunks` (default: `5`): Concurrent processing for speed
The chunking is completely transparent - you get the same output format whether the table was processed in one piece or multiple chunks.
#### Performance Optimization for LLMTableExtraction
**Provider Recommendations by Table Size**:
| Table Size | Recommended Providers | Why |
|------------|----------------------|-----|
| Small (<50 rows) | Any provider | Fast enough |
| Medium (50-200 rows) | Groq, Cerebras | Optimized inference |
| Large (200+ rows) | **Groq** (best), Cerebras | Fastest inference + automatic chunking |
| Very Large (500+ rows) | Groq with chunking | Parallel processing keeps it fast |
### NoTableExtraction
Disable table extraction for better performance when tables aren't needed:
```python
from crawl4ai import NoTableExtraction, CrawlerRunConfig
config = CrawlerRunConfig(
table_extraction=NoTableExtraction()
)
# Tables won't be extracted, improving performance
result = await crawler.arun(url, config)
assert len(result.tables) == 0
```
## Extracted Table Structure
Each extracted table contains:
```python
{
"headers": ["Column 1", "Column 2", ...], # Column headers
"rows": [ # Data rows
["Row 1 Col 1", "Row 1 Col 2", ...],
["Row 2 Col 1", "Row 2 Col 2", ...],
],
"caption": "Table Caption", # If present
"summary": "Table Summary", # If present
"metadata": {
"row_count": 10, # Number of rows
"column_count": 3, # Number of columns
"has_headers": True, # Headers detected
"has_caption": True, # Caption exists
"has_summary": False, # Summary exists
"id": "data-table-1", # Table ID if present
"class": "financial-data" # Table class if present
}
}
```
## Configuration Options
### Basic Configuration
```python
config = CrawlerRunConfig(
# Table extraction settings
table_score_threshold=7, # Default threshold (backward compatible)
table_extraction=strategy, # Optional: custom strategy
# Filter what to process
css_selector="main", # Focus on specific area
excluded_tags=["nav", "aside"] # Exclude page sections
)
```
### Advanced Configuration
```python
from crawl4ai import DefaultTableExtraction, CrawlerRunConfig
# Fine-tuned extraction
strategy = DefaultTableExtraction(
table_score_threshold=5, # Lower = more permissive
min_rows=3, # Require at least 3 rows
min_cols=2, # Require at least 2 columns
verbose=True # Detailed logging
)
config = CrawlerRunConfig(
table_extraction=strategy,
css_selector="article.content", # Target specific content
exclude_domains=["ads.com"], # Exclude ad domains
cache_mode=CacheMode.BYPASS # Fresh extraction
)
```
## Working with Extracted Tables
### Convert to Pandas DataFrame
```python
import pandas as pd
async def tables_to_dataframes(url):
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url)
dataframes = []
for table_data in result.tables:
# Create DataFrame
if table_data['headers']:
df = pd.DataFrame(
table_data['rows'],
columns=table_data['headers']
)
else:
df = pd.DataFrame(table_data['rows'])
# Add metadata as DataFrame attributes
df.attrs['caption'] = table_data.get('caption', '')
df.attrs['metadata'] = table_data.get('metadata', {})
dataframes.append(df)
return dataframes
```
### Filter Tables by Criteria
```python
async def extract_large_tables(url):
async with AsyncWebCrawler() as crawler:
# Configure minimum size requirements
strategy = DefaultTableExtraction(
min_rows=10,
min_cols=3,
table_score_threshold=6
)
config = CrawlerRunConfig(
table_extraction=strategy
)
result = await crawler.arun(url, config)
# Further filter results
large_tables = [
table for table in result.tables
if table['metadata']['row_count'] > 10
and table['metadata']['column_count'] > 3
]
return large_tables
```
### Export Tables to Different Formats
```python
import json
import csv
async def export_tables(url):
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url)
for i, table in enumerate(result.tables):
# Export as JSON
with open(f'table_{i}.json', 'w') as f:
json.dump(table, f, indent=2)
# Export as CSV
with open(f'table_{i}.csv', 'w', newline='') as f:
writer = csv.writer(f)
if table['headers']:
writer.writerow(table['headers'])
writer.writerows(table['rows'])
# Export as Markdown
with open(f'table_{i}.md', 'w') as f:
# Write headers
if table['headers']:
f.write('| ' + ' | '.join(table['headers']) + ' |\n')
f.write('|' + '---|' * len(table['headers']) + '\n')
# Write rows
for row in table['rows']:
f.write('| ' + ' | '.join(str(cell) for cell in row) + ' |\n')
```
## Creating Custom Strategies
Extend `TableExtractionStrategy` to create custom extraction logic:
### Example: Financial Table Extractor
```python
from crawl4ai import TableExtractionStrategy
from typing import List, Dict, Any
import re
class FinancialTableExtractor(TableExtractionStrategy):
"""Extract tables containing financial data."""
def __init__(self, currency_symbols=None, require_numbers=True, **kwargs):
super().__init__(**kwargs)
self.currency_symbols = currency_symbols or ['$', '€', '£', '¥']
self.require_numbers = require_numbers
self.number_pattern = re.compile(r'\d+[,.]?\d*')
def extract_tables(self, element, **kwargs):
tables_data = []
for table in element.xpath(".//table"):
# Check if table contains financial indicators
table_text = ''.join(table.itertext())
# Must contain currency symbols
has_currency = any(sym in table_text for sym in self.currency_symbols)
if not has_currency:
continue
# Must contain numbers if required
if self.require_numbers:
numbers = self.number_pattern.findall(table_text)
if len(numbers) < 3: # Arbitrary minimum
continue
# Extract the table data
table_data = self._extract_financial_data(table)
if table_data:
tables_data.append(table_data)
return tables_data
def _extract_financial_data(self, table):
"""Extract and clean financial data from table."""
headers = []
rows = []
# Extract headers
for th in table.xpath(".//thead//th | .//tr[1]//th"):
headers.append(th.text_content().strip())
# Extract and clean rows
for tr in table.xpath(".//tbody//tr | .//tr[position()>1]"):
row = []
for td in tr.xpath(".//td"):
text = td.text_content().strip()
# Clean currency formatting
text = re.sub(r'[$€£¥,]', '', text)
row.append(text)
if row:
rows.append(row)
return {
"headers": headers,
"rows": rows,
"caption": self._get_caption(table),
"summary": table.get("summary", ""),
"metadata": {
"type": "financial",
"row_count": len(rows),
"column_count": len(headers) or len(rows[0]) if rows else 0
}
}
def _get_caption(self, table):
caption = table.xpath(".//caption/text()")
return caption[0].strip() if caption else ""
# Usage
strategy = FinancialTableExtractor(
currency_symbols=['$', 'EUR'],
require_numbers=True
)
config = CrawlerRunConfig(
table_extraction=strategy
)
```
### Example: Specific Table Extractor
```python
class SpecificTableExtractor(TableExtractionStrategy):
"""Extract only tables matching specific criteria."""
def __init__(self,
required_headers=None,
id_pattern=None,
class_pattern=None,
**kwargs):
super().__init__(**kwargs)
self.required_headers = required_headers or []
self.id_pattern = id_pattern
self.class_pattern = class_pattern
def extract_tables(self, element, **kwargs):
tables_data = []
for table in element.xpath(".//table"):
# Check ID pattern
if self.id_pattern:
table_id = table.get('id', '')
if not re.match(self.id_pattern, table_id):
continue
# Check class pattern
if self.class_pattern:
table_class = table.get('class', '')
if not re.match(self.class_pattern, table_class):
continue
# Extract headers to check requirements
headers = self._extract_headers(table)
# Check if required headers are present
if self.required_headers:
if not all(req in headers for req in self.required_headers):
continue
# Extract full table data
table_data = self._extract_table_data(table, headers)
tables_data.append(table_data)
return tables_data
```
## Combining with Other Strategies
Table extraction works seamlessly with other Crawl4AI strategies:
```python
from crawl4ai import (
AsyncWebCrawler,
CrawlerRunConfig,
DefaultTableExtraction,
LLMExtractionStrategy,
JsonCssExtractionStrategy
)
async def combined_extraction(url):
async with AsyncWebCrawler() as crawler:
config = CrawlerRunConfig(
# Table extraction
table_extraction=DefaultTableExtraction(
table_score_threshold=6,
min_rows=2
),
# CSS-based extraction for specific elements
extraction_strategy=JsonCssExtractionStrategy({
"title": "h1",
"summary": "p.summary",
"date": "time"
}),
# Focus on main content
css_selector="main.content"
)
result = await crawler.arun(url, config)
# Access different extraction results
tables = result.tables # Table data
structured = json.loads(result.extracted_content) # CSS extraction
return {
"tables": tables,
"structured_data": structured,
"markdown": result.markdown
}
```
## Performance Considerations
### Optimization Tips
1. **Disable when not needed**: Use `NoTableExtraction` if tables aren't required
2. **Target specific areas**: Use `css_selector` to limit processing scope
3. **Set minimum thresholds**: Filter out small/irrelevant tables early
4. **Cache results**: Use appropriate cache modes for repeated extractions
```python
# Optimized configuration for large pages
config = CrawlerRunConfig(
# Only process main content area
css_selector="article.main-content",
# Exclude navigation and sidebars
excluded_tags=["nav", "aside", "footer"],
# Higher threshold for stricter filtering
table_extraction=DefaultTableExtraction(
table_score_threshold=8,
min_rows=5,
min_cols=3
),
# Enable caching for repeated access
cache_mode=CacheMode.ENABLED
)
```
## Migration Guide
### Important: Your Code Still Works!
**No changes required!** The transition to the strategy pattern is **fully backward compatible**.
### How It Works Internally
#### v0.7.2 and Earlier
```python
# Old way - directly passing table_score_threshold
config = CrawlerRunConfig(
table_score_threshold=7
)
# Internally: No strategy pattern, direct implementation
```
#### v0.7.3+ (Current)
```python
# Old way STILL WORKS - we handle it internally
config = CrawlerRunConfig(
table_score_threshold=7
)
# Internally: Automatically creates DefaultTableExtraction(table_score_threshold=7)
```
### Taking Advantage of New Features
While your old code works, you can now use the strategy pattern for more control:
```python
# Option 1: Keep using the old way (perfectly fine!)
config = CrawlerRunConfig(
table_score_threshold=7 # Still supported
)
# Option 2: Use the new strategy pattern (more flexibility)
from crawl4ai import DefaultTableExtraction
strategy = DefaultTableExtraction(
table_score_threshold=7,
min_rows=2, # New capability!
min_cols=2 # New capability!
)
config = CrawlerRunConfig(
table_extraction=strategy
)
# Option 3: Use advanced strategies when needed
from crawl4ai import LLMTableExtraction, LLMConfig
# Only for complex tables that DefaultTableExtraction can't handle
# Automatically handles large tables with smart chunking
llm_strategy = LLMTableExtraction(
llm_config=LLMConfig(
provider="groq/llama-3.3-70b-versatile",
api_token="your_key"
),
max_tries=3,
enable_chunking=True, # Automatically chunk large tables
chunk_token_threshold=3000, # Chunk when exceeding 3000 tokens
max_parallel_chunks=5 # Process up to 5 chunks in parallel
)
config = CrawlerRunConfig(
table_extraction=llm_strategy # Advanced extraction with automatic chunking
)
```
### Summary
- ✅ **No breaking changes** - Old code works as-is
- ✅ **Same defaults** - DefaultTableExtraction is automatically used
- ✅ **Gradual adoption** - Use new features when you need them
- ✅ **Full compatibility** - result.tables structure unchanged
## Best Practices
### 1. Choose the Right Strategy (Cost-Conscious Approach)
**Decision Flow**:
```
1. Do you need tables?
→ No: Use NoTableExtraction
→ Yes: Continue to #2
2. Try DefaultTableExtraction first (FREE)
→ Works? Done! ✅
→ Fails? Continue to #3
3. Is the table critical and complex?
→ No: Accept DefaultTableExtraction results
→ Yes: Continue to #4
4. Use LLMTableExtraction (COSTS MONEY)
→ Small table (<50 rows): Any LLM provider
→ Large table (50+ rows): Use Groq or Cerebras
→ Very large (500+ rows): Reconsider - maybe chunk the page
```
**Strategy Selection Guide**:
- **DefaultTableExtraction**: Use for 99% of cases - it's free and effective
- **LLMTableExtraction**: Only for complex tables with merged cells that break DefaultTableExtraction
- **NoTableExtraction**: When you only need text/markdown content
- **Custom Strategy**: For specialized requirements (financial, scientific, etc.)
### 2. Validate Extracted Data
```python
def validate_table(table):
"""Validate table data quality."""
# Check structure
if not table.get('rows'):
return False
# Check consistency
if table.get('headers'):
expected_cols = len(table['headers'])
for row in table['rows']:
if len(row) != expected_cols:
return False
# Check minimum content
total_cells = sum(len(row) for row in table['rows'])
non_empty = sum(1 for row in table['rows']
for cell in row if cell.strip())
if non_empty / total_cells < 0.5: # Less than 50% non-empty
return False
return True
# Filter valid tables
valid_tables = [t for t in result.tables if validate_table(t)]
```
### 3. Handle Edge Cases
```python
async def robust_table_extraction(url):
"""Extract tables with error handling."""
async with AsyncWebCrawler() as crawler:
try:
config = CrawlerRunConfig(
table_extraction=DefaultTableExtraction(
table_score_threshold=6,
verbose=True
)
)
result = await crawler.arun(url, config)
if not result.success:
print(f"Crawl failed: {result.error}")
return []
# Process tables safely
processed_tables = []
for table in result.tables:
try:
# Validate and process
if validate_table(table):
processed_tables.append(table)
except Exception as e:
print(f"Error processing table: {e}")
continue
return processed_tables
except Exception as e:
print(f"Extraction error: {e}")
return []
```
## Troubleshooting
### Common Issues and Solutions
| Issue | Cause | Solution |
|-------|-------|----------|
| No tables extracted | Score too high | Lower `table_score_threshold` |
| Layout tables included | Score too low | Increase `table_score_threshold` |
| Missing tables | CSS selector too specific | Broaden or remove `css_selector` |
| Incomplete data | Complex table structure | Create custom strategy |
| Performance issues | Processing entire page | Use `css_selector` to limit scope |
### Debug Logging
Enable verbose logging to understand extraction decisions:
```python
import logging
# Configure logging
logging.basicConfig(level=logging.DEBUG)
# Enable verbose mode in strategy
strategy = DefaultTableExtraction(
table_score_threshold=7,
verbose=True # Detailed extraction logs
)
config = CrawlerRunConfig(
table_extraction=strategy,
verbose=True # General crawler logs
)
```
## See Also
- [Extraction Strategies](extraction-strategies.md) - Overview of all extraction strategies
- [Content Selection](content-selection.md) - Using CSS selectors and filters
- [Performance Optimization](../optimization/performance-tuning.md) - Speed up extraction
- [Examples](../examples/table_extraction_example.py) - Complete working examples |