Files
crawl4ai/docs/examples/table-extraction-api.md
AHMET YILMAZ 00e9904609 feat: Add table extraction strategies and API documentation
- Implemented table extraction strategies: default, LLM, financial, and none in utils.py.
- Created new API documentation for table extraction endpoints and strategies.
- Added integration tests for table extraction functionality covering various strategies and error handling.
- Developed quick test script for rapid validation of table extraction features.
2025-10-17 12:30:37 +08:00

627 lines
14 KiB
Markdown

# Table Extraction API Documentation
## Overview
The Crawl4AI Docker Server provides powerful table extraction capabilities through both **integrated** and **dedicated** endpoints. Extract structured data from HTML tables using multiple strategies: default (fast regex-based), LLM-powered (semantic understanding), or financial (specialized for financial data).
---
## Table of Contents
1. [Quick Start](#quick-start)
2. [Extraction Strategies](#extraction-strategies)
3. [Integrated Extraction (with /crawl)](#integrated-extraction)
4. [Dedicated Endpoints (/tables)](#dedicated-endpoints)
5. [Batch Processing](#batch-processing)
6. [Configuration Options](#configuration-options)
7. [Response Format](#response-format)
8. [Error Handling](#error-handling)
---
## Quick Start
### Extract Tables During Crawl
```bash
curl -X POST http://localhost:11235/crawl \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://example.com/financial-data"],
"table_extraction": {
"strategy": "default"
}
}'
```
### Extract Tables from HTML
```bash
curl -X POST http://localhost:11235/tables/extract \
-H "Content-Type: application/json" \
-d '{
"html": "<table><tr><th>Name</th><th>Value</th></tr><tr><td>A</td><td>100</td></tr></table>",
"config": {
"strategy": "default"
}
}'
```
---
## Extraction Strategies
### 1. **Default Strategy** (Fast, Regex-Based)
Best for general-purpose table extraction with high performance.
```json
{
"strategy": "default"
}
```
**Use Cases:**
- General web scraping
- Simple data tables
- High-volume extraction
### 2. **LLM Strategy** (AI-Powered)
Uses Large Language Models for semantic understanding and complex table structures.
```json
{
"strategy": "llm",
"llm_provider": "openai",
"llm_model": "gpt-4",
"llm_api_key": "your-api-key",
"llm_prompt": "Extract and structure the financial data"
}
```
**Use Cases:**
- Complex nested tables
- Tables with irregular structure
- Semantic data extraction
**Supported Providers:**
- `openai` (GPT-3.5, GPT-4)
- `anthropic` (Claude)
- `huggingface` (Open models)
### 3. **Financial Strategy** (Specialized)
Optimized for financial tables with proper numerical formatting.
```json
{
"strategy": "financial",
"preserve_formatting": true,
"extract_metadata": true
}
```
**Use Cases:**
- Stock data
- Financial statements
- Accounting tables
- Price lists
### 4. **None Strategy** (No Extraction)
Disables table extraction.
```json
{
"strategy": "none"
}
```
---
## Integrated Extraction
Add table extraction to any crawl request by including the `table_extraction` configuration.
### Example: Basic Integration
```python
import requests
response = requests.post("http://localhost:11235/crawl", json={
"urls": ["https://finance.yahoo.com/quote/AAPL"],
"browser_config": {
"headless": True
},
"crawler_config": {
"wait_until": "networkidle"
},
"table_extraction": {
"strategy": "financial",
"preserve_formatting": True
}
})
data = response.json()
for result in data["results"]:
if result["success"]:
print(f"Found {len(result.get('tables', []))} tables")
for table in result.get("tables", []):
print(f"Table: {table['headers']}")
```
### Example: Multiple URLs with Table Extraction
```javascript
// Node.js example
const axios = require('axios');
const response = await axios.post('http://localhost:11235/crawl', {
urls: [
'https://example.com/page1',
'https://example.com/page2',
'https://example.com/page3'
],
table_extraction: {
strategy: 'default'
}
});
response.data.results.forEach((result, index) => {
console.log(`Page ${index + 1}:`);
console.log(` Tables found: ${result.tables?.length || 0}`);
});
```
### Example: LLM-Based Extraction with Custom Prompt
```bash
curl -X POST http://localhost:11235/crawl \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://example.com/complex-data"],
"table_extraction": {
"strategy": "llm",
"llm_provider": "openai",
"llm_model": "gpt-4",
"llm_api_key": "sk-...",
"llm_prompt": "Extract product pricing information, including discounts and availability"
}
}'
```
---
## Dedicated Endpoints
### `/tables/extract` - Single Extraction
Extract tables from HTML content or by fetching a URL.
#### Extract from HTML
```python
import requests
html_content = """
<table>
<thead>
<tr><th>Product</th><th>Price</th><th>Stock</th></tr>
</thead>
<tbody>
<tr><td>Widget A</td><td>$19.99</td><td>In Stock</td></tr>
<tr><td>Widget B</td><td>$29.99</td><td>Out of Stock</td></tr>
</tbody>
</table>
"""
response = requests.post("http://localhost:11235/tables/extract", json={
"html": html_content,
"config": {
"strategy": "default"
}
})
data = response.json()
print(f"Success: {data['success']}")
print(f"Tables found: {data['table_count']}")
print(f"Strategy used: {data['strategy']}")
for table in data['tables']:
print("\nTable:")
print(f" Headers: {table['headers']}")
print(f" Rows: {len(table['rows'])}")
```
#### Extract from URL
```python
response = requests.post("http://localhost:11235/tables/extract", json={
"url": "https://example.com/data-page",
"config": {
"strategy": "financial",
"preserve_formatting": True
}
})
data = response.json()
for table in data['tables']:
print(f"Table with {len(table['rows'])} rows")
```
---
## Batch Processing
### `/tables/extract/batch` - Batch Extraction
Extract tables from multiple HTML contents or URLs in a single request.
#### Batch from HTML List
```python
import requests
html_contents = [
"<table><tr><th>A</th></tr><tr><td>1</td></tr></table>",
"<table><tr><th>B</th></tr><tr><td>2</td></tr></table>",
"<table><tr><th>C</th></tr><tr><td>3</td></tr></table>",
]
response = requests.post("http://localhost:11235/tables/extract/batch", json={
"html_list": html_contents,
"config": {
"strategy": "default"
}
})
data = response.json()
print(f"Total processed: {data['summary']['total_processed']}")
print(f"Successful: {data['summary']['successful']}")
print(f"Failed: {data['summary']['failed']}")
print(f"Total tables: {data['summary']['total_tables_extracted']}")
for result in data['results']:
if result['success']:
print(f" {result['source']}: {result['table_count']} tables")
else:
print(f" {result['source']}: Error - {result['error']}")
```
#### Batch from URL List
```python
response = requests.post("http://localhost:11235/tables/extract/batch", json={
"url_list": [
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page3",
],
"config": {
"strategy": "financial"
}
})
data = response.json()
for result in data['results']:
print(f"URL: {result['source']}")
if result['success']:
print(f" ✓ Found {result['table_count']} tables")
else:
print(f" ✗ Failed: {result['error']}")
```
#### Mixed Batch (HTML + URLs)
```python
response = requests.post("http://localhost:11235/tables/extract/batch", json={
"html_list": [
"<table><tr><th>Local</th></tr></table>"
],
"url_list": [
"https://example.com/remote"
],
"config": {
"strategy": "default"
}
})
```
**Batch Limits:**
- Maximum 50 items per batch request
- Items are processed independently (partial failures allowed)
---
## Configuration Options
### TableExtractionConfig
| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `strategy` | `"none"` \| `"default"` \| `"llm"` \| `"financial"` | `"default"` | Extraction strategy to use |
| `llm_provider` | `string` | `null` | LLM provider (required for `llm` strategy) |
| `llm_model` | `string` | `null` | Model name (required for `llm` strategy) |
| `llm_api_key` | `string` | `null` | API key (required for `llm` strategy) |
| `llm_prompt` | `string` | `null` | Custom extraction prompt |
| `preserve_formatting` | `boolean` | `false` | Keep original number/date formatting |
| `extract_metadata` | `boolean` | `false` | Include table metadata (id, class, etc.) |
### Example: Full Configuration
```json
{
"strategy": "llm",
"llm_provider": "openai",
"llm_model": "gpt-4",
"llm_api_key": "sk-...",
"llm_prompt": "Extract structured product data",
"preserve_formatting": true,
"extract_metadata": true
}
```
---
## Response Format
### Single Extraction Response
```json
{
"success": true,
"table_count": 2,
"strategy": "default",
"tables": [
{
"headers": ["Product", "Price", "Stock"],
"rows": [
["Widget A", "$19.99", "In Stock"],
["Widget B", "$29.99", "Out of Stock"]
],
"metadata": {
"id": "product-table",
"class": "data-table",
"row_count": 2,
"column_count": 3
}
}
]
}
```
### Batch Extraction Response
```json
{
"success": true,
"summary": {
"total_processed": 3,
"successful": 2,
"failed": 1,
"total_tables_extracted": 5
},
"strategy": "default",
"results": [
{
"success": true,
"source": "html_0",
"table_count": 2,
"tables": [...]
},
{
"success": true,
"source": "https://example.com",
"table_count": 3,
"tables": [...]
},
{
"success": false,
"source": "html_2",
"error": "Invalid HTML structure"
}
]
}
```
### Integrated Crawl Response
Tables are included in the standard crawl result:
```json
{
"success": true,
"results": [
{
"url": "https://example.com",
"success": true,
"html": "...",
"markdown": "...",
"tables": [
{
"headers": [...],
"rows": [...]
}
]
}
]
}
```
---
## Error Handling
### Common Errors
#### 400 Bad Request
```json
{
"detail": "Must provide either 'html' or 'url' for table extraction."
}
```
**Cause:** Invalid request parameters
**Solution:** Ensure you provide exactly one of `html` or `url`
#### 400 Bad Request (LLM)
```json
{
"detail": "Invalid table extraction config: LLM strategy requires llm_provider, llm_model, and llm_api_key"
}
```
**Cause:** Missing required LLM configuration
**Solution:** Provide all required LLM fields
#### 500 Internal Server Error
```json
{
"detail": "Failed to fetch and extract from URL: Connection timeout"
}
```
**Cause:** URL fetch failure or extraction error
**Solution:** Check URL accessibility and HTML validity
### Handling Partial Failures in Batch
```python
response = requests.post("http://localhost:11235/tables/extract/batch", json={
"url_list": urls,
"config": {"strategy": "default"}
})
data = response.json()
successful_results = [r for r in data['results'] if r['success']]
failed_results = [r for r in data['results'] if not r['success']]
print(f"Successful: {len(successful_results)}")
for result in failed_results:
print(f"Failed: {result['source']} - {result['error']}")
```
---
## Best Practices
### 1. **Choose the Right Strategy**
- **Default**: Fast, reliable for most tables
- **LLM**: Complex structures, semantic extraction
- **Financial**: Numerical data with formatting
### 2. **Batch Processing**
- Use batch endpoints for multiple pages
- Keep batch size under 50 items
- Handle partial failures gracefully
### 3. **Performance Optimization**
- Use `default` strategy for high-volume extraction
- Enable `preserve_formatting` only when needed
- Limit `extract_metadata` to reduce payload size
### 4. **LLM Strategy Tips**
- Use specific prompts for better results
- GPT-4 for complex tables, GPT-3.5 for simple ones
- Cache results to reduce API costs
### 5. **Error Handling**
- Always check `success` field
- Log errors for debugging
- Implement retry logic for transient failures
---
## Examples by Use Case
### Financial Data Extraction
```python
response = requests.post("http://localhost:11235/crawl", json={
"urls": ["https://finance.site.com/stocks"],
"table_extraction": {
"strategy": "financial",
"preserve_formatting": True,
"extract_metadata": True
}
})
for result in response.json()["results"]:
for table in result.get("tables", []):
# Financial tables with preserved formatting
print(table["rows"])
```
### Product Catalog Scraping
```python
response = requests.post("http://localhost:11235/tables/extract/batch", json={
"url_list": [
"https://shop.com/category/electronics",
"https://shop.com/category/clothing",
"https://shop.com/category/books",
],
"config": {"strategy": "default"}
})
all_products = []
for result in response.json()["results"]:
if result["success"]:
for table in result["tables"]:
all_products.extend(table["rows"])
print(f"Total products: {len(all_products)}")
```
### Complex Table with LLM
```python
response = requests.post("http://localhost:11235/tables/extract", json={
"url": "https://complex-data.com/report",
"config": {
"strategy": "llm",
"llm_provider": "openai",
"llm_model": "gpt-4",
"llm_api_key": "sk-...",
"llm_prompt": "Extract quarterly revenue breakdown by region and product category"
}
})
structured_data = response.json()["tables"]
```
---
## API Reference Summary
| Endpoint | Method | Purpose |
|----------|--------|---------|
| `/crawl` | POST | Crawl with integrated table extraction |
| `/crawl/stream` | POST | Stream crawl with table extraction |
| `/tables/extract` | POST | Extract tables from HTML or URL |
| `/tables/extract/batch` | POST | Batch extract from multiple sources |
For complete API documentation, visit: `/docs` (Swagger UI)
---
## Support
For issues, feature requests, or questions:
- GitHub: https://github.com/unclecode/crawl4ai
- Documentation: https://crawl4ai.com/docs
- Discord: https://discord.gg/crawl4ai