- Implemented table extraction strategies: default, LLM, financial, and none in utils.py. - Created new API documentation for table extraction endpoints and strategies. - Added integration tests for table extraction functionality covering various strategies and error handling. - Developed quick test script for rapid validation of table extraction features.
627 lines
14 KiB
Markdown
627 lines
14 KiB
Markdown
# Table Extraction API Documentation
|
|
|
|
## Overview
|
|
|
|
The Crawl4AI Docker Server provides powerful table extraction capabilities through both **integrated** and **dedicated** endpoints. Extract structured data from HTML tables using multiple strategies: default (fast regex-based), LLM-powered (semantic understanding), or financial (specialized for financial data).
|
|
|
|
---
|
|
|
|
## Table of Contents
|
|
|
|
1. [Quick Start](#quick-start)
|
|
2. [Extraction Strategies](#extraction-strategies)
|
|
3. [Integrated Extraction (with /crawl)](#integrated-extraction)
|
|
4. [Dedicated Endpoints (/tables)](#dedicated-endpoints)
|
|
5. [Batch Processing](#batch-processing)
|
|
6. [Configuration Options](#configuration-options)
|
|
7. [Response Format](#response-format)
|
|
8. [Error Handling](#error-handling)
|
|
|
|
---
|
|
|
|
## Quick Start
|
|
|
|
### Extract Tables During Crawl
|
|
|
|
```bash
|
|
curl -X POST http://localhost:11235/crawl \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"urls": ["https://example.com/financial-data"],
|
|
"table_extraction": {
|
|
"strategy": "default"
|
|
}
|
|
}'
|
|
```
|
|
|
|
### Extract Tables from HTML
|
|
|
|
```bash
|
|
curl -X POST http://localhost:11235/tables/extract \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"html": "<table><tr><th>Name</th><th>Value</th></tr><tr><td>A</td><td>100</td></tr></table>",
|
|
"config": {
|
|
"strategy": "default"
|
|
}
|
|
}'
|
|
```
|
|
|
|
---
|
|
|
|
## Extraction Strategies
|
|
|
|
### 1. **Default Strategy** (Fast, Regex-Based)
|
|
|
|
Best for general-purpose table extraction with high performance.
|
|
|
|
```json
|
|
{
|
|
"strategy": "default"
|
|
}
|
|
```
|
|
|
|
**Use Cases:**
|
|
- General web scraping
|
|
- Simple data tables
|
|
- High-volume extraction
|
|
|
|
### 2. **LLM Strategy** (AI-Powered)
|
|
|
|
Uses Large Language Models for semantic understanding and complex table structures.
|
|
|
|
```json
|
|
{
|
|
"strategy": "llm",
|
|
"llm_provider": "openai",
|
|
"llm_model": "gpt-4",
|
|
"llm_api_key": "your-api-key",
|
|
"llm_prompt": "Extract and structure the financial data"
|
|
}
|
|
```
|
|
|
|
**Use Cases:**
|
|
- Complex nested tables
|
|
- Tables with irregular structure
|
|
- Semantic data extraction
|
|
|
|
**Supported Providers:**
|
|
- `openai` (GPT-3.5, GPT-4)
|
|
- `anthropic` (Claude)
|
|
- `huggingface` (Open models)
|
|
|
|
### 3. **Financial Strategy** (Specialized)
|
|
|
|
Optimized for financial tables with proper numerical formatting.
|
|
|
|
```json
|
|
{
|
|
"strategy": "financial",
|
|
"preserve_formatting": true,
|
|
"extract_metadata": true
|
|
}
|
|
```
|
|
|
|
**Use Cases:**
|
|
- Stock data
|
|
- Financial statements
|
|
- Accounting tables
|
|
- Price lists
|
|
|
|
### 4. **None Strategy** (No Extraction)
|
|
|
|
Disables table extraction.
|
|
|
|
```json
|
|
{
|
|
"strategy": "none"
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Integrated Extraction
|
|
|
|
Add table extraction to any crawl request by including the `table_extraction` configuration.
|
|
|
|
### Example: Basic Integration
|
|
|
|
```python
|
|
import requests
|
|
|
|
response = requests.post("http://localhost:11235/crawl", json={
|
|
"urls": ["https://finance.yahoo.com/quote/AAPL"],
|
|
"browser_config": {
|
|
"headless": True
|
|
},
|
|
"crawler_config": {
|
|
"wait_until": "networkidle"
|
|
},
|
|
"table_extraction": {
|
|
"strategy": "financial",
|
|
"preserve_formatting": True
|
|
}
|
|
})
|
|
|
|
data = response.json()
|
|
for result in data["results"]:
|
|
if result["success"]:
|
|
print(f"Found {len(result.get('tables', []))} tables")
|
|
for table in result.get("tables", []):
|
|
print(f"Table: {table['headers']}")
|
|
```
|
|
|
|
### Example: Multiple URLs with Table Extraction
|
|
|
|
```javascript
|
|
// Node.js example
|
|
const axios = require('axios');
|
|
|
|
const response = await axios.post('http://localhost:11235/crawl', {
|
|
urls: [
|
|
'https://example.com/page1',
|
|
'https://example.com/page2',
|
|
'https://example.com/page3'
|
|
],
|
|
table_extraction: {
|
|
strategy: 'default'
|
|
}
|
|
});
|
|
|
|
response.data.results.forEach((result, index) => {
|
|
console.log(`Page ${index + 1}:`);
|
|
console.log(` Tables found: ${result.tables?.length || 0}`);
|
|
});
|
|
```
|
|
|
|
### Example: LLM-Based Extraction with Custom Prompt
|
|
|
|
```bash
|
|
curl -X POST http://localhost:11235/crawl \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"urls": ["https://example.com/complex-data"],
|
|
"table_extraction": {
|
|
"strategy": "llm",
|
|
"llm_provider": "openai",
|
|
"llm_model": "gpt-4",
|
|
"llm_api_key": "sk-...",
|
|
"llm_prompt": "Extract product pricing information, including discounts and availability"
|
|
}
|
|
}'
|
|
```
|
|
|
|
---
|
|
|
|
## Dedicated Endpoints
|
|
|
|
### `/tables/extract` - Single Extraction
|
|
|
|
Extract tables from HTML content or by fetching a URL.
|
|
|
|
#### Extract from HTML
|
|
|
|
```python
|
|
import requests
|
|
|
|
html_content = """
|
|
<table>
|
|
<thead>
|
|
<tr><th>Product</th><th>Price</th><th>Stock</th></tr>
|
|
</thead>
|
|
<tbody>
|
|
<tr><td>Widget A</td><td>$19.99</td><td>In Stock</td></tr>
|
|
<tr><td>Widget B</td><td>$29.99</td><td>Out of Stock</td></tr>
|
|
</tbody>
|
|
</table>
|
|
"""
|
|
|
|
response = requests.post("http://localhost:11235/tables/extract", json={
|
|
"html": html_content,
|
|
"config": {
|
|
"strategy": "default"
|
|
}
|
|
})
|
|
|
|
data = response.json()
|
|
print(f"Success: {data['success']}")
|
|
print(f"Tables found: {data['table_count']}")
|
|
print(f"Strategy used: {data['strategy']}")
|
|
|
|
for table in data['tables']:
|
|
print("\nTable:")
|
|
print(f" Headers: {table['headers']}")
|
|
print(f" Rows: {len(table['rows'])}")
|
|
```
|
|
|
|
#### Extract from URL
|
|
|
|
```python
|
|
response = requests.post("http://localhost:11235/tables/extract", json={
|
|
"url": "https://example.com/data-page",
|
|
"config": {
|
|
"strategy": "financial",
|
|
"preserve_formatting": True
|
|
}
|
|
})
|
|
|
|
data = response.json()
|
|
for table in data['tables']:
|
|
print(f"Table with {len(table['rows'])} rows")
|
|
```
|
|
|
|
---
|
|
|
|
## Batch Processing
|
|
|
|
### `/tables/extract/batch` - Batch Extraction
|
|
|
|
Extract tables from multiple HTML contents or URLs in a single request.
|
|
|
|
#### Batch from HTML List
|
|
|
|
```python
|
|
import requests
|
|
|
|
html_contents = [
|
|
"<table><tr><th>A</th></tr><tr><td>1</td></tr></table>",
|
|
"<table><tr><th>B</th></tr><tr><td>2</td></tr></table>",
|
|
"<table><tr><th>C</th></tr><tr><td>3</td></tr></table>",
|
|
]
|
|
|
|
response = requests.post("http://localhost:11235/tables/extract/batch", json={
|
|
"html_list": html_contents,
|
|
"config": {
|
|
"strategy": "default"
|
|
}
|
|
})
|
|
|
|
data = response.json()
|
|
print(f"Total processed: {data['summary']['total_processed']}")
|
|
print(f"Successful: {data['summary']['successful']}")
|
|
print(f"Failed: {data['summary']['failed']}")
|
|
print(f"Total tables: {data['summary']['total_tables_extracted']}")
|
|
|
|
for result in data['results']:
|
|
if result['success']:
|
|
print(f" {result['source']}: {result['table_count']} tables")
|
|
else:
|
|
print(f" {result['source']}: Error - {result['error']}")
|
|
```
|
|
|
|
#### Batch from URL List
|
|
|
|
```python
|
|
response = requests.post("http://localhost:11235/tables/extract/batch", json={
|
|
"url_list": [
|
|
"https://example.com/page1",
|
|
"https://example.com/page2",
|
|
"https://example.com/page3",
|
|
],
|
|
"config": {
|
|
"strategy": "financial"
|
|
}
|
|
})
|
|
|
|
data = response.json()
|
|
for result in data['results']:
|
|
print(f"URL: {result['source']}")
|
|
if result['success']:
|
|
print(f" ✓ Found {result['table_count']} tables")
|
|
else:
|
|
print(f" ✗ Failed: {result['error']}")
|
|
```
|
|
|
|
#### Mixed Batch (HTML + URLs)
|
|
|
|
```python
|
|
response = requests.post("http://localhost:11235/tables/extract/batch", json={
|
|
"html_list": [
|
|
"<table><tr><th>Local</th></tr></table>"
|
|
],
|
|
"url_list": [
|
|
"https://example.com/remote"
|
|
],
|
|
"config": {
|
|
"strategy": "default"
|
|
}
|
|
})
|
|
```
|
|
|
|
**Batch Limits:**
|
|
- Maximum 50 items per batch request
|
|
- Items are processed independently (partial failures allowed)
|
|
|
|
---
|
|
|
|
## Configuration Options
|
|
|
|
### TableExtractionConfig
|
|
|
|
| Field | Type | Default | Description |
|
|
|-------|------|---------|-------------|
|
|
| `strategy` | `"none"` \| `"default"` \| `"llm"` \| `"financial"` | `"default"` | Extraction strategy to use |
|
|
| `llm_provider` | `string` | `null` | LLM provider (required for `llm` strategy) |
|
|
| `llm_model` | `string` | `null` | Model name (required for `llm` strategy) |
|
|
| `llm_api_key` | `string` | `null` | API key (required for `llm` strategy) |
|
|
| `llm_prompt` | `string` | `null` | Custom extraction prompt |
|
|
| `preserve_formatting` | `boolean` | `false` | Keep original number/date formatting |
|
|
| `extract_metadata` | `boolean` | `false` | Include table metadata (id, class, etc.) |
|
|
|
|
### Example: Full Configuration
|
|
|
|
```json
|
|
{
|
|
"strategy": "llm",
|
|
"llm_provider": "openai",
|
|
"llm_model": "gpt-4",
|
|
"llm_api_key": "sk-...",
|
|
"llm_prompt": "Extract structured product data",
|
|
"preserve_formatting": true,
|
|
"extract_metadata": true
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Response Format
|
|
|
|
### Single Extraction Response
|
|
|
|
```json
|
|
{
|
|
"success": true,
|
|
"table_count": 2,
|
|
"strategy": "default",
|
|
"tables": [
|
|
{
|
|
"headers": ["Product", "Price", "Stock"],
|
|
"rows": [
|
|
["Widget A", "$19.99", "In Stock"],
|
|
["Widget B", "$29.99", "Out of Stock"]
|
|
],
|
|
"metadata": {
|
|
"id": "product-table",
|
|
"class": "data-table",
|
|
"row_count": 2,
|
|
"column_count": 3
|
|
}
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
### Batch Extraction Response
|
|
|
|
```json
|
|
{
|
|
"success": true,
|
|
"summary": {
|
|
"total_processed": 3,
|
|
"successful": 2,
|
|
"failed": 1,
|
|
"total_tables_extracted": 5
|
|
},
|
|
"strategy": "default",
|
|
"results": [
|
|
{
|
|
"success": true,
|
|
"source": "html_0",
|
|
"table_count": 2,
|
|
"tables": [...]
|
|
},
|
|
{
|
|
"success": true,
|
|
"source": "https://example.com",
|
|
"table_count": 3,
|
|
"tables": [...]
|
|
},
|
|
{
|
|
"success": false,
|
|
"source": "html_2",
|
|
"error": "Invalid HTML structure"
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
### Integrated Crawl Response
|
|
|
|
Tables are included in the standard crawl result:
|
|
|
|
```json
|
|
{
|
|
"success": true,
|
|
"results": [
|
|
{
|
|
"url": "https://example.com",
|
|
"success": true,
|
|
"html": "...",
|
|
"markdown": "...",
|
|
"tables": [
|
|
{
|
|
"headers": [...],
|
|
"rows": [...]
|
|
}
|
|
]
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Error Handling
|
|
|
|
### Common Errors
|
|
|
|
#### 400 Bad Request
|
|
|
|
```json
|
|
{
|
|
"detail": "Must provide either 'html' or 'url' for table extraction."
|
|
}
|
|
```
|
|
|
|
**Cause:** Invalid request parameters
|
|
|
|
**Solution:** Ensure you provide exactly one of `html` or `url`
|
|
|
|
#### 400 Bad Request (LLM)
|
|
|
|
```json
|
|
{
|
|
"detail": "Invalid table extraction config: LLM strategy requires llm_provider, llm_model, and llm_api_key"
|
|
}
|
|
```
|
|
|
|
**Cause:** Missing required LLM configuration
|
|
|
|
**Solution:** Provide all required LLM fields
|
|
|
|
#### 500 Internal Server Error
|
|
|
|
```json
|
|
{
|
|
"detail": "Failed to fetch and extract from URL: Connection timeout"
|
|
}
|
|
```
|
|
|
|
**Cause:** URL fetch failure or extraction error
|
|
|
|
**Solution:** Check URL accessibility and HTML validity
|
|
|
|
### Handling Partial Failures in Batch
|
|
|
|
```python
|
|
response = requests.post("http://localhost:11235/tables/extract/batch", json={
|
|
"url_list": urls,
|
|
"config": {"strategy": "default"}
|
|
})
|
|
|
|
data = response.json()
|
|
|
|
successful_results = [r for r in data['results'] if r['success']]
|
|
failed_results = [r for r in data['results'] if not r['success']]
|
|
|
|
print(f"Successful: {len(successful_results)}")
|
|
for result in failed_results:
|
|
print(f"Failed: {result['source']} - {result['error']}")
|
|
```
|
|
|
|
---
|
|
|
|
## Best Practices
|
|
|
|
### 1. **Choose the Right Strategy**
|
|
|
|
- **Default**: Fast, reliable for most tables
|
|
- **LLM**: Complex structures, semantic extraction
|
|
- **Financial**: Numerical data with formatting
|
|
|
|
### 2. **Batch Processing**
|
|
|
|
- Use batch endpoints for multiple pages
|
|
- Keep batch size under 50 items
|
|
- Handle partial failures gracefully
|
|
|
|
### 3. **Performance Optimization**
|
|
|
|
- Use `default` strategy for high-volume extraction
|
|
- Enable `preserve_formatting` only when needed
|
|
- Limit `extract_metadata` to reduce payload size
|
|
|
|
### 4. **LLM Strategy Tips**
|
|
|
|
- Use specific prompts for better results
|
|
- GPT-4 for complex tables, GPT-3.5 for simple ones
|
|
- Cache results to reduce API costs
|
|
|
|
### 5. **Error Handling**
|
|
|
|
- Always check `success` field
|
|
- Log errors for debugging
|
|
- Implement retry logic for transient failures
|
|
|
|
---
|
|
|
|
## Examples by Use Case
|
|
|
|
### Financial Data Extraction
|
|
|
|
```python
|
|
response = requests.post("http://localhost:11235/crawl", json={
|
|
"urls": ["https://finance.site.com/stocks"],
|
|
"table_extraction": {
|
|
"strategy": "financial",
|
|
"preserve_formatting": True,
|
|
"extract_metadata": True
|
|
}
|
|
})
|
|
|
|
for result in response.json()["results"]:
|
|
for table in result.get("tables", []):
|
|
# Financial tables with preserved formatting
|
|
print(table["rows"])
|
|
```
|
|
|
|
### Product Catalog Scraping
|
|
|
|
```python
|
|
response = requests.post("http://localhost:11235/tables/extract/batch", json={
|
|
"url_list": [
|
|
"https://shop.com/category/electronics",
|
|
"https://shop.com/category/clothing",
|
|
"https://shop.com/category/books",
|
|
],
|
|
"config": {"strategy": "default"}
|
|
})
|
|
|
|
all_products = []
|
|
for result in response.json()["results"]:
|
|
if result["success"]:
|
|
for table in result["tables"]:
|
|
all_products.extend(table["rows"])
|
|
|
|
print(f"Total products: {len(all_products)}")
|
|
```
|
|
|
|
### Complex Table with LLM
|
|
|
|
```python
|
|
response = requests.post("http://localhost:11235/tables/extract", json={
|
|
"url": "https://complex-data.com/report",
|
|
"config": {
|
|
"strategy": "llm",
|
|
"llm_provider": "openai",
|
|
"llm_model": "gpt-4",
|
|
"llm_api_key": "sk-...",
|
|
"llm_prompt": "Extract quarterly revenue breakdown by region and product category"
|
|
}
|
|
})
|
|
|
|
structured_data = response.json()["tables"]
|
|
```
|
|
|
|
---
|
|
|
|
## API Reference Summary
|
|
|
|
| Endpoint | Method | Purpose |
|
|
|----------|--------|---------|
|
|
| `/crawl` | POST | Crawl with integrated table extraction |
|
|
| `/crawl/stream` | POST | Stream crawl with table extraction |
|
|
| `/tables/extract` | POST | Extract tables from HTML or URL |
|
|
| `/tables/extract/batch` | POST | Batch extract from multiple sources |
|
|
|
|
For complete API documentation, visit: `/docs` (Swagger UI)
|
|
|
|
---
|
|
|
|
## Support
|
|
|
|
For issues, feature requests, or questions:
|
|
- GitHub: https://github.com/unclecode/crawl4ai
|
|
- Documentation: https://crawl4ai.com/docs
|
|
- Discord: https://discord.gg/crawl4ai
|