Merge pull request #1366 from unclecode/fix/update-tables-documentation

docs: Update README.md and modify Media and Tables Documentation.(#1271)
2025-08-06 15:15:24 +08:00
parent 437395e490 fddae303fb
commit 45d8327d23
3 changed files with 74 additions and 85 deletions
--- a/README.md
+++ b/README.md
@@ -618,16 +618,16 @@ Read the full details in our [0.7.0 Release Notes](https://docs.crawl4ai.com/blo
        # Process results
        raw_df = pd.DataFrame()
        for result in results:
-            if result.success and result.media["tables"]:
+            if result.success and result.tables:
                raw_df = pd.DataFrame(
-                    result.media["tables"][0]["rows"],
-                    columns=result.media["tables"][0]["headers"],
+                    result.tables[0]["rows"],
+                    columns=result.tables[0]["headers"],
                )
                break
        print(raw_df.head())

    finally:
-        await crawler.stop()
+        await crawler.close()
  ```

 - **🚀 Browser Pooling**: Pages launch hot with pre-warmed browser instances for lower latency and memory usage
--- a/docs/md_v2/core/crawler-result.md
+++ b/docs/md_v2/core/crawler-result.md
@@ -187,7 +187,7 @@ Here:

 ---

-## 5. More Fields: Links, Media, and More
+## 5. More Fields: Links, Media, Tables and More

 ### 5.1 `links`

@@ -207,7 +207,69 @@ for img in images:
    print("Image URL:", img["src"], "Alt:", img.get("alt"))
 ```

-### 5.3 `screenshot`, `pdf`, and `mhtml`
+### 5.3 `tables`
+
+The `tables` field contains structured data extracted from HTML tables found on the crawled page. Tables are analyzed based on various criteria to determine if they are actual data tables (as opposed to layout tables), including:
+
+- Presence of thead and tbody sections
+- Use of th elements for headers
+- Column consistency
+- Text density
+- And other factors
+
+Tables that score above the threshold (default: 7) are extracted and stored in result.tables.
+
+### Accessing Table data:
+```python
+async with AsyncWebCrawler() as crawler:
+    result = await crawler.arun(
+        url="https://example.com/",
+        config=CrawlerRunConfig(
+            table_score_threshold=7  # Minimum score for table detection
+        )
+    )
+    
+    if result.success and result.tables:
+        print(f"Found {len(result.tables)} tables")
+        
+        for i, table in enumerate(result.tables):
+            print(f"\nTable {i+1}:")
+            print(f"Caption: {table.get('caption', 'No caption')}")
+            print(f"Headers: {table['headers']}")
+            print(f"Rows: {len(table['rows'])}")
+            
+            # Print first few rows as example
+            for j, row in enumerate(table['rows'][:3]):
+                print(f"  Row {j+1}: {row}")
+```
+
+### Configuring Table Extraction:
+
+You can adjust the sensitivity of the table detection algorithm with:
+
+```python
+config = CrawlerRunConfig(
+    table_score_threshold=5  # Lower value = more tables detected (default: 7)
+)
+```
+
+Each extracted table contains: 
+
+- `headers`: Column header names 
+- `rows`: List of rows, each containing cell values
+- `caption`: Table caption text (if available) 
+- `summary`: Table summary attribute (if specified)
+
+### Table Extraction Tips
+
+- Not all HTML tables are extracted - only those detected as "data tables" vs. layout tables.
+- Tables with inconsistent cell counts, nested tables, or those used purely for layout may be skipped.
+- If you're missing tables, try adjusting the `table_score_threshold` to a lower value (default is 7).
+
+The table detection algorithm scores tables based on features like consistent columns, presence of headers, text density, and more. Tables scoring above the threshold are considered data tables worth extracting.
+
+
+### 5.4 `screenshot`, `pdf`, and `mhtml`

 If you set `screenshot=True`, `pdf=True`, or `capture_mhtml=True` in **`CrawlerRunConfig`**, then:

@@ -228,7 +290,7 @@ if result.mhtml:

 The MHTML (MIME HTML) format is particularly useful as it captures the entire web page including all of its resources (CSS, images, scripts, etc.) in a single file, making it perfect for archiving or offline viewing.

-### 5.4 `ssl_certificate`
+### 5.5 `ssl_certificate`

 If `fetch_ssl_certificate=True`, `result.ssl_certificate` holds details about the site’s SSL cert, such as issuer, validity dates, etc.

--- a/docs/md_v2/core/link-media.md
+++ b/docs/md_v2/core/link-media.md
@@ -520,7 +520,8 @@ This approach is handy when you still want external links but need to block cert

 ### 4.1 Accessing `result.media`

-By default, Crawl4AI collects images, audio, video URLs, and data tables it finds on the page. These are stored in `result.media`, a dictionary keyed by media type (e.g., `images`, `videos`, `audio`, `tables`).
+By default, Crawl4AI collects images, audio and video URLs it finds on the page. These are stored in `result.media`, a dictionary keyed by media type (e.g., `images`, `videos`, `audio`).
+**Note: Tables have been moved from `result.media["tables"]` to the new `result.tables` format for better organization and direct access.**

 **Basic Example**:

@@ -534,14 +535,6 @@ if result.success:
        print(f"           Alt text: {img.get('alt', '')}")
        print(f"           Score: {img.get('score')}")
        print(f"           Description: {img.get('desc', '')}\n")
-    
-    # Get tables
-    tables = result.media.get("tables", [])
-    print(f"Found {len(tables)} data tables in total.")
-    for i, table in enumerate(tables):
-        print(f"[Table {i}] Caption: {table.get('caption', 'No caption')}")
-        print(f"           Columns: {len(table.get('headers', []))}")
-        print(f"           Rows: {len(table.get('rows', []))}")
 ```

 **Structure Example**:
@@ -568,19 +561,6 @@ result.media = {
  "audio": [
    # Similar structure but with audio-specific fields
  ],
-  "tables": [
-    {
-      "headers": ["Name", "Age", "Location"],
-      "rows": [
-        ["John Doe", "34", "New York"],
-        ["Jane Smith", "28", "San Francisco"],
-        ["Alex Johnson", "42", "Chicago"]
-      ],
-      "caption": "Employee Directory",
-      "summary": "Directory of company employees"
-    },
-    # More tables if present
-  ]
 }
 ```

@@ -608,53 +588,7 @@ crawler_cfg = CrawlerRunConfig(

 This setting attempts to discard images from outside the primary domain, keeping only those from the site you’re crawling.

-### 3.3 Working with Tables
-
-Crawl4AI can detect and extract structured data from HTML tables. Tables are analyzed based on various criteria to determine if they are actual data tables (as opposed to layout tables), including:
-
- Presence of thead and tbody sections
- Use of th elements for headers
- Column consistency
- Text density
- And other factors
-
-Tables that score above the threshold (default: 7) are extracted and stored in `result.media.tables`.
-
-**Accessing Table Data**:
-
-```python
-if result.success:
-    tables = result.media.get("tables", [])
-    print(f"Found {len(tables)} data tables on the page")
-    
-    if tables:
-        # Access the first table
-        first_table = tables[0]
-        print(f"Table caption: {first_table.get('caption', 'No caption')}")
-        print(f"Headers: {first_table.get('headers', [])}")
-        
-        # Print the first 3 rows
-        for i, row in enumerate(first_table.get('rows', [])[:3]):
-            print(f"Row {i+1}: {row}")
-```
-
-**Configuring Table Extraction**:
-
-You can adjust the sensitivity of the table detection algorithm with:
-
-```python
-crawler_cfg = CrawlerRunConfig(
-    table_score_threshold=5  # Lower value = more tables detected (default: 7)
-)
-```
-
-Each extracted table contains:
- `headers`: Column header names
- `rows`: List of rows, each containing cell values
- `caption`: Table caption text (if available)
- `summary`: Table summary attribute (if specified)
-
-### 3.4 Additional Media Config
+### 4.3 Additional Media Config

 - **`screenshot`**: Set to `True` if you want a full-page screenshot stored as `base64` in `result.screenshot`.  
 - **`pdf`**: Set to `True` if you want a PDF version of the page in `result.pdf`.  
@@ -695,7 +629,7 @@ The MHTML format is particularly useful because:

 ---

-## 4. Putting It All Together: Link & Media Filtering
+## 5. Putting It All Together: Link & Media Filtering

 Here’s a combined example demonstrating how to filter out external links, skip certain domains, and exclude external images:

@@ -743,7 +677,7 @@ if __name__ == "__main__":

 ---

-## 5. Common Pitfalls & Tips
+## 6. Common Pitfalls & Tips

 1. **Conflicting Flags**:  
   - `exclude_external_links=True` but then also specifying `exclude_social_media_links=True` is typically fine, but understand that the first setting already discards *all* external links. The second becomes somewhat redundant.  
@@ -762,10 +696,3 @@ if __name__ == "__main__":
 ---

 **That’s it for Link & Media Analysis!** You’re now equipped to filter out unwanted sites and zero in on the images and videos that matter for your project.
-### Table Extraction Tips
-
- Not all HTML tables are extracted - only those detected as "data tables" vs. layout tables.
- Tables with inconsistent cell counts, nested tables, or those used purely for layout may be skipped.
- If you're missing tables, try adjusting the `table_score_threshold` to a lower value (default is 7).
-
-The table detection algorithm scores tables based on features like consistent columns, presence of headers, text density, and more. Tables scoring above the threshold are considered data tables worth extracting.