feat(crawler): add network request and console message capturing

Implement comprehensive network request and console message capturing functionality: - Add capture_network_requests and capture_console_messages config parameters - Add network_requests and console_messages fields to models - Implement Playwright event listeners to capture requests, responses, and console output - Create detailed documentation and examples - Add comprehensive tests This feature enables deep visibility into web page activity for debugging, security analysis, performance profiling, and API discovery in web applications.
2025-04-10 16:03:48 +08:00
parent a2061bf31e
commit 66ac07b4f3
31 changed files with 1686 additions and 10 deletions
--- a/docs/md_v2/advanced/network-console-capture.md
+++ b/docs/md_v2/advanced/network-console-capture.md
@@ -0,0 +1,205 @@
+# Network Requests & Console Message Capturing
+
+Crawl4AI can capture all network requests and browser console messages during a crawl, which is invaluable for debugging, security analysis, or understanding page behavior.
+
+## Configuration
+
+To enable network and console capturing, use these configuration options:
+
+```python
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+
+# Enable both network request capture and console message capture
+config = CrawlerRunConfig(
+    capture_network_requests=True,  # Capture all network requests and responses
+    capture_console_messages=True   # Capture all browser console output
+)
+```
+
+## Example Usage
+
+```python
+import asyncio
+import json
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+
+async def main():
+    # Enable both network request capture and console message capture
+    config = CrawlerRunConfig(
+        capture_network_requests=True,
+        capture_console_messages=True
+    )
+    
+    async with AsyncWebCrawler() as crawler:
+        result = await crawler.arun(
+            url="https://example.com",
+            config=config
+        )
+        
+        if result.success:
+            # Analyze network requests
+            if result.network_requests:
+                print(f"Captured {len(result.network_requests)} network events")
+                
+                # Count request types
+                request_count = len([r for r in result.network_requests if r.get("event_type") == "request"])
+                response_count = len([r for r in result.network_requests if r.get("event_type") == "response"])
+                failed_count = len([r for r in result.network_requests if r.get("event_type") == "request_failed"])
+                
+                print(f"Requests: {request_count}, Responses: {response_count}, Failed: {failed_count}")
+                
+                # Find API calls
+                api_calls = [r for r in result.network_requests 
+                            if r.get("event_type") == "request" and "api" in r.get("url", "")]
+                if api_calls:
+                    print(f"Detected {len(api_calls)} API calls:")
+                    for call in api_calls[:3]:  # Show first 3
+                        print(f"  - {call.get('method')} {call.get('url')}")
+            
+            # Analyze console messages
+            if result.console_messages:
+                print(f"Captured {len(result.console_messages)} console messages")
+                
+                # Group by type
+                message_types = {}
+                for msg in result.console_messages:
+                    msg_type = msg.get("type", "unknown")
+                    message_types[msg_type] = message_types.get(msg_type, 0) + 1
+                
+                print("Message types:", message_types)
+                
+                # Show errors (often the most important)
+                errors = [msg for msg in result.console_messages if msg.get("type") == "error"]
+                if errors:
+                    print(f"Found {len(errors)} console errors:")
+                    for err in errors[:2]:  # Show first 2
+                        print(f"  - {err.get('text', '')[:100]}")
+            
+            # Export all captured data to a file for detailed analysis
+            with open("network_capture.json", "w") as f:
+                json.dump({
+                    "url": result.url,
+                    "network_requests": result.network_requests or [],
+                    "console_messages": result.console_messages or []
+                }, f, indent=2)
+            
+            print("Exported detailed capture data to network_capture.json")
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+## Captured Data Structure
+
+### Network Requests
+
+The `result.network_requests` contains a list of dictionaries, each representing a network event with these common fields:
+
+| Field | Description |
+|-------|-------------|
+| `event_type` | Type of event: `"request"`, `"response"`, or `"request_failed"` |
+| `url` | The URL of the request |
+| `timestamp` | Unix timestamp when the event was captured |
+
+#### Request Event Fields
+
+```json
+{
+  "event_type": "request",
+  "url": "https://example.com/api/data.json",
+  "method": "GET",
+  "headers": {"User-Agent": "...", "Accept": "..."},
+  "post_data": "key=value&otherkey=value",
+  "resource_type": "fetch",
+  "is_navigation_request": false,
+  "timestamp": 1633456789.123
+}
+```
+
+#### Response Event Fields
+
+```json
+{
+  "event_type": "response",
+  "url": "https://example.com/api/data.json",
+  "status": 200,
+  "status_text": "OK",
+  "headers": {"Content-Type": "application/json", "Cache-Control": "..."},
+  "from_service_worker": false,
+  "request_timing": {"requestTime": 1234.56, "receiveHeadersEnd": 1234.78},
+  "timestamp": 1633456789.456
+}
+```
+
+#### Failed Request Event Fields
+
+```json
+{
+  "event_type": "request_failed",
+  "url": "https://example.com/missing.png",
+  "method": "GET",
+  "resource_type": "image",
+  "failure_text": "net::ERR_ABORTED 404",
+  "timestamp": 1633456789.789
+}
+```
+
+### Console Messages
+
+The `result.console_messages` contains a list of dictionaries, each representing a console message with these common fields:
+
+| Field | Description |
+|-------|-------------|
+| `type` | Message type: `"log"`, `"error"`, `"warning"`, `"info"`, etc. |
+| `text` | The message text |
+| `timestamp` | Unix timestamp when the message was captured |
+
+#### Console Message Example
+
+```json
+{
+  "type": "error",
+  "text": "Uncaught TypeError: Cannot read property 'length' of undefined",
+  "location": "https://example.com/script.js:123:45",
+  "timestamp": 1633456790.123
+}
+```
+
+## Key Benefits
+
+- **Full Request Visibility**: Capture all network activity including:
+  - Requests (URLs, methods, headers, post data)
+  - Responses (status codes, headers, timing)
+  - Failed requests (with error messages)
+  
+- **Console Message Access**: View all JavaScript console output:
+  - Log messages
+  - Warnings
+  - Errors with stack traces
+  - Developer debugging information
+
+- **Debugging Power**: Identify issues such as:
+  - Failed API calls or resource loading
+  - JavaScript errors affecting page functionality
+  - CORS or other security issues
+  - Hidden API endpoints and data flows
+
+- **Security Analysis**: Detect:
+  - Unexpected third-party requests
+  - Data leakage in request payloads
+  - Suspicious script behavior
+
+- **Performance Insights**: Analyze:
+  - Request timing data
+  - Resource loading patterns
+  - Potential bottlenecks
+
+## Use Cases
+
+1. **API Discovery**: Identify hidden endpoints and data flows in single-page applications
+2. **Debugging**: Track down JavaScript errors affecting page functionality
+3. **Security Auditing**: Detect unwanted third-party requests or data leakage
+4. **Performance Analysis**: Identify slow-loading resources
+5. **Ad/Tracker Analysis**: Detect and catalog advertising or tracking calls
+
+This capability is especially valuable for complex sites with heavy JavaScript, single-page applications, or when you need to understand the exact communication happening between a browser and servers.
--- a/docs/md_v2/api/crawl-result.md
+++ b/docs/md_v2/api/crawl-result.md
@@ -281,7 +281,69 @@ for result in results:

 ---

-## 7. Example: Accessing Everything
+## 7. Network Requests & Console Messages
+
+When you enable network and console message capturing in `CrawlerRunConfig` using `capture_network_requests=True` and `capture_console_messages=True`, the `CrawlResult` will include these fields:
+
+### 7.1 **`network_requests`** *(Optional[List[Dict[str, Any]]])*
+**What**: A list of dictionaries containing information about all network requests, responses, and failures captured during the crawl.
+**Structure**:
+- Each item has an `event_type` field that can be `"request"`, `"response"`, or `"request_failed"`.
+- Request events include `url`, `method`, `headers`, `post_data`, `resource_type`, and `is_navigation_request`.
+- Response events include `url`, `status`, `status_text`, `headers`, and `request_timing`.
+- Failed request events include `url`, `method`, `resource_type`, and `failure_text`.
+- All events include a `timestamp` field.
+
+**Usage**:
+```python
+if result.network_requests:
+    # Count different types of events
+    requests = [r for r in result.network_requests if r.get("event_type") == "request"]
+    responses = [r for r in result.network_requests if r.get("event_type") == "response"]
+    failures = [r for r in result.network_requests if r.get("event_type") == "request_failed"]
+    
+    print(f"Captured {len(requests)} requests, {len(responses)} responses, and {len(failures)} failures")
+    
+    # Analyze API calls
+    api_calls = [r for r in requests if "api" in r.get("url", "")]
+    
+    # Identify failed resources
+    for failure in failures:
+        print(f"Failed to load: {failure.get('url')} - {failure.get('failure_text')}")
+```
+
+### 7.2 **`console_messages`** *(Optional[List[Dict[str, Any]]])*
+**What**: A list of dictionaries containing all browser console messages captured during the crawl.
+**Structure**:
+- Each item has a `type` field indicating the message type (e.g., `"log"`, `"error"`, `"warning"`, etc.).
+- The `text` field contains the actual message text.
+- Some messages include `location` information (URL, line, column).
+- All messages include a `timestamp` field.
+
+**Usage**:
+```python
+if result.console_messages:
+    # Count messages by type
+    message_types = {}
+    for msg in result.console_messages:
+        msg_type = msg.get("type", "unknown")
+        message_types[msg_type] = message_types.get(msg_type, 0) + 1
+    
+    print(f"Message type counts: {message_types}")
+    
+    # Display errors (which are usually most important)
+    for msg in result.console_messages:
+        if msg.get("type") == "error":
+            print(f"Error: {msg.get('text')}")
+```
+
+These fields provide deep visibility into the page's network activity and browser console, which is invaluable for debugging, security analysis, and understanding complex web applications.
+
+For more details on network and console capturing, see the [Network & Console Capture documentation](../advanced/network-console-capture.md).
+
+---
+
+## 8. Example: Accessing Everything

 ```python
 async def handle_result(result: CrawlResult):
@@ -321,11 +383,29 @@ async def handle_result(result: CrawlResult):
        print("PDF bytes length:", len(result.pdf))
    if result.mhtml:
        print("MHTML length:", len(result.mhtml))
+        
+    # Network and console capturing
+    if result.network_requests:
+        print(f"Network requests captured: {len(result.network_requests)}")
+        # Analyze request types
+        req_types = {}
+        for req in result.network_requests:
+            if "resource_type" in req:
+                req_types[req["resource_type"]] = req_types.get(req["resource_type"], 0) + 1
+        print(f"Resource types: {req_types}")
+        
+    if result.console_messages:
+        print(f"Console messages captured: {len(result.console_messages)}")
+        # Count by message type
+        msg_types = {}
+        for msg in result.console_messages:
+            msg_types[msg.get("type", "unknown")] = msg_types.get(msg.get("type", "unknown"), 0) + 1
+        print(f"Message types: {msg_types}")
 ```

 ---

-## 8. Key Points & Future
+## 9. Key Points & Future

 1. **Deprecated legacy properties of CrawlResult**  
   - `markdown_v2` - Deprecated in v0.5. Just use `markdown`. It holds the `MarkdownGenerationResult` now!