Merged next branch

This commit is contained in:
Aravind Karnam
2025-04-12 10:47:02 +05:30
62 changed files with 3225 additions and 7085 deletions

View File

@@ -0,0 +1,205 @@
# Network Requests & Console Message Capturing
Crawl4AI can capture all network requests and browser console messages during a crawl, which is invaluable for debugging, security analysis, or understanding page behavior.
## Configuration
To enable network and console capturing, use these configuration options:
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
# Enable both network request capture and console message capture
config = CrawlerRunConfig(
capture_network_requests=True, # Capture all network requests and responses
capture_console_messages=True # Capture all browser console output
)
```
## Example Usage
```python
import asyncio
import json
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
async def main():
# Enable both network request capture and console message capture
config = CrawlerRunConfig(
capture_network_requests=True,
capture_console_messages=True
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com",
config=config
)
if result.success:
# Analyze network requests
if result.network_requests:
print(f"Captured {len(result.network_requests)} network events")
# Count request types
request_count = len([r for r in result.network_requests if r.get("event_type") == "request"])
response_count = len([r for r in result.network_requests if r.get("event_type") == "response"])
failed_count = len([r for r in result.network_requests if r.get("event_type") == "request_failed"])
print(f"Requests: {request_count}, Responses: {response_count}, Failed: {failed_count}")
# Find API calls
api_calls = [r for r in result.network_requests
if r.get("event_type") == "request" and "api" in r.get("url", "")]
if api_calls:
print(f"Detected {len(api_calls)} API calls:")
for call in api_calls[:3]: # Show first 3
print(f" - {call.get('method')} {call.get('url')}")
# Analyze console messages
if result.console_messages:
print(f"Captured {len(result.console_messages)} console messages")
# Group by type
message_types = {}
for msg in result.console_messages:
msg_type = msg.get("type", "unknown")
message_types[msg_type] = message_types.get(msg_type, 0) + 1
print("Message types:", message_types)
# Show errors (often the most important)
errors = [msg for msg in result.console_messages if msg.get("type") == "error"]
if errors:
print(f"Found {len(errors)} console errors:")
for err in errors[:2]: # Show first 2
print(f" - {err.get('text', '')[:100]}")
# Export all captured data to a file for detailed analysis
with open("network_capture.json", "w") as f:
json.dump({
"url": result.url,
"network_requests": result.network_requests or [],
"console_messages": result.console_messages or []
}, f, indent=2)
print("Exported detailed capture data to network_capture.json")
if __name__ == "__main__":
asyncio.run(main())
```
## Captured Data Structure
### Network Requests
The `result.network_requests` contains a list of dictionaries, each representing a network event with these common fields:
| Field | Description |
|-------|-------------|
| `event_type` | Type of event: `"request"`, `"response"`, or `"request_failed"` |
| `url` | The URL of the request |
| `timestamp` | Unix timestamp when the event was captured |
#### Request Event Fields
```json
{
"event_type": "request",
"url": "https://example.com/api/data.json",
"method": "GET",
"headers": {"User-Agent": "...", "Accept": "..."},
"post_data": "key=value&otherkey=value",
"resource_type": "fetch",
"is_navigation_request": false,
"timestamp": 1633456789.123
}
```
#### Response Event Fields
```json
{
"event_type": "response",
"url": "https://example.com/api/data.json",
"status": 200,
"status_text": "OK",
"headers": {"Content-Type": "application/json", "Cache-Control": "..."},
"from_service_worker": false,
"request_timing": {"requestTime": 1234.56, "receiveHeadersEnd": 1234.78},
"timestamp": 1633456789.456
}
```
#### Failed Request Event Fields
```json
{
"event_type": "request_failed",
"url": "https://example.com/missing.png",
"method": "GET",
"resource_type": "image",
"failure_text": "net::ERR_ABORTED 404",
"timestamp": 1633456789.789
}
```
### Console Messages
The `result.console_messages` contains a list of dictionaries, each representing a console message with these common fields:
| Field | Description |
|-------|-------------|
| `type` | Message type: `"log"`, `"error"`, `"warning"`, `"info"`, etc. |
| `text` | The message text |
| `timestamp` | Unix timestamp when the message was captured |
#### Console Message Example
```json
{
"type": "error",
"text": "Uncaught TypeError: Cannot read property 'length' of undefined",
"location": "https://example.com/script.js:123:45",
"timestamp": 1633456790.123
}
```
## Key Benefits
- **Full Request Visibility**: Capture all network activity including:
- Requests (URLs, methods, headers, post data)
- Responses (status codes, headers, timing)
- Failed requests (with error messages)
- **Console Message Access**: View all JavaScript console output:
- Log messages
- Warnings
- Errors with stack traces
- Developer debugging information
- **Debugging Power**: Identify issues such as:
- Failed API calls or resource loading
- JavaScript errors affecting page functionality
- CORS or other security issues
- Hidden API endpoints and data flows
- **Security Analysis**: Detect:
- Unexpected third-party requests
- Data leakage in request payloads
- Suspicious script behavior
- **Performance Insights**: Analyze:
- Request timing data
- Resource loading patterns
- Potential bottlenecks
## Use Cases
1. **API Discovery**: Identify hidden endpoints and data flows in single-page applications
2. **Debugging**: Track down JavaScript errors affecting page functionality
3. **Security Auditing**: Detect unwanted third-party requests or data leakage
4. **Performance Analysis**: Identify slow-loading resources
5. **Ad/Tracker Analysis**: Detect and catalog advertising or tracking calls
This capability is especially valuable for complex sites with heavy JavaScript, single-page applications, or when you need to understand the exact communication happening between a browser and servers.

View File

@@ -15,6 +15,7 @@ class CrawlResult(BaseModel):
downloaded_files: Optional[List[str]] = None
screenshot: Optional[str] = None
pdf : Optional[bytes] = None
mhtml: Optional[str] = None
markdown: Optional[Union[str, MarkdownGenerationResult]] = None
extracted_content: Optional[str] = None
metadata: Optional[dict] = None
@@ -236,7 +237,16 @@ if result.pdf:
f.write(result.pdf)
```
### 5.5 **`metadata`** *(Optional[dict])*
### 5.5 **`mhtml`** *(Optional[str])*
**What**: MHTML snapshot of the page if `capture_mhtml=True` in `CrawlerRunConfig`. MHTML (MIME HTML) format preserves the entire web page with all its resources (CSS, images, scripts, etc.) in a single file.
**Usage**:
```python
if result.mhtml:
with open("page.mhtml", "w", encoding="utf-8") as f:
f.write(result.mhtml)
```
### 5.6 **`metadata`** *(Optional[dict])*
**What**: Page-level metadata if discovered (title, description, OG data, etc.).
**Usage**:
```python
@@ -271,7 +281,69 @@ for result in results:
---
## 7. Example: Accessing Everything
## 7. Network Requests & Console Messages
When you enable network and console message capturing in `CrawlerRunConfig` using `capture_network_requests=True` and `capture_console_messages=True`, the `CrawlResult` will include these fields:
### 7.1 **`network_requests`** *(Optional[List[Dict[str, Any]]])*
**What**: A list of dictionaries containing information about all network requests, responses, and failures captured during the crawl.
**Structure**:
- Each item has an `event_type` field that can be `"request"`, `"response"`, or `"request_failed"`.
- Request events include `url`, `method`, `headers`, `post_data`, `resource_type`, and `is_navigation_request`.
- Response events include `url`, `status`, `status_text`, `headers`, and `request_timing`.
- Failed request events include `url`, `method`, `resource_type`, and `failure_text`.
- All events include a `timestamp` field.
**Usage**:
```python
if result.network_requests:
# Count different types of events
requests = [r for r in result.network_requests if r.get("event_type") == "request"]
responses = [r for r in result.network_requests if r.get("event_type") == "response"]
failures = [r for r in result.network_requests if r.get("event_type") == "request_failed"]
print(f"Captured {len(requests)} requests, {len(responses)} responses, and {len(failures)} failures")
# Analyze API calls
api_calls = [r for r in requests if "api" in r.get("url", "")]
# Identify failed resources
for failure in failures:
print(f"Failed to load: {failure.get('url')} - {failure.get('failure_text')}")
```
### 7.2 **`console_messages`** *(Optional[List[Dict[str, Any]]])*
**What**: A list of dictionaries containing all browser console messages captured during the crawl.
**Structure**:
- Each item has a `type` field indicating the message type (e.g., `"log"`, `"error"`, `"warning"`, etc.).
- The `text` field contains the actual message text.
- Some messages include `location` information (URL, line, column).
- All messages include a `timestamp` field.
**Usage**:
```python
if result.console_messages:
# Count messages by type
message_types = {}
for msg in result.console_messages:
msg_type = msg.get("type", "unknown")
message_types[msg_type] = message_types.get(msg_type, 0) + 1
print(f"Message type counts: {message_types}")
# Display errors (which are usually most important)
for msg in result.console_messages:
if msg.get("type") == "error":
print(f"Error: {msg.get('text')}")
```
These fields provide deep visibility into the page's network activity and browser console, which is invaluable for debugging, security analysis, and understanding complex web applications.
For more details on network and console capturing, see the [Network & Console Capture documentation](../advanced/network-console-capture.md).
---
## 8. Example: Accessing Everything
```python
async def handle_result(result: CrawlResult):
@@ -304,16 +376,36 @@ async def handle_result(result: CrawlResult):
if result.extracted_content:
print("Structured data:", result.extracted_content)
# Screenshot/PDF
# Screenshot/PDF/MHTML
if result.screenshot:
print("Screenshot length:", len(result.screenshot))
if result.pdf:
print("PDF bytes length:", len(result.pdf))
if result.mhtml:
print("MHTML length:", len(result.mhtml))
# Network and console capturing
if result.network_requests:
print(f"Network requests captured: {len(result.network_requests)}")
# Analyze request types
req_types = {}
for req in result.network_requests:
if "resource_type" in req:
req_types[req["resource_type"]] = req_types.get(req["resource_type"], 0) + 1
print(f"Resource types: {req_types}")
if result.console_messages:
print(f"Console messages captured: {len(result.console_messages)}")
# Count by message type
msg_types = {}
for msg in result.console_messages:
msg_types[msg.get("type", "unknown")] = msg_types.get(msg.get("type", "unknown"), 0) + 1
print(f"Message types: {msg_types}")
```
---
## 8. Key Points & Future
## 9. Key Points & Future
1. **Deprecated legacy properties of CrawlResult**
- `markdown_v2` - Deprecated in v0.5. Just use `markdown`. It holds the `MarkdownGenerationResult` now!

View File

@@ -141,6 +141,7 @@ If your page is a single-page app with repeated JS updates, set `js_only=True` i
| **`screenshot_wait_for`** | `float or None` | Extra wait time before the screenshot. |
| **`screenshot_height_threshold`** | `int` (~20000) | If the page is taller than this, alternate screenshot strategies are used. |
| **`pdf`** | `bool` (False) | If `True`, returns a PDF in `result.pdf`. |
| **`capture_mhtml`** | `bool` (False) | If `True`, captures an MHTML snapshot of the page in `result.mhtml`. MHTML includes all page resources (CSS, images, etc.) in a single file. |
| **`image_description_min_word_threshold`** | `int` (~50) | Minimum words for an images alt text or description to be considered valid. |
| **`image_score_threshold`** | `int` (~3) | Filter out low-scoring images. The crawler scores images by relevance (size, context, etc.). |
| **`exclude_external_images`** | `bool` (False) | Exclude images from other domains. |

View File

@@ -136,6 +136,12 @@ class CrawlerRunConfig:
wait_for=None,
screenshot=False,
pdf=False,
capture_mhtml=False,
enable_rate_limiting=False,
rate_limit_config=None,
memory_threshold_percent=70.0,
check_interval=1.0,
max_session_permit=20,
display_mode=None,
verbose=True,
stream=False, # Enable streaming for arun_many()
@@ -170,10 +176,9 @@ class CrawlerRunConfig:
- A CSS or JS expression to wait for before extracting content.
- Common usage: `wait_for="css:.main-loaded"` or `wait_for="js:() => window.loaded === true"`.
7. **`screenshot`** & **`pdf`**:
- If `True`, captures a screenshot or PDF after the page is fully loaded.
- The results go to `result.screenshot` (base64) or `result.pdf` (bytes).
7. **`screenshot`**, **`pdf`**, & **`capture_mhtml`**:
- If `True`, captures a screenshot, PDF, or MHTML snapshot after the page is fully loaded.
- The results go to `result.screenshot` (base64), `result.pdf` (bytes), or `result.mhtml` (string).
8. **`verbose`**:
- Logs additional runtime details.
- Overlaps with the browsers verbosity if also set to `True` in `BrowserConfig`.

View File

@@ -26,6 +26,7 @@ class CrawlResult(BaseModel):
downloaded_files: Optional[List[str]] = None
screenshot: Optional[str] = None
pdf : Optional[bytes] = None
mhtml: Optional[str] = None
markdown: Optional[Union[str, MarkdownGenerationResult]] = None
extracted_content: Optional[str] = None
metadata: Optional[dict] = None
@@ -51,6 +52,7 @@ class CrawlResult(BaseModel):
| **downloaded_files (`Optional[List[str]]`)** | If `accept_downloads=True` in `BrowserConfig`, this lists the filepaths of saved downloads. |
| **screenshot (`Optional[str]`)** | Screenshot of the page (base64-encoded) if `screenshot=True`. |
| **pdf (`Optional[bytes]`)** | PDF of the page if `pdf=True`. |
| **mhtml (`Optional[str]`)** | MHTML snapshot of the page if `capture_mhtml=True`. Contains the full page with all resources. |
| **markdown (`Optional[str or MarkdownGenerationResult]`)** | It holds a `MarkdownGenerationResult`. Over time, this will be consolidated into `markdown`. The generator can provide raw markdown, citations, references, and optionally `fit_markdown`. |
| **extracted_content (`Optional[str]`)** | The output of a structured extraction (CSS/LLM-based) stored as JSON string or other text. |
| **metadata (`Optional[dict]`)** | Additional info about the crawl or extracted data. |
@@ -190,18 +192,27 @@ for img in images:
print("Image URL:", img["src"], "Alt:", img.get("alt"))
```
### 5.3 `screenshot` and `pdf`
### 5.3 `screenshot`, `pdf`, and `mhtml`
If you set `screenshot=True` or `pdf=True` in **`CrawlerRunConfig`**, then:
If you set `screenshot=True`, `pdf=True`, or `capture_mhtml=True` in **`CrawlerRunConfig`**, then:
- `result.screenshot` contains a base64-encoded PNG string.
- `result.screenshot` contains a base64-encoded PNG string.
- `result.pdf` contains raw PDF bytes (you can write them to a file).
- `result.mhtml` contains the MHTML snapshot of the page as a string (you can write it to a .mhtml file).
```python
# Save the PDF
with open("page.pdf", "wb") as f:
f.write(result.pdf)
# Save the MHTML
if result.mhtml:
with open("page.mhtml", "w", encoding="utf-8") as f:
f.write(result.mhtml)
```
The MHTML (MIME HTML) format is particularly useful as it captures the entire web page including all of its resources (CSS, images, scripts, etc.) in a single file, making it perfect for archiving or offline viewing.
### 5.4 `ssl_certificate`
If `fetch_ssl_certificate=True`, `result.ssl_certificate` holds details about the sites SSL cert, such as issuer, validity dates, etc.

View File

@@ -4,7 +4,35 @@ In this tutorial, youll learn how to:
1. Extract links (internal, external) from crawled pages
2. Filter or exclude specific domains (e.g., social media or custom domains)
3. Access and manage media data (especially images) in the crawl result
3. Access and ma### 3.2 Excluding Images
#### Excluding External Images
If you're dealing with heavy pages or want to skip third-party images (advertisements, for example), you can turn on:
```python
crawler_cfg = CrawlerRunConfig(
exclude_external_images=True
)
```
This setting attempts to discard images from outside the primary domain, keeping only those from the site you're crawling.
#### Excluding All Images
If you want to completely remove all images from the page to maximize performance and reduce memory usage, use:
```python
crawler_cfg = CrawlerRunConfig(
exclude_all_images=True
)
```
This setting removes all images very early in the processing pipeline, which significantly improves memory efficiency and processing speed. This is particularly useful when:
- You don't need image data in your results
- You're crawling image-heavy pages that cause memory issues
- You want to focus only on text content
- You need to maximize crawling speeddata (especially images) in the crawl result
4. Configure your crawler to exclude or prioritize certain images
> **Prerequisites**
@@ -271,8 +299,41 @@ Each extracted table contains:
- **`screenshot`**: Set to `True` if you want a full-page screenshot stored as `base64` in `result.screenshot`.
- **`pdf`**: Set to `True` if you want a PDF version of the page in `result.pdf`.
- **`capture_mhtml`**: Set to `True` if you want an MHTML snapshot of the page in `result.mhtml`. This format preserves the entire web page with all its resources (CSS, images, scripts) in a single file, making it perfect for archiving or offline viewing.
- **`wait_for_images`**: If `True`, attempts to wait until images are fully loaded before final extraction.
#### Example: Capturing Page as MHTML
```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
async def main():
crawler_cfg = CrawlerRunConfig(
capture_mhtml=True # Enable MHTML capture
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com", config=crawler_cfg)
if result.success and result.mhtml:
# Save the MHTML snapshot to a file
with open("example.mhtml", "w", encoding="utf-8") as f:
f.write(result.mhtml)
print("MHTML snapshot saved to example.mhtml")
else:
print("Failed to capture MHTML:", result.error_message)
if __name__ == "__main__":
asyncio.run(main())
```
The MHTML format is particularly useful because:
- It captures the complete page state including all resources
- It can be opened in most modern browsers for offline viewing
- It preserves the page exactly as it appeared during crawling
- It's a single file, making it easy to store and transfer
---
## 4. Putting It All Together: Link & Media Filtering