Merge next + resolve conflicts

This commit is contained in:
Aravind Karnam
2025-04-23 19:44:50 +05:30
137 changed files with 36997 additions and 8947 deletions

View File

@@ -263,7 +263,102 @@ See the full example in `docs/examples/identity_based_browsing.py` for a complet
---
## 7. Summary
## 7. Locale, Timezone, and Geolocation Control
In addition to using persistent profiles, Crawl4AI supports customizing your browser's locale, timezone, and geolocation settings. These features enhance your identity-based browsing experience by allowing you to control how websites perceive your location and regional settings.
### Setting Locale and Timezone
You can set the browser's locale and timezone through `CrawlerRunConfig`:
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com",
config=CrawlerRunConfig(
# Set browser locale (language and region formatting)
locale="fr-FR", # French (France)
# Set browser timezone
timezone_id="Europe/Paris",
# Other normal options...
magic=True,
page_timeout=60000
)
)
```
**How it works:**
- `locale` affects language preferences, date formats, number formats, etc.
- `timezone_id` affects JavaScript's Date object and time-related functionality
- These settings are applied when creating the browser context and maintained throughout the session
### Configuring Geolocation
Control the GPS coordinates reported by the browser's geolocation API:
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, GeolocationConfig
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://maps.google.com", # Or any location-aware site
config=CrawlerRunConfig(
# Configure precise GPS coordinates
geolocation=GeolocationConfig(
latitude=48.8566, # Paris coordinates
longitude=2.3522,
accuracy=100 # Accuracy in meters (optional)
),
# This site will see you as being in Paris
page_timeout=60000
)
)
```
**Important notes:**
- When `geolocation` is specified, the browser is automatically granted permission to access location
- Websites using the Geolocation API will receive the exact coordinates you specify
- This affects map services, store locators, delivery services, etc.
- Combined with the appropriate `locale` and `timezone_id`, you can create a fully consistent location profile
### Combining with Managed Browsers
These settings work perfectly with managed browsers for a complete identity solution:
```python
from crawl4ai import (
AsyncWebCrawler, BrowserConfig, CrawlerRunConfig,
GeolocationConfig
)
browser_config = BrowserConfig(
use_managed_browser=True,
user_data_dir="/path/to/my-profile",
browser_type="chromium"
)
crawl_config = CrawlerRunConfig(
# Location settings
locale="es-MX", # Spanish (Mexico)
timezone_id="America/Mexico_City",
geolocation=GeolocationConfig(
latitude=19.4326, # Mexico City
longitude=-99.1332
)
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url="https://example.com", config=crawl_config)
```
Combining persistent profiles with precise geolocation and region settings gives you complete control over your digital identity.
## 8. Summary
- **Create** your user-data directory either:
- By launching Chrome/Chromium externally with `--user-data-dir=/some/path`
@@ -271,6 +366,7 @@ See the full example in `docs/examples/identity_based_browsing.py` for a complet
- Or through the interactive interface with `profiler.interactive_manager()`
- **Log in** or configure sites as needed, then close the browser
- **Reference** that folder in `BrowserConfig(user_data_dir="...")` + `use_managed_browser=True`
- **Customize** identity aspects with `locale`, `timezone_id`, and `geolocation`
- **List and reuse** profiles with `BrowserProfiler.list_profiles()`
- **Manage** your profiles with the dedicated `BrowserProfiler` class
- Enjoy **persistent** sessions that reflect your real identity

View File

@@ -0,0 +1,205 @@
# Network Requests & Console Message Capturing
Crawl4AI can capture all network requests and browser console messages during a crawl, which is invaluable for debugging, security analysis, or understanding page behavior.
## Configuration
To enable network and console capturing, use these configuration options:
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
# Enable both network request capture and console message capture
config = CrawlerRunConfig(
capture_network_requests=True, # Capture all network requests and responses
capture_console_messages=True # Capture all browser console output
)
```
## Example Usage
```python
import asyncio
import json
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
async def main():
# Enable both network request capture and console message capture
config = CrawlerRunConfig(
capture_network_requests=True,
capture_console_messages=True
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com",
config=config
)
if result.success:
# Analyze network requests
if result.network_requests:
print(f"Captured {len(result.network_requests)} network events")
# Count request types
request_count = len([r for r in result.network_requests if r.get("event_type") == "request"])
response_count = len([r for r in result.network_requests if r.get("event_type") == "response"])
failed_count = len([r for r in result.network_requests if r.get("event_type") == "request_failed"])
print(f"Requests: {request_count}, Responses: {response_count}, Failed: {failed_count}")
# Find API calls
api_calls = [r for r in result.network_requests
if r.get("event_type") == "request" and "api" in r.get("url", "")]
if api_calls:
print(f"Detected {len(api_calls)} API calls:")
for call in api_calls[:3]: # Show first 3
print(f" - {call.get('method')} {call.get('url')}")
# Analyze console messages
if result.console_messages:
print(f"Captured {len(result.console_messages)} console messages")
# Group by type
message_types = {}
for msg in result.console_messages:
msg_type = msg.get("type", "unknown")
message_types[msg_type] = message_types.get(msg_type, 0) + 1
print("Message types:", message_types)
# Show errors (often the most important)
errors = [msg for msg in result.console_messages if msg.get("type") == "error"]
if errors:
print(f"Found {len(errors)} console errors:")
for err in errors[:2]: # Show first 2
print(f" - {err.get('text', '')[:100]}")
# Export all captured data to a file for detailed analysis
with open("network_capture.json", "w") as f:
json.dump({
"url": result.url,
"network_requests": result.network_requests or [],
"console_messages": result.console_messages or []
}, f, indent=2)
print("Exported detailed capture data to network_capture.json")
if __name__ == "__main__":
asyncio.run(main())
```
## Captured Data Structure
### Network Requests
The `result.network_requests` contains a list of dictionaries, each representing a network event with these common fields:
| Field | Description |
|-------|-------------|
| `event_type` | Type of event: `"request"`, `"response"`, or `"request_failed"` |
| `url` | The URL of the request |
| `timestamp` | Unix timestamp when the event was captured |
#### Request Event Fields
```json
{
"event_type": "request",
"url": "https://example.com/api/data.json",
"method": "GET",
"headers": {"User-Agent": "...", "Accept": "..."},
"post_data": "key=value&otherkey=value",
"resource_type": "fetch",
"is_navigation_request": false,
"timestamp": 1633456789.123
}
```
#### Response Event Fields
```json
{
"event_type": "response",
"url": "https://example.com/api/data.json",
"status": 200,
"status_text": "OK",
"headers": {"Content-Type": "application/json", "Cache-Control": "..."},
"from_service_worker": false,
"request_timing": {"requestTime": 1234.56, "receiveHeadersEnd": 1234.78},
"timestamp": 1633456789.456
}
```
#### Failed Request Event Fields
```json
{
"event_type": "request_failed",
"url": "https://example.com/missing.png",
"method": "GET",
"resource_type": "image",
"failure_text": "net::ERR_ABORTED 404",
"timestamp": 1633456789.789
}
```
### Console Messages
The `result.console_messages` contains a list of dictionaries, each representing a console message with these common fields:
| Field | Description |
|-------|-------------|
| `type` | Message type: `"log"`, `"error"`, `"warning"`, `"info"`, etc. |
| `text` | The message text |
| `timestamp` | Unix timestamp when the message was captured |
#### Console Message Example
```json
{
"type": "error",
"text": "Uncaught TypeError: Cannot read property 'length' of undefined",
"location": "https://example.com/script.js:123:45",
"timestamp": 1633456790.123
}
```
## Key Benefits
- **Full Request Visibility**: Capture all network activity including:
- Requests (URLs, methods, headers, post data)
- Responses (status codes, headers, timing)
- Failed requests (with error messages)
- **Console Message Access**: View all JavaScript console output:
- Log messages
- Warnings
- Errors with stack traces
- Developer debugging information
- **Debugging Power**: Identify issues such as:
- Failed API calls or resource loading
- JavaScript errors affecting page functionality
- CORS or other security issues
- Hidden API endpoints and data flows
- **Security Analysis**: Detect:
- Unexpected third-party requests
- Data leakage in request payloads
- Suspicious script behavior
- **Performance Insights**: Analyze:
- Request timing data
- Resource loading patterns
- Potential bottlenecks
## Use Cases
1. **API Discovery**: Identify hidden endpoints and data flows in single-page applications
2. **Debugging**: Track down JavaScript errors affecting page functionality
3. **Security Auditing**: Detect unwanted third-party requests or data leakage
4. **Performance Analysis**: Identify slow-loading resources
5. **Ad/Tracker Analysis**: Detect and catalog advertising or tracking calls
This capability is especially valuable for complex sites with heavy JavaScript, single-page applications, or when you need to understand the exact communication happening between a browser and servers.

View File

@@ -15,6 +15,7 @@ class CrawlResult(BaseModel):
downloaded_files: Optional[List[str]] = None
screenshot: Optional[str] = None
pdf : Optional[bytes] = None
mhtml: Optional[str] = None
markdown: Optional[Union[str, MarkdownGenerationResult]] = None
extracted_content: Optional[str] = None
metadata: Optional[dict] = None
@@ -236,7 +237,16 @@ if result.pdf:
f.write(result.pdf)
```
### 5.5 **`metadata`** *(Optional[dict])*
### 5.5 **`mhtml`** *(Optional[str])*
**What**: MHTML snapshot of the page if `capture_mhtml=True` in `CrawlerRunConfig`. MHTML (MIME HTML) format preserves the entire web page with all its resources (CSS, images, scripts, etc.) in a single file.
**Usage**:
```python
if result.mhtml:
with open("page.mhtml", "w", encoding="utf-8") as f:
f.write(result.mhtml)
```
### 5.6 **`metadata`** *(Optional[dict])*
**What**: Page-level metadata if discovered (title, description, OG data, etc.).
**Usage**:
```python
@@ -271,7 +281,69 @@ for result in results:
---
## 7. Example: Accessing Everything
## 7. Network Requests & Console Messages
When you enable network and console message capturing in `CrawlerRunConfig` using `capture_network_requests=True` and `capture_console_messages=True`, the `CrawlResult` will include these fields:
### 7.1 **`network_requests`** *(Optional[List[Dict[str, Any]]])*
**What**: A list of dictionaries containing information about all network requests, responses, and failures captured during the crawl.
**Structure**:
- Each item has an `event_type` field that can be `"request"`, `"response"`, or `"request_failed"`.
- Request events include `url`, `method`, `headers`, `post_data`, `resource_type`, and `is_navigation_request`.
- Response events include `url`, `status`, `status_text`, `headers`, and `request_timing`.
- Failed request events include `url`, `method`, `resource_type`, and `failure_text`.
- All events include a `timestamp` field.
**Usage**:
```python
if result.network_requests:
# Count different types of events
requests = [r for r in result.network_requests if r.get("event_type") == "request"]
responses = [r for r in result.network_requests if r.get("event_type") == "response"]
failures = [r for r in result.network_requests if r.get("event_type") == "request_failed"]
print(f"Captured {len(requests)} requests, {len(responses)} responses, and {len(failures)} failures")
# Analyze API calls
api_calls = [r for r in requests if "api" in r.get("url", "")]
# Identify failed resources
for failure in failures:
print(f"Failed to load: {failure.get('url')} - {failure.get('failure_text')}")
```
### 7.2 **`console_messages`** *(Optional[List[Dict[str, Any]]])*
**What**: A list of dictionaries containing all browser console messages captured during the crawl.
**Structure**:
- Each item has a `type` field indicating the message type (e.g., `"log"`, `"error"`, `"warning"`, etc.).
- The `text` field contains the actual message text.
- Some messages include `location` information (URL, line, column).
- All messages include a `timestamp` field.
**Usage**:
```python
if result.console_messages:
# Count messages by type
message_types = {}
for msg in result.console_messages:
msg_type = msg.get("type", "unknown")
message_types[msg_type] = message_types.get(msg_type, 0) + 1
print(f"Message type counts: {message_types}")
# Display errors (which are usually most important)
for msg in result.console_messages:
if msg.get("type") == "error":
print(f"Error: {msg.get('text')}")
```
These fields provide deep visibility into the page's network activity and browser console, which is invaluable for debugging, security analysis, and understanding complex web applications.
For more details on network and console capturing, see the [Network & Console Capture documentation](../advanced/network-console-capture.md).
---
## 8. Example: Accessing Everything
```python
async def handle_result(result: CrawlResult):
@@ -304,16 +376,36 @@ async def handle_result(result: CrawlResult):
if result.extracted_content:
print("Structured data:", result.extracted_content)
# Screenshot/PDF
# Screenshot/PDF/MHTML
if result.screenshot:
print("Screenshot length:", len(result.screenshot))
if result.pdf:
print("PDF bytes length:", len(result.pdf))
if result.mhtml:
print("MHTML length:", len(result.mhtml))
# Network and console capturing
if result.network_requests:
print(f"Network requests captured: {len(result.network_requests)}")
# Analyze request types
req_types = {}
for req in result.network_requests:
if "resource_type" in req:
req_types[req["resource_type"]] = req_types.get(req["resource_type"], 0) + 1
print(f"Resource types: {req_types}")
if result.console_messages:
print(f"Console messages captured: {len(result.console_messages)}")
# Count by message type
msg_types = {}
for msg in result.console_messages:
msg_types[msg.get("type", "unknown")] = msg_types.get(msg.get("type", "unknown"), 0) + 1
print(f"Message types: {msg_types}")
```
---
## 8. Key Points & Future
## 9. Key Points & Future
1. **Deprecated legacy properties of CrawlResult**
- `markdown_v2` - Deprecated in v0.5. Just use `markdown`. It holds the `MarkdownGenerationResult` now!

View File

@@ -70,7 +70,7 @@ We group them by category.
|------------------------------|--------------------------------------|-------------------------------------------------------------------------------------------------|
| **`word_count_threshold`** | `int` (default: ~200) | Skips text blocks below X words. Helps ignore trivial sections. |
| **`extraction_strategy`** | `ExtractionStrategy` (default: None) | If set, extracts structured data (CSS-based, LLM-based, etc.). |
| **`markdown_generator`** | `MarkdownGenerationStrategy` (None) | If you want specialized markdown output (citations, filtering, chunking, etc.). |
| **`markdown_generator`** | `MarkdownGenerationStrategy` (None) | If you want specialized markdown output (citations, filtering, chunking, etc.). Can be customized with options such as `content_source` parameter to select the HTML input source ('cleaned_html', 'raw_html', or 'fit_html'). |
| **`css_selector`** | `str` (None) | Retains only the part of the page matching this selector. Affects the entire extraction process. |
| **`target_elements`** | `List[str]` (None) | List of CSS selectors for elements to focus on for markdown generation and data extraction, while still processing the entire page for links, media, etc. Provides more flexibility than `css_selector`. |
| **`excluded_tags`** | `list` (None) | Removes entire tags (e.g. `["script", "style"]`). |
@@ -140,6 +140,7 @@ If your page is a single-page app with repeated JS updates, set `js_only=True` i
| **`screenshot_wait_for`** | `float or None` | Extra wait time before the screenshot. |
| **`screenshot_height_threshold`** | `int` (~20000) | If the page is taller than this, alternate screenshot strategies are used. |
| **`pdf`** | `bool` (False) | If `True`, returns a PDF in `result.pdf`. |
| **`capture_mhtml`** | `bool` (False) | If `True`, captures an MHTML snapshot of the page in `result.mhtml`. MHTML includes all page resources (CSS, images, etc.) in a single file. |
| **`image_description_min_word_threshold`** | `int` (~50) | Minimum words for an images alt text or description to be considered valid. |
| **`image_score_threshold`** | `int` (~3) | Filter out low-scoring images. The crawler scores images by relevance (size, context, etc.). |
| **`exclude_external_images`** | `bool` (False) | Exclude images from other domains. |
@@ -231,6 +232,7 @@ async def main():
if __name__ == "__main__":
asyncio.run(main())
```
## 2.4 Compliance & Ethics

View File

@@ -0,0 +1,444 @@
/* ==== File: docs/ask_ai/ask_ai.css ==== */
/* --- Basic Reset & Font --- */
body {
/* Attempt to inherit variables from parent window (iframe context) */
/* Fallback values if variables are not inherited */
--fallback-bg: #070708;
--fallback-font: #e8e9ed;
--fallback-secondary: #a3abba;
--fallback-primary: #50ffff;
--fallback-primary-dimmed: #09b5a5;
--fallback-border: #1d1d20;
--fallback-code-bg: #1e1e1e;
--fallback-invert-font: #222225;
--font-stack: dm, Monaco, Courier New, monospace, serif;
font-family: var(--font-stack, "Courier New", monospace); /* Use theme font stack */
background-color: var(--background-color, var(--fallback-bg));
color: var(--font-color, var(--fallback-font));
margin: 0;
padding: 0;
font-size: 14px; /* Match global font size */
line-height: 1.5em; /* Match global line height */
height: 100vh; /* Ensure body takes full height */
overflow: hidden; /* Prevent body scrollbars, panels handle scroll */
display: flex; /* Use flex for the main container */
}
a {
color: var(--secondary-color, var(--fallback-secondary));
text-decoration: none;
transition: color 0.2s;
}
a:hover {
color: var(--primary-color, var(--fallback-primary));
}
/* --- Main Container Layout --- */
.ai-assistant-container {
display: flex;
width: 100%;
height: 100%;
background-color: var(--background-color, var(--fallback-bg));
}
/* --- Sidebar Styling --- */
.sidebar {
flex-shrink: 0; /* Prevent sidebars from shrinking */
height: 100%;
display: flex;
flex-direction: column;
/* background-color: var(--code-bg-color, var(--fallback-code-bg)); */
overflow-y: hidden; /* Header fixed, list scrolls */
}
.left-sidebar {
flex-basis: 240px; /* Width of history panel */
border-right: 1px solid var(--progress-bar-background, var(--fallback-border));
}
.right-sidebar {
flex-basis: 280px; /* Width of citations panel */
border-left: 1px solid var(--progress-bar-background, var(--fallback-border));
}
.sidebar header {
padding: 0.6em 1em;
border-bottom: 1px solid var(--progress-bar-background, var(--fallback-border));
flex-shrink: 0;
display: flex;
justify-content: space-between;
align-items: center;
}
.sidebar header h3 {
margin: 0;
font-size: 1.1em;
color: var(--font-color, var(--fallback-font));
}
.sidebar ul {
list-style: none;
padding: 0;
margin: 0;
overflow-y: auto; /* Enable scrolling for the list */
flex-grow: 1; /* Allow list to take remaining space */
padding: 0.5em 0;
}
.sidebar ul li {
padding: 0.3em 1em;
}
.sidebar ul li.no-citations,
.sidebar ul li.no-history {
color: var(--secondary-color, var(--fallback-secondary));
font-style: italic;
font-size: 0.9em;
padding-left: 1em;
}
.sidebar ul li a {
color: var(--secondary-color, var(--fallback-secondary));
text-decoration: none;
display: block;
padding: 0.2em 0.5em;
border-radius: 3px;
transition: background-color 0.2s, color 0.2s;
}
.sidebar ul li a:hover {
color: var(--primary-color, var(--fallback-primary));
background-color: rgba(80, 255, 255, 0.08); /* Use primary color with alpha */
}
/* Style for active history item */
#history-list li.active a {
color: var(--primary-dimmed-color, var(--fallback-primary-dimmed));
font-weight: bold;
background-color: rgba(80, 255, 255, 0.12);
}
/* --- Chat Panel Styling --- */
#chat-panel {
flex-grow: 1; /* Take remaining space */
display: flex;
flex-direction: column;
height: 100%;
overflow: hidden; /* Prevent overflow, internal elements handle scroll */
}
#chat-messages {
flex-grow: 1;
overflow-y: auto; /* Scrollable chat history */
padding: 1em 1.5em;
border-bottom: 1px solid var(--progress-bar-background, var(--fallback-border));
}
.message {
margin-bottom: 1em;
padding: 0.8em 1.2em;
border-radius: 8px;
max-width: 90%; /* Slightly wider */
line-height: 1.6;
/* Apply pre-wrap for better handling of spaces/newlines AND wrapping */
white-space: pre-wrap;
word-wrap: break-word; /* Ensure long words break */
}
.user-message {
background-color: var(--progress-bar-background, var(--fallback-border)); /* User message background */
color: var(--font-color, var(--fallback-font));
margin-left: auto; /* Align user messages to the right */
text-align: left;
}
.ai-message {
background-color: var(--code-bg-color, var(--fallback-code-bg)); /* AI message background */
color: var(--font-color, var(--fallback-font));
margin-right: auto; /* Align AI messages to the left */
border: 1px solid var(--progress-bar-background, var(--fallback-border));
}
.ai-message.welcome-message {
border: none;
background-color: transparent;
max-width: 100%;
text-align: center;
color: var(--secondary-color, var(--fallback-secondary));
white-space: normal;
}
/* Styles for code within messages */
.ai-message code {
background-color: var(--invert-font-color, var(--fallback-invert-font)) !important; /* Use light bg for code */
/* color: var(--background-color, var(--fallback-bg)) !important; Dark text */
padding: 0.1em 0.4em;
border-radius: 4px;
font-size: 0.9em;
}
.ai-message pre {
background-color: var(--invert-font-color, var(--fallback-invert-font)) !important;
color: var(--background-color, var(--fallback-bg)) !important;
padding: 1em;
border-radius: 5px;
overflow-x: auto;
margin: 0.8em 0;
white-space: pre;
}
.ai-message pre code {
background-color: transparent !important;
padding: 0;
font-size: inherit;
}
/* Override white-space for specific elements generated by Markdown */
.ai-message p,
.ai-message ul,
.ai-message ol,
.ai-message blockquote {
white-space: normal; /* Allow standard wrapping for block elements */
}
/* --- Markdown Element Styling within Messages --- */
.message p {
margin-top: 0;
margin-bottom: 0.5em;
}
.message p:last-child {
margin-bottom: 0;
}
.message ul,
.message ol {
margin: 0.5em 0 0.5em 1.5em;
padding: 0;
}
.message li {
margin-bottom: 0.2em;
}
/* Code block styling (adjusts previous rules slightly) */
.message code {
/* Inline code */
background-color: var(--invert-font-color, var(--fallback-invert-font)) !important;
color: var(--font-color);
padding: 0.1em 0.4em;
border-radius: 4px;
font-size: 0.9em;
/* Ensure inline code breaks nicely */
word-break: break-all;
white-space: normal; /* Allow inline code to wrap if needed */
}
.message pre {
/* Code block container */
background-color: var(--invert-font-color, var(--fallback-invert-font)) !important;
color: var(--background-color, var(--fallback-bg)) !important;
padding: 1em;
border-radius: 5px;
overflow-x: auto;
margin: 0.8em 0;
font-size: 0.9em; /* Slightly smaller code blocks */
}
.message pre code {
/* Code within code block */
background-color: transparent !important;
padding: 0;
font-size: inherit;
word-break: normal; /* Don't break words in code blocks */
white-space: pre; /* Preserve whitespace strictly in code blocks */
}
/* Thinking indicator */
.message-thinking {
display: inline-block;
width: 5px;
height: 5px;
background-color: var(--primary-color, var(--fallback-primary));
border-radius: 50%;
margin-left: 8px;
vertical-align: middle;
animation: thinking 1s infinite ease-in-out;
}
@keyframes thinking {
0%,
100% {
opacity: 0.5;
transform: scale(0.8);
}
50% {
opacity: 1;
transform: scale(1.2);
}
}
/* --- Thinking Indicator (Blinking Cursor Style) --- */
.thinking-indicator-cursor {
display: inline-block;
width: 10px; /* Width of the cursor */
height: 1.1em; /* Match line height */
background-color: var(--primary-color, var(--fallback-primary));
margin-left: 5px;
vertical-align: text-bottom; /* Align with text baseline */
animation: blink-cursor 1s step-end infinite;
}
@keyframes blink-cursor {
from,
to {
background-color: transparent;
}
50% {
background-color: var(--primary-color, var(--fallback-primary));
}
}
#chat-input-area {
flex-shrink: 0; /* Prevent input area from shrinking */
padding: 1em 1.5em;
display: flex;
align-items: flex-end; /* Align items to bottom */
gap: 10px;
background-color: var(--code-bg-color, var(--fallback-code-bg)); /* Match sidebars */
}
#chat-input-area textarea {
flex-grow: 1;
padding: 0.8em 1em;
border: 1px solid var(--progress-bar-background, var(--fallback-border));
background-color: var(--background-color, var(--fallback-bg));
color: var(--font-color, var(--fallback-font));
border-radius: 5px;
resize: none; /* Disable manual resize */
font-family: inherit;
font-size: 1em;
line-height: 1.4;
max-height: 150px; /* Limit excessive height */
overflow-y: auto;
/* rows: 2; */
}
#chat-input-area button {
/* Basic button styling - maybe inherit from main theme? */
padding: 0.6em 1.2em;
border: 1px solid var(--primary-dimmed-color, var(--fallback-primary-dimmed));
background-color: var(--primary-dimmed-color, var(--fallback-primary-dimmed));
color: var(--background-color, var(--fallback-bg));
border-radius: 5px;
cursor: pointer;
font-size: 0.9em;
transition: background-color 0.2s, border-color 0.2s;
height: min-content; /* Align with bottom of textarea */
}
#chat-input-area button:hover {
background-color: var(--primary-color, var(--fallback-primary));
border-color: var(--primary-color, var(--fallback-primary));
}
#chat-input-area button:disabled {
opacity: 0.6;
cursor: not-allowed;
}
.loading-indicator {
font-size: 0.9em;
color: var(--secondary-color, var(--fallback-secondary));
margin-right: 10px;
align-self: center;
}
/* --- Buttons --- */
/* Inherit some button styles if possible */
.btn.btn-sm {
color: var(--font-color, var(--fallback-font));
padding: 0.2em 0.5em;
font-size: 0.8em;
border: 1px solid var(--secondary-color, var(--fallback-secondary));
background: none;
border-radius: 3px;
cursor: pointer;
}
.btn.btn-sm:hover {
border-color: var(--font-color, var(--fallback-font));
background-color: var(--progress-bar-background, var(--fallback-border));
}
/* --- Basic Responsiveness --- */
@media screen and (max-width: 900px) {
.left-sidebar {
flex-basis: 200px; /* Shrink history */
}
.right-sidebar {
flex-basis: 240px; /* Shrink citations */
}
}
@media screen and (max-width: 768px) {
/* Stack layout on mobile? Or hide sidebars? Hiding for now */
.sidebar {
display: none; /* Hide sidebars on small screens */
}
/* Could add toggle buttons later */
}
/* ==== File: docs/ask_ai/ask-ai.css (Updates V4 - Delete Button) ==== */
.sidebar ul li {
/* Use flexbox to align link and delete button */
display: flex;
justify-content: space-between;
align-items: center;
padding: 0; /* Remove padding from li, add to link/button */
margin: 0.1em 0; /* Small vertical margin */
}
.sidebar ul li a {
/* Link takes most space */
flex-grow: 1;
padding: 0.3em 0.5em 0.3em 1em; /* Adjust padding */
/* Make ellipsis work for long titles */
white-space: nowrap;
overflow: hidden;
text-overflow: ellipsis;
/* Keep existing link styles */
color: var(--secondary-color, var(--fallback-secondary));
text-decoration: none;
display: block;
border-radius: 3px;
transition: background-color 0.2s, color 0.2s;
}
.sidebar ul li a:hover {
color: var(--primary-color, var(--fallback-primary));
background-color: rgba(80, 255, 255, 0.08);
}
/* Style for active history item's link */
#history-list li.active a {
color: var(--primary-dimmed-color, var(--fallback-primary-dimmed));
font-weight: bold;
background-color: rgba(80, 255, 255, 0.12);
}
/* --- Delete Chat Button --- */
.delete-chat-btn {
flex-shrink: 0; /* Don't shrink */
background: none;
border: none;
color: var(--secondary-color, var(--fallback-secondary));
cursor: pointer;
padding: 0.4em 0.8em; /* Padding around icon */
font-size: 0.9em;
opacity: 0.5; /* Dimmed by default */
transition: opacity 0.2s, color 0.2s;
margin-left: 5px; /* Space between link and button */
border-radius: 3px;
}
.sidebar ul li:hover .delete-chat-btn,
.delete-chat-btn:hover {
opacity: 1; /* Show fully on hover */
color: var(--error-color, #ff3c74); /* Use error color on hover */
}
.delete-chat-btn:focus {
outline: 1px dashed var(--error-color, #ff3c74); /* Accessibility */
opacity: 1;
}

607
docs/md_v2/ask_ai/ask-ai.js Normal file
View File

@@ -0,0 +1,607 @@
// ==== File: docs/ask_ai/ask-ai.js (Marked, Streaming, History) ====
document.addEventListener("DOMContentLoaded", () => {
console.log("AI Assistant JS V2 Loaded");
// --- DOM Element Selectors ---
const historyList = document.getElementById("history-list");
const newChatButton = document.getElementById("new-chat-button");
const chatMessages = document.getElementById("chat-messages");
const chatInput = document.getElementById("chat-input");
const sendButton = document.getElementById("send-button");
const citationsList = document.getElementById("citations-list");
// --- Constants ---
const CHAT_INDEX_KEY = "aiAssistantChatIndex_v1";
const CHAT_PREFIX = "aiAssistantChat_v1_";
// --- State ---
let currentChatId = null;
let conversationHistory = []; // Holds message objects { sender: 'user'/'ai', text: '...' }
let isThinking = false;
let streamInterval = null; // To control the streaming interval
// --- Event Listeners ---
sendButton.addEventListener("click", handleSendMessage);
chatInput.addEventListener("keydown", handleInputKeydown);
newChatButton.addEventListener("click", handleNewChat);
chatInput.addEventListener("input", autoGrowTextarea);
// --- Initialization ---
loadChatHistoryIndex(); // Load history list on startup
const initialQuery = checkForInitialQuery(window.parent.location); // Check for query param
if (!initialQuery) {
loadInitialChat(); // Load normally if no query
}
// --- Core Functions ---
function handleSendMessage() {
const userMessageText = chatInput.value.trim();
if (!userMessageText || isThinking) return;
setThinking(true); // Start thinking state
// Add user message to state and UI
const userMessage = { sender: "user", text: userMessageText };
conversationHistory.push(userMessage);
addMessageToChat(userMessage, false); // Add user message without parsing markdown
chatInput.value = "";
autoGrowTextarea(); // Reset textarea height
// Prepare for AI response (create empty div)
const aiMessageDiv = addMessageToChat({ sender: "ai", text: "" }, true); // Add empty div with thinking indicator
// TODO: Generate fingerprint/JWT here
// TODO: Send `conversationHistory` + JWT to backend API
// Replace placeholder below with actual API call
// The backend should ideally return a stream of text tokens
// --- Placeholder Streaming Simulation ---
const simulatedFullResponse = `Okay, Heres a minimal Python script that creates an AsyncWebCrawler, fetches a webpage, and prints the first 300 characters of its Markdown output:
\`\`\`python
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com")
print(result.markdown[:300]) # Print first 300 chars
if __name__ == "__main__":
asyncio.run(main())
\`\`\`
A code snippet: \`crawler.run()\`. Check the [quickstart](/core/quickstart).`;
// Simulate receiving the response stream
streamSimulatedResponse(aiMessageDiv, simulatedFullResponse);
// // Simulate receiving citations *after* stream starts (or with first chunk)
// setTimeout(() => {
// addCitations([
// { title: "Simulated Doc 1", url: "#sim1" },
// { title: "Another Concept", url: "#sim2" },
// ]);
// }, 500); // Citations appear shortly after thinking starts
}
function handleInputKeydown(event) {
if (event.key === "Enter" && !event.shiftKey) {
event.preventDefault();
handleSendMessage();
}
}
function addMessageToChat(message, addThinkingIndicator = false) {
const messageDiv = document.createElement("div");
messageDiv.classList.add("message", `${message.sender}-message`);
// Parse markdown and set HTML
messageDiv.innerHTML = message.text ? marked.parse(message.text) : "";
if (message.sender === "ai") {
// Apply Syntax Highlighting AFTER setting innerHTML
messageDiv.querySelectorAll("pre code:not(.hljs)").forEach((block) => {
if (typeof hljs !== "undefined") {
// Check if already highlighted to prevent double-highlighting issues
if (!block.classList.contains("hljs")) {
hljs.highlightElement(block);
}
} else {
console.warn("highlight.js (hljs) not found for syntax highlighting.");
}
});
// Add thinking indicator if needed (and not already present)
if (addThinkingIndicator && !message.text && !messageDiv.querySelector(".thinking-indicator-cursor")) {
const thinkingDiv = document.createElement("div");
thinkingDiv.className = "thinking-indicator-cursor";
messageDiv.appendChild(thinkingDiv);
}
} else {
// User messages remain plain text
// messageDiv.textContent = message.text;
}
// wrap each pre in a div.terminal
messageDiv.querySelectorAll("pre").forEach((block) => {
const wrapper = document.createElement("div");
wrapper.className = "terminal";
block.parentNode.insertBefore(wrapper, block);
wrapper.appendChild(block);
});
chatMessages.appendChild(messageDiv);
// Scroll only if user is near the bottom? (More advanced)
// Simple scroll for now:
scrollToBottom();
return messageDiv; // Return the created element
}
function streamSimulatedResponse(messageDiv, fullText) {
const thinkingIndicator = messageDiv.querySelector(".thinking-indicator-cursor");
if (thinkingIndicator) thinkingIndicator.remove();
const tokens = fullText.split(/(\s+)/);
let currentText = "";
let tokenIndex = 0;
// Clear previous interval just in case
if (streamInterval) clearInterval(streamInterval);
streamInterval = setInterval(() => {
const cursorSpan = '<span class="thinking-indicator-cursor"></span>'; // Cursor for streaming
if (tokenIndex < tokens.length) {
currentText += tokens[tokenIndex];
// Render intermediate markdown + cursor
messageDiv.innerHTML = marked.parse(currentText + cursorSpan);
// Re-highlight code blocks on each stream update - might be slightly inefficient
// but ensures partial code blocks look okay. Highlight only final on completion.
// messageDiv.querySelectorAll('pre code:not(.hljs)').forEach((block) => {
// hljs.highlightElement(block);
// });
scrollToBottom(); // Keep scrolling as content streams
tokenIndex++;
} else {
// Streaming finished
clearInterval(streamInterval);
streamInterval = null;
// Final render without cursor
messageDiv.innerHTML = marked.parse(currentText);
// === Final Syntax Highlighting ===
messageDiv.querySelectorAll("pre code:not(.hljs)").forEach((block) => {
if (typeof hljs !== "undefined" && !block.classList.contains("hljs")) {
hljs.highlightElement(block);
}
});
// === Extract Citations ===
const citations = extractMarkdownLinks(currentText);
// Wrap each pre in a div.terminal
messageDiv.querySelectorAll("pre").forEach((block) => {
const wrapper = document.createElement("div");
wrapper.className = "terminal";
block.parentNode.insertBefore(wrapper, block);
wrapper.appendChild(block);
});
const aiMessage = { sender: "ai", text: currentText, citations: citations };
conversationHistory.push(aiMessage);
updateCitationsDisplay();
saveCurrentChat();
setThinking(false);
}
}, 50); // Adjust speed
}
// === NEW Function to Extract Links ===
function extractMarkdownLinks(markdownText) {
const regex = /\[([^\]]+)\]\(([^)]+)\)/g; // [text](url)
const citations = [];
let match;
while ((match = regex.exec(markdownText)) !== null) {
// Avoid adding self-links from within the citations list if AI includes them
if (!match[2].startsWith("#citation-")) {
citations.push({
title: match[1].trim(),
url: match[2].trim(),
});
}
}
// Optional: Deduplicate links based on URL
const uniqueCitations = citations.filter(
(citation, index, self) => index === self.findIndex((c) => c.url === citation.url)
);
return uniqueCitations;
}
// === REVISED Function to Display Citations ===
function updateCitationsDisplay() {
let lastCitations = null;
// Find the most recent AI message with citations
for (let i = conversationHistory.length - 1; i >= 0; i--) {
if (
conversationHistory[i].sender === "ai" &&
conversationHistory[i].citations &&
conversationHistory[i].citations.length > 0
) {
lastCitations = conversationHistory[i].citations;
break; // Found the latest citations
}
}
citationsList.innerHTML = ""; // Clear previous
if (!lastCitations) {
citationsList.innerHTML = '<li class="no-citations">No citations available.</li>';
return;
}
lastCitations.forEach((citation, index) => {
const li = document.createElement("li");
const a = document.createElement("a");
// Generate a unique ID for potential internal linking if needed
// a.id = `citation-${index}`;
a.href = citation.url || "#";
a.textContent = citation.title;
a.target = "_top"; // Open in main window
li.appendChild(a);
citationsList.appendChild(li);
});
}
function addCitations(citations) {
citationsList.innerHTML = ""; // Clear
if (!citations || citations.length === 0) {
citationsList.innerHTML = '<li class="no-citations">No citations available.</li>';
return;
}
citations.forEach((citation) => {
const li = document.createElement("li");
const a = document.createElement("a");
a.href = citation.url || "#";
a.textContent = citation.title;
a.target = "_top"; // Open in main window
li.appendChild(a);
citationsList.appendChild(li);
});
}
function setThinking(thinking) {
isThinking = thinking;
sendButton.disabled = thinking;
chatInput.disabled = thinking;
chatInput.placeholder = thinking ? "AI is responding..." : "Ask about Crawl4AI...";
// Stop any existing stream if we start thinking again (e.g., rapid resend)
if (thinking && streamInterval) {
clearInterval(streamInterval);
streamInterval = null;
}
}
function autoGrowTextarea() {
chatInput.style.height = "auto";
chatInput.style.height = `${chatInput.scrollHeight}px`;
}
function scrollToBottom() {
chatMessages.scrollTop = chatMessages.scrollHeight;
}
// --- Query Parameter Handling ---
function checkForInitialQuery(locationToCheck) {
// <-- Receive location object
if (!locationToCheck) {
console.warn("Ask AI: Could not access parent window location.");
return false;
}
const urlParams = new URLSearchParams(locationToCheck.search); // <-- Use passed location's search string
const encodedQuery = urlParams.get("qq"); // <-- Use 'qq'
if (encodedQuery) {
console.log("Initial query found (qq):", encodedQuery);
try {
const decodedText = decodeURIComponent(escape(atob(encodedQuery)));
console.log("Decoded query:", decodedText);
// Start new chat immediately
handleNewChat(true);
// Delay setting input and sending message slightly
setTimeout(() => {
chatInput.value = decodedText;
autoGrowTextarea();
handleSendMessage();
// Clean the PARENT window's URL
try {
const cleanUrl = locationToCheck.pathname;
// Use parent's history object
window.parent.history.replaceState({}, window.parent.document.title, cleanUrl);
} catch (e) {
console.warn("Ask AI: Could not clean parent URL using replaceState.", e);
// This might fail due to cross-origin restrictions if served differently,
// but should work fine with mkdocs serve on the same origin.
}
}, 100);
return true; // Query processed
} catch (e) {
console.error("Error decoding initial query (qq):", e);
// Clean the PARENT window's URL even on error
try {
const cleanUrl = locationToCheck.pathname;
window.parent.history.replaceState({}, window.parent.document.title, cleanUrl);
} catch (cleanError) {
console.warn("Ask AI: Could not clean parent URL after decode error.", cleanError);
}
return false;
}
}
return false; // No 'qq' query found
}
// --- History Management ---
function handleNewChat(isFromQuery = false) {
if (isThinking) return; // Don't allow new chat while responding
// Only save if NOT triggered immediately by a query parameter load
if (!isFromQuery) {
saveCurrentChat();
}
currentChatId = `chat_${Date.now()}`;
conversationHistory = []; // Clear message history state
chatMessages.innerHTML = ""; // Start with clean slate for query
if (!isFromQuery) {
// Show welcome only if manually started
// chatMessages.innerHTML =
// '<div class="message ai-message welcome-message">Started a new chat! Ask me anything about Crawl4AI.</div>';
chatMessages.innerHTML =
'<div class="message ai-message welcome-message">We will launch this feature very soon.</div>';
}
addCitations([]); // Clear citations
updateCitationsDisplay(); // Clear UI
// Add to index and save
let index = loadChatIndex();
// Generate a generic title initially, update later
const newTitle = isFromQuery ? "Chat from Selection" : `Chat ${new Date().toLocaleString()}`;
// index.unshift({ id: currentChatId, title: `Chat ${new Date().toLocaleString()}` }); // Add to start
index.unshift({ id: currentChatId, title: newTitle });
saveChatIndex(index);
renderHistoryList(index); // Update UI
setActiveHistoryItem(currentChatId);
saveCurrentChat(); // Save the empty new chat state
}
function loadChat(chatId) {
if (isThinking || chatId === currentChatId) return;
// Check if chat data actually exists before proceeding
const storedChat = localStorage.getItem(CHAT_PREFIX + chatId);
if (storedChat === null) {
console.warn(`Attempted to load non-existent chat: ${chatId}. Removing from index.`);
deleteChatData(chatId); // Clean up index
loadChatHistoryIndex(); // Reload history list
loadInitialChat(); // Load next available chat
return;
}
console.log(`Loading chat: ${chatId}`);
saveCurrentChat(); // Save current before switching
try {
conversationHistory = JSON.parse(storedChat);
currentChatId = chatId;
renderChatMessages(conversationHistory);
updateCitationsDisplay();
setActiveHistoryItem(chatId);
} catch (e) {
console.error("Error loading chat:", chatId, e);
alert("Failed to load chat data.");
conversationHistory = [];
renderChatMessages(conversationHistory);
updateCitationsDisplay();
}
}
function saveCurrentChat() {
if (currentChatId && conversationHistory.length > 0) {
try {
localStorage.setItem(CHAT_PREFIX + currentChatId, JSON.stringify(conversationHistory));
console.log(`Chat ${currentChatId} saved.`);
// Update title in index (e.g., use first user message)
let index = loadChatIndex();
const currentItem = index.find((item) => item.id === currentChatId);
if (
currentItem &&
conversationHistory[0]?.sender === "user" &&
!currentItem.title.startsWith("Chat about:")
) {
currentItem.title = `Chat about: ${conversationHistory[0].text.substring(0, 30)}...`;
saveChatIndex(index);
// Re-render history list if title changed - small optimization needed here maybe
renderHistoryList(index);
setActiveHistoryItem(currentChatId); // Re-set active after re-render
}
} catch (e) {
console.error("Error saving chat:", currentChatId, e);
// Handle potential storage full errors
if (e.name === "QuotaExceededError") {
alert("Local storage is full. Cannot save chat history.");
// Consider implementing history pruning logic here
}
}
} else if (currentChatId) {
// Save empty state for newly created chats if needed, or remove?
localStorage.setItem(CHAT_PREFIX + currentChatId, JSON.stringify([]));
}
}
function loadChatIndex() {
try {
const storedIndex = localStorage.getItem(CHAT_INDEX_KEY);
return storedIndex ? JSON.parse(storedIndex) : [];
} catch (e) {
console.error("Error loading chat index:", e);
return []; // Return empty array on error
}
}
function saveChatIndex(indexArray) {
try {
localStorage.setItem(CHAT_INDEX_KEY, JSON.stringify(indexArray));
} catch (e) {
console.error("Error saving chat index:", e);
}
}
function renderHistoryList(indexArray) {
historyList.innerHTML = ""; // Clear existing
if (!indexArray || indexArray.length === 0) {
historyList.innerHTML = '<li class="no-history">No past chats found.</li>';
return;
}
indexArray.forEach((item) => {
const li = document.createElement("li");
li.dataset.chatId = item.id; // Add ID to li for easier selection
const a = document.createElement("a");
a.href = "#";
a.dataset.chatId = item.id;
a.textContent = item.title || `Chat ${item.id.split("_")[1] || item.id}`;
a.title = a.textContent; // Tooltip for potentially long titles
a.addEventListener("click", (e) => {
e.preventDefault();
loadChat(item.id);
});
// === Add Delete Button ===
const deleteBtn = document.createElement("button");
deleteBtn.className = "delete-chat-btn";
deleteBtn.innerHTML = "✕"; // Trash can emoji/icon (or use text/SVG/FontAwesome)
deleteBtn.title = "Delete Chat";
deleteBtn.dataset.chatId = item.id; // Store ID on button too
deleteBtn.addEventListener("click", handleDeleteChat);
li.appendChild(a);
li.appendChild(deleteBtn); // Append button to the list item
historyList.appendChild(li);
});
}
function renderChatMessages(messages) {
chatMessages.innerHTML = ""; // Clear existing messages
messages.forEach((message) => {
// Ensure highlighting is applied when loading from history
addMessageToChat(message, false);
});
if (messages.length === 0) {
// chatMessages.innerHTML =
// '<div class="message ai-message welcome-message">Chat history loaded. Ask a question!</div>';
chatMessages.innerHTML =
'<div class="message ai-message welcome-message">We will launch this feature very soon.</div>';
}
// Scroll to bottom after loading messages
scrollToBottom();
}
function setActiveHistoryItem(chatId) {
document.querySelectorAll("#history-list li").forEach((li) => li.classList.remove("active"));
// Select the LI element directly now
const activeLi = document.querySelector(`#history-list li[data-chat-id="${chatId}"]`);
if (activeLi) {
activeLi.classList.add("active");
}
}
function loadInitialChat() {
const index = loadChatIndex();
if (index.length > 0) {
loadChat(index[0].id);
} else {
// Check if handleNewChat wasn't already called by query handler
if (!currentChatId) {
handleNewChat();
}
}
}
function loadChatHistoryIndex() {
const index = loadChatIndex();
renderHistoryList(index);
if (currentChatId) setActiveHistoryItem(currentChatId);
}
// === NEW Function to Handle Delete Click ===
function handleDeleteChat(event) {
event.stopPropagation(); // Prevent triggering loadChat on the link behind it
const button = event.currentTarget;
const chatIdToDelete = button.dataset.chatId;
if (!chatIdToDelete) return;
// Confirmation dialog
if (
window.confirm(
`Are you sure you want to delete this chat session?\n"${
button.previousElementSibling?.textContent || "Chat " + chatIdToDelete
}"`
)
) {
console.log(`Deleting chat: ${chatIdToDelete}`);
// Perform deletion
const updatedIndex = deleteChatData(chatIdToDelete);
// If the deleted chat was the currently active one, load another chat
if (currentChatId === chatIdToDelete) {
currentChatId = null; // Reset current ID
conversationHistory = []; // Clear state
if (updatedIndex.length > 0) {
// Load the new top chat (most recent remaining)
loadChat(updatedIndex[0].id);
} else {
// No chats left, start a new one
handleNewChat();
}
} else {
// If a different chat was deleted, just re-render the list
renderHistoryList(updatedIndex);
// Re-apply active state in case IDs shifted (though they shouldn't)
setActiveHistoryItem(currentChatId);
}
}
}
// === NEW Function to Delete Chat Data ===
function deleteChatData(chatId) {
// Remove chat data
localStorage.removeItem(CHAT_PREFIX + chatId);
// Update index
let index = loadChatIndex();
index = index.filter((item) => item.id !== chatId);
saveChatIndex(index);
console.log(`Chat ${chatId} data and index entry removed.`);
return index; // Return the updated index
}
// --- Virtual Scrolling Placeholder ---
// NOTE: Virtual scrolling is complex. For now, we do direct rendering.
// If performance becomes an issue with very long chats/history,
// investigate libraries like 'simple-virtual-scroll' or 'virtual-scroller'.
// You would replace parts of `renderChatMessages` and `renderHistoryList`
// to work with the chosen library's API (providing data and item renderers).
console.warn("Virtual scrolling not implemented. Performance may degrade with very long chat histories.");
});

View File

@@ -0,0 +1,64 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Crawl4AI Assistant</title>
<!-- Link main styles first for variable access -->
<link rel="stylesheet" href="../assets/layout.css">
<link rel="stylesheet" href="../assets/styles.css">
<!-- Link specific AI styles -->
<link rel="stylesheet" href="../assets/highlight.css">
<link rel="stylesheet" href="ask-ai.css">
</head>
<body>
<div class="ai-assistant-container">
<!-- Left Sidebar: Conversation History -->
<aside id="history-panel" class="sidebar left-sidebar">
<header>
<h3>History</h3>
<button id="new-chat-button" class="btn btn-sm">New Chat</button>
</header>
<ul id="history-list">
<!-- History items populated by JS -->
</ul>
</aside>
<!-- Main Area: Chat Interface -->
<main id="chat-panel">
<div id="chat-messages">
<!-- Chat messages populated by JS -->
<div class="message ai-message welcome-message">
Welcome to the Crawl4AI Assistant! How can I help you today?
</div>
</div>
<div id="chat-input-area">
<!-- Loading indicator for general waiting (optional) -->
<!-- <div class="loading-indicator" style="display: none;">Thinking...</div> -->
<textarea id="chat-input" placeholder="We will roll out this feature very soon." rows="2" disabled></textarea>
<button id="send-button">Send</button>
</div>
</main>
<!-- Right Sidebar: Citations / Context -->
<aside id="citations-panel" class="sidebar right-sidebar">
<header>
<h3>Citations</h3>
</header>
<ul id="citations-list">
<!-- Citations populated by JS -->
<li class="no-citations">No citations for this response yet.</li>
</ul>
</aside>
</div>
<!-- Include Marked.js library -->
<script src="https://cdn.jsdelivr.net/npm/marked/marked.min.js"></script>
<script src="../assets/highlight.min.js"></script>
<!-- Your AI Assistant Logic -->
<script src="ask-ai.js"></script>
</body>
</html>

View File

@@ -0,0 +1,62 @@
// ==== File: docs/assets/copy_code.js ====
document.addEventListener('DOMContentLoaded', () => {
// Target specifically code blocks within the main content area
const codeBlocks = document.querySelectorAll('#terminal-mkdocs-main-content pre > code');
codeBlocks.forEach((codeElement) => {
const preElement = codeElement.parentElement; // The <pre> tag
// Ensure the <pre> tag can contain a positioned button
if (window.getComputedStyle(preElement).position === 'static') {
preElement.style.position = 'relative';
}
// Create the button
const copyButton = document.createElement('button');
copyButton.className = 'copy-code-button';
copyButton.type = 'button';
copyButton.setAttribute('aria-label', 'Copy code to clipboard');
copyButton.title = 'Copy code to clipboard';
copyButton.innerHTML = 'Copy'; // Or use an icon like an SVG or FontAwesome class
// Append the button to the <pre> element
preElement.appendChild(copyButton);
// Add click event listener
copyButton.addEventListener('click', () => {
copyCodeToClipboard(codeElement, copyButton);
});
});
async function copyCodeToClipboard(codeElement, button) {
// Use innerText to get the rendered text content, preserving line breaks
const textToCopy = codeElement.innerText;
try {
await navigator.clipboard.writeText(textToCopy);
// Visual feedback
button.innerHTML = 'Copied!';
button.classList.add('copied');
button.disabled = true; // Temporarily disable
// Revert button state after a short delay
setTimeout(() => {
button.innerHTML = 'Copy';
button.classList.remove('copied');
button.disabled = false;
}, 2000); // Show "Copied!" for 2 seconds
} catch (err) {
console.error('Failed to copy code: ', err);
// Optional: Provide error feedback on the button
button.innerHTML = 'Error';
setTimeout(() => {
button.innerHTML = 'Copy';
}, 2000);
}
}
console.log("Copy Code Button script loaded.");
});

View File

@@ -0,0 +1,39 @@
// ==== File: docs/assets/floating_ask_ai_button.js ====
document.addEventListener('DOMContentLoaded', () => {
const askAiPagePath = '/core/ask-ai/'; // IMPORTANT: Adjust this path if needed!
const currentPath = window.location.pathname;
// Determine the base URL for constructing the link correctly,
// especially if deployed in a sub-directory.
// This assumes a simple structure; adjust if needed.
const baseUrl = window.location.origin + (currentPath.startsWith('/core/') ? '../..' : '');
// Check if the current page IS the Ask AI page
// Use includes() for flexibility (handles trailing slash or .html)
if (currentPath.includes(askAiPagePath.replace(/\/$/, ''))) { // Remove trailing slash for includes check
console.log("Floating Ask AI Button: Not adding button on the Ask AI page itself.");
return; // Don't add the button on the target page
}
// --- Create the button ---
const fabLink = document.createElement('a');
fabLink.className = 'floating-ask-ai-button';
fabLink.href = askAiPagePath; // Construct the correct URL
fabLink.title = 'Ask Crawl4AI Assistant';
fabLink.setAttribute('aria-label', 'Ask Crawl4AI Assistant');
// Add content (using SVG icon for better visuals)
fabLink.innerHTML = `
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" width="24" height="24" fill="currentColor">
<path d="M20 2H4c-1.1 0-2 .9-2 2v12c0 1.1.9 2 2 2h14l4 4V4c0-1.1-.9-2-2-2zm-2 12H6v-2h12v2zm0-3H6V9h12v2zm0-3H6V6h12v2z"/>
</svg>
<span>Ask AI</span>
`;
// Append to body
document.body.appendChild(fabLink);
console.log("Floating Ask AI Button added.");
});

View File

@@ -0,0 +1,119 @@
// ==== File: assets/github_stats.js ====
document.addEventListener('DOMContentLoaded', async () => {
// --- Configuration ---
const targetHeaderSelector = '.terminal .container:first-child'; // Selector for your header container
const insertBeforeSelector = '.terminal-nav'; // Selector for the element to insert the badge BEFORE (e.g., the main nav)
// Or set to null to append at the end of the header.
// --- Find elements ---
const headerContainer = document.querySelector(targetHeaderSelector);
if (!headerContainer) {
console.warn('GitHub Stats: Header container not found with selector:', targetHeaderSelector);
return;
}
const repoLinkElement = headerContainer.querySelector('a[href*="github.com/"]'); // Find the existing GitHub link
let repoUrl = 'https://github.com/unclecode/crawl4ai';
// if (repoLinkElement) {
// repoUrl = repoLinkElement.href;
// } else {
// // Fallback: Try finding from config (requires template injection - harder)
// // Or hardcode if necessary, but reading from the link is better.
// console.warn('GitHub Stats: GitHub repo link not found in header.');
// // Try to get repo_url from mkdocs config if available globally (less likely)
// // repoUrl = window.mkdocs_config?.repo_url; // Requires setting this variable
// // if (!repoUrl) return; // Exit if still no URL
// return; // Exit for now if link isn't found
// }
// --- Extract Repo Owner/Name ---
let owner = '';
let repo = '';
try {
const url = new URL(repoUrl);
const pathParts = url.pathname.split('/').filter(part => part.length > 0);
if (pathParts.length >= 2) {
owner = pathParts[0];
repo = pathParts[1];
}
} catch (e) {
console.error('GitHub Stats: Could not parse repository URL:', repoUrl, e);
return;
}
if (!owner || !repo) {
console.warn('GitHub Stats: Could not extract owner/repo from URL:', repoUrl);
return;
}
// --- Get Version (Attempt to extract from site title) ---
let version = '';
const siteTitleElement = headerContainer.querySelector('.terminal-title, .site-title'); // Adjust selector based on theme's title element
// Example title: "Crawl4AI Documentation (v0.5.x)"
if (siteTitleElement) {
const match = siteTitleElement.textContent.match(/\((v?[^)]+)\)/); // Look for text in parentheses starting with 'v' (optional)
if (match && match[1]) {
version = match[1].trim();
}
}
if (!version) {
console.info('GitHub Stats: Could not extract version from title. You might need to adjust the selector or regex.');
// You could fallback to config.extra.version if injected into JS
// version = window.mkdocs_config?.extra?.version || 'N/A';
}
// --- Fetch GitHub API Data ---
let stars = '...';
let forks = '...';
try {
const apiUrl = `https://api.github.com/repos/${owner}/${repo}`;
const response = await fetch(apiUrl);
if (response.ok) {
const data = await response.json();
// Format large numbers (optional)
stars = data.stargazers_count > 1000 ? `${(data.stargazers_count / 1000).toFixed(1)}k` : data.stargazers_count;
forks = data.forks_count > 1000 ? `${(data.forks_count / 1000).toFixed(1)}k` : data.forks_count;
} else {
console.warn(`GitHub Stats: API request failed with status ${response.status}. Rate limit exceeded?`);
stars = 'N/A';
forks = 'N/A';
}
} catch (error) {
console.error('GitHub Stats: Error fetching repository data:', error);
stars = 'N/A';
forks = 'N/A';
}
// --- Create Badge HTML ---
const badgeContainer = document.createElement('div');
badgeContainer.className = 'github-stats-badge';
// Use innerHTML for simplicity, including potential icons (requires FontAwesome or similar)
// Ensure your theme loads FontAwesome or add it yourself if you want icons.
badgeContainer.innerHTML = `
<a href="${repoUrl}" target="_blank" rel="noopener">
<!-- Optional Icon (FontAwesome example) -->
<!-- <i class="fab fa-github"></i> -->
<span class="repo-name">${owner}/${repo}</span>
${version ? `<span class="stat version"><i class="fas fa-tag"></i> ${version}</span>` : ''}
<span class="stat stars"><i class="fas fa-star"></i> ${stars}</span>
<span class="stat forks"><i class="fas fa-code-branch"></i> ${forks}</span>
</a>
`;
// --- Inject Badge into Header ---
const insertBeforeElement = insertBeforeSelector ? headerContainer.querySelector(insertBeforeSelector) : null;
if (insertBeforeElement) {
// headerContainer.insertBefore(badgeContainer, insertBeforeElement);
headerContainer.querySelector(insertBeforeSelector).appendChild(badgeContainer);
} else {
headerContainer.appendChild(badgeContainer);
}
console.info('GitHub Stats: Badge added to header.');
});

View File

@@ -0,0 +1,576 @@
/* ==== File: assets/layout.css (Non-Fluid Centered Layout) ==== */
:root {
--header-height: 55px; /* Adjust if needed */
--sidebar-width: 280px; /* Adjust if needed */
--toc-width: 340px; /* As specified */
--content-max-width: 90em; /* Max width for the centered content */
--layout-transition-speed: 0.2s;
--global-space: 10px;
}
/* --- Basic Setup --- */
html {
scroll-behavior: smooth;
scroll-padding-top: calc(var(--header-height) + 15px);
box-sizing: border-box;
}
*, *:before, *:after {
box-sizing: inherit;
}
body {
padding-top: 0;
padding-bottom: 0;
background-color: var(--background-color);
color: var(--font-color);
/* Prevents horizontal scrollbars during transitions */
overflow-x: hidden;
}
/* --- Fixed Header --- */
/* Full width, fixed header */
.terminal .container:first-child { /* Assuming this targets the header container */
position: fixed;
top: 0;
left: 0;
right: 0;
height: var(--header-height);
background-color: var(--background-color);
z-index: 1000;
border-bottom: 1px solid var(--progress-bar-background);
max-width: none; /* Override any container max-width */
padding: 0 calc(var(--global-space) * 2);
}
/* --- Main Layout Container (Below Header) --- */
/* This container just provides space for the fixed header */
.container:has(.terminal-mkdocs-main-grid) {
margin: 0 auto;
padding: 0;
padding-top: var(--header-height); /* Space for fixed header */
}
/* --- Flex Container: Grid holding content and toc (CENTERED) --- */
/* THIS is the main centered block */
.terminal-mkdocs-main-grid {
display: flex;
align-items: flex-start;
/* Enforce max-width and center */
max-width: var(--content-max-width);
margin-left: auto;
margin-right: auto;
position: relative;
/* Apply side padding within the centered block */
padding-left: calc(var(--global-space) * 2);
padding-right: calc(var(--global-space) * 2);
/* Add margin-left to clear the fixed sidebar - ONLY ON DESKTOP */
margin-left: var(--sidebar-width);
}
/* --- 1. Fixed Left Sidebar (Viewport Relative) --- */
#terminal-mkdocs-side-panel {
position: fixed;
top: var(--header-height);
left: max(0px, calc((90vw - var(--content-max-width)) / 2));
bottom: 0;
width: var(--sidebar-width);
background-color: var(--background-color);
border-right: 1px solid var(--progress-bar-background);
overflow-y: auto;
z-index: 900;
padding: 1em calc(var(--global-space) * 2);
padding-bottom: 2em;
transition: left var(--layout-transition-speed) ease-in-out;
}
/* --- 2. Main Content Area (Within Centered Grid) --- */
#terminal-mkdocs-main-content {
flex-grow: 1;
flex-shrink: 1;
min-width: 0; /* Flexbox shrink fix */
/* No left/right margins needed here - handled by parent grid */
margin-left: 0;
margin-right: 0;
/* Internal Padding */
padding: 1.5em 2em;
position: relative;
z-index: 1;
}
/* --- 3. Right Table of Contents (Sticky, Within Centered Grid) --- */
#toc-sidebar {
flex-basis: var(--toc-width);
flex-shrink: 0;
width: var(--toc-width);
position: sticky; /* Sticks within the centered grid */
top: var(--header-height);
align-self: stretch;
height: calc(100vh - var(--header-height));
overflow-y: auto;
padding: 1.5em 1em;
font-size: 0.85em;
border-left: 1px solid var(--progress-bar-background);
z-index: 800;
/* display: none; /* JS handles */
}
/* (ToC link styles remain the same) */
#toc-sidebar h4 { margin-top: 0; margin-bottom: 1em; font-size: 1.1em; color: var(--secondary-color); padding-left: 0.8em; }
#toc-sidebar ul { list-style: none; padding: 0; margin: 0; }
#toc-sidebar ul li a { display: block; padding: 0.3em 0; color: var(--secondary-color); text-decoration: none; border-left: 3px solid transparent; padding-left: 0.8em; transition: all 0.1s ease-in-out; line-height: 1.4; word-break: break-word; }
#toc-sidebar ul li.toc-level-3 a { padding-left: 1.8em; }
#toc-sidebar ul li.toc-level-4 a { padding-left: 2.8em; }
#toc-sidebar ul li a:hover { color: var(--font-color); background-color: rgba(255, 255, 255, 0.05); }
#toc-sidebar ul li a.active { color: var(--primary-color); border-left-color: var(--primary-color); background-color: rgba(80, 255, 255, 0.08); }
/* --- Footer Styling (Respects Centered Layout) --- */
footer {
background-color: var(--code-bg-color);
color: var(--secondary-color);
position: relative;
z-index: 10;
margin-top: 2em;
/* Apply margin-left to clear the fixed sidebar */
margin-left: var(--sidebar-width);
/* Constrain width relative to the centered grid it follows */
max-width: calc(var(--content-max-width) - var(--sidebar-width));
margin-right: auto; /* Keep it left-aligned within the space next to sidebar */
/* Use padding consistent with the grid */
padding: 2em calc(var(--global-space) * 2);
}
/* Adjust footer grid if needed */
.terminal-mkdocs-footer-grid {
display: grid;
grid-template-columns: 1fr auto;
gap: 1em;
align-items: center;
}
/* ==========================================================================
RESPONSIVENESS (Adapting the Non-Fluid Layout)
========================================================================== */
/* --- Medium screens: Hide ToC --- */
@media screen and (max-width: 1200px) {
#toc-sidebar {
display: none;
}
.terminal-mkdocs-main-grid {
/* Grid adjusts automatically as ToC is removed */
/* Ensure grid padding remains */
padding-left: calc(var(--global-space) * 2);
padding-right: calc(var(--global-space) * 2);
}
#terminal-mkdocs-main-content {
/* Content area naturally expands */
}
footer {
/* Footer still respects the left sidebar and overall max width */
margin-left: var(--sidebar-width);
max-width: calc(var(--content-max-width) - var(--sidebar-width));
/* Padding remains consistent */
padding-left: calc(var(--global-space) * 2);
padding-right: calc(var(--global-space) * 2);
}
}
/* --- Mobile Menu Styles --- */
.mobile-menu-toggle {
display: none; /* Hidden by default, shown in mobile */
background: none;
border: none;
padding: 10px;
cursor: pointer;
z-index: 1200;
margin-right: 10px;
position: absolute;
left: 10px;
top: 50%;
transform: translateY(-50%);
/* Make sure it doesn't get moved */
min-width: 30px;
min-height: 30px;
}
.hamburger-line {
display: block;
width: 22px;
height: 2px;
margin: 5px 0;
background-color: var(--font-color);
transition: transform 0.3s, opacity 0.3s;
}
/* Hamburger animation */
.mobile-menu-toggle.is-active .hamburger-line:nth-child(1) {
transform: translateY(7px) rotate(45deg);
}
.mobile-menu-toggle.is-active .hamburger-line:nth-child(2) {
opacity: 0;
}
.mobile-menu-toggle.is-active .hamburger-line:nth-child(3) {
transform: translateY(-7px) rotate(-45deg);
}
.mobile-menu-close {
display: none; /* Hidden by default, shown in mobile */
position: absolute;
top: 10px;
right: 10px;
background: none;
border: none;
color: var(--font-color);
font-size: 24px;
cursor: pointer;
z-index: 1200;
padding: 5px 10px;
}
.mobile-menu-backdrop {
position: fixed;
top: 0;
left: 0;
right: 0;
bottom: 0;
background-color: rgba(0, 0, 0, 0.7);
z-index: 1050;
}
/* --- Small screens: Hide left sidebar, full width content & footer --- */
@media screen and (max-width: 768px) {
/* Hide the terminal-menu from theme */
.terminal-menu {
display: none !important;
}
/* Add padding to site name to prevent hamburger overlap */
.terminal-mkdocs-site-name,
.terminal-logo a,
.terminal-nav .logo {
padding-left: 40px !important;
white-space: nowrap;
overflow: hidden;
text-overflow: ellipsis;
}
/* Show mobile menu toggle button */
.mobile-menu-toggle {
display: block;
}
/* Show mobile menu close button */
.mobile-menu-close {
display: block;
}
#terminal-mkdocs-side-panel {
left: -100%; /* Hide completely off-screen */
z-index: 1100;
box-shadow: 2px 0 10px rgba(0,0,0,0.3);
top: 0; /* Start from top edge */
height: 100%; /* Full height */
transition: left 0.3s ease-in-out;
padding-top: 50px; /* Space for close button */
overflow-y: auto;
width: 85%; /* Wider on mobile */
max-width: 320px; /* Maximum width */
background-color: var(--background-color); /* Ensure solid background */
}
#terminal-mkdocs-side-panel.sidebar-visible {
left: 0;
}
/* Make navigation links more touch-friendly */
#terminal-mkdocs-side-panel a {
padding: 6px 15px;
display: block;
/* No border as requested */
}
#terminal-mkdocs-side-panel ul {
padding-left: 0;
}
#terminal-mkdocs-side-panel ul ul a {
padding-left: 10px;
}
.terminal-mkdocs-main-grid {
/* Grid now takes full width (minus body padding) */
margin-left: 0 !important; /* Override sidebar margin with !important */
margin-right: 0; /* Override auto margin */
max-width: 100%; /* Allow full width */
padding-left: var(--global-space); /* Reduce padding */
padding-right: var(--global-space);
}
#terminal-mkdocs-main-content {
padding: 1.5em 1em; /* Adjust internal padding */
}
footer {
margin-left: 0; /* Full width footer */
max-width: 100%; /* Allow full width */
padding: 2em 1em; /* Adjust internal padding */
}
.terminal-mkdocs-footer-grid {
grid-template-columns: 1fr; /* Stack footer items */
text-align: center;
gap: 0.5em;
}
}
/* ==== GitHub Stats Badge Styling ==== */
.github-stats-badge {
display: inline-block; /* Or flex if needed */
margin-left: 2em; /* Adjust spacing */
vertical-align: middle; /* Align with other header items */
font-size: 0.9em; /* Slightly smaller font */
}
.github-stats-badge a {
color: var(--secondary-color); /* Use secondary color */
text-decoration: none;
display: flex; /* Use flex for alignment */
align-items: center;
gap: 0.8em; /* Space between items */
padding: 0.2em 0.5em;
border: 1px solid var(--progress-bar-background); /* Subtle border */
border-radius: 4px;
transition: color 0.2s, background-color 0.2s;
}
.github-stats-badge a:hover {
color: var(--font-color); /* Brighter color on hover */
background-color: var(--progress-bar-background); /* Subtle background on hover */
}
.github-stats-badge .repo-name {
color: var(--font-color); /* Make repo name stand out slightly */
font-weight: 500; /* Optional bolder weight */
}
.github-stats-badge .stat {
/* Styles for individual stats (version, stars, forks) */
white-space: nowrap; /* Prevent wrapping */
}
.github-stats-badge .stat i {
/* Optional: Style for FontAwesome icons */
margin-right: 0.3em;
color: var(--secondary-dimmed-color); /* Dimmer color for icons */
}
/* Adjust positioning relative to search/nav if needed */
/* Example: If search is floated right */
/* .terminal-nav { float: left; } */
/* .github-stats-badge { float: left; } */
/* #mkdocs-search-query { float: right; } */
/* --- Responsive adjustments --- */
@media screen and (max-width: 900px) { /* Example breakpoint */
.github-stats-badge .repo-name {
display: none; /* Hide full repo name on smaller screens */
}
.github-stats-badge {
margin-left: 1em;
}
.github-stats-badge a {
gap: 0.5em;
}
}
@media screen and (max-width: 768px) {
/* Further hide or simplify on mobile if needed */
.github-stats-badge {
display: none; /* Example: Hide completely on smallest screens */
}
}
/* --- Ask AI Selection Button --- */
.ask-ai-selection-button {
background-color: var(--primary-dimmed-color, #09b5a5);
color: var(--background-color, #070708);
border: none;
padding: 6px 10px;
font-size: 0.8em;
border-radius: 4px;
cursor: pointer;
box-shadow: 0 3px 8px rgba(0, 0, 0, 0.3);
transition: background-color 0.2s ease, transform 0.15s ease;
white-space: nowrap;
display: flex;
align-items: center;
font-weight: 500;
animation: askAiButtonAppear 0.2s ease-out;
}
@keyframes askAiButtonAppear {
from {
opacity: 0;
transform: scale(0.9);
}
to {
opacity: 1;
transform: scale(1);
}
}
.ask-ai-selection-button:hover {
background-color: var(--primary-color, #50ffff);
transform: scale(1.05);
}
/* Mobile styles for Ask AI button */
@media screen and (max-width: 768px) {
.ask-ai-selection-button {
padding: 8px 12px; /* Larger touch target on mobile */
font-size: 0.9em; /* Slightly larger text */
}
}
/* ==== File: docs/assets/layout.css (Additions) ==== */
/* ... (keep all existing layout CSS) ... */
/* --- Copy Code Button Styling --- */
/* Ensure the parent <pre> can contain the absolutely positioned button */
#terminal-mkdocs-main-content pre {
position: relative; /* Needed for absolute positioning of child */
/* Add a little padding top/right to make space for the button */
padding-top: 2.5em;
padding-right: 1em; /* Ensure padding is sufficient */
}
.copy-code-button {
position: absolute;
top: 0.5em; /* Adjust spacing from top */
left: 0.5em; /* Adjust spacing from left */
z-index: 1; /* Sit on top of code */
background-color: var(--progress-bar-background, #444); /* Use a background */
color: var(--font-color, #eaeaea);
border: 1px solid var(--secondary-color, #727578);
padding: 3px 8px;
font-size: 0.8em;
font-family: var(--font-stack, monospace);
border-radius: 4px;
cursor: pointer;
opacity: 0; /* Hidden by default */
transition: opacity 0.2s ease-in-out, background-color 0.2s ease, color 0.2s ease;
white-space: nowrap;
}
/* Show button on hover of the <pre> container */
#terminal-mkdocs-main-content pre:hover .copy-code-button {
opacity: 0.8; /* Show partially */
}
.copy-code-button:hover {
opacity: 1; /* Fully visible on button hover */
background-color: var(--secondary-color, #727578);
}
.copy-code-button:focus {
opacity: 1; /* Ensure visible when focused */
outline: 1px dashed var(--primary-color);
}
/* Style for "Copied!" state */
.copy-code-button.copied {
background-color: var(--primary-dimmed-color, #09b5a5);
color: var(--background-color, #070708);
border-color: var(--primary-dimmed-color, #09b5a5);
opacity: 1; /* Ensure visible */
}
.copy-code-button.copied:hover {
background-color: var(--primary-dimmed-color, #09b5a5); /* Prevent hover change */
}
/* ==== File: docs/assets/layout.css (Additions) ==== */
/* ... (keep all existing layout CSS) ... */
/* --- Floating Ask AI Button --- */
.floating-ask-ai-button {
position: fixed;
bottom: 25px;
right: 25px;
z-index: 1050; /* Below modals, above most content */
background-color: var(--primary-dimmed-color, #09b5a5);
color: var(--background-color, #070708);
border: none;
border-radius: 50%; /* Make it circular */
width: 60px; /* Adjust size */
height: 60px; /* Adjust size */
padding: 10px; /* Adjust padding */
box-shadow: 0 4px 10px rgba(0, 0, 0, 0.4);
cursor: pointer;
transition: background-color 0.2s ease, transform 0.2s ease;
display: flex;
flex-direction: column; /* Stack icon and text */
align-items: center;
justify-content: center;
text-decoration: none;
text-align: center;
}
.floating-ask-ai-button svg {
width: 24px; /* Control icon size */
height: 24px;
}
.floating-ask-ai-button span {
font-size: 0.7em;
margin-top: 2px; /* Space between icon and text */
display: block; /* Ensure it takes space */
line-height: 1;
}
.floating-ask-ai-button:hover {
background-color: var(--primary-color, #50ffff);
transform: scale(1.05); /* Slight grow effect */
}
.floating-ask-ai-button:focus {
outline: 2px solid var(--primary-color);
outline-offset: 2px;
}
/* Optional: Hide text on smaller screens if needed */
@media screen and (max-width: 768px) {
.floating-ask-ai-button span {
/* display: none; */ /* Uncomment to hide text */
}
.floating-ask-ai-button {
width: 55px;
height: 55px;
bottom: 20px;
right: 20px;
}
}

View File

@@ -0,0 +1,106 @@
// mobile_menu.js - Hamburger menu for mobile view
document.addEventListener('DOMContentLoaded', () => {
// Get references to key elements
const sidePanel = document.getElementById('terminal-mkdocs-side-panel');
const mainHeader = document.querySelector('.terminal .container:first-child');
if (!sidePanel || !mainHeader) {
console.warn('Mobile menu: Required elements not found');
return;
}
// Force hide sidebar on mobile
const checkMobile = () => {
if (window.innerWidth <= 768) {
// Force with !important-like priority
sidePanel.style.setProperty('left', '-100%', 'important');
// Also hide terminal-menu from the theme
const terminalMenu = document.querySelector('.terminal-menu');
if (terminalMenu) {
terminalMenu.style.setProperty('display', 'none', 'important');
}
} else {
sidePanel.style.removeProperty('left');
// Restore terminal-menu if it exists
const terminalMenu = document.querySelector('.terminal-menu');
if (terminalMenu) {
terminalMenu.style.removeProperty('display');
}
}
};
// Run on initial load
checkMobile();
// Also run on resize
window.addEventListener('resize', checkMobile);
// Create hamburger button
const hamburgerBtn = document.createElement('button');
hamburgerBtn.className = 'mobile-menu-toggle';
hamburgerBtn.setAttribute('aria-label', 'Toggle navigation menu');
hamburgerBtn.innerHTML = `
<span class="hamburger-line"></span>
<span class="hamburger-line"></span>
<span class="hamburger-line"></span>
`;
// Create backdrop overlay
const menuBackdrop = document.createElement('div');
menuBackdrop.className = 'mobile-menu-backdrop';
menuBackdrop.style.display = 'none';
document.body.appendChild(menuBackdrop);
// Make sure it's properly hidden on page load
if (window.innerWidth <= 768) {
menuBackdrop.style.display = 'none';
}
// Insert hamburger button into header
mainHeader.insertBefore(hamburgerBtn, mainHeader.firstChild);
// Add menu close button to side panel
const closeBtn = document.createElement('button');
closeBtn.className = 'mobile-menu-close';
closeBtn.setAttribute('aria-label', 'Close navigation menu');
closeBtn.innerHTML = `&times;`;
sidePanel.insertBefore(closeBtn, sidePanel.firstChild);
// Toggle function
function toggleMobileMenu() {
const isOpen = sidePanel.classList.toggle('sidebar-visible');
// Toggle backdrop
menuBackdrop.style.display = isOpen ? 'block' : 'none';
// Toggle aria-expanded
hamburgerBtn.setAttribute('aria-expanded', isOpen ? 'true' : 'false');
// Toggle hamburger animation class
hamburgerBtn.classList.toggle('is-active');
// Force sidebar visibility setting
if (isOpen) {
sidePanel.style.setProperty('left', '0', 'important');
} else {
sidePanel.style.setProperty('left', '-100%', 'important');
}
// Prevent body scrolling when menu is open
document.body.style.overflow = isOpen ? 'hidden' : '';
}
// Event listeners
hamburgerBtn.addEventListener('click', toggleMobileMenu);
closeBtn.addEventListener('click', toggleMobileMenu);
menuBackdrop.addEventListener('click', toggleMobileMenu);
// Close menu on window resize to desktop
window.addEventListener('resize', () => {
if (window.innerWidth > 768 && sidePanel.classList.contains('sidebar-visible')) {
toggleMobileMenu();
}
});
console.log('Mobile menu initialized');
});

View File

@@ -0,0 +1,186 @@
// ==== File: docs/assets/selection_ask_ai.js ====
document.addEventListener('DOMContentLoaded', () => {
let askAiButton = null;
const askAiPageUrl = '/core/ask-ai/'; // Adjust if your Ask AI page path is different
function createAskAiButton() {
const button = document.createElement('button');
button.id = 'ask-ai-selection-btn';
button.className = 'ask-ai-selection-button';
// Add icon and text for better visibility
button.innerHTML = `
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" width="12" height="12" fill="currentColor" style="margin-right: 4px; vertical-align: middle;">
<path d="M20 2H4c-1.1 0-2 .9-2 2v12c0 1.1.9 2 2 2h14l4 4V4c0-1.1-.9-2-2-2z"/>
</svg>
<span>Ask AI</span>
`;
// Common styles
button.style.display = 'none'; // Initially hidden
button.style.position = 'absolute';
button.style.zIndex = '1500'; // Ensure it's on top
button.style.boxShadow = '0 3px 8px rgba(0, 0, 0, 0.4)'; // More pronounced shadow
button.style.transition = 'transform 0.15s ease, background-color 0.2s ease'; // Smooth hover effect
// Add transform on hover
button.addEventListener('mouseover', () => {
button.style.transform = 'scale(1.05)';
});
button.addEventListener('mouseout', () => {
button.style.transform = 'scale(1)';
});
document.body.appendChild(button);
button.addEventListener('click', handleAskAiClick);
return button;
}
function getSafeSelectedText() {
const selection = window.getSelection();
if (!selection || selection.rangeCount === 0) {
return null;
}
// Avoid selecting text within the button itself if it was somehow selected
const container = selection.getRangeAt(0).commonAncestorContainer;
if (askAiButton && askAiButton.contains(container)) {
return null;
}
const text = selection.toString().trim();
return text.length > 0 ? text : null;
}
function positionButton(event) {
const selection = window.getSelection();
if (!selection || selection.rangeCount === 0 || selection.isCollapsed) {
hideButton();
return;
}
const range = selection.getRangeAt(0);
const rect = range.getBoundingClientRect();
// Get viewport dimensions
const viewportWidth = window.innerWidth;
const viewportHeight = window.innerHeight;
// Calculate position based on selection
const scrollX = window.scrollX;
const scrollY = window.scrollY;
// Default position (top-right of selection)
let buttonTop = rect.top + scrollY - askAiButton.offsetHeight - 5; // 5px above
let buttonLeft = rect.right + scrollX + 5; // 5px to the right
// Check if we're on mobile (which we define as less than 768px)
const isMobile = viewportWidth <= 768;
if (isMobile) {
// On mobile, position centered above selection to avoid edge issues
buttonTop = rect.top + scrollY - askAiButton.offsetHeight - 10; // 10px above on mobile
buttonLeft = rect.left + scrollX + (rect.width / 2) - (askAiButton.offsetWidth / 2); // Centered
} else {
// For desktop, ensure the button doesn't go off screen
// Check right edge
if (buttonLeft + askAiButton.offsetWidth > scrollX + viewportWidth) {
buttonLeft = scrollX + viewportWidth - askAiButton.offsetWidth - 10; // 10px from right edge
}
}
// Check top edge (for all devices)
if (buttonTop < scrollY) {
// If would go above viewport, position below selection instead
buttonTop = rect.bottom + scrollY + 5; // 5px below
}
askAiButton.style.top = `${buttonTop}px`;
askAiButton.style.left = `${buttonLeft}px`;
askAiButton.style.display = 'block'; // Show the button
}
function hideButton() {
if (askAiButton) {
askAiButton.style.display = 'none';
}
}
function handleAskAiClick(event) {
event.stopPropagation(); // Prevent mousedown from hiding button immediately
const selectedText = getSafeSelectedText();
if (selectedText) {
console.log("Selected Text:", selectedText);
// Base64 encode for URL safety (handles special chars, line breaks)
// Use encodeURIComponent first for proper Unicode handling before btoa
const encodedText = btoa(unescape(encodeURIComponent(selectedText)));
const targetUrl = `${askAiPageUrl}?qq=${encodedText}`;
console.log("Navigating to:", targetUrl);
window.location.href = targetUrl; // Navigate to Ask AI page
}
hideButton(); // Hide after click
}
// --- Event Listeners ---
// Function to handle selection events (both mouse and touch)
function handleSelectionEvent(event) {
// Slight delay to ensure selection is registered
setTimeout(() => {
const selectedText = getSafeSelectedText();
if (selectedText) {
if (!askAiButton) {
askAiButton = createAskAiButton();
}
// Don't position if the event was ON the button itself
if (event.target !== askAiButton) {
positionButton(event);
}
} else {
hideButton();
}
}, 10); // Small delay
}
// Mouse selection events (desktop)
document.addEventListener('mouseup', handleSelectionEvent);
// Touch selection events (mobile)
document.addEventListener('touchend', handleSelectionEvent);
document.addEventListener('selectionchange', () => {
// This helps with mobile selection which can happen without mouseup/touchend
setTimeout(() => {
const selectedText = getSafeSelectedText();
if (selectedText && askAiButton) {
positionButton();
}
}, 300); // Longer delay for selection change
});
// Hide button on various events
document.addEventListener('mousedown', (event) => {
// Hide if clicking anywhere EXCEPT the button itself
if (askAiButton && event.target !== askAiButton) {
hideButton();
}
});
document.addEventListener('touchstart', (event) => {
// Same for touch events, but only hide if not on the button
if (askAiButton && event.target !== askAiButton) {
hideButton();
}
});
document.addEventListener('scroll', hideButton, true); // Capture scroll events
// Also hide when pressing Escape key
document.addEventListener('keydown', (event) => {
if (event.key === 'Escape') {
hideButton();
}
});
console.log("Selection Ask AI script loaded.");
});

View File

@@ -6,8 +6,8 @@
}
:root {
--global-font-size: 16px;
--global-code-font-size: 16px;
--global-font-size: 14px;
--global-code-font-size: 13px;
--global-line-height: 1.5em;
--global-space: 10px;
--font-stack: Menlo, Monaco, Lucida Console, Liberation Mono, DejaVu Sans Mono, Bitstream Vera Sans Mono,
@@ -50,8 +50,17 @@
--display-h1-decoration: none;
--display-h1-decoration: none;
--header-height: 65px; /* Adjust based on your actual header height */
--sidebar-width: 280px; /* Adjust based on your desired sidebar width */
--toc-width: 240px; /* Adjust based on your desired ToC width */
--layout-transition-speed: 0.2s; /* For potential future animations */
--page-width : 100em; /* Adjust based on your design */
}
/* body {
background-color: var(--background-color);
color: var(--font-color);
@@ -256,4 +265,9 @@ div.badges a {
}
div.badges a > img {
width: auto;
}
table td, table th {
border: 1px solid var(--code-bg-color) !important;
}

144
docs/md_v2/assets/toc.js Normal file
View File

@@ -0,0 +1,144 @@
// ==== File: assets/toc.js ====
document.addEventListener('DOMContentLoaded', () => {
const mainContent = document.getElementById('terminal-mkdocs-main-content');
const tocContainer = document.getElementById('toc-sidebar');
const mainGrid = document.querySelector('.terminal-mkdocs-main-grid'); // Get the flex container
if (!mainContent) {
console.warn("TOC Generator: Main content area '#terminal-mkdocs-main-content' not found.");
return;
}
// --- Create ToC container if it doesn't exist ---
let tocElement = tocContainer;
if (!tocElement) {
if (!mainGrid) {
console.warn("TOC Generator: Flex container '.terminal-mkdocs-main-grid' not found to append ToC.");
return;
}
tocElement = document.createElement('aside');
tocElement.id = 'toc-sidebar';
tocElement.style.display = 'none'; // Keep hidden initially
// Append it as the last child of the flex grid
mainGrid.appendChild(tocElement);
console.info("TOC Generator: Created '#toc-sidebar' element.");
}
// --- Find Headings (h2, h3, h4 are common for ToC) ---
const headings = mainContent.querySelectorAll('h2, h3, h4');
if (headings.length === 0) {
console.info("TOC Generator: No headings found on this page. ToC not generated.");
tocElement.style.display = 'none'; // Ensure it's hidden
return;
}
// --- Generate ToC List ---
const tocList = document.createElement('ul');
const observerTargets = []; // Store headings for IntersectionObserver
headings.forEach((heading, index) => {
// Ensure heading has an ID for linking
if (!heading.id) {
// Create a simple slug-like ID
heading.id = `toc-heading-${index}-${heading.textContent.toLowerCase().replace(/\s+/g, '-').replace(/[^a-z0-9-]/g, '')}`;
}
const listItem = document.createElement('li');
const link = document.createElement('a');
link.href = `#${heading.id}`;
link.textContent = heading.textContent;
// Add class for styling based on heading level
const level = parseInt(heading.tagName.substring(1), 10); // Get 2, 3, or 4
listItem.classList.add(`toc-level-${level}`);
listItem.appendChild(link);
tocList.appendChild(listItem);
observerTargets.push(heading); // Add to observer list
});
// --- Populate and Show ToC ---
// Optional: Add a title
const tocTitle = document.createElement('h4');
tocTitle.textContent = 'On this page'; // Customize title if needed
tocElement.innerHTML = ''; // Clear previous content if any
tocElement.appendChild(tocTitle);
tocElement.appendChild(tocList);
tocElement.style.display = ''; // Show the ToC container
console.info(`TOC Generator: Generated ToC with ${headings.length} items.`);
// --- Scroll Spy using Intersection Observer ---
const tocLinks = tocElement.querySelectorAll('a');
let activeLink = null; // Keep track of the current active link
const observerOptions = {
// Observe changes relative to the viewport, offset by the header height
// Negative top margin pushes the intersection trigger point down
// Negative bottom margin ensures elements low on the screen can trigger before they exit
rootMargin: `-${getComputedStyle(document.documentElement).getPropertyValue('--header-height').trim()} 0px -60% 0px`,
threshold: 0 // Trigger as soon as any part enters/exits the boundary
};
const observerCallback = (entries) => {
let topmostVisibleHeading = null;
entries.forEach(entry => {
const link = tocElement.querySelector(`a[href="#${entry.target.id}"]`);
if (!link) return;
// Check if the heading is intersecting (partially or fully visible within rootMargin)
if (entry.isIntersecting) {
// Among visible headings, find the one closest to the top edge (within the rootMargin)
if (!topmostVisibleHeading || entry.boundingClientRect.top < topmostVisibleHeading.boundingClientRect.top) {
topmostVisibleHeading = entry.target;
}
}
});
// If we found a topmost visible heading, activate its link
if (topmostVisibleHeading) {
const newActiveLink = tocElement.querySelector(`a[href="#${topmostVisibleHeading.id}"]`);
if (newActiveLink && newActiveLink !== activeLink) {
// Remove active class from previous link
if (activeLink) {
activeLink.classList.remove('active');
activeLink.parentElement.classList.remove('active-parent'); // Optional parent styling
}
// Add active class to the new link
newActiveLink.classList.add('active');
newActiveLink.parentElement.classList.add('active-parent'); // Optional parent styling
activeLink = newActiveLink;
// Optional: Scroll the ToC sidebar to keep the active link visible
// newActiveLink.scrollIntoView({ behavior: 'smooth', block: 'nearest' });
}
}
// If no headings are intersecting (scrolled past the last one?), maybe deactivate all
// Or keep the last one active - depends on desired behavior. Current logic keeps last active.
};
const observer = new IntersectionObserver(observerCallback, observerOptions);
// Observe all target headings
observerTargets.forEach(heading => observer.observe(heading));
// Initial check in case a heading is already in view on load
// (Requires slight delay for accurate layout calculation)
setTimeout(() => {
observerCallback(observer.takeRecords()); // Process initial state
}, 100);
// move footer and the hr before footer to the end of the main content
const footer = document.querySelector('footer');
const hr = footer.previousElementSibling;
if (hr && hr.tagName === 'HR') {
mainContent.appendChild(hr);
}
mainContent.appendChild(footer);
console.info("TOC Generator: Footer moved to the end of the main content.");
});

View File

@@ -4,6 +4,32 @@ Welcome to the Crawl4AI blog! Here you'll find detailed release notes, technical
## Latest Release
Heres the blog index entry for **v0.6.0**, written to match the exact tone and structure of your previous entries:
---
### [Crawl4AI v0.6.0 World-Aware Crawling, Pre-Warmed Browsers, and the MCP API](releases/0.6.0.md)
*April 23, 2025*
Crawl4AI v0.6.0 is our most powerful release yet. This update brings major architectural upgrades including world-aware crawling (set geolocation, locale, and timezone), real-time traffic capture, and a memory-efficient crawler pool with pre-warmed pages.
The Docker server now exposes a full-featured MCP socket + SSE interface, supports streaming, and comes with a new Playground UI. Plus, table extraction is now native, and the new stress-test framework supports crawling 1,000+ URLs.
Other key changes:
* Native support for `result.media["tables"]` to export DataFrames
* Full network + console logs and MHTML snapshot per crawl
* Browser pooling and pre-warming for faster cold starts
* New streaming endpoints via MCP API and Playground
* Robots.txt support, proxy rotation, and improved session handling
* Deprecated old markdown names, legacy modules cleaned up
* Massive repo cleanup: ~36K insertions, ~5K deletions across 121 files
[Read full release notes →](releases/0.6.0.md)
---
Let me know if you want me to auto-update the actual file or just paste this into the markdown.
### [Crawl4AI v0.5.0: Deep Crawling, Scalability, and a New CLI!](releases/0.5.0.md)

View File

@@ -251,7 +251,7 @@ from crawl4ai import (
RoundRobinProxyStrategy,
)
import asyncio
from crawl4ai.proxy_strategy import ProxyConfig
from crawl4ai import ProxyConfig
async def main():
# Load proxies and create rotation strategy
proxies = ProxyConfig.from_env()

View File

@@ -0,0 +1,143 @@
# Crawl4AI v0.6.0 Release Notes
We're excited to announce the release of **Crawl4AI v0.6.0**, our biggest and most feature-rich update yet. This version introduces major architectural upgrades, brand-new capabilities for geo-aware crawling, high-efficiency scraping, and real-time streaming support for scalable deployments.
---
## Highlights
### 1. **World-Aware Crawlers**
Crawl as if youre anywhere in the world. With v0.6.0, each crawl can simulate:
- Specific GPS coordinates
- Browser locale
- Timezone
Example:
```python
CrawlerRunConfig(
url="https://browserleaks.com/geo",
locale="en-US",
timezone_id="America/Los_Angeles",
geolocation=GeolocationConfig(
latitude=34.0522,
longitude=-118.2437,
accuracy=10.0
)
)
```
Great for accessing region-specific content or testing global behavior.
---
### 2. **Native Table Extraction**
Extract HTML tables directly into usable formats like Pandas DataFrames or CSV with zero parsing hassle. All table data is available under `result.media["tables"]`.
Example:
```python
raw_df = pd.DataFrame(
result.media["tables"][0]["rows"],
columns=result.media["tables"][0]["headers"]
)
```
This makes it ideal for scraping financial data, pricing pages, or anything tabular.
---
### 3. **Browser Pooling & Pre-Warming**
We've overhauled browser management. Now, multiple browser instances can be pooled and pages pre-warmed for ultra-fast launches:
- Reduces cold-start latency
- Lowers memory spikes
- Enhances parallel crawling stability
This powers the new **Docker Playground** experience and streamlines heavy-load crawling.
---
### 4. **Traffic & Snapshot Capture**
Need full visibility? You can now capture:
- Full network traffic logs
- Console output
- MHTML page snapshots for post-crawl audits and debugging
No more guesswork on what happened during your crawl.
---
### 5. **MCP API and Streaming Support**
Were exposing **MCP socket and SSE endpoints**, allowing:
- Live streaming of crawl results
- Real-time integration with agents or frontends
- A new Playground UI for interactive crawling
This is a major step towards making Crawl4AI real-time ready.
---
### 6. **Stress-Test Framework**
Want to test performance under heavy load? v0.6.0 includes a new memory stress-test suite that supports 1,000+ URL workloads. Ideal for:
- Load testing
- Performance benchmarking
- Validating memory efficiency
---
## Core Improvements
- Robots.txt compliance
- Proxy rotation support
- Improved URL normalization and session reuse
- Shared data across crawler hooks
- New page routing logic
---
## Breaking Changes & Deprecations
- Legacy `crawl4ai/browser/*` modules are removed. Update imports accordingly.
- `AsyncPlaywrightCrawlerStrategy.get_page` now uses a new function signature.
- Deprecated markdown generator aliases now point to `DefaultMarkdownGenerator` with warning.
---
## Miscellaneous Updates
- FastAPI validators replaced custom validation logic
- Docker build now based on a Chromium layer
- Repo-wide cleanup: ~36,000 insertions, ~5,000 deletions
---
## New Examples Included
- Geo-location crawling
- Network + console log capture
- Docker MCP API usage
- Markdown selector usage
- Crypto project data extraction
---
## Watch the Release Video
Want a visual walkthrough of all these updates? Watch the video:
🔗 https://youtu.be/9x7nVcjOZks
If you're new to Crawl4AI, start here:
🔗 https://www.youtube.com/watch?v=xo3qK6Hg9AA&t=15s
---
## Join the Community
Weve just opened up our **Discord** for the public. Join us to:
- Ask questions
- Share your projects
- Get help or contribute
💬 https://discord.gg/wpYFACrHR4
---
## Install or Upgrade
```bash
pip install -U crawl4ai
```
---
Live long and import crawl4ai. 🖖

74
docs/md_v2/core/ask-ai.md Normal file
View File

@@ -0,0 +1,74 @@
<div class="ask-ai-container">
<iframe id="ask-ai-frame" src="../../ask_ai/index.html" width="100%" style="border:none; display: block;" title="Crawl4AI Assistant"></iframe>
</div>
<script>
// Iframe height adjustment
function resizeAskAiIframe() {
const iframe = document.getElementById('ask-ai-frame');
if (iframe) {
const headerHeight = parseFloat(getComputedStyle(document.documentElement).getPropertyValue('--header-height') || '55');
// Footer is removed by JS below, so calculate height based on header + small buffer
const topOffset = headerHeight + 20; // Header + buffer/margin
const availableHeight = window.innerHeight - topOffset;
iframe.style.height = Math.max(600, availableHeight) + 'px'; // Min height 600px
}
}
// Run immediately and on resize/load
resizeAskAiIframe(); // Initial call
let resizeTimer;
window.addEventListener('load', resizeAskAiIframe);
window.addEventListener('resize', () => {
clearTimeout(resizeTimer);
resizeTimer = setTimeout(resizeAskAiIframe, 150);
});
// Remove Footer & HR from parent page (DOM Ready might be safer)
document.addEventListener('DOMContentLoaded', () => {
setTimeout(() => { // Add slight delay just in case elements render slowly
const footer = window.parent.document.querySelector('footer'); // Target parent document
if (footer) {
const hrBeforeFooter = footer.previousElementSibling;
if (hrBeforeFooter && hrBeforeFooter.tagName === 'HR') {
hrBeforeFooter.remove();
}
footer.remove();
// Trigger resize again after removing footer
resizeAskAiIframe();
} else {
console.warn("Ask AI Page: Could not find footer in parent document to remove.");
}
}, 100); // Shorter delay
});
</script>
<style>
#terminal-mkdocs-main-content {
padding: 0 !important;
margin: 0;
width: 100%;
height: 100%;
overflow: hidden; /* Prevent body scrollbars, panels handle scroll */
}
/* Ensure iframe container takes full space */
#terminal-mkdocs-main-content .ask-ai-container {
/* Remove negative margins if footer removal handles space */
margin: 0;
padding: 0;
max-width: none;
/* Let the JS set the height */
/* height: 600px; Initial fallback height */
overflow: hidden; /* Hide potential overflow before JS resize */
}
/* Hide title/paragraph if they were part of the markdown */
/* Alternatively, just remove them from the .md file directly */
/* #terminal-mkdocs-main-content > h1,
#terminal-mkdocs-main-content > p:first-of-type {
display: none;
} */
</style>

View File

@@ -1,9 +1,9 @@
# Browser, Crawler & LLM Configuration (Quick Overview)
Crawl4AIs flexibility stems from two key classes:
Crawl4AI's flexibility stems from two key classes:
1. **`BrowserConfig`** Dictates **how** the browser is launched and behaves (e.g., headless or visible, proxy, user agent).
2. **`CrawlerRunConfig`** Dictates **how** each **crawl** operates (e.g., caching, extraction, timeouts, JavaScript code to run, etc.).
1. **`BrowserConfig`** Dictates **how** the browser is launched and behaves (e.g., headless or visible, proxy, user agent).
2. **`CrawlerRunConfig`** Dictates **how** each **crawl** operates (e.g., caching, extraction, timeouts, JavaScript code to run, etc.).
3. **`LLMConfig`** - Dictates **how** LLM providers are configured. (model, api token, base url, temperature etc.)
In most examples, you create **one** `BrowserConfig` for the entire crawler session, then pass a **fresh** or re-used `CrawlerRunConfig` whenever you call `arun()`. This tutorial shows the most commonly used parameters. If you need advanced or rarely used fields, see the [Configuration Parameters](../api/parameters.md).
@@ -36,18 +36,16 @@ class BrowserConfig:
### Key Fields to Note
1. **`browser_type`**
1. **`browser_type`**
- Options: `"chromium"`, `"firefox"`, or `"webkit"`.
- Defaults to `"chromium"`.
- If you need a different engine, specify it here.
2. **`headless`**
2. **`headless`**
- `True`: Runs the browser in headless mode (invisible browser).
- `False`: Runs the browser in visible mode, which helps with debugging.
3. **`proxy_config`**
3. **`proxy_config`**
- A dictionary with fields like:
```json
{
@@ -58,31 +56,31 @@ class BrowserConfig:
```
- Leave as `None` if a proxy is not required.
4. **`viewport_width` & `viewport_height`**:
4. **`viewport_width` & `viewport_height`**:
- The initial window size.
- Some sites behave differently with smaller or bigger viewports.
5. **`verbose`**:
5. **`verbose`**:
- If `True`, prints extra logs.
- Handy for debugging.
6. **`use_persistent_context`**:
6. **`use_persistent_context`**:
- If `True`, uses a **persistent** browser profile, storing cookies/local storage across runs.
- Typically also set `user_data_dir` to point to a folder.
7. **`cookies`** & **`headers`**:
7. **`cookies`** & **`headers`**:
- If you want to start with specific cookies or add universal HTTP headers, set them here.
- E.g. `cookies=[{"name": "session", "value": "abc123", "domain": "example.com"}]`.
8. **`user_agent`**:
8. **`user_agent`**:
- Custom User-Agent string. If `None`, a default is used.
- You can also set `user_agent_mode="random"` for randomization (if you want to fight bot detection).
9. **`text_mode`** & **`light_mode`**:
9. **`text_mode`** & **`light_mode`**:
- `text_mode=True` disables images, possibly speeding up text-only crawls.
- `light_mode=True` turns off certain background features for performance.
10. **`extra_args`**:
10. **`extra_args`**:
- Additional flags for the underlying browser.
- E.g. `["--disable-extensions"]`.
@@ -136,6 +134,12 @@ class CrawlerRunConfig:
wait_for=None,
screenshot=False,
pdf=False,
capture_mhtml=False,
# Location and Identity Parameters
locale=None, # e.g. "en-US", "fr-FR"
timezone_id=None, # e.g. "America/New_York"
geolocation=None, # GeolocationConfig object
# Resource Management
enable_rate_limiting=False,
rate_limit_config=None,
memory_threshold_percent=70.0,
@@ -151,58 +155,65 @@ class CrawlerRunConfig:
### Key Fields to Note
1. **`word_count_threshold`**:
1. **`word_count_threshold`**:
- The minimum word count before a block is considered.
- If your site has lots of short paragraphs or items, you can lower it.
2. **`extraction_strategy`**:
2. **`extraction_strategy`**:
- Where you plug in JSON-based extraction (CSS, LLM, etc.).
- If `None`, no structured extraction is done (only raw/cleaned HTML + markdown).
3. **`markdown_generator`**:
3. **`markdown_generator`**:
- E.g., `DefaultMarkdownGenerator(...)`, controlling how HTML→Markdown conversion is done.
- If `None`, a default approach is used.
4. **`cache_mode`**:
4. **`cache_mode`**:
- Controls caching behavior (`ENABLED`, `BYPASS`, `DISABLED`, etc.).
- If `None`, defaults to some level of caching or you can specify `CacheMode.ENABLED`.
5. **`js_code`**:
5. **`js_code`**:
- A string or list of JS strings to execute.
- Great for Load More buttons or user interactions.
- Great for "Load More" buttons or user interactions.
6. **`wait_for`**:
6. **`wait_for`**:
- A CSS or JS expression to wait for before extracting content.
- Common usage: `wait_for="css:.main-loaded"` or `wait_for="js:() => window.loaded === true"`.
7. **`screenshot`** & **`pdf`**:
- If `True`, captures a screenshot or PDF after the page is fully loaded.
- The results go to `result.screenshot` (base64) or `result.pdf` (bytes).
7. **`screenshot`**, **`pdf`**, & **`capture_mhtml`**:
- If `True`, captures a screenshot, PDF, or MHTML snapshot after the page is fully loaded.
- The results go to `result.screenshot` (base64), `result.pdf` (bytes), or `result.mhtml` (string).
8. **`verbose`**:
8. **Location Parameters**:
- **`locale`**: Browser's locale (e.g., `"en-US"`, `"fr-FR"`) for language preferences
- **`timezone_id`**: Browser's timezone (e.g., `"America/New_York"`, `"Europe/Paris"`)
- **`geolocation`**: GPS coordinates via `GeolocationConfig(latitude=48.8566, longitude=2.3522)`
- See [Identity Based Crawling](../advanced/identity-based-crawling.md#7-locale-timezone-and-geolocation-control)
9. **`verbose`**:
- Logs additional runtime details.
- Overlaps with the browsers verbosity if also set to `True` in `BrowserConfig`.
- Overlaps with the browser's verbosity if also set to `True` in `BrowserConfig`.
9. **`enable_rate_limiting`**:
10. **`enable_rate_limiting`**:
- If `True`, enables rate limiting for batch processing.
- Requires `rate_limit_config` to be set.
10. **`memory_threshold_percent`**:
11. **`memory_threshold_percent`**:
- The memory threshold (as a percentage) to monitor.
- If exceeded, the crawler will pause or slow down.
11. **`check_interval`**:
12. **`check_interval`**:
- The interval (in seconds) to check system resources.
- Affects how often memory and CPU usage are monitored.
12. **`max_session_permit`**:
13. **`max_session_permit`**:
- The maximum number of concurrent crawl sessions.
- Helps prevent overwhelming the system.
13. **`display_mode`**:
14. **`display_mode`**:
- The display mode for progress information (`DETAILED`, `BRIEF`, etc.).
- Affects how much information is printed during the crawl.
### Helper Methods
The `clone()` method is particularly useful for creating variations of your crawler configuration:
@@ -236,23 +247,20 @@ The `clone()` method:
---
## 3. LLMConfig Essentials
### Key fields to note
1. **`provider`**:
1. **`provider`**:
- Which LLM provoder to use.
- Possible values are `"ollama/llama3","groq/llama3-70b-8192","groq/llama3-8b-8192", "openai/gpt-4o-mini" ,"openai/gpt-4o","openai/o1-mini","openai/o1-preview","openai/o3-mini","openai/o3-mini-high","anthropic/claude-3-haiku-20240307","anthropic/claude-3-opus-20240229","anthropic/claude-3-sonnet-20240229","anthropic/claude-3-5-sonnet-20240620","gemini/gemini-pro","gemini/gemini-1.5-pro","gemini/gemini-2.0-flash","gemini/gemini-2.0-flash-exp","gemini/gemini-2.0-flash-lite-preview-02-05","deepseek/deepseek-chat"`<br/>*(default: `"openai/gpt-4o-mini"`)*
2. **`api_token`**:
2. **`api_token`**:
- Optional. When not provided explicitly, api_token will be read from environment variables based on provider. For example: If a gemini model is passed as provider then,`"GEMINI_API_KEY"` will be read from environment variables
- API token of LLM provider <br/> eg: `api_token = "gsk_1ClHGGJ7Lpn4WGybR7vNWGdyb3FY7zXEw3SCiy0BAVM9lL8CQv"`
- Environment variable - use with prefix "env:" <br/> eg:`api_token = "env: GROQ_API_KEY"`
3. **`base_url`**:
3. **`base_url`**:
- If your provider has a custom endpoint
```python
@@ -261,7 +269,7 @@ llm_config = LLMConfig(provider="openai/gpt-4o-mini", api_token=os.getenv("OPENA
## 4. Putting It All Together
In a typical scenario, you define **one** `BrowserConfig` for your crawler session, then create **one or more** `CrawlerRunConfig` & `LLMConfig` depending on each calls needs:
In a typical scenario, you define **one** `BrowserConfig` for your crawler session, then create **one or more** `CrawlerRunConfig` & `LLMConfig` depending on each call's needs:
```python
import asyncio

View File

@@ -26,6 +26,7 @@ class CrawlResult(BaseModel):
downloaded_files: Optional[List[str]] = None
screenshot: Optional[str] = None
pdf : Optional[bytes] = None
mhtml: Optional[str] = None
markdown: Optional[Union[str, MarkdownGenerationResult]] = None
extracted_content: Optional[str] = None
metadata: Optional[dict] = None
@@ -51,6 +52,7 @@ class CrawlResult(BaseModel):
| **downloaded_files (`Optional[List[str]]`)** | If `accept_downloads=True` in `BrowserConfig`, this lists the filepaths of saved downloads. |
| **screenshot (`Optional[str]`)** | Screenshot of the page (base64-encoded) if `screenshot=True`. |
| **pdf (`Optional[bytes]`)** | PDF of the page if `pdf=True`. |
| **mhtml (`Optional[str]`)** | MHTML snapshot of the page if `capture_mhtml=True`. Contains the full page with all resources. |
| **markdown (`Optional[str or MarkdownGenerationResult]`)** | It holds a `MarkdownGenerationResult`. Over time, this will be consolidated into `markdown`. The generator can provide raw markdown, citations, references, and optionally `fit_markdown`. |
| **extracted_content (`Optional[str]`)** | The output of a structured extraction (CSS/LLM-based) stored as JSON string or other text. |
| **metadata (`Optional[dict]`)** | Additional info about the crawl or extracted data. |
@@ -190,18 +192,27 @@ for img in images:
print("Image URL:", img["src"], "Alt:", img.get("alt"))
```
### 5.3 `screenshot` and `pdf`
### 5.3 `screenshot`, `pdf`, and `mhtml`
If you set `screenshot=True` or `pdf=True` in **`CrawlerRunConfig`**, then:
If you set `screenshot=True`, `pdf=True`, or `capture_mhtml=True` in **`CrawlerRunConfig`**, then:
- `result.screenshot` contains a base64-encoded PNG string.
- `result.screenshot` contains a base64-encoded PNG string.
- `result.pdf` contains raw PDF bytes (you can write them to a file).
- `result.mhtml` contains the MHTML snapshot of the page as a string (you can write it to a .mhtml file).
```python
# Save the PDF
with open("page.pdf", "wb") as f:
f.write(result.pdf)
# Save the MHTML
if result.mhtml:
with open("page.mhtml", "w", encoding="utf-8") as f:
f.write(result.mhtml)
```
The MHTML (MIME HTML) format is particularly useful as it captures the entire web page including all of its resources (CSS, images, scripts, etc.) in a single file, making it perfect for archiving or offline viewing.
### 5.4 `ssl_certificate`
If `fetch_ssl_certificate=True`, `result.ssl_certificate` holds details about the sites SSL cert, such as issuer, validity dates, etc.

File diff suppressed because it is too large Load Diff

115
docs/md_v2/core/examples.md Normal file
View File

@@ -0,0 +1,115 @@
# Code Examples
This page provides a comprehensive list of example scripts that demonstrate various features and capabilities of Crawl4AI. Each example is designed to showcase specific functionality, making it easier for you to understand how to implement these features in your own projects.
## Getting Started Examples
| Example | Description | Link |
|---------|-------------|------|
| Hello World | A simple introductory example demonstrating basic usage of AsyncWebCrawler with JavaScript execution and content filtering. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/hello_world.py) |
| Quickstart | A comprehensive collection of examples showcasing various features including basic crawling, content cleaning, link analysis, JavaScript execution, CSS selectors, media handling, custom hooks, proxy configuration, screenshots, and multiple extraction strategies. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/quickstart.py) |
| Quickstart Set 1 | Basic examples for getting started with Crawl4AI. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/quickstart_examples_set_1.py) |
| Quickstart Set 2 | More advanced examples for working with Crawl4AI. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/quickstart_examples_set_2.py) |
## Browser & Crawling Features
| Example | Description | Link |
|---------|-------------|------|
| Built-in Browser | Demonstrates how to use the built-in browser capabilities. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/builtin_browser_example.py) |
| Browser Optimization | Focuses on browser performance optimization techniques. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/browser_optimization_example.py) |
| arun vs arun_many | Compares the `arun` and `arun_many` methods for single vs. multiple URL crawling. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/arun_vs_arun_many.py) |
| Multiple URLs | Shows how to crawl multiple URLs asynchronously. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/async_webcrawler_multiple_urls_example.py) |
| Page Interaction | Guide on interacting with dynamic elements through clicks. | [View Guide](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/tutorial_dynamic_clicks.md) |
| Crawler Monitor | Shows how to monitor the crawler's activities and status. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/crawler_monitor_example.py) |
| Full Page Screenshot & PDF | Guide on capturing full-page screenshots and PDFs from massive webpages. | [View Guide](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/full_page_screenshot_and_pdf_export.md) |
## Advanced Crawling & Deep Crawling
| Example | Description | Link |
|---------|-------------|------|
| Deep Crawling | An extensive tutorial on deep crawling capabilities, demonstrating BFS and BestFirst strategies, stream vs. non-stream execution, filters, scorers, and advanced configurations. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/deepcrawl_example.py) |
| Dispatcher | Shows how to use the crawl dispatcher for advanced workload management. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/dispatcher_example.py) |
| Storage State | Tutorial on managing browser storage state for persistence. | [View Guide](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/storage_state_tutorial.md) |
| Network Console Capture | Demonstrates how to capture and analyze network requests and console logs. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/network_console_capture_example.py) |
## Extraction Strategies
| Example | Description | Link |
|---------|-------------|------|
| Extraction Strategies | Demonstrates different extraction strategies with various input formats (markdown, HTML, fit_markdown) and JSON-based extractors (CSS and XPath). | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/extraction_strategies_examples.py) |
| Scraping Strategies | Compares the performance of different scraping strategies. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/scraping_strategies_performance.py) |
| LLM Extraction | Demonstrates LLM-based extraction specifically for OpenAI pricing data. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/llm_extraction_openai_pricing.py) |
| LLM Markdown | Shows how to use LLMs to generate markdown from crawled content. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/llm_markdown_generator.py) |
| Summarize Page | Shows how to summarize web page content. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/summarize_page.py) |
## E-commerce & Specialized Crawling
| Example | Description | Link |
|---------|-------------|------|
| Amazon Product Extraction | Demonstrates how to extract structured product data from Amazon search results using CSS selectors. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/amazon_product_extraction_direct_url.py) |
| Amazon with Hooks | Shows how to use hooks with Amazon product extraction. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/amazon_product_extraction_using_hooks.py) |
| Amazon with JavaScript | Demonstrates using custom JavaScript for Amazon product extraction. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/amazon_product_extraction_using_use_javascript.py) |
| Crypto Analysis | Demonstrates how to crawl and analyze cryptocurrency data. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/crypto_analysis_example.py) |
| SERP API | Demonstrates using Crawl4AI with search engine result pages. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/serp_api_project_11_feb.py) |
## Customization & Security
| Example | Description | Link |
|---------|-------------|------|
| Hooks | Illustrates how to use hooks at different stages of the crawling process for advanced customization. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/hooks_example.py) |
| Identity-Based Browsing | Illustrates identity-based browsing configurations for authentic browsing experiences. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/identity_based_browsing.py) |
| Proxy Rotation | Shows how to use proxy rotation for web scraping and avoiding IP blocks. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/proxy_rotation_demo.py) |
| SSL Certificate | Illustrates SSL certificate handling and verification. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/ssl_example.py) |
| Language Support | Shows how to handle different languages during crawling. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/language_support_example.py) |
| Geolocation | Demonstrates how to use geolocation features. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/use_geo_location.py) |
## Docker & Deployment
| Example | Description | Link |
|---------|-------------|------|
| Docker Config | Demonstrates how to create and use Docker configuration objects. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/docker_config_obj.py) |
| Docker Basic | A test suite for Docker deployment, showcasing various functionalities through the Docker API. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/docker_example.py) |
| Docker REST API | Shows how to interact with Crawl4AI Docker using REST API calls. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/docker_python_rest_api.py) |
| Docker SDK | Demonstrates using the Python SDK for Crawl4AI Docker. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/docker_python_sdk.py) |
## Application Examples
| Example | Description | Link |
|---------|-------------|------|
| Research Assistant | Demonstrates how to build a research assistant using Crawl4AI. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/research_assistant.py) |
| REST Call | Shows how to make REST API calls with Crawl4AI. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/rest_call.py) |
| Chainlit Integration | Shows how to integrate Crawl4AI with Chainlit. | [View Guide](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/chainlit.md) |
| Crawl4AI vs FireCrawl | Compares Crawl4AI with the FireCrawl library. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/crawlai_vs_firecrawl.py) |
## Content Generation & Markdown
| Example | Description | Link |
|---------|-------------|------|
| Content Source | Demonstrates how to work with different content sources in markdown generation. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/markdown/content_source_example.py) |
| Content Source (Short) | A simplified version of content source usage. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/markdown/content_source_short_example.py) |
| Built-in Browser Guide | Guide for using the built-in browser capabilities. | [View Guide](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/README_BUILTIN_BROWSER.md) |
## Running the Examples
To run any of these examples, you'll need to have Crawl4AI installed:
```bash
pip install crawl4ai
```
Then, you can run an example script like this:
```bash
python -m docs.examples.hello_world
```
For examples that require additional dependencies or environment variables, refer to the comments at the top of each file.
Some examples may require:
- API keys (for LLM-based examples)
- Docker setup (for Docker-related examples)
- Additional dependencies (specified in the example files)
## Contributing New Examples
If you've created an interesting example that demonstrates a unique use case or feature of Crawl4AI, we encourage you to contribute it to our examples collection. Please see our [contribution guidelines](https://github.com/unclecode/crawl4ai/blob/main/CONTRIBUTORS.md) for more information.

View File

@@ -4,7 +4,35 @@ In this tutorial, youll learn how to:
1. Extract links (internal, external) from crawled pages
2. Filter or exclude specific domains (e.g., social media or custom domains)
3. Access and manage media data (especially images) in the crawl result
3. Access and ma### 3.2 Excluding Images
#### Excluding External Images
If you're dealing with heavy pages or want to skip third-party images (advertisements, for example), you can turn on:
```python
crawler_cfg = CrawlerRunConfig(
exclude_external_images=True
)
```
This setting attempts to discard images from outside the primary domain, keeping only those from the site you're crawling.
#### Excluding All Images
If you want to completely remove all images from the page to maximize performance and reduce memory usage, use:
```python
crawler_cfg = CrawlerRunConfig(
exclude_all_images=True
)
```
This setting removes all images very early in the processing pipeline, which significantly improves memory efficiency and processing speed. This is particularly useful when:
- You don't need image data in your results
- You're crawling image-heavy pages that cause memory issues
- You want to focus only on text content
- You need to maximize crawling speeddata (especially images) in the crawl result
4. Configure your crawler to exclude or prioritize certain images
> **Prerequisites**
@@ -271,8 +299,41 @@ Each extracted table contains:
- **`screenshot`**: Set to `True` if you want a full-page screenshot stored as `base64` in `result.screenshot`.
- **`pdf`**: Set to `True` if you want a PDF version of the page in `result.pdf`.
- **`capture_mhtml`**: Set to `True` if you want an MHTML snapshot of the page in `result.mhtml`. This format preserves the entire web page with all its resources (CSS, images, scripts) in a single file, making it perfect for archiving or offline viewing.
- **`wait_for_images`**: If `True`, attempts to wait until images are fully loaded before final extraction.
#### Example: Capturing Page as MHTML
```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
async def main():
crawler_cfg = CrawlerRunConfig(
capture_mhtml=True # Enable MHTML capture
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com", config=crawler_cfg)
if result.success and result.mhtml:
# Save the MHTML snapshot to a file
with open("example.mhtml", "w", encoding="utf-8") as f:
f.write(result.mhtml)
print("MHTML snapshot saved to example.mhtml")
else:
print("Failed to capture MHTML:", result.error_message)
if __name__ == "__main__":
asyncio.run(main())
```
The MHTML format is particularly useful because:
- It captures the complete page state including all resources
- It can be opened in most modern browsers for offline viewing
- It preserves the page exactly as it appeared during crawling
- It's a single file, making it easy to store and transfer
---
## 4. Putting It All Together: Link & Media Filtering

View File

@@ -111,13 +111,71 @@ Some commonly used `options`:
- **`skip_internal_links`** (bool): If `True`, omit `#localAnchors` or internal links referencing the same page.
- **`include_sup_sub`** (bool): Attempt to handle `<sup>` / `<sub>` in a more readable way.
## 4. Selecting the HTML Source for Markdown Generation
The `content_source` parameter allows you to control which HTML content is used as input for markdown generation. This gives you flexibility in how the HTML is processed before conversion to markdown.
```python
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
async def main():
# Option 1: Use the raw HTML directly from the webpage (before any processing)
raw_md_generator = DefaultMarkdownGenerator(
content_source="raw_html",
options={"ignore_links": True}
)
# Option 2: Use the cleaned HTML (after scraping strategy processing - default)
cleaned_md_generator = DefaultMarkdownGenerator(
content_source="cleaned_html", # This is the default
options={"ignore_links": True}
)
# Option 3: Use preprocessed HTML optimized for schema extraction
fit_md_generator = DefaultMarkdownGenerator(
content_source="fit_html",
options={"ignore_links": True}
)
# Use one of the generators in your crawler config
config = CrawlerRunConfig(
markdown_generator=raw_md_generator # Try each of the generators
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com", config=config)
if result.success:
print("Markdown:\n", result.markdown.raw_markdown[:500])
else:
print("Crawl failed:", result.error_message)
if __name__ == "__main__":
import asyncio
asyncio.run(main())
```
### HTML Source Options
- **`"cleaned_html"`** (default): Uses the HTML after it has been processed by the scraping strategy. This HTML is typically cleaner and more focused on content, with some boilerplate removed.
- **`"raw_html"`**: Uses the original HTML directly from the webpage, before any cleaning or processing. This preserves more of the original content, but may include navigation bars, ads, footers, and other elements that might not be relevant to the main content.
- **`"fit_html"`**: Uses HTML preprocessed for schema extraction. This HTML is optimized for structured data extraction and may have certain elements simplified or removed.
### When to Use Each Option
- Use **`"cleaned_html"`** (default) for most cases where you want a balance of content preservation and noise removal.
- Use **`"raw_html"`** when you need to preserve all original content, or when the cleaning process is removing content you actually want to keep.
- Use **`"fit_html"`** when working with structured data or when you need HTML that's optimized for schema extraction.
---
## 4. Content Filters
## 5. Content Filters
**Content filters** selectively remove or rank sections of text before turning them into Markdown. This is especially helpful if your page has ads, nav bars, or other clutter you dont want.
### 4.1 BM25ContentFilter
### 5.1 BM25ContentFilter
If you have a **search query**, BM25 is a good choice:
@@ -146,7 +204,7 @@ config = CrawlerRunConfig(markdown_generator=md_generator)
**No query provided?** BM25 tries to glean a context from page metadata, or you can simply treat it as a scorched-earth approach that discards text with low generic score. Realistically, you want to supply a query for best results.
### 4.2 PruningContentFilter
### 5.2 PruningContentFilter
If you **dont** have a specific query, or if you just want a robust “junk remover,” use `PruningContentFilter`. It analyzes text density, link density, HTML structure, and known patterns (like “nav,” “footer”) to systematically prune extraneous or repetitive sections.
@@ -170,7 +228,7 @@ prune_filter = PruningContentFilter(
- You want a broad cleanup without a user query.
- The page has lots of repeated sidebars, footers, or disclaimers that hamper text extraction.
### 4.3 LLMContentFilter
### 5.3 LLMContentFilter
For intelligent content filtering and high-quality markdown generation, you can use the **LLMContentFilter**. This filter leverages LLMs to generate relevant markdown while preserving the original content's meaning and structure:
@@ -247,7 +305,7 @@ filter = LLMContentFilter(
---
## 5. Using Fit Markdown
## 6. Using Fit Markdown
When a content filter is active, the library produces two forms of markdown inside `result.markdown`:
@@ -284,7 +342,7 @@ if __name__ == "__main__":
---
## 6. The `MarkdownGenerationResult` Object
## 7. The `MarkdownGenerationResult` Object
If your library stores detailed markdown output in an object like `MarkdownGenerationResult`, youll see fields such as:
@@ -315,7 +373,7 @@ Below is a **revised section** under “Combining Filters (BM25 + Pruning)” th
---
## 7. Combining Filters (BM25 + Pruning) in Two Passes
## 8. Combining Filters (BM25 + Pruning) in Two Passes
You might want to **prune out** noisy boilerplate first (with `PruningContentFilter`), and then **rank whats left** against a user query (with `BM25ContentFilter`). You dont have to crawl the page twice. Instead:
@@ -407,7 +465,7 @@ If your codebase or pipeline design allows applying multiple filters in one pass
---
## 8. Common Pitfalls & Tips
## 9. Common Pitfalls & Tips
1. **No Markdown Output?**
- Make sure the crawler actually retrieved HTML. If the site is heavily JS-based, you may need to enable dynamic rendering or wait for elements.
@@ -427,11 +485,12 @@ If your codebase or pipeline design allows applying multiple filters in one pass
---
## 9. Summary & Next Steps
## 10. Summary & Next Steps
In this **Markdown Generation Basics** tutorial, you learned to:
- Configure the **DefaultMarkdownGenerator** with HTML-to-text options.
- Select different HTML sources using the `content_source` parameter.
- Use **BM25ContentFilter** for query-specific extraction or **PruningContentFilter** for general noise removal.
- Distinguish between raw and filtered markdown (`fit_markdown`).
- Leverage the `MarkdownGenerationResult` object to handle different forms of output (citations, references, etc.).

View File

@@ -2,7 +2,7 @@
In some cases, you need to extract **complex or unstructured** information from a webpage that a simple CSS/XPath schema cannot easily parse. Or you want **AI**-driven insights, classification, or summarization. For these scenarios, Crawl4AI provides an **LLM-based extraction strategy** that:
1. Works with **any** large language model supported by [LightLLM](https://github.com/LightLLM) (Ollama, OpenAI, Claude, and more).
1. Works with **any** large language model supported by [LiteLLM](https://github.com/BerriAI/litellm) (Ollama, OpenAI, Claude, and more).
2. Automatically splits content into chunks (if desired) to handle token limits, then combines results.
3. Lets you define a **schema** (like a Pydantic model) or a simpler “block” extraction approach.
@@ -18,13 +18,19 @@ In some cases, you need to extract **complex or unstructured** information from
---
## 2. Provider-Agnostic via LightLLM
## 2. Provider-Agnostic via LiteLLM
Crawl4AI uses a “provider string” (e.g., `"openai/gpt-4o"`, `"ollama/llama2.0"`, `"aws/titan"`) to identify your LLM. **Any** model that LightLLM supports is fair game. You just provide:
You can use LlmConfig, to quickly configure multiple variations of LLMs and experiment with them to find the optimal one for your use case. You can read more about LlmConfig [here](/api/parameters).
```python
llmConfig = LlmConfig(provider="openai/gpt-4o-mini", api_token=os.getenv("OPENAI_API_KEY"))
```
Crawl4AI uses a “provider string” (e.g., `"openai/gpt-4o"`, `"ollama/llama2.0"`, `"aws/titan"`) to identify your LLM. **Any** model that LiteLLM supports is fair game. You just provide:
- **`provider`**: The `<provider>/<model_name>` identifier (e.g., `"openai/gpt-4"`, `"ollama/llama2"`, `"huggingface/google-flan"`, etc.).
- **`api_token`**: If needed (for OpenAI, HuggingFace, etc.); local models or Ollama might not require it.
- **`api_base`** (optional): If your provider has a custom endpoint.
- **`base_url`** (optional): If your provider has a custom endpoint.
This means you **arent locked** into a single LLM vendor. Switch or experiment easily.
@@ -52,20 +58,19 @@ For structured data, `"schema"` is recommended. You provide `schema=YourPydantic
Below is an overview of important LLM extraction parameters. All are typically set inside `LLMExtractionStrategy(...)`. You then put that strategy in your `CrawlerRunConfig(..., extraction_strategy=...)`.
1. **`provider`** (str): e.g., `"openai/gpt-4"`, `"ollama/llama2"`.
2. **`api_token`** (str): The API key or token for that model. May not be needed for local models.
3. **`schema`** (dict): A JSON schema describing the fields you want. Usually generated by `YourModel.model_json_schema()`.
4. **`extraction_type`** (str): `"schema"` or `"block"`.
5. **`instruction`** (str): Prompt text telling the LLM what you want extracted. E.g., “Extract these fields as a JSON array.”
6. **`chunk_token_threshold`** (int): Maximum tokens per chunk. If your content is huge, you can break it up for the LLM.
7. **`overlap_rate`** (float): Overlap ratio between adjacent chunks. E.g., `0.1` means 10% of each chunk is repeated to preserve context continuity.
8. **`apply_chunking`** (bool): Set `True` to chunk automatically. If you want a single pass, set `False`.
9. **`input_format`** (str): Determines **which** crawler result is passed to the LLM. Options include:
1. **`llmConfig`** (LlmConfig): e.g., `"openai/gpt-4"`, `"ollama/llama2"`.
2. **`schema`** (dict): A JSON schema describing the fields you want. Usually generated by `YourModel.model_json_schema()`.
3. **`extraction_type`** (str): `"schema"` or `"block"`.
4. **`instruction`** (str): Prompt text telling the LLM what you want extracted. E.g., “Extract these fields as a JSON array.”
5. **`chunk_token_threshold`** (int): Maximum tokens per chunk. If your content is huge, you can break it up for the LLM.
6. **`overlap_rate`** (float): Overlap ratio between adjacent chunks. E.g., `0.1` means 10% of each chunk is repeated to preserve context continuity.
7. **`apply_chunking`** (bool): Set `True` to chunk automatically. If you want a single pass, set `False`.
8. **`input_format`** (str): Determines **which** crawler result is passed to the LLM. Options include:
- `"markdown"`: The raw markdown (default).
- `"fit_markdown"`: The filtered “fit” markdown if you used a content filter.
- `"html"`: The cleaned or raw HTML.
10. **`extra_args`** (dict): Additional LLM parameters like `temperature`, `max_tokens`, `top_p`, etc.
11. **`show_usage()`**: A method you can call to print out usage info (token usage per chunk, total cost if known).
9. **`extra_args`** (dict): Additional LLM parameters like `temperature`, `max_tokens`, `top_p`, etc.
10. **`show_usage()`**: A method you can call to print out usage info (token usage per chunk, total cost if known).
**Example**:
@@ -233,8 +238,7 @@ class KnowledgeGraph(BaseModel):
async def main():
# LLM extraction strategy
llm_strat = LLMExtractionStrategy(
provider="openai/gpt-4",
api_token=os.getenv('OPENAI_API_KEY'),
llmConfig = LlmConfig(provider="openai/gpt-4", api_token=os.getenv('OPENAI_API_KEY')),
schema=KnowledgeGraph.schema_json(),
extraction_type="schema",
instruction="Extract entities and relationships from the content. Return valid JSON.",
@@ -286,7 +290,7 @@ if __name__ == "__main__":
## 11. Conclusion
**LLM-based extraction** in Crawl4AI is **provider-agnostic**, letting you choose from hundreds of models via LightLLM. Its perfect for **semantically complex** tasks or generating advanced structures like knowledge graphs. However, its **slower** and potentially costlier than schema-based approaches. Keep these tips in mind:
**LLM-based extraction** in Crawl4AI is **provider-agnostic**, letting you choose from hundreds of models via LiteLLM. Its perfect for **semantically complex** tasks or generating advanced structures like knowledge graphs. However, its **slower** and potentially costlier than schema-based approaches. Keep these tips in mind:
- Put your LLM strategy **in `CrawlerRunConfig`**.
- Use **`input_format`** to pick which form (markdown, HTML, fit_markdown) the LLM sees.
@@ -317,4 +321,4 @@ If your sites data is consistent or repetitive, consider [`JsonCssExtractionS
---
Thats it for **Extracting JSON (LLM)**—now you can harness AI to parse, classify, or reorganize data on the web. Happy crawling!
Thats it for **Extracting JSON (LLM)**—now you can harness AI to parse, classify, or reorganize data on the web. Happy crawling!

View File

@@ -72,6 +72,14 @@ asyncio.run(main())
---
## Video Tutorial
<div align="center">
<iframe width="560" height="315" src="https://www.youtube.com/embed/xo3qK6Hg9AA?start=15" title="Crawl4AI Tutorial" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
</div>
---
## What Does Crawl4AI Do?
Crawl4AI is a feature-rich crawler and scraper that aims to: