Compare commits
9 Commits
vr0.6.0
...
merge-pr97
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
0e5d672763 | ||
|
|
cd2b490b40 | ||
|
|
50f0b83fcd | ||
|
|
9499164d3c | ||
|
|
2140d9aca4 | ||
|
|
ccec40ed17 | ||
|
|
ad4dfb21e1 | ||
|
|
7784b2468e | ||
|
|
b2f3cb0dfa |
11
CHANGELOG.md
11
CHANGELOG.md
@@ -5,7 +5,16 @@ All notable changes to Crawl4AI will be documented in this file.
|
|||||||
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
||||||
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
||||||
|
|
||||||
## [0.6.0rc1] ‑ 2025‑04‑22
|
## [0.6.1] - 2025-04-24
|
||||||
|
|
||||||
|
### Added
|
||||||
|
- New dedicated `tables` field in `CrawlResult` model for better table extraction handling
|
||||||
|
- Updated crypto_analysis_example.py to use the new tables field with backward compatibility
|
||||||
|
|
||||||
|
### Changed
|
||||||
|
- Improved playground UI in Docker deployment with better endpoint handling and UI feedback
|
||||||
|
|
||||||
|
## [0.6.0] ‑ 2025‑04‑22
|
||||||
|
|
||||||
### Added
|
### Added
|
||||||
- Browser pooling with page pre‑warming and fine‑grained **geolocation, locale, and timezone** controls
|
- Browser pooling with page pre‑warming and fine‑grained **geolocation, locale, and timezone** controls
|
||||||
|
|||||||
10
README.md
10
README.md
@@ -21,9 +21,9 @@
|
|||||||
|
|
||||||
Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for LLMs, AI agents, and data pipelines. Open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease.
|
Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for LLMs, AI agents, and data pipelines. Open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease.
|
||||||
|
|
||||||
[✨ Check out latest update v0.6.0rc1](#-recent-updates)
|
[✨ Check out latest update v0.6.0](#-recent-updates)
|
||||||
|
|
||||||
🎉 **Version 0.6.0rc1 is now available!** This release candidate introduces World-aware Crawling with geolocation and locale settings, Table-to-DataFrame extraction, Browser pooling with pre-warming, Network and console traffic capture, MCP integration for AI tools, and a completely revamped Docker deployment! [Read the release notes →](https://docs.crawl4ai.com/blog)
|
🎉 **Version 0.6.0 is now available!** This release candidate introduces World-aware Crawling with geolocation and locale settings, Table-to-DataFrame extraction, Browser pooling with pre-warming, Network and console traffic capture, MCP integration for AI tools, and a completely revamped Docker deployment! [Read the release notes →](https://docs.crawl4ai.com/blog)
|
||||||
|
|
||||||
<details>
|
<details>
|
||||||
<summary>🤓 <strong>My Personal Story</strong></summary>
|
<summary>🤓 <strong>My Personal Story</strong></summary>
|
||||||
@@ -505,7 +505,7 @@ async def test_news_crawl():
|
|||||||
|
|
||||||
## ✨ Recent Updates
|
## ✨ Recent Updates
|
||||||
|
|
||||||
### Version 0.6.0rc1 Release Highlights
|
### Version 0.6.0 Release Highlights
|
||||||
|
|
||||||
- **🌎 World-aware Crawling**: Set geolocation, language, and timezone for authentic locale-specific content:
|
- **🌎 World-aware Crawling**: Set geolocation, language, and timezone for authentic locale-specific content:
|
||||||
```python
|
```python
|
||||||
@@ -575,7 +575,7 @@ async def test_news_crawl():
|
|||||||
|
|
||||||
- **📱 Multi-stage Build System**: Optimized Dockerfile with platform-specific performance enhancements
|
- **📱 Multi-stage Build System**: Optimized Dockerfile with platform-specific performance enhancements
|
||||||
|
|
||||||
Read the full details in our [0.6.0rc1 Release Notes](https://docs.crawl4ai.com/blog/releases/0.6.0.html) or check the [CHANGELOG](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md).
|
Read the full details in our [0.6.0 Release Notes](https://docs.crawl4ai.com/blog/releases/0.6.0.html) or check the [CHANGELOG](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md).
|
||||||
|
|
||||||
### Previous Version: 0.5.0 Major Release Highlights
|
### Previous Version: 0.5.0 Major Release Highlights
|
||||||
|
|
||||||
@@ -606,7 +606,7 @@ We use different suffixes to indicate development stages:
|
|||||||
- `dev` (0.4.3dev1): Development versions, unstable
|
- `dev` (0.4.3dev1): Development versions, unstable
|
||||||
- `a` (0.4.3a1): Alpha releases, experimental features
|
- `a` (0.4.3a1): Alpha releases, experimental features
|
||||||
- `b` (0.4.3b1): Beta releases, feature complete but needs testing
|
- `b` (0.4.3b1): Beta releases, feature complete but needs testing
|
||||||
- `rc` (0.4.3rc1): Release candidates, potential final version
|
- `rc` (0.4.3): Release candidates, potential final version
|
||||||
|
|
||||||
#### Installation
|
#### Installation
|
||||||
- Regular installation (stable version):
|
- Regular installation (stable version):
|
||||||
|
|||||||
@@ -1,3 +1,3 @@
|
|||||||
# crawl4ai/_version.py
|
# crawl4ai/_version.py
|
||||||
__version__ = "0.6.0"
|
__version__ = "0.6.3"
|
||||||
|
|
||||||
|
|||||||
@@ -427,7 +427,7 @@ class BrowserConfig:
|
|||||||
host: str = "localhost",
|
host: str = "localhost",
|
||||||
):
|
):
|
||||||
self.browser_type = browser_type
|
self.browser_type = browser_type
|
||||||
self.headless = headless or True
|
self.headless = headless
|
||||||
self.browser_mode = browser_mode
|
self.browser_mode = browser_mode
|
||||||
self.use_managed_browser = use_managed_browser
|
self.use_managed_browser = use_managed_browser
|
||||||
self.cdp_url = cdp_url
|
self.cdp_url = cdp_url
|
||||||
|
|||||||
@@ -171,7 +171,10 @@ class AsyncDatabaseManager:
|
|||||||
f"Code context:\n{error_context['code_context']}"
|
f"Code context:\n{error_context['code_context']}"
|
||||||
)
|
)
|
||||||
self.logger.error(
|
self.logger.error(
|
||||||
message=create_box_message(error_message, type="error"),
|
message="{error}",
|
||||||
|
tag="ERROR",
|
||||||
|
params={"error": str(error_message)},
|
||||||
|
boxes=["error"],
|
||||||
)
|
)
|
||||||
|
|
||||||
raise
|
raise
|
||||||
@@ -189,7 +192,10 @@ class AsyncDatabaseManager:
|
|||||||
f"Code context:\n{error_context['code_context']}"
|
f"Code context:\n{error_context['code_context']}"
|
||||||
)
|
)
|
||||||
self.logger.error(
|
self.logger.error(
|
||||||
message=create_box_message(error_message, type="error"),
|
message="{error}",
|
||||||
|
tag="ERROR",
|
||||||
|
params={"error": str(error_message)},
|
||||||
|
boxes=["error"],
|
||||||
)
|
)
|
||||||
raise
|
raise
|
||||||
finally:
|
finally:
|
||||||
|
|||||||
@@ -1,10 +1,12 @@
|
|||||||
from abc import ABC, abstractmethod
|
from abc import ABC, abstractmethod
|
||||||
from enum import Enum
|
from enum import Enum
|
||||||
from typing import Optional, Dict, Any
|
from typing import Optional, Dict, Any, List
|
||||||
from colorama import Fore, Style, init
|
|
||||||
import os
|
import os
|
||||||
from datetime import datetime
|
from datetime import datetime
|
||||||
from urllib.parse import unquote
|
from urllib.parse import unquote
|
||||||
|
from rich.console import Console
|
||||||
|
from rich.text import Text
|
||||||
|
from .utils import create_box_message
|
||||||
|
|
||||||
|
|
||||||
class LogLevel(Enum):
|
class LogLevel(Enum):
|
||||||
@@ -21,6 +23,26 @@ class LogLevel(Enum):
|
|||||||
FATAL = 10
|
FATAL = 10
|
||||||
|
|
||||||
|
|
||||||
|
def __str__(self):
|
||||||
|
return self.name.lower()
|
||||||
|
|
||||||
|
class LogColor(str, Enum):
|
||||||
|
"""Enum for log colors."""
|
||||||
|
|
||||||
|
DEBUG = "lightblack"
|
||||||
|
INFO = "cyan"
|
||||||
|
SUCCESS = "green"
|
||||||
|
WARNING = "yellow"
|
||||||
|
ERROR = "red"
|
||||||
|
CYAN = "cyan"
|
||||||
|
GREEN = "green"
|
||||||
|
YELLOW = "yellow"
|
||||||
|
MAGENTA = "magenta"
|
||||||
|
DIM_MAGENTA = "dim magenta"
|
||||||
|
|
||||||
|
def __str__(self):
|
||||||
|
"""Automatically convert rich color to string."""
|
||||||
|
return self.value
|
||||||
|
|
||||||
|
|
||||||
class AsyncLoggerBase(ABC):
|
class AsyncLoggerBase(ABC):
|
||||||
@@ -52,6 +74,7 @@ class AsyncLoggerBase(ABC):
|
|||||||
def error_status(self, url: str, error: str, tag: str = "ERROR", url_length: int = 100):
|
def error_status(self, url: str, error: str, tag: str = "ERROR", url_length: int = 100):
|
||||||
pass
|
pass
|
||||||
|
|
||||||
|
|
||||||
class AsyncLogger(AsyncLoggerBase):
|
class AsyncLogger(AsyncLoggerBase):
|
||||||
"""
|
"""
|
||||||
Asynchronous logger with support for colored console output and file logging.
|
Asynchronous logger with support for colored console output and file logging.
|
||||||
@@ -79,17 +102,11 @@ class AsyncLogger(AsyncLoggerBase):
|
|||||||
}
|
}
|
||||||
|
|
||||||
DEFAULT_COLORS = {
|
DEFAULT_COLORS = {
|
||||||
LogLevel.DEBUG: Fore.LIGHTBLACK_EX,
|
LogLevel.DEBUG: LogColor.DEBUG,
|
||||||
LogLevel.INFO: Fore.CYAN,
|
LogLevel.INFO: LogColor.INFO,
|
||||||
LogLevel.SUCCESS: Fore.GREEN,
|
LogLevel.SUCCESS: LogColor.SUCCESS,
|
||||||
LogLevel.WARNING: Fore.YELLOW,
|
LogLevel.WARNING: LogColor.WARNING,
|
||||||
LogLevel.ERROR: Fore.RED,
|
LogLevel.ERROR: LogColor.ERROR,
|
||||||
LogLevel.CRITICAL: Fore.RED + Style.BRIGHT,
|
|
||||||
LogLevel.ALERT: Fore.RED + Style.BRIGHT,
|
|
||||||
LogLevel.NOTICE: Fore.BLUE,
|
|
||||||
LogLevel.EXCEPTION: Fore.RED + Style.BRIGHT,
|
|
||||||
LogLevel.FATAL: Fore.RED + Style.BRIGHT,
|
|
||||||
LogLevel.DEFAULT: Fore.WHITE,
|
|
||||||
}
|
}
|
||||||
|
|
||||||
def __init__(
|
def __init__(
|
||||||
@@ -98,7 +115,7 @@ class AsyncLogger(AsyncLoggerBase):
|
|||||||
log_level: LogLevel = LogLevel.DEBUG,
|
log_level: LogLevel = LogLevel.DEBUG,
|
||||||
tag_width: int = 10,
|
tag_width: int = 10,
|
||||||
icons: Optional[Dict[str, str]] = None,
|
icons: Optional[Dict[str, str]] = None,
|
||||||
colors: Optional[Dict[LogLevel, str]] = None,
|
colors: Optional[Dict[LogLevel, LogColor]] = None,
|
||||||
verbose: bool = True,
|
verbose: bool = True,
|
||||||
):
|
):
|
||||||
"""
|
"""
|
||||||
@@ -112,13 +129,13 @@ class AsyncLogger(AsyncLoggerBase):
|
|||||||
colors: Custom colors for different log levels
|
colors: Custom colors for different log levels
|
||||||
verbose: Whether to output to console
|
verbose: Whether to output to console
|
||||||
"""
|
"""
|
||||||
init() # Initialize colorama
|
|
||||||
self.log_file = log_file
|
self.log_file = log_file
|
||||||
self.log_level = log_level
|
self.log_level = log_level
|
||||||
self.tag_width = tag_width
|
self.tag_width = tag_width
|
||||||
self.icons = icons or self.DEFAULT_ICONS
|
self.icons = icons or self.DEFAULT_ICONS
|
||||||
self.colors = colors or self.DEFAULT_COLORS
|
self.colors = colors or self.DEFAULT_COLORS
|
||||||
self.verbose = verbose
|
self.verbose = verbose
|
||||||
|
self.console = Console()
|
||||||
|
|
||||||
# Create log file directory if needed
|
# Create log file directory if needed
|
||||||
if log_file:
|
if log_file:
|
||||||
@@ -143,16 +160,11 @@ class AsyncLogger(AsyncLoggerBase):
|
|||||||
def _write_to_file(self, message: str):
|
def _write_to_file(self, message: str):
|
||||||
"""Write a message to the log file if configured."""
|
"""Write a message to the log file if configured."""
|
||||||
if self.log_file:
|
if self.log_file:
|
||||||
|
text = Text.from_markup(message)
|
||||||
|
plain_text = text.plain
|
||||||
timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S.%f")[:-3]
|
timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S.%f")[:-3]
|
||||||
with open(self.log_file, "a", encoding="utf-8") as f:
|
with open(self.log_file, "a", encoding="utf-8") as f:
|
||||||
# Strip ANSI color codes for file output
|
f.write(f"[{timestamp}] {plain_text}\n")
|
||||||
clean_message = message.replace(Fore.RESET, "").replace(
|
|
||||||
Style.RESET_ALL, ""
|
|
||||||
)
|
|
||||||
for color in vars(Fore).values():
|
|
||||||
if isinstance(color, str):
|
|
||||||
clean_message = clean_message.replace(color, "")
|
|
||||||
f.write(f"[{timestamp}] {clean_message}\n")
|
|
||||||
|
|
||||||
def _log(
|
def _log(
|
||||||
self,
|
self,
|
||||||
@@ -160,8 +172,9 @@ class AsyncLogger(AsyncLoggerBase):
|
|||||||
message: str,
|
message: str,
|
||||||
tag: str,
|
tag: str,
|
||||||
params: Optional[Dict[str, Any]] = None,
|
params: Optional[Dict[str, Any]] = None,
|
||||||
colors: Optional[Dict[str, str]] = None,
|
colors: Optional[Dict[str, LogColor]] = None,
|
||||||
base_color: Optional[str] = None,
|
boxes: Optional[List[str]] = None,
|
||||||
|
base_color: Optional[LogColor] = None,
|
||||||
**kwargs,
|
**kwargs,
|
||||||
):
|
):
|
||||||
"""
|
"""
|
||||||
@@ -173,55 +186,44 @@ class AsyncLogger(AsyncLoggerBase):
|
|||||||
tag: Tag for the message
|
tag: Tag for the message
|
||||||
params: Parameters to format into the message
|
params: Parameters to format into the message
|
||||||
colors: Color overrides for specific parameters
|
colors: Color overrides for specific parameters
|
||||||
|
boxes: Box overrides for specific parameters
|
||||||
base_color: Base color for the entire message
|
base_color: Base color for the entire message
|
||||||
"""
|
"""
|
||||||
if level.value < self.log_level.value:
|
if level.value < self.log_level.value:
|
||||||
return
|
return
|
||||||
|
|
||||||
# Format the message with parameters if provided
|
# avoid conflict with rich formatting
|
||||||
|
parsed_message = message.replace("[", "[[").replace("]", "]]")
|
||||||
if params:
|
if params:
|
||||||
try:
|
# FIXME: If there are formatting strings in floating point format,
|
||||||
# First format the message with raw parameters
|
# this may result in colors and boxes not being applied properly.
|
||||||
formatted_message = message.format(**params)
|
# such as {value:.2f}, the value is 0.23333 format it to 0.23,
|
||||||
|
# but we replace("0.23333", "[color]0.23333[/color]")
|
||||||
|
formatted_message = parsed_message.format(**params)
|
||||||
|
for key, value in params.items():
|
||||||
|
# value_str may discard `[` and `]`, so we need to replace it.
|
||||||
|
value_str = str(value).replace("[", "[[").replace("]", "]]")
|
||||||
|
# check is need apply color
|
||||||
|
if colors and key in colors:
|
||||||
|
color_str = f"[{colors[key]}]{value_str}[/{colors[key]}]"
|
||||||
|
formatted_message = formatted_message.replace(value_str, color_str)
|
||||||
|
value_str = color_str
|
||||||
|
|
||||||
# Then apply colors if specified
|
# check is need apply box
|
||||||
color_map = {
|
if boxes and key in boxes:
|
||||||
"green": Fore.GREEN,
|
formatted_message = formatted_message.replace(value_str,
|
||||||
"red": Fore.RED,
|
create_box_message(value_str, type=str(level)))
|
||||||
"yellow": Fore.YELLOW,
|
|
||||||
"blue": Fore.BLUE,
|
|
||||||
"cyan": Fore.CYAN,
|
|
||||||
"magenta": Fore.MAGENTA,
|
|
||||||
"white": Fore.WHITE,
|
|
||||||
"black": Fore.BLACK,
|
|
||||||
"reset": Style.RESET_ALL,
|
|
||||||
}
|
|
||||||
if colors:
|
|
||||||
for key, color in colors.items():
|
|
||||||
# Find the formatted value in the message and wrap it with color
|
|
||||||
if color in color_map:
|
|
||||||
color = color_map[color]
|
|
||||||
if key in params:
|
|
||||||
value_str = str(params[key])
|
|
||||||
formatted_message = formatted_message.replace(
|
|
||||||
value_str, f"{color}{value_str}{Style.RESET_ALL}"
|
|
||||||
)
|
|
||||||
|
|
||||||
except KeyError as e:
|
|
||||||
formatted_message = (
|
|
||||||
f"LOGGING ERROR: Missing parameter {e} in message template"
|
|
||||||
)
|
|
||||||
level = LogLevel.ERROR
|
|
||||||
else:
|
else:
|
||||||
formatted_message = message
|
formatted_message = parsed_message
|
||||||
|
|
||||||
# Construct the full log line
|
# Construct the full log line
|
||||||
color = base_color or self.colors[level]
|
color: LogColor = base_color or self.colors[level]
|
||||||
log_line = f"{color}{self._format_tag(tag)} {self._get_icon(tag)} {formatted_message}{Style.RESET_ALL}"
|
log_line = f"[{color}]{self._format_tag(tag)} {self._get_icon(tag)} {formatted_message} [/{color}]"
|
||||||
|
|
||||||
# Output to console if verbose
|
# Output to console if verbose
|
||||||
if self.verbose or kwargs.get("force_verbose", False):
|
if self.verbose or kwargs.get("force_verbose", False):
|
||||||
print(log_line)
|
self.console.print(log_line)
|
||||||
|
|
||||||
# Write to file if configured
|
# Write to file if configured
|
||||||
self._write_to_file(log_line)
|
self._write_to_file(log_line)
|
||||||
@@ -292,8 +294,8 @@ class AsyncLogger(AsyncLoggerBase):
|
|||||||
"timing": timing,
|
"timing": timing,
|
||||||
},
|
},
|
||||||
colors={
|
colors={
|
||||||
"status": Fore.GREEN if success else Fore.RED,
|
"status": LogColor.SUCCESS if success else LogColor.ERROR,
|
||||||
"timing": Fore.YELLOW,
|
"timing": LogColor.WARNING,
|
||||||
},
|
},
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|||||||
@@ -2,7 +2,6 @@ from .__version__ import __version__ as crawl4ai_version
|
|||||||
import os
|
import os
|
||||||
import sys
|
import sys
|
||||||
import time
|
import time
|
||||||
from colorama import Fore
|
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
from typing import Optional, List
|
from typing import Optional, List
|
||||||
import json
|
import json
|
||||||
@@ -44,7 +43,6 @@ from .utils import (
|
|||||||
sanitize_input_encode,
|
sanitize_input_encode,
|
||||||
InvalidCSSSelectorError,
|
InvalidCSSSelectorError,
|
||||||
fast_format_html,
|
fast_format_html,
|
||||||
create_box_message,
|
|
||||||
get_error_context,
|
get_error_context,
|
||||||
RobotsParser,
|
RobotsParser,
|
||||||
preprocess_html_for_schema,
|
preprocess_html_for_schema,
|
||||||
@@ -419,7 +417,7 @@ class AsyncWebCrawler:
|
|||||||
|
|
||||||
self.logger.error_status(
|
self.logger.error_status(
|
||||||
url=url,
|
url=url,
|
||||||
error=create_box_message(error_message, type="error"),
|
error=error_message,
|
||||||
tag="ERROR",
|
tag="ERROR",
|
||||||
)
|
)
|
||||||
|
|
||||||
@@ -496,11 +494,13 @@ class AsyncWebCrawler:
|
|||||||
cleaned_html = sanitize_input_encode(
|
cleaned_html = sanitize_input_encode(
|
||||||
result.get("cleaned_html", ""))
|
result.get("cleaned_html", ""))
|
||||||
media = result.get("media", {})
|
media = result.get("media", {})
|
||||||
|
tables = media.pop("tables", []) if isinstance(media, dict) else []
|
||||||
links = result.get("links", {})
|
links = result.get("links", {})
|
||||||
metadata = result.get("metadata", {})
|
metadata = result.get("metadata", {})
|
||||||
else:
|
else:
|
||||||
cleaned_html = sanitize_input_encode(result.cleaned_html)
|
cleaned_html = sanitize_input_encode(result.cleaned_html)
|
||||||
media = result.media.model_dump()
|
media = result.media.model_dump()
|
||||||
|
tables = media.pop("tables", [])
|
||||||
links = result.links.model_dump()
|
links = result.links.model_dump()
|
||||||
metadata = result.metadata
|
metadata = result.metadata
|
||||||
|
|
||||||
@@ -627,6 +627,7 @@ class AsyncWebCrawler:
|
|||||||
cleaned_html=cleaned_html,
|
cleaned_html=cleaned_html,
|
||||||
markdown=markdown_result,
|
markdown=markdown_result,
|
||||||
media=media,
|
media=media,
|
||||||
|
tables=tables, # NEW
|
||||||
links=links,
|
links=links,
|
||||||
metadata=metadata,
|
metadata=metadata,
|
||||||
screenshot=screenshot_data,
|
screenshot=screenshot_data,
|
||||||
|
|||||||
@@ -5,7 +5,10 @@ import os
|
|||||||
import sys
|
import sys
|
||||||
import shutil
|
import shutil
|
||||||
import tempfile
|
import tempfile
|
||||||
|
import psutil
|
||||||
|
import signal
|
||||||
import subprocess
|
import subprocess
|
||||||
|
import shlex
|
||||||
from playwright.async_api import BrowserContext
|
from playwright.async_api import BrowserContext
|
||||||
import hashlib
|
import hashlib
|
||||||
from .js_snippet import load_js_script
|
from .js_snippet import load_js_script
|
||||||
@@ -194,6 +197,45 @@ class ManagedBrowser:
|
|||||||
if self.browser_config.extra_args:
|
if self.browser_config.extra_args:
|
||||||
args.extend(self.browser_config.extra_args)
|
args.extend(self.browser_config.extra_args)
|
||||||
|
|
||||||
|
|
||||||
|
# ── make sure no old Chromium instance is owning the same port/profile ──
|
||||||
|
try:
|
||||||
|
if sys.platform == "win32":
|
||||||
|
if psutil is None:
|
||||||
|
raise RuntimeError("psutil not available, cannot clean old browser")
|
||||||
|
for p in psutil.process_iter(["pid", "name", "cmdline"]):
|
||||||
|
cl = " ".join(p.info.get("cmdline") or [])
|
||||||
|
if (
|
||||||
|
f"--remote-debugging-port={self.debugging_port}" in cl
|
||||||
|
and f"--user-data-dir={self.user_data_dir}" in cl
|
||||||
|
):
|
||||||
|
p.kill()
|
||||||
|
p.wait(timeout=5)
|
||||||
|
else: # macOS / Linux
|
||||||
|
# kill any process listening on the same debugging port
|
||||||
|
pids = (
|
||||||
|
subprocess.check_output(shlex.split(f"lsof -t -i:{self.debugging_port}"))
|
||||||
|
.decode()
|
||||||
|
.strip()
|
||||||
|
.splitlines()
|
||||||
|
)
|
||||||
|
for pid in pids:
|
||||||
|
try:
|
||||||
|
os.kill(int(pid), signal.SIGTERM)
|
||||||
|
except ProcessLookupError:
|
||||||
|
pass
|
||||||
|
|
||||||
|
# remove Chromium singleton locks, or new launch exits with
|
||||||
|
# “Opening in existing browser session.”
|
||||||
|
for f in ("SingletonLock", "SingletonSocket", "SingletonCookie"):
|
||||||
|
fp = os.path.join(self.user_data_dir, f)
|
||||||
|
if os.path.exists(fp):
|
||||||
|
os.remove(fp)
|
||||||
|
except Exception as _e:
|
||||||
|
# non-fatal — we'll try to start anyway, but log what happened
|
||||||
|
self.logger.warning(f"pre-launch cleanup failed: {_e}", tag="BROWSER")
|
||||||
|
|
||||||
|
|
||||||
# Start browser process
|
# Start browser process
|
||||||
try:
|
try:
|
||||||
# Use DETACHED_PROCESS flag on Windows to fully detach the process
|
# Use DETACHED_PROCESS flag on Windows to fully detach the process
|
||||||
@@ -922,7 +964,7 @@ class BrowserManager:
|
|||||||
pages = context.pages
|
pages = context.pages
|
||||||
page = next((p for p in pages if p.url == crawlerRunConfig.url), None)
|
page = next((p for p in pages if p.url == crawlerRunConfig.url), None)
|
||||||
if not page:
|
if not page:
|
||||||
page = await context.new_page()
|
page = context.pages[0] # await context.new_page()
|
||||||
else:
|
else:
|
||||||
# Otherwise, check if we have an existing context for this config
|
# Otherwise, check if we have an existing context for this config
|
||||||
config_signature = self._make_config_signature(crawlerRunConfig)
|
config_signature = self._make_config_signature(crawlerRunConfig)
|
||||||
|
|||||||
@@ -15,12 +15,12 @@ import shutil
|
|||||||
import json
|
import json
|
||||||
import subprocess
|
import subprocess
|
||||||
import time
|
import time
|
||||||
from typing import List, Dict, Optional, Any, Tuple
|
from typing import List, Dict, Optional, Any
|
||||||
from colorama import Fore, Style, init
|
from rich.console import Console
|
||||||
|
|
||||||
from .async_configs import BrowserConfig
|
from .async_configs import BrowserConfig
|
||||||
from .browser_manager import ManagedBrowser
|
from .browser_manager import ManagedBrowser
|
||||||
from .async_logger import AsyncLogger, AsyncLoggerBase
|
from .async_logger import AsyncLogger, AsyncLoggerBase, LogColor
|
||||||
from .utils import get_home_folder
|
from .utils import get_home_folder
|
||||||
|
|
||||||
|
|
||||||
@@ -45,8 +45,8 @@ class BrowserProfiler:
|
|||||||
logger (AsyncLoggerBase, optional): Logger for outputting messages.
|
logger (AsyncLoggerBase, optional): Logger for outputting messages.
|
||||||
If None, a default AsyncLogger will be created.
|
If None, a default AsyncLogger will be created.
|
||||||
"""
|
"""
|
||||||
# Initialize colorama for colorful terminal output
|
# Initialize rich console for colorful input prompts
|
||||||
init()
|
self.console = Console()
|
||||||
|
|
||||||
# Create a logger if not provided
|
# Create a logger if not provided
|
||||||
if logger is None:
|
if logger is None:
|
||||||
@@ -127,26 +127,30 @@ class BrowserProfiler:
|
|||||||
profile_path = os.path.join(self.profiles_dir, profile_name)
|
profile_path = os.path.join(self.profiles_dir, profile_name)
|
||||||
os.makedirs(profile_path, exist_ok=True)
|
os.makedirs(profile_path, exist_ok=True)
|
||||||
|
|
||||||
# Print instructions for the user with colorama formatting
|
# Print instructions for the user with rich formatting
|
||||||
border = f"{Fore.CYAN}{'='*80}{Style.RESET_ALL}"
|
border = "{'='*80}"
|
||||||
self.logger.info(f"\n{border}", tag="PROFILE")
|
self.logger.info("{border}", tag="PROFILE", params={"border": f"\n{border}"}, colors={"border": LogColor.CYAN})
|
||||||
self.logger.info(f"Creating browser profile: {Fore.GREEN}{profile_name}{Style.RESET_ALL}", tag="PROFILE")
|
self.logger.info("Creating browser profile: {profile_name}", tag="PROFILE", params={"profile_name": profile_name}, colors={"profile_name": LogColor.GREEN})
|
||||||
self.logger.info(f"Profile directory: {Fore.YELLOW}{profile_path}{Style.RESET_ALL}", tag="PROFILE")
|
self.logger.info("Profile directory: {profile_path}", tag="PROFILE", params={"profile_path": profile_path}, colors={"profile_path": LogColor.YELLOW})
|
||||||
|
|
||||||
self.logger.info("\nInstructions:", tag="PROFILE")
|
self.logger.info("\nInstructions:", tag="PROFILE")
|
||||||
self.logger.info("1. A browser window will open for you to set up your profile.", tag="PROFILE")
|
self.logger.info("1. A browser window will open for you to set up your profile.", tag="PROFILE")
|
||||||
self.logger.info(f"2. {Fore.CYAN}Log in to websites{Style.RESET_ALL}, configure settings, etc. as needed.", tag="PROFILE")
|
self.logger.info("{segment}, configure settings, etc. as needed.", tag="PROFILE", params={"segment": "2. Log in to websites"}, colors={"segment": LogColor.CYAN})
|
||||||
self.logger.info(f"3. When you're done, {Fore.YELLOW}press 'q' in this terminal{Style.RESET_ALL} to close the browser.", tag="PROFILE")
|
self.logger.info("3. When you're done, {segment} to close the browser.", tag="PROFILE", params={"segment": "press 'q' in this terminal"}, colors={"segment": LogColor.YELLOW})
|
||||||
self.logger.info("4. The profile will be saved and ready to use with Crawl4AI.", tag="PROFILE")
|
self.logger.info("4. The profile will be saved and ready to use with Crawl4AI.", tag="PROFILE")
|
||||||
self.logger.info(f"{border}\n", tag="PROFILE")
|
self.logger.info("{border}", tag="PROFILE", params={"border": f"{border}\n"}, colors={"border": LogColor.CYAN})
|
||||||
|
|
||||||
|
browser_config.headless = False
|
||||||
|
browser_config.user_data_dir = profile_path
|
||||||
|
|
||||||
|
|
||||||
# Create managed browser instance
|
# Create managed browser instance
|
||||||
managed_browser = ManagedBrowser(
|
managed_browser = ManagedBrowser(
|
||||||
browser_type=browser_config.browser_type,
|
browser_config=browser_config,
|
||||||
user_data_dir=profile_path,
|
# user_data_dir=profile_path,
|
||||||
headless=False, # Must be visible
|
# headless=False, # Must be visible
|
||||||
logger=self.logger,
|
logger=self.logger,
|
||||||
debugging_port=browser_config.debugging_port
|
# debugging_port=browser_config.debugging_port
|
||||||
)
|
)
|
||||||
|
|
||||||
# Set up signal handlers to ensure cleanup on interrupt
|
# Set up signal handlers to ensure cleanup on interrupt
|
||||||
@@ -181,7 +185,7 @@ class BrowserProfiler:
|
|||||||
import select
|
import select
|
||||||
|
|
||||||
# First output the prompt
|
# First output the prompt
|
||||||
self.logger.info(f"{Fore.CYAN}Press '{Fore.WHITE}q{Fore.CYAN}' when you've finished using the browser...{Style.RESET_ALL}", tag="PROFILE")
|
self.logger.info("Press 'q' when you've finished using the browser...", tag="PROFILE")
|
||||||
|
|
||||||
# Save original terminal settings
|
# Save original terminal settings
|
||||||
fd = sys.stdin.fileno()
|
fd = sys.stdin.fileno()
|
||||||
@@ -197,7 +201,7 @@ class BrowserProfiler:
|
|||||||
if readable:
|
if readable:
|
||||||
key = sys.stdin.read(1)
|
key = sys.stdin.read(1)
|
||||||
if key.lower() == 'q':
|
if key.lower() == 'q':
|
||||||
self.logger.info(f"{Fore.GREEN}Closing browser and saving profile...{Style.RESET_ALL}", tag="PROFILE")
|
self.logger.info("Closing browser and saving profile...", tag="PROFILE", base_color=LogColor.GREEN)
|
||||||
user_done_event.set()
|
user_done_event.set()
|
||||||
return
|
return
|
||||||
|
|
||||||
@@ -223,7 +227,7 @@ class BrowserProfiler:
|
|||||||
self.logger.error("Failed to start browser process.", tag="PROFILE")
|
self.logger.error("Failed to start browser process.", tag="PROFILE")
|
||||||
return None
|
return None
|
||||||
|
|
||||||
self.logger.info(f"Browser launched. {Fore.CYAN}Waiting for you to finish...{Style.RESET_ALL}", tag="PROFILE")
|
self.logger.info("Browser launched. Waiting for you to finish...", tag="PROFILE")
|
||||||
|
|
||||||
# Start listening for keyboard input
|
# Start listening for keyboard input
|
||||||
listener_task = asyncio.create_task(listen_for_quit_command())
|
listener_task = asyncio.create_task(listen_for_quit_command())
|
||||||
@@ -245,10 +249,10 @@ class BrowserProfiler:
|
|||||||
self.logger.info("Terminating browser process...", tag="PROFILE")
|
self.logger.info("Terminating browser process...", tag="PROFILE")
|
||||||
await managed_browser.cleanup()
|
await managed_browser.cleanup()
|
||||||
|
|
||||||
self.logger.success(f"Browser closed. Profile saved at: {Fore.GREEN}{profile_path}{Style.RESET_ALL}", tag="PROFILE")
|
self.logger.success(f"Browser closed. Profile saved at: {profile_path}", tag="PROFILE")
|
||||||
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
self.logger.error(f"Error creating profile: {str(e)}", tag="PROFILE")
|
self.logger.error(f"Error creating profile: {e!s}", tag="PROFILE")
|
||||||
await managed_browser.cleanup()
|
await managed_browser.cleanup()
|
||||||
return None
|
return None
|
||||||
finally:
|
finally:
|
||||||
@@ -440,25 +444,27 @@ class BrowserProfiler:
|
|||||||
```
|
```
|
||||||
"""
|
"""
|
||||||
while True:
|
while True:
|
||||||
self.logger.info(f"\n{Fore.CYAN}Profile Management Options:{Style.RESET_ALL}", tag="MENU")
|
self.logger.info("\nProfile Management Options:", tag="MENU")
|
||||||
self.logger.info(f"1. {Fore.GREEN}Create a new profile{Style.RESET_ALL}", tag="MENU")
|
self.logger.info("1. Create a new profile", tag="MENU", base_color=LogColor.GREEN)
|
||||||
self.logger.info(f"2. {Fore.YELLOW}List available profiles{Style.RESET_ALL}", tag="MENU")
|
self.logger.info("2. List available profiles", tag="MENU", base_color=LogColor.YELLOW)
|
||||||
self.logger.info(f"3. {Fore.RED}Delete a profile{Style.RESET_ALL}", tag="MENU")
|
self.logger.info("3. Delete a profile", tag="MENU", base_color=LogColor.RED)
|
||||||
|
|
||||||
# Only show crawl option if callback provided
|
# Only show crawl option if callback provided
|
||||||
if crawl_callback:
|
if crawl_callback:
|
||||||
self.logger.info(f"4. {Fore.CYAN}Use a profile to crawl a website{Style.RESET_ALL}", tag="MENU")
|
self.logger.info("4. Use a profile to crawl a website", tag="MENU", base_color=LogColor.CYAN)
|
||||||
self.logger.info(f"5. {Fore.MAGENTA}Exit{Style.RESET_ALL}", tag="MENU")
|
self.logger.info("5. Exit", tag="MENU", base_color=LogColor.MAGENTA)
|
||||||
exit_option = "5"
|
exit_option = "5"
|
||||||
else:
|
else:
|
||||||
self.logger.info(f"4. {Fore.MAGENTA}Exit{Style.RESET_ALL}", tag="MENU")
|
self.logger.info("4. Exit", tag="MENU", base_color=LogColor.MAGENTA)
|
||||||
exit_option = "4"
|
exit_option = "4"
|
||||||
|
|
||||||
choice = input(f"\n{Fore.CYAN}Enter your choice (1-{exit_option}): {Style.RESET_ALL}")
|
self.logger.print(f"\n[cyan]Enter your choice (1-{exit_option}): [/cyan]", end="")
|
||||||
|
choice = input()
|
||||||
|
|
||||||
if choice == "1":
|
if choice == "1":
|
||||||
# Create new profile
|
# Create new profile
|
||||||
name = input(f"{Fore.GREEN}Enter a name for the new profile (or press Enter for auto-generated name): {Style.RESET_ALL}")
|
self.console.print("[green]Enter a name for the new profile (or press Enter for auto-generated name): [/green]", end="")
|
||||||
|
name = input()
|
||||||
await self.create_profile(name or None)
|
await self.create_profile(name or None)
|
||||||
|
|
||||||
elif choice == "2":
|
elif choice == "2":
|
||||||
@@ -472,8 +478,8 @@ class BrowserProfiler:
|
|||||||
# Print profile information with colorama formatting
|
# Print profile information with colorama formatting
|
||||||
self.logger.info("\nAvailable profiles:", tag="PROFILES")
|
self.logger.info("\nAvailable profiles:", tag="PROFILES")
|
||||||
for i, profile in enumerate(profiles):
|
for i, profile in enumerate(profiles):
|
||||||
self.logger.info(f"[{i+1}] {Fore.CYAN}{profile['name']}{Style.RESET_ALL}", tag="PROFILES")
|
self.logger.info(f"[{i+1}] {profile['name']}", tag="PROFILES")
|
||||||
self.logger.info(f" Path: {Fore.YELLOW}{profile['path']}{Style.RESET_ALL}", tag="PROFILES")
|
self.logger.info(f" Path: {profile['path']}", tag="PROFILES", base_color=LogColor.YELLOW)
|
||||||
self.logger.info(f" Created: {profile['created'].strftime('%Y-%m-%d %H:%M:%S')}", tag="PROFILES")
|
self.logger.info(f" Created: {profile['created'].strftime('%Y-%m-%d %H:%M:%S')}", tag="PROFILES")
|
||||||
self.logger.info(f" Browser type: {profile['type']}", tag="PROFILES")
|
self.logger.info(f" Browser type: {profile['type']}", tag="PROFILES")
|
||||||
self.logger.info("", tag="PROFILES") # Empty line for spacing
|
self.logger.info("", tag="PROFILES") # Empty line for spacing
|
||||||
@@ -486,12 +492,13 @@ class BrowserProfiler:
|
|||||||
continue
|
continue
|
||||||
|
|
||||||
# Display numbered list
|
# Display numbered list
|
||||||
self.logger.info(f"\n{Fore.YELLOW}Available profiles:{Style.RESET_ALL}", tag="PROFILES")
|
self.logger.info("\nAvailable profiles:", tag="PROFILES", base_color=LogColor.YELLOW)
|
||||||
for i, profile in enumerate(profiles):
|
for i, profile in enumerate(profiles):
|
||||||
self.logger.info(f"[{i+1}] {profile['name']}", tag="PROFILES")
|
self.logger.info(f"[{i+1}] {profile['name']}", tag="PROFILES")
|
||||||
|
|
||||||
# Get profile to delete
|
# Get profile to delete
|
||||||
profile_idx = input(f"{Fore.RED}Enter the number of the profile to delete (or 'c' to cancel): {Style.RESET_ALL}")
|
self.console.print("[red]Enter the number of the profile to delete (or 'c' to cancel): [/red]", end="")
|
||||||
|
profile_idx = input()
|
||||||
if profile_idx.lower() == 'c':
|
if profile_idx.lower() == 'c':
|
||||||
continue
|
continue
|
||||||
|
|
||||||
@@ -499,17 +506,18 @@ class BrowserProfiler:
|
|||||||
idx = int(profile_idx) - 1
|
idx = int(profile_idx) - 1
|
||||||
if 0 <= idx < len(profiles):
|
if 0 <= idx < len(profiles):
|
||||||
profile_name = profiles[idx]["name"]
|
profile_name = profiles[idx]["name"]
|
||||||
self.logger.info(f"Deleting profile: {Fore.YELLOW}{profile_name}{Style.RESET_ALL}", tag="PROFILES")
|
self.logger.info(f"Deleting profile: [yellow]{profile_name}[/yellow]", tag="PROFILES")
|
||||||
|
|
||||||
# Confirm deletion
|
# Confirm deletion
|
||||||
confirm = input(f"{Fore.RED}Are you sure you want to delete this profile? (y/n): {Style.RESET_ALL}")
|
self.console.print("[red]Are you sure you want to delete this profile? (y/n): [/red]", end="")
|
||||||
|
confirm = input()
|
||||||
if confirm.lower() == 'y':
|
if confirm.lower() == 'y':
|
||||||
success = self.delete_profile(profiles[idx]["path"])
|
success = self.delete_profile(profiles[idx]["path"])
|
||||||
|
|
||||||
if success:
|
if success:
|
||||||
self.logger.success(f"Profile {Fore.GREEN}{profile_name}{Style.RESET_ALL} deleted successfully", tag="PROFILES")
|
self.logger.success(f"Profile {profile_name} deleted successfully", tag="PROFILES")
|
||||||
else:
|
else:
|
||||||
self.logger.error(f"Failed to delete profile {Fore.RED}{profile_name}{Style.RESET_ALL}", tag="PROFILES")
|
self.logger.error(f"Failed to delete profile {profile_name}", tag="PROFILES")
|
||||||
else:
|
else:
|
||||||
self.logger.error("Invalid profile number", tag="PROFILES")
|
self.logger.error("Invalid profile number", tag="PROFILES")
|
||||||
except ValueError:
|
except ValueError:
|
||||||
@@ -523,12 +531,13 @@ class BrowserProfiler:
|
|||||||
continue
|
continue
|
||||||
|
|
||||||
# Display numbered list
|
# Display numbered list
|
||||||
self.logger.info(f"\n{Fore.YELLOW}Available profiles:{Style.RESET_ALL}", tag="PROFILES")
|
self.logger.info("\nAvailable profiles:", tag="PROFILES", base_color=LogColor.YELLOW)
|
||||||
for i, profile in enumerate(profiles):
|
for i, profile in enumerate(profiles):
|
||||||
self.logger.info(f"[{i+1}] {profile['name']}", tag="PROFILES")
|
self.logger.info(f"[{i+1}] {profile['name']}", tag="PROFILES")
|
||||||
|
|
||||||
# Get profile to use
|
# Get profile to use
|
||||||
profile_idx = input(f"{Fore.CYAN}Enter the number of the profile to use (or 'c' to cancel): {Style.RESET_ALL}")
|
self.console.print("[cyan]Enter the number of the profile to use (or 'c' to cancel): [/cyan]", end="")
|
||||||
|
profile_idx = input()
|
||||||
if profile_idx.lower() == 'c':
|
if profile_idx.lower() == 'c':
|
||||||
continue
|
continue
|
||||||
|
|
||||||
@@ -536,7 +545,8 @@ class BrowserProfiler:
|
|||||||
idx = int(profile_idx) - 1
|
idx = int(profile_idx) - 1
|
||||||
if 0 <= idx < len(profiles):
|
if 0 <= idx < len(profiles):
|
||||||
profile_path = profiles[idx]["path"]
|
profile_path = profiles[idx]["path"]
|
||||||
url = input(f"{Fore.CYAN}Enter the URL to crawl: {Style.RESET_ALL}")
|
self.console.print("[cyan]Enter the URL to crawl: [/cyan]", end="")
|
||||||
|
url = input()
|
||||||
if url:
|
if url:
|
||||||
# Call the provided crawl callback
|
# Call the provided crawl callback
|
||||||
await crawl_callback(profile_path, url)
|
await crawl_callback(profile_path, url)
|
||||||
@@ -599,11 +609,11 @@ class BrowserProfiler:
|
|||||||
# Print initial information
|
# Print initial information
|
||||||
border = f"{Fore.CYAN}{'='*80}{Style.RESET_ALL}"
|
border = f"{Fore.CYAN}{'='*80}{Style.RESET_ALL}"
|
||||||
self.logger.info(f"\n{border}", tag="CDP")
|
self.logger.info(f"\n{border}", tag="CDP")
|
||||||
self.logger.info(f"Launching standalone browser with CDP debugging", tag="CDP")
|
self.logger.info("Launching standalone browser with CDP debugging", tag="CDP")
|
||||||
self.logger.info(f"Browser type: {Fore.GREEN}{browser_type}{Style.RESET_ALL}", tag="CDP")
|
self.logger.info("Browser type: {browser_type}", tag="CDP", params={"browser_type": browser_type}, colors={"browser_type": LogColor.CYAN})
|
||||||
self.logger.info(f"Profile path: {Fore.YELLOW}{profile_path}{Style.RESET_ALL}", tag="CDP")
|
self.logger.info("Profile path: {profile_path}", tag="CDP", params={"profile_path": profile_path}, colors={"profile_path": LogColor.YELLOW})
|
||||||
self.logger.info(f"Debugging port: {Fore.CYAN}{debugging_port}{Style.RESET_ALL}", tag="CDP")
|
self.logger.info(f"Debugging port: {debugging_port}", tag="CDP")
|
||||||
self.logger.info(f"Headless mode: {Fore.CYAN}{headless}{Style.RESET_ALL}", tag="CDP")
|
self.logger.info(f"Headless mode: {headless}", tag="CDP")
|
||||||
|
|
||||||
# Create managed browser instance
|
# Create managed browser instance
|
||||||
managed_browser = ManagedBrowser(
|
managed_browser = ManagedBrowser(
|
||||||
@@ -646,7 +656,7 @@ class BrowserProfiler:
|
|||||||
import select
|
import select
|
||||||
|
|
||||||
# First output the prompt
|
# First output the prompt
|
||||||
self.logger.info(f"{Fore.CYAN}Press '{Fore.WHITE}q{Fore.CYAN}' to stop the browser and exit...{Style.RESET_ALL}", tag="CDP")
|
self.logger.info("Press 'q' to stop the browser and exit...", tag="CDP")
|
||||||
|
|
||||||
# Save original terminal settings
|
# Save original terminal settings
|
||||||
fd = sys.stdin.fileno()
|
fd = sys.stdin.fileno()
|
||||||
@@ -662,7 +672,7 @@ class BrowserProfiler:
|
|||||||
if readable:
|
if readable:
|
||||||
key = sys.stdin.read(1)
|
key = sys.stdin.read(1)
|
||||||
if key.lower() == 'q':
|
if key.lower() == 'q':
|
||||||
self.logger.info(f"{Fore.GREEN}Closing browser...{Style.RESET_ALL}", tag="CDP")
|
self.logger.info("Closing browser...", tag="CDP")
|
||||||
user_done_event.set()
|
user_done_event.set()
|
||||||
return
|
return
|
||||||
|
|
||||||
@@ -716,20 +726,20 @@ class BrowserProfiler:
|
|||||||
self.logger.error("Failed to start browser process.", tag="CDP")
|
self.logger.error("Failed to start browser process.", tag="CDP")
|
||||||
return None
|
return None
|
||||||
|
|
||||||
self.logger.info(f"Browser launched successfully. Retrieving CDP information...", tag="CDP")
|
self.logger.info("Browser launched successfully. Retrieving CDP information...", tag="CDP")
|
||||||
|
|
||||||
# Get CDP URL and JSON config
|
# Get CDP URL and JSON config
|
||||||
cdp_url, config_json = await get_cdp_json(debugging_port)
|
cdp_url, config_json = await get_cdp_json(debugging_port)
|
||||||
|
|
||||||
if cdp_url:
|
if cdp_url:
|
||||||
self.logger.success(f"CDP URL: {Fore.GREEN}{cdp_url}{Style.RESET_ALL}", tag="CDP")
|
self.logger.success(f"CDP URL: {cdp_url}", tag="CDP")
|
||||||
|
|
||||||
if config_json:
|
if config_json:
|
||||||
# Display relevant CDP information
|
# Display relevant CDP information
|
||||||
self.logger.info(f"Browser: {Fore.CYAN}{config_json.get('Browser', 'Unknown')}{Style.RESET_ALL}", tag="CDP")
|
self.logger.info(f"Browser: {config_json.get('Browser', 'Unknown')}", tag="CDP", colors={"Browser": LogColor.CYAN})
|
||||||
self.logger.info(f"Protocol Version: {config_json.get('Protocol-Version', 'Unknown')}", tag="CDP")
|
self.logger.info(f"Protocol Version: {config_json.get('Protocol-Version', 'Unknown')}", tag="CDP", colors={"Protocol-Version": LogColor.CYAN})
|
||||||
if 'webSocketDebuggerUrl' in config_json:
|
if 'webSocketDebuggerUrl' in config_json:
|
||||||
self.logger.info(f"WebSocket URL: {Fore.GREEN}{config_json['webSocketDebuggerUrl']}{Style.RESET_ALL}", tag="CDP")
|
self.logger.info("WebSocket URL: {webSocketDebuggerUrl}", tag="CDP", params={"webSocketDebuggerUrl": config_json['webSocketDebuggerUrl']}, colors={"webSocketDebuggerUrl": LogColor.GREEN})
|
||||||
else:
|
else:
|
||||||
self.logger.warning("Could not retrieve CDP configuration JSON", tag="CDP")
|
self.logger.warning("Could not retrieve CDP configuration JSON", tag="CDP")
|
||||||
else:
|
else:
|
||||||
@@ -757,7 +767,7 @@ class BrowserProfiler:
|
|||||||
self.logger.info("Terminating browser process...", tag="CDP")
|
self.logger.info("Terminating browser process...", tag="CDP")
|
||||||
await managed_browser.cleanup()
|
await managed_browser.cleanup()
|
||||||
|
|
||||||
self.logger.success(f"Browser closed.", tag="CDP")
|
self.logger.success("Browser closed.", tag="CDP")
|
||||||
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
self.logger.error(f"Error launching standalone browser: {str(e)}", tag="CDP")
|
self.logger.error(f"Error launching standalone browser: {str(e)}", tag="CDP")
|
||||||
@@ -972,3 +982,30 @@ class BrowserProfiler:
|
|||||||
'info': browser_info
|
'info': browser_info
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
# Example usage
|
||||||
|
profiler = BrowserProfiler()
|
||||||
|
|
||||||
|
# Create a new profile
|
||||||
|
import os
|
||||||
|
from pathlib import Path
|
||||||
|
home_dir = Path.home()
|
||||||
|
profile_path = asyncio.run(profiler.create_profile( str(home_dir / ".crawl4ai/profiles/test-profile")))
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
# Launch a standalone browser
|
||||||
|
asyncio.run(profiler.launch_standalone_browser())
|
||||||
|
|
||||||
|
# List profiles
|
||||||
|
profiles = profiler.list_profiles()
|
||||||
|
for profile in profiles:
|
||||||
|
print(f"Profile: {profile['name']}, Path: {profile['path']}")
|
||||||
|
|
||||||
|
# Delete a profile
|
||||||
|
success = profiler.delete_profile("my-profile")
|
||||||
|
if success:
|
||||||
|
print("Profile deleted successfully")
|
||||||
|
else:
|
||||||
|
print("Failed to delete profile")
|
||||||
@@ -27,8 +27,7 @@ import json
|
|||||||
import hashlib
|
import hashlib
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
from concurrent.futures import ThreadPoolExecutor
|
from concurrent.futures import ThreadPoolExecutor
|
||||||
from .async_logger import AsyncLogger, LogLevel
|
from .async_logger import AsyncLogger, LogLevel, LogColor
|
||||||
from colorama import Fore, Style
|
|
||||||
|
|
||||||
|
|
||||||
class RelevantContentFilter(ABC):
|
class RelevantContentFilter(ABC):
|
||||||
@@ -846,8 +845,7 @@ class LLMContentFilter(RelevantContentFilter):
|
|||||||
},
|
},
|
||||||
colors={
|
colors={
|
||||||
**AsyncLogger.DEFAULT_COLORS,
|
**AsyncLogger.DEFAULT_COLORS,
|
||||||
LogLevel.INFO: Fore.MAGENTA
|
LogLevel.INFO: LogColor.DIM_MAGENTA # Dimmed purple for LLM ops
|
||||||
+ Style.DIM, # Dimmed purple for LLM ops
|
|
||||||
},
|
},
|
||||||
)
|
)
|
||||||
else:
|
else:
|
||||||
@@ -892,7 +890,7 @@ class LLMContentFilter(RelevantContentFilter):
|
|||||||
"Starting LLM markdown content filtering process",
|
"Starting LLM markdown content filtering process",
|
||||||
tag="LLM",
|
tag="LLM",
|
||||||
params={"provider": self.llm_config.provider},
|
params={"provider": self.llm_config.provider},
|
||||||
colors={"provider": Fore.CYAN},
|
colors={"provider": LogColor.CYAN},
|
||||||
)
|
)
|
||||||
|
|
||||||
# Cache handling
|
# Cache handling
|
||||||
@@ -929,7 +927,7 @@ class LLMContentFilter(RelevantContentFilter):
|
|||||||
"LLM markdown: Split content into {chunk_count} chunks",
|
"LLM markdown: Split content into {chunk_count} chunks",
|
||||||
tag="CHUNK",
|
tag="CHUNK",
|
||||||
params={"chunk_count": len(html_chunks)},
|
params={"chunk_count": len(html_chunks)},
|
||||||
colors={"chunk_count": Fore.YELLOW},
|
colors={"chunk_count": LogColor.YELLOW},
|
||||||
)
|
)
|
||||||
|
|
||||||
start_time = time.time()
|
start_time = time.time()
|
||||||
@@ -1038,7 +1036,7 @@ class LLMContentFilter(RelevantContentFilter):
|
|||||||
"LLM markdown: Completed processing in {time:.2f}s",
|
"LLM markdown: Completed processing in {time:.2f}s",
|
||||||
tag="LLM",
|
tag="LLM",
|
||||||
params={"time": end_time - start_time},
|
params={"time": end_time - start_time},
|
||||||
colors={"time": Fore.YELLOW},
|
colors={"time": LogColor.YELLOW},
|
||||||
)
|
)
|
||||||
|
|
||||||
result = ordered_results if ordered_results else []
|
result = ordered_results if ordered_results else []
|
||||||
|
|||||||
@@ -1,4 +1,4 @@
|
|||||||
from pydantic import BaseModel, HttpUrl, PrivateAttr
|
from pydantic import BaseModel, HttpUrl, PrivateAttr, Field
|
||||||
from typing import List, Dict, Optional, Callable, Awaitable, Union, Any
|
from typing import List, Dict, Optional, Callable, Awaitable, Union, Any
|
||||||
from typing import AsyncGenerator
|
from typing import AsyncGenerator
|
||||||
from typing import Generic, TypeVar
|
from typing import Generic, TypeVar
|
||||||
@@ -150,6 +150,7 @@ class CrawlResult(BaseModel):
|
|||||||
redirected_url: Optional[str] = None
|
redirected_url: Optional[str] = None
|
||||||
network_requests: Optional[List[Dict[str, Any]]] = None
|
network_requests: Optional[List[Dict[str, Any]]] = None
|
||||||
console_messages: Optional[List[Dict[str, Any]]] = None
|
console_messages: Optional[List[Dict[str, Any]]] = None
|
||||||
|
tables: List[Dict] = Field(default_factory=list) # NEW – [{headers,rows,caption,summary}]
|
||||||
|
|
||||||
class Config:
|
class Config:
|
||||||
arbitrary_types_allowed = True
|
arbitrary_types_allowed = True
|
||||||
|
|||||||
@@ -20,7 +20,6 @@ from urllib.parse import urljoin
|
|||||||
import requests
|
import requests
|
||||||
from requests.exceptions import InvalidSchema
|
from requests.exceptions import InvalidSchema
|
||||||
import xxhash
|
import xxhash
|
||||||
from colorama import Fore, Style, init
|
|
||||||
import textwrap
|
import textwrap
|
||||||
import cProfile
|
import cProfile
|
||||||
import pstats
|
import pstats
|
||||||
@@ -441,14 +440,13 @@ def create_box_message(
|
|||||||
str: A formatted string containing the styled message box.
|
str: A formatted string containing the styled message box.
|
||||||
"""
|
"""
|
||||||
|
|
||||||
init()
|
|
||||||
|
|
||||||
# Define border and text colors for different types
|
# Define border and text colors for different types
|
||||||
styles = {
|
styles = {
|
||||||
"warning": (Fore.YELLOW, Fore.LIGHTYELLOW_EX, "⚠"),
|
"warning": ("yellow", "bright_yellow", "⚠"),
|
||||||
"info": (Fore.BLUE, Fore.LIGHTBLUE_EX, "ℹ"),
|
"info": ("blue", "bright_blue", "ℹ"),
|
||||||
"success": (Fore.GREEN, Fore.LIGHTGREEN_EX, "✓"),
|
"debug": ("lightblack", "bright_black", "⋯"),
|
||||||
"error": (Fore.RED, Fore.LIGHTRED_EX, "×"),
|
"success": ("green", "bright_green", "✓"),
|
||||||
|
"error": ("red", "bright_red", "×"),
|
||||||
}
|
}
|
||||||
|
|
||||||
border_color, text_color, prefix = styles.get(type.lower(), styles["info"])
|
border_color, text_color, prefix = styles.get(type.lower(), styles["info"])
|
||||||
@@ -480,12 +478,12 @@ def create_box_message(
|
|||||||
# Create the box with colored borders and lighter text
|
# Create the box with colored borders and lighter text
|
||||||
horizontal_line = h_line * (width - 1)
|
horizontal_line = h_line * (width - 1)
|
||||||
box = [
|
box = [
|
||||||
f"{border_color}{tl}{horizontal_line}{tr}",
|
f"[{border_color}]{tl}{horizontal_line}{tr}[/{border_color}]",
|
||||||
*[
|
*[
|
||||||
f"{border_color}{v_line}{text_color} {line:<{width-2}}{border_color}{v_line}"
|
f"[{border_color}]{v_line}[{text_color}] {line:<{width-2}}[/{text_color}][{border_color}]{v_line}[/{border_color}]"
|
||||||
for line in formatted_lines
|
for line in formatted_lines
|
||||||
],
|
],
|
||||||
f"{border_color}{bl}{horizontal_line}{br}{Style.RESET_ALL}",
|
f"[{border_color}]{bl}{horizontal_line}{br}[/{border_color}]",
|
||||||
]
|
]
|
||||||
|
|
||||||
result = "\n".join(box)
|
result = "\n".join(box)
|
||||||
@@ -2778,4 +2776,3 @@ def preprocess_html_for_schema(html_content, text_threshold=100, attr_value_thre
|
|||||||
# Fallback for parsing errors
|
# Fallback for parsing errors
|
||||||
return html_content[:max_size] if len(html_content) > max_size else html_content
|
return html_content[:max_size] if len(html_content) > max_size else html_content
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@@ -58,7 +58,7 @@ Pull and run images directly from Docker Hub without building locally.
|
|||||||
|
|
||||||
#### 1. Pull the Image
|
#### 1. Pull the Image
|
||||||
|
|
||||||
Our latest release candidate is `0.6.0rc1-r1`. Images are built with multi-arch manifests, so Docker automatically pulls the correct version for your system.
|
Our latest release candidate is `0.6.0-r1`. Images are built with multi-arch manifests, so Docker automatically pulls the correct version for your system.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Pull the release candidate (recommended for latest features)
|
# Pull the release candidate (recommended for latest features)
|
||||||
@@ -124,9 +124,9 @@ docker stop crawl4ai && docker rm crawl4ai
|
|||||||
#### Docker Hub Versioning Explained
|
#### Docker Hub Versioning Explained
|
||||||
|
|
||||||
* **Image Name:** `unclecode/crawl4ai`
|
* **Image Name:** `unclecode/crawl4ai`
|
||||||
* **Tag Format:** `LIBRARY_VERSION[-SUFFIX]` (e.g., `0.6.0rc1-r1`)
|
* **Tag Format:** `LIBRARY_VERSION[-SUFFIX]` (e.g., `0.6.0-r1`)
|
||||||
* `LIBRARY_VERSION`: The semantic version of the core `crawl4ai` Python library
|
* `LIBRARY_VERSION`: The semantic version of the core `crawl4ai` Python library
|
||||||
* `SUFFIX`: Optional tag for release candidates (`rc1`) and revisions (`r1`)
|
* `SUFFIX`: Optional tag for release candidates (``) and revisions (`r1`)
|
||||||
* **`latest` Tag:** Points to the most recent stable version
|
* **`latest` Tag:** Points to the most recent stable version
|
||||||
* **Multi-Architecture Support:** All images support both `linux/amd64` and `linux/arm64` architectures through a single tag
|
* **Multi-Architecture Support:** All images support both `linux/amd64` and `linux/arm64` architectures through a single tag
|
||||||
|
|
||||||
|
|||||||
@@ -193,7 +193,48 @@
|
|||||||
<textarea id="urls" class="w-full bg-dark border border-border rounded p-2 h-32 text-sm mb-4"
|
<textarea id="urls" class="w-full bg-dark border border-border rounded p-2 h-32 text-sm mb-4"
|
||||||
spellcheck="false">https://example.com</textarea>
|
spellcheck="false">https://example.com</textarea>
|
||||||
|
|
||||||
<details class="mb-4">
|
<!-- Specific options for /md endpoint -->
|
||||||
|
<details id="md-options" class="mb-4 hidden">
|
||||||
|
<summary class="text-sm text-secondary cursor-pointer">/md Options</summary>
|
||||||
|
<div class="mt-2 space-y-3 p-2 border border-border rounded">
|
||||||
|
<div>
|
||||||
|
<label for="md-filter" class="block text-xs text-secondary mb-1">Filter Type</label>
|
||||||
|
<select id="md-filter" class="bg-dark border border-border rounded px-2 py-1 text-sm w-full">
|
||||||
|
<option value="fit">fit - Adaptive content filtering</option>
|
||||||
|
<option value="raw">raw - No filtering</option>
|
||||||
|
<option value="bm25">bm25 - BM25 keyword relevance</option>
|
||||||
|
<option value="llm">llm - LLM-based filtering</option>
|
||||||
|
</select>
|
||||||
|
</div>
|
||||||
|
<div>
|
||||||
|
<label for="md-query" class="block text-xs text-secondary mb-1">Query (for BM25/LLM filters)</label>
|
||||||
|
<input id="md-query" type="text" placeholder="Enter search terms or instructions"
|
||||||
|
class="bg-dark border border-border rounded px-2 py-1 text-sm w-full">
|
||||||
|
</div>
|
||||||
|
<div>
|
||||||
|
<label for="md-cache" class="block text-xs text-secondary mb-1">Cache Mode</label>
|
||||||
|
<select id="md-cache" class="bg-dark border border-border rounded px-2 py-1 text-sm w-full">
|
||||||
|
<option value="0">Write-Only (0)</option>
|
||||||
|
<option value="1">Enabled (1)</option>
|
||||||
|
</select>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</details>
|
||||||
|
|
||||||
|
<!-- Specific options for /llm endpoint -->
|
||||||
|
<details id="llm-options" class="mb-4 hidden">
|
||||||
|
<summary class="text-sm text-secondary cursor-pointer">/llm Options</summary>
|
||||||
|
<div class="mt-2 space-y-3 p-2 border border-border rounded">
|
||||||
|
<div>
|
||||||
|
<label for="llm-question" class="block text-xs text-secondary mb-1">Question</label>
|
||||||
|
<input id="llm-question" type="text" value="What is this page about?"
|
||||||
|
class="bg-dark border border-border rounded px-2 py-1 text-sm w-full">
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</details>
|
||||||
|
|
||||||
|
<!-- Advanced config for /crawl endpoints -->
|
||||||
|
<details id="adv-config" class="mb-4">
|
||||||
<summary class="text-sm text-secondary cursor-pointer">Advanced Config <span
|
<summary class="text-sm text-secondary cursor-pointer">Advanced Config <span
|
||||||
class="text-xs text-primary">(Python → auto‑JSON)</span></summary>
|
class="text-xs text-primary">(Python → auto‑JSON)</span></summary>
|
||||||
|
|
||||||
@@ -438,6 +479,33 @@
|
|||||||
document.getElementById('cfg-status').textContent = '';
|
document.getElementById('cfg-status').textContent = '';
|
||||||
});
|
});
|
||||||
|
|
||||||
|
// Handle endpoint selection change to show appropriate options
|
||||||
|
document.getElementById('endpoint').addEventListener('change', function(e) {
|
||||||
|
const endpoint = e.target.value;
|
||||||
|
const mdOptions = document.getElementById('md-options');
|
||||||
|
const llmOptions = document.getElementById('llm-options');
|
||||||
|
const advConfig = document.getElementById('adv-config');
|
||||||
|
|
||||||
|
// Hide all option sections first
|
||||||
|
mdOptions.classList.add('hidden');
|
||||||
|
llmOptions.classList.add('hidden');
|
||||||
|
advConfig.classList.add('hidden');
|
||||||
|
|
||||||
|
// Show the appropriate section based on endpoint
|
||||||
|
if (endpoint === 'md') {
|
||||||
|
mdOptions.classList.remove('hidden');
|
||||||
|
// Auto-open the /md options
|
||||||
|
mdOptions.setAttribute('open', '');
|
||||||
|
} else if (endpoint === 'llm') {
|
||||||
|
llmOptions.classList.remove('hidden');
|
||||||
|
// Auto-open the /llm options
|
||||||
|
llmOptions.setAttribute('open', '');
|
||||||
|
} else {
|
||||||
|
// For /crawl endpoints, show the advanced config
|
||||||
|
advConfig.classList.remove('hidden');
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
async function pyConfigToJson() {
|
async function pyConfigToJson() {
|
||||||
const code = cm.getValue().trim();
|
const code = cm.getValue().trim();
|
||||||
if (!code) return {};
|
if (!code) return {};
|
||||||
@@ -494,10 +562,18 @@
|
|||||||
}
|
}
|
||||||
|
|
||||||
// Generate code snippets
|
// Generate code snippets
|
||||||
function generateSnippets(api, payload) {
|
function generateSnippets(api, payload, method = 'POST') {
|
||||||
// Python snippet
|
// Python snippet
|
||||||
const pyCodeEl = document.querySelector('#python-content code');
|
const pyCodeEl = document.querySelector('#python-content code');
|
||||||
const pySnippet = `import httpx\n\nasync def crawl():\n async with httpx.AsyncClient() as client:\n response = await client.post(\n "${window.location.origin}${api}",\n json=${JSON.stringify(payload, null, 4).replace(/\n/g, '\n ')}\n )\n return response.json()`;
|
let pySnippet;
|
||||||
|
|
||||||
|
if (method === 'GET') {
|
||||||
|
// GET request (for /llm endpoint)
|
||||||
|
pySnippet = `import httpx\n\nasync def crawl():\n async with httpx.AsyncClient() as client:\n response = await client.get(\n "${window.location.origin}${api}"\n )\n return response.json()`;
|
||||||
|
} else {
|
||||||
|
// POST request (for /crawl and /md endpoints)
|
||||||
|
pySnippet = `import httpx\n\nasync def crawl():\n async with httpx.AsyncClient() as client:\n response = await client.post(\n "${window.location.origin}${api}",\n json=${JSON.stringify(payload, null, 4).replace(/\n/g, '\n ')}\n )\n return response.json()`;
|
||||||
|
}
|
||||||
|
|
||||||
pyCodeEl.textContent = pySnippet;
|
pyCodeEl.textContent = pySnippet;
|
||||||
pyCodeEl.className = 'python hljs'; // Reset classes
|
pyCodeEl.className = 'python hljs'; // Reset classes
|
||||||
@@ -505,7 +581,15 @@
|
|||||||
|
|
||||||
// cURL snippet
|
// cURL snippet
|
||||||
const curlCodeEl = document.querySelector('#curl-content code');
|
const curlCodeEl = document.querySelector('#curl-content code');
|
||||||
const curlSnippet = `curl -X POST ${window.location.origin}${api} \\\n -H "Content-Type: application/json" \\\n -d '${JSON.stringify(payload)}'`;
|
let curlSnippet;
|
||||||
|
|
||||||
|
if (method === 'GET') {
|
||||||
|
// GET request (for /llm endpoint)
|
||||||
|
curlSnippet = `curl -X GET "${window.location.origin}${api}"`;
|
||||||
|
} else {
|
||||||
|
// POST request (for /crawl and /md endpoints)
|
||||||
|
curlSnippet = `curl -X POST ${window.location.origin}${api} \\\n -H "Content-Type: application/json" \\\n -d '${JSON.stringify(payload)}'`;
|
||||||
|
}
|
||||||
|
|
||||||
curlCodeEl.textContent = curlSnippet;
|
curlCodeEl.textContent = curlSnippet;
|
||||||
curlCodeEl.className = 'bash hljs'; // Reset classes
|
curlCodeEl.className = 'bash hljs'; // Reset classes
|
||||||
@@ -536,20 +620,39 @@
|
|||||||
|
|
||||||
const endpointMap = {
|
const endpointMap = {
|
||||||
crawl: '/crawl',
|
crawl: '/crawl',
|
||||||
};
|
// crawl_stream: '/crawl/stream',
|
||||||
|
|
||||||
/*const endpointMap = {
|
|
||||||
crawl: '/crawl',
|
|
||||||
crawl_stream: '/crawl/stream',
|
|
||||||
md: '/md',
|
md: '/md',
|
||||||
llm: '/llm'
|
llm: '/llm'
|
||||||
};*/
|
};
|
||||||
|
|
||||||
const api = endpointMap[endpoint];
|
const api = endpointMap[endpoint];
|
||||||
const payload = {
|
let payload;
|
||||||
|
|
||||||
|
// Create appropriate payload based on endpoint type
|
||||||
|
if (endpoint === 'md') {
|
||||||
|
// Get values from the /md specific inputs
|
||||||
|
const filterType = document.getElementById('md-filter').value;
|
||||||
|
const query = document.getElementById('md-query').value.trim();
|
||||||
|
const cache = document.getElementById('md-cache').value;
|
||||||
|
|
||||||
|
// MD endpoint expects: { url, f, q, c }
|
||||||
|
payload = {
|
||||||
|
url: urls[0], // Take first URL
|
||||||
|
f: filterType, // Lowercase filter type as required by server
|
||||||
|
q: query || null, // Use the query if provided, otherwise null
|
||||||
|
c: cache
|
||||||
|
};
|
||||||
|
} else if (endpoint === 'llm') {
|
||||||
|
// LLM endpoint has a different URL pattern and uses query params
|
||||||
|
// This will be handled directly in the fetch below
|
||||||
|
payload = null;
|
||||||
|
} else {
|
||||||
|
// Default payload for /crawl and /crawl/stream
|
||||||
|
payload = {
|
||||||
urls,
|
urls,
|
||||||
...advConfig
|
...advConfig
|
||||||
};
|
};
|
||||||
|
}
|
||||||
|
|
||||||
updateStatus('processing');
|
updateStatus('processing');
|
||||||
|
|
||||||
@@ -557,7 +660,18 @@
|
|||||||
const startTime = performance.now();
|
const startTime = performance.now();
|
||||||
let response, responseData;
|
let response, responseData;
|
||||||
|
|
||||||
if (endpoint === 'crawl_stream') {
|
if (endpoint === 'llm') {
|
||||||
|
// Special handling for LLM endpoint which uses URL pattern: /llm/{encoded_url}?q={query}
|
||||||
|
const url = urls[0];
|
||||||
|
const encodedUrl = encodeURIComponent(url);
|
||||||
|
// Get the question from the LLM-specific input
|
||||||
|
const question = document.getElementById('llm-question').value.trim() || "What is this page about?";
|
||||||
|
|
||||||
|
response = await fetch(`${api}/${encodedUrl}?q=${encodeURIComponent(question)}`, {
|
||||||
|
method: 'GET',
|
||||||
|
headers: { 'Accept': 'application/json' }
|
||||||
|
});
|
||||||
|
} else if (endpoint === 'crawl_stream') {
|
||||||
// Stream processing
|
// Stream processing
|
||||||
response = await fetch(api, {
|
response = await fetch(api, {
|
||||||
method: 'POST',
|
method: 'POST',
|
||||||
@@ -597,7 +711,7 @@
|
|||||||
document.querySelector('#response-content code').className = 'json hljs'; // Reset classes
|
document.querySelector('#response-content code').className = 'json hljs'; // Reset classes
|
||||||
forceHighlightElement(document.querySelector('#response-content code'));
|
forceHighlightElement(document.querySelector('#response-content code'));
|
||||||
} else {
|
} else {
|
||||||
// Regular request
|
// Regular request (handles /crawl and /md)
|
||||||
response = await fetch(api, {
|
response = await fetch(api, {
|
||||||
method: 'POST',
|
method: 'POST',
|
||||||
headers: { 'Content-Type': 'application/json' },
|
headers: { 'Content-Type': 'application/json' },
|
||||||
@@ -625,7 +739,16 @@
|
|||||||
}
|
}
|
||||||
|
|
||||||
forceHighlightElement(document.querySelector('#response-content code'));
|
forceHighlightElement(document.querySelector('#response-content code'));
|
||||||
|
|
||||||
|
// For generateSnippets, handle the LLM case specially
|
||||||
|
if (endpoint === 'llm') {
|
||||||
|
const url = urls[0];
|
||||||
|
const encodedUrl = encodeURIComponent(url);
|
||||||
|
const question = document.getElementById('llm-question').value.trim() || "What is this page about?";
|
||||||
|
generateSnippets(`${api}/${encodedUrl}?q=${encodeURIComponent(question)}`, null, 'GET');
|
||||||
|
} else {
|
||||||
generateSnippets(api, payload);
|
generateSnippets(api, payload);
|
||||||
|
}
|
||||||
} catch (error) {
|
} catch (error) {
|
||||||
console.error('Error:', error);
|
console.error('Error:', error);
|
||||||
updateStatus('error');
|
updateStatus('error');
|
||||||
@@ -808,8 +931,23 @@
|
|||||||
});
|
});
|
||||||
}
|
}
|
||||||
|
|
||||||
// Call this in your DOMContentLoaded or initialization
|
// Function to initialize UI based on selected endpoint
|
||||||
|
function initUI() {
|
||||||
|
// Trigger the endpoint change handler to set initial UI state
|
||||||
|
const endpointSelect = document.getElementById('endpoint');
|
||||||
|
const event = new Event('change');
|
||||||
|
endpointSelect.dispatchEvent(event);
|
||||||
|
|
||||||
|
// Initialize copy buttons
|
||||||
initCopyButtons();
|
initCopyButtons();
|
||||||
|
}
|
||||||
|
|
||||||
|
// Initialize on page load
|
||||||
|
document.addEventListener('DOMContentLoaded', initUI);
|
||||||
|
// Also call it immediately in case the script runs after DOM is already loaded
|
||||||
|
if (document.readyState !== 'loading') {
|
||||||
|
initUI();
|
||||||
|
}
|
||||||
|
|
||||||
</script>
|
</script>
|
||||||
</body>
|
</body>
|
||||||
|
|||||||
126
docs/apps/linkdin/README.md
Normal file
126
docs/apps/linkdin/README.md
Normal file
@@ -0,0 +1,126 @@
|
|||||||
|
# Crawl4AI Prospect‑Wizard – step‑by‑step guide
|
||||||
|
|
||||||
|
A three‑stage demo that goes from **LinkedIn scraping** ➜ **LLM reasoning** ➜ **graph visualisation**.
|
||||||
|
|
||||||
|
```
|
||||||
|
prospect‑wizard/
|
||||||
|
├─ c4ai_discover.py # Stage 1 – scrape companies + people
|
||||||
|
├─ c4ai_insights.py # Stage 2 – embeddings, org‑charts, scores
|
||||||
|
├─ graph_view_template.html # Stage 3 – graph viewer (static HTML)
|
||||||
|
└─ data/ # output lands here (*.jsonl / *.json)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1 Install & boot a LinkedIn profile (one‑time)
|
||||||
|
|
||||||
|
### 1.1 Install dependencies
|
||||||
|
```bash
|
||||||
|
pip install crawl4ai openai sentence-transformers networkx pandas vis-network rich
|
||||||
|
```
|
||||||
|
|
||||||
|
### 1.2 Create / warm a LinkedIn browser profile
|
||||||
|
```bash
|
||||||
|
crwl profiler
|
||||||
|
```
|
||||||
|
1. The interactive shell shows **New profile** – hit **enter**.
|
||||||
|
2. Choose a name, e.g. `profile_linkedin_uc`.
|
||||||
|
3. A Chromium window opens – log in to LinkedIn, solve whatever CAPTCHA, then close.
|
||||||
|
|
||||||
|
> Remember the **profile name**. All future runs take `--profile-name <your_name>`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2 Discovery – scrape companies & people
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python c4ai_discover.py full \
|
||||||
|
--query "health insurance management" \
|
||||||
|
--geo 102713980 \ # Malaysia geoUrn
|
||||||
|
--title_filters "" \ # or "Product,Engineering"
|
||||||
|
--max_companies 10 \ # default set small for workshops
|
||||||
|
--max_people 20 \ # \^ same
|
||||||
|
--profile-name profile_linkedin_uc \
|
||||||
|
--outdir ./data \
|
||||||
|
--concurrency 2 \
|
||||||
|
--log_level debug
|
||||||
|
```
|
||||||
|
**Outputs** in `./data/`:
|
||||||
|
* `companies.jsonl` – one JSON per company
|
||||||
|
* `people.jsonl` – one JSON per employee
|
||||||
|
|
||||||
|
🛠️ **Dry‑run:** `C4AI_DEMO_DEBUG=1 python c4ai_discover.py full --query coffee` uses bundled HTML snippets, no network.
|
||||||
|
|
||||||
|
### Handy geoUrn cheatsheet
|
||||||
|
| Location | geoUrn |
|
||||||
|
|----------|--------|
|
||||||
|
| Singapore | **103644278** |
|
||||||
|
| Malaysia | **102713980** |
|
||||||
|
| United States | **103644922** |
|
||||||
|
| United Kingdom | **102221843** |
|
||||||
|
| Australia | **101452733** |
|
||||||
|
_See more: <https://www.linkedin.com/search/results/companies/?geoUrn=XXX> – the number after `geoUrn=` is what you need._
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3 Insights – embeddings, org‑charts, decision makers
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python c4ai_insights.py \
|
||||||
|
--in ./data \
|
||||||
|
--out ./data \
|
||||||
|
--embed_model all-MiniLM-L6-v2 \
|
||||||
|
--top_k 10 \
|
||||||
|
--openai_model gpt-4.1 \
|
||||||
|
--max_llm_tokens 8024 \
|
||||||
|
--llm_temperature 1.0 \
|
||||||
|
--workers 4
|
||||||
|
```
|
||||||
|
Emits next to the Stage‑1 files:
|
||||||
|
* `company_graph.json` – inter‑company similarity graph
|
||||||
|
* `org_chart_<handle>.json` – one per company
|
||||||
|
* `decision_makers.csv` – hand‑picked ‘who to pitch’ list
|
||||||
|
|
||||||
|
Flags reference (straight from `build_arg_parser()`):
|
||||||
|
| Flag | Default | Purpose |
|
||||||
|
|------|---------|---------|
|
||||||
|
| `--in` | `.` | Stage‑1 output dir |
|
||||||
|
| `--out` | `.` | Destination dir |
|
||||||
|
| `--embed_model` | `all-MiniLM-L6-v2` | Sentence‑Transformer model |
|
||||||
|
| `--top_k` | `10` | Neighbours per company in graph |
|
||||||
|
| `--openai_model` | `gpt-4.1` | LLM for scoring decision makers |
|
||||||
|
| `--max_llm_tokens` | `8024` | Token budget per LLM call |
|
||||||
|
| `--llm_temperature` | `1.0` | Creativity knob |
|
||||||
|
| `--stub` | off | Skip OpenAI and fabricate tiny charts |
|
||||||
|
| `--workers` | `4` | Parallel LLM workers |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4 Visualise – interactive graph
|
||||||
|
|
||||||
|
After Stage 2 completes, simply open the HTML viewer from the project root:
|
||||||
|
```bash
|
||||||
|
open graph_view_template.html # or Live Server / Python -http
|
||||||
|
```
|
||||||
|
The page fetches `data/company_graph.json` and the `org_chart_*.json` files automatically; keep the `data/` folder beside the HTML file.
|
||||||
|
|
||||||
|
* Left pane → list of companies (clans).
|
||||||
|
* Click a node to load its org‑chart on the right.
|
||||||
|
* Chat drawer lets you ask follow‑up questions; context is pulled from `people.jsonl`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5 Common snags
|
||||||
|
|
||||||
|
| Symptom | Fix |
|
||||||
|
|---------|-----|
|
||||||
|
| Infinite CAPTCHA | Use a residential proxy: `--proxy http://user:pass@ip:port` |
|
||||||
|
| 429 Too Many Requests | Lower `--concurrency`, rotate profile, add delay |
|
||||||
|
| Blank graph | Check JSON paths, clear `localStorage` in browser |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### TL;DR
|
||||||
|
`crwl profiler` → `c4ai_discover.py` → `c4ai_insights.py` → open `graph_view_template.html`.
|
||||||
|
Live long and `import crawl4ai`.
|
||||||
|
|
||||||
440
docs/apps/linkdin/c4ai_discover.py
Normal file
440
docs/apps/linkdin/c4ai_discover.py
Normal file
@@ -0,0 +1,440 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
c4ai-discover — Stage‑1 Discovery CLI
|
||||||
|
|
||||||
|
Scrapes LinkedIn company search + their people pages and dumps two newline‑delimited
|
||||||
|
JSON files: companies.jsonl and people.jsonl.
|
||||||
|
|
||||||
|
Key design rules
|
||||||
|
----------------
|
||||||
|
* No BeautifulSoup — Crawl4AI only for network + HTML fetch.
|
||||||
|
* JsonCssExtractionStrategy for structured scraping; schema auto‑generated once
|
||||||
|
from sample HTML provided by user and then cached under ./schemas/.
|
||||||
|
* Defaults are embedded so the file runs inside VS Code debugger without CLI args.
|
||||||
|
* If executed as a console script (argv > 1), CLI flags win.
|
||||||
|
* Lightweight deps: argparse + Crawl4AI stack.
|
||||||
|
|
||||||
|
Author: Tom @ Kidocode 2025‑04‑26
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import warnings, re
|
||||||
|
warnings.filterwarnings(
|
||||||
|
"ignore",
|
||||||
|
message=r"The pseudo class ':contains' is deprecated, ':-soup-contains' should be used.*",
|
||||||
|
category=FutureWarning,
|
||||||
|
module=r"soupsieve"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
# ───────────────────────────────────────────────────────────────────────────────
|
||||||
|
# Imports
|
||||||
|
# ───────────────────────────────────────────────────────────────────────────────
|
||||||
|
import argparse
|
||||||
|
import random
|
||||||
|
import asyncio
|
||||||
|
import json
|
||||||
|
import logging
|
||||||
|
import os
|
||||||
|
import pathlib
|
||||||
|
import sys
|
||||||
|
# 3rd-party rich for pretty logging
|
||||||
|
from rich.console import Console
|
||||||
|
from rich.logging import RichHandler
|
||||||
|
|
||||||
|
from datetime import datetime, UTC
|
||||||
|
from itertools import cycle
|
||||||
|
from textwrap import dedent
|
||||||
|
from types import SimpleNamespace
|
||||||
|
from typing import Dict, List, Optional
|
||||||
|
from urllib.parse import quote
|
||||||
|
from pathlib import Path
|
||||||
|
from glob import glob
|
||||||
|
|
||||||
|
from crawl4ai import (
|
||||||
|
AsyncWebCrawler,
|
||||||
|
BrowserConfig,
|
||||||
|
CacheMode,
|
||||||
|
CrawlerRunConfig,
|
||||||
|
JsonCssExtractionStrategy,
|
||||||
|
BrowserProfiler,
|
||||||
|
LLMConfig,
|
||||||
|
)
|
||||||
|
|
||||||
|
# ───────────────────────────────────────────────────────────────────────────────
|
||||||
|
# Constants / paths
|
||||||
|
# ───────────────────────────────────────────────────────────────────────────────
|
||||||
|
BASE_DIR = pathlib.Path(__file__).resolve().parent
|
||||||
|
SCHEMA_DIR = BASE_DIR / "schemas"
|
||||||
|
SCHEMA_DIR.mkdir(parents=True, exist_ok=True)
|
||||||
|
COMPANY_SCHEMA_PATH = SCHEMA_DIR / "company_card.json"
|
||||||
|
PEOPLE_SCHEMA_PATH = SCHEMA_DIR / "people_card.json"
|
||||||
|
|
||||||
|
# ---------- deterministic target JSON examples ----------
|
||||||
|
_COMPANY_SCHEMA_EXAMPLE = {
|
||||||
|
"handle": "/company/posify/",
|
||||||
|
"profile_image": "https://media.licdn.com/dms/image/v2/.../logo.jpg",
|
||||||
|
"name": "Management Research Services, Inc. (MRS, Inc)",
|
||||||
|
"descriptor": "Insurance • Milwaukee, Wisconsin",
|
||||||
|
"about": "Insurance • Milwaukee, Wisconsin",
|
||||||
|
"followers": 1000
|
||||||
|
}
|
||||||
|
|
||||||
|
_PEOPLE_SCHEMA_EXAMPLE = {
|
||||||
|
"profile_url": "https://www.linkedin.com/in/lily-ng/",
|
||||||
|
"name": "Lily Ng",
|
||||||
|
"headline": "VP Product @ Posify",
|
||||||
|
"followers": 890,
|
||||||
|
"connection_degree": "2nd",
|
||||||
|
"avatar_url": "https://media.licdn.com/dms/image/v2/.../lily.jpg"
|
||||||
|
}
|
||||||
|
|
||||||
|
# Provided sample HTML snippets (trimmed) — used exactly once to cold‑generate schema.
|
||||||
|
_SAMPLE_COMPANY_HTML = (Path(__file__).resolve().parent / "snippets/company.html").read_text()
|
||||||
|
_SAMPLE_PEOPLE_HTML = (Path(__file__).resolve().parent / "snippets/people.html").read_text()
|
||||||
|
|
||||||
|
# --------- tighter schema prompts ----------
|
||||||
|
_COMPANY_SCHEMA_QUERY = dedent(
|
||||||
|
"""
|
||||||
|
Using the supplied <li> company-card HTML, build a JsonCssExtractionStrategy schema that,
|
||||||
|
for every card, outputs *exactly* the keys shown in the example JSON below.
|
||||||
|
JSON spec:
|
||||||
|
• handle – href of the outermost <a> that wraps the logo/title, e.g. "/company/posify/"
|
||||||
|
• profile_image – absolute URL of the <img> inside that link
|
||||||
|
• name – text of the <a> inside the <span class*='t-16'>
|
||||||
|
• descriptor – text line with industry • location
|
||||||
|
• about – text of the <div class*='t-normal'> below the name (industry + geo)
|
||||||
|
• followers – integer parsed from the <div> containing 'followers'
|
||||||
|
|
||||||
|
IMPORTANT: Do not use the base64 kind of classes to target element. It's not reliable.
|
||||||
|
The main div parent contains these li element is "div.search-results-container" you can use this.
|
||||||
|
The <ul> parent has "role" equal to "list". Using these two should be enough to target the <li> elements."
|
||||||
|
"""
|
||||||
|
)
|
||||||
|
|
||||||
|
_PEOPLE_SCHEMA_QUERY = dedent(
|
||||||
|
"""
|
||||||
|
Using the supplied <li> people-card HTML, build a JsonCssExtractionStrategy schema that
|
||||||
|
outputs exactly the keys in the example JSON below.
|
||||||
|
Fields:
|
||||||
|
• profile_url – href of the outermost profile link
|
||||||
|
• name – text inside artdeco-entity-lockup__title
|
||||||
|
• headline – inner text of artdeco-entity-lockup__subtitle
|
||||||
|
• followers – integer parsed from the span inside lt-line-clamp--multi-line
|
||||||
|
• connection_degree – '1st', '2nd', etc. from artdeco-entity-lockup__badge
|
||||||
|
• avatar_url – src of the <img> within artdeco-entity-lockup__image
|
||||||
|
|
||||||
|
IMPORTANT: Do not use the base64 kind of classes to target element. It's not reliable.
|
||||||
|
The main div parent contains these li element is a "div" has these classes "artdeco-card org-people-profile-card__card-spacing org-people__card-margin-bottom".
|
||||||
|
"""
|
||||||
|
)
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Utility helpers
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
def _load_or_build_schema(
|
||||||
|
path: pathlib.Path,
|
||||||
|
sample_html: str,
|
||||||
|
query: str,
|
||||||
|
example_json: Dict,
|
||||||
|
force = False
|
||||||
|
) -> Dict:
|
||||||
|
"""Load schema from path, else call generate_schema once and persist."""
|
||||||
|
if path.exists() and not force:
|
||||||
|
return json.loads(path.read_text())
|
||||||
|
|
||||||
|
logging.info("[SCHEMA] Generating schema %s", path.name)
|
||||||
|
schema = JsonCssExtractionStrategy.generate_schema(
|
||||||
|
html=sample_html,
|
||||||
|
llm_config=LLMConfig(
|
||||||
|
provider=os.getenv("C4AI_SCHEMA_PROVIDER", "openai/gpt-4o"),
|
||||||
|
api_token=os.getenv("OPENAI_API_KEY", "env:OPENAI_API_KEY"),
|
||||||
|
),
|
||||||
|
query=query,
|
||||||
|
target_json_example=json.dumps(example_json, indent=2),
|
||||||
|
)
|
||||||
|
path.write_text(json.dumps(schema, indent=2))
|
||||||
|
return schema
|
||||||
|
|
||||||
|
|
||||||
|
def _openai_friendly_number(text: str) -> Optional[int]:
|
||||||
|
"""Extract first int from text like '1K followers' (returns 1000)."""
|
||||||
|
import re
|
||||||
|
|
||||||
|
m = re.search(r"(\d[\d,]*)", text.replace(",", ""))
|
||||||
|
if not m:
|
||||||
|
return None
|
||||||
|
val = int(m.group(1))
|
||||||
|
if "k" in text.lower():
|
||||||
|
val *= 1000
|
||||||
|
if "m" in text.lower():
|
||||||
|
val *= 1_000_000
|
||||||
|
return val
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Core async workers
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
async def crawl_company_search(crawler: AsyncWebCrawler, url: str, schema: Dict, limit: int) -> List[Dict]:
|
||||||
|
"""Paginate 10-item company search pages until `limit` reached."""
|
||||||
|
extraction = JsonCssExtractionStrategy(schema)
|
||||||
|
cfg = CrawlerRunConfig(
|
||||||
|
extraction_strategy=extraction,
|
||||||
|
cache_mode=CacheMode.BYPASS,
|
||||||
|
wait_for = ".search-marvel-srp",
|
||||||
|
session_id="company_search",
|
||||||
|
delay_before_return_html=1,
|
||||||
|
magic = True,
|
||||||
|
verbose= False,
|
||||||
|
)
|
||||||
|
companies, page = [], 1
|
||||||
|
while len(companies) < max(limit, 10):
|
||||||
|
paged_url = f"{url}&page={page}"
|
||||||
|
res = await crawler.arun(paged_url, config=cfg)
|
||||||
|
batch = json.loads(res[0].extracted_content)
|
||||||
|
if not batch:
|
||||||
|
break
|
||||||
|
for item in batch:
|
||||||
|
name = item.get("name", "").strip()
|
||||||
|
handle = item.get("handle", "").strip()
|
||||||
|
if not handle or not name:
|
||||||
|
continue
|
||||||
|
descriptor = item.get("descriptor")
|
||||||
|
about = item.get("about")
|
||||||
|
followers = _openai_friendly_number(str(item.get("followers", "")))
|
||||||
|
companies.append(
|
||||||
|
{
|
||||||
|
"handle": handle,
|
||||||
|
"name": name,
|
||||||
|
"descriptor": descriptor,
|
||||||
|
"about": about,
|
||||||
|
"followers": followers,
|
||||||
|
"people_url": f"{handle}people/",
|
||||||
|
"captured_at": datetime.now(UTC).isoformat(timespec="seconds") + "Z",
|
||||||
|
}
|
||||||
|
)
|
||||||
|
page += 1
|
||||||
|
logging.info(
|
||||||
|
f"[dim]Page {page}[/] — running total: {len(companies)}/{limit} companies"
|
||||||
|
)
|
||||||
|
|
||||||
|
return companies[:max(limit, 10)]
|
||||||
|
|
||||||
|
|
||||||
|
async def crawl_people_page(
|
||||||
|
crawler: AsyncWebCrawler,
|
||||||
|
people_url: str,
|
||||||
|
schema: Dict,
|
||||||
|
limit: int,
|
||||||
|
title_kw: str,
|
||||||
|
) -> List[Dict]:
|
||||||
|
people_u = f"{people_url}?keywords={quote(title_kw)}"
|
||||||
|
extraction = JsonCssExtractionStrategy(schema)
|
||||||
|
cfg = CrawlerRunConfig(
|
||||||
|
extraction_strategy=extraction,
|
||||||
|
# scan_full_page=True,
|
||||||
|
cache_mode=CacheMode.BYPASS,
|
||||||
|
magic=True,
|
||||||
|
wait_for=".org-people-profile-card__card-spacing",
|
||||||
|
delay_before_return_html=1,
|
||||||
|
session_id="people_search",
|
||||||
|
)
|
||||||
|
res = await crawler.arun(people_u, config=cfg)
|
||||||
|
if not res[0].success:
|
||||||
|
return []
|
||||||
|
raw = json.loads(res[0].extracted_content)
|
||||||
|
people = []
|
||||||
|
for p in raw[:limit]:
|
||||||
|
followers = _openai_friendly_number(str(p.get("followers", "")))
|
||||||
|
people.append(
|
||||||
|
{
|
||||||
|
"profile_url": p.get("profile_url"),
|
||||||
|
"name": p.get("name"),
|
||||||
|
"headline": p.get("headline"),
|
||||||
|
"followers": followers,
|
||||||
|
"connection_degree": p.get("connection_degree"),
|
||||||
|
"avatar_url": p.get("avatar_url"),
|
||||||
|
}
|
||||||
|
)
|
||||||
|
return people
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# CLI + main
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
def build_arg_parser() -> argparse.ArgumentParser:
|
||||||
|
ap = argparse.ArgumentParser("c4ai-discover — Crawl4AI LinkedIn discovery")
|
||||||
|
sub = ap.add_subparsers(dest="cmd", required=False, help="run scope")
|
||||||
|
|
||||||
|
def add_flags(parser: argparse.ArgumentParser):
|
||||||
|
parser.add_argument("--query", required=False, help="query keyword(s)")
|
||||||
|
parser.add_argument("--geo", required=False, type=int, help="LinkedIn geoUrn")
|
||||||
|
parser.add_argument("--title-filters", default="Product,Engineering", help="comma list of job keywords")
|
||||||
|
parser.add_argument("--max-companies", type=int, default=1000)
|
||||||
|
parser.add_argument("--max-people", type=int, default=500)
|
||||||
|
parser.add_argument("--profile-path", default=str(pathlib.Path.home() / ".crawl4ai/profiles/profile_linkedin_uc"))
|
||||||
|
parser.add_argument("--outdir", default="./output")
|
||||||
|
parser.add_argument("--concurrency", type=int, default=4)
|
||||||
|
parser.add_argument("--log-level", default="info", choices=["debug", "info", "warn", "error"])
|
||||||
|
|
||||||
|
add_flags(sub.add_parser("full"))
|
||||||
|
add_flags(sub.add_parser("companies"))
|
||||||
|
add_flags(sub.add_parser("people"))
|
||||||
|
|
||||||
|
# global flags
|
||||||
|
ap.add_argument(
|
||||||
|
"--debug",
|
||||||
|
action="store_true",
|
||||||
|
help="Use built-in demo defaults (same as C4AI_DEMO_DEBUG=1)",
|
||||||
|
)
|
||||||
|
return ap
|
||||||
|
|
||||||
|
|
||||||
|
def detect_debug_defaults(force = False) -> SimpleNamespace:
|
||||||
|
if not force and sys.gettrace() is None and not os.getenv("C4AI_DEMO_DEBUG"):
|
||||||
|
return SimpleNamespace()
|
||||||
|
# ----- debug‑friendly defaults -----
|
||||||
|
return SimpleNamespace(
|
||||||
|
cmd="full",
|
||||||
|
query="health insurance management",
|
||||||
|
geo=102713980,
|
||||||
|
# title_filters="Product,Engineering",
|
||||||
|
title_filters="",
|
||||||
|
max_companies=10,
|
||||||
|
max_people=5,
|
||||||
|
profile_name="profile_linkedin_uc",
|
||||||
|
outdir="./debug_out",
|
||||||
|
concurrency=2,
|
||||||
|
log_level="debug",
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
async def async_main(opts):
|
||||||
|
# ─────────── logging setup ───────────
|
||||||
|
console = Console()
|
||||||
|
logging.basicConfig(
|
||||||
|
level=opts.log_level.upper(),
|
||||||
|
format="%(message)s",
|
||||||
|
handlers=[RichHandler(console=console, markup=True, rich_tracebacks=True)],
|
||||||
|
)
|
||||||
|
|
||||||
|
# -------------------------------------------------------------------
|
||||||
|
# Load or build schemas (one‑time LLM call each)
|
||||||
|
# -------------------------------------------------------------------
|
||||||
|
company_schema = _load_or_build_schema(
|
||||||
|
COMPANY_SCHEMA_PATH,
|
||||||
|
_SAMPLE_COMPANY_HTML,
|
||||||
|
_COMPANY_SCHEMA_QUERY,
|
||||||
|
_COMPANY_SCHEMA_EXAMPLE,
|
||||||
|
# True
|
||||||
|
)
|
||||||
|
people_schema = _load_or_build_schema(
|
||||||
|
PEOPLE_SCHEMA_PATH,
|
||||||
|
_SAMPLE_PEOPLE_HTML,
|
||||||
|
_PEOPLE_SCHEMA_QUERY,
|
||||||
|
_PEOPLE_SCHEMA_EXAMPLE,
|
||||||
|
# True
|
||||||
|
)
|
||||||
|
|
||||||
|
outdir = BASE_DIR / pathlib.Path(opts.outdir)
|
||||||
|
outdir.mkdir(parents=True, exist_ok=True)
|
||||||
|
f_companies = (BASE_DIR / outdir / "companies.jsonl").open("a", encoding="utf-8")
|
||||||
|
f_people = (BASE_DIR / outdir / "people.jsonl").open("a", encoding="utf-8")
|
||||||
|
|
||||||
|
# -------------------------------------------------------------------
|
||||||
|
# Prepare crawler with cookie pool rotation
|
||||||
|
# -------------------------------------------------------------------
|
||||||
|
profiler = BrowserProfiler()
|
||||||
|
path = profiler.get_profile_path(opts.profile_name)
|
||||||
|
bc = BrowserConfig(
|
||||||
|
headless=False,
|
||||||
|
verbose=False,
|
||||||
|
user_data_dir=path,
|
||||||
|
use_managed_browser=True,
|
||||||
|
user_agent_mode = "random",
|
||||||
|
user_agent_generator_config= {
|
||||||
|
"platforms": "mobile",
|
||||||
|
"os": "Android"
|
||||||
|
},
|
||||||
|
verbose=False,
|
||||||
|
)
|
||||||
|
crawler = AsyncWebCrawler(config=bc)
|
||||||
|
|
||||||
|
await crawler.start()
|
||||||
|
|
||||||
|
# Single worker for simplicity; concurrency can be scaled by arun_many if needed.
|
||||||
|
# crawler = await next_crawler().start()
|
||||||
|
try:
|
||||||
|
# Build LinkedIn search URL
|
||||||
|
search_url = f"https://www.linkedin.com/search/results/companies/?keywords={quote(opts.query)}&geoUrn={opts.geo}"
|
||||||
|
logging.info("Seed URL => %s", search_url)
|
||||||
|
|
||||||
|
companies: List[Dict] = []
|
||||||
|
if opts.cmd in ("companies", "full"):
|
||||||
|
companies = await crawl_company_search(
|
||||||
|
crawler, search_url, company_schema, opts.max_companies
|
||||||
|
)
|
||||||
|
for c in companies:
|
||||||
|
f_companies.write(json.dumps(c, ensure_ascii=False) + "\n")
|
||||||
|
logging.info(f"[bold green]✓[/] Companies scraped so far: {len(companies)}")
|
||||||
|
|
||||||
|
if opts.cmd in ("people", "full"):
|
||||||
|
if not companies:
|
||||||
|
# load from previous run
|
||||||
|
src = outdir / "companies.jsonl"
|
||||||
|
if not src.exists():
|
||||||
|
logging.error("companies.jsonl missing — run companies/full first")
|
||||||
|
return 10
|
||||||
|
companies = [json.loads(l) for l in src.read_text().splitlines()]
|
||||||
|
total_people = 0
|
||||||
|
title_kw = " ".join([t.strip() for t in opts.title_filters.split(",") if t.strip()]) if opts.title_filters else ""
|
||||||
|
for comp in companies:
|
||||||
|
people = await crawl_people_page(
|
||||||
|
crawler,
|
||||||
|
comp["people_url"],
|
||||||
|
people_schema,
|
||||||
|
opts.max_people,
|
||||||
|
title_kw,
|
||||||
|
)
|
||||||
|
for p in people:
|
||||||
|
rec = p | {
|
||||||
|
"company_handle": comp["handle"],
|
||||||
|
# "captured_at": datetime.now(UTC).isoformat(timespec="seconds") + "Z",
|
||||||
|
"captured_at": datetime.now(UTC).isoformat(timespec="seconds") + "Z",
|
||||||
|
}
|
||||||
|
f_people.write(json.dumps(rec, ensure_ascii=False) + "\n")
|
||||||
|
total_people += len(people)
|
||||||
|
logging.info(
|
||||||
|
f"{comp['name']} — [cyan]{len(people)}[/] people extracted"
|
||||||
|
)
|
||||||
|
await asyncio.sleep(random.uniform(0.5, 1))
|
||||||
|
logging.info("Total people scraped: %d", total_people)
|
||||||
|
finally:
|
||||||
|
await crawler.close()
|
||||||
|
f_companies.close()
|
||||||
|
f_people.close()
|
||||||
|
|
||||||
|
return 0
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
parser = build_arg_parser()
|
||||||
|
cli_opts = parser.parse_args()
|
||||||
|
|
||||||
|
# decide on debug defaults
|
||||||
|
if cli_opts.debug:
|
||||||
|
opts = detect_debug_defaults(force=True)
|
||||||
|
else:
|
||||||
|
env_defaults = detect_debug_defaults()
|
||||||
|
env_defaults = detect_debug_defaults()
|
||||||
|
opts = env_defaults if env_defaults else cli_opts
|
||||||
|
|
||||||
|
if not getattr(opts, "cmd", None):
|
||||||
|
opts.cmd = "full"
|
||||||
|
|
||||||
|
exit_code = asyncio.run(async_main(opts))
|
||||||
|
sys.exit(exit_code)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
372
docs/apps/linkdin/c4ai_insights.py
Normal file
372
docs/apps/linkdin/c4ai_insights.py
Normal file
@@ -0,0 +1,372 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Stage-2 Insights builder
|
||||||
|
------------------------
|
||||||
|
Reads companies.jsonl & people.jsonl (Stage-1 output) and produces:
|
||||||
|
• company_graph.json
|
||||||
|
• org_chart_<handle>.json (one per company)
|
||||||
|
• decision_makers.csv
|
||||||
|
• graph_view.html (interactive visualisation)
|
||||||
|
|
||||||
|
Run:
|
||||||
|
python c4ai_insights.py --in ./stage1_out --out ./stage2_out
|
||||||
|
|
||||||
|
Author : Tom @ Kidocode, 2025-04-28
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
# ───────────────────────────────────────────────────────────────────────────────
|
||||||
|
# Imports & Third-party
|
||||||
|
# ───────────────────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
import argparse, asyncio, json, os, sys, pathlib, random, time, csv
|
||||||
|
from datetime import datetime, UTC
|
||||||
|
from types import SimpleNamespace
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import List, Dict, Any
|
||||||
|
# Pretty CLI UX
|
||||||
|
from rich.console import Console
|
||||||
|
from rich.logging import RichHandler
|
||||||
|
from rich.progress import Progress, SpinnerColumn, BarColumn, TextColumn, TimeElapsedColumn
|
||||||
|
import logging
|
||||||
|
from jinja2 import Environment, FileSystemLoader, select_autoescape
|
||||||
|
|
||||||
|
BASE_DIR = pathlib.Path(__file__).resolve().parent
|
||||||
|
|
||||||
|
# ───────────────────────────────────────────────────────────────────────────────
|
||||||
|
# 3rd-party deps
|
||||||
|
# ───────────────────────────────────────────────────────────────────────────────
|
||||||
|
import numpy as np
|
||||||
|
# from sentence_transformers import SentenceTransformer
|
||||||
|
# from sklearn.metrics.pairwise import cosine_similarity
|
||||||
|
import pandas as pd
|
||||||
|
import hashlib
|
||||||
|
|
||||||
|
from openai import OpenAI # same SDK you pre-loaded
|
||||||
|
|
||||||
|
# ───────────────────────────────────────────────────────────────────────────────
|
||||||
|
# Utils
|
||||||
|
# ───────────────────────────────────────────────────────────────────────────────
|
||||||
|
def load_jsonl(path: Path) -> List[Dict[str, Any]]:
|
||||||
|
with open(path, "r", encoding="utf-8") as f:
|
||||||
|
return [json.loads(l) for l in f]
|
||||||
|
|
||||||
|
def dump_json(obj, path: Path):
|
||||||
|
with open(path, "w", encoding="utf-8") as f:
|
||||||
|
json.dump(obj, f, ensure_ascii=False, indent=2)
|
||||||
|
|
||||||
|
# ───────────────────────────────────────────────────────────────────────────────
|
||||||
|
# Constants
|
||||||
|
# ───────────────────────────────────────────────────────────────────────────────
|
||||||
|
BASE_DIR = pathlib.Path(__file__).resolve().parent
|
||||||
|
|
||||||
|
# ───────────────────────────────────────────────────────────────────────────────
|
||||||
|
# Debug defaults (mirrors Stage-1 trick)
|
||||||
|
# ───────────────────────────────────────────────────────────────────────────────
|
||||||
|
def dev_defaults() -> SimpleNamespace:
|
||||||
|
return SimpleNamespace(
|
||||||
|
in_dir="./debug_out",
|
||||||
|
out_dir="./insights_debug",
|
||||||
|
embed_model="all-MiniLM-L6-v2",
|
||||||
|
top_k=10,
|
||||||
|
openai_model="gpt-4.1",
|
||||||
|
max_llm_tokens=8000,
|
||||||
|
llm_temperature=1.0,
|
||||||
|
workers=4, # parallel processing
|
||||||
|
stub=False, # manual
|
||||||
|
)
|
||||||
|
|
||||||
|
# ───────────────────────────────────────────────────────────────────────────────
|
||||||
|
# Graph builders
|
||||||
|
# ───────────────────────────────────────────────────────────────────────────────
|
||||||
|
def embed_descriptions(companies, model_name:str, opts) -> np.ndarray:
|
||||||
|
from sentence_transformers import SentenceTransformer
|
||||||
|
|
||||||
|
logging.debug(f"Using embedding model: {model_name}")
|
||||||
|
cache_path = BASE_DIR / Path(opts.out_dir) / "embeds_cache.json"
|
||||||
|
cache = {}
|
||||||
|
if cache_path.exists():
|
||||||
|
with open(cache_path) as f:
|
||||||
|
cache = json.load(f)
|
||||||
|
# flush cache if model differs
|
||||||
|
if cache.get("_model") != model_name:
|
||||||
|
cache = {}
|
||||||
|
|
||||||
|
model = SentenceTransformer(model_name)
|
||||||
|
new_texts, new_indices = [], []
|
||||||
|
vectors = np.zeros((len(companies), 384), dtype=np.float32)
|
||||||
|
|
||||||
|
for idx, comp in enumerate(companies):
|
||||||
|
text = comp.get("about") or comp.get("descriptor","")
|
||||||
|
h = hashlib.sha1(text.encode("utf-8")).hexdigest()
|
||||||
|
cached = cache.get(comp["handle"])
|
||||||
|
if cached and cached["hash"] == h:
|
||||||
|
vectors[idx] = np.array(cached["vector"], dtype=np.float32)
|
||||||
|
else:
|
||||||
|
new_texts.append(text)
|
||||||
|
new_indices.append((idx, comp["handle"], h))
|
||||||
|
|
||||||
|
if new_texts:
|
||||||
|
embeds = model.encode(new_texts, show_progress_bar=False, convert_to_numpy=True)
|
||||||
|
for vec, (idx, handle, h) in zip(embeds, new_indices):
|
||||||
|
vectors[idx] = vec
|
||||||
|
cache[handle] = {"hash": h, "vector": vec.tolist()}
|
||||||
|
cache["_model"] = model_name
|
||||||
|
with open(cache_path, "w") as f:
|
||||||
|
json.dump(cache, f)
|
||||||
|
|
||||||
|
return vectors
|
||||||
|
|
||||||
|
def build_company_graph(companies, embeds:np.ndarray, top_k:int) -> Dict[str,Any]:
|
||||||
|
from sklearn.metrics.pairwise import cosine_similarity
|
||||||
|
sims = cosine_similarity(embeds)
|
||||||
|
nodes, edges = [], []
|
||||||
|
idx_of = {c["handle"]: i for i,c in enumerate(companies)}
|
||||||
|
for i,c in enumerate(companies):
|
||||||
|
node = dict(
|
||||||
|
id=c["handle"].strip("/"),
|
||||||
|
name=c["name"],
|
||||||
|
handle=c["handle"],
|
||||||
|
about=c.get("about",""),
|
||||||
|
people_url=c.get("people_url",""),
|
||||||
|
industry=c.get("descriptor","").split("•")[0].strip(),
|
||||||
|
geoUrn=c.get("geoUrn"),
|
||||||
|
followers=c.get("followers",0),
|
||||||
|
# desc_embed=embeds[i].tolist(),
|
||||||
|
desc_embed=[],
|
||||||
|
)
|
||||||
|
nodes.append(node)
|
||||||
|
# pick top-k most similar except itself
|
||||||
|
top_idx = np.argsort(sims[i])[::-1][1:top_k+1]
|
||||||
|
for j in top_idx:
|
||||||
|
tgt = companies[j]
|
||||||
|
weight = float(sims[i,j])
|
||||||
|
if node["industry"] == tgt.get("descriptor","").split("•")[0].strip():
|
||||||
|
weight += 0.10
|
||||||
|
if node["geoUrn"] == tgt.get("geoUrn"):
|
||||||
|
weight += 0.05
|
||||||
|
tgt['followers'] = tgt.get("followers", None) or 1
|
||||||
|
node["followers"] = node.get("followers", None) or 1
|
||||||
|
follower_ratio = min(node["followers"], tgt.get("followers",1)) / max(node["followers"] or 1, tgt.get("followers",1))
|
||||||
|
weight += 0.05 * follower_ratio
|
||||||
|
edges.append(dict(
|
||||||
|
source=node["id"],
|
||||||
|
target=tgt["handle"].strip("/"),
|
||||||
|
weight=round(weight,4),
|
||||||
|
drivers=dict(
|
||||||
|
embed_sim=round(float(sims[i,j]),4),
|
||||||
|
industry_match=0.10 if node["industry"] == tgt.get("descriptor","").split("•")[0].strip() else 0,
|
||||||
|
geo_overlap=0.05 if node["geoUrn"] == tgt.get("geoUrn") else 0,
|
||||||
|
)
|
||||||
|
))
|
||||||
|
# return {"nodes":nodes,"edges":edges,"meta":{"generated_at":datetime.now(UTC).isoformat()}}
|
||||||
|
return {"nodes":nodes,"edges":edges,"meta":{"generated_at":datetime.now(UTC).isoformat()}}
|
||||||
|
|
||||||
|
# ───────────────────────────────────────────────────────────────────────────────
|
||||||
|
# Org-chart via LLM
|
||||||
|
# ───────────────────────────────────────────────────────────────────────────────
|
||||||
|
async def infer_org_chart_llm(company, people, client:OpenAI, model_name:str, max_tokens:int, temperature:float, stub:bool):
|
||||||
|
if stub:
|
||||||
|
# Tiny fake org-chart when debugging offline
|
||||||
|
chief = random.choice(people)
|
||||||
|
nodes = [{
|
||||||
|
"id": chief["profile_url"],
|
||||||
|
"name": chief["name"],
|
||||||
|
"title": chief["headline"],
|
||||||
|
"dept": chief["headline"].split()[:1][0],
|
||||||
|
"yoe_total": 8,
|
||||||
|
"yoe_current": 2,
|
||||||
|
"seniority_score": 0.8,
|
||||||
|
"decision_score": 0.9,
|
||||||
|
"avatar_url": chief.get("avatar_url")
|
||||||
|
}]
|
||||||
|
return {"nodes":nodes,"edges":[],"meta":{"debug_stub":True,"generated_at":datetime.now(UTC).isoformat()}}
|
||||||
|
|
||||||
|
prompt = [
|
||||||
|
{"role":"system","content":"You are an expert B2B org-chart reasoner."},
|
||||||
|
{"role":"user","content":f"""Here is the company description:
|
||||||
|
|
||||||
|
<company>
|
||||||
|
{json.dumps(company, ensure_ascii=False)}
|
||||||
|
</company>
|
||||||
|
|
||||||
|
Here is a JSON list of employees:
|
||||||
|
<employees>
|
||||||
|
{json.dumps(people, ensure_ascii=False)}
|
||||||
|
</employees>
|
||||||
|
|
||||||
|
1) Build a reporting tree (manager -> direct reports)
|
||||||
|
2) For each person output a decision_score 0-1 for buying new software
|
||||||
|
|
||||||
|
Return JSON: {{ "nodes":[{{id,name,title,dept,yoe_total,yoe_current,seniority_score,decision_score,avatar_url,profile_url}}], "edges":[{{source,target,type,confidence}}] }}
|
||||||
|
"""}
|
||||||
|
]
|
||||||
|
resp = client.chat.completions.create(
|
||||||
|
model=model_name,
|
||||||
|
messages=prompt,
|
||||||
|
max_tokens=max_tokens,
|
||||||
|
temperature=temperature,
|
||||||
|
response_format={"type":"json_object"}
|
||||||
|
)
|
||||||
|
chart = json.loads(resp.choices[0].message.content)
|
||||||
|
chart["meta"] = dict(model=model_name, generated_at=datetime.now(UTC).isoformat())
|
||||||
|
return chart
|
||||||
|
|
||||||
|
# ───────────────────────────────────────────────────────────────────────────────
|
||||||
|
# CSV flatten
|
||||||
|
# ───────────────────────────────────────────────────────────────────────────────
|
||||||
|
def export_decision_makers(charts_dir:Path, csv_path:Path, threshold:float=0.5):
|
||||||
|
rows=[]
|
||||||
|
for p in charts_dir.glob("org_chart_*.json"):
|
||||||
|
data=json.loads(p.read_text())
|
||||||
|
comp = p.stem.split("org_chart_")[1]
|
||||||
|
for n in data.get("nodes",[]):
|
||||||
|
if n.get("decision_score",0)>=threshold:
|
||||||
|
rows.append(dict(
|
||||||
|
company=comp,
|
||||||
|
person=n["name"],
|
||||||
|
title=n["title"],
|
||||||
|
decision_score=n["decision_score"],
|
||||||
|
profile_url=n["id"]
|
||||||
|
))
|
||||||
|
pd.DataFrame(rows).to_csv(csv_path,index=False)
|
||||||
|
|
||||||
|
# ───────────────────────────────────────────────────────────────────────────────
|
||||||
|
# HTML rendering
|
||||||
|
# ───────────────────────────────────────────────────────────────────────────────
|
||||||
|
def render_html(out:Path, template_dir:Path):
|
||||||
|
# From template folder cp graph_view.html and ai.js in out folder
|
||||||
|
import shutil
|
||||||
|
shutil.copy(template_dir/"graph_view_template.html", out / "graph_view.html")
|
||||||
|
shutil.copy(template_dir/"ai.js", out)
|
||||||
|
|
||||||
|
|
||||||
|
# ───────────────────────────────────────────────────────────────────────────────
|
||||||
|
# Main async pipeline
|
||||||
|
# ───────────────────────────────────────────────────────────────────────────────
|
||||||
|
async def run(opts):
|
||||||
|
# ── silence SDK noise ──────────────────────────────────────────────────────
|
||||||
|
for noisy in ("openai", "httpx", "httpcore"):
|
||||||
|
lg = logging.getLogger(noisy)
|
||||||
|
lg.setLevel(logging.WARNING) # or ERROR if you want total silence
|
||||||
|
lg.propagate = False # optional: stop them reaching root
|
||||||
|
|
||||||
|
# ────────────── logging bootstrap ──────────────
|
||||||
|
console = Console()
|
||||||
|
logging.basicConfig(
|
||||||
|
level="INFO",
|
||||||
|
format="%(message)s",
|
||||||
|
handlers=[RichHandler(console=console, markup=True, rich_tracebacks=True)],
|
||||||
|
)
|
||||||
|
|
||||||
|
in_dir = BASE_DIR / Path(opts.in_dir)
|
||||||
|
out_dir = BASE_DIR / Path(opts.out_dir)
|
||||||
|
out_dir.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
companies = load_jsonl(in_dir/"companies.jsonl")
|
||||||
|
people = load_jsonl(in_dir/"people.jsonl")
|
||||||
|
|
||||||
|
logging.info(f"[bold cyan]Loaded[/] {len(companies)} companies, {len(people)} people")
|
||||||
|
|
||||||
|
logging.info("[bold]⇢[/] Embedding company descriptions…")
|
||||||
|
# embeds = embed_descriptions(companies, opts.embed_model, opts)
|
||||||
|
|
||||||
|
logging.info("[bold]⇢[/] Building similarity graph")
|
||||||
|
# company_graph = build_company_graph(companies, embeds, opts.top_k)
|
||||||
|
# dump_json(company_graph, out_dir/"company_graph.json")
|
||||||
|
|
||||||
|
# OpenAI client (only built if not debugging)
|
||||||
|
stub = bool(opts.stub)
|
||||||
|
client = OpenAI() if not stub else None
|
||||||
|
|
||||||
|
# Filter companies that need processing
|
||||||
|
to_process = []
|
||||||
|
for comp in companies:
|
||||||
|
handle = comp["handle"].strip("/").replace("/","_")
|
||||||
|
out_file = out_dir/f"org_chart_{handle}.json"
|
||||||
|
if out_file.exists() and False:
|
||||||
|
logging.info(f"[green]✓[/] Skipping existing {comp['name']}")
|
||||||
|
continue
|
||||||
|
to_process.append(comp)
|
||||||
|
|
||||||
|
|
||||||
|
if not to_process:
|
||||||
|
logging.info("[yellow]All companies already processed[/]")
|
||||||
|
else:
|
||||||
|
workers = getattr(opts, 'workers', 1)
|
||||||
|
parallel = workers > 1
|
||||||
|
|
||||||
|
logging.info(f"[bold]⇢[/] Inferring org-charts via LLM {f'(parallel={workers} workers)' if parallel else ''}")
|
||||||
|
|
||||||
|
with Progress(
|
||||||
|
SpinnerColumn(),
|
||||||
|
BarColumn(),
|
||||||
|
TextColumn("[progress.description]{task.description}"),
|
||||||
|
TimeElapsedColumn(),
|
||||||
|
console=console,
|
||||||
|
) as progress:
|
||||||
|
task = progress.add_task("Org charts", total=len(to_process))
|
||||||
|
|
||||||
|
async def process_one(comp):
|
||||||
|
handle = comp["handle"].strip("/").replace("/","_")
|
||||||
|
persons = [p for p in people if p["company_handle"].strip("/") == comp["handle"].strip("/")]
|
||||||
|
|
||||||
|
chart = await infer_org_chart_llm(
|
||||||
|
comp, persons,
|
||||||
|
client=client if client else OpenAI(api_key="sk-debug"),
|
||||||
|
model_name=opts.openai_model,
|
||||||
|
max_tokens=opts.max_llm_tokens,
|
||||||
|
temperature=opts.llm_temperature,
|
||||||
|
stub=stub,
|
||||||
|
)
|
||||||
|
chart["meta"]["company"] = comp["name"]
|
||||||
|
|
||||||
|
# Save the result immediately
|
||||||
|
dump_json(chart, out_dir/f"org_chart_{handle}.json")
|
||||||
|
|
||||||
|
progress.update(task, advance=1, description=f"{comp['name']} ({len(persons)} ppl)")
|
||||||
|
|
||||||
|
# Create tasks for all companies
|
||||||
|
tasks = [process_one(comp) for comp in to_process]
|
||||||
|
|
||||||
|
# Process in batches based on worker count
|
||||||
|
semaphore = asyncio.Semaphore(workers)
|
||||||
|
|
||||||
|
async def bounded_process(coro):
|
||||||
|
async with semaphore:
|
||||||
|
return await coro
|
||||||
|
|
||||||
|
# Run with concurrency control
|
||||||
|
await asyncio.gather(*(bounded_process(task) for task in tasks))
|
||||||
|
|
||||||
|
logging.info("[bold]⇢[/] Flattening decision-makers CSV")
|
||||||
|
export_decision_makers(out_dir, out_dir/"decision_makers.csv")
|
||||||
|
|
||||||
|
render_html(out_dir, template_dir=BASE_DIR/"templates")
|
||||||
|
logging.success = lambda msg, **k: console.print(f"[bold green]✓[/] {msg}", **k)
|
||||||
|
logging.success(f"Stage-2 artefacts written to {out_dir}")
|
||||||
|
|
||||||
|
# ───────────────────────────────────────────────────────────────────────────────
|
||||||
|
# CLI
|
||||||
|
# ───────────────────────────────────────────────────────────────────────────────
|
||||||
|
def build_arg_parser():
|
||||||
|
p = argparse.ArgumentParser(description="Build graphs & visualisation from Stage-1 output")
|
||||||
|
p.add_argument("--in", dest="in_dir", required=False, help="Stage-1 output dir", default=".")
|
||||||
|
p.add_argument("--out", dest="out_dir", required=False, help="Destination dir", default=".")
|
||||||
|
p.add_argument("--embed_model", default="all-MiniLM-L6-v2")
|
||||||
|
p.add_argument("--top_k", type=int, default=10, help="Top-k neighbours per company")
|
||||||
|
p.add_argument("--openai_model", default="gpt-4.1")
|
||||||
|
p.add_argument("--max_llm_tokens", type=int, default=8024)
|
||||||
|
p.add_argument("--llm_temperature", type=float, default=1.0)
|
||||||
|
p.add_argument("--stub", action="store_true", help="Skip OpenAI call and generate tiny fake org charts")
|
||||||
|
p.add_argument("--workers", type=int, default=4, help="Number of parallel workers for LLM inference")
|
||||||
|
return p
|
||||||
|
|
||||||
|
def main():
|
||||||
|
dbg = dev_defaults()
|
||||||
|
opts = dbg if True else build_arg_parser().parse_args()
|
||||||
|
asyncio.run(run(opts))
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
39
docs/apps/linkdin/schemas/company_card.json
Normal file
39
docs/apps/linkdin/schemas/company_card.json
Normal file
@@ -0,0 +1,39 @@
|
|||||||
|
{
|
||||||
|
"name": "LinkedIn Company Card",
|
||||||
|
"baseSelector": "div.search-results-container ul[role='list'] > li",
|
||||||
|
"fields": [
|
||||||
|
{
|
||||||
|
"name": "handle",
|
||||||
|
"selector": "a[href*='/company/']",
|
||||||
|
"type": "attribute",
|
||||||
|
"attribute": "href"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "profile_image",
|
||||||
|
"selector": "a[href*='/company/'] img",
|
||||||
|
"type": "attribute",
|
||||||
|
"attribute": "src"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "name",
|
||||||
|
"selector": "span[class*='t-16'] a",
|
||||||
|
"type": "text"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "descriptor",
|
||||||
|
"selector": "div[class*='t-black t-normal']",
|
||||||
|
"type": "text"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "about",
|
||||||
|
"selector": "p[class*='entity-result__summary--2-lines']",
|
||||||
|
"type": "text"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "followers",
|
||||||
|
"selector": "div:contains('followers')",
|
||||||
|
"type": "regex",
|
||||||
|
"pattern": "(\\d+)\\s*followers"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
38
docs/apps/linkdin/schemas/people_card.json
Normal file
38
docs/apps/linkdin/schemas/people_card.json
Normal file
@@ -0,0 +1,38 @@
|
|||||||
|
{
|
||||||
|
"name": "LinkedIn People Card",
|
||||||
|
"baseSelector": "li.org-people-profile-card__profile-card-spacing",
|
||||||
|
"fields": [
|
||||||
|
{
|
||||||
|
"name": "profile_url",
|
||||||
|
"selector": "a.eETATgYTipaVsmrBChiBJJvFsdPhNpulhPZUVLHLo",
|
||||||
|
"type": "attribute",
|
||||||
|
"attribute": "href"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "name",
|
||||||
|
"selector": ".artdeco-entity-lockup__title .lt-line-clamp--single-line",
|
||||||
|
"type": "text"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "headline",
|
||||||
|
"selector": ".artdeco-entity-lockup__subtitle .lt-line-clamp--multi-line",
|
||||||
|
"type": "text"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "followers",
|
||||||
|
"selector": ".lt-line-clamp--multi-line.t-12",
|
||||||
|
"type": "text"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "connection_degree",
|
||||||
|
"selector": ".artdeco-entity-lockup__badge .artdeco-entity-lockup__degree",
|
||||||
|
"type": "text"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "avatar_url",
|
||||||
|
"selector": ".artdeco-entity-lockup__image img",
|
||||||
|
"type": "attribute",
|
||||||
|
"attribute": "src"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
143
docs/apps/linkdin/snippets/company.html
Normal file
143
docs/apps/linkdin/snippets/company.html
Normal file
@@ -0,0 +1,143 @@
|
|||||||
|
<li class="yCLWzruNprmIzaZzFFonVFBtMrbaVYnuDFA">
|
||||||
|
<!----><!---->
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
<div class="IxlEPbRZwQYrRltKPvHAyjBmCdIWTAoYo" data-chameleon-result-urn="urn:li:company:362492"
|
||||||
|
data-view-name="search-entity-result-universal-template">
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
<div class="linked-area flex-1
|
||||||
|
cursor-pointer">
|
||||||
|
|
||||||
|
<div class="BAEgVqVuxosMJZodcelsgPoyRcrkiqgVCGHXNQ">
|
||||||
|
<div class="afcvrbGzNuyRlhPPQWrWirJtUdHAAtUlqxwvVA">
|
||||||
|
<div class="display-flex align-items-center">
|
||||||
|
<!---->
|
||||||
|
|
||||||
|
<a class="eETATgYTipaVsmrBChiBJJvFsdPhNpulhPZUVLHLo scale-down " aria-hidden="true"
|
||||||
|
tabindex="-1" href="https://www.linkedin.com/company/managment-research-services-inc./"
|
||||||
|
data-test-app-aware-link="">
|
||||||
|
|
||||||
|
<div class="ivm-image-view-model ">
|
||||||
|
|
||||||
|
<div class="ivm-view-attr__img-wrapper
|
||||||
|
|
||||||
|
">
|
||||||
|
<!---->
|
||||||
|
<!----> <img width="48"
|
||||||
|
src="https://media.licdn.com/dms/image/v2/C560BAQFWpusEOgW-ww/company-logo_100_100/company-logo_100_100/0/1630583697877/managment_research_services_inc_logo?e=1750896000&v=beta&t=Ch9vyEZdfng-1D1m_XqP5kjNpVXUBKkk9cNhMZUhx0E"
|
||||||
|
loading="lazy" height="48" alt="Management Research Services, Inc. (MRS, Inc)"
|
||||||
|
id="ember28"
|
||||||
|
class="ivm-view-attr__img--centered EntityPhoto-square-3 evi-image lazy-image ember-view">
|
||||||
|
</div>
|
||||||
|
|
||||||
|
</div>
|
||||||
|
|
||||||
|
</a>
|
||||||
|
|
||||||
|
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
<div
|
||||||
|
class="wympnVuDByXHvafWrMGJLZuchDmCRqLmWPwg MmzCPRicJimZvjJhvqTzDcDbdHhWPzspERzA pt3 pb3 t-12 t-black--light">
|
||||||
|
<div class="mb1">
|
||||||
|
|
||||||
|
<div class="t-roman t-sans">
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
<div class="display-flex">
|
||||||
|
<span class="TikBXjihYvcNUoIzkslUaEjfIuLmYxfs OoHEyXgsiIqGADjcOtTmfdpoYVXrLKTvkwI ">
|
||||||
|
<span class="CgaWLOzmXNuKbRIRARSErqCJcBPYudEKo
|
||||||
|
t-16">
|
||||||
|
<a class="eETATgYTipaVsmrBChiBJJvFsdPhNpulhPZUVLHLo "
|
||||||
|
href="https://www.linkedin.com/company/managment-research-services-inc./"
|
||||||
|
data-test-app-aware-link="">
|
||||||
|
<!---->Management Research Services, Inc. (MRS, Inc)<!---->
|
||||||
|
<!----> </a>
|
||||||
|
<!----> </span>
|
||||||
|
</span>
|
||||||
|
<!---->
|
||||||
|
</div>
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
</div>
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
<div class="LjmdKCEqKITHihFOiQsBAQylkdnsWhqZii
|
||||||
|
t-14 t-black t-normal">
|
||||||
|
<!---->Insurance • Milwaukee, Wisconsin<!---->
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div class="cTPhJiHyNLmxdQYFlsEOutjznmqrVHUByZwZ
|
||||||
|
t-14 t-normal">
|
||||||
|
<!---->1K followers<!---->
|
||||||
|
</div>
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<!---->
|
||||||
|
<p class="yWzlqwKNlvCWVNoKqmzoDDEnBMUuyynaLg
|
||||||
|
entity-result__summary--2-lines
|
||||||
|
t-12 t-black--light
|
||||||
|
">
|
||||||
|
<!---->MRS combines 30 years of experience supporting the Life,<span class="white-space-pre">
|
||||||
|
</span><strong><!---->Health<!----></strong><span class="white-space-pre"> </span>and
|
||||||
|
Annuities<span class="white-space-pre"> </span><strong><!---->Insurance<!----></strong><span
|
||||||
|
class="white-space-pre"> </span>Industry with customized<span class="white-space-pre">
|
||||||
|
</span><strong><!---->insurance<!----></strong><span class="white-space-pre">
|
||||||
|
</span>underwriting solutions that efficiently support clients’ workflows. Supported by the
|
||||||
|
Agenium Platform (www.agenium.ai) our innovative underwriting solutions are guaranteed to
|
||||||
|
optimize requirements...<!---->
|
||||||
|
</p>
|
||||||
|
|
||||||
|
<!---->
|
||||||
|
</div>
|
||||||
|
<div class="qXxdnXtzRVFTnTnetmNpssucBwQBsWlUuk MmzCPRicJimZvjJhvqTzDcDbdHhWPzspERzA">
|
||||||
|
<!---->
|
||||||
|
|
||||||
|
|
||||||
|
<div>
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
<button aria-label="Follow Management Research Services, Inc. (MRS, Inc)" id="ember61"
|
||||||
|
class="artdeco-button artdeco-button--2 artdeco-button--secondary ember-view"
|
||||||
|
type="button"><!---->
|
||||||
|
<span class="artdeco-button__text">
|
||||||
|
Follow
|
||||||
|
</span></button>
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
<!---->
|
||||||
|
<!---->
|
||||||
|
|
||||||
|
|
||||||
|
</div>
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
</div>
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
</div>
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
</li>
|
||||||
94
docs/apps/linkdin/snippets/people.html
Normal file
94
docs/apps/linkdin/snippets/people.html
Normal file
@@ -0,0 +1,94 @@
|
|||||||
|
<li class="grid grid__col--lg-8 block org-people-profile-card__profile-card-spacing">
|
||||||
|
<div>
|
||||||
|
|
||||||
|
|
||||||
|
<section class="artdeco-card full-width qQdPErXQkSAbwApNgNfuxukTIPPykttCcZGOHk">
|
||||||
|
<!---->
|
||||||
|
|
||||||
|
<img width="210" src="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7"
|
||||||
|
ariarole="presentation" loading="lazy" height="210" alt="" id="ember96"
|
||||||
|
class="evi-image lazy-image ghost-default ember-view org-people-profile-card__cover-photo org-people-profile-card__cover-photo--people">
|
||||||
|
|
||||||
|
<div class="org-people-profile-card__profile-info">
|
||||||
|
<div id="ember97"
|
||||||
|
class="artdeco-entity-lockup artdeco-entity-lockup--stacked-center artdeco-entity-lockup--size-7 ember-view">
|
||||||
|
<div id="ember98"
|
||||||
|
class="artdeco-entity-lockup__image artdeco-entity-lockup__image--type-circle ember-view"
|
||||||
|
type="circle">
|
||||||
|
|
||||||
|
<a class="eETATgYTipaVsmrBChiBJJvFsdPhNpulhPZUVLHLo "
|
||||||
|
id="org-people-profile-card__profile-image-0"
|
||||||
|
href="https://www.linkedin.com/in/speakerrayna?miniProfileUrn=urn%3Ali%3Afs_miniProfile%3AACoAABsqUBoBr5x071PuGGpNtK3NlvSARiVXPIs"
|
||||||
|
data-test-app-aware-link="">
|
||||||
|
<img width="104"
|
||||||
|
src="https://media.licdn.com/dms/image/v2/D5603AQGs2Vyju4xZ7A/profile-displayphoto-shrink_100_100/profile-displayphoto-shrink_100_100/0/1681741067031?e=1750896000&v=beta&t=Hvj--IrrmpVIH7pec7-l_PQok8vsS__CGeUqBWOw7co"
|
||||||
|
loading="lazy" height="104" alt="Dr. Rayna S." id="ember99"
|
||||||
|
class="evi-image lazy-image ember-view">
|
||||||
|
</a>
|
||||||
|
|
||||||
|
|
||||||
|
</div>
|
||||||
|
<div id="ember100" class="artdeco-entity-lockup__content ember-view">
|
||||||
|
<div id="ember101" class="artdeco-entity-lockup__title ember-view">
|
||||||
|
<a class="eETATgYTipaVsmrBChiBJJvFsdPhNpulhPZUVLHLo link-without-visited-state"
|
||||||
|
aria-label="View Dr. Rayna S.’s profile"
|
||||||
|
href="https://www.linkedin.com/in/speakerrayna?miniProfileUrn=urn%3Ali%3Afs_miniProfile%3AACoAABsqUBoBr5x071PuGGpNtK3NlvSARiVXPIs"
|
||||||
|
data-test-app-aware-link="">
|
||||||
|
<div id="ember103" class="ember-view lt-line-clamp lt-line-clamp--single-line AGabuksChUpCmjWshSnaZryLKSthOKkwclxY
|
||||||
|
t-black" style="">
|
||||||
|
Dr. Rayna S.
|
||||||
|
|
||||||
|
<!---->
|
||||||
|
</div>
|
||||||
|
|
||||||
|
</a>
|
||||||
|
|
||||||
|
</div>
|
||||||
|
<div id="ember104" class="artdeco-entity-lockup__badge ember-view"> <span class="a11y-text">3rd+
|
||||||
|
degree connection</span>
|
||||||
|
<span class="artdeco-entity-lockup__degree" aria-hidden="true">
|
||||||
|
· 3rd
|
||||||
|
</span>
|
||||||
|
<!----><!---->
|
||||||
|
</div>
|
||||||
|
<div id="ember105" class="artdeco-entity-lockup__subtitle ember-view">
|
||||||
|
<div class="t-14 t-black--light t-normal">
|
||||||
|
<div id="ember107" class="ember-view lt-line-clamp lt-line-clamp--multi-line"
|
||||||
|
style="-webkit-line-clamp: 2">
|
||||||
|
Leadership and Talent Development Consultant and Professional Speaker
|
||||||
|
|
||||||
|
<!---->
|
||||||
|
</div>
|
||||||
|
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
<div id="ember108" class="artdeco-entity-lockup__caption ember-view"></div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
</div>
|
||||||
|
<span class="text-align-center">
|
||||||
|
<span id="ember110"
|
||||||
|
class="ember-view lt-line-clamp lt-line-clamp--multi-line t-12 t-black--light mt2"
|
||||||
|
style="-webkit-line-clamp: 3">
|
||||||
|
727 followers
|
||||||
|
|
||||||
|
<!----> </span>
|
||||||
|
|
||||||
|
</span>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<footer class="ph3 pb3">
|
||||||
|
<button aria-label="Follow Dr. Rayna S." id="ember111"
|
||||||
|
class="artdeco-button artdeco-button--2 artdeco-button--secondary ember-view full-width"
|
||||||
|
type="button"><!---->
|
||||||
|
<span class="artdeco-button__text">
|
||||||
|
Follow
|
||||||
|
</span></button>
|
||||||
|
</footer>
|
||||||
|
|
||||||
|
</section>
|
||||||
|
|
||||||
|
|
||||||
|
</div>
|
||||||
|
|
||||||
|
</li>
|
||||||
50
docs/apps/linkdin/templates/ai.js
Normal file
50
docs/apps/linkdin/templates/ai.js
Normal file
@@ -0,0 +1,50 @@
|
|||||||
|
// ==== File: ai.js ====
|
||||||
|
|
||||||
|
class ApiHandler {
|
||||||
|
constructor(apiKey = null) {
|
||||||
|
this.apiKey = apiKey || localStorage.getItem("openai_api_key") || "";
|
||||||
|
console.log("ApiHandler ready");
|
||||||
|
}
|
||||||
|
|
||||||
|
setApiKey(k) {
|
||||||
|
this.apiKey = k.trim();
|
||||||
|
if (this.apiKey) localStorage.setItem("openai_api_key", this.apiKey);
|
||||||
|
}
|
||||||
|
|
||||||
|
async *chatStream(messages, {model = "gpt-4o", temperature = 0.7} = {}) {
|
||||||
|
if (!this.apiKey) throw new Error("OpenAI API key missing");
|
||||||
|
const payload = {model, messages, stream: true, max_tokens: 1024};
|
||||||
|
const controller = new AbortController();
|
||||||
|
|
||||||
|
const res = await fetch("https://api.openai.com/v1/chat/completions", {
|
||||||
|
method: "POST",
|
||||||
|
headers: {
|
||||||
|
"Content-Type": "application/json",
|
||||||
|
Authorization: `Bearer ${this.apiKey}`,
|
||||||
|
},
|
||||||
|
body: JSON.stringify(payload),
|
||||||
|
signal: controller.signal,
|
||||||
|
});
|
||||||
|
if (!res.ok) throw new Error(`OpenAI: ${res.statusText}`);
|
||||||
|
const reader = res.body.getReader();
|
||||||
|
const dec = new TextDecoder();
|
||||||
|
|
||||||
|
let buf = "";
|
||||||
|
while (true) {
|
||||||
|
const {done, value} = await reader.read();
|
||||||
|
if (done) break;
|
||||||
|
buf += dec.decode(value, {stream: true});
|
||||||
|
for (const line of buf.split("\n")) {
|
||||||
|
if (!line.startsWith("data: ")) continue;
|
||||||
|
if (line.includes("[DONE]")) return;
|
||||||
|
const json = JSON.parse(line.slice(6));
|
||||||
|
const delta = json.choices?.[0]?.delta?.content;
|
||||||
|
if (delta) yield delta;
|
||||||
|
}
|
||||||
|
buf = buf.endsWith("\n") ? "" : buf; // keep partial line
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
window.API = new ApiHandler();
|
||||||
|
|
||||||
1171
docs/apps/linkdin/templates/graph_view_template.html
Normal file
1171
docs/apps/linkdin/templates/graph_view_template.html
Normal file
File diff suppressed because it is too large
Load Diff
51
docs/codebase/browser.md
Normal file
51
docs/codebase/browser.md
Normal file
@@ -0,0 +1,51 @@
|
|||||||
|
### browser_manager.py
|
||||||
|
|
||||||
|
| Function | What it does |
|
||||||
|
|---|---|
|
||||||
|
| `ManagedBrowser.build_browser_flags` | Returns baseline Chromium CLI flags, disables GPU and sandbox, plugs locale, timezone, stealth tweaks, and any extras from `BrowserConfig`. |
|
||||||
|
| `ManagedBrowser.__init__` | Stores config and logger, creates temp dir, preps internal state. |
|
||||||
|
| `ManagedBrowser.start` | Spawns or connects to the Chromium process, returns its CDP endpoint plus the `subprocess.Popen` handle. |
|
||||||
|
| `ManagedBrowser._initial_startup_check` | Pings the CDP endpoint once to be sure the browser is alive, raises if not. |
|
||||||
|
| `ManagedBrowser._monitor_browser_process` | Async-loops on the subprocess, logs exits or crashes, restarts if policy allows. |
|
||||||
|
| `ManagedBrowser._get_browser_path_WIP` | Old helper that maps OS + browser type to an executable path. |
|
||||||
|
| `ManagedBrowser._get_browser_path` | Current helper, checks env vars, Playwright cache, and OS defaults for the real executable. |
|
||||||
|
| `ManagedBrowser._get_browser_args` | Builds the final CLI arg list by merging user flags, stealth flags, and defaults. |
|
||||||
|
| `ManagedBrowser.cleanup` | Terminates the browser, stops monitors, deletes the temp dir. |
|
||||||
|
| `ManagedBrowser.create_profile` | Opens a visible browser so a human can log in, then zips the resulting user-data-dir to `~/.crawl4ai/profiles/<name>`. |
|
||||||
|
| `ManagedBrowser.list_profiles` | Thin wrapper, now forwarded to `BrowserProfiler.list_profiles()`. |
|
||||||
|
| `ManagedBrowser.delete_profile` | Thin wrapper, now forwarded to `BrowserProfiler.delete_profile()`. |
|
||||||
|
| `BrowserManager.__init__` | Holds the global Playwright instance, browser handle, config signature cache, session map, and logger. |
|
||||||
|
| `BrowserManager.start` | Boots the underlying `ManagedBrowser`, then spins up the default Playwright browser context with stealth patches. |
|
||||||
|
| `BrowserManager._build_browser_args` | Translates `CrawlerRunConfig` (proxy, UA, timezone, headless flag, etc.) into Playwright `launch_args`. |
|
||||||
|
| `BrowserManager.setup_context` | Applies locale, geolocation, permissions, cookies, and UA overrides on a fresh context. |
|
||||||
|
| `BrowserManager.create_browser_context` | Internal helper that actually calls `browser.new_context(**options)` after running `setup_context`. |
|
||||||
|
| `BrowserManager._make_config_signature` | Hashes the non-ephemeral parts of `CrawlerRunConfig` so contexts can be reused safely. |
|
||||||
|
| `BrowserManager.get_page` | Returns a ready `Page` for a given session id, reusing an existing one or creating a new context/page, injects helper scripts, updates `last_used`. |
|
||||||
|
| `BrowserManager.kill_session` | Force-closes a context/page for a session and removes it from the session map. |
|
||||||
|
| `BrowserManager._cleanup_expired_sessions` | Periodic sweep that drops sessions idle longer than `ttl_seconds`. |
|
||||||
|
| `BrowserManager.close` | Gracefully shuts down all contexts, the browser, Playwright, and background tasks. |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### browser_profiler.py
|
||||||
|
|
||||||
|
| Function | What it does |
|
||||||
|
|---|---|
|
||||||
|
| `BrowserProfiler.__init__` | Sets up profile folder paths, async logger, and signal handlers. |
|
||||||
|
| `BrowserProfiler.create_profile` | Launches a visible browser with a new user-data-dir for manual login, on exit compresses and stores it as a named profile. |
|
||||||
|
| `BrowserProfiler.cleanup_handler` | General SIGTERM/SIGINT cleanup wrapper that kills child processes. |
|
||||||
|
| `BrowserProfiler.sigint_handler` | Handles Ctrl-C during an interactive session, makes sure the browser shuts down cleanly. |
|
||||||
|
| `BrowserProfiler.listen_for_quit_command` | Async REPL that exits when the user types `q`. |
|
||||||
|
| `BrowserProfiler.list_profiles` | Enumerates `~/.crawl4ai/profiles`, prints profile name, browser type, size, and last modified. |
|
||||||
|
| `BrowserProfiler.get_profile_path` | Returns the absolute path of a profile given its name, or `None` if missing. |
|
||||||
|
| `BrowserProfiler.delete_profile` | Removes a profile folder or a direct path from disk, with optional confirmation prompt. |
|
||||||
|
| `BrowserProfiler.interactive_manager` | Text UI loop for listing, creating, deleting, or launching profiles. |
|
||||||
|
| `BrowserProfiler.launch_standalone_browser` | Starts a non-headless Chromium with remote debugging enabled and keeps it alive for manual tests. |
|
||||||
|
| `BrowserProfiler.get_cdp_json` | Pulls `/json/version` from a CDP endpoint and returns the parsed JSON. |
|
||||||
|
| `BrowserProfiler.launch_builtin_browser` | Spawns a headless Chromium in the background, saves `{wsEndpoint, pid, started_at}` to `~/.crawl4ai/builtin_browser.json`. |
|
||||||
|
| `BrowserProfiler.get_builtin_browser_info` | Reads that JSON file, verifies the PID, and returns browser status info. |
|
||||||
|
| `BrowserProfiler._is_browser_running` | Cross-platform helper that checks if a PID is still alive. |
|
||||||
|
| `BrowserProfiler.kill_builtin_browser` | Terminates the background builtin browser and removes its status file. |
|
||||||
|
| `BrowserProfiler.get_builtin_browser_status` | Returns `{running: bool, wsEndpoint, pid, started_at}` for quick health checks. |
|
||||||
|
|
||||||
|
Let me know what you want to tweak or dive into next.
|
||||||
40
docs/codebase/cli.md
Normal file
40
docs/codebase/cli.md
Normal file
@@ -0,0 +1,40 @@
|
|||||||
|
### `cli.py` command surface
|
||||||
|
|
||||||
|
| Command | Inputs / flags | What it does |
|
||||||
|
|---|---|---|
|
||||||
|
| **profiles** | *(none)* | Opens the interactive profile manager, lets you list, create, delete saved browser profiles that live in `~/.crawl4ai/profiles`. |
|
||||||
|
| **browser status** | – | Prints whether the always-on *builtin* browser is running, shows its CDP URL, PID, start time. |
|
||||||
|
| **browser stop** | – | Kills the builtin browser and deletes its status file. |
|
||||||
|
| **browser view** | `--url, -u` URL *(optional)* | Pops a visible window of the builtin browser, navigates to `URL` or `about:blank`. |
|
||||||
|
| **config list** | – | Dumps every global setting, showing current value, default, and description. |
|
||||||
|
| **config get** | `key` | Prints the value of a single setting, falls back to default if unset. |
|
||||||
|
| **config set** | `key value` | Persists a new value in the global config (stored under `~/.crawl4ai/config.yml`). |
|
||||||
|
| **examples** | – | Just spits out real-world CLI usage samples. |
|
||||||
|
| **crawl** | `url` *(positional)*<br>`--browser-config,-B` path<br>`--crawler-config,-C` path<br>`--filter-config,-f` path<br>`--extraction-config,-e` path<br>`--json-extract,-j` [desc]\*<br>`--schema,-s` path<br>`--browser,-b` k=v list<br>`--crawler,-c` k=v list<br>`--output,-o` all,json,markdown,md,markdown-fit,md-fit *(default all)*<br>`--output-file,-O` path<br>`--bypass-cache,-b` *(flag, default true — note flag reuse)*<br>`--question,-q` str<br>`--verbose,-v` *(flag)*<br>`--profile,-p` profile-name | One-shot crawl + extraction. Builds `BrowserConfig` and `CrawlerRunConfig` from inline flags or separate YAML/JSON files, runs `AsyncWebCrawler.run()`, can route through a named saved profile and pipe the result to stdout or a file. |
|
||||||
|
| **(default)** | Same flags as **crawl**, plus `--example` | Shortcut so you can type just `crwl https://site.com`. When first arg is not a known sub-command, it falls through to *crawl*. |
|
||||||
|
|
||||||
|
\* `--json-extract/-j` with no value turns on LLM-based JSON extraction using an auto schema, supplying a string lets you prompt-engineer the field descriptions.
|
||||||
|
|
||||||
|
> Quick mental model
|
||||||
|
> `profiles` = manage identities,
|
||||||
|
> `browser ...` = control long-running headless Chrome that all crawls can piggy-back on,
|
||||||
|
> `crawl` = do the actual work,
|
||||||
|
> `config` = tweak global defaults,
|
||||||
|
> everything else is sugar.
|
||||||
|
|
||||||
|
### Quick-fire “profile” usage cheatsheet
|
||||||
|
|
||||||
|
| Scenario | Command (copy-paste ready) | Notes |
|
||||||
|
|---|---|---|
|
||||||
|
| **Launch interactive Profile Manager UI** | `crwl profiles` | Opens TUI with options: 1 List, 2 Create, 3 Delete, 4 Use-to-crawl, 5 Exit. |
|
||||||
|
| **Create a fresh profile** | `crwl profiles` → choose **2** → name it → browser opens → log in → press **q** in terminal | Saves to `~/.crawl4ai/profiles/<name>`. |
|
||||||
|
| **List saved profiles** | `crwl profiles` → choose **1** | Shows name, browser type, size, last-modified. |
|
||||||
|
| **Delete a profile** | `crwl profiles` → choose **3** → pick the profile index → confirm | Removes the folder. |
|
||||||
|
| **Crawl with a profile (default alias)** | `crwl https://site.com/dashboard -p my-profile` | Keeps login cookies, sets `use_managed_browser=true` under the hood. |
|
||||||
|
| **Crawl + verbose JSON output** | `crwl https://site.com -p my-profile -o json -v` | Any other `crawl` flags work the same. |
|
||||||
|
| **Crawl with extra browser tweaks** | `crwl https://site.com -p my-profile -b "headless=true,viewport_width=1680"` | CLI overrides go on top of the profile. |
|
||||||
|
| **Same but via explicit sub-command** | `crwl crawl https://site.com -p my-profile` | Identical to default alias. |
|
||||||
|
| **Use profile from inside Profile Manager** | `crwl profiles` → choose **4** → pick profile → enter URL → follow prompts | Handy when demo-ing to non-CLI folks. |
|
||||||
|
| **One-off crawl with a profile folder path (no name lookup)** | `crwl https://site.com -b "user_data_dir=$HOME/.crawl4ai/profiles/my-profile,use_managed_browser=true"` | Bypasses registry, useful for CI scripts. |
|
||||||
|
| **Launch a dev browser on CDP port with the same identity** | `crwl cdp -d $HOME/.crawl4ai/profiles/my-profile -P 9223` | Lets Puppeteer/Playwright attach for debugging. |
|
||||||
|
|
||||||
@@ -391,12 +391,14 @@ async def main():
|
|||||||
# Process results
|
# Process results
|
||||||
raw_df = pd.DataFrame()
|
raw_df = pd.DataFrame()
|
||||||
for result in results:
|
for result in results:
|
||||||
if result.success and result.media["tables"]:
|
# Use the new tables field, falling back to media["tables"] for backward compatibility
|
||||||
|
tables = result.tables if hasattr(result, "tables") and result.tables else result.media.get("tables", [])
|
||||||
|
if result.success and tables:
|
||||||
# Extract primary market table
|
# Extract primary market table
|
||||||
# DataFrame
|
# DataFrame
|
||||||
raw_df = pd.DataFrame(
|
raw_df = pd.DataFrame(
|
||||||
result.media["tables"][0]["rows"],
|
tables[0]["rows"],
|
||||||
columns=result.media["tables"][0]["headers"],
|
columns=tables[0]["headers"],
|
||||||
)
|
)
|
||||||
break
|
break
|
||||||
|
|
||||||
|
|||||||
File diff suppressed because it is too large
Load Diff
@@ -31,7 +31,7 @@ async def example_cdp():
|
|||||||
|
|
||||||
|
|
||||||
async def main():
|
async def main():
|
||||||
browser_config = BrowserConfig(headless=True, verbose=True)
|
browser_config = BrowserConfig(headless=False, verbose=True)
|
||||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||||
crawler_config = CrawlerRunConfig(
|
crawler_config = CrawlerRunConfig(
|
||||||
cache_mode=CacheMode.BYPASS,
|
cache_mode=CacheMode.BYPASS,
|
||||||
|
|||||||
@@ -412,17 +412,41 @@ footer {
|
|||||||
background-color: var(--primary-dimmed-color, #09b5a5);
|
background-color: var(--primary-dimmed-color, #09b5a5);
|
||||||
color: var(--background-color, #070708);
|
color: var(--background-color, #070708);
|
||||||
border: none;
|
border: none;
|
||||||
padding: 4px 8px;
|
padding: 6px 10px;
|
||||||
font-size: 0.8em;
|
font-size: 0.8em;
|
||||||
border-radius: 4px;
|
border-radius: 4px;
|
||||||
cursor: pointer;
|
cursor: pointer;
|
||||||
box-shadow: 0 2px 5px rgba(0, 0, 0, 0.3);
|
box-shadow: 0 3px 8px rgba(0, 0, 0, 0.3);
|
||||||
transition: background-color 0.2s ease;
|
transition: background-color 0.2s ease, transform 0.15s ease;
|
||||||
white-space: nowrap;
|
white-space: nowrap;
|
||||||
|
display: flex;
|
||||||
|
align-items: center;
|
||||||
|
font-weight: 500;
|
||||||
|
animation: askAiButtonAppear 0.2s ease-out;
|
||||||
|
}
|
||||||
|
|
||||||
|
@keyframes askAiButtonAppear {
|
||||||
|
from {
|
||||||
|
opacity: 0;
|
||||||
|
transform: scale(0.9);
|
||||||
|
}
|
||||||
|
to {
|
||||||
|
opacity: 1;
|
||||||
|
transform: scale(1);
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
.ask-ai-selection-button:hover {
|
.ask-ai-selection-button:hover {
|
||||||
background-color: var(--primary-color, #50ffff);
|
background-color: var(--primary-color, #50ffff);
|
||||||
|
transform: scale(1.05);
|
||||||
|
}
|
||||||
|
|
||||||
|
/* Mobile styles for Ask AI button */
|
||||||
|
@media screen and (max-width: 768px) {
|
||||||
|
.ask-ai-selection-button {
|
||||||
|
padding: 8px 12px; /* Larger touch target on mobile */
|
||||||
|
font-size: 0.9em; /* Slightly larger text */
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
/* ==== File: docs/assets/layout.css (Additions) ==== */
|
/* ==== File: docs/assets/layout.css (Additions) ==== */
|
||||||
|
|||||||
@@ -8,12 +8,32 @@ document.addEventListener('DOMContentLoaded', () => {
|
|||||||
const button = document.createElement('button');
|
const button = document.createElement('button');
|
||||||
button.id = 'ask-ai-selection-btn';
|
button.id = 'ask-ai-selection-btn';
|
||||||
button.className = 'ask-ai-selection-button';
|
button.className = 'ask-ai-selection-button';
|
||||||
button.textContent = 'Ask AI'; // Or use an icon
|
|
||||||
|
// Add icon and text for better visibility
|
||||||
|
button.innerHTML = `
|
||||||
|
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" width="12" height="12" fill="currentColor" style="margin-right: 4px; vertical-align: middle;">
|
||||||
|
<path d="M20 2H4c-1.1 0-2 .9-2 2v12c0 1.1.9 2 2 2h14l4 4V4c0-1.1-.9-2-2-2z"/>
|
||||||
|
</svg>
|
||||||
|
<span>Ask AI</span>
|
||||||
|
`;
|
||||||
|
|
||||||
|
// Common styles
|
||||||
button.style.display = 'none'; // Initially hidden
|
button.style.display = 'none'; // Initially hidden
|
||||||
button.style.position = 'absolute';
|
button.style.position = 'absolute';
|
||||||
button.style.zIndex = '1500'; // Ensure it's on top
|
button.style.zIndex = '1500'; // Ensure it's on top
|
||||||
document.body.appendChild(button);
|
button.style.boxShadow = '0 3px 8px rgba(0, 0, 0, 0.4)'; // More pronounced shadow
|
||||||
|
button.style.transition = 'transform 0.15s ease, background-color 0.2s ease'; // Smooth hover effect
|
||||||
|
|
||||||
|
// Add transform on hover
|
||||||
|
button.addEventListener('mouseover', () => {
|
||||||
|
button.style.transform = 'scale(1.05)';
|
||||||
|
});
|
||||||
|
|
||||||
|
button.addEventListener('mouseout', () => {
|
||||||
|
button.style.transform = 'scale(1)';
|
||||||
|
});
|
||||||
|
|
||||||
|
document.body.appendChild(button);
|
||||||
button.addEventListener('click', handleAskAiClick);
|
button.addEventListener('click', handleAskAiClick);
|
||||||
return button;
|
return button;
|
||||||
}
|
}
|
||||||
@@ -43,11 +63,38 @@ document.addEventListener('DOMContentLoaded', () => {
|
|||||||
const range = selection.getRangeAt(0);
|
const range = selection.getRangeAt(0);
|
||||||
const rect = range.getBoundingClientRect();
|
const rect = range.getBoundingClientRect();
|
||||||
|
|
||||||
// Calculate position: top-right of the selection
|
// Get viewport dimensions
|
||||||
|
const viewportWidth = window.innerWidth;
|
||||||
|
const viewportHeight = window.innerHeight;
|
||||||
|
|
||||||
|
// Calculate position based on selection
|
||||||
const scrollX = window.scrollX;
|
const scrollX = window.scrollX;
|
||||||
const scrollY = window.scrollY;
|
const scrollY = window.scrollY;
|
||||||
const buttonTop = rect.top + scrollY - askAiButton.offsetHeight - 5; // 5px above
|
|
||||||
const buttonLeft = rect.right + scrollX + 5; // 5px to the right
|
// Default position (top-right of selection)
|
||||||
|
let buttonTop = rect.top + scrollY - askAiButton.offsetHeight - 5; // 5px above
|
||||||
|
let buttonLeft = rect.right + scrollX + 5; // 5px to the right
|
||||||
|
|
||||||
|
// Check if we're on mobile (which we define as less than 768px)
|
||||||
|
const isMobile = viewportWidth <= 768;
|
||||||
|
|
||||||
|
if (isMobile) {
|
||||||
|
// On mobile, position centered above selection to avoid edge issues
|
||||||
|
buttonTop = rect.top + scrollY - askAiButton.offsetHeight - 10; // 10px above on mobile
|
||||||
|
buttonLeft = rect.left + scrollX + (rect.width / 2) - (askAiButton.offsetWidth / 2); // Centered
|
||||||
|
} else {
|
||||||
|
// For desktop, ensure the button doesn't go off screen
|
||||||
|
// Check right edge
|
||||||
|
if (buttonLeft + askAiButton.offsetWidth > scrollX + viewportWidth) {
|
||||||
|
buttonLeft = scrollX + viewportWidth - askAiButton.offsetWidth - 10; // 10px from right edge
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// Check top edge (for all devices)
|
||||||
|
if (buttonTop < scrollY) {
|
||||||
|
// If would go above viewport, position below selection instead
|
||||||
|
buttonTop = rect.bottom + scrollY + 5; // 5px below
|
||||||
|
}
|
||||||
|
|
||||||
askAiButton.style.top = `${buttonTop}px`;
|
askAiButton.style.top = `${buttonTop}px`;
|
||||||
askAiButton.style.left = `${buttonLeft}px`;
|
askAiButton.style.left = `${buttonLeft}px`;
|
||||||
@@ -77,8 +124,8 @@ document.addEventListener('DOMContentLoaded', () => {
|
|||||||
|
|
||||||
// --- Event Listeners ---
|
// --- Event Listeners ---
|
||||||
|
|
||||||
// Show button on mouse up after selection
|
// Function to handle selection events (both mouse and touch)
|
||||||
document.addEventListener('mouseup', (event) => {
|
function handleSelectionEvent(event) {
|
||||||
// Slight delay to ensure selection is registered
|
// Slight delay to ensure selection is registered
|
||||||
setTimeout(() => {
|
setTimeout(() => {
|
||||||
const selectedText = getSafeSelectedText();
|
const selectedText = getSafeSelectedText();
|
||||||
@@ -86,7 +133,7 @@ document.addEventListener('DOMContentLoaded', () => {
|
|||||||
if (!askAiButton) {
|
if (!askAiButton) {
|
||||||
askAiButton = createAskAiButton();
|
askAiButton = createAskAiButton();
|
||||||
}
|
}
|
||||||
// Don't position if the click was ON the button itself
|
// Don't position if the event was ON the button itself
|
||||||
if (event.target !== askAiButton) {
|
if (event.target !== askAiButton) {
|
||||||
positionButton(event);
|
positionButton(event);
|
||||||
}
|
}
|
||||||
@@ -94,16 +141,46 @@ document.addEventListener('DOMContentLoaded', () => {
|
|||||||
hideButton();
|
hideButton();
|
||||||
}
|
}
|
||||||
}, 10); // Small delay
|
}, 10); // Small delay
|
||||||
|
}
|
||||||
|
|
||||||
|
// Mouse selection events (desktop)
|
||||||
|
document.addEventListener('mouseup', handleSelectionEvent);
|
||||||
|
|
||||||
|
// Touch selection events (mobile)
|
||||||
|
document.addEventListener('touchend', handleSelectionEvent);
|
||||||
|
document.addEventListener('selectionchange', () => {
|
||||||
|
// This helps with mobile selection which can happen without mouseup/touchend
|
||||||
|
setTimeout(() => {
|
||||||
|
const selectedText = getSafeSelectedText();
|
||||||
|
if (selectedText && askAiButton) {
|
||||||
|
positionButton();
|
||||||
|
}
|
||||||
|
}, 300); // Longer delay for selection change
|
||||||
});
|
});
|
||||||
|
|
||||||
// Hide button on scroll or click elsewhere
|
// Hide button on various events
|
||||||
document.addEventListener('mousedown', (event) => {
|
document.addEventListener('mousedown', (event) => {
|
||||||
// Hide if clicking anywhere EXCEPT the button itself
|
// Hide if clicking anywhere EXCEPT the button itself
|
||||||
if (askAiButton && event.target !== askAiButton) {
|
if (askAiButton && event.target !== askAiButton) {
|
||||||
hideButton();
|
hideButton();
|
||||||
}
|
}
|
||||||
});
|
});
|
||||||
|
|
||||||
|
document.addEventListener('touchstart', (event) => {
|
||||||
|
// Same for touch events, but only hide if not on the button
|
||||||
|
if (askAiButton && event.target !== askAiButton) {
|
||||||
|
hideButton();
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
document.addEventListener('scroll', hideButton, true); // Capture scroll events
|
document.addEventListener('scroll', hideButton, true); // Capture scroll events
|
||||||
|
|
||||||
|
// Also hide when pressing Escape key
|
||||||
|
document.addEventListener('keydown', (event) => {
|
||||||
|
if (event.key === 'Escape') {
|
||||||
|
hideButton();
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
console.log("Selection Ask AI script loaded.");
|
console.log("Selection Ask AI script loaded.");
|
||||||
});
|
});
|
||||||
@@ -4,6 +4,32 @@ Welcome to the Crawl4AI blog! Here you'll find detailed release notes, technical
|
|||||||
|
|
||||||
## Latest Release
|
## Latest Release
|
||||||
|
|
||||||
|
Here’s the blog index entry for **v0.6.0**, written to match the exact tone and structure of your previous entries:
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### [Crawl4AI v0.6.0 – World-Aware Crawling, Pre-Warmed Browsers, and the MCP API](releases/0.6.0.md)
|
||||||
|
*April 23, 2025*
|
||||||
|
|
||||||
|
Crawl4AI v0.6.0 is our most powerful release yet. This update brings major architectural upgrades including world-aware crawling (set geolocation, locale, and timezone), real-time traffic capture, and a memory-efficient crawler pool with pre-warmed pages.
|
||||||
|
|
||||||
|
The Docker server now exposes a full-featured MCP socket + SSE interface, supports streaming, and comes with a new Playground UI. Plus, table extraction is now native, and the new stress-test framework supports crawling 1,000+ URLs.
|
||||||
|
|
||||||
|
Other key changes:
|
||||||
|
|
||||||
|
* Native support for `result.media["tables"]` to export DataFrames
|
||||||
|
* Full network + console logs and MHTML snapshot per crawl
|
||||||
|
* Browser pooling and pre-warming for faster cold starts
|
||||||
|
* New streaming endpoints via MCP API and Playground
|
||||||
|
* Robots.txt support, proxy rotation, and improved session handling
|
||||||
|
* Deprecated old markdown names, legacy modules cleaned up
|
||||||
|
* Massive repo cleanup: ~36K insertions, ~5K deletions across 121 files
|
||||||
|
|
||||||
|
[Read full release notes →](releases/0.6.0.md)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
Let me know if you want me to auto-update the actual file or just paste this into the markdown.
|
||||||
|
|
||||||
### [Crawl4AI v0.5.0: Deep Crawling, Scalability, and a New CLI!](releases/0.5.0.md)
|
### [Crawl4AI v0.5.0: Deep Crawling, Scalability, and a New CLI!](releases/0.5.0.md)
|
||||||
|
|
||||||
|
|||||||
@@ -1,51 +1,143 @@
|
|||||||
# Crawl4AI 0.6.0
|
# Crawl4AI v0.6.0 Release Notes
|
||||||
|
|
||||||
*Release date: 2025‑04‑22*
|
We're excited to announce the release of **Crawl4AI v0.6.0**, our biggest and most feature-rich update yet. This version introduces major architectural upgrades, brand-new capabilities for geo-aware crawling, high-efficiency scraping, and real-time streaming support for scalable deployments.
|
||||||
|
|
||||||
0.6.0 is the **biggest jump** since the 0.5 series, packing a smarter browser core, pool‑based crawlers, and a ton of DX candy. Expect faster runs, lower RAM burn, and richer diagnostics.
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 🚀 Key upgrades
|
## Highlights
|
||||||
|
|
||||||
| Area | What changed |
|
### 1. **World-Aware Crawlers**
|
||||||
|------|--------------|
|
Crawl as if you’re anywhere in the world. With v0.6.0, each crawl can simulate:
|
||||||
| **Browser** | New **Browser** management with pooling, page pre‑warm, geolocation + locale + timezone switches |
|
- Specific GPS coordinates
|
||||||
| **Crawler** | Console and network log capture, MHTML snapshots, safer `get_page` API |
|
- Browser locale
|
||||||
| **Server & API** | **Crawler Pool Manager** endpoint, MCP socket + SSE support |
|
- Timezone
|
||||||
| **Docs** | v2 layout, floating Ask‑AI helper, GitHub stats badge, copy‑code buttons, Docker API demo |
|
|
||||||
| **Tests** | Memory + load benchmarks, 90+ new cases covering MCP and Docker |
|
Example:
|
||||||
|
```python
|
||||||
|
CrawlerRunConfig(
|
||||||
|
url="https://browserleaks.com/geo",
|
||||||
|
locale="en-US",
|
||||||
|
timezone_id="America/Los_Angeles",
|
||||||
|
geolocation=GeolocationConfig(
|
||||||
|
latitude=34.0522,
|
||||||
|
longitude=-118.2437,
|
||||||
|
accuracy=10.0
|
||||||
|
)
|
||||||
|
)
|
||||||
|
```
|
||||||
|
Great for accessing region-specific content or testing global behavior.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## ⚠️ Breaking changes
|
### 2. **Native Table Extraction**
|
||||||
|
Extract HTML tables directly into usable formats like Pandas DataFrames or CSV with zero parsing hassle. All table data is available under `result.media["tables"]`.
|
||||||
|
|
||||||
1. **`get_page` signature** – returns `(html, metadata)` instead of plain html.
|
Example:
|
||||||
2. **Docker** – new Chromium base layer, rebuild images.
|
```python
|
||||||
|
raw_df = pd.DataFrame(
|
||||||
|
result.media["tables"][0]["rows"],
|
||||||
|
columns=result.media["tables"][0]["headers"]
|
||||||
|
)
|
||||||
|
```
|
||||||
|
This makes it ideal for scraping financial data, pricing pages, or anything tabular.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## How to upgrade
|
### 3. **Browser Pooling & Pre-Warming**
|
||||||
|
We've overhauled browser management. Now, multiple browser instances can be pooled and pages pre-warmed for ultra-fast launches:
|
||||||
|
- Reduces cold-start latency
|
||||||
|
- Lowers memory spikes
|
||||||
|
- Enhances parallel crawling stability
|
||||||
|
|
||||||
|
This powers the new **Docker Playground** experience and streamlines heavy-load crawling.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 4. **Traffic & Snapshot Capture**
|
||||||
|
Need full visibility? You can now capture:
|
||||||
|
- Full network traffic logs
|
||||||
|
- Console output
|
||||||
|
- MHTML page snapshots for post-crawl audits and debugging
|
||||||
|
|
||||||
|
No more guesswork on what happened during your crawl.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 5. **MCP API and Streaming Support**
|
||||||
|
We’re exposing **MCP socket and SSE endpoints**, allowing:
|
||||||
|
- Live streaming of crawl results
|
||||||
|
- Real-time integration with agents or frontends
|
||||||
|
- A new Playground UI for interactive crawling
|
||||||
|
|
||||||
|
This is a major step towards making Crawl4AI real-time ready.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 6. **Stress-Test Framework**
|
||||||
|
Want to test performance under heavy load? v0.6.0 includes a new memory stress-test suite that supports 1,000+ URL workloads. Ideal for:
|
||||||
|
- Load testing
|
||||||
|
- Performance benchmarking
|
||||||
|
- Validating memory efficiency
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Core Improvements
|
||||||
|
- Robots.txt compliance
|
||||||
|
- Proxy rotation support
|
||||||
|
- Improved URL normalization and session reuse
|
||||||
|
- Shared data across crawler hooks
|
||||||
|
- New page routing logic
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Breaking Changes & Deprecations
|
||||||
|
- Legacy `crawl4ai/browser/*` modules are removed. Update imports accordingly.
|
||||||
|
- `AsyncPlaywrightCrawlerStrategy.get_page` now uses a new function signature.
|
||||||
|
- Deprecated markdown generator aliases now point to `DefaultMarkdownGenerator` with warning.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Miscellaneous Updates
|
||||||
|
- FastAPI validators replaced custom validation logic
|
||||||
|
- Docker build now based on a Chromium layer
|
||||||
|
- Repo-wide cleanup: ~36,000 insertions, ~5,000 deletions
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## New Examples Included
|
||||||
|
- Geo-location crawling
|
||||||
|
- Network + console log capture
|
||||||
|
- Docker MCP API usage
|
||||||
|
- Markdown selector usage
|
||||||
|
- Crypto project data extraction
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Watch the Release Video
|
||||||
|
Want a visual walkthrough of all these updates? Watch the video:
|
||||||
|
🔗 https://youtu.be/9x7nVcjOZks
|
||||||
|
|
||||||
|
If you're new to Crawl4AI, start here:
|
||||||
|
🔗 https://www.youtube.com/watch?v=xo3qK6Hg9AA&t=15s
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Join the Community
|
||||||
|
We’ve just opened up our **Discord** for the public. Join us to:
|
||||||
|
- Ask questions
|
||||||
|
- Share your projects
|
||||||
|
- Get help or contribute
|
||||||
|
|
||||||
|
💬 https://discord.gg/wpYFACrHR4
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Install or Upgrade
|
||||||
```bash
|
```bash
|
||||||
pip install -U crawl4ai==0.6.0
|
pip install -U crawl4ai
|
||||||
```
|
```
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Full changelog
|
Live long and import crawl4ai. 🖖
|
||||||
|
|
||||||
The diff between `main` and `next` spans **36 k insertions, 4.9 k deletions** over 121 files. Read the [compare view](https://github.com/unclecode/crawl4ai/compare/0.5.0.post8...0.6.0) or see `CHANGELOG.md` for the granular list.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Upgrade tips
|
|
||||||
|
|
||||||
* Using the Docker API? Pull `unclecode/crawl4ai:0.6.0`, new args are documented in `/deploy/docker/README.md`.
|
|
||||||
* Stress‑test your stack with `tests/memory/run_benchmark.py` before production rollout.
|
|
||||||
* Markdown generators renamed but aliased, update when convenient, warnings will remind you.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
Happy crawling, ping `@unclecode` on X for questions or memes.
|
|
||||||
|
|
||||||
|
|||||||
@@ -58,7 +58,7 @@ Pull and run images directly from Docker Hub without building locally.
|
|||||||
|
|
||||||
#### 1. Pull the Image
|
#### 1. Pull the Image
|
||||||
|
|
||||||
Our latest release candidate is `0.6.0rc1-r2`. Images are built with multi-arch manifests, so Docker automatically pulls the correct version for your system.
|
Our latest release candidate is `0.6.0-r2`. Images are built with multi-arch manifests, so Docker automatically pulls the correct version for your system.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Pull the release candidate (recommended for latest features)
|
# Pull the release candidate (recommended for latest features)
|
||||||
@@ -124,9 +124,9 @@ docker stop crawl4ai && docker rm crawl4ai
|
|||||||
#### Docker Hub Versioning Explained
|
#### Docker Hub Versioning Explained
|
||||||
|
|
||||||
* **Image Name:** `unclecode/crawl4ai`
|
* **Image Name:** `unclecode/crawl4ai`
|
||||||
* **Tag Format:** `LIBRARY_VERSION[-SUFFIX]` (e.g., `0.6.0rc1-r2`)
|
* **Tag Format:** `LIBRARY_VERSION[-SUFFIX]` (e.g., `0.6.0-r2`)
|
||||||
* `LIBRARY_VERSION`: The semantic version of the core `crawl4ai` Python library
|
* `LIBRARY_VERSION`: The semantic version of the core `crawl4ai` Python library
|
||||||
* `SUFFIX`: Optional tag for release candidates (`rc1`) and revisions (`r1`)
|
* `SUFFIX`: Optional tag for release candidates (``) and revisions (`r1`)
|
||||||
* **`latest` Tag:** Points to the most recent stable version
|
* **`latest` Tag:** Points to the most recent stable version
|
||||||
* **Multi-Architecture Support:** All images support both `linux/amd64` and `linux/arm64` architectures through a single tag
|
* **Multi-Architecture Support:** All images support both `linux/amd64` and `linux/arm64` architectures through a single tag
|
||||||
|
|
||||||
|
|||||||
32
tests/profiler/test_crteate_profile.py
Normal file
32
tests/profiler/test_crteate_profile.py
Normal file
@@ -0,0 +1,32 @@
|
|||||||
|
from crawl4ai import BrowserProfiler
|
||||||
|
import asyncio
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
# Example usage
|
||||||
|
profiler = BrowserProfiler()
|
||||||
|
|
||||||
|
# Create a new profile
|
||||||
|
import os
|
||||||
|
from pathlib import Path
|
||||||
|
home_dir = Path.home()
|
||||||
|
profile_path = asyncio.run(profiler.create_profile( str(home_dir / ".crawl4ai/profiles/test-profile")))
|
||||||
|
|
||||||
|
print(f"Profile created at: {profile_path}")
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
# # Launch a standalone browser
|
||||||
|
# asyncio.run(profiler.launch_standalone_browser())
|
||||||
|
|
||||||
|
# # List profiles
|
||||||
|
# profiles = profiler.list_profiles()
|
||||||
|
# for profile in profiles:
|
||||||
|
# print(f"Profile: {profile['name']}, Path: {profile['path']}")
|
||||||
|
|
||||||
|
# # Delete a profile
|
||||||
|
# success = profiler.delete_profile("my-profile")
|
||||||
|
# if success:
|
||||||
|
# print("Profile deleted successfully")
|
||||||
|
# else:
|
||||||
|
# print("Failed to delete profile")
|
||||||
Reference in New Issue
Block a user