Compare commits

...

9 Commits

Author SHA1 Message Date
UncleCode
0e5d672763 Merge branch 'pr-971' into merge-pr971 2025-05-01 18:57:28 +08:00
wakaka6
cd2b490b40 refactor(logger): Apply the Enumeration for color 2025-05-01 17:04:44 +08:00
UncleCode
50f0b83fcd feat(linkedin): add prospect-wizard app with scraping and visualization
Add new LinkedIn prospect discovery tool with three main components:
- c4ai_discover.py for company and people scraping
- c4ai_insights.py for org chart and decision maker analysis
- Interactive graph visualization with company/people exploration

Features include:
- Configurable LinkedIn search and scraping
- Org chart generation with decision maker scoring
- Interactive network graph visualization
- Company similarity analysis
- Chat interface for data exploration

Requires: crawl4ai, openai, sentence-transformers, networkx
2025-04-30 19:38:25 +08:00
UncleCode
9499164d3c feat(browser): improve browser profile management and cleanup
Enhance browser profile handling with better process cleanup and documentation:
- Add process cleanup for existing Chromium instances on Windows/Unix
- Fix profile creation by passing complete browser config
- Add comprehensive documentation for browser and CLI components
- Add initial profile creation test
- Bump version to 0.6.3

This change improves reliability when managing browser profiles and provides better documentation for developers.
2025-04-29 23:04:32 +08:00
UncleCode
2140d9aca4 fix(browser): correct headless mode default behavior
Modify BrowserConfig to respect explicit headless parameter setting instead of forcing True. Update version to 0.6.2 and clean up code formatting in examples.

BREAKING CHANGE: BrowserConfig no longer defaults to headless=True when explicitly set to False
2025-04-26 21:09:50 +08:00
UncleCode
ccec40ed17 feat(models): add dedicated tables field to CrawlResult
- Add tables field to CrawlResult model while maintaining backward compatibility
- Update async_webcrawler.py to extract tables from media and pass to tables field
- Update crypto_analysis_example.py to use the new tables field
- Add /config/dump examples to demo_docker_api.py
- Bump version to 0.6.1
2025-04-24 18:36:25 +08:00
UncleCode
ad4dfb21e1 Remoce "rc1" 2025-04-23 21:00:00 +08:00
UncleCode
7784b2468e feat(docs): enhance Ask AI button UX and add v0.6.0 release notes
Improve Ask AI button with better mobile support, animations, and positioning:
- Add button animations and hover effects
- Improve mobile responsiveness
- Add icon to button
- Fix positioning logic for different viewport sizes
- Add keyboard (Escape) support

Add comprehensive v0.6.0 release documentation:
- Create detailed release notes
- Update blog index with latest release
- Document all major features and breaking changes

BREAKING CHANGE: Documentation structure updated with new v0.6.0 section
2025-04-23 20:07:03 +08:00
wakaka6
b2f3cb0dfa WIP: logger migriate to rich 2025-04-11 00:44:43 +08:00
34 changed files with 3769 additions and 420 deletions

View File

@@ -5,7 +5,16 @@ All notable changes to Crawl4AI will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
## [0.6.0rc1] 20250422
## [0.6.1] - 2025-04-24
### Added
- New dedicated `tables` field in `CrawlResult` model for better table extraction handling
- Updated crypto_analysis_example.py to use the new tables field with backward compatibility
### Changed
- Improved playground UI in Docker deployment with better endpoint handling and UI feedback
## [0.6.0] 20250422
### Added
- Browser pooling with page prewarming and finegrained **geolocation, locale, and timezone** controls

View File

@@ -21,9 +21,9 @@
Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for LLMs, AI agents, and data pipelines. Open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease.
[✨ Check out latest update v0.6.0rc1](#-recent-updates)
[✨ Check out latest update v0.6.0](#-recent-updates)
🎉 **Version 0.6.0rc1 is now available!** This release candidate introduces World-aware Crawling with geolocation and locale settings, Table-to-DataFrame extraction, Browser pooling with pre-warming, Network and console traffic capture, MCP integration for AI tools, and a completely revamped Docker deployment! [Read the release notes →](https://docs.crawl4ai.com/blog)
🎉 **Version 0.6.0 is now available!** This release candidate introduces World-aware Crawling with geolocation and locale settings, Table-to-DataFrame extraction, Browser pooling with pre-warming, Network and console traffic capture, MCP integration for AI tools, and a completely revamped Docker deployment! [Read the release notes →](https://docs.crawl4ai.com/blog)
<details>
<summary>🤓 <strong>My Personal Story</strong></summary>
@@ -505,7 +505,7 @@ async def test_news_crawl():
## ✨ Recent Updates
### Version 0.6.0rc1 Release Highlights
### Version 0.6.0 Release Highlights
- **🌎 World-aware Crawling**: Set geolocation, language, and timezone for authentic locale-specific content:
```python
@@ -575,7 +575,7 @@ async def test_news_crawl():
- **📱 Multi-stage Build System**: Optimized Dockerfile with platform-specific performance enhancements
Read the full details in our [0.6.0rc1 Release Notes](https://docs.crawl4ai.com/blog/releases/0.6.0.html) or check the [CHANGELOG](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md).
Read the full details in our [0.6.0 Release Notes](https://docs.crawl4ai.com/blog/releases/0.6.0.html) or check the [CHANGELOG](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md).
### Previous Version: 0.5.0 Major Release Highlights
@@ -606,7 +606,7 @@ We use different suffixes to indicate development stages:
- `dev` (0.4.3dev1): Development versions, unstable
- `a` (0.4.3a1): Alpha releases, experimental features
- `b` (0.4.3b1): Beta releases, feature complete but needs testing
- `rc` (0.4.3rc1): Release candidates, potential final version
- `rc` (0.4.3): Release candidates, potential final version
#### Installation
- Regular installation (stable version):

View File

@@ -1,3 +1,3 @@
# crawl4ai/_version.py
__version__ = "0.6.0"
__version__ = "0.6.3"

View File

@@ -427,7 +427,7 @@ class BrowserConfig:
host: str = "localhost",
):
self.browser_type = browser_type
self.headless = headless or True
self.headless = headless
self.browser_mode = browser_mode
self.use_managed_browser = use_managed_browser
self.cdp_url = cdp_url

View File

@@ -171,7 +171,10 @@ class AsyncDatabaseManager:
f"Code context:\n{error_context['code_context']}"
)
self.logger.error(
message=create_box_message(error_message, type="error"),
message="{error}",
tag="ERROR",
params={"error": str(error_message)},
boxes=["error"],
)
raise
@@ -189,7 +192,10 @@ class AsyncDatabaseManager:
f"Code context:\n{error_context['code_context']}"
)
self.logger.error(
message=create_box_message(error_message, type="error"),
message="{error}",
tag="ERROR",
params={"error": str(error_message)},
boxes=["error"],
)
raise
finally:

View File

@@ -1,10 +1,12 @@
from abc import ABC, abstractmethod
from enum import Enum
from typing import Optional, Dict, Any
from colorama import Fore, Style, init
from typing import Optional, Dict, Any, List
import os
from datetime import datetime
from urllib.parse import unquote
from rich.console import Console
from rich.text import Text
from .utils import create_box_message
class LogLevel(Enum):
@@ -21,6 +23,26 @@ class LogLevel(Enum):
FATAL = 10
def __str__(self):
return self.name.lower()
class LogColor(str, Enum):
"""Enum for log colors."""
DEBUG = "lightblack"
INFO = "cyan"
SUCCESS = "green"
WARNING = "yellow"
ERROR = "red"
CYAN = "cyan"
GREEN = "green"
YELLOW = "yellow"
MAGENTA = "magenta"
DIM_MAGENTA = "dim magenta"
def __str__(self):
"""Automatically convert rich color to string."""
return self.value
class AsyncLoggerBase(ABC):
@@ -52,6 +74,7 @@ class AsyncLoggerBase(ABC):
def error_status(self, url: str, error: str, tag: str = "ERROR", url_length: int = 100):
pass
class AsyncLogger(AsyncLoggerBase):
"""
Asynchronous logger with support for colored console output and file logging.
@@ -79,17 +102,11 @@ class AsyncLogger(AsyncLoggerBase):
}
DEFAULT_COLORS = {
LogLevel.DEBUG: Fore.LIGHTBLACK_EX,
LogLevel.INFO: Fore.CYAN,
LogLevel.SUCCESS: Fore.GREEN,
LogLevel.WARNING: Fore.YELLOW,
LogLevel.ERROR: Fore.RED,
LogLevel.CRITICAL: Fore.RED + Style.BRIGHT,
LogLevel.ALERT: Fore.RED + Style.BRIGHT,
LogLevel.NOTICE: Fore.BLUE,
LogLevel.EXCEPTION: Fore.RED + Style.BRIGHT,
LogLevel.FATAL: Fore.RED + Style.BRIGHT,
LogLevel.DEFAULT: Fore.WHITE,
LogLevel.DEBUG: LogColor.DEBUG,
LogLevel.INFO: LogColor.INFO,
LogLevel.SUCCESS: LogColor.SUCCESS,
LogLevel.WARNING: LogColor.WARNING,
LogLevel.ERROR: LogColor.ERROR,
}
def __init__(
@@ -98,7 +115,7 @@ class AsyncLogger(AsyncLoggerBase):
log_level: LogLevel = LogLevel.DEBUG,
tag_width: int = 10,
icons: Optional[Dict[str, str]] = None,
colors: Optional[Dict[LogLevel, str]] = None,
colors: Optional[Dict[LogLevel, LogColor]] = None,
verbose: bool = True,
):
"""
@@ -112,13 +129,13 @@ class AsyncLogger(AsyncLoggerBase):
colors: Custom colors for different log levels
verbose: Whether to output to console
"""
init() # Initialize colorama
self.log_file = log_file
self.log_level = log_level
self.tag_width = tag_width
self.icons = icons or self.DEFAULT_ICONS
self.colors = colors or self.DEFAULT_COLORS
self.verbose = verbose
self.console = Console()
# Create log file directory if needed
if log_file:
@@ -143,16 +160,11 @@ class AsyncLogger(AsyncLoggerBase):
def _write_to_file(self, message: str):
"""Write a message to the log file if configured."""
if self.log_file:
text = Text.from_markup(message)
plain_text = text.plain
timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S.%f")[:-3]
with open(self.log_file, "a", encoding="utf-8") as f:
# Strip ANSI color codes for file output
clean_message = message.replace(Fore.RESET, "").replace(
Style.RESET_ALL, ""
)
for color in vars(Fore).values():
if isinstance(color, str):
clean_message = clean_message.replace(color, "")
f.write(f"[{timestamp}] {clean_message}\n")
f.write(f"[{timestamp}] {plain_text}\n")
def _log(
self,
@@ -160,8 +172,9 @@ class AsyncLogger(AsyncLoggerBase):
message: str,
tag: str,
params: Optional[Dict[str, Any]] = None,
colors: Optional[Dict[str, str]] = None,
base_color: Optional[str] = None,
colors: Optional[Dict[str, LogColor]] = None,
boxes: Optional[List[str]] = None,
base_color: Optional[LogColor] = None,
**kwargs,
):
"""
@@ -173,55 +186,44 @@ class AsyncLogger(AsyncLoggerBase):
tag: Tag for the message
params: Parameters to format into the message
colors: Color overrides for specific parameters
boxes: Box overrides for specific parameters
base_color: Base color for the entire message
"""
if level.value < self.log_level.value:
return
# Format the message with parameters if provided
# avoid conflict with rich formatting
parsed_message = message.replace("[", "[[").replace("]", "]]")
if params:
try:
# First format the message with raw parameters
formatted_message = message.format(**params)
# FIXME: If there are formatting strings in floating point format,
# this may result in colors and boxes not being applied properly.
# such as {value:.2f}, the value is 0.23333 format it to 0.23,
# but we replace("0.23333", "[color]0.23333[/color]")
formatted_message = parsed_message.format(**params)
for key, value in params.items():
# value_str may discard `[` and `]`, so we need to replace it.
value_str = str(value).replace("[", "[[").replace("]", "]]")
# check is need apply color
if colors and key in colors:
color_str = f"[{colors[key]}]{value_str}[/{colors[key]}]"
formatted_message = formatted_message.replace(value_str, color_str)
value_str = color_str
# Then apply colors if specified
color_map = {
"green": Fore.GREEN,
"red": Fore.RED,
"yellow": Fore.YELLOW,
"blue": Fore.BLUE,
"cyan": Fore.CYAN,
"magenta": Fore.MAGENTA,
"white": Fore.WHITE,
"black": Fore.BLACK,
"reset": Style.RESET_ALL,
}
if colors:
for key, color in colors.items():
# Find the formatted value in the message and wrap it with color
if color in color_map:
color = color_map[color]
if key in params:
value_str = str(params[key])
formatted_message = formatted_message.replace(
value_str, f"{color}{value_str}{Style.RESET_ALL}"
)
# check is need apply box
if boxes and key in boxes:
formatted_message = formatted_message.replace(value_str,
create_box_message(value_str, type=str(level)))
except KeyError as e:
formatted_message = (
f"LOGGING ERROR: Missing parameter {e} in message template"
)
level = LogLevel.ERROR
else:
formatted_message = message
formatted_message = parsed_message
# Construct the full log line
color = base_color or self.colors[level]
log_line = f"{color}{self._format_tag(tag)} {self._get_icon(tag)} {formatted_message}{Style.RESET_ALL}"
color: LogColor = base_color or self.colors[level]
log_line = f"[{color}]{self._format_tag(tag)} {self._get_icon(tag)} {formatted_message} [/{color}]"
# Output to console if verbose
if self.verbose or kwargs.get("force_verbose", False):
print(log_line)
self.console.print(log_line)
# Write to file if configured
self._write_to_file(log_line)
@@ -292,8 +294,8 @@ class AsyncLogger(AsyncLoggerBase):
"timing": timing,
},
colors={
"status": Fore.GREEN if success else Fore.RED,
"timing": Fore.YELLOW,
"status": LogColor.SUCCESS if success else LogColor.ERROR,
"timing": LogColor.WARNING,
},
)

View File

@@ -2,7 +2,6 @@ from .__version__ import __version__ as crawl4ai_version
import os
import sys
import time
from colorama import Fore
from pathlib import Path
from typing import Optional, List
import json
@@ -44,7 +43,6 @@ from .utils import (
sanitize_input_encode,
InvalidCSSSelectorError,
fast_format_html,
create_box_message,
get_error_context,
RobotsParser,
preprocess_html_for_schema,
@@ -419,7 +417,7 @@ class AsyncWebCrawler:
self.logger.error_status(
url=url,
error=create_box_message(error_message, type="error"),
error=error_message,
tag="ERROR",
)
@@ -496,11 +494,13 @@ class AsyncWebCrawler:
cleaned_html = sanitize_input_encode(
result.get("cleaned_html", ""))
media = result.get("media", {})
tables = media.pop("tables", []) if isinstance(media, dict) else []
links = result.get("links", {})
metadata = result.get("metadata", {})
else:
cleaned_html = sanitize_input_encode(result.cleaned_html)
media = result.media.model_dump()
tables = media.pop("tables", [])
links = result.links.model_dump()
metadata = result.metadata
@@ -627,6 +627,7 @@ class AsyncWebCrawler:
cleaned_html=cleaned_html,
markdown=markdown_result,
media=media,
tables=tables, # NEW
links=links,
metadata=metadata,
screenshot=screenshot_data,

View File

@@ -5,7 +5,10 @@ import os
import sys
import shutil
import tempfile
import psutil
import signal
import subprocess
import shlex
from playwright.async_api import BrowserContext
import hashlib
from .js_snippet import load_js_script
@@ -193,6 +196,45 @@ class ManagedBrowser:
if self.browser_config.extra_args:
args.extend(self.browser_config.extra_args)
# ── make sure no old Chromium instance is owning the same port/profile ──
try:
if sys.platform == "win32":
if psutil is None:
raise RuntimeError("psutil not available, cannot clean old browser")
for p in psutil.process_iter(["pid", "name", "cmdline"]):
cl = " ".join(p.info.get("cmdline") or [])
if (
f"--remote-debugging-port={self.debugging_port}" in cl
and f"--user-data-dir={self.user_data_dir}" in cl
):
p.kill()
p.wait(timeout=5)
else: # macOS / Linux
# kill any process listening on the same debugging port
pids = (
subprocess.check_output(shlex.split(f"lsof -t -i:{self.debugging_port}"))
.decode()
.strip()
.splitlines()
)
for pid in pids:
try:
os.kill(int(pid), signal.SIGTERM)
except ProcessLookupError:
pass
# remove Chromium singleton locks, or new launch exits with
# “Opening in existing browser session.”
for f in ("SingletonLock", "SingletonSocket", "SingletonCookie"):
fp = os.path.join(self.user_data_dir, f)
if os.path.exists(fp):
os.remove(fp)
except Exception as _e:
# non-fatal — we'll try to start anyway, but log what happened
self.logger.warning(f"pre-launch cleanup failed: {_e}", tag="BROWSER")
# Start browser process
try:
@@ -922,7 +964,7 @@ class BrowserManager:
pages = context.pages
page = next((p for p in pages if p.url == crawlerRunConfig.url), None)
if not page:
page = await context.new_page()
page = context.pages[0] # await context.new_page()
else:
# Otherwise, check if we have an existing context for this config
config_signature = self._make_config_signature(crawlerRunConfig)

View File

@@ -15,12 +15,12 @@ import shutil
import json
import subprocess
import time
from typing import List, Dict, Optional, Any, Tuple
from colorama import Fore, Style, init
from typing import List, Dict, Optional, Any
from rich.console import Console
from .async_configs import BrowserConfig
from .browser_manager import ManagedBrowser
from .async_logger import AsyncLogger, AsyncLoggerBase
from .async_logger import AsyncLogger, AsyncLoggerBase, LogColor
from .utils import get_home_folder
@@ -45,8 +45,8 @@ class BrowserProfiler:
logger (AsyncLoggerBase, optional): Logger for outputting messages.
If None, a default AsyncLogger will be created.
"""
# Initialize colorama for colorful terminal output
init()
# Initialize rich console for colorful input prompts
self.console = Console()
# Create a logger if not provided
if logger is None:
@@ -127,26 +127,30 @@ class BrowserProfiler:
profile_path = os.path.join(self.profiles_dir, profile_name)
os.makedirs(profile_path, exist_ok=True)
# Print instructions for the user with colorama formatting
border = f"{Fore.CYAN}{'='*80}{Style.RESET_ALL}"
self.logger.info(f"\n{border}", tag="PROFILE")
self.logger.info(f"Creating browser profile: {Fore.GREEN}{profile_name}{Style.RESET_ALL}", tag="PROFILE")
self.logger.info(f"Profile directory: {Fore.YELLOW}{profile_path}{Style.RESET_ALL}", tag="PROFILE")
# Print instructions for the user with rich formatting
border = "{'='*80}"
self.logger.info("{border}", tag="PROFILE", params={"border": f"\n{border}"}, colors={"border": LogColor.CYAN})
self.logger.info("Creating browser profile: {profile_name}", tag="PROFILE", params={"profile_name": profile_name}, colors={"profile_name": LogColor.GREEN})
self.logger.info("Profile directory: {profile_path}", tag="PROFILE", params={"profile_path": profile_path}, colors={"profile_path": LogColor.YELLOW})
self.logger.info("\nInstructions:", tag="PROFILE")
self.logger.info("1. A browser window will open for you to set up your profile.", tag="PROFILE")
self.logger.info(f"2. {Fore.CYAN}Log in to websites{Style.RESET_ALL}, configure settings, etc. as needed.", tag="PROFILE")
self.logger.info(f"3. When you're done, {Fore.YELLOW}press 'q' in this terminal{Style.RESET_ALL} to close the browser.", tag="PROFILE")
self.logger.info("{segment}, configure settings, etc. as needed.", tag="PROFILE", params={"segment": "2. Log in to websites"}, colors={"segment": LogColor.CYAN})
self.logger.info("3. When you're done, {segment} to close the browser.", tag="PROFILE", params={"segment": "press 'q' in this terminal"}, colors={"segment": LogColor.YELLOW})
self.logger.info("4. The profile will be saved and ready to use with Crawl4AI.", tag="PROFILE")
self.logger.info(f"{border}\n", tag="PROFILE")
self.logger.info("{border}", tag="PROFILE", params={"border": f"{border}\n"}, colors={"border": LogColor.CYAN})
browser_config.headless = False
browser_config.user_data_dir = profile_path
# Create managed browser instance
managed_browser = ManagedBrowser(
browser_type=browser_config.browser_type,
user_data_dir=profile_path,
headless=False, # Must be visible
browser_config=browser_config,
# user_data_dir=profile_path,
# headless=False, # Must be visible
logger=self.logger,
debugging_port=browser_config.debugging_port
# debugging_port=browser_config.debugging_port
)
# Set up signal handlers to ensure cleanup on interrupt
@@ -181,7 +185,7 @@ class BrowserProfiler:
import select
# First output the prompt
self.logger.info(f"{Fore.CYAN}Press '{Fore.WHITE}q{Fore.CYAN}' when you've finished using the browser...{Style.RESET_ALL}", tag="PROFILE")
self.logger.info("Press 'q' when you've finished using the browser...", tag="PROFILE")
# Save original terminal settings
fd = sys.stdin.fileno()
@@ -197,7 +201,7 @@ class BrowserProfiler:
if readable:
key = sys.stdin.read(1)
if key.lower() == 'q':
self.logger.info(f"{Fore.GREEN}Closing browser and saving profile...{Style.RESET_ALL}", tag="PROFILE")
self.logger.info("Closing browser and saving profile...", tag="PROFILE", base_color=LogColor.GREEN)
user_done_event.set()
return
@@ -223,7 +227,7 @@ class BrowserProfiler:
self.logger.error("Failed to start browser process.", tag="PROFILE")
return None
self.logger.info(f"Browser launched. {Fore.CYAN}Waiting for you to finish...{Style.RESET_ALL}", tag="PROFILE")
self.logger.info("Browser launched. Waiting for you to finish...", tag="PROFILE")
# Start listening for keyboard input
listener_task = asyncio.create_task(listen_for_quit_command())
@@ -245,10 +249,10 @@ class BrowserProfiler:
self.logger.info("Terminating browser process...", tag="PROFILE")
await managed_browser.cleanup()
self.logger.success(f"Browser closed. Profile saved at: {Fore.GREEN}{profile_path}{Style.RESET_ALL}", tag="PROFILE")
self.logger.success(f"Browser closed. Profile saved at: {profile_path}", tag="PROFILE")
except Exception as e:
self.logger.error(f"Error creating profile: {str(e)}", tag="PROFILE")
self.logger.error(f"Error creating profile: {e!s}", tag="PROFILE")
await managed_browser.cleanup()
return None
finally:
@@ -440,25 +444,27 @@ class BrowserProfiler:
```
"""
while True:
self.logger.info(f"\n{Fore.CYAN}Profile Management Options:{Style.RESET_ALL}", tag="MENU")
self.logger.info(f"1. {Fore.GREEN}Create a new profile{Style.RESET_ALL}", tag="MENU")
self.logger.info(f"2. {Fore.YELLOW}List available profiles{Style.RESET_ALL}", tag="MENU")
self.logger.info(f"3. {Fore.RED}Delete a profile{Style.RESET_ALL}", tag="MENU")
self.logger.info("\nProfile Management Options:", tag="MENU")
self.logger.info("1. Create a new profile", tag="MENU", base_color=LogColor.GREEN)
self.logger.info("2. List available profiles", tag="MENU", base_color=LogColor.YELLOW)
self.logger.info("3. Delete a profile", tag="MENU", base_color=LogColor.RED)
# Only show crawl option if callback provided
if crawl_callback:
self.logger.info(f"4. {Fore.CYAN}Use a profile to crawl a website{Style.RESET_ALL}", tag="MENU")
self.logger.info(f"5. {Fore.MAGENTA}Exit{Style.RESET_ALL}", tag="MENU")
self.logger.info("4. Use a profile to crawl a website", tag="MENU", base_color=LogColor.CYAN)
self.logger.info("5. Exit", tag="MENU", base_color=LogColor.MAGENTA)
exit_option = "5"
else:
self.logger.info(f"4. {Fore.MAGENTA}Exit{Style.RESET_ALL}", tag="MENU")
self.logger.info("4. Exit", tag="MENU", base_color=LogColor.MAGENTA)
exit_option = "4"
choice = input(f"\n{Fore.CYAN}Enter your choice (1-{exit_option}): {Style.RESET_ALL}")
self.logger.print(f"\n[cyan]Enter your choice (1-{exit_option}): [/cyan]", end="")
choice = input()
if choice == "1":
# Create new profile
name = input(f"{Fore.GREEN}Enter a name for the new profile (or press Enter for auto-generated name): {Style.RESET_ALL}")
self.console.print("[green]Enter a name for the new profile (or press Enter for auto-generated name): [/green]", end="")
name = input()
await self.create_profile(name or None)
elif choice == "2":
@@ -472,8 +478,8 @@ class BrowserProfiler:
# Print profile information with colorama formatting
self.logger.info("\nAvailable profiles:", tag="PROFILES")
for i, profile in enumerate(profiles):
self.logger.info(f"[{i+1}] {Fore.CYAN}{profile['name']}{Style.RESET_ALL}", tag="PROFILES")
self.logger.info(f" Path: {Fore.YELLOW}{profile['path']}{Style.RESET_ALL}", tag="PROFILES")
self.logger.info(f"[{i+1}] {profile['name']}", tag="PROFILES")
self.logger.info(f" Path: {profile['path']}", tag="PROFILES", base_color=LogColor.YELLOW)
self.logger.info(f" Created: {profile['created'].strftime('%Y-%m-%d %H:%M:%S')}", tag="PROFILES")
self.logger.info(f" Browser type: {profile['type']}", tag="PROFILES")
self.logger.info("", tag="PROFILES") # Empty line for spacing
@@ -486,12 +492,13 @@ class BrowserProfiler:
continue
# Display numbered list
self.logger.info(f"\n{Fore.YELLOW}Available profiles:{Style.RESET_ALL}", tag="PROFILES")
self.logger.info("\nAvailable profiles:", tag="PROFILES", base_color=LogColor.YELLOW)
for i, profile in enumerate(profiles):
self.logger.info(f"[{i+1}] {profile['name']}", tag="PROFILES")
# Get profile to delete
profile_idx = input(f"{Fore.RED}Enter the number of the profile to delete (or 'c' to cancel): {Style.RESET_ALL}")
self.console.print("[red]Enter the number of the profile to delete (or 'c' to cancel): [/red]", end="")
profile_idx = input()
if profile_idx.lower() == 'c':
continue
@@ -499,17 +506,18 @@ class BrowserProfiler:
idx = int(profile_idx) - 1
if 0 <= idx < len(profiles):
profile_name = profiles[idx]["name"]
self.logger.info(f"Deleting profile: {Fore.YELLOW}{profile_name}{Style.RESET_ALL}", tag="PROFILES")
self.logger.info(f"Deleting profile: [yellow]{profile_name}[/yellow]", tag="PROFILES")
# Confirm deletion
confirm = input(f"{Fore.RED}Are you sure you want to delete this profile? (y/n): {Style.RESET_ALL}")
self.console.print("[red]Are you sure you want to delete this profile? (y/n): [/red]", end="")
confirm = input()
if confirm.lower() == 'y':
success = self.delete_profile(profiles[idx]["path"])
if success:
self.logger.success(f"Profile {Fore.GREEN}{profile_name}{Style.RESET_ALL} deleted successfully", tag="PROFILES")
self.logger.success(f"Profile {profile_name} deleted successfully", tag="PROFILES")
else:
self.logger.error(f"Failed to delete profile {Fore.RED}{profile_name}{Style.RESET_ALL}", tag="PROFILES")
self.logger.error(f"Failed to delete profile {profile_name}", tag="PROFILES")
else:
self.logger.error("Invalid profile number", tag="PROFILES")
except ValueError:
@@ -523,12 +531,13 @@ class BrowserProfiler:
continue
# Display numbered list
self.logger.info(f"\n{Fore.YELLOW}Available profiles:{Style.RESET_ALL}", tag="PROFILES")
self.logger.info("\nAvailable profiles:", tag="PROFILES", base_color=LogColor.YELLOW)
for i, profile in enumerate(profiles):
self.logger.info(f"[{i+1}] {profile['name']}", tag="PROFILES")
# Get profile to use
profile_idx = input(f"{Fore.CYAN}Enter the number of the profile to use (or 'c' to cancel): {Style.RESET_ALL}")
self.console.print("[cyan]Enter the number of the profile to use (or 'c' to cancel): [/cyan]", end="")
profile_idx = input()
if profile_idx.lower() == 'c':
continue
@@ -536,7 +545,8 @@ class BrowserProfiler:
idx = int(profile_idx) - 1
if 0 <= idx < len(profiles):
profile_path = profiles[idx]["path"]
url = input(f"{Fore.CYAN}Enter the URL to crawl: {Style.RESET_ALL}")
self.console.print("[cyan]Enter the URL to crawl: [/cyan]", end="")
url = input()
if url:
# Call the provided crawl callback
await crawl_callback(profile_path, url)
@@ -599,11 +609,11 @@ class BrowserProfiler:
# Print initial information
border = f"{Fore.CYAN}{'='*80}{Style.RESET_ALL}"
self.logger.info(f"\n{border}", tag="CDP")
self.logger.info(f"Launching standalone browser with CDP debugging", tag="CDP")
self.logger.info(f"Browser type: {Fore.GREEN}{browser_type}{Style.RESET_ALL}", tag="CDP")
self.logger.info(f"Profile path: {Fore.YELLOW}{profile_path}{Style.RESET_ALL}", tag="CDP")
self.logger.info(f"Debugging port: {Fore.CYAN}{debugging_port}{Style.RESET_ALL}", tag="CDP")
self.logger.info(f"Headless mode: {Fore.CYAN}{headless}{Style.RESET_ALL}", tag="CDP")
self.logger.info("Launching standalone browser with CDP debugging", tag="CDP")
self.logger.info("Browser type: {browser_type}", tag="CDP", params={"browser_type": browser_type}, colors={"browser_type": LogColor.CYAN})
self.logger.info("Profile path: {profile_path}", tag="CDP", params={"profile_path": profile_path}, colors={"profile_path": LogColor.YELLOW})
self.logger.info(f"Debugging port: {debugging_port}", tag="CDP")
self.logger.info(f"Headless mode: {headless}", tag="CDP")
# Create managed browser instance
managed_browser = ManagedBrowser(
@@ -646,7 +656,7 @@ class BrowserProfiler:
import select
# First output the prompt
self.logger.info(f"{Fore.CYAN}Press '{Fore.WHITE}q{Fore.CYAN}' to stop the browser and exit...{Style.RESET_ALL}", tag="CDP")
self.logger.info("Press 'q' to stop the browser and exit...", tag="CDP")
# Save original terminal settings
fd = sys.stdin.fileno()
@@ -662,7 +672,7 @@ class BrowserProfiler:
if readable:
key = sys.stdin.read(1)
if key.lower() == 'q':
self.logger.info(f"{Fore.GREEN}Closing browser...{Style.RESET_ALL}", tag="CDP")
self.logger.info("Closing browser...", tag="CDP")
user_done_event.set()
return
@@ -716,20 +726,20 @@ class BrowserProfiler:
self.logger.error("Failed to start browser process.", tag="CDP")
return None
self.logger.info(f"Browser launched successfully. Retrieving CDP information...", tag="CDP")
self.logger.info("Browser launched successfully. Retrieving CDP information...", tag="CDP")
# Get CDP URL and JSON config
cdp_url, config_json = await get_cdp_json(debugging_port)
if cdp_url:
self.logger.success(f"CDP URL: {Fore.GREEN}{cdp_url}{Style.RESET_ALL}", tag="CDP")
self.logger.success(f"CDP URL: {cdp_url}", tag="CDP")
if config_json:
# Display relevant CDP information
self.logger.info(f"Browser: {Fore.CYAN}{config_json.get('Browser', 'Unknown')}{Style.RESET_ALL}", tag="CDP")
self.logger.info(f"Protocol Version: {config_json.get('Protocol-Version', 'Unknown')}", tag="CDP")
self.logger.info(f"Browser: {config_json.get('Browser', 'Unknown')}", tag="CDP", colors={"Browser": LogColor.CYAN})
self.logger.info(f"Protocol Version: {config_json.get('Protocol-Version', 'Unknown')}", tag="CDP", colors={"Protocol-Version": LogColor.CYAN})
if 'webSocketDebuggerUrl' in config_json:
self.logger.info(f"WebSocket URL: {Fore.GREEN}{config_json['webSocketDebuggerUrl']}{Style.RESET_ALL}", tag="CDP")
self.logger.info("WebSocket URL: {webSocketDebuggerUrl}", tag="CDP", params={"webSocketDebuggerUrl": config_json['webSocketDebuggerUrl']}, colors={"webSocketDebuggerUrl": LogColor.GREEN})
else:
self.logger.warning("Could not retrieve CDP configuration JSON", tag="CDP")
else:
@@ -757,7 +767,7 @@ class BrowserProfiler:
self.logger.info("Terminating browser process...", tag="CDP")
await managed_browser.cleanup()
self.logger.success(f"Browser closed.", tag="CDP")
self.logger.success("Browser closed.", tag="CDP")
except Exception as e:
self.logger.error(f"Error launching standalone browser: {str(e)}", tag="CDP")
@@ -972,3 +982,30 @@ class BrowserProfiler:
'info': browser_info
}
if __name__ == "__main__":
# Example usage
profiler = BrowserProfiler()
# Create a new profile
import os
from pathlib import Path
home_dir = Path.home()
profile_path = asyncio.run(profiler.create_profile( str(home_dir / ".crawl4ai/profiles/test-profile")))
# Launch a standalone browser
asyncio.run(profiler.launch_standalone_browser())
# List profiles
profiles = profiler.list_profiles()
for profile in profiles:
print(f"Profile: {profile['name']}, Path: {profile['path']}")
# Delete a profile
success = profiler.delete_profile("my-profile")
if success:
print("Profile deleted successfully")
else:
print("Failed to delete profile")

View File

@@ -27,8 +27,7 @@ import json
import hashlib
from pathlib import Path
from concurrent.futures import ThreadPoolExecutor
from .async_logger import AsyncLogger, LogLevel
from colorama import Fore, Style
from .async_logger import AsyncLogger, LogLevel, LogColor
class RelevantContentFilter(ABC):
@@ -846,8 +845,7 @@ class LLMContentFilter(RelevantContentFilter):
},
colors={
**AsyncLogger.DEFAULT_COLORS,
LogLevel.INFO: Fore.MAGENTA
+ Style.DIM, # Dimmed purple for LLM ops
LogLevel.INFO: LogColor.DIM_MAGENTA # Dimmed purple for LLM ops
},
)
else:
@@ -892,7 +890,7 @@ class LLMContentFilter(RelevantContentFilter):
"Starting LLM markdown content filtering process",
tag="LLM",
params={"provider": self.llm_config.provider},
colors={"provider": Fore.CYAN},
colors={"provider": LogColor.CYAN},
)
# Cache handling
@@ -929,7 +927,7 @@ class LLMContentFilter(RelevantContentFilter):
"LLM markdown: Split content into {chunk_count} chunks",
tag="CHUNK",
params={"chunk_count": len(html_chunks)},
colors={"chunk_count": Fore.YELLOW},
colors={"chunk_count": LogColor.YELLOW},
)
start_time = time.time()
@@ -1038,7 +1036,7 @@ class LLMContentFilter(RelevantContentFilter):
"LLM markdown: Completed processing in {time:.2f}s",
tag="LLM",
params={"time": end_time - start_time},
colors={"time": Fore.YELLOW},
colors={"time": LogColor.YELLOW},
)
result = ordered_results if ordered_results else []

View File

@@ -1,4 +1,4 @@
from pydantic import BaseModel, HttpUrl, PrivateAttr
from pydantic import BaseModel, HttpUrl, PrivateAttr, Field
from typing import List, Dict, Optional, Callable, Awaitable, Union, Any
from typing import AsyncGenerator
from typing import Generic, TypeVar
@@ -150,6 +150,7 @@ class CrawlResult(BaseModel):
redirected_url: Optional[str] = None
network_requests: Optional[List[Dict[str, Any]]] = None
console_messages: Optional[List[Dict[str, Any]]] = None
tables: List[Dict] = Field(default_factory=list) # NEW [{headers,rows,caption,summary}]
class Config:
arbitrary_types_allowed = True

View File

@@ -20,7 +20,6 @@ from urllib.parse import urljoin
import requests
from requests.exceptions import InvalidSchema
import xxhash
from colorama import Fore, Style, init
import textwrap
import cProfile
import pstats
@@ -441,14 +440,13 @@ def create_box_message(
str: A formatted string containing the styled message box.
"""
init()
# Define border and text colors for different types
styles = {
"warning": (Fore.YELLOW, Fore.LIGHTYELLOW_EX, ""),
"info": (Fore.BLUE, Fore.LIGHTBLUE_EX, ""),
"success": (Fore.GREEN, Fore.LIGHTGREEN_EX, ""),
"error": (Fore.RED, Fore.LIGHTRED_EX, "×"),
"warning": ("yellow", "bright_yellow", ""),
"info": ("blue", "bright_blue", ""),
"debug": ("lightblack", "bright_black", ""),
"success": ("green", "bright_green", ""),
"error": ("red", "bright_red", "×"),
}
border_color, text_color, prefix = styles.get(type.lower(), styles["info"])
@@ -480,12 +478,12 @@ def create_box_message(
# Create the box with colored borders and lighter text
horizontal_line = h_line * (width - 1)
box = [
f"{border_color}{tl}{horizontal_line}{tr}",
f"[{border_color}]{tl}{horizontal_line}{tr}[/{border_color}]",
*[
f"{border_color}{v_line}{text_color} {line:<{width-2}}{border_color}{v_line}"
f"[{border_color}]{v_line}[{text_color}] {line:<{width-2}}[/{text_color}][{border_color}]{v_line}[/{border_color}]"
for line in formatted_lines
],
f"{border_color}{bl}{horizontal_line}{br}{Style.RESET_ALL}",
f"[{border_color}]{bl}{horizontal_line}{br}[/{border_color}]",
]
result = "\n".join(box)
@@ -2778,4 +2776,3 @@ def preprocess_html_for_schema(html_content, text_threshold=100, attr_value_thre
# Fallback for parsing errors
return html_content[:max_size] if len(html_content) > max_size else html_content

View File

@@ -58,7 +58,7 @@ Pull and run images directly from Docker Hub without building locally.
#### 1. Pull the Image
Our latest release candidate is `0.6.0rc1-r1`. Images are built with multi-arch manifests, so Docker automatically pulls the correct version for your system.
Our latest release candidate is `0.6.0-r1`. Images are built with multi-arch manifests, so Docker automatically pulls the correct version for your system.
```bash
# Pull the release candidate (recommended for latest features)
@@ -124,9 +124,9 @@ docker stop crawl4ai && docker rm crawl4ai
#### Docker Hub Versioning Explained
* **Image Name:** `unclecode/crawl4ai`
* **Tag Format:** `LIBRARY_VERSION[-SUFFIX]` (e.g., `0.6.0rc1-r1`)
* **Tag Format:** `LIBRARY_VERSION[-SUFFIX]` (e.g., `0.6.0-r1`)
* `LIBRARY_VERSION`: The semantic version of the core `crawl4ai` Python library
* `SUFFIX`: Optional tag for release candidates (`rc1`) and revisions (`r1`)
* `SUFFIX`: Optional tag for release candidates (``) and revisions (`r1`)
* **`latest` Tag:** Points to the most recent stable version
* **Multi-Architecture Support:** All images support both `linux/amd64` and `linux/arm64` architectures through a single tag

View File

@@ -193,7 +193,48 @@
<textarea id="urls" class="w-full bg-dark border border-border rounded p-2 h-32 text-sm mb-4"
spellcheck="false">https://example.com</textarea>
<details class="mb-4">
<!-- Specific options for /md endpoint -->
<details id="md-options" class="mb-4 hidden">
<summary class="text-sm text-secondary cursor-pointer">/md Options</summary>
<div class="mt-2 space-y-3 p-2 border border-border rounded">
<div>
<label for="md-filter" class="block text-xs text-secondary mb-1">Filter Type</label>
<select id="md-filter" class="bg-dark border border-border rounded px-2 py-1 text-sm w-full">
<option value="fit">fit - Adaptive content filtering</option>
<option value="raw">raw - No filtering</option>
<option value="bm25">bm25 - BM25 keyword relevance</option>
<option value="llm">llm - LLM-based filtering</option>
</select>
</div>
<div>
<label for="md-query" class="block text-xs text-secondary mb-1">Query (for BM25/LLM filters)</label>
<input id="md-query" type="text" placeholder="Enter search terms or instructions"
class="bg-dark border border-border rounded px-2 py-1 text-sm w-full">
</div>
<div>
<label for="md-cache" class="block text-xs text-secondary mb-1">Cache Mode</label>
<select id="md-cache" class="bg-dark border border-border rounded px-2 py-1 text-sm w-full">
<option value="0">Write-Only (0)</option>
<option value="1">Enabled (1)</option>
</select>
</div>
</div>
</details>
<!-- Specific options for /llm endpoint -->
<details id="llm-options" class="mb-4 hidden">
<summary class="text-sm text-secondary cursor-pointer">/llm Options</summary>
<div class="mt-2 space-y-3 p-2 border border-border rounded">
<div>
<label for="llm-question" class="block text-xs text-secondary mb-1">Question</label>
<input id="llm-question" type="text" value="What is this page about?"
class="bg-dark border border-border rounded px-2 py-1 text-sm w-full">
</div>
</div>
</details>
<!-- Advanced config for /crawl endpoints -->
<details id="adv-config" class="mb-4">
<summary class="text-sm text-secondary cursor-pointer">Advanced Config <span
class="text-xs text-primary">(Python → autoJSON)</span></summary>
@@ -437,6 +478,33 @@
cm.setValue(TEMPLATES[e.target.value]);
document.getElementById('cfg-status').textContent = '';
});
// Handle endpoint selection change to show appropriate options
document.getElementById('endpoint').addEventListener('change', function(e) {
const endpoint = e.target.value;
const mdOptions = document.getElementById('md-options');
const llmOptions = document.getElementById('llm-options');
const advConfig = document.getElementById('adv-config');
// Hide all option sections first
mdOptions.classList.add('hidden');
llmOptions.classList.add('hidden');
advConfig.classList.add('hidden');
// Show the appropriate section based on endpoint
if (endpoint === 'md') {
mdOptions.classList.remove('hidden');
// Auto-open the /md options
mdOptions.setAttribute('open', '');
} else if (endpoint === 'llm') {
llmOptions.classList.remove('hidden');
// Auto-open the /llm options
llmOptions.setAttribute('open', '');
} else {
// For /crawl endpoints, show the advanced config
advConfig.classList.remove('hidden');
}
});
async function pyConfigToJson() {
const code = cm.getValue().trim();
@@ -494,10 +562,18 @@
}
// Generate code snippets
function generateSnippets(api, payload) {
function generateSnippets(api, payload, method = 'POST') {
// Python snippet
const pyCodeEl = document.querySelector('#python-content code');
const pySnippet = `import httpx\n\nasync def crawl():\n async with httpx.AsyncClient() as client:\n response = await client.post(\n "${window.location.origin}${api}",\n json=${JSON.stringify(payload, null, 4).replace(/\n/g, '\n ')}\n )\n return response.json()`;
let pySnippet;
if (method === 'GET') {
// GET request (for /llm endpoint)
pySnippet = `import httpx\n\nasync def crawl():\n async with httpx.AsyncClient() as client:\n response = await client.get(\n "${window.location.origin}${api}"\n )\n return response.json()`;
} else {
// POST request (for /crawl and /md endpoints)
pySnippet = `import httpx\n\nasync def crawl():\n async with httpx.AsyncClient() as client:\n response = await client.post(\n "${window.location.origin}${api}",\n json=${JSON.stringify(payload, null, 4).replace(/\n/g, '\n ')}\n )\n return response.json()`;
}
pyCodeEl.textContent = pySnippet;
pyCodeEl.className = 'python hljs'; // Reset classes
@@ -505,7 +581,15 @@
// cURL snippet
const curlCodeEl = document.querySelector('#curl-content code');
const curlSnippet = `curl -X POST ${window.location.origin}${api} \\\n -H "Content-Type: application/json" \\\n -d '${JSON.stringify(payload)}'`;
let curlSnippet;
if (method === 'GET') {
// GET request (for /llm endpoint)
curlSnippet = `curl -X GET "${window.location.origin}${api}"`;
} else {
// POST request (for /crawl and /md endpoints)
curlSnippet = `curl -X POST ${window.location.origin}${api} \\\n -H "Content-Type: application/json" \\\n -d '${JSON.stringify(payload)}'`;
}
curlCodeEl.textContent = curlSnippet;
curlCodeEl.className = 'bash hljs'; // Reset classes
@@ -536,20 +620,39 @@
const endpointMap = {
crawl: '/crawl',
};
/*const endpointMap = {
crawl: '/crawl',
crawl_stream: '/crawl/stream',
// crawl_stream: '/crawl/stream',
md: '/md',
llm: '/llm'
};*/
};
const api = endpointMap[endpoint];
const payload = {
urls,
...advConfig
};
let payload;
// Create appropriate payload based on endpoint type
if (endpoint === 'md') {
// Get values from the /md specific inputs
const filterType = document.getElementById('md-filter').value;
const query = document.getElementById('md-query').value.trim();
const cache = document.getElementById('md-cache').value;
// MD endpoint expects: { url, f, q, c }
payload = {
url: urls[0], // Take first URL
f: filterType, // Lowercase filter type as required by server
q: query || null, // Use the query if provided, otherwise null
c: cache
};
} else if (endpoint === 'llm') {
// LLM endpoint has a different URL pattern and uses query params
// This will be handled directly in the fetch below
payload = null;
} else {
// Default payload for /crawl and /crawl/stream
payload = {
urls,
...advConfig
};
}
updateStatus('processing');
@@ -557,7 +660,18 @@
const startTime = performance.now();
let response, responseData;
if (endpoint === 'crawl_stream') {
if (endpoint === 'llm') {
// Special handling for LLM endpoint which uses URL pattern: /llm/{encoded_url}?q={query}
const url = urls[0];
const encodedUrl = encodeURIComponent(url);
// Get the question from the LLM-specific input
const question = document.getElementById('llm-question').value.trim() || "What is this page about?";
response = await fetch(`${api}/${encodedUrl}?q=${encodeURIComponent(question)}`, {
method: 'GET',
headers: { 'Accept': 'application/json' }
});
} else if (endpoint === 'crawl_stream') {
// Stream processing
response = await fetch(api, {
method: 'POST',
@@ -597,7 +711,7 @@
document.querySelector('#response-content code').className = 'json hljs'; // Reset classes
forceHighlightElement(document.querySelector('#response-content code'));
} else {
// Regular request
// Regular request (handles /crawl and /md)
response = await fetch(api, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
@@ -625,7 +739,16 @@
}
forceHighlightElement(document.querySelector('#response-content code'));
generateSnippets(api, payload);
// For generateSnippets, handle the LLM case specially
if (endpoint === 'llm') {
const url = urls[0];
const encodedUrl = encodeURIComponent(url);
const question = document.getElementById('llm-question').value.trim() || "What is this page about?";
generateSnippets(`${api}/${encodedUrl}?q=${encodeURIComponent(question)}`, null, 'GET');
} else {
generateSnippets(api, payload);
}
} catch (error) {
console.error('Error:', error);
updateStatus('error');
@@ -807,9 +930,24 @@
});
});
}
// Function to initialize UI based on selected endpoint
function initUI() {
// Trigger the endpoint change handler to set initial UI state
const endpointSelect = document.getElementById('endpoint');
const event = new Event('change');
endpointSelect.dispatchEvent(event);
// Initialize copy buttons
initCopyButtons();
}
// Call this in your DOMContentLoaded or initialization
initCopyButtons();
// Initialize on page load
document.addEventListener('DOMContentLoaded', initUI);
// Also call it immediately in case the script runs after DOM is already loaded
if (document.readyState !== 'loading') {
initUI();
}
</script>
</body>

126
docs/apps/linkdin/README.md Normal file
View File

@@ -0,0 +1,126 @@
# Crawl4AIProspectWizard stepbystep guide
A threestage demo that goes from **LinkedIn scraping****LLM reasoning****graph visualisation**.
```
prospectwizard/
├─ c4ai_discover.py # Stage 1 scrape companies + people
├─ c4ai_insights.py # Stage 2 embeddings, orgcharts, scores
├─ graph_view_template.html # Stage 3 graph viewer (static HTML)
└─ data/ # output lands here (*.jsonl / *.json)
```
---
## 1  Install & boot a LinkedIn profile (onetime)
### 1.1  Install dependencies
```bash
pip install crawl4ai openai sentence-transformers networkx pandas vis-network rich
```
### 1.2  Create / warm a LinkedIn browser profile
```bash
crwl profiler
```
1. The interactive shell shows **New profile** hit **enter**.
2. Choose a name, e.g. `profile_linkedin_uc`.
3. A Chromium window opens log in to LinkedIn, solve whatever CAPTCHA, then close.
> Remember the **profile name**. All future runs take `--profile-name <your_name>`.
---
## 2  Discovery scrape companies & people
```bash
python c4ai_discover.py full \
--query "health insurance management" \
--geo 102713980 \ # Malaysia geoUrn
--title_filters "" \ # or "Product,Engineering"
--max_companies 10 \ # default set small for workshops
--max_people 20 \ # \^ same
--profile-name profile_linkedin_uc \
--outdir ./data \
--concurrency 2 \
--log_level debug
```
**Outputs** in `./data/`:
* `companies.jsonl` one JSON per company
* `people.jsonl` one JSON per employee
🛠️ **Dryrun:** `C4AI_DEMO_DEBUG=1 python c4ai_discover.py full --query coffee` uses bundled HTML snippets, no network.
### Handy geoUrn cheatsheet
| Location | geoUrn |
|----------|--------|
| Singapore | **103644278** |
| Malaysia | **102713980** |
| UnitedStates | **103644922** |
| UnitedKingdom | **102221843** |
| Australia | **101452733** |
_See more: <https://www.linkedin.com/search/results/companies/?geoUrn=XXX> the number after `geoUrn=` is what you need._
---
## 3  Insights embeddings, orgcharts, decision makers
```bash
python c4ai_insights.py \
--in ./data \
--out ./data \
--embed_model all-MiniLM-L6-v2 \
--top_k 10 \
--openai_model gpt-4.1 \
--max_llm_tokens 8024 \
--llm_temperature 1.0 \
--workers 4
```
Emits next to the Stage1 files:
* `company_graph.json` intercompany similarity graph
* `org_chart_<handle>.json` one per company
* `decision_makers.csv` handpicked who to pitch list
Flags reference (straight from `build_arg_parser()`):
| Flag | Default | Purpose |
|------|---------|---------|
| `--in` | `.` | Stage1 output dir |
| `--out` | `.` | Destination dir |
| `--embed_model` | `all-MiniLM-L6-v2` | SentenceTransformer model |
| `--top_k` | `10` | Neighbours per company in graph |
| `--openai_model` | `gpt-4.1` | LLM for scoring decision makers |
| `--max_llm_tokens` | `8024` | Token budget per LLM call |
| `--llm_temperature` | `1.0` | Creativity knob |
| `--stub` | off | Skip OpenAI and fabricate tiny charts |
| `--workers` | `4` | Parallel LLM workers |
---
## 4  Visualise interactive graph
After Stage 2 completes, simply open the HTML viewer from the project root:
```bash
open graph_view_template.html # or Live Server / Python -http
```
The page fetches `data/company_graph.json` and the `org_chart_*.json` files automatically; keep the `data/` folder beside the HTML file.
* Left pane → list of companies (clans).
* Click a node to load its orgchart on the right.
* Chat drawer lets you ask followup questions; context is pulled from `people.jsonl`.
---
## 5  Common snags
| Symptom | Fix |
|---------|-----|
| Infinite CAPTCHA | Use a residential proxy: `--proxy http://user:pass@ip:port` |
| 429 Too Many Requests | Lower `--concurrency`, rotate profile, add delay |
| Blank graph | Check JSON paths, clear `localStorage` in browser |
---
### TL;DR
`crwl profiler``c4ai_discover.py``c4ai_insights.py` → open `graph_view_template.html`.
Live long and `import crawl4ai`.

View File

@@ -0,0 +1,440 @@
#!/usr/bin/env python3
"""
c4ai-discover — Stage1 Discovery CLI
Scrapes LinkedIn company search + their people pages and dumps two newlinedelimited
JSON files: companies.jsonl and people.jsonl.
Key design rules
----------------
* No BeautifulSoup — Crawl4AI only for network + HTML fetch.
* JsonCssExtractionStrategy for structured scraping; schema autogenerated once
from sample HTML provided by user and then cached under ./schemas/.
* Defaults are embedded so the file runs inside VS Code debugger without CLI args.
* If executed as a console script (argv > 1), CLI flags win.
* Lightweight deps: argparse + Crawl4AI stack.
Author: Tom @ Kidocode 20250426
"""
from __future__ import annotations
import warnings, re
warnings.filterwarnings(
"ignore",
message=r"The pseudo class ':contains' is deprecated, ':-soup-contains' should be used.*",
category=FutureWarning,
module=r"soupsieve"
)
# ───────────────────────────────────────────────────────────────────────────────
# Imports
# ───────────────────────────────────────────────────────────────────────────────
import argparse
import random
import asyncio
import json
import logging
import os
import pathlib
import sys
# 3rd-party rich for pretty logging
from rich.console import Console
from rich.logging import RichHandler
from datetime import datetime, UTC
from itertools import cycle
from textwrap import dedent
from types import SimpleNamespace
from typing import Dict, List, Optional
from urllib.parse import quote
from pathlib import Path
from glob import glob
from crawl4ai import (
AsyncWebCrawler,
BrowserConfig,
CacheMode,
CrawlerRunConfig,
JsonCssExtractionStrategy,
BrowserProfiler,
LLMConfig,
)
# ───────────────────────────────────────────────────────────────────────────────
# Constants / paths
# ───────────────────────────────────────────────────────────────────────────────
BASE_DIR = pathlib.Path(__file__).resolve().parent
SCHEMA_DIR = BASE_DIR / "schemas"
SCHEMA_DIR.mkdir(parents=True, exist_ok=True)
COMPANY_SCHEMA_PATH = SCHEMA_DIR / "company_card.json"
PEOPLE_SCHEMA_PATH = SCHEMA_DIR / "people_card.json"
# ---------- deterministic target JSON examples ----------
_COMPANY_SCHEMA_EXAMPLE = {
"handle": "/company/posify/",
"profile_image": "https://media.licdn.com/dms/image/v2/.../logo.jpg",
"name": "Management Research Services, Inc. (MRS, Inc)",
"descriptor": "Insurance • Milwaukee, Wisconsin",
"about": "Insurance • Milwaukee, Wisconsin",
"followers": 1000
}
_PEOPLE_SCHEMA_EXAMPLE = {
"profile_url": "https://www.linkedin.com/in/lily-ng/",
"name": "Lily Ng",
"headline": "VP Product @ Posify",
"followers": 890,
"connection_degree": "2nd",
"avatar_url": "https://media.licdn.com/dms/image/v2/.../lily.jpg"
}
# Provided sample HTML snippets (trimmed) — used exactly once to coldgenerate schema.
_SAMPLE_COMPANY_HTML = (Path(__file__).resolve().parent / "snippets/company.html").read_text()
_SAMPLE_PEOPLE_HTML = (Path(__file__).resolve().parent / "snippets/people.html").read_text()
# --------- tighter schema prompts ----------
_COMPANY_SCHEMA_QUERY = dedent(
"""
Using the supplied <li> company-card HTML, build a JsonCssExtractionStrategy schema that,
for every card, outputs *exactly* the keys shown in the example JSON below.
JSON spec:
• handle href of the outermost <a> that wraps the logo/title, e.g. "/company/posify/"
• profile_image absolute URL of the <img> inside that link
• name text of the <a> inside the <span class*='t-16'>
• descriptor text line with industry • location
• about text of the <div class*='t-normal'> below the name (industry + geo)
• followers integer parsed from the <div> containing 'followers'
IMPORTANT: Do not use the base64 kind of classes to target element. It's not reliable.
The main div parent contains these li element is "div.search-results-container" you can use this.
The <ul> parent has "role" equal to "list". Using these two should be enough to target the <li> elements."
"""
)
_PEOPLE_SCHEMA_QUERY = dedent(
"""
Using the supplied <li> people-card HTML, build a JsonCssExtractionStrategy schema that
outputs exactly the keys in the example JSON below.
Fields:
• profile_url href of the outermost profile link
• name text inside artdeco-entity-lockup__title
• headline inner text of artdeco-entity-lockup__subtitle
• followers integer parsed from the span inside lt-line-clamp--multi-line
• connection_degree '1st', '2nd', etc. from artdeco-entity-lockup__badge
• avatar_url src of the <img> within artdeco-entity-lockup__image
IMPORTANT: Do not use the base64 kind of classes to target element. It's not reliable.
The main div parent contains these li element is a "div" has these classes "artdeco-card org-people-profile-card__card-spacing org-people__card-margin-bottom".
"""
)
# ---------------------------------------------------------------------------
# Utility helpers
# ---------------------------------------------------------------------------
def _load_or_build_schema(
path: pathlib.Path,
sample_html: str,
query: str,
example_json: Dict,
force = False
) -> Dict:
"""Load schema from path, else call generate_schema once and persist."""
if path.exists() and not force:
return json.loads(path.read_text())
logging.info("[SCHEMA] Generating schema %s", path.name)
schema = JsonCssExtractionStrategy.generate_schema(
html=sample_html,
llm_config=LLMConfig(
provider=os.getenv("C4AI_SCHEMA_PROVIDER", "openai/gpt-4o"),
api_token=os.getenv("OPENAI_API_KEY", "env:OPENAI_API_KEY"),
),
query=query,
target_json_example=json.dumps(example_json, indent=2),
)
path.write_text(json.dumps(schema, indent=2))
return schema
def _openai_friendly_number(text: str) -> Optional[int]:
"""Extract first int from text like '1K followers' (returns 1000)."""
import re
m = re.search(r"(\d[\d,]*)", text.replace(",", ""))
if not m:
return None
val = int(m.group(1))
if "k" in text.lower():
val *= 1000
if "m" in text.lower():
val *= 1_000_000
return val
# ---------------------------------------------------------------------------
# Core async workers
# ---------------------------------------------------------------------------
async def crawl_company_search(crawler: AsyncWebCrawler, url: str, schema: Dict, limit: int) -> List[Dict]:
"""Paginate 10-item company search pages until `limit` reached."""
extraction = JsonCssExtractionStrategy(schema)
cfg = CrawlerRunConfig(
extraction_strategy=extraction,
cache_mode=CacheMode.BYPASS,
wait_for = ".search-marvel-srp",
session_id="company_search",
delay_before_return_html=1,
magic = True,
verbose= False,
)
companies, page = [], 1
while len(companies) < max(limit, 10):
paged_url = f"{url}&page={page}"
res = await crawler.arun(paged_url, config=cfg)
batch = json.loads(res[0].extracted_content)
if not batch:
break
for item in batch:
name = item.get("name", "").strip()
handle = item.get("handle", "").strip()
if not handle or not name:
continue
descriptor = item.get("descriptor")
about = item.get("about")
followers = _openai_friendly_number(str(item.get("followers", "")))
companies.append(
{
"handle": handle,
"name": name,
"descriptor": descriptor,
"about": about,
"followers": followers,
"people_url": f"{handle}people/",
"captured_at": datetime.now(UTC).isoformat(timespec="seconds") + "Z",
}
)
page += 1
logging.info(
f"[dim]Page {page}[/] — running total: {len(companies)}/{limit} companies"
)
return companies[:max(limit, 10)]
async def crawl_people_page(
crawler: AsyncWebCrawler,
people_url: str,
schema: Dict,
limit: int,
title_kw: str,
) -> List[Dict]:
people_u = f"{people_url}?keywords={quote(title_kw)}"
extraction = JsonCssExtractionStrategy(schema)
cfg = CrawlerRunConfig(
extraction_strategy=extraction,
# scan_full_page=True,
cache_mode=CacheMode.BYPASS,
magic=True,
wait_for=".org-people-profile-card__card-spacing",
delay_before_return_html=1,
session_id="people_search",
)
res = await crawler.arun(people_u, config=cfg)
if not res[0].success:
return []
raw = json.loads(res[0].extracted_content)
people = []
for p in raw[:limit]:
followers = _openai_friendly_number(str(p.get("followers", "")))
people.append(
{
"profile_url": p.get("profile_url"),
"name": p.get("name"),
"headline": p.get("headline"),
"followers": followers,
"connection_degree": p.get("connection_degree"),
"avatar_url": p.get("avatar_url"),
}
)
return people
# ---------------------------------------------------------------------------
# CLI + main
# ---------------------------------------------------------------------------
def build_arg_parser() -> argparse.ArgumentParser:
ap = argparse.ArgumentParser("c4ai-discover — Crawl4AI LinkedIn discovery")
sub = ap.add_subparsers(dest="cmd", required=False, help="run scope")
def add_flags(parser: argparse.ArgumentParser):
parser.add_argument("--query", required=False, help="query keyword(s)")
parser.add_argument("--geo", required=False, type=int, help="LinkedIn geoUrn")
parser.add_argument("--title-filters", default="Product,Engineering", help="comma list of job keywords")
parser.add_argument("--max-companies", type=int, default=1000)
parser.add_argument("--max-people", type=int, default=500)
parser.add_argument("--profile-path", default=str(pathlib.Path.home() / ".crawl4ai/profiles/profile_linkedin_uc"))
parser.add_argument("--outdir", default="./output")
parser.add_argument("--concurrency", type=int, default=4)
parser.add_argument("--log-level", default="info", choices=["debug", "info", "warn", "error"])
add_flags(sub.add_parser("full"))
add_flags(sub.add_parser("companies"))
add_flags(sub.add_parser("people"))
# global flags
ap.add_argument(
"--debug",
action="store_true",
help="Use built-in demo defaults (same as C4AI_DEMO_DEBUG=1)",
)
return ap
def detect_debug_defaults(force = False) -> SimpleNamespace:
if not force and sys.gettrace() is None and not os.getenv("C4AI_DEMO_DEBUG"):
return SimpleNamespace()
# ----- debugfriendly defaults -----
return SimpleNamespace(
cmd="full",
query="health insurance management",
geo=102713980,
# title_filters="Product,Engineering",
title_filters="",
max_companies=10,
max_people=5,
profile_name="profile_linkedin_uc",
outdir="./debug_out",
concurrency=2,
log_level="debug",
)
async def async_main(opts):
# ─────────── logging setup ───────────
console = Console()
logging.basicConfig(
level=opts.log_level.upper(),
format="%(message)s",
handlers=[RichHandler(console=console, markup=True, rich_tracebacks=True)],
)
# -------------------------------------------------------------------
# Load or build schemas (onetime LLM call each)
# -------------------------------------------------------------------
company_schema = _load_or_build_schema(
COMPANY_SCHEMA_PATH,
_SAMPLE_COMPANY_HTML,
_COMPANY_SCHEMA_QUERY,
_COMPANY_SCHEMA_EXAMPLE,
# True
)
people_schema = _load_or_build_schema(
PEOPLE_SCHEMA_PATH,
_SAMPLE_PEOPLE_HTML,
_PEOPLE_SCHEMA_QUERY,
_PEOPLE_SCHEMA_EXAMPLE,
# True
)
outdir = BASE_DIR / pathlib.Path(opts.outdir)
outdir.mkdir(parents=True, exist_ok=True)
f_companies = (BASE_DIR / outdir / "companies.jsonl").open("a", encoding="utf-8")
f_people = (BASE_DIR / outdir / "people.jsonl").open("a", encoding="utf-8")
# -------------------------------------------------------------------
# Prepare crawler with cookie pool rotation
# -------------------------------------------------------------------
profiler = BrowserProfiler()
path = profiler.get_profile_path(opts.profile_name)
bc = BrowserConfig(
headless=False,
verbose=False,
user_data_dir=path,
use_managed_browser=True,
user_agent_mode = "random",
user_agent_generator_config= {
"platforms": "mobile",
"os": "Android"
},
verbose=False,
)
crawler = AsyncWebCrawler(config=bc)
await crawler.start()
# Single worker for simplicity; concurrency can be scaled by arun_many if needed.
# crawler = await next_crawler().start()
try:
# Build LinkedIn search URL
search_url = f"https://www.linkedin.com/search/results/companies/?keywords={quote(opts.query)}&geoUrn={opts.geo}"
logging.info("Seed URL => %s", search_url)
companies: List[Dict] = []
if opts.cmd in ("companies", "full"):
companies = await crawl_company_search(
crawler, search_url, company_schema, opts.max_companies
)
for c in companies:
f_companies.write(json.dumps(c, ensure_ascii=False) + "\n")
logging.info(f"[bold green]✓[/] Companies scraped so far: {len(companies)}")
if opts.cmd in ("people", "full"):
if not companies:
# load from previous run
src = outdir / "companies.jsonl"
if not src.exists():
logging.error("companies.jsonl missing — run companies/full first")
return 10
companies = [json.loads(l) for l in src.read_text().splitlines()]
total_people = 0
title_kw = " ".join([t.strip() for t in opts.title_filters.split(",") if t.strip()]) if opts.title_filters else ""
for comp in companies:
people = await crawl_people_page(
crawler,
comp["people_url"],
people_schema,
opts.max_people,
title_kw,
)
for p in people:
rec = p | {
"company_handle": comp["handle"],
# "captured_at": datetime.now(UTC).isoformat(timespec="seconds") + "Z",
"captured_at": datetime.now(UTC).isoformat(timespec="seconds") + "Z",
}
f_people.write(json.dumps(rec, ensure_ascii=False) + "\n")
total_people += len(people)
logging.info(
f"{comp['name']} — [cyan]{len(people)}[/] people extracted"
)
await asyncio.sleep(random.uniform(0.5, 1))
logging.info("Total people scraped: %d", total_people)
finally:
await crawler.close()
f_companies.close()
f_people.close()
return 0
def main():
parser = build_arg_parser()
cli_opts = parser.parse_args()
# decide on debug defaults
if cli_opts.debug:
opts = detect_debug_defaults(force=True)
else:
env_defaults = detect_debug_defaults()
env_defaults = detect_debug_defaults()
opts = env_defaults if env_defaults else cli_opts
if not getattr(opts, "cmd", None):
opts.cmd = "full"
exit_code = asyncio.run(async_main(opts))
sys.exit(exit_code)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,372 @@
#!/usr/bin/env python3
"""
Stage-2 Insights builder
------------------------
Reads companies.jsonl & people.jsonl (Stage-1 output) and produces:
• company_graph.json
• org_chart_<handle>.json (one per company)
• decision_makers.csv
• graph_view.html (interactive visualisation)
Run:
python c4ai_insights.py --in ./stage1_out --out ./stage2_out
Author : Tom @ Kidocode, 2025-04-28
"""
from __future__ import annotations
# ───────────────────────────────────────────────────────────────────────────────
# Imports & Third-party
# ───────────────────────────────────────────────────────────────────────────────
import argparse, asyncio, json, os, sys, pathlib, random, time, csv
from datetime import datetime, UTC
from types import SimpleNamespace
from pathlib import Path
from typing import List, Dict, Any
# Pretty CLI UX
from rich.console import Console
from rich.logging import RichHandler
from rich.progress import Progress, SpinnerColumn, BarColumn, TextColumn, TimeElapsedColumn
import logging
from jinja2 import Environment, FileSystemLoader, select_autoescape
BASE_DIR = pathlib.Path(__file__).resolve().parent
# ───────────────────────────────────────────────────────────────────────────────
# 3rd-party deps
# ───────────────────────────────────────────────────────────────────────────────
import numpy as np
# from sentence_transformers import SentenceTransformer
# from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
import hashlib
from openai import OpenAI # same SDK you pre-loaded
# ───────────────────────────────────────────────────────────────────────────────
# Utils
# ───────────────────────────────────────────────────────────────────────────────
def load_jsonl(path: Path) -> List[Dict[str, Any]]:
with open(path, "r", encoding="utf-8") as f:
return [json.loads(l) for l in f]
def dump_json(obj, path: Path):
with open(path, "w", encoding="utf-8") as f:
json.dump(obj, f, ensure_ascii=False, indent=2)
# ───────────────────────────────────────────────────────────────────────────────
# Constants
# ───────────────────────────────────────────────────────────────────────────────
BASE_DIR = pathlib.Path(__file__).resolve().parent
# ───────────────────────────────────────────────────────────────────────────────
# Debug defaults (mirrors Stage-1 trick)
# ───────────────────────────────────────────────────────────────────────────────
def dev_defaults() -> SimpleNamespace:
return SimpleNamespace(
in_dir="./debug_out",
out_dir="./insights_debug",
embed_model="all-MiniLM-L6-v2",
top_k=10,
openai_model="gpt-4.1",
max_llm_tokens=8000,
llm_temperature=1.0,
workers=4, # parallel processing
stub=False, # manual
)
# ───────────────────────────────────────────────────────────────────────────────
# Graph builders
# ───────────────────────────────────────────────────────────────────────────────
def embed_descriptions(companies, model_name:str, opts) -> np.ndarray:
from sentence_transformers import SentenceTransformer
logging.debug(f"Using embedding model: {model_name}")
cache_path = BASE_DIR / Path(opts.out_dir) / "embeds_cache.json"
cache = {}
if cache_path.exists():
with open(cache_path) as f:
cache = json.load(f)
# flush cache if model differs
if cache.get("_model") != model_name:
cache = {}
model = SentenceTransformer(model_name)
new_texts, new_indices = [], []
vectors = np.zeros((len(companies), 384), dtype=np.float32)
for idx, comp in enumerate(companies):
text = comp.get("about") or comp.get("descriptor","")
h = hashlib.sha1(text.encode("utf-8")).hexdigest()
cached = cache.get(comp["handle"])
if cached and cached["hash"] == h:
vectors[idx] = np.array(cached["vector"], dtype=np.float32)
else:
new_texts.append(text)
new_indices.append((idx, comp["handle"], h))
if new_texts:
embeds = model.encode(new_texts, show_progress_bar=False, convert_to_numpy=True)
for vec, (idx, handle, h) in zip(embeds, new_indices):
vectors[idx] = vec
cache[handle] = {"hash": h, "vector": vec.tolist()}
cache["_model"] = model_name
with open(cache_path, "w") as f:
json.dump(cache, f)
return vectors
def build_company_graph(companies, embeds:np.ndarray, top_k:int) -> Dict[str,Any]:
from sklearn.metrics.pairwise import cosine_similarity
sims = cosine_similarity(embeds)
nodes, edges = [], []
idx_of = {c["handle"]: i for i,c in enumerate(companies)}
for i,c in enumerate(companies):
node = dict(
id=c["handle"].strip("/"),
name=c["name"],
handle=c["handle"],
about=c.get("about",""),
people_url=c.get("people_url",""),
industry=c.get("descriptor","").split("")[0].strip(),
geoUrn=c.get("geoUrn"),
followers=c.get("followers",0),
# desc_embed=embeds[i].tolist(),
desc_embed=[],
)
nodes.append(node)
# pick top-k most similar except itself
top_idx = np.argsort(sims[i])[::-1][1:top_k+1]
for j in top_idx:
tgt = companies[j]
weight = float(sims[i,j])
if node["industry"] == tgt.get("descriptor","").split("")[0].strip():
weight += 0.10
if node["geoUrn"] == tgt.get("geoUrn"):
weight += 0.05
tgt['followers'] = tgt.get("followers", None) or 1
node["followers"] = node.get("followers", None) or 1
follower_ratio = min(node["followers"], tgt.get("followers",1)) / max(node["followers"] or 1, tgt.get("followers",1))
weight += 0.05 * follower_ratio
edges.append(dict(
source=node["id"],
target=tgt["handle"].strip("/"),
weight=round(weight,4),
drivers=dict(
embed_sim=round(float(sims[i,j]),4),
industry_match=0.10 if node["industry"] == tgt.get("descriptor","").split("")[0].strip() else 0,
geo_overlap=0.05 if node["geoUrn"] == tgt.get("geoUrn") else 0,
)
))
# return {"nodes":nodes,"edges":edges,"meta":{"generated_at":datetime.now(UTC).isoformat()}}
return {"nodes":nodes,"edges":edges,"meta":{"generated_at":datetime.now(UTC).isoformat()}}
# ───────────────────────────────────────────────────────────────────────────────
# Org-chart via LLM
# ───────────────────────────────────────────────────────────────────────────────
async def infer_org_chart_llm(company, people, client:OpenAI, model_name:str, max_tokens:int, temperature:float, stub:bool):
if stub:
# Tiny fake org-chart when debugging offline
chief = random.choice(people)
nodes = [{
"id": chief["profile_url"],
"name": chief["name"],
"title": chief["headline"],
"dept": chief["headline"].split()[:1][0],
"yoe_total": 8,
"yoe_current": 2,
"seniority_score": 0.8,
"decision_score": 0.9,
"avatar_url": chief.get("avatar_url")
}]
return {"nodes":nodes,"edges":[],"meta":{"debug_stub":True,"generated_at":datetime.now(UTC).isoformat()}}
prompt = [
{"role":"system","content":"You are an expert B2B org-chart reasoner."},
{"role":"user","content":f"""Here is the company description:
<company>
{json.dumps(company, ensure_ascii=False)}
</company>
Here is a JSON list of employees:
<employees>
{json.dumps(people, ensure_ascii=False)}
</employees>
1) Build a reporting tree (manager -> direct reports)
2) For each person output a decision_score 0-1 for buying new software
Return JSON: {{ "nodes":[{{id,name,title,dept,yoe_total,yoe_current,seniority_score,decision_score,avatar_url,profile_url}}], "edges":[{{source,target,type,confidence}}] }}
"""}
]
resp = client.chat.completions.create(
model=model_name,
messages=prompt,
max_tokens=max_tokens,
temperature=temperature,
response_format={"type":"json_object"}
)
chart = json.loads(resp.choices[0].message.content)
chart["meta"] = dict(model=model_name, generated_at=datetime.now(UTC).isoformat())
return chart
# ───────────────────────────────────────────────────────────────────────────────
# CSV flatten
# ───────────────────────────────────────────────────────────────────────────────
def export_decision_makers(charts_dir:Path, csv_path:Path, threshold:float=0.5):
rows=[]
for p in charts_dir.glob("org_chart_*.json"):
data=json.loads(p.read_text())
comp = p.stem.split("org_chart_")[1]
for n in data.get("nodes",[]):
if n.get("decision_score",0)>=threshold:
rows.append(dict(
company=comp,
person=n["name"],
title=n["title"],
decision_score=n["decision_score"],
profile_url=n["id"]
))
pd.DataFrame(rows).to_csv(csv_path,index=False)
# ───────────────────────────────────────────────────────────────────────────────
# HTML rendering
# ───────────────────────────────────────────────────────────────────────────────
def render_html(out:Path, template_dir:Path):
# From template folder cp graph_view.html and ai.js in out folder
import shutil
shutil.copy(template_dir/"graph_view_template.html", out / "graph_view.html")
shutil.copy(template_dir/"ai.js", out)
# ───────────────────────────────────────────────────────────────────────────────
# Main async pipeline
# ───────────────────────────────────────────────────────────────────────────────
async def run(opts):
# ── silence SDK noise ──────────────────────────────────────────────────────
for noisy in ("openai", "httpx", "httpcore"):
lg = logging.getLogger(noisy)
lg.setLevel(logging.WARNING) # or ERROR if you want total silence
lg.propagate = False # optional: stop them reaching root
# ────────────── logging bootstrap ──────────────
console = Console()
logging.basicConfig(
level="INFO",
format="%(message)s",
handlers=[RichHandler(console=console, markup=True, rich_tracebacks=True)],
)
in_dir = BASE_DIR / Path(opts.in_dir)
out_dir = BASE_DIR / Path(opts.out_dir)
out_dir.mkdir(parents=True, exist_ok=True)
companies = load_jsonl(in_dir/"companies.jsonl")
people = load_jsonl(in_dir/"people.jsonl")
logging.info(f"[bold cyan]Loaded[/] {len(companies)} companies, {len(people)} people")
logging.info("[bold]⇢[/] Embedding company descriptions…")
# embeds = embed_descriptions(companies, opts.embed_model, opts)
logging.info("[bold]⇢[/] Building similarity graph")
# company_graph = build_company_graph(companies, embeds, opts.top_k)
# dump_json(company_graph, out_dir/"company_graph.json")
# OpenAI client (only built if not debugging)
stub = bool(opts.stub)
client = OpenAI() if not stub else None
# Filter companies that need processing
to_process = []
for comp in companies:
handle = comp["handle"].strip("/").replace("/","_")
out_file = out_dir/f"org_chart_{handle}.json"
if out_file.exists() and False:
logging.info(f"[green]✓[/] Skipping existing {comp['name']}")
continue
to_process.append(comp)
if not to_process:
logging.info("[yellow]All companies already processed[/]")
else:
workers = getattr(opts, 'workers', 1)
parallel = workers > 1
logging.info(f"[bold]⇢[/] Inferring org-charts via LLM {f'(parallel={workers} workers)' if parallel else ''}")
with Progress(
SpinnerColumn(),
BarColumn(),
TextColumn("[progress.description]{task.description}"),
TimeElapsedColumn(),
console=console,
) as progress:
task = progress.add_task("Org charts", total=len(to_process))
async def process_one(comp):
handle = comp["handle"].strip("/").replace("/","_")
persons = [p for p in people if p["company_handle"].strip("/") == comp["handle"].strip("/")]
chart = await infer_org_chart_llm(
comp, persons,
client=client if client else OpenAI(api_key="sk-debug"),
model_name=opts.openai_model,
max_tokens=opts.max_llm_tokens,
temperature=opts.llm_temperature,
stub=stub,
)
chart["meta"]["company"] = comp["name"]
# Save the result immediately
dump_json(chart, out_dir/f"org_chart_{handle}.json")
progress.update(task, advance=1, description=f"{comp['name']} ({len(persons)} ppl)")
# Create tasks for all companies
tasks = [process_one(comp) for comp in to_process]
# Process in batches based on worker count
semaphore = asyncio.Semaphore(workers)
async def bounded_process(coro):
async with semaphore:
return await coro
# Run with concurrency control
await asyncio.gather(*(bounded_process(task) for task in tasks))
logging.info("[bold]⇢[/] Flattening decision-makers CSV")
export_decision_makers(out_dir, out_dir/"decision_makers.csv")
render_html(out_dir, template_dir=BASE_DIR/"templates")
logging.success = lambda msg, **k: console.print(f"[bold green]✓[/] {msg}", **k)
logging.success(f"Stage-2 artefacts written to {out_dir}")
# ───────────────────────────────────────────────────────────────────────────────
# CLI
# ───────────────────────────────────────────────────────────────────────────────
def build_arg_parser():
p = argparse.ArgumentParser(description="Build graphs & visualisation from Stage-1 output")
p.add_argument("--in", dest="in_dir", required=False, help="Stage-1 output dir", default=".")
p.add_argument("--out", dest="out_dir", required=False, help="Destination dir", default=".")
p.add_argument("--embed_model", default="all-MiniLM-L6-v2")
p.add_argument("--top_k", type=int, default=10, help="Top-k neighbours per company")
p.add_argument("--openai_model", default="gpt-4.1")
p.add_argument("--max_llm_tokens", type=int, default=8024)
p.add_argument("--llm_temperature", type=float, default=1.0)
p.add_argument("--stub", action="store_true", help="Skip OpenAI call and generate tiny fake org charts")
p.add_argument("--workers", type=int, default=4, help="Number of parallel workers for LLM inference")
return p
def main():
dbg = dev_defaults()
opts = dbg if True else build_arg_parser().parse_args()
asyncio.run(run(opts))
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,39 @@
{
"name": "LinkedIn Company Card",
"baseSelector": "div.search-results-container ul[role='list'] > li",
"fields": [
{
"name": "handle",
"selector": "a[href*='/company/']",
"type": "attribute",
"attribute": "href"
},
{
"name": "profile_image",
"selector": "a[href*='/company/'] img",
"type": "attribute",
"attribute": "src"
},
{
"name": "name",
"selector": "span[class*='t-16'] a",
"type": "text"
},
{
"name": "descriptor",
"selector": "div[class*='t-black t-normal']",
"type": "text"
},
{
"name": "about",
"selector": "p[class*='entity-result__summary--2-lines']",
"type": "text"
},
{
"name": "followers",
"selector": "div:contains('followers')",
"type": "regex",
"pattern": "(\\d+)\\s*followers"
}
]
}

View File

@@ -0,0 +1,38 @@
{
"name": "LinkedIn People Card",
"baseSelector": "li.org-people-profile-card__profile-card-spacing",
"fields": [
{
"name": "profile_url",
"selector": "a.eETATgYTipaVsmrBChiBJJvFsdPhNpulhPZUVLHLo",
"type": "attribute",
"attribute": "href"
},
{
"name": "name",
"selector": ".artdeco-entity-lockup__title .lt-line-clamp--single-line",
"type": "text"
},
{
"name": "headline",
"selector": ".artdeco-entity-lockup__subtitle .lt-line-clamp--multi-line",
"type": "text"
},
{
"name": "followers",
"selector": ".lt-line-clamp--multi-line.t-12",
"type": "text"
},
{
"name": "connection_degree",
"selector": ".artdeco-entity-lockup__badge .artdeco-entity-lockup__degree",
"type": "text"
},
{
"name": "avatar_url",
"selector": ".artdeco-entity-lockup__image img",
"type": "attribute",
"attribute": "src"
}
]
}

View File

@@ -0,0 +1,143 @@
<li class="yCLWzruNprmIzaZzFFonVFBtMrbaVYnuDFA">
<!----><!---->
<div class="IxlEPbRZwQYrRltKPvHAyjBmCdIWTAoYo" data-chameleon-result-urn="urn:li:company:362492"
data-view-name="search-entity-result-universal-template">
<div class="linked-area flex-1
cursor-pointer">
<div class="BAEgVqVuxosMJZodcelsgPoyRcrkiqgVCGHXNQ">
<div class="afcvrbGzNuyRlhPPQWrWirJtUdHAAtUlqxwvVA">
<div class="display-flex align-items-center">
<!---->
<a class="eETATgYTipaVsmrBChiBJJvFsdPhNpulhPZUVLHLo scale-down " aria-hidden="true"
tabindex="-1" href="https://www.linkedin.com/company/managment-research-services-inc./"
data-test-app-aware-link="">
<div class="ivm-image-view-model ">
<div class="ivm-view-attr__img-wrapper
">
<!---->
<!----> <img width="48"
src="https://media.licdn.com/dms/image/v2/C560BAQFWpusEOgW-ww/company-logo_100_100/company-logo_100_100/0/1630583697877/managment_research_services_inc_logo?e=1750896000&amp;v=beta&amp;t=Ch9vyEZdfng-1D1m_XqP5kjNpVXUBKkk9cNhMZUhx0E"
loading="lazy" height="48" alt="Management Research Services, Inc. (MRS, Inc)"
id="ember28"
class="ivm-view-attr__img--centered EntityPhoto-square-3 evi-image lazy-image ember-view">
</div>
</div>
</a>
</div>
</div>
<div
class="wympnVuDByXHvafWrMGJLZuchDmCRqLmWPwg MmzCPRicJimZvjJhvqTzDcDbdHhWPzspERzA pt3 pb3 t-12 t-black--light">
<div class="mb1">
<div class="t-roman t-sans">
<div class="display-flex">
<span class="TikBXjihYvcNUoIzkslUaEjfIuLmYxfs OoHEyXgsiIqGADjcOtTmfdpoYVXrLKTvkwI ">
<span class="CgaWLOzmXNuKbRIRARSErqCJcBPYudEKo
t-16">
<a class="eETATgYTipaVsmrBChiBJJvFsdPhNpulhPZUVLHLo "
href="https://www.linkedin.com/company/managment-research-services-inc./"
data-test-app-aware-link="">
<!---->Management Research Services, Inc. (MRS, Inc)<!---->
<!----> </a>
<!----> </span>
</span>
<!---->
</div>
</div>
<div class="LjmdKCEqKITHihFOiQsBAQylkdnsWhqZii
t-14 t-black t-normal">
<!---->Insurance • Milwaukee, Wisconsin<!---->
</div>
<div class="cTPhJiHyNLmxdQYFlsEOutjznmqrVHUByZwZ
t-14 t-normal">
<!---->1K followers<!---->
</div>
</div>
<!---->
<p class="yWzlqwKNlvCWVNoKqmzoDDEnBMUuyynaLg
entity-result__summary--2-lines
t-12 t-black--light
">
<!---->MRS combines 30 years of experience supporting the Life,<span class="white-space-pre">
</span><strong><!---->Health<!----></strong><span class="white-space-pre"> </span>and
Annuities<span class="white-space-pre"> </span><strong><!---->Insurance<!----></strong><span
class="white-space-pre"> </span>Industry with customized<span class="white-space-pre">
</span><strong><!---->insurance<!----></strong><span class="white-space-pre">
</span>underwriting solutions that efficiently support clients workflows. Supported by the
Agenium Platform (www.agenium.ai) our innovative underwriting solutions are guaranteed to
optimize requirements...<!---->
</p>
<!---->
</div>
<div class="qXxdnXtzRVFTnTnetmNpssucBwQBsWlUuk MmzCPRicJimZvjJhvqTzDcDbdHhWPzspERzA">
<!---->
<div>
<button aria-label="Follow Management Research Services, Inc. (MRS, Inc)" id="ember61"
class="artdeco-button artdeco-button--2 artdeco-button--secondary ember-view"
type="button"><!---->
<span class="artdeco-button__text">
Follow
</span></button>
<!---->
<!---->
</div>
</div>
</div>
</div>
</div>
</li>

View File

@@ -0,0 +1,94 @@
<li class="grid grid__col--lg-8 block org-people-profile-card__profile-card-spacing">
<div>
<section class="artdeco-card full-width qQdPErXQkSAbwApNgNfuxukTIPPykttCcZGOHk">
<!---->
<img width="210" src="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7"
ariarole="presentation" loading="lazy" height="210" alt="" id="ember96"
class="evi-image lazy-image ghost-default ember-view org-people-profile-card__cover-photo org-people-profile-card__cover-photo--people">
<div class="org-people-profile-card__profile-info">
<div id="ember97"
class="artdeco-entity-lockup artdeco-entity-lockup--stacked-center artdeco-entity-lockup--size-7 ember-view">
<div id="ember98"
class="artdeco-entity-lockup__image artdeco-entity-lockup__image--type-circle ember-view"
type="circle">
<a class="eETATgYTipaVsmrBChiBJJvFsdPhNpulhPZUVLHLo "
id="org-people-profile-card__profile-image-0"
href="https://www.linkedin.com/in/speakerrayna?miniProfileUrn=urn%3Ali%3Afs_miniProfile%3AACoAABsqUBoBr5x071PuGGpNtK3NlvSARiVXPIs"
data-test-app-aware-link="">
<img width="104"
src="https://media.licdn.com/dms/image/v2/D5603AQGs2Vyju4xZ7A/profile-displayphoto-shrink_100_100/profile-displayphoto-shrink_100_100/0/1681741067031?e=1750896000&amp;v=beta&amp;t=Hvj--IrrmpVIH7pec7-l_PQok8vsS__CGeUqBWOw7co"
loading="lazy" height="104" alt="Dr. Rayna S." id="ember99"
class="evi-image lazy-image ember-view">
</a>
</div>
<div id="ember100" class="artdeco-entity-lockup__content ember-view">
<div id="ember101" class="artdeco-entity-lockup__title ember-view">
<a class="eETATgYTipaVsmrBChiBJJvFsdPhNpulhPZUVLHLo link-without-visited-state"
aria-label="View Dr. Rayna S.s profile"
href="https://www.linkedin.com/in/speakerrayna?miniProfileUrn=urn%3Ali%3Afs_miniProfile%3AACoAABsqUBoBr5x071PuGGpNtK3NlvSARiVXPIs"
data-test-app-aware-link="">
<div id="ember103" class="ember-view lt-line-clamp lt-line-clamp--single-line AGabuksChUpCmjWshSnaZryLKSthOKkwclxY
t-black" style="">
Dr. Rayna S.
<!---->
</div>
</a>
</div>
<div id="ember104" class="artdeco-entity-lockup__badge ember-view"> <span class="a11y-text">3rd+
degree connection</span>
<span class="artdeco-entity-lockup__degree" aria-hidden="true">
·&nbsp;3rd
</span>
<!----><!---->
</div>
<div id="ember105" class="artdeco-entity-lockup__subtitle ember-view">
<div class="t-14 t-black--light t-normal">
<div id="ember107" class="ember-view lt-line-clamp lt-line-clamp--multi-line"
style="-webkit-line-clamp: 2">
Leadership and Talent Development Consultant and Professional Speaker
<!---->
</div>
</div>
</div>
<div id="ember108" class="artdeco-entity-lockup__caption ember-view"></div>
</div>
</div>
<span class="text-align-center">
<span id="ember110"
class="ember-view lt-line-clamp lt-line-clamp--multi-line t-12 t-black--light mt2"
style="-webkit-line-clamp: 3">
727 followers
<!----> </span>
</span>
</div>
<footer class="ph3 pb3">
<button aria-label="Follow Dr. Rayna S." id="ember111"
class="artdeco-button artdeco-button--2 artdeco-button--secondary ember-view full-width"
type="button"><!---->
<span class="artdeco-button__text">
Follow
</span></button>
</footer>
</section>
</div>
</li>

View File

@@ -0,0 +1,50 @@
// ==== File: ai.js ====
class ApiHandler {
constructor(apiKey = null) {
this.apiKey = apiKey || localStorage.getItem("openai_api_key") || "";
console.log("ApiHandler ready");
}
setApiKey(k) {
this.apiKey = k.trim();
if (this.apiKey) localStorage.setItem("openai_api_key", this.apiKey);
}
async *chatStream(messages, {model = "gpt-4o", temperature = 0.7} = {}) {
if (!this.apiKey) throw new Error("OpenAI API key missing");
const payload = {model, messages, stream: true, max_tokens: 1024};
const controller = new AbortController();
const res = await fetch("https://api.openai.com/v1/chat/completions", {
method: "POST",
headers: {
"Content-Type": "application/json",
Authorization: `Bearer ${this.apiKey}`,
},
body: JSON.stringify(payload),
signal: controller.signal,
});
if (!res.ok) throw new Error(`OpenAI: ${res.statusText}`);
const reader = res.body.getReader();
const dec = new TextDecoder();
let buf = "";
while (true) {
const {done, value} = await reader.read();
if (done) break;
buf += dec.decode(value, {stream: true});
for (const line of buf.split("\n")) {
if (!line.startsWith("data: ")) continue;
if (line.includes("[DONE]")) return;
const json = JSON.parse(line.slice(6));
const delta = json.choices?.[0]?.delta?.content;
if (delta) yield delta;
}
buf = buf.endsWith("\n") ? "" : buf; // keep partial line
}
}
}
window.API = new ApiHandler();

File diff suppressed because it is too large Load Diff

51
docs/codebase/browser.md Normal file
View File

@@ -0,0 +1,51 @@
### browser_manager.py
| Function | What it does |
|---|---|
| `ManagedBrowser.build_browser_flags` | Returns baseline Chromium CLI flags, disables GPU and sandbox, plugs locale, timezone, stealth tweaks, and any extras from `BrowserConfig`. |
| `ManagedBrowser.__init__` | Stores config and logger, creates temp dir, preps internal state. |
| `ManagedBrowser.start` | Spawns or connects to the Chromium process, returns its CDP endpoint plus the `subprocess.Popen` handle. |
| `ManagedBrowser._initial_startup_check` | Pings the CDP endpoint once to be sure the browser is alive, raises if not. |
| `ManagedBrowser._monitor_browser_process` | Async-loops on the subprocess, logs exits or crashes, restarts if policy allows. |
| `ManagedBrowser._get_browser_path_WIP` | Old helper that maps OS + browser type to an executable path. |
| `ManagedBrowser._get_browser_path` | Current helper, checks env vars, Playwright cache, and OS defaults for the real executable. |
| `ManagedBrowser._get_browser_args` | Builds the final CLI arg list by merging user flags, stealth flags, and defaults. |
| `ManagedBrowser.cleanup` | Terminates the browser, stops monitors, deletes the temp dir. |
| `ManagedBrowser.create_profile` | Opens a visible browser so a human can log in, then zips the resulting user-data-dir to `~/.crawl4ai/profiles/<name>`. |
| `ManagedBrowser.list_profiles` | Thin wrapper, now forwarded to `BrowserProfiler.list_profiles()`. |
| `ManagedBrowser.delete_profile` | Thin wrapper, now forwarded to `BrowserProfiler.delete_profile()`. |
| `BrowserManager.__init__` | Holds the global Playwright instance, browser handle, config signature cache, session map, and logger. |
| `BrowserManager.start` | Boots the underlying `ManagedBrowser`, then spins up the default Playwright browser context with stealth patches. |
| `BrowserManager._build_browser_args` | Translates `CrawlerRunConfig` (proxy, UA, timezone, headless flag, etc.) into Playwright `launch_args`. |
| `BrowserManager.setup_context` | Applies locale, geolocation, permissions, cookies, and UA overrides on a fresh context. |
| `BrowserManager.create_browser_context` | Internal helper that actually calls `browser.new_context(**options)` after running `setup_context`. |
| `BrowserManager._make_config_signature` | Hashes the non-ephemeral parts of `CrawlerRunConfig` so contexts can be reused safely. |
| `BrowserManager.get_page` | Returns a ready `Page` for a given session id, reusing an existing one or creating a new context/page, injects helper scripts, updates `last_used`. |
| `BrowserManager.kill_session` | Force-closes a context/page for a session and removes it from the session map. |
| `BrowserManager._cleanup_expired_sessions` | Periodic sweep that drops sessions idle longer than `ttl_seconds`. |
| `BrowserManager.close` | Gracefully shuts down all contexts, the browser, Playwright, and background tasks. |
---
### browser_profiler.py
| Function | What it does |
|---|---|
| `BrowserProfiler.__init__` | Sets up profile folder paths, async logger, and signal handlers. |
| `BrowserProfiler.create_profile` | Launches a visible browser with a new user-data-dir for manual login, on exit compresses and stores it as a named profile. |
| `BrowserProfiler.cleanup_handler` | General SIGTERM/SIGINT cleanup wrapper that kills child processes. |
| `BrowserProfiler.sigint_handler` | Handles Ctrl-C during an interactive session, makes sure the browser shuts down cleanly. |
| `BrowserProfiler.listen_for_quit_command` | Async REPL that exits when the user types `q`. |
| `BrowserProfiler.list_profiles` | Enumerates `~/.crawl4ai/profiles`, prints profile name, browser type, size, and last modified. |
| `BrowserProfiler.get_profile_path` | Returns the absolute path of a profile given its name, or `None` if missing. |
| `BrowserProfiler.delete_profile` | Removes a profile folder or a direct path from disk, with optional confirmation prompt. |
| `BrowserProfiler.interactive_manager` | Text UI loop for listing, creating, deleting, or launching profiles. |
| `BrowserProfiler.launch_standalone_browser` | Starts a non-headless Chromium with remote debugging enabled and keeps it alive for manual tests. |
| `BrowserProfiler.get_cdp_json` | Pulls `/json/version` from a CDP endpoint and returns the parsed JSON. |
| `BrowserProfiler.launch_builtin_browser` | Spawns a headless Chromium in the background, saves `{wsEndpoint, pid, started_at}` to `~/.crawl4ai/builtin_browser.json`. |
| `BrowserProfiler.get_builtin_browser_info` | Reads that JSON file, verifies the PID, and returns browser status info. |
| `BrowserProfiler._is_browser_running` | Cross-platform helper that checks if a PID is still alive. |
| `BrowserProfiler.kill_builtin_browser` | Terminates the background builtin browser and removes its status file. |
| `BrowserProfiler.get_builtin_browser_status` | Returns `{running: bool, wsEndpoint, pid, started_at}` for quick health checks. |
Let me know what you want to tweak or dive into next.

40
docs/codebase/cli.md Normal file
View File

@@ -0,0 +1,40 @@
### `cli.py` command surface
| Command | Inputs / flags | What it does |
|---|---|---|
| **profiles** | *(none)* | Opens the interactive profile manager, lets you list, create, delete saved browser profiles that live in `~/.crawl4ai/profiles`. |
| **browser status** | | Prints whether the always-on *builtin* browser is running, shows its CDP URL, PID, start time. |
| **browser stop** | | Kills the builtin browser and deletes its status file. |
| **browser view** | `--url, -u` URL *(optional)* | Pops a visible window of the builtin browser, navigates to `URL` or `about:blank`. |
| **config list** | | Dumps every global setting, showing current value, default, and description. |
| **config get** | `key` | Prints the value of a single setting, falls back to default if unset. |
| **config set** | `key value` | Persists a new value in the global config (stored under `~/.crawl4ai/config.yml`). |
| **examples** | | Just spits out real-world CLI usage samples. |
| **crawl** | `url` *(positional)*<br>`--browser-config,-B` path<br>`--crawler-config,-C` path<br>`--filter-config,-f` path<br>`--extraction-config,-e` path<br>`--json-extract,-j` [desc]\*<br>`--schema,-s` path<br>`--browser,-b` k=v list<br>`--crawler,-c` k=v list<br>`--output,-o` all,json,markdown,md,markdown-fit,md-fit *(default all)*<br>`--output-file,-O` path<br>`--bypass-cache,-b` *(flag, default true — note flag reuse)*<br>`--question,-q` str<br>`--verbose,-v` *(flag)*<br>`--profile,-p` profile-name | One-shot crawl + extraction. Builds `BrowserConfig` and `CrawlerRunConfig` from inline flags or separate YAML/JSON files, runs `AsyncWebCrawler.run()`, can route through a named saved profile and pipe the result to stdout or a file. |
| **(default)** | Same flags as **crawl**, plus `--example` | Shortcut so you can type just `crwl https://site.com`. When first arg is not a known sub-command, it falls through to *crawl*. |
\* `--json-extract/-j` with no value turns on LLM-based JSON extraction using an auto schema, supplying a string lets you prompt-engineer the field descriptions.
> Quick mental model
> `profiles` = manage identities,
> `browser ...` = control long-running headless Chrome that all crawls can piggy-back on,
> `crawl` = do the actual work,
> `config` = tweak global defaults,
> everything else is sugar.
### Quick-fire “profile” usage cheatsheet
| Scenario | Command (copy-paste ready) | Notes |
|---|---|---|
| **Launch interactive Profile Manager UI** | `crwl profiles` | Opens TUI with options: 1 List, 2 Create, 3 Delete, 4 Use-to-crawl, 5 Exit. |
| **Create a fresh profile** | `crwl profiles` → choose **2** → name it → browser opens → log in → press **q** in terminal | Saves to `~/.crawl4ai/profiles/<name>`. |
| **List saved profiles** | `crwl profiles` → choose **1** | Shows name, browser type, size, last-modified. |
| **Delete a profile** | `crwl profiles` → choose **3** → pick the profile index → confirm | Removes the folder. |
| **Crawl with a profile (default alias)** | `crwl https://site.com/dashboard -p my-profile` | Keeps login cookies, sets `use_managed_browser=true` under the hood. |
| **Crawl + verbose JSON output** | `crwl https://site.com -p my-profile -o json -v` | Any other `crawl` flags work the same. |
| **Crawl with extra browser tweaks** | `crwl https://site.com -p my-profile -b "headless=true,viewport_width=1680"` | CLI overrides go on top of the profile. |
| **Same but via explicit sub-command** | `crwl crawl https://site.com -p my-profile` | Identical to default alias. |
| **Use profile from inside Profile Manager** | `crwl profiles` → choose **4** → pick profile → enter URL → follow prompts | Handy when demo-ing to non-CLI folks. |
| **One-off crawl with a profile folder path (no name lookup)** | `crwl https://site.com -b "user_data_dir=$HOME/.crawl4ai/profiles/my-profile,use_managed_browser=true"` | Bypasses registry, useful for CI scripts. |
| **Launch a dev browser on CDP port with the same identity** | `crwl cdp -d $HOME/.crawl4ai/profiles/my-profile -P 9223` | Lets Puppeteer/Playwright attach for debugging. |

View File

@@ -391,12 +391,14 @@ async def main():
# Process results
raw_df = pd.DataFrame()
for result in results:
if result.success and result.media["tables"]:
# Use the new tables field, falling back to media["tables"] for backward compatibility
tables = result.tables if hasattr(result, "tables") and result.tables else result.media.get("tables", [])
if result.success and tables:
# Extract primary market table
# DataFrame
raw_df = pd.DataFrame(
result.media["tables"][0]["rows"],
columns=result.media["tables"][0]["headers"],
tables[0]["rows"],
columns=tables[0]["headers"],
)
break

File diff suppressed because it is too large Load Diff

View File

@@ -31,7 +31,7 @@ async def example_cdp():
async def main():
browser_config = BrowserConfig(headless=True, verbose=True)
browser_config = BrowserConfig(headless=False, verbose=True)
async with AsyncWebCrawler(config=browser_config) as crawler:
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,

View File

@@ -412,17 +412,41 @@ footer {
background-color: var(--primary-dimmed-color, #09b5a5);
color: var(--background-color, #070708);
border: none;
padding: 4px 8px;
padding: 6px 10px;
font-size: 0.8em;
border-radius: 4px;
cursor: pointer;
box-shadow: 0 2px 5px rgba(0, 0, 0, 0.3);
transition: background-color 0.2s ease;
box-shadow: 0 3px 8px rgba(0, 0, 0, 0.3);
transition: background-color 0.2s ease, transform 0.15s ease;
white-space: nowrap;
display: flex;
align-items: center;
font-weight: 500;
animation: askAiButtonAppear 0.2s ease-out;
}
@keyframes askAiButtonAppear {
from {
opacity: 0;
transform: scale(0.9);
}
to {
opacity: 1;
transform: scale(1);
}
}
.ask-ai-selection-button:hover {
background-color: var(--primary-color, #50ffff);
transform: scale(1.05);
}
/* Mobile styles for Ask AI button */
@media screen and (max-width: 768px) {
.ask-ai-selection-button {
padding: 8px 12px; /* Larger touch target on mobile */
font-size: 0.9em; /* Slightly larger text */
}
}
/* ==== File: docs/assets/layout.css (Additions) ==== */

View File

@@ -8,12 +8,32 @@ document.addEventListener('DOMContentLoaded', () => {
const button = document.createElement('button');
button.id = 'ask-ai-selection-btn';
button.className = 'ask-ai-selection-button';
button.textContent = 'Ask AI'; // Or use an icon
// Add icon and text for better visibility
button.innerHTML = `
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" width="12" height="12" fill="currentColor" style="margin-right: 4px; vertical-align: middle;">
<path d="M20 2H4c-1.1 0-2 .9-2 2v12c0 1.1.9 2 2 2h14l4 4V4c0-1.1-.9-2-2-2z"/>
</svg>
<span>Ask AI</span>
`;
// Common styles
button.style.display = 'none'; // Initially hidden
button.style.position = 'absolute';
button.style.zIndex = '1500'; // Ensure it's on top
button.style.boxShadow = '0 3px 8px rgba(0, 0, 0, 0.4)'; // More pronounced shadow
button.style.transition = 'transform 0.15s ease, background-color 0.2s ease'; // Smooth hover effect
// Add transform on hover
button.addEventListener('mouseover', () => {
button.style.transform = 'scale(1.05)';
});
button.addEventListener('mouseout', () => {
button.style.transform = 'scale(1)';
});
document.body.appendChild(button);
button.addEventListener('click', handleAskAiClick);
return button;
}
@@ -43,11 +63,38 @@ document.addEventListener('DOMContentLoaded', () => {
const range = selection.getRangeAt(0);
const rect = range.getBoundingClientRect();
// Calculate position: top-right of the selection
// Get viewport dimensions
const viewportWidth = window.innerWidth;
const viewportHeight = window.innerHeight;
// Calculate position based on selection
const scrollX = window.scrollX;
const scrollY = window.scrollY;
const buttonTop = rect.top + scrollY - askAiButton.offsetHeight - 5; // 5px above
const buttonLeft = rect.right + scrollX + 5; // 5px to the right
// Default position (top-right of selection)
let buttonTop = rect.top + scrollY - askAiButton.offsetHeight - 5; // 5px above
let buttonLeft = rect.right + scrollX + 5; // 5px to the right
// Check if we're on mobile (which we define as less than 768px)
const isMobile = viewportWidth <= 768;
if (isMobile) {
// On mobile, position centered above selection to avoid edge issues
buttonTop = rect.top + scrollY - askAiButton.offsetHeight - 10; // 10px above on mobile
buttonLeft = rect.left + scrollX + (rect.width / 2) - (askAiButton.offsetWidth / 2); // Centered
} else {
// For desktop, ensure the button doesn't go off screen
// Check right edge
if (buttonLeft + askAiButton.offsetWidth > scrollX + viewportWidth) {
buttonLeft = scrollX + viewportWidth - askAiButton.offsetWidth - 10; // 10px from right edge
}
}
// Check top edge (for all devices)
if (buttonTop < scrollY) {
// If would go above viewport, position below selection instead
buttonTop = rect.bottom + scrollY + 5; // 5px below
}
askAiButton.style.top = `${buttonTop}px`;
askAiButton.style.left = `${buttonLeft}px`;
@@ -77,8 +124,8 @@ document.addEventListener('DOMContentLoaded', () => {
// --- Event Listeners ---
// Show button on mouse up after selection
document.addEventListener('mouseup', (event) => {
// Function to handle selection events (both mouse and touch)
function handleSelectionEvent(event) {
// Slight delay to ensure selection is registered
setTimeout(() => {
const selectedText = getSafeSelectedText();
@@ -86,7 +133,7 @@ document.addEventListener('DOMContentLoaded', () => {
if (!askAiButton) {
askAiButton = createAskAiButton();
}
// Don't position if the click was ON the button itself
// Don't position if the event was ON the button itself
if (event.target !== askAiButton) {
positionButton(event);
}
@@ -94,16 +141,46 @@ document.addEventListener('DOMContentLoaded', () => {
hideButton();
}
}, 10); // Small delay
}
// Mouse selection events (desktop)
document.addEventListener('mouseup', handleSelectionEvent);
// Touch selection events (mobile)
document.addEventListener('touchend', handleSelectionEvent);
document.addEventListener('selectionchange', () => {
// This helps with mobile selection which can happen without mouseup/touchend
setTimeout(() => {
const selectedText = getSafeSelectedText();
if (selectedText && askAiButton) {
positionButton();
}
}, 300); // Longer delay for selection change
});
// Hide button on scroll or click elsewhere
// Hide button on various events
document.addEventListener('mousedown', (event) => {
// Hide if clicking anywhere EXCEPT the button itself
if (askAiButton && event.target !== askAiButton) {
hideButton();
}
});
document.addEventListener('touchstart', (event) => {
// Same for touch events, but only hide if not on the button
if (askAiButton && event.target !== askAiButton) {
hideButton();
}
});
document.addEventListener('scroll', hideButton, true); // Capture scroll events
// Also hide when pressing Escape key
document.addEventListener('keydown', (event) => {
if (event.key === 'Escape') {
hideButton();
}
});
console.log("Selection Ask AI script loaded.");
});

View File

@@ -4,6 +4,32 @@ Welcome to the Crawl4AI blog! Here you'll find detailed release notes, technical
## Latest Release
Heres the blog index entry for **v0.6.0**, written to match the exact tone and structure of your previous entries:
---
### [Crawl4AI v0.6.0 World-Aware Crawling, Pre-Warmed Browsers, and the MCP API](releases/0.6.0.md)
*April 23, 2025*
Crawl4AI v0.6.0 is our most powerful release yet. This update brings major architectural upgrades including world-aware crawling (set geolocation, locale, and timezone), real-time traffic capture, and a memory-efficient crawler pool with pre-warmed pages.
The Docker server now exposes a full-featured MCP socket + SSE interface, supports streaming, and comes with a new Playground UI. Plus, table extraction is now native, and the new stress-test framework supports crawling 1,000+ URLs.
Other key changes:
* Native support for `result.media["tables"]` to export DataFrames
* Full network + console logs and MHTML snapshot per crawl
* Browser pooling and pre-warming for faster cold starts
* New streaming endpoints via MCP API and Playground
* Robots.txt support, proxy rotation, and improved session handling
* Deprecated old markdown names, legacy modules cleaned up
* Massive repo cleanup: ~36K insertions, ~5K deletions across 121 files
[Read full release notes →](releases/0.6.0.md)
---
Let me know if you want me to auto-update the actual file or just paste this into the markdown.
### [Crawl4AI v0.5.0: Deep Crawling, Scalability, and a New CLI!](releases/0.5.0.md)

View File

@@ -1,51 +1,143 @@
# Crawl4AI 0.6.0
# Crawl4AI v0.6.0 Release Notes
*Release date: 20250422*
0.6.0 is the **biggest jump** since the 0.5 series, packing a smarter browser core, poolbased crawlers, and a ton of DX candy. Expect faster runs, lower RAM burn, and richer diagnostics.
We're excited to announce the release of **Crawl4AI v0.6.0**, our biggest and most feature-rich update yet. This version introduces major architectural upgrades, brand-new capabilities for geo-aware crawling, high-efficiency scraping, and real-time streaming support for scalable deployments.
---
## 🚀 Key upgrades
## Highlights
| Area | What changed |
|------|--------------|
| **Browser** | New **Browser** management with pooling, page prewarm, geolocation + locale + timezone switches |
| **Crawler** | Console and network log capture, MHTML snapshots, safer `get_page` API |
| **Server & API** | **Crawler Pool Manager** endpoint, MCP socket + SSE support |
| **Docs** | v2 layout, floating AskAI helper, GitHub stats badge, copycode buttons, Docker API demo |
| **Tests** | Memory + load benchmarks, 90+ new cases covering MCP and Docker |
### 1. **World-Aware Crawlers**
Crawl as if youre anywhere in the world. With v0.6.0, each crawl can simulate:
- Specific GPS coordinates
- Browser locale
- Timezone
Example:
```python
CrawlerRunConfig(
url="https://browserleaks.com/geo",
locale="en-US",
timezone_id="America/Los_Angeles",
geolocation=GeolocationConfig(
latitude=34.0522,
longitude=-118.2437,
accuracy=10.0
)
)
```
Great for accessing region-specific content or testing global behavior.
---
## Breaking changes
### 2. **Native Table Extraction**
Extract HTML tables directly into usable formats like Pandas DataFrames or CSV with zero parsing hassle. All table data is available under `result.media["tables"]`.
1. **`get_page` signature** returns `(html, metadata)` instead of plain html.
2. **Docker** new Chromium base layer, rebuild images.
Example:
```python
raw_df = pd.DataFrame(
result.media["tables"][0]["rows"],
columns=result.media["tables"][0]["headers"]
)
```
This makes it ideal for scraping financial data, pricing pages, or anything tabular.
---
## How to upgrade
### 3. **Browser Pooling & Pre-Warming**
We've overhauled browser management. Now, multiple browser instances can be pooled and pages pre-warmed for ultra-fast launches:
- Reduces cold-start latency
- Lowers memory spikes
- Enhances parallel crawling stability
This powers the new **Docker Playground** experience and streamlines heavy-load crawling.
---
### 4. **Traffic & Snapshot Capture**
Need full visibility? You can now capture:
- Full network traffic logs
- Console output
- MHTML page snapshots for post-crawl audits and debugging
No more guesswork on what happened during your crawl.
---
### 5. **MCP API and Streaming Support**
Were exposing **MCP socket and SSE endpoints**, allowing:
- Live streaming of crawl results
- Real-time integration with agents or frontends
- A new Playground UI for interactive crawling
This is a major step towards making Crawl4AI real-time ready.
---
### 6. **Stress-Test Framework**
Want to test performance under heavy load? v0.6.0 includes a new memory stress-test suite that supports 1,000+ URL workloads. Ideal for:
- Load testing
- Performance benchmarking
- Validating memory efficiency
---
## Core Improvements
- Robots.txt compliance
- Proxy rotation support
- Improved URL normalization and session reuse
- Shared data across crawler hooks
- New page routing logic
---
## Breaking Changes & Deprecations
- Legacy `crawl4ai/browser/*` modules are removed. Update imports accordingly.
- `AsyncPlaywrightCrawlerStrategy.get_page` now uses a new function signature.
- Deprecated markdown generator aliases now point to `DefaultMarkdownGenerator` with warning.
---
## Miscellaneous Updates
- FastAPI validators replaced custom validation logic
- Docker build now based on a Chromium layer
- Repo-wide cleanup: ~36,000 insertions, ~5,000 deletions
---
## New Examples Included
- Geo-location crawling
- Network + console log capture
- Docker MCP API usage
- Markdown selector usage
- Crypto project data extraction
---
## Watch the Release Video
Want a visual walkthrough of all these updates? Watch the video:
🔗 https://youtu.be/9x7nVcjOZks
If you're new to Crawl4AI, start here:
🔗 https://www.youtube.com/watch?v=xo3qK6Hg9AA&t=15s
---
## Join the Community
Weve just opened up our **Discord** for the public. Join us to:
- Ask questions
- Share your projects
- Get help or contribute
💬 https://discord.gg/wpYFACrHR4
---
## Install or Upgrade
```bash
pip install -U crawl4ai==0.6.0
pip install -U crawl4ai
```
---
## Full changelog
The diff between `main` and `next` spans **36k insertions, 4.9k deletions** over 121 files. Read the [compare view](https://github.com/unclecode/crawl4ai/compare/0.5.0.post8...0.6.0) or see `CHANGELOG.md` for the granular list.
---
## Upgrade tips
* Using the Docker API? Pull `unclecode/crawl4ai:0.6.0`, new args are documented in `/deploy/docker/README.md`.
* Stresstest your stack with `tests/memory/run_benchmark.py` before production rollout.
* Markdown generators renamed but aliased, update when convenient, warnings will remind you.
---
Happy crawling, ping `@unclecode` on X for questions or memes.
Live long and import crawl4ai. 🖖

View File

@@ -58,7 +58,7 @@ Pull and run images directly from Docker Hub without building locally.
#### 1. Pull the Image
Our latest release candidate is `0.6.0rc1-r2`. Images are built with multi-arch manifests, so Docker automatically pulls the correct version for your system.
Our latest release candidate is `0.6.0-r2`. Images are built with multi-arch manifests, so Docker automatically pulls the correct version for your system.
```bash
# Pull the release candidate (recommended for latest features)
@@ -124,9 +124,9 @@ docker stop crawl4ai && docker rm crawl4ai
#### Docker Hub Versioning Explained
* **Image Name:** `unclecode/crawl4ai`
* **Tag Format:** `LIBRARY_VERSION[-SUFFIX]` (e.g., `0.6.0rc1-r2`)
* **Tag Format:** `LIBRARY_VERSION[-SUFFIX]` (e.g., `0.6.0-r2`)
* `LIBRARY_VERSION`: The semantic version of the core `crawl4ai` Python library
* `SUFFIX`: Optional tag for release candidates (`rc1`) and revisions (`r1`)
* `SUFFIX`: Optional tag for release candidates (``) and revisions (`r1`)
* **`latest` Tag:** Points to the most recent stable version
* **Multi-Architecture Support:** All images support both `linux/amd64` and `linux/arm64` architectures through a single tag

View File

@@ -0,0 +1,32 @@
from crawl4ai import BrowserProfiler
import asyncio
if __name__ == "__main__":
# Example usage
profiler = BrowserProfiler()
# Create a new profile
import os
from pathlib import Path
home_dir = Path.home()
profile_path = asyncio.run(profiler.create_profile( str(home_dir / ".crawl4ai/profiles/test-profile")))
print(f"Profile created at: {profile_path}")
# # Launch a standalone browser
# asyncio.run(profiler.launch_standalone_browser())
# # List profiles
# profiles = profiler.list_profiles()
# for profile in profiles:
# print(f"Profile: {profile['name']}, Path: {profile['path']}")
# # Delete a profile
# success = profiler.delete_profile("my-profile")
# if success:
# print("Profile deleted successfully")
# else:
# print("Failed to delete profile")