Compare commits
16 Commits
v0.5.5
...
merge-pr97
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
0e5d672763 | ||
|
|
cd2b490b40 | ||
|
|
50f0b83fcd | ||
|
|
9499164d3c | ||
|
|
2140d9aca4 | ||
|
|
ccec40ed17 | ||
|
|
ad4dfb21e1 | ||
|
|
7784b2468e | ||
|
|
146f9d415f | ||
|
|
37fd80e4b9 | ||
|
|
949a93982e | ||
|
|
c4f5651199 | ||
|
|
b0aa8bc9f7 | ||
|
|
c98ffe2130 | ||
|
|
4812f08a73 | ||
|
|
b2f3cb0dfa |
56
CHANGELOG.md
56
CHANGELOG.md
@@ -5,6 +5,62 @@ All notable changes to Crawl4AI will be documented in this file.
|
||||
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
||||
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
||||
|
||||
## [0.6.1] - 2025-04-24
|
||||
|
||||
### Added
|
||||
- New dedicated `tables` field in `CrawlResult` model for better table extraction handling
|
||||
- Updated crypto_analysis_example.py to use the new tables field with backward compatibility
|
||||
|
||||
### Changed
|
||||
- Improved playground UI in Docker deployment with better endpoint handling and UI feedback
|
||||
|
||||
## [0.6.0] ‑ 2025‑04‑22
|
||||
|
||||
### Added
|
||||
- Browser pooling with page pre‑warming and fine‑grained **geolocation, locale, and timezone** controls
|
||||
- Crawler pool manager (SDK + Docker API) for smarter resource allocation
|
||||
- Network & console log capture plus MHTML snapshot export
|
||||
- **Table extractor**: turn HTML `<table>`s into DataFrames or CSV with one flag
|
||||
- High‑volume stress‑test framework in `tests/memory` and API load scripts
|
||||
- MCP protocol endpoints with socket & SSE support; playground UI scaffold
|
||||
- Docs v2 revamp: TOC, GitHub badge, copy‑code buttons, Docker API demo
|
||||
- “Ask AI” helper button *(work‑in‑progress, shipping soon)*
|
||||
- New examples: geo‑location usage, network/console capture, Docker API, markdown source selection, crypto analysis
|
||||
- Expanded automated test suites for browser, Docker, MCP and memory benchmarks
|
||||
|
||||
### Changed
|
||||
- Consolidated and renamed browser strategies; legacy docker strategy modules removed
|
||||
- `ProxyConfig` moved to `async_configs`
|
||||
- Server migrated to pool‑based crawler management
|
||||
- FastAPI validators replace custom query validation
|
||||
- Docker build now uses Chromium base image
|
||||
- Large‑scale repo tidy‑up (≈36 k insertions, ≈5 k deletions)
|
||||
|
||||
### Fixed
|
||||
- Async crawler session leak, duplicate‑visit handling, URL normalisation
|
||||
- Target‑element regressions in scraping strategies
|
||||
- Logged‑URL readability, encoded‑URL decoding, middle truncation for long URLs
|
||||
- Closed issues: #701, #733, #756, #774, #804, #822, #839, #841, #842, #843, #867, #902, #911
|
||||
|
||||
### Removed
|
||||
- Obsolete modules under `crawl4ai/browser/*` superseded by the new pooled browser layer
|
||||
|
||||
### Deprecated
|
||||
- Old markdown generator names now alias `DefaultMarkdownGenerator` and emit warnings
|
||||
|
||||
---
|
||||
|
||||
#### Upgrade notes
|
||||
1. Update any direct imports from `crawl4ai/browser/*` to the new pooled browser modules
|
||||
2. If you override `AsyncPlaywrightCrawlerStrategy.get_page`, adopt the new signature
|
||||
3. Rebuild Docker images to pull the new Chromium layer
|
||||
4. Switch to `DefaultMarkdownGenerator` (or silence the deprecation warning)
|
||||
|
||||
---
|
||||
|
||||
`121 files changed, ≈36 223 insertions, ≈4 975 deletions` :contentReference[oaicite:0]{index=0}​:contentReference[oaicite:1]{index=1}
|
||||
|
||||
|
||||
### [Feature] 2025-04-21
|
||||
- Implemented MCP protocol for machine-to-machine communication
|
||||
- Added WebSocket and SSE transport for MCP server
|
||||
|
||||
12
Dockerfile
12
Dockerfile
@@ -1,4 +1,9 @@
|
||||
FROM python:3.10-slim
|
||||
FROM python:3.12-slim-bookworm AS build
|
||||
|
||||
# C4ai version
|
||||
ARG C4AI_VER=0.6.0
|
||||
ENV C4AI_VERSION=$C4AI_VER
|
||||
LABEL c4ai.version=$C4AI_VER
|
||||
|
||||
# Set build arguments
|
||||
ARG APP_HOME=/app
|
||||
@@ -17,7 +22,7 @@ ENV PYTHONFAULTHANDLER=1 \
|
||||
REDIS_HOST=localhost \
|
||||
REDIS_PORT=6379
|
||||
|
||||
ARG PYTHON_VERSION=3.10
|
||||
ARG PYTHON_VERSION=3.12
|
||||
ARG INSTALL_TYPE=default
|
||||
ARG ENABLE_GPU=false
|
||||
ARG TARGETARCH
|
||||
@@ -66,6 +71,9 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
|
||||
&& apt-get clean \
|
||||
&& rm -rf /var/lib/apt/lists/*
|
||||
|
||||
RUN apt-get update && apt-get dist-upgrade -y \
|
||||
&& rm -rf /var/lib/apt/lists/*
|
||||
|
||||
RUN if [ "$ENABLE_GPU" = "true" ] && [ "$TARGETARCH" = "amd64" ] ; then \
|
||||
apt-get update && apt-get install -y --no-install-recommends \
|
||||
nvidia-cuda-toolkit \
|
||||
|
||||
138
README.md
138
README.md
@@ -21,9 +21,9 @@
|
||||
|
||||
Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for LLMs, AI agents, and data pipelines. Open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease.
|
||||
|
||||
[✨ Check out latest update v0.5.0](#-recent-updates)
|
||||
[✨ Check out latest update v0.6.0](#-recent-updates)
|
||||
|
||||
🎉 **Version 0.5.0 is out!** This major release introduces Deep Crawling with BFS/DFS/BestFirst strategies, Memory-Adaptive Dispatcher, Multiple Crawling Strategies (Playwright and HTTP), Docker Deployment with FastAPI, Command-Line Interface (CLI), and more! [Read the release notes →](https://docs.crawl4ai.com/blog)
|
||||
🎉 **Version 0.6.0 is now available!** This release candidate introduces World-aware Crawling with geolocation and locale settings, Table-to-DataFrame extraction, Browser pooling with pre-warming, Network and console traffic capture, MCP integration for AI tools, and a completely revamped Docker deployment! [Read the release notes →](https://docs.crawl4ai.com/blog)
|
||||
|
||||
<details>
|
||||
<summary>🤓 <strong>My Personal Story</strong></summary>
|
||||
@@ -253,24 +253,29 @@ pip install -e ".[all]" # Install all optional features
|
||||
<details>
|
||||
<summary>🐳 <strong>Docker Deployment</strong></summary>
|
||||
|
||||
> 🚀 **Major Changes Coming!** We're developing a completely new Docker implementation that will make deployment even more efficient and seamless. The current Docker setup is being deprecated in favor of this new solution.
|
||||
> 🚀 **Now Available!** Our completely redesigned Docker implementation is here! This new solution makes deployment more efficient and seamless than ever.
|
||||
|
||||
### Current Docker Support
|
||||
### New Docker Features
|
||||
|
||||
The existing Docker implementation is being deprecated and will be replaced soon. If you still need to use Docker with the current version:
|
||||
The new Docker implementation includes:
|
||||
- **Browser pooling** with page pre-warming for faster response times
|
||||
- **Interactive playground** to test and generate request code
|
||||
- **MCP integration** for direct connection to AI tools like Claude Code
|
||||
- **Comprehensive API endpoints** including HTML extraction, screenshots, PDF generation, and JavaScript execution
|
||||
- **Multi-architecture support** with automatic detection (AMD64/ARM64)
|
||||
- **Optimized resources** with improved memory management
|
||||
|
||||
- 📚 [Deprecated Docker Setup](./docs/deprecated/docker-deployment.md) - Instructions for the current Docker implementation
|
||||
- ⚠️ Note: This setup will be replaced in the next major release
|
||||
### Getting Started
|
||||
|
||||
### What's Coming Next?
|
||||
```bash
|
||||
# Pull and run the latest release candidate
|
||||
docker pull unclecode/crawl4ai:0.6.0-rN # Use your favorite revision number
|
||||
docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:0.6.0-rN # Use your favorite revision number
|
||||
|
||||
Our new Docker implementation will bring:
|
||||
- Improved performance and resource efficiency
|
||||
- Streamlined deployment process
|
||||
- Better integration with Crawl4AI features
|
||||
- Enhanced scalability options
|
||||
# Visit the playground at http://localhost:11235/playground
|
||||
```
|
||||
|
||||
Stay connected with our [GitHub repository](https://github.com/unclecode/crawl4ai) for updates!
|
||||
For complete documentation, see our [Docker Deployment Guide](https://docs.crawl4ai.com/core/docker-deployment/).
|
||||
|
||||
</details>
|
||||
|
||||
@@ -500,31 +505,92 @@ async def test_news_crawl():
|
||||
|
||||
## ✨ Recent Updates
|
||||
|
||||
### Version 0.5.0 Major Release Highlights
|
||||
### Version 0.6.0 Release Highlights
|
||||
|
||||
- **🚀 Deep Crawling System**: Explore websites beyond initial URLs with three strategies:
|
||||
- **BFS Strategy**: Breadth-first search explores websites level by level
|
||||
- **DFS Strategy**: Depth-first search explores each branch deeply before backtracking
|
||||
- **BestFirst Strategy**: Uses scoring functions to prioritize which URLs to crawl next
|
||||
- **Page Limiting**: Control the maximum number of pages to crawl with `max_pages` parameter
|
||||
- **Score Thresholds**: Filter URLs based on relevance scores
|
||||
- **⚡ Memory-Adaptive Dispatcher**: Dynamically adjusts concurrency based on system memory with built-in rate limiting
|
||||
- **🔄 Multiple Crawling Strategies**:
|
||||
- **AsyncPlaywrightCrawlerStrategy**: Browser-based crawling with JavaScript support (Default)
|
||||
- **AsyncHTTPCrawlerStrategy**: Fast, lightweight HTTP-only crawler for simple tasks
|
||||
- **🐳 Docker Deployment**: Easy deployment with FastAPI server and streaming/non-streaming endpoints
|
||||
- **💻 Command-Line Interface**: New `crwl` CLI provides convenient terminal access to all features with intuitive commands and configuration options
|
||||
- **👤 Browser Profiler**: Create and manage persistent browser profiles to save authentication states, cookies, and settings for seamless crawling of protected content
|
||||
- **🧠 Crawl4AI Coding Assistant**: AI-powered coding assistant to answer your question for Crawl4ai, and generate proper code for crawling.
|
||||
- **🏎️ LXML Scraping Mode**: Fast HTML parsing using the `lxml` library for improved performance
|
||||
- **🌐 Proxy Rotation**: Built-in support for proxy switching with `RoundRobinProxyStrategy`
|
||||
- **🌎 World-aware Crawling**: Set geolocation, language, and timezone for authentic locale-specific content:
|
||||
```python
|
||||
crun_cfg = CrawlerRunConfig(
|
||||
url="https://browserleaks.com/geo", # test page that shows your location
|
||||
locale="en-US", # Accept-Language & UI locale
|
||||
timezone_id="America/Los_Angeles", # JS Date()/Intl timezone
|
||||
geolocation=GeolocationConfig( # override GPS coords
|
||||
latitude=34.0522,
|
||||
longitude=-118.2437,
|
||||
accuracy=10.0,
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
- **📊 Table-to-DataFrame Extraction**: Extract HTML tables directly to CSV or pandas DataFrames:
|
||||
```python
|
||||
crawler = AsyncWebCrawler(config=browser_config)
|
||||
await crawler.start()
|
||||
|
||||
try:
|
||||
# Set up scraping parameters
|
||||
crawl_config = CrawlerRunConfig(
|
||||
table_score_threshold=8, # Strict table detection
|
||||
)
|
||||
|
||||
# Execute market data extraction
|
||||
results: List[CrawlResult] = await crawler.arun(
|
||||
url="https://coinmarketcap.com/?page=1", config=crawl_config
|
||||
)
|
||||
|
||||
# Process results
|
||||
raw_df = pd.DataFrame()
|
||||
for result in results:
|
||||
if result.success and result.media["tables"]:
|
||||
raw_df = pd.DataFrame(
|
||||
result.media["tables"][0]["rows"],
|
||||
columns=result.media["tables"][0]["headers"],
|
||||
)
|
||||
break
|
||||
print(raw_df.head())
|
||||
|
||||
finally:
|
||||
await crawler.stop()
|
||||
```
|
||||
|
||||
- **🚀 Browser Pooling**: Pages launch hot with pre-warmed browser instances for lower latency and memory usage
|
||||
|
||||
- **🕸️ Network and Console Capture**: Full traffic logs and MHTML snapshots for debugging:
|
||||
```python
|
||||
crawler_config = CrawlerRunConfig(
|
||||
capture_network=True,
|
||||
capture_console=True,
|
||||
mhtml=True
|
||||
)
|
||||
```
|
||||
|
||||
- **🔌 MCP Integration**: Connect to AI tools like Claude Code through the Model Context Protocol
|
||||
```bash
|
||||
# Add Crawl4AI to Claude Code
|
||||
claude mcp add --transport sse c4ai-sse http://localhost:11235/mcp/sse
|
||||
```
|
||||
|
||||
- **🖥️ Interactive Playground**: Test configurations and generate API requests with the built-in web interface at `http://localhost:11235//playground`
|
||||
|
||||
- **🐳 Revamped Docker Deployment**: Streamlined multi-architecture Docker image with improved resource efficiency
|
||||
|
||||
- **📱 Multi-stage Build System**: Optimized Dockerfile with platform-specific performance enhancements
|
||||
|
||||
Read the full details in our [0.6.0 Release Notes](https://docs.crawl4ai.com/blog/releases/0.6.0.html) or check the [CHANGELOG](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md).
|
||||
|
||||
### Previous Version: 0.5.0 Major Release Highlights
|
||||
|
||||
- **🚀 Deep Crawling System**: Explore websites beyond initial URLs with BFS, DFS, and BestFirst strategies
|
||||
- **⚡ Memory-Adaptive Dispatcher**: Dynamically adjusts concurrency based on system memory
|
||||
- **🔄 Multiple Crawling Strategies**: Browser-based and lightweight HTTP-only crawlers
|
||||
- **💻 Command-Line Interface**: New `crwl` CLI provides convenient terminal access
|
||||
- **👤 Browser Profiler**: Create and manage persistent browser profiles
|
||||
- **🧠 Crawl4AI Coding Assistant**: AI-powered coding assistant
|
||||
- **🏎️ LXML Scraping Mode**: Fast HTML parsing using the `lxml` library
|
||||
- **🌐 Proxy Rotation**: Built-in support for proxy switching
|
||||
- **🤖 LLM Content Filter**: Intelligent markdown generation using LLMs
|
||||
- **📄 PDF Processing**: Extract text, images, and metadata from PDF files
|
||||
- **🔗 URL Redirection Tracking**: Automatically follow and record HTTP redirects
|
||||
- **🤖 LLM Schema Generation**: Easily create extraction schemas with LLM assistance
|
||||
- **🔍 robots.txt Compliance**: Respect website crawling rules
|
||||
|
||||
Read the full details in our [0.5.0 Release Notes](https://docs.crawl4ai.com/blog/releases/0.5.0.html) or check the [CHANGELOG](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md).
|
||||
Read the full details in our [0.5.0 Release Notes](https://docs.crawl4ai.com/blog/releases/0.5.0.html).
|
||||
|
||||
## Version Numbering in Crawl4AI
|
||||
|
||||
@@ -540,7 +606,7 @@ We use different suffixes to indicate development stages:
|
||||
- `dev` (0.4.3dev1): Development versions, unstable
|
||||
- `a` (0.4.3a1): Alpha releases, experimental features
|
||||
- `b` (0.4.3b1): Beta releases, feature complete but needs testing
|
||||
- `rc` (0.4.3rc1): Release candidates, potential final version
|
||||
- `rc` (0.4.3): Release candidates, potential final version
|
||||
|
||||
#### Installation
|
||||
- Regular installation (stable version):
|
||||
|
||||
@@ -1,2 +1,3 @@
|
||||
# crawl4ai/_version.py
|
||||
__version__ = "0.5.0.post8"
|
||||
__version__ = "0.6.3"
|
||||
|
||||
|
||||
@@ -427,7 +427,7 @@ class BrowserConfig:
|
||||
host: str = "localhost",
|
||||
):
|
||||
self.browser_type = browser_type
|
||||
self.headless = headless or True
|
||||
self.headless = headless
|
||||
self.browser_mode = browser_mode
|
||||
self.use_managed_browser = use_managed_browser
|
||||
self.cdp_url = cdp_url
|
||||
|
||||
@@ -171,7 +171,10 @@ class AsyncDatabaseManager:
|
||||
f"Code context:\n{error_context['code_context']}"
|
||||
)
|
||||
self.logger.error(
|
||||
message=create_box_message(error_message, type="error"),
|
||||
message="{error}",
|
||||
tag="ERROR",
|
||||
params={"error": str(error_message)},
|
||||
boxes=["error"],
|
||||
)
|
||||
|
||||
raise
|
||||
@@ -189,7 +192,10 @@ class AsyncDatabaseManager:
|
||||
f"Code context:\n{error_context['code_context']}"
|
||||
)
|
||||
self.logger.error(
|
||||
message=create_box_message(error_message, type="error"),
|
||||
message="{error}",
|
||||
tag="ERROR",
|
||||
params={"error": str(error_message)},
|
||||
boxes=["error"],
|
||||
)
|
||||
raise
|
||||
finally:
|
||||
|
||||
@@ -1,10 +1,12 @@
|
||||
from abc import ABC, abstractmethod
|
||||
from enum import Enum
|
||||
from typing import Optional, Dict, Any
|
||||
from colorama import Fore, Style, init
|
||||
from typing import Optional, Dict, Any, List
|
||||
import os
|
||||
from datetime import datetime
|
||||
from urllib.parse import unquote
|
||||
from rich.console import Console
|
||||
from rich.text import Text
|
||||
from .utils import create_box_message
|
||||
|
||||
|
||||
class LogLevel(Enum):
|
||||
@@ -21,6 +23,26 @@ class LogLevel(Enum):
|
||||
FATAL = 10
|
||||
|
||||
|
||||
def __str__(self):
|
||||
return self.name.lower()
|
||||
|
||||
class LogColor(str, Enum):
|
||||
"""Enum for log colors."""
|
||||
|
||||
DEBUG = "lightblack"
|
||||
INFO = "cyan"
|
||||
SUCCESS = "green"
|
||||
WARNING = "yellow"
|
||||
ERROR = "red"
|
||||
CYAN = "cyan"
|
||||
GREEN = "green"
|
||||
YELLOW = "yellow"
|
||||
MAGENTA = "magenta"
|
||||
DIM_MAGENTA = "dim magenta"
|
||||
|
||||
def __str__(self):
|
||||
"""Automatically convert rich color to string."""
|
||||
return self.value
|
||||
|
||||
|
||||
class AsyncLoggerBase(ABC):
|
||||
@@ -52,6 +74,7 @@ class AsyncLoggerBase(ABC):
|
||||
def error_status(self, url: str, error: str, tag: str = "ERROR", url_length: int = 100):
|
||||
pass
|
||||
|
||||
|
||||
class AsyncLogger(AsyncLoggerBase):
|
||||
"""
|
||||
Asynchronous logger with support for colored console output and file logging.
|
||||
@@ -79,17 +102,11 @@ class AsyncLogger(AsyncLoggerBase):
|
||||
}
|
||||
|
||||
DEFAULT_COLORS = {
|
||||
LogLevel.DEBUG: Fore.LIGHTBLACK_EX,
|
||||
LogLevel.INFO: Fore.CYAN,
|
||||
LogLevel.SUCCESS: Fore.GREEN,
|
||||
LogLevel.WARNING: Fore.YELLOW,
|
||||
LogLevel.ERROR: Fore.RED,
|
||||
LogLevel.CRITICAL: Fore.RED + Style.BRIGHT,
|
||||
LogLevel.ALERT: Fore.RED + Style.BRIGHT,
|
||||
LogLevel.NOTICE: Fore.BLUE,
|
||||
LogLevel.EXCEPTION: Fore.RED + Style.BRIGHT,
|
||||
LogLevel.FATAL: Fore.RED + Style.BRIGHT,
|
||||
LogLevel.DEFAULT: Fore.WHITE,
|
||||
LogLevel.DEBUG: LogColor.DEBUG,
|
||||
LogLevel.INFO: LogColor.INFO,
|
||||
LogLevel.SUCCESS: LogColor.SUCCESS,
|
||||
LogLevel.WARNING: LogColor.WARNING,
|
||||
LogLevel.ERROR: LogColor.ERROR,
|
||||
}
|
||||
|
||||
def __init__(
|
||||
@@ -98,7 +115,7 @@ class AsyncLogger(AsyncLoggerBase):
|
||||
log_level: LogLevel = LogLevel.DEBUG,
|
||||
tag_width: int = 10,
|
||||
icons: Optional[Dict[str, str]] = None,
|
||||
colors: Optional[Dict[LogLevel, str]] = None,
|
||||
colors: Optional[Dict[LogLevel, LogColor]] = None,
|
||||
verbose: bool = True,
|
||||
):
|
||||
"""
|
||||
@@ -112,13 +129,13 @@ class AsyncLogger(AsyncLoggerBase):
|
||||
colors: Custom colors for different log levels
|
||||
verbose: Whether to output to console
|
||||
"""
|
||||
init() # Initialize colorama
|
||||
self.log_file = log_file
|
||||
self.log_level = log_level
|
||||
self.tag_width = tag_width
|
||||
self.icons = icons or self.DEFAULT_ICONS
|
||||
self.colors = colors or self.DEFAULT_COLORS
|
||||
self.verbose = verbose
|
||||
self.console = Console()
|
||||
|
||||
# Create log file directory if needed
|
||||
if log_file:
|
||||
@@ -143,16 +160,11 @@ class AsyncLogger(AsyncLoggerBase):
|
||||
def _write_to_file(self, message: str):
|
||||
"""Write a message to the log file if configured."""
|
||||
if self.log_file:
|
||||
text = Text.from_markup(message)
|
||||
plain_text = text.plain
|
||||
timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S.%f")[:-3]
|
||||
with open(self.log_file, "a", encoding="utf-8") as f:
|
||||
# Strip ANSI color codes for file output
|
||||
clean_message = message.replace(Fore.RESET, "").replace(
|
||||
Style.RESET_ALL, ""
|
||||
)
|
||||
for color in vars(Fore).values():
|
||||
if isinstance(color, str):
|
||||
clean_message = clean_message.replace(color, "")
|
||||
f.write(f"[{timestamp}] {clean_message}\n")
|
||||
f.write(f"[{timestamp}] {plain_text}\n")
|
||||
|
||||
def _log(
|
||||
self,
|
||||
@@ -160,8 +172,9 @@ class AsyncLogger(AsyncLoggerBase):
|
||||
message: str,
|
||||
tag: str,
|
||||
params: Optional[Dict[str, Any]] = None,
|
||||
colors: Optional[Dict[str, str]] = None,
|
||||
base_color: Optional[str] = None,
|
||||
colors: Optional[Dict[str, LogColor]] = None,
|
||||
boxes: Optional[List[str]] = None,
|
||||
base_color: Optional[LogColor] = None,
|
||||
**kwargs,
|
||||
):
|
||||
"""
|
||||
@@ -173,55 +186,44 @@ class AsyncLogger(AsyncLoggerBase):
|
||||
tag: Tag for the message
|
||||
params: Parameters to format into the message
|
||||
colors: Color overrides for specific parameters
|
||||
boxes: Box overrides for specific parameters
|
||||
base_color: Base color for the entire message
|
||||
"""
|
||||
if level.value < self.log_level.value:
|
||||
return
|
||||
|
||||
# Format the message with parameters if provided
|
||||
# avoid conflict with rich formatting
|
||||
parsed_message = message.replace("[", "[[").replace("]", "]]")
|
||||
if params:
|
||||
try:
|
||||
# First format the message with raw parameters
|
||||
formatted_message = message.format(**params)
|
||||
# FIXME: If there are formatting strings in floating point format,
|
||||
# this may result in colors and boxes not being applied properly.
|
||||
# such as {value:.2f}, the value is 0.23333 format it to 0.23,
|
||||
# but we replace("0.23333", "[color]0.23333[/color]")
|
||||
formatted_message = parsed_message.format(**params)
|
||||
for key, value in params.items():
|
||||
# value_str may discard `[` and `]`, so we need to replace it.
|
||||
value_str = str(value).replace("[", "[[").replace("]", "]]")
|
||||
# check is need apply color
|
||||
if colors and key in colors:
|
||||
color_str = f"[{colors[key]}]{value_str}[/{colors[key]}]"
|
||||
formatted_message = formatted_message.replace(value_str, color_str)
|
||||
value_str = color_str
|
||||
|
||||
# Then apply colors if specified
|
||||
color_map = {
|
||||
"green": Fore.GREEN,
|
||||
"red": Fore.RED,
|
||||
"yellow": Fore.YELLOW,
|
||||
"blue": Fore.BLUE,
|
||||
"cyan": Fore.CYAN,
|
||||
"magenta": Fore.MAGENTA,
|
||||
"white": Fore.WHITE,
|
||||
"black": Fore.BLACK,
|
||||
"reset": Style.RESET_ALL,
|
||||
}
|
||||
if colors:
|
||||
for key, color in colors.items():
|
||||
# Find the formatted value in the message and wrap it with color
|
||||
if color in color_map:
|
||||
color = color_map[color]
|
||||
if key in params:
|
||||
value_str = str(params[key])
|
||||
formatted_message = formatted_message.replace(
|
||||
value_str, f"{color}{value_str}{Style.RESET_ALL}"
|
||||
)
|
||||
# check is need apply box
|
||||
if boxes and key in boxes:
|
||||
formatted_message = formatted_message.replace(value_str,
|
||||
create_box_message(value_str, type=str(level)))
|
||||
|
||||
except KeyError as e:
|
||||
formatted_message = (
|
||||
f"LOGGING ERROR: Missing parameter {e} in message template"
|
||||
)
|
||||
level = LogLevel.ERROR
|
||||
else:
|
||||
formatted_message = message
|
||||
formatted_message = parsed_message
|
||||
|
||||
# Construct the full log line
|
||||
color = base_color or self.colors[level]
|
||||
log_line = f"{color}{self._format_tag(tag)} {self._get_icon(tag)} {formatted_message}{Style.RESET_ALL}"
|
||||
color: LogColor = base_color or self.colors[level]
|
||||
log_line = f"[{color}]{self._format_tag(tag)} {self._get_icon(tag)} {formatted_message} [/{color}]"
|
||||
|
||||
# Output to console if verbose
|
||||
if self.verbose or kwargs.get("force_verbose", False):
|
||||
print(log_line)
|
||||
self.console.print(log_line)
|
||||
|
||||
# Write to file if configured
|
||||
self._write_to_file(log_line)
|
||||
@@ -292,8 +294,8 @@ class AsyncLogger(AsyncLoggerBase):
|
||||
"timing": timing,
|
||||
},
|
||||
colors={
|
||||
"status": Fore.GREEN if success else Fore.RED,
|
||||
"timing": Fore.YELLOW,
|
||||
"status": LogColor.SUCCESS if success else LogColor.ERROR,
|
||||
"timing": LogColor.WARNING,
|
||||
},
|
||||
)
|
||||
|
||||
|
||||
@@ -2,7 +2,6 @@ from .__version__ import __version__ as crawl4ai_version
|
||||
import os
|
||||
import sys
|
||||
import time
|
||||
from colorama import Fore
|
||||
from pathlib import Path
|
||||
from typing import Optional, List
|
||||
import json
|
||||
@@ -44,7 +43,6 @@ from .utils import (
|
||||
sanitize_input_encode,
|
||||
InvalidCSSSelectorError,
|
||||
fast_format_html,
|
||||
create_box_message,
|
||||
get_error_context,
|
||||
RobotsParser,
|
||||
preprocess_html_for_schema,
|
||||
@@ -419,7 +417,7 @@ class AsyncWebCrawler:
|
||||
|
||||
self.logger.error_status(
|
||||
url=url,
|
||||
error=create_box_message(error_message, type="error"),
|
||||
error=error_message,
|
||||
tag="ERROR",
|
||||
)
|
||||
|
||||
@@ -496,11 +494,13 @@ class AsyncWebCrawler:
|
||||
cleaned_html = sanitize_input_encode(
|
||||
result.get("cleaned_html", ""))
|
||||
media = result.get("media", {})
|
||||
tables = media.pop("tables", []) if isinstance(media, dict) else []
|
||||
links = result.get("links", {})
|
||||
metadata = result.get("metadata", {})
|
||||
else:
|
||||
cleaned_html = sanitize_input_encode(result.cleaned_html)
|
||||
media = result.media.model_dump()
|
||||
tables = media.pop("tables", [])
|
||||
links = result.links.model_dump()
|
||||
metadata = result.metadata
|
||||
|
||||
@@ -627,6 +627,7 @@ class AsyncWebCrawler:
|
||||
cleaned_html=cleaned_html,
|
||||
markdown=markdown_result,
|
||||
media=media,
|
||||
tables=tables, # NEW
|
||||
links=links,
|
||||
metadata=metadata,
|
||||
screenshot=screenshot_data,
|
||||
|
||||
@@ -5,7 +5,10 @@ import os
|
||||
import sys
|
||||
import shutil
|
||||
import tempfile
|
||||
import psutil
|
||||
import signal
|
||||
import subprocess
|
||||
import shlex
|
||||
from playwright.async_api import BrowserContext
|
||||
import hashlib
|
||||
from .js_snippet import load_js_script
|
||||
@@ -193,6 +196,45 @@ class ManagedBrowser:
|
||||
|
||||
if self.browser_config.extra_args:
|
||||
args.extend(self.browser_config.extra_args)
|
||||
|
||||
|
||||
# ── make sure no old Chromium instance is owning the same port/profile ──
|
||||
try:
|
||||
if sys.platform == "win32":
|
||||
if psutil is None:
|
||||
raise RuntimeError("psutil not available, cannot clean old browser")
|
||||
for p in psutil.process_iter(["pid", "name", "cmdline"]):
|
||||
cl = " ".join(p.info.get("cmdline") or [])
|
||||
if (
|
||||
f"--remote-debugging-port={self.debugging_port}" in cl
|
||||
and f"--user-data-dir={self.user_data_dir}" in cl
|
||||
):
|
||||
p.kill()
|
||||
p.wait(timeout=5)
|
||||
else: # macOS / Linux
|
||||
# kill any process listening on the same debugging port
|
||||
pids = (
|
||||
subprocess.check_output(shlex.split(f"lsof -t -i:{self.debugging_port}"))
|
||||
.decode()
|
||||
.strip()
|
||||
.splitlines()
|
||||
)
|
||||
for pid in pids:
|
||||
try:
|
||||
os.kill(int(pid), signal.SIGTERM)
|
||||
except ProcessLookupError:
|
||||
pass
|
||||
|
||||
# remove Chromium singleton locks, or new launch exits with
|
||||
# “Opening in existing browser session.”
|
||||
for f in ("SingletonLock", "SingletonSocket", "SingletonCookie"):
|
||||
fp = os.path.join(self.user_data_dir, f)
|
||||
if os.path.exists(fp):
|
||||
os.remove(fp)
|
||||
except Exception as _e:
|
||||
# non-fatal — we'll try to start anyway, but log what happened
|
||||
self.logger.warning(f"pre-launch cleanup failed: {_e}", tag="BROWSER")
|
||||
|
||||
|
||||
# Start browser process
|
||||
try:
|
||||
@@ -922,7 +964,7 @@ class BrowserManager:
|
||||
pages = context.pages
|
||||
page = next((p for p in pages if p.url == crawlerRunConfig.url), None)
|
||||
if not page:
|
||||
page = await context.new_page()
|
||||
page = context.pages[0] # await context.new_page()
|
||||
else:
|
||||
# Otherwise, check if we have an existing context for this config
|
||||
config_signature = self._make_config_signature(crawlerRunConfig)
|
||||
|
||||
@@ -15,12 +15,12 @@ import shutil
|
||||
import json
|
||||
import subprocess
|
||||
import time
|
||||
from typing import List, Dict, Optional, Any, Tuple
|
||||
from colorama import Fore, Style, init
|
||||
from typing import List, Dict, Optional, Any
|
||||
from rich.console import Console
|
||||
|
||||
from .async_configs import BrowserConfig
|
||||
from .browser_manager import ManagedBrowser
|
||||
from .async_logger import AsyncLogger, AsyncLoggerBase
|
||||
from .async_logger import AsyncLogger, AsyncLoggerBase, LogColor
|
||||
from .utils import get_home_folder
|
||||
|
||||
|
||||
@@ -45,8 +45,8 @@ class BrowserProfiler:
|
||||
logger (AsyncLoggerBase, optional): Logger for outputting messages.
|
||||
If None, a default AsyncLogger will be created.
|
||||
"""
|
||||
# Initialize colorama for colorful terminal output
|
||||
init()
|
||||
# Initialize rich console for colorful input prompts
|
||||
self.console = Console()
|
||||
|
||||
# Create a logger if not provided
|
||||
if logger is None:
|
||||
@@ -127,26 +127,30 @@ class BrowserProfiler:
|
||||
profile_path = os.path.join(self.profiles_dir, profile_name)
|
||||
os.makedirs(profile_path, exist_ok=True)
|
||||
|
||||
# Print instructions for the user with colorama formatting
|
||||
border = f"{Fore.CYAN}{'='*80}{Style.RESET_ALL}"
|
||||
self.logger.info(f"\n{border}", tag="PROFILE")
|
||||
self.logger.info(f"Creating browser profile: {Fore.GREEN}{profile_name}{Style.RESET_ALL}", tag="PROFILE")
|
||||
self.logger.info(f"Profile directory: {Fore.YELLOW}{profile_path}{Style.RESET_ALL}", tag="PROFILE")
|
||||
# Print instructions for the user with rich formatting
|
||||
border = "{'='*80}"
|
||||
self.logger.info("{border}", tag="PROFILE", params={"border": f"\n{border}"}, colors={"border": LogColor.CYAN})
|
||||
self.logger.info("Creating browser profile: {profile_name}", tag="PROFILE", params={"profile_name": profile_name}, colors={"profile_name": LogColor.GREEN})
|
||||
self.logger.info("Profile directory: {profile_path}", tag="PROFILE", params={"profile_path": profile_path}, colors={"profile_path": LogColor.YELLOW})
|
||||
|
||||
self.logger.info("\nInstructions:", tag="PROFILE")
|
||||
self.logger.info("1. A browser window will open for you to set up your profile.", tag="PROFILE")
|
||||
self.logger.info(f"2. {Fore.CYAN}Log in to websites{Style.RESET_ALL}, configure settings, etc. as needed.", tag="PROFILE")
|
||||
self.logger.info(f"3. When you're done, {Fore.YELLOW}press 'q' in this terminal{Style.RESET_ALL} to close the browser.", tag="PROFILE")
|
||||
self.logger.info("{segment}, configure settings, etc. as needed.", tag="PROFILE", params={"segment": "2. Log in to websites"}, colors={"segment": LogColor.CYAN})
|
||||
self.logger.info("3. When you're done, {segment} to close the browser.", tag="PROFILE", params={"segment": "press 'q' in this terminal"}, colors={"segment": LogColor.YELLOW})
|
||||
self.logger.info("4. The profile will be saved and ready to use with Crawl4AI.", tag="PROFILE")
|
||||
self.logger.info(f"{border}\n", tag="PROFILE")
|
||||
self.logger.info("{border}", tag="PROFILE", params={"border": f"{border}\n"}, colors={"border": LogColor.CYAN})
|
||||
|
||||
browser_config.headless = False
|
||||
browser_config.user_data_dir = profile_path
|
||||
|
||||
|
||||
# Create managed browser instance
|
||||
managed_browser = ManagedBrowser(
|
||||
browser_type=browser_config.browser_type,
|
||||
user_data_dir=profile_path,
|
||||
headless=False, # Must be visible
|
||||
browser_config=browser_config,
|
||||
# user_data_dir=profile_path,
|
||||
# headless=False, # Must be visible
|
||||
logger=self.logger,
|
||||
debugging_port=browser_config.debugging_port
|
||||
# debugging_port=browser_config.debugging_port
|
||||
)
|
||||
|
||||
# Set up signal handlers to ensure cleanup on interrupt
|
||||
@@ -181,7 +185,7 @@ class BrowserProfiler:
|
||||
import select
|
||||
|
||||
# First output the prompt
|
||||
self.logger.info(f"{Fore.CYAN}Press '{Fore.WHITE}q{Fore.CYAN}' when you've finished using the browser...{Style.RESET_ALL}", tag="PROFILE")
|
||||
self.logger.info("Press 'q' when you've finished using the browser...", tag="PROFILE")
|
||||
|
||||
# Save original terminal settings
|
||||
fd = sys.stdin.fileno()
|
||||
@@ -197,7 +201,7 @@ class BrowserProfiler:
|
||||
if readable:
|
||||
key = sys.stdin.read(1)
|
||||
if key.lower() == 'q':
|
||||
self.logger.info(f"{Fore.GREEN}Closing browser and saving profile...{Style.RESET_ALL}", tag="PROFILE")
|
||||
self.logger.info("Closing browser and saving profile...", tag="PROFILE", base_color=LogColor.GREEN)
|
||||
user_done_event.set()
|
||||
return
|
||||
|
||||
@@ -223,7 +227,7 @@ class BrowserProfiler:
|
||||
self.logger.error("Failed to start browser process.", tag="PROFILE")
|
||||
return None
|
||||
|
||||
self.logger.info(f"Browser launched. {Fore.CYAN}Waiting for you to finish...{Style.RESET_ALL}", tag="PROFILE")
|
||||
self.logger.info("Browser launched. Waiting for you to finish...", tag="PROFILE")
|
||||
|
||||
# Start listening for keyboard input
|
||||
listener_task = asyncio.create_task(listen_for_quit_command())
|
||||
@@ -245,10 +249,10 @@ class BrowserProfiler:
|
||||
self.logger.info("Terminating browser process...", tag="PROFILE")
|
||||
await managed_browser.cleanup()
|
||||
|
||||
self.logger.success(f"Browser closed. Profile saved at: {Fore.GREEN}{profile_path}{Style.RESET_ALL}", tag="PROFILE")
|
||||
self.logger.success(f"Browser closed. Profile saved at: {profile_path}", tag="PROFILE")
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error creating profile: {str(e)}", tag="PROFILE")
|
||||
self.logger.error(f"Error creating profile: {e!s}", tag="PROFILE")
|
||||
await managed_browser.cleanup()
|
||||
return None
|
||||
finally:
|
||||
@@ -440,25 +444,27 @@ class BrowserProfiler:
|
||||
```
|
||||
"""
|
||||
while True:
|
||||
self.logger.info(f"\n{Fore.CYAN}Profile Management Options:{Style.RESET_ALL}", tag="MENU")
|
||||
self.logger.info(f"1. {Fore.GREEN}Create a new profile{Style.RESET_ALL}", tag="MENU")
|
||||
self.logger.info(f"2. {Fore.YELLOW}List available profiles{Style.RESET_ALL}", tag="MENU")
|
||||
self.logger.info(f"3. {Fore.RED}Delete a profile{Style.RESET_ALL}", tag="MENU")
|
||||
self.logger.info("\nProfile Management Options:", tag="MENU")
|
||||
self.logger.info("1. Create a new profile", tag="MENU", base_color=LogColor.GREEN)
|
||||
self.logger.info("2. List available profiles", tag="MENU", base_color=LogColor.YELLOW)
|
||||
self.logger.info("3. Delete a profile", tag="MENU", base_color=LogColor.RED)
|
||||
|
||||
# Only show crawl option if callback provided
|
||||
if crawl_callback:
|
||||
self.logger.info(f"4. {Fore.CYAN}Use a profile to crawl a website{Style.RESET_ALL}", tag="MENU")
|
||||
self.logger.info(f"5. {Fore.MAGENTA}Exit{Style.RESET_ALL}", tag="MENU")
|
||||
self.logger.info("4. Use a profile to crawl a website", tag="MENU", base_color=LogColor.CYAN)
|
||||
self.logger.info("5. Exit", tag="MENU", base_color=LogColor.MAGENTA)
|
||||
exit_option = "5"
|
||||
else:
|
||||
self.logger.info(f"4. {Fore.MAGENTA}Exit{Style.RESET_ALL}", tag="MENU")
|
||||
self.logger.info("4. Exit", tag="MENU", base_color=LogColor.MAGENTA)
|
||||
exit_option = "4"
|
||||
|
||||
choice = input(f"\n{Fore.CYAN}Enter your choice (1-{exit_option}): {Style.RESET_ALL}")
|
||||
self.logger.print(f"\n[cyan]Enter your choice (1-{exit_option}): [/cyan]", end="")
|
||||
choice = input()
|
||||
|
||||
if choice == "1":
|
||||
# Create new profile
|
||||
name = input(f"{Fore.GREEN}Enter a name for the new profile (or press Enter for auto-generated name): {Style.RESET_ALL}")
|
||||
self.console.print("[green]Enter a name for the new profile (or press Enter for auto-generated name): [/green]", end="")
|
||||
name = input()
|
||||
await self.create_profile(name or None)
|
||||
|
||||
elif choice == "2":
|
||||
@@ -472,8 +478,8 @@ class BrowserProfiler:
|
||||
# Print profile information with colorama formatting
|
||||
self.logger.info("\nAvailable profiles:", tag="PROFILES")
|
||||
for i, profile in enumerate(profiles):
|
||||
self.logger.info(f"[{i+1}] {Fore.CYAN}{profile['name']}{Style.RESET_ALL}", tag="PROFILES")
|
||||
self.logger.info(f" Path: {Fore.YELLOW}{profile['path']}{Style.RESET_ALL}", tag="PROFILES")
|
||||
self.logger.info(f"[{i+1}] {profile['name']}", tag="PROFILES")
|
||||
self.logger.info(f" Path: {profile['path']}", tag="PROFILES", base_color=LogColor.YELLOW)
|
||||
self.logger.info(f" Created: {profile['created'].strftime('%Y-%m-%d %H:%M:%S')}", tag="PROFILES")
|
||||
self.logger.info(f" Browser type: {profile['type']}", tag="PROFILES")
|
||||
self.logger.info("", tag="PROFILES") # Empty line for spacing
|
||||
@@ -486,12 +492,13 @@ class BrowserProfiler:
|
||||
continue
|
||||
|
||||
# Display numbered list
|
||||
self.logger.info(f"\n{Fore.YELLOW}Available profiles:{Style.RESET_ALL}", tag="PROFILES")
|
||||
self.logger.info("\nAvailable profiles:", tag="PROFILES", base_color=LogColor.YELLOW)
|
||||
for i, profile in enumerate(profiles):
|
||||
self.logger.info(f"[{i+1}] {profile['name']}", tag="PROFILES")
|
||||
|
||||
# Get profile to delete
|
||||
profile_idx = input(f"{Fore.RED}Enter the number of the profile to delete (or 'c' to cancel): {Style.RESET_ALL}")
|
||||
self.console.print("[red]Enter the number of the profile to delete (or 'c' to cancel): [/red]", end="")
|
||||
profile_idx = input()
|
||||
if profile_idx.lower() == 'c':
|
||||
continue
|
||||
|
||||
@@ -499,17 +506,18 @@ class BrowserProfiler:
|
||||
idx = int(profile_idx) - 1
|
||||
if 0 <= idx < len(profiles):
|
||||
profile_name = profiles[idx]["name"]
|
||||
self.logger.info(f"Deleting profile: {Fore.YELLOW}{profile_name}{Style.RESET_ALL}", tag="PROFILES")
|
||||
self.logger.info(f"Deleting profile: [yellow]{profile_name}[/yellow]", tag="PROFILES")
|
||||
|
||||
# Confirm deletion
|
||||
confirm = input(f"{Fore.RED}Are you sure you want to delete this profile? (y/n): {Style.RESET_ALL}")
|
||||
self.console.print("[red]Are you sure you want to delete this profile? (y/n): [/red]", end="")
|
||||
confirm = input()
|
||||
if confirm.lower() == 'y':
|
||||
success = self.delete_profile(profiles[idx]["path"])
|
||||
|
||||
if success:
|
||||
self.logger.success(f"Profile {Fore.GREEN}{profile_name}{Style.RESET_ALL} deleted successfully", tag="PROFILES")
|
||||
self.logger.success(f"Profile {profile_name} deleted successfully", tag="PROFILES")
|
||||
else:
|
||||
self.logger.error(f"Failed to delete profile {Fore.RED}{profile_name}{Style.RESET_ALL}", tag="PROFILES")
|
||||
self.logger.error(f"Failed to delete profile {profile_name}", tag="PROFILES")
|
||||
else:
|
||||
self.logger.error("Invalid profile number", tag="PROFILES")
|
||||
except ValueError:
|
||||
@@ -523,12 +531,13 @@ class BrowserProfiler:
|
||||
continue
|
||||
|
||||
# Display numbered list
|
||||
self.logger.info(f"\n{Fore.YELLOW}Available profiles:{Style.RESET_ALL}", tag="PROFILES")
|
||||
self.logger.info("\nAvailable profiles:", tag="PROFILES", base_color=LogColor.YELLOW)
|
||||
for i, profile in enumerate(profiles):
|
||||
self.logger.info(f"[{i+1}] {profile['name']}", tag="PROFILES")
|
||||
|
||||
# Get profile to use
|
||||
profile_idx = input(f"{Fore.CYAN}Enter the number of the profile to use (or 'c' to cancel): {Style.RESET_ALL}")
|
||||
self.console.print("[cyan]Enter the number of the profile to use (or 'c' to cancel): [/cyan]", end="")
|
||||
profile_idx = input()
|
||||
if profile_idx.lower() == 'c':
|
||||
continue
|
||||
|
||||
@@ -536,7 +545,8 @@ class BrowserProfiler:
|
||||
idx = int(profile_idx) - 1
|
||||
if 0 <= idx < len(profiles):
|
||||
profile_path = profiles[idx]["path"]
|
||||
url = input(f"{Fore.CYAN}Enter the URL to crawl: {Style.RESET_ALL}")
|
||||
self.console.print("[cyan]Enter the URL to crawl: [/cyan]", end="")
|
||||
url = input()
|
||||
if url:
|
||||
# Call the provided crawl callback
|
||||
await crawl_callback(profile_path, url)
|
||||
@@ -599,11 +609,11 @@ class BrowserProfiler:
|
||||
# Print initial information
|
||||
border = f"{Fore.CYAN}{'='*80}{Style.RESET_ALL}"
|
||||
self.logger.info(f"\n{border}", tag="CDP")
|
||||
self.logger.info(f"Launching standalone browser with CDP debugging", tag="CDP")
|
||||
self.logger.info(f"Browser type: {Fore.GREEN}{browser_type}{Style.RESET_ALL}", tag="CDP")
|
||||
self.logger.info(f"Profile path: {Fore.YELLOW}{profile_path}{Style.RESET_ALL}", tag="CDP")
|
||||
self.logger.info(f"Debugging port: {Fore.CYAN}{debugging_port}{Style.RESET_ALL}", tag="CDP")
|
||||
self.logger.info(f"Headless mode: {Fore.CYAN}{headless}{Style.RESET_ALL}", tag="CDP")
|
||||
self.logger.info("Launching standalone browser with CDP debugging", tag="CDP")
|
||||
self.logger.info("Browser type: {browser_type}", tag="CDP", params={"browser_type": browser_type}, colors={"browser_type": LogColor.CYAN})
|
||||
self.logger.info("Profile path: {profile_path}", tag="CDP", params={"profile_path": profile_path}, colors={"profile_path": LogColor.YELLOW})
|
||||
self.logger.info(f"Debugging port: {debugging_port}", tag="CDP")
|
||||
self.logger.info(f"Headless mode: {headless}", tag="CDP")
|
||||
|
||||
# Create managed browser instance
|
||||
managed_browser = ManagedBrowser(
|
||||
@@ -646,7 +656,7 @@ class BrowserProfiler:
|
||||
import select
|
||||
|
||||
# First output the prompt
|
||||
self.logger.info(f"{Fore.CYAN}Press '{Fore.WHITE}q{Fore.CYAN}' to stop the browser and exit...{Style.RESET_ALL}", tag="CDP")
|
||||
self.logger.info("Press 'q' to stop the browser and exit...", tag="CDP")
|
||||
|
||||
# Save original terminal settings
|
||||
fd = sys.stdin.fileno()
|
||||
@@ -662,7 +672,7 @@ class BrowserProfiler:
|
||||
if readable:
|
||||
key = sys.stdin.read(1)
|
||||
if key.lower() == 'q':
|
||||
self.logger.info(f"{Fore.GREEN}Closing browser...{Style.RESET_ALL}", tag="CDP")
|
||||
self.logger.info("Closing browser...", tag="CDP")
|
||||
user_done_event.set()
|
||||
return
|
||||
|
||||
@@ -716,20 +726,20 @@ class BrowserProfiler:
|
||||
self.logger.error("Failed to start browser process.", tag="CDP")
|
||||
return None
|
||||
|
||||
self.logger.info(f"Browser launched successfully. Retrieving CDP information...", tag="CDP")
|
||||
self.logger.info("Browser launched successfully. Retrieving CDP information...", tag="CDP")
|
||||
|
||||
# Get CDP URL and JSON config
|
||||
cdp_url, config_json = await get_cdp_json(debugging_port)
|
||||
|
||||
if cdp_url:
|
||||
self.logger.success(f"CDP URL: {Fore.GREEN}{cdp_url}{Style.RESET_ALL}", tag="CDP")
|
||||
self.logger.success(f"CDP URL: {cdp_url}", tag="CDP")
|
||||
|
||||
if config_json:
|
||||
# Display relevant CDP information
|
||||
self.logger.info(f"Browser: {Fore.CYAN}{config_json.get('Browser', 'Unknown')}{Style.RESET_ALL}", tag="CDP")
|
||||
self.logger.info(f"Protocol Version: {config_json.get('Protocol-Version', 'Unknown')}", tag="CDP")
|
||||
self.logger.info(f"Browser: {config_json.get('Browser', 'Unknown')}", tag="CDP", colors={"Browser": LogColor.CYAN})
|
||||
self.logger.info(f"Protocol Version: {config_json.get('Protocol-Version', 'Unknown')}", tag="CDP", colors={"Protocol-Version": LogColor.CYAN})
|
||||
if 'webSocketDebuggerUrl' in config_json:
|
||||
self.logger.info(f"WebSocket URL: {Fore.GREEN}{config_json['webSocketDebuggerUrl']}{Style.RESET_ALL}", tag="CDP")
|
||||
self.logger.info("WebSocket URL: {webSocketDebuggerUrl}", tag="CDP", params={"webSocketDebuggerUrl": config_json['webSocketDebuggerUrl']}, colors={"webSocketDebuggerUrl": LogColor.GREEN})
|
||||
else:
|
||||
self.logger.warning("Could not retrieve CDP configuration JSON", tag="CDP")
|
||||
else:
|
||||
@@ -757,7 +767,7 @@ class BrowserProfiler:
|
||||
self.logger.info("Terminating browser process...", tag="CDP")
|
||||
await managed_browser.cleanup()
|
||||
|
||||
self.logger.success(f"Browser closed.", tag="CDP")
|
||||
self.logger.success("Browser closed.", tag="CDP")
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error launching standalone browser: {str(e)}", tag="CDP")
|
||||
@@ -972,3 +982,30 @@ class BrowserProfiler:
|
||||
'info': browser_info
|
||||
}
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
# Example usage
|
||||
profiler = BrowserProfiler()
|
||||
|
||||
# Create a new profile
|
||||
import os
|
||||
from pathlib import Path
|
||||
home_dir = Path.home()
|
||||
profile_path = asyncio.run(profiler.create_profile( str(home_dir / ".crawl4ai/profiles/test-profile")))
|
||||
|
||||
|
||||
|
||||
# Launch a standalone browser
|
||||
asyncio.run(profiler.launch_standalone_browser())
|
||||
|
||||
# List profiles
|
||||
profiles = profiler.list_profiles()
|
||||
for profile in profiles:
|
||||
print(f"Profile: {profile['name']}, Path: {profile['path']}")
|
||||
|
||||
# Delete a profile
|
||||
success = profiler.delete_profile("my-profile")
|
||||
if success:
|
||||
print("Profile deleted successfully")
|
||||
else:
|
||||
print("Failed to delete profile")
|
||||
@@ -27,8 +27,7 @@ import json
|
||||
import hashlib
|
||||
from pathlib import Path
|
||||
from concurrent.futures import ThreadPoolExecutor
|
||||
from .async_logger import AsyncLogger, LogLevel
|
||||
from colorama import Fore, Style
|
||||
from .async_logger import AsyncLogger, LogLevel, LogColor
|
||||
|
||||
|
||||
class RelevantContentFilter(ABC):
|
||||
@@ -846,8 +845,7 @@ class LLMContentFilter(RelevantContentFilter):
|
||||
},
|
||||
colors={
|
||||
**AsyncLogger.DEFAULT_COLORS,
|
||||
LogLevel.INFO: Fore.MAGENTA
|
||||
+ Style.DIM, # Dimmed purple for LLM ops
|
||||
LogLevel.INFO: LogColor.DIM_MAGENTA # Dimmed purple for LLM ops
|
||||
},
|
||||
)
|
||||
else:
|
||||
@@ -892,7 +890,7 @@ class LLMContentFilter(RelevantContentFilter):
|
||||
"Starting LLM markdown content filtering process",
|
||||
tag="LLM",
|
||||
params={"provider": self.llm_config.provider},
|
||||
colors={"provider": Fore.CYAN},
|
||||
colors={"provider": LogColor.CYAN},
|
||||
)
|
||||
|
||||
# Cache handling
|
||||
@@ -929,7 +927,7 @@ class LLMContentFilter(RelevantContentFilter):
|
||||
"LLM markdown: Split content into {chunk_count} chunks",
|
||||
tag="CHUNK",
|
||||
params={"chunk_count": len(html_chunks)},
|
||||
colors={"chunk_count": Fore.YELLOW},
|
||||
colors={"chunk_count": LogColor.YELLOW},
|
||||
)
|
||||
|
||||
start_time = time.time()
|
||||
@@ -1038,7 +1036,7 @@ class LLMContentFilter(RelevantContentFilter):
|
||||
"LLM markdown: Completed processing in {time:.2f}s",
|
||||
tag="LLM",
|
||||
params={"time": end_time - start_time},
|
||||
colors={"time": Fore.YELLOW},
|
||||
colors={"time": LogColor.YELLOW},
|
||||
)
|
||||
|
||||
result = ordered_results if ordered_results else []
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
from pydantic import BaseModel, HttpUrl, PrivateAttr
|
||||
from pydantic import BaseModel, HttpUrl, PrivateAttr, Field
|
||||
from typing import List, Dict, Optional, Callable, Awaitable, Union, Any
|
||||
from typing import AsyncGenerator
|
||||
from typing import Generic, TypeVar
|
||||
@@ -150,6 +150,7 @@ class CrawlResult(BaseModel):
|
||||
redirected_url: Optional[str] = None
|
||||
network_requests: Optional[List[Dict[str, Any]]] = None
|
||||
console_messages: Optional[List[Dict[str, Any]]] = None
|
||||
tables: List[Dict] = Field(default_factory=list) # NEW – [{headers,rows,caption,summary}]
|
||||
|
||||
class Config:
|
||||
arbitrary_types_allowed = True
|
||||
|
||||
@@ -20,7 +20,6 @@ from urllib.parse import urljoin
|
||||
import requests
|
||||
from requests.exceptions import InvalidSchema
|
||||
import xxhash
|
||||
from colorama import Fore, Style, init
|
||||
import textwrap
|
||||
import cProfile
|
||||
import pstats
|
||||
@@ -441,14 +440,13 @@ def create_box_message(
|
||||
str: A formatted string containing the styled message box.
|
||||
"""
|
||||
|
||||
init()
|
||||
|
||||
# Define border and text colors for different types
|
||||
styles = {
|
||||
"warning": (Fore.YELLOW, Fore.LIGHTYELLOW_EX, "⚠"),
|
||||
"info": (Fore.BLUE, Fore.LIGHTBLUE_EX, "ℹ"),
|
||||
"success": (Fore.GREEN, Fore.LIGHTGREEN_EX, "✓"),
|
||||
"error": (Fore.RED, Fore.LIGHTRED_EX, "×"),
|
||||
"warning": ("yellow", "bright_yellow", "⚠"),
|
||||
"info": ("blue", "bright_blue", "ℹ"),
|
||||
"debug": ("lightblack", "bright_black", "⋯"),
|
||||
"success": ("green", "bright_green", "✓"),
|
||||
"error": ("red", "bright_red", "×"),
|
||||
}
|
||||
|
||||
border_color, text_color, prefix = styles.get(type.lower(), styles["info"])
|
||||
@@ -480,12 +478,12 @@ def create_box_message(
|
||||
# Create the box with colored borders and lighter text
|
||||
horizontal_line = h_line * (width - 1)
|
||||
box = [
|
||||
f"{border_color}{tl}{horizontal_line}{tr}",
|
||||
f"[{border_color}]{tl}{horizontal_line}{tr}[/{border_color}]",
|
||||
*[
|
||||
f"{border_color}{v_line}{text_color} {line:<{width-2}}{border_color}{v_line}"
|
||||
f"[{border_color}]{v_line}[{text_color}] {line:<{width-2}}[/{text_color}][{border_color}]{v_line}[/{border_color}]"
|
||||
for line in formatted_lines
|
||||
],
|
||||
f"{border_color}{bl}{horizontal_line}{br}{Style.RESET_ALL}",
|
||||
f"[{border_color}]{bl}{horizontal_line}{br}[/{border_color}]",
|
||||
]
|
||||
|
||||
result = "\n".join(box)
|
||||
@@ -2778,4 +2776,3 @@ def preprocess_html_for_schema(html_content, text_threshold=100, attr_value_thre
|
||||
# Fallback for parsing errors
|
||||
return html_content[:max_size] if len(html_content) > max_size else html_content
|
||||
|
||||
|
||||
|
||||
@@ -1,644 +0,0 @@
|
||||
# Crawl4AI Docker Guide 🐳
|
||||
|
||||
## Table of Contents
|
||||
- [Prerequisites](#prerequisites)
|
||||
- [Installation](#installation)
|
||||
- [Option 1: Using Docker Compose (Recommended)](#option-1-using-docker-compose-recommended)
|
||||
- [Option 2: Manual Local Build & Run](#option-2-manual-local-build--run)
|
||||
- [Option 3: Using Pre-built Docker Hub Images](#option-3-using-pre-built-docker-hub-images)
|
||||
- [Dockerfile Parameters](#dockerfile-parameters)
|
||||
- [Using the API](#using-the-api)
|
||||
- [Understanding Request Schema](#understanding-request-schema)
|
||||
- [REST API Examples](#rest-api-examples)
|
||||
- [Python SDK](#python-sdk)
|
||||
- [Metrics & Monitoring](#metrics--monitoring)
|
||||
- [Deployment Scenarios](#deployment-scenarios)
|
||||
- [Complete Examples](#complete-examples)
|
||||
- [Server Configuration](#server-configuration)
|
||||
- [Understanding config.yml](#understanding-configyml)
|
||||
- [JWT Authentication](#jwt-authentication)
|
||||
- [Configuration Tips and Best Practices](#configuration-tips-and-best-practices)
|
||||
- [Customizing Your Configuration](#customizing-your-configuration)
|
||||
- [Configuration Recommendations](#configuration-recommendations)
|
||||
- [Getting Help](#getting-help)
|
||||
|
||||
## Prerequisites
|
||||
|
||||
Before we dive in, make sure you have:
|
||||
- Docker installed and running (version 20.10.0 or higher), including `docker compose` (usually bundled with Docker Desktop).
|
||||
- `git` for cloning the repository.
|
||||
- At least 4GB of RAM available for the container (more recommended for heavy use).
|
||||
- Python 3.10+ (if using the Python SDK).
|
||||
- Node.js 16+ (if using the Node.js examples).
|
||||
|
||||
> 💡 **Pro tip**: Run `docker info` to check your Docker installation and available resources.
|
||||
|
||||
## Installation
|
||||
|
||||
We offer several ways to get the Crawl4AI server running. Docker Compose is the easiest way to manage local builds and runs.
|
||||
|
||||
### Option 1: Using Docker Compose (Recommended)
|
||||
|
||||
Docker Compose simplifies building and running the service, especially for local development and testing across different platforms.
|
||||
|
||||
#### 1. Clone Repository
|
||||
|
||||
```bash
|
||||
git clone https://github.com/unclecode/crawl4ai.git
|
||||
cd crawl4ai
|
||||
```
|
||||
|
||||
#### 2. Environment Setup (API Keys)
|
||||
|
||||
If you plan to use LLMs, copy the example environment file and add your API keys. This file should be in the **project root directory**.
|
||||
|
||||
```bash
|
||||
# Make sure you are in the 'crawl4ai' root directory
|
||||
cp deploy/docker/.llm.env.example .llm.env
|
||||
|
||||
# Now edit .llm.env and add your API keys
|
||||
# Example content:
|
||||
# OPENAI_API_KEY=sk-your-key
|
||||
# ANTHROPIC_API_KEY=your-anthropic-key
|
||||
# ...
|
||||
```
|
||||
> 🔑 **Note**: Keep your API keys secure! Never commit `.llm.env` to version control.
|
||||
|
||||
#### 3. Build and Run with Compose
|
||||
|
||||
The `docker-compose.yml` file in the project root defines services for different scenarios using **profiles**.
|
||||
|
||||
* **Build and Run Locally (AMD64):**
|
||||
```bash
|
||||
# Builds the image locally using Dockerfile and runs it
|
||||
docker compose --profile local-amd64 up --build -d
|
||||
```
|
||||
|
||||
* **Build and Run Locally (ARM64):**
|
||||
```bash
|
||||
# Builds the image locally using Dockerfile and runs it
|
||||
docker compose --profile local-arm64 up --build -d
|
||||
```
|
||||
|
||||
* **Run Pre-built Image from Docker Hub (AMD64):**
|
||||
```bash
|
||||
# Pulls and runs the specified AMD64 image from Docker Hub
|
||||
# (Set VERSION env var for specific tags, e.g., VERSION=0.5.1-d1)
|
||||
docker compose --profile hub-amd64 up -d
|
||||
```
|
||||
|
||||
* **Run Pre-built Image from Docker Hub (ARM64):**
|
||||
```bash
|
||||
# Pulls and runs the specified ARM64 image from Docker Hub
|
||||
docker compose --profile hub-arm64 up -d
|
||||
```
|
||||
|
||||
> The server will be available at `http://localhost:11235`.
|
||||
|
||||
#### 4. Stopping Compose Services
|
||||
|
||||
```bash
|
||||
# Stop the service(s) associated with a profile (e.g., local-amd64)
|
||||
docker compose --profile local-amd64 down
|
||||
```
|
||||
|
||||
### Option 2: Manual Local Build & Run
|
||||
|
||||
If you prefer not to use Docker Compose for local builds.
|
||||
|
||||
#### 1. Clone Repository & Setup Environment
|
||||
|
||||
Follow steps 1 and 2 from the Docker Compose section above (clone repo, `cd crawl4ai`, create `.llm.env` in the root).
|
||||
|
||||
#### 2. Build the Image (Multi-Arch)
|
||||
|
||||
Use `docker buildx` to build the image. This example builds for multiple platforms and loads the image matching your host architecture into the local Docker daemon.
|
||||
|
||||
```bash
|
||||
# Make sure you are in the 'crawl4ai' root directory
|
||||
docker buildx build --platform linux/amd64,linux/arm64 -t crawl4ai-local:latest --load .
|
||||
```
|
||||
|
||||
#### 3. Run the Container
|
||||
|
||||
* **Basic run (no LLM support):**
|
||||
```bash
|
||||
# Replace --platform if your host is ARM64
|
||||
docker run -d \
|
||||
-p 11235:11235 \
|
||||
--name crawl4ai-standalone \
|
||||
--shm-size=1g \
|
||||
--platform linux/amd64 \
|
||||
crawl4ai-local:latest
|
||||
```
|
||||
|
||||
* **With LLM support:**
|
||||
```bash
|
||||
# Make sure .llm.env is in the current directory (project root)
|
||||
# Replace --platform if your host is ARM64
|
||||
docker run -d \
|
||||
-p 11235:11235 \
|
||||
--name crawl4ai-standalone \
|
||||
--env-file .llm.env \
|
||||
--shm-size=1g \
|
||||
--platform linux/amd64 \
|
||||
crawl4ai-local:latest
|
||||
```
|
||||
|
||||
> The server will be available at `http://localhost:11235`.
|
||||
|
||||
#### 4. Stopping the Manual Container
|
||||
|
||||
```bash
|
||||
docker stop crawl4ai-standalone && docker rm crawl4ai-standalone
|
||||
```
|
||||
|
||||
### Option 3: Using Pre-built Docker Hub Images
|
||||
|
||||
Pull and run images directly from Docker Hub without building locally.
|
||||
|
||||
#### 1. Pull the Image
|
||||
|
||||
We use a versioning scheme like `LIBRARY_VERSION-dREVISION` (e.g., `0.5.1-d1`). The `latest` tag points to the most recent stable release. Images are built with multi-arch manifests, so Docker usually pulls the correct version for your system automatically.
|
||||
|
||||
```bash
|
||||
# Pull a specific version (recommended for stability)
|
||||
docker pull unclecode/crawl4ai:0.5.1-d1
|
||||
|
||||
# Or pull the latest stable version
|
||||
docker pull unclecode/crawl4ai:latest
|
||||
```
|
||||
|
||||
#### 2. Setup Environment (API Keys)
|
||||
|
||||
If using LLMs, create the `.llm.env` file in a directory of your choice, similar to Step 2 in the Compose section.
|
||||
|
||||
#### 3. Run the Container
|
||||
|
||||
* **Basic run:**
|
||||
```bash
|
||||
docker run -d \
|
||||
-p 11235:11235 \
|
||||
--name crawl4ai-hub \
|
||||
--shm-size=1g \
|
||||
unclecode/crawl4ai:0.5.1-d1 # Or use :latest
|
||||
```
|
||||
|
||||
* **With LLM support:**
|
||||
```bash
|
||||
# Make sure .llm.env is in the current directory you are running docker from
|
||||
docker run -d \
|
||||
-p 11235:11235 \
|
||||
--name crawl4ai-hub \
|
||||
--env-file .llm.env \
|
||||
--shm-size=1g \
|
||||
unclecode/crawl4ai:0.5.1-d1 # Or use :latest
|
||||
```
|
||||
|
||||
> The server will be available at `http://localhost:11235`.
|
||||
|
||||
#### 4. Stopping the Hub Container
|
||||
|
||||
```bash
|
||||
docker stop crawl4ai-hub && docker rm crawl4ai-hub
|
||||
```
|
||||
|
||||
#### Docker Hub Versioning Explained
|
||||
|
||||
* **Image Name:** `unclecode/crawl4ai`
|
||||
* **Tag Format:** `LIBRARY_VERSION-dREVISION`
|
||||
* `LIBRARY_VERSION`: The Semantic Version of the core `crawl4ai` Python library included (e.g., `0.5.1`).
|
||||
* `dREVISION`: An incrementing number (starting at `d1`) for Docker build changes made *without* changing the library version (e.g., base image updates, dependency fixes). Resets to `d1` for each new `LIBRARY_VERSION`.
|
||||
* **Example:** `unclecode/crawl4ai:0.5.1-d1`
|
||||
* **`latest` Tag:** Points to the most recent stable `LIBRARY_VERSION-dREVISION`.
|
||||
* **Multi-Arch:** Images support `linux/amd64` and `linux/arm64`. Docker automatically selects the correct architecture.
|
||||
|
||||
---
|
||||
|
||||
*(Rest of the document remains largely the same, but with key updates below)*
|
||||
|
||||
---
|
||||
|
||||
## Dockerfile Parameters
|
||||
|
||||
You can customize the image build process using build arguments (`--build-arg`). These are typically used via `docker buildx build` or within the `docker-compose.yml` file.
|
||||
|
||||
```bash
|
||||
# Example: Build with 'all' features using buildx
|
||||
docker buildx build \
|
||||
--platform linux/amd64,linux/arm64 \
|
||||
--build-arg INSTALL_TYPE=all \
|
||||
-t yourname/crawl4ai-all:latest \
|
||||
--load \
|
||||
. # Build from root context
|
||||
```
|
||||
|
||||
### Build Arguments Explained
|
||||
|
||||
| Argument | Description | Default | Options |
|
||||
| :----------- | :--------------------------------------- | :-------- | :--------------------------------- |
|
||||
| INSTALL_TYPE | Feature set | `default` | `default`, `all`, `torch`, `transformer` |
|
||||
| ENABLE_GPU | GPU support (CUDA for AMD64) | `false` | `true`, `false` |
|
||||
| APP_HOME | Install path inside container (advanced) | `/app` | any valid path |
|
||||
| USE_LOCAL | Install library from local source | `true` | `true`, `false` |
|
||||
| GITHUB_REPO | Git repo to clone if USE_LOCAL=false | *(see Dockerfile)* | any git URL |
|
||||
| GITHUB_BRANCH| Git branch to clone if USE_LOCAL=false | `main` | any branch name |
|
||||
|
||||
*(Note: PYTHON_VERSION is fixed by the `FROM` instruction in the Dockerfile)*
|
||||
|
||||
### Build Best Practices
|
||||
|
||||
1. **Choose the Right Install Type**
|
||||
* `default`: Basic installation, smallest image size. Suitable for most standard web scraping and markdown generation.
|
||||
* `all`: Full features including `torch` and `transformers` for advanced extraction strategies (e.g., CosineStrategy, certain LLM filters). Significantly larger image. Ensure you need these extras.
|
||||
2. **Platform Considerations**
|
||||
* Use `buildx` for building multi-architecture images, especially for pushing to registries.
|
||||
* Use `docker compose` profiles (`local-amd64`, `local-arm64`) for easy platform-specific local builds.
|
||||
3. **Performance Optimization**
|
||||
* The image automatically includes platform-specific optimizations (OpenMP for AMD64, OpenBLAS for ARM64).
|
||||
|
||||
---
|
||||
|
||||
## Using the API
|
||||
|
||||
Communicate with the running Docker server via its REST API (defaulting to `http://localhost:11235`). You can use the Python SDK or make direct HTTP requests.
|
||||
|
||||
### Python SDK
|
||||
|
||||
Install the SDK: `pip install crawl4ai`
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai.docker_client import Crawl4aiDockerClient
|
||||
from crawl4ai import BrowserConfig, CrawlerRunConfig, CacheMode # Assuming you have crawl4ai installed
|
||||
|
||||
async def main():
|
||||
# Point to the correct server port
|
||||
async with Crawl4aiDockerClient(base_url="http://localhost:11235", verbose=True) as client:
|
||||
# If JWT is enabled on the server, authenticate first:
|
||||
# await client.authenticate("user@example.com") # See Server Configuration section
|
||||
|
||||
# Example Non-streaming crawl
|
||||
print("--- Running Non-Streaming Crawl ---")
|
||||
results = await client.crawl(
|
||||
["https://httpbin.org/html"],
|
||||
browser_config=BrowserConfig(headless=True), # Use library classes for config aid
|
||||
crawler_config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
|
||||
)
|
||||
if results: # client.crawl returns None on failure
|
||||
print(f"Non-streaming results success: {results.success}")
|
||||
if results.success:
|
||||
for result in results: # Iterate through the CrawlResultContainer
|
||||
print(f"URL: {result.url}, Success: {result.success}")
|
||||
else:
|
||||
print("Non-streaming crawl failed.")
|
||||
|
||||
|
||||
# Example Streaming crawl
|
||||
print("\n--- Running Streaming Crawl ---")
|
||||
stream_config = CrawlerRunConfig(stream=True, cache_mode=CacheMode.BYPASS)
|
||||
try:
|
||||
async for result in await client.crawl( # client.crawl returns an async generator for streaming
|
||||
["https://httpbin.org/html", "https://httpbin.org/links/5/0"],
|
||||
browser_config=BrowserConfig(headless=True),
|
||||
crawler_config=stream_config
|
||||
):
|
||||
print(f"Streamed result: URL: {result.url}, Success: {result.success}")
|
||||
except Exception as e:
|
||||
print(f"Streaming crawl failed: {e}")
|
||||
|
||||
|
||||
# Example Get schema
|
||||
print("\n--- Getting Schema ---")
|
||||
schema = await client.get_schema()
|
||||
print(f"Schema received: {bool(schema)}") # Print whether schema was received
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
*(SDK parameters like timeout, verify_ssl etc. remain the same)*
|
||||
|
||||
### Second Approach: Direct API Calls
|
||||
|
||||
Crucially, when sending configurations directly via JSON, they **must** follow the `{"type": "ClassName", "params": {...}}` structure for any non-primitive value (like config objects or strategies). Dictionaries must be wrapped as `{"type": "dict", "value": {...}}`.
|
||||
|
||||
*(Keep the detailed explanation of Configuration Structure, Basic Pattern, Simple vs Complex, Strategy Pattern, Complex Nested Example, Quick Grammar Overview, Important Rules, Pro Tip)*
|
||||
|
||||
#### More Examples *(Ensure Schema example uses type/value wrapper)*
|
||||
|
||||
**Advanced Crawler Configuration**
|
||||
*(Keep example, ensure cache_mode uses valid enum value like "bypass")*
|
||||
|
||||
**Extraction Strategy**
|
||||
```json
|
||||
{
|
||||
"crawler_config": {
|
||||
"type": "CrawlerRunConfig",
|
||||
"params": {
|
||||
"extraction_strategy": {
|
||||
"type": "JsonCssExtractionStrategy",
|
||||
"params": {
|
||||
"schema": {
|
||||
"type": "dict",
|
||||
"value": {
|
||||
"baseSelector": "article.post",
|
||||
"fields": [
|
||||
{"name": "title", "selector": "h1", "type": "text"},
|
||||
{"name": "content", "selector": ".content", "type": "html"}
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**LLM Extraction Strategy** *(Keep example, ensure schema uses type/value wrapper)*
|
||||
*(Keep Deep Crawler Example)*
|
||||
|
||||
### REST API Examples
|
||||
|
||||
Update URLs to use port `11235`.
|
||||
|
||||
#### Simple Crawl
|
||||
|
||||
```python
|
||||
import requests
|
||||
|
||||
# Configuration objects converted to the required JSON structure
|
||||
browser_config_payload = {
|
||||
"type": "BrowserConfig",
|
||||
"params": {"headless": True}
|
||||
}
|
||||
crawler_config_payload = {
|
||||
"type": "CrawlerRunConfig",
|
||||
"params": {"stream": False, "cache_mode": "bypass"} # Use string value of enum
|
||||
}
|
||||
|
||||
crawl_payload = {
|
||||
"urls": ["https://httpbin.org/html"],
|
||||
"browser_config": browser_config_payload,
|
||||
"crawler_config": crawler_config_payload
|
||||
}
|
||||
response = requests.post(
|
||||
"http://localhost:11235/crawl", # Updated port
|
||||
# headers={"Authorization": f"Bearer {token}"}, # If JWT is enabled
|
||||
json=crawl_payload
|
||||
)
|
||||
print(f"Status Code: {response.status_code}")
|
||||
if response.ok:
|
||||
print(response.json())
|
||||
else:
|
||||
print(f"Error: {response.text}")
|
||||
|
||||
```
|
||||
|
||||
#### Streaming Results
|
||||
|
||||
```python
|
||||
import json
|
||||
import httpx # Use httpx for async streaming example
|
||||
|
||||
async def test_stream_crawl(token: str = None): # Made token optional
|
||||
"""Test the /crawl/stream endpoint with multiple URLs."""
|
||||
url = "http://localhost:11235/crawl/stream" # Updated port
|
||||
payload = {
|
||||
"urls": [
|
||||
"https://httpbin.org/html",
|
||||
"https://httpbin.org/links/5/0",
|
||||
],
|
||||
"browser_config": {
|
||||
"type": "BrowserConfig",
|
||||
"params": {"headless": True, "viewport": {"type": "dict", "value": {"width": 1200, "height": 800}}} # Viewport needs type:dict
|
||||
},
|
||||
"crawler_config": {
|
||||
"type": "CrawlerRunConfig",
|
||||
"params": {"stream": True, "cache_mode": "bypass"}
|
||||
}
|
||||
}
|
||||
|
||||
headers = {}
|
||||
# if token:
|
||||
# headers = {"Authorization": f"Bearer {token}"} # If JWT is enabled
|
||||
|
||||
try:
|
||||
async with httpx.AsyncClient() as client:
|
||||
async with client.stream("POST", url, json=payload, headers=headers, timeout=120.0) as response:
|
||||
print(f"Status: {response.status_code} (Expected: 200)")
|
||||
response.raise_for_status() # Raise exception for bad status codes
|
||||
|
||||
# Read streaming response line-by-line (NDJSON)
|
||||
async for line in response.aiter_lines():
|
||||
if line:
|
||||
try:
|
||||
data = json.loads(line)
|
||||
# Check for completion marker
|
||||
if data.get("status") == "completed":
|
||||
print("Stream completed.")
|
||||
break
|
||||
print(f"Streamed Result: {json.dumps(data, indent=2)}")
|
||||
except json.JSONDecodeError:
|
||||
print(f"Warning: Could not decode JSON line: {line}")
|
||||
|
||||
except httpx.HTTPStatusError as e:
|
||||
print(f"HTTP error occurred: {e.response.status_code} - {e.response.text}")
|
||||
except Exception as e:
|
||||
print(f"Error in streaming crawl test: {str(e)}")
|
||||
|
||||
# To run this example:
|
||||
# import asyncio
|
||||
# asyncio.run(test_stream_crawl())
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Metrics & Monitoring
|
||||
|
||||
Keep an eye on your crawler with these endpoints:
|
||||
|
||||
- `/health` - Quick health check
|
||||
- `/metrics` - Detailed Prometheus metrics
|
||||
- `/schema` - Full API schema
|
||||
|
||||
Example health check:
|
||||
```bash
|
||||
curl http://localhost:11235/health
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
*(Deployment Scenarios and Complete Examples sections remain the same, maybe update links if examples moved)*
|
||||
|
||||
---
|
||||
|
||||
## Server Configuration
|
||||
|
||||
The server's behavior can be customized through the `config.yml` file.
|
||||
|
||||
### Understanding config.yml
|
||||
|
||||
The configuration file is loaded from `/app/config.yml` inside the container. By default, the file from `deploy/docker/config.yml` in the repository is copied there during the build.
|
||||
|
||||
Here's a detailed breakdown of the configuration options (using defaults from `deploy/docker/config.yml`):
|
||||
|
||||
```yaml
|
||||
# Application Configuration
|
||||
app:
|
||||
title: "Crawl4AI API"
|
||||
version: "1.0.0" # Consider setting this to match library version, e.g., "0.5.1"
|
||||
host: "0.0.0.0"
|
||||
port: 8020 # NOTE: This port is used ONLY when running server.py directly. Gunicorn overrides this (see supervisord.conf).
|
||||
reload: False # Default set to False - suitable for production
|
||||
timeout_keep_alive: 300
|
||||
|
||||
# Default LLM Configuration
|
||||
llm:
|
||||
provider: "openai/gpt-4o-mini"
|
||||
api_key_env: "OPENAI_API_KEY"
|
||||
# api_key: sk-... # If you pass the API key directly then api_key_env will be ignored
|
||||
|
||||
# Redis Configuration (Used by internal Redis server managed by supervisord)
|
||||
redis:
|
||||
host: "localhost"
|
||||
port: 6379
|
||||
db: 0
|
||||
password: ""
|
||||
# ... other redis options ...
|
||||
|
||||
# Rate Limiting Configuration
|
||||
rate_limiting:
|
||||
enabled: True
|
||||
default_limit: "1000/minute"
|
||||
trusted_proxies: []
|
||||
storage_uri: "memory://" # Use "redis://localhost:6379" if you need persistent/shared limits
|
||||
|
||||
# Security Configuration
|
||||
security:
|
||||
enabled: false # Master toggle for security features
|
||||
jwt_enabled: false # Enable JWT authentication (requires security.enabled=true)
|
||||
https_redirect: false # Force HTTPS (requires security.enabled=true)
|
||||
trusted_hosts: ["*"] # Allowed hosts (use specific domains in production)
|
||||
headers: # Security headers (applied if security.enabled=true)
|
||||
x_content_type_options: "nosniff"
|
||||
x_frame_options: "DENY"
|
||||
content_security_policy: "default-src 'self'"
|
||||
strict_transport_security: "max-age=63072000; includeSubDomains"
|
||||
|
||||
# Crawler Configuration
|
||||
crawler:
|
||||
memory_threshold_percent: 95.0
|
||||
rate_limiter:
|
||||
base_delay: [1.0, 2.0] # Min/max delay between requests in seconds for dispatcher
|
||||
timeouts:
|
||||
stream_init: 30.0 # Timeout for stream initialization
|
||||
batch_process: 300.0 # Timeout for non-streaming /crawl processing
|
||||
|
||||
# Logging Configuration
|
||||
logging:
|
||||
level: "INFO"
|
||||
format: "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
|
||||
|
||||
# Observability Configuration
|
||||
observability:
|
||||
prometheus:
|
||||
enabled: True
|
||||
endpoint: "/metrics"
|
||||
health_check:
|
||||
endpoint: "/health"
|
||||
```
|
||||
|
||||
*(JWT Authentication section remains the same, just note the default port is now 11235 for requests)*
|
||||
|
||||
*(Configuration Tips and Best Practices remain the same)*
|
||||
|
||||
### Customizing Your Configuration
|
||||
|
||||
You can override the default `config.yml`.
|
||||
|
||||
#### Method 1: Modify Before Build
|
||||
|
||||
1. Edit the `deploy/docker/config.yml` file in your local repository clone.
|
||||
2. Build the image using `docker buildx` or `docker compose --profile local-... up --build`. The modified file will be copied into the image.
|
||||
|
||||
#### Method 2: Runtime Mount (Recommended for Custom Deploys)
|
||||
|
||||
1. Create your custom configuration file, e.g., `my-custom-config.yml` locally. Ensure it contains all necessary sections.
|
||||
2. Mount it when running the container:
|
||||
|
||||
* **Using `docker run`:**
|
||||
```bash
|
||||
# Assumes my-custom-config.yml is in the current directory
|
||||
docker run -d -p 11235:11235 \
|
||||
--name crawl4ai-custom-config \
|
||||
--env-file .llm.env \
|
||||
--shm-size=1g \
|
||||
-v $(pwd)/my-custom-config.yml:/app/config.yml \
|
||||
unclecode/crawl4ai:latest # Or your specific tag
|
||||
```
|
||||
|
||||
* **Using `docker-compose.yml`:** Add a `volumes` section to the service definition:
|
||||
```yaml
|
||||
services:
|
||||
crawl4ai-hub-amd64: # Or your chosen service
|
||||
image: unclecode/crawl4ai:latest
|
||||
profiles: ["hub-amd64"]
|
||||
<<: *base-config
|
||||
volumes:
|
||||
# Mount local custom config over the default one in the container
|
||||
- ./my-custom-config.yml:/app/config.yml
|
||||
# Keep the shared memory volume from base-config
|
||||
- /dev/shm:/dev/shm
|
||||
```
|
||||
*(Note: Ensure `my-custom-config.yml` is in the same directory as `docker-compose.yml`)*
|
||||
|
||||
> 💡 When mounting, your custom file *completely replaces* the default one. Ensure it's a valid and complete configuration.
|
||||
|
||||
### Configuration Recommendations
|
||||
|
||||
1. **Security First** 🔒
|
||||
- Always enable security in production
|
||||
- Use specific trusted_hosts instead of wildcards
|
||||
- Set up proper rate limiting to protect your server
|
||||
- Consider your environment before enabling HTTPS redirect
|
||||
|
||||
2. **Resource Management** 💻
|
||||
- Adjust memory_threshold_percent based on available RAM
|
||||
- Set timeouts according to your content size and network conditions
|
||||
- Use Redis for rate limiting in multi-container setups
|
||||
|
||||
3. **Monitoring** 📊
|
||||
- Enable Prometheus if you need metrics
|
||||
- Set DEBUG logging in development, INFO in production
|
||||
- Regular health check monitoring is crucial
|
||||
|
||||
4. **Performance Tuning** ⚡
|
||||
- Start with conservative rate limiter delays
|
||||
- Increase batch_process timeout for large content
|
||||
- Adjust stream_init timeout based on initial response times
|
||||
|
||||
## Getting Help
|
||||
|
||||
We're here to help you succeed with Crawl4AI! Here's how to get support:
|
||||
|
||||
- 📖 Check our [full documentation](https://docs.crawl4ai.com)
|
||||
- 🐛 Found a bug? [Open an issue](https://github.com/unclecode/crawl4ai/issues)
|
||||
- 💬 Join our [Discord community](https://discord.gg/crawl4ai)
|
||||
- ⭐ Star us on GitHub to show support!
|
||||
|
||||
## Summary
|
||||
|
||||
In this guide, we've covered everything you need to get started with Crawl4AI's Docker deployment:
|
||||
- Building and running the Docker container
|
||||
- Configuring the environment
|
||||
- Making API requests with proper typing
|
||||
- Using the Python SDK
|
||||
- Monitoring your deployment
|
||||
|
||||
Remember, the examples in the `examples` folder are your friends - they show real-world usage patterns that you can adapt for your needs.
|
||||
|
||||
Keep exploring, and don't hesitate to reach out if you need help! We're building something amazing together. 🚀
|
||||
|
||||
Happy crawling! 🕷️
|
||||
File diff suppressed because it is too large
Load Diff
@@ -3,9 +3,9 @@ app:
|
||||
title: "Crawl4AI API"
|
||||
version: "1.0.0"
|
||||
host: "0.0.0.0"
|
||||
port: 8020
|
||||
port: 11235
|
||||
reload: False
|
||||
workers: 4
|
||||
workers: 1
|
||||
timeout_keep_alive: 300
|
||||
|
||||
# Default LLM Configuration
|
||||
|
||||
@@ -1,5 +1,5 @@
|
||||
fastapi==0.115.12
|
||||
uvicorn==0.34.2
|
||||
fastapi>=0.115.12
|
||||
uvicorn>=0.34.2
|
||||
gunicorn>=23.0.0
|
||||
slowapi==0.1.9
|
||||
prometheus-fastapi-instrumentator>=7.1.0
|
||||
@@ -8,8 +8,9 @@ jwt>=1.3.1
|
||||
dnspython>=2.7.0
|
||||
email-validator==2.2.0
|
||||
sse-starlette==2.2.1
|
||||
pydantic==2.11
|
||||
pydantic>=2.11
|
||||
rank-bm25==0.2.2
|
||||
anyio==4.9.0
|
||||
PyJWT==2.10.1
|
||||
|
||||
mcp>=1.6.0
|
||||
websockets>=15.0.1
|
||||
|
||||
@@ -629,6 +629,7 @@ async def get_context(
|
||||
|
||||
|
||||
# attach MCP layer (adds /mcp/ws, /mcp/sse, /mcp/schema)
|
||||
print(f"MCP server running on {config['app']['host']}:{config['app']['port']}")
|
||||
attach_mcp(
|
||||
app,
|
||||
base_url=f"http://{config['app']['host']}:{config['app']['port']}"
|
||||
|
||||
@@ -193,7 +193,48 @@
|
||||
<textarea id="urls" class="w-full bg-dark border border-border rounded p-2 h-32 text-sm mb-4"
|
||||
spellcheck="false">https://example.com</textarea>
|
||||
|
||||
<details class="mb-4">
|
||||
<!-- Specific options for /md endpoint -->
|
||||
<details id="md-options" class="mb-4 hidden">
|
||||
<summary class="text-sm text-secondary cursor-pointer">/md Options</summary>
|
||||
<div class="mt-2 space-y-3 p-2 border border-border rounded">
|
||||
<div>
|
||||
<label for="md-filter" class="block text-xs text-secondary mb-1">Filter Type</label>
|
||||
<select id="md-filter" class="bg-dark border border-border rounded px-2 py-1 text-sm w-full">
|
||||
<option value="fit">fit - Adaptive content filtering</option>
|
||||
<option value="raw">raw - No filtering</option>
|
||||
<option value="bm25">bm25 - BM25 keyword relevance</option>
|
||||
<option value="llm">llm - LLM-based filtering</option>
|
||||
</select>
|
||||
</div>
|
||||
<div>
|
||||
<label for="md-query" class="block text-xs text-secondary mb-1">Query (for BM25/LLM filters)</label>
|
||||
<input id="md-query" type="text" placeholder="Enter search terms or instructions"
|
||||
class="bg-dark border border-border rounded px-2 py-1 text-sm w-full">
|
||||
</div>
|
||||
<div>
|
||||
<label for="md-cache" class="block text-xs text-secondary mb-1">Cache Mode</label>
|
||||
<select id="md-cache" class="bg-dark border border-border rounded px-2 py-1 text-sm w-full">
|
||||
<option value="0">Write-Only (0)</option>
|
||||
<option value="1">Enabled (1)</option>
|
||||
</select>
|
||||
</div>
|
||||
</div>
|
||||
</details>
|
||||
|
||||
<!-- Specific options for /llm endpoint -->
|
||||
<details id="llm-options" class="mb-4 hidden">
|
||||
<summary class="text-sm text-secondary cursor-pointer">/llm Options</summary>
|
||||
<div class="mt-2 space-y-3 p-2 border border-border rounded">
|
||||
<div>
|
||||
<label for="llm-question" class="block text-xs text-secondary mb-1">Question</label>
|
||||
<input id="llm-question" type="text" value="What is this page about?"
|
||||
class="bg-dark border border-border rounded px-2 py-1 text-sm w-full">
|
||||
</div>
|
||||
</div>
|
||||
</details>
|
||||
|
||||
<!-- Advanced config for /crawl endpoints -->
|
||||
<details id="adv-config" class="mb-4">
|
||||
<summary class="text-sm text-secondary cursor-pointer">Advanced Config <span
|
||||
class="text-xs text-primary">(Python → auto‑JSON)</span></summary>
|
||||
|
||||
@@ -437,6 +478,33 @@
|
||||
cm.setValue(TEMPLATES[e.target.value]);
|
||||
document.getElementById('cfg-status').textContent = '';
|
||||
});
|
||||
|
||||
// Handle endpoint selection change to show appropriate options
|
||||
document.getElementById('endpoint').addEventListener('change', function(e) {
|
||||
const endpoint = e.target.value;
|
||||
const mdOptions = document.getElementById('md-options');
|
||||
const llmOptions = document.getElementById('llm-options');
|
||||
const advConfig = document.getElementById('adv-config');
|
||||
|
||||
// Hide all option sections first
|
||||
mdOptions.classList.add('hidden');
|
||||
llmOptions.classList.add('hidden');
|
||||
advConfig.classList.add('hidden');
|
||||
|
||||
// Show the appropriate section based on endpoint
|
||||
if (endpoint === 'md') {
|
||||
mdOptions.classList.remove('hidden');
|
||||
// Auto-open the /md options
|
||||
mdOptions.setAttribute('open', '');
|
||||
} else if (endpoint === 'llm') {
|
||||
llmOptions.classList.remove('hidden');
|
||||
// Auto-open the /llm options
|
||||
llmOptions.setAttribute('open', '');
|
||||
} else {
|
||||
// For /crawl endpoints, show the advanced config
|
||||
advConfig.classList.remove('hidden');
|
||||
}
|
||||
});
|
||||
|
||||
async function pyConfigToJson() {
|
||||
const code = cm.getValue().trim();
|
||||
@@ -494,10 +562,18 @@
|
||||
}
|
||||
|
||||
// Generate code snippets
|
||||
function generateSnippets(api, payload) {
|
||||
function generateSnippets(api, payload, method = 'POST') {
|
||||
// Python snippet
|
||||
const pyCodeEl = document.querySelector('#python-content code');
|
||||
const pySnippet = `import httpx\n\nasync def crawl():\n async with httpx.AsyncClient() as client:\n response = await client.post(\n "${window.location.origin}${api}",\n json=${JSON.stringify(payload, null, 4).replace(/\n/g, '\n ')}\n )\n return response.json()`;
|
||||
let pySnippet;
|
||||
|
||||
if (method === 'GET') {
|
||||
// GET request (for /llm endpoint)
|
||||
pySnippet = `import httpx\n\nasync def crawl():\n async with httpx.AsyncClient() as client:\n response = await client.get(\n "${window.location.origin}${api}"\n )\n return response.json()`;
|
||||
} else {
|
||||
// POST request (for /crawl and /md endpoints)
|
||||
pySnippet = `import httpx\n\nasync def crawl():\n async with httpx.AsyncClient() as client:\n response = await client.post(\n "${window.location.origin}${api}",\n json=${JSON.stringify(payload, null, 4).replace(/\n/g, '\n ')}\n )\n return response.json()`;
|
||||
}
|
||||
|
||||
pyCodeEl.textContent = pySnippet;
|
||||
pyCodeEl.className = 'python hljs'; // Reset classes
|
||||
@@ -505,7 +581,15 @@
|
||||
|
||||
// cURL snippet
|
||||
const curlCodeEl = document.querySelector('#curl-content code');
|
||||
const curlSnippet = `curl -X POST ${window.location.origin}${api} \\\n -H "Content-Type: application/json" \\\n -d '${JSON.stringify(payload)}'`;
|
||||
let curlSnippet;
|
||||
|
||||
if (method === 'GET') {
|
||||
// GET request (for /llm endpoint)
|
||||
curlSnippet = `curl -X GET "${window.location.origin}${api}"`;
|
||||
} else {
|
||||
// POST request (for /crawl and /md endpoints)
|
||||
curlSnippet = `curl -X POST ${window.location.origin}${api} \\\n -H "Content-Type: application/json" \\\n -d '${JSON.stringify(payload)}'`;
|
||||
}
|
||||
|
||||
curlCodeEl.textContent = curlSnippet;
|
||||
curlCodeEl.className = 'bash hljs'; // Reset classes
|
||||
@@ -536,16 +620,39 @@
|
||||
|
||||
const endpointMap = {
|
||||
crawl: '/crawl',
|
||||
crawl_stream: '/crawl/stream',
|
||||
// crawl_stream: '/crawl/stream',
|
||||
md: '/md',
|
||||
llm: '/llm'
|
||||
};
|
||||
|
||||
const api = endpointMap[endpoint];
|
||||
const payload = {
|
||||
urls,
|
||||
...advConfig
|
||||
};
|
||||
let payload;
|
||||
|
||||
// Create appropriate payload based on endpoint type
|
||||
if (endpoint === 'md') {
|
||||
// Get values from the /md specific inputs
|
||||
const filterType = document.getElementById('md-filter').value;
|
||||
const query = document.getElementById('md-query').value.trim();
|
||||
const cache = document.getElementById('md-cache').value;
|
||||
|
||||
// MD endpoint expects: { url, f, q, c }
|
||||
payload = {
|
||||
url: urls[0], // Take first URL
|
||||
f: filterType, // Lowercase filter type as required by server
|
||||
q: query || null, // Use the query if provided, otherwise null
|
||||
c: cache
|
||||
};
|
||||
} else if (endpoint === 'llm') {
|
||||
// LLM endpoint has a different URL pattern and uses query params
|
||||
// This will be handled directly in the fetch below
|
||||
payload = null;
|
||||
} else {
|
||||
// Default payload for /crawl and /crawl/stream
|
||||
payload = {
|
||||
urls,
|
||||
...advConfig
|
||||
};
|
||||
}
|
||||
|
||||
updateStatus('processing');
|
||||
|
||||
@@ -553,7 +660,18 @@
|
||||
const startTime = performance.now();
|
||||
let response, responseData;
|
||||
|
||||
if (endpoint === 'crawl_stream') {
|
||||
if (endpoint === 'llm') {
|
||||
// Special handling for LLM endpoint which uses URL pattern: /llm/{encoded_url}?q={query}
|
||||
const url = urls[0];
|
||||
const encodedUrl = encodeURIComponent(url);
|
||||
// Get the question from the LLM-specific input
|
||||
const question = document.getElementById('llm-question').value.trim() || "What is this page about?";
|
||||
|
||||
response = await fetch(`${api}/${encodedUrl}?q=${encodeURIComponent(question)}`, {
|
||||
method: 'GET',
|
||||
headers: { 'Accept': 'application/json' }
|
||||
});
|
||||
} else if (endpoint === 'crawl_stream') {
|
||||
// Stream processing
|
||||
response = await fetch(api, {
|
||||
method: 'POST',
|
||||
@@ -593,7 +711,7 @@
|
||||
document.querySelector('#response-content code').className = 'json hljs'; // Reset classes
|
||||
forceHighlightElement(document.querySelector('#response-content code'));
|
||||
} else {
|
||||
// Regular request
|
||||
// Regular request (handles /crawl and /md)
|
||||
response = await fetch(api, {
|
||||
method: 'POST',
|
||||
headers: { 'Content-Type': 'application/json' },
|
||||
@@ -621,7 +739,16 @@
|
||||
}
|
||||
|
||||
forceHighlightElement(document.querySelector('#response-content code'));
|
||||
generateSnippets(api, payload);
|
||||
|
||||
// For generateSnippets, handle the LLM case specially
|
||||
if (endpoint === 'llm') {
|
||||
const url = urls[0];
|
||||
const encodedUrl = encodeURIComponent(url);
|
||||
const question = document.getElementById('llm-question').value.trim() || "What is this page about?";
|
||||
generateSnippets(`${api}/${encodedUrl}?q=${encodeURIComponent(question)}`, null, 'GET');
|
||||
} else {
|
||||
generateSnippets(api, payload);
|
||||
}
|
||||
} catch (error) {
|
||||
console.error('Error:', error);
|
||||
updateStatus('error');
|
||||
@@ -803,9 +930,24 @@
|
||||
});
|
||||
});
|
||||
}
|
||||
|
||||
// Function to initialize UI based on selected endpoint
|
||||
function initUI() {
|
||||
// Trigger the endpoint change handler to set initial UI state
|
||||
const endpointSelect = document.getElementById('endpoint');
|
||||
const event = new Event('change');
|
||||
endpointSelect.dispatchEvent(event);
|
||||
|
||||
// Initialize copy buttons
|
||||
initCopyButtons();
|
||||
}
|
||||
|
||||
// Call this in your DOMContentLoaded or initialization
|
||||
initCopyButtons();
|
||||
// Initialize on page load
|
||||
document.addEventListener('DOMContentLoaded', initUI);
|
||||
// Also call it immediately in case the script runs after DOM is already loaded
|
||||
if (document.readyState !== 'loading') {
|
||||
initUI();
|
||||
}
|
||||
|
||||
</script>
|
||||
</body>
|
||||
|
||||
@@ -14,7 +14,7 @@ stderr_logfile=/dev/stderr ; Redirect redis stderr to container stderr
|
||||
stderr_logfile_maxbytes=0
|
||||
|
||||
[program:gunicorn]
|
||||
command=/usr/local/bin/gunicorn --bind 0.0.0.0:11235 --workers 2 --threads 2 --timeout 120 --graceful-timeout 30 --keep-alive 60 --log-level info --worker-class uvicorn.workers.UvicornWorker server:app
|
||||
command=/usr/local/bin/gunicorn --bind 0.0.0.0:11235 --workers 1 --threads 4 --timeout 1800 --graceful-timeout 30 --keep-alive 300 --log-level info --worker-class uvicorn.workers.UvicornWorker server:app
|
||||
directory=/app ; Working directory for the app
|
||||
user=appuser ; Run gunicorn as our non-root user
|
||||
autorestart=true
|
||||
|
||||
@@ -1,19 +1,11 @@
|
||||
# docker-compose.yml
|
||||
version: '3.8'
|
||||
|
||||
# Base configuration anchor for reusability
|
||||
# Shared configuration for all environments
|
||||
x-base-config: &base-config
|
||||
ports:
|
||||
# Map host port 11235 to container port 11235 (where Gunicorn will listen)
|
||||
- "11235:11235"
|
||||
# - "8080:8080" # Uncomment if needed
|
||||
|
||||
# Load API keys primarily from .llm.env file
|
||||
# Create .llm.env in the root directory .llm.env.example
|
||||
- "11235:11235" # Gunicorn port
|
||||
env_file:
|
||||
- .llm.env
|
||||
|
||||
# Define environment variables, allowing overrides from host environment
|
||||
# Syntax ${VAR:-} uses host env var 'VAR' if set, otherwise uses value from .llm.env
|
||||
- .llm.env # API keys (create from .llm.env.example)
|
||||
environment:
|
||||
- OPENAI_API_KEY=${OPENAI_API_KEY:-}
|
||||
- DEEPSEEK_API_KEY=${DEEPSEEK_API_KEY:-}
|
||||
@@ -22,10 +14,8 @@ x-base-config: &base-config
|
||||
- TOGETHER_API_KEY=${TOGETHER_API_KEY:-}
|
||||
- MISTRAL_API_KEY=${MISTRAL_API_KEY:-}
|
||||
- GEMINI_API_TOKEN=${GEMINI_API_TOKEN:-}
|
||||
|
||||
volumes:
|
||||
# Mount /dev/shm for Chromium/Playwright performance
|
||||
- /dev/shm:/dev/shm
|
||||
- /dev/shm:/dev/shm # Chromium performance
|
||||
deploy:
|
||||
resources:
|
||||
limits:
|
||||
@@ -34,47 +24,26 @@ x-base-config: &base-config
|
||||
memory: 1G
|
||||
restart: unless-stopped
|
||||
healthcheck:
|
||||
# IMPORTANT: Ensure Gunicorn binds to 11235 in supervisord.conf
|
||||
test: ["CMD", "curl", "-f", "http://localhost:11235/health"]
|
||||
interval: 30s
|
||||
timeout: 10s
|
||||
retries: 3
|
||||
start_period: 40s # Give the server time to start
|
||||
# Run the container as the non-root user defined in the Dockerfile
|
||||
start_period: 40s
|
||||
user: "appuser"
|
||||
|
||||
services:
|
||||
# --- Local Build Services ---
|
||||
crawl4ai-local-amd64:
|
||||
crawl4ai:
|
||||
# 1. Default: Pull multi-platform test image from Docker Hub
|
||||
# 2. Override with local image via: IMAGE=local-test docker compose up
|
||||
image: ${IMAGE:-unclecode/crawl4ai:${TAG:-latest}}
|
||||
|
||||
# Local build config (used with --build)
|
||||
build:
|
||||
context: . # Build context is the root directory
|
||||
dockerfile: Dockerfile # Dockerfile is in the root directory
|
||||
context: .
|
||||
dockerfile: Dockerfile
|
||||
args:
|
||||
INSTALL_TYPE: ${INSTALL_TYPE:-default}
|
||||
ENABLE_GPU: ${ENABLE_GPU:-false}
|
||||
# PYTHON_VERSION arg is omitted as it's fixed by 'FROM python:3.10-slim' in Dockerfile
|
||||
platform: linux/amd64
|
||||
profiles: ["local-amd64"]
|
||||
<<: *base-config # Inherit base configuration
|
||||
|
||||
crawl4ai-local-arm64:
|
||||
build:
|
||||
context: . # Build context is the root directory
|
||||
dockerfile: Dockerfile # Dockerfile is in the root directory
|
||||
args:
|
||||
INSTALL_TYPE: ${INSTALL_TYPE:-default}
|
||||
ENABLE_GPU: ${ENABLE_GPU:-false}
|
||||
platform: linux/arm64
|
||||
profiles: ["local-arm64"]
|
||||
<<: *base-config
|
||||
|
||||
# --- Docker Hub Image Services ---
|
||||
crawl4ai-hub-amd64:
|
||||
image: unclecode/crawl4ai:${VERSION:-latest}-amd64
|
||||
profiles: ["hub-amd64"]
|
||||
<<: *base-config
|
||||
|
||||
crawl4ai-hub-arm64:
|
||||
image: unclecode/crawl4ai:${VERSION:-latest}-arm64
|
||||
profiles: ["hub-arm64"]
|
||||
|
||||
# Inherit shared config
|
||||
<<: *base-config
|
||||
126
docs/apps/linkdin/README.md
Normal file
126
docs/apps/linkdin/README.md
Normal file
@@ -0,0 +1,126 @@
|
||||
# Crawl4AI Prospect‑Wizard – step‑by‑step guide
|
||||
|
||||
A three‑stage demo that goes from **LinkedIn scraping** ➜ **LLM reasoning** ➜ **graph visualisation**.
|
||||
|
||||
```
|
||||
prospect‑wizard/
|
||||
├─ c4ai_discover.py # Stage 1 – scrape companies + people
|
||||
├─ c4ai_insights.py # Stage 2 – embeddings, org‑charts, scores
|
||||
├─ graph_view_template.html # Stage 3 – graph viewer (static HTML)
|
||||
└─ data/ # output lands here (*.jsonl / *.json)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 1 Install & boot a LinkedIn profile (one‑time)
|
||||
|
||||
### 1.1 Install dependencies
|
||||
```bash
|
||||
pip install crawl4ai openai sentence-transformers networkx pandas vis-network rich
|
||||
```
|
||||
|
||||
### 1.2 Create / warm a LinkedIn browser profile
|
||||
```bash
|
||||
crwl profiler
|
||||
```
|
||||
1. The interactive shell shows **New profile** – hit **enter**.
|
||||
2. Choose a name, e.g. `profile_linkedin_uc`.
|
||||
3. A Chromium window opens – log in to LinkedIn, solve whatever CAPTCHA, then close.
|
||||
|
||||
> Remember the **profile name**. All future runs take `--profile-name <your_name>`.
|
||||
|
||||
---
|
||||
|
||||
## 2 Discovery – scrape companies & people
|
||||
|
||||
```bash
|
||||
python c4ai_discover.py full \
|
||||
--query "health insurance management" \
|
||||
--geo 102713980 \ # Malaysia geoUrn
|
||||
--title_filters "" \ # or "Product,Engineering"
|
||||
--max_companies 10 \ # default set small for workshops
|
||||
--max_people 20 \ # \^ same
|
||||
--profile-name profile_linkedin_uc \
|
||||
--outdir ./data \
|
||||
--concurrency 2 \
|
||||
--log_level debug
|
||||
```
|
||||
**Outputs** in `./data/`:
|
||||
* `companies.jsonl` – one JSON per company
|
||||
* `people.jsonl` – one JSON per employee
|
||||
|
||||
🛠️ **Dry‑run:** `C4AI_DEMO_DEBUG=1 python c4ai_discover.py full --query coffee` uses bundled HTML snippets, no network.
|
||||
|
||||
### Handy geoUrn cheatsheet
|
||||
| Location | geoUrn |
|
||||
|----------|--------|
|
||||
| Singapore | **103644278** |
|
||||
| Malaysia | **102713980** |
|
||||
| United States | **103644922** |
|
||||
| United Kingdom | **102221843** |
|
||||
| Australia | **101452733** |
|
||||
_See more: <https://www.linkedin.com/search/results/companies/?geoUrn=XXX> – the number after `geoUrn=` is what you need._
|
||||
|
||||
---
|
||||
|
||||
## 3 Insights – embeddings, org‑charts, decision makers
|
||||
|
||||
```bash
|
||||
python c4ai_insights.py \
|
||||
--in ./data \
|
||||
--out ./data \
|
||||
--embed_model all-MiniLM-L6-v2 \
|
||||
--top_k 10 \
|
||||
--openai_model gpt-4.1 \
|
||||
--max_llm_tokens 8024 \
|
||||
--llm_temperature 1.0 \
|
||||
--workers 4
|
||||
```
|
||||
Emits next to the Stage‑1 files:
|
||||
* `company_graph.json` – inter‑company similarity graph
|
||||
* `org_chart_<handle>.json` – one per company
|
||||
* `decision_makers.csv` – hand‑picked ‘who to pitch’ list
|
||||
|
||||
Flags reference (straight from `build_arg_parser()`):
|
||||
| Flag | Default | Purpose |
|
||||
|------|---------|---------|
|
||||
| `--in` | `.` | Stage‑1 output dir |
|
||||
| `--out` | `.` | Destination dir |
|
||||
| `--embed_model` | `all-MiniLM-L6-v2` | Sentence‑Transformer model |
|
||||
| `--top_k` | `10` | Neighbours per company in graph |
|
||||
| `--openai_model` | `gpt-4.1` | LLM for scoring decision makers |
|
||||
| `--max_llm_tokens` | `8024` | Token budget per LLM call |
|
||||
| `--llm_temperature` | `1.0` | Creativity knob |
|
||||
| `--stub` | off | Skip OpenAI and fabricate tiny charts |
|
||||
| `--workers` | `4` | Parallel LLM workers |
|
||||
|
||||
---
|
||||
|
||||
## 4 Visualise – interactive graph
|
||||
|
||||
After Stage 2 completes, simply open the HTML viewer from the project root:
|
||||
```bash
|
||||
open graph_view_template.html # or Live Server / Python -http
|
||||
```
|
||||
The page fetches `data/company_graph.json` and the `org_chart_*.json` files automatically; keep the `data/` folder beside the HTML file.
|
||||
|
||||
* Left pane → list of companies (clans).
|
||||
* Click a node to load its org‑chart on the right.
|
||||
* Chat drawer lets you ask follow‑up questions; context is pulled from `people.jsonl`.
|
||||
|
||||
---
|
||||
|
||||
## 5 Common snags
|
||||
|
||||
| Symptom | Fix |
|
||||
|---------|-----|
|
||||
| Infinite CAPTCHA | Use a residential proxy: `--proxy http://user:pass@ip:port` |
|
||||
| 429 Too Many Requests | Lower `--concurrency`, rotate profile, add delay |
|
||||
| Blank graph | Check JSON paths, clear `localStorage` in browser |
|
||||
|
||||
---
|
||||
|
||||
### TL;DR
|
||||
`crwl profiler` → `c4ai_discover.py` → `c4ai_insights.py` → open `graph_view_template.html`.
|
||||
Live long and `import crawl4ai`.
|
||||
|
||||
440
docs/apps/linkdin/c4ai_discover.py
Normal file
440
docs/apps/linkdin/c4ai_discover.py
Normal file
@@ -0,0 +1,440 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
c4ai-discover — Stage‑1 Discovery CLI
|
||||
|
||||
Scrapes LinkedIn company search + their people pages and dumps two newline‑delimited
|
||||
JSON files: companies.jsonl and people.jsonl.
|
||||
|
||||
Key design rules
|
||||
----------------
|
||||
* No BeautifulSoup — Crawl4AI only for network + HTML fetch.
|
||||
* JsonCssExtractionStrategy for structured scraping; schema auto‑generated once
|
||||
from sample HTML provided by user and then cached under ./schemas/.
|
||||
* Defaults are embedded so the file runs inside VS Code debugger without CLI args.
|
||||
* If executed as a console script (argv > 1), CLI flags win.
|
||||
* Lightweight deps: argparse + Crawl4AI stack.
|
||||
|
||||
Author: Tom @ Kidocode 2025‑04‑26
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import warnings, re
|
||||
warnings.filterwarnings(
|
||||
"ignore",
|
||||
message=r"The pseudo class ':contains' is deprecated, ':-soup-contains' should be used.*",
|
||||
category=FutureWarning,
|
||||
module=r"soupsieve"
|
||||
)
|
||||
|
||||
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
# Imports
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
import argparse
|
||||
import random
|
||||
import asyncio
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import pathlib
|
||||
import sys
|
||||
# 3rd-party rich for pretty logging
|
||||
from rich.console import Console
|
||||
from rich.logging import RichHandler
|
||||
|
||||
from datetime import datetime, UTC
|
||||
from itertools import cycle
|
||||
from textwrap import dedent
|
||||
from types import SimpleNamespace
|
||||
from typing import Dict, List, Optional
|
||||
from urllib.parse import quote
|
||||
from pathlib import Path
|
||||
from glob import glob
|
||||
|
||||
from crawl4ai import (
|
||||
AsyncWebCrawler,
|
||||
BrowserConfig,
|
||||
CacheMode,
|
||||
CrawlerRunConfig,
|
||||
JsonCssExtractionStrategy,
|
||||
BrowserProfiler,
|
||||
LLMConfig,
|
||||
)
|
||||
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
# Constants / paths
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
BASE_DIR = pathlib.Path(__file__).resolve().parent
|
||||
SCHEMA_DIR = BASE_DIR / "schemas"
|
||||
SCHEMA_DIR.mkdir(parents=True, exist_ok=True)
|
||||
COMPANY_SCHEMA_PATH = SCHEMA_DIR / "company_card.json"
|
||||
PEOPLE_SCHEMA_PATH = SCHEMA_DIR / "people_card.json"
|
||||
|
||||
# ---------- deterministic target JSON examples ----------
|
||||
_COMPANY_SCHEMA_EXAMPLE = {
|
||||
"handle": "/company/posify/",
|
||||
"profile_image": "https://media.licdn.com/dms/image/v2/.../logo.jpg",
|
||||
"name": "Management Research Services, Inc. (MRS, Inc)",
|
||||
"descriptor": "Insurance • Milwaukee, Wisconsin",
|
||||
"about": "Insurance • Milwaukee, Wisconsin",
|
||||
"followers": 1000
|
||||
}
|
||||
|
||||
_PEOPLE_SCHEMA_EXAMPLE = {
|
||||
"profile_url": "https://www.linkedin.com/in/lily-ng/",
|
||||
"name": "Lily Ng",
|
||||
"headline": "VP Product @ Posify",
|
||||
"followers": 890,
|
||||
"connection_degree": "2nd",
|
||||
"avatar_url": "https://media.licdn.com/dms/image/v2/.../lily.jpg"
|
||||
}
|
||||
|
||||
# Provided sample HTML snippets (trimmed) — used exactly once to cold‑generate schema.
|
||||
_SAMPLE_COMPANY_HTML = (Path(__file__).resolve().parent / "snippets/company.html").read_text()
|
||||
_SAMPLE_PEOPLE_HTML = (Path(__file__).resolve().parent / "snippets/people.html").read_text()
|
||||
|
||||
# --------- tighter schema prompts ----------
|
||||
_COMPANY_SCHEMA_QUERY = dedent(
|
||||
"""
|
||||
Using the supplied <li> company-card HTML, build a JsonCssExtractionStrategy schema that,
|
||||
for every card, outputs *exactly* the keys shown in the example JSON below.
|
||||
JSON spec:
|
||||
• handle – href of the outermost <a> that wraps the logo/title, e.g. "/company/posify/"
|
||||
• profile_image – absolute URL of the <img> inside that link
|
||||
• name – text of the <a> inside the <span class*='t-16'>
|
||||
• descriptor – text line with industry • location
|
||||
• about – text of the <div class*='t-normal'> below the name (industry + geo)
|
||||
• followers – integer parsed from the <div> containing 'followers'
|
||||
|
||||
IMPORTANT: Do not use the base64 kind of classes to target element. It's not reliable.
|
||||
The main div parent contains these li element is "div.search-results-container" you can use this.
|
||||
The <ul> parent has "role" equal to "list". Using these two should be enough to target the <li> elements."
|
||||
"""
|
||||
)
|
||||
|
||||
_PEOPLE_SCHEMA_QUERY = dedent(
|
||||
"""
|
||||
Using the supplied <li> people-card HTML, build a JsonCssExtractionStrategy schema that
|
||||
outputs exactly the keys in the example JSON below.
|
||||
Fields:
|
||||
• profile_url – href of the outermost profile link
|
||||
• name – text inside artdeco-entity-lockup__title
|
||||
• headline – inner text of artdeco-entity-lockup__subtitle
|
||||
• followers – integer parsed from the span inside lt-line-clamp--multi-line
|
||||
• connection_degree – '1st', '2nd', etc. from artdeco-entity-lockup__badge
|
||||
• avatar_url – src of the <img> within artdeco-entity-lockup__image
|
||||
|
||||
IMPORTANT: Do not use the base64 kind of classes to target element. It's not reliable.
|
||||
The main div parent contains these li element is a "div" has these classes "artdeco-card org-people-profile-card__card-spacing org-people__card-margin-bottom".
|
||||
"""
|
||||
)
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Utility helpers
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def _load_or_build_schema(
|
||||
path: pathlib.Path,
|
||||
sample_html: str,
|
||||
query: str,
|
||||
example_json: Dict,
|
||||
force = False
|
||||
) -> Dict:
|
||||
"""Load schema from path, else call generate_schema once and persist."""
|
||||
if path.exists() and not force:
|
||||
return json.loads(path.read_text())
|
||||
|
||||
logging.info("[SCHEMA] Generating schema %s", path.name)
|
||||
schema = JsonCssExtractionStrategy.generate_schema(
|
||||
html=sample_html,
|
||||
llm_config=LLMConfig(
|
||||
provider=os.getenv("C4AI_SCHEMA_PROVIDER", "openai/gpt-4o"),
|
||||
api_token=os.getenv("OPENAI_API_KEY", "env:OPENAI_API_KEY"),
|
||||
),
|
||||
query=query,
|
||||
target_json_example=json.dumps(example_json, indent=2),
|
||||
)
|
||||
path.write_text(json.dumps(schema, indent=2))
|
||||
return schema
|
||||
|
||||
|
||||
def _openai_friendly_number(text: str) -> Optional[int]:
|
||||
"""Extract first int from text like '1K followers' (returns 1000)."""
|
||||
import re
|
||||
|
||||
m = re.search(r"(\d[\d,]*)", text.replace(",", ""))
|
||||
if not m:
|
||||
return None
|
||||
val = int(m.group(1))
|
||||
if "k" in text.lower():
|
||||
val *= 1000
|
||||
if "m" in text.lower():
|
||||
val *= 1_000_000
|
||||
return val
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Core async workers
|
||||
# ---------------------------------------------------------------------------
|
||||
async def crawl_company_search(crawler: AsyncWebCrawler, url: str, schema: Dict, limit: int) -> List[Dict]:
|
||||
"""Paginate 10-item company search pages until `limit` reached."""
|
||||
extraction = JsonCssExtractionStrategy(schema)
|
||||
cfg = CrawlerRunConfig(
|
||||
extraction_strategy=extraction,
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
wait_for = ".search-marvel-srp",
|
||||
session_id="company_search",
|
||||
delay_before_return_html=1,
|
||||
magic = True,
|
||||
verbose= False,
|
||||
)
|
||||
companies, page = [], 1
|
||||
while len(companies) < max(limit, 10):
|
||||
paged_url = f"{url}&page={page}"
|
||||
res = await crawler.arun(paged_url, config=cfg)
|
||||
batch = json.loads(res[0].extracted_content)
|
||||
if not batch:
|
||||
break
|
||||
for item in batch:
|
||||
name = item.get("name", "").strip()
|
||||
handle = item.get("handle", "").strip()
|
||||
if not handle or not name:
|
||||
continue
|
||||
descriptor = item.get("descriptor")
|
||||
about = item.get("about")
|
||||
followers = _openai_friendly_number(str(item.get("followers", "")))
|
||||
companies.append(
|
||||
{
|
||||
"handle": handle,
|
||||
"name": name,
|
||||
"descriptor": descriptor,
|
||||
"about": about,
|
||||
"followers": followers,
|
||||
"people_url": f"{handle}people/",
|
||||
"captured_at": datetime.now(UTC).isoformat(timespec="seconds") + "Z",
|
||||
}
|
||||
)
|
||||
page += 1
|
||||
logging.info(
|
||||
f"[dim]Page {page}[/] — running total: {len(companies)}/{limit} companies"
|
||||
)
|
||||
|
||||
return companies[:max(limit, 10)]
|
||||
|
||||
|
||||
async def crawl_people_page(
|
||||
crawler: AsyncWebCrawler,
|
||||
people_url: str,
|
||||
schema: Dict,
|
||||
limit: int,
|
||||
title_kw: str,
|
||||
) -> List[Dict]:
|
||||
people_u = f"{people_url}?keywords={quote(title_kw)}"
|
||||
extraction = JsonCssExtractionStrategy(schema)
|
||||
cfg = CrawlerRunConfig(
|
||||
extraction_strategy=extraction,
|
||||
# scan_full_page=True,
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
magic=True,
|
||||
wait_for=".org-people-profile-card__card-spacing",
|
||||
delay_before_return_html=1,
|
||||
session_id="people_search",
|
||||
)
|
||||
res = await crawler.arun(people_u, config=cfg)
|
||||
if not res[0].success:
|
||||
return []
|
||||
raw = json.loads(res[0].extracted_content)
|
||||
people = []
|
||||
for p in raw[:limit]:
|
||||
followers = _openai_friendly_number(str(p.get("followers", "")))
|
||||
people.append(
|
||||
{
|
||||
"profile_url": p.get("profile_url"),
|
||||
"name": p.get("name"),
|
||||
"headline": p.get("headline"),
|
||||
"followers": followers,
|
||||
"connection_degree": p.get("connection_degree"),
|
||||
"avatar_url": p.get("avatar_url"),
|
||||
}
|
||||
)
|
||||
return people
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# CLI + main
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def build_arg_parser() -> argparse.ArgumentParser:
|
||||
ap = argparse.ArgumentParser("c4ai-discover — Crawl4AI LinkedIn discovery")
|
||||
sub = ap.add_subparsers(dest="cmd", required=False, help="run scope")
|
||||
|
||||
def add_flags(parser: argparse.ArgumentParser):
|
||||
parser.add_argument("--query", required=False, help="query keyword(s)")
|
||||
parser.add_argument("--geo", required=False, type=int, help="LinkedIn geoUrn")
|
||||
parser.add_argument("--title-filters", default="Product,Engineering", help="comma list of job keywords")
|
||||
parser.add_argument("--max-companies", type=int, default=1000)
|
||||
parser.add_argument("--max-people", type=int, default=500)
|
||||
parser.add_argument("--profile-path", default=str(pathlib.Path.home() / ".crawl4ai/profiles/profile_linkedin_uc"))
|
||||
parser.add_argument("--outdir", default="./output")
|
||||
parser.add_argument("--concurrency", type=int, default=4)
|
||||
parser.add_argument("--log-level", default="info", choices=["debug", "info", "warn", "error"])
|
||||
|
||||
add_flags(sub.add_parser("full"))
|
||||
add_flags(sub.add_parser("companies"))
|
||||
add_flags(sub.add_parser("people"))
|
||||
|
||||
# global flags
|
||||
ap.add_argument(
|
||||
"--debug",
|
||||
action="store_true",
|
||||
help="Use built-in demo defaults (same as C4AI_DEMO_DEBUG=1)",
|
||||
)
|
||||
return ap
|
||||
|
||||
|
||||
def detect_debug_defaults(force = False) -> SimpleNamespace:
|
||||
if not force and sys.gettrace() is None and not os.getenv("C4AI_DEMO_DEBUG"):
|
||||
return SimpleNamespace()
|
||||
# ----- debug‑friendly defaults -----
|
||||
return SimpleNamespace(
|
||||
cmd="full",
|
||||
query="health insurance management",
|
||||
geo=102713980,
|
||||
# title_filters="Product,Engineering",
|
||||
title_filters="",
|
||||
max_companies=10,
|
||||
max_people=5,
|
||||
profile_name="profile_linkedin_uc",
|
||||
outdir="./debug_out",
|
||||
concurrency=2,
|
||||
log_level="debug",
|
||||
)
|
||||
|
||||
|
||||
async def async_main(opts):
|
||||
# ─────────── logging setup ───────────
|
||||
console = Console()
|
||||
logging.basicConfig(
|
||||
level=opts.log_level.upper(),
|
||||
format="%(message)s",
|
||||
handlers=[RichHandler(console=console, markup=True, rich_tracebacks=True)],
|
||||
)
|
||||
|
||||
# -------------------------------------------------------------------
|
||||
# Load or build schemas (one‑time LLM call each)
|
||||
# -------------------------------------------------------------------
|
||||
company_schema = _load_or_build_schema(
|
||||
COMPANY_SCHEMA_PATH,
|
||||
_SAMPLE_COMPANY_HTML,
|
||||
_COMPANY_SCHEMA_QUERY,
|
||||
_COMPANY_SCHEMA_EXAMPLE,
|
||||
# True
|
||||
)
|
||||
people_schema = _load_or_build_schema(
|
||||
PEOPLE_SCHEMA_PATH,
|
||||
_SAMPLE_PEOPLE_HTML,
|
||||
_PEOPLE_SCHEMA_QUERY,
|
||||
_PEOPLE_SCHEMA_EXAMPLE,
|
||||
# True
|
||||
)
|
||||
|
||||
outdir = BASE_DIR / pathlib.Path(opts.outdir)
|
||||
outdir.mkdir(parents=True, exist_ok=True)
|
||||
f_companies = (BASE_DIR / outdir / "companies.jsonl").open("a", encoding="utf-8")
|
||||
f_people = (BASE_DIR / outdir / "people.jsonl").open("a", encoding="utf-8")
|
||||
|
||||
# -------------------------------------------------------------------
|
||||
# Prepare crawler with cookie pool rotation
|
||||
# -------------------------------------------------------------------
|
||||
profiler = BrowserProfiler()
|
||||
path = profiler.get_profile_path(opts.profile_name)
|
||||
bc = BrowserConfig(
|
||||
headless=False,
|
||||
verbose=False,
|
||||
user_data_dir=path,
|
||||
use_managed_browser=True,
|
||||
user_agent_mode = "random",
|
||||
user_agent_generator_config= {
|
||||
"platforms": "mobile",
|
||||
"os": "Android"
|
||||
},
|
||||
verbose=False,
|
||||
)
|
||||
crawler = AsyncWebCrawler(config=bc)
|
||||
|
||||
await crawler.start()
|
||||
|
||||
# Single worker for simplicity; concurrency can be scaled by arun_many if needed.
|
||||
# crawler = await next_crawler().start()
|
||||
try:
|
||||
# Build LinkedIn search URL
|
||||
search_url = f"https://www.linkedin.com/search/results/companies/?keywords={quote(opts.query)}&geoUrn={opts.geo}"
|
||||
logging.info("Seed URL => %s", search_url)
|
||||
|
||||
companies: List[Dict] = []
|
||||
if opts.cmd in ("companies", "full"):
|
||||
companies = await crawl_company_search(
|
||||
crawler, search_url, company_schema, opts.max_companies
|
||||
)
|
||||
for c in companies:
|
||||
f_companies.write(json.dumps(c, ensure_ascii=False) + "\n")
|
||||
logging.info(f"[bold green]✓[/] Companies scraped so far: {len(companies)}")
|
||||
|
||||
if opts.cmd in ("people", "full"):
|
||||
if not companies:
|
||||
# load from previous run
|
||||
src = outdir / "companies.jsonl"
|
||||
if not src.exists():
|
||||
logging.error("companies.jsonl missing — run companies/full first")
|
||||
return 10
|
||||
companies = [json.loads(l) for l in src.read_text().splitlines()]
|
||||
total_people = 0
|
||||
title_kw = " ".join([t.strip() for t in opts.title_filters.split(",") if t.strip()]) if opts.title_filters else ""
|
||||
for comp in companies:
|
||||
people = await crawl_people_page(
|
||||
crawler,
|
||||
comp["people_url"],
|
||||
people_schema,
|
||||
opts.max_people,
|
||||
title_kw,
|
||||
)
|
||||
for p in people:
|
||||
rec = p | {
|
||||
"company_handle": comp["handle"],
|
||||
# "captured_at": datetime.now(UTC).isoformat(timespec="seconds") + "Z",
|
||||
"captured_at": datetime.now(UTC).isoformat(timespec="seconds") + "Z",
|
||||
}
|
||||
f_people.write(json.dumps(rec, ensure_ascii=False) + "\n")
|
||||
total_people += len(people)
|
||||
logging.info(
|
||||
f"{comp['name']} — [cyan]{len(people)}[/] people extracted"
|
||||
)
|
||||
await asyncio.sleep(random.uniform(0.5, 1))
|
||||
logging.info("Total people scraped: %d", total_people)
|
||||
finally:
|
||||
await crawler.close()
|
||||
f_companies.close()
|
||||
f_people.close()
|
||||
|
||||
return 0
|
||||
|
||||
|
||||
def main():
|
||||
parser = build_arg_parser()
|
||||
cli_opts = parser.parse_args()
|
||||
|
||||
# decide on debug defaults
|
||||
if cli_opts.debug:
|
||||
opts = detect_debug_defaults(force=True)
|
||||
else:
|
||||
env_defaults = detect_debug_defaults()
|
||||
env_defaults = detect_debug_defaults()
|
||||
opts = env_defaults if env_defaults else cli_opts
|
||||
|
||||
if not getattr(opts, "cmd", None):
|
||||
opts.cmd = "full"
|
||||
|
||||
exit_code = asyncio.run(async_main(opts))
|
||||
sys.exit(exit_code)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
372
docs/apps/linkdin/c4ai_insights.py
Normal file
372
docs/apps/linkdin/c4ai_insights.py
Normal file
@@ -0,0 +1,372 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Stage-2 Insights builder
|
||||
------------------------
|
||||
Reads companies.jsonl & people.jsonl (Stage-1 output) and produces:
|
||||
• company_graph.json
|
||||
• org_chart_<handle>.json (one per company)
|
||||
• decision_makers.csv
|
||||
• graph_view.html (interactive visualisation)
|
||||
|
||||
Run:
|
||||
python c4ai_insights.py --in ./stage1_out --out ./stage2_out
|
||||
|
||||
Author : Tom @ Kidocode, 2025-04-28
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
# Imports & Third-party
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
|
||||
import argparse, asyncio, json, os, sys, pathlib, random, time, csv
|
||||
from datetime import datetime, UTC
|
||||
from types import SimpleNamespace
|
||||
from pathlib import Path
|
||||
from typing import List, Dict, Any
|
||||
# Pretty CLI UX
|
||||
from rich.console import Console
|
||||
from rich.logging import RichHandler
|
||||
from rich.progress import Progress, SpinnerColumn, BarColumn, TextColumn, TimeElapsedColumn
|
||||
import logging
|
||||
from jinja2 import Environment, FileSystemLoader, select_autoescape
|
||||
|
||||
BASE_DIR = pathlib.Path(__file__).resolve().parent
|
||||
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
# 3rd-party deps
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
import numpy as np
|
||||
# from sentence_transformers import SentenceTransformer
|
||||
# from sklearn.metrics.pairwise import cosine_similarity
|
||||
import pandas as pd
|
||||
import hashlib
|
||||
|
||||
from openai import OpenAI # same SDK you pre-loaded
|
||||
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
# Utils
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
def load_jsonl(path: Path) -> List[Dict[str, Any]]:
|
||||
with open(path, "r", encoding="utf-8") as f:
|
||||
return [json.loads(l) for l in f]
|
||||
|
||||
def dump_json(obj, path: Path):
|
||||
with open(path, "w", encoding="utf-8") as f:
|
||||
json.dump(obj, f, ensure_ascii=False, indent=2)
|
||||
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
# Constants
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
BASE_DIR = pathlib.Path(__file__).resolve().parent
|
||||
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
# Debug defaults (mirrors Stage-1 trick)
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
def dev_defaults() -> SimpleNamespace:
|
||||
return SimpleNamespace(
|
||||
in_dir="./debug_out",
|
||||
out_dir="./insights_debug",
|
||||
embed_model="all-MiniLM-L6-v2",
|
||||
top_k=10,
|
||||
openai_model="gpt-4.1",
|
||||
max_llm_tokens=8000,
|
||||
llm_temperature=1.0,
|
||||
workers=4, # parallel processing
|
||||
stub=False, # manual
|
||||
)
|
||||
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
# Graph builders
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
def embed_descriptions(companies, model_name:str, opts) -> np.ndarray:
|
||||
from sentence_transformers import SentenceTransformer
|
||||
|
||||
logging.debug(f"Using embedding model: {model_name}")
|
||||
cache_path = BASE_DIR / Path(opts.out_dir) / "embeds_cache.json"
|
||||
cache = {}
|
||||
if cache_path.exists():
|
||||
with open(cache_path) as f:
|
||||
cache = json.load(f)
|
||||
# flush cache if model differs
|
||||
if cache.get("_model") != model_name:
|
||||
cache = {}
|
||||
|
||||
model = SentenceTransformer(model_name)
|
||||
new_texts, new_indices = [], []
|
||||
vectors = np.zeros((len(companies), 384), dtype=np.float32)
|
||||
|
||||
for idx, comp in enumerate(companies):
|
||||
text = comp.get("about") or comp.get("descriptor","")
|
||||
h = hashlib.sha1(text.encode("utf-8")).hexdigest()
|
||||
cached = cache.get(comp["handle"])
|
||||
if cached and cached["hash"] == h:
|
||||
vectors[idx] = np.array(cached["vector"], dtype=np.float32)
|
||||
else:
|
||||
new_texts.append(text)
|
||||
new_indices.append((idx, comp["handle"], h))
|
||||
|
||||
if new_texts:
|
||||
embeds = model.encode(new_texts, show_progress_bar=False, convert_to_numpy=True)
|
||||
for vec, (idx, handle, h) in zip(embeds, new_indices):
|
||||
vectors[idx] = vec
|
||||
cache[handle] = {"hash": h, "vector": vec.tolist()}
|
||||
cache["_model"] = model_name
|
||||
with open(cache_path, "w") as f:
|
||||
json.dump(cache, f)
|
||||
|
||||
return vectors
|
||||
|
||||
def build_company_graph(companies, embeds:np.ndarray, top_k:int) -> Dict[str,Any]:
|
||||
from sklearn.metrics.pairwise import cosine_similarity
|
||||
sims = cosine_similarity(embeds)
|
||||
nodes, edges = [], []
|
||||
idx_of = {c["handle"]: i for i,c in enumerate(companies)}
|
||||
for i,c in enumerate(companies):
|
||||
node = dict(
|
||||
id=c["handle"].strip("/"),
|
||||
name=c["name"],
|
||||
handle=c["handle"],
|
||||
about=c.get("about",""),
|
||||
people_url=c.get("people_url",""),
|
||||
industry=c.get("descriptor","").split("•")[0].strip(),
|
||||
geoUrn=c.get("geoUrn"),
|
||||
followers=c.get("followers",0),
|
||||
# desc_embed=embeds[i].tolist(),
|
||||
desc_embed=[],
|
||||
)
|
||||
nodes.append(node)
|
||||
# pick top-k most similar except itself
|
||||
top_idx = np.argsort(sims[i])[::-1][1:top_k+1]
|
||||
for j in top_idx:
|
||||
tgt = companies[j]
|
||||
weight = float(sims[i,j])
|
||||
if node["industry"] == tgt.get("descriptor","").split("•")[0].strip():
|
||||
weight += 0.10
|
||||
if node["geoUrn"] == tgt.get("geoUrn"):
|
||||
weight += 0.05
|
||||
tgt['followers'] = tgt.get("followers", None) or 1
|
||||
node["followers"] = node.get("followers", None) or 1
|
||||
follower_ratio = min(node["followers"], tgt.get("followers",1)) / max(node["followers"] or 1, tgt.get("followers",1))
|
||||
weight += 0.05 * follower_ratio
|
||||
edges.append(dict(
|
||||
source=node["id"],
|
||||
target=tgt["handle"].strip("/"),
|
||||
weight=round(weight,4),
|
||||
drivers=dict(
|
||||
embed_sim=round(float(sims[i,j]),4),
|
||||
industry_match=0.10 if node["industry"] == tgt.get("descriptor","").split("•")[0].strip() else 0,
|
||||
geo_overlap=0.05 if node["geoUrn"] == tgt.get("geoUrn") else 0,
|
||||
)
|
||||
))
|
||||
# return {"nodes":nodes,"edges":edges,"meta":{"generated_at":datetime.now(UTC).isoformat()}}
|
||||
return {"nodes":nodes,"edges":edges,"meta":{"generated_at":datetime.now(UTC).isoformat()}}
|
||||
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
# Org-chart via LLM
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
async def infer_org_chart_llm(company, people, client:OpenAI, model_name:str, max_tokens:int, temperature:float, stub:bool):
|
||||
if stub:
|
||||
# Tiny fake org-chart when debugging offline
|
||||
chief = random.choice(people)
|
||||
nodes = [{
|
||||
"id": chief["profile_url"],
|
||||
"name": chief["name"],
|
||||
"title": chief["headline"],
|
||||
"dept": chief["headline"].split()[:1][0],
|
||||
"yoe_total": 8,
|
||||
"yoe_current": 2,
|
||||
"seniority_score": 0.8,
|
||||
"decision_score": 0.9,
|
||||
"avatar_url": chief.get("avatar_url")
|
||||
}]
|
||||
return {"nodes":nodes,"edges":[],"meta":{"debug_stub":True,"generated_at":datetime.now(UTC).isoformat()}}
|
||||
|
||||
prompt = [
|
||||
{"role":"system","content":"You are an expert B2B org-chart reasoner."},
|
||||
{"role":"user","content":f"""Here is the company description:
|
||||
|
||||
<company>
|
||||
{json.dumps(company, ensure_ascii=False)}
|
||||
</company>
|
||||
|
||||
Here is a JSON list of employees:
|
||||
<employees>
|
||||
{json.dumps(people, ensure_ascii=False)}
|
||||
</employees>
|
||||
|
||||
1) Build a reporting tree (manager -> direct reports)
|
||||
2) For each person output a decision_score 0-1 for buying new software
|
||||
|
||||
Return JSON: {{ "nodes":[{{id,name,title,dept,yoe_total,yoe_current,seniority_score,decision_score,avatar_url,profile_url}}], "edges":[{{source,target,type,confidence}}] }}
|
||||
"""}
|
||||
]
|
||||
resp = client.chat.completions.create(
|
||||
model=model_name,
|
||||
messages=prompt,
|
||||
max_tokens=max_tokens,
|
||||
temperature=temperature,
|
||||
response_format={"type":"json_object"}
|
||||
)
|
||||
chart = json.loads(resp.choices[0].message.content)
|
||||
chart["meta"] = dict(model=model_name, generated_at=datetime.now(UTC).isoformat())
|
||||
return chart
|
||||
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
# CSV flatten
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
def export_decision_makers(charts_dir:Path, csv_path:Path, threshold:float=0.5):
|
||||
rows=[]
|
||||
for p in charts_dir.glob("org_chart_*.json"):
|
||||
data=json.loads(p.read_text())
|
||||
comp = p.stem.split("org_chart_")[1]
|
||||
for n in data.get("nodes",[]):
|
||||
if n.get("decision_score",0)>=threshold:
|
||||
rows.append(dict(
|
||||
company=comp,
|
||||
person=n["name"],
|
||||
title=n["title"],
|
||||
decision_score=n["decision_score"],
|
||||
profile_url=n["id"]
|
||||
))
|
||||
pd.DataFrame(rows).to_csv(csv_path,index=False)
|
||||
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
# HTML rendering
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
def render_html(out:Path, template_dir:Path):
|
||||
# From template folder cp graph_view.html and ai.js in out folder
|
||||
import shutil
|
||||
shutil.copy(template_dir/"graph_view_template.html", out / "graph_view.html")
|
||||
shutil.copy(template_dir/"ai.js", out)
|
||||
|
||||
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
# Main async pipeline
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
async def run(opts):
|
||||
# ── silence SDK noise ──────────────────────────────────────────────────────
|
||||
for noisy in ("openai", "httpx", "httpcore"):
|
||||
lg = logging.getLogger(noisy)
|
||||
lg.setLevel(logging.WARNING) # or ERROR if you want total silence
|
||||
lg.propagate = False # optional: stop them reaching root
|
||||
|
||||
# ────────────── logging bootstrap ──────────────
|
||||
console = Console()
|
||||
logging.basicConfig(
|
||||
level="INFO",
|
||||
format="%(message)s",
|
||||
handlers=[RichHandler(console=console, markup=True, rich_tracebacks=True)],
|
||||
)
|
||||
|
||||
in_dir = BASE_DIR / Path(opts.in_dir)
|
||||
out_dir = BASE_DIR / Path(opts.out_dir)
|
||||
out_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
companies = load_jsonl(in_dir/"companies.jsonl")
|
||||
people = load_jsonl(in_dir/"people.jsonl")
|
||||
|
||||
logging.info(f"[bold cyan]Loaded[/] {len(companies)} companies, {len(people)} people")
|
||||
|
||||
logging.info("[bold]⇢[/] Embedding company descriptions…")
|
||||
# embeds = embed_descriptions(companies, opts.embed_model, opts)
|
||||
|
||||
logging.info("[bold]⇢[/] Building similarity graph")
|
||||
# company_graph = build_company_graph(companies, embeds, opts.top_k)
|
||||
# dump_json(company_graph, out_dir/"company_graph.json")
|
||||
|
||||
# OpenAI client (only built if not debugging)
|
||||
stub = bool(opts.stub)
|
||||
client = OpenAI() if not stub else None
|
||||
|
||||
# Filter companies that need processing
|
||||
to_process = []
|
||||
for comp in companies:
|
||||
handle = comp["handle"].strip("/").replace("/","_")
|
||||
out_file = out_dir/f"org_chart_{handle}.json"
|
||||
if out_file.exists() and False:
|
||||
logging.info(f"[green]✓[/] Skipping existing {comp['name']}")
|
||||
continue
|
||||
to_process.append(comp)
|
||||
|
||||
|
||||
if not to_process:
|
||||
logging.info("[yellow]All companies already processed[/]")
|
||||
else:
|
||||
workers = getattr(opts, 'workers', 1)
|
||||
parallel = workers > 1
|
||||
|
||||
logging.info(f"[bold]⇢[/] Inferring org-charts via LLM {f'(parallel={workers} workers)' if parallel else ''}")
|
||||
|
||||
with Progress(
|
||||
SpinnerColumn(),
|
||||
BarColumn(),
|
||||
TextColumn("[progress.description]{task.description}"),
|
||||
TimeElapsedColumn(),
|
||||
console=console,
|
||||
) as progress:
|
||||
task = progress.add_task("Org charts", total=len(to_process))
|
||||
|
||||
async def process_one(comp):
|
||||
handle = comp["handle"].strip("/").replace("/","_")
|
||||
persons = [p for p in people if p["company_handle"].strip("/") == comp["handle"].strip("/")]
|
||||
|
||||
chart = await infer_org_chart_llm(
|
||||
comp, persons,
|
||||
client=client if client else OpenAI(api_key="sk-debug"),
|
||||
model_name=opts.openai_model,
|
||||
max_tokens=opts.max_llm_tokens,
|
||||
temperature=opts.llm_temperature,
|
||||
stub=stub,
|
||||
)
|
||||
chart["meta"]["company"] = comp["name"]
|
||||
|
||||
# Save the result immediately
|
||||
dump_json(chart, out_dir/f"org_chart_{handle}.json")
|
||||
|
||||
progress.update(task, advance=1, description=f"{comp['name']} ({len(persons)} ppl)")
|
||||
|
||||
# Create tasks for all companies
|
||||
tasks = [process_one(comp) for comp in to_process]
|
||||
|
||||
# Process in batches based on worker count
|
||||
semaphore = asyncio.Semaphore(workers)
|
||||
|
||||
async def bounded_process(coro):
|
||||
async with semaphore:
|
||||
return await coro
|
||||
|
||||
# Run with concurrency control
|
||||
await asyncio.gather(*(bounded_process(task) for task in tasks))
|
||||
|
||||
logging.info("[bold]⇢[/] Flattening decision-makers CSV")
|
||||
export_decision_makers(out_dir, out_dir/"decision_makers.csv")
|
||||
|
||||
render_html(out_dir, template_dir=BASE_DIR/"templates")
|
||||
logging.success = lambda msg, **k: console.print(f"[bold green]✓[/] {msg}", **k)
|
||||
logging.success(f"Stage-2 artefacts written to {out_dir}")
|
||||
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
# CLI
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
def build_arg_parser():
|
||||
p = argparse.ArgumentParser(description="Build graphs & visualisation from Stage-1 output")
|
||||
p.add_argument("--in", dest="in_dir", required=False, help="Stage-1 output dir", default=".")
|
||||
p.add_argument("--out", dest="out_dir", required=False, help="Destination dir", default=".")
|
||||
p.add_argument("--embed_model", default="all-MiniLM-L6-v2")
|
||||
p.add_argument("--top_k", type=int, default=10, help="Top-k neighbours per company")
|
||||
p.add_argument("--openai_model", default="gpt-4.1")
|
||||
p.add_argument("--max_llm_tokens", type=int, default=8024)
|
||||
p.add_argument("--llm_temperature", type=float, default=1.0)
|
||||
p.add_argument("--stub", action="store_true", help="Skip OpenAI call and generate tiny fake org charts")
|
||||
p.add_argument("--workers", type=int, default=4, help="Number of parallel workers for LLM inference")
|
||||
return p
|
||||
|
||||
def main():
|
||||
dbg = dev_defaults()
|
||||
opts = dbg if True else build_arg_parser().parse_args()
|
||||
asyncio.run(run(opts))
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
39
docs/apps/linkdin/schemas/company_card.json
Normal file
39
docs/apps/linkdin/schemas/company_card.json
Normal file
@@ -0,0 +1,39 @@
|
||||
{
|
||||
"name": "LinkedIn Company Card",
|
||||
"baseSelector": "div.search-results-container ul[role='list'] > li",
|
||||
"fields": [
|
||||
{
|
||||
"name": "handle",
|
||||
"selector": "a[href*='/company/']",
|
||||
"type": "attribute",
|
||||
"attribute": "href"
|
||||
},
|
||||
{
|
||||
"name": "profile_image",
|
||||
"selector": "a[href*='/company/'] img",
|
||||
"type": "attribute",
|
||||
"attribute": "src"
|
||||
},
|
||||
{
|
||||
"name": "name",
|
||||
"selector": "span[class*='t-16'] a",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "descriptor",
|
||||
"selector": "div[class*='t-black t-normal']",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "about",
|
||||
"selector": "p[class*='entity-result__summary--2-lines']",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "followers",
|
||||
"selector": "div:contains('followers')",
|
||||
"type": "regex",
|
||||
"pattern": "(\\d+)\\s*followers"
|
||||
}
|
||||
]
|
||||
}
|
||||
38
docs/apps/linkdin/schemas/people_card.json
Normal file
38
docs/apps/linkdin/schemas/people_card.json
Normal file
@@ -0,0 +1,38 @@
|
||||
{
|
||||
"name": "LinkedIn People Card",
|
||||
"baseSelector": "li.org-people-profile-card__profile-card-spacing",
|
||||
"fields": [
|
||||
{
|
||||
"name": "profile_url",
|
||||
"selector": "a.eETATgYTipaVsmrBChiBJJvFsdPhNpulhPZUVLHLo",
|
||||
"type": "attribute",
|
||||
"attribute": "href"
|
||||
},
|
||||
{
|
||||
"name": "name",
|
||||
"selector": ".artdeco-entity-lockup__title .lt-line-clamp--single-line",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "headline",
|
||||
"selector": ".artdeco-entity-lockup__subtitle .lt-line-clamp--multi-line",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "followers",
|
||||
"selector": ".lt-line-clamp--multi-line.t-12",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "connection_degree",
|
||||
"selector": ".artdeco-entity-lockup__badge .artdeco-entity-lockup__degree",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "avatar_url",
|
||||
"selector": ".artdeco-entity-lockup__image img",
|
||||
"type": "attribute",
|
||||
"attribute": "src"
|
||||
}
|
||||
]
|
||||
}
|
||||
143
docs/apps/linkdin/snippets/company.html
Normal file
143
docs/apps/linkdin/snippets/company.html
Normal file
@@ -0,0 +1,143 @@
|
||||
<li class="yCLWzruNprmIzaZzFFonVFBtMrbaVYnuDFA">
|
||||
<!----><!---->
|
||||
|
||||
|
||||
|
||||
<div class="IxlEPbRZwQYrRltKPvHAyjBmCdIWTAoYo" data-chameleon-result-urn="urn:li:company:362492"
|
||||
data-view-name="search-entity-result-universal-template">
|
||||
|
||||
|
||||
|
||||
|
||||
<div class="linked-area flex-1
|
||||
cursor-pointer">
|
||||
|
||||
<div class="BAEgVqVuxosMJZodcelsgPoyRcrkiqgVCGHXNQ">
|
||||
<div class="afcvrbGzNuyRlhPPQWrWirJtUdHAAtUlqxwvVA">
|
||||
<div class="display-flex align-items-center">
|
||||
<!---->
|
||||
|
||||
<a class="eETATgYTipaVsmrBChiBJJvFsdPhNpulhPZUVLHLo scale-down " aria-hidden="true"
|
||||
tabindex="-1" href="https://www.linkedin.com/company/managment-research-services-inc./"
|
||||
data-test-app-aware-link="">
|
||||
|
||||
<div class="ivm-image-view-model ">
|
||||
|
||||
<div class="ivm-view-attr__img-wrapper
|
||||
|
||||
">
|
||||
<!---->
|
||||
<!----> <img width="48"
|
||||
src="https://media.licdn.com/dms/image/v2/C560BAQFWpusEOgW-ww/company-logo_100_100/company-logo_100_100/0/1630583697877/managment_research_services_inc_logo?e=1750896000&v=beta&t=Ch9vyEZdfng-1D1m_XqP5kjNpVXUBKkk9cNhMZUhx0E"
|
||||
loading="lazy" height="48" alt="Management Research Services, Inc. (MRS, Inc)"
|
||||
id="ember28"
|
||||
class="ivm-view-attr__img--centered EntityPhoto-square-3 evi-image lazy-image ember-view">
|
||||
</div>
|
||||
|
||||
</div>
|
||||
|
||||
</a>
|
||||
|
||||
|
||||
</div>
|
||||
</div>
|
||||
<div
|
||||
class="wympnVuDByXHvafWrMGJLZuchDmCRqLmWPwg MmzCPRicJimZvjJhvqTzDcDbdHhWPzspERzA pt3 pb3 t-12 t-black--light">
|
||||
<div class="mb1">
|
||||
|
||||
<div class="t-roman t-sans">
|
||||
|
||||
|
||||
|
||||
<div class="display-flex">
|
||||
<span class="TikBXjihYvcNUoIzkslUaEjfIuLmYxfs OoHEyXgsiIqGADjcOtTmfdpoYVXrLKTvkwI ">
|
||||
<span class="CgaWLOzmXNuKbRIRARSErqCJcBPYudEKo
|
||||
t-16">
|
||||
<a class="eETATgYTipaVsmrBChiBJJvFsdPhNpulhPZUVLHLo "
|
||||
href="https://www.linkedin.com/company/managment-research-services-inc./"
|
||||
data-test-app-aware-link="">
|
||||
<!---->Management Research Services, Inc. (MRS, Inc)<!---->
|
||||
<!----> </a>
|
||||
<!----> </span>
|
||||
</span>
|
||||
<!---->
|
||||
</div>
|
||||
|
||||
|
||||
|
||||
</div>
|
||||
|
||||
|
||||
|
||||
<div class="LjmdKCEqKITHihFOiQsBAQylkdnsWhqZii
|
||||
t-14 t-black t-normal">
|
||||
<!---->Insurance • Milwaukee, Wisconsin<!---->
|
||||
</div>
|
||||
|
||||
<div class="cTPhJiHyNLmxdQYFlsEOutjznmqrVHUByZwZ
|
||||
t-14 t-normal">
|
||||
<!---->1K followers<!---->
|
||||
</div>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
</div>
|
||||
|
||||
<!---->
|
||||
<p class="yWzlqwKNlvCWVNoKqmzoDDEnBMUuyynaLg
|
||||
entity-result__summary--2-lines
|
||||
t-12 t-black--light
|
||||
">
|
||||
<!---->MRS combines 30 years of experience supporting the Life,<span class="white-space-pre">
|
||||
</span><strong><!---->Health<!----></strong><span class="white-space-pre"> </span>and
|
||||
Annuities<span class="white-space-pre"> </span><strong><!---->Insurance<!----></strong><span
|
||||
class="white-space-pre"> </span>Industry with customized<span class="white-space-pre">
|
||||
</span><strong><!---->insurance<!----></strong><span class="white-space-pre">
|
||||
</span>underwriting solutions that efficiently support clients’ workflows. Supported by the
|
||||
Agenium Platform (www.agenium.ai) our innovative underwriting solutions are guaranteed to
|
||||
optimize requirements...<!---->
|
||||
</p>
|
||||
|
||||
<!---->
|
||||
</div>
|
||||
<div class="qXxdnXtzRVFTnTnetmNpssucBwQBsWlUuk MmzCPRicJimZvjJhvqTzDcDbdHhWPzspERzA">
|
||||
<!---->
|
||||
|
||||
|
||||
<div>
|
||||
|
||||
|
||||
|
||||
|
||||
<button aria-label="Follow Management Research Services, Inc. (MRS, Inc)" id="ember61"
|
||||
class="artdeco-button artdeco-button--2 artdeco-button--secondary ember-view"
|
||||
type="button"><!---->
|
||||
<span class="artdeco-button__text">
|
||||
Follow
|
||||
</span></button>
|
||||
|
||||
|
||||
|
||||
<!---->
|
||||
<!---->
|
||||
|
||||
|
||||
</div>
|
||||
|
||||
|
||||
|
||||
</div>
|
||||
</div>
|
||||
|
||||
</div>
|
||||
|
||||
|
||||
|
||||
|
||||
</div>
|
||||
|
||||
|
||||
|
||||
</li>
|
||||
94
docs/apps/linkdin/snippets/people.html
Normal file
94
docs/apps/linkdin/snippets/people.html
Normal file
@@ -0,0 +1,94 @@
|
||||
<li class="grid grid__col--lg-8 block org-people-profile-card__profile-card-spacing">
|
||||
<div>
|
||||
|
||||
|
||||
<section class="artdeco-card full-width qQdPErXQkSAbwApNgNfuxukTIPPykttCcZGOHk">
|
||||
<!---->
|
||||
|
||||
<img width="210" src="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7"
|
||||
ariarole="presentation" loading="lazy" height="210" alt="" id="ember96"
|
||||
class="evi-image lazy-image ghost-default ember-view org-people-profile-card__cover-photo org-people-profile-card__cover-photo--people">
|
||||
|
||||
<div class="org-people-profile-card__profile-info">
|
||||
<div id="ember97"
|
||||
class="artdeco-entity-lockup artdeco-entity-lockup--stacked-center artdeco-entity-lockup--size-7 ember-view">
|
||||
<div id="ember98"
|
||||
class="artdeco-entity-lockup__image artdeco-entity-lockup__image--type-circle ember-view"
|
||||
type="circle">
|
||||
|
||||
<a class="eETATgYTipaVsmrBChiBJJvFsdPhNpulhPZUVLHLo "
|
||||
id="org-people-profile-card__profile-image-0"
|
||||
href="https://www.linkedin.com/in/speakerrayna?miniProfileUrn=urn%3Ali%3Afs_miniProfile%3AACoAABsqUBoBr5x071PuGGpNtK3NlvSARiVXPIs"
|
||||
data-test-app-aware-link="">
|
||||
<img width="104"
|
||||
src="https://media.licdn.com/dms/image/v2/D5603AQGs2Vyju4xZ7A/profile-displayphoto-shrink_100_100/profile-displayphoto-shrink_100_100/0/1681741067031?e=1750896000&v=beta&t=Hvj--IrrmpVIH7pec7-l_PQok8vsS__CGeUqBWOw7co"
|
||||
loading="lazy" height="104" alt="Dr. Rayna S." id="ember99"
|
||||
class="evi-image lazy-image ember-view">
|
||||
</a>
|
||||
|
||||
|
||||
</div>
|
||||
<div id="ember100" class="artdeco-entity-lockup__content ember-view">
|
||||
<div id="ember101" class="artdeco-entity-lockup__title ember-view">
|
||||
<a class="eETATgYTipaVsmrBChiBJJvFsdPhNpulhPZUVLHLo link-without-visited-state"
|
||||
aria-label="View Dr. Rayna S.’s profile"
|
||||
href="https://www.linkedin.com/in/speakerrayna?miniProfileUrn=urn%3Ali%3Afs_miniProfile%3AACoAABsqUBoBr5x071PuGGpNtK3NlvSARiVXPIs"
|
||||
data-test-app-aware-link="">
|
||||
<div id="ember103" class="ember-view lt-line-clamp lt-line-clamp--single-line AGabuksChUpCmjWshSnaZryLKSthOKkwclxY
|
||||
t-black" style="">
|
||||
Dr. Rayna S.
|
||||
|
||||
<!---->
|
||||
</div>
|
||||
|
||||
</a>
|
||||
|
||||
</div>
|
||||
<div id="ember104" class="artdeco-entity-lockup__badge ember-view"> <span class="a11y-text">3rd+
|
||||
degree connection</span>
|
||||
<span class="artdeco-entity-lockup__degree" aria-hidden="true">
|
||||
· 3rd
|
||||
</span>
|
||||
<!----><!---->
|
||||
</div>
|
||||
<div id="ember105" class="artdeco-entity-lockup__subtitle ember-view">
|
||||
<div class="t-14 t-black--light t-normal">
|
||||
<div id="ember107" class="ember-view lt-line-clamp lt-line-clamp--multi-line"
|
||||
style="-webkit-line-clamp: 2">
|
||||
Leadership and Talent Development Consultant and Professional Speaker
|
||||
|
||||
<!---->
|
||||
</div>
|
||||
|
||||
</div>
|
||||
</div>
|
||||
<div id="ember108" class="artdeco-entity-lockup__caption ember-view"></div>
|
||||
</div>
|
||||
|
||||
</div>
|
||||
<span class="text-align-center">
|
||||
<span id="ember110"
|
||||
class="ember-view lt-line-clamp lt-line-clamp--multi-line t-12 t-black--light mt2"
|
||||
style="-webkit-line-clamp: 3">
|
||||
727 followers
|
||||
|
||||
<!----> </span>
|
||||
|
||||
</span>
|
||||
</div>
|
||||
|
||||
<footer class="ph3 pb3">
|
||||
<button aria-label="Follow Dr. Rayna S." id="ember111"
|
||||
class="artdeco-button artdeco-button--2 artdeco-button--secondary ember-view full-width"
|
||||
type="button"><!---->
|
||||
<span class="artdeco-button__text">
|
||||
Follow
|
||||
</span></button>
|
||||
</footer>
|
||||
|
||||
</section>
|
||||
|
||||
|
||||
</div>
|
||||
|
||||
</li>
|
||||
50
docs/apps/linkdin/templates/ai.js
Normal file
50
docs/apps/linkdin/templates/ai.js
Normal file
@@ -0,0 +1,50 @@
|
||||
// ==== File: ai.js ====
|
||||
|
||||
class ApiHandler {
|
||||
constructor(apiKey = null) {
|
||||
this.apiKey = apiKey || localStorage.getItem("openai_api_key") || "";
|
||||
console.log("ApiHandler ready");
|
||||
}
|
||||
|
||||
setApiKey(k) {
|
||||
this.apiKey = k.trim();
|
||||
if (this.apiKey) localStorage.setItem("openai_api_key", this.apiKey);
|
||||
}
|
||||
|
||||
async *chatStream(messages, {model = "gpt-4o", temperature = 0.7} = {}) {
|
||||
if (!this.apiKey) throw new Error("OpenAI API key missing");
|
||||
const payload = {model, messages, stream: true, max_tokens: 1024};
|
||||
const controller = new AbortController();
|
||||
|
||||
const res = await fetch("https://api.openai.com/v1/chat/completions", {
|
||||
method: "POST",
|
||||
headers: {
|
||||
"Content-Type": "application/json",
|
||||
Authorization: `Bearer ${this.apiKey}`,
|
||||
},
|
||||
body: JSON.stringify(payload),
|
||||
signal: controller.signal,
|
||||
});
|
||||
if (!res.ok) throw new Error(`OpenAI: ${res.statusText}`);
|
||||
const reader = res.body.getReader();
|
||||
const dec = new TextDecoder();
|
||||
|
||||
let buf = "";
|
||||
while (true) {
|
||||
const {done, value} = await reader.read();
|
||||
if (done) break;
|
||||
buf += dec.decode(value, {stream: true});
|
||||
for (const line of buf.split("\n")) {
|
||||
if (!line.startsWith("data: ")) continue;
|
||||
if (line.includes("[DONE]")) return;
|
||||
const json = JSON.parse(line.slice(6));
|
||||
const delta = json.choices?.[0]?.delta?.content;
|
||||
if (delta) yield delta;
|
||||
}
|
||||
buf = buf.endsWith("\n") ? "" : buf; // keep partial line
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
window.API = new ApiHandler();
|
||||
|
||||
1171
docs/apps/linkdin/templates/graph_view_template.html
Normal file
1171
docs/apps/linkdin/templates/graph_view_template.html
Normal file
File diff suppressed because it is too large
Load Diff
51
docs/codebase/browser.md
Normal file
51
docs/codebase/browser.md
Normal file
@@ -0,0 +1,51 @@
|
||||
### browser_manager.py
|
||||
|
||||
| Function | What it does |
|
||||
|---|---|
|
||||
| `ManagedBrowser.build_browser_flags` | Returns baseline Chromium CLI flags, disables GPU and sandbox, plugs locale, timezone, stealth tweaks, and any extras from `BrowserConfig`. |
|
||||
| `ManagedBrowser.__init__` | Stores config and logger, creates temp dir, preps internal state. |
|
||||
| `ManagedBrowser.start` | Spawns or connects to the Chromium process, returns its CDP endpoint plus the `subprocess.Popen` handle. |
|
||||
| `ManagedBrowser._initial_startup_check` | Pings the CDP endpoint once to be sure the browser is alive, raises if not. |
|
||||
| `ManagedBrowser._monitor_browser_process` | Async-loops on the subprocess, logs exits or crashes, restarts if policy allows. |
|
||||
| `ManagedBrowser._get_browser_path_WIP` | Old helper that maps OS + browser type to an executable path. |
|
||||
| `ManagedBrowser._get_browser_path` | Current helper, checks env vars, Playwright cache, and OS defaults for the real executable. |
|
||||
| `ManagedBrowser._get_browser_args` | Builds the final CLI arg list by merging user flags, stealth flags, and defaults. |
|
||||
| `ManagedBrowser.cleanup` | Terminates the browser, stops monitors, deletes the temp dir. |
|
||||
| `ManagedBrowser.create_profile` | Opens a visible browser so a human can log in, then zips the resulting user-data-dir to `~/.crawl4ai/profiles/<name>`. |
|
||||
| `ManagedBrowser.list_profiles` | Thin wrapper, now forwarded to `BrowserProfiler.list_profiles()`. |
|
||||
| `ManagedBrowser.delete_profile` | Thin wrapper, now forwarded to `BrowserProfiler.delete_profile()`. |
|
||||
| `BrowserManager.__init__` | Holds the global Playwright instance, browser handle, config signature cache, session map, and logger. |
|
||||
| `BrowserManager.start` | Boots the underlying `ManagedBrowser`, then spins up the default Playwright browser context with stealth patches. |
|
||||
| `BrowserManager._build_browser_args` | Translates `CrawlerRunConfig` (proxy, UA, timezone, headless flag, etc.) into Playwright `launch_args`. |
|
||||
| `BrowserManager.setup_context` | Applies locale, geolocation, permissions, cookies, and UA overrides on a fresh context. |
|
||||
| `BrowserManager.create_browser_context` | Internal helper that actually calls `browser.new_context(**options)` after running `setup_context`. |
|
||||
| `BrowserManager._make_config_signature` | Hashes the non-ephemeral parts of `CrawlerRunConfig` so contexts can be reused safely. |
|
||||
| `BrowserManager.get_page` | Returns a ready `Page` for a given session id, reusing an existing one or creating a new context/page, injects helper scripts, updates `last_used`. |
|
||||
| `BrowserManager.kill_session` | Force-closes a context/page for a session and removes it from the session map. |
|
||||
| `BrowserManager._cleanup_expired_sessions` | Periodic sweep that drops sessions idle longer than `ttl_seconds`. |
|
||||
| `BrowserManager.close` | Gracefully shuts down all contexts, the browser, Playwright, and background tasks. |
|
||||
|
||||
---
|
||||
|
||||
### browser_profiler.py
|
||||
|
||||
| Function | What it does |
|
||||
|---|---|
|
||||
| `BrowserProfiler.__init__` | Sets up profile folder paths, async logger, and signal handlers. |
|
||||
| `BrowserProfiler.create_profile` | Launches a visible browser with a new user-data-dir for manual login, on exit compresses and stores it as a named profile. |
|
||||
| `BrowserProfiler.cleanup_handler` | General SIGTERM/SIGINT cleanup wrapper that kills child processes. |
|
||||
| `BrowserProfiler.sigint_handler` | Handles Ctrl-C during an interactive session, makes sure the browser shuts down cleanly. |
|
||||
| `BrowserProfiler.listen_for_quit_command` | Async REPL that exits when the user types `q`. |
|
||||
| `BrowserProfiler.list_profiles` | Enumerates `~/.crawl4ai/profiles`, prints profile name, browser type, size, and last modified. |
|
||||
| `BrowserProfiler.get_profile_path` | Returns the absolute path of a profile given its name, or `None` if missing. |
|
||||
| `BrowserProfiler.delete_profile` | Removes a profile folder or a direct path from disk, with optional confirmation prompt. |
|
||||
| `BrowserProfiler.interactive_manager` | Text UI loop for listing, creating, deleting, or launching profiles. |
|
||||
| `BrowserProfiler.launch_standalone_browser` | Starts a non-headless Chromium with remote debugging enabled and keeps it alive for manual tests. |
|
||||
| `BrowserProfiler.get_cdp_json` | Pulls `/json/version` from a CDP endpoint and returns the parsed JSON. |
|
||||
| `BrowserProfiler.launch_builtin_browser` | Spawns a headless Chromium in the background, saves `{wsEndpoint, pid, started_at}` to `~/.crawl4ai/builtin_browser.json`. |
|
||||
| `BrowserProfiler.get_builtin_browser_info` | Reads that JSON file, verifies the PID, and returns browser status info. |
|
||||
| `BrowserProfiler._is_browser_running` | Cross-platform helper that checks if a PID is still alive. |
|
||||
| `BrowserProfiler.kill_builtin_browser` | Terminates the background builtin browser and removes its status file. |
|
||||
| `BrowserProfiler.get_builtin_browser_status` | Returns `{running: bool, wsEndpoint, pid, started_at}` for quick health checks. |
|
||||
|
||||
Let me know what you want to tweak or dive into next.
|
||||
40
docs/codebase/cli.md
Normal file
40
docs/codebase/cli.md
Normal file
@@ -0,0 +1,40 @@
|
||||
### `cli.py` command surface
|
||||
|
||||
| Command | Inputs / flags | What it does |
|
||||
|---|---|---|
|
||||
| **profiles** | *(none)* | Opens the interactive profile manager, lets you list, create, delete saved browser profiles that live in `~/.crawl4ai/profiles`. |
|
||||
| **browser status** | – | Prints whether the always-on *builtin* browser is running, shows its CDP URL, PID, start time. |
|
||||
| **browser stop** | – | Kills the builtin browser and deletes its status file. |
|
||||
| **browser view** | `--url, -u` URL *(optional)* | Pops a visible window of the builtin browser, navigates to `URL` or `about:blank`. |
|
||||
| **config list** | – | Dumps every global setting, showing current value, default, and description. |
|
||||
| **config get** | `key` | Prints the value of a single setting, falls back to default if unset. |
|
||||
| **config set** | `key value` | Persists a new value in the global config (stored under `~/.crawl4ai/config.yml`). |
|
||||
| **examples** | – | Just spits out real-world CLI usage samples. |
|
||||
| **crawl** | `url` *(positional)*<br>`--browser-config,-B` path<br>`--crawler-config,-C` path<br>`--filter-config,-f` path<br>`--extraction-config,-e` path<br>`--json-extract,-j` [desc]\*<br>`--schema,-s` path<br>`--browser,-b` k=v list<br>`--crawler,-c` k=v list<br>`--output,-o` all,json,markdown,md,markdown-fit,md-fit *(default all)*<br>`--output-file,-O` path<br>`--bypass-cache,-b` *(flag, default true — note flag reuse)*<br>`--question,-q` str<br>`--verbose,-v` *(flag)*<br>`--profile,-p` profile-name | One-shot crawl + extraction. Builds `BrowserConfig` and `CrawlerRunConfig` from inline flags or separate YAML/JSON files, runs `AsyncWebCrawler.run()`, can route through a named saved profile and pipe the result to stdout or a file. |
|
||||
| **(default)** | Same flags as **crawl**, plus `--example` | Shortcut so you can type just `crwl https://site.com`. When first arg is not a known sub-command, it falls through to *crawl*. |
|
||||
|
||||
\* `--json-extract/-j` with no value turns on LLM-based JSON extraction using an auto schema, supplying a string lets you prompt-engineer the field descriptions.
|
||||
|
||||
> Quick mental model
|
||||
> `profiles` = manage identities,
|
||||
> `browser ...` = control long-running headless Chrome that all crawls can piggy-back on,
|
||||
> `crawl` = do the actual work,
|
||||
> `config` = tweak global defaults,
|
||||
> everything else is sugar.
|
||||
|
||||
### Quick-fire “profile” usage cheatsheet
|
||||
|
||||
| Scenario | Command (copy-paste ready) | Notes |
|
||||
|---|---|---|
|
||||
| **Launch interactive Profile Manager UI** | `crwl profiles` | Opens TUI with options: 1 List, 2 Create, 3 Delete, 4 Use-to-crawl, 5 Exit. |
|
||||
| **Create a fresh profile** | `crwl profiles` → choose **2** → name it → browser opens → log in → press **q** in terminal | Saves to `~/.crawl4ai/profiles/<name>`. |
|
||||
| **List saved profiles** | `crwl profiles` → choose **1** | Shows name, browser type, size, last-modified. |
|
||||
| **Delete a profile** | `crwl profiles` → choose **3** → pick the profile index → confirm | Removes the folder. |
|
||||
| **Crawl with a profile (default alias)** | `crwl https://site.com/dashboard -p my-profile` | Keeps login cookies, sets `use_managed_browser=true` under the hood. |
|
||||
| **Crawl + verbose JSON output** | `crwl https://site.com -p my-profile -o json -v` | Any other `crawl` flags work the same. |
|
||||
| **Crawl with extra browser tweaks** | `crwl https://site.com -p my-profile -b "headless=true,viewport_width=1680"` | CLI overrides go on top of the profile. |
|
||||
| **Same but via explicit sub-command** | `crwl crawl https://site.com -p my-profile` | Identical to default alias. |
|
||||
| **Use profile from inside Profile Manager** | `crwl profiles` → choose **4** → pick profile → enter URL → follow prompts | Handy when demo-ing to non-CLI folks. |
|
||||
| **One-off crawl with a profile folder path (no name lookup)** | `crwl https://site.com -b "user_data_dir=$HOME/.crawl4ai/profiles/my-profile,use_managed_browser=true"` | Bypasses registry, useful for CI scripts. |
|
||||
| **Launch a dev browser on CDP port with the same identity** | `crwl cdp -d $HOME/.crawl4ai/profiles/my-profile -P 9223` | Lets Puppeteer/Playwright attach for debugging. |
|
||||
|
||||
@@ -383,29 +383,31 @@ async def main():
|
||||
scroll_delay=0.2,
|
||||
)
|
||||
|
||||
# # Execute market data extraction
|
||||
# results: List[CrawlResult] = await crawler.arun(
|
||||
# url="https://coinmarketcap.com/?page=1", config=crawl_config
|
||||
# )
|
||||
# Execute market data extraction
|
||||
results: List[CrawlResult] = await crawler.arun(
|
||||
url="https://coinmarketcap.com/?page=1", config=crawl_config
|
||||
)
|
||||
|
||||
# # Process results
|
||||
# raw_df = pd.DataFrame()
|
||||
# for result in results:
|
||||
# if result.success and result.media["tables"]:
|
||||
# # Extract primary market table
|
||||
# # DataFrame
|
||||
# raw_df = pd.DataFrame(
|
||||
# result.media["tables"][0]["rows"],
|
||||
# columns=result.media["tables"][0]["headers"],
|
||||
# )
|
||||
# break
|
||||
# Process results
|
||||
raw_df = pd.DataFrame()
|
||||
for result in results:
|
||||
# Use the new tables field, falling back to media["tables"] for backward compatibility
|
||||
tables = result.tables if hasattr(result, "tables") and result.tables else result.media.get("tables", [])
|
||||
if result.success and tables:
|
||||
# Extract primary market table
|
||||
# DataFrame
|
||||
raw_df = pd.DataFrame(
|
||||
tables[0]["rows"],
|
||||
columns=tables[0]["headers"],
|
||||
)
|
||||
break
|
||||
|
||||
|
||||
# This is for debugging only
|
||||
# ////// Remove this in production from here..
|
||||
# Save raw data for debugging
|
||||
# raw_df.to_csv(f"{__current_dir__}/tmp/raw_crypto_data.csv", index=False)
|
||||
# print("🔍 Raw data saved to 'raw_crypto_data.csv'")
|
||||
raw_df.to_csv(f"{__current_dir__}/tmp/raw_crypto_data.csv", index=False)
|
||||
print("🔍 Raw data saved to 'raw_crypto_data.csv'")
|
||||
|
||||
# Read from file for debugging
|
||||
raw_df = pd.read_csv(f"{__current_dir__}/tmp/raw_crypto_data.csv")
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@@ -31,7 +31,7 @@ async def example_cdp():
|
||||
|
||||
|
||||
async def main():
|
||||
browser_config = BrowserConfig(headless=True, verbose=True)
|
||||
browser_config = BrowserConfig(headless=False, verbose=True)
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
crawler_config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
|
||||
@@ -361,8 +361,10 @@ A code snippet: \`crawler.run()\`. Check the [quickstart](/core/quickstart).`;
|
||||
chatMessages.innerHTML = ""; // Start with clean slate for query
|
||||
if (!isFromQuery) {
|
||||
// Show welcome only if manually started
|
||||
// chatMessages.innerHTML =
|
||||
// '<div class="message ai-message welcome-message">Started a new chat! Ask me anything about Crawl4AI.</div>';
|
||||
chatMessages.innerHTML =
|
||||
'<div class="message ai-message welcome-message">Started a new chat! Ask me anything about Crawl4AI.</div>';
|
||||
'<div class="message ai-message welcome-message">We will launch this feature very soon.</div>';
|
||||
}
|
||||
addCitations([]); // Clear citations
|
||||
updateCitationsDisplay(); // Clear UI
|
||||
@@ -504,8 +506,10 @@ A code snippet: \`crawler.run()\`. Check the [quickstart](/core/quickstart).`;
|
||||
addMessageToChat(message, false);
|
||||
});
|
||||
if (messages.length === 0) {
|
||||
// chatMessages.innerHTML =
|
||||
// '<div class="message ai-message welcome-message">Chat history loaded. Ask a question!</div>';
|
||||
chatMessages.innerHTML =
|
||||
'<div class="message ai-message welcome-message">Chat history loaded. Ask a question!</div>';
|
||||
'<div class="message ai-message welcome-message">We will launch this feature very soon.</div>';
|
||||
}
|
||||
// Scroll to bottom after loading messages
|
||||
scrollToBottom();
|
||||
|
||||
@@ -36,7 +36,7 @@
|
||||
<div id="chat-input-area">
|
||||
<!-- Loading indicator for general waiting (optional) -->
|
||||
<!-- <div class="loading-indicator" style="display: none;">Thinking...</div> -->
|
||||
<textarea id="chat-input" placeholder="Ask about Crawl4AI..." rows="2"></textarea>
|
||||
<textarea id="chat-input" placeholder="We will roll out this feature very soon." rows="2" disabled></textarea>
|
||||
<button id="send-button">Send</button>
|
||||
</div>
|
||||
</main>
|
||||
|
||||
@@ -64,7 +64,7 @@ body {
|
||||
/* Apply side padding within the centered block */
|
||||
padding-left: calc(var(--global-space) * 2);
|
||||
padding-right: calc(var(--global-space) * 2);
|
||||
/* Add margin-left to clear the fixed sidebar */
|
||||
/* Add margin-left to clear the fixed sidebar - ONLY ON DESKTOP */
|
||||
margin-left: var(--sidebar-width);
|
||||
}
|
||||
|
||||
@@ -81,7 +81,7 @@ body {
|
||||
z-index: 900;
|
||||
padding: 1em calc(var(--global-space) * 2);
|
||||
padding-bottom: 2em;
|
||||
/* transition: left var(--layout-transition-speed) ease-in-out; */
|
||||
transition: left var(--layout-transition-speed) ease-in-out;
|
||||
}
|
||||
|
||||
/* --- 2. Main Content Area (Within Centered Grid) --- */
|
||||
@@ -188,21 +188,133 @@ footer {
|
||||
}
|
||||
}
|
||||
|
||||
/* --- Mobile Menu Styles --- */
|
||||
.mobile-menu-toggle {
|
||||
display: none; /* Hidden by default, shown in mobile */
|
||||
background: none;
|
||||
border: none;
|
||||
padding: 10px;
|
||||
cursor: pointer;
|
||||
z-index: 1200;
|
||||
margin-right: 10px;
|
||||
position: absolute;
|
||||
left: 10px;
|
||||
top: 50%;
|
||||
transform: translateY(-50%);
|
||||
/* Make sure it doesn't get moved */
|
||||
min-width: 30px;
|
||||
min-height: 30px;
|
||||
}
|
||||
|
||||
.hamburger-line {
|
||||
display: block;
|
||||
width: 22px;
|
||||
height: 2px;
|
||||
margin: 5px 0;
|
||||
background-color: var(--font-color);
|
||||
transition: transform 0.3s, opacity 0.3s;
|
||||
}
|
||||
|
||||
/* Hamburger animation */
|
||||
.mobile-menu-toggle.is-active .hamburger-line:nth-child(1) {
|
||||
transform: translateY(7px) rotate(45deg);
|
||||
}
|
||||
|
||||
.mobile-menu-toggle.is-active .hamburger-line:nth-child(2) {
|
||||
opacity: 0;
|
||||
}
|
||||
|
||||
.mobile-menu-toggle.is-active .hamburger-line:nth-child(3) {
|
||||
transform: translateY(-7px) rotate(-45deg);
|
||||
}
|
||||
|
||||
.mobile-menu-close {
|
||||
display: none; /* Hidden by default, shown in mobile */
|
||||
position: absolute;
|
||||
top: 10px;
|
||||
right: 10px;
|
||||
background: none;
|
||||
border: none;
|
||||
color: var(--font-color);
|
||||
font-size: 24px;
|
||||
cursor: pointer;
|
||||
z-index: 1200;
|
||||
padding: 5px 10px;
|
||||
}
|
||||
|
||||
.mobile-menu-backdrop {
|
||||
position: fixed;
|
||||
top: 0;
|
||||
left: 0;
|
||||
right: 0;
|
||||
bottom: 0;
|
||||
background-color: rgba(0, 0, 0, 0.7);
|
||||
z-index: 1050;
|
||||
}
|
||||
|
||||
/* --- Small screens: Hide left sidebar, full width content & footer --- */
|
||||
@media screen and (max-width: 768px) {
|
||||
/* Hide the terminal-menu from theme */
|
||||
.terminal-menu {
|
||||
display: none !important;
|
||||
}
|
||||
|
||||
/* Add padding to site name to prevent hamburger overlap */
|
||||
.terminal-mkdocs-site-name,
|
||||
.terminal-logo a,
|
||||
.terminal-nav .logo {
|
||||
padding-left: 40px !important;
|
||||
white-space: nowrap;
|
||||
overflow: hidden;
|
||||
text-overflow: ellipsis;
|
||||
}
|
||||
|
||||
/* Show mobile menu toggle button */
|
||||
.mobile-menu-toggle {
|
||||
display: block;
|
||||
}
|
||||
|
||||
/* Show mobile menu close button */
|
||||
.mobile-menu-close {
|
||||
display: block;
|
||||
}
|
||||
|
||||
#terminal-mkdocs-side-panel {
|
||||
left: calc(-1 * var(--sidebar-width));
|
||||
left: -100%; /* Hide completely off-screen */
|
||||
z-index: 1100;
|
||||
box-shadow: 2px 0 10px rgba(0,0,0,0.3);
|
||||
top: 0; /* Start from top edge */
|
||||
height: 100%; /* Full height */
|
||||
transition: left 0.3s ease-in-out;
|
||||
padding-top: 50px; /* Space for close button */
|
||||
overflow-y: auto;
|
||||
width: 85%; /* Wider on mobile */
|
||||
max-width: 320px; /* Maximum width */
|
||||
background-color: var(--background-color); /* Ensure solid background */
|
||||
}
|
||||
|
||||
#terminal-mkdocs-side-panel.sidebar-visible {
|
||||
left: 0;
|
||||
}
|
||||
|
||||
/* Make navigation links more touch-friendly */
|
||||
#terminal-mkdocs-side-panel a {
|
||||
padding: 6px 15px;
|
||||
display: block;
|
||||
/* No border as requested */
|
||||
}
|
||||
|
||||
#terminal-mkdocs-side-panel ul {
|
||||
padding-left: 0;
|
||||
}
|
||||
|
||||
#terminal-mkdocs-side-panel ul ul a {
|
||||
padding-left: 10px;
|
||||
}
|
||||
|
||||
.terminal-mkdocs-main-grid {
|
||||
/* Grid now takes full width (minus body padding) */
|
||||
margin-left: 0; /* Override sidebar margin */
|
||||
margin-left: 0 !important; /* Override sidebar margin with !important */
|
||||
margin-right: 0; /* Override auto margin */
|
||||
max-width: 100%; /* Allow full width */
|
||||
padding-left: var(--global-space); /* Reduce padding */
|
||||
@@ -224,7 +336,6 @@ footer {
|
||||
text-align: center;
|
||||
gap: 0.5em;
|
||||
}
|
||||
/* Remember JS for toggle button & overlay */
|
||||
}
|
||||
|
||||
|
||||
@@ -301,17 +412,41 @@ footer {
|
||||
background-color: var(--primary-dimmed-color, #09b5a5);
|
||||
color: var(--background-color, #070708);
|
||||
border: none;
|
||||
padding: 4px 8px;
|
||||
padding: 6px 10px;
|
||||
font-size: 0.8em;
|
||||
border-radius: 4px;
|
||||
cursor: pointer;
|
||||
box-shadow: 0 2px 5px rgba(0, 0, 0, 0.3);
|
||||
transition: background-color 0.2s ease;
|
||||
box-shadow: 0 3px 8px rgba(0, 0, 0, 0.3);
|
||||
transition: background-color 0.2s ease, transform 0.15s ease;
|
||||
white-space: nowrap;
|
||||
display: flex;
|
||||
align-items: center;
|
||||
font-weight: 500;
|
||||
animation: askAiButtonAppear 0.2s ease-out;
|
||||
}
|
||||
|
||||
@keyframes askAiButtonAppear {
|
||||
from {
|
||||
opacity: 0;
|
||||
transform: scale(0.9);
|
||||
}
|
||||
to {
|
||||
opacity: 1;
|
||||
transform: scale(1);
|
||||
}
|
||||
}
|
||||
|
||||
.ask-ai-selection-button:hover {
|
||||
background-color: var(--primary-color, #50ffff);
|
||||
transform: scale(1.05);
|
||||
}
|
||||
|
||||
/* Mobile styles for Ask AI button */
|
||||
@media screen and (max-width: 768px) {
|
||||
.ask-ai-selection-button {
|
||||
padding: 8px 12px; /* Larger touch target on mobile */
|
||||
font-size: 0.9em; /* Slightly larger text */
|
||||
}
|
||||
}
|
||||
|
||||
/* ==== File: docs/assets/layout.css (Additions) ==== */
|
||||
|
||||
106
docs/md_v2/assets/mobile_menu.js
Normal file
106
docs/md_v2/assets/mobile_menu.js
Normal file
@@ -0,0 +1,106 @@
|
||||
// mobile_menu.js - Hamburger menu for mobile view
|
||||
document.addEventListener('DOMContentLoaded', () => {
|
||||
// Get references to key elements
|
||||
const sidePanel = document.getElementById('terminal-mkdocs-side-panel');
|
||||
const mainHeader = document.querySelector('.terminal .container:first-child');
|
||||
|
||||
if (!sidePanel || !mainHeader) {
|
||||
console.warn('Mobile menu: Required elements not found');
|
||||
return;
|
||||
}
|
||||
|
||||
// Force hide sidebar on mobile
|
||||
const checkMobile = () => {
|
||||
if (window.innerWidth <= 768) {
|
||||
// Force with !important-like priority
|
||||
sidePanel.style.setProperty('left', '-100%', 'important');
|
||||
// Also hide terminal-menu from the theme
|
||||
const terminalMenu = document.querySelector('.terminal-menu');
|
||||
if (terminalMenu) {
|
||||
terminalMenu.style.setProperty('display', 'none', 'important');
|
||||
}
|
||||
} else {
|
||||
sidePanel.style.removeProperty('left');
|
||||
// Restore terminal-menu if it exists
|
||||
const terminalMenu = document.querySelector('.terminal-menu');
|
||||
if (terminalMenu) {
|
||||
terminalMenu.style.removeProperty('display');
|
||||
}
|
||||
}
|
||||
};
|
||||
|
||||
// Run on initial load
|
||||
checkMobile();
|
||||
|
||||
// Also run on resize
|
||||
window.addEventListener('resize', checkMobile);
|
||||
|
||||
// Create hamburger button
|
||||
const hamburgerBtn = document.createElement('button');
|
||||
hamburgerBtn.className = 'mobile-menu-toggle';
|
||||
hamburgerBtn.setAttribute('aria-label', 'Toggle navigation menu');
|
||||
hamburgerBtn.innerHTML = `
|
||||
<span class="hamburger-line"></span>
|
||||
<span class="hamburger-line"></span>
|
||||
<span class="hamburger-line"></span>
|
||||
`;
|
||||
|
||||
// Create backdrop overlay
|
||||
const menuBackdrop = document.createElement('div');
|
||||
menuBackdrop.className = 'mobile-menu-backdrop';
|
||||
menuBackdrop.style.display = 'none';
|
||||
document.body.appendChild(menuBackdrop);
|
||||
|
||||
// Make sure it's properly hidden on page load
|
||||
if (window.innerWidth <= 768) {
|
||||
menuBackdrop.style.display = 'none';
|
||||
}
|
||||
|
||||
// Insert hamburger button into header
|
||||
mainHeader.insertBefore(hamburgerBtn, mainHeader.firstChild);
|
||||
|
||||
// Add menu close button to side panel
|
||||
const closeBtn = document.createElement('button');
|
||||
closeBtn.className = 'mobile-menu-close';
|
||||
closeBtn.setAttribute('aria-label', 'Close navigation menu');
|
||||
closeBtn.innerHTML = `×`;
|
||||
sidePanel.insertBefore(closeBtn, sidePanel.firstChild);
|
||||
|
||||
// Toggle function
|
||||
function toggleMobileMenu() {
|
||||
const isOpen = sidePanel.classList.toggle('sidebar-visible');
|
||||
|
||||
// Toggle backdrop
|
||||
menuBackdrop.style.display = isOpen ? 'block' : 'none';
|
||||
|
||||
// Toggle aria-expanded
|
||||
hamburgerBtn.setAttribute('aria-expanded', isOpen ? 'true' : 'false');
|
||||
|
||||
// Toggle hamburger animation class
|
||||
hamburgerBtn.classList.toggle('is-active');
|
||||
|
||||
// Force sidebar visibility setting
|
||||
if (isOpen) {
|
||||
sidePanel.style.setProperty('left', '0', 'important');
|
||||
} else {
|
||||
sidePanel.style.setProperty('left', '-100%', 'important');
|
||||
}
|
||||
|
||||
// Prevent body scrolling when menu is open
|
||||
document.body.style.overflow = isOpen ? 'hidden' : '';
|
||||
}
|
||||
|
||||
// Event listeners
|
||||
hamburgerBtn.addEventListener('click', toggleMobileMenu);
|
||||
closeBtn.addEventListener('click', toggleMobileMenu);
|
||||
menuBackdrop.addEventListener('click', toggleMobileMenu);
|
||||
|
||||
// Close menu on window resize to desktop
|
||||
window.addEventListener('resize', () => {
|
||||
if (window.innerWidth > 768 && sidePanel.classList.contains('sidebar-visible')) {
|
||||
toggleMobileMenu();
|
||||
}
|
||||
});
|
||||
|
||||
console.log('Mobile menu initialized');
|
||||
});
|
||||
@@ -8,12 +8,32 @@ document.addEventListener('DOMContentLoaded', () => {
|
||||
const button = document.createElement('button');
|
||||
button.id = 'ask-ai-selection-btn';
|
||||
button.className = 'ask-ai-selection-button';
|
||||
button.textContent = 'Ask AI'; // Or use an icon
|
||||
|
||||
// Add icon and text for better visibility
|
||||
button.innerHTML = `
|
||||
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" width="12" height="12" fill="currentColor" style="margin-right: 4px; vertical-align: middle;">
|
||||
<path d="M20 2H4c-1.1 0-2 .9-2 2v12c0 1.1.9 2 2 2h14l4 4V4c0-1.1-.9-2-2-2z"/>
|
||||
</svg>
|
||||
<span>Ask AI</span>
|
||||
`;
|
||||
|
||||
// Common styles
|
||||
button.style.display = 'none'; // Initially hidden
|
||||
button.style.position = 'absolute';
|
||||
button.style.zIndex = '1500'; // Ensure it's on top
|
||||
button.style.boxShadow = '0 3px 8px rgba(0, 0, 0, 0.4)'; // More pronounced shadow
|
||||
button.style.transition = 'transform 0.15s ease, background-color 0.2s ease'; // Smooth hover effect
|
||||
|
||||
// Add transform on hover
|
||||
button.addEventListener('mouseover', () => {
|
||||
button.style.transform = 'scale(1.05)';
|
||||
});
|
||||
|
||||
button.addEventListener('mouseout', () => {
|
||||
button.style.transform = 'scale(1)';
|
||||
});
|
||||
|
||||
document.body.appendChild(button);
|
||||
|
||||
button.addEventListener('click', handleAskAiClick);
|
||||
return button;
|
||||
}
|
||||
@@ -43,11 +63,38 @@ document.addEventListener('DOMContentLoaded', () => {
|
||||
const range = selection.getRangeAt(0);
|
||||
const rect = range.getBoundingClientRect();
|
||||
|
||||
// Calculate position: top-right of the selection
|
||||
// Get viewport dimensions
|
||||
const viewportWidth = window.innerWidth;
|
||||
const viewportHeight = window.innerHeight;
|
||||
|
||||
// Calculate position based on selection
|
||||
const scrollX = window.scrollX;
|
||||
const scrollY = window.scrollY;
|
||||
const buttonTop = rect.top + scrollY - askAiButton.offsetHeight - 5; // 5px above
|
||||
const buttonLeft = rect.right + scrollX + 5; // 5px to the right
|
||||
|
||||
// Default position (top-right of selection)
|
||||
let buttonTop = rect.top + scrollY - askAiButton.offsetHeight - 5; // 5px above
|
||||
let buttonLeft = rect.right + scrollX + 5; // 5px to the right
|
||||
|
||||
// Check if we're on mobile (which we define as less than 768px)
|
||||
const isMobile = viewportWidth <= 768;
|
||||
|
||||
if (isMobile) {
|
||||
// On mobile, position centered above selection to avoid edge issues
|
||||
buttonTop = rect.top + scrollY - askAiButton.offsetHeight - 10; // 10px above on mobile
|
||||
buttonLeft = rect.left + scrollX + (rect.width / 2) - (askAiButton.offsetWidth / 2); // Centered
|
||||
} else {
|
||||
// For desktop, ensure the button doesn't go off screen
|
||||
// Check right edge
|
||||
if (buttonLeft + askAiButton.offsetWidth > scrollX + viewportWidth) {
|
||||
buttonLeft = scrollX + viewportWidth - askAiButton.offsetWidth - 10; // 10px from right edge
|
||||
}
|
||||
}
|
||||
|
||||
// Check top edge (for all devices)
|
||||
if (buttonTop < scrollY) {
|
||||
// If would go above viewport, position below selection instead
|
||||
buttonTop = rect.bottom + scrollY + 5; // 5px below
|
||||
}
|
||||
|
||||
askAiButton.style.top = `${buttonTop}px`;
|
||||
askAiButton.style.left = `${buttonLeft}px`;
|
||||
@@ -77,8 +124,8 @@ document.addEventListener('DOMContentLoaded', () => {
|
||||
|
||||
// --- Event Listeners ---
|
||||
|
||||
// Show button on mouse up after selection
|
||||
document.addEventListener('mouseup', (event) => {
|
||||
// Function to handle selection events (both mouse and touch)
|
||||
function handleSelectionEvent(event) {
|
||||
// Slight delay to ensure selection is registered
|
||||
setTimeout(() => {
|
||||
const selectedText = getSafeSelectedText();
|
||||
@@ -86,7 +133,7 @@ document.addEventListener('DOMContentLoaded', () => {
|
||||
if (!askAiButton) {
|
||||
askAiButton = createAskAiButton();
|
||||
}
|
||||
// Don't position if the click was ON the button itself
|
||||
// Don't position if the event was ON the button itself
|
||||
if (event.target !== askAiButton) {
|
||||
positionButton(event);
|
||||
}
|
||||
@@ -94,16 +141,46 @@ document.addEventListener('DOMContentLoaded', () => {
|
||||
hideButton();
|
||||
}
|
||||
}, 10); // Small delay
|
||||
}
|
||||
|
||||
// Mouse selection events (desktop)
|
||||
document.addEventListener('mouseup', handleSelectionEvent);
|
||||
|
||||
// Touch selection events (mobile)
|
||||
document.addEventListener('touchend', handleSelectionEvent);
|
||||
document.addEventListener('selectionchange', () => {
|
||||
// This helps with mobile selection which can happen without mouseup/touchend
|
||||
setTimeout(() => {
|
||||
const selectedText = getSafeSelectedText();
|
||||
if (selectedText && askAiButton) {
|
||||
positionButton();
|
||||
}
|
||||
}, 300); // Longer delay for selection change
|
||||
});
|
||||
|
||||
// Hide button on scroll or click elsewhere
|
||||
// Hide button on various events
|
||||
document.addEventListener('mousedown', (event) => {
|
||||
// Hide if clicking anywhere EXCEPT the button itself
|
||||
if (askAiButton && event.target !== askAiButton) {
|
||||
hideButton();
|
||||
}
|
||||
});
|
||||
|
||||
document.addEventListener('touchstart', (event) => {
|
||||
// Same for touch events, but only hide if not on the button
|
||||
if (askAiButton && event.target !== askAiButton) {
|
||||
hideButton();
|
||||
}
|
||||
});
|
||||
|
||||
document.addEventListener('scroll', hideButton, true); // Capture scroll events
|
||||
|
||||
// Also hide when pressing Escape key
|
||||
document.addEventListener('keydown', (event) => {
|
||||
if (event.key === 'Escape') {
|
||||
hideButton();
|
||||
}
|
||||
});
|
||||
|
||||
console.log("Selection Ask AI script loaded.");
|
||||
});
|
||||
@@ -268,3 +268,6 @@ div.badges a > img {
|
||||
}
|
||||
|
||||
|
||||
table td, table th {
|
||||
border: 1px solid var(--code-bg-color) !important;
|
||||
}
|
||||
@@ -4,6 +4,32 @@ Welcome to the Crawl4AI blog! Here you'll find detailed release notes, technical
|
||||
|
||||
## Latest Release
|
||||
|
||||
Here’s the blog index entry for **v0.6.0**, written to match the exact tone and structure of your previous entries:
|
||||
|
||||
---
|
||||
|
||||
### [Crawl4AI v0.6.0 – World-Aware Crawling, Pre-Warmed Browsers, and the MCP API](releases/0.6.0.md)
|
||||
*April 23, 2025*
|
||||
|
||||
Crawl4AI v0.6.0 is our most powerful release yet. This update brings major architectural upgrades including world-aware crawling (set geolocation, locale, and timezone), real-time traffic capture, and a memory-efficient crawler pool with pre-warmed pages.
|
||||
|
||||
The Docker server now exposes a full-featured MCP socket + SSE interface, supports streaming, and comes with a new Playground UI. Plus, table extraction is now native, and the new stress-test framework supports crawling 1,000+ URLs.
|
||||
|
||||
Other key changes:
|
||||
|
||||
* Native support for `result.media["tables"]` to export DataFrames
|
||||
* Full network + console logs and MHTML snapshot per crawl
|
||||
* Browser pooling and pre-warming for faster cold starts
|
||||
* New streaming endpoints via MCP API and Playground
|
||||
* Robots.txt support, proxy rotation, and improved session handling
|
||||
* Deprecated old markdown names, legacy modules cleaned up
|
||||
* Massive repo cleanup: ~36K insertions, ~5K deletions across 121 files
|
||||
|
||||
[Read full release notes →](releases/0.6.0.md)
|
||||
|
||||
---
|
||||
|
||||
Let me know if you want me to auto-update the actual file or just paste this into the markdown.
|
||||
|
||||
### [Crawl4AI v0.5.0: Deep Crawling, Scalability, and a New CLI!](releases/0.5.0.md)
|
||||
|
||||
|
||||
143
docs/md_v2/blog/releases/0.6.0.md
Normal file
143
docs/md_v2/blog/releases/0.6.0.md
Normal file
@@ -0,0 +1,143 @@
|
||||
# Crawl4AI v0.6.0 Release Notes
|
||||
|
||||
We're excited to announce the release of **Crawl4AI v0.6.0**, our biggest and most feature-rich update yet. This version introduces major architectural upgrades, brand-new capabilities for geo-aware crawling, high-efficiency scraping, and real-time streaming support for scalable deployments.
|
||||
|
||||
---
|
||||
|
||||
## Highlights
|
||||
|
||||
### 1. **World-Aware Crawlers**
|
||||
Crawl as if you’re anywhere in the world. With v0.6.0, each crawl can simulate:
|
||||
- Specific GPS coordinates
|
||||
- Browser locale
|
||||
- Timezone
|
||||
|
||||
Example:
|
||||
```python
|
||||
CrawlerRunConfig(
|
||||
url="https://browserleaks.com/geo",
|
||||
locale="en-US",
|
||||
timezone_id="America/Los_Angeles",
|
||||
geolocation=GeolocationConfig(
|
||||
latitude=34.0522,
|
||||
longitude=-118.2437,
|
||||
accuracy=10.0
|
||||
)
|
||||
)
|
||||
```
|
||||
Great for accessing region-specific content or testing global behavior.
|
||||
|
||||
---
|
||||
|
||||
### 2. **Native Table Extraction**
|
||||
Extract HTML tables directly into usable formats like Pandas DataFrames or CSV with zero parsing hassle. All table data is available under `result.media["tables"]`.
|
||||
|
||||
Example:
|
||||
```python
|
||||
raw_df = pd.DataFrame(
|
||||
result.media["tables"][0]["rows"],
|
||||
columns=result.media["tables"][0]["headers"]
|
||||
)
|
||||
```
|
||||
This makes it ideal for scraping financial data, pricing pages, or anything tabular.
|
||||
|
||||
---
|
||||
|
||||
### 3. **Browser Pooling & Pre-Warming**
|
||||
We've overhauled browser management. Now, multiple browser instances can be pooled and pages pre-warmed for ultra-fast launches:
|
||||
- Reduces cold-start latency
|
||||
- Lowers memory spikes
|
||||
- Enhances parallel crawling stability
|
||||
|
||||
This powers the new **Docker Playground** experience and streamlines heavy-load crawling.
|
||||
|
||||
---
|
||||
|
||||
### 4. **Traffic & Snapshot Capture**
|
||||
Need full visibility? You can now capture:
|
||||
- Full network traffic logs
|
||||
- Console output
|
||||
- MHTML page snapshots for post-crawl audits and debugging
|
||||
|
||||
No more guesswork on what happened during your crawl.
|
||||
|
||||
---
|
||||
|
||||
### 5. **MCP API and Streaming Support**
|
||||
We’re exposing **MCP socket and SSE endpoints**, allowing:
|
||||
- Live streaming of crawl results
|
||||
- Real-time integration with agents or frontends
|
||||
- A new Playground UI for interactive crawling
|
||||
|
||||
This is a major step towards making Crawl4AI real-time ready.
|
||||
|
||||
---
|
||||
|
||||
### 6. **Stress-Test Framework**
|
||||
Want to test performance under heavy load? v0.6.0 includes a new memory stress-test suite that supports 1,000+ URL workloads. Ideal for:
|
||||
- Load testing
|
||||
- Performance benchmarking
|
||||
- Validating memory efficiency
|
||||
|
||||
---
|
||||
|
||||
## Core Improvements
|
||||
- Robots.txt compliance
|
||||
- Proxy rotation support
|
||||
- Improved URL normalization and session reuse
|
||||
- Shared data across crawler hooks
|
||||
- New page routing logic
|
||||
|
||||
---
|
||||
|
||||
## Breaking Changes & Deprecations
|
||||
- Legacy `crawl4ai/browser/*` modules are removed. Update imports accordingly.
|
||||
- `AsyncPlaywrightCrawlerStrategy.get_page` now uses a new function signature.
|
||||
- Deprecated markdown generator aliases now point to `DefaultMarkdownGenerator` with warning.
|
||||
|
||||
---
|
||||
|
||||
## Miscellaneous Updates
|
||||
- FastAPI validators replaced custom validation logic
|
||||
- Docker build now based on a Chromium layer
|
||||
- Repo-wide cleanup: ~36,000 insertions, ~5,000 deletions
|
||||
|
||||
---
|
||||
|
||||
## New Examples Included
|
||||
- Geo-location crawling
|
||||
- Network + console log capture
|
||||
- Docker MCP API usage
|
||||
- Markdown selector usage
|
||||
- Crypto project data extraction
|
||||
|
||||
---
|
||||
|
||||
## Watch the Release Video
|
||||
Want a visual walkthrough of all these updates? Watch the video:
|
||||
🔗 https://youtu.be/9x7nVcjOZks
|
||||
|
||||
If you're new to Crawl4AI, start here:
|
||||
🔗 https://www.youtube.com/watch?v=xo3qK6Hg9AA&t=15s
|
||||
|
||||
---
|
||||
|
||||
## Join the Community
|
||||
We’ve just opened up our **Discord** for the public. Join us to:
|
||||
- Ask questions
|
||||
- Share your projects
|
||||
- Get help or contribute
|
||||
|
||||
💬 https://discord.gg/wpYFACrHR4
|
||||
|
||||
---
|
||||
|
||||
## Install or Upgrade
|
||||
```bash
|
||||
pip install -U crawl4ai
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
Live long and import crawl4ai. 🖖
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
115
docs/md_v2/core/examples.md
Normal file
115
docs/md_v2/core/examples.md
Normal file
@@ -0,0 +1,115 @@
|
||||
# Code Examples
|
||||
|
||||
This page provides a comprehensive list of example scripts that demonstrate various features and capabilities of Crawl4AI. Each example is designed to showcase specific functionality, making it easier for you to understand how to implement these features in your own projects.
|
||||
|
||||
## Getting Started Examples
|
||||
|
||||
| Example | Description | Link |
|
||||
|---------|-------------|------|
|
||||
| Hello World | A simple introductory example demonstrating basic usage of AsyncWebCrawler with JavaScript execution and content filtering. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/hello_world.py) |
|
||||
| Quickstart | A comprehensive collection of examples showcasing various features including basic crawling, content cleaning, link analysis, JavaScript execution, CSS selectors, media handling, custom hooks, proxy configuration, screenshots, and multiple extraction strategies. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/quickstart.py) |
|
||||
| Quickstart Set 1 | Basic examples for getting started with Crawl4AI. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/quickstart_examples_set_1.py) |
|
||||
| Quickstart Set 2 | More advanced examples for working with Crawl4AI. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/quickstart_examples_set_2.py) |
|
||||
|
||||
## Browser & Crawling Features
|
||||
|
||||
| Example | Description | Link |
|
||||
|---------|-------------|------|
|
||||
| Built-in Browser | Demonstrates how to use the built-in browser capabilities. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/builtin_browser_example.py) |
|
||||
| Browser Optimization | Focuses on browser performance optimization techniques. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/browser_optimization_example.py) |
|
||||
| arun vs arun_many | Compares the `arun` and `arun_many` methods for single vs. multiple URL crawling. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/arun_vs_arun_many.py) |
|
||||
| Multiple URLs | Shows how to crawl multiple URLs asynchronously. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/async_webcrawler_multiple_urls_example.py) |
|
||||
| Page Interaction | Guide on interacting with dynamic elements through clicks. | [View Guide](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/tutorial_dynamic_clicks.md) |
|
||||
| Crawler Monitor | Shows how to monitor the crawler's activities and status. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/crawler_monitor_example.py) |
|
||||
| Full Page Screenshot & PDF | Guide on capturing full-page screenshots and PDFs from massive webpages. | [View Guide](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/full_page_screenshot_and_pdf_export.md) |
|
||||
|
||||
## Advanced Crawling & Deep Crawling
|
||||
|
||||
| Example | Description | Link |
|
||||
|---------|-------------|------|
|
||||
| Deep Crawling | An extensive tutorial on deep crawling capabilities, demonstrating BFS and BestFirst strategies, stream vs. non-stream execution, filters, scorers, and advanced configurations. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/deepcrawl_example.py) |
|
||||
| Dispatcher | Shows how to use the crawl dispatcher for advanced workload management. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/dispatcher_example.py) |
|
||||
| Storage State | Tutorial on managing browser storage state for persistence. | [View Guide](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/storage_state_tutorial.md) |
|
||||
| Network Console Capture | Demonstrates how to capture and analyze network requests and console logs. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/network_console_capture_example.py) |
|
||||
|
||||
## Extraction Strategies
|
||||
|
||||
| Example | Description | Link |
|
||||
|---------|-------------|------|
|
||||
| Extraction Strategies | Demonstrates different extraction strategies with various input formats (markdown, HTML, fit_markdown) and JSON-based extractors (CSS and XPath). | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/extraction_strategies_examples.py) |
|
||||
| Scraping Strategies | Compares the performance of different scraping strategies. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/scraping_strategies_performance.py) |
|
||||
| LLM Extraction | Demonstrates LLM-based extraction specifically for OpenAI pricing data. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/llm_extraction_openai_pricing.py) |
|
||||
| LLM Markdown | Shows how to use LLMs to generate markdown from crawled content. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/llm_markdown_generator.py) |
|
||||
| Summarize Page | Shows how to summarize web page content. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/summarize_page.py) |
|
||||
|
||||
## E-commerce & Specialized Crawling
|
||||
|
||||
| Example | Description | Link |
|
||||
|---------|-------------|------|
|
||||
| Amazon Product Extraction | Demonstrates how to extract structured product data from Amazon search results using CSS selectors. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/amazon_product_extraction_direct_url.py) |
|
||||
| Amazon with Hooks | Shows how to use hooks with Amazon product extraction. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/amazon_product_extraction_using_hooks.py) |
|
||||
| Amazon with JavaScript | Demonstrates using custom JavaScript for Amazon product extraction. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/amazon_product_extraction_using_use_javascript.py) |
|
||||
| Crypto Analysis | Demonstrates how to crawl and analyze cryptocurrency data. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/crypto_analysis_example.py) |
|
||||
| SERP API | Demonstrates using Crawl4AI with search engine result pages. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/serp_api_project_11_feb.py) |
|
||||
|
||||
## Customization & Security
|
||||
|
||||
| Example | Description | Link |
|
||||
|---------|-------------|------|
|
||||
| Hooks | Illustrates how to use hooks at different stages of the crawling process for advanced customization. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/hooks_example.py) |
|
||||
| Identity-Based Browsing | Illustrates identity-based browsing configurations for authentic browsing experiences. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/identity_based_browsing.py) |
|
||||
| Proxy Rotation | Shows how to use proxy rotation for web scraping and avoiding IP blocks. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/proxy_rotation_demo.py) |
|
||||
| SSL Certificate | Illustrates SSL certificate handling and verification. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/ssl_example.py) |
|
||||
| Language Support | Shows how to handle different languages during crawling. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/language_support_example.py) |
|
||||
| Geolocation | Demonstrates how to use geolocation features. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/use_geo_location.py) |
|
||||
|
||||
## Docker & Deployment
|
||||
|
||||
| Example | Description | Link |
|
||||
|---------|-------------|------|
|
||||
| Docker Config | Demonstrates how to create and use Docker configuration objects. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/docker_config_obj.py) |
|
||||
| Docker Basic | A test suite for Docker deployment, showcasing various functionalities through the Docker API. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/docker_example.py) |
|
||||
| Docker REST API | Shows how to interact with Crawl4AI Docker using REST API calls. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/docker_python_rest_api.py) |
|
||||
| Docker SDK | Demonstrates using the Python SDK for Crawl4AI Docker. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/docker_python_sdk.py) |
|
||||
|
||||
## Application Examples
|
||||
|
||||
| Example | Description | Link |
|
||||
|---------|-------------|------|
|
||||
| Research Assistant | Demonstrates how to build a research assistant using Crawl4AI. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/research_assistant.py) |
|
||||
| REST Call | Shows how to make REST API calls with Crawl4AI. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/rest_call.py) |
|
||||
| Chainlit Integration | Shows how to integrate Crawl4AI with Chainlit. | [View Guide](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/chainlit.md) |
|
||||
| Crawl4AI vs FireCrawl | Compares Crawl4AI with the FireCrawl library. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/crawlai_vs_firecrawl.py) |
|
||||
|
||||
## Content Generation & Markdown
|
||||
|
||||
| Example | Description | Link |
|
||||
|---------|-------------|------|
|
||||
| Content Source | Demonstrates how to work with different content sources in markdown generation. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/markdown/content_source_example.py) |
|
||||
| Content Source (Short) | A simplified version of content source usage. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/markdown/content_source_short_example.py) |
|
||||
| Built-in Browser Guide | Guide for using the built-in browser capabilities. | [View Guide](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/README_BUILTIN_BROWSER.md) |
|
||||
|
||||
## Running the Examples
|
||||
|
||||
To run any of these examples, you'll need to have Crawl4AI installed:
|
||||
|
||||
```bash
|
||||
pip install crawl4ai
|
||||
```
|
||||
|
||||
Then, you can run an example script like this:
|
||||
|
||||
```bash
|
||||
python -m docs.examples.hello_world
|
||||
```
|
||||
|
||||
For examples that require additional dependencies or environment variables, refer to the comments at the top of each file.
|
||||
|
||||
Some examples may require:
|
||||
- API keys (for LLM-based examples)
|
||||
- Docker setup (for Docker-related examples)
|
||||
- Additional dependencies (specified in the example files)
|
||||
|
||||
## Contributing New Examples
|
||||
|
||||
If you've created an interesting example that demonstrates a unique use case or feature of Crawl4AI, we encourage you to contribute it to our examples collection. Please see our [contribution guidelines](https://github.com/unclecode/crawl4ai/blob/main/CONTRIBUTORS.md) for more information.
|
||||
@@ -72,6 +72,14 @@ asyncio.run(main())
|
||||
|
||||
---
|
||||
|
||||
## Video Tutorial
|
||||
|
||||
<div align="center">
|
||||
<iframe width="560" height="315" src="https://www.youtube.com/embed/xo3qK6Hg9AA?start=15" title="Crawl4AI Tutorial" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
|
||||
</div>
|
||||
|
||||
---
|
||||
|
||||
## What Does Crawl4AI Do?
|
||||
|
||||
Crawl4AI is a feature-rich crawler and scraper that aims to:
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
site_name: Crawl4AI Documentation (v0.5.x)
|
||||
site_name: Crawl4AI Documentation (v0.6.x)
|
||||
site_description: 🚀🤖 Crawl4AI, Open-source LLM-Friendly Web Crawler & Scraper
|
||||
site_url: https://docs.crawl4ai.com
|
||||
repo_url: https://github.com/unclecode/crawl4ai
|
||||
@@ -9,6 +9,7 @@ nav:
|
||||
- Home: 'index.md'
|
||||
- "Ask AI": "core/ask-ai.md"
|
||||
- "Quick Start": "core/quickstart.md"
|
||||
- "Code Examples": "core/examples.md"
|
||||
- Setup & Installation:
|
||||
- "Installation": "core/installation.md"
|
||||
- "Docker Deployment": "core/docker-deployment.md"
|
||||
@@ -90,4 +91,5 @@ extra_javascript:
|
||||
- assets/github_stats.js
|
||||
- assets/selection_ask_ai.js
|
||||
- assets/copy_code.js
|
||||
- assets/floating_ask_ai_button.js
|
||||
- assets/floating_ask_ai_button.js
|
||||
- assets/mobile_menu.js
|
||||
@@ -8,7 +8,7 @@ dynamic = ["version"]
|
||||
description = "🚀🤖 Crawl4AI: Open-source LLM Friendly Web Crawler & scraper"
|
||||
readme = "README.md"
|
||||
requires-python = ">=3.9"
|
||||
license = {text = "MIT"}
|
||||
license = "Apache-2.0"
|
||||
authors = [
|
||||
{name = "Unclecode", email = "unclecode@kidocode.com"}
|
||||
]
|
||||
@@ -48,7 +48,6 @@ dependencies = [
|
||||
classifiers = [
|
||||
"Development Status :: 4 - Beta",
|
||||
"Intended Audience :: Developers",
|
||||
"License :: OSI Approved :: Apache Software License",
|
||||
"Programming Language :: Python :: 3",
|
||||
"Programming Language :: Python :: 3.9",
|
||||
"Programming Language :: Python :: 3.10",
|
||||
|
||||
3
setup.py
3
setup.py
@@ -49,13 +49,12 @@ setup(
|
||||
url="https://github.com/unclecode/crawl4ai",
|
||||
author="Unclecode",
|
||||
author_email="unclecode@kidocode.com",
|
||||
license="MIT",
|
||||
license="Apache-2.0",
|
||||
packages=find_packages(),
|
||||
package_data={"crawl4ai": ["js_snippet/*.js"]},
|
||||
classifiers=[
|
||||
"Development Status :: 3 - Alpha",
|
||||
"Intended Audience :: Developers",
|
||||
"License :: OSI Approved :: Apache Software License",
|
||||
"Programming Language :: Python :: 3",
|
||||
"Programming Language :: Python :: 3.9",
|
||||
"Programming Language :: Python :: 3.10",
|
||||
|
||||
@@ -101,19 +101,19 @@ async def test_context(s: ClientSession):
|
||||
|
||||
|
||||
async def main() -> None:
|
||||
async with websocket_client("ws://localhost:8020/mcp/ws") as (r, w):
|
||||
async with websocket_client("ws://localhost:11235/mcp/ws") as (r, w):
|
||||
async with ClientSession(r, w) as s:
|
||||
await s.initialize() # handshake
|
||||
tools = (await s.list_tools()).tools
|
||||
print("tools:", [t.name for t in tools])
|
||||
|
||||
# await test_list()
|
||||
# await test_crawl(s)
|
||||
# await test_md(s)
|
||||
# await test_screenshot(s)
|
||||
# await test_pdf(s)
|
||||
# await test_execute_js(s)
|
||||
# await test_html(s)
|
||||
await test_crawl(s)
|
||||
await test_md(s)
|
||||
await test_screenshot(s)
|
||||
await test_pdf(s)
|
||||
await test_execute_js(s)
|
||||
await test_html(s)
|
||||
await test_context(s)
|
||||
|
||||
anyio.run(main)
|
||||
|
||||
32
tests/profiler/test_crteate_profile.py
Normal file
32
tests/profiler/test_crteate_profile.py
Normal file
@@ -0,0 +1,32 @@
|
||||
from crawl4ai import BrowserProfiler
|
||||
import asyncio
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
# Example usage
|
||||
profiler = BrowserProfiler()
|
||||
|
||||
# Create a new profile
|
||||
import os
|
||||
from pathlib import Path
|
||||
home_dir = Path.home()
|
||||
profile_path = asyncio.run(profiler.create_profile( str(home_dir / ".crawl4ai/profiles/test-profile")))
|
||||
|
||||
print(f"Profile created at: {profile_path}")
|
||||
|
||||
|
||||
|
||||
# # Launch a standalone browser
|
||||
# asyncio.run(profiler.launch_standalone_browser())
|
||||
|
||||
# # List profiles
|
||||
# profiles = profiler.list_profiles()
|
||||
# for profile in profiles:
|
||||
# print(f"Profile: {profile['name']}, Path: {profile['path']}")
|
||||
|
||||
# # Delete a profile
|
||||
# success = profiler.delete_profile("my-profile")
|
||||
# if success:
|
||||
# print("Profile deleted successfully")
|
||||
# else:
|
||||
# print("Failed to delete profile")
|
||||
Reference in New Issue
Block a user