feat(linkedin): add prospect-wizard app with scraping and visualization
Add new LinkedIn prospect discovery tool with three main components: - c4ai_discover.py for company and people scraping - c4ai_insights.py for org chart and decision maker analysis - Interactive graph visualization with company/people exploration Features include: - Configurable LinkedIn search and scraping - Org chart generation with decision maker scoring - Interactive network graph visualization - Company similarity analysis - Chat interface for data exploration Requires: crawl4ai, openai, sentence-transformers, networkx
This commit is contained in:
126
docs/apps/linkdin/README.md
Normal file
126
docs/apps/linkdin/README.md
Normal file
@@ -0,0 +1,126 @@
|
|||||||
|
# Crawl4AI Prospect‑Wizard – step‑by‑step guide
|
||||||
|
|
||||||
|
A three‑stage demo that goes from **LinkedIn scraping** ➜ **LLM reasoning** ➜ **graph visualisation**.
|
||||||
|
|
||||||
|
```
|
||||||
|
prospect‑wizard/
|
||||||
|
├─ c4ai_discover.py # Stage 1 – scrape companies + people
|
||||||
|
├─ c4ai_insights.py # Stage 2 – embeddings, org‑charts, scores
|
||||||
|
├─ graph_view_template.html # Stage 3 – graph viewer (static HTML)
|
||||||
|
└─ data/ # output lands here (*.jsonl / *.json)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1 Install & boot a LinkedIn profile (one‑time)
|
||||||
|
|
||||||
|
### 1.1 Install dependencies
|
||||||
|
```bash
|
||||||
|
pip install crawl4ai openai sentence-transformers networkx pandas vis-network rich
|
||||||
|
```
|
||||||
|
|
||||||
|
### 1.2 Create / warm a LinkedIn browser profile
|
||||||
|
```bash
|
||||||
|
crwl profiler
|
||||||
|
```
|
||||||
|
1. The interactive shell shows **New profile** – hit **enter**.
|
||||||
|
2. Choose a name, e.g. `profile_linkedin_uc`.
|
||||||
|
3. A Chromium window opens – log in to LinkedIn, solve whatever CAPTCHA, then close.
|
||||||
|
|
||||||
|
> Remember the **profile name**. All future runs take `--profile-name <your_name>`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2 Discovery – scrape companies & people
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python c4ai_discover.py full \
|
||||||
|
--query "health insurance management" \
|
||||||
|
--geo 102713980 \ # Malaysia geoUrn
|
||||||
|
--title_filters "" \ # or "Product,Engineering"
|
||||||
|
--max_companies 10 \ # default set small for workshops
|
||||||
|
--max_people 20 \ # \^ same
|
||||||
|
--profile-name profile_linkedin_uc \
|
||||||
|
--outdir ./data \
|
||||||
|
--concurrency 2 \
|
||||||
|
--log_level debug
|
||||||
|
```
|
||||||
|
**Outputs** in `./data/`:
|
||||||
|
* `companies.jsonl` – one JSON per company
|
||||||
|
* `people.jsonl` – one JSON per employee
|
||||||
|
|
||||||
|
🛠️ **Dry‑run:** `C4AI_DEMO_DEBUG=1 python c4ai_discover.py full --query coffee` uses bundled HTML snippets, no network.
|
||||||
|
|
||||||
|
### Handy geoUrn cheatsheet
|
||||||
|
| Location | geoUrn |
|
||||||
|
|----------|--------|
|
||||||
|
| Singapore | **103644278** |
|
||||||
|
| Malaysia | **102713980** |
|
||||||
|
| United States | **103644922** |
|
||||||
|
| United Kingdom | **102221843** |
|
||||||
|
| Australia | **101452733** |
|
||||||
|
_See more: <https://www.linkedin.com/search/results/companies/?geoUrn=XXX> – the number after `geoUrn=` is what you need._
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3 Insights – embeddings, org‑charts, decision makers
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python c4ai_insights.py \
|
||||||
|
--in ./data \
|
||||||
|
--out ./data \
|
||||||
|
--embed_model all-MiniLM-L6-v2 \
|
||||||
|
--top_k 10 \
|
||||||
|
--openai_model gpt-4.1 \
|
||||||
|
--max_llm_tokens 8024 \
|
||||||
|
--llm_temperature 1.0 \
|
||||||
|
--workers 4
|
||||||
|
```
|
||||||
|
Emits next to the Stage‑1 files:
|
||||||
|
* `company_graph.json` – inter‑company similarity graph
|
||||||
|
* `org_chart_<handle>.json` – one per company
|
||||||
|
* `decision_makers.csv` – hand‑picked ‘who to pitch’ list
|
||||||
|
|
||||||
|
Flags reference (straight from `build_arg_parser()`):
|
||||||
|
| Flag | Default | Purpose |
|
||||||
|
|------|---------|---------|
|
||||||
|
| `--in` | `.` | Stage‑1 output dir |
|
||||||
|
| `--out` | `.` | Destination dir |
|
||||||
|
| `--embed_model` | `all-MiniLM-L6-v2` | Sentence‑Transformer model |
|
||||||
|
| `--top_k` | `10` | Neighbours per company in graph |
|
||||||
|
| `--openai_model` | `gpt-4.1` | LLM for scoring decision makers |
|
||||||
|
| `--max_llm_tokens` | `8024` | Token budget per LLM call |
|
||||||
|
| `--llm_temperature` | `1.0` | Creativity knob |
|
||||||
|
| `--stub` | off | Skip OpenAI and fabricate tiny charts |
|
||||||
|
| `--workers` | `4` | Parallel LLM workers |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4 Visualise – interactive graph
|
||||||
|
|
||||||
|
After Stage 2 completes, simply open the HTML viewer from the project root:
|
||||||
|
```bash
|
||||||
|
open graph_view_template.html # or Live Server / Python -http
|
||||||
|
```
|
||||||
|
The page fetches `data/company_graph.json` and the `org_chart_*.json` files automatically; keep the `data/` folder beside the HTML file.
|
||||||
|
|
||||||
|
* Left pane → list of companies (clans).
|
||||||
|
* Click a node to load its org‑chart on the right.
|
||||||
|
* Chat drawer lets you ask follow‑up questions; context is pulled from `people.jsonl`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5 Common snags
|
||||||
|
|
||||||
|
| Symptom | Fix |
|
||||||
|
|---------|-----|
|
||||||
|
| Infinite CAPTCHA | Use a residential proxy: `--proxy http://user:pass@ip:port` |
|
||||||
|
| 429 Too Many Requests | Lower `--concurrency`, rotate profile, add delay |
|
||||||
|
| Blank graph | Check JSON paths, clear `localStorage` in browser |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### TL;DR
|
||||||
|
`crwl profiler` → `c4ai_discover.py` → `c4ai_insights.py` → open `graph_view_template.html`.
|
||||||
|
Live long and `import crawl4ai`.
|
||||||
|
|
||||||
440
docs/apps/linkdin/c4ai_discover.py
Normal file
440
docs/apps/linkdin/c4ai_discover.py
Normal file
@@ -0,0 +1,440 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
c4ai-discover — Stage‑1 Discovery CLI
|
||||||
|
|
||||||
|
Scrapes LinkedIn company search + their people pages and dumps two newline‑delimited
|
||||||
|
JSON files: companies.jsonl and people.jsonl.
|
||||||
|
|
||||||
|
Key design rules
|
||||||
|
----------------
|
||||||
|
* No BeautifulSoup — Crawl4AI only for network + HTML fetch.
|
||||||
|
* JsonCssExtractionStrategy for structured scraping; schema auto‑generated once
|
||||||
|
from sample HTML provided by user and then cached under ./schemas/.
|
||||||
|
* Defaults are embedded so the file runs inside VS Code debugger without CLI args.
|
||||||
|
* If executed as a console script (argv > 1), CLI flags win.
|
||||||
|
* Lightweight deps: argparse + Crawl4AI stack.
|
||||||
|
|
||||||
|
Author: Tom @ Kidocode 2025‑04‑26
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import warnings, re
|
||||||
|
warnings.filterwarnings(
|
||||||
|
"ignore",
|
||||||
|
message=r"The pseudo class ':contains' is deprecated, ':-soup-contains' should be used.*",
|
||||||
|
category=FutureWarning,
|
||||||
|
module=r"soupsieve"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
# ───────────────────────────────────────────────────────────────────────────────
|
||||||
|
# Imports
|
||||||
|
# ───────────────────────────────────────────────────────────────────────────────
|
||||||
|
import argparse
|
||||||
|
import random
|
||||||
|
import asyncio
|
||||||
|
import json
|
||||||
|
import logging
|
||||||
|
import os
|
||||||
|
import pathlib
|
||||||
|
import sys
|
||||||
|
# 3rd-party rich for pretty logging
|
||||||
|
from rich.console import Console
|
||||||
|
from rich.logging import RichHandler
|
||||||
|
|
||||||
|
from datetime import datetime, UTC
|
||||||
|
from itertools import cycle
|
||||||
|
from textwrap import dedent
|
||||||
|
from types import SimpleNamespace
|
||||||
|
from typing import Dict, List, Optional
|
||||||
|
from urllib.parse import quote
|
||||||
|
from pathlib import Path
|
||||||
|
from glob import glob
|
||||||
|
|
||||||
|
from crawl4ai import (
|
||||||
|
AsyncWebCrawler,
|
||||||
|
BrowserConfig,
|
||||||
|
CacheMode,
|
||||||
|
CrawlerRunConfig,
|
||||||
|
JsonCssExtractionStrategy,
|
||||||
|
BrowserProfiler,
|
||||||
|
LLMConfig,
|
||||||
|
)
|
||||||
|
|
||||||
|
# ───────────────────────────────────────────────────────────────────────────────
|
||||||
|
# Constants / paths
|
||||||
|
# ───────────────────────────────────────────────────────────────────────────────
|
||||||
|
BASE_DIR = pathlib.Path(__file__).resolve().parent
|
||||||
|
SCHEMA_DIR = BASE_DIR / "schemas"
|
||||||
|
SCHEMA_DIR.mkdir(parents=True, exist_ok=True)
|
||||||
|
COMPANY_SCHEMA_PATH = SCHEMA_DIR / "company_card.json"
|
||||||
|
PEOPLE_SCHEMA_PATH = SCHEMA_DIR / "people_card.json"
|
||||||
|
|
||||||
|
# ---------- deterministic target JSON examples ----------
|
||||||
|
_COMPANY_SCHEMA_EXAMPLE = {
|
||||||
|
"handle": "/company/posify/",
|
||||||
|
"profile_image": "https://media.licdn.com/dms/image/v2/.../logo.jpg",
|
||||||
|
"name": "Management Research Services, Inc. (MRS, Inc)",
|
||||||
|
"descriptor": "Insurance • Milwaukee, Wisconsin",
|
||||||
|
"about": "Insurance • Milwaukee, Wisconsin",
|
||||||
|
"followers": 1000
|
||||||
|
}
|
||||||
|
|
||||||
|
_PEOPLE_SCHEMA_EXAMPLE = {
|
||||||
|
"profile_url": "https://www.linkedin.com/in/lily-ng/",
|
||||||
|
"name": "Lily Ng",
|
||||||
|
"headline": "VP Product @ Posify",
|
||||||
|
"followers": 890,
|
||||||
|
"connection_degree": "2nd",
|
||||||
|
"avatar_url": "https://media.licdn.com/dms/image/v2/.../lily.jpg"
|
||||||
|
}
|
||||||
|
|
||||||
|
# Provided sample HTML snippets (trimmed) — used exactly once to cold‑generate schema.
|
||||||
|
_SAMPLE_COMPANY_HTML = (Path(__file__).resolve().parent / "snippets/company.html").read_text()
|
||||||
|
_SAMPLE_PEOPLE_HTML = (Path(__file__).resolve().parent / "snippets/people.html").read_text()
|
||||||
|
|
||||||
|
# --------- tighter schema prompts ----------
|
||||||
|
_COMPANY_SCHEMA_QUERY = dedent(
|
||||||
|
"""
|
||||||
|
Using the supplied <li> company-card HTML, build a JsonCssExtractionStrategy schema that,
|
||||||
|
for every card, outputs *exactly* the keys shown in the example JSON below.
|
||||||
|
JSON spec:
|
||||||
|
• handle – href of the outermost <a> that wraps the logo/title, e.g. "/company/posify/"
|
||||||
|
• profile_image – absolute URL of the <img> inside that link
|
||||||
|
• name – text of the <a> inside the <span class*='t-16'>
|
||||||
|
• descriptor – text line with industry • location
|
||||||
|
• about – text of the <div class*='t-normal'> below the name (industry + geo)
|
||||||
|
• followers – integer parsed from the <div> containing 'followers'
|
||||||
|
|
||||||
|
IMPORTANT: Do not use the base64 kind of classes to target element. It's not reliable.
|
||||||
|
The main div parent contains these li element is "div.search-results-container" you can use this.
|
||||||
|
The <ul> parent has "role" equal to "list". Using these two should be enough to target the <li> elements."
|
||||||
|
"""
|
||||||
|
)
|
||||||
|
|
||||||
|
_PEOPLE_SCHEMA_QUERY = dedent(
|
||||||
|
"""
|
||||||
|
Using the supplied <li> people-card HTML, build a JsonCssExtractionStrategy schema that
|
||||||
|
outputs exactly the keys in the example JSON below.
|
||||||
|
Fields:
|
||||||
|
• profile_url – href of the outermost profile link
|
||||||
|
• name – text inside artdeco-entity-lockup__title
|
||||||
|
• headline – inner text of artdeco-entity-lockup__subtitle
|
||||||
|
• followers – integer parsed from the span inside lt-line-clamp--multi-line
|
||||||
|
• connection_degree – '1st', '2nd', etc. from artdeco-entity-lockup__badge
|
||||||
|
• avatar_url – src of the <img> within artdeco-entity-lockup__image
|
||||||
|
|
||||||
|
IMPORTANT: Do not use the base64 kind of classes to target element. It's not reliable.
|
||||||
|
The main div parent contains these li element is a "div" has these classes "artdeco-card org-people-profile-card__card-spacing org-people__card-margin-bottom".
|
||||||
|
"""
|
||||||
|
)
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Utility helpers
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
def _load_or_build_schema(
|
||||||
|
path: pathlib.Path,
|
||||||
|
sample_html: str,
|
||||||
|
query: str,
|
||||||
|
example_json: Dict,
|
||||||
|
force = False
|
||||||
|
) -> Dict:
|
||||||
|
"""Load schema from path, else call generate_schema once and persist."""
|
||||||
|
if path.exists() and not force:
|
||||||
|
return json.loads(path.read_text())
|
||||||
|
|
||||||
|
logging.info("[SCHEMA] Generating schema %s", path.name)
|
||||||
|
schema = JsonCssExtractionStrategy.generate_schema(
|
||||||
|
html=sample_html,
|
||||||
|
llm_config=LLMConfig(
|
||||||
|
provider=os.getenv("C4AI_SCHEMA_PROVIDER", "openai/gpt-4o"),
|
||||||
|
api_token=os.getenv("OPENAI_API_KEY", "env:OPENAI_API_KEY"),
|
||||||
|
),
|
||||||
|
query=query,
|
||||||
|
target_json_example=json.dumps(example_json, indent=2),
|
||||||
|
)
|
||||||
|
path.write_text(json.dumps(schema, indent=2))
|
||||||
|
return schema
|
||||||
|
|
||||||
|
|
||||||
|
def _openai_friendly_number(text: str) -> Optional[int]:
|
||||||
|
"""Extract first int from text like '1K followers' (returns 1000)."""
|
||||||
|
import re
|
||||||
|
|
||||||
|
m = re.search(r"(\d[\d,]*)", text.replace(",", ""))
|
||||||
|
if not m:
|
||||||
|
return None
|
||||||
|
val = int(m.group(1))
|
||||||
|
if "k" in text.lower():
|
||||||
|
val *= 1000
|
||||||
|
if "m" in text.lower():
|
||||||
|
val *= 1_000_000
|
||||||
|
return val
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Core async workers
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
async def crawl_company_search(crawler: AsyncWebCrawler, url: str, schema: Dict, limit: int) -> List[Dict]:
|
||||||
|
"""Paginate 10-item company search pages until `limit` reached."""
|
||||||
|
extraction = JsonCssExtractionStrategy(schema)
|
||||||
|
cfg = CrawlerRunConfig(
|
||||||
|
extraction_strategy=extraction,
|
||||||
|
cache_mode=CacheMode.BYPASS,
|
||||||
|
wait_for = ".search-marvel-srp",
|
||||||
|
session_id="company_search",
|
||||||
|
delay_before_return_html=1,
|
||||||
|
magic = True,
|
||||||
|
verbose= False,
|
||||||
|
)
|
||||||
|
companies, page = [], 1
|
||||||
|
while len(companies) < max(limit, 10):
|
||||||
|
paged_url = f"{url}&page={page}"
|
||||||
|
res = await crawler.arun(paged_url, config=cfg)
|
||||||
|
batch = json.loads(res[0].extracted_content)
|
||||||
|
if not batch:
|
||||||
|
break
|
||||||
|
for item in batch:
|
||||||
|
name = item.get("name", "").strip()
|
||||||
|
handle = item.get("handle", "").strip()
|
||||||
|
if not handle or not name:
|
||||||
|
continue
|
||||||
|
descriptor = item.get("descriptor")
|
||||||
|
about = item.get("about")
|
||||||
|
followers = _openai_friendly_number(str(item.get("followers", "")))
|
||||||
|
companies.append(
|
||||||
|
{
|
||||||
|
"handle": handle,
|
||||||
|
"name": name,
|
||||||
|
"descriptor": descriptor,
|
||||||
|
"about": about,
|
||||||
|
"followers": followers,
|
||||||
|
"people_url": f"{handle}people/",
|
||||||
|
"captured_at": datetime.now(UTC).isoformat(timespec="seconds") + "Z",
|
||||||
|
}
|
||||||
|
)
|
||||||
|
page += 1
|
||||||
|
logging.info(
|
||||||
|
f"[dim]Page {page}[/] — running total: {len(companies)}/{limit} companies"
|
||||||
|
)
|
||||||
|
|
||||||
|
return companies[:max(limit, 10)]
|
||||||
|
|
||||||
|
|
||||||
|
async def crawl_people_page(
|
||||||
|
crawler: AsyncWebCrawler,
|
||||||
|
people_url: str,
|
||||||
|
schema: Dict,
|
||||||
|
limit: int,
|
||||||
|
title_kw: str,
|
||||||
|
) -> List[Dict]:
|
||||||
|
people_u = f"{people_url}?keywords={quote(title_kw)}"
|
||||||
|
extraction = JsonCssExtractionStrategy(schema)
|
||||||
|
cfg = CrawlerRunConfig(
|
||||||
|
extraction_strategy=extraction,
|
||||||
|
# scan_full_page=True,
|
||||||
|
cache_mode=CacheMode.BYPASS,
|
||||||
|
magic=True,
|
||||||
|
wait_for=".org-people-profile-card__card-spacing",
|
||||||
|
delay_before_return_html=1,
|
||||||
|
session_id="people_search",
|
||||||
|
)
|
||||||
|
res = await crawler.arun(people_u, config=cfg)
|
||||||
|
if not res[0].success:
|
||||||
|
return []
|
||||||
|
raw = json.loads(res[0].extracted_content)
|
||||||
|
people = []
|
||||||
|
for p in raw[:limit]:
|
||||||
|
followers = _openai_friendly_number(str(p.get("followers", "")))
|
||||||
|
people.append(
|
||||||
|
{
|
||||||
|
"profile_url": p.get("profile_url"),
|
||||||
|
"name": p.get("name"),
|
||||||
|
"headline": p.get("headline"),
|
||||||
|
"followers": followers,
|
||||||
|
"connection_degree": p.get("connection_degree"),
|
||||||
|
"avatar_url": p.get("avatar_url"),
|
||||||
|
}
|
||||||
|
)
|
||||||
|
return people
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# CLI + main
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
def build_arg_parser() -> argparse.ArgumentParser:
|
||||||
|
ap = argparse.ArgumentParser("c4ai-discover — Crawl4AI LinkedIn discovery")
|
||||||
|
sub = ap.add_subparsers(dest="cmd", required=False, help="run scope")
|
||||||
|
|
||||||
|
def add_flags(parser: argparse.ArgumentParser):
|
||||||
|
parser.add_argument("--query", required=False, help="query keyword(s)")
|
||||||
|
parser.add_argument("--geo", required=False, type=int, help="LinkedIn geoUrn")
|
||||||
|
parser.add_argument("--title-filters", default="Product,Engineering", help="comma list of job keywords")
|
||||||
|
parser.add_argument("--max-companies", type=int, default=1000)
|
||||||
|
parser.add_argument("--max-people", type=int, default=500)
|
||||||
|
parser.add_argument("--profile-path", default=str(pathlib.Path.home() / ".crawl4ai/profiles/profile_linkedin_uc"))
|
||||||
|
parser.add_argument("--outdir", default="./output")
|
||||||
|
parser.add_argument("--concurrency", type=int, default=4)
|
||||||
|
parser.add_argument("--log-level", default="info", choices=["debug", "info", "warn", "error"])
|
||||||
|
|
||||||
|
add_flags(sub.add_parser("full"))
|
||||||
|
add_flags(sub.add_parser("companies"))
|
||||||
|
add_flags(sub.add_parser("people"))
|
||||||
|
|
||||||
|
# global flags
|
||||||
|
ap.add_argument(
|
||||||
|
"--debug",
|
||||||
|
action="store_true",
|
||||||
|
help="Use built-in demo defaults (same as C4AI_DEMO_DEBUG=1)",
|
||||||
|
)
|
||||||
|
return ap
|
||||||
|
|
||||||
|
|
||||||
|
def detect_debug_defaults(force = False) -> SimpleNamespace:
|
||||||
|
if not force and sys.gettrace() is None and not os.getenv("C4AI_DEMO_DEBUG"):
|
||||||
|
return SimpleNamespace()
|
||||||
|
# ----- debug‑friendly defaults -----
|
||||||
|
return SimpleNamespace(
|
||||||
|
cmd="full",
|
||||||
|
query="health insurance management",
|
||||||
|
geo=102713980,
|
||||||
|
# title_filters="Product,Engineering",
|
||||||
|
title_filters="",
|
||||||
|
max_companies=10,
|
||||||
|
max_people=5,
|
||||||
|
profile_name="profile_linkedin_uc",
|
||||||
|
outdir="./debug_out",
|
||||||
|
concurrency=2,
|
||||||
|
log_level="debug",
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
async def async_main(opts):
|
||||||
|
# ─────────── logging setup ───────────
|
||||||
|
console = Console()
|
||||||
|
logging.basicConfig(
|
||||||
|
level=opts.log_level.upper(),
|
||||||
|
format="%(message)s",
|
||||||
|
handlers=[RichHandler(console=console, markup=True, rich_tracebacks=True)],
|
||||||
|
)
|
||||||
|
|
||||||
|
# -------------------------------------------------------------------
|
||||||
|
# Load or build schemas (one‑time LLM call each)
|
||||||
|
# -------------------------------------------------------------------
|
||||||
|
company_schema = _load_or_build_schema(
|
||||||
|
COMPANY_SCHEMA_PATH,
|
||||||
|
_SAMPLE_COMPANY_HTML,
|
||||||
|
_COMPANY_SCHEMA_QUERY,
|
||||||
|
_COMPANY_SCHEMA_EXAMPLE,
|
||||||
|
# True
|
||||||
|
)
|
||||||
|
people_schema = _load_or_build_schema(
|
||||||
|
PEOPLE_SCHEMA_PATH,
|
||||||
|
_SAMPLE_PEOPLE_HTML,
|
||||||
|
_PEOPLE_SCHEMA_QUERY,
|
||||||
|
_PEOPLE_SCHEMA_EXAMPLE,
|
||||||
|
# True
|
||||||
|
)
|
||||||
|
|
||||||
|
outdir = BASE_DIR / pathlib.Path(opts.outdir)
|
||||||
|
outdir.mkdir(parents=True, exist_ok=True)
|
||||||
|
f_companies = (BASE_DIR / outdir / "companies.jsonl").open("a", encoding="utf-8")
|
||||||
|
f_people = (BASE_DIR / outdir / "people.jsonl").open("a", encoding="utf-8")
|
||||||
|
|
||||||
|
# -------------------------------------------------------------------
|
||||||
|
# Prepare crawler with cookie pool rotation
|
||||||
|
# -------------------------------------------------------------------
|
||||||
|
profiler = BrowserProfiler()
|
||||||
|
path = profiler.get_profile_path(opts.profile_name)
|
||||||
|
bc = BrowserConfig(
|
||||||
|
headless=False,
|
||||||
|
verbose=False,
|
||||||
|
user_data_dir=path,
|
||||||
|
use_managed_browser=True,
|
||||||
|
user_agent_mode = "random",
|
||||||
|
user_agent_generator_config= {
|
||||||
|
"platforms": "mobile",
|
||||||
|
"os": "Android"
|
||||||
|
},
|
||||||
|
verbose=False,
|
||||||
|
)
|
||||||
|
crawler = AsyncWebCrawler(config=bc)
|
||||||
|
|
||||||
|
await crawler.start()
|
||||||
|
|
||||||
|
# Single worker for simplicity; concurrency can be scaled by arun_many if needed.
|
||||||
|
# crawler = await next_crawler().start()
|
||||||
|
try:
|
||||||
|
# Build LinkedIn search URL
|
||||||
|
search_url = f"https://www.linkedin.com/search/results/companies/?keywords={quote(opts.query)}&geoUrn={opts.geo}"
|
||||||
|
logging.info("Seed URL => %s", search_url)
|
||||||
|
|
||||||
|
companies: List[Dict] = []
|
||||||
|
if opts.cmd in ("companies", "full"):
|
||||||
|
companies = await crawl_company_search(
|
||||||
|
crawler, search_url, company_schema, opts.max_companies
|
||||||
|
)
|
||||||
|
for c in companies:
|
||||||
|
f_companies.write(json.dumps(c, ensure_ascii=False) + "\n")
|
||||||
|
logging.info(f"[bold green]✓[/] Companies scraped so far: {len(companies)}")
|
||||||
|
|
||||||
|
if opts.cmd in ("people", "full"):
|
||||||
|
if not companies:
|
||||||
|
# load from previous run
|
||||||
|
src = outdir / "companies.jsonl"
|
||||||
|
if not src.exists():
|
||||||
|
logging.error("companies.jsonl missing — run companies/full first")
|
||||||
|
return 10
|
||||||
|
companies = [json.loads(l) for l in src.read_text().splitlines()]
|
||||||
|
total_people = 0
|
||||||
|
title_kw = " ".join([t.strip() for t in opts.title_filters.split(",") if t.strip()]) if opts.title_filters else ""
|
||||||
|
for comp in companies:
|
||||||
|
people = await crawl_people_page(
|
||||||
|
crawler,
|
||||||
|
comp["people_url"],
|
||||||
|
people_schema,
|
||||||
|
opts.max_people,
|
||||||
|
title_kw,
|
||||||
|
)
|
||||||
|
for p in people:
|
||||||
|
rec = p | {
|
||||||
|
"company_handle": comp["handle"],
|
||||||
|
# "captured_at": datetime.now(UTC).isoformat(timespec="seconds") + "Z",
|
||||||
|
"captured_at": datetime.now(UTC).isoformat(timespec="seconds") + "Z",
|
||||||
|
}
|
||||||
|
f_people.write(json.dumps(rec, ensure_ascii=False) + "\n")
|
||||||
|
total_people += len(people)
|
||||||
|
logging.info(
|
||||||
|
f"{comp['name']} — [cyan]{len(people)}[/] people extracted"
|
||||||
|
)
|
||||||
|
await asyncio.sleep(random.uniform(0.5, 1))
|
||||||
|
logging.info("Total people scraped: %d", total_people)
|
||||||
|
finally:
|
||||||
|
await crawler.close()
|
||||||
|
f_companies.close()
|
||||||
|
f_people.close()
|
||||||
|
|
||||||
|
return 0
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
parser = build_arg_parser()
|
||||||
|
cli_opts = parser.parse_args()
|
||||||
|
|
||||||
|
# decide on debug defaults
|
||||||
|
if cli_opts.debug:
|
||||||
|
opts = detect_debug_defaults(force=True)
|
||||||
|
else:
|
||||||
|
env_defaults = detect_debug_defaults()
|
||||||
|
env_defaults = detect_debug_defaults()
|
||||||
|
opts = env_defaults if env_defaults else cli_opts
|
||||||
|
|
||||||
|
if not getattr(opts, "cmd", None):
|
||||||
|
opts.cmd = "full"
|
||||||
|
|
||||||
|
exit_code = asyncio.run(async_main(opts))
|
||||||
|
sys.exit(exit_code)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
372
docs/apps/linkdin/c4ai_insights.py
Normal file
372
docs/apps/linkdin/c4ai_insights.py
Normal file
@@ -0,0 +1,372 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Stage-2 Insights builder
|
||||||
|
------------------------
|
||||||
|
Reads companies.jsonl & people.jsonl (Stage-1 output) and produces:
|
||||||
|
• company_graph.json
|
||||||
|
• org_chart_<handle>.json (one per company)
|
||||||
|
• decision_makers.csv
|
||||||
|
• graph_view.html (interactive visualisation)
|
||||||
|
|
||||||
|
Run:
|
||||||
|
python c4ai_insights.py --in ./stage1_out --out ./stage2_out
|
||||||
|
|
||||||
|
Author : Tom @ Kidocode, 2025-04-28
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
# ───────────────────────────────────────────────────────────────────────────────
|
||||||
|
# Imports & Third-party
|
||||||
|
# ───────────────────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
import argparse, asyncio, json, os, sys, pathlib, random, time, csv
|
||||||
|
from datetime import datetime, UTC
|
||||||
|
from types import SimpleNamespace
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import List, Dict, Any
|
||||||
|
# Pretty CLI UX
|
||||||
|
from rich.console import Console
|
||||||
|
from rich.logging import RichHandler
|
||||||
|
from rich.progress import Progress, SpinnerColumn, BarColumn, TextColumn, TimeElapsedColumn
|
||||||
|
import logging
|
||||||
|
from jinja2 import Environment, FileSystemLoader, select_autoescape
|
||||||
|
|
||||||
|
BASE_DIR = pathlib.Path(__file__).resolve().parent
|
||||||
|
|
||||||
|
# ───────────────────────────────────────────────────────────────────────────────
|
||||||
|
# 3rd-party deps
|
||||||
|
# ───────────────────────────────────────────────────────────────────────────────
|
||||||
|
import numpy as np
|
||||||
|
# from sentence_transformers import SentenceTransformer
|
||||||
|
# from sklearn.metrics.pairwise import cosine_similarity
|
||||||
|
import pandas as pd
|
||||||
|
import hashlib
|
||||||
|
|
||||||
|
from openai import OpenAI # same SDK you pre-loaded
|
||||||
|
|
||||||
|
# ───────────────────────────────────────────────────────────────────────────────
|
||||||
|
# Utils
|
||||||
|
# ───────────────────────────────────────────────────────────────────────────────
|
||||||
|
def load_jsonl(path: Path) -> List[Dict[str, Any]]:
|
||||||
|
with open(path, "r", encoding="utf-8") as f:
|
||||||
|
return [json.loads(l) for l in f]
|
||||||
|
|
||||||
|
def dump_json(obj, path: Path):
|
||||||
|
with open(path, "w", encoding="utf-8") as f:
|
||||||
|
json.dump(obj, f, ensure_ascii=False, indent=2)
|
||||||
|
|
||||||
|
# ───────────────────────────────────────────────────────────────────────────────
|
||||||
|
# Constants
|
||||||
|
# ───────────────────────────────────────────────────────────────────────────────
|
||||||
|
BASE_DIR = pathlib.Path(__file__).resolve().parent
|
||||||
|
|
||||||
|
# ───────────────────────────────────────────────────────────────────────────────
|
||||||
|
# Debug defaults (mirrors Stage-1 trick)
|
||||||
|
# ───────────────────────────────────────────────────────────────────────────────
|
||||||
|
def dev_defaults() -> SimpleNamespace:
|
||||||
|
return SimpleNamespace(
|
||||||
|
in_dir="./debug_out",
|
||||||
|
out_dir="./insights_debug",
|
||||||
|
embed_model="all-MiniLM-L6-v2",
|
||||||
|
top_k=10,
|
||||||
|
openai_model="gpt-4.1",
|
||||||
|
max_llm_tokens=8000,
|
||||||
|
llm_temperature=1.0,
|
||||||
|
workers=4, # parallel processing
|
||||||
|
stub=False, # manual
|
||||||
|
)
|
||||||
|
|
||||||
|
# ───────────────────────────────────────────────────────────────────────────────
|
||||||
|
# Graph builders
|
||||||
|
# ───────────────────────────────────────────────────────────────────────────────
|
||||||
|
def embed_descriptions(companies, model_name:str, opts) -> np.ndarray:
|
||||||
|
from sentence_transformers import SentenceTransformer
|
||||||
|
|
||||||
|
logging.debug(f"Using embedding model: {model_name}")
|
||||||
|
cache_path = BASE_DIR / Path(opts.out_dir) / "embeds_cache.json"
|
||||||
|
cache = {}
|
||||||
|
if cache_path.exists():
|
||||||
|
with open(cache_path) as f:
|
||||||
|
cache = json.load(f)
|
||||||
|
# flush cache if model differs
|
||||||
|
if cache.get("_model") != model_name:
|
||||||
|
cache = {}
|
||||||
|
|
||||||
|
model = SentenceTransformer(model_name)
|
||||||
|
new_texts, new_indices = [], []
|
||||||
|
vectors = np.zeros((len(companies), 384), dtype=np.float32)
|
||||||
|
|
||||||
|
for idx, comp in enumerate(companies):
|
||||||
|
text = comp.get("about") or comp.get("descriptor","")
|
||||||
|
h = hashlib.sha1(text.encode("utf-8")).hexdigest()
|
||||||
|
cached = cache.get(comp["handle"])
|
||||||
|
if cached and cached["hash"] == h:
|
||||||
|
vectors[idx] = np.array(cached["vector"], dtype=np.float32)
|
||||||
|
else:
|
||||||
|
new_texts.append(text)
|
||||||
|
new_indices.append((idx, comp["handle"], h))
|
||||||
|
|
||||||
|
if new_texts:
|
||||||
|
embeds = model.encode(new_texts, show_progress_bar=False, convert_to_numpy=True)
|
||||||
|
for vec, (idx, handle, h) in zip(embeds, new_indices):
|
||||||
|
vectors[idx] = vec
|
||||||
|
cache[handle] = {"hash": h, "vector": vec.tolist()}
|
||||||
|
cache["_model"] = model_name
|
||||||
|
with open(cache_path, "w") as f:
|
||||||
|
json.dump(cache, f)
|
||||||
|
|
||||||
|
return vectors
|
||||||
|
|
||||||
|
def build_company_graph(companies, embeds:np.ndarray, top_k:int) -> Dict[str,Any]:
|
||||||
|
from sklearn.metrics.pairwise import cosine_similarity
|
||||||
|
sims = cosine_similarity(embeds)
|
||||||
|
nodes, edges = [], []
|
||||||
|
idx_of = {c["handle"]: i for i,c in enumerate(companies)}
|
||||||
|
for i,c in enumerate(companies):
|
||||||
|
node = dict(
|
||||||
|
id=c["handle"].strip("/"),
|
||||||
|
name=c["name"],
|
||||||
|
handle=c["handle"],
|
||||||
|
about=c.get("about",""),
|
||||||
|
people_url=c.get("people_url",""),
|
||||||
|
industry=c.get("descriptor","").split("•")[0].strip(),
|
||||||
|
geoUrn=c.get("geoUrn"),
|
||||||
|
followers=c.get("followers",0),
|
||||||
|
# desc_embed=embeds[i].tolist(),
|
||||||
|
desc_embed=[],
|
||||||
|
)
|
||||||
|
nodes.append(node)
|
||||||
|
# pick top-k most similar except itself
|
||||||
|
top_idx = np.argsort(sims[i])[::-1][1:top_k+1]
|
||||||
|
for j in top_idx:
|
||||||
|
tgt = companies[j]
|
||||||
|
weight = float(sims[i,j])
|
||||||
|
if node["industry"] == tgt.get("descriptor","").split("•")[0].strip():
|
||||||
|
weight += 0.10
|
||||||
|
if node["geoUrn"] == tgt.get("geoUrn"):
|
||||||
|
weight += 0.05
|
||||||
|
tgt['followers'] = tgt.get("followers", None) or 1
|
||||||
|
node["followers"] = node.get("followers", None) or 1
|
||||||
|
follower_ratio = min(node["followers"], tgt.get("followers",1)) / max(node["followers"] or 1, tgt.get("followers",1))
|
||||||
|
weight += 0.05 * follower_ratio
|
||||||
|
edges.append(dict(
|
||||||
|
source=node["id"],
|
||||||
|
target=tgt["handle"].strip("/"),
|
||||||
|
weight=round(weight,4),
|
||||||
|
drivers=dict(
|
||||||
|
embed_sim=round(float(sims[i,j]),4),
|
||||||
|
industry_match=0.10 if node["industry"] == tgt.get("descriptor","").split("•")[0].strip() else 0,
|
||||||
|
geo_overlap=0.05 if node["geoUrn"] == tgt.get("geoUrn") else 0,
|
||||||
|
)
|
||||||
|
))
|
||||||
|
# return {"nodes":nodes,"edges":edges,"meta":{"generated_at":datetime.now(UTC).isoformat()}}
|
||||||
|
return {"nodes":nodes,"edges":edges,"meta":{"generated_at":datetime.now(UTC).isoformat()}}
|
||||||
|
|
||||||
|
# ───────────────────────────────────────────────────────────────────────────────
|
||||||
|
# Org-chart via LLM
|
||||||
|
# ───────────────────────────────────────────────────────────────────────────────
|
||||||
|
async def infer_org_chart_llm(company, people, client:OpenAI, model_name:str, max_tokens:int, temperature:float, stub:bool):
|
||||||
|
if stub:
|
||||||
|
# Tiny fake org-chart when debugging offline
|
||||||
|
chief = random.choice(people)
|
||||||
|
nodes = [{
|
||||||
|
"id": chief["profile_url"],
|
||||||
|
"name": chief["name"],
|
||||||
|
"title": chief["headline"],
|
||||||
|
"dept": chief["headline"].split()[:1][0],
|
||||||
|
"yoe_total": 8,
|
||||||
|
"yoe_current": 2,
|
||||||
|
"seniority_score": 0.8,
|
||||||
|
"decision_score": 0.9,
|
||||||
|
"avatar_url": chief.get("avatar_url")
|
||||||
|
}]
|
||||||
|
return {"nodes":nodes,"edges":[],"meta":{"debug_stub":True,"generated_at":datetime.now(UTC).isoformat()}}
|
||||||
|
|
||||||
|
prompt = [
|
||||||
|
{"role":"system","content":"You are an expert B2B org-chart reasoner."},
|
||||||
|
{"role":"user","content":f"""Here is the company description:
|
||||||
|
|
||||||
|
<company>
|
||||||
|
{json.dumps(company, ensure_ascii=False)}
|
||||||
|
</company>
|
||||||
|
|
||||||
|
Here is a JSON list of employees:
|
||||||
|
<employees>
|
||||||
|
{json.dumps(people, ensure_ascii=False)}
|
||||||
|
</employees>
|
||||||
|
|
||||||
|
1) Build a reporting tree (manager -> direct reports)
|
||||||
|
2) For each person output a decision_score 0-1 for buying new software
|
||||||
|
|
||||||
|
Return JSON: {{ "nodes":[{{id,name,title,dept,yoe_total,yoe_current,seniority_score,decision_score,avatar_url,profile_url}}], "edges":[{{source,target,type,confidence}}] }}
|
||||||
|
"""}
|
||||||
|
]
|
||||||
|
resp = client.chat.completions.create(
|
||||||
|
model=model_name,
|
||||||
|
messages=prompt,
|
||||||
|
max_tokens=max_tokens,
|
||||||
|
temperature=temperature,
|
||||||
|
response_format={"type":"json_object"}
|
||||||
|
)
|
||||||
|
chart = json.loads(resp.choices[0].message.content)
|
||||||
|
chart["meta"] = dict(model=model_name, generated_at=datetime.now(UTC).isoformat())
|
||||||
|
return chart
|
||||||
|
|
||||||
|
# ───────────────────────────────────────────────────────────────────────────────
|
||||||
|
# CSV flatten
|
||||||
|
# ───────────────────────────────────────────────────────────────────────────────
|
||||||
|
def export_decision_makers(charts_dir:Path, csv_path:Path, threshold:float=0.5):
|
||||||
|
rows=[]
|
||||||
|
for p in charts_dir.glob("org_chart_*.json"):
|
||||||
|
data=json.loads(p.read_text())
|
||||||
|
comp = p.stem.split("org_chart_")[1]
|
||||||
|
for n in data.get("nodes",[]):
|
||||||
|
if n.get("decision_score",0)>=threshold:
|
||||||
|
rows.append(dict(
|
||||||
|
company=comp,
|
||||||
|
person=n["name"],
|
||||||
|
title=n["title"],
|
||||||
|
decision_score=n["decision_score"],
|
||||||
|
profile_url=n["id"]
|
||||||
|
))
|
||||||
|
pd.DataFrame(rows).to_csv(csv_path,index=False)
|
||||||
|
|
||||||
|
# ───────────────────────────────────────────────────────────────────────────────
|
||||||
|
# HTML rendering
|
||||||
|
# ───────────────────────────────────────────────────────────────────────────────
|
||||||
|
def render_html(out:Path, template_dir:Path):
|
||||||
|
# From template folder cp graph_view.html and ai.js in out folder
|
||||||
|
import shutil
|
||||||
|
shutil.copy(template_dir/"graph_view_template.html", out / "graph_view.html")
|
||||||
|
shutil.copy(template_dir/"ai.js", out)
|
||||||
|
|
||||||
|
|
||||||
|
# ───────────────────────────────────────────────────────────────────────────────
|
||||||
|
# Main async pipeline
|
||||||
|
# ───────────────────────────────────────────────────────────────────────────────
|
||||||
|
async def run(opts):
|
||||||
|
# ── silence SDK noise ──────────────────────────────────────────────────────
|
||||||
|
for noisy in ("openai", "httpx", "httpcore"):
|
||||||
|
lg = logging.getLogger(noisy)
|
||||||
|
lg.setLevel(logging.WARNING) # or ERROR if you want total silence
|
||||||
|
lg.propagate = False # optional: stop them reaching root
|
||||||
|
|
||||||
|
# ────────────── logging bootstrap ──────────────
|
||||||
|
console = Console()
|
||||||
|
logging.basicConfig(
|
||||||
|
level="INFO",
|
||||||
|
format="%(message)s",
|
||||||
|
handlers=[RichHandler(console=console, markup=True, rich_tracebacks=True)],
|
||||||
|
)
|
||||||
|
|
||||||
|
in_dir = BASE_DIR / Path(opts.in_dir)
|
||||||
|
out_dir = BASE_DIR / Path(opts.out_dir)
|
||||||
|
out_dir.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
companies = load_jsonl(in_dir/"companies.jsonl")
|
||||||
|
people = load_jsonl(in_dir/"people.jsonl")
|
||||||
|
|
||||||
|
logging.info(f"[bold cyan]Loaded[/] {len(companies)} companies, {len(people)} people")
|
||||||
|
|
||||||
|
logging.info("[bold]⇢[/] Embedding company descriptions…")
|
||||||
|
# embeds = embed_descriptions(companies, opts.embed_model, opts)
|
||||||
|
|
||||||
|
logging.info("[bold]⇢[/] Building similarity graph")
|
||||||
|
# company_graph = build_company_graph(companies, embeds, opts.top_k)
|
||||||
|
# dump_json(company_graph, out_dir/"company_graph.json")
|
||||||
|
|
||||||
|
# OpenAI client (only built if not debugging)
|
||||||
|
stub = bool(opts.stub)
|
||||||
|
client = OpenAI() if not stub else None
|
||||||
|
|
||||||
|
# Filter companies that need processing
|
||||||
|
to_process = []
|
||||||
|
for comp in companies:
|
||||||
|
handle = comp["handle"].strip("/").replace("/","_")
|
||||||
|
out_file = out_dir/f"org_chart_{handle}.json"
|
||||||
|
if out_file.exists() and False:
|
||||||
|
logging.info(f"[green]✓[/] Skipping existing {comp['name']}")
|
||||||
|
continue
|
||||||
|
to_process.append(comp)
|
||||||
|
|
||||||
|
|
||||||
|
if not to_process:
|
||||||
|
logging.info("[yellow]All companies already processed[/]")
|
||||||
|
else:
|
||||||
|
workers = getattr(opts, 'workers', 1)
|
||||||
|
parallel = workers > 1
|
||||||
|
|
||||||
|
logging.info(f"[bold]⇢[/] Inferring org-charts via LLM {f'(parallel={workers} workers)' if parallel else ''}")
|
||||||
|
|
||||||
|
with Progress(
|
||||||
|
SpinnerColumn(),
|
||||||
|
BarColumn(),
|
||||||
|
TextColumn("[progress.description]{task.description}"),
|
||||||
|
TimeElapsedColumn(),
|
||||||
|
console=console,
|
||||||
|
) as progress:
|
||||||
|
task = progress.add_task("Org charts", total=len(to_process))
|
||||||
|
|
||||||
|
async def process_one(comp):
|
||||||
|
handle = comp["handle"].strip("/").replace("/","_")
|
||||||
|
persons = [p for p in people if p["company_handle"].strip("/") == comp["handle"].strip("/")]
|
||||||
|
|
||||||
|
chart = await infer_org_chart_llm(
|
||||||
|
comp, persons,
|
||||||
|
client=client if client else OpenAI(api_key="sk-debug"),
|
||||||
|
model_name=opts.openai_model,
|
||||||
|
max_tokens=opts.max_llm_tokens,
|
||||||
|
temperature=opts.llm_temperature,
|
||||||
|
stub=stub,
|
||||||
|
)
|
||||||
|
chart["meta"]["company"] = comp["name"]
|
||||||
|
|
||||||
|
# Save the result immediately
|
||||||
|
dump_json(chart, out_dir/f"org_chart_{handle}.json")
|
||||||
|
|
||||||
|
progress.update(task, advance=1, description=f"{comp['name']} ({len(persons)} ppl)")
|
||||||
|
|
||||||
|
# Create tasks for all companies
|
||||||
|
tasks = [process_one(comp) for comp in to_process]
|
||||||
|
|
||||||
|
# Process in batches based on worker count
|
||||||
|
semaphore = asyncio.Semaphore(workers)
|
||||||
|
|
||||||
|
async def bounded_process(coro):
|
||||||
|
async with semaphore:
|
||||||
|
return await coro
|
||||||
|
|
||||||
|
# Run with concurrency control
|
||||||
|
await asyncio.gather(*(bounded_process(task) for task in tasks))
|
||||||
|
|
||||||
|
logging.info("[bold]⇢[/] Flattening decision-makers CSV")
|
||||||
|
export_decision_makers(out_dir, out_dir/"decision_makers.csv")
|
||||||
|
|
||||||
|
render_html(out_dir, template_dir=BASE_DIR/"templates")
|
||||||
|
logging.success = lambda msg, **k: console.print(f"[bold green]✓[/] {msg}", **k)
|
||||||
|
logging.success(f"Stage-2 artefacts written to {out_dir}")
|
||||||
|
|
||||||
|
# ───────────────────────────────────────────────────────────────────────────────
|
||||||
|
# CLI
|
||||||
|
# ───────────────────────────────────────────────────────────────────────────────
|
||||||
|
def build_arg_parser():
|
||||||
|
p = argparse.ArgumentParser(description="Build graphs & visualisation from Stage-1 output")
|
||||||
|
p.add_argument("--in", dest="in_dir", required=False, help="Stage-1 output dir", default=".")
|
||||||
|
p.add_argument("--out", dest="out_dir", required=False, help="Destination dir", default=".")
|
||||||
|
p.add_argument("--embed_model", default="all-MiniLM-L6-v2")
|
||||||
|
p.add_argument("--top_k", type=int, default=10, help="Top-k neighbours per company")
|
||||||
|
p.add_argument("--openai_model", default="gpt-4.1")
|
||||||
|
p.add_argument("--max_llm_tokens", type=int, default=8024)
|
||||||
|
p.add_argument("--llm_temperature", type=float, default=1.0)
|
||||||
|
p.add_argument("--stub", action="store_true", help="Skip OpenAI call and generate tiny fake org charts")
|
||||||
|
p.add_argument("--workers", type=int, default=4, help="Number of parallel workers for LLM inference")
|
||||||
|
return p
|
||||||
|
|
||||||
|
def main():
|
||||||
|
dbg = dev_defaults()
|
||||||
|
opts = dbg if True else build_arg_parser().parse_args()
|
||||||
|
asyncio.run(run(opts))
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
39
docs/apps/linkdin/schemas/company_card.json
Normal file
39
docs/apps/linkdin/schemas/company_card.json
Normal file
@@ -0,0 +1,39 @@
|
|||||||
|
{
|
||||||
|
"name": "LinkedIn Company Card",
|
||||||
|
"baseSelector": "div.search-results-container ul[role='list'] > li",
|
||||||
|
"fields": [
|
||||||
|
{
|
||||||
|
"name": "handle",
|
||||||
|
"selector": "a[href*='/company/']",
|
||||||
|
"type": "attribute",
|
||||||
|
"attribute": "href"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "profile_image",
|
||||||
|
"selector": "a[href*='/company/'] img",
|
||||||
|
"type": "attribute",
|
||||||
|
"attribute": "src"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "name",
|
||||||
|
"selector": "span[class*='t-16'] a",
|
||||||
|
"type": "text"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "descriptor",
|
||||||
|
"selector": "div[class*='t-black t-normal']",
|
||||||
|
"type": "text"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "about",
|
||||||
|
"selector": "p[class*='entity-result__summary--2-lines']",
|
||||||
|
"type": "text"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "followers",
|
||||||
|
"selector": "div:contains('followers')",
|
||||||
|
"type": "regex",
|
||||||
|
"pattern": "(\\d+)\\s*followers"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
38
docs/apps/linkdin/schemas/people_card.json
Normal file
38
docs/apps/linkdin/schemas/people_card.json
Normal file
@@ -0,0 +1,38 @@
|
|||||||
|
{
|
||||||
|
"name": "LinkedIn People Card",
|
||||||
|
"baseSelector": "li.org-people-profile-card__profile-card-spacing",
|
||||||
|
"fields": [
|
||||||
|
{
|
||||||
|
"name": "profile_url",
|
||||||
|
"selector": "a.eETATgYTipaVsmrBChiBJJvFsdPhNpulhPZUVLHLo",
|
||||||
|
"type": "attribute",
|
||||||
|
"attribute": "href"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "name",
|
||||||
|
"selector": ".artdeco-entity-lockup__title .lt-line-clamp--single-line",
|
||||||
|
"type": "text"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "headline",
|
||||||
|
"selector": ".artdeco-entity-lockup__subtitle .lt-line-clamp--multi-line",
|
||||||
|
"type": "text"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "followers",
|
||||||
|
"selector": ".lt-line-clamp--multi-line.t-12",
|
||||||
|
"type": "text"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "connection_degree",
|
||||||
|
"selector": ".artdeco-entity-lockup__badge .artdeco-entity-lockup__degree",
|
||||||
|
"type": "text"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "avatar_url",
|
||||||
|
"selector": ".artdeco-entity-lockup__image img",
|
||||||
|
"type": "attribute",
|
||||||
|
"attribute": "src"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
143
docs/apps/linkdin/snippets/company.html
Normal file
143
docs/apps/linkdin/snippets/company.html
Normal file
@@ -0,0 +1,143 @@
|
|||||||
|
<li class="yCLWzruNprmIzaZzFFonVFBtMrbaVYnuDFA">
|
||||||
|
<!----><!---->
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
<div class="IxlEPbRZwQYrRltKPvHAyjBmCdIWTAoYo" data-chameleon-result-urn="urn:li:company:362492"
|
||||||
|
data-view-name="search-entity-result-universal-template">
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
<div class="linked-area flex-1
|
||||||
|
cursor-pointer">
|
||||||
|
|
||||||
|
<div class="BAEgVqVuxosMJZodcelsgPoyRcrkiqgVCGHXNQ">
|
||||||
|
<div class="afcvrbGzNuyRlhPPQWrWirJtUdHAAtUlqxwvVA">
|
||||||
|
<div class="display-flex align-items-center">
|
||||||
|
<!---->
|
||||||
|
|
||||||
|
<a class="eETATgYTipaVsmrBChiBJJvFsdPhNpulhPZUVLHLo scale-down " aria-hidden="true"
|
||||||
|
tabindex="-1" href="https://www.linkedin.com/company/managment-research-services-inc./"
|
||||||
|
data-test-app-aware-link="">
|
||||||
|
|
||||||
|
<div class="ivm-image-view-model ">
|
||||||
|
|
||||||
|
<div class="ivm-view-attr__img-wrapper
|
||||||
|
|
||||||
|
">
|
||||||
|
<!---->
|
||||||
|
<!----> <img width="48"
|
||||||
|
src="https://media.licdn.com/dms/image/v2/C560BAQFWpusEOgW-ww/company-logo_100_100/company-logo_100_100/0/1630583697877/managment_research_services_inc_logo?e=1750896000&v=beta&t=Ch9vyEZdfng-1D1m_XqP5kjNpVXUBKkk9cNhMZUhx0E"
|
||||||
|
loading="lazy" height="48" alt="Management Research Services, Inc. (MRS, Inc)"
|
||||||
|
id="ember28"
|
||||||
|
class="ivm-view-attr__img--centered EntityPhoto-square-3 evi-image lazy-image ember-view">
|
||||||
|
</div>
|
||||||
|
|
||||||
|
</div>
|
||||||
|
|
||||||
|
</a>
|
||||||
|
|
||||||
|
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
<div
|
||||||
|
class="wympnVuDByXHvafWrMGJLZuchDmCRqLmWPwg MmzCPRicJimZvjJhvqTzDcDbdHhWPzspERzA pt3 pb3 t-12 t-black--light">
|
||||||
|
<div class="mb1">
|
||||||
|
|
||||||
|
<div class="t-roman t-sans">
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
<div class="display-flex">
|
||||||
|
<span class="TikBXjihYvcNUoIzkslUaEjfIuLmYxfs OoHEyXgsiIqGADjcOtTmfdpoYVXrLKTvkwI ">
|
||||||
|
<span class="CgaWLOzmXNuKbRIRARSErqCJcBPYudEKo
|
||||||
|
t-16">
|
||||||
|
<a class="eETATgYTipaVsmrBChiBJJvFsdPhNpulhPZUVLHLo "
|
||||||
|
href="https://www.linkedin.com/company/managment-research-services-inc./"
|
||||||
|
data-test-app-aware-link="">
|
||||||
|
<!---->Management Research Services, Inc. (MRS, Inc)<!---->
|
||||||
|
<!----> </a>
|
||||||
|
<!----> </span>
|
||||||
|
</span>
|
||||||
|
<!---->
|
||||||
|
</div>
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
</div>
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
<div class="LjmdKCEqKITHihFOiQsBAQylkdnsWhqZii
|
||||||
|
t-14 t-black t-normal">
|
||||||
|
<!---->Insurance • Milwaukee, Wisconsin<!---->
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div class="cTPhJiHyNLmxdQYFlsEOutjznmqrVHUByZwZ
|
||||||
|
t-14 t-normal">
|
||||||
|
<!---->1K followers<!---->
|
||||||
|
</div>
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<!---->
|
||||||
|
<p class="yWzlqwKNlvCWVNoKqmzoDDEnBMUuyynaLg
|
||||||
|
entity-result__summary--2-lines
|
||||||
|
t-12 t-black--light
|
||||||
|
">
|
||||||
|
<!---->MRS combines 30 years of experience supporting the Life,<span class="white-space-pre">
|
||||||
|
</span><strong><!---->Health<!----></strong><span class="white-space-pre"> </span>and
|
||||||
|
Annuities<span class="white-space-pre"> </span><strong><!---->Insurance<!----></strong><span
|
||||||
|
class="white-space-pre"> </span>Industry with customized<span class="white-space-pre">
|
||||||
|
</span><strong><!---->insurance<!----></strong><span class="white-space-pre">
|
||||||
|
</span>underwriting solutions that efficiently support clients’ workflows. Supported by the
|
||||||
|
Agenium Platform (www.agenium.ai) our innovative underwriting solutions are guaranteed to
|
||||||
|
optimize requirements...<!---->
|
||||||
|
</p>
|
||||||
|
|
||||||
|
<!---->
|
||||||
|
</div>
|
||||||
|
<div class="qXxdnXtzRVFTnTnetmNpssucBwQBsWlUuk MmzCPRicJimZvjJhvqTzDcDbdHhWPzspERzA">
|
||||||
|
<!---->
|
||||||
|
|
||||||
|
|
||||||
|
<div>
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
<button aria-label="Follow Management Research Services, Inc. (MRS, Inc)" id="ember61"
|
||||||
|
class="artdeco-button artdeco-button--2 artdeco-button--secondary ember-view"
|
||||||
|
type="button"><!---->
|
||||||
|
<span class="artdeco-button__text">
|
||||||
|
Follow
|
||||||
|
</span></button>
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
<!---->
|
||||||
|
<!---->
|
||||||
|
|
||||||
|
|
||||||
|
</div>
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
</div>
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
</div>
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
</li>
|
||||||
94
docs/apps/linkdin/snippets/people.html
Normal file
94
docs/apps/linkdin/snippets/people.html
Normal file
@@ -0,0 +1,94 @@
|
|||||||
|
<li class="grid grid__col--lg-8 block org-people-profile-card__profile-card-spacing">
|
||||||
|
<div>
|
||||||
|
|
||||||
|
|
||||||
|
<section class="artdeco-card full-width qQdPErXQkSAbwApNgNfuxukTIPPykttCcZGOHk">
|
||||||
|
<!---->
|
||||||
|
|
||||||
|
<img width="210" src="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7"
|
||||||
|
ariarole="presentation" loading="lazy" height="210" alt="" id="ember96"
|
||||||
|
class="evi-image lazy-image ghost-default ember-view org-people-profile-card__cover-photo org-people-profile-card__cover-photo--people">
|
||||||
|
|
||||||
|
<div class="org-people-profile-card__profile-info">
|
||||||
|
<div id="ember97"
|
||||||
|
class="artdeco-entity-lockup artdeco-entity-lockup--stacked-center artdeco-entity-lockup--size-7 ember-view">
|
||||||
|
<div id="ember98"
|
||||||
|
class="artdeco-entity-lockup__image artdeco-entity-lockup__image--type-circle ember-view"
|
||||||
|
type="circle">
|
||||||
|
|
||||||
|
<a class="eETATgYTipaVsmrBChiBJJvFsdPhNpulhPZUVLHLo "
|
||||||
|
id="org-people-profile-card__profile-image-0"
|
||||||
|
href="https://www.linkedin.com/in/speakerrayna?miniProfileUrn=urn%3Ali%3Afs_miniProfile%3AACoAABsqUBoBr5x071PuGGpNtK3NlvSARiVXPIs"
|
||||||
|
data-test-app-aware-link="">
|
||||||
|
<img width="104"
|
||||||
|
src="https://media.licdn.com/dms/image/v2/D5603AQGs2Vyju4xZ7A/profile-displayphoto-shrink_100_100/profile-displayphoto-shrink_100_100/0/1681741067031?e=1750896000&v=beta&t=Hvj--IrrmpVIH7pec7-l_PQok8vsS__CGeUqBWOw7co"
|
||||||
|
loading="lazy" height="104" alt="Dr. Rayna S." id="ember99"
|
||||||
|
class="evi-image lazy-image ember-view">
|
||||||
|
</a>
|
||||||
|
|
||||||
|
|
||||||
|
</div>
|
||||||
|
<div id="ember100" class="artdeco-entity-lockup__content ember-view">
|
||||||
|
<div id="ember101" class="artdeco-entity-lockup__title ember-view">
|
||||||
|
<a class="eETATgYTipaVsmrBChiBJJvFsdPhNpulhPZUVLHLo link-without-visited-state"
|
||||||
|
aria-label="View Dr. Rayna S.’s profile"
|
||||||
|
href="https://www.linkedin.com/in/speakerrayna?miniProfileUrn=urn%3Ali%3Afs_miniProfile%3AACoAABsqUBoBr5x071PuGGpNtK3NlvSARiVXPIs"
|
||||||
|
data-test-app-aware-link="">
|
||||||
|
<div id="ember103" class="ember-view lt-line-clamp lt-line-clamp--single-line AGabuksChUpCmjWshSnaZryLKSthOKkwclxY
|
||||||
|
t-black" style="">
|
||||||
|
Dr. Rayna S.
|
||||||
|
|
||||||
|
<!---->
|
||||||
|
</div>
|
||||||
|
|
||||||
|
</a>
|
||||||
|
|
||||||
|
</div>
|
||||||
|
<div id="ember104" class="artdeco-entity-lockup__badge ember-view"> <span class="a11y-text">3rd+
|
||||||
|
degree connection</span>
|
||||||
|
<span class="artdeco-entity-lockup__degree" aria-hidden="true">
|
||||||
|
· 3rd
|
||||||
|
</span>
|
||||||
|
<!----><!---->
|
||||||
|
</div>
|
||||||
|
<div id="ember105" class="artdeco-entity-lockup__subtitle ember-view">
|
||||||
|
<div class="t-14 t-black--light t-normal">
|
||||||
|
<div id="ember107" class="ember-view lt-line-clamp lt-line-clamp--multi-line"
|
||||||
|
style="-webkit-line-clamp: 2">
|
||||||
|
Leadership and Talent Development Consultant and Professional Speaker
|
||||||
|
|
||||||
|
<!---->
|
||||||
|
</div>
|
||||||
|
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
<div id="ember108" class="artdeco-entity-lockup__caption ember-view"></div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
</div>
|
||||||
|
<span class="text-align-center">
|
||||||
|
<span id="ember110"
|
||||||
|
class="ember-view lt-line-clamp lt-line-clamp--multi-line t-12 t-black--light mt2"
|
||||||
|
style="-webkit-line-clamp: 3">
|
||||||
|
727 followers
|
||||||
|
|
||||||
|
<!----> </span>
|
||||||
|
|
||||||
|
</span>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<footer class="ph3 pb3">
|
||||||
|
<button aria-label="Follow Dr. Rayna S." id="ember111"
|
||||||
|
class="artdeco-button artdeco-button--2 artdeco-button--secondary ember-view full-width"
|
||||||
|
type="button"><!---->
|
||||||
|
<span class="artdeco-button__text">
|
||||||
|
Follow
|
||||||
|
</span></button>
|
||||||
|
</footer>
|
||||||
|
|
||||||
|
</section>
|
||||||
|
|
||||||
|
|
||||||
|
</div>
|
||||||
|
|
||||||
|
</li>
|
||||||
50
docs/apps/linkdin/templates/ai.js
Normal file
50
docs/apps/linkdin/templates/ai.js
Normal file
@@ -0,0 +1,50 @@
|
|||||||
|
// ==== File: ai.js ====
|
||||||
|
|
||||||
|
class ApiHandler {
|
||||||
|
constructor(apiKey = null) {
|
||||||
|
this.apiKey = apiKey || localStorage.getItem("openai_api_key") || "";
|
||||||
|
console.log("ApiHandler ready");
|
||||||
|
}
|
||||||
|
|
||||||
|
setApiKey(k) {
|
||||||
|
this.apiKey = k.trim();
|
||||||
|
if (this.apiKey) localStorage.setItem("openai_api_key", this.apiKey);
|
||||||
|
}
|
||||||
|
|
||||||
|
async *chatStream(messages, {model = "gpt-4o", temperature = 0.7} = {}) {
|
||||||
|
if (!this.apiKey) throw new Error("OpenAI API key missing");
|
||||||
|
const payload = {model, messages, stream: true, max_tokens: 1024};
|
||||||
|
const controller = new AbortController();
|
||||||
|
|
||||||
|
const res = await fetch("https://api.openai.com/v1/chat/completions", {
|
||||||
|
method: "POST",
|
||||||
|
headers: {
|
||||||
|
"Content-Type": "application/json",
|
||||||
|
Authorization: `Bearer ${this.apiKey}`,
|
||||||
|
},
|
||||||
|
body: JSON.stringify(payload),
|
||||||
|
signal: controller.signal,
|
||||||
|
});
|
||||||
|
if (!res.ok) throw new Error(`OpenAI: ${res.statusText}`);
|
||||||
|
const reader = res.body.getReader();
|
||||||
|
const dec = new TextDecoder();
|
||||||
|
|
||||||
|
let buf = "";
|
||||||
|
while (true) {
|
||||||
|
const {done, value} = await reader.read();
|
||||||
|
if (done) break;
|
||||||
|
buf += dec.decode(value, {stream: true});
|
||||||
|
for (const line of buf.split("\n")) {
|
||||||
|
if (!line.startsWith("data: ")) continue;
|
||||||
|
if (line.includes("[DONE]")) return;
|
||||||
|
const json = JSON.parse(line.slice(6));
|
||||||
|
const delta = json.choices?.[0]?.delta?.content;
|
||||||
|
if (delta) yield delta;
|
||||||
|
}
|
||||||
|
buf = buf.endsWith("\n") ? "" : buf; // keep partial line
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
window.API = new ApiHandler();
|
||||||
|
|
||||||
1171
docs/apps/linkdin/templates/graph_view_template.html
Normal file
1171
docs/apps/linkdin/templates/graph_view_template.html
Normal file
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user