feat(linkedin): add prospect-wizard app with scraping and visualization

Add new LinkedIn prospect discovery tool with three main components:
- c4ai_discover.py for company and people scraping
- c4ai_insights.py for org chart and decision maker analysis
- Interactive graph visualization with company/people exploration

Features include:
- Configurable LinkedIn search and scraping
- Org chart generation with decision maker scoring
- Interactive network graph visualization
- Company similarity analysis
- Chat interface for data exploration

Requires: crawl4ai, openai, sentence-transformers, networkx
This commit is contained in:
UncleCode
2025-04-30 19:38:25 +08:00
parent 9499164d3c
commit 50f0b83fcd
9 changed files with 2473 additions and 0 deletions

126
docs/apps/linkdin/README.md Normal file
View File

@@ -0,0 +1,126 @@
# Crawl4AIProspectWizard stepbystep guide
A threestage demo that goes from **LinkedIn scraping****LLM reasoning****graph visualisation**.
```
prospectwizard/
├─ c4ai_discover.py # Stage 1 scrape companies + people
├─ c4ai_insights.py # Stage 2 embeddings, orgcharts, scores
├─ graph_view_template.html # Stage 3 graph viewer (static HTML)
└─ data/ # output lands here (*.jsonl / *.json)
```
---
## 1  Install & boot a LinkedIn profile (onetime)
### 1.1  Install dependencies
```bash
pip install crawl4ai openai sentence-transformers networkx pandas vis-network rich
```
### 1.2  Create / warm a LinkedIn browser profile
```bash
crwl profiler
```
1. The interactive shell shows **New profile** hit **enter**.
2. Choose a name, e.g. `profile_linkedin_uc`.
3. A Chromium window opens log in to LinkedIn, solve whatever CAPTCHA, then close.
> Remember the **profile name**. All future runs take `--profile-name <your_name>`.
---
## 2  Discovery scrape companies & people
```bash
python c4ai_discover.py full \
--query "health insurance management" \
--geo 102713980 \ # Malaysia geoUrn
--title_filters "" \ # or "Product,Engineering"
--max_companies 10 \ # default set small for workshops
--max_people 20 \ # \^ same
--profile-name profile_linkedin_uc \
--outdir ./data \
--concurrency 2 \
--log_level debug
```
**Outputs** in `./data/`:
* `companies.jsonl` one JSON per company
* `people.jsonl` one JSON per employee
🛠️ **Dryrun:** `C4AI_DEMO_DEBUG=1 python c4ai_discover.py full --query coffee` uses bundled HTML snippets, no network.
### Handy geoUrn cheatsheet
| Location | geoUrn |
|----------|--------|
| Singapore | **103644278** |
| Malaysia | **102713980** |
| UnitedStates | **103644922** |
| UnitedKingdom | **102221843** |
| Australia | **101452733** |
_See more: <https://www.linkedin.com/search/results/companies/?geoUrn=XXX> the number after `geoUrn=` is what you need._
---
## 3  Insights embeddings, orgcharts, decision makers
```bash
python c4ai_insights.py \
--in ./data \
--out ./data \
--embed_model all-MiniLM-L6-v2 \
--top_k 10 \
--openai_model gpt-4.1 \
--max_llm_tokens 8024 \
--llm_temperature 1.0 \
--workers 4
```
Emits next to the Stage1 files:
* `company_graph.json` intercompany similarity graph
* `org_chart_<handle>.json` one per company
* `decision_makers.csv` handpicked who to pitch list
Flags reference (straight from `build_arg_parser()`):
| Flag | Default | Purpose |
|------|---------|---------|
| `--in` | `.` | Stage1 output dir |
| `--out` | `.` | Destination dir |
| `--embed_model` | `all-MiniLM-L6-v2` | SentenceTransformer model |
| `--top_k` | `10` | Neighbours per company in graph |
| `--openai_model` | `gpt-4.1` | LLM for scoring decision makers |
| `--max_llm_tokens` | `8024` | Token budget per LLM call |
| `--llm_temperature` | `1.0` | Creativity knob |
| `--stub` | off | Skip OpenAI and fabricate tiny charts |
| `--workers` | `4` | Parallel LLM workers |
---
## 4  Visualise interactive graph
After Stage 2 completes, simply open the HTML viewer from the project root:
```bash
open graph_view_template.html # or Live Server / Python -http
```
The page fetches `data/company_graph.json` and the `org_chart_*.json` files automatically; keep the `data/` folder beside the HTML file.
* Left pane → list of companies (clans).
* Click a node to load its orgchart on the right.
* Chat drawer lets you ask followup questions; context is pulled from `people.jsonl`.
---
## 5  Common snags
| Symptom | Fix |
|---------|-----|
| Infinite CAPTCHA | Use a residential proxy: `--proxy http://user:pass@ip:port` |
| 429 Too Many Requests | Lower `--concurrency`, rotate profile, add delay |
| Blank graph | Check JSON paths, clear `localStorage` in browser |
---
### TL;DR
`crwl profiler``c4ai_discover.py``c4ai_insights.py` → open `graph_view_template.html`.
Live long and `import crawl4ai`.

View File

@@ -0,0 +1,440 @@
#!/usr/bin/env python3
"""
c4ai-discover — Stage1 Discovery CLI
Scrapes LinkedIn company search + their people pages and dumps two newlinedelimited
JSON files: companies.jsonl and people.jsonl.
Key design rules
----------------
* No BeautifulSoup — Crawl4AI only for network + HTML fetch.
* JsonCssExtractionStrategy for structured scraping; schema autogenerated once
from sample HTML provided by user and then cached under ./schemas/.
* Defaults are embedded so the file runs inside VS Code debugger without CLI args.
* If executed as a console script (argv > 1), CLI flags win.
* Lightweight deps: argparse + Crawl4AI stack.
Author: Tom @ Kidocode 20250426
"""
from __future__ import annotations
import warnings, re
warnings.filterwarnings(
"ignore",
message=r"The pseudo class ':contains' is deprecated, ':-soup-contains' should be used.*",
category=FutureWarning,
module=r"soupsieve"
)
# ───────────────────────────────────────────────────────────────────────────────
# Imports
# ───────────────────────────────────────────────────────────────────────────────
import argparse
import random
import asyncio
import json
import logging
import os
import pathlib
import sys
# 3rd-party rich for pretty logging
from rich.console import Console
from rich.logging import RichHandler
from datetime import datetime, UTC
from itertools import cycle
from textwrap import dedent
from types import SimpleNamespace
from typing import Dict, List, Optional
from urllib.parse import quote
from pathlib import Path
from glob import glob
from crawl4ai import (
AsyncWebCrawler,
BrowserConfig,
CacheMode,
CrawlerRunConfig,
JsonCssExtractionStrategy,
BrowserProfiler,
LLMConfig,
)
# ───────────────────────────────────────────────────────────────────────────────
# Constants / paths
# ───────────────────────────────────────────────────────────────────────────────
BASE_DIR = pathlib.Path(__file__).resolve().parent
SCHEMA_DIR = BASE_DIR / "schemas"
SCHEMA_DIR.mkdir(parents=True, exist_ok=True)
COMPANY_SCHEMA_PATH = SCHEMA_DIR / "company_card.json"
PEOPLE_SCHEMA_PATH = SCHEMA_DIR / "people_card.json"
# ---------- deterministic target JSON examples ----------
_COMPANY_SCHEMA_EXAMPLE = {
"handle": "/company/posify/",
"profile_image": "https://media.licdn.com/dms/image/v2/.../logo.jpg",
"name": "Management Research Services, Inc. (MRS, Inc)",
"descriptor": "Insurance • Milwaukee, Wisconsin",
"about": "Insurance • Milwaukee, Wisconsin",
"followers": 1000
}
_PEOPLE_SCHEMA_EXAMPLE = {
"profile_url": "https://www.linkedin.com/in/lily-ng/",
"name": "Lily Ng",
"headline": "VP Product @ Posify",
"followers": 890,
"connection_degree": "2nd",
"avatar_url": "https://media.licdn.com/dms/image/v2/.../lily.jpg"
}
# Provided sample HTML snippets (trimmed) — used exactly once to coldgenerate schema.
_SAMPLE_COMPANY_HTML = (Path(__file__).resolve().parent / "snippets/company.html").read_text()
_SAMPLE_PEOPLE_HTML = (Path(__file__).resolve().parent / "snippets/people.html").read_text()
# --------- tighter schema prompts ----------
_COMPANY_SCHEMA_QUERY = dedent(
"""
Using the supplied <li> company-card HTML, build a JsonCssExtractionStrategy schema that,
for every card, outputs *exactly* the keys shown in the example JSON below.
JSON spec:
• handle href of the outermost <a> that wraps the logo/title, e.g. "/company/posify/"
• profile_image absolute URL of the <img> inside that link
• name text of the <a> inside the <span class*='t-16'>
• descriptor text line with industry • location
• about text of the <div class*='t-normal'> below the name (industry + geo)
• followers integer parsed from the <div> containing 'followers'
IMPORTANT: Do not use the base64 kind of classes to target element. It's not reliable.
The main div parent contains these li element is "div.search-results-container" you can use this.
The <ul> parent has "role" equal to "list". Using these two should be enough to target the <li> elements."
"""
)
_PEOPLE_SCHEMA_QUERY = dedent(
"""
Using the supplied <li> people-card HTML, build a JsonCssExtractionStrategy schema that
outputs exactly the keys in the example JSON below.
Fields:
• profile_url href of the outermost profile link
• name text inside artdeco-entity-lockup__title
• headline inner text of artdeco-entity-lockup__subtitle
• followers integer parsed from the span inside lt-line-clamp--multi-line
• connection_degree '1st', '2nd', etc. from artdeco-entity-lockup__badge
• avatar_url src of the <img> within artdeco-entity-lockup__image
IMPORTANT: Do not use the base64 kind of classes to target element. It's not reliable.
The main div parent contains these li element is a "div" has these classes "artdeco-card org-people-profile-card__card-spacing org-people__card-margin-bottom".
"""
)
# ---------------------------------------------------------------------------
# Utility helpers
# ---------------------------------------------------------------------------
def _load_or_build_schema(
path: pathlib.Path,
sample_html: str,
query: str,
example_json: Dict,
force = False
) -> Dict:
"""Load schema from path, else call generate_schema once and persist."""
if path.exists() and not force:
return json.loads(path.read_text())
logging.info("[SCHEMA] Generating schema %s", path.name)
schema = JsonCssExtractionStrategy.generate_schema(
html=sample_html,
llm_config=LLMConfig(
provider=os.getenv("C4AI_SCHEMA_PROVIDER", "openai/gpt-4o"),
api_token=os.getenv("OPENAI_API_KEY", "env:OPENAI_API_KEY"),
),
query=query,
target_json_example=json.dumps(example_json, indent=2),
)
path.write_text(json.dumps(schema, indent=2))
return schema
def _openai_friendly_number(text: str) -> Optional[int]:
"""Extract first int from text like '1K followers' (returns 1000)."""
import re
m = re.search(r"(\d[\d,]*)", text.replace(",", ""))
if not m:
return None
val = int(m.group(1))
if "k" in text.lower():
val *= 1000
if "m" in text.lower():
val *= 1_000_000
return val
# ---------------------------------------------------------------------------
# Core async workers
# ---------------------------------------------------------------------------
async def crawl_company_search(crawler: AsyncWebCrawler, url: str, schema: Dict, limit: int) -> List[Dict]:
"""Paginate 10-item company search pages until `limit` reached."""
extraction = JsonCssExtractionStrategy(schema)
cfg = CrawlerRunConfig(
extraction_strategy=extraction,
cache_mode=CacheMode.BYPASS,
wait_for = ".search-marvel-srp",
session_id="company_search",
delay_before_return_html=1,
magic = True,
verbose= False,
)
companies, page = [], 1
while len(companies) < max(limit, 10):
paged_url = f"{url}&page={page}"
res = await crawler.arun(paged_url, config=cfg)
batch = json.loads(res[0].extracted_content)
if not batch:
break
for item in batch:
name = item.get("name", "").strip()
handle = item.get("handle", "").strip()
if not handle or not name:
continue
descriptor = item.get("descriptor")
about = item.get("about")
followers = _openai_friendly_number(str(item.get("followers", "")))
companies.append(
{
"handle": handle,
"name": name,
"descriptor": descriptor,
"about": about,
"followers": followers,
"people_url": f"{handle}people/",
"captured_at": datetime.now(UTC).isoformat(timespec="seconds") + "Z",
}
)
page += 1
logging.info(
f"[dim]Page {page}[/] — running total: {len(companies)}/{limit} companies"
)
return companies[:max(limit, 10)]
async def crawl_people_page(
crawler: AsyncWebCrawler,
people_url: str,
schema: Dict,
limit: int,
title_kw: str,
) -> List[Dict]:
people_u = f"{people_url}?keywords={quote(title_kw)}"
extraction = JsonCssExtractionStrategy(schema)
cfg = CrawlerRunConfig(
extraction_strategy=extraction,
# scan_full_page=True,
cache_mode=CacheMode.BYPASS,
magic=True,
wait_for=".org-people-profile-card__card-spacing",
delay_before_return_html=1,
session_id="people_search",
)
res = await crawler.arun(people_u, config=cfg)
if not res[0].success:
return []
raw = json.loads(res[0].extracted_content)
people = []
for p in raw[:limit]:
followers = _openai_friendly_number(str(p.get("followers", "")))
people.append(
{
"profile_url": p.get("profile_url"),
"name": p.get("name"),
"headline": p.get("headline"),
"followers": followers,
"connection_degree": p.get("connection_degree"),
"avatar_url": p.get("avatar_url"),
}
)
return people
# ---------------------------------------------------------------------------
# CLI + main
# ---------------------------------------------------------------------------
def build_arg_parser() -> argparse.ArgumentParser:
ap = argparse.ArgumentParser("c4ai-discover — Crawl4AI LinkedIn discovery")
sub = ap.add_subparsers(dest="cmd", required=False, help="run scope")
def add_flags(parser: argparse.ArgumentParser):
parser.add_argument("--query", required=False, help="query keyword(s)")
parser.add_argument("--geo", required=False, type=int, help="LinkedIn geoUrn")
parser.add_argument("--title-filters", default="Product,Engineering", help="comma list of job keywords")
parser.add_argument("--max-companies", type=int, default=1000)
parser.add_argument("--max-people", type=int, default=500)
parser.add_argument("--profile-path", default=str(pathlib.Path.home() / ".crawl4ai/profiles/profile_linkedin_uc"))
parser.add_argument("--outdir", default="./output")
parser.add_argument("--concurrency", type=int, default=4)
parser.add_argument("--log-level", default="info", choices=["debug", "info", "warn", "error"])
add_flags(sub.add_parser("full"))
add_flags(sub.add_parser("companies"))
add_flags(sub.add_parser("people"))
# global flags
ap.add_argument(
"--debug",
action="store_true",
help="Use built-in demo defaults (same as C4AI_DEMO_DEBUG=1)",
)
return ap
def detect_debug_defaults(force = False) -> SimpleNamespace:
if not force and sys.gettrace() is None and not os.getenv("C4AI_DEMO_DEBUG"):
return SimpleNamespace()
# ----- debugfriendly defaults -----
return SimpleNamespace(
cmd="full",
query="health insurance management",
geo=102713980,
# title_filters="Product,Engineering",
title_filters="",
max_companies=10,
max_people=5,
profile_name="profile_linkedin_uc",
outdir="./debug_out",
concurrency=2,
log_level="debug",
)
async def async_main(opts):
# ─────────── logging setup ───────────
console = Console()
logging.basicConfig(
level=opts.log_level.upper(),
format="%(message)s",
handlers=[RichHandler(console=console, markup=True, rich_tracebacks=True)],
)
# -------------------------------------------------------------------
# Load or build schemas (onetime LLM call each)
# -------------------------------------------------------------------
company_schema = _load_or_build_schema(
COMPANY_SCHEMA_PATH,
_SAMPLE_COMPANY_HTML,
_COMPANY_SCHEMA_QUERY,
_COMPANY_SCHEMA_EXAMPLE,
# True
)
people_schema = _load_or_build_schema(
PEOPLE_SCHEMA_PATH,
_SAMPLE_PEOPLE_HTML,
_PEOPLE_SCHEMA_QUERY,
_PEOPLE_SCHEMA_EXAMPLE,
# True
)
outdir = BASE_DIR / pathlib.Path(opts.outdir)
outdir.mkdir(parents=True, exist_ok=True)
f_companies = (BASE_DIR / outdir / "companies.jsonl").open("a", encoding="utf-8")
f_people = (BASE_DIR / outdir / "people.jsonl").open("a", encoding="utf-8")
# -------------------------------------------------------------------
# Prepare crawler with cookie pool rotation
# -------------------------------------------------------------------
profiler = BrowserProfiler()
path = profiler.get_profile_path(opts.profile_name)
bc = BrowserConfig(
headless=False,
verbose=False,
user_data_dir=path,
use_managed_browser=True,
user_agent_mode = "random",
user_agent_generator_config= {
"platforms": "mobile",
"os": "Android"
},
verbose=False,
)
crawler = AsyncWebCrawler(config=bc)
await crawler.start()
# Single worker for simplicity; concurrency can be scaled by arun_many if needed.
# crawler = await next_crawler().start()
try:
# Build LinkedIn search URL
search_url = f"https://www.linkedin.com/search/results/companies/?keywords={quote(opts.query)}&geoUrn={opts.geo}"
logging.info("Seed URL => %s", search_url)
companies: List[Dict] = []
if opts.cmd in ("companies", "full"):
companies = await crawl_company_search(
crawler, search_url, company_schema, opts.max_companies
)
for c in companies:
f_companies.write(json.dumps(c, ensure_ascii=False) + "\n")
logging.info(f"[bold green]✓[/] Companies scraped so far: {len(companies)}")
if opts.cmd in ("people", "full"):
if not companies:
# load from previous run
src = outdir / "companies.jsonl"
if not src.exists():
logging.error("companies.jsonl missing — run companies/full first")
return 10
companies = [json.loads(l) for l in src.read_text().splitlines()]
total_people = 0
title_kw = " ".join([t.strip() for t in opts.title_filters.split(",") if t.strip()]) if opts.title_filters else ""
for comp in companies:
people = await crawl_people_page(
crawler,
comp["people_url"],
people_schema,
opts.max_people,
title_kw,
)
for p in people:
rec = p | {
"company_handle": comp["handle"],
# "captured_at": datetime.now(UTC).isoformat(timespec="seconds") + "Z",
"captured_at": datetime.now(UTC).isoformat(timespec="seconds") + "Z",
}
f_people.write(json.dumps(rec, ensure_ascii=False) + "\n")
total_people += len(people)
logging.info(
f"{comp['name']} — [cyan]{len(people)}[/] people extracted"
)
await asyncio.sleep(random.uniform(0.5, 1))
logging.info("Total people scraped: %d", total_people)
finally:
await crawler.close()
f_companies.close()
f_people.close()
return 0
def main():
parser = build_arg_parser()
cli_opts = parser.parse_args()
# decide on debug defaults
if cli_opts.debug:
opts = detect_debug_defaults(force=True)
else:
env_defaults = detect_debug_defaults()
env_defaults = detect_debug_defaults()
opts = env_defaults if env_defaults else cli_opts
if not getattr(opts, "cmd", None):
opts.cmd = "full"
exit_code = asyncio.run(async_main(opts))
sys.exit(exit_code)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,372 @@
#!/usr/bin/env python3
"""
Stage-2 Insights builder
------------------------
Reads companies.jsonl & people.jsonl (Stage-1 output) and produces:
• company_graph.json
• org_chart_<handle>.json (one per company)
• decision_makers.csv
• graph_view.html (interactive visualisation)
Run:
python c4ai_insights.py --in ./stage1_out --out ./stage2_out
Author : Tom @ Kidocode, 2025-04-28
"""
from __future__ import annotations
# ───────────────────────────────────────────────────────────────────────────────
# Imports & Third-party
# ───────────────────────────────────────────────────────────────────────────────
import argparse, asyncio, json, os, sys, pathlib, random, time, csv
from datetime import datetime, UTC
from types import SimpleNamespace
from pathlib import Path
from typing import List, Dict, Any
# Pretty CLI UX
from rich.console import Console
from rich.logging import RichHandler
from rich.progress import Progress, SpinnerColumn, BarColumn, TextColumn, TimeElapsedColumn
import logging
from jinja2 import Environment, FileSystemLoader, select_autoescape
BASE_DIR = pathlib.Path(__file__).resolve().parent
# ───────────────────────────────────────────────────────────────────────────────
# 3rd-party deps
# ───────────────────────────────────────────────────────────────────────────────
import numpy as np
# from sentence_transformers import SentenceTransformer
# from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
import hashlib
from openai import OpenAI # same SDK you pre-loaded
# ───────────────────────────────────────────────────────────────────────────────
# Utils
# ───────────────────────────────────────────────────────────────────────────────
def load_jsonl(path: Path) -> List[Dict[str, Any]]:
with open(path, "r", encoding="utf-8") as f:
return [json.loads(l) for l in f]
def dump_json(obj, path: Path):
with open(path, "w", encoding="utf-8") as f:
json.dump(obj, f, ensure_ascii=False, indent=2)
# ───────────────────────────────────────────────────────────────────────────────
# Constants
# ───────────────────────────────────────────────────────────────────────────────
BASE_DIR = pathlib.Path(__file__).resolve().parent
# ───────────────────────────────────────────────────────────────────────────────
# Debug defaults (mirrors Stage-1 trick)
# ───────────────────────────────────────────────────────────────────────────────
def dev_defaults() -> SimpleNamespace:
return SimpleNamespace(
in_dir="./debug_out",
out_dir="./insights_debug",
embed_model="all-MiniLM-L6-v2",
top_k=10,
openai_model="gpt-4.1",
max_llm_tokens=8000,
llm_temperature=1.0,
workers=4, # parallel processing
stub=False, # manual
)
# ───────────────────────────────────────────────────────────────────────────────
# Graph builders
# ───────────────────────────────────────────────────────────────────────────────
def embed_descriptions(companies, model_name:str, opts) -> np.ndarray:
from sentence_transformers import SentenceTransformer
logging.debug(f"Using embedding model: {model_name}")
cache_path = BASE_DIR / Path(opts.out_dir) / "embeds_cache.json"
cache = {}
if cache_path.exists():
with open(cache_path) as f:
cache = json.load(f)
# flush cache if model differs
if cache.get("_model") != model_name:
cache = {}
model = SentenceTransformer(model_name)
new_texts, new_indices = [], []
vectors = np.zeros((len(companies), 384), dtype=np.float32)
for idx, comp in enumerate(companies):
text = comp.get("about") or comp.get("descriptor","")
h = hashlib.sha1(text.encode("utf-8")).hexdigest()
cached = cache.get(comp["handle"])
if cached and cached["hash"] == h:
vectors[idx] = np.array(cached["vector"], dtype=np.float32)
else:
new_texts.append(text)
new_indices.append((idx, comp["handle"], h))
if new_texts:
embeds = model.encode(new_texts, show_progress_bar=False, convert_to_numpy=True)
for vec, (idx, handle, h) in zip(embeds, new_indices):
vectors[idx] = vec
cache[handle] = {"hash": h, "vector": vec.tolist()}
cache["_model"] = model_name
with open(cache_path, "w") as f:
json.dump(cache, f)
return vectors
def build_company_graph(companies, embeds:np.ndarray, top_k:int) -> Dict[str,Any]:
from sklearn.metrics.pairwise import cosine_similarity
sims = cosine_similarity(embeds)
nodes, edges = [], []
idx_of = {c["handle"]: i for i,c in enumerate(companies)}
for i,c in enumerate(companies):
node = dict(
id=c["handle"].strip("/"),
name=c["name"],
handle=c["handle"],
about=c.get("about",""),
people_url=c.get("people_url",""),
industry=c.get("descriptor","").split("")[0].strip(),
geoUrn=c.get("geoUrn"),
followers=c.get("followers",0),
# desc_embed=embeds[i].tolist(),
desc_embed=[],
)
nodes.append(node)
# pick top-k most similar except itself
top_idx = np.argsort(sims[i])[::-1][1:top_k+1]
for j in top_idx:
tgt = companies[j]
weight = float(sims[i,j])
if node["industry"] == tgt.get("descriptor","").split("")[0].strip():
weight += 0.10
if node["geoUrn"] == tgt.get("geoUrn"):
weight += 0.05
tgt['followers'] = tgt.get("followers", None) or 1
node["followers"] = node.get("followers", None) or 1
follower_ratio = min(node["followers"], tgt.get("followers",1)) / max(node["followers"] or 1, tgt.get("followers",1))
weight += 0.05 * follower_ratio
edges.append(dict(
source=node["id"],
target=tgt["handle"].strip("/"),
weight=round(weight,4),
drivers=dict(
embed_sim=round(float(sims[i,j]),4),
industry_match=0.10 if node["industry"] == tgt.get("descriptor","").split("")[0].strip() else 0,
geo_overlap=0.05 if node["geoUrn"] == tgt.get("geoUrn") else 0,
)
))
# return {"nodes":nodes,"edges":edges,"meta":{"generated_at":datetime.now(UTC).isoformat()}}
return {"nodes":nodes,"edges":edges,"meta":{"generated_at":datetime.now(UTC).isoformat()}}
# ───────────────────────────────────────────────────────────────────────────────
# Org-chart via LLM
# ───────────────────────────────────────────────────────────────────────────────
async def infer_org_chart_llm(company, people, client:OpenAI, model_name:str, max_tokens:int, temperature:float, stub:bool):
if stub:
# Tiny fake org-chart when debugging offline
chief = random.choice(people)
nodes = [{
"id": chief["profile_url"],
"name": chief["name"],
"title": chief["headline"],
"dept": chief["headline"].split()[:1][0],
"yoe_total": 8,
"yoe_current": 2,
"seniority_score": 0.8,
"decision_score": 0.9,
"avatar_url": chief.get("avatar_url")
}]
return {"nodes":nodes,"edges":[],"meta":{"debug_stub":True,"generated_at":datetime.now(UTC).isoformat()}}
prompt = [
{"role":"system","content":"You are an expert B2B org-chart reasoner."},
{"role":"user","content":f"""Here is the company description:
<company>
{json.dumps(company, ensure_ascii=False)}
</company>
Here is a JSON list of employees:
<employees>
{json.dumps(people, ensure_ascii=False)}
</employees>
1) Build a reporting tree (manager -> direct reports)
2) For each person output a decision_score 0-1 for buying new software
Return JSON: {{ "nodes":[{{id,name,title,dept,yoe_total,yoe_current,seniority_score,decision_score,avatar_url,profile_url}}], "edges":[{{source,target,type,confidence}}] }}
"""}
]
resp = client.chat.completions.create(
model=model_name,
messages=prompt,
max_tokens=max_tokens,
temperature=temperature,
response_format={"type":"json_object"}
)
chart = json.loads(resp.choices[0].message.content)
chart["meta"] = dict(model=model_name, generated_at=datetime.now(UTC).isoformat())
return chart
# ───────────────────────────────────────────────────────────────────────────────
# CSV flatten
# ───────────────────────────────────────────────────────────────────────────────
def export_decision_makers(charts_dir:Path, csv_path:Path, threshold:float=0.5):
rows=[]
for p in charts_dir.glob("org_chart_*.json"):
data=json.loads(p.read_text())
comp = p.stem.split("org_chart_")[1]
for n in data.get("nodes",[]):
if n.get("decision_score",0)>=threshold:
rows.append(dict(
company=comp,
person=n["name"],
title=n["title"],
decision_score=n["decision_score"],
profile_url=n["id"]
))
pd.DataFrame(rows).to_csv(csv_path,index=False)
# ───────────────────────────────────────────────────────────────────────────────
# HTML rendering
# ───────────────────────────────────────────────────────────────────────────────
def render_html(out:Path, template_dir:Path):
# From template folder cp graph_view.html and ai.js in out folder
import shutil
shutil.copy(template_dir/"graph_view_template.html", out / "graph_view.html")
shutil.copy(template_dir/"ai.js", out)
# ───────────────────────────────────────────────────────────────────────────────
# Main async pipeline
# ───────────────────────────────────────────────────────────────────────────────
async def run(opts):
# ── silence SDK noise ──────────────────────────────────────────────────────
for noisy in ("openai", "httpx", "httpcore"):
lg = logging.getLogger(noisy)
lg.setLevel(logging.WARNING) # or ERROR if you want total silence
lg.propagate = False # optional: stop them reaching root
# ────────────── logging bootstrap ──────────────
console = Console()
logging.basicConfig(
level="INFO",
format="%(message)s",
handlers=[RichHandler(console=console, markup=True, rich_tracebacks=True)],
)
in_dir = BASE_DIR / Path(opts.in_dir)
out_dir = BASE_DIR / Path(opts.out_dir)
out_dir.mkdir(parents=True, exist_ok=True)
companies = load_jsonl(in_dir/"companies.jsonl")
people = load_jsonl(in_dir/"people.jsonl")
logging.info(f"[bold cyan]Loaded[/] {len(companies)} companies, {len(people)} people")
logging.info("[bold]⇢[/] Embedding company descriptions…")
# embeds = embed_descriptions(companies, opts.embed_model, opts)
logging.info("[bold]⇢[/] Building similarity graph")
# company_graph = build_company_graph(companies, embeds, opts.top_k)
# dump_json(company_graph, out_dir/"company_graph.json")
# OpenAI client (only built if not debugging)
stub = bool(opts.stub)
client = OpenAI() if not stub else None
# Filter companies that need processing
to_process = []
for comp in companies:
handle = comp["handle"].strip("/").replace("/","_")
out_file = out_dir/f"org_chart_{handle}.json"
if out_file.exists() and False:
logging.info(f"[green]✓[/] Skipping existing {comp['name']}")
continue
to_process.append(comp)
if not to_process:
logging.info("[yellow]All companies already processed[/]")
else:
workers = getattr(opts, 'workers', 1)
parallel = workers > 1
logging.info(f"[bold]⇢[/] Inferring org-charts via LLM {f'(parallel={workers} workers)' if parallel else ''}")
with Progress(
SpinnerColumn(),
BarColumn(),
TextColumn("[progress.description]{task.description}"),
TimeElapsedColumn(),
console=console,
) as progress:
task = progress.add_task("Org charts", total=len(to_process))
async def process_one(comp):
handle = comp["handle"].strip("/").replace("/","_")
persons = [p for p in people if p["company_handle"].strip("/") == comp["handle"].strip("/")]
chart = await infer_org_chart_llm(
comp, persons,
client=client if client else OpenAI(api_key="sk-debug"),
model_name=opts.openai_model,
max_tokens=opts.max_llm_tokens,
temperature=opts.llm_temperature,
stub=stub,
)
chart["meta"]["company"] = comp["name"]
# Save the result immediately
dump_json(chart, out_dir/f"org_chart_{handle}.json")
progress.update(task, advance=1, description=f"{comp['name']} ({len(persons)} ppl)")
# Create tasks for all companies
tasks = [process_one(comp) for comp in to_process]
# Process in batches based on worker count
semaphore = asyncio.Semaphore(workers)
async def bounded_process(coro):
async with semaphore:
return await coro
# Run with concurrency control
await asyncio.gather(*(bounded_process(task) for task in tasks))
logging.info("[bold]⇢[/] Flattening decision-makers CSV")
export_decision_makers(out_dir, out_dir/"decision_makers.csv")
render_html(out_dir, template_dir=BASE_DIR/"templates")
logging.success = lambda msg, **k: console.print(f"[bold green]✓[/] {msg}", **k)
logging.success(f"Stage-2 artefacts written to {out_dir}")
# ───────────────────────────────────────────────────────────────────────────────
# CLI
# ───────────────────────────────────────────────────────────────────────────────
def build_arg_parser():
p = argparse.ArgumentParser(description="Build graphs & visualisation from Stage-1 output")
p.add_argument("--in", dest="in_dir", required=False, help="Stage-1 output dir", default=".")
p.add_argument("--out", dest="out_dir", required=False, help="Destination dir", default=".")
p.add_argument("--embed_model", default="all-MiniLM-L6-v2")
p.add_argument("--top_k", type=int, default=10, help="Top-k neighbours per company")
p.add_argument("--openai_model", default="gpt-4.1")
p.add_argument("--max_llm_tokens", type=int, default=8024)
p.add_argument("--llm_temperature", type=float, default=1.0)
p.add_argument("--stub", action="store_true", help="Skip OpenAI call and generate tiny fake org charts")
p.add_argument("--workers", type=int, default=4, help="Number of parallel workers for LLM inference")
return p
def main():
dbg = dev_defaults()
opts = dbg if True else build_arg_parser().parse_args()
asyncio.run(run(opts))
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,39 @@
{
"name": "LinkedIn Company Card",
"baseSelector": "div.search-results-container ul[role='list'] > li",
"fields": [
{
"name": "handle",
"selector": "a[href*='/company/']",
"type": "attribute",
"attribute": "href"
},
{
"name": "profile_image",
"selector": "a[href*='/company/'] img",
"type": "attribute",
"attribute": "src"
},
{
"name": "name",
"selector": "span[class*='t-16'] a",
"type": "text"
},
{
"name": "descriptor",
"selector": "div[class*='t-black t-normal']",
"type": "text"
},
{
"name": "about",
"selector": "p[class*='entity-result__summary--2-lines']",
"type": "text"
},
{
"name": "followers",
"selector": "div:contains('followers')",
"type": "regex",
"pattern": "(\\d+)\\s*followers"
}
]
}

View File

@@ -0,0 +1,38 @@
{
"name": "LinkedIn People Card",
"baseSelector": "li.org-people-profile-card__profile-card-spacing",
"fields": [
{
"name": "profile_url",
"selector": "a.eETATgYTipaVsmrBChiBJJvFsdPhNpulhPZUVLHLo",
"type": "attribute",
"attribute": "href"
},
{
"name": "name",
"selector": ".artdeco-entity-lockup__title .lt-line-clamp--single-line",
"type": "text"
},
{
"name": "headline",
"selector": ".artdeco-entity-lockup__subtitle .lt-line-clamp--multi-line",
"type": "text"
},
{
"name": "followers",
"selector": ".lt-line-clamp--multi-line.t-12",
"type": "text"
},
{
"name": "connection_degree",
"selector": ".artdeco-entity-lockup__badge .artdeco-entity-lockup__degree",
"type": "text"
},
{
"name": "avatar_url",
"selector": ".artdeco-entity-lockup__image img",
"type": "attribute",
"attribute": "src"
}
]
}

View File

@@ -0,0 +1,143 @@
<li class="yCLWzruNprmIzaZzFFonVFBtMrbaVYnuDFA">
<!----><!---->
<div class="IxlEPbRZwQYrRltKPvHAyjBmCdIWTAoYo" data-chameleon-result-urn="urn:li:company:362492"
data-view-name="search-entity-result-universal-template">
<div class="linked-area flex-1
cursor-pointer">
<div class="BAEgVqVuxosMJZodcelsgPoyRcrkiqgVCGHXNQ">
<div class="afcvrbGzNuyRlhPPQWrWirJtUdHAAtUlqxwvVA">
<div class="display-flex align-items-center">
<!---->
<a class="eETATgYTipaVsmrBChiBJJvFsdPhNpulhPZUVLHLo scale-down " aria-hidden="true"
tabindex="-1" href="https://www.linkedin.com/company/managment-research-services-inc./"
data-test-app-aware-link="">
<div class="ivm-image-view-model ">
<div class="ivm-view-attr__img-wrapper
">
<!---->
<!----> <img width="48"
src="https://media.licdn.com/dms/image/v2/C560BAQFWpusEOgW-ww/company-logo_100_100/company-logo_100_100/0/1630583697877/managment_research_services_inc_logo?e=1750896000&amp;v=beta&amp;t=Ch9vyEZdfng-1D1m_XqP5kjNpVXUBKkk9cNhMZUhx0E"
loading="lazy" height="48" alt="Management Research Services, Inc. (MRS, Inc)"
id="ember28"
class="ivm-view-attr__img--centered EntityPhoto-square-3 evi-image lazy-image ember-view">
</div>
</div>
</a>
</div>
</div>
<div
class="wympnVuDByXHvafWrMGJLZuchDmCRqLmWPwg MmzCPRicJimZvjJhvqTzDcDbdHhWPzspERzA pt3 pb3 t-12 t-black--light">
<div class="mb1">
<div class="t-roman t-sans">
<div class="display-flex">
<span class="TikBXjihYvcNUoIzkslUaEjfIuLmYxfs OoHEyXgsiIqGADjcOtTmfdpoYVXrLKTvkwI ">
<span class="CgaWLOzmXNuKbRIRARSErqCJcBPYudEKo
t-16">
<a class="eETATgYTipaVsmrBChiBJJvFsdPhNpulhPZUVLHLo "
href="https://www.linkedin.com/company/managment-research-services-inc./"
data-test-app-aware-link="">
<!---->Management Research Services, Inc. (MRS, Inc)<!---->
<!----> </a>
<!----> </span>
</span>
<!---->
</div>
</div>
<div class="LjmdKCEqKITHihFOiQsBAQylkdnsWhqZii
t-14 t-black t-normal">
<!---->Insurance • Milwaukee, Wisconsin<!---->
</div>
<div class="cTPhJiHyNLmxdQYFlsEOutjznmqrVHUByZwZ
t-14 t-normal">
<!---->1K followers<!---->
</div>
</div>
<!---->
<p class="yWzlqwKNlvCWVNoKqmzoDDEnBMUuyynaLg
entity-result__summary--2-lines
t-12 t-black--light
">
<!---->MRS combines 30 years of experience supporting the Life,<span class="white-space-pre">
</span><strong><!---->Health<!----></strong><span class="white-space-pre"> </span>and
Annuities<span class="white-space-pre"> </span><strong><!---->Insurance<!----></strong><span
class="white-space-pre"> </span>Industry with customized<span class="white-space-pre">
</span><strong><!---->insurance<!----></strong><span class="white-space-pre">
</span>underwriting solutions that efficiently support clients workflows. Supported by the
Agenium Platform (www.agenium.ai) our innovative underwriting solutions are guaranteed to
optimize requirements...<!---->
</p>
<!---->
</div>
<div class="qXxdnXtzRVFTnTnetmNpssucBwQBsWlUuk MmzCPRicJimZvjJhvqTzDcDbdHhWPzspERzA">
<!---->
<div>
<button aria-label="Follow Management Research Services, Inc. (MRS, Inc)" id="ember61"
class="artdeco-button artdeco-button--2 artdeco-button--secondary ember-view"
type="button"><!---->
<span class="artdeco-button__text">
Follow
</span></button>
<!---->
<!---->
</div>
</div>
</div>
</div>
</div>
</li>

View File

@@ -0,0 +1,94 @@
<li class="grid grid__col--lg-8 block org-people-profile-card__profile-card-spacing">
<div>
<section class="artdeco-card full-width qQdPErXQkSAbwApNgNfuxukTIPPykttCcZGOHk">
<!---->
<img width="210" src="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7"
ariarole="presentation" loading="lazy" height="210" alt="" id="ember96"
class="evi-image lazy-image ghost-default ember-view org-people-profile-card__cover-photo org-people-profile-card__cover-photo--people">
<div class="org-people-profile-card__profile-info">
<div id="ember97"
class="artdeco-entity-lockup artdeco-entity-lockup--stacked-center artdeco-entity-lockup--size-7 ember-view">
<div id="ember98"
class="artdeco-entity-lockup__image artdeco-entity-lockup__image--type-circle ember-view"
type="circle">
<a class="eETATgYTipaVsmrBChiBJJvFsdPhNpulhPZUVLHLo "
id="org-people-profile-card__profile-image-0"
href="https://www.linkedin.com/in/speakerrayna?miniProfileUrn=urn%3Ali%3Afs_miniProfile%3AACoAABsqUBoBr5x071PuGGpNtK3NlvSARiVXPIs"
data-test-app-aware-link="">
<img width="104"
src="https://media.licdn.com/dms/image/v2/D5603AQGs2Vyju4xZ7A/profile-displayphoto-shrink_100_100/profile-displayphoto-shrink_100_100/0/1681741067031?e=1750896000&amp;v=beta&amp;t=Hvj--IrrmpVIH7pec7-l_PQok8vsS__CGeUqBWOw7co"
loading="lazy" height="104" alt="Dr. Rayna S." id="ember99"
class="evi-image lazy-image ember-view">
</a>
</div>
<div id="ember100" class="artdeco-entity-lockup__content ember-view">
<div id="ember101" class="artdeco-entity-lockup__title ember-view">
<a class="eETATgYTipaVsmrBChiBJJvFsdPhNpulhPZUVLHLo link-without-visited-state"
aria-label="View Dr. Rayna S.s profile"
href="https://www.linkedin.com/in/speakerrayna?miniProfileUrn=urn%3Ali%3Afs_miniProfile%3AACoAABsqUBoBr5x071PuGGpNtK3NlvSARiVXPIs"
data-test-app-aware-link="">
<div id="ember103" class="ember-view lt-line-clamp lt-line-clamp--single-line AGabuksChUpCmjWshSnaZryLKSthOKkwclxY
t-black" style="">
Dr. Rayna S.
<!---->
</div>
</a>
</div>
<div id="ember104" class="artdeco-entity-lockup__badge ember-view"> <span class="a11y-text">3rd+
degree connection</span>
<span class="artdeco-entity-lockup__degree" aria-hidden="true">
·&nbsp;3rd
</span>
<!----><!---->
</div>
<div id="ember105" class="artdeco-entity-lockup__subtitle ember-view">
<div class="t-14 t-black--light t-normal">
<div id="ember107" class="ember-view lt-line-clamp lt-line-clamp--multi-line"
style="-webkit-line-clamp: 2">
Leadership and Talent Development Consultant and Professional Speaker
<!---->
</div>
</div>
</div>
<div id="ember108" class="artdeco-entity-lockup__caption ember-view"></div>
</div>
</div>
<span class="text-align-center">
<span id="ember110"
class="ember-view lt-line-clamp lt-line-clamp--multi-line t-12 t-black--light mt2"
style="-webkit-line-clamp: 3">
727 followers
<!----> </span>
</span>
</div>
<footer class="ph3 pb3">
<button aria-label="Follow Dr. Rayna S." id="ember111"
class="artdeco-button artdeco-button--2 artdeco-button--secondary ember-view full-width"
type="button"><!---->
<span class="artdeco-button__text">
Follow
</span></button>
</footer>
</section>
</div>
</li>

View File

@@ -0,0 +1,50 @@
// ==== File: ai.js ====
class ApiHandler {
constructor(apiKey = null) {
this.apiKey = apiKey || localStorage.getItem("openai_api_key") || "";
console.log("ApiHandler ready");
}
setApiKey(k) {
this.apiKey = k.trim();
if (this.apiKey) localStorage.setItem("openai_api_key", this.apiKey);
}
async *chatStream(messages, {model = "gpt-4o", temperature = 0.7} = {}) {
if (!this.apiKey) throw new Error("OpenAI API key missing");
const payload = {model, messages, stream: true, max_tokens: 1024};
const controller = new AbortController();
const res = await fetch("https://api.openai.com/v1/chat/completions", {
method: "POST",
headers: {
"Content-Type": "application/json",
Authorization: `Bearer ${this.apiKey}`,
},
body: JSON.stringify(payload),
signal: controller.signal,
});
if (!res.ok) throw new Error(`OpenAI: ${res.statusText}`);
const reader = res.body.getReader();
const dec = new TextDecoder();
let buf = "";
while (true) {
const {done, value} = await reader.read();
if (done) break;
buf += dec.decode(value, {stream: true});
for (const line of buf.split("\n")) {
if (!line.startsWith("data: ")) continue;
if (line.includes("[DONE]")) return;
const json = JSON.parse(line.slice(6));
const delta = json.choices?.[0]?.delta?.content;
if (delta) yield delta;
}
buf = buf.endsWith("\n") ? "" : buf; // keep partial line
}
}
}
window.API = new ApiHandler();

File diff suppressed because it is too large Load Diff