feat(linkedin): add prospect-wizard app with scraping and visualization

Add new LinkedIn prospect discovery tool with three main components: - c4ai_discover.py for company and people scraping - c4ai_insights.py for org chart and decision maker analysis - Interactive graph visualization with company/people exploration Features include: - Configurable LinkedIn search and scraping - Org chart generation with decision maker scoring - Interactive network graph visualization - Company similarity analysis - Chat interface for data exploration Requires: crawl4ai, openai, sentence-transformers, networkx
2025-04-30 19:38:25 +08:00
parent 9499164d3c
commit 50f0b83fcd
9 changed files with 2473 additions and 0 deletions
--- a/docs/apps/linkdin/README.md
+++ b/docs/apps/linkdin/README.md
@@ -0,0 +1,126 @@
 # Crawl4AI Prospect‑Wizard – step‑by‑step guide
 A three‑stage demo that goes from **LinkedIn scraping** ➜ **LLM reasoning** ➜ **graph visualisation**.
 ```
 prospect‑wizard/
 ├─ c4ai_discover.py         # Stage 1 – scrape companies + people
 ├─ c4ai_insights.py         # Stage 2 – embeddings, org‑charts, scores
 ├─ graph_view_template.html # Stage 3 – graph viewer (static HTML)
 └─ data/                    # output lands here (*.jsonl / *.json)
 ```
 ---
 ## 1  Install & boot a LinkedIn profile (one‑time)
 ### 1.1  Install dependencies
 ```bash
 pip install crawl4ai openai sentence-transformers networkx pandas vis-network rich
 ```
 ### 1.2  Create / warm a LinkedIn browser profile
 ```bash
 crwl profiler
 ```
 1. The interactive shell shows **New profile** – hit **enter**.
 2. Choose a name, e.g. `profile_linkedin_uc`.
 3. A Chromium window opens – log in to LinkedIn, solve whatever CAPTCHA, then close.
 > Remember the **profile name**. All future runs take `--profile-name <your_name>`.
 ---
 ## 2  Discovery – scrape companies & people
 ```bash
 python c4ai_discover.py full \ 
  --query "health insurance management" \ 
  --geo 102713980 \               # Malaysia geoUrn
  --title_filters "" \            # or "Product,Engineering"
  --max_companies 10 \            # default set small for workshops
  --max_people 20 \               # \^ same
  --profile-name profile_linkedin_uc \ 
  --outdir ./data \ 
  --concurrency 2 \ 
  --log_level debug
 ```
 **Outputs** in `./data/`:
 * `companies.jsonl` – one JSON per company
 * `people.jsonl` – one JSON per employee
 🛠️  **Dry‑run:** `C4AI_DEMO_DEBUG=1 python c4ai_discover.py full --query coffee` uses bundled HTML snippets, no network.
 ### Handy geoUrn cheatsheet
 | Location | geoUrn |
 |----------|--------|
 | Singapore | **103644278** |
 | Malaysia | **102713980** |
 | United States | **103644922** |
 | United Kingdom | **102221843** |
 | Australia | **101452733** |
 _See more: <https://www.linkedin.com/search/results/companies/?geoUrn=XXX> – the number after `geoUrn=` is what you need._
 ---
 ## 3  Insights – embeddings, org‑charts, decision makers
 ```bash
 python c4ai_insights.py \ 
  --in  ./data \ 
  --out ./data \ 
  --embed_model all-MiniLM-L6-v2 \ 
  --top_k 10 \ 
  --openai_model gpt-4.1 \ 
  --max_llm_tokens 8024 \ 
  --llm_temperature 1.0 \ 
  --workers 4
 ```
 Emits next to the Stage‑1 files:
 * `company_graph.json` – inter‑company similarity graph
 * `org_chart_<handle>.json` – one per company
 * `decision_makers.csv` – hand‑picked ‘who to pitch’ list
 Flags reference (straight from `build_arg_parser()`):
 | Flag | Default | Purpose |
 |------|---------|---------|
 | `--in` | `.` | Stage‑1 output dir |
 | `--out` | `.` | Destination dir |
 | `--embed_model` | `all-MiniLM-L6-v2` | Sentence‑Transformer model |
 | `--top_k` | `10` | Neighbours per company in graph |
 | `--openai_model` | `gpt-4.1` | LLM for scoring decision makers |
 | `--max_llm_tokens` | `8024` | Token budget per LLM call |
 | `--llm_temperature` | `1.0` | Creativity knob |
 | `--stub` | off | Skip OpenAI and fabricate tiny charts |
 | `--workers` | `4` | Parallel LLM workers |
 ---
 ## 4  Visualise – interactive graph
 After Stage 2 completes, simply open the HTML viewer from the project root:
 ```bash
 open graph_view_template.html   # or Live Server / Python -http
 ```
 The page fetches `data/company_graph.json` and the `org_chart_*.json` files automatically; keep the `data/` folder beside the HTML file.
 * Left pane → list of companies (clans).
 * Click a node to load its org‑chart on the right.
 * Chat drawer lets you ask follow‑up questions; context is pulled from `people.jsonl`.
 ---
 ## 5  Common snags
 | Symptom | Fix |
 |---------|-----|
 | Infinite CAPTCHA | Use a residential proxy: `--proxy http://user:pass@ip:port` |
 | 429 Too Many Requests | Lower `--concurrency`, rotate profile, add delay |
 | Blank graph | Check JSON paths, clear `localStorage` in browser |
 ---
 ### TL;DR
 `crwl profiler` → `c4ai_discover.py` → `c4ai_insights.py` → open `graph_view_template.html`.  
 Live long and `import crawl4ai`.
--- a/docs/apps/linkdin/c4ai_discover.py
+++ b/docs/apps/linkdin/c4ai_discover.py
@@ -0,0 +1,440 @@
 #!/usr/bin/env python3
 """
 c4ai-discover — Stage‑1 Discovery CLI
 Scrapes LinkedIn company search + their people pages and dumps two newline‑delimited
 JSON files: companies.jsonl and people.jsonl.
 Key design rules
 ----------------
 * No BeautifulSoup — Crawl4AI only for network + HTML fetch.
 * JsonCssExtractionStrategy for structured scraping; schema auto‑generated once
  from sample HTML provided by user and then cached under ./schemas/.
 * Defaults are embedded so the file runs inside VS Code debugger without CLI args.
 * If executed as a console script (argv > 1), CLI flags win.
 * Lightweight deps: argparse + Crawl4AI stack.
 Author: Tom @ Kidocode 2025‑04‑26
 """
 from __future__ import annotations
 import warnings, re
 warnings.filterwarnings(
    "ignore",
    message=r"The pseudo class ':contains' is deprecated, ':-soup-contains' should be used.*",
    category=FutureWarning,
    module=r"soupsieve"
 )
 # ───────────────────────────────────────────────────────────────────────────────
 # Imports
 # ───────────────────────────────────────────────────────────────────────────────
 import argparse
 import random
 import asyncio
 import json
 import logging
 import os
 import pathlib
 import sys
 # 3rd-party rich for pretty logging
 from rich.console import Console
 from rich.logging import RichHandler
 from datetime import datetime, UTC
 from itertools import cycle
 from textwrap import dedent
 from types import SimpleNamespace
 from typing import Dict, List, Optional
 from urllib.parse import quote
 from pathlib import Path
 from glob import glob
 from crawl4ai import (
    AsyncWebCrawler,
    BrowserConfig,
    CacheMode,
    CrawlerRunConfig,
    JsonCssExtractionStrategy,
    BrowserProfiler,
    LLMConfig,
 )
 # ───────────────────────────────────────────────────────────────────────────────
 # Constants / paths
 # ───────────────────────────────────────────────────────────────────────────────
 BASE_DIR = pathlib.Path(__file__).resolve().parent
 SCHEMA_DIR = BASE_DIR / "schemas"
 SCHEMA_DIR.mkdir(parents=True, exist_ok=True)
 COMPANY_SCHEMA_PATH = SCHEMA_DIR / "company_card.json"
 PEOPLE_SCHEMA_PATH = SCHEMA_DIR / "people_card.json"
 # ---------- deterministic target JSON examples ----------
 _COMPANY_SCHEMA_EXAMPLE = {
    "handle": "/company/posify/",
    "profile_image": "https://media.licdn.com/dms/image/v2/.../logo.jpg",
    "name": "Management Research Services, Inc. (MRS, Inc)",
    "descriptor": "Insurance • Milwaukee, Wisconsin",
    "about": "Insurance • Milwaukee, Wisconsin",
    "followers": 1000
 }
 _PEOPLE_SCHEMA_EXAMPLE = {
    "profile_url": "https://www.linkedin.com/in/lily-ng/",
    "name": "Lily Ng",
    "headline": "VP Product @ Posify",
    "followers": 890,
    "connection_degree": "2nd",
    "avatar_url": "https://media.licdn.com/dms/image/v2/.../lily.jpg"
 }
 # Provided sample HTML snippets (trimmed) — used exactly once to cold‑generate schema.
 _SAMPLE_COMPANY_HTML = (Path(__file__).resolve().parent / "snippets/company.html").read_text()
 _SAMPLE_PEOPLE_HTML = (Path(__file__).resolve().parent / "snippets/people.html").read_text()
 # --------- tighter schema prompts ----------
 _COMPANY_SCHEMA_QUERY = dedent(
    """
    Using the supplied <li> company-card HTML, build a JsonCssExtractionStrategy schema that,
    for every card, outputs *exactly* the keys shown in the example JSON below.
    JSON spec:
      • handle        – href of the outermost <a> that wraps the logo/title, e.g. "/company/posify/"
      • profile_image – absolute URL of the <img> inside that link
      • name          – text of the <a> inside the <span class*='t-16'>
      • descriptor    – text line with industry • location
      • about         – text of the <div class*='t-normal'> below the name (industry + geo)
      • followers     – integer parsed from the <div> containing 'followers'
    IMPORTANT: Do not use the base64 kind of classes to target element. It's not reliable.
    The main div parent contains these li element is "div.search-results-container" you can use this.
    The <ul> parent has "role" equal to "list". Using these two should be enough to target the <li> elements."
    """
 )
 _PEOPLE_SCHEMA_QUERY = dedent(
    """
    Using the supplied <li> people-card HTML, build a JsonCssExtractionStrategy schema that
    outputs exactly the keys in the example JSON below.
    Fields:
      • profile_url        – href of the outermost profile link
      • name               – text inside artdeco-entity-lockup__title
      • headline           – inner text of artdeco-entity-lockup__subtitle
      • followers          – integer parsed from the span inside lt-line-clamp--multi-line
      • connection_degree  – '1st', '2nd', etc. from artdeco-entity-lockup__badge
      • avatar_url         – src of the <img> within artdeco-entity-lockup__image
    IMPORTANT: Do not use the base64 kind of classes to target element. It's not reliable.
    The main div parent contains these li element is a "div" has these classes "artdeco-card org-people-profile-card__card-spacing org-people__card-margin-bottom".
    """
 )
 # ---------------------------------------------------------------------------
 # Utility helpers
 # ---------------------------------------------------------------------------
 def _load_or_build_schema(
    path: pathlib.Path, 
    sample_html: str, 
    query: str, 
    example_json: Dict,
    force = False
 ) -> Dict:
    """Load schema from path, else call generate_schema once and persist."""
    if path.exists() and not force:
        return json.loads(path.read_text())
    logging.info("[SCHEMA] Generating schema %s", path.name)
    schema = JsonCssExtractionStrategy.generate_schema(
        html=sample_html,
        llm_config=LLMConfig(
            provider=os.getenv("C4AI_SCHEMA_PROVIDER", "openai/gpt-4o"),
            api_token=os.getenv("OPENAI_API_KEY", "env:OPENAI_API_KEY"),
        ),
        query=query,
        target_json_example=json.dumps(example_json, indent=2),
    )
    path.write_text(json.dumps(schema, indent=2))
    return schema
 def _openai_friendly_number(text: str) -> Optional[int]:
    """Extract first int from text like '1K followers' (returns 1000)."""
    import re
    m = re.search(r"(\d[\d,]*)", text.replace(",", ""))
    if not m:
        return None
    val = int(m.group(1))
    if "k" in text.lower():
        val *= 1000
    if "m" in text.lower():
        val *= 1_000_000
    return val
 # ---------------------------------------------------------------------------
 # Core async workers
 # ---------------------------------------------------------------------------
 async def crawl_company_search(crawler: AsyncWebCrawler, url: str, schema: Dict, limit: int) -> List[Dict]:
    """Paginate 10-item company search pages until `limit` reached."""
    extraction = JsonCssExtractionStrategy(schema)
    cfg = CrawlerRunConfig(
        extraction_strategy=extraction,
        cache_mode=CacheMode.BYPASS,
        wait_for = ".search-marvel-srp",
        session_id="company_search",
        delay_before_return_html=1,
        magic = True,
        verbose= False,
    )
    companies, page = [], 1
    while len(companies) < max(limit, 10):
        paged_url = f"{url}&page={page}"
        res = await crawler.arun(paged_url, config=cfg)
        batch = json.loads(res[0].extracted_content)
        if not batch:
            break
        for item in batch:
            name = item.get("name", "").strip()
            handle = item.get("handle", "").strip()
            if not handle or not name:
                continue
            descriptor = item.get("descriptor")
            about = item.get("about")
            followers = _openai_friendly_number(str(item.get("followers", "")))
            companies.append(
                {
                    "handle": handle,
                    "name": name,
                    "descriptor": descriptor,
                    "about": about,
                    "followers": followers,
                    "people_url": f"{handle}people/",
                    "captured_at": datetime.now(UTC).isoformat(timespec="seconds") + "Z",
                }
            )
        page += 1
        logging.info(
            f"[dim]Page {page}[/] — running total: {len(companies)}/{limit} companies"
        )
    return companies[:max(limit, 10)]
 async def crawl_people_page(
    crawler: AsyncWebCrawler,
    people_url: str,
    schema: Dict,
    limit: int,
    title_kw: str,
 ) -> List[Dict]:
    people_u = f"{people_url}?keywords={quote(title_kw)}"
    extraction = JsonCssExtractionStrategy(schema)
    cfg = CrawlerRunConfig(
        extraction_strategy=extraction,
        # scan_full_page=True,
        cache_mode=CacheMode.BYPASS,
        magic=True,
        wait_for=".org-people-profile-card__card-spacing",
        delay_before_return_html=1,
        session_id="people_search",
    )
    res = await crawler.arun(people_u, config=cfg)
    if not res[0].success:
        return []
    raw = json.loads(res[0].extracted_content)
    people = []
    for p in raw[:limit]:
        followers = _openai_friendly_number(str(p.get("followers", "")))
        people.append(
            {
                "profile_url": p.get("profile_url"),
                "name": p.get("name"),
                "headline": p.get("headline"),
                "followers": followers,
                "connection_degree": p.get("connection_degree"),
                "avatar_url": p.get("avatar_url"),
            }
        )
    return people
 # ---------------------------------------------------------------------------
 # CLI + main
 # ---------------------------------------------------------------------------
 def build_arg_parser() -> argparse.ArgumentParser:
    ap = argparse.ArgumentParser("c4ai-discover — Crawl4AI LinkedIn discovery")
    sub = ap.add_subparsers(dest="cmd", required=False, help="run scope")
    def add_flags(parser: argparse.ArgumentParser):
        parser.add_argument("--query", required=False, help="query keyword(s)")
        parser.add_argument("--geo", required=False, type=int, help="LinkedIn geoUrn")
        parser.add_argument("--title-filters", default="Product,Engineering", help="comma list of job keywords")
        parser.add_argument("--max-companies", type=int, default=1000)
        parser.add_argument("--max-people", type=int, default=500)
        parser.add_argument("--profile-path", default=str(pathlib.Path.home() / ".crawl4ai/profiles/profile_linkedin_uc"))
        parser.add_argument("--outdir", default="./output")
        parser.add_argument("--concurrency", type=int, default=4)
        parser.add_argument("--log-level", default="info", choices=["debug", "info", "warn", "error"])
    add_flags(sub.add_parser("full"))
    add_flags(sub.add_parser("companies"))
    add_flags(sub.add_parser("people"))
    # global flags
    ap.add_argument(
        "--debug",
        action="store_true",
        help="Use built-in demo defaults (same as C4AI_DEMO_DEBUG=1)",
    )
    return ap
 def detect_debug_defaults(force = False) -> SimpleNamespace:
    if not force and sys.gettrace() is None and not os.getenv("C4AI_DEMO_DEBUG"):
        return SimpleNamespace()
    # ----- debug‑friendly defaults -----
    return SimpleNamespace(
        cmd="full",
        query="health insurance management",
        geo=102713980,
        # title_filters="Product,Engineering",
        title_filters="",
        max_companies=10,
        max_people=5,
        profile_name="profile_linkedin_uc",
        outdir="./debug_out",
        concurrency=2,
        log_level="debug",
    )
 async def async_main(opts):
    # ─────────── logging setup ───────────
    console = Console()
    logging.basicConfig(
        level=opts.log_level.upper(),
        format="%(message)s",
        handlers=[RichHandler(console=console, markup=True, rich_tracebacks=True)],
    )
    # -------------------------------------------------------------------
    # Load or build schemas (one‑time LLM call each)
    # -------------------------------------------------------------------
    company_schema = _load_or_build_schema(
        COMPANY_SCHEMA_PATH,
        _SAMPLE_COMPANY_HTML,
        _COMPANY_SCHEMA_QUERY,
        _COMPANY_SCHEMA_EXAMPLE,
        # True
    )
    people_schema = _load_or_build_schema(
        PEOPLE_SCHEMA_PATH,
        _SAMPLE_PEOPLE_HTML,
        _PEOPLE_SCHEMA_QUERY,
        _PEOPLE_SCHEMA_EXAMPLE,
        # True
    )
    outdir = BASE_DIR / pathlib.Path(opts.outdir)
    outdir.mkdir(parents=True, exist_ok=True)
    f_companies = (BASE_DIR / outdir / "companies.jsonl").open("a", encoding="utf-8")
    f_people = (BASE_DIR / outdir / "people.jsonl").open("a", encoding="utf-8")
    # -------------------------------------------------------------------
    # Prepare crawler with cookie pool rotation
    # -------------------------------------------------------------------
    profiler = BrowserProfiler()
    path = profiler.get_profile_path(opts.profile_name)
    bc = BrowserConfig(
        headless=False,        
        verbose=False,
        user_data_dir=path,
        use_managed_browser=True,
        user_agent_mode = "random",
        user_agent_generator_config= {
            "platforms": "mobile",
            "os": "Android"
        },
        verbose=False,
    )
    crawler = AsyncWebCrawler(config=bc)
    await crawler.start()
    # Single worker for simplicity; concurrency can be scaled by arun_many if needed.
    # crawler = await next_crawler().start()
    try:
        # Build LinkedIn search URL
        search_url = f"https://www.linkedin.com/search/results/companies/?keywords={quote(opts.query)}&geoUrn={opts.geo}"
        logging.info("Seed URL => %s", search_url)
        companies: List[Dict] = []
        if opts.cmd in ("companies", "full"):
            companies = await crawl_company_search(
                crawler, search_url, company_schema, opts.max_companies
            )
            for c in companies:
                f_companies.write(json.dumps(c, ensure_ascii=False) + "\n")
            logging.info(f"[bold green]✓[/] Companies scraped so far: {len(companies)}")
        if opts.cmd in ("people", "full"):
            if not companies:
                # load from previous run
                src = outdir / "companies.jsonl"
                if not src.exists():
                    logging.error("companies.jsonl missing — run companies/full first")
                    return 10
                companies = [json.loads(l) for l in src.read_text().splitlines()]
            total_people = 0
            title_kw = " ".join([t.strip() for t in opts.title_filters.split(",") if t.strip()]) if opts.title_filters else ""
            for comp in companies:
                people = await crawl_people_page(
                    crawler,
                    comp["people_url"],
                    people_schema,
                    opts.max_people,
                    title_kw,
                )
                for p in people:
                    rec = p | {
                        "company_handle": comp["handle"],
                        # "captured_at": datetime.now(UTC).isoformat(timespec="seconds") + "Z",
                        "captured_at": datetime.now(UTC).isoformat(timespec="seconds") + "Z",
                    }
                    f_people.write(json.dumps(rec, ensure_ascii=False) + "\n")
                total_people += len(people)
                logging.info(
                    f"{comp['name']} — [cyan]{len(people)}[/] people extracted"
                )
                await asyncio.sleep(random.uniform(0.5, 1))
            logging.info("Total people scraped: %d", total_people)
    finally:
        await crawler.close()
        f_companies.close()
        f_people.close()
    return 0
 def main():
    parser = build_arg_parser()
    cli_opts = parser.parse_args()
    # decide on debug defaults
    if cli_opts.debug:
        opts = detect_debug_defaults(force=True)
    else:
        env_defaults = detect_debug_defaults()
        env_defaults = detect_debug_defaults()
        opts = env_defaults if env_defaults else cli_opts
    if not getattr(opts, "cmd", None):
        opts.cmd = "full"
    exit_code = asyncio.run(async_main(opts))
    sys.exit(exit_code)
 if __name__ == "__main__":
    main()
--- a/docs/apps/linkdin/c4ai_insights.py
+++ b/docs/apps/linkdin/c4ai_insights.py
@@ -0,0 +1,372 @@
 #!/usr/bin/env python3
 """
 Stage-2 Insights builder
 ------------------------
 Reads companies.jsonl & people.jsonl (Stage-1 output) and produces:
 • company_graph.json
 • org_chart_<handle>.json  (one per company)
 • decision_makers.csv
 • graph_view.html          (interactive visualisation)
 Run:
    python c4ai_insights.py --in ./stage1_out --out ./stage2_out
 Author : Tom @ Kidocode, 2025-04-28
 """
 from __future__ import annotations
 # ───────────────────────────────────────────────────────────────────────────────
 # Imports & Third-party
 # ───────────────────────────────────────────────────────────────────────────────
 import argparse, asyncio, json, os, sys, pathlib, random, time, csv
 from datetime import datetime, UTC
 from types import SimpleNamespace
 from pathlib import Path
 from typing import List, Dict, Any
 # Pretty CLI UX
 from rich.console import Console
 from rich.logging import RichHandler
 from rich.progress import Progress, SpinnerColumn, BarColumn, TextColumn, TimeElapsedColumn
 import logging
 from jinja2 import Environment, FileSystemLoader, select_autoescape
 BASE_DIR = pathlib.Path(__file__).resolve().parent
 # ───────────────────────────────────────────────────────────────────────────────
 # 3rd-party deps
 # ───────────────────────────────────────────────────────────────────────────────
 import numpy as np
 # from sentence_transformers import SentenceTransformer
 # from sklearn.metrics.pairwise import cosine_similarity
 import pandas as pd
 import hashlib
 from openai import OpenAI                    # same SDK you pre-loaded
 # ───────────────────────────────────────────────────────────────────────────────
 # Utils
 # ───────────────────────────────────────────────────────────────────────────────
 def load_jsonl(path: Path) -> List[Dict[str, Any]]:
    with open(path, "r", encoding="utf-8") as f:
        return [json.loads(l) for l in f]
 def dump_json(obj, path: Path):
    with open(path, "w", encoding="utf-8") as f:
        json.dump(obj, f, ensure_ascii=False, indent=2)
 # ───────────────────────────────────────────────────────────────────────────────
 # Constants
 # ───────────────────────────────────────────────────────────────────────────────
 BASE_DIR = pathlib.Path(__file__).resolve().parent
 # ───────────────────────────────────────────────────────────────────────────────
 # Debug defaults   (mirrors Stage-1 trick)
 # ───────────────────────────────────────────────────────────────────────────────
 def dev_defaults() -> SimpleNamespace:
    return SimpleNamespace(
        in_dir="./debug_out",          
        out_dir="./insights_debug",
        embed_model="all-MiniLM-L6-v2",
        top_k=10,
        openai_model="gpt-4.1",
        max_llm_tokens=8000,
        llm_temperature=1.0,
        workers=4,           # parallel processing
        stub=False,          # manual
    )
 # ───────────────────────────────────────────────────────────────────────────────
 # Graph builders
 # ───────────────────────────────────────────────────────────────────────────────
 def embed_descriptions(companies, model_name:str, opts) -> np.ndarray:
    from sentence_transformers import SentenceTransformer
    logging.debug(f"Using embedding model: {model_name}")
    cache_path = BASE_DIR / Path(opts.out_dir) / "embeds_cache.json"
    cache = {}
    if cache_path.exists():
        with open(cache_path) as f:
            cache = json.load(f)
        # flush cache if model differs
        if cache.get("_model") != model_name:
            cache = {}
    model = SentenceTransformer(model_name)
    new_texts, new_indices = [], []
    vectors = np.zeros((len(companies), 384), dtype=np.float32)
    for idx, comp in enumerate(companies):
        text = comp.get("about") or comp.get("descriptor","")
        h = hashlib.sha1(text.encode("utf-8")).hexdigest()
        cached = cache.get(comp["handle"])
        if cached and cached["hash"] == h:
            vectors[idx] = np.array(cached["vector"], dtype=np.float32)
        else:
            new_texts.append(text)
            new_indices.append((idx, comp["handle"], h))
    if new_texts:
        embeds = model.encode(new_texts, show_progress_bar=False, convert_to_numpy=True)
        for vec, (idx, handle, h) in zip(embeds, new_indices):
            vectors[idx] = vec
            cache[handle] = {"hash": h, "vector": vec.tolist()}
        cache["_model"] = model_name
        with open(cache_path, "w") as f:
            json.dump(cache, f)
    return vectors
 def build_company_graph(companies, embeds:np.ndarray, top_k:int) -> Dict[str,Any]:
    from sklearn.metrics.pairwise import cosine_similarity
    sims = cosine_similarity(embeds)
    nodes, edges = [], []
    idx_of = {c["handle"]: i for i,c in enumerate(companies)}
    for i,c in enumerate(companies):
        node = dict(
            id=c["handle"].strip("/"),
            name=c["name"],
            handle=c["handle"],
            about=c.get("about",""),
            people_url=c.get("people_url",""),
            industry=c.get("descriptor","").split("•")[0].strip(),
            geoUrn=c.get("geoUrn"),
            followers=c.get("followers",0),
            # desc_embed=embeds[i].tolist(),
            desc_embed=[],
        )
        nodes.append(node)
        # pick top-k most similar except itself
        top_idx = np.argsort(sims[i])[::-1][1:top_k+1]
        for j in top_idx:
            tgt = companies[j]
            weight = float(sims[i,j])
            if node["industry"] == tgt.get("descriptor","").split("•")[0].strip():
                weight += 0.10
            if node["geoUrn"] == tgt.get("geoUrn"):
                weight += 0.05
            tgt['followers'] = tgt.get("followers", None) or 1
            node["followers"] = node.get("followers", None) or 1
            follower_ratio = min(node["followers"], tgt.get("followers",1)) / max(node["followers"] or 1, tgt.get("followers",1))
            weight += 0.05 * follower_ratio
            edges.append(dict(
                source=node["id"],
                target=tgt["handle"].strip("/"),
                weight=round(weight,4),
                drivers=dict(
                    embed_sim=round(float(sims[i,j]),4),
                    industry_match=0.10 if node["industry"] == tgt.get("descriptor","").split("•")[0].strip() else 0,
                    geo_overlap=0.05 if node["geoUrn"] == tgt.get("geoUrn") else 0,
                )
            ))
    # return {"nodes":nodes,"edges":edges,"meta":{"generated_at":datetime.now(UTC).isoformat()}}
    return {"nodes":nodes,"edges":edges,"meta":{"generated_at":datetime.now(UTC).isoformat()}}
 # ───────────────────────────────────────────────────────────────────────────────
 # Org-chart via LLM
 # ───────────────────────────────────────────────────────────────────────────────
 async def infer_org_chart_llm(company, people, client:OpenAI, model_name:str, max_tokens:int, temperature:float, stub:bool):
    if stub:
        # Tiny fake org-chart when debugging offline
        chief = random.choice(people)
        nodes = [{
            "id": chief["profile_url"],
            "name": chief["name"],
            "title": chief["headline"],
            "dept": chief["headline"].split()[:1][0],
            "yoe_total": 8,
            "yoe_current": 2,
            "seniority_score": 0.8,
            "decision_score": 0.9,
            "avatar_url": chief.get("avatar_url")
        }]
        return {"nodes":nodes,"edges":[],"meta":{"debug_stub":True,"generated_at":datetime.now(UTC).isoformat()}}
    prompt = [
        {"role":"system","content":"You are an expert B2B org-chart reasoner."},
        {"role":"user","content":f"""Here is the company description:
 <company>
 {json.dumps(company, ensure_ascii=False)}
 </company>
 Here is a JSON list of employees:
 <employees>
 {json.dumps(people, ensure_ascii=False)}
 </employees>
 1) Build a reporting tree (manager -> direct reports)
 2) For each person output a decision_score 0-1 for buying new software
 Return JSON: {{ "nodes":[{{id,name,title,dept,yoe_total,yoe_current,seniority_score,decision_score,avatar_url,profile_url}}], "edges":[{{source,target,type,confidence}}] }}
 """}
    ]
    resp = client.chat.completions.create(
        model=model_name,
        messages=prompt,
        max_tokens=max_tokens,
        temperature=temperature,
        response_format={"type":"json_object"}
    )
    chart = json.loads(resp.choices[0].message.content)
    chart["meta"] = dict(model=model_name, generated_at=datetime.now(UTC).isoformat())
    return chart
 # ───────────────────────────────────────────────────────────────────────────────
 # CSV flatten
 # ───────────────────────────────────────────────────────────────────────────────
 def export_decision_makers(charts_dir:Path, csv_path:Path, threshold:float=0.5):
    rows=[]
    for p in charts_dir.glob("org_chart_*.json"):
        data=json.loads(p.read_text())
        comp = p.stem.split("org_chart_")[1]
        for n in data.get("nodes",[]):
            if n.get("decision_score",0)>=threshold:
                rows.append(dict(
                    company=comp,
                    person=n["name"],
                    title=n["title"],
                    decision_score=n["decision_score"],
                    profile_url=n["id"]
                ))
    pd.DataFrame(rows).to_csv(csv_path,index=False)
 # ───────────────────────────────────────────────────────────────────────────────
 # HTML rendering
 # ───────────────────────────────────────────────────────────────────────────────
 def render_html(out:Path, template_dir:Path):
    # From template folder cp graph_view.html and ai.js in out folder
    import shutil
    shutil.copy(template_dir/"graph_view_template.html", out / "graph_view.html")
    shutil.copy(template_dir/"ai.js", out)
 # ───────────────────────────────────────────────────────────────────────────────
 # Main async pipeline
 # ───────────────────────────────────────────────────────────────────────────────
 async def run(opts):
    # ── silence SDK noise ──────────────────────────────────────────────────────
    for noisy in ("openai", "httpx", "httpcore"):
        lg = logging.getLogger(noisy)
        lg.setLevel(logging.WARNING)     # or ERROR if you want total silence
        lg.propagate = False             # optional: stop them reaching root
    # ────────────── logging bootstrap ──────────────
    console = Console()
    logging.basicConfig(
        level="INFO",
        format="%(message)s",
        handlers=[RichHandler(console=console, markup=True, rich_tracebacks=True)],
    )
    in_dir  = BASE_DIR / Path(opts.in_dir)
    out_dir = BASE_DIR / Path(opts.out_dir)
    out_dir.mkdir(parents=True, exist_ok=True)
    companies = load_jsonl(in_dir/"companies.jsonl")
    people    = load_jsonl(in_dir/"people.jsonl")
    logging.info(f"[bold cyan]Loaded[/] {len(companies)} companies, {len(people)} people")
    logging.info("[bold]⇢[/] Embedding company descriptions…")
    # embeds = embed_descriptions(companies, opts.embed_model, opts)
    logging.info("[bold]⇢[/] Building similarity graph")
    # company_graph = build_company_graph(companies, embeds, opts.top_k)
    # dump_json(company_graph, out_dir/"company_graph.json")
    # OpenAI client (only built if not debugging)
    stub = bool(opts.stub)
    client = OpenAI() if not stub else None
    # Filter companies that need processing
    to_process = []
    for comp in companies:
        handle = comp["handle"].strip("/").replace("/","_")
        out_file = out_dir/f"org_chart_{handle}.json"
        if out_file.exists() and False:
            logging.info(f"[green]✓[/] Skipping existing {comp['name']}")
            continue
        to_process.append(comp)
    if not to_process:
        logging.info("[yellow]All companies already processed[/]")
    else:
        workers = getattr(opts, 'workers', 1)
        parallel = workers > 1
        logging.info(f"[bold]⇢[/] Inferring org-charts via LLM {f'(parallel={workers} workers)' if parallel else ''}")
        with Progress(
            SpinnerColumn(),
            BarColumn(),
            TextColumn("[progress.description]{task.description}"),
            TimeElapsedColumn(),
            console=console,
        ) as progress:
            task = progress.add_task("Org charts", total=len(to_process))
            async def process_one(comp):
                handle = comp["handle"].strip("/").replace("/","_")
                persons = [p for p in people if p["company_handle"].strip("/") == comp["handle"].strip("/")]
                chart = await infer_org_chart_llm(
                    comp, persons,
                    client=client if client else OpenAI(api_key="sk-debug"),
                    model_name=opts.openai_model,
                    max_tokens=opts.max_llm_tokens,
                    temperature=opts.llm_temperature,
                    stub=stub,
                )
                chart["meta"]["company"] = comp["name"]
                # Save the result immediately
                dump_json(chart, out_dir/f"org_chart_{handle}.json")
                progress.update(task, advance=1, description=f"{comp['name']} ({len(persons)} ppl)")
            # Create tasks for all companies
            tasks = [process_one(comp) for comp in to_process]
            # Process in batches based on worker count
            semaphore = asyncio.Semaphore(workers)
            async def bounded_process(coro):
                async with semaphore:
                    return await coro
            # Run with concurrency control
            await asyncio.gather(*(bounded_process(task) for task in tasks))
    logging.info("[bold]⇢[/] Flattening decision-makers CSV")
    export_decision_makers(out_dir, out_dir/"decision_makers.csv")
    render_html(out_dir, template_dir=BASE_DIR/"templates")
    logging.success = lambda msg, **k: console.print(f"[bold green]✓[/] {msg}", **k)
    logging.success(f"Stage-2 artefacts written to {out_dir}")
 # ───────────────────────────────────────────────────────────────────────────────
 # CLI
 # ───────────────────────────────────────────────────────────────────────────────
 def build_arg_parser():
    p = argparse.ArgumentParser(description="Build graphs & visualisation from Stage-1 output")
    p.add_argument("--in",       dest="in_dir",  required=False, help="Stage-1 output dir", default=".")
    p.add_argument("--out",      dest="out_dir", required=False, help="Destination dir",   default=".")
    p.add_argument("--embed_model", default="all-MiniLM-L6-v2")
    p.add_argument("--top_k", type=int, default=10, help="Top-k neighbours per company")
    p.add_argument("--openai_model", default="gpt-4.1")
    p.add_argument("--max_llm_tokens", type=int, default=8024)
    p.add_argument("--llm_temperature", type=float, default=1.0)
    p.add_argument("--stub", action="store_true", help="Skip OpenAI call and generate tiny fake org charts")
    p.add_argument("--workers", type=int, default=4, help="Number of parallel workers for LLM inference")
    return p
 def main():
    dbg = dev_defaults()
    opts = dbg if True else build_arg_parser().parse_args()
    asyncio.run(run(opts))
 if __name__ == "__main__":
    main()
--- a/docs/apps/linkdin/schemas/company_card.json
+++ b/docs/apps/linkdin/schemas/company_card.json
@@ -0,0 +1,39 @@
 {
  "name": "LinkedIn Company Card",
  "baseSelector": "div.search-results-container ul[role='list'] > li",
  "fields": [
    {
      "name": "handle",
      "selector": "a[href*='/company/']",
      "type": "attribute",
      "attribute": "href"
    },
    {
      "name": "profile_image",
      "selector": "a[href*='/company/'] img",
      "type": "attribute",
      "attribute": "src"
    },
    {
      "name": "name",
      "selector": "span[class*='t-16'] a",
      "type": "text"
    },
    {
      "name": "descriptor",
      "selector": "div[class*='t-black t-normal']",
      "type": "text"
    },
    {
      "name": "about",
      "selector": "p[class*='entity-result__summary--2-lines']",
      "type": "text"
    },
    {
      "name": "followers",
      "selector": "div:contains('followers')",
      "type": "regex",
      "pattern": "(\\d+)\\s*followers"
    }
  ]
 }
--- a/docs/apps/linkdin/schemas/people_card.json
+++ b/docs/apps/linkdin/schemas/people_card.json
@@ -0,0 +1,38 @@
 {
  "name": "LinkedIn People Card",
  "baseSelector": "li.org-people-profile-card__profile-card-spacing",
  "fields": [
    {
      "name": "profile_url",
      "selector": "a.eETATgYTipaVsmrBChiBJJvFsdPhNpulhPZUVLHLo",
      "type": "attribute",
      "attribute": "href"
    },
    {
      "name": "name",
      "selector": ".artdeco-entity-lockup__title .lt-line-clamp--single-line",
      "type": "text"
    },
    {
      "name": "headline",
      "selector": ".artdeco-entity-lockup__subtitle .lt-line-clamp--multi-line",
      "type": "text"
    },
    {
      "name": "followers",
      "selector": ".lt-line-clamp--multi-line.t-12",
      "type": "text"
    },
    {
      "name": "connection_degree",
      "selector": ".artdeco-entity-lockup__badge .artdeco-entity-lockup__degree",
      "type": "text"
    },
    {
      "name": "avatar_url",
      "selector": ".artdeco-entity-lockup__image img",
      "type": "attribute",
      "attribute": "src"
    }
  ]
 }
--- a/docs/apps/linkdin/snippets/company.html
+++ b/docs/apps/linkdin/snippets/company.html
@@ -0,0 +1,143 @@
 <li class="yCLWzruNprmIzaZzFFonVFBtMrbaVYnuDFA">
    <!----><!---->
    <div class="IxlEPbRZwQYrRltKPvHAyjBmCdIWTAoYo" data-chameleon-result-urn="urn:li:company:362492"
        data-view-name="search-entity-result-universal-template">
        <div class="linked-area flex-1
              cursor-pointer">
            <div class="BAEgVqVuxosMJZodcelsgPoyRcrkiqgVCGHXNQ">
                <div class="afcvrbGzNuyRlhPPQWrWirJtUdHAAtUlqxwvVA">
                    <div class="display-flex align-items-center">
                        <!---->
                        <a class="eETATgYTipaVsmrBChiBJJvFsdPhNpulhPZUVLHLo  scale-down " aria-hidden="true"
                            tabindex="-1" href="https://www.linkedin.com/company/managment-research-services-inc./"
                            data-test-app-aware-link="">
                            <div class="ivm-image-view-model   ">
                                <div class="ivm-view-attr__img-wrapper
            ">
                                    <!---->
                                    <!----> <img width="48"
                                        src="https://media.licdn.com/dms/image/v2/C560BAQFWpusEOgW-ww/company-logo_100_100/company-logo_100_100/0/1630583697877/managment_research_services_inc_logo?e=1750896000&amp;v=beta&amp;t=Ch9vyEZdfng-1D1m_XqP5kjNpVXUBKkk9cNhMZUhx0E"
                                        loading="lazy" height="48" alt="Management Research Services, Inc. (MRS, Inc)"
                                        id="ember28"
                                        class="ivm-view-attr__img--centered EntityPhoto-square-3   evi-image lazy-image ember-view">
                                </div>
                            </div>
                        </a>
                    </div>
                </div>
                <div
                    class="wympnVuDByXHvafWrMGJLZuchDmCRqLmWPwg MmzCPRicJimZvjJhvqTzDcDbdHhWPzspERzA pt3 pb3 t-12 t-black--light">
                    <div class="mb1">
                        <div class="t-roman t-sans">
                            <div class="display-flex">
                                <span class="TikBXjihYvcNUoIzkslUaEjfIuLmYxfs OoHEyXgsiIqGADjcOtTmfdpoYVXrLKTvkwI ">
                                    <span class="CgaWLOzmXNuKbRIRARSErqCJcBPYudEKo
                t-16">
                                        <a class="eETATgYTipaVsmrBChiBJJvFsdPhNpulhPZUVLHLo "
                                            href="https://www.linkedin.com/company/managment-research-services-inc./"
                                            data-test-app-aware-link="">
                                            <!---->Management Research Services, Inc. (MRS, Inc)<!---->
                                            <!----> </a>
                                        <!----> </span>
                                </span>
                                <!---->
                            </div>
                        </div>
                        <div class="LjmdKCEqKITHihFOiQsBAQylkdnsWhqZii
              t-14 t-black t-normal">
                            <!---->Insurance • Milwaukee, Wisconsin<!---->
                        </div>
                        <div class="cTPhJiHyNLmxdQYFlsEOutjznmqrVHUByZwZ
              t-14 t-normal">
                            <!---->1K followers<!---->
                        </div>
                    </div>
                    <!---->
                    <p class="yWzlqwKNlvCWVNoKqmzoDDEnBMUuyynaLg
                    entity-result__summary--2-lines
                    t-12 t-black--light
                    ">
                        <!---->MRS combines 30 years of experience supporting the Life,<span class="white-space-pre">
                        </span><strong><!---->Health<!----></strong><span class="white-space-pre"> </span>and
                        Annuities<span class="white-space-pre"> </span><strong><!---->Insurance<!----></strong><span
                            class="white-space-pre"> </span>Industry with customized<span class="white-space-pre">
                        </span><strong><!---->insurance<!----></strong><span class="white-space-pre">
                        </span>underwriting solutions that efficiently support clients’ workflows. Supported by the
                        Agenium Platform (www.agenium.ai) our innovative underwriting solutions are guaranteed to
                        optimize requirements...<!---->
                    </p>
                    <!---->
                </div>
                <div class="qXxdnXtzRVFTnTnetmNpssucBwQBsWlUuk MmzCPRicJimZvjJhvqTzDcDbdHhWPzspERzA">
                    <!---->
                    <div>
                        <button aria-label="Follow Management Research Services, Inc. (MRS, Inc)" id="ember61"
                            class="artdeco-button artdeco-button--2 artdeco-button--secondary ember-view"
                            type="button"><!---->
                            <span class="artdeco-button__text">
                                Follow
                            </span></button>
                        <!---->
                        <!---->
                    </div>
                </div>
            </div>
        </div>
    </div>
 </li>
--- a/docs/apps/linkdin/snippets/people.html
+++ b/docs/apps/linkdin/snippets/people.html
@@ -0,0 +1,94 @@
 <li class="grid grid__col--lg-8 block org-people-profile-card__profile-card-spacing">
    <div>
        <section class="artdeco-card full-width qQdPErXQkSAbwApNgNfuxukTIPPykttCcZGOHk">
            <!---->
            <img width="210" src="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7"
                ariarole="presentation" loading="lazy" height="210" alt="" id="ember96"
                class="evi-image lazy-image ghost-default ember-view org-people-profile-card__cover-photo org-people-profile-card__cover-photo--people">
            <div class="org-people-profile-card__profile-info">
                <div id="ember97"
                    class="artdeco-entity-lockup artdeco-entity-lockup--stacked-center artdeco-entity-lockup--size-7 ember-view">
                    <div id="ember98"
                        class="artdeco-entity-lockup__image artdeco-entity-lockup__image--type-circle ember-view"
                        type="circle">
                        <a class="eETATgYTipaVsmrBChiBJJvFsdPhNpulhPZUVLHLo "
                            id="org-people-profile-card__profile-image-0"
                            href="https://www.linkedin.com/in/speakerrayna?miniProfileUrn=urn%3Ali%3Afs_miniProfile%3AACoAABsqUBoBr5x071PuGGpNtK3NlvSARiVXPIs"
                            data-test-app-aware-link="">
                            <img width="104"
                                src="https://media.licdn.com/dms/image/v2/D5603AQGs2Vyju4xZ7A/profile-displayphoto-shrink_100_100/profile-displayphoto-shrink_100_100/0/1681741067031?e=1750896000&amp;v=beta&amp;t=Hvj--IrrmpVIH7pec7-l_PQok8vsS__CGeUqBWOw7co"
                                loading="lazy" height="104" alt="Dr. Rayna S." id="ember99"
                                class="evi-image lazy-image ember-view">
                        </a>
                    </div>
                    <div id="ember100" class="artdeco-entity-lockup__content ember-view">
                        <div id="ember101" class="artdeco-entity-lockup__title ember-view">
                            <a class="eETATgYTipaVsmrBChiBJJvFsdPhNpulhPZUVLHLo  link-without-visited-state"
                                aria-label="View Dr. Rayna S.’s profile"
                                href="https://www.linkedin.com/in/speakerrayna?miniProfileUrn=urn%3Ali%3Afs_miniProfile%3AACoAABsqUBoBr5x071PuGGpNtK3NlvSARiVXPIs"
                                data-test-app-aware-link="">
                                <div id="ember103" class="ember-view lt-line-clamp lt-line-clamp--single-line AGabuksChUpCmjWshSnaZryLKSthOKkwclxY
          t-black" style="">
                                    Dr. Rayna S.
                                    <!---->
                                </div>
                            </a>
                        </div>
                        <div id="ember104" class="artdeco-entity-lockup__badge ember-view"> <span class="a11y-text">3rd+
                                degree connection</span>
                            <span class="artdeco-entity-lockup__degree" aria-hidden="true">
                                ·&nbsp;3rd
                            </span>
                            <!----><!---->
                        </div>
                        <div id="ember105" class="artdeco-entity-lockup__subtitle ember-view">
                            <div class="t-14 t-black--light t-normal">
                                <div id="ember107" class="ember-view lt-line-clamp lt-line-clamp--multi-line"
                                    style="-webkit-line-clamp: 2">
                                    Leadership and Talent Development Consultant and Professional Speaker
                                    <!---->
                                </div>
                            </div>
                        </div>
                        <div id="ember108" class="artdeco-entity-lockup__caption ember-view"></div>
                    </div>
                </div>
                <span class="text-align-center">
                    <span id="ember110"
                        class="ember-view lt-line-clamp lt-line-clamp--multi-line t-12 t-black--light mt2"
                        style="-webkit-line-clamp: 3">
                        727 followers
                        <!----> </span>
                </span>
            </div>
            <footer class="ph3 pb3">
                <button aria-label="Follow Dr. Rayna S." id="ember111"
                    class="artdeco-button artdeco-button--2 artdeco-button--secondary ember-view full-width"
                    type="button"><!---->
                    <span class="artdeco-button__text">
                        Follow
                    </span></button>
            </footer>
        </section>
    </div>
 </li>
--- a/docs/apps/linkdin/templates/ai.js
+++ b/docs/apps/linkdin/templates/ai.js
@@ -0,0 +1,50 @@
 // ==== File: ai.js ====
 class ApiHandler {
    constructor(apiKey = null) {
      this.apiKey = apiKey || localStorage.getItem("openai_api_key") || "";
      console.log("ApiHandler ready");
    }
    setApiKey(k) {
      this.apiKey = k.trim();
      if (this.apiKey) localStorage.setItem("openai_api_key", this.apiKey);
    }
    async *chatStream(messages, {model = "gpt-4o", temperature = 0.7} = {}) {
      if (!this.apiKey) throw new Error("OpenAI API key missing");
      const payload = {model, messages, stream: true, max_tokens: 1024};
      const controller = new AbortController();
      const res = await fetch("https://api.openai.com/v1/chat/completions", {
        method: "POST",
        headers: {
          "Content-Type": "application/json",
          Authorization: `Bearer ${this.apiKey}`,
        },
        body: JSON.stringify(payload),
        signal: controller.signal,
      });
      if (!res.ok) throw new Error(`OpenAI: ${res.statusText}`);
      const reader = res.body.getReader();
      const dec = new TextDecoder();
      let buf = "";
      while (true) {
        const {done, value} = await reader.read();
        if (done) break;
        buf += dec.decode(value, {stream: true});
        for (const line of buf.split("\n")) {
          if (!line.startsWith("data: ")) continue;
          if (line.includes("[DONE]")) return;
          const json = JSON.parse(line.slice(6));
          const delta = json.choices?.[0]?.delta?.content;
          if (delta) yield delta;
        }
        buf = buf.endsWith("\n") ? "" : buf; // keep partial line
      }
    }
  }
  window.API = new ApiHandler();
--- a/docs/apps/linkdin/templates/graph_view_template.html
+++ b/docs/apps/linkdin/templates/graph_view_template.html