From 50f0b83fcd4e951b7109b653d14bc3a04ca604a8 Mon Sep 17 00:00:00 2001 From: UncleCode Date: Wed, 30 Apr 2025 19:38:25 +0800 Subject: [PATCH] feat(linkedin): add prospect-wizard app with scraping and visualization Add new LinkedIn prospect discovery tool with three main components: - c4ai_discover.py for company and people scraping - c4ai_insights.py for org chart and decision maker analysis - Interactive graph visualization with company/people exploration Features include: - Configurable LinkedIn search and scraping - Org chart generation with decision maker scoring - Interactive network graph visualization - Company similarity analysis - Chat interface for data exploration Requires: crawl4ai, openai, sentence-transformers, networkx --- docs/apps/linkdin/README.md | 126 ++ docs/apps/linkdin/c4ai_discover.py | 440 +++++++ docs/apps/linkdin/c4ai_insights.py | 372 ++++++ docs/apps/linkdin/schemas/company_card.json | 39 + docs/apps/linkdin/schemas/people_card.json | 38 + docs/apps/linkdin/snippets/company.html | 143 ++ docs/apps/linkdin/snippets/people.html | 94 ++ docs/apps/linkdin/templates/ai.js | 50 + .../templates/graph_view_template.html | 1171 +++++++++++++++++ 9 files changed, 2473 insertions(+) create mode 100644 docs/apps/linkdin/README.md create mode 100644 docs/apps/linkdin/c4ai_discover.py create mode 100644 docs/apps/linkdin/c4ai_insights.py create mode 100644 docs/apps/linkdin/schemas/company_card.json create mode 100644 docs/apps/linkdin/schemas/people_card.json create mode 100644 docs/apps/linkdin/snippets/company.html create mode 100644 docs/apps/linkdin/snippets/people.html create mode 100644 docs/apps/linkdin/templates/ai.js create mode 100644 docs/apps/linkdin/templates/graph_view_template.html diff --git a/docs/apps/linkdin/README.md b/docs/apps/linkdin/README.md new file mode 100644 index 00000000..cce244ac --- /dev/null +++ b/docs/apps/linkdin/README.md @@ -0,0 +1,126 @@ +# Crawl4AI Prospect‑Wizard – step‑by‑step guide + +A three‑stage demo that goes from **LinkedIn scraping** ➜ **LLM reasoning** ➜ **graph visualisation**. + +``` +prospect‑wizard/ +├─ c4ai_discover.py # Stage 1 – scrape companies + people +├─ c4ai_insights.py # Stage 2 – embeddings, org‑charts, scores +├─ graph_view_template.html # Stage 3 – graph viewer (static HTML) +└─ data/ # output lands here (*.jsonl / *.json) +``` + +--- + +## 1  Install & boot a LinkedIn profile (one‑time) + +### 1.1  Install dependencies +```bash +pip install crawl4ai openai sentence-transformers networkx pandas vis-network rich +``` + +### 1.2  Create / warm a LinkedIn browser profile +```bash +crwl profiler +``` +1. The interactive shell shows **New profile** – hit **enter**. +2. Choose a name, e.g. `profile_linkedin_uc`. +3. A Chromium window opens – log in to LinkedIn, solve whatever CAPTCHA, then close. + +> Remember the **profile name**. All future runs take `--profile-name `. + +--- + +## 2  Discovery – scrape companies & people + +```bash +python c4ai_discover.py full \ + --query "health insurance management" \ + --geo 102713980 \ # Malaysia geoUrn + --title_filters "" \ # or "Product,Engineering" + --max_companies 10 \ # default set small for workshops + --max_people 20 \ # \^ same + --profile-name profile_linkedin_uc \ + --outdir ./data \ + --concurrency 2 \ + --log_level debug +``` +**Outputs** in `./data/`: +* `companies.jsonl` – one JSON per company +* `people.jsonl` – one JSON per employee + +🛠️ **Dry‑run:** `C4AI_DEMO_DEBUG=1 python c4ai_discover.py full --query coffee` uses bundled HTML snippets, no network. + +### Handy geoUrn cheatsheet +| Location | geoUrn | +|----------|--------| +| Singapore | **103644278** | +| Malaysia | **102713980** | +| United States | **103644922** | +| United Kingdom | **102221843** | +| Australia | **101452733** | +_See more: – the number after `geoUrn=` is what you need._ + +--- + +## 3  Insights – embeddings, org‑charts, decision makers + +```bash +python c4ai_insights.py \ + --in ./data \ + --out ./data \ + --embed_model all-MiniLM-L6-v2 \ + --top_k 10 \ + --openai_model gpt-4.1 \ + --max_llm_tokens 8024 \ + --llm_temperature 1.0 \ + --workers 4 +``` +Emits next to the Stage‑1 files: +* `company_graph.json` – inter‑company similarity graph +* `org_chart_.json` – one per company +* `decision_makers.csv` – hand‑picked ‘who to pitch’ list + +Flags reference (straight from `build_arg_parser()`): +| Flag | Default | Purpose | +|------|---------|---------| +| `--in` | `.` | Stage‑1 output dir | +| `--out` | `.` | Destination dir | +| `--embed_model` | `all-MiniLM-L6-v2` | Sentence‑Transformer model | +| `--top_k` | `10` | Neighbours per company in graph | +| `--openai_model` | `gpt-4.1` | LLM for scoring decision makers | +| `--max_llm_tokens` | `8024` | Token budget per LLM call | +| `--llm_temperature` | `1.0` | Creativity knob | +| `--stub` | off | Skip OpenAI and fabricate tiny charts | +| `--workers` | `4` | Parallel LLM workers | + +--- + +## 4  Visualise – interactive graph + +After Stage 2 completes, simply open the HTML viewer from the project root: +```bash +open graph_view_template.html # or Live Server / Python -http +``` +The page fetches `data/company_graph.json` and the `org_chart_*.json` files automatically; keep the `data/` folder beside the HTML file. + +* Left pane → list of companies (clans). +* Click a node to load its org‑chart on the right. +* Chat drawer lets you ask follow‑up questions; context is pulled from `people.jsonl`. + +--- + +## 5  Common snags + +| Symptom | Fix | +|---------|-----| +| Infinite CAPTCHA | Use a residential proxy: `--proxy http://user:pass@ip:port` | +| 429 Too Many Requests | Lower `--concurrency`, rotate profile, add delay | +| Blank graph | Check JSON paths, clear `localStorage` in browser | + +--- + +### TL;DR +`crwl profiler` → `c4ai_discover.py` → `c4ai_insights.py` → open `graph_view_template.html`. +Live long and `import crawl4ai`. + diff --git a/docs/apps/linkdin/c4ai_discover.py b/docs/apps/linkdin/c4ai_discover.py new file mode 100644 index 00000000..82874568 --- /dev/null +++ b/docs/apps/linkdin/c4ai_discover.py @@ -0,0 +1,440 @@ +#!/usr/bin/env python3 +""" +c4ai-discover — Stage‑1 Discovery CLI + +Scrapes LinkedIn company search + their people pages and dumps two newline‑delimited +JSON files: companies.jsonl and people.jsonl. + +Key design rules +---------------- +* No BeautifulSoup — Crawl4AI only for network + HTML fetch. +* JsonCssExtractionStrategy for structured scraping; schema auto‑generated once + from sample HTML provided by user and then cached under ./schemas/. +* Defaults are embedded so the file runs inside VS Code debugger without CLI args. +* If executed as a console script (argv > 1), CLI flags win. +* Lightweight deps: argparse + Crawl4AI stack. + +Author: Tom @ Kidocode 2025‑04‑26 +""" +from __future__ import annotations + +import warnings, re +warnings.filterwarnings( + "ignore", + message=r"The pseudo class ':contains' is deprecated, ':-soup-contains' should be used.*", + category=FutureWarning, + module=r"soupsieve" +) + + +# ─────────────────────────────────────────────────────────────────────────────── +# Imports +# ─────────────────────────────────────────────────────────────────────────────── +import argparse +import random +import asyncio +import json +import logging +import os +import pathlib +import sys +# 3rd-party rich for pretty logging +from rich.console import Console +from rich.logging import RichHandler + +from datetime import datetime, UTC +from itertools import cycle +from textwrap import dedent +from types import SimpleNamespace +from typing import Dict, List, Optional +from urllib.parse import quote +from pathlib import Path +from glob import glob + +from crawl4ai import ( + AsyncWebCrawler, + BrowserConfig, + CacheMode, + CrawlerRunConfig, + JsonCssExtractionStrategy, + BrowserProfiler, + LLMConfig, +) + +# ─────────────────────────────────────────────────────────────────────────────── +# Constants / paths +# ─────────────────────────────────────────────────────────────────────────────── +BASE_DIR = pathlib.Path(__file__).resolve().parent +SCHEMA_DIR = BASE_DIR / "schemas" +SCHEMA_DIR.mkdir(parents=True, exist_ok=True) +COMPANY_SCHEMA_PATH = SCHEMA_DIR / "company_card.json" +PEOPLE_SCHEMA_PATH = SCHEMA_DIR / "people_card.json" + +# ---------- deterministic target JSON examples ---------- +_COMPANY_SCHEMA_EXAMPLE = { + "handle": "/company/posify/", + "profile_image": "https://media.licdn.com/dms/image/v2/.../logo.jpg", + "name": "Management Research Services, Inc. (MRS, Inc)", + "descriptor": "Insurance • Milwaukee, Wisconsin", + "about": "Insurance • Milwaukee, Wisconsin", + "followers": 1000 +} + +_PEOPLE_SCHEMA_EXAMPLE = { + "profile_url": "https://www.linkedin.com/in/lily-ng/", + "name": "Lily Ng", + "headline": "VP Product @ Posify", + "followers": 890, + "connection_degree": "2nd", + "avatar_url": "https://media.licdn.com/dms/image/v2/.../lily.jpg" +} + +# Provided sample HTML snippets (trimmed) — used exactly once to cold‑generate schema. +_SAMPLE_COMPANY_HTML = (Path(__file__).resolve().parent / "snippets/company.html").read_text() +_SAMPLE_PEOPLE_HTML = (Path(__file__).resolve().parent / "snippets/people.html").read_text() + +# --------- tighter schema prompts ---------- +_COMPANY_SCHEMA_QUERY = dedent( + """ + Using the supplied
  • company-card HTML, build a JsonCssExtractionStrategy schema that, + for every card, outputs *exactly* the keys shown in the example JSON below. + JSON spec: + • handle – href of the outermost that wraps the logo/title, e.g. "/company/posify/" + • profile_image – absolute URL of the inside that link + • name – text of the inside the + • descriptor – text line with industry • location + • about – text of the
    below the name (industry + geo) + • followers – integer parsed from the
    containing 'followers' + + IMPORTANT: Do not use the base64 kind of classes to target element. It's not reliable. + The main div parent contains these li element is "div.search-results-container" you can use this. + The