Crawl4AI Prospect‑Wizard – step‑by‑step guide
A three‑stage demo that goes from LinkedIn scraping ➜ LLM reasoning ➜ graph visualisation.
prospect‑wizard/
├─ c4ai_discover.py # Stage 1 – scrape companies + people
├─ c4ai_insights.py # Stage 2 – embeddings, org‑charts, scores
├─ graph_view_template.html # Stage 3 – graph viewer (static HTML)
└─ data/ # output lands here (*.jsonl / *.json)
1 Install & boot a LinkedIn profile (one‑time)
1.1 Install dependencies
pip install crawl4ai litellm sentence-transformers pandas rich
1.2 Create / warm a LinkedIn browser profile
crwl profiles
- The interactive shell shows New profile – hit enter.
- Choose a name, e.g.
profile_linkedin_uc. - A Chromium window opens – log in to LinkedIn, solve whatever CAPTCHA, then close.
Remember the profile name. All future runs take
--profile-name <your_name>.
2 Discovery – scrape companies & people
python c4ai_discover.py full \
--query "health insurance management" \
--geo 102713980 \ # Malaysia geoUrn
--title-filters "" \ # or "Product,Engineering"
--max-companies 10 \ # default set small for workshops
--max-people 20 \ # \^ same
--profile-name profile_linkedin_uc \
--outdir ./data \
--concurrency 2 \
--log-level debug
Outputs in ./data/:
companies.jsonl– one JSON per companypeople.jsonl– one JSON per employee
🛠️ Dry‑run: C4AI_DEMO_DEBUG=1 python c4ai_discover.py full --query coffee uses bundled HTML snippets, no network.
Handy geoUrn cheatsheet
| Location | geoUrn |
|---|---|
| Singapore | 103644278 |
| Malaysia | 102713980 |
| United States | 103644922 |
| United Kingdom | 102221843 |
| Australia | 101452733 |
See more: https://www.linkedin.com/search/results/companies/?geoUrn=XXX – the number after geoUrn= is what you need. |
3 Insights – embeddings, org‑charts, decision makers
python c4ai_insights.py \
--in ./data \
--out ./data \
--embed-model all-MiniLM-L6-v2 \
--llm-provider gemini/gemini-2.0-flash \
--llm-api-key "" \
--top-k 10 \
--max-llm-tokens 8024 \
--llm-temperature 1.0 \
--workers 4
Emits next to the Stage‑1 files:
company_graph.json– inter‑company similarity graphorg_chart_<handle>.json– one per companydecision_makers.csv– hand‑picked ‘who to pitch’ list
Flags reference (straight from build_arg_parser()):
| Flag | Default | Purpose |
|---|---|---|
--in |
. |
Stage‑1 output dir |
--out |
. |
Destination dir |
--embed_model |
all-MiniLM-L6-v2 |
Sentence‑Transformer model |
--top_k |
10 |
Neighbours per company in graph |
--openai_model |
gpt-4.1 |
LLM for scoring decision makers |
--max_llm_tokens |
8024 |
Token budget per LLM call |
--llm_temperature |
1.0 |
Creativity knob |
--stub |
off | Skip OpenAI and fabricate tiny charts |
--workers |
4 |
Parallel LLM workers |
4 Visualise – interactive graph
After Stage 2 completes, simply open the HTML viewer from the project root:
open graph_view_template.html # or Live Server / Python -http
The page fetches data/company_graph.json and the org_chart_*.json files automatically; keep the data/ folder beside the HTML file.
- Left pane → list of companies (clans).
- Click a node to load its org‑chart on the right.
- Chat drawer lets you ask follow‑up questions; context is pulled from
people.jsonl.
5 Common snags
| Symptom | Fix |
|---|---|
| Infinite CAPTCHA | Use a residential proxy: --proxy http://user:pass@ip:port |
| 429 Too Many Requests | Lower --concurrency, rotate profile, add delay |
| Blank graph | Check JSON paths, clear localStorage in browser |
TL;DR
crwl profiles → c4ai_discover.py → c4ai_insights.py → open graph_view_template.html.
Live long and import crawl4ai.