This commit introduces significant updates to the LinkedIn data discovery documentation by adding two new Jupyter notebooks that provide detailed insights into data discovery processes. The previous workshop notebook has been removed to streamline the content and avoid redundancy. Additionally, the URL seeder documentation has been expanded with a new tutorial and several enhancements to existing scripts, improving usability and clarity. The changes include: - Added and for comprehensive LinkedIn data discovery. - Removed to eliminate outdated content. - Updated to reflect new data visualization requirements. - Introduced and to facilitate easier access to URL seeding techniques. - Enhanced existing Python scripts and markdown files in the URL seeder section for better documentation and examples. These changes aim to improve the overall documentation quality and user experience for developers working with LinkedIn data and URL seeding techniques.
Crawl4AI Prospect‑Wizard – step‑by‑step guide
A three‑stage demo that goes from LinkedIn scraping ➜ LLM reasoning ➜ graph visualisation.
Try it in Google Colab! Click the badge above to run this demo in a cloud environment with zero setup required.
prospect‑wizard/
├─ c4ai_discover.py # Stage 1 – scrape companies + people
├─ c4ai_insights.py # Stage 2 – embeddings, org‑charts, scores
├─ graph_view_template.html # Stage 3 – graph viewer (static HTML)
└─ data/ # output lands here (*.jsonl / *.json)
1 Install & boot a LinkedIn profile (one‑time)
1.1 Install dependencies
pip install crawl4ai litellm sentence-transformers pandas rich
1.2 Create / warm a LinkedIn browser profile
crwl profiles
- The interactive shell shows New profile – hit enter.
- Choose a name, e.g.
profile_linkedin_uc. - A Chromium window opens – log in to LinkedIn, solve whatever CAPTCHA, then close.
Remember the profile name. All future runs take
--profile-name <your_name>.
2 Discovery – scrape companies & people
python c4ai_discover.py full \
--query "health insurance management" \
--geo 102713980 \ # Malaysia geoUrn
--title-filters "" \ # or "Product,Engineering"
--max-companies 10 \ # default set small for workshops
--max-people 20 \ # \^ same
--profile-name profile_linkedin_uc \
--outdir ./data \
--concurrency 2 \
--log-level debug
Outputs in ./data/:
companies.jsonl– one JSON per companypeople.jsonl– one JSON per employee
🛠️ Dry‑run: C4AI_DEMO_DEBUG=1 python c4ai_discover.py full --query coffee uses bundled HTML snippets, no network.
Handy geoUrn cheatsheet
| Location | geoUrn |
|---|---|
| Singapore | 103644278 |
| Malaysia | 102713980 |
| United States | 103644922 |
| United Kingdom | 102221843 |
| Australia | 101452733 |
See more: https://www.linkedin.com/search/results/companies/?geoUrn=XXX – the number after geoUrn= is what you need. |
3 Insights – embeddings, org‑charts, decision makers
python c4ai_insights.py \
--in ./data \
--out ./data \
--embed-model all-MiniLM-L6-v2 \
--llm-provider gemini/gemini-2.0-flash \
--llm-api-key "" \
--top-k 10 \
--max-llm-tokens 8024 \
--llm-temperature 1.0 \
--workers 4
Emits next to the Stage‑1 files:
company_graph.json– inter‑company similarity graphorg_chart_<handle>.json– one per companydecision_makers.csv– hand‑picked ‘who to pitch’ list
Flags reference (straight from build_arg_parser()):
| Flag | Default | Purpose |
|---|---|---|
--in |
. |
Stage‑1 output dir |
--out |
. |
Destination dir |
--embed_model |
all-MiniLM-L6-v2 |
Sentence‑Transformer model |
--top_k |
10 |
Neighbours per company in graph |
--openai_model |
gpt-4.1 |
LLM for scoring decision makers |
--max_llm_tokens |
8024 |
Token budget per LLM call |
--llm_temperature |
1.0 |
Creativity knob |
--stub |
off | Skip OpenAI and fabricate tiny charts |
--workers |
4 |
Parallel LLM workers |
4 Visualise – interactive graph
After Stage 2 completes, simply open the HTML viewer from the project root:
open graph_view_template.html # or Live Server / Python -http
The page fetches data/company_graph.json and the org_chart_*.json files automatically; keep the data/ folder beside the HTML file.
- Left pane → list of companies (clans).
- Click a node to load its org‑chart on the right.
- Chat drawer lets you ask follow‑up questions; context is pulled from
people.jsonl.
5 Common snags
| Symptom | Fix |
|---|---|
| Infinite CAPTCHA | Use a residential proxy: --proxy http://user:pass@ip:port |
| 429 Too Many Requests | Lower --concurrency, rotate profile, add delay |
| Blank graph | Check JSON paths, clear localStorage in browser |
TL;DR
crwl profiles → c4ai_discover.py → c4ai_insights.py → open graph_view_template.html.
Live long and import crawl4ai.