Files
crawl4ai/docs/apps/linkdin
UncleCode c6fc5c0518 docs(linkdin, url_seeder): update and reorganize LinkedIn data discovery and URL seeder documentation
This commit introduces significant updates to the LinkedIn data discovery documentation by adding two new Jupyter notebooks that provide detailed insights into data discovery processes. The previous workshop notebook has been removed to streamline the content and avoid redundancy. Additionally, the URL seeder documentation has been expanded with a new tutorial and several enhancements to existing scripts, improving usability and clarity.

The changes include:
- Added  and  for comprehensive LinkedIn data discovery.
- Removed  to eliminate outdated content.
- Updated  to reflect new data visualization requirements.
- Introduced  and  to facilitate easier access to URL seeding techniques.
- Enhanced existing Python scripts and markdown files in the URL seeder section for better documentation and examples.

These changes aim to improve the overall documentation quality and user experience for developers working with LinkedIn data and URL seeding techniques.
2025-06-05 15:06:25 +08:00
..

Crawl4AIProspectWizard stepbystep guide

Open In Colab

A threestage demo that goes from LinkedIn scrapingLLM reasoninggraph visualisation.

Try it in Google Colab! Click the badge above to run this demo in a cloud environment with zero setup required.

prospectwizard/
├─ c4ai_discover.py         # Stage 1  scrape companies + people
├─ c4ai_insights.py         # Stage 2  embeddings, orgcharts, scores
├─ graph_view_template.html # Stage 3  graph viewer (static HTML)
└─ data/                    # output lands here (*.jsonl / *.json)

1  Install & boot a LinkedIn profile (onetime)

1.1  Install dependencies

pip install crawl4ai litellm sentence-transformers pandas rich

1.2  Create / warm a LinkedIn browser profile

crwl profiles
  1. The interactive shell shows New profile hit enter.
  2. Choose a name, e.g. profile_linkedin_uc.
  3. A Chromium window opens log in to LinkedIn, solve whatever CAPTCHA, then close.

Remember the profile name. All future runs take --profile-name <your_name>.


2  Discovery scrape companies & people

python c4ai_discover.py full \
  --query "health insurance management" \
  --geo 102713980 \               # Malaysia geoUrn
  --title-filters "" \            # or "Product,Engineering"
  --max-companies 10 \            # default set small for workshops
  --max-people 20 \               # \^ same
  --profile-name profile_linkedin_uc \
  --outdir ./data \
  --concurrency 2 \
  --log-level debug

Outputs in ./data/:

  • companies.jsonl one JSON per company
  • people.jsonl one JSON per employee

🛠️ Dryrun: C4AI_DEMO_DEBUG=1 python c4ai_discover.py full --query coffee uses bundled HTML snippets, no network.

Handy geoUrn cheatsheet

Location geoUrn
Singapore 103644278
Malaysia 102713980
UnitedStates 103644922
UnitedKingdom 102221843
Australia 101452733
See more: https://www.linkedin.com/search/results/companies/?geoUrn=XXX the number after geoUrn= is what you need.

3  Insights embeddings, orgcharts, decision makers

python c4ai_insights.py \
  --in ./data \
  --out ./data \
  --embed-model all-MiniLM-L6-v2 \
  --llm-provider gemini/gemini-2.0-flash \
  --llm-api-key "" \
  --top-k 10 \
  --max-llm-tokens 8024 \
  --llm-temperature 1.0 \
  --workers 4

Emits next to the Stage1 files:

  • company_graph.json intercompany similarity graph
  • org_chart_<handle>.json one per company
  • decision_makers.csv handpicked who to pitch list

Flags reference (straight from build_arg_parser()):

Flag Default Purpose
--in . Stage1 output dir
--out . Destination dir
--embed_model all-MiniLM-L6-v2 SentenceTransformer model
--top_k 10 Neighbours per company in graph
--openai_model gpt-4.1 LLM for scoring decision makers
--max_llm_tokens 8024 Token budget per LLM call
--llm_temperature 1.0 Creativity knob
--stub off Skip OpenAI and fabricate tiny charts
--workers 4 Parallel LLM workers

4  Visualise interactive graph

After Stage 2 completes, simply open the HTML viewer from the project root:

open graph_view_template.html   # or Live Server / Python -http

The page fetches data/company_graph.json and the org_chart_*.json files automatically; keep the data/ folder beside the HTML file.

  • Left pane → list of companies (clans).
  • Click a node to load its orgchart on the right.
  • Chat drawer lets you ask followup questions; context is pulled from people.jsonl.

5  Common snags

Symptom Fix
Infinite CAPTCHA Use a residential proxy: --proxy http://user:pass@ip:port
429 Too Many Requests Lower --concurrency, rotate profile, add delay
Blank graph Check JSON paths, clear localStorage in browser

TL;DR

crwl profilesc4ai_discover.pyc4ai_insights.py → open graph_view_template.html.
Live long and import crawl4ai.