Files
crawl4ai/docs/apps/linkdin
UncleCode 8a5e23d374 feat(crawler): add separate timeout for wait_for condition
Adds a new wait_for_timeout parameter to CrawlerRunConfig that allows specifying
a separate timeout for the wait_for condition, independent of the page_timeout.
This provides more granular control over waiting behaviors in the crawler.

Also removes unused colorama dependency and updates LinkedIn crawler example.

BREAKING CHANGE: LinkedIn crawler example now uses different wait_for_images timing
2025-05-16 17:00:45 +08:00
..

Crawl4AIProspectWizard stepbystep guide

A threestage demo that goes from LinkedIn scrapingLLM reasoninggraph visualisation.

prospectwizard/
├─ c4ai_discover.py         # Stage 1  scrape companies + people
├─ c4ai_insights.py         # Stage 2  embeddings, orgcharts, scores
├─ graph_view_template.html # Stage 3  graph viewer (static HTML)
└─ data/                    # output lands here (*.jsonl / *.json)

1  Install & boot a LinkedIn profile (onetime)

1.1  Install dependencies

pip install crawl4ai litellm sentence-transformers pandas rich

1.2  Create / warm a LinkedIn browser profile

crwl profiles
  1. The interactive shell shows New profile hit enter.
  2. Choose a name, e.g. profile_linkedin_uc.
  3. A Chromium window opens log in to LinkedIn, solve whatever CAPTCHA, then close.

Remember the profile name. All future runs take --profile-name <your_name>.


2  Discovery scrape companies & people

python c4ai_discover.py full \
  --query "health insurance management" \
  --geo 102713980 \               # Malaysia geoUrn
  --title-filters "" \            # or "Product,Engineering"
  --max-companies 10 \            # default set small for workshops
  --max-people 20 \               # \^ same
  --profile-name profile_linkedin_uc \
  --outdir ./data \
  --concurrency 2 \
  --log-level debug

Outputs in ./data/:

  • companies.jsonl one JSON per company
  • people.jsonl one JSON per employee

🛠️ Dryrun: C4AI_DEMO_DEBUG=1 python c4ai_discover.py full --query coffee uses bundled HTML snippets, no network.

Handy geoUrn cheatsheet

Location geoUrn
Singapore 103644278
Malaysia 102713980
UnitedStates 103644922
UnitedKingdom 102221843
Australia 101452733
See more: https://www.linkedin.com/search/results/companies/?geoUrn=XXX the number after geoUrn= is what you need.

3  Insights embeddings, orgcharts, decision makers

python c4ai_insights.py \
  --in ./data \
  --out ./data \
  --embed-model all-MiniLM-L6-v2 \
  --llm-provider gemini/gemini-2.0-flash \
  --llm-api-key "" \
  --top-k 10 \
  --max-llm-tokens 8024 \
  --llm-temperature 1.0 \
  --workers 4

Emits next to the Stage1 files:

  • company_graph.json intercompany similarity graph
  • org_chart_<handle>.json one per company
  • decision_makers.csv handpicked who to pitch list

Flags reference (straight from build_arg_parser()):

Flag Default Purpose
--in . Stage1 output dir
--out . Destination dir
--embed_model all-MiniLM-L6-v2 SentenceTransformer model
--top_k 10 Neighbours per company in graph
--openai_model gpt-4.1 LLM for scoring decision makers
--max_llm_tokens 8024 Token budget per LLM call
--llm_temperature 1.0 Creativity knob
--stub off Skip OpenAI and fabricate tiny charts
--workers 4 Parallel LLM workers

4  Visualise interactive graph

After Stage 2 completes, simply open the HTML viewer from the project root:

open graph_view_template.html   # or Live Server / Python -http

The page fetches data/company_graph.json and the org_chart_*.json files automatically; keep the data/ folder beside the HTML file.

  • Left pane → list of companies (clans).
  • Click a node to load its orgchart on the right.
  • Chat drawer lets you ask followup questions; context is pulled from people.jsonl.

5  Common snags

Symptom Fix
Infinite CAPTCHA Use a residential proxy: --proxy http://user:pass@ip:port
429 Too Many Requests Lower --concurrency, rotate profile, add delay
Blank graph Check JSON paths, clear localStorage in browser

TL;DR

crwl profilesc4ai_discover.pyc4ai_insights.py → open graph_view_template.html.
Live long and import crawl4ai.