Files

UncleCode c6fc5c0518 docs(linkdin, url_seeder): update and reorganize LinkedIn data discovery and URL seeder documentation

This commit introduces significant updates to the LinkedIn data discovery documentation by adding two new Jupyter notebooks that provide detailed insights into data discovery processes. The previous workshop notebook has been removed to streamline the content and avoid redundancy. Additionally, the URL seeder documentation has been expanded with a new tutorial and several enhancements to existing scripts, improving usability and clarity.

The changes include:
- Added  and  for comprehensive LinkedIn data discovery.
- Removed  to eliminate outdated content.
- Updated  to reflect new data visualization requirements.
- Introduced  and  to facilitate easier access to URL seeding techniques.
- Enhanced existing Python scripts and markdown files in the URL seeder section for better documentation and examples.

These changes aim to improve the overall documentation quality and user experience for developers working with LinkedIn data and URL seeding techniques.

2025-06-05 15:06:25 +08:00

samples

Fix temperature typo and enhance LinkedIn extraction with Colab support

2025-05-25 16:47:12 +08:00

schemas

Fix temperature typo and enhance LinkedIn extraction with Colab support

2025-05-25 16:47:12 +08:00

snippets

Fix temperature typo and enhance LinkedIn extraction with Colab support

2025-05-25 16:47:12 +08:00

templates

docs(linkdin, url_seeder): update and reorganize LinkedIn data discovery and URL seeder documentation

2025-06-05 15:06:25 +08:00

c4ai_discover.py

Fix temperature typo and enhance LinkedIn extraction with Colab support

2025-05-25 16:47:12 +08:00

c4ai_insights.py

Fix temperature typo and enhance LinkedIn extraction with Colab support

2025-05-25 16:47:12 +08:00

Crawl4ai_Linkedin_Data_Discovery_Part_1.ipynb

docs(linkdin, url_seeder): update and reorganize LinkedIn data discovery and URL seeder documentation

2025-06-05 15:06:25 +08:00

Crawl4ai_Linkedin_Data_Discovery_Part_2.ipynb

docs(linkdin, url_seeder): update and reorganize LinkedIn data discovery and URL seeder documentation

2025-06-05 15:06:25 +08:00

README.md

Add Google Colab button to LinkedIn Prospect Wizard README

2025-05-26 14:35:06 +08:00

README.md

Crawl4AI Prospect‑Wizard – step‑by‑step guide

A three‑stage demo that goes from LinkedIn scraping ➜ LLM reasoning ➜ graph visualisation.

Try it in Google Colab! Click the badge above to run this demo in a cloud environment with zero setup required.

prospect‑wizard/
├─ c4ai_discover.py         # Stage 1 – scrape companies + people
├─ c4ai_insights.py         # Stage 2 – embeddings, org‑charts, scores
├─ graph_view_template.html # Stage 3 – graph viewer (static HTML)
└─ data/                    # output lands here (*.jsonl / *.json)

1 Install & boot a LinkedIn profile (one‑time)

1.1 Install dependencies

pip install crawl4ai litellm sentence-transformers pandas rich

1.2 Create / warm a LinkedIn browser profile

crwl profiles

The interactive shell shows New profile – hit enter.
Choose a name, e.g. profile_linkedin_uc.
A Chromium window opens – log in to LinkedIn, solve whatever CAPTCHA, then close.

Remember the profile name. All future runs take --profile-name <your_name>.

2 Discovery – scrape companies & people

python c4ai_discover.py full \
  --query "health insurance management" \
  --geo 102713980 \               # Malaysia geoUrn
  --title-filters "" \            # or "Product,Engineering"
  --max-companies 10 \            # default set small for workshops
  --max-people 20 \               # \^ same
  --profile-name profile_linkedin_uc \
  --outdir ./data \
  --concurrency 2 \
  --log-level debug

Outputs in ./data/:

companies.jsonl – one JSON per company
people.jsonl – one JSON per employee

🛠️ Dry‑run: C4AI_DEMO_DEBUG=1 python c4ai_discover.py full --query coffee uses bundled HTML snippets, no network.

Handy geoUrn cheatsheet

Location	geoUrn
Singapore	103644278
Malaysia	102713980
United States	103644922
United Kingdom	102221843
Australia	101452733
See more: https://www.linkedin.com/search/results/companies/?geoUrn=XXX – the number after `geoUrn=` is what you need.

3 Insights – embeddings, org‑charts, decision makers

python c4ai_insights.py \
  --in ./data \
  --out ./data \
  --embed-model all-MiniLM-L6-v2 \
  --llm-provider gemini/gemini-2.0-flash \
  --llm-api-key "" \
  --top-k 10 \
  --max-llm-tokens 8024 \
  --llm-temperature 1.0 \
  --workers 4

Emits next to the Stage‑1 files:

company_graph.json – inter‑company similarity graph
org_chart_<handle>.json – one per company
decision_makers.csv – hand‑picked ‘who to pitch’ list

Flags reference (straight from build_arg_parser()):

Flag	Default	Purpose
`--in`	`.`	Stage‑1 output dir
`--out`	`.`	Destination dir
`--embed_model`	`all-MiniLM-L6-v2`	Sentence‑Transformer model
`--top_k`	`10`	Neighbours per company in graph
`--openai_model`	`gpt-4.1`	LLM for scoring decision makers
`--max_llm_tokens`	`8024`	Token budget per LLM call
`--llm_temperature`	`1.0`	Creativity knob
`--stub`	off	Skip OpenAI and fabricate tiny charts
`--workers`	`4`	Parallel LLM workers

4 Visualise – interactive graph

After Stage 2 completes, simply open the HTML viewer from the project root:

open graph_view_template.html   # or Live Server / Python -http

The page fetches data/company_graph.json and the org_chart_*.json files automatically; keep the data/ folder beside the HTML file.

Left pane → list of companies (clans).
Click a node to load its org‑chart on the right.
Chat drawer lets you ask follow‑up questions; context is pulled from people.jsonl.

5 Common snags

Symptom	Fix
Infinite CAPTCHA	Use a residential proxy: `--proxy http://user:pass@ip:port`
429 Too Many Requests	Lower `--concurrency`, rotate profile, add delay
Blank graph	Check JSON paths, clear `localStorage` in browser

TL;DR

crwl profiles → c4ai_discover.py → c4ai_insights.py → open graph_view_template.html.
Live long and import crawl4ai.

README.md Unescape Escape

Crawl4AI Prospect‑Wizard – step‑by‑step guide

1 Install & boot a LinkedIn profile (one‑time)

1.1 Install dependencies

1.2 Create / warm a LinkedIn browser profile

2 Discovery – scrape companies & people

Handy geoUrn cheatsheet

3 Insights – embeddings, org‑charts, decision makers

4 Visualise – interactive graph

5 Common snags

TL;DR

README.md

Crawl4AI Prospect‑Wizard – step‑by‑step guide