Files
crawl4ai/docs/apps/linkdin/README.md
UncleCode 3b766e1aac Add Google Colab button to LinkedIn Prospect Wizard README
- Added Colab badge linking to the demo notebook
- Added call-to-action encouraging users to try the demo in Colab
- Provides zero-setup cloud environment for testing

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-05-26 14:35:06 +08:00

132 lines
4.4 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Crawl4AIProspectWizard stepbystep guide
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/10nRCwmfxPjVrRUHyJsYlX7BH5bvPoGpx?usp=sharing)
A threestage demo that goes from **LinkedIn scraping****LLM reasoning****graph visualisation**.
**Try it in Google Colab!** Click the badge above to run this demo in a cloud environment with zero setup required.
```
prospectwizard/
├─ c4ai_discover.py # Stage 1 scrape companies + people
├─ c4ai_insights.py # Stage 2 embeddings, orgcharts, scores
├─ graph_view_template.html # Stage 3 graph viewer (static HTML)
└─ data/ # output lands here (*.jsonl / *.json)
```
---
## 1  Install & boot a LinkedIn profile (onetime)
### 1.1  Install dependencies
```bash
pip install crawl4ai litellm sentence-transformers pandas rich
```
### 1.2  Create / warm a LinkedIn browser profile
```bash
crwl profiles
```
1. The interactive shell shows **New profile** hit **enter**.
2. Choose a name, e.g. `profile_linkedin_uc`.
3. A Chromium window opens log in to LinkedIn, solve whatever CAPTCHA, then close.
> Remember the **profile name**. All future runs take `--profile-name <your_name>`.
---
## 2  Discovery scrape companies & people
```bash
python c4ai_discover.py full \
--query "health insurance management" \
--geo 102713980 \ # Malaysia geoUrn
--title-filters "" \ # or "Product,Engineering"
--max-companies 10 \ # default set small for workshops
--max-people 20 \ # \^ same
--profile-name profile_linkedin_uc \
--outdir ./data \
--concurrency 2 \
--log-level debug
```
**Outputs** in `./data/`:
* `companies.jsonl` one JSON per company
* `people.jsonl` one JSON per employee
🛠️ **Dryrun:** `C4AI_DEMO_DEBUG=1 python c4ai_discover.py full --query coffee` uses bundled HTML snippets, no network.
### Handy geoUrn cheatsheet
| Location | geoUrn |
|----------|--------|
| Singapore | **103644278** |
| Malaysia | **102713980** |
| UnitedStates | **103644922** |
| UnitedKingdom | **102221843** |
| Australia | **101452733** |
_See more: <https://www.linkedin.com/search/results/companies/?geoUrn=XXX> the number after `geoUrn=` is what you need._
---
## 3  Insights embeddings, orgcharts, decision makers
```bash
python c4ai_insights.py \
--in ./data \
--out ./data \
--embed-model all-MiniLM-L6-v2 \
--llm-provider gemini/gemini-2.0-flash \
--llm-api-key "" \
--top-k 10 \
--max-llm-tokens 8024 \
--llm-temperature 1.0 \
--workers 4
```
Emits next to the Stage1 files:
* `company_graph.json` intercompany similarity graph
* `org_chart_<handle>.json` one per company
* `decision_makers.csv` handpicked who to pitch list
Flags reference (straight from `build_arg_parser()`):
| Flag | Default | Purpose |
|------|---------|---------|
| `--in` | `.` | Stage1 output dir |
| `--out` | `.` | Destination dir |
| `--embed_model` | `all-MiniLM-L6-v2` | SentenceTransformer model |
| `--top_k` | `10` | Neighbours per company in graph |
| `--openai_model` | `gpt-4.1` | LLM for scoring decision makers |
| `--max_llm_tokens` | `8024` | Token budget per LLM call |
| `--llm_temperature` | `1.0` | Creativity knob |
| `--stub` | off | Skip OpenAI and fabricate tiny charts |
| `--workers` | `4` | Parallel LLM workers |
---
## 4  Visualise interactive graph
After Stage 2 completes, simply open the HTML viewer from the project root:
```bash
open graph_view_template.html # or Live Server / Python -http
```
The page fetches `data/company_graph.json` and the `org_chart_*.json` files automatically; keep the `data/` folder beside the HTML file.
* Left pane → list of companies (clans).
* Click a node to load its orgchart on the right.
* Chat drawer lets you ask followup questions; context is pulled from `people.jsonl`.
---
## 5  Common snags
| Symptom | Fix |
|---------|-----|
| Infinite CAPTCHA | Use a residential proxy: `--proxy http://user:pass@ip:port` |
| 429 Too Many Requests | Lower `--concurrency`, rotate profile, add delay |
| Blank graph | Check JSON paths, clear `localStorage` in browser |
---
### TL;DR
`crwl profiles``c4ai_discover.py``c4ai_insights.py` → open `graph_view_template.html`.
Live long and `import crawl4ai`.