Merge branch 'next' into 2025-APR-1
This commit is contained in:
126
docs/apps/linkdin/README.md
Normal file
126
docs/apps/linkdin/README.md
Normal file
@@ -0,0 +1,126 @@
|
||||
# Crawl4AI Prospect‑Wizard – step‑by‑step guide
|
||||
|
||||
A three‑stage demo that goes from **LinkedIn scraping** ➜ **LLM reasoning** ➜ **graph visualisation**.
|
||||
|
||||
```
|
||||
prospect‑wizard/
|
||||
├─ c4ai_discover.py # Stage 1 – scrape companies + people
|
||||
├─ c4ai_insights.py # Stage 2 – embeddings, org‑charts, scores
|
||||
├─ graph_view_template.html # Stage 3 – graph viewer (static HTML)
|
||||
└─ data/ # output lands here (*.jsonl / *.json)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 1 Install & boot a LinkedIn profile (one‑time)
|
||||
|
||||
### 1.1 Install dependencies
|
||||
```bash
|
||||
pip install crawl4ai openai sentence-transformers networkx pandas vis-network rich
|
||||
```
|
||||
|
||||
### 1.2 Create / warm a LinkedIn browser profile
|
||||
```bash
|
||||
crwl profiler
|
||||
```
|
||||
1. The interactive shell shows **New profile** – hit **enter**.
|
||||
2. Choose a name, e.g. `profile_linkedin_uc`.
|
||||
3. A Chromium window opens – log in to LinkedIn, solve whatever CAPTCHA, then close.
|
||||
|
||||
> Remember the **profile name**. All future runs take `--profile-name <your_name>`.
|
||||
|
||||
---
|
||||
|
||||
## 2 Discovery – scrape companies & people
|
||||
|
||||
```bash
|
||||
python c4ai_discover.py full \
|
||||
--query "health insurance management" \
|
||||
--geo 102713980 \ # Malaysia geoUrn
|
||||
--title_filters "" \ # or "Product,Engineering"
|
||||
--max_companies 10 \ # default set small for workshops
|
||||
--max_people 20 \ # \^ same
|
||||
--profile-name profile_linkedin_uc \
|
||||
--outdir ./data \
|
||||
--concurrency 2 \
|
||||
--log_level debug
|
||||
```
|
||||
**Outputs** in `./data/`:
|
||||
* `companies.jsonl` – one JSON per company
|
||||
* `people.jsonl` – one JSON per employee
|
||||
|
||||
🛠️ **Dry‑run:** `C4AI_DEMO_DEBUG=1 python c4ai_discover.py full --query coffee` uses bundled HTML snippets, no network.
|
||||
|
||||
### Handy geoUrn cheatsheet
|
||||
| Location | geoUrn |
|
||||
|----------|--------|
|
||||
| Singapore | **103644278** |
|
||||
| Malaysia | **102713980** |
|
||||
| United States | **103644922** |
|
||||
| United Kingdom | **102221843** |
|
||||
| Australia | **101452733** |
|
||||
_See more: <https://www.linkedin.com/search/results/companies/?geoUrn=XXX> – the number after `geoUrn=` is what you need._
|
||||
|
||||
---
|
||||
|
||||
## 3 Insights – embeddings, org‑charts, decision makers
|
||||
|
||||
```bash
|
||||
python c4ai_insights.py \
|
||||
--in ./data \
|
||||
--out ./data \
|
||||
--embed_model all-MiniLM-L6-v2 \
|
||||
--top_k 10 \
|
||||
--openai_model gpt-4.1 \
|
||||
--max_llm_tokens 8024 \
|
||||
--llm_temperature 1.0 \
|
||||
--workers 4
|
||||
```
|
||||
Emits next to the Stage‑1 files:
|
||||
* `company_graph.json` – inter‑company similarity graph
|
||||
* `org_chart_<handle>.json` – one per company
|
||||
* `decision_makers.csv` – hand‑picked ‘who to pitch’ list
|
||||
|
||||
Flags reference (straight from `build_arg_parser()`):
|
||||
| Flag | Default | Purpose |
|
||||
|------|---------|---------|
|
||||
| `--in` | `.` | Stage‑1 output dir |
|
||||
| `--out` | `.` | Destination dir |
|
||||
| `--embed_model` | `all-MiniLM-L6-v2` | Sentence‑Transformer model |
|
||||
| `--top_k` | `10` | Neighbours per company in graph |
|
||||
| `--openai_model` | `gpt-4.1` | LLM for scoring decision makers |
|
||||
| `--max_llm_tokens` | `8024` | Token budget per LLM call |
|
||||
| `--llm_temperature` | `1.0` | Creativity knob |
|
||||
| `--stub` | off | Skip OpenAI and fabricate tiny charts |
|
||||
| `--workers` | `4` | Parallel LLM workers |
|
||||
|
||||
---
|
||||
|
||||
## 4 Visualise – interactive graph
|
||||
|
||||
After Stage 2 completes, simply open the HTML viewer from the project root:
|
||||
```bash
|
||||
open graph_view_template.html # or Live Server / Python -http
|
||||
```
|
||||
The page fetches `data/company_graph.json` and the `org_chart_*.json` files automatically; keep the `data/` folder beside the HTML file.
|
||||
|
||||
* Left pane → list of companies (clans).
|
||||
* Click a node to load its org‑chart on the right.
|
||||
* Chat drawer lets you ask follow‑up questions; context is pulled from `people.jsonl`.
|
||||
|
||||
---
|
||||
|
||||
## 5 Common snags
|
||||
|
||||
| Symptom | Fix |
|
||||
|---------|-----|
|
||||
| Infinite CAPTCHA | Use a residential proxy: `--proxy http://user:pass@ip:port` |
|
||||
| 429 Too Many Requests | Lower `--concurrency`, rotate profile, add delay |
|
||||
| Blank graph | Check JSON paths, clear `localStorage` in browser |
|
||||
|
||||
---
|
||||
|
||||
### TL;DR
|
||||
`crwl profiler` → `c4ai_discover.py` → `c4ai_insights.py` → open `graph_view_template.html`.
|
||||
Live long and `import crawl4ai`.
|
||||
|
||||
440
docs/apps/linkdin/c4ai_discover.py
Normal file
440
docs/apps/linkdin/c4ai_discover.py
Normal file
@@ -0,0 +1,440 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
c4ai-discover — Stage‑1 Discovery CLI
|
||||
|
||||
Scrapes LinkedIn company search + their people pages and dumps two newline‑delimited
|
||||
JSON files: companies.jsonl and people.jsonl.
|
||||
|
||||
Key design rules
|
||||
----------------
|
||||
* No BeautifulSoup — Crawl4AI only for network + HTML fetch.
|
||||
* JsonCssExtractionStrategy for structured scraping; schema auto‑generated once
|
||||
from sample HTML provided by user and then cached under ./schemas/.
|
||||
* Defaults are embedded so the file runs inside VS Code debugger without CLI args.
|
||||
* If executed as a console script (argv > 1), CLI flags win.
|
||||
* Lightweight deps: argparse + Crawl4AI stack.
|
||||
|
||||
Author: Tom @ Kidocode 2025‑04‑26
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import warnings, re
|
||||
warnings.filterwarnings(
|
||||
"ignore",
|
||||
message=r"The pseudo class ':contains' is deprecated, ':-soup-contains' should be used.*",
|
||||
category=FutureWarning,
|
||||
module=r"soupsieve"
|
||||
)
|
||||
|
||||
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
# Imports
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
import argparse
|
||||
import random
|
||||
import asyncio
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import pathlib
|
||||
import sys
|
||||
# 3rd-party rich for pretty logging
|
||||
from rich.console import Console
|
||||
from rich.logging import RichHandler
|
||||
|
||||
from datetime import datetime, UTC
|
||||
from itertools import cycle
|
||||
from textwrap import dedent
|
||||
from types import SimpleNamespace
|
||||
from typing import Dict, List, Optional
|
||||
from urllib.parse import quote
|
||||
from pathlib import Path
|
||||
from glob import glob
|
||||
|
||||
from crawl4ai import (
|
||||
AsyncWebCrawler,
|
||||
BrowserConfig,
|
||||
CacheMode,
|
||||
CrawlerRunConfig,
|
||||
JsonCssExtractionStrategy,
|
||||
BrowserProfiler,
|
||||
LLMConfig,
|
||||
)
|
||||
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
# Constants / paths
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
BASE_DIR = pathlib.Path(__file__).resolve().parent
|
||||
SCHEMA_DIR = BASE_DIR / "schemas"
|
||||
SCHEMA_DIR.mkdir(parents=True, exist_ok=True)
|
||||
COMPANY_SCHEMA_PATH = SCHEMA_DIR / "company_card.json"
|
||||
PEOPLE_SCHEMA_PATH = SCHEMA_DIR / "people_card.json"
|
||||
|
||||
# ---------- deterministic target JSON examples ----------
|
||||
_COMPANY_SCHEMA_EXAMPLE = {
|
||||
"handle": "/company/posify/",
|
||||
"profile_image": "https://media.licdn.com/dms/image/v2/.../logo.jpg",
|
||||
"name": "Management Research Services, Inc. (MRS, Inc)",
|
||||
"descriptor": "Insurance • Milwaukee, Wisconsin",
|
||||
"about": "Insurance • Milwaukee, Wisconsin",
|
||||
"followers": 1000
|
||||
}
|
||||
|
||||
_PEOPLE_SCHEMA_EXAMPLE = {
|
||||
"profile_url": "https://www.linkedin.com/in/lily-ng/",
|
||||
"name": "Lily Ng",
|
||||
"headline": "VP Product @ Posify",
|
||||
"followers": 890,
|
||||
"connection_degree": "2nd",
|
||||
"avatar_url": "https://media.licdn.com/dms/image/v2/.../lily.jpg"
|
||||
}
|
||||
|
||||
# Provided sample HTML snippets (trimmed) — used exactly once to cold‑generate schema.
|
||||
_SAMPLE_COMPANY_HTML = (Path(__file__).resolve().parent / "snippets/company.html").read_text()
|
||||
_SAMPLE_PEOPLE_HTML = (Path(__file__).resolve().parent / "snippets/people.html").read_text()
|
||||
|
||||
# --------- tighter schema prompts ----------
|
||||
_COMPANY_SCHEMA_QUERY = dedent(
|
||||
"""
|
||||
Using the supplied <li> company-card HTML, build a JsonCssExtractionStrategy schema that,
|
||||
for every card, outputs *exactly* the keys shown in the example JSON below.
|
||||
JSON spec:
|
||||
• handle – href of the outermost <a> that wraps the logo/title, e.g. "/company/posify/"
|
||||
• profile_image – absolute URL of the <img> inside that link
|
||||
• name – text of the <a> inside the <span class*='t-16'>
|
||||
• descriptor – text line with industry • location
|
||||
• about – text of the <div class*='t-normal'> below the name (industry + geo)
|
||||
• followers – integer parsed from the <div> containing 'followers'
|
||||
|
||||
IMPORTANT: Do not use the base64 kind of classes to target element. It's not reliable.
|
||||
The main div parent contains these li element is "div.search-results-container" you can use this.
|
||||
The <ul> parent has "role" equal to "list". Using these two should be enough to target the <li> elements."
|
||||
"""
|
||||
)
|
||||
|
||||
_PEOPLE_SCHEMA_QUERY = dedent(
|
||||
"""
|
||||
Using the supplied <li> people-card HTML, build a JsonCssExtractionStrategy schema that
|
||||
outputs exactly the keys in the example JSON below.
|
||||
Fields:
|
||||
• profile_url – href of the outermost profile link
|
||||
• name – text inside artdeco-entity-lockup__title
|
||||
• headline – inner text of artdeco-entity-lockup__subtitle
|
||||
• followers – integer parsed from the span inside lt-line-clamp--multi-line
|
||||
• connection_degree – '1st', '2nd', etc. from artdeco-entity-lockup__badge
|
||||
• avatar_url – src of the <img> within artdeco-entity-lockup__image
|
||||
|
||||
IMPORTANT: Do not use the base64 kind of classes to target element. It's not reliable.
|
||||
The main div parent contains these li element is a "div" has these classes "artdeco-card org-people-profile-card__card-spacing org-people__card-margin-bottom".
|
||||
"""
|
||||
)
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Utility helpers
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def _load_or_build_schema(
|
||||
path: pathlib.Path,
|
||||
sample_html: str,
|
||||
query: str,
|
||||
example_json: Dict,
|
||||
force = False
|
||||
) -> Dict:
|
||||
"""Load schema from path, else call generate_schema once and persist."""
|
||||
if path.exists() and not force:
|
||||
return json.loads(path.read_text())
|
||||
|
||||
logging.info("[SCHEMA] Generating schema %s", path.name)
|
||||
schema = JsonCssExtractionStrategy.generate_schema(
|
||||
html=sample_html,
|
||||
llm_config=LLMConfig(
|
||||
provider=os.getenv("C4AI_SCHEMA_PROVIDER", "openai/gpt-4o"),
|
||||
api_token=os.getenv("OPENAI_API_KEY", "env:OPENAI_API_KEY"),
|
||||
),
|
||||
query=query,
|
||||
target_json_example=json.dumps(example_json, indent=2),
|
||||
)
|
||||
path.write_text(json.dumps(schema, indent=2))
|
||||
return schema
|
||||
|
||||
|
||||
def _openai_friendly_number(text: str) -> Optional[int]:
|
||||
"""Extract first int from text like '1K followers' (returns 1000)."""
|
||||
import re
|
||||
|
||||
m = re.search(r"(\d[\d,]*)", text.replace(",", ""))
|
||||
if not m:
|
||||
return None
|
||||
val = int(m.group(1))
|
||||
if "k" in text.lower():
|
||||
val *= 1000
|
||||
if "m" in text.lower():
|
||||
val *= 1_000_000
|
||||
return val
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Core async workers
|
||||
# ---------------------------------------------------------------------------
|
||||
async def crawl_company_search(crawler: AsyncWebCrawler, url: str, schema: Dict, limit: int) -> List[Dict]:
|
||||
"""Paginate 10-item company search pages until `limit` reached."""
|
||||
extraction = JsonCssExtractionStrategy(schema)
|
||||
cfg = CrawlerRunConfig(
|
||||
extraction_strategy=extraction,
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
wait_for = ".search-marvel-srp",
|
||||
session_id="company_search",
|
||||
delay_before_return_html=1,
|
||||
magic = True,
|
||||
verbose= False,
|
||||
)
|
||||
companies, page = [], 1
|
||||
while len(companies) < max(limit, 10):
|
||||
paged_url = f"{url}&page={page}"
|
||||
res = await crawler.arun(paged_url, config=cfg)
|
||||
batch = json.loads(res[0].extracted_content)
|
||||
if not batch:
|
||||
break
|
||||
for item in batch:
|
||||
name = item.get("name", "").strip()
|
||||
handle = item.get("handle", "").strip()
|
||||
if not handle or not name:
|
||||
continue
|
||||
descriptor = item.get("descriptor")
|
||||
about = item.get("about")
|
||||
followers = _openai_friendly_number(str(item.get("followers", "")))
|
||||
companies.append(
|
||||
{
|
||||
"handle": handle,
|
||||
"name": name,
|
||||
"descriptor": descriptor,
|
||||
"about": about,
|
||||
"followers": followers,
|
||||
"people_url": f"{handle}people/",
|
||||
"captured_at": datetime.now(UTC).isoformat(timespec="seconds") + "Z",
|
||||
}
|
||||
)
|
||||
page += 1
|
||||
logging.info(
|
||||
f"[dim]Page {page}[/] — running total: {len(companies)}/{limit} companies"
|
||||
)
|
||||
|
||||
return companies[:max(limit, 10)]
|
||||
|
||||
|
||||
async def crawl_people_page(
|
||||
crawler: AsyncWebCrawler,
|
||||
people_url: str,
|
||||
schema: Dict,
|
||||
limit: int,
|
||||
title_kw: str,
|
||||
) -> List[Dict]:
|
||||
people_u = f"{people_url}?keywords={quote(title_kw)}"
|
||||
extraction = JsonCssExtractionStrategy(schema)
|
||||
cfg = CrawlerRunConfig(
|
||||
extraction_strategy=extraction,
|
||||
# scan_full_page=True,
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
magic=True,
|
||||
wait_for=".org-people-profile-card__card-spacing",
|
||||
delay_before_return_html=1,
|
||||
session_id="people_search",
|
||||
)
|
||||
res = await crawler.arun(people_u, config=cfg)
|
||||
if not res[0].success:
|
||||
return []
|
||||
raw = json.loads(res[0].extracted_content)
|
||||
people = []
|
||||
for p in raw[:limit]:
|
||||
followers = _openai_friendly_number(str(p.get("followers", "")))
|
||||
people.append(
|
||||
{
|
||||
"profile_url": p.get("profile_url"),
|
||||
"name": p.get("name"),
|
||||
"headline": p.get("headline"),
|
||||
"followers": followers,
|
||||
"connection_degree": p.get("connection_degree"),
|
||||
"avatar_url": p.get("avatar_url"),
|
||||
}
|
||||
)
|
||||
return people
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# CLI + main
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def build_arg_parser() -> argparse.ArgumentParser:
|
||||
ap = argparse.ArgumentParser("c4ai-discover — Crawl4AI LinkedIn discovery")
|
||||
sub = ap.add_subparsers(dest="cmd", required=False, help="run scope")
|
||||
|
||||
def add_flags(parser: argparse.ArgumentParser):
|
||||
parser.add_argument("--query", required=False, help="query keyword(s)")
|
||||
parser.add_argument("--geo", required=False, type=int, help="LinkedIn geoUrn")
|
||||
parser.add_argument("--title-filters", default="Product,Engineering", help="comma list of job keywords")
|
||||
parser.add_argument("--max-companies", type=int, default=1000)
|
||||
parser.add_argument("--max-people", type=int, default=500)
|
||||
parser.add_argument("--profile-path", default=str(pathlib.Path.home() / ".crawl4ai/profiles/profile_linkedin_uc"))
|
||||
parser.add_argument("--outdir", default="./output")
|
||||
parser.add_argument("--concurrency", type=int, default=4)
|
||||
parser.add_argument("--log-level", default="info", choices=["debug", "info", "warn", "error"])
|
||||
|
||||
add_flags(sub.add_parser("full"))
|
||||
add_flags(sub.add_parser("companies"))
|
||||
add_flags(sub.add_parser("people"))
|
||||
|
||||
# global flags
|
||||
ap.add_argument(
|
||||
"--debug",
|
||||
action="store_true",
|
||||
help="Use built-in demo defaults (same as C4AI_DEMO_DEBUG=1)",
|
||||
)
|
||||
return ap
|
||||
|
||||
|
||||
def detect_debug_defaults(force = False) -> SimpleNamespace:
|
||||
if not force and sys.gettrace() is None and not os.getenv("C4AI_DEMO_DEBUG"):
|
||||
return SimpleNamespace()
|
||||
# ----- debug‑friendly defaults -----
|
||||
return SimpleNamespace(
|
||||
cmd="full",
|
||||
query="health insurance management",
|
||||
geo=102713980,
|
||||
# title_filters="Product,Engineering",
|
||||
title_filters="",
|
||||
max_companies=10,
|
||||
max_people=5,
|
||||
profile_name="profile_linkedin_uc",
|
||||
outdir="./debug_out",
|
||||
concurrency=2,
|
||||
log_level="debug",
|
||||
)
|
||||
|
||||
|
||||
async def async_main(opts):
|
||||
# ─────────── logging setup ───────────
|
||||
console = Console()
|
||||
logging.basicConfig(
|
||||
level=opts.log_level.upper(),
|
||||
format="%(message)s",
|
||||
handlers=[RichHandler(console=console, markup=True, rich_tracebacks=True)],
|
||||
)
|
||||
|
||||
# -------------------------------------------------------------------
|
||||
# Load or build schemas (one‑time LLM call each)
|
||||
# -------------------------------------------------------------------
|
||||
company_schema = _load_or_build_schema(
|
||||
COMPANY_SCHEMA_PATH,
|
||||
_SAMPLE_COMPANY_HTML,
|
||||
_COMPANY_SCHEMA_QUERY,
|
||||
_COMPANY_SCHEMA_EXAMPLE,
|
||||
# True
|
||||
)
|
||||
people_schema = _load_or_build_schema(
|
||||
PEOPLE_SCHEMA_PATH,
|
||||
_SAMPLE_PEOPLE_HTML,
|
||||
_PEOPLE_SCHEMA_QUERY,
|
||||
_PEOPLE_SCHEMA_EXAMPLE,
|
||||
# True
|
||||
)
|
||||
|
||||
outdir = BASE_DIR / pathlib.Path(opts.outdir)
|
||||
outdir.mkdir(parents=True, exist_ok=True)
|
||||
f_companies = (BASE_DIR / outdir / "companies.jsonl").open("a", encoding="utf-8")
|
||||
f_people = (BASE_DIR / outdir / "people.jsonl").open("a", encoding="utf-8")
|
||||
|
||||
# -------------------------------------------------------------------
|
||||
# Prepare crawler with cookie pool rotation
|
||||
# -------------------------------------------------------------------
|
||||
profiler = BrowserProfiler()
|
||||
path = profiler.get_profile_path(opts.profile_name)
|
||||
bc = BrowserConfig(
|
||||
headless=False,
|
||||
verbose=False,
|
||||
user_data_dir=path,
|
||||
use_managed_browser=True,
|
||||
user_agent_mode = "random",
|
||||
user_agent_generator_config= {
|
||||
"platforms": "mobile",
|
||||
"os": "Android"
|
||||
},
|
||||
verbose=False,
|
||||
)
|
||||
crawler = AsyncWebCrawler(config=bc)
|
||||
|
||||
await crawler.start()
|
||||
|
||||
# Single worker for simplicity; concurrency can be scaled by arun_many if needed.
|
||||
# crawler = await next_crawler().start()
|
||||
try:
|
||||
# Build LinkedIn search URL
|
||||
search_url = f"https://www.linkedin.com/search/results/companies/?keywords={quote(opts.query)}&geoUrn={opts.geo}"
|
||||
logging.info("Seed URL => %s", search_url)
|
||||
|
||||
companies: List[Dict] = []
|
||||
if opts.cmd in ("companies", "full"):
|
||||
companies = await crawl_company_search(
|
||||
crawler, search_url, company_schema, opts.max_companies
|
||||
)
|
||||
for c in companies:
|
||||
f_companies.write(json.dumps(c, ensure_ascii=False) + "\n")
|
||||
logging.info(f"[bold green]✓[/] Companies scraped so far: {len(companies)}")
|
||||
|
||||
if opts.cmd in ("people", "full"):
|
||||
if not companies:
|
||||
# load from previous run
|
||||
src = outdir / "companies.jsonl"
|
||||
if not src.exists():
|
||||
logging.error("companies.jsonl missing — run companies/full first")
|
||||
return 10
|
||||
companies = [json.loads(l) for l in src.read_text().splitlines()]
|
||||
total_people = 0
|
||||
title_kw = " ".join([t.strip() for t in opts.title_filters.split(",") if t.strip()]) if opts.title_filters else ""
|
||||
for comp in companies:
|
||||
people = await crawl_people_page(
|
||||
crawler,
|
||||
comp["people_url"],
|
||||
people_schema,
|
||||
opts.max_people,
|
||||
title_kw,
|
||||
)
|
||||
for p in people:
|
||||
rec = p | {
|
||||
"company_handle": comp["handle"],
|
||||
# "captured_at": datetime.now(UTC).isoformat(timespec="seconds") + "Z",
|
||||
"captured_at": datetime.now(UTC).isoformat(timespec="seconds") + "Z",
|
||||
}
|
||||
f_people.write(json.dumps(rec, ensure_ascii=False) + "\n")
|
||||
total_people += len(people)
|
||||
logging.info(
|
||||
f"{comp['name']} — [cyan]{len(people)}[/] people extracted"
|
||||
)
|
||||
await asyncio.sleep(random.uniform(0.5, 1))
|
||||
logging.info("Total people scraped: %d", total_people)
|
||||
finally:
|
||||
await crawler.close()
|
||||
f_companies.close()
|
||||
f_people.close()
|
||||
|
||||
return 0
|
||||
|
||||
|
||||
def main():
|
||||
parser = build_arg_parser()
|
||||
cli_opts = parser.parse_args()
|
||||
|
||||
# decide on debug defaults
|
||||
if cli_opts.debug:
|
||||
opts = detect_debug_defaults(force=True)
|
||||
else:
|
||||
env_defaults = detect_debug_defaults()
|
||||
env_defaults = detect_debug_defaults()
|
||||
opts = env_defaults if env_defaults else cli_opts
|
||||
|
||||
if not getattr(opts, "cmd", None):
|
||||
opts.cmd = "full"
|
||||
|
||||
exit_code = asyncio.run(async_main(opts))
|
||||
sys.exit(exit_code)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
372
docs/apps/linkdin/c4ai_insights.py
Normal file
372
docs/apps/linkdin/c4ai_insights.py
Normal file
@@ -0,0 +1,372 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Stage-2 Insights builder
|
||||
------------------------
|
||||
Reads companies.jsonl & people.jsonl (Stage-1 output) and produces:
|
||||
• company_graph.json
|
||||
• org_chart_<handle>.json (one per company)
|
||||
• decision_makers.csv
|
||||
• graph_view.html (interactive visualisation)
|
||||
|
||||
Run:
|
||||
python c4ai_insights.py --in ./stage1_out --out ./stage2_out
|
||||
|
||||
Author : Tom @ Kidocode, 2025-04-28
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
# Imports & Third-party
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
|
||||
import argparse, asyncio, json, os, sys, pathlib, random, time, csv
|
||||
from datetime import datetime, UTC
|
||||
from types import SimpleNamespace
|
||||
from pathlib import Path
|
||||
from typing import List, Dict, Any
|
||||
# Pretty CLI UX
|
||||
from rich.console import Console
|
||||
from rich.logging import RichHandler
|
||||
from rich.progress import Progress, SpinnerColumn, BarColumn, TextColumn, TimeElapsedColumn
|
||||
import logging
|
||||
from jinja2 import Environment, FileSystemLoader, select_autoescape
|
||||
|
||||
BASE_DIR = pathlib.Path(__file__).resolve().parent
|
||||
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
# 3rd-party deps
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
import numpy as np
|
||||
# from sentence_transformers import SentenceTransformer
|
||||
# from sklearn.metrics.pairwise import cosine_similarity
|
||||
import pandas as pd
|
||||
import hashlib
|
||||
|
||||
from openai import OpenAI # same SDK you pre-loaded
|
||||
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
# Utils
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
def load_jsonl(path: Path) -> List[Dict[str, Any]]:
|
||||
with open(path, "r", encoding="utf-8") as f:
|
||||
return [json.loads(l) for l in f]
|
||||
|
||||
def dump_json(obj, path: Path):
|
||||
with open(path, "w", encoding="utf-8") as f:
|
||||
json.dump(obj, f, ensure_ascii=False, indent=2)
|
||||
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
# Constants
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
BASE_DIR = pathlib.Path(__file__).resolve().parent
|
||||
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
# Debug defaults (mirrors Stage-1 trick)
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
def dev_defaults() -> SimpleNamespace:
|
||||
return SimpleNamespace(
|
||||
in_dir="./debug_out",
|
||||
out_dir="./insights_debug",
|
||||
embed_model="all-MiniLM-L6-v2",
|
||||
top_k=10,
|
||||
openai_model="gpt-4.1",
|
||||
max_llm_tokens=8000,
|
||||
llm_temperature=1.0,
|
||||
workers=4, # parallel processing
|
||||
stub=False, # manual
|
||||
)
|
||||
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
# Graph builders
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
def embed_descriptions(companies, model_name:str, opts) -> np.ndarray:
|
||||
from sentence_transformers import SentenceTransformer
|
||||
|
||||
logging.debug(f"Using embedding model: {model_name}")
|
||||
cache_path = BASE_DIR / Path(opts.out_dir) / "embeds_cache.json"
|
||||
cache = {}
|
||||
if cache_path.exists():
|
||||
with open(cache_path) as f:
|
||||
cache = json.load(f)
|
||||
# flush cache if model differs
|
||||
if cache.get("_model") != model_name:
|
||||
cache = {}
|
||||
|
||||
model = SentenceTransformer(model_name)
|
||||
new_texts, new_indices = [], []
|
||||
vectors = np.zeros((len(companies), 384), dtype=np.float32)
|
||||
|
||||
for idx, comp in enumerate(companies):
|
||||
text = comp.get("about") or comp.get("descriptor","")
|
||||
h = hashlib.sha1(text.encode("utf-8")).hexdigest()
|
||||
cached = cache.get(comp["handle"])
|
||||
if cached and cached["hash"] == h:
|
||||
vectors[idx] = np.array(cached["vector"], dtype=np.float32)
|
||||
else:
|
||||
new_texts.append(text)
|
||||
new_indices.append((idx, comp["handle"], h))
|
||||
|
||||
if new_texts:
|
||||
embeds = model.encode(new_texts, show_progress_bar=False, convert_to_numpy=True)
|
||||
for vec, (idx, handle, h) in zip(embeds, new_indices):
|
||||
vectors[idx] = vec
|
||||
cache[handle] = {"hash": h, "vector": vec.tolist()}
|
||||
cache["_model"] = model_name
|
||||
with open(cache_path, "w") as f:
|
||||
json.dump(cache, f)
|
||||
|
||||
return vectors
|
||||
|
||||
def build_company_graph(companies, embeds:np.ndarray, top_k:int) -> Dict[str,Any]:
|
||||
from sklearn.metrics.pairwise import cosine_similarity
|
||||
sims = cosine_similarity(embeds)
|
||||
nodes, edges = [], []
|
||||
idx_of = {c["handle"]: i for i,c in enumerate(companies)}
|
||||
for i,c in enumerate(companies):
|
||||
node = dict(
|
||||
id=c["handle"].strip("/"),
|
||||
name=c["name"],
|
||||
handle=c["handle"],
|
||||
about=c.get("about",""),
|
||||
people_url=c.get("people_url",""),
|
||||
industry=c.get("descriptor","").split("•")[0].strip(),
|
||||
geoUrn=c.get("geoUrn"),
|
||||
followers=c.get("followers",0),
|
||||
# desc_embed=embeds[i].tolist(),
|
||||
desc_embed=[],
|
||||
)
|
||||
nodes.append(node)
|
||||
# pick top-k most similar except itself
|
||||
top_idx = np.argsort(sims[i])[::-1][1:top_k+1]
|
||||
for j in top_idx:
|
||||
tgt = companies[j]
|
||||
weight = float(sims[i,j])
|
||||
if node["industry"] == tgt.get("descriptor","").split("•")[0].strip():
|
||||
weight += 0.10
|
||||
if node["geoUrn"] == tgt.get("geoUrn"):
|
||||
weight += 0.05
|
||||
tgt['followers'] = tgt.get("followers", None) or 1
|
||||
node["followers"] = node.get("followers", None) or 1
|
||||
follower_ratio = min(node["followers"], tgt.get("followers",1)) / max(node["followers"] or 1, tgt.get("followers",1))
|
||||
weight += 0.05 * follower_ratio
|
||||
edges.append(dict(
|
||||
source=node["id"],
|
||||
target=tgt["handle"].strip("/"),
|
||||
weight=round(weight,4),
|
||||
drivers=dict(
|
||||
embed_sim=round(float(sims[i,j]),4),
|
||||
industry_match=0.10 if node["industry"] == tgt.get("descriptor","").split("•")[0].strip() else 0,
|
||||
geo_overlap=0.05 if node["geoUrn"] == tgt.get("geoUrn") else 0,
|
||||
)
|
||||
))
|
||||
# return {"nodes":nodes,"edges":edges,"meta":{"generated_at":datetime.now(UTC).isoformat()}}
|
||||
return {"nodes":nodes,"edges":edges,"meta":{"generated_at":datetime.now(UTC).isoformat()}}
|
||||
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
# Org-chart via LLM
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
async def infer_org_chart_llm(company, people, client:OpenAI, model_name:str, max_tokens:int, temperature:float, stub:bool):
|
||||
if stub:
|
||||
# Tiny fake org-chart when debugging offline
|
||||
chief = random.choice(people)
|
||||
nodes = [{
|
||||
"id": chief["profile_url"],
|
||||
"name": chief["name"],
|
||||
"title": chief["headline"],
|
||||
"dept": chief["headline"].split()[:1][0],
|
||||
"yoe_total": 8,
|
||||
"yoe_current": 2,
|
||||
"seniority_score": 0.8,
|
||||
"decision_score": 0.9,
|
||||
"avatar_url": chief.get("avatar_url")
|
||||
}]
|
||||
return {"nodes":nodes,"edges":[],"meta":{"debug_stub":True,"generated_at":datetime.now(UTC).isoformat()}}
|
||||
|
||||
prompt = [
|
||||
{"role":"system","content":"You are an expert B2B org-chart reasoner."},
|
||||
{"role":"user","content":f"""Here is the company description:
|
||||
|
||||
<company>
|
||||
{json.dumps(company, ensure_ascii=False)}
|
||||
</company>
|
||||
|
||||
Here is a JSON list of employees:
|
||||
<employees>
|
||||
{json.dumps(people, ensure_ascii=False)}
|
||||
</employees>
|
||||
|
||||
1) Build a reporting tree (manager -> direct reports)
|
||||
2) For each person output a decision_score 0-1 for buying new software
|
||||
|
||||
Return JSON: {{ "nodes":[{{id,name,title,dept,yoe_total,yoe_current,seniority_score,decision_score,avatar_url,profile_url}}], "edges":[{{source,target,type,confidence}}] }}
|
||||
"""}
|
||||
]
|
||||
resp = client.chat.completions.create(
|
||||
model=model_name,
|
||||
messages=prompt,
|
||||
max_tokens=max_tokens,
|
||||
temperature=temperature,
|
||||
response_format={"type":"json_object"}
|
||||
)
|
||||
chart = json.loads(resp.choices[0].message.content)
|
||||
chart["meta"] = dict(model=model_name, generated_at=datetime.now(UTC).isoformat())
|
||||
return chart
|
||||
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
# CSV flatten
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
def export_decision_makers(charts_dir:Path, csv_path:Path, threshold:float=0.5):
|
||||
rows=[]
|
||||
for p in charts_dir.glob("org_chart_*.json"):
|
||||
data=json.loads(p.read_text())
|
||||
comp = p.stem.split("org_chart_")[1]
|
||||
for n in data.get("nodes",[]):
|
||||
if n.get("decision_score",0)>=threshold:
|
||||
rows.append(dict(
|
||||
company=comp,
|
||||
person=n["name"],
|
||||
title=n["title"],
|
||||
decision_score=n["decision_score"],
|
||||
profile_url=n["id"]
|
||||
))
|
||||
pd.DataFrame(rows).to_csv(csv_path,index=False)
|
||||
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
# HTML rendering
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
def render_html(out:Path, template_dir:Path):
|
||||
# From template folder cp graph_view.html and ai.js in out folder
|
||||
import shutil
|
||||
shutil.copy(template_dir/"graph_view_template.html", out / "graph_view.html")
|
||||
shutil.copy(template_dir/"ai.js", out)
|
||||
|
||||
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
# Main async pipeline
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
async def run(opts):
|
||||
# ── silence SDK noise ──────────────────────────────────────────────────────
|
||||
for noisy in ("openai", "httpx", "httpcore"):
|
||||
lg = logging.getLogger(noisy)
|
||||
lg.setLevel(logging.WARNING) # or ERROR if you want total silence
|
||||
lg.propagate = False # optional: stop them reaching root
|
||||
|
||||
# ────────────── logging bootstrap ──────────────
|
||||
console = Console()
|
||||
logging.basicConfig(
|
||||
level="INFO",
|
||||
format="%(message)s",
|
||||
handlers=[RichHandler(console=console, markup=True, rich_tracebacks=True)],
|
||||
)
|
||||
|
||||
in_dir = BASE_DIR / Path(opts.in_dir)
|
||||
out_dir = BASE_DIR / Path(opts.out_dir)
|
||||
out_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
companies = load_jsonl(in_dir/"companies.jsonl")
|
||||
people = load_jsonl(in_dir/"people.jsonl")
|
||||
|
||||
logging.info(f"[bold cyan]Loaded[/] {len(companies)} companies, {len(people)} people")
|
||||
|
||||
logging.info("[bold]⇢[/] Embedding company descriptions…")
|
||||
# embeds = embed_descriptions(companies, opts.embed_model, opts)
|
||||
|
||||
logging.info("[bold]⇢[/] Building similarity graph")
|
||||
# company_graph = build_company_graph(companies, embeds, opts.top_k)
|
||||
# dump_json(company_graph, out_dir/"company_graph.json")
|
||||
|
||||
# OpenAI client (only built if not debugging)
|
||||
stub = bool(opts.stub)
|
||||
client = OpenAI() if not stub else None
|
||||
|
||||
# Filter companies that need processing
|
||||
to_process = []
|
||||
for comp in companies:
|
||||
handle = comp["handle"].strip("/").replace("/","_")
|
||||
out_file = out_dir/f"org_chart_{handle}.json"
|
||||
if out_file.exists() and False:
|
||||
logging.info(f"[green]✓[/] Skipping existing {comp['name']}")
|
||||
continue
|
||||
to_process.append(comp)
|
||||
|
||||
|
||||
if not to_process:
|
||||
logging.info("[yellow]All companies already processed[/]")
|
||||
else:
|
||||
workers = getattr(opts, 'workers', 1)
|
||||
parallel = workers > 1
|
||||
|
||||
logging.info(f"[bold]⇢[/] Inferring org-charts via LLM {f'(parallel={workers} workers)' if parallel else ''}")
|
||||
|
||||
with Progress(
|
||||
SpinnerColumn(),
|
||||
BarColumn(),
|
||||
TextColumn("[progress.description]{task.description}"),
|
||||
TimeElapsedColumn(),
|
||||
console=console,
|
||||
) as progress:
|
||||
task = progress.add_task("Org charts", total=len(to_process))
|
||||
|
||||
async def process_one(comp):
|
||||
handle = comp["handle"].strip("/").replace("/","_")
|
||||
persons = [p for p in people if p["company_handle"].strip("/") == comp["handle"].strip("/")]
|
||||
|
||||
chart = await infer_org_chart_llm(
|
||||
comp, persons,
|
||||
client=client if client else OpenAI(api_key="sk-debug"),
|
||||
model_name=opts.openai_model,
|
||||
max_tokens=opts.max_llm_tokens,
|
||||
temperature=opts.llm_temperature,
|
||||
stub=stub,
|
||||
)
|
||||
chart["meta"]["company"] = comp["name"]
|
||||
|
||||
# Save the result immediately
|
||||
dump_json(chart, out_dir/f"org_chart_{handle}.json")
|
||||
|
||||
progress.update(task, advance=1, description=f"{comp['name']} ({len(persons)} ppl)")
|
||||
|
||||
# Create tasks for all companies
|
||||
tasks = [process_one(comp) for comp in to_process]
|
||||
|
||||
# Process in batches based on worker count
|
||||
semaphore = asyncio.Semaphore(workers)
|
||||
|
||||
async def bounded_process(coro):
|
||||
async with semaphore:
|
||||
return await coro
|
||||
|
||||
# Run with concurrency control
|
||||
await asyncio.gather(*(bounded_process(task) for task in tasks))
|
||||
|
||||
logging.info("[bold]⇢[/] Flattening decision-makers CSV")
|
||||
export_decision_makers(out_dir, out_dir/"decision_makers.csv")
|
||||
|
||||
render_html(out_dir, template_dir=BASE_DIR/"templates")
|
||||
logging.success = lambda msg, **k: console.print(f"[bold green]✓[/] {msg}", **k)
|
||||
logging.success(f"Stage-2 artefacts written to {out_dir}")
|
||||
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
# CLI
|
||||
# ───────────────────────────────────────────────────────────────────────────────
|
||||
def build_arg_parser():
|
||||
p = argparse.ArgumentParser(description="Build graphs & visualisation from Stage-1 output")
|
||||
p.add_argument("--in", dest="in_dir", required=False, help="Stage-1 output dir", default=".")
|
||||
p.add_argument("--out", dest="out_dir", required=False, help="Destination dir", default=".")
|
||||
p.add_argument("--embed_model", default="all-MiniLM-L6-v2")
|
||||
p.add_argument("--top_k", type=int, default=10, help="Top-k neighbours per company")
|
||||
p.add_argument("--openai_model", default="gpt-4.1")
|
||||
p.add_argument("--max_llm_tokens", type=int, default=8024)
|
||||
p.add_argument("--llm_temperature", type=float, default=1.0)
|
||||
p.add_argument("--stub", action="store_true", help="Skip OpenAI call and generate tiny fake org charts")
|
||||
p.add_argument("--workers", type=int, default=4, help="Number of parallel workers for LLM inference")
|
||||
return p
|
||||
|
||||
def main():
|
||||
dbg = dev_defaults()
|
||||
opts = dbg if True else build_arg_parser().parse_args()
|
||||
asyncio.run(run(opts))
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
39
docs/apps/linkdin/schemas/company_card.json
Normal file
39
docs/apps/linkdin/schemas/company_card.json
Normal file
@@ -0,0 +1,39 @@
|
||||
{
|
||||
"name": "LinkedIn Company Card",
|
||||
"baseSelector": "div.search-results-container ul[role='list'] > li",
|
||||
"fields": [
|
||||
{
|
||||
"name": "handle",
|
||||
"selector": "a[href*='/company/']",
|
||||
"type": "attribute",
|
||||
"attribute": "href"
|
||||
},
|
||||
{
|
||||
"name": "profile_image",
|
||||
"selector": "a[href*='/company/'] img",
|
||||
"type": "attribute",
|
||||
"attribute": "src"
|
||||
},
|
||||
{
|
||||
"name": "name",
|
||||
"selector": "span[class*='t-16'] a",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "descriptor",
|
||||
"selector": "div[class*='t-black t-normal']",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "about",
|
||||
"selector": "p[class*='entity-result__summary--2-lines']",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "followers",
|
||||
"selector": "div:contains('followers')",
|
||||
"type": "regex",
|
||||
"pattern": "(\\d+)\\s*followers"
|
||||
}
|
||||
]
|
||||
}
|
||||
38
docs/apps/linkdin/schemas/people_card.json
Normal file
38
docs/apps/linkdin/schemas/people_card.json
Normal file
@@ -0,0 +1,38 @@
|
||||
{
|
||||
"name": "LinkedIn People Card",
|
||||
"baseSelector": "li.org-people-profile-card__profile-card-spacing",
|
||||
"fields": [
|
||||
{
|
||||
"name": "profile_url",
|
||||
"selector": "a.eETATgYTipaVsmrBChiBJJvFsdPhNpulhPZUVLHLo",
|
||||
"type": "attribute",
|
||||
"attribute": "href"
|
||||
},
|
||||
{
|
||||
"name": "name",
|
||||
"selector": ".artdeco-entity-lockup__title .lt-line-clamp--single-line",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "headline",
|
||||
"selector": ".artdeco-entity-lockup__subtitle .lt-line-clamp--multi-line",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "followers",
|
||||
"selector": ".lt-line-clamp--multi-line.t-12",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "connection_degree",
|
||||
"selector": ".artdeco-entity-lockup__badge .artdeco-entity-lockup__degree",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "avatar_url",
|
||||
"selector": ".artdeco-entity-lockup__image img",
|
||||
"type": "attribute",
|
||||
"attribute": "src"
|
||||
}
|
||||
]
|
||||
}
|
||||
143
docs/apps/linkdin/snippets/company.html
Normal file
143
docs/apps/linkdin/snippets/company.html
Normal file
@@ -0,0 +1,143 @@
|
||||
<li class="yCLWzruNprmIzaZzFFonVFBtMrbaVYnuDFA">
|
||||
<!----><!---->
|
||||
|
||||
|
||||
|
||||
<div class="IxlEPbRZwQYrRltKPvHAyjBmCdIWTAoYo" data-chameleon-result-urn="urn:li:company:362492"
|
||||
data-view-name="search-entity-result-universal-template">
|
||||
|
||||
|
||||
|
||||
|
||||
<div class="linked-area flex-1
|
||||
cursor-pointer">
|
||||
|
||||
<div class="BAEgVqVuxosMJZodcelsgPoyRcrkiqgVCGHXNQ">
|
||||
<div class="afcvrbGzNuyRlhPPQWrWirJtUdHAAtUlqxwvVA">
|
||||
<div class="display-flex align-items-center">
|
||||
<!---->
|
||||
|
||||
<a class="eETATgYTipaVsmrBChiBJJvFsdPhNpulhPZUVLHLo scale-down " aria-hidden="true"
|
||||
tabindex="-1" href="https://www.linkedin.com/company/managment-research-services-inc./"
|
||||
data-test-app-aware-link="">
|
||||
|
||||
<div class="ivm-image-view-model ">
|
||||
|
||||
<div class="ivm-view-attr__img-wrapper
|
||||
|
||||
">
|
||||
<!---->
|
||||
<!----> <img width="48"
|
||||
src="https://media.licdn.com/dms/image/v2/C560BAQFWpusEOgW-ww/company-logo_100_100/company-logo_100_100/0/1630583697877/managment_research_services_inc_logo?e=1750896000&v=beta&t=Ch9vyEZdfng-1D1m_XqP5kjNpVXUBKkk9cNhMZUhx0E"
|
||||
loading="lazy" height="48" alt="Management Research Services, Inc. (MRS, Inc)"
|
||||
id="ember28"
|
||||
class="ivm-view-attr__img--centered EntityPhoto-square-3 evi-image lazy-image ember-view">
|
||||
</div>
|
||||
|
||||
</div>
|
||||
|
||||
</a>
|
||||
|
||||
|
||||
</div>
|
||||
</div>
|
||||
<div
|
||||
class="wympnVuDByXHvafWrMGJLZuchDmCRqLmWPwg MmzCPRicJimZvjJhvqTzDcDbdHhWPzspERzA pt3 pb3 t-12 t-black--light">
|
||||
<div class="mb1">
|
||||
|
||||
<div class="t-roman t-sans">
|
||||
|
||||
|
||||
|
||||
<div class="display-flex">
|
||||
<span class="TikBXjihYvcNUoIzkslUaEjfIuLmYxfs OoHEyXgsiIqGADjcOtTmfdpoYVXrLKTvkwI ">
|
||||
<span class="CgaWLOzmXNuKbRIRARSErqCJcBPYudEKo
|
||||
t-16">
|
||||
<a class="eETATgYTipaVsmrBChiBJJvFsdPhNpulhPZUVLHLo "
|
||||
href="https://www.linkedin.com/company/managment-research-services-inc./"
|
||||
data-test-app-aware-link="">
|
||||
<!---->Management Research Services, Inc. (MRS, Inc)<!---->
|
||||
<!----> </a>
|
||||
<!----> </span>
|
||||
</span>
|
||||
<!---->
|
||||
</div>
|
||||
|
||||
|
||||
|
||||
</div>
|
||||
|
||||
|
||||
|
||||
<div class="LjmdKCEqKITHihFOiQsBAQylkdnsWhqZii
|
||||
t-14 t-black t-normal">
|
||||
<!---->Insurance • Milwaukee, Wisconsin<!---->
|
||||
</div>
|
||||
|
||||
<div class="cTPhJiHyNLmxdQYFlsEOutjznmqrVHUByZwZ
|
||||
t-14 t-normal">
|
||||
<!---->1K followers<!---->
|
||||
</div>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
</div>
|
||||
|
||||
<!---->
|
||||
<p class="yWzlqwKNlvCWVNoKqmzoDDEnBMUuyynaLg
|
||||
entity-result__summary--2-lines
|
||||
t-12 t-black--light
|
||||
">
|
||||
<!---->MRS combines 30 years of experience supporting the Life,<span class="white-space-pre">
|
||||
</span><strong><!---->Health<!----></strong><span class="white-space-pre"> </span>and
|
||||
Annuities<span class="white-space-pre"> </span><strong><!---->Insurance<!----></strong><span
|
||||
class="white-space-pre"> </span>Industry with customized<span class="white-space-pre">
|
||||
</span><strong><!---->insurance<!----></strong><span class="white-space-pre">
|
||||
</span>underwriting solutions that efficiently support clients’ workflows. Supported by the
|
||||
Agenium Platform (www.agenium.ai) our innovative underwriting solutions are guaranteed to
|
||||
optimize requirements...<!---->
|
||||
</p>
|
||||
|
||||
<!---->
|
||||
</div>
|
||||
<div class="qXxdnXtzRVFTnTnetmNpssucBwQBsWlUuk MmzCPRicJimZvjJhvqTzDcDbdHhWPzspERzA">
|
||||
<!---->
|
||||
|
||||
|
||||
<div>
|
||||
|
||||
|
||||
|
||||
|
||||
<button aria-label="Follow Management Research Services, Inc. (MRS, Inc)" id="ember61"
|
||||
class="artdeco-button artdeco-button--2 artdeco-button--secondary ember-view"
|
||||
type="button"><!---->
|
||||
<span class="artdeco-button__text">
|
||||
Follow
|
||||
</span></button>
|
||||
|
||||
|
||||
|
||||
<!---->
|
||||
<!---->
|
||||
|
||||
|
||||
</div>
|
||||
|
||||
|
||||
|
||||
</div>
|
||||
</div>
|
||||
|
||||
</div>
|
||||
|
||||
|
||||
|
||||
|
||||
</div>
|
||||
|
||||
|
||||
|
||||
</li>
|
||||
94
docs/apps/linkdin/snippets/people.html
Normal file
94
docs/apps/linkdin/snippets/people.html
Normal file
@@ -0,0 +1,94 @@
|
||||
<li class="grid grid__col--lg-8 block org-people-profile-card__profile-card-spacing">
|
||||
<div>
|
||||
|
||||
|
||||
<section class="artdeco-card full-width qQdPErXQkSAbwApNgNfuxukTIPPykttCcZGOHk">
|
||||
<!---->
|
||||
|
||||
<img width="210" src="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7"
|
||||
ariarole="presentation" loading="lazy" height="210" alt="" id="ember96"
|
||||
class="evi-image lazy-image ghost-default ember-view org-people-profile-card__cover-photo org-people-profile-card__cover-photo--people">
|
||||
|
||||
<div class="org-people-profile-card__profile-info">
|
||||
<div id="ember97"
|
||||
class="artdeco-entity-lockup artdeco-entity-lockup--stacked-center artdeco-entity-lockup--size-7 ember-view">
|
||||
<div id="ember98"
|
||||
class="artdeco-entity-lockup__image artdeco-entity-lockup__image--type-circle ember-view"
|
||||
type="circle">
|
||||
|
||||
<a class="eETATgYTipaVsmrBChiBJJvFsdPhNpulhPZUVLHLo "
|
||||
id="org-people-profile-card__profile-image-0"
|
||||
href="https://www.linkedin.com/in/speakerrayna?miniProfileUrn=urn%3Ali%3Afs_miniProfile%3AACoAABsqUBoBr5x071PuGGpNtK3NlvSARiVXPIs"
|
||||
data-test-app-aware-link="">
|
||||
<img width="104"
|
||||
src="https://media.licdn.com/dms/image/v2/D5603AQGs2Vyju4xZ7A/profile-displayphoto-shrink_100_100/profile-displayphoto-shrink_100_100/0/1681741067031?e=1750896000&v=beta&t=Hvj--IrrmpVIH7pec7-l_PQok8vsS__CGeUqBWOw7co"
|
||||
loading="lazy" height="104" alt="Dr. Rayna S." id="ember99"
|
||||
class="evi-image lazy-image ember-view">
|
||||
</a>
|
||||
|
||||
|
||||
</div>
|
||||
<div id="ember100" class="artdeco-entity-lockup__content ember-view">
|
||||
<div id="ember101" class="artdeco-entity-lockup__title ember-view">
|
||||
<a class="eETATgYTipaVsmrBChiBJJvFsdPhNpulhPZUVLHLo link-without-visited-state"
|
||||
aria-label="View Dr. Rayna S.’s profile"
|
||||
href="https://www.linkedin.com/in/speakerrayna?miniProfileUrn=urn%3Ali%3Afs_miniProfile%3AACoAABsqUBoBr5x071PuGGpNtK3NlvSARiVXPIs"
|
||||
data-test-app-aware-link="">
|
||||
<div id="ember103" class="ember-view lt-line-clamp lt-line-clamp--single-line AGabuksChUpCmjWshSnaZryLKSthOKkwclxY
|
||||
t-black" style="">
|
||||
Dr. Rayna S.
|
||||
|
||||
<!---->
|
||||
</div>
|
||||
|
||||
</a>
|
||||
|
||||
</div>
|
||||
<div id="ember104" class="artdeco-entity-lockup__badge ember-view"> <span class="a11y-text">3rd+
|
||||
degree connection</span>
|
||||
<span class="artdeco-entity-lockup__degree" aria-hidden="true">
|
||||
· 3rd
|
||||
</span>
|
||||
<!----><!---->
|
||||
</div>
|
||||
<div id="ember105" class="artdeco-entity-lockup__subtitle ember-view">
|
||||
<div class="t-14 t-black--light t-normal">
|
||||
<div id="ember107" class="ember-view lt-line-clamp lt-line-clamp--multi-line"
|
||||
style="-webkit-line-clamp: 2">
|
||||
Leadership and Talent Development Consultant and Professional Speaker
|
||||
|
||||
<!---->
|
||||
</div>
|
||||
|
||||
</div>
|
||||
</div>
|
||||
<div id="ember108" class="artdeco-entity-lockup__caption ember-view"></div>
|
||||
</div>
|
||||
|
||||
</div>
|
||||
<span class="text-align-center">
|
||||
<span id="ember110"
|
||||
class="ember-view lt-line-clamp lt-line-clamp--multi-line t-12 t-black--light mt2"
|
||||
style="-webkit-line-clamp: 3">
|
||||
727 followers
|
||||
|
||||
<!----> </span>
|
||||
|
||||
</span>
|
||||
</div>
|
||||
|
||||
<footer class="ph3 pb3">
|
||||
<button aria-label="Follow Dr. Rayna S." id="ember111"
|
||||
class="artdeco-button artdeco-button--2 artdeco-button--secondary ember-view full-width"
|
||||
type="button"><!---->
|
||||
<span class="artdeco-button__text">
|
||||
Follow
|
||||
</span></button>
|
||||
</footer>
|
||||
|
||||
</section>
|
||||
|
||||
|
||||
</div>
|
||||
|
||||
</li>
|
||||
50
docs/apps/linkdin/templates/ai.js
Normal file
50
docs/apps/linkdin/templates/ai.js
Normal file
@@ -0,0 +1,50 @@
|
||||
// ==== File: ai.js ====
|
||||
|
||||
class ApiHandler {
|
||||
constructor(apiKey = null) {
|
||||
this.apiKey = apiKey || localStorage.getItem("openai_api_key") || "";
|
||||
console.log("ApiHandler ready");
|
||||
}
|
||||
|
||||
setApiKey(k) {
|
||||
this.apiKey = k.trim();
|
||||
if (this.apiKey) localStorage.setItem("openai_api_key", this.apiKey);
|
||||
}
|
||||
|
||||
async *chatStream(messages, {model = "gpt-4o", temperature = 0.7} = {}) {
|
||||
if (!this.apiKey) throw new Error("OpenAI API key missing");
|
||||
const payload = {model, messages, stream: true, max_tokens: 1024};
|
||||
const controller = new AbortController();
|
||||
|
||||
const res = await fetch("https://api.openai.com/v1/chat/completions", {
|
||||
method: "POST",
|
||||
headers: {
|
||||
"Content-Type": "application/json",
|
||||
Authorization: `Bearer ${this.apiKey}`,
|
||||
},
|
||||
body: JSON.stringify(payload),
|
||||
signal: controller.signal,
|
||||
});
|
||||
if (!res.ok) throw new Error(`OpenAI: ${res.statusText}`);
|
||||
const reader = res.body.getReader();
|
||||
const dec = new TextDecoder();
|
||||
|
||||
let buf = "";
|
||||
while (true) {
|
||||
const {done, value} = await reader.read();
|
||||
if (done) break;
|
||||
buf += dec.decode(value, {stream: true});
|
||||
for (const line of buf.split("\n")) {
|
||||
if (!line.startsWith("data: ")) continue;
|
||||
if (line.includes("[DONE]")) return;
|
||||
const json = JSON.parse(line.slice(6));
|
||||
const delta = json.choices?.[0]?.delta?.content;
|
||||
if (delta) yield delta;
|
||||
}
|
||||
buf = buf.endsWith("\n") ? "" : buf; // keep partial line
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
window.API = new ApiHandler();
|
||||
|
||||
1171
docs/apps/linkdin/templates/graph_view_template.html
Normal file
1171
docs/apps/linkdin/templates/graph_view_template.html
Normal file
File diff suppressed because it is too large
Load Diff
51
docs/codebase/browser.md
Normal file
51
docs/codebase/browser.md
Normal file
@@ -0,0 +1,51 @@
|
||||
### browser_manager.py
|
||||
|
||||
| Function | What it does |
|
||||
|---|---|
|
||||
| `ManagedBrowser.build_browser_flags` | Returns baseline Chromium CLI flags, disables GPU and sandbox, plugs locale, timezone, stealth tweaks, and any extras from `BrowserConfig`. |
|
||||
| `ManagedBrowser.__init__` | Stores config and logger, creates temp dir, preps internal state. |
|
||||
| `ManagedBrowser.start` | Spawns or connects to the Chromium process, returns its CDP endpoint plus the `subprocess.Popen` handle. |
|
||||
| `ManagedBrowser._initial_startup_check` | Pings the CDP endpoint once to be sure the browser is alive, raises if not. |
|
||||
| `ManagedBrowser._monitor_browser_process` | Async-loops on the subprocess, logs exits or crashes, restarts if policy allows. |
|
||||
| `ManagedBrowser._get_browser_path_WIP` | Old helper that maps OS + browser type to an executable path. |
|
||||
| `ManagedBrowser._get_browser_path` | Current helper, checks env vars, Playwright cache, and OS defaults for the real executable. |
|
||||
| `ManagedBrowser._get_browser_args` | Builds the final CLI arg list by merging user flags, stealth flags, and defaults. |
|
||||
| `ManagedBrowser.cleanup` | Terminates the browser, stops monitors, deletes the temp dir. |
|
||||
| `ManagedBrowser.create_profile` | Opens a visible browser so a human can log in, then zips the resulting user-data-dir to `~/.crawl4ai/profiles/<name>`. |
|
||||
| `ManagedBrowser.list_profiles` | Thin wrapper, now forwarded to `BrowserProfiler.list_profiles()`. |
|
||||
| `ManagedBrowser.delete_profile` | Thin wrapper, now forwarded to `BrowserProfiler.delete_profile()`. |
|
||||
| `BrowserManager.__init__` | Holds the global Playwright instance, browser handle, config signature cache, session map, and logger. |
|
||||
| `BrowserManager.start` | Boots the underlying `ManagedBrowser`, then spins up the default Playwright browser context with stealth patches. |
|
||||
| `BrowserManager._build_browser_args` | Translates `CrawlerRunConfig` (proxy, UA, timezone, headless flag, etc.) into Playwright `launch_args`. |
|
||||
| `BrowserManager.setup_context` | Applies locale, geolocation, permissions, cookies, and UA overrides on a fresh context. |
|
||||
| `BrowserManager.create_browser_context` | Internal helper that actually calls `browser.new_context(**options)` after running `setup_context`. |
|
||||
| `BrowserManager._make_config_signature` | Hashes the non-ephemeral parts of `CrawlerRunConfig` so contexts can be reused safely. |
|
||||
| `BrowserManager.get_page` | Returns a ready `Page` for a given session id, reusing an existing one or creating a new context/page, injects helper scripts, updates `last_used`. |
|
||||
| `BrowserManager.kill_session` | Force-closes a context/page for a session and removes it from the session map. |
|
||||
| `BrowserManager._cleanup_expired_sessions` | Periodic sweep that drops sessions idle longer than `ttl_seconds`. |
|
||||
| `BrowserManager.close` | Gracefully shuts down all contexts, the browser, Playwright, and background tasks. |
|
||||
|
||||
---
|
||||
|
||||
### browser_profiler.py
|
||||
|
||||
| Function | What it does |
|
||||
|---|---|
|
||||
| `BrowserProfiler.__init__` | Sets up profile folder paths, async logger, and signal handlers. |
|
||||
| `BrowserProfiler.create_profile` | Launches a visible browser with a new user-data-dir for manual login, on exit compresses and stores it as a named profile. |
|
||||
| `BrowserProfiler.cleanup_handler` | General SIGTERM/SIGINT cleanup wrapper that kills child processes. |
|
||||
| `BrowserProfiler.sigint_handler` | Handles Ctrl-C during an interactive session, makes sure the browser shuts down cleanly. |
|
||||
| `BrowserProfiler.listen_for_quit_command` | Async REPL that exits when the user types `q`. |
|
||||
| `BrowserProfiler.list_profiles` | Enumerates `~/.crawl4ai/profiles`, prints profile name, browser type, size, and last modified. |
|
||||
| `BrowserProfiler.get_profile_path` | Returns the absolute path of a profile given its name, or `None` if missing. |
|
||||
| `BrowserProfiler.delete_profile` | Removes a profile folder or a direct path from disk, with optional confirmation prompt. |
|
||||
| `BrowserProfiler.interactive_manager` | Text UI loop for listing, creating, deleting, or launching profiles. |
|
||||
| `BrowserProfiler.launch_standalone_browser` | Starts a non-headless Chromium with remote debugging enabled and keeps it alive for manual tests. |
|
||||
| `BrowserProfiler.get_cdp_json` | Pulls `/json/version` from a CDP endpoint and returns the parsed JSON. |
|
||||
| `BrowserProfiler.launch_builtin_browser` | Spawns a headless Chromium in the background, saves `{wsEndpoint, pid, started_at}` to `~/.crawl4ai/builtin_browser.json`. |
|
||||
| `BrowserProfiler.get_builtin_browser_info` | Reads that JSON file, verifies the PID, and returns browser status info. |
|
||||
| `BrowserProfiler._is_browser_running` | Cross-platform helper that checks if a PID is still alive. |
|
||||
| `BrowserProfiler.kill_builtin_browser` | Terminates the background builtin browser and removes its status file. |
|
||||
| `BrowserProfiler.get_builtin_browser_status` | Returns `{running: bool, wsEndpoint, pid, started_at}` for quick health checks. |
|
||||
|
||||
Let me know what you want to tweak or dive into next.
|
||||
40
docs/codebase/cli.md
Normal file
40
docs/codebase/cli.md
Normal file
@@ -0,0 +1,40 @@
|
||||
### `cli.py` command surface
|
||||
|
||||
| Command | Inputs / flags | What it does |
|
||||
|---|---|---|
|
||||
| **profiles** | *(none)* | Opens the interactive profile manager, lets you list, create, delete saved browser profiles that live in `~/.crawl4ai/profiles`. |
|
||||
| **browser status** | – | Prints whether the always-on *builtin* browser is running, shows its CDP URL, PID, start time. |
|
||||
| **browser stop** | – | Kills the builtin browser and deletes its status file. |
|
||||
| **browser view** | `--url, -u` URL *(optional)* | Pops a visible window of the builtin browser, navigates to `URL` or `about:blank`. |
|
||||
| **config list** | – | Dumps every global setting, showing current value, default, and description. |
|
||||
| **config get** | `key` | Prints the value of a single setting, falls back to default if unset. |
|
||||
| **config set** | `key value` | Persists a new value in the global config (stored under `~/.crawl4ai/config.yml`). |
|
||||
| **examples** | – | Just spits out real-world CLI usage samples. |
|
||||
| **crawl** | `url` *(positional)*<br>`--browser-config,-B` path<br>`--crawler-config,-C` path<br>`--filter-config,-f` path<br>`--extraction-config,-e` path<br>`--json-extract,-j` [desc]\*<br>`--schema,-s` path<br>`--browser,-b` k=v list<br>`--crawler,-c` k=v list<br>`--output,-o` all,json,markdown,md,markdown-fit,md-fit *(default all)*<br>`--output-file,-O` path<br>`--bypass-cache,-b` *(flag, default true — note flag reuse)*<br>`--question,-q` str<br>`--verbose,-v` *(flag)*<br>`--profile,-p` profile-name | One-shot crawl + extraction. Builds `BrowserConfig` and `CrawlerRunConfig` from inline flags or separate YAML/JSON files, runs `AsyncWebCrawler.run()`, can route through a named saved profile and pipe the result to stdout or a file. |
|
||||
| **(default)** | Same flags as **crawl**, plus `--example` | Shortcut so you can type just `crwl https://site.com`. When first arg is not a known sub-command, it falls through to *crawl*. |
|
||||
|
||||
\* `--json-extract/-j` with no value turns on LLM-based JSON extraction using an auto schema, supplying a string lets you prompt-engineer the field descriptions.
|
||||
|
||||
> Quick mental model
|
||||
> `profiles` = manage identities,
|
||||
> `browser ...` = control long-running headless Chrome that all crawls can piggy-back on,
|
||||
> `crawl` = do the actual work,
|
||||
> `config` = tweak global defaults,
|
||||
> everything else is sugar.
|
||||
|
||||
### Quick-fire “profile” usage cheatsheet
|
||||
|
||||
| Scenario | Command (copy-paste ready) | Notes |
|
||||
|---|---|---|
|
||||
| **Launch interactive Profile Manager UI** | `crwl profiles` | Opens TUI with options: 1 List, 2 Create, 3 Delete, 4 Use-to-crawl, 5 Exit. |
|
||||
| **Create a fresh profile** | `crwl profiles` → choose **2** → name it → browser opens → log in → press **q** in terminal | Saves to `~/.crawl4ai/profiles/<name>`. |
|
||||
| **List saved profiles** | `crwl profiles` → choose **1** | Shows name, browser type, size, last-modified. |
|
||||
| **Delete a profile** | `crwl profiles` → choose **3** → pick the profile index → confirm | Removes the folder. |
|
||||
| **Crawl with a profile (default alias)** | `crwl https://site.com/dashboard -p my-profile` | Keeps login cookies, sets `use_managed_browser=true` under the hood. |
|
||||
| **Crawl + verbose JSON output** | `crwl https://site.com -p my-profile -o json -v` | Any other `crawl` flags work the same. |
|
||||
| **Crawl with extra browser tweaks** | `crwl https://site.com -p my-profile -b "headless=true,viewport_width=1680"` | CLI overrides go on top of the profile. |
|
||||
| **Same but via explicit sub-command** | `crwl crawl https://site.com -p my-profile` | Identical to default alias. |
|
||||
| **Use profile from inside Profile Manager** | `crwl profiles` → choose **4** → pick profile → enter URL → follow prompts | Handy when demo-ing to non-CLI folks. |
|
||||
| **One-off crawl with a profile folder path (no name lookup)** | `crwl https://site.com -b "user_data_dir=$HOME/.crawl4ai/profiles/my-profile,use_managed_browser=true"` | Bypasses registry, useful for CI scripts. |
|
||||
| **Launch a dev browser on CDP port with the same identity** | `crwl cdp -d $HOME/.crawl4ai/profiles/my-profile -P 9223` | Lets Puppeteer/Playwright attach for debugging. |
|
||||
|
||||
@@ -391,12 +391,14 @@ async def main():
|
||||
# Process results
|
||||
raw_df = pd.DataFrame()
|
||||
for result in results:
|
||||
if result.success and result.media["tables"]:
|
||||
# Use the new tables field, falling back to media["tables"] for backward compatibility
|
||||
tables = result.tables if hasattr(result, "tables") and result.tables else result.media.get("tables", [])
|
||||
if result.success and tables:
|
||||
# Extract primary market table
|
||||
# DataFrame
|
||||
raw_df = pd.DataFrame(
|
||||
result.media["tables"][0]["rows"],
|
||||
columns=result.media["tables"][0]["headers"],
|
||||
tables[0]["rows"],
|
||||
columns=tables[0]["headers"],
|
||||
)
|
||||
break
|
||||
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
149
docs/examples/docker/demo_docker_polling.py
Normal file
149
docs/examples/docker/demo_docker_polling.py
Normal file
@@ -0,0 +1,149 @@
|
||||
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
demo_docker_polling.py
|
||||
Quick sanity-check for the asynchronous crawl job endpoints:
|
||||
|
||||
• POST /crawl/job – enqueue work, get task_id
|
||||
• GET /crawl/job/{id} – poll status / fetch result
|
||||
|
||||
The style matches demo_docker_api.py (console.rule banners, helper
|
||||
functions, coloured status lines). Adjust BASE_URL as needed.
|
||||
|
||||
Run: python demo_docker_polling.py
|
||||
"""
|
||||
|
||||
import asyncio, json, os, time, urllib.parse
|
||||
from typing import Dict, List
|
||||
|
||||
import httpx
|
||||
from rich.console import Console
|
||||
from rich.panel import Panel
|
||||
from rich.syntax import Syntax
|
||||
|
||||
console = Console()
|
||||
BASE_URL = os.getenv("BASE_URL", "http://localhost:11234")
|
||||
SIMPLE_URL = "https://example.org"
|
||||
LINKS_URL = "https://httpbin.org/links/10/1"
|
||||
|
||||
# --- helpers --------------------------------------------------------------
|
||||
|
||||
|
||||
def print_payload(payload: Dict):
|
||||
console.print(Panel(Syntax(json.dumps(payload, indent=2),
|
||||
"json", theme="monokai", line_numbers=False),
|
||||
title="Payload", border_style="cyan", expand=False))
|
||||
|
||||
|
||||
async def check_server_health(client: httpx.AsyncClient) -> bool:
|
||||
try:
|
||||
resp = await client.get("/health")
|
||||
if resp.is_success:
|
||||
console.print("[green]Server healthy[/]")
|
||||
return True
|
||||
except Exception:
|
||||
pass
|
||||
console.print("[bold red]Server is not responding on /health[/]")
|
||||
return False
|
||||
|
||||
|
||||
async def poll_for_result(client: httpx.AsyncClient, task_id: str,
|
||||
poll_interval: float = 1.5, timeout: float = 90.0):
|
||||
"""Hit /crawl/job/{id} until COMPLETED/FAILED or timeout."""
|
||||
start = time.time()
|
||||
while True:
|
||||
resp = await client.get(f"/crawl/job/{task_id}")
|
||||
resp.raise_for_status()
|
||||
data = resp.json()
|
||||
status = data.get("status")
|
||||
if status.upper() in ("COMPLETED", "FAILED"):
|
||||
return data
|
||||
if time.time() - start > timeout:
|
||||
raise TimeoutError(f"Task {task_id} did not finish in {timeout}s")
|
||||
await asyncio.sleep(poll_interval)
|
||||
|
||||
|
||||
# --- demo functions -------------------------------------------------------
|
||||
|
||||
|
||||
async def demo_poll_single_url(client: httpx.AsyncClient):
|
||||
payload = {
|
||||
"urls": [SIMPLE_URL],
|
||||
"browser_config": {"type": "BrowserConfig",
|
||||
"params": {"headless": True}},
|
||||
"crawler_config": {"type": "CrawlerRunConfig",
|
||||
"params": {"cache_mode": "BYPASS"}}
|
||||
}
|
||||
|
||||
console.rule("[bold blue]Demo A: /crawl/job Single URL[/]", style="blue")
|
||||
print_payload(payload)
|
||||
|
||||
# enqueue
|
||||
resp = await client.post("/crawl/job", json=payload)
|
||||
console.print(f"Enqueue status: [bold]{resp.status_code}[/]")
|
||||
resp.raise_for_status()
|
||||
task_id = resp.json()["task_id"]
|
||||
console.print(f"Task ID: [yellow]{task_id}[/]")
|
||||
|
||||
# poll
|
||||
console.print("Polling…")
|
||||
result = await poll_for_result(client, task_id)
|
||||
console.print(Panel(Syntax(json.dumps(result, indent=2),
|
||||
"json", theme="fruity"),
|
||||
title="Final result", border_style="green"))
|
||||
if result["status"] == "COMPLETED":
|
||||
console.print("[green]✅ Crawl succeeded[/]")
|
||||
else:
|
||||
console.print("[red]❌ Crawl failed[/]")
|
||||
|
||||
|
||||
async def demo_poll_multi_url(client: httpx.AsyncClient):
|
||||
payload = {
|
||||
"urls": [SIMPLE_URL, LINKS_URL],
|
||||
"browser_config": {"type": "BrowserConfig",
|
||||
"params": {"headless": True}},
|
||||
"crawler_config": {"type": "CrawlerRunConfig",
|
||||
"params": {"cache_mode": "BYPASS"}}
|
||||
}
|
||||
|
||||
console.rule("[bold magenta]Demo B: /crawl/job Multi-URL[/]",
|
||||
style="magenta")
|
||||
print_payload(payload)
|
||||
|
||||
resp = await client.post("/crawl/job", json=payload)
|
||||
console.print(f"Enqueue status: [bold]{resp.status_code}[/]")
|
||||
resp.raise_for_status()
|
||||
task_id = resp.json()["task_id"]
|
||||
console.print(f"Task ID: [yellow]{task_id}[/]")
|
||||
|
||||
console.print("Polling…")
|
||||
result = await poll_for_result(client, task_id)
|
||||
console.print(Panel(Syntax(json.dumps(result, indent=2),
|
||||
"json", theme="fruity"),
|
||||
title="Final result", border_style="green"))
|
||||
if result["status"] == "COMPLETED":
|
||||
console.print(
|
||||
f"[green]✅ {len(json.loads(result['result'])['results'])} URLs crawled[/]")
|
||||
else:
|
||||
console.print("[red]❌ Crawl failed[/]")
|
||||
|
||||
|
||||
# --- main runner ----------------------------------------------------------
|
||||
|
||||
|
||||
async def main_demo():
|
||||
async with httpx.AsyncClient(base_url=BASE_URL, timeout=300.0) as client:
|
||||
if not await check_server_health(client):
|
||||
return
|
||||
await demo_poll_single_url(client)
|
||||
await demo_poll_multi_url(client)
|
||||
console.rule("[bold green]Polling demos complete[/]", style="green")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
try:
|
||||
asyncio.run(main_demo())
|
||||
except KeyboardInterrupt:
|
||||
console.print("\n[yellow]Interrupted by user[/]")
|
||||
except Exception:
|
||||
console.print_exception(show_locals=False)
|
||||
@@ -3,42 +3,19 @@ from crawl4ai import (
|
||||
AsyncWebCrawler,
|
||||
BrowserConfig,
|
||||
CrawlerRunConfig,
|
||||
CacheMode,
|
||||
DefaultMarkdownGenerator,
|
||||
PruningContentFilter,
|
||||
CrawlResult
|
||||
)
|
||||
|
||||
async def example_cdp():
|
||||
browser_conf = BrowserConfig(
|
||||
headless=False,
|
||||
cdp_url="http://localhost:9223"
|
||||
)
|
||||
crawler_config = CrawlerRunConfig(
|
||||
session_id="test",
|
||||
js_code = """(() => { return {"result": "Hello World!"} })()""",
|
||||
js_only=True
|
||||
)
|
||||
async with AsyncWebCrawler(
|
||||
config=browser_conf,
|
||||
verbose=True,
|
||||
) as crawler:
|
||||
result : CrawlResult = await crawler.arun(
|
||||
url="https://www.helloworld.org",
|
||||
config=crawler_config,
|
||||
)
|
||||
print(result.js_execution_result)
|
||||
|
||||
|
||||
async def main():
|
||||
async def main():
|
||||
browser_config = BrowserConfig(headless=True, verbose=True)
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
crawler_config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
markdown_generator=DefaultMarkdownGenerator(
|
||||
content_filter=PruningContentFilter(
|
||||
threshold=0.48, threshold_type="fixed", min_word_threshold=0
|
||||
)
|
||||
content_filter=PruningContentFilter()
|
||||
),
|
||||
)
|
||||
result : CrawlResult = await crawler.arun(
|
||||
|
||||
143
docs/examples/regex_extraction_quickstart.py
Normal file
143
docs/examples/regex_extraction_quickstart.py
Normal file
@@ -0,0 +1,143 @@
|
||||
# == File: regex_extraction_quickstart.py ==
|
||||
"""
|
||||
Mini–quick-start for RegexExtractionStrategy
|
||||
────────────────────────────────────────────
|
||||
3 bite-sized demos that parallel the style of *quickstart_examples_set_1.py*:
|
||||
|
||||
1. **Default catalog** – scrape a page and pull out e-mails / phones / URLs, etc.
|
||||
2. **Custom pattern** – add your own regex at instantiation time.
|
||||
3. **LLM-assisted schema** – ask the model to write a pattern, cache it, then
|
||||
run extraction _without_ further LLM calls.
|
||||
|
||||
Run the whole thing with::
|
||||
|
||||
python regex_extraction_quickstart.py
|
||||
"""
|
||||
|
||||
import os, json, asyncio
|
||||
from pathlib import Path
|
||||
from typing import List
|
||||
|
||||
from crawl4ai import (
|
||||
AsyncWebCrawler,
|
||||
CrawlerRunConfig,
|
||||
CrawlResult,
|
||||
RegexExtractionStrategy,
|
||||
LLMConfig,
|
||||
)
|
||||
|
||||
# ────────────────────────────────────────────────────────────────────────────
|
||||
# 1. Default-catalog extraction
|
||||
# ────────────────────────────────────────────────────────────────────────────
|
||||
async def demo_regex_default() -> None:
|
||||
print("\n=== 1. Regex extraction – default patterns ===")
|
||||
|
||||
url = "https://www.iana.org/domains/example" # has e-mail + URLs
|
||||
strategy = RegexExtractionStrategy(
|
||||
pattern = RegexExtractionStrategy.Url | RegexExtractionStrategy.Currency
|
||||
)
|
||||
config = CrawlerRunConfig(extraction_strategy=strategy)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result: CrawlResult = await crawler.arun(url, config=config)
|
||||
|
||||
print(f"Fetched {url} - success={result.success}")
|
||||
if result.success:
|
||||
data = json.loads(result.extracted_content)
|
||||
for d in data[:10]:
|
||||
print(f" {d['label']:<12} {d['value']}")
|
||||
print(f"... total matches: {len(data)}")
|
||||
else:
|
||||
print(" !!! crawl failed")
|
||||
|
||||
|
||||
# ────────────────────────────────────────────────────────────────────────────
|
||||
# 2. Custom pattern override / extension
|
||||
# ────────────────────────────────────────────────────────────────────────────
|
||||
async def demo_regex_custom() -> None:
|
||||
print("\n=== 2. Regex extraction – custom price pattern ===")
|
||||
|
||||
url = "https://www.apple.com/shop/buy-mac/macbook-pro"
|
||||
price_pattern = {"usd_price": r"\$\s?\d{1,3}(?:,\d{3})*(?:\.\d{2})?"}
|
||||
|
||||
strategy = RegexExtractionStrategy(custom = price_pattern)
|
||||
config = CrawlerRunConfig(extraction_strategy=strategy)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result: CrawlResult = await crawler.arun(url, config=config)
|
||||
|
||||
if result.success:
|
||||
data = json.loads(result.extracted_content)
|
||||
for d in data:
|
||||
print(f" {d['value']}")
|
||||
if not data:
|
||||
print(" (No prices found - page layout may have changed)")
|
||||
else:
|
||||
print(" !!! crawl failed")
|
||||
|
||||
|
||||
# ────────────────────────────────────────────────────────────────────────────
|
||||
# 3. One-shot LLM pattern generation, then fast extraction
|
||||
# ────────────────────────────────────────────────────────────────────────────
|
||||
async def demo_regex_generate_pattern() -> None:
|
||||
print("\n=== 3. generate_pattern → regex extraction ===")
|
||||
|
||||
cache_dir = Path(__file__).parent / "tmp"
|
||||
cache_dir.mkdir(exist_ok=True)
|
||||
pattern_file = cache_dir / "price_pattern.json"
|
||||
|
||||
url = "https://www.lazada.sg/tag/smartphone/"
|
||||
|
||||
# ── 3-A. build or load the cached pattern
|
||||
if pattern_file.exists():
|
||||
pattern = json.load(pattern_file.open(encoding="utf-8"))
|
||||
print("Loaded cached pattern:", pattern)
|
||||
else:
|
||||
print("Generating pattern via LLM…")
|
||||
|
||||
llm_cfg = LLMConfig(
|
||||
provider="openai/gpt-4o-mini",
|
||||
api_token="env:OPENAI_API_KEY",
|
||||
)
|
||||
|
||||
# pull one sample page as HTML context
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
html = (await crawler.arun(url)).fit_html
|
||||
|
||||
pattern = RegexExtractionStrategy.generate_pattern(
|
||||
label="price",
|
||||
html=html,
|
||||
query="Prices in Malaysian Ringgit (e.g. RM1,299.00 or RM200)",
|
||||
llm_config=llm_cfg,
|
||||
)
|
||||
|
||||
json.dump(pattern, pattern_file.open("w", encoding="utf-8"), indent=2)
|
||||
print("Saved pattern:", pattern_file)
|
||||
|
||||
# ── 3-B. extraction pass – zero LLM calls
|
||||
strategy = RegexExtractionStrategy(custom=pattern)
|
||||
config = CrawlerRunConfig(extraction_strategy=strategy, delay_before_return_html=3)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result: CrawlResult = await crawler.arun(url, config=config)
|
||||
|
||||
if result.success:
|
||||
data = json.loads(result.extracted_content)
|
||||
for d in data[:15]:
|
||||
print(f" {d['value']}")
|
||||
print(f"... total matches: {len(data)}")
|
||||
else:
|
||||
print(" !!! crawl failed")
|
||||
|
||||
|
||||
# ────────────────────────────────────────────────────────────────────────────
|
||||
# Entrypoint
|
||||
# ────────────────────────────────────────────────────────────────────────────
|
||||
async def main() -> None:
|
||||
# await demo_regex_default()
|
||||
# await demo_regex_custom()
|
||||
await demo_regex_generate_pattern()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
@@ -10,6 +10,7 @@ class CrawlResult(BaseModel):
|
||||
html: str
|
||||
success: bool
|
||||
cleaned_html: Optional[str] = None
|
||||
fit_html: Optional[str] = None # Preprocessed HTML optimized for extraction
|
||||
media: Dict[str, List[Dict]] = {}
|
||||
links: Dict[str, List[Dict]] = {}
|
||||
downloaded_files: Optional[List[str]] = None
|
||||
@@ -50,7 +51,7 @@ if not result.success:
|
||||
```
|
||||
|
||||
### 1.3 **`status_code`** *(Optional[int])*
|
||||
**What**: The page’s HTTP status code (e.g., 200, 404).
|
||||
**What**: The page's HTTP status code (e.g., 200, 404).
|
||||
**Usage**:
|
||||
```python
|
||||
if result.status_code == 404:
|
||||
@@ -82,7 +83,7 @@ if result.response_headers:
|
||||
```
|
||||
|
||||
### 1.7 **`ssl_certificate`** *(Optional[SSLCertificate])*
|
||||
**What**: If `fetch_ssl_certificate=True` in your CrawlerRunConfig, **`result.ssl_certificate`** contains a [**`SSLCertificate`**](../advanced/ssl-certificate.md) object describing the site’s certificate. You can export the cert in multiple formats (PEM/DER/JSON) or access its properties like `issuer`,
|
||||
**What**: If `fetch_ssl_certificate=True` in your CrawlerRunConfig, **`result.ssl_certificate`** contains a [**`SSLCertificate`**](../advanced/ssl-certificate.md) object describing the site's certificate. You can export the cert in multiple formats (PEM/DER/JSON) or access its properties like `issuer`,
|
||||
`subject`, `valid_from`, `valid_until`, etc.
|
||||
**Usage**:
|
||||
```python
|
||||
@@ -109,14 +110,6 @@ print(len(result.html))
|
||||
print(result.cleaned_html[:500]) # Show a snippet
|
||||
```
|
||||
|
||||
### 2.3 **`fit_html`** *(Optional[str])*
|
||||
**What**: If a **content filter** or heuristic (e.g., Pruning/BM25) modifies the HTML, the “fit” or post-filter version.
|
||||
**When**: This is **only** present if your `markdown_generator` or `content_filter` produces it.
|
||||
**Usage**:
|
||||
```python
|
||||
if result.markdown.fit_html:
|
||||
print("High-value HTML content:", result.markdown.fit_html[:300])
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
@@ -135,7 +128,7 @@ Crawl4AI can convert HTML→Markdown, optionally including:
|
||||
- **`raw_markdown`** *(str)*: The full HTML→Markdown conversion.
|
||||
- **`markdown_with_citations`** *(str)*: Same markdown, but with link references as academic-style citations.
|
||||
- **`references_markdown`** *(str)*: The reference list or footnotes at the end.
|
||||
- **`fit_markdown`** *(Optional[str])*: If content filtering (Pruning/BM25) was applied, the filtered “fit” text.
|
||||
- **`fit_markdown`** *(Optional[str])*: If content filtering (Pruning/BM25) was applied, the filtered "fit" text.
|
||||
- **`fit_html`** *(Optional[str])*: The HTML that led to `fit_markdown`.
|
||||
|
||||
**Usage**:
|
||||
@@ -157,7 +150,7 @@ print(result.markdown.raw_markdown[:200])
|
||||
print(result.markdown.fit_markdown)
|
||||
print(result.markdown.fit_html)
|
||||
```
|
||||
**Important**: “Fit” content (in `fit_markdown`/`fit_html`) exists in result.markdown, only if you used a **filter** (like **PruningContentFilter** or **BM25ContentFilter**) within a `MarkdownGenerationStrategy`.
|
||||
**Important**: "Fit" content (in `fit_markdown`/`fit_html`) exists in result.markdown, only if you used a **filter** (like **PruningContentFilter** or **BM25ContentFilter**) within a `MarkdownGenerationStrategy`.
|
||||
|
||||
---
|
||||
|
||||
@@ -169,7 +162,7 @@ print(result.markdown.fit_html)
|
||||
|
||||
- `src` *(str)*: Media URL
|
||||
- `alt` or `title` *(str)*: Descriptive text
|
||||
- `score` *(float)*: Relevance score if the crawler’s heuristic found it “important”
|
||||
- `score` *(float)*: Relevance score if the crawler's heuristic found it "important"
|
||||
- `desc` or `description` *(Optional[str])*: Additional context extracted from surrounding text
|
||||
|
||||
**Usage**:
|
||||
@@ -263,7 +256,7 @@ A `DispatchResult` object providing additional concurrency and resource usage in
|
||||
|
||||
- **`task_id`**: A unique identifier for the parallel task.
|
||||
- **`memory_usage`** (float): The memory (in MB) used at the time of completion.
|
||||
- **`peak_memory`** (float): The peak memory usage (in MB) recorded during the task’s execution.
|
||||
- **`peak_memory`** (float): The peak memory usage (in MB) recorded during the task's execution.
|
||||
- **`start_time`** / **`end_time`** (datetime): Time range for this crawling task.
|
||||
- **`error_message`** (str): Any dispatcher- or concurrency-related error encountered.
|
||||
|
||||
@@ -358,7 +351,7 @@ async def handle_result(result: CrawlResult):
|
||||
# HTML
|
||||
print("Original HTML size:", len(result.html))
|
||||
print("Cleaned HTML size:", len(result.cleaned_html or ""))
|
||||
|
||||
|
||||
# Markdown output
|
||||
if result.markdown:
|
||||
print("Raw Markdown:", result.markdown.raw_markdown[:300])
|
||||
|
||||
@@ -36,6 +36,45 @@ LLMExtractionStrategy(
|
||||
)
|
||||
```
|
||||
|
||||
### RegexExtractionStrategy
|
||||
|
||||
Used for fast pattern-based extraction of common entities using regular expressions.
|
||||
|
||||
```python
|
||||
RegexExtractionStrategy(
|
||||
# Pattern Configuration
|
||||
pattern: IntFlag = RegexExtractionStrategy.Nothing, # Bit flags of built-in patterns to use
|
||||
custom: Optional[Dict[str, str]] = None, # Custom pattern dictionary {label: regex}
|
||||
|
||||
# Input Format
|
||||
input_format: str = "fit_html", # "html", "markdown", "text" or "fit_html"
|
||||
)
|
||||
|
||||
# Built-in Patterns as Bit Flags
|
||||
RegexExtractionStrategy.Email # Email addresses
|
||||
RegexExtractionStrategy.PhoneIntl # International phone numbers
|
||||
RegexExtractionStrategy.PhoneUS # US-format phone numbers
|
||||
RegexExtractionStrategy.Url # HTTP/HTTPS URLs
|
||||
RegexExtractionStrategy.IPv4 # IPv4 addresses
|
||||
RegexExtractionStrategy.IPv6 # IPv6 addresses
|
||||
RegexExtractionStrategy.Uuid # UUIDs
|
||||
RegexExtractionStrategy.Currency # Currency values (USD, EUR, etc)
|
||||
RegexExtractionStrategy.Percentage # Percentage values
|
||||
RegexExtractionStrategy.Number # Numeric values
|
||||
RegexExtractionStrategy.DateIso # ISO format dates
|
||||
RegexExtractionStrategy.DateUS # US format dates
|
||||
RegexExtractionStrategy.Time24h # 24-hour format times
|
||||
RegexExtractionStrategy.PostalUS # US postal codes
|
||||
RegexExtractionStrategy.PostalUK # UK postal codes
|
||||
RegexExtractionStrategy.HexColor # HTML hex color codes
|
||||
RegexExtractionStrategy.TwitterHandle # Twitter handles
|
||||
RegexExtractionStrategy.Hashtag # Hashtags
|
||||
RegexExtractionStrategy.MacAddr # MAC addresses
|
||||
RegexExtractionStrategy.Iban # International bank account numbers
|
||||
RegexExtractionStrategy.CreditCard # Credit card numbers
|
||||
RegexExtractionStrategy.All # All available patterns
|
||||
```
|
||||
|
||||
### CosineStrategy
|
||||
|
||||
Used for content similarity-based extraction and clustering.
|
||||
@@ -156,6 +195,55 @@ result = await crawler.arun(
|
||||
data = json.loads(result.extracted_content)
|
||||
```
|
||||
|
||||
### Regex Extraction
|
||||
|
||||
```python
|
||||
import json
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, RegexExtractionStrategy
|
||||
|
||||
# Method 1: Use built-in patterns
|
||||
strategy = RegexExtractionStrategy(
|
||||
pattern = RegexExtractionStrategy.Email | RegexExtractionStrategy.Url
|
||||
)
|
||||
|
||||
# Method 2: Use custom patterns
|
||||
price_pattern = {"usd_price": r"\$\s?\d{1,3}(?:,\d{3})*(?:\.\d{2})?"}
|
||||
strategy = RegexExtractionStrategy(custom=price_pattern)
|
||||
|
||||
# Method 3: Generate pattern with LLM assistance (one-time)
|
||||
from crawl4ai import LLMConfig
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Get sample HTML first
|
||||
sample_result = await crawler.arun("https://example.com/products")
|
||||
html = sample_result.fit_html
|
||||
|
||||
# Generate regex pattern once
|
||||
pattern = RegexExtractionStrategy.generate_pattern(
|
||||
label="price",
|
||||
html=html,
|
||||
query="Product prices in USD format",
|
||||
llm_config=LLMConfig(provider="openai/gpt-4o-mini")
|
||||
)
|
||||
|
||||
# Save pattern for reuse
|
||||
import json
|
||||
with open("price_pattern.json", "w") as f:
|
||||
json.dump(pattern, f)
|
||||
|
||||
# Use pattern for extraction (no LLM calls)
|
||||
strategy = RegexExtractionStrategy(custom=pattern)
|
||||
result = await crawler.arun(
|
||||
url="https://example.com/products",
|
||||
config=CrawlerRunConfig(extraction_strategy=strategy)
|
||||
)
|
||||
|
||||
# Process results
|
||||
data = json.loads(result.extracted_content)
|
||||
for item in data:
|
||||
print(f"{item['label']}: {item['value']}")
|
||||
```
|
||||
|
||||
### CSS Extraction
|
||||
|
||||
```python
|
||||
@@ -220,12 +308,28 @@ result = await crawler.arun(
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Choose the Right Strategy**
|
||||
- Use `LLMExtractionStrategy` for complex, unstructured content
|
||||
- Use `JsonCssExtractionStrategy` for well-structured HTML
|
||||
1. **Choose the Right Strategy**
|
||||
- Use `RegexExtractionStrategy` for common data types like emails, phones, URLs, dates
|
||||
- Use `JsonCssExtractionStrategy` for well-structured HTML with consistent patterns
|
||||
- Use `LLMExtractionStrategy` for complex, unstructured content requiring reasoning
|
||||
- Use `CosineStrategy` for content similarity and clustering
|
||||
|
||||
2. **Optimize Chunking**
|
||||
2. **Strategy Selection Guide**
|
||||
```
|
||||
Is the target data a common type (email/phone/date/URL)?
|
||||
→ RegexExtractionStrategy
|
||||
|
||||
Does the page have consistent HTML structure?
|
||||
→ JsonCssExtractionStrategy or JsonXPathExtractionStrategy
|
||||
|
||||
Is the data semantically complex or unstructured?
|
||||
→ LLMExtractionStrategy
|
||||
|
||||
Need to find content similar to a specific topic?
|
||||
→ CosineStrategy
|
||||
```
|
||||
|
||||
3. **Optimize Chunking**
|
||||
```python
|
||||
# For long documents
|
||||
strategy = LLMExtractionStrategy(
|
||||
@@ -234,7 +338,26 @@ result = await crawler.arun(
|
||||
)
|
||||
```
|
||||
|
||||
3. **Handle Errors**
|
||||
4. **Combine Strategies for Best Performance**
|
||||
```python
|
||||
# First pass: Extract structure with CSS
|
||||
css_strategy = JsonCssExtractionStrategy(product_schema)
|
||||
css_result = await crawler.arun(url, config=CrawlerRunConfig(extraction_strategy=css_strategy))
|
||||
product_data = json.loads(css_result.extracted_content)
|
||||
|
||||
# Second pass: Extract specific fields with regex
|
||||
descriptions = [product["description"] for product in product_data]
|
||||
regex_strategy = RegexExtractionStrategy(
|
||||
pattern=RegexExtractionStrategy.Email | RegexExtractionStrategy.PhoneUS,
|
||||
custom={"dimension": r"\d+x\d+x\d+ (?:cm|in)"}
|
||||
)
|
||||
|
||||
# Process descriptions with regex
|
||||
for text in descriptions:
|
||||
matches = regex_strategy.extract("", text) # Direct extraction
|
||||
```
|
||||
|
||||
5. **Handle Errors**
|
||||
```python
|
||||
try:
|
||||
result = await crawler.arun(
|
||||
@@ -247,11 +370,31 @@ result = await crawler.arun(
|
||||
print(f"Extraction failed: {e}")
|
||||
```
|
||||
|
||||
4. **Monitor Performance**
|
||||
6. **Monitor Performance**
|
||||
```python
|
||||
strategy = CosineStrategy(
|
||||
verbose=True, # Enable logging
|
||||
word_count_threshold=20, # Filter short content
|
||||
top_k=5 # Limit results
|
||||
)
|
||||
```
|
||||
|
||||
7. **Cache Generated Patterns**
|
||||
```python
|
||||
# For RegexExtractionStrategy pattern generation
|
||||
import json
|
||||
from pathlib import Path
|
||||
|
||||
cache_dir = Path("./pattern_cache")
|
||||
cache_dir.mkdir(exist_ok=True)
|
||||
pattern_file = cache_dir / "product_pattern.json"
|
||||
|
||||
if pattern_file.exists():
|
||||
with open(pattern_file) as f:
|
||||
pattern = json.load(f)
|
||||
else:
|
||||
# Generate once with LLM
|
||||
pattern = RegexExtractionStrategy.generate_pattern(...)
|
||||
with open(pattern_file, "w") as f:
|
||||
json.dump(pattern, f)
|
||||
```
|
||||
@@ -1,15 +1,20 @@
|
||||
# Extracting JSON (No LLM)
|
||||
|
||||
One of Crawl4AI’s **most powerful** features is extracting **structured JSON** from websites **without** relying on large language models. By defining a **schema** with CSS or XPath selectors, you can extract data instantly—even from complex or nested HTML structures—without the cost, latency, or environmental impact of an LLM.
|
||||
One of Crawl4AI's **most powerful** features is extracting **structured JSON** from websites **without** relying on large language models. Crawl4AI offers several strategies for LLM-free extraction:
|
||||
|
||||
1. **Schema-based extraction** with CSS or XPath selectors via `JsonCssExtractionStrategy` and `JsonXPathExtractionStrategy`
|
||||
2. **Regular expression extraction** with `RegexExtractionStrategy` for fast pattern matching
|
||||
|
||||
These approaches let you extract data instantly—even from complex or nested HTML structures—without the cost, latency, or environmental impact of an LLM.
|
||||
|
||||
**Why avoid LLM for basic extractions?**
|
||||
|
||||
1. **Faster & Cheaper**: No API calls or GPU overhead.
|
||||
2. **Lower Carbon Footprint**: LLM inference can be energy-intensive. A well-defined schema is practically carbon-free.
|
||||
3. **Precise & Repeatable**: CSS/XPath selectors do exactly what you specify. LLM outputs can vary or hallucinate.
|
||||
4. **Scales Readily**: For thousands of pages, schema-based extraction runs quickly and in parallel.
|
||||
1. **Faster & Cheaper**: No API calls or GPU overhead.
|
||||
2. **Lower Carbon Footprint**: LLM inference can be energy-intensive. Pattern-based extraction is practically carbon-free.
|
||||
3. **Precise & Repeatable**: CSS/XPath selectors and regex patterns do exactly what you specify. LLM outputs can vary or hallucinate.
|
||||
4. **Scales Readily**: For thousands of pages, pattern-based extraction runs quickly and in parallel.
|
||||
|
||||
Below, we’ll explore how to craft these schemas and use them with **JsonCssExtractionStrategy** (or **JsonXPathExtractionStrategy** if you prefer XPath). We’ll also highlight advanced features like **nested fields** and **base element attributes**.
|
||||
Below, we'll explore how to craft these schemas and use them with **JsonCssExtractionStrategy** (or **JsonXPathExtractionStrategy** if you prefer XPath). We'll also highlight advanced features like **nested fields** and **base element attributes**.
|
||||
|
||||
---
|
||||
|
||||
@@ -17,17 +22,17 @@ Below, we’ll explore how to craft these schemas and use them with **JsonCssExt
|
||||
|
||||
A schema defines:
|
||||
|
||||
1. A **base selector** that identifies each “container” element on the page (e.g., a product row, a blog post card).
|
||||
2. **Fields** describing which CSS/XPath selectors to use for each piece of data you want to capture (text, attribute, HTML block, etc.).
|
||||
3. **Nested** or **list** types for repeated or hierarchical structures.
|
||||
1. A **base selector** that identifies each "container" element on the page (e.g., a product row, a blog post card).
|
||||
2. **Fields** describing which CSS/XPath selectors to use for each piece of data you want to capture (text, attribute, HTML block, etc.).
|
||||
3. **Nested** or **list** types for repeated or hierarchical structures.
|
||||
|
||||
For example, if you have a list of products, each one might have a name, price, reviews, and “related products.” This approach is faster and more reliable than an LLM for consistent, structured pages.
|
||||
For example, if you have a list of products, each one might have a name, price, reviews, and "related products." This approach is faster and more reliable than an LLM for consistent, structured pages.
|
||||
|
||||
---
|
||||
|
||||
## 2. Simple Example: Crypto Prices
|
||||
|
||||
Let’s begin with a **simple** schema-based extraction using the `JsonCssExtractionStrategy`. Below is a snippet that extracts cryptocurrency prices from a site (similar to the legacy Coinbase example). Notice we **don’t** call any LLM:
|
||||
Let's begin with a **simple** schema-based extraction using the `JsonCssExtractionStrategy`. Below is a snippet that extracts cryptocurrency prices from a site (similar to the legacy Coinbase example). Notice we **don't** call any LLM:
|
||||
|
||||
```python
|
||||
import json
|
||||
@@ -87,7 +92,7 @@ asyncio.run(extract_crypto_prices())
|
||||
|
||||
**Highlights**:
|
||||
|
||||
- **`baseSelector`**: Tells us where each “item” (crypto row) is.
|
||||
- **`baseSelector`**: Tells us where each "item" (crypto row) is.
|
||||
- **`fields`**: Two fields (`coin_name`, `price`) using simple CSS selectors.
|
||||
- Each field defines a **`type`** (e.g., `text`, `attribute`, `html`, `regex`, etc.).
|
||||
|
||||
@@ -97,7 +102,7 @@ No LLM is needed, and the performance is **near-instant** for hundreds or thousa
|
||||
|
||||
### **XPath Example with `raw://` HTML**
|
||||
|
||||
Below is a short example demonstrating **XPath** extraction plus the **`raw://`** scheme. We’ll pass a **dummy HTML** directly (no network request) and define the extraction strategy in `CrawlerRunConfig`.
|
||||
Below is a short example demonstrating **XPath** extraction plus the **`raw://`** scheme. We'll pass a **dummy HTML** directly (no network request) and define the extraction strategy in `CrawlerRunConfig`.
|
||||
|
||||
```python
|
||||
import json
|
||||
@@ -168,12 +173,12 @@ asyncio.run(extract_crypto_prices_xpath())
|
||||
|
||||
**Key Points**:
|
||||
|
||||
1. **`JsonXPathExtractionStrategy`** is used instead of `JsonCssExtractionStrategy`.
|
||||
2. **`baseSelector`** and each field’s `"selector"` use **XPath** instead of CSS.
|
||||
3. **`raw://`** lets us pass `dummy_html` with no real network request—handy for local testing.
|
||||
1. **`JsonXPathExtractionStrategy`** is used instead of `JsonCssExtractionStrategy`.
|
||||
2. **`baseSelector`** and each field's `"selector"` use **XPath** instead of CSS.
|
||||
3. **`raw://`** lets us pass `dummy_html` with no real network request—handy for local testing.
|
||||
4. Everything (including the extraction strategy) is in **`CrawlerRunConfig`**.
|
||||
|
||||
That’s how you keep the config self-contained, illustrate **XPath** usage, and demonstrate the **raw** scheme for direct HTML input—all while avoiding the old approach of passing `extraction_strategy` directly to `arun()`.
|
||||
That's how you keep the config self-contained, illustrate **XPath** usage, and demonstrate the **raw** scheme for direct HTML input—all while avoiding the old approach of passing `extraction_strategy` directly to `arun()`.
|
||||
|
||||
---
|
||||
|
||||
@@ -187,7 +192,7 @@ We have a **sample e-commerce** HTML file on GitHub (example):
|
||||
```
|
||||
https://gist.githubusercontent.com/githubusercontent/2d7b8ba3cd8ab6cf3c8da771ddb36878/raw/1ae2f90c6861ce7dd84cc50d3df9920dee5e1fd2/sample_ecommerce.html
|
||||
```
|
||||
This snippet includes categories, products, features, reviews, and related items. Let’s see how to define a schema that fully captures that structure **without LLM**.
|
||||
This snippet includes categories, products, features, reviews, and related items. Let's see how to define a schema that fully captures that structure **without LLM**.
|
||||
|
||||
```python
|
||||
schema = {
|
||||
@@ -333,24 +338,253 @@ async def extract_ecommerce_data():
|
||||
asyncio.run(extract_ecommerce_data())
|
||||
```
|
||||
|
||||
If all goes well, you get a **structured** JSON array with each “category,” containing an array of `products`. Each product includes `details`, `features`, `reviews`, etc. All of that **without** an LLM.
|
||||
If all goes well, you get a **structured** JSON array with each "category," containing an array of `products`. Each product includes `details`, `features`, `reviews`, etc. All of that **without** an LLM.
|
||||
|
||||
---
|
||||
|
||||
## 4. Why “No LLM” Is Often Better
|
||||
## 4. RegexExtractionStrategy - Fast Pattern-Based Extraction
|
||||
|
||||
1. **Zero Hallucination**: Schema-based extraction doesn’t guess text. It either finds it or not.
|
||||
2. **Guaranteed Structure**: The same schema yields consistent JSON across many pages, so your downstream pipeline can rely on stable keys.
|
||||
3. **Speed**: LLM-based extraction can be 10–1000x slower for large-scale crawling.
|
||||
4. **Scalable**: Adding or updating a field is a matter of adjusting the schema, not re-tuning a model.
|
||||
Crawl4AI now offers a powerful new zero-LLM extraction strategy: `RegexExtractionStrategy`. This strategy provides lightning-fast extraction of common data types like emails, phone numbers, URLs, dates, and more using pre-compiled regular expressions.
|
||||
|
||||
**When might you consider an LLM?** Possibly if the site is extremely unstructured or you want AI summarization. But always try a schema approach first for repeated or consistent data patterns.
|
||||
### Key Features
|
||||
|
||||
- **Zero LLM Dependency**: Extracts data without any AI model calls
|
||||
- **Blazing Fast**: Uses pre-compiled regex patterns for maximum performance
|
||||
- **Built-in Patterns**: Includes ready-to-use patterns for common data types
|
||||
- **Custom Patterns**: Add your own regex patterns for domain-specific extraction
|
||||
- **LLM-Assisted Pattern Generation**: Optionally use an LLM once to generate optimized patterns, then reuse them without further LLM calls
|
||||
|
||||
### Simple Example: Extracting Common Entities
|
||||
|
||||
The easiest way to start is by using the built-in pattern catalog:
|
||||
|
||||
```python
|
||||
import json
|
||||
import asyncio
|
||||
from crawl4ai import (
|
||||
AsyncWebCrawler,
|
||||
CrawlerRunConfig,
|
||||
RegexExtractionStrategy
|
||||
)
|
||||
|
||||
async def extract_with_regex():
|
||||
# Create a strategy using built-in patterns for URLs and currencies
|
||||
strategy = RegexExtractionStrategy(
|
||||
pattern = RegexExtractionStrategy.Url | RegexExtractionStrategy.Currency
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(extraction_strategy=strategy)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
config=config
|
||||
)
|
||||
|
||||
if result.success:
|
||||
data = json.loads(result.extracted_content)
|
||||
for item in data[:5]: # Show first 5 matches
|
||||
print(f"{item['label']}: {item['value']}")
|
||||
print(f"Total matches: {len(data)}")
|
||||
|
||||
asyncio.run(extract_with_regex())
|
||||
```
|
||||
|
||||
### Available Built-in Patterns
|
||||
|
||||
`RegexExtractionStrategy` provides these common patterns as IntFlag attributes for easy combining:
|
||||
|
||||
```python
|
||||
# Use individual patterns
|
||||
strategy = RegexExtractionStrategy(pattern=RegexExtractionStrategy.Email)
|
||||
|
||||
# Combine multiple patterns
|
||||
strategy = RegexExtractionStrategy(
|
||||
pattern = (
|
||||
RegexExtractionStrategy.Email |
|
||||
RegexExtractionStrategy.PhoneUS |
|
||||
RegexExtractionStrategy.Url
|
||||
)
|
||||
)
|
||||
|
||||
# Use all available patterns
|
||||
strategy = RegexExtractionStrategy(pattern=RegexExtractionStrategy.All)
|
||||
```
|
||||
|
||||
Available patterns include:
|
||||
- `Email` - Email addresses
|
||||
- `PhoneIntl` - International phone numbers
|
||||
- `PhoneUS` - US-format phone numbers
|
||||
- `Url` - HTTP/HTTPS URLs
|
||||
- `IPv4` - IPv4 addresses
|
||||
- `IPv6` - IPv6 addresses
|
||||
- `Uuid` - UUIDs
|
||||
- `Currency` - Currency values (USD, EUR, etc.)
|
||||
- `Percentage` - Percentage values
|
||||
- `Number` - Numeric values
|
||||
- `DateIso` - ISO format dates
|
||||
- `DateUS` - US format dates
|
||||
- `Time24h` - 24-hour format times
|
||||
- `PostalUS` - US postal codes
|
||||
- `PostalUK` - UK postal codes
|
||||
- `HexColor` - HTML hex color codes
|
||||
- `TwitterHandle` - Twitter handles
|
||||
- `Hashtag` - Hashtags
|
||||
- `MacAddr` - MAC addresses
|
||||
- `Iban` - International bank account numbers
|
||||
- `CreditCard` - Credit card numbers
|
||||
|
||||
### Custom Pattern Example
|
||||
|
||||
For more targeted extraction, you can provide custom patterns:
|
||||
|
||||
```python
|
||||
import json
|
||||
import asyncio
|
||||
from crawl4ai import (
|
||||
AsyncWebCrawler,
|
||||
CrawlerRunConfig,
|
||||
RegexExtractionStrategy
|
||||
)
|
||||
|
||||
async def extract_prices():
|
||||
# Define a custom pattern for US Dollar prices
|
||||
price_pattern = {"usd_price": r"\$\s?\d{1,3}(?:,\d{3})*(?:\.\d{2})?"}
|
||||
|
||||
# Create strategy with custom pattern
|
||||
strategy = RegexExtractionStrategy(custom=price_pattern)
|
||||
config = CrawlerRunConfig(extraction_strategy=strategy)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://www.example.com/products",
|
||||
config=config
|
||||
)
|
||||
|
||||
if result.success:
|
||||
data = json.loads(result.extracted_content)
|
||||
for item in data:
|
||||
print(f"Found price: {item['value']}")
|
||||
|
||||
asyncio.run(extract_prices())
|
||||
```
|
||||
|
||||
### LLM-Assisted Pattern Generation
|
||||
|
||||
For complex or site-specific patterns, you can use an LLM once to generate an optimized pattern, then save and reuse it without further LLM calls:
|
||||
|
||||
```python
|
||||
import json
|
||||
import asyncio
|
||||
from pathlib import Path
|
||||
from crawl4ai import (
|
||||
AsyncWebCrawler,
|
||||
CrawlerRunConfig,
|
||||
RegexExtractionStrategy,
|
||||
LLMConfig
|
||||
)
|
||||
|
||||
async def extract_with_generated_pattern():
|
||||
cache_dir = Path("./pattern_cache")
|
||||
cache_dir.mkdir(exist_ok=True)
|
||||
pattern_file = cache_dir / "price_pattern.json"
|
||||
|
||||
# 1. Generate or load pattern
|
||||
if pattern_file.exists():
|
||||
pattern = json.load(pattern_file.open())
|
||||
print(f"Using cached pattern: {pattern}")
|
||||
else:
|
||||
print("Generating pattern via LLM...")
|
||||
|
||||
# Configure LLM
|
||||
llm_config = LLMConfig(
|
||||
provider="openai/gpt-4o-mini",
|
||||
api_token="env:OPENAI_API_KEY",
|
||||
)
|
||||
|
||||
# Get sample HTML for context
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://example.com/products")
|
||||
html = result.fit_html
|
||||
|
||||
# Generate pattern (one-time LLM usage)
|
||||
pattern = RegexExtractionStrategy.generate_pattern(
|
||||
label="price",
|
||||
html=html,
|
||||
query="Product prices in USD format",
|
||||
llm_config=llm_config,
|
||||
)
|
||||
|
||||
# Cache pattern for future use
|
||||
json.dump(pattern, pattern_file.open("w"), indent=2)
|
||||
|
||||
# 2. Use pattern for extraction (no LLM calls)
|
||||
strategy = RegexExtractionStrategy(custom=pattern)
|
||||
config = CrawlerRunConfig(extraction_strategy=strategy)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com/products",
|
||||
config=config
|
||||
)
|
||||
|
||||
if result.success:
|
||||
data = json.loads(result.extracted_content)
|
||||
for item in data[:10]:
|
||||
print(f"Extracted: {item['value']}")
|
||||
print(f"Total matches: {len(data)}")
|
||||
|
||||
asyncio.run(extract_with_generated_pattern())
|
||||
```
|
||||
|
||||
This pattern allows you to:
|
||||
1. Use an LLM once to generate a highly optimized regex for your specific site
|
||||
2. Save the pattern to disk for reuse
|
||||
3. Extract data using only regex (no further LLM calls) in production
|
||||
|
||||
### Extraction Results Format
|
||||
|
||||
The `RegexExtractionStrategy` returns results in a consistent format:
|
||||
|
||||
```json
|
||||
[
|
||||
{
|
||||
"url": "https://example.com",
|
||||
"label": "email",
|
||||
"value": "contact@example.com",
|
||||
"span": [145, 163]
|
||||
},
|
||||
{
|
||||
"url": "https://example.com",
|
||||
"label": "url",
|
||||
"value": "https://support.example.com",
|
||||
"span": [210, 235]
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
Each match includes:
|
||||
- `url`: The source URL
|
||||
- `label`: The pattern name that matched (e.g., "email", "phone_us")
|
||||
- `value`: The extracted text
|
||||
- `span`: The start and end positions in the source content
|
||||
|
||||
---
|
||||
|
||||
## 5. Base Element Attributes & Additional Fields
|
||||
## 5. Why "No LLM" Is Often Better
|
||||
|
||||
It’s easy to **extract attributes** (like `href`, `src`, or `data-xxx`) from your base or nested elements using:
|
||||
1. **Zero Hallucination**: Pattern-based extraction doesn't guess text. It either finds it or not.
|
||||
2. **Guaranteed Structure**: The same schema or regex yields consistent JSON across many pages, so your downstream pipeline can rely on stable keys.
|
||||
3. **Speed**: LLM-based extraction can be 10–1000x slower for large-scale crawling.
|
||||
4. **Scalable**: Adding or updating a field is a matter of adjusting the schema or regex, not re-tuning a model.
|
||||
|
||||
**When might you consider an LLM?** Possibly if the site is extremely unstructured or you want AI summarization. But always try a schema or regex approach first for repeated or consistent data patterns.
|
||||
|
||||
---
|
||||
|
||||
## 6. Base Element Attributes & Additional Fields
|
||||
|
||||
It's easy to **extract attributes** (like `href`, `src`, or `data-xxx`) from your base or nested elements using:
|
||||
|
||||
```json
|
||||
{
|
||||
@@ -361,11 +595,11 @@ It’s easy to **extract attributes** (like `href`, `src`, or `data-xxx`) from y
|
||||
}
|
||||
```
|
||||
|
||||
You can define them in **`baseFields`** (extracted from the main container element) or in each field’s sub-lists. This is especially helpful if you need an item’s link or ID stored in the parent `<div>`.
|
||||
You can define them in **`baseFields`** (extracted from the main container element) or in each field's sub-lists. This is especially helpful if you need an item's link or ID stored in the parent `<div>`.
|
||||
|
||||
---
|
||||
|
||||
## 6. Putting It All Together: Larger Example
|
||||
## 7. Putting It All Together: Larger Example
|
||||
|
||||
Consider a blog site. We have a schema that extracts the **URL** from each post card (via `baseFields` with an `"attribute": "href"`), plus the title, date, summary, and author:
|
||||
|
||||
@@ -389,19 +623,20 @@ Then run with `JsonCssExtractionStrategy(schema)` to get an array of blog post o
|
||||
|
||||
---
|
||||
|
||||
## 7. Tips & Best Practices
|
||||
## 8. Tips & Best Practices
|
||||
|
||||
1. **Inspect the DOM** in Chrome DevTools or Firefox’s Inspector to find stable selectors.
|
||||
2. **Start Simple**: Verify you can extract a single field. Then add complexity like nested objects or lists.
|
||||
3. **Test** your schema on partial HTML or a test page before a big crawl.
|
||||
4. **Combine with JS Execution** if the site loads content dynamically. You can pass `js_code` or `wait_for` in `CrawlerRunConfig`.
|
||||
5. **Look at Logs** when `verbose=True`: if your selectors are off or your schema is malformed, it’ll often show warnings.
|
||||
6. **Use baseFields** if you need attributes from the container element (e.g., `href`, `data-id`), especially for the “parent” item.
|
||||
7. **Performance**: For large pages, make sure your selectors are as narrow as possible.
|
||||
1. **Inspect the DOM** in Chrome DevTools or Firefox's Inspector to find stable selectors.
|
||||
2. **Start Simple**: Verify you can extract a single field. Then add complexity like nested objects or lists.
|
||||
3. **Test** your schema on partial HTML or a test page before a big crawl.
|
||||
4. **Combine with JS Execution** if the site loads content dynamically. You can pass `js_code` or `wait_for` in `CrawlerRunConfig`.
|
||||
5. **Look at Logs** when `verbose=True`: if your selectors are off or your schema is malformed, it'll often show warnings.
|
||||
6. **Use baseFields** if you need attributes from the container element (e.g., `href`, `data-id`), especially for the "parent" item.
|
||||
7. **Performance**: For large pages, make sure your selectors are as narrow as possible.
|
||||
8. **Consider Using Regex First**: For simple data types like emails, URLs, and dates, `RegexExtractionStrategy` is often the fastest approach.
|
||||
|
||||
---
|
||||
|
||||
## 8. Schema Generation Utility
|
||||
## 9. Schema Generation Utility
|
||||
|
||||
While manually crafting schemas is powerful and precise, Crawl4AI now offers a convenient utility to **automatically generate** extraction schemas using LLM. This is particularly useful when:
|
||||
|
||||
@@ -481,27 +716,26 @@ strategy = JsonCssExtractionStrategy(css_schema)
|
||||
- Use OpenAI for production-quality schemas
|
||||
- Use Ollama for development, testing, or when you need a self-hosted solution
|
||||
|
||||
That's it for **Extracting JSON (No LLM)**! You've seen how schema-based approaches (either CSS or XPath) can handle everything from simple lists to deeply nested product catalogs—instantly, with minimal overhead. Enjoy building robust scrapers that produce consistent, structured JSON for your data pipelines!
|
||||
|
||||
---
|
||||
|
||||
## 9. Conclusion
|
||||
## 10. Conclusion
|
||||
|
||||
With **JsonCssExtractionStrategy** (or **JsonXPathExtractionStrategy**), you can build powerful, **LLM-free** pipelines that:
|
||||
With Crawl4AI's LLM-free extraction strategies - `JsonCssExtractionStrategy`, `JsonXPathExtractionStrategy`, and now `RegexExtractionStrategy` - you can build powerful pipelines that:
|
||||
|
||||
- Scrape any consistent site for structured data.
|
||||
- Support nested objects, repeating lists, or advanced transformations.
|
||||
- Support nested objects, repeating lists, or pattern-based extraction.
|
||||
- Scale to thousands of pages quickly and reliably.
|
||||
|
||||
**Next Steps**:
|
||||
**Choosing the Right Strategy**:
|
||||
|
||||
- Combine your extracted JSON with advanced filtering or summarization in a second pass if needed.
|
||||
- For dynamic pages, combine strategies with `js_code` or infinite scroll hooking to ensure all content is loaded.
|
||||
- Use **`RegexExtractionStrategy`** for fast extraction of common data types like emails, phones, URLs, dates, etc.
|
||||
- Use **`JsonCssExtractionStrategy`** or **`JsonXPathExtractionStrategy`** for structured data with clear HTML patterns
|
||||
- If you need both: first extract structured data with JSON strategies, then use regex on specific fields
|
||||
|
||||
**Remember**: For repeated, structured data, you don’t need to pay for or wait on an LLM. A well-crafted schema plus CSS or XPath gets you the data faster, cleaner, and cheaper—**the real power** of Crawl4AI.
|
||||
**Remember**: For repeated, structured data, you don't need to pay for or wait on an LLM. Well-crafted schemas and regex patterns get you the data faster, cleaner, and cheaper—**the real power** of Crawl4AI.
|
||||
|
||||
**Last Updated**: 2025-01-01
|
||||
**Last Updated**: 2025-05-02
|
||||
|
||||
---
|
||||
|
||||
That’s it for **Extracting JSON (No LLM)**! You’ve seen how schema-based approaches (either CSS or XPath) can handle everything from simple lists to deeply nested product catalogs—instantly, with minimal overhead. Enjoy building robust scrapers that produce consistent, structured JSON for your data pipelines!
|
||||
That's it for **Extracting JSON (No LLM)**! You've seen how schema-based approaches (either CSS or XPath) and regex patterns can handle everything from simple lists to deeply nested product catalogs—instantly, with minimal overhead. Enjoy building robust scrapers that produce consistent, structured JSON for your data pipelines!
|
||||
Reference in New Issue
Block a user