357 lines
13 KiB
Markdown
357 lines
13 KiB
Markdown
---
|
|
name: daily-news-report
|
|
description: Scrapes content based on a preset URL list, filters high-quality technical information, and generates daily Markdown reports.
|
|
argument-hint: [optional: date]
|
|
disable-model-invocation: false
|
|
user-invocable: true
|
|
allowed-tools: Task, WebFetch, Read, Write, Bash(mkdir*), Bash(date*), Bash(ls*), mcp__chrome-devtools__*
|
|
---
|
|
|
|
# Daily News Report v3.0
|
|
|
|
> **Architecture Upgrade**: Main Agent Orchestration + SubAgent Execution + Browser Scraping + Smart Caching
|
|
|
|
## Core Architecture
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────────┐
|
|
│ Main Agent (Orchestrator) │
|
|
│ Role: Scheduling, Monitoring, Evaluation, Decision, Aggregation │
|
|
├─────────────────────────────────────────────────────────────────────┤
|
|
│ │
|
|
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
|
|
│ │ 1. Init │ → │ 2. Dispatch │ → │ 3. Monitor │ → │ 4. Evaluate │ │
|
|
│ │ Read Config │ │ Assign Tasks│ │ Collect Res │ │ Filter/Sort │ │
|
|
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │
|
|
│ │ │ │ │ │
|
|
│ ▼ ▼ ▼ ▼ │
|
|
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
|
|
│ │ 5. Decision │ ← │ Enough 20? │ │ 6. Generate │ → │ 7. Update │ │
|
|
│ │ Cont/Stop │ │ Y/N │ │ Report File │ │ Cache Stats │ │
|
|
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │
|
|
│ │
|
|
└──────────────────────────────────────────────────────────────────────┘
|
|
↓ Dispatch ↑ Return Results
|
|
┌─────────────────────────────────────────────────────────────────────┐
|
|
│ SubAgent Execution Layer │
|
|
├─────────────────────────────────────────────────────────────────────┤
|
|
│ │
|
|
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
|
|
│ │ Worker A │ │ Worker B │ │ Browser │ │
|
|
│ │ (WebFetch) │ │ (WebFetch) │ │ (Headless) │ │
|
|
│ │ Tier1 Batch │ │ Tier2 Batch │ │ JS Render │ │
|
|
│ └─────────────┘ └─────────────┘ └─────────────┘ │
|
|
│ ↓ ↓ ↓ │
|
|
│ ┌─────────────────────────────────────────────────────────────┐ │
|
|
│ │ Structured Result Return │ │
|
|
│ │ { status, data: [...], errors: [...], metadata: {...} } │ │
|
|
│ └─────────────────────────────────────────────────────────────┘ │
|
|
│ │
|
|
└─────────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
## Configuration Files
|
|
|
|
This skill uses the following configuration files:
|
|
|
|
| File | Purpose |
|
|
|------|---------|
|
|
| `sources.json` | Source configuration, priorities, scrape methods |
|
|
| `cache.json` | Cached data, historical stats, deduplication fingerprints |
|
|
|
|
## Execution Process Details
|
|
|
|
### Phase 1: Initialization
|
|
|
|
```yaml
|
|
Steps:
|
|
1. Determine date (user argument or current date)
|
|
2. Read sources.json for source configurations
|
|
3. Read cache.json for historical data
|
|
4. Create output directory NewsReport/
|
|
5. Check if a partial report exists for today (append mode)
|
|
```
|
|
|
|
### Phase 2: Dispatch SubAgents
|
|
|
|
**Strategy**: Parallel dispatch, batch execution, early stopping mechanism
|
|
|
|
```yaml
|
|
Wave 1 (Parallel):
|
|
- Worker A: Tier1 Batch A (HN, HuggingFace Papers)
|
|
- Worker B: Tier1 Batch B (OneUsefulThing, Paul Graham)
|
|
|
|
Wait for results → Evaluate count
|
|
|
|
If < 15 high-quality items:
|
|
Wave 2 (Parallel):
|
|
- Worker C: Tier2 Batch A (James Clear, FS Blog)
|
|
- Worker D: Tier2 Batch B (HackerNoon, Scott Young)
|
|
|
|
If still < 20 items:
|
|
Wave 3 (Browser):
|
|
- Browser Worker: ProductHunt, Latent Space (Require JS rendering)
|
|
```
|
|
|
|
### Phase 3: SubAgent Task Format
|
|
|
|
Task format received by each SubAgent:
|
|
|
|
```yaml
|
|
task: fetch_and_extract
|
|
sources:
|
|
- id: hn
|
|
url: https://news.ycombinator.com
|
|
extract: top_10
|
|
- id: hf_papers
|
|
url: https://huggingface.co/papers
|
|
extract: top_voted
|
|
|
|
output_schema:
|
|
items:
|
|
- source_id: string # Source Identifier
|
|
title: string # Title
|
|
summary: string # 2-4 sentence summary
|
|
key_points: string[] # Max 3 key points
|
|
url: string # Original URL
|
|
keywords: string[] # Keywords
|
|
quality_score: 1-5 # Quality Score
|
|
|
|
constraints:
|
|
filter: "Cutting-edge Tech/Deep Tech/Productivity/Practical Info"
|
|
exclude: "General Science/Marketing Puff/Overly Academic/Job Posts"
|
|
max_items_per_source: 10
|
|
skip_on_error: true
|
|
|
|
return_format: JSON
|
|
```
|
|
|
|
### Phase 4: Main Agent Monitoring & Feedback
|
|
|
|
Main Agent Responsibilities:
|
|
|
|
```yaml
|
|
Monitoring:
|
|
- Check SubAgent return status (success/partial/failed)
|
|
- Count collected items
|
|
- Record success rate per source
|
|
|
|
Feedback Loop:
|
|
- If a SubAgent fails, decide whether to retry or skip
|
|
- If a source fails persistently, mark as disabled
|
|
- Dynamically adjust source selection for subsequent batches
|
|
|
|
Decision:
|
|
- Items >= 25 AND HighQuality >= 20 → Stop scraping
|
|
- Items < 15 → Continue to next batch
|
|
- All batches done but < 20 → Generate with available content (Quality over Quantity)
|
|
```
|
|
|
|
### Phase 5: Evaluation & Filtering
|
|
|
|
```yaml
|
|
Deduplication:
|
|
- Exact URL match
|
|
- Title similarity (>80% considered duplicate)
|
|
- Check cache.json to avoid history duplicates
|
|
|
|
Score Calibration:
|
|
- Unify scoring standards across SubAgents
|
|
- Adjust weights based on source credibility
|
|
- Bonus points for manually curated high-quality sources
|
|
|
|
Sorting:
|
|
- Descending order by quality_score
|
|
- Sort by source priority if scores are equal
|
|
- Take Top 20
|
|
```
|
|
|
|
### Phase 6: Browser Scraping (MCP Chrome DevTools)
|
|
|
|
For pages requiring JS rendering, use a headless browser:
|
|
|
|
```yaml
|
|
Process:
|
|
1. Call mcp__chrome-devtools__new_page to open page
|
|
2. Call mcp__chrome-devtools__wait_for to wait for content load
|
|
3. Call mcp__chrome-devtools__take_snapshot to get page structure
|
|
4. Parse snapshot to extract required content
|
|
5. Call mcp__chrome-devtools__close_page to close page
|
|
|
|
Applicable Scenarios:
|
|
- ProductHunt (403 on WebFetch)
|
|
- Latent Space (Substack JS rendering)
|
|
- Other SPA applications
|
|
```
|
|
|
|
### Phase 7: Generate Report
|
|
|
|
```yaml
|
|
Output:
|
|
- Directory: NewsReport/
|
|
- Filename: YYYY-MM-DD-news-report.md
|
|
- Format: Standard Markdown
|
|
|
|
Content Structure:
|
|
- Title + Date
|
|
- Statistical Summary (Source count, items collected)
|
|
- 20 High-Quality Items (Template based)
|
|
- Generation Info (Version, Timestamps)
|
|
```
|
|
|
|
### Phase 8: Update Cache
|
|
|
|
```yaml
|
|
Update cache.json:
|
|
- last_run: Record this run info
|
|
- source_stats: Update stats per source
|
|
- url_cache: Add processed URLs
|
|
- content_hashes: Add content fingerprints
|
|
- article_history: Record included articles
|
|
```
|
|
|
|
## SubAgent Call Examples
|
|
|
|
### Using general-purpose Agent
|
|
|
|
Since custom agents require session restart to be discovered, use general-purpose and inject worker prompts:
|
|
|
|
```
|
|
Task Call:
|
|
subagent_type: general-purpose
|
|
model: haiku
|
|
prompt: |
|
|
You are a stateless execution unit. Only do the assigned task and return structured JSON.
|
|
|
|
Task: Scrape the following URLs and extract content
|
|
|
|
URLs:
|
|
- https://news.ycombinator.com (Extract Top 10)
|
|
- https://huggingface.co/papers (Extract top voted papers)
|
|
|
|
Output Format:
|
|
{
|
|
"status": "success" | "partial" | "failed",
|
|
"data": [
|
|
{
|
|
"source_id": "hn",
|
|
"title": "...",
|
|
"summary": "...",
|
|
"key_points": ["...", "...", "..."],
|
|
"url": "...",
|
|
"keywords": ["...", "..."],
|
|
"quality_score": 4
|
|
}
|
|
],
|
|
"errors": [],
|
|
"metadata": { "processed": 2, "failed": 0 }
|
|
}
|
|
|
|
Filter Criteria:
|
|
- Keep: Cutting-edge Tech/Deep Tech/Productivity/Practical Info
|
|
- Exclude: General Science/Marketing Puff/Overly Academic/Job Posts
|
|
|
|
Return JSON directly, no explanation.
|
|
```
|
|
|
|
### Using worker Agent (Requires session restart)
|
|
|
|
```
|
|
Task Call:
|
|
subagent_type: worker
|
|
prompt: |
|
|
task: fetch_and_extract
|
|
input:
|
|
urls:
|
|
- https://news.ycombinator.com
|
|
- https://huggingface.co/papers
|
|
output_schema:
|
|
- source_id: string
|
|
- title: string
|
|
- summary: string
|
|
- key_points: string[]
|
|
- url: string
|
|
- keywords: string[]
|
|
- quality_score: 1-5
|
|
constraints:
|
|
filter: Cutting-edge Tech/Deep Tech/Productivity/Practical Info
|
|
exclude: General Science/Marketing Puff/Overly Academic
|
|
```
|
|
|
|
## Output Template
|
|
|
|
```markdown
|
|
# Daily News Report (YYYY-MM-DD)
|
|
|
|
> Curated from N sources today, containing 20 high-quality items
|
|
> Generation Time: X min | Version: v3.0
|
|
>
|
|
> **Warning**: Sub-agent 'worker' not detected. Running in generic mode (Serial Execution). Performance might be degraded.
|
|
|
|
---
|
|
|
|
## 1. Title
|
|
|
|
- **Summary**: 2-4 lines overview
|
|
- **Key Points**:
|
|
1. Point one
|
|
2. Point two
|
|
3. Point three
|
|
- **Source**: [Link](URL)
|
|
- **Keywords**: `keyword1` `keyword2` `keyword3`
|
|
- **Score**: ⭐⭐⭐⭐⭐ (5/5)
|
|
|
|
---
|
|
|
|
## 2. Title
|
|
...
|
|
|
|
---
|
|
|
|
*Generated by Daily News Report v3.0*
|
|
*Sources: HN, HuggingFace, OneUsefulThing, ...*
|
|
```
|
|
|
|
## Constraints & Principles
|
|
|
|
1. **Quality over Quantity**: Low-quality content does not enter the report.
|
|
2. **Early Stop**: Stop scraping once 20 high-quality items are reached.
|
|
3. **Parallel First**: SubAgents in the same batch execute in parallel.
|
|
4. **Fault Tolerance**: Failure of a single source does not affect the whole process.
|
|
5. **Cache Reuse**: Avoid re-scraping the same content.
|
|
6. **Main Agent Control**: All decisions are made by the Main Agent.
|
|
7. **Fallback Awareness**: Detect sub-agent availability, gracefully degrade if unavailable.
|
|
|
|
## Expected Performance
|
|
|
|
| Scenario | Expected Time | Note |
|
|
|---|---|---|
|
|
| Optimal | ~2 mins | Tier1 sufficient, no browser needed |
|
|
| Normal | ~3-4 mins | Requires Tier2 supplement |
|
|
| Browser Needed | ~5-6 mins | Includes JS rendered pages |
|
|
|
|
## Error Handling
|
|
|
|
| Error Type | Handling |
|
|
|---|---|
|
|
| SubAgent Timeout | Log error, continue to next |
|
|
| Source 403/404 | Mark disabled, update sources.json |
|
|
| Extraction Failed | Return raw content, Main Agent decides |
|
|
| Browser Crash | Skip source, log entry |
|
|
|
|
## Compatibility & Fallback
|
|
|
|
To ensure usability across different Agent environments, the following checks must be performed:
|
|
|
|
1. **Environment Check**:
|
|
- In Phase 1 initialization, attempt to detect if `worker` sub-agent exists.
|
|
- If not exists (or plugin not installed), automatically switch to **Serial Execution Mode**.
|
|
|
|
2. **Serial Execution Mode**:
|
|
- Do not use parallel block.
|
|
- Main Agent executes scraping tasks for each source sequentially.
|
|
- Slower, but guarantees basic functionality.
|
|
|
|
3. **User Alert**:
|
|
- MUST include a clear warning in the generated report header indicating the current degraded mode.
|