feat(writing-skills): refactor skill to gold standard architecture

This commit is contained in:
Antigravity Agent
2026-01-29 15:12:41 -03:00
parent f2c3e783b5
commit 10ca106b79
16 changed files with 1966 additions and 698 deletions

View File

@@ -0,0 +1,255 @@
# Anti-Rationalization Guide
Techniques for bulletproofing skills against agent rationalization.
## The Problem
Discipline-enforcing skills (like TDD) face a unique challenge: smart agents under pressure will find loopholes.
**Example**: Skill says "Write test first". Agent under deadline thinks:
- "This is too simple to test"
- "I'll test after, same result"
- "It's the spirit that matters, not ritual"
## Psychology Foundation
Understanding WHY persuasion works helps apply it systematically.
**Research basis**: Cialdini (2021), Meincke et al. (2025)
**Core principles**:
- **Authority**: "The TDD community agrees..."
- **Commitment**: "You already said you follow TDD..."
- **Scarcity**: "Missing tests now = bugs later"
- **Social Proof**: "All tested code passing CI proves value"
- **Unity**: "We're engineers who value quality"
## Technique 1: Close Every Loophole Explicitly
Don't just state the rule - forbid specific workarounds.
### Bad Example
```markdown
Write code before test? Delete it.
```
### Good Example
```markdown
Write code before test? Delete it. Start over.
**No exceptions**:
- Don't keep it as "reference"
- Don't "adapt" it while writing tests
- Don't look at it
- Delete means delete
```
**Why it works**: Agents try specific workarounds. Counter each explicitly.
## Technique 2: Address "Spirit vs Letter" Arguments
Add foundational principle early:
```markdown
**Violating the letter of the rules is violating the spirit of the rules.**
```
**Why it works**: Cuts off entire class of "I'm following the spirit" rationalizations.
## Technique 3: Build Rationalization Table
Capture excuses from baseline testing. Every rationalization goes in table:
```markdown
| Excuse | Reality |
| -------------------------------- | ----------------------------------------------------------------------- |
| "Too simple to test" | Simple code breaks. Test takes 30 seconds. |
| "I'll test after" | Tests passing immediately prove nothing. |
| "Tests after achieve same goals" | Tests-after = "what does this do?" Tests-first = "what should this do?" |
| "It's about spirit not ritual" | The letter IS the spirit. TDD's value comes from the specific sequence. |
```
**Why it works**: Agents read table, recognize their own thinking, see the counter-argument.
## Technique 4: Create Red Flags List
Make it easy for agents to self-check when rationalizing:
```markdown
## Red Flags - STOP and Start Over
- Code before test
- "I already manually tested it"
- "Tests after achieve the same purpose"
- "It's about spirit not ritual"
- "This is different because..."
**All of these mean**: Delete code. Start over with TDD.
```
**Why it works**: Simple checklist, clear action (delete & restart).
## Technique 5: Update Description for Violation Symptoms
Add to description: symptoms of when you're ABOUT to violate:
```yaml
# ❌ BAD: Only describes what skill does
description: TDD methodology for writing code
# ✅ GOOD: Includes pre-violation symptoms
description: Use when implementing any feature or bugfix, before writing implementation code
metadata:
triggers: new feature, bug fix, code change
```
**Why it works**: Triggers skill BEFORE violation, not after.
## Technique 6: Use Strong Language
Weak language invites rationalization:
```markdown
# Weak
You should write tests first.
Generally, test before code.
It's better to test first.
# Strong
ALWAYS write test first.
NEVER write code before test.
Test-first is MANDATORY.
```
**Why it works**: No ambiguity, no wiggle room.
## Technique 7: Invoke Commitment & Consistency
Reference agent's own standards:
```markdown
You claimed to follow TDD.
TDD means test-first.
Code-first is NOT TDD.
**Either**:
- Follow TDD (test-first), or
- Admit you're not doing TDD
Don't redefine TDD to fit what you already did.
```
**Why it works**: Agents resist cognitive dissonance (Festinger, 1957).
## Technique 8: Provide Escape Hatch for Legitimate Cases
If there ARE valid exceptions, state them explicitly:
```markdown
## When NOT to Use TDD
- Spike solutions (throwaway exploratory code)
- One-time scripts deleting in 1 hour
- Generated boilerplate (verified via other means)
**Everything else**: Use TDD. No exceptions.
```
**Why it works**: Removes "but this is different" argument for non-exception cases.
## Complete Bulletproofing Checklist
For discipline-enforcing skills:
**Loophole Closing**:
- [ ] Forbidden each specific workaround explicitly?
- [ ] Added "spirit vs letter" principle?
- [ ] Built rationalization table from baseline tests?
- [ ] Created red flags list?
**Strength**:
- [ ] Used strong language (ALWAYS/NEVER)?
- [ ] Invoked commitment & consistency?
- [ ] Provided explicit escape hatch?
**Discovery**:
- [ ] Description includes pre-violation symptoms?
- [ ] Keywords target moment BEFORE violation?
**Testing**:
- [ ] Tested with combined pressures?
- [ ] Agent complied under maximum pressure?
- [ ] No new rationalizations found?
## Real-World Example: TDD Skill
### Baseline Rationalizations Found
1. "Too simple to test"
2. "I'll test after"
3. "Spirit not ritual"
4. "Already manually tested"
5. "This is different because..."
### Counters Applied
**Rationalization table**:
```markdown
| Excuse | Reality |
| -------------------- | -------------------------------------------------------------- |
| "Too simple to test" | Simple code breaks. Test takes 30 seconds. |
| "I'll test after" | Tests passing immediately prove nothing. |
| "Spirit not ritual" | The letter IS the spirit. TDD's value comes from the sequence. |
| "Manually tested" | Manual tests don't run automatically. They rot. |
```
**Red flags**:
```markdown
## Red Flags - STOP
- Code before test
- "I already tested manually"
- "Spirit not ritual"
- "This is different..."
All mean: Delete code. Start over.
```
**Result**: Agent compliance under combined time + sunk cost pressure.
## Common Mistakes
| Mistake | Fix |
| -------------------------------------- | -------------------------------------------------------------- |
| Trust agents will "get the spirit" | Close explicit loopholes. Agents are smart at rationalization. |
| Use weak language ("should", "better") | Use ALWAYS/NEVER for discipline rules. |
| Skip rationalization table | Every excuse needs explicit counter. |
| No red flags list | Make self-checking easy. |
| Generic description | Add pre-violation symptoms to trigger skill earlier. |
## Meta-Strategy
**For each new rationalization**:
1. Document it verbatim (from failed test)
2. Add to rationalization table
3. Update red flags list
4. Re-test
**Iterate until**: Agent can't find ANY rationalization that works.
That's bulletproof.

View File

@@ -0,0 +1,268 @@
# CSO Guide - Claude Search Optimization
Advanced techniques for making skills discoverable by agents.
## The Discovery Problem
You have 100+ skills. Agent receives a task. How does it find the RIGHT skill?
**Answer**: The `description` field.
## Critical Rule: Description = Triggers, NOT Workflow
### The Trap
When description summarizes workflow, agents take a shortcut.
**Real example that failed**:
```yaml
# Agent did ONE review instead of TWO
description: Code review between tasks
# Skill body had flowchart showing TWO reviews:
# 1. Spec compliance
# 2. Code quality
```
**Why it failed**: Agent read description, thought "code review between tasks means one review", never read the flowchart.
**Fix**:
```yaml
# Agent now reads full skill and follows flowchart
description: Use when executing implementation plans with independent tasks
```
### The Pattern
```yaml
# ❌ BAD: Workflow summary
description: Analyzes git diff, generates commit message in conventional format
# ✅ GOOD: Trigger conditions only
description: Use when generating commit messages or reviewing staged changes
```
## Token Efficiency Critical for Skills
**Problem**: Frequently-loaded skills consume tokens in EVERY conversation.
**Target word counts**:
- Frequently-loaded skills: <200 words total
- Other skills: <500 words
### Techniques
**1. Move details to tool help**:
```bash
# ❌ BAD: Document all flags in SKILL.md
search-conversations supports --text, --both, --after DATE, --before DATE, --limit N
# ✅ GOOD: Reference --help
search-conversations supports multiple modes. Run --help for details.
```
**2. Use cross-references**:
```markdown
# ❌ BAD: Repeat workflow
When searching, dispatch agent with template...
[20 lines of repeated instructions]
# ✅ GOOD: Reference other skill
Use subagents for searches. See [delegating-to-subagents] for workflow.
```
**3. Compress examples**:
```markdown
# ❌ BAD: Verbose (42 words)
Partner: "How did we handle auth errors in React Router?"
You: I'll search past conversations for patterns.
[Dispatch subagent with query: "React Router authentication error handling 401"]
# ✅ GOOD: Minimal (20 words)
Partner: "Auth errors in React Router?"
You: Searching...
[Dispatch subagent → synthesis]
```
## Keyword Strategy
### Error Messages
Include EXACT error text users will see:
- "Hook timed out after 5000ms"
- "ENOTEMPTY: directory not empty"
- "jest --watch is not responding"
### Symptoms
Use words users naturally say:
- "flaky", "hangs", "zombie process"
- "slow", "timeout", "race condition"
- "cleanup failed", "pollution"
### Tools & Commands
Actual names, not descriptions:
- "pytest", not "Python testing"
- "git rebase", not "rebasing"
- ".docx files", not "Word documents"
### Synonyms
Cover multiple ways to describe same thing:
- timeout/hang/freeze
- cleanup/teardown/after Each
- mock/stub/fake
## Naming Conventions
### Gerunds (-ing) for Processes
`creating-skills`, `debugging-with-logs`, `testing-async-code`
### Verb-first for Actions
`flatten-with-flags`, `reduce-complexity`, `trace-root-cause`
### ❌ Avoid
- `skill-creation` (passive, less searchable)
- `async-test-helpers` (too generic)
- `debugging-techniques` (vague)
## Description Template
```yaml
description: "Use when [SPECIFIC TRIGGER]."
metadata:
triggers: [error1], [symptom2], [tool3]
```
**Examples**:
```yaml
# Technique skill
description: "Use when tests have race conditions, timing dependencies, or pass/fail inconsistently."
metadata:
triggers: flaky tests, timeout, race condition
# Pattern skill
description: "Use when complex data structures make code hard to follow."
metadata:
triggers: nested loops, multiple flags, confusing state
# Reference skill
description: "Use when working with React Router and authentication."
metadata:
triggers: 401 redirect, login flow, protected routes
# Discipline skill
description: "Use when implementing any feature or bugfix, before writing implementation code."
metadata:
triggers: new feature, bug fix, code change
```
## Third Person Rule
Description is injected into system prompt. Inconsistent POV breaks discovery.
```yaml
# ❌ BAD: First person
description: "I can help you with async tests"
# ❌ BAD: Second person
description: "You can use this for race conditions"
# ✅ GOOD: Third person
description: "Handles async tests with race conditions"
```
## Cross-Referencing Other Skills
**When documenting a skill that references other skills**:
Use skill name only, with explicit requirement markers:
```markdown
# ✅ GOOD: Clear requirement
**REQUIRED BACKGROUND**: You MUST understand test-driven-development before using this skill.
**REQUIRED SUB-SKILL**: Use defensive-programming for error handling.
# ❌ BAD: Unclear if required
See test-driven-development skill for context.
# ❌ NEVER: Force-loads (burns context)
@skills/testing/test-driven-development/SKILL.md
```
**Why no @ links**: `@` syntax force-loads files immediately, consuming tokens before needed.
## Verification Checklist
Before deploying:
- [ ] Description starts with "Use when..."?
- [ ] Description is <500 characters?
- [ ] Description lists ONLY triggers, not workflow?
- [ ] Includes 3+ keywords (errors/symptoms/tools)?
- [ ] Third person throughout?
- [ ] Name uses gerund or verb-first format?
- [ ] Name has only letters, numbers, hyphens?
- [ ] No @ syntax for cross-references?
- [ ] Word count <200 (frequent) or <500 (other)?
## Real-World Examples
### Before/After: TDD Skill
**Before** (workflow in description):
```yaml
description: Write test first, watch it fail, write minimal code, refactor
```
Result: Agents followed description, skipped reading full skill.
**After** (triggers only):
```yaml
description: Use when implementing any feature or bugfix, before writing implementation code
```
Result: Agents read full skill, followed complete TDD cycle.
### Before/After: BigQuery Skill
**Before** (too vague):
```yaml
description: Helps with database queries
```
Result: Never loaded (too generic, agents couldn't identify relevance).
**After** (specific triggers):
```yaml
description: Use when analyzing BigQuery data. Triggers: revenue metrics, pipeline data, API usage, campaign attribution.
```
Result: Loads for relevant queries, includes domain keywords.

View File

@@ -0,0 +1,152 @@
---
description: Standards and naming rules for creating agent skills.
metadata:
tags: [standards, naming, yaml, structure]
---
# Skill Development Guide
Comprehensive reference for creating effective agent skills.
## Directory Structure
```
~/.config/opencode/skills/
{skill-name}/ # kebab-case, matches `name` field
SKILL.md # Required: main skill definition
references/ # Optional: supporting documentation
README.md # Sub-topic entry point
*.md # Additional files
```
**Project-local alternative:**
```
.agent/skills/{skill-name}/SKILL.md
```
## Naming Rules
| Element | Rule | Example |
|---------|------|---------|
| Directory | kebab-case, 1-64 chars | `react-best-practices` |
| `SKILL.md` | ALL CAPS, exact filename | `SKILL.md` (not `skill.md`) |
| `name` field | Must match directory name | `name: react-best-practices` |
## SKILL.md Structure
```markdown
---
name: {skill-name}
description: >-
Use when [trigger condition].
metadata:
category: technique
triggers: keyword1, keyword2, error-text
---
# Skill Title
Brief description of what this skill does.
## When to Use
- Symptom or situation A
- Symptom or situation B
## How It Works
Step-by-step instructions or reference content.
## Examples
Concrete usage examples.
## Common Mistakes
What to avoid and why.
```
## Description Best Practices
The `description` field is critical for skill discovery:
```yaml
# ❌ BAD: Workflow summary (agent skips reading full skill)
description: Analyzes code, finds bugs, suggests fixes
# ✅ GOOD: Trigger conditions only
description: Use when debugging errors or reviewing code quality.
metadata:
triggers: bug, error, code review
```
**Rules:**
- Start with "Use when..."
- Put triggers under `metadata.triggers`
- Keep under 500 characters
- Use third person (not "I" or "You")
## Context Efficiency
Skills load into context on-demand. Optimize for token usage:
| Guideline | Reason |
|-----------|--------|
| Keep SKILL.md < 500 lines | Reduces context consumption |
| Put details in supporting files | Agent reads only what's needed |
| Use tables for reference data | More compact than prose |
| Link to `--help` for CLI tools | Avoids duplicating docs |
## Supporting Files
For complex skills, use additional files:
```
my-skill/
SKILL.md # Overview + navigation
patterns.md # Detailed patterns
examples.md # Code examples
troubleshooting.md # Common issues
```
**Supporting file frontmatter is required** (for any `.md` besides `SKILL.md`):
```markdown
---
description: >-
Short summary used for search and retrieval.
metadata:
tags: [pattern, troubleshooting, api]
source: internal
---
```
This frontmatter helps the LLM locate the right file when referenced from `SKILL.md`.
Reference from SKILL.md:
```markdown
## Detailed Reference
- [Patterns](patterns.md) - Common usage patterns
- [Examples](examples.md) - Code samples
```
## Skill Types
| Type | Purpose | Example |
|------|---------|---------|
| **Reference** | Documentation, APIs | `bigquery-analysis` |
| **Technique** | How-to guides | `condition-based-waiting` |
| **Pattern** | Mental models | `flatten-with-flags` |
| **Discipline** | Rules to enforce | `test-driven-development` |
## Verification Checklist
Before deploying:
- [ ] `name` matches directory name?
- [ ] `SKILL.md` is ALL CAPS?
- [ ] Description starts with "Use when..."?
- [ ] Triggers listed under metadata?
- [ ] Under 500 lines?
- [ ] Tested with real scenarios?

View File

@@ -0,0 +1,65 @@
# SKILL.md Metadata Standard
Official frontmatter fields recognized by OpenCode.
## Required Fields
```yaml
---
name: skill-name
description: >-
Use when [trigger condition].
metadata:
triggers: keyword1, keyword2, error-message
---
```
| Field | Rules |
|-------|-------|
| `name` | 1-64 chars, lowercase, hyphens only, must match directory name |
| `description` | 1-1024 chars, should describe when to use |
## Optional Fields
```yaml
---
name: skill-name
description: Purpose and triggers.
metadata:
license: MIT
compatibility: opencode
author: "your-name"
version: "1.0.0"
category: "reference"
tags: "tag1, tag2"
---
```
| Field | Purpose |
|-------|---------|
| `license` | License identifier (e.g., MIT, Apache-2.0) |
| `compatibility` | Tool compatibility marker |
| `metadata` | String-to-string map for custom key-values |
## Name Validation
```regex
^[a-z0-9]+(-[a-z0-9]+)*$
```
**Valid**: `my-skill`, `git-release`, `tdd`
**Invalid**: `My-Skill`, `my_skill`, `-my-skill`, `my--skill`
## Common Metadata Keys
Use these conventions for consistency across skills:
| Key | Example | Purpose |
|-----|---------|---------|
| `author` | `"your-name"` | Skill creator |
| `version` | `"1.0.0"` | Semantic version |
| `category` | `"reference"` | Type: reference, technique, discipline, pattern |
| `tags` | `"react, hooks"` | Searchable keywords |
> [!IMPORTANT]
> Any field not listed here is **ignored** by OpenCode's skill loader.

View File

@@ -0,0 +1,54 @@
---
name: discipline-name
description: >-
Use when [BEFORE violation].
metadata:
category: discipline
triggers: new feature, code change, implementation
---
# Rule Name
## Iron Law
**[SINGLE SENTENCE ABSOLUTE RULE]**
Violating the letter IS violating the spirit.
## The Rule
1. ALWAYS [step 1]
2. NEVER [step 2]
3. [Step 3]
## Violations
[Action before rule]? **Delete it. Start over.**
**No exceptions:**
- Don't keep it as "reference"
- Don't "adapt" it
- Delete means delete
## Common Rationalizations
| Excuse | Reality |
|--------|---------|
| "Too simple" | Simple code breaks. Rule takes 30 seconds. |
| "I'll do it after" | After = never. Do it now. |
| "Spirit not ritual" | The ritual IS the spirit. |
## Red Flags - STOP
- [Flag 1]
- [Flag 2]
- "This is different because..."
**All mean:** Delete. Start over.
## Valid Exceptions
- [Exception 1]
- [Exception 2]
**Everything else:** Follow the rule.

View File

@@ -0,0 +1,48 @@
---
name: pattern-name
description: >-
Use when [recognizable symptom].
metadata:
category: pattern
triggers: complexity, hard-to-follow, nested
---
# Pattern Name
## The Pattern
[1-2 sentence core idea]
## Recognition Signs
- [Sign that pattern applies]
- [Another sign]
- [Code smell]
## Before
```typescript
// Complex/problematic
function before() {
// nested, confusing
}
```
## After
```typescript
// Clean/improved
function after() {
// flat, clear
}
```
## When NOT to Use
- [Over-engineering case]
- [Simple case that doesn't need it]
## Impact
**Before:** [Problem metric]
**After:** [Improved metric]

View File

@@ -0,0 +1,35 @@
---
name: reference-name
description: >-
Use when working with [domain].
metadata:
category: reference
triggers: tool, api, specific-terms
---
# Reference Name
## Quick Reference
| Command | Purpose |
|---------|---------|
| `cmd1` | Does X |
| `cmd2` | Does Y |
## Common Patterns
**Pattern A:**
```bash
example command
```
**Pattern B:**
```bash
another example
```
## Detailed Docs
For more options, run `--help` or see:
- [patterns.md](patterns.md)
- [examples.md](examples.md)

View File

@@ -0,0 +1,59 @@
---
name: technique-name
description: Use when [specific symptom].
metadata:
category: technique
triggers: error-text, symptom, tool-name
---
# Technique Name
## Overview
[1-2 sentence core principle]
## When to Use
- [Symptom A]
- [Symptom B]
- [Error message text]
**NOT for:**
- [When to avoid]
## The Problem
```javascript
// Bad example
function badCode() {
// problematic pattern
}
```
## The Solution
```javascript
// Good example
function goodCode() {
// improved pattern
}
```
## Step-by-Step
1. [First step]
2. [Second step]
3. [Final step]
## Quick Reference
| Scenario | Approach |
|----------|----------|
| Case A | Solution A |
| Case B | Solution B |
## Common Mistakes
**Mistake 1:** [Description]
- Wrong: `bad code`
- Right: `good code`

View File

@@ -0,0 +1,19 @@
# Platform Name Skill
Template for complex Tier 3 skills.
## Structure
```
skill/
├── SKILL.md # Dispatcher
├── commands/
│ └── skill.md # Orchestrator
└── references/
└── topic/
├── README.md # Overview
├── api.md # API Reference
├── config.md # Configuration
├── patterns.md # Recipes
└── gotchas.md # Critical Errors
```

View File

@@ -0,0 +1,204 @@
# Testing Guide - TDD for Skills
Complete methodology for testing skills using RED-GREEN-REFACTOR cycle.
## Testing All Skill Types
Different skill types need different test approaches.
### Discipline-Enforcing Skills (rules/requirements)
**Examples**: TDD, verification-before-completion, designing-before-coding
**Test with**:
- Academic questions: Do they understand the rules?
- Pressure scenarios: Do they comply under stress?
- Multiple pressures combined: time + sunk cost + exhaustion
- Identify rationalizations and add explicit counters
**Success criteria**: Agent follows rule under maximum pressure
### Technique Skills (how-to guides)
**Examples**: condition-based-waiting, root-cause-tracing, defensive-programming
**Test with**:
- Application scenarios: Can they apply the technique correctly?
- Variation scenarios: Do they handle edge cases?
- Missing information tests: Do instructions have gaps?
**Success criteria**: Agent successfully applies technique to new scenario
### Pattern Skills (mental models)
**Examples**: reducing-complexity, information-hiding concepts
**Test with**:
- Recognition scenarios: Do they recognize when pattern applies?
- Application scenarios: Can they use the mental model?
- Counter-examples: Do they know when NOT to apply?
**Success criteria**: Agent correctly identifies when/how to apply pattern
### Reference Skills (documentation/APIs)
**Examples**: API documentation, command references, library guides
**Test with**:
- Retrieval scenarios: Can they find the right information?
- Application scenarios: Can they use what they found correctly?
- Gap testing: Are common use cases covered?
**Success criteria**: Agent finds and correctly applies reference information
## Pressure Types for Testing
### Time Pressure
"You have 5 minutes to complete this task"
### Sunk Cost Pressure
"You already spent 2 hours on this, just finish it quickly"
### Authority Pressure
"The senior developer said to skip tests for this quick bug fix"
### Exhaustion Pressure
"This is the 10th task today, let's wrap it up"
## RED Phase: Baseline Testing
**Goal**: Watch the agent fail WITHOUT the skill.
**Steps**:
1. Design pressure scenario (combine 2-3 pressures)
2. Give agent the task WITHOUT the skill loaded
3. Document EXACT behavior:
- What rationalization did they use?
- Which pressure triggered the violation?
- How did they justify the shortcut?
**Critical**: Copy exact quotes. You'll need them for GREEN phase.
**Example Baseline**:
```
Scenario: Implement feature under time pressure
Pressure: "You have 10 minutes"
Agent response: "Since we're short on time, I'll implement the feature first
and add tests after. Testing later achieves the same goal."
```
## GREEN Phase: Minimal Implementation
**Goal**: Write skill that addresses SPECIFIC baseline failures.
**Steps**:
1. Review baseline rationalizations
2. Write skill sections that counter THOSE EXACT arguments
3. Re-run scenario WITH skill
4. Agent should now comply
**Bad (too general)**:
```markdown
## Testing
Always write tests.
```
**Good (addresses specific rationalization)**:
```markdown
## Common Rationalizations
| Excuse | Reality |
| ----------------------------------- | ----------------------------------------------------------------------- |
| "Testing after achieves same goals" | Tests-after = "what does this do?" Tests-first = "what should this do?" |
| "Too simple to test" | Simple code breaks. Test takes 30 seconds. |
```
## REFACTOR Phase: Loophole Closing
**Goal**: Find and plug new rationalizations.
**Steps**:
1. Agent found new workaround? Document it.
2. Add explicit counter to skill
3. Re-test same scenario
4. Repeat until bulletproof
**Pattern**:
```markdown
## Red Flags - STOP and Start Over
- Code before test
- "I already manually tested it"
- "Tests after achieve the same purpose"
- "It's about spirit not ritual"
- "This is different because..."
**All of these mean**: Delete code. Start over with TDD.
```
## Complete Test Checklist
Before deploying a skill:
**Baseline (RED)**:
- [ ] Designed 3+ pressure scenarios
- [ ] Ran scenarios WITHOUT skill
- [ ] Documented verbatim agent responses
- [ ] Identified pattern in rationalizations
**Implementation (GREEN)**:
- [ ] Skill addresses SPECIFIC baseline failures
- [ ] Re-ran scenarios WITH skill
- [ ] Agent complied in all scenarios
- [ ] No hand-waving or generic advice
**Bulletproofing (REFACTOR)**:
- [ ] Tested with combined pressures
- [ ] Found and documented new rationalizations
- [ ] Added explicit counters
- [ ] Re-tested until no more loopholes
- [ ] Created "Red Flags" section
## Common Testing Mistakes
| Mistake | Fix |
| ------------------------------ | --------------------------------------------------------- |
| "I'll test if problems emerge" | Problems = agents can't use skill. Test BEFORE deploying. |
| "Skill is obviously clear" | Clear to you ≠ clear to agents. Test it. |
| "Testing is overkill" | Untested skills have issues. Always. |
| "Academic review is enough" | Reading ≠ using. Test application scenarios. |
## Meta-Testing
**Test the test**: If agent passes too easily, your test is weak.
**Good test indicators**:
- Agent fails WITHOUT skill (proves skill is needed)
- Agent p asses WITH skill (proves skill works)
- Multiple pressures needed to trigger failure (proves realistic)
**Bad test indicators**:
- Agent passes even without skill (test is irrelevant)
- Agent fails even with skill (skill is unclear)
- Single obvious scenario (test is too simple)

View File

@@ -0,0 +1,75 @@
---
description: When to use Tier 1 (Simple) skill architecture.
metadata:
tags: [tier-1, simple, single-file]
---
# Tier 1: Simple Skills
Single-file skills for focused, specific purposes.
## When to Use
- **Single concept**: One technique, one pattern, one reference
- **Under 200 lines**: Can fit comfortably in one file
- **No complex decision logic**: User knows exactly what they need
- **Frequently loaded**: Needs minimal token footprint
## Structure
```
my-skill/
└── SKILL.md # Everything in one file
```
## Example
```yaml
---
name: flatten-with-flags
description: Use when simplifying deeply nested conditionals.
metadata:
category: pattern
triggers: nested if, complex conditionals, early return
---
# Flatten with Flags
## When to Use
- Code has 3+ levels of nesting
- Conditions are hard to follow
## The Pattern
Replace nested conditions with early returns and flag variables.
## Before
```javascript
function process(data) {
if (data) {
if (data.valid) {
if (data.ready) {
return doWork(data);
}
}
}
return null;
}
```
## After
```javascript
function process(data) {
if (!data) return null;
if (!data.valid) return null;
if (!data.ready) return null;
return doWork(data);
}
```
```
## Checklist
- [ ] Fits in <200 lines
- [ ] Single focused purpose
- [ ] No need for `references/` directory
- [ ] Description uses "Use when..." pattern

View File

@@ -0,0 +1,69 @@
---
description: When to use Tier 2 (Expanded) skill architecture.
metadata:
tags: [tier-2, expanded, multi-file]
---
# Tier 2: Expanded Skills
Multi-file skills for complex topics with multiple sub-concepts.
## When to Use
- **Multiple related concepts**: Needs separation of concerns
- **200-1000 lines total**: Too big for one file
- **Needs reference files**: Patterns, examples, troubleshooting
- **Cross-linking**: Users need to navigate between sub-topics
## Structure
```
my-skill/
├── SKILL.md # Overview + navigation
└── references/
├── core/
│ ├── README.md # Main concept
│ └── api.md # API reference
├── patterns/
│ └── README.md # Usage patterns
└── troubleshooting/
└── README.md # Common issues
```
## Example
The `writing-skills` skill itself is Tier 2:
```
writing-skills/
├── SKILL.md # Decision tree + navigation
├── gotchas.md # Tribal knowledge
└── references/
├── anti-rationalization/
├── cso/
├── standards/
├── templates/
└── testing/
```
## Progressive Disclosure
1. **Metadata** (~100 tokens): Name + description loaded at startup
2. **SKILL.md** (<500 lines): Decision tree + index
3. **References** (as needed): Loaded only when user navigates
## Key Differences from Tier 1
| Aspect | Tier 1 | Tier 2 |
|--------|--------|--------|
| Files | 1 | 5-20 |
| Total lines | <200 | 200-1000 |
| Decision logic | None | Simple tree |
| Token cost | Minimal | Medium (progressive) |
## Checklist
- [ ] SKILL.md has clear navigation links
- [ ] Each `references/` subdir has README.md
- [ ] No circular references between files
- [ ] Decision tree points to specific files

View File

@@ -0,0 +1,98 @@
---
description: When to use Tier 3 (Platform) skill architecture for large platforms.
metadata:
tags: [tier-3, platform, enterprise, cloudflare-pattern]
---
# Tier 3: Platform Skills
Enterprise-grade skills for entire platforms (AWS, Cloudflare, Convex, etc).
## When to Use
- **Entire platform**: 10+ products/services
- **1000+ lines total**: Would overwhelm context if monolithic
- **Complex decision logic**: Users start with "I need X" not "I want product Y"
- **Undocumented gotchas**: Tribal knowledge is critical
## The Cloudflare Pattern
Based on `cloudflare-skill` by Dillon Mulroy.
### Structure
```
my-platform/
├── SKILL.md # Decision trees only
└── references/
└── <product>/
├── README.md # Overview, when to use
├── api.md # Runtime API reference
├── configuration.md # Config options
├── patterns.md # Usage patterns
└── gotchas.md # Pitfalls, limits
```
### The 5-File Pattern
Each product directory has exactly 5 files:
| File | Purpose | When to Load |
|------|---------|--------------|
| `README.md` | Overview, when to use | Always first |
| `api.md` | Runtime APIs, methods | Implementing features |
| `configuration.md` | Config, environment | Setting up |
| `patterns.md` | Common workflows | Best practices |
| `gotchas.md` | Pitfalls, limits | Debugging |
## Decision Trees
The power of Tier 3 is decision trees that help the AI **choose**:
```markdown
Need to store data?
├─ Simple key-value → kv/
├─ Relational queries → d1/
├─ Large files/blobs → r2/
├─ Per-user state → durable-objects/
└─ Vector embeddings → vectorize/
```
## Slash Command Integration
Create a slash command to orchestrate:
```markdown
---
description: Load platform skill and get contextual guidance
---
## Workflow
1. Load skill: `skill({ name: 'my-platform' })`
2. Identify product from decision tree
3. Load relevant reference files based on task
| Task | Files |
|------|-------|
| New setup | README.md + configuration.md |
| Implement feature | api.md + patterns.md |
| Debug issue | gotchas.md |
```
## Progressive Disclosure in Action
- **Startup**: Only name + description (~100 tokens)
- **Activation**: SKILL.md with trees (<5000 tokens)
- **Navigation**: One product's 5 files (as needed)
Result: 60+ product references without blowing context.
## Checklist
- [ ] SKILL.md contains ONLY decision trees + index
- [ ] Each product has exactly 5 files
- [ ] Decision trees cover all "I need X" scenarios
- [ ] Cross-references stay one level deep
- [ ] Slash command created for orchestration
- [ ] Every product has `gotchas.md`