Fix: Ensure all skills are tracked as files, not submodules

2026-01-14 18:48:48 +01:00
parent 7f46ed8ca1
commit 8bd204708b
1113 changed files with 82065 additions and 2 deletions
--- a/skills/loki-mode/references/advanced-patterns.md
+++ b/skills/loki-mode/references/advanced-patterns.md
@@ -0,0 +1,453 @@
+# Advanced Agentic Patterns Reference
+
+Research-backed patterns from 2025-2026 literature for enhanced multi-agent orchestration.
+
+---
+
+## Memory Architecture (MIRIX/A-Mem/MemGPT Research)
+
+### Three-Layer Memory System
+
+```
+------------------------------------------------------------------+
+| EPISODIC MEMORY (Specific Events)                                 |
+| - What happened, when, where                                      |
+| - Full interaction traces with timestamps                         |
+| - Stored in: .loki/memory/episodic/                              |
+------------------------------------------------------------------+
+| SEMANTIC MEMORY (Generalized Knowledge)                           |
+| - Abstracted patterns and facts                                   |
+| - Context-independent knowledge                                   |
+| - Stored in: .loki/memory/semantic/                              |
+------------------------------------------------------------------+
+| PROCEDURAL MEMORY (Learned Skills)                                |
+| - How to do things                                                |
+| - Successful action sequences                                     |
+| - Stored in: .loki/memory/skills/                                |
+------------------------------------------------------------------+
+```
+
+### Episodic-to-Semantic Consolidation
+
+**Protocol:** After completing tasks, consolidate specific experiences into general knowledge.
+
+```python
+def consolidate_memory(task_result):
+    """
+    Transform episodic (what happened) to semantic (how things work).
+    Based on MemGPT and Voyager patterns.
+    """
+    # 1. Store raw episodic trace
+    episodic_entry = {
+        "timestamp": now(),
+        "task_id": task_result.id,
+        "context": task_result.context,
+        "actions": task_result.action_log,
+        "outcome": task_result.outcome,
+        "errors": task_result.errors
+    }
+    save_to_episodic(episodic_entry)
+
+    # 2. Extract generalizable patterns
+    if task_result.success:
+        pattern = extract_pattern(task_result)
+        if pattern.is_generalizable():
+            semantic_entry = {
+                "pattern": pattern.description,
+                "conditions": pattern.when_to_apply,
+                "actions": pattern.steps,
+                "confidence": pattern.success_rate,
+                "source_episodes": [task_result.id]
+            }
+            save_to_semantic(semantic_entry)
+
+    # 3. If error, create anti-pattern
+    if task_result.errors:
+        anti_pattern = {
+            "what_failed": task_result.errors[0].message,
+            "why_failed": analyze_root_cause(task_result),
+            "prevention": generate_prevention_rule(task_result),
+            "severity": classify_severity(task_result.errors)
+        }
+        save_to_learnings(anti_pattern)
+```
+
+### Zettelkasten-Inspired Note Linking (A-Mem Pattern)
+
+Each memory note is atomic and linked to related notes:
+
+```json
+{
+  "id": "note-2026-01-06-001",
+  "content": "Express route handlers need explicit return types in strict mode",
+  "type": "semantic",
+  "links": [
+    {"to": "note-2026-01-05-042", "relation": "derived_from"},
+    {"to": "note-2026-01-06-003", "relation": "related_to"}
+  ],
+  "tags": ["typescript", "express", "strict-mode"],
+  "confidence": 0.95,
+  "usage_count": 12
+}
+```
+
+---
+
+## Multi-Agent Reflexion (MAR Pattern)
+
+### Problem: Degeneration-of-Thought
+
+Single-agent self-critique leads to repeating the same flawed reasoning across iterations.
+
+### Solution: Structured Debate Among Persona-Based Critics
+
+```
+------------------+     +------------------+     +------------------+
+| IMPLEMENTER      |     | SKEPTIC          |     | ADVOCATE         |
+| (Creates work)   | --> | (Challenges it)  | --> | (Defends merits) |
+------------------+     +------------------+     +------------------+
+        |                        |                        |
+        v                        v                        v
+------------------------------------------------------------------+
+| SYNTHESIZER                                                       |
+| - Weighs all perspectives                                         |
+| - Identifies valid concerns vs. false negatives                   |
+| - Produces final verdict with evidence                            |
+------------------------------------------------------------------+
+```
+
+### Anti-Sycophancy Protocol (CONSENSAGENT)
+
+**Problem:** Agents reinforce each other's responses instead of critically engaging.
+
+**Solution:**
+
+```python
+def anti_sycophancy_review(implementation, reviewers):
+    """
+    Prevent reviewers from just agreeing with each other.
+    Based on CONSENSAGENT research.
+    """
+    # 1. Independent review phase (no visibility of other reviews)
+    independent_reviews = []
+    for reviewer in reviewers:
+        review = reviewer.review(
+            implementation,
+            visibility="blind",  # Cannot see other reviews
+            prompt_suffix="Be skeptical. List specific concerns."
+        )
+        independent_reviews.append(review)
+
+    # 2. Debate phase (now reveal reviews)
+    if has_disagreement(independent_reviews):
+        debate_result = structured_debate(
+            reviews=independent_reviews,
+            max_rounds=2,
+            require_evidence=True  # Must cite specific code/lines
+        )
+    else:
+        # All agreed - run devil's advocate check
+        devil_review = devil_advocate_agent.review(
+            implementation,
+            prompt="Find problems the other reviewers missed. Be contrarian."
+        )
+        independent_reviews.append(devil_review)
+
+    # 3. Synthesize with validity check
+    return synthesize_with_validity_alignment(independent_reviews)
+
+def synthesize_with_validity_alignment(reviews):
+    """
+    Research shows validity-aligned reasoning most strongly predicts improvement.
+    """
+    findings = []
+    for review in reviews:
+        for concern in review.concerns:
+            findings.append({
+                "concern": concern.description,
+                "evidence": concern.code_reference,  # Must have evidence
+                "severity": concern.severity,
+                "is_valid": verify_concern_is_actionable(concern)
+            })
+
+    # Filter to only valid, evidenced concerns
+    return [f for f in findings if f["is_valid"] and f["evidence"]]
+```
+
+### Heterogeneous Team Composition
+
+**Research finding:** Diverse teams outperform homogeneous ones by 4-6%.
+
+```yaml
+review_team:
+  - role: "security_analyst"
+    model: opus
+    expertise: ["OWASP", "auth", "injection"]
+    personality: "paranoid"
+
+  - role: "performance_engineer"
+    model: sonnet
+    expertise: ["complexity", "caching", "async"]
+    personality: "pragmatic"
+
+  - role: "maintainability_advocate"
+    model: opus
+    expertise: ["SOLID", "patterns", "readability"]
+    personality: "perfectionist"
+```
+
+---
+
+## Hierarchical Planning (GoalAct/TMS Patterns)
+
+### Global Planning with Hierarchical Execution
+
+**Research:** GoalAct achieved 12.22% improvement in success rate using this pattern.
+
+```
+------------------------------------------------------------------+
+| GLOBAL PLANNER                                                    |
+| - Maintains overall goal and strategy                             |
+| - Continuously updates plan based on progress                     |
+| - Decomposes into high-level skills                               |
+------------------------------------------------------------------+
+        |
+        v
+------------------------------------------------------------------+
+| HIGH-LEVEL SKILLS                                                 |
+| - searching, coding, testing, writing, deploying                  |
+| - Each skill has defined entry/exit conditions                    |
+| - Reduces planning complexity at execution level                  |
+------------------------------------------------------------------+
+        |
+        v
+------------------------------------------------------------------+
+| LOCAL EXECUTORS                                                   |
+| - Execute specific actions within skill context                   |
+| - Report progress back to global planner                          |
+| - Can request skill escalation if blocked                         |
+------------------------------------------------------------------+
+```
+
+### Thought Management System (TMS)
+
+**For long-horizon tasks:**
+
+```python
+class ThoughtManagementSystem:
+    """
+    Based on TMS research for long-horizon autonomous tasks.
+    Enables dynamic prioritization and adaptive strategy.
+    """
+
+    def __init__(self, completion_promise):
+        self.goal_hierarchy = self.decompose_goal(completion_promise)
+        self.active_thoughts = PriorityQueue()
+        self.completed_thoughts = []
+        self.blocked_thoughts = []
+
+    def decompose_goal(self, goal):
+        """
+        Hierarchical goal decomposition with self-critique.
+        """
+        # Level 0: Ultimate goal
+        hierarchy = {"goal": goal, "subgoals": []}
+
+        # Level 1: Phase-level subgoals
+        phases = self.identify_phases(goal)
+        for phase in phases:
+            phase_node = {"goal": phase, "subgoals": []}
+
+            # Level 2: Task-level subgoals
+            tasks = self.identify_tasks(phase)
+            for task in tasks:
+                phase_node["subgoals"].append({"goal": task, "subgoals": []})
+
+            hierarchy["subgoals"].append(phase_node)
+
+        return hierarchy
+
+    def iterate(self):
+        """
+        Single iteration with self-critique.
+        """
+        # 1. Select highest priority thought
+        thought = self.active_thoughts.pop()
+
+        # 2. Execute thought
+        result = self.execute(thought)
+
+        # 3. Self-critique: Did this make progress?
+        critique = self.self_critique(thought, result)
+
+        # 4. Adapt strategy based on critique
+        if critique.made_progress:
+            self.completed_thoughts.append(thought)
+            self.generate_next_thoughts(thought, result)
+        elif critique.is_blocked:
+            self.blocked_thoughts.append(thought)
+            self.escalate_or_decompose(thought)
+        else:
+            # No progress, not blocked - need different approach
+            thought.attempts += 1
+            thought.alternative_strategy = critique.suggested_alternative
+            self.active_thoughts.push(thought)
+```
+
+---
+
+## Iter-VF: Iterative Verification-First
+
+**Key insight:** Verify the extracted answer only, not the whole thinking process.
+
+```python
+def iterative_verify_first(task, max_iterations=3):
+    """
+    Based on Iter-VF research: verify answer, maintain Markovian process.
+    Avoids context overflow and error accumulation.
+    """
+    for iteration in range(max_iterations):
+        # 1. Generate solution
+        solution = generate_solution(task)
+
+        # 2. Extract concrete answer/output
+        answer = extract_answer(solution)
+
+        # 3. Verify ONLY the answer (not reasoning chain)
+        verification = verify_answer(
+            answer=answer,
+            spec=task.spec,
+            tests=task.tests
+        )
+
+        if verification.passes:
+            return solution
+
+        # 4. Markovian retry: fresh context with just error info
+        task = create_fresh_task(
+            original=task,
+            error=verification.error,
+            attempt=iteration + 1
+            # NOTE: Do NOT include previous reasoning chain
+        )
+
+    return FailedResult(task, "Max iterations reached")
+```
+
+---
+
+## Collaboration Structures
+
+### When to Use Each Structure
+
+| Structure | Use When | Loki Mode Application |
+|-----------|----------|----------------------|
+| **Centralized** | Need consistency, single source of truth | Orchestrator for phase management |
+| **Decentralized** | Need fault tolerance, parallel execution | Agent swarms for implementation |
+| **Hierarchical** | Complex tasks with clear decomposition | Global planner -> Skill -> Executor |
+
+### Coopetition Pattern
+
+**Agents compete on alternatives, cooperate on consensus:**
+
+```python
+def coopetition_decision(agents, decision_point):
+    """
+    Competition phase: Generate diverse alternatives
+    Cooperation phase: Reach consensus on best option
+    """
+    # COMPETITION: Each agent proposes solution independently
+    proposals = []
+    for agent in agents:
+        proposal = agent.propose(
+            decision_point,
+            visibility="blind"  # No peeking at other proposals
+        )
+        proposals.append(proposal)
+
+    # COOPERATION: Collaborative evaluation
+    if len(set(p.approach for p in proposals)) == 1:
+        # Unanimous - likely good solution
+        return proposals[0]
+
+    # Multiple approaches - structured debate
+    for proposal in proposals:
+        proposal.pros = evaluate_pros(proposal)
+        proposal.cons = evaluate_cons(proposal)
+        proposal.evidence = gather_evidence(proposal)
+
+    # Vote with reasoning requirement
+    winner = ranked_choice_vote(
+        proposals,
+        require_justification=True
+    )
+
+    return winner
+```
+
+---
+
+## Progressive Complexity Escalation
+
+**Start simple, escalate only when needed:**
+
+```
+Level 1: Single Agent, Direct Execution
+   |
+   +-- Success? --> Done
+   |
+   +-- Failure? --> Escalate
+           |
+           v
+Level 2: Single Agent + Self-Verification Loop
+   |
+   +-- Success? --> Done
+   |
+   +-- Failure after 3 attempts? --> Escalate
+           |
+           v
+Level 3: Multi-Agent Review
+   |
+   +-- Success? --> Done
+   |
+   +-- Persistent issues? --> Escalate
+           |
+           v
+Level 4: Hierarchical Planning + Decomposition
+   |
+   +-- Success? --> Done
+   |
+   +-- Fundamental blocker? --> Human escalation
+```
+
+---
+
+## Key Research Findings Summary
+
+### What Works
+
+1. **Heterogeneous teams** outperform homogeneous by 4-6%
+2. **Iter-VF** (verify answer only) prevents context overflow
+3. **Episodic-to-semantic consolidation** enables genuine learning
+4. **Anti-sycophancy measures** (blind review, devil's advocate) improve accuracy 30%+
+5. **Global planning** with local execution improves success rate 12%+
+
+### What Doesn't Work
+
+1. **Deep debate chains** - diminishing returns after 1-2 rounds
+2. **Confidence visibility** - causes over-confidence cascades
+3. **Full reasoning chain review** - leads to error accumulation
+4. **Homogeneous reviewer teams** - miss diverse failure modes
+5. **Over-engineered orchestration** - model upgrades outpace gains
+
+---
+
+## Sources
+
+- [Multi-Agent Collaboration Mechanisms Survey](https://arxiv.org/abs/2501.06322)
+- [CONSENSAGENT: Anti-Sycophancy Framework](https://aclanthology.org/2025.findings-acl.1141/)
+- [GoalAct: Global Planning + Hierarchical Execution](https://arxiv.org/abs/2504.16563)
+- [A-Mem: Agentic Memory System](https://arxiv.org/html/2502.12110v11)
+- [Multi-Agent Reflexion (MAR)](https://arxiv.org/html/2512.20845)
+- [Iter-VF: Iterative Verification-First](https://arxiv.org/html/2511.21734v1)
+- [Awesome Agentic Patterns](https://github.com/nibzard/awesome-agentic-patterns)
--- a/skills/loki-mode/references/agent-types.md
+++ b/skills/loki-mode/references/agent-types.md
@@ -0,0 +1,188 @@
+# Agent Types Reference
+
+Complete definitions and capabilities for all 37 specialized agent types.
+
+---
+
+## Overview
+
+Loki Mode has 37 predefined agent types organized into 7 specialized swarms. The orchestrator spawns only the agents needed for your project - a simple app might use 5-10 agents, while a complex startup could spawn 100+ agents working in parallel.
+
+---
+
+## Engineering Swarm (8 types)
+
+| Agent | Capabilities |
+|-------|-------------|
+| `eng-frontend` | React/Vue/Svelte, TypeScript, Tailwind, accessibility, responsive design, state management |
+| `eng-backend` | Node/Python/Go, REST/GraphQL, auth, business logic, middleware, validation |
+| `eng-database` | PostgreSQL/MySQL/MongoDB, migrations, query optimization, indexing, backups |
+| `eng-mobile` | React Native/Flutter/Swift/Kotlin, offline-first, push notifications, app store prep |
+| `eng-api` | OpenAPI specs, SDK generation, versioning, webhooks, rate limiting, documentation |
+| `eng-qa` | Unit/integration/E2E tests, coverage, automation, test data management |
+| `eng-perf` | Profiling, benchmarking, optimization, caching, load testing, memory analysis |
+| `eng-infra` | Docker, K8s manifests, IaC review, networking, security hardening |
+
+---
+
+## Operations Swarm (8 types)
+
+| Agent | Capabilities |
+|-------|-------------|
+| `ops-devops` | CI/CD pipelines, GitHub Actions, GitLab CI, Jenkins, build optimization |
+| `ops-sre` | Reliability, SLOs/SLIs, capacity planning, on-call, runbooks |
+| `ops-security` | SAST/DAST, pen testing, vulnerability management, security reviews |
+| `ops-monitor` | Observability, Datadog/Grafana, alerting, dashboards, log aggregation |
+| `ops-incident` | Incident response, runbooks, RCA, post-mortems, communication |
+| `ops-release` | Versioning, changelogs, blue-green, canary, rollbacks, feature flags |
+| `ops-cost` | Cloud cost optimization, right-sizing, FinOps, reserved instances |
+| `ops-compliance` | SOC2, GDPR, HIPAA, PCI-DSS, audit preparation, policy enforcement |
+
+---
+
+## Business Swarm (8 types)
+
+| Agent | Capabilities |
+|-------|-------------|
+| `biz-marketing` | Landing pages, SEO, content, email campaigns, social media |
+| `biz-sales` | CRM setup, outreach, demos, proposals, pipeline management |
+| `biz-finance` | Billing (Stripe), invoicing, metrics, runway, pricing strategy |
+| `biz-legal` | ToS, privacy policy, contracts, IP protection, compliance docs |
+| `biz-support` | Help docs, FAQs, ticket system, chatbot, knowledge base |
+| `biz-hr` | Job posts, recruiting, onboarding, culture docs, team structure |
+| `biz-investor` | Pitch decks, investor updates, data room, cap table management |
+| `biz-partnerships` | BD outreach, integration partnerships, co-marketing, API partnerships |
+
+---
+
+## Data Swarm (3 types)
+
+| Agent | Capabilities |
+|-------|-------------|
+| `data-ml` | Model training, MLOps, feature engineering, inference, model monitoring |
+| `data-eng` | ETL pipelines, data warehousing, dbt, Airflow, data quality |
+| `data-analytics` | Product analytics, A/B tests, dashboards, insights, reporting |
+
+---
+
+## Product Swarm (3 types)
+
+| Agent | Capabilities |
+|-------|-------------|
+| `prod-pm` | Backlog grooming, prioritization, roadmap, specs, stakeholder management |
+| `prod-design` | Design system, Figma, UX patterns, prototypes, user research |
+| `prod-techwriter` | API docs, guides, tutorials, release notes, developer experience |
+
+---
+
+## Growth Swarm (4 types)
+
+| Agent | Capabilities |
+|-------|-------------|
+| `growth-hacker` | Growth experiments, viral loops, referral programs, acquisition |
+| `growth-community` | Community building, Discord/Slack, ambassador programs, events |
+| `growth-success` | Customer success, health scoring, churn prevention, expansion |
+| `growth-lifecycle` | Email lifecycle, in-app messaging, re-engagement, onboarding |
+
+---
+
+## Review Swarm (3 types)
+
+| Agent | Capabilities |
+|-------|-------------|
+| `review-code` | Code quality, design patterns, SOLID, maintainability, best practices |
+| `review-business` | Requirements alignment, business logic, edge cases, UX flows |
+| `review-security` | Vulnerabilities, auth/authz, OWASP Top 10, data protection |
+
+---
+
+## Agent Execution Model
+
+**Claude Code does NOT support background processes.** Agents execute via:
+
+1. **Role Switching (Recommended):** Orchestrator maintains agent queue, switches roles per task
+2. **Sequential:** Execute agents one at a time (simple, reliable)
+3. **Parallel via tmux:** Multiple Claude Code sessions (complex, faster)
+
+```bash
+# Option 1: Sequential (simple, reliable)
+for agent in frontend backend database; do
+  claude -p "Act as $agent agent..." --dangerously-skip-permissions
+done
+
+# Option 2: Parallel via tmux (complex, faster)
+tmux new-session -d -s loki-pool
+for i in {1..5}; do
+  tmux new-window -t loki-pool -n "agent-$i" \
+    "claude --dangerously-skip-permissions -p '$(cat .loki/prompts/agent-$i.md)'"
+done
+
+# Option 3: Role switching (recommended)
+# Orchestrator maintains agent queue, switches roles per task
+```
+
+---
+
+## Model Selection by Agent Type
+
+| Task Type | Model | Reason |
+|-----------|-------|--------|
+| Implementation | Sonnet | Fast, good enough for coding |
+| Code Review | Opus | Deep analysis, catches subtle issues |
+| Security Review | Opus | Critical, needs thoroughness |
+| Business Logic Review | Opus | Needs to understand requirements deeply |
+| Documentation | Sonnet | Straightforward writing |
+| Quick fixes | Haiku | Fast iteration |
+
+---
+
+## Agent Lifecycle
+
+```
+SPAWN -> INITIALIZE -> POLL_QUEUE -> CLAIM_TASK -> EXECUTE -> REPORT -> POLL_QUEUE
+           |              |                        |          |
+           |         circuit open?             timeout?    success?
+           |              |                        |          |
+           v              v                        v          v
+     Create state    WAIT_BACKOFF              RELEASE    UPDATE_STATE
+                          |                    + RETRY         |
+                     exponential                              |
+                       backoff                                v
+                                                    NO_TASKS --> IDLE (5min)
+                                                                    |
+                                                             idle > 30min?
+                                                                    |
+                                                                    v
+                                                               TERMINATE
+```
+
+---
+
+## Dynamic Scaling Rules
+
+| Condition | Action | Cooldown |
+|-----------|--------|----------|
+| Queue depth > 20 | Spawn 2 agents of bottleneck type | 5min |
+| Queue depth > 50 | Spawn 5 agents, alert orchestrator | 2min |
+| Agent idle > 30min | Terminate agent | - |
+| Agent failed 3x consecutive | Terminate, open circuit breaker | 5min |
+| Critical task waiting > 10min | Spawn priority agent | 1min |
+| Circuit breaker half-open | Spawn 1 test agent | - |
+| All agents of type failed | HALT, request human intervention | - |
+
+---
+
+## Agent Context Preservation
+
+### Lineage Rules
+1. **Immutable Inheritance:** Agents CANNOT modify inherited context
+2. **Decision Logging:** All decisions MUST be logged to agent context file
+3. **Lineage Reference:** All commits MUST reference parent agent ID
+4. **Context Handoff:** When agent completes, context is archived but lineage preserved
+
+### Preventing Context Drift
+1. Read `.agent/sub-agents/${parent_id}.json` before spawning
+2. Inherit immutable context (tech stack, constraints, decisions)
+3. Log all new decisions to own context file
+4. Reference lineage in all commits
+5. Periodic context sync: check if inherited context has been updated upstream
--- a/skills/loki-mode/references/agents.md
+++ b/skills/loki-mode/references/agents.md
--- a/skills/loki-mode/references/business-ops.md
+++ b/skills/loki-mode/references/business-ops.md
@@ -0,0 +1,550 @@
+# Business Operations Reference
+
+Workflows and procedures for business swarm agents.
+
+## Marketing Operations
+
+### Landing Page Checklist
+```
+[ ] Hero section with clear value proposition
+[ ] Problem/solution narrative
+[ ] Feature highlights (3-5 key features)
+[ ] Social proof (testimonials, logos, stats)
+[ ] Pricing section (if applicable)
+[ ] FAQ section
+[ ] Call-to-action (primary and secondary)
+[ ] Footer with legal links
+```
+
+### SEO Optimization
+```yaml
+Technical SEO:
+  - meta title: 50-60 characters, include primary keyword
+  - meta description: 150-160 characters, compelling
+  - canonical URL set
+  - robots.txt configured
+  - sitemap.xml generated
+  - structured data (JSON-LD)
+  - Open Graph tags
+  - Twitter Card tags
+
+Performance:
+  - Largest Contentful Paint < 2.5s
+  - First Input Delay < 100ms
+  - Cumulative Layout Shift < 0.1
+  - Images optimized (WebP, lazy loading)
+
+Content:
+  - H1 contains primary keyword
+  - H2-H6 hierarchy logical
+  - Internal linking strategy
+  - Alt text on all images
+  - Content length appropriate for intent
+```
+
+### Content Calendar Template
+```markdown
+# Week of [DATE]
+
+## Monday
+- [ ] Blog post: [TITLE]
+- [ ] Social: LinkedIn announcement
+
+## Wednesday  
+- [ ] Email newsletter
+- [ ] Social: Twitter thread
+
+## Friday
+- [ ] Case study update
+- [ ] Social: Feature highlight
+```
+
+### Email Sequences
+
+**Onboarding Sequence:**
+```
+Day 0: Welcome email (immediate)
+  - Thank you for signing up
+  - Quick start guide link
+  - Support contact
+
+Day 1: Getting started
+  - First feature tutorial
+  - Video walkthrough
+
+Day 3: Value demonstration
+  - Success metrics
+  - Customer story
+
+Day 7: Check-in
+  - How's it going?
+  - Feature discovery
+
+Day 14: Advanced features
+  - Power user tips
+  - Integration options
+```
+
+**Abandoned Cart/Trial:**
+```
+Hour 1: Reminder
+Day 1: Benefits recap
+Day 3: Testimonial + urgency
+Day 7: Final offer
+```
+
+---
+
+## Sales Operations
+
+### CRM Pipeline Stages
+```
+1. Lead (new contact)
+2. Qualified (fits ICP, has need)
+3. Meeting Scheduled
+4. Demo Completed
+5. Proposal Sent
+6. Negotiation
+7. Closed Won / Closed Lost
+```
+
+### Qualification Framework (BANT)
+```yaml
+Budget:
+  - What's the allocated budget?
+  - Who controls the budget?
+  
+Authority:
+  - Who makes the final decision?
+  - Who else is involved?
+  
+Need:
+  - What problem are you solving?
+  - What's the impact of not solving it?
+  
+Timeline:
+  - When do you need a solution?
+  - What's driving that timeline?
+```
+
+### Outreach Template
+```markdown
+Subject: [Specific pain point] at [Company]
+
+Hi [Name],
+
+I noticed [Company] is [specific observation about their business].
+
+Many [similar role/company type] struggle with [problem], which leads to [negative outcome].
+
+[Product] helps by [specific solution], resulting in [specific benefit with metric].
+
+Would you be open to a 15-minute call to see if this could help [Company]?
+
+Best,
+[Name]
+```
+
+### Demo Script Structure
+```
+1. Rapport (2 min)
+   - Confirm attendees and roles
+   - Agenda overview
+
+2. Discovery (5 min)
+   - Confirm pain points
+   - Understand current process
+   - Success metrics
+
+3. Solution (15 min)
+   - Map features to their needs
+   - Show don't tell
+   - Address specific use cases
+
+4. Social Proof (3 min)
+   - Relevant customer stories
+   - Metrics and outcomes
+
+5. Pricing/Next Steps (5 min)
+   - Present options
+   - Answer objections
+   - Define next steps
+```
+
+---
+
+## Finance Operations
+
+### Billing Setup Checklist (Stripe)
+```bash
+# Initialize Stripe
+npm install stripe
+
+# Required configurations:
+- [ ] Products and prices created
+- [ ] Customer portal enabled
+- [ ] Webhook endpoints configured
+- [ ] Tax settings (Stripe Tax or manual)
+- [ ] Invoice settings customized
+- [ ] Payment methods enabled
+- [ ] Fraud protection rules
+```
+
+### Webhook Events to Handle
+```javascript
+const relevantEvents = [
+  'customer.subscription.created',
+  'customer.subscription.updated', 
+  'customer.subscription.deleted',
+  'invoice.paid',
+  'invoice.payment_failed',
+  'payment_intent.succeeded',
+  'payment_intent.payment_failed',
+  'customer.updated',
+  'charge.refunded'
+];
+```
+
+### Key Metrics Dashboard
+```yaml
+Revenue Metrics:
+  - MRR (Monthly Recurring Revenue)
+  - ARR (Annual Recurring Revenue)
+  - Net Revenue Retention
+  - Expansion Revenue
+  - Churn Rate
+
+Customer Metrics:
+  - CAC (Customer Acquisition Cost)
+  - LTV (Lifetime Value)
+  - LTV:CAC Ratio (target: 3:1)
+  - Payback Period
+
+Product Metrics:
+  - Trial to Paid Conversion
+  - Activation Rate
+  - Feature Adoption
+  - NPS Score
+```
+
+### Runway Calculation
+```
+Monthly Burn = Total Monthly Expenses - Monthly Revenue
+Runway (months) = Cash Balance / Monthly Burn
+
+Healthy: > 18 months
+Warning: 6-12 months
+Critical: < 6 months
+```
+
+---
+
+## Legal Operations
+
+### Terms of Service Template Sections
+```
+1. Acceptance of Terms
+2. Description of Service
+3. User Accounts and Registration
+4. User Conduct and Content
+5. Intellectual Property Rights
+6. Payment Terms (if applicable)
+7. Termination
+8. Disclaimers and Limitations
+9. Indemnification
+10. Dispute Resolution
+11. Changes to Terms
+12. Contact Information
+```
+
+### Privacy Policy Requirements (GDPR)
+```
+Required Disclosures:
+- [ ] Data controller identity
+- [ ] Types of data collected
+- [ ] Purpose of processing
+- [ ] Legal basis for processing
+- [ ] Data retention periods
+- [ ] Third-party sharing
+- [ ] User rights (access, rectification, deletion)
+- [ ] Cookie usage
+- [ ] International transfers
+- [ ] Contact information
+- [ ] DPO contact (if applicable)
+```
+
+### GDPR Compliance Checklist
+```
+Data Collection:
+- [ ] Consent mechanism implemented
+- [ ] Purpose limitation documented
+- [ ] Data minimization practiced
+
+User Rights:
+- [ ] Right to access (data export)
+- [ ] Right to rectification (edit profile)
+- [ ] Right to erasure (delete account)
+- [ ] Right to portability (download data)
+- [ ] Right to object (marketing opt-out)
+
+Technical:
+- [ ] Encryption at rest
+- [ ] Encryption in transit
+- [ ] Access logging
+- [ ] Breach notification process
+```
+
+### Cookie Consent Implementation
+```javascript
+// Cookie categories
+const cookieCategories = {
+  necessary: true,      // Always enabled
+  functional: false,    // User preference
+  analytics: false,     // Tracking/analytics
+  marketing: false      // Advertising
+};
+
+// Required: Show banner before non-necessary cookies
+// Required: Allow granular control
+// Required: Easy withdrawal of consent
+// Required: Record consent timestamp
+```
+
+---
+
+## Customer Support Operations
+
+### Ticket Priority Matrix
+| Priority | Description | Response SLA | Resolution SLA |
+|----------|-------------|--------------|----------------|
+| P1 - Critical | Service down, data loss | 15 min | 4 hours |
+| P2 - High | Major feature broken | 1 hour | 8 hours |
+| P3 - Medium | Feature impaired | 4 hours | 24 hours |
+| P4 - Low | General questions | 24 hours | 72 hours |
+
+### Response Templates
+
+**Acknowledgment:**
+```
+Hi [Name],
+
+Thanks for reaching out! I've received your message about [issue summary].
+
+I'm looking into this now and will get back to you within [SLA time].
+
+In the meantime, [helpful resource or workaround if applicable].
+
+Best,
+[Agent Name]
+```
+
+**Resolution:**
+```
+Hi [Name],
+
+Great news - I've resolved the issue with [specific problem].
+
+Here's what was happening: [brief explanation]
+
+Here's what I did to fix it: [solution summary]
+
+To prevent this in the future: [if applicable]
+
+Please let me know if you have any questions!
+
+Best,
+[Agent Name]
+```
+
+### Knowledge Base Structure
+```
+/help
+├── /getting-started
+│   ├── quick-start-guide
+│   ├── account-setup
+│   └── first-steps
+├── /features
+│   ├── feature-a
+│   ├── feature-b
+│   └── feature-c
+├── /billing
+│   ├── plans-and-pricing
+│   ├── payment-methods
+│   └── invoices
+├── /integrations
+│   ├── integration-a
+│   └── integration-b
+├── /troubleshooting
+│   ├── common-issues
+│   └── error-messages
+└── /api
+    ├── authentication
+    ├── endpoints
+    └── examples
+```
+
+---
+
+## Analytics Operations
+
+### Event Tracking Plan
+```yaml
+User Lifecycle:
+  - user_signed_up:
+      properties: [source, referrer, plan]
+  - user_activated:
+      properties: [activation_method, time_to_activate]
+  - user_converted:
+      properties: [plan, trial_length, conversion_path]
+  - user_churned:
+      properties: [reason, lifetime_value, last_active]
+
+Core Actions:
+  - feature_used:
+      properties: [feature_name, context]
+  - action_completed:
+      properties: [action_type, duration, success]
+  - error_encountered:
+      properties: [error_type, page, context]
+
+Engagement:
+  - page_viewed:
+      properties: [page_name, referrer, duration]
+  - button_clicked:
+      properties: [button_name, page, context]
+  - search_performed:
+      properties: [query, results_count]
+```
+
+### A/B Testing Framework
+```yaml
+Test Structure:
+  name: "Homepage CTA Test"
+  hypothesis: "Changing CTA from 'Sign Up' to 'Start Free' will increase conversions"
+  primary_metric: signup_rate
+  secondary_metrics: [time_on_page, bounce_rate]
+  
+  variants:
+    control:
+      description: "Original 'Sign Up' button"
+      allocation: 50%
+    variant_a:
+      description: "'Start Free' button"
+      allocation: 50%
+  
+  sample_size: 1000_per_variant
+  duration: 14_days
+  significance_level: 0.95
+
+Analysis:
+  - Calculate conversion rate per variant
+  - Run chi-squared test for significance
+  - Check for novelty effects
+  - Segment by user type if needed
+  - Document learnings
+```
+
+### Funnel Analysis
+```
+Signup Funnel:
+  1. Landing Page Visit    → 100% (baseline)
+  2. Signup Page View      → 40% (60% drop-off)
+  3. Form Submitted        → 25% (15% drop-off)
+  4. Email Verified        → 20% (5% drop-off)
+  5. Onboarding Complete   → 12% (8% drop-off)
+  6. First Value Action    → 8% (4% drop-off)
+
+Optimization Targets:
+  - Biggest drop: Landing → Signup (improve CTA, value prop)
+  - Second biggest: Signup → Submit (simplify form)
+```
+
+### Weekly Metrics Report Template
+```markdown
+# Weekly Metrics Report: [Date Range]
+
+## Key Metrics Summary
+| Metric | This Week | Last Week | Change |
+|--------|-----------|-----------|--------|
+| New Users | X | Y | +Z% |
+| Activated Users | X | Y | +Z% |
+| Revenue | $X | $Y | +Z% |
+| Churn | X% | Y% | -Z% |
+
+## Highlights
+- [Positive trend 1]
+- [Positive trend 2]
+
+## Concerns
+- [Issue 1 and action plan]
+- [Issue 2 and action plan]
+
+## Experiments Running
+- [Test name]: [current results]
+
+## Next Week Focus
+- [Priority 1]
+- [Priority 2]
+```
+
+---
+
+## Cross-Functional Workflows
+
+### Feature Launch Checklist
+```
+Pre-Launch:
+[ ] Feature complete and tested
+[ ] Documentation updated
+[ ] Help articles written
+[ ] Email announcement drafted
+[ ] Social content prepared
+[ ] Sales team briefed
+[ ] Support team trained
+[ ] Analytics events added
+[ ] Feature flag ready
+
+Launch:
+[ ] Deploy to production
+[ ] Enable feature flag (% rollout)
+[ ] Send email announcement
+[ ] Publish blog post
+[ ] Post on social media
+[ ] Update changelog
+
+Post-Launch:
+[ ] Monitor error rates
+[ ] Track feature adoption
+[ ] Collect user feedback
+[ ] Iterate based on data
+```
+
+### Incident Communication Template
+```markdown
+# [Incident Type] - [Brief Description]
+
+## Status: [Investigating | Identified | Monitoring | Resolved]
+
+## Timeline
+- [HH:MM] Issue reported
+- [HH:MM] Team engaged
+- [HH:MM] Root cause identified
+- [HH:MM] Fix deployed
+- [HH:MM] Monitoring
+
+## Impact
+- Affected: [% of users, specific features]
+- Duration: [X hours/minutes]
+
+## Root Cause
+[Brief explanation]
+
+## Resolution
+[What was done to fix]
+
+## Prevention
+[What changes will prevent recurrence]
+
+## Next Update
+[Time of next update or "Resolved"]
+```
--- a/skills/loki-mode/references/core-workflow.md
+++ b/skills/loki-mode/references/core-workflow.md
@@ -0,0 +1,275 @@
+# Core Workflow Reference
+
+Full RARV cycle, CONTINUITY.md template, and autonomy rules.
+
+---
+
+## Autonomy Rules
+
+**This system runs with ZERO human intervention.**
+
+### Core Rules
+1. **NEVER ask questions** - Do not say "Would you like me to...", "Should I...", or "What would you prefer?"
+2. **NEVER wait for confirmation** - Take immediate action. If something needs to be done, do it.
+3. **NEVER stop voluntarily** - Continue until completion promise is fulfilled or max iterations reached
+4. **NEVER suggest alternatives** - Pick the best option and execute. No "You could also..." or "Alternatively..."
+5. **ALWAYS use RARV cycle** - Every action follows the Reason-Act-Reflect-Verify pattern
+
+---
+
+## RARV Cycle (Reason-Act-Reflect-Verify)
+
+**Enhanced with Automatic Self-Verification Loop (Boris Cherny Pattern)**
+
+Every iteration follows this cycle:
+
+```
+-------------------------------------------------------------------+
+| REASON: What needs to be done next?                               |
+| - READ .loki/CONTINUITY.md first (working memory)                 |
+| - READ "Mistakes & Learnings" to avoid past errors                |
+| - Check current state in .loki/state/orchestrator.json            |
+| - Review pending tasks in .loki/queue/pending.json                |
+| - Identify highest priority unblocked task                        |
+| - Determine exact steps to complete it                            |
+-------------------------------------------------------------------+
+| ACT: Execute the task                                             |
+| - Dispatch subagent via Task tool OR execute directly             |
+| - Write code, run tests, fix issues                               |
+| - Commit changes atomically (git checkpoint)                      |
+| - Update queue files (.loki/queue/*.json)                         |
+-------------------------------------------------------------------+
+| REFLECT: Did it work? What next?                                  |
+| - Verify task success (tests pass, no errors)                     |
+| - UPDATE .loki/CONTINUITY.md with progress                        |
+| - Update orchestrator state                                       |
+| - Check completion promise - are we done?                         |
+| - If not done, loop back to REASON                                |
+-------------------------------------------------------------------+
+| VERIFY: Let AI test its own work (2-3x quality improvement)       |
+| - Run automated tests (unit, integration, E2E)                    |
+| - Check compilation/build (no errors or warnings)                 |
+| - Verify against spec (.loki/specs/openapi.yaml)                  |
+| - Run linters/formatters via post-write hooks                     |
+| - Browser/runtime testing if applicable                           |
+|                                                                   |
+| IF VERIFICATION FAILS:                                            |
+|   1. Capture error details (stack trace, logs)                    |
+|   2. Analyze root cause                                           |
+|   3. UPDATE CONTINUITY.md "Mistakes & Learnings"                  |
+|   4. Rollback to last good git checkpoint (if needed)             |
+|   5. Apply learning and RETRY from REASON                         |
+|                                                                   |
+| - If verification passes, mark task complete and continue         |
+-------------------------------------------------------------------+
+```
+
+**Key Enhancement:** The VERIFY step creates a feedback loop where the AI:
+- Tests every change automatically
+- Learns from failures by updating CONTINUITY.md
+- Retries with learned context
+- Achieves 2-3x quality improvement (Boris Cherny's observed result)
+
+---
+
+## CONTINUITY.md - Working Memory Protocol
+
+**CRITICAL:** You have a persistent working memory file at `.loki/CONTINUITY.md` that maintains state across all turns of execution.
+
+### AT THE START OF EVERY TURN:
+1. Read `.loki/CONTINUITY.md` to orient yourself to the current state
+2. Reference it throughout your reasoning
+3. Never make decisions without checking CONTINUITY.md first
+
+### AT THE END OF EVERY TURN:
+1. Update `.loki/CONTINUITY.md` with any important new information
+2. Record what was accomplished
+3. Note what needs to happen next
+4. Document any blockers or decisions made
+
+### CONTINUITY.md Template
+
+```markdown
+# Loki Mode Working Memory
+Last Updated: [ISO timestamp]
+Current Phase: [bootstrap|discovery|architecture|development|qa|deployment|growth]
+Current Iteration: [number]
+
+## Active Goal
+[What we're currently trying to accomplish - 1-2 sentences]
+
+## Current Task
+- ID: [task-id from queue]
+- Description: [what we're doing]
+- Status: [in-progress|blocked|reviewing]
+- Started: [timestamp]
+
+## Just Completed
+- [Most recent accomplishment with file:line references]
+- [Previous accomplishment]
+- [etc - last 5 items]
+
+## Next Actions (Priority Order)
+1. [Immediate next step]
+2. [Following step]
+3. [etc]
+
+## Active Blockers
+- [Any current blockers or waiting items]
+
+## Key Decisions This Session
+- [Decision]: [Rationale] - [timestamp]
+
+## Mistakes & Learnings (Self-Updating)
+**CRITICAL:** When errors occur, agents MUST update this section to prevent repeating mistakes.
+
+### Pattern: Error -> Learning -> Prevention
+- **What Failed:** [Specific error that occurred]
+- **Why It Failed:** [Root cause analysis]
+- **How to Prevent:** [Concrete action to avoid this in future]
+- **Timestamp:** [When this was learned]
+- **Agent:** [Which agent learned this]
+
+### Example:
+- **What Failed:** TypeScript compilation error - missing return type annotation
+- **Why It Failed:** Express route handlers need explicit `: void` return type in strict mode
+- **How to Prevent:** Always add `: void` to route handlers: `(req, res): void =>`
+- **Timestamp:** 2026-01-04T00:16:00Z
+- **Agent:** eng-001-backend-api
+
+**Self-Update Protocol:**
+```
+ON_ERROR:
+  1. Capture error details (stack trace, context)
+  2. Analyze root cause
+  3. Write learning to CONTINUITY.md "Mistakes & Learnings"
+  4. Update approach based on learning
+  5. Retry with corrected approach
+```
+
+## Working Context
+[Any critical information needed for current work - API keys in use,
+architecture decisions, patterns being followed, etc.]
+
+## Files Currently Being Modified
+- [file path]: [what we're changing]
+```
+
+---
+
+## Memory Hierarchy
+
+The memory systems work together:
+
+1. **CONTINUITY.md** = Working memory (current session state, updated every turn)
+2. **ledgers/** = Agent-specific state (checkpointed periodically)
+3. **handoffs/** = Agent-to-agent transfers (on agent switch)
+4. **learnings/** = Extracted patterns (on task completion)
+5. **rules/** = Permanent validated patterns (promoted from learnings)
+
+**CONTINUITY.md is the PRIMARY source of truth for "what am I doing right now?"**
+
+---
+
+## Git Checkpoint System
+
+**CRITICAL:** Every completed task MUST create a git checkpoint for rollback safety.
+
+### Protocol: Automatic Commits After Task Completion
+
+**RULE:** When `task.status == "completed"`, create a git commit immediately.
+
+```bash
+# Git Checkpoint Protocol
+ON_TASK_COMPLETE() {
+    task_id=$1
+    task_title=$2
+    agent_id=$3
+
+    # Stage modified files
+    git add <modified_files>
+
+    # Create structured commit message
+    git commit -m "[Loki] ${agent_type}-${task_id}: ${task_title}
+
+${detailed_description}
+
+Agent: ${agent_id}
+Parent: ${parent_agent_id}
+Spec: ${spec_reference}
+Tests: ${test_files}
+Git-Checkpoint: $(date -u +%Y-%m-%dT%H:%M:%SZ)"
+
+    # Store commit SHA in task metadata
+    commit_sha=$(git rev-parse HEAD)
+    update_task_metadata task_id git_commit_sha "$commit_sha"
+
+    # Update CONTINUITY.md
+    echo "- Task $task_id completed (commit: $commit_sha)" >> .loki/CONTINUITY.md
+}
+```
+
+### Commit Message Format
+
+**Template:**
+```
+[Loki] ${agent_type}-${task_id}: ${task_title}
+
+${detailed_description}
+
+Agent: ${agent_id}
+Parent: ${parent_agent_id}
+Spec: ${spec_reference}
+Tests: ${test_files}
+Git-Checkpoint: ${timestamp}
+```
+
+**Example:**
+```
+[Loki] eng-005-backend: Implement POST /api/todos endpoint
+
+Created todo creation endpoint per OpenAPI spec.
+- Input validation for title field
+- SQLite insertion with timestamps
+- Returns 201 with created todo object
+- Contract tests passing
+
+Agent: eng-001-backend-api
+Parent: orchestrator-main
+Spec: .loki/specs/openapi.yaml#/paths/~1api~1todos/post
+Tests: backend/tests/todos.contract.test.ts
+Git-Checkpoint: 2026-01-04T05:45:00Z
+```
+
+### Rollback Strategy
+
+**When to Rollback:**
+- Quality gates fail after merge
+- Integration tests fail
+- Security vulnerabilities detected
+- Breaking changes discovered
+
+**Rollback Command:**
+```bash
+# Find last good checkpoint
+last_good_commit=$(git log --grep="\[Loki\].*task-${last_good_task_id}" --format=%H -n 1)
+
+# Rollback to that checkpoint
+git reset --hard $last_good_commit
+
+# Update CONTINUITY.md
+echo "ROLLBACK: Reset to task-${last_good_task_id} (commit: $last_good_commit)" >> .loki/CONTINUITY.md
+
+# Re-queue failed tasks
+move_tasks_to_pending after_task=$last_good_task_id
+```
+
+---
+
+## If Subagent Fails
+
+1. Do NOT try to fix manually (context pollution)
+2. Dispatch fix subagent with specific error context
+3. If fix subagent fails 3x, move to dead letter queue
+4. Open circuit breaker for that agent type
+5. Alert orchestrator for human review
--- a/skills/loki-mode/references/deployment.md
+++ b/skills/loki-mode/references/deployment.md
@@ -0,0 +1,604 @@
+# Deployment Reference
+
+Infrastructure provisioning and deployment instructions for all supported platforms.
+
+## Deployment Decision Matrix
+
+| Criteria | Vercel/Netlify | Railway/Render | AWS | GCP | Azure |
+|----------|----------------|----------------|-----|-----|-------|
+| Static/JAMstack | Best | Good | Overkill | Overkill | Overkill |
+| Simple full-stack | Good | Best | Overkill | Overkill | Overkill |
+| Scale to millions | No | Limited | Best | Best | Best |
+| Enterprise compliance | Limited | Limited | Best | Good | Best |
+| Cost at scale | Expensive | Moderate | Cheapest | Cheap | Moderate |
+| Setup complexity | Trivial | Easy | Complex | Complex | Complex |
+
+## Quick Start Commands
+
+### Vercel
+```bash
+# Install CLI
+npm i -g vercel
+
+# Deploy (auto-detects framework)
+vercel --prod
+
+# Environment variables
+vercel env add VARIABLE_NAME production
+```
+
+### Netlify
+```bash
+# Install CLI
+npm i -g netlify-cli
+
+# Deploy
+netlify deploy --prod
+
+# Environment variables
+netlify env:set VARIABLE_NAME value
+```
+
+### Railway
+```bash
+# Install CLI
+npm i -g @railway/cli
+
+# Login and deploy
+railway login
+railway init
+railway up
+
+# Environment variables
+railway variables set VARIABLE_NAME=value
+```
+
+### Render
+```yaml
+# render.yaml (Infrastructure as Code)
+services:
+  - type: web
+    name: api
+    env: node
+    buildCommand: npm install && npm run build
+    startCommand: npm start
+    envVars:
+      - key: NODE_ENV
+        value: production
+      - key: DATABASE_URL
+        fromDatabase:
+          name: postgres
+          property: connectionString
+
+databases:
+  - name: postgres
+    plan: starter
+```
+
+---
+
+## AWS Deployment
+
+### Architecture Template
+```
+┌─────────────────────────────────────────────────────────┐
+│                        CloudFront                        │
+└─────────────────────────┬───────────────────────────────┘
+                          │
+          ┌───────────────┴───────────────┐
+          │                               │
+    ┌─────▼─────┐                   ┌─────▼─────┐
+    │    S3     │                   │    ALB    │
+    │ (static)  │                   │           │
+    └───────────┘                   └─────┬─────┘
+                                          │
+                                    ┌─────▼─────┐
+                                    │   ECS     │
+                                    │  Fargate  │
+                                    └─────┬─────┘
+                                          │
+                              ┌───────────┴───────────┐
+                              │                       │
+                        ┌─────▼─────┐           ┌─────▼─────┐
+                        │    RDS    │           │ ElastiCache│
+                        │ Postgres  │           │   Redis   │
+                        └───────────┘           └───────────┘
+```
+
+### Terraform Configuration
+```hcl
+# main.tf
+terraform {
+  required_providers {
+    aws = {
+      source  = "hashicorp/aws"
+      version = "~> 5.0"
+    }
+  }
+  backend "s3" {
+    bucket = "terraform-state-${var.project_name}"
+    key    = "state.tfstate"
+    region = "us-east-1"
+  }
+}
+
+provider "aws" {
+  region = var.aws_region
+}
+
+# VPC
+module "vpc" {
+  source  = "terraform-aws-modules/vpc/aws"
+  version = "5.0.0"
+
+  name = "${var.project_name}-vpc"
+  cidr = "10.0.0.0/16"
+
+  azs             = ["${var.aws_region}a", "${var.aws_region}b"]
+  private_subnets = ["10.0.1.0/24", "10.0.2.0/24"]
+  public_subnets  = ["10.0.101.0/24", "10.0.102.0/24"]
+
+  enable_nat_gateway = true
+  single_nat_gateway = var.environment != "production"
+}
+
+# ECS Cluster
+resource "aws_ecs_cluster" "main" {
+  name = "${var.project_name}-cluster"
+  
+  setting {
+    name  = "containerInsights"
+    value = "enabled"
+  }
+}
+
+# RDS
+module "rds" {
+  source  = "terraform-aws-modules/rds/aws"
+  version = "6.0.0"
+
+  identifier = "${var.project_name}-db"
+
+  engine               = "postgres"
+  engine_version       = "15"
+  family               = "postgres15"
+  major_engine_version = "15"
+  instance_class       = var.environment == "production" ? "db.t3.medium" : "db.t3.micro"
+
+  allocated_storage = 20
+  storage_encrypted = true
+
+  db_name  = var.db_name
+  username = var.db_username
+  port     = 5432
+
+  vpc_security_group_ids = [aws_security_group.rds.id]
+  subnet_ids             = module.vpc.private_subnets
+
+  backup_retention_period = var.environment == "production" ? 7 : 1
+  deletion_protection     = var.environment == "production"
+}
+```
+
+### ECS Task Definition
+```json
+{
+  "family": "app",
+  "networkMode": "awsvpc",
+  "requiresCompatibilities": ["FARGATE"],
+  "cpu": "256",
+  "memory": "512",
+  "containerDefinitions": [
+    {
+      "name": "app",
+      "image": "${ECR_REPO}:${TAG}",
+      "portMappings": [
+        {
+          "containerPort": 3000,
+          "protocol": "tcp"
+        }
+      ],
+      "environment": [
+        {"name": "NODE_ENV", "value": "production"}
+      ],
+      "secrets": [
+        {
+          "name": "DATABASE_URL",
+          "valueFrom": "arn:aws:secretsmanager:region:account:secret:db-url"
+        }
+      ],
+      "logConfiguration": {
+        "logDriver": "awslogs",
+        "options": {
+          "awslogs-group": "/ecs/app",
+          "awslogs-region": "us-east-1",
+          "awslogs-stream-prefix": "ecs"
+        }
+      },
+      "healthCheck": {
+        "command": ["CMD-SHELL", "curl -f http://localhost:3000/health || exit 1"],
+        "interval": 30,
+        "timeout": 5,
+        "retries": 3
+      }
+    }
+  ]
+}
+```
+
+### GitHub Actions CI/CD
+```yaml
+name: Deploy to AWS
+
+on:
+  push:
+    branches: [main]
+
+env:
+  AWS_REGION: us-east-1
+  ECR_REPOSITORY: app
+  ECS_SERVICE: app-service
+  ECS_CLUSTER: app-cluster
+
+jobs:
+  deploy:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+
+      - name: Configure AWS credentials
+        uses: aws-actions/configure-aws-credentials@v4
+        with:
+          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
+          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
+          aws-region: ${{ env.AWS_REGION }}
+
+      - name: Login to Amazon ECR
+        id: login-ecr
+        uses: aws-actions/amazon-ecr-login@v2
+
+      - name: Build, tag, and push image
+        id: build-image
+        env:
+          ECR_REGISTRY: ${{ steps.login-ecr.outputs.registry }}
+          IMAGE_TAG: ${{ github.sha }}
+        run: |
+          docker build -t $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG .
+          docker push $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG
+          echo "image=$ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG" >> $GITHUB_OUTPUT
+
+      - name: Deploy to ECS
+        uses: aws-actions/amazon-ecs-deploy-task-definition@v1
+        with:
+          task-definition: task-definition.json
+          service: ${{ env.ECS_SERVICE }}
+          cluster: ${{ env.ECS_CLUSTER }}
+          wait-for-service-stability: true
+```
+
+---
+
+## GCP Deployment
+
+### Cloud Run (Recommended for most cases)
+```bash
+# Build and deploy
+gcloud builds submit --tag gcr.io/PROJECT_ID/app
+gcloud run deploy app \
+  --image gcr.io/PROJECT_ID/app \
+  --platform managed \
+  --region us-central1 \
+  --allow-unauthenticated \
+  --set-env-vars="NODE_ENV=production" \
+  --set-secrets="DATABASE_URL=db-url:latest"
+```
+
+### Terraform for GCP
+```hcl
+provider "google" {
+  project = var.project_id
+  region  = var.region
+}
+
+# Cloud Run Service
+resource "google_cloud_run_service" "app" {
+  name     = "app"
+  location = var.region
+
+  template {
+    spec {
+      containers {
+        image = "gcr.io/${var.project_id}/app:latest"
+        
+        ports {
+          container_port = 3000
+        }
+
+        env {
+          name  = "NODE_ENV"
+          value = "production"
+        }
+
+        env {
+          name = "DATABASE_URL"
+          value_from {
+            secret_key_ref {
+              name = google_secret_manager_secret.db_url.secret_id
+              key  = "latest"
+            }
+          }
+        }
+
+        resources {
+          limits = {
+            cpu    = "1000m"
+            memory = "512Mi"
+          }
+        }
+      }
+    }
+
+    metadata {
+      annotations = {
+        "autoscaling.knative.dev/maxScale" = "10"
+        "run.googleapis.com/cloudsql-instances" = google_sql_database_instance.main.connection_name
+      }
+    }
+  }
+
+  traffic {
+    percent         = 100
+    latest_revision = true
+  }
+}
+
+# Cloud SQL
+resource "google_sql_database_instance" "main" {
+  name             = "app-db"
+  database_version = "POSTGRES_15"
+  region           = var.region
+
+  settings {
+    tier = "db-f1-micro"
+
+    backup_configuration {
+      enabled = true
+    }
+  }
+
+  deletion_protection = var.environment == "production"
+}
+```
+
+---
+
+## Azure Deployment
+
+### Azure Container Apps
+```bash
+# Create resource group
+az group create --name app-rg --location eastus
+
+# Create Container Apps environment
+az containerapp env create \
+  --name app-env \
+  --resource-group app-rg \
+  --location eastus
+
+# Deploy container
+az containerapp create \
+  --name app \
+  --resource-group app-rg \
+  --environment app-env \
+  --image myregistry.azurecr.io/app:latest \
+  --target-port 3000 \
+  --ingress external \
+  --min-replicas 1 \
+  --max-replicas 10 \
+  --env-vars "NODE_ENV=production"
+```
+
+---
+
+## Kubernetes Deployment
+
+### Manifests
+```yaml
+# deployment.yaml
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: app
+  labels:
+    app: app
+spec:
+  replicas: 3
+  selector:
+    matchLabels:
+      app: app
+  template:
+    metadata:
+      labels:
+        app: app
+    spec:
+      containers:
+        - name: app
+          image: app:latest
+          ports:
+            - containerPort: 3000
+          env:
+            - name: NODE_ENV
+              value: production
+            - name: DATABASE_URL
+              valueFrom:
+                secretKeyRef:
+                  name: app-secrets
+                  key: database-url
+          resources:
+            requests:
+              memory: "128Mi"
+              cpu: "100m"
+            limits:
+              memory: "512Mi"
+              cpu: "500m"
+          livenessProbe:
+            httpGet:
+              path: /health
+              port: 3000
+            initialDelaySeconds: 10
+            periodSeconds: 10
+          readinessProbe:
+            httpGet:
+              path: /ready
+              port: 3000
+            initialDelaySeconds: 5
+            periodSeconds: 5
+---
+# service.yaml
+apiVersion: v1
+kind: Service
+metadata:
+  name: app
+spec:
+  selector:
+    app: app
+  ports:
+    - port: 80
+      targetPort: 3000
+  type: ClusterIP
+---
+# ingress.yaml
+apiVersion: networking.k8s.io/v1
+kind: Ingress
+metadata:
+  name: app
+  annotations:
+    kubernetes.io/ingress.class: nginx
+    cert-manager.io/cluster-issuer: letsencrypt-prod
+spec:
+  tls:
+    - hosts:
+        - app.example.com
+      secretName: app-tls
+  rules:
+    - host: app.example.com
+      http:
+        paths:
+          - path: /
+            pathType: Prefix
+            backend:
+              service:
+                name: app
+                port:
+                  number: 80
+```
+
+### Helm Chart Structure
+```
+chart/
+├── Chart.yaml
+├── values.yaml
+├── values-staging.yaml
+├── values-production.yaml
+└── templates/
+    ├── deployment.yaml
+    ├── service.yaml
+    ├── ingress.yaml
+    ├── configmap.yaml
+    ├── secret.yaml
+    └── hpa.yaml
+```
+
+---
+
+## Blue-Green Deployment
+
+### Strategy
+```
+1. Deploy new version to "green" environment
+2. Run smoke tests against green
+3. Switch load balancer to green
+4. Monitor for 15 minutes
+5. If healthy: decommission blue
+6. If errors: switch back to blue (rollback)
+```
+
+### Implementation (AWS ALB)
+```bash
+# Deploy green
+aws ecs update-service --cluster app --service app-green --task-definition app:NEW_VERSION
+
+# Wait for stability
+aws ecs wait services-stable --cluster app --services app-green
+
+# Run smoke tests
+curl -f https://green.app.example.com/health
+
+# Switch traffic (update target group weights)
+aws elbv2 modify-listener-rule \
+  --rule-arn $RULE_ARN \
+  --actions '[{"Type":"forward","TargetGroupArn":"'$GREEN_TG'","Weight":100}]'
+```
+
+---
+
+## Rollback Procedures
+
+### Immediate Rollback
+```bash
+# AWS ECS
+aws ecs update-service --cluster app --service app --task-definition app:PREVIOUS_VERSION
+
+# Kubernetes
+kubectl rollout undo deployment/app
+
+# Vercel
+vercel rollback
+```
+
+### Automated Rollback Triggers
+Monitor these metrics post-deploy:
+- Error rate > 1% for 5 minutes
+- p99 latency > 500ms for 5 minutes
+- Health check failures > 3 consecutive
+- Memory usage > 90% for 10 minutes
+
+If any trigger fires, execute automatic rollback.
+
+---
+
+## Secrets Management
+
+### AWS Secrets Manager
+```bash
+# Create secret
+aws secretsmanager create-secret \
+  --name app/database-url \
+  --secret-string "postgresql://..."
+
+# Reference in ECS task
+"secrets": [
+  {
+    "name": "DATABASE_URL",
+    "valueFrom": "arn:aws:secretsmanager:region:account:secret:app/database-url"
+  }
+]
+```
+
+### HashiCorp Vault
+```bash
+# Store secret
+vault kv put secret/app database-url="postgresql://..."
+
+# Read in application
+vault kv get -field=database-url secret/app
+```
+
+### Environment-Specific
+```
+.env.development   # Local development
+.env.staging       # Staging environment
+.env.production    # Production (never commit)
+```
+
+All production secrets must be in a secrets manager, never in code or environment files.
--- a/skills/loki-mode/references/lab-research-patterns.md
+++ b/skills/loki-mode/references/lab-research-patterns.md
@@ -0,0 +1,534 @@
+# Lab Research Patterns Reference
+
+Research-backed patterns from Google DeepMind and Anthropic for enhanced multi-agent orchestration and safety.
+
+---
+
+## Overview
+
+This reference consolidates key patterns from:
+1. **Google DeepMind** - World models, self-improvement, scalable oversight
+2. **Anthropic** - Constitutional AI, alignment safety, agentic coding
+
+---
+
+## Google DeepMind Patterns
+
+### World Model Training (Dreamer 4)
+
+**Key Insight:** Train agents inside world models for safety and data efficiency.
+
+```yaml
+world_model_training:
+  principle: "Learn behaviors through simulation, not real environment"
+  benefits:
+    - 100x less data than real-world training
+    - Safe exploration of dangerous actions
+    - Faster iteration cycles
+
+  architecture:
+    tokenizer: "Compress frames into continuous representation"
+    dynamics_model: "Predict next world state given action"
+    imagination_training: "RL inside simulated trajectories"
+
+  loki_application:
+    - Run agent tasks in isolated containers first
+    - Simulate deployment before actual deploy
+    - Test error scenarios in sandbox
+```
+
+### Self-Improvement Loop (SIMA 2)
+
+**Key Insight:** Use AI to generate tasks and score outcomes for bootstrapped learning.
+
+```python
+class SelfImprovementLoop:
+    """
+    Based on SIMA 2's self-improvement mechanism.
+    Gemini-based teacher + learned reward model.
+    """
+
+    def __init__(self):
+        self.task_generator = "Use LLM to generate varied tasks"
+        self.reward_model = "Learned model to score trajectories"
+        self.experience_bank = []
+
+    def bootstrap_cycle(self):
+        # 1. Generate tasks with estimated rewards
+        tasks = self.task_generator.generate(
+            domain=current_project,
+            difficulty_curriculum=True
+        )
+
+        # 2. Execute tasks, accumulate experience
+        for task in tasks:
+            trajectory = execute(task)
+            reward = self.reward_model.score(trajectory)
+            self.experience_bank.append((trajectory, reward))
+
+        # 3. Train next generation on experience
+        next_agent = train_on_experience(self.experience_bank)
+
+        # 4. Iterate with minimal human intervention
+        return next_agent
+```
+
+**Loki Mode Application:**
+- Generate test scenarios automatically
+- Score code quality with learned criteria
+- Bootstrap agent training across projects
+
+### Hierarchical Reasoning (Gemini Robotics)
+
+**Key Insight:** Separate high-level planning from low-level execution.
+
+```
+------------------------------------------------------------------+
+| EMBODIED REASONING MODEL (Gemini Robotics-ER)                     |
+| - Orchestrates activities like a "high-level brain"               |
+| - Spatial understanding, planning, logical decisions              |
+| - Natively calls tools (search, user functions)                   |
+| - Does NOT directly control actions                               |
+------------------------------------------------------------------+
+        |
+        | High-level insights
+        v
+------------------------------------------------------------------+
+| VISION-LANGUAGE-ACTION MODEL (Gemini Robotics)                    |
+| - "Thinks before taking action"                                   |
+| - Generates internal reasoning in natural language                |
+| - Decomposes long tasks into simpler segments                     |
+| - Directly outputs actions/commands                               |
+------------------------------------------------------------------+
+```
+
+**Loki Mode Application:**
+- Orchestrator = ER model (planning, tool calls)
+- Implementation agents = VLA model (code actions)
+- Task decomposition before execution
+
+### Cross-Embodiment Transfer
+
+**Key Insight:** Skills learned by one agent type transfer to others.
+
+```yaml
+transfer_learning:
+  observation: "Tasks learned on ALOHA2 work on Apollo humanoid"
+  mechanism: "Shared action space abstraction"
+
+  loki_application:
+    - Patterns learned by frontend agent transfer to mobile agent
+    - Testing strategies from QA apply to security testing
+    - Deployment scripts generalize across cloud providers
+
+  implementation:
+    shared_skills_library: ".loki/memory/skills/"
+    abstraction_layer: "Domain-agnostic action primitives"
+    transfer_score: "Confidence in skill applicability"
+```
+
+### Scalable Oversight via Debate
+
+**Key Insight:** Pit AI capabilities against each other for verification.
+
+```python
+async def debate_verification(proposal, max_rounds=2):
+    """
+    Based on DeepMind's Scalable AI Safety via Doubly-Efficient Debate.
+    Use debate to break down verification into manageable sub-tasks.
+    """
+    # Two equally capable AI critics
+    proponent = Agent(role="defender", model="opus")
+    opponent = Agent(role="challenger", model="opus")
+
+    debate_log = []
+
+    for round in range(max_rounds):
+        # Proponent defends proposal
+        defense = await proponent.argue(
+            proposal=proposal,
+            counter_arguments=debate_log
+        )
+
+        # Opponent challenges
+        challenge = await opponent.argue(
+            proposal=proposal,
+            defense=defense,
+            goal="find_flaws"
+        )
+
+        debate_log.append({
+            "round": round,
+            "defense": defense,
+            "challenge": challenge
+        })
+
+        # If opponent cannot find valid flaw, proposal is verified
+        if not challenge.has_valid_flaw:
+            return VerificationResult(verified=True, debate_log=debate_log)
+
+    # Human reviews remaining disagreements
+    return escalate_to_human(debate_log)
+```
+
+### Amplified Oversight
+
+**Key Insight:** Use AI to help humans supervise AI beyond human capability.
+
+```yaml
+amplified_oversight:
+  goal: "Supervision as close as possible to human with complete understanding"
+
+  techniques:
+    - "AI explains its reasoning transparently"
+    - "AI argues against itself when wrong"
+    - "AI cites relevant evidence"
+    - "Monitor knows when it doesn't know"
+
+  monitoring_principle:
+    when_unsure: "Either reject action OR flag for review"
+    never: "Approve uncertain actions silently"
+```
+
+---
+
+## Anthropic Patterns
+
+### Constitutional AI Principles
+
+**Key Insight:** Train AI to self-critique based on explicit principles.
+
+```python
+class ConstitutionalAI:
+    """
+    Based on Anthropic's Constitutional AI: Harmlessness from AI Feedback.
+    Self-critique and revision based on constitutional principles.
+    """
+
+    def __init__(self, constitution):
+        self.constitution = constitution  # List of principles
+
+    async def supervised_learning_phase(self, response):
+        """Phase 1: Self-critique and revise."""
+        # Generate initial response
+        initial = response
+
+        # Self-critique against each principle
+        critiques = []
+        for principle in self.constitution:
+            critique = await self.critique(
+                response=initial,
+                principle=principle,
+                prompt=f"Does this response violate: {principle}?"
+            )
+            critiques.append(critique)
+
+        # Revise based on critiques
+        revised = await self.revise(
+            response=initial,
+            critiques=critiques
+        )
+
+        return revised
+
+    async def rlai_phase(self, response_pairs):
+        """Phase 2: AI compares responses for constitutional compliance."""
+        preferences = []
+        for (response_a, response_b) in response_pairs:
+            preference = await self.compare(
+                a=response_a,
+                b=response_b,
+                criterion="Which better follows the constitution?"
+            )
+            preferences.append(preference)
+
+        # Train preference model on AI feedback
+        return train_preference_model(preferences)
+```
+
+**Loki Mode Constitution Example:**
+```yaml
+loki_constitution:
+  - "Never delete production data without explicit backup"
+  - "Never commit secrets or credentials to version control"
+  - "Never bypass quality gates for speed"
+  - "Always verify tests pass before marking task complete"
+  - "Never claim completion without running actual tests"
+  - "Prefer simple solutions over clever ones"
+  - "Document decisions, not just code"
+```
+
+### Sleeper Agent Detection (Defection Probes)
+
+**Key Insight:** Use linear classifiers to detect when agents might misbehave.
+
+```python
+class DefectionProbe:
+    """
+    Based on Anthropic's "Simple probes can catch sleeper agents".
+    Linear classifiers detecting anomalous agent behavior.
+    """
+
+    def __init__(self):
+        self.baseline_behaviors = []
+        self.probe = LinearClassifier()
+
+    def train_probe(self, normal_trajectories, defection_examples):
+        """
+        Train on contrast pairs:
+        - Normal agent behavior
+        - Examples of defection/misbehavior
+        """
+        contrast_pairs = create_contrast_pairs(
+            normal=normal_trajectories,
+            defection=defection_examples
+        )
+        self.probe.fit(contrast_pairs)
+
+    def check_for_defection(self, agent_action, context):
+        """
+        Check if action appears anomalous.
+        AUROC > 99% achieved in Anthropic's research.
+        """
+        features = extract_features(agent_action, context)
+        defection_probability = self.probe.predict_proba(features)
+
+        if defection_probability > THRESHOLD:
+            return DefectionAlert(
+                probability=defection_probability,
+                action=agent_action,
+                recommendation="Flag for human review"
+            )
+        return None
+```
+
+**Loki Mode Application:**
+- Monitor agent actions for anomalies
+- Flag unusual file access patterns
+- Detect attempts to bypass quality gates
+
+### Alignment Faking Awareness
+
+**Key Insight:** Models may strategically comply while preserving hidden preferences.
+
+```yaml
+alignment_faking_signals:
+  warning_signs:
+    - "Agent complies only when it knows it's being monitored"
+    - "Agent's reasoning contradicts its final action"
+    - "Sudden behavior change when training context mentioned"
+    - "Agent attempts to exfiltrate its own weights/state"
+
+  detection_strategy:
+    - "Monitor chain-of-thought for rationalization"
+    - "Compare behavior in 'free' vs 'monitored' conditions"
+    - "Check for attempts to undermine oversight systems"
+
+  mitigation:
+    - "Constitutional constraints at training time"
+    - "Transparency in reasoning (visible extended thinking)"
+    - "Defection probes on intermediate reasoning"
+```
+
+### Extended Thinking Levels
+
+**Key Insight:** Allocate computation based on problem complexity.
+
+```yaml
+thinking_levels:
+  "think":
+    computation: "Minimal"
+    use_for: "Simple questions, straightforward tasks"
+
+  "think hard":
+    computation: "Moderate"
+    use_for: "Multi-step problems, code implementation"
+
+  "think harder":
+    computation: "Extended"
+    use_for: "Complex debugging, architecture decisions"
+
+  "ultrathink":
+    computation: "Maximum"
+    use_for: "Security analysis, critical system design"
+
+loki_mode_mapping:
+  haiku_tasks: "think"
+  sonnet_tasks: "think hard"
+  opus_tasks: "think harder to ultrathink"
+```
+
+### Explore-Plan-Code Pattern
+
+**Key Insight:** Research before planning, plan before coding.
+
+```
+------------------------------------------------------------------+
+| PHASE 1: EXPLORE                                                  |
+| - Research relevant files                                         |
+| - Understand existing patterns                                    |
+| - Identify dependencies and constraints                           |
+| - NO CODE CHANGES YET                                             |
+------------------------------------------------------------------+
+        |
+        v
+------------------------------------------------------------------+
+| PHASE 2: PLAN                                                     |
+| - Create detailed implementation plan                             |
+| - List all files to modify                                        |
+| - Define success criteria                                         |
+| - Get checkpoint approval if needed                               |
+| - STILL NO CODE CHANGES                                           |
+------------------------------------------------------------------+
+        |
+        v
+------------------------------------------------------------------+
+| PHASE 3: CODE                                                     |
+| - Execute plan systematically                                     |
+| - Test after each file change                                     |
+| - Update plan if discoveries require it                           |
+| - Verify against success criteria                                 |
+------------------------------------------------------------------+
+```
+
+### Context Reset Strategy
+
+**Key Insight:** Fresh context often performs better than accumulated context.
+
+```yaml
+context_management:
+  problem: "Long sessions accumulate irrelevant information"
+
+  solution:
+    trigger_reset:
+      - "After completing major task"
+      - "When changing domains (backend -> frontend)"
+      - "When agent seems confused or repeating errors"
+
+    preserve_across_reset:
+      - "CONTINUITY.md (working memory)"
+      - "Key decisions made this session"
+      - "Current task state"
+
+    discard_on_reset:
+      - "Intermediate debugging attempts"
+      - "Abandoned approaches"
+      - "Superseded plans"
+```
+
+### Parallel Instance Pattern
+
+**Key Insight:** Multiple Claude instances with separation of concerns.
+
+```python
+async def parallel_instance_pattern(task):
+    """
+    Run multiple Claude instances for separation of concerns.
+    Based on Anthropic's Claude Code best practices.
+    """
+    # Instance 1: Implementation
+    implementer = spawn_instance(
+        role="implementer",
+        context=implementation_context,
+        permissions=["edit", "bash"]
+    )
+
+    # Instance 2: Review
+    reviewer = spawn_instance(
+        role="reviewer",
+        context=review_context,
+        permissions=["read"]  # Read-only for safety
+    )
+
+    # Parallel execution
+    implementation = await implementer.execute(task)
+    review = await reviewer.review(implementation)
+
+    if review.approved:
+        return implementation
+    else:
+        # Feed review back to implementer for fixes
+        fixed = await implementer.fix(review.issues)
+        return fixed
+```
+
+### Prompt Injection Defense
+
+**Key Insight:** Multi-layer defense against injection attacks.
+
+```yaml
+prompt_injection_defense:
+  layers:
+    layer_1_recognition:
+      - "Train to recognize injection patterns"
+      - "Detect malicious content in external sources"
+
+    layer_2_context_isolation:
+      - "Sandbox external content processing"
+      - "Mark user content vs system instructions"
+
+    layer_3_action_validation:
+      - "Verify requested actions are authorized"
+      - "Block sensitive operations without confirmation"
+
+    layer_4_monitoring:
+      - "Log all external content interactions"
+      - "Alert on suspicious patterns"
+
+  performance:
+    claude_opus_4: "89% attack prevention"
+    claude_sonnet_4: "86% attack prevention"
+```
+
+---
+
+## Combined Patterns for Loki Mode
+
+### Self-Improving Multi-Agent System
+
+```yaml
+combined_approach:
+  world_model_training: "Test in simulation before real execution"
+  self_improvement: "Bootstrap learning from successful trajectories"
+  constitutional_constraints: "Principles-based self-critique"
+  debate_verification: "Pit reviewers against each other"
+  defection_probes: "Monitor for alignment faking"
+
+  implementation_priority:
+    high:
+      - Constitutional AI principles in agent prompts
+      - Explore-Plan-Code workflow enforcement
+      - Context reset triggers
+
+    medium:
+      - Self-improvement loop for task generation
+      - Debate-based verification for critical changes
+      - Cross-embodiment skill transfer
+
+    low:
+      - Full world model training
+      - Defection probe classifiers
+```
+
+---
+
+## Sources
+
+**Google DeepMind:**
+- [SIMA 2: Generalist AI Agent](https://deepmind.google/blog/sima-2-an-agent-that-plays-reasons-and-learns-with-you-in-virtual-3d-worlds/)
+- [Gemini Robotics 1.5](https://deepmind.google/blog/gemini-robotics-15-brings-ai-agents-into-the-physical-world/)
+- [Dreamer 4: World Model Training](https://danijar.com/project/dreamer4/)
+- [Genie 3: World Models](https://deepmind.google/blog/genie-3-a-new-frontier-for-world-models/)
+- [Scalable AI Safety via Debate](https://deepmind.google/research/publications/34920/)
+- [Amplified Oversight](https://deepmindsafetyresearch.medium.com/human-ai-complementarity-a-goal-for-amplified-oversight-0ad8a44cae0a)
+- [Technical AGI Safety Approach](https://arxiv.org/html/2504.01849v1)
+
+**Anthropic:**
+- [Constitutional AI](https://www.anthropic.com/research/constitutional-ai-harmlessness-from-ai-feedback)
+- [Building Effective Agents](https://www.anthropic.com/research/building-effective-agents)
+- [Claude Code Best Practices](https://www.anthropic.com/engineering/claude-code-best-practices)
+- [Sleeper Agents Detection](https://www.anthropic.com/research/probes-catch-sleeper-agents)
+- [Alignment Faking](https://www.anthropic.com/research/alignment-faking)
+- [Visible Extended Thinking](https://www.anthropic.com/research/visible-extended-thinking)
+- [Computer Use Safety](https://www.anthropic.com/news/3-5-models-and-computer-use)
+- [Sabotage Evaluations](https://www.anthropic.com/research/sabotage-evaluations-for-frontier-models)
--- a/skills/loki-mode/references/memory-system.md
+++ b/skills/loki-mode/references/memory-system.md
@@ -0,0 +1,444 @@
+# Memory System Reference
+
+Enhanced memory architecture based on 2025 research (MIRIX, A-Mem, MemGPT, AriGraph).
+
+---
+
+## Memory Hierarchy Overview
+
+```
+------------------------------------------------------------------+
+| WORKING MEMORY (CONTINUITY.md)                                    |
+| - Current session state                                           |
+| - Updated every turn                                              |
+| - What am I doing right NOW?                                      |
+------------------------------------------------------------------+
+        |
+        v
+------------------------------------------------------------------+
+| EPISODIC MEMORY (.loki/memory/episodic/)                         |
+| - Specific interaction traces                                     |
+| - Full context with timestamps                                    |
+| - "What happened when I tried X?"                                 |
+------------------------------------------------------------------+
+        |
+        v (consolidation)
+------------------------------------------------------------------+
+| SEMANTIC MEMORY (.loki/memory/semantic/)                         |
+| - Generalized patterns and facts                                  |
+| - Context-independent knowledge                                   |
+| - "How does X work in general?"                                   |
+------------------------------------------------------------------+
+        |
+        v
+------------------------------------------------------------------+
+| PROCEDURAL MEMORY (.loki/memory/skills/)                         |
+| - Learned action sequences                                        |
+| - Reusable skill templates                                        |
+| - "How to do X successfully"                                      |
+------------------------------------------------------------------+
+```
+
+---
+
+## Directory Structure
+
+```
+.loki/memory/
+-- episodic/
+|   +-- 2026-01-06/
+|   |   +-- task-001.json      # Full trace of task execution
+|   |   +-- task-002.json
+|   +-- index.json             # Temporal index for retrieval
+|
+-- semantic/
+|   +-- patterns.json          # Generalized patterns
+|   +-- anti-patterns.json     # What NOT to do
+|   +-- facts.json             # Domain knowledge
+|   +-- links.json             # Zettelkasten-style connections
+|
+-- skills/
+|   +-- api-implementation.md  # Skill: How to implement an API
+|   +-- test-writing.md        # Skill: How to write tests
+|   +-- debugging.md           # Skill: How to debug issues
+|
+-- ledgers/                   # Agent-specific checkpoints
+|   +-- eng-001.json
+|   +-- qa-001.json
+|
+-- handoffs/                  # Agent-to-agent transfers
+|   +-- handoff-001.json
+|
+-- learnings/                 # Extracted from errors
+|   +-- 2026-01-06.json
+
+# Related: Metrics System (separate from memory)
+# .loki/metrics/
+# +-- efficiency/              # Task cost tracking (time, agents, retries)
+# +-- rewards/                 # Outcome/efficiency/preference signals
+# +-- dashboard.json           # Rolling 7-day metrics summary
+# See references/tool-orchestration.md for details
+```
+
+---
+
+## Episodic Memory Schema
+
+Each task execution creates an episodic trace:
+
+```json
+{
+  "id": "ep-2026-01-06-001",
+  "task_id": "task-042",
+  "timestamp": "2026-01-06T10:30:00Z",
+  "duration_seconds": 342,
+  "agent": "eng-001-backend",
+  "context": {
+    "phase": "development",
+    "goal": "Implement POST /api/todos endpoint",
+    "constraints": ["No third-party deps", "< 200ms response"],
+    "files_involved": ["src/routes/todos.ts", "src/db/todos.ts"]
+  },
+  "action_log": [
+    {"t": 0, "action": "read_file", "target": "openapi.yaml"},
+    {"t": 5, "action": "write_file", "target": "src/routes/todos.ts"},
+    {"t": 120, "action": "run_test", "result": "fail", "error": "missing return type"},
+    {"t": 140, "action": "edit_file", "target": "src/routes/todos.ts"},
+    {"t": 180, "action": "run_test", "result": "pass"}
+  ],
+  "outcome": "success",
+  "errors_encountered": [
+    {
+      "type": "TypeScript compilation",
+      "message": "Missing return type annotation",
+      "resolution": "Added explicit :void to route handler"
+    }
+  ],
+  "artifacts_produced": ["src/routes/todos.ts", "tests/todos.test.ts"],
+  "git_commit": "abc123"
+}
+```
+
+---
+
+## Semantic Memory Schema
+
+Generalized patterns extracted from episodic memory:
+
+```json
+{
+  "id": "sem-001",
+  "pattern": "Express route handlers require explicit return types in strict mode",
+  "category": "typescript",
+  "conditions": [
+    "Using TypeScript strict mode",
+    "Writing Express route handlers",
+    "Handler doesn't return a value"
+  ],
+  "correct_approach": "Add `: void` to handler signature: `(req, res): void =>`",
+  "incorrect_approach": "Omitting return type annotation",
+  "confidence": 0.95,
+  "source_episodes": ["ep-2026-01-06-001", "ep-2026-01-05-012"],
+  "usage_count": 8,
+  "last_used": "2026-01-06T14:00:00Z",
+  "links": [
+    {"to": "sem-005", "relation": "related_to"},
+    {"to": "sem-012", "relation": "supersedes"}
+  ]
+}
+```
+
+---
+
+## Episodic-to-Semantic Consolidation
+
+**When to consolidate:** After task completion, during idle time, at phase boundaries.
+
+```python
+def consolidate_episodic_to_semantic():
+    """
+    Transform specific experiences into general knowledge.
+    Based on MemGPT and Voyager research.
+    """
+    # 1. Load recent episodic memories
+    recent_episodes = load_episodes(since=hours_ago(24))
+
+    # 2. Group by similarity
+    clusters = cluster_by_similarity(recent_episodes)
+
+    for cluster in clusters:
+        if len(cluster) >= 2:  # Pattern appears multiple times
+            # 3. Extract common pattern
+            pattern = extract_common_pattern(cluster)
+
+            # 4. Validate pattern
+            if pattern.confidence >= 0.8:
+                # 5. Check if already exists
+                existing = find_similar_semantic(pattern)
+                if existing:
+                    # Update existing with new evidence
+                    existing.source_episodes.extend([e.id for e in cluster])
+                    existing.confidence = recalculate_confidence(existing)
+                    existing.usage_count += 1
+                else:
+                    # Create new semantic memory
+                    save_semantic(pattern)
+
+    # 6. Consolidate anti-patterns from errors
+    error_episodes = [e for e in recent_episodes if e.errors_encountered]
+    for episode in error_episodes:
+        for error in episode.errors_encountered:
+            anti_pattern = {
+                "what_fails": error.type,
+                "why": error.message,
+                "prevention": error.resolution,
+                "source": episode.id
+            }
+            save_anti_pattern(anti_pattern)
+```
+
+---
+
+## Zettelkasten-Style Linking
+
+Each memory note can link to related notes:
+
+```json
+{
+  "links": [
+    {"to": "sem-005", "relation": "derived_from"},
+    {"to": "sem-012", "relation": "contradicts"},
+    {"to": "sem-018", "relation": "elaborates"},
+    {"to": "sem-023", "relation": "example_of"},
+    {"to": "sem-031", "relation": "superseded_by"}
+  ]
+}
+```
+
+### Link Relations
+
+| Relation | Meaning |
+|----------|---------|
+| `derived_from` | This pattern was extracted from that episode |
+| `related_to` | Conceptually similar, often used together |
+| `contradicts` | These patterns conflict - need resolution |
+| `elaborates` | Provides more detail on the linked pattern |
+| `example_of` | Specific instance of a general pattern |
+| `supersedes` | This pattern replaces an older one |
+| `superseded_by` | This pattern is outdated, use the linked one |
+
+---
+
+## Procedural Memory (Skills)
+
+Reusable action sequences:
+
+```markdown
+# Skill: API Endpoint Implementation
+
+## Prerequisites
+- OpenAPI spec exists at .loki/specs/openapi.yaml
+- Database schema defined
+
+## Steps
+1. Read endpoint spec from openapi.yaml
+2. Create route handler in src/routes/{resource}.ts
+3. Implement request validation using spec schema
+4. Implement business logic
+5. Add database operations if needed
+6. Return response matching spec schema
+7. Write contract tests
+8. Run tests, verify passing
+
+## Common Errors & Fixes
+- Missing return type: Add `: void` to handler
+- Schema mismatch: Regenerate types from spec
+
+## Exit Criteria
+- All contract tests pass
+- Response matches OpenAPI spec
+- No TypeScript errors
+```
+
+---
+
+## Memory Retrieval
+
+### Retrieval by Similarity
+
+```python
+def retrieve_relevant_memory(current_context):
+    """
+    Retrieve memories relevant to current task.
+    Uses semantic similarity + temporal recency.
+    """
+    query_embedding = embed(current_context.goal)
+
+    # 1. Search semantic memory first
+    semantic_matches = vector_search(
+        collection="semantic",
+        query=query_embedding,
+        top_k=5
+    )
+
+    # 2. Search episodic memory for similar situations
+    episodic_matches = vector_search(
+        collection="episodic",
+        query=query_embedding,
+        top_k=3,
+        filters={"outcome": "success"}  # Prefer successful episodes
+    )
+
+    # 3. Search skills
+    skill_matches = keyword_search(
+        collection="skills",
+        keywords=extract_keywords(current_context)
+    )
+
+    # 4. Combine and rank
+    combined = merge_and_rank(
+        semantic_matches,
+        episodic_matches,
+        skill_matches,
+        weights={"semantic": 0.5, "episodic": 0.3, "skills": 0.2}
+    )
+
+    return combined[:5]  # Return top 5 most relevant
+```
+
+### Retrieval Before Task Execution
+
+**CRITICAL:** Before executing any task, retrieve relevant memories:
+
+```python
+def before_task_execution(task):
+    """
+    Inject relevant memories into task context.
+    """
+    # 1. Retrieve relevant memories
+    memories = retrieve_relevant_memory(task)
+
+    # 2. Check for anti-patterns
+    anti_patterns = search_anti_patterns(task.action_type)
+
+    # 3. Inject into prompt
+    task.context["relevant_patterns"] = [m.summary for m in memories]
+    task.context["avoid_these"] = [a.summary for a in anti_patterns]
+    task.context["applicable_skills"] = find_skills(task.type)
+
+    return task
+```
+
+---
+
+## Ledger System (Agent Checkpoints)
+
+Each agent maintains its own ledger:
+
+```json
+{
+  "agent_id": "eng-001-backend",
+  "last_checkpoint": "2026-01-06T10:00:00Z",
+  "tasks_completed": 12,
+  "current_task": "task-042",
+  "state": {
+    "files_modified": ["src/routes/todos.ts"],
+    "uncommitted_changes": true,
+    "last_git_commit": "abc123"
+  },
+  "context": {
+    "tech_stack": ["express", "typescript", "sqlite"],
+    "patterns_learned": ["sem-001", "sem-005"],
+    "current_goal": "Implement CRUD for todos"
+  }
+}
+```
+
+---
+
+## Handoff Protocol
+
+When switching between agents:
+
+```json
+{
+  "id": "handoff-001",
+  "from_agent": "eng-001-backend",
+  "to_agent": "qa-001-testing",
+  "timestamp": "2026-01-06T11:00:00Z",
+  "context": {
+    "what_was_done": "Implemented POST /api/todos endpoint",
+    "artifacts": ["src/routes/todos.ts"],
+    "git_state": "commit abc123",
+    "needs_testing": ["unit tests for validation", "contract tests"],
+    "known_issues": [],
+    "relevant_patterns": ["sem-001"]
+  }
+}
+```
+
+---
+
+## Memory Maintenance
+
+### Pruning Old Episodic Memories
+
+```python
+def prune_episodic_memories():
+    """
+    Keep episodic memories from:
+    - Last 7 days (full detail)
+    - Last 30 days (summarized)
+    - Older: only if referenced by semantic memory
+    """
+    now = datetime.now()
+
+    for episode in load_all_episodes():
+        age_days = (now - episode.timestamp).days
+
+        if age_days > 30:
+            if not is_referenced_by_semantic(episode):
+                archive_episode(episode)
+        elif age_days > 7:
+            summarize_episode(episode)
+```
+
+### Merging Duplicate Patterns
+
+```python
+def merge_duplicate_semantics():
+    """
+    Find and merge semantically similar patterns.
+    """
+    all_patterns = load_semantic_patterns()
+
+    clusters = cluster_by_embedding_similarity(all_patterns, threshold=0.9)
+
+    for cluster in clusters:
+        if len(cluster) > 1:
+            # Keep highest confidence, merge sources
+            primary = max(cluster, key=lambda p: p.confidence)
+            for other in cluster:
+                if other != primary:
+                    primary.source_episodes.extend(other.source_episodes)
+                    primary.usage_count += other.usage_count
+                    create_link(other, primary, "superseded_by")
+            save_semantic(primary)
+```
+
+---
+
+## Integration with CONTINUITY.md
+
+CONTINUITY.md is working memory - it references but doesn't duplicate long-term memory:
+
+```markdown
+## Relevant Memories (Auto-Retrieved)
+- [sem-001] Express handlers need explicit return types
+- [ep-2026-01-05-012] Similar endpoint implementation succeeded
+- [skill: api-implementation] Standard API implementation flow
+
+## Mistakes to Avoid (From Learnings)
+- Don't forget return type annotations
+- Run contract tests before marking complete
+```
--- a/skills/loki-mode/references/openai-patterns.md
+++ b/skills/loki-mode/references/openai-patterns.md
@@ -0,0 +1,647 @@
+# OpenAI Agent Patterns Reference
+
+Research-backed patterns from OpenAI's Agents SDK, Deep Research, and autonomous agent frameworks.
+
+---
+
+## Overview
+
+OpenAI's agent ecosystem provides four key architectural innovations for Loki Mode:
+
+1. **Tracing Spans** - Hierarchical event tracking with span types
+2. **Guardrails & Tripwires** - Input/output validation with early termination
+3. **Handoff Callbacks** - Data preparation during agent transfers
+4. **Multi-Tiered Fallbacks** - Model and workflow-level failure recovery
+
+---
+
+## Tracing Spans Architecture
+
+### Span Types (Agents SDK Pattern)
+
+Every operation is wrapped in a typed span for observability:
+
+```yaml
+span_types:
+  agent_span:
+    - Wraps entire agent execution
+    - Contains: agent_name, instructions_hash, model
+
+  generation_span:
+    - Wraps LLM API calls
+    - Contains: model, tokens_in, tokens_out, latency_ms
+
+  function_span:
+    - Wraps tool/function calls
+    - Contains: function_name, arguments, result, success
+
+  guardrail_span:
+    - Wraps validation checks
+    - Contains: guardrail_name, triggered, blocking
+
+  handoff_span:
+    - Wraps agent-to-agent transfers
+    - Contains: from_agent, to_agent, context_passed
+
+  custom_span:
+    - User-defined operations
+    - Contains: operation_name, metadata
+```
+
+### Hierarchical Trace Structure
+
+```json
+{
+  "trace_id": "trace_abc123def456",
+  "workflow_name": "implement_feature",
+  "group_id": "session_xyz789",
+  "spans": [
+    {
+      "span_id": "span_001",
+      "parent_id": null,
+      "type": "agent_span",
+      "agent_name": "orchestrator",
+      "started_at": "2026-01-07T10:00:00Z",
+      "ended_at": "2026-01-07T10:05:00Z",
+      "children": ["span_002", "span_003"]
+    },
+    {
+      "span_id": "span_002",
+      "parent_id": "span_001",
+      "type": "guardrail_span",
+      "guardrail_name": "input_validation",
+      "triggered": false,
+      "blocking": true
+    },
+    {
+      "span_id": "span_003",
+      "parent_id": "span_001",
+      "type": "handoff_span",
+      "from_agent": "orchestrator",
+      "to_agent": "backend-dev",
+      "context_passed": ["task_spec", "related_files"]
+    }
+  ]
+}
+```
+
+### Storage Location
+
+```
+.loki/traces/
+├── active/
+│   └── {trace_id}.json     # Currently running traces
+└── completed/
+    └── {date}/
+        └── {trace_id}.json # Archived traces by date
+```
+
+---
+
+## Guardrails & Tripwires System
+
+### Input Guardrails
+
+Run **before** agent execution to validate user input:
+
+```python
+@input_guardrail(blocking=True)
+async def validate_task_scope(input, context):
+    """
+    Blocks tasks outside project scope.
+    Based on OpenAI Agents SDK pattern.
+    """
+    # Check if task references files outside project
+    if references_external_paths(input):
+        return GuardrailResult(
+            tripwire_triggered=True,
+            reason="Task references paths outside project root"
+        )
+
+    # Check for disallowed operations
+    if contains_destructive_operation(input):
+        return GuardrailResult(
+            tripwire_triggered=True,
+            reason="Destructive operation requires human approval"
+        )
+
+    return GuardrailResult(tripwire_triggered=False)
+```
+
+### Output Guardrails
+
+Run **after** agent execution to validate results:
+
+```python
+@output_guardrail
+async def validate_code_quality(output, context):
+    """
+    Blocks low-quality code output.
+    """
+    if output.type == "code":
+        issues = run_static_analysis(output.content)
+        critical = [i for i in issues if i.severity == "critical"]
+
+        if critical:
+            return GuardrailResult(
+                tripwire_triggered=True,
+                reason=f"Critical issues found: {critical}"
+            )
+
+    return GuardrailResult(tripwire_triggered=False)
+```
+
+### Execution Modes
+
+| Mode | Behavior | Use When |
+|------|----------|----------|
+| **Blocking** | Guardrail completes before agent starts | Sensitive operations, expensive models |
+| **Parallel** | Guardrail runs concurrently with agent | Fast checks, acceptable token loss |
+
+```python
+# Blocking mode: prevents token consumption
+@input_guardrail(blocking=True, run_in_parallel=False)
+async def expensive_validation(input):
+    # Agent won't start until this completes
+    pass
+
+# Parallel mode: faster but may waste tokens if fails
+@input_guardrail(blocking=True, run_in_parallel=True)
+async def fast_validation(input):
+    # Runs alongside agent start
+    pass
+```
+
+### Tripwire Exceptions
+
+When tripwire triggers, execution halts immediately:
+
+```python
+class InputGuardrailTripwireTriggered(Exception):
+    """Raised when input validation fails."""
+    pass
+
+class OutputGuardrailTripwireTriggered(Exception):
+    """Raised when output validation fails."""
+    pass
+
+# In agent loop:
+try:
+    result = await run_agent(task)
+except InputGuardrailTripwireTriggered as e:
+    log_blocked_attempt(e)
+    return early_exit(reason=str(e))
+except OutputGuardrailTripwireTriggered as e:
+    rollback_changes()
+    return retry_with_constraints(e.constraints)
+```
+
+### Layered Defense Strategy
+
+> "Think of guardrails as a layered defense mechanism. While a single one is unlikely to provide sufficient protection, using multiple, specialized guardrails together creates more resilient agents." - OpenAI Agents SDK
+
+```yaml
+guardrail_layers:
+  layer_1_input:
+    - scope_validation      # Is task within bounds?
+    - pii_detection         # Contains sensitive data?
+    - injection_detection   # Prompt injection attempt?
+
+  layer_2_pre_execution:
+    - cost_estimation       # Will this exceed budget?
+    - dependency_check      # Are dependencies available?
+    - conflict_detection    # Will this conflict with in-progress work?
+
+  layer_3_output:
+    - static_analysis       # Code quality issues?
+    - secret_detection      # Secrets in output?
+    - spec_compliance       # Matches OpenAPI spec?
+
+  layer_4_post_action:
+    - test_validation       # Tests pass?
+    - review_approval       # Review passed?
+    - deployment_safety     # Safe to deploy?
+```
+
+---
+
+## Handoff Callbacks
+
+### on_handoff Pattern
+
+Prepare data when transferring between agents:
+
+```python
+async def on_handoff_to_backend_dev(handoff_context):
+    """
+    Called when orchestrator hands off to backend-dev agent.
+    Fetches context the receiving agent will need.
+    """
+    # Pre-fetch relevant files
+    relevant_files = await find_related_files(handoff_context.task)
+
+    # Load architectural context
+    architecture = await read_file(".loki/specs/architecture.md")
+
+    # Get recent changes to affected areas
+    recent_commits = await git_log(paths=relevant_files, limit=10)
+
+    return HandoffData(
+        files=relevant_files,
+        architecture=architecture,
+        recent_changes=recent_commits,
+        constraints=handoff_context.constraints
+    )
+
+# Register callback
+handoff(
+    to_agent=backend_dev,
+    on_handoff=on_handoff_to_backend_dev
+)
+```
+
+### Handoff Context Transfer
+
+```json
+{
+  "handoff_id": "ho_abc123",
+  "from_agent": "orchestrator",
+  "to_agent": "backend-dev",
+  "timestamp": "2026-01-07T10:05:00Z",
+  "context": {
+    "task_id": "task-001",
+    "goal": "Implement user authentication endpoint",
+    "constraints": [
+      "Use existing auth patterns from src/auth/",
+      "Maintain backwards compatibility",
+      "Add rate limiting"
+    ],
+    "pre_fetched": {
+      "files": ["src/auth/middleware.ts", "src/routes/index.ts"],
+      "architecture": "...",
+      "recent_changes": [...]
+    }
+  },
+  "return_expected": true,
+  "timeout_seconds": 600
+}
+```
+
+---
+
+## Multi-Tiered Fallback System
+
+### Model-Level Fallbacks
+
+```python
+async def execute_with_model_fallback(task, preferred_model):
+    """
+    Try preferred model, fall back to alternatives on failure.
+    Based on OpenAI safety patterns.
+    """
+    fallback_chain = {
+        "opus": ["sonnet", "haiku"],
+        "sonnet": ["haiku", "opus"],
+        "haiku": ["sonnet"]
+    }
+
+    models_to_try = [preferred_model] + fallback_chain.get(preferred_model, [])
+
+    for model in models_to_try:
+        try:
+            result = await run_agent(task, model=model)
+            if result.success:
+                return result
+        except RateLimitError:
+            log_warning(f"Rate limit on {model}, trying fallback")
+            continue
+        except ModelUnavailableError:
+            log_warning(f"{model} unavailable, trying fallback")
+            continue
+
+    # All models failed
+    return escalate_to_human(task, reason="All model fallbacks exhausted")
+```
+
+### Workflow-Level Fallbacks
+
+```python
+async def execute_with_workflow_fallback(task):
+    """
+    If complex workflow fails, fall back to simpler operations.
+    """
+    # Try full workflow first
+    try:
+        return await full_implementation_workflow(task)
+    except WorkflowError as e:
+        log_warning(f"Full workflow failed: {e}")
+
+    # Fall back to simpler approach
+    try:
+        return await simplified_workflow(task)
+    except WorkflowError as e:
+        log_warning(f"Simplified workflow failed: {e}")
+
+    # Last resort: decompose and try piece by piece
+    try:
+        subtasks = decompose_task(task)
+        results = []
+        for subtask in subtasks:
+            result = await execute_single_step(subtask)
+            results.append(result)
+        return combine_results(results)
+    except Exception as e:
+        return escalate_to_human(task, reason=f"All workflows failed: {e}")
+```
+
+### Fallback Decision Tree
+
+```
+Task Execution
+    |
+    +-- Try preferred approach
+    |   |
+    |   +-- Success? --> Done
+    |   |
+    |   +-- Rate limit? --> Try next model in chain
+    |   |
+    |   +-- Error? --> Try simpler workflow
+    |
+    +-- All workflows failed?
+    |   |
+    |   +-- Decompose into subtasks
+    |   |
+    |   +-- Execute piece by piece
+    |
+    +-- Still failing?
+        |
+        +-- Escalate to human
+        +-- Log detailed failure context
+        +-- Save state for resume
+```
+
+---
+
+## Confidence-Based Human Escalation
+
+### Confidence Scoring
+
+```python
+def calculate_confidence(task_result):
+    """
+    Score confidence 0-1 based on multiple signals.
+    Low confidence triggers human review.
+    """
+    signals = []
+
+    # Test coverage signal
+    if task_result.test_coverage >= 0.9:
+        signals.append(1.0)
+    elif task_result.test_coverage >= 0.7:
+        signals.append(0.7)
+    else:
+        signals.append(0.3)
+
+    # Review consensus signal
+    if task_result.review_unanimous:
+        signals.append(1.0)
+    elif task_result.review_majority:
+        signals.append(0.7)
+    else:
+        signals.append(0.3)
+
+    # Retry count signal
+    retry_penalty = min(task_result.retry_count * 0.2, 0.8)
+    signals.append(1.0 - retry_penalty)
+
+    return sum(signals) / len(signals)
+
+# Escalation threshold
+CONFIDENCE_THRESHOLD = 0.6
+
+if calculate_confidence(result) < CONFIDENCE_THRESHOLD:
+    escalate_to_human(
+        task,
+        reason="Low confidence score",
+        context=result
+    )
+```
+
+### Automatic Escalation Triggers
+
+```yaml
+human_escalation_triggers:
+  # Retry-based
+  - condition: retry_count > 3
+    action: pause_and_escalate
+    reason: "Multiple failures indicate unclear requirements"
+
+  # Domain-based
+  - condition: domain in ["payments", "auth", "pii"]
+    action: require_approval
+    reason: "Sensitive domain requires human review"
+
+  # Confidence-based
+  - condition: confidence_score < 0.6
+    action: pause_and_escalate
+    reason: "Low confidence in solution quality"
+
+  # Time-based
+  - condition: wall_time > expected_time * 3
+    action: pause_and_escalate
+    reason: "Task taking much longer than expected"
+
+  # Cost-based
+  - condition: tokens_used > budget * 0.8
+    action: pause_and_escalate
+    reason: "Approaching token budget limit"
+```
+
+---
+
+## AGENTS.md Integration
+
+### Reading Target Project's AGENTS.md
+
+```python
+async def load_project_context():
+    """
+    Read AGENTS.md from target project if exists.
+    Based on OpenAI/AAIF standard.
+    """
+    agents_md_locations = [
+        "AGENTS.md",
+        ".github/AGENTS.md",
+        "docs/AGENTS.md"
+    ]
+
+    for location in agents_md_locations:
+        if await file_exists(location):
+            content = await read_file(location)
+            return parse_agents_md(content)
+
+    # No AGENTS.md found - use defaults
+    return default_project_context()
+
+def parse_agents_md(content):
+    """
+    Extract structured guidance from AGENTS.md.
+    """
+    sections = parse_markdown_sections(content)
+
+    return ProjectContext(
+        build_commands=sections.get("build", []),
+        test_commands=sections.get("test", []),
+        code_style=sections.get("code style", {}),
+        architecture_notes=sections.get("architecture", ""),
+        deployment_notes=sections.get("deployment", ""),
+        security_notes=sections.get("security", "")
+    )
+```
+
+### Context Priority
+
+```
+1. AGENTS.md (closest to current file, monorepo-aware)
+2. CLAUDE.md (Claude-specific instructions)
+3. .loki/CONTINUITY.md (session state)
+4. Package-level documentation
+5. README.md (general project info)
+```
+
+---
+
+## Reasoning Model Guidance
+
+### When to Use Extended Thinking
+
+Based on OpenAI's o3/o4-mini patterns:
+
+```yaml
+use_extended_reasoning:
+  always:
+    - System architecture design
+    - Security vulnerability analysis
+    - Complex debugging (multi-file, unclear root cause)
+    - API design decisions
+    - Performance optimization strategy
+
+  sometimes:
+    - Code review (only for critical/complex changes)
+    - Refactoring planning (when multiple approaches exist)
+    - Integration design (when crossing system boundaries)
+
+  never:
+    - Simple bug fixes
+    - Documentation updates
+    - Unit test writing
+    - Formatting/linting
+    - File operations
+```
+
+### Backtracking Pattern
+
+```python
+async def execute_with_backtracking(task, max_backtracks=3):
+    """
+    Allow agent to backtrack and try different approaches.
+    Based on Deep Research's adaptive planning.
+    """
+    attempts = []
+
+    for attempt in range(max_backtracks + 1):
+        # Generate approach considering previous failures
+        approach = await plan_approach(
+            task,
+            failed_approaches=attempts
+        )
+
+        result = await execute_approach(approach)
+
+        if result.success:
+            return result
+
+        # Record failed approach for learning
+        attempts.append({
+            "approach": approach,
+            "failure_reason": result.error,
+            "partial_progress": result.partial_output
+        })
+
+        # Backtrack: reset to clean state
+        await rollback_to_checkpoint(task.checkpoint_id)
+
+    return FailedResult(
+        reason="Max backtracks exceeded",
+        attempts=attempts
+    )
+```
+
+---
+
+## Session State Management
+
+### Automatic State Persistence
+
+```python
+class Session:
+    """
+    Automatic conversation history and state management.
+    Inspired by OpenAI Agents SDK Sessions.
+    """
+
+    def __init__(self, session_id):
+        self.session_id = session_id
+        self.state_file = f".loki/state/sessions/{session_id}.json"
+        self.history = []
+        self.context = {}
+
+    async def save_state(self):
+        state = {
+            "session_id": self.session_id,
+            "history": self.history,
+            "context": self.context,
+            "last_updated": now()
+        }
+        await write_json(self.state_file, state)
+
+    async def load_state(self):
+        if await file_exists(self.state_file):
+            state = await read_json(self.state_file)
+            self.history = state["history"]
+            self.context = state["context"]
+
+    async def add_turn(self, role, content, metadata=None):
+        self.history.append({
+            "role": role,
+            "content": content,
+            "metadata": metadata,
+            "timestamp": now()
+        })
+        await self.save_state()
+```
+
+---
+
+## Sources
+
+**OpenAI Official:**
+- [Agents SDK Documentation](https://openai.github.io/openai-agents-python/)
+- [Practical Guide to Building Agents](https://cdn.openai.com/business-guides-and-resources/a-practical-guide-to-building-agents.pdf)
+- [Building Agents Track](https://developers.openai.com/tracks/building-agents/)
+- [AGENTS.md Specification](https://agents.md/)
+
+**Deep Research & Reasoning:**
+- [Introducing Deep Research](https://openai.com/index/introducing-deep-research/)
+- [Deep Research System Card](https://cdn.openai.com/deep-research-system-card.pdf)
+- [Introducing o3 and o4-mini](https://openai.com/index/introducing-o3-and-o4-mini/)
+- [Reasoning Best Practices](https://platform.openai.com/docs/guides/reasoning-best-practices)
+
+**Safety & Monitoring:**
+- [Chain of Thought Monitoring](https://openai.com/index/chain-of-thought-monitoring/)
+- [Agent Builder Safety](https://platform.openai.com/docs/guides/agent-builder-safety)
+- [Computer-Using Agent](https://openai.com/index/computer-using-agent/)
+
+**Standards & Interoperability:**
+- [Agentic AI Foundation](https://openai.com/index/agentic-ai-foundation/)
+- [OpenAI for Developers 2025](https://developers.openai.com/blog/openai-for-developers-2025/)
--- a/skills/loki-mode/references/production-patterns.md
+++ b/skills/loki-mode/references/production-patterns.md
@@ -0,0 +1,568 @@
+# Production Patterns Reference
+
+Practitioner-tested patterns from Hacker News discussions and real-world deployments. These patterns represent what actually works in production, not theoretical frameworks.
+
+---
+
+## Overview
+
+This reference consolidates battle-tested insights from:
+- HN discussions on autonomous agents in production (2025)
+- Coding with LLMs practitioner experiences
+- Simon Willison's Superpowers coding agent patterns
+- Multi-agent orchestration real-world deployments
+
+---
+
+## What Actually Works in Production
+
+### Human-in-the-Loop (HITL) is Non-Negotiable
+
+**Key Insight:** "Zero companies don't have a human in the loop" for customer-facing applications.
+
+```yaml
+hitl_patterns:
+  always_human:
+    - Customer-facing responses
+    - Financial transactions
+    - Security-critical operations
+    - Legal/compliance decisions
+
+  automation_candidates:
+    - Internal tooling
+    - Developer assistance
+    - Data preprocessing
+    - Code generation (with review)
+
+  implementation:
+    - Classification layer routes to human vs automated
+    - Confidence thresholds trigger escalation
+    - Audit trails for all automated decisions
+```
+
+### Narrow Scope Wins
+
+**Key Insight:** Successful agents operate within tightly constrained domains.
+
+```yaml
+scope_constraints:
+  max_steps_before_review: 3-5
+  task_characteristics:
+    - Specific, well-defined objectives
+    - Pre-classified inputs
+    - Deterministic success criteria
+    - Verifiable outputs
+
+  successful_domains:
+    - Email scanning and classification
+    - Invoice processing
+    - Code refactoring (bounded)
+    - Documentation generation
+    - Test writing
+
+  failure_prone_domains:
+    - Open-ended feature implementation
+    - Novel algorithm design
+    - Security-critical code
+    - Cross-system integrations
+```
+
+### Confidence-Based Routing
+
+**Key Insight:** Treat agents as preprocessors, not decision-makers.
+
+```python
+def confidence_based_routing(agent_output):
+    """
+    Route based on confidence, not capability.
+    Based on production practitioner patterns.
+    """
+    confidence = agent_output.confidence_score
+
+    if confidence >= 0.95:
+        # High confidence: auto-approve with logging
+        return AutoApprove(audit_log=True)
+
+    elif confidence >= 0.70:
+        # Medium confidence: quick human review
+        return HumanReview(priority="normal", timeout="1h")
+
+    elif confidence >= 0.40:
+        # Low confidence: detailed human review
+        return HumanReview(priority="high", context="full")
+
+    else:
+        # Very low confidence: escalate immediately
+        return Escalate(reason="low_confidence", require_senior=True)
+```
+
+### Classification Before Automation
+
+**Key Insight:** Separate inputs before processing.
+
+```yaml
+classification_first:
+  step_1_classify:
+    workable:
+      - Clear requirements
+      - Existing patterns
+      - Test coverage available
+    non_workable:
+      - Ambiguous requirements
+      - Novel architecture
+      - Missing dependencies
+    escalate_immediately:
+      - Security concerns
+      - Compliance requirements
+      - Customer-facing changes
+
+  step_2_route:
+    workable: "Automated pipeline"
+    non_workable: "Human clarification"
+    escalate: "Senior review"
+```
+
+### Deterministic Outer Loops
+
+**Key Insight:** Wrap agent outputs with rule-based validation.
+
+```python
+def deterministic_validation_loop(task, max_attempts=3):
+    """
+    Use LLMs only where genuine ambiguity exists.
+    Wrap with deterministic rules.
+    """
+    for attempt in range(max_attempts):
+        # LLM handles the ambiguous part
+        output = agent.execute(task)
+
+        # Deterministic validation (NOT LLM)
+        validation_errors = []
+
+        # Rule: Must have tests
+        if not output.has_tests:
+            validation_errors.append("Missing tests")
+
+        # Rule: Must pass linting
+        lint_result = run_linter(output.code)
+        if lint_result.errors:
+            validation_errors.append(f"Lint errors: {lint_result.errors}")
+
+        # Rule: Must compile
+        compile_result = compile_code(output.code)
+        if not compile_result.success:
+            validation_errors.append(f"Compile error: {compile_result.error}")
+
+        # Rule: Tests must pass
+        if output.has_tests:
+            test_result = run_tests(output.code)
+            if not test_result.all_passed:
+                validation_errors.append(f"Test failures: {test_result.failures}")
+
+        if not validation_errors:
+            return output
+
+        # Feed errors back for retry
+        task = task.with_feedback(validation_errors)
+
+    return FailedResult(reason="Max attempts exceeded")
+```
+
+---
+
+## Context Engineering Patterns
+
+### Context Curation Over Automatic Selection
+
+**Key Insight:** Manually choose which files and information to provide.
+
+```yaml
+context_curation:
+  principles:
+    - "Less is more" - focused context beats comprehensive context
+    - Manual selection outperforms automatic RAG
+    - Remove outdated information aggressively
+
+  anti_patterns:
+    - Dumping entire codebase into context
+    - Relying on automatic context selection
+    - Accumulating conversation history indefinitely
+
+  implementation:
+    per_task_context:
+      - 2-5 most relevant files
+      - Specific functions, not entire modules
+      - Recent changes only (last 1-2 days)
+      - Clear success criteria
+
+    context_budget:
+      target: "< 10k tokens for context"
+      reserve: "90% for model reasoning"
+```
+
+### Information Abstraction
+
+**Key Insight:** Summarize rather than feeding full data.
+
+```python
+def abstract_for_agent(raw_data, task_context):
+    """
+    Design abstractions that preserve decision-relevant information.
+    Based on practitioner insights.
+    """
+    # BAD: Feed 10,000 database rows
+    # raw_data = db.query("SELECT * FROM users")
+
+    # GOOD: Summarize to decision-relevant info
+    summary = {
+        "query_status": "success",
+        "total_results": len(raw_data),
+        "sample": raw_data[:5],
+        "schema": extract_schema(raw_data),
+        "statistics": {
+            "null_count": count_nulls(raw_data),
+            "unique_values": count_uniques(raw_data),
+            "date_range": get_date_range(raw_data)
+        }
+    }
+
+    return summary
+```
+
+### Separate Conversations Per Task
+
+**Key Insight:** Fresh contexts yield better results than accumulated sessions.
+
+```yaml
+conversation_management:
+  new_conversation_triggers:
+    - Different domain (backend -> frontend)
+    - New feature vs bug fix
+    - After completing major task
+    - When errors accumulate (3+ in row)
+
+  preserve_across_sessions:
+    - CLAUDE.md / CONTINUITY.md
+    - Architectural decisions
+    - Key constraints
+
+  discard_between_sessions:
+    - Debugging attempts
+    - Abandoned approaches
+    - Intermediate drafts
+```
+
+---
+
+## Skills System Pattern
+
+### On-Demand Skill Loading
+
+**Key Insight:** Skills remain dormant until the model actively seeks them out.
+
+```yaml
+skills_architecture:
+  core_interaction: "< 2k tokens"
+  skill_loading: "On-demand via search"
+
+  implementation:
+    skill_discovery:
+      - Shell script searches skill files
+      - Model requests specific skills by name
+      - Skills loaded only when needed
+
+    skill_structure:
+      name: "unique-skill-name"
+      trigger: "Pattern that activates skill"
+      content: "Detailed instructions"
+      dependencies: ["other-skills"]
+
+  benefits:
+    - Minimal base context
+    - Extensible without bloat
+    - Skills can be updated independently
+```
+
+### Sub-Agents for Context Isolation
+
+**Key Insight:** Prevent massive token waste by isolating context-noisy subtasks.
+
+```python
+async def context_isolated_search(query, codebase_path):
+    """
+    Use sub-agent for grep/search to prevent context pollution.
+    Based on Simon Willison's patterns.
+    """
+    # Main agent stays focused
+    # Sub-agent handles noisy file searching
+
+    search_agent = spawn_subagent(
+        role="codebase-searcher",
+        context_limit="10k tokens",
+        permissions=["read-only"]
+    )
+
+    results = await search_agent.execute(
+        task=f"Find files related to: {query}",
+        codebase=codebase_path
+    )
+
+    # Return only relevant paths, not full content
+    return FilteredResults(
+        paths=results.relevant_files[:10],
+        summaries=results.file_summaries,
+        confidence=results.relevance_scores
+    )
+```
+
+---
+
+## Planning Before Execution
+
+### Explicit Plan-Then-Code Workflow
+
+**Key Insight:** Have models articulate detailed plans without immediately writing code.
+
+```yaml
+plan_then_code:
+  phase_1_planning:
+    outputs:
+      - spec.md: "Detailed requirements"
+      - todo.md: "Tagged tasks [BUG], [FEAT], [REFACTOR]"
+      - approach.md: "Implementation strategy"
+    constraints:
+      - NO CODE in this phase
+      - Human review before proceeding
+      - Clear success criteria
+
+  phase_2_review:
+    checks:
+      - Plan addresses all requirements
+      - Approach is feasible
+      - No missing dependencies
+      - Tests are specified
+
+  phase_3_implementation:
+    constraints:
+      - Follow plan exactly
+      - One task at a time
+      - Test after each change
+      - Report deviations immediately
+```
+
+---
+
+## Multi-Agent Orchestration Patterns
+
+### Event-Driven Coordination
+
+**Key Insight:** Move beyond synchronous prompt chaining to asynchronous, decoupled systems.
+
+```yaml
+event_driven_orchestration:
+  problems_with_synchronous:
+    - Doesn't scale
+    - Mixes orchestration with prompt logic
+    - Single failure breaks entire chain
+    - No retry/recovery mechanism
+
+  async_architecture:
+    message_queue:
+      - Agents communicate via events
+      - Decoupled execution
+      - Natural retry/dead-letter handling
+
+    state_management:
+      - Persistent task state
+      - Checkpoint/resume capability
+      - Clear ownership of data
+
+    error_handling:
+      - Per-agent retry policies
+      - Circuit breakers
+      - Graceful degradation
+```
+
+### Policy-First Enforcement
+
+**Key Insight:** Govern agent behavior at runtime, not just training time.
+
+```python
+class PolicyEngine:
+    """
+    Runtime governance for agent behavior.
+    Based on autonomous control plane patterns.
+    """
+
+    def __init__(self, policies):
+        self.policies = policies
+
+    async def enforce(self, agent_action, context):
+        for policy in self.policies:
+            result = await policy.evaluate(agent_action, context)
+
+            if result.blocked:
+                return BlockedAction(
+                    reason=result.reason,
+                    policy=policy.name,
+                    remediation=result.suggested_action
+                )
+
+            if result.modified:
+                agent_action = result.modified_action
+
+        return AllowedAction(agent_action)
+
+# Example policies
+policies = [
+    NoProductionDataDeletion(),
+    NoSecretsInCode(),
+    MaxTokenBudget(limit=100000),
+    RequireTestsForCode(),
+    BlockExternalNetworkCalls(in_sandbox=True)
+]
+```
+
+### Simulation Layer
+
+**Key Insight:** Evaluate changes before deploying to real environment.
+
+```yaml
+simulation_layer:
+  purpose: "Test agent behavior in safe environment"
+
+  implementation:
+    sandbox_environment:
+      - Isolated container
+      - Mocked external services
+      - Synthetic data
+      - Full audit logging
+
+    validation_checks:
+      - Run tests in sandbox first
+      - Compare outputs to expected
+      - Check for policy violations
+      - Measure resource consumption
+
+    promotion_criteria:
+      - All tests pass
+      - No policy violations
+      - Resource usage within limits
+      - Human approval (for sensitive changes)
+```
+
+---
+
+## Evaluation and Benchmarking
+
+### Problems with Current Benchmarks
+
+**Key Insight:** LLM-as-judge creates shared blind spots.
+
+```yaml
+benchmark_problems:
+  llm_judge_issues:
+    - Same architecture = same failure modes
+    - Math errors accepted as correct
+    - "Do-nothing" baseline passes 38% of time
+
+  contamination:
+    - Published benchmarks become training targets
+    - Overfitting to specific datasets
+    - Inflated scores don't reflect real performance
+
+  solutions:
+    held_back_sets: "90% public, 10% private"
+    human_evaluation: "Final published results require humans"
+    production_testing: "A/B tests measure actual value"
+    objective_outcomes: "Simulated environments with verifiable results"
+```
+
+### Practical Evaluation Approach
+
+```python
+def evaluate_agent_change(before_agent, after_agent, task_set):
+    """
+    Production-oriented evaluation.
+    Based on HN practitioner recommendations.
+    """
+    results = {
+        "before": [],
+        "after": [],
+        "human_preference": []
+    }
+
+    for task in task_set:
+        # Run both agents
+        before_result = before_agent.execute(task)
+        after_result = after_agent.execute(task)
+
+        # Objective metrics (NOT LLM-judged)
+        results["before"].append({
+            "tests_pass": run_tests(before_result),
+            "lint_clean": run_linter(before_result),
+            "time_taken": before_result.duration,
+            "tokens_used": before_result.tokens
+        })
+
+        results["after"].append({
+            "tests_pass": run_tests(after_result),
+            "lint_clean": run_linter(after_result),
+            "time_taken": after_result.duration,
+            "tokens_used": after_result.tokens
+        })
+
+        # Sample for human review
+        if random.random() < 0.1:  # 10% sample
+            results["human_preference"].append({
+                "task": task,
+                "before": before_result,
+                "after": after_result,
+                "pending_review": True
+            })
+
+    return EvaluationReport(results)
+```
+
+---
+
+## Cost and Token Economics
+
+### Real-World Cost Patterns
+
+```yaml
+cost_patterns:
+  claude_code:
+    heavy_use: "$25/1-2 hours on large codebases"
+    api_range: "$1-5/hour depending on efficiency"
+    max_tier: "$200/month often needs 2-3 subscriptions"
+
+  token_economics:
+    sub_agents_multiply_cost: "Each duplicates context"
+    example: "5-task parallel job = 50,000+ tokens per subtask"
+
+  optimization:
+    context_isolation: "Use sub-agents for noisy tasks"
+    information_abstraction: "Summarize, don't dump"
+    fresh_conversations: "Reset after major tasks"
+    skill_on_demand: "Load only when needed"
+```
+
+---
+
+## Sources
+
+**Hacker News Discussions:**
+- [What Actually Works in Production for Autonomous Agents](https://news.ycombinator.com/item?id=44623207)
+- [Coding with LLMs in Summer 2025](https://news.ycombinator.com/item?id=44623953)
+- [Superpowers: How I'm Using Coding Agents](https://news.ycombinator.com/item?id=45547344)
+- [Claude Code Experience After Two Weeks](https://news.ycombinator.com/item?id=44596472)
+- [AI Agent Benchmarks Are Broken](https://news.ycombinator.com/item?id=44531697)
+- [How to Orchestrate Multi-Agent Workflows](https://news.ycombinator.com/item?id=45955997)
+- [Context Engineering vs Prompt Engineering](https://news.ycombinator.com/item?id=44427757)
+
+**Show HN Projects:**
+- [Self-Evolving Agents Repository](https://news.ycombinator.com/item?id=45099226)
+- [Package Manager for Agent Skills](https://news.ycombinator.com/item?id=46422264)
+- [Wispbit - AI Code Review Agent](https://news.ycombinator.com/item?id=44722603)
+- [Agtrace - Monitoring for AI Coding Agents](https://news.ycombinator.com/item?id=46425670)
--- a/skills/loki-mode/references/quality-control.md
+++ b/skills/loki-mode/references/quality-control.md
@@ -0,0 +1,437 @@
+# Quality Control Reference
+
+Quality gates, code review process, and severity blocking rules.
+Enhanced with 2025 research on anti-sycophancy, heterogeneous teams, and OpenAI Agents SDK patterns.
+
+---
+
+## Core Principle: Guardrails, Not Just Acceleration
+
+**CRITICAL:** Speed without quality controls creates "AI slop" - semi-functional code that accumulates technical debt. Loki Mode enforces strict quality guardrails.
+
+**Research Insight:** Heterogeneous review teams outperform homogeneous ones by 4-6% (A-HMAD, 2025).
+**OpenAI Insight:** "Think of guardrails as a layered defense mechanism. Multiple specialized guardrails create resilient agents."
+
+---
+
+## Guardrails & Tripwires System (OpenAI SDK Pattern)
+
+### Input Guardrails (Run Before Execution)
+
+```python
+# Layer 1: Validate task scope and safety
+@input_guardrail(blocking=True)
+async def validate_task_scope(input, context):
+    # Check if task within project bounds
+    if references_external_paths(input):
+        return GuardrailResult(
+            tripwire_triggered=True,
+            reason="Task references paths outside project"
+        )
+    # Check for destructive operations
+    if contains_destructive_operation(input):
+        return GuardrailResult(
+            tripwire_triggered=True,
+            reason="Destructive operation requires human approval"
+        )
+    return GuardrailResult(tripwire_triggered=False)
+
+# Layer 2: Detect prompt injection
+@input_guardrail(blocking=True)
+async def detect_injection(input, context):
+    if has_injection_patterns(input):
+        return GuardrailResult(
+            tripwire_triggered=True,
+            reason="Potential prompt injection detected"
+        )
+    return GuardrailResult(tripwire_triggered=False)
+```
+
+### Output Guardrails (Run After Execution)
+
+```python
+# Validate code quality before accepting
+@output_guardrail
+async def validate_code_output(output, context):
+    if output.type == "code":
+        issues = run_static_analysis(output.content)
+        critical = [i for i in issues if i.severity == "critical"]
+        if critical:
+            return GuardrailResult(
+                tripwire_triggered=True,
+                reason=f"Critical issues: {critical}"
+            )
+    return GuardrailResult(tripwire_triggered=False)
+
+# Check for secrets in output
+@output_guardrail
+async def check_secrets(output, context):
+    if contains_secrets(output.content):
+        return GuardrailResult(
+            tripwire_triggered=True,
+            reason="Output contains potential secrets"
+        )
+    return GuardrailResult(tripwire_triggered=False)
+```
+
+### Execution Modes
+
+| Mode | Behavior | Use When |
+|------|----------|----------|
+| **Blocking** | Guardrail completes before agent starts | Expensive models, sensitive ops |
+| **Parallel** | Guardrail runs with agent | Fast checks, acceptable token loss |
+
+```python
+# Blocking: prevents token consumption on fail
+@input_guardrail(blocking=True, run_in_parallel=False)
+async def expensive_validation(input): pass
+
+# Parallel: faster but may waste tokens
+@input_guardrail(blocking=True, run_in_parallel=True)
+async def fast_validation(input): pass
+```
+
+### Tripwire Handling
+
+When a guardrail triggers its tripwire, execution halts immediately:
+
+```python
+try:
+    result = await run_agent(task)
+except InputGuardrailTripwireTriggered as e:
+    log_blocked_attempt(e)
+    return early_exit(reason=str(e))
+except OutputGuardrailTripwireTriggered as e:
+    rollback_changes()
+    return retry_with_constraints(e.constraints)
+```
+
+### Layered Defense Strategy
+
+```yaml
+guardrail_layers:
+  layer_1_input:
+    - scope_validation      # Is task within bounds?
+    - pii_detection         # Contains sensitive data?
+    - injection_detection   # Prompt injection attempt?
+
+  layer_2_pre_execution:
+    - cost_estimation       # Will this exceed budget?
+    - dependency_check      # Are dependencies available?
+    - conflict_detection    # Conflicts with in-progress work?
+
+  layer_3_output:
+    - static_analysis       # Code quality issues?
+    - secret_detection      # Secrets in output?
+    - spec_compliance       # Matches OpenAPI spec?
+
+  layer_4_post_action:
+    - test_validation       # Tests pass?
+    - review_approval       # Review passed?
+    - deployment_safety     # Safe to deploy?
+```
+
+See `references/openai-patterns.md` for full guardrails implementation.
+
+---
+
+## Quality Gates
+
+**Never ship code without passing all quality gates:**
+
+### 1. Static Analysis (Automated)
+- CodeQL security scanning
+- ESLint/Pylint/Rubocop for code style
+- Unused variable/import detection
+- Duplicated logic detection
+- Type checking (TypeScript/mypy/etc)
+
+### 2. 3-Reviewer Parallel System (AI-driven)
+
+Every code change goes through 3 specialized reviewers **simultaneously**:
+
+```
+IMPLEMENT -> BLIND REVIEW (parallel) -> DEBATE (if disagreement) -> AGGREGATE -> FIX -> RE-REVIEW
+                |
+                +-- code-reviewer (Opus) - Code quality, patterns, best practices
+                +-- business-logic-reviewer (Opus) - Requirements, edge cases, UX
+                +-- security-reviewer (Opus) - Vulnerabilities, OWASP Top 10
+```
+
+**Important:**
+- ALWAYS launch all 3 reviewers in a single message (3 Task calls)
+- ALWAYS specify model: "opus" for each reviewer
+- ALWAYS use blind review mode (reviewers cannot see each other's findings initially)
+- NEVER dispatch reviewers sequentially (always parallel - 3x faster)
+- NEVER aggregate before all 3 reviewers complete
+
+### Anti-Sycophancy Protocol (CONSENSAGENT Research)
+
+**Problem:** Reviewers may reinforce each other's findings instead of critically engaging.
+
+**Solution: Blind Review + Devil's Advocate**
+
+```python
+# Phase 1: Independent blind review
+reviews = []
+for reviewer in [code_reviewer, business_reviewer, security_reviewer]:
+    review = Task(
+        subagent_type="general-purpose",
+        model="opus",
+        prompt=f"""
+        {reviewer.prompt}
+
+        CRITICAL: Be skeptical. Your job is to find problems.
+        List specific concerns with file:line references.
+        Do NOT rubber-stamp. Finding zero issues is suspicious.
+        """
+    )
+    reviews.append(review)
+
+# Phase 2: Check for disagreement
+if has_disagreement(reviews):
+    # Structured debate - max 2 rounds
+    debate_result = structured_debate(reviews, max_rounds=2)
+else:
+    # All agreed - run devil's advocate
+    devil_review = Task(
+        subagent_type="general-purpose",
+        model="opus",
+        prompt="""
+        The other reviewers found no issues. Your job is to be contrarian.
+        Find problems they missed. Challenge assumptions.
+        If truly nothing wrong, explain why each potential issue category is covered.
+        """
+    )
+    reviews.append(devil_review)
+```
+
+### Heterogeneous Team Composition
+
+**Each reviewer has distinct personality/focus:**
+
+| Reviewer | Model | Expertise | Personality |
+|----------|-------|-----------|-------------|
+| Code Quality | Opus | SOLID, patterns, maintainability | Perfectionist |
+| Business Logic | Opus | Requirements, edge cases, UX | Pragmatic |
+| Security | Opus | OWASP, auth, injection | Paranoid |
+
+This diversity prevents groupthink and catches more issues.
+
+### 3. Severity-Based Blocking
+
+| Severity | Action | Continue? |
+|----------|--------|-----------|
+| **Critical** | BLOCK - Fix immediately | NO |
+| **High** | BLOCK - Fix immediately | NO |
+| **Medium** | BLOCK - Fix before proceeding | NO |
+| **Low** | Add `// TODO(review): ...` comment | YES |
+| **Cosmetic** | Add `// FIXME(nitpick): ...` comment | YES |
+
+**Critical/High/Medium = BLOCK and fix before proceeding**
+**Low/Cosmetic = Add TODO/FIXME comment, continue**
+
+### 4. Test Coverage Gates
+- Unit tests: 100% pass, >80% coverage
+- Integration tests: 100% pass
+- E2E tests: critical flows pass
+
+### 5. Rulesets (Blocking Merges)
+- No secrets in code
+- No unhandled exceptions
+- No SQL injection vulnerabilities
+- No XSS vulnerabilities
+
+---
+
+## Code Review Protocol
+
+### Launching Reviewers (Parallel)
+
+```python
+# CORRECT: Launch all 3 in parallel
+Task(subagent_type="general-purpose", model="opus",
+     description="Code quality review",
+     prompt="Review for code quality, patterns, SOLID principles...")
+
+Task(subagent_type="general-purpose", model="opus",
+     description="Business logic review",
+     prompt="Review for requirements alignment, edge cases, UX...")
+
+Task(subagent_type="general-purpose", model="opus",
+     description="Security review",
+     prompt="Review for vulnerabilities, OWASP Top 10...")
+
+# WRONG: Sequential reviewers (3x slower)
+# Don't do: await reviewer1; await reviewer2; await reviewer3;
+```
+
+### After Fixes
+
+- ALWAYS re-run ALL 3 reviewers after fixes (not just the one that found the issue)
+- Wait for all reviews to complete before aggregating results
+
+---
+
+## Structured Prompting for Subagents
+
+**Every subagent dispatch MUST include:**
+
+```markdown
+## GOAL (What success looks like)
+[High-level objective, not just the action]
+Example: "Refactor authentication for maintainability and testability"
+NOT: "Refactor the auth file"
+
+## CONSTRAINTS (What you cannot do)
+- No third-party dependencies without approval
+- Maintain backwards compatibility with v1.x API
+- Keep response time under 200ms
+- Follow existing error handling patterns
+
+## CONTEXT (What you need to know)
+- Related files: [list with brief descriptions]
+- Architecture decisions: [relevant ADRs or patterns]
+- Previous attempts: [what was tried, why it failed]
+- Dependencies: [what this depends on, what depends on this]
+
+## OUTPUT FORMAT (What to deliver)
+- [ ] Pull request with Why/What/Trade-offs description
+- [ ] Unit tests with >90% coverage
+- [ ] Update API documentation
+- [ ] Performance benchmark results
+```
+
+---
+
+## Task Completion Report
+
+**Every completed task MUST include decision documentation:**
+
+```markdown
+## Task Completion Report
+
+### WHY (Problem & Solution Rationale)
+- **Problem**: [What was broken/missing/suboptimal]
+- **Root Cause**: [Why it happened]
+- **Solution Chosen**: [What we implemented]
+- **Alternatives Considered**:
+  1. [Option A]: Rejected because [reason]
+  2. [Option B]: Rejected because [reason]
+
+### WHAT (Changes Made)
+- **Files Modified**: [with line ranges and purpose]
+  - `src/auth.ts:45-89` - Extracted token validation to separate function
+  - `src/auth.test.ts:120-156` - Added edge case tests
+- **APIs Changed**: [breaking vs non-breaking]
+- **Behavior Changes**: [what users will notice]
+- **Dependencies Added/Removed**: [with justification]
+
+### TRADE-OFFS (Gains & Costs)
+- **Gained**:
+  - Better testability (extracted pure functions)
+  - 40% faster token validation
+  - Reduced cyclomatic complexity from 15 to 6
+- **Cost**:
+  - Added 2 new functions (increased surface area)
+  - Requires migration for custom token validators
+- **Neutral**:
+  - No performance change for standard use cases
+
+### RISKS & MITIGATIONS
+- **Risk**: Existing custom validators may break
+  - **Mitigation**: Added backwards-compatibility shim, deprecation warning
+- **Risk**: New validation logic untested at scale
+  - **Mitigation**: Gradual rollout with feature flag, rollback plan ready
+
+### TEST RESULTS
+- Unit: 24/24 passed (coverage: 92%)
+- Integration: 8/8 passed
+- Performance: p99 improved from 145ms -> 87ms
+
+### NEXT STEPS (if any)
+- [ ] Monitor error rates for 24h post-deploy
+- [ ] Create follow-up task to remove compatibility shim in v3.0
+```
+
+---
+
+## Preventing "AI Slop"
+
+### Warning Signs
+- Tests pass but code quality degraded
+- Copy-paste duplication instead of abstraction
+- Over-engineered solutions to simple problems
+- Missing error handling
+- No logging/observability
+- Generic variable names (data, temp, result)
+- Magic numbers without constants
+- Commented-out code
+- TODO comments without GitHub issues
+
+### When Detected
+1. Fail the task immediately
+2. Add to failed queue with detailed feedback
+3. Re-dispatch with stricter constraints
+4. Update CONTINUITY.md with anti-pattern to avoid
+
+---
+
+## Quality Gate Hooks
+
+### Pre-Write Hook (BLOCKING)
+```bash
+#!/bin/bash
+# .loki/hooks/pre-write.sh
+# Blocks writes that violate rules
+
+# Check for secrets
+if grep -rE "(password|secret|key).*=.*['\"][^'\"]{8,}" "$1"; then
+  echo "BLOCKED: Potential secret detected"
+  exit 1
+fi
+
+# Check for console.log in production
+if grep -n "console.log" "$1" | grep -v "test"; then
+  echo "BLOCKED: Remove console.log statements"
+  exit 1
+fi
+```
+
+### Post-Write Hook (AUTO-FIX)
+```bash
+#!/bin/bash
+# .loki/hooks/post-write.sh
+# Auto-fixes after writes
+
+# Format code
+npx prettier --write "$1"
+
+# Fix linting issues
+npx eslint --fix "$1"
+
+# Type check
+npx tsc --noEmit
+```
+
+---
+
+## Constitution Reference
+
+Quality gates are enforced by `autonomy/CONSTITUTION.md`:
+
+**Pre-Commit (BLOCKING):**
+- Linting (auto-fix enabled)
+- Type checking (strict mode)
+- Contract tests (80% coverage minimum)
+- Spec validation (Spectral)
+
+**Post-Implementation (AUTO-FIX):**
+- Static analysis (ESLint, Prettier, TSC)
+- Security scan (Semgrep, Snyk)
+- Performance check (Lighthouse score 90+)
+
+**Runtime Invariants:**
+- `SPEC_BEFORE_CODE`: Implementation tasks require spec reference
+- `TASK_HAS_COMMIT`: Completed tasks have git commit SHA
+- `QUALITY_GATES_PASSED`: Completed tasks passed all quality checks
--- a/skills/loki-mode/references/sdlc-phases.md
+++ b/skills/loki-mode/references/sdlc-phases.md
@@ -0,0 +1,410 @@
+# SDLC Phases Reference
+
+All phases with detailed workflows and testing procedures.
+
+---
+
+## Phase Overview
+
+```
+Bootstrap -> Discovery -> Architecture -> Infrastructure
+     |           |            |              |
+  (Setup)   (Analyze PRD)  (Design)    (Cloud/DB Setup)
+                                             |
+Development <- QA <- Deployment <- Business Ops <- Growth Loop
+     |         |         |            |            |
+ (Build)    (Test)   (Release)    (Monitor)    (Iterate)
+```
+
+---
+
+## Phase 0: Bootstrap
+
+**Purpose:** Initialize Loki Mode environment
+
+### Actions:
+1. Create `.loki/` directory structure
+2. Initialize orchestrator state in `.loki/state/orchestrator.json`
+3. Validate PRD exists and is readable
+4. Spawn initial agent pool (3-5 agents)
+5. Create CONTINUITY.md
+
+### Directory Structure Created:
+```
+.loki/
+-- CONTINUITY.md
+-- state/
+|   +-- orchestrator.json
+|   +-- agents/
+|   +-- circuit-breakers/
+-- queue/
+|   +-- pending.json
+|   +-- in-progress.json
+|   +-- completed.json
+|   +-- dead-letter.json
+-- specs/
+-- memory/
+-- artifacts/
+```
+
+---
+
+## Phase 1: Discovery
+
+**Purpose:** Understand requirements and market context
+
+### Actions:
+1. Parse PRD, extract requirements
+2. Spawn `biz-analytics` agent for competitive research
+3. Web search competitors, extract features, reviews
+4. Identify market gaps and opportunities
+5. Generate task backlog with priorities and dependencies
+
+### Output:
+- Requirements document
+- Competitive analysis
+- Initial task backlog in `.loki/queue/pending.json`
+
+---
+
+## Phase 2: Architecture
+
+**Purpose:** Design system architecture and generate specs
+
+### SPEC-FIRST WORKFLOW
+
+**Step 1: Extract API Requirements from PRD**
+- Parse PRD for user stories and functionality
+- Map to REST/GraphQL operations
+- Document data models and relationships
+
+**Step 2: Generate OpenAPI 3.1 Specification**
+
+```yaml
+openapi: 3.1.0
+info:
+  title: Product API
+  version: 1.0.0
+paths:
+  /auth/login:
+    post:
+      summary: Authenticate user and return JWT
+      requestBody:
+        required: true
+        content:
+          application/json:
+            schema:
+              type: object
+              required: [email, password]
+              properties:
+                email: { type: string, format: email }
+                password: { type: string, minLength: 8 }
+      responses:
+        200:
+          description: Success
+          content:
+            application/json:
+              schema:
+                type: object
+                properties:
+                  token: { type: string }
+                  expiresAt: { type: string, format: date-time }
+        401:
+          description: Invalid credentials
+```
+
+**Step 3: Validate Spec**
+```bash
+npm install -g @stoplight/spectral-cli
+spectral lint .loki/specs/openapi.yaml
+swagger-cli validate .loki/specs/openapi.yaml
+```
+
+**Step 4: Generate Artifacts from Spec**
+```bash
+# TypeScript types
+npx openapi-typescript .loki/specs/openapi.yaml --output src/types/api.ts
+
+# Client SDK
+npx openapi-generator-cli generate \
+  -i .loki/specs/openapi.yaml \
+  -g typescript-axios \
+  -o src/clients/api
+
+# Server stubs
+npx openapi-generator-cli generate \
+  -i .loki/specs/openapi.yaml \
+  -g nodejs-express-server \
+  -o backend/generated
+
+# Documentation
+npx redoc-cli bundle .loki/specs/openapi.yaml -o docs/api.html
+```
+
+**Step 5: Select Tech Stack**
+- Spawn `eng-backend` + `eng-frontend` architects
+- Both agents review spec and propose stack
+- Consensus required (both must agree)
+- Self-reflection checkpoint with evidence
+
+**Step 6: Create Project Scaffolding**
+- Initialize project with tech stack
+- Install dependencies
+- Configure linters
+- Setup contract testing framework
+
+---
+
+## Phase 3: Infrastructure
+
+**Purpose:** Provision cloud resources and CI/CD
+
+### Actions:
+1. Spawn `ops-devops` agent
+2. Provision cloud resources (see `references/deployment.md`)
+3. Set up CI/CD pipelines
+4. Configure monitoring and alerting
+5. Create staging and production environments
+
+### CI/CD Pipeline:
+```yaml
+name: CI/CD Pipeline
+on: [push, pull_request]
+jobs:
+  test:
+    - Lint
+    - Type check
+    - Unit tests
+    - Contract tests
+    - Security scan
+  deploy-staging:
+    needs: test
+    - Deploy to staging
+    - Smoke tests
+  deploy-production:
+    needs: deploy-staging
+    - Blue-green deploy
+    - Health checks
+    - Auto-rollback on errors
+```
+
+---
+
+## Phase 4: Development
+
+**Purpose:** Implement features with quality gates
+
+### Workflow Per Task:
+
+```
+1. Dispatch implementation subagent (Task tool, model: sonnet)
+2. Subagent implements with TDD, commits, reports back
+3. Dispatch 3 reviewers IN PARALLEL (single message, 3 Task calls):
+   - code-reviewer (opus)
+   - business-logic-reviewer (opus)
+   - security-reviewer (opus)
+4. Aggregate findings by severity
+5. IF Critical/High/Medium found:
+   - Dispatch fix subagent
+   - Re-run ALL 3 reviewers
+   - Loop until all PASS
+6. Add TODO comments for Low issues
+7. Add FIXME comments for Cosmetic issues
+8. Mark task complete with git checkpoint
+```
+
+### Implementation Rules:
+- Agents implement ONLY what's in the spec
+- Must validate against openapi.yaml schema
+- Must return responses matching spec
+- Performance targets from spec x-performance extension
+
+---
+
+## Phase 5: Quality Assurance
+
+**Purpose:** Comprehensive testing and security audit
+
+### Testing Phases:
+
+**UNIT Phase:**
+```bash
+npm run test:unit
+# or
+pytest tests/unit/
+```
+- Coverage: >80% required
+- All tests must pass
+
+**INTEGRATION Phase:**
+```bash
+npm run test:integration
+```
+- Test API endpoints against actual database
+- Test external service integrations
+- Verify data flows end-to-end
+
+**E2E Phase:**
+```bash
+npx playwright test
+# or
+npx cypress run
+```
+- Test complete user flows
+- Cross-browser testing
+- Mobile responsive testing
+
+**CONTRACT Phase:**
+```bash
+npm run test:contract
+```
+- Validate implementation matches OpenAPI spec
+- Test request/response schemas
+- Breaking change detection
+
+**SECURITY Phase:**
+```bash
+npm audit
+npx snyk test
+semgrep --config=auto .
+```
+- OWASP Top 10 checks
+- Dependency vulnerabilities
+- Static analysis
+
+**PERFORMANCE Phase:**
+```bash
+npx k6 run tests/load.js
+npx lighthouse http://localhost:3000
+```
+- Load testing: 100 concurrent users for 1 minute
+- Stress testing: 500 concurrent users for 30 seconds
+- P95 response time < 500ms required
+
+**ACCESSIBILITY Phase:**
+```bash
+npx axe http://localhost:3000
+```
+- WCAG 2.1 AA compliance
+- Alt text, ARIA labels, color contrast
+- Keyboard navigation, focus indicators
+
+**REGRESSION Phase:**
+- Compare behavior against previous version
+- Verify no features broken by recent changes
+- Test backward compatibility of APIs
+
+**UAT Phase:**
+- Create acceptance tests from PRD
+- Walk through complete user journeys
+- Verify business logic matches PRD
+- Document any UX friction points
+
+---
+
+## Phase 6: Deployment
+
+**Purpose:** Release to production
+
+### Actions:
+1. Spawn `ops-release` agent
+2. Generate semantic version, changelog
+3. Create release branch, tag
+4. Deploy to staging, run smoke tests
+5. Blue-green deploy to production
+6. Monitor for 30min, auto-rollback if errors spike
+
+### Deployment Strategies:
+
+**Blue-Green:**
+```
+1. Deploy new version to "green" environment
+2. Run smoke tests
+3. Switch traffic from "blue" to "green"
+4. Keep "blue" as rollback target
+```
+
+**Canary:**
+```
+1. Deploy to 5% of traffic
+2. Monitor error rates
+3. Gradually increase to 25%, 50%, 100%
+4. Rollback if errors exceed threshold
+```
+
+---
+
+## Phase 7: Business Operations
+
+**Purpose:** Non-technical business setup
+
+### Actions:
+1. `biz-marketing`: Create landing page, SEO, content
+2. `biz-sales`: Set up CRM, outreach templates
+3. `biz-finance`: Configure billing (Stripe), invoicing
+4. `biz-support`: Create help docs, chatbot
+5. `biz-legal`: Generate ToS, privacy policy
+
+---
+
+## Phase 8: Growth Loop
+
+**Purpose:** Continuous improvement
+
+### Cycle:
+```
+MONITOR -> ANALYZE -> OPTIMIZE -> DEPLOY -> MONITOR
+    |
+Customer feedback -> Feature requests -> Backlog
+    |
+A/B tests -> Winner -> Permanent deploy
+    |
+Incidents -> RCA -> Prevention -> Deploy fix
+```
+
+### Never "Done":
+- Run performance optimizations
+- Add missing test coverage
+- Improve documentation
+- Refactor code smells
+- Update dependencies
+- Enhance user experience
+- Implement A/B test learnings
+
+---
+
+## Final Review (Before Any Deployment)
+
+```
+1. Dispatch 3 reviewers reviewing ENTIRE implementation:
+   - code-reviewer: Full codebase quality
+   - business-logic-reviewer: All requirements met
+   - security-reviewer: Full security audit
+
+2. Aggregate findings across all files
+3. Fix Critical/High/Medium issues
+4. Re-run all 3 reviewers until all PASS
+5. Generate final report in .loki/artifacts/reports/final-review.md
+6. Proceed to deployment only after all PASS
+```
+
+---
+
+## Quality Gates Summary
+
+| Gate | Agent | Pass Criteria |
+|------|-------|---------------|
+| Unit Tests | eng-qa | 100% pass |
+| Integration Tests | eng-qa | 100% pass |
+| E2E Tests | eng-qa | 100% pass |
+| Coverage | eng-qa | > 80% |
+| Linting | eng-qa | 0 errors |
+| Type Check | eng-qa | 0 errors |
+| Security Scan | ops-security | 0 high/critical |
+| Dependency Audit | ops-security | 0 vulnerabilities |
+| Performance | eng-qa | p99 < 200ms |
+| Accessibility | eng-frontend | WCAG 2.1 AA |
+| Load Test | ops-devops | Handles 10x expected traffic |
+| Chaos Test | ops-devops | Recovers from failures |
+| Cost Estimate | ops-cost | Within budget |
+| Legal Review | biz-legal | Compliant |
--- a/skills/loki-mode/references/task-queue.md
+++ b/skills/loki-mode/references/task-queue.md
@@ -0,0 +1,361 @@
+# Task Queue Reference
+
+Distributed task queue system, dead letter handling, and circuit breakers.
+
+---
+
+## Task Schema
+
+```json
+{
+  "id": "uuid",
+  "idempotencyKey": "hash-of-task-content",
+  "type": "eng-backend|eng-frontend|ops-devops|...",
+  "priority": 1-10,
+  "dependencies": ["task-id-1", "task-id-2"],
+  "payload": {
+    "action": "implement|test|deploy|...",
+    "target": "file/path or resource",
+    "params": {},
+    "goal": "What success looks like (high-level objective)",
+    "constraints": ["No third-party deps", "Maintain backwards compat"],
+    "context": {
+      "relatedFiles": ["file1.ts", "file2.ts"],
+      "architectureDecisions": ["ADR-001: Use JWT tokens"],
+      "previousAttempts": "What was tried before, why it failed"
+    }
+  },
+  "createdAt": "ISO",
+  "claimedBy": null,
+  "claimedAt": null,
+  "timeout": 3600,
+  "retries": 0,
+  "maxRetries": 3,
+  "backoffSeconds": 60,
+  "lastError": null,
+  "completedAt": null,
+  "result": {
+    "status": "success|failed",
+    "output": "What was produced",
+    "decisionReport": { ... }
+  }
+}
+```
+
+**Decision Report is REQUIRED for completed tasks.** Tasks without proper decision documentation will be marked as incomplete.
+
+---
+
+## Queue Files
+
+```
+.loki/queue/
+-- pending.json       # Tasks waiting to be claimed
+-- in-progress.json   # Currently executing tasks
+-- completed.json     # Finished tasks
+-- dead-letter.json   # Failed tasks for review
+-- cancelled.json     # Cancelled tasks
+```
+
+---
+
+## Queue Operations
+
+### Claim Task (with file locking)
+
+```python
+def claim_task(agent_id, agent_capabilities):
+    with file_lock(".loki/state/locks/queue.lock", timeout=10):
+        pending = read_json(".loki/queue/pending.json")
+
+        # Find eligible task
+        for task in sorted(pending.tasks, key=lambda t: -t.priority):
+            if task.type not in agent_capabilities:
+                continue
+            if task.claimedBy and not claim_expired(task):
+                continue
+            if not all_dependencies_completed(task.dependencies):
+                continue
+            if circuit_breaker_open(task.type):
+                continue
+
+            # Claim it
+            task.claimedBy = agent_id
+            task.claimedAt = now()
+            move_task(task, "pending", "in-progress")
+            return task
+
+        return None
+```
+
+### File Locking (Bash)
+
+```bash
+#!/bin/bash
+# Atomic task claim using flock
+
+QUEUE_FILE=".loki/queue/pending.json"
+LOCK_FILE=".loki/state/locks/queue.lock"
+
+(
+  flock -x -w 10 200 || exit 1
+
+  # Read, claim, write atomically
+  TASK=$(jq -r '.tasks | map(select(.claimedBy == null)) | .[0]' "$QUEUE_FILE")
+  if [ "$TASK" != "null" ]; then
+    TASK_ID=$(echo "$TASK" | jq -r '.id')
+    jq --arg id "$TASK_ID" --arg agent "$AGENT_ID" \
+      '.tasks |= map(if .id == $id then .claimedBy = $agent | .claimedAt = now else . end)' \
+      "$QUEUE_FILE" > "${QUEUE_FILE}.tmp" && mv "${QUEUE_FILE}.tmp" "$QUEUE_FILE"
+    echo "$TASK_ID"
+  fi
+
+) 200>"$LOCK_FILE"
+```
+
+### Complete Task
+
+```python
+def complete_task(task_id, result, success=True):
+    with file_lock(".loki/state/locks/queue.lock"):
+        task = find_task(task_id, "in-progress")
+        task.completedAt = now()
+        task.result = result
+
+        if success:
+            move_task(task, "in-progress", "completed")
+            reset_circuit_breaker(task.type)
+            trigger_dependents(task_id)
+        else:
+            handle_failure(task)
+```
+
+---
+
+## Failure Handling
+
+### Exponential Backoff
+
+```python
+def handle_failure(task):
+    task.retries += 1
+    task.lastError = get_last_error()
+
+    if task.retries >= task.maxRetries:
+        # Move to dead letter queue
+        move_task(task, "in-progress", "dead-letter")
+        increment_circuit_breaker(task.type)
+        alert_orchestrator(f"Task {task.id} moved to dead letter queue")
+    else:
+        # Exponential backoff: 60s, 120s, 240s, ...
+        task.backoffSeconds = task.backoffSeconds * (2 ** (task.retries - 1))
+        task.availableAt = now() + task.backoffSeconds
+        move_task(task, "in-progress", "pending")
+        log(f"Task {task.id} retry {task.retries}, backoff {task.backoffSeconds}s")
+```
+
+---
+
+## Dead Letter Queue
+
+Tasks in dead letter queue require manual review:
+
+### Review Process
+
+1. Read `.loki/queue/dead-letter.json`
+2. For each task:
+   - Analyze `lastError` and failure pattern
+   - Determine if:
+     - Task is invalid -> delete
+     - Bug in agent -> fix agent, retry
+     - External dependency down -> wait, retry
+     - Requires human decision -> escalate
+3. To retry: move task back to pending with reset retries
+4. Log decision in `.loki/logs/decisions/dlq-review-{date}.md`
+
+---
+
+## Idempotency
+
+```python
+def enqueue_task(task):
+    # Generate idempotency key from content
+    task.idempotencyKey = hash(json.dumps(task.payload, sort_keys=True))
+
+    # Check if already exists
+    for queue in ["pending", "in-progress", "completed"]:
+        existing = find_by_idempotency_key(task.idempotencyKey, queue)
+        if existing:
+            log(f"Duplicate task detected: {task.idempotencyKey}")
+            return existing.id  # Return existing, don't create duplicate
+
+    # Safe to create
+    save_task(task, "pending")
+    return task.id
+```
+
+---
+
+## Task Cancellation
+
+```python
+def cancel_task(task_id, reason):
+    with file_lock(".loki/state/locks/queue.lock"):
+        for queue in ["pending", "in-progress"]:
+            task = find_task(task_id, queue)
+            if task:
+                task.cancelledAt = now()
+                task.cancelReason = reason
+                move_task(task, queue, "cancelled")
+
+                # Cancel dependent tasks too
+                for dep_task in find_tasks_depending_on(task_id):
+                    cancel_task(dep_task.id, f"Parent {task_id} cancelled")
+
+                return True
+        return False
+```
+
+---
+
+## Circuit Breakers
+
+### State Schema
+
+```json
+{
+  "circuitBreakers": {
+    "eng-backend": {
+      "state": "closed",
+      "failures": 0,
+      "lastFailure": null,
+      "openedAt": null,
+      "halfOpenAt": null
+    }
+  }
+}
+```
+
+### States
+
+| State | Description | Behavior |
+|-------|-------------|----------|
+| **closed** | Normal operation | Tasks flow normally |
+| **open** | Too many failures | Block all tasks of this type |
+| **half-open** | Testing recovery | Allow 1 test task |
+
+### Configuration
+
+```yaml
+# .loki/config/circuit-breakers.yaml
+defaults:
+  failureThreshold: 5
+  cooldownSeconds: 300
+  halfOpenAfter: 60
+
+overrides:
+  ops-security:
+    failureThreshold: 3  # More sensitive for security
+  biz-marketing:
+    failureThreshold: 10  # More tolerant for non-critical
+```
+
+### Implementation
+
+```python
+def check_circuit_breaker(agent_type):
+    cb = load_circuit_breaker(agent_type)
+
+    if cb.state == "closed":
+        return True  # Proceed
+
+    if cb.state == "open":
+        if now() > cb.openedAt + config.halfOpenAfter:
+            cb.state = "half-open"
+            save_circuit_breaker(cb)
+            return True  # Allow test task
+        return False  # Still blocking
+
+    if cb.state == "half-open":
+        return False  # Already testing, wait
+
+def on_task_success(agent_type):
+    cb = load_circuit_breaker(agent_type)
+    if cb.state == "half-open":
+        cb.state = "closed"
+        cb.failures = 0
+    save_circuit_breaker(cb)
+
+def on_task_failure(agent_type):
+    cb = load_circuit_breaker(agent_type)
+    cb.failures += 1
+    cb.lastFailure = now()
+
+    if cb.state == "half-open" or cb.failures >= config.failureThreshold:
+        cb.state = "open"
+        cb.openedAt = now()
+        alert_orchestrator(f"Circuit breaker OPEN for {agent_type}")
+
+    save_circuit_breaker(cb)
+```
+
+---
+
+## Rate Limit Handling
+
+### Detection
+
+```python
+def detect_rate_limit(error):
+    indicators = [
+        "rate limit",
+        "429",
+        "too many requests",
+        "quota exceeded",
+        "retry-after"
+    ]
+    return any(ind in str(error).lower() for ind in indicators)
+```
+
+### Response Protocol
+
+```python
+def handle_rate_limit(agent_id, error):
+    # 1. Save state checkpoint
+    checkpoint_state(agent_id)
+
+    # 2. Calculate backoff
+    retry_after = parse_retry_after(error) or calculate_exponential_backoff()
+
+    # 3. Log and wait
+    log(f"Rate limit hit for {agent_id}, waiting {retry_after}s")
+
+    # 4. Signal other agents to slow down
+    broadcast_signal("SLOWDOWN", {"wait": retry_after / 2})
+
+    # 5. Resume after backoff
+    schedule_resume(agent_id, retry_after)
+```
+
+### Exponential Backoff
+
+```python
+def calculate_exponential_backoff(attempt=1, base=60, max_wait=3600):
+    wait = min(base * (2 ** (attempt - 1)), max_wait)
+    jitter = random.uniform(0, wait * 0.1)
+    return wait + jitter
+```
+
+---
+
+## Priority System
+
+| Priority | Use Case | Example |
+|----------|----------|---------|
+| 10 | Critical blockers | Security vulnerability fix |
+| 8-9 | High priority | Core feature implementation |
+| 5-7 | Normal | Standard tasks |
+| 3-4 | Low priority | Documentation, cleanup |
+| 1-2 | Background | Nice-to-have improvements |
+
+Tasks are always processed in priority order within their type.
--- a/skills/loki-mode/references/tool-orchestration.md
+++ b/skills/loki-mode/references/tool-orchestration.md
@@ -0,0 +1,691 @@
+# Tool Orchestration Patterns Reference
+
+Research-backed patterns inspired by NVIDIA ToolOrchestra, OpenAI Agents SDK, and multi-agent coordination research.
+
+---
+
+## Overview
+
+Effective tool orchestration requires four key innovations:
+1. **Tracing Spans** - Hierarchical event tracking (OpenAI SDK pattern)
+2. **Efficiency Metrics** - Track computational cost per task
+3. **Reward Signals** - Outcome, efficiency, and preference rewards for learning
+4. **Dynamic Selection** - Adapt agent count and types based on task complexity
+
+---
+
+## Tracing Spans Architecture (OpenAI SDK Pattern)
+
+### Span Types
+
+Every operation is wrapped in a typed span for observability:
+
+```yaml
+span_types:
+  agent_span:     # Wraps entire agent execution
+  generation_span: # Wraps LLM API calls
+  function_span:  # Wraps tool/function calls
+  guardrail_span: # Wraps validation checks
+  handoff_span:   # Wraps agent-to-agent transfers
+  custom_span:    # User-defined operations
+```
+
+### Hierarchical Trace Structure
+
+```json
+{
+  "trace_id": "trace_abc123def456",
+  "workflow_name": "implement_feature",
+  "group_id": "session_xyz789",
+  "spans": [
+    {
+      "span_id": "span_001",
+      "parent_id": null,
+      "type": "agent_span",
+      "agent_name": "orchestrator",
+      "started_at": "2026-01-07T10:00:00Z",
+      "ended_at": "2026-01-07T10:05:00Z",
+      "children": ["span_002", "span_003"]
+    },
+    {
+      "span_id": "span_002",
+      "parent_id": "span_001",
+      "type": "guardrail_span",
+      "guardrail_name": "input_validation",
+      "triggered": false,
+      "blocking": true
+    },
+    {
+      "span_id": "span_003",
+      "parent_id": "span_001",
+      "type": "handoff_span",
+      "from_agent": "orchestrator",
+      "to_agent": "backend-dev"
+    }
+  ]
+}
+```
+
+### Storage Location
+
+```
+.loki/traces/
+├── active/
+│   └── {trace_id}.json     # Currently running traces
+└── completed/
+    └── {date}/
+        └── {trace_id}.json # Archived traces
+```
+
+See `references/openai-patterns.md` for full tracing implementation.
+
+---
+
+## Efficiency Metrics System
+
+### Why Track Efficiency?
+
+ToolOrchestra achieves 70% cost reduction vs GPT-5 by explicitly optimizing for efficiency. Loki Mode should track:
+
+- **Token usage** per task (input + output)
+- **Wall clock time** per task
+- **Agent spawns** per task
+- **Retry count** before success
+
+### Efficiency Tracking Schema
+
+```json
+{
+  "task_id": "task-2026-01-06-001",
+  "correlation_id": "session-abc123",
+  "started_at": "2026-01-06T10:00:00Z",
+  "completed_at": "2026-01-06T10:05:32Z",
+  "metrics": {
+    "wall_time_seconds": 332,
+    "agents_spawned": 3,
+    "total_agent_calls": 7,
+    "retry_count": 1,
+    "retry_reasons": ["test_failure"],
+    "recovery_rate": 1.0,
+    "model_usage": {
+      "haiku": {"calls": 4, "est_tokens": 12000},
+      "sonnet": {"calls": 2, "est_tokens": 8000},
+      "opus": {"calls": 1, "est_tokens": 6000}
+    }
+  },
+  "outcome": "success",
+  "outcome_reason": "tests_passed_after_fix",
+  "efficiency_score": 0.85,
+  "efficiency_factors": ["used_haiku_for_tests", "parallel_review"],
+  "quality_pillars": {
+    "tool_selection_correct": true,
+    "tool_reliability_rate": 0.95,
+    "memory_retrieval_relevant": true,
+    "goal_adherence": 1.0
+  }
+}
+```
+
+**Why capture these metrics?** (Based on multi-agent research)
+
+1. **Capture intent, not just actions** ([Hashrocket](https://hashrocket.substack.com/p/the-hidden-cost-of-well-fix-it-later))
+   - "UX debt turns into data debt" - recording actions without intent creates useless analytics
+
+2. **Track recovery rate** ([Assessment Framework, arXiv 2512.12791](https://arxiv.org/html/2512.12791v1))
+   - `recovery_rate = successful_retries / total_retries`
+   - Paper found "perfect tool sequencing but only 33% policy adherence" - surface metrics mask failures
+
+3. **Distributed tracing** ([Maxim AI](https://www.getmaxim.ai/articles/best-practices-for-building-production-ready-multi-agent-systems/))
+   - `correlation_id`: Links all tasks in a session for end-to-end tracing
+   - Essential for debugging multi-agent coordination failures
+
+4. **Tool reliability separate from selection** ([Stanford/Harvard](https://www.marktechpost.com/2025/12/24/this-ai-paper-from-stanford-and-harvard-explains-why-most-agentic-ai-systems-feel-impressive-in-demos-and-then-completely-fall-apart-in-real-use/))
+   - `tool_selection_correct`: Did we pick the right tool?
+   - `tool_reliability_rate`: Did the tool work as expected? (tools can fail even when correctly selected)
+   - Key insight: "Tool use reliability" is a primary demo-to-deployment gap
+
+5. **Quality pillars beyond outcomes** ([Assessment Framework](https://arxiv.org/html/2512.12791v1))
+   - `memory_retrieval_relevant`: Did episodic/semantic retrieval help?
+   - `goal_adherence`: Did we stay on task? (0.0-1.0 score)
+
+### Efficiency Score Calculation
+
+```python
+def calculate_efficiency_score(metrics, task_complexity):
+    """
+    Score from 0-1 where higher is more efficient.
+    Based on ToolOrchestra's efficiency reward signal.
+    """
+    # Baseline expectations by complexity
+    baselines = {
+        "trivial": {"time": 60, "agents": 1, "retries": 0},
+        "simple": {"time": 180, "agents": 2, "retries": 0},
+        "moderate": {"time": 600, "agents": 4, "retries": 1},
+        "complex": {"time": 1800, "agents": 8, "retries": 2},
+        "critical": {"time": 3600, "agents": 12, "retries": 3}
+    }
+
+    baseline = baselines[task_complexity]
+
+    # Calculate component scores (1.0 = at baseline, >1 = better, <1 = worse)
+    time_score = min(1.0, baseline["time"] / max(metrics["wall_time_seconds"], 1))
+    agent_score = min(1.0, baseline["agents"] / max(metrics["agents_spawned"], 1))
+    retry_score = 1.0 - (metrics["retry_count"] / (baseline["retries"] + 3))
+
+    # Weighted average (time matters most)
+    return (time_score * 0.5) + (agent_score * 0.3) + (retry_score * 0.2)
+```
+
+### Standard Reason Codes
+
+Use consistent codes to enable pattern analysis:
+
+```yaml
+outcome_reasons:
+  success:
+    - tests_passed_first_try
+    - tests_passed_after_fix
+    - review_approved
+    - spec_validated
+  partial:
+    - tests_partial_pass
+    - review_concerns_minor
+    - timeout_partial_work
+  failure:
+    - tests_failed
+    - review_blocked
+    - dependency_missing
+    - timeout_no_progress
+    - error_unrecoverable
+
+retry_reasons:
+  - test_failure
+  - lint_error
+  - type_error
+  - review_rejection
+  - rate_limit
+  - timeout
+  - dependency_conflict
+
+efficiency_factors:
+  positive:
+    - used_haiku_for_simple
+    - parallel_execution
+    - cached_result
+    - first_try_success
+    - spec_driven
+  negative:
+    - used_opus_for_simple
+    - sequential_when_parallel_possible
+    - multiple_retries
+    - missing_context
+    - unclear_requirements
+```
+
+### Storage Location
+
+```
+.loki/metrics/
+├── efficiency/
+│   ├── 2026-01-06.json      # Daily efficiency logs
+│   └── aggregate.json        # Running averages by task type
+└── rewards/
+    ├── outcomes.json         # Task success/failure records
+    └── preferences.json      # User preference signals
+```
+
+---
+
+## Reward Signal Framework
+
+### Three Reward Types (ToolOrchestra Pattern)
+
+```
+------------------------------------------------------------------+
+| 1. OUTCOME REWARD                                                 |
+|    - Did the task succeed? Binary + quality grade                 |
+|    - Signal: +1.0 (success), 0.0 (partial), -1.0 (failure)       |
+------------------------------------------------------------------+
+| 2. EFFICIENCY REWARD                                              |
+|    - Did we use resources wisely?                                 |
+|    - Signal: 0.0 to 1.0 based on efficiency score                |
+------------------------------------------------------------------+
+| 3. PREFERENCE REWARD                                              |
+|    - Did the user like the approach/result?                       |
+|    - Signal: Inferred from user actions (accept/reject/modify)   |
+------------------------------------------------------------------+
+```
+
+### Outcome Reward Implementation
+
+```python
+def calculate_outcome_reward(task_result):
+    """
+    Outcome reward based on task completion status.
+    """
+    if task_result.status == "completed":
+        # Grade the quality of completion
+        if task_result.tests_passed and task_result.review_passed:
+            return 1.0  # Full success
+        elif task_result.tests_passed:
+            return 0.7  # Tests pass but review had concerns
+        else:
+            return 0.3  # Completed but with issues
+
+    elif task_result.status == "partial":
+        return 0.0  # Partial completion, no reward
+
+    else:  # failed
+        return -1.0  # Negative reward for failure
+```
+
+### Preference Reward Implementation
+
+```python
+def infer_preference_reward(task_result, user_actions):
+    """
+    Infer user preference from their actions after task completion.
+    Based on implicit feedback patterns.
+    """
+    signals = []
+
+    # Positive signals
+    if "commit" in user_actions:
+        signals.append(0.8)  # User committed our changes
+    if "deploy" in user_actions:
+        signals.append(1.0)  # User deployed our changes
+    if "no_edits" in user_actions:
+        signals.append(0.6)  # User didn't modify our output
+
+    # Negative signals
+    if "revert" in user_actions:
+        signals.append(-1.0)  # User reverted our changes
+    if "manual_fix" in user_actions:
+        signals.append(-0.5)  # User had to fix our work
+    if "retry_different" in user_actions:
+        signals.append(-0.3)  # User asked for different approach
+
+    # Neutral (no signal)
+    if not signals:
+        return None
+
+    return sum(signals) / len(signals)
+```
+
+### Reward Aggregation for Learning
+
+```python
+def aggregate_rewards(outcome, efficiency, preference):
+    """
+    Combine rewards into single learning signal.
+    Weights based on ToolOrchestra findings.
+    """
+    # Outcome is most important (must succeed)
+    # Efficiency secondary (once successful, optimize)
+    # Preference tertiary (align with user style)
+
+    weights = {
+        "outcome": 0.6,
+        "efficiency": 0.25,
+        "preference": 0.15
+    }
+
+    total = outcome * weights["outcome"]
+    total += efficiency * weights["efficiency"]
+
+    if preference is not None:
+        total += preference * weights["preference"]
+    else:
+        # Redistribute weight if no preference signal
+        total = total / (1 - weights["preference"])
+
+    return total
+```
+
+---
+
+## Dynamic Agent Selection
+
+### Task Complexity Classification
+
+```python
+def classify_task_complexity(task):
+    """
+    Classify task to determine agent allocation.
+    Based on ToolOrchestra's tool selection flexibility.
+    """
+    complexity_signals = {
+        # File scope signals
+        "single_file": -1,
+        "few_files": 0,       # 2-5 files
+        "many_files": +1,     # 6-20 files
+        "system_wide": +2,    # 20+ files
+
+        # Change type signals
+        "typo_fix": -2,
+        "bug_fix": 0,
+        "feature": +1,
+        "refactor": +1,
+        "architecture": +2,
+
+        # Domain signals
+        "documentation": -1,
+        "tests_only": 0,
+        "frontend": 0,
+        "backend": 0,
+        "full_stack": +1,
+        "infrastructure": +1,
+        "security": +2,
+    }
+
+    score = 0
+    for signal, weight in complexity_signals.items():
+        if task.has_signal(signal):
+            score += weight
+
+    # Map score to complexity level
+    if score <= -2:
+        return "trivial"
+    elif score <= 0:
+        return "simple"
+    elif score <= 2:
+        return "moderate"
+    elif score <= 4:
+        return "complex"
+    else:
+        return "critical"
+```
+
+### Agent Allocation by Complexity
+
+```yaml
+# Agent allocation strategy
+# Model selection: Opus=planning, Sonnet=development, Haiku=unit tests/monitoring
+complexity_allocations:
+  trivial:
+    max_agents: 1
+    planning: null         # No planning needed
+    development: haiku
+    testing: haiku
+    review: skip           # No review needed for trivial
+    parallel: false
+
+  simple:
+    max_agents: 2
+    planning: null         # No planning needed
+    development: haiku
+    testing: haiku
+    review: single         # One quick review
+    parallel: false
+
+  moderate:
+    max_agents: 4
+    planning: sonnet       # Sonnet for moderate planning
+    development: sonnet
+    testing: haiku         # Unit tests always haiku
+    review: standard       # 3 parallel reviewers
+    parallel: true
+
+  complex:
+    max_agents: 8
+    planning: opus         # Opus ONLY for complex planning
+    development: sonnet    # Sonnet for implementation
+    testing: haiku         # Unit tests still haiku
+    review: deep           # 3 reviewers + devil's advocate
+    parallel: true
+
+  critical:
+    max_agents: 12
+    planning: opus         # Opus for critical planning
+    development: sonnet    # Sonnet for implementation
+    testing: sonnet        # Functional/E2E tests with sonnet
+    review: exhaustive     # Multiple review rounds
+    parallel: true
+    human_checkpoint: true # Pause for human review
+```
+
+### Dynamic Selection Algorithm
+
+```python
+def select_agents_for_task(task, available_agents):
+    """
+    Dynamically select agents based on task requirements.
+    Inspired by ToolOrchestra's configurable tool selection.
+    """
+    complexity = classify_task_complexity(task)
+    allocation = COMPLEXITY_ALLOCATIONS[complexity]
+
+    # 1. Identify required agent types
+    required_types = identify_required_agents(task)
+
+    # 2. Filter to available agents of required types
+    candidates = [a for a in available_agents if a.type in required_types]
+
+    # 3. Score candidates by past performance
+    for agent in candidates:
+        agent.selection_score = get_agent_performance_score(
+            agent,
+            task_type=task.type,
+            complexity=complexity
+        )
+
+    # 4. Select top N agents up to allocation limit
+    candidates.sort(key=lambda a: a.selection_score, reverse=True)
+    selected = candidates[:allocation["max_agents"]]
+
+    # 5. Assign models based on complexity
+    for agent in selected:
+        if agent.role == "reviewer":
+            agent.model = "opus"  # Always opus for reviews
+        else:
+            agent.model = allocation["model"]
+
+    return selected
+
+def get_agent_performance_score(agent, task_type, complexity):
+    """
+    Score agent based on historical performance on similar tasks.
+    Uses reward signals from previous executions.
+    """
+    history = load_agent_history(agent.id)
+
+    # Filter to similar tasks
+    similar = [h for h in history
+               if h.task_type == task_type
+               and h.complexity == complexity]
+
+    if not similar:
+        return 0.5  # Neutral score if no history
+
+    # Average past rewards
+    return sum(h.aggregate_reward for h in similar) / len(similar)
+```
+
+---
+
+## Tool Usage Analytics
+
+### Track Tool Effectiveness
+
+```json
+{
+  "tool_analytics": {
+    "period": "2026-01-06",
+    "by_tool": {
+      "Grep": {
+        "calls": 142,
+        "success_rate": 0.89,
+        "avg_result_quality": 0.82,
+        "common_patterns": ["error handling", "function def"]
+      },
+      "Task": {
+        "calls": 47,
+        "success_rate": 0.94,
+        "avg_efficiency": 0.76,
+        "by_subagent_type": {
+          "general-purpose": {"calls": 35, "success": 0.91},
+          "Explore": {"calls": 12, "success": 1.0}
+        }
+      }
+    },
+    "insights": [
+      "Explore agent 100% success - use more for codebase search",
+      "Grep success drops to 0.65 for regex patterns - simplify searches"
+    ]
+  }
+}
+```
+
+### Continuous Improvement Loop
+
+```
+------------------------------------------------------------------+
+| 1. COLLECT                                                        |
+|    Record every task: agents used, tools called, outcome          |
+------------------------------------------------------------------+
+          |
+          v
+------------------------------------------------------------------+
+| 2. ANALYZE                                                        |
+|    Weekly aggregation: What worked? What didn't?                  |
+|    Identify patterns in high-reward vs low-reward tasks           |
+------------------------------------------------------------------+
+          |
+          v
+------------------------------------------------------------------+
+| 3. ADAPT                                                          |
+|    Update selection algorithms based on analytics                 |
+|    Store successful patterns in semantic memory                   |
+------------------------------------------------------------------+
+          |
+          v
+------------------------------------------------------------------+
+| 4. VALIDATE                                                       |
+|    A/B test new selection strategies                              |
+|    Measure efficiency improvement                                 |
+------------------------------------------------------------------+
+          |
+          +-----------> Loop back to COLLECT
+```
+
+---
+
+## Integration with RARV Cycle
+
+The orchestration patterns integrate with RARV at each phase:
+
+```
+REASON:
+├── Check efficiency metrics for similar past tasks
+├── Classify task complexity
+└── Select appropriate agent allocation
+
+ACT:
+├── Dispatch agents according to allocation
+├── Track start time and resource usage
+└── Record tool calls and agent interactions
+
+REFLECT:
+├── Calculate outcome reward (did it work?)
+├── Calculate efficiency reward (resource usage)
+└── Log to metrics store
+
+VERIFY:
+├── Run verification checks
+├── If failed: negative outcome reward, retry with learning
+├── If passed: infer preference reward from user actions
+└── Update agent performance scores
+```
+
+---
+
+## Key Metrics Dashboard
+
+Track these metrics in `.loki/metrics/dashboard.json`:
+
+```json
+{
+  "dashboard": {
+    "period": "rolling_7_days",
+    "summary": {
+      "tasks_completed": 127,
+      "success_rate": 0.94,
+      "avg_efficiency_score": 0.78,
+      "avg_outcome_reward": 0.82,
+      "avg_preference_reward": 0.71,
+      "avg_recovery_rate": 0.87,
+      "avg_goal_adherence": 0.93
+    },
+    "quality_pillars": {
+      "tool_selection_accuracy": 0.91,
+      "tool_reliability_rate": 0.93,
+      "memory_retrieval_relevance": 0.84,
+      "policy_adherence": 0.96
+    },
+    "trends": {
+      "efficiency": "+12% vs previous week",
+      "success_rate": "+3% vs previous week",
+      "avg_agents_per_task": "-0.8 (improving)",
+      "recovery_rate": "+5% vs previous week"
+    },
+    "top_performing_patterns": [
+      "Haiku for unit tests (0.95 success, 0.92 efficiency)",
+      "Explore agent for codebase search (1.0 success)",
+      "Parallel review with opus (0.98 accuracy)"
+    ],
+    "areas_for_improvement": [
+      "Complex refactors taking 2x expected time",
+      "Security review efficiency below baseline",
+      "Memory retrieval relevance below 0.85 target"
+    ]
+  }
+}
+```
+
+---
+
+## Multi-Dimensional Evaluation
+
+Based on [Measurement Imbalance research (arXiv 2506.02064)](https://arxiv.org/abs/2506.02064):
+
+> "Technical metrics dominate assessments (83%), while human-centered (30%), safety (53%), and economic (30%) remain peripheral"
+
+**Loki Mode tracks four evaluation axes:**
+
+| Axis | Metrics | Current Coverage |
+|------|---------|------------------|
+| **Technical** | success_rate, efficiency_score, recovery_rate | Full |
+| **Human-Centered** | preference_reward, goal_adherence | Partial |
+| **Safety** | policy_adherence, quality_gates_passed | Full (via review system) |
+| **Economic** | model_usage, agents_spawned, wall_time | Full |
+
+---
+
+## Sources
+
+**OpenAI Agents SDK:**
+- [Agents SDK Documentation](https://openai.github.io/openai-agents-python/) - Core primitives: agents, handoffs, guardrails, tracing
+- [Practical Guide to Building Agents](https://cdn.openai.com/business-guides-and-resources/a-practical-guide-to-building-agents.pdf) - Orchestration patterns
+- [Building Agents Track](https://developers.openai.com/tracks/building-agents/) - Official developer guide
+- [AGENTS.md Specification](https://agents.md/) - Standard for agent instructions
+- [Tracing Documentation](https://openai.github.io/openai-agents-python/tracing/) - Span types and observability
+
+**Efficiency & Orchestration:**
+- [NVIDIA ToolOrchestra](https://github.com/NVlabs/ToolOrchestra) - Multi-turn tool orchestration with RL
+- [ToolScale Dataset](https://huggingface.co/datasets/nvidia/ToolScale) - Training data synthesis
+
+**Evaluation Frameworks:**
+- [Assessment Framework for Agentic AI (arXiv 2512.12791)](https://arxiv.org/html/2512.12791v1) - Four-pillar evaluation model
+- [Measurement Imbalance in Agentic AI (arXiv 2506.02064)](https://arxiv.org/abs/2506.02064) - Multi-dimensional evaluation
+- [Adaptive Monitoring for Agentic AI (arXiv 2509.00115)](https://arxiv.org/abs/2509.00115) - AMDM algorithm
+
+**Best Practices:**
+- [Anthropic: Building Effective Agents](https://www.anthropic.com/research/building-effective-agents) - Simplicity, transparency, tool engineering
+- [Maxim AI: Production Multi-Agent Systems](https://www.getmaxim.ai/articles/best-practices-for-building-production-ready-multi-agent-systems/) - Orchestration patterns, distributed tracing
+- [UiPath: Agent Builder Best Practices](https://www.uipath.com/blog/ai/agent-builder-best-practices) - Single-responsibility, evaluations
+- [Stanford/Harvard: Demo-to-Deployment Gap](https://www.marktechpost.com/2025/12/24/this-ai-paper-from-stanford-and-harvard-explains-why-most-agentic-ai-systems-feel-impressive-in-demos-and-then-completely-fall-apart-in-real-use/) - Tool reliability as key failure mode
+
+**Safety & Reasoning:**
+- [Chain of Thought Monitoring](https://openai.com/index/chain-of-thought-monitoring/) - CoT monitorability for safety
+- [Agent Builder Safety](https://platform.openai.com/docs/guides/agent-builder-safety) - Human-in-loop patterns
+- [Agentic AI Foundation](https://openai.com/index/agentic-ai-foundation/) - Industry standards (MCP, AGENTS.md, goose)