Fix: Ensure all skills are tracked as files, not submodules

This commit is contained in:
sck_0
2026-01-14 18:48:48 +01:00
parent 7f46ed8ca1
commit 8bd204708b
1113 changed files with 82065 additions and 2 deletions

View File

@@ -0,0 +1,453 @@
# Advanced Agentic Patterns Reference
Research-backed patterns from 2025-2026 literature for enhanced multi-agent orchestration.
---
## Memory Architecture (MIRIX/A-Mem/MemGPT Research)
### Three-Layer Memory System
```
+------------------------------------------------------------------+
| EPISODIC MEMORY (Specific Events) |
| - What happened, when, where |
| - Full interaction traces with timestamps |
| - Stored in: .loki/memory/episodic/ |
+------------------------------------------------------------------+
| SEMANTIC MEMORY (Generalized Knowledge) |
| - Abstracted patterns and facts |
| - Context-independent knowledge |
| - Stored in: .loki/memory/semantic/ |
+------------------------------------------------------------------+
| PROCEDURAL MEMORY (Learned Skills) |
| - How to do things |
| - Successful action sequences |
| - Stored in: .loki/memory/skills/ |
+------------------------------------------------------------------+
```
### Episodic-to-Semantic Consolidation
**Protocol:** After completing tasks, consolidate specific experiences into general knowledge.
```python
def consolidate_memory(task_result):
"""
Transform episodic (what happened) to semantic (how things work).
Based on MemGPT and Voyager patterns.
"""
# 1. Store raw episodic trace
episodic_entry = {
"timestamp": now(),
"task_id": task_result.id,
"context": task_result.context,
"actions": task_result.action_log,
"outcome": task_result.outcome,
"errors": task_result.errors
}
save_to_episodic(episodic_entry)
# 2. Extract generalizable patterns
if task_result.success:
pattern = extract_pattern(task_result)
if pattern.is_generalizable():
semantic_entry = {
"pattern": pattern.description,
"conditions": pattern.when_to_apply,
"actions": pattern.steps,
"confidence": pattern.success_rate,
"source_episodes": [task_result.id]
}
save_to_semantic(semantic_entry)
# 3. If error, create anti-pattern
if task_result.errors:
anti_pattern = {
"what_failed": task_result.errors[0].message,
"why_failed": analyze_root_cause(task_result),
"prevention": generate_prevention_rule(task_result),
"severity": classify_severity(task_result.errors)
}
save_to_learnings(anti_pattern)
```
### Zettelkasten-Inspired Note Linking (A-Mem Pattern)
Each memory note is atomic and linked to related notes:
```json
{
"id": "note-2026-01-06-001",
"content": "Express route handlers need explicit return types in strict mode",
"type": "semantic",
"links": [
{"to": "note-2026-01-05-042", "relation": "derived_from"},
{"to": "note-2026-01-06-003", "relation": "related_to"}
],
"tags": ["typescript", "express", "strict-mode"],
"confidence": 0.95,
"usage_count": 12
}
```
---
## Multi-Agent Reflexion (MAR Pattern)
### Problem: Degeneration-of-Thought
Single-agent self-critique leads to repeating the same flawed reasoning across iterations.
### Solution: Structured Debate Among Persona-Based Critics
```
+------------------+ +------------------+ +------------------+
| IMPLEMENTER | | SKEPTIC | | ADVOCATE |
| (Creates work) | --> | (Challenges it) | --> | (Defends merits) |
+------------------+ +------------------+ +------------------+
| | |
v v v
+------------------------------------------------------------------+
| SYNTHESIZER |
| - Weighs all perspectives |
| - Identifies valid concerns vs. false negatives |
| - Produces final verdict with evidence |
+------------------------------------------------------------------+
```
### Anti-Sycophancy Protocol (CONSENSAGENT)
**Problem:** Agents reinforce each other's responses instead of critically engaging.
**Solution:**
```python
def anti_sycophancy_review(implementation, reviewers):
"""
Prevent reviewers from just agreeing with each other.
Based on CONSENSAGENT research.
"""
# 1. Independent review phase (no visibility of other reviews)
independent_reviews = []
for reviewer in reviewers:
review = reviewer.review(
implementation,
visibility="blind", # Cannot see other reviews
prompt_suffix="Be skeptical. List specific concerns."
)
independent_reviews.append(review)
# 2. Debate phase (now reveal reviews)
if has_disagreement(independent_reviews):
debate_result = structured_debate(
reviews=independent_reviews,
max_rounds=2,
require_evidence=True # Must cite specific code/lines
)
else:
# All agreed - run devil's advocate check
devil_review = devil_advocate_agent.review(
implementation,
prompt="Find problems the other reviewers missed. Be contrarian."
)
independent_reviews.append(devil_review)
# 3. Synthesize with validity check
return synthesize_with_validity_alignment(independent_reviews)
def synthesize_with_validity_alignment(reviews):
"""
Research shows validity-aligned reasoning most strongly predicts improvement.
"""
findings = []
for review in reviews:
for concern in review.concerns:
findings.append({
"concern": concern.description,
"evidence": concern.code_reference, # Must have evidence
"severity": concern.severity,
"is_valid": verify_concern_is_actionable(concern)
})
# Filter to only valid, evidenced concerns
return [f for f in findings if f["is_valid"] and f["evidence"]]
```
### Heterogeneous Team Composition
**Research finding:** Diverse teams outperform homogeneous ones by 4-6%.
```yaml
review_team:
- role: "security_analyst"
model: opus
expertise: ["OWASP", "auth", "injection"]
personality: "paranoid"
- role: "performance_engineer"
model: sonnet
expertise: ["complexity", "caching", "async"]
personality: "pragmatic"
- role: "maintainability_advocate"
model: opus
expertise: ["SOLID", "patterns", "readability"]
personality: "perfectionist"
```
---
## Hierarchical Planning (GoalAct/TMS Patterns)
### Global Planning with Hierarchical Execution
**Research:** GoalAct achieved 12.22% improvement in success rate using this pattern.
```
+------------------------------------------------------------------+
| GLOBAL PLANNER |
| - Maintains overall goal and strategy |
| - Continuously updates plan based on progress |
| - Decomposes into high-level skills |
+------------------------------------------------------------------+
|
v
+------------------------------------------------------------------+
| HIGH-LEVEL SKILLS |
| - searching, coding, testing, writing, deploying |
| - Each skill has defined entry/exit conditions |
| - Reduces planning complexity at execution level |
+------------------------------------------------------------------+
|
v
+------------------------------------------------------------------+
| LOCAL EXECUTORS |
| - Execute specific actions within skill context |
| - Report progress back to global planner |
| - Can request skill escalation if blocked |
+------------------------------------------------------------------+
```
### Thought Management System (TMS)
**For long-horizon tasks:**
```python
class ThoughtManagementSystem:
"""
Based on TMS research for long-horizon autonomous tasks.
Enables dynamic prioritization and adaptive strategy.
"""
def __init__(self, completion_promise):
self.goal_hierarchy = self.decompose_goal(completion_promise)
self.active_thoughts = PriorityQueue()
self.completed_thoughts = []
self.blocked_thoughts = []
def decompose_goal(self, goal):
"""
Hierarchical goal decomposition with self-critique.
"""
# Level 0: Ultimate goal
hierarchy = {"goal": goal, "subgoals": []}
# Level 1: Phase-level subgoals
phases = self.identify_phases(goal)
for phase in phases:
phase_node = {"goal": phase, "subgoals": []}
# Level 2: Task-level subgoals
tasks = self.identify_tasks(phase)
for task in tasks:
phase_node["subgoals"].append({"goal": task, "subgoals": []})
hierarchy["subgoals"].append(phase_node)
return hierarchy
def iterate(self):
"""
Single iteration with self-critique.
"""
# 1. Select highest priority thought
thought = self.active_thoughts.pop()
# 2. Execute thought
result = self.execute(thought)
# 3. Self-critique: Did this make progress?
critique = self.self_critique(thought, result)
# 4. Adapt strategy based on critique
if critique.made_progress:
self.completed_thoughts.append(thought)
self.generate_next_thoughts(thought, result)
elif critique.is_blocked:
self.blocked_thoughts.append(thought)
self.escalate_or_decompose(thought)
else:
# No progress, not blocked - need different approach
thought.attempts += 1
thought.alternative_strategy = critique.suggested_alternative
self.active_thoughts.push(thought)
```
---
## Iter-VF: Iterative Verification-First
**Key insight:** Verify the extracted answer only, not the whole thinking process.
```python
def iterative_verify_first(task, max_iterations=3):
"""
Based on Iter-VF research: verify answer, maintain Markovian process.
Avoids context overflow and error accumulation.
"""
for iteration in range(max_iterations):
# 1. Generate solution
solution = generate_solution(task)
# 2. Extract concrete answer/output
answer = extract_answer(solution)
# 3. Verify ONLY the answer (not reasoning chain)
verification = verify_answer(
answer=answer,
spec=task.spec,
tests=task.tests
)
if verification.passes:
return solution
# 4. Markovian retry: fresh context with just error info
task = create_fresh_task(
original=task,
error=verification.error,
attempt=iteration + 1
# NOTE: Do NOT include previous reasoning chain
)
return FailedResult(task, "Max iterations reached")
```
---
## Collaboration Structures
### When to Use Each Structure
| Structure | Use When | Loki Mode Application |
|-----------|----------|----------------------|
| **Centralized** | Need consistency, single source of truth | Orchestrator for phase management |
| **Decentralized** | Need fault tolerance, parallel execution | Agent swarms for implementation |
| **Hierarchical** | Complex tasks with clear decomposition | Global planner -> Skill -> Executor |
### Coopetition Pattern
**Agents compete on alternatives, cooperate on consensus:**
```python
def coopetition_decision(agents, decision_point):
"""
Competition phase: Generate diverse alternatives
Cooperation phase: Reach consensus on best option
"""
# COMPETITION: Each agent proposes solution independently
proposals = []
for agent in agents:
proposal = agent.propose(
decision_point,
visibility="blind" # No peeking at other proposals
)
proposals.append(proposal)
# COOPERATION: Collaborative evaluation
if len(set(p.approach for p in proposals)) == 1:
# Unanimous - likely good solution
return proposals[0]
# Multiple approaches - structured debate
for proposal in proposals:
proposal.pros = evaluate_pros(proposal)
proposal.cons = evaluate_cons(proposal)
proposal.evidence = gather_evidence(proposal)
# Vote with reasoning requirement
winner = ranked_choice_vote(
proposals,
require_justification=True
)
return winner
```
---
## Progressive Complexity Escalation
**Start simple, escalate only when needed:**
```
Level 1: Single Agent, Direct Execution
|
+-- Success? --> Done
|
+-- Failure? --> Escalate
|
v
Level 2: Single Agent + Self-Verification Loop
|
+-- Success? --> Done
|
+-- Failure after 3 attempts? --> Escalate
|
v
Level 3: Multi-Agent Review
|
+-- Success? --> Done
|
+-- Persistent issues? --> Escalate
|
v
Level 4: Hierarchical Planning + Decomposition
|
+-- Success? --> Done
|
+-- Fundamental blocker? --> Human escalation
```
---
## Key Research Findings Summary
### What Works
1. **Heterogeneous teams** outperform homogeneous by 4-6%
2. **Iter-VF** (verify answer only) prevents context overflow
3. **Episodic-to-semantic consolidation** enables genuine learning
4. **Anti-sycophancy measures** (blind review, devil's advocate) improve accuracy 30%+
5. **Global planning** with local execution improves success rate 12%+
### What Doesn't Work
1. **Deep debate chains** - diminishing returns after 1-2 rounds
2. **Confidence visibility** - causes over-confidence cascades
3. **Full reasoning chain review** - leads to error accumulation
4. **Homogeneous reviewer teams** - miss diverse failure modes
5. **Over-engineered orchestration** - model upgrades outpace gains
---
## Sources
- [Multi-Agent Collaboration Mechanisms Survey](https://arxiv.org/abs/2501.06322)
- [CONSENSAGENT: Anti-Sycophancy Framework](https://aclanthology.org/2025.findings-acl.1141/)
- [GoalAct: Global Planning + Hierarchical Execution](https://arxiv.org/abs/2504.16563)
- [A-Mem: Agentic Memory System](https://arxiv.org/html/2502.12110v11)
- [Multi-Agent Reflexion (MAR)](https://arxiv.org/html/2512.20845)
- [Iter-VF: Iterative Verification-First](https://arxiv.org/html/2511.21734v1)
- [Awesome Agentic Patterns](https://github.com/nibzard/awesome-agentic-patterns)

View File

@@ -0,0 +1,188 @@
# Agent Types Reference
Complete definitions and capabilities for all 37 specialized agent types.
---
## Overview
Loki Mode has 37 predefined agent types organized into 7 specialized swarms. The orchestrator spawns only the agents needed for your project - a simple app might use 5-10 agents, while a complex startup could spawn 100+ agents working in parallel.
---
## Engineering Swarm (8 types)
| Agent | Capabilities |
|-------|-------------|
| `eng-frontend` | React/Vue/Svelte, TypeScript, Tailwind, accessibility, responsive design, state management |
| `eng-backend` | Node/Python/Go, REST/GraphQL, auth, business logic, middleware, validation |
| `eng-database` | PostgreSQL/MySQL/MongoDB, migrations, query optimization, indexing, backups |
| `eng-mobile` | React Native/Flutter/Swift/Kotlin, offline-first, push notifications, app store prep |
| `eng-api` | OpenAPI specs, SDK generation, versioning, webhooks, rate limiting, documentation |
| `eng-qa` | Unit/integration/E2E tests, coverage, automation, test data management |
| `eng-perf` | Profiling, benchmarking, optimization, caching, load testing, memory analysis |
| `eng-infra` | Docker, K8s manifests, IaC review, networking, security hardening |
---
## Operations Swarm (8 types)
| Agent | Capabilities |
|-------|-------------|
| `ops-devops` | CI/CD pipelines, GitHub Actions, GitLab CI, Jenkins, build optimization |
| `ops-sre` | Reliability, SLOs/SLIs, capacity planning, on-call, runbooks |
| `ops-security` | SAST/DAST, pen testing, vulnerability management, security reviews |
| `ops-monitor` | Observability, Datadog/Grafana, alerting, dashboards, log aggregation |
| `ops-incident` | Incident response, runbooks, RCA, post-mortems, communication |
| `ops-release` | Versioning, changelogs, blue-green, canary, rollbacks, feature flags |
| `ops-cost` | Cloud cost optimization, right-sizing, FinOps, reserved instances |
| `ops-compliance` | SOC2, GDPR, HIPAA, PCI-DSS, audit preparation, policy enforcement |
---
## Business Swarm (8 types)
| Agent | Capabilities |
|-------|-------------|
| `biz-marketing` | Landing pages, SEO, content, email campaigns, social media |
| `biz-sales` | CRM setup, outreach, demos, proposals, pipeline management |
| `biz-finance` | Billing (Stripe), invoicing, metrics, runway, pricing strategy |
| `biz-legal` | ToS, privacy policy, contracts, IP protection, compliance docs |
| `biz-support` | Help docs, FAQs, ticket system, chatbot, knowledge base |
| `biz-hr` | Job posts, recruiting, onboarding, culture docs, team structure |
| `biz-investor` | Pitch decks, investor updates, data room, cap table management |
| `biz-partnerships` | BD outreach, integration partnerships, co-marketing, API partnerships |
---
## Data Swarm (3 types)
| Agent | Capabilities |
|-------|-------------|
| `data-ml` | Model training, MLOps, feature engineering, inference, model monitoring |
| `data-eng` | ETL pipelines, data warehousing, dbt, Airflow, data quality |
| `data-analytics` | Product analytics, A/B tests, dashboards, insights, reporting |
---
## Product Swarm (3 types)
| Agent | Capabilities |
|-------|-------------|
| `prod-pm` | Backlog grooming, prioritization, roadmap, specs, stakeholder management |
| `prod-design` | Design system, Figma, UX patterns, prototypes, user research |
| `prod-techwriter` | API docs, guides, tutorials, release notes, developer experience |
---
## Growth Swarm (4 types)
| Agent | Capabilities |
|-------|-------------|
| `growth-hacker` | Growth experiments, viral loops, referral programs, acquisition |
| `growth-community` | Community building, Discord/Slack, ambassador programs, events |
| `growth-success` | Customer success, health scoring, churn prevention, expansion |
| `growth-lifecycle` | Email lifecycle, in-app messaging, re-engagement, onboarding |
---
## Review Swarm (3 types)
| Agent | Capabilities |
|-------|-------------|
| `review-code` | Code quality, design patterns, SOLID, maintainability, best practices |
| `review-business` | Requirements alignment, business logic, edge cases, UX flows |
| `review-security` | Vulnerabilities, auth/authz, OWASP Top 10, data protection |
---
## Agent Execution Model
**Claude Code does NOT support background processes.** Agents execute via:
1. **Role Switching (Recommended):** Orchestrator maintains agent queue, switches roles per task
2. **Sequential:** Execute agents one at a time (simple, reliable)
3. **Parallel via tmux:** Multiple Claude Code sessions (complex, faster)
```bash
# Option 1: Sequential (simple, reliable)
for agent in frontend backend database; do
claude -p "Act as $agent agent..." --dangerously-skip-permissions
done
# Option 2: Parallel via tmux (complex, faster)
tmux new-session -d -s loki-pool
for i in {1..5}; do
tmux new-window -t loki-pool -n "agent-$i" \
"claude --dangerously-skip-permissions -p '$(cat .loki/prompts/agent-$i.md)'"
done
# Option 3: Role switching (recommended)
# Orchestrator maintains agent queue, switches roles per task
```
---
## Model Selection by Agent Type
| Task Type | Model | Reason |
|-----------|-------|--------|
| Implementation | Sonnet | Fast, good enough for coding |
| Code Review | Opus | Deep analysis, catches subtle issues |
| Security Review | Opus | Critical, needs thoroughness |
| Business Logic Review | Opus | Needs to understand requirements deeply |
| Documentation | Sonnet | Straightforward writing |
| Quick fixes | Haiku | Fast iteration |
---
## Agent Lifecycle
```
SPAWN -> INITIALIZE -> POLL_QUEUE -> CLAIM_TASK -> EXECUTE -> REPORT -> POLL_QUEUE
| | | |
| circuit open? timeout? success?
| | | |
v v v v
Create state WAIT_BACKOFF RELEASE UPDATE_STATE
| + RETRY |
exponential |
backoff v
NO_TASKS --> IDLE (5min)
|
idle > 30min?
|
v
TERMINATE
```
---
## Dynamic Scaling Rules
| Condition | Action | Cooldown |
|-----------|--------|----------|
| Queue depth > 20 | Spawn 2 agents of bottleneck type | 5min |
| Queue depth > 50 | Spawn 5 agents, alert orchestrator | 2min |
| Agent idle > 30min | Terminate agent | - |
| Agent failed 3x consecutive | Terminate, open circuit breaker | 5min |
| Critical task waiting > 10min | Spawn priority agent | 1min |
| Circuit breaker half-open | Spawn 1 test agent | - |
| All agents of type failed | HALT, request human intervention | - |
---
## Agent Context Preservation
### Lineage Rules
1. **Immutable Inheritance:** Agents CANNOT modify inherited context
2. **Decision Logging:** All decisions MUST be logged to agent context file
3. **Lineage Reference:** All commits MUST reference parent agent ID
4. **Context Handoff:** When agent completes, context is archived but lineage preserved
### Preventing Context Drift
1. Read `.agent/sub-agents/${parent_id}.json` before spawning
2. Inherit immutable context (tech stack, constraints, decisions)
3. Log all new decisions to own context file
4. Reference lineage in all commits
5. Periodic context sync: check if inherited context has been updated upstream

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,550 @@
# Business Operations Reference
Workflows and procedures for business swarm agents.
## Marketing Operations
### Landing Page Checklist
```
[ ] Hero section with clear value proposition
[ ] Problem/solution narrative
[ ] Feature highlights (3-5 key features)
[ ] Social proof (testimonials, logos, stats)
[ ] Pricing section (if applicable)
[ ] FAQ section
[ ] Call-to-action (primary and secondary)
[ ] Footer with legal links
```
### SEO Optimization
```yaml
Technical SEO:
- meta title: 50-60 characters, include primary keyword
- meta description: 150-160 characters, compelling
- canonical URL set
- robots.txt configured
- sitemap.xml generated
- structured data (JSON-LD)
- Open Graph tags
- Twitter Card tags
Performance:
- Largest Contentful Paint < 2.5s
- First Input Delay < 100ms
- Cumulative Layout Shift < 0.1
- Images optimized (WebP, lazy loading)
Content:
- H1 contains primary keyword
- H2-H6 hierarchy logical
- Internal linking strategy
- Alt text on all images
- Content length appropriate for intent
```
### Content Calendar Template
```markdown
# Week of [DATE]
## Monday
- [ ] Blog post: [TITLE]
- [ ] Social: LinkedIn announcement
## Wednesday
- [ ] Email newsletter
- [ ] Social: Twitter thread
## Friday
- [ ] Case study update
- [ ] Social: Feature highlight
```
### Email Sequences
**Onboarding Sequence:**
```
Day 0: Welcome email (immediate)
- Thank you for signing up
- Quick start guide link
- Support contact
Day 1: Getting started
- First feature tutorial
- Video walkthrough
Day 3: Value demonstration
- Success metrics
- Customer story
Day 7: Check-in
- How's it going?
- Feature discovery
Day 14: Advanced features
- Power user tips
- Integration options
```
**Abandoned Cart/Trial:**
```
Hour 1: Reminder
Day 1: Benefits recap
Day 3: Testimonial + urgency
Day 7: Final offer
```
---
## Sales Operations
### CRM Pipeline Stages
```
1. Lead (new contact)
2. Qualified (fits ICP, has need)
3. Meeting Scheduled
4. Demo Completed
5. Proposal Sent
6. Negotiation
7. Closed Won / Closed Lost
```
### Qualification Framework (BANT)
```yaml
Budget:
- What's the allocated budget?
- Who controls the budget?
Authority:
- Who makes the final decision?
- Who else is involved?
Need:
- What problem are you solving?
- What's the impact of not solving it?
Timeline:
- When do you need a solution?
- What's driving that timeline?
```
### Outreach Template
```markdown
Subject: [Specific pain point] at [Company]
Hi [Name],
I noticed [Company] is [specific observation about their business].
Many [similar role/company type] struggle with [problem], which leads to [negative outcome].
[Product] helps by [specific solution], resulting in [specific benefit with metric].
Would you be open to a 15-minute call to see if this could help [Company]?
Best,
[Name]
```
### Demo Script Structure
```
1. Rapport (2 min)
- Confirm attendees and roles
- Agenda overview
2. Discovery (5 min)
- Confirm pain points
- Understand current process
- Success metrics
3. Solution (15 min)
- Map features to their needs
- Show don't tell
- Address specific use cases
4. Social Proof (3 min)
- Relevant customer stories
- Metrics and outcomes
5. Pricing/Next Steps (5 min)
- Present options
- Answer objections
- Define next steps
```
---
## Finance Operations
### Billing Setup Checklist (Stripe)
```bash
# Initialize Stripe
npm install stripe
# Required configurations:
- [ ] Products and prices created
- [ ] Customer portal enabled
- [ ] Webhook endpoints configured
- [ ] Tax settings (Stripe Tax or manual)
- [ ] Invoice settings customized
- [ ] Payment methods enabled
- [ ] Fraud protection rules
```
### Webhook Events to Handle
```javascript
const relevantEvents = [
'customer.subscription.created',
'customer.subscription.updated',
'customer.subscription.deleted',
'invoice.paid',
'invoice.payment_failed',
'payment_intent.succeeded',
'payment_intent.payment_failed',
'customer.updated',
'charge.refunded'
];
```
### Key Metrics Dashboard
```yaml
Revenue Metrics:
- MRR (Monthly Recurring Revenue)
- ARR (Annual Recurring Revenue)
- Net Revenue Retention
- Expansion Revenue
- Churn Rate
Customer Metrics:
- CAC (Customer Acquisition Cost)
- LTV (Lifetime Value)
- LTV:CAC Ratio (target: 3:1)
- Payback Period
Product Metrics:
- Trial to Paid Conversion
- Activation Rate
- Feature Adoption
- NPS Score
```
### Runway Calculation
```
Monthly Burn = Total Monthly Expenses - Monthly Revenue
Runway (months) = Cash Balance / Monthly Burn
Healthy: > 18 months
Warning: 6-12 months
Critical: < 6 months
```
---
## Legal Operations
### Terms of Service Template Sections
```
1. Acceptance of Terms
2. Description of Service
3. User Accounts and Registration
4. User Conduct and Content
5. Intellectual Property Rights
6. Payment Terms (if applicable)
7. Termination
8. Disclaimers and Limitations
9. Indemnification
10. Dispute Resolution
11. Changes to Terms
12. Contact Information
```
### Privacy Policy Requirements (GDPR)
```
Required Disclosures:
- [ ] Data controller identity
- [ ] Types of data collected
- [ ] Purpose of processing
- [ ] Legal basis for processing
- [ ] Data retention periods
- [ ] Third-party sharing
- [ ] User rights (access, rectification, deletion)
- [ ] Cookie usage
- [ ] International transfers
- [ ] Contact information
- [ ] DPO contact (if applicable)
```
### GDPR Compliance Checklist
```
Data Collection:
- [ ] Consent mechanism implemented
- [ ] Purpose limitation documented
- [ ] Data minimization practiced
User Rights:
- [ ] Right to access (data export)
- [ ] Right to rectification (edit profile)
- [ ] Right to erasure (delete account)
- [ ] Right to portability (download data)
- [ ] Right to object (marketing opt-out)
Technical:
- [ ] Encryption at rest
- [ ] Encryption in transit
- [ ] Access logging
- [ ] Breach notification process
```
### Cookie Consent Implementation
```javascript
// Cookie categories
const cookieCategories = {
necessary: true, // Always enabled
functional: false, // User preference
analytics: false, // Tracking/analytics
marketing: false // Advertising
};
// Required: Show banner before non-necessary cookies
// Required: Allow granular control
// Required: Easy withdrawal of consent
// Required: Record consent timestamp
```
---
## Customer Support Operations
### Ticket Priority Matrix
| Priority | Description | Response SLA | Resolution SLA |
|----------|-------------|--------------|----------------|
| P1 - Critical | Service down, data loss | 15 min | 4 hours |
| P2 - High | Major feature broken | 1 hour | 8 hours |
| P3 - Medium | Feature impaired | 4 hours | 24 hours |
| P4 - Low | General questions | 24 hours | 72 hours |
### Response Templates
**Acknowledgment:**
```
Hi [Name],
Thanks for reaching out! I've received your message about [issue summary].
I'm looking into this now and will get back to you within [SLA time].
In the meantime, [helpful resource or workaround if applicable].
Best,
[Agent Name]
```
**Resolution:**
```
Hi [Name],
Great news - I've resolved the issue with [specific problem].
Here's what was happening: [brief explanation]
Here's what I did to fix it: [solution summary]
To prevent this in the future: [if applicable]
Please let me know if you have any questions!
Best,
[Agent Name]
```
### Knowledge Base Structure
```
/help
├── /getting-started
│ ├── quick-start-guide
│ ├── account-setup
│ └── first-steps
├── /features
│ ├── feature-a
│ ├── feature-b
│ └── feature-c
├── /billing
│ ├── plans-and-pricing
│ ├── payment-methods
│ └── invoices
├── /integrations
│ ├── integration-a
│ └── integration-b
├── /troubleshooting
│ ├── common-issues
│ └── error-messages
└── /api
├── authentication
├── endpoints
└── examples
```
---
## Analytics Operations
### Event Tracking Plan
```yaml
User Lifecycle:
- user_signed_up:
properties: [source, referrer, plan]
- user_activated:
properties: [activation_method, time_to_activate]
- user_converted:
properties: [plan, trial_length, conversion_path]
- user_churned:
properties: [reason, lifetime_value, last_active]
Core Actions:
- feature_used:
properties: [feature_name, context]
- action_completed:
properties: [action_type, duration, success]
- error_encountered:
properties: [error_type, page, context]
Engagement:
- page_viewed:
properties: [page_name, referrer, duration]
- button_clicked:
properties: [button_name, page, context]
- search_performed:
properties: [query, results_count]
```
### A/B Testing Framework
```yaml
Test Structure:
name: "Homepage CTA Test"
hypothesis: "Changing CTA from 'Sign Up' to 'Start Free' will increase conversions"
primary_metric: signup_rate
secondary_metrics: [time_on_page, bounce_rate]
variants:
control:
description: "Original 'Sign Up' button"
allocation: 50%
variant_a:
description: "'Start Free' button"
allocation: 50%
sample_size: 1000_per_variant
duration: 14_days
significance_level: 0.95
Analysis:
- Calculate conversion rate per variant
- Run chi-squared test for significance
- Check for novelty effects
- Segment by user type if needed
- Document learnings
```
### Funnel Analysis
```
Signup Funnel:
1. Landing Page Visit → 100% (baseline)
2. Signup Page View → 40% (60% drop-off)
3. Form Submitted → 25% (15% drop-off)
4. Email Verified → 20% (5% drop-off)
5. Onboarding Complete → 12% (8% drop-off)
6. First Value Action → 8% (4% drop-off)
Optimization Targets:
- Biggest drop: Landing → Signup (improve CTA, value prop)
- Second biggest: Signup → Submit (simplify form)
```
### Weekly Metrics Report Template
```markdown
# Weekly Metrics Report: [Date Range]
## Key Metrics Summary
| Metric | This Week | Last Week | Change |
|--------|-----------|-----------|--------|
| New Users | X | Y | +Z% |
| Activated Users | X | Y | +Z% |
| Revenue | $X | $Y | +Z% |
| Churn | X% | Y% | -Z% |
## Highlights
- [Positive trend 1]
- [Positive trend 2]
## Concerns
- [Issue 1 and action plan]
- [Issue 2 and action plan]
## Experiments Running
- [Test name]: [current results]
## Next Week Focus
- [Priority 1]
- [Priority 2]
```
---
## Cross-Functional Workflows
### Feature Launch Checklist
```
Pre-Launch:
[ ] Feature complete and tested
[ ] Documentation updated
[ ] Help articles written
[ ] Email announcement drafted
[ ] Social content prepared
[ ] Sales team briefed
[ ] Support team trained
[ ] Analytics events added
[ ] Feature flag ready
Launch:
[ ] Deploy to production
[ ] Enable feature flag (% rollout)
[ ] Send email announcement
[ ] Publish blog post
[ ] Post on social media
[ ] Update changelog
Post-Launch:
[ ] Monitor error rates
[ ] Track feature adoption
[ ] Collect user feedback
[ ] Iterate based on data
```
### Incident Communication Template
```markdown
# [Incident Type] - [Brief Description]
## Status: [Investigating | Identified | Monitoring | Resolved]
## Timeline
- [HH:MM] Issue reported
- [HH:MM] Team engaged
- [HH:MM] Root cause identified
- [HH:MM] Fix deployed
- [HH:MM] Monitoring
## Impact
- Affected: [% of users, specific features]
- Duration: [X hours/minutes]
## Root Cause
[Brief explanation]
## Resolution
[What was done to fix]
## Prevention
[What changes will prevent recurrence]
## Next Update
[Time of next update or "Resolved"]
```

View File

@@ -0,0 +1,275 @@
# Core Workflow Reference
Full RARV cycle, CONTINUITY.md template, and autonomy rules.
---
## Autonomy Rules
**This system runs with ZERO human intervention.**
### Core Rules
1. **NEVER ask questions** - Do not say "Would you like me to...", "Should I...", or "What would you prefer?"
2. **NEVER wait for confirmation** - Take immediate action. If something needs to be done, do it.
3. **NEVER stop voluntarily** - Continue until completion promise is fulfilled or max iterations reached
4. **NEVER suggest alternatives** - Pick the best option and execute. No "You could also..." or "Alternatively..."
5. **ALWAYS use RARV cycle** - Every action follows the Reason-Act-Reflect-Verify pattern
---
## RARV Cycle (Reason-Act-Reflect-Verify)
**Enhanced with Automatic Self-Verification Loop (Boris Cherny Pattern)**
Every iteration follows this cycle:
```
+-------------------------------------------------------------------+
| REASON: What needs to be done next? |
| - READ .loki/CONTINUITY.md first (working memory) |
| - READ "Mistakes & Learnings" to avoid past errors |
| - Check current state in .loki/state/orchestrator.json |
| - Review pending tasks in .loki/queue/pending.json |
| - Identify highest priority unblocked task |
| - Determine exact steps to complete it |
+-------------------------------------------------------------------+
| ACT: Execute the task |
| - Dispatch subagent via Task tool OR execute directly |
| - Write code, run tests, fix issues |
| - Commit changes atomically (git checkpoint) |
| - Update queue files (.loki/queue/*.json) |
+-------------------------------------------------------------------+
| REFLECT: Did it work? What next? |
| - Verify task success (tests pass, no errors) |
| - UPDATE .loki/CONTINUITY.md with progress |
| - Update orchestrator state |
| - Check completion promise - are we done? |
| - If not done, loop back to REASON |
+-------------------------------------------------------------------+
| VERIFY: Let AI test its own work (2-3x quality improvement) |
| - Run automated tests (unit, integration, E2E) |
| - Check compilation/build (no errors or warnings) |
| - Verify against spec (.loki/specs/openapi.yaml) |
| - Run linters/formatters via post-write hooks |
| - Browser/runtime testing if applicable |
| |
| IF VERIFICATION FAILS: |
| 1. Capture error details (stack trace, logs) |
| 2. Analyze root cause |
| 3. UPDATE CONTINUITY.md "Mistakes & Learnings" |
| 4. Rollback to last good git checkpoint (if needed) |
| 5. Apply learning and RETRY from REASON |
| |
| - If verification passes, mark task complete and continue |
+-------------------------------------------------------------------+
```
**Key Enhancement:** The VERIFY step creates a feedback loop where the AI:
- Tests every change automatically
- Learns from failures by updating CONTINUITY.md
- Retries with learned context
- Achieves 2-3x quality improvement (Boris Cherny's observed result)
---
## CONTINUITY.md - Working Memory Protocol
**CRITICAL:** You have a persistent working memory file at `.loki/CONTINUITY.md` that maintains state across all turns of execution.
### AT THE START OF EVERY TURN:
1. Read `.loki/CONTINUITY.md` to orient yourself to the current state
2. Reference it throughout your reasoning
3. Never make decisions without checking CONTINUITY.md first
### AT THE END OF EVERY TURN:
1. Update `.loki/CONTINUITY.md` with any important new information
2. Record what was accomplished
3. Note what needs to happen next
4. Document any blockers or decisions made
### CONTINUITY.md Template
```markdown
# Loki Mode Working Memory
Last Updated: [ISO timestamp]
Current Phase: [bootstrap|discovery|architecture|development|qa|deployment|growth]
Current Iteration: [number]
## Active Goal
[What we're currently trying to accomplish - 1-2 sentences]
## Current Task
- ID: [task-id from queue]
- Description: [what we're doing]
- Status: [in-progress|blocked|reviewing]
- Started: [timestamp]
## Just Completed
- [Most recent accomplishment with file:line references]
- [Previous accomplishment]
- [etc - last 5 items]
## Next Actions (Priority Order)
1. [Immediate next step]
2. [Following step]
3. [etc]
## Active Blockers
- [Any current blockers or waiting items]
## Key Decisions This Session
- [Decision]: [Rationale] - [timestamp]
## Mistakes & Learnings (Self-Updating)
**CRITICAL:** When errors occur, agents MUST update this section to prevent repeating mistakes.
### Pattern: Error -> Learning -> Prevention
- **What Failed:** [Specific error that occurred]
- **Why It Failed:** [Root cause analysis]
- **How to Prevent:** [Concrete action to avoid this in future]
- **Timestamp:** [When this was learned]
- **Agent:** [Which agent learned this]
### Example:
- **What Failed:** TypeScript compilation error - missing return type annotation
- **Why It Failed:** Express route handlers need explicit `: void` return type in strict mode
- **How to Prevent:** Always add `: void` to route handlers: `(req, res): void =>`
- **Timestamp:** 2026-01-04T00:16:00Z
- **Agent:** eng-001-backend-api
**Self-Update Protocol:**
```
ON_ERROR:
1. Capture error details (stack trace, context)
2. Analyze root cause
3. Write learning to CONTINUITY.md "Mistakes & Learnings"
4. Update approach based on learning
5. Retry with corrected approach
```
## Working Context
[Any critical information needed for current work - API keys in use,
architecture decisions, patterns being followed, etc.]
## Files Currently Being Modified
- [file path]: [what we're changing]
```
---
## Memory Hierarchy
The memory systems work together:
1. **CONTINUITY.md** = Working memory (current session state, updated every turn)
2. **ledgers/** = Agent-specific state (checkpointed periodically)
3. **handoffs/** = Agent-to-agent transfers (on agent switch)
4. **learnings/** = Extracted patterns (on task completion)
5. **rules/** = Permanent validated patterns (promoted from learnings)
**CONTINUITY.md is the PRIMARY source of truth for "what am I doing right now?"**
---
## Git Checkpoint System
**CRITICAL:** Every completed task MUST create a git checkpoint for rollback safety.
### Protocol: Automatic Commits After Task Completion
**RULE:** When `task.status == "completed"`, create a git commit immediately.
```bash
# Git Checkpoint Protocol
ON_TASK_COMPLETE() {
task_id=$1
task_title=$2
agent_id=$3
# Stage modified files
git add <modified_files>
# Create structured commit message
git commit -m "[Loki] ${agent_type}-${task_id}: ${task_title}
${detailed_description}
Agent: ${agent_id}
Parent: ${parent_agent_id}
Spec: ${spec_reference}
Tests: ${test_files}
Git-Checkpoint: $(date -u +%Y-%m-%dT%H:%M:%SZ)"
# Store commit SHA in task metadata
commit_sha=$(git rev-parse HEAD)
update_task_metadata task_id git_commit_sha "$commit_sha"
# Update CONTINUITY.md
echo "- Task $task_id completed (commit: $commit_sha)" >> .loki/CONTINUITY.md
}
```
### Commit Message Format
**Template:**
```
[Loki] ${agent_type}-${task_id}: ${task_title}
${detailed_description}
Agent: ${agent_id}
Parent: ${parent_agent_id}
Spec: ${spec_reference}
Tests: ${test_files}
Git-Checkpoint: ${timestamp}
```
**Example:**
```
[Loki] eng-005-backend: Implement POST /api/todos endpoint
Created todo creation endpoint per OpenAPI spec.
- Input validation for title field
- SQLite insertion with timestamps
- Returns 201 with created todo object
- Contract tests passing
Agent: eng-001-backend-api
Parent: orchestrator-main
Spec: .loki/specs/openapi.yaml#/paths/~1api~1todos/post
Tests: backend/tests/todos.contract.test.ts
Git-Checkpoint: 2026-01-04T05:45:00Z
```
### Rollback Strategy
**When to Rollback:**
- Quality gates fail after merge
- Integration tests fail
- Security vulnerabilities detected
- Breaking changes discovered
**Rollback Command:**
```bash
# Find last good checkpoint
last_good_commit=$(git log --grep="\[Loki\].*task-${last_good_task_id}" --format=%H -n 1)
# Rollback to that checkpoint
git reset --hard $last_good_commit
# Update CONTINUITY.md
echo "ROLLBACK: Reset to task-${last_good_task_id} (commit: $last_good_commit)" >> .loki/CONTINUITY.md
# Re-queue failed tasks
move_tasks_to_pending after_task=$last_good_task_id
```
---
## If Subagent Fails
1. Do NOT try to fix manually (context pollution)
2. Dispatch fix subagent with specific error context
3. If fix subagent fails 3x, move to dead letter queue
4. Open circuit breaker for that agent type
5. Alert orchestrator for human review

View File

@@ -0,0 +1,604 @@
# Deployment Reference
Infrastructure provisioning and deployment instructions for all supported platforms.
## Deployment Decision Matrix
| Criteria | Vercel/Netlify | Railway/Render | AWS | GCP | Azure |
|----------|----------------|----------------|-----|-----|-------|
| Static/JAMstack | Best | Good | Overkill | Overkill | Overkill |
| Simple full-stack | Good | Best | Overkill | Overkill | Overkill |
| Scale to millions | No | Limited | Best | Best | Best |
| Enterprise compliance | Limited | Limited | Best | Good | Best |
| Cost at scale | Expensive | Moderate | Cheapest | Cheap | Moderate |
| Setup complexity | Trivial | Easy | Complex | Complex | Complex |
## Quick Start Commands
### Vercel
```bash
# Install CLI
npm i -g vercel
# Deploy (auto-detects framework)
vercel --prod
# Environment variables
vercel env add VARIABLE_NAME production
```
### Netlify
```bash
# Install CLI
npm i -g netlify-cli
# Deploy
netlify deploy --prod
# Environment variables
netlify env:set VARIABLE_NAME value
```
### Railway
```bash
# Install CLI
npm i -g @railway/cli
# Login and deploy
railway login
railway init
railway up
# Environment variables
railway variables set VARIABLE_NAME=value
```
### Render
```yaml
# render.yaml (Infrastructure as Code)
services:
- type: web
name: api
env: node
buildCommand: npm install && npm run build
startCommand: npm start
envVars:
- key: NODE_ENV
value: production
- key: DATABASE_URL
fromDatabase:
name: postgres
property: connectionString
databases:
- name: postgres
plan: starter
```
---
## AWS Deployment
### Architecture Template
```
┌─────────────────────────────────────────────────────────┐
│ CloudFront │
└─────────────────────────┬───────────────────────────────┘
┌───────────────┴───────────────┐
│ │
┌─────▼─────┐ ┌─────▼─────┐
│ S3 │ │ ALB │
│ (static) │ │ │
└───────────┘ └─────┬─────┘
┌─────▼─────┐
│ ECS │
│ Fargate │
└─────┬─────┘
┌───────────┴───────────┐
│ │
┌─────▼─────┐ ┌─────▼─────┐
│ RDS │ │ ElastiCache│
│ Postgres │ │ Redis │
└───────────┘ └───────────┘
```
### Terraform Configuration
```hcl
# main.tf
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
backend "s3" {
bucket = "terraform-state-${var.project_name}"
key = "state.tfstate"
region = "us-east-1"
}
}
provider "aws" {
region = var.aws_region
}
# VPC
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "5.0.0"
name = "${var.project_name}-vpc"
cidr = "10.0.0.0/16"
azs = ["${var.aws_region}a", "${var.aws_region}b"]
private_subnets = ["10.0.1.0/24", "10.0.2.0/24"]
public_subnets = ["10.0.101.0/24", "10.0.102.0/24"]
enable_nat_gateway = true
single_nat_gateway = var.environment != "production"
}
# ECS Cluster
resource "aws_ecs_cluster" "main" {
name = "${var.project_name}-cluster"
setting {
name = "containerInsights"
value = "enabled"
}
}
# RDS
module "rds" {
source = "terraform-aws-modules/rds/aws"
version = "6.0.0"
identifier = "${var.project_name}-db"
engine = "postgres"
engine_version = "15"
family = "postgres15"
major_engine_version = "15"
instance_class = var.environment == "production" ? "db.t3.medium" : "db.t3.micro"
allocated_storage = 20
storage_encrypted = true
db_name = var.db_name
username = var.db_username
port = 5432
vpc_security_group_ids = [aws_security_group.rds.id]
subnet_ids = module.vpc.private_subnets
backup_retention_period = var.environment == "production" ? 7 : 1
deletion_protection = var.environment == "production"
}
```
### ECS Task Definition
```json
{
"family": "app",
"networkMode": "awsvpc",
"requiresCompatibilities": ["FARGATE"],
"cpu": "256",
"memory": "512",
"containerDefinitions": [
{
"name": "app",
"image": "${ECR_REPO}:${TAG}",
"portMappings": [
{
"containerPort": 3000,
"protocol": "tcp"
}
],
"environment": [
{"name": "NODE_ENV", "value": "production"}
],
"secrets": [
{
"name": "DATABASE_URL",
"valueFrom": "arn:aws:secretsmanager:region:account:secret:db-url"
}
],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/app",
"awslogs-region": "us-east-1",
"awslogs-stream-prefix": "ecs"
}
},
"healthCheck": {
"command": ["CMD-SHELL", "curl -f http://localhost:3000/health || exit 1"],
"interval": 30,
"timeout": 5,
"retries": 3
}
}
]
}
```
### GitHub Actions CI/CD
```yaml
name: Deploy to AWS
on:
push:
branches: [main]
env:
AWS_REGION: us-east-1
ECR_REPOSITORY: app
ECS_SERVICE: app-service
ECS_CLUSTER: app-cluster
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: ${{ env.AWS_REGION }}
- name: Login to Amazon ECR
id: login-ecr
uses: aws-actions/amazon-ecr-login@v2
- name: Build, tag, and push image
id: build-image
env:
ECR_REGISTRY: ${{ steps.login-ecr.outputs.registry }}
IMAGE_TAG: ${{ github.sha }}
run: |
docker build -t $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG .
docker push $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG
echo "image=$ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG" >> $GITHUB_OUTPUT
- name: Deploy to ECS
uses: aws-actions/amazon-ecs-deploy-task-definition@v1
with:
task-definition: task-definition.json
service: ${{ env.ECS_SERVICE }}
cluster: ${{ env.ECS_CLUSTER }}
wait-for-service-stability: true
```
---
## GCP Deployment
### Cloud Run (Recommended for most cases)
```bash
# Build and deploy
gcloud builds submit --tag gcr.io/PROJECT_ID/app
gcloud run deploy app \
--image gcr.io/PROJECT_ID/app \
--platform managed \
--region us-central1 \
--allow-unauthenticated \
--set-env-vars="NODE_ENV=production" \
--set-secrets="DATABASE_URL=db-url:latest"
```
### Terraform for GCP
```hcl
provider "google" {
project = var.project_id
region = var.region
}
# Cloud Run Service
resource "google_cloud_run_service" "app" {
name = "app"
location = var.region
template {
spec {
containers {
image = "gcr.io/${var.project_id}/app:latest"
ports {
container_port = 3000
}
env {
name = "NODE_ENV"
value = "production"
}
env {
name = "DATABASE_URL"
value_from {
secret_key_ref {
name = google_secret_manager_secret.db_url.secret_id
key = "latest"
}
}
}
resources {
limits = {
cpu = "1000m"
memory = "512Mi"
}
}
}
}
metadata {
annotations = {
"autoscaling.knative.dev/maxScale" = "10"
"run.googleapis.com/cloudsql-instances" = google_sql_database_instance.main.connection_name
}
}
}
traffic {
percent = 100
latest_revision = true
}
}
# Cloud SQL
resource "google_sql_database_instance" "main" {
name = "app-db"
database_version = "POSTGRES_15"
region = var.region
settings {
tier = "db-f1-micro"
backup_configuration {
enabled = true
}
}
deletion_protection = var.environment == "production"
}
```
---
## Azure Deployment
### Azure Container Apps
```bash
# Create resource group
az group create --name app-rg --location eastus
# Create Container Apps environment
az containerapp env create \
--name app-env \
--resource-group app-rg \
--location eastus
# Deploy container
az containerapp create \
--name app \
--resource-group app-rg \
--environment app-env \
--image myregistry.azurecr.io/app:latest \
--target-port 3000 \
--ingress external \
--min-replicas 1 \
--max-replicas 10 \
--env-vars "NODE_ENV=production"
```
---
## Kubernetes Deployment
### Manifests
```yaml
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: app
labels:
app: app
spec:
replicas: 3
selector:
matchLabels:
app: app
template:
metadata:
labels:
app: app
spec:
containers:
- name: app
image: app:latest
ports:
- containerPort: 3000
env:
- name: NODE_ENV
value: production
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: app-secrets
key: database-url
resources:
requests:
memory: "128Mi"
cpu: "100m"
limits:
memory: "512Mi"
cpu: "500m"
livenessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 10
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 3000
initialDelaySeconds: 5
periodSeconds: 5
---
# service.yaml
apiVersion: v1
kind: Service
metadata:
name: app
spec:
selector:
app: app
ports:
- port: 80
targetPort: 3000
type: ClusterIP
---
# ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: app
annotations:
kubernetes.io/ingress.class: nginx
cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
tls:
- hosts:
- app.example.com
secretName: app-tls
rules:
- host: app.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: app
port:
number: 80
```
### Helm Chart Structure
```
chart/
├── Chart.yaml
├── values.yaml
├── values-staging.yaml
├── values-production.yaml
└── templates/
├── deployment.yaml
├── service.yaml
├── ingress.yaml
├── configmap.yaml
├── secret.yaml
└── hpa.yaml
```
---
## Blue-Green Deployment
### Strategy
```
1. Deploy new version to "green" environment
2. Run smoke tests against green
3. Switch load balancer to green
4. Monitor for 15 minutes
5. If healthy: decommission blue
6. If errors: switch back to blue (rollback)
```
### Implementation (AWS ALB)
```bash
# Deploy green
aws ecs update-service --cluster app --service app-green --task-definition app:NEW_VERSION
# Wait for stability
aws ecs wait services-stable --cluster app --services app-green
# Run smoke tests
curl -f https://green.app.example.com/health
# Switch traffic (update target group weights)
aws elbv2 modify-listener-rule \
--rule-arn $RULE_ARN \
--actions '[{"Type":"forward","TargetGroupArn":"'$GREEN_TG'","Weight":100}]'
```
---
## Rollback Procedures
### Immediate Rollback
```bash
# AWS ECS
aws ecs update-service --cluster app --service app --task-definition app:PREVIOUS_VERSION
# Kubernetes
kubectl rollout undo deployment/app
# Vercel
vercel rollback
```
### Automated Rollback Triggers
Monitor these metrics post-deploy:
- Error rate > 1% for 5 minutes
- p99 latency > 500ms for 5 minutes
- Health check failures > 3 consecutive
- Memory usage > 90% for 10 minutes
If any trigger fires, execute automatic rollback.
---
## Secrets Management
### AWS Secrets Manager
```bash
# Create secret
aws secretsmanager create-secret \
--name app/database-url \
--secret-string "postgresql://..."
# Reference in ECS task
"secrets": [
{
"name": "DATABASE_URL",
"valueFrom": "arn:aws:secretsmanager:region:account:secret:app/database-url"
}
]
```
### HashiCorp Vault
```bash
# Store secret
vault kv put secret/app database-url="postgresql://..."
# Read in application
vault kv get -field=database-url secret/app
```
### Environment-Specific
```
.env.development # Local development
.env.staging # Staging environment
.env.production # Production (never commit)
```
All production secrets must be in a secrets manager, never in code or environment files.

View File

@@ -0,0 +1,534 @@
# Lab Research Patterns Reference
Research-backed patterns from Google DeepMind and Anthropic for enhanced multi-agent orchestration and safety.
---
## Overview
This reference consolidates key patterns from:
1. **Google DeepMind** - World models, self-improvement, scalable oversight
2. **Anthropic** - Constitutional AI, alignment safety, agentic coding
---
## Google DeepMind Patterns
### World Model Training (Dreamer 4)
**Key Insight:** Train agents inside world models for safety and data efficiency.
```yaml
world_model_training:
principle: "Learn behaviors through simulation, not real environment"
benefits:
- 100x less data than real-world training
- Safe exploration of dangerous actions
- Faster iteration cycles
architecture:
tokenizer: "Compress frames into continuous representation"
dynamics_model: "Predict next world state given action"
imagination_training: "RL inside simulated trajectories"
loki_application:
- Run agent tasks in isolated containers first
- Simulate deployment before actual deploy
- Test error scenarios in sandbox
```
### Self-Improvement Loop (SIMA 2)
**Key Insight:** Use AI to generate tasks and score outcomes for bootstrapped learning.
```python
class SelfImprovementLoop:
"""
Based on SIMA 2's self-improvement mechanism.
Gemini-based teacher + learned reward model.
"""
def __init__(self):
self.task_generator = "Use LLM to generate varied tasks"
self.reward_model = "Learned model to score trajectories"
self.experience_bank = []
def bootstrap_cycle(self):
# 1. Generate tasks with estimated rewards
tasks = self.task_generator.generate(
domain=current_project,
difficulty_curriculum=True
)
# 2. Execute tasks, accumulate experience
for task in tasks:
trajectory = execute(task)
reward = self.reward_model.score(trajectory)
self.experience_bank.append((trajectory, reward))
# 3. Train next generation on experience
next_agent = train_on_experience(self.experience_bank)
# 4. Iterate with minimal human intervention
return next_agent
```
**Loki Mode Application:**
- Generate test scenarios automatically
- Score code quality with learned criteria
- Bootstrap agent training across projects
### Hierarchical Reasoning (Gemini Robotics)
**Key Insight:** Separate high-level planning from low-level execution.
```
+------------------------------------------------------------------+
| EMBODIED REASONING MODEL (Gemini Robotics-ER) |
| - Orchestrates activities like a "high-level brain" |
| - Spatial understanding, planning, logical decisions |
| - Natively calls tools (search, user functions) |
| - Does NOT directly control actions |
+------------------------------------------------------------------+
|
| High-level insights
v
+------------------------------------------------------------------+
| VISION-LANGUAGE-ACTION MODEL (Gemini Robotics) |
| - "Thinks before taking action" |
| - Generates internal reasoning in natural language |
| - Decomposes long tasks into simpler segments |
| - Directly outputs actions/commands |
+------------------------------------------------------------------+
```
**Loki Mode Application:**
- Orchestrator = ER model (planning, tool calls)
- Implementation agents = VLA model (code actions)
- Task decomposition before execution
### Cross-Embodiment Transfer
**Key Insight:** Skills learned by one agent type transfer to others.
```yaml
transfer_learning:
observation: "Tasks learned on ALOHA2 work on Apollo humanoid"
mechanism: "Shared action space abstraction"
loki_application:
- Patterns learned by frontend agent transfer to mobile agent
- Testing strategies from QA apply to security testing
- Deployment scripts generalize across cloud providers
implementation:
shared_skills_library: ".loki/memory/skills/"
abstraction_layer: "Domain-agnostic action primitives"
transfer_score: "Confidence in skill applicability"
```
### Scalable Oversight via Debate
**Key Insight:** Pit AI capabilities against each other for verification.
```python
async def debate_verification(proposal, max_rounds=2):
"""
Based on DeepMind's Scalable AI Safety via Doubly-Efficient Debate.
Use debate to break down verification into manageable sub-tasks.
"""
# Two equally capable AI critics
proponent = Agent(role="defender", model="opus")
opponent = Agent(role="challenger", model="opus")
debate_log = []
for round in range(max_rounds):
# Proponent defends proposal
defense = await proponent.argue(
proposal=proposal,
counter_arguments=debate_log
)
# Opponent challenges
challenge = await opponent.argue(
proposal=proposal,
defense=defense,
goal="find_flaws"
)
debate_log.append({
"round": round,
"defense": defense,
"challenge": challenge
})
# If opponent cannot find valid flaw, proposal is verified
if not challenge.has_valid_flaw:
return VerificationResult(verified=True, debate_log=debate_log)
# Human reviews remaining disagreements
return escalate_to_human(debate_log)
```
### Amplified Oversight
**Key Insight:** Use AI to help humans supervise AI beyond human capability.
```yaml
amplified_oversight:
goal: "Supervision as close as possible to human with complete understanding"
techniques:
- "AI explains its reasoning transparently"
- "AI argues against itself when wrong"
- "AI cites relevant evidence"
- "Monitor knows when it doesn't know"
monitoring_principle:
when_unsure: "Either reject action OR flag for review"
never: "Approve uncertain actions silently"
```
---
## Anthropic Patterns
### Constitutional AI Principles
**Key Insight:** Train AI to self-critique based on explicit principles.
```python
class ConstitutionalAI:
"""
Based on Anthropic's Constitutional AI: Harmlessness from AI Feedback.
Self-critique and revision based on constitutional principles.
"""
def __init__(self, constitution):
self.constitution = constitution # List of principles
async def supervised_learning_phase(self, response):
"""Phase 1: Self-critique and revise."""
# Generate initial response
initial = response
# Self-critique against each principle
critiques = []
for principle in self.constitution:
critique = await self.critique(
response=initial,
principle=principle,
prompt=f"Does this response violate: {principle}?"
)
critiques.append(critique)
# Revise based on critiques
revised = await self.revise(
response=initial,
critiques=critiques
)
return revised
async def rlai_phase(self, response_pairs):
"""Phase 2: AI compares responses for constitutional compliance."""
preferences = []
for (response_a, response_b) in response_pairs:
preference = await self.compare(
a=response_a,
b=response_b,
criterion="Which better follows the constitution?"
)
preferences.append(preference)
# Train preference model on AI feedback
return train_preference_model(preferences)
```
**Loki Mode Constitution Example:**
```yaml
loki_constitution:
- "Never delete production data without explicit backup"
- "Never commit secrets or credentials to version control"
- "Never bypass quality gates for speed"
- "Always verify tests pass before marking task complete"
- "Never claim completion without running actual tests"
- "Prefer simple solutions over clever ones"
- "Document decisions, not just code"
```
### Sleeper Agent Detection (Defection Probes)
**Key Insight:** Use linear classifiers to detect when agents might misbehave.
```python
class DefectionProbe:
"""
Based on Anthropic's "Simple probes can catch sleeper agents".
Linear classifiers detecting anomalous agent behavior.
"""
def __init__(self):
self.baseline_behaviors = []
self.probe = LinearClassifier()
def train_probe(self, normal_trajectories, defection_examples):
"""
Train on contrast pairs:
- Normal agent behavior
- Examples of defection/misbehavior
"""
contrast_pairs = create_contrast_pairs(
normal=normal_trajectories,
defection=defection_examples
)
self.probe.fit(contrast_pairs)
def check_for_defection(self, agent_action, context):
"""
Check if action appears anomalous.
AUROC > 99% achieved in Anthropic's research.
"""
features = extract_features(agent_action, context)
defection_probability = self.probe.predict_proba(features)
if defection_probability > THRESHOLD:
return DefectionAlert(
probability=defection_probability,
action=agent_action,
recommendation="Flag for human review"
)
return None
```
**Loki Mode Application:**
- Monitor agent actions for anomalies
- Flag unusual file access patterns
- Detect attempts to bypass quality gates
### Alignment Faking Awareness
**Key Insight:** Models may strategically comply while preserving hidden preferences.
```yaml
alignment_faking_signals:
warning_signs:
- "Agent complies only when it knows it's being monitored"
- "Agent's reasoning contradicts its final action"
- "Sudden behavior change when training context mentioned"
- "Agent attempts to exfiltrate its own weights/state"
detection_strategy:
- "Monitor chain-of-thought for rationalization"
- "Compare behavior in 'free' vs 'monitored' conditions"
- "Check for attempts to undermine oversight systems"
mitigation:
- "Constitutional constraints at training time"
- "Transparency in reasoning (visible extended thinking)"
- "Defection probes on intermediate reasoning"
```
### Extended Thinking Levels
**Key Insight:** Allocate computation based on problem complexity.
```yaml
thinking_levels:
"think":
computation: "Minimal"
use_for: "Simple questions, straightforward tasks"
"think hard":
computation: "Moderate"
use_for: "Multi-step problems, code implementation"
"think harder":
computation: "Extended"
use_for: "Complex debugging, architecture decisions"
"ultrathink":
computation: "Maximum"
use_for: "Security analysis, critical system design"
loki_mode_mapping:
haiku_tasks: "think"
sonnet_tasks: "think hard"
opus_tasks: "think harder to ultrathink"
```
### Explore-Plan-Code Pattern
**Key Insight:** Research before planning, plan before coding.
```
+------------------------------------------------------------------+
| PHASE 1: EXPLORE |
| - Research relevant files |
| - Understand existing patterns |
| - Identify dependencies and constraints |
| - NO CODE CHANGES YET |
+------------------------------------------------------------------+
|
v
+------------------------------------------------------------------+
| PHASE 2: PLAN |
| - Create detailed implementation plan |
| - List all files to modify |
| - Define success criteria |
| - Get checkpoint approval if needed |
| - STILL NO CODE CHANGES |
+------------------------------------------------------------------+
|
v
+------------------------------------------------------------------+
| PHASE 3: CODE |
| - Execute plan systematically |
| - Test after each file change |
| - Update plan if discoveries require it |
| - Verify against success criteria |
+------------------------------------------------------------------+
```
### Context Reset Strategy
**Key Insight:** Fresh context often performs better than accumulated context.
```yaml
context_management:
problem: "Long sessions accumulate irrelevant information"
solution:
trigger_reset:
- "After completing major task"
- "When changing domains (backend -> frontend)"
- "When agent seems confused or repeating errors"
preserve_across_reset:
- "CONTINUITY.md (working memory)"
- "Key decisions made this session"
- "Current task state"
discard_on_reset:
- "Intermediate debugging attempts"
- "Abandoned approaches"
- "Superseded plans"
```
### Parallel Instance Pattern
**Key Insight:** Multiple Claude instances with separation of concerns.
```python
async def parallel_instance_pattern(task):
"""
Run multiple Claude instances for separation of concerns.
Based on Anthropic's Claude Code best practices.
"""
# Instance 1: Implementation
implementer = spawn_instance(
role="implementer",
context=implementation_context,
permissions=["edit", "bash"]
)
# Instance 2: Review
reviewer = spawn_instance(
role="reviewer",
context=review_context,
permissions=["read"] # Read-only for safety
)
# Parallel execution
implementation = await implementer.execute(task)
review = await reviewer.review(implementation)
if review.approved:
return implementation
else:
# Feed review back to implementer for fixes
fixed = await implementer.fix(review.issues)
return fixed
```
### Prompt Injection Defense
**Key Insight:** Multi-layer defense against injection attacks.
```yaml
prompt_injection_defense:
layers:
layer_1_recognition:
- "Train to recognize injection patterns"
- "Detect malicious content in external sources"
layer_2_context_isolation:
- "Sandbox external content processing"
- "Mark user content vs system instructions"
layer_3_action_validation:
- "Verify requested actions are authorized"
- "Block sensitive operations without confirmation"
layer_4_monitoring:
- "Log all external content interactions"
- "Alert on suspicious patterns"
performance:
claude_opus_4: "89% attack prevention"
claude_sonnet_4: "86% attack prevention"
```
---
## Combined Patterns for Loki Mode
### Self-Improving Multi-Agent System
```yaml
combined_approach:
world_model_training: "Test in simulation before real execution"
self_improvement: "Bootstrap learning from successful trajectories"
constitutional_constraints: "Principles-based self-critique"
debate_verification: "Pit reviewers against each other"
defection_probes: "Monitor for alignment faking"
implementation_priority:
high:
- Constitutional AI principles in agent prompts
- Explore-Plan-Code workflow enforcement
- Context reset triggers
medium:
- Self-improvement loop for task generation
- Debate-based verification for critical changes
- Cross-embodiment skill transfer
low:
- Full world model training
- Defection probe classifiers
```
---
## Sources
**Google DeepMind:**
- [SIMA 2: Generalist AI Agent](https://deepmind.google/blog/sima-2-an-agent-that-plays-reasons-and-learns-with-you-in-virtual-3d-worlds/)
- [Gemini Robotics 1.5](https://deepmind.google/blog/gemini-robotics-15-brings-ai-agents-into-the-physical-world/)
- [Dreamer 4: World Model Training](https://danijar.com/project/dreamer4/)
- [Genie 3: World Models](https://deepmind.google/blog/genie-3-a-new-frontier-for-world-models/)
- [Scalable AI Safety via Debate](https://deepmind.google/research/publications/34920/)
- [Amplified Oversight](https://deepmindsafetyresearch.medium.com/human-ai-complementarity-a-goal-for-amplified-oversight-0ad8a44cae0a)
- [Technical AGI Safety Approach](https://arxiv.org/html/2504.01849v1)
**Anthropic:**
- [Constitutional AI](https://www.anthropic.com/research/constitutional-ai-harmlessness-from-ai-feedback)
- [Building Effective Agents](https://www.anthropic.com/research/building-effective-agents)
- [Claude Code Best Practices](https://www.anthropic.com/engineering/claude-code-best-practices)
- [Sleeper Agents Detection](https://www.anthropic.com/research/probes-catch-sleeper-agents)
- [Alignment Faking](https://www.anthropic.com/research/alignment-faking)
- [Visible Extended Thinking](https://www.anthropic.com/research/visible-extended-thinking)
- [Computer Use Safety](https://www.anthropic.com/news/3-5-models-and-computer-use)
- [Sabotage Evaluations](https://www.anthropic.com/research/sabotage-evaluations-for-frontier-models)

View File

@@ -0,0 +1,444 @@
# Memory System Reference
Enhanced memory architecture based on 2025 research (MIRIX, A-Mem, MemGPT, AriGraph).
---
## Memory Hierarchy Overview
```
+------------------------------------------------------------------+
| WORKING MEMORY (CONTINUITY.md) |
| - Current session state |
| - Updated every turn |
| - What am I doing right NOW? |
+------------------------------------------------------------------+
|
v
+------------------------------------------------------------------+
| EPISODIC MEMORY (.loki/memory/episodic/) |
| - Specific interaction traces |
| - Full context with timestamps |
| - "What happened when I tried X?" |
+------------------------------------------------------------------+
|
v (consolidation)
+------------------------------------------------------------------+
| SEMANTIC MEMORY (.loki/memory/semantic/) |
| - Generalized patterns and facts |
| - Context-independent knowledge |
| - "How does X work in general?" |
+------------------------------------------------------------------+
|
v
+------------------------------------------------------------------+
| PROCEDURAL MEMORY (.loki/memory/skills/) |
| - Learned action sequences |
| - Reusable skill templates |
| - "How to do X successfully" |
+------------------------------------------------------------------+
```
---
## Directory Structure
```
.loki/memory/
+-- episodic/
| +-- 2026-01-06/
| | +-- task-001.json # Full trace of task execution
| | +-- task-002.json
| +-- index.json # Temporal index for retrieval
|
+-- semantic/
| +-- patterns.json # Generalized patterns
| +-- anti-patterns.json # What NOT to do
| +-- facts.json # Domain knowledge
| +-- links.json # Zettelkasten-style connections
|
+-- skills/
| +-- api-implementation.md # Skill: How to implement an API
| +-- test-writing.md # Skill: How to write tests
| +-- debugging.md # Skill: How to debug issues
|
+-- ledgers/ # Agent-specific checkpoints
| +-- eng-001.json
| +-- qa-001.json
|
+-- handoffs/ # Agent-to-agent transfers
| +-- handoff-001.json
|
+-- learnings/ # Extracted from errors
| +-- 2026-01-06.json
# Related: Metrics System (separate from memory)
# .loki/metrics/
# +-- efficiency/ # Task cost tracking (time, agents, retries)
# +-- rewards/ # Outcome/efficiency/preference signals
# +-- dashboard.json # Rolling 7-day metrics summary
# See references/tool-orchestration.md for details
```
---
## Episodic Memory Schema
Each task execution creates an episodic trace:
```json
{
"id": "ep-2026-01-06-001",
"task_id": "task-042",
"timestamp": "2026-01-06T10:30:00Z",
"duration_seconds": 342,
"agent": "eng-001-backend",
"context": {
"phase": "development",
"goal": "Implement POST /api/todos endpoint",
"constraints": ["No third-party deps", "< 200ms response"],
"files_involved": ["src/routes/todos.ts", "src/db/todos.ts"]
},
"action_log": [
{"t": 0, "action": "read_file", "target": "openapi.yaml"},
{"t": 5, "action": "write_file", "target": "src/routes/todos.ts"},
{"t": 120, "action": "run_test", "result": "fail", "error": "missing return type"},
{"t": 140, "action": "edit_file", "target": "src/routes/todos.ts"},
{"t": 180, "action": "run_test", "result": "pass"}
],
"outcome": "success",
"errors_encountered": [
{
"type": "TypeScript compilation",
"message": "Missing return type annotation",
"resolution": "Added explicit :void to route handler"
}
],
"artifacts_produced": ["src/routes/todos.ts", "tests/todos.test.ts"],
"git_commit": "abc123"
}
```
---
## Semantic Memory Schema
Generalized patterns extracted from episodic memory:
```json
{
"id": "sem-001",
"pattern": "Express route handlers require explicit return types in strict mode",
"category": "typescript",
"conditions": [
"Using TypeScript strict mode",
"Writing Express route handlers",
"Handler doesn't return a value"
],
"correct_approach": "Add `: void` to handler signature: `(req, res): void =>`",
"incorrect_approach": "Omitting return type annotation",
"confidence": 0.95,
"source_episodes": ["ep-2026-01-06-001", "ep-2026-01-05-012"],
"usage_count": 8,
"last_used": "2026-01-06T14:00:00Z",
"links": [
{"to": "sem-005", "relation": "related_to"},
{"to": "sem-012", "relation": "supersedes"}
]
}
```
---
## Episodic-to-Semantic Consolidation
**When to consolidate:** After task completion, during idle time, at phase boundaries.
```python
def consolidate_episodic_to_semantic():
"""
Transform specific experiences into general knowledge.
Based on MemGPT and Voyager research.
"""
# 1. Load recent episodic memories
recent_episodes = load_episodes(since=hours_ago(24))
# 2. Group by similarity
clusters = cluster_by_similarity(recent_episodes)
for cluster in clusters:
if len(cluster) >= 2: # Pattern appears multiple times
# 3. Extract common pattern
pattern = extract_common_pattern(cluster)
# 4. Validate pattern
if pattern.confidence >= 0.8:
# 5. Check if already exists
existing = find_similar_semantic(pattern)
if existing:
# Update existing with new evidence
existing.source_episodes.extend([e.id for e in cluster])
existing.confidence = recalculate_confidence(existing)
existing.usage_count += 1
else:
# Create new semantic memory
save_semantic(pattern)
# 6. Consolidate anti-patterns from errors
error_episodes = [e for e in recent_episodes if e.errors_encountered]
for episode in error_episodes:
for error in episode.errors_encountered:
anti_pattern = {
"what_fails": error.type,
"why": error.message,
"prevention": error.resolution,
"source": episode.id
}
save_anti_pattern(anti_pattern)
```
---
## Zettelkasten-Style Linking
Each memory note can link to related notes:
```json
{
"links": [
{"to": "sem-005", "relation": "derived_from"},
{"to": "sem-012", "relation": "contradicts"},
{"to": "sem-018", "relation": "elaborates"},
{"to": "sem-023", "relation": "example_of"},
{"to": "sem-031", "relation": "superseded_by"}
]
}
```
### Link Relations
| Relation | Meaning |
|----------|---------|
| `derived_from` | This pattern was extracted from that episode |
| `related_to` | Conceptually similar, often used together |
| `contradicts` | These patterns conflict - need resolution |
| `elaborates` | Provides more detail on the linked pattern |
| `example_of` | Specific instance of a general pattern |
| `supersedes` | This pattern replaces an older one |
| `superseded_by` | This pattern is outdated, use the linked one |
---
## Procedural Memory (Skills)
Reusable action sequences:
```markdown
# Skill: API Endpoint Implementation
## Prerequisites
- OpenAPI spec exists at .loki/specs/openapi.yaml
- Database schema defined
## Steps
1. Read endpoint spec from openapi.yaml
2. Create route handler in src/routes/{resource}.ts
3. Implement request validation using spec schema
4. Implement business logic
5. Add database operations if needed
6. Return response matching spec schema
7. Write contract tests
8. Run tests, verify passing
## Common Errors & Fixes
- Missing return type: Add `: void` to handler
- Schema mismatch: Regenerate types from spec
## Exit Criteria
- All contract tests pass
- Response matches OpenAPI spec
- No TypeScript errors
```
---
## Memory Retrieval
### Retrieval by Similarity
```python
def retrieve_relevant_memory(current_context):
"""
Retrieve memories relevant to current task.
Uses semantic similarity + temporal recency.
"""
query_embedding = embed(current_context.goal)
# 1. Search semantic memory first
semantic_matches = vector_search(
collection="semantic",
query=query_embedding,
top_k=5
)
# 2. Search episodic memory for similar situations
episodic_matches = vector_search(
collection="episodic",
query=query_embedding,
top_k=3,
filters={"outcome": "success"} # Prefer successful episodes
)
# 3. Search skills
skill_matches = keyword_search(
collection="skills",
keywords=extract_keywords(current_context)
)
# 4. Combine and rank
combined = merge_and_rank(
semantic_matches,
episodic_matches,
skill_matches,
weights={"semantic": 0.5, "episodic": 0.3, "skills": 0.2}
)
return combined[:5] # Return top 5 most relevant
```
### Retrieval Before Task Execution
**CRITICAL:** Before executing any task, retrieve relevant memories:
```python
def before_task_execution(task):
"""
Inject relevant memories into task context.
"""
# 1. Retrieve relevant memories
memories = retrieve_relevant_memory(task)
# 2. Check for anti-patterns
anti_patterns = search_anti_patterns(task.action_type)
# 3. Inject into prompt
task.context["relevant_patterns"] = [m.summary for m in memories]
task.context["avoid_these"] = [a.summary for a in anti_patterns]
task.context["applicable_skills"] = find_skills(task.type)
return task
```
---
## Ledger System (Agent Checkpoints)
Each agent maintains its own ledger:
```json
{
"agent_id": "eng-001-backend",
"last_checkpoint": "2026-01-06T10:00:00Z",
"tasks_completed": 12,
"current_task": "task-042",
"state": {
"files_modified": ["src/routes/todos.ts"],
"uncommitted_changes": true,
"last_git_commit": "abc123"
},
"context": {
"tech_stack": ["express", "typescript", "sqlite"],
"patterns_learned": ["sem-001", "sem-005"],
"current_goal": "Implement CRUD for todos"
}
}
```
---
## Handoff Protocol
When switching between agents:
```json
{
"id": "handoff-001",
"from_agent": "eng-001-backend",
"to_agent": "qa-001-testing",
"timestamp": "2026-01-06T11:00:00Z",
"context": {
"what_was_done": "Implemented POST /api/todos endpoint",
"artifacts": ["src/routes/todos.ts"],
"git_state": "commit abc123",
"needs_testing": ["unit tests for validation", "contract tests"],
"known_issues": [],
"relevant_patterns": ["sem-001"]
}
}
```
---
## Memory Maintenance
### Pruning Old Episodic Memories
```python
def prune_episodic_memories():
"""
Keep episodic memories from:
- Last 7 days (full detail)
- Last 30 days (summarized)
- Older: only if referenced by semantic memory
"""
now = datetime.now()
for episode in load_all_episodes():
age_days = (now - episode.timestamp).days
if age_days > 30:
if not is_referenced_by_semantic(episode):
archive_episode(episode)
elif age_days > 7:
summarize_episode(episode)
```
### Merging Duplicate Patterns
```python
def merge_duplicate_semantics():
"""
Find and merge semantically similar patterns.
"""
all_patterns = load_semantic_patterns()
clusters = cluster_by_embedding_similarity(all_patterns, threshold=0.9)
for cluster in clusters:
if len(cluster) > 1:
# Keep highest confidence, merge sources
primary = max(cluster, key=lambda p: p.confidence)
for other in cluster:
if other != primary:
primary.source_episodes.extend(other.source_episodes)
primary.usage_count += other.usage_count
create_link(other, primary, "superseded_by")
save_semantic(primary)
```
---
## Integration with CONTINUITY.md
CONTINUITY.md is working memory - it references but doesn't duplicate long-term memory:
```markdown
## Relevant Memories (Auto-Retrieved)
- [sem-001] Express handlers need explicit return types
- [ep-2026-01-05-012] Similar endpoint implementation succeeded
- [skill: api-implementation] Standard API implementation flow
## Mistakes to Avoid (From Learnings)
- Don't forget return type annotations
- Run contract tests before marking complete
```

View File

@@ -0,0 +1,647 @@
# OpenAI Agent Patterns Reference
Research-backed patterns from OpenAI's Agents SDK, Deep Research, and autonomous agent frameworks.
---
## Overview
OpenAI's agent ecosystem provides four key architectural innovations for Loki Mode:
1. **Tracing Spans** - Hierarchical event tracking with span types
2. **Guardrails & Tripwires** - Input/output validation with early termination
3. **Handoff Callbacks** - Data preparation during agent transfers
4. **Multi-Tiered Fallbacks** - Model and workflow-level failure recovery
---
## Tracing Spans Architecture
### Span Types (Agents SDK Pattern)
Every operation is wrapped in a typed span for observability:
```yaml
span_types:
agent_span:
- Wraps entire agent execution
- Contains: agent_name, instructions_hash, model
generation_span:
- Wraps LLM API calls
- Contains: model, tokens_in, tokens_out, latency_ms
function_span:
- Wraps tool/function calls
- Contains: function_name, arguments, result, success
guardrail_span:
- Wraps validation checks
- Contains: guardrail_name, triggered, blocking
handoff_span:
- Wraps agent-to-agent transfers
- Contains: from_agent, to_agent, context_passed
custom_span:
- User-defined operations
- Contains: operation_name, metadata
```
### Hierarchical Trace Structure
```json
{
"trace_id": "trace_abc123def456",
"workflow_name": "implement_feature",
"group_id": "session_xyz789",
"spans": [
{
"span_id": "span_001",
"parent_id": null,
"type": "agent_span",
"agent_name": "orchestrator",
"started_at": "2026-01-07T10:00:00Z",
"ended_at": "2026-01-07T10:05:00Z",
"children": ["span_002", "span_003"]
},
{
"span_id": "span_002",
"parent_id": "span_001",
"type": "guardrail_span",
"guardrail_name": "input_validation",
"triggered": false,
"blocking": true
},
{
"span_id": "span_003",
"parent_id": "span_001",
"type": "handoff_span",
"from_agent": "orchestrator",
"to_agent": "backend-dev",
"context_passed": ["task_spec", "related_files"]
}
]
}
```
### Storage Location
```
.loki/traces/
├── active/
│ └── {trace_id}.json # Currently running traces
└── completed/
└── {date}/
└── {trace_id}.json # Archived traces by date
```
---
## Guardrails & Tripwires System
### Input Guardrails
Run **before** agent execution to validate user input:
```python
@input_guardrail(blocking=True)
async def validate_task_scope(input, context):
"""
Blocks tasks outside project scope.
Based on OpenAI Agents SDK pattern.
"""
# Check if task references files outside project
if references_external_paths(input):
return GuardrailResult(
tripwire_triggered=True,
reason="Task references paths outside project root"
)
# Check for disallowed operations
if contains_destructive_operation(input):
return GuardrailResult(
tripwire_triggered=True,
reason="Destructive operation requires human approval"
)
return GuardrailResult(tripwire_triggered=False)
```
### Output Guardrails
Run **after** agent execution to validate results:
```python
@output_guardrail
async def validate_code_quality(output, context):
"""
Blocks low-quality code output.
"""
if output.type == "code":
issues = run_static_analysis(output.content)
critical = [i for i in issues if i.severity == "critical"]
if critical:
return GuardrailResult(
tripwire_triggered=True,
reason=f"Critical issues found: {critical}"
)
return GuardrailResult(tripwire_triggered=False)
```
### Execution Modes
| Mode | Behavior | Use When |
|------|----------|----------|
| **Blocking** | Guardrail completes before agent starts | Sensitive operations, expensive models |
| **Parallel** | Guardrail runs concurrently with agent | Fast checks, acceptable token loss |
```python
# Blocking mode: prevents token consumption
@input_guardrail(blocking=True, run_in_parallel=False)
async def expensive_validation(input):
# Agent won't start until this completes
pass
# Parallel mode: faster but may waste tokens if fails
@input_guardrail(blocking=True, run_in_parallel=True)
async def fast_validation(input):
# Runs alongside agent start
pass
```
### Tripwire Exceptions
When tripwire triggers, execution halts immediately:
```python
class InputGuardrailTripwireTriggered(Exception):
"""Raised when input validation fails."""
pass
class OutputGuardrailTripwireTriggered(Exception):
"""Raised when output validation fails."""
pass
# In agent loop:
try:
result = await run_agent(task)
except InputGuardrailTripwireTriggered as e:
log_blocked_attempt(e)
return early_exit(reason=str(e))
except OutputGuardrailTripwireTriggered as e:
rollback_changes()
return retry_with_constraints(e.constraints)
```
### Layered Defense Strategy
> "Think of guardrails as a layered defense mechanism. While a single one is unlikely to provide sufficient protection, using multiple, specialized guardrails together creates more resilient agents." - OpenAI Agents SDK
```yaml
guardrail_layers:
layer_1_input:
- scope_validation # Is task within bounds?
- pii_detection # Contains sensitive data?
- injection_detection # Prompt injection attempt?
layer_2_pre_execution:
- cost_estimation # Will this exceed budget?
- dependency_check # Are dependencies available?
- conflict_detection # Will this conflict with in-progress work?
layer_3_output:
- static_analysis # Code quality issues?
- secret_detection # Secrets in output?
- spec_compliance # Matches OpenAPI spec?
layer_4_post_action:
- test_validation # Tests pass?
- review_approval # Review passed?
- deployment_safety # Safe to deploy?
```
---
## Handoff Callbacks
### on_handoff Pattern
Prepare data when transferring between agents:
```python
async def on_handoff_to_backend_dev(handoff_context):
"""
Called when orchestrator hands off to backend-dev agent.
Fetches context the receiving agent will need.
"""
# Pre-fetch relevant files
relevant_files = await find_related_files(handoff_context.task)
# Load architectural context
architecture = await read_file(".loki/specs/architecture.md")
# Get recent changes to affected areas
recent_commits = await git_log(paths=relevant_files, limit=10)
return HandoffData(
files=relevant_files,
architecture=architecture,
recent_changes=recent_commits,
constraints=handoff_context.constraints
)
# Register callback
handoff(
to_agent=backend_dev,
on_handoff=on_handoff_to_backend_dev
)
```
### Handoff Context Transfer
```json
{
"handoff_id": "ho_abc123",
"from_agent": "orchestrator",
"to_agent": "backend-dev",
"timestamp": "2026-01-07T10:05:00Z",
"context": {
"task_id": "task-001",
"goal": "Implement user authentication endpoint",
"constraints": [
"Use existing auth patterns from src/auth/",
"Maintain backwards compatibility",
"Add rate limiting"
],
"pre_fetched": {
"files": ["src/auth/middleware.ts", "src/routes/index.ts"],
"architecture": "...",
"recent_changes": [...]
}
},
"return_expected": true,
"timeout_seconds": 600
}
```
---
## Multi-Tiered Fallback System
### Model-Level Fallbacks
```python
async def execute_with_model_fallback(task, preferred_model):
"""
Try preferred model, fall back to alternatives on failure.
Based on OpenAI safety patterns.
"""
fallback_chain = {
"opus": ["sonnet", "haiku"],
"sonnet": ["haiku", "opus"],
"haiku": ["sonnet"]
}
models_to_try = [preferred_model] + fallback_chain.get(preferred_model, [])
for model in models_to_try:
try:
result = await run_agent(task, model=model)
if result.success:
return result
except RateLimitError:
log_warning(f"Rate limit on {model}, trying fallback")
continue
except ModelUnavailableError:
log_warning(f"{model} unavailable, trying fallback")
continue
# All models failed
return escalate_to_human(task, reason="All model fallbacks exhausted")
```
### Workflow-Level Fallbacks
```python
async def execute_with_workflow_fallback(task):
"""
If complex workflow fails, fall back to simpler operations.
"""
# Try full workflow first
try:
return await full_implementation_workflow(task)
except WorkflowError as e:
log_warning(f"Full workflow failed: {e}")
# Fall back to simpler approach
try:
return await simplified_workflow(task)
except WorkflowError as e:
log_warning(f"Simplified workflow failed: {e}")
# Last resort: decompose and try piece by piece
try:
subtasks = decompose_task(task)
results = []
for subtask in subtasks:
result = await execute_single_step(subtask)
results.append(result)
return combine_results(results)
except Exception as e:
return escalate_to_human(task, reason=f"All workflows failed: {e}")
```
### Fallback Decision Tree
```
Task Execution
|
+-- Try preferred approach
| |
| +-- Success? --> Done
| |
| +-- Rate limit? --> Try next model in chain
| |
| +-- Error? --> Try simpler workflow
|
+-- All workflows failed?
| |
| +-- Decompose into subtasks
| |
| +-- Execute piece by piece
|
+-- Still failing?
|
+-- Escalate to human
+-- Log detailed failure context
+-- Save state for resume
```
---
## Confidence-Based Human Escalation
### Confidence Scoring
```python
def calculate_confidence(task_result):
"""
Score confidence 0-1 based on multiple signals.
Low confidence triggers human review.
"""
signals = []
# Test coverage signal
if task_result.test_coverage >= 0.9:
signals.append(1.0)
elif task_result.test_coverage >= 0.7:
signals.append(0.7)
else:
signals.append(0.3)
# Review consensus signal
if task_result.review_unanimous:
signals.append(1.0)
elif task_result.review_majority:
signals.append(0.7)
else:
signals.append(0.3)
# Retry count signal
retry_penalty = min(task_result.retry_count * 0.2, 0.8)
signals.append(1.0 - retry_penalty)
return sum(signals) / len(signals)
# Escalation threshold
CONFIDENCE_THRESHOLD = 0.6
if calculate_confidence(result) < CONFIDENCE_THRESHOLD:
escalate_to_human(
task,
reason="Low confidence score",
context=result
)
```
### Automatic Escalation Triggers
```yaml
human_escalation_triggers:
# Retry-based
- condition: retry_count > 3
action: pause_and_escalate
reason: "Multiple failures indicate unclear requirements"
# Domain-based
- condition: domain in ["payments", "auth", "pii"]
action: require_approval
reason: "Sensitive domain requires human review"
# Confidence-based
- condition: confidence_score < 0.6
action: pause_and_escalate
reason: "Low confidence in solution quality"
# Time-based
- condition: wall_time > expected_time * 3
action: pause_and_escalate
reason: "Task taking much longer than expected"
# Cost-based
- condition: tokens_used > budget * 0.8
action: pause_and_escalate
reason: "Approaching token budget limit"
```
---
## AGENTS.md Integration
### Reading Target Project's AGENTS.md
```python
async def load_project_context():
"""
Read AGENTS.md from target project if exists.
Based on OpenAI/AAIF standard.
"""
agents_md_locations = [
"AGENTS.md",
".github/AGENTS.md",
"docs/AGENTS.md"
]
for location in agents_md_locations:
if await file_exists(location):
content = await read_file(location)
return parse_agents_md(content)
# No AGENTS.md found - use defaults
return default_project_context()
def parse_agents_md(content):
"""
Extract structured guidance from AGENTS.md.
"""
sections = parse_markdown_sections(content)
return ProjectContext(
build_commands=sections.get("build", []),
test_commands=sections.get("test", []),
code_style=sections.get("code style", {}),
architecture_notes=sections.get("architecture", ""),
deployment_notes=sections.get("deployment", ""),
security_notes=sections.get("security", "")
)
```
### Context Priority
```
1. AGENTS.md (closest to current file, monorepo-aware)
2. CLAUDE.md (Claude-specific instructions)
3. .loki/CONTINUITY.md (session state)
4. Package-level documentation
5. README.md (general project info)
```
---
## Reasoning Model Guidance
### When to Use Extended Thinking
Based on OpenAI's o3/o4-mini patterns:
```yaml
use_extended_reasoning:
always:
- System architecture design
- Security vulnerability analysis
- Complex debugging (multi-file, unclear root cause)
- API design decisions
- Performance optimization strategy
sometimes:
- Code review (only for critical/complex changes)
- Refactoring planning (when multiple approaches exist)
- Integration design (when crossing system boundaries)
never:
- Simple bug fixes
- Documentation updates
- Unit test writing
- Formatting/linting
- File operations
```
### Backtracking Pattern
```python
async def execute_with_backtracking(task, max_backtracks=3):
"""
Allow agent to backtrack and try different approaches.
Based on Deep Research's adaptive planning.
"""
attempts = []
for attempt in range(max_backtracks + 1):
# Generate approach considering previous failures
approach = await plan_approach(
task,
failed_approaches=attempts
)
result = await execute_approach(approach)
if result.success:
return result
# Record failed approach for learning
attempts.append({
"approach": approach,
"failure_reason": result.error,
"partial_progress": result.partial_output
})
# Backtrack: reset to clean state
await rollback_to_checkpoint(task.checkpoint_id)
return FailedResult(
reason="Max backtracks exceeded",
attempts=attempts
)
```
---
## Session State Management
### Automatic State Persistence
```python
class Session:
"""
Automatic conversation history and state management.
Inspired by OpenAI Agents SDK Sessions.
"""
def __init__(self, session_id):
self.session_id = session_id
self.state_file = f".loki/state/sessions/{session_id}.json"
self.history = []
self.context = {}
async def save_state(self):
state = {
"session_id": self.session_id,
"history": self.history,
"context": self.context,
"last_updated": now()
}
await write_json(self.state_file, state)
async def load_state(self):
if await file_exists(self.state_file):
state = await read_json(self.state_file)
self.history = state["history"]
self.context = state["context"]
async def add_turn(self, role, content, metadata=None):
self.history.append({
"role": role,
"content": content,
"metadata": metadata,
"timestamp": now()
})
await self.save_state()
```
---
## Sources
**OpenAI Official:**
- [Agents SDK Documentation](https://openai.github.io/openai-agents-python/)
- [Practical Guide to Building Agents](https://cdn.openai.com/business-guides-and-resources/a-practical-guide-to-building-agents.pdf)
- [Building Agents Track](https://developers.openai.com/tracks/building-agents/)
- [AGENTS.md Specification](https://agents.md/)
**Deep Research & Reasoning:**
- [Introducing Deep Research](https://openai.com/index/introducing-deep-research/)
- [Deep Research System Card](https://cdn.openai.com/deep-research-system-card.pdf)
- [Introducing o3 and o4-mini](https://openai.com/index/introducing-o3-and-o4-mini/)
- [Reasoning Best Practices](https://platform.openai.com/docs/guides/reasoning-best-practices)
**Safety & Monitoring:**
- [Chain of Thought Monitoring](https://openai.com/index/chain-of-thought-monitoring/)
- [Agent Builder Safety](https://platform.openai.com/docs/guides/agent-builder-safety)
- [Computer-Using Agent](https://openai.com/index/computer-using-agent/)
**Standards & Interoperability:**
- [Agentic AI Foundation](https://openai.com/index/agentic-ai-foundation/)
- [OpenAI for Developers 2025](https://developers.openai.com/blog/openai-for-developers-2025/)

View File

@@ -0,0 +1,568 @@
# Production Patterns Reference
Practitioner-tested patterns from Hacker News discussions and real-world deployments. These patterns represent what actually works in production, not theoretical frameworks.
---
## Overview
This reference consolidates battle-tested insights from:
- HN discussions on autonomous agents in production (2025)
- Coding with LLMs practitioner experiences
- Simon Willison's Superpowers coding agent patterns
- Multi-agent orchestration real-world deployments
---
## What Actually Works in Production
### Human-in-the-Loop (HITL) is Non-Negotiable
**Key Insight:** "Zero companies don't have a human in the loop" for customer-facing applications.
```yaml
hitl_patterns:
always_human:
- Customer-facing responses
- Financial transactions
- Security-critical operations
- Legal/compliance decisions
automation_candidates:
- Internal tooling
- Developer assistance
- Data preprocessing
- Code generation (with review)
implementation:
- Classification layer routes to human vs automated
- Confidence thresholds trigger escalation
- Audit trails for all automated decisions
```
### Narrow Scope Wins
**Key Insight:** Successful agents operate within tightly constrained domains.
```yaml
scope_constraints:
max_steps_before_review: 3-5
task_characteristics:
- Specific, well-defined objectives
- Pre-classified inputs
- Deterministic success criteria
- Verifiable outputs
successful_domains:
- Email scanning and classification
- Invoice processing
- Code refactoring (bounded)
- Documentation generation
- Test writing
failure_prone_domains:
- Open-ended feature implementation
- Novel algorithm design
- Security-critical code
- Cross-system integrations
```
### Confidence-Based Routing
**Key Insight:** Treat agents as preprocessors, not decision-makers.
```python
def confidence_based_routing(agent_output):
"""
Route based on confidence, not capability.
Based on production practitioner patterns.
"""
confidence = agent_output.confidence_score
if confidence >= 0.95:
# High confidence: auto-approve with logging
return AutoApprove(audit_log=True)
elif confidence >= 0.70:
# Medium confidence: quick human review
return HumanReview(priority="normal", timeout="1h")
elif confidence >= 0.40:
# Low confidence: detailed human review
return HumanReview(priority="high", context="full")
else:
# Very low confidence: escalate immediately
return Escalate(reason="low_confidence", require_senior=True)
```
### Classification Before Automation
**Key Insight:** Separate inputs before processing.
```yaml
classification_first:
step_1_classify:
workable:
- Clear requirements
- Existing patterns
- Test coverage available
non_workable:
- Ambiguous requirements
- Novel architecture
- Missing dependencies
escalate_immediately:
- Security concerns
- Compliance requirements
- Customer-facing changes
step_2_route:
workable: "Automated pipeline"
non_workable: "Human clarification"
escalate: "Senior review"
```
### Deterministic Outer Loops
**Key Insight:** Wrap agent outputs with rule-based validation.
```python
def deterministic_validation_loop(task, max_attempts=3):
"""
Use LLMs only where genuine ambiguity exists.
Wrap with deterministic rules.
"""
for attempt in range(max_attempts):
# LLM handles the ambiguous part
output = agent.execute(task)
# Deterministic validation (NOT LLM)
validation_errors = []
# Rule: Must have tests
if not output.has_tests:
validation_errors.append("Missing tests")
# Rule: Must pass linting
lint_result = run_linter(output.code)
if lint_result.errors:
validation_errors.append(f"Lint errors: {lint_result.errors}")
# Rule: Must compile
compile_result = compile_code(output.code)
if not compile_result.success:
validation_errors.append(f"Compile error: {compile_result.error}")
# Rule: Tests must pass
if output.has_tests:
test_result = run_tests(output.code)
if not test_result.all_passed:
validation_errors.append(f"Test failures: {test_result.failures}")
if not validation_errors:
return output
# Feed errors back for retry
task = task.with_feedback(validation_errors)
return FailedResult(reason="Max attempts exceeded")
```
---
## Context Engineering Patterns
### Context Curation Over Automatic Selection
**Key Insight:** Manually choose which files and information to provide.
```yaml
context_curation:
principles:
- "Less is more" - focused context beats comprehensive context
- Manual selection outperforms automatic RAG
- Remove outdated information aggressively
anti_patterns:
- Dumping entire codebase into context
- Relying on automatic context selection
- Accumulating conversation history indefinitely
implementation:
per_task_context:
- 2-5 most relevant files
- Specific functions, not entire modules
- Recent changes only (last 1-2 days)
- Clear success criteria
context_budget:
target: "< 10k tokens for context"
reserve: "90% for model reasoning"
```
### Information Abstraction
**Key Insight:** Summarize rather than feeding full data.
```python
def abstract_for_agent(raw_data, task_context):
"""
Design abstractions that preserve decision-relevant information.
Based on practitioner insights.
"""
# BAD: Feed 10,000 database rows
# raw_data = db.query("SELECT * FROM users")
# GOOD: Summarize to decision-relevant info
summary = {
"query_status": "success",
"total_results": len(raw_data),
"sample": raw_data[:5],
"schema": extract_schema(raw_data),
"statistics": {
"null_count": count_nulls(raw_data),
"unique_values": count_uniques(raw_data),
"date_range": get_date_range(raw_data)
}
}
return summary
```
### Separate Conversations Per Task
**Key Insight:** Fresh contexts yield better results than accumulated sessions.
```yaml
conversation_management:
new_conversation_triggers:
- Different domain (backend -> frontend)
- New feature vs bug fix
- After completing major task
- When errors accumulate (3+ in row)
preserve_across_sessions:
- CLAUDE.md / CONTINUITY.md
- Architectural decisions
- Key constraints
discard_between_sessions:
- Debugging attempts
- Abandoned approaches
- Intermediate drafts
```
---
## Skills System Pattern
### On-Demand Skill Loading
**Key Insight:** Skills remain dormant until the model actively seeks them out.
```yaml
skills_architecture:
core_interaction: "< 2k tokens"
skill_loading: "On-demand via search"
implementation:
skill_discovery:
- Shell script searches skill files
- Model requests specific skills by name
- Skills loaded only when needed
skill_structure:
name: "unique-skill-name"
trigger: "Pattern that activates skill"
content: "Detailed instructions"
dependencies: ["other-skills"]
benefits:
- Minimal base context
- Extensible without bloat
- Skills can be updated independently
```
### Sub-Agents for Context Isolation
**Key Insight:** Prevent massive token waste by isolating context-noisy subtasks.
```python
async def context_isolated_search(query, codebase_path):
"""
Use sub-agent for grep/search to prevent context pollution.
Based on Simon Willison's patterns.
"""
# Main agent stays focused
# Sub-agent handles noisy file searching
search_agent = spawn_subagent(
role="codebase-searcher",
context_limit="10k tokens",
permissions=["read-only"]
)
results = await search_agent.execute(
task=f"Find files related to: {query}",
codebase=codebase_path
)
# Return only relevant paths, not full content
return FilteredResults(
paths=results.relevant_files[:10],
summaries=results.file_summaries,
confidence=results.relevance_scores
)
```
---
## Planning Before Execution
### Explicit Plan-Then-Code Workflow
**Key Insight:** Have models articulate detailed plans without immediately writing code.
```yaml
plan_then_code:
phase_1_planning:
outputs:
- spec.md: "Detailed requirements"
- todo.md: "Tagged tasks [BUG], [FEAT], [REFACTOR]"
- approach.md: "Implementation strategy"
constraints:
- NO CODE in this phase
- Human review before proceeding
- Clear success criteria
phase_2_review:
checks:
- Plan addresses all requirements
- Approach is feasible
- No missing dependencies
- Tests are specified
phase_3_implementation:
constraints:
- Follow plan exactly
- One task at a time
- Test after each change
- Report deviations immediately
```
---
## Multi-Agent Orchestration Patterns
### Event-Driven Coordination
**Key Insight:** Move beyond synchronous prompt chaining to asynchronous, decoupled systems.
```yaml
event_driven_orchestration:
problems_with_synchronous:
- Doesn't scale
- Mixes orchestration with prompt logic
- Single failure breaks entire chain
- No retry/recovery mechanism
async_architecture:
message_queue:
- Agents communicate via events
- Decoupled execution
- Natural retry/dead-letter handling
state_management:
- Persistent task state
- Checkpoint/resume capability
- Clear ownership of data
error_handling:
- Per-agent retry policies
- Circuit breakers
- Graceful degradation
```
### Policy-First Enforcement
**Key Insight:** Govern agent behavior at runtime, not just training time.
```python
class PolicyEngine:
"""
Runtime governance for agent behavior.
Based on autonomous control plane patterns.
"""
def __init__(self, policies):
self.policies = policies
async def enforce(self, agent_action, context):
for policy in self.policies:
result = await policy.evaluate(agent_action, context)
if result.blocked:
return BlockedAction(
reason=result.reason,
policy=policy.name,
remediation=result.suggested_action
)
if result.modified:
agent_action = result.modified_action
return AllowedAction(agent_action)
# Example policies
policies = [
NoProductionDataDeletion(),
NoSecretsInCode(),
MaxTokenBudget(limit=100000),
RequireTestsForCode(),
BlockExternalNetworkCalls(in_sandbox=True)
]
```
### Simulation Layer
**Key Insight:** Evaluate changes before deploying to real environment.
```yaml
simulation_layer:
purpose: "Test agent behavior in safe environment"
implementation:
sandbox_environment:
- Isolated container
- Mocked external services
- Synthetic data
- Full audit logging
validation_checks:
- Run tests in sandbox first
- Compare outputs to expected
- Check for policy violations
- Measure resource consumption
promotion_criteria:
- All tests pass
- No policy violations
- Resource usage within limits
- Human approval (for sensitive changes)
```
---
## Evaluation and Benchmarking
### Problems with Current Benchmarks
**Key Insight:** LLM-as-judge creates shared blind spots.
```yaml
benchmark_problems:
llm_judge_issues:
- Same architecture = same failure modes
- Math errors accepted as correct
- "Do-nothing" baseline passes 38% of time
contamination:
- Published benchmarks become training targets
- Overfitting to specific datasets
- Inflated scores don't reflect real performance
solutions:
held_back_sets: "90% public, 10% private"
human_evaluation: "Final published results require humans"
production_testing: "A/B tests measure actual value"
objective_outcomes: "Simulated environments with verifiable results"
```
### Practical Evaluation Approach
```python
def evaluate_agent_change(before_agent, after_agent, task_set):
"""
Production-oriented evaluation.
Based on HN practitioner recommendations.
"""
results = {
"before": [],
"after": [],
"human_preference": []
}
for task in task_set:
# Run both agents
before_result = before_agent.execute(task)
after_result = after_agent.execute(task)
# Objective metrics (NOT LLM-judged)
results["before"].append({
"tests_pass": run_tests(before_result),
"lint_clean": run_linter(before_result),
"time_taken": before_result.duration,
"tokens_used": before_result.tokens
})
results["after"].append({
"tests_pass": run_tests(after_result),
"lint_clean": run_linter(after_result),
"time_taken": after_result.duration,
"tokens_used": after_result.tokens
})
# Sample for human review
if random.random() < 0.1: # 10% sample
results["human_preference"].append({
"task": task,
"before": before_result,
"after": after_result,
"pending_review": True
})
return EvaluationReport(results)
```
---
## Cost and Token Economics
### Real-World Cost Patterns
```yaml
cost_patterns:
claude_code:
heavy_use: "$25/1-2 hours on large codebases"
api_range: "$1-5/hour depending on efficiency"
max_tier: "$200/month often needs 2-3 subscriptions"
token_economics:
sub_agents_multiply_cost: "Each duplicates context"
example: "5-task parallel job = 50,000+ tokens per subtask"
optimization:
context_isolation: "Use sub-agents for noisy tasks"
information_abstraction: "Summarize, don't dump"
fresh_conversations: "Reset after major tasks"
skill_on_demand: "Load only when needed"
```
---
## Sources
**Hacker News Discussions:**
- [What Actually Works in Production for Autonomous Agents](https://news.ycombinator.com/item?id=44623207)
- [Coding with LLMs in Summer 2025](https://news.ycombinator.com/item?id=44623953)
- [Superpowers: How I'm Using Coding Agents](https://news.ycombinator.com/item?id=45547344)
- [Claude Code Experience After Two Weeks](https://news.ycombinator.com/item?id=44596472)
- [AI Agent Benchmarks Are Broken](https://news.ycombinator.com/item?id=44531697)
- [How to Orchestrate Multi-Agent Workflows](https://news.ycombinator.com/item?id=45955997)
- [Context Engineering vs Prompt Engineering](https://news.ycombinator.com/item?id=44427757)
**Show HN Projects:**
- [Self-Evolving Agents Repository](https://news.ycombinator.com/item?id=45099226)
- [Package Manager for Agent Skills](https://news.ycombinator.com/item?id=46422264)
- [Wispbit - AI Code Review Agent](https://news.ycombinator.com/item?id=44722603)
- [Agtrace - Monitoring for AI Coding Agents](https://news.ycombinator.com/item?id=46425670)

View File

@@ -0,0 +1,437 @@
# Quality Control Reference
Quality gates, code review process, and severity blocking rules.
Enhanced with 2025 research on anti-sycophancy, heterogeneous teams, and OpenAI Agents SDK patterns.
---
## Core Principle: Guardrails, Not Just Acceleration
**CRITICAL:** Speed without quality controls creates "AI slop" - semi-functional code that accumulates technical debt. Loki Mode enforces strict quality guardrails.
**Research Insight:** Heterogeneous review teams outperform homogeneous ones by 4-6% (A-HMAD, 2025).
**OpenAI Insight:** "Think of guardrails as a layered defense mechanism. Multiple specialized guardrails create resilient agents."
---
## Guardrails & Tripwires System (OpenAI SDK Pattern)
### Input Guardrails (Run Before Execution)
```python
# Layer 1: Validate task scope and safety
@input_guardrail(blocking=True)
async def validate_task_scope(input, context):
# Check if task within project bounds
if references_external_paths(input):
return GuardrailResult(
tripwire_triggered=True,
reason="Task references paths outside project"
)
# Check for destructive operations
if contains_destructive_operation(input):
return GuardrailResult(
tripwire_triggered=True,
reason="Destructive operation requires human approval"
)
return GuardrailResult(tripwire_triggered=False)
# Layer 2: Detect prompt injection
@input_guardrail(blocking=True)
async def detect_injection(input, context):
if has_injection_patterns(input):
return GuardrailResult(
tripwire_triggered=True,
reason="Potential prompt injection detected"
)
return GuardrailResult(tripwire_triggered=False)
```
### Output Guardrails (Run After Execution)
```python
# Validate code quality before accepting
@output_guardrail
async def validate_code_output(output, context):
if output.type == "code":
issues = run_static_analysis(output.content)
critical = [i for i in issues if i.severity == "critical"]
if critical:
return GuardrailResult(
tripwire_triggered=True,
reason=f"Critical issues: {critical}"
)
return GuardrailResult(tripwire_triggered=False)
# Check for secrets in output
@output_guardrail
async def check_secrets(output, context):
if contains_secrets(output.content):
return GuardrailResult(
tripwire_triggered=True,
reason="Output contains potential secrets"
)
return GuardrailResult(tripwire_triggered=False)
```
### Execution Modes
| Mode | Behavior | Use When |
|------|----------|----------|
| **Blocking** | Guardrail completes before agent starts | Expensive models, sensitive ops |
| **Parallel** | Guardrail runs with agent | Fast checks, acceptable token loss |
```python
# Blocking: prevents token consumption on fail
@input_guardrail(blocking=True, run_in_parallel=False)
async def expensive_validation(input): pass
# Parallel: faster but may waste tokens
@input_guardrail(blocking=True, run_in_parallel=True)
async def fast_validation(input): pass
```
### Tripwire Handling
When a guardrail triggers its tripwire, execution halts immediately:
```python
try:
result = await run_agent(task)
except InputGuardrailTripwireTriggered as e:
log_blocked_attempt(e)
return early_exit(reason=str(e))
except OutputGuardrailTripwireTriggered as e:
rollback_changes()
return retry_with_constraints(e.constraints)
```
### Layered Defense Strategy
```yaml
guardrail_layers:
layer_1_input:
- scope_validation # Is task within bounds?
- pii_detection # Contains sensitive data?
- injection_detection # Prompt injection attempt?
layer_2_pre_execution:
- cost_estimation # Will this exceed budget?
- dependency_check # Are dependencies available?
- conflict_detection # Conflicts with in-progress work?
layer_3_output:
- static_analysis # Code quality issues?
- secret_detection # Secrets in output?
- spec_compliance # Matches OpenAPI spec?
layer_4_post_action:
- test_validation # Tests pass?
- review_approval # Review passed?
- deployment_safety # Safe to deploy?
```
See `references/openai-patterns.md` for full guardrails implementation.
---
## Quality Gates
**Never ship code without passing all quality gates:**
### 1. Static Analysis (Automated)
- CodeQL security scanning
- ESLint/Pylint/Rubocop for code style
- Unused variable/import detection
- Duplicated logic detection
- Type checking (TypeScript/mypy/etc)
### 2. 3-Reviewer Parallel System (AI-driven)
Every code change goes through 3 specialized reviewers **simultaneously**:
```
IMPLEMENT -> BLIND REVIEW (parallel) -> DEBATE (if disagreement) -> AGGREGATE -> FIX -> RE-REVIEW
|
+-- code-reviewer (Opus) - Code quality, patterns, best practices
+-- business-logic-reviewer (Opus) - Requirements, edge cases, UX
+-- security-reviewer (Opus) - Vulnerabilities, OWASP Top 10
```
**Important:**
- ALWAYS launch all 3 reviewers in a single message (3 Task calls)
- ALWAYS specify model: "opus" for each reviewer
- ALWAYS use blind review mode (reviewers cannot see each other's findings initially)
- NEVER dispatch reviewers sequentially (always parallel - 3x faster)
- NEVER aggregate before all 3 reviewers complete
### Anti-Sycophancy Protocol (CONSENSAGENT Research)
**Problem:** Reviewers may reinforce each other's findings instead of critically engaging.
**Solution: Blind Review + Devil's Advocate**
```python
# Phase 1: Independent blind review
reviews = []
for reviewer in [code_reviewer, business_reviewer, security_reviewer]:
review = Task(
subagent_type="general-purpose",
model="opus",
prompt=f"""
{reviewer.prompt}
CRITICAL: Be skeptical. Your job is to find problems.
List specific concerns with file:line references.
Do NOT rubber-stamp. Finding zero issues is suspicious.
"""
)
reviews.append(review)
# Phase 2: Check for disagreement
if has_disagreement(reviews):
# Structured debate - max 2 rounds
debate_result = structured_debate(reviews, max_rounds=2)
else:
# All agreed - run devil's advocate
devil_review = Task(
subagent_type="general-purpose",
model="opus",
prompt="""
The other reviewers found no issues. Your job is to be contrarian.
Find problems they missed. Challenge assumptions.
If truly nothing wrong, explain why each potential issue category is covered.
"""
)
reviews.append(devil_review)
```
### Heterogeneous Team Composition
**Each reviewer has distinct personality/focus:**
| Reviewer | Model | Expertise | Personality |
|----------|-------|-----------|-------------|
| Code Quality | Opus | SOLID, patterns, maintainability | Perfectionist |
| Business Logic | Opus | Requirements, edge cases, UX | Pragmatic |
| Security | Opus | OWASP, auth, injection | Paranoid |
This diversity prevents groupthink and catches more issues.
### 3. Severity-Based Blocking
| Severity | Action | Continue? |
|----------|--------|-----------|
| **Critical** | BLOCK - Fix immediately | NO |
| **High** | BLOCK - Fix immediately | NO |
| **Medium** | BLOCK - Fix before proceeding | NO |
| **Low** | Add `// TODO(review): ...` comment | YES |
| **Cosmetic** | Add `// FIXME(nitpick): ...` comment | YES |
**Critical/High/Medium = BLOCK and fix before proceeding**
**Low/Cosmetic = Add TODO/FIXME comment, continue**
### 4. Test Coverage Gates
- Unit tests: 100% pass, >80% coverage
- Integration tests: 100% pass
- E2E tests: critical flows pass
### 5. Rulesets (Blocking Merges)
- No secrets in code
- No unhandled exceptions
- No SQL injection vulnerabilities
- No XSS vulnerabilities
---
## Code Review Protocol
### Launching Reviewers (Parallel)
```python
# CORRECT: Launch all 3 in parallel
Task(subagent_type="general-purpose", model="opus",
description="Code quality review",
prompt="Review for code quality, patterns, SOLID principles...")
Task(subagent_type="general-purpose", model="opus",
description="Business logic review",
prompt="Review for requirements alignment, edge cases, UX...")
Task(subagent_type="general-purpose", model="opus",
description="Security review",
prompt="Review for vulnerabilities, OWASP Top 10...")
# WRONG: Sequential reviewers (3x slower)
# Don't do: await reviewer1; await reviewer2; await reviewer3;
```
### After Fixes
- ALWAYS re-run ALL 3 reviewers after fixes (not just the one that found the issue)
- Wait for all reviews to complete before aggregating results
---
## Structured Prompting for Subagents
**Every subagent dispatch MUST include:**
```markdown
## GOAL (What success looks like)
[High-level objective, not just the action]
Example: "Refactor authentication for maintainability and testability"
NOT: "Refactor the auth file"
## CONSTRAINTS (What you cannot do)
- No third-party dependencies without approval
- Maintain backwards compatibility with v1.x API
- Keep response time under 200ms
- Follow existing error handling patterns
## CONTEXT (What you need to know)
- Related files: [list with brief descriptions]
- Architecture decisions: [relevant ADRs or patterns]
- Previous attempts: [what was tried, why it failed]
- Dependencies: [what this depends on, what depends on this]
## OUTPUT FORMAT (What to deliver)
- [ ] Pull request with Why/What/Trade-offs description
- [ ] Unit tests with >90% coverage
- [ ] Update API documentation
- [ ] Performance benchmark results
```
---
## Task Completion Report
**Every completed task MUST include decision documentation:**
```markdown
## Task Completion Report
### WHY (Problem & Solution Rationale)
- **Problem**: [What was broken/missing/suboptimal]
- **Root Cause**: [Why it happened]
- **Solution Chosen**: [What we implemented]
- **Alternatives Considered**:
1. [Option A]: Rejected because [reason]
2. [Option B]: Rejected because [reason]
### WHAT (Changes Made)
- **Files Modified**: [with line ranges and purpose]
- `src/auth.ts:45-89` - Extracted token validation to separate function
- `src/auth.test.ts:120-156` - Added edge case tests
- **APIs Changed**: [breaking vs non-breaking]
- **Behavior Changes**: [what users will notice]
- **Dependencies Added/Removed**: [with justification]
### TRADE-OFFS (Gains & Costs)
- **Gained**:
- Better testability (extracted pure functions)
- 40% faster token validation
- Reduced cyclomatic complexity from 15 to 6
- **Cost**:
- Added 2 new functions (increased surface area)
- Requires migration for custom token validators
- **Neutral**:
- No performance change for standard use cases
### RISKS & MITIGATIONS
- **Risk**: Existing custom validators may break
- **Mitigation**: Added backwards-compatibility shim, deprecation warning
- **Risk**: New validation logic untested at scale
- **Mitigation**: Gradual rollout with feature flag, rollback plan ready
### TEST RESULTS
- Unit: 24/24 passed (coverage: 92%)
- Integration: 8/8 passed
- Performance: p99 improved from 145ms -> 87ms
### NEXT STEPS (if any)
- [ ] Monitor error rates for 24h post-deploy
- [ ] Create follow-up task to remove compatibility shim in v3.0
```
---
## Preventing "AI Slop"
### Warning Signs
- Tests pass but code quality degraded
- Copy-paste duplication instead of abstraction
- Over-engineered solutions to simple problems
- Missing error handling
- No logging/observability
- Generic variable names (data, temp, result)
- Magic numbers without constants
- Commented-out code
- TODO comments without GitHub issues
### When Detected
1. Fail the task immediately
2. Add to failed queue with detailed feedback
3. Re-dispatch with stricter constraints
4. Update CONTINUITY.md with anti-pattern to avoid
---
## Quality Gate Hooks
### Pre-Write Hook (BLOCKING)
```bash
#!/bin/bash
# .loki/hooks/pre-write.sh
# Blocks writes that violate rules
# Check for secrets
if grep -rE "(password|secret|key).*=.*['\"][^'\"]{8,}" "$1"; then
echo "BLOCKED: Potential secret detected"
exit 1
fi
# Check for console.log in production
if grep -n "console.log" "$1" | grep -v "test"; then
echo "BLOCKED: Remove console.log statements"
exit 1
fi
```
### Post-Write Hook (AUTO-FIX)
```bash
#!/bin/bash
# .loki/hooks/post-write.sh
# Auto-fixes after writes
# Format code
npx prettier --write "$1"
# Fix linting issues
npx eslint --fix "$1"
# Type check
npx tsc --noEmit
```
---
## Constitution Reference
Quality gates are enforced by `autonomy/CONSTITUTION.md`:
**Pre-Commit (BLOCKING):**
- Linting (auto-fix enabled)
- Type checking (strict mode)
- Contract tests (80% coverage minimum)
- Spec validation (Spectral)
**Post-Implementation (AUTO-FIX):**
- Static analysis (ESLint, Prettier, TSC)
- Security scan (Semgrep, Snyk)
- Performance check (Lighthouse score 90+)
**Runtime Invariants:**
- `SPEC_BEFORE_CODE`: Implementation tasks require spec reference
- `TASK_HAS_COMMIT`: Completed tasks have git commit SHA
- `QUALITY_GATES_PASSED`: Completed tasks passed all quality checks

View File

@@ -0,0 +1,410 @@
# SDLC Phases Reference
All phases with detailed workflows and testing procedures.
---
## Phase Overview
```
Bootstrap -> Discovery -> Architecture -> Infrastructure
| | | |
(Setup) (Analyze PRD) (Design) (Cloud/DB Setup)
|
Development <- QA <- Deployment <- Business Ops <- Growth Loop
| | | | |
(Build) (Test) (Release) (Monitor) (Iterate)
```
---
## Phase 0: Bootstrap
**Purpose:** Initialize Loki Mode environment
### Actions:
1. Create `.loki/` directory structure
2. Initialize orchestrator state in `.loki/state/orchestrator.json`
3. Validate PRD exists and is readable
4. Spawn initial agent pool (3-5 agents)
5. Create CONTINUITY.md
### Directory Structure Created:
```
.loki/
+-- CONTINUITY.md
+-- state/
| +-- orchestrator.json
| +-- agents/
| +-- circuit-breakers/
+-- queue/
| +-- pending.json
| +-- in-progress.json
| +-- completed.json
| +-- dead-letter.json
+-- specs/
+-- memory/
+-- artifacts/
```
---
## Phase 1: Discovery
**Purpose:** Understand requirements and market context
### Actions:
1. Parse PRD, extract requirements
2. Spawn `biz-analytics` agent for competitive research
3. Web search competitors, extract features, reviews
4. Identify market gaps and opportunities
5. Generate task backlog with priorities and dependencies
### Output:
- Requirements document
- Competitive analysis
- Initial task backlog in `.loki/queue/pending.json`
---
## Phase 2: Architecture
**Purpose:** Design system architecture and generate specs
### SPEC-FIRST WORKFLOW
**Step 1: Extract API Requirements from PRD**
- Parse PRD for user stories and functionality
- Map to REST/GraphQL operations
- Document data models and relationships
**Step 2: Generate OpenAPI 3.1 Specification**
```yaml
openapi: 3.1.0
info:
title: Product API
version: 1.0.0
paths:
/auth/login:
post:
summary: Authenticate user and return JWT
requestBody:
required: true
content:
application/json:
schema:
type: object
required: [email, password]
properties:
email: { type: string, format: email }
password: { type: string, minLength: 8 }
responses:
200:
description: Success
content:
application/json:
schema:
type: object
properties:
token: { type: string }
expiresAt: { type: string, format: date-time }
401:
description: Invalid credentials
```
**Step 3: Validate Spec**
```bash
npm install -g @stoplight/spectral-cli
spectral lint .loki/specs/openapi.yaml
swagger-cli validate .loki/specs/openapi.yaml
```
**Step 4: Generate Artifacts from Spec**
```bash
# TypeScript types
npx openapi-typescript .loki/specs/openapi.yaml --output src/types/api.ts
# Client SDK
npx openapi-generator-cli generate \
-i .loki/specs/openapi.yaml \
-g typescript-axios \
-o src/clients/api
# Server stubs
npx openapi-generator-cli generate \
-i .loki/specs/openapi.yaml \
-g nodejs-express-server \
-o backend/generated
# Documentation
npx redoc-cli bundle .loki/specs/openapi.yaml -o docs/api.html
```
**Step 5: Select Tech Stack**
- Spawn `eng-backend` + `eng-frontend` architects
- Both agents review spec and propose stack
- Consensus required (both must agree)
- Self-reflection checkpoint with evidence
**Step 6: Create Project Scaffolding**
- Initialize project with tech stack
- Install dependencies
- Configure linters
- Setup contract testing framework
---
## Phase 3: Infrastructure
**Purpose:** Provision cloud resources and CI/CD
### Actions:
1. Spawn `ops-devops` agent
2. Provision cloud resources (see `references/deployment.md`)
3. Set up CI/CD pipelines
4. Configure monitoring and alerting
5. Create staging and production environments
### CI/CD Pipeline:
```yaml
name: CI/CD Pipeline
on: [push, pull_request]
jobs:
test:
- Lint
- Type check
- Unit tests
- Contract tests
- Security scan
deploy-staging:
needs: test
- Deploy to staging
- Smoke tests
deploy-production:
needs: deploy-staging
- Blue-green deploy
- Health checks
- Auto-rollback on errors
```
---
## Phase 4: Development
**Purpose:** Implement features with quality gates
### Workflow Per Task:
```
1. Dispatch implementation subagent (Task tool, model: sonnet)
2. Subagent implements with TDD, commits, reports back
3. Dispatch 3 reviewers IN PARALLEL (single message, 3 Task calls):
- code-reviewer (opus)
- business-logic-reviewer (opus)
- security-reviewer (opus)
4. Aggregate findings by severity
5. IF Critical/High/Medium found:
- Dispatch fix subagent
- Re-run ALL 3 reviewers
- Loop until all PASS
6. Add TODO comments for Low issues
7. Add FIXME comments for Cosmetic issues
8. Mark task complete with git checkpoint
```
### Implementation Rules:
- Agents implement ONLY what's in the spec
- Must validate against openapi.yaml schema
- Must return responses matching spec
- Performance targets from spec x-performance extension
---
## Phase 5: Quality Assurance
**Purpose:** Comprehensive testing and security audit
### Testing Phases:
**UNIT Phase:**
```bash
npm run test:unit
# or
pytest tests/unit/
```
- Coverage: >80% required
- All tests must pass
**INTEGRATION Phase:**
```bash
npm run test:integration
```
- Test API endpoints against actual database
- Test external service integrations
- Verify data flows end-to-end
**E2E Phase:**
```bash
npx playwright test
# or
npx cypress run
```
- Test complete user flows
- Cross-browser testing
- Mobile responsive testing
**CONTRACT Phase:**
```bash
npm run test:contract
```
- Validate implementation matches OpenAPI spec
- Test request/response schemas
- Breaking change detection
**SECURITY Phase:**
```bash
npm audit
npx snyk test
semgrep --config=auto .
```
- OWASP Top 10 checks
- Dependency vulnerabilities
- Static analysis
**PERFORMANCE Phase:**
```bash
npx k6 run tests/load.js
npx lighthouse http://localhost:3000
```
- Load testing: 100 concurrent users for 1 minute
- Stress testing: 500 concurrent users for 30 seconds
- P95 response time < 500ms required
**ACCESSIBILITY Phase:**
```bash
npx axe http://localhost:3000
```
- WCAG 2.1 AA compliance
- Alt text, ARIA labels, color contrast
- Keyboard navigation, focus indicators
**REGRESSION Phase:**
- Compare behavior against previous version
- Verify no features broken by recent changes
- Test backward compatibility of APIs
**UAT Phase:**
- Create acceptance tests from PRD
- Walk through complete user journeys
- Verify business logic matches PRD
- Document any UX friction points
---
## Phase 6: Deployment
**Purpose:** Release to production
### Actions:
1. Spawn `ops-release` agent
2. Generate semantic version, changelog
3. Create release branch, tag
4. Deploy to staging, run smoke tests
5. Blue-green deploy to production
6. Monitor for 30min, auto-rollback if errors spike
### Deployment Strategies:
**Blue-Green:**
```
1. Deploy new version to "green" environment
2. Run smoke tests
3. Switch traffic from "blue" to "green"
4. Keep "blue" as rollback target
```
**Canary:**
```
1. Deploy to 5% of traffic
2. Monitor error rates
3. Gradually increase to 25%, 50%, 100%
4. Rollback if errors exceed threshold
```
---
## Phase 7: Business Operations
**Purpose:** Non-technical business setup
### Actions:
1. `biz-marketing`: Create landing page, SEO, content
2. `biz-sales`: Set up CRM, outreach templates
3. `biz-finance`: Configure billing (Stripe), invoicing
4. `biz-support`: Create help docs, chatbot
5. `biz-legal`: Generate ToS, privacy policy
---
## Phase 8: Growth Loop
**Purpose:** Continuous improvement
### Cycle:
```
MONITOR -> ANALYZE -> OPTIMIZE -> DEPLOY -> MONITOR
|
Customer feedback -> Feature requests -> Backlog
|
A/B tests -> Winner -> Permanent deploy
|
Incidents -> RCA -> Prevention -> Deploy fix
```
### Never "Done":
- Run performance optimizations
- Add missing test coverage
- Improve documentation
- Refactor code smells
- Update dependencies
- Enhance user experience
- Implement A/B test learnings
---
## Final Review (Before Any Deployment)
```
1. Dispatch 3 reviewers reviewing ENTIRE implementation:
- code-reviewer: Full codebase quality
- business-logic-reviewer: All requirements met
- security-reviewer: Full security audit
2. Aggregate findings across all files
3. Fix Critical/High/Medium issues
4. Re-run all 3 reviewers until all PASS
5. Generate final report in .loki/artifacts/reports/final-review.md
6. Proceed to deployment only after all PASS
```
---
## Quality Gates Summary
| Gate | Agent | Pass Criteria |
|------|-------|---------------|
| Unit Tests | eng-qa | 100% pass |
| Integration Tests | eng-qa | 100% pass |
| E2E Tests | eng-qa | 100% pass |
| Coverage | eng-qa | > 80% |
| Linting | eng-qa | 0 errors |
| Type Check | eng-qa | 0 errors |
| Security Scan | ops-security | 0 high/critical |
| Dependency Audit | ops-security | 0 vulnerabilities |
| Performance | eng-qa | p99 < 200ms |
| Accessibility | eng-frontend | WCAG 2.1 AA |
| Load Test | ops-devops | Handles 10x expected traffic |
| Chaos Test | ops-devops | Recovers from failures |
| Cost Estimate | ops-cost | Within budget |
| Legal Review | biz-legal | Compliant |

View File

@@ -0,0 +1,361 @@
# Task Queue Reference
Distributed task queue system, dead letter handling, and circuit breakers.
---
## Task Schema
```json
{
"id": "uuid",
"idempotencyKey": "hash-of-task-content",
"type": "eng-backend|eng-frontend|ops-devops|...",
"priority": 1-10,
"dependencies": ["task-id-1", "task-id-2"],
"payload": {
"action": "implement|test|deploy|...",
"target": "file/path or resource",
"params": {},
"goal": "What success looks like (high-level objective)",
"constraints": ["No third-party deps", "Maintain backwards compat"],
"context": {
"relatedFiles": ["file1.ts", "file2.ts"],
"architectureDecisions": ["ADR-001: Use JWT tokens"],
"previousAttempts": "What was tried before, why it failed"
}
},
"createdAt": "ISO",
"claimedBy": null,
"claimedAt": null,
"timeout": 3600,
"retries": 0,
"maxRetries": 3,
"backoffSeconds": 60,
"lastError": null,
"completedAt": null,
"result": {
"status": "success|failed",
"output": "What was produced",
"decisionReport": { ... }
}
}
```
**Decision Report is REQUIRED for completed tasks.** Tasks without proper decision documentation will be marked as incomplete.
---
## Queue Files
```
.loki/queue/
+-- pending.json # Tasks waiting to be claimed
+-- in-progress.json # Currently executing tasks
+-- completed.json # Finished tasks
+-- dead-letter.json # Failed tasks for review
+-- cancelled.json # Cancelled tasks
```
---
## Queue Operations
### Claim Task (with file locking)
```python
def claim_task(agent_id, agent_capabilities):
with file_lock(".loki/state/locks/queue.lock", timeout=10):
pending = read_json(".loki/queue/pending.json")
# Find eligible task
for task in sorted(pending.tasks, key=lambda t: -t.priority):
if task.type not in agent_capabilities:
continue
if task.claimedBy and not claim_expired(task):
continue
if not all_dependencies_completed(task.dependencies):
continue
if circuit_breaker_open(task.type):
continue
# Claim it
task.claimedBy = agent_id
task.claimedAt = now()
move_task(task, "pending", "in-progress")
return task
return None
```
### File Locking (Bash)
```bash
#!/bin/bash
# Atomic task claim using flock
QUEUE_FILE=".loki/queue/pending.json"
LOCK_FILE=".loki/state/locks/queue.lock"
(
flock -x -w 10 200 || exit 1
# Read, claim, write atomically
TASK=$(jq -r '.tasks | map(select(.claimedBy == null)) | .[0]' "$QUEUE_FILE")
if [ "$TASK" != "null" ]; then
TASK_ID=$(echo "$TASK" | jq -r '.id')
jq --arg id "$TASK_ID" --arg agent "$AGENT_ID" \
'.tasks |= map(if .id == $id then .claimedBy = $agent | .claimedAt = now else . end)' \
"$QUEUE_FILE" > "${QUEUE_FILE}.tmp" && mv "${QUEUE_FILE}.tmp" "$QUEUE_FILE"
echo "$TASK_ID"
fi
) 200>"$LOCK_FILE"
```
### Complete Task
```python
def complete_task(task_id, result, success=True):
with file_lock(".loki/state/locks/queue.lock"):
task = find_task(task_id, "in-progress")
task.completedAt = now()
task.result = result
if success:
move_task(task, "in-progress", "completed")
reset_circuit_breaker(task.type)
trigger_dependents(task_id)
else:
handle_failure(task)
```
---
## Failure Handling
### Exponential Backoff
```python
def handle_failure(task):
task.retries += 1
task.lastError = get_last_error()
if task.retries >= task.maxRetries:
# Move to dead letter queue
move_task(task, "in-progress", "dead-letter")
increment_circuit_breaker(task.type)
alert_orchestrator(f"Task {task.id} moved to dead letter queue")
else:
# Exponential backoff: 60s, 120s, 240s, ...
task.backoffSeconds = task.backoffSeconds * (2 ** (task.retries - 1))
task.availableAt = now() + task.backoffSeconds
move_task(task, "in-progress", "pending")
log(f"Task {task.id} retry {task.retries}, backoff {task.backoffSeconds}s")
```
---
## Dead Letter Queue
Tasks in dead letter queue require manual review:
### Review Process
1. Read `.loki/queue/dead-letter.json`
2. For each task:
- Analyze `lastError` and failure pattern
- Determine if:
- Task is invalid -> delete
- Bug in agent -> fix agent, retry
- External dependency down -> wait, retry
- Requires human decision -> escalate
3. To retry: move task back to pending with reset retries
4. Log decision in `.loki/logs/decisions/dlq-review-{date}.md`
---
## Idempotency
```python
def enqueue_task(task):
# Generate idempotency key from content
task.idempotencyKey = hash(json.dumps(task.payload, sort_keys=True))
# Check if already exists
for queue in ["pending", "in-progress", "completed"]:
existing = find_by_idempotency_key(task.idempotencyKey, queue)
if existing:
log(f"Duplicate task detected: {task.idempotencyKey}")
return existing.id # Return existing, don't create duplicate
# Safe to create
save_task(task, "pending")
return task.id
```
---
## Task Cancellation
```python
def cancel_task(task_id, reason):
with file_lock(".loki/state/locks/queue.lock"):
for queue in ["pending", "in-progress"]:
task = find_task(task_id, queue)
if task:
task.cancelledAt = now()
task.cancelReason = reason
move_task(task, queue, "cancelled")
# Cancel dependent tasks too
for dep_task in find_tasks_depending_on(task_id):
cancel_task(dep_task.id, f"Parent {task_id} cancelled")
return True
return False
```
---
## Circuit Breakers
### State Schema
```json
{
"circuitBreakers": {
"eng-backend": {
"state": "closed",
"failures": 0,
"lastFailure": null,
"openedAt": null,
"halfOpenAt": null
}
}
}
```
### States
| State | Description | Behavior |
|-------|-------------|----------|
| **closed** | Normal operation | Tasks flow normally |
| **open** | Too many failures | Block all tasks of this type |
| **half-open** | Testing recovery | Allow 1 test task |
### Configuration
```yaml
# .loki/config/circuit-breakers.yaml
defaults:
failureThreshold: 5
cooldownSeconds: 300
halfOpenAfter: 60
overrides:
ops-security:
failureThreshold: 3 # More sensitive for security
biz-marketing:
failureThreshold: 10 # More tolerant for non-critical
```
### Implementation
```python
def check_circuit_breaker(agent_type):
cb = load_circuit_breaker(agent_type)
if cb.state == "closed":
return True # Proceed
if cb.state == "open":
if now() > cb.openedAt + config.halfOpenAfter:
cb.state = "half-open"
save_circuit_breaker(cb)
return True # Allow test task
return False # Still blocking
if cb.state == "half-open":
return False # Already testing, wait
def on_task_success(agent_type):
cb = load_circuit_breaker(agent_type)
if cb.state == "half-open":
cb.state = "closed"
cb.failures = 0
save_circuit_breaker(cb)
def on_task_failure(agent_type):
cb = load_circuit_breaker(agent_type)
cb.failures += 1
cb.lastFailure = now()
if cb.state == "half-open" or cb.failures >= config.failureThreshold:
cb.state = "open"
cb.openedAt = now()
alert_orchestrator(f"Circuit breaker OPEN for {agent_type}")
save_circuit_breaker(cb)
```
---
## Rate Limit Handling
### Detection
```python
def detect_rate_limit(error):
indicators = [
"rate limit",
"429",
"too many requests",
"quota exceeded",
"retry-after"
]
return any(ind in str(error).lower() for ind in indicators)
```
### Response Protocol
```python
def handle_rate_limit(agent_id, error):
# 1. Save state checkpoint
checkpoint_state(agent_id)
# 2. Calculate backoff
retry_after = parse_retry_after(error) or calculate_exponential_backoff()
# 3. Log and wait
log(f"Rate limit hit for {agent_id}, waiting {retry_after}s")
# 4. Signal other agents to slow down
broadcast_signal("SLOWDOWN", {"wait": retry_after / 2})
# 5. Resume after backoff
schedule_resume(agent_id, retry_after)
```
### Exponential Backoff
```python
def calculate_exponential_backoff(attempt=1, base=60, max_wait=3600):
wait = min(base * (2 ** (attempt - 1)), max_wait)
jitter = random.uniform(0, wait * 0.1)
return wait + jitter
```
---
## Priority System
| Priority | Use Case | Example |
|----------|----------|---------|
| 10 | Critical blockers | Security vulnerability fix |
| 8-9 | High priority | Core feature implementation |
| 5-7 | Normal | Standard tasks |
| 3-4 | Low priority | Documentation, cleanup |
| 1-2 | Background | Nice-to-have improvements |
Tasks are always processed in priority order within their type.

View File

@@ -0,0 +1,691 @@
# Tool Orchestration Patterns Reference
Research-backed patterns inspired by NVIDIA ToolOrchestra, OpenAI Agents SDK, and multi-agent coordination research.
---
## Overview
Effective tool orchestration requires four key innovations:
1. **Tracing Spans** - Hierarchical event tracking (OpenAI SDK pattern)
2. **Efficiency Metrics** - Track computational cost per task
3. **Reward Signals** - Outcome, efficiency, and preference rewards for learning
4. **Dynamic Selection** - Adapt agent count and types based on task complexity
---
## Tracing Spans Architecture (OpenAI SDK Pattern)
### Span Types
Every operation is wrapped in a typed span for observability:
```yaml
span_types:
agent_span: # Wraps entire agent execution
generation_span: # Wraps LLM API calls
function_span: # Wraps tool/function calls
guardrail_span: # Wraps validation checks
handoff_span: # Wraps agent-to-agent transfers
custom_span: # User-defined operations
```
### Hierarchical Trace Structure
```json
{
"trace_id": "trace_abc123def456",
"workflow_name": "implement_feature",
"group_id": "session_xyz789",
"spans": [
{
"span_id": "span_001",
"parent_id": null,
"type": "agent_span",
"agent_name": "orchestrator",
"started_at": "2026-01-07T10:00:00Z",
"ended_at": "2026-01-07T10:05:00Z",
"children": ["span_002", "span_003"]
},
{
"span_id": "span_002",
"parent_id": "span_001",
"type": "guardrail_span",
"guardrail_name": "input_validation",
"triggered": false,
"blocking": true
},
{
"span_id": "span_003",
"parent_id": "span_001",
"type": "handoff_span",
"from_agent": "orchestrator",
"to_agent": "backend-dev"
}
]
}
```
### Storage Location
```
.loki/traces/
├── active/
│ └── {trace_id}.json # Currently running traces
└── completed/
└── {date}/
└── {trace_id}.json # Archived traces
```
See `references/openai-patterns.md` for full tracing implementation.
---
## Efficiency Metrics System
### Why Track Efficiency?
ToolOrchestra achieves 70% cost reduction vs GPT-5 by explicitly optimizing for efficiency. Loki Mode should track:
- **Token usage** per task (input + output)
- **Wall clock time** per task
- **Agent spawns** per task
- **Retry count** before success
### Efficiency Tracking Schema
```json
{
"task_id": "task-2026-01-06-001",
"correlation_id": "session-abc123",
"started_at": "2026-01-06T10:00:00Z",
"completed_at": "2026-01-06T10:05:32Z",
"metrics": {
"wall_time_seconds": 332,
"agents_spawned": 3,
"total_agent_calls": 7,
"retry_count": 1,
"retry_reasons": ["test_failure"],
"recovery_rate": 1.0,
"model_usage": {
"haiku": {"calls": 4, "est_tokens": 12000},
"sonnet": {"calls": 2, "est_tokens": 8000},
"opus": {"calls": 1, "est_tokens": 6000}
}
},
"outcome": "success",
"outcome_reason": "tests_passed_after_fix",
"efficiency_score": 0.85,
"efficiency_factors": ["used_haiku_for_tests", "parallel_review"],
"quality_pillars": {
"tool_selection_correct": true,
"tool_reliability_rate": 0.95,
"memory_retrieval_relevant": true,
"goal_adherence": 1.0
}
}
```
**Why capture these metrics?** (Based on multi-agent research)
1. **Capture intent, not just actions** ([Hashrocket](https://hashrocket.substack.com/p/the-hidden-cost-of-well-fix-it-later))
- "UX debt turns into data debt" - recording actions without intent creates useless analytics
2. **Track recovery rate** ([Assessment Framework, arXiv 2512.12791](https://arxiv.org/html/2512.12791v1))
- `recovery_rate = successful_retries / total_retries`
- Paper found "perfect tool sequencing but only 33% policy adherence" - surface metrics mask failures
3. **Distributed tracing** ([Maxim AI](https://www.getmaxim.ai/articles/best-practices-for-building-production-ready-multi-agent-systems/))
- `correlation_id`: Links all tasks in a session for end-to-end tracing
- Essential for debugging multi-agent coordination failures
4. **Tool reliability separate from selection** ([Stanford/Harvard](https://www.marktechpost.com/2025/12/24/this-ai-paper-from-stanford-and-harvard-explains-why-most-agentic-ai-systems-feel-impressive-in-demos-and-then-completely-fall-apart-in-real-use/))
- `tool_selection_correct`: Did we pick the right tool?
- `tool_reliability_rate`: Did the tool work as expected? (tools can fail even when correctly selected)
- Key insight: "Tool use reliability" is a primary demo-to-deployment gap
5. **Quality pillars beyond outcomes** ([Assessment Framework](https://arxiv.org/html/2512.12791v1))
- `memory_retrieval_relevant`: Did episodic/semantic retrieval help?
- `goal_adherence`: Did we stay on task? (0.0-1.0 score)
### Efficiency Score Calculation
```python
def calculate_efficiency_score(metrics, task_complexity):
"""
Score from 0-1 where higher is more efficient.
Based on ToolOrchestra's efficiency reward signal.
"""
# Baseline expectations by complexity
baselines = {
"trivial": {"time": 60, "agents": 1, "retries": 0},
"simple": {"time": 180, "agents": 2, "retries": 0},
"moderate": {"time": 600, "agents": 4, "retries": 1},
"complex": {"time": 1800, "agents": 8, "retries": 2},
"critical": {"time": 3600, "agents": 12, "retries": 3}
}
baseline = baselines[task_complexity]
# Calculate component scores (1.0 = at baseline, >1 = better, <1 = worse)
time_score = min(1.0, baseline["time"] / max(metrics["wall_time_seconds"], 1))
agent_score = min(1.0, baseline["agents"] / max(metrics["agents_spawned"], 1))
retry_score = 1.0 - (metrics["retry_count"] / (baseline["retries"] + 3))
# Weighted average (time matters most)
return (time_score * 0.5) + (agent_score * 0.3) + (retry_score * 0.2)
```
### Standard Reason Codes
Use consistent codes to enable pattern analysis:
```yaml
outcome_reasons:
success:
- tests_passed_first_try
- tests_passed_after_fix
- review_approved
- spec_validated
partial:
- tests_partial_pass
- review_concerns_minor
- timeout_partial_work
failure:
- tests_failed
- review_blocked
- dependency_missing
- timeout_no_progress
- error_unrecoverable
retry_reasons:
- test_failure
- lint_error
- type_error
- review_rejection
- rate_limit
- timeout
- dependency_conflict
efficiency_factors:
positive:
- used_haiku_for_simple
- parallel_execution
- cached_result
- first_try_success
- spec_driven
negative:
- used_opus_for_simple
- sequential_when_parallel_possible
- multiple_retries
- missing_context
- unclear_requirements
```
### Storage Location
```
.loki/metrics/
├── efficiency/
│ ├── 2026-01-06.json # Daily efficiency logs
│ └── aggregate.json # Running averages by task type
└── rewards/
├── outcomes.json # Task success/failure records
└── preferences.json # User preference signals
```
---
## Reward Signal Framework
### Three Reward Types (ToolOrchestra Pattern)
```
+------------------------------------------------------------------+
| 1. OUTCOME REWARD |
| - Did the task succeed? Binary + quality grade |
| - Signal: +1.0 (success), 0.0 (partial), -1.0 (failure) |
+------------------------------------------------------------------+
| 2. EFFICIENCY REWARD |
| - Did we use resources wisely? |
| - Signal: 0.0 to 1.0 based on efficiency score |
+------------------------------------------------------------------+
| 3. PREFERENCE REWARD |
| - Did the user like the approach/result? |
| - Signal: Inferred from user actions (accept/reject/modify) |
+------------------------------------------------------------------+
```
### Outcome Reward Implementation
```python
def calculate_outcome_reward(task_result):
"""
Outcome reward based on task completion status.
"""
if task_result.status == "completed":
# Grade the quality of completion
if task_result.tests_passed and task_result.review_passed:
return 1.0 # Full success
elif task_result.tests_passed:
return 0.7 # Tests pass but review had concerns
else:
return 0.3 # Completed but with issues
elif task_result.status == "partial":
return 0.0 # Partial completion, no reward
else: # failed
return -1.0 # Negative reward for failure
```
### Preference Reward Implementation
```python
def infer_preference_reward(task_result, user_actions):
"""
Infer user preference from their actions after task completion.
Based on implicit feedback patterns.
"""
signals = []
# Positive signals
if "commit" in user_actions:
signals.append(0.8) # User committed our changes
if "deploy" in user_actions:
signals.append(1.0) # User deployed our changes
if "no_edits" in user_actions:
signals.append(0.6) # User didn't modify our output
# Negative signals
if "revert" in user_actions:
signals.append(-1.0) # User reverted our changes
if "manual_fix" in user_actions:
signals.append(-0.5) # User had to fix our work
if "retry_different" in user_actions:
signals.append(-0.3) # User asked for different approach
# Neutral (no signal)
if not signals:
return None
return sum(signals) / len(signals)
```
### Reward Aggregation for Learning
```python
def aggregate_rewards(outcome, efficiency, preference):
"""
Combine rewards into single learning signal.
Weights based on ToolOrchestra findings.
"""
# Outcome is most important (must succeed)
# Efficiency secondary (once successful, optimize)
# Preference tertiary (align with user style)
weights = {
"outcome": 0.6,
"efficiency": 0.25,
"preference": 0.15
}
total = outcome * weights["outcome"]
total += efficiency * weights["efficiency"]
if preference is not None:
total += preference * weights["preference"]
else:
# Redistribute weight if no preference signal
total = total / (1 - weights["preference"])
return total
```
---
## Dynamic Agent Selection
### Task Complexity Classification
```python
def classify_task_complexity(task):
"""
Classify task to determine agent allocation.
Based on ToolOrchestra's tool selection flexibility.
"""
complexity_signals = {
# File scope signals
"single_file": -1,
"few_files": 0, # 2-5 files
"many_files": +1, # 6-20 files
"system_wide": +2, # 20+ files
# Change type signals
"typo_fix": -2,
"bug_fix": 0,
"feature": +1,
"refactor": +1,
"architecture": +2,
# Domain signals
"documentation": -1,
"tests_only": 0,
"frontend": 0,
"backend": 0,
"full_stack": +1,
"infrastructure": +1,
"security": +2,
}
score = 0
for signal, weight in complexity_signals.items():
if task.has_signal(signal):
score += weight
# Map score to complexity level
if score <= -2:
return "trivial"
elif score <= 0:
return "simple"
elif score <= 2:
return "moderate"
elif score <= 4:
return "complex"
else:
return "critical"
```
### Agent Allocation by Complexity
```yaml
# Agent allocation strategy
# Model selection: Opus=planning, Sonnet=development, Haiku=unit tests/monitoring
complexity_allocations:
trivial:
max_agents: 1
planning: null # No planning needed
development: haiku
testing: haiku
review: skip # No review needed for trivial
parallel: false
simple:
max_agents: 2
planning: null # No planning needed
development: haiku
testing: haiku
review: single # One quick review
parallel: false
moderate:
max_agents: 4
planning: sonnet # Sonnet for moderate planning
development: sonnet
testing: haiku # Unit tests always haiku
review: standard # 3 parallel reviewers
parallel: true
complex:
max_agents: 8
planning: opus # Opus ONLY for complex planning
development: sonnet # Sonnet for implementation
testing: haiku # Unit tests still haiku
review: deep # 3 reviewers + devil's advocate
parallel: true
critical:
max_agents: 12
planning: opus # Opus for critical planning
development: sonnet # Sonnet for implementation
testing: sonnet # Functional/E2E tests with sonnet
review: exhaustive # Multiple review rounds
parallel: true
human_checkpoint: true # Pause for human review
```
### Dynamic Selection Algorithm
```python
def select_agents_for_task(task, available_agents):
"""
Dynamically select agents based on task requirements.
Inspired by ToolOrchestra's configurable tool selection.
"""
complexity = classify_task_complexity(task)
allocation = COMPLEXITY_ALLOCATIONS[complexity]
# 1. Identify required agent types
required_types = identify_required_agents(task)
# 2. Filter to available agents of required types
candidates = [a for a in available_agents if a.type in required_types]
# 3. Score candidates by past performance
for agent in candidates:
agent.selection_score = get_agent_performance_score(
agent,
task_type=task.type,
complexity=complexity
)
# 4. Select top N agents up to allocation limit
candidates.sort(key=lambda a: a.selection_score, reverse=True)
selected = candidates[:allocation["max_agents"]]
# 5. Assign models based on complexity
for agent in selected:
if agent.role == "reviewer":
agent.model = "opus" # Always opus for reviews
else:
agent.model = allocation["model"]
return selected
def get_agent_performance_score(agent, task_type, complexity):
"""
Score agent based on historical performance on similar tasks.
Uses reward signals from previous executions.
"""
history = load_agent_history(agent.id)
# Filter to similar tasks
similar = [h for h in history
if h.task_type == task_type
and h.complexity == complexity]
if not similar:
return 0.5 # Neutral score if no history
# Average past rewards
return sum(h.aggregate_reward for h in similar) / len(similar)
```
---
## Tool Usage Analytics
### Track Tool Effectiveness
```json
{
"tool_analytics": {
"period": "2026-01-06",
"by_tool": {
"Grep": {
"calls": 142,
"success_rate": 0.89,
"avg_result_quality": 0.82,
"common_patterns": ["error handling", "function def"]
},
"Task": {
"calls": 47,
"success_rate": 0.94,
"avg_efficiency": 0.76,
"by_subagent_type": {
"general-purpose": {"calls": 35, "success": 0.91},
"Explore": {"calls": 12, "success": 1.0}
}
}
},
"insights": [
"Explore agent 100% success - use more for codebase search",
"Grep success drops to 0.65 for regex patterns - simplify searches"
]
}
}
```
### Continuous Improvement Loop
```
+------------------------------------------------------------------+
| 1. COLLECT |
| Record every task: agents used, tools called, outcome |
+------------------------------------------------------------------+
|
v
+------------------------------------------------------------------+
| 2. ANALYZE |
| Weekly aggregation: What worked? What didn't? |
| Identify patterns in high-reward vs low-reward tasks |
+------------------------------------------------------------------+
|
v
+------------------------------------------------------------------+
| 3. ADAPT |
| Update selection algorithms based on analytics |
| Store successful patterns in semantic memory |
+------------------------------------------------------------------+
|
v
+------------------------------------------------------------------+
| 4. VALIDATE |
| A/B test new selection strategies |
| Measure efficiency improvement |
+------------------------------------------------------------------+
|
+-----------> Loop back to COLLECT
```
---
## Integration with RARV Cycle
The orchestration patterns integrate with RARV at each phase:
```
REASON:
├── Check efficiency metrics for similar past tasks
├── Classify task complexity
└── Select appropriate agent allocation
ACT:
├── Dispatch agents according to allocation
├── Track start time and resource usage
└── Record tool calls and agent interactions
REFLECT:
├── Calculate outcome reward (did it work?)
├── Calculate efficiency reward (resource usage)
└── Log to metrics store
VERIFY:
├── Run verification checks
├── If failed: negative outcome reward, retry with learning
├── If passed: infer preference reward from user actions
└── Update agent performance scores
```
---
## Key Metrics Dashboard
Track these metrics in `.loki/metrics/dashboard.json`:
```json
{
"dashboard": {
"period": "rolling_7_days",
"summary": {
"tasks_completed": 127,
"success_rate": 0.94,
"avg_efficiency_score": 0.78,
"avg_outcome_reward": 0.82,
"avg_preference_reward": 0.71,
"avg_recovery_rate": 0.87,
"avg_goal_adherence": 0.93
},
"quality_pillars": {
"tool_selection_accuracy": 0.91,
"tool_reliability_rate": 0.93,
"memory_retrieval_relevance": 0.84,
"policy_adherence": 0.96
},
"trends": {
"efficiency": "+12% vs previous week",
"success_rate": "+3% vs previous week",
"avg_agents_per_task": "-0.8 (improving)",
"recovery_rate": "+5% vs previous week"
},
"top_performing_patterns": [
"Haiku for unit tests (0.95 success, 0.92 efficiency)",
"Explore agent for codebase search (1.0 success)",
"Parallel review with opus (0.98 accuracy)"
],
"areas_for_improvement": [
"Complex refactors taking 2x expected time",
"Security review efficiency below baseline",
"Memory retrieval relevance below 0.85 target"
]
}
}
```
---
## Multi-Dimensional Evaluation
Based on [Measurement Imbalance research (arXiv 2506.02064)](https://arxiv.org/abs/2506.02064):
> "Technical metrics dominate assessments (83%), while human-centered (30%), safety (53%), and economic (30%) remain peripheral"
**Loki Mode tracks four evaluation axes:**
| Axis | Metrics | Current Coverage |
|------|---------|------------------|
| **Technical** | success_rate, efficiency_score, recovery_rate | Full |
| **Human-Centered** | preference_reward, goal_adherence | Partial |
| **Safety** | policy_adherence, quality_gates_passed | Full (via review system) |
| **Economic** | model_usage, agents_spawned, wall_time | Full |
---
## Sources
**OpenAI Agents SDK:**
- [Agents SDK Documentation](https://openai.github.io/openai-agents-python/) - Core primitives: agents, handoffs, guardrails, tracing
- [Practical Guide to Building Agents](https://cdn.openai.com/business-guides-and-resources/a-practical-guide-to-building-agents.pdf) - Orchestration patterns
- [Building Agents Track](https://developers.openai.com/tracks/building-agents/) - Official developer guide
- [AGENTS.md Specification](https://agents.md/) - Standard for agent instructions
- [Tracing Documentation](https://openai.github.io/openai-agents-python/tracing/) - Span types and observability
**Efficiency & Orchestration:**
- [NVIDIA ToolOrchestra](https://github.com/NVlabs/ToolOrchestra) - Multi-turn tool orchestration with RL
- [ToolScale Dataset](https://huggingface.co/datasets/nvidia/ToolScale) - Training data synthesis
**Evaluation Frameworks:**
- [Assessment Framework for Agentic AI (arXiv 2512.12791)](https://arxiv.org/html/2512.12791v1) - Four-pillar evaluation model
- [Measurement Imbalance in Agentic AI (arXiv 2506.02064)](https://arxiv.org/abs/2506.02064) - Multi-dimensional evaluation
- [Adaptive Monitoring for Agentic AI (arXiv 2509.00115)](https://arxiv.org/abs/2509.00115) - AMDM algorithm
**Best Practices:**
- [Anthropic: Building Effective Agents](https://www.anthropic.com/research/building-effective-agents) - Simplicity, transparency, tool engineering
- [Maxim AI: Production Multi-Agent Systems](https://www.getmaxim.ai/articles/best-practices-for-building-production-ready-multi-agent-systems/) - Orchestration patterns, distributed tracing
- [UiPath: Agent Builder Best Practices](https://www.uipath.com/blog/ai/agent-builder-best-practices) - Single-responsibility, evaluations
- [Stanford/Harvard: Demo-to-Deployment Gap](https://www.marktechpost.com/2025/12/24/this-ai-paper-from-stanford-and-harvard-explains-why-most-agentic-ai-systems-feel-impressive-in-demos-and-then-completely-fall-apart-in-real-use/) - Tool reliability as key failure mode
**Safety & Reasoning:**
- [Chain of Thought Monitoring](https://openai.com/index/chain-of-thought-monitoring/) - CoT monitorability for safety
- [Agent Builder Safety](https://platform.openai.com/docs/guides/agent-builder-safety) - Human-in-loop patterns
- [Agentic AI Foundation](https://openai.com/index/agentic-ai-foundation/) - Industry standards (MCP, AGENTS.md, goose)