Fix: Ensure all skills are tracked as files, not submodules
This commit is contained in:
568
skills/loki-mode/references/production-patterns.md
Normal file
568
skills/loki-mode/references/production-patterns.md
Normal file
@@ -0,0 +1,568 @@
|
||||
# Production Patterns Reference
|
||||
|
||||
Practitioner-tested patterns from Hacker News discussions and real-world deployments. These patterns represent what actually works in production, not theoretical frameworks.
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
This reference consolidates battle-tested insights from:
|
||||
- HN discussions on autonomous agents in production (2025)
|
||||
- Coding with LLMs practitioner experiences
|
||||
- Simon Willison's Superpowers coding agent patterns
|
||||
- Multi-agent orchestration real-world deployments
|
||||
|
||||
---
|
||||
|
||||
## What Actually Works in Production
|
||||
|
||||
### Human-in-the-Loop (HITL) is Non-Negotiable
|
||||
|
||||
**Key Insight:** "Zero companies don't have a human in the loop" for customer-facing applications.
|
||||
|
||||
```yaml
|
||||
hitl_patterns:
|
||||
always_human:
|
||||
- Customer-facing responses
|
||||
- Financial transactions
|
||||
- Security-critical operations
|
||||
- Legal/compliance decisions
|
||||
|
||||
automation_candidates:
|
||||
- Internal tooling
|
||||
- Developer assistance
|
||||
- Data preprocessing
|
||||
- Code generation (with review)
|
||||
|
||||
implementation:
|
||||
- Classification layer routes to human vs automated
|
||||
- Confidence thresholds trigger escalation
|
||||
- Audit trails for all automated decisions
|
||||
```
|
||||
|
||||
### Narrow Scope Wins
|
||||
|
||||
**Key Insight:** Successful agents operate within tightly constrained domains.
|
||||
|
||||
```yaml
|
||||
scope_constraints:
|
||||
max_steps_before_review: 3-5
|
||||
task_characteristics:
|
||||
- Specific, well-defined objectives
|
||||
- Pre-classified inputs
|
||||
- Deterministic success criteria
|
||||
- Verifiable outputs
|
||||
|
||||
successful_domains:
|
||||
- Email scanning and classification
|
||||
- Invoice processing
|
||||
- Code refactoring (bounded)
|
||||
- Documentation generation
|
||||
- Test writing
|
||||
|
||||
failure_prone_domains:
|
||||
- Open-ended feature implementation
|
||||
- Novel algorithm design
|
||||
- Security-critical code
|
||||
- Cross-system integrations
|
||||
```
|
||||
|
||||
### Confidence-Based Routing
|
||||
|
||||
**Key Insight:** Treat agents as preprocessors, not decision-makers.
|
||||
|
||||
```python
|
||||
def confidence_based_routing(agent_output):
|
||||
"""
|
||||
Route based on confidence, not capability.
|
||||
Based on production practitioner patterns.
|
||||
"""
|
||||
confidence = agent_output.confidence_score
|
||||
|
||||
if confidence >= 0.95:
|
||||
# High confidence: auto-approve with logging
|
||||
return AutoApprove(audit_log=True)
|
||||
|
||||
elif confidence >= 0.70:
|
||||
# Medium confidence: quick human review
|
||||
return HumanReview(priority="normal", timeout="1h")
|
||||
|
||||
elif confidence >= 0.40:
|
||||
# Low confidence: detailed human review
|
||||
return HumanReview(priority="high", context="full")
|
||||
|
||||
else:
|
||||
# Very low confidence: escalate immediately
|
||||
return Escalate(reason="low_confidence", require_senior=True)
|
||||
```
|
||||
|
||||
### Classification Before Automation
|
||||
|
||||
**Key Insight:** Separate inputs before processing.
|
||||
|
||||
```yaml
|
||||
classification_first:
|
||||
step_1_classify:
|
||||
workable:
|
||||
- Clear requirements
|
||||
- Existing patterns
|
||||
- Test coverage available
|
||||
non_workable:
|
||||
- Ambiguous requirements
|
||||
- Novel architecture
|
||||
- Missing dependencies
|
||||
escalate_immediately:
|
||||
- Security concerns
|
||||
- Compliance requirements
|
||||
- Customer-facing changes
|
||||
|
||||
step_2_route:
|
||||
workable: "Automated pipeline"
|
||||
non_workable: "Human clarification"
|
||||
escalate: "Senior review"
|
||||
```
|
||||
|
||||
### Deterministic Outer Loops
|
||||
|
||||
**Key Insight:** Wrap agent outputs with rule-based validation.
|
||||
|
||||
```python
|
||||
def deterministic_validation_loop(task, max_attempts=3):
|
||||
"""
|
||||
Use LLMs only where genuine ambiguity exists.
|
||||
Wrap with deterministic rules.
|
||||
"""
|
||||
for attempt in range(max_attempts):
|
||||
# LLM handles the ambiguous part
|
||||
output = agent.execute(task)
|
||||
|
||||
# Deterministic validation (NOT LLM)
|
||||
validation_errors = []
|
||||
|
||||
# Rule: Must have tests
|
||||
if not output.has_tests:
|
||||
validation_errors.append("Missing tests")
|
||||
|
||||
# Rule: Must pass linting
|
||||
lint_result = run_linter(output.code)
|
||||
if lint_result.errors:
|
||||
validation_errors.append(f"Lint errors: {lint_result.errors}")
|
||||
|
||||
# Rule: Must compile
|
||||
compile_result = compile_code(output.code)
|
||||
if not compile_result.success:
|
||||
validation_errors.append(f"Compile error: {compile_result.error}")
|
||||
|
||||
# Rule: Tests must pass
|
||||
if output.has_tests:
|
||||
test_result = run_tests(output.code)
|
||||
if not test_result.all_passed:
|
||||
validation_errors.append(f"Test failures: {test_result.failures}")
|
||||
|
||||
if not validation_errors:
|
||||
return output
|
||||
|
||||
# Feed errors back for retry
|
||||
task = task.with_feedback(validation_errors)
|
||||
|
||||
return FailedResult(reason="Max attempts exceeded")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Context Engineering Patterns
|
||||
|
||||
### Context Curation Over Automatic Selection
|
||||
|
||||
**Key Insight:** Manually choose which files and information to provide.
|
||||
|
||||
```yaml
|
||||
context_curation:
|
||||
principles:
|
||||
- "Less is more" - focused context beats comprehensive context
|
||||
- Manual selection outperforms automatic RAG
|
||||
- Remove outdated information aggressively
|
||||
|
||||
anti_patterns:
|
||||
- Dumping entire codebase into context
|
||||
- Relying on automatic context selection
|
||||
- Accumulating conversation history indefinitely
|
||||
|
||||
implementation:
|
||||
per_task_context:
|
||||
- 2-5 most relevant files
|
||||
- Specific functions, not entire modules
|
||||
- Recent changes only (last 1-2 days)
|
||||
- Clear success criteria
|
||||
|
||||
context_budget:
|
||||
target: "< 10k tokens for context"
|
||||
reserve: "90% for model reasoning"
|
||||
```
|
||||
|
||||
### Information Abstraction
|
||||
|
||||
**Key Insight:** Summarize rather than feeding full data.
|
||||
|
||||
```python
|
||||
def abstract_for_agent(raw_data, task_context):
|
||||
"""
|
||||
Design abstractions that preserve decision-relevant information.
|
||||
Based on practitioner insights.
|
||||
"""
|
||||
# BAD: Feed 10,000 database rows
|
||||
# raw_data = db.query("SELECT * FROM users")
|
||||
|
||||
# GOOD: Summarize to decision-relevant info
|
||||
summary = {
|
||||
"query_status": "success",
|
||||
"total_results": len(raw_data),
|
||||
"sample": raw_data[:5],
|
||||
"schema": extract_schema(raw_data),
|
||||
"statistics": {
|
||||
"null_count": count_nulls(raw_data),
|
||||
"unique_values": count_uniques(raw_data),
|
||||
"date_range": get_date_range(raw_data)
|
||||
}
|
||||
}
|
||||
|
||||
return summary
|
||||
```
|
||||
|
||||
### Separate Conversations Per Task
|
||||
|
||||
**Key Insight:** Fresh contexts yield better results than accumulated sessions.
|
||||
|
||||
```yaml
|
||||
conversation_management:
|
||||
new_conversation_triggers:
|
||||
- Different domain (backend -> frontend)
|
||||
- New feature vs bug fix
|
||||
- After completing major task
|
||||
- When errors accumulate (3+ in row)
|
||||
|
||||
preserve_across_sessions:
|
||||
- CLAUDE.md / CONTINUITY.md
|
||||
- Architectural decisions
|
||||
- Key constraints
|
||||
|
||||
discard_between_sessions:
|
||||
- Debugging attempts
|
||||
- Abandoned approaches
|
||||
- Intermediate drafts
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Skills System Pattern
|
||||
|
||||
### On-Demand Skill Loading
|
||||
|
||||
**Key Insight:** Skills remain dormant until the model actively seeks them out.
|
||||
|
||||
```yaml
|
||||
skills_architecture:
|
||||
core_interaction: "< 2k tokens"
|
||||
skill_loading: "On-demand via search"
|
||||
|
||||
implementation:
|
||||
skill_discovery:
|
||||
- Shell script searches skill files
|
||||
- Model requests specific skills by name
|
||||
- Skills loaded only when needed
|
||||
|
||||
skill_structure:
|
||||
name: "unique-skill-name"
|
||||
trigger: "Pattern that activates skill"
|
||||
content: "Detailed instructions"
|
||||
dependencies: ["other-skills"]
|
||||
|
||||
benefits:
|
||||
- Minimal base context
|
||||
- Extensible without bloat
|
||||
- Skills can be updated independently
|
||||
```
|
||||
|
||||
### Sub-Agents for Context Isolation
|
||||
|
||||
**Key Insight:** Prevent massive token waste by isolating context-noisy subtasks.
|
||||
|
||||
```python
|
||||
async def context_isolated_search(query, codebase_path):
|
||||
"""
|
||||
Use sub-agent for grep/search to prevent context pollution.
|
||||
Based on Simon Willison's patterns.
|
||||
"""
|
||||
# Main agent stays focused
|
||||
# Sub-agent handles noisy file searching
|
||||
|
||||
search_agent = spawn_subagent(
|
||||
role="codebase-searcher",
|
||||
context_limit="10k tokens",
|
||||
permissions=["read-only"]
|
||||
)
|
||||
|
||||
results = await search_agent.execute(
|
||||
task=f"Find files related to: {query}",
|
||||
codebase=codebase_path
|
||||
)
|
||||
|
||||
# Return only relevant paths, not full content
|
||||
return FilteredResults(
|
||||
paths=results.relevant_files[:10],
|
||||
summaries=results.file_summaries,
|
||||
confidence=results.relevance_scores
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Planning Before Execution
|
||||
|
||||
### Explicit Plan-Then-Code Workflow
|
||||
|
||||
**Key Insight:** Have models articulate detailed plans without immediately writing code.
|
||||
|
||||
```yaml
|
||||
plan_then_code:
|
||||
phase_1_planning:
|
||||
outputs:
|
||||
- spec.md: "Detailed requirements"
|
||||
- todo.md: "Tagged tasks [BUG], [FEAT], [REFACTOR]"
|
||||
- approach.md: "Implementation strategy"
|
||||
constraints:
|
||||
- NO CODE in this phase
|
||||
- Human review before proceeding
|
||||
- Clear success criteria
|
||||
|
||||
phase_2_review:
|
||||
checks:
|
||||
- Plan addresses all requirements
|
||||
- Approach is feasible
|
||||
- No missing dependencies
|
||||
- Tests are specified
|
||||
|
||||
phase_3_implementation:
|
||||
constraints:
|
||||
- Follow plan exactly
|
||||
- One task at a time
|
||||
- Test after each change
|
||||
- Report deviations immediately
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Multi-Agent Orchestration Patterns
|
||||
|
||||
### Event-Driven Coordination
|
||||
|
||||
**Key Insight:** Move beyond synchronous prompt chaining to asynchronous, decoupled systems.
|
||||
|
||||
```yaml
|
||||
event_driven_orchestration:
|
||||
problems_with_synchronous:
|
||||
- Doesn't scale
|
||||
- Mixes orchestration with prompt logic
|
||||
- Single failure breaks entire chain
|
||||
- No retry/recovery mechanism
|
||||
|
||||
async_architecture:
|
||||
message_queue:
|
||||
- Agents communicate via events
|
||||
- Decoupled execution
|
||||
- Natural retry/dead-letter handling
|
||||
|
||||
state_management:
|
||||
- Persistent task state
|
||||
- Checkpoint/resume capability
|
||||
- Clear ownership of data
|
||||
|
||||
error_handling:
|
||||
- Per-agent retry policies
|
||||
- Circuit breakers
|
||||
- Graceful degradation
|
||||
```
|
||||
|
||||
### Policy-First Enforcement
|
||||
|
||||
**Key Insight:** Govern agent behavior at runtime, not just training time.
|
||||
|
||||
```python
|
||||
class PolicyEngine:
|
||||
"""
|
||||
Runtime governance for agent behavior.
|
||||
Based on autonomous control plane patterns.
|
||||
"""
|
||||
|
||||
def __init__(self, policies):
|
||||
self.policies = policies
|
||||
|
||||
async def enforce(self, agent_action, context):
|
||||
for policy in self.policies:
|
||||
result = await policy.evaluate(agent_action, context)
|
||||
|
||||
if result.blocked:
|
||||
return BlockedAction(
|
||||
reason=result.reason,
|
||||
policy=policy.name,
|
||||
remediation=result.suggested_action
|
||||
)
|
||||
|
||||
if result.modified:
|
||||
agent_action = result.modified_action
|
||||
|
||||
return AllowedAction(agent_action)
|
||||
|
||||
# Example policies
|
||||
policies = [
|
||||
NoProductionDataDeletion(),
|
||||
NoSecretsInCode(),
|
||||
MaxTokenBudget(limit=100000),
|
||||
RequireTestsForCode(),
|
||||
BlockExternalNetworkCalls(in_sandbox=True)
|
||||
]
|
||||
```
|
||||
|
||||
### Simulation Layer
|
||||
|
||||
**Key Insight:** Evaluate changes before deploying to real environment.
|
||||
|
||||
```yaml
|
||||
simulation_layer:
|
||||
purpose: "Test agent behavior in safe environment"
|
||||
|
||||
implementation:
|
||||
sandbox_environment:
|
||||
- Isolated container
|
||||
- Mocked external services
|
||||
- Synthetic data
|
||||
- Full audit logging
|
||||
|
||||
validation_checks:
|
||||
- Run tests in sandbox first
|
||||
- Compare outputs to expected
|
||||
- Check for policy violations
|
||||
- Measure resource consumption
|
||||
|
||||
promotion_criteria:
|
||||
- All tests pass
|
||||
- No policy violations
|
||||
- Resource usage within limits
|
||||
- Human approval (for sensitive changes)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Evaluation and Benchmarking
|
||||
|
||||
### Problems with Current Benchmarks
|
||||
|
||||
**Key Insight:** LLM-as-judge creates shared blind spots.
|
||||
|
||||
```yaml
|
||||
benchmark_problems:
|
||||
llm_judge_issues:
|
||||
- Same architecture = same failure modes
|
||||
- Math errors accepted as correct
|
||||
- "Do-nothing" baseline passes 38% of time
|
||||
|
||||
contamination:
|
||||
- Published benchmarks become training targets
|
||||
- Overfitting to specific datasets
|
||||
- Inflated scores don't reflect real performance
|
||||
|
||||
solutions:
|
||||
held_back_sets: "90% public, 10% private"
|
||||
human_evaluation: "Final published results require humans"
|
||||
production_testing: "A/B tests measure actual value"
|
||||
objective_outcomes: "Simulated environments with verifiable results"
|
||||
```
|
||||
|
||||
### Practical Evaluation Approach
|
||||
|
||||
```python
|
||||
def evaluate_agent_change(before_agent, after_agent, task_set):
|
||||
"""
|
||||
Production-oriented evaluation.
|
||||
Based on HN practitioner recommendations.
|
||||
"""
|
||||
results = {
|
||||
"before": [],
|
||||
"after": [],
|
||||
"human_preference": []
|
||||
}
|
||||
|
||||
for task in task_set:
|
||||
# Run both agents
|
||||
before_result = before_agent.execute(task)
|
||||
after_result = after_agent.execute(task)
|
||||
|
||||
# Objective metrics (NOT LLM-judged)
|
||||
results["before"].append({
|
||||
"tests_pass": run_tests(before_result),
|
||||
"lint_clean": run_linter(before_result),
|
||||
"time_taken": before_result.duration,
|
||||
"tokens_used": before_result.tokens
|
||||
})
|
||||
|
||||
results["after"].append({
|
||||
"tests_pass": run_tests(after_result),
|
||||
"lint_clean": run_linter(after_result),
|
||||
"time_taken": after_result.duration,
|
||||
"tokens_used": after_result.tokens
|
||||
})
|
||||
|
||||
# Sample for human review
|
||||
if random.random() < 0.1: # 10% sample
|
||||
results["human_preference"].append({
|
||||
"task": task,
|
||||
"before": before_result,
|
||||
"after": after_result,
|
||||
"pending_review": True
|
||||
})
|
||||
|
||||
return EvaluationReport(results)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Cost and Token Economics
|
||||
|
||||
### Real-World Cost Patterns
|
||||
|
||||
```yaml
|
||||
cost_patterns:
|
||||
claude_code:
|
||||
heavy_use: "$25/1-2 hours on large codebases"
|
||||
api_range: "$1-5/hour depending on efficiency"
|
||||
max_tier: "$200/month often needs 2-3 subscriptions"
|
||||
|
||||
token_economics:
|
||||
sub_agents_multiply_cost: "Each duplicates context"
|
||||
example: "5-task parallel job = 50,000+ tokens per subtask"
|
||||
|
||||
optimization:
|
||||
context_isolation: "Use sub-agents for noisy tasks"
|
||||
information_abstraction: "Summarize, don't dump"
|
||||
fresh_conversations: "Reset after major tasks"
|
||||
skill_on_demand: "Load only when needed"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Sources
|
||||
|
||||
**Hacker News Discussions:**
|
||||
- [What Actually Works in Production for Autonomous Agents](https://news.ycombinator.com/item?id=44623207)
|
||||
- [Coding with LLMs in Summer 2025](https://news.ycombinator.com/item?id=44623953)
|
||||
- [Superpowers: How I'm Using Coding Agents](https://news.ycombinator.com/item?id=45547344)
|
||||
- [Claude Code Experience After Two Weeks](https://news.ycombinator.com/item?id=44596472)
|
||||
- [AI Agent Benchmarks Are Broken](https://news.ycombinator.com/item?id=44531697)
|
||||
- [How to Orchestrate Multi-Agent Workflows](https://news.ycombinator.com/item?id=45955997)
|
||||
- [Context Engineering vs Prompt Engineering](https://news.ycombinator.com/item?id=44427757)
|
||||
|
||||
**Show HN Projects:**
|
||||
- [Self-Evolving Agents Repository](https://news.ycombinator.com/item?id=45099226)
|
||||
- [Package Manager for Agent Skills](https://news.ycombinator.com/item?id=46422264)
|
||||
- [Wispbit - AI Code Review Agent](https://news.ycombinator.com/item?id=44722603)
|
||||
- [Agtrace - Monitoring for AI Coding Agents](https://news.ycombinator.com/item?id=46425670)
|
||||
Reference in New Issue
Block a user