Fix: Ensure all skills are tracked as files, not submodules

2026-01-14 18:48:48 +01:00
parent 7f46ed8ca1
commit 8bd204708b
1113 changed files with 82065 additions and 2 deletions
--- a/skills/loki-mode/references/production-patterns.md
+++ b/skills/loki-mode/references/production-patterns.md
@@ -0,0 +1,568 @@
+# Production Patterns Reference
+
+Practitioner-tested patterns from Hacker News discussions and real-world deployments. These patterns represent what actually works in production, not theoretical frameworks.
+
+---
+
+## Overview
+
+This reference consolidates battle-tested insights from:
+- HN discussions on autonomous agents in production (2025)
+- Coding with LLMs practitioner experiences
+- Simon Willison's Superpowers coding agent patterns
+- Multi-agent orchestration real-world deployments
+
+---
+
+## What Actually Works in Production
+
+### Human-in-the-Loop (HITL) is Non-Negotiable
+
+**Key Insight:** "Zero companies don't have a human in the loop" for customer-facing applications.
+
+```yaml
+hitl_patterns:
+  always_human:
+    - Customer-facing responses
+    - Financial transactions
+    - Security-critical operations
+    - Legal/compliance decisions
+
+  automation_candidates:
+    - Internal tooling
+    - Developer assistance
+    - Data preprocessing
+    - Code generation (with review)
+
+  implementation:
+    - Classification layer routes to human vs automated
+    - Confidence thresholds trigger escalation
+    - Audit trails for all automated decisions
+```
+
+### Narrow Scope Wins
+
+**Key Insight:** Successful agents operate within tightly constrained domains.
+
+```yaml
+scope_constraints:
+  max_steps_before_review: 3-5
+  task_characteristics:
+    - Specific, well-defined objectives
+    - Pre-classified inputs
+    - Deterministic success criteria
+    - Verifiable outputs
+
+  successful_domains:
+    - Email scanning and classification
+    - Invoice processing
+    - Code refactoring (bounded)
+    - Documentation generation
+    - Test writing
+
+  failure_prone_domains:
+    - Open-ended feature implementation
+    - Novel algorithm design
+    - Security-critical code
+    - Cross-system integrations
+```
+
+### Confidence-Based Routing
+
+**Key Insight:** Treat agents as preprocessors, not decision-makers.
+
+```python
+def confidence_based_routing(agent_output):
+    """
+    Route based on confidence, not capability.
+    Based on production practitioner patterns.
+    """
+    confidence = agent_output.confidence_score
+
+    if confidence >= 0.95:
+        # High confidence: auto-approve with logging
+        return AutoApprove(audit_log=True)
+
+    elif confidence >= 0.70:
+        # Medium confidence: quick human review
+        return HumanReview(priority="normal", timeout="1h")
+
+    elif confidence >= 0.40:
+        # Low confidence: detailed human review
+        return HumanReview(priority="high", context="full")
+
+    else:
+        # Very low confidence: escalate immediately
+        return Escalate(reason="low_confidence", require_senior=True)
+```
+
+### Classification Before Automation
+
+**Key Insight:** Separate inputs before processing.
+
+```yaml
+classification_first:
+  step_1_classify:
+    workable:
+      - Clear requirements
+      - Existing patterns
+      - Test coverage available
+    non_workable:
+      - Ambiguous requirements
+      - Novel architecture
+      - Missing dependencies
+    escalate_immediately:
+      - Security concerns
+      - Compliance requirements
+      - Customer-facing changes
+
+  step_2_route:
+    workable: "Automated pipeline"
+    non_workable: "Human clarification"
+    escalate: "Senior review"
+```
+
+### Deterministic Outer Loops
+
+**Key Insight:** Wrap agent outputs with rule-based validation.
+
+```python
+def deterministic_validation_loop(task, max_attempts=3):
+    """
+    Use LLMs only where genuine ambiguity exists.
+    Wrap with deterministic rules.
+    """
+    for attempt in range(max_attempts):
+        # LLM handles the ambiguous part
+        output = agent.execute(task)
+
+        # Deterministic validation (NOT LLM)
+        validation_errors = []
+
+        # Rule: Must have tests
+        if not output.has_tests:
+            validation_errors.append("Missing tests")
+
+        # Rule: Must pass linting
+        lint_result = run_linter(output.code)
+        if lint_result.errors:
+            validation_errors.append(f"Lint errors: {lint_result.errors}")
+
+        # Rule: Must compile
+        compile_result = compile_code(output.code)
+        if not compile_result.success:
+            validation_errors.append(f"Compile error: {compile_result.error}")
+
+        # Rule: Tests must pass
+        if output.has_tests:
+            test_result = run_tests(output.code)
+            if not test_result.all_passed:
+                validation_errors.append(f"Test failures: {test_result.failures}")
+
+        if not validation_errors:
+            return output
+
+        # Feed errors back for retry
+        task = task.with_feedback(validation_errors)
+
+    return FailedResult(reason="Max attempts exceeded")
+```
+
+---
+
+## Context Engineering Patterns
+
+### Context Curation Over Automatic Selection
+
+**Key Insight:** Manually choose which files and information to provide.
+
+```yaml
+context_curation:
+  principles:
+    - "Less is more" - focused context beats comprehensive context
+    - Manual selection outperforms automatic RAG
+    - Remove outdated information aggressively
+
+  anti_patterns:
+    - Dumping entire codebase into context
+    - Relying on automatic context selection
+    - Accumulating conversation history indefinitely
+
+  implementation:
+    per_task_context:
+      - 2-5 most relevant files
+      - Specific functions, not entire modules
+      - Recent changes only (last 1-2 days)
+      - Clear success criteria
+
+    context_budget:
+      target: "< 10k tokens for context"
+      reserve: "90% for model reasoning"
+```
+
+### Information Abstraction
+
+**Key Insight:** Summarize rather than feeding full data.
+
+```python
+def abstract_for_agent(raw_data, task_context):
+    """
+    Design abstractions that preserve decision-relevant information.
+    Based on practitioner insights.
+    """
+    # BAD: Feed 10,000 database rows
+    # raw_data = db.query("SELECT * FROM users")
+
+    # GOOD: Summarize to decision-relevant info
+    summary = {
+        "query_status": "success",
+        "total_results": len(raw_data),
+        "sample": raw_data[:5],
+        "schema": extract_schema(raw_data),
+        "statistics": {
+            "null_count": count_nulls(raw_data),
+            "unique_values": count_uniques(raw_data),
+            "date_range": get_date_range(raw_data)
+        }
+    }
+
+    return summary
+```
+
+### Separate Conversations Per Task
+
+**Key Insight:** Fresh contexts yield better results than accumulated sessions.
+
+```yaml
+conversation_management:
+  new_conversation_triggers:
+    - Different domain (backend -> frontend)
+    - New feature vs bug fix
+    - After completing major task
+    - When errors accumulate (3+ in row)
+
+  preserve_across_sessions:
+    - CLAUDE.md / CONTINUITY.md
+    - Architectural decisions
+    - Key constraints
+
+  discard_between_sessions:
+    - Debugging attempts
+    - Abandoned approaches
+    - Intermediate drafts
+```
+
+---
+
+## Skills System Pattern
+
+### On-Demand Skill Loading
+
+**Key Insight:** Skills remain dormant until the model actively seeks them out.
+
+```yaml
+skills_architecture:
+  core_interaction: "< 2k tokens"
+  skill_loading: "On-demand via search"
+
+  implementation:
+    skill_discovery:
+      - Shell script searches skill files
+      - Model requests specific skills by name
+      - Skills loaded only when needed
+
+    skill_structure:
+      name: "unique-skill-name"
+      trigger: "Pattern that activates skill"
+      content: "Detailed instructions"
+      dependencies: ["other-skills"]
+
+  benefits:
+    - Minimal base context
+    - Extensible without bloat
+    - Skills can be updated independently
+```
+
+### Sub-Agents for Context Isolation
+
+**Key Insight:** Prevent massive token waste by isolating context-noisy subtasks.
+
+```python
+async def context_isolated_search(query, codebase_path):
+    """
+    Use sub-agent for grep/search to prevent context pollution.
+    Based on Simon Willison's patterns.
+    """
+    # Main agent stays focused
+    # Sub-agent handles noisy file searching
+
+    search_agent = spawn_subagent(
+        role="codebase-searcher",
+        context_limit="10k tokens",
+        permissions=["read-only"]
+    )
+
+    results = await search_agent.execute(
+        task=f"Find files related to: {query}",
+        codebase=codebase_path
+    )
+
+    # Return only relevant paths, not full content
+    return FilteredResults(
+        paths=results.relevant_files[:10],
+        summaries=results.file_summaries,
+        confidence=results.relevance_scores
+    )
+```
+
+---
+
+## Planning Before Execution
+
+### Explicit Plan-Then-Code Workflow
+
+**Key Insight:** Have models articulate detailed plans without immediately writing code.
+
+```yaml
+plan_then_code:
+  phase_1_planning:
+    outputs:
+      - spec.md: "Detailed requirements"
+      - todo.md: "Tagged tasks [BUG], [FEAT], [REFACTOR]"
+      - approach.md: "Implementation strategy"
+    constraints:
+      - NO CODE in this phase
+      - Human review before proceeding
+      - Clear success criteria
+
+  phase_2_review:
+    checks:
+      - Plan addresses all requirements
+      - Approach is feasible
+      - No missing dependencies
+      - Tests are specified
+
+  phase_3_implementation:
+    constraints:
+      - Follow plan exactly
+      - One task at a time
+      - Test after each change
+      - Report deviations immediately
+```
+
+---
+
+## Multi-Agent Orchestration Patterns
+
+### Event-Driven Coordination
+
+**Key Insight:** Move beyond synchronous prompt chaining to asynchronous, decoupled systems.
+
+```yaml
+event_driven_orchestration:
+  problems_with_synchronous:
+    - Doesn't scale
+    - Mixes orchestration with prompt logic
+    - Single failure breaks entire chain
+    - No retry/recovery mechanism
+
+  async_architecture:
+    message_queue:
+      - Agents communicate via events
+      - Decoupled execution
+      - Natural retry/dead-letter handling
+
+    state_management:
+      - Persistent task state
+      - Checkpoint/resume capability
+      - Clear ownership of data
+
+    error_handling:
+      - Per-agent retry policies
+      - Circuit breakers
+      - Graceful degradation
+```
+
+### Policy-First Enforcement
+
+**Key Insight:** Govern agent behavior at runtime, not just training time.
+
+```python
+class PolicyEngine:
+    """
+    Runtime governance for agent behavior.
+    Based on autonomous control plane patterns.
+    """
+
+    def __init__(self, policies):
+        self.policies = policies
+
+    async def enforce(self, agent_action, context):
+        for policy in self.policies:
+            result = await policy.evaluate(agent_action, context)
+
+            if result.blocked:
+                return BlockedAction(
+                    reason=result.reason,
+                    policy=policy.name,
+                    remediation=result.suggested_action
+                )
+
+            if result.modified:
+                agent_action = result.modified_action
+
+        return AllowedAction(agent_action)
+
+# Example policies
+policies = [
+    NoProductionDataDeletion(),
+    NoSecretsInCode(),
+    MaxTokenBudget(limit=100000),
+    RequireTestsForCode(),
+    BlockExternalNetworkCalls(in_sandbox=True)
+]
+```
+
+### Simulation Layer
+
+**Key Insight:** Evaluate changes before deploying to real environment.
+
+```yaml
+simulation_layer:
+  purpose: "Test agent behavior in safe environment"
+
+  implementation:
+    sandbox_environment:
+      - Isolated container
+      - Mocked external services
+      - Synthetic data
+      - Full audit logging
+
+    validation_checks:
+      - Run tests in sandbox first
+      - Compare outputs to expected
+      - Check for policy violations
+      - Measure resource consumption
+
+    promotion_criteria:
+      - All tests pass
+      - No policy violations
+      - Resource usage within limits
+      - Human approval (for sensitive changes)
+```
+
+---
+
+## Evaluation and Benchmarking
+
+### Problems with Current Benchmarks
+
+**Key Insight:** LLM-as-judge creates shared blind spots.
+
+```yaml
+benchmark_problems:
+  llm_judge_issues:
+    - Same architecture = same failure modes
+    - Math errors accepted as correct
+    - "Do-nothing" baseline passes 38% of time
+
+  contamination:
+    - Published benchmarks become training targets
+    - Overfitting to specific datasets
+    - Inflated scores don't reflect real performance
+
+  solutions:
+    held_back_sets: "90% public, 10% private"
+    human_evaluation: "Final published results require humans"
+    production_testing: "A/B tests measure actual value"
+    objective_outcomes: "Simulated environments with verifiable results"
+```
+
+### Practical Evaluation Approach
+
+```python
+def evaluate_agent_change(before_agent, after_agent, task_set):
+    """
+    Production-oriented evaluation.
+    Based on HN practitioner recommendations.
+    """
+    results = {
+        "before": [],
+        "after": [],
+        "human_preference": []
+    }
+
+    for task in task_set:
+        # Run both agents
+        before_result = before_agent.execute(task)
+        after_result = after_agent.execute(task)
+
+        # Objective metrics (NOT LLM-judged)
+        results["before"].append({
+            "tests_pass": run_tests(before_result),
+            "lint_clean": run_linter(before_result),
+            "time_taken": before_result.duration,
+            "tokens_used": before_result.tokens
+        })
+
+        results["after"].append({
+            "tests_pass": run_tests(after_result),
+            "lint_clean": run_linter(after_result),
+            "time_taken": after_result.duration,
+            "tokens_used": after_result.tokens
+        })
+
+        # Sample for human review
+        if random.random() < 0.1:  # 10% sample
+            results["human_preference"].append({
+                "task": task,
+                "before": before_result,
+                "after": after_result,
+                "pending_review": True
+            })
+
+    return EvaluationReport(results)
+```
+
+---
+
+## Cost and Token Economics
+
+### Real-World Cost Patterns
+
+```yaml
+cost_patterns:
+  claude_code:
+    heavy_use: "$25/1-2 hours on large codebases"
+    api_range: "$1-5/hour depending on efficiency"
+    max_tier: "$200/month often needs 2-3 subscriptions"
+
+  token_economics:
+    sub_agents_multiply_cost: "Each duplicates context"
+    example: "5-task parallel job = 50,000+ tokens per subtask"
+
+  optimization:
+    context_isolation: "Use sub-agents for noisy tasks"
+    information_abstraction: "Summarize, don't dump"
+    fresh_conversations: "Reset after major tasks"
+    skill_on_demand: "Load only when needed"
+```
+
+---
+
+## Sources
+
+**Hacker News Discussions:**
+- [What Actually Works in Production for Autonomous Agents](https://news.ycombinator.com/item?id=44623207)
+- [Coding with LLMs in Summer 2025](https://news.ycombinator.com/item?id=44623953)
+- [Superpowers: How I'm Using Coding Agents](https://news.ycombinator.com/item?id=45547344)
+- [Claude Code Experience After Two Weeks](https://news.ycombinator.com/item?id=44596472)
+- [AI Agent Benchmarks Are Broken](https://news.ycombinator.com/item?id=44531697)
+- [How to Orchestrate Multi-Agent Workflows](https://news.ycombinator.com/item?id=45955997)
+- [Context Engineering vs Prompt Engineering](https://news.ycombinator.com/item?id=44427757)
+
+**Show HN Projects:**
+- [Self-Evolving Agents Repository](https://news.ycombinator.com/item?id=45099226)
+- [Package Manager for Agent Skills](https://news.ycombinator.com/item?id=46422264)
+- [Wispbit - AI Code Review Agent](https://news.ycombinator.com/item?id=44722603)
+- [Agtrace - Monitoring for AI Coding Agents](https://news.ycombinator.com/item?id=46425670)