# Loki Mode Benchmark Results ## Overview This directory contains benchmark results for Loki Mode multi-agent system. ## Benchmarks Available ### HumanEval - **Problems:** 164 Python programming problems - **Metric:** Pass@1 (percentage of problems solved on first attempt) - **Competitor Baseline:** MetaGPT achieves 85.9-87.7% ### SWE-bench Lite - **Problems:** 300 real-world GitHub issues - **Metric:** Resolution rate - **Competitor Baseline:** Top agents achieve 45-77% ## Running Benchmarks ```bash # Run all benchmarks ./benchmarks/run-benchmarks.sh all # Run specific benchmark ./benchmarks/run-benchmarks.sh humaneval --execute ./benchmarks/run-benchmarks.sh swebench --execute ``` ## Results Format Results are saved as JSON files with: - Timestamp - Problem count - Pass rate - Individual problem results - Token usage - Execution time ## Methodology Loki Mode uses its multi-agent architecture to solve each problem: 1. **Architect Agent** analyzes the problem 2. **Engineer Agent** implements the solution 3. **QA Agent** validates with test cases 4. **Review Agent** checks code quality This mirrors real-world software development more accurately than single-agent approaches.