1.2 KiB
1.2 KiB
Loki Mode Benchmark Results
Overview
This directory contains benchmark results for Loki Mode multi-agent system.
Benchmarks Available
HumanEval
- Problems: 164 Python programming problems
- Metric: Pass@1 (percentage of problems solved on first attempt)
- Competitor Baseline: MetaGPT achieves 85.9-87.7%
SWE-bench Lite
- Problems: 300 real-world GitHub issues
- Metric: Resolution rate
- Competitor Baseline: Top agents achieve 45-77%
Running Benchmarks
# Run all benchmarks
./benchmarks/run-benchmarks.sh all
# Run specific benchmark
./benchmarks/run-benchmarks.sh humaneval --execute
./benchmarks/run-benchmarks.sh swebench --execute
Results Format
Results are saved as JSON files with:
- Timestamp
- Problem count
- Pass rate
- Individual problem results
- Token usage
- Execution time
Methodology
Loki Mode uses its multi-agent architecture to solve each problem:
- Architect Agent analyzes the problem
- Engineer Agent implements the solution
- QA Agent validates with test cases
- Review Agent checks code quality
This mirrors real-world software development more accurately than single-agent approaches.