1.3 KiB
1.3 KiB
Loki Mode Benchmark Results
Generated: 2026-01-05 02:32:40
Overview
This directory contains benchmark results for Loki Mode multi-agent system.
SWE-bench Lite Results
| Metric | Value |
|---|---|
| Problems | 50 |
| Patches Generated | 50 |
| Errors | 0 |
| Model | opus |
| Time | 3413.75s |
Next Step: Run the SWE-bench evaluator to validate patches:
python -m swebench.harness.run_evaluation --predictions /Users/lokesh/git/loki-mode/benchmarks/results/2026-01-05-01-35-39/swebench-predictions.json --max_workers 4
Methodology
Loki Mode uses its multi-agent architecture to solve each problem:
- Architect Agent analyzes the problem
- Engineer Agent implements the solution
- QA Agent validates with test cases
- Review Agent checks code quality
This mirrors real-world software development more accurately than single-agent approaches.
Running Benchmarks
# Setup only (download datasets)
./benchmarks/run-benchmarks.sh all
# Execute with Claude
./benchmarks/run-benchmarks.sh humaneval --execute
./benchmarks/run-benchmarks.sh humaneval --execute --limit 10 # First 10 only
./benchmarks/run-benchmarks.sh swebench --execute --limit 5 # First 5 only
# Use different model
./benchmarks/run-benchmarks.sh humaneval --execute --model opus