ayrisdev/app-store-optimization

Files

sck_0 8bd204708b Fix: Ensure all skills are tracked as files, not submodules

2026-01-14 18:48:48 +01:00

1.3 KiB

Raw Blame History

Loki Mode Benchmark Results

Generated: 2026-01-05 02:32:40

Overview

This directory contains benchmark results for Loki Mode multi-agent system.

SWE-bench Lite Results

Metric	Value
Problems	50
Patches Generated	50
Errors	0
Model	opus
Time	3413.75s

Next Step: Run the SWE-bench evaluator to validate patches:

python -m swebench.harness.run_evaluation     --predictions /Users/lokesh/git/loki-mode/benchmarks/results/2026-01-05-01-35-39/swebench-predictions.json     --max_workers 4

Methodology

Loki Mode uses its multi-agent architecture to solve each problem:

Architect Agent analyzes the problem
Engineer Agent implements the solution
QA Agent validates with test cases
Review Agent checks code quality

This mirrors real-world software development more accurately than single-agent approaches.

Running Benchmarks

# Setup only (download datasets)
./benchmarks/run-benchmarks.sh all

# Execute with Claude
./benchmarks/run-benchmarks.sh humaneval --execute
./benchmarks/run-benchmarks.sh humaneval --execute --limit 10  # First 10 only
./benchmarks/run-benchmarks.sh swebench --execute --limit 5    # First 5 only

# Use different model
./benchmarks/run-benchmarks.sh humaneval --execute --model opus