Loki Mode Benchmark Results

Overview

This directory contains benchmark results for Loki Mode multi-agent system.

Benchmarks Available

HumanEval

Problems: 164 Python programming problems
Metric: Pass@1 (percentage of problems solved on first attempt)
Competitor Baseline: MetaGPT achieves 85.9-87.7%

SWE-bench Lite

Problems: 300 real-world GitHub issues
Metric: Resolution rate
Competitor Baseline: Top agents achieve 45-77%

Running Benchmarks

# Run all benchmarks
./benchmarks/run-benchmarks.sh all

# Run specific benchmark
./benchmarks/run-benchmarks.sh humaneval --execute
./benchmarks/run-benchmarks.sh swebench --execute

Results Format

Results are saved as JSON files with:

Timestamp
Problem count
Pass rate
Individual problem results
Token usage
Execution time

Methodology

Loki Mode uses its multi-agent architecture to solve each problem:

Architect Agent analyzes the problem
Engineer Agent implements the solution
QA Agent validates with test cases
Review Agent checks code quality

This mirrors real-world software development more accurately than single-agent approaches.

1.2 KiB Raw Blame History