49 lines
1.2 KiB
Markdown
49 lines
1.2 KiB
Markdown
# Loki Mode Benchmark Results
|
|
|
|
## Overview
|
|
|
|
This directory contains benchmark results for Loki Mode multi-agent system.
|
|
|
|
## Benchmarks Available
|
|
|
|
### HumanEval
|
|
- **Problems:** 164 Python programming problems
|
|
- **Metric:** Pass@1 (percentage of problems solved on first attempt)
|
|
- **Competitor Baseline:** MetaGPT achieves 85.9-87.7%
|
|
|
|
### SWE-bench Lite
|
|
- **Problems:** 300 real-world GitHub issues
|
|
- **Metric:** Resolution rate
|
|
- **Competitor Baseline:** Top agents achieve 45-77%
|
|
|
|
## Running Benchmarks
|
|
|
|
```bash
|
|
# Run all benchmarks
|
|
./benchmarks/run-benchmarks.sh all
|
|
|
|
# Run specific benchmark
|
|
./benchmarks/run-benchmarks.sh humaneval --execute
|
|
./benchmarks/run-benchmarks.sh swebench --execute
|
|
```
|
|
|
|
## Results Format
|
|
|
|
Results are saved as JSON files with:
|
|
- Timestamp
|
|
- Problem count
|
|
- Pass rate
|
|
- Individual problem results
|
|
- Token usage
|
|
- Execution time
|
|
|
|
## Methodology
|
|
|
|
Loki Mode uses its multi-agent architecture to solve each problem:
|
|
1. **Architect Agent** analyzes the problem
|
|
2. **Engineer Agent** implements the solution
|
|
3. **QA Agent** validates with test cases
|
|
4. **Review Agent** checks code quality
|
|
|
|
This mirrors real-world software development more accurately than single-agent approaches.
|