Fix: Ensure all skills are tracked as files, not submodules
This commit is contained in:
@@ -0,0 +1,48 @@
|
||||
# Loki Mode Benchmark Results
|
||||
|
||||
**Generated:** 2026-01-05 02:32:40
|
||||
|
||||
## Overview
|
||||
|
||||
This directory contains benchmark results for Loki Mode multi-agent system.
|
||||
|
||||
## SWE-bench Lite Results
|
||||
|
||||
| Metric | Value |
|
||||
|--------|-------|
|
||||
| Problems | 50 |
|
||||
| Patches Generated | 50 |
|
||||
| Errors | 0 |
|
||||
| Model | opus |
|
||||
| Time | 3413.75s |
|
||||
|
||||
**Next Step:** Run the SWE-bench evaluator to validate patches:
|
||||
|
||||
```bash
|
||||
python -m swebench.harness.run_evaluation --predictions /Users/lokesh/git/loki-mode/benchmarks/results/2026-01-05-01-35-39/swebench-predictions.json --max_workers 4
|
||||
```
|
||||
|
||||
## Methodology
|
||||
|
||||
Loki Mode uses its multi-agent architecture to solve each problem:
|
||||
1. **Architect Agent** analyzes the problem
|
||||
2. **Engineer Agent** implements the solution
|
||||
3. **QA Agent** validates with test cases
|
||||
4. **Review Agent** checks code quality
|
||||
|
||||
This mirrors real-world software development more accurately than single-agent approaches.
|
||||
|
||||
## Running Benchmarks
|
||||
|
||||
```bash
|
||||
# Setup only (download datasets)
|
||||
./benchmarks/run-benchmarks.sh all
|
||||
|
||||
# Execute with Claude
|
||||
./benchmarks/run-benchmarks.sh humaneval --execute
|
||||
./benchmarks/run-benchmarks.sh humaneval --execute --limit 10 # First 10 only
|
||||
./benchmarks/run-benchmarks.sh swebench --execute --limit 5 # First 5 only
|
||||
|
||||
# Use different model
|
||||
./benchmarks/run-benchmarks.sh humaneval --execute --model opus
|
||||
```
|
||||
Reference in New Issue
Block a user