Fix: Ensure all skills are tracked as files, not submodules
This commit is contained in:
@@ -0,0 +1,48 @@
|
||||
# Loki Mode Benchmark Results
|
||||
|
||||
## Overview
|
||||
|
||||
This directory contains benchmark results for Loki Mode multi-agent system.
|
||||
|
||||
## Benchmarks Available
|
||||
|
||||
### HumanEval
|
||||
- **Problems:** 164 Python programming problems
|
||||
- **Metric:** Pass@1 (percentage of problems solved on first attempt)
|
||||
- **Competitor Baseline:** MetaGPT achieves 85.9-87.7%
|
||||
|
||||
### SWE-bench Lite
|
||||
- **Problems:** 300 real-world GitHub issues
|
||||
- **Metric:** Resolution rate
|
||||
- **Competitor Baseline:** Top agents achieve 45-77%
|
||||
|
||||
## Running Benchmarks
|
||||
|
||||
```bash
|
||||
# Run all benchmarks
|
||||
./benchmarks/run-benchmarks.sh all
|
||||
|
||||
# Run specific benchmark
|
||||
./benchmarks/run-benchmarks.sh humaneval --execute
|
||||
./benchmarks/run-benchmarks.sh swebench --execute
|
||||
```
|
||||
|
||||
## Results Format
|
||||
|
||||
Results are saved as JSON files with:
|
||||
- Timestamp
|
||||
- Problem count
|
||||
- Pass rate
|
||||
- Individual problem results
|
||||
- Token usage
|
||||
- Execution time
|
||||
|
||||
## Methodology
|
||||
|
||||
Loki Mode uses its multi-agent architecture to solve each problem:
|
||||
1. **Architect Agent** analyzes the problem
|
||||
2. **Engineer Agent** implements the solution
|
||||
3. **QA Agent** validates with test cases
|
||||
4. **Review Agent** checks code quality
|
||||
|
||||
This mirrors real-world software development more accurately than single-agent approaches.
|
||||
@@ -0,0 +1,15 @@
|
||||
{
|
||||
"benchmark": "HumanEval",
|
||||
"version": "1.0",
|
||||
"timestamp": "2026-01-05T00:24:04.904083",
|
||||
"total_problems": 164,
|
||||
"status": "INFRASTRUCTURE_READY",
|
||||
"note": "Benchmark infrastructure created. Run with --execute to run actual tests.",
|
||||
"sample_problems": [
|
||||
"HumanEval/0",
|
||||
"HumanEval/1",
|
||||
"HumanEval/2",
|
||||
"HumanEval/3",
|
||||
"HumanEval/4"
|
||||
]
|
||||
}
|
||||
@@ -0,0 +1,10 @@
|
||||
{
|
||||
"benchmark": "SWE-bench Lite",
|
||||
"version": "1.0",
|
||||
"timestamp": "2026-01-05T00:24:04.950779",
|
||||
"total_problems": 300,
|
||||
"status": "INFRASTRUCTURE_READY",
|
||||
"note": "Benchmark infrastructure created. Install swebench package for full evaluation.",
|
||||
"install": "pip install swebench",
|
||||
"evaluation": "python -m swebench.harness.run_evaluation --predictions predictions.json"
|
||||
}
|
||||
Reference in New Issue
Block a user