harness-eval
Which agentic coding framework actually performs best? Stop arguing — measure it.
View the live demo leaderboard → GitHubThe experiment
Four popular coding-agent frameworks — Superpowers, Compound Engineering, Agent Skills, and GSD — each build the same product from the same 2,200-line specification (an issue-tracker-driven orchestration daemon), driven by the same harness and model (Claude Code + Opus 4.6), each trial in a fresh isolated sandbox. The framework is the only variable. Every artifact is then graded by two independent instruments and ranked on a weighted composite.
How grading works
PRD adherence — 40%
ViBench-style Graded Score: an evaluator agent executes a frozen, spec-derived test plan against the built, running artifact — cold-start scripts, mock services, per-step cited evidence, weighted partial credit, fatal gates. Absolute 0–100.
Code quality — 25%
A blind LLM judge scores five criteria (tests, architecture, error handling, dead code, docs), three samples each with medians, on a copy scrubbed of framework-identifying files. Judge ≠ worker model. Absolute 0–100.
Speed — 17.5%
Agent working time from the harness's own session telemetry (sandbox setup excluded), min-max normalized within the run: fastest 100, slowest 0.
Token spend — 17.5%
Total session cost from harness telemetry, normalized the same way. Together with speed: the efficiency frontier.
Fairness engineering
Identical rendered prompts for every candidate · version-pinned frameworks with
post-install asserts · frozen content-hashed PRD and test plan · blind judging on
marker-scrubbed workspaces · per-trial sandbox isolation with contamination tests ·
evaluators forbidden from repairing artifacts · evidence cited for every verdict ·
budget caps with explicit capped status, never silent truncation ·
rankings with overlapping variance flagged inconclusive.
Pluggable by design
Five isolation providers behind one interface (Daytona, E2B, Docker, Apple-Silicon micro-VMs, git worktrees) all running one pinned trial image. Eval targets are swappable — bring your own PRD with a frozen test plan and the same machinery grades it. Next up: pluggable worker models (GLM, Kimi, Qwen via Anthropic-compatible endpoints) and alternative harnesses (OpenCode, Codex).
Run it yourself
git clone https://github.com/natea/harness-eval && cd harness-eval bun install && cp .env.example .env bun run src/cli.ts run --candidates superpowers --trials 1 --provider docker bun scripts/grade-trial.ts runs/<run-dir> superpowers-t1 bun run dashboard # full interactive dashboard on localhost
Hosted evals coming
A hosted service for running your own framework/model evals — pick candidates, bring a spec, get a graded leaderboard — is on the roadmap.