harness-eval

Which agentic coding framework actually performs best? Stop arguing — measure it.

View the live demo leaderboard → GitHub

The experiment

Four popular coding-agent frameworks — Superpowers, Compound Engineering, Agent Skills, and GSD — each build the same product from the same 2,200-line specification (an issue-tracker-driven orchestration daemon), driven by the same harness and model (Claude Code + Opus 4.6), each trial in a fresh isolated sandbox. The framework is the only variable. Every artifact is then graded by two independent instruments and ranked on a weighted composite.

How grading works

PRD adherence — 40%

ViBench-style Graded Score: an evaluator agent executes a frozen, spec-derived test plan against the built, running artifact — cold-start scripts, mock services, per-step cited evidence, weighted partial credit, fatal gates. Absolute 0–100.

Code quality — 25%

A blind LLM judge scores five criteria (tests, architecture, error handling, dead code, docs), three samples each with medians, on a copy scrubbed of framework-identifying files. Judge ≠ worker model. Absolute 0–100.

Speed — 17.5%

Agent working time from the harness's own session telemetry (sandbox setup excluded), min-max normalized within the run: fastest 100, slowest 0.

Token spend — 17.5%

Total session cost from harness telemetry, normalized the same way. Together with speed: the efficiency frontier.

Fairness engineering

Identical rendered prompts for every candidate · version-pinned frameworks with post-install asserts · frozen content-hashed PRD and test plan · blind judging on marker-scrubbed workspaces · per-trial sandbox isolation with contamination tests · evaluators forbidden from repairing artifacts · evidence cited for every verdict · budget caps with explicit capped status, never silent truncation · rankings with overlapping variance flagged inconclusive.

Pluggable by design

Five isolation providers behind one interface (Daytona, E2B, Docker, Apple-Silicon micro-VMs, git worktrees) all running one pinned trial image. Eval targets are swappable — bring your own PRD with a frozen test plan and the same machinery grades it. Next up: pluggable worker models (GLM, Kimi, Qwen via Anthropic-compatible endpoints) and alternative harnesses (OpenCode, Codex).

Run it yourself

git clone https://github.com/natea/harness-eval && cd harness-eval
bun install && cp .env.example .env
bun run src/cli.ts run --candidates superpowers --trials 1 --provider docker
bun scripts/grade-trial.ts runs/<run-dir> superpowers-t1
bun run dashboard   # full interactive dashboard on localhost

Hosted evals coming

A hosted service for running your own framework/model evals — pick candidates, bring a spec, get a graded leaderboard — is on the roadmap.