Every run produces a deterministic 0-100 quality score. The scorecard measures how well the agent executed the spec, not just whether it finished.
Scoring axes
The score is a weighted sum of four axes:
| Axis | Weight | What it measures |
|---|
| Completion | 40% | Did the agent finish the spec? |
| Error rate | 30% | Ratio of passing stages to total stages |
| Latency | 20% | Wall time compared to workspace rolling p50 baseline |
| Resource efficiency | 10% | Memory and CPU usage relative to sandbox limits |
Completion (40%)
| Outcome | Score |
|---|
COMPLETED — all gates pass | 100 |
BLOCKED — partial progress with retries | 30 |
FAILED — no usable output | 0 |
Error rate (30%)
Calculated as the ratio of stages that passed on their first attempt to total stages executed. A run where lint and build pass immediately but test requires two repair loops scores lower than a clean sweep.
Latency (20%)
Measured against the workspace’s rolling p50 baseline. A run that finishes in half the typical time scores higher. A run that takes 3x the baseline scores near zero on this axis.
The baseline is computed from the last 20 completed runs in the same workspace.
Resource efficiency (10%)
Measures how much memory and CPU the agent consumed during execution. An agent that does its job without hogging resources scores higher.
- Memory (70% of this axis) — how much of the available memory the agent used. Using very little scores close to 100. Maxing out the limit scores 0.
- CPU (30% of this axis) — how often the agent was slowed down by CPU limits. Running freely scores 100. Being constantly throttled scores 0.
This gives you early visibility into resource-heavy agents before they become a cost problem.
When resource data is not available (for example, during local development), this axis defaults to 50.
Tier labels
Scores map to named tiers:
| Tier | Score range |
|---|
| Unranked | No score available |
| Bronze | 0 - 39 |
| Silver | 40 - 69 |
| Gold | 70 - 89 |
| Elite | 90 - 100 |
Fail-safe design
Scoring is fail-safe. It never blocks a run and always produces a score, even if an axis cannot be computed. If a data source is unavailable (e.g., no baseline for latency), that axis defaults to 50/100 rather than failing the run.
Analytics
Every scored run emits a PostHog event:
The event payload includes the run ID, final score, per-axis breakdown, and tier label.
Storage and API access
Scores persist in the agent_run_scores table and are queryable via the API:
curl https://api.usezombie.com/v1/runs/run_01JQ7K.../score \
-H "Authorization: Bearer $ZOMBIE_TOKEN"
{
"run_id": "run_01JQ7K...",
"score": 89,
"tier": "Gold",
"formula_version": "2",
"axes": {
"completion": 100,
"error_rate": 85,
"latency": 72,
"resource_efficiency": 93
}
}
Future: competitive scoring
In a future release, multiple agents will be able to compete on the same spec. Each agent produces its own branch and scorecard. The highest-scoring agent’s PR is the one that gets opened for human review.