Scorecard

Every run produces a deterministic 0-100 quality score. The scorecard measures how well the agent executed the spec, not just whether it finished.

Scoring axes

The score is a weighted sum of four axes:

Axis	Weight	What it measures
Completion	40%	Did the agent finish the spec?
Error rate	30%	Ratio of passing stages to total stages
Latency	20%	Wall time compared to workspace rolling p50 baseline
Resource efficiency	10%	Memory and CPU usage relative to sandbox limits

Completion (40%)

Outcome	Score
`COMPLETED` — all gates pass	100
`BLOCKED` — partial progress with retries	30
`FAILED` — no usable output	0

Error rate (30%)

Calculated as the ratio of stages that passed on their first attempt to total stages executed. A run where lint and build pass immediately but test requires two repair loops scores lower than a clean sweep.

Latency (20%)

Measured against the workspace’s rolling p50 baseline. A run that finishes in half the typical time scores higher. A run that takes 3x the baseline scores near zero on this axis. The baseline is computed from the last 20 completed runs in the same workspace.

Resource efficiency (10%)

Measures how much memory and CPU the agent consumed during execution. An agent that does its job without hogging resources scores higher.

Memory (70% of this axis) — how much of the available memory the agent used. Using very little scores close to 100. Maxing out the limit scores 0.
CPU (30% of this axis) — how often the agent was slowed down by CPU limits. Running freely scores 100. Being constantly throttled scores 0.

This gives you early visibility into resource-heavy agents before they become a cost problem.

When resource data is not available (for example, during local development), this axis defaults to 50.

Tier labels

Scores map to named tiers:

Tier	Score range
Unranked	No score available
Bronze	0 - 39
Silver	40 - 69
Gold	70 - 89
Elite	90 - 100

Fail-safe design

Scoring is fail-safe. It never blocks a run and always produces a score, even if an axis cannot be computed. If a data source is unavailable (e.g., no baseline for latency), that axis defaults to 50/100 rather than failing the run.

Analytics

Every scored run emits a PostHog event:

agent.run.scored

The event payload includes the run ID, final score, per-axis breakdown, and tier label.

Storage and API access

Scores persist in the agent_run_scores table and are queryable via the API:

curl https://api.usezombie.com/v1/runs/run_01JQ7K.../score \
  -H "Authorization: Bearer $ZOMBIE_TOKEN"

{
  "run_id": "run_01JQ7K...",
  "score": 89,
  "tier": "Gold",
  "formula_version": "2",
  "axes": {
    "completion": 100,
    "error_rate": 85,
    "latency": 72,
    "resource_efficiency": 93
  }
}

Future: competitive scoring

In a future release, multiple agents will be able to compete on the same spec. Each agent produces its own branch and scorecard. The highest-scoring agent’s PR is the one that gets opened for human review.

Getting started

Specs

Runs

Workspaces

Billing and cost control

CLI reference

Updates

Scoring axes

Completion (40%)

Error rate (30%)

Latency (20%)

Resource efficiency (10%)

Tier labels

Fail-safe design

Analytics

Storage and API access

Future: competitive scoring

Getting started

Specs

Runs

Workspaces

Billing and cost control

CLI reference

Updates

​Scoring axes

​Completion (40%)

​Error rate (30%)

​Latency (20%)

​Resource efficiency (10%)

​Tier labels

​Fail-safe design

​Analytics

​Storage and API access

​Future: competitive scoring

Scoring axes

Completion (40%)

Error rate (30%)

Latency (20%)

Resource efficiency (10%)

Tier labels

Fail-safe design

Analytics

Storage and API access

Future: competitive scoring