Skip to main content
Every run produces a deterministic 0-100 quality score. The scorecard measures how well the agent executed the spec, not just whether it finished.

Scoring axes

The score is a weighted sum of four axes:
AxisWeightWhat it measures
Completion40%Did the agent finish the spec?
Error rate30%Ratio of passing stages to total stages
Latency20%Wall time compared to workspace rolling p50 baseline
Resource efficiency10%Memory and CPU usage relative to sandbox limits

Completion (40%)

OutcomeScore
COMPLETED — all gates pass100
BLOCKED — partial progress with retries30
FAILED — no usable output0

Error rate (30%)

Calculated as the ratio of stages that passed on their first attempt to total stages executed. A run where lint and build pass immediately but test requires two repair loops scores lower than a clean sweep.

Latency (20%)

Measured against the workspace’s rolling p50 baseline. A run that finishes in half the typical time scores higher. A run that takes 3x the baseline scores near zero on this axis. The baseline is computed from the last 20 completed runs in the same workspace.

Resource efficiency (10%)

Measures how much memory and CPU the agent consumed during execution. An agent that does its job without hogging resources scores higher.
  • Memory (70% of this axis) — how much of the available memory the agent used. Using very little scores close to 100. Maxing out the limit scores 0.
  • CPU (30% of this axis) — how often the agent was slowed down by CPU limits. Running freely scores 100. Being constantly throttled scores 0.
This gives you early visibility into resource-heavy agents before they become a cost problem.
When resource data is not available (for example, during local development), this axis defaults to 50.

Tier labels

Scores map to named tiers:
TierScore range
UnrankedNo score available
Bronze0 - 39
Silver40 - 69
Gold70 - 89
Elite90 - 100

Fail-safe design

Scoring is fail-safe. It never blocks a run and always produces a score, even if an axis cannot be computed. If a data source is unavailable (e.g., no baseline for latency), that axis defaults to 50/100 rather than failing the run.

Analytics

Every scored run emits a PostHog event:
agent.run.scored
The event payload includes the run ID, final score, per-axis breakdown, and tier label.

Storage and API access

Scores persist in the agent_run_scores table and are queryable via the API:
curl https://api.usezombie.com/v1/runs/run_01JQ7K.../score \
  -H "Authorization: Bearer $ZOMBIE_TOKEN"
{
  "run_id": "run_01JQ7K...",
  "score": 89,
  "tier": "Gold",
  "formula_version": "2",
  "axes": {
    "completion": 100,
    "error_rate": 85,
    "latency": 72,
    "resource_efficiency": 93
  }
}

Future: competitive scoring

In a future release, multiple agents will be able to compete on the same spec. Each agent produces its own branch and scorecard. The highest-scoring agent’s PR is the one that gets opened for human review.