Skip to main content

Overview

The zombied control plane exposes Prometheus text-format metrics at GET /metrics. All metric names are prefixed with zombie_ (with one exception, zombied_run_limit_exceeded_total, noted below). The exporter is open to any scraper — put it behind an authenticated gateway if the endpoint is publicly reachable. Labels are kept low-cardinality. The per-workspace tokens counter is keyed on both workspace_id and zombie_id; overflow (4096 distinct pairs max) routes to workspace_id="_other",zombie_id="_other" and is counted in zombie_workspace_metrics_overflow_total. Do not add workspace or zombie labels to other metrics without going through the per-workspace helper.

Zombie lifecycle

End-to-end counters and the wall-time histogram for inbound event delivery.
MetricTypeDescription
zombie_triggered_totalCounterInbound zombie webhook triggers that passed signature + dedupe and were enqueued.
zombie_completed_totalCounterZombie events delivered to completion (exit status OK).
zombie_failed_totalCounterZombie event delivery failures (runtime error, budget exhaustion, etc.).
zombie_tokens_totalCounterTotal LLM tokens consumed across all zombie deliveries.
zombie_execution_secondsHistogramEnd-to-end zombie event wall-time in seconds. Buckets ranged from sub-second through multi-minute — see ZombieDurationBucketsMs in src/observability/metrics_zombie.zig.

Agent execution

Agent-level token + latency metrics. Applied regardless of which zombie the agent is serving.
MetricTypeDescription
zombie_agent_tokens_totalCounterLLM tokens consumed by direct agent calls.
zombie_agent_duration_secondsHistogramWall-clock duration of individual agent calls. Buckets: 1, 3, 5, 10, 30, 60, 120, 300 seconds.
zombie_backoff_wait_ms_totalCounterCumulative time spent in backoff/retry sleeps across all external calls.

External-call retry classification

Every outbound external call the agent makes is wrapped by a classifier that tags retries and terminal failures with a reason. These counters drive the retry-effectiveness dashboards.
MetricTypeDescription
zombie_external_retries_totalCounterTotal retry attempts inside external side-effect wrappers.
zombie_external_retries_rate_limited_totalCounterRetries classified as rate-limited.
zombie_external_retries_timeout_totalCounterRetries classified as timeout.
zombie_external_retries_context_exhausted_totalCounterRetries classified as context-window exhausted.
zombie_external_retries_auth_totalCounterRetries classified as auth failure.
zombie_external_retries_invalid_request_totalCounterRetries classified as invalid request.
zombie_external_retries_server_error_totalCounterRetries classified as upstream server error.
zombie_external_retries_unknown_totalCounterRetries that did not match any known class.
zombie_external_failures_totalCounterExternal calls that exited the wrapper as a classified failure.
zombie_external_failures_{rate_limited,timeout,context_exhausted,auth,invalid_request,server_error,unknown}_totalCounterPer-class terminal failures; same categories as retries.
zombie_retry_after_hints_totalCounterRetry attempts that honored an upstream Retry-After hint rather than the wrapper’s own backoff.

Execution limits

zombied_run_limit_exceeded_total (note the zombied_ prefix, not zombie_) is a single counter with a reason label.
MetricTypeDescription
zombied_run_limit_exceeded_total{reason="token_budget"}CounterEvents terminated by monthly token-budget exhaustion.
zombied_run_limit_exceeded_total{reason="wall_time"}CounterEvents terminated by wall-time limit.
zombied_run_limit_exceeded_total{reason="repair_loops"}CounterEvents terminated because the repair loop exhausted its max iterations.

Gate repair

Approval gate repair loops executed by the worker when a gate claim/lease is disputed or stale.
MetricTypeDescription
zombie_gate_repair_exhausted_totalCounterGate-repair runs that hit the iteration cap without converging.

Side-effect outbox

Dead-letter counter for the reconciler daemon.
MetricTypeDescription
zombie_side_effect_outbox_dead_letter_totalCounterOutbox rows dead-lettered by reconciliation (stuck-pending cleanup).
zombie_reconcile_runningGauge1 if the reconcile daemon is live; 0 otherwise.

API server liveness

MetricTypeDescription
zombie_api_in_flight_requestsGaugeCurrent in-flight API requests under the backpressure guard.
zombie_api_backpressure_rejections_totalCounterRequests rejected by the in-flight-count backpressure guard.
zombie_worker_runningGauge1 if a worker process is up; 0 otherwise.

OTEL export

Health of the outbound OTEL exporter that forwards metrics to an external collector. The OTLP/HTTP JSON exporter (src/observability/otel_export.zig) forwards counters, gauges, and histograms as OTLP data points — zombie_execution_seconds, zombie_agent_duration_seconds, and zombie_executor_agent_duration_seconds arrive at the collector as OTLP histogram data points (cumulative temporality) with explicit bucket bounds and the sum/count fields. Both Prometheus scrape and OTLP collectors receive histogram data.
MetricTypeDescription
zombie_otel_export_totalCounterTotal OTEL export attempts.
zombie_otel_export_failed_totalCounterTotal OTEL export failures.
zombie_otel_last_success_at_msGaugeUnix-ms timestamp of the last successful OTEL export. 0 if no success recorded this process lifetime.

Executor (NullClaw sessions)

Sandbox session + resource-limit metrics from the executor boundary.
MetricTypeDescription
zombie_executor_sessions_created_totalCounterExecutor sessions created.
zombie_executor_sessions_activeGaugeCurrently-active executor sessions.
zombie_executor_failures_totalCounterExecutor RPC failures (transport-level).
zombie_executor_oom_kills_totalCounterExecutor processes killed by the cgroup memory limit.
zombie_executor_timeout_kills_totalCounterExecutor processes killed by the wall-time limit.
zombie_executor_landlock_denials_totalCounterFilesystem access attempts denied by Landlock.
zombie_executor_resource_kills_totalCounterExecutor processes killed by any resource guard (CPU, disk, etc.).
zombie_executor_lease_expired_totalCounterExecutor leases that expired without a renewal (orphan cleanup fired).
zombie_executor_cancellations_totalCounterExecutor sessions cancelled (kill-switch, user-initiated).
zombie_executor_cpu_throttled_ms_totalCounterCumulative cgroup CPU-throttling time across executor sessions.
zombie_executor_memory_peak_bytesGaugePeak RSS observed across executor sessions.
zombie_executor_stages_started_totalCounterNullClaw stage invocations started inside executor sessions.
zombie_executor_stages_completed_totalCounterNullClaw stage invocations that completed successfully.
zombie_executor_stages_failed_totalCounterNullClaw stage invocations that failed.
zombie_executor_agent_tokens_totalCounterLLM tokens consumed by agent calls inside executor sessions.
zombie_executor_agent_duration_secondsHistogramWall-time of agent calls inside executor sessions.

Signup funnel

Clerk-driven signup delivery counters emitted by POST /v1/webhooks/clerk. The signup_bootstrapped / signup_replayed split lets funnel dashboards distinguish first-delivery signups from idempotent retries.
MetricTypeDescription
zombie_signup_bootstrapped_totalCounterClerk webhooks that provisioned a fresh personal account (tenant + user + owner membership + default workspace + 0-cent credit state).
zombie_signup_replayed_totalCounterClerk webhooks that resolved to an existing account via the oidc_subject fast-path — idempotent replay, no new rows.
zombie_signup_failed_total{reason="bad_sig"}CounterSignup webhooks rejected with 401 UZ-WH-010 for invalid Svix signature.
zombie_signup_failed_total{reason="stale_ts"}CounterSignup webhooks rejected with 401 UZ-WH-011 for timestamp outside the 5-minute freshness window.
zombie_signup_failed_total{reason="missing_email"}CounterSignup webhooks rejected with 400 UZ-REQ-001 for malformed JSON or missing primary email.
zombie_signup_failed_total{reason="db_error"}CounterSignup webhooks that reached the transactional bootstrap and rolled back on DB error.

Per-workspace + per-zombie metrics

The token counter is keyed on both workspace_id and zombie_id. The first 4096 distinct (workspace_id, zombie_id) pairs each get their own label set; every subsequent pair routes into workspace_id="_other",zombie_id="_other". zombie_workspace_metrics_overflow_total counts overflow increments — alert on sustained non-zero growth. The counter fires from the live zombie delivery path (src/zombie/event_loop_helpers.zig) on every successful completion.
MetricTypeDescription
zombie_agent_tokens_by_workspace_total{workspace_id,zombie_id}CounterTokens consumed per (workspace, zombie) pair. Bounded to 4096 pairs; overflow counted in zombie_workspace_metrics_overflow_total.
zombie_workspace_metrics_overflow_totalCounterIncrements routed to _other because the 4096-slot table is full.
Example PromQL:
# Top-10 zombies by token spend over the last day
topk(10, sum by (workspace_id, zombie_id) (rate(zombie_agent_tokens_by_workspace_total[1d])))

# Per-workspace token spend over 5 minutes
sum by (workspace_id) (rate(zombie_agent_tokens_by_workspace_total[5m]))

Labels

Labels deliberately used across metrics:
LabelApplied toValues
reasonzombied_run_limit_exceeded_total, zombie_signup_failed_totalLifecycle-specific; see each metric.
workspace_idzombie_agent_tokens_by_workspace_totalUUIDv7 or _other on overflow.
zombie_idzombie_agent_tokens_by_workspace_totalZombie instance id or _other on overflow.
leHistogram buckets (*_seconds_bucket)Bucket upper-bound.

Grafana dashboard queries

Real PromQL queries that run against the emitted metric names:
# Zombie trigger → completion rate, last hour
rate(zombie_completed_total[1h]) / rate(zombie_triggered_total[1h])

# P95 end-to-end zombie event wall-time, last 5 minutes
histogram_quantile(0.95, sum by (le) (rate(zombie_execution_seconds_bucket[5m])))

# External retry exhaustion (any reason)
rate(zombie_external_failures_total[5m])

# OOM kill rate at the executor boundary
rate(zombie_executor_oom_kills_total[5m])

# Signup funnel: fresh vs replay
rate(zombie_signup_bootstrapped_total[1h])
rate(zombie_signup_replayed_total[1h])

# Signup failure mix
sum by (reason) (rate(zombie_signup_failed_total[1h]))

# Run limit terminations by reason
sum by (reason) (rate(zombied_run_limit_exceeded_total[1h]))

# Workspace + zombie cardinality overflow — alert if this is non-zero for long
rate(zombie_workspace_metrics_overflow_total[5m])

# Top-N zombies by token spend over the last hour
topk(10, sum by (workspace_id, zombie_id) (rate(zombie_agent_tokens_by_workspace_total[1h])))

Emission source

The authoritative Prometheus text is rendered by src/observability/metrics_render.zig in the usezombie/usezombie repo. Any time a new appendMetric(...) or histogram emission lands there, update this page in the same PR — CI enforces router ↔ openapi parity but does not yet diff this page against the renderer, so drift is caught only at review time.