Overview
Thezombied control plane exposes Prometheus text-format metrics at GET /metrics. All metric names are prefixed with zombie_ (with one exception, zombied_run_limit_exceeded_total, noted below). The exporter is open to any scraper — put it behind an authenticated gateway if the endpoint is publicly reachable.
Labels are kept low-cardinality. The per-workspace tokens counter is keyed on both workspace_id and zombie_id; overflow (4096 distinct pairs max) routes to workspace_id="_other",zombie_id="_other" and is counted in zombie_workspace_metrics_overflow_total. Do not add workspace or zombie labels to other metrics without going through the per-workspace helper.
Zombie lifecycle
End-to-end counters and the wall-time histogram for inbound event delivery.| Metric | Type | Description |
|---|---|---|
zombie_triggered_total | Counter | Inbound zombie webhook triggers that passed signature + dedupe and were enqueued. |
zombie_completed_total | Counter | Zombie events delivered to completion (exit status OK). |
zombie_failed_total | Counter | Zombie event delivery failures (runtime error, budget exhaustion, etc.). |
zombie_tokens_total | Counter | Total LLM tokens consumed across all zombie deliveries. |
zombie_execution_seconds | Histogram | End-to-end zombie event wall-time in seconds. Buckets ranged from sub-second through multi-minute — see ZombieDurationBucketsMs in src/observability/metrics_zombie.zig. |
Agent execution
Agent-level token + latency metrics. Applied regardless of which zombie the agent is serving.| Metric | Type | Description |
|---|---|---|
zombie_agent_tokens_total | Counter | LLM tokens consumed by direct agent calls. |
zombie_agent_duration_seconds | Histogram | Wall-clock duration of individual agent calls. Buckets: 1, 3, 5, 10, 30, 60, 120, 300 seconds. |
zombie_backoff_wait_ms_total | Counter | Cumulative time spent in backoff/retry sleeps across all external calls. |
External-call retry classification
Every outbound external call the agent makes is wrapped by a classifier that tags retries and terminal failures with a reason. These counters drive the retry-effectiveness dashboards.| Metric | Type | Description |
|---|---|---|
zombie_external_retries_total | Counter | Total retry attempts inside external side-effect wrappers. |
zombie_external_retries_rate_limited_total | Counter | Retries classified as rate-limited. |
zombie_external_retries_timeout_total | Counter | Retries classified as timeout. |
zombie_external_retries_context_exhausted_total | Counter | Retries classified as context-window exhausted. |
zombie_external_retries_auth_total | Counter | Retries classified as auth failure. |
zombie_external_retries_invalid_request_total | Counter | Retries classified as invalid request. |
zombie_external_retries_server_error_total | Counter | Retries classified as upstream server error. |
zombie_external_retries_unknown_total | Counter | Retries that did not match any known class. |
zombie_external_failures_total | Counter | External calls that exited the wrapper as a classified failure. |
zombie_external_failures_{rate_limited,timeout,context_exhausted,auth,invalid_request,server_error,unknown}_total | Counter | Per-class terminal failures; same categories as retries. |
zombie_retry_after_hints_total | Counter | Retry attempts that honored an upstream Retry-After hint rather than the wrapper’s own backoff. |
Execution limits
zombied_run_limit_exceeded_total (note the zombied_ prefix, not zombie_) is a single counter with a reason label.
| Metric | Type | Description |
|---|---|---|
zombied_run_limit_exceeded_total{reason="token_budget"} | Counter | Events terminated by monthly token-budget exhaustion. |
zombied_run_limit_exceeded_total{reason="wall_time"} | Counter | Events terminated by wall-time limit. |
zombied_run_limit_exceeded_total{reason="repair_loops"} | Counter | Events terminated because the repair loop exhausted its max iterations. |
Gate repair
Approval gate repair loops executed by the worker when a gate claim/lease is disputed or stale.| Metric | Type | Description |
|---|---|---|
zombie_gate_repair_exhausted_total | Counter | Gate-repair runs that hit the iteration cap without converging. |
Side-effect outbox
Dead-letter counter for the reconciler daemon.| Metric | Type | Description |
|---|---|---|
zombie_side_effect_outbox_dead_letter_total | Counter | Outbox rows dead-lettered by reconciliation (stuck-pending cleanup). |
zombie_reconcile_running | Gauge | 1 if the reconcile daemon is live; 0 otherwise. |
API server liveness
| Metric | Type | Description |
|---|---|---|
zombie_api_in_flight_requests | Gauge | Current in-flight API requests under the backpressure guard. |
zombie_api_backpressure_rejections_total | Counter | Requests rejected by the in-flight-count backpressure guard. |
zombie_worker_running | Gauge | 1 if a worker process is up; 0 otherwise. |
OTEL export
Health of the outbound OTEL exporter that forwards metrics to an external collector. The OTLP/HTTP JSON exporter (src/observability/otel_export.zig) forwards counters, gauges, and histograms as OTLP data points — zombie_execution_seconds, zombie_agent_duration_seconds, and zombie_executor_agent_duration_seconds arrive at the collector as OTLP histogram data points (cumulative temporality) with explicit bucket bounds and the sum/count fields. Both Prometheus scrape and OTLP collectors receive histogram data.
| Metric | Type | Description |
|---|---|---|
zombie_otel_export_total | Counter | Total OTEL export attempts. |
zombie_otel_export_failed_total | Counter | Total OTEL export failures. |
zombie_otel_last_success_at_ms | Gauge | Unix-ms timestamp of the last successful OTEL export. 0 if no success recorded this process lifetime. |
Executor (NullClaw sessions)
Sandbox session + resource-limit metrics from the executor boundary.| Metric | Type | Description |
|---|---|---|
zombie_executor_sessions_created_total | Counter | Executor sessions created. |
zombie_executor_sessions_active | Gauge | Currently-active executor sessions. |
zombie_executor_failures_total | Counter | Executor RPC failures (transport-level). |
zombie_executor_oom_kills_total | Counter | Executor processes killed by the cgroup memory limit. |
zombie_executor_timeout_kills_total | Counter | Executor processes killed by the wall-time limit. |
zombie_executor_landlock_denials_total | Counter | Filesystem access attempts denied by Landlock. |
zombie_executor_resource_kills_total | Counter | Executor processes killed by any resource guard (CPU, disk, etc.). |
zombie_executor_lease_expired_total | Counter | Executor leases that expired without a renewal (orphan cleanup fired). |
zombie_executor_cancellations_total | Counter | Executor sessions cancelled (kill-switch, user-initiated). |
zombie_executor_cpu_throttled_ms_total | Counter | Cumulative cgroup CPU-throttling time across executor sessions. |
zombie_executor_memory_peak_bytes | Gauge | Peak RSS observed across executor sessions. |
zombie_executor_stages_started_total | Counter | NullClaw stage invocations started inside executor sessions. |
zombie_executor_stages_completed_total | Counter | NullClaw stage invocations that completed successfully. |
zombie_executor_stages_failed_total | Counter | NullClaw stage invocations that failed. |
zombie_executor_agent_tokens_total | Counter | LLM tokens consumed by agent calls inside executor sessions. |
zombie_executor_agent_duration_seconds | Histogram | Wall-time of agent calls inside executor sessions. |
Signup funnel
Clerk-driven signup delivery counters emitted byPOST /v1/webhooks/clerk. The signup_bootstrapped / signup_replayed split lets funnel dashboards distinguish first-delivery signups from idempotent retries.
| Metric | Type | Description |
|---|---|---|
zombie_signup_bootstrapped_total | Counter | Clerk webhooks that provisioned a fresh personal account (tenant + user + owner membership + default workspace + 0-cent credit state). |
zombie_signup_replayed_total | Counter | Clerk webhooks that resolved to an existing account via the oidc_subject fast-path — idempotent replay, no new rows. |
zombie_signup_failed_total{reason="bad_sig"} | Counter | Signup webhooks rejected with 401 UZ-WH-010 for invalid Svix signature. |
zombie_signup_failed_total{reason="stale_ts"} | Counter | Signup webhooks rejected with 401 UZ-WH-011 for timestamp outside the 5-minute freshness window. |
zombie_signup_failed_total{reason="missing_email"} | Counter | Signup webhooks rejected with 400 UZ-REQ-001 for malformed JSON or missing primary email. |
zombie_signup_failed_total{reason="db_error"} | Counter | Signup webhooks that reached the transactional bootstrap and rolled back on DB error. |
Per-workspace + per-zombie metrics
The token counter is keyed on bothworkspace_id and zombie_id. The first 4096 distinct (workspace_id, zombie_id) pairs each get their own label set; every subsequent pair routes into workspace_id="_other",zombie_id="_other". zombie_workspace_metrics_overflow_total counts overflow increments — alert on sustained non-zero growth. The counter fires from the live zombie delivery path (src/zombie/event_loop_helpers.zig) on every successful completion.
| Metric | Type | Description |
|---|---|---|
zombie_agent_tokens_by_workspace_total{workspace_id,zombie_id} | Counter | Tokens consumed per (workspace, zombie) pair. Bounded to 4096 pairs; overflow counted in zombie_workspace_metrics_overflow_total. |
zombie_workspace_metrics_overflow_total | Counter | Increments routed to _other because the 4096-slot table is full. |
Labels
Labels deliberately used across metrics:| Label | Applied to | Values |
|---|---|---|
reason | zombied_run_limit_exceeded_total, zombie_signup_failed_total | Lifecycle-specific; see each metric. |
workspace_id | zombie_agent_tokens_by_workspace_total | UUIDv7 or _other on overflow. |
zombie_id | zombie_agent_tokens_by_workspace_total | Zombie instance id or _other on overflow. |
le | Histogram buckets (*_seconds_bucket) | Bucket upper-bound. |
Grafana dashboard queries
Real PromQL queries that run against the emitted metric names:Emission source
The authoritative Prometheus text is rendered bysrc/observability/metrics_render.zig in the usezombie/usezombie repo. Any time a new appendMetric(...) or histogram emission lands there, update this page in the same PR — CI enforces router ↔ openapi parity but does not yet diff this page against the renderer, so drift is caught only at review time.