Overview
Worker fleet management covers how to safely upgrade, scale, and maintain the pool of worker machines. The key constraint is that workers may have in-flight runs that must complete before the worker is shut down.Drain
Before upgrading a worker, drain it so no new work is claimed and in-flight runs complete gracefully.SIGUSR1 by entering drain mode:
- Stops claiming new runs from Redis.
- Continues executing in-flight runs to completion.
- Exits cleanly when all runs finish.
RUN_TIMEOUT_MS limit will eventually force it to terminate.
Rolling restart
To upgrade the entire fleet without downtime:- Drain worker A (send
SIGUSR1, wait for active runs to finish). - Stop worker A (
systemctl stop zombied-worker). - Deploy new binaries to worker A.
- Restart executor on worker A (
systemctl restart zombied-executor). - Start worker A (
systemctl start zombied-worker). - Verify worker A is healthy (
zombied doctor worker). - Repeat for worker B, C, etc.
Requires=zombied-executor.service).
Canary deploy
For high-risk upgrades, deploy to a single worker first and verify before rolling out to the fleet. Step 1 — Deploy canary Pick one worker and deploy the new version:- All doctor checks pass.
- Runs complete successfully (check
sessions_created_totalandfailures_totalmetrics). - No new error codes in logs.
- Executor sandbox enforcement is working (check
oom_kills_total,landlock_denials_total).
Systemd ordering
The executor must always start before the worker. This is enforced by the systemd unit dependency:PartOf= for automatic restart propagation).
Scaling
To add a new worker to the fleet:- Provision a bare-metal machine on OVHCloud.
- Join it to the Tailscale network.
- Deploy the standard directory structure to
/opt/zombie/. - Configure
.envfrom your secret manager. - Enable and start both systemd services.
- Verify with
zombied doctor worker.