kubernetesdevops

Zero-Downtime Deploys with Kubernetes

PodDisruptionBudgets, readiness probes, and rolling updates done right.

·4 мин чтения·852 слов

Zero-downtime deployment is one of those things that sounds simple and isn't. The theory is easy: keep old pods running while new ones start, route traffic only to healthy pods, drain old pods gracefully. The practice involves half a dozen moving parts that all need to be configured correctly, or you get 500s.

This post covers the configuration I use for production services.

What Can Go Wrong

Before the solution, the failure modes:

  1. New pods receive traffic before they're ready — the readiness probe isn't configured, so Kubernetes marks pods Running and routes traffic immediately, before your app has finished initializing.
  2. Old pods are killed mid-request — the old pod gets SIGTERM but doesn't finish in-flight requests before shutting down.
  3. All pods are updated simultaneouslymaxUnavailable: 100% plus maxSurge: 0 means zero pods for the duration of the update.
  4. PodDisruptionBudget isn't set — cluster autoscaler or node maintenance evicts all replicas of a service at once.

Each of these has a specific fix.

Readiness Probes

A readiness probe tells Kubernetes when a pod is ready to accept traffic. Until the probe passes, the pod is excluded from Service endpoints.

yaml
spec:
  containers:
    - name: api
      image: myapp:v2
      readinessProbe:
        httpGet:
          path: /api/health
          port: 3000
        initialDelaySeconds: 5
        periodSeconds: 5
        failureThreshold: 3
        successThreshold: 1
      livenessProbe:
        httpGet:
          path: /api/health
          port: 3000
        initialDelaySeconds: 15
        periodSeconds: 10
        failureThreshold: 5

The key distinction:

  • Readiness: gates traffic. A failing readiness probe removes the pod from the load balancer but doesn't restart it.
  • Liveness: gates existence. A failing liveness probe restarts the pod.

Your /health endpoint should return 200 only when the app is fully initialized — database connections established, caches warm, etc.

Rolling Update Strategy

yaml
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 1

With maxUnavailable: 0 and maxSurge: 1, the update proceeds as:

  1. Kubernetes creates 1 new pod (now 4 total, 3 old + 1 new)
  2. Waits for the new pod's readiness probe to pass
  3. Terminates 1 old pod (back to 3, now 2 old + 1 new)
  4. Repeats until all old pods are replaced

At no point are there fewer than 3 healthy pods. Traffic is never interrupted.

Graceful Shutdown

When Kubernetes terminates a pod, it sends SIGTERM. Your app must:

  1. Stop accepting new connections
  2. Finish processing in-flight requests
  3. Close database connections cleanly
  4. Exit with code 0
typescript
// Express graceful shutdown
const server = app.listen(3000);
 
process.on('SIGTERM', () => {
  console.log('SIGTERM received, shutting down gracefully');
 
  server.close(() => {
    // All connections closed
    db.pool.end(() => {
      process.exit(0);
    });
  });
 
  // Force exit after 30s if shutdown hangs
  setTimeout(() => {
    console.error('Forcing shutdown after timeout');
    process.exit(1);
  }, 30_000);
});

And in your Kubernetes config:

yaml
spec:
  terminationGracePeriodSeconds: 60
  containers:
    - name: api
      lifecycle:
        preStop:
          exec:
            # Delay SIGTERM by 5s to let the load balancer drain
            command: ['/bin/sh', '-c', 'sleep 5']

The preStop hook matters because there's a race: Kubernetes sends SIGTERM and updates the Endpoints object at the same time, but kube-proxy (which updates iptables rules) may lag behind. If SIGTERM arrives before iptables is updated, new requests can still be routed to a pod that's shutting down. The 5-second sleep gives kube-proxy time to catch up.

PodDisruptionBudget

A PodDisruptionBudget (PDB) limits how many pods of a deployment can be disrupted simultaneously. "Disruption" includes voluntary evictions: node maintenance, cluster upgrades, autoscaler scale-downs.

yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: api

With minAvailable: 2 and 3 replicas, only 1 pod can be evicted at a time. If two nodes are drained simultaneously, the second drain will block until the first completes.

Use minAvailable (absolute count) over maxUnavailable (percentage) for small replica counts. With 3 replicas, maxUnavailable: 33% rounds down to 0, making eviction impossible.

What PDB Doesn't Protect Against

PDB only protects against voluntary disruptions. Node failures, OOM kills, and pod crashes are involuntary and bypass PDB. For those, you need multiple replicas spread across availability zones.

yaml
spec:
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: DoNotSchedule
      labelSelector:
        matchLabels:
          app: api

Putting It Together

A deployment that survives rolling updates, node maintenance, and graceful shutdown:

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 1
  template:
    spec:
      terminationGracePeriodSeconds: 60
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: api
      containers:
        - name: api
          image: myapp:v2
          lifecycle:
            preStop:
              exec:
                command: ['/bin/sh', '-c', 'sleep 5']
          readinessProbe:
            httpGet:
              path: /api/health
              port: 3000
            initialDelaySeconds: 5
            periodSeconds: 5
            failureThreshold: 3
          livenessProbe:
            httpGet:
              path: /api/health
              port: 3000
            initialDelaySeconds: 15
            periodSeconds: 10
            failureThreshold: 5

The most common mistake I see is teams deploying with all of this configured except the preStop hook. They see zero errors in staging (no load balancer lag) and mysterious 500s in production during deploys.

Set up the hook. Measure with real traffic. Watch your error rates stay flat during deploys.