Don't Take Rolling Upgrades for Granted

December 21, 2025

kubernetesistio

When Rolling Upgrades Go Wrong: A Deep Dive into Istio Sidecar Behavior and Pod Disruption Windows

Context

Over the past several months, I’ve been focused on decomposing a decade-old monolith into smaller, more manageable components. Yes, microservices are the buzzword (and yes, there are lots of containers), but the real motivation behind this effort is to improve delivery velocity, enable surgical deployments, and support zero-downtime deployments—nobody likes long maintenance windows.

This is a story about how we ran into a surprising issue during a rolling update of a Kubernetes deployment, where we experienced intermittent 503s despite the new pod being ready and the deployment using a generous grace period.

The Problem

We broke up a large, monolithic pod (what we like to call a “fat pod”) into several smaller, independently managed pods, each with its own lifecycle and replica set controller.

We used kubectl rollout restart deploy/<deployment> to test rolling upgrades on one of our webserver pods, which powers several pages. To simulate real-world load, we deployed oha—a lightweight CLI tool written in Rust that generates high-volume HTTP requests—in a small pod within the cluster to drive traffic during our tests.

As expected, Kubernetes brought up a new pod, waited for it to pass its readiness probe, added its IP to the service endpoint slice, and sent a SIGTERM to the old pod.

However, what we didn’t expect was to see a brief spike in 503s during the upgrade. The new pod was up and ready, so why were we seeing traffic disruption?

We repeated the test several times and found that the disruption window was consistently between 30 and 60 seconds, which seemed unusually long for a standard rolling update.

The Rabbit Hole

At first, I was stumped. I’ve been working with Kubernetes for nearly a decade, and rolling updates had always kinda of just “worked”—until now.

The Role of Istio

This time, we were running Istio as our service mesh, handling mTLS, traffic management, and more. Specifically, we were using Sidecar Injection mode, where every pod in the cluster gets a sidecar container that handles all ingress and egress traffic, including the readiness and liveness probes of the main container.

It turns out that Istio’s default behavior for draining connections is not ideal. According to the Istio codebase, the default settings for connection draining are suboptimal:

// Default values for drain duration and termination drain duration
DrainDuration:  5 * time.Second,
TerminationDrainDuration: 5 * time.Second,

This means that by default, when the sidecar receives SIGTERM, it:

Sleeps for 5 seconds
Then kills the active envoy process
Also begins failing its own readiness and liveness probes, which causes the pod to be removed from service endpoints before it’s fully drained

This is a problem when the sidecar is also responsible for routing traffic to the main container, especially when the sidecar has stale network state.

The Sidecar State Problem

Each Istio sidecar maintains an in-memory mapping of service names to pod IPs. When a new version of a pod is deployed, the Istio control plane (istiod) pushes updates to sidecars via the xDS protocol. These updates are triggered by events on Kubernetes endpoint slices.

In a large, dynamic cluster with lots of ephemeral pods, this can cause a lot of churn. To avoid overwhelming the control plane, Istio has debounce settings:

PILOT_DEBOUNCE_AFTER: Default 100ms
PILOT_DEBOUNCE_MAX: Default 1 second

However, in our case, we had overridden these to:

PILOT_DEBOUNCE_MAX: 60 seconds
PILOT_DEBOUNCE_AFTER: 30 seconds

This means that it can take up to 60 seconds for new pod IPs to propagate to all sidecars.

We confirmed this by monitoring the sidecar config of the oha pod using istioctl proxy-config endpoint <pod-name> during a rolling upgrade. The old IPs were still present in the sidecar config for up to 60 seconds.

In our case, since the service had only a single replica, this problem was magnified. A single stale IP in the sidecar meant a single point of failure for traffic routing.

The Solution

The root cause was clear: network state in Istio sidecars can be out of sync, and the default sidecar termination behavior wasn’t draining connections properly.

We needed to address two issues:

Graceful sidecar termination
Ensuring that pod IPs are updated in the sidecars before SIGTERM is sent

Step 1: Extend Sidecar Drain Time

We configured the sidecar to drain connections for 60 seconds instead of the default 5 seconds, using the terminationDrainDuration flag. According to the Istio documentation, this parameter controls how long the proxy waits before terminating after receiving a shutdown signal.

Here’s how to add it to a pod deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: your-app-deployment
  namespace: default
spec:
  template:
    metadata:
      annotations:
        proxy.istio.io/config: |
          terminationDrainDuration: 60s  # 60 seconds to drain before termination

For more details on the drain duration parameter, see the Istio documentation.

Step 2: Delay SIGTERM with PreStop Hooks

To give Istio time to propagate the new endpoint IPs, we added preStop hooks to both the sidecar and the main application container. This is crucial because if the sidecar stays up but the main app goes down before it, then traffic will drop. It’s all or nothing.

spec:
  containers:
    - name: main-app
      # ... main app configuration
      lifecycle:
        preStop:
          exec:
            command: ["/bin/sh", "-c", "sleep 60"]
    - name: istio-proxy
      # ... sidecar configuration
      lifecycle:
        preStop:
          exec:
            command: ["/bin/sh", "-c", "sleep 60"]

We also increased the terminationGracePeriodSeconds to 120 seconds to accommodate the delay.

For more information on customizing sidecar injection, including how to modify the admission controller’s injection behavior, see the Istio documentation.

The New Flow

Now, the rolling update process looks like this:

New pod starts.
Kubernetes waits for the new pod’s readiness probe to succeed.
Kubernetes adds the new pod IP to the service endpoint slice.
Kubernetes initiates termination of the old pod.
PreStop hooks execute—both containers sleep for 60 seconds, giving sidecars time to receive updated endpoints.
After hooks complete, SIGTERM is sent to the containers.
Both the sidecar and main app gracefully drain connections for 60 seconds.
The old pod shuts down. If it doesn’t terminate gracefully, it gets a SIGKILL.
No more 503s.

Conclusion

The fix was conceptually simple, but getting there required digging into Istio internals and understanding how sidecar state propagation interacts with Kubernetes pod lifecycle. If you’re running Istio in production—especially in clusters with high churn—take a close look at your debounce settings and sidecar termination behavior.

Thanks to the Airbnb engineering blog for key insights that pointed us in the right direction.

References: