Don't Take Rolling Upgrades for Granted
When Rolling Upgrades Go Wrong: A Deep Dive into Istio Sidecar Behavior and Pod Disruption Windows
Context
Over the past several months, I’ve been focused on decomposing a decade-old monolith into smaller, more manageable components. Yes, microservices are the buzzword (and yes, there are lots of containers), but the real motivation behind this effort is to improve delivery velocity, enable surgical deployments, and support zero-downtime deployments—nobody likes long maintenance windows.
This is a story about how we ran into a surprising issue during a rolling update of a Kubernetes deployment, where we experienced intermittent 503s despite the new pod being ready and the deployment using a generous grace period.
The Problem
We broke up a large, monolithic pod (what we like to call a “fat pod”) into several smaller, independently managed pods, each with its own lifecycle and replica set controller.
We used kubectl rollout restart deploy/<deployment> to test rolling upgrades on one of our webserver pods, which powers several pages. To simulate real-world load, we deployed oha—a lightweight CLI tool written in Rust that generates high-volume HTTP requests—in a small pod within the cluster to drive traffic during our tests.
As expected, Kubernetes brought up a new pod, waited for it to pass its readiness probe, added its IP to the service endpoint slice, and sent a SIGTERM to the old pod.
However, what we didn’t expect was to see a brief spike in 503s during the upgrade. The new pod was up and ready, so why were we seeing traffic disruption?
We repeated the test several times and found that the disruption window was consistently between 30 and 60 seconds, which seemed unusually long for a standard rolling update.
The Rabbit Hole
At first, I was stumped. I’ve been working with Kubernetes for nearly a decade, and rolling updates had always kinda of just “worked”—until now.
The Role of Istio
This time, we were running Istio as our service mesh, handling mTLS, traffic management, and more. Specifically, we were using Sidecar Injection mode, where every pod in the cluster gets a sidecar container that handles all ingress and egress traffic, including the readiness and liveness probes of the main container.
It turns out that Istio’s default behavior for draining connections is not ideal. According to the Istio codebase, the default settings for connection draining are suboptimal:
// Default values for drain duration and termination drain duration
DrainDuration: 5 * time.Second,
TerminationDrainDuration: 5 * time.Second,
This means that by default, when the sidecar receives SIGTERM, it:
- Sleeps for 5 seconds
- Then kills the active envoy process
- Also begins failing its own readiness and liveness probes, which causes the pod to be removed from service endpoints before it’s fully drained
This is a problem when the sidecar is also responsible for routing traffic to the main container, especially when the sidecar has stale network state.
The Sidecar State Problem
Each Istio sidecar maintains an in-memory mapping of service names to pod IPs. When a new version of a pod is deployed, the Istio control plane (istiod) pushes updates to sidecars via the xDS protocol. These updates are triggered by events on Kubernetes endpoint slices.
In a large, dynamic cluster with lots of ephemeral pods, this can cause a lot of churn. To avoid overwhelming the control plane, Istio has debounce settings:
PILOT_DEBOUNCE_AFTER: Default 100msPILOT_DEBOUNCE_MAX: Default 1 second
However, in our case, we had overridden these to:
PILOT_DEBOUNCE_MAX: 60 secondsPILOT_DEBOUNCE_AFTER: 30 seconds
This means that it can take up to 60 seconds for new pod IPs to propagate to all sidecars.
We confirmed this by monitoring the sidecar config of the oha pod using istioctl proxy-config endpoint <pod-name> during a rolling upgrade. The old IPs were still present in the sidecar config for up to 60 seconds.
In our case, since the service had only a single replica, this problem was magnified. A single stale IP in the sidecar meant a single point of failure for traffic routing.
The Solution
The root cause was clear: network state in Istio sidecars can be out of sync, and the default sidecar termination behavior wasn’t draining connections properly.
We needed to address two issues:
- Graceful sidecar termination
- Ensuring that pod IPs are updated in the sidecars before SIGTERM is sent
Step 1: Extend Sidecar Drain Time
We configured the sidecar to drain connections for 60 seconds instead of the default 5 seconds, using the terminationDrainDuration flag. According to the Istio documentation, this parameter controls how long the proxy waits before terminating after receiving a shutdown signal.
Here’s how to add it to a pod deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: your-app-deployment
namespace: default
spec:
template:
metadata:
annotations:
proxy.istio.io/config: |
terminationDrainDuration: 60s # 60 seconds to drain before termination
For more details on the drain duration parameter, see the Istio documentation.
Step 2: Delay SIGTERM with PreStop Hooks
To give Istio time to propagate the new endpoint IPs, we added preStop hooks to both the sidecar and the main application container. This is crucial because if the sidecar stays up but the main app goes down before it, then traffic will drop. It’s all or nothing.
spec:
containers:
- name: main-app
# ... main app configuration
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 60"]
- name: istio-proxy
# ... sidecar configuration
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 60"]
We also increased the terminationGracePeriodSeconds to 120 seconds to accommodate the delay.
For more information on customizing sidecar injection, including how to modify the admission controller’s injection behavior, see the Istio documentation.
The New Flow
Now, the rolling update process looks like this:
- New pod starts.
- Kubernetes waits for the new pod’s readiness probe to succeed.
- Kubernetes adds the new pod IP to the service endpoint slice.
- Kubernetes initiates termination of the old pod.
- PreStop hooks execute—both containers sleep for 60 seconds, giving sidecars time to receive updated endpoints.
- After hooks complete,
SIGTERMis sent to the containers. - Both the sidecar and main app gracefully drain connections for 60 seconds.
- The old pod shuts down. If it doesn’t terminate gracefully, it gets a
SIGKILL. - No more 503s.
Conclusion
The fix was conceptually simple, but getting there required digging into Istio internals and understanding how sidecar state propagation interacts with Kubernetes pod lifecycle. If you’re running Istio in production—especially in clusters with high churn—take a close look at your debounce settings and sidecar termination behavior.
Thanks to the Airbnb engineering blog for key insights that pointed us in the right direction.
References: