Handle EKS Spot Instance Interruption for High Availability

Running production workloads on AWS EKS using Spot Instances can reduce your cloud compute bill by up to 70%. However, the volatility of spot capacity poses a significant risk to service availability. When AWS needs the capacity back, it issues a two-minute interruption notice. If your application does not respond to this signal by gracefully draining connections, your users will experience 502 Bad Gateway errors and dropped requests.

To maintain high availability, you must implement a proactive draining strategy. This involves detecting the interruption signal via the AWS Node Termination Handler (NTH) or Karpenter, cordoning the affected node to prevent new pods from scheduling, and triggering a graceful shutdown of existing replicas. This guide demonstrates the technical implementation required to turn volatile spot capacity into a reliable production-grade environment.

TL;DR — Install the AWS Node Termination Handler (Queue Processor mode) and define Pod Disruption Budgets (PDB). This setup ensures pods are evacuated within the 120-second window provided by AWS before the instance is forcefully reclaimed.

The 2-Minute Warning Concept

💡 Analogy: Imagine a professional football game where the "Two-Minute Warning" is called. The players know the game is ending soon, so they don't start any long-term plays. Instead, they focus on finishing the current drive. In EKS, the interruption notice is your two-minute warning. You must stop accepting new work and finish existing requests before the "stadium" (the EC2 instance) closes.

AWS provides interruption notices via the Instance Metadata Service (IMDS) at http://169.254.169.254/latest/meta-data/spot/termination-time. However, polling this endpoint manually is inefficient. The modern standard is to use the AWS Node Termination Handler (NTH). It monitors these signals and translates them into Kubernetes events. When a signal is detected, NTH "cordons" the node (marking it unschedulable) and "drains" it (deleting the pods so they move to other nodes).

By default, Kubernetes does not know an EC2 instance is about to disappear. Without NTH, the node simply goes "NotReady," and the control plane waits for a timeout (usually 5 minutes) before rescheduling pods. On spot instances, the node is gone in 2 minutes, leaving a 3-minute "black hole" where traffic is routed to non-existent pods. Implementing NTH closes this gap.

When to Use Spot Instances in EKS

Not every workload is a candidate for spot instances. You should categorize your applications based on their "Interruption Tolerance." Stateless microservices that can scale horizontally are the primary candidates. These services should have a fast startup time and be able to handle SIGTERM signals correctly. If your pod takes 90 seconds to start and 60 seconds to shut down, you will struggle to fit into the 120-second spot window.

Batch processing jobs, CI/CD runners, and development environments are also ideal. In these cases, an interruption might mean a job retry, but it won't impact customer-facing uptime. For production APIs, you should use a "Split-Node-Group" strategy: maintain a baseline of On-Demand instances for stability and use Spot instances for the burstable capacity. This ensures that even if a specific spot pool is reclaimed, your service remains partially available while the cluster autoscaler finds new capacity.

Avoid using spot instances for stateful sets like primary databases or legacy applications that require manual intervention to restart. While EKS can move persistent volumes, the mounting/unmounting process often exceeds the 2-minute window, leading to "VolumeInUse" errors that stall your recovery.

Step-by-Step Implementation

Step 1: Install AWS Node Termination Handler

We will use the Queue Processor mode, which is more reliable than the IMDS-only mode because it uses SQS to capture events like Rebalance Recommendations and Scheduled Actions. Ensure you are using Helm v3 and have the necessary IAM permissions for SQS access.

helm repo add eks https://aws.github.io/eks-charts
helm install aws-node-termination-handler eks/aws-node-termination-handler \
--namespace kube-system \
--set enableSqsTerminationDraining=true \
--set queueURL=https://sqs.us-east-1.amazonaws.com/123456789012/MyNthQueue

Step 2: Define Pod Disruption Budgets (PDB)

NTH respects Pod Disruption Budgets. A PDB ensures that a minimum number of replicas are always running. If you have 5 replicas and a PDB minAvailable: 3, NTH will only drain pods if at least 3 other replicas are ready elsewhere. This prevents "cascading failures" during mass spot interruptions.

<apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: web-api<

Step 3: Configure Termination Grace Periods

Your application must shut down gracefully when it receives a SIGTERM. By default, Kubernetes gives pods 30 seconds. In a spot environment, you might want to increase this, but keep it well under the 120-second limit. We recommend 60 to 90 seconds for most Java or Python web frameworks.

<spec:
containers:
- name: web-app
image: my-repo/api:v1.2
terminationGracePeriodSeconds: 60<

Common Pitfalls and Fixes

⚠️ Common Mistake: Ignoring the preStop hook. Many applications (like Nginx) do not stop gracefully by default upon receiving a SIGTERM. They might kill active connections immediately. In EKS, there is a race condition between the pod being removed from the Load Balancer (ELB) target group and the pod shutting down.

To fix this, use a preStop hook to introduce a small delay (e.g., 15 seconds). This allows the ELB to stop sending new traffic to the pod before the application starts closing its internal socket listeners. This simple addition can eliminate 99% of "Connection Refused" errors during a spot drain.

<lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 15"]<

Another pitfall is "Single Point of Interruption." If you use only one instance type (e.g., m5.large), and AWS needs that specific type back, all your nodes might be reclaimed at once. To prevent this, diversify your Node Groups. Use a mix of m5.large, m5d.large, m4.large, and t3.large. AWS is unlikely to reclaim all these different pools simultaneously.

Pro Tips for FinOps Success

To truly master EKS Spot instances, you should move beyond the standard Cluster Autoscaler and consider **Karpenter**. Karpenter is an open-source, flexible node provisioner. Unlike the Cluster Autoscaler, which is constrained by AWS Auto Scaling Groups (ASGs), Karpenter can provision any available instance type directly. It has built-in support for spot interruption handling, meaning you don't even need the standalone Node Termination Handler if you configure Karpenter correctly.

Monitor your "Interruption Frequency" via AWS Cost Explorer or CloudWatch. If you find a specific region or instance family is being interrupted more than 15% of the time, it’s no longer cost-effective for production due to the "Re-scheduling Overhead" (the CPU/Memory cost of repeatedly starting new pods). At that point, moving to Graviton-based instances (m6g, c6g) often provides a better price-to-performance ratio with higher spot availability.

📌 Key Takeaways:

  • Always use Pod Disruption Budgets to prevent cluster-wide outages.
  • Use preStop hooks to synchronize with ELB connection draining.
  • Diversify instance families to minimize the impact of a single spot pool exhaustion.
  • Transition to Karpenter for faster node provisioning and integrated interruption handling.

Frequently Asked Questions

Q. Is the 2-minute interruption notice guaranteed?

A. Yes, AWS guarantees a 120-second warning before a spot instance is terminated. However, your automation (NTH or Karpenter) must detect this signal immediately to utilize the full window. If your polling or event processing is delayed, you may only have seconds to drain pods.

Q. Can I use Spot instances for production databases?

A. It is not recommended. While possible with complex operator logic, the risk of data corruption during a hard "kill" after the 2-minute window is high. Use On-Demand or Reserved Instances for databases, and save Spot for the application layer.

Q. How does NTH compare to Karpenter's interruption handling?

A. Karpenter is the newer, more integrated solution. It listens to the same SQS events as NTH but can also automatically provision replacement nodes *before* the current node is terminated. NTH is better for legacy ASG-based clusters, while Karpenter is the standard for modern EKS designs.

Post a Comment