You deploy your application to the cluster, the status briefly shows Running, and then it happens: the dreaded Kubernetes CrashLoopBackOff. You check the application logs, but they show a clean shutdown or even a successful startup sequence. This frustrating cycle often stems not from a code bug, but from aggressive "health checks" that kill healthy containers before they have a chance to breathe. When your liveness and readiness probes are misconfigured, Kubernetes becomes its own worst enemy, terminating pods that are simply busy or still initializing.
Aggressive or poorly configured liveness probes can cause Kubernetes to continuously restart perfectly healthy application pods under heavy load. Properly calibrating initialDelaySeconds and timeout tuning prevents false-positive health check failures and stabilizes deployments. In this guide, we will diagnose why these probes fail and how to implement a configuration that ensures high availability without the restart loops.
TL;DR — Most probe-related CrashLoopBackOff issues occur because the livenessProbe starts too early or has a timeout shorter than the application's response time. Use startupProbes for slow-booting apps (like Spring Boot or large Node.js bundles) to give them time to initialize without triggering a kill signal.
Table of Contents
Symptoms: Identifying Probe-Induced Restarts
💡 Analogy: Imagine a doctor checking your pulse every 2 seconds while you are performing high-intensity surgery. If you don't answer within 1 second because you are focused on the task, the doctor declares you dead and calls for a replacement. In Kubernetes, the Kubelet is that impatient doctor, and your application is the surgeon.
When a probe fails, Kubernetes does not always provide a loud error. Instead, the container simply exits. The primary way to detect this is by examining the pod events. You will notice that the container starts, passes the "Running" state for a few seconds, and then transitions to CrashLoopBackOff. If you run kubectl describe pod [pod_name], you will see events similar to this:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Started 2m (x3 over 5m) kubelet Started container app
Warning Unhealthy 110s (x9 over 4m) kubelet Liveness probe failed: Get "http://10.1.2.3:8080/health": context deadline exceeded
Normal Killing 110s kubelet Container app failed liveness probe, will be restarted
The "context deadline exceeded" or "connection refused" messages in the events log are the smoking guns. This indicates that the kubelet attempted to ping your defined health endpoint, but the application failed to respond within the timeoutSeconds window. If this happens enough times (defined by failureThreshold), Kubernetes kills the container. Because the application was otherwise fine, it tries to start again, leading to the loop.
In my experience managing production clusters on Kubernetes 1.28, I’ve seen this most frequently during "heavy traffic" events. The application is processing requests at 95% CPU, which slows down the response time for the /health endpoint. The liveness probe times out, the pod is killed, and the remaining pods take on even more load, leading to a cascading failure across the entire namespace.
The Root Causes of Probe Failures
1. The Cold Start Paradox
Modern frameworks like Spring Boot, Quarkus, or heavy Rails apps often take 30–60 seconds to fully initialize. If your livenessProbe is configured with an initialDelaySeconds: 5, Kubernetes will start checking the health of the container long before the web server is actually listening on the port. The probe fails, the container is killed, and it never actually finishes booting. This is the "Cold Start Paradox" where the safety mechanism prevents the very thing it is meant to protect.
2. Resource Contention and Latency
Probes are often configured with a default timeoutSeconds: 1. While a 1-second response time is usually expected for a simple health check, it doesn't account for "Stop the World" Garbage Collection (GC) in Java apps or event loop lag in Node.js. If your container is hitting its CPU limit, the OS will throttle the process. A throttled process cannot respond to a network request in 1 second, causing the Kubelet to perceive the pod as dead when it is actually just busy.
3. Dependency Overreach
A common mistake is creating a health check endpoint that checks the health of external dependencies like a database or a third-party API. If your database has a 2-second lag, your application’s health check endpoint might take 3 seconds to return. If your probe timeout is 1 second, your pod is killed because the database is slow. This is a logic error: liveness probes should only reflect the state of the local process, not the entire ecosystem.
# BROKEN EXAMPLE: A liveness probe that is too aggressive
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 3 # Too short for most apps
periodSeconds: 5
timeoutSeconds: 1 # No margin for network/CPU lag
failureThreshold: 2 # Restarts after only 10 seconds of failure
How to Fix Misconfigured Probes
To resolve probe-induced CrashLoopBackOff, you must decouple the startup phase from the steady-state phase. Kubernetes introduced the startupProbe specifically to solve this. When a startupProbe is present, all other probes (liveness and readiness) are disabled until the startup probe succeeds.
Step 1: Implement a Startup Probe
The startupProbe should be configured to be patient. It allows the container to fail many times before giving up. This is much better than using a massive initialDelaySeconds on a liveness probe, because the liveness probe remains disabled until the app is actually ready.
Step 2: Increase Timeout and Failure Thresholds
For liveness probes, aim for "lazy" checks. You want to give the application enough room to recover from a temporary spike in load without being killed immediately. Increase the failureThreshold to 3 or 5, and set timeoutSeconds to at least 2 or 3 seconds for higher-latency environments.
# CORRECTED EXAMPLE: Using Startup Probes and Relaxed Liveness
containers:
- name: my-app
image: my-app:latest
startupProbe:
httpGet:
path: /healthz
port: 8080
failureThreshold: 30 # Try 30 times
periodSeconds: 10 # Every 10 seconds
# Total time allowed for startup: 300 seconds (5 minutes)
livenessProbe:
httpGet:
path: /healthz/liveness
port: 8080
periodSeconds: 20
timeoutSeconds: 3 # Added buffer for lag
failureThreshold: 3 # Allows 1 minute of failure before restart
readinessProbe:
httpGet:
path: /healthz/readiness
port: 8080
periodSeconds: 10
successThreshold: 1
⚠️ Common Mistake: Setting readinessProbe and livenessProbe to the same failure threshold. If the readiness probe fails, the pod is removed from service traffic. If the liveness probe fails, the pod is killed. You almost always want your liveness probe to be more "forgiving" than your readiness probe.
Verifying the Fix
After applying the updated YAML, you should monitor the pod startup sequence. Use the following command to watch the status and the restarts count:
kubectl get pods -w
If the configuration is correct, the RESTARTS column should remain at 0. You can also verify that the startup probe is working by checking the pod description again. Look for the transition where the startupProbe finishes and the others take over:
# Run this to see the events
kubectl describe pod [pod_name]
# Expected output in events:
# Normal Started 30s kubelet Started container app
# (No "Killing" or "Unhealthy" warnings should appear during startup)
If you still see failures, check your application logs (kubectl logs [pod_name]). Ensure the web server is actually listening on the port and path you specified in the probe. A common typo (e.g., /health vs /healthz) is a frequent cause of "connection refused" errors.
Prevention and Best Practices
To prevent future probe-related outages, follow these operational standards:
- Separate Liveness and Readiness logic: The readiness probe should check if dependencies (DB, Cache) are available. The liveness probe should only check if the process is alive (e.g., a simple 200 OK).
- Avoid heavy computation in probes: A health check endpoint should not perform complex database queries or intensive calculations. It should return a cached or lightweight status.
- Monitor Probe Metrics: Use Prometheus to track
prober_probe_total. If you see a high rate of failed probes that don't result in restarts, yourfailureThresholdis doing its job, but you may need to investigate why the app is occasionally slow. - Use GRPC Probes: If you are using gRPC, use the native
grpcprobe support added in Kubernetes 1.27+ instead of relying on customexecscripts or HTTP bridges.
📌 Key Takeaways
CrashLoopBackOffis often a symptom of the Kubelet killing your container too early.- Use startupProbes for any application that takes more than 10 seconds to start.
- Set
timeoutSecondsto at least 2–3 seconds to account for CPU throttling or network jitter. - Ensure the
livenessProbeis less aggressive than thereadinessProbe.
Frequently Asked Questions
Q. What is the difference between liveness and readiness probes?
A. A liveness probe tells Kubernetes if the container is alive. If it fails, Kubernetes kills the container and restarts it. A readiness probe tells Kubernetes if the container is ready to handle traffic. If it fails, Kubernetes removes the pod from the Service's endpoint list, so no traffic is sent to it, but the container is not killed.
Q. Should I use initialDelaySeconds or startupProbes?
A. You should prefer startupProbes for modern Kubernetes versions (1.16+). initialDelaySeconds is a "blind" wait; if the app starts faster, you waste time. If it starts slower, it crashes. startupProbes poll the app and allow it to start as fast or slow as needed (within your threshold).
Q. How do I debug a probe failing with "Connection Refused"?
A. This usually means the application isn't listening on the specified port or IP yet. Verify the containerPort in your YAML matches the application's config. Also, ensure your application is listening on 0.0.0.0 and not just 127.0.0.1, as the Kubelet probes come from the node's IP space.
For further reading on Kubernetes operational excellence, check out the official Kubernetes documentation.
Post a Comment