Few things are more frustrating for a DevOps engineer than intermittent network latency that appears and disappears without a clear pattern. In high-traffic Kubernetes clusters, you might notice your applications occasionally take exactly five seconds to resolve a DNS query. This is not a coincidence; it is a well-documented side effect of how the Linux kernel handles concurrent UDP packets under heavy load. By implementing NodeLocal DNSCache and tuning your networking stack, you can eliminate these spikes and ensure consistent sub-millisecond resolution.
TL;DR — The "5-second delay" is caused by a race condition in the Linux netfilter framework during conntrack entry creation. To fix it, deploy NodeLocal DNSCache to shift DNS traffic from UDP to TCP and avoid the iptables NAT bottleneck entirely.
Table of Contents
- The 5-Second Mystery: Symptoms and Error Messages
- The Root Cause: Netfilter and Conntrack Race Conditions
- The Ultimate Fix: Implementing NodeLocal DNSCache
- Verification: How to Confirm the Fix is Working
- Long-term Prevention and CoreDNS Tuning
- Frequently Asked Questions
The 5-Second Mystery: Symptoms and Error Messages
This issue typically occurs in Kubernetes clusters running kube-proxy in iptables mode. When your application performs a DNS lookup, it often sends two parallel requests: one for the A record (IPv4) and one for the AAAA record (IPv6). If these packets arrive at the kernel's network stack simultaneously, the conntrack module attempts to create two new entries for the same connection flow. One succeeds; the other fails due to a race condition. The kernel drops the failing packet, and the client (your app) waits for a default timeout—usually 5 seconds—before retrying.
You will see error logs in your application similar to these:
# Go application error
getaddrinfo: temporary failure in name resolution
lookup api.service.cluster.local on 10.96.0.10:53: read udp 10.244.1.5:45321->10.96.0.10:53: i/o timeout
# Python/Requests error
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='api.internal', port=443):
Max retries exceeded with url: /v1/data (Caused by NewConnectionError('<...>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution'))
If you run conntrack -S on a worker node during a traffic spike, you will likely see the insert_failed counter incrementing. This is the "smoking gun" that proves your DNS packets are being dropped by the node's kernel rather than CoreDNS itself.
The Root Cause: Netfilter and Conntrack Race Conditions
UDP Conntrack Race
When multiple pods send DNS queries to the ClusterFirst DNS IP (usually 10.96.0.10), iptables performs Destination Network Address Translation (DNAT) to redirect the traffic to a CoreDNS pod's actual IP. The nf_conntrack module tracks these states. Because UDP is connectionless, there is no "SYN" packet to lock the state creation. When two UDP packets from the same source try to pass through DNAT at once, they conflict. The kernel lacks a locking mechanism for this specific race, leading to a dropped packet.
Parallel A and AAAA Lookups
Modern glibc (used in Ubuntu, Debian, etc.) and musl (used in Alpine) libraries often perform A and AAAA lookups in parallel. Since both queries originate from the same source port and target the same DNS service IP, they are highly likely to trigger the race condition. This is why Alpine Linux users often report more DNS issues; its musl implementation is particularly aggressive with parallelized lookups.
The Ultimate Fix: Implementing NodeLocal DNSCache
The most effective way to resolve this is to bypass the iptables NAT table for DNS queries. NodeLocal DNSCache runs a small DNS caching agent as a DaemonSet on every node. Instead of talking to a remote Service IP, pods talk to a local link-local IP (like 169.254.20.10). This local agent then forwards requests to the global CoreDNS service using TCP, which is not subject to the same conntrack UDP race.
To deploy NodeLocal DNSCache, you first need to obtain the kube-dns service IP:
kubectl get svc kube-dns -n kube-system -o jsonpath='{.spec.clusterIP}'
# Output: 10.96.0.10
Download the official manifest from the Kubernetes GitHub repository and replace the variables __PILLAR__DNS__SERVER__ and __PILLAR__LOCAL__DNS__. Here is a simplified version of the configuration you should apply:
<!-- Simplified NodeLocal DNS ConfigMap -->
apiVersion: v1
kind: ConfigMap
metadata:
name: node-local-dns
namespace: kube-system
data:
Corefile: |
cluster.local:53 {
errors
cache {
success 9984 30
denial 9984 5
}
reload
loop
bind 169.254.20.10
forward . 10.96.0.10 {
force_tcp
}
prometheus :9253
}
Setting force_tcp in the forwarder is critical. This ensures that even if the client-to-cache communication is UDP, the cache-to-CoreDNS communication uses TCP, which avoids the conntrack race entirely.
Verification: How to Confirm the Fix is Working
Once NodeLocal DNSCache is deployed, you should verify that your pods are correctly routing traffic through the local agent. You can test this from within a pod using dig or nslookup.
# Run a test pod
kubectl run -it --rm --restart=Never dns-test --image=infoblox/dnstools
# Inside the pod, check your /etc/resolv.conf
cat /etc/resolv.conf
# You should see nameserver 169.254.20.10 (if using the NodeLocal IP)
# Perform a lookup and check the time
time dig google.com
In a healthy environment with NodeLocal DNSCache, the Query time should be 0ms or 1ms for cached entries. More importantly, you should monitor your application logs for the absence of "i/o timeout" or "5s" latency spikes. If you use Prometheus, monitor the coredns_dns_request_duration_seconds_bucket metric. A significant drop in the p99 latency after deployment confirms the fix.
kubelet configuration. On some older clusters, you may need to manually update the --cluster-dns flag to point to the NodeLocal IP, though most modern manifests handle this via iptables redirection automatically.
Long-term Prevention and CoreDNS Tuning
Beyond NodeLocal DNSCache, you should optimize CoreDNS to handle higher loads. Default CoreDNS installations are often under-provisioned for enterprise-grade traffic.
- Autoscaling: Use the
cluster-proportional-autoscalerto increase the number of CoreDNS replicas as your node count grows. A good rule of thumb is 1 replica per 256 cores or 16 nodes. - Memory Limits: Ensure CoreDNS has at least 512Mi of memory. Large clusters with thousands of Services require CoreDNS to maintain a large in-memory cache.
- The
ndotsProblem: By default, Kubernetes setsndots:5. This means any DNS query with fewer than 5 dots will search through all local search domains (e.g.,svc.cluster.local) before trying the external name. This quintuples the DNS load. For external services, use a fully qualified name ending in a dot (e.g.,google.com.) to bypass the search path.
📌 Key Takeaways
- 5-second DNS delays are caused by
nf_conntrackraces during parallel UDP DNAT. - NodeLocal DNSCache is the industry-standard solution for high-traffic clusters.
- Using
force_tcpfor upstream forwarding eliminates the race condition. - Check
conntrack -Sforinsert_failedto confirm the kernel is dropping packets. - Optimize your application by using fully qualified domain names (FQDNs) to reduce DNS queries.
Frequently Asked Questions
Q. Why is the timeout exactly 5 seconds?
A. This is the default retransmission timeout (RTO) in many Linux distributions and application runtimes. When a UDP packet is dropped due to a conntrack race, the client waits for this 5-second window before attempting a retry.
Q. Does switching kube-proxy to IPVS mode fix this?
A. IPVS reduces the complexity of iptables rules and improves performance, but it does not completely eliminate the conntrack race for UDP packets. NodeLocal DNSCache is still recommended even when using IPVS.
Q. How does NodeLocal DNSCache improve performance?
A. It reduces the hop distance for DNS queries, caches results locally on the worker node, and upgrades the connection to CoreDNS to TCP, which prevents packet loss from NAT race conditions.
Post a Comment