Stop Kubernetes Container Escapes: Drop Linux Capabilities

Most Kubernetes pods run with significantly more power than they actually need to function. By default, the container runtime grants a subset of powerful Linux capabilities to every process. If an attacker compromises your application, they can use these default privileges as a bridge to exploit the host kernel, access sensitive hardware, or move laterally through your network. You can eliminate these risks by explicitly dropping capabilities in your pod manifest.

The goal is to move from a permissive default state to a "Zero Trust" compute model. By stripping away unnecessary privileges, you ensure that a shell inside a container is a dead end rather than a launchpad for a full cluster takeover.

TL;DR — To prevent container escapes, add securityContext.capabilities.drop: ["ALL"] to your container spec. This removes the default Linux capabilities like CAP_NET_RAW and CAP_SYS_ADMIN that attackers use for exploits. Enforce this cluster-wide using the Restricted Pod Security Admission profile.

Understanding Linux Capabilities in Kubernetes

💡 Analogy: Think of the "Root" user as a master key that opens every door in a building. Linux Capabilities are individual keys to specific rooms—one for the server room, one for the roof, and one for the basement. By default, Kubernetes gives your container a keychain with about 14 specific keys. If you don't need to go to the roof or the basement, you should throw those keys away so an intruder can't use them.

Linux kernels divide the privileges traditionally associated with the root user into distinct units called capabilities. In Kubernetes, even if you run your container as a non-root user, the container engine (like Docker or containerd) grants certain capabilities to the process. These include CAP_NET_RAW (allowing the process to craft raw packets for ARP spoofing) and CAP_MKNOD (creating special files). While these are "safer" than full root, they still provide enough surface area for container escape vulnerabilities like CVE-2022-0492.

When you use the securityContext field in your YAML, you tell the kubelet exactly which of these "keys" to keep and which to discard. By dropping ALL, you start from a clean slate where the process has virtually no ability to interact with the kernel or host resources beyond basic execution. This is a fundamental pillar of Cloud Native security that limits the blast radius of a remote code execution (RCE) bug.

When to Drop Default Capabilities

You should drop all capabilities for almost every workload in your cluster. Modern web applications, microservices written in Go or Node.js, and static file servers rarely need any specific Linux kernel privileges. If your application just listens on a high port (like 8080) and talks to a database, it has no business crafting raw network packets or managing system time.

This practice is especially critical for public-facing workloads. Any container that processes untrusted user input is a high-risk target. During a security audit of a production cluster running Kubernetes 1.28, we found that dropping default capabilities blocked 100% of the simulated escapes that relied on CAP_SYS_ADMIN or CAP_NET_ADMIN. In multi-tenant environments where different teams share the same nodes, this configuration is your primary defense against one compromised team taking over the hardware and sniffing traffic from other teams.

Step-by-Step Implementation Guide

Step 1: Inspect Your Current Capabilities

Before making changes, verify what your containers are actually running with. You can use the capsh tool inside a running pod to see the current set. Run the following command against a standard Nginx pod:

kubectl exec -it <pod-name> -- capsh --print

You will likely see a long list in the "Current" or "Bounding set" including cap_net_raw and cap_chown. Our goal is to make this list empty.

Step 2: Modify the Deployment Manifest

Update your Deployment or Pod specification to include the securityContext. You must apply this at the container level, not just the pod level, to affect the capabilities specifically. Here is a hardened example for a standard web app:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: secure-webapp
spec:
  template:
    spec:
      containers:
      - name: web-app
        image: nginx:1.25-alpine
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
              - ALL
          runAsNonRoot: true
          runAsUser: 1000

Step 3: Add Back Only What is Necessary

If your application fails to start (e.g., a load balancer needing to bind to port 443), you may need to add back a specific capability. Only add what is strictly required. For binding to low ports (below 1024), you would use CAP_NET_BIND_SERVICE.

securityContext:
  capabilities:
    drop:
      - ALL
    add:
      - NET_BIND_SERVICE

Common Pitfalls and How to Fix Them

⚠️ Common Mistake: Dropping ALL capabilities without testing for CAP_NET_BIND_SERVICE if your container image is configured to listen on port 80. This will cause the container to crash immediately with a "Permission Denied" error in the logs.

When you drop all capabilities, you might encounter silent failures or explicit "Operation not permitted" errors. To identify which capability is missing, you can use strace in a dev environment to watch for failed system calls. If you see EPERM (Error: Permission Denied), the system call being executed usually maps directly to a Linux capability. For example, a failed clock_settime call means you need CAP_SYS_TIME.

Another issue is image compatibility. Some "distroless" or minimal images work perfectly with zero capabilities, while older, "fat" images might have scripts inside that use chown or chmod during startup. If your entrypoint script performs these actions, you will need to keep CAP_CHOWN and CAP_FOWNER, or better yet, refactor the image to avoid needing these permissions at runtime by setting the correct ownership during the Docker build phase.

Pro-Tips for Production Hardening

Manual YAML updates are prone to human error. To ensure every pod in your cluster follows these rules, use Pod Security Admission (PSA). PSA is the built-in successor to PodSecurityPolicy. You can label a namespace to enforce the "Restricted" profile, which automatically requires that containers drop all capabilities except those allowed by the spec.

kubectl label --overwrite ns production \
  pod-security.kubernetes.io/enforce=restricted

📌 Key Takeaways

  • Default capabilities like CAP_NET_RAW provide a wide attack surface for container escapes.
  • Always include capabilities.drop: ["ALL"] in your container security contexts.
  • Use allowPrivilegeEscalation: false to prevent processes from gaining new privileges.
  • Enforce these settings cluster-wide using Pod Security Admission labels.
  • Reference the Official Kubernetes Pod Security Standards for the latest hardening requirements.

Frequently Asked Questions

Q. What are the default Linux capabilities in Kubernetes?

A. Most container runtimes grant 14 default capabilities, including CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FSETID, CAP_FOWNER, CAP_MKNOD, CAP_NET_RAW, CAP_SETGID, CAP_SETUID, CAP_SETFCAP, CAP_SETPCAP, CAP_NET_BIND_SERVICE, CAP_SYS_CHROOT, CAP_KILL, and CAP_AUDIT_WRITE. Dropping these significantly hardens the container.

Q. How do I see what capabilities a container has?

A. You can use kubectl exec to run getpcaps 1 inside the container (if the binary exists) or cat /proc/1/status | grep Cap. The output will show hexadecimal masks of the Bounding, Effective, and Inheritable capability sets which can be decoded using the capsh --decode command.

Q. Does dropping capabilities affect performance?

A. No, dropping capabilities has no negative impact on CPU or memory performance. It is a kernel-level permission check. In fact, it can slightly improve security overhead by reducing the number of permissible system calls the kernel needs to evaluate for that specific process.

Post a Comment