How to Prevent Prometheus Metrics Cardinality Explosions

You have likely experienced the silent killer of observability: your Prometheus instance suddenly slows down, consumes all available RAM, and enters a crash loop. This is a Prometheus metrics cardinality explosion. It happens when your application starts exporting too many unique label combinations, forcing the Time-Series Database (TSDB) to track millions of individual streams simultaneously. If you do not control this growth, your monitoring system will fail exactly when you need it most—during a high-traffic event or a production outage.

Managing cardinality is not just a configuration task; it is a fundamental architectural requirement for any scale-out system. By auditing your instrumentation and enforcing strict server-side rules, you can maintain high-resolution visibility without risking infrastructure stability. This guide provides a blueprint for identifying, mitigating, and preventing these explosions in production environments running Prometheus 2.x and later.

TL;DR — Audit labels to remove unique IDs (UUIDs, IPs), apply labeldrop in your relabel_configs, and set sample_limit in your scrape configurations to hard-cap the incoming data volume.

Understanding Cardinality in Prometheus

💡 Analogy: Think of Prometheus as a library catalog. Each metric name is a book title, and labels are descriptive tags (Genre, Author, Year). If you use a "Genre" label like "Fiction," the index stays small. But if you create a new tag for every "Specific Page Number" in every book, the catalog grows larger than the library itself. Eventually, the librarian (Prometheus) spends all their time filing cards instead of finding books.

In technical terms, cardinality is the number of unique time series stored in Prometheus. A single metric like http_requests_total becomes a unique time series for every unique combination of its labels. If you have 2 HTTP methods and 3 status codes, you have 6 time series. If you add a user_id label with 100,000 unique users, you suddenly have 600,000 time series for just one metric.

Prometheus stores these series in memory (the Head block) before flushing them to disk. High cardinality causes the memory footprint to swell because Prometheus must maintain an index of every active series. When the memory limit of the pod or VM is reached, the operating system invokes the OOM (Out of Memory) killer, and your monitoring goes dark. This is why unbounded label values are the primary enemy of a reliable observability stack.

When to Audit Your Metrics

You should audit your metrics long before an outage occurs. In a healthy production environment, cardinality growth should correlate with the number of services or pods you deploy, not with the number of users or transactions your system processes. If you notice your Prometheus RAM usage climbing linearly while your traffic increases, you likely have a cardinality leak.

Check your Prometheus "Status -> TSDB Status" page regularly. This dashboard identifies which metrics and labels contribute most to your series count. If you see labels like email_address, request_id, or source_ip in the "Top 10 labels with high memory usage" section, you are in the danger zone. These are "unbounded" labels because their possible values are infinite. In a production environment, you should only use labels with a finite, predictable set of values (e.g., region, env, service_name, or status_code).

How Cardinality Scales: The Math

The total cardinality of a metric is the product of the number of unique values for each of its labels. This multiplicative effect is what leads to "explosions." Even seemingly small sets of labels can combine to create massive data loads.

Total Series = [Metric Name] x [Label_1_Values] x [Label_2_Values] x [Label_3_Values] ...

Consider a standard microservice deployment:

  • Metric: api_latency_seconds_bucket (Prometheus Histograms usually have 10-15 buckets).
  • Label: method (GET, POST, PUT, DELETE) = 4 values.
  • Label: route (/login, /search, /profile) = 3 values.
  • Label: instance (5 pods) = 5 values.
  • Total: 15 (buckets) * 4 * 3 * 5 = 900 series.

Now, imagine a developer adds a user_id label to track latency per user. With 50,000 active users, the calculation becomes 900 * 50,000 = 45,000,000 series. A single Prometheus instance typically handles 1 to 5 million active series comfortably; 45 million from a single metric will crash almost any standard configuration instantly. This is why your architecture must prioritize "Aggregated Visibility" over "Individual Traceability" in metrics.

Step-by-Step Prevention Strategy

Step 1: Audit and Sanitize Instrumentation

The most effective fix happens at the source. Review your application code to ensure no dynamic values are being passed into labels. If you need to track specific user behavior, use distributed tracing (like OpenTelemetry with Jaeger or Tempo) instead of metrics. Metrics are for aggregate trends; traces are for individual requests.

// BAD: High cardinality
httpRequestsTotal.WithLabelValues("200", "/api/v1/user/" + userId).Inc()

// GOOD: Low cardinality
httpRequestsTotal.WithLabelValues("200", "/api/v1/user/:id").Inc()

Step 2: Use relabel_configs to Drop Volatile Labels

If you cannot change the application code immediately, use Prometheus relabel_configs to strip dangerous labels at scrape time. This prevents the data from ever entering the TSDB index. This is a critical safety net for third-party exporters where you don't control the source code.

scrape_configs:
  - job_name: 'my-app'
    static_configs:
      - targets: ['localhost:8080']
    metric_relabel_configs:
      - source_labels: [user_id]
        action: labeldrop  # Removes the user_id label entirely
      - source_labels: [__name__]
        regex: 'temp_metric_.*'
        action: drop       # Drops the entire metric if it matches the pattern

Step 3: Enforce Scrape Limits

Prometheus allows you to set hard limits on how many samples it will accept from a single target. If a target exceeds this limit, Prometheus will stop scraping it and mark it as "failed" in the UI. This protects the global health of the Prometheus server by sacrificing the visibility of a single misbehaving service.

scrape_configs:
  - job_name: 'my-app'
    sample_limit: 10000  # Target will fail if it exports > 10k series
    label_limit: 20      # Limits the number of labels per series
    label_name_length_limit: 64
    label_value_length_limit: 128

⚠️ Common Mistake: Relying solely on sample_limit without monitoring. If a limit is hit, you lose all data for that target. Always set an alert for up == 0 or prometheus_target_scrapes_exceeded_sample_limit_total to know when a service has been silenced by the safety valve.

Stability vs. Granularity

There is always a tension between wanting deep detail and maintaining a stable system. When designing your metrics architecture, use the following comparison to decide where to place your data.

Feature High Cardinality (Traces/Logs) Low Cardinality (Metrics)
Storage Cost High (Linear with traffic) Low (Fixed per time series)
Query Speed Slow (Requires scanning logs) Instant (Pre-indexed series)
Use Case Debugging specific errors Alerting and dashboards
Best Tool Loki, Elasticsearch, Jaeger Prometheus, VictoriaMetrics

In most production scenarios, if you are asking "Which user saw this error?", you should be looking at logs or traces. If you are asking "What is the 99th percentile latency across all users?", you should be looking at Prometheus metrics. Keeping these domains separate ensures that your monitoring system remains fast and cost-effective.

Proactive Monitoring Tips

To keep Prometheus healthy, you must monitor the monitor. Use the internal metrics provided by Prometheus to create an "Early Warning System" for cardinality growth. I have found that monitoring the prometheus_tsdb_head_series metric is the most reliable way to predict memory exhaustion before it happens.

Create a recording rule to track the growth rate of your series. If the number of series increases by more than 20% in an hour without a corresponding increase in your infrastructure size (number of pods), trigger a high-priority alert. This gives your SRE team time to identify the offending service and apply relabel_configs before the instance crashes.

Additionally, consider using a tool like mimirtool or promtool to analyze your Prometheus WAL (Write-Ahead Log). These tools can parse your storage and tell you exactly which labels are bloating your index. In one production case I managed, we discovered that a misconfigured ingress controller was exporting the full User-Agent string of every request as a label, which accounted for 80% of our total RAM usage. Removing that one label dropped memory consumption from 64GB to 8GB overnight.

📌 Key Takeaways

  • Never use unbounded values (UUIDs, IPs, timestamps) as Prometheus labels.
  • Use metric_relabel_configs with labeldrop to sanitize data from services you don't control.
  • Set sample_limit in your scrape configs to prevent a single target from crashing the server.
  • Monitor prometheus_tsdb_head_series to detect cardinality leaks early.
  • Offload high-detail data to distributed tracing and logging systems.

Frequently Asked Questions

Q. How many labels are too many for a Prometheus metric?

A. It is not the number of labels, but the number of unique label values. Having 20 labels with 2 values each is manageable (40 series). Having 1 label with 100,000 values is dangerous. Generally, keep the total series per metric under 10,000 for standard applications.

Q. Does high cardinality affect query performance or just memory?

A. Both. High cardinality forces Prometheus to scan a massive index during queries. If you use a regex selector on a high-cardinality label (e.g., {user_id=~"user-12.*"}), the query will be extremely slow and might time out, potentially locking up the Prometheus engine.

Q. Can I use recording rules to reduce cardinality?

A. No, recording rules actually increase cardinality because they create new time series. However, they can help by allowing you to store an aggregated version of a high-cardinality metric for long-term use while you set a shorter retention policy for the raw, high-cardinality data.

Post a Comment