Relying purely on infrastructure metrics like CPU and memory usage leaves dangerous blind spots in your system's health. While your Kubernetes nodes might look "green," your application could be failing to process payments or experiencing silent logic errors. To achieve true observability, you must instrument your microservices with OpenTelemetry (OTel) to expose domain-specific metrics. By exporting these rich, custom business metrics directly into Prometheus and visualizing them in Grafana, you transform a "black box" into a transparent system where every transaction is measurable.
This guide provides a technical walkthrough for configuring a modern observability stack. You will learn how to use the OpenTelemetry SDK to capture custom counters and histograms, route them through the OTel Collector, and store them in Prometheus for long-term analysis. By the end of this tutorial, you will have a functional pipeline that provides granular insights into your application's internal state, allowing for faster debugging and more accurate capacity planning.
TL;DR — Instrument your application using the OpenTelemetry SDK (v1.20+), configure the OTel Collector to receive OTLP data and export it via prometheus_exporter, then point your Grafana instance to Prometheus to build dashboards based on custom business logic attributes.
Understanding the OpenTelemetry Metrics Pipeline
OpenTelemetry is a vendor-agnostic framework designed to standardize the collection of traces, metrics, and logs. In the context of metrics, OTel provides a unified API that allows developers to write instrumentation once and send it to any backend. This decouples your code from specific monitoring tools like Prometheus, Datadog, or New Relic. The OTel Collector acts as a centralized relay that receives data, processes it (e.g., adding environment tags), and exports it to your chosen storage engine. This architecture is essential for scaling observability in distributed systems because it prevents "agent sprawl" and reduces the performance overhead on individual services.
The Prometheus ecosystem has historically relied on the "pull" model, where the Prometheus server scrapes an /metrics endpoint on every target. However, OpenTelemetry introduces more flexibility by supporting the OpenTelemetry Protocol (OTLP). Using an OTel Collector, you can receive OTLP data from your applications via "push" and then allow Prometheus to scrape the Collector's own endpoint. This hybrid approach ensures compatibility with Prometheus's battle-tested storage engine while benefiting from OTel's modern, multi-language SDKs. You can find more details in the official OpenTelemetry documentation.
When to Use Custom Application Metrics
Standard infrastructure metrics tell you if your pods are alive, but custom application metrics tell you if your software is doing its job. You should implement custom metrics when you need to track Service Level Indicators (SLIs) that are unique to your domain. For example, in an e-commerce application, tracking the "number of successful checkouts" or "payment gateway latency" is far more valuable for the business than simply knowing the average memory consumption of the checkout service. These metrics allow you to create alerts that trigger when conversion rates drop, even if the underlying infrastructure appears healthy.
Another critical use case is tracking long-tail latency via Histograms. Infrastructure-level latency often aggregates all requests together, but OTel custom metrics allow you to categorize latency by specific attributes, such as customer_tier or api_version. This granularity helps engineers identify if a performance regression is affecting all users or just a specific segment. Furthermore, custom metrics are vital for auditing and compliance, providing a verifiable record of application events that can be visualized over months or years within Grafana. If you are building a microservices architecture, these metrics are the primary way you verify that inter-service communication adheres to your expected performance contracts.
Step-by-Step Implementation Guide
Step 1: Instrumenting the Application (Node.js Example)
First, you must install the necessary OTel packages. For this example, we use Node.js, but the logic applies to Python, Go, and Java as well. You need the SDK metrics package and the OTLP exporter to send data to the collector.
npm install @opentelemetry/sdk-metrics \
@opentelemetry/exporter-metrics-otlp-grpc \
@opentelemetry/api
Initialize the Meter Provider and create a custom counter to track orders processed. Ensure you are using the correct OTLP endpoint for your collector instance (usually port 4317 for gRPC).
const { MeterProvider, PeriodicExportingMetricReader } = require('@opentelemetry/sdk-metrics');
const { OTLPMetricExporter } = require('@opentelemetry/exporter-metrics-otlp-grpc');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');
const metricExporter = new OTLPMetricExporter({
url: 'http://otel-collector:4317',
});
const meterProvider = new MeterProvider({
resource: new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: 'order-service',
}),
readers: [
new PeriodicExportingMetricReader({
exporter: metricExporter,
exportIntervalMillis: 10000,
}),
],
});
const meter = meterProvider.getMeter('order-meter');
const orderCounter = meter.createCounter('orders_processed_total', {
description: 'Total number of orders processed by the system',
});
// Record a metric
orderCounter.add(1, { 'order_type': 'digital', 'region': 'us-east-1' });
Step 2: Configuring the OpenTelemetry Collector
The OTel Collector needs a configuration file (config.yaml) to receive the OTLP data and expose it to Prometheus. Use the prometheus exporter, which starts an HTTP server that Prometheus can scrape.
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
processors:
batch:
exporters:
prometheus:
endpoint: "0.0.0.0:8889"
resource_to_telemetry_conversion:
enabled: true # Converts OTel resource attributes to Prometheus labels
service:
pipelines:
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]
Step 3: Configuring Prometheus and Grafana
Update your prometheus.yml to scrape the Collector's metrics endpoint. Prometheus will now pick up any custom metrics sent by your application to the OTel Collector.
scrape_configs:
- job_name: 'otel-collector'
static_configs:
- targets: ['otel-collector:8889']
Finally, open Grafana and add Prometheus as a data source. You can now query your metrics using PromQL, such as sum(orders_processed_total) by (region). This allows you to visualize business performance in real-time alongside your technical infrastructure metrics.
Common Pitfalls and Performance Bottlenecks
user_id or session_id as labels in your metrics will cause Prometheus to explode in memory usage. Each unique set of labels creates a new time series. Keep labels bounded (e.g., region, status_code, env).
One of the most frequent issues encountered when setting up this pipeline is the "Metric Name Clash." Prometheus and OpenTelemetry have slightly different naming conventions. Prometheus traditionally uses underscores (total_requests_count), while OTel often uses dots in its internal representation. The OTel Collector handles the conversion, but if you have existing Prometheus instrumentation, you must be careful not to duplicate metric names with different schemas, as this will lead to fragmented data and broken dashboards. Always verify your naming strategy early in the development lifecycle.
Another bottleneck is the OTel Collector's processing capacity. If you have hundreds of microservices pushing metrics at high frequency, a single Collector instance may become a bottleneck, leading to dropped data. Monitor the otelcol_processor_batch_batch_send_failures metric within the collector itself. If you see spikes, you likely need to scale the Collector horizontally or increase the batch processor's timeout and size settings. Furthermore, ensure that you are using the memory_limiter processor in your collector configuration to prevent the process from being OOMKilled (Out-of-Memory) during traffic surges.
Pro-Tips for Better Dashboarding
To make your Grafana dashboards truly useful, follow the "RED" pattern for services: Requests, Errors, and Duration. While OTel provides these automatically for many frameworks, custom metrics allow you to add a "B" for Business Value. For every technical dashboard, include one "Business SLI" panel at the top. This provides context for technical failures; a 500-error spike is bad, but a 500-error spike that correlates with a 0% checkout rate is a P0 emergency. Combining these two perspectives in a single Grafana row is a hallmark of elite SRE teams.
Leverage "Exemplars" in your Prometheus and Grafana setup. Exemplars are references to specific trace IDs that are stored alongside your metrics. When you see a latency spike in a Grafana histogram, you can click on the data point to jump directly into the specific trace that caused that spike. This significantly reduces Mean Time to Resolution (MTTR) because it bridges the gap between aggregate metrics and individual request debugging. Ensure your OTel SDK is configured to enable exemplars, and use Prometheus v2.26 or higher for full support.
📌 Key Takeaways
- Use OpenTelemetry SDKs (v1.x) to instrument business-critical logic.
- Route all metrics through an OTel Collector to standardize attributes and reduce backend load.
- Avoid high-cardinality labels (like User IDs) to protect Prometheus performance.
- Implement the RED pattern supplemented by Business SLIs for comprehensive dashboards.
- Integrate Exemplars to link metric spikes directly to distributed traces in Grafana.
Frequently Asked Questions
Q. Should I use OTLP push or Prometheus pull for metrics?
A. Use OTLP "push" from your application to the OTel Collector, and then use Prometheus to "pull" from the Collector. This gives you the benefits of OTel's rich SDKs and unified protocol while maintaining the stability and scraping reliability of the Prometheus ecosystem.
Q. How do I handle high cardinality in OpenTelemetry?
A. You should use the filter or transform processors in the OTel Collector to drop high-cardinality attributes before they reach Prometheus. Alternatively, ensure your application code only uses low-cardinality values (e.g., HTTP methods, status codes, regions) as metric labels.
Q. Can OpenTelemetry replace the Prometheus Node Exporter?
A. While OTel has host metrics receivers that provide similar data, the Prometheus Node Exporter is still the industry standard for deep OS-level metrics. Most teams use both: Node Exporter for infrastructure and OTel for application-level custom metrics.
Post a Comment