OpenTelemetry Explained: Unifying Metrics, Logs, and Traces

OpenTelemetry unifies metrics, logs, and traces under one open standard — here is how it works, what it replaces, and how to instrument your first service in 20 minutes.

Status: DRAFT

For years, observability meant picking a vendor, installing their SDK, and being trapped. Datadog agent on every pod. New Relic SDK baked into your application code. When you wanted to switch tools, you rewrote your instrumentation from scratch.

OpenTelemetry (OTel) was built to end this. It is an open standard — backed by Google, Microsoft, Datadog, Grafana, AWS, and 200+ other contributors — for how applications should emit observability data. Instrument once, send anywhere.

In 2026, 47% of teams have increased their OpenTelemetry adoption, and the number is climbing. If you are still running vendor-locked agents, you are accumulating technical debt.

The Three Pillars of Observability

Before OpenTelemetry, teams had three separate tooling problems:

Metrics tell you what is happening in aggregate — request rates, error rates, latency percentiles, CPU usage. Traditionally handled by Prometheus exporters or vendor agents.

Logs tell you what happened to a specific request — error messages, state transitions, debug output. Traditionally unstructured or semi-structured text, collected by Fluentd or Filebeat.

Traces tell you how a specific request flowed through your system — which services it touched, how long each took, where it failed. Traditionally required vendor SDKs like Jaeger or Zipkin client libraries.

The problem: all three were siloed. You couldn't click from a metric spike to the logs from that spike to a trace from that spike — because they came from different systems with no shared context.

OpenTelemetry solves this with a single context model that threads through all three signals.

How OpenTelemetry Works

OpenTelemetry has three parts:

◈ DIAGRAM

Your Application
      |
      | (OTel SDK - one instrumentation)
      v
OTel Collector
      |
      +-- Prometheus (metrics)
      +-- Loki (logs)
      +-- Tempo / Jaeger (traces)
      +-- Datadog (if you want)
      +-- Any other backend

The SDK instruments your code and produces telemetry. The Collector receives it, processes it, and routes it to whatever backends you choose. The backends are swappable at any time without touching your application code.

The Trace Context Model

The key insight in OpenTelemetry is the trace context. Every request gets a trace_id at the point of entry. Every service that handles that request propagates the trace_id downstream. Every log event and metric emitted during that request is tagged with the same trace_id.

This is what enables the experience teams actually want: you see a latency spike in Grafana, click on it, and are taken directly to the traces from that 5-minute window. Click on one trace, and every log line from every service that touched that request is right there.

Instrumenting a Node.js Service

Here is a minimal, production-ready OTel setup for a Node.js service:

Bash

npm install @opentelemetry/sdk-node \
  @opentelemetry/auto-instrumentations-node \
  @opentelemetry/exporter-otlp-grpc

JAVASCRIPT

// tracing.js — load this before anything else
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require(
  '@opentelemetry/auto-instrumentations-node'
);
const { OTLPTraceExporter } = require(
  '@opentelemetry/exporter-otlp-grpc'
);
 
const sdk = new NodeSDK({
  serviceName: 'payment-service',  // appears in all traces
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4317',  // your OTel Collector
  }),
  instrumentations: [
    getNodeAutoInstrumentations()  // auto-instruments http, db, redis
  ],
});
 
sdk.start();

JAVASCRIPT

// server.js — start with tracing loaded
require('./tracing');
const express = require('express');
const app = express();
// ... rest of your app

That's it. Auto-instrumentation patches http, express, pg, redis, mongoose, and 40+ other libraries automatically. You get traces for every incoming request and every database call without touching your business logic.

Deploying the OTel Collector

The Collector is what makes OTel production-grade. You run it as a sidecar or as a central deployment, and it handles batching, retry, sampling, and routing to multiple backends.

YAML

apiVersion: apps/v1
kind: Deployment
metadata:
  name: otel-collector
  namespace: monitoring
spec:
  replicas: 2
  template:
    spec:
      containers:
        - name: otel-collector
          image: otel/opentelemetry-collector-contrib:latest
          args: ["--config=/conf/config.yaml"]
          volumeMounts:
            - name: config
              mountPath: /conf

YAML

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
 
processors:
  batch:
    timeout: 1s
    send_batch_size: 1024
  memory_limiter:
    limit_mib: 512
 
exporters:
  prometheus:
    endpoint: "0.0.0.0:8889"  ## Prometheus scrapes here
  loki:
    endpoint: http://loki:3100/loki/api/v1/push
  otlp/tempo:
    endpoint: http://tempo:4317
    tls:
      insecure: true
 
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [loki]

This single Collector config routes traces to Tempo, metrics to Prometheus, and logs to Loki — the Grafana LGTM stack, which is the most common open-source OTel backend in 2026.

Correlating Across Signals in Grafana

With OTel feeding the LGTM stack, Grafana can link signals together. A dashboard panel showing error rate for payment-service can be configured to show "Explore Traces" and "Show Logs" links that pre-filter by the same time range and service.

The configuration that enables this is exemplars — metric data points that carry a trace_id. When Prometheus receives an OTel metric with an exemplar, Grafana can render a clickable link from the metric point to the exact trace in Tempo.

YAML

# In your Grafana datasource config
- name: Prometheus
  type: prometheus
  url: http://prometheus:9090
  jsonData:
    exemplarTraceIdDestinations:
      - name: trace_id
        datasourceUid: tempo-uid  ## links to your Tempo datasource

What OpenTelemetry Does Not Solve

OTel standardizes the collection of telemetry. It does not standardize analysis, alerting, or dashboarding. You still need a backend (Grafana, Datadog, Honeycomb, etc.) to make sense of the data.

OTel also adds overhead. Auto-instrumentation is convenient but not free — a high-throughput service at scale may need head-based or tail-based sampling to keep trace volume manageable. Configure sampling in the Collector, not in the SDK.

YAML

processors:
  probabilistic_sampler:
    sampling_percentage: 10  ## sample 10% of traces in production

Trade-offs and Alternatives

Option	Vendor Lock-in	Setup Effort	Full Signal Coverage
OpenTelemetry	None	Medium	Yes
Datadog Agent	High	Low	Yes
Prometheus only	Low	Low	Metrics only
Elastic APM	Medium	Medium	Yes

Use OTel if you care about portability. Use Datadog if you want everything working in two hours and the bill is acceptable. The two are not mutually exclusive — you can instrument with OTel and export to Datadog.

Production Implementation Guidelines

Instrument new services first, legacy services last. Do not try to retrofit OTel across 60 existing services in a sprint — you will burn out and the instrumentation will be inconsistent. Start with new services, establish a standard, then migrate older ones gradually.

Use the resource SDK attribute to tag every trace and metric with your environment, cluster, and team:

JAVASCRIPT

const sdk = new NodeSDK({
  resource: new Resource({
    'service.name': 'payment-service',
    'deployment.environment': 'production',
    'team': 'payments-squad',
    'k8s.cluster.name': 'prod-cluster',
  }),
  // ...
});

These attributes become labels in Prometheus and tags in your trace backend — they are what makes filtering by team or environment possible months later.

INFORMATION
📚 **References & Further Reading** * [OpenTelemetry Documentation](https://opentelemetry.io/docs/) - Complete SDK and Collector reference * [Grafana LGTM Stack](https://grafana.com/about/grafana-stack/) - Loki, Grafana, Tempo, Mimir setup * [OTel Collector Configuration](https://opentelemetry.io/docs/collector/configuration/) - Receiver, processor, exporter reference * [Grafana 2026 Observability Survey](https://grafana.com/observability-survey/) - Industry adoption data

Frequently Asked Questions

How do you configure tail-based sampling in the OpenTelemetry Collector to retain all error traces without storing 100% of traffic?

Use the tail_sampling processor in the OTel Collector with a composite policy: always sample spans where status.code equals ERROR or http.status_code is 5xx, and probabilistically sample the remaining healthy traces at 5-10%. The Collector must buffer spans until the root span arrives to make the sampling decision, requiring adequate memory allocation in high-throughput deployments.

Why do OpenTelemetry exemplars not appear in Grafana even when the Prometheus exporter is configured correctly?

Exemplars require Prometheus to be started with the --enable-feature=exemplar-storage flag, and the Grafana datasource must have exemplars enabled in its configuration with a valid traceIdLabelName matching your OTel trace ID label. Additionally, the Prometheus client SDK must emit exemplars on histogram observations — the OTel Go SDK does this automatically but older Java agent versions require explicit exemplar filter configuration.

The Three Pillars of Observability

How OpenTelemetry Works

The Trace Context Model

Instrumenting a Node.js Service

Deploying the OTel Collector

Correlating Across Signals in Grafana

What OpenTelemetry Does Not Solve

Trade-offs and Alternatives

Production Implementation Guidelines

Frequently Asked Questions

How do you configure tail-based sampling in the OpenTelemetry Collector to retain all error traces without storing 100% of traffic?

Why do OpenTelemetry exemplars not appear in Grafana even when the Prometheus exporter is configured correctly?

Discussion0