OpenTelemetry unifies metrics, logs, and traces under one open standard — here is how it works, what it replaces, and how to instrument your first service in 20 minutes.
Status: DRAFT
For years, observability meant picking a vendor, installing their SDK, and being trapped. Datadog agent on every pod. New Relic SDK baked into your application code. When you wanted to switch tools, you rewrote your instrumentation from scratch.
OpenTelemetry (OTel) was built to end this. It is an open standard — backed by Google, Microsoft, Datadog, Grafana, AWS, and 200+ other contributors — for how applications should emit observability data. Instrument once, send anywhere.
In 2026, 47% of teams have increased their OpenTelemetry adoption, and the number is climbing. If you are still running vendor-locked agents, you are accumulating technical debt.
Before OpenTelemetry, teams had three separate tooling problems:
Metrics tell you what is happening in aggregate — request rates, error rates, latency percentiles, CPU usage. Traditionally handled by Prometheus exporters or vendor agents.
Logs tell you what happened to a specific request — error messages, state transitions, debug output. Traditionally unstructured or semi-structured text, collected by Fluentd or Filebeat.
Traces tell you how a specific request flowed through your system — which services it touched, how long each took, where it failed. Traditionally required vendor SDKs like Jaeger or Zipkin client libraries.
The problem: all three were siloed. You couldn't click from a metric spike to the logs from that spike to a trace from that spike — because they came from different systems with no shared context.
OpenTelemetry solves this with a single context model that threads through all three signals.
OpenTelemetry has three parts:
Your Application | | (OTel SDK - one instrumentation) vOTel Collector | +-- Prometheus (metrics) +-- Loki (logs) +-- Tempo / Jaeger (traces) +-- Datadog (if you want) +-- Any other backendThe SDK instruments your code and produces telemetry. The Collector receives it, processes it, and routes it to whatever backends you choose. The backends are swappable at any time without touching your application code.
The key insight in OpenTelemetry is the trace context. Every request gets a trace_id at the point of entry. Every service that handles that request propagates the trace_id downstream. Every log event and metric emitted during that request is tagged with the same trace_id.
This is what enables the experience teams actually want: you see a latency spike in Grafana, click on it, and are taken directly to the traces from that 5-minute window. Click on one trace, and every log line from every service that touched that request is right there.
Here is a minimal, production-ready OTel setup for a Node.js service:
npm install @opentelemetry/sdk-node \ @opentelemetry/auto-instrumentations-node \ @opentelemetry/exporter-otlp-grpc// tracing.js — load this before anything elseconst { NodeSDK } = require('@opentelemetry/sdk-node');const { getNodeAutoInstrumentations } = require( '@opentelemetry/auto-instrumentations-node');const { OTLPTraceExporter } = require( '@opentelemetry/exporter-otlp-grpc'); const sdk = new NodeSDK({ serviceName: 'payment-service', // appears in all traces traceExporter: new OTLPTraceExporter({ url: 'http://otel-collector:4317', // your OTel Collector }), instrumentations: [ getNodeAutoInstrumentations() // auto-instruments http, db, redis ],}); sdk.start();// server.js — start with tracing loadedrequire('./tracing');const express = require('express');const app = express();// ... rest of your appThat's it. Auto-instrumentation patches http, express, pg, redis, mongoose, and 40+ other libraries automatically. You get traces for every incoming request and every database call without touching your business logic.
The Collector is what makes OTel production-grade. You run it as a sidecar or as a central deployment, and it handles batching, retry, sampling, and routing to multiple backends.
apiVersion: apps/v1kind: Deploymentmetadata: name: otel-collector namespace: monitoringspec: replicas: 2 template: spec: containers: - name: otel-collector image: otel/opentelemetry-collector-contrib:latest args: ["--config=/conf/config.yaml"] volumeMounts: - name: config mountPath: /confreceivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317 http: endpoint: 0.0.0.0:4318 processors: batch: timeout: 1s send_batch_size: 1024 memory_limiter: limit_mib: 512 exporters: prometheus: endpoint: "0.0.0.0:8889" ## Prometheus scrapes here loki: endpoint: http://loki:3100/loki/api/v1/push otlp/tempo: endpoint: http://tempo:4317 tls: insecure: true service: pipelines: traces: receivers: [otlp] processors: [memory_limiter, batch] exporters: [otlp/tempo] metrics: receivers: [otlp] processors: [batch] exporters: [prometheus] logs: receivers: [otlp] processors: [batch] exporters: [loki]This single Collector config routes traces to Tempo, metrics to Prometheus, and logs to Loki — the Grafana LGTM stack, which is the most common open-source OTel backend in 2026.
With OTel feeding the LGTM stack, Grafana can link signals together. A dashboard panel showing error rate for payment-service can be configured to show "Explore Traces" and "Show Logs" links that pre-filter by the same time range and service.
The configuration that enables this is exemplars — metric data points that carry a trace_id. When Prometheus receives an OTel metric with an exemplar, Grafana can render a clickable link from the metric point to the exact trace in Tempo.
# In your Grafana datasource config- name: Prometheus type: prometheus url: http://prometheus:9090 jsonData: exemplarTraceIdDestinations: - name: trace_id datasourceUid: tempo-uid ## links to your Tempo datasourceOTel standardizes the collection of telemetry. It does not standardize analysis, alerting, or dashboarding. You still need a backend (Grafana, Datadog, Honeycomb, etc.) to make sense of the data.
OTel also adds overhead. Auto-instrumentation is convenient but not free — a high-throughput service at scale may need head-based or tail-based sampling to keep trace volume manageable. Configure sampling in the Collector, not in the SDK.
processors: probabilistic_sampler: sampling_percentage: 10 ## sample 10% of traces in production| Option | Vendor Lock-in | Setup Effort | Full Signal Coverage |
|---|---|---|---|
| OpenTelemetry | None | Medium | Yes |
| Datadog Agent | High | Low | Yes |
| Prometheus only | Low | Low | Metrics only |
| Elastic APM | Medium | Medium | Yes |
Use OTel if you care about portability. Use Datadog if you want everything working in two hours and the bill is acceptable. The two are not mutually exclusive — you can instrument with OTel and export to Datadog.
Instrument new services first, legacy services last. Do not try to retrofit OTel across 60 existing services in a sprint — you will burn out and the instrumentation will be inconsistent. Start with new services, establish a standard, then migrate older ones gradually.
Use the resource SDK attribute to tag every trace and metric with your environment, cluster, and team:
const sdk = new NodeSDK({ resource: new Resource({ 'service.name': 'payment-service', 'deployment.environment': 'production', 'team': 'payments-squad', 'k8s.cluster.name': 'prod-cluster', }), // ...});These attributes become labels in Prometheus and tags in your trace backend — they are what makes filtering by team or environment possible months later.
INFORMATION📚 **References & Further Reading** * [OpenTelemetry Documentation](https://opentelemetry.io/docs/) - Complete SDK and Collector reference * [Grafana LGTM Stack](https://grafana.com/about/grafana-stack/) - Loki, Grafana, Tempo, Mimir setup * [OTel Collector Configuration](https://opentelemetry.io/docs/collector/configuration/) - Receiver, processor, exporter reference * [Grafana 2026 Observability Survey](https://grafana.com/observability-survey/) - Industry adoption data
Use the tail_sampling processor in the OTel Collector with a composite policy: always sample spans where status.code equals ERROR or http.status_code is 5xx, and probabilistically sample the remaining healthy traces at 5-10%. The Collector must buffer spans until the root span arrives to make the sampling decision, requiring adequate memory allocation in high-throughput deployments.
Exemplars require Prometheus to be started with the --enable-feature=exemplar-storage flag, and the Grafana datasource must have exemplars enabled in its configuration with a valid traceIdLabelName matching your OTel trace ID label. Additionally, the Prometheus client SDK must emit exemplars on histogram observations — the OTel Go SDK does this automatically but older Java agent versions require explicit exemplar filter configuration.
Discussion0