15 Docker Observability
A focused guide to Docker Observability, connecting core concepts with practical Docker and container operations.
Docker observability is the practice of gaining insight into the behavior, health, and performance of containerized applications and the Docker runtime itself, built around the same three pillars used in observability generally, logs, metrics, and traces, but with additional considerations specific to how containers isolate processes, manage resources, and obscure some of the visibility a traditional host-based deployment would otherwise provide by default.
Why containers complicate observability
A process running directly on a host is visible to standard host tools: ps, top, system logs, and direct filesystem access all work without any special configuration. A process running inside a container is, by design, isolated from the host's own view in several of these dimensions, which means observability for containerized applications generally requires deliberate instrumentation and tooling rather than relying on what the host already provides for free.
docker stats my-api
docker logs --tail 100 -f my-api
These two commands cover the most basic layer of Docker observability, ad hoc resource usage and log tailing, but neither scales to a production environment running many containers across multiple hosts without additional aggregation and retention infrastructure layered on top.
Logs
Container logs are the most immediately accessible observability signal, captured automatically from a container's stdout and stderr by Docker's logging driver:
docker run -d --log-driver=json-file --log-opt max-size=10m my-api
For anything beyond a single host, logs need to be shipped to a centralized aggregation system, since docker logs only works against containers running on the local Docker daemon and provides no cross-host search or long-term retention:
services:
api:
logging:
driver: gelf
options:
gelf-address: "udp://logging-host:12201"
Structured logging, emitting JSON rather than free-form text, makes logs significantly more useful once aggregated, since a centralized log system can index and query individual fields rather than only performing text search across unstructured lines.
logger.info(JSON.stringify({ event: 'request_completed', path: req.path, duration_ms: elapsed, status: res.statusCode }));
Metrics
Metrics provide a numeric, time-series view of behavior, suited to dashboards, alerting thresholds, and trend analysis in a way that logs, even structured ones, are not well suited for:
docker run -d -p 8080:8080 google/cadvisor
cAdvisor, built specifically for container metrics, exposes per-container CPU, memory, network, and filesystem usage in a format that Prometheus and similar systems can scrape directly, giving infrastructure-level visibility without requiring any instrumentation inside the application containers themselves.
scrape_configs:
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
Application-level metrics, distinct from container resource metrics, require instrumentation inside the application code itself, exposing a metrics endpoint that reflects business and request-level behavior rather than only infrastructure resource consumption:
const promClient = require('prom-client');
const httpRequestDuration = new promClient.Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration',
labelNames: ['method', 'route', 'status'],
});
Distributed tracing
In a multi-service containerized architecture, a single user-facing request often touches several containers in sequence, and understanding where time is spent or where a failure originated requires tracing that follows the request across that entire chain, not just metrics or logs from any one service in isolation:
const { trace } = require('@opentelemetry/api');
const tracer = trace.getTracer('my-api');
const span = tracer.startSpan('process-order');
// ... work happens ...
span.end();
services:
api:
environment:
- OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
OpenTelemetry has become a common standard for this, providing a vendor-neutral way to instrument an application once and export the resulting trace data to whichever backend (Jaeger, Tempo, a commercial APM tool) an organization chooses, without re-instrumenting the application if that backend choice changes later.
Health checks as a fourth, simpler signal
Beyond the three classic pillars, Docker's own built-in health check mechanism provides a simpler, binary observability signal directly tied to the container runtime, useful for restart and routing decisions even without a full observability stack in place:
HEALTHCHECK --interval=15s --timeout=5s --retries=3 \
CMD curl -f http://localhost:3000/healthz || exit 1
docker inspect --format='{{.State.Health.Status}}' my-api
This signal is coarser than full metrics or tracing, but it requires no external infrastructure at all, making it a reasonable minimum observability baseline even for the smallest deployments.
Correlating signals across the three pillars
The real value of observability comes from being able to move between logs, metrics, and traces for the same request or incident, rather than treating each as an isolated tool. Including a consistent identifier, such as a trace ID, across all three makes this correlation possible:
logger.info(JSON.stringify({ traceId: span.spanContext().traceId, event: 'order_failed', orderId }));
An operator investigating an elevated error rate seen in metrics can then search logs for that specific time window and correlate them with traces for the exact requests that failed, rather than needing to manually reconstruct timing alignment across three disconnected systems.
Common mistakes
- Relying only on
docker logsanddocker statsrun manually, with no aggregation or retention, leaving no observability once a container is removed or a host is lost. - Capturing infrastructure-level container metrics but never instrumenting application-level metrics, missing visibility into business and request behavior that resource metrics alone cannot reveal.
- Writing unstructured log lines that are difficult to query at scale once aggregated, losing much of the value centralized log aggregation would otherwise provide.
- Treating distributed tracing as optional in a multi-service architecture, making cross-service failure investigation dramatically slower without it.
- Failing to correlate logs, metrics, and traces with a shared identifier, leaving each signal as an isolated, harder-to-cross-reference data source during an investigation.
Effective Docker observability combines centralized, structured logging, both infrastructure and application-level metrics, and distributed tracing for multi-service request flows, correlated through shared identifiers, layered on top of (but not replacing) the basic health check signal Docker's own runtime already provides.