15.2.1 Runtime Resource Metrics
A focused guide to Runtime Resource Metrics, connecting core concepts with practical Docker and container operations.
Runtime resource metrics are measurements produced by an application's own language runtime, heap usage, garbage collection pauses, event loop lag, thread or goroutine counts, connection pool saturation, that sit between container-level cgroups metrics and business-level application metrics, explaining the internal behavior of the process running inside a container in a way neither of the other two layers can.
Why container metrics alone are insufficient
Container-level metrics report that a process is using a certain amount of memory or CPU, but they cannot explain why from the runtime's own perspective; a container approaching its memory limit could be doing so because of a genuine memory leak, an oversized cache, or simply a garbage collector that has not yet run, three very different situations that look identical from outside the process:
docker stats my-api
MEM USAGE / LIMIT: 480MiB / 512MiB
Runtime resource metrics fill exactly this gap, providing the internal breakdown a container-level view cannot.
Heap and garbage collection metrics
For garbage-collected languages, heap size, allocation rate, and time spent in garbage collection are some of the most diagnostically valuable runtime metrics, since memory pressure that originates from the application's own allocation pattern looks very different from memory pressure caused by an external factor:
const v8 = require('v8');
setInterval(() => {
const stats = v8.getHeapStatistics();
metrics.gauge('heap_used_bytes', stats.used_heap_size);
metrics.gauge('heap_total_bytes', stats.total_heap_size);
}, 10000);
java -Xlog:gc -jar my-api.jar
A rising heap size correlated with rising garbage collection time, but with the application's own request latency staying flat, usually points toward an actual memory leak rather than ordinary working-set growth, since legitimate working memory typically plateaus once steady-state traffic is reached, while a leak continues climbing indefinitely.
Garbage collection pause time as a latency factor
Beyond memory accounting, garbage collection pauses directly affect application latency, since most garbage collectors briefly pause some or all application threads while reclaiming memory, which means GC metrics are as relevant to a latency investigation as they are to a memory investigation:
java -XX:+PrintGCApplicationStoppedTime -jar my-api.jar
gc_pause_seconds_total: 0.42
A service experiencing intermittent latency spikes that correlate with garbage collection events, rather than with request volume or downstream dependency latency, points toward a GC tuning problem rather than an application logic or infrastructure issue, which is a distinction that container or business metrics alone would not surface.
Event loop lag for single-threaded runtimes
For event-loop-based runtimes such as Node.js, event loop lag, the delay between when a scheduled callback should run and when it actually runs, is a critical runtime metric, since a blocked or overloaded event loop affects every concurrent request being handled by that process simultaneously:
const lag = require('event-loop-lag')(1000);
setInterval(() => {
metrics.gauge('event_loop_lag_ms', lag());
}, 5000);
Rising event loop lag under otherwise normal-looking CPU and memory metrics typically indicates a synchronous, blocking operation somewhere in the request path, a CPU-intensive computation run without yielding, or a blocking I/O call that should have been asynchronous, since this is exactly the kind of problem that does not show up clearly in container-level resource metrics.
Thread pool and goroutine metrics
For multi-threaded or concurrent runtimes, the number of active threads, goroutines, or worker pool utilization reveals concurrency-related problems that aggregate CPU metrics cannot distinguish on their own:
import "runtime"
metrics.gauge("goroutine_count", float64(runtime.NumGoroutine()))
metrics.gauge('worker_pool_active', workerPool.activeCount);
metrics.gauge('worker_pool_queued', workerPool.pending);
A steadily increasing goroutine or thread count over time, without a corresponding increase in actual workload, is a strong, early signal of a goroutine or thread leak, a path that spawns concurrent work without a corresponding mechanism to ensure it always terminates, well before that leak manifests as a more dramatic resource exhaustion problem.
Connection pool saturation
Database and external service connection pools have their own internal state, active connections, idle connections, queued requests waiting for an available connection, that is directly relevant to diagnosing latency and timeout issues but is entirely invisible without explicit instrumentation:
const pool = new Pool({ max: 20 });
setInterval(() => {
metrics.gauge('db_pool_total', pool.totalCount);
metrics.gauge('db_pool_idle', pool.idleCount);
metrics.gauge('db_pool_waiting', pool.waitingCount);
}, 5000);
A connection pool with zero idle connections and a growing waiting count under load is a clear, direct signal that the pool size itself, not the database's own capacity, is the bottleneck currently limiting throughput, which is a distinction that overall database-side metrics alone would not make obvious.
Exposing runtime metrics through the same pipeline as application metrics
Runtime resource metrics are most useful when exported through the same metrics pipeline as application-level business metrics, allowing correlation between, for example, a latency spike and a simultaneous garbage collection pause or connection pool exhaustion event on the same dashboard:
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
});
Many language ecosystems provide pre-built instrumentation libraries that automatically collect and expose common runtime metrics (heap, GC, event loop lag, and similar) alongside whatever custom business metrics an application defines, removing the need to instrument these specific, well-understood signals manually for each new service.
Common mistakes
- Treating container-level memory and CPU metrics as sufficient on their own, without runtime-level metrics to explain the internal cause of a resource trend.
- Not monitoring garbage collection pause time specifically, missing a common and otherwise hard-to-diagnose cause of intermittent latency spikes.
- Overlooking event loop lag for single-threaded, event-loop-based runtimes, where a blocked loop affects every concurrent request simultaneously.
- Failing to instrument connection pool state, leaving pool exhaustion indistinguishable from genuine downstream capacity limits during a latency investigation.
- Instrumenting runtime metrics through a separate pipeline from application metrics, losing the ability to correlate them on a shared timeline during an investigation.
Runtime resource metrics close the gap between what a container's resource consumption looks like from outside and why it actually behaves that way from inside the process, and instrumenting the well-understood signals, heap and GC behavior, event loop lag, thread or goroutine counts, and connection pool state, turns resource investigations from guesswork into something that can usually be diagnosed directly from the data already being collected.