15.1.1.5 Container Log Aggregation

A focused guide to Container Log Aggregation, connecting core concepts with practical Docker and container operations.

Container log aggregation is the storage, indexing, and querying layer that receives log output forwarded from many containers and hosts, distinct from the collection agents that gather and forward it, focused instead on how that combined volume of log data is organized so it can actually be searched, filtered, and correlated effectively once it has all arrived in one place.

Choosing an aggregation backend

Several categories of aggregation backend are commonly used with containerized deployments, each with different trade-offs around indexing cost, query flexibility, and operational complexity:

services:
  elasticsearch:
    image: elasticsearch:8.13.0
    environment:
      - discovery.type=single-node
  kibana:
    image: kibana:8.13.0
    ports:
      - "5601:5601"

services:
  loki:
    image: grafana/loki:3.0.0
    ports:
      - "3100:3100"
  grafana:
    image: grafana/grafana
    ports:
      - "3001:3000"

Elasticsearch-based stacks (commonly paired with Kibana) index the full content of every log line, enabling rich free-text search at the cost of higher storage and indexing overhead; Loki, by contrast, indexes only a small set of labels and stores log content itself more cheaply, trading some query flexibility for significantly lower operational cost at scale.

Indexing strategy and its cost implications

How much of a log line gets indexed directly affects both query capability and infrastructure cost; indexing every field of every log line enables the most flexible querying but scales poorly once log volume grows substantially:

{
  "service": "my-api",
  "level": "error",
  "message": "Payment charge failed",
  "orderId": "12345"
}

# Loki-style: only labels are indexed, message content is stored but not indexed
{service="my-api", level="error"} |= "Payment charge failed"

A common middle ground is indexing a deliberately small set of high-value fields, such as service name, log level, and environment, while leaving the full message content searchable but not separately indexed, which keeps query performance reasonable for the most common filtering patterns without the cost of fully indexing every possible field.

Querying across services

The practical value of aggregation comes from the ability to query across every service at once, using the correlation identifiers and structured fields that good app log output already provides:

{service=~"api|worker|payment-service"} |= "orderId=12345"

SELECT * FROM logs WHERE traceId = 'abc123' ORDER BY timestamp ASC;

This kind of cross-service query, reconstructing every log line related to a single order or trace ID regardless of which container produced it, is the primary justification for aggregating logs centrally in the first place rather than leaving them scattered across individual hosts and containers.

Retention tiers

Not all log data needs to remain in the most expensive, fully searchable storage tier indefinitely; a tiered retention policy keeps recent data in fast, queryable storage while moving or discarding older data according to its actual ongoing value:

retention:
  hot: 7d
  warm: 30d
  cold: 90d
  delete: 365d

aws s3 cp s3://logs-bucket/archive/2024-01/ s3://logs-bucket-glacier/2024-01/ --storage-class GLACIER

Logs older than a defined window are often valuable only for compliance retention rather than active querying, which makes a cheaper, less immediately accessible storage tier appropriate for that older data rather than keeping everything in the same expensive, fully indexed store indefinitely.

Sampling and filtering at scale

For very high-volume services, aggregating every single log line at full fidelity may not be cost-effective relative to the actual value of that volume; sampling a representative fraction of routine, successful events while retaining all error and warning-level output preserves diagnostic value while controlling cost:

[FILTER]
    Name throttle
    Match api.debug
    Rate 10
    Window 60

Applying sampling selectively, to high-volume, low-severity log streams specifically, rather than uniformly across all log levels, keeps the aggregation system from being overwhelmed by routine traffic while still capturing everything needed to investigate an actual problem.

Dashboards built on aggregated logs

Once logs are aggregated and indexed consistently, dashboards summarizing error rates, request volume, or specific business events become possible to build directly from log data, often without needing a separate metrics pipeline for the same information:

sum(rate({service="api"} |= "level=error" [5m]))

A dashboard panel built from an aggregated log query like this gives a real-time view of error volume across every replica of a service combined, which would be considerably harder to assemble by checking individual containers' logs one at a time.

Access control over aggregated logs

Because aggregated logs from every service typically end up in one centralized system, access control over who can query that system matters more than it would for any individual container's local logs, since a single point of access now exposes log data across the entire deployment:

roles:
  - name: developer
    indices:
      - names: ["logs-*"]
        privileges: ["read"]
        query: '{"match": {"environment": "staging"}}'

Restricting which environments or services a given role can query, rather than granting blanket read access to every aggregated log across every environment, limits the exposure if a single set of credentials to the aggregation system is ever compromised.

Common mistakes

Choosing a fully-indexing aggregation backend without accounting for how indexing cost scales as log volume grows, leading to unexpectedly high infrastructure spend.
Indexing every field of every log line by default instead of deliberately choosing a smaller set of high-value indexed fields.
Retaining all log data at the same, most expensive storage tier indefinitely rather than tiering retention based on actual query need versus compliance-only retention.
Sampling log volume uniformly across all severity levels, accidentally discarding error-level events that should always be retained at full fidelity.
Granting broad, unrestricted query access to the centralized aggregation system, creating a single point of exposure across every service's logs.

Container log aggregation turns scattered, per-container log output into a genuinely queryable, cross-service resource, but only when the indexing strategy, retention tiers, and access controls are deliberately designed around actual query needs and cost constraints rather than defaulting to indexing and retaining everything at full fidelity indefinitely.