✦ For everyone, free.

Practical knowledge for real and everyday life

Home

14.3.3.3 Blue Green Health Check

A focused guide to Blue Green Health Check, connecting core concepts with practical Docker and container operations.

A blue-green health check is the validation gate applied to the idle environment before the proxy switch sends it any real traffic, and its design directly determines whether blue-green deployment actually catches problems before they reach users or merely confirms that containers started without verifying the application inside them genuinely works.

Why a shallow health check undermines the entire pattern

The core safety promise of blue-green deployment depends entirely on the idle environment being properly validated before the switch. A health check that only confirms a process is running, without exercising the application's actual dependencies and behavior, can pass while masking a real problem, defeating the purpose of having a separate validation phase at all:

HEALTHCHECK CMD curl -f http://localhost:3000/ || exit 1

A bare request to the root path often succeeds even when a critical downstream dependency, such as the database, is unreachable, since many frameworks return a generic response for an unmatched or static route regardless of the application's actual operational state.

Designing a health check that verifies real dependencies

A meaningful health check actively exercises the dependencies the application needs to function, rather than confirming only that the HTTP server itself is listening:

app.get('/healthz', async (req, res) => {
  try {
    await db.query('SELECT 1');
    await cache.ping();
    res.status(200).json({ status: 'ok' });
  } catch (err) {
    res.status(503).json({ status: 'degraded', reason: err.message });
  }
});
curl -f http://green.internal:3000/healthz

A health check failing here because the database is unreachable from the green environment specifically, even though blue's identical code is working fine against the same database, would surface a real, environment-specific problem, such as a missing network route or an incorrect connection string, before the switch ever happens.

Layered validation beyond the basic health endpoint

A single health endpoint check is a reasonable first gate, but a more thorough blue-green validation process layers additional checks before considering the idle environment ready for the switch:

curl -f http://green.internal:3000/healthz
docker compose -p myapp-green run --rm api npm run smoke-test
docker compose -p myapp-green run --rm api npm run integration-test -- --target=internal

A smoke test suite that exercises a handful of critical user-facing flows end to end catches problems a simple dependency health check would not, such as a regression in business logic that has nothing to do with infrastructure connectivity but would still be a serious issue to expose to real users.

Validating with synthetic traffic before the real switch

Beyond automated tests, routing a small amount of real or replayed traffic to the idle environment before the full switch, without yet making it the default for ordinary users, provides validation under conditions closer to actual production load:

location /internal-canary-check {
    proxy_pass http://green.internal:3000;
}
for i in $(seq 1 100); do curl -s -o /dev/null -w "%{http_code}\n" http://green.internal:3000/api/products; done | sort | uniq -c

This kind of synthetic load test, run directly against the idle environment before any real customer traffic is routed there, helps surface performance issues, such as a cold cache or an underprovisioned resource limit, that a single health check request would never reveal.

Health checks during the switch itself

The proxy or load balancer's own ongoing health checking, separate from the one-time validation gate before the switch, should continue monitoring the now-live environment after the cutover, since a problem that did not appear during validation can still surface once real production traffic and data volume hit the environment:

http:
  services:
    api:
      loadBalancer:
        healthCheck:
          path: /healthz
          interval: "5s"

This ongoing health check is what would allow an automated or fast manual rollback if the newly live environment starts failing shortly after the switch, even if the pre-switch validation gate passed cleanly.

Avoiding false confidence from validation against stale data

A health check or smoke test run against the idle environment is only meaningful if that environment's data and configuration genuinely reflect what production traffic would encounter; validating against a stale or synthetic dataset that does not resemble real production data can produce a misleadingly clean pass:

docker compose -p myapp-green exec db psql -U postgres -c "SELECT count(*) FROM orders;"

In a shared-database blue-green setup, this is less of a concern since both environments see the same live data; in a setup where each environment has its own separate database, keeping that database's content representative of production is necessary for the validation to mean anything close to what real traffic will actually exercise.

Defining a clear pass/fail gate rather than a judgment call

The decision to proceed with the switch should be based on an explicit, automated pass/fail result from the validation steps, rather than an operator's informal impression that "it looks fine," since an informal judgment call is harder to apply consistently and harder to audit after the fact:

#!/bin/sh
set -e
curl -f http://green.internal:3000/healthz
docker compose -p myapp-green run --rm api npm run smoke-test
echo "Validation passed, ready for switch"
./scripts/validate-environment.sh green && ./scripts/switch-environment.sh green

Chaining the validation script's success directly into the switch script's invocation, so the switch genuinely cannot proceed unless validation passed, removes the possibility of a deployment proceeding to the switch step despite a failed or skipped validation gate.

Common mistakes

  • Using a health check that only confirms the application process is running, without exercising its actual dependencies or business logic.
  • Treating a single, automated health check as the entire validation process, without any smoke testing or synthetic load against the idle environment.
  • Validating the idle environment against stale or unrepresentative data that does not resemble what real production traffic would actually exercise.
  • Allowing the switch to proceed based on an operator's informal judgment rather than an automated, scripted pass/fail gate.
  • Stopping health checking once the switch has occurred, missing problems that only surface once the newly live environment is under real production load.

A blue-green health check earns its place as the deployment's actual safety gate only when it genuinely exercises the dependencies and behavior the application relies on, is layered with smoke testing and synthetic load rather than standing alone, and gates the switch through an explicit, automated pass/fail decision rather than an informal judgment call.