✦ For everyone, free.

Practical knowledge for real and everyday life

Home

15.3.1.3 Database Health Caution

A focused guide to Database Health Caution, connecting core concepts with practical Docker and container operations.

Database health caution refers to the specific risks of checking database connectivity directly inside a health check, where a poorly designed check can itself contribute to the very database load or connection exhaustion problem it was meant to help detect, turning a diagnostic tool into an additional source of production risk during exactly the conditions, database stress, where caution matters most.

Why naive database health checks are risky

A health check that opens a new database connection on every invocation, run on a short interval across many replicas, can itself become a meaningful and unnecessary source of connection pressure on the database, particularly during an incident when the database is already under stress and least able to absorb additional load:

app.get('/healthz', async (req, res) => {
  const client = new Client(connectionString);
  await client.connect();
  await client.query('SELECT 1');
  await client.end();
  res.status(200).send('ok');
});

This pattern opens and closes a brand new connection on every single health check invocation, which, multiplied across many replicas all checking every fifteen seconds, adds a steady stream of connection churn that serves no purpose beyond the check itself and competes with genuine application traffic for the database's limited connection capacity.

Reusing the application's existing connection pool

The health check should query through the same connection pool the application already maintains for ordinary request handling, rather than establishing a separate, dedicated connection just for the check:

const pool = new Pool({ max: 20 });

app.get('/healthz', async (req, res) => {
  try {
    await pool.query('SELECT 1');
    res.status(200).send('ok');
  } catch (err) {
    res.status(503).send('degraded');
  }
});

Using the existing pool means the health check's query competes for an existing connection slot the same way any other query would, rather than requiring its own separate connection budget on top of what the application already needs for actual traffic, and it also more accurately reflects the real state the application would encounter: if the pool itself is exhausted, the health check query would also experience that exhaustion, which is exactly the condition worth surfacing.

Choosing a genuinely lightweight query

The specific query a health check runs against the database should be the cheapest possible operation that still confirms genuine connectivity and responsiveness, not a query that exercises meaningful database work:

SELECT 1;
SELECT COUNT(*) FROM large_table; -- unnecessarily expensive for a connectivity check

A connectivity check needs to confirm the database can accept and respond to a query at all; it does not need to verify anything about the application's actual data or business logic, which is a different, more thorough kind of test better suited to a separate, less frequently run synthetic check rather than the health check that runs on every container's interval continuously.

Avoiding cascading failure during a database incident

If a database is already struggling under load, every replica's health check simultaneously hitting it with connectivity checks can add just enough additional load to make a marginal, recoverable situation into a full outage, a self-inflicted amplification of the very problem the health check exists to detect:

const cache = { healthy: true, lastChecked: 0 };

app.get('/healthz', async (req, res) => {
  if (Date.now() - cache.lastChecked < 5000) {
    return res.status(cache.healthy ? 200 : 503).send();
  }
  try {
    await pool.query('SELECT 1');
    cache.healthy = true;
  } catch {
    cache.healthy = false;
  }
  cache.lastChecked = Date.now();
  res.status(cache.healthy ? 200 : 503).send();
});

Caching the result of the actual database check for a short window, rather than performing a fresh check on every single health check invocation regardless of how frequently it is polled, reduces the load contribution while still providing reasonably current status, trading a small amount of detection latency for meaningfully reduced load during exactly the conditions where that load matters most.

Considering a circuit breaker pattern for repeated failures

When the database check fails repeatedly, continuing to attempt it on every cycle, rather than backing off, can itself contribute to connection retry storms that make recovery harder once the database does start to recover:

let consecutiveFailures = 0;

async function checkDatabase() {
  try {
    await pool.query('SELECT 1');
    consecutiveFailures = 0;
    return true;
  } catch {
    consecutiveFailures++;
    return false;
  }
}

A circuit breaker that temporarily stops attempting the actual database query after a threshold of consecutive failures, instead reporting unhealthy directly without adding to the connection attempts during that backoff window, reduces the additional load contributed by the health check itself during a sustained outage, resuming actual checks once a cooldown period has passed.

Distinguishing the health check's database concern from the application's

It is worth being explicit that the health check verifying database connectivity is checking a narrower thing than the application's full functional correctness against the database; a connectivity check can pass while a specific query path the application relies on for a particular feature is broken for an unrelated reason, such as a missing index causing one specific query to time out while ordinary connectivity remains fine.

app.get('/healthz', async (req, res) => {
  await pool.query('SELECT 1'); // confirms connectivity only, not full functional correctness
  res.status(200).send('ok');
});

This distinction matters when interpreting a healthy result: it confirms the database is reachable and responsive to a trivial query, not that every feature depending on the database is functioning correctly in every respect.

Common mistakes

  • Opening a new, dedicated database connection on every health check invocation instead of reusing the application's existing connection pool.
  • Running an unnecessarily expensive query as the connectivity check, adding real load for no diagnostic benefit beyond what a trivial query would already provide.
  • Running uncached, fresh checks on every interval across many replicas during a database incident, amplifying load on a database that is already struggling.
  • Retrying the database check aggressively during a sustained outage without any backoff, contributing to connection retry storms that can slow recovery.
  • Treating a passing connectivity check as proof of full functional correctness against the database, rather than recognizing it as a narrower signal.

Database health caution comes down to recognizing that a health check querying a shared, limited resource can itself become a meaningful load contributor, and designing it to reuse existing connections, run the cheapest sufficient query, cache results briefly, and back off during sustained failures keeps the check a useful diagnostic tool rather than an accidental amplifier of the exact problem it exists to detect.