Resilience in Node.js: Managing Uncaught Exceptions and Promise Rejections

TL;DR: When an unhandled error occurs, the Node.js process is in an undefined state. You must log the stack trace synchronously, trigger a graceful shutdown of external resources (DBs, sockets), and force a process.exit(1) to allow your orchestrator (PM2/Kubernetes) to spawn a clean instance.

Node.js applications often fail silently or enter "zombie states" due to neglected error boundaries. Unlike synchronous languages where an exception might only crash a single thread, an unhandled exception in Node.js can leave the entire event loop in a corrupted state—leaking file descriptors, keeping database locks open, or leaving memory segments allocated but unreachable.

The Anatomy of a Fatal Error

In Node.js, fatal errors typically fall into two categories: uncaughtException and unhandledRejection. Understanding the difference in how the V8 engine treats these is essential for building a stable system.

1. uncaughtException

This occurs when a synchronous error is thrown and not caught by any try/catch block. Since the error bubbled up to the event loop, the stack is lost, and the application's internal state is no longer guaranteed to be consistent. Continuing execution after this event is dangerous; the official Node.js documentation warns that the process should be restarted.

2. unhandledRejection

This happens when a Promise is rejected and no .catch() handler is attached. Historically, Node.js would simply emit a warning. However, since Node.js 15, unhandled rejections trigger a code 1 crash by default. This change aligns Promise behavior with synchronous exceptions, emphasizing that an unhandled rejection is a logic bug that threatens application state.

Implementing the Global Error Boundary

To handle these gracefully, you need a centralized handler. This shouldn't be used to "swallow" errors and keep running, but to "crash intelligently."

const logger = require('./infra/logger'); // Assume a structured logger like Pino or Winston
const server = require('./server');
const db = require('./infra/db');

function shutdownGracefully(exitCode) {
  // Give the process 10 seconds to clean up, then force exit
  // This prevents the process from hanging forever if a cleanup task fails
  setTimeout(() => {
    console.error('Forcible shutdown timed out. Exiting.');
    process.exit(exitCode);
  }, 10000).unref();

  server.close(async () => {
    try {
      await db.disconnect();
      logger.info('Closed database connections.');
      process.exit(exitCode);
    } catch (err) {
      logger.error('Error during database disconnection', { err });
      process.exit(1);
    }
  });
}

process.on('uncaughtException', (error) => {
  // Use synchronous logging here if possible to ensure the message hits the disk/stdout
  // before the process dies.
  logger.fatal('UNCAUGHT EXCEPTION! Shutting down...', {
    message: error.message,
    stack: error.stack,
  });
  
  shutdownGracefully(1);
});

process.on('unhandledRejection', (reason, promise) => {
  logger.fatal('UNHANDLED REJECTION! Shutting down...', {
    reason: reason instanceof Error ? reason.stack : reason,
  });
  
  shutdownGracefully(1);
});

Critical Nuance: Synchronous vs. Asynchronous Logging

When the process is crashing, you are in a race against the operating system. If you use an asynchronous logger that buffers output (like a standard Winston configuration sending to an HTTP transport), the process might exit before the log packet is sent.

In high-stakes environments, use fs.writeSync(2, message) or a logger configured for synchronous flushing. This ensures that the post-mortem data is available in your log aggregator (Datadog, ELK, or CloudWatch) even if the network stack is failing. Standard console.error is generally synchronous when the output is a TTY or a file, but can be asynchronous when piped to another process—be wary of your execution environment.

Why We Must Exit the Process

It is a common anti-pattern to try and "recover" from an uncaughtException. This is often motivated by a desire for high availability, but it leads to "state pollution." Consider this scenario:

Step Event Resulting State
1 Request A starts a DB transaction. Transaction Open.
2 Synchronous error occurs in Request A. Process catches error, but transaction never closes.
3 Process stays alive. DB Connection pool is now leaked; Request B may time out.

By exiting the process, we allow the operating system to reclaim memory and close sockets, and we allow the orchestration layer (PM2, Kubernetes, or Systemd) to restart the service from a known "clean" configuration.

The Orchestration Layer: PM2 and Kubernetes

Handling errors inside the code is only half the battle. Your deployment strategy must support rapid restarts.

PM2 Integration

If using PM2, ensure you have an exponential backoff restart strategy. This prevents a "crash loop" from overwhelming your CPU or logging storage if the error is caused by a persistent environmental issue (e.g., a missing environment variable or a down database).

// ecosystem.config.js
module.exports = {
  apps: [{
    name: "api-service",
    script: "./index.js",
    exp_backoff_restart_delay: 100,
    max_memory_restart: '1G'
  }]
}

Kubernetes Liveness and Readiness Probes

In a Kubernetes environment, your shutdownGracefully function should stop responding to Readiness Probes immediately. This ensures that the Service Load Balancer stops routing new traffic to the dying pod while it attempts to finish processing existing requests. The Liveness Probe will eventually fail, or your process.exit(1) will trigger the CrashLoopBackOff state, signaling K8s to replace the pod.

Edge Case: Recursive Errors in the Handler

What happens if your error handler itself throws an error? This leads to an infinite loop or an immediate silent crash. To prevent this, always wrap your shutdown logic in a basic try/catch and use console.error as a absolute fallback.

process.on('uncaughtException', (err) => {
  try {
    // Attempt structured log
    logger.fatal(err);
    // Attempt clean shutdown
    shutdownGracefully(1);
  } catch (secondaryError) {
    // Ultimate fallback: write to stderr and bail
    console.error('Error in uncaughtException handler:', secondaryError);
    process.exit(1);
  }
});

Testing Your Error Boundaries

Do not wait for a production failure to see if your logging works. Use a "chaos endpoint" in your staging environment to trigger these events manually:

// Development-only routes
if (process.env.NODE_ENV !== 'production') {
  app.get('/debug/force-exception', () => {
    throw new Error('Manual Uncaught Exception');
  });

  app.get('/debug/force-rejection', () => {
    Promise.reject(new Error('Manual Unhandled Rejection'));
  });
}

Verify that your logs contain the full stack trace, that database connections are terminated, and that the process manager successfully restarts the service. If the process hangs for more than 10 seconds, your timeout logic is failing.

Summary of Best Practices

  • Always crash: Never attempt to continue execution after an uncaught exception.
  • Synchronous logs: Use blocking I/O for the final error log to ensure persistence.
  • Time-limited shutdown: Use a setTimeout to force exit if cleanup takes too long.
  • External Monitor: Rely on PM2 or Kubernetes to handle the process lifecycle.
  • Node.js Versioning: Be aware that Node 15+ changed unhandled rejections to fatal errors; legacy codebases may behave differently upon upgrade.

By treating failure as an expected part of the application lifecycle, you shift from a fragile system to a resilient one that can recover from the unexpected without human intervention.

Post a Comment