Don’t Let Your Containers Become ‘Zombies’: Mastering Docker Healthcheck and Restart Policies

Docker tutorial - IT technology blog
Docker tutorial - IT technology blog

The Problem: Docker Shows Green but the App is ‘Dead’

Docker reports the container status as Up, but accessing the website results in a 502 Bad Gateway error. This is an extremely frustrating situation. You check the logs and find that the application has been frozen for a long time.

The issue is that Docker only monitors the main process. If the app suffers from an Out of Memory (OOM) error, a deadlock, or loses its DB connection but the process doesn’t exit, Docker still considers it healthy. The system then falls into a “vegetative state”: not completely dead, but unable to function.

In my first project, I was overconfident and only used restart: always. When server overload caused the Database to respond slowly, the Node.js app hung on connections, yet the container still showed a bright green status. Customers complained constantly while I remained convinced the system was stable. To handle this properly, you need the duo: Restart Policies and Healthcheck.

1. Self-healing with Restart Policies

Restart Policies help containers stand back up after a power failure or a crash. Docker provides four main options:

  • no: The default. Docker watches the container die and does nothing.
  • always: Always restarts regardless of the reason it stopped. If you reboot the server, this container also automatically restarts with the Docker daemon.
  • on-failure: Only restarts if the exit code is non-zero. Suitable for data processing jobs that need to finish and then stop.
  • unless-stopped: Similar to always but with a plus point. If you proactively use the docker stop command, it will stay down until you manually start it again.

The configuration in docker-compose.yml is very clean:

services:
  web-app:
    image: nginx:1.25-alpine
    restart: unless-stopped

I usually prefer unless-stopped. It helps the app come back after server maintenance while avoiding the annoyance of containers automatically restarting when I intentionally shut them down for debugging.

2. Healthcheck: A Private “Doctor” for Your Container

While Restart Policies only know if a container is alive or dead, Healthcheck knows if the application is working effectively. It’s like periodically sending a signal to ask: “Hey, are you still responsive?”

Key Parameters You Need to Master

  • test: The check command (usually using curl or pg_isready).
  • interval: Check frequency (e.g., once every 30 seconds).
  • timeout: How long to wait for a response before considering the check a failure.
  • retries: Number of consecutive failures (e.g., 3 times) before labeling it unhealthy.
  • start_period: Wait time for the app to boot. A Java Spring Boot app might take 45 seconds to start; give it time to prepare before starting the inspection.

3. Practical Configuration for a Node.js Application

Suppose you have an application running on port 3000. Don’t just hope for the best; force Docker to check it.

services:
  my-api:
    image: node-app:v1
    restart: always
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/health"] 
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s

In this configuration, the curl -f command will return an error if the /health endpoint responds with a 500 code or times out. Thanks to start_period: 40s, Docker will patiently wait for the app to finish loading libraries before it starts scoring its health.

Embedding Healthcheck into the Dockerfile

The best way is to package this mechanism into the image so that every environment is protected:

FROM node:18-alpine
RUN apk add --no-cache curl
# ... setup app ...
HEALTHCHECK --interval=1m --timeout=3s --retries=3 \
  CMD curl -f http://localhost:3000/ || exit 1
CMD ["npm", "start"]

4. Smooth Coordination Between Services

A common mistake is the app starting faster than the Database, leading to connection errors from the start. Instead of using complex wait scripts, use depends_on combined with a health condition:

services:
  db:
    image: postgres:15
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U user -d mydb"]
      interval: 10s

  app:
    image: my-node-app
    depends_on:
      db:
        condition: service_healthy

Now, the app container will patiently wait until the db is truly ready to accept connections. This approach is much more professional and reliable than blindly using sleep 10.

5. Resource Considerations: Don’t Over-Check

Healthchecks aren’t free. Each run consumes a small amount of CPU and RAM. If you set interval: 1s for 20 containers, the server will waste resources just checking itself.

A reasonable number is usually 30 seconds to 1 minute for standard services. Prioritize lightweight check commands and avoid heavy SQL queries just to see if the DB is alive.

Conclusion

Combining Restart Policies and Healthchecks provides you with a self-healing system. You will no longer have to wake up at 2 AM just to type docker restart.

Three rules of thumb:

  1. Use unless-stopped for most web services.
  2. Always include a start_period so the app isn’t killed before it can even start.
  3. Use service_healthy to manage the execution order of dependent services.

Apply these techniques immediately to make your applications more resilient and stable.

Share: