Advanced Docker Swarm: Rolling Updates, Placement Constraints, and Zero-Downtime Deployments – ITFROMZERO

The first time I used Docker Compose on a real project, I made a lot of basic mistakes that are embarrassing to look back on — I’d finish a deploy and the site would be down for an hour because I forgot to drain traffic before restarting containers. When the team asked “why is it down?”, I had no good answer. That was the starting point that pushed me to go deep on advanced Docker Swarm: controlled Rolling Updates, Placement Constraints, and Docker Config.

This article is for people who already know Swarm basics and want to configure it for real production — not a demo lab, but a system running 24/7 with actual users.

Table of Contents

Context: Why Do We Need This Extra Layer of Configuration?

Swarm’s defaults are decent, but the moment you move to production, you’ll run into a handful of common problems:

Default rolling updates cause downtime: Swarm stops the old replica before starting the new one, creating a window where no instance is serving requests — users see a 502.
Uncontrolled container placement: A heavy database might get scheduled on a low-RAM node, or all your API replicas might pile onto a single node that then goes down.
Config and secrets in environment variables: These can easily leak through docker inspect, log aggregation, or ps aux on the host.
No automatic rollback: You deploy, discover a bug, and have to handle it manually while users are seeing real errors.

Three features — Placement Constraints, Docker Config/Secret, and Rolling Updates with order: start-first — directly solve each of these problems.

Setup: Labeling Your Nodes

Labels are the foundation of Placement Constraints. Before writing your stack file, you need to assign labels to each node based on its role and hardware characteristics. This is a step most tutorials skip, which leaves readers confused about why constraints aren’t working:

# Assign role labels to worker nodes
docker node update --label-add role=worker node-1
docker node update --label-add role=worker node-2

# Assign storage type — important for databases
docker node update --label-add storage=ssd node-1
docker node update --label-add storage=hdd node-2

# Assign availability zone if running a multi-region cluster
docker node update --label-add zone=az-1 node-1
docker node update --label-add zone=az-2 node-2

# Verify labels were assigned correctly
docker node inspect node-1 --format '{{json .Spec.Labels}}'
docker node ls --format 'table {{.Hostname}}\t{{.Status}}\t{{.ManagerStatus}}'

After assigning labels, double-check with docker node ls -q | xargs docker node inspect --format '{{.Description.Hostname}}: {{.Spec.Labels}}' to make sure all nodes have their labels before deploying the stack.

Detailed Configuration

Docker Config and Secret — Managing Configuration the Right Way

Instead of passing config through environment variables, Docker Config stores static configuration files (nginx.conf, app.yaml, etc.) while Docker Secret stores sensitive data (passwords, API keys). Both are encrypted at rest and only decrypted in the RAM of the running container:

# Create config from a file
docker config create nginx_conf ./nginx.conf
docker config create app_settings ./app.yaml

# Create secret from a file (recommended over stdin to avoid saving to shell history)
docker secret create db_password ./db_password.txt
docker secret create jwt_secret ./jwt_secret.txt

# Verify
docker config ls
docker secret ls

Declare and mount them into containers in your stack file:

configs:
  nginx_conf:
    external: true
  app_settings:
    external: true

secrets:
  db_password:
    external: true

services:
  nginx:
    image: nginx:1.25-alpine
    configs:
      - source: nginx_conf
        target: /etc/nginx/nginx.conf
        mode: 0440         # Read-only for owner and group, not others

  api:
    image: myapp/api:latest
    configs:
      - source: app_settings
        target: /app/config/settings.yaml
    secrets:
      - source: db_password
        target: db_password
        mode: 0400         # Owner read-only — secrets should be stricter than configs

Secrets are mounted at /run/secrets/<secret_name> inside the container. Your app reads from this file instead of an environment variable — this is far more secure because the secret never appears in the process environment and won’t be exposed through docker inspect.

Placement Constraints — Routing Workloads to the Right Nodes

With the labels assigned above, you can now precisely control which containers run on which nodes. constraints are hard rules (must be satisfied), while preferences are soft rules (satisfied when possible):

services:
  api:
    deploy:
      replicas: 4
      placement:
        constraints:
          - node.role == worker          # API does not run on manager nodes
          - node.labels.role == worker   # Double-check via custom label
        preferences:
          - spread: node.labels.zone     # Spread evenly across AZs, avoid concentration

  database:
    deploy:
      replicas: 1
      placement:
        constraints:
          - node.labels.storage == ssd   # DB only runs on nodes with SSD
          - node.role == worker          # Not on manager nodes

  redis:
    deploy:
      replicas: 1
      placement:
        constraints:
          - node.labels.zone == az-1     # Pin Redis to a specific zone if needed

Zero-Downtime Rolling Updates — Detailed Configuration

This is the section most commonly misconfigured. The key parameter is order: start-first — Swarm starts the new replica, waits for the healthcheck to pass, and only then stops the old replica. This is the opposite of the default stop-first behavior, which causes downtime.

But order: start-first only works correctly when paired with a properly configured healthcheck:

services:
  api:
    image: myapp/api:${VERSION:-latest}
    healthcheck:
      test: ["CMD-SHELL", "wget -qO- http://localhost:3000/health || exit 1"]
      interval: 10s
      timeout: 5s
      retries: 3
      start_period: 30s    # Give the app 30s to start up before health evaluation begins
    deploy:
      replicas: 4
      update_config:
        parallelism: 1            # Update 1 replica at a time — conservative but safe
        delay: 15s                # Wait 15s between update batches
        order: start-first        # START the new replica FIRST, STOP the old one AFTER
        failure_action: rollback  # Automatically rollback everything if the update fails
        monitor: 30s              # Monitor for 30s after each update to catch delayed failures
        max_failure_ratio: 0.3    # Allow up to 30% replica failures before triggering rollback
      rollback_config:
        parallelism: 0            # 0 = rollback all replicas simultaneously
        delay: 0s                 # No delay during rollback — speed matters here
        failure_action: continue  # Continue rolling back even if errors occur
        order: stop-first         # During rollback: stop new version first, restore old after

start_period in the healthcheck is the parameter I spent the most time tuning correctly. If your app needs 20 seconds to connect to the database, load config, and warm up its cache — set start_period: 25s to give yourself a buffer. Without this, Swarm marks the container as failed immediately during startup, the container keeps restarting, and the rolling update never completes.

Testing and Monitoring

Deploy the Stack and Watch the Rolling Update

# Deploy the stack for the first time
docker stack deploy -c docker-stack.yml myapp

# List all services in the stack
docker stack services myapp

# Update the API to a new version — rolling update runs automatically
VERSION=v2.1.0 docker stack deploy -c docker-stack.yml myapp

# Watch the rolling update in real-time
# Look for: new replica in Running state BEFORE old replica reaches Shutdown
watch -n2 'docker service ps myapp_api --format "table {{.Name}}\t{{.Node}}\t{{.CurrentState}}\t{{.DesiredState}}"'

When the update is working correctly, you’ll see a moment where replicas+1 tasks exist simultaneously: the new replica in Running state while the old replica hasn’t yet transitioned to Shutdown. That’s your proof that zero-downtime is actually working.

Rollback and Placement Verification

# Roll back a service to the previous version (for manual intervention)
docker service rollback myapp_api

# Verify the database is running on the correct SSD node
docker service ps myapp_database --format 'table {{.Name}}\t{{.Node}}\t{{.CurrentState}}'

# View the distribution of API replicas across nodes
docker service ps myapp_api --filter 'desired-state=running'

# Aggregate logs from all replicas of a service
docker service logs -f --tail 100 myapp_api

Monitoring Resource Usage

# Resource usage for all containers in a service
docker stats $(docker ps --filter 'name=myapp_api' -q)

# View configured resource limits
docker service inspect myapp_api --pretty | grep -A 8 Resources

# Health status of all tasks
docker service ps myapp_api --format 'table {{.Name}}\t{{.Node}}\t{{.CurrentState}}\t{{.Error}}'

Once everything is set up, I like to run a “dry” rolling update: retag the same image with a new version number, trigger the update, and watch watch docker service ps. If the new replica reaches Running before the old one reaches Shutdown — zero-downtime deployment is working as intended.

These three things — Placement Constraints to route workloads to the right nodes, Docker Config/Secret to secure your configuration, and Rolling Updates with order: start-first for uninterrupted deployments — are what you need from day one when taking Swarm to production. None of it is complicated, but missing any one of the three will eventually cause an incident. I learned that the hard way.