Docker Checkpoint and Restore with CRIU: Live Container Migration Without Service Downtime

Docker tutorial - IT technology blog
Docker tutorial - IT technology blog

The Problem: Migrating a Container Without Stopping the Service

Last month I needed to move a VPS to a more powerful server. The running container was a background processing job — it had been running for 6 hours, right in the middle of a large batch job. Stopping the container meant losing all progress and having to start from scratch. But keeping it running meant I couldn’t migrate.

This is a situation many people have run into: Docker containers have no built-in “pause and move” mechanism in the literal sense. The usual approaches are:

  • Stop the container → back up data → restore on the new server → restart (causes downtime)
  • Use a persistent volume and restart (loses in-memory state)
  • Use complex replication (resource-intensive, not always feasible)

But there’s a lesser-known approach: CRIU (Checkpoint/Restore In Userspace) — freeze the entire process state, transfer it to another machine, and resume execution as if nothing happened.

Why Containers Can’t Be Migrated “Hot”

At their core, containers are process groups isolated using Linux namespaces and cgroups. While a container is running, it holds:

  • Memory state: heap, stack, variables currently being processed
  • File descriptors: open sockets, pipes, file handles
  • Network connections: established TCP connections
  • Process state: CPU registers, signal handlers, timers

All of this exists only in the current server’s RAM. A Docker image stores only a filesystem snapshot — not runtime state. That’s why docker commit followed by docker pull on a new machine won’t help you recover in-memory state.

Solutions

Option 1: Blue-Green Deployment (Without CRIU)

If the application is stateless or can sync state from a database, blue-green deployment is a much simpler approach:

  • Start a new container on the destination server
  • Switch traffic via Nginx or a load balancer
  • Stop the old container once everything is confirmed

However, if the container is holding critical in-memory state (a fully loaded ML model, an in-progress batch job, active WebSocket connections, etc.), this approach won’t work.

Option 2: CRIU — Checkpoint/Restore In Userspace

CRIU is a Linux tool that lets you dump the entire process state to disk and then restore it anywhere. Docker integrates CRIU through an experimental feature with two key commands:

  • docker checkpoint create — freezes the container and dumps its state to disk
  • docker start --checkpoint — restores the container from a previously created checkpoint

Hands-On: Migrating a Container with Docker Checkpoint + CRIU

Step 1: Install CRIU on Both Servers

# Ubuntu/Debian
sudo apt-get install -y criu

# Check version (3.14+ required)
criu --version

# Verify the kernel supports all required features
criu check --ms

Step 2: Enable Experimental Features for the Docker Daemon

Docker disables checkpoint support by default. You need to enable it in the daemon config:

sudo nano /etc/docker/daemon.json
{
  "experimental": true
}
sudo systemctl restart docker

# Verify experimental mode is enabled
docker info | grep -i experimental
# Experimental: true

Step 3: Run the Container with the runc Runtime

CRIU only works with the runc runtime — it’s incompatible with the default containerd snapshotter in Docker 24+. I spent a fair amount of time debugging this:

# --runtime=runc is required
docker run -d --name myapp \
  --runtime=runc \
  --security-opt seccomp:unconfined \
  nginx:alpine

# Confirm the container is running
docker ps

The --security-opt seccomp:unconfined flag is necessary so CRIU has sufficient access to system calls when dumping state.

Step 4: Create a Checkpoint

# Create a checkpoint (container pauses during the dump)
docker checkpoint create myapp checkpoint1

# Or keep the container running after the checkpoint
docker checkpoint create --leave-running myapp checkpoint1

# List existing checkpoints
docker checkpoint ls myapp
# CHECKPOINT NAME
# checkpoint1

Checkpoint data is stored by default at:

/var/lib/docker/containers/<container-id>/checkpoints/checkpoint1/

Step 5: Transfer the Checkpoint to the New Server

# Get the container ID
CONTAINER_ID=$(docker inspect --format='{{.Id}}' myapp)

# Archive the checkpoint directory
tar -czf checkpoint1.tar.gz \
  /var/lib/docker/containers/${CONTAINER_ID}/checkpoints/checkpoint1/

# Copy to the new server
scp checkpoint1.tar.gz user@new-server:/tmp/

# Export the container filesystem (needed for a proper restore environment)
docker export myapp | gzip > myapp-fs.tar.gz
scp myapp-fs.tar.gz user@new-server:/tmp/

Step 6: Restore on the New Server

# --- Run these commands on the new server ---

# 1. Import the container filesystem
docker import /tmp/myapp-fs.tar.gz myapp-image:restored

# 2. Create the container (don't start it yet) with the same name
docker create --name myapp \
  --runtime=runc \
  --security-opt seccomp:unconfined \
  myapp-image:restored

# 3. Get the new container ID
NEW_CONTAINER_ID=$(docker inspect --format='{{.Id}}' myapp)

# 4. Copy the checkpoint to the correct location
mkdir -p /var/lib/docker/containers/${NEW_CONTAINER_ID}/checkpoints/
tar -xzf /tmp/checkpoint1.tar.gz \
  -C /var/lib/docker/containers/${NEW_CONTAINER_ID}/checkpoints/

# 5. Restore from the checkpoint
docker start --checkpoint checkpoint1 myapp

Verifying the Restore

# Check logs to confirm the container resumed from where it left off
docker logs myapp

# View the full container state
docker inspect myapp

I usually paste the JSON output of docker inspect into toolcraft.app/en/tools/developer/json-formatter to read it more easily — much faster than installing an extension, especially when you’re SSH’d into a server without browser extensions available.

Best Practices: What You Should Know Before Using CRIU

When to Use CRIU:

  • The container holds critical in-memory state that can’t be recovered from disk
  • You’re running a long-running batch job and don’t want to restart from scratch
  • You need to migrate during a hardware upgrade for CPU/RAM-intensive workloads
  • Debugging production issues: checkpoint to snapshot state at the moment a bug occurs and analyze it offline

When NOT to Use CRIU:

  • The container has active TCP connections — restores often fail or connections get dropped
  • GPU workloads — CRIU does not support CUDA/GPU state
  • Containers using complex user namespaces that may conflict on restore
  • Production environments with strict SLAs — CRIU is still experimental in Docker; don’t attempt it for the first time on a live system

Periodic Checkpoint Script

A pattern I’ve found effective is checkpointing on a schedule rather than waiting until you actually need to migrate:

#!/bin/bash
# Run via cron every hour: 0 * * * * /opt/scripts/checkpoint.sh

CONTAINER_NAME="myapp"
CHECKPOINT_NAME="auto-$(date +%Y%m%d-%H%M)"

# Remove old checkpoints, keeping only the 3 most recent
OLD=$(docker checkpoint ls $CONTAINER_NAME | tail -n +2 | head -n -3 | awk '{print $1}')
for cp in $OLD; do
  docker checkpoint rm $CONTAINER_NAME $cp 2>/dev/null
done

# Create a new checkpoint without pausing the container
docker checkpoint create --leave-running $CONTAINER_NAME $CHECKPOINT_NAME
echo "[$(date)] Checkpoint created: $CHECKPOINT_NAME"

Lessons Learned from Real-World Use

  • Kernel versions must be compatible between source and destination servers — CRIU checkpoints are tied to the kernel ABI, and restoring across significantly different kernel versions will fail
  • Test in a dev environment first — don’t try this for the first time on production when something is already going wrong
  • Dump time depends on memory usage — a container using 8GB of RAM can take several minutes to checkpoint
  • Docker Swarm and Kubernetes don’t natively support CRIU migration; you’ll need third-party tooling, or consider CRAC (Coordinated Restore at Checkpoint) for JVM workloads

CRIU is a powerful tool, but it requires a clear understanding of its limitations. For stateless services, blue-green deployment is still the safer and simpler choice. But for containers holding complex state that simply can’t be restarted, CRIU is the only option that gets the job done.

Share: