Containers run directly on the host kernel — and that’s a problem
If you’re using Docker and have never considered that a container could “escape” and attack the host, this article is for you.
By nature, Linux containers share the kernel with the host. Unlike VMs with a hypervisor providing complete separation, containers rely only on namespaces and cgroups for isolation. That means: if an attacker exploits a dangerous syscall inside a container, they can escalate privileges out to the host.
This type of attack is called Container Escape. Well-known CVEs like runc CVE-2019-5736 and Dirty Pipe CVE-2022-0847 both follow this mechanism — exploiting the kernel from inside a container.
I run a homelab with Proxmox VE managing 12 VMs and containers — it’s my playground for testing everything before pushing to production. After reading a post-mortem about a container escape on another company’s production Kubernetes cluster, I started taking hardening more seriously instead of just using --read-only or dropping capabilities.
The solution I found: gVisor.
What is gVisor and how it differs from conventional isolation
gVisor is a sandbox runtime for containers, developed and open-sourced by Google. Instead of letting containers make syscalls directly to the host kernel, gVisor places an intermediary layer called Sentry in between.
Sentry is a kernel written in Go that runs in user space. When an app inside a container calls open(), read(), or execve(), Sentry intercepts those syscalls and handles them within the sandbox. Only what is truly necessary gets passed down to the host kernel — through a very restricted set of syscalls.
Think of it like this:
- Regular Docker: App → syscall → host Linux kernel (direct)
- gVisor: App → syscall → Sentry (virtual kernel in user space) → a few safe syscalls → host Linux kernel
The result: the host kernel’s attack surface is dramatically reduced. Even if an attacker exploits a vulnerability in a container app, they can only escape into Sentry — not the real host kernel.
gVisor supports two platforms:
- ptrace: Works anywhere, but slower
- KVM: Requires CPU virtualization support, significantly faster (this is what I use in my homelab) — if you haven’t set up KVM on Ubuntu yet, it’s worth doing before enabling this platform
Installing gVisor on Ubuntu/Debian
Step 1: Add the repository and install gVisor
# Add gVisor's official GPG key and repository
curl -fsSL https://gvisor.dev/archive.key | sudo gpg --dearmor -o /usr/share/keyrings/gvisor-archive-keyring.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/gvisor-archive-keyring.gpg] https://storage.googleapis.com/gvisor/releases release main" \
| sudo tee /etc/apt/sources.list.d/gvisor.list
sudo apt-get update && sudo apt-get install -y runsc
After installation, verify the version:
runsc --version
# runsc version release-20240401.0
Step 2: Configure Docker to use the gVisor runtime
Open or create the file /etc/docker/daemon.json:
sudo nano /etc/docker/daemon.json
Add the following content (if the file already has content, merge it in — don’t overwrite):
{
"runtimes": {
"runsc": {
"path": "/usr/bin/runsc"
}
}
}
Restart Docker to apply the changes:
sudo systemctl restart docker
Confirm the runtime has been recognized:
docker info | grep -i runtime
# Runtimes: io.containerd.runc.v2 runsc runc
Step 3: Run a container with gVisor
Simply add the --runtime=runsc flag to your normal Docker command:
# Regular container (uses runc, host kernel)
docker run --rm ubuntu uname -r
# Container with gVisor (uses runsc, Sentry kernel)
docker run --rm --runtime=runsc ubuntu uname -r
The interesting part: uname -r inside a gVisor container returns a completely different kernel version than the host — that’s Sentry’s kernel, not your machine’s actual kernel.
# Example output
# Host kernel: 6.8.0-87-generic
# gVisor kernel: 4.4.0
# (Sentry emulates an older kernel version for compatibility)
Step 4: Use gVisor with Docker Compose
In your docker-compose.yml, add runtime to the services you want to protect:
version: '3.8'
services:
webapp:
image: nginx:alpine
runtime: runsc
ports:
- "8080:80"
database:
image: postgres:15
runtime: runsc
environment:
POSTGRES_PASSWORD: secret
volumes:
- pgdata:/var/lib/postgresql/data
volumes:
pgdata:
Run as usual:
docker compose up -d
Step 5: Set gVisor as the default runtime (optional)
Want every container to go through gVisor unless specified otherwise? Update daemon.json:
{
"default-runtime": "runsc",
"runtimes": {
"runsc": {
"path": "/usr/bin/runsc"
},
"runc": {
"path": "/usr/bin/runc"
}
}
}
When you need the original runtime, just specify --runtime=runc.
Verifying gVisor actually provides isolation
This is the first thing I test after setup — both to confirm the sandbox is working and to see the difference firsthand:
# Check /proc/self/status inside a gVisor container
docker run --rm --runtime=runsc ubuntu cat /proc/self/status
# CapEff will differ from a regular container
# Read kernel information
docker run --rm --runtime=runsc ubuntu cat /proc/version
# Linux version 4.4.0 (#1 SMP ...) — this is Sentry, not the real kernel
# Check the sandbox hostname
docker run --rm --runtime=runsc ubuntu hostname
# Each container has its own isolated sandbox
One more test for syscall restrictions — try calling ptrace (a syscall commonly abused in exploit techniques):
docker run --rm --runtime=runsc ubuntu bash -c \
'strace -e trace=ptrace ls 2>&1 | head -5'
# You'll see an error or a blocked syscall — this is the expected behavior
Performance considerations and limitations
gVisor is not a silver bullet. Before deploying to production, there are a few trade-offs to understand:
- Syscall overhead: Every syscall must pass through Sentry, so latency is higher than plain runc. Real-world benchmarks show I/O-heavy workloads (databases with continuous writes, file processing) can be 20–40% slower, while CPU-bound workloads (compression, encryption) typically add only 2–5%.
- Compatibility: Sentry does not implement 100% of Linux syscalls. Apps using less common syscalls or newer kernel features may not run — test thoroughly before pushing to production.
- Volume mounts: I/O with bind mounts is slower than regular containers. Where possible, use named volumes or tmpfs for directories requiring high throughput.
- Not a VM: gVisor is still much lighter than a VM, with millisecond start times. But if you need full hardware-level isolation, a VM with KVM is the right tool.
In my homelab, I use gVisor for containers running untrusted code (CI runners receiving external code) and public-facing services. Internal databases still run on plain runc because of the high write frequency — the 20–40% overhead there isn’t worth the trade-off.
Conclusion
After a few months of running gVisor in my homelab, I think it’s a worthwhile security layer to add to your stack if you’re concerned about container escape. No need to touch your Dockerfile or build pipeline — just --runtime=runsc and your container is running in a sandbox with its own kernel.
Quick recap:
- gVisor intercepts syscalls via Sentry — the host kernel is never directly exposed
- Installation takes just 5 minutes with no image or Dockerfile changes required
- Best suited for untrusted workloads, public-facing services, and CI/CD environments
- There are performance trade-offs — I/O-heavy apps should be benchmarked first, not deployed blindly
If you’re using Kubernetes, gVisor also supports it via RuntimeClass — apply it per-pod without affecting the entire cluster. That’s the next step if your infrastructure has reached the orchestration layer.
