How to Deploy a Mesh VPN with Nebula: Securely Connect Thousands of Servers from Slack – ITFROMZERO

Table of Contents

Traditional VPN starts breaking down as infrastructure grows

Managing a few dozen servers spread across multiple cloud providers — AWS in Virginia, Hetzner in Nuremberg, a handful of VPS instances in Singapore — and you’ll quickly realize that OpenVPN or WireGuard in a hub-and-spoke model just doesn’t cut it anymore. All traffic has to pass through a central server. Latency climbs. If that server goes down, the entire network goes dark.

Slack faced exactly this problem at the scale of thousands of nodes. They built and open-sourced Nebula — a mesh VPN overlay where each node connects directly peer-to-peer after authenticating through a lighthouse (a routing node that does discovery, not traffic relay). No more single point of failure. No more bottleneck.

I’ve been using Nebula to connect 3 VPS instances across 3 different datacenters and a dev laptop. Everything was up and running within 30 minutes of setup. Latency between two nodes in the same region is about 15ms lower compared to WireGuard through a relay.

How Nebula Works

Each node gets a virtual IP within a range you define (typically 192.168.100.0/24). Traffic between nodes is encrypted using the Noise Protocol Framework — the same foundation as WireGuard, but with a completely different architecture. Four things set it apart:

Lighthouse: A node with a public IP that helps other nodes find each other (works like a STUN server). It does not relay traffic — only discovery.
True peer-to-peer: After discovery, two nodes connect directly via UDP hole punching — even when both are behind NAT.
Certificate-based auth: Each node has a certificate signed by your CA. No valid cert means no access to the network. Simple as that.
Firewall embedded in config: Traffic control at the overlay layer — no separate iptables rules needed.

Installing Nebula

Requirements

At least 1 server with a public IP (to act as lighthouse)
Linux/macOS/Windows are all supported
UDP port 4242 open on the lighthouse’s firewall

Download the binary

On each node (including the lighthouse), download the binary from GitHub releases:

# On Linux x86_64
wget https://github.com/slackhq/nebula/releases/latest/download/nebula-linux-amd64.tar.gz
tar -xzf nebula-linux-amd64.tar.gz
sudo mv nebula nebula-cert /usr/local/bin/

Create the CA and certificates for each node

This step runs once on the admin machine, then you distribute the certs to each node via scp. Important: ca.key never leaves this machine.

# Create the CA
nebula-cert ca -name "MyInfra CA"
# Outputs: ca.crt and ca.key — keep ca.key EXTREMELY secret

# Create cert for the lighthouse (virtual IP: 192.168.100.1)
nebula-cert sign -name "lighthouse" \
  -ip "192.168.100.1/24" \
  -ca-crt ca.crt -ca-key ca.key
# Outputs: lighthouse.crt, lighthouse.key

# Create cert for app server (virtual IP: 192.168.100.10)
nebula-cert sign -name "app-server" \
  -ip "192.168.100.10/24" \
  -groups "servers" \
  -ca-crt ca.crt -ca-key ca.key

# Create cert for dev laptop (virtual IP: 192.168.100.50)
nebula-cert sign -name "dev-laptop" \
  -ip "192.168.100.50/24" \
  -groups "developers" \
  -ca-crt ca.crt -ca-key ca.key

When I need to work out subnets for the overlay IP range, I often use toolcraft.app/en/tools/developer/ip-subnet-calculator — just enter a CIDR and it instantly shows the network range, broadcast address, and maximum host count. Much easier than calculating by hand.

Detailed Configuration

Lighthouse Config

Create /etc/nebula/config.yaml on the lighthouse server:

pki:
  ca: /etc/nebula/ca.crt
  cert: /etc/nebula/lighthouse.crt
  key: /etc/nebula/lighthouse.key

lighthouse:
  am_lighthouse: true

listen:
  host: 0.0.0.0
  port: 4242

punchy:
  punch: true

logging:
  level: info

firewall:
  outbound:
    - port: any
      proto: any
      host: any
  inbound:
    - port: any
      proto: icmp
      host: any

Config for regular nodes (app-server, dev-laptop)

pki:
  ca: /etc/nebula/ca.crt
  cert: /etc/nebula/app-server.crt
  key: /etc/nebula/app-server.key

lighthouse:
  am_lighthouse: false
  interval: 60
  hosts:
    - "192.168.100.1"  # Virtual IP of the lighthouse

static_host_map:
  "192.168.100.1": ["203.0.113.10:4242"]  # Real public IP of the lighthouse

listen:
  host: 0.0.0.0
  port: 4242

punchy:
  punch: true
  respond: true

logging:
  level: info

firewall:
  outbound:
    - port: any
      proto: any
      host: any
  inbound:
    - port: any
      proto: icmp
      host: any
    # Allow SSH from the developers group
    - port: 22
      proto: tcp
      groups:
        - developers
    # Allow app traffic between servers
    - port: 8080
      proto: tcp
      groups:
        - servers

Deploy certs and do a test run

Copy the certs to each node and test before handing off to systemd:

# Copy to lighthouse
scp ca.crt lighthouse.crt lighthouse.key root@<lighthouse-ip>:/etc/nebula/

# Copy to app-server
scp ca.crt app-server.crt app-server.key root@<app-server-ip>:/etc/nebula/

# Run in the foreground to watch logs directly
sudo nebula -config /etc/nebula/config.yaml

When the log shows Handshake message sent and Handshake message received, the two nodes have successfully established a connection. Only then should you switch to systemd:

cat > /etc/systemd/system/nebula.service << 'EOF'
[Unit]
Description=Nebula VPN
After=network.target

[Service]
ExecStart=/usr/local/bin/nebula -config /etc/nebula/config.yaml
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable --now nebula

Testing Connectivity and Monitoring

Ping over the overlay network

Once Nebula is running on all nodes, verify using the virtual IPs:

# From app-server, ping dev-laptop over the Nebula overlay
ping 192.168.100.50

# Print the certificate info for the current node
nebula-cert print -path /etc/nebula/app-server.crt

Check peer status via Prometheus metrics

Nebula supports exposing metrics in Prometheus format. Add this to your config:

stats:
  type: prometheus
  listen: 127.0.0.1:8080
  path: /metrics
  namespace: nebula
  subsystem: stats
  interval: 10s

# Check active tunnel count and handshake success/failure stats
curl -s http://127.0.0.1:8080/metrics | grep -E "(tunnel|handshake)"

Debug firewall rules

# Enable debug log level to see whether packets are allowed or denied
# Edit config: logging.level: debug
# Restart and monitor
journalctl -u nebula -f | grep -E "(ALLOW|DENY|firewall)"

Test real-world throughput

Measure bandwidth between two nodes with iperf3 once the network has stabilized:

# On the receiver node (192.168.100.10)
iperf3 -s -B 192.168.100.10

# On the sender node — 4 parallel streams, 30-second test
iperf3 -c 192.168.100.10 -t 30 -P 4

Things to Keep in Mind for Real-World Operations

During the first few weeks in production, I ran into all sorts of issues. Documenting them here so you don’t have to waste time debugging the same things:

Lighthouse redundancy: Use at least 2 lighthouses in 2 different datacenters. Just add the second lighthouse’s IP to lighthouse.hosts and static_host_map in each node’s config and you’re done.
Certificate expiry: Certs have no expiration by default — that sounds convenient but is actually dangerous if a cert is ever compromised. Always set -duration 8760h (1 year) when signing certs and schedule regular rotation.
MTU overhead: Nebula adds about 60 bytes of header. App timing out or seeing unexplained packet loss? Try adding tun: mtu: 1300 to your config.
Revoking a node: Need to immediately kick a node off the network? Add its cert to pki.blocklist in every node’s config and reload — Nebula will immediately refuse connections from that cert.

Compared to WireGuard, Nebula has a steeper initial setup. But having firewall rules embedded directly in each node’s config saves an enormous amount of access control overhead once you go past 10 nodes. No separate iptables, no manually syncing rules across servers. For growing infrastructure, this is a compelling replacement for the classic OpenVPN hub-and-spoke model.