Network Bonding on Linux: Configure Bandwidth Aggregation and Automatic Failover for Servers

Network tutorial - IT technology blog
Network tutorial - IT technology blog

Background — Why You Need Network Bonding

It started at 2 AM on a Friday. The database server suddenly lost its network connection. Alerts flooded the screen, and over 50 office employees wouldn’t be able to work the next morning if the database didn’t come back up. When I ran into the server room to check, I found a failed network card — the only card connecting the server to the switch.

That was the first time I truly understood why network bonding exists.

Network bonding (also known as NIC teaming or link aggregation) is a technique that combines multiple physical network cards into a single logical interface. The most obvious benefits:

  • Automatic failover: One card dies, the other takes over immediately — zero downtime
  • Load balancing: Distribute traffic across multiple parallel links
  • Increased bandwidth: 2 × 1Gbps cards → up to 2Gbps aggregated depending on mode

Linux supports 7 bonding modes, but in practice I mainly use these two:

  • mode 1 (active-backup): Only 1 card is active, the other stays on standby. Simple, no special switch required — this is the default choice for production
  • mode 4 (802.3ad/LACP): Requires a switch with LACP support, but in return provides real bandwidth increase — used for database/storage servers that need high throughput

Installation

Ubuntu 20.04+ bundles the bonding driver in the kernel. Check if it’s loaded:

modprobe bonding
lsmod | grep bonding

If the output is empty, manually load the module and configure it to auto-load at boot:

echo "bonding" >> /etc/modules
modinfo bonding

For Ubuntu 18.04 and earlier, install ifenslave:

apt-get install ifenslave -y

CentOS/RHEL 8+ requires no additional setup — the driver is built into the kernel, just use nmcli to configure it.

Detailed Configuration

Method 1: Using netplan (Ubuntu 20.04+)

This is my go-to method for current Ubuntu servers. First, check the existing config file:

ls /etc/netplan/
# Usually: 00-installer-config.yaml or 01-netcfg.yaml

Back up the old file before making changes:

cp /etc/netplan/00-installer-config.yaml /etc/netplan/00-installer-config.yaml.bak

Full config file for active-backup bonding:

# /etc/netplan/00-installer-config.yaml
network:
  version: 2
  renderer: networkd
  ethernets:
    eth0:
      dhcp4: false
      dhcp6: false
    eth1:
      dhcp4: false
      dhcp6: false
  bonds:
    bond0:
      interfaces: [eth0, eth1]
      addresses: [192.168.1.100/24]
      gateway4: 192.168.1.1
      nameservers:
        addresses: [8.8.8.8, 1.1.1.1]
      parameters:
        mode: active-backup
        primary: eth0
        mii-monitor-interval: 100

One detail worth noting: mii-monitor-interval: 100 is the interval (in ms) at which the kernel checks the card’s link status. I once set it to 200ms and noticed the failover was noticeably slower — 100ms is a good threshold for production.

Apply the config — always use netplan try first when working on a remote server:

netplan try     # Test for 120 seconds, will auto-rollback if connection is lost
netplan apply   # Confirm and apply permanently

I once nearly locked myself out of a production server by skipping the netplan try step. Never skip it.

Configuring 802.3ad Mode (LACP) — Real Bandwidth Increase

The switch needs to have a port-channel/LAG created with LACP enabled first. Then modify the parameters section in the netplan file:

bonds:
  bond0:
    interfaces: [eth0, eth1]
    addresses: [192.168.1.100/24]
    gateway4: 192.168.1.1
    parameters:
      mode: 802.3ad
      lacp-rate: fast
      transmit-hash-policy: layer3+4
      mii-monitor-interval: 100

transmit-hash-policy: layer3+4 uses both IP and port to hash traffic — more even load distribution compared to the default layer2 which only hashes by MAC address.

Method 2: Using nmcli (CentOS/RHEL/Rocky Linux)

# Create bond interface
nmcli connection add type bond \
  con-name bond0 \
  ifname bond0 \
  bond.options "mode=active-backup,miimon=100"

# Add eth0 and eth1 as slaves
nmcli connection add type ethernet \
  con-name bond0-slave-eth0 \
  ifname eth0 \
  master bond0

nmcli connection add type ethernet \
  con-name bond0-slave-eth1 \
  ifname eth1 \
  master bond0

# Assign static IP to bond0
nmcli connection modify bond0 \
  ipv4.addresses "192.168.1.100/24" \
  ipv4.gateway "192.168.1.1" \
  ipv4.dns "8.8.8.8,1.1.1.1" \
  ipv4.method manual

# Bring up the connection
nmcli connection up bond0

Verify the connection is up after applying:

nmcli connection show
nmcli device status

Verification & Monitoring

Reading Bond Status from the Kernel

The kernel exposes the full bonding status via /proc/net/bonding/bond0 — this is the first place to look when debugging:

cat /proc/net/bonding/bond0

Output when both cards are healthy:

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: eth0 (primary_reselect failure)
Currently Active Slave: eth0
MII Status: up
MII Polling Interval (ms): 100

Slave Interface: eth0
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0

Slave Interface: eth1
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0

MII Status: up on both slaves and Currently Active Slave pointing to the correct primary card is a sign that everything is working. Monitor Link Failure Count to detect a flaky card — a steadily increasing count is a sign the card is failing, even if it hasn’t fully gone down yet.

Real-World Failover Testing — A Step You Cannot Skip

Setting up bonding without testing failover is as good as not having it at all. Open 2 terminals (use tmux if you’re connected over SSH):

# Terminal 1: Continuous ping to gateway
ping -i 0.2 192.168.1.1

# Terminal 2: Bring down active card to simulate hardware failure
ip link set eth0 down

# Watch Terminal 1 — ideally no packet loss
# 1-2 packet drops is normal, more than 5 drops means you should revisit mii-monitor-interval

# Bring back up after testing
ip link set eth0 up

With mii-monitor-interval: 100ms, failover typically occurs within 200–300ms — fast enough that TCP connections are not reset.

Real-Time Monitoring and Useful Commands

# Monitor status continuously
watch -n 1 'cat /proc/net/bonding/bond0 | grep -E "Currently Active|MII Status|Link Failure"'

# Check speed and physical status of each interface
ethtool eth0 | grep -E "Speed|Duplex|Link detected"
ethtool eth1 | grep -E "Speed|Duplex|Link detected"

# TX/RX byte statistics per interface (view traffic distribution)
ip -s link show eth0
ip -s link show eth1

# View kernel log during failover events
journalctl -k | grep -i "bond\|enslaved\|link failure"

Using Prometheus + node_exporter? The metrics node_network_up{device="bond0"} and node_network_up{device="eth0"} automatically track the bond and each slave’s status — no additional configuration required.

Real-World Issues I’ve Encountered

I manage networking for a 50-person office and a small datacenter, and I’ve run into all kinds of bonding problems — documenting them here so you don’t have to lose a whole night responding to a server crisis like I did:

  1. MAC address conflict with LACP: The switch needs to see the MAC address of bond0, not the individual slave MACs. Older switches that don’t properly support LACP can cause loops or blocked ports. When in doubt about your switch, mode 1 (active-backup) is always the safer choice.
  2. Interface rename after reboot: I once had bonding working fine, but after a reboot eth0/eth1 renamed themselves to eth2/eth3 — because udev changed the card detection order. The configuration broke completely with no clear error to debug. Always use predictable names like enp3s0, never rely on eth0.
  3. VLANs on bonding: Fully supported. Create VLAN interfaces on bond0 as normal — no need to create VLANs on each individual slave.
  4. Bonding inside VMs: For hypervisor servers, bond at the host level and expose a virtual NIC to the VM. Don’t bond inside the guest VM — the hypervisor layer already handles failover; adding bonding inside the guest only introduces unnecessary complexity.

After that 2 AM incident, every one of my production servers now runs active-backup bonding. Over the past year, there have been 3 network card failures — all 3 failovers happened in complete silence. No one noticed, not a single support ticket was opened. That’s how good infrastructure should work.

Share: