Monitoring Systemd with Prometheus: Don’t Let Your Server Stay ‘Alive’ While Your App Is ‘Dead’ – ITFROMZERO

Table of Contents

The Issue: A Perfectly Green Dashboard While Customers Still Complain

Have you ever encountered a frustrating situation: Grafana shows perfectly stable CPU and RAM usage, yet customers are calling because they can’t access the website? I’ve experienced this dozens of times. The cause is simple: the server isn’t down, but core services like Nginx, MySQL, or a Python Bot running as a systemd unit have failed or frozen.

By default, Node Exporter can monitor systemd, but only at a surface level. To track the exact uptime of each unit or filter specific critical services, Systemd Exporter is the real deal. It allows you to manage service states much more granularly and efficiently.

Deploy Systemd Exporter in 5 Minutes

No fluff—we’ll quickly install the binary on a Linux server and connect it to Prometheus immediately.

Step 1: Download and Install the Binary

# Check for the latest version on Prometheus Community GitHub
export VERSION="0.6.0"
wget https://github.com/prometheus-community/systemd_exporter/releases/download/v${VERSION}/systemd_exporter-${VERSION}.linux-amd64.tar.gz
tar -xvf systemd_exporter-${VERSION}.linux-amd64.tar.gz
sudo cp systemd_exporter-${VERSION}.linux-amd64/systemd_exporter /usr/local/bin/

Step 2: Configure the Systemd Service for the Exporter

Running an exporter as root is a security mistake. I always create a dedicated user without login privileges to run these tools.

sudo useradd --no-create-home --shell /bin/false systemd_exporter
sudo nano /etc/systemd/system/systemd_exporter.service

Paste the following configuration content:

[Unit]
Description=Systemd Exporter
After=network-online.target

[Service]
User=systemd_exporter
Group=systemd_exporter
ExecStart=/usr/local/bin/systemd_exporter \
    --collector.unit-include="(nginx|mysql|docker|ssh).*" \
    --web.listen-address=":9101"

[Install]
WantedBy=multi-user.target

Pro tip: The --collector.unit-include flag is extremely important. It ensures you only collect metrics for necessary services, preventing your Prometheus database from being cluttered with hundreds of obscure system services.

Step 3: Activation

sudo systemctl daemon-reload
sudo systemctl enable --now systemd_exporter

# Quickly test the results
curl http://localhost:9101/metrics | grep node_systemd_unit_state

Why Systemd Exporter is Worth Your Time

Many might ask: “Node Exporter is already there, why install this and bloat the system?” In reality, Systemd Exporter is extremely lightweight, consuming only about 10-15MB of RAM, but it provides three values that Node Exporter can’t match:

Regex Filtering: You have the power to choose exactly which services appear on your dashboard.
Accurate Uptime: Know exactly how many seconds a service has been running. If this number keeps resetting to 0, your app is definitely in a crash loop.
Diverse States: Clearly distinguish between the thin line of activating (starting up) and failed (dead).

Configure Prometheus to Scrape Data

Add the following configuration snippet to your prometheus.yml file. Remember to replace the IP with your actual address.

scrape_configs:
  - job_name: 'systemd_nodes'
    static_configs:
      - targets: ['192.168.1.10:9101']
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
        replacement: 'app-server-01'

Real-world Experience: Curing Alert Fatigue

In the past, while managing a 50-node cluster, I made a “newbie” mistake: alerting on everything. Even a log-cleaning timer going inactive would trigger a Telegram notification. Consequently, I started ignoring all alerts because of the noise. To fix this, divide your alerts into two categories:

1. Critical Alerts (Service is completely down)

Only apply this to mission-critical services. If it stays failed for more than 2 minutes, fire an alert immediately.

- alert: ServiceDown
  expr: node_systemd_unit_state{state="failed"} == 1
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "Service {{ $labels.name }} is down on {{ $labels.instance }}"

2. Warning Alerts (Service flapping/restarting continuously)

This is the most annoying “hidden” error. The service reports as active, but restarts every 30 seconds. In this case, use the node_systemd_unit_start_time_seconds metric to catch it in the act.

- alert: ServiceFlapping
  expr: (time() - node_systemd_unit_start_time_seconds) < 60
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Service {{ $labels.name }} is restarting continuously"

Dashboard and Operations

Don’t waste effort building a dashboard from scratch. Use Grafana ID: 7539. It provides a high-level view of the green/red status of your entire system. Look at the uptime graph; if you see a line trending upward that suddenly drops to zero, it’s time to check the logs immediately.

A few final notes: Never expose port 9101 to the internet. Use a firewall to allow only the Prometheus Server to access it. If you use Nginx with PHP-FPM via a socket, don’t forget to monitor the .socket units as well to ensure the processing flow isn’t bottlenecked.

Monitoring down to the systemd level helps me sleep better. Instead of waiting for customer complaints, I usually fix the issue as soon as a service shows signs of instability. Good luck mastering your systems!