Prometheus Process Exporter Guide: Detailed Monitoring of Linux Process Health

Monitoring tutorial - IT technology blog
Monitoring tutorial - IT technology blog

The Problem: Dashboard Shows 100% CPU but You Don’t Know Which App Is Responsible

This scenario is definitely familiar to SREs: Grafana turns red, server CPU spikes to 100%, and RAM starts spilling into swap. You quickly SSH in, typing top or htop to find the culprit. It turns out to be a Python worker in an infinite loop or a hung Java application.

If the issue occurs at 2 AM and resolves itself before you can check, how do you know exactly which process caused the error? A general dashboard showing only overall server CPU usage isn’t enough to answer that question.

Why Node Exporter Alone Is Not Enough

The monitoring system I operate uses Prometheus + Grafana to track over 20 servers. Initially, I only used Node Exporter. This tool is great for gathering general metrics like system-wide CPU, RAM, and Network usage.

However, Node Exporter has one major drawback: it is completely “blind” at the process level. It might report that the server is consuming 16GB of RAM, but it can’t tell you that your Node.js application has a memory leak taking up 12GB. In a microservices environment, this lack of data makes troubleshooting very time-consuming.

Comparing Current Monitoring Options

To solve this problem, engineers usually consider three main approaches:

  • Custom Scripts: Writing bash scripts to pull data from /proc and push it to Pushgateway. This is quite manual, hard to maintain, and can create ghost load on the system.
  • Netdata: Extremely detailed real-time monitoring. However, if you already have a Prometheus stack, installing Netdata just for process metrics can be resource-intensive.
  • Zabbix: Strong support for process monitoring. But if you are moving toward a cloud-native approach, introducing Zabbix into an existing Prometheus setup is an architectural step backward.

The Optimal Solution: Prometheus Process Exporter

After testing various tools, I chose Process Exporter. It’s lightweight, flexibly configured, and allows for grouping processes by name, path, or command line.

Step 1: Installing on a Linux Server

Download the latest release from GitHub. For example, on a 64-bit Linux environment:

wget https://github.com/ncabatoff/process-exporter/releases/download/v0.7.10/process-exporter-0.7.10.linux-amd64.tar.gz
tar -xvf process-exporter-0.7.10.linux-amd64.tar.gz
sudo mv process-exporter-0.7.10.linux-amd64/process-exporter /usr/local/bin/

Step 2: Configuring Process Grouping

Process Exporter doesn’t indiscriminately collect thousands of processes. You need to define the specific application groups you care about to keep the data concise.

Create a configuration file at /etc/process-exporter/config.yaml:

process_names:
  - name: "{{.Comm}}"
    cmdline:
    - 'nginx'

  - name: "{{.Comm}}"
    cmdline:
    - 'mysqld'

  - name: "Java_Backend"
    cmdline:
    - 'java'

The {{.Comm}} variable helps automatically retrieve the executable name. You can also set descriptive names like “Java_Backend” for easier filtering on the Dashboard.

Step 3: Running as a Systemd Service

To have the exporter start automatically with the server, create a service file:

sudo nano /etc/systemd/system/process-exporter.service

File content:

[Unit]
Description=Process Exporter
After=network.target

[Service]
User=root
ExecStart=/usr/local/bin/process-exporter -config.path /etc/process-exporter/config.yaml
Restart=always

[Install]
WantedBy=multi-user.target

Enable the service with these commands:

sudo systemctl daemon-reload
sudo systemctl start process-exporter
sudo systemctl enable process-exporter

Verify at http://<SERVER_IP>:9256/metrics. If you see lines starting with namedprocess_namegroup_cpu_seconds_total, you have succeeded.

Step 4: Configuring Prometheus Scraping

Add the following to your prometheus.yml file:

scrape_configs:
  - job_name: 'process-stats'
    static_configs:
      - targets: ['192.168.1.10:9256']

Visualizing Data on Grafana

Don’t build dashboards from scratch. Use high-quality community dashboards. I recommend ID 249 (basic) or 8378 (advanced metrics).

Key metrics you should monitor:

  • CPU usage per process: Instantly identify which app is consuming cores.
  • Resident Set Size (RSS): The actual amount of RAM the process is occupying (crucial for catching memory leaks).
  • Disk I/O: See which service is reading/writing at 50MB/s and causing system bottlenecks.
  • File Descriptors: Early warning for “Too many open files” errors before an application crashes.

Real-world Operational Experience

After deploying this in a production environment, here are three tips for you:

  1. Don’t track too much: Only monitor core services. Tracking hundreds of minor processes will bloat the Prometheus TSDB by about 50-100MB per day per node.
  2. Leverage Regex: If you run multiple application instances with different parameters, use regex in cmdline to consolidate them into a single clean group.
  3. Smart Alerts: Set up Alertmanager to trigger when a process’s RAM exceeds 80%, rather than waiting for the entire server to hang.

Understanding process-level health gives me more confidence when scaling systems. I hope this guide helps you avoid headaches the next time your server CPU spikes.

Share: