The Problem: Dashboard Shows 100% CPU but You Don’t Know Which App Is Responsible
This scenario is definitely familiar to SREs: Grafana turns red, server CPU spikes to 100%, and RAM starts spilling into swap. You quickly SSH in, typing top or htop to find the culprit. It turns out to be a Python worker in an infinite loop or a hung Java application.
If the issue occurs at 2 AM and resolves itself before you can check, how do you know exactly which process caused the error? A general dashboard showing only overall server CPU usage isn’t enough to answer that question.
Why Node Exporter Alone Is Not Enough
The monitoring system I operate uses Prometheus + Grafana to track over 20 servers. Initially, I only used Node Exporter. This tool is great for gathering general metrics like system-wide CPU, RAM, and Network usage.
However, Node Exporter has one major drawback: it is completely “blind” at the process level. It might report that the server is consuming 16GB of RAM, but it can’t tell you that your Node.js application has a memory leak taking up 12GB. In a microservices environment, this lack of data makes troubleshooting very time-consuming.
Comparing Current Monitoring Options
To solve this problem, engineers usually consider three main approaches:
- Custom Scripts: Writing bash scripts to pull data from
/procand push it to Pushgateway. This is quite manual, hard to maintain, and can create ghost load on the system. - Netdata: Extremely detailed real-time monitoring. However, if you already have a Prometheus stack, installing Netdata just for process metrics can be resource-intensive.
- Zabbix: Strong support for process monitoring. But if you are moving toward a cloud-native approach, introducing Zabbix into an existing Prometheus setup is an architectural step backward.
The Optimal Solution: Prometheus Process Exporter
After testing various tools, I chose Process Exporter. It’s lightweight, flexibly configured, and allows for grouping processes by name, path, or command line.
Step 1: Installing on a Linux Server
Download the latest release from GitHub. For example, on a 64-bit Linux environment:
wget https://github.com/ncabatoff/process-exporter/releases/download/v0.7.10/process-exporter-0.7.10.linux-amd64.tar.gz
tar -xvf process-exporter-0.7.10.linux-amd64.tar.gz
sudo mv process-exporter-0.7.10.linux-amd64/process-exporter /usr/local/bin/
Step 2: Configuring Process Grouping
Process Exporter doesn’t indiscriminately collect thousands of processes. You need to define the specific application groups you care about to keep the data concise.
Create a configuration file at /etc/process-exporter/config.yaml:
process_names:
- name: "{{.Comm}}"
cmdline:
- 'nginx'
- name: "{{.Comm}}"
cmdline:
- 'mysqld'
- name: "Java_Backend"
cmdline:
- 'java'
The {{.Comm}} variable helps automatically retrieve the executable name. You can also set descriptive names like “Java_Backend” for easier filtering on the Dashboard.
Step 3: Running as a Systemd Service
To have the exporter start automatically with the server, create a service file:
sudo nano /etc/systemd/system/process-exporter.service
File content:
[Unit]
Description=Process Exporter
After=network.target
[Service]
User=root
ExecStart=/usr/local/bin/process-exporter -config.path /etc/process-exporter/config.yaml
Restart=always
[Install]
WantedBy=multi-user.target
Enable the service with these commands:
sudo systemctl daemon-reload
sudo systemctl start process-exporter
sudo systemctl enable process-exporter
Verify at http://<SERVER_IP>:9256/metrics. If you see lines starting with namedprocess_namegroup_cpu_seconds_total, you have succeeded.
Step 4: Configuring Prometheus Scraping
Add the following to your prometheus.yml file:
scrape_configs:
- job_name: 'process-stats'
static_configs:
- targets: ['192.168.1.10:9256']
Visualizing Data on Grafana
Don’t build dashboards from scratch. Use high-quality community dashboards. I recommend ID 249 (basic) or 8378 (advanced metrics).
Key metrics you should monitor:
- CPU usage per process: Instantly identify which app is consuming cores.
- Resident Set Size (RSS): The actual amount of RAM the process is occupying (crucial for catching memory leaks).
- Disk I/O: See which service is reading/writing at 50MB/s and causing system bottlenecks.
- File Descriptors: Early warning for “Too many open files” errors before an application crashes.
Real-world Operational Experience
After deploying this in a production environment, here are three tips for you:
- Don’t track too much: Only monitor core services. Tracking hundreds of minor processes will bloat the Prometheus TSDB by about 50-100MB per day per node.
- Leverage Regex: If you run multiple application instances with different parameters, use regex in
cmdlineto consolidate them into a single clean group. - Smart Alerts: Set up Alertmanager to trigger when a process’s RAM exceeds 80%, rather than waiting for the entire server to hang.
Understanding process-level health gives me more confidence when scaling systems. I hope this guide helps you avoid headaches the next time your server CPU spikes.

