Why does Windows Server need its own setup?
If you’ve read any Prometheus + Grafana guides on this blog, most of them revolve around Linux with node_exporter. Windows Server is a different story — different file system, different service management, and most importantly, Windows Services need to be monitored separately.
I once had a Windows server running IIS where a service hung at 3 AM. Nobody knew until customers started calling. After that incident, I got serious about setting up proper Windows monitoring. This guide focuses exactly on that: CPU, RAM, disk, and service state — nothing more, nothing less.
What are the options for monitoring Windows?
Before installing anything, it’s worth knowing what options are available:
1. Windows Exporter (for Prometheus)
Forked from the old wmi_exporter, this is now the official project maintained by the Prometheus community. It exports metrics via an HTTP endpoint that Prometheus scrapes on a schedule. Runs as a background Windows Service with a small footprint — only about 15–20MB of RAM.
2. Telegraf + InfluxDB
InfluxData’s general-purpose agent, with a win_perf_counters plugin that uses native Windows Performance Counters. Good Windows support, but heavier (~50MB RAM), requires more configuration, and needs either InfluxDB or forwarding to Prometheus via an output plugin.
3. Zabbix Agent
Zabbix has its own Windows agent with fairly comprehensive built-in templates. But if you’re already using Prometheus as your primary backend, adding Zabbix creates unnecessary complexity — two parallel ecosystems, two places to maintain.
4. SNMP / WMI directly
Pull metrics into Prometheus without installing an agent. Sounds clean, but latency is high, configuration is complex, and it lacks the detail needed for application-level monitoring. Typically only used for network devices where installing an agent isn’t an option.
Quick comparison
| Approach | Pros | Cons |
|---|---|---|
| Windows Exporter | Native Prometheus, lightweight (~15MB RAM), rich metrics | Requires an existing Prometheus setup |
| Telegraf | Versatile, many output plugins | Heavier (~50MB), more configuration needed |
| Zabbix Agent | Rich templates, polished UI | Separate ecosystem, doesn’t share Prometheus |
| SNMP/WMI | No agent installation needed | High latency, lacks app-level detail |
So which one should you choose?
Already using Prometheus + Grafana for Linux servers? The answer is almost always Windows Exporter. Same scrape config, same Alertmanager, same Grafana — just one more target. No new stack to learn.
The only exception: if you already have Zabbix in place and Windows Servers are a minority → use Zabbix Agent for consistency. Otherwise, this guide goes straight to Windows Exporter.
Deployment guide
Step 1: Install Windows Exporter on Windows Server
Download the latest MSI from the GitHub releases of prometheus-community/windows_exporter. At the time of writing, this is v0.29.2.
Install via PowerShell with Administrator privileges:
# Download the MSI
$version = "0.29.2"
$url = "https://github.com/prometheus-community/windows_exporter/releases/download/v$version/windows_exporter-$version-amd64.msi"
Invoke-WebRequest -Uri $url -OutFile "windows_exporter.msi"
# Install with the required collectors
msiexec /i windows_exporter.msi `
ENABLED_COLLECTORS="cpu,memory,logical_disk,net,os,service,system" `
LISTEN_PORT="9182" /quiet
You can also install manually via the GUI and select port 9182 (default). Once installed, the Windows Service named windows_exporter will start automatically.
Verify the service is running:
Get-Service windows_exporter
# Status should be Running
Test the metrics endpoint directly from the server:
Invoke-WebRequest -Uri http://localhost:9182/metrics | Select-Object -ExpandProperty Content | Select-String "windows_cpu"
Step 2: Open the firewall for Prometheus scraping
New-NetFirewallRule `
-DisplayName "Windows Exporter" `
-Direction Inbound `
-Protocol TCP `
-LocalPort 9182 `
-Action Allow
Better to restrict the rule to only the Prometheus server’s IP rather than leaving it wide open — adding -RemoteAddress 192.168.1.50 is all it takes.
Step 3: Add the target to Prometheus
On the Prometheus server (Linux), edit prometheus.yml:
scrape_configs:
# ... existing Linux jobs ...
- job_name: 'windows_servers'
static_configs:
- targets:
- '192.168.1.100:9182' # Windows Server 1
- '192.168.1.101:9182' # Windows Server 2
labels:
env: 'production'
os: 'windows'
Reload Prometheus to apply the changes:
curl -X POST http://localhost:9090/-/reload
# or
systemctl reload prometheus
Go to Prometheus UI → Status → Targets — the windows_servers target should show state UP. If you see DOWN, check the firewall first — that’s the cause 9 times out of 10.
Step 4: Import the Grafana Dashboard
Dashboard ID 14694 (Windows Exporter Node) is the most popular choice in the community. To import it into Grafana:
- Grafana → Dashboards → Import
- Enter ID
14694→ Load - Select your Prometheus datasource → Import
The dashboard includes panels for CPU usage, RAM, disk I/O, and network throughput. Adjust the instance variable to select the right server. With multiple servers, this variable renders as a dropdown — quite handy.
Step 5: Monitor specific Windows Services
This is the biggest difference from Linux monitoring. The metric to use is windows_service_state:
# Check whether a service is running (1 = running)
windows_service_state{state="running", name="W3SVC"} # IIS
windows_service_state{state="running", name="MSSQLSERVER"} # SQL Server
windows_service_state{state="running", name="wuauserv"} # Windows Update
# Alert when a service stops unexpectedly — simple and effective
windows_service_state{name=~"W3SVC|MSSQLSERVER|SQLSERVERAGENT", state="running"} == 0
Add a panel with the above query to Grafana, using a Stat or Table visualization for a clear at-a-glance view of each service’s state.
Battle-tested Alert Rules
This is where I spent the most time initially. Alert fatigue isn’t an abstract concept — I’ve lived through it: finished the setup, alerts fired non-stop at 2 AM because the thresholds didn’t match the real workload. I started ignoring them, lost confidence in the whole system. It took multiple rounds of tuning before things felt right. If you want alerts delivered directly to your inbox, the guide on setting up email alerts with Grafana covers that end-to-end.
Here’s the windows_alerts.yml file I’m currently using:
groups:
- name: windows_server
rules:
# CPU high for 10 consecutive minutes — brief spikes are normal, don't alert on them
- alert: WindowsCPUHigh
expr: |
100 - (avg by (instance) (
rate(windows_cpu_time_total{mode="idle"}[5m])
) * 100) > 85
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU on {{ $labels.instance }}"
description: "CPU usage {{ $value | printf \"%.1f\" }}% for 10 minutes"
# Free RAM below 500MB (absolute, not percentage)
- alert: WindowsLowMemory
expr: windows_os_physical_memory_free_bytes < 500 * 1024 * 1024
for: 5m
labels:
severity: critical
annotations:
summary: "Low RAM on {{ $labels.instance }}"
description: "Only {{ $value | humanize1024 }}B of RAM remaining"
# Disk below 10%
- alert: WindowsDiskLow
expr: |
(windows_logical_disk_free_bytes{volume!~"HarddiskVolume.*"}
/ windows_logical_disk_size_bytes) * 100 < 10
for: 5m
labels:
severity: warning
annotations:
summary: "Low disk on {{ $labels.volume }} on {{ $labels.instance }}"
# Critical service stopped
- alert: WindowsServiceDown
expr: |
windows_service_state{
name=~"W3SVC|MSSQLSERVER|SQLSERVERAGENT",
state="running"
} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.name }} is down on {{ $labels.instance }}"
Load the rules into Prometheus by adding them to prometheus.yml:
rule_files:
- "/etc/prometheus/rules/windows_alerts.yml"
Lessons learned from getting paged in the middle of the night
Hard-won experience, not theory:
- Use
for: 10m, don’t trigger immediately — a 1-minute CPU spike is normal when a backup runs or Windows Update kicks in. Sustained high CPU for 10 minutes is when you should worry. - Start with severity warning, not critical — critical should only fire when you genuinely need to wake up at 3 AM to fix something.
- Observe for 1–2 weeks before enabling Alertmanager — understand your server’s real traffic patterns and set thresholds based on actual data, not gut feeling.
- Alert on specific service names, not all stopped services — Windows has dozens of services that don’t need to run continuously; alerting on all of them is an instant noise disaster.
Verifying the full pipeline
# From the Prometheus server, manually test the connection
curl http://192.168.1.100:9182/metrics | grep -E "windows_(cpu|memory|service)"
# This query should return results in the Prometheus UI:
# windows_os_physical_memory_free_bytes{instance="192.168.1.100:9182"}
Once metrics appear in Prometheus and the Grafana dashboard loads data, the pipeline is working end to end. From here, the remaining work is tuning alert thresholds to match the specific characteristics of each server.
