Monitoring Windows Server with Prometheus and Windows Exporter: Tracking CPU, RAM, and Services in Detail

Monitoring tutorial - IT technology blog
Monitoring tutorial - IT technology blog

Why does Windows Server need its own setup?

If you’ve read any Prometheus + Grafana guides on this blog, most of them revolve around Linux with node_exporter. Windows Server is a different story — different file system, different service management, and most importantly, Windows Services need to be monitored separately.

I once had a Windows server running IIS where a service hung at 3 AM. Nobody knew until customers started calling. After that incident, I got serious about setting up proper Windows monitoring. This guide focuses exactly on that: CPU, RAM, disk, and service state — nothing more, nothing less.

What are the options for monitoring Windows?

Before installing anything, it’s worth knowing what options are available:

1. Windows Exporter (for Prometheus)

Forked from the old wmi_exporter, this is now the official project maintained by the Prometheus community. It exports metrics via an HTTP endpoint that Prometheus scrapes on a schedule. Runs as a background Windows Service with a small footprint — only about 15–20MB of RAM.

2. Telegraf + InfluxDB

InfluxData’s general-purpose agent, with a win_perf_counters plugin that uses native Windows Performance Counters. Good Windows support, but heavier (~50MB RAM), requires more configuration, and needs either InfluxDB or forwarding to Prometheus via an output plugin.

3. Zabbix Agent

Zabbix has its own Windows agent with fairly comprehensive built-in templates. But if you’re already using Prometheus as your primary backend, adding Zabbix creates unnecessary complexity — two parallel ecosystems, two places to maintain.

4. SNMP / WMI directly

Pull metrics into Prometheus without installing an agent. Sounds clean, but latency is high, configuration is complex, and it lacks the detail needed for application-level monitoring. Typically only used for network devices where installing an agent isn’t an option.

Quick comparison

Approach Pros Cons
Windows Exporter Native Prometheus, lightweight (~15MB RAM), rich metrics Requires an existing Prometheus setup
Telegraf Versatile, many output plugins Heavier (~50MB), more configuration needed
Zabbix Agent Rich templates, polished UI Separate ecosystem, doesn’t share Prometheus
SNMP/WMI No agent installation needed High latency, lacks app-level detail

So which one should you choose?

Already using Prometheus + Grafana for Linux servers? The answer is almost always Windows Exporter. Same scrape config, same Alertmanager, same Grafana — just one more target. No new stack to learn.

The only exception: if you already have Zabbix in place and Windows Servers are a minority → use Zabbix Agent for consistency. Otherwise, this guide goes straight to Windows Exporter.

Deployment guide

Step 1: Install Windows Exporter on Windows Server

Download the latest MSI from the GitHub releases of prometheus-community/windows_exporter. At the time of writing, this is v0.29.2.

Install via PowerShell with Administrator privileges:

# Download the MSI
$version = "0.29.2"
$url = "https://github.com/prometheus-community/windows_exporter/releases/download/v$version/windows_exporter-$version-amd64.msi"
Invoke-WebRequest -Uri $url -OutFile "windows_exporter.msi"

# Install with the required collectors
msiexec /i windows_exporter.msi `
  ENABLED_COLLECTORS="cpu,memory,logical_disk,net,os,service,system" `
  LISTEN_PORT="9182" /quiet

You can also install manually via the GUI and select port 9182 (default). Once installed, the Windows Service named windows_exporter will start automatically.

Verify the service is running:

Get-Service windows_exporter
# Status should be Running

Test the metrics endpoint directly from the server:

Invoke-WebRequest -Uri http://localhost:9182/metrics | Select-Object -ExpandProperty Content | Select-String "windows_cpu"

Step 2: Open the firewall for Prometheus scraping

New-NetFirewallRule `
  -DisplayName "Windows Exporter" `
  -Direction Inbound `
  -Protocol TCP `
  -LocalPort 9182 `
  -Action Allow

Better to restrict the rule to only the Prometheus server’s IP rather than leaving it wide open — adding -RemoteAddress 192.168.1.50 is all it takes.

Step 3: Add the target to Prometheus

On the Prometheus server (Linux), edit prometheus.yml:

scrape_configs:
  # ... existing Linux jobs ...

  - job_name: 'windows_servers'
    static_configs:
      - targets:
          - '192.168.1.100:9182'   # Windows Server 1
          - '192.168.1.101:9182'   # Windows Server 2
        labels:
          env: 'production'
          os: 'windows'

Reload Prometheus to apply the changes:

curl -X POST http://localhost:9090/-/reload
# or
systemctl reload prometheus

Go to Prometheus UI → Status → Targets — the windows_servers target should show state UP. If you see DOWN, check the firewall first — that’s the cause 9 times out of 10.

Step 4: Import the Grafana Dashboard

Dashboard ID 14694 (Windows Exporter Node) is the most popular choice in the community. To import it into Grafana:

  1. Grafana → Dashboards → Import
  2. Enter ID 14694 → Load
  3. Select your Prometheus datasource → Import

The dashboard includes panels for CPU usage, RAM, disk I/O, and network throughput. Adjust the instance variable to select the right server. With multiple servers, this variable renders as a dropdown — quite handy.

Step 5: Monitor specific Windows Services

This is the biggest difference from Linux monitoring. The metric to use is windows_service_state:

# Check whether a service is running (1 = running)
windows_service_state{state="running", name="W3SVC"}       # IIS
windows_service_state{state="running", name="MSSQLSERVER"}  # SQL Server
windows_service_state{state="running", name="wuauserv"}     # Windows Update

# Alert when a service stops unexpectedly — simple and effective
windows_service_state{name=~"W3SVC|MSSQLSERVER|SQLSERVERAGENT", state="running"} == 0

Add a panel with the above query to Grafana, using a Stat or Table visualization for a clear at-a-glance view of each service’s state.

Battle-tested Alert Rules

This is where I spent the most time initially. Alert fatigue isn’t an abstract concept — I’ve lived through it: finished the setup, alerts fired non-stop at 2 AM because the thresholds didn’t match the real workload. I started ignoring them, lost confidence in the whole system. It took multiple rounds of tuning before things felt right. If you want alerts delivered directly to your inbox, the guide on setting up email alerts with Grafana covers that end-to-end.

Here’s the windows_alerts.yml file I’m currently using:

groups:
  - name: windows_server
    rules:
      # CPU high for 10 consecutive minutes — brief spikes are normal, don't alert on them
      - alert: WindowsCPUHigh
        expr: |
          100 - (avg by (instance) (
            rate(windows_cpu_time_total{mode="idle"}[5m])
          ) * 100) > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High CPU on {{ $labels.instance }}"
          description: "CPU usage {{ $value | printf \"%.1f\" }}% for 10 minutes"

      # Free RAM below 500MB (absolute, not percentage)
      - alert: WindowsLowMemory
        expr: windows_os_physical_memory_free_bytes < 500 * 1024 * 1024
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Low RAM on {{ $labels.instance }}"
          description: "Only {{ $value | humanize1024 }}B of RAM remaining"

      # Disk below 10%
      - alert: WindowsDiskLow
        expr: |
          (windows_logical_disk_free_bytes{volume!~"HarddiskVolume.*"}
          / windows_logical_disk_size_bytes) * 100 < 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Low disk on {{ $labels.volume }} on {{ $labels.instance }}"

      # Critical service stopped
      - alert: WindowsServiceDown
        expr: |
          windows_service_state{
            name=~"W3SVC|MSSQLSERVER|SQLSERVERAGENT",
            state="running"
          } == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service {{ $labels.name }} is down on {{ $labels.instance }}"

Load the rules into Prometheus by adding them to prometheus.yml:

rule_files:
  - "/etc/prometheus/rules/windows_alerts.yml"

Lessons learned from getting paged in the middle of the night

Hard-won experience, not theory:

  • Use for: 10m, don’t trigger immediately — a 1-minute CPU spike is normal when a backup runs or Windows Update kicks in. Sustained high CPU for 10 minutes is when you should worry.
  • Start with severity warning, not critical — critical should only fire when you genuinely need to wake up at 3 AM to fix something.
  • Observe for 1–2 weeks before enabling Alertmanager — understand your server’s real traffic patterns and set thresholds based on actual data, not gut feeling.
  • Alert on specific service names, not all stopped services — Windows has dozens of services that don’t need to run continuously; alerting on all of them is an instant noise disaster.

Verifying the full pipeline

# From the Prometheus server, manually test the connection
curl http://192.168.1.100:9182/metrics | grep -E "windows_(cpu|memory|service)"

# This query should return results in the Prometheus UI:
# windows_os_physical_memory_free_bytes{instance="192.168.1.100:9182"}

Once metrics appear in Prometheus and the Grafana dashboard loads data, the pipeline is working end to end. From here, the remaining work is tuning alert thresholds to match the specific characteristics of each server.

Share: