Hướng dẫn cài đặt VMware Exporter và Grafana: Giám sát toàn diện hiệu suất hạ tầng vSphere bằng Prometheus – ITFROMZERO

Table of Contents

3 giờ sáng, VM production chết và không ai biết tại sao

Không phải ví dụ học thuật — đây là sự cố mình từng xử lý. Alert chỉ bắn khi service down, nhưng không có data để trace nguyên nhân. CPU host ESXi tăng đột biến từ mấy giờ? Memory balloon driver có bị kích hoạt không? Datastore I/O latency leo thang từ bao giờ?

vCenter có built-in performance charts, nhưng data chi tiết nhất (rollup 5 phút) chỉ giữ 1 ngày. Cần trace lại timeline 48 giờ trước? vCenter im lặng. Điểm mù này là lý do incident thường kéo dài gấp đôi thời gian cần thiết.

Vấn đề của công cụ giám sát mặc định trong VMware

vCenter Server có Performance Charts và Alarms sẵn — đủ dùng lúc bình thường, nhưng bộc lộ giới hạn ngay khi có sự cố:

Data retention ngắn: rollup 5 phút giữ 1 ngày, rollup 30 phút giữ 1 tuần, rollup 2 giờ giữ 1 tháng. Không đủ để phân tích trend dài hạn.
Không tương thích với alerting stack hiện đại: Nếu team dùng PagerDuty, OpsGenie hay AlertManager, vCenter alarm không tích hợp trực tiếp.
Không có unified dashboard: Muốn xem vSphere cùng với Kubernetes, Linux host và database trên một màn hình Grafana — vCenter không làm được chuyện này.

Đó là lúc cần exporter để kéo metrics từ vSphere API ra ngoài và đẩy vào Prometheus.

Các hướng giải quyết

Hướng 1: SNMP polling truyền thống

ESXi hỗ trợ SNMP, dùng được với Zabbix hoặc Nagios. Vấn đề: SNMP với VMware khá hạn chế — chỉ lấy được thông tin cơ bản của host, không thấy được CPU ready time từng VM hay storage latency theo datastore.

Hướng 2: vRealize Operations (vROps)

Giải pháp chính thức của VMware. Rất mạnh, nhưng license bắt đầu từ vài nghìn USD/năm per CPU — với cluster 4-8 CPU, con số đủ để phòng tài chính từ chối.

Hướng 3: Prometheus + VMware Exporter (khuyến nghị)

Stack mã nguồn mở: vmware_exporter kéo metrics từ vSphere API → Prometheus lưu time-series → Grafana visualize. Zero license cost, tích hợp được với PagerDuty/OpsGenie/AlertManager, giữ data 90 ngày hay 1 năm tùy cấu hình.

Đây là hướng mình đi tiếp.

Kiến trúc tổng quan

vCenter Server
     │  (vSphere API)
     ▼
vmware_exporter :9272   ←── Prometheus scrape mỗi 60s
     │
     ▼
Prometheus :9090
     │
     ▼
Grafana :3000

vmware_exporter chạy như một service Python, kết nối vào vCenter qua API và expose metrics theo định dạng Prometheus tại endpoint /metrics.

Cài đặt từng bước

Bước 1: Chuẩn bị môi trường

Cần một server Linux riêng để chạy monitoring stack — Ubuntu 22.04 hoặc Debian 12 đều ổn. Không cài trực tiếp trên ESXi host.

# Cập nhật hệ thống
sudo apt update && sudo apt upgrade -y

# Cài Python và pip
sudo apt install -y python3 python3-pip python3-venv git

Bước 2: Cài đặt vmware_exporter

# Tạo user riêng cho service
sudo useradd -r -s /bin/false vmware_exporter

# Tạo thư mục và virtualenv
sudo mkdir -p /opt/vmware_exporter
sudo python3 -m venv /opt/vmware_exporter/venv

# Cài package
sudo /opt/vmware_exporter/venv/bin/pip install vmware_exporter

Tạo file cấu hình:

sudo nano /opt/vmware_exporter/config.yaml

default:
    vsphere_host: "vcenter.lab.local"      # IP hoặc FQDN của vCenter
    vsphere_user: "[email protected]"
    vsphere_password: "YourSecurePassword"
    vsphere_port: 443
    ignore_ssl: true                         # false nếu dùng cert hợp lệ
    collect_only:
        vms: true
        vmguests: true
        datastores: true
        hosts: true
        snapshots: true

Tạo systemd service:

sudo nano /etc/systemd/system/vmware_exporter.service

[Unit]
Description=VMware Exporter for Prometheus
After=network.target

[Service]
User=vmware_exporter
ExecStart=/opt/vmware_exporter/venv/bin/vmware_exporter \
    -c /opt/vmware_exporter/config.yaml
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

sudo systemctl daemon-reload
sudo systemctl enable --now vmware_exporter

# Kiểm tra service đang chạy
sudo systemctl status vmware_exporter

# Test endpoint
curl http://localhost:9272/metrics | head -30

Bước 3: Cài đặt Prometheus

# Tải Prometheus bản mới nhất
wget https://github.com/prometheus/prometheus/releases/download/v2.53.0/prometheus-2.53.0.linux-amd64.tar.gz
tar xvf prometheus-2.53.0.linux-amd64.tar.gz
sudo mv prometheus-2.53.0.linux-amd64 /opt/prometheus

# Cấu hình scrape vmware_exporter
sudo nano /opt/prometheus/prometheus.yml

global:
  scrape_interval: 60s      # vSphere API chậm, 60s là hợp lý
  evaluation_interval: 60s

scrape_configs:
  - job_name: 'vmware'
    static_configs:
      - targets: ['localhost:9272']
    scrape_timeout: 55s     # Phải nhỏ hơn scrape_interval

# Tạo systemd service cho Prometheus
sudo useradd -r -s /bin/false prometheus
sudo chown -R prometheus:prometheus /opt/prometheus

cat <<EOF | sudo tee /etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus
After=network.target

[Service]
User=prometheus
ExecStart=/opt/prometheus/prometheus \
    --config.file=/opt/prometheus/prometheus.yml \
    --storage.tsdb.path=/opt/prometheus/data \
    --storage.tsdb.retention.time=90d
Restart=always

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now prometheus

Bước 4: Cài đặt Grafana

# Thêm repo Grafana (dùng signed-by thay cho apt-key deprecated)
sudo mkdir -p /etc/apt/keyrings/
wget -q -O - https://apt.grafana.com/gpg.key | gpg --dearmor | sudo tee /etc/apt/keyrings/grafana.gpg > /dev/null
echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com stable main" | sudo tee /etc/apt/sources.list.d/grafana.list

sudo apt update && sudo apt install -y grafana
sudo systemctl enable --now grafana-server

Truy cập Grafana tại http://<server-ip>:3000. Đăng nhập với admin/admin và đổi mật khẩu ngay — bước này hay bị bỏ qua và để lại default credential trên prod.

Bước 5: Thêm datasource và import dashboard

Trong Grafana UI:

Vào Configuration → Data Sources → Add data source
Chọn Prometheus, URL: http://localhost:9090
Click Save & Test

Import dashboard từ Grafana.com: ID 8168 (VMware vSphere Overview) là bản được cộng đồng maintain nhiều nhất. Vào Dashboards → Import → nhập ID 8168.

Kinh nghiệm thực tế và những điểm cần lưu ý

Mấy chỗ mình đã mất thời gian debug — ghi lại để tránh lặp:

Scrape timeout: vSphere API với nhiều VM có thể mất 30-50 giây để respond. Không set scrape_timeout nhỏ hơn scrape_interval thì Prometheus báo lỗi timeout liên tục.
Read-only user: Tạo vSphere user riêng với role Read-Only, gán ở cấp vCenter (không phải ESXi host) để thấy toàn bộ inventory. Đừng dùng administrator cho monitoring.
Firewall: Mở port 443 từ monitoring server đến vCenter. Bước này hay bị bỏ sót.

Ngoài lề: khi mình migrate lab cá nhân từ VMware sang Proxmox để thử nghiệm, nhận ra một điểm thú vị — Proxmox VE Exporter dùng mô hình tương tự nhưng data phong phú hơn vì API mở hơn. Với môi trường enterprise vSphere, vmware_exporter vẫn là lựa chọn production-ready nhất hiện tại.

Một số alert rule hữu ích

Có data rồi thì thêm alert. Đây là 3 rule mình dùng từ ngày đầu:

groups:
  - name: vmware_alerts
    rules:
      - alert: HostHighCPUUsage
        expr: vmware_host_cpu_usage_average > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "ESXi host {{ $labels.host_name }} CPU > 85%"

      - alert: DatastoreLowFreeSpace
        expr: (vmware_datastore_freespace_size / vmware_datastore_capacity_size) * 100 < 15
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Datastore {{ $labels.ds_name }} chỉ còn {{ $value }}% dung lượng"

      - alert: VMSnapshotTooOld
        expr: vmware_vm_snapshot_timestamp_seconds < (time() - 7 * 24 * 3600)
        labels:
          severity: warning
        annotations:
          summary: "VM {{ $labels.vm_name }} có snapshot cũ hơn 7 ngày"

Alert snapshot cũ hơn 7 ngày là cái mình thêm đầu tiên mỗi lần setup mới. Snapshot tích lũy lâu ngày ăn dần datastore mà không ai để ý — đây là nguồn gốc của nhiều incident “datastore đầy không rõ nguyên nhân” mình từng xử lý.

Kết quả sau khi hoàn thành

Setup xong, bạn có Prometheus giữ 90 ngày data và Grafana hiển thị real-time metrics của toàn bộ host, VM, datastore, network. Lần tới incident 3 giờ sáng, bạn mở timeline thấy ngay CPU tăng bất thường từ 11 giờ tối — không cần đoán mò.

Thêm vSphere cluster mới? Update config.yaml, restart service. Không cần license thêm, không cần agent, không cần change window.