Cài đặt Prometheus và Grafana để Giám sát Server Theo Thời Gian Thực – ITFROMZERO

Table of Contents

Khi nào bạn biết server đang có vấn đề?

Trước đây, mình phát hiện server bị chậm theo kiểu… người dùng nhắn tin phàn nàn trước. Rồi mới SSH vào, chạy top, df -h, free -m từng cái một. Ba, bốn server thì còn chịu được — hơn 10 con là bắt đầu không biết phải bắt đầu từ đâu.

Sau khi dựng Prometheus + Grafana, mọi thứ khác hẳn. Mở dashboard là thấy ngay: CPU spike lúc mấy giờ, RAM đang ở mức nào, disk còn bao nhiêu — tất cả trên một màn hình. Không cần SSH vào từng server nữa.

Bài này đi thẳng vào cài đặt thực tế trên Ubuntu 22.04. Node Exporter để thu thập metrics, Grafana để hiển thị.

Kiến trúc hoạt động — hiểu nhanh trong 30 giây

Prometheus là time-series database kiêm scraper. Cứ 15 giây, nó gọi HTTP đến các exporter, kéo metrics về rồi lưu vào storage riêng. Query bằng PromQL — ngôn ngữ riêng của Prometheus, học mấy tiếng là dùng được ngay.

Node Exporter chạy trên từng server cần giám sát, expose endpoint /metrics với ~700+ chỉ số: CPU per core, RAM, disk I/O, network traffic, số file descriptor đang mở…

Grafana đóng vai frontend. Kết nối vào Prometheus, query dữ liệu, vẽ thành dashboard. Alert gửi qua email, Slack, hoặc Telegram webhook.

Luồng hoạt động:

Node Exporter (port 9100) ← Prometheus scrape mỗi 15s → Lưu vào TSDB → Grafana query → Dashboard

Cài Node Exporter trên server cần giám sát

Tạo system user riêng — chạy exporter bằng root là bad practice:

# Tạo system user
sudo useradd --no-create-home --shell /bin/false node_exporter

# Download Node Exporter (kiểm tra phiên bản mới nhất tại github.com/prometheus/node_exporter)
wget https://github.com/prometheus/node_exporter/releases/download/v1.8.2/node_exporter-1.8.2.linux-amd64.tar.gz
tar xvf node_exporter-1.8.2.linux-amd64.tar.gz
sudo cp node_exporter-1.8.2.linux-amd64/node_exporter /usr/local/bin/
sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter

Tiếp theo tạo systemd service để Node Exporter tự chạy khi reboot:

sudo nano /etc/systemd/system/node_exporter.service

[Unit]
Description=Node Exporter
After=network.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target

sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter

# Kiểm tra
curl http://localhost:9100/metrics | head -20

Output dạng node_cpu_seconds_total{...} là Node Exporter đang chạy đúng.

Cài Prometheus trên monitoring server

Mình tách riêng: một server chuyên chạy Prometheus + Grafana, các server còn lại chỉ cài Node Exporter. Backup data dễ hơn, không lẫn lộn với workload production.

sudo useradd --no-create-home --shell /bin/false prometheus
sudo mkdir -p /etc/prometheus /var/lib/prometheus
sudo chown prometheus:prometheus /var/lib/prometheus

wget https://github.com/prometheus/prometheus/releases/download/v2.52.0/prometheus-2.52.0.linux-amd64.tar.gz
tar xvf prometheus-2.52.0.linux-amd64.tar.gz
sudo cp prometheus-2.52.0.linux-amd64/{prometheus,promtool} /usr/local/bin/
sudo cp -r prometheus-2.52.0.linux-amd64/{consoles,console_libraries} /etc/prometheus/
sudo chown -R prometheus:prometheus /etc/prometheus

Cấu hình scrape targets

Đây là phần quan trọng nhất — khai báo server nào Prometheus cần thu thập metrics:

sudo nano /etc/prometheus/prometheus.yml

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node_exporter'
    static_configs:
      - targets:
          - '192.168.1.10:9100'   # web server
          - '192.168.1.11:9100'   # db server
          - '192.168.1.12:9100'   # app server

Tạo systemd service cho Prometheus:

sudo nano /etc/systemd/system/prometheus.service

[Unit]
Description=Prometheus
After=network.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus/ \
  --storage.tsdb.retention.time=30d \
  --web.listen-address=0.0.0.0:9090

[Install]
WantedBy=multi-user.target

sudo systemctl daemon-reload
sudo systemctl enable --now prometheus

# Kiểm tra targets có up không
curl http://localhost:9090/api/v1/targets | python3 -m json.tool | grep health

Vào http://<monitoring-server>:9090/targets để xem trạng thái — target màu xanh là đang scrape được. Lưu ý nhỏ: 30 ngày retention với 3 server tốn khoảng 2–4GB disk, điều chỉnh --storage.tsdb.retention.time nếu disk eo hẹp.

Cài Grafana

Grafana có APT repo chính thức, cài sạch hơn là tải file thủ công:

sudo apt-get install -y apt-transport-https software-properties-common wget
sudo mkdir -p /etc/apt/keyrings/
wget -q -O - https://apt.grafana.com/gpg.key | gpg --dearmor | sudo tee /etc/apt/keyrings/grafana.gpg > /dev/null
echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com stable main" | sudo tee /etc/apt/sources.list.d/grafana.list

sudo apt-get update
sudo apt-get install grafana
sudo systemctl enable --now grafana-server

Port mặc định là 3000. Vào http://<server-ip>:3000, đăng nhập admin/admin — đổi password ngay lần đầu tiên.

Kết nối Grafana với Prometheus

Vào Connections → Data Sources → Add data source
Chọn Prometheus
URL: http://localhost:9090 (nếu cùng server) hoặc IP của Prometheus server
Bấm Save & Test — thấy “Successfully queried the Prometheus API” là xong

Import dashboard trong 2 phút

Không cần tự vẽ từ đầu. Grafana có kho community dashboard tại grafana.com/grafana/dashboards — ID 1860 (Node Exporter Full) là dashboard được tải nhiều nhất, hàng triệu lượt, covers gần như mọi thứ bạn cần:

Dashboards → Import
Nhập ID 1860, bấm Load
Chọn Prometheus data source vừa tạo
Import — CPU, RAM, disk, network hiện ra ngay, kèm breakdown per core và per disk

PromQL hay dùng

Khi cần tự tạo panel riêng, mấy query này mình dùng gần như hàng ngày:

# % CPU đang dùng (trung bình 5 phút)
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# % RAM đang dùng
100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100)

# Disk usage theo mount point
100 - ((node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes) * 100)

# Network traffic inbound (bytes/s)
rate(node_network_receive_bytes_total{device!="lo"}[5m])

Cấu hình alert khi CPU vượt ngưỡng

Grafana tạo alert được ngay từ panel, không cần Alertmanager cho case đơn giản. Ví dụ: alert khi CPU > 80% kéo dài 5 phút liên tục:

Mở panel CPU Usage → Edit
Tab Alert → New alert rule
Condition: WHEN avg() OF query IS ABOVE 80
For: 5m — đợi 5 phút mới trigger, tránh báo nhầm khi CPU chỉ spike thoáng qua rồi hết
Notification: chọn contact point (email, Slack, Telegram webhook)

Tổng kết

Setup này mình đang chạy trên production. Từ lúc bắt tay đến khi có dashboard đầy đủ mất khoảng 30–45 phút nếu đã quen tay với Linux.

Điểm hay nhất của Prometheus: data lưu dạng time-series nên khi có sự cố, mình rewind lại xem chính xác lúc 2:37 sáng hôm qua RAM đang ở mức nào, CPU bắt đầu leo từ lúc nào. Không đoán mò, không mất dấu.

Muốn nâng thêm: Alertmanager xử lý alert phức tạp hơn — grouping, silencing, routing theo team. Còn đang chạy Docker thì thêm cAdvisor để theo dõi resource usage từng container.