Monitoring Kubernetes Clusters with Prometheus Operator and Grafana: Automated Metrics Collection – ITFROMZERO

Knowing how your system is performing in Kubernetes is one of the biggest challenges. Are your pods running stably? Are your nodes overloaded with resources? Is your application experiencing performance issues that you’re unaware of? Without a proper monitoring system, managing a Kubernetes cluster is like driving in fog – you can’t see where you’re going.

Previously, when I was responsible for monitoring physical servers or virtual machines, I used to install Prometheus and Grafana manually. I would define the scrape_configs in the prometheus.yml file to collect metrics from Node Exporter or cAdvisor.

That approach was effective for a small number of servers. However, when transitioning to Kubernetes with hundreds, or even thousands, of constantly changing pods, manual configuration becomes impractical. Pods are created, terminated, and IPs change continuously, so how can Prometheus automatically discover them to collect metrics?

That’s where Prometheus Operator emerges as a lifesaver. This article will guide you through deploying Prometheus Operator and Grafana to automatically monitor your Kubernetes cluster, enabling you to proactively detect and resolve issues.

Table of Contents

Quick Start: Deploying Prometheus Operator and Grafana in 5 Minutes

To quickly see results, we will install the entire monitoring stack on your cluster using Helm. This stack includes Prometheus Operator, Prometheus, Grafana, Alertmanager, Node Exporter, and Kube-state-metrics. Make sure you have Helm installed and access to your Kubernetes cluster.

1. Add Helm Repository

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

2. Install kube-prometheus-stack

kube-prometheus-stack is a Helm chart that includes Prometheus Operator and all other necessary components.

helm install prometheus prometheus-community/kube-prometheus-stack --namespace monitoring --create-namespace

This command will create a new namespace called monitoring and deploy the entire stack into it. This process may take a few minutes.

3. Access Grafana

Once the pods are running stably, you’ll need to port-forward to access the Grafana dashboard. First, retrieve the Grafana admin password:

kubectl get secret prometheus-grafana -n monitoring -o jsonpath="{.data.admin-password}" | base64 --decode

Then, port-forward the Grafana service:

kubectl port-forward service/prometheus-grafana 3000:80 -n monitoring

Now, open your browser and go to http://localhost:3000. Log in with the username admin and the password you just retrieved. You’ll find a wide range of pre-built dashboards to monitor your Kubernetes cluster!

Detailed Explanation: Why Prometheus Operator for Kubernetes?

The Problem: Complex Kubernetes Monitoring

Kubernetes is a complex distributed system with many dynamic components: Nodes, Pods, Services, Deployments, StatefulSets, Ingress, Controllers, and more. Each component has its own state and performance characteristics. Manually tracking each one is impractical. Furthermore, pods are constantly created and destroyed, and IPs change, rendering static Prometheus configurations useless.

The Cause: Kubernetes’ Dynamic Nature

The dynamic nature of Kubernetes is the primary reason traditional monitoring struggles. Prometheus needs to know its targets (endpoints providing metrics) to collect data. In a static environment, you simply list IPs. But with Kubernetes, pods can be scaled up/down, moved between nodes, or change IPs at any time.

The Solution: Prometheus Operator and Grafana

This is where Prometheus Operator and Grafana unleash their power.

Prometheus Operator: This is a Kubernetes Operator, built to automate the deployment, management, and operation of Prometheus on Kubernetes. Instead of manually configuring Prometheus, you define Custom Resources (CRDs) such as Prometheus, ServiceMonitor, PodMonitor, or Alertmanager. The Operator reads these definitions to automatically create and manage the corresponding Prometheus objects.
ServiceMonitor and PodMonitor: These are two of Prometheus Operator’s most important CRDs. They allow you to define how Prometheus will automatically discover and collect metrics from Services or Pods in the cluster based on label selectors. This completely solves the problem of Kubernetes’ dynamic nature.
Grafana: A powerful visualization tool. It connects to Prometheus to query metric data and display it in easy-to-understand charts and tables via dashboards.

Kubernetes Monitoring System Architecture

When using kube-prometheus-stack, your monitoring system will have the following architecture:

Node Exporter: Deployed as a DaemonSet, it runs on each Node to collect metrics about Node resources (CPU, RAM, Disk, Network).
Kube-state-metrics: Collects metrics about the state of Kubernetes objects (number of running Pods, failed Deployments, PersistentVolumes in use, etc.).
Prometheus: The heart of the system, collecting metrics from Node Exporter, Kube-state-metrics, and your applications via ServiceMonitor/PodMonitor.
Prometheus Operator: Monitors CRDs (Prometheus, ServiceMonitor, PodMonitor, Alertmanager) and ensures Prometheus and Alertmanager instances are correctly configured and running.
Alertmanager: Processes alerts generated by Prometheus, sending notifications to various channels (email, Slack, Telegram, etc.).
Grafana: Connects to Prometheus to display data and dashboards.

Advanced: Automated Metrics Collection with ServiceMonitor

The strength of Prometheus Operator lies in its ability to automatically discover Services/Pods for metrics collection. Suppose you have a web application running in Kubernetes that exposes metrics at the path /metrics on port 8080. You can create a ServiceMonitor for Prometheus to automatically collect metrics from that application’s Service.

ServiceMonitor Example

First, your application needs a Service for Prometheus to access:

apiVersion: v1
kind: Service
metadata:
  name: my-app-service
  labels:
    app: my-app
spec:
  selector:
    app: my-app
  ports:
  - name: web
    port: 80
    targetPort: 8080 # Port where the application exposes metrics

Next, create a ServiceMonitor so Prometheus knows how to collect metrics from this Service:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-app-servicemonitor
  labels:
    app: my-app # This label is important; Prometheus Operator will use it to select ServiceMonitors
spec:
  selector:
    matchLabels:
      app: my-app # Select Services with the label app: my-app
  endpoints:
  - port: web # Port name from the Service above
    path: /metrics # Application's metrics path
  namespaceSelector:
    matchNames:
    - default # Or the namespace containing your application

When you apply this YAML file, Prometheus Operator will detect the new ServiceMonitor. It will automatically configure Prometheus to start scraping metrics from my-app-service at port web and path /metrics. You no longer need to manually edit Prometheus configuration files!

Configure Rules and Alertmanager

kube-prometheus-stack also includes Alertmanager and PrometheusRule CRDs. You can define alert rules (e.g., Pod CPU usage exceeding 80% for 5 minutes) by creating a PrometheusRule resource:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: my-app-alerts
  labels:
    app: my-app
spec:
  groups:
  - name: my-app.rules
    rules:
    - alert: HighCpuUsage
      expr: sum(rate(container_cpu_usage_seconds_total{namespace="default", pod=~"my-app.*"}[5m])) by (pod) > 0.8
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "CPU usage for pod {{ $labels.pod }} is high"
        description: "Pod {{ $labels.pod }} has been using more than 80% CPU for 5 minutes."

Prometheus Operator will automatically load these rules into Prometheus, and when alert conditions are met, Prometheus will send alerts to Alertmanager. Alertmanager will then process and send notifications to the channels you’ve configured (e.g., Slack, email).

Practical Tips and Personal Experience

Monitoring Your Applications

To fully leverage this monitoring system, your applications also need to expose metrics in Prometheus format. Most modern frameworks have supporting libraries: for example, Spring Boot has Actuator, Node.js has prom-client, and Python has prometheus_client. Simply add the dependency, configure metrics exposure at an endpoint (usually /metrics), and create a ServiceMonitor.

Hard-earned Experience

My current monitoring system, comprising Prometheus + Grafana, tracks about 15 physical servers and a large Kubernetes cluster. This setup has repeatedly helped me detect issues before users even reported them.

One time, an application in K8s started experiencing high latency. I immediately saw on the Grafana dashboard that the number of successful requests dropped, and response times spiked. Thanks to the Alertmanager alert, I was able to promptly scale up the number of pods and identify the root cause before it broadly impacted users. Without monitoring, I would probably have had to wait until users reported the error, by which point it would have been too late!

Long-term Storage

Prometheus, by default, only stores data for a certain period (usually a few weeks to several months) on local disk. For long-term trend analysis or regulatory compliance, you might need long-term storage solutions like Thanos or Cortex. These solutions allow Prometheus to store data in object storage like S3, GCS.

Grafana Security

Always change Grafana’s default admin password. Consider integrating Grafana with a centralized authentication system (LDAP/OAuth) if you have multiple users. Ensure Grafana is not unnecessarily exposed to the internet. If exposure is mandatory, place it behind an Ingress Controller with tight security configurations (e.g., HTTPS, basic authentication).

Conclusion

Prometheus Operator and Grafana form a powerful and flexible duo for monitoring the performance and health of your Kubernetes cluster. With their ability to automatically discover and collect metrics, you can easily scale your monitoring system without extensive manual configuration. Deploy them now to gain deep insights into your Kubernetes system!