Grafana Pyroscope: Hunting Down CPU and RAM “Hogs” Directly on Production

Monitoring tutorial - IT technology blog
Monitoring tutorial - IT technology blog

The Nightmare of Sudden Resource Spikes

Imagine this: 2 AM, the system alarm goes off because the server’s CPU has hit 98%. You rush to your computer, SSH into the server, and type top. Ironically, everything suddenly returns to a normal 10% as if nothing ever happened. Or even more frustrating is a Memory Leak. RAM silently creeps up a little bit every day until the OOM Killer terminates the process, leaving behind a pile of meaningless logs.

Previously, I used pprof for Go or py-spy for Python to take performance snapshots. However, this method is very hit-or-miss because it only captures a single moment. After 6 months of running Grafana Pyroscope in production, I realized this is the missing piece to complete the Observability puzzle, alongside Metrics, Logs, and Traces.

Comparing Popular Profiling Methods

To see the value of Pyroscope, let’s look back at the three approaches I’ve experienced:

1. Manual Profiling

You have to wait for the exact moment the incident occurs to run the command to export the profile file, then download it to your local machine for analysis. This method is free but extremely labor-intensive. It’s very easy to miss the “golden window” if the incident only lasts for a few dozen seconds.

2. Expensive APMs (Datadog, New Relic)

These tools offer very smooth Continuous Profiling features. However, the cost is a massive barrier. For startups with around 20-50 microservices, seeing the end-of-month bill from Datadog is enough to make you want to… just turn the feature off to save your sanity.

3. Grafana Pyroscope: The Balanced Choice

This is the optimal direction for both cost and efficiency. Pyroscope continuously collects application Stack Traces with extremely low latency. Data is compressed and stored in real-time. Thanks to this, you can “travel back in time” to any point in the past to see exactly which function was consuming the most resources.

Why Pyroscope Deserves a Place in Your Stack

After a long time in the field, I’ve summarized three outstanding advantages:

  • Ultra-lightweight Overhead: Practice shows that Pyroscope only consumes about 1-2% of the application’s CPU. You can confidently keep it running 24/7 on Production without worrying about slowing down the system.
  • Intuitive Flame Graphs: Instead of reading dry logs, you’ll look at a “flame graph.” The wider the block, the more resources that function consumes. You’ll know immediately if the error lies in your code logic or a third-party library.
  • The Grafana Ecosystem: Bringing Metrics (Prometheus), Logs (Loki), and Profiles (Pyroscope) into the same Dashboard makes incident investigation many times faster.

A small note: Storing Profile data can consume quite a bit of disk space. You should configure a Retention Policy of about 7-14 days to balance investigation needs and storage costs.

Quick Deployment in 3 Steps

Here is how to quickly set up a Pyroscope cluster using Docker for you to try out.

Step 1: Initialize the Server

Create a docker-compose.yaml file with the following simple content:

version: '3.9'
services:
  pyroscope:
    image: grafana/pyroscope:latest
    ports:
      - "4040:4040"
  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_AUTH_ANONYMOUS_ENABLED=true
      - GF_AUTH_ANONYMOUS_ORG_ROLE=Admin

Run the command docker-compose up -d to start. The collection server will listen on port 4040.

Step 2: Integrate into Code (Python)

Install the agent library:

pip install pyroscope-io

Insert this initialization code into your main execution file:

import pyroscope

pyroscope.configure(
    application_name    = "api-service-prod",
    server_address      = "http://localhost:4040",
    tags = {"env": "production", "version": "1.2.0"}
)

# Simulate a heavy processing function
def process_data():
    return [x**2 for x in range(1000000)]

if __name__ == "__main__":
    while True:
        process_data()

Step 3: Enjoy the Results

Access Grafana (port 3000), add Pyroscope as a Data Source with the URL http://pyroscope:4040. In the Explore menu, you will see orange and red blocks appear. That is the actual resource map of your application.

Real-world Experience: Reading Flame Graphs Without Getting Confused

When you first start, it’s easy to get overwhelmed by hundreds of colored blocks. Remember two golden rules:

  1. Look at the width: Don’t worry about how deep or shallow a function is. Whichever one occupies the largest horizontal area should be prioritized for optimization.
  2. Use Diff View: This is the most valuable feature. You can compare the system when it’s running normally versus when it’s failing. Pyroscope will highlight in red the code areas where consumption has spiked.

Don’t Let Alert Fatigue Bother You

My biggest mistake at first was setting up alerts directly from Pyroscope. As a result, Telegram would alert constantly whenever a Cronjob ran and spiked CPU for a few seconds.

Advice: Use Prometheus to alert on overall CPU/RAM thresholds. When you receive a system overload alert, that’s when you use Pyroscope to “examine the crime scene.” Pyroscope is an in-depth investigation (debugging) tool, not the first layer of alerting.

Summary

Continuous Profiling is no longer a luxury reserved for the tech giants. With Grafana Pyroscope, you can confidently deploy new code without worrying about hidden performance bugs. Try spending a weekend setting it up; I believe you’ll discover many “illogical” points in your code that logs would never have pointed out.

Share: