How to Use strace to Debug Applications on Linux: Tracing System Calls and Finding Root Causes Effectively

Linux tutorial - IT technology blog
Linux tutorial - IT technology blog

When Do You Actually Need strace?

Some Friday afternoons, a service just dies without logging anything. Restart it and it comes back up, then dies again five minutes later. dmesg shows nothing, journalctl is dead silent — and if you haven’t already mastered effective Linux debugging with journalctl and dmesg, those tools alone won’t get you far. That’s when I reach for strace.

On an old CentOS 7 server at work, I found myself using strace quite a bit to track down issues that normal log levels couldn’t capture — from file permissions going wrong after a deploy, to sockets being blocked by ulimit, to a binary stubbornly reading a config file from an old path even after symlinking it to the new location.

strace is a tool that lets you “eavesdrop” on every system call a process makes — meaning every interaction between the application and the kernel: opening files, reading/writing, network connections, spawning child processes, and so on. No source code needed, no recompilation — just attach to a running PID and you’re good.

Unlike reading logs (which only shows what the developer chose to print), strace shows you everything the process is actually doing at the kernel level.

Installing strace

Most distros ship strace in their official repositories:

# Ubuntu / Debian
sudo apt install strace

# CentOS / RHEL / AlmaLinux
sudo yum install strace
# or dnf for newer versions
sudo dnf install strace

# Arch Linux
sudo pacman -S strace

# Check version
strace --version

No additional configuration needed. strace works via the kernel’s ptrace() syscall — as long as you have sufficient permissions to attach to the process (typically requires root or the same user).

Practical Ways to Use strace

Run a Command Directly Through strace

The simplest approach — strace runs the command and prints every syscall:

strace ls /tmp

The output will be overwhelming. In practice, you’ll want to filter or write it to a file:

# Write to file for later analysis
strace -o /tmp/strace_ls.log ls /tmp

# Include timestamps (very useful)
strace -t -o /tmp/strace_ls.log ls /tmp

# Timestamps accurate to the microsecond
strace -tt -o /tmp/strace_ls.log ls /tmp

Attach to a Running Process

This is the most common real-world case — a service is running but misbehaving, and you can’t restart it. If the misbehaving process has gone fully silent, it’s also worth checking whether it has turned into a zombie process before attaching strace:

# Find the PID first
pgrep -a nginx
# or
ps aux | grep myapp

# Attach to the process
sudo strace -p 12345

# Attach to all threads (important for multi-threaded apps)
sudo strace -p 12345 -f

The -f flag (follow forks) is critical for multi-threaded apps or anything that spawns child processes — without -f you’ll miss all the syscalls from child threads.

Filter for Only the Syscalls You Care About

Use -e trace= to cut down on noise — this is what I use most often:

# Only show file-related syscalls
strace -e trace=file ls /tmp

# Only show network calls
strace -p 12345 -e trace=network

# Only show open, read, write
strace -e trace=open,read,write -p 12345

# Only show errors (syscalls returning -1)
strace -e trace=all -e status=failed -p 12345

The -e status=failed option is incredibly powerful — it only shows syscalls that failed, ignoring everything that succeeded. Use this to hunt down permission errors or missing files in seconds.

Measure Time per Syscall

# -T: show how long each syscall took
strace -T -p 12345 -e trace=file

# -c: aggregate statistics (very useful for bottleneck analysis)
strace -c ls /tmp

The output from -c looks like this:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 52.13    0.000423          42        10           mmap
 23.41    0.000190          19        10           read
 12.50    0.000102          12         8         1 openat
  5.10    0.000041           5         8           fstat
  ...

The errors column shows which syscalls are failing. The % time column shows what’s consuming the most CPU time.

Debugging Real-World Problems

Case 1: Application Can’t Find Its Config File

A classic scenario: the app errors out but the log just vaguely says “config not found”.

# Filter file-related syscalls, only capture errors
strace -e trace=openat,open -e status=failed ./myapp 2>&1 | grep ENOENT

The output will point directly to the files the app is trying to read but can’t find:

openat(AT_FDCWD, "/etc/myapp/config.yaml", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/local/etc/myapp.conf", O_RDONLY) = -1 ENOENT (No such file or directory)

Now you know exactly where the app is looking for its config, even if the code doesn’t document it clearly.

Case 2: Mysterious Permission Denied

strace -e trace=file -e status=failed sudo -u appuser ./myapp 2>&1 | grep EACCES

On that CentOS 7 server at work, I once ran into a service failing with “Permission denied” but ls -la showed the file permissions were clearly fine. It was only after running strace that I discovered SELinux was blocking it — the syscall returned EACCES but the error message didn’t mention SELinux at all. These subtle permission boundaries are also why it’s worth understanding Linux capabilities and fine-grained permissions — sometimes the issue isn’t file ownership at all, but a missing capability bit.

Case 3: Service Is Hanging — No Idea What It’s Waiting For

sudo strace -p $(pgrep myservice) -e trace=network,ipc

If you see the process stuck at:

epoll_wait(5, [], 1, 30000)             = 0  # waiting on network, 30s timeout
# or
futex(0x7f..., FUTEX_WAIT, ...)         # waiting on a lock

You know right away: the first one means it’s waiting on a network connection (possibly a backend timeout), the second means it’s waiting on a mutex lock (possibly a deadlock).

Case 4: Finding I/O Bottlenecks

# Collect aggregate stats over 30 seconds
sudo timeout 30 strace -c -f -p $(pgrep myapp) 2>&1

If you see read or write consuming more than 50% of time with a high usecs/call value — that’s a sign of slow I/O, and you should check your disk or NFS mount. At that point, pairing strace findings with a real-time view from iotop and htop helps confirm whether the bottleneck is process-level or system-wide.

Inspecting and Monitoring with strace

My Actual Debug Workflow

  1. Start with -c for a high-level overview: which syscalls are called most, which ones are failing.
  2. Filter to specific syscalls based on what step 1 revealed — don’t look at raw output, it’s too noisy.
  3. Use -tt -T when you need precise timing — look for unusually long gaps.
  4. Grep by error code: ENOENT (file not found), EACCES (permission denied), ECONNREFUSED (network), ETIMEDOUT (timeout).

Useful grep Commands When Analyzing strace Logs

# See all files that were successfully opened
grep 'openat.*O_RDONLY' strace.log | grep -v '= -1'

# Find network connections
grep 'connect(' strace.log

# Find writes to stderr (fd=2)
grep 'write(2,' strace.log

A Note on Overhead

strace uses ptrace() — every syscall requires pausing the process so the kernel can report it. Overhead can range from 2x to 10x depending on workload. Don’t use it on production under heavy load — only use it while actively debugging, or use -c for aggregate stats instead of watching real-time output.

If you need production tracing with lower overhead, perf trace or bpftrace are better choices — but strace is still the first tool I reach for because it requires zero setup and just works.

Wrapping Up

strace isn’t a tool you use every day, but when you need to debug problems that logs can’t explain — especially permission errors, missing files, network timeouts, or deadlocks — it saves a tremendous amount of time compared to reading source code or sprinkling in print statements and redeploying.

My recommended order when debugging with strace: start with -e status=failed to surface errors, then -c to spot bottlenecks, then drill into the details only if needed. Don’t start with raw output — you’ll get overwhelmed immediately.

Share: