Building an Incident Response Process for Linux Servers: Investigation, Containment, and Recovery After an Attack

Security tutorial - IT technology blog
Security tutorial - IT technology blog

Last March, I got a call at 2 AM — a production server was pegging CPU at 100%, traffic was going out, and nobody knew why. No playbook, no checklist, everyone scrambling. The result: 6 hours to handle an incident that should have taken 90 minutes with a clear process in place.

Having audited 10+ production servers, nearly all share the same blind spot — not technical skill, but process. The tools are there, the knowledge is there, but when a real incident hits, nobody knows what to do first or next. Here’s the process I actually use, with the specific commands you can apply right now.

What Is Incident Response — and Why Does It Need Its Own Process?

Straight to the point: Incident Response (IR) is just an ordered answer to a simple question — when you get hacked, what do you do first? Detect, contain, investigate, recover. Sounds simple, but without a process it’s chaos.

Why a fixed process beats winging it:

  • Under pressure, people skip critical steps — like backing up logs before removing malware
  • No checklist means the team duplicates effort or misses things entirely
  • No documentation means after the incident you can’t tell exactly what happened and can’t improve

The most widely used IR model is PICERL from SANS: Preparation → Identification → Containment → Eradication → Recovery → Lessons Learned. I’ll walk through each phase in order, straight to the practical part.

Step 1: Detect and Confirm the Incident

Before doing anything, confirm this is actually an incident — not just a false alarm from normal traffic spikes.

Check for Abnormal Processes

# List running processes, sorted by CPU usage
ps aux --sort=-%cpu | head -20

# Check for hidden processes (non-sequential PIDs are suspicious)
ls /proc | grep -E '^[0-9]+$' | sort -n > /tmp/proc_pids.txt
ps aux | awk '{print $2}' | sort -n > /tmp/ps_pids.txt
diff /tmp/proc_pids.txt /tmp/ps_pids.txt

# Show process tree, spot abnormal shell spawns
pstree -a -p

Check Network Connections

# Active connections
ss -tulnp
netstat -antp | grep ESTABLISHED

# Show outbound traffic by process
lsof -i -n -P | grep -v LISTEN

Check Logged-In Users

who
w
last -n 20          # Last 20 logins
lastb -n 20         # Last 20 failed login attempts

If you see unknown processes hogging CPU or bandwidth, connections to unrecognized foreign IPs, or an unfamiliar user online at 3 AM — this is a real incident, move to the next step immediately.

Step 2: Isolate the System Immediately

This is the most skipped step — and usually the most costly mistake. Many teams jump straight into investigation while the attacker is still connected, continuing to exfiltrate data. Hard rule: isolate first, investigate second.

# Block all traffic, keep only your IP connected
YOUR_IP="1.2.3.4"  # Replace with your actual IP

iptables -F && iptables -X

# Allow your IP
iptables -A INPUT  -s $YOUR_IP -j ACCEPT
iptables -A OUTPUT -d $YOUR_IP -j ACCEPT

# Allow loopback
iptables -A INPUT  -i lo -j ACCEPT
iptables -A OUTPUT -o lo -j ACCEPT

# Block everything else
iptables -A INPUT   -j DROP
iptables -A OUTPUT  -j DROP
iptables -A FORWARD -j DROP

Why not just shut down the server? RAM holds a lot of valuable evidence: running processes, encryption keys, network state — all gone the moment you power off. Network isolation preserves the evidence while keeping the attacker locked out.

If you absolutely must shut down — for example, ransomware is actively encrypting files — pull the hard power instead of running shutdown. Reason: attackers often embed cleanup scripts that run during a graceful shutdown.

Step 3: Collect Evidence and Investigate

Once the network is isolated, collect evidence in priority order: volatile data (RAM, network state) first, non-volatile (disk, logs) second. Don’t reverse this order.

Collect System Information

# Create an evidence directory with a timestamp
EVIDENCE_DIR="/tmp/ir-$(date +%Y%m%d-%H%M%S)"
mkdir -p $EVIDENCE_DIR

date                > $EVIDENCE_DIR/timestamp.txt
uname -a            > $EVIDENCE_DIR/sysinfo.txt
uptime             >> $EVIDENCE_DIR/sysinfo.txt
ps auxf             > $EVIDENCE_DIR/processes.txt
ss -tulnp           > $EVIDENCE_DIR/network.txt
netstat -rn         > $EVIDENCE_DIR/routes.txt
lsof -n             > $EVIDENCE_DIR/open_files.txt

Find Recently Modified Files

# Files modified in the last 24 hours (skip /proc, /sys, /dev)
find / -not \( -path /proc -prune \) \
       -not \( -path /sys  -prune \) \
       -not \( -path /dev  -prune \) \
       -mtime -1 -type f 2>/dev/null > $EVIDENCE_DIR/recent_files.txt

# Suspicious SUID/SGID files
find / -perm /6000 -type f 2>/dev/null > $EVIDENCE_DIR/suid_files.txt

Analyze Logs for Signs of Intrusion

# SSH brute force attempts and suspicious successful logins
grep "Failed password" /var/log/auth.log \
    | awk '{print $11}' | sort | uniq -c | sort -rn | head -20

grep "Accepted password\|Accepted publickey" /var/log/auth.log | tail -50

# Newly added crontab/at jobs
grep -i "cron\|atd" /var/log/syslog | tail -50

Check Persistence Mechanisms

# Crontabs for all users
for user in $(cut -d: -f1 /etc/passwd); do
    echo "=== $user ==="
    crontab -u $user -l 2>/dev/null
done

# Unusual systemd services that are running
systemctl list-units --type=service --state=running

# SSH authorized_keys — look for unfamiliar keys added
find /home /root -name "authorized_keys" \
    -exec echo "=== {} ===" \; -exec cat {} \; 2>/dev/null

# Newly created users (excluding system accounts)
grep -v "nologin\|false" /etc/passwd

Step 4: Eradicate the Threat

Only start removing things once you know exactly what the problem is. Don’t delete by gut feeling — every action needs a timestamp so your incident report is actually usable afterward.

# Remove a fraudulent user
userdel -r suspicious_username

# Remove unknown SSH key from authorized_keys
# Open the file, find and delete the unrecognized key line
nano /root/.ssh/authorized_keys

# Kill malware process by PID
kill -9 <PID>

# Remove crontab for a compromised user
crontab -r -u www-data

# Reinstall binaries that may have been replaced by the attacker
apt-get install --reinstall openssh-server nginx

One mistake I see repeatedly: teams remove malware and then realize they never grabbed a hash or sample. By then there’s nothing left to analyze. Back up the evidence first, delete second — no exceptions.

Step 5: Safely Restore the System

Restoring from backup without understanding the attack vector just buys time. The vulnerability is still there; you’ll repeat the incident in 2–3 weeks — I’ve watched this happen at least 3 times across servers I’ve audited.

Checklist before bringing the server back to production:

  1. Confirm the attack vector has been patched (patch the CVE, rotate leaked credentials, fix the misconfiguration)
  2. Restore from a backup taken before the suspected compromise date
  3. Verify backup integrity using checksums
  4. Reset passwords for all affected user accounts
# Force password change on next login
chage -d 0 username

# Regenerate SSH host keys (important if old keys were exposed)
rm /etc/ssh/ssh_host_*
dpkg-reconfigure openssh-server

# Enable auditd for monitoring after restore
systemctl enable --now auditd

Prepare Ahead to Make Incidents Manageable

80% of the complexity in every incident I’ve handled came from a lack of preparation. Three things to do today, not when you need them:

Baseline Your System — Snapshot Normal State

# Run on a schedule (weekly cron), save for later comparison
mkdir -p /var/lib/ir-baseline

md5sum /usr/sbin/* /usr/bin/* > /var/lib/ir-baseline/bin_hashes.txt
ss -tulnp                     > /var/lib/ir-baseline/network.txt
ps auxf                       > /var/lib/ir-baseline/processes.txt
crontab -l                    > /var/lib/ir-baseline/crontab.txt

Ship Logs Off-Server Immediately

Attackers typically wipe local logs after gaining access. Start forwarding logs to an external syslog server from day one:

# Add to /etc/rsyslog.conf
echo "*.* @your-syslog-server:514" >> /etc/rsyslog.conf
systemctl restart rsyslog

IR Toolkit — Install Before You Need It

apt-get install -y chkrootkit rkhunter tcpdump strace lsof auditd

The single biggest difference-maker after all of this: document every incident, no matter how small. Post-mortems aren’t about finding someone to blame — they’re about handling the next one 50% faster. I keep an incidents.md file with the timeline, attack vector, and lessons learned from each case. After a year, it’s the best training material I have.

Share: