Building an Incident Response Process for Linux Servers: Investigation, Containment, and Recovery After an Attack – ITFROMZERO

Last March, I got a call at 2 AM — a production server was pegging CPU at 100%, traffic was going out, and nobody knew why. No playbook, no checklist, everyone scrambling. The result: 6 hours to handle an incident that should have taken 90 minutes with a clear process in place.

Having audited 10+ production servers, nearly all share the same blind spot — not technical skill, but process. The tools are there, the knowledge is there, but when a real incident hits, nobody knows what to do first or next. Here’s the process I actually use, with the specific commands you can apply right now.

Table of Contents

What Is Incident Response — and Why Does It Need Its Own Process?

Straight to the point: Incident Response (IR) is just an ordered answer to a simple question — when you get hacked, what do you do first? Detect, contain, investigate, recover. Sounds simple, but without a process it’s chaos.

Why a fixed process beats winging it:

Under pressure, people skip critical steps — like backing up logs before removing malware
No checklist means the team duplicates effort or misses things entirely
No documentation means after the incident you can’t tell exactly what happened and can’t improve

The most widely used IR model is PICERL from SANS: Preparation → Identification → Containment → Eradication → Recovery → Lessons Learned. I’ll walk through each phase in order, straight to the practical part.

Step 1: Detect and Confirm the Incident

Before doing anything, confirm this is actually an incident — not just a false alarm from normal traffic spikes.

Check for Abnormal Processes

# List running processes, sorted by CPU usage
ps aux --sort=-%cpu | head -20

# Check for hidden processes (non-sequential PIDs are suspicious)
ls /proc | grep -E '^[0-9]+$' | sort -n > /tmp/proc_pids.txt
ps aux | awk '{print $2}' | sort -n > /tmp/ps_pids.txt
diff /tmp/proc_pids.txt /tmp/ps_pids.txt

# Show process tree, spot abnormal shell spawns
pstree -a -p

Check Network Connections

# Active connections
ss -tulnp
netstat -antp | grep ESTABLISHED

# Show outbound traffic by process
lsof -i -n -P | grep -v LISTEN

Check Logged-In Users

who
w
last -n 20          # Last 20 logins
lastb -n 20         # Last 20 failed login attempts

If you see unknown processes hogging CPU or bandwidth, connections to unrecognized foreign IPs, or an unfamiliar user online at 3 AM — this is a real incident, move to the next step immediately.

Step 2: Isolate the System Immediately

This is the most skipped step — and usually the most costly mistake. Many teams jump straight into investigation while the attacker is still connected, continuing to exfiltrate data. Hard rule: isolate first, investigate second.

# Block all traffic, keep only your IP connected
YOUR_IP="1.2.3.4"  # Replace with your actual IP

iptables -F && iptables -X

# Allow your IP
iptables -A INPUT  -s $YOUR_IP -j ACCEPT
iptables -A OUTPUT -d $YOUR_IP -j ACCEPT

# Allow loopback
iptables -A INPUT  -i lo -j ACCEPT
iptables -A OUTPUT -o lo -j ACCEPT

# Block everything else
iptables -A INPUT   -j DROP
iptables -A OUTPUT  -j DROP
iptables -A FORWARD -j DROP

Why not just shut down the server? RAM holds a lot of valuable evidence: running processes, encryption keys, network state — all gone the moment you power off. Network isolation preserves the evidence while keeping the attacker locked out.

If you absolutely must shut down — for example, ransomware is actively encrypting files — pull the hard power instead of running shutdown. Reason: attackers often embed cleanup scripts that run during a graceful shutdown.

Step 3: Collect Evidence and Investigate

Once the network is isolated, collect evidence in priority order: volatile data (RAM, network state) first, non-volatile (disk, logs) second. Don’t reverse this order.

Collect System Information

# Create an evidence directory with a timestamp
EVIDENCE_DIR="/tmp/ir-$(date +%Y%m%d-%H%M%S)"
mkdir -p $EVIDENCE_DIR

date                > $EVIDENCE_DIR/timestamp.txt
uname -a            > $EVIDENCE_DIR/sysinfo.txt
uptime             >> $EVIDENCE_DIR/sysinfo.txt
ps auxf             > $EVIDENCE_DIR/processes.txt
ss -tulnp           > $EVIDENCE_DIR/network.txt
netstat -rn         > $EVIDENCE_DIR/routes.txt
lsof -n             > $EVIDENCE_DIR/open_files.txt

Find Recently Modified Files

# Files modified in the last 24 hours (skip /proc, /sys, /dev)
find / -not \( -path /proc -prune \) \
       -not \( -path /sys  -prune \) \
       -not \( -path /dev  -prune \) \
       -mtime -1 -type f 2>/dev/null > $EVIDENCE_DIR/recent_files.txt

# Suspicious SUID/SGID files
find / -perm /6000 -type f 2>/dev/null > $EVIDENCE_DIR/suid_files.txt

Analyze Logs for Signs of Intrusion

# SSH brute force attempts and suspicious successful logins
grep "Failed password" /var/log/auth.log \
    | awk '{print $11}' | sort | uniq -c | sort -rn | head -20

grep "Accepted password\|Accepted publickey" /var/log/auth.log | tail -50

# Newly added crontab/at jobs
grep -i "cron\|atd" /var/log/syslog | tail -50

Check Persistence Mechanisms

# Crontabs for all users
for user in $(cut -d: -f1 /etc/passwd); do
    echo "=== $user ==="
    crontab -u $user -l 2>/dev/null
done

# Unusual systemd services that are running
systemctl list-units --type=service --state=running

# SSH authorized_keys — look for unfamiliar keys added
find /home /root -name "authorized_keys" \
    -exec echo "=== {} ===" \; -exec cat {} \; 2>/dev/null

# Newly created users (excluding system accounts)
grep -v "nologin\|false" /etc/passwd

Step 4: Eradicate the Threat

Only start removing things once you know exactly what the problem is. Don’t delete by gut feeling — every action needs a timestamp so your incident report is actually usable afterward.

# Remove a fraudulent user
userdel -r suspicious_username

# Remove unknown SSH key from authorized_keys
# Open the file, find and delete the unrecognized key line
nano /root/.ssh/authorized_keys

# Kill malware process by PID
kill -9 <PID>

# Remove crontab for a compromised user
crontab -r -u www-data

# Reinstall binaries that may have been replaced by the attacker
apt-get install --reinstall openssh-server nginx

One mistake I see repeatedly: teams remove malware and then realize they never grabbed a hash or sample. By then there’s nothing left to analyze. Back up the evidence first, delete second — no exceptions.

Step 5: Safely Restore the System

Restoring from backup without understanding the attack vector just buys time. The vulnerability is still there; you’ll repeat the incident in 2–3 weeks — I’ve watched this happen at least 3 times across servers I’ve audited.

Checklist before bringing the server back to production:

Confirm the attack vector has been patched (patch the CVE, rotate leaked credentials, fix the misconfiguration)
Restore from a backup taken before the suspected compromise date
Verify backup integrity using checksums
Reset passwords for all affected user accounts

# Force password change on next login
chage -d 0 username

# Regenerate SSH host keys (important if old keys were exposed)
rm /etc/ssh/ssh_host_*
dpkg-reconfigure openssh-server

# Enable auditd for monitoring after restore
systemctl enable --now auditd

Prepare Ahead to Make Incidents Manageable

80% of the complexity in every incident I’ve handled came from a lack of preparation. Three things to do today, not when you need them:

Baseline Your System — Snapshot Normal State

# Run on a schedule (weekly cron), save for later comparison
mkdir -p /var/lib/ir-baseline

md5sum /usr/sbin/* /usr/bin/* > /var/lib/ir-baseline/bin_hashes.txt
ss -tulnp                     > /var/lib/ir-baseline/network.txt
ps auxf                       > /var/lib/ir-baseline/processes.txt
crontab -l                    > /var/lib/ir-baseline/crontab.txt

Ship Logs Off-Server Immediately

Attackers typically wipe local logs after gaining access. Start forwarding logs to an external syslog server from day one:

# Add to /etc/rsyslog.conf
echo "*.* @your-syslog-server:514" >> /etc/rsyslog.conf
systemctl restart rsyslog

IR Toolkit — Install Before You Need It

apt-get install -y chkrootkit rkhunter tcpdump strace lsof auditd

The single biggest difference-maker after all of this: document every incident, no matter how small. Post-mortems aren’t about finding someone to blame — they’re about handling the next one 50% faster. I keep an incidents.md file with the timeline, attack vector, and lessons learned from each case. After a year, it’s the best training material I have.