Last March, I got a call at 2 AM — a production server was pegging CPU at 100%, traffic was going out, and nobody knew why. No playbook, no checklist, everyone scrambling. The result: 6 hours to handle an incident that should have taken 90 minutes with a clear process in place.
Having audited 10+ production servers, nearly all share the same blind spot — not technical skill, but process. The tools are there, the knowledge is there, but when a real incident hits, nobody knows what to do first or next. Here’s the process I actually use, with the specific commands you can apply right now.
What Is Incident Response — and Why Does It Need Its Own Process?
Straight to the point: Incident Response (IR) is just an ordered answer to a simple question — when you get hacked, what do you do first? Detect, contain, investigate, recover. Sounds simple, but without a process it’s chaos.
Why a fixed process beats winging it:
- Under pressure, people skip critical steps — like backing up logs before removing malware
- No checklist means the team duplicates effort or misses things entirely
- No documentation means after the incident you can’t tell exactly what happened and can’t improve
The most widely used IR model is PICERL from SANS: Preparation → Identification → Containment → Eradication → Recovery → Lessons Learned. I’ll walk through each phase in order, straight to the practical part.
Step 1: Detect and Confirm the Incident
Before doing anything, confirm this is actually an incident — not just a false alarm from normal traffic spikes.
Check for Abnormal Processes
# List running processes, sorted by CPU usage
ps aux --sort=-%cpu | head -20
# Check for hidden processes (non-sequential PIDs are suspicious)
ls /proc | grep -E '^[0-9]+$' | sort -n > /tmp/proc_pids.txt
ps aux | awk '{print $2}' | sort -n > /tmp/ps_pids.txt
diff /tmp/proc_pids.txt /tmp/ps_pids.txt
# Show process tree, spot abnormal shell spawns
pstree -a -p
Check Network Connections
# Active connections
ss -tulnp
netstat -antp | grep ESTABLISHED
# Show outbound traffic by process
lsof -i -n -P | grep -v LISTEN
Check Logged-In Users
who
w
last -n 20 # Last 20 logins
lastb -n 20 # Last 20 failed login attempts
If you see unknown processes hogging CPU or bandwidth, connections to unrecognized foreign IPs, or an unfamiliar user online at 3 AM — this is a real incident, move to the next step immediately.
Step 2: Isolate the System Immediately
This is the most skipped step — and usually the most costly mistake. Many teams jump straight into investigation while the attacker is still connected, continuing to exfiltrate data. Hard rule: isolate first, investigate second.
# Block all traffic, keep only your IP connected
YOUR_IP="1.2.3.4" # Replace with your actual IP
iptables -F && iptables -X
# Allow your IP
iptables -A INPUT -s $YOUR_IP -j ACCEPT
iptables -A OUTPUT -d $YOUR_IP -j ACCEPT
# Allow loopback
iptables -A INPUT -i lo -j ACCEPT
iptables -A OUTPUT -o lo -j ACCEPT
# Block everything else
iptables -A INPUT -j DROP
iptables -A OUTPUT -j DROP
iptables -A FORWARD -j DROP
Why not just shut down the server? RAM holds a lot of valuable evidence: running processes, encryption keys, network state — all gone the moment you power off. Network isolation preserves the evidence while keeping the attacker locked out.
If you absolutely must shut down — for example, ransomware is actively encrypting files — pull the hard power instead of running shutdown. Reason: attackers often embed cleanup scripts that run during a graceful shutdown.
Step 3: Collect Evidence and Investigate
Once the network is isolated, collect evidence in priority order: volatile data (RAM, network state) first, non-volatile (disk, logs) second. Don’t reverse this order.
Collect System Information
# Create an evidence directory with a timestamp
EVIDENCE_DIR="/tmp/ir-$(date +%Y%m%d-%H%M%S)"
mkdir -p $EVIDENCE_DIR
date > $EVIDENCE_DIR/timestamp.txt
uname -a > $EVIDENCE_DIR/sysinfo.txt
uptime >> $EVIDENCE_DIR/sysinfo.txt
ps auxf > $EVIDENCE_DIR/processes.txt
ss -tulnp > $EVIDENCE_DIR/network.txt
netstat -rn > $EVIDENCE_DIR/routes.txt
lsof -n > $EVIDENCE_DIR/open_files.txt
Find Recently Modified Files
# Files modified in the last 24 hours (skip /proc, /sys, /dev)
find / -not \( -path /proc -prune \) \
-not \( -path /sys -prune \) \
-not \( -path /dev -prune \) \
-mtime -1 -type f 2>/dev/null > $EVIDENCE_DIR/recent_files.txt
# Suspicious SUID/SGID files
find / -perm /6000 -type f 2>/dev/null > $EVIDENCE_DIR/suid_files.txt
Analyze Logs for Signs of Intrusion
# SSH brute force attempts and suspicious successful logins
grep "Failed password" /var/log/auth.log \
| awk '{print $11}' | sort | uniq -c | sort -rn | head -20
grep "Accepted password\|Accepted publickey" /var/log/auth.log | tail -50
# Newly added crontab/at jobs
grep -i "cron\|atd" /var/log/syslog | tail -50
Check Persistence Mechanisms
# Crontabs for all users
for user in $(cut -d: -f1 /etc/passwd); do
echo "=== $user ==="
crontab -u $user -l 2>/dev/null
done
# Unusual systemd services that are running
systemctl list-units --type=service --state=running
# SSH authorized_keys — look for unfamiliar keys added
find /home /root -name "authorized_keys" \
-exec echo "=== {} ===" \; -exec cat {} \; 2>/dev/null
# Newly created users (excluding system accounts)
grep -v "nologin\|false" /etc/passwd
Step 4: Eradicate the Threat
Only start removing things once you know exactly what the problem is. Don’t delete by gut feeling — every action needs a timestamp so your incident report is actually usable afterward.
# Remove a fraudulent user
userdel -r suspicious_username
# Remove unknown SSH key from authorized_keys
# Open the file, find and delete the unrecognized key line
nano /root/.ssh/authorized_keys
# Kill malware process by PID
kill -9 <PID>
# Remove crontab for a compromised user
crontab -r -u www-data
# Reinstall binaries that may have been replaced by the attacker
apt-get install --reinstall openssh-server nginx
One mistake I see repeatedly: teams remove malware and then realize they never grabbed a hash or sample. By then there’s nothing left to analyze. Back up the evidence first, delete second — no exceptions.
Step 5: Safely Restore the System
Restoring from backup without understanding the attack vector just buys time. The vulnerability is still there; you’ll repeat the incident in 2–3 weeks — I’ve watched this happen at least 3 times across servers I’ve audited.
Checklist before bringing the server back to production:
- Confirm the attack vector has been patched (patch the CVE, rotate leaked credentials, fix the misconfiguration)
- Restore from a backup taken before the suspected compromise date
- Verify backup integrity using checksums
- Reset passwords for all affected user accounts
# Force password change on next login
chage -d 0 username
# Regenerate SSH host keys (important if old keys were exposed)
rm /etc/ssh/ssh_host_*
dpkg-reconfigure openssh-server
# Enable auditd for monitoring after restore
systemctl enable --now auditd
Prepare Ahead to Make Incidents Manageable
80% of the complexity in every incident I’ve handled came from a lack of preparation. Three things to do today, not when you need them:
Baseline Your System — Snapshot Normal State
# Run on a schedule (weekly cron), save for later comparison
mkdir -p /var/lib/ir-baseline
md5sum /usr/sbin/* /usr/bin/* > /var/lib/ir-baseline/bin_hashes.txt
ss -tulnp > /var/lib/ir-baseline/network.txt
ps auxf > /var/lib/ir-baseline/processes.txt
crontab -l > /var/lib/ir-baseline/crontab.txt
Ship Logs Off-Server Immediately
Attackers typically wipe local logs after gaining access. Start forwarding logs to an external syslog server from day one:
# Add to /etc/rsyslog.conf
echo "*.* @your-syslog-server:514" >> /etc/rsyslog.conf
systemctl restart rsyslog
IR Toolkit — Install Before You Need It
apt-get install -y chkrootkit rkhunter tcpdump strace lsof auditd
The single biggest difference-maker after all of this: document every incident, no matter how small. Post-mortems aren’t about finding someone to blame — they’re about handling the next one 50% faster. I keep an incidents.md file with the timeline, attack vector, and lessons learned from each case. After a year, it’s the best training material I have.

