top and htop Mastery: Identifying System Bottlenecks
What You'll Learn
- How to judge CPU, memory, or I/O bottleneck in 30 seconds from
top's summary lines - The meaning and decision thresholds for
%us,%sy,%wa,%si - How to use
htop's tree view, filters, and F-keys to hunt rogue processes interactively - How to dig past "load average is high" to the actual root layer
Quick Triage (30-second pattern)
- Look at
top's line 3 → high%us= CPU-bound, high%wa= I/O wait, high%sy= kernel/context-switch overhead - Look at lines 4-5 → tiny
avail Memand growingswap used= memory pressure - Use
htopfor the suspect: tree view + sort (P / M / T keys)
Prerequisites
- OS: Ubuntu / RHEL-family Linux
htopmay not be preinstalled (sudo apt install htop/sudo dnf install htop)- Output examples assume procps-ng
top(BSDtopdisplays differently)
Why Use Both top and htop?
top is guaranteed to exist everywhere (procps-ng ships with nearly every Linux distribution). htop excels at interactivity and visuals: tree view, color, mouse support, multi-column sort. top is your last line of defense during incidents; htop is the daily-driver investigator. They complement each other.
When to reach for which
- Incident response / minimal container / SSH-only →
top(binary is almost always there) - Daily monitoring / parent-child analysis / batch process kill →
htop
How to Read the top Screen
The 5-line summary at the top is what matters. The process list comes second.
top - 14:32:11 up 12 days, 3:45, 2 users, load average: 4.21, 3.85, 2.10
Tasks: 234 total, 2 running, 232 sleeping, 0 stopped, 0 zombie
%Cpu(s): 12.3 us, 3.2 sy, 0.0 ni, 78.5 id, 5.8 wa, 0.0 hi, 0.2 si, 0.0 st
MiB Mem : 16004.0 total, 412.5 free, 12890.3 used, 2701.2 buff/cache
MiB Swap: 2048.0 total, 102.3 free, 1945.7 used, 1893.4 avail Mem
Line 1: uptime + load average
load average: 4.21, 3.85, 2.10 is the average count of processes in R (runnable) or D (uninterruptible sleep) over the last 1 / 5 / 15 minutes. Compare against CPU core count.
- Example: 4-core box at
4.21→ near-saturated - Example: 4-core box at
8.50→ chronically overloaded - 1-min > 15-min → load climbing; 1-min < 15-min → load subsiding
Load average includes D state processes (uninterruptible I/O), not just CPU contention. High %wa inflates load too — "high load = need more CPU" is a common misread.
Line 3: CPU Breakdown (Most Important)
| Field | Meaning | Threshold |
|---|---|---|
us |
User-space CPU | Sustained 80%+ = CPU-bound |
sy |
Kernel-space CPU | 30%+ = excessive syscalls |
ni |
Niced (priority-adjusted) processes | Usually ignorable |
id |
Idle | Lower = busier |
wa |
I/O wait (disk / network) | 20%+ = I/O bottleneck |
hi |
Hardware interrupts | Normally < 1% |
si |
Software interrupts | 5%+ = network/timer churn |
st |
Stolen time (virtualization) | 10%+ = noisy-neighbor VM impact |
Lines 4-5: Memory and Swap
- Tiny
free+ largebuff/cache→ healthy (kernel using cache) - Tiny
free+ smallbuff/cache+ growingswap used→ real memory pressure avail Memis the realistic budget for new processes. Trust it more thanfree.
Where Do You Spot a CPU Bottleneck?
When line 3's %us stays above ~80% sustained and the load concentrates on one or two processes, you have a CPU-bound workload.
Investigation Steps
# 1. Snapshot $ top # 2. Sort by CPU (press Shift + P while top runs) # %CPU column now sorted descending # 3. Single process > 100% (= 1 full core) → that's the culprit # Load spread across many processes → general overload; scale out
The 1 key in top: switches to per-core breakdown. Spot a single-threaded process saturating one core instantly.
High %us but No Obvious Culprit?
- Short-lived processes may not show in top's 3-second refresh window
- Try
top -d 0.5for half-second refresh, orpidstat 1for per-second granularity
How Do You Detect Memory Pressure?
Low free ≠ memory pressure. Linux uses free RAM as page cache, so free is always small on a busy system.
Real Memory Pressure Criteria
MiB Mem'savail Memdrops below 5% of totalMiB Swap'susedis continually increasing- In
top, pressShift + Mto sort by RES (resident memory) → one process abnormally large dmesg | grep -i "killed process"shows OOM Killer activity
Swap usage ≠ instant memory pressure. Linux paging out cold pages is normal. The red flag is swap used growing continuously while %wa also climbs (thrashing).
How Do You Pinpoint I/O Wait?
When %wa stays at 20%+, the CPU is idle but tasks aren't progressing — they're waiting on disk or network.
Drill-down Commands
# Which processes are in D state (uninterruptible sleep)? $ ps -eo state,pid,cmd | grep "^D" # Per-device I/O rates $ iostat -xz 1 # Per-process I/O (iotop needs root) $ sudo iotop -o
Quick decision table
| %us | %wa | Interpretation |
|---|---|---|
| High | Low | CPU-bound (heavy computation) |
| Low | High | I/O-bound (disk/network latency) |
| High | High | Mixed (e.g., expensive DB query) |
| Low | Low | Lock contention / external API / sleeping |
What Are htop's Killer Features?
htop looks like a prettier top, but the real wins are tree view, filters, and multi-select operations.
Default View
0[|||||||||||||||| 45.2%] Tasks: 87, 234 thr; 3 running
1[||||||||| 22.1%] Load average: 1.85 1.42 1.05
2[||||||||||||||||||||||||||||||||||||| 98.7%] Uptime: 12 days, 03:45:11
3[||| 5.3%]
Mem[||||||||||||||||||||| 12.5G/16.0G]
Swp[||||| 1.9G/2.0G]
Per-core utilization is visible as bars. Core 2 at 98.7% screams "single-threaded bottleneck" at a glance.
Common F-keys and Shortcuts
| Key | Function |
|---|---|
F2 |
Settings (colors, columns, display modes) |
F3 / / |
Incremental search by process name |
F4 / \ |
Filter by process name (hides non-matches) |
F5 / t |
Toggle tree view (parent-child relationships) |
F6 |
Choose sort column |
F9 / k |
Send signal (SIGTERM / SIGKILL / etc.) |
Space |
Tag a process (multi-select) |
Shift + P |
Sort by %CPU |
Shift + M |
Sort by %MEM |
Shift + T |
Sort by start time |
u |
Filter by user |
When Tree View Earns Its Keep
└─ nginx: master process
├─ nginx: worker process
├─ nginx: worker process
└─ nginx: cache manager process
F5 tree view is unbeatable when you need to find the misbehaving parent of runaway children. If only one nginx worker is CPU-hot, the issue is request-specific, not config-wide.
Killing Multiple Processes at Once
- Narrow the list with
/(search) orF4(filter) - Press
Spaceon each target to tag (multi-select) - Press
F9to choose a signal → it's broadcast to all tagged processes
SIGKILL (9) is a last resort. Try SIGTERM (15) first, then escalate. SIGKILL on databases or processes mid-write can corrupt data.
Practical Bottleneck Diagnosis Workflow
The 3-step flow I run during real incidents.
Step 1: Narrow the Layer with top (10 sec)
$ top -b -n 1 | head -5
-b (batch) + -n 1 (one iteration) also makes it loggable.
%waspike → suspect I/O layer (next:iostat)%usspike → suspect application layer (next:htopto find PID)%syspike → suspect kernel layer (next:strace,perf)Swap usedclimbing → suspect memory layer (next:Shift + Msort)
Step 2: Find the Suspect Process with htop (30 sec)
$ htop # Shift + P / M / T to sort by your axis of interest # F5 for tree view to inspect parent-child relationships
Step 3: Drill into Root Cause
- Application layer →
strace -p PID -f, application logs - I/O layer →
iotop -o,iostat -x 1 - Kernel layer →
dmesg -T,journalctl -k - Memory layer →
dmesg | grep -i oom, app leak profiling
Field template: 30-second triage
# 1. Overall picture top -b -n 1 | head -20 # 2. Load trend uptime # 3. Real memory state free -h # 4. I/O wait detail iostat -xz 1 3 # 5. Find the culprit (interactive) htop
Common Mistakes to Avoid
Pitfalls when using top / htop
- Treating
load averagealone as "CPU shortage" (could be%wa-driven) - Treating low
freeas "memory shortage" (look atavail Mem) - Trusting a single snapshot (observe for at least 10 seconds to filter transient spikes)
- Reaching for SIGKILL first (denies the process a chance to flush)
- Killing parent processes without checking the tree view (orphan-process risk)
Summary
top's line 3 (CPU breakdown) and lines 4-5 (memory) are the diagnostic entry points- Which of
%us/%sy/%wais high tells you which layer to dig into htop's tree view, filters, and tagging make process hunting efficient- The real-world pattern: 30-second triage → identify layer → specialist tools (iostat / strace / etc.) for deep dive