top and htop Mastery: Identifying System Bottlenecks

2026-05-23 Reading time: About 14 min Difficulty: Intermediate

What You'll Learn

How to judge CPU, memory, or I/O bottleneck in 30 seconds from top's summary lines
The meaning and decision thresholds for %us, %sy, %wa, %si
How to use htop's tree view, filters, and F-keys to hunt rogue processes interactively
How to dig past "load average is high" to the actual root layer

Quick Triage (30-second pattern)

Look at top's line 3 → high %us = CPU-bound, high %wa = I/O wait, high %sy = kernel/context-switch overhead
Look at lines 4-5 → tiny avail Mem and growing swap used = memory pressure
Use htop for the suspect: tree view + sort (P / M / T keys)

Prerequisites

OS: Ubuntu / RHEL-family Linux
htop may not be preinstalled (sudo apt install htop / sudo dnf install htop)
Output examples assume procps-ng top (BSD top displays differently)

Why Use Both top and htop?

top is guaranteed to exist everywhere (procps-ng ships with nearly every Linux distribution). htop excels at interactivity and visuals: tree view, color, mouse support, multi-column sort. top is your last line of defense during incidents; htop is the daily-driver investigator. They complement each other.

When to reach for which

Incident response / minimal container / SSH-only → top (binary is almost always there)
Daily monitoring / parent-child analysis / batch process kill → htop

How to Read the top Screen

The 5-line summary at the top is what matters. The process list comes second.

top - 14:32:11 up 12 days,  3:45,  2 users,  load average: 4.21, 3.85, 2.10
Tasks: 234 total,   2 running, 232 sleeping,   0 stopped,   0 zombie
%Cpu(s): 12.3 us,  3.2 sy,  0.0 ni, 78.5 id,  5.8 wa,  0.0 hi,  0.2 si,  0.0 st
MiB Mem :  16004.0 total,    412.5 free,  12890.3 used,   2701.2 buff/cache
MiB Swap:   2048.0 total,    102.3 free,   1945.7 used,   1893.4 avail Mem

Line 1: uptime + load average

load average: 4.21, 3.85, 2.10 is the average count of processes in R (runnable) or D (uninterruptible sleep) over the last 1 / 5 / 15 minutes. Compare against CPU core count.

Example: 4-core box at 4.21 → near-saturated
Example: 4-core box at 8.50 → chronically overloaded
1-min > 15-min → load climbing; 1-min < 15-min → load subsiding

Load average includes D state processes (uninterruptible I/O), not just CPU contention. High %wa inflates load too — "high load = need more CPU" is a common misread.

Line 3: CPU Breakdown (Most Important)

Field	Meaning	Threshold
`us`	User-space CPU	Sustained 80%+ = CPU-bound
`sy`	Kernel-space CPU	30%+ = excessive syscalls
`ni`	Niced (priority-adjusted) processes	Usually ignorable
`id`	Idle	Lower = busier
`wa`	I/O wait (disk / network)	20%+ = I/O bottleneck
`hi`	Hardware interrupts	Normally < 1%
`si`	Software interrupts	5%+ = network/timer churn
`st`	Stolen time (virtualization)	10%+ = noisy-neighbor VM impact

Lines 4-5: Memory and Swap

Tiny free + large buff/cache → healthy (kernel using cache)
Tiny free + small buff/cache + growing swap used → real memory pressure
avail Mem is the realistic budget for new processes. Trust it more than free.

Where Do You Spot a CPU Bottleneck?

When line 3's %us stays above ~80% sustained and the load concentrates on one or two processes, you have a CPU-bound workload.

Investigation Steps

# 1. Snapshot
$ top

# 2. Sort by CPU (press Shift + P while top runs)
# %CPU column now sorted descending

# 3. Single process > 100% (= 1 full core) → that's the culprit
# Load spread across many processes → general overload; scale out

The 1 key in top: switches to per-core breakdown. Spot a single-threaded process saturating one core instantly.

High `%us` but No Obvious Culprit?

Short-lived processes may not show in top's 3-second refresh window
Try top -d 0.5 for half-second refresh, or pidstat 1 for per-second granularity

How Do You Detect Memory Pressure?

Low free ≠ memory pressure. Linux uses free RAM as page cache, so free is always small on a busy system.

Real Memory Pressure Criteria

MiB Mem's avail Mem drops below 5% of total
MiB Swap's used is continually increasing
In top, press Shift + M to sort by RES (resident memory) → one process abnormally large
dmesg | grep -i "killed process" shows OOM Killer activity

Swap usage ≠ instant memory pressure. Linux paging out cold pages is normal. The red flag is swap used growing continuously while %wa also climbs (thrashing).

How Do You Pinpoint I/O Wait?

When %wa stays at 20%+, the CPU is idle but tasks aren't progressing — they're waiting on disk or network.

Drill-down Commands

# Which processes are in D state (uninterruptible sleep)?
$ ps -eo state,pid,cmd | grep "^D"

# Per-device I/O rates
$ iostat -xz 1

# Per-process I/O (iotop needs root)
$ sudo iotop -o

Quick decision table

%us	%wa	Interpretation
High	Low	CPU-bound (heavy computation)
Low	High	I/O-bound (disk/network latency)
High	High	Mixed (e.g., expensive DB query)
Low	Low	Lock contention / external API / sleeping

What Are htop's Killer Features?

htop looks like a prettier top, but the real wins are tree view, filters, and multi-select operations.

Default View

  0[||||||||||||||||                       45.2%]   Tasks: 87, 234 thr; 3 running
  1[|||||||||                              22.1%]   Load average: 1.85 1.42 1.05
  2[|||||||||||||||||||||||||||||||||||||  98.7%]   Uptime: 12 days, 03:45:11
  3[|||                                     5.3%]
  Mem[|||||||||||||||||||||              12.5G/16.0G]
  Swp[|||||                                  1.9G/2.0G]

Per-core utilization is visible as bars. Core 2 at 98.7% screams "single-threaded bottleneck" at a glance.

Common F-keys and Shortcuts

Key	Function
`F2`	Settings (colors, columns, display modes)
`F3` / `/`	Incremental search by process name
`F4` / `\`	Filter by process name (hides non-matches)
`F5` / `t`	Toggle tree view (parent-child relationships)
`F6`	Choose sort column
`F9` / `k`	Send signal (SIGTERM / SIGKILL / etc.)
`Space`	Tag a process (multi-select)
`Shift + P`	Sort by %CPU
`Shift + M`	Sort by %MEM
`Shift + T`	Sort by start time
`u`	Filter by user

When Tree View Earns Its Keep

└─ nginx: master process
   ├─ nginx: worker process
   ├─ nginx: worker process
   └─ nginx: cache manager process

F5 tree view is unbeatable when you need to find the misbehaving parent of runaway children. If only one nginx worker is CPU-hot, the issue is request-specific, not config-wide.

Killing Multiple Processes at Once

Narrow the list with / (search) or F4 (filter)
Press Space on each target to tag (multi-select)
Press F9 to choose a signal → it's broadcast to all tagged processes

SIGKILL (9) is a last resort. Try SIGTERM (15) first, then escalate. SIGKILL on databases or processes mid-write can corrupt data.

Practical Bottleneck Diagnosis Workflow

The 3-step flow I run during real incidents.

Step 1: Narrow the Layer with top (10 sec)

$ top -b -n 1 | head -5

-b (batch) + -n 1 (one iteration) also makes it loggable.

%wa spike → suspect I/O layer (next: iostat)
%us spike → suspect application layer (next: htop to find PID)
%sy spike → suspect kernel layer (next: strace, perf)
Swap used climbing → suspect memory layer (next: Shift + M sort)

Step 2: Find the Suspect Process with htop (30 sec)

$ htop
# Shift + P / M / T to sort by your axis of interest
# F5 for tree view to inspect parent-child relationships

Step 3: Drill into Root Cause

Application layer → strace -p PID -f, application logs
I/O layer → iotop -o, iostat -x 1
Kernel layer → dmesg -T, journalctl -k
Memory layer → dmesg | grep -i oom, app leak profiling

Field template: 30-second triage

# 1. Overall picture
top -b -n 1 | head -20

# 2. Load trend
uptime

# 3. Real memory state
free -h

# 4. I/O wait detail
iostat -xz 1 3

# 5. Find the culprit (interactive)
htop

Common Mistakes to Avoid

Pitfalls when using top / htop

Treating load average alone as "CPU shortage" (could be %wa-driven)
Treating low free as "memory shortage" (look at avail Mem)
Trusting a single snapshot (observe for at least 10 seconds to filter transient spikes)
Reaching for SIGKILL first (denies the process a chance to flush)
Killing parent processes without checking the tree view (orphan-process risk)

Summary

top's line 3 (CPU breakdown) and lines 4-5 (memory) are the diagnostic entry points
Which of %us / %sy / %wa is high tells you which layer to dig into
htop's tree view, filters, and tagging make process hunting efficient
The real-world pattern: 30-second triage → identify layer → specialist tools (iostat / strace / etc.) for deep dive