top and htop Mastery: Identifying System Bottlenecks

top and htop Mastery: Identifying System Bottlenecks

What You'll Learn

  • How to judge CPU, memory, or I/O bottleneck in 30 seconds from top's summary lines
  • The meaning and decision thresholds for %us, %sy, %wa, %si
  • How to use htop's tree view, filters, and F-keys to hunt rogue processes interactively
  • How to dig past "load average is high" to the actual root layer

Quick Triage (30-second pattern)

  1. Look at top's line 3 → high %us = CPU-bound, high %wa = I/O wait, high %sy = kernel/context-switch overhead
  2. Look at lines 4-5 → tiny avail Mem and growing swap used = memory pressure
  3. Use htop for the suspect: tree view + sort (P / M / T keys)

Prerequisites

  • OS: Ubuntu / RHEL-family Linux
  • htop may not be preinstalled (sudo apt install htop / sudo dnf install htop)
  • Output examples assume procps-ng top (BSD top displays differently)

Why Use Both top and htop?

top is guaranteed to exist everywhere (procps-ng ships with nearly every Linux distribution). htop excels at interactivity and visuals: tree view, color, mouse support, multi-column sort. top is your last line of defense during incidents; htop is the daily-driver investigator. They complement each other.

When to reach for which

  • Incident response / minimal container / SSH-onlytop (binary is almost always there)
  • Daily monitoring / parent-child analysis / batch process killhtop

How to Read the top Screen

The 5-line summary at the top is what matters. The process list comes second.

top - 14:32:11 up 12 days,  3:45,  2 users,  load average: 4.21, 3.85, 2.10
Tasks: 234 total,   2 running, 232 sleeping,   0 stopped,   0 zombie
%Cpu(s): 12.3 us,  3.2 sy,  0.0 ni, 78.5 id,  5.8 wa,  0.0 hi,  0.2 si,  0.0 st
MiB Mem :  16004.0 total,    412.5 free,  12890.3 used,   2701.2 buff/cache
MiB Swap:   2048.0 total,    102.3 free,   1945.7 used,   1893.4 avail Mem

Line 1: uptime + load average

load average: 4.21, 3.85, 2.10 is the average count of processes in R (runnable) or D (uninterruptible sleep) over the last 1 / 5 / 15 minutes. Compare against CPU core count.

  • Example: 4-core box at 4.21 → near-saturated
  • Example: 4-core box at 8.50 → chronically overloaded
  • 1-min > 15-min → load climbing; 1-min < 15-min → load subsiding

Load average includes D state processes (uninterruptible I/O), not just CPU contention. High %wa inflates load too — "high load = need more CPU" is a common misread.

Line 3: CPU Breakdown (Most Important)

Field Meaning Threshold
us User-space CPU Sustained 80%+ = CPU-bound
sy Kernel-space CPU 30%+ = excessive syscalls
ni Niced (priority-adjusted) processes Usually ignorable
id Idle Lower = busier
wa I/O wait (disk / network) 20%+ = I/O bottleneck
hi Hardware interrupts Normally < 1%
si Software interrupts 5%+ = network/timer churn
st Stolen time (virtualization) 10%+ = noisy-neighbor VM impact

Lines 4-5: Memory and Swap

  • Tiny free + large buff/cache → healthy (kernel using cache)
  • Tiny free + small buff/cache + growing swap usedreal memory pressure
  • avail Mem is the realistic budget for new processes. Trust it more than free.

Where Do You Spot a CPU Bottleneck?

When line 3's %us stays above ~80% sustained and the load concentrates on one or two processes, you have a CPU-bound workload.

Investigation Steps

# 1. Snapshot
$ top

# 2. Sort by CPU (press Shift + P while top runs)
# %CPU column now sorted descending

# 3. Single process > 100% (= 1 full core) → that's the culprit
# Load spread across many processes → general overload; scale out

The 1 key in top: switches to per-core breakdown. Spot a single-threaded process saturating one core instantly.

High %us but No Obvious Culprit?

  • Short-lived processes may not show in top's 3-second refresh window
  • Try top -d 0.5 for half-second refresh, or pidstat 1 for per-second granularity

How Do You Detect Memory Pressure?

Low free ≠ memory pressure. Linux uses free RAM as page cache, so free is always small on a busy system.

Real Memory Pressure Criteria

  1. MiB Mem's avail Mem drops below 5% of total
  2. MiB Swap's used is continually increasing
  3. In top, press Shift + M to sort by RES (resident memory) → one process abnormally large
  4. dmesg | grep -i "killed process" shows OOM Killer activity

Swap usage ≠ instant memory pressure. Linux paging out cold pages is normal. The red flag is swap used growing continuously while %wa also climbs (thrashing).

How Do You Pinpoint I/O Wait?

When %wa stays at 20%+, the CPU is idle but tasks aren't progressing — they're waiting on disk or network.

Drill-down Commands

# Which processes are in D state (uninterruptible sleep)?
$ ps -eo state,pid,cmd | grep "^D"

# Per-device I/O rates
$ iostat -xz 1

# Per-process I/O (iotop needs root)
$ sudo iotop -o

Quick decision table

%us %wa Interpretation
High Low CPU-bound (heavy computation)
Low High I/O-bound (disk/network latency)
High High Mixed (e.g., expensive DB query)
Low Low Lock contention / external API / sleeping

What Are htop's Killer Features?

htop looks like a prettier top, but the real wins are tree view, filters, and multi-select operations.

Default View

  0[||||||||||||||||                       45.2%]   Tasks: 87, 234 thr; 3 running
  1[|||||||||                              22.1%]   Load average: 1.85 1.42 1.05
  2[|||||||||||||||||||||||||||||||||||||  98.7%]   Uptime: 12 days, 03:45:11
  3[|||                                     5.3%]
  Mem[|||||||||||||||||||||              12.5G/16.0G]
  Swp[|||||                                  1.9G/2.0G]

Per-core utilization is visible as bars. Core 2 at 98.7% screams "single-threaded bottleneck" at a glance.

Common F-keys and Shortcuts

Key Function
F2 Settings (colors, columns, display modes)
F3 / / Incremental search by process name
F4 / \ Filter by process name (hides non-matches)
F5 / t Toggle tree view (parent-child relationships)
F6 Choose sort column
F9 / k Send signal (SIGTERM / SIGKILL / etc.)
Space Tag a process (multi-select)
Shift + P Sort by %CPU
Shift + M Sort by %MEM
Shift + T Sort by start time
u Filter by user

When Tree View Earns Its Keep

└─ nginx: master process
   ├─ nginx: worker process
   ├─ nginx: worker process
   └─ nginx: cache manager process

F5 tree view is unbeatable when you need to find the misbehaving parent of runaway children. If only one nginx worker is CPU-hot, the issue is request-specific, not config-wide.

Killing Multiple Processes at Once

  1. Narrow the list with / (search) or F4 (filter)
  2. Press Space on each target to tag (multi-select)
  3. Press F9 to choose a signal → it's broadcast to all tagged processes

SIGKILL (9) is a last resort. Try SIGTERM (15) first, then escalate. SIGKILL on databases or processes mid-write can corrupt data.

Practical Bottleneck Diagnosis Workflow

The 3-step flow I run during real incidents.

Step 1: Narrow the Layer with top (10 sec)

$ top -b -n 1 | head -5

-b (batch) + -n 1 (one iteration) also makes it loggable.

  • %wa spike → suspect I/O layer (next: iostat)
  • %us spike → suspect application layer (next: htop to find PID)
  • %sy spike → suspect kernel layer (next: strace, perf)
  • Swap used climbing → suspect memory layer (next: Shift + M sort)

Step 2: Find the Suspect Process with htop (30 sec)

$ htop
# Shift + P / M / T to sort by your axis of interest
# F5 for tree view to inspect parent-child relationships

Step 3: Drill into Root Cause

  • Application layer → strace -p PID -f, application logs
  • I/O layer → iotop -o, iostat -x 1
  • Kernel layer → dmesg -T, journalctl -k
  • Memory layer → dmesg | grep -i oom, app leak profiling

Field template: 30-second triage

# 1. Overall picture
top -b -n 1 | head -20

# 2. Load trend
uptime

# 3. Real memory state
free -h

# 4. I/O wait detail
iostat -xz 1 3

# 5. Find the culprit (interactive)
htop

Common Mistakes to Avoid

Pitfalls when using top / htop

  • Treating load average alone as "CPU shortage" (could be %wa-driven)
  • Treating low free as "memory shortage" (look at avail Mem)
  • Trusting a single snapshot (observe for at least 10 seconds to filter transient spikes)
  • Reaching for SIGKILL first (denies the process a chance to flush)
  • Killing parent processes without checking the tree view (orphan-process risk)

Summary

  • top's line 3 (CPU breakdown) and lines 4-5 (memory) are the diagnostic entry points
  • Which of %us / %sy / %wa is high tells you which layer to dig into
  • htop's tree view, filters, and tagging make process hunting efficient
  • The real-world pattern: 30-second triage → identify layer → specialist tools (iostat / strace / etc.) for deep dive

Next Reading