Diagnosing High Load Average: CPU-Bound vs I/O-Bound

Diagnosing High Load Average: CPU-Bound vs I/O-Bound

What You'll Learn

  • How to tell whether a high load average is CPU-bound or I/O-bound
  • What Linux load average actually counts (it includes I/O wait)
  • Exactly where to look in uptime, top, vmstat, and iostat

Quick Summary

  1. Check load average with uptime and core count with nproc. load > cores means overload.
  2. In vmstat 1, a high r column (runnable) means CPU saturation; a high b column plus wa means I/O wait.
  3. Once the direction is clear, chase CPU with top/ps and I/O with iostat -x.

Assumptions (environment)

  • OS: Ubuntu / general Linux
  • Shell: bash
  • Tools: vmstat, iostat, and mpstat ship in the sysstat package (sudo apt install sysstat if missing)

What is load average, and what do the numbers mean?

Conclusion: Load average is an exponential moving average of the number of processes running, waiting to run, or in I/O wait; the three values from uptime are the 1-, 5-, and 15-minute averages.

Check it with uptime or cat /proc/loadavg.

$ uptime
 14:23:05  up 10 days,  3:42,  2 users,  load average: 4.85, 3.12, 1.90
$ cat /proc/loadavg
4.85 3.12 1.90 5/812 23145

The three numbers are the 1-, 5-, and 15-minute averages, left to right. Comparing them shows the trend.

  • 1-min > 15-min (e.g. 4.85 vs 1.90) → load is rising
  • 1-min < 15-min → load is falling
  • All three close → load is steadily high

In /proc/loadavg, the 4th field 5/812 is "running processes / total processes", and the 5th is the most recently created PID.

How high is "too high" for load average?

Conclusion: Never judge by the raw number. Compare it to the CPU core count: when load average ÷ cores exceeds 1.0, work is arriving faster than the cores can clear it and processes are waiting.

Load average is a count of waiting processes, so more cores raise the acceptable value.

$ nproc
4
  • 4 cores, load average 4.0 → essentially fully used (queue near zero)
  • 4 cores, load average 8.0 → 2x overload (half the processes are queued)

Use load ÷ cores as a rule of thumb.

load ÷ cores State
~0.7 Headroom
0.7–1.0 Near full (watch it)
> 1.0 Overloaded (queuing)

This rule of thumb is for the CPU-saturation case. On Linux, I/O wait also counts toward load, so you can be over the core count while the CPU is mostly idle. The next section splits the two.

Why doesn't Linux load average match CPU usage?

Conclusion: Linux load average counts not only runnable processes (TASK_RUNNING) but also uninterruptible I/O waiters (TASK_UNINTERRUPTIBLE, the D state). That's why load can spike even when CPU usage is low.

Where most UNIX systems put only the CPU run-queue length into load, Linux also includes uninterruptible sleep (the D state in ps). The D state mostly comes from waiting on disk I/O or network storage.

This makes seemingly contradictory states possible.

  • load average 8.0 but CPU usage (us+sy) is 10% → the rest is stacked-up I/O wait
  • Slow storage or a hung NFS mount can send load alone soaring

"High load average" does not always mean "busy CPU". CPU saturation and I/O wait have different causes and different fixes, so split them first.

How do you tell CPU wait from I/O wait?

Conclusion: Read the r column (runnable) and b column (blocked on I/O) in vmstat 1, and %wa (iowait) in top. A large r means CPU saturation; a large b and wa mean I/O wait.

Step 1: Get the overall trend with vmstat

$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 6  0      0 124560  20480 890120    0    0     8    12  210  430 78 12 10  0  0
 5  0      0 124100  20480 890120    0    0     0     0  198  410 80 11  9  0  0

Two columns matter.

  • r (runnable): processes running or waiting for the CPU. Consistently above the core count → CPU saturation.
  • b (blocked): processes in uninterruptible sleep (I/O, etc.). Large → I/O wait.

Also read wa (iowait, the share of time waiting for I/O) in the CPU columns. Above, r=6 exceeds the 4 cores and wa=0, so it is CPU-bound.

Step 2: Confirm the breakdown with top

$ top

Read the CPU breakdown in the header.

%Cpu(s): 78.0 us, 12.0 sy,  0.0 ni, 10.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
  • High us (user) + sy (system) → CPU saturation. Press P to sort by CPU and find the culprit.
  • High wa (iowait) → I/O wait; the CPU is idling while it waits for disk and friends.

Step 3: Name the D-state processes stuck on I/O

When wa is high, find which processes are blocked on I/O via the ps state column (STAT).

$ ps -eo pid,stat,comm,wchan | awk '$2 ~ /D/'
  1842 D    mysqld          wait_on_page_bits
  1990 D+   dd              balance_dirty_pages

Processes whose STAT is D (uninterruptible sleep) are the I/O waiters. The wchan (the kernel function they sleep in) hints at the cause.

Step 4: Corroborate on the disk side with iostat

$ iostat -x 1 3
Device   r/s     w/s   rkB/s   wkB/s  await  aqu-sz  %util
sda      8.00  420.00   64.00 51200.0  85.20    9.80   99.6
  • %util near 100% → that device is saturated (it cannot push more I/O).
  • Large await (average ms per I/O) → the disk is slow.
  • Large aqu-sz (average queue length) → I/O is piling up.

A %util near 100% is the clincher for I/O-bound load.

Decision cheat sheet

  • vmstat r high, wa low → CPU saturation
  • vmstat b high, top wa high, iostat %util high → I/O wait
  • Both high → CPU and I/O are chained (e.g. a DB full-table scan). Look at both sides.

What to do when CPU saturation is the cause?

Conclusion: Find the CPU-hungry process with top/ps. If it's runaway, renice or stop it; if it's chronic, consider more cores, distributing the work, or optimizing the code.

$ ps -eo pid,pcpu,comm --sort=-pcpu | head
  • A one-off runaway → stop the offending process, or lower its priority with renice.
  • Chronically above the core count → scale up (more cores), parallelize/distribute the work, or improve the algorithm.
  • Only at certain times → suspect bunched cron/batch jobs and spread their schedules.

For the full culprit-hunting workflow, see Diagnosing 100% CPU usage.

What to do when I/O wait is the cause?

Conclusion: Pinpoint the saturated device with iostat, then find the process generating I/O with iotop. Log bloat, heavy swapping, and slow storage are the usual culprits.

$ sudo iotop -o
  • One process writes heavily → suspect excessive logging, a full-table scan, and the like.
  • si/so (vmstat swap columns) are moving → memory pressure is causing swap. See Investigating memory pressure.
  • The storage itself is slow → replace the device, revisit the I/O scheduler, or cut reads/writes.

For a deeper disk-I/O dive, see Diagnosing slow disk I/O.

Pitfalls to avoid

  • Judging by the raw load number alone (always pair it with the core count).
  • Ignoring wa and rushing to add CPUs (more cores won't fix I/O wait).
  • Trying to force-kill D-state processes with kill -9 (they often won't die until the I/O completes).

Summary