Diagnosing High Load Average: CPU-Bound vs I/O-Bound
What You'll Learn
- How to tell whether a high
load averageis CPU-bound or I/O-bound - What Linux load average actually counts (it includes I/O wait)
- Exactly where to look in
uptime,top,vmstat, andiostat
Quick Summary
- Check load average with
uptimeand core count withnproc. load > cores means overload. - In
vmstat 1, a high r column (runnable) means CPU saturation; a high b column pluswameans I/O wait. - Once the direction is clear, chase CPU with
top/psand I/O withiostat -x.
Assumptions (environment)
- OS: Ubuntu / general Linux
- Shell: bash
- Tools:
vmstat,iostat, andmpstatship in thesysstatpackage (sudo apt install sysstatif missing)
What is load average, and what do the numbers mean?
Conclusion: Load average is an exponential moving average of the number of processes running, waiting to run, or in I/O wait; the three values from
uptimeare the 1-, 5-, and 15-minute averages.
Check it with uptime or cat /proc/loadavg.
$ uptime 14:23:05 up 10 days, 3:42, 2 users, load average: 4.85, 3.12, 1.90
$ cat /proc/loadavg 4.85 3.12 1.90 5/812 23145
The three numbers are the 1-, 5-, and 15-minute averages, left to right. Comparing them shows the trend.
- 1-min > 15-min (e.g. 4.85 vs 1.90) → load is rising
- 1-min < 15-min → load is falling
- All three close → load is steadily high
In /proc/loadavg, the 4th field 5/812 is "running processes / total processes", and the 5th is the most recently created PID.
How high is "too high" for load average?
Conclusion: Never judge by the raw number. Compare it to the CPU core count: when
load average ÷ coresexceeds 1.0, work is arriving faster than the cores can clear it and processes are waiting.
Load average is a count of waiting processes, so more cores raise the acceptable value.
$ nproc 4
- 4 cores, load average 4.0 → essentially fully used (queue near zero)
- 4 cores, load average 8.0 → 2x overload (half the processes are queued)
Use load ÷ cores as a rule of thumb.
| load ÷ cores | State |
|---|---|
| ~0.7 | Headroom |
| 0.7–1.0 | Near full (watch it) |
| > 1.0 | Overloaded (queuing) |
This rule of thumb is for the CPU-saturation case. On Linux, I/O wait also counts toward load, so you can be over the core count while the CPU is mostly idle. The next section splits the two.
Why doesn't Linux load average match CPU usage?
Conclusion: Linux load average counts not only runnable processes (TASK_RUNNING) but also uninterruptible I/O waiters (TASK_UNINTERRUPTIBLE, the
Dstate). That's why load can spike even when CPU usage is low.
Where most UNIX systems put only the CPU run-queue length into load, Linux also includes uninterruptible sleep (the D state in ps). The D state mostly comes from waiting on disk I/O or network storage.
This makes seemingly contradictory states possible.
- load average 8.0 but CPU usage (us+sy) is 10% → the rest is stacked-up I/O wait
- Slow storage or a hung NFS mount can send load alone soaring
"High load average" does not always mean "busy CPU". CPU saturation and I/O wait have different causes and different fixes, so split them first.
How do you tell CPU wait from I/O wait?
Conclusion: Read the r column (runnable) and b column (blocked on I/O) in
vmstat 1, and%wa(iowait) in top. A large r means CPU saturation; a large b and wa mean I/O wait.
Step 1: Get the overall trend with vmstat
$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 6 0 0 124560 20480 890120 0 0 8 12 210 430 78 12 10 0 0 5 0 0 124100 20480 890120 0 0 0 0 198 410 80 11 9 0 0
Two columns matter.
r(runnable): processes running or waiting for the CPU. Consistently above the core count → CPU saturation.b(blocked): processes in uninterruptible sleep (I/O, etc.). Large → I/O wait.
Also read wa (iowait, the share of time waiting for I/O) in the CPU columns. Above, r=6 exceeds the 4 cores and wa=0, so it is CPU-bound.
Step 2: Confirm the breakdown with top
$ top
Read the CPU breakdown in the header.
%Cpu(s): 78.0 us, 12.0 sy, 0.0 ni, 10.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
- High
us(user) +sy(system) → CPU saturation. PressPto sort by CPU and find the culprit. - High
wa(iowait) → I/O wait; the CPU is idling while it waits for disk and friends.
Step 3: Name the D-state processes stuck on I/O
When wa is high, find which processes are blocked on I/O via the ps state column (STAT).
$ ps -eo pid,stat,comm,wchan | awk '$2 ~ /D/'
1842 D mysqld wait_on_page_bits 1990 D+ dd balance_dirty_pages
Processes whose STAT is D (uninterruptible sleep) are the I/O waiters. The wchan (the kernel function they sleep in) hints at the cause.
Step 4: Corroborate on the disk side with iostat
$ iostat -x 1 3
Device r/s w/s rkB/s wkB/s await aqu-sz %util sda 8.00 420.00 64.00 51200.0 85.20 9.80 99.6
%utilnear 100% → that device is saturated (it cannot push more I/O).- Large
await(average ms per I/O) → the disk is slow. - Large
aqu-sz(average queue length) → I/O is piling up.
A %util near 100% is the clincher for I/O-bound load.
Decision cheat sheet
vmstatrhigh,walow → CPU saturationvmstatbhigh, topwahigh,iostat%utilhigh → I/O wait- Both high → CPU and I/O are chained (e.g. a DB full-table scan). Look at both sides.
What to do when CPU saturation is the cause?
Conclusion: Find the CPU-hungry process with top/ps. If it's runaway, renice or stop it; if it's chronic, consider more cores, distributing the work, or optimizing the code.
$ ps -eo pid,pcpu,comm --sort=-pcpu | head
- A one-off runaway → stop the offending process, or lower its priority with
renice. - Chronically above the core count → scale up (more cores), parallelize/distribute the work, or improve the algorithm.
- Only at certain times → suspect bunched cron/batch jobs and spread their schedules.
For the full culprit-hunting workflow, see Diagnosing 100% CPU usage.
What to do when I/O wait is the cause?
Conclusion: Pinpoint the saturated device with iostat, then find the process generating I/O with iotop. Log bloat, heavy swapping, and slow storage are the usual culprits.
$ sudo iotop -o
- One process writes heavily → suspect excessive logging, a full-table scan, and the like.
si/so(vmstat swap columns) are moving → memory pressure is causing swap. See Investigating memory pressure.- The storage itself is slow → replace the device, revisit the I/O scheduler, or cut reads/writes.
For a deeper disk-I/O dive, see Diagnosing slow disk I/O.
Pitfalls to avoid
- Judging by the raw load number alone (always pair it with the core count).
- Ignoring
waand rushing to add CPUs (more cores won't fix I/O wait). - Trying to force-kill
D-state processes withkill -9(they often won't die until the I/O completes).