How to Troubleshoot Disk I/O on Linux - iostat and vmstat Guide

How to Troubleshoot Disk I/O on Linux - iostat and vmstat Guide

What You'll Learn

  • How to determine if disk I/O is the bottleneck when the server is slow
  • How to use iostat / vmstat to distinguish CPU wait vs disk congestion
  • How to escape the "don't know what's happening" state

Quick Summary

When suspecting disk I/O issues:

  1. CPU idle but slow? -> iostat -xz 1
  2. High I/O wait? -> Check %iowait
  3. Write congestion? -> Check await / %util
  4. Confirm disk is the cause? -> vmstat 1

Prerequisites

  • OS: Ubuntu
  • Target: Server beginners
  • Goal: Isolation and diagnosis

1. What is "Slow Disk I/O"?

Conclusion: Slow disk I/O means %iowait rises even when CPU utilization looks low.

"Slow disk" usually means one of these:

  • Application waiting for disk response
  • Log writes getting backed up
  • DB / Docker / backups consuming I/O
  • CPU is idle but stuck waiting on I/O

Looking at CPU% alone will always mislead you.

2. Installing iostat

Conclusion: Install sysstat via apt to get iostat — it is absent by default on Ubuntu.

$ sudo apt update
$ sudo apt install -y sysstat

3. Using iostat -xz

Conclusion: Run iostat -xz 1 for per-second stats — -x reveals the key I/O metrics.

$ iostat -xz 1

Options:

  • -x: Extended stats (essential)
  • -z: Hide zero rows (cleaner output)
  • 1: Update every 1 second

4. Key Metrics (Core Knowledge)

Conclusion: Focus on %iowait (CPU wait), %util (saturation), await (latency) together.

4-1. %iowait (CPU side)

  • 10%+: Notable I/O wait
  • 20%+: Disk I/O is almost certainly the bottleneck

CPU wants to work but is stuck waiting on disk.

4-2. %util (Disk side)

  • 70% or less: Plenty of headroom
  • 80-90%: Getting congested
  • 100% constant: Completely saturated

4-3. await (Wait time)

  • Few ms: Normal
  • 10ms+: Slow
  • 50ms+: Noticeably slow
  • 100ms+: Critical

4-4. r/s w/s (Read/Write rate)

Tells you whether reads or writes are the problem. Log bloat and DB writes spike w/s.

5. Common Patterns

Conclusion: Three patterns: iowait+util high, CPU+I/O combined, or await high with low util.

Pattern A: CPU low but slow

  • %iowait high
  • %util high

Classic disk I/O bottleneck

Pattern B: Both CPU and I/O high

  • %user / %system high
  • %iowait also high

Heavy processing + heavy disk writes combined

Pattern C: %util low but still slow

  • %util low
  • await high

Storage itself is slow (network storage, EBS, etc.)

6. Using vmstat

Conclusion: Use vmstat 1 — high b with low r column confirms I/O is the bottleneck.

$ vmstat 1

Key columns:

  • b: Processes waiting on I/O (high = trouble)
  • wa: I/O wait (10%+ = warning)

If r is low but b is high, I/O is the cause, not CPU.

7. Confirming Disk is the Culprit

Conclusion: Disk is confirmed when %iowait, %util, await, and vmstat b are elevated.

If ALL of these are true, disk is almost certainly the cause:

  • %iowait is high
  • %util is high
  • await is spiking
  • vmstat b column is increasing

8. Common Causes

Conclusion: Docker, databases, log bloat, and backups are the most common disk I/O culprits.

  • Docker (logs, layers, overlayfs)
  • Databases (MySQL / PostgreSQL)
  • Log bloat (access.log, error.log)
  • Backups (rsync, tar)
  • Heavy cron jobs

9. What NOT to Do

Conclusion: Three mistakes: CPU-only judgment, rebooting early, and trusting %util alone.

NG1: Judging "All Clear" by CPU Alone

A classic accident from ignoring I/O wait.

NG2: Rebooting Immediately

The I/O evidence disappears. Observe first.

NG3: Trusting %util Alone

await can be dying even when %util looks fine.

Copy-Paste Template

# Disk details (most important)
iostat -xz 1

# Overall view
vmstat 1

# Disk list
lsblk
df -h

Summary

  • "Slow" does not always mean CPU
  • %iowait and %util are the decision axes
  • iostat and vmstat go together
  • Follow the order: observe → identify cause → act

Next Reading