Ubuntu Memory Troubleshooting - free, top, ps, and OOM Killer Guide

Ubuntu Memory Troubleshooting - free, top, ps, and OOM Killer Guide

What You'll Learn

  • How to determine if server slowness/crashes are caused by memory issues
  • How to identify memory-hungry processes
  • How to check if OOM Killer (OS kills process due to memory) has triggered
  • Move beyond "just restart" to prevent recurrence

Quick Summary

When memory is suspect, follow this order:

  1. Overview: free -h (memory/swap status)
  2. Find culprit: top (high RES, high CPU)
  3. Top list: ps aux --sort=-%mem | head
  4. OOM evidence: journalctl -k | grep -i oom
  5. Fix: Address the process/service (logs, config, memory limits, scaling, swap)

Prerequisites

  • OS: Ubuntu
  • Target: Server beginners
  • sudo access
  • Goal: Isolation, recovery, and basic prevention

1. Memory Issue Symptoms

Conclusion: Crashes, sluggish SSH, and 502/503 are memory symptoms — check memory first.

"Memory issues" can look many ways:

  • Server extremely slow (SSH sluggish, commands hang)
  • Apps crash suddenly / 502/503 increase
  • Processes die unexpectedly (errors in logs)
  • Same issue recurs after restart

Important: Even 100% CPU can be caused by memory (swap thrashing). Don't misdiagnose by looking at CPU alone.

2. free -h for Overview (First Step)

Conclusion: In free -h, low available or growing swap are the clearest danger signals.

$ free -h

Example output:

              total        used        free      shared  buff/cache   available
Mem:           2.0Gi       1.9Gi        30Mi       120Mi        70Mi        80Mi
Swap:          1.0Gi       1.0Gi         0Mi

Key points:

  • available: Actual usable memory estimate (small = trouble)
  • Swap: If swap is nearly full, danger (swap thrashing causes extreme slowness)

Rough guidelines:

  • available at few tens of MB -> Very dangerous (OOM or extreme slowness likely)
  • Swap keeps growing -> Root cause likely not addressed

3. top to Find "Current Culprit" (Look at RES)

Conclusion: In top, sort by M and check RES — it shows actual memory used per process.

$ top

What beginners should look at:

  • %MEM: Memory usage percentage
  • RES: Actual memory usage (important)
  • %CPU: CPU usage percentage

top shortcuts:

  • M: Sort by memory
  • P: Sort by CPU
  • q: Quit

4. ps for Memory Top List (Create Evidence)

Conclusion: Run ps aux --sort=-%mem | head -n 20 and save the output before any restart.

$ ps aux --sort=-%mem | head -n 20

Also check CPU top for comparison:

$ ps aux --sort=-%cpu | head -n 20

Common patterns:

  • One giant process (e.g., java, node, php-fpm, python, db)
  • Many same type (worker runaway, process proliferation)
  • Docker containers multiplying

5. Check OOM Killer Activity (Often Decisive)

Conclusion: Run journalctl -k | grep -i oom — the OOM Killer log names the killed process.

OOM Killer means "OS ran out of memory and force-killed a process". Very common cause of sudden process death.

$ sudo journalctl -k | grep -i oom | tail -n 50

5-2. Search for "killed process"

$ sudo journalctl -k | grep -i "killed process" | tail -n 50

5-3. dmesg Also Works

$ dmesg | grep -i oom | tail -n 50

Typical log:

Out of memory: Killed process 1234 (node) total-vm:... anon-rss:...

The killed process name appears here. That's the main suspect.

6. Types of Memory Issues (Response Differs)

Conclusion: Classify as spike, leak, or proliferation — each type requires a different fix.

Type A: Temporary Spike (Burst)

  • Batch processing, heavy aggregation, image processing, etc.
  • Fix: Isolate processing, memory limits/worker count tuning, add swap, scale up

Type B: Leak / Keeps Growing (Worsens Over Time)

  • Node/Python/Java apps grow over long runtime
  • Fix: Investigate app memory leak, worker/process restart design, set limits

Type C: Process Proliferation (Fork Bomb / Too Many Workers)

  • php-fpm children grow too much, queue/traffic causes worker explosion
  • Fix: Set child process/worker limits, load testing, rate limiting

7. Quick Recovery Steps (Safety First)

Conclusion: Check logs before restarting — a restart that helps confirms the issue recurs.

7-1. If You Know Which Service Is Heavy: Check Logs First

$ sudo journalctl -u nginx -n 200
$ sudo journalctl -u app -n 200

7-2. Service Restart (Last Resort But Sometimes Necessary)

$ sudo systemctl restart <service>
$ sudo systemctl status <service>

Note: If restart fixes it, the cause is "memory accumulating" type. Will recur if left alone.

8. Swap Makes It Extremely Slow (Swap Hell)

Conclusion: Heavy swap creates a vicious cycle: low memory drives disk I/O, slowing further.

When swap grows, memory shortage -> disk I/O -> even slower - a hellish cycle.

$ free -h

If swap is heavily used, that's enough to know.

9. Things to Avoid

Conclusion: Avoid restart-without-OOM-log, treating swap as permanent, or unlimited workers.

Don't: Restart Repeatedly Without Looking at Cause

Restarting without checking logs and OOM repeats the same accident.

Don't: Think Swap Is a "Silver Bullet"

Swap is a temporary relief, not a root fix. Also causes performance degradation.

Don't: Leave Unlimited Worker/Process Settings Alone

php-fpm and app workers without limits will eat all memory. Always set limits.

Copy-Paste Template

# 1) Overview
free -h

# 2) Current culprit (by memory)
top

# 3) Memory top list (evidence)
ps aux --sort=-%mem | head -n 20

# 4) OOM evidence (decisive)
sudo journalctl -k | grep -i oom | tail -n 50
sudo journalctl -k | grep -i "killed process" | tail -n 50

Summary

  • Check free -h for available and Swap first
  • Use ps aux --sort=-%mem to identify top consumers
  • Use journalctl -k | grep -i oom to confirm OOM
  • If restart fixes it temporarily, it will almost certainly recur. Classify and address root cause.

Next Reading