Getting Started with sort and uniq: Sorting Data and Removing Duplicates

Getting Started with sort and uniq: Sorting Data and Removing Duplicates

What You'll Learn

  • How to sort lines with sort (alphabetical, numeric, reverse)
  • How to remove duplicate lines with uniq — and why it implicitly requires sort
  • How to write the classic frequency ranking pipeline sort | uniq -c | sort -rn
  • Why beginners get stuck on "uniq doesn't remove duplicates" and "numbers come out in a weird order"

Quick Summary

  • Want to sort? → sort
  • Want to sort and dedupe? → sort -u
  • Want to count occurrences? → sort | uniq -c | sort -rn

Environment

  • OS: Ubuntu / typical Linux
  • GNU coreutils sort / uniq (BSD versions on macOS differ in some option details)

1. What Does "Sorting Lines" Mean?

Lina: Senpai, I often want to put logs or lists in alphabetical order. How do I do that?
Linny-senpai: That's exactly what sort is for. sort filename reads the file line by line and prints the sorted result. It doesn't modify the file — it just prints to the screen, so you can experiment safely.
Lina: So the original file stays untouched. That's reassuring.
Linny-senpai: Right. There are three sort orders to remember: alphabetical, numeric, and reversed. Knowing those three handles 80% of real-world cases.

Let's prepare a sample file:

$ cat fruits.txt
banana
apple
cherry
apple
banana
date

1-1. Basic: Alphabetical Order

$ sort fruits.txt
apple
apple
banana
banana
cherry
date

Key points

  • sort defaults to alphabetical (dictionary) order
  • Uppercase and lowercase are typically treated as different (uppercase comes first)
  • The original file is not modifiedsort only prints to the screen

1-2. Reverse (Descending) Order: -r

$ sort -r fruits.txt
date
cherry
banana
banana
apple
apple

-r stands for reverse.

2. The Numeric Sort Trap

Lina: I sorted some numbers, but the order looks wrong...
Linny-senpai: Perfect example. Let's see what happens.
$ cat scores.txt
100
3
25
9
1000
$ sort scores.txt
100
1000
25
3
9
Lina: Wait — 100 comes before 25, and 3 and 9 are at the end. Is this a bug?
Linny-senpai: Not a bug. By default, sort does string comparison character by character from the left, so lines starting with 1 come before lines starting with 2 or 3. To sort as numbers, pass -n.

2-1. Numeric Sort: -n

$ sort -n scores.txt
3
9
25
100
1000

-n stands for numeric.

Beginner pitfall

  • Forgetting -n when sorting sizes, counts, or any numeric column produces the wrong order
  • Rule of thumb: "if the column looks like a number, add -n"

2-2. Numbers in Descending Order

$ sort -nr scores.txt
1000
100
25
9
3

-n and -r combine freely. This combination appears in nearly every ranking task.

3. Sort + Deduplicate in One Shot: sort -u

$ sort -u fruits.txt
apple
banana
cherry
date
Lina: Oh, now apple and banana appear only once each.
Linny-senpai: Yes. -u stands for unique. It sorts and strips duplicates in a single command. When you just want "the unique values, sorted," this one option does it all.

In real work, "give me the unique values" is one of the most common requests. sort -u is the shortcut.

4. uniq: The Deduplication Specialist

4-1. Basics

$ uniq fruits.txt
banana
apple
cherry
apple
banana
date
Lina: Huh, apple and banana are still duplicated!
Linny-senpai: That's the biggest gotcha with uniq. It only removes adjacent duplicates — duplicates separated by other lines are kept.
Lina: So how do I actually remove all duplicates?
Linny-senpai: Pipe through sort first. Once sort puts identical lines next to each other, uniq can collapse them properly.

4-2. The sort | uniq Pattern

$ sort fruits.txt | uniq
apple
banana
cherry
date

Rule of thumb

  • uniq always goes after sort
  • Use uniq alone only when you already know the input is sorted
  • If "sort and dedupe" is all you want, sort -u is shorter

4-3. Counting Occurrences: uniq -c

$ sort fruits.txt | uniq -c
      2 apple
      2 banana
      1 cherry
      1 date

-c stands for count — each line gets its occurrence count prepended. Extremely useful for aggregation.

4-4. Duplicates Only / Singletons Only

# Show only lines that appear more than once
$ sort fruits.txt | uniq -d
apple
banana
# Show only lines that appear exactly once
$ sort fruits.txt | uniq -u
cherry
date
Option Meaning Use case
-c Prepend count Aggregation
-d Duplicates only Find duplicated items
-u Singletons only Extract values seen exactly 1x
-i Case-insensitive compare Merge case variants

5. The Real-World Workhorse: Frequency Ranking

Lina: For access logs, I want to know which IP hits the server the most. How do I do that?
Linny-senpai: This is today's climax. The three-stage pipeline sort | uniq -c | sort -rn is the standard idiom. Memorize it.

Sample log:

$ cat access.log
192.168.1.10
192.168.1.20
192.168.1.10
192.168.1.30
192.168.1.10
192.168.1.20

Frequency ranking:

$ sort access.log | uniq -c | sort -rn
      3 192.168.1.10
      2 192.168.1.20
      1 192.168.1.30

Pipeline breakdown

Stage Command What it does
1 sort Brings identical lines next to each other
2 uniq -c Collapses adjacent duplicates with a count
3 sort -rn Sorts by count (numeric) in descending order

5-1. Top N Only

$ sort access.log | uniq -c | sort -rn | head -n 3

head -n 3 keeps the top 3 entries. Combining with head is the everyday pattern.

6. Advanced: Sort by a Specific Column with -k

For CSV or whitespace-separated data, -k chooses which field to sort by.

$ cat sales.txt
apple 120
banana 80
cherry 200
date 50
# Sort by the 2nd column (numeric) in descending order
$ sort -k2 -nr sales.txt
cherry 200
apple 120
banana 80
date 50
  • -k2 selects the second field as the sort key
  • Use -n whenever the chosen column is numeric
  • To change the delimiter, use -t, (comma-separated), -t:, etc.

7. Common Beginner Pitfalls

7-1. uniq Didn't Remove the Duplicates

Cause: forgot to sort first.

# BAD: non-adjacent duplicates are not removed
$ uniq fruits.txt

# GOOD
$ sort fruits.txt | uniq
$ sort -u fruits.txt

7-2. Numbers Came Out in a Weird Order

Cause: forgot -n. sort is doing string comparison.

$ sort -n scores.txt   # Sort as numbers

7-3. The Original File Wasn't Modified

sort only prints to the screen — it never modifies the input file. To save the sorted result, redirect explicitly:

$ sort fruits.txt > fruits-sorted.txt

Never do this

# BAD: this empties the file
$ sort fruits.txt > fruits.txt

> truncates the destination before the command runs, so sort reads an empty file. To sort in-place safely, use sort -o:

# GOOD: -o writes only after reading is finished
$ sort -o fruits.txt fruits.txt

7-4. Upper/Lowercase Are Treated as Different

$ cat names.txt
Alice
bob
Alice
BOB
$ sort -u names.txt
Alice
BOB
bob

To ignore case, add -f (fold case):

$ sort -uf names.txt
Alice
bob

8. Mini Exercises

Lina: I get the theory! I want to try it for real.
Linny-senpai: Here are three exercises. Run them in your terminal.

Exercise 1: Print the unique words from this file.

$ cat << 'EOF' > words.txt
apple
banana
apple
cherry
banana
EOF
Show hint

There's one option that does "sort and dedupe" in a single step.

Show answer
$ sort -u words.txt
apple
banana
cherry

Exercise 2: Count how many times each word appears.

Show hint

Two-stage pipe: sortuniq -c.

Show answer
$ sort words.txt | uniq -c
      2 apple
      2 banana
      1 cherry

Exercise 3: Sort the counts in descending order and show only the top 2.

Show hint

Sort the count column numerically in reverse → keep 2 lines with head.

Show answer
$ sort words.txt | uniq -c | sort -rn | head -n 2
      2 apple
      2 banana

9. Copy-Paste Templates

Patterns to keep handy

# Sort alphabetically
sort file.txt

# Sort and deduplicate
sort -u file.txt

# Sort numerically (ascending / descending)
sort -n file.txt
sort -nr file.txt

# Count occurrences per line
sort file.txt | uniq -c

# Frequency ranking (most frequent first)
sort file.txt | uniq -c | sort -rn

# Top 10 frequency ranking
sort file.txt | uniq -c | sort -rn | head -n 10

# Sort by 2nd column, descending numeric
sort -k2 -nr file.txt

# Case-insensitive unique values
sort -uf file.txt

# Sort in place safely (avoids the > self-truncation bug)
sort -o file.txt file.txt

Summary: What to Read Next