Text Stream Filters: cat, sort, uniq, wc, head, tail

Text Stream Filters: cat, sort, uniq, wc, head, tail

What You Will Achieve

  • Build filter pipelines that process text received from standard input
  • Aggregate logs accurately by combining sort / uniq
  • Safely inspect only the needed part of large files with head / tail
  • Extract fields, translate characters, and number lines with cut / tr / nl
  • Handle frequent aggregation tasks with chained one-liners

This is the core of LPIC-1 objective 103.2 "Process text streams using filters". Filters read standard input, transform it, and write to standard output; chaining them with pipes makes them powerful.

Which Filter, When

Goal Filter Key options
Reorder lines sort -n numeric / -r reverse / -k key
Deduplicate/aggregate uniq -c count / -d duplicates only
Count items wc -l lines / -w words / -c bytes
See head/tail only head / tail -n count / tail -f follow
Extract columns cut -d delimiter / -f field
Replace/delete chars tr -d delete / -s squeeze
Add line numbers nl / cat -n -b a number all lines

uniq only collapses adjacent duplicates, so it is almost always combined with sort. This is the most frequent pattern in both the exam and real work.

Steps

Step 1: Concatenate files to standard output

cat access.log
cat -n script.sh
cat file1 file2 > merged.txt
     1  #!/bin/bash
     2  echo "start"
     3  exit 0

cat concatenates multiple files. -n adds line numbers and -A reveals invisible characters such as newlines and tabs. To merely view a single file, less handles large files better.

Step 2: Sort and aggregate duplicates

sort access.log | uniq -c | sort -nr | head -n 5
    143 GET /index.html
     97 GET /login
     61 POST /api/data
     28 GET /favicon.ico
     12 GET /robots.txt

"sort → uniq -c to count → reverse sort by count → top 5" is the standard access-aggregation idiom. uniq -c assumes the previous step already sorted the input.

Step 3: Count lines, words, and bytes

wc -l access.log
wc -lwc README.md
  10234 access.log
   120  856 5421 README.md

-l is lines, -w is words, -c is bytes (-m is characters). Placed at the end of a pipe, it directly yields "how many items matched".

Step 4: Slice head and tail

head -n 20 large.csv
tail -n 50 syslog
tail -f /var/log/nginx/access.log
2026-05-17 10:01:22 INFO  start
2026-05-17 10:01:23 INFO  ready

tail -f displays appended lines in real time, the basics of log monitoring. Combining head and tail extracts ranges such as "M lines starting at line N".

Step 5: Column extraction and character translation

cut -d: -f1,7 /etc/passwd
echo "Hello World" | tr 'a-z' 'A-Z'
cat data.txt | tr -s ' ' | tr -d '\r'
root:/bin/bash
daemon:/usr/sbin/nologin
HELLO WORLD

cut -d: -f1 extracts the first :-delimited column. tr translates or deletes characters; -s squeezes repeated characters into one and -d deletes specified characters. It is commonly used to strip Windows-origin \r.

Why Chain Filters

Each filter follows the Unix philosophy of "do one thing well". sort only reorders; uniq only handles adjacent duplicates. Being single-purpose is exactly what makes them freely composable through pipes, achieving aggregation and extraction without writing a huge dedicated tool.

uniq handles only adjacent duplicates because it processes the stream line by line without holding state. To handle whole-input duplicates, identical lines must first be made adjacent by sorting. Understanding this constraint makes writing sort | uniq reflexive.

Troubleshooting

Symptom: uniq -c does not aggregate duplicates

Cause: The input is not sorted

Check:

sort file | uniq -c

Fix: Always put sort before uniq. Or use sort -u to sort and deduplicate at once (but then you lose the -c counts).

Symptom: sort -n does not order as expected

Cause: The numeric column contains spaces or unit characters, or the key position is unspecified

Check:

sort -k2 -n data.txt

Fix: Specify the sort field with -k and the delimiter with -t. For human-readable sizes (1K, 2M) use sort -h.

Symptom: tail -f stops following updates

Cause: The log rotated to a different inode

Check:

tail -F /var/log/syslog

Fix: -f (lowercase) follows the inode, so after rotation switch to -F (uppercase) for filename-based following.

Completion Checklist

  • [ ] Ran the aggregation one-liner sort | uniq -c | sort -nr
  • [ ] Used wc -l at the end of a pipe to count items
  • [ ] Inspected only the needed part of a large file with head / tail
  • [ ] Extracted fields with cut -d -f
  • [ ] Verified character translation and \r removal with tr

Summary

Scenario Command Purpose
Aggregate sort | uniq -c | sort -nr Frequency ranking
Count wc -l Line count
Head/tail head -n / tail -f Range slice / follow
Column cut -d: -f1 Field extraction
Translate tr a-z A-Z Per-character replace/delete

Chaining filters is the fundamental text-processing pattern. More complex pattern matching needs regular expressions and grep.

Next Reading