comm and join: Comparing and Joining Files

comm and join: Comparing and Joining Files

What You'll Learn

  • How to extract common lines and differences between two files with comm
  • How to join two files horizontally on a shared key with join
  • How to avoid sort-related failures like missing lines or not sorted warnings

Quick Summary

  • Want common lines / differences line by line → comm
  • Want to join two tables on a key column (like SQL JOIN) → join
  • Both require sorted input — this is non-negotiable

Assumptions (environment)

  • GNU coreutils (Ubuntu / most Linux distributions)
  • comm and join ship with coreutils. No extra install needed

What is comm and what does it do?

Conclusion: comm compares two sorted files line by line and prints "left only / right only / common" across three columns.

comm compares two sorted files and sorts the result into three columns.

  • Column 1: lines only in file1
  • Column 2: lines only in file2
  • Column 3: lines in both (common lines)

Here are two sample files:

$ cat a.txt
apple
banana
cherry

$ cat b.txt
banana
cherry
date
$ comm a.txt b.txt
apple
		banana
		cherry
	date

Column position is shown by tab indentation. apple is left only (column 1), banana and cherry are in both (column 3, two tabs), and date is right only (column 2, one tab).

Why does comm require sorted input?

Conclusion: comm uses a simple merge that reads both files line by line, so unsorted input causes it to miss common lines.

comm advances through both files from the top simultaneously. If the order is broken, it cannot detect matching lines correctly and the output breaks. Unsorted input triggers this warning:

comm: file 1 is not in sorted order

Always run sort first. Process substitution avoids temporary files:

$ comm <(sort a.txt) <(sort b.txt)

sort collation is locale dependent. If comm and sort disagree on ordering, results break. When in doubt, pin the order with LC_ALL=C sort for stability.

How do I select specific comm columns?

Conclusion: Use -1, -2, -3 to suppress the matching column. The number refers to the column you remove.

The options specify the column to suppress, not the column to show.

Goal Command Remaining column
Common lines only comm -12 a b Column 3
Differences only comm -3 a b Columns 1 and 2
Lines only in file1 comm -23 a b Column 1
Lines only in file2 comm -13 a b Column 2
$ comm -12 a.txt b.txt
banana
cherry
$ comm -23 a.txt b.txt
apple

Memory aid: the numbers say "drop these columns." To keep only common lines, drop columns 1 and 2 with -12.

What is join and how is it different from comm?

Conclusion: join merges rows from two files on a shared key column, equivalent to a SQL INNER JOIN.

While comm matches whole lines, join merges rows whose key field matches into a single line. The default key is the first field of each line.

$ cat users.txt
1 alice
2 bob
3 carol

$ cat depts.txt
1 sales
2 engineering
4 marketing
$ join users.txt depts.txt
1 alice sales
2 bob engineering

Keys 1 and 2 exist in both files, so they are joined. 3 carol and 4 marketing have no match and are not printed (inner join).

join also requires input sorted on the key column. Unsorted input produces a join: ... is not sorted warning and drops rows. Just like comm, run sort first.

How do I set join fields and the delimiter?

Conclusion: Use -t for the delimiter, -1 / -2 for each file's key column, and -o for the output fields.

Adjust these when the delimiter differs (like CSV) or the key is not the first column.

$ cat users.csv
1,alice,tokyo
2,bob,osaka

$ cat depts.csv
sales,1
engineering,2

In users.csv the key is column 1; in depts.csv the key is column 2. The delimiter is a comma.

$ join -t, -1 1 -2 2 users.csv depts.csv
1,alice,tokyo,sales
2,bob,osaka,engineering
  • -t,: set the delimiter to a comma
  • -1 1: the key in file1 is column 1
  • -2 2: the key in file2 is column 2

Use -o to spell out the output fields. -o 1.1,1.2,2.1 means "columns 1 and 2 of file1, then column 1 of file2." List them as <file>.<field>.

How do I keep non-matching rows (outer join)?

Conclusion: Use -a to also print unpaired rows. -a 1 is a left outer join; -a 1 -a 2 is a full outer join.

An inner join discards rows with no match. Use -a to keep them.

$ join -a 1 users.txt depts.txt
1 alice sales
2 bob engineering
3 carol

3 carol has no match in depts.txt, but -a 1 keeps it (with the missing field empty). To fill the gap, combine -e and -o.

$ join -a 1 -e '-' -o '1.1,1.2,2.2' users.txt depts.txt
1 alice sales
2 bob engineering
3 carol -

-e '-' fills missing fields with -, and -o fixes the output columns.

comm vs join: which to use

Conclusion: Use comm for line-set differences and join for key-based merges. Both require sorted input.

Goal Command
Find common lines / differences comm
Join two tables on a shared key join
Deduplicate / sort (preprocessing) sort

Pitfalls to avoid

  • Passing unsorted input to comm / join (rows are silently dropped)
  • Mixing locales between comm and sort (ordering mismatch breaks results)
  • Forgetting the key-column options (-1 / -2) for join and defaulting to column 1