gzip vs bzip2 vs xz vs zstd - Choosing a Compression Format

gzip vs bzip2 vs xz vs zstd - Choosing a Compression Format

Which of the Four Should You Pick?

Conclusion: When in doubt, use zstd - the best balance of speed, ratio, and threading. Use gzip for compatibility, xz for maximum ratio; bzip2 has almost no reason for new use.

When compressing files on Linux, gzip, bzip2, xz, and zstd are the four standard choices. All pair with tar, but they differ greatly in compression ratio, compression speed, decompression speed, and parallel support.

Quick reference table

Format Extension Ratio Compress Decompress In one line
gzip .gz Low Fast Fast The compatibility king
bzip2 .bz2 Medium Slow Slow Older gen, fading
xz .xz High Slowest Medium Ratio-focused
zstd .zst Med-High Fast Fastest The modern default

Assumptions (target environment)

  • A common Linux distribution (Ubuntu / RHEL family, etc.)
  • zstd may not be installed on older systems (apt install zstd / dnf install zstd)

What Makes Each Format Different?

Conclusion: gzip uses DEFLATE (fast, ubiquitous), bzip2 uses BWT (medium ratio but slow), xz uses LZMA2 (highest ratio), and zstd is fast with a wide tuning range.

gzip - The Compatibility Standard

gzip uses DEFLATE (LZ77 + Huffman coding). It has existed since 1992 and is installed almost everywhere. Its ratio is the lowest of the four, but it is fast and decompresses anywhere. From HTTP Content-Encoding: gzip onward, it is the de facto standard for distribution formats.

bzip2 - The BWT Old-Timer

bzip2 is block-sorting compression based on the Burrows-Wheeler Transform (BWT). It compresses better than gzip, but both compression and decompression are slow. It used to be the "smaller than gzip" option, yet today it loses to xz and zstd in both ratio and speed, so there is almost no reason to choose it for new work. It is mostly for decompressing existing .bz2 files.

xz - The Ratio Champion

xz uses the LZMA2 algorithm and delivers the highest compression ratio of the four. In exchange, it is the slowest to compress and uses a lot of memory at high levels. It fits compress once, distribute many times use cases (kernel sources, distro packages, etc.).

zstd - The Modern Default

zstd (Zstandard) is a relatively new format with an excellent balance of speed and ratio. It reaches a higher ratio than gzip at gzip-like speeds, and at high levels it approaches xz. Its very fast decompression is a major advantage, and it is increasingly adopted by the Linux kernel, btrfs, and various package managers.

How Do Ratio and Speed Compare?

Conclusion: Ratio is xz >= zstd(high level) > bzip2 > gzip. Decompression speed is zstd > gzip > xz > bzip2. zstd alone covers "fast yet reasonably small."

Compression fundamentally runs on the "smaller means slower" trade-off. The general tendencies:

  • Ratio: xz is highest. zstd approaches xz at high levels. bzip2 is medium, gzip is lowest
  • Compression speed: gzip and zstd (low-to-mid levels) are fast. bzip2 is slow, xz is the slowest
  • Decompression speed: zstd is fastest. gzip is also fast. xz is medium, bzip2 is the slowest

What matters most is decompression speed. You compress once, but decompression runs many times at the destination. If you repeatedly extract on many servers or in CI, zstd's fast decompression pays off directly.

Actual numbers vary widely by data type (text / binary / already-compressed) and CPU. The only correct answer is to benchmark on your own representative data. Measure with time and ls -l.

$ for c in gzip bzip2 xz zstd; do \
    echo "== $c =="; \
    time $c -k -9 -f sample.dat; \
    ls -l sample.dat.* ; rm -f sample.dat.{gz,bz2,xz,zst}; \
  done

How Should You Choose?

Conclusion: Pick gzip for compatibility, xz when disk savings come first, and zstd for most everything else. Use bzip2 only to decompress existing files.

The decision flow is simple.

  1. Must it decompress reliably on the other side? (old systems, sharing with others) -> gzip (.gz extracts anywhere)
  2. Do you want to shave off every last byte? (archives, long-term storage, many downloads) -> xz (slow to compress but smallest)
  3. Everything else (backups, logs, most daily work) -> zstd (fast, compresses well, fastest to decompress)
  4. You received a .bz2 or have legacy assets -> decompress with bzip2 (do not use it for new compression)

One-line guidance

  • When in doubt, zstd
  • "Handing it to someone" -> gzip
  • "As small as possible" -> xz

What Are the Basic Commands?

Conclusion: All four share the same pattern: cmd file to compress, cmd -d file.ext to decompress, and -k to keep the original.

Single-file compression and decompression follow nearly identical conventions across all commands.

# Compress (note: the original is removed)
$ gzip  file.txt        # -> file.txt.gz
$ bzip2 file.txt        # -> file.txt.bz2
$ xz    file.txt        # -> file.txt.xz
$ zstd  file.txt        # -> file.txt.zst (original kept)

# Compress while keeping the original (-k = keep)
$ gzip -k file.txt
$ xz   -k file.txt

# Decompress (-d = decompress)
$ gzip  -d file.txt.gz
$ xz    -d file.txt.xz
$ zstd  -d file.txt.zst

# Dedicated decompression commands exist too
$ gunzip  file.txt.gz
$ bunzip2 file.txt.bz2
$ unxz    file.txt.xz
$ unzstd  file.txt.zst

gzip, bzip2, and xz delete the original by default. Add -k (keep) to retain it. zstd does the opposite - it keeps the original by default, so add --rm if you want it removed. The behavior is reversed, so be careful.

To inspect contents without writing a decompressed file, use each command's -c (to stdout) or zcat / bzcat / xzcat / zstdcat.

$ zcat access.log.gz | grep 500
$ zstdcat backup.tar.zst | tar tf -

What About Levels and Multithreading?

Conclusion: All use -1 to -9 for levels (higher means smaller but slower). xz and zstd support -T0 to compress in parallel across all CPU cores, cutting time substantially.

Compression levels

Higher numbers compress more but run slower. Typical defaults:

  • gzip: -1 to -9, default -6
  • bzip2: -1 to -9, default -9 (block size)
  • xz: -0 to -9, default -6
  • zstd: -1 to -19, default -3. You can push further to the maximum with --ultra -22
$ gzip -9 file        # max compression
$ xz -9 file          # high ratio (slow, more memory)
$ zstd -19 file       # zstd's normal maximum
$ zstd --ultra -22 file   # zstd's absolute maximum

Multithreading (parallelism)

Parallelism helps on large files.

$ xz   -T0 big.tar    # compress using all cores
$ zstd -T0 big.tar    # compress using all cores (0 = auto)

gzip and bzip2 themselves do not support parallelism, but compatible parallel implementations exist. Install pigz (parallel gzip) and pbzip2 / lbzip2 to use all cores while keeping the .gz / .bz2 format.

How Do You Combine With tar?

Conclusion: tar has shortcuts -z (gzip), -j (bzip2), and -J (xz). For zstd, use --zstd, or the handy -a (caf) that auto-detects from the extension.

Bundling multiple files (tar) and compression are separate steps. Specify the compression format via tar options.

# Create (c = create, f = file)
$ tar czf archive.tar.gz   dir/   # gzip
$ tar cjf archive.tar.bz2  dir/   # bzip2
$ tar cJf archive.tar.xz   dir/   # xz
$ tar --zstd -cf archive.tar.zst dir/   # zstd

# On extraction, the format is auto-detected
$ tar xf archive.tar.gz
$ tar xf archive.tar.zst

Auto-detect by extension (-a)

With tar's -a (--auto-compress), tar picks the format from the output file's extension. Change formats without relearning the options (z / j / J).

$ tar caf archive.tar.zst dir/   # .zst -> zstd
$ tar caf archive.tar.xz  dir/   # .xz  -> xz
$ tar caf archive.tar.gz  dir/   # .gz  -> gzip

Older tar (especially non-GNU) may not support --zstd or -a. In that case, use a pipe.

$ tar cf - dir/ | zstd -T0 -o archive.tar.zst
$ zstd -dc archive.tar.zst | tar xf -

Summary and Next Steps

  • When in doubt, zstd - the best balance of speed, ratio, decompression, and threading
  • gzip for distribution and compatibility, xz for maximum ratio
  • bzip2 has little reason for new use (decompressing existing .bz2 only)
  • Levels run -1 to -9 (zstd -19 / --ultra -22); -T0 enables parallelism
  • Numbers are environment-dependent. Benchmark your own representative data before deciding

Next reading: