split Command: Splitting and Joining Large Files

split Command: Splitting and Joining Large Files

What If a File Is Too Big?

Lina: Senpai, I tried to copy a 5GB log file to a USB stick and it said "file too large"...
Linny-senpai: That's a job for the split command. It cuts a big file into small pieces that you can join back together later, exactly as they were. Let's walk through it.

What You'll Learn

  • How to split a large file by size, lines, or piece count with split
  • How to join the pieces back into the original file with cat
  • How to use numbered suffixes (part_01 instead of xaa)
  • How to verify the file is intact after splitting and joining

1. What Is the split Command?

Conclusion: split breaks one file into several smaller files; concatenating them with cat restores the original byte-for-byte.

Lina: Wait, does "splitting" damage the original file?
Linny-senpai: No. split only cuts a copy into pieces; the original stays untouched. And when you join the pieces in order, you get the original back without losing a single byte.
Lina: That's reassuring. When would I use it?
Linny-senpai: When you need to fit a file onto size-limited media, transfer a huge file in chunks, or break a giant log into manageable pieces. Let's start with the simplest form.

First, create a file to practice with.

# Create a 50MB dummy file
$ dd if=/dev/zero of=bigfile.dat bs=1M count=50
$ ls -lh bigfile.dat
-rw-r--r-- 1 user user 50M Jun  5 10:00 bigfile.dat

2. How to Split by Size

Conclusion: Use split -b SIZE file prefix. -b 100M makes 100MB pieces, -b 10M makes 10MB pieces.

Lina: Let me try cutting it into 10MB pieces.
Linny-senpai: Use -b (for bytes). The part_ at the end is the prefix added to the front of each output file name.
$ split -b 10M bigfile.dat part_
$ ls -lh part_*
-rw-r--r-- 1 user user 10M Jun  5 10:01 part_aa
-rw-r--r-- 1 user user 10M Jun  5 10:01 part_ab
-rw-r--r-- 1 user user 10M Jun  5 10:01 part_ac
-rw-r--r-- 1 user user 10M Jun  5 10:01 part_ad
-rw-r--r-- 1 user user 10M Jun  5 10:01 part_ae
Lina: So it goes part_aa, part_ab... with letters increasing.
Linny-senpai: Right. If you omit the prefix, you get xaa, xab... The size units are K, M, G. Note that 10M means 10x1024x1024 bytes, while 10MB means 10x1000x1000 bytes.

Handy size guide

  • split -b 700M -> fits on one CD
  • split -b 100M -> easy cloud-upload size
  • split -b 1G -> 1GB per piece

3. How to Split by Line Count

Conclusion: For text and logs, split -l LINES file prefix splits on line boundaries so no line is ever cut in half.

Lina: What about splitting a log file every 1000 lines?
Linny-senpai: Use -l (for lines). Splitting by size can cut a line right in the middle, but -l always breaks at a line boundary. Much safer for CSV and logs.
$ split -l 1000 access.log chunk_
$ wc -l chunk_*
   1000 chunk_aa
   1000 chunk_ab
    342 chunk_ac
   2342 total

Size splitting (-b) cuts mechanically at a byte offset, so in a text file a line may be split across two pieces. When line meaning matters, always use -l.

4. How to Split into a Fixed Number of Pieces

Conclusion: split -n COUNT file prefix divides the file into exactly that many equal pieces.

Lina: Sometimes I don't care about "10MB each" - I just want exactly 5 pieces.
Linny-senpai: Then use -n (for number). It divides the whole file into 5 equal parts, so you don't have to calculate sizes.
$ split -n 5 bigfile.dat group_
$ ls -lh group_*
-rw-r--r-- 1 user user 10M Jun  5 10:05 group_aa
-rw-r--r-- 1 user user 10M Jun  5 10:05 group_ab
-rw-r--r-- 1 user user 10M Jun  5 10:05 group_ac
-rw-r--r-- 1 user user 10M Jun  5 10:05 group_ad
-rw-r--r-- 1 user user 10M Jun  5 10:05 group_ae

5. How to Join the Pieces Back

Conclusion: No special command is needed. cat prefix* > restored_file concatenates the pieces in order to rebuild the original.

Lina: I split it, but how do I put it back? Is there a "join" command?
Linny-senpai: Good question. There is a join command, but that's for joining table columns - completely different. To reassemble split pieces, you use cat.
Lina: The cat command that displays files?
Linny-senpai: Yes. cat also concatenates multiple files in order. Redirect with > to write the result to a file, and you're done.
$ cat part_* > restored.dat
$ ls -lh restored.dat
-rw-r--r-- 1 user user 50M Jun  5 10:10 restored.dat

Watch the order. The * wildcard in cat part_* expands in alphabetical order, so part_aa -> part_ab -> ... stays correct. But if you name files with plain numbers like part_1, part_2, ... part_10, then part_10 may sort before part_2. Use the zero-padded numbering in the next section to stay safe.

6. How to Use Numbered Suffixes

Conclusion: -d gives numeric suffixes (00, 01...), -a sets the digit count, and --additional-suffix adds an extension.

Lina: I'd rather have 01, 02 than aa, ab - it's clearer.
Linny-senpai: Add -d (for digits) to get numbers. Set the width with -a, and you can even add an extension like .part with --additional-suffix.
$ split -b 10M -d -a 2 --additional-suffix=.part bigfile.dat backup_
$ ls backup_*
backup_00.part  backup_01.part  backup_02.part  backup_03.part  backup_04.part

With zero-padded numbers (00, 01, ... 10, 11), cat backup_*.part > restored.dat always joins in the correct order. If you expect more than 100 pieces, use -a 3 for three digits.

7. How to Verify the File Is Intact

Conclusion: Compare sha256sum hashes before and after. Matching values prove the file was restored byte-for-byte.

Lina: I'm nervous the joined file isn't really identical to the original...
Linny-senpai: That's what sha256sum is for. It's a kind of "fingerprint" computed from the file's contents. If the original and restored files have the same fingerprint, they're identical. It also catches corruption during transfer or copy.
$ sha256sum bigfile.dat restored.dat
e3b0c44298fc1c149afbf4c8996fb924...  bigfile.dat
e3b0c44298fc1c149afbf4c8996fb924...  restored.dat
Lina: The long strings on the left match! That's a relief.
Linny-senpai: If they differed, it would mean the join order was wrong or the data got corrupted in transit. In that case, redo the split.
Mini Exercise (click to open)

Create a 30MB file called practice.dat, then (1) split it into 7MB pieces, (2) join them with cat, and (3) confirm the hash matches the original.

Hint: dd if=/dev/zero of=practice.dat bs=1M count=30 -> split -b 7M practice.dat p_ -> cat p_* > joined.dat -> sha256sum practice.dat joined.dat

8. Common Pitfalls and Fixes

Conclusion: Most trouble comes from join order, unit confusion, or running out of disk space. Check capacity and units before you split.

Symptom Cause Fix
Joined file is corrupted Wrong join order Use zero-padded -d and cat ...*
More/fewer pieces than expected M (1024) vs MB (1000) Use one unit consistently
No space left on device Splitting needs ~2x the space Check free space with df -h first
Text lines cut in half You used -b (bytes) Re-split with -l (lines)

Don't do this

  • Deleting the original before testing the join
  • Skipping the hash verification
  • Starting a split without checking free space

Summary / Next Reading