Character Encoding: UTF-8 and Mojibake Explained

Character Encoding: UTF-8 and Mojibake Explained

What You'll Learn

  • What character encoding and UTF-8 actually mean, and how the terms relate
  • Why mojibake (garbled text) happens, from first principles
  • A reliable workflow for checking and converting encodings on Linux (file / iconv / nkf / locale)

Quick Summary

  • An encoding is a mapping between characters and byte sequences. A character set (Unicode) and an encoding form (UTF-8) are two different things.
  • UTF-8 is ASCII-compatible, variable-length (1-4 bytes), and the de facto standard on the web and Linux.
  • Mojibake happens when the encoding used to write differs from the one used to read.
  • Check with file -i / nkf -g; convert with iconv.

What Is Character Encoding?

Conclusion: A character encoding is the rule that maps characters to byte sequences. Computers store only bytes, so any text must be encoded to be saved or transmitted.

A computer can only handle bytes (numbers from 0 to 255). It cannot store the character "あ" or "A" directly. So we need a table that decides "which character is represented by which byte sequence." That is the character encoding.

Separating two layers that are easy to confuse makes mojibake far easier to reason about.

Layer Role Examples
Character set Assigns each character a unique number (code point) Unicode, JIS X 0208
Encoding form Turns code points into actual byte sequences UTF-8, UTF-16, Shift_JIS, EUC-JP

For example, "あ" has the Unicode code point U+3042. Which bytes represent U+3042 depends on the encoding form.

Character "あ" = Unicode code point U+3042
  UTF-8  => E3 81 82      (3 bytes)
  UTF-16 => 30 42         (2 bytes)
  EUC-JP => A4 A2         (2 bytes)

Key point: The same "あ" becomes different bytes under different encodings. So if you misjudge "which encoding wrote this," you can no longer turn the bytes back into the right characters. That is exactly what mojibake is.

What Is UTF-8, and Why Is It Dominant?

Conclusion: UTF-8 is one encoding form for Unicode. Being ASCII-compatible, variable-length, and free of byte-order issues, it has become the de facto standard on the web and Linux.

UTF-8 encodes every Unicode character using a variable length of 1 to 4 bytes. It became dominant because of these properties.

  • ASCII-compatible: U+0000-U+007F (letters, digits, symbols) use a single byte, identical to ASCII. Existing English-centric tools keep working.
  • Compact via variable length: ASCII is 1 byte, characters like Japanese are 2-3 bytes. Often smaller than fixed-width encodings.
  • No byte-order (endianness) problem: Unlike UTF-16, it does not depend on a BOM (byte order mark).
  • Self-synchronizing: The leading bits of each byte signal "lead byte vs continuation byte," so character boundaries are easy to recover even mid-stream.
# Count the bytes of a string (in a UTF-8 environment)
echo -n "あ" | wc -c
3

UTF-8 and Unicode are not synonyms. Unicode is the standard that assigns numbers to characters; UTF-8 is one way to turn those numbers into bytes. "Save as UTF-8" is precise, but "save as Unicode" is inherently ambiguous.

A Note on UTF-8 with BOM

A UTF-8 file can start with a BOM (the 3 bytes EF BB BF). Some Windows editors add it, and when it sneaks into the start of a shell script, #!/bin/bash is not recognized and execution fails. On Linux, UTF-8 without BOM is the norm.

Why Does Mojibake (Garbled Text) Happen?

Conclusion: Mojibake happens when the encoding used to write differs from the one used to read. The bytes are not corrupted; only the interpretation rule is mismatched.

Mojibake (the Japanese term mojibake is widely used in English too) occurs when bytes are interpreted with the wrong encoding. Three typical cases:

  1. Opening a Shift_JIS file as UTF-8 (and vice versa)
  2. Opening a UTF-8 file as Latin-1 (ISO-8859-1) (you get sequences like "é")
  3. The terminal locale does not match the file's encoding

The important point: the original bytes are intact. Align the interpretation rule correctly and, in most cases, the text comes back.

Correct:  こんにちは        (UTF-8 bytes read as UTF-8)
Garbled:  ã“ã‚“ã«ã¡ã¯     (UTF-8 bytes read as Latin-1)

When you see mojibake, before panicking that "the file is broken," first separate "which encoding wrote it" from "which encoding is reading it." A mismatch is the cause most of the time.

How Do You Check and Convert Encodings?

Conclusion: Check with file -i or nkf -g, convert with iconv. For terminal display issues, suspect locale and LANG.

Guess a File's Encoding

# Show the charset in MIME form
file -i notes.txt
notes.txt: text/plain; charset=utf-8

file only guesses, so short or ambiguous files may report unknown-8bit. For Japanese, nkf -g (guess) is effective.

# Detect the Japanese encoding (nkf must be installed)
nkf -g legacy.txt
Shift_JIS

Convert Between Encodings

iconv converts by specifying the source (-f) and target (-t).

# Convert from Shift_JIS to UTF-8 and save
iconv -f SHIFT_JIS -t UTF-8 legacy.txt -o utf8.txt
# List the available encodings
iconv -l

If you specify the wrong source encoding, the already-garbled text gets converted again and recovery becomes hard. Always confirm the original encoding with file -i / nkf -g first, and keep the original file.

Check the Terminal and Locale

If the file itself is UTF-8 but it still looks garbled in the terminal, the locale setting is often the cause.

# Show the current locale
locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
...

If LANG and LC_CTYPE are *.UTF-8, the terminal renders as UTF-8. With C or POSIX, non-ASCII text may garble. To switch temporarily:

export LANG=en_US.UTF-8

Common Problems and Fixes

Conclusion: Separate "is the file correct?" from "is the terminal correct?". Most issues are explained by an encoding mismatch or a locale setting.

Symptom Likely cause Fix
Whole text file is garbled Wrong encoding assumed Check with file -i, convert with iconv
Japanese garbles only in the terminal Locale is not UTF-8 Check locale, set LANG=...UTF-8
Error at the start of a script UTF-8 with BOM Remove the BOM and re-save
File names are garbled Displayed in a different encoding Match the locale, or convert with convmv

Practical tip: When in doubt, run file -i first. It almost always tells you whether the problem is "on the file side" or "on the terminal side." Creating new files as UTF-8 without BOM causes the fewest accidents.

To try echo and wc -c hands-on in a browser-based virtual terminal, use the learning terminal and watch the byte counts yourself.

Summary

A character encoding is a "table mapping characters to bytes," and the key to understanding it is separating the character set (Unicode) from the encoding form (UTF-8 and friends). Mojibake is not corruption but a mismatch in interpretation, so the reliable workflow is: confirm the encoding with file -i / nkf -g, convert with iconv, and suspect locale for terminal-side issues. Standardizing new files as UTF-8 without BOM is the safest default.