Character Encoding: UTF-8 and Mojibake Explained
What You'll Learn
- What character encoding and UTF-8 actually mean, and how the terms relate
- Why mojibake (garbled text) happens, from first principles
- A reliable workflow for checking and converting encodings on Linux (
file/iconv/nkf/locale)
Quick Summary
- An encoding is a mapping between characters and byte sequences. A character set (Unicode) and an encoding form (UTF-8) are two different things.
- UTF-8 is ASCII-compatible, variable-length (1-4 bytes), and the de facto standard on the web and Linux.
- Mojibake happens when the encoding used to write differs from the one used to read.
- Check with
file -i/nkf -g; convert withiconv.
What Is Character Encoding?
Conclusion: A character encoding is the rule that maps characters to byte sequences. Computers store only bytes, so any text must be encoded to be saved or transmitted.
A computer can only handle bytes (numbers from 0 to 255). It cannot store the character "あ" or "A" directly. So we need a table that decides "which character is represented by which byte sequence." That is the character encoding.
Separating two layers that are easy to confuse makes mojibake far easier to reason about.
| Layer | Role | Examples |
|---|---|---|
| Character set | Assigns each character a unique number (code point) | Unicode, JIS X 0208 |
| Encoding form | Turns code points into actual byte sequences | UTF-8, UTF-16, Shift_JIS, EUC-JP |
For example, "あ" has the Unicode code point U+3042. Which bytes represent U+3042 depends on the encoding form.
Character "あ" = Unicode code point U+3042 UTF-8 => E3 81 82 (3 bytes) UTF-16 => 30 42 (2 bytes) EUC-JP => A4 A2 (2 bytes)
Key point: The same "あ" becomes different bytes under different encodings. So if you misjudge "which encoding wrote this," you can no longer turn the bytes back into the right characters. That is exactly what mojibake is.
What Is UTF-8, and Why Is It Dominant?
Conclusion: UTF-8 is one encoding form for Unicode. Being ASCII-compatible, variable-length, and free of byte-order issues, it has become the de facto standard on the web and Linux.
UTF-8 encodes every Unicode character using a variable length of 1 to 4 bytes. It became dominant because of these properties.
- ASCII-compatible:
U+0000-U+007F(letters, digits, symbols) use a single byte, identical to ASCII. Existing English-centric tools keep working. - Compact via variable length: ASCII is 1 byte, characters like Japanese are 2-3 bytes. Often smaller than fixed-width encodings.
- No byte-order (endianness) problem: Unlike UTF-16, it does not depend on a BOM (byte order mark).
- Self-synchronizing: The leading bits of each byte signal "lead byte vs continuation byte," so character boundaries are easy to recover even mid-stream.
# Count the bytes of a string (in a UTF-8 environment) echo -n "あ" | wc -c
3
UTF-8 and Unicode are not synonyms. Unicode is the standard that assigns numbers to characters; UTF-8 is one way to turn those numbers into bytes. "Save as UTF-8" is precise, but "save as Unicode" is inherently ambiguous.
A Note on UTF-8 with BOM
A UTF-8 file can start with a BOM (the 3 bytes EF BB BF). Some Windows editors add it, and when it sneaks into the start of a shell script, #!/bin/bash is not recognized and execution fails. On Linux, UTF-8 without BOM is the norm.
Why Does Mojibake (Garbled Text) Happen?
Conclusion: Mojibake happens when the encoding used to write differs from the one used to read. The bytes are not corrupted; only the interpretation rule is mismatched.
Mojibake (the Japanese term mojibake is widely used in English too) occurs when bytes are interpreted with the wrong encoding. Three typical cases:
- Opening a Shift_JIS file as UTF-8 (and vice versa)
- Opening a UTF-8 file as Latin-1 (ISO-8859-1) (you get sequences like "é")
- The terminal locale does not match the file's encoding
The important point: the original bytes are intact. Align the interpretation rule correctly and, in most cases, the text comes back.
Correct: こんにちは (UTF-8 bytes read as UTF-8) Garbled: ã“ã‚“ã«ã¡ã¯ (UTF-8 bytes read as Latin-1)
When you see mojibake, before panicking that "the file is broken," first separate "which encoding wrote it" from "which encoding is reading it." A mismatch is the cause most of the time.
How Do You Check and Convert Encodings?
Conclusion: Check with
file -iornkf -g, convert withiconv. For terminal display issues, suspectlocaleandLANG.
Guess a File's Encoding
# Show the charset in MIME form file -i notes.txt
notes.txt: text/plain; charset=utf-8
file only guesses, so short or ambiguous files may report unknown-8bit. For Japanese, nkf -g (guess) is effective.
# Detect the Japanese encoding (nkf must be installed) nkf -g legacy.txt
Shift_JIS
Convert Between Encodings
iconv converts by specifying the source (-f) and target (-t).
# Convert from Shift_JIS to UTF-8 and save iconv -f SHIFT_JIS -t UTF-8 legacy.txt -o utf8.txt
# List the available encodings iconv -l
If you specify the wrong source encoding, the already-garbled text gets converted again and recovery becomes hard. Always confirm the original encoding with file -i / nkf -g first, and keep the original file.
Check the Terminal and Locale
If the file itself is UTF-8 but it still looks garbled in the terminal, the locale setting is often the cause.
# Show the current locale locale
LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8" ...
If LANG and LC_CTYPE are *.UTF-8, the terminal renders as UTF-8. With C or POSIX, non-ASCII text may garble. To switch temporarily:
export LANG=en_US.UTF-8
Common Problems and Fixes
Conclusion: Separate "is the file correct?" from "is the terminal correct?". Most issues are explained by an encoding mismatch or a locale setting.
| Symptom | Likely cause | Fix |
|---|---|---|
| Whole text file is garbled | Wrong encoding assumed | Check with file -i, convert with iconv |
| Japanese garbles only in the terminal | Locale is not UTF-8 | Check locale, set LANG=...UTF-8 |
| Error at the start of a script | UTF-8 with BOM | Remove the BOM and re-save |
| File names are garbled | Displayed in a different encoding | Match the locale, or convert with convmv |
Practical tip: When in doubt, run file -i first. It almost always tells you whether the problem is "on the file side" or "on the terminal side." Creating new files as UTF-8 without BOM causes the fewest accidents.
To try echo and wc -c hands-on in a browser-based virtual terminal, use the learning terminal and watch the byte counts yourself.