Character encoding is the mechanism that maps characters (letters, digits, symbols) to numeric codes so they can be stored in memory or transmitted over networks.
Computers ultimately operate on binary data, so encoding defines how human-readable text becomes machine-readable bytes — and how those bytes are converted back into characters.
Encodings specify:
-
The character set (which characters are supported)
-
The mapping from characters to numbers (code points)
-
The byte representation used when storing/transmitting text
Without a consistent encoding, text may appear corrupted (mojibake), especially when mixing languages.
Common Character Encodings and Their Usage
1. ASCII (American Standard Code for Information Interchange)
Characteristics:
-
Uses 7 bits (128 possible characters)
-
Includes English letters, digits, punctuation, control characters
-
The earliest widely adopted encoding (1960s)
Usage scenario:
-
Early computing and networking protocols
-
Modern systems still keep ASCII as a subset of Unicode
-
Suitable for English-only environments
2. Extended ASCII / ISO-8859 Series
Examples: ISO-8859-1 (Latin-1), ISO-8859-15, Windows-1252
Characteristics:
-
8-bit encoding (256 characters)
-
Adds accented European characters
-
Different regions use different variants
Usage scenario:
-
Older Western European systems and documents
-
Legacy software
-
Email, early Web content before Unicode adoption
3. GB2312, GBK, GB18030 (Chinese Encodings)
GB2312
-
Standard for simplified Chinese (early)
-
Few thousand characters
GBK
-
Extension of GB2312
-
Supports simplified + traditional Chinese
GB18030
-
The mandatory Chinese encoding standard
-
Covers all Unicode characters
Usage scenarios:
-
Windows systems in China historically used GBK
-
Modern Chinese applications are required to support GB18030
-
Legacy documents and websites in China may still use these encodings
4. Shift-JIS, EUC-JP (Japanese Encodings)
Characteristics:
-
Designed for Japanese Kanji, Hiragana, Katakana
-
Variable-length encodings
Usage scenarios:
-
Japanese Windows systems (Shift-JIS)
-
Older Unix/Linux in Japan (EUC-JP)
-
Legacy Japanese websites and applications
5. Unicode
Unicode is a character set that aims to represent all characters of all languages.
Unicode itself does not specify the storage format — that’s the role of UTF encodings:
6. UTF-8 (Unicode Transformation Format – 8 bit)
Characteristics:
-
Variable length (1–4 bytes)
-
Compatible with ASCII (first 128 characters)
-
Most popular encoding today
-
Handles all languages efficiently
Usage scenarios:
-
The modern Web (HTML, JSON, XML)
-
Programming languages (Python 3, Go, Rust, etc.)
-
Linux systems default to UTF-8
-
Cross-platform applications
7. UTF-16
Characteristics:
-
Uses 2 or 4 bytes
-
Efficient for Asian scripts, though UTF-8 is now preferred
-
BOM (Byte Order Mark) sometimes required
Usage scenarios:
-
Windows internal encoding (UTF-16 LE)
-
Java and JavaScript string representation
-
Some large databases and enterprise systems
8. UTF-32
Characteristics:
-
Fixed length: 4 bytes per character
-
Very simple mapping (1 code point = 1 integer)
Usage scenarios:
-
Internal text processing
-
Academic or specialized applications where fast indexing matters
-
Rarely used for storage or transmission due to size
Choosing an Encoding: When to Use What?
| Encoding | Strengths | Use Case |
|---|---|---|
| ASCII | Simple, universal subset | Basic English text, protocols |
| ISO-8859 / Windows-1252 | Compact, legacy support | Western language legacy data |
| GBK / GB18030 | Good for Chinese text | Chinese OS, legacy files |
| Shift-JIS / EUC-JP | Japanese support | Older Japanese systems |
| UTF-8 | Compact, universal, web standard | Modern software, international apps, web |
| UTF-16 | Good for internal operations | Windows internals, Java/JS strings |
| UTF-32 | Simplest mapping | Low-level internal processing |