What Is Character Encoding?
Computers store text as numbers. A character encoding defines how letters, digits, and symbols map to numeric values and bytes. When sender and receiver use the same encoding, text is displayed correctly.
ASCII
ASCII (American Standard Code for Information Interchange) is a 7-bit encoding with values from 0 to 127. It covers English letters, digits, punctuation, and control codes.
- Uses 1 byte in storage, but only 7 bits are defined.
- Great for basic English text and legacy systems.
- Cannot represent characters like ä, ö, å, €, or emoji.
| Character | ASCII Decimal | Binary (7-bit) | Hex |
|---|---|---|---|
| A | 65 | 1000001 | 0x41 |
| a | 97 | 1100001 | 0x61 |
| 0 | 48 | 0110000 | 0x30 |
| Space | 32 | 0100000 | 0x20 |
UTF-8
UTF-8 is a variable-length encoding for Unicode. It can represent almost every written character in modern computing. It uses 1 to 4 bytes per character.
- ASCII characters keep the same byte values in UTF-8.
- Common Latin characters with accents often use 2 bytes.
- Emoji and many symbols use 4 bytes.
- It is the standard encoding for the web.
| Character | Unicode Code Point | UTF-8 Bytes (Hex) | Byte Count |
|---|---|---|---|
| A | U+0041 | 41 | 1 |
| ä | U+00E4 | C3 A4 | 2 |
| € | U+20AC | E2 82 AC | 3 |
| 😀 | U+1F600 | F0 9F 98 80 | 4 |
UTF-8 on the Binary Level
UTF-8 uses specific high bits to show whether a byte starts a new character or continues one. The first byte tells how many bytes the character uses, and every following byte uses the continuation pattern.
| Byte Pattern | Meaning |
|---|---|
| 0xxxxxxx | Single-byte character (ASCII range). |
| 110xxxxx 10xxxxxx | 2-byte character. |
| 1110xxxx 10xxxxxx 10xxxxxx | 3-byte character. |
| 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | 4-byte character. |
| 10xxxxxx | Continuation byte (never valid as a first byte). |
The prefix 10 marks a byte that continues a character, and the leading byte prefix (0, 110, 1110, or 11110) tells how many total bytes belong to that character.
Example Breakdown
- A = 01000001 (starts with 0, so 1 byte).
- ä = 11000011 10100100 (first byte starts with 110, next byte starts with 10).
- € = 11100010 10000010 10101100 (first byte starts with 1110, then two continuation bytes).
ASCII vs UTF-8 in Practice
For plain English text, ASCII and UTF-8 bytes are identical. Differences appear when text includes accents, non-Latin scripts, symbols, or emoji.
| Text | ASCII Possible? | UTF-8 Byte Length |
|---|---|---|
| Hello | Yes | 5 |
| Kärenlampi | No (contains ä) | 11 |
| Price: 10 € | No (contains €) | 13 |
| Hi 😀 | No (contains emoji) | 7 |
Interactive Encoder
Enter text below to see per-character encoding details. Characters outside ASCII are shown as not representable in ASCII, while UTF-8 is always provided.
ASCII Bytes
-
-
UTF-8 Bytes
-
-
| # | Character | Code Point | ASCII | UTF-8 |
|---|---|---|---|---|
| No data yet. | ||||