Unicode
You never have to think about Unicode until you do.
I recently read this blog post: The Absolute Minimum Every Software Developer Must Know About Unicode in 2023 (Still No Excuses!). It’s a nod to a similarly titled 2003 post by Joel Spolsky. And if you are a Go programmer, you may have encountered this blog post, which explains how Go handles strings (it also references the Spolsky post).
I highly recommend all three - the third even if you aren’t a Go programmer.
The premise is that we (software engineers) often take strings for granted, but there are fundamentals to text encoding that are worth grokking. The majority of the time you don’t need to deal with these details. But, like many topics, eventually the edge cases come and demand a deeper understanding.
Key Take-Aways
Unicode
Unicode is a standard that maps all the characters in the world to “code points”.
Code points are just numbers, usually written as hexadecimal digits with a U+
prefix, e.g.:
Name | Character | Code Point | Decimal Value |
---|---|---|---|
Latin Capital Letter A | A | U+0041 | 65 |
Rightwards arrow | → | U+2192 | 8594 |
Dog face | 🐶 | U+1F436 | 128054 |
Unicode is an evolving standard. Its codespace covers U+0000 to U+10FFFF, about 1.1M code points, of which currently only ~15% are defined. The Unicode Consortium periodically releases new versions which include new character mappings.
Encodings
Code points are somewhat theoretical: they just map characters to numbers. But there are different ways for computers to store these values in memory.
Character Encodings are the concrete implementations for storing strings in memory.
UTF-8 is the most popular encoding (98% of web pages according to the post). UTF-8 is a variable-length encoding, so characters are represented by sequences of 1 to 4 bytes depending on their value.
First code point | Last code point | Byte 1 | Byte 2 | Byte 3 | Byte 4 |
---|---|---|---|---|---|
U+0000 | U+007F | 0xxxxxxx | - | - | - |
U+0080 | U+07FF | 110xxxxx | 10xxxxxx | - | - |
U+0800 | U+FFFF | 1110xxxx | 10xxxxxx | 10xxxxxx | - |
U+10000 | U+10FFFF | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |
The first 128 code points (U+0000 - U+007F) are stored in 1-byte and are equivalent to ASCII, making UTF-8 backward compatible with ASCII. Higher code points use 2, 3, or 4 bytes according to the table above. The byte prefixes indicate whether a byte is the leading, or non-leading, byte in a 1-, 2-, 3- or 4-byte sequence.
UTF-16 is another encoding. Despite being less popular than UTF-8, UTF-16 is important for web developers because it’s what JavaScript uses.
UTF-16 encodes strings as sequences of 16-bit “code units”. Because some code points are larger than 16-bits, UTF-16 uses a technique called “surrogate pairs”, which, as the name suggests, are pairs of 16-bit code units that represent a single character. Thus UTF-8 and UTF-16 are both variable length, but the implementation is different.
To illustrate with some JavaScript:
const message = 'hi 😍';
message.length; // => 5
message.split(''); // => [ 'h', 'i', ' ', '\ud83d', '\ude0d' ]
// iterate over UTF-16 code units
for (let i = 0; i < message.length; i++) {
console.log(`${message[i]} ${message.charCodeAt(i).toString(16)}`);
}
// Output:
// h 68
// i 69
// 20
// � d83d
// � de0d
// iterate over Unicode code points
for (const char of message) {
console.log(`${char} ${char.codePointAt(0).toString(16)}`);
}
// Output:
// h 68
// i 69
// 20
// 😍 1f60d
This shows a few things:
h
,i
, and[space]
correspond to single UTF-16 code units,\u0068
,\u0069
, and\u0020
respectively- the
😍
emoji is comprised of a surrogate pair:\ud83d\ude0d
; each value alone is invalid and prints as�
, but together they are interpreted correctly as the emoji - the length of a string in JS corresponds to the number of UTF-16 code units, NOT Unicode code points, and NOT bytes
- when you
.split('')
a string, or access its indexes, it uses the UTF-16 code points - on the other hand, when you iterate over a string with
for...of
, or anything that uses the iterable protocol, each loop yields the full unicode character, which could be multiple UTF-16 code units in length
We can also decode and encode UTF-8 in JS:
const bytes = new Uint8Array([
0x68, // h
0x69, // i
0x20, // [space]
0xf0, 0x9f, 0x98, 0x8d, // 😍
]);
const string = new TextDecoder('utf-8').decode(bytes); // => "hi 😍"
string.length; // => 5
console.log(new TextEncoder().encode('😍')); // => Uint8Array(4) [ 240, 159, 152, 141 ]
Note that the emoji is represented by 4-bytes in UTF-8, rather than the surrogate pair in UTF-16, but in both encodings it would occupy 32 bits. The Latin characters have the same numeric values, but UTF-16 is using 16 bits per character, making UTF-8 more space-efficient.
Character | Code Point | UTF-8 Hex | UTF-8 Binary | UTF-16 Hex | UTF-16 Binary |
---|---|---|---|---|---|
h | U+0068 | 68 | 01101000 | 00 68 | 00000000 01101000 |
i | U+0069 | 69 | 01101001 | 00 69 | 00000000 01101001 |
[space] | U+0020 | 20 | 00100000 | 00 20 | 00000000 00100000 |
😍 | U+1F60D | F0 9F 98 8D | 11110000 10011111 10011000 10001101 | D8 3D DE 0D | 11011000 00111101 11011110 00001101 |
Extended Grapheme Clusters
There is one more layer to this onion, and that is the “grapheme cluster”. Grapheme clusters are certain sequences of Unicode code points that should be displayed as one visual unit.
Common examples are modified emoji, like thumbs up with a skin tone: 👍🏽.
This symbol is actually comprised of two code points: Thumbs Up (U+1F44D) and a skin tone modifer (U+1F3FD). In UTF-16, each of those code points is comprised of a surrogate pair. In UTF-8, each would be a 4-byte value.
const grapheme = '👍🏽';
grapheme.length; // => 4
[...grapheme]; // => [ '👍', '🏽' ]
for (const char of grapheme) {
console.log(`${char} ${char.codePointAt(0).toString(16)}`);
}
// Output:
// 👍 1f44d
// 🏽 1f3fd
In Summary
As a programmer, knowing how encodings work is useful when you are slicing, dicing, counting, splitting, or in any way manipulating or parsing strings. A mental model of Unicode and character encoding is prerequisite to understanding these operations.
As shown in the JS examples above, some functions operate on UTF-16 code units (String.length
, indexes), while others operate on Unicode code points (iterables such as for...of
loops and ...
spread).
In other languages string methods may operate on bytes.
Handling grapheme clusters is a layer above the encoding, and often times languages don’t have a built-in way to work with them. The JavaScript MDN docs simply concede:
Iterating through grapheme clusters will require some custom code.
MDN
���