Unicode

Text

Text is a sequence of glyphs. Glyphs are individual marks that contribute to the meaning of what's written.

A character can have many glyphs that look similar
A character can consist of multiple glyphs: accents and combinations
A glyph can represent different characters

Glyph ⇨ Characters ⇨ Code Points ⇨ Binary Encoding

ASCII

American Standard Code for Information Interchange
In total 128 ASCII characters (including null)
All ASCI characters fit into 1 byte (8-bits)
The leading bit is 0

Unicode Code Points

Each character is given a unique code point
Code Point is defined by an integer value. Ex: 109
- Int: 109
- Hex: 0x6d
- Name: LATIN SMALL LETTER M
- Convention: U + {Hex} = U+006D

Organization

Standard allows for 17 * 2^16 code points
Each group of 2^16 = 65,536 is called a plane
U+DDSSSS
- DD indicates the plane
- SSSS indicates the point on the plane

Planes

BMP 00 - Basic Multilingual Plane
SMP 01 - Special Multilingual Plane

UTF-8

Has variable length and is grouped in sequence of bytes (8 bits)
Thus a file size cam't be used to guess the number of characters
No endian-ness because both big-endian and little-endian group by 8 bits

1-Byte Encoding

0XXX XXXX
First bit will be 0
This overlaps with ASCII characters

Multi-Byte Encoding

Starts with 1
The number of 1 show how many bytes are required to store the data
The sequence of 1 end with a 0
Following Byte: each byte that follows the first byte
- 10XX XXXX
- Start with 10
- Payload is contained in XX XXXX

UTF-8 Encoding ➡︎ Unicode Point

Convert to binary
Determine size
Strip leading bytes and encoding to get payload
Group in pairs of 4 from right
Convert to hex
U+{hex}

Unicode Point ➡︎ UTF-8 Encoding

Other Encodings

UTF-X・UTF-16・UTF-32

Groups by increments of X bits

Endian-ness

)xfeff: BOM (Byte Order Mark) is required for UTF-X when X is not 8