Unicode
Text
Text is a sequence of glyphs. Glyphs are individual marks that contribute to the meaning of what's written.
- A character can have many glyphs that look similar
- A character can consist of multiple glyphs: accents and combinations
- A glyph can represent different characters
Glyph ⇨ Characters ⇨ Code Points ⇨ Binary Encoding
ASCII
- American Standard Code for Information Interchange
- In total 128 ASCII characters (including
null
) - All ASCI characters fit into 1 byte (8-bits)
- The leading bit is
0
Unicode Code Points
- Each character is given a unique code point
- Code Point is defined by an integer value. Ex: 109
- Int:
109
- Hex:
0x6d
- Name:
LATIN SMALL LETTER M
- Convention:
U + {Hex}
=U+006D
- Int:
Organization
- Standard allows for
17 * 2^16
code points - Each group of
2^16 = 65,536
is called a plane U+DDSSSS
DD
indicates the planeSSSS
indicates the point on the plane
Planes
- BMP
00
- Basic Multilingual Plane - SMP
01
- Special Multilingual Plane
UTF-8
- Has variable length and is grouped in sequence of bytes (8 bits)
- Thus a file size cam't be used to guess the number of characters
- No endian-ness because both big-endian and little-endian group by 8 bits
1-Byte Encoding
0XXX XXXX
- First bit will be
0
- This overlaps with ASCII characters
Multi-Byte Encoding
- Starts with
1
- The number of
1
show how many bytes are required to store the data - The sequence of
1
end with a 0 - Following Byte: each byte that follows the first byte
10XX XXXX
- Start with
10
- Payload is contained in
XX XXXX
UTF-8 Encoding ➡︎ Unicode Point
- Convert to binary
- Determine size
- Strip leading bytes and encoding to get payload
- Group in pairs of 4 from right
- Convert to hex
U+{hex}
Unicode Point ➡︎ UTF-8 Encoding
Other Encodings
UTF-X・UTF-16・UTF-32
- Groups by increments of X bits
Endian-ness
)xfeff
: BOM (Byte Order Mark) is required for UTF-X when X is not 8