Unicode
Text
Text is a sequence of glyphs. Glyphs are individual marks that contribute to the meaning of what's written.
- A character can have many glyphs that look similar
- A character can consist of multiple glyphs: accents and combinations
- A glyph can represent different characters
Glyph ⇨ Characters ⇨ Code Points ⇨ Binary Encoding
ASCII
- American Standard Code for Information Interchange
- In total 128 ASCII characters (including
null) - All ASCI characters fit into 1 byte (8-bits)
- The leading bit is
0
Unicode Code Points
- Each character is given a unique code point
- Code Point is defined by an integer value. Ex: 109
- Int:
109 - Hex:
0x6d - Name:
LATIN SMALL LETTER M - Convention:
U + {Hex}=U+006D
- Int:
Organization
- Standard allows for
17 * 2^16code points - Each group of
2^16 = 65,536is called a plane U+DDSSSSDDindicates the planeSSSSindicates the point on the plane
Planes
- BMP
00- Basic Multilingual Plane - SMP
01- Special Multilingual Plane
UTF-8
- Has variable length and is grouped in sequence of bytes (8 bits)
- Thus a file size cam't be used to guess the number of characters
- No endian-ness because both big-endian and little-endian group by 8 bits
1-Byte Encoding
0XXX XXXX- First bit will be
0 - This overlaps with ASCII characters
Multi-Byte Encoding
- Starts with
1 - The number of
1show how many bytes are required to store the data - The sequence of
1end with a 0 - Following Byte: each byte that follows the first byte
10XX XXXX- Start with
10 - Payload is contained in
XX XXXX
UTF-8 Encoding ➡︎ Unicode Point
- Convert to binary
- Determine size
- Strip leading bytes and encoding to get payload
- Group in pairs of 4 from right
- Convert to hex
U+{hex}
Unicode Point ➡︎ UTF-8 Encoding
Other Encodings
UTF-X・UTF-16・UTF-32
- Groups by increments of X bits
Endian-ness
)xfeff: BOM (Byte Order Mark) is required for UTF-X when X is not 8