tanszek:oktatas:techcomm:utf-8

UTF-8 Encoding
- Structure of UTF-8
UTF-16 Encoding
- Conclusion

UTF-8 Encoding

The essence of UTF (Unicode Transformation Format) encoding lies in how we can shorten the 32-bit encoding of Unicode characters. UTF-8 is the most widely used encoding method today, especially in web applications and information systems. It efficiently compresses the encoding of Unicode characters to reduce storage space while ensuring compatibility with older standards like ASCII.

Structure of UTF-8

UTF-8 is a variable-length encoding that uses 1 to 6 bytes to represent a character. The number of bytes used depends on the value of the Unicode character:

- Characters from the ASCII range (U+0000 to U+007F) are encoded using 1 byte.

- Characters beyond the ASCII range (U+0080 to U+07FF) use 2 bytes.

- For larger Unicode values > U+0800, 3 bytes - 6 bytes are used.

Here is the UTF-8 encoding scheme:

Unicode (bits)	UTF-8
00000000 00000000 00000000 0xxxxxxx	0xxxxxxx
00000000 00000000 00000yyy yyxxxxxx	110yyyyy 10xxxxxx
00000000 00000000 xxxxxxxx xxxxxxxx	1110xxxx 10xxxxxx 10xxxxxx
00000000 000xxxxx xxxxxxxx xxxxxxxx	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
000000xx xxxxxxxx xxxxxxxx xxxxxxxx	111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
0xxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx	1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

Explanation of Encoding

1.) 1-byte encoding: If a character's Unicode value is within the range of U+0000 to U+007F (which corresponds to standard ASCII characters), it is directly represented in UTF-8 using a single byte. For example:

The letter 'A' has a Unicode code point of U+0041, which is encoded as 01000001 in binary, or 0x41 in hexadecimal, and remains the same in UTF-8.

2.) 2-byte encoding: Characters outside of the ASCII range require more bytes. For example, the Unicode character 'é' has the code point U+00E9:

In binary: 00000000 11101001
UTF-8 encoding: 11000011 10101001 (C3 A9 in hexadecimal).

3.) 3-byte encoding: Characters with even larger code points require 3 bytes. For instance, the character 'अ' (from the Devanagari script) has the code point U+0905:

In binary: 00000000 00001001 00000101
UTF-8 encoding: 11100000 10100100 10000101 (E0 A5 85 in hexadecimal).

4.) 4-byte encoding: Characters from supplementary planes, such as certain emojis or rare historical scripts, use 4 bytes. For example, the emoji '🦄' (unicorn) has the Unicode code point U+1F984:

In binary: 00000001 11111001 10000100
UTF-8 encoding: 11110000 10011111 10100110 10000100.

Example: Encoding the Character 'ó'

The Unicode code point for 'ó' is U+00F3, which is decimal 243 or hexadecimal 0x00F3. Since this is more than 7 bits, it fits the second rule for UTF-8 encoding (2-byte encoding).

- Unicode binary: 00000000 00000000 11110011

- UTF-8 binary: 11000011 10110011 (C3 B3 in hexadecimal).

Thus, 'ó' in UTF-8 is encoded as two bytes: C3 B3.

UTF-16 Encoding

UTF-16 is another encoding standard that uses either 2 or 4 bytes to represent characters. Unlike UTF-8, UTF-16 uses a minimum of 2 bytes for each character, which simplifies the encoding of characters but can be less space-efficient for texts containing many ASCII characters.

For characters outside the Basic Multilingual Plane (BMP), UTF-16 requires 4 bytes to encode using a technique called surrogate pairs.

Byte Order in UTF-16

One important aspect of UTF-16 is the order of bytes, also known as endianness. UTF-16 files typically start with a Byte Order Mark (BOM) to indicate the byte order. The BOM is represented by the special Unicode character U+FEFF, which is a “zero-width non-breaking space” that doesn't appear in text but signals the byte order:

FE FF: Big-endian (most significant byte first)
FF FE: Little-endian (least significant byte first)

For example, you may encounter this BOM initially in Windows text files or Microsoft Office documents, especially when opening files in text editors like Notepad ++.

Conclusion

UTF-8 has become the dominant encoding standard because it is backwards-compatible with ASCII, space-efficient for predominantly ASCII texts, and can represent any character in the Unicode standard. Meanwhile, UTF-16 is commonly used in environments like Windows, which handles 2-byte characters efficiently.

In summary:

- UTF-8 is optimal for web content and multi-language support.

- UTF-16 is preferred in some system environments, particularly in Windows applications.

Table of Contents