tanszek:oktatas:techcomm:utf-8_encoding
Differences
This shows you the differences between two versions of the page.
| Next revision | Previous revision | ||
| tanszek:oktatas:techcomm:utf-8_encoding [2024/10/07 11:05] – created knehez | tanszek:oktatas:techcomm:utf-8_encoding [2024/11/19 10:59] (current) – [UTF-16 Encoding] knehez | ||
|---|---|---|---|
| Line 11: | Line 11: | ||
| - Characters beyond the ASCII range (U+0080 to U+07FF) use **2 bytes**. | - Characters beyond the ASCII range (U+0080 to U+07FF) use **2 bytes**. | ||
| - | - For larger Unicode values | + | - For larger Unicode values |
| - | + | ||
| - | - Characters in the supplementary planes (U+10000 to U+10FFFF) are encoded using **4 bytes**. | + | |
| Here is the UTF-8 encoding scheme: | Here is the UTF-8 encoding scheme: | ||
| Line 33: | Line 31: | ||
| 2.) **2-byte encoding**: Characters outside of the ASCII range require more bytes. For example, the Unicode character ' | 2.) **2-byte encoding**: Characters outside of the ASCII range require more bytes. For example, the Unicode character ' | ||
| - | * In binary: 00000000 11101001 | + | * In binary: |
| - | * UTF-8 encoding: 11000011 10101001 (C3 A9 in hexadecimal). | + | * UTF-8 encoding: |
| 3.) **3-byte encoding**: Characters with even larger code points require 3 bytes. For instance, the character ' | 3.) **3-byte encoding**: Characters with even larger code points require 3 bytes. For instance, the character ' | ||
| - | * In binary: 00000000 00001001 00000101 | + | * In binary: |
| - | * UTF-8 encoding: | + | * UTF-8 encoding: |
| 4.) **4-byte encoding**: Characters from supplementary planes, such as certain emojis or rare historical scripts, use 4 bytes. For example, the emoji ' | 4.) **4-byte encoding**: Characters from supplementary planes, such as certain emojis or rare historical scripts, use 4 bytes. For example, the emoji ' | ||
| - | * In binary: 00000001 11111001 10000100 | + | * In binary: |
| - | * UTF-8 encoding: 11110000 10011111 | + | * UTF-8 encoding: |
| === Example: Encoding the Character ' | === Example: Encoding the Character ' | ||
| Line 50: | Line 48: | ||
| The Unicode code point for ' | The Unicode code point for ' | ||
| - | - Unicode binary: 00000000 00000000 11110011 | + | - Unicode binary: |
| - | - UTF-8 binary: 11000011 10110011 (C3 B3 in hexadecimal). | + | |
| + | - UTF-8 binary: | ||
| Thus, ' | Thus, ' | ||
| - | ==== UTF-16 Encoding ==== | + | ===== UTF-16 Encoding |
| **UTF-16** is another encoding standard that uses either 2 or 4 bytes to represent characters. Unlike UTF-8, UTF-16 uses **a minimum of 2 bytes** for each character, which simplifies the encoding of characters but can be less space-efficient for texts containing many ASCII characters. | **UTF-16** is another encoding standard that uses either 2 or 4 bytes to represent characters. Unlike UTF-8, UTF-16 uses **a minimum of 2 bytes** for each character, which simplifies the encoding of characters but can be less space-efficient for texts containing many ASCII characters. | ||
| Line 68: | Line 67: | ||
| * **FF FE**: Little-endian (least significant byte first) | * **FF FE**: Little-endian (least significant byte first) | ||
| - | For example, in Windows text files or Microsoft Office documents, you may encounter this BOM at the beginning, especially when opening files in text editors like Notepad. | + | For example, |
| - | === Conclusion === | + | ==== Conclusion |
| - | UTF-8 has become the dominant encoding standard because it is backward-compatible with ASCII, space-efficient for texts that are predominantly ASCII, and can represent any character in the Unicode standard. Meanwhile, UTF-16 is commonly used in environments like Windows, | + | **UTF-8** has become the dominant encoding standard because it is backwards-compatible with ASCII, space-efficient for predominantly ASCII texts, and can represent any character in the Unicode standard. Meanwhile, UTF-16 is commonly used in environments like Windows, |
| In summary: | In summary: | ||
| + | |||
| - **UTF-8** is optimal for web content and multi-language support. | - **UTF-8** is optimal for web content and multi-language support. | ||
| + | |||
| - **UTF-16** is preferred in some system environments, | - **UTF-16** is preferred in some system environments, | ||
| - | --- | ||
| - | With this explanation, | ||
tanszek/oktatas/techcomm/utf-8_encoding.1728299149.txt.gz · Last modified: 2024/10/07 11:05 by knehez
