User Tools

Site Tools


tanszek:oktatas:techcomm:utf-8_encoding

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
tanszek:oktatas:techcomm:utf-8_encoding [2024/10/07 11:05] – created kneheztanszek:oktatas:techcomm:utf-8_encoding [2024/11/19 10:59] (current) – [UTF-16 Encoding] knehez
Line 11: Line 11:
 - Characters beyond the ASCII range (U+0080 to U+07FF) use **2 bytes**. - Characters beyond the ASCII range (U+0080 to U+07FF) use **2 bytes**.
  
-- For larger Unicode values (U+0800 to U+FFFF), **3 bytes** are used. +- For larger Unicode values U+0800, **3 bytes** - **bytes** are used.
- +
-Characters in the supplementary planes (U+10000 to U+10FFFF) are encoded using **bytes**.+
  
 Here is the UTF-8 encoding scheme: Here is the UTF-8 encoding scheme:
Line 33: Line 31:
 2.) **2-byte encoding**: Characters outside of the ASCII range require more bytes. For example, the Unicode character 'é' has the code point **U+00E9**: 2.) **2-byte encoding**: Characters outside of the ASCII range require more bytes. For example, the Unicode character 'é' has the code point **U+00E9**:
  
-  * In binary: 00000000 11101001 +  * In binary: ''00000000 11101001'' 
-  * UTF-8 encoding: 11000011 10101001 (C3 A9 in hexadecimal).+  * UTF-8 encoding: ''11000011 10101001'' (C3 A9 in hexadecimal).
  
 3.) **3-byte encoding**: Characters with even larger code points require 3 bytes. For instance, the character 'अ' (from the Devanagari script) has the code point **U+0905**: 3.) **3-byte encoding**: Characters with even larger code points require 3 bytes. For instance, the character 'अ' (from the Devanagari script) has the code point **U+0905**:
  
-  * In binary: 00000000 00001001 00000101 +  * In binary: ''00000000 00001001 00000101'' 
-  * UTF-8 encoding: 1110 0000 1010 0001 1000 0101 (E0 A5 85 in hexadecimal).+  * UTF-8 encoding: ''11100000 10100100 10000101'' (E0 A5 85 in hexadecimal).
  
 4.) **4-byte encoding**: Characters from supplementary planes, such as certain emojis or rare historical scripts, use 4 bytes. For example, the emoji '🦄' (unicorn) has the Unicode code point **U+1F984**: 4.) **4-byte encoding**: Characters from supplementary planes, such as certain emojis or rare historical scripts, use 4 bytes. For example, the emoji '🦄' (unicorn) has the Unicode code point **U+1F984**:
  
-  * In binary: 00000001 11111001 10000100 +  * In binary: ''00000001 11111001 10000100'' 
-  * UTF-8 encoding: 11110000 10011111 10011000 10000100 (F0 9F A6 84 in hexadecimal).+  * UTF-8 encoding: ''11110000 10011111 10100110 10000100''.
  
 === Example: Encoding the Character 'ó' === === Example: Encoding the Character 'ó' ===
Line 50: Line 48:
 The Unicode code point for 'ó' is **U+00F3**, which is decimal 243 or hexadecimal 0x00F3. Since this is more than 7 bits, it fits the second rule for UTF-8 encoding (2-byte encoding). The Unicode code point for 'ó' is **U+00F3**, which is decimal 243 or hexadecimal 0x00F3. Since this is more than 7 bits, it fits the second rule for UTF-8 encoding (2-byte encoding).
  
-- Unicode binary: 00000000 00000000 11110011 +- Unicode binary: ''00000000 00000000 11110011'' 
-- UTF-8 binary: 11000011 10110011 (C3 B3 in hexadecimal).+ 
 +- UTF-8 binary: ''11000011 10110011'' (C3 B3 in hexadecimal).
  
 Thus, 'ó' in UTF-8 is encoded as two bytes: **C3 B3**. Thus, 'ó' in UTF-8 is encoded as two bytes: **C3 B3**.
  
-==== UTF-16 Encoding ====+===== UTF-16 Encoding =====
  
 **UTF-16** is another encoding standard that uses either 2 or 4 bytes to represent characters. Unlike UTF-8, UTF-16 uses **a minimum of 2 bytes** for each character, which simplifies the encoding of characters but can be less space-efficient for texts containing many ASCII characters. **UTF-16** is another encoding standard that uses either 2 or 4 bytes to represent characters. Unlike UTF-8, UTF-16 uses **a minimum of 2 bytes** for each character, which simplifies the encoding of characters but can be less space-efficient for texts containing many ASCII characters.
Line 68: Line 67:
   * **FF FE**: Little-endian (least significant byte first)   * **FF FE**: Little-endian (least significant byte first)
  
-For example, in Windows text files or Microsoft Office documents, you may encounter this BOM at the beginning, especially when opening files in text editors like Notepad.+For example, you may encounter this BOM initially in Windows text files or Microsoft Office documents, especially when opening files in text editors like Notepad ++.
  
-=== Conclusion ===+==== Conclusion ====
  
-UTF-8 has become the dominant encoding standard because it is backward-compatible with ASCII, space-efficient for texts that are predominantly ASCII, and can represent any character in the Unicode standard. Meanwhile, UTF-16 is commonly used in environments like Windows, where it handles 2-byte characters efficiently.+**UTF-8** has become the dominant encoding standard because it is backwards-compatible with ASCII, space-efficient for predominantly ASCII texts, and can represent any character in the Unicode standard. Meanwhile, UTF-16 is commonly used in environments like Windows, which handles 2-byte characters efficiently.
  
 In summary: In summary:
 +
 - **UTF-8** is optimal for web content and multi-language support. - **UTF-8** is optimal for web content and multi-language support.
 +
 - **UTF-16** is preferred in some system environments, particularly in Windows applications. - **UTF-16** is preferred in some system environments, particularly in Windows applications.
  
---- 
  
-With this explanation, your students will gain a clearer understanding of how UTF-8 encoding works and how it differs from UTF-16. If you'd like, I can also add some more specific real-world examples. 
tanszek/oktatas/techcomm/utf-8_encoding.1728299149.txt.gz · Last modified: 2024/10/07 11:05 by knehez