User Tools

Site Tools


tanszek:oktatas:techcomm:character_encoding

Character Encoding

Creating a unified code for characters is a significant achievement in technical communication. The American Standards Institute initially standardized a 7-bit code, known as ASCII (American Standard Code for Information Interchange). This code includes the 26 uppercase and lowercase letters of the English alphabet, digits, punctuation marks, and so-called control characters. Control characters were used to manage text formatting and control specific applications.

ASCII: The First Standard

ASCII was designed for English and includes 128 characters, where:

  • The first 32 (0-31) are control characters (e.g., carriage return, line feed).
  • The characters from 32 to 127 represent printable symbols (letters, digits, punctuation).

However, this 7-bit code couldn't handle characters from languages with accents, mathematical symbols, or special graphical symbols. As a result, a need arose to extend the character set.

ANSI and Extended Encoding

To address the need for more characters, an 8-bit extension was developed and named ANSI (American National Standards Institute) encoding. This provided 256 characters and allowed for additional language-specific characters. One widely used extension, particularly for text editing, is the IBM Code Page 852, which includes the 18 accented characters used in the Hungarian language.

ISO Standards

To further standardize character sets globally, the ISO 8859 series was introduced:

  • ISO-8859-1 (Latin-1): Encodes Western European accented characters.
  • ISO-8859-2 (Latin-2): Used for Central and Eastern European languages, including Hungarian, and contains special characters like ű and ő.

Unicode and UCS: Universal Character Sets

As digital communication expanded, a more comprehensive character set was required to accommodate all languages and symbols. The ISO-10646 standard defines a Universal Character Set (UCS), ensuring that no information is lost when converting any character to UCS and back. UCS includes characters from all known languages, both living and extinct, as well as known mathematical and scientific symbols.

  • UCS was first published in 1993 as ISO-10646-1 and originally used 31-bit encoding.
  • The range between 0x0000 and 0xFFFD represents the 16-bit domain called the Basic Multilingual Plane (BMP).
  • In 2001, ISO-10646-2 was introduced to cover characters outside the BMP, and by 2003, both sets were unified under the ISO-10646 standard.

Unicode

The Unicode standard is widely used alongside UCS. Each Unicode character is identified by both a number and a name. For example, U+0041 stands for “Latin capital letter A.” Unicode's range:

  • From U+0000 to U+007F is equivalent to 7-bit ASCII.
  • From U+0080 to U+00FF corresponds to Latin-1.

Differences Between ISO 10646 and Unicode

While ISO 10646 defines the code table of characters, Unicode adds further functionality by providing:

  • Typography standards for different writing systems (like Arabic and Hebrew).
  • Text rendering rules that handle multiple writing directions in a document.
  • Collation and text comparison algorithms that allow for sorting and comparison of characters in different languages.

Practical Example

In practice, if you're encoding a text that contains the characters A, ű, and Ω:

  • In ASCII, only A would be represented (U+0041).
  • In ISO-8859-2 (Latin-2), both A and ű (U+0171) would be encoded.
  • In Unicode, all three characters, A (U+0041), ű (U+0171), and Ω (U+03A9), would be included.

Examples of Encoding Systems

  • ASCII: Limited to 128 characters (0-127), primarily for English.
  • ANSI (Extended ASCII): 256 characters, includes language-specific letters.
  • ISO-8859-1 (Latin-1): Encodes characters for Western European languages.
  • ISO-8859-2 (Latin-2): Supports Eastern European languages, including Hungarian.
  • Unicode (UTF-8): Supports virtually all characters from all writing systems worldwide.

With Unicode's rise, character encoding has become highly flexible, supporting diverse languages, symbols, and historical texts. This makes Unicode the most widely adopted standard for modern software systems.

tanszek/oktatas/techcomm/character_encoding.txt · Last modified: 2024/11/18 20:15 by knehez