A simple data compression procedure, which involves the following steps:
Example:
Symbol | Probability | Code | Length |
---|---|---|---|
X1 | 0.25 | 00 | 2 |
X2 | 0.25 | 01 | 2 |
X3 | 0.125 | 100 | 3 |
X4 | 0.125 | 101 | 3 |
X5 | 0.0625 | 1100 | 4 |
X6 | 0.0625 | 1101 | 4 |
X7 | 0.0625 | 1110 | 4 |
X8 | 0.0625 | 1111 | 4 |
We use 8 symbols for encoding, where the symbols have the given probabilities in the message to be sent. They are arranged in decreasing order of their frequency of occurrence, with the most frequent ones at the top. The third column shows the assigned variable-length codes, with shorter codes given to more frequent symbols.
The 8 symbols form a complete event system since they are mutually exclusive events, and the sum of their occurrence probabilities is 1.
Let's calculate the entropy of this event system:
$$ H = -\left(\frac{2}{4} \log \frac{1}{4} + \frac{2}{8} \log \frac{1}{8} + \frac{4}{16} \log \frac{1}{16}\right) = 2.75 \, \text{[bit].} $$
What is the average word length?
We can calculate the average word length with the use of this function: \( L = \sum_{i=1}^n x_i \mu_i \)
After substituting:
$$ L = \frac{2}{4} \cdot 2 + \frac{2}{8} \cdot 3 + \frac{4}{16} \cdot 4 = 2.75 \,\text{ [bit]}$$
The original 8 symbols (X1-X8) can be stored with 4 bits. The average length is 2.75 bits, thus the compressed messages will be \(\frac{2.75}{4} \approx 68.75\% \) shorter.