tanszek:oktatas:techcomm:information
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
tanszek:oktatas:techcomm:information [2024/10/08 07:33] – [Redundancy] knehez | tanszek:oktatas:techcomm:information [2025/10/14 06:25] (current) – [Entropy] knehez | ||
---|---|---|---|
Line 11: | Line 11: | ||
$$ I_E = \log_2 \frac{1}{p_E} = -\log_2( p_E ) [bit] $$ | $$ I_E = \log_2 \frac{1}{p_E} = -\log_2( p_E ) [bit] $$ | ||
- | The properties of a logarithm function play an important role in the modeling | + | Shannon used the logarithm to measure information because only the logarithmic function makes the information of independent events additive. If two independent events 𝐴 𝐵 occur, their joint probability is: \( p(A,B) = p(A) \cdot p(B) \). |
+ | |||
+ | We expect that the total information should add up: | ||
+ | |||
+ | $$ I(A,B) = I(A) + I(B) $$ | ||
+ | |||
+ | Only the logarithm satisfies this property: | ||
+ | |||
+ | $$ I(p) = -\log p \quad \Rightarrow \quad I(A,B) = -\log(p(A)p(B)) = I(A) + I(B) $$ | ||
+ | |||
+ | If we used \( I(p) = 1/p \), the values would multiply, not add. | ||
+ | |||
+ | The properties of a logarithm function play an important role in modeling the quantitative properties of a given information. | ||
If an event space consist of two equal-probability event \(p(E_1) = p(E_2) = 0.5 \) then, | If an event space consist of two equal-probability event \(p(E_1) = p(E_2) = 0.5 \) then, | ||
- | $$ I_{E_1} = I_{E_2} = \log_2 \frac{1}{0.5} = - \log_2 | + | $$ I_{E_1} = I_{E_2} = \log_2 \frac{1}{0.5} = - \log_2 |
So the unit of the information means the news value which is connected to the simple, less likely, same probability choice. | So the unit of the information means the news value which is connected to the simple, less likely, same probability choice. | ||
Line 33: | Line 45: | ||
The average information content of the set of messages is called the //entropy// of the message set. | The average information content of the set of messages is called the //entropy// of the message set. | ||
- | $$ H_E = \sum_{i=1}^n p_i \cdot I_{E_i} = \sum_{i=1}^n p_i \cdot \log_2 \frac{1}{p_i} = - \sum_{i=1}^n p_i \cdot \log_2 p_i$$ | + | $$ H_E = \sum_{i=1}^n p_i \cdot I_{E_i} = \sum_{i=1}^n p_i \cdot \log_2 \frac{1}{p_i} = - \sum_{i=1}^n p_i \cdot \log_2 p_i [bit]$$ |
**Example**: | **Example**: | ||
Line 46: | Line 58: | ||
Entropy can also be viewed as a measure of the information " | Entropy can also be viewed as a measure of the information " | ||
+ | |||
+ | Example: | ||
+ | |||
+ | * If a source always sends the same letter (“AAAAA…”), | ||
+ | * If every letter occurs with equal probability (e.g., random characters), | ||
This concept is crucial in various fields, including //data compression//, | This concept is crucial in various fields, including //data compression//, | ||
Line 99: | Line 116: | ||
which means the redundancy of the event system is approximately 16%. | which means the redundancy of the event system is approximately 16%. | ||
+ | |||
+ | ==== Example of Entropy calculation ==== | ||
+ | |||
+ | Why do not store raw password strings in compiled code? | ||
+ | |||
+ | <sxh c> | ||
+ | #include < | ||
+ | #include < | ||
+ | #include < | ||
+ | |||
+ | float calculateEntropy(unsigned char counts[], int length); | ||
+ | |||
+ | int main(void) { | ||
+ | const char sample[] = | ||
+ | "Some poetry types are unique to particular cultures and genres " | ||
+ | "and respond to yQ%v? | ||
+ | "poet writes. Readers accustomed to identifying poetry with Dante, " | ||
+ | " | ||
+ | "based on rhyme and regular meter. There are, however, traditions, " | ||
+ | "such as Biblical poetry and alliterative verse, that use other " | ||
+ | "means to create rhythm and euphony. Much modern poetry reflects " | ||
+ | "a critique of poetic tradition, testing the principle of euphony " | ||
+ | " | ||
+ | |||
+ | const int windowWidth = 20; | ||
+ | unsigned char counts[256]; | ||
+ | |||
+ | int sampleLength = strlen(sample); | ||
+ | |||
+ | for (int start = 0; start <= sampleLength - windowWidth; | ||
+ | memset(counts, | ||
+ | |||
+ | // Count characters in current window | ||
+ | for (int j = 0; j < windowWidth; | ||
+ | unsigned char c = (unsigned char)sample[start + j]; | ||
+ | counts[c]++; | ||
+ | } | ||
+ | |||
+ | float entropy = calculateEntropy(counts, | ||
+ | printf(" | ||
+ | } | ||
+ | |||
+ | return 0; | ||
+ | } | ||
+ | |||
+ | float calculateEntropy(unsigned char counts[], int length) { | ||
+ | float entropy = 0.0f; | ||
+ | for (int i = 0; i < 256; i++) { | ||
+ | if (counts[i] > 0) { | ||
+ | float freq = (float)counts[i] / length; | ||
+ | entropy -= freq * log2f(freq); | ||
+ | } | ||
+ | } | ||
+ | return entropy; | ||
+ | } | ||
+ | |||
+ | </ | ||
tanszek/oktatas/techcomm/information.1728372781.txt.gz · Last modified: 2024/10/08 07:33 by knehez