===== LZW coding (compression) =====
LZW (Lempel-Ziv-Welch) is a lossless data compression algorithm that builds a dictionary of substrings during encoding. It compresses data by replacing repeated occurrences of data (patterns) with shorter codes that represent those patterns. The basic idea is to replace substrings with shorter codes as the compression progresses, so the output size becomes smaller.
==== Key Steps in LZW Compression ====
1. **Initialize a dictionary** with all single characters in the input data (e.g., all ASCII characters if the input is text).
2. **Scan the input** string character by character, building longer and longer substrings that exist in the dictionary.
3. **Add new substrings** to the dictionary and output the dictionary index for the current substring when it can no longer be extended by the next character in the input.
----
=== Example of LZW Compression ===
Let’s take the input string:
ABABABA
We will step through the LZW compression process.
==== Step 1: Initialize the Dictionary ====
At the beginning, the dictionary contains all single characters from the input string.
For this example, let’s assume we use indices 0–255 for the ASCII table, so initially, our dictionary contains:
^Index^Character^
| 65 | A |
| 66 | B |
==== Step 2: Start Encoding ====
We start reading the input string character by character and follow these steps:
1. **Current substring**: `"A"`
- It exists in the dictionary (index 65).
- Read the next character: `"B"`.
- The new substring `"AB"` does not exist in the dictionary.
- **Output** the code for `"A"` (65) and add `"AB"` to the dictionary with the next available index (256).
2. **Current substring**: `"B"`
- It exists in the dictionary (index 66).
- Read the next character: `"A"`.
- The new substring `"BA"` does not exist in the dictionary.
- **Output** the code for `"B"` (66) and add `"BA"` to the dictionary with the next available index (257).
3. **Current substring**: `"A"`
- It exists in the dictionary (index 65).
- Read the next character: `"B"`.
- The substring `"AB"` **already exists** in the dictionary (index 256).
- Read the next character: `"A"`, forming `"ABA"`.
- The new substring `"ABA"` does not exist in the dictionary.
- **Output** the code for `"AB"` (256) and add `"ABA"` to the dictionary with the next available index (258).
4. **Current substring**: `"A"`
- It exists in the dictionary (index 65).
- Read the next character: `"B"`.
- The substring `"AB"` **already exists** in the dictionary (index 256).
- Read the next character: `"A"`, forming `"ABA"`.
- No more characters left to read.
- **Output** the code for `"ABA"` (258).
==== Final Dictionary After Compression ====
The final dictionary looks like this:
^Index^Character^
| 65 | A |
| 66 | B |
| 256 | AB |
| 257 | BA |
| 258 | ABA |
==== Encoded Output ====
The final output (in terms of dictionary indices) is:
65 66 256 258
This represents the compressed version of the original string `"ABABABA"`.
---
==== Decoding LZW ====
To decode an LZW-compressed string, you use the same dictionary-based approach but in reverse. The decoder reconstructs the dictionary entries as it processes the encoded output, using the indices to retrieve the corresponding substrings.
Let’s go through the decoding process of the encoded sequence ''65 66 256 65'':
1.) **65**: Look up in the dictionary. It maps to `"A"`.
* **Decoded string so far**: `"A"`
2.) **66**: Look up in the dictionary. It maps to `"B"`.
* **Decoded string so far**: `"AB"`
3.) **256**: Look up in the dictionary. It maps to `"AB"`.
* **Decoded string so far**: `"ABAB"`
4.) **258**: Look up in the dictionary. It maps to `"ABA"`.
* **Decoded string so far**: `"ABABABA"`
The original string ''ABABABA'' is reconstructed from the compressed codes.
---
==== Summary of LZW Compression ====
1. **Dictionary Initialization**: The dictionary starts with all individual characters.
2. **Scanning Input**: Input data is scanned character by character, grouping together characters to form substrings.
3. **Dictionary Expansion**: New substrings are added to the dictionary as the input is processed.
4. **Output**: The dictionary indices corresponding to the substrings are output, resulting in a compressed sequence of codes.
LZW is widely used in formats like **GIF** images and **TIFF** files for lossless compression. It is particularly effective for data that contains repeated patterns or symbols.
C implementation of LZW encoding:
#include
#include
#define MAX_sxhS 100
char sxhTable[MAX_sxhS][10] = {"a","b","c","d","e"};
int tableElements = 5;
int isInsxhTable(char *s)
{
for(int i = 0; i < tableElements; i++)
{
if(strcmp(sxhTable[i], s) == 0)
{
return i;
}
}
return -1;
}
int main()
{
char text[] = "dabbacdabbacdabbacdabbacdee";
char *p = text;
int bufferEnd = 1;
int lastFoundIndex = -1;
while(*p != '\0')
{
char subStr[10];
strncpy(subStr, p, bufferEnd);
subStr[bufferEnd] = '\0';
int foundIndex = isInsxhTable(subStr);
if(foundIndex != -1)
{
bufferEnd++;
lastFoundIndex = foundIndex;
continue;
}
p += strlen(subStr) - 1;
bufferEnd = 1;
strcpy(sxhTable[tableElements++], subStr);
printf("%d,", lastFoundIndex + 1);
}
}