LZW (Lempel-Ziv-Welch) is a lossless data compression algorithm that builds a dictionary of substrings during encoding. It compresses data by replacing repeated occurrences of data (patterns) with shorter codes that represent those patterns. The basic idea is to replace substrings with shorter codes as the compression progresses, so the output size becomes smaller.
1. Initialize a dictionary with all single characters in the input data (e.g., all ASCII characters if the input is text).
2. Scan the input string character by character, building longer and longer substrings that exist in the dictionary.
3. Add new substrings to the dictionary and output the dictionary index for the current substring when it can no longer be extended by the next character in the input.
Let’s take the input string:
ABABABA
We will step through the LZW compression process.
At the beginning, the dictionary contains all single characters from the input string.
For this example, let’s assume we use indices 0–255 for the ASCII table, so initially, our dictionary contains:
Index | Character |
---|---|
65 | A |
66 | B |
We start reading the input string character by character and follow these steps:
1. Current substring: `“A”`
2. Current substring: `“B”`
3. Current substring: `“A”`
4. Current substring: `“A”`
The final dictionary looks like this:
Index | Character |
---|---|
65 | A |
66 | B |
256 | AB |
257 | BA |
258 | ABA |
The final output (in terms of dictionary indices) is:
65 66 256 258
This represents the compressed version of the original string `“ABABABA”`.
—
To decode an LZW-compressed string, you use the same dictionary-based approach but in reverse. The decoder reconstructs the dictionary entries as it processes the encoded output, using the indices to retrieve the corresponding substrings.
Let’s go through the decoding process of the encoded sequence 65 66 256 65
:
1.) 65: Look up in the dictionary. It maps to `“A”`.
2.) 66: Look up in the dictionary. It maps to `“B”`.
3.) 256: Look up in the dictionary. It maps to `“AB”`.
4.) 258: Look up in the dictionary. It maps to `“ABA”`.
The original string ABABABA
is reconstructed from the compressed codes.
—
1. Dictionary Initialization: The dictionary starts with all individual characters.
2. Scanning Input: Input data is scanned character by character, grouping together characters to form substrings.
3. Dictionary Expansion: New substrings are added to the dictionary as the input is processed.
4. Output: The dictionary indices corresponding to the substrings are output, resulting in a compressed sequence of codes.
LZW is widely used in formats like GIF images and TIFF files for lossless compression. It is particularly effective for data that contains repeated patterns or symbols.
C implementation of LZW encoding:
#include <stdio.h> #include <string.h> #define MAX_sxhS 100 char sxhTable[MAX_sxhS][10] = {"a","b","c","d","e"}; int tableElements = 5; int isInsxhTable(char *s) { for(int i = 0; i < tableElements; i++) { if(strcmp(sxhTable[i], s) == 0) { return i; } } return -1; } int main() { char text[] = "dabbacdabbacdabbacdabbacdee"; char *p = text; int bufferEnd = 1; int lastFoundIndex = -1; while(*p != '\0') { char subStr[10]; strncpy(subStr, p, bufferEnd); subStr[bufferEnd] = '\0'; int foundIndex = isInsxhTable(subStr); if(foundIndex != -1) { bufferEnd++; lastFoundIndex = foundIndex; continue; } p += strlen(subStr) - 1; bufferEnd = 1; strcpy(sxhTable[tableElements++], subStr); printf("%d,", lastFoundIndex + 1); } }