Table of Contents
LZW coding (compression)
LZW (Lempel-Ziv-Welch) is a lossless data compression algorithm that builds a dictionary of substrings during encoding. It compresses data by replacing repeated occurrences of data (patterns) with shorter codes that represent those patterns. The basic idea is to replace substrings with shorter codes as the compression progresses, so the output size becomes smaller.
Key Steps in LZW Compression
1. Initialize a dictionary with all single characters in the input data (e.g., all ASCII characters if the input is text).
2. Scan the input string character by character, building longer and longer substrings that exist in the dictionary.
3. Add new substrings to the dictionary and output the dictionary index for the current substring when it can no longer be extended by the next character in the input.
Example of LZW Compression
Let’s take the input string:
ABABABA
We will step through the LZW compression process.
Step 1: Initialize the Dictionary
At the beginning, the dictionary contains all single characters from the input string.
For this example, let’s assume we use indices 0–255 for the ASCII table, so initially, our dictionary contains:
Index | Character |
---|---|
65 | A |
66 | B |
Step 2: Start Encoding
We start reading the input string character by character and follow these steps:
1. Current substring: `“A”`
- It exists in the dictionary (index 65).
- Read the next character: `“B”`.
- The new substring `“AB”` does not exist in the dictionary.
- Output the code for `“A”` (65) and add `“AB”` to the dictionary with the next available index (256).
2. Current substring: `“B”`
- It exists in the dictionary (index 66).
- Read the next character: `“A”`.
- The new substring `“BA”` does not exist in the dictionary.
- Output the code for `“B”` (66) and add `“BA”` to the dictionary with the next available index (257).
3. Current substring: `“A”`
- It exists in the dictionary (index 65).
- Read the next character: `“B”`.
- The substring `“AB”` already exists in the dictionary (index 256).
- Read the next character: `“A”`, forming `“ABA”`.
- The new substring `“ABA”` does not exist in the dictionary.
- Output the code for `“AB”` (256) and add `“ABA”` to the dictionary with the next available index (258).
4. Current substring: `“A”`
- It exists in the dictionary (index 65).
- Read the next character: `“B”`.
- The substring `“AB”` already exists in the dictionary (index 256).
- Read the next character: `“A”`, forming `“ABA”`.
- No more characters left to read.
- Output the code for `“ABA”` (258).
Final Dictionary After Compression
The final dictionary looks like this:
Index | Character |
---|---|
65 | A |
66 | B |
256 | AB |
257 | BA |
258 | ABA |
Encoded Output
The final output (in terms of dictionary indices) is:
65 66 256 258
This represents the compressed version of the original string `“ABABABA”`.
—
Decoding LZW
To decode an LZW-compressed string, you use the same dictionary-based approach but in reverse. The decoder reconstructs the dictionary entries as it processes the encoded output, using the indices to retrieve the corresponding substrings.
Let’s go through the decoding process of the encoded sequence 65 66 256 65
:
1.) 65: Look up in the dictionary. It maps to `“A”`.
- Decoded string so far: `“A”`
2.) 66: Look up in the dictionary. It maps to `“B”`.
- Decoded string so far: `“AB”`
3.) 256: Look up in the dictionary. It maps to `“AB”`.
- Decoded string so far: `“ABAB”`
4.) 258: Look up in the dictionary. It maps to `“ABA”`.
- Decoded string so far: `“ABABABA”`
The original string ABABABA
is reconstructed from the compressed codes.
—
Summary of LZW Compression
1. Dictionary Initialization: The dictionary starts with all individual characters.
2. Scanning Input: Input data is scanned character by character, grouping together characters to form substrings.
3. Dictionary Expansion: New substrings are added to the dictionary as the input is processed.
4. Output: The dictionary indices corresponding to the substrings are output, resulting in a compressed sequence of codes.
LZW is widely used in formats like GIF images and TIFF files for lossless compression. It is particularly effective for data that contains repeated patterns or symbols.
C implementation of LZW encoding:
#include <stdio.h> #include <string.h> #define MAX_sxhS 100 char sxhTable[MAX_sxhS][10] = {"a","b","c","d","e"}; int tableElements = 5; int isInsxhTable(char *s) { for(int i = 0; i < tableElements; i++) { if(strcmp(sxhTable[i], s) == 0) { return i; } } return -1; } int main() { char text[] = "dabbacdabbacdabbacdabbacdee"; char *p = text; int bufferEnd = 1; int lastFoundIndex = -1; while(*p != '\0') { char subStr[10]; strncpy(subStr, p, bufferEnd); subStr[bufferEnd] = '\0'; int foundIndex = isInsxhTable(subStr); if(foundIndex != -1) { bufferEnd++; lastFoundIndex = foundIndex; continue; } p += strlen(subStr) - 1; bufferEnd = 1; strcpy(sxhTable[tableElements++], subStr); printf("%d,", lastFoundIndex + 1); } }