Gautam Ahuja CS-1203: Data Structures Debayan Gupta December 22, 2022
- If a string has repeatition of characters, then note down the frequency and create a string with single character.
- This is the new compressed string
- Now, create a Huffman Tree using the frequencies of each character using min-heap data structure.
- When decompressing, note down the frequencies and produce the original srings.
- If the string has no repeatition, then this compression doesn't work well.
- A C program to compress the user input string using Huffman Coding
- The program takes a string as input and output the compressed string
- The program outputs the compressed string in bits format.
- The porgam also perform Run-Length_Encoding to prevent the problem of Huffman Coding when the input string is too long or have repeatition.
- The Huffman Coding is implemented using a min-heap data structure.
- The Huffman Code works as follows:
- The user is prompted to enter a string.
- The program then counts the frequency of each character in the string.
- The program then creates a Huffman Tree using the frequencies of each character using min-heap data structure.
- The program then creates a Huffman Code for each character using the Huffman Tree.
- The program then encodes the string using RL-encoding and Huffman Coding.
- The program then outputs the encoded string in bits format.
- The program then decodes the encoded string and outputs the decoded string.
- The program outputs the compression ratio.
- Min-Heap = Heapify, Insert, ExtractMin, Delete, DecreaseKey
- Huffman Tree
- Huffman Code
- Strings, pointers, arrays, structs, etc. = Pointers to Pointers (2D arrays/Strings), Pointers to structs, etc.
- Run-Length Encoding
- String Must be less than 256 characters
- String should not contain any numerical characters
- Huffman Coding is a lossless data compression algorithm.
- Huffman Coding is a prefix code.
- The above compression algorithm is most effective for data that is not uniformly distributed. And if there are pattens of repeatition in the data.
- For example, if the input string is "aaaaa", then the compression ratio will be 1/5 = 20%.
- For unique strings (no repeatition), the algorithm might produce a larger string than the original string.
- If: Compression Ratio < 1, then the compression algorithm is effective.
- If: Compression Ratio > 1, then the compression algorithm is not effective.
- Intermixing of concepts mentioned below to create a compression algorithm (Huffman Coding + Run-Length Encoding + Min-Heap)
- Proper implementation of Heaps and Concepts of heaps (Heapify, Insert, ExtractMin, Delete, DecreaseKey)
- Proper implementation of Huffman Tree and Huffman Code
- Proper implementation of Run-Length Encoding
- Proper implementation of Strings, pointers, arrays, structs, etc. = Pointers to Pointers (2D arrays/Strings), Pointers to structs, etc.
- Und3rstanding of the working of the compression algorithm and encoding/decoding algorithms.
- The above code has been tested for approx 50 strings with repeatition.
- The above code has been tested for approx 50 strings with no repeatition.
- The algorithm is most effective for strings with repeatition.
- The algorithm and all its concepts (min-heap, Huffman Tree, Huffman Code, Run-Length Encoding) works for all inputs.
- The code is written in C language and can be compiled on gcc compiler.
- The code can be run on any IDE or on the terminal. Command to run the code on terminal: gcc -o Output_File Gautam_Ahuja_Huffman_Code_Compression.c ./Output_File
- The user is prompted to enter a string.
- Input a string which is less than 256 characters and does not contain any numerical characters.
- The program then outputs the RL-encoded string, string in bits format, compressed string in bits format.
- The program prints the frequencies of each character in the string
- It also outputs the decompressed string along with decoded string.
- It also outputs the compression ratio.