- Overview
- Installation
- Build
- Usage
- Development
- Test
- Algorithmic Analysis
- Hacks and Tricks
- License
- Todo
- Changelog
LZW is an archive format that utilizes power of LZW compression algorithm. LZW compression algorithm is a dictionary-based loseless algorithm. It's an old algorithm suitable for beginner to practice.
Internal algorithm processes byte data. So it's applicable to any file types, besides text file. Although it may not be able to achieve substantial compression rate for some file types that are already compressed efficiently, such as PDF files and MP4 files. It treats data as byte stream, unaware of the text-level pattern, which makes it less compression-efficient compared to other more advanced compression algorithms.
LZW compression algorithm is dynamic. It doesn't collect data statistics before hand. Instead, it learns the data pattern while conducting the compression, building a code table on the fly. The compression ratio approaches maximum after enough time. The algorithmic complexity is strictly linear to the size of the text. A more in-depth algorithmic analysis can be found in the following sections.
An alternative implementation that utilizes more efficient self-made customized bitmap
, dict
and set
to replace C++ builtin general-purpose std::bitset
, std::unordered_map
and std::set
can be found in the branch assignment
. Future enhancement includes customized hash
functions to replace builtin general-purpose std::hash
.
Pre-built binaries (Windows only) are available on the Releases page. Release version conforms to the semantic versioning convention.
Prerequisites: Git, Python3.6+, pip
, and a modern C++ compiler: g++
or cl
.
$ git clone https://github.com/MapleCCC/liblzw.git
$ cd liblzw
# You can optionally create a virtual environment for isolation purpose
$ python -m virtualenv .venv
$ source .venv/Scripts/activate
# Install basic build requirements
$ pip install -r requirements
# if g++ is available:
$ make
# Compiled executable is in build/lzw
# if speed is desirable:
$ make fast
# if on Windows platform and cl.exe is available:
$ mkdir build && cl /Fe"build/lzw.exe" lzw.cpp
# Compiled executable is in build/lzw.exe
# if speed is desirable:
# compile in release mode
$ mkdir build && cl /D "NDEBUG" /O2 /Fe"build/lzw.exe" lzw.cpp
Alternatively, you can build as a static library for embedding in other applications.
# Build static library
$ make build-lib
# Compiled library is in build/liblzw.a
# Build dynamic library
[...]
# Get Help Message
$ lzw --help
'''
Usage:
# Compression
$ lzw compress [-o|--output <ARCHIVE>] <FILES>...
# Decompression
$ lzw decompress <ARCHIVE>
'''
Contribution is welcome.
- When commiting new code, make sure to apply format specified in
.clang-format
config file. - Don't directly modify
README.md
. It's automatically generated from the templateREADME.raw.md
. If you want to add something to README, modify the templateREADME.raw.md
instead. - Also remember to add
scripts/pre-commit.py
to.git/hooks/pre-commit
as pre-commit hook script.
Prerequisites: Git, Python3.6+, pip
, npm
, and a modern C++ compiler: g++
or cl
.
# Clone the repository to local environment
$ git clone https://github.com/MapleCCC/liblzw.git
$ cd liblzw
# You can optionally create a virtual environment for isolation purpose
$ python -m virtualenv .venv
$ source .venv/Scripts/activate
# Install basic build requirements
$ pip install -r requirements.txt
# Install dev requirements
$ pip install -r requirements-dev.txt
$ npm install
# Install pre-commit hook script
$ make install-pre-commit-hook-script
The pre-commit hook script basically does four things:
-
Format staged C/C++ code
-
Format staged Python code
-
Transform
LaTeX
math equation inREADME.raw.md
to image url inREADME.md
-
Append content of
TODO.md
andCHANGELOG.md
toREADME.md
-
Generate TOC for
README.md
Besides relying on the pre-commit hook script, you can manually format code and transform math equations in README.md
.
# Format C/C++ and Python code under the working directory
$ make reformat
# Transform LaTeX math equations to image urls in README.md
$ make transform-eqn
The advantages of pre-commit hook script, compared to manual triggering scripts, is that it's convenient and un-disruptive, as it only introduces changes to staged files, instead of to all the files in the repo.
-
Unit test
$ make unit-test
-
Integrate test
Python3.6+, Pytest and Hypothesis are needed for integrate test
# You can optionally create a virtual environment for isolation purpose $ python -m virtualenv .venv $ source .venv/Scripts/activate $ python -m pip install -U pytest, hypothesis $ make integrate-test # It makes very thorough testing. It currently take about ten to twenty seconds.
Before talking about performance, we first conduct some complexity analysis. This help us understand the internal structure of the problem, and which and how various parts of the task contribute to the performance hotspots.
LZW algorithm treats data as binary stream, regardless of its text encoding. During the algorithm process, a running word is maintained. Before algorithm starts, the running word is empty. Every new byte consumed by the algorithm would be appended to the running word. The running word would be reset to some value under some circumstances.
For each byte consumed by the algorithm, deciding upon whether the running word is in the code dict, the algorithm has two branches to go.
If the new running word formed by appending the new byte at the end is in the code dict, the algorithm proceeds to branch, which basically do nothing, and continue to next iteration. On the other hand, if the new running word constituted by the new byte is not in the code dict, the algorithm goes to branch.
In the branch, three things happen. First, the code in the code dict corresponding to the old running word (before the new byte is appended) is emitted. Then the new running word is entered into the code dict as a new string, and assigned a new code, ready to use in the future. Lastly, the running word is reset to contain only the new byte. The algorithm then proceeds to the next iteration.
The cost model for these two branches are respectively:
Suppose the source text byte length is . Among the bytes consumed by the algorithm, there are bytes for whom the algorithm goes to branch A, and goes to branch for the other bytes.
For simplicity, we assume that , , , , and are fixed cost that don't vary upon different string sizes. This assumption is invalid/broken for large input, but that kind of case is very rare, so we are good with such hurtless simplification, as long as the strings are of reasonable lengths.
The total cost model of compression process can then be summarized as:
For input data that doesn't have many repeated byte pattern, is small compared to (i.e. ). The cost model approximates to:
If the underlying data structure implementation of code dict is hash table, then and are both operations. The total cost is then.
For input data that has many repeated byte pattern, is not negligible compared to . The largest possible comes from input consisting of single byte pattern. In such case, (The detailed deduction process is delegated to the readers). The total cost is still .
We see that the time complexity of compression algorithm is not affected by the byte repetition pattern. It's always linear time. This nice property holds true as long as the underlying implementation of the code dict scales in sublinear factor.
Contrary to the compression algorithm, LZW decompression algorithm consumes a stream of codes, decodes them on the fly, and emitted byte data. A string dict is built and maintained along with the process. Similar to the compression process, a running word is maintained during the process. The running word is empty at the beginning.
Deciding on whether the code can be found in the string dict, the algorithm has two branches to go.
If the code currently consumed can be found in the string dict, the algorithm goes to branch , where basically four things happen. First, the decoded string coresponding to the current code is lookuped in the string dict. The decoded string is emitted as byte data. Then a new string constructed by appending the first byte of the decoded string to the running word is added to the string dict, ready for future use. The running word is then reset to be the decoded string. The algorithm then goes on to the next iteration.
On the other hand, if the code currently consumed is not present in the string dict, a special situation happens and needs special care. Algorithm goes to branch , where three things happen. A new string constructed by appending the first byte of the running word to the running word is added to the string dict for future use. This new string is emitted as byte data. Then the running word is reset to be the new string. The algorithm then goes on to the next iteration.
The cost model for these two branches is then:
Suppose the code stream length is . Among the codes consumed by the algorithm, there are codes for whom the algorithm goes to branch A, and goes to branch for the other codes.
For simplicity, we assume that , , , , and are fixed cost that don't vary upon different string sizes. This assumption is invalid/broken for large input, but that kind of case is very rare, so we are good with such hurtless simplification, as long as the strings are of reasonable lengths.
The probability of going to branch is relatively rare, so the major cost comes from branch . The total cost model for the decompression algorithm is then:
It's the same with that of the compression algorithm! The total cost model for the decompression algorithm turns out to be identical to that of the compression algorithm! They are both linear . (under the precondition that the underlying implementation of string dict and code dict scales in sublinear factor)
The algorithmic detail of LZW compression algorithm doesn't have large parameter tuning space or rich variants for twisting. The only performance improvement we can try to make is concentrated on the underlying data structure implementation of the code dict and the string dict. The exported interface consists of dict.lookup
, dict.add
, and dict.membership_check
. Many data structure supports these three operations. When deciding which data structure to use, we want the cost to be as small as possible, because these three operations is invoked in nearly every iteration of the algorithm process. Three candidates are to be examined in detail:
-
Balanced Tree
Balanced tree has the advantages of being easy to implement, while having reasonable average performance bound for many operations. It's too mediocre for our demanding task.
-
Hash Table
Hash table, if implemented carefully, can exhibit excel performance in various frequent operations. The major drawback of hash table data structure is that it's space demanding, and its performance is sensitive to input data pattern. The good news is that we can tune the hashing function and collision resolution scheme to workaround skewed input data.
-
Trie
Trie is a specialized data structure especially good at handling text related data. Its major drawback is that correct and efficient implementation needs some attention and care.
We break the trade-off and decide to use hash table to implement the code dict and string dict.
For simplicity, we do not customize our own hashing functions. std::hash
is used instead. This could serve as a minor optimization candidate for future.
Ideally, the hash table implemented code dict and string dict exhibit time complexity for all dict.add
, dict.lookup
, and dict.membership_check
operations.
Instead of loading the whole file content into memory, which could lead to large memory footprint and hence resource pressure, our code adopts a more flexible IO method. We load input data in stream style. We read in data when needed and no too much more. This ensures that our code scales well with large input data size. Also our code is more memory space efficient.
A little trick we use here is to reserve sufficient space for the code dict and string dict at initialization. This effectively reduce the cost of resizing hash table, which is expensive, to say the least.
Deprecated.
Deprecated.
details
- Performance improvement. Profile to find performance hotspots. Use more advanced data structure to replace the naive hash table implementation.
- Investigate hash table statistics (e.g., load factor, hit/miss rate, collision count)
- Use more efficient hash algorithm for string data structure
- Use more efficient hash table design algorithm. (cf. TAOCP Vol.3)
- Consider use
.hpp
instead of.h
- Add
const
modifier to member function that doesn't change member variables. - Choose semantic exception type
- Use custom exceptions instead of lazily using
runtime_error
for every error places. - Inline member function
- Add
const
modifier to parameter - Add print to
cerr
before throwing exception - Use Pass-by-reference for some parameters
- Use command
nm
to check object files' symbols. - Add imported name as comment beside include directive
- Remove unused includes.
- Add unit test for all interface. Calculate test coverage with
gcov
. - Build shared/dynamic library
- Add
inline
keyword to some short functions - Acquire empirical statistics of hash table collision condition.
- Use optimization flag to compile before making performance benchmark.
- Remove dead codes, outdated comments. Cleanup.
- Consider separate template declaration and definition, refer to https://stackoverflow.com/questions/495021/why-can-templates-only-be-implemented-in-the-header-file
- Consider use bitmap to implement Bitarray, for improving performance.
- Use pytest-subtesthack.
- Check for memory leak. (e.g. unmatched new-delete, malloc-free) Or use automatic tool, like valgrind.
- Add [SCOPE] before every entry in TODO.md
- Project idea: script to auto-generate Makefile
- Turn out that
g++ -M
can do the job.
- Turn out that
- Restructure project layout.
#include "lzw/utils.h"
instead of#include "utils.h"
- Use
std::copy
,std::hash
,std::move
,std::memcpy
,std::map
,std::swap
. - Implement variants of LZW algorithm. E.g., code size, code table whether to reuse. Encode algorithm parameters to lzwfile and let decode logic automatically inspect it.
- The canonical copy-assignment operator is expected to perform no action on self-assignment. Ref: https://en.cppreference.com/w/cpp/language/operators
- Other overloaded operator also some of them expect no action performed on self-assignment. Ref: https://en.cppreference.com/w/cpp/language/operators
- Use immutable data structure to improve space efficiency.
- Use VSCode C/C++ extension's rename functionality to rename
Code
toint
andBytes
tostring
. - Try to use and master VSCode C/C++ extension's debug functionality.
- Compare speed of LZW.py and liblzw
- Wrap liblzw into CPython extension. Remove subprocess call overhead when testing using Python Hypothesis framework.
- Consider using Cython to fast prototype CPython extension.
- Update CMakeLists.txt
- Setup unit-test workflow. Include CI badge, about test suit passing status, and test coverage percentage.
- Try C++ test frameworks.
- Add GitHub badge about LOC (lines of code), or SLOC.
- Add GitHub badge about "this repo uses semantic versioning".
- State least supported C++ versions in README.
- Consider deploying auto-changelog-hook
- Consider align README.md big title to the center of the page.
- Add instructions to build dynamic library / shared library in README
- In pre-commit hook script, use black and isort as Python moduel instead of invoke as subprocess, to reduce subprocess call overhead.
- Use
std::bitset
to replace self-made bitmap data structure. - Reimplement
Bytes
andBitarray
data strcutre. Use more efficient implementation. - Use C++11 move semantic to improve performance in vector resize process.
- Put short inline-able member functions to header
- clangformat configuration about
using namespace
statement - Add CUHK academic honesty header
- Remeber to use flag
-static-libstdc++
when compiling C++ in MinGW Windows environment. - gcc's -I option "add the directory dir to the list of directories to be searched for header files. Directories named by -I are searched before the standard system include directories.", refer to: https://gcc.gnu.org/onlinedocs/gcc-4.9.2/gcc/Preprocessor-Options.html#Preprocessor-Options
details
- Multiple inclusion guard
- Template class can't separate declaration and definition into two files.
- Don't use
using namespace std;
in header files - Don't use
malloc
in C++. It can't correctly handle class initialization. - [str_dict] Add
str_cache
to reduce string membership check fromO(n)
toO(1)
- Use
.tpp
as extension for templates - Use
std::string
and specializedstd::hash<std::string>
function object to replace wrapper typeBytes
. - Use int and specialized
std::hash<int>
function object ot replace wrapper typeCode
. - Use
std::unordered_map
to replace self-madedict
data structure. - Use
std::set
to replace self-madeset
data structure. - Use
std::hash
- Outdo lzw_example_win.exe
- Out-of-source build
- Build static library: liblzw.a
- Improve CLI interface to be more user-friendly and intuitive.
- Write README. Elaborate on algorithmic details. Elaborate on analysis of computational complexity.
- Add git tag "submission milestone"
- Customize commit hook: process README.raw.md to README.md with all latex equation transformed to images.
- Add Travis CI pipeline. Automatically compile and test.
- Create script to replace LaTeX math text with image url. For render in GitHub.
- Release prebuild binary (static library, cli executable) and version-tagged source tarball.
- Push to GitHub upstream.
- Remove sensitive information. The first commit.
- Add unit test for bitmap data structure.
- Add GitHub badge about release version.
- Utilize Makefile syntax (order-only prerequisites) to specify that some directory need to pre-exist
- In pre-commit hook script, re-generate README TOC and insert.