-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvements and future direction of the compression algorithm #10
Comments
Obviously there is a tradeoff to be made between the speed of compression and the size of the compressed data, for instance, quite recently @lingeringwillx has made a "faster compressorizer"12 by, among other things, lowering the compression ratio of the algorithm used. So the question arises: what is a "reasonable speed" for compression? Obviously different people have different tolerances for "reasonable" performance, but somewhere around that number could work as a default setting. Footnotes
|
As mentioned above, we should probably have a configuration option to configure the behavior of the algorithm.
I'm personally not at all worried about the potential depreciation aspect of providing fine control, but I think it would be nice to give some set of reasonable defaults. |
In addition to the libraries listed in the main README, there is also a set of implementations by, again, @lingeringwillx https://github.com/lingeringwillx/CrappySims2Compression/tree/main/practice Maybe it would be possible to find out what algorithms the compressorizer or SimPE currently use. |
There might be some good (possibly rust) FLATE or similar libraries that have prefix search algorithms or other stuff that could be adapted to this library. |
Similar to the above, there are good data structure libraries already available in rust which are made for this use case, think of tries, hashing/hash maps, circular buffers and more |
This is an interesting one I haven't seen in other DBPF libraries: since we're only ever working on full chunks of data it's possible to do several things:
|
I'm slightly stumped on this one; other libraries seem to take the approach of writing an implementation and accepting whatever results come out in terms of size, but I think it would be nice to have actual regression tests that integrate with the benchmark suite if at all possible. Maybe it would work to write a custom criterion |
I've been able to get some quick results by adding a print statement to the benchmarks Random Input Data Increasing:
Repeating Input Data Increasing:
All Zero Input Data Increasing:
Files:
|
If you're interested in the versions of the algorithm that are out there. I compiled a list of all of them (or at least the ones that include both compression and decompression) as a sort of literature review: The compressorizer uses the algorithm written by Ben-Rudiak Gould (benrg) which he dropped alongside some other low-level programs before he disappeared off the web. The algorithm is entirely based on zlib. It took me a year to understand it, but it's using a hash chain to keep track of the data, or rather an efficient version of the hash chain that's commonly used in compression libraries. It stores the whole chain in just two fixed size arrays. The hashing function is a rolling hash that updates the hash value as it progresses over the file, i.e. In my experience, both the rolling hash and lazy matching are a pain to implement, and they don't improve the performance by much. On the other hand, hash-chaining is a useful technique. SimPE's compression code is crap. Don't bother with it. It was written all the way back in 2005 before we got better implementations. I think it even corrupts some files. |
I wanted to get a better overview, so here are the algorithms in that list sorted by implementation type:
None of these libraries use dynamic programming/tree search like in LZMA so I might try to take a crack at that, but otherwise the best algorithm still seems to be the one devised by benrg. Footnotes |
Here's the compression ratios of a fairly simple map multi implementation adapted from the existing code
|
Sorry I just want to confirm that I read the thread and appreciate the effort and research. Been extremely busy in my personal life and been sidetracked with other projects in the free time I do have. Imo optimizing for compressed size would be the smart goal here as long as it's not cripplingly slow, compression is an infrequent process compared to decompression. Multiple speeds or algorithms is also an option for if fast speeds are really desired. I think it's pretty obvious that I'm not afraid of large API breaking changes. There's a fairly narrow band of consumers and considering that the current implementation is "ideal" in that it's about as bug free and hardened as I ever expect It to be,I don't feel bad about "If you don't want to migrate just stay on the old version." |
There is a limit to how far the compressed size could get due to the limitations of the compression algorithm itself, and when you try to exceed that you end up wasting a ton of CPU cycles only to eliminate a few more bytes. This is why I tend to not bother with the compression ratio much and focus on the efficiency instead. |
Related to #9
While implementing file writing in my DBPF library I've noticed that file sizes tend to be somewhat larger than original files (most likely created with SimPE).
Rather than blindly making changes or implementing another algorithm I'd like to take a moment with this issue to explore what the options are in terms of algorithms and what the end goals are of the compression part of the library as a whole.
So, to start:
I'll try to expand on each of these questions a little in the comments
The text was updated successfully, but these errors were encountered: