Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements and future direction of the compression algorithm #10

Open
chieltbest opened this issue Aug 8, 2024 · 13 comments
Open

Improvements and future direction of the compression algorithm #10

chieltbest opened this issue Aug 8, 2024 · 13 comments

Comments

@chieltbest
Copy link
Contributor

Related to #9

While implementing file writing in my DBPF library I've noticed that file sizes tend to be somewhat larger than original files (most likely created with SimPE).
Rather than blindly making changes or implementing another algorithm I'd like to take a moment with this issue to explore what the options are in terms of algorithms and what the end goals are of the compression part of the library as a whole.

So, to start:

  • What is the eventual goal of this library in terms of the speed/size tradeoff?
  • Should there be configuration options to configure the maximum window size and such, and what should that look like (should a configuration struct be portable across algorithms, or just decide to do breaking changes)?
  • What algorithms currently exist for compressing RefPack, and which is most suited for this library?
  • Are there any modern algorithms for LZ77/LZSS that deliver better performance, and can they be adapted/rewritten for use in RefPack?
  • What libraries exist in the rust ecosystem that could be used?
  • Since the entire data is always known before starting compression, is it possible to take advantage of that?
  • How do we measure the improvement in compression ratio in a reliable way?

I'll try to expand on each of these questions a little in the comments

@chieltbest
Copy link
Contributor Author

What is the eventual goal of this library in terms of the speed/size tradeoff?

Obviously there is a tradeoff to be made between the speed of compression and the size of the compressed data, for instance, quite recently @lingeringwillx has made a "faster compressorizer"12 by, among other things, lowering the compression ratio of the algorithm used.

So the question arises: what is a "reasonable speed" for compression?
Let's take a person playing the sims 2 with a downloads folder of 20-30GB3 as a baseline, let's also assume the have a reasonably modern 8-core CPU to do the compressing with, and they want to wait for a maximum of around 1 hour to compress all their files:
1 hour is 3600 seconds, 20GB / 3600s = ~5.5MB/s
5.5MB/s / 8 cores = ~0.7MB/s
This would be the minimum speed that could possibly be called "reasonable", and I would personally consider a speed that is 10x that to be "good"

Obviously different people have different tolerances for "reasonable" performance, but somewhere around that number could work as a default setting.

Footnotes

  1. https://modthesims.info/d/679366/faster-compressorizer-reupload-beta.html

  2. The Compressorizer is a well-used and longstanding program to re-compress entire sims 2 download folders

  3. according to sources this seems to be the size that the sims 2 starts to not function properly, so it's kind of a soft limit in a way

@chieltbest
Copy link
Contributor Author

Should there be configuration options to configure the maximum window size and such, and what should that look like (should a configuration struct be portable across algorithms, or just decide to do breaking changes)?

As mentioned above, we should probably have a configuration option to configure the behavior of the algorithm.
This could be done in two (or possibly more) ways:

  • The direct way, meaning all internal algorithm options are exposed and the program author has to make decisions on what the user-facing options are.
  • The zlib way, meaning there is a single "0-9" compression strength option that provides reasonable defaults for each level.
  • A hybrid of the two, providing an enum that is either the granular 0-9 or direct options.

I'm personally not at all worried about the potential depreciation aspect of providing fine control, but I think it would be nice to give some set of reasonable defaults.

@chieltbest
Copy link
Contributor Author

What algorithms currently exist for compressing RefPack, and which is most suited for this library?

In addition to the libraries listed in the main README, there is also a set of implementations by, again, @lingeringwillx https://github.com/lingeringwillx/CrappySims2Compression/tree/main/practice
The library is currently using an algorithm that is pretty much identical to the "map-single" implementation of that page.

Maybe it would be possible to find out what algorithms the compressorizer or SimPE currently use.

@chieltbest
Copy link
Contributor Author

Are there any modern algorithms for LZ77/LZSS that deliver better performance, and can they be adapted/rewritten for use in RefPack?

There might be some good (possibly rust) FLATE or similar libraries that have prefix search algorithms or other stuff that could be adapted to this library.

@chieltbest
Copy link
Contributor Author

What libraries exist in the rust ecosystem that could be used?

Similar to the above, there are good data structure libraries already available in rust which are made for this use case, think of tries, hashing/hash maps, circular buffers and more

@chieltbest
Copy link
Contributor Author

Since the entire data is always known before starting compression, is it possible to take advantage of that?

This is an interesting one I haven't seen in other DBPF libraries: since we're only ever working on full chunks of data it's possible to do several things:

  • SIMD vectorize hashing of prefixes
  • start compression at the largest repetition instead of at the beginning (resulting in a better compression ratio)
  • possibly optimize in other ways

@chieltbest
Copy link
Contributor Author

chieltbest commented Aug 8, 2024

How do we measure the improvement in compression ratio in a reliable way?

I'm slightly stumped on this one; other libraries seem to take the approach of writing an implementation and accepting whatever results come out in terms of size, but I think it would be nice to have actual regression tests that integrate with the benchmark suite if at all possible.

Maybe it would work to write a custom criterion Measurement so that we can reuse the entire benchmark suite.

@chieltbest
Copy link
Contributor Author

chieltbest commented Aug 8, 2024

I've been able to get some quick results by adding a print statement to the benchmarks

Random Input Data Increasing:

Data Size Compressed Ratio
8096 8174 1.0096343873517786
16192 16342 1.009263833992095
32384 32678 1.009078557312253
64768 65350 1.008985918972332
129536 130694 1.0089395998023716
259072 261382 1.0089164402173914

Repeating Input Data Increasing:

Data Size Compressed Ratio
8096 296 0.036561264822134384
16192 328 0.020256916996047432
32384 392 0.012104743083003952
64768 516 0.007966897233201582
129536 772 0.005959733201581028
259072 1276 0.004925271739130435

All Zero Input Data Increasing:

Data Size Compressed Ratio
8096 38 0.004693675889328063
16192 70 0.004323122529644269
32384 134 0.004137845849802372
64768 262 0.004045207509881423
129536 514 0.003968008893280632
259072 1018 0.003929409584980237

Files:

Name Data Size Compressed Ratio
dickens 10192446 6037010 0.5923023776628299
mozilla 51220480 23870415 0.4660326299167833
mr 9970564 4991210 0.5005945501177266
ooffice 6152192 3884461 0.6313946313769141
osdb 10085684 4609668 0.4570506075740624
reymont 6627202 3026142 0.45662437933836936
samba 21606400 7148674 0.33085909730450236
sao 7251944 6360656 0.8770966791800929
webster 41458703 19516074 0.4707352760167148
x-ray 8474240 7618154 0.8989778434408278
xml 5345280 1471895 0.2753634982638889

According to these results the literal encoding might be slightly inefficient (8096 "0" bytes should compress to 37 bytes: header 4 bytes + "0" literal byte + 8 long commands 8*4 bytes)
Nevermind, I forgot about the stop command.

@lingeringwillx
Copy link
Contributor

What algorithms currently exist for compressing RefPack, and which is most suited for this library?

In addition to the libraries listed in the main README, there is also a set of implementations by, again, @lingeringwillx https://github.com/lingeringwillx/CrappySims2Compression/tree/main/practice The library is currently using an algorithm that is pretty much identical to the "map-single" implementation of that page.

Maybe it would be possible to find out what algorithms the compressorizer or SimPE currently use.

If you're interested in the versions of the algorithm that are out there. I compiled a list of all of them (or at least the ones that include both compression and decompression) as a sort of literature review:
https://github.com/lingeringwillx/Refpack-QFS-Resources?tab=readme-ov-file#implementations

The compressorizer uses the algorithm written by Ben-Rudiak Gould (benrg) which he dropped alongside some other low-level programs before he disappeared off the web. The algorithm is entirely based on zlib.

It took me a year to understand it, but it's using a hash chain to keep track of the data, or rather an efficient version of the hash chain that's commonly used in compression libraries. It stores the whole chain in just two fixed size arrays.

The hashing function is a rolling hash that updates the hash value as it progresses over the file, i.e. hash = ((hash << shift) ^ byte) & mask. There is also a feature called lazy matching which is basically dropping the match found at the current position in favor of a match in a later position, if the later match is better.

In my experience, both the rolling hash and lazy matching are a pain to implement, and they don't improve the performance by much. On the other hand, hash-chaining is a useful technique.

SimPE's compression code is crap. Don't bother with it. It was written all the way back in 2005 before we got better implementations. I think it even corrupts some files.

@chieltbest
Copy link
Contributor Author

I wanted to get a better overview, so here are the algorithms in that list sorted by implementation type:

Implementation Part of project Link
direct search s3pi pljones & Tiger
gibbed1
map single godbpf marcboudreau2
lingeringwillx
lingeringwillx
actioninja
map multi SimPE ambertation3
FreeSO Afr04
jDBPFX java_dwarf4
lingeringwillx5
sims2-4k-ui-patch lah7
hash chaining Denis Auroux
The Compressorizer benrg567
DBPFSharp 0xC00000548
J.M. Pescado8
Thyme OmniBlade
KUDr9
sebamarynissen10
lingeringwillx5
lingeringwillx8
scdbpf memo33

None of these libraries use dynamic programming/tree search like in LZMA so I might try to take a crack at that, but otherwise the best algorithm still seems to be the one devised by benrg.

Footnotes

  1. uses some weird "block tracking" data structure (that seems to not be used during search), so this might be a map algorithm

  2. 'perfect' 8 bit hash

  3. 'perfect' 24 bit hash

  4. seems to be a copy of the SimPE algorithm 2

  5. 'nice' copy length shortcut 2 3

  6. running hash

  7. lazy matching

  8. rewrite of benrg's code 2 3

  9. 'perfect' 16 bit hash

  10. rewrite of Dennis Auroux's code

@chieltbest
Copy link
Contributor Author

Here's the compression ratios of a fairly simple map multi implementation adapted from the existing code

Name Data Size Compressed Ratio
dickens 10192446 5144698 0.5047559731981901
mozilla 51220480 22506612 0.4394065030237905
mr 9970564 4732279 0.47462500616815656
nci 33553445 3898520 0.11618836754318372
ooffice 6152192 3624197 0.589090359988765
osdb 10085684 4295582 0.42590884267244544
reymont 6627202 2468070 0.3724150855821205
samba 21606400 6307239 0.2919153121297393
sao 7251944 6153356 0.8485112405721831
webster 41458703 15377026 0.37089983253938263
x-ray 8474240 7389406 0.8719845083452912
xml 5345280 848567 0.15875071090756704

@actioninja
Copy link
Owner

Sorry I just want to confirm that I read the thread and appreciate the effort and research. Been extremely busy in my personal life and been sidetracked with other projects in the free time I do have.

Imo optimizing for compressed size would be the smart goal here as long as it's not cripplingly slow, compression is an infrequent process compared to decompression. Multiple speeds or algorithms is also an option for if fast speeds are really desired. I think it's pretty obvious that I'm not afraid of large API breaking changes. There's a fairly narrow band of consumers and considering that the current implementation is "ideal" in that it's about as bug free and hardened as I ever expect It to be,I don't feel bad about "If you don't want to migrate just stay on the old version."

@lingeringwillx
Copy link
Contributor

There is a limit to how far the compressed size could get due to the limitations of the compression algorithm itself, and when you try to exceed that you end up wasting a ton of CPU cycles only to eliminate a few more bytes. This is why I tend to not bother with the compression ratio much and focus on the efficiency instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants