Improvements and future direction of the compression algorithm #10

chieltbest · 2024-08-08T18:12:47Z

Related to #9

While implementing file writing in my DBPF library I've noticed that file sizes tend to be somewhat larger than original files (most likely created with SimPE).
Rather than blindly making changes or implementing another algorithm I'd like to take a moment with this issue to explore what the options are in terms of algorithms and what the end goals are of the compression part of the library as a whole.

So, to start:

What is the eventual goal of this library in terms of the speed/size tradeoff?
Should there be configuration options to configure the maximum window size and such, and what should that look like (should a configuration struct be portable across algorithms, or just decide to do breaking changes)?
What algorithms currently exist for compressing RefPack, and which is most suited for this library?
Are there any modern algorithms for LZ77/LZSS that deliver better performance, and can they be adapted/rewritten for use in RefPack?
What libraries exist in the rust ecosystem that could be used?
Since the entire data is always known before starting compression, is it possible to take advantage of that?
How do we measure the improvement in compression ratio in a reliable way?

I'll try to expand on each of these questions a little in the comments

chieltbest · 2024-08-08T18:32:44Z

What is the eventual goal of this library in terms of the speed/size tradeoff?

Obviously there is a tradeoff to be made between the speed of compression and the size of the compressed data, for instance, quite recently @lingeringwillx has made a "faster compressorizer"¹² by, among other things, lowering the compression ratio of the algorithm used.

So the question arises: what is a "reasonable speed" for compression?
Let's take a person playing the sims 2 with a downloads folder of 20-30GB³ as a baseline, let's also assume the have a reasonably modern 8-core CPU to do the compressing with, and they want to wait for a maximum of around 1 hour to compress all their files:
1 hour is 3600 seconds, 20GB / 3600s = ~5.5MB/s
5.5MB/s / 8 cores = ~0.7MB/s
This would be the minimum speed that could possibly be called "reasonable", and I would personally consider a speed that is 10x that to be "good"

Obviously different people have different tolerances for "reasonable" performance, but somewhere around that number could work as a default setting.

https://modthesims.info/d/679366/faster-compressorizer-reupload-beta.html ↩
The Compressorizer is a well-used and longstanding program to re-compress entire sims 2 download folders ↩
according to sources this seems to be the size that the sims 2 starts to not function properly, so it's kind of a soft limit in a way ↩

chieltbest · 2024-08-08T18:41:23Z

Should there be configuration options to configure the maximum window size and such, and what should that look like (should a configuration struct be portable across algorithms, or just decide to do breaking changes)?

As mentioned above, we should probably have a configuration option to configure the behavior of the algorithm.
This could be done in two (or possibly more) ways:

The direct way, meaning all internal algorithm options are exposed and the program author has to make decisions on what the user-facing options are.
The zlib way, meaning there is a single "0-9" compression strength option that provides reasonable defaults for each level.
A hybrid of the two, providing an enum that is either the granular 0-9 or direct options.

I'm personally not at all worried about the potential depreciation aspect of providing fine control, but I think it would be nice to give some set of reasonable defaults.

chieltbest · 2024-08-08T18:53:51Z

What algorithms currently exist for compressing RefPack, and which is most suited for this library?

In addition to the libraries listed in the main README, there is also a set of implementations by, again, @lingeringwillx https://github.com/lingeringwillx/CrappySims2Compression/tree/main/practice
The library is currently using an algorithm that is pretty much identical to the "map-single" implementation of that page.

Maybe it would be possible to find out what algorithms the compressorizer or SimPE currently use.

chieltbest · 2024-08-08T18:58:30Z

Are there any modern algorithms for LZ77/LZSS that deliver better performance, and can they be adapted/rewritten for use in RefPack?

There might be some good (possibly rust) FLATE or similar libraries that have prefix search algorithms or other stuff that could be adapted to this library.

chieltbest · 2024-08-08T19:02:40Z

What libraries exist in the rust ecosystem that could be used?

Similar to the above, there are good data structure libraries already available in rust which are made for this use case, think of tries, hashing/hash maps, circular buffers and more

chieltbest · 2024-08-08T19:14:22Z

Since the entire data is always known before starting compression, is it possible to take advantage of that?

This is an interesting one I haven't seen in other DBPF libraries: since we're only ever working on full chunks of data it's possible to do several things:

SIMD vectorize hashing of prefixes
start compression at the largest repetition instead of at the beginning (resulting in a better compression ratio)
possibly optimize in other ways

chieltbest · 2024-08-08T19:25:40Z

How do we measure the improvement in compression ratio in a reliable way?

I'm slightly stumped on this one; other libraries seem to take the approach of writing an implementation and accepting whatever results come out in terms of size, but I think it would be nice to have actual regression tests that integrate with the benchmark suite if at all possible.

Maybe it would work to write a custom criterion Measurement so that we can reuse the entire benchmark suite.

chieltbest · 2024-08-08T22:01:00Z

I've been able to get some quick results by adding a print statement to the benchmarks

Random Input Data Increasing:

Data Size	Compressed	Ratio
8096	8174	1.0096343873517786
16192	16342	1.009263833992095
32384	32678	1.009078557312253
64768	65350	1.008985918972332
129536	130694	1.0089395998023716
259072	261382	1.0089164402173914

Repeating Input Data Increasing:

Data Size	Compressed	Ratio
8096	296	0.036561264822134384
16192	328	0.020256916996047432
32384	392	0.012104743083003952
64768	516	0.007966897233201582
129536	772	0.005959733201581028
259072	1276	0.004925271739130435

All Zero Input Data Increasing:

Data Size	Compressed	Ratio
8096	38	0.004693675889328063
16192	70	0.004323122529644269
32384	134	0.004137845849802372
64768	262	0.004045207509881423
129536	514	0.003968008893280632
259072	1018	0.003929409584980237

Files:

Name	Data Size	Compressed	Ratio
dickens	10192446	6037010	0.5923023776628299
mozilla	51220480	23870415	0.4660326299167833
mr	9970564	4991210	0.5005945501177266
ooffice	6152192	3884461	0.6313946313769141
osdb	10085684	4609668	0.4570506075740624
reymont	6627202	3026142	0.45662437933836936
samba	21606400	7148674	0.33085909730450236
sao	7251944	6360656	0.8770966791800929
webster	41458703	19516074	0.4707352760167148
x-ray	8474240	7618154	0.8989778434408278
xml	5345280	1471895	0.2753634982638889

~~According to these results the literal encoding might be slightly inefficient (8096 "0" bytes should compress to 37 bytes: header 4 bytes + "0" literal byte + 8 long commands 8*4 bytes)~~
Nevermind, I forgot about the stop command.

lingeringwillx · 2024-08-09T05:05:31Z

What algorithms currently exist for compressing RefPack, and which is most suited for this library?

In addition to the libraries listed in the main README, there is also a set of implementations by, again, @lingeringwillx https://github.com/lingeringwillx/CrappySims2Compression/tree/main/practice The library is currently using an algorithm that is pretty much identical to the "map-single" implementation of that page.

Maybe it would be possible to find out what algorithms the compressorizer or SimPE currently use.

If you're interested in the versions of the algorithm that are out there. I compiled a list of all of them (or at least the ones that include both compression and decompression) as a sort of literature review:
https://github.com/lingeringwillx/Refpack-QFS-Resources?tab=readme-ov-file#implementations

The compressorizer uses the algorithm written by Ben-Rudiak Gould (benrg) which he dropped alongside some other low-level programs before he disappeared off the web. The algorithm is entirely based on zlib.

It took me a year to understand it, but it's using a hash chain to keep track of the data, or rather an efficient version of the hash chain that's commonly used in compression libraries. It stores the whole chain in just two fixed size arrays.

The hashing function is a rolling hash that updates the hash value as it progresses over the file, i.e. hash = ((hash << shift) ^ byte) & mask. There is also a feature called lazy matching which is basically dropping the match found at the current position in favor of a match in a later position, if the later match is better.

In my experience, both the rolling hash and lazy matching are a pain to implement, and they don't improve the performance by much. On the other hand, hash-chaining is a useful technique.

SimPE's compression code is crap. Don't bother with it. It was written all the way back in 2005 before we got better implementations. I think it even corrupts some files.

chieltbest · 2024-08-09T12:03:53Z

I wanted to get a better overview, so here are the algorithms in that list sorted by implementation type:

Implementation	Part of project	Link
direct search	s3pi	pljones & Tiger
		gibbed¹
map single	godbpf	marcboudreau²
		lingeringwillx
		lingeringwillx
		actioninja
map multi	SimPE	ambertation³
	FreeSO	Afr0⁴
	jDBPFX	java_dwarf⁴
		lingeringwillx⁵
	sims2-4k-ui-patch	lah7
hash chaining		Denis Auroux
	The Compressorizer	benrg⁵⁶⁷
	DBPFSharp	0xC0000054⁸
		J.M. Pescado⁸
	Thyme	OmniBlade
		KUDr⁹
		sebamarynissen¹⁰
		lingeringwillx⁵
		lingeringwillx⁸
	scdbpf	memo33

None of these libraries use dynamic programming/tree search like in LZMA so I might try to take a crack at that, but otherwise the best algorithm still seems to be the one devised by benrg.

uses some weird "block tracking" data structure (that seems to not be used during search), so this might be a map algorithm ↩
'perfect' 8 bit hash ↩
'perfect' 24 bit hash ↩
seems to be a copy of the SimPE algorithm ↩ ↩²
'nice' copy length shortcut ↩ ↩² ↩³
running hash ↩
lazy matching ↩
rewrite of benrg's code ↩ ↩² ↩³
'perfect' 16 bit hash ↩
rewrite of Dennis Auroux's code ↩

chieltbest · 2024-08-10T15:57:52Z

Here's the compression ratios of a fairly simple map multi implementation adapted from the existing code

Name	Data Size	Compressed	Ratio
dickens	10192446	5144698	0.5047559731981901
mozilla	51220480	22506612	0.4394065030237905
mr	9970564	4732279	0.47462500616815656
nci	33553445	3898520	0.11618836754318372
ooffice	6152192	3624197	0.589090359988765
osdb	10085684	4295582	0.42590884267244544
reymont	6627202	2468070	0.3724150855821205
samba	21606400	6307239	0.2919153121297393
sao	7251944	6153356	0.8485112405721831
webster	41458703	15377026	0.37089983253938263
x-ray	8474240	7389406	0.8719845083452912
xml	5345280	848567	0.15875071090756704

actioninja · 2024-09-12T06:48:18Z

Sorry I just want to confirm that I read the thread and appreciate the effort and research. Been extremely busy in my personal life and been sidetracked with other projects in the free time I do have.

Imo optimizing for compressed size would be the smart goal here as long as it's not cripplingly slow, compression is an infrequent process compared to decompression. Multiple speeds or algorithms is also an option for if fast speeds are really desired. I think it's pretty obvious that I'm not afraid of large API breaking changes. There's a fairly narrow band of consumers and considering that the current implementation is "ideal" in that it's about as bug free and hardened as I ever expect It to be,I don't feel bad about "If you don't want to migrate just stay on the old version."

lingeringwillx · 2024-09-12T15:23:17Z

There is a limit to how far the compressed size could get due to the limitations of the compression algorithm itself, and when you try to exceed that you end up wasting a ton of CPU cycles only to eliminate a few more bytes. This is why I tend to not bother with the compression ratio much and focus on the efficiency instead.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements and future direction of the compression algorithm #10

Improvements and future direction of the compression algorithm #10

chieltbest commented Aug 8, 2024

chieltbest commented Aug 8, 2024

chieltbest commented Aug 8, 2024

chieltbest commented Aug 8, 2024

chieltbest commented Aug 8, 2024

chieltbest commented Aug 8, 2024

chieltbest commented Aug 8, 2024

chieltbest commented Aug 8, 2024 •

edited

Loading

chieltbest commented Aug 8, 2024 •

edited

Loading

lingeringwillx commented Aug 9, 2024

chieltbest commented Aug 9, 2024

chieltbest commented Aug 10, 2024

actioninja commented Sep 12, 2024

lingeringwillx commented Sep 12, 2024

Improvements and future direction of the compression algorithm #10

Improvements and future direction of the compression algorithm #10

Comments

chieltbest commented Aug 8, 2024

chieltbest commented Aug 8, 2024

Footnotes

chieltbest commented Aug 8, 2024

chieltbest commented Aug 8, 2024

chieltbest commented Aug 8, 2024

chieltbest commented Aug 8, 2024

chieltbest commented Aug 8, 2024

chieltbest commented Aug 8, 2024 • edited Loading

chieltbest commented Aug 8, 2024 • edited Loading

lingeringwillx commented Aug 9, 2024

chieltbest commented Aug 9, 2024

Footnotes

chieltbest commented Aug 10, 2024

actioninja commented Sep 12, 2024

lingeringwillx commented Sep 12, 2024

chieltbest commented Aug 8, 2024 •

edited

Loading

chieltbest commented Aug 8, 2024 •

edited

Loading