Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster short string compression with terminator byte and AVX512 #48

Open
XiangpengHao opened this issue Oct 26, 2024 · 2 comments
Open

Comments

@XiangpengHao
Copy link
Contributor

Hi SpiralDB, I've had a great experience using the fsst lib and vortex -- thank you for building them.

I'm trying to make fsst even faster. Currently the fsst compress the vortex varbin array by iterating each string and compress them individually. This works well for long strings but not so much for short strings, especially strings that are shorter than 8 bytes -- in that case, every string compression will fallback to the slow path.

The original paper suggests to copy the short strings to a new buffer and add a terminator between strings (Section 5.2). With the new long buffer, we can compress faster even with scalar code (and potentially auto-vectorized code). The new long buffer also allows us to compress with AVX512.

I guess the first step is to add a terminator byte to the symbol table, as shown in the c++ implementation.

Would you happen to have plans to implement this or any thoughts on the approach? I’d love to hear your thoughts.

@a10y
Copy link
Contributor

a10y commented Oct 27, 2024

Hey @XiangpengHao , thanks for broaching this.

As you noted, the implementation right now is entirely based on the Section 5.1 "Predicated Scalar Compression" solution. This corresponds to the compressBulkC++ impl, however as you note we don't split the input segments into 511 byte chunks + terminator byte.

My reasoning at the time was that (a) comments in the C++ code mostly indicated that the 511-byte segmentation for the scalar fallback impl is mainly to keep consistent output format with the AVX512 variant, so that data compresses the same way across architectures, and (b) it seemed complicated.

I think the first step would be doing some benchmarking to show that the speedup from not falling back to the "slow path" loop is not outweighed by having to memcpy and prepare chunks of 511 bytes from the input.

I think if we do this, it'd end up greatly changing the structure of the Compressor, so I'd just wanna prove the idea out a bit more before we go that route. As you mentioned, it is a bridge we would need to cross anyway if we were to implement the AVX512 "meat grinder" compressor

@XiangpengHao
Copy link
Contributor Author

Thank you for the comments @a10y. They make perfect sense to me!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants