Skip to content
This repository has been archived by the owner on Aug 20, 2021. It is now read-only.

Use zstd instead of zip #398

Merged
merged 6 commits into from
Apr 16, 2020
Merged

Use zstd instead of zip #398

merged 6 commits into from
Apr 16, 2020

Conversation

asymmetric
Copy link
Contributor

@asymmetric asymmetric commented Apr 13, 2020

Logs can sometimes be Gigabytes large, so transferring and decompressing them
can take a while.

zstd is very fast at all operations (de/compression, extraction) out of the box.

It's possible there are parameters we can tweak for our specific usecase, but I
haven't looked into that.

For the simple Vat_subui_pass_rough proof in examples/heal it's 40KB using zstd vs 3.8MB using zip.

On a 1.1GB zipped proof file I found on buildbot, unzip needed around 3 minutes.

zstd compressed the contents of the same file to 30MB (!!), and decompressed it in 1m36 seconds.

Since we might want to change the compression method.
@asymmetric asymmetric requested a review from a team April 13, 2020 18:10
"bin_runtime.k"
\( -name "*$spec_hash*" \
-o -name "*$spec_name*" \
-o -name config.tmp.json \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this from config.json to config.tmp.json, because the former is not in $KLAB_OUT while the latter is, and this made the code cleaner (since I only have to search in one dir).

AFAICT the former is a subset of the latter, so everything should be fine.

@mhhf true story?

@asymmetric
Copy link
Contributor Author

asymmetric commented Apr 13, 2020

I tested this manually by doing

klab compress $PROOF_HASH
mkdir foo && tar xf $KLAB_OUT/log/$PROOF_HASH.tar -C foo
KLAB_OUT=foo klab debug $PROOF_HASH

-o -name bin_runtime.k \) \
-a \! -name "*.tar" \
-a \! -name "*.zip" \
-exec tar --create --zstd --file "./log/$spec_hash.tar" {} +
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we could pass the --threads=N argument to zstd, to control how many threads are used for compression, in case we deemed it necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you have a rough idea of what kind of speedup we could get from this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the test archive on buildbot, using -T0, 29 vs 33 seconds, so negligible. Maybe there are other bottlenecks.

shell.nix Show resolved Hide resolved
@asymmetric
Copy link
Contributor Author

This is backwards-incompatible as is. Previously generated zip files are not fetchable anymore. Not sure if we should deal with that.

d-xo
d-xo previously approved these changes Apr 14, 2020
Copy link
Contributor

@d-xo d-xo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general lgtm. One small question:

I remember previously @mhhf mentioning in chat that zipping was additive, and so it should be pretty easy to start running klab zip every X minutes during prove-all to allow fetching of running proofs. Is this also the case for zstd?

@d-xo
Copy link
Contributor

d-xo commented Apr 14, 2020

This is backwards-incompatible as is. Previously generated zip files are not fetchable anymore. Not sure if we should deal with that.

It seems like it would be fairly easy to add some conditional logic in klab fetch that can handle both?

@asymmetric
Copy link
Contributor Author

@xwvvvvwx

I remember previously @mhhf mentioning in chat that zipping was additive, and so it should be pretty easy to start running klab zip every X minutes during prove-all to allow fetching of running proofs. Is this also the case for zstd?

No, it's not the case with tar, as it doesn't allow adding files to compressed tarballs ex post.

I'm still not sure if that solution you mention above is the best though - I think it would make sense to have logic that, when receiving the HTTP GET from klab fetch, zips up the file if it's not already there (and also if it's there but the proof is still running)

@d-xo
Copy link
Contributor

d-xo commented Apr 14, 2020

I'm still not sure if that solution you mention above is the best though - I think it would make sense to have logic that, when receiving the HTTP GET from klab fetch, zips up the file if it's not already there (and also if it's there but the proof is still running)

Not really sure if this issue is the best place to bikeshed this potential feature, but I'm not sure how I feel about adding a long running server process to our CI infrastructure, feels like quite a lot of moving parts.

@asymmetric
Copy link
Contributor Author

asymmetric commented Apr 14, 2020

Not really sure if this issue is the best place to bikeshed this potential feature, but I'm not sure how I feel about adding a long running server process to our CI infrastructure, feels like quite a lot of moving parts.

Thanks for bringing this up in this issue, as it's relevant.

The way I see it, we wouldn't be adding any moving parts. The long-running process is already there, it's nginx. When a request arrives at a certain endpoint, it starts a (Lua or njs) script that does the tarballing and returns the tarball.

@d-xo
Copy link
Contributor

d-xo commented Apr 14, 2020

The long-running process is already there, it's nginx. When a request arrives at a certain endpoint, it starts a (Lua or njs) script that does the tarballing and returns the tarball.

Didn't know that you could use nginx like that. I like it. It should even allow us to klab fetch accepted proofs, which is something I have often wanted.

I guess there is also the complication of having different branches / projects potentially using different versions of klab, which could cause some issues if we modify the proof serialization format between klab versions.

@asymmetric
Copy link
Contributor Author

It seems like it would be fairly easy to add some conditional logic in klab fetch that can handle both?

@xwvvvvwx yes, we could do that if needed, it's true. Do we want to?

@d-xo
Copy link
Contributor

d-xo commented Apr 15, 2020

yes, we could do that if needed, it's true. Do we want to?

On balance I think I'm fine with leaving it out and adding it back if it starts to become annoying.

@asymmetric
Copy link
Contributor Author

yes, we could do that if needed, it's true. Do we want to?

On balance I think I'm fine with leaving it out and adding it back if it starts to become annoying.

I circled back on this, I think it would be much nicer if it would transparently work with zip files without users having to dig back old versions of klab. So I'll implement that in this PR.

@asymmetric
Copy link
Contributor Author

OK, should be done.

@asymmetric asymmetric requested a review from a team April 15, 2020 14:51
Copy link
Contributor

@d-xo d-xo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one trivial nit. I'm fine to merge either way. lgtm 💖

@asymmetric asymmetric merged commit 9a42e21 into master Apr 16, 2020
@asymmetric asymmetric deleted the zstd branch April 16, 2020 08:06
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants