Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decrease grid file size #12

Closed
cschwan opened this issue May 27, 2020 · 3 comments
Closed

Decrease grid file size #12

cschwan opened this issue May 27, 2020 · 3 comments
Assignees
Labels
enhancement New feature or request

Comments

@cschwan
Copy link
Contributor

cschwan commented May 27, 2020

The grids that are currently produced are rather large (CMSDY2D11 is 800 MB large), but this can be optimized using several strategies:

  • many of out grids have basically a static scale, so that we should be able to remove the interpolation in Q2,
  • inside the non-zero Q2 grids many bins are probably empty, which can be optimized when writing to disk using a simple compression algorithm.
@cschwan
Copy link
Contributor Author

cschwan commented Jun 6, 2020

A simple strategy is to compress the grids. I've tried the following compression algorithms:

  1. without compression the file is 894 MB according to du --apparent-size,
  2. lzop --fast: 139 MB, lzop --best: 135 MB,
  3. lz4 --fast: 137 MB, lz4 --best: 133 MB,
  4. gzip --fast: 133 MB, gzip --best: 130 MB,
  5. bzip2 --fast: 127 MB, bzip2 --best: 126 MB,
  6. 7z a -t7z -mx=1: 127 MB, 7z a -t7z -mx=9: 125 MB,
  7. lzip --fast: 126 MB, lzip --best: 124 MB,
  8. xz --fast: 125 MB, xz --best: 125 MB.

A disadvantage is that compressing the grids (which we can do once) adds a decompression penalty every time we read it. Therefore the following timings are relevant (each using a compressed grid with the fastest compression algorithm on dom):

  1. lzop -d: real 0m0.806s, user 0m0.413s, sys 0m0.393s,
  2. lz4 -d: real 0m0.640s, user 0m0.203s, sys 0m0.436s,
  3. gzip -d: real 0m2.776s, user 0m2.335s, sys 0m0.440s,
  4. bzip2 -d: real 0m7.026s, user 0m6.550s, sys 0m0.473s,
  5. 7z e: not available on dom,
  6. lzip -d: not available on dom,
  7. xz: real 0m7.550s, user 0m7.050s, sys 0m0.490s.

Here also the corresponding numbers with the best compression:

  1. lzop -d: real 0m1.136s, user 0m0.759s, sys 0m0.369s,
  2. lz4 -d: real 0m0.629s, user 0m0.203s, sys 0m0.425s,
  3. gzip -d: real 0m3.335s, user 0m2.947s, sys 0m0.387s,
  4. bzip2 -d: real 0m8.619s, user 0m8.120s, sys 0m0.493s,
  5. 7z e: not available on dom,
  6. lzip -d: not available on dom,
  7. xz: real 0m7.571s, user 0m7.061s, sys 0m0.494s.

@cschwan
Copy link
Contributor Author

cschwan commented Jun 6, 2020

Commit 1f4df99 implements reading from an LZ4 compressed stream, using the lz-fear crate. The decompression penalty is roughly 0.5 seconds for the usecase shown above, which is what the tests predicted. If we decide the that the grids are too large, we can simply compress them, if not we leave them. PineAPPL can read both now.

@cschwan cschwan self-assigned this Jun 6, 2020
@cschwan cschwan added the enhancement New feature or request label Jun 6, 2020
@cschwan
Copy link
Contributor Author

cschwan commented Nov 25, 2020

PR #48 implements another file-size optimisation, and Issue #45 lists more possibilities.

@cschwan cschwan closed this as completed Nov 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant