Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rANS (rans4x8) encoding support for CRAM block compression #330

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

athos
Copy link
Member

@athos athos commented Dec 24, 2024

This PR implements the rANS (rans4x8) encoder to allow the CRAM writer to use it for block compression.

With this PR, the default compressor for the following data series will be updated to rans4x8:

Data Series Default Compressor
BF rans4x8 (Order 1)
CF rans4x8 (Order 1)
RI rans4x8 (Order 0)
RL rans4x8 (Order 1)
AP rans4x8 (Order 0)
RG rans4x8 (Order 1)
NS rans4x8 (Order 1)
TS rans4x8 (Order 1)
BA rans4x8 (Order 1)
QS rans4x8 (Order 1)

You can override the compressor for specific data series or tags by specifying :r4x8-o0 (Order 0) or :r4x8-o1 (Order 1) in the :ds-compressor-overrides / :tag-compressor-overrides options.

For example, to set the compressor of the FN data series to rANS (Order 0) and the TL data series to rANS (Order 1), you can configure the writer as follows:

(require '[cljam.io.cram :as cram])

(with-open [w (cram/writer "path/to/cram/file"
                           { ...
                            :ds-compressor-overrides {:FN :r4x8-o0
                                                      :TL :r4x8-o1}
                             ... })]
  ... )

@athos athos self-assigned this Dec 24, 2024
Copy link

codecov bot commented Dec 24, 2024

Codecov Report

Attention: Patch coverage is 85.61873% with 43 lines in your changes missing coverage. Please review.

Project coverage is 89.83%. Comparing base (18f8cca) to head (46b2f73).

Files with missing lines Patch % Lines
src/cljam/io/cram/codecs/rans4x8.clj 84.83% 4 Missing and 38 partials ⚠️
src/cljam/io/cram/encode/structure.clj 0.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #330      +/-   ##
==========================================
- Coverage   89.97%   89.83%   -0.15%     
==========================================
  Files         104      104              
  Lines        9341     9623     +282     
  Branches      490      528      +38     
==========================================
+ Hits         8405     8645     +240     
- Misses        446      450       +4     
- Partials      490      528      +38     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@athos athos force-pushed the feature/rans4x8-encoder branch from 70fea79 to 11e3225 Compare December 24, 2024 05:04
@athos athos marked this pull request as ready for review December 24, 2024 05:27
@athos athos requested review from alumi and a team as code owners December 24, 2024 05:27
@athos athos requested review from niyarin and removed request for a team December 24, 2024 05:27
Copy link
Member

@alumi alumi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for implementing the encoder👍
I think the implementation looks great overall. Thanks to the addition of some macros, it’s a lot easier to read!

Apologies, but I noticed one existing issue🙇.
In the current :dev profile of project.clj, the *unchecked-math* under :global-vars is set to :warn-on-boxed. However, if this is disabled, an IllegalArgumentException occurs, causing the tests to fail. I guess some occurrences of byte fn call in this namespace need to be replaced with unchecked-byte. Since *unchecked-math* was enabled during CI tests, this issue went unnoticed until now. You can verify it with the following command:

lein update-in : assoc :global-vars '^:replace {*warn-on-reflection* nil, *unchecked-math* nil}' -- test

On a side note, when nrepl/nrepl >= 1.3.0 is included as a dependency, running lein repl seems to overwrite the :global-vars settings with falsy values after loading the libraries, which ended up revealing the issue.

`(aget ~(with-meta `(aget ~syms ~i) {:tag 'objects}) ~j))

(defn- encode-payload1 ^long [^ByteBuffer in ^objects syms ^ByteBuffer out]
(let [raw-size (.remaining in)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spec says We do not permit Order-1 encoding of data streams smaller than 4 bytes but it seems that the current implementation actually handles short input data well.
Do you think we should warn if raw-size is less than 4?

(->> (.getBytes "ab")
     bb/make-lsb-byte-buffer
     (rans/encode 1)
     bb/make-lsb-byte-buffer
     rans/decode
     String.
     (= "ab"))
;; => true

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing that out! I had overlooked that part of the CRAM specification 🙇

It seems that in htslib and htsjdk, inputs smaller than 4 bytes are encoded as Order 0, even if Order 1 encoding is specified. In practice, this implicit fallback to Order 0 for such small inputs doesn’t seem to cause significant issues, so it appears to be a reasonable approach. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for looking into the behavior of other tools!
As you mentioned, I think implicit falling back to Order-0 works just fine! 👍👍

@athos
Copy link
Member Author

athos commented Dec 26, 2024

Thank you for the comment!

I hadn’t noticed the overflow issue with the byte fn because I didn’t disable *unchecked-math* during development. Commit 46b2f73 addressed this issue for the cljam.io.cram.codecs.rans4x8 ns (including the decoder), but it seems like many other namespaces also have similar issues, most of which are likely due to cljam.io.util.byte-buffer and cljam.io.util.lsb.*. So, I would like to handle those in a separate PR, if possible.

Failed tests
$ lein test
...
Ran 372 tests containing 3830 assertions.
1 failures, 17 errors.
$ lein test | grep 'lein test :only'
lein test :only cljam.algo.cram-indexer-test/cram-without-alignments-test
lein test :only cljam.io.cram-test/writer-with-reference-embedding-test
lein test :only cljam.io.cram-test/writer-test
lein test :only cljam.io.cram-test/writer-index-options-test
lein test :only cljam.io.cram.data-series-test/build-tag-encoders-test
lein test :only cljam.io.cram.data-series-test/build-tag-decoders-test
lein test :only cljam.io.cram.encode.record-test/encode-slice-records-test
lein test :only cljam.io.cram.encode.structure-test/encode-block-test
lein test :only cljam.io.cram.encode.structure-test/encode-slice-header-block-test
lein test :only cljam.io.cram.encode.structure-test/encode-eof-container-test
lein test :only cljam.io.cram.encode.structure-test/encode-cram-header-container-test
lein test :only cljam.io.cram.encode.structure-test/encode-compression-header-block-test
lein test :only cljam.io.cram.itf8-test/encode-itf8-test
lein test :only cljam.io.util-test/writer-predicates-test
lein test :only cljam.io.util.byte-buffer-test/allocate-msb-byte-buffer-test
lein test :only cljam.io.util.byte-buffer-test/allocate-lsb-byte-buffer-test
lein test :only cljam.io.util.byte-buffer-test/allocate-lsb-byte-buffer-test
lein test :only cljam.io.util.lsb.io-stream-test/about-data-output-stream
$

@alumi
Copy link
Member

alumi commented Dec 26, 2024

Thank you for updating! The commit 46b2f73 perfectly fixes the issue! 👍

many other namespaces also have similar issues, most of which are likely due to cljam.io.util.byte-buffer and cljam.io.util.lsb.*. So, I would like to handle those in a separate PR, if possible.

I really appreciate you taking care of this in a separate PR! ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants