-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rANS (rans4x8) encoding support for CRAM block compression #330
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #330 +/- ##
==========================================
- Coverage 89.97% 89.83% -0.15%
==========================================
Files 104 104
Lines 9341 9623 +282
Branches 490 528 +38
==========================================
+ Hits 8405 8645 +240
- Misses 446 450 +4
- Partials 490 528 +38 ☔ View full report in Codecov by Sentry. |
70fea79
to
11e3225
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for implementing the encoder👍
I think the implementation looks great overall. Thanks to the addition of some macros, it’s a lot easier to read!
Apologies, but I noticed one existing issue🙇.
In the current :dev
profile of project.clj
, the *unchecked-math*
under :global-vars
is set to :warn-on-boxed
. However, if this is disabled, an IllegalArgumentException
occurs, causing the tests to fail. I guess some occurrences of byte
fn call in this namespace need to be replaced with unchecked-byte
. Since *unchecked-math*
was enabled during CI tests, this issue went unnoticed until now. You can verify it with the following command:
lein update-in : assoc :global-vars '^:replace {*warn-on-reflection* nil, *unchecked-math* nil}' -- test
On a side note, when nrepl/nrepl
>= 1.3.0
is included as a dependency, running lein repl
seems to overwrite the :global-vars
settings with falsy values after loading the libraries, which ended up revealing the issue.
`(aget ~(with-meta `(aget ~syms ~i) {:tag 'objects}) ~j)) | ||
|
||
(defn- encode-payload1 ^long [^ByteBuffer in ^objects syms ^ByteBuffer out] | ||
(let [raw-size (.remaining in) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The spec says We do not permit Order-1 encoding of data streams smaller than 4 bytes
but it seems that the current implementation actually handles short input data well.
Do you think we should warn if raw-size
is less than 4?
(->> (.getBytes "ab")
bb/make-lsb-byte-buffer
(rans/encode 1)
bb/make-lsb-byte-buffer
rans/decode
String.
(= "ab"))
;; => true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for pointing that out! I had overlooked that part of the CRAM specification 🙇
It seems that in htslib and htsjdk, inputs smaller than 4 bytes are encoded as Order 0, even if Order 1 encoding is specified. In practice, this implicit fallback to Order 0 for such small inputs doesn’t seem to cause significant issues, so it appears to be a reasonable approach. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for looking into the behavior of other tools!
As you mentioned, I think implicit falling back to Order-0 works just fine! 👍👍
Thank you for the comment! I hadn’t noticed the overflow issue with the Failed tests$ lein test
...
Ran 372 tests containing 3830 assertions.
1 failures, 17 errors.
$ lein test | grep 'lein test :only'
lein test :only cljam.algo.cram-indexer-test/cram-without-alignments-test
lein test :only cljam.io.cram-test/writer-with-reference-embedding-test
lein test :only cljam.io.cram-test/writer-test
lein test :only cljam.io.cram-test/writer-index-options-test
lein test :only cljam.io.cram.data-series-test/build-tag-encoders-test
lein test :only cljam.io.cram.data-series-test/build-tag-decoders-test
lein test :only cljam.io.cram.encode.record-test/encode-slice-records-test
lein test :only cljam.io.cram.encode.structure-test/encode-block-test
lein test :only cljam.io.cram.encode.structure-test/encode-slice-header-block-test
lein test :only cljam.io.cram.encode.structure-test/encode-eof-container-test
lein test :only cljam.io.cram.encode.structure-test/encode-cram-header-container-test
lein test :only cljam.io.cram.encode.structure-test/encode-compression-header-block-test
lein test :only cljam.io.cram.itf8-test/encode-itf8-test
lein test :only cljam.io.util-test/writer-predicates-test
lein test :only cljam.io.util.byte-buffer-test/allocate-msb-byte-buffer-test
lein test :only cljam.io.util.byte-buffer-test/allocate-lsb-byte-buffer-test
lein test :only cljam.io.util.byte-buffer-test/allocate-lsb-byte-buffer-test
lein test :only cljam.io.util.lsb.io-stream-test/about-data-output-stream
$ |
Thank you for updating! The commit 46b2f73 perfectly fixes the issue! 👍
I really appreciate you taking care of this in a separate PR! ❤️ |
This PR implements the rANS (rans4x8) encoder to allow the CRAM writer to use it for block compression.
With this PR, the default compressor for the following data series will be updated to rans4x8:
BF
CF
RI
RL
AP
RG
NS
TS
BA
QS
You can override the compressor for specific data series or tags by specifying
:r4x8-o0
(Order 0) or:r4x8-o1
(Order 1) in the:ds-compressor-overrides
/:tag-compressor-overrides
options.For example, to set the compressor of the
FN
data series to rANS (Order 0) and theTL
data series to rANS (Order 1), you can configure the writer as follows: