-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change BAM and BAI length/count fields to uint32_t (PR#460 alternative) #476
Conversation
I'm not convinced by "memory", although I'm struggling to think of a better term. Memory seems weak, as I may (will!) have well over 4Gb spare to play with. Perhaps limited? I also don't like having some fields with their maximum sizes being displayed in the Value column, which for now is for fixed values "BAM\1" or default values "[-1]", given we have many other fields where the limits are written in the description. e.g. "Reference sequence ID, −1 ≤ refID < n ref" or "−1 ≤ next refID < n ref". We ought to be consistent here. I think it's also actually useful to be able to state for example that Why is If I had to choose, I think for brevity sake and that an extra column may be unwieldy, I'd put the two 2^{31} limits into the description field. Note this is also formatted poorly as the exponent is overflowing the cell bounding slightly (although the same was already happening in footnotes). |
It's memory marked because you're gonna load it into memory. The thinking was: headers are 1-set-per-file so mostly anything goes, but for read records once you've got a hundred or more in memory at once, that's starting to be a lot of memory. However most things do mostly just stream one at a time… But you're quite right that the memory constraint is the minor consideration; the main thing for these field sizes is simply the reality of the data represented. I'll carry on thinking about how to express that (and improve this overall, as discussed), even if it's just “Note A” in the table with the explanation in the note. |
6df82fa
to
2b863b3
Compare
Changed memory to limited, and removed the lifting of |
Thanks for the update. I like it and am happy for this to go in as is. |
…amtools#476) These fields are by definition unsigned, but add notes on their usual ranges -- which mostly do not extend beyond the int32_t maximum and mostly are much smaller. Co-authored-by: John Marshall <[email protected]>
@yfarjoun? I'd like to merge this this week if possible. |
sorry for the holdup 👍 from me. |
Thanks all; merged now (and SAMv1.pdf rebuilt with this and I have the 4 GiB |
Revisit PR #460 (see conversation there), adding notes in the existing Value table columns indicating length/count fields whose maximum values are still (well) within the
int32_t
range convenient for e.g. Java arrays. This clarifies that, while the fields are naturally described asuint32_t
, they are not required to be implemented using such a type if that is inconvenient in your implementation language.OTOH this PR explicitly raises thel_text
limit from 2GiB to 4GiB:[Decided during April 2nd meeting to leave
l_text
as is for this PR, perhaps to be revisited separately later.]HTSlib already handles 4GiB of BAM header text.
HTSJDK represents SAM/BAM/etc headers as a collection of separate headers so no representation change is needed to deal with this
l_text
range increase. However BAMFileReader/Writer.java'sreadHeader()
andwriteHeader()
currently operate on the full header text as aString
so would require some rearranging internally to deal with files with such large headers.