Ensure discussion on VCF position is correct #180

percyfal · 2024-11-21T10:53:18Z

When converting the VCF spruce data to Zarr I noticed that even the longer chromosomes (>1.2Gbp) are encoded as 32-bit integers. In the text I wrote that the maximum value of VCF POS is ~537Mbp (God knows where I got that from), but the maximum number for a signed 32-bit integer is 2,147,483,647. The reason we split up chromosomes may therefore be another than an inherent limitation of VCF.

jeromekelleher · 2024-11-21T11:14:55Z

I see - so 1.2Gb is still well within the signed 32 bit limit. Good catch!

percyfal · 2024-11-21T12:35:54Z

In the current iteration, there is a draft introductory paragraph to the VCF POS problem. I could edit it but still describe how we convert the coordinate system and provide a notebook, or is this outside the scope of the paper as this is more related to the processing of Zarr files rather than conversion?

jeromekelleher · 2024-11-21T13:12:50Z

Could you store all of the variants you have in a single VCF file? I.e., will things break when the total number of variants is > 2**31? It would be nice to make the point about these limitations but it's less concrete when we don't have any actual example.

jeromekelleher · 2024-11-21T13:14:08Z

I'm not sure if there is a documented maximum of these things anywhere, we'd probably need to go poking in the htslib source.

percyfal · 2024-11-21T15:22:34Z

Could you store all of the variants you have in a single VCF file? I.e., will things break when the total number of variants is > 2**31? It would be nice to make the point about these limitations but it's less concrete when we don't have any actual example.

I could try to concatenate all the files, either with or without converting the coordinates. I'll launch a job and see where it takes me.

percyfal · 2024-11-22T09:54:47Z

To speed things up, I selected two samples from each VCF and concatenated into one big VCF (without coordinate conversion). Works fine and running bcftools stats shows that there are 3745170452 variants - so that doesn't seem to be an issue. I'll do a similar test by updating coordinates, but I doubt that will change anything. I'm communicating with my colleague who did the chunking to see what his motivation was to do so (could be motivated by some downstream application).

jeromekelleher · 2024-11-22T14:57:57Z

Nice, thanks for chasing this one up @percyfal

percyfal · 2024-11-25T08:41:18Z

@jeromekelleher my colleague got back to me and the issue was the limitations imposed by the bam index. From the BAM specification, 5.1.1:

In the BAI format, each bin may span 229, 226, 223, 220, 217 or 214 bp. Bin 0 spans a 512Mbp region, bins
1–8 span 64Mbp, 9–72 8Mbp, 73–584 1Mbp, 585–4680 128kbp, and bins...

However, it also says

The CSI format generalises the sizes of the bins, and supports reference sequences of the same length as
are supported by SAM and BAM

I don't have any good answer why BAI was preferred over CSI.

jeromekelleher · 2024-11-25T09:01:44Z

Aha, very good.

Do you know of any species that definitely do overflow the 32 bit limit, and maybe chase up some reference on them? We could modify the discussion a bit to discuss them, after mentioning that the chrs were chunked up in this way because of the bai index.

percyfal · 2024-11-25T09:14:26Z

Well, at least the mistletoe should have some chromosomes that are 9-10Gbp, clearly larger than the 2Gbp limit:

https://www.darwintreeoflife.org/news_item/2022-the-year-we-built-the-biggest-genome-in-britain-and-ireland/

percyfal · 2024-11-25T09:22:21Z

Actually, the 32-bit limit is pretty obvious from the NCBI assembly stats; see Chromosomes section in https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_963277665.1/

And, the ToL article has the following quote:

For example, although genomes as large as 100 Gbp can be uploaded, the databases cannot take individual continuous sequences of DNA larger than 2.14 Gbp. Since the mistletoe chromosomes are so large, they will need to be split into six pieces each, but still all be part of the same genome submission.

Still haven't found a reference though, not sure the paper has been published yet

jeromekelleher · 2024-11-25T09:41:28Z

Very nice! I can't track down the genome note paper for it (presumably would be linked here?) but we can just cite that web page if we need to.

jeromekelleher · 2024-11-25T09:44:46Z

So we'd say something like

While the chromosomes of this species is large enough to pose problems for older indexing schemes, some species have genomes that are too large for 32 bit numbers which bcftools requires. For example, Misteltoe has .... [cite DTOL website].

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure discussion on VCF position is correct #180

Ensure discussion on VCF position is correct #180

percyfal commented Nov 21, 2024

jeromekelleher commented Nov 21, 2024

percyfal commented Nov 21, 2024 •

edited

Loading

jeromekelleher commented Nov 21, 2024

jeromekelleher commented Nov 21, 2024

percyfal commented Nov 21, 2024

percyfal commented Nov 22, 2024

jeromekelleher commented Nov 22, 2024

percyfal commented Nov 25, 2024

jeromekelleher commented Nov 25, 2024

percyfal commented Nov 25, 2024

percyfal commented Nov 25, 2024

jeromekelleher commented Nov 25, 2024

jeromekelleher commented Nov 25, 2024 •

edited

Loading

Ensure discussion on VCF position is correct #180

Ensure discussion on VCF position is correct #180

Comments

percyfal commented Nov 21, 2024

jeromekelleher commented Nov 21, 2024

percyfal commented Nov 21, 2024 • edited Loading

jeromekelleher commented Nov 21, 2024

jeromekelleher commented Nov 21, 2024

percyfal commented Nov 21, 2024

percyfal commented Nov 22, 2024

jeromekelleher commented Nov 22, 2024

percyfal commented Nov 25, 2024

jeromekelleher commented Nov 25, 2024

percyfal commented Nov 25, 2024

percyfal commented Nov 25, 2024

jeromekelleher commented Nov 25, 2024

jeromekelleher commented Nov 25, 2024 • edited Loading

percyfal commented Nov 21, 2024 •

edited

Loading

jeromekelleher commented Nov 25, 2024 •

edited

Loading