Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure discussion on VCF position is correct #180

Open
percyfal opened this issue Nov 21, 2024 · 13 comments
Open

Ensure discussion on VCF position is correct #180

percyfal opened this issue Nov 21, 2024 · 13 comments

Comments

@percyfal
Copy link
Contributor

When converting the VCF spruce data to Zarr I noticed that even the longer chromosomes (>1.2Gbp) are encoded as 32-bit integers. In the text I wrote that the maximum value of VCF POS is ~537Mbp (God knows where I got that from), but the maximum number for a signed 32-bit integer is 2,147,483,647. The reason we split up chromosomes may therefore be another than an inherent limitation of VCF.

@jeromekelleher
Copy link
Contributor

I see - so 1.2Gb is still well within the signed 32 bit limit. Good catch!

@percyfal
Copy link
Contributor Author

percyfal commented Nov 21, 2024

In the current iteration, there is a draft introductory paragraph to the VCF POS problem. I could edit it but still describe how we convert the coordinate system and provide a notebook, or is this outside the scope of the paper as this is more related to the processing of Zarr files rather than conversion?

@jeromekelleher
Copy link
Contributor

Could you store all of the variants you have in a single VCF file? I.e., will things break when the total number of variants is > 2**31? It would be nice to make the point about these limitations but it's less concrete when we don't have any actual example.

@jeromekelleher
Copy link
Contributor

I'm not sure if there is a documented maximum of these things anywhere, we'd probably need to go poking in the htslib source.

@percyfal
Copy link
Contributor Author

Could you store all of the variants you have in a single VCF file? I.e., will things break when the total number of variants is > 2**31? It would be nice to make the point about these limitations but it's less concrete when we don't have any actual example.

I could try to concatenate all the files, either with or without converting the coordinates. I'll launch a job and see where it takes me.

@percyfal
Copy link
Contributor Author

To speed things up, I selected two samples from each VCF and concatenated into one big VCF (without coordinate conversion). Works fine and running bcftools stats shows that there are 3745170452 variants - so that doesn't seem to be an issue. I'll do a similar test by updating coordinates, but I doubt that will change anything. I'm communicating with my colleague who did the chunking to see what his motivation was to do so (could be motivated by some downstream application).

@jeromekelleher
Copy link
Contributor

Nice, thanks for chasing this one up @percyfal

@percyfal
Copy link
Contributor Author

@jeromekelleher my colleague got back to me and the issue was the limitations imposed by the bam index. From the BAM specification, 5.1.1:

In the BAI format, each bin may span 229, 226, 223, 220, 217 or 214 bp. Bin 0 spans a 512Mbp region, bins
1–8 span 64Mbp, 9–72 8Mbp, 73–584 1Mbp, 585–4680 128kbp, and bins...

However, it also says

The CSI format generalises the sizes of the bins, and supports reference sequences of the same length as
are supported by SAM and BAM

I don't have any good answer why BAI was preferred over CSI.

@jeromekelleher
Copy link
Contributor

Aha, very good.

Do you know of any species that definitely do overflow the 32 bit limit, and maybe chase up some reference on them? We could modify the discussion a bit to discuss them, after mentioning that the chrs were chunked up in this way because of the bai index.

@percyfal
Copy link
Contributor Author

Well, at least the mistletoe should have some chromosomes that are 9-10Gbp, clearly larger than the 2Gbp limit:

https://www.darwintreeoflife.org/news_item/2022-the-year-we-built-the-biggest-genome-in-britain-and-ireland/

@percyfal
Copy link
Contributor Author

Actually, the 32-bit limit is pretty obvious from the NCBI assembly stats; see Chromosomes section in https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_963277665.1/

And, the ToL article has the following quote:

For example, although genomes as large as 100 Gbp can be uploaded, the databases cannot take individual continuous sequences of DNA larger than 2.14 Gbp. Since the mistletoe chromosomes are so large, they will need to be split into six pieces each, but still all be part of the same genome submission.

Still haven't found a reference though, not sure the paper has been published yet

@jeromekelleher
Copy link
Contributor

Very nice! I can't track down the genome note paper for it (presumably would be linked here?) but we can just cite that web page if we need to.

@jeromekelleher
Copy link
Contributor

jeromekelleher commented Nov 25, 2024

So we'd say something like

While the chromosomes of this species is large enough to pose problems for older indexing schemes, some species have genomes that are too large for 32 bit numbers which bcftools requires. For example, Misteltoe has .... [cite DTOL website].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants