Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pysam .csi index support & how to deal with large references #1278

Open
RNieuwenhuis opened this issue Apr 10, 2024 · 0 comments
Open

Pysam .csi index support & how to deal with large references #1278

RNieuwenhuis opened this issue Apr 10, 2024 · 0 comments

Comments

@RNieuwenhuis
Copy link

Hi,

I use CrossMap as my favourite liftover tool because of its versatility and lately I've tried to use it to do liftover on some bam files with a large reference genome. Largest chromosome being 2446554542 in length.

I got this error:

Traceback (most recent call last):
  File "/home/WUR/nieuw133/miniconda3/envs/Crossmap/bin/CrossMap.py", line 281, in <module>
    crossmap_bam_file(mapping = mapTree, chainfile = chain_file, infile = in_file, outfile_prefix = out_file, chrom_size = targetChromSizes, IS_size = args.insert_size, IS_std = args.insert_size_stdev, fold = args.insert_size_fold, addtag = args.add_tags, cstyle = args.cstyle)
  File "/home/WUR/nieuw133/miniconda3/envs/Crossmap/lib/python3.10/site-packages/cmmodule/mapbam.py", line 406, in crossmap_bam_file
    new_alignment.next_reference_start =  read2_maps[1][1]
  File "pysam/libcalignedsegment.pyx", line 1346, in pysam.libcalignedsegment.AlignedSegment.next_reference_start.__set__
OverflowError: value too large to convert to int32_t

Now searching pysam history it seems reading of .csi indexed bam files was implemented at some point, but writing was not?

In general, I am confused with the different limits present. The default .bai index has a max reference SQ length of 2^29, which can be omitted by using a .csi index for which I could not find a specified max. However, the largest chromosome I have will also overflow int32_t max value of 2,147,483,647. This also seems to go over the sam/bam limit specified. Many tools will be stuck with this limit because it would break backwards compatibility it seems? Is there a specific reason why not unsigned int32 was used?

What is in general the way to deal with such large genomes? Is there a standardized way to deal with it? Are there ongoing discussions about how to deal with this? Any references are much appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant