Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for haplotagged.bam? #10

Open
ymcki opened this issue Jul 25, 2023 · 1 comment
Open

Support for haplotagged.bam? #10

ymcki opened this issue Jul 25, 2023 · 1 comment
Labels
enhancement New feature or request

Comments

@ymcki
Copy link

ymcki commented Jul 25, 2023

I am currently working on comparing different STR software to see how well they make the calls using ONT's latest Q20+ data for HG002.
https://labs.epi2me.io/giab-2023.05/

I am using HG002 T2T assembly v0.7 as well as using IGV to establish a truth set.
https://github.com/marbl/HG002

I noticed that while NanoRepeat works quite well in most cases, it often just calls one STR allele when the two STR counts of a heterozygous call only differ by one or two repeats. For example, for the GIPC2 CCG repeat

$ more pass.cram.chr19-14496041-14496074-CCG.summary.txt
Summary_file=pass.cram.chr19-14496041-14496074-CCG.summary.txt Repeat_Region=chr19-14496041-14496074-CCG Method=GMM Num_Alleles=1 Num_Removed_Reads=0 Allele1_Num_Reads=22 Allele1_Repeat_Size=12
$ more pass.cram.chr19-14496041-14496074-CCG.repeat_size.txt
##Repeat_Region=chr19-14496041-14496074-CCG
#Read_Name Repeat_Size
5a621e4a-464b-4397-bb52-dd6868344ed8 12.0
32c058b8-463a-4963-98de-ed151215c897 12.0
5e640f40-af78-4c37-b75f-2a489e6d287d 10.0
07ba1eca-eed6-41b7-9657-041cf9478b30 10.0
09feeb1a-bacc-43a5-9246-51adbec0d4ad 12.0
f7e95eed-e085-480e-badf-2ff3dafb75c6 12.0
ce13d292-e06c-4216-88e2-01900709ffb4 12.0
f2c19bf6-8875-45c6-bfef-c1cc49c74e82 12.0
7993c22d-00f7-4da4-a39e-dd7836d214f4 10.0
36b5fa47-1ae5-4c95-9d7b-2f4209d2ce57 12.0
1e96d19d-0281-4225-9395-bf9a001d1f18 10.0
d2a355a4-e30e-4f83-96b6-731467528d6c 10.0
be856e49-b1dd-4b2a-bc05-d129a5dfbff9 10.0
622e9d4c-3ab6-4560-9396-65cb9bd73696 10.0
1daea057-60bb-449a-8027-26a8e536909b 10.0
8505aa3e-d5b3-4544-b98c-e492122ccd27 12.0
1c702495-d4f0-4f41-9622-fb9555950713 12.0
1655f006-34f8-43ed-8800-eef18aa9a22e 12.0
4dc8881f-bf77-4c2e-94c7-fcb6955872c5 12.0
b8c0f0d9-377e-4f31-9440-46a2926d4ed4 12.0
c0b6597d-e401-4c5d-8eec-a7e49a43dc39 11.0
8646a71c-6f96-49c3-8250-e20186acfbdf 10.0

While it can be seen that the call should be 10/12 but I suppose NanoRepeat plays it safe and call 12/12 most of the time.

I think it is possible to resolve quite many of these calls if the reads are haplotagged.

haplotagged bam can be created by running "whatshap haplotag". The idea is to use heterozygous sites to assign a read to one of the two possible haplotypes in a chromosome. whatshap add a HP tag to an aligned read to be either 1 or 2. In the GIPC2 example above, you can see clearly 10 and 12 are assigned to different haplotypes.

Therefore, I think it would be great if NanoRepeat can also support haplotagged bam such that it can make better calls in these situations.

@fangli80 fangli80 added the enhancement New feature or request label Jul 25, 2023
@fangli80
Copy link
Collaborator

Thanks! I can add this function in the next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants