Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VG STATS: 2000 Properly Paired Inquiry #4434

Open
AbdelF6 opened this issue Nov 4, 2024 · 9 comments
Open

VG STATS: 2000 Properly Paired Inquiry #4434

AbdelF6 opened this issue Nov 4, 2024 · 9 comments

Comments

@AbdelF6
Copy link

AbdelF6 commented Nov 4, 2024

1. What were you trying to do?
VG stats on 3 files after aligning them using vg giraffe and the T2T-CHM13 reference.

2. What did you want to happen?
vg giraffe to use paired-end mapping and properly paired values to be accurate

3. What actually happened?
Got an error for vg giraffe:
warning[vg::giraffe]: Cannot cluster reads with a fragment distance smaller than read distance
warning[vg::giraffe]: Falling back on single-end mapping

For VG stats, had a properly paired value of 2000 for all 3 files
After removing duplicates form the files using samtools markdup, no errors about cluster reads came up. I was wondering what the 2000 value means (is this a default output for when single-end mapping occurs? Also, is there a command in vg that can allow duplicates to be removed/notify user of duplicates in file?

4. If you got a line like Stack trace path: /somewhere/on/your/computer/stacktrace.txt, please copy-paste the contents of that file here:

Place stacktrace here.

5. What data and command can the vg dev team use to make the problem happen?
Files can be sent to be downloaded and for the command to be tested

6. What does running vg version say?
vg version v1.60.0 "Annicco"

Place vg version output here
@adamnovak
Copy link
Member

I think the feature request for a vg tool for marking/removing duplicates might be already tracked in #2662? The request is also mentioned in #3283. So people definitely want it.

Giraffe insists that your paired end fragment distribution (which you can set manually with --fragment-mean and --fragment-stdev) produce a maximum fragment size (of mean + 2 standard deviations) that is longer than the first read length + 50 bp, and also longer than the clustering --distance-limit (default 200 bp). If you have oddly long reads, or oddly short fragments (so that e.g. your reads are always sequencing the whole fragment), this might not be the case.

It sounds like the fragment length distribution, as learned from the non-deduplicated input reads, didn't satisfy this constraint, which is why Giraffe produced that Cannot cluster reads with a fragment distance smaller than read distance message. I'm not sure why that problem would go away when you removed duplicate read pairs from the input; maybe that also changed the order of reads in the file and made the statistics come out in a way Giraffe can handle? (Are you sure the correct pairs of reads are being fed in as pairs?)

What was the full warning message from Giraffe? It should have printed a couple more lines giving statistics about the fragment length distribution it observed: Fragment length distribution: mean=XXX, stdev=XXX and Fragment distance limit: XXX, read distance limit: XXX.

What is the fragment length distribution for this sequencing run supposed to be? If you actually are trying to run data where the fragments are meant to be very short relative to the reads, either we have to change the clusterer to not have this constraint, or you need to lie to Giraffe about your fragment length distribution with --fragment-mean and --fragment-stdev.

As for the statistics on the result showing 2000 reads marked properly paired, I think what is happening is that the initial 1000 pairs that we map with loose constraints, to try and learn the fragment length distribution, are being marked properly paired. But then all the reads mapped after that are falling back to single-ended mapping because of the problem that the warning indicates, and none of those are being marked properly paired. So we are actually producing output with 2000 reads that are marked properly paired.

@AbdelF6
Copy link
Author

AbdelF6 commented Nov 4, 2024

@adamnovak Hi, thank you for your response!

The full warning message was:
Cannot cluster reads with a fragment distance smaller than read distance
Fragment length distribution: mean: 114.141, stdev = 33.3627
Fragment distance limit: 180, read distance limit: 200
Falling back on single-end mapping

I am not setting a fragment length distribution for this sequencing run, just using the default. Thank you for this suggestion!

@adamnovak
Copy link
Member

@xchang1 I feel like we might be foggy on whether the "fragment length" is between the inner or outer ends of the reads in the fragment. Because it makes sense to have 150bp reads on a fragment with 114 bp between their inner ends, and it's weird that the clusterer wouldn't like that.

@xchang1
Copy link
Contributor

xchang1 commented Nov 6, 2024

The fragment length should always be the distance between the outer ends.

The problem might be that we find the minimum distance between the outer ends, rather than the minimum distance between the inner ends plus the lengths of the reads. So the topologically minimum distance might end up being smaller than the lengths of the reads if there's a big deletion edge. I wouldn't expect that to happen a lot though.

@AbdelF6 what graph are you using? Is it just the CHM13 reference?

@AbdelF6
Copy link
Author

AbdelF6 commented Nov 6, 2024

Hi @xchang1, yes I am using the CHM13-T2T reference to run on vg. I downloaded the indexes from this Github link: https://github.com/human-pangenomics/hpp_pangenome_resources/blob/main/hprc-v1.0-mc.md.

@xchang1
Copy link
Contributor

xchang1 commented Nov 10, 2024

Ah ok. I wouldn't expect that problem on the HPRC minigraph-cactus graph.
Are you shuffling your reads? Since we use the first thousand reads to figure out the fragment length distribution, we need those to be representative of the distribution we expect, but some sequencers put the bad reads first and mess up the distribution finder. Usually if that's the problem giraffe will complain about bad read mappings, and it's pretty weird that it found such a tight distribution, but maybe your reads are in a weird order.
If that's not the problem, then are you able to share your reads with me? Just the first thousand should be enough for me to try to diagnose the problem

@xchang1
Copy link
Contributor

xchang1 commented Nov 11, 2024

Oh also how long are your reads? Maybe they're just short and --distance-limit 200 is too big

@AbdelF6
Copy link
Author

AbdelF6 commented Nov 11, 2024

Hi @xchang1, the reads are around 75 nucleotides base paired ends. But, we have run vg giraffe with other files, all with the same read lengths (around 75 nucleotide base paired ends), and we did not get this 2000 properly paired statistic or single-end mapping error for those runs.

If the read length is around 75, what would you recommend we set the distance limit to?

@xchang1
Copy link
Contributor

xchang1 commented Nov 11, 2024

Maybe try 100?
I would still recommend shuffling your reads if they aren't already. It's a bit weird that the fragment length that it's finding is smaller than the sum of the read lengths. I guess they could be overlapping though

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants