Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GroupReadsByUmi and optical duplicates #1013

Open
miwalter opened this issue Oct 14, 2024 · 2 comments · May be fixed by #1029
Open

GroupReadsByUmi and optical duplicates #1013

miwalter opened this issue Oct 14, 2024 · 2 comments · May be fixed by #1029
Labels

Comments

@miwalter
Copy link

Hi.

In a recent experiment we sequenced the same libraries on a MiSeq (random FC) and NovaSeq (patterend FC) with similar number of reads but with a 10x higher number of duplicate reads on the NovaSeq. So, I'm wondering if there is a way to deal with optical duplicates (OD) on Illumina patterned flow cells when creating the UMI groups?

If I understand the documentation correctly, all reads with the same coordinates and UMI sequence are grouped regardless if they are PCR or optical duplicates and later used to create a consensus call. In the attached example, there is a tag family with 14 read pairs. However, looking at their location of the flow cell, there are several copies that are within a pixel distance of 2500 which is considered to be ODs on a patterned FC. Some OD cluster have 3-4 copies while other members of the same UMI family have no OD. This will skew the representation of PCR/library prep errors and also the overall size of the UMI family is overestimated (accounting for OD there are only 7 unique copies of the same UMI left). Or do I need to remove optical duplicates first (e.g with picard) and then create my UMI consensus?

Thank you very much for your comments.

image

@miwalter
Copy link
Author

Here's the same UMI family accounted for ODs:

image

@yfarjoun
Copy link
Contributor

This is a nice idea that can also be used for other counts of UMI molecules, other than Library size estimation.

Questions:

  1. Let's assume that the counts in the histogram are produced by ignoring the optical duplicates. Is there an easy way to turn that histogram into a library size? I'm not aware of a tool that takes in a duplicate-set-size histogram and produces a library size (not that the calculation is especially complicated...)
  2. Are there other points in consensus calling that one might want to ignore the counts of optical duplicates? For example: the template counts for filtering?
  3. Should there be an option to mark reads as optical duplicates (like in Picard's MarkDuplicates)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants