Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

markdups fails with Range [0, -1) out of bounds error #99

Closed
bounlu opened this issue Sep 11, 2024 · 5 comments
Closed

markdups fails with Range [0, -1) out of bounds error #99

bounlu opened this issue Sep 11, 2024 · 5 comments

Comments

@bounlu
Copy link

bounlu commented Sep 11, 2024

Referring to #37 (comment)

Nextflow run:

#!/bin/bash

nextflow run nf-core/oncoanalyser \
-latest \
-profile docker \
--mode 'targeted' \
--genome 'GRCh38_hmf' \
--panel 'tso500' \
--input 'samplesheet_oncoanalyser.csv' \
--outdir 'oncoanalyser/results/' \
-work-dir 'oncoanalyser/work/' \
-c 'custom_local.config' \
-r master \
-resume
Unable to find image 'quay.io/biocontainers/hmftools-mark-dups:1.1.7--hdfd78af_0' locally
1.1.7--hdfd78af_0: Pulling from biocontainers/hmftools-mark-dups
ca7680d1025d: Already exists
bd9ddc54bea9: Already exists
b5e822314cc7: Pulling fs layer
b5e822314cc7: Verifying Checksum
b5e822314cc7: Download complete
b5e822314cc7: Pull complete
Digest: sha256:9040ac4af8fb438148a50e03659407a3ccc1de88dfa6d424cb18eae38a330b2a
Status: Downloaded newer image for quay.io/biocontainers/hmftools-mark-dups:1.1.7--hdfd78af_0
/usr/local/bin/markdups: line 6: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8): No such file or directory
06:00:24.766 [main] [INFO ] MarkDups version 1.1.7
06:00:24.826 [main] [INFO ] output(./)
06:00:25.171 [main] [INFO ] loaded 80309 unmapping regions from unmap_regions.38.tsv
06:00:25.172 [main] [INFO ] duplicate logic: UMIs
06:00:25.174 [main] [INFO ] sample(220024466) starting mark duplicates
06:00:26.293 [Thread-0] [ERROR] read(id(F350034942L2C002R05300375863) coords(chr1:45099-45194) cigar(4S96M) mate(chr1:45147) flags(99)) exception: java.lang.StringIndexOutOfBoundsException: Range [0, -1) out of bounds for length 28
java.lang.StringIndexOutOfBoundsException: Range [0, -1) out of bounds for length 28
	at java.base/jdk.internal.util.Preconditions$1.apply(Preconditions.java:55)
	at java.base/jdk.internal.util.Preconditions$1.apply(Preconditions.java:52)
	at java.base/jdk.internal.util.Preconditions$4.apply(Preconditions.java:213)
	at java.base/jdk.internal.util.Preconditions$4.apply(Preconditions.java:210)
	at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:98)
	at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckFromToIndex(Preconditions.java:112)
	at java.base/jdk.internal.util.Preconditions.checkFromToIndex(Preconditions.java:349)
	at java.base/java.lang.String.checkBoundsBeginEnd(String.java:4914)
	at java.base/java.lang.String.substring(String.java:2876)
	at com.hartwig.hmftools.markdups.umi.UmiGroupBuilder.splitUmi(UmiGroupBuilder.java:505)
	at com.hartwig.hmftools.markdups.umi.UmiGroupBuilder.hasDuplexUmiMatch(UmiGroupBuilder.java:493)
	at com.hartwig.hmftools.markdups.umi.UmiGroupBuilder.collapseCoordinateGroup(UmiGroupBuilder.java:444)
	at com.hartwig.hmftools.markdups.umi.UmiGroupBuilder.processUmiGroups(UmiGroupBuilder.java:289)
	at com.hartwig.hmftools.markdups.common.DuplicateGroupBuilder.processDuplicateGroups(DuplicateGroupBuilder.java:248)
	at com.hartwig.hmftools.markdups.PartitionReader.accept(PartitionReader.java:357)
	at com.hartwig.hmftools.markdups.PartitionReader.accept(PartitionReader.java:43)
	at com.hartwig.hmftools.markdups.ReadPositionsCache.checkFlush(ReadPositionsCache.java:305)
	at com.hartwig.hmftools.markdups.ReadPositionsCache.storeInitialRead(ReadPositionsCache.java:166)
	at com.hartwig.hmftools.markdups.ReadPositionsCache.processRead(ReadPositionsCache.java:141)
	at com.hartwig.hmftools.markdups.PartitionReader.processSamRecord(PartitionReader.java:216)
	at com.hartwig.hmftools.markdups.BamReader.sliceRegion(BamReader.java:83)
	at com.hartwig.hmftools.markdups.PartitionReader.processRegion(PartitionReader.java:125)
	at com.hartwig.hmftools.markdups.PartitionThread.run(PartitionThread.java:61)

I would suggest to use a more established tool like Picard MarkDuplicates to avoid such cases.

@scwatts
Copy link
Collaborator

scwatts commented Sep 11, 2024

Hi @bounlu, the Hartwig MarkDups tool is preferred since it uses algorithm optimised for the Hartwig toolkit and runs additional routines beyond duplicate marking. I understand your perspective however.

To help investigate this issue, would you be able share the .nextflow.log associated with the above error and a couple of reads (or alignments) that shows your UMI data?

@bounlu
Copy link
Author

bounlu commented Sep 16, 2024

I can confirm that the error is due to the UMI in my data as it works when I run the process without umi flags '-umi_enabled -umi_duplex -umi_duplex_delim +'. Because the UMI sequences are not included in the read ID so markdups cannot parse them.

However it works fine when I provide the UMI pattern to sarek pipeline as below:
--umi_read_structure '5M2S+T 5M2S+T'

I see that there is no such parameter for oncoanalyser to provide the UMI pattern as input. So I am not sure how markdups handles with the UMIs. sarek handles it in multiple steps by using fgbio in this subworkflow.

@scwatts
Copy link
Collaborator

scwatts commented Sep 16, 2024

The current processing of UMIs in oncoanalyser is:

  1. extract UMIs from FASTQ read sequence and place into read name with fastp
  2. align reads using bwa-mem2
  3. call consensus and mark duplicates with MarkDups

I'm working on better support for various UMI structures, which will be facilitated by the fastp options. Further details on the MarkDups algorithm are available here.

@scwatts scwatts closed this as completed Sep 16, 2024
@bounlu
Copy link
Author

bounlu commented Sep 17, 2024

For fastp, I need to use the below parameters:

fastp -U --umi_loc=per_read --umi_len=7

Could you please point me where can I specify these options in the pipeline?

@scwatts
Copy link
Collaborator

scwatts commented Sep 17, 2024

In the 0.5.0 release you can achieve that configuration by setting --umi_length 7 for the oncoanalyser command.

I'm working on an enhancement that will allow more flexible configuration of UMI processing in fastp and MarkDups, the above will change slightly once those adjustments have been made in the dev branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants