Fix Segmentation Fault Under Certain Conditions #1

evolvedmicrobe · 2018-12-05T08:17:50Z

This PR was motivated by a SIGSEGV that STAR produced when aligning a particular read against the Rabbit genome. Explanation of the problem and fix below.

The issue was as follows. In trying to map the read, the aligner routine within ReadAlign::maxMappableLength2strands was trying to find exact matches for the end of the read. The first step in this process is looking up a given prefix length in the suffix array index. This index contains minimum (and maximum) locations for prefixes up to size L_max (14 in this case). For this particular read segment, no prefix of size 14 was found, so the program re-attempted to search the index with a prefix of size 13, which it did find. Because a prefix of size 13 and not 14 was found, the program knew there was no match of size 14 and so tried to set the end of the search range in the suffix array (indStartEnd[1]) using a short-cut that avoids the binary search. In particular, it set the End of the bounded range to the value of one minus the next element in the index.

However, it appears not all elements prior to the start of the next prefix are guaranteed to match the current prefix. In this particular case, that next element pointed to a position 10 bases before the end of the genome, which meant it could not produce an alignment 13 bases in length. Later on inside ReadAlign::stitchPieces and its call to sjSplitAlign this led to the SIGSEGV when while calculating the alignment start position it inferred to be 3 basepairs past the end of the genome, giving an offset position of -3 which when converted to an unsigned integer has value 18446744073709551613, and consequentally the calculated value for isj in the code line P->sjDstart[isj] referred to unmapped memory, leading to the SIGSEGV.

I believe this can happen as the suffix array can be sorted as follows:

ACGTACGTACGTAA # Matching prefix of size 13
ACGTACGTACGTAN # Suffix array element with padded-bases/Ns, but not a valid 13-mer alignment
ACGTACGTACGTCA # Next prefix of size 13

Since it seemed the short-cut avoiding the binary-search was not reliable, I removed it with this commit. Since it also appeared that this was a general problem where one could not guarantee the initial range guessed held matching prefixes up to the length tested (but rather only the length tested minus 1), I also changed the binary search to account for this. Testing showed for this case it restored the correct behavior in the inferred bounds, such that the alignment length was consistent across the start and end of the array element range set.

This PR was motivated by a SIGSEGV that STAR produced when aligning a particular read against the Rabbit genome. Explanation of the problem and fix below. The issue was as follows. In trying to map the read, the aligner routine within `ReadAlign::maxMappableLength2strands` was trying to find exact matches for the end of the read. The first step in this process is looking up a given prefix length in the suffix array index. This index contains minimum (and maximum) locations for prefixes up to size L_max (14 in this case). For this particular read segment, no prefix of size 14 was found, so the program re-attempted to search the index with a prefix of size 13, which it did find. Because a prefix of size 13 and not 14 was found, the program knew there was no match of size 14 and so tried to set the end of the search range in the suffix array (`indStartEnd[1]`) using a short-cut that avoids the binary search. In particular, it set the End of the bounded range to the value of one minus the next element in the index. However, it appears not all elements prior to the start of the next prefix are guaranteed to match the current prefix. In this particular case, that next element pointed to a position 10 bases before the end of the genome, which meant it could not produce an alignment 13 bases in length. Later on inside `ReadAlign::stitchPieces` and its call to `sjSplitAlign` this led to the SIGSEGV when while calculating the alignment start position it inferred to be 3 basepairs past the end of the genome, giving an offset position of `-3` which when converted to an unsigned integer has value `18446744073709551613`, and consequentally the calculated value for `isj` in the code line `P->sjDstart[isj]` referred to unmapped memory, leading to the SIGSEGV. I believe this can happen as the suffix array can be sorted as follows: ACGTACGTACGTAA # Matching prefix of size 13 ACGTACGTACGTAN # Suffix array element with padded-bases/Ns, but not a valid 13-mer alignment ACGTACGTACGTCA # Next prefix of size 13 Since it seemed the short-cut avoiding the binary-search was not reliable, I removed it with this commit. Since it also appeared that this was a general problem where one could not guarantee the initial range guessed held matching prefixes up to the length tested (but rather only the length tested minus 1), I also changed the binary search to account for this. Testing showed for this case it restored the correct behavior in the inferred bounds, such that the alignment length was consistent across the start and end of the array element range set.

evolvedmicrobe · 2018-12-05T08:21:37Z

@pmarks I am going to try and upstream this change once I get feedback on whether I can share data to reproduce the problem. No rush on the review for this, but since you were familiar with the STAR internals I thought to send it your way for any flags before sending on.

evolvedmicrobe · 2018-12-05T08:45:24Z

Changed to WIP, realized edge case can likely mean >= 1 BP differ, so will need to refine logic.

evolvedmicrobe · 2018-12-05T09:06:44Z

Fixed WIP issue

For sparse suffix arrays, presumably it could be more.

evolvedmicrobe · 2018-12-12T20:31:24Z

Note: A version of this change modified to match the latest STAR was also sent upstream, alexdobin#535

Bring Dockerfile up to date with current release

evolvedmicrobe requested a review from pmarks December 5, 2018 08:18

evolvedmicrobe changed the title ~~Fix Segmentation Fault Under Certain Conditions~~ [WIP] Fix Segmentation Fault Under Certain Conditions Dec 5, 2018

evolvedmicrobe changed the title ~~[WIP] Fix Segmentation Fault Under Certain Conditions~~ Fix Segmentation Fault Under Certain Conditions Dec 5, 2018

Fix assumption that difference was only 1

927e0c7

For sparse suffix arrays, presumably it could be more.

evolvedmicrobe force-pushed the nd/cr-1688 branch from afe22d0 to 927e0c7 Compare December 5, 2018 09:21

pmarks pushed a commit that referenced this pull request Aug 23, 2019

Merge pull request #1 from vpresnyak/vpresnyak-patch-1

5ef1bec

Bring Dockerfile up to date with current release

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Segmentation Fault Under Certain Conditions #1

Fix Segmentation Fault Under Certain Conditions #1

evolvedmicrobe commented Dec 5, 2018

evolvedmicrobe commented Dec 5, 2018 •

edited

Loading

evolvedmicrobe commented Dec 5, 2018

evolvedmicrobe commented Dec 5, 2018

evolvedmicrobe commented Dec 12, 2018 •

edited

Loading

Fix Segmentation Fault Under Certain Conditions #1

Are you sure you want to change the base?

Fix Segmentation Fault Under Certain Conditions #1

Conversation

evolvedmicrobe commented Dec 5, 2018

evolvedmicrobe commented Dec 5, 2018 • edited Loading

evolvedmicrobe commented Dec 5, 2018

evolvedmicrobe commented Dec 5, 2018

evolvedmicrobe commented Dec 12, 2018 • edited Loading

evolvedmicrobe commented Dec 5, 2018 •

edited

Loading

evolvedmicrobe commented Dec 12, 2018 •

edited

Loading