Mean QScore calculation can be much quicker. #336

rhpvorderman · 2023-03-24T06:14:44Z

Hi,

I had a look at the QScore calculation as defined in the code

Line 124 in 21df7d5

def mean_qscore_from_qstring(qstring):

I see you use numpy to leverage the power of C to do multiple calculations. I wrote a fastq filter and I also went down this route. However after some time I realized that there are only 94 distinct phred values in the FASTQ format. So basically you are recalculating the same 94 values over and over again (and in practice, a lot less than 94).

I found that it was much quicker to create a lookup table:

import math
QSCORE_LOOKUP = [10 ** -((i - 33)/10) for i in range(128)]

def mean_qscore_from_qstring(qstring):
    """
    Convert qstring into a mean qscore
    """
    length = len(qstring)
    if len == 0: return 0.0
    qscore_values = qstring.encode('ASCII')  # This returns a bytes object which when iterated over returns integers
    sum_error = sum(QSCORE_LOOKUP[q] for q in qscore_values)
    return -10 * math.log10(sum_error / length)

In C this is even quicker of course. You can simply create a const double QSCORE_LOOKUP[128] = {...} in a generated header file or use a constexpr in C++ (I believe, I do not use C++ myself). I have a generated header file here: https://github.com/LUMC/fastq-filter/blob/develop/src/fastq_filter/score_to_error_rate.h. Feel free to use. Keep in mind that I did not do the substracting of the 33 offset in the C lookup as q-33 in C compiles to an instruction that only takes one clocktick anyway and there is no notable speed difference. So in C I do QSCORE_LOOKUP[q-33] .

If you want to use the very fast C implementation in python. I already made it in C in my fastq-filter. So a from fastq_filter import qualmean suffices.

No need to recalculate 94 distinct phred values all over again is what I am saying. Hope this helps.

The text was updated successfully, but these errors were encountered:

rhpvorderman mentioned this issue Mar 24, 2023

FASTQ average quality score calculation can be faster wdecoster/nanoget#28

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mean QScore calculation can be much quicker. #336

Mean QScore calculation can be much quicker. #336

rhpvorderman commented Mar 24, 2023 •

edited

Loading

Mean QScore calculation can be much quicker. #336

Mean QScore calculation can be much quicker. #336

Comments

rhpvorderman commented Mar 24, 2023 • edited Loading

rhpvorderman commented Mar 24, 2023 •

edited

Loading