Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mean QScore calculation can be much quicker. #336

Open
rhpvorderman opened this issue Mar 24, 2023 · 0 comments
Open

Mean QScore calculation can be much quicker. #336

rhpvorderman opened this issue Mar 24, 2023 · 0 comments

Comments

@rhpvorderman
Copy link

rhpvorderman commented Mar 24, 2023

Hi,

I had a look at the QScore calculation as defined in the code

def mean_qscore_from_qstring(qstring):

I see you use numpy to leverage the power of C to do multiple calculations. I wrote a fastq filter and I also went down this route. However after some time I realized that there are only 94 distinct phred values in the FASTQ format. So basically you are recalculating the same 94 values over and over again (and in practice, a lot less than 94).

I found that it was much quicker to create a lookup table:

import math
QSCORE_LOOKUP = [10 ** -((i - 33)/10) for i in range(128)]

def mean_qscore_from_qstring(qstring):
    """
    Convert qstring into a mean qscore
    """
    length = len(qstring)
    if len == 0: return 0.0
    qscore_values = qstring.encode('ASCII')  # This returns a bytes object which when iterated over returns integers
    sum_error = sum(QSCORE_LOOKUP[q] for q in qscore_values)
    return -10 * math.log10(sum_error / length)

In C this is even quicker of course. You can simply create a const double QSCORE_LOOKUP[128] = {...} in a generated header file or use a constexpr in C++ (I believe, I do not use C++ myself). I have a generated header file here: https://github.com/LUMC/fastq-filter/blob/develop/src/fastq_filter/score_to_error_rate.h. Feel free to use. Keep in mind that I did not do the substracting of the 33 offset in the C lookup as q-33 in C compiles to an instruction that only takes one clocktick anyway and there is no notable speed difference. So in C I do QSCORE_LOOKUP[q-33] .

If you want to use the very fast C implementation in python. I already made it in C in my fastq-filter. So a from fastq_filter import qualmean suffices.

No need to recalculate 94 distinct phred values all over again is what I am saying. Hope this helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant