You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I see you use numpy to leverage the power of C to do multiple calculations. I wrote a fastq filter and I also went down this route. However after some time I realized that there are only 94 distinct phred values in the FASTQ format. So basically you are recalculating the same 94 values over and over again (and in practice, a lot less than 94).
I found that it was much quicker to create a lookup table:
import math
QSCORE_LOOKUP = [10 ** -((i - 33)/10) for i in range(128)]
def mean_qscore_from_qstring(qstring):
"""
Convert qstring into a mean qscore
"""
length = len(qstring)
if len == 0: return 0.0
qscore_values = qstring.encode('ASCII') # This returns a bytes object which when iterated over returns integers
sum_error = sum(QSCORE_LOOKUP[q] for q in qscore_values)
return -10 * math.log10(sum_error / length)
In C this is even quicker of course. You can simply create a const double QSCORE_LOOKUP[128] = {...} in a generated header file or use a constexpr in C++ (I believe, I do not use C++ myself). I have a generated header file here: https://github.com/LUMC/fastq-filter/blob/develop/src/fastq_filter/score_to_error_rate.h. Feel free to use. Keep in mind that I did not do the substracting of the 33 offset in the C lookup as q-33 in C compiles to an instruction that only takes one clocktick anyway and there is no notable speed difference. So in C I do QSCORE_LOOKUP[q-33] .
If you want to use the very fast C implementation in python. I already made it in C in my fastq-filter. So a from fastq_filter import qualmean suffices.
No need to recalculate 94 distinct phred values all over again is what I am saying. Hope this helps.
The text was updated successfully, but these errors were encountered:
Hi,
I had a look at the QScore calculation as defined in the code
bonito/bonito/util.py
Line 124 in 21df7d5
I see you use numpy to leverage the power of C to do multiple calculations. I wrote a fastq filter and I also went down this route. However after some time I realized that there are only 94 distinct phred values in the FASTQ format. So basically you are recalculating the same 94 values over and over again (and in practice, a lot less than 94).
I found that it was much quicker to create a lookup table:
In C this is even quicker of course. You can simply create a
const double QSCORE_LOOKUP[128] = {...}
in a generated header file or use a constexpr in C++ (I believe, I do not use C++ myself). I have a generated header file here: https://github.com/LUMC/fastq-filter/blob/develop/src/fastq_filter/score_to_error_rate.h. Feel free to use. Keep in mind that I did not do the substracting of the 33 offset in the C lookup asq-33
in C compiles to an instruction that only takes one clocktick anyway and there is no notable speed difference. So in C I doQSCORE_LOOKUP[q-33]
.If you want to use the very fast C implementation in python. I already made it in C in my fastq-filter. So a
from fastq_filter import qualmean
suffices.No need to recalculate 94 distinct phred values all over again is what I am saying. Hope this helps.
The text was updated successfully, but these errors were encountered: