It is a Pitch Determination Algorithm based on Short-time Autocorrelation and Shortest-distance Search
git clone https://github.com/MorrisXu-Driving/Pitch_Determiation_for_Speech_Signal.git
- Create a new project in Python IDE and choose file
mainvoid.py
as the script path inconfiguration
. - Make sure the test input wav file
tone4_w.wav
is under the same directory as themainvoid.py
.
In this algorithm we have:
-
Parameters for input preprocessing
wlen = int(0.03 * fs)
# 0.03 stands for wlen in time domain, here the wlen is 30ms.inc = int(0.01 * fs)
# 0.01 stands for inc in time domain, here the inc is 10ms.lf = 60 # Hz
# lf stands for the lower pass frequency of the bandpass denoising filterhf = 500 # Hz
# hf stands for the upper pass frequency of the bandpass denoising filter
-
Parameters for pitch determination
IS = 0.8
# Observe the waveform of the input audio at above diagram and set non-speech time at the start of the input in secondr1= 0.03
# Threshold Coefficient for energy threshold T1 (shown in the above diagram) judging speech segment, namelyT1 = np.mean(H[:NIS]) * r1
whereH[:NIS]
is the energy of speech between 0-IS.r2 = 0.26
# Threshold Coefficient for judging mainbodys in a speech segment, each speech segment has a different T2 (shown in the above diagram)ThrC = [10, 15]
# Max difference in F0 between adjacent frames when conducting the shortest-distance search in order to avoid unnatural change in final resultminiL = 10
# Minimum length for a speech segmentmnlong = 3
# Minimum length for a major body in speech segments
The above diagram consists of the spectrogram of the input audio and the pitch extracted from the input file. The pitch extracted(in white line) highly correlated with the first harmonic frequency shwon from the STFT spectrogram, which reveals that the algorithm is working properly.
The RMSE in Hz of the results tested from the wav files in speech_signal_for_test/.
- The algorithm is not adaptive to differnt types of audio signals.
- For those inputs with low SNR(i.e. the background energy between 0-IS is very high already needs to set a low r1)
- For those inputs with low energy at each speech segments, r2 should be lower in order to better recognize the extended parts besides each mainbodys.
- Adaptive parameter setting is needed to have better user experience since too many parameters need to be adjusted to achieve a good performance on different types of speech audios.
- Future Work
- Merely extracting the pitch is not friendly for future research. Its combination with forced alignment in char level and word level need to be conducted.