-
Notifications
You must be signed in to change notification settings - Fork 209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Audio range out of (-1,+1) #1254
Comments
The -1 to 1 check in icefall is to avoid the issue when users pass samples in the range -32768 to 32767 to the model, which is the default behavior in Kaldi. I think it is safe to enlarge the range as long as we can achieve the same goal. |
- some AudioTransform classes produce audio signals out of range [-1,+1] - Resample produced 1.0079 - The range [-10,+10] was chosen to still be able to reliably distinguish from the [-32k,+32k] signal... - this is related to : lhotse-speech/lhotse#1254
- some AudioTransform classes produce audio signals out of range [-1,+1] - Resample produced 1.0079 - The range [-10,+10] was chosen to still be able to reliably distinguish from the [-32k,+32k] signal... - this is related to : lhotse-speech/lhotse#1254
- some AudioTransform classes produce audio signals out of range [-1,+1] - Resample produced 1.0079 - The range [-10,+10] was chosen to still be able to reliably distinguish from the [-32k,+32k] signal... - this is related to : lhotse-speech/lhotse#1254
Hi Karel! We had another issue related to this somewhere. Technically we could either add conditional rescaling (if np.max(np.abs(audio)) > 1.0, then divide audio by maxabs value) or a limiter (I have one in a separate pip package https://github.com/pzelasko/cylimiter), but I'm just not sure if it's worth paying the runtime cost. If it's not a strict requirement in Icefall I think it's OK to leave it as it is. |
…1448) - some AudioTransform classes produce audio signals out of range [-1,+1] - Resample produced 1.0079 - The range [-10,+10] was chosen to still be able to reliably distinguish from the [-32k,+32k] signal... - this is related to : lhotse-speech/lhotse#1254
Hello @pzelasko, @csukuangfj,
I just identified an open question related to the audio transforms.
In lhotse, there is the
Resample
class wrapping thetorchaudio.transforms.Resample()
.When resampling 32kHz->16kHz common_voice_cs_26209290, the
audio.max()
becomes 1.0079In streaming_decode.py in Icefall, there is a check that max audio sample must be
s<=1.0
What would be the cleanest solution to this ?
a) Stop checking for
audio.abs().max()<=1.0
in Icefall.b) Introduce audio clipping
AudioTransform
to lhotse.c) Introduce audio Limiter
AudioTransform
(sth. like:if audio.abs().max() > 0.99: rescale_to_099(...)
) to lhotse.d) Try to add a check to
torchaudio
, intotorchaudio.transforms.Resample()
.I guess a similar issue would appear also for volume perturbation, but I did not check that specifically.
Best regards
Karel
// Ps: All the best in the new "western" year !!
The text was updated successfully, but these errors were encountered: