You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Amplitudes are min-max normalized, for each audio example loaded from the dataset.
Bad for three reasons:
First reason: DC offset. The normalization was calculated by subtracting the minimum and dividing by the maximum. But if minimum peak and maximum peak are different, silence is no longer the middle value, so you introduce a DC offset into the audio.
Second reason: Each example has different peaks, so each example will have a different quantization value for silence.
Third reason: dynamics. If part of my dataset is soft, part is loud, and part is transitions between soft and loud, they will all be normalized to loud. Now SampleRNN will struggle to learn those transitions. If some [8-second] example is nearly silent, now it is super loud.
I think the only acceptable amplitude normalization would be to the entire dataset and you could do so [with ffmpeg] when creating the dataset.
# quantize the wav amplitude into 256 levelsq_levels=256# Plot original wav samplesplot(samples)
# samples = tensor([ 0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000])
# Linearly quantize the sampleslq=linear_quantize(samples, q_levels)
plot(lq)
# lq = tensor([ 133, 133, 133, ..., 133, 133, 133])# note, silence should be 128
# Unquantize the samplesldq=linear_dequantize(lq, q_levels)
plot(ldq)
# tensor([ 0.0391, 0.0391, 0.0391, ..., 0.0391, 0.0391, 0.0391])# introduction of DC offset. # instead, this should be silent 0.0000, 0,0000, 0.0000, ...
Problem
Amplitudes are min-max normalized, for each audio example loaded from the dataset.
Bad for three reasons:
First reason: DC offset. The normalization was calculated by subtracting the minimum and dividing by the maximum. But if minimum peak and maximum peak are different, silence is no longer the middle value, so you introduce a DC offset into the audio.
Second reason: Each example has different peaks, so each example will have a different quantization value for silence.
Third reason: dynamics. If part of my dataset is soft, part is loud, and part is transitions between soft and loud, they will all be normalized to loud. Now SampleRNN will struggle to learn those transitions. If some [8-second] example is nearly silent, now it is super loud.
I think the only acceptable amplitude normalization would be to the entire dataset and you could do so [with ffmpeg] when creating the dataset.
The normalization happens in
linear_quantize
Audio normalized upon loading:
(Example) linear_dequantize(linear_quantize(samples)) != samples
Solution
Don't normalize with linear_quantize
The text was updated successfully, but these errors were encountered: