Add RASR compatible feature extraction #44

kuacakuaca · 2023-12-15T11:23:40Z

one frame rasr features
0.234030 0.731747 1.082119 1.292996 0.890861 1.056064 1.390038 1.597611 1.604475 1.586611 1.939575 2.216516 2.233468 1.941573 1.820577 2.262333 2.291739 2.248550 2.066255 2.148235 2.549040 2.568474 2.229928 2.418101 2.622112 2.478587 2.042983 2.188321 2.437973 2.706243 2.692690 2.617883 2.669614 2.743750 2.857951 2.730998 2.775461 2.747587 2.674153 3.180011 2.949069 3.098550 3.110722 2.988153 2.890325 3.153568 3.039258 3.129505 3.048017 3.061022 3.084099 3.244599 3.241169 3.604517 3.651453 3.282753 2.902287 3.145018 3.163275 2.913236
same frame torch features
0.2297, 0.7334, 1.0803, 1.2995, 0.9143, 1.0722, 1.3999, 1.6022, 1.6045, 1.5851, 1.9403, 2.2169,

2.2317, 1.9365, 1.8187, 2.2637, 2.2955, 2.2538, 2.0692, 2.1491, 2.5498, 2.5663, 2.2258, 2.4154,

2.6246, 2.4799, 2.0473, 2.1879, 2.4400, 2.7045, 2.6909, 2.6192, 2.6692, 2.7444, 2.8586, 2.7310,

2.7798, 2.7500, 2.6750, 3.1814, 2.9498, 3.1026, 3.1138, 2.9888, 2.8930, 3.1551, 3.0379, 3.1301,

3.0457, 3.0599, 3.0900, 3.2458, 3.2438, 3.6075, 3.6512, 3.2803, 2.9031, 3.1472, 3.1635, 2.9173

michelwi

Just one comment when comparing to what I know from usual rasr feature flows.

Otherwise looks good to me

i6_models/primitives/feature_extraction.py

albertz

I cannot really judge just from the code whether it is correct. I guess you tested it, that you get the same output? So I would say, then all is fine.

If you want, maybe a test could be added, where you feed some dummy data through RASR, and then get the output, and compare it with this module output. You can store the RASR output directly in the test case, so no need to run RASR in the test case. But not really necessary now.

i6_models/primitives/feature_extraction.py

…g offset

i6_models/primitives/feature_extraction.py

michelwi · 2024-02-01T10:49:20Z

one frame rasr features: 0.234030 0.731747 1.082119
same frame torch features: 0.2297, 0.7334, 1.0803,

the difference is on the 3rd decimal position, that is more than just numeric precision so there is still something differen?

i6_models/primitives/feature_extraction.py

kuacakuaca · 2024-02-01T11:51:54Z

one frame rasr features: 0.234030 0.731747 1.082119
same frame torch features: 0.2297, 0.7334, 1.0803,

the difference is on the 3rd decimal position, that is more than just numeric precision so there is still something different?

not sure if it is worth investigating further. i don't think our tensorflow features and rasr features match 100%.

i6_models/primitives/feature_extraction.py

albertz · 2024-02-09T15:23:29Z

one frame rasr features
same frame torch features

I'm checking this currently. What was the RASR flow file to get these? And what were the corresponding settings for RasrCompatibleLogMelFeatureExtractionV1Config? So far, I always get very different results (off by some shift of ~4.515).

michelwi · 2024-02-12T08:58:38Z

I'm checking this currently. What was the RASR flow file to get these? And what were the corresponding settings for RasrCompatibleLogMelFeatureExtractionV1Config? So far, I always get very different results (off by some shift of ~4.515).

I cannot tell you what was used to create the numbers above and I think those might be from a version before preemphasis and zero padding was added. Here are the config/flow that I have used recently:

feat_cfg = RasrCompatibleLogMelFeatureExtractionV1Config(
            sample_rate=sampling_rate,
            win_size=0.025,
            hop_size=0.01,
            min_amp=1e-6,
            num_filters=num_channels,
)

<?xml version="1.0" ?>
<network name="network">
  <out name="features"/>
  <param name="end-time"/>
  <param name="input-file"/>
  <param name="start-time"/>
  <param name="track"/>
  <node end-time="$(end-time)" file="$(input-file).wav" filter="audio-input-file-wav" name="samples" start-time="$(start-time)"/>
  <node filter="generic-vector-s16-demultiplex" name="demultiplex" track="$(track)"/>
  <link from="samples" to="demultiplex"/>
  <node filter="generic-convert-vector-s16-to-vector-f32" name="convert"/>
  <link from="demultiplex" to="convert"/>
  <node alpha="1.0" filter="signal-preemphasis" name="preemphasis"/>
  <link from="convert" to="preemphasis"/>
  <node filter="signal-window" length="0.025" name="window" shift="0.01" type="hanning"/>
  <link from="preemphasis" to="window"/>
  <node filter="signal-real-fast-fourier-transform" maximum-input-size="0.025" name="fft"/>
  <link from="window" to="fft"/>
  <node filter="generic-vector-f32-multiplication" name="scaling" value="16000"/>
  <link from="fft" to="scaling"/>
  <node filter="signal-vector-alternating-complex-f32-amplitude" name="amplitude-spectrum"/>
  <link from="scaling" to="amplitude-spectrum"/>
  <node filter="signal-filterbank" filter-width="70.12402584464985" name="filterbank" warp-differential-unit="false" warping-function="mel"/>
  <link from="amplitude-spectrum" to="filterbank"/>
  <node filter="generic-vector-f32-log-plus" name="nonlinear" value="1.175494e-38"/>
  <link from="filterbank" to="nonlinear"/>
  <link from="nonlinear" to="network:features"/>
</network>

Just by copy pasting I see that the epsilon defined in config and flow does not match.. 🙈

albertz · 2024-02-12T12:00:01Z

I still wonder about the shift (off by some shift of ~4.515): So, I figured out now, this means, before the log10, this is exactly a factor 2^15 = 32,768. But I think I used the same flow as you. (Well, actually, I did not, @curufinwe gave me some generated feature cache, and he said it was generated with this flow network.) So, I'm not sure, maybe this here is actually the wrong flow network?

I also still don't exactly understand what we mean by "rasr compatible feature extraction" here. So this means it supposed to do the same as this specific flow network? Then shouldn't this reference flow network be documented somehow? Otherwise, "rasr compatible feature extraction" can mean anything or nothing?

michelwi · 2024-02-12T12:44:20Z

I still wonder about the shift (off by some shift of ~4.515): So, I figured out now, this means, before the log10, this is exactly a factor 2^15 = 32,768.

Rasr reads the wav samples as a short and then just casts them to float. If the other code reads the samples as float \in [-1, 1] directly, this would explain an offset of 2^15.

But I think I used the same flow as you. (Well, actually, I did not, @curufinwe gave me some generated feature cache, and he said it was generated with this flow network.) So, I'm not sure, maybe this here is actually the wrong flow network?

As I said, this is the flow that I used, it is based on existing flow implementations but I changed it to be more similar to this PR. It could be that @curufinwe by accident (or virtue of obvious implementation) arrived at the same flow as me, but I would not guarantee it.

I also still don't exactly understand what we mean by "rasr compatible feature extraction" here. So this means it supposed to do the same as this specific flow network? Then shouldn't this reference flow network be documented somehow? Otherwise, "rasr compatible feature extraction" can mean anything or nothing?

I agree to this question. I think as a first goal "rasr compatible" would mean something that can be exactly reproduced as a rasr flow network, although not necessarily the same as "the one flow that we always use".

But then, when I tested the version of this PR without preemphasis (and also without preemphasis in the rasr flow) to train a model, and compared to the current version of this PR (and the flow I posted above), I found an improvement in WER from 16.1 -> 15.7

So maybe as a stretch goal we should aim to make the implementation as similar as "the one flow we always use" (which is different from the one I posted above) as possible.

But then nobody would argue against using different features if the final WER is the same (or better) and the structure is easier.

albertz · 2024-02-14T07:39:42Z

2. find the cause of the remaining difference

I'm on this now. For this, I created a test case where I can test the flow file step by step (see test_rasr_compatible and test_rasr_compatible_*).

I found many more differences:

Preemphasize was wrong in the first frame, should be 0. (f41a60c)
The Hanning window was slightly off. It was created in RASR with float64 and then afterwards converted to float32. This makes a difference for cos. I do the same now. (44aba81) (Note, this caused a difference more than 1e-05.)
The Hanning window in the last output frame was wrong. RASR applies a shorter Hanning window for the last output frame, just of the size of the remaining frames. (d845650)

Now, up to that point in the flow network (window node), the outputs match almost perfectly (at least 1e-30 absolute and relative).

And there are still remaining differences:

The FFT seems different. I need to understand this better. But looking at amplitude-spectrum, there is again some difference. Via test_rasr_compatible_fft:

>       torch.testing.assert_close(amplitude_spectrum, rasr_feat, rtol=1e-30, atol=1e-30)
E       AssertionError: Tensor-likes are not close!
E       
E       Mismatched elements: 126540 / 128243 (98.7%)
E       Greatest absolute difference: 0.01171875 at index (245, 5) (up to 1e-30 allowed)
E       Greatest relative difference: 0.028590014204382896 at index (206, 99) (up to 1e-30 allowed)

albertz · 2024-02-14T14:38:56Z

I replicated the RASR C++ FFT code in Python. Now I still get some difference in the amplitude-spectrum:

>       torch.testing.assert_close(amplitude_spectrum, rasr_feat, rtol=1e-30, atol=1e-30)
E       AssertionError: Tensor-likes are not close!
E       
E       Mismatched elements: 72884 / 128243 (56.8%)
E       Greatest absolute difference: 0.0078125 at index (5, 6) (up to 1e-30 allowed)
E       Greatest relative difference: 2.3795448100827343e-07 at index (53, 75) (up to 1e-30 allowed)

I'm not sure if this is because I still (by mistake) do sth a little bit different than the corresponding C++ code, or because some of the math just behaves different in PyTorch (e.g. torch.sin, or maybe the order of multiplications, or so). I think RASR was also compiled using fastmath, so it could have reordered the formulas, also resulting in differences.

I now spent already quite a bit of time on this, and I'm a bit questioning whether this is really worth the effort? I also have the feeling, to get really close results, we probably need to have a lot of very custom implementations, our own custom FFT, etc. But on the other side, if we don't do this, it looks like we will still have quite large differences, too large that we can just use one as a drop-in for the other...

albertz · 2024-02-14T14:44:39Z

Note, as a next step, to verify it even more whether the FFT produces exactly the same results or not, I would probably copy the RASR C++ code into some standalone test tool, and then verify step by step where it differs from the Python output. Maybe I still find some things I can fix on the Python side to make it really the same. Or if not, that will at least give more insights where it becomes different, where errors are accumulating.

michelwi · 2024-02-14T15:34:47Z

we probably need to have a lot of very custom implementations, our own custom FFT, etc.

if you think it would be easier to change/augment the rasr implementation, we could try to go that route instead.

albertz · 2024-02-14T15:44:45Z

if you think it would be easier to change/augment the rasr implementation, we could try to go that route instead.

I'm really not an expert when it comes to FFT implementations. I know there are a couple of different ways to implement a Fourier transformation, but even when looking at the same algorithm, there are so much details which might result in such differences.

First, we should maybe better understand whether there is still some bug, or some difference we can fix. I don't know if there is. I also don't know if it is expected that the numerical differences of FFT implementations can be so large (at least it was unexpected to me, but I don't know enough about it).

On changing RASR: I'm not sure if this is really simpler. Maybe. For that, we should better understand what the Torch FFT is actually doing. I also wonder how different Torch FFT is on CPU vs GPU. E.g. on GPU, this uses CuFFT, so again a very specific implementation, and also closed source (I think), so we cannot even really check the implementation. Of course, you could simply copy the Torch CPU code.

But also, what do you mean by "change the RASR implementation"? Change the flow network? So change the flow network that it simply returns raw audio samples and doesn't do anything? Then it would be trivial to make it the same. So we are done?

michelwi · 2024-03-05T14:46:42Z

I replicated the RASR C++ FFT code in Python. Now I still get some difference in the amplitude-spectrum:
Greatest absolute difference: 0.0078125 at index (5, 6) (up to 1e-30 allowed)
Greatest relative difference: 2.3795448100827343e-07 at index (53, 75) (up to 1e-30 allowed)

I ran your test cases and found that with rtol=1e-6, atol=1e-30 there are no differences found. Given that the implementation uses float32 I would say the deviation by 2e-7 relative is acceptable.
If now applying the filterbanks also leaves the deviation below 1e-6 relative, then I would say we are done.

Is your implementation of my_fft "sufficiently fast" in practice so that we can use it in the RasrCompatibleLogMelFeatureExtraction?

albertz · 2024-03-05T15:04:35Z

I ran your test cases and found that with rtol=1e-6, atol=1e-30 there are no differences found. Given that the implementation uses float32 I would say the deviation by 2e-7 relative is acceptable.

What test case exactly? But as you see from my output, the absolute error was quite high, so I would need atol=1e-2 or so that it passes.

If now applying the filterbanks also leaves the deviation below 1e-6 relative, then I would say we are done.

But the output is already the best I could get, including my_fft. (Right? I don't remember exactly.)

Is your implementation of my_fft "sufficiently fast" in practice so that we can use it in the RasrCompatibleLogMelFeatureExtraction?

I did not measure it, but I would expect it is way too slow. There are several nestings of Python loops. Already the C++ code was a bit too slow when done on-the-fly, and I expect this Python code will be factor 10 or 100 times slower. I don't think this is an option.

michelwi · 2024-03-05T15:22:00Z

What test case exactly?

test_rasr_compatible_fft and test_rasr_compatible_amplitude_spectrum as implemented in this PR, where I changed rtol=1e-6 and atol=1e-30

the absolute error was quite high, so I would need atol=1e-2 or so that it passes.

The definition of the closeness criterion according to https://pytorch.org/docs/stable/testing.html#torch.testing.assert_allclose is

∣ actual − expected ∣ ≤ atol + rtol ⋅ ∣expected∣

so if the value at the specific position in expected is large, then also a large absolute difference can be covered by rtol * value. My guess is, that this is happening.

kuacakuaca · 2024-03-07T14:34:07Z

running test case test_rasr_compatible_amplitude_spectrum using torch.fft.rfftn yields following mismatch:

RASR: torch.Size([499, 257]) x∈[0.005153656471520662, 77318.9375] μ=425.7894592285156 σ=3692.595947265625 torch.float32
i6_models torch.Size([499, 257]) x∈[0.005153656005859375, 77318.9296875] μ=425.7894287109375 σ=3692.595703125 torch.float32

Mismatched elements: 60230 / 128243 (47.0%)
Greatest absolute difference: 0.00764918327331543 at index (248, 250) (up to 1e-05 allowed)
Greatest relative difference: 0.03186587807087847 at index (253, 28) (up to 1e-05 allowed)

are those still not close enough?

michelwi · 2024-03-07T14:37:32Z

relative difference: 0.03186587807087847
are those still not close enough?

3% is not a lot, but it is also not equal. We could do an experiment to see how much this changes the posteriors of the NN and finally the WER.

michelwi

I re-approve the changes. I would deem the remaining difference small enough to not influence the WER.

tests/test_feature_extraction.py

curufinwe · 2024-03-15T14:42:53Z

i6_models/primitives/feature_extraction.py

+        smoothed = windowed[:, :-1] * self.window[None, None, :]  # [B, T'-1, W]
+
+        # The last window might be shorter. Will use a shorter Hanning window then. Need to fix that.
+        last_win = torch.hann_window(last_win_size, periodic=False, dtype=torch.float64).to(


Is this correct if you have sequences of different lengths in the batch? Maybe it's necessary to do this as a loop over all sequences in the batch and modify some parts of the output with this?

kuacakuaca added 2 commits December 15, 2023 06:06

add rasr compatible feature extraction

0f5a148

remove f_min and f_max from config

e8c5dde

kuacakuaca requested review from albertz, curufinwe, christophmluscher, JackTemaki, mmz33, michelwi, Judyxujj, SimBe195 and Atticus1806 December 15, 2023 11:23

fix

5fa4bf3

michelwi reviewed Jan 4, 2024

View reviewed changes

i6_models/primitives/feature_extraction.py Outdated Show resolved Hide resolved

albertz reviewed Jan 11, 2024

View reviewed changes

i6_models/primitives/feature_extraction.py Outdated Show resolved Hide resolved

albertz approved these changes Jan 11, 2024

View reviewed changes

michelwi reviewed Jan 24, 2024

View reviewed changes

i6_models/primitives/feature_extraction.py Outdated Show resolved Hide resolved

kuacakuaca added 2 commits January 29, 2024 12:40

add preemphasis, use amplitude instead of power spectrum, additive lo…

33dd047

…g offset

small change

3071b44

michelwi reviewed Jan 30, 2024

View reviewed changes

i6_models/primitives/feature_extraction.py Outdated Show resolved Hide resolved

make alpha a parameter

687854a

michelwi approved these changes Jan 30, 2024

View reviewed changes

i6_models/primitives/feature_extraction.py Outdated Show resolved Hide resolved

michelwi requested changes Feb 1, 2024

View reviewed changes

i6_models/primitives/feature_extraction.py Outdated Show resolved Hide resolved

i6_models/primitives/feature_extraction.py Outdated Show resolved Hide resolved

fix errors

e7c850d

michelwi approved these changes Feb 5, 2024

View reviewed changes

i6_models/primitives/feature_extraction.py Outdated Show resolved Hide resolved

albertz added 4 commits February 14, 2024 01:35

fix last Hanning window

d845650

fix device

e2cda8b

simplify

6925acc

test_rasr_compatible_fft (failing)

848c886

albertz added 7 commits February 14, 2024 11:39

FFT test more direct (still failing)

990a977

tests deterministic

6556b3c

copy RASR C++ FFT code for testing

79a043e

FFT fixes

3476557

FFT becomes more exact

467f828

FFT cleanup

69cd90d

test_rasr_compatible_amplitude_spectrum (failing)

aca1eb3

kuacakuaca added 2 commits March 12, 2024 09:46

add fft scaling, updata test and remove spaces

3ead58f

black

1af10fc

michelwi approved these changes Mar 13, 2024

View reviewed changes

tests/test_feature_extraction.py Show resolved Hide resolved

curufinwe reviewed Mar 15, 2024

View reviewed changes

adjust last window for different sequence lengths

fff39f0

curufinwe approved these changes Mar 19, 2024

View reviewed changes

curufinwe changed the title ~~Add rasr compatible feature extraction class and config~~ Add RASR compatible feature extraction Mar 19, 2024

curufinwe marked this pull request as ready for review March 19, 2024 12:54

curufinwe merged commit a15dad4 into main Mar 19, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add RASR compatible feature extraction #44

Add RASR compatible feature extraction #44

kuacakuaca commented Dec 15, 2023 •

edited

Loading

michelwi left a comment

albertz left a comment

michelwi commented Feb 1, 2024

kuacakuaca commented Feb 1, 2024

albertz commented Feb 9, 2024 •

edited

Loading

michelwi commented Feb 12, 2024

albertz commented Feb 12, 2024

michelwi commented Feb 12, 2024

albertz commented Feb 14, 2024

albertz commented Feb 14, 2024

albertz commented Feb 14, 2024

michelwi commented Feb 14, 2024

albertz commented Feb 14, 2024 •

edited

Loading

michelwi commented Mar 5, 2024

albertz commented Mar 5, 2024

michelwi commented Mar 5, 2024

kuacakuaca commented Mar 7, 2024

michelwi commented Mar 7, 2024

michelwi left a comment

curufinwe Mar 15, 2024

Add RASR compatible feature extraction #44

Add RASR compatible feature extraction #44

Conversation

kuacakuaca commented Dec 15, 2023 • edited Loading

michelwi left a comment

Choose a reason for hiding this comment

albertz left a comment

Choose a reason for hiding this comment

michelwi commented Feb 1, 2024

kuacakuaca commented Feb 1, 2024

albertz commented Feb 9, 2024 • edited Loading

michelwi commented Feb 12, 2024

albertz commented Feb 12, 2024

michelwi commented Feb 12, 2024

albertz commented Feb 14, 2024

albertz commented Feb 14, 2024

albertz commented Feb 14, 2024

michelwi commented Feb 14, 2024

albertz commented Feb 14, 2024 • edited Loading

michelwi commented Mar 5, 2024

albertz commented Mar 5, 2024

michelwi commented Mar 5, 2024

kuacakuaca commented Mar 7, 2024

michelwi commented Mar 7, 2024

michelwi left a comment

Choose a reason for hiding this comment

curufinwe Mar 15, 2024

Choose a reason for hiding this comment

kuacakuaca commented Dec 15, 2023 •

edited

Loading

albertz commented Feb 9, 2024 •

edited

Loading

albertz commented Feb 14, 2024 •

edited

Loading