Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RASR compatible feature extraction #44

Merged

Conversation

kuacakuaca
Copy link
Collaborator

@kuacakuaca kuacakuaca commented Dec 15, 2023

one frame rasr features
0.234030 0.731747 1.082119 1.292996 0.890861 1.056064 1.390038 1.597611 1.604475 1.586611 1.939575 2.216516 2.233468 1.941573 1.820577 2.262333 2.291739 2.248550 2.066255 2.148235 2.549040 2.568474 2.229928 2.418101 2.622112 2.478587 2.042983 2.188321 2.437973 2.706243 2.692690 2.617883 2.669614 2.743750 2.857951 2.730998 2.775461 2.747587 2.674153 3.180011 2.949069 3.098550 3.110722 2.988153 2.890325 3.153568 3.039258 3.129505 3.048017 3.061022 3.084099 3.244599 3.241169 3.604517 3.651453 3.282753 2.902287 3.145018 3.163275 2.913236
same frame torch features
0.2297, 0.7334, 1.0803, 1.2995, 0.9143, 1.0722, 1.3999, 1.6022, 1.6045, 1.5851, 1.9403, 2.2169,

2.2317, 1.9365, 1.8187, 2.2637, 2.2955, 2.2538, 2.0692, 2.1491, 2.5498, 2.5663, 2.2258, 2.4154,

2.6246, 2.4799, 2.0473, 2.1879, 2.4400, 2.7045, 2.6909, 2.6192, 2.6692, 2.7444, 2.8586, 2.7310,

2.7798, 2.7500, 2.6750, 3.1814, 2.9498, 3.1026, 3.1138, 2.9888, 2.8930, 3.1551, 3.0379, 3.1301,

3.0457, 3.0599, 3.0900, 3.2458, 3.2438, 3.6075, 3.6512, 3.2803, 2.9031, 3.1472, 3.1635, 2.9173

Copy link
Contributor

@michelwi michelwi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one comment when comparing to what I know from usual rasr feature flows.

Otherwise looks good to me

i6_models/primitives/feature_extraction.py Outdated Show resolved Hide resolved
Copy link
Member

@albertz albertz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot really judge just from the code whether it is correct. I guess you tested it, that you get the same output? So I would say, then all is fine.

If you want, maybe a test could be added, where you feed some dummy data through RASR, and then get the output, and compare it with this module output. You can store the RASR output directly in the test case, so no need to run RASR in the test case. But not really necessary now.

i6_models/primitives/feature_extraction.py Outdated Show resolved Hide resolved
@michelwi
Copy link
Contributor

michelwi commented Feb 1, 2024

one frame rasr features: 0.234030 0.731747 1.082119
same frame torch features: 0.2297, 0.7334, 1.0803,

the difference is on the 3rd decimal position, that is more than just numeric precision so there is still something differen?

i6_models/primitives/feature_extraction.py Outdated Show resolved Hide resolved
i6_models/primitives/feature_extraction.py Outdated Show resolved Hide resolved
@kuacakuaca
Copy link
Collaborator Author

one frame rasr features: 0.234030 0.731747 1.082119
same frame torch features: 0.2297, 0.7334, 1.0803,

the difference is on the 3rd decimal position, that is more than just numeric precision so there is still something different?

not sure if it is worth investigating further. i don't think our tensorflow features and rasr features match 100%.

i6_models/primitives/feature_extraction.py Outdated Show resolved Hide resolved
@albertz
Copy link
Member

albertz commented Feb 9, 2024

one frame rasr features
same frame torch features

I'm checking this currently. What was the RASR flow file to get these? And what were the corresponding settings for RasrCompatibleLogMelFeatureExtractionV1Config? So far, I always get very different results (off by some shift of ~4.515).

@michelwi
Copy link
Contributor

I'm checking this currently. What was the RASR flow file to get these? And what were the corresponding settings for RasrCompatibleLogMelFeatureExtractionV1Config? So far, I always get very different results (off by some shift of ~4.515).

I cannot tell you what was used to create the numbers above and I think those might be from a version before preemphasis and zero padding was added. Here are the config/flow that I have used recently:

feat_cfg = RasrCompatibleLogMelFeatureExtractionV1Config(
            sample_rate=sampling_rate,
            win_size=0.025,
            hop_size=0.01,
            min_amp=1e-6,
            num_filters=num_channels,
)
<?xml version="1.0" ?>
<network name="network">
  <out name="features"/>
  <param name="end-time"/>
  <param name="input-file"/>
  <param name="start-time"/>
  <param name="track"/>
  <node end-time="$(end-time)" file="$(input-file).wav" filter="audio-input-file-wav" name="samples" start-time="$(start-time)"/>
  <node filter="generic-vector-s16-demultiplex" name="demultiplex" track="$(track)"/>
  <link from="samples" to="demultiplex"/>
  <node filter="generic-convert-vector-s16-to-vector-f32" name="convert"/>
  <link from="demultiplex" to="convert"/>
  <node alpha="1.0" filter="signal-preemphasis" name="preemphasis"/>
  <link from="convert" to="preemphasis"/>
  <node filter="signal-window" length="0.025" name="window" shift="0.01" type="hanning"/>
  <link from="preemphasis" to="window"/>
  <node filter="signal-real-fast-fourier-transform" maximum-input-size="0.025" name="fft"/>
  <link from="window" to="fft"/>
  <node filter="generic-vector-f32-multiplication" name="scaling" value="16000"/>
  <link from="fft" to="scaling"/>
  <node filter="signal-vector-alternating-complex-f32-amplitude" name="amplitude-spectrum"/>
  <link from="scaling" to="amplitude-spectrum"/>
  <node filter="signal-filterbank" filter-width="70.12402584464985" name="filterbank" warp-differential-unit="false" warping-function="mel"/>
  <link from="amplitude-spectrum" to="filterbank"/>
  <node filter="generic-vector-f32-log-plus" name="nonlinear" value="1.175494e-38"/>
  <link from="filterbank" to="nonlinear"/>
  <link from="nonlinear" to="network:features"/>
</network>

Just by copy pasting I see that the epsilon defined in config and flow does not match.. 🙈

@albertz
Copy link
Member

albertz commented Feb 12, 2024

I still wonder about the shift (off by some shift of ~4.515): So, I figured out now, this means, before the log10, this is exactly a factor 2^15 = 32,768. But I think I used the same flow as you. (Well, actually, I did not, @curufinwe gave me some generated feature cache, and he said it was generated with this flow network.) So, I'm not sure, maybe this here is actually the wrong flow network?

I also still don't exactly understand what we mean by "rasr compatible feature extraction" here. So this means it supposed to do the same as this specific flow network? Then shouldn't this reference flow network be documented somehow? Otherwise, "rasr compatible feature extraction" can mean anything or nothing?

@michelwi
Copy link
Contributor

I still wonder about the shift (off by some shift of ~4.515): So, I figured out now, this means, before the log10, this is exactly a factor 2^15 = 32,768.

Rasr reads the wav samples as a short and then just casts them to float. If the other code reads the samples as float \in [-1, 1] directly, this would explain an offset of 2^15.

But I think I used the same flow as you. (Well, actually, I did not, @curufinwe gave me some generated feature cache, and he said it was generated with this flow network.) So, I'm not sure, maybe this here is actually the wrong flow network?

As I said, this is the flow that I used, it is based on existing flow implementations but I changed it to be more similar to this PR. It could be that @curufinwe by accident (or virtue of obvious implementation) arrived at the same flow as me, but I would not guarantee it.

I also still don't exactly understand what we mean by "rasr compatible feature extraction" here. So this means it supposed to do the same as this specific flow network? Then shouldn't this reference flow network be documented somehow? Otherwise, "rasr compatible feature extraction" can mean anything or nothing?

I agree to this question. I think as a first goal "rasr compatible" would mean something that can be exactly reproduced as a rasr flow network, although not necessarily the same as "the one flow that we always use".

But then, when I tested the version of this PR without preemphasis (and also without preemphasis in the rasr flow) to train a model, and compared to the current version of this PR (and the flow I posted above), I found an improvement in WER from 16.1 -> 15.7

So maybe as a stretch goal we should aim to make the implementation as similar as "the one flow we always use" (which is different from the one I posted above) as possible.

But then nobody would argue against using different features if the final WER is the same (or better) and the structure is easier.

@albertz
Copy link
Member

albertz commented Feb 14, 2024

2. find the cause of the remaining difference

I'm on this now. For this, I created a test case where I can test the flow file step by step (see test_rasr_compatible and test_rasr_compatible_*).

I found many more differences:

  • Preemphasize was wrong in the first frame, should be 0. (f41a60c)
  • The Hanning window was slightly off. It was created in RASR with float64 and then afterwards converted to float32. This makes a difference for cos. I do the same now. (44aba81) (Note, this caused a difference more than 1e-05.)
  • The Hanning window in the last output frame was wrong. RASR applies a shorter Hanning window for the last output frame, just of the size of the remaining frames. (d845650)

Now, up to that point in the flow network (window node), the outputs match almost perfectly (at least 1e-30 absolute and relative).

And there are still remaining differences:

The FFT seems different. I need to understand this better. But looking at amplitude-spectrum, there is again some difference. Via test_rasr_compatible_fft:

>       torch.testing.assert_close(amplitude_spectrum, rasr_feat, rtol=1e-30, atol=1e-30)
E       AssertionError: Tensor-likes are not close!
E       
E       Mismatched elements: 126540 / 128243 (98.7%)
E       Greatest absolute difference: 0.01171875 at index (245, 5) (up to 1e-30 allowed)
E       Greatest relative difference: 0.028590014204382896 at index (206, 99) (up to 1e-30 allowed)

@albertz
Copy link
Member

albertz commented Feb 14, 2024

I replicated the RASR C++ FFT code in Python. Now I still get some difference in the amplitude-spectrum:

>       torch.testing.assert_close(amplitude_spectrum, rasr_feat, rtol=1e-30, atol=1e-30)
E       AssertionError: Tensor-likes are not close!
E       
E       Mismatched elements: 72884 / 128243 (56.8%)
E       Greatest absolute difference: 0.0078125 at index (5, 6) (up to 1e-30 allowed)
E       Greatest relative difference: 2.3795448100827343e-07 at index (53, 75) (up to 1e-30 allowed)

I'm not sure if this is because I still (by mistake) do sth a little bit different than the corresponding C++ code, or because some of the math just behaves different in PyTorch (e.g. torch.sin, or maybe the order of multiplications, or so). I think RASR was also compiled using fastmath, so it could have reordered the formulas, also resulting in differences.

I now spent already quite a bit of time on this, and I'm a bit questioning whether this is really worth the effort? I also have the feeling, to get really close results, we probably need to have a lot of very custom implementations, our own custom FFT, etc. But on the other side, if we don't do this, it looks like we will still have quite large differences, too large that we can just use one as a drop-in for the other...

@albertz
Copy link
Member

albertz commented Feb 14, 2024

Note, as a next step, to verify it even more whether the FFT produces exactly the same results or not, I would probably copy the RASR C++ code into some standalone test tool, and then verify step by step where it differs from the Python output. Maybe I still find some things I can fix on the Python side to make it really the same. Or if not, that will at least give more insights where it becomes different, where errors are accumulating.

@michelwi
Copy link
Contributor

we probably need to have a lot of very custom implementations, our own custom FFT, etc.

if you think it would be easier to change/augment the rasr implementation, we could try to go that route instead.

@albertz
Copy link
Member

albertz commented Feb 14, 2024

if you think it would be easier to change/augment the rasr implementation, we could try to go that route instead.

I'm really not an expert when it comes to FFT implementations. I know there are a couple of different ways to implement a Fourier transformation, but even when looking at the same algorithm, there are so much details which might result in such differences.

First, we should maybe better understand whether there is still some bug, or some difference we can fix. I don't know if there is. I also don't know if it is expected that the numerical differences of FFT implementations can be so large (at least it was unexpected to me, but I don't know enough about it).

On changing RASR: I'm not sure if this is really simpler. Maybe. For that, we should better understand what the Torch FFT is actually doing. I also wonder how different Torch FFT is on CPU vs GPU. E.g. on GPU, this uses CuFFT, so again a very specific implementation, and also closed source (I think), so we cannot even really check the implementation. Of course, you could simply copy the Torch CPU code.

But also, what do you mean by "change the RASR implementation"? Change the flow network? So change the flow network that it simply returns raw audio samples and doesn't do anything? Then it would be trivial to make it the same. So we are done?

@michelwi
Copy link
Contributor

michelwi commented Mar 5, 2024

I replicated the RASR C++ FFT code in Python. Now I still get some difference in the amplitude-spectrum:

Greatest absolute difference: 0.0078125 at index (5, 6) (up to 1e-30 allowed)
Greatest relative difference: 2.3795448100827343e-07 at index (53, 75) (up to 1e-30 allowed)

I ran your test cases and found that with rtol=1e-6, atol=1e-30 there are no differences found. Given that the implementation uses float32 I would say the deviation by 2e-7 relative is acceptable.
If now applying the filterbanks also leaves the deviation below 1e-6 relative, then I would say we are done.

Is your implementation of my_fft "sufficiently fast" in practice so that we can use it in the RasrCompatibleLogMelFeatureExtraction?

@albertz
Copy link
Member

albertz commented Mar 5, 2024

I ran your test cases and found that with rtol=1e-6, atol=1e-30 there are no differences found. Given that the implementation uses float32 I would say the deviation by 2e-7 relative is acceptable.

What test case exactly? But as you see from my output, the absolute error was quite high, so I would need atol=1e-2 or so that it passes.

If now applying the filterbanks also leaves the deviation below 1e-6 relative, then I would say we are done.

But the output is already the best I could get, including my_fft. (Right? I don't remember exactly.)

Is your implementation of my_fft "sufficiently fast" in practice so that we can use it in the RasrCompatibleLogMelFeatureExtraction?

I did not measure it, but I would expect it is way too slow. There are several nestings of Python loops. Already the C++ code was a bit too slow when done on-the-fly, and I expect this Python code will be factor 10 or 100 times slower. I don't think this is an option.

@michelwi
Copy link
Contributor

michelwi commented Mar 5, 2024

What test case exactly?

test_rasr_compatible_fft and test_rasr_compatible_amplitude_spectrum as implemented in this PR, where I changed rtol=1e-6 and atol=1e-30

the absolute error was quite high, so I would need atol=1e-2 or so that it passes.

The definition of the closeness criterion according to https://pytorch.org/docs/stable/testing.html#torch.testing.assert_allclose is

∣ actual − expected ∣ ≤ atol + rtol ⋅ ∣expected∣

so if the value at the specific position in expected is large, then also a large absolute difference can be covered by rtol * value. My guess is, that this is happening.

@kuacakuaca
Copy link
Collaborator Author

running test case test_rasr_compatible_amplitude_spectrum using torch.fft.rfftn yields following mismatch:

RASR: torch.Size([499, 257]) x∈[0.005153656471520662, 77318.9375] μ=425.7894592285156 σ=3692.595947265625 torch.float32
i6_models torch.Size([499, 257]) x∈[0.005153656005859375, 77318.9296875] μ=425.7894287109375 σ=3692.595703125 torch.float32

Mismatched elements: 60230 / 128243 (47.0%)
Greatest absolute difference: 0.00764918327331543 at index (248, 250) (up to 1e-05 allowed)
Greatest relative difference: 0.03186587807087847 at index (253, 28) (up to 1e-05 allowed)

are those still not close enough?

@michelwi
Copy link
Contributor

michelwi commented Mar 7, 2024

relative difference: 0.03186587807087847
are those still not close enough?

3% is not a lot, but it is also not equal. We could do an experiment to see how much this changes the posteriors of the NN and finally the WER.

Copy link
Contributor

@michelwi michelwi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I re-approve the changes. I would deem the remaining difference small enough to not influence the WER.

tests/test_feature_extraction.py Show resolved Hide resolved
smoothed = windowed[:, :-1] * self.window[None, None, :] # [B, T'-1, W]

# The last window might be shorter. Will use a shorter Hanning window then. Need to fix that.
last_win = torch.hann_window(last_win_size, periodic=False, dtype=torch.float64).to(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct if you have sequences of different lengths in the batch? Maybe it's necessary to do this as a loop over all sequences in the batch and modify some parts of the output with this?

@curufinwe curufinwe changed the title Add rasr compatible feature extraction class and config Add RASR compatible feature extraction Mar 19, 2024
@curufinwe curufinwe marked this pull request as ready for review March 19, 2024 12:54
@curufinwe curufinwe merged commit a15dad4 into main Mar 19, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants