-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add RASR compatible feature extraction #44
Add RASR compatible feature extraction #44
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one comment when comparing to what I know from usual rasr feature flows.
Otherwise looks good to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot really judge just from the code whether it is correct. I guess you tested it, that you get the same output? So I would say, then all is fine.
If you want, maybe a test could be added, where you feed some dummy data through RASR, and then get the output, and compare it with this module output. You can store the RASR output directly in the test case, so no need to run RASR in the test case. But not really necessary now.
the difference is on the 3rd decimal position, that is more than just numeric precision so there is still something differen? |
not sure if it is worth investigating further. i don't think our tensorflow features and rasr features match 100%. |
I'm checking this currently. What was the RASR flow file to get these? And what were the corresponding settings for |
I cannot tell you what was used to create the numbers above and I think those might be from a version before preemphasis and zero padding was added. Here are the config/flow that I have used recently:
Just by copy pasting I see that the epsilon defined in config and flow does not match.. 🙈 |
I still wonder about the shift (off by some shift of ~4.515): So, I figured out now, this means, before the log10, this is exactly a factor 2^15 = 32,768. But I think I used the same flow as you. (Well, actually, I did not, @curufinwe gave me some generated feature cache, and he said it was generated with this flow network.) So, I'm not sure, maybe this here is actually the wrong flow network? I also still don't exactly understand what we mean by "rasr compatible feature extraction" here. So this means it supposed to do the same as this specific flow network? Then shouldn't this reference flow network be documented somehow? Otherwise, "rasr compatible feature extraction" can mean anything or nothing? |
Rasr reads the wav samples as a short and then just casts them to float. If the other code reads the samples as float
As I said, this is the flow that I used, it is based on existing flow implementations but I changed it to be more similar to this PR. It could be that @curufinwe by accident (or virtue of obvious implementation) arrived at the same flow as me, but I would not guarantee it.
I agree to this question. I think as a first goal "rasr compatible" would mean something that can be exactly reproduced as a rasr flow network, although not necessarily the same as "the one flow that we always use". But then, when I tested the version of this PR without preemphasis (and also without preemphasis in the rasr flow) to train a model, and compared to the current version of this PR (and the flow I posted above), I found an improvement in WER from 16.1 -> 15.7 So maybe as a stretch goal we should aim to make the implementation as similar as "the one flow we always use" (which is different from the one I posted above) as possible. But then nobody would argue against using different features if the final WER is the same (or better) and the structure is easier. |
I'm on this now. For this, I created a test case where I can test the flow file step by step (see I found many more differences:
Now, up to that point in the flow network ( And there are still remaining differences: The FFT seems different. I need to understand this better. But looking at
|
I replicated the RASR C++ FFT code in Python. Now I still get some difference in the amplitude-spectrum:
I'm not sure if this is because I still (by mistake) do sth a little bit different than the corresponding C++ code, or because some of the math just behaves different in PyTorch (e.g. I now spent already quite a bit of time on this, and I'm a bit questioning whether this is really worth the effort? I also have the feeling, to get really close results, we probably need to have a lot of very custom implementations, our own custom FFT, etc. But on the other side, if we don't do this, it looks like we will still have quite large differences, too large that we can just use one as a drop-in for the other... |
Note, as a next step, to verify it even more whether the FFT produces exactly the same results or not, I would probably copy the RASR C++ code into some standalone test tool, and then verify step by step where it differs from the Python output. Maybe I still find some things I can fix on the Python side to make it really the same. Or if not, that will at least give more insights where it becomes different, where errors are accumulating. |
if you think it would be easier to change/augment the rasr implementation, we could try to go that route instead. |
I'm really not an expert when it comes to FFT implementations. I know there are a couple of different ways to implement a Fourier transformation, but even when looking at the same algorithm, there are so much details which might result in such differences. First, we should maybe better understand whether there is still some bug, or some difference we can fix. I don't know if there is. I also don't know if it is expected that the numerical differences of FFT implementations can be so large (at least it was unexpected to me, but I don't know enough about it). On changing RASR: I'm not sure if this is really simpler. Maybe. For that, we should better understand what the Torch FFT is actually doing. I also wonder how different Torch FFT is on CPU vs GPU. E.g. on GPU, this uses CuFFT, so again a very specific implementation, and also closed source (I think), so we cannot even really check the implementation. Of course, you could simply copy the Torch CPU code. But also, what do you mean by "change the RASR implementation"? Change the flow network? So change the flow network that it simply returns raw audio samples and doesn't do anything? Then it would be trivial to make it the same. So we are done? |
I ran your test cases and found that with Is your implementation of |
What test case exactly? But as you see from my output, the absolute error was quite high, so I would need atol=1e-2 or so that it passes.
But the output is already the best I could get, including
I did not measure it, but I would expect it is way too slow. There are several nestings of Python loops. Already the C++ code was a bit too slow when done on-the-fly, and I expect this Python code will be factor 10 or 100 times slower. I don't think this is an option. |
The definition of the closeness criterion according to https://pytorch.org/docs/stable/testing.html#torch.testing.assert_allclose is
so if the value at the specific position in |
running test case
are those still not close enough? |
3% is not a lot, but it is also not equal. We could do an experiment to see how much this changes the posteriors of the NN and finally the WER. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I re-approve the changes. I would deem the remaining difference small enough to not influence the WER.
smoothed = windowed[:, :-1] * self.window[None, None, :] # [B, T'-1, W] | ||
|
||
# The last window might be shorter. Will use a shorter Hanning window then. Need to fix that. | ||
last_win = torch.hann_window(last_win_size, periodic=False, dtype=torch.float64).to( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this correct if you have sequences of different lengths in the batch? Maybe it's necessary to do this as a loop over all sequences in the batch and modify some parts of the output with this?
one frame rasr features
0.234030 0.731747 1.082119 1.292996 0.890861 1.056064 1.390038 1.597611 1.604475 1.586611 1.939575 2.216516 2.233468 1.941573 1.820577 2.262333 2.291739 2.248550 2.066255 2.148235 2.549040 2.568474 2.229928 2.418101 2.622112 2.478587 2.042983 2.188321 2.437973 2.706243 2.692690 2.617883 2.669614 2.743750 2.857951 2.730998 2.775461 2.747587 2.674153 3.180011 2.949069 3.098550 3.110722 2.988153 2.890325 3.153568 3.039258 3.129505 3.048017 3.061022 3.084099 3.244599 3.241169 3.604517 3.651453 3.282753 2.902287 3.145018 3.163275 2.913236
same frame torch features
0.2297, 0.7334, 1.0803, 1.2995, 0.9143, 1.0722, 1.3999, 1.6022, 1.6045, 1.5851, 1.9403, 2.2169,
2.2317, 1.9365, 1.8187, 2.2637, 2.2955, 2.2538, 2.0692, 2.1491, 2.5498, 2.5663, 2.2258, 2.4154,
2.6246, 2.4799, 2.0473, 2.1879, 2.4400, 2.7045, 2.6909, 2.6192, 2.6692, 2.7444, 2.8586, 2.7310,
2.7798, 2.7500, 2.6750, 3.1814, 2.9498, 3.1026, 3.1138, 2.9888, 2.8930, 3.1551, 3.0379, 3.1301,
3.0457, 3.0599, 3.0900, 3.2458, 3.2438, 3.6075, 3.6512, 3.2803, 2.9031, 3.1472, 3.1635, 2.9173