-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Big gap in WER between online and offline CTC decoding #1194
Comments
There are examples in Sherpa of real-time/streaming/online decoding, I think that might be a better starting point? |
Can you please specify which example it is? I did look into sherpa repo but did not find any examples about CTC-based streaming.
I used the same AM output for both offline and streaming decoding. I don't think the gap can be that big. |
Sorry, there is no ctc hlg streaming decoding in Sherpa, only one example in k2/torch/bin (I think it is the online_deocde.cu you used).
We normally test the streaming decoding method with a streaming model, may be you can try the online_decode.cu with a streaming model. A offline model is not suitable for a streaming decoding method. |
But @pkufool I think that binary just evaluates the nnet for the entire file and simulates streaming, so surely it should in principle give the same results as the offline decoding if it was given a non-streaming model? (Even though this would not be useful in practice). |
@pkufool @danpovey How I tested was that I read the audio file and evaluated nnet output for the entire audio. Then I used that output to simulate streaming as in online_decode.cu and used the final text result to compute the WER. I did the test twice, using the conformer ctc model from icefall and my conformer ctc model (using wenet). However, the results obtained were not as good as offline decoding in both cases. |
hm, how did it differ? |
I am using the latest version of k2. |
Yes, I think there could be some bugs. I will look into the code. |
I am currently experiencing the same issue. Offline decoding is fine but any form of streaming using |
OK, I am debuging it. |
Any updates @pkufool? |
Sorry, I did not fix it at that day and forgot it, will return to it. |
@svandiekendialpad @chiendb97 Does the differences only happens when using |
Hi @pkufool, I just ran tests again using librispeech conformer ctc, here is the result:
So I think there is still a significant difference between online and offline implementations regardless of using n-gram (though the gap is smaller). |
I can confirm what @binhtranmcs said. It all points to a bug in the online decoding code. |
@binhtranmcs I think #1218 solve some problems, but it still has differences between the lattices generated by online & offline mode, now I know it relates to the pruning, I am trying to fix it. |
@danpovey I think one issue is: for offline mode the forward pass always run before the backward pass (i.e. when we expand the arcs at step t, the frames[t] has not beed pruned by backward pass), but in current online implementation, when we expand at step t (t is the last frame of previous chunk) frame[t] has been pruned by backward pass in the previous chunk. This is the only difference I found after reading the code carefully. |
Does the backward pass start with -(forward score) on all active states?
That's how it is supposed to work.
…On Tue, Jul 11, 2023, 10:20 AM Wei Kang ***@***.***> wrote:
@danpovey <https://github.com/danpovey> I think one issue is: for offline
mode the forward pass always run before the backward pass (i.e. when we
expand the arcs at step t, the frames[t] has not beed pruned by backward
pass), but in current online implementation, when we expand at step t (t is
the last frame of previous chunk) frame[t] has been pruned by backward pass
in the previous chunk. This is the only difference I found after reading
the code carefully.
—
Reply to this email directly, view it on GitHub
<#1194 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO2RPNA2A6ULSP7WKH3XPUEF5ANCNFSM6AAAAAAX6JHOAI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Hi @danpovey, as I want to understand the code, can you please provide me some references to the online/offline decoding algorithm you implemented here. Since I am pretty new to this, it would really help a lot. Thanks in advance. |
I think it is described in my paper about exact lattices.. or at least
mentioned there.
Pruned viterbi beam search with some extensions to store a lattice.
The guys have discovered the problem but IDK if they have made the fix
public yet.
…On Sun, Jul 16, 2023, 5:56 PM binhtranmcs ***@***.***> wrote:
Hi @danpovey <https://github.com/danpovey>, as I want to understand the
code, can you please provide me some references to the online/offline
decoding algorithm you implemented here. Since I am pretty new to this, it
would really help a lot. Thanks in advance.
—
Reply to this email directly, view it on GitHub
<#1194 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO7J4QLWBIEAYHIIKDLXQO3DVANCNFSM6AAAAAAX6JHOAI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@binhtranmcs @svandiekendialpad @chiendb97 I think #1218 can fix this issue, you can try it on your dataset. |
@pkufool, I just tested again with librispeech conformer ctc, using
WER for online hlg decoding did decrease(from 18% down to 12%) but it is not as good as offline decoding(3.49%). I think there are still problems here. |
For me it went up from 33% to 45%, when 14% should be normal. Should I have used |
@binhtranmcs @svandiekendialpad OK, I just tested some bad cases, will test the full test datasets. |
Hi @pkufool, are there any updates on this??? |
I think #1218 may be relevant to this. Not merged yet but says it is ready. |
I did some exps on librispeech test-clean, here is the results:
For hlg decoding (decode with an HLG), there are still big difference between online and offline, mainly the deletions at the tail of sentences.
I believe this is the issue of pruning at the boundary frames (as I mentioned above). When I set the
I need to revisit the implementation carefully to figure out the fixes for this issue, for now I think you can try using the same [edit:] BTW, I add the python test code in #1218 |
@pkufool this makes me think that the backward scores have not been initialized correctly. They are supposed to be set to -(the forward score) when we do "intermediate" pruning (i.e. pruning not at the end of the file). If that is done, it should be OK to prune using "output_beam". I suspect that something is not working right in this respect: for example, they are not being set to that value, or they are being overwritten somehow, or something to do with a final-state is not correct. |
@binhtranmcs @svandiekendialpad @chiendb97 I update #1218 I think this time it should be able to fix your issue. |
@pkufool I'm trying to replicate your results, for now I still have very high error rate due to deltions. I am therefore investigating whether my custom decoder implementation has a bug. However, could you send me a short code snippet how you set the decoding graph scores to 0.0? I just set |
I think you can simply do |
For me |
@svandiekendialpad I did test the fixes on test-clean with the model librispeech conformer ctc, and I got 2.73% for online decoding (the online_decode.py). Can you try your test set with my script? (i.e. online_decode.py in #1218). Let me know, if you meet some troubles, thanks! |
Hi @pkufool, thanks for your effort on resolving this issue. I've downloaded librispeech conformer ctc and latest librispeech zipformer, trained with both ctc and rnnt losses. I decoded Results before the fix (HLG decoding presented with different
Results after the fix:
So, online decoding works well now, I also went through the code with @svandiekendialpad and we sort things out, everything works as expected. @pkufool Can we consider merging #1218 to master as this is really important fix? I see you were asked in #1218 to add the |
@videodanchik Thanks very much for the testing! Yes I will have a look at the failed CI tests and merge it.
Actually, I have not started this work yet, will make a seperate PR later. |
I tried offline decoding using hlg_decode.cu and online decoding using online_decode.cu. And here is the result:
(WER online decoding is much larger than offline decoding (both use the same am output), online decoding uses chunk size 16)
Could you please tell me the difference between offline decoding and online decoding? In addition, could you tell us the result of 2 kinds of decoding.
Thanks!
The text was updated successfully, but these errors were encountered: