Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Big gap in WER between online and offline CTC decoding #1194

Open
chiendb97 opened this issue May 11, 2023 · 36 comments
Open

Big gap in WER between online and offline CTC decoding #1194

chiendb97 opened this issue May 11, 2023 · 36 comments

Comments

@chiendb97
Copy link

I tried offline decoding using hlg_decode.cu and online decoding using online_decode.cu. And here is the result:

  • For model librispeech conformer ctc: offline decoding: 3.49% WER, online decoding: 19.08% WER
  • For our model: offline decoding: ~3% WER, online decoding: ~18% WER
    (WER online decoding is much larger than offline decoding (both use the same am output), online decoding uses chunk size 16)

Could you please tell me the difference between offline decoding and online decoding? In addition, could you tell us the result of 2 kinds of decoding.
Thanks!

@danpovey
Copy link
Collaborator

There are examples in Sherpa of real-time/streaming/online decoding, I think that might be a better starting point?
Normally you need to use a model that has been trained with streaming in mind.

@chiendb97
Copy link
Author

There are examples in Sherpa of real-time/streaming/online decoding

Can you please specify which example it is? I did look into sherpa repo but did not find any examples about CTC-based streaming.

Normally you need to use a model that has been trained with streaming in mind.

I used the same AM output for both offline and streaming decoding. I don't think the gap can be that big.

@pkufool
Copy link
Collaborator

pkufool commented May 12, 2023

Can you please specify which example it is? I did look into sherpa repo but did not find any examples about CTC-based streaming.

Sorry, there is no ctc hlg streaming decoding in Sherpa, only one example in k2/torch/bin (I think it is the online_deocde.cu you used).

I used the same AM output for both offline and streaming decoding. I don't think the gap can be that big.

We normally test the streaming decoding method with a streaming model, may be you can try the online_decode.cu with a streaming model. A offline model is not suitable for a streaming decoding method.

@danpovey
Copy link
Collaborator

But @pkufool I think that binary just evaluates the nnet for the entire file and simulates streaming, so surely it should in principle give the same results as the offline decoding if it was given a non-streaming model? (Even though this would not be useful in practice).

@chiendb97
Copy link
Author

@pkufool @danpovey How I tested was that I read the audio file and evaluated nnet output for the entire audio. Then I used that output to simulate streaming as in online_decode.cu and used the final text result to compute the WER. I did the test twice, using the conformer ctc model from icefall and my conformer ctc model (using wenet). However, the results obtained were not as good as offline decoding in both cases.
I tried to print out the lattice (lattice.fsa.values) of the online decoder and noticed that the first few lattices are quite the same as that of the offline decoder. But then it started to differ.

@danpovey
Copy link
Collaborator

hm, how did it differ?
@pkufool do you think there is possibly a bug that is affecting him?
@chiendb97 what version of k2 are you using? see if a newer version helps.

@chiendb97
Copy link
Author

what version of k2 are you using? see if a newer version helps.

I am using the latest version of k2.

@pkufool
Copy link
Collaborator

pkufool commented May 15, 2023

@pkufool do you think there is possibly a bug that is affecting him?

Yes, I think there could be some bugs. I will look into the code.

@svandiekendialpad
Copy link

I am currently experiencing the same issue. Offline decoding is fine but any form of streaming using OnlineDenseIntersecter increases WER by an unreasonable amount with almost all new errors coming from deletions.

@pkufool
Copy link
Collaborator

pkufool commented May 17, 2023

I am currently experiencing the same issue. Offline decoding is fine but any form of streaming using OnlineDenseIntersecter increases WER by an unreasonable amount with almost all new errors coming from deletions.

OK, I am debuging it.

@svandiekendialpad
Copy link

Any updates @pkufool?

@pkufool
Copy link
Collaborator

pkufool commented Jun 28, 2023

Any updates @pkufool?

Sorry, I did not fix it at that day and forgot it, will return to it.

@pkufool
Copy link
Collaborator

pkufool commented Jul 4, 2023

@svandiekendialpad @chiendb97 Does the differences only happens when using --use_ctc_decoding=false (i.e decoding with an ngram) ?

@binhtranmcs
Copy link

Hi @pkufool, I just ran tests again using librispeech conformer ctc, here is the result:

  • Using --use_ctc_decoding=true, I got WER=7.3%.
  • Using offline ctc decoding in ctc_decode.cu, I got WER=2.6%.

So I think there is still a significant difference between online and offline implementations regardless of using n-gram (though the gap is smaller).

@svandiekendialpad
Copy link

I can confirm what @binhtranmcs said. It all points to a bug in the online decoding code.

@pkufool
Copy link
Collaborator

pkufool commented Jul 11, 2023

@binhtranmcs I think #1218 solve some problems, but it still has differences between the lattices generated by online & offline mode, now I know it relates to the pruning, I am trying to fix it.

@pkufool
Copy link
Collaborator

pkufool commented Jul 11, 2023

@danpovey I think one issue is: for offline mode the forward pass always run before the backward pass (i.e. when we expand the arcs at step t, the frames[t] has not beed pruned by backward pass), but in current online implementation, when we expand at step t (t is the last frame of previous chunk) frame[t] has been pruned by backward pass in the previous chunk. This is the only difference I found after reading the code carefully.

@danpovey
Copy link
Collaborator

danpovey commented Jul 11, 2023 via email

@binhtranmcs
Copy link

Hi @danpovey, as I want to understand the code, can you please provide me some references to the online/offline decoding algorithm you implemented here. Since I am pretty new to this, it would really help a lot. Thanks in advance.

@danpovey
Copy link
Collaborator

danpovey commented Jul 16, 2023 via email

@pkufool
Copy link
Collaborator

pkufool commented Jul 19, 2023

@binhtranmcs @svandiekendialpad @chiendb97 I think #1218 can fix this issue, you can try it on your dataset.

@binhtranmcs
Copy link

@pkufool, I just tested again with librispeech conformer ctc, using online_decode.cu:

  • With --use_ctc_decoding=true, WER=7.3%.
  • With --use_ctc_decoding=false, WER=12.2%.

WER for online hlg decoding did decrease(from 18% down to 12%) but it is not as good as offline decoding(3.49%). I think there are still problems here.

@svandiekendialpad
Copy link

For me it went up from 33% to 45%, when 14% should be normal. Should I have used allow_partial anywhere? I just left it at its default (true in OnlineDenseIntersecter).

@pkufool
Copy link
Collaborator

pkufool commented Jul 20, 2023

@binhtranmcs @svandiekendialpad OK, I just tested some bad cases, will test the full test datasets.

@binhtranmcs
Copy link

Hi @pkufool, are there any updates on this???

@danpovey
Copy link
Collaborator

danpovey commented Aug 1, 2023

I think #1218 may be relevant to this. Not merged yet but says it is ready.

@pkufool
Copy link
Collaborator

pkufool commented Aug 1, 2023

I think #1218 may be relevant to this. Not merged yet but says it is ready.

It a pity that the fixes in #1218 can not fix all the issue, I am still debuging it.

@pkufool
Copy link
Collaborator

pkufool commented Aug 3, 2023

I did some exps on librispeech test-clean, here is the results:
For ctc-decoding (decode with a ctc-topology),after applying the fixes in #1218 I can get almost the same WERs for online and offline.

  Offline Online (chunk=10)
Ctc-decoding 2.99 2.92

For hlg decoding (decode with an HLG), there are still big difference between online and offline, mainly the deletions at the tail of sentences.

  Offline Online(chunk=10) Online(chunk=30) Online(chunk=50) Online(chunk=30)decoding_graph.scores = 0.0
Hlg decoding 2.77 19.06 6.93 5.13 3.02

I believe this is the issue of pruning at the boundary frames (as I mentioned above). When I set the output_beam (used in backward pruning) the same as the search_beam (used in forward pruning) I can get the same results.

  Offline Online(chunk=10) Online(chunk=10) output-beam=search-beam
Hlg decoding 2.77 19.06 2.73

I need to revisit the implementation carefully to figure out the fixes for this issue, for now I think you can try using the same output_beam and search_beam.

[edit:] BTW, I add the python test code in #1218 online_decode.py and hlg_decode.py which accept a wav scp, then you can use simple-wer to calculate the WERs.

@danpovey
Copy link
Collaborator

danpovey commented Aug 3, 2023

@pkufool this makes me think that the backward scores have not been initialized correctly. They are supposed to be set to -(the forward score) when we do "intermediate" pruning (i.e. pruning not at the end of the file). If that is done, it should be OK to prune using "output_beam". I suspect that something is not working right in this respect: for example, they are not being set to that value, or they are being overwritten somehow, or something to do with a final-state is not correct.

@pkufool
Copy link
Collaborator

pkufool commented Aug 7, 2023

@binhtranmcs @svandiekendialpad @chiendb97 I update #1218 I think this time it should be able to fix your issue.

@svandiekendialpad
Copy link

@pkufool I'm trying to replicate your results, for now I still have very high error rate due to deltions. I am therefore investigating whether my custom decoder implementation has a bug.

However, could you send me a short code snippet how you set the decoding graph scores to 0.0? I just set HLG.scores = torch.zeros(HLG.scores.shape) and it leads to an AssertionError in parse_timestamps_and_texts, where I end up with fewer index_pairs than words/tokens. This doesn't happen when the scores aren't zero.

@desh2608
Copy link
Contributor

I think you can simply do HLG.scores *= 0. I guess HLG.scores is a RaggedTensor and so its shape attribute actually refers to an underlying RaggedShape (and not a torch Tensor).

@svandiekendialpad
Copy link

For me HLG.scores is a torch.Tensor.

@pkufool
Copy link
Collaborator

pkufool commented Aug 11, 2023

@svandiekendialpad I did test the fixes on test-clean with the model librispeech conformer ctc, and I got 2.73% for online decoding (the online_decode.py). Can you try your test set with my script? (i.e. online_decode.py in #1218). Let me know, if you meet some troubles, thanks!

@videodanchik
Copy link
Contributor

videodanchik commented Sep 20, 2023

Hi @pkufool, thanks for your effort on resolving this issue. I've downloaded librispeech conformer ctc and latest librispeech zipformer, trained with both ctc and rnnt losses. I decoded test-clean and test-other with both models online (chunk = 15) and offline before and after the fix from #1218.

Results before the fix (HLG decoding presented with different acoustic_model_weight = 1 / lm_model_weight):

decoding type test_clean (conformer / zipformer) test_other (conformer / zipformer)
H online 2.86 / 2.35 7.46 / 5.67
HLG online am scale 1 20.87 / 20.90 21.09 / 20.62
HLG online am scale 2 23.14 / 23.89 23.17 / 23.16
HLG online am scale 3 23.70 / 24.69 23.68 / 23.83
HLG online am scale 4 23.97 / 25.13 23.89 / 24.18
H offline 2.86 / 2.35 7.46 / 5.67
HLG offline am scale 1 2.68 / 2.60 6.43 / 5.42
HLG offline am scale 2 2.70 / 2.39 6.36 / 5.12
HLG offline am scale 3 2.71 / 2.39 6.47 / 5.14
HLG offline am scale 4 2.73 / 2.40 6.54 / 5.16

Results after the fix:

decoding type test_clean (conformer / zipformer) test_other (conformer / zipformer)
H online 2.86 / 2.35 7.46 / 5.67
HLG online am scale 1 2.68 / 2.59 6.43 / 5.42
HLG online am scale 2 2.70 / 2.39 6.36 / 5.12
HLG online am scale 3 2.72 / 2.39 6.47 / 5.15
HLG online am scale 4 2.74 / 2.40 6.56 / 5.17
HLG online am scale 5 2.74 / 2.40 6.61 / 5.21
H offline 2.86 / 2.35 7.46 / 5.67
HLG offline am scale 1 2.68 / 2.59 6.43 / 5.42
HLG offline am scale 2 2.70 / 2.39 6.36 / 5.12
HLG offline am scale 3 2.71 / 2.39 6.47 / 5.14
HLG offline am scale 4 2.73 / 2.40 6.55 / 5.16
HLG offline am scale 5 2.74 / 2.40 6.58 / 5.19

So, online decoding works well now, I also went through the code with @svandiekendialpad and we sort things out, everything works as expected. @pkufool Can we consider merging #1218 to master as this is really important fix? I see you were asked in #1218 to add the allow-partial option for k2.intersect and k2.intersect_device, is it possible to elaborate on this or merge it as is?

@pkufool
Copy link
Collaborator

pkufool commented Sep 22, 2023

@videodanchik Thanks very much for the testing! Yes I will have a look at the failed CI tests and merge it.

I see you were asked in #1218 to add the allow-partial option for k2.intersect and k2.intersect_device, is it possible to elaborate on this or merge it as is?

Actually, I have not started this work yet, will make a seperate PR later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants