Big gap in WER between online and offline CTC decoding #1194

chiendb97 · 2023-05-11T15:21:51Z

I tried offline decoding using hlg_decode.cu and online decoding using online_decode.cu. And here is the result:

For model librispeech conformer ctc: offline decoding: 3.49% WER, online decoding: 19.08% WER
For our model: offline decoding: ~3% WER, online decoding: ~18% WER
(WER online decoding is much larger than offline decoding (both use the same am output), online decoding uses chunk size 16)

Could you please tell me the difference between offline decoding and online decoding? In addition, could you tell us the result of 2 kinds of decoding.
Thanks!

danpovey · 2023-05-11T15:24:55Z

There are examples in Sherpa of real-time/streaming/online decoding, I think that might be a better starting point?
Normally you need to use a model that has been trained with streaming in mind.

chiendb97 · 2023-05-11T15:41:47Z

There are examples in Sherpa of real-time/streaming/online decoding

Can you please specify which example it is? I did look into sherpa repo but did not find any examples about CTC-based streaming.

Normally you need to use a model that has been trained with streaming in mind.

I used the same AM output for both offline and streaming decoding. I don't think the gap can be that big.

pkufool · 2023-05-12T05:47:23Z

Can you please specify which example it is? I did look into sherpa repo but did not find any examples about CTC-based streaming.

Sorry, there is no ctc hlg streaming decoding in Sherpa, only one example in k2/torch/bin (I think it is the online_deocde.cu you used).

I used the same AM output for both offline and streaming decoding. I don't think the gap can be that big.

We normally test the streaming decoding method with a streaming model, may be you can try the online_decode.cu with a streaming model. A offline model is not suitable for a streaming decoding method.

danpovey · 2023-05-13T03:40:59Z

But @pkufool I think that binary just evaluates the nnet for the entire file and simulates streaming, so surely it should in principle give the same results as the offline decoding if it was given a non-streaming model? (Even though this would not be useful in practice).

chiendb97 · 2023-05-13T07:48:45Z

@pkufool @danpovey How I tested was that I read the audio file and evaluated nnet output for the entire audio. Then I used that output to simulate streaming as in online_decode.cu and used the final text result to compute the WER. I did the test twice, using the conformer ctc model from icefall and my conformer ctc model (using wenet). However, the results obtained were not as good as offline decoding in both cases.
I tried to print out the lattice (lattice.fsa.values) of the online decoder and noticed that the first few lattices are quite the same as that of the offline decoder. But then it started to differ.

danpovey · 2023-05-13T07:57:26Z

hm, how did it differ?
@pkufool do you think there is possibly a bug that is affecting him?
@chiendb97 what version of k2 are you using? see if a newer version helps.

chiendb97 · 2023-05-13T08:02:39Z

what version of k2 are you using? see if a newer version helps.

I am using the latest version of k2.

pkufool · 2023-05-15T02:28:17Z

@pkufool do you think there is possibly a bug that is affecting him?

Yes, I think there could be some bugs. I will look into the code.

svandiekendialpad · 2023-05-16T20:38:03Z

I am currently experiencing the same issue. Offline decoding is fine but any form of streaming using OnlineDenseIntersecter increases WER by an unreasonable amount with almost all new errors coming from deletions.

pkufool · 2023-05-17T02:27:42Z

I am currently experiencing the same issue. Offline decoding is fine but any form of streaming using OnlineDenseIntersecter increases WER by an unreasonable amount with almost all new errors coming from deletions.

OK, I am debuging it.

svandiekendialpad · 2023-06-27T16:54:09Z

Any updates @pkufool?

pkufool · 2023-06-28T01:48:38Z

Any updates @pkufool?

Sorry, I did not fix it at that day and forgot it, will return to it.

pkufool · 2023-07-04T02:01:37Z

@svandiekendialpad @chiendb97 Does the differences only happens when using --use_ctc_decoding=false (i.e decoding with an ngram) ?

binhtranmcs · 2023-07-04T03:49:19Z

Hi @pkufool, I just ran tests again using librispeech conformer ctc, here is the result:

Using --use_ctc_decoding=true, I got WER=7.3%.
Using offline ctc decoding in ctc_decode.cu, I got WER=2.6%.

So I think there is still a significant difference between online and offline implementations regardless of using n-gram (though the gap is smaller).

svandiekendialpad · 2023-07-04T19:14:19Z

I can confirm what @binhtranmcs said. It all points to a bug in the online decoding code.

pkufool · 2023-07-11T08:10:24Z

@binhtranmcs I think #1218 solve some problems, but it still has differences between the lattices generated by online & offline mode, now I know it relates to the pruning, I am trying to fix it.

pkufool · 2023-07-11T08:20:35Z

@danpovey I think one issue is: for offline mode the forward pass always run before the backward pass (i.e. when we expand the arcs at step t, the frames[t] has not beed pruned by backward pass), but in current online implementation, when we expand at step t (t is the last frame of previous chunk) frame[t] has been pruned by backward pass in the previous chunk. This is the only difference I found after reading the code carefully.

danpovey · 2023-07-11T08:29:39Z

Does the backward pass start with -(forward score) on all active states? That's how it is supposed to work.

…

On Tue, Jul 11, 2023, 10:20 AM Wei Kang ***@***.***> wrote: @danpovey <https://github.com/danpovey> I think one issue is: for offline mode the forward pass always run before the backward pass (i.e. when we expand the arcs at step t, the frames[t] has not beed pruned by backward pass), but in current online implementation, when we expand at step t (t is the last frame of previous chunk) frame[t] has been pruned by backward pass in the previous chunk. This is the only difference I found after reading the code carefully. — Reply to this email directly, view it on GitHub <#1194 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZFLO2RPNA2A6ULSP7WKH3XPUEF5ANCNFSM6AAAAAAX6JHOAI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

binhtranmcs · 2023-07-16T09:56:00Z

Hi @danpovey, as I want to understand the code, can you please provide me some references to the online/offline decoding algorithm you implemented here. Since I am pretty new to this, it would really help a lot. Thanks in advance.

danpovey · 2023-07-16T10:13:05Z

I think it is described in my paper about exact lattices.. or at least mentioned there. Pruned viterbi beam search with some extensions to store a lattice. The guys have discovered the problem but IDK if they have made the fix public yet.

…

On Sun, Jul 16, 2023, 5:56 PM binhtranmcs ***@***.***> wrote: Hi @danpovey <https://github.com/danpovey>, as I want to understand the code, can you please provide me some references to the online/offline decoding algorithm you implemented here. Since I am pretty new to this, it would really help a lot. Thanks in advance. — Reply to this email directly, view it on GitHub <#1194 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZFLO7J4QLWBIEAYHIIKDLXQO3DVANCNFSM6AAAAAAX6JHOAI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

pkufool · 2023-07-19T09:35:33Z

@binhtranmcs @svandiekendialpad @chiendb97 I think #1218 can fix this issue, you can try it on your dataset.

binhtranmcs · 2023-07-19T11:12:21Z

@pkufool, I just tested again with librispeech conformer ctc, using online_decode.cu:

With --use_ctc_decoding=true, WER=7.3%.
With --use_ctc_decoding=false, WER=12.2%.

WER for online hlg decoding did decrease(from 18% down to 12%) but it is not as good as offline decoding(3.49%). I think there are still problems here.

svandiekendialpad · 2023-07-19T23:48:47Z

For me it went up from 33% to 45%, when 14% should be normal. Should I have used allow_partial anywhere? I just left it at its default (true in OnlineDenseIntersecter).

pkufool · 2023-07-20T01:58:49Z

@binhtranmcs @svandiekendialpad OK, I just tested some bad cases, will test the full test datasets.

binhtranmcs · 2023-07-31T03:51:56Z

Hi @pkufool, are there any updates on this???

danpovey · 2023-08-01T02:26:30Z

I think #1218 may be relevant to this. Not merged yet but says it is ready.

pkufool · 2023-08-01T10:19:14Z

I think #1218 may be relevant to this. Not merged yet but says it is ready.

It a pity that the fixes in #1218 can not fix all the issue, I am still debuging it.

pkufool · 2023-08-03T11:33:23Z

I did some exps on librispeech test-clean, here is the results:
For ctc-decoding (decode with a ctc-topology），after applying the fixes in #1218 I can get almost the same WERs for online and offline.

	Offline	Online (chunk=10)
Ctc-decoding	2.99	2.92

For hlg decoding (decode with an HLG), there are still big difference between online and offline, mainly the deletions at the tail of sentences.

	Offline	Online(chunk=10)	Online(chunk=30)	Online(chunk=50)	Online(chunk=30)decoding_graph.scores = 0.0
Hlg decoding	2.77	19.06	6.93	5.13	3.02

I believe this is the issue of pruning at the boundary frames (as I mentioned above). When I set the output_beam (used in backward pruning) the same as the search_beam (used in forward pruning) I can get the same results.

	Offline	Online(chunk=10)	Online(chunk=10) output-beam=search-beam
Hlg decoding	2.77	19.06	2.73

I need to revisit the implementation carefully to figure out the fixes for this issue, for now I think you can try using the same output_beam and search_beam.

[edit:] BTW, I add the python test code in #1218 online_decode.py and hlg_decode.py which accept a wav scp, then you can use simple-wer to calculate the WERs.

danpovey · 2023-08-03T14:23:32Z

@pkufool this makes me think that the backward scores have not been initialized correctly. They are supposed to be set to -(the forward score) when we do "intermediate" pruning (i.e. pruning not at the end of the file). If that is done, it should be OK to prune using "output_beam". I suspect that something is not working right in this respect: for example, they are not being set to that value, or they are being overwritten somehow, or something to do with a final-state is not correct.

pkufool · 2023-08-07T13:53:31Z

@binhtranmcs @svandiekendialpad @chiendb97 I update #1218 I think this time it should be able to fix your issue.

svandiekendialpad · 2023-08-10T20:42:37Z

@pkufool I'm trying to replicate your results, for now I still have very high error rate due to deltions. I am therefore investigating whether my custom decoder implementation has a bug.

However, could you send me a short code snippet how you set the decoding graph scores to 0.0? I just set HLG.scores = torch.zeros(HLG.scores.shape) and it leads to an AssertionError in parse_timestamps_and_texts, where I end up with fewer index_pairs than words/tokens. This doesn't happen when the scores aren't zero.

desh2608 · 2023-08-10T21:30:50Z

I think you can simply do HLG.scores *= 0. I guess HLG.scores is a RaggedTensor and so its shape attribute actually refers to an underlying RaggedShape (and not a torch Tensor).

svandiekendialpad · 2023-08-10T22:49:42Z

For me HLG.scores is a torch.Tensor.

pkufool · 2023-08-11T04:35:54Z

@svandiekendialpad I did test the fixes on test-clean with the model librispeech conformer ctc, and I got 2.73% for online decoding (the online_decode.py). Can you try your test set with my script? (i.e. online_decode.py in #1218). Let me know, if you meet some troubles, thanks!

videodanchik · 2023-09-20T20:35:32Z

Hi @pkufool, thanks for your effort on resolving this issue. I've downloaded librispeech conformer ctc and latest librispeech zipformer, trained with both ctc and rnnt losses. I decoded test-clean and test-other with both models online (chunk = 15) and offline before and after the fix from #1218.

Results before the fix (HLG decoding presented with different acoustic_model_weight = 1 / lm_model_weight):

decoding type	test_clean (conformer / zipformer)	test_other (conformer / zipformer)
H online	2.86 / 2.35	7.46 / 5.67
HLG online am scale 1	20.87 / 20.90	21.09 / 20.62
HLG online am scale 2	23.14 / 23.89	23.17 / 23.16
HLG online am scale 3	23.70 / 24.69	23.68 / 23.83
HLG online am scale 4	23.97 / 25.13	23.89 / 24.18
H offline	2.86 / 2.35	7.46 / 5.67
HLG offline am scale 1	2.68 / 2.60	6.43 / 5.42
HLG offline am scale 2	2.70 / 2.39	6.36 / 5.12
HLG offline am scale 3	2.71 / 2.39	6.47 / 5.14
HLG offline am scale 4	2.73 / 2.40	6.54 / 5.16

Results after the fix:

decoding type	test_clean (conformer / zipformer)	test_other (conformer / zipformer)
H online	2.86 / 2.35	7.46 / 5.67
HLG online am scale 1	2.68 / 2.59	6.43 / 5.42
HLG online am scale 2	2.70 / 2.39	6.36 / 5.12
HLG online am scale 3	2.72 / 2.39	6.47 / 5.15
HLG online am scale 4	2.74 / 2.40	6.56 / 5.17
HLG online am scale 5	2.74 / 2.40	6.61 / 5.21
H offline	2.86 / 2.35	7.46 / 5.67
HLG offline am scale 1	2.68 / 2.59	6.43 / 5.42
HLG offline am scale 2	2.70 / 2.39	6.36 / 5.12
HLG offline am scale 3	2.71 / 2.39	6.47 / 5.14
HLG offline am scale 4	2.73 / 2.40	6.55 / 5.16
HLG offline am scale 5	2.74 / 2.40	6.58 / 5.19

So, online decoding works well now, I also went through the code with @svandiekendialpad and we sort things out, everything works as expected. @pkufool Can we consider merging #1218 to master as this is really important fix? I see you were asked in #1218 to add the allow-partial option for k2.intersect and k2.intersect_device, is it possible to elaborate on this or merge it as is?

pkufool · 2023-09-22T07:54:33Z

@videodanchik Thanks very much for the testing! Yes I will have a look at the failed CI tests and merge it.

I see you were asked in #1218 to add the allow-partial option for k2.intersect and k2.intersect_device, is it possible to elaborate on this or merge it as is?

Actually, I have not started this work yet, will make a seperate PR later.

pkufool mentioned this issue May 24, 2023

Remove unused file #1125

Merged

Big gap in WER between online and offline CTC decoding #1194

Big gap in WER between online and offline CTC decoding #1194

Comments

chiendb97 commented May 11, 2023

danpovey commented May 11, 2023

chiendb97 commented May 11, 2023

pkufool commented May 12, 2023

danpovey commented May 13, 2023

chiendb97 commented May 13, 2023

danpovey commented May 13, 2023

chiendb97 commented May 13, 2023

pkufool commented May 15, 2023

svandiekendialpad commented May 16, 2023

pkufool commented May 17, 2023

svandiekendialpad commented Jun 27, 2023

pkufool commented Jun 28, 2023

pkufool commented Jul 4, 2023

binhtranmcs commented Jul 4, 2023

svandiekendialpad commented Jul 4, 2023

pkufool commented Jul 11, 2023

pkufool commented Jul 11, 2023

danpovey commented Jul 11, 2023 via email

binhtranmcs commented Jul 16, 2023

danpovey commented Jul 16, 2023 via email

pkufool commented Jul 19, 2023

binhtranmcs commented Jul 19, 2023

svandiekendialpad commented Jul 19, 2023

pkufool commented Jul 20, 2023

binhtranmcs commented Jul 31, 2023

danpovey commented Aug 1, 2023

pkufool commented Aug 1, 2023

pkufool commented Aug 3, 2023 • edited Loading

danpovey commented Aug 3, 2023

pkufool commented Aug 7, 2023

svandiekendialpad commented Aug 10, 2023

desh2608 commented Aug 10, 2023

svandiekendialpad commented Aug 10, 2023

pkufool commented Aug 11, 2023 • edited Loading

videodanchik commented Sep 20, 2023 • edited Loading

pkufool commented Sep 22, 2023 • edited Loading

pkufool commented Aug 3, 2023 •

edited

Loading

pkufool commented Aug 11, 2023 •

edited

Loading

videodanchik commented Sep 20, 2023 •

edited

Loading

pkufool commented Sep 22, 2023 •

edited

Loading