Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming server outputs only phonemes even with LG #499

Open
kasidis-kanwat opened this issue Nov 7, 2023 · 10 comments
Open

Streaming server outputs only phonemes even with LG #499

kasidis-kanwat opened this issue Nov 7, 2023 · 10 comments

Comments

@kasidis-kanwat
Copy link

I'm using a phone-based zipformer but I could not get the server to output graphemes despite the fact that I'm providing an LG graph to both cpp API and python API.

These are what I tried.

sherpa-online-websocket-server \
  --decoding-method=fast_beam_search \
  --nn-model=../model_v1/jit_script_chunk_64_left_128.pt \
  --lg=/workdir/Desktop/sherpa/model_v1/lang_phone2/LG.pt \
  --tokens=../model_v1/lang_phone2/tokens.txt \
  --port=5051 \
  --decode-chunk-size=32 \
  --decode-left-context=128 \
  --doc-root=./sherpa/bin/web \
  --ngram-lm-scale=0.3

python3 ./sherpa/bin/streaming_server.py \
  --port=5051 \
  --decoding-method=fast_beam_search \
  --LG=../model_v1/lang_phone2/LG.pt \
  --nn-model=../model_v1/jit_script_chunk_64_left_128.pt \
  --tokens=../model_v1/lang_phone2/tokens.txt \
  --ngram-lm-scale=0.3
@csukuangfj
Copy link
Collaborator

Could you post what the above commands output?

@kasidis-kanwat
Copy link
Author

There are no errors actually. Everything works perfectly except the fact that the predicted text is in phoneme.

For example,
output from sherpa-online-websocket-server

[I] /workdir/Desktop/sherpa/sherpa/sherpa/csrc/parse-options.cc:495:int sherpa::ParseOptions::Read(int, const char* const*) 2023-11-07 08:39:54.760 sherpa-online-websocket-server --decoding-method=fast_beam_search --nn-model=../model_v1/jit_script_chunk_64_left_128.pt --lg=/workdir/Desktop/sherpa/model_v1/lang_phone2/LG.pt --tokens=../model_v1/lang_phone2/tokens.txt --port=5051 --decode-chunk-size=32 --decode-left-context=128 --doc-root=./sherpa/bin/web --ngram-lm-scale=0.3

[I] /workdir/Desktop/sherpa/sherpa/sherpa/cpp_api/online-recognizer.cc:498:void sherpa::OnlineRecognizer::OnlineRecognizerImpl::WarmUp() 2023-11-07 08:39:55.314 WarmUp begins
[I] /workdir/Desktop/sherpa/sherpa/sherpa/cpp_api/online-recognizer.cc:521:void sherpa::OnlineRecognizer::OnlineRecognizerImpl::WarmUp() 2023-11-07 08:39:55.388 WarmUp ended
[I] /workdir/Desktop/sherpa/sherpa/sherpa/cpp_api/websocket/online-websocket-server.cc:81:int32_t main(int32_t, char**) 2023-11-07 08:39:55.560 Listening on: 5051

[I] /workdir/Desktop/sherpa/sherpa/sherpa/cpp_api/websocket/online-websocket-server.cc:83:int32_t main(int32_t, char**) 2023-11-07 08:39:55.560 Number of work threads: 5

[I] /workdir/Desktop/sherpa/sherpa/sherpa/cpp_api/websocket/online-websocket-server.cc:119:int32_t main(int32_t, char**) 2023-11-07 08:39:55.560
Please access the HTTP server using the following address:

http://localhost:5051

[I] /workdir/Desktop/sherpa/sherpa/sherpa/cpp_api/websocket/online-websocket-server-impl.cc:272:void sherpa::OnlineWebsocketServer::OnOpen(connection_hdl) 2023-11-07 08:40:17.534 New connection: 127.0.0.1:47978. Number of active connections: 1.

[I] /workdir/Desktop/sherpa/sherpa/sherpa/cpp_api/websocket/online-websocket-server-impl.cc:279:void sherpa::OnlineWebsocketServer::OnClose(connection_hdl) 2023-11-07 08:40:17.952 Number of active connections: 0

output from sherpa-online-websocket-client

[I] /workdir/Desktop/sherpa/sherpa/sherpa/csrc/parse-options.cc:495:int sherpa::ParseOptions::Read(int, const char* const*) 2023-11-07 08:40:17.531 sherpa-online-websocket-client --server-port=5051 processed.wav 

[I] /workdir/Desktop/sherpa/sherpa/sherpa/cpp_api/websocket/online-websocket-client.cc:223:void Client::SendMessage(websocketpp::connection_hdl, std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long int, std::ratio<1, 1000000000> > >) 2023-11-07 08:40:17.534 Starting to send audio
[I] /workdir/Desktop/sherpa/sherpa/sherpa/cpp_api/websocket/online-websocket-client.cc:261:void Client::SendMessage(websocketpp::connection_hdl, std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long int, std::ratio<1, 1000000000> > >) 2023-11-07 08:40:17.534 Sent Done Signal
[I] /workdir/Desktop/sherpa/sherpa/sherpa/cpp_api/websocket/online-websocket-client.cc:182:void Client::OnMessage(websocketpp::connection_hdl, message_ptr) 2023-11-07 08:40:17.866 {"final":false,"segment":0,"start_time":0.0,"text":"pqq1t^","timestamps":[0.9599999785423279,1.0,1.0799999237060547],"tokens":["p","qq1","t^"]}
[I] /workdir/Desktop/sherpa/sherpa/sherpa/cpp_api/websocket/online-websocket-client.cc:182:void Client::OnMessage(websocketpp::connection_hdl, message_ptr) 2023-11-07 08:40:17.951 {"final":true,"segment":0,"start_time":0.0,"text":"pqq1t^fa0j^duua2j^","timestamps":[0.9599999785423279,1.0,1.0799999237060547,1.2799999713897705,1.4399999380111694,1.5199999809265137,1.6799999475479126,1.7599999904632568,2.200000047683716],"tokens":["p","qq1","t^","f","a0","j^","d","uua2","j^"]}
processed.wavpqq1t^fa0j^duua2j^
[I] /workdir/Desktop/sherpa/sherpa/sherpa/cpp_api/websocket/online-websocket-client.cc:103:Client::Client(asio::io_context&, const string&, int16_t, const string&, float, int32_t, std::string)::<lambda(websocketpp::connection_hdl)> 2023-11-07 08:40:17.952 Disconnected
[I] /workdir/Desktop/sherpa/sherpa/sherpa/cpp_api/websocket/online-websocket-client.cc:337:int32_t main(int32_t, char**) 2023-11-07 08:40:17.952 Done!

@csukuangfj
Copy link
Collaborator

Can you check that your LG is correct?

Have you tested the LG and pre-trained models provided by us in the doc?

@kasidis-kanwat
Copy link
Author

kasidis-kanwat commented Nov 7, 2023

Can you check that your LG is correct?

I have used this LG graph to replace trivial_graph in zipformer/decode.py and it correctly output grapheme. Is there another preferred way to verify the correctness?

fast_beam_search_nbest
00d0fb48e21f4b0086ac94eb7723f150-27530:	ref=['ph', 'aa2', 'p^', 'j', 'a1', 'j^', 'kh', 'vv0', 'z^']
00d0fb48e21f4b0086ac94eb7723f150-27530:	hyp=['ph', 'aa2', 'p^', 'j', 'a1', 'j^', 'kh', 'vv0', 'z^']

fast_beam_search_nbest_LG
00d0fb48e21f4b0086ac94eb7723f150-27530:	ref=['ภาพ', 'ใหญ่', 'คือ']
00d0fb48e21f4b0086ac94eb7723f150-27530:	hyp=['ภาพ', 'ใหญ่', 'คือ']

Have you tested the LG and pre-trained models provided by us in the doc?

I tested two models and it seems to be working.

icefall-asr-librispeech-streaming-zipformer-2023-05-17

./test_wavs/1221-135766-0002.wav                                                                                                                                                                            
 YET THESE THOUGHTS AFFECTED HESTER PRYNNE LESS WITH HOPE THAN APPREHENSION                                                                                                                                 
{"final":true,"segment":0,"start_time":0.0,"text":" YET THESE THOUGHTS AFFECTED HESTER PRYNNE LESS WITH HOPE THAN APPREHENSION","timestamps":[0.6399999856948853,0.7199999690055847,0.9599999785423279,1.03$9999618530273,1.1999999284744263,1.399999976158142,1.5999999046325684,1.6799999475479126,1.71999990940094,1.7999999523162842,1.8399999141693115,2.0399999618530273,2.119999885559082,2.2799999713897705,2.4$0000057220459,2.5199999809265137,2.6399998664855957,2.679999828338623,2.919999837875366,2.9600000381469727,3.240000009536743,3.4800000190734863,3.6399998664855957,3.879999876022339,4.159999847412109,4.27$999732971191,4.319999694824219,4.519999980926514,4.599999904632568,4.679999828338623,4.759999752044678],"tokens":[" YE","T"," THE","SE"," THOUGHT","S"," A","FF","E","C","TED"," HE","S","TER"," P","RY","N$,"NE"," ","LESS"," WITH"," HO","PE"," THAN"," A","PP","RE","HE","N","S","ION"]}

icefall_asr_wenetspeech_pruned_transducer_stateless5_streaming

./test_wavs/DEV_T0000000000.wav
对我介绍我想
{"final":true,"segment":0,"start_time":0.0,"text":"对我介绍我想","timestamps":[0.47999998927116394,0.5999999642372131,1.0399999618530273,1.159999966621399,2.2799999713897705,2.3999998569488525],"tokens":["对","我","介","绍","我","想"]}

./test_wavs/DEV_T0000000001.wav
重点三个问题首先表现
{"final":true,"segment":0,"start_time":0.0,"text":"重点三个问题首先表现","timestamps":[0.35999998450279236,0.4399999976158142,1.0799999237060547,1.2400000095367432,1.399999976158142,1.6399999856948853,2.319999933242798,2.4800000190734863,4.679999828338623,4.880000114440918],"tokens":["重","点","三","个","问","题","首","先","表","现"]}

./test_wavs/DEV_T0000000002.wav
分析这一次全球进动脑
{"final":true,"segment":0,"start_time":0.0,"text":"分析这一次全球进动脑","timestamps":[1.1200000047683716,1.399999976158142,1.7999999523162842,2.0,2.240000009536743,2.759999990463257,2.879999876022339,3.0799999237060547,3.2799999713897705,3.4800000190734863],"tokens":["分","析","这","一","次","全","球","进","动","脑"]}

@csukuangfj
Copy link
Collaborator

For the following output:

[I] /workdir/Desktop/sherpa/sherpa/sherpa/cpp_api/websocket/online-websocket-client.cc:182:void 
Client::OnMessage(websocketpp::connection_hdl, message_ptr) 2023-11-07 08:40:17.951 {"final":true,"segment":0,"start_time":0.0,
"text":"pqq1t^fa0j^duua2j^",
"timestamps":[0.9599999785423279,1.0,1.0799999237060547,1.2799999713897705,1.4399999380111694,1.5199999809265137,1.6799999475479126,1.7599999904632568,2.200000047683716],
"tokens":["p","qq1","t^","f","a0","j^","d","uua2","j^"]}

what are your expected text and tokens?

@kasidis-kanwat
Copy link
Author

I would expect the text to be the grapheme of "pqq1t^fa0j^duua2j^", i.e., "เปิด ไฟ ด้วย"
since my lexicon looks something like this:

...
เปิด p qq1 t^
ไฟ f a0 j^
ด้วย d uua2 j^
...

As for tokens, I'm uncertain but I think it should probably be phoneme since the model was trained to predict phonemes.

@kasidis-kanwat
Copy link
Author

@csukuangfj may I inquire the status of this issue? Thank you.

@csukuangfj
Copy link
Collaborator

I'm sorry for not getting back to you sooner.

I see the problem now.

During decoding, we save only the decoded tokens

It is not a problem for BPE-based models since we can get the correct words by just concatenating all the BPE tokens
and removing _ with a space afterward.

For non-BPE-based models, we need to also save the word_ids during decoding and get the text from the word_ids

We also need to pass words.txt so that we can map word IDs to strings.

@kasidis-kanwat
Copy link
Author

Thank you for responding so quickly. I will attempt to implement the fix as soon as I have the time.

@kerolos
Copy link

kerolos commented May 7, 2024

Hello @kasidis-kanwat @csukuangfj :
I will really appreciate if you can answer my questions:

  1. Which Zipformer model is good for Phone-Based lexicon (with an acceptable result as BPE), which one and recipe do you recommended and are there any changes in the Model parameters ? (this tiny model seems that it does not have a good CER egs/librispeech/ASR/tiny_transducer_ctc)
  2. con we converted it to ONNX_int8 model Sherpa?
  3. Is it possible to Decode with LM
  4. Does it cover the new word feature Contextual biasing (Hotwords) ?
  5. Does it cover multiple variant transcriptions per word ?

Thanks in advance,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants