-
Hi everybody, NeMo is giving me really great results for audio file transcription 👍 (conformer ctc medium). I've seen the speech_to_text_buffered_infer_ctc.py example, the Online_ASR_Microphone_Demo.ipynb and the transcribe_speech.py but I'm having trouble isolating the relevant parts since there is a lot going on. It seems there is no dedicated function like in other toolkits (Coqui or Vosk for example) but you have to wrap another class around the original 'ASRModel' (in my case 'EncDecCTCModel') and modify what is called 'the preprocessor'. Maybe you can help me finish the following example code. Please note: I don't need any data evaluation, WER calculation, samplerate transformation, etc. etc., I just want to feed the audio and get a text result ^^:
I'd be happy about any additional info, hints or code examples 🙂 |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 14 replies
-
So as you've already guessed, Nemo ASR models are complex underneath the hood. I will spend some time next week to see a minimal script based on yours that works. But before that look at https://huggingface.co/spaces/smajumdar/nemo_conformer_rnnt_large_streaming There's no special tricks applied here, it's the most inefficient inference method on chunks, and that works fine actually. These models in Nemo are not true streaming models, but offline models. We can make them work in streaming mode in multiple ways - buffered inference script above is one of those days. It's a more accurate form of the simple "predict full chunk, every chunk, and concat results" method I've used in the above demo. We do provide transcribe() method that takes in a file and output's text that does the entire pipeline inside it. The drawback to this is we don't support raw audio stream input. Since there's recently been request for it in both hf and other users, we will look into defining a transcribe_buffer() or something like that which accepts tensors and output's text, but it will take time. In the meantime, I will try a minimal script for transcribing a buffer based on your code and respond next week c |
Beta Was this translation helpful? Give feedback.
-
Any updated code on getting partial and final result @fquirin. Thanks |
Beta Was this translation helpful? Give feedback.
So as you've already guessed, Nemo ASR models are complex underneath the hood. I will spend some time next week to see a minimal script based on yours that works. But before that look at https://huggingface.co/spaces/smajumdar/nemo_conformer_rnnt_large_streaming
There's no special tricks applied here, it's the most inefficient inference method on chunks, and that works fine actually. These models in Nemo are not true streaming models, but offline models. We can make them work in streaming mode in multiple ways - buffered inference script above is one of those days. It's a more accurate form of the simple "predict full chunk, every chunk, and concat results" method I've used in the above d…