Simple example with partial and final results for chunked audio stream? #4216

fquirin · 2022-05-20T14:43:33Z

fquirin
May 20, 2022

Hi everybody,

NeMo is giving me really great results for audio file transcription 👍 (conformer ctc medium).
Now I'm trying to build a simple example for streaming audio with partial and final results, but I'm a bit lost 😟 .

I've seen the speech_to_text_buffered_infer_ctc.py example, the Online_ASR_Microphone_Demo.ipynb and the transcribe_speech.py but I'm having trouble isolating the relevant parts since there is a lot going on.

It seems there is no dedicated function like in other toolkits (Coqui or Vosk for example) but you have to wrap another class around the original 'ASRModel' (in my case 'EncDecCTCModel') and modify what is called 'the preprocessor'.

Maybe you can help me finish the following example code. Please note: I don't need any data evaluation, WER calculation, samplerate transformation, etc. etc., I just want to feed the audio and get a text result ^^:

import copy
import wave

import numpy as np
import torch
from omegaconf import OmegaConf
import nemo.collections.asr as nemo_asr
from nemo.utils import logging, model_utils

# Load the model 
model_file = "models/stt_en_conformer_ctc_medium.nemo"
model = nemo_asr.models.EncDecCTCModel.restore_from(restore_path=model_file, map_location="cpu")
model = model.to("cpu")    
# BTW: no matter how I tell the model to use CPU it always runs into the auto-detection warning later

# Modify preprocessor
cfg = copy.deepcopy(model._cfg)
OmegaConf.set_struct(cfg.preprocessor, False)  # enable config overwriting

# "some changes for streaming scenario" - no idea what this does
cfg.preprocessor.dither = 0.0
cfg.preprocessor.pad_to = 0

OmegaConf.set_struct(cfg.preprocessor, True)  # disable config overwriting
model.eval()

# TODO: Some magic using 'FrameBatchASR' or a completely new wrapper class ... ???
frame_asr = ???

# Load a wave file - 16khz mono PCM
wf = wave.open("audio/test.wav", "rb")

# Serve the wave file in chunks and do magic
chunk_size = 4000
while True:
    # Read some bytes
    frames = wf.readframes(chunk_size)

    # Process as long as we have data
    if len(frames) == 0:
        # TODO: finish and end
        final_result = ???
    else:
        # TODO: Convert to the right input (??) format
        data = np.frombuffer(frames, dtype=np.int16)

        # TODO: feed chunks get partial result
        partial_result = ???

I'd be happy about any additional info, hints or code examples 🙂
Ty,
Florian

Answered by titu1994

May 20, 2022

So as you've already guessed, Nemo ASR models are complex underneath the hood. I will spend some time next week to see a minimal script based on yours that works. But before that look at https://huggingface.co/spaces/smajumdar/nemo_conformer_rnnt_large_streaming

There's no special tricks applied here, it's the most inefficient inference method on chunks, and that works fine actually. These models in Nemo are not true streaming models, but offline models. We can make them work in streaming mode in multiple ways - buffered inference script above is one of those days. It's a more accurate form of the simple "predict full chunk, every chunk, and concat results" method I've used in the above d…

View full answer

titu1994 · 2022-05-20T19:07:19Z

titu1994
May 20, 2022
Maintainer

So as you've already guessed, Nemo ASR models are complex underneath the hood. I will spend some time next week to see a minimal script based on yours that works. But before that look at https://huggingface.co/spaces/smajumdar/nemo_conformer_rnnt_large_streaming

There's no special tricks applied here, it's the most inefficient inference method on chunks, and that works fine actually. These models in Nemo are not true streaming models, but offline models. We can make them work in streaming mode in multiple ways - buffered inference script above is one of those days. It's a more accurate form of the simple "predict full chunk, every chunk, and concat results" method I've used in the above demo.

We do provide transcribe() method that takes in a file and output's text that does the entire pipeline inside it. The drawback to this is we don't support raw audio stream input.

Since there's recently been request for it in both hf and other users, we will look into defining a transcribe_buffer() or something like that which accepts tensors and output's text, but it will take time.

In the meantime, I will try a minimal script for transcribing a buffer based on your code and respond next week c

13 replies

fquirin Jun 1, 2022
Author

ok np, ty for the info and enjoy your holidays 👍

fquirin Jul 12, 2022
Author

Hope you've enjoyed holidays 🙂 , if you've time I'd still be interested in the example code 😇.
Also: anywhere to follow the "true streaming conformer" progress? :-)

VahidooX Jul 28, 2022
Collaborator

Here is the PR for streaming Conformer: #3888
It has a small issue to get fixed and going to get merged next week.

fquirin Jul 28, 2022
Author

Awesome! I'll definitely have a closer look soon 😎

fquirin Aug 20, 2022
Author

Hi @VahidooX , I just read the new docs section about the streaming conformer model and I was wondering how the chunk-aware look-ahead mode works. The descriptions says "self-attention layers would be able to see all the tokens in their corresponding chunk", but what if a word for example is split between one or multiple chunks? Do new chunks still improve the "old" result. To be honest this is something I always asked myself when trying to understand streaming ASR, how do you handle overlap without constantly recalculating old data? 🤔

BakingBrains · 2024-10-07T05:56:34Z

BakingBrains
Oct 7, 2024

Any updated code on getting partial and final result @fquirin.

Thanks

1 reply

fquirin Oct 10, 2024
Author

I haven't checked the lib for a while, but would be interested in the answer as well ^^.
Maybe @titu1994 or @VahidooX can update us on the state of partial results and streaming models?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simple example with partial and final results for chunked audio stream? #4216

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 14 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Simple example with partial and final results for chunked audio stream? #4216

fquirin May 20, 2022

Replies: 2 comments · 14 replies

titu1994 May 20, 2022 Maintainer

fquirin Jun 1, 2022 Author

fquirin Jul 12, 2022 Author

VahidooX Jul 28, 2022 Collaborator

fquirin Jul 28, 2022 Author

fquirin Aug 20, 2022 Author

BakingBrains Oct 7, 2024

fquirin Oct 10, 2024 Author

fquirin
May 20, 2022

Replies: 2 comments 14 replies

titu1994
May 20, 2022
Maintainer

fquirin Jun 1, 2022
Author

fquirin Jul 12, 2022
Author

VahidooX Jul 28, 2022
Collaborator

fquirin Jul 28, 2022
Author

fquirin Aug 20, 2022
Author

BakingBrains
Oct 7, 2024

fquirin Oct 10, 2024
Author