Using only the realtime model. #134
Replies: 1 comment
-
The maintainer can definitely give you more accurate information, but let me ask you a couple of things first. First of all, which Secondly, how are you using the model? Which script are you using to transcript? Did you write the script yourself or are you using one of the scripts in the test folder? From my understanding of how the project works, it's built only around live transcription. Now, I'm using So in A very clear example of what I'm describing can be seen in the video below. You can see that as soon as the script starts processing audio, the sentences first appear in a bright yellow. After a little bit, they change to either a deeper yellow or cyan. So the bright yellow transcription is coming from the real-time model, which is just a preview. And the deep yellow or cyan transcription is coming from the main model. If you want to make longer recordings where instead of the script transcribing as you're speaking and printing as you're speaking, instead it starts recording and then when it ends recording it then transcribes the whole thing you can use the script Edit: For even longer recordings, it might be better to save the recording in a temp audio file and then feed that to whisper (or faster-whisper or distil-whisper or ...) directly. All these models work out of the box with static transcriptions and the challenge it to make them work properly with live transcription. |
Beta Was this translation helpful? Give feedback.
-
I had an issue where the main model does not ever transcribe. The text returned is always 'realtime;, never 'fullSentence' but that is not important to my use case.
For my use case (long sessions without the need for really accurate results) , I can work with the real time model only.
Optionally, if I can limit the real time to process only the last 2-3 minutes of audio, I can remove the lag in real time transcription that starts to come up after the live session is longer than 5 minutes (Transcription returned is in bigger phrases or sentences rather than a couple words at the start of the session. This is because the real time model transcribes from the beginning of the recording up to the most recent one.
This starts at 00:00.00 each time. At this time, I cannot run tests or access previous logs:
Does anyone have any idea how I can modify the current implementation to support this use case or will I have to fork faster-whisper and modify that?
Beta Was this translation helpful? Give feedback.
All reactions