Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

go-whisper golang bindings #1

Open
6 tasks done
djthorpe opened this issue Dec 1, 2022 · 11 comments
Open
6 tasks done

go-whisper golang bindings #1

djthorpe opened this issue Dec 1, 2022 · 11 comments
Assignees

Comments

@djthorpe
Copy link
Member

djthorpe commented Dec 1, 2022

Create bindings for https://github.com/ggerganov/whisper.cpp

  • Simple golang bindings with tests
  • Some examples (main, sample) based off of these
  • Integrate with ffmpeg for audio conversion
  • Some sort of real-time translation
  • gRPC and/or websocket API
  • Docker image of a speech-to-tech service
@djthorpe djthorpe self-assigned this Dec 1, 2022
@djthorpe
Copy link
Member Author

Made PR:
ggerganov/whisper.cpp#269

@chrisbward
Copy link

Great work!

Keen on realtime translation and a way of calling out/streaming the output to another app - gRPC seems the best option for this

@djthorpe
Copy link
Member Author

Yeah thanks.

I'm doing the audio downsampling to 16KHz at the moment in a different repository (go-media)

The realtime transcription and translation should be pretty straightforward, but pretty experimental, even for whisper.cpp

I will take a while to get to the gPRC microservice :-(

@djthorpe
Copy link
Member Author

djthorpe commented Jan 6, 2023

Added a "stream" command for the start of real-time streaming, but:

  • Thread safety: Needs some work to ensure the same model can be used in the process method across threads/goroutines
  • Ring buffer: Implement a ring buffer for continious audio samples
  • Overlaps: Need some word overlaps to ensure we don't lose words between sample windows
  • Silence: Don't process audio when silence is fed in. Ideally chunk windows when there is a largish (>1s) silence

There's also some issues with the segmenting in the main package (repeated segments come out!) needs fixing.

@djthorpe
Copy link
Member Author

djthorpe commented Jul 30, 2024

Coming back to this after some time!

Remaining tasks:

  • Streaming output of segments from the server
  • Streaming input of audio from the client and an example of this working
  • Add iniital token prompts for segments
  • Have ffmpeg logging go through the whisper logging (single source for all log messages)
  • Not sure MaxConcurrent tasks is really working - need to check
  • Fix Dockerfiles so they work
  • Add Diarization
  • SRT/VTT and text output as well as JSON
  • Add tokens into the verbose_json format
  • Fix the README so it reflects reality
  • Investigate the segmenter having bad output (seems to work, but the language detection makes it look off)

Lower priority:

  • Integrate VAD voice detection (probably from https://github.com/baabaaox/go-webrtcvad which seems to work) on segments
  • Smarter segmentation so that it only happens on silence boundaries, not in the middle of words
  • Fix model id's so they are unique and that models can be stored at some sub-path
  • Integrate the bindings back into whisper.cpp repository

@djthorpe
Copy link
Member Author

djthorpe commented Jul 31, 2024

Also:

  • Fix client so that it works with text/event-stream for both downloading models and transcription
  • Logging is generally not working. Fix it so only messages apart from errors are supressed unless debug mode. Also output logging of requests
  • Add some metrics in there somewhere

@djthorpe
Copy link
Member Author

djthorpe commented Jul 31, 2024

Also:

  • Fix resampling of raw audio in the go-media code, so we can again ingest WAV files without "input changed" errors

@djthorpe
Copy link
Member Author

djthorpe commented Aug 8, 2024

Simplified Dockerfile and now uses the base images from here as a base:

https://github.com/mutablelogic/docker-llamacpp

This is still now working; Now I need to have the ffmpeg shared libraries included in the runtime image. Considering whether to just copy over the libraries from the build image, or to install ffmpeg libraries from source.

@paradoxe35
Copy link

Hey @djthorpe,

Thank you for your excellent work on ggerganov/whisper.cpp#269. However, it seems the binding you developed isn't compatible with the latest version of whisper.cpp.

Do you have plans to update it soon, or is there an updated version available somewhere?

@djthorpe
Copy link
Member Author

Hi @paradoxe35 how are you?

Actually this repository contains updated bindings and I would like to merge them into whisper.cpp at some point...You can find them under
github.com/mutablelogic/go-whisper/sys/whisper but I didn't test them recently. Let me know if that' useful to you?

@paradoxe35
Copy link

Thank you, @djthorpe, for your feedback. Unfortunately, I couldn't run github.com/mutablelogic/go-whisper/sys/whisper. To simplify things, I've decided to use the previous version of whisper.cpp that is compatible with the existing binding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants