A zero-dependency simple Python wrapper for whisper.cpp, providing an easy-to-use interface for speech recognition using the Whisper model.
Why is it better than faster-whisper and pywhispercpp:
- Zero-dependency: Everything is shipped with the built wheel, no Python dependency (on
av
orctranslate2
etc.) except fornumpy
. - Dead simple API: call
.transcribe()
and get a result - Acceleration enabled: supports whatever whisper.cpp supports
- Updated: using precompiled whisper.cpp from https://github.com/locaal-ai/occ-ai-dep-whispercpp
- Build time: builds in 2 minutes because it's using a precompiled binary
To install simpler-whisper, you need:
- A C++ compiler (e.g., GCC, Clang, or MSVC)
- CMake (version 3.12 or higher)
- NumPy
Then you can install using pip:
pip install simpler-whisper
There are three ways to use simpler-whisper:
from simpler_whisper.whisper import WhisperModel
# Load the model (models can be downloaded from https://huggingface.co/ggerganov/whisper.cpp)
model = WhisperModel("path/to/model.bin", use_gpu=True)
# Load and prepare your audio
# You can use av, librosa, or any method that gives you 16kHz mono float32 samples
import av
container = av.open("audio.mp3")
audio_stream = container.streams.audio[0]
samples = np.concatenate([
frame.to_ndarray().mean(axis=0) if frame.format.channels == 2 else frame.to_ndarray()
for frame in container.decode(audio_stream)
])
# Transcribe
transcription = model.transcribe(samples)
for segment in transcription:
print(f"{segment.text} ({segment.t0:.2f}s - {segment.t1:.2f}s)")
This will create a thread in the backend (not locked by the GIL) to allow for asynchronous transcription.
from simpler_whisper.whisper import AsyncWhisperModel
def handle_result(chunk_id: int, segments: List[WhisperSegment], is_partial: bool):
text = " ".join([seg.text for seg in segments])
print(f"Chunk {chunk_id}: {text}")
# Create and start async model
model = AsyncWhisperModel("path/to/model.bin", callback=handle_result, use_gpu=True)
model.start()
# Queue audio chunks for processing
chunk_id = model.transcribe(audio_samples)
# When done
model.stop()
This method creates a background thread for real-time transcription that will continuously process the input in e.g. 10 seconds chunks and report on both final or partial results.
from simpler_whisper.whisper import ThreadedWhisperModel
def handle_result(chunk_id: int, segments: List[WhisperSegment], is_partial: bool):
text = " ".join([seg.text for seg in segments])
print(f"Chunk {chunk_id}: {text}")
# Create and start threaded model with 10-second chunks
model = ThreadedWhisperModel(
"path/to/model.bin",
callback=handle_result,
use_gpu=True,
max_duration_sec=10.0
)
model.start()
# Queue audio chunks as they arrive
chunk_id = model.queue_audio(audio_samples)
# When done
model.stop()
- On Windows, the package uses a DLL (whisper.dll), which is included in the package.
- On Mac and Linux, the package uses static libraries that are linked into the extension.
If you're building from source:
- Clone the repository:
git clone https://github.com/locaal-ai/simpler-whisper.git cd simpler-whisper
- Install the package in editable mode:
pip install -e .
This will run the CMake build process and compile the extension.
Simpler Whisper supports various build configurations to optimize for different hardware and acceleration methods. You can specify the build configuration using environment variables:
SIMPLER_WHISPER_ACCELERATION
: Specifies the acceleration method. Options are:cpu
(default)cuda
(for NVIDIA GPUs)hipblas
(for AMD GPUs)vulkan
(for cross-platform GPU acceleration)
$env:SIMPLER_WHISPER_ACCELERATION="cuda"
pip install .
pip install .
This project is licensed under the MIT License - see the LICENSE file for details.