This project is a CLI for multi-speaker audio transcription using OpenAI Whisper (text transcription), Pyannote-Audio (speaker-detection) and Spleeter (voice extraction). It can be used to extract audio-segments for each speaker and to create transcriptions in various formats (txt, srt, sami, dfx, transc).
It's compatible with Windows, Linux and Mac.
Install system dependencies
# on Ubuntu or Debian
sudo apt update && sudo apt install ffmpeg
# on Arch Linux
sudo pacman -S ffmpeg
# on MacOS using Homebrew (https://brew.sh/)
brew install ffmpeg
# on Windows using Chocolatey (https://chocolatey.org/)
choco install ffmpeg
# on Windows using Scoop (https://scoop.sh/)
scoop install ffmpeg
Install python dependencies
pip3 install tqdm setuptools-rust pycaption simpleaudio simple-term-menu colour plotly mutagen pydub spleeter pyannotate.audio git+https://github.com/openai/whisper.git
This project requires a fixed folder structure for your data.
Your input data in raw_audio/
or raw_audio_voices/
may be structured in subfolders
data/
raw_audio/ Your original audio data (any formats)
raw_audio_voices/ Preprocessed audio data (only .wav)
diarization/ Output-folder of --audio-to-voices and --set-speakers
text/ Output-folder of --audio-to-text
voice_splits/ Output-folder of --text-to-splits
output/ Outout-folder for various results
slices/ Audio-slices ordered by speaker (--slice)
analysis/ Analysis output (--viewer)
transcripts/ Transcripts output (--transcribe)
Follow the setup instructions from Spleeter.
Run the voice extraction process to filter background audio in audio-files located in raw_audio/
.
python -m transcripy --audio-extract-voice
Optional arguments:
--model [spleeter:2stems, spleeter:4stems, spleeter:5stems] \\ Select the spleeter-model (2 voices, 4 voices, 5 voices)
--data-path [path] \\ Root direction of data (without raw_audio/)
--extract-all \\ Extract all voices
- Use RipX to extract voices from audio files in
data/raw_audio
. Place them indata/raw_audio_voices
.
Follow the setup instructions from OpenAI Whisper.
Run the transcription of audio-files (.wav only!) located in raw_audio_voices/
with
python -m transcripy --audio-to-text
Optional arguments:
--model [tiny,base,small,medium,large] \\ Select the whisper-model
--language [lang] \\ Force the language to detect
--data-path [path] \\ Root direction of data (without raw_audio_voices/)
Follow the setup instructions from Pyannote-Audio.
Run the diarization process to detect multiple readers in audio-files located in raw_audio_voices/
.
python -m transcripy --audio-to-voices
Optional arguments:
--model [pyannote/speaker-diarization, pyannote/segmentation, pyannote/speaker-segmentation, pyannote/overlapped-speech-detection, pyannote/voice-activity-detection] \\ Select the pyannote-model
--data-path [path] \\ Root direction of data (without raw_audio_voices/)
To rename the speakers of the audio-files, run
python -m transcripy --set-speakers
Important: Make sure that you have completed step 2 and 3
Create the data you need.
Create transcriptions in various formats with
python -m transcripy --transcribe
Create HTML files for visualization of the results with
python -m transcripy --viewer
Slice the audio files in separate text-slices with
python -m transcripy --slice
- Download the executable for Voice-Cloning-App
- Start it
- Download model for your language
- Create dataset for one speaker with
python -m create-dataset <SPEAKER>
- Load the dataset into Voice-Cloning-App
Follow the setup instructions from Real-Time Voice Cloning.
python -m voice-synthesis
See this jupyter-notebook for a different implementation.