This repository provides a FastAPI-based API for audio transcription, alignment, and speaker diarization using WhisperX.
👉 Try the live demo on Hugging Face Spaces
fastapi/
├── app/
│ └── main.py # FastAPI app with /transcribe endpoint
├── test_audios/ # Example audio files for testing
│ ├── BernhardtCrescent.wav
│ ├── BlackStone_en_in.mp4
│ ├── BlackStone_en_in.wav
│ ├── fillicafe.wav
│ └── harvard.wav
├── requirements.txt # Python dependencies
├── dockerfile # Docker setup
git clone <your-repo-url>
cd fastapi
python3 -m venv .venv
source .venv/bin/activate
Build and run the container:
conda create --name whisperx_api python==3.10
conda activate whisperx_api
pip install --upgrade pip
pip install -r requirements.txt
- ffmpeg is required for audio processing.
- On Ubuntu/Debian:
sudo apt-get update && sudo apt-get install -y ffmpeg git
uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload
Build and run the container:
docker build -t curify_fastapi .
docker run -p 8000:8000 curify_fastapi
Upload an audio file from test_audios/
for transcription:
curl -X POST "http://localhost:8000/transcribe" \
-H "Content-Type: multipart/form-data" \
-F "file=@test_audios/harvard.wav"
Visit http://localhost:8000/docs for interactive API documentation.
- POST
/transcribe
Upload an audio file and receive a transcript with speaker labels and timestamps.
pip install ffmpeg
ffmpeg -i BlackStone_en_in.mp4 -ar 16000 -ac 1 BlackStone_en_in.wav
Temporary files are automatically deleted after each request.
- The WhisperX model is loaded once at startup for efficiency.
- Diarization uses a Hugging Face token (edit in
main.py
if needed). - For best results, use clear audio files (see
test_audios/
for examples).