Made in Vancouver, Canada by Picovoice
This repo is a minimalist and extensible framework for benchmarking different speaker diarization engines.
VoxConverse is a well-known dataset in the speaker diarization field, showcasing speakers conversing in multiple languages. In this benchmark, we utilize cloud-based Speech-to-Text engines equipped with speaker diarization capabilities. Hence, for benchmarking purposes, we specifically employ the English subset of the dataset's test section.
- Clone the VoxConverse repository. This repository contains only the labels
in the form of
.rttm
files. - Download the test set from the links provided in the
README.md
file of the cloned repository and extract the downloaded files.
The Diarization Error Rate (DER) is the most common metric for evaluating speaker diarization systems. DER is calculated by summing the time duration of three distinct errors: speaker confusion, false alarms, and missed detections. This total duration is then divided by the overall time span.
The Jaccard Error Rate (JER) is a newly developed metric for evaluating speaker diarization, specifically designed for DIHARD II. It is based on the Jaccard similarity index, which measures the similarity between two sets of segments. In short, JER assigns equal weight to each speaker's contribution, regardless of their speech duration. For a more in-depth understanding, refer to the second DIHARD's paper.
This metric provides insight into the memory consumption of the diarization engine during its processing of audio files. It presents the total memory utilized, measured in gigabytes (GB).
The Core-Hour metric is used to evaluate the computational efficiency of the diarization engine, indicating the number of hours required to process one hour of audio on a single CPU core.
Note
Total Memory Usage
and Core-Hour
metrics are not applicable to cloud-based engines.
This benchmark has been developed and tested on Ubuntu 20.04
using Python 3.8
.
- Set up your dataset as described in the Data section.
- Install the requirements:
pip3 install -r requirements.txt
- In the commands that follow, replace
${DATASET}
with a supported dataset,${DATA_FOLDER}
with the path to the dataset folder, and${LABEL_FOLDER}
with the path to the label folder. For further details, refer to the Data. Replace${TYPE}
withACCURACY
,CPU
, orMEMORY
for accuracy, CPU benchmark, and memory benchmark, respectively.
python3 benchmark.py \
--type ${TYPE} \
--dataset ${DATASET} \
--data-folder ${DATA_FOLDER} \
--label-folder ${LABEL_FOLDER} \
--engine ${ENGINE} \
...
- For the memory benchmark, you should also run
mem_monitor.py
in a separate terminal window. This script will monitor the memory usage of the diarization engine.
python3 mem_monitor.py --engine ${ENGINE}
when the benchmark is complete, press Ctrl + C
to stop the memory monitor.
Additionally, specify the desired engine using the --engine
flag. For instructions on each engine and the required
flags, consult the section below.
Create an S3 bucket. Then, substitute ${AWS_PROFILE}
with your AWS profile name and ${AWS_S3_BUCKET_NAME}
with the created S3 bucket name.
python3 benchmark.py \
--dataset ${DATASET} \
--data-folder ${DATA_FOLDER} \
--label-folder ${LABEL_FOLDER} \
--engine AWS_TRANSCRIBE \
--aws-profile ${AWS_PROFILE} \
--aws-s3-bucket-name ${AWS_S3_BUCKET_NAME}
A client library for the Speech to Text REST API should be generated, as outlined in the documentation.
Then, create an Azure storage account and container, and replace ${AZURE_STORAGE_ACCOUNT_NAME}
with your Azure storage
account name, ${AZURE_STORAGE_ACCOUNT_KEY}
with your Azure storage account key, and ${AZURE_STORAGE_CONTAINER_NAME}
with your Azure storage container name.
Finally, replace ${AZURE_SUBSCRIPTION_KEY}
with your Azure subscription key and ${AZURE_REGION}
with your Azure
region.
python3 benchmark.py \
--dataset ${DATASET} \
--data-folder ${DATA_FOLDER} \
--label-folder ${LABEL_FOLDER} \
--engine AZURE_SPEECH_TO_TEXT \
--azure-storage-account-name ${AZURE_STORAGE_ACCOUNT_NAME} \
--azure-storage-account-key ${AZURE_STORAGE_ACCOUNT_KEY} \
--azure-storage-container-name ${AZURE_STORAGE_CONTAINER_NAME} \
--azure-subscription-key ${AZURE_SUBSCRIPTION_KEY} \
--azure-region ${AZURE_REGION}
Create a Google cloud storage bucket. Then, replace ${GCP_CREDENTIALS}
with the path to your GCP credentials
file (.json
) and ${GCP_BUCKET_NAME}
with your GCP bucket name.
python3 benchmark.py \
--dataset ${DATASET} \
--data-folder ${DATA_FOLDER} \
--label-folder ${LABEL_FOLDER} \
--engine GOOGLE_SPEECH_TO_TEXT \
--gcp-credentials ${GCP_CREDENTIALS} \
--gcp-bucket-name ${GCP_BUCKET_NAME} \
To utilize the enhanced model, replace the GOOGLE_SPEECH_TO_TEXT
engine with GOOGLE_SPEECH_TO_TEXT_ENHANCED
.
Replace ${PICOVOICE_ACCESS_KEY}
with AccessKey obtained from Picovoice Console.
python3 benchmark.py \
--dataset ${DATASET} \
--data-folder ${DATA_FOLDER} \
--label-folder ${LABEL_FOLDER} \
--engine PICOVOICE_FALCON \
--picovoice-access-key ${PICOVOICE_ACCESS_KEY}
Obtain your authentication token to download pretrained models by visiting
their Hugging Face page.
Then replace ${PYANNOTE_AUTH_TOKEN}
with the authentication token.
python3 benchmark.py \
--dataset ${DATASET} \
--data-folder ${DATA_FOLDER} \
--label-folder ${LABEL_FOLDER} \
--engine PYANNOTE \
--pyannote-auth-token ${PYANNOTE_AUTH_TOKEN}
Measurement is carried on an Ubuntu 20.04
machine with AMD CPU (AMD Ryzen 7 5700X (16) @ 3.400G
), 64 GB of
RAM, and NVMe storage.
Engine | VoxConverse (English) |
---|---|
Amazon | 11.1% |
Azure | 15.7% |
50.2% | |
Google - Enhanced | 24.0% |
Picovoice Falcon | 10.3% |
pyannote.audio | 9.0% |
Engine | VoxConverse (English) |
---|---|
Amazon | 29.8% |
Azure | 30.1% |
83.4% | |
Google - Enhanced | 57.6% |
Picovoice Falcon | 19.9% |
pyannote.audio | 27.4% |
To obtain these results, we ran the benchmark across the entire VoxConverse
dataset and recorded the maximum memory
usage during that period. As conversations involve varying lengths and numbers of speakers, this method provides us with
a reliable estimation of the memory usage of each engine.
Engine | Memory Usage (GB) |
---|---|
pyannote.audio | 1.5 |
Picovoice Falcon | 0.1 |
Engine | Core-Hour |
---|---|
pyannote.audio | 442 |
Picovoice Falcon | 4 |