Mismatch diarization result between pyannote/speaker-diarization-3.0 and k2-fsa/speaker-diarization #1708

takipipo · 2025-01-14T08:39:10Z

I attempted to diarize the audio clip using the same model, but I obtained different results. Is this a known issue related to the ONNX format, or did I make a mistake in my process?

I have checked the pipeline of the pyannote/speaker-diarization-3.0 and select the same model as provided in sherpa-onnx

How to reproduce

pyannote/speaker-diarization-3.0

from pyannote.audio import Pipeline

pipeline = Pipeline.from_pretrained(
  "pyannote/speaker-diarization-3.0",
  use_auth_token="change_to_your_huggingface_token")

diarization = pipeline("ck-interview-mono.wav")

for turn, _, speaker in diarization.itertracks(yield_label=True):
    print(f"start={turn.start:.1f}s stop={turn.end:.1f}s speaker_{speaker}")

Output

start=0.0s stop=5.2s speaker_SPEAKER_00
start=6.0s stop=23.0s speaker_SPEAKER_00
start=23.8s stop=33.2s speaker_SPEAKER_00
start=33.3s stop=41.4s speaker_SPEAKER_00
start=42.2s stop=43.0s speaker_SPEAKER_00
start=43.0s stop=48.0s speaker_SPEAKER_01
start=48.7s stop=50.2s speaker_SPEAKER_01
start=50.5s stop=61.9s speaker_SPEAKER_01
start=62.2s stop=71.3s speaker_SPEAKER_01
start=71.5s stop=72.0s speaker_SPEAKER_00
start=71.9s stop=72.7s speaker_SPEAKER_01
start=73.5s stop=74.6s speaker_SPEAKER_00

k2-fsa/speaker-diarization

Ran on https://huggingface.co/spaces/k2-fsa/speaker-diarization

speaker embedding model: wespeaker_en_voxceleb_resnet34_LM.onnx|26MB
speaker segmentation model: pyannote/segmentation-3.0
Number of speakers: 2

Output

0.031 -- 5.228 speaker_00
6.038 -- 23.048 speaker_00
23.825 -- 32.971 speaker_00
33.562 -- 41.375 speaker_00
42.151 -- 47.990 speaker_00
48.732 -- 72.728 speaker_00
73.522 -- 74.602 speaker_00

The text was updated successfully, but these errors were encountered:

takipipo · 2025-01-15T06:40:25Z

Additionally, I conducted a comparison of the embedding models using cosine similarity. The similarity score was nearly 1, indicating that the embeddings generated by both models were almost orthogonal.

Cosine Similarity Calculation

from pyannote.audio import Model
from pyannote.audio import Inference
import sherpa_onnx
from scipy.spatial.distance import cdist

model = Model.from_pretrained("pyannote/wespeaker-voxceleb-resnet34-LM")
inference = Inference(model, window="whole")
audio_fp = "change_to_your_audio_filepath"

embedding_pyannote = inference(audio_fp)
config = sherpa_onnx.SpeakerEmbeddingExtractorConfig(model = "/Users/kridtaphadsae-khow/.cache/huggingface/hub/models--csukuangfj--speaker-embedding-models/snapshots/0743f301363dec56491a490f6d6cbc9d67f9a3bf/wespeaker_en_voxceleb_resnet34_LM.onnx", num_threads = 1, debug=True, provider = "cpu")
extractor = sherpa_onnx.SpeakerEmbeddingExtractor(config)

audio, sample_rate = read_wave(audio_fp)
stream = extractor.create_stream()
stream.accept_waveform(sample_rate=sample_rate, waveform=audio)
embedding_sherpa = np.asarray(extractor.compute(stream))

distance = cdist(np.expand_dims(embedding_pyannote, axis=0), np.expand_dims(embedding_sherpa, axis=0), metric="cosine")
print(distance)

>> array([[0.82130009]])

csukuangfj · 2025-01-15T09:00:48Z

The similarity score was nearly 1, indicating that the embeddings generated by both models were almost orthogonal.

If it is nearly 0, then you can consider them almost orthogonal.

If it is nearly 1, then you cannot say they are almost orthogonal.

csukuangfj · 2025-01-15T09:05:12Z

Can you share ck-interview-mono.wav ?

takipipo · 2025-01-15T09:06:27Z

Can you share ck-interview-mono.wav ?

audio clip

takipipo · 2025-01-15T09:11:22Z

The similarity score was nearly 1, indicating that the embeddings generated by both models were almost orthogonal.

If it is nearly 0, then you can consider them almost orthogonal.

If it is nearly 1, then you cannot say they are almost orthogonal.

In the context of the scipy implementation, 1 indicates orthogonality, while 0 signifies parallelism.

csukuangfj · 2025-01-15T09:50:03Z

I see what you mean now

cosine_distance = 1 - similariy_score

csukuangfj · 2025-01-15T09:51:37Z

audio, sample_rate = read_wave(audio_fp)

Please show the complete code.

what is read_wave?

takipipo · 2025-01-15T09:56:03Z

audio, sample_rate = read_wave(audio_fp)

Please show the complete code.

what is read_wave?

I used the read_wave as you provided in the https://huggingface.co/spaces/k2-fsa/speaker-diarization/blob/main/model.py#L26-L48

takipipo mentioned this issue Jan 14, 2025

Improve speaker recognition accuracy #1526

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mismatch diarization result between pyannote/speaker-diarization-3.0 and k2-fsa/speaker-diarization #1708

Mismatch diarization result between pyannote/speaker-diarization-3.0 and k2-fsa/speaker-diarization #1708

takipipo commented Jan 14, 2025 •

edited

Loading

takipipo commented Jan 15, 2025

csukuangfj commented Jan 15, 2025

csukuangfj commented Jan 15, 2025

takipipo commented Jan 15, 2025

takipipo commented Jan 15, 2025 •

edited

Loading

csukuangfj commented Jan 15, 2025

csukuangfj commented Jan 15, 2025

takipipo commented Jan 15, 2025

Mismatch diarization result between pyannote/speaker-diarization-3.0 and k2-fsa/speaker-diarization #1708

Mismatch diarization result between pyannote/speaker-diarization-3.0 and k2-fsa/speaker-diarization #1708

Comments

takipipo commented Jan 14, 2025 • edited Loading

How to reproduce

pyannote/speaker-diarization-3.0

Output

k2-fsa/speaker-diarization

Output

takipipo commented Jan 15, 2025

csukuangfj commented Jan 15, 2025

csukuangfj commented Jan 15, 2025

takipipo commented Jan 15, 2025

takipipo commented Jan 15, 2025 • edited Loading

csukuangfj commented Jan 15, 2025

csukuangfj commented Jan 15, 2025

takipipo commented Jan 15, 2025

takipipo commented Jan 14, 2025 •

edited

Loading

takipipo commented Jan 15, 2025 •

edited

Loading