NeuralDiarizer with the telephonic config mix speakers at the very beginning of shorter audio files (less than 2 minutes duration) #10988

uro-sh · 2024-10-22T12:16:22Z

Describe the bug

I am using NeuralDiarizer with the default diar_infer_telephonic.yaml settings (nemo version 1.21.0). I am using it to diarize the real-life phone call recordings.
I have experienced the same issue for almost any shorter audio file (less than 2 minutes duration) I have diarized: the first couple of utterances, pronounced by two different speakers, are merged into the same one, and labeled such that it was spoken by a single speaker.
After the initial glitch, diarizer continues to work with the very precise predictions, so this issue is really only about the very first couple of sentences.

Any recommendation how to improve its precision for that particular problem?

Steps/Code to reproduce bug

I am using NeuralDiarizer with the default diar_infer_telephonic.yaml settings file, with this addition:

meta = {
        "audio_filepath": os.path.join(output_dir, "mono_file.wav"),
        "offset": 0,
        "duration": None,
        "label": "infer",
        "text": "-",
        "rttm_filepath": None,
        "uem_filepath": None,
    }
    with open(os.path.join(data_dir, "input_manifest.json"), "w") as fp:
        json.dump(meta, fp)
        fp.write("\n")

    pretrained_vad = "vad_multilingual_marblenet"
    pretrained_speaker_model = "titanet_large"
    config.num_workers = 0
    config.diarizer.manifest_filepath = os.path.join(data_dir, "input_manifest.json")
    config.diarizer.out_dir = (
        output_dir  # Directory to store intermediate files and prediction outputs
    )

    config.diarizer.speaker_embeddings.model_path = pretrained_speaker_model
    config.diarizer.oracle_vad = (
        False  # compute VAD provided with model_path to vad config
    )
    config.diarizer.clustering.parameters.oracle_num_speakers = False

    config.diarizer.clustering.parameters.enhanced_count_thres = 80
    config.diarizer.clustering.parameters.max_speaker_num = 2

    # Here, we use our in-house pretrained NeMo VAD model
    config.diarizer.vad.model_path = pretrained_vad
    config.diarizer.vad.parameters.onset = 0.8
    config.diarizer.vad.parameters.offset = 0.6
    config.diarizer.vad.parameters.pad_offset = -0.05
    config.diarizer.msdd_model.model_path = (
        "diar_msdd_telephonic"  # Telephonic speaker diarization model
    )

uro-sh added the bug Something isn't working label Oct 22, 2024

uro-sh changed the title ~~NeuralDiarizer with the telephonic config mix speakers at the very beginning of audio files~~ NeuralDiarizer with the telephonic config mix speakers at the very beginning of shorter audio files (less than 2 minutes duration) Oct 22, 2024

elliottnv assigned titu1994 Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NeuralDiarizer with the telephonic config mix speakers at the very beginning of shorter audio files (less than 2 minutes duration) #10988

NeuralDiarizer with the telephonic config mix speakers at the very beginning of shorter audio files (less than 2 minutes duration) #10988

uro-sh commented Oct 22, 2024 •

edited

Loading

NeuralDiarizer with the telephonic config mix speakers at the very beginning of shorter audio files (less than 2 minutes duration) #10988

NeuralDiarizer with the telephonic config mix speakers at the very beginning of shorter audio files (less than 2 minutes duration) #10988

Comments

uro-sh commented Oct 22, 2024 • edited Loading

uro-sh commented Oct 22, 2024 •

edited

Loading