Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NeuralDiarizer with the telephonic config mix speakers at the very beginning of shorter audio files (less than 2 minutes duration) #10988

Open
uro-sh opened this issue Oct 22, 2024 · 0 comments
Assignees
Labels
bug Something isn't working

Comments

@uro-sh
Copy link

uro-sh commented Oct 22, 2024

Describe the bug

I am using NeuralDiarizer with the default diar_infer_telephonic.yaml settings (nemo version 1.21.0). I am using it to diarize the real-life phone call recordings.
I have experienced the same issue for almost any shorter audio file (less than 2 minutes duration) I have diarized: the first couple of utterances, pronounced by two different speakers, are merged into the same one, and labeled such that it was spoken by a single speaker.
After the initial glitch, diarizer continues to work with the very precise predictions, so this issue is really only about the very first couple of sentences.

Any recommendation how to improve its precision for that particular problem?

Steps/Code to reproduce bug

I am using NeuralDiarizer with the default diar_infer_telephonic.yaml settings file, with this addition:

meta = {
        "audio_filepath": os.path.join(output_dir, "mono_file.wav"),
        "offset": 0,
        "duration": None,
        "label": "infer",
        "text": "-",
        "rttm_filepath": None,
        "uem_filepath": None,
    }
    with open(os.path.join(data_dir, "input_manifest.json"), "w") as fp:
        json.dump(meta, fp)
        fp.write("\n")

    pretrained_vad = "vad_multilingual_marblenet"
    pretrained_speaker_model = "titanet_large"
    config.num_workers = 0
    config.diarizer.manifest_filepath = os.path.join(data_dir, "input_manifest.json")
    config.diarizer.out_dir = (
        output_dir  # Directory to store intermediate files and prediction outputs
    )

    config.diarizer.speaker_embeddings.model_path = pretrained_speaker_model
    config.diarizer.oracle_vad = (
        False  # compute VAD provided with model_path to vad config
    )
    config.diarizer.clustering.parameters.oracle_num_speakers = False

    config.diarizer.clustering.parameters.enhanced_count_thres = 80
    config.diarizer.clustering.parameters.max_speaker_num = 2

    # Here, we use our in-house pretrained NeMo VAD model
    config.diarizer.vad.model_path = pretrained_vad
    config.diarizer.vad.parameters.onset = 0.8
    config.diarizer.vad.parameters.offset = 0.6
    config.diarizer.vad.parameters.pad_offset = -0.05
    config.diarizer.msdd_model.model_path = (
        "diar_msdd_telephonic"  # Telephonic speaker diarization model
    )
@uro-sh uro-sh added the bug Something isn't working label Oct 22, 2024
@uro-sh uro-sh changed the title NeuralDiarizer with the telephonic config mix speakers at the very beginning of audio files NeuralDiarizer with the telephonic config mix speakers at the very beginning of shorter audio files (less than 2 minutes duration) Oct 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants