Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluation script #3

Open
cromz22 opened this issue Jun 18, 2024 · 3 comments
Open

Evaluation script #3

cromz22 opened this issue Jun 18, 2024 · 3 comments

Comments

@cromz22
Copy link

cromz22 commented Jun 18, 2024

Thank you for opensourcing this amazing work!

Do you have any plans for releasing the evaluation scripts?

I would like to reproduce the results provided in the tables in the paper, but it seems that enough details are not provided.
For example,

  • How can we obtain discrete units from pretrained DinoSR model?
    I believe the argmax call in this line would produce them, but because this forward function doesn't seem to be targeted for evaluation, I'm not sure if the arguments given to the function is OK as it is.

    onehot_target[range(len(neg_l2_dist)),neg_l2_dist.argmax(-1)] = 1.0

  • The definition of the 5th layer seems unclear. Is it the 5th layer in 12 layers of Transformer or in the top 8 layers that was used for DinoSR?

    We focused on the fifth layer of DinoSR

  • What kind of forced alignment method was used? If Montreal Forced Aligner was used, which acoustic/dictionary models were used?

    To compute these metrics, forced alignment is used to acquire the ground truth phone of each feature frame on LibriSpeech dev-clean and dev-other sets

@cantabile-kwok
Copy link

@cromz22 Hi Shuichiro, I am also using this repo and facing the same problem. I am wondering that have you managed to work out a way for obtaining discrete units from pretrained DinoSR model after posting this issue? I will be very grateful for any help : )

@cromz22
Copy link
Author

cromz22 commented Aug 26, 2024

Hi, I have no progress on this since this report. As stated above, I believe the argmax values above are the discrete units, but I can't be sure.

@cantabile-kwok
Copy link

cantabile-kwok commented Aug 27, 2024

I read through the code carefully, and I believe you are right. The discrete units should be the argmax values of negative distances between layer outputs and codebooks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants