Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect number of utterances for the 10min and 1h subsets #53

Open
mzboito opened this issue Apr 21, 2023 · 1 comment
Open

Incorrect number of utterances for the 10min and 1h subsets #53

mzboito opened this issue Apr 21, 2023 · 1 comment

Comments

@mzboito
Copy link

mzboito commented Apr 21, 2023

Hello,

I recently downloaded this dataset, and noticed that the 10min and 1h subsets are of equal size (in number of utterances).
Both account to 1,571 lines of phonetic transcriptions.

Fetching the corresponding audios results in two sets that are 05:29:37 long (HH:MM:SS).
I'm guessing this is a mistake? :)

@azinonos
Copy link

azinonos commented Feb 26, 2024

I have the same issue. Moreover, loading the instances in Python I get the exact same filelists in both files, so the files are identical.

EDIT:
It seems like the 1h folder is split into 6 sub-folders, 10 mins each. So by taking all paths of any of those sub-folders you would have the 10mins of data, and by taking the entire subfolder you would have the 1h. So you could rebuild the .txt files using the established directory structure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants