-
Notifications
You must be signed in to change notification settings - Fork 287
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training Zipformer Transducer on 8k sample rate dataset #1518
Comments
hi, for the for the best |
Hi, could you also show the log when you run the training script with |
I'm training on 100 hours of Telugu language data. I haven't changed any defaults from the example other than the librispeech data module and the tokenizer, which I changed to a 1024 vocab bpe tokenizer trained on the same data with character coverage 1. I manually filtered the data to be >2s and <30s. 100 hr dataset I'm using is not as clean as librispeech and mightly be slightly more challenging, I've discarded the audio len vs text len outliers, but since the dataset is more unclean and challenging, is it a bad idea to use with librispeech/ASR/zipformer example? Also, what is a gt wer for a good enough dataset that can be used for zipformer training? <3%? |
From the log "encoder_embed.conv.0.output is not finite", I doubt there might be some inf values in the input features. Could you check that? |
Hi, sorry for the late reply. Here's how I'm computing the input features:
Here's how I checked if there are any input features:
The output was 0. So, if this method I used is valid for finding inf values in this case, I don't think there are any inf values in the input features. Is it possible that the inf values are being created in the training process due to some reason? |
I was able to train the model after removing outliers via the duration/characters ratio via the threshold 0.115455 < dur/char < 0.685000 from a 200hr dataset similar to the one I've been trying to train on previously. I removed those utterances with those thresholds I got after observing the dur/char distrubution because, as far as I understand, the inf grads are being created during the training process due to utterances with bad lengths, (from this comment). The training successfully completed and here's the tensorboard for that. But when I used the exact same process for training a bigger 1k hour dataset that is very similar to the previous datasets I've used, I'm facing this error again when using --inf-check=True after 1 epoch.
I do see the training for 1k hours is less stable and also the loss does not seem to be as low as 0.1 as per comments on some other issues. The recipe and the method I used are identical for both runs so this error seems to be occurring due to the dataset quality. The steps I've taken to filter bad data out are the dur/char filtering I've mentioned before and removing all utterances above 12 seconds as they are in the 0.01 percentile for the data I'm using. I'm just looking for a better understanding of what sort of bad data could lead to this error and what steps I could take to remove those instances. Any input or help regarding this would be really valuable! |
@yaozengwei, @csukuangfj, @JinZr Any suggestions on what I could do to improve data quality would be really helpful. Should I try training CTC instead, could that help avoid this error? |
please try feature extraction using 1st gen kaldi first, and use lhotse to import the 1st kaldi data dir format to a lhotse compatible one, see if the same issue happens with the original kaldi feat extraction. there's not much we can do with corrupted data other than just filtering out them. |
Thanks for your prompt response @JinZr, I wanted to ask if you think it's possible that the data is alright and this issue is due to some other reason (like the dataset being too challenging for rnnt), In that case, should I try the zipformer_ctc model? |
what’s the average duration of the recordings and are they all extremely
noisy
Best Regards
Jin
…On Mon, 18 Mar 2024 at 20:01 Bharath Raj ***@***.***> wrote:
Thanks for your prompt response @JinZr <https://github.com/JinZr>, I
wanted to ask if you think it's possible that the data is alright and this
issue is due to some other reason (like the dataset being too challenging
for rnnt),
In that case, should I try the zipformer_ctc model?
—
Reply to this email directly, view it on GitHub
<#1518 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AOON42BHDOZQEASJLR6M55TYY3JSDAVCNFSM6AAAAABDYFWRVGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBTG4ZDOMZRHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
The average duration is 3.49s. The distribution is
(This is before I filtered out dur/char outliers and audios <2s and >12s) The data is not necessarily super noisy, but it inherits real-world background acoustic qualities, it's not clean audio but I believe not all of it is extremely noisy. In this case, if I use similar data but 10x its size (say 10k hours) can that be better? |
You could try filtering out utterances with a duration smaller than 1s and
larger than 20s first.
training with utterances of 56 seconds would be harmful especially when you
are at the beginning of the training process.
Best Regards
Jin
…On Mon, 18 Mar 2024 at 22:57 Bharath Raj ***@***.***> wrote:
what’s the average duration of the recordings and are they all extremely
noisy
The average duration is 3.49s. Their distribution is
count 949908.000000
mean 3.498416
std 1.497223
min 0.780000
25% 2.560000
50% 3.140000
75% 4.000000
90% 5.140000
95% 6.160000
99% 9.080000
max 56.200000
The data is not necessarily super noisy, but it inherits real-world
background acoustic qualities, it's not clean audio but I believe not all
of it is extremely noisy. In this case, if I use similar data but 10x its
size (say 10k hours) can that be better?
—
Reply to this email directly, view it on GitHub
<#1518 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AOON42CHLNE5I3LKV7K7DX3YY36D5AVCNFSM6AAAAABDYFWRVGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBUGE2DEMBZGM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Apologies, I mentioned this in the edit after responding. The values I shared are before the filtering, I did filter out dur/char outliers and audios <2s and >12s before the last run and it still crashed with the inf grad error. |
You can also just filter out all acoustic features containing nan, this is
the only conclusion I can draw from the information you’ve provided so far.
if the rnnt model couldn’t converge well on this kind of data, I don’t see
why training a CTC model would help.
also try doing feature extraction with other toolkits like i mentioned
earlier, so you can make sure whether this is the problem with the toolkit
or the data, maybe some of the wav files are broken but i can’t say for
sure.
Best Regards
Jin
…On Tue, 19 Mar 2024 at 00:03 Bharath Raj ***@***.***> wrote:
Apologies, I mentioned this in the edit after responding. The values I
shared are before the filtering, I did filter out dur/char outliers and
audios <2s and >12s before the last run and it crashed with the inf grad
error.
—
Reply to this email directly, view it on GitHub
<#1518 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AOON42EYL55B53AEQWB43OLYY4F4LAVCNFSM6AAAAABDYFWRVGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBUGMZDMMBSGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@JinZr, Thank you for your guidance! When I removed audios with durations >9s which was the 99 percentile for the data I was using, the error stopped. I also was using max duration 100s instead of 1000s and the low max dur could have also played a part, I'm not sure but the run with 1k max dur and removing audio <2s and >9s stopped the error. if you have any insights/questions to ask us regarding our experiments, happy to answer |
cool, glad to hear everything works fine now. 🎉
best
jin
… On Mar 27, 2024, at 17:21, Bharath Raj ***@***.***> wrote:
@JinZr <https://github.com/JinZr>, Thank you for your guidance! When I removed audios with durations >9s which was the 99 percentile for the data I was using, the error stopped. I also was using max duration 100s instead of 1000s and the low max dur could have also played a part, I'm not sure but the run with 1k max dur and removing audio <2s and >9s stopped the error.
|
Hi
I modified the librispeech zipformer example to facilitate an 8k dataset of around 100 hours. I created a separate datamodule which returns the train and valid cuts and dataloaders. I opted for DynamicBucketingSampler, and for the K2SpeechRecognitionDataset, PrecomputedFeatures as the input strategy and at first tried computing the features via compute_and_store_features() with Lhotse FBank extractor as
FbankConfig(num_mel_bins=80, sampling_rate=8000, device="cuda")
and it kept returning the following error:When I used KaldifeatFBank as the extractor via
extractor = KaldifeatFbank(KaldifeatFbankConfig(device='cuda'))
and resampled the cutset to 16k before using compute_and_store_features(), there weren't any errors and the training seemed to start. But then, after a few epochs in training, I'm facing a completely new error "Too many grads were not finite" and when I passed --inf-check True, I faced another error.I'm having a hard time understanding what's going on. While training Zipformer on 8k sr dataset, what do I need to be wary of? Am I missing something in my approach? Guidance regarding this would be of great help.
Thanks!
The text was updated successfully, but these errors were encountered: