-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues with long cutset during zipformer training #1844
Comments
we have a function in train.py to remove long and short utterances, which is enabled by default. |
@csukuangfj in train.py there is train_cuts = train_cuts.filter(remove_short_utt). I was not able to find any option for long utt. |
which file.are you referring to? Please recheck. |
@csukuangfj I am using this file : https://github.com/k2-fsa/icefall/blob/master/egs/gigaspeech/ASR/zipformer/train.py |
please refer to librispeech |
icefall/egs/librispeech/ASR/zipformer/train.py Lines 1377 to 1385 in ad966fb
please read the comment in train.py carefully. |
@csukuangfj ok. thanks for your suggestion. .Now the training has resumed. Hopefully it gets completed without any error. |
@csukuangfj Is there anyway to retain audios which are greater than 20s or less than 1s by doing any modification to cuts so that it doesn't give error? |
Hello All,
we are training a zip former model for about 3400 hours of Tamil data. This is in reference with #1751
We have NVIDIA A6000 50GB GPU. Getting the below error:
2024-12-19 16:44:26,604 INFO [asr_datamodule.py:375] About to create dev dataloader
2024-12-19 16:44:26,604 INFO [train.py:1326] Sanity check -- see if any of the batches in epoch 1 would cause OOM.
/home/armuser/anaconda3/envs/k2/lib/python3.8/site-packages/lhotse/dataset/sampling/dynamic.py:342: UserWarning: We have exceeded the max_duration constraint during sampling but have only 1 cut. This is likely because max_duration was set to a very low value ~10s, or you're using a CutSet with very long cuts (e.g. 100s of seconds long).
warnings.warn(
2024-12-19 16:58:57,481 ERROR [train.py:1345] Your GPU ran out of memory with the current max_duration setting. We recommend decreasing max_duration and trying again.
Failing criterion: single_longest_cut (=162.26) ...
2024-12-19 16:58:57,482 INFO [train.py:1304] Saving batch to zipformer/exp/batch-6c307511-b2b9-437a-28df-6ec4ce4a2bbd.pt
2024-12-19 16:59:01,314 INFO [train.py:1310] features shape: torch.Size([7, 16226, 80])
2024-12-19 16:59:01,315 INFO [train.py:1314] num tokens: 817
Traceback (most recent call last):
File "./zipformer/train.py", line 1380, in
main()
File "./zipformer/train.py", line 1373, in main
run(rank=0, world_size=1, args=args)
File "./zipformer/train.py", line 1225, in run
scan_pessimistic_batches_for_oom(
File "./zipformer/train.py", line 1341, in scan_pessimistic_batches_for_oom
loss.backward()
File "/home/armuser/anaconda3/envs/k2/lib/python3.8/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/home/armuser/anaconda3/envs/k2/lib/python3.8/site-packages/torch/autograd/init.py", line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/home/armuser/anaconda3/envs/k2/lib/python3.8/site-packages/torch/autograd/function.py", line 267, in apply
return user_fn(self, *args)
File "/home/armuser/k2_249/icefall/egs/tamil/ASR/zipformer/scaling.py", line 313, in backward
x_grad = x_grad - ans * x_grad.sum(dim=ctx.dim, keepdim=True)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.86 GiB (GPU 0; 47.54 GiB total capacity; 37.04 GiB already allocated; 3.94 GiB free; 41.56 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Training command: ./zipformer/train.py --world-size 1 --num-epochs 30 --start-batch 336000 --use-fp16 1 --exp-dir zipformer/exp --max-duration 100
Initially had kept the max duration as 150
The training had completed for 4 epochs.Then got the above issue.Loaded the batch.pt file and got as below:
'sequence_idx': tensor([0, 1, 2, 3, 4, 5, 6], dtype=torch.int32), 'start_frame': tensor([0, 0, 0, 0, 0, 0, 0], dtype=torch.int32), 'num_frames': tensor([16226, 2022, 1926, 1744, 1716, 1676, 1613], dtype=torch.int32), 'cut': [MonoCut(id='Regional-Tiruchirapalli-Tamil-1345-202039142853_sent_97', start=0.0, duration=162.26, channel=0, supervisions=[SupervisionSegment(id='Regional-Tiruchirapalli-Tamil-1345-202039142853_sent_97', recording_id='Regional-Tiruchirapalli-Tamil-1345-202039142853_sent_97', start=0.0, duration=162.26, channel=0, text='அவ்வையார் விருது தமிழ்நாட்டில் சமூகநலப் பணிகளை அரப்பணிப்புடன் செயலாற்றியதாக 2020ஆம் ஆண்டிற்கான அவ்வையார் விருதுக்கு தேர்வு செய்யப்பட்ட திருவண்ணாமலையைச் சேர்ந்த சமூக சேவகி திருமதி', language=None, speaker='Regional-Tiruchirapalli-Tamil-1345-202039142853_sent_97', gender=None, custom={'origin': 'giga'}, alignment=None)], features=Features(type='kaldifeat-fbank', num_frames=16226, num_features=80, frame_shift=0.01, sampling_rate=8000, start=0, duration=162.26, storage_type='lilcom_chunky', storage_path='/home/armuser/10TBHDD/CUDA_11.6/icefall/egs/tamil/ASR/data/fbank/train_split/tamil_feats_train_00032581.lca', storage_key='964876,45872,45111,44652,45255,45498,44806,45091,45317,44865,45016,44804,44720,44784,44749,45046,44983,44943,45297,44866,45335,45125,45507,44978,44909,44841,44914,44718,44569,45297,44670,45390,44619,20203', recording_id='Regional-Tiruchirapalli-Tamil-1345-202039142853_sent_97', channels=0), recording=Recording(id='Regional-Tiruchirapalli-Tamil-1345-202039142853_sent_97', sources=[AudioSource(type='file', channels=[0], source='/media/ASR_database/shruthilipi_data/tamil/newsonair_renamed
/Regional-Tiruchirapalli-Tamil-1345-202039142853_sent_97.wav')], sampling_rate=8000, num_samples=1298080, duration=162.26, channel_ids=[0], transforms=None),
Kindly suggest how to go about this issue.
The text was updated successfully, but these errors were encountered: