Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix the CTC zipformer2 training #1713

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

KarelVesely84
Copy link
Contributor

  • too many supervision tokens
  • change filtering rule to if (T - 2) < len(tokens): return False
  • this prevents inf. from appearing in the CTC loss value (empirically tested)

- too many supervision tokens
- change filtering rule to `if (T - 2) < len(tokens): return False`
- this prevents inf. from appearing in the CTC loss value
@KarelVesely84
Copy link
Contributor Author

KarelVesely84 commented Aug 12, 2024

workflow with error: https://github.com/k2-fsa/icefall/actions/runs/10348851808/job/28642009312?pr=1713

fatal: unable to access 'https://huggingface.co/Zengwei/icefall-asr-librispeech-zipformer-2023-05-15/': Recv failure: Connection reset by peer

but the file location https://huggingface.co/Zengwei/icefall-asr-librispeech-zipformer-2023-05-15/ exists...

maybe to many tests at the same time ? (overloaded HuggingFace ?)

@KarelVesely84
Copy link
Contributor Author

Hi @csukuangfj ,
how about this one ? Is @yaozengwei testing it currently ?

It is solving the issue #1352

My theory is that CTC uses 2 extra symbols at beginning/end of label sequence.
So, the label-length limit needs to be lowered by 2 symbols to accomodate that.

Best regards
Karel

@csukuangfj
Copy link
Collaborator

Sorry for the late reply.

Could you analyze the wave that causes inf loss?
Is it too short?

Does it contain only a single word or does it contain nothing at all?

@KarelVesely84
Copy link
Contributor Author

Hi,
the problematic utterance contained many words:
(num_embeddings, supervision_length, difference a-b) = (34, 33, 1)

text:
['▁O', 'f', '▁all', '▁.', '▁P', 'ar', 'li', 'a', 'ment', '▁,', '▁Co', 'un', 'c', 'il', '▁and', '▁Co', 'm', 'm', 'i', 's', 's', 'ion', '▁are', '▁work', 'ing', '▁to', 'ge', 'ther', '▁to', '▁de', 'li', 'ver', '▁.']

It seems like a better set of BPEs could reduce the number of supervision tokens.
Nevertheless, this would only hide the ``inf.'' problem for CTC.

I believe the two extra tokens for the CTC loss are the <bos/eos>
that get (pre-,ap-)pended to the supervision sequence,
hence the (T - 2).

Best regards
Karel

@csukuangfj
Copy link
Collaborator

the problematic utterance contained many words:

Thanks for sharing! Could you also post the duration of the corresponding wave file?

@KarelVesely84
Copy link
Contributor Author

KarelVesely84 commented Aug 30, 2024

This is the corresponding Cut:

MonoCut(id='20180612-0900-PLENARY-3-59', start=557.34, duration=1.44, channel=0, supervisions=[SupervisionSegment(id='20180612-0900-PLENARY-3-59', recording_id='20180612-0900-PLENARY-3', start=0.0, duration=1.44, channel=0, text='Of all . Parliament , Council and Commission are working together to deliver .', language='en', speaker='None', gender='male', custom={'orig_text': 'of all. Parliament, Council and Commission are working together to deliver.'}, alignment=None)], features=Features(type='kaldi-fbank', num_frames=144, num_features=80, frame_shift=0.01, sampling_rate=16000, start=557.34, duration=1.44, storage_type='lilcom_chunky', storage_path='data/fbank/voxpopuli-asr-en-train_feats/feats-59.lca', storage_key='395124474,12987', recording_id='None', channels=0), recording=Recording(id='20180612-0900-PLENARY-3', sources=[AudioSource(type='file', channels=[0], source='/mnt/matylda6/szoke/EU-ASR/DATA/voxpopuli/raw_audios/en/2018/20180612-0900-PLENARY-3_en.ogg')], sampling_rate=16000, num_samples=139896326, duration=8743.520375, channel_ids=[0], transforms=None), custom={'dataloading_info': {'rank': 3, 'world_size': 4, 'worker_id': None}})

It is a 1.44 sec long cut inside a very long recording (2.42 hrs).
And the 1.44 sec is very little to pronounce all the words in the reference text :
"Of all . Parliament , Council and Commission are working together to deliver ."

Definitely a data issue.
And if the Cut is filtered out, and consequently the CTC stops breaking, it sholud be seen as a good thing...

K.

@csukuangfj
Copy link
Collaborator

yes, I think it should be good to filter out such kind of data.

@KarelVesely84
Copy link
Contributor Author

Hello, is there something needed for this to merge from my side ?
K.

@csukuangfj
Copy link
Collaborator

csukuangfj commented Sep 17, 2024

The root cause is due to bad data. Would it be more appropriate to fix it when preparing the data?

The -2 thing is not a constraint for computing the CTC or the transducer loss.

@KarelVesely84
Copy link
Contributor Author

Well, without that (T - 2) change i was getting inf. value from the CTC loss.
There sholud be no inf. even if the data are prepared badly.

I also did not find any trace of the extra CTC symbols or or similar in the scripts.
The torch.nn.functional.ctc_loss(.) is getting the same set of symbols as transducer loss.

Could you try to reproduce the issue by adding a training example with a very lenghty transcript ?
(or I can create a branch to demonstrate it, say repeating the librispeech transcript 100x, just to make the error appear)

Best regards,
Karel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants