Add the data augmentation method mixup #47

Judyxujj · 2024-04-18T10:18:16Z

This implementation of mixup only mix up the audio inputs x, but not on the targets y.
The implementation is partly based on Albert's implementation in returnn front end (c.f. [https://github.com/rwth-i6/i6_experiments/blob/main/users/zeyer/returnn/models/rf_mixup.py]) and Zoltan's implementation at Apptek.

i6_models/primitives/mixup.py

michelwi

In this implementation

for each sequence a random number n between 1 and max_num_mix (e.g. 4) is drawn
4 different audio tracks are selected from the buffered data
each sequence is perturbed with track 1
the sequences with n<=2 are perturbed with track 1 and 2
the sequences with n<=3 are perturbed with track 1, 2, and 3 (and so on)

is there any reason why to make it so complicated?

What I would have done:

for each sequence a random number n between 1 and max_num_mix (e.g. 4) is drawn
\sum_b n_b different audio tracks are selected from the buffered data (this is M' from the inline comments)
each sequence is perturbed with n separate tracks

Bonus implementation: with probability 1-apply_prob n_b is set to 0 i.e. no mixup for a specific sequence.

michelwi · 2024-04-30T08:24:16Z

i6_models/primitives/mixup.py

+        lambda_min: minimum lambda value
+        lambda_max: maximum lambda value
+        max_num_mix: maximum number of mixups (random int in [1, max_num_mix])


I know that lambda_min is the minimum "lambda".. the docstring should tell me what exactly "lambda" is and what it is doing (e.g. SNR? or length of the perturbed audio in seconds? ...)

what is "one mixup"? if I put it to 10, will there be 10 different audio tracks mixed into each target or will only 10 out of the million sentences ever be affected by mixup?

i6_models/primitives/mixup.py

Co-authored-by: Albert Zeyer <[email protected]>

Judyxujj · 2024-05-23T11:51:13Z

In this implementation

for each sequence a random number n between 1 and max_num_mix (e.g. 4) is drawn

4 different audio tracks are selected from the buffered data

each sequence is perturbed with track 1

the sequences with n<=2 are perturbed with track 1 and 2

the sequences with n<=3 are perturbed with track 1, 2, and 3 (and so on)

is there any reason why to make it so complicated?

What I would have done:

for each sequence a random number n between 1 and max_num_mix (e.g. 4) is drawn

\sum_b n_b different audio tracks are selected from the buffered data (this is M' from the inline comments)

each sequence is perturbed with n separate tracks

Bonus implementation: with probability 1-apply_prob n_b is set to 0 i.e. no mixup for a specific sequence.

I think there is some misunderstanding here. In current implementation, for each sequence, n is randomly selected as the number of times to apply mixup, we also have n corresponding mixup scales (they will be normalised by multiplying lambda_/(sum(n mixup values))). Then n sequences of feature will be randomly selected from the feature buffer. Then the n mixup scales will be element wise multiply with n features squences. I think the current implementation is same as you suggested except that there we apply normalisation to the mixup scales.

michelwi · 2024-05-23T12:22:55Z

I think there is some misunderstanding here. In current implementation, for each sequence, n is randomly selected as the number of times to apply mixup, we also have n corresponding mixup scales (they will be normalised by multiplying lambda_/(sum(n mixup values))). Then n sequences of feature will be randomly selected from the feature buffer. Then the n mixup scales will be element wise multiply with n features squences. I think the current implementation is same as you suggested except that there we apply normalisation to the mixup scales.

My suggestion was to sample M' sequences of features instead of n, where M' = \sum_s n_s.

E.g. if there is a batch of 3 sequences and the n's that are sampled would be [2, 4, 2] then in (my understanding of) the current implementation 4 noise sequences are sampled and in my suggestion it would be 2+4+2=8

So am I wrong in my understanding of the current implementation or is my suggestion unclear?

I think the current implementation is same as you suggested except that there we apply normalisation to the mixup scales.

we can/should keep correct normalization of mixup scales, I agree to that.

Judyxujj · 2024-05-23T12:25:24Z

My suggestion was to sample M' sequences of features instead of n, where M' = \sum_s n_s.

E.g. if there is a batch of 3 sequences and the n's that are sampled would be [2, 4, 2] then in (my understanding of) the current implementation 4 noise sequences are sampled and in my suggestion it would be 2+4+2=8

okay, now I understand what you meant. But why it is simpler than the current one? We still need to do scales normalisation (If I understand it correctly, your 'perturbed' is actually scale normalisation, or?).

michelwi · 2024-05-23T13:18:38Z

okay, now I understand what you meant. But why it is simpler than the current one?

haha, it is not simpler. More complicated actually. But I thought it might be helpful for the training to have different noises for different sequences.

Judyxujj · 2024-05-23T14:59:27Z

okay, now I understand what you meant. But why it is simpler than the current one?

haha, it is not simpler. More complicated actually. But I thought it might be helpful for the training to have different noises for different sequences.

Using linear combination of same random sequences with different mix up scales should also give different noises. But I agree that it might be more robust to use different noises sequences. Albert and Zoltan both did mixup in this way. @albertz do you think it would help?

albertz · 2024-05-23T15:17:07Z

haha, it is not simpler. More complicated actually. But I thought it might be helpful for the training to have different noises for different sequences.

Using linear combination of same random sequences with different mix up scales should also give different noises. But I agree that it might be more robust to use different noises sequences. Albert and Zoltan both did mixup in this way. @albertz do you think it would help?

But we already have different noises for different sequences? lambda_ is of shape [B,M], i.e. every mixup of every sequence will get a different noise.

michelwi · 2024-05-23T15:32:27Z

i.e. every mixup of every sequence will get a different noise.

then I must have misread the code and am happy now.

albertz · 2024-06-19T12:56:40Z

From a quick glance, this looks ok now. Just to verify: This is following now exactly the same algorithm as in my RF implementation, or are there some differences? (I would prefer if we do exactly the same as a starting point, and then we can discuss, maybe after some experiments, whether we want to change/extend sth.)

add mixup

03ada02

Judyxujj requested review from albertz, curufinwe, JackTemaki, mmz33, michelwi, SimBe195 and Atticus1806 April 18, 2024 10:18

Judyxujj assigned Judyxujj and unassigned Judyxujj Apr 18, 2024

update

0be60e5

Judyxujj requested a review from christophmluscher April 22, 2024 13:55