Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the data augmentation method mixup #47

Merged
merged 19 commits into from
Jun 20, 2024
Merged

Add the data augmentation method mixup #47

merged 19 commits into from
Jun 20, 2024

Conversation

Judyxujj
Copy link
Contributor

@Judyxujj Judyxujj commented Apr 18, 2024

  • This implementation of mixup only mix up the audio inputs x, but not on the targets y.
  • The implementation is partly based on Albert's implementation in returnn front end (c.f. [https://github.com/rwth-i6/i6_experiments/blob/main/users/zeyer/returnn/models/rf_mixup.py]) and Zoltan's implementation at Apptek.

@albertz

This comment was marked as resolved.

Copy link
Contributor

@michelwi michelwi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this implementation

  • for each sequence a random number n between 1 and max_num_mix (e.g. 4) is drawn
  • 4 different audio tracks are selected from the buffered data
  • each sequence is perturbed with track 1
  • the sequences with n<=2 are perturbed with track 1 and 2
  • the sequences with n<=3 are perturbed with track 1, 2, and 3 (and so on)

is there any reason why to make it so complicated?

What I would have done:

  • for each sequence a random number n between 1 and max_num_mix (e.g. 4) is drawn
  • \sum_b n_b different audio tracks are selected from the buffered data (this is M' from the inline comments)
  • each sequence is perturbed with n separate tracks

Bonus implementation: with probability 1-apply_prob n_b is set to 0 i.e. no mixup for a specific sequence.

Comment on lines 17 to 19
lambda_min: minimum lambda value
lambda_max: maximum lambda value
max_num_mix: maximum number of mixups (random int in [1, max_num_mix])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that lambda_min is the minimum "lambda".. the docstring should tell me what exactly "lambda" is and what it is doing (e.g. SNR? or length of the perturbed audio in seconds? ...)

what is "one mixup"? if I put it to 10, will there be 10 different audio tracks mixed into each target or will only 10 out of the million sentences ever be affected by mixup?

i6_models/primitives/mixup.py Outdated Show resolved Hide resolved
i6_models/primitives/mixup.py Outdated Show resolved Hide resolved
i6_models/primitives/mixup.py Show resolved Hide resolved
i6_models/primitives/mixup.py Outdated Show resolved Hide resolved
i6_models/primitives/mixup.py Show resolved Hide resolved
i6_models/primitives/mixup.py Show resolved Hide resolved
@Judyxujj

This comment was marked as resolved.

@Judyxujj
Copy link
Contributor Author

Judyxujj commented May 23, 2024

In this implementation

  • for each sequence a random number n between 1 and max_num_mix (e.g. 4) is drawn
  • 4 different audio tracks are selected from the buffered data
  • each sequence is perturbed with track 1
  • the sequences with n<=2 are perturbed with track 1 and 2
  • the sequences with n<=3 are perturbed with track 1, 2, and 3 (and so on)

is there any reason why to make it so complicated?

What I would have done:

  • for each sequence a random number n between 1 and max_num_mix (e.g. 4) is drawn
  • \sum_b n_b different audio tracks are selected from the buffered data (this is M' from the inline comments)
  • each sequence is perturbed with n separate tracks

Bonus implementation: with probability 1-apply_prob n_b is set to 0 i.e. no mixup for a specific sequence.

I think there is some misunderstanding here. In current implementation, for each sequence, n is randomly selected as the number of times to apply mixup, we also have n corresponding mixup scales (they will be normalised by multiplying lambda_/(sum(n mixup values))). Then n sequences of feature will be randomly selected from the feature buffer. Then the n mixup scales will be element wise multiply with n features squences. I think the current implementation is same as you suggested except that there we apply normalisation to the mixup scales.

Jingjing Xu added 2 commits May 23, 2024 13:54
@michelwi
Copy link
Contributor

I think there is some misunderstanding here. In current implementation, for each sequence, n is randomly selected as the number of times to apply mixup, we also have n corresponding mixup scales (they will be normalised by multiplying lambda_/(sum(n mixup values))). Then n sequences of feature will be randomly selected from the feature buffer. Then the n mixup scales will be element wise multiply with n features squences. I think the current implementation is same as you suggested except that there we apply normalisation to the mixup scales.

My suggestion was to sample M' sequences of features instead of n, where M' = \sum_s n_s.

E.g. if there is a batch of 3 sequences and the n's that are sampled would be [2, 4, 2] then in (my understanding of) the current implementation 4 noise sequences are sampled and in my suggestion it would be 2+4+2=8

So am I wrong in my understanding of the current implementation or is my suggestion unclear?

I think the current implementation is same as you suggested except that there we apply normalisation to the mixup scales.

we can/should keep correct normalization of mixup scales, I agree to that.

@Judyxujj
Copy link
Contributor Author

Judyxujj commented May 23, 2024

My suggestion was to sample M' sequences of features instead of n, where M' = \sum_s n_s.

E.g. if there is a batch of 3 sequences and the n's that are sampled would be [2, 4, 2] then in (my understanding of) the current implementation 4 noise sequences are sampled and in my suggestion it would be 2+4+2=8

okay, now I understand what you meant. But why it is simpler than the current one? We still need to do scales normalisation (If I understand it correctly, your 'perturbed' is actually scale normalisation, or?).

@Judyxujj Judyxujj closed this May 23, 2024
@Judyxujj Judyxujj reopened this May 23, 2024
@Judyxujj Judyxujj marked this pull request as ready for review May 23, 2024 12:50
@Judyxujj Judyxujj requested review from albertz and michelwi May 23, 2024 12:50
@michelwi
Copy link
Contributor

okay, now I understand what you meant. But why it is simpler than the current one?

haha, it is not simpler. More complicated actually. But I thought it might be helpful for the training to have different noises for different sequences.

@Judyxujj Judyxujj closed this May 23, 2024
@Judyxujj Judyxujj reopened this May 23, 2024
@Judyxujj
Copy link
Contributor Author

okay, now I understand what you meant. But why it is simpler than the current one?

haha, it is not simpler. More complicated actually. But I thought it might be helpful for the training to have different noises for different sequences.

Using linear combination of same random sequences with different mix up scales should also give different noises. But I agree that it might be more robust to use different noises sequences. Albert and Zoltan both did mixup in this way. @albertz do you think it would help?

@albertz
Copy link
Member

albertz commented May 23, 2024

haha, it is not simpler. More complicated actually. But I thought it might be helpful for the training to have different noises for different sequences.

Using linear combination of same random sequences with different mix up scales should also give different noises. But I agree that it might be more robust to use different noises sequences. Albert and Zoltan both did mixup in this way. @albertz do you think it would help?

But we already have different noises for different sequences? lambda_ is of shape [B,M], i.e. every mixup of every sequence will get a different noise.

@michelwi
Copy link
Contributor

i.e. every mixup of every sequence will get a different noise.

then I must have misread the code and am happy now.

@albertz
Copy link
Member

albertz commented Jun 19, 2024

From a quick glance, this looks ok now. Just to verify: This is following now exactly the same algorithm as in my RF implementation, or are there some differences? (I would prefer if we do exactly the same as a starting point, and then we can discuss, maybe after some experiments, whether we want to change/extend sth.)

@curufinwe curufinwe merged commit 56bf9fa into main Jun 20, 2024
4 checks passed
@albertz albertz deleted the jing-mixup branch June 20, 2024 09:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants