Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix previous text prepending #142

Merged
merged 1 commit into from
Jan 8, 2025
Merged

Conversation

bofenghuang
Copy link
Contributor

Hi 👋,

Thank you for continuously adding more features to the Whisper distillation code!

As I reviewed the section on prepending previous text during the preparation of training data, I made the following adjustments based on my interpretation:

  1. Moved the prepending of decoder_prev_token_id to the end to ensure it's always triggered, even when prev_ids aren't cut by the previous two conditions
  2. Updated the total length check to len(prev_ids + token_ids) + 1, which now includes decoder_prev_token_id since it's always added
  3. Removed prev_ids from the trim_length calculation. For instance, with 3 prev_ids and 3 token_ids and a max_label_length of 6, we should retain only the last 2 tokens in prev_ids, calculated as max_label_length - len(token_ids) - 1 = 6 - 3 - 1 = 2

Copy link
Contributor

@sanchit-gandhi sanchit-gandhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catches, thanks @bofenghuang!

cc @eustlb feel free to merge if you're happy

@eustlb
Copy link
Collaborator

eustlb commented Jan 8, 2025

Thanks a lot @bofenghuang !
LGTM 😊
I would have written it (for some reason, it is more intuitive for me to count from left to right for prev_ids):

if len(prev_ids + token_ids) > max_label_length - 1:
     trim_length = len(token_ids) + len(prev_ids) - max_label_length + 1

Yet it is equivalent and a detail, so let's merge!

@eustlb eustlb merged commit cc96130 into huggingface:main Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants