-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Distillation with a Chunked, Fused Linear JS-divergence Loss #408
Draft
austin362667
wants to merge
14
commits into
linkedin:main
Choose a base branch
from
austin362667:feat/distillation/jsd
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+1,127
−2
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Austin Liu <[email protected]> Add Testing Naive Distillation Base Signed-off-by: Austin Liu <[email protected]> Add Chunked JSD Tests and Benchmarks Signed-off-by: Austin Liu <[email protected]> Fix call Signed-off-by: Austin Liu <[email protected]> Fix Test Usage Signed-off-by: Austin Liu <[email protected]> Remove beta Signed-off-by: Austin Liu <[email protected]> Fix test params Signed-off-by: Austin Liu <[email protected]> Fix call Signed-off-by: Austin Liu <[email protected]> Fix ignore_index Signed-off-by: Austin Liu <[email protected]> Fix weights dimension Signed-off-by: Austin Liu <[email protected]> Fix assign dimension Signed-off-by: Austin Liu <[email protected]> Fix assign dimension Signed-off-by: Austin Liu <[email protected]> Fix teacher bias Signed-off-by: Austin Liu <[email protected]> Reshape input Signed-off-by: Austin Liu <[email protected]> Fix mean Signed-off-by: Austin Liu <[email protected]> Remove alpha Signed-off-by: Austin Liu <[email protected]> Fix t Signed-off-by: Austin Liu <[email protected]> Fix t Signed-off-by: Austin Liu <[email protected]> Fix t scaling Signed-off-by: Austin Liu <[email protected]> Remove teacher tests Signed-off-by: Austin Liu <[email protected]> Fix t scaling Signed-off-by: Austin Liu <[email protected]> Fix beta Signed-off-by: Austin Liu <[email protected]> Fix beta Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]>
Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]>
Signed-off-by: Austin Liu <[email protected]>
Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]>
Signed-off-by: Austin Liu <[email protected]>
Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]>
Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]> WIP Signed-off-by: Austin Liu <[email protected]>
Signed-off-by: Austin Liu <[email protected]> Clean up Signed-off-by: Austin Liu <[email protected]> Clean up Signed-off-by: Austin Liu <[email protected]> Clean up Signed-off-by: Austin Liu <[email protected]> Clean up Signed-off-by: Austin Liu <[email protected]> Clean up Signed-off-by: Austin Liu <[email protected]> Clean up Signed-off-by: Austin Liu <[email protected]> Clean up Signed-off-by: Austin Liu <[email protected]> Clean up Signed-off-by: Austin Liu <[email protected]> Clean up Signed-off-by: Austin Liu <[email protected]>
Signed-off-by: Austin Liu <[email protected]> Format Signed-off-by: Austin Liu <[email protected]> Fix Signed-off-by: Austin Liu <[email protected]>
Signed-off-by: Austin Liu <[email protected]> Fix tol Signed-off-by: Austin Liu <[email protected]> Fix tol Signed-off-by: Austin Liu <[email protected]> Fix tol Signed-off-by: Austin Liu <[email protected]>
Signed-off-by: Austin Liu <[email protected]>
Signed-off-by: Austin Liu <[email protected]>
Signed-off-by: Austin Liu <[email protected]>
austin362667
changed the title
Add Support for Knowledge Distillation with a chunked, fused linear JS-divergence Loss
Add Distillation with a Chunked, Fused Linear JS-divergence Loss
Nov 27, 2024
Signed-off-by: Austin Liu <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Knowledge Distillation
Knowledge Distillation (KD; Hinton et al. 2015, Gou et al. 2020) is a straightforward way to build a smaller, cheaper model (“student model”) to speed up inference by transferring skills from a pre-trained expensive model (“teacher model”) into the student.
In knowledge distillation, a student model is trained to replicate the outputs of a teacher model using a distillation loss. Neural networks typically include a softmax layer; for instance, a large language model produces a probability distribution over tokens. Let
z_t
andz_s
represent the logits before the softmax layer for the teacher and student models, respectively. The distillation loss reduces the discrepancy between the two softmax outputs at a high temperatureT
. When ground truth labelsy
are available, this approach can be combined with a supervised learning objective, such as cross-entropy, to compare the student’s outputs with the ground truth.The combined loss function is defined as:
Here,
lambda
is a hyperparameter that balances the distillation loss and the supervised objective.Shared
DistillationBase
To support various distillation learning objectives, this PR aims to add a
LigerFusedLinearDistillationBase
which is basically same as propose by @hongpeng-guo within this discussion #371 (comment). Thank you @hongpeng-guo for thinking through this.Jensen-Shannon Divergence Loss
In addition to adding the base class, this PR implements Jensen-Shannon Divergence (JSD) loss as the soft learning objective in the distillation setting. This component can be replaced with other losses (e.g., KL divergence) as
distillation_loss_fn
.JSD is defined as the average of the KL divergences between each distribution and the mean distribution:
Here,
P
andQ
are the two probability distributions, andM
is their average.TODO
Testing Done
Yes.
make test
to ensure correctnessmake checkstyle
to ensure code stylemake test-convergence
to ensure convergence