Knowledge distillation #1035

yangliuxin-nn · 2025-01-05T06:13:42Z

Hi team, can you please help us how to perform knowledge distillation to obtain this model, and how can we fine-tune the model based on the distilled version? Thanks a lot!

kylesayrs · 2025-01-10T06:59:05Z

Hi @yangliuxin-nn,

That model is missing a recipe.yaml file! Despite the missing file, it's likely that this model was compressed using a script similar to examples/trl_mixin/ex_trl_distillation.py. I'm currently reaching out the research team which compressed this model in order to confirm this.

yangliuxin-nn · 2025-01-11T08:16:37Z

Thanks @kylesayrs . I have a few questions about examples/trl_mixin/ex_trl_distillation.py:

I noticed 'test_stage' in the yaml configuration - what does this stage represent?
Does it make sense to switch from ConstantPruningModifier to using SparseGPT in the examples/trl_mixin/ex_trl_distillation.py?
After saving the model, what's the correct way to load it? And should we expect to match the published performance metrics after pruning and distillation?

Many thanks for your time!

eldarkurtic · 2025-01-13T15:42:13Z

Hi @yangliuxin-nn,

To produce our pretrained 2:4 Sparse Llama model, we haven’t fully open-sourced the dataset mix, so reproducing it on your side may be challenging—not to mention the significant GPU resources required for the process.

For fine-tuning the sparse model with distillation on your target dataset, we used a custom fork of MosaicML's llm-foundry, but you’re welcome to use any framework you’re comfortable with. If you decide to use your own fine-tuning framework, you’ll need to implement two key features:

Masking of sparse weights
Knowledge distillation

Could you share more details about your setup, such as the model, dataset type and size, number of GPUs, fine-tuning framework, etc.? That will help me provide a more tailored response.

kylesayrs · 2025-01-13T19:49:39Z

Hi @yangliuxin-nn,

test_stage refers to a purely aesthetic name for the stage. In this example, it's kind of misleading. A better name would be compression_stage.
The ConstantPruningModifier is a modifier which maintains existing sparsity. In this example, the modifier maintains the existing sparsity of the base model, "neuralmagic/Llama-2-7b-pruned50-retrained". For more information, see (this explaination)
We highly recommend using vllm to load and perform inference with the model

For your case, I recommend compressing your model in two steps. In the first step, use SparseGPT to prune your model (examples/llama3_8b_2of4.py). When saving your model, use save_pretrained(save_compressed=False).

In the second step, load the model AutoModel.from_pretrained and perform KD finetuning, as described by @eldarkurtic or using examples/trl_mixin/ex_trl_distillation.py. Save your model with save_pretrained(save_compressed=True), and then load the model using vllm.

yangliuxin-nn added the bug Something isn't working label Jan 5, 2025

kylesayrs assigned kylesayrs and eldarkurtic Jan 10, 2025

kylesayrs added question Further information is requested and removed bug Something isn't working labels Jan 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Knowledge distillation #1035

Knowledge distillation #1035

yangliuxin-nn commented Jan 5, 2025

kylesayrs commented Jan 10, 2025

yangliuxin-nn commented Jan 11, 2025

eldarkurtic commented Jan 13, 2025

kylesayrs commented Jan 13, 2025

Knowledge distillation #1035

Knowledge distillation #1035

Comments

yangliuxin-nn commented Jan 5, 2025

kylesayrs commented Jan 10, 2025

yangliuxin-nn commented Jan 11, 2025

eldarkurtic commented Jan 13, 2025

kylesayrs commented Jan 13, 2025