Skip to content

Conversation

@Dhiraj309
Copy link

📝 Suggested PR Description

What does this PR add?

This PR introduces transformers_distillation, a lightweight library built on top of the 🤗 Transformers ecosystem to make knowledge distillation of language models simple, flexible, and reproducible.

Key features:

  • 📦 Drop-in usage with Hugging Face models — no extra setup needed.
  • 🔧 Built directly on top of transformers.Trainer, so researchers and practitioners can reuse the familiar API with minimal overhead.
  • 🧑‍💻 Designed for small teams or even individual contributors to manage training, evaluation, and experimentation without extra complexity.
  • ✅ Includes working examples (examples/) and tests (tests/) for CausalLM, Seq2SeqLM, and MLM tasks.
  • ⚡ Full compatibility with existing Trainer features like callbacks, logging, evaluation, distributed training (multi-GPU, TPU), and Hugging Face Hub integration.

Example usage

I’ve demonstrated this in a Kaggle public notebook using a small dataset, showing how easy it is to run end-to-end distillation with just a few lines of code.
(Link: [https://www.kaggle.com/code/dignity45/transformer-distill-trainer-knowledge-distillation])


Why this matters

  • Distillation is becoming increasingly important for efficient model deployment.

  • Current pipelines often require custom code, but this library leverages the Trainer directly, meaning practitioners can:

    • Reuse existing callbacks, logging, and evaluation tools.
    • Integrate seamlessly with Hugging Face Hub.
    • Scale training easily with distributed GPU/TPU setups.
    • Avoid reinventing the wheel.

Next steps / Future directions

  • Potential discussion with the Hugging Face team about whether such functionality could evolve into official support inside transformers.
  • Broader support and benchmarking across distributed setups (multi-GPU, TPU, DeepSpeed, etc.) for larger-scale distillation experiments.

✅ Overall, this PR lowers the barrier for practitioners who want to experiment with knowledge distillation without needing large teams or custom infrastructure, while remaining fully compatible with the Hugging Face ecosystem.

…, MLM, Seq2SeqLM) Models From Scratch, With Leveraging Knowledge Distillation
@Rocketknight1
Copy link
Member

I think this makes more sense as a separate repository, not as part of Transformers itself!

@Dhiraj309
Copy link
Author

I think this makes more sense as a separate repository, not as part of Transformers itself!
Thanks for the feedback! 🙏 I agree it makes sense as a separate repo for now, but my longer-term goal is to align it with Trainer (like Seq2SeqTrainer) so we could eventually have a lightweight DistillTrainer function call inside Transformers.

@Dhiraj309
Copy link
Author

@Rocketknight1, would you like to make me any suggestions in this if possible, it will be really helpful

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants