-
-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Model] Whisper model implementation #11280
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some initial comments.
if self.model_config.hf_config.model_type == "whisper": | ||
# For Whisper models, the text prompt should go to the decoder. | ||
# If no explicit encoder/decoder inputs, then copy the prompt | ||
# from the encoder to the decoder. The encoder tokens are later | ||
# overridden by the audio features. | ||
dec_token_ids = encoder_inputs["prompt_token_ids"].copy() | ||
else: | ||
dec_token_ids = self._prepare_decoder_input_ids_for_generation( | ||
None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to determine this without model type information?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about generalizing this from a single example. In the long term it may be better to allow the model definition to specify exactly the mapping between input fields and where they go (e.g. encoder/decoder)
config.vocab_size, logit_scale) | ||
self.sampler = Sampler() | ||
|
||
def forward( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To support V1, the model needs to implement get_input_embeddings
and get_multimodal_embeddings
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, but there is some awkwardness
get_multimodal_embeddings
for Whisper also requires kv-cache and attn metadata which is not in the base method's interface- It looks like
get_input_embeddings
expects to embed multimodal tokens into text tokens. again, this is not what Whisper does. Currently I just have this method return the decoder token embeddings.
I couldn't find anywhere in vLLM that uses these two methods so it's hard for me to understand their intended usage. Happy to change the implementation to whatever works better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I forgot that V1 doesn't support encoder decoder yet. Let's just add a TODO and come back to this later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added todos
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some initial comments about model implementation :)
Add Whisper model implementation. Based on #5964 but heavily optimized and integrated with newer encoder/decoder support in vLLM. Currently only supports audio up to 30s. No VAD model, no beam search, no phoneme model, forced alignment, etc.
Performance overall looks OK, especially at larger batch sizes
FIX #180