Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Model] Whisper model implementation #11280

Open
wants to merge 39 commits into
base: main
Choose a base branch
from

Conversation

aurickq
Copy link
Contributor

@aurickq aurickq commented Dec 18, 2024

Add Whisper model implementation. Based on #5964 but heavily optimized and integrated with newer encoder/decoder support in vLLM. Currently only supports audio up to 30s. No VAD model, no beam search, no phoneme model, forced alignment, etc.

Performance overall looks OK, especially at larger batch sizes

Screenshot 2024-12-18 at 5 33 15 PM

FIX #180

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments.

vllm/worker/enc_dec_model_runner.py Outdated Show resolved Hide resolved
vllm/transformers_utils/tokenizer_group/tokenizer_group.py Outdated Show resolved Hide resolved
vllm/model_executor/models/registry.py Show resolved Hide resolved
Comment on lines +442 to +450
if self.model_config.hf_config.model_type == "whisper":
# For Whisper models, the text prompt should go to the decoder.
# If no explicit encoder/decoder inputs, then copy the prompt
# from the encoder to the decoder. The encoder tokens are later
# overridden by the audio features.
dec_token_ids = encoder_inputs["prompt_token_ids"].copy()
else:
dec_token_ids = self._prepare_decoder_input_ids_for_generation(
None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to determine this without model type information?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about generalizing this from a single example. In the long term it may be better to allow the model definition to specify exactly the mapping between input fields and where they go (e.g. encoder/decoder)

vllm/model_executor/models/whisper.py Show resolved Hide resolved
config.vocab_size, logit_scale)
self.sampler = Sampler()

def forward(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To support V1, the model needs to implement get_input_embeddings and get_multimodal_embeddings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, but there is some awkwardness

  • get_multimodal_embeddings for Whisper also requires kv-cache and attn metadata which is not in the base method's interface
  • It looks like get_input_embeddings expects to embed multimodal tokens into text tokens. again, this is not what Whisper does. Currently I just have this method return the decoder token embeddings.

I couldn't find anywhere in vLLM that uses these two methods so it's hard for me to understand their intended usage. Happy to change the implementation to whatever works better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I forgot that V1 doesn't support encoder decoder yet. Let's just add a TODO and come back to this later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added todos

Copy link
Collaborator

@Isotr0py Isotr0py left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some initial comments about model implementation :)

vllm/model_executor/models/whisper.py Outdated Show resolved Hide resolved
vllm/model_executor/models/whisper.py Outdated Show resolved Hide resolved
vllm/model_executor/models/whisper.py Outdated Show resolved Hide resolved
@mergify mergify bot added the ci/build label Dec 18, 2024
requirements-common.txt Outdated Show resolved Hide resolved
vllm/model_executor/models/whisper.py Outdated Show resolved Hide resolved
vllm/model_executor/models/whisper.py Outdated Show resolved Hide resolved
vllm/model_executor/models/whisper.py Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Whisper support
5 participants