-
Notifications
You must be signed in to change notification settings - Fork 859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mixtral Mixture of Experts example #3075
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please move this example under folder /example/large_models/Huggingface_accelerate
logger.info("Model %s loading tokenizer", ctx.model_name) | ||
self.model = AutoModelForCausalLM.from_pretrained( | ||
model_path, | ||
device_map="balanced", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
according to the link, it seems "auto" can cover the future changes. Or you can allow cx to define in the model-config.yaml.
Quoted from the link:
"The options "auto" and "balanced" produce the same results for now, but the behavior of "auto" might change in the future if we find a strategy that makes more sense, while "balanced" will stay stable."
- `low_cpu_mem_usage=True` for loading with limited resource using `accelerate` | ||
- 8-bit quantization using `bitsandbytes` | ||
- `Accelerated Transformers` using `optimum` | ||
- TorchServe streaming response |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this example going to demo microbatch+streaming or continuousbatching+streaming?
) | ||
self.model.resize_token_embeddings(self.model.config.vocab_size + 1) | ||
|
||
self.output_streamer = TextIteratorStreamerBatch( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use TS customized TextIteratorStreamerBatch with microbatching. See inf2 example:
self.output_streamer = TextIteratorStreamerBatch( |
input_ids, attention_mask = self.encode_input_text(input_text["prompt"]) | ||
input_ids_batch.append(input_ids) | ||
attention_mask_batch.append(attention_mask) | ||
params.append(input_text["params"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are going to use openAI payload style (see https://github.com/lxning/benchmark-locust/blob/ts/llm_bench/load_test.py#L254) since openAI api is most popular. the params is flatten.
self.output_streamer = TextIteratorStreamerBatch( | ||
self.tokenizer, | ||
batch_size=len(input_ids_batch), | ||
skip_special_tokens=True, | ||
) | ||
generation_kwargs = dict( | ||
inputs=input_ids_batch, | ||
attention_mask=attention_mask_batch, | ||
streamer=self.output_streamer, | ||
max_new_tokens=params[0]["max_new_tokens"], | ||
temperature=params[0]["temperature"], | ||
top_p=params[0]["top_p"], | ||
) | ||
thread = Thread(target=self.model.generate, kwargs=generation_kwargs) | ||
thread.start() | ||
|
||
for new_text in self.output_streamer: | ||
send_intermediate_predict_response( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can check inf2 example to update this section to combine microbatch+streaming.
Description
This example shows how to deploy Mixtral-8x7B model with HuggingFace with the following features
low_cpu_mem_usage=True
for loading with limited resource usingaccelerate
bitsandbytes
Accelerated Transformers
usingoptimum
produces the output
Fixes #(issue)
Type of change
Please delete options that are not relevant.
Feature/Issue validation/testing
Please describe the Unit or Integration tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.
Test A
Logs for Test A
Test B
Logs for Test B
Checklist: