Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mixtral Mixture of Experts example #3075

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

Conversation

agunapal
Copy link
Collaborator

@agunapal agunapal commented Apr 4, 2024

Description

This example shows how to deploy Mixtral-8x7B model with HuggingFace with the following features

  • low_cpu_mem_usage=True for loading with limited resource using accelerate
  • 8-bit quantization using bitsandbytes
  • Accelerated Transformers using optimum
  • TorchServe streaming response
python test_streaming.py

produces the output

What is the difference between cricket and baseball?

- Cricket is a bat-and-ball game played between two teams of eleven players each on a field at the center of which is a rectangular 22-yard-long pitch. Each team takes its turn to bat,

Fixes #(issue)

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

Feature/Issue validation/testing

Please describe the Unit or Integration tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

  • Test A
    Logs for Test A

  • Test B
    Logs for Test B

Checklist:

  • Did you have fun?
  • Have you added tests that prove your fix is effective or that this feature works?
  • Has code been commented, particularly in hard-to-understand areas?
  • Have you made corresponding changes to the documentation?

@agunapal agunapal marked this pull request as ready for review April 4, 2024 18:03
@agunapal agunapal requested review from chauhang and lxning April 4, 2024 18:03
Copy link
Collaborator

@lxning lxning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please move this example under folder /example/large_models/Huggingface_accelerate

logger.info("Model %s loading tokenizer", ctx.model_name)
self.model = AutoModelForCausalLM.from_pretrained(
model_path,
device_map="balanced",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

according to the link, it seems "auto" can cover the future changes. Or you can allow cx to define in the model-config.yaml.

Quoted from the link:
"The options "auto" and "balanced" produce the same results for now, but the behavior of "auto" might change in the future if we find a strategy that makes more sense, while "balanced" will stay stable."

- `low_cpu_mem_usage=True` for loading with limited resource using `accelerate`
- 8-bit quantization using `bitsandbytes`
- `Accelerated Transformers` using `optimum`
- TorchServe streaming response
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this example going to demo microbatch+streaming or continuousbatching+streaming?

)
self.model.resize_token_embeddings(self.model.config.vocab_size + 1)

self.output_streamer = TextIteratorStreamerBatch(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use TS customized TextIteratorStreamerBatch with microbatching. See inf2 example:

self.output_streamer = TextIteratorStreamerBatch(

input_ids, attention_mask = self.encode_input_text(input_text["prompt"])
input_ids_batch.append(input_ids)
attention_mask_batch.append(attention_mask)
params.append(input_text["params"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are going to use openAI payload style (see https://github.com/lxning/benchmark-locust/blob/ts/llm_bench/load_test.py#L254) since openAI api is most popular. the params is flatten.

Comment on lines +160 to +177
self.output_streamer = TextIteratorStreamerBatch(
self.tokenizer,
batch_size=len(input_ids_batch),
skip_special_tokens=True,
)
generation_kwargs = dict(
inputs=input_ids_batch,
attention_mask=attention_mask_batch,
streamer=self.output_streamer,
max_new_tokens=params[0]["max_new_tokens"],
temperature=params[0]["temperature"],
top_p=params[0]["top_p"],
)
thread = Thread(target=self.model.generate, kwargs=generation_kwargs)
thread.start()

for new_text in self.output_streamer:
send_intermediate_predict_response(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can check inf2 example to update this section to combine microbatch+streaming.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants