You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on May 28, 2024. It is now read-only.
Hi, I wonder if it's possible to use a partial portion of GPU per model instead of using 1 GPU for each model deployed? As an example, when using a G5.12xlarge instance in AWS with 4 GPUs, instead of deploying on a maximum of 4 models, by using half of the GPU, it might able to deploy eight models with quantization. Changing the num gpus per worker resulted in error.
The text was updated successfully, but these errors were encountered:
Any update on this? I am doing a similar test and want to know what is the best practice for deploying 8 models in a 4 GPUs instance. what is the different between 1 worker, 4 replicas and 4 workers with 1 replica each? Thanks.
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Hi, I wonder if it's possible to use a partial portion of GPU per model instead of using 1 GPU for each model deployed? As an example, when using a G5.12xlarge instance in AWS with 4 GPUs, instead of deploying on a maximum of 4 models, by using half of the GPU, it might able to deploy eight models with quantization. Changing the num gpus per worker resulted in error.
The text was updated successfully, but these errors were encountered: