-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add RayService vLLM TPU Inference script #1467
base: main
Are you sure you want to change the base?
Add RayService vLLM TPU Inference script #1467
Conversation
3ff9092
to
717b6ef
Compare
Signed-off-by: Ryan O'Leary <[email protected]> bug fixes Signed-off-by: Ryan O'Leary <[email protected]> remove extra ray init Signed-off-by: Ryan O'Leary <[email protected]> Read hf token from os Signed-off-by: Ryan O'Leary <[email protected]> Fix bugs Signed-off-by: Ryan O'Leary <[email protected]> Remove hf token logic Signed-off-by: Ryan O'Leary <[email protected]> Fix serve script Signed-off-by: Ryan O'Leary <[email protected]>
717b6ef
to
dfd04dd
Compare
Do we need a RayService YAML in the repo with region tags that you can reference in the GCP docs? |
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
# NOTE: this file was inspired from: https://github.com/richardsliu/vllm/blob/rayserve/examples/rayserve_tpu.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@richardsliu can we get this example merged into the vllm repo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened this one a while back: vllm-project/vllm#8038
I'll ping them again on it.
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Here is the summary of changes. You are about to add 4 region tags.
This comment is generated by snippet-bot.
|
Yeah that sounds good. I'm still testing out the 405B RayService, but I added the 8B and 70B ones in fe6440c, we can then use |
Signed-off-by: Ryan O'Leary <[email protected]>
I've tried running LLama-3.1-405B with TPU slice sizes up to 4x4x8 v4 and 8x16 v5e and ran into a few issues:
If the user has sufficient quota for TPU chips and SSD in their region, a v4 4x4x8 or v5e 8x16 are large enough to run multi-host inference with Llama-3.1-405B. However, I'm wondering whether I'm missing anything obvious here (with the current amount of TPU support in vLLM) that could allow us to a). load the model faster and b). require less disk space when initializing the model. |
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
ai-ml/gke-ray/rayserve/llm/llama-3-8b-it/ray-cluster-v5e-tpu.yaml
Outdated
Show resolved
Hide resolved
ai-ml/gke-ray/rayserve/llm/llama-3.1-70b/ray-cluster-v4-tpu.yaml
Outdated
Show resolved
Hide resolved
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: ryanaoleary <[email protected]>
Signed-off-by: ryanaoleary <[email protected]>
Description
This PR adds a simple inference script to be used for a Ray multi-host TPU example serving Meta-Llama-3-70B. Similar to the other scripts in the /llm/ folder,
serve_tpu.py
builds a serve deployment for vLLM, which can then be queried with text prompts to generate output. This script will be used as part of a tutorial in the GKE and Ray docs.Tasks