diff --git a/jobs/deploy_to_nvidia_nemo_inference_microservice/README.md b/jobs/deploy_to_nvidia_nemo_inference_microservice/README.md index 6f62968..668b111 100644 --- a/jobs/deploy_to_nvidia_nemo_inference_microservice/README.md +++ b/jobs/deploy_to_nvidia_nemo_inference_microservice/README.md @@ -14,11 +14,16 @@ Deployment time varies by model and machine type. The base Llama2-7b config take ## User Quickstart -1. Create a queue if you don't have one already, and launch an agent: +1. Create a queue if you don't have one already. See an example queue config below. + 1. You can set `gpus` to the specific GPUs you want to use, or `all` to use everything. + 2. Set `runtime` to `nvidia` + + ![image](https://github.com/wandb/launch-jobs/assets/15385696/d349e37a-ce1d-48b3-992f-1b4b617efa19) +2. Launch an agent on your GPU machine: ```bash wandb launch-agent -e $ENTITY -p $PROJECT -q $QUEUE ``` -2. Submit the deployment job with your desired configs from the [Launch UI](https://wandb.ai/launch). See `configs/` for examples. +3. Submit the deployment job with your desired configs from the [Launch UI](https://wandb.ai/launch). See `configs/` for examples. 1. You can also submit via the CLI: ```bash wandb launch -d gcr.io/playground-111/deploy-to-nemo:latest \ @@ -27,7 +32,12 @@ Deployment time varies by model and machine type. The base Llama2-7b config take -q $QUEUE \ -c $CONFIG_JSON_FNAME ``` -3. You can track the deployment process in the Launch UI. Once complete, you can immediately curl the endpoint to test the model. The model name is always `ensemble`. + ![image](https://github.com/wandb/launch-jobs/assets/15385696/8bc95b7a-94a6-453e-9c87-f6b25a567604) + +5. You can track the deployment process in the Launch UI. + ![image](https://github.com/wandb/launch-jobs/assets/15385696/49ca8391-689e-4cb7-9ba9-b5691f2cc7aa) + +7. Once complete, you can immediately curl the endpoint to test the model. The model name is always `ensemble`. ```bash #!/bin/bash curl -X POST "http://0.0.0.0:9999/v1/completions" \