GPU Spot Scheduler (GSS)

LLM Inference on AWS GPU Spot Instances

GSS is a framework for efficient and cost-effective deployment of large language models (LLMs) aon AWS spot GPU instances. Base on SkyPilot v0.6, GSS inherits its powerful job orchestration capabilities.

🚀 Features

Optimized LLM Inference: Integration with popular LLM frameworks, auto-batching, and model offloading for maximum GPU utilization.
Cost Savings: Leverages cheap AWS spot instances for up to 90% cost reduction compared to on-demand.
Simplified Deployment: Launch inference endpoints with a single command and automatic scaling.
Flexible Inference Modes: Supports both real-time and offline batch inference.

🛠️ Quick Start

Install GSS via pip:

git clone https://github.com/tsaol/GSS.git
cd GSS
pip install -e . 

# Install boto
pip install boto3
# Configure your AWS credentials
aws configure

Define your model configuration in a YAML file:

envs:
  # MODEL_NAME: meta-llama/Meta-Llama-3-70B-Instruct
  MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
  HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass.

service:
  replicas: 2
  # An actual request for readiness probe.
  readiness_probe:
    path: /v1/chat/completions
    post_data:
      model: $MODEL_NAME
      messages:
        - role: user
          content: Hello! What is your name?
      max_tokens: 1

resources:
  cloud: aws
  region: us-east-1
  accelerators: {A10g:1}
  cpus: 8+
  use_spot: True
  disk_size: 512  # Ensure model checkpoints can fit.
  disk_tier: best
  ports: 8081  # Expose to internet traffic.

setup: |
  conda activate vllm
  if [ $? -ne 0 ]; then
    conda create -n vllm python=3.10 -y
    conda activate vllm
  fi

  pip install vllm==0.4.2
  # Install Gradio for web UI.
  pip install gradio openai
  pip install flash-attn==2.5.9.post1


run: |
  conda activate vllm
  echo 'Starting vllm api server...'

  python -u -m vllm.entrypoints.openai.api_server \
    --port 8081 \
    --model $MODEL_NAME \
    --trust-remote-code --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
    --gpu-memory-utilization 0.95 \
    --max-num-seqs 64 \
    2>&1 | tee api_server.log &

  while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do
    echo 'Waiting for vllm api server to start...'
    sleep 5
  done2

  echo 'Starting gradio server...'
  git clone https://github.com/vllm-project/vllm.git || true
  python vllm/examples/gradio_openai_chatbot_webserver.py \
    -m $MODEL_NAME \
    --port 8811 \
    --model-url http://localhost:8081/v1 \
    --stop-token-ids 128009,128001
  gss serve

Launch the inference endpoint:

#Start
HF_TOKEN=xxx sky serve up llama3.yaml -n llama3 --env HF_TOKEN

📃 License

GSS is open-source under the Apache-2.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
examples		examples
llm		llm
sample		sample
sky		sky
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
format.sh		format.sh
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPU Spot Scheduler (GSS)

🚀 Features

🛠️ Quick Start

📃 License

About

Releases

Packages

Languages

License

tsaol/GSS

Folders and files

Latest commit

History

Repository files navigation

GPU Spot Scheduler (GSS)

🚀 Features

🛠️ Quick Start

📃 License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages