Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Docs][Serve] Speed up weights loading by AMI and Docker Image #3073

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

cblmemo
Copy link
Collaborator

@cblmemo cblmemo commented Feb 2, 2024

TODO: Benchmark and get some numbers
Fix bug using

sudo jq '.["exec-opts"] = ["native.cgroupdriver=cgroupfs"]' /etc/docker/daemon.json > /tmp/daemon.json && sudo mv /tmp/daemon.json /etc/docker/daemon.json
sudo systemctl restart docker

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • sky launch the following yaml and works properly
# mixtral.yaml
service:
  readiness_probe:
    path: /v1/chat/completions
    post_data:
      model: mistralai/Mixtral-8x7B-Instruct-v0.1
      messages:
        - role: user
          content: Hello! What is your name?
      max_tokens: 1
    initial_delay_seconds: 1800
  replicas: 1

# Fields below describe each replica.
resources:
  cloud: gcp
  image_id: docker:cblmemo/mixtral-vllm:latest
  ports: 8080
  accelerators: {L4:8, A10g:8, A100:4, A100:8, A100-80GB:2, A100-80GB:4, A100-80GB:8}

setup: |
  conda activate vllm
  if [ $? -ne 0 ]; then
    conda create -n vllm python=3.10 -y
    conda activate vllm
  fi
  pip install vllm==0.3.0 transformers==4.37.2

run: |
  conda activate vllm
  export PATH=$PATH:/sbin
  python -m vllm.entrypoints.openai.api_server \
    --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
    --host 0.0.0.0 --port 8080 \
    --model mistralai/Mixtral-8x7B-Instruct-v0.1
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

Speedup Weights Loading in Large Model Serving
----------------------------------------------

When serving large models, the weights of the model are loaded from the cloud storage / public internet to the VMs. This process can take a long time, especially for large models. To speed up the weights loading, you can use the following methods:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the benefit of using machine image or docker image is more for reducing the setup time, instead of the the model weight downloading time, as they are mostly for packaging the dependencies and not neccessarily means it will speed up the download of the image or the model, which should be limited by the network bandwidth instead.

Should we rewrite the section as reducing the overhead of environment setup?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Let me rephrase that

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess there are some speedups if using a machine image? That should be optimized by the cloud provider which makes it faster than a plain network download?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Machine images are intended to be used in a single region. It can be used in another region for launching a VM, but there will involve data transfer from one region to another. Cloud provider should have optimized it, but we probably want to be careful about the wording to avoid having a impression that all the benefits comes from weight loading.

docs/source/serving/sky-serve.rst Outdated Show resolved Hide resolved
docs/source/serving/sky-serve.rst Outdated Show resolved Hide resolved
docs/source/serving/sky-serve.rst Outdated Show resolved Hide resolved
docs/source/serving/sky-serve.rst Outdated Show resolved Hide resolved
docs/source/serving/sky-serve.rst Outdated Show resolved Hide resolved
docs/source/serving/sky-serve.rst Outdated Show resolved Hide resolved
docs/source/serving/sky-serve.rst Outdated Show resolved Hide resolved
docs/source/serving/sky-serve.rst Outdated Show resolved Hide resolved
docs/source/serving/sky-serve.rst Outdated Show resolved Hide resolved

# Here goes the setup and run commands...

This is easier to configure than machine images, but it may have a longer startup time than machine images since it needs to pull the docker image from the registry.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we actually timed these two methods?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm feeling like it is hard to make a fair comparison - it is largely dependent on the base docker/machine image used... Though I'll try to make some benchmarks and see the results 🫡

image_id: docker:docker-image-with-dependency-installed

# Followed by setup and run commands.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could mention something about how this docker image should be built, especially, it could have SkyPilot runtime pre-built. Something like the following would be useful (could you help giving an concrete example for how to install vllm, and download the image in the Dockerfile below for a better reference, i.e. replacing the line # Your dependencies installation and model download code goes here with actual workable commands for serving vllm+mistral):

Your docker image can have all skypilot dependencies pre-installed to further reduce the setup time, you could try building your docker image based from our base image. The `Dockerfile` could look like the following:
```Dockerfile
FROM docker:berkeleyskypilot/skypilot-k8s-gpu:latest

# Your dependencies installation and model download code goes here
```

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this doc @cblmemo! We have users asking for this and it would be nice we can directly point them to this page. : )

@cblmemo
Copy link
Collaborator Author

cblmemo commented Mar 23, 2024

Thanks for adding this doc @cblmemo! We have users asking for this and it would be nice we can directly point them to this page. : )

Just want to update the status of this PR first: I found a mysterious bug that causes an NVML initialization error when using docker container as runtime env. By bisect it seems like those lines are causing the error:

'sudo systemctl stop jupyter > /dev/null 2>&1 || true;'
'sudo systemctl disable jupyter > /dev/null 2>&1 || true;'
'sudo systemctl stop jupyterhub > /dev/null 2>&1 || true;'
'sudo systemctl disable jupyterhub > /dev/null 2>&1 || true;',

That is very strange since those are running on the host but somehow affect the containers. Will investigate more.

@Michaelvll
Copy link
Collaborator

We should consider having this PR updated and merged as well. : )

@cblmemo
Copy link
Collaborator Author

cblmemo commented Apr 26, 2024

We should consider having this PR updated and merged as well. : )

This PR is blocked by the max/ultra disk tier as the current performance is not better than install everything from pip...

@Michaelvll
Copy link
Collaborator

Another user requests this. : )

@cblmemo
Copy link
Collaborator Author

cblmemo commented Aug 26, 2024

Left some benchmark results using PR #3860. In the following table, high = gp3 7,000 IOPS, ultra = io2 20,000 IOPS, max = io2 100,000 IOPS. All tests are running on AWS and the result is the e2e execution time for launching a Llama 2 70b checkpoint w/ the latest version of vLLM, on an A10G:8 instance.

high ultra max
Use AMI > 2 hours 487s 467s
Download from HF 524s 410s -

In conclusion, our high disk tier is indeed not enough for large checkpoint downloading and the ultra tier increases the performance a lot. Though the AMI does not enhance the performance as expected; there might be other bottlenecks like the download speed of the AMI.

@Michaelvll Michaelvll added the P0 label Sep 13, 2024
@Michaelvll
Copy link
Collaborator

We should revamp this PR with our latest findings and support for ultra disk : )

@cblmemo cblmemo changed the title [Docs][Serve] Speed up weights loading [Docs][Serve] Speed up weights loading by AMI and Docker Image Sep 16, 2024
@cblmemo
Copy link
Collaborator Author

cblmemo commented Sep 16, 2024

We should revamp this PR with our latest findings and support for ultra disk : )

Done in #3949 . Still keeping this so we could investigate if it is possible to speed up by using AMI & docker image.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants