-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Adopting vllm to external KV store Vineyard #1
Conversation
Steps to set up vineyard and vllm for testing (partial credit to @haiyang from his internal doc): Prepare v6d If using shared memory cache, the following line needs to be commented out before build (see more detail below), for hacking purposes sudo apt-get update
sudo apt-get install -y ca-certificates \
cmake \
doxygen \
libboost-all-dev \
libcurl4-openssl-dev \
libgflags-dev \
libgoogle-glog-dev \
libgrpc-dev \
libgrpc++-dev \
libmpich-dev \
libprotobuf-dev \
libssl-dev \
libunwind-dev \
libz-dev \
protobuf-compiler-grpc \
python3-pip \
wget \
ninja-build
sudo rm -rf /lib/x86_64-linux-gnu/libprotoc.a \
/lib/x86_64-linux-gnu/libprotobuf.a \
/lib/x86_64-linux-gnu/libprotobuf-lite.a \
/lib/x86_64-linux-gnu/libprotobuf.so.23 \
/lib/x86_64-linux-gnu/libprotobuf.so.23.0.4
sudo ldconfig
# we have to build arrow from source to use the system-wide protobuf
cd ~
git clone https://github.com/apache/arrow.git
cd arrow/cpp && git checkout apache-arrow-16.1.0 && mkdir build-release && cd build-release
cmake --preset ninja-release-python -DCMAKE_INSTALL_PREFIX=/usr/ -DProtobuf_PROTOC_LIBRARY=/lib/x86_64-linux-gnu/libprotoc.so.32 ..
cmake --build .
sudo ninja install
sudo pip3 install cython
cd ~/arrow/python
sudo python3 setup.py install
cd ~
git clone https://github.com/v6d-io/v6d
cd v6d && git submodule update --init --recursive
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release \
-DBUILD_SHARED_LIBS=ON \
-DUSE_STATIC_BOOST_LIBS=OFF \
-DBUILD_VINEYARD_SERVER=ON \
-DBUILD_VINEYARD_CLIENT=OFF \
-DBUILD_VINEYARD_PYTHON_BINDINGS=ON \
-DBUILD_VINEYARD_PYPI_PACKAGES=OFF \
-DBUILD_VINEYARD_LLM_CACHE=ON \
-DBUILD_VINEYARD_BASIC=OFF \
-DBUILD_VINEYARD_GRAPH=OFF \
-DBUILD_VINEYARD_IO=OFF \
-DBUILD_VINEYARD_HOSSEINMOEIN_DATAFRAME=OFF \
-DBUILD_VINEYARD_TESTS=ON \
-DBUILD_VINEYARD_TESTS_ALL=OFF \
-DBUILD_VINEYARD_PROFILING=OFF
make -j
make vineyard_llm_python -j
sudo make install
sudo pip3 install cython
cd ~/v6d
sudo python3 setup.py install
sudo python3 setup_llm.py install
pip3 install google-benchmark To start etcd ETCD_VER=v3.4.33
# choose either URL
GOOGLE_URL=https://storage.googleapis.com/etcd
GITHUB_URL=https://github.com/etcd-io/etcd/releases/download
DOWNLOAD_URL=${GOOGLE_URL}
rm -f /tmp/etcd-${ETCD_VER}-linux-amd64.tar.gz
rm -rf /tmp/etcd-download-test && mkdir -p /tmp/etcd-download-test
curl -L ${DOWNLOAD_URL}/${ETCD_VER}/etcd-${ETCD_VER}-linux-amd64.tar.gz -o /tmp/etcd-${ETCD_VER}-linux-amd64.tar.gz
tar xzvf /tmp/etcd-${ETCD_VER}-linux-amd64.tar.gz -C /tmp/etcd-download-test --strip-components=1
rm -f /tmp/etcd-${ETCD_VER}-linux-amd64.tar.gz
/tmp/etcd-download-test/etcd --version
/tmp/etcd-download-test/etcdctl version
# start a local etcd server
/tmp/etcd-download-test/etcd 1>/dev/null 2>&1 &
# clear
# ETCDCTL_API=3 /tmp/etcd-download-test/etcdctl del "" --from-key=true Build and run vllm # start a vineyard server
~/v6d/build/bin/vineyardd --socket /tmp/vineyard_test.sock 1>out.log 2>&1 &
#build and install vllm
cd ~ && git clone https://github.com/aibrix/vllm.git
cd vllm && git checkout lexu/vineyard-adptation
pip3 install -e .
#upgrade pyarrow
python -m pip install pyarrow --upgrade
# build vineyard vllm
cd ~/v6d && sudo python3 setup.py install && sudo python3 setup_llm.py install
export VLLM_USE_VINEYARD_CACHE=1
export VLLM_USE_FLASH_ATTN_DECODING=1 Check out two ways of using cache here If using file config cache
If using shared memory cache (hacking, WIP) Make sure this line is comment out before you built v6d. then
Run vllm
To test locally, run:
|
Problems encountered cherry-picking commit sighingnow@d347dab#diff-c89ac25bd066e936e80260d21be63c7d2379cfedc371a9ff288fb5ba02ae1350 to the latest branch:
|
vllm/worker/model_runner.py
Outdated
@@ -90,6 +90,12 @@ class ModelInputForGPU(ModelRunnerInputBase): | |||
request_ids_to_seq_ids: Optional[Dict[str, List[int]]] = None | |||
finished_requests_ids: Optional[List[str]] = None | |||
virtual_engine: int = 0 | |||
|
|||
slot_mapping: Optional[torch.Tensor] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can not find reference here. Seems not directly related to vineyard changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seq_group_metadata_list should be the only field being added. The rest of the fields are ported from older versioned VLLM and I'll remove those.
Comment out null checks in v6d: https://github.com/v6d-io/v6d/blob/ebe8f077e3d3780a27d49238c501854b6b8e29df/modules/llm-cache/ds/kv_cache_block.cc#L163 allows code to run with VineyardCacheConfig However, running the following request sequence lead to segmentation fault on vllm client Screen shot of the fault |
Updated instruction above -- using shared memory cache could be done through the following
|
Let's create a separate branch and ping the commit id. rest PRs should be submitted against that branch. Otherwise, this PR will grow endlessly and not easy for collaboration. @happyandslow @DwyaneShi |
feature branch could use |
f888485
to
70f523c
Compare
Using v1 configuration and benchmark code from link results in the following segmentation fault while running on v6d (commit link):
|
Further investigation using gdb (along with prompt that triggers the error)
|
ac2a3ac
to
5586db7
Compare
# to ensure the tensor model parallel group is initialized. | ||
self.vineyard_llm_cache = None | ||
|
||
set_cpu_offload_max_bytes( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this duplicate snippet
vllm/worker/model_runner.py
Outdated
if envs.VLLM_USE_VINEYARD_CACHE: | ||
if not self.scheduler_config.chunked_prefill_enabled: | ||
logger.warn("Vineyard LLM cache is not enabled, requires chunked prefill") | ||
elif not envs.VLLM_USE_FLASH_ATTN_DECODING: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this is required, why do we create an env there?
vllm/worker/model_runner.py
Outdated
model_config=self.model_config, | ||
parallel_config=self.parallel_config, | ||
kv_cache_dtype=self.kv_cache_dtype, | ||
torch_dtype=get_kv_cache_torch_dtype(self.kv_cache_dtype, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
torch_dtype can be determined by kv_cache_dtype
and self.model_config
is it a better idea to make it internal viariable?
vllm/worker/model_runner.py
Outdated
) | ||
if self.vineyard_llm_cache: | ||
logger.info("Using Vineyard LLM cache") | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, it just throws warning message. what if I enabled cache but disable chunk prefill, should we throw error and avoid it coming into the next steps here? if that case, you do not need that many self.vineyard_llm_cache
checks
intermediate_tensors = None | ||
if not get_pp_group().is_first_rank: | ||
intermediate_tensors = self.model.make_empty_intermediate_tensors( | ||
batch_size=batch_size, | ||
dtype=self.model_config.dtype, | ||
device=self.device) | ||
self.execute_model(model_input, kv_caches, intermediate_tensors) | ||
if self.vineyard_llm_cache and kv_caches[0] is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems the method is profile_run, do we need to update the kv cache here?
torch_dtype=torch_dtype, | ||
) | ||
|
||
def prefetch_seq_kv_caches( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will double check the logics here latr
Based on another version of vllm: sighingnow@d347dab Cherry-pick from commit d347dab Signed-off-by: Tao He <[email protected]> (cherry picked from commit 1545f6bf7edcd667e305d3fbcadd913066f04747) resolving vllm update diff temporarily comment out torch.distributed for single node env add VineyardCacheConfig with https://github.com/v6d-io/v6d/blob/ebe8f077e3d3780a27d49238c501854b6b8e29df/modules/llm-cache/ds/kv_cache_block.cc#L163 commented out; cache_ops fix remove CacheConfig from argument (configure through ENV) v6d: fix integration w/ v1 APIs Signed-off-by: Haiyang Shi <[email protected]> Change model_runner to latest version cherry pick model_runner from d347dab source sighingnow@d347dab fix reshape_and_cache_flash argument add cache prefetch/update to work_base clean up Fix after rebase to 029c71d remove tensor copy from cache managed address to pin memory clean up
3fd0048
to
45554d1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's merge this one
Adopting vllm to use external KV services for KV cache (vineyard v6d)
This PR adopt vllm to vllm integration provided by the vineyard team: sighingnow@d347dab#diff-c89ac25bd066e936e80260d21be63c7d2379cfedc371a9ff288fb5ba02ae1350