Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BladeLLM] Support dispatch feature for BladeLLM #86

Merged
merged 4 commits into from
Dec 18, 2024
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
# Proto files
*_pb2.py
*_pb2_grpc.py

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
Expand Down
28 changes: 27 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ init:

.PHONY: install
install:
@pip install -e .
@pip install -e .[vllm]

.PHONY: lint
lint: check_pylint_installed check_pytest_installed
Expand All @@ -27,6 +27,30 @@ lint: check_pylint_installed check_pytest_installed
--disable=protected-access,super-init-not-called,unused-argument,redefined-outer-name,invalid-name \
-s n --jobs=128 ./tests

.PHONY: clean
clean: proto-clean

###################################### proto begin ######################################

.PHONY: proto
proto:
@find . -type d -name "proto" | while read dir; do \
dir_base=$$(dirname $$dir); \
find $$dir -name "*.proto" | while read proto_file; do \
echo "Compiling $$proto_file"; \
PYTHONWARNINGS="ignore::DeprecationWarning" python -m grpc_tools.protoc --proto_path=. --python_out=. --grpc_python_out=. $$proto_file; \
done; \
done;

.PHONY: proto-clean
proto-clean:
@find . -name "*_pb2_grpc.py" | xargs rm -f
@find . -name "*_pb2.py" | xargs rm -f

####################################### proto end #######################################

###################################### test begin #######################################

.PHONY: test
test: check_pytest_installed
@pytest -v --ignore=third_party/ --ignore=tests/e2e_test --disable-warnings
Expand Down Expand Up @@ -55,6 +79,8 @@ bench_test:
migration_test:
@pytest -v -x -s --tb=long ./tests/e2e_test/test_migration.py

####################################### test end ########################################

#################### pygloo install for gloo migration backend begin ####################

BAZEL_CMD = bazel
Expand Down
22 changes: 22 additions & 0 deletions configs/bladellm.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
SERVER:
RAY_CLUSTER_PORT: 6379
LAUNCH_RAY_CLUSTER: True
REQUEST_OUTPUT_QUEUE_TYPE: "rayqueue"

MANAGER:
DISABLE_FIXED_NODE_INIT_INSTANCE: False
DISABLE_INIT_INSTANCE_BY_MANAGER: True

LOAD_METRIC: 'remaining_steps'
DISPATCH_POLICY: 'load'

ENABLE_MIGRATION: False
ENABLE_DEFRAG: True
REQUEST_MIGRATION_POLICY: 'SR'

MIGRATION_BACKEND: 'grpc'
MIGRATION_BUFFER_BLOCKS: 512

ENABLE_SCALING: False

LOG_INSTANCE_INFO: False
File renamed without changes.
24 changes: 20 additions & 4 deletions docs/Arguments.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,8 +32,11 @@ usage: -m llumnix.entrypoints.vllm.api_server [-h]
[--profiling-result-file-path PROFILING_RESULT_FILE_PATH]
[--gpu-type GPU_TYPE]
[--polling-interval POLLING_INTERVAL]
[--migration-backend {gloo,nccl,rpc}]
[--migration-backend {gloo,nccl,rayrpc,grpc,kvtransfer}]
[--migration-buffer-blocks MIGRATION_BUFFER_BLOCKS]
[--migration-backend-transfer-type {cuda_ipc,rdma,}]
[--migration-backend-kvtransfer-naming-url MIGRATION_BACKEND_KVTRANSFER_NAMING_URL]
[--migration-backend-server-address MIGRATION_BACKEND_SERVER_ADDRESS]
[--migration-backend-init-timeout MIGRATION_BACKEND_INIT_TIMEOUT]
[--migration-num-layers MIGRATION_NUM_LAYERS]
[--last-stage-max-blocks LAST_STAGE_MAX_BLOCKS]
Expand Down Expand Up @@ -144,11 +147,24 @@ usage: -m llumnix.entrypoints.vllm.api_server [-h]

`--migration-backend`
- Communication backend of migration.
- Possible choices: gloo, rpc
- Default: "rpc"
- Possible choices: gloo, rayrpc, grpc, kvtransfer. [gloo, rayrpc] are available for vllm and [grpc, kvtransfer] are available for bladellm.
- Default: "rayrpc"
KuilongCui marked this conversation as resolved.
Show resolved Hide resolved

`--migration-backend-transfer-type`
- Transfer type for migration backend grpc and kvTransfer.
- Possible choices: cuda_ipc, rdma, ""
- Default: ""
KuilongCui marked this conversation as resolved.
Show resolved Hide resolved

`--migration-backend-server-address`
- Address of grpc server for migration backend
- Default: "127.0.0.1:50051"

`--migration-backend-kvtransfer-naming-url`
- URL of naming server for kvtransfer migration backend
- Default: ""
KuilongCui marked this conversation as resolved.
Show resolved Hide resolved

`--migration-buffer-blocks`
- Number of cache blocks in migration.
- Number of buffer blocks in migration.
- Default: 512

`--migration-backend-init-timeout`
Expand Down
2 changes: 1 addition & 1 deletion docs/Quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ cd llumnix
make install
```

The default migration backend is RPC. If you want to use NCCL as the migration backend, run `make cupy-cuda` to install [cupy-cuda](https://pypi.org/search/?q=cupy-cuda) manually, as it is related to the CUDA version.
The default migration backend is rayrpc. If you want to use NCCL as the migration backend, run `make cupy-cuda` to install [cupy-cuda](https://pypi.org/search/?q=cupy-cuda) manually, as it is related to the CUDA version.

If you want to use Gloo as migration backend, **in addition to installing cupy-cuda**, please refer to [this link](https://github.com/ZeldaHuang/pygloo/blob/main/.github/workflows/ubuntu_basic.yml#L24C1-L26C1) to install [Bazel](https://github.com/bazelbuild/bazel) >= 5.1.0. Then, run `make pygloo` to install [pygloo](https://github.com/ZeldaHuang/pygloo).

Expand Down
4 changes: 2 additions & 2 deletions examlpes/offline_inference.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

from llumnix import launch_ray_cluster, connect_to_ray_cluster, init_manager, init_llumlets
from llumnix import (SamplingParams, ServerInfo, EngineManagerArgs, LLMEngineManager, Llumlet,
EngineArgs, QueueType)
EngineArgs, QueueType, BackendType)
from llumnix.utils import random_uuid
from llumnix.queue.ray_queue_server import RayQueueServer

Expand Down Expand Up @@ -40,7 +40,7 @@
llumlets: List[Llumlet] = None
llumlet_ids, llumlets = init_llumlets(
manager_args, engine_args, ray.get_runtime_context().get_node_id(),
QueueType("rayqueue")
QueueType("rayqueue"), BackendType.VLLM, 1,
)


Expand Down
23 changes: 17 additions & 6 deletions llumnix/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,6 @@
# See the License for the specific language governing permissions and
# limitations under the License.

import vllm
from vllm import *

from llumnix.server_info import ServerInfo
from llumnix.entrypoints.setup import (launch_ray_cluster,
connect_to_ray_cluster,
Expand All @@ -23,8 +20,8 @@
from llumnix.llm_engine_manager import LLMEngineManager
from llumnix.llumlet.llumlet import Llumlet
from llumnix.queue.queue_type import QueueType

from .version import __version__
from llumnix.backends.backend_interface import BackendType
from llumnix.version import __version__

__all__ = [
"__version__",
Expand All @@ -37,6 +34,20 @@
"LLMEngineManager",
"Llumlet",
"QueueType",
"BackendType",
]

__all__.extend(getattr(vllm, "__all__", []))
try:
KuilongCui marked this conversation as resolved.
Show resolved Hide resolved
import vllm
from vllm import *
__all__.extend(getattr(vllm, "__all__", []))
except ImportError:
pass

# TODO(KuilongCui): import blade_llm after cuda is ready
# try:
# import blade_llm
# from blade_llm import *
# __all__.extend(getattr(blade_llm, "__all__", []))
# except ImportError:
# pass
41 changes: 32 additions & 9 deletions llumnix/arg_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ class LlumnixEntrypointsArgs:
request_output_queue_port: int = None
disable_log_requests_server: bool = None
log_request_timestamps: bool = None
config_file: bool = None
config_file: str = None

def __post_init__(self):
for attr in dataclasses.fields(self):
Expand Down Expand Up @@ -132,9 +132,12 @@ class EngineManagerArgs:
log_instance_info: bool = None
profiling_result_file_path: str = None

migration_backend_kvtransfer_naming_url: str = None
migration_backend_server_address: str = None
migration_backend_init_timeout: float = None
migration_backend: str = None
migration_buffer_blocks: int = None
migration_backend_transfer_type: str = None
migration_num_layers: int = None
last_stage_max_blocks: int = None
max_stages: int = None
Expand Down Expand Up @@ -177,7 +180,10 @@ def create_migration_config(self) -> MigrationConfig:
self.migration_num_layers,
self.last_stage_max_blocks,
self.max_stages,
self.migration_backend_init_timeout)
self.migration_backend_init_timeout,
self.migration_backend_transfer_type,
self.migration_backend_server_address,
self.migration_backend_kvtransfer_naming_url)
return migration_config

@classmethod
Expand All @@ -194,16 +200,23 @@ def check_args(cls, args: 'EngineManagerArgs', parser: argparse.ArgumentParser):
# pylint: disable=protected-access
for action in parser._optionals._actions:
if hasattr(action, 'choices') and action.choices is not None and hasattr(args, action.dest):
assert getattr(args, action.dest) in action.choices, f"{action.dest} should be one of {action.choices}."
cur_arg = getattr(args, action.dest)
assert cur_arg in action.choices, f"{action.dest} should be one of {action.choices}, but {cur_arg} is set."

# vllm only
assert args.migration_backend != 'gloo' or (args.migration_backend == 'gloo' \
and not args.disable_init_instance_by_manager and not args.disable_fixed_node_init_instance), \
("When using gloo as migration backend, "
"do not set --disable-init-instance-by-manager and --disable-fixed-node-init-instance.")

# bladellm only
assert args.migration_backend not in ['kvtransfer'] or (args.migration_backend == 'kvtransfer' \
s5u13b marked this conversation as resolved.
Show resolved Hide resolved
and args.migration_backend_transfer_type), \
("When using kvTransfer as migration backend, "
"do not set --migration-backend-transfer-type as empty.")

@staticmethod
def add_cli_args(
parser: argparse.ArgumentParser) -> argparse.ArgumentParser:
def add_cli_args(parser: argparse.ArgumentParser) -> argparse.ArgumentParser:
parser.add_argument('--disable-fixed-node-init-instance',
action='store_true',
help='disable fixing the placement of instance to current node')
Expand Down Expand Up @@ -302,17 +315,27 @@ def add_cli_args(
parser.add_argument('--profiling-result-file-path',
type=str,
help='profiling result file path')

parser.add_argument('--migration-backend',
type=str,
choices=['gloo', 'nccl', 'rpc'],
help='communication backend of migration')
choices=['gloo','nccl','rayrpc','grpc','kvtransfer'],
help='communication backend of migration, [gloo, rayrpc] are available for vllm \
and [grpc, kvtransfer] are available for bladellm')
parser.add_argument('--migration-backend-transfer-type',
type=str,
choices=['cuda_ipc','rdma', ''],
help='transfer type for migration backend grpc and kvTransfer')
parser.add_argument('--grpc-migration-backend-address',
type=str,
help='address of grpc server for migration backend')
parser.add_argument('--migration-backend-kvtransfer-naming-url',
type=str,
help='url of naming server for kvtransfer migration backend')
parser.add_argument('--migration-backend-init-timeout',
type=float,
help='timeout(s) for initializing migration backend')
parser.add_argument('--migration-buffer-blocks',
type=int,
help='number of cache blocks in migration')
help='number of buffer blocks in migration')
parser.add_argument('--migration-num-layers',
type=int,
help='number of kv-cache layers to transfer in each round during migration')
Expand Down
8 changes: 2 additions & 6 deletions llumnix/backends/backend_interface.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,13 +27,15 @@ class EngineState(str, Enum):
class BackendType(str, Enum):
VLLM = "VLLM"
SIM_VLLM = "SIM_VLLM"
BLADELLM = "BLADELLM"
KuilongCui marked this conversation as resolved.
Show resolved Hide resolved

@staticmethod
def is_sim_backend(status: "BackendType") -> bool:
return status in [
BackendType.SIM_VLLM,
]

# TODO(KuilongCui): separate backend interface into two parts: DispatchBackendInterface and MigrationBackendInterface
class BackendInterface(ABC):
# Methods for inference
@abstractmethod
Expand Down Expand Up @@ -67,12 +69,6 @@ def abort_request(self, request_id: Union[str, Iterable[str]]) -> None:
"""
raise NotImplementedError

@abstractmethod
async def _start_engine_step_loop(self) -> None:
"""Start step loop of backend engine.
"""
raise NotImplementedError

# Methods for migration
@abstractmethod
def get_request_incremental_blocks(self, backend_request: LlumnixRequest, pre_stage_num_blocks: int) -> List[int]:
Expand Down
12 changes: 12 additions & 0 deletions llumnix/backends/bladellm/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# Copyright (c) 2024, Alibaba Group;
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at

# http://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
Loading
Loading