Reconcile merge differences [fix Custom All Reduce; remove Torchrun & Cython] #163

mawong-amd · 2024-09-03T11:47:39Z

Apart from small miscellaneous fixes, this PR also contains the following major changes:

Fixes and enables custom all reduce on ROCm.
Removes torchrun executor as it does not provide performance benefits over the now-default multiprocessing.
Removes Cython-ization of sampler related files as sampler has seen performance benefits in upstream.

github-actions · 2024-09-03T11:47:51Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

mawong-amd · 2024-09-03T11:48:03Z

/ready

gshtras

Overall good additions and a few bugs caught, thanks
A couple of suggestions though, plus, let's move on to rocm/pytorch:rocm6.2_ubuntu22.04_py3.10_pytorch_release_2.3.0 as the default base, and leave only building rccl (from the rocm-6.2.0 tag) and triton, as it is buggy there lol

gshtras · 2024-09-03T15:26:30Z

Dockerfile.rocm

+ENV CCACHE_DIR=/root/.cache/ccache
+
+RUN python3 -m pip install --upgrade pip
+# Remove sccache so it doesn't interfere with ccache


What's the motivation here?

Upstream CI added in ccache previously to try to cut down on the vLLM compile time. It's not important to have it locally but might be marginally useful for internal CI. We should switch to sccache though as it's more supported by ROCm components... but truthfully this entire thing is not important. It's changed here because that's what's in upstream.

gshtras · 2024-09-03T15:27:09Z

Dockerfile.rocm

-    fi
+# Package upgrades for useful functionality or to avoid dependency issues
+RUN --mount=type=cache,target=/root/.cache/pip \
+    python3 -m pip install --upgrade numba scipy huggingface-hub[cli]


Why do we want to bundle huggingface-cli?

No huge reason, it was done previously so we can conveniently do huggingface-cli login to access gated models needed for some tests.

gshtras · 2024-09-03T15:27:58Z

csrc/custom/custom.cu

-  int M = in_a.size(0);
-  int K = in_a.size(1);
+void LLMM1(at::Tensor& in_a, at::Tensor& in_b, at::Tensor& out_c,
+           const int64_t rows_per_block = 4) {


Let's remove this default, it doesn't make sense.

gshtras · 2024-09-03T15:28:09Z

csrc/custom/custom.cu

-  int K = in_a.size(1);
+void wvSpltK(at::Tensor& in_a, at::Tensor& in_b, at::Tensor& out_c,
+             const int64_t N_in, const int64_t CuCount) {
+  auto M = in_a.size(0);


gshtras · 2024-09-03T15:34:40Z

tests/kernels/test_awq_triton.py

Upstream has a newer version of awq according to @rasmith

I looked at the upstream PR but it doesn't seem to include this PR: #146? Or is this later PR not necessary?

The upstream PR came after this and should override what we currently have.

gshtras · 2024-09-03T15:37:50Z

vllm/_custom_ops.py

@@ -207,6 +182,12 @@ def advance_step(num_seqs: int, num_queries: int, block_size: int,
 def awq_dequantize(qweight: torch.Tensor, scales: torch.Tensor,
                   zeros: torch.Tensor, split_k_iters: int, thx: int,
                   thy: int) -> torch.Tensor:
+    # print(f"awq_dequantize:qweight.shape = {qweight.shape}"


Same here, we want the clean implementation that was upstreamed

gshtras · 2024-09-03T15:39:04Z

vllm/attention/ops/triton_flash_attention.py

We want to revert to the original more performant implementation. The current version is a workaround to the now fixed compiler bug that was never upstreamed, so taking the upstream version is the way to go here

gshtras · 2024-09-03T16:02:29Z

vllm/entrypoints/sync_openai/api_server.py

@@ -173,8 +174,9 @@ async def _check_model(request: Union[CompletionRequest,

 async def _guided_decode_logits_processor(request, tokenizer):
    decoding_config = runner.engine_config.decoding_config
-    guided_decoding_backend = request.guided_decoding_backend \
-            or decoding_config.guided_decoding_backend
+    assert decoding_config is not None


Strictly speaking not required if it arrives in the request

Yeah oops the logic for the assert as written currently isn't right...

For context, this was done because the lint complained that decoding_config could be None (whence it doesn't have the guided_decoding_backend attribute).
Perfunctory read seems to indicate that that's not possible because of the default initialization. Unfortunately the linter can't trace types that far.

So either we do a # type: ignore or do this assert only if request.guided_decoding_backend is a dud.

vllm/sequence.py

rasmith · 2024-09-03T16:28:07Z

tests/kernels/test_awq_triton.py

@@ -57,7 +58,25 @@ def awq_dequantize_torch(qweight: torch.Tensor, scales: torch.Tensor,

    scales = scales.repeat_interleave(group_size, dim=0)
    zeros = zeros.repeat_interleave(group_size, dim=0)
-    return (iweights - zeros) * scales
+    return (iweights - zeros) * scales, zeros


This is old, I thought @gshtras was bringing in the new stuff from upstream.

This is old, I thought @gshtras was bringing in the new stuff from upstream.

Yeah something's a bit strange here... Having a newer version of this line misled me into thinking PR#146 is newer than upstream. Will re-inspect the AWQ merge.

rasmith · 2024-09-03T17:14:08Z

tests/kernels/test_awq_triton.py

+          f" qweight_rows = {qweight_rows} qweight_cols = {qweight_cols}"
+          f" scales_rows = {scales_rows} scales_cols = {scales_cols}")
+    weights = torch_awq_dequantize(qweight, scales, qzeros)
+    return torch.matmul(input, weights)


This is also old, there is newer stuff upstream. This is also not listed in the points 1-4 in the description. Did you mean to change the awq stuff? If you do, please bring in the new changes from upstream.

rasmith · 2024-09-03T17:15:22Z

vllm/_custom_ops.py

@@ -207,6 +182,12 @@ def advance_step(num_seqs: int, num_queries: int, block_size: int,
 def awq_dequantize(qweight: torch.Tensor, scales: torch.Tensor,
                   zeros: torch.Tensor, split_k_iters: int, thx: int,
                   thy: int) -> torch.Tensor:
+    # print(f"awq_dequantize:qweight.shape = {qweight.shape}"
+    #       f"scales = {scales.shape},"


rasmith · 2024-09-03T17:15:32Z

vllm/_custom_ops.py

@@ -217,6 +198,12 @@ def awq_dequantize(qweight: torch.Tensor, scales: torch.Tensor,

 def awq_gemm(input: torch.Tensor, qweight: torch.Tensor, qzeros: torch.Tensor,
             scales: torch.Tensor, split_k_iters: int) -> torch.Tensor:
+    # if input.shape[0] > 1:


rasmith · 2024-09-03T17:17:33Z

vllm/envs.py

@@ -444,6 +444,10 @@ def get_default_config_root():
    lambda: (None if os.getenv("VLLM_TORCH_PROFILER_DIR", None) is None else os
             .path.expanduser(os.getenv("VLLM_TORCH_PROFILER_DIR", "."))),

+    # If set, vLLM will use Triton implementations of AWQ.


Again, this is old. Get the new stuff from upstream.

rasmith · 2024-09-03T17:52:53Z

tests/kernels/test_awq_triton.py

The upstream PR came after this and should override what we currently have.

gshtras · 2024-09-03T20:08:47Z

csrc/custom_all_reduce.cu

-std::vector<uint8_t> get_meta_buffer_ipc_handle(torch::Tensor inp) {
-  std::vector<uint8_t> data_handle(sizeof(cudaIpcMemHandle_t), 0);
-  CUDACHECK(cudaIpcGetMemHandle((cudaIpcMemHandle_t*)data_handle.data(),
+torch::Tensor get_meta_buffer_ipc_handle(torch::Tensor& inp) {


Could you elaborate on this? Why the change to tensor and why its python counterpart still marks the return type as List[str]

using namespace std;

Why change to torch::Tensor: torch_bindings.cpp doesn't compile if the return type is vector<uint8_t> because Torch Extension does not support binding this type to Python. This problem was hidden before this PR/cherry-pick at because we did not compile the CAR extension and associated bindings on ROCm.

Why List[str]: was a problem carried over in the original PR. Looks like they first tried vector<string> return instead of torch::Tensor and forgot to change the type hint afterwards. I assumed some implicit conversion business was going on during the binding but on second thought, clearly not. The type hints should be converted to torch.Tensor.

The status quo isn't strictly speaking correct, but it does follow upstream convention and works at present. As per the comments in the original PR, if this is wrong, we are likely to see evidence only when torch.compile is used more liberally and we start seeing problems here.

Extra context:
Some aspects of the Torch Extension system are not easy to understand and are not well-documented. I'm not convinced that the upstream PR gets everything right: for instance, it often does something like the following

custom_ar.def("get_graph_buffer_ipc_meta", &get_graph_buffer_ipc_meta); custom_ar.impl("get_graph_buffer_ipc_meta", torch::kCPU, &get_graph_buffer_ipc_meta);

But it doesn't seem the second line is necessary: the original .def with the optional function pointer supplied already (implicitly) registers this method for all dispatch keys (including torch::kCUDA, torch::kCPU, etc.). The subsequent .impl registers this method with the CPU dispatch key and hence appears to be redundant (see below comment for why: the .def with function pointer actually bypasses the dispatch system, so registered dispatch keys aren't used).

In general we would expect .impl to be redundant if we are not intending on having different implementations of a method depending on the device (e.g. different CPU and GPU methods for computing sine of a Tensor). Put another way, we should never need to use .impl if we only ever have/expect a single implementation.

For CAR, which is currently NV/AMD GPU-only, the developer knows at compile-time whether a Tensor is on CPU or GPU, and accordingly we only have one implementation of each method. So we don't need the .impl-s. This might change in the future if say someone generalizes CAR to also work on CPU-only vLLM, and hence they need to provide implementations of these methods for the cases where some of the all reduce buffers are on CPU (instead of being on GPU).

Some documentation for details about custom operator dispatch: https://pytorch.org/cppdocs/api/file_torch_library.h.html#file-torch-library-h
https://pytorch.org/cppdocs/library.html#classtorch_1_1_library_1a59330802ab9deaff247d32a22092bdfb

How/when dispatching happens:

When you pass operators with tensors of your custom backend [editor note: corresponding to a particular dispatch key], your overridden implementations will be called instead of the standard implementations

Here is even more detailed information, good for understanding context but some mechanical details about the dispatch system are deprecated.

Using .def with a function pointer, with/without a schema:

Define an operator for a schema and then register an implementation for it.
This is typically what you would use if you aren’t planning on making use of the dispatcher to structure your operator implementation. It’s roughly equivalent to calling def() and then impl(), but if you omit the schema of the operator, we will infer it from the type of your C++ function. All template arguments are inferred.

Using .def without a function pointer, schema necessary:

Declare an operator with a schema, but don’t provide any implementations for it.
You’re expected to then provide implementations using the impl() method. All template arguments are inferred.

Using .impl:

Register an implementation for an operator.
You may register multiple implementations for a single operator at different dispatch keys (see torch::dispatch()). Implementations must have a corresponding declaration (from def()), otherwise they are invalid. If you plan to register multiple implementations, DO NOT provide a function implementation when you def() the operator.

mawong-amd · 2024-09-04T10:40:37Z

Since 05e67ab has cherry-picked most of the salient changes in this PR, it will be closed and a new one opened for any further changes.

mawong-amd · 2024-09-04T13:07:42Z

Overall good additions and a few bugs caught, thanks A couple of suggestions though, plus, let's move on to rocm/pytorch:rocm6.2_ubuntu22.04_py3.10_pytorch_release_2.3.0 as the default base, and leave only building rccl (from the rocm-6.2.0 tag) and triton, as it is buggy there lol

Is there a particular reason we want rocm/pytorch:rocm6.2_ubuntu22.04_py3.10_pytorch_release_2.3.0? rocm/pytorch:rocm6.2_ubuntu20.04_py3.9_pytorch_staging works pretty well.

mawong-amd changed the title ~~Reconcile merge diffs~~ Reconcile merge differences [Custom All Reduce, Triton FA, remove torchrun & cython] Sep 3, 2024

mawong-amd changed the title ~~Reconcile merge differences [Custom All Reduce, Triton FA, remove torchrun & cython]~~ Reconcile merge differences [fix Custom All Reduce & Triton FA, remove Torchrun & Cython] Sep 3, 2024

mawong-amd requested a review from gshtras September 3, 2024 11:52

mawong-amd changed the title ~~Reconcile merge differences [fix Custom All Reduce & Triton FA, remove Torchrun & Cython]~~ Reconcile merge differences [fix Custom All Reduce & Triton FA; remove Torchrun & Cython] Sep 3, 2024

mawong-amd force-pushed the mawong/reconcile_merge branch from 922e143 to a61a72a Compare September 3, 2024 12:37

gshtras requested changes Sep 3, 2024

View reviewed changes

rasmith reviewed Sep 3, 2024

View reviewed changes

gshtras reviewed Sep 3, 2024

View reviewed changes

gshtras force-pushed the mawong/reconcile_merge branch from 646e40d to ee47dc3 Compare September 3, 2024 22:15

mawong-amd closed this Sep 4, 2024

mawong-amd force-pushed the mawong/reconcile_merge branch from 82a8e65 to 7fd46eb Compare September 4, 2024 13:45

mawong-amd changed the title ~~Reconcile merge differences [fix Custom All Reduce & Triton FA; remove Torchrun & Cython]~~ Reconcile merge differences [fix Custom All Reduce; remove Torchrun & Cython] Sep 4, 2024

mawong-amd mentioned this pull request Sep 4, 2024

Miscellaneous cosmetic changes #166

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconcile merge differences [fix Custom All Reduce; remove Torchrun & Cython] #163

Reconcile merge differences [fix Custom All Reduce; remove Torchrun & Cython] #163

mawong-amd commented Sep 3, 2024 •

edited

Loading

github-actions bot commented Sep 3, 2024

mawong-amd commented Sep 3, 2024

gshtras left a comment

gshtras Sep 3, 2024

mawong-amd Sep 3, 2024 •

edited

Loading

gshtras Sep 3, 2024

mawong-amd Sep 3, 2024

gshtras Sep 3, 2024

gshtras Sep 3, 2024

gshtras Sep 3, 2024

mawong-amd Sep 3, 2024

rasmith Sep 3, 2024

gshtras Sep 3, 2024

gshtras Sep 3, 2024

gshtras Sep 3, 2024

mawong-amd Sep 3, 2024 •

edited

Loading

rasmith Sep 3, 2024

mawong-amd Sep 3, 2024

rasmith Sep 3, 2024

rasmith Sep 3, 2024

rasmith Sep 3, 2024

rasmith Sep 3, 2024

rasmith Sep 3, 2024

gshtras Sep 3, 2024

mawong-amd Sep 4, 2024 •

edited

Loading

mawong-amd Sep 4, 2024 •

edited

Loading

mawong-amd commented Sep 4, 2024 •

edited

Loading

mawong-amd commented Sep 4, 2024

Reconcile merge differences [fix Custom All Reduce; remove Torchrun & Cython] #163

Reconcile merge differences [fix Custom All Reduce; remove Torchrun & Cython] #163

Conversation

mawong-amd commented Sep 3, 2024 • edited Loading

github-actions bot commented Sep 3, 2024

mawong-amd commented Sep 3, 2024

gshtras left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mawong-amd Sep 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mawong-amd Sep 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mawong-amd Sep 4, 2024 • edited Loading

Choose a reason for hiding this comment

mawong-amd Sep 4, 2024 • edited Loading

Choose a reason for hiding this comment

mawong-amd commented Sep 4, 2024 • edited Loading

mawong-amd commented Sep 4, 2024

mawong-amd commented Sep 3, 2024 •

edited

Loading

mawong-amd Sep 3, 2024 •

edited

Loading

mawong-amd Sep 3, 2024 •

edited

Loading

mawong-amd Sep 4, 2024 •

edited

Loading

mawong-amd Sep 4, 2024 •

edited

Loading

mawong-amd commented Sep 4, 2024 •

edited

Loading