Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Illegal memory access when fuse_reduction=False #10

Closed
tlrmchlsmth opened this issue Jun 27, 2024 · 5 comments
Closed

[BUG] Illegal memory access when fuse_reduction=False #10

tlrmchlsmth opened this issue Jun 27, 2024 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@tlrmchlsmth
Copy link
Contributor

Describe the bug
I'm hitting an illegal memory access in vllm-project/vllm#5917 when setting fuse_reduction=False in the fused GEMM+ReduceScatter kernel.

To Reproduce
Clone vllm-project/vllm#5917 and then apply this patch:

diff --git a/vllm/model_executor/layers/linear.py b/vllm/model_executor/layers/linear.py
index aa45cf98..adad2df6 100644
--- a/vllm/model_executor/layers/linear.py
+++ b/vllm/model_executor/layers/linear.py
@@ -852,7 +852,7 @@ class FluxRowParallelLinear(LinearBase):
             # Note: bfloat16 requires fuse_reduction=False.
             # When fuse_reduction=False, I encounter illegal memory accesses in
             # the kernel, which are hard to track down.
-            fuse_reduction=True,
+            fuse_reduction=False,
         )

         # Divide the weight matrix along the last dimension.

Then run:

python3 benchmarks/benchmark_latency.py --model meta-llama/Meta-Llama-3-8B-Instruct --num-iters 100 --batch-size 1 --input-len 512 --output-len 1 --enforce-eager --tensor-parallel-size 2 --dtype float16

Unfortunately, I haven't been able to reproduce this with a minimal example. I also haven't been able to reproduce the problem when running with compute-sanitizer. Some problem sizes work, and some don't (--input-len 1024 seems to work OK but not --input-len 512 for instance).

Stack trace/logs

(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method start_worker_execution_loop: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`, Traceback (most recent call last):
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]   File "/home/tms/nm-vllm/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]     output = executor(*args, **kwargs)
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]   File "/home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]   File "/home/tms/nm-vllm/vllm/worker/worker_base.py", line 63, in start_worker_execution_loop
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]     output = self.execute_model(execute_model_req=None)
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]   File "/home/tms/nm-vllm/vllm/worker/worker_base.py", line 255, in execute_model
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]     output = self.model_runner.execute_model(model_input, self.kv_cache)
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]   File "/home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]   File "/home/tms/nm-vllm/vllm/worker/model_runner.py", line 994, in execute_model
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]     hidden_states = model_executable(
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]   File "/home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]   File "/home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]   File "/home/tms/nm-vllm/vllm/model_executor/models/llama.py", line 378, in forward
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]     hidden_states = self.model(input_ids, positions, kv_caches,
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]   File "/home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]   File "/home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]   File "/home/tms/nm-vllm/vllm/model_executor/models/llama.py", line 292, in forward
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]     hidden_states, residual = layer(
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]   File "/home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]   File "/home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]   File "/home/tms/nm-vllm/vllm/model_executor/models/llama.py", line 241, in forward
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]     hidden_states = self.mlp(hidden_states)
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]   File "/home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]   File "/home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]   File "/home/tms/nm-vllm/vllm/model_executor/models/llama.py", line 82, in forward
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]     gate_up, _ = self.gate_up_proj(x)
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]   File "/home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]   File "/home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]   File "/home/tms/nm-vllm/vllm/model_executor/layers/linear.py", line 300, in forward
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]     output_parallel = self.quant_method.apply(self, input_, bias)
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]   File "/home/tms/nm-vllm/vllm/model_executor/layers/linear.py", line 113, in apply
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]     return F.linear(x, weight, bias)
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226] RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]
Warmup iterations:   0%|                                                                                                                                                                                                                                                       | 0/10 [00:00<?, ?it/s]
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/tms/nm-vllm/benchmarks/benchmark_latency.py", line 280, in <module>
[rank0]:     main(args)
[rank0]:   File "/home/tms/nm-vllm/benchmarks/benchmark_latency.py", line 91, in main
[rank0]:     run_to_completion(profile_dir=None)
[rank0]:   File "/home/tms/nm-vllm/benchmarks/benchmark_latency.py", line 82, in run_to_completion
[rank0]:     llm.generate(dummy_inputs,
[rank0]:   File "/home/tms/nm-vllm/vllm/utils.py", line 764, in inner
[rank0]:     return fn(*args, **kwargs)
[rank0]:   File "/home/tms/nm-vllm/vllm/entrypoints/llm.py", line 304, in generate
[rank0]:     outputs = self._run_engine(use_tqdm=use_tqdm)
[rank0]:   File "/home/tms/nm-vllm/vllm/entrypoints/llm.py", line 556, in _run_engine
[rank0]:     step_outputs = self.llm_engine.step()
[rank0]:   File "/home/tms/nm-vllm/vllm/engine/llm_engine.py", line 806, in step
[rank0]:     output = self.model_executor.execute_model(
[rank0]:   File "/home/tms/nm-vllm/vllm/executor/distributed_gpu_executor.py", line 76, in execute_model
[rank0]:     return self._driver_execute_model(execute_model_req)
[rank0]:   File "/home/tms/nm-vllm/vllm/executor/multiproc_gpu_executor.py", line 88, in _driver_execute_model
[rank0]:     return self.driver_worker.execute_model(execute_model_req)
[rank0]:   File "/home/tms/nm-vllm/vllm/worker/worker_base.py", line 255, in execute_model
[rank0]:     output = self.model_runner.execute_model(model_input, self.kv_cache)
[rank0]:   File "/home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/home/tms/nm-vllm/vllm/worker/model_runner.py", line 994, in execute_model
[rank0]:     hidden_states = model_executable(
[rank0]:   File "/home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/tms/nm-vllm/vllm/model_executor/models/llama.py", line 378, in forward
[rank0]:     hidden_states = self.model(input_ids, positions, kv_caches,
[rank0]:   File "/home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/tms/nm-vllm/vllm/model_executor/models/llama.py", line 292, in forward
[rank0]:     hidden_states, residual = layer(
[rank0]:   File "/home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/tms/nm-vllm/vllm/model_executor/models/llama.py", line 241, in forward
[rank0]:     hidden_states = self.mlp(hidden_states)
[rank0]:   File "/home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/tms/nm-vllm/vllm/model_executor/models/llama.py", line 82, in forward
[rank0]:     gate_up, _ = self.gate_up_proj(x)
[rank0]:   File "/home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/tms/nm-vllm/vllm/model_executor/layers/linear.py", line 300, in forward
[rank0]:     output_parallel = self.quant_method.apply(self, input_, bias)
[rank0]:   File "/home/tms/nm-vllm/vllm/model_executor/layers/linear.py", line 113, in apply
[rank0]:     return F.linear(x, weight, bias)
[rank0]: RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`
[rank0]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7d6ee92897 in /home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f7d6ee42b25 in /home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f7d6ef6a718 in /home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f7d22c4ae36 in /home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f7d22c4ef38 in /home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7f7d22c545ac in /home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f7d22c5531c in /home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdc253 (0x7f7d6e6b0253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x94ac3 (0x7f7e10fddac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x126850 (0x7f7e1106f850 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 2 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7d6ee92897 in /home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f7d6ee42b25 in /home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f7d6ef6a718 in /home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f7d22c4ae36 in /home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f7d22c4ef38 in /home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7f7d22c545ac in /home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f7d22c5531c in /home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdc253 (0x7f7d6e6b0253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x94ac3 (0x7f7e10fddac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x126850 (0x7f7e1106f850 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7d6ee92897 in /home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe32e33 (0x7f7d228d7e33 in /home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xdc253 (0x7f7d6e6b0253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x94ac3 (0x7f7e10fddac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x126850 (0x7f7e1106f850 in /lib/x86_64-linux-gnu/libc.so.6)

/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 2 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
[1]    1868949 IOT instruction (core dumped)  python3 benchmarks/benchmark_latency.py --model  --num-iters 100 --batch-size
@liwenchangbdbz liwenchangbdbz added the bug Something isn't working label Jun 28, 2024
@zheng-ningxin
Copy link
Collaborator

Thank you very much for your feedback @tlrmchlsmth . I was unable to reproduce this bug using the latest commit (nm-vllm: e556f59 flux: c866c43). The command I ran is:

python3 benchmarks/benchmark_latency.py --model /opt/tiger/Meta-Llama-3-8B-Instruct --num-iters 100 --batch-size 1 --input-len 2048 --output-len 1 --enforce-eager --tensor-parallel-size 4 --dtype float16

Could it be an environment-related issue?

@zheng-ningxin
Copy link
Collaborator

I change the sequence length to 512, still not be able to reproduce the bug.
python3 benchmarks/benchmark_latency.py --model /home/tiger/Meta-Llama-3-8B-Instruct --num-iters 100 --batch-size 1 --input-len 512 --output-len 1 --enforce-eager --tensor-parallel-size 2 --dtype float16

@zheng-ningxin zheng-ningxin self-assigned this Jul 18, 2024
@wenlei-bao
Copy link
Collaborator

@zheng-ningxin Let's maybe wait for @tlrmchlsmth provide the docker to reproduce as mentioned in the other thread.

@tlrmchlsmth
Copy link
Contributor Author

tlrmchlsmth commented Jul 18, 2024 via email

@tlrmchlsmth
Copy link
Contributor Author

I am no longer able to reproduce the issue at all on Flux's main.
I've updated my vllm PR and am now seeing speedup vs main 💥

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants