Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU crashes when running "D_packed @ Q.to(dtype=D_packed.dtype).T" with no error message #348

Open
Jimmy9507 opened this issue Jun 3, 2024 · 1 comment

Comments

@Jimmy9507
Copy link

Jimmy9507 commented Jun 3, 2024

Hey,

I tried to do ColBERT model inferencing via Triton server in multiple GPUs instance.

GPU 0 works fine. However, other GPU devices (1,2,3,... etc) crash when running to this line

D_packed @ Q.to(dtype=D_packed.dtype).T

with no error message.

Did anyone see the same error before?

@Jimmy9507 Jimmy9507 changed the title Is it possible to do ColBERT model inferencing via triton server in multiple GPUs instace? Is it possible to do ColBERT model inferencing via triton server in multiple GPUs instance? Jun 3, 2024
@Jimmy9507 Jimmy9507 changed the title Is it possible to do ColBERT model inferencing via triton server in multiple GPUs instance? GPU crashes when running "D_packed @ Q.to(dtype=D_packed.dtype).T" with no error message Jun 5, 2024
@Jimmy9507
Copy link
Author

Jimmy9507 commented Jun 6, 2024

After diving deep in the code, look like this line

torch::Tensor decompress_residuals_cuda(
const torch::Tensor binary_residuals, const torch::Tensor bucket_weights,
const torch::Tensor reversed_bit_map,
const torch::Tensor bucket_weight_combinations, const torch::Tensor codes,
const torch::Tensor centroids, const int dim, const int nbits) {
auto options = torch::TensorOptions()
.dtype(torch::kFloat16)
.device(torch::kCUDA, 0)
.requires_grad(false);

restrict cpp method decompress_residuals_cuda on GPU device 0 only. decompress_residuals_cuda will crash when running on other GPUs.

After update it to .device(torch::kCUDA, residuals.device().index()). The crash problem is resolved.

Should we update to .device(torch::kCUDA, residuals.device().index())? This should also significantly increase the model inferencing efficiency by enabling model inference on multiple GPUs.

Wondering if this is a bug or designed intentionally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant