-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2 hardware agnostic front and backend #5
base: master
Are you sure you want to change the base?
Conversation
'update backend to be hardware agnostic' Rony Leppänen <[email protected]> 'update frontend to be hardware agnostic' Anders Smedegaard Pedersen <[email protected]> 'update Dockerfile.dev to also work for AMD' 'update requirements/ for AMD support' Samu Tamminen <[email protected]> Other contributions: Bipradip Chowdhury <[email protected]> Jarkko Lehtiranta <[email protected]> Jarkko Vainio <[email protected]> Tero Kemppi <[email protected]>
frontend/server/src/main/java/org/pytorch/serve/device/utils/XpuUtil.java
Show resolved
Hide resolved
frontend/server/src/main/java/org/pytorch/serve/util/ConfigManager.java
Outdated
Show resolved
Hide resolved
frontend/server/src/test/java/org/pytorch/serve/device/utils/CudaUtilTest.java
Show resolved
Hide resolved
amdsmi.amdsmi_init() | ||
|
||
handle = amdsmi.amdsmi_get_processor_handles()[gpu_index] | ||
mem_used = amdsmi.amdsmi_get_gpu_vram_usage(handle)["vram_used"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that torch.cuda.mem_get_info should work fine on our systems if we want to follow the same approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jataylo thank you for the comment! I am not sure if we can do that yet, see below an example when using torch2.4.1+rocm6.1:
(venv) root@6f6ab1e7f4fb:/workspaces/torch-serve-amd# amd-smi monitor --vram
GPU VRAM_USED VRAM_TOTAL
0 61496 MB 65501 MB
1 61496 MB 65501 MB
2 59062 MB 65501 MB
3 61496 MB 65501 MB
4 13 MB 65501 MB
5 13 MB 65501 MB
6 13 MB 65501 MB
7 13 MB 65501 MB
(venv) root@6f6ab1e7f4fb:/workspaces/torch-serve-amd# python -c "import amdsmi; import torch; print(*[(i, amdsmi.amdsmi_get_gpu_vram_usage(amdsmi.amdsmi_get_processor_handles()[i])) for i in range(torch.cuda.device_count())], sep='\n');"
(0, {'vram_total': 65501, 'vram_used': 61496})
(1, {'vram_total': 65501, 'vram_used': 61496})
(2, {'vram_total': 65501, 'vram_used': 59062})
(3, {'vram_total': 65501, 'vram_used': 61496})
(4, {'vram_total': 65501, 'vram_used': 13})
(5, {'vram_total': 65501, 'vram_used': 13})
(6, {'vram_total': 65501, 'vram_used': 13})
(7, {'vram_total': 65501, 'vram_used': 13})
(venv) root@6f6ab1e7f4fb:/workspaces/torch-serve-amd# python -c "import torch; import numpy as np; print(*[(i, np.array(torch.cuda.mem_get_info(i)) // 1024**2) for i in range(torch.cuda.device_count())], sep='\n');"
(0, array([ 6146, 65520])) # vram_used 59374
(1, array([ 4046, 65520])) # vram_used 61474
(2, array([ 4046, 65520])) # vram_used 61474
(3, array([ 4046, 65520])) # vram_used 61474
(4, array([65414, 65520])) # vram_used 106
(5, array([65414, 65520])) # vram_used 106
(6, array([65414, 65520])) # vram_used 106
(7, array([65414, 65520])) # vram_used 106
Here the amdsmi and handle-based approach seems to provide correct numbers, but when using torch.cuda.mem_get_info()
and accessing devices by index, the information does not seem to be correct (note that mem_get_info()
returns (free, total)
memory used). Seems that the device indices get somehow mixed up a bit.
Co-authored-by: Jack Taylor <[email protected]>
05211fa
to
ff4daa8
Compare
extend serve.ModelServerTest.testMetricManager
Add latest ROCM support
Co-authored-by: Jeff Daily <[email protected]>
Co-authored-by: Jeff Daily <[email protected]>
Co-authored-by: Jeff Daily <[email protected]>
Co-authored-by: Jeff Daily <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good to me now, I think we should consult upstream feedback once any remaining comments are addressed.
But there still does seem to be unnecessary formatting changes in the java code that may want to clean up.
Description
This PR decouples the hardware layer from the front- and backend of TorchServe.
Relates to #740
Requirement Files
Added
requirements/torch_rocm62.txt
,requirements/torch_rocm61.txt
andrequirements/torch_rocm60.txt
for easy install of dependencies needed for AMD support.Backend
The Python backend supports currently NVIDIA GPUs using hardware specific libraries. There were also a number of functions that could be refactored using more generalized interfaces.
Changes Made to Backend
print_env_info
for AMD GPUs and reimplement a number of functionsFrontend
The Java frontend that acts as the workload manager had calls to SMIs hard-coded in a few places. This made it difficult for TorchServe to support multiple hardware vendors in a graceful manner.
Changes Made to Frontend
We've introduced a new package
org.pytorch.serve.device
with the classesSystemInfo
andAccelerator
.SystemInfo
holds an array list ofAccelerator
objects that holds static information about the specific accelerators on a machine, and the relevant metrics.Instead of calling the SMIs directly in multiple places in the frontend code we have abstracted the hardware away by adding an instance of
SystemInfo
to the pre-existingConfigManager
. Now the frontend can get data from the hardware via the methods onSystemInfo
without knowing about the specifics of the hardware and SMIs.To implement the specifics for each of the vendors that was already partially supported we have created a number of utility classes that communicates with the hardware via the relevant SMI.
The following steps are taken in the
SystemInfo
constructor.which {relevant smi}
for each of the supported vendors.This is how vendor detection was done previously. There might be more robust ways.
where
is used on Windows systems.ROCmUtility
for AMD.HIP_VISIBLE_DEVICES
for AMD,CUDA_VISIBLE_DEVICES
for nvidia andXPU_VISIBLE_DEVICES
for Intel. All devices are detected if the relevant environment variable is not set.The following is a class diagram showing how the new classes relate to the existing code
Documentation
serve/docs/hardware_support/
and added them under "Hardware Support" in the TOCType of change
Please delete options that are not relevant.
Feature/Issue validation/testing
We build new docker container for each platform using Dockerfile.dev and build arguments
CUDA_VERSION
andROCM_VERSION
Run containers
Tests
Logs:
Logs:
Logs:
Logs:
Logs:
Logs:
Logs:
Logs:
Logs:
OBS! The test
test_handler.py::test_huggingface_bert_model_parallel_inference
fails due to:ValueError: Input length of input_ids is 150, but max_length is set to 50. This can lead to unexpected behavior. You should consider increasing max_length or, better yet, setting max_new_tokens.
This indicates that preprocessing uses a different
max_length
than inference, which can be verified when looking at the handler when the test was originally implemented: model.generate() hasmax_length=50
by default, while tokenizer uses max_length from setup_config (max_length=150
). It seems that the bert-basedTextgeneration.mar
needs an update.Checklist: