Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check NVIDIA driver version to avoid compatibility issues #288

Open
jeremyfowers opened this issue May 11, 2023 · 1 comment
Open

Check NVIDIA driver version to avoid compatibility issues #288

jeremyfowers opened this issue May 11, 2023 · 1 comment
Assignees
Labels
api Relating to the benchit API bug Something isn't working p0 top priority

Comments

@jeremyfowers
Copy link
Contributor

Trying to run any GPU benchmark on head of main on GCP or Azure yields an error like this:

azureuser@mla-gpu-test:~$ miniconda3/bin/conda run -n mla benchit mlagility/models/selftest/linear.py --device nvidia

Models discovered during profiling:

linear.py:
	model (executed 1x)
		Model Type:	Pytorch (torch.nn.Module)
		Class:		LinearTestModel (<class 'linear.LinearTestModel'>)
		Location:	/home/azureuser/mlagility/models/selftest/linear.py, line 21
		Parameters:	110 (<0.1 MB)
		Hash:		d5b1df11
		Status:		Unknown benchit error: 'Total Latency'
		Traceback (most recent call last):
		  File "/home/azureuser/mlagility/src/mlagility/analysis/analysis.py", line 133, in call_benchit
		    perf = benchmark_model(
		  File "/home/azureuser/mlagility/src/mlagility/api/model_api.py", line 145, in benchmark_model
		    perf = gpu_model.benchmark(backend=backend)
		  File "/home/azureuser/mlagility/src/mlagility/api/trtmodel.py", line 21, in benchmark
		    benchmark_results = self._execute(repetitions=repetitions, backend=backend)
		  File "/home/azureuser/mlagility/src/mlagility/api/trtmodel.py", line 84, in _execute
		    mean_latency=self.mean_latency,
		  File "/home/azureuser/mlagility/src/mlagility/api/trtmodel.py", line 43, in mean_latency
		    return float(self._get_stat("Total Latency")["mean "].split(" ")[1])
		  File "/home/azureuser/mlagility/src/mlagility/api/trtmodel.py", line 34, in _get_stat
		    return performance[stat]
		KeyError: 'Total Latency'

GPU benchmarking is known to work correctly on commit e250ac7

@jeremyfowers jeremyfowers added bug Something isn't working api Relating to the benchit API p0 top priority labels May 11, 2023
@ramkrishna2910
Copy link
Contributor

The design to use containers for all benchmarking was due to its promise of including all required dependencies and will not force the user to worry about versions and stuff.
For TensorRT to function you need the compatible versions of CUDA, CUDNN and the driver as stated here in this matrix
The official Nvidia TensortRT container we use comes packaged with the right version of CUDA and CUDNN, great!
But since the driver is a kernel mode component, that cannot come with the container, rather should have been installed on the host system.
So far all of the systems we had tested this feature on happened to have the correct drivers.
Except for the T4 system Jermey used and the T4 system I found on GCP. Once I updated the driver version everything worked as expected.
Ideally, the TRT container should report this error instead of just crashing. The fix on our end should be to read the driver version and report a proper error to update the driver. I will add this to the issue.

Follow the steps here to update drivers

@ramkrishna2910 ramkrishna2910 changed the title GPU benchmarking is broken Check NVIDIA driver version to avoid compatibility issues May 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api Relating to the benchit API bug Something isn't working p0 top priority
Projects
None yet
Development

No branches or pull requests

3 participants