Check NVIDIA driver version to avoid compatibility issues #288

jeremyfowers · 2023-05-11T19:52:21Z

Trying to run any GPU benchmark on head of main on GCP or Azure yields an error like this:

azureuser@mla-gpu-test:~$ miniconda3/bin/conda run -n mla benchit mlagility/models/selftest/linear.py --device nvidia

Models discovered during profiling:

linear.py:
	model (executed 1x)
		Model Type:	Pytorch (torch.nn.Module)
		Class:		LinearTestModel (<class 'linear.LinearTestModel'>)
		Location:	/home/azureuser/mlagility/models/selftest/linear.py, line 21
		Parameters:	110 (<0.1 MB)
		Hash:		d5b1df11
		Status:		Unknown benchit error: 'Total Latency'
		Traceback (most recent call last):
		  File "/home/azureuser/mlagility/src/mlagility/analysis/analysis.py", line 133, in call_benchit
		    perf = benchmark_model(
		  File "/home/azureuser/mlagility/src/mlagility/api/model_api.py", line 145, in benchmark_model
		    perf = gpu_model.benchmark(backend=backend)
		  File "/home/azureuser/mlagility/src/mlagility/api/trtmodel.py", line 21, in benchmark
		    benchmark_results = self._execute(repetitions=repetitions, backend=backend)
		  File "/home/azureuser/mlagility/src/mlagility/api/trtmodel.py", line 84, in _execute
		    mean_latency=self.mean_latency,
		  File "/home/azureuser/mlagility/src/mlagility/api/trtmodel.py", line 43, in mean_latency
		    return float(self._get_stat("Total Latency")["mean "].split(" ")[1])
		  File "/home/azureuser/mlagility/src/mlagility/api/trtmodel.py", line 34, in _get_stat
		    return performance[stat]
		KeyError: 'Total Latency'

GPU benchmarking is known to work correctly on commit e250ac7

The text was updated successfully, but these errors were encountered:

ramkrishna2910 · 2023-05-11T22:29:15Z

The design to use containers for all benchmarking was due to its promise of including all required dependencies and will not force the user to worry about versions and stuff.
For TensorRT to function you need the compatible versions of CUDA, CUDNN and the driver as stated here in this matrix
The official Nvidia TensortRT container we use comes packaged with the right version of CUDA and CUDNN, great!
But since the driver is a kernel mode component, that cannot come with the container, rather should have been installed on the host system.
So far all of the systems we had tested this feature on happened to have the correct drivers.
Except for the T4 system Jermey used and the T4 system I found on GCP. Once I updated the driver version everything worked as expected.
Ideally, the TRT container should report this error instead of just crashing. The fix on our end should be to read the driver version and report a proper error to update the driver. I will add this to the issue.

Follow the steps here to update drivers

jeremyfowers added bug Something isn't working api Relating to the benchit API p0 top priority labels May 11, 2023

jeremyfowers assigned danielholanda May 11, 2023

ramkrishna2910 changed the title ~~GPU benchmarking is broken~~ Check NVIDIA driver version to avoid compatibility issues May 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check NVIDIA driver version to avoid compatibility issues #288

Check NVIDIA driver version to avoid compatibility issues #288

jeremyfowers commented May 11, 2023

ramkrishna2910 commented May 11, 2023

Check NVIDIA driver version to avoid compatibility issues #288

Check NVIDIA driver version to avoid compatibility issues #288

Comments

jeremyfowers commented May 11, 2023

ramkrishna2910 commented May 11, 2023