[ Speedster] With Hugging Face notebook code on nebulydocker/nebullvm container: RuntimeError: Expected all tensors to be on the same device #349

trent-s · 2023-06-21T09:10:29Z

Hi! Thank you for your continued work with this project! I would like to report a possible TensorFlow GPU configuration issue with the documented nebulydocker/nebullvm container that appears to prevent notebook code from running.

I am trying to use code in the Hugging Face notebook found at
https://github.com/nebuly-ai/nebuly/blob/main/optimization/speedster/notebooks/huggingface/Accelerate_Hugging_Face_PyTorch_BERT_with_Speedster.ipynb

And am running in the current nebulydocker/nebullvm docker container documented at
https://docs.nebuly.com/Speedster/installation/#optional-download-docker-images-with-frameworks-and-optimizers

Here is exact Python code I am trying to run (essentially code from the notebook with a couple of diagnostic lines added.):

#!/usr/bin/python
import os
import torch
from transformers import BertTokenizer, BertModel
import random
from speedster import optimize_model

tensorrt_path = "/usr/local/lib/python3.8/dist-packages/tensorrt"

if os.path.exists(tensorrt_path):
    os.environ['LD_LIBRARY_PATH'] += f":{tensorrt_path}"
else:
    print("Unable to find TensorRT path. ONNXRuntime won't use TensorrtExecutionProvider.")

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased', torchscript=True)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device).eval()

sentences = [
    "Mars is the fourth planet from the Sun.",
    "has a crust primarily composed of elements",
    "However, it is unknown",
    "can be viewed from Earth",
    "It was the Romans",
]

len_dataset = 100

texts = []
for _ in range(len_dataset):
    n_times = random.randint(1, 30)
    texts.append(" ".join(random.choice(sentences) for _ in range(n_times)))

encoded_inputs = [tokenizer(text, return_tensors="pt") for text in texts]

dynamic_info = {
    "inputs": [
        {0: 'batch', 1: 'num_tokens'},
        {0: 'batch', 1: 'num_tokens'},
        {0: 'batch', 1: 'num_tokens'},
    ],
    "outputs": [
        {0: 'batch', 1: 'num_tokens'},
        {0: 'batch'},
    ]
}

optimized_model = optimize_model(
    model=model,
    input_data=encoded_inputs,
    optimization_time="constrained",
    ignore_compilers=["onnx_tensor_rt","onnx_tvm","onnxruntime","tensor_rt", "tvm"],
    device=str(device),
    dynamic_info=dynamic_info,
)

print ("Type of optimized model: "+str(type(optimized_model)) + " on device: "+str(optimized_model.device))

encoded_inputs = [tokenizer(text, return_tensors="pt").to(device) for text in texts]

# Warmup for 30 iterations
for encoded_input in encoded_inputs[:30]:
    with torch.no_grad():
        final_out = model(**encoded_input)

print (final_out)

Just in case it is useful, starting up the container looks like this:

$ docker run -ti --rm -v ~/data:/data -v ~/src:/src --gpus=all nebulydocker/nebullvm:latest

=====================
== NVIDIA TensorRT ==
=====================

NVIDIA Release 23.03 (build 54538654)
NVIDIA TensorRT Version 8.5.3
Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Container image Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

https://developer.nvidia.com/tensorrt

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

To install Python sample dependencies, run /opt/tensorrt/python/python_setup.sh

To install the open-source samples corresponding to this TensorRT release version
run /opt/tensorrt/install_opensource.sh.  To build the open source parsers,
plugins, and samples for current top-of-tree on master or a different branch,
run /opt/tensorrt/install_opensource.sh -b <branch>
See https://github.com/NVIDIA/TensorRT for more information.

And, this is the output that I get running the above code:

│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
2023-06-21 07:44:32.387780: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point r
ound-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-06-21 07:44:32.437353: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-cri
tical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-06-21 07:44:34.329062: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1956] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned a
bove are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries
 for your platform.
Skipping registering GPU devices...
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.b
ias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', '
cls.seq_relationship.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequ
enceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassifica
tion model from a BertForSequenceClassification model).
2023-06-21 07:44:42 | INFO     | Running Speedster on GPU:0
2023-06-21 07:44:46 | INFO     | Benchmark performance of original model
2023-06-21 07:44:47 | INFO     | Original model latency: 0.011019186973571777 sec/iter
============= Diagnostic Run torch.onnx.export version 2.0.0+cu118 =============
verbose: False, log level: Level.ERROR
======================= 0 NONE 0 NOTE 0 WARNING 0 ERROR ========================

2023-06-21 07:44:53 | INFO     | [1/2] Running PyTorch Optimization Pipeline
2023-06-21 07:44:53 | INFO     | Optimizing with PytorchBackendCompiler and q_type: None.
2023-06-21 07:44:54 | WARNING  | Unable to trace model with torch.fx
2023-06-21 07:46:04 | INFO     | Optimized model latency: 0.007783412933349609 sec/iter
2023-06-21 07:46:04 | INFO     | Optimizing with PytorchBackendCompiler and q_type: QuantizationType.HALF.
2023-06-21 07:46:04 | WARNING  | Unable to trace model with torch.fx
2023-06-21 07:47:44 | INFO     | Optimized model latency: 0.007919073104858398 sec/iter
2023-06-21 07:47:44 | INFO     | [2/2] Running ONNX Optimization Pipeline

[Speedster results on Tesla V100-PCIE-32GB]
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Metric      ┃ Original Model   ┃ Optimized Model   ┃ Improvement   ┃
┣━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━┫
┃ backend     ┃ PYTORCH          ┃ TorchScript       ┃               ┃
┃ latency     ┃ 0.0110 sec/batch ┃ 0.0078 sec/batch  ┃ 1.42x         ┃
┃ throughput  ┃ 90.75 data/sec   ┃ 128.48 data/sec   ┃ 1.42x         ┃
┃ model size  ┃ 438.03 MB        ┃ 438.35 MB         ┃ 0%            ┃
┃ metric drop ┃                  ┃ 0                 ┃               ┃
┃ techniques  ┃                  ┃ fp32              ┃               ┃
┗━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━┛

Max speed-up with your input parameters is 1.42x. If you want to get a faster optimized model, see the following link for some suggestions: https://docs.nebuly.com/Speedster/advanced_options/#acceleration-suggestions

Type of optimized model: <class 'nebullvm.operations.inference_learners.huggingface.HuggingFaceInferenceLearner'> on device: None
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /src/./sample.py:68 in <module>                                                                  │
│                                                                                                  │
│   65 # Warmup for 30 iterations                                                                  │
│   66 for encoded_input in encoded_inputs[:30]:                                                   │
│   67 │   with torch.no_grad():                                                                   │
│ ❱ 68 │   │   final_out = model(**encoded_input)                                                  │
│   69                                                                                             │
│   70 print (final_out)                                                                           │
│   71                                                                                             │
│                                                                                                  │
│ /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py:1501 in _call_impl             │
│                                                                                                  │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hooks                   │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                                          │
│   1502 │   │   # Do not call functions when jit is used                                          │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1504 │   │   backward_pre_hooks = []                                                           │
│                                                                                                  │
│ /usr/local/lib/python3.8/dist-packages/transformers/models/bert/modeling_bert.py:1013 in forward │
│                                                                                                  │
│   1010 │   │   # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x s  │
│   1011 │   │   head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)          │
│   1012 │   │                                                                                     │
│ ❱ 1013 │   │   embedding_output = self.embeddings(                                               │
│   1014 │   │   │   input_ids=input_ids,                                                          │
│   1015 │   │   │   position_ids=position_ids,                                                    │
│   1016 │   │   │   token_type_ids=token_type_ids,                                                │
│                                                                                                  │
│ /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py:1501 in _call_impl             │
│                                                                                                  │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hooks                   │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                                          │
│   1502 │   │   # Do not call functions when jit is used                                          │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1504 │   │   backward_pre_hooks = []                                                           │
│                                                                                                  │
│ /usr/local/lib/python3.8/dist-packages/transformers/models/bert/modeling_bert.py:230 in forward  │
│                                                                                                  │
│    227 │   │   │   │   token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=self.  │
│    228 │   │                                                                                     │
│    229 │   │   if inputs_embeds is None:                                                         │
│ ❱  230 │   │   │   inputs_embeds = self.word_embeddings(input_ids)                               │
│    231 │   │   token_type_embeddings = self.token_type_embeddings(token_type_ids)                │
│    232 │   │                                                                                     │
│    233 │   │   embeddings = inputs_embeds + token_type_embeddings                                │
│                                                                                                  │
│ /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py:1501 in _call_impl             │
│                                                                                                  │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hooks                   │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                                          │
│   1502 │   │   # Do not call functions when jit is used                                          │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1504 │   │   backward_pre_hooks = []                                                           │
│                                                                                                  │
│ /usr/local/lib/python3.8/dist-packages/torch/nn/modules/sparse.py:162 in forward                 │
│                                                                                                  │
│   159 │   │   │   │   self.weight[self.padding_idx].fill_(0)                                     │
│   160 │                                                                                          │
│   161 │   def forward(self, input: Tensor) -> Tensor:                                            │
│ ❱ 162 │   │   return F.embedding(                                                                │
│   163 │   │   │   input, self.weight, self.padding_idx, self.max_norm,                           │
│   164 │   │   │   self.norm_type, self.scale_grad_by_freq, self.sparse)                          │
│   165                                                                                            │
│                                                                                                  │
│ /usr/local/lib/python3.8/dist-packages/torch/nn/functional.py:2210 in embedding                  │
│                                                                                                  │
│   2207 │   │   #   torch.embedding_renorm_                                                       │
│   2208 │   │   # remove once script supports set_grad_enabled                                    │
│   2209 │   │   _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)                    │
│ ❱ 2210 │   return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)        │
│   2211                                                                                           │
│   2212                                                                                           │
│   2213 def embedding_bag(                                                                        │

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)

Attempting to call the model appears to cause the final RuntimeError

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)

This seems like it may be related to optimized_model.device being none.

Just FYI, GPU seems to be accessible on this container:

# nvidia-smi
Wed Jun 21 09:05:31 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.41.03              Driver Version: 530.41.03    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-PCIE-32GB            Off| 00000000:AF:00.0 Off |                    0 |
| N/A   33C    P0               23W / 250W|      5MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE-32GB            Off| 00000000:D8:00.0 Off |                    0 |
| N/A   32C    P0               24W / 250W|      5MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
# python -c "import torch; print(torch.cuda.is_available())"
True

Thank you for looking at this.

The text was updated successfully, but these errors were encountered:

SuperSecureHuman · 2023-07-11T14:00:32Z

Same issue

Looking further, the model is getting into CPU after going through speedster (?)

Changing the model to cuda manually before inference works. But model detaching from GPU is not the expected behaviour.

I am not sure if this has something to do with the model being detached

trent-s · 2023-07-12T09:18:12Z

Thank you very much for taking a look at this. That is a good point. The "cannot dlopen some GPU libraries" message sounds serious.

I have a question about the workaround you suggested. I tried to perform optimized_model.to(device) to force the model to the gpu, but as the following output shows, there was no .to() method.

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /data/./hGPT2gpu.py:67 in <module>                                                               │
│                                                                                                  │
│    64                                                                                            │
│    65 print ("Type of optimized model: "+str(type(optimized_model)) + " on device: "+str(optim   │
│    66 print("moving model to gpu")                                                               │
│ ❱  67 optimized_model.to(device)                                                                 │
│    68 print ("Type of optimized model: "+str(type(optimized_model)) + " on device: "+str(optim   │
│    69                                                                                            │
│    70 # print (dir(optimized_model))                                                             │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
AttributeError: 'HuggingFaceInferenceLearner' object has no attribute 'to'

Is there another way to move the model to cuda? Thanks!

SuperSecureHuman · 2023-07-12T09:27:57Z

It's inference Learner object

I am not exactly sure how to move it, but a higher level view would be to get the model out of the inference learner, and move it to gpu

trent-s · 2023-07-12T09:32:03Z

Thanks! That sounds like a good suggestion. I will try that!

leizaf · 2023-07-27T07:40:45Z

Seems related to pytorch/pytorch#72175, solution is to first export to onnx on CPU, then optimize it on the GPU.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ Speedster] With Hugging Face notebook code on nebulydocker/nebullvm container: RuntimeError: Expected all tensors to be on the same device #349

[ Speedster] With Hugging Face notebook code on nebulydocker/nebullvm container: RuntimeError: Expected all tensors to be on the same device #349

trent-s commented Jun 21, 2023 •

edited

Loading

SuperSecureHuman commented Jul 11, 2023

trent-s commented Jul 12, 2023 •

edited

Loading

SuperSecureHuman commented Jul 12, 2023

trent-s commented Jul 12, 2023

leizaf commented Jul 27, 2023

[ Speedster] With Hugging Face notebook code on nebulydocker/nebullvm container: RuntimeError: Expected all tensors to be on the same device #349

[ Speedster] With Hugging Face notebook code on nebulydocker/nebullvm container: RuntimeError: Expected all tensors to be on the same device #349

Comments

trent-s commented Jun 21, 2023 • edited Loading

SuperSecureHuman commented Jul 11, 2023

trent-s commented Jul 12, 2023 • edited Loading

SuperSecureHuman commented Jul 12, 2023

trent-s commented Jul 12, 2023

leizaf commented Jul 27, 2023

trent-s commented Jun 21, 2023 •

edited

Loading

trent-s commented Jul 12, 2023 •

edited

Loading