-
Notifications
You must be signed in to change notification settings - Fork 350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 [Bug] torch_tensorrt::torchscript::compile gets stuck; bug caused by elimination exception #1560
Comments
Can you provide us a reproducer script with the model for us to investigate? Also are you seeing this with the latest release 1.3 and master too ? |
So far I have tested it only with the latest release torch-tensorrt=1.3.0 I have built a minimal example with a pretrained model provided by pytorch to reproduce the issue (you only need to update the paths in the .cpp file): might need to set: |
Hi @peri044, |
Torch-TensorRT works for me when I follow this tutorial (https://developer.nvidia.com/blog/accelerating-inference-up-to-6x-faster-in-pytorch-with-torch-tensorrt/) . But I got the same issues with same traceback for my models. Any ideas to fix this bug? @bjaeger1 @peri044 I use pip to install the related packages as follow: Environment
|
@bobby-chiu |
I suspect this might be related to issue #1823 |
Hi @gs-olive - I just tried out the PR #1859. After building the docker image with:
and running the container:
it fails when checking if torch-tensorrt was compiled successfully:
ERROR: Output when building the docker image: |
Hi @bjaeger1 - I just rebased #1859 onto DOCKER_BUILDKIT=1 docker build --build-arg CUDNN_VERSION=8.9 --build-arg TENSORRT_VERSION=8.6 --build-arg PYTHON_VERSION=3.10 -f docker/Dockerfile -t torch_tensorrt:latest . Could you try the build again with a fresh pull of that branch? |
Hi @gs-olive , thanks for the quick answer! However, the command: But the docker-image build process stats: |
Hi @bjaeger1 - thanks for the follow-up. I was able to reproduce the issue and I addressed the problem in #2085, which adds the necessary bazel test //tests/core/conversion/converters:test_activation --compilation_mode=opt --test_output=summary --config pre_cxx11_abi (I had to remove |
Hi @gs-olive , after adding the mentioned lines for the symlink to the Dockerfile, I was finally able to successfully build torch_tensorrt. Thanks! I currently have difficulties (linking torch_tensorrt against libtorch fails (undefined references)) when building my minimal example. I'll come back when I fixed that issue. |
@gs-olive, I would assume that the linking error is because the torch_tensorrt I built is using CUDA-12.1 (BASE_IMG=nvidia/cuda:12.1.1-devel-ubuntu22.04) but the libtorch-version I downloaded from the official website is with CUDA-11.8. Trying to build torch_tensorrt with BASE_IMG=nvidia/cuda:11.8.0-devel-ubuntu22.04 is somehow not possible and fails with:
|
Hi @bjaeger1 - thanks for the follow-up. It seems likely that the mismatched libtorch/CUDA versions are contributing to this issue. We very recently updated the stack on DOCKER_BUILDKIT=1 docker build --build-arg CUDNN_VERSION=8.8 --build-arg TENSORRT_VERSION=8.6 -f docker/Dockerfile -t torch_tensorrt:latest . |
Hi @gs-olive , I prefer to use CUDA 11.8 and wait until the pytorch binaries are released with CUDA 12.1 instead of building it from source. I pulled the cuda_118_rollback branch and built the image. The torch_tensorrt build fails with:
In Again, thanks a lot for your effort! PS: I also tried to add the changes from commit |
Just to check - is this failure occurring during the |
The or: |
Update: I just installed the libtorch nightly binary which comes with CUDA 12.1 I built a torch_tensorrt-docker-image with CUDA-12.1 and then the torch_tensorrt library.
|
That makes sense, thank you for the details. I was able to reproduce the error, and it seems to be caused by an issue in the DOCKER_BUILDKIT=1 docker build --build-arg CUDNN_VERSION=8.8 --build-arg TENSORRT_VERSION=8.6 -f docker/Dockerfile -t torch_tensorrt:latest . The following command succeeds from within the container, on my machine: bazel test //tests/core/conversion/converters:test_activation --compilation_mode=opt --test_output=summary --config pre_cxx11_abi Please let me know if the latest updates work for you as well. |
Hi, with the latest update I was able to build the image and run the container. You only changed The commands: When compiling my example there are a lot of undefined references:
When comparing the undefined references to the ones from my previous comment (where I used the nigthly-libtorch-version with CUDA-12.1 and also built torch_tensorrt with CUDA-12.1) the same 3 glibc references are missing + additional c10 errors. I guess there is still somewhere a (cuda ?)-version mismatch in torch_tensorrt. |
Yes, changing the After discussing the issue with @peri044 - could you try the following from within the container prior to building your example. The issue may be with the use of # Make directory for modeling files
mkdir modeling
cd modeling
# Make compilation/Torch-TRT build file
touch run.cpp
# See below for demo BUILD file
touch BUILD
# Build using Bazel
cd ..
bazel build modeling:my_custom_model --config pre_cxx11_abi Demo load("@rules_pkg//:pkg.bzl", "pkg_tar")
config_setting(
name = "use_pre_cxx11_abi",
values = {
"define": "abi=pre_cxx11_abi",
},
)
cc_binary(
name = "my_custom_model",
srcs = [
"run.cpp"
],
linkopts = [
"-ldl",
],
deps = [
"//third_party/args",
"//cpp:torch_tensorrt",
] + select({
":use_pre_cxx11_abi": [
"@libtorch_pre_cxx11_abi//:libtorch",
"@libtorch_pre_cxx11_abi//:caffe2",
],
"//conditions:default": [
"@libtorch//:libtorch",
"@libtorch//:caffe2",
],
}),
) |
I tried your suggestion but the build failed:
|
Thanks for testing that out. I was able to reproduce that message, but only when the Demo #include "torch/csrc/autograd/grad_mode.h"
#include "torch/csrc/jit/runtime/graph_executor.h"
#include "torch/script.h"
#include "torch_tensorrt/logging.h"
#include "torch_tensorrt/torch_tensorrt.h"
int main(int argc, char** argv) {
// Compile, infer, ...
printf("Output ...\n");
return 0;
} Then I run the following: cd /opt/torch_tensorrt
bazel build modeling:my_custom_model --config pre_cxx11_abi
./bazel-bin/modeling/my_custom_model The above succeeds on my instance of the container. |
hi @gs-olive , on my side your example now also works fine! Analogous to your example, I made another folder in
Then I run the following
The model compiles also successfully but when running the executable the torch-tensorrt compile functions gets again stuck in an infinite loop...
|
Thanks for the update, and good to hear it gets through more of the compilation! I just rebased the |
I have really good news! The torch-tensorrt compilation of my minimal example finally works inside the container! The issue of the infinite loop is solved! Great, thanks a lot! From which folder inside the running container am I supposed to download the built torch-tensorrt files? I want to be able to compile my minimal example also outside the docker container. However, when compiling the example on my system outside the container (with CMake) there is a linking error. Locally I have symlinks to the respective libraries:
|
Great to hear it works on the example now! After discussing with @narendasan, there are a few options to use the example outside the Docker container. One is to copy the Lines 34 to 38 in e884820
Then, you can rebuild the bazel target similarly to what you did from the Docker container.
Alternatively, you could rebuild the library on your local and add a new target for the |
Short update: only applying our change is not enough. Bazel can somehow not fetch the libraries:
I therefore added instead:
also the local cudnn & torch_tensorrt are not found so I changed the paths from The problem then is the following:
which looks like it is related to this issue: #45 although I am not sure which lines to modify in PS: in WORKSPACE I also had to remove the "-" in the workspace name ("Torch-TensorRT" --> "TorchTensorRT") |
This is fine as long as 2.1.0.dev20230703+cu118 is the version you used to build torch-tensorrt in the container or you are building from source (make sure to set the cuda version to 11.8 for that dependency in the workspace)
For building with bazel the easiest (i.e. least prone to error way) to include the cudnn and tensorrt dependencies is to use the tarballs without unpacking them as inputs to http archive: On my systems after downloading the correct builds from developer.nvidia.com, my workspace will look like this (give or take a few changes related to version upgrades): workspace(name = "Torch-TensorRT")
load("@bazel_tools//tools/build_defs/repo:http.bzl", "http_archive")
http_archive(
name = "rules_python",
sha256 = "863ba0fa944319f7e3d695711427d9ad80ba92c6edd0b7c7443b84e904689539",
strip_prefix = "rules_python-0.22.0",
url = "https://github.com/bazelbuild/rules_python/releases/download/0.22.0/rules_python-0.22.0.tar.gz",
)
load("@rules_python//python:repositories.bzl", "py_repositories")
py_repositories()
http_archive(
name = "rules_pkg",
sha256 = "8f9ee2dc10c1ae514ee599a8b42ed99fa262b757058f65ad3c384289ff70c4b8",
urls = [
"https://mirror.bazel.build/github.com/bazelbuild/rules_pkg/releases/download/0.9.1/rules_pkg-0.9.1.tar.gz",
"https://github.com/bazelbuild/rules_pkg/releases/download/0.9.1/rules_pkg-0.9.1.tar.gz",
],
)
load("@rules_pkg//:deps.bzl", "rules_pkg_dependencies")
rules_pkg_dependencies()
http_archive(
name = "googletest",
sha256 = "755f9a39bc7205f5a0c428e920ddad092c33c8a1b46997def3f1d4a82aded6e1",
strip_prefix = "googletest-5ab508a01f9eb089207ee87fd547d290da39d015",
urls = ["https://github.com/google/googletest/archive/5ab508a01f9eb089207ee87fd547d290da39d015.zip"],
)
# External dependency for torch_tensorrt if you already have precompiled binaries.
local_repository(
name = "torch_tensorrt",
path = "/opt/conda/lib/python3.8/site-packages/torch_tensorrt",
)
# CUDA should be installed on the system locally
new_local_repository(
name = "cuda",
build_file = "@//third_party/cuda:BUILD",
path = "/usr/local/cuda-12.1/",
)
#############################################################################################################
# Tarballs and fetched dependencies (default - use in cases when building from precompiled bin and tarballs)
#############################################################################################################
http_archive(
name = "libtorch",
build_file = "@//third_party/libtorch:BUILD",
sha256 = "1ae8366aaf7af7f68f142ba644fe26c837c6fa8347ec6bd9ce605ac60e7f7e5e",
strip_prefix = "libtorch",
urls = ["https://download.pytorch.org/libtorch/nightly/cu121/libtorch-cxx11-abi-shared-with-deps-2.1.0.dev20230703%2Bcu121.zip"],
)
http_archive(
name = "libtorch_pre_cxx11_abi",
build_file = "@//third_party/libtorch:BUILD",
sha256 = "9add4832f4da9223866d85810820b816ab3319d5a227066101eeb6cbb76adb4b",
strip_prefix = "libtorch",
urls = ["https://download.pytorch.org/libtorch/nightly/cu121/libtorch-shared-with-deps-2.1.0.dev20230703%2Bcu121.zip"],
)
http_archive(
name = "cudnn",
build_file = "@//third_party/cudnn/archive:BUILD",
sha256 = "79d77a769c7e7175abc7b5c2ed5c494148c0618a864138722c887f95c623777c",
strip_prefix = "cudnn-linux-x86_64-8.8.1.3_cuda12-archive",
urls = [
"file:///<ABSOLUTE PATH TO DOWNLOAD ON SYSTEM>/cudnn-linux-x86_64-8.8.1.3_cuda12-archive.tar.xz",
"https://developer.nvidia.com/downloads/compute/cudnn/secure/8.8.1/local_installers/12.0/cudnn-linux-x86_64-8.8.1.3_cuda12-archive.tar.xz",
],
)
http_archive(
name = "tensorrt",
build_file = "@//third_party/tensorrt/archive:BUILD",
sha256 = "0f8157a5fc5329943b338b893591373350afa90ca81239cdadd7580cd1eba254",
strip_prefix = "TensorRT-8.6.1.6",
urls = [
"file:///<ABSOLUTE PATH TO DOWNLOAD ON SYSTEM>/TensorRT-8.6.1.6.Linux.x86_64-gnu.cuda-12.0.tar.gz",
"https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/secure/8.6.1/tars/TensorRT-8.6.1.6.Linux.x86_64-gnu.cuda-12.0.tar.gz",
],
)
#########################################################################
# Development Dependencies (optional - comment out on aarch64)
#########################################################################
load("@rules_python//python:pip.bzl", "pip_parse")
pip_parse(
name = "devtools_deps",
requirements = "//:requirements-dev.txt",
)
load("@devtools_deps//:requirements.bzl", "install_deps")
install_deps() The |
Hi, the reason why my libtorch For the cudnn & tensorrt issue I now use, as you told me, the tar-files which work fine! However when building the example its reported that bazel has some issues with libtorch(?):
|
This issue has not seen activity for 90 days, Remove stale label or comment or this will be closed in 10 days |
Bug Description
after calling
auto trt_mod = torch_tensorrt::torchscript::compile(module, compile_settings);
the process gets stuck in an infinite(?) loop. I can also observe that the GPU load drops back to 0% after about 1s.
According to this link: #1409 the issue should already have been fixed.
Error message
1 __memmove_avx_unaligned 0x7fff79289cc1
2 std::vectortorch::jit::Use::_M_erase(__gnu_cxx::__normal_iterator<torch::jit::Use *, std::vectortorch::jit::Use>) 0x7fffab48412f
3 torch::jit::Value::replaceFirstUseWith(torch::jit::Value *) 0x7fffab46ff5d
4 torch::jit::Value::replaceAllUsesWith(torch::jit::Value *) 0x7fffab46ffcb
5 torch::jit::EliminateExceptions(torch::jit::Block *) 0x7fffab63c3c9
6 torch::jit::EliminateExceptions(std::shared_ptrtorch::jit::Graph&) 0x7fffab63c999
7 torch_tensorrt::core::lowering::LowerGraph(std::shared_ptrtorch::jit::Graph&, std::vectorc10::IValue&, torch_tensorrt::core::lowering::LowerInfo) 0x7fffd7426b0d
8 torch_tensorrt::core::lowering::Lower(torch::jit::Module const&, std::string, torch_tensorrt::core::lowering::LowerInfo const&) 0x7fffd742a181
9 torch_tensorrt::core::CompileGraph(torch::jit::Module const&, torch_tensorrt::core::CompileSpec) 0x7fffd732b5a8
10 torch_tensorrt::torchscript::compile(torch::jit::Module const&, torch_tensorrt::torchscript::CompileSpec) 0x7fffd7313a04
11 ModelLoader::optimizeWithTensorRT modelloader.cpp 266 0x5ad43c
12 InferenceDisplay::<lambda()>::<lambda()>::operator() inferencedisplay.cpp 1330 0x58c996
13 std::_Function_handler<void(), InferenceDisplay::InferenceDisplay(QWidget *, DataController&)::<lambda()>::<lambda()>>::_M_invoke(const std::_Any_data &) std_function.h 316 0x58c996
14 std::function<void ()>::operator()() const std_function.h 706 0x5cbcca
15 errorwrapper::loading(std::function<void ()>) errorwrapper.cpp 11 0x5cbcca
16 InferenceDisplay::<lambda()>::operator() inferencedisplay.cpp 1333 0x58e127
17 QtPrivate::FunctorCall<QtPrivate::IndexesList<>, QtPrivate::List<>, void, InferenceDisplay::InferenceDisplay(QWidget *, DataController&)::<lambda()>>::call qobjectdefs_impl.h 146 0x58e127
18 QtPrivate::Functor<InferenceDisplay::InferenceDisplay(QWidget *, DataController&)::<lambda()>, 0>::call<QtPrivate::List<>, void> qobjectdefs_impl.h 256 0x58e127
19 QtPrivate::QFunctorSlotObject<InferenceDisplay::InferenceDisplay(QWidget *, DataController&)::<lambda()>, 0, QtPrivate::List<>, void>::impl(int, QtPrivate::QSlotObjectBase *, QObject *, void * *, bool *) qobjectdefs_impl.h 439 0x58e127
20 QMetaObject::activate(QObject *, int, int, void * *) 0x7fff7a163f8f
...
Expected behavior
successful torch-tensorrt optimization of a torchscript model
Environment
The text was updated successfully, but these errors were encountered: