Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gauss kernel initialization: unknown error #55

Open
fabiencastan opened this issue Dec 10, 2018 · 11 comments
Open

Gauss kernel initialization: unknown error #55

fabiencastan opened this issue Dec 10, 2018 · 11 comments

Comments

@fabiencastan
Copy link
Member

[17:41:52.100709][info] CUDA-Enabled GPU.
Device information:
	- id:                      0
	- name:                    Quadro K4000
	- compute capability:      3.0
	- total device memory:     3071 MB 
	- device memory available: 1303 MB 
	- per-block shared memory: 49152
	- warp size:               32
	- max threads per block:   1024
	- max threads per SM(X):   2048
	- max block sizes:         {1024,1024,64}
	- max grid sizes:          {2147483647,65535,65535}
	- max 2D array texture:    {65536,65536}
	- max 3D array texture:    {4096,4096,4096}
	- max 2D linear texture:   {65000,65000,1048544}
	- max 2D layered texture:  {16384,16384,2048}
	- number of SM(x)s:        4
	- registers per SM(x):     65536
	- registers per block:     65536
	- concurrent kernels:      yes
	- mapping host memory:     yes
	- unified addressing:      yes
	- texture alignment:       512 byte
	- pitch alignment:         32 byte

[17:41:52.176080][info] Supported CUDA-Enabled GPU detected.
[17:41:53.197578][info] Extracting sift features from view '/xxxx/footage/LionWood/IMG_3678.JPG' [gpu]
Choosing device 0: Quadro K4000
/xxxx/popsift/develop/repo/src/popsift/gauss_filter.cu:245
    cudaMemcpyToSymbol failed for Gauss kernel initialization: unknown error

Does anyone have any idea where this error cames from?

@stale
Copy link

stale bot commented Apr 30, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Apr 30, 2020
@simogasp
Copy link
Member

simogasp commented May 11, 2020

I sometimes get a similar error on an MBP when there are other programs running using GPU (typically Chrome). Closing the program(s) makes the system to switch to the integrated graphic card and then I am able use popsift. Probably a memory size limit?

@taketwo
Copy link

taketwo commented May 12, 2020

I get the same error:

$ ./popsift-demo -i image.png --print-dev-info
PopSift version: 1.0.0
image.png
Choosing device 0: GeForce RTX 2070
Device information:
    Name: GeForce RTX 2070
    Compute Capability:    7.5
    Total device mem:      8366915584 B 8170816 kB 7979 MB
    Per-block shared mem:  49152
    Warp size:             32
    Max threads per block: 1024
    Max threads per SM(X): 1024
    Max block sizes:       {1024,1024,64}
    Max grid sizes:        {2147483647,65535,65535}
    Number of SM(x)s:      36
    Concurrent kernels:    yes
    Mapping host memory:   yes
    Unified addressing:    yes

/code/my_projects/popsift/src/popsift/gauss_filter.cu:245
    cudaMemcpyToSymbol failed for Gauss kernel initialization: invalid device symbol

According to nvidia-smi, only 600 MiB out of 8000 MiB on the GPU are occupied by other processes.

@simogasp
Copy link
Member

thanks for the report.
I guess we need to call in the big guns :-) @griwodz

@taketwo
Copy link

taketwo commented May 12, 2020

Just a few more data points. Out of curiosity, I commented out that memory copy. This lead to a similar error in uploading SIFT constants:

/code/my_projects/popsift/src/popsift/sift_constants.cu:52
    Failed to upload h_consts to device: invalid device symbol

Commenting this one out, I hit the next one at:

/code/my_projects/popsift/src/popsift/common/debug_macros.cu:24
    called from /code/my_projects/popsift/src/popsift/s_pyramid_build.cu:125
    cudaGetLastError failed: invalid device function

@griwodz
Copy link
Member

griwodz commented May 12, 2020

Huh. You are getting different error messages on the K4000 and the RTX 2070. That's weird.

Could you try to move the __device__ __constant__ GaussInfo d_gauss; in gauss_filter.cu out of the namespace popsift? Since the binding is symbolic, it is possible that something has changed and the namespace is now a problem.

Another possibility, but I wouldn't know why that should happen if your system has only 1 CUDA card, is that cudaMemcpyToSymbol cannot figure out which card you are trying to use. The constant memory should exist on all CUDA cards anyway. That could be tested by adding a call cudaSetDevice(0); at the top of the init_filter function (that would be just for testing, not a solution in the long term).

@griwodz
Copy link
Member

griwodz commented May 12, 2020

The amount on constant memory on a CUDA card is quite limited, but all documentation insists that it is because the constant cache size is limited.

Do you have any hints on how I can get recreate the error (on Linux)?

@taketwo
Copy link

taketwo commented May 13, 2020

Hi, thanks for your answer. I've tried both moving d_gauss out of the popsift namespace and setting device explicitly, all to no avail.

I did not mention before, I am running PopSift in a Docker container. This morning I tried to build and run it on the host system directly, and there were no issues.

Do you have any hints on how I can get recreate the error (on Linux)?

Unfortunately, the Docker image I use is proprietary, so I can not share it. Instead, I've tried to create a minimal image based on nvidia/cuda with the same Ubuntu/CUDA version. To my surprise, when I compile and run PopSift there, it also has no issues. So, apparently, there is something very special about my proprietary image. I'm investigating further, but if you have any ideas or hints what can be tried, please let me know.

@griwodz
Copy link
Member

griwodz commented May 13, 2020

Is it possible that your main Docker container uses a different CUDA SDK than the host machine, but your test container uses the same SDK as the host?

Since late CUDA 10, NVidia tries to do something about the compatibility hassle (as they are writing here: https://docs.nvidia.com/deploy/cuda-compatibility/index.html), but I have not looked at those compatibility libraries at all.

@taketwo
Copy link

taketwo commented May 13, 2020

My "main" container is based on the nvidia/cudagl:10.0-devel-ubuntu18.04 image; I used the same one for my "test" container. So both of them have CUDA 10.0. On my host system I used to have 10.2, however this morning I downgraded it to 10.0 just in case. nvidia-smi still reports:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00    Driver Version: 440.64.00    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+

but as far as I understand, this is because the driver is tied to a certain CUDA version regardless of which CUDA SDK is actually installed. Also, according to the info on the page you posted, this driver is compatible with all 10.x versions.

I'm currently trying to "bisect" the layers of the "main" container to find the one that introduces the problem.

@taketwo
Copy link

taketwo commented May 13, 2020

I found the cause and it (seemingly) has nothing to do with CUDA and/or Docker. In my "main" container lld linker is installed and setup to be used by default. That's all. Switching back to the standard gold linker eliminates the issue.

In case you want to reproduce this and check what's going on, simply install lld package and add the following to the root CMakeLists.txt of PopSift:

set(CMAKE_SHARED_LINKER_FLAGS "${CMAKE_SHARED_LINKER_FLAGS} -fuse-ld=lld")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants