Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interoperability with half-precision #3

Open
georgebisbas opened this issue Dec 9, 2020 · 0 comments
Open

Interoperability with half-precision #3

georgebisbas opened this issue Dec 9, 2020 · 0 comments

Comments

@georgebisbas
Copy link

georgebisbas commented Dec 9, 2020

Dear all,

thank you for these beautiful examples, really helpful!
My aplogies if this is not the right place to ask, feel free to close this issue.
I am not openning this issue cause I have a problem,
but rather, I am trying to do openacc-interoperability with cuda_fp16 half precision intrinsics.
I have looked both at openacc_c_main and openacc_cuda_device in order to get some influence.

My changes are here, for openacc_cuda_device: master...georgebisbas:wip_fp16

I am working on a V100 and I am using:

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Thu_Jun_11_22:26:38_PDT_2020
Cuda compilation tools, release 11.0, V11.0.194
Build cuda_11.0_bu.TC445_37.28540450_0

and

 pgcc --version

pgcc (aka nvc) 20.7-0 LLVM 64-bit target on x86-64 Linux -tp skylake 
PGI Compilers and Tools
Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.

Code compiles:

$ make openacc_cuda_device
nvc++ -fast -acc -Minfo=all -gpu= cc75 -c openacc_cuda_device.cpp
"openacc_cuda_device.cpp", line 19: warning: variable "tmp" was declared but
          never referenced
    float *x, *y, tmp;
                  ^

main:
     34, Generating copyout(y[:n]) [if not already present]
         Generating create(x[:n]) [if not already present]
     37, Loop is parallelizable
         Generating Tesla code
         37, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
     37, Complex loop carried dependence of x-> prevents parallelization
         Loop carried dependence of y-> prevents parallelization
         Loop not fused: complex flow graph
         Loop not vectorized: data dependency
         Generated vector simd code for the loop
         Loop unrolled 8 times
     45, Generating Tesla code
         45, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
     45, Loop not vectorized/parallelized: contains call
nvc++ -o openacc_cuda_device -fast -acc -Minfo=all -gpu= cc75 saxpy_cuda_device.o openacc_cuda_device.o -Mcuda 

but seems to be crashing when calling foo:

$ ./openacc_cuda_device 
c = 0.160000

I have been able so far to compile and execute with ease mixed precision code: https://github.com/NVIDIA-developer-blog/code-samples/tree/master/posts/mixed-precision
and bare openacc code as well as the openacc+cuda examples of this repository (
openacc-interoperability ).

Any inshight would be extremely helpful.
Regards,
--George

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant