Interoperability with half-precision #3

georgebisbas · 2020-12-09T18:58:57Z

Dear all,

thank you for these beautiful examples, really helpful!
My aplogies if this is not the right place to ask, feel free to close this issue.
I am not openning this issue cause I have a problem,
but rather, I am trying to do openacc-interoperability with cuda_fp16 half precision intrinsics.
I have looked both at openacc_c_main and openacc_cuda_device in order to get some influence.

My changes are here, for openacc_cuda_device: master...georgebisbas:wip_fp16

I am working on a V100 and I am using:

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Thu_Jun_11_22:26:38_PDT_2020
Cuda compilation tools, release 11.0, V11.0.194
Build cuda_11.0_bu.TC445_37.28540450_0

and

 pgcc --version

pgcc (aka nvc) 20.7-0 LLVM 64-bit target on x86-64 Linux -tp skylake 
PGI Compilers and Tools
Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.

Code compiles:

$ make openacc_cuda_device
nvc++ -fast -acc -Minfo=all -gpu= cc75 -c openacc_cuda_device.cpp
"openacc_cuda_device.cpp", line 19: warning: variable "tmp" was declared but
          never referenced
    float *x, *y, tmp;
                  ^

main:
     34, Generating copyout(y[:n]) [if not already present]
         Generating create(x[:n]) [if not already present]
     37, Loop is parallelizable
         Generating Tesla code
         37, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
     37, Complex loop carried dependence of x-> prevents parallelization
         Loop carried dependence of y-> prevents parallelization
         Loop not fused: complex flow graph
         Loop not vectorized: data dependency
         Generated vector simd code for the loop
         Loop unrolled 8 times
     45, Generating Tesla code
         45, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
     45, Loop not vectorized/parallelized: contains call
nvc++ -o openacc_cuda_device -fast -acc -Minfo=all -gpu= cc75 saxpy_cuda_device.o openacc_cuda_device.o -Mcuda

but seems to be crashing when calling foo:

$ ./openacc_cuda_device 
c = 0.160000

I have been able so far to compile and execute with ease mixed precision code: https://github.com/NVIDIA-developer-blog/code-samples/tree/master/posts/mixed-precision
and bare openacc code as well as the openacc+cuda examples of this repository (
openacc-interoperability ).

Any inshight would be extremely helpful.
Regards,
--George

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interoperability with half-precision #3

Interoperability with half-precision #3

georgebisbas commented Dec 9, 2020 •

edited

Loading

Interoperability with half-precision #3

Interoperability with half-precision #3

Comments

georgebisbas commented Dec 9, 2020 • edited Loading

georgebisbas commented Dec 9, 2020 •

edited

Loading