Add support for @cuda fastmath #2030

maleadt · 2023-08-14T09:00:01Z

Fixes JuliaGPU/GPUCompiler.jl#491, ref JuliaGPU/GPUCompiler.jl#492

@Zentrik Can you add a test?

codecov · 2023-08-14T09:49:18Z

Codecov Report

Patch coverage has no change and project coverage change: -0.12% ⚠️

Comparison is base (eb2eaf1) 61.04% compared to head (a2d3219) 60.92%.
Report is 3 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #2030      +/-   ##
==========================================
- Coverage   61.04%   60.92%   -0.12%     
==========================================
  Files         152      152              
  Lines       13291    13297       +6     
==========================================
- Hits         8113     8101      -12     
- Misses       5178     5196      +18

Files Changed	Coverage Δ
src/compiler/execution.jl	`85.36% <ø> (ø)`

... and 7 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Zentrik · 2023-08-14T12:20:47Z

I've added the test here, Zentrik@e87fa74. I don't think I can just commit to this pr.

maleadt · 2023-08-15T08:24:06Z

Thanks. Guess that test doesn't work on CUDA 11.0, where we generate:

//
// Generated by LLVM NVPTX Back-End
//

.version 7.0
.target sm_80
.address_size 64

	// .globl	julia_sqrt_kernel_30301 // -- Begin function julia_sqrt_kernel_30301
.extern .func julia__throw_boundserror_30309
()
;
                                        // @julia_sqrt_kernel_30301
.visible .func julia_sqrt_kernel_30301(
	.param .b64 julia_sqrt_kernel_30301_param_0
)
{
	.reg .pred 	%p<2>;
	.reg .b32 	%r<3>;
	.reg .f32 	%f<3>;
	.reg .b64 	%rd<8>;

// %bb.0:                               // %top
	ld.param.u64 	%rd1, [julia_sqrt_kernel_30301_param_0];
	mov.u32 	%r1, %tid.x;
	add.s32 	%r2, %r1, 1;
	ld.u64 	%rd2, [%rd1+16];
	max.s64 	%rd3, %rd2, 0;
	cvt.u64.u32 	%rd4, %r2;
	setp.ge.u64 	%p1, %rd3, %rd4;
	@%p1 bra 	LBB0_2;
	bra.uni 	LBB0_1;
LBB0_2:                                 // %L54
	ld.u64 	%rd5, [%rd1];
	mul.wide.u32 	%rd6, %r1, 4;
	add.s64 	%rd7, %rd5, %rd6;
	ld.global.f32 	%f1, [%rd7];
	sqrt.rn.f32 	%f2, %f1;
	st.global.f32 	[%rd7], %f2;
	ret;
LBB0_1:                                 // %L51
	{ // callseq 3273, 0
	.reg .b32 temp_param_reg;
	call.uni
	julia__throw_boundserror_30309,
	(
	);
	} // callseq 3273
                                        // -- End function
}

Zentrik · 2023-08-16T09:33:45Z

I haven't been able to figure out why we don't generate a sqrt.approx in the ptx, though as the sqrt.rn shows up before we run the NVVM Reflect pass I don't think it will be easily fixable. In the meantime we could only run the test for CUDA.runtime_version() > v"11.1-".

maleadt · 2023-08-17T15:11:07Z

I haven't been able to figure out why we don't generate a sqrt.approx in the ptx

The reason is libdevice itself, where this is the definition for __nv_sqrtf on CUDA 11.0:

define float @__nv_sqrtf(float %x) #0 {
  %1 = call float @llvm.nvvm.sqrt.f(float %x)
  ret float %1
}

... vs CUDA 11.1:

define float @__nv_sqrtf(float %x) #0 {
  %1 = call i32 @__nvvm_reflect(i8* getelementptr inbounds ([11 x i8], [11 x i8]* @.str, i32 0, i32 0)) #6
  %2 = icmp ne i32 %1, 0
  br i1 %2, label %3, label %10

3:                                                ; preds = %0
  %4 = call i32 @__nvvm_reflect(i8* getelementptr inbounds ([17 x i8], [17 x i8]* @.str.2, i32 0, i32 0)) #6
  %5 = icmp ne i32 %4, 0
  br i1 %5, label %6, label %8

6:                                                ; preds = %3
  %7 = call float @llvm.nvvm.sqrt.rn.ftz.f(float %x) #6
  br label %__nvvm_sqrt_f.exit

8:                                                ; preds = %3
  %9 = call float @llvm.nvvm.sqrt.approx.ftz.f(float %x) #6
  br label %__nvvm_sqrt_f.exit

10:                                               ; preds = %0
  %11 = call i32 @__nvvm_reflect(i8* getelementptr inbounds ([17 x i8], [17 x i8]* @.str.2, i32 0, i32 0)) #6
  %12 = icmp ne i32 %11, 0
  br i1 %12, label %13, label %15

13:                                               ; preds = %10
  %14 = call float @llvm.nvvm.sqrt.rn.f(float %x) #6
  br label %__nvvm_sqrt_f.exit

15:                                               ; preds = %10
  %16 = call float @llvm.nvvm.sqrt.approx.f(float %x) #6
  br label %__nvvm_sqrt_f.exit

__nvvm_sqrt_f.exit:                               ; preds = %6, %8, %13, %15
  %.0 = phi float [ %7, %6 ], [ %9, %8 ], [ %14, %13 ], [ %16, %15 ]
  ret float %.0
}

So yeah, this is expected.

maleadt added the enhancement New feature or request label Aug 14, 2023

maleadt marked this pull request as ready for review August 15, 2023 07:13

maleadt force-pushed the tb/fastmath branch from 25712a9 to c35ec36 Compare August 15, 2023 07:14

maleadt mentioned this pull request Aug 15, 2023

CompatHelper: bump compat for "GPUCompiler" to "0.22" #2032

Closed

maleadt and others added 4 commits August 17, 2023 17:15

Bump GPUCompiler.

1b81c4d

Add support for fastmath kwarg.

aff8cd3

Add test.

b85130d

Disable sqrt fast math test on CUDA 11.0.

a2d3219

maleadt force-pushed the tb/fastmath branch from c35ec36 to a2d3219 Compare August 17, 2023 15:15

maleadt merged commit fade845 into master Aug 17, 2023

maleadt deleted the tb/fastmath branch August 17, 2023 19:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for @cuda fastmath #2030

Add support for @cuda fastmath #2030

maleadt commented Aug 14, 2023

codecov bot commented Aug 14, 2023 •

edited

Loading

Zentrik commented Aug 14, 2023

maleadt commented Aug 15, 2023

Zentrik commented Aug 16, 2023

maleadt commented Aug 17, 2023

Add support for @cuda fastmath #2030

Add support for @cuda fastmath #2030

Conversation

maleadt commented Aug 14, 2023

codecov bot commented Aug 14, 2023 • edited Loading

Codecov Report

Zentrik commented Aug 14, 2023

maleadt commented Aug 15, 2023

Zentrik commented Aug 16, 2023

maleadt commented Aug 17, 2023

codecov bot commented Aug 14, 2023 •

edited

Loading