Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for @cuda fastmath #2030

Merged
merged 4 commits into from
Aug 17, 2023
Merged

Add support for @cuda fastmath #2030

merged 4 commits into from
Aug 17, 2023

Conversation

maleadt
Copy link
Member

@maleadt maleadt commented Aug 14, 2023

@maleadt maleadt added the enhancement New feature or request label Aug 14, 2023
@codecov
Copy link

codecov bot commented Aug 14, 2023

Codecov Report

Patch coverage has no change and project coverage change: -0.12% ⚠️

Comparison is base (eb2eaf1) 61.04% compared to head (a2d3219) 60.92%.
Report is 3 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2030      +/-   ##
==========================================
- Coverage   61.04%   60.92%   -0.12%     
==========================================
  Files         152      152              
  Lines       13291    13297       +6     
==========================================
- Hits         8113     8101      -12     
- Misses       5178     5196      +18     
Files Changed Coverage Δ
src/compiler/execution.jl 85.36% <ø> (ø)

... and 7 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Zentrik
Copy link
Contributor

Zentrik commented Aug 14, 2023

I've added the test here, Zentrik@e87fa74. I don't think I can just commit to this pr.

@maleadt maleadt marked this pull request as ready for review August 15, 2023 07:13
@maleadt
Copy link
Member Author

maleadt commented Aug 15, 2023

Thanks. Guess that test doesn't work on CUDA 11.0, where we generate:

//
// Generated by LLVM NVPTX Back-End
//

.version 7.0
.target sm_80
.address_size 64

	// .globl	julia_sqrt_kernel_30301 // -- Begin function julia_sqrt_kernel_30301
.extern .func julia__throw_boundserror_30309
()
;
                                        // @julia_sqrt_kernel_30301
.visible .func julia_sqrt_kernel_30301(
	.param .b64 julia_sqrt_kernel_30301_param_0
)
{
	.reg .pred 	%p<2>;
	.reg .b32 	%r<3>;
	.reg .f32 	%f<3>;
	.reg .b64 	%rd<8>;

// %bb.0:                               // %top
	ld.param.u64 	%rd1, [julia_sqrt_kernel_30301_param_0];
	mov.u32 	%r1, %tid.x;
	add.s32 	%r2, %r1, 1;
	ld.u64 	%rd2, [%rd1+16];
	max.s64 	%rd3, %rd2, 0;
	cvt.u64.u32 	%rd4, %r2;
	setp.ge.u64 	%p1, %rd3, %rd4;
	@%p1 bra 	LBB0_2;
	bra.uni 	LBB0_1;
LBB0_2:                                 // %L54
	ld.u64 	%rd5, [%rd1];
	mul.wide.u32 	%rd6, %r1, 4;
	add.s64 	%rd7, %rd5, %rd6;
	ld.global.f32 	%f1, [%rd7];
	sqrt.rn.f32 	%f2, %f1;
	st.global.f32 	[%rd7], %f2;
	ret;
LBB0_1:                                 // %L51
	{ // callseq 3273, 0
	.reg .b32 temp_param_reg;
	call.uni
	julia__throw_boundserror_30309,
	(
	);
	} // callseq 3273
                                        // -- End function
}

@Zentrik
Copy link
Contributor

Zentrik commented Aug 16, 2023

I haven't been able to figure out why we don't generate a sqrt.approx in the ptx, though as the sqrt.rn shows up before we run the NVVM Reflect pass I don't think it will be easily fixable. In the meantime we could only run the test for CUDA.runtime_version() > v"11.1-".

@maleadt
Copy link
Member Author

maleadt commented Aug 17, 2023

I haven't been able to figure out why we don't generate a sqrt.approx in the ptx

The reason is libdevice itself, where this is the definition for __nv_sqrtf on CUDA 11.0:

define float @__nv_sqrtf(float %x) #0 {
  %1 = call float @llvm.nvvm.sqrt.f(float %x)
  ret float %1
}

... vs CUDA 11.1:

define float @__nv_sqrtf(float %x) #0 {
  %1 = call i32 @__nvvm_reflect(i8* getelementptr inbounds ([11 x i8], [11 x i8]* @.str, i32 0, i32 0)) #6
  %2 = icmp ne i32 %1, 0
  br i1 %2, label %3, label %10

3:                                                ; preds = %0
  %4 = call i32 @__nvvm_reflect(i8* getelementptr inbounds ([17 x i8], [17 x i8]* @.str.2, i32 0, i32 0)) #6
  %5 = icmp ne i32 %4, 0
  br i1 %5, label %6, label %8

6:                                                ; preds = %3
  %7 = call float @llvm.nvvm.sqrt.rn.ftz.f(float %x) #6
  br label %__nvvm_sqrt_f.exit

8:                                                ; preds = %3
  %9 = call float @llvm.nvvm.sqrt.approx.ftz.f(float %x) #6
  br label %__nvvm_sqrt_f.exit

10:                                               ; preds = %0
  %11 = call i32 @__nvvm_reflect(i8* getelementptr inbounds ([17 x i8], [17 x i8]* @.str.2, i32 0, i32 0)) #6
  %12 = icmp ne i32 %11, 0
  br i1 %12, label %13, label %15

13:                                               ; preds = %10
  %14 = call float @llvm.nvvm.sqrt.rn.f(float %x) #6
  br label %__nvvm_sqrt_f.exit

15:                                               ; preds = %10
  %16 = call float @llvm.nvvm.sqrt.approx.f(float %x) #6
  br label %__nvvm_sqrt_f.exit

__nvvm_sqrt_f.exit:                               ; preds = %6, %8, %13, %15
  %.0 = phi float [ %7, %6 ], [ %9, %8 ], [ %14, %13 ], [ %16, %15 ]
  ret float %.0
}

So yeah, this is expected.

@maleadt maleadt merged commit fade845 into master Aug 17, 2023
@maleadt maleadt deleted the tb/fastmath branch August 17, 2023 19:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

How to set fast math for CUDA
2 participants