[libspirv] Reuse clc function in generic math/common built-ins #19231

wenju-he · 2025-07-01T09:03:37Z

Delete unused .inc and .h files. Delete unused tables.cl.

frasercrmck

Good stuff, thank you.

I have been meaning to ask, are you testing all of these "generic" changes downstream with the SYCL-CTS? We unfortunately don't have great coverage of the "generic" implementations in pre-commit CI.

bader · 2025-07-01T15:19:05Z

👍

wenju-he · 2025-07-02T01:47:04Z

I have been meaning to ask, are you testing all of these "generic" changes downstream with the SYCL-CTS? We unfortunately don't have great coverage of the "generic" implementations in pre-commit CI.

@frasercrmck Not yet, but these generic implementations will be tested in downstream SYCL-CTS CI in the near future except for a few built-ins that has hardware support, e.g. float sin. My plan is:

Compare performance of two kinds of generic implementations, e.g.

llvm/libclc/clc/lib/generic/math/clc_sin.inc

Line 40 in 9805c6c

_CLC_OVERLOAD _CLC_DEF __CLC_GENTYPE __clc_sin(__CLC_GENTYPE x) {

and https://github.com/intel/intel-graphics-compiler/blob/7c04a1ec340f1a0b79460cb61982e0499efda1b6/IGC/BiFModule/Implementation/IMF/FP64/sin_d_ha.cl#L431. Upstream intel implementation to clc if it has better performance.
Switch to use libspirv generic implementation. Then the generic implementations in this PR will be tested.

For this PR, I only checked llvm-diff output. There are two main changes:

In libspirv-nvptx64--nvidiacl.bc, use of __spirv_IsInf and __spirv_IsNan are replaced with generic implementation. For instance, the diff in _Z17__spirv_ocl_fractfPf is:

in function _Z17__spirv_ocl_fractfPf:
    Old:
    >   %call.i = tail call i32 @__nv_isinff(float noundef %x) #44
    >   %tobool.i.not = icmp eq i32 %call.i, 0
    >   %cond = select i1 %tobool.i.not, float %0, float 0.000000e+00
    >   %call.i1 = tail call i32 @__nv_isnanf(float noundef %x) #44
    >   %tobool.i2.not = icmp eq i32 %call.i1, 0
    >   %cond11 = select i1 %tobool.i2.not, float %cond, float %x
    >   ret float %cond11
    New:
    <   %1 = tail call float @llvm.fabs.f32(float %x)
    <   %2 = fcmp une float %1, 0x7FF0000000000000
    <   %cond.i = select i1 %2, float %0, float 0.000000e+00
    <   %3 = fcmp ord float %x, 0.000000e+00
    <   %cond8.i = select i1 %3, float %cond.i, float %x
    <   ret float %cond8.i

better vectorization in this PR since clc implementation has better vectorization.

…arg (#146495) (#19255) There is no change to amdgcn--amdhsa.bc and nvptx64--nvidiacl.bc because __opencl_c_generic_address_space is not defined for them. (cherry picked from commit b0e6faa) This PR fixes pre-commit CI error in #19231: undefined hidden symbol: __clc_lgamma_r(float, int*)

wenju-he · 2025-07-03T02:05:30Z

Sorry, after looking at llvm-diff again, I find following change which is unexpected:

  in block %if.then13.i / %if.then12:
    OLD:
    >   %call.i4 = tail call float @__nv_logf(float noundef %0) #44
    >   %fneg37 = fneg float %call.i4
    NEW:
    <   %elt.abs.i.i191 = tail call noundef float @llvm.fabs.f32(float %sub26.i)
    <   %cmp.i192 = fcmp olt float %elt.abs.i.i191, 6.250000e-02
    <   %add.i193 = fadd float %sub26.i, 2.000000e+00
    <   %div.i194 = fdiv float %sub26.i, %add.i193, !fpmath !6
    <   %add1.i195 = fadd float %div.i194, %div.i194
    <   %mul2.i196 = fmul float %add1.i195, %add1.i195
    <   %13 = tail call noundef float @llvm.fmuladd.f32(float %mul2.i196, float 0x3F899999A0000000, float 0x3FB5555560000000)
    <   %mul4.i197 = fmul float %mul2.i196, %13
    <   %14 = fneg float %sub26.i
    <   %fneg.i198 = fmul float %div.i194, %14
    <   %15 = tail call noundef float @llvm.fmuladd.f32(float %add1.i195, float %mul4.i197, float %fneg.i198)
    <   %astype6.i199 = bitcast float %sub26.i to i32
    <   %and7.i200 = and i32 %astype6.i199, -65536
    <   %astype8.i201 = bitcast i32 %and7.i200 to float
    <   %sub9.i202 = fsub float %sub26.i, %astype8.i201
    <   %add10.i203 = fadd float %sub9.i202, %15
    <   %mul11.i204 = fmul float %add10.i203, 0x3F75476520000000
    <   %16 = tail call noundef float @llvm.fmuladd.f32(float %astype8.i201, float 0x3F75476520000000, float %mul11.i204)
    <   %17 = tail call noundef float @llvm.fmuladd.f32(float %add10.i203, float 1.437500e+00, float %16)
    <   %18 = tail call noundef float @llvm.fmuladd.f32(float %astype8.i201, float 1.437500e+00, float %17)
    <   %shr.i205 = lshr i32 %astype1.i, 23
    <   %sub15.i206 = add nsw i32 %shr.i205, -127
    <   %or.i207 = or i32 %astype1.i, 1065353216
    <   %astype16.i208 = bitcast i32 %or.i207 to float
    <   %sub17.i209 = fadd float %astype16.i208, -1.000000e+00
    <   %astype18.i210 = bitcast float %sub17.i209 to i32
    <   %shr19.i211 = lshr i32 %astype18.i210, 23
    <   %sub20.i212 = add nsw i32 %shr19.i211, -253
    <   %cmp21.i213 = icmp samesign ult i32 %astype1.i, 8388608
    <   %cond.i214 = select i1 %cmp21.i213, i32 %sub20.i212, i32 %sub15.i206
    <   %cond27.i215 = select i1 %cmp21.i213, i32 %astype18.i210, i32 %astype1.i
    <   %conv28.i216 = sitofp i32 %cond.i214 to float
    <   %and29.i217 = and i32 %cond27.i215, 8323072
    <   %and30.i218 = shl i32 %cond27.i215, 1
    <   %shl.i219 = and i32 %and30.i218, 65536
    <   %add31.i220 = add nuw nsw i32 %shl.i219, %and29.i217
    <   %or32.i221 = or disjoint i32 %add31.i220, 1056964608
    <   %astype33.i222 = bitcast i32 %or32.i221 to float
    <   %and34.i223 = and i32 %cond27.i215, 8388607
    <   %or35.i224 = or disjoint i32 %and34.i223, 1056964608
    <   %astype36.i225 = bitcast i32 %or35.i224 to float
    <   %sub37.i226 = fsub float %astype33.i222, %astype36.i225
    <   %shr38.i227 = lshr exact i32 %add31.i220, 16
    <   %idxprom.i296 = zext nneg i32 %shr38.i227 to i64
    <   %arrayidx.i297 = getelementptr inbounds nuw [129 x float], ptr addrspace(4) @LOG_INV_TBL, i64 0, i64 %idxprom.i296
    <   %19 = load float, ptr addrspace(4) %arrayidx.i297, align 4, !tbaa !7
    <   %mul40.i229 = fmul float %sub37.i226, %19
    <   %20 = tail call noundef float @llvm.fmuladd.f32(float %mul40.i229, float 0x3FD5555560000000, float 5.000000e-01)
    <   %mul42.i230 = fmul float %mul40.i229, %mul40.i229
    <   %21 = tail call noundef float @llvm.fmuladd.f32(float %20, float %mul42.i230, float %mul40.i229)
    <   %arrayidx.i307 = getelementptr inbounds nuw [129 x <2 x float>], ptr addrspace(4) @LOG2_TBL, i64 0, i64 %idxprom.i296
    <   %22 = load float, ptr addrspace(4) %arrayidx.i307, align 8
    <   %add46.i233 = fadd float %22, %conv28.i216
    <   %23 = getelementptr inbounds nuw i8, ptr addrspace(4) %arrayidx.i307, i64 4
    <   %24 = load float, ptr addrspace(4) %23, align 4
    <   %25 = tail call noundef float @llvm.fmuladd.f32(float %21, float 0xBFF7154760000000, float %24)
    <   %add48.i234 = fadd float %add46.i233, %25
    <   %cond53.i235 = select i1 %cmp.i192, float %18, float %add48.i234
    <   %cmp66.i240 = icmp eq i32 %astype1.i, 0
    <   %.neg = fmul float %cond53.i235, 0xBFE62E4300000000
    <   %fneg38.i = select i1 %cmp66.i240, float 0x7FF0000000000000, float %.neg
        %sub39.i = fsub float 1.000000e+00, %elt.abs.i1.i
        %cmp40.i = fcmp ole float %elt.abs.i1.i, 0x3FECCCCCC0000000
    >   %cond45 = select i1 %cmp39, float %fneg37, float 0.000000e+00
    <   %cond46.i = select i1 %cmp40.i, float %fneg38.i, float 0.000000e+00

I'll rework PR to limit the main change to be better vectorization only.

frasercrmck · 2025-07-03T07:26:38Z

Sorry, after looking at llvm-diff again, I find following change which is unexpected:

  in block %if.then13.i / %if.then12:
    OLD:
    >   %call.i4 = tail call float @__nv_logf(float noundef %0) #44
    >   %fneg37 = fneg float %call.i4
    NEW:
    <   %elt.abs.i.i191 = tail call noundef float @llvm.fabs.f32(float %sub26.i)
    <   %cmp.i192 = fcmp olt float %elt.abs.i.i191, 6.250000e-02
    <   %add.i193 = fadd float %sub26.i, 2.000000e+00
    <   %div.i194 = fdiv float %sub26.i, %add.i193, !fpmath !6
    <   %add1.i195 = fadd float %div.i194, %div.i194
    <   %mul2.i196 = fmul float %add1.i195, %add1.i195
    <   %13 = tail call noundef float @llvm.fmuladd.f32(float %mul2.i196, float 0x3F899999A0000000, float 0x3FB5555560000000)
    <   %mul4.i197 = fmul float %mul2.i196, %13
    <   %14 = fneg float %sub26.i
    <   %fneg.i198 = fmul float %div.i194, %14
    <   %15 = tail call noundef float @llvm.fmuladd.f32(float %add1.i195, float %mul4.i197, float %fneg.i198)
    <   %astype6.i199 = bitcast float %sub26.i to i32
    <   %and7.i200 = and i32 %astype6.i199, -65536
    <   %astype8.i201 = bitcast i32 %and7.i200 to float
    <   %sub9.i202 = fsub float %sub26.i, %astype8.i201
    <   %add10.i203 = fadd float %sub9.i202, %15
    <   %mul11.i204 = fmul float %add10.i203, 0x3F75476520000000
    <   %16 = tail call noundef float @llvm.fmuladd.f32(float %astype8.i201, float 0x3F75476520000000, float %mul11.i204)
    <   %17 = tail call noundef float @llvm.fmuladd.f32(float %add10.i203, float 1.437500e+00, float %16)
    <   %18 = tail call noundef float @llvm.fmuladd.f32(float %astype8.i201, float 1.437500e+00, float %17)
    <   %shr.i205 = lshr i32 %astype1.i, 23
    <   %sub15.i206 = add nsw i32 %shr.i205, -127
    <   %or.i207 = or i32 %astype1.i, 1065353216
    <   %astype16.i208 = bitcast i32 %or.i207 to float
    <   %sub17.i209 = fadd float %astype16.i208, -1.000000e+00
    <   %astype18.i210 = bitcast float %sub17.i209 to i32
    <   %shr19.i211 = lshr i32 %astype18.i210, 23
    <   %sub20.i212 = add nsw i32 %shr19.i211, -253
    <   %cmp21.i213 = icmp samesign ult i32 %astype1.i, 8388608
    <   %cond.i214 = select i1 %cmp21.i213, i32 %sub20.i212, i32 %sub15.i206
    <   %cond27.i215 = select i1 %cmp21.i213, i32 %astype18.i210, i32 %astype1.i
    <   %conv28.i216 = sitofp i32 %cond.i214 to float
    <   %and29.i217 = and i32 %cond27.i215, 8323072
    <   %and30.i218 = shl i32 %cond27.i215, 1
    <   %shl.i219 = and i32 %and30.i218, 65536
    <   %add31.i220 = add nuw nsw i32 %shl.i219, %and29.i217
    <   %or32.i221 = or disjoint i32 %add31.i220, 1056964608
    <   %astype33.i222 = bitcast i32 %or32.i221 to float
    <   %and34.i223 = and i32 %cond27.i215, 8388607
    <   %or35.i224 = or disjoint i32 %and34.i223, 1056964608
    <   %astype36.i225 = bitcast i32 %or35.i224 to float
    <   %sub37.i226 = fsub float %astype33.i222, %astype36.i225
    <   %shr38.i227 = lshr exact i32 %add31.i220, 16
    <   %idxprom.i296 = zext nneg i32 %shr38.i227 to i64
    <   %arrayidx.i297 = getelementptr inbounds nuw [129 x float], ptr addrspace(4) @LOG_INV_TBL, i64 0, i64 %idxprom.i296
    <   %19 = load float, ptr addrspace(4) %arrayidx.i297, align 4, !tbaa !7
    <   %mul40.i229 = fmul float %sub37.i226, %19
    <   %20 = tail call noundef float @llvm.fmuladd.f32(float %mul40.i229, float 0x3FD5555560000000, float 5.000000e-01)
    <   %mul42.i230 = fmul float %mul40.i229, %mul40.i229
    <   %21 = tail call noundef float @llvm.fmuladd.f32(float %20, float %mul42.i230, float %mul40.i229)
    <   %arrayidx.i307 = getelementptr inbounds nuw [129 x <2 x float>], ptr addrspace(4) @LOG2_TBL, i64 0, i64 %idxprom.i296
    <   %22 = load float, ptr addrspace(4) %arrayidx.i307, align 8
    <   %add46.i233 = fadd float %22, %conv28.i216
    <   %23 = getelementptr inbounds nuw i8, ptr addrspace(4) %arrayidx.i307, i64 4
    <   %24 = load float, ptr addrspace(4) %23, align 4
    <   %25 = tail call noundef float @llvm.fmuladd.f32(float %21, float 0xBFF7154760000000, float %24)
    <   %add48.i234 = fadd float %add46.i233, %25
    <   %cond53.i235 = select i1 %cmp.i192, float %18, float %add48.i234
    <   %cmp66.i240 = icmp eq i32 %astype1.i, 0
    <   %.neg = fmul float %cond53.i235, 0xBFE62E4300000000
    <   %fneg38.i = select i1 %cmp66.i240, float 0x7FF0000000000000, float %.neg
        %sub39.i = fsub float 1.000000e+00, %elt.abs.i1.i
        %cmp40.i = fcmp ole float %elt.abs.i1.i, 0x3FECCCCCC0000000
    >   %cond45 = select i1 %cmp39, float %fneg37, float 0.000000e+00
    <   %cond46.i = select i1 %cmp40.i, float %fneg38.i, float 0.000000e+00

I'll rework PR to limit the main change to be better vectorization only.

Thanks for checking.

Yesterday I did start some precursory investigations as to whether __nv_* builtins were in fact better than the generic libclc implementations, especially with respect to vector sizes. The __nv builtins don't vectorize well and in theory the generic ones do, but then NVPTX only really has vectors of up to size 2 anyway. So it's a mixed bag. There do appear situations where we'd be better off using the generic libclc implementations. The idea was that hopefully we could find other situations where generic builtins could be improved.

From what I remember, NV log was performing better so I agree that we should try and keep this PR's changes to a minimum.

libclc/libspirv/lib/amdgcn-amdhsa/math/lgamma.cl libclc/libspirv/lib/amdgcn-amdhsa/math/log.cl libclc/libspirv/lib/amdgcn-amdhsa/math/sinpi.cl libclc/libspirv/lib/amdgcn-amdhsa/math/sqrt.cl libclc/libspirv/lib/amdgcn-amdhsa/math/fmax.cl libclc/libspirv/lib/amdgcn-amdhsa/math/fmin.cl libclc/libspirv/lib/ptx-nvidiacl/math/log.cl libclc/libspirv/lib/ptx-nvidiacl/math/sinpi.cl libclc/libspirv/lib/ptx-nvidiacl/math/sqrt.cl Move them into clc to eliminate llvm-diff differences of commit 96f547c Fix a typo in libspirv/lib/generic/math/sqrt.cl These functions use amdgcn and ptx clang built-ins and vector functions are scalarized. Therefore, they are not well vectorized compared to generic implementations. __nv_isinfd, __nv_isinff, __nv_isnand, __nv_isnanf and __nv_fma are not moved from libspirv to clc because they'll cause change in many other places since __clc_isinf/__clc_isnan/__clc_fma are widely used in many clc functions.

wenju-he · 2025-07-04T02:08:03Z

Yesterday I did start some precursory investigations as to whether __nv_* builtins were in fact better than the generic libclc implementations, especially with respect to vector sizes. The __nv builtins don't vectorize well and in theory the generic ones do, but then NVPTX only really has vectors of up to size 2 anyway. So it's a mixed bag. There do appear situations where we'd be better off using the generic libclc implementations. The idea was that hopefully we could find other situations where generic builtins could be improved.

Yes, _nv* and _ocml* scalar clang builtin can't be vectorized in libspirv.

From what I remember, NV log was performing better so I agree that we should try and keep this PR's changes to a minimum.

I have fixed llvm-diff caused by the first commit of this PR for following functions by moving their use from libspirv to clc in 3986a2d:

__nv_log
__nv_sinpi
__nv_sqrt/__nv_sqrtf
__ocml_exp_f32/f64
__ocml_fmax_f16/f32/f64
__ocml_fmin_f16/f32/f64
__ocml_lgamma_f32
__ocml_log_f32
__ocml_sinpi_f32
__ocml_sqrt_f16/f32/f64

I'm not able to fix llvm-diff for __nv_isinfd, __nv_isinff, __nv_isnand, __nv_isnanf and __nv_fma. I can't move them from libspirv to clc because the moving will cause change in many other places since __clc_isinf/__clc_isnan/__clc_fma are widely used in many clc functions.
On the other hand, generic __clc_isinf/__clc_isnan/__clc_fma implementations are vectorized and may be better.

frasercrmck · 2025-07-04T10:52:45Z

Yesterday I did start some precursory investigations as to whether __nv_* builtins were in fact better than the generic libclc implementations, especially with respect to vector sizes. The __nv builtins don't vectorize well and in theory the generic ones do, but then NVPTX only really has vectors of up to size 2 anyway. So it's a mixed bag. There do appear situations where we'd be better off using the generic libclc implementations. The idea was that hopefully we could find other situations where generic builtins could be improved.

Yes, _nv* and _ocml* scalar clang builtin can't be vectorized in libspirv.

From what I remember, NV log was performing better so I agree that we should try and keep this PR's changes to a minimum.

I have fixed llvm-diff caused by the first commit of this PR for following functions by moving their use from libspirv to clc in 3986a2d:
__nv_log
__nv_sinpi
__nv_sqrt/__nv_sqrtf
__ocml_exp_f32/f64
__ocml_fmax_f16/f32/f64
__ocml_fmin_f16/f32/f64
__ocml_lgamma_f32
__ocml_log_f32
__ocml_sinpi_f32
__ocml_sqrt_f16/f32/f64
I'm not able to fix llvm-diff for __nv_isinfd, __nv_isinff, __nv_isnand, __nv_isnanf and __nv_fma. I can't move them from libspirv to clc because the moving will cause change in many other places since __clc_isinf/__clc_isnan/__clc_fma are widely used in many clc functions. On the other hand, generic __clc_isinf/__clc_isnan/__clc_fma implementations are vectorized and may be better.

I think that's probably fine, thanks.

I do hesitate at introducing these changes to the OpenCL builtins. For one it increases our downstream changes which results in a higher maintenance burden. We also don't know these __ocml or __nv builtins are conformant for OpenCL but there's no great way to check that. Presumably the SPIR-V OpenCL Ext builtins should have the same precision requirements as OpenCL but I wouldn't put my money on that or the SYCL-CTS doing robust precision testing. We sadly don't yet have a great set-up for when OpenCL and SPIR-V CLC implementations want to diverge. That's the downside of the CLC approach.

We could probably look at undoing these changes later and having these builtins use the generic implementation across the board.

[libspirv] Reuse clc function in generic math/common built-ins

96f547c

Delete unused .inc and .h files. Delete unused tables.cl.

wenju-he requested a review from a team as a code owner July 1, 2025 09:03

wenju-he requested a review from frasercrmck July 1, 2025 09:03

wenju-he temporarily deployed to WindowsCILock July 1, 2025 09:03 — with GitHub Actions Inactive

frasercrmck approved these changes Jul 1, 2025

View reviewed changes

frasercrmck added the libclc libclc project related issues label Jul 1, 2025

wenju-he had a problem deploying to WindowsCILock July 1, 2025 17:51 — with GitHub Actions Failure

wenju-he temporarily deployed to WindowsCILock July 1, 2025 17:51 — with GitHub Actions Inactive

wenju-he mentioned this pull request Jul 2, 2025

[libclc] Add missing clc_lgamma_r with generic address space pointer arg (#146495) #19255

Merged

wenju-he added 2 commits July 4, 2025 03:57

Merge branch 'sycl' into libspirv-math-use-clc

9a38607

wenju-he temporarily deployed to WindowsCILock July 4, 2025 02:00 — with GitHub Actions Inactive

wenju-he requested a review from frasercrmck July 4, 2025 02:08

wenju-he temporarily deployed to WindowsCILock July 4, 2025 02:24 — with GitHub Actions Inactive

frasercrmck approved these changes Jul 4, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[libspirv] Reuse clc function in generic math/common built-ins #19231

[libspirv] Reuse clc function in generic math/common built-ins #19231

wenju-he commented Jul 1, 2025

Uh oh!

frasercrmck left a comment

Uh oh!

bader commented Jul 1, 2025

Uh oh!

wenju-he commented Jul 2, 2025

Uh oh!

wenju-he commented Jul 3, 2025

Uh oh!

frasercrmck commented Jul 3, 2025

Uh oh!

wenju-he commented Jul 4, 2025 •

edited

Loading

Uh oh!

frasercrmck commented Jul 4, 2025

Uh oh!

Uh oh!

[libspirv] Reuse clc function in generic math/common built-ins #19231

Are you sure you want to change the base?

[libspirv] Reuse clc function in generic math/common built-ins #19231

Conversation

wenju-he commented Jul 1, 2025

Uh oh!

frasercrmck left a comment

Choose a reason for hiding this comment

Uh oh!

bader commented Jul 1, 2025

Uh oh!

wenju-he commented Jul 2, 2025

Uh oh!

wenju-he commented Jul 3, 2025

Uh oh!

frasercrmck commented Jul 3, 2025

Uh oh!

wenju-he commented Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

frasercrmck commented Jul 4, 2025

Uh oh!

Uh oh!

wenju-he commented Jul 4, 2025 •

edited

Loading