Skip to content

[libspirv] Reuse clc function in generic math/common built-ins #19231

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: sycl
Choose a base branch
from

Conversation

wenju-he
Copy link
Contributor

@wenju-he wenju-he commented Jul 1, 2025

Delete unused .inc and .h files. Delete unused tables.cl.

Delete unused .inc and .h files. Delete unused tables.cl.
@wenju-he wenju-he requested a review from a team as a code owner July 1, 2025 09:03
@wenju-he wenju-he requested a review from frasercrmck July 1, 2025 09:03
@wenju-he wenju-he temporarily deployed to WindowsCILock July 1, 2025 09:03 — with GitHub Actions Inactive
Copy link
Contributor

@frasercrmck frasercrmck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff, thank you.

I have been meaning to ask, are you testing all of these "generic" changes downstream with the SYCL-CTS? We unfortunately don't have great coverage of the "generic" implementations in pre-commit CI.

@frasercrmck frasercrmck added the libclc libclc project related issues label Jul 1, 2025
@bader
Copy link
Contributor

bader commented Jul 1, 2025

image
👍

@wenju-he
Copy link
Contributor Author

wenju-he commented Jul 2, 2025

I have been meaning to ask, are you testing all of these "generic" changes downstream with the SYCL-CTS? We unfortunately don't have great coverage of the "generic" implementations in pre-commit CI.

@frasercrmck Not yet, but these generic implementations will be tested in downstream SYCL-CTS CI in the near future except for a few built-ins that has hardware support, e.g. float sin. My plan is:

  1. Compare performance of two kinds of generic implementations, e.g.
    _CLC_OVERLOAD _CLC_DEF __CLC_GENTYPE __clc_sin(__CLC_GENTYPE x) {
    and https://github.com/intel/intel-graphics-compiler/blob/7c04a1ec340f1a0b79460cb61982e0499efda1b6/IGC/BiFModule/Implementation/IMF/FP64/sin_d_ha.cl#L431. Upstream intel implementation to clc if it has better performance.
  2. Switch to use libspirv generic implementation. Then the generic implementations in this PR will be tested.

For this PR, I only checked llvm-diff output. There are two main changes:

  1. In libspirv-nvptx64--nvidiacl.bc, use of __spirv_IsInf and __spirv_IsNan are replaced with generic implementation. For instance, the diff in _Z17__spirv_ocl_fractfPf is:
in function _Z17__spirv_ocl_fractfPf:
    Old:
    >   %call.i = tail call i32 @__nv_isinff(float noundef %x) #44
    >   %tobool.i.not = icmp eq i32 %call.i, 0
    >   %cond = select i1 %tobool.i.not, float %0, float 0.000000e+00
    >   %call.i1 = tail call i32 @__nv_isnanf(float noundef %x) #44
    >   %tobool.i2.not = icmp eq i32 %call.i1, 0
    >   %cond11 = select i1 %tobool.i2.not, float %cond, float %x
    >   ret float %cond11
    New:
    <   %1 = tail call float @llvm.fabs.f32(float %x)
    <   %2 = fcmp une float %1, 0x7FF0000000000000
    <   %cond.i = select i1 %2, float %0, float 0.000000e+00
    <   %3 = fcmp ord float %x, 0.000000e+00
    <   %cond8.i = select i1 %3, float %cond.i, float %x
    <   ret float %cond8.i
  1. better vectorization in this PR since clc implementation has better vectorization.

steffenlarsen pushed a commit that referenced this pull request Jul 2, 2025
…arg (#146495) (#19255)

There is no change to amdgcn--amdhsa.bc and nvptx64--nvidiacl.bc because
__opencl_c_generic_address_space is not defined for them.

(cherry picked from commit b0e6faa)
This PR fixes pre-commit CI error in #19231: undefined hidden symbol:
__clc_lgamma_r(float, int*)
@wenju-he
Copy link
Contributor Author

wenju-he commented Jul 3, 2025

Sorry, after looking at llvm-diff again, I find following change which is unexpected:

  in block %if.then13.i / %if.then12:
    OLD:
    >   %call.i4 = tail call float @__nv_logf(float noundef %0) #44
    >   %fneg37 = fneg float %call.i4
    NEW:
    <   %elt.abs.i.i191 = tail call noundef float @llvm.fabs.f32(float %sub26.i)
    <   %cmp.i192 = fcmp olt float %elt.abs.i.i191, 6.250000e-02
    <   %add.i193 = fadd float %sub26.i, 2.000000e+00
    <   %div.i194 = fdiv float %sub26.i, %add.i193, !fpmath !6
    <   %add1.i195 = fadd float %div.i194, %div.i194
    <   %mul2.i196 = fmul float %add1.i195, %add1.i195
    <   %13 = tail call noundef float @llvm.fmuladd.f32(float %mul2.i196, float 0x3F899999A0000000, float 0x3FB5555560000000)
    <   %mul4.i197 = fmul float %mul2.i196, %13
    <   %14 = fneg float %sub26.i
    <   %fneg.i198 = fmul float %div.i194, %14
    <   %15 = tail call noundef float @llvm.fmuladd.f32(float %add1.i195, float %mul4.i197, float %fneg.i198)
    <   %astype6.i199 = bitcast float %sub26.i to i32
    <   %and7.i200 = and i32 %astype6.i199, -65536
    <   %astype8.i201 = bitcast i32 %and7.i200 to float
    <   %sub9.i202 = fsub float %sub26.i, %astype8.i201
    <   %add10.i203 = fadd float %sub9.i202, %15
    <   %mul11.i204 = fmul float %add10.i203, 0x3F75476520000000
    <   %16 = tail call noundef float @llvm.fmuladd.f32(float %astype8.i201, float 0x3F75476520000000, float %mul11.i204)
    <   %17 = tail call noundef float @llvm.fmuladd.f32(float %add10.i203, float 1.437500e+00, float %16)
    <   %18 = tail call noundef float @llvm.fmuladd.f32(float %astype8.i201, float 1.437500e+00, float %17)
    <   %shr.i205 = lshr i32 %astype1.i, 23
    <   %sub15.i206 = add nsw i32 %shr.i205, -127
    <   %or.i207 = or i32 %astype1.i, 1065353216
    <   %astype16.i208 = bitcast i32 %or.i207 to float
    <   %sub17.i209 = fadd float %astype16.i208, -1.000000e+00
    <   %astype18.i210 = bitcast float %sub17.i209 to i32
    <   %shr19.i211 = lshr i32 %astype18.i210, 23
    <   %sub20.i212 = add nsw i32 %shr19.i211, -253
    <   %cmp21.i213 = icmp samesign ult i32 %astype1.i, 8388608
    <   %cond.i214 = select i1 %cmp21.i213, i32 %sub20.i212, i32 %sub15.i206
    <   %cond27.i215 = select i1 %cmp21.i213, i32 %astype18.i210, i32 %astype1.i
    <   %conv28.i216 = sitofp i32 %cond.i214 to float
    <   %and29.i217 = and i32 %cond27.i215, 8323072
    <   %and30.i218 = shl i32 %cond27.i215, 1
    <   %shl.i219 = and i32 %and30.i218, 65536
    <   %add31.i220 = add nuw nsw i32 %shl.i219, %and29.i217
    <   %or32.i221 = or disjoint i32 %add31.i220, 1056964608
    <   %astype33.i222 = bitcast i32 %or32.i221 to float
    <   %and34.i223 = and i32 %cond27.i215, 8388607
    <   %or35.i224 = or disjoint i32 %and34.i223, 1056964608
    <   %astype36.i225 = bitcast i32 %or35.i224 to float
    <   %sub37.i226 = fsub float %astype33.i222, %astype36.i225
    <   %shr38.i227 = lshr exact i32 %add31.i220, 16
    <   %idxprom.i296 = zext nneg i32 %shr38.i227 to i64
    <   %arrayidx.i297 = getelementptr inbounds nuw [129 x float], ptr addrspace(4) @LOG_INV_TBL, i64 0, i64 %idxprom.i296
    <   %19 = load float, ptr addrspace(4) %arrayidx.i297, align 4, !tbaa !7
    <   %mul40.i229 = fmul float %sub37.i226, %19
    <   %20 = tail call noundef float @llvm.fmuladd.f32(float %mul40.i229, float 0x3FD5555560000000, float 5.000000e-01)
    <   %mul42.i230 = fmul float %mul40.i229, %mul40.i229
    <   %21 = tail call noundef float @llvm.fmuladd.f32(float %20, float %mul42.i230, float %mul40.i229)
    <   %arrayidx.i307 = getelementptr inbounds nuw [129 x <2 x float>], ptr addrspace(4) @LOG2_TBL, i64 0, i64 %idxprom.i296
    <   %22 = load float, ptr addrspace(4) %arrayidx.i307, align 8
    <   %add46.i233 = fadd float %22, %conv28.i216
    <   %23 = getelementptr inbounds nuw i8, ptr addrspace(4) %arrayidx.i307, i64 4
    <   %24 = load float, ptr addrspace(4) %23, align 4
    <   %25 = tail call noundef float @llvm.fmuladd.f32(float %21, float 0xBFF7154760000000, float %24)
    <   %add48.i234 = fadd float %add46.i233, %25
    <   %cond53.i235 = select i1 %cmp.i192, float %18, float %add48.i234
    <   %cmp66.i240 = icmp eq i32 %astype1.i, 0
    <   %.neg = fmul float %cond53.i235, 0xBFE62E4300000000
    <   %fneg38.i = select i1 %cmp66.i240, float 0x7FF0000000000000, float %.neg
        %sub39.i = fsub float 1.000000e+00, %elt.abs.i1.i
        %cmp40.i = fcmp ole float %elt.abs.i1.i, 0x3FECCCCCC0000000
    >   %cond45 = select i1 %cmp39, float %fneg37, float 0.000000e+00
    <   %cond46.i = select i1 %cmp40.i, float %fneg38.i, float 0.000000e+00

I'll rework PR to limit the main change to be better vectorization only.

@frasercrmck
Copy link
Contributor

Sorry, after looking at llvm-diff again, I find following change which is unexpected:

  in block %if.then13.i / %if.then12:
    OLD:
    >   %call.i4 = tail call float @__nv_logf(float noundef %0) #44
    >   %fneg37 = fneg float %call.i4
    NEW:
    <   %elt.abs.i.i191 = tail call noundef float @llvm.fabs.f32(float %sub26.i)
    <   %cmp.i192 = fcmp olt float %elt.abs.i.i191, 6.250000e-02
    <   %add.i193 = fadd float %sub26.i, 2.000000e+00
    <   %div.i194 = fdiv float %sub26.i, %add.i193, !fpmath !6
    <   %add1.i195 = fadd float %div.i194, %div.i194
    <   %mul2.i196 = fmul float %add1.i195, %add1.i195
    <   %13 = tail call noundef float @llvm.fmuladd.f32(float %mul2.i196, float 0x3F899999A0000000, float 0x3FB5555560000000)
    <   %mul4.i197 = fmul float %mul2.i196, %13
    <   %14 = fneg float %sub26.i
    <   %fneg.i198 = fmul float %div.i194, %14
    <   %15 = tail call noundef float @llvm.fmuladd.f32(float %add1.i195, float %mul4.i197, float %fneg.i198)
    <   %astype6.i199 = bitcast float %sub26.i to i32
    <   %and7.i200 = and i32 %astype6.i199, -65536
    <   %astype8.i201 = bitcast i32 %and7.i200 to float
    <   %sub9.i202 = fsub float %sub26.i, %astype8.i201
    <   %add10.i203 = fadd float %sub9.i202, %15
    <   %mul11.i204 = fmul float %add10.i203, 0x3F75476520000000
    <   %16 = tail call noundef float @llvm.fmuladd.f32(float %astype8.i201, float 0x3F75476520000000, float %mul11.i204)
    <   %17 = tail call noundef float @llvm.fmuladd.f32(float %add10.i203, float 1.437500e+00, float %16)
    <   %18 = tail call noundef float @llvm.fmuladd.f32(float %astype8.i201, float 1.437500e+00, float %17)
    <   %shr.i205 = lshr i32 %astype1.i, 23
    <   %sub15.i206 = add nsw i32 %shr.i205, -127
    <   %or.i207 = or i32 %astype1.i, 1065353216
    <   %astype16.i208 = bitcast i32 %or.i207 to float
    <   %sub17.i209 = fadd float %astype16.i208, -1.000000e+00
    <   %astype18.i210 = bitcast float %sub17.i209 to i32
    <   %shr19.i211 = lshr i32 %astype18.i210, 23
    <   %sub20.i212 = add nsw i32 %shr19.i211, -253
    <   %cmp21.i213 = icmp samesign ult i32 %astype1.i, 8388608
    <   %cond.i214 = select i1 %cmp21.i213, i32 %sub20.i212, i32 %sub15.i206
    <   %cond27.i215 = select i1 %cmp21.i213, i32 %astype18.i210, i32 %astype1.i
    <   %conv28.i216 = sitofp i32 %cond.i214 to float
    <   %and29.i217 = and i32 %cond27.i215, 8323072
    <   %and30.i218 = shl i32 %cond27.i215, 1
    <   %shl.i219 = and i32 %and30.i218, 65536
    <   %add31.i220 = add nuw nsw i32 %shl.i219, %and29.i217
    <   %or32.i221 = or disjoint i32 %add31.i220, 1056964608
    <   %astype33.i222 = bitcast i32 %or32.i221 to float
    <   %and34.i223 = and i32 %cond27.i215, 8388607
    <   %or35.i224 = or disjoint i32 %and34.i223, 1056964608
    <   %astype36.i225 = bitcast i32 %or35.i224 to float
    <   %sub37.i226 = fsub float %astype33.i222, %astype36.i225
    <   %shr38.i227 = lshr exact i32 %add31.i220, 16
    <   %idxprom.i296 = zext nneg i32 %shr38.i227 to i64
    <   %arrayidx.i297 = getelementptr inbounds nuw [129 x float], ptr addrspace(4) @LOG_INV_TBL, i64 0, i64 %idxprom.i296
    <   %19 = load float, ptr addrspace(4) %arrayidx.i297, align 4, !tbaa !7
    <   %mul40.i229 = fmul float %sub37.i226, %19
    <   %20 = tail call noundef float @llvm.fmuladd.f32(float %mul40.i229, float 0x3FD5555560000000, float 5.000000e-01)
    <   %mul42.i230 = fmul float %mul40.i229, %mul40.i229
    <   %21 = tail call noundef float @llvm.fmuladd.f32(float %20, float %mul42.i230, float %mul40.i229)
    <   %arrayidx.i307 = getelementptr inbounds nuw [129 x <2 x float>], ptr addrspace(4) @LOG2_TBL, i64 0, i64 %idxprom.i296
    <   %22 = load float, ptr addrspace(4) %arrayidx.i307, align 8
    <   %add46.i233 = fadd float %22, %conv28.i216
    <   %23 = getelementptr inbounds nuw i8, ptr addrspace(4) %arrayidx.i307, i64 4
    <   %24 = load float, ptr addrspace(4) %23, align 4
    <   %25 = tail call noundef float @llvm.fmuladd.f32(float %21, float 0xBFF7154760000000, float %24)
    <   %add48.i234 = fadd float %add46.i233, %25
    <   %cond53.i235 = select i1 %cmp.i192, float %18, float %add48.i234
    <   %cmp66.i240 = icmp eq i32 %astype1.i, 0
    <   %.neg = fmul float %cond53.i235, 0xBFE62E4300000000
    <   %fneg38.i = select i1 %cmp66.i240, float 0x7FF0000000000000, float %.neg
        %sub39.i = fsub float 1.000000e+00, %elt.abs.i1.i
        %cmp40.i = fcmp ole float %elt.abs.i1.i, 0x3FECCCCCC0000000
    >   %cond45 = select i1 %cmp39, float %fneg37, float 0.000000e+00
    <   %cond46.i = select i1 %cmp40.i, float %fneg38.i, float 0.000000e+00

I'll rework PR to limit the main change to be better vectorization only.

Thanks for checking.

Yesterday I did start some precursory investigations as to whether __nv_* builtins were in fact better than the generic libclc implementations, especially with respect to vector sizes. The __nv builtins don't vectorize well and in theory the generic ones do, but then NVPTX only really has vectors of up to size 2 anyway. So it's a mixed bag. There do appear situations where we'd be better off using the generic libclc implementations. The idea was that hopefully we could find other situations where generic builtins could be improved.

From what I remember, NV log was performing better so I agree that we should try and keep this PR's changes to a minimum.

wenju-he added 2 commits July 4, 2025 03:57
libclc/libspirv/lib/amdgcn-amdhsa/math/lgamma.cl
libclc/libspirv/lib/amdgcn-amdhsa/math/log.cl
libclc/libspirv/lib/amdgcn-amdhsa/math/sinpi.cl
libclc/libspirv/lib/amdgcn-amdhsa/math/sqrt.cl
libclc/libspirv/lib/amdgcn-amdhsa/math/fmax.cl
libclc/libspirv/lib/amdgcn-amdhsa/math/fmin.cl
libclc/libspirv/lib/ptx-nvidiacl/math/log.cl
libclc/libspirv/lib/ptx-nvidiacl/math/sinpi.cl
libclc/libspirv/lib/ptx-nvidiacl/math/sqrt.cl

Move them into clc to eliminate llvm-diff differences of commit 96f547c

Fix a typo in libspirv/lib/generic/math/sqrt.cl

These functions use amdgcn and ptx clang built-ins and vector functions
are scalarized. Therefore, they are not well vectorized compared to
generic implementations.

__nv_isinfd, __nv_isinff, __nv_isnand, __nv_isnanf and __nv_fma are not
moved from libspirv to clc because they'll cause change in many other places
since __clc_isinf/__clc_isnan/__clc_fma are widely used in many clc functions.
@wenju-he wenju-he temporarily deployed to WindowsCILock July 4, 2025 02:00 — with GitHub Actions Inactive
@wenju-he
Copy link
Contributor Author

wenju-he commented Jul 4, 2025

Yesterday I did start some precursory investigations as to whether __nv_* builtins were in fact better than the generic libclc implementations, especially with respect to vector sizes. The __nv builtins don't vectorize well and in theory the generic ones do, but then NVPTX only really has vectors of up to size 2 anyway. So it's a mixed bag. There do appear situations where we'd be better off using the generic libclc implementations. The idea was that hopefully we could find other situations where generic builtins could be improved.

Yes, _nv* and _ocml* scalar clang builtin can't be vectorized in libspirv.

From what I remember, NV log was performing better so I agree that we should try and keep this PR's changes to a minimum.

I have fixed llvm-diff caused by the first commit of this PR for following functions by moving their use from libspirv to clc in 3986a2d:

__nv_log
__nv_sinpi
__nv_sqrt/__nv_sqrtf
__ocml_exp_f32/f64
__ocml_fmax_f16/f32/f64
__ocml_fmin_f16/f32/f64
__ocml_lgamma_f32
__ocml_log_f32
__ocml_sinpi_f32
__ocml_sqrt_f16/f32/f64

I'm not able to fix llvm-diff for __nv_isinfd, __nv_isinff, __nv_isnand, __nv_isnanf and __nv_fma. I can't move them from libspirv to clc because the moving will cause change in many other places since __clc_isinf/__clc_isnan/__clc_fma are widely used in many clc functions.
On the other hand, generic __clc_isinf/__clc_isnan/__clc_fma implementations are vectorized and may be better.

@wenju-he wenju-he requested a review from frasercrmck July 4, 2025 02:08
@wenju-he wenju-he temporarily deployed to WindowsCILock July 4, 2025 02:24 — with GitHub Actions Inactive
@wenju-he wenju-he temporarily deployed to WindowsCILock July 4, 2025 02:24 — with GitHub Actions Inactive
@frasercrmck
Copy link
Contributor

Yesterday I did start some precursory investigations as to whether __nv_* builtins were in fact better than the generic libclc implementations, especially with respect to vector sizes. The __nv builtins don't vectorize well and in theory the generic ones do, but then NVPTX only really has vectors of up to size 2 anyway. So it's a mixed bag. There do appear situations where we'd be better off using the generic libclc implementations. The idea was that hopefully we could find other situations where generic builtins could be improved.

Yes, _nv* and _ocml* scalar clang builtin can't be vectorized in libspirv.

From what I remember, NV log was performing better so I agree that we should try and keep this PR's changes to a minimum.

I have fixed llvm-diff caused by the first commit of this PR for following functions by moving their use from libspirv to clc in 3986a2d:

__nv_log
__nv_sinpi
__nv_sqrt/__nv_sqrtf
__ocml_exp_f32/f64
__ocml_fmax_f16/f32/f64
__ocml_fmin_f16/f32/f64
__ocml_lgamma_f32
__ocml_log_f32
__ocml_sinpi_f32
__ocml_sqrt_f16/f32/f64

I'm not able to fix llvm-diff for __nv_isinfd, __nv_isinff, __nv_isnand, __nv_isnanf and __nv_fma. I can't move them from libspirv to clc because the moving will cause change in many other places since __clc_isinf/__clc_isnan/__clc_fma are widely used in many clc functions. On the other hand, generic __clc_isinf/__clc_isnan/__clc_fma implementations are vectorized and may be better.

I think that's probably fine, thanks.

I do hesitate at introducing these changes to the OpenCL builtins. For one it increases our downstream changes which results in a higher maintenance burden. We also don't know these __ocml or __nv builtins are conformant for OpenCL but there's no great way to check that. Presumably the SPIR-V OpenCL Ext builtins should have the same precision requirements as OpenCL but I wouldn't put my money on that or the SYCL-CTS doing robust precision testing. We sadly don't yet have a great set-up for when OpenCL and SPIR-V CLC implementations want to diverge. That's the downside of the CLC approach.

We could probably look at undoing these changes later and having these builtins use the generic implementation across the board.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
libclc libclc project related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants