Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: change default compiler to clang #2887

Merged
merged 17 commits into from
Sep 18, 2024
Merged

chore: change default compiler to clang #2887

merged 17 commits into from
Sep 18, 2024

Conversation

eddyxu
Copy link
Contributor

@eddyxu eddyxu commented Sep 16, 2024

see #2885

@github-actions github-actions bot added the chore label Sep 16, 2024
@codecov-commenter
Copy link

codecov-commenter commented Sep 16, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 77.83%. Comparing base (9c361fe) to head (6b43cfb).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2887      +/-   ##
==========================================
- Coverage   77.98%   77.83%   -0.15%     
==========================================
  Files         231      231              
  Lines       70643    70082     -561     
  Branches    70643    70082     -561     
==========================================
- Hits        55090    54550     -540     
- Misses      12424    12649     +225     
+ Partials     3129     2883     -246     
Flag Coverage Δ
unittests 77.83% <100.00%> (-0.15%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR seems to drop support for gcc when fp16kernels feature is enabled. I'm 👎 on that.

You seem to be changing testing jobs and the source-only the release jobs. If we want to ship fast binaries, we should be changing the Python / Java build jobs, right?

Comment on lines 79 to 81
// We use clang #pragma to yields better vectorization
// See https://github.com/lancedb/lance/pull/2885
.compiler("clang")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can recommend clang, but it seems wrong to force users to use it. Newer versions of GCC should work okay (even if they aren't the fastest), and I think we should let users do that if they want.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, this is a 3-5X performance difference across the board, see #2885, so I don't think this is the point where people should pick a compiler.

.github/workflows/cargo-publish.yml Show resolved Hide resolved
@eddyxu
Copy link
Contributor Author

eddyxu commented Sep 16, 2024

This PR seems to drop support for gcc when fp16kernels feature is enabled. I'm 👎 on that.

I have not finished this PR yet

@eddyxu eddyxu marked this pull request as draft September 16, 2024 17:22
@eddyxu
Copy link
Contributor Author

eddyxu commented Sep 17, 2024

This PR seems to drop support for gcc when fp16kernels feature is enabled.

#pragma clang loop unroll(enable) interleave(enable) vectorize(enable)

f16.c uses #pragma clang, a clang specific #pragma to hint at auto-vectorization in the first place.

GCC ignores it, thus auto-vectorization does not happen in many platforms as seen in #2885. (and a lot of compile warnings are generated if you try it locally).mWe can definitely write a gcc-friendly vectorized C code using a few extra days. Before that, using GCC with fp16kernels does not deliver major benefits.

@eddyxu
Copy link
Contributor Author

eddyxu commented Sep 17, 2024

Ready for review @wjones127

Pipy build passed https://github.com/lancedb/lance/actions/runs/10907625461

@eddyxu eddyxu marked this pull request as ready for review September 17, 2024 17:51
Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's pretty crazy the peformance difference we see here. For others reference, here is the assembly for the core loop body in clang-18:

movzx   ecx, word ptr [rdi + 2*rdx]
        vmovd   xmm1, ecx
        vcvtph2ps       xmm1, xmm1
        vfmadd213ss     xmm1, xmm1, xmm0
        movzx   ecx, word ptr [rdi + 2*rdx + 2]
        vmovd   xmm0, ecx
        vcvtph2ps       xmm0, xmm0
        vfmadd213ss     xmm0, xmm0, xmm1
        movzx   ecx, word ptr [rdi + 2*rdx + 4]
        vmovd   xmm1, ecx
        vcvtph2ps       xmm1, xmm1
        movzx   ecx, word ptr [rdi + 2*rdx + 6]
        vmovd   xmm2, ecx
        vcvtph2ps       xmm2, xmm2
        vfmadd213ss     xmm1, xmm1, xmm0
        vfmadd213ss     xmm2, xmm2, xmm1
        movzx   ecx, word ptr [rdi + 2*rdx + 8]
        vmovd   xmm0, ecx
        vcvtph2ps       xmm0, xmm0
        vfmadd213ss     xmm0, xmm0, xmm2
        movzx   ecx, word ptr [rdi + 2*rdx + 10]
        vmovd   xmm1, ecx
        vcvtph2ps       xmm1, xmm1
        vfmadd213ss     xmm1, xmm1, xmm0
        movzx   ecx, word ptr [rdi + 2*rdx + 12]
        vmovd   xmm0, ecx
        vcvtph2ps       xmm2, xmm0
        movzx   ecx, word ptr [rdi + 2*rdx + 14]
        vmovd   xmm0, ecx
        vcvtph2ps       xmm0, xmm0
        vfmadd213ss     xmm2, xmm2, xmm1
        vfmadd213ss     xmm0, xmm0, xmm2
        add     rdx, 8
        cmp     rax, rdx
        jne     .LBB0_10

versus in GCC 13:

vpxor   xmm5, xmm5, xmm5
        vpinsrw xmm8, xmm5, WORD PTR [rdi], 0
        add     rdi, 16
        vpinsrw xmm11, xmm5, WORD PTR [rdi-12], 0
        vpinsrw xmm13, xmm5, WORD PTR [rdi-10], 0
        vcvtph2ps       xmm9, xmm8
        vfmadd132ss     xmm9, xmm0, xmm9
        vpinsrw xmm0, xmm5, WORD PTR [rdi-14], 0
        vpinsrw xmm15, xmm5, WORD PTR [rdi-8], 0
        vcvtph2ps       xmm12, xmm11
        vcvtph2ps       xmm14, xmm13
        vpinsrw xmm1, xmm5, WORD PTR [rdi-6], 0
        vpinsrw xmm7, xmm5, WORD PTR [rdi-4], 0
        vcvtph2ps       xmm10, xmm0
        vcvtph2ps       xmm6, xmm15
        vpinsrw xmm4, xmm5, WORD PTR [rdi-2], 0
        vcvtph2ps       xmm2, xmm1
        vcvtph2ps       xmm3, xmm7
        vcvtph2ps       xmm0, xmm4
        vfmadd231ss     xmm9, xmm10, xmm10
        vfmadd231ss     xmm9, xmm12, xmm12
        vfmadd231ss     xmm9, xmm14, xmm14
        vfmadd231ss     xmm9, xmm6, xmm6
        vfmadd231ss     xmm9, xmm2, xmm2
        vfmadd231ss     xmm9, xmm3, xmm3
        vfmadd132ss     xmm0, xmm9, xmm0
        cmp     rdx, rdi
        jne     .L3

I feel like we should look into one day why there is such a performance gap

@eddyxu eddyxu merged commit 739545f into main Sep 18, 2024
25 checks passed
@eddyxu eddyxu deleted the lei/clang branch September 18, 2024 20:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants