-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: change default compiler to clang #2887
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #2887 +/- ##
==========================================
- Coverage 77.98% 77.83% -0.15%
==========================================
Files 231 231
Lines 70643 70082 -561
Branches 70643 70082 -561
==========================================
- Hits 55090 54550 -540
- Misses 12424 12649 +225
+ Partials 3129 2883 -246
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR seems to drop support for gcc
when fp16kernels
feature is enabled. I'm 👎 on that.
You seem to be changing testing jobs and the source-only the release jobs. If we want to ship fast binaries, we should be changing the Python / Java build jobs, right?
rust/lance-linalg/build.rs
Outdated
// We use clang #pragma to yields better vectorization | ||
// See https://github.com/lancedb/lance/pull/2885 | ||
.compiler("clang") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can recommend clang
, but it seems wrong to force users to use it. Newer versions of GCC should work okay (even if they aren't the fastest), and I think we should let users do that if they want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, this is a 3-5X performance difference across the board, see #2885, so I don't think this is the point where people should pick a compiler.
I have not finished this PR yet |
Manylinux 2014 is [EOL on June 2024](https://github.com/pypa/manylinux).
lance/rust/lance-linalg/src/simd/f16.c Line 65 in 78d90a9
GCC ignores it, thus auto-vectorization does not happen in many platforms as seen in #2885. (and a lot of compile warnings are generated if you try it locally).mWe can definitely write a gcc-friendly vectorized C code using a few extra days. Before that, using GCC with |
Ready for review @wjones127 Pipy build passed https://github.com/lancedb/lance/actions/runs/10907625461 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's pretty crazy the peformance difference we see here. For others reference, here is the assembly for the core loop body in clang-18:
movzx ecx, word ptr [rdi + 2*rdx]
vmovd xmm1, ecx
vcvtph2ps xmm1, xmm1
vfmadd213ss xmm1, xmm1, xmm0
movzx ecx, word ptr [rdi + 2*rdx + 2]
vmovd xmm0, ecx
vcvtph2ps xmm0, xmm0
vfmadd213ss xmm0, xmm0, xmm1
movzx ecx, word ptr [rdi + 2*rdx + 4]
vmovd xmm1, ecx
vcvtph2ps xmm1, xmm1
movzx ecx, word ptr [rdi + 2*rdx + 6]
vmovd xmm2, ecx
vcvtph2ps xmm2, xmm2
vfmadd213ss xmm1, xmm1, xmm0
vfmadd213ss xmm2, xmm2, xmm1
movzx ecx, word ptr [rdi + 2*rdx + 8]
vmovd xmm0, ecx
vcvtph2ps xmm0, xmm0
vfmadd213ss xmm0, xmm0, xmm2
movzx ecx, word ptr [rdi + 2*rdx + 10]
vmovd xmm1, ecx
vcvtph2ps xmm1, xmm1
vfmadd213ss xmm1, xmm1, xmm0
movzx ecx, word ptr [rdi + 2*rdx + 12]
vmovd xmm0, ecx
vcvtph2ps xmm2, xmm0
movzx ecx, word ptr [rdi + 2*rdx + 14]
vmovd xmm0, ecx
vcvtph2ps xmm0, xmm0
vfmadd213ss xmm2, xmm2, xmm1
vfmadd213ss xmm0, xmm0, xmm2
add rdx, 8
cmp rax, rdx
jne .LBB0_10
versus in GCC 13:
vpxor xmm5, xmm5, xmm5
vpinsrw xmm8, xmm5, WORD PTR [rdi], 0
add rdi, 16
vpinsrw xmm11, xmm5, WORD PTR [rdi-12], 0
vpinsrw xmm13, xmm5, WORD PTR [rdi-10], 0
vcvtph2ps xmm9, xmm8
vfmadd132ss xmm9, xmm0, xmm9
vpinsrw xmm0, xmm5, WORD PTR [rdi-14], 0
vpinsrw xmm15, xmm5, WORD PTR [rdi-8], 0
vcvtph2ps xmm12, xmm11
vcvtph2ps xmm14, xmm13
vpinsrw xmm1, xmm5, WORD PTR [rdi-6], 0
vpinsrw xmm7, xmm5, WORD PTR [rdi-4], 0
vcvtph2ps xmm10, xmm0
vcvtph2ps xmm6, xmm15
vpinsrw xmm4, xmm5, WORD PTR [rdi-2], 0
vcvtph2ps xmm2, xmm1
vcvtph2ps xmm3, xmm7
vcvtph2ps xmm0, xmm4
vfmadd231ss xmm9, xmm10, xmm10
vfmadd231ss xmm9, xmm12, xmm12
vfmadd231ss xmm9, xmm14, xmm14
vfmadd231ss xmm9, xmm6, xmm6
vfmadd231ss xmm9, xmm2, xmm2
vfmadd231ss xmm9, xmm3, xmm3
vfmadd132ss xmm0, xmm9, xmm0
cmp rdx, rdi
jne .L3
I feel like we should look into one day why there is such a performance gap
see #2885