Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: accelerate f16 distance #2885

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from
Draft

feat: accelerate f16 distance #2885

wants to merge 5 commits into from

Conversation

eddyxu
Copy link
Contributor

@eddyxu eddyxu commented Sep 15, 2024

Ran command

Env:

  • Ubuntu 24.04 / Macos 15
  • AWS VMs or Apple M2 Max macbook pro
  • GCC-13 (Ubuntu), clang 18 (ubuntu / mac). GCC is not installed on Mac by default
  • RUSTFLAGS=""
  • Rustc 1.81
CC={clang|gcc} cargo bench --bench l2/cosine/dot [--features fp16kernels]  -- "half::binary16::f16, auto-vectorization"
CPU CC L2(f16) Dot (f16) Cosine (f16) branch + feature
AMD Zen3 867.81 ms 701.42 ms 1.3697 main
AMD Zen3 gcc 13 887.41 ms 905.89 m 920.16 ms main + fp16kernels
AMD Zen3 clang 18 119.64 ms 118.90 ms 121.82 ms main + fp16kernels
AMD Zen3 gcc 13 887.04 ms 878.89 ms 915.79 ms lei/f16_bench
AMD Zen3 clang 18 120.78 ms 113.93 ms 120.68 ms lei/f16_bench
Skylake clang 1.5729 s main
Skylake gcc 1.4302 s 1.4184 s 1.4276 s main + fp16kernels
Skylake clang 290.73 ms 260.39 ms 287.47 ms main + fp16kernels
Skylake gcc 1.4337 s 1.4161 s 1.4273 s lei/f16_bench
Skylake clang 578.46 ms 582.08 ms 888.80 ms lei/f16_bench
Sapphire Rapis 1.4047 s 1.1850 s 2.3802 s main
Shappire Rapis gcc 1.2236 s 616.14 ms 1.5293 s main + fp16kernels
Shappire Rapis clang 308.18 ms 283.11 ms 293.49 ms main + fp16kernels
Shappire Rapis gcc 887.84 ms 857.94 ms 897.96 ms lei/f16_bench
Shappire Rapis clang 274.20 ms 276.86 ms 314.43 ms lei/f16_bench
Graviton 3 (m7g.xlarge) 2.9608 s 2.7640 s 4.7155 s main
Graviton 3 gcc 234.97 ms 218.71 ms 230.73 ms main + fp16kernel
Graviton 3 clang 209.75 ms 209.26 ms 239.20 ms main + fp16kernel
Graviton 3 gcc 129.63 ms 120.84 ms 230.57 ms lei/f16_bench
Graviton 3 clang 130.93 ms 118.42 ms 235.08 ms lei/f16_bench
Apple M2 Max clang 85.693 ms 64.815 ms 87.479 ms main + fp16kernels
Apple M2 Max clang 416.78 ms 345.76 ms 691.80 ms main
Apple M2 Max clang 64.450 ms ms 63.911 ms 109.16 ms lei/f16_bench

Conclusion:

  • We need to use clang

@github-actions github-actions bot added the enhancement New feature or request label Sep 15, 2024
@eddyxu eddyxu added the WIP work in progress label Sep 15, 2024
Comment on lines +58 to +65
#if defined(__aarch64__)
// on aarch64 with fp16, this is 2x faster.
FP16 sub = x[i] - y[i];
#else
float sub = x[i] - y[i];
#endif
// Use float32 as the accumulator to avoid overflow.
sum += sub * sub;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we just have simd/genric, simd/x86 and simd/aarch64?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As 3 different functions?

eddyxu added a commit that referenced this pull request Sep 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request WIP work in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants