Skip to content

Improvement of 2D thread-partitioned GEMM for M << N case #5276

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

nakagawa-fj
Copy link
Contributor

@nakagawa-fj nakagawa-fj commented May 21, 2025

Closes #5270
The 2D thread partitioning in GEMM (PR#4655) requires nthreads_m % 2 == 0. This can prevent optimal nthreads_m and nthreads_n combinations on architectures like A64FX (48 cores) or Grace (144 cores) when M<<N, due to core counts having divisors other than 2.
Specifically, when matrix size N is significantly larger than M, the number of threads for N direction should be increased.
However, if nthreads_m includes divisors other than 2, such as 3, the increase of nthreads_n is prevented by ' nthreads_m % 2 == 0 '.
This modification removes the nthreads_m % 2 == 0 restriction and selects the combination that minimizes the following objective function 'n * nthreads_m + m * nthreads_n'.
This change improves the performance of multi-threaded GEMM for M << N cases.

image

image

@martin-frbg martin-frbg added this to the 0.3.30 milestone May 21, 2025
@martin-frbg
Copy link
Collaborator

Thank you

@martin-frbg martin-frbg merged commit e2e6a4d into OpenMathLib:develop May 21, 2025
82 of 86 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improvement of 2D thread-partitioned GEMM for M << N case
2 participants