Refactor Cutlass BF16 Grouped GEMM #4124

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

cthi wants to merge 1 commit into pytorch:main from cthi:export-D74760416

cthi commented May 14, 2025

Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/1205

We plan to make some changes to the kernel heuristics to improve performance on this kernel. Do a quick refactor first to parallelize kernel compilation, similar with cutlass FP8 rowwise, to keep the next diffs smaller. No functional changes in this diff.

Differential Revision: D74760416

netlify bot commented May 14, 2025 •

edited

Loading

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
🔨 Latest commit	`a0866b3`
🔍 Latest deploy log	https://app.netlify.com/projects/pytorch-fbgemm-docs/deploys/682cdb0f8af3150008f16434
😎 Deploy Preview	https://deploy-preview-4124--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

facebook-github-bot added the cla signed label

Contributor

facebook-github-bot commented May 14, 2025

This pull request was exported from Phabricator. Differential Revision: D74760416

facebook-github-bot added the fb-exported label

cthi force-pushed the export-D74760416 branch from 389da07 to 5f1086f Compare

May 15, 2025 21:56

cthi pushed a commit to cthi/FBGEMM-1 that referenced this pull request


          Refactor Cutlass BF16 Grouped GEMM (pytorch#4124)

5f1086f

Summary:

X-link: facebookresearch/FBGEMM#1205

We plan to make some changes to the kernel heuristics to improve performance on this kernel. Do a quick refactor first to parallelize kernel compilation, similar with [cutlass FP8 rowwise](https://www.internalfb.com/code/fbsource/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/), to keep the next diffs smaller. No functional changes in this diff.

Differential Revision: D74760416

cthi pushed a commit to cthi/FBGEMM-1 that referenced this pull request


          Refactor Cutlass BF16 Grouped GEMM (pytorch#4124)

736c55a

Summary:

X-link: facebookresearch/FBGEMM#1205

We plan to make some changes to the kernel heuristics to improve performance on this kernel. Do a quick refactor first to parallelize kernel compilation, similar with [cutlass FP8 rowwise](https://www.internalfb.com/code/fbsource/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/), to keep the next diffs smaller. No functional changes in this diff.

Differential Revision: D74760416

cthi force-pushed the export-D74760416 branch from 5f1086f to 736c55a Compare

May 15, 2025 21:57

Contributor

facebook-github-bot commented May 15, 2025

This pull request was exported from Phabricator. Differential Revision: D74760416

cthi pushed a commit to cthi/FBGEMM-1 that referenced this pull request


          Refactor Cutlass BF16 Grouped GEMM (pytorch#4124)

fb88f9e

Summary:
Pull Request resolved: pytorch#4124

X-link: facebookresearch/FBGEMM#1205

We plan to make some changes to the kernel heuristics to improve performance on this kernel. Do a quick refactor first to parallelize kernel compilation, similar with [cutlass FP8 rowwise](https://www.internalfb.com/code/fbsource/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/), to keep the next diffs smaller. No functional changes in this diff.

Differential Revision: D74760416

cthi force-pushed the export-D74760416 branch from 736c55a to fb88f9e Compare

May 15, 2025 22:00

Contributor

facebook-github-bot commented May 15, 2025

This pull request was exported from Phabricator. Differential Revision: D74760416

cthi force-pushed the export-D74760416 branch from fb88f9e to 9064757 Compare

May 15, 2025 22:06

cthi pushed a commit to cthi/FBGEMM-1 that referenced this pull request


          Refactor Cutlass BF16 Grouped GEMM (pytorch#4124)

Summary:
Pull Request resolved: pytorch#4124

X-link: facebookresearch/FBGEMM#1205

We plan to make some changes to the kernel heuristics to improve performance on this kernel. Do a quick refactor first to parallelize kernel compilation, similar with [cutlass FP8 rowwise](https://www.internalfb.com/code/fbsource/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/), to keep the next diffs smaller. No functional changes in this diff.

Differential Revision: D74760416

cthi pushed a commit to cthi/FBGEMM-1 that referenced this pull request


          Refactor Cutlass BF16 Grouped GEMM (pytorch#4124)

ffbec71

Summary:

X-link: facebookresearch/FBGEMM#1205

We plan to make some changes to the kernel heuristics to improve performance on this kernel. Do a quick refactor first to parallelize kernel compilation, similar with [cutlass FP8 rowwise](https://www.internalfb.com/code/fbsource/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/), to keep the next diffs smaller. No functional changes in this diff.

Differential Revision: D74760416

cthi force-pushed the export-D74760416 branch from 9064757 to 0bb90c3 Compare

May 19, 2025 13:29

cthi pushed a commit to cthi/FBGEMM-1 that referenced this pull request


          Refactor Cutlass BF16 Grouped GEMM (pytorch#4124)

0bb90c3

Summary:

X-link: facebookresearch/FBGEMM#1205

We plan to make some changes to the kernel heuristics to improve performance on this kernel. Do a quick refactor first to parallelize kernel compilation, similar with [cutlass FP8 rowwise](https://www.internalfb.com/code/fbsource/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/), to keep the next diffs smaller. No functional changes in this diff.

Reviewed By: jianyuh

Differential Revision: D74760416

cthi pushed a commit to cthi/FBGEMM-1 that referenced this pull request


          Refactor Cutlass BF16 Grouped GEMM (pytorch#4124)

445caa2

Summary:

X-link: facebookresearch/FBGEMM#1205

We plan to make some changes to the kernel heuristics to improve performance on this kernel. Do a quick refactor first to parallelize kernel compilation, similar with [cutlass FP8 rowwise](https://www.internalfb.com/code/fbsource/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/), to keep the next diffs smaller. No functional changes in this diff.

Reviewed By: jianyuh

Differential Revision: D74760416

cthi pushed a commit to cthi/FBGEMM-1 that referenced this pull request


          Refactor Cutlass BF16 Grouped GEMM (pytorch#4124)

04921bb

Summary:

X-link: facebookresearch/FBGEMM#1205

We plan to make some changes to the kernel heuristics to improve performance on this kernel. Do a quick refactor first to parallelize kernel compilation, similar with [cutlass FP8 rowwise](https://www.internalfb.com/code/fbsource/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/), to keep the next diffs smaller. No functional changes in this diff.

Reviewed By: jianyuh

Differential Revision: D74760416

cthi force-pushed the export-D74760416 branch from 0bb90c3 to 04921bb Compare

May 19, 2025 13:30

cthi pushed a commit to cthi/FBGEMM-1 that referenced this pull request


          Refactor Cutlass BF16 Grouped GEMM (pytorch#4124)

9ee6e1c

Summary:

X-link: facebookresearch/FBGEMM#1205

We plan to make some changes to the kernel heuristics to improve performance on this kernel. Do a quick refactor first to parallelize kernel compilation, similar with [cutlass FP8 rowwise](https://www.internalfb.com/code/fbsource/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/), to keep the next diffs smaller. No functional changes in this diff.

Reviewed By: jianyuh

Differential Revision: D74760416

Contributor

facebook-github-bot commented May 19, 2025

This pull request was exported from Phabricator. Differential Revision: D74760416

cthi pushed a commit to cthi/FBGEMM-1 that referenced this pull request


          Refactor Cutlass BF16 Grouped GEMM (pytorch#4124)

144cdaa

Summary:
Pull Request resolved: pytorch#4124

X-link: facebookresearch/FBGEMM#1205

We plan to make some changes to the kernel heuristics to improve performance on this kernel. Do a quick refactor first to parallelize kernel compilation, similar with [cutlass FP8 rowwise](https://www.internalfb.com/code/fbsource/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/), to keep the next diffs smaller. No functional changes in this diff.

Reviewed By: jianyuh

Differential Revision: D74760416

cthi force-pushed the export-D74760416 branch from 04921bb to 144cdaa Compare

May 19, 2025 13:34

Contributor

facebook-github-bot commented May 19, 2025

This pull request was exported from Phabricator. Differential Revision: D74760416

cthi force-pushed the export-D74760416 branch from 144cdaa to 536e464 Compare

May 19, 2025 13:38

cthi pushed a commit to cthi/FBGEMM-1 that referenced this pull request


          Refactor Cutlass BF16 Grouped GEMM (pytorch#4124)

536e464

Summary:
Pull Request resolved: pytorch#4124

X-link: facebookresearch/FBGEMM#1205

We plan to make some changes to the kernel heuristics to improve performance on this kernel. Do a quick refactor first to parallelize kernel compilation, similar with [cutlass FP8 rowwise](https://www.internalfb.com/code/fbsource/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/), to keep the next diffs smaller. No functional changes in this diff.

Reviewed By: jianyuh

Differential Revision: D74760416

cthi force-pushed the export-D74760416 branch from 536e464 to 45c8cd1 Compare

May 20, 2025 19:30

cthi pushed a commit to cthi/FBGEMM-1 that referenced this pull request


          Refactor Cutlass BF16 Grouped GEMM (pytorch#4124)

45c8cd1

Summary:

X-link: facebookresearch/FBGEMM#1205

We plan to make some changes to the kernel heuristics to improve performance on this kernel. Do a quick refactor first to parallelize kernel compilation, similar with [cutlass FP8 rowwise](https://www.internalfb.com/code/fbsource/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/), to keep the next diffs smaller. No functional changes in this diff.

Reviewed By: jianyuh

Differential Revision: D74760416

cthi pushed a commit to cthi/FBGEMM-1 that referenced this pull request


          Refactor Cutlass BF16 Grouped GEMM (pytorch#4124)

Summary:

X-link: facebookresearch/FBGEMM#1205

We plan to make some changes to the kernel heuristics to improve performance on this kernel. Do a quick refactor first to parallelize kernel compilation, similar with [cutlass FP8 rowwise](https://www.internalfb.com/code/fbsource/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/), to keep the next diffs smaller. No functional changes in this diff.

Reviewed By: jianyuh

Differential Revision: D74760416

cthi force-pushed the export-D74760416 branch from 45c8cd1 to 22f87e5 Compare

May 20, 2025 19:32

cthi pushed a commit to cthi/FBGEMM-1 that referenced this pull request


          Refactor Cutlass BF16 Grouped GEMM (pytorch#4124)

22f87e5

Summary:

X-link: facebookresearch/FBGEMM#1205

We plan to make some changes to the kernel heuristics to improve performance on this kernel. Do a quick refactor first to parallelize kernel compilation, similar with [cutlass FP8 rowwise](https://www.internalfb.com/code/fbsource/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/), to keep the next diffs smaller. No functional changes in this diff.

Reviewed By: jianyuh

Differential Revision: D74760416

cthi pushed a commit to cthi/FBGEMM-1 that referenced this pull request


          Refactor Cutlass BF16 Grouped GEMM (pytorch#4124)

dfaa057

Summary:

X-link: facebookresearch/FBGEMM#1205

We plan to make some changes to the kernel heuristics to improve performance on this kernel. Do a quick refactor first to parallelize kernel compilation, similar with [cutlass FP8 rowwise](https://www.internalfb.com/code/fbsource/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/), to keep the next diffs smaller. No functional changes in this diff.

Reviewed By: jianyuh

Differential Revision: D74760416

Contributor

facebook-github-bot commented May 20, 2025

This pull request was exported from Phabricator. Differential Revision: D74760416

cthi force-pushed the export-D74760416 branch from 22f87e5 to bbcf138 Compare

May 20, 2025 19:34

cthi pushed a commit to cthi/FBGEMM-1 that referenced this pull request


          Refactor Cutlass BF16 Grouped GEMM (pytorch#4124)

bbcf138

Summary:
Pull Request resolved: pytorch#4124

X-link: facebookresearch/FBGEMM#1205

We plan to make some changes to the kernel heuristics to improve performance on this kernel. Do a quick refactor first to parallelize kernel compilation, similar with [cutlass FP8 rowwise](https://www.internalfb.com/code/fbsource/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/), to keep the next diffs smaller. No functional changes in this diff.

Reviewed By: jianyuh

Differential Revision: D74760416


          Refactor Cutlass BF16 Grouped GEMM (pytorch#4124)

a0866b3

Summary:
Pull Request resolved: pytorch#4124

X-link: facebookresearch/FBGEMM#1205

We plan to make some changes to the kernel heuristics to improve performance on this kernel. Do a quick refactor first to parallelize kernel compilation, similar with [cutlass FP8 rowwise](https://www.internalfb.com/code/fbsource/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/), to keep the next diffs smaller. No functional changes in this diff.

Reviewed By: jianyuh

Differential Revision: D74760416

Contributor

facebook-github-bot commented May 20, 2025

This pull request was exported from Phabricator. Differential Revision: D74760416

cthi force-pushed the export-D74760416 branch from bbcf138 to a0866b3 Compare

May 20, 2025 19:42

facebook-github-bot closed this in

Contributor

facebook-github-bot commented May 21, 2025

This pull request has been merged in 9932686.

facebook-github-bot added the Merged label

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed fb-exported Merged