Skip to content

Refactor Cutlass BF16 Grouped GEMM #4124

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Conversation

cthi
Copy link

@cthi cthi commented May 14, 2025

Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/1205

We plan to make some changes to the kernel heuristics to improve performance on this kernel. Do a quick refactor first to parallelize kernel compilation, similar with cutlass FP8 rowwise, to keep the next diffs smaller. No functional changes in this diff.

Differential Revision: D74760416

Copy link

netlify bot commented May 14, 2025

Deploy Preview for pytorch-fbgemm-docs ready!

Name Link
🔨 Latest commit a0866b3
🔍 Latest deploy log https://app.netlify.com/projects/pytorch-fbgemm-docs/deploys/682cdb0f8af3150008f16434
😎 Deploy Preview https://deploy-preview-4124--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D74760416

@cthi cthi force-pushed the export-D74760416 branch from 389da07 to 5f1086f Compare May 15, 2025 21:56
cthi pushed a commit to cthi/FBGEMM-1 that referenced this pull request May 15, 2025
Summary:

X-link: facebookresearch/FBGEMM#1205

We plan to make some changes to the kernel heuristics to improve performance on this kernel. Do a quick refactor first to parallelize kernel compilation, similar with [cutlass FP8 rowwise](https://www.internalfb.com/code/fbsource/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/), to keep the next diffs smaller. No functional changes in this diff.

Differential Revision: D74760416
cthi pushed a commit to cthi/FBGEMM-1 that referenced this pull request May 15, 2025
Summary:

X-link: facebookresearch/FBGEMM#1205

We plan to make some changes to the kernel heuristics to improve performance on this kernel. Do a quick refactor first to parallelize kernel compilation, similar with [cutlass FP8 rowwise](https://www.internalfb.com/code/fbsource/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/), to keep the next diffs smaller. No functional changes in this diff.

Differential Revision: D74760416
@cthi cthi force-pushed the export-D74760416 branch from 5f1086f to 736c55a Compare May 15, 2025 21:57
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D74760416

cthi pushed a commit to cthi/FBGEMM-1 that referenced this pull request May 15, 2025
Summary:
Pull Request resolved: pytorch#4124

X-link: facebookresearch/FBGEMM#1205

We plan to make some changes to the kernel heuristics to improve performance on this kernel. Do a quick refactor first to parallelize kernel compilation, similar with [cutlass FP8 rowwise](https://www.internalfb.com/code/fbsource/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/), to keep the next diffs smaller. No functional changes in this diff.

Differential Revision: D74760416
@cthi cthi force-pushed the export-D74760416 branch from 736c55a to fb88f9e Compare May 15, 2025 22:00
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D74760416

@cthi cthi force-pushed the export-D74760416 branch from fb88f9e to 9064757 Compare May 15, 2025 22:06
cthi pushed a commit to cthi/FBGEMM-1 that referenced this pull request May 15, 2025
Summary:
Pull Request resolved: pytorch#4124

X-link: facebookresearch/FBGEMM#1205

We plan to make some changes to the kernel heuristics to improve performance on this kernel. Do a quick refactor first to parallelize kernel compilation, similar with [cutlass FP8 rowwise](https://www.internalfb.com/code/fbsource/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/), to keep the next diffs smaller. No functional changes in this diff.

Differential Revision: D74760416
cthi pushed a commit to cthi/FBGEMM-1 that referenced this pull request May 16, 2025
Summary:

X-link: facebookresearch/FBGEMM#1205

We plan to make some changes to the kernel heuristics to improve performance on this kernel. Do a quick refactor first to parallelize kernel compilation, similar with [cutlass FP8 rowwise](https://www.internalfb.com/code/fbsource/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/), to keep the next diffs smaller. No functional changes in this diff.

Differential Revision: D74760416
@cthi cthi force-pushed the export-D74760416 branch from 9064757 to 0bb90c3 Compare May 19, 2025 13:29
cthi pushed a commit to cthi/FBGEMM-1 that referenced this pull request May 19, 2025
Summary:

X-link: facebookresearch/FBGEMM#1205

We plan to make some changes to the kernel heuristics to improve performance on this kernel. Do a quick refactor first to parallelize kernel compilation, similar with [cutlass FP8 rowwise](https://www.internalfb.com/code/fbsource/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/), to keep the next diffs smaller. No functional changes in this diff.

Reviewed By: jianyuh

Differential Revision: D74760416
cthi pushed a commit to cthi/FBGEMM-1 that referenced this pull request May 19, 2025
Summary:

X-link: facebookresearch/FBGEMM#1205

We plan to make some changes to the kernel heuristics to improve performance on this kernel. Do a quick refactor first to parallelize kernel compilation, similar with [cutlass FP8 rowwise](https://www.internalfb.com/code/fbsource/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/), to keep the next diffs smaller. No functional changes in this diff.

Reviewed By: jianyuh

Differential Revision: D74760416
cthi pushed a commit to cthi/FBGEMM-1 that referenced this pull request May 19, 2025
Summary:

X-link: facebookresearch/FBGEMM#1205

We plan to make some changes to the kernel heuristics to improve performance on this kernel. Do a quick refactor first to parallelize kernel compilation, similar with [cutlass FP8 rowwise](https://www.internalfb.com/code/fbsource/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/), to keep the next diffs smaller. No functional changes in this diff.

Reviewed By: jianyuh

Differential Revision: D74760416
@cthi cthi force-pushed the export-D74760416 branch from 0bb90c3 to 04921bb Compare May 19, 2025 13:30
cthi pushed a commit to cthi/FBGEMM-1 that referenced this pull request May 19, 2025
Summary:

X-link: facebookresearch/FBGEMM#1205

We plan to make some changes to the kernel heuristics to improve performance on this kernel. Do a quick refactor first to parallelize kernel compilation, similar with [cutlass FP8 rowwise](https://www.internalfb.com/code/fbsource/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/), to keep the next diffs smaller. No functional changes in this diff.

Reviewed By: jianyuh

Differential Revision: D74760416
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D74760416

cthi pushed a commit to cthi/FBGEMM-1 that referenced this pull request May 19, 2025
Summary:
Pull Request resolved: pytorch#4124

X-link: facebookresearch/FBGEMM#1205

We plan to make some changes to the kernel heuristics to improve performance on this kernel. Do a quick refactor first to parallelize kernel compilation, similar with [cutlass FP8 rowwise](https://www.internalfb.com/code/fbsource/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/), to keep the next diffs smaller. No functional changes in this diff.

Reviewed By: jianyuh

Differential Revision: D74760416
@cthi cthi force-pushed the export-D74760416 branch from 04921bb to 144cdaa Compare May 19, 2025 13:34
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D74760416

@cthi cthi force-pushed the export-D74760416 branch from 144cdaa to 536e464 Compare May 19, 2025 13:38
cthi pushed a commit to cthi/FBGEMM-1 that referenced this pull request May 19, 2025
Summary:
Pull Request resolved: pytorch#4124

X-link: facebookresearch/FBGEMM#1205

We plan to make some changes to the kernel heuristics to improve performance on this kernel. Do a quick refactor first to parallelize kernel compilation, similar with [cutlass FP8 rowwise](https://www.internalfb.com/code/fbsource/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/), to keep the next diffs smaller. No functional changes in this diff.

Reviewed By: jianyuh

Differential Revision: D74760416
@cthi cthi force-pushed the export-D74760416 branch from 536e464 to 45c8cd1 Compare May 20, 2025 19:30
cthi pushed a commit to cthi/FBGEMM-1 that referenced this pull request May 20, 2025
Summary:

X-link: facebookresearch/FBGEMM#1205

We plan to make some changes to the kernel heuristics to improve performance on this kernel. Do a quick refactor first to parallelize kernel compilation, similar with [cutlass FP8 rowwise](https://www.internalfb.com/code/fbsource/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/), to keep the next diffs smaller. No functional changes in this diff.

Reviewed By: jianyuh

Differential Revision: D74760416
cthi pushed a commit to cthi/FBGEMM-1 that referenced this pull request May 20, 2025
Summary:

X-link: facebookresearch/FBGEMM#1205

We plan to make some changes to the kernel heuristics to improve performance on this kernel. Do a quick refactor first to parallelize kernel compilation, similar with [cutlass FP8 rowwise](https://www.internalfb.com/code/fbsource/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/), to keep the next diffs smaller. No functional changes in this diff.

Reviewed By: jianyuh

Differential Revision: D74760416
@cthi cthi force-pushed the export-D74760416 branch from 45c8cd1 to 22f87e5 Compare May 20, 2025 19:32
cthi pushed a commit to cthi/FBGEMM-1 that referenced this pull request May 20, 2025
Summary:

X-link: facebookresearch/FBGEMM#1205

We plan to make some changes to the kernel heuristics to improve performance on this kernel. Do a quick refactor first to parallelize kernel compilation, similar with [cutlass FP8 rowwise](https://www.internalfb.com/code/fbsource/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/), to keep the next diffs smaller. No functional changes in this diff.

Reviewed By: jianyuh

Differential Revision: D74760416
cthi pushed a commit to cthi/FBGEMM-1 that referenced this pull request May 20, 2025
Summary:

X-link: facebookresearch/FBGEMM#1205

We plan to make some changes to the kernel heuristics to improve performance on this kernel. Do a quick refactor first to parallelize kernel compilation, similar with [cutlass FP8 rowwise](https://www.internalfb.com/code/fbsource/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/), to keep the next diffs smaller. No functional changes in this diff.

Reviewed By: jianyuh

Differential Revision: D74760416
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D74760416

@cthi cthi force-pushed the export-D74760416 branch from 22f87e5 to bbcf138 Compare May 20, 2025 19:34
cthi pushed a commit to cthi/FBGEMM-1 that referenced this pull request May 20, 2025
Summary:
Pull Request resolved: pytorch#4124

X-link: facebookresearch/FBGEMM#1205

We plan to make some changes to the kernel heuristics to improve performance on this kernel. Do a quick refactor first to parallelize kernel compilation, similar with [cutlass FP8 rowwise](https://www.internalfb.com/code/fbsource/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/), to keep the next diffs smaller. No functional changes in this diff.

Reviewed By: jianyuh

Differential Revision: D74760416
Summary:
Pull Request resolved: pytorch#4124

X-link: facebookresearch/FBGEMM#1205

We plan to make some changes to the kernel heuristics to improve performance on this kernel. Do a quick refactor first to parallelize kernel compilation, similar with [cutlass FP8 rowwise](https://www.internalfb.com/code/fbsource/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/), to keep the next diffs smaller. No functional changes in this diff.

Reviewed By: jianyuh

Differential Revision: D74760416
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D74760416

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 9932686.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants