-
Notifications
You must be signed in to change notification settings - Fork 578
Refactor Cutlass BF16 Grouped GEMM #4124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
✅ Deploy Preview for pytorch-fbgemm-docs ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
This pull request was exported from Phabricator. Differential Revision: D74760416 |
Summary: X-link: facebookresearch/FBGEMM#1205 We plan to make some changes to the kernel heuristics to improve performance on this kernel. Do a quick refactor first to parallelize kernel compilation, similar with [cutlass FP8 rowwise](https://www.internalfb.com/code/fbsource/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/), to keep the next diffs smaller. No functional changes in this diff. Differential Revision: D74760416
Summary: X-link: facebookresearch/FBGEMM#1205 We plan to make some changes to the kernel heuristics to improve performance on this kernel. Do a quick refactor first to parallelize kernel compilation, similar with [cutlass FP8 rowwise](https://www.internalfb.com/code/fbsource/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/), to keep the next diffs smaller. No functional changes in this diff. Differential Revision: D74760416
This pull request was exported from Phabricator. Differential Revision: D74760416 |
Summary: Pull Request resolved: pytorch#4124 X-link: facebookresearch/FBGEMM#1205 We plan to make some changes to the kernel heuristics to improve performance on this kernel. Do a quick refactor first to parallelize kernel compilation, similar with [cutlass FP8 rowwise](https://www.internalfb.com/code/fbsource/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/), to keep the next diffs smaller. No functional changes in this diff. Differential Revision: D74760416
This pull request was exported from Phabricator. Differential Revision: D74760416 |
Summary: Pull Request resolved: pytorch#4124 X-link: facebookresearch/FBGEMM#1205 We plan to make some changes to the kernel heuristics to improve performance on this kernel. Do a quick refactor first to parallelize kernel compilation, similar with [cutlass FP8 rowwise](https://www.internalfb.com/code/fbsource/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/), to keep the next diffs smaller. No functional changes in this diff. Differential Revision: D74760416
Summary: X-link: facebookresearch/FBGEMM#1205 We plan to make some changes to the kernel heuristics to improve performance on this kernel. Do a quick refactor first to parallelize kernel compilation, similar with [cutlass FP8 rowwise](https://www.internalfb.com/code/fbsource/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/), to keep the next diffs smaller. No functional changes in this diff. Differential Revision: D74760416
Summary: X-link: facebookresearch/FBGEMM#1205 We plan to make some changes to the kernel heuristics to improve performance on this kernel. Do a quick refactor first to parallelize kernel compilation, similar with [cutlass FP8 rowwise](https://www.internalfb.com/code/fbsource/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/), to keep the next diffs smaller. No functional changes in this diff. Reviewed By: jianyuh Differential Revision: D74760416
Summary: X-link: facebookresearch/FBGEMM#1205 We plan to make some changes to the kernel heuristics to improve performance on this kernel. Do a quick refactor first to parallelize kernel compilation, similar with [cutlass FP8 rowwise](https://www.internalfb.com/code/fbsource/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/), to keep the next diffs smaller. No functional changes in this diff. Reviewed By: jianyuh Differential Revision: D74760416
Summary: X-link: facebookresearch/FBGEMM#1205 We plan to make some changes to the kernel heuristics to improve performance on this kernel. Do a quick refactor first to parallelize kernel compilation, similar with [cutlass FP8 rowwise](https://www.internalfb.com/code/fbsource/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/), to keep the next diffs smaller. No functional changes in this diff. Reviewed By: jianyuh Differential Revision: D74760416
Summary: X-link: facebookresearch/FBGEMM#1205 We plan to make some changes to the kernel heuristics to improve performance on this kernel. Do a quick refactor first to parallelize kernel compilation, similar with [cutlass FP8 rowwise](https://www.internalfb.com/code/fbsource/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/), to keep the next diffs smaller. No functional changes in this diff. Reviewed By: jianyuh Differential Revision: D74760416
This pull request was exported from Phabricator. Differential Revision: D74760416 |
Summary: Pull Request resolved: pytorch#4124 X-link: facebookresearch/FBGEMM#1205 We plan to make some changes to the kernel heuristics to improve performance on this kernel. Do a quick refactor first to parallelize kernel compilation, similar with [cutlass FP8 rowwise](https://www.internalfb.com/code/fbsource/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/), to keep the next diffs smaller. No functional changes in this diff. Reviewed By: jianyuh Differential Revision: D74760416
This pull request was exported from Phabricator. Differential Revision: D74760416 |
Summary: Pull Request resolved: pytorch#4124 X-link: facebookresearch/FBGEMM#1205 We plan to make some changes to the kernel heuristics to improve performance on this kernel. Do a quick refactor first to parallelize kernel compilation, similar with [cutlass FP8 rowwise](https://www.internalfb.com/code/fbsource/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/), to keep the next diffs smaller. No functional changes in this diff. Reviewed By: jianyuh Differential Revision: D74760416
Summary: X-link: facebookresearch/FBGEMM#1205 We plan to make some changes to the kernel heuristics to improve performance on this kernel. Do a quick refactor first to parallelize kernel compilation, similar with [cutlass FP8 rowwise](https://www.internalfb.com/code/fbsource/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/), to keep the next diffs smaller. No functional changes in this diff. Reviewed By: jianyuh Differential Revision: D74760416
Summary: X-link: facebookresearch/FBGEMM#1205 We plan to make some changes to the kernel heuristics to improve performance on this kernel. Do a quick refactor first to parallelize kernel compilation, similar with [cutlass FP8 rowwise](https://www.internalfb.com/code/fbsource/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/), to keep the next diffs smaller. No functional changes in this diff. Reviewed By: jianyuh Differential Revision: D74760416
Summary: X-link: facebookresearch/FBGEMM#1205 We plan to make some changes to the kernel heuristics to improve performance on this kernel. Do a quick refactor first to parallelize kernel compilation, similar with [cutlass FP8 rowwise](https://www.internalfb.com/code/fbsource/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/), to keep the next diffs smaller. No functional changes in this diff. Reviewed By: jianyuh Differential Revision: D74760416
Summary: X-link: facebookresearch/FBGEMM#1205 We plan to make some changes to the kernel heuristics to improve performance on this kernel. Do a quick refactor first to parallelize kernel compilation, similar with [cutlass FP8 rowwise](https://www.internalfb.com/code/fbsource/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/), to keep the next diffs smaller. No functional changes in this diff. Reviewed By: jianyuh Differential Revision: D74760416
This pull request was exported from Phabricator. Differential Revision: D74760416 |
Summary: Pull Request resolved: pytorch#4124 X-link: facebookresearch/FBGEMM#1205 We plan to make some changes to the kernel heuristics to improve performance on this kernel. Do a quick refactor first to parallelize kernel compilation, similar with [cutlass FP8 rowwise](https://www.internalfb.com/code/fbsource/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/), to keep the next diffs smaller. No functional changes in this diff. Reviewed By: jianyuh Differential Revision: D74760416
Summary: Pull Request resolved: pytorch#4124 X-link: facebookresearch/FBGEMM#1205 We plan to make some changes to the kernel heuristics to improve performance on this kernel. Do a quick refactor first to parallelize kernel compilation, similar with [cutlass FP8 rowwise](https://www.internalfb.com/code/fbsource/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/), to keep the next diffs smaller. No functional changes in this diff. Reviewed By: jianyuh Differential Revision: D74760416
This pull request was exported from Phabricator. Differential Revision: D74760416 |
This pull request has been merged in 9932686. |
Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/1205
We plan to make some changes to the kernel heuristics to improve performance on this kernel. Do a quick refactor first to parallelize kernel compilation, similar with cutlass FP8 rowwise, to keep the next diffs smaller. No functional changes in this diff.
Differential Revision: D74760416